=head1 NAME Parse::Eyapp::languageintro - Introduction to the Eyapp language =head1 The Eyapp Language =head2 Eyapp Grammar This section describes the syntax of the Eyapp language using its own notation. The grammar extends L and L grammars. Semicolons have been omitted to save space. Between C-like comments you can find an (informal) explanation of the language associated with the token. eyapp: head body tail ; symbol: LITERAL /* A string literal like 'hello' */ | ident ident: IDENT /* IDENT is [A-Za-z_][A-Za-z0-9_]* */ head: headsec '%%' headsec: decl * decl: '\n' | SEMANTIC typedecl symlist '\n' /* SEMANTIC is %semantic\s+token */ | SYNTACTIC typedecl symlist '\n' /* SYNTACTIC is %syntactic\s+token */ | TOKEN typedecl symlist '\n' /* TOKEN is %token */ | ASSOC typedecl symlist '\n' /* ASSOC is %(left|right|nonassoc) */ | START ident '\n' /* START is %start */ | HEADCODE '\n' /* HEADCODE is %{ Perl code ... %} */ | UNION CODE '\n' /* UNION CODE see yacc/bison */ | DEFAULTACTION CODE '\n' /* DEFAULTACTION is %defaultaction */ | TREE treeclauses? '\n' /* TREE is %tree */ | METATREE '\n' /* METATREE is %metatree */ | TYPE typedecl identlist '\n' /* TYPE is %type */ | EXPECT NUMBER '\n' /* EXPECT is %expect */ /* NUMBER is \d+ */ typedecl: /* empty */ | '<' IDENT '>' treeclauses: BYPASS ALIAS? | ALIAS BYPASS? symlist: symbol + identlist: ident + body: rules * '%%' rules: IDENT ':' rhss ';' rhss: rule <+ '|'> rule: optname rhs (prec epscode)? rhs: rhseltwithid * rhseltwithid : rhselt '.' IDENT | '$' rhselt | rhselt rhselt: symbol | code | '(' optname rhs ')' | rhselt STAR /* STAR is (%name\s*([A-Za-z_]\w*)\s*)?\* */ | rhselt '<' STAR symbol '>' | rhselt OPTION /* OPTION is (%name\s*([A-Za-z_]\w*)\s*)?\? */ | rhselt '<' PLUS symbol '>' | rhselt PLUS /* PLUS is (%name\s*([A-Za-z_]\w*)\s*)?\+ */ optname: (NAME IDENT)? /* NAME is %name */ | NOBYPASS IDENT /* NOBYPASS is %no\s+bypass */ prec: PREC symbol /* PREC is %prec */ epscode: code ? code: CODE /* CODE is { Perl code ... } */ | BEGINCODE /* BEGINCODE is %begin { Perl code ... } */ tail: TAILCODE ? /* TAILCODE is { Perl code ... } */ The semantic of C agrees with the semantic of C and C for all the common constructions. =head2 Comments Comments are either Perl style, from C<#> up to the end of line, or C style, enclosed between C and C<*/>. =head2 Syntactic Variables, Symbolic Tokens and String Literals Two kind of symbols may appear inside a Parse::Eyapp program: I symbols or I, called also I symbols and I symbols, called also I. Tokens are the symbols the lexical analyzer function returns to the parser. There are two kinds: I and I. I and I identifiers must conform to the regular expression C<[A-Za-z][A-Za-z0-9_]*>. When building the syntax tree (i.e. when running under the C<%tree> directive) I will be considered I (see section L). String literals are enclosed in single quotes and can contain almost anything. They will be received by the parser as double-quoted strings. Any special character as C<'"'>, C<'$'> and C<'@'> is escaped. To have a single quote inside a literal, escape it with '\'. When building the syntax tree (i.e. when running under the C<%tree> directive) I will be considered I (see section L). =head2 Parts of an C Program An Eyapp program has three parts called head, body and tail: eyapp: head body tail ; Each part is separated from the former by the symbol C<%%>: head: headsec '%%' body: rulesec '%%' =head2 The Head Section The head section contains a list of declarations headsec: decl * There are different kinds of declarations. This reference does not fully describes all the declarations that are shared with L and L. =head3 Example of Head Section In this and the next sections we will describe the basics of the Eyapp language using the file C that accompanies this distribution. This file implements a trivial calculator. Here is the header section: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '1,11p' Calc.eyp | cat -n 1 # examples/Calc.eyp 2 %right '=' 3 %left '-' '+' 4 %left '*' '/' 5 %left NEG 6 %right '^' 7 %{ 8 my %s; # symbol table 9 %} 10 11 %% =head3 Declarations and Precedence Lines 2-5 declare several tokens. The usual way to declare tokens is through the C<%token> directive. The declarations C<%nonassoc>, C<%left> and C<%right> not only declare the tokens but also associate a I with them. Tokens declared in the same line have the same precedence. Tokens declared with these directives in lines below have more precedence than those declared above. Thus, in the example above we are saying that C<"+"> and C<"-"> have the same precedence but higher precedence than =. The final effect of C<"-"> having greater precedence than = will be that an expression like: a = 4 - 5 will be interpreted as a = (4 - 5) and not as (a = 4) - 5 The use of the C<%left> indicates that - in case of ambiguity and a match between precedences - the parser must build the tree corresponding to a left parenthesization. Thus, the expression 4 - 5 - 9 will be interpreted as (4 - 5) - 9 =head3 Header Code Perl code surrounded by C<%{> and C<%}> can be inserted in the head section. Such code will be inserted in the module generated by C near the beginning. Therefore, declarations like the one of the calculator symbol table C<%s> 7 %{ 8 my %s; # symbol table 9 %} will be visible from almost any point in the file. =head3 The Start Symbol of the Grammar C<%start IDENT> declares C as the start symbol of the grammar. When C<%start> is not used, the first rule in the body section will be used. =head3 Expect The C<%expect #NUMBER> directive works as in L and suppress warnings when the number of Shift/Reduce conflicts is exactly C<#NUMBER>. See section L to know more about Shift/Reduce conflicts. =head3 Type and Union C oriented declarations like C<%type> and C<%union> are parsed but ignored. =head3 The C<%strict> Directive By default, identifiers appearing in the rule section will be classified as terminal if they don't appear in the left hand side of any production rules. The directive C<%strict> forces the declaration of all tokens. The following C program issues a warning: pl@nereida:~/LEyapp/examples$ cat -n bugyapp2.eyp 1 %strict 2 %% 3 expr: NUM; 4 %% pl@nereida:~/LEyapp/examples$ eyapp bugyapp2.eyp Warning! Non declared token NUM at line 3 of bugyapp2.eyp To keep silent the compiler declare all tokens using one of the token declaration directives (C<%token>, C<%left>, etc.) pl@nereida:~/LEyapp/examples$ cat -n bugyapp3.eyp 1 %strict 2 %token NUM 3 %% 4 expr: NUM; 5 %% pl@nereida:~/LEyapp/examples$ eyapp bugyapp3.eyp pl@nereida:~/LEyapp/examples$ It is a good practice to use C<%strict> at the beginning of your grammar. =head3 Default Action Directive In C you can modify the default action using the C<%defaultaction { Perl code }> directive. See section L. =head3 Tree Construction Directives C facilitates the construction of concrete syntax trees and abstract syntax trees (abbreviated AST from now on) through the C<%tree> C<%metatree> directives. See section L and L. =head3 Syntactic and Semantic Tokens The new token declaration directives C<%syntactic token> and C<%semantic token> can change the way C builds the abstract syntax tree. See section L. =head2 The Body The body section contains the rules describing the grammar: body: rules * '%%' rules: IDENT ':' rhss ';' rhss: (optname rhs (prec epscode)?) <+ '|'> =head3 Rules A rule is made of a left-hand-side symbol (the I), followed by a C<':'> and one or more I (or I) separated by C<'|'> and terminated by a C<';'> like in: exp: exp '+' exp | exp '-' exp | NUM ; A I (I) may be empty: input: /* empty */ | input line ; The former two productions can be abbreviated as input: line * ; The operators C<*>, C<+> and C are presented in section L. A I (This differs from L). =head3 Semantic Values and Semantic Actions In C a production rule A -> X_1 X_2 ... X_n can be followed by a I: A -> X_1 X_2 ... X_n { Perl Code } Such semantic action is nothing but Perl code that will be treated as an anonymous subroutine. The semantic action associated with production rule C X_1 X_2 ... X_n> is executed after any actions associated with the subtrees of C, C, ..., C. C parsers build the syntax tree using a left-right bottom-up traverse of the syntax tree. Each times the Parser visits the node associated with the production C X_1 X_2 ... X_n> the associated semantic action is called. Asociated with each symbol of a Parse::Eyapp grammar there is a scalar I or I. The semantic values of terminals are provided by the lexical analyzer. In the calculator example (see file C in the distribution), the semantic value associated with an expression is its numeric value. Thus in the rule: exp '+' exp { $_[1] + $_[3] } C<$_[1]> refers to the attribute of the first C, C<$_[2]> is the attribute associated with C<'+'>, which is the second component of the pair provided by the lexical analyzer and C<$_[3]> refers to the attribute of the second C. When the semantic action/anonymous subroutine is called, the arguments are as follows: =over =item * C<$_[1]> to C<$_[n]> are the attributes of the symbols C, C, ..., C. Just as C<$1> to C<$n> in L, =item * C<$_[0]> is the parser object itself. Having C<$_[0]> beeing the parser object itself allows you to call parser methods. Most L macros have been converted into parser methods. See section 'Methods Available in the Generated Class' in L. =back The returned value will be the attribute associated with the left hand side of the production. Names can be given to the attributes using the dot notation (see file C): exp.left '+' exp.right { $left + $right } See section L for more details about the I and I notations. If no action is specified and no C<%defaultaction> is specified the default action { $_[1] } will be executed instead. See section L to know more. =head3 Actions in Mid-Rule Actions can be inserted in the middle of a production like in: block: '{'.bracket { $ids->begin_scope(); } declaration*.decs statement*.sts '}' { ... } A middle production action is managed by inserting a new rule in the grammar and associating the semantic action with it: Temp: /* empty */ { $ids->begin_scope(); } Middle production actions can refer to the attributes on its left. They count as one of the components of the production. Thus the program: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '1,4p' intermediateaction2.yp %% S: 'a' { $_[1]x4 }.mid 'a' { print "$_[2], $mid, $_[3]\n"; } ; %% The auxiliar syntactic variables are named C<@#position-#order> where C<#position> is the position of the action in the rhs and C is an ordinal number. See the C<.output> file for the former example: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ eyapp -v intermediateaction2.yp pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '1,5p' intermediateaction2.output Rules: ------ 0: $start -> S $end 1: S -> 'a' @1-1 'a' 2: @1-1 -> /* empty */ when given input C the execution will produce as output C. =head3 Example of Body Section Following with the calculator example, the body is: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '12,48p' Calc.eyp | cat -n 1 start: 2 input { \%s } 3 ; 4 5 input: line * 6 ; 7 8 line: 9 '\n' { undef } 10 | exp '\n' { print "$_[1]\n" if defined($_[1]); $_[1] } 11 | error '\n' 12 { 13 $_[0]->YYErrok; 14 undef 15 } 16 ; 17 18 exp: 19 NUM 20 | $VAR { $s{$VAR} } 21 | $VAR '=' $exp { $s{$VAR} = $exp } 22 | exp.left '+' exp.right { $left + $right } 23 | exp.left '-' exp.right { $left - $right } 24 | exp.left '*' exp.right { $left * $right } 25 | exp.left '/' exp.right 26 { 27 $_[3] and return($_[1] / $_[3]); 28 $_[0]->YYData->{ERRMSG} = "Illegal division by zero.\n"; 29 $_[0]->YYError; # Pretend that a syntactic error ocurred: _Error will be called 30 undef 31 } 32 | '-' $exp %prec NEG { -$exp } 33 | exp.left '^' exp.right { $left ** $right } 34 | '(' $exp ')' { $exp } 35 ; 36 37 %% This example does not uses any of the Eyapp extensions (with the exception of the I at line 5) and the dot and dollar notations. Please, see the L pages and elsewhere documentation on L and L for more information. =head3 Solving Ambiguities and Conflicts When Eyapp analizes a grammar like: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ cat -n ambiguities.eyp 1 %% 2 exp: 3 NUM 4 | exp '-' exp 5 ; 6 %% it will produce a warning announcing the existence of I conflicts: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ eyapp ambiguities.eyp 1 shift/reduce conflict (see .output file) State 5: reduce by rule 2: exp -> exp '-' exp (default action) State 5: shifts: to state 3 with '-' pl@nereida:~/src/perl/YappWithDefaultAction/examples$ ls -ltr | tail -1 -rw-rw---- 1 pl users 1082 2007-02-06 08:26 ambiguities.output when C finds warnings automatically produces a C<.output> file describing the conflict. What the warning is saying is that an expression like C (rule 2) followed by a minus C<'-'> can be worked in more than one way. If we have an input like C the activity of a LALR(1) parser (the family of parsers to which Eyapp belongs) consists of a sequence of I. A I has as consequence the reading of the next token. A I is finding a production rule that matches and substituting the rhs of the production by the lhs. For input C the activity will be as follows (the dot is used to indicate where the next input token is): .NUM - NUM - NUM # shift NUM.- NUM - NUM # reduce exp: NUM exp.- NUM - NUM # shift exp -.NUM - NUM # shift exp - NUM.- NUM # reduce exp: NUM exp - exp.- NUM # shift/reduce conflict up this point two different decisions can be taken: the next description can be exp.- NUM # reduce by exp: exp '-' exp (rule 2) or: exp - exp -.NUM # shift '-' (to state 3) that is why it is called a I. That is also the reason for the precedence declarations in the head section. Another kind of conflicts are I. They arise when more that rhs can be applied for a reduction action. Eyapp solves the conflicts applying the following rules: =over =item * In a shift/reduce conflict, the default is the shift. =item * In a reduce/reduce conflict, the default is to reduce by the earlier grammar production (in the input sequence). =item * The precedences and associativities are associated with tokens in the declarations section. This is made by a sequence of lines beginning with one of the directives: C<%left>, C<%right>, or C<%nonassoc>, followed by a list of tokens. All the tokens on the same line have the same precedence and associativity; the lines are listed in order of increasing precedence. =item * A precedence and associativity is associated with each grammar production; it is the precedence and associativity of the I or I in the right hand side of the production. =item * The C<%prec> directive can be used when a rhs is involved in a conflict and has no tokens inside or it has but the precedence of the last token leads to an incorrect interpretation. A rhs can be followed by an optional C<%prec token> directive giving the production the precedence of the C exp: '-' exp %prec NEG { -$_[1] } =item * If there is a shift/reduce conflict, and both the grammar production and the input character have precedence and associativity associated with them, then the conflict is solved in favor of the action (shift or reduce) associated with the higher precedence. If the precedences are the same, then the associativity is used; left associative implies reduce, right associative implies shift, and nonassociating implies error. =back To solve a shift-reduce conflict between a production C SOMETHING> and a token C<'a'> you can follow this procedure: =over =item 1. Edit the C<.output> file =item 2. Search for the state where the conflict between the production and the token is. In our example it looks like: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '56,65p' ambiguities.output State 5: exp -> exp . '-' exp (Rule 2) exp -> exp '-' exp . (Rule 2) '-' shift, and go to state 3 '-' [reduce using rule 2 (exp)] $default reduce using rule 2 (exp) =item 3. Inside the state there has to be a production of the type C SOMETHING.> (with the dot at the end) indicating that a reduction must take place. There has to be also another production of the form C prefix . suffix>, where suffix can I with the involved token C<'a'>. =item 4. Decide what action shift or reduce matches the kind of trees you want. In this example we want C to produce a tree like C and not C. We want the conflict in C to be solved in favor of the reduction by C. This is achieved by declaring C<%left '-'>. =back =head3 Error Recovery The token name C is reserved for error handling. This name can be used in grammar productions; it suggests places where errors are expected, and recovery can take place: line: '\n' { undef } | exp '\n' { print "$_[1]\n" if defined($_[1]); $_[1] } | error '\n' { $_[0]->YYErrok; undef } The parser pops its stack until it enters a state where the token C is legal. It then shifts the token C and proceeds to discard tokens until finding one that is acceptable. In the example all the tokens until finding a C<'\n'> will be skipped. If no special error productions have been specified, the processing will halt. In order to prevent a cascade of error messages, the parser, after detecting an error, remains in error state until three tokens have been successfully read and shifted. If an error is detected when the parser is already in error state, no message is given, and the input token is quietly deleted. The method C used in the example communicates to the parser that a satisfactory recovery has been reached and that it can safely emit new error messages. You cannot have a literal I<'error'> in your grammar as it would confuse the driver with the I token. Use a symbolic token instead. =head2 The Tail The tail section contains Perl code. Usually the lexical analyzer and the Error management subroutines go there. A better practice however is to isolate both subroutines in a module and use them in the grammar. An example of this is in files C and C. =head3 The Lexical Analyzer The Lexical Analyzer is called each time the parser needs a new token. It is called with only one argument (the parser object) and returns a pair containing the next token and its associated attribute. The fact that is a method of the parser object means that the parser methods are accesible inside the lexical analyzer. Specially interesting is the C<$_[0]-EYYData> method which provides access to the user data area. I C<('', undef)> See below how to write a lexical analyzer (file C): 1 sub make_lexer { 2 my $input = shift; 3 4 return sub { 5 my $parser = shift; 6 7 for ($$input) { 8 m{\G[ \t]*}gc; 9 m{\G([0-9]+(?:\.[0-9]+)?)}gc and return ('NUM',$1); 10 m{\G([A-Za-z][A-Za-z0-9_]*)}gc and return ('VAR',$1); 11 m{\G\n}gc and do { $lineno++; return ("\n", "\n") }; 12 m{\G(.)}gc and return ($1,$1); 13 14 return('',undef); 15 } 16 } 17 } The subroutine C creates the lexical analyzer as a closure. The lexer returned by C is used by the C method: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '90,97p' Calc.eyp | cat -n 1 sub Run { 2 my($self)=shift; 3 my $input = shift or die "No input given\n"; 4 5 return $self->YYParse( yylex => make_lexer($input), yyerror => \&_Error, 6 #yydebug =>0x1F 7 ); 8 } =head3 The Error Report Subroutine The Error Report subroutine is also a parser method, and consequently receives as parameter the parser object. See the error report subroutine for the example in C: 1 %% 2 3 my $lineno = 1; 4 5 sub _Error { 6 my $parser = shift; 7 8 exists $parser->YYData->{ERRMSG} 9 and do { 10 print $parser->YYData->{ERRMSG}; 11 delete $parser->YYData->{ERRMSG}; 12 return; 13 }; 14 my($token)=$parser->YYCurval; 15 my($what)= $token ? "input: '$token'" : "end of input"; 16 my @expected = $parser->YYExpect(); 17 local $" = ', '; 18 print << "ERRMSG"; 19 20 Syntax error near $what (lin num $lineno). 21 Expected one of these terminals: @expected 22 ERRMSG 23 } See the L pages and elsewhere documentation on L and L for more information. =head2 Using an Eyapp Program The following is an example of a program that uses the calculator explained in the two previous sections: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ cat -n usecalc.pl 1 #!/usr/bin/perl -w 2 use strict; 3 use Calc; 4 5 my $parser = Calc->new(); 6 my $input = <<'EOI'; 7 a = 2*3 8 d = 5/(a-6) 9 b = (a+1)/7 10 c=a*3+4)-5 11 a = a+1 12 EOI 13 my $t = $parser->Run(\$input); 14 print "========= Symbol Table ==============\n"; 15 print "$_ = $t->{$_}\n" for sort keys %$t; The output for this program is (the input for each output appear as a Perl comment on the right): pl@nereida:~/src/perl/YappWithDefaultAction/examples$ eyapp Calc.eyp pl@nereida:~/src/perl/YappWithDefaultAction/examples$ usecalc.pl 6 # a = 2*3 Illegal division by zero. # d = 5/(a-6) 1 # b = (a+1)/7 Syntax error near input: ')' (lin num 4). # c=a*3+4)-5 Expected one of these terminals: -, /, ^, *, +, 7 # a = a+1 ========= Symbol Table ============== a = 7 b = 1 c = 22 =head2 Lists and Optionals The elements of a rhs can be one of these: rhselt: symbol | code | '(' optname rhs ')' | rhselt STAR /* STAR is (%name\s*([A-Za-z_]\w*)\s*)?\* */ | rhselt '<' STAR symbol '>' | rhselt OPTION /* OPTION is (%name\s*([A-Za-z_]\w*)\s*)?\? */ | rhselt '<' PLUS symbol '>' | rhselt PLUS /* PLUS is (%name\s*([A-Za-z_]\w*)\s*)?\+ */ The C, C: # File Postfix.eyp (See the examples/ directory) %right '=' %left '-' '+' %left '*' '/' %left NEG %defaultaction { return "$left $right $op"; } %% line: $exp { print "$exp\n" } ; exp: $NUM { $NUM } | $VAR { $VAR } | VAR.left '='.op exp.right | exp.left '+'.op exp.right | exp.left '-'.op exp.right | exp.left '*'.op exp.right | exp.left '/'.op exp.right | '-' $exp %prec NEG { "$exp NEG" } | '(' $exp ')' { $exp } ; %% # Support subroutines as in the Synopsis example ... The file containing the C program must be compiled with C: nereida:~/src/perl/YappWithDefaultAction/examples> eyapp Postfix.eyp Next, you have to write a client program: nereida:~/src/perl/YappWithDefaultAction/examples> cat -n usepostfix.pl 1 #!/usr/bin/perl -w 2 use strict; 3 use Postfix; 4 5 my $parser = new Postfix(); 6 $parser->Run; Now we can run the client program: nereida:~/src/perl/YappWithDefaultAction/examples> usepostfix.pl Write an expression: -(2*a-b*-3) 2 a * b 3 NEG * - NEG =head3 Default Actions, C<%name> and C In C each production rule has a name. The name of a rule can be explicitly given by the programmer using the C<%name> directive. For example, in the piece of code that follows the name C is given to the rule CYYName(); 11 bless { children => [ grep {ref($_)} @_] }, $name; 12 } 13 14 %% 15 input: 16 /* empty */ 17 { [] } 18 | input line 19 { 20 push @{$_[1]}, $_[2] if defined($_[2]); 21 $_[1] 22 } 23 ; 24 25 line: '\n' { } 26 | exp '\n' { $_[1] } 27 ; 28 29 exp: 30 NUM { $_[1] } 31 | VAR { $_[1] } 32 | %name ASSIGN 33 VAR '=' exp 34 | %name PLUS 35 exp '+' exp 36 | %name MINUS 37 exp '-' exp 38 | %name TIMES 39 exp '*' exp 40 | %name DIV 41 exp '/' exp 42 | %name UMINUS 43 '-' exp %prec NEG 44 | '(' exp ')' { $_[2] } 45 ; Inside a semantic action the name of the current rule can be recovered using the method C of the parser object. The default action (lines 8-12) computes as attribute of the left hand side a reference to an object blessed in the name of the rule. The object has an attribute C which is a reference to the list of children of the node. The call to C 11 bless { children => [ grep {ref($_)} @_] }, $name; excludes children that aren't references. Notice that the lexical analyzer only returns references for the C and C terminals: 59 sub _Lexer { 60 my($parser)=shift; 61 62 for ($parser->YYData->{INPUT}) { 63 s/^[ \t]+//; 64 return('',undef) unless $_; 65 s/^([0-9]+(?:\.[0-9]+)?)// 66 and return('NUM', bless { attr => $1}, 'NUM'); 67 s/^([A-Za-z][A-Za-z0-9_]*)// 68 and return('VAR',bless {attr => $1}, 'VAR'); 69 s/^(.)//s 70 and return($1, $1); 71 } 72 return('',undef); 73 } follows the client program: pl@nereida:~/LEyapp/examples$ cat -n uselhs.pl 1 #!/usr/bin/perl -w 2 use Lhs; 3 use Data::Dumper; 4 5 $parser = new Lhs(); 6 my $tree = $parser->Run; 7 $Data::Dumper::Indent = 1; 8 if (defined($tree)) { print Dumper($tree); } 9 else { print "Cadena no válida\n"; } When executed with input C the parser produces the following tree: ASSIGN(TIMES(PLUS(NUM[2],NUM[3]), VAR[b])) See the result of an execution: pl@nereida:~/LEyapp/examples$ uselhs.pl a=(2+3)*b $VAR1 = [ bless( { 'children' => [ bless( { 'attr' => 'a' }, 'VAR' ), bless( { 'children' => [ bless( { 'children' => [ bless( { 'attr' => '2' }, 'NUM' ), bless( { 'attr' => '3' }, 'NUM' ) ] }, 'PLUS' ), bless( { 'attr' => 'b' }, 'VAR' ) ] }, 'TIMES' ) ] }, 'ASSIGN' ) ]; The name of a production rule can be changed at execution time. See the following example: 29 exp: 30 NUM { $_[1] } 31 | VAR { $_[1] } 32 | %name ASSIGN 33 VAR '=' exp 34 | %name PLUS 35 exp '+' exp 36 | %name MINUS 37 exp '-' exp 38 { 39 my $self = shift; 40 $self->YYName('SUBSTRACT'); # rename it 41 $self->YYBuildAST(@_); # build the node 42 } 43 | %name TIMES 44 exp '*' exp 45 | %name DIV 46 exp '/' exp 47 | %name UMINUS 48 '-' exp %prec NEG 49 | '(' exp ')' { $_[2] } 50 ; When the client program is executed we can see the presence of the C nodes: pl@nereida:~/LEyapp/examples$ useyynamedynamic.pl 2-b $VAR1 = [ bless( { 'children' => [ bless( { 'attr' => '2' }, 'NUM' ), bless( { 'attr' => 'b' }, 'VAR' ) ] }, 'SUBSTRACT' ) ]; =head2 Abstract Syntax Trees : C<%tree> and C<%name> C facilitates the construction of concrete syntax trees and abstract syntax trees (abbreviated AST from now on) through the C<%tree> directive. Nodes in the AST are blessed in the production C. By default the name of a production is the concatenation of the left hand side and the production number. The production number is the ordinal number of the production as they appear in the associated C<.output> file (see option C<-v> of L). For example, given the grammar: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '9,28p' treewithoutnames.pl my $grammar = q{ %right '=' # Lowest precedence %left '-' '+' # + and - have more precedence than = Disambiguate a-b-c as (a-b)-c %left '*' '/' # * and / have more precedence than + Disambiguate a/b/c as (a/b)/c %left NEG # Disambiguate -a-b as (-a)-b and not as -(a-b) %tree # Let us build an abstract syntax tree ... %% line: exp <+ ';'> { $_[1] } /* list of expressions separated by ';' */ ; exp: NUM | VAR | VAR '=' exp | exp '+' exp | exp '-' exp | exp '*' exp | exp '/' exp | '-' exp %prec NEG | '(' exp ')' { $_[2] } ; The tree produced by the parser when feed with input C is: _PLUS_LIST(exp_6(TERMINAL[a],exp_9(exp_4(TERMINAL[2]),exp_5(TERMINAL[b])))) If we want to see the correspondence between names and rules we can generate and check the corresponding file C<.output>: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '28,42p' treewithoutnames.output Rules: ------ 0: $start -> line $end 1: PLUS-1 -> PLUS-1 ';' exp 2: PLUS-1 -> exp 3: line -> PLUS-1 4: exp -> NUM 5: exp -> VAR 6: exp -> VAR '=' exp 7: exp -> exp '+' exp 8: exp -> exp '-' exp 9: exp -> exp '*' exp 10: exp -> exp '/' exp 11: exp -> '-' exp 12: exp -> '(' exp ')' We can see now that the node C corresponds to the production C exp '*' exp>. Observe also that the Eyapp production: line: exp <+ ';'> actually produces the productions: 1: PLUS-1 -> PLUS-1 ';' exp 2: PLUS-1 -> exp and that the name of the class associated with the non empty list is C<_PLUS_LIST>. A production rule can be I using the C<%name IDENTIFIER> directive. For each production rule a namespace/package is created. I C I. Therefore, by modifying the former grammar with additional C<%name> directives: pl@nereida:~/src/perl/YappWithDefaultAction/examples$ sed -ne '8,26p' treewithnames.pl my $grammar = q{ %right '=' # Lowest precedence %left '-' '+' # + and - have more precedence than = Disambiguate a-b-c as (a-b)-c %left '*' '/' # * and / have more precedence than + Disambiguate a/b/c as (a/b)/c %left NEG # Disambiguate -a-b as (-a)-b and not as -(a-b) %tree # Let us build an abstract syntax tree ... %% line: exp <%name EXPS + ';'> { $_[1] } /* list of expressions separated by ';' */ ; exp: %name NUM NUM | %name VAR VAR | %name ASSIGN VAR '=' exp | %name PLUS exp '+' exp | %name MINUS exp '-' exp | %name TIMES exp '*' exp | %name DIV exp '/' exp | %name UMINUS '-' exp %prec NEG | '(' exp ')' { $_[2] } ; we are explictly naming the productions. Thus, all the node instances corresponding to the production C for an example. =head3 TERMINAL Nodes Nodes named C are built from the tokens provided by the lexical analyzer. C follows the same protocol than L for communication between the parser and the lexical analyzer: A couple C<($token, $attribute)> is returned by the lexical analyzer. These values are stored under the keys C and C. C nodes as all C nodes also have the attribute C but is - almost always - empty. =head3 Explicit Actions Inside C<%tree> Explicit actions can be specified by the programmer like in this line from the L C example: | '(' exp ')' { $_[2] } /* Let us simplify a bit the tree */ Explicit actions receive as arguments the references to the children nodes already built. The programmer can influence the shape of the tree by inserting these explicit actions. In this example the programmer has decided to simplify the syntax tree: the nodes associated with the parenthesis are discarded and the reference to the subtree containing the proper expression is returned. Such manoeuvre is called I. See section L to know more about I =head3 Explicitly Building Nodes With C Sometimes the best time to decorate a node with some attributes is just after being built. In such cases the programmer can take I building the node with C to inmediately proceed to decorate it. The following example illustrates the situation: Variable: %name VARARRAY $ID ('[' binary ']') <%name INDEXSPEC +> { my $self = shift; my $node = $self->YYBuildAST(@_); $node->{line} = $ID->[1]; return $node; } This production rule defines the expression to access an array element as an identifier followed by a non empty list of binary expressions C< Variable: ID ('[' binary ']')+>. Furthermore, the node corresponding to the list of indices has been named C. When no explicit action is inserted a binary node will be built having as first child the node corresponding to the identifier C<$ID> and as second child the reference to the list of binary expressions. The children corresponding to C<'['> and C<']'> are discarded since they are -by default- I (see section L). However, the programmer wants to decorate the node being built with a C attribute holding the line number in the source code where the identifier being used appears. The call to the C method C does the job of building the node. After that the node can be decorated and returned. Actually, the C<%tree> directive is semantically equivalent to: %default action { goto &Parse::Eyapp::Driver::YYBuildAST } =head3 Returning non References Under C<%tree> When a I. This fact can be used to supress nodes in the AST being built. See the following example (file C): nereida:~/src/perl/YappWithDefaultAction/examples> sed -ne '1,11p' returnnonode.yp | cat -n 1 %tree 2 %semantic token 'a' 'b' 3 %% 4 S: /* empty */ 5 | S A 6 | S B 7 ; 8 A : 'a' 9 ; 10 B : 'b' { } 11 ; since the action at line 10 returns C the C subtree will not be inserted in the AST: nereida:~/src/perl/YappWithDefaultAction/examples> usereturnnonode.pl ababa S_2(S_3(S_2(S_3(S_2(S_1,A_4(TERMINAL[a]))),A_4(TERMINAL[a]))),A_4(TERMINAL[a])) Observe the absence of Cs and C<'b'>s. =head3 Intermediate actions and C<%tree> Intermediate actions can be used to change the shape of the AST (prune it, decorate it, etc.) but the value returned by them is ignored. The grammar below has two intermediate actions. They modify the attributes of the node to its left and return a reference C<$f> to such node (lines 5 and 6): nereida:~/src/perl/YappWithDefaultAction/examples> \ sed -ne '1,10p' intermediateactiontree.yp | cat -n 1 %semantic token 'a' 'b' 2 %tree bypass 3 %% 4 S: /* empty */ 5 | S A.f { $f->{attr} = "A"; $f; } A 6 | S B.f { $f->{attr} = "B"; $f; } B 7 ; 8 A : %name A 'a' 9 ; 10 B : %name B 'b' See the client program running: nereida:~/src/perl/YappWithDefaultAction/examples> cat -n useintermediateactiontree.pl 1 #!/usr/bin/perl -w 2 use strict; 3 use Parse::Eyapp; 4 use intermediateactiontree; 5 6 { no warnings; 7 *A::info = *B::info = sub { $_[0]{attr} }; 8 } 9 10 my $parser = intermediateactiontree->new(); 11 my $t = $parser->Run; 12 print $t->str,"\n"; nereida:~/src/perl/YappWithDefaultAction/examples> useintermediateactiontree.pl aabbaa S_2(S_4(S_2(S_1,A[A],A[a]),B[B],B[b]),A[A],A[a]) The attributes of left Cs have been effectively changed by the intermediate actions from C<'a'> to C<'A'>. However no further children have been inserted. =head3 Syntactic and Semantic tokens C diferences between C and C. By default all tokens declared using string notation (i.e. between quotes like C<'+'>, C<'='>) are considered I. Tokens declared by an identifier (like C or C) are by default considered I. B. Thus, the first print in the former L C example: $parser->YYData->{INPUT} = "2*-3+b*0;--2\n"; my $t = $parser->Run; local $Parse::Eyapp::Node::INDENT=2; print "Syntax Tree:",$t->str; gives as result the following output: nereida:~/src/perl/YappWithDefaultAction/examples> synopsis.pl Syntax Tree: EXPRESION_LIST( PLUS( TIMES( NUM( TERMINAL[2] ), UMINUS( NUM( TERMINAL[3] ) ) # UMINUS ) # TIMES, TIMES( VAR( TERMINAL[b] ), NUM( TERMINAL[0] ) ) # TIMES ) # PLUS, UMINUS( UMINUS( NUM( TERMINAL[2] ) ) # UMINUS ) # UMINUS ) # EXPRESION_LIST C nodes corresponding to tokens that were defined by strings like C<'='>, C<'-'>, C<'+'>, C<'/'>, C<'*'>, C<'('> and C<')'> do not appear in the tree. C nodes corresponding to tokens that were defined using an identifer, like C or C are, by default, I and appear in the AST. =head3 Changing the Status of a Token The new token declaration directives C<%syntactic token> and C<%semantic token> can change the status of a token. For example (file C<15treewithsyntactictoken.pl> in the C directory), given the grammar: %syntactic token b %semantic token 'a' 'c' %tree %% S: %name ABC A B C | %name BC B C ; A: %name A 'a' ; B: %name B b ; C: %name C 'c' ; %% the tree build for input C will be C. =head3 Saving the Information of Syntactic Tokens in their Father The reason for the adjective C<%syntactic> applied to a token is to state that the token influences the shape of the syntax tree but carries no other information. When the syntax tree is built the node corresponding to the token is discarded. Sometimes the difference between syntactic and semantic tokens is blurred. For example the line number associated with an instance of the syntactic token C<'+'> can be used later -say during type checking- to emit a more accurate error diagnostic. But if the node was discarded the information about that line number is no longer available. When building the syntax tree C (namely the method C) checks if the method C exists and if so it will be called when dealing with a I. The method receives as argument - additionally to the reference to the attribute of the token as it is returned by the lexical analyzer - a reference to the node associated with the left hand side of the production. Here is an example (file C) of use: sub TERMINAL::save_attributes { # $_[0] is a syntactic terminal # $_[1] is the father. push @{$_[1]->{lines}}, $_[0]->[1]; # save the line number } =head3 The C clause and the C<%no bypass> directive The shape of the tree can be also modified using some C<%tree> clauses as C<%tree bypass> which will produce an automatic I of any node with only one child at tree-construction-time. A I consists in I (if a name was provided). A node may have only one child at tree-construction-time for one of two reasons. =over =item * The first occurs when the right hand side of the production was already unary like in: exp: %name NUM NUM Here - if the C clause is used - the C node will be bypassed and the child C built from the information provided by the lexical analyzer will be renamed/reblessed as C. =item * Another reason for a node to be I is the fact that though the right hand side of the production may have more than one symbol, only one of them is not a syntactic token like in: exp: '(' exp ')' =back A consequence of the global scope application of C<%tree bypass> is that undesired bypasses may occur like in exp : %name UMINUS '-' $exp %prec NEG though the right hand side has two symbols, token C<'-'> is a syntactic token and therefore only C is left. The I operation will be applied when building this node. This I can be avoided applying the C directive to the corresponding production: exp : %no bypass UMINUS '-' $exp %prec NEG The following example (file C) is the equivalent of the L C example but using the C clause instead: use Parse::Eyapp; use Parse::Eyapp::Treeregexp; sub TERMINAL::info { $_[0]{attr} } { no warnings; *VAR::info = *NUM::info = \&TERMINAL::info; } my $grammar = q{ %right '=' # Lowest precedence %left '-' '+' %left '*' '/' %left NEG # Disambiguate -a-b as (-a)-b and not as -(a-b) %tree bypass # Let us build an abstract syntax tree ... %% line: exp <%name EXPRESION_LIST + ';'> { $_[1] } ; exp: %name NUM NUM | %name VAR VAR | %name ASSIGN VAR '=' exp | %name PLUS exp '+' exp | %name MINUS exp '-' exp | %name TIMES exp '*' exp | %name DIV exp '/' exp | %no bypass UMINUS '-' $exp %prec NEG | '(' exp ')' ; %% # sub _Error, _Lexer and Run like in the synopsis example # ... }; # end grammar our (@all, $uminus); Parse::Eyapp->new_grammar( # Create the parser package/class input=>$grammar, classname=>'Calc', # The name of the package containing the parser firstline=>7 # String $grammar starts at line 7 (for error diagnostics) ); my $parser = Calc->new(); # Create a parser $parser->YYData->{INPUT} = "a=2*-3+b*0\n"; # Set the input my $t = $parser->Run; # Parse it! print "\n************\n".$t->str."\n************\n"; # Let us transform the tree. Define the tree-regular expressions .. my $p = Parse::Eyapp::Treeregexp->new( STRING => q{ { # Example of support code my %Op = (PLUS=>'+', MINUS => '-', TIMES=>'*', DIV => '/'); } constantfold: /TIMES|PLUS|DIV|MINUS/:bin(NUM, NUM) => { my $op = $Op{ref($_[0])}; $NUM[0]->{attr} = eval "$NUM[0]->{attr} $op $NUM[1]->{attr}"; $_[0] = $NUM[0]; } zero_times_whatever: TIMES(NUM, .) and { $NUM->{attr} == 0 } => { $_[0] = $NUM } whatever_times_zero: TIMES(., NUM) and { $NUM->{attr} == 0 } => { $_[0] = $NUM } uminus: UMINUS(NUM) => { $NUM->{attr} = -$NUM->{attr}; $_[0] = $NUM } }, OUTPUTFILE=> 'main.pm' ); $p->generate(); # Create the tranformations $t->s(@all); # constant folding and mult. by zero print $t->str,"\n"; when running this example with input C<"a=2*-3+b*0\n"> we obtain the following output: nereida:~/src/perl/YappWithDefaultAction/examples> bypass.pl ************ EXPRESION_LIST(ASSIGN(TERMINAL[a],PLUS(TIMES(NUM[2],UMINUS(NUM[3])),TIMES(VAR[b],NUM[0])))) ************ EXPRESION_LIST(ASSIGN(TERMINAL[a],NUM[-6])) As you can see the trees are more compact when using the C directive. =head3 The C clause of the C<%tree> directive Access to children in L is made through the C and C methods. There are occasions however where access by name to the children may be preferable. The use of the C clause with the C<%tree> directive creates accessors to the children with names specified by the programmer. The I are used for this. When dealing with a production like: A: %name A_Node Node B.bum N.pum $Chip methods C, C and C will be created for the class C. Those methods wil provide access to the respective child (first, second and third in the example). The methods are build at compile-time and therefore later transformations of the AST modifying the order of the children may invalidate the use of these getter-setters. As an example, the CPAN module L provides AST decorators from an attribute grammar specification of the AST. To work L requires named access to the children of the AST nodes. Follows an example (file C) of a small calculator: pl@nereida:~/LEyapp/examples$ cat -n CalcwithAttributeGrammar.pl 1 #!/usr/bin/perl -w 2 use strict; 3 use Parse::Eyapp; 4 use Data::Dumper; 5 use Language::AttributeGrammar; 6 7 my $grammar = q{ 8 %{ 9 # use Data::Dumper; 10 %} 11 %right '=' 12 %left '-' '+' 13 %left '*' '/' 14 %left NEG 15 %tree bypass alias 16 17 %% 18 line: $exp { $_[1] } 19 ; 20 21 exp: 22 %name NUM 23 $NUM 24 | %name VAR 25 $VAR 26 | %name ASSIGN 27 $VAR '=' $exp 28 | %name PLUS 29 exp.left '+' exp.right 30 | %name MINUS 31 exp.left '-' exp.right 32 | %name TIMES 33 exp.left '*' exp.right 34 | %name DIV 35 exp.left '/' exp.right 36 | %no bypass UMINUS 37 '-' $exp %prec NEG 38 | '(' $exp ')' { $_[2] } /* Let us simplify a bit the tree */ 39 ; 40 41 %% 42 43 sub _Error { 44 exists $_[0]->YYData->{ERRMSG} 45 and do { 46 print $_[0]->YYData->{ERRMSG}; 47 delete $_[0]->YYData->{ERRMSG}; 48 return; 49 }; 50 print "Syntax error.\n"; 51 } 52 53 sub _Lexer { 54 my($parser)=shift; 55 56 $parser->YYData->{INPUT} 57 or $parser->YYData->{INPUT} = 58 or return('',undef); 59 60 $parser->YYData->{INPUT}=~s/^\s+//; 61 62 for ($parser->YYData->{INPUT}) { 63 s/^([0-9]+(?:\.[0-9]+)?)// 64 and return('NUM',$1); 65 s/^([A-Za-z][A-Za-z0-9_]*)// 66 and return('VAR',$1); 67 s/^(.)//s 68 and return($1,$1); 69 } 70 } 71 72 sub Run { 73 my($self)=shift; 74 $self->YYParse( yylex => \&_Lexer, yyerror => \&_Error, 75 #yydebug =>0xFF 76 ); 77 } 78 }; # end grammar 79 80 81 $Data::Dumper::Indent = 1; 82 Parse::Eyapp->new_grammar( 83 input=>$grammar, 84 classname=>'Rule6', 85 firstline =>7, 86 outputfile => 'Calc.pm', 87 ); 88 my $parser = Rule6->new(); 89 $parser->YYData->{INPUT} = "a = -(2*3+5-1)\n"; 90 my $t = $parser->Run; 91 print "\n***** Before ******\n"; 92 print Dumper($t); 93 94 my $attgram = new Language::AttributeGrammar <<'EOG'; 95 96 # Compute the expression 97 NUM: $/.val = { $ } 98 TIMES: $/.val = { $.val * $.val } 99 PLUS: $/.val = { $.val + $.val } 100 MINUS: $/.val = { $.val - $.val } 101 UMINUS: $/.val = { -$.val } 102 ASSIGN: $/.val = { $.val } 103 EOG 104 105 my $res = $attgram->apply($t, 'val'); 106 107 $Data::Dumper::Indent = 1; 108 print "\n***** After ******\n"; 109 print Dumper($t); 110 print Dumper($res); =head1 SEE ALSO =over =item * L, L, L, L, L, L, L, L, L, L, L, =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The pdf file in L =item * The tutorial I C (An Introduction to Compiler Construction in seven pages) in L =item * perldoc L, =item * perldoc L, =item * perldoc L, =item * The Syntax Highlight file for vim at L and L =item * I, (Notes for a course in compiler construction) by Casiano Rodriguez-Leon. Available at L Is the more complete and reliable source for Parse::Eyapp. However is in Spanish. =item * L, =item * Man pages of yacc(1), =item * Man pages of bison(1), =item * L =item * L. =item * L =item * L =item * ocamlyacc tutorial at L =back =head1 REFERENCES =over =item * The classic Dragon's book I by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman (Addison-Wesley 1986) =item * I (See L, L and L) by Pete Jinks =back =head1 AUTHOR William N. Braswell, Jr. (Remove "NOSPAM".) =head1 ACKNOWLEDGMENTS This work has been supported by CEE (FEDER) and the Spanish Ministry of I through I number TIN2005-08818-C04-04 (ULL::OPLINK project L). Support from Gobierno de Canarias was through GC02210601 (I). The University of La Laguna has also supported my work in many ways and for many years. A large percentage of code is verbatim taken from L 1.05. The author of L is Francois Desarmenien. I wish to thank Francois Desarmenien for his L module, to my students at La Laguna and to the Perl Community. Special thanks to my family and Larry Wall. =head1 LICENSE AND COPYRIGHT Copyright © 2006, 2007, 2008, 2009, 2010, 2011, 2012 Casiano Rodriguez-Leon. Copyright © 2017 William N. Braswell, Jr. All Rights Reserved. Parse::Yapp is Copyright © 1998, 1999, 2000, 2001, Francois Desarmenien. Parse::Yapp is Copyright © 2017 William N. Braswell, Jr. All Rights Reserved. These modules are free software; you can redistribute it and/or modify it under the same terms as Perl itself. See L. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.