• 沒有找到結果。

ABSTRACT PARSE TREES

在文檔中 Modern Compiler Implementation in C (頁 103-114)

Abstract Syntax

4.2 ABSTRACT PARSE TREES

It is possible to write an entire compiler that fits within the semantic action phrases of a Yacc parser. However, such a compiler is difficult to read and maintain. And this approach constrains the compiler to analyze the program in exactly the order it is parsed.

To improve modularity, it is better to separate issues of syntax (parsing) from issues of semantics (type-checking and translation to machine code).

One way to do this is for the parser to produce a parse tree – a data structure that later phases of the compiler can traverse. Technically, a parse tree has exactly one leaf for each token of the input and one internal node for each grammar rule reduced during the parse.

4.2. ABSTRACT PARSE TREES

%{

typedef struct table *Table_;

Table_ {string id; int value; Table_ tail};

Table_ Table(string id, int value, struct table *tail); (see page 13) Table_ table=NULL;

int lookup(Table_ table, string id) { assert(table!=NULL);

if (id==table.id) return table.value;

else return lookup(table.tail, id);

}

void update(Table_ *tabptr, string id, int value) {

*tabptr = Table(id, value, *tabptr);

}

%}

%union {int num; string id;}

%token <num> INT

%token <id> ID

%token ASSIGN PRINT LPAREN RPAREN

%type <num> exp

%right SEMICOLON

%left PLUS MINUS

%left TIMES DIV

%start prog

%%

prog: stm

stm : stm SEMICOLON stm

stm : ID ASSIGN exp {update(&table,ID,$3);}

stm : PRINT LPAREN exps RPAREN {printf("\n");}

exps: exp {printf("%d ", $1);}

exps: exps COMMA exp {printf("%d ", $3);}

exp : INT {$$=$1;}

exp : ID {$$=lookup(table,$1);}

exp : exp PLUS exp {$$=$1+$3;}

exp : exp MINUS exp {$$=$1-$3;}

exp : exp TIMES exp {$$=$1*$3;}

exp : exp DIV exp {$$=$1/$3;}

exp : stm COMMA exp {$$=$3;}

exp : LPAREN exp RPAREN {$$=$2;}

PROGRAM 4.4. An interpreter in imperative style.

S → S ; S L →

S →id :=E L → L E

S →printL

E →id B → +

E →num B → −

E → E B E B → ×

E → S , E B → /

GRAMMAR 4.5. Abstract syntax of straight-line programs.

Such a parse tree, which we will call a concrete parse tree representing the concrete syntaxof the source language, is inconvenient to use directly. Many of the punctuation tokens are redundant and convey no information – they are useful in the input string, but once the parse tree is built, the structure of the tree conveys the structuring information more conveniently.

Furthermore, the structure of the parse tree depends too much on the gram-mar! The grammar transformations shown inChapter 3– factoring, elimina-tion of left recursion, eliminaelimina-tion of ambiguity – involve the introducelimina-tion of extra nonterminal symbols and extra grammar productions for technical pur-poses. These details should be confined to the parsing phase and should not clutter the semantic analysis.

An abstract syntax makes a clean interface between the parser and the later phases of a compiler (or, in fact, for the later phases of other kinds of program-analysis tools such as dependency analyzers). The abstract syntax tree conveys the phrase structure of the source program, with all parsing is-sues resolved but without any semantic interpretation.

Many early compilers did not use an abstract syntax data structure because early computers did not have enough memory to represent an entire compi-lation unit’s syntax tree. Modern computers rarely have this problem. And many modern programming languages (ML, Modula-3, Java) allow forward reference to identifiers defined later in the same module; using an abstract syntax tree makes compilation easier for these languages. It may be that Pas-cal and C require clumsy forward declarations because their designers wanted to avoid an extra compiler pass on the machines of the 1970s.

Grammar 4.5 shows the abstract syntax of a straight-line-program lan-guage. This grammar is completely impractical for parsing: the grammar is

4.2. ABSTRACT PARSE TREES

quite ambiguous, since precedence of the operators is not specified, and many of the punctuation keywords are missing.

However,Grammar 4.5is not meant for parsing. The parser uses the con-crete syntax(Program 4.6)to build a parse tree for the abstract syntax. The semantic analysis phase takes this abstract syntax tree; it is not bothered by the ambiguity of the grammar, since it already has the parse tree!

The compiler will need to represent and manipulate abstract syntax trees as data structures. In C, these data structures are organized according to the principles outlined inSection 1.3: atypedeffor each nonterminal, a union-variant for each production, and so on.Program 1.5shows the data structure declarations forGrammar 4.5.

The Yacc (or recursive-descent) parser, parsing the concrete syntax, con-structs the abstract syntax tree. This is shown inProgram 4.6.

POSITIONS

In a one-pass compiler, lexical analysis, parsing, and semantic analysis (type-checking) are all done simultaneously. If there is a type error that must be reported to the user, the current position of the lexical analyzer is a reason-able approximation of the source position of the error. In such a compiler, the lexical analyzer keeps a “current position” global variable, and the error-message routine just prints the value of that variable with each error-message.

A compiler that uses abstract-syntax-tree data structures need not do all the parsing and semantic analysis in one pass. This makes life easier in many ways, but slightly complicates the production of semantic error messages.

The lexer reaches the end of file before semantic analysis even begins; so if a semantic error is detected in traversing the abstract syntax tree, the current position of the lexer (at end of file) will not be useful in generating a line-number for the error message. Thus, the source-file position of each node of the abstract syntax tree must be remembered, in case that node turns out to contain a semantic error.

To remember positions accurately, the abstract-syntax data structures must be sprinkled withposfields. These indicate the position, within the original source file, of the characters from which these abstract syntax structures were derived. Then the type-checker can produce useful error messages.

The lexer must pass the source-file positions of the beginning and end of each token to the parser. Ideally, the automatically generated parser should maintain a position stack along with the semantic value stack, so that the beginning and end positions of each token and phrase are available for the

%{

#include "absyn.h"

%}

%union {int num; string id; A_stm stm; A_exp exp; A_expList expList;}

%token <num> INT

%token <id> ID

%token ASSIGN PRINT LPAREN RPAREN

%type <stm> stm prog

%type <exp> exp

%type <expList> exps

%left SEMICOLON

%left PLUS MINUS

%left TIMES DIV

%start prog

%%

prog: stm {$$=$1;}

stm : stm SEMICOLON stm {$$=A_CompoundStm($1,$3);}

stm : ID ASSIGN exp {$$=A_AssignStm($1,$3);}

stm : PRINT LPAREN exps RPAREN {$$=A_PrintStm($3);}

exps: exp {$$=A_ExpList($1,NULL);}

exps: exp COMMA exps {$$=A_ExpList($1,$3);}

exp : INT {$$=A_NumExp($1);}

exp : ID {$$=A_IdExp($1);}

exp : exp PLUS exp {$$=A_OpExp($1,A_plus,$3);}

exp : exp MINUS exp {$$=A_OpExp($1,A_minus,$3);}

exp : exp TIMES exp {$$=A_OpExp($1,A_times,$3);}

exp : exp DIV exp {$$=A_OpExp($1,A_div,$3);}

exp : stm COMMA exp {$$=A_EseqExp($1,$3);}

exp : LPAREN exp RPAREN {$$=$2;}

PROGRAM 4.6. Abstract-syntax builder for straight-line programs.

semantic actions to use in reporting error messages. The Bison parser gener-ator can do this; Yacc does not. When using Yacc, one solution is to define a nonterminal symbol poswhose semantic value is a source location (line number, or line number and position within line). Then, if one wants to ac-cess the position of thePLUSfrom the semantic action afterexp PLUS exp, the following works:

4.2. ABSTRACT PARSE TREES

%{ extern A_OpExp(A_exp, A_binop, A_exp, position); %}

%union{int num; string id; position pos; · · · };

%type <pos> pos

pos : { $$ = EM_tokPos; }

exp : exp PLUS pos exp { $$ = A_OpExp($1,A_plus,$4,$3); }

But this trick can be dangerous. WithposafterPLUS, it works; but with postoo early in the production, it does not:

exp : pos exp PLUS exp { $$ = A_OpExp($2,A_plus,$4,$1); }

This is because the LR(1) parser must reduce pos → ϵ before seeing the PLUS. A shift-reduce or reduce-reduce conflict will result.

ABSTRACT SYNTAX FOR Tiger

Figure 4.7shows for the abstract syntax of Tiger. The meaning of each con-structor in the abstract syntax should be clear after a careful study of Ap-pendix A, but there are a few points that merit explanation.

Figure 4.7 shows only the constructor functions, not the typedefs and structs. The definition ofA_varwould actually be written as

/* absyn.h */

typedef struct A_var_ *A_var;

struct A_var_

{enum {A_simpleVar, A_fieldVar, A_subscriptVar} kind;

A_pos pos;

union {S_symbol simple;

struct {A_var var;

S_symbol sym;} field;

struct {A_var var;

A_exp exp;} subscript;

} u;

};

This follows the principles outlined on page 9.

The Tiger program

(a := 5; a+1)

translates into abstract syntax as

/* absyn.h */

A_var A_SimpleVar(A_pos pos, S_symbol sym);

A_var A_FieldVar(A_pos pos, A_var var, S_symbol sym);

A_var A_SubscriptVar(A_pos pos, A_var var, A_exp exp);

A_exp A_VarExp(A_pos pos, A_var var);

A_exp A_NilExp(A_pos pos);

A_exp A_IntExp(A_pos pos, int i);

A_exp A_StringExp(A_pos pos, string s);

A_exp A_CallExp(A_pos pos, S_symbol func, A_expList args);

A_exp A_OpExp(A_pos pos, A_oper oper, A_exp left, A_exp right);

A_exp A_RecordExp(A_pos pos, S_symbol typ, A_efieldList fields);

A_exp A_SeqExp(A_pos pos, A_expList seq);

A_exp A_AssignExp(A_pos pos, A_var var, A_exp exp);

A_exp A_IfExp(A_pos pos, A_exp test, A_exp then, A_exp elsee);

A_exp A_WhileExp(A_pos pos, A_exp test, A_exp body);

A_exp A_BreakExp(A_pos pos);

A_exp A_ForExp(A_pos pos, S_symbol var, A_exp lo, A_exp hi, A_exp body);

A_exp A_LetExp(A_pos pos, A_decList decs, A_exp body);

A_exp A_ArrayExp(A_pos pos, S_symbol typ, A_exp size, A_exp init);

A_dec A_FunctionDec(A_pos pos, A_fundecList function);

A_dec A_VarDec(A_pos pos, S_symbol var, S_symbol typ, A_exp init);

A_dec A_TypeDec(A_pos pos, A_nametyList type);

A_ty A_NameTy(A_pos pos, S_symbol name);

A_ty A_RecordTy(A_pos pos, A_fieldList record);

A_ty A_ArrayTy(A_pos pos, S_symbol array);

A_field A_Field(A_pos pos, S_symbol name, S_symbol typ);

A_fieldList A_FieldList(A_field head, A_fieldList tail);

A_expList A_ExpList(A_exp head, A_expList tail);

A_fundec A_Fundec(A_pos pos, S_symbol name, A_fieldList params, S_symbol result, A_exp body);

A_fundecList A_FundecList(A_fundec head, A_fundecList tail);

A_decList A_DecList(A_dec head, A_decList tail);

A_namety A_Namety(S_symbol name, A_ty ty);

A_nametyList A_NametyList(A_namety head, A_nametyList tail);

A_efield A_Efield(S_symbol name, A_exp exp);

A_efieldList A_EfieldList(A_efield head, A_efieldList tail);

typedef enum {A_plusOp, A_minusOp, A_timesOp, A_divideOp,

A_eqOp, A_neqOp, A_ltOp, A_leOp, A_gtOp, A_geOp} A_oper;

FIGURE 4.7. Abstract syntax for the Tiger language. Only the constructor functions are shown; the structure fields correspond exactly to the names of the constructor arguments.

4.2. ABSTRACT PARSE TREES

A SeqExp(2,

A ExpList(A AssignExp(4,A SimpleVar(2,S Symbol(“a”)),A IntExp(7,5)), A ExpList((A OpExp(11,A plusOp,A VarExp(A SimpleVar(10,

S Symbol(“a”))), A IntExp(12,1))), NULL)))

This is a sequence expression containing two expressions separated by a semicolon: an assignment expression and an operator expression. Within these are a variable expression and two integer constant expressions.

The positions sprinkled throughout are source-code character count. The position I have chosen to associate with an AssignExp is that of the :=

operator, for anOpExpthat of the+operator, and so on. These decisions are a matter of taste;they represent my guesses about how they will look when included in semantic error messages.

Now consider

let var a := 5

function f() : int = g(a) function g(i: int) = f() in f()

end

The Tiger language treats adjacent function declarations as (possibly) mu-tually recursive. TheFunctionDecconstructor of the abstract syntax takes a list of function declarations, not just a single function. The intent is that this list is a maximal consecutive sequence of function declarations. Thus, functions declared by the same FunctionDec can be mutually recursive.

Therefore, this program translates into the abstract syntax, A LetExp(

A DecList(A VarDec(S Symbol(“a”),NULL,A IntExp(5)), A DecList(A FunctionDec(

A FundecList(A Fundec(

S Symbol(“f”),NULL,S Symbol(“int”), A CallExp(S Symbol(“g”), · · ·)),

A FundecList(A Fundec(

S Symbol(“g”),

A FieldList(S Symbol(“i”),S Symbol(“int”),NULL), NULL,

A CallExp(S Symbol(“f”), · · ·)), NULL))),

NULL)),

A CallExp(S Symbol(“f”), NULL))

where the positions are omitted for clarity.

The TypeDec constructor also takes a list of type declarations, for the same reason; consider the declarations

type tree = {key: int, children: treelist}

type treelist = {head: tree, tail: treelist}

which translate to one type declaration, not two:

A TypeDec(

A NametyList(A Namety(S Symbol(“tree”), A RecordTy(

A FieldList(A Field(S Symbol(“key”),S Symbol(“int”)), A FieldList(A Field(S Symbol(“children”),

S Symbol(“treelist”)), NULL)))),

A NametyList(A NameTy(S Symbol(“treelist”), A RecordTy(

A FieldList(A Field(S Symbol(“head”),S Symbol(“tree”)), A FieldList(A Field(S Symbol(“tail”),S Symbol(“treelist”)), NULL)))),

NULL)))

There is no abstract syntax for “&” and “|” expressions; instead, e1&e2 is translated asife1 thene2else0, and e1|e2is translated as though it had been writtenife1then1elsee2.

Similarly, unary negation (−i) should be represented as subtraction (0 − i) in the abstract syntax.1 Also, where the body of a LetExphas multiple statements, we must use a seqExp. An empty statement is represented by A_seqExp(NULL).

By using these representations for&,|, and unary negation, we keep the abstract syntax data type smaller and make fewer cases for the semantic anal-ysis phase to process. On the other hand, it becomes harder for the type-checker to give meaningful error messages that relate to the source code.

The lexer returns ID tokens withstringvalues. The abstract syntax re-quires identifiers to have symbol values. The function S_symbol (from symbol.h) converts strings to symbols, and the function S_nameconverts back. The representation of symbols is discussed inChapter 5.

1This might not be adequate in an industrial-strength compiler. The most negative two’s complement integer of a given size cannot be represented as 0 − i for any i of the same size. In floating point numbers, 0 − x is not the same as −x if x = 0. We will neglect these issues in the Tiger compiler.

PROGRAMMING EXERCISE

The semantic analysis phase of the compiler will need to keep track of which local variables are used from within nested functions. The escape component of avarDecorfieldis used to keep track of this. Thisescape field is not mentioned in the parameters of the constructor function, but is al-ways initialized toTRUE, which is a conservative approximation. Thefield type is used for both formal parameters and record fields;escapehas mean-ing for formal parameters, but for record fields it can be ignored.

Having theescapefields in the abstract syntax is a “hack,” since escap-ing is a global, nonsyntactic property. But leavescap-ingescapeout of theAbsyn would require another data structure for describing escapes.

P R O G R A M ABSTRACT SYNTAX

Add semantic actions to your parser to produce abstract syntax for the Tiger language.

You should turn in the filetiger.grm.

Supporting files available in$TIGER/chap4include:

absyn.h The abstract syntax declarations for Tiger.

absyn.c Implementation of the constructor functions.

prabsyn.[ch] A pretty-printer for abstract syntax trees, so you can see your results.

errormsg.[ch] As before.

lex.yy.c Use this only if your own lexical analyzer still isn’t working.

symbol.[ch] A module to turn strings into symbols.

makefile As usual.

parse.[ch] A driver to run your parser on an input file.

tiger.grm The skeleton of a grammar specification.

F U R T H E R R E A D I N G

Many compilers mix recursive-descent parsing code with semantic-action code, as shown inProgram 4.1; Gries [1971] and Fraser and Hanson [1995]

are ancient and modern examples. Machine-generated parsers with seman-tic actions (in special-purpose “semanseman-tic-action mini-languages”) attached to the grammar productions were tried out in 1960s [Feldman and Gries 1968];

Yacc [Johnson 1975] was one of the first to permit semantic action fragments to be written in a conventional, general-purpose programming language.

The notion of abstract syntax is due to McCarthy [1963], who designed the abstract syntax for Lisp [McCarthy et al. 1962]. The abstract syntax was intended to be used writing programs until designers could get around to cre-ating a concrete syntax with human-readable punctuation (instead of Lots of Irritating Silly Parentheses), but programmers soon got used to programming directly in abstract syntax.

The search for a theory of programming-language semantics, and a no-tation for expressing semantics in a compiler-compiler, led to ideas such as denotational semantics[Stoy 1977]. The denotational semanticists also ad-vocated the separation of concrete syntax from semantics – using abstract syntax as a clean interface – because in a full-sized programming language the syntactic clutter gets in the way of understanding the semantic analysis.

E X E R C I S E S

4.1 Write type declarations and constructor functions to express the abstract syntax of regular expressions.

4.2 ImplementProgram 4.4as a recursive-descent parser, with the semantic actions embedded in the parsing functions.

5

在文檔中 Modern Compiler Implementation in C (頁 103-114)