• 沒有找到結果。

Resume normal parsing

在文檔中 Modern Compiler Implementation in C (頁 89-99)

until I does not change return I

4. Resume normal parsing

In the two error productions illustrated above, we have taken care to follow the error symbol with an appropriate synchronizing token – in this case, right parenthesis or semicolon. Thus, the “non-error action” taken in step 3 will always shift. If instead we used the production exp → error, the “non-error action” would be reduce, and (in an SLR or LALR parser) it is possible that the original (erroneous) lookahead symbol would cause another error after the reduce action, without having advanced the input. Therefore, grammar rules that contain error not followed by a token should be used only when there is no good alternative.

Caution. One can attach semantic actions to Yacc grammar rules; whenever a rule is reduced, its semantic action is executed. Chapter 4explains the use of semantic actions. Popping states from the stack can lead to seemingly “im-possible” semantic actions, especially if the actions contain side effects. Con-sider this grammar fragment:

statements: statements exp SEMICOLON

| statements error SEMICOLON

| /* empty */

exp : increment exp decrement

| ID

increment: LPAREN {nest=nest+1;}

decrement: RPAREN {nest=nest-1;}

3.5. ERROR RECOVERY

“Obviously” it is true that whenever a semicolon is reached, the value of nestis zero, because it is incremented and decremented in a balanced way according to the grammar of expressions. But if a syntax error is found after some left parentheses have been parsed, then states will be popped from the stack without “completing” them, leading to a nonzero value of nest. The best solution to this problem is to have side-effect-free semantic actions that build abstract syntax trees, as described inChapter 4.

GLOBAL ERROR REPAIR

What if the best way to recover from the error is to insert or delete tokens from the input stream at a point before where the error was detected? Consider the following Tiger program:

let type a := intArray [ 10 ] of 0 in . . .

A local technique will discover a syntax error with:=as lookahead sym-bol. Error recovery based on error productions would likely delete the phrase from typeto 0, resynchronizing on thein token. Some local repair tech-niques can insert tokens as well as delete them; but even a local repair that replaces the:=by=is not very good, and will encounter another syntax er-ror at the[ token. Really, the programmer’s mistake here is in usingtype instead ofvar, but the error is detected two tokens too late.

Global error repairfinds the smallest set of insertions and deletions that would turn the source string into a syntactically correct string, even if the insertions and deletions are not at a point where an LL or LR parser would first report an error. In this case, global error repair would do a single-token substitution, replacingtypebyvar.

Burke-Fisher error repair. I will describe a limited but useful form of global error repair, which tries every possible single-token insertion, deletion, or replacement at every point that occurs no earlier than K tokens before the point where the parser reported the error. Thus, with K = 15, if the parsing engine gets stuck at the 100th token of the input, then it will try every possible repair between the 85th and 100th token.

The correction that allows the parser to parse furthest past the original reported error is taken as the best error repair. Thus, if a single-token substi-tution ofvarfortypeat the 98th token allows the parsing engine to proceed past the 104th token without getting stuck, this repair is a successful one.

a := 7 Old num10 Stack :=6

id4

; b := c + (

! "# $

6-token queue

Current (8 Stack +16

E11

:=6 id4

;3 S2

d := 5 + 6 , d ) $

FIGURE 3.38. Burke-Fisher parsing, with an error-repair queue. Figure 3.18 shows the complete parse of this string according toTable 3.19.

Generally, if a repair carries the parser R = 4 tokens beyond where it origi-nally got stuck, this is “good enough.”

The advantage of this technique is that the LL(k) or LR(k) (or LALR, etc.) grammar is not modified at all (no error productions), nor are the parsing tables modified. Only the parsing engine, which interprets the parsing tables, is modified.

The parsing engine must be able to back up K tokens and reparse. To do this, it needs to remember what the parse stack looked like K tokens ago.

Therefore, the algorithm maintains two parse stacks: the current stack and the old stack. A queue of K tokens is kept; as each new token is shifted, it is pushed on the current stack and also put onto the tail of the queue; simul-taneously, the head of the queue is removed and shifted onto the old stack.

With each shift onto the old or current stack, the appropriate reduce actions are also performed.Figure 3.38illustrates the two stacks and queue.

Now suppose a syntax error is detected at the current token. For each possi-ble insertion, deletion, or substitution of a token at any position of the queue, the Burke-Fisher error repairer makes that change to within (a copy of) the queue, then attempts to reparse from the old stack. The success of a modifi-cation is in how many tokens past the current token can be parsed; generally, if three or four new tokens can be parsed, this is considered a completely successful repair.

In a language with N kinds of tokens, there are K + K · N + K · N possible deletions, insertions, and substitutions within the K -token window. Trying

3.5. ERROR RECOVERY

this many repairs is not very costly, especially considering that it happens only when a syntax error is discovered, not during ordinary parsing.

Semantic actions. Shift and reduce actions are tried repeatly and discarded during the search for the best error repair. Parser generators usually perform programmer-specified semantic actions along with each reduce action, but the programmer does not expect that these actions will be performed repeatedly and discarded – they may have side effects. Therefore, a Burke-Fisher parser does not execute any of the semantic actions as reductions are performed on the current stack, but waits until the same reductions are performed (perma-nently) on the old stack.

This means that the lexical analyzer may be up to K + R tokens ahead of the point to which semantic actions have been performed. If semantic actions affect lexical analysis – as they do in C, compiling the typedeffeature – this can be a problem with the Burke-Fisher approach. For languages with a pure context-free grammar approach to syntax, the delay of semantic actions poses no problem.

Semantic values for insertions. In repairing an error by insertion, the parser needs to provide a semantic value for each token it inserts, so that semantic actions can be performed as if the token had come from the lexical analyzer.

For punctuation tokens no value is necessary, but when tokens such as num-bers or identifiers must be inserted, where can the value come from? The ML-Yacc parser generator, which uses Burke-Fischer error correction, has a

%valuedirective, allowing the programmer to specify what value should be used when inserting each kind of token:

%value ID ("bogus")

%value INT (1)

%value STRING ("")

Programmer-specified substitutions. Some common kinds of errors cannot be repaired by the insertion or deletion of a single token, and sometimes a particular single-token insertion or substitution is very commonly required and should be tried first. Therefore, in an ML-Yacc grammar specification the programmer can use the%changedirective to suggest error corrections to be tried first, before the default “delete or insert each possible token” repairs.

%change EQ -> ASSIGN | ASSIGN -> EQ

| SEMICOLON ELSE -> ELSE | -> IN INT END

Here the programmer is suggesting that users often write “; else” where they mean “else” and so on.

The insertion ofin 0 endis a particularly important kind of correction, known as a scope closer. Programs commonly have extra left parentheses or right parentheses, or extra left or right brackets, and so on. In Tiger, another kind of nesting construct islet· · ·in· · ·end. If the programmer forgets to close a scope that was opened by left parenthesis, then the automatic single-token insertion heuristic can close this scope where necessary. But to close a let scope requires the insertion of three tokens, which will not be done automatically unless the compiler-writer has suggested “change nothing to in 0 end” as illustrated in the%changecommand above.

P R O G R A M PARSING

Use Yacc to implement a parser for the Tiger language. Appendix A de-scribes, among other things, the syntax of Tiger.

You should turn in the filetiger.grmand aREADME. Supporting files available in$TIGER/chap3include:

makefile The “makefile.”

errormsg.[ch] The Error Message structure, useful for producing error mes-sages with file names and line numbers.

lex.yy.c The lexical analyzer. I haven’t provided the source file tiger.lex, but I’ve provided the output of Lex that you can use if your lexer isn’t working.

parsetest.c A driver to run your parser on an input file.

tiger.grm The skeleton of a file you must fill in.

You won’t needtokens.hanymore; instead, the header file for tokens is y.tab.h, which is produced automatically by Yacc from the token specifi-cation of your grammar.

Your grammar should have as few shift-reduce conflicts as possible, and no reduce-reduce conflicts. Furthermore, your accompanying documentation should list each shift-reduce conflict (if any) and explain why it is not harmful.

My grammar has a shift-reduce conflict that’s related to the confusion be-tween

variable [ expression ]

type-id [ expression ] of expression

In fact, I had to add a seemingly redundant grammar rule to handle this con-fusion. Is there a way to do this without a shift-reduce conflict?

FURTHER READING

Use the precedence directives (%left, %nonassoc,%right) when it is straightforward to do so.

Do not attach any semantic actions to your grammar rules for this exercise.

Optional: Add error productions to your grammar and demonstrate that your parser can sometimes recover from syntax errors.

F U R T H E R R E A D I N G

Conway [1963] describes a predictive (recursive-descent) parser, with a no-tion of FIRST sets and left-factoring. LL(k) parsing theory was formalized by Lewis and Stearns [1968].

LR(k) parsing was developed by Knuth [1965]; the SLR and LALR tech-niques by DeRemer [1971]; LALR(1) parsing was popularized by the de-velopment and distribution of Yacc [Johnson 1975] (which was not the first parser-generator, or “compiler-compiler,” as can be seen from the title of the cited paper).

Figure 3.29summarizes many theorems on subset relations between gram-mar classes. Heilbrunner [1981] shows proofs of several of these theorems, including LL(k) ⊂ LR(k) and LL(1) ̸⊂ LALR(1) (seeExercise 3.14). Back-house [1979] is a good introduction to theoretical aspects of LL and LR pars-ing.

Aho et al. [1975] showed how deterministic LL or LR parsing engines can handle ambiguous grammars, with ambiguities resolved by precedence directives (as described inSection 3.4).

Burke and Fisher [1987] invented the errorrepair tactic that keeps a K -token queue and two parse stacks.

E X E R C I S E S

3.1 Translate each of these regular expressions into a context-free grammar.

a. ((x yx)|(yxy))?

b. ((0|1)+"."(0|1))|((0|1)"."(0|1)+)

*3.2 Write a grammar for English sentences using the words

time, arrow, banana, flies, like, a, an, the, fruit

and the semicolon. Be sure to include all the senses (noun, verb, etc.) of each word. Then show that this grammar is ambiguous by exhibiting more than one parse tree for “time flies like an arrow; fruit flies like a banana.”

3.3 Write an unambigous grammar for each of the following languages. Hint: One way of verifying that a grammar is unambiguous is to run it through Yacc and get no conflicts.

a. Palindromes over the alphabet {a, b} (strings that are the same backward and forward).

b. Strings that match the regular expression a∗b∗ and have more a’s than b’s.

c. Balanced parentheses and square brackets. Example:([[](()[()][])])

*d. Balanced parentheses and brackets, where a closing bracket also closes any outstanding open parentheses (up to the previous open bracket).

Example:[([](()[(][])]. Hint: First, make the language of balanced parentheses and brackets, where extra open parentheses are allowed;

then make sure this nonterminal must appear within brackets.

e. All subsets and permutations (without repetition) of the keywordspublic final static synchronized transient. (Then comment on how best to handle this situation in a real compiler.)

f. Statement blocks in Pascal or ML where the semicolons separate the statements:

( statement ; ( statement ; statement ) ; statement ) g. Statement blocks in C where the semicolonsterminatethe statements:

{ expression; { expression; expression; } expression; }

3.4 Write a grammar that accepts the same language as Grammar 3.1,but that is suitable for LL(1) parsing. That is, eliminate the ambiguity, eliminate the left recursion, and (if necessary) left-factor.

3.5 Find nullable, FIRST, and FOLLOW sets for this grammar; then construct the LL(1) parsing table.

0 SS$

1 S →

2 S → X S

3 B → \ begin { WORD }

4 E → \ end { WORD }

5 X → B S E

6 X → { S }

7 X → WORD

8 X → begin

9 X → end

10 X → \ WORD

EXERCISES

3.6 a. Calculate nullable, FIRST, and FOLLOW for this grammar:

S → u B D z B → B v B → w D → E F E → y E → F → x F →

b. Construct the LL(1) parsing table.

c. Give evidence that this grammar is not LL(1).

d. Modify the grammar as little as possible to make an LL(1) grammar that accepts the same language.

*3.7 a. Left-factor this grammar.

0 S → G $

1 G → P

2 G → P G

3 P →id : R

4 R →

5 R →id R

b. Show that the resulting grammar is LL(2). You can do this by construct-ing FIRST sets (etc.) containconstruct-ing two-symbol strconstruct-ings; but it is simpler to construct an LL(1) parsing table and then argue convincingly that any conflicts can be resolved by looking ahead one more symbol.

c. Show how thetok variable andadvancefunction should be altered for recursive-descent parsing with two-symbol lookahead.

d. Use the grammar class hierarchy (Figure 3.29) to show that the (left-factored) grammar is LR(2).

e. Prove that no string has two parse trees according to this (left-factored) grammar.

3.8 Make up a tiny grammar containing left recursion, and use it to demonstrate that left recursion is not a problem for LR parsing. Then show a small example comparing growth of the LR parse stack with left recursion versus right recursion.

3.9 Diagram the LR(0) states forGrammar 3.26,build the SLR parsing table, and identify the conflicts.

3.10 Diagram the LR(1) states for the grammar ofExercise 3.7(without left-factoring), and construct the LR(1) parsing table. Indicate clearly any conflicts.

3.11 Construct the LR(0) states for this grammar, and then determine whether it is an

SLR grammar.

0 S → B$

1 B →id P

2 B →id ( E ]

3 P →

4 P → ( E )

5 E → B

6 E → B , E

3.12 a. Build the LR(0) DFA for this grammar:

0 S → E $

1 E →id

2 E →id ( E )

3 E → E + id

b. Is this an LR(0) grammar? Give evidence.

c. Is this an SLR grammar? Give evidence.

d. Is this an LR(1) grammar? Give evidence.

3.13 Show that this grammar is LALR(1) but not SLR:

0 S → X$

1 X → M a

2 X → b M c

3 X → d c

4 X → b d a

5 M → d

3.14 Show that this grammar is LL(1) but not LALR(1):

1 S → ( X

2 S → E ]

3 S → F )

4 X → E )

5 X → F ]

6 E → A

7 F → A

8 A →

*3.15 Feed this grammar to Yacc; from the output description file, construct the LALR(1) parsing table for this grammar, with duplicate entries where there are conflicts. For each conflict, show whether shifting or reducing should be chosen so that the different kinds of expressions have “conventional” precedence. Then show the Yacc-style precedence directives that resolve the conflicts this way.

0 S → E $

1 E →while E do E

2 E →id := E

3 E → E + E

4 E →id

EXERCISES

*3.16 Explain how to resolve the conflicts in this grammar, using precedence direc-tives, or grammar transformations, or both. Use Yacc as a tool in your investi-gations, if you like.

1 E →id

2 E → E B E

3 B → +

4 B → −

5 B → ×

6 B → /

*3.17 Prove that Grammar 3.8 cannot generate parse trees of the form shown in Figure 3.9. Hint: What nonterminals could possibly be where the ?X is shown?

What does that tell us about what could be where the ?Y is shown?

4

在文檔中 Modern Compiler Implementation in C (頁 89-99)