CONTEXT-FREE GRAMMARS - Context-Free Languages

Context-Free Languages

3.1 CONTEXT-FREE GRAMMARS

Think of yourself as a language processor. You can recogllize a legal English sentence when you hear one; "the cat is in the hat" is at least syntactically correct (whether or not it says anything that happens to be the truth), but

"hat the the in is cat" is gibberish. However you manage to do it, you can immediately tell when reading such sentences whether they are formed according to generally accepted rules for sentence structure. In this respect you are acting as a language recognizer: a device that accepts valid strings. The finite automata of the last chapter are formalized types of language recognizers.

You also, however, are capable of producing legal English sentences. Again, why you would want to do so and how you manage to do it are not our concern;

but the fact is that you occasionally speak or write sentences, and in general they are syntactically correct (even when they are lies). In this respect you are acting as a language generator. In this section we shall study certain types of formal language generators. Such a device begins, when given some sort of

"start" signal, to construct a string. Its operation is not completely determined from the beginning but is nevertheless limited by a set of rules. Eventually this process halts, and the device outputs a completed string. The language defined by the device is the set of all strings that it can produce.

Neither a recognizer nor a generator for the English language is at all easy to produce; indeed, designing such devices for large subsets of natural languages has been a challenging research front for several decades. Nevertheless the idea of a language generator has some explanatory force in attempts to discuss human language. More important for us, however, is the theory of generators of formal,

"artificial" languages, such as the regular languages and the important class of

"context-free" languages illtroduced below. This theory will neatly complement 113

114 Chapter 3: CONTEXT-FREE LANGUAGES the study of automata, which recognize languages, and is also of practical value in the specification and analysis of computer languages.

Regular expressions can be viewed as language generators. For example, consider the regular expression a(a* U b*)b. A verbal description of how to generate a string in accordance with this expression would be the following

First output an a. Then do one of the following two things:

Either output a number of a's or output a number of b's.

Finally output a b.

The language associated with this language generator -that is, the set of all strings that can be produced by the process just described -is, of course, exactly the regular language defined in the way described earlier by the regular expression a(a* U b*)b.

In this chapter we shall study certain more complex sorts of language gen-erators, called context-free grammars, which are based on a more complete understanding of the structure of the strings belonging to the language. To take again the example of the language generated by a(a* U b*)b, note that any string in this language consists of a leading a, followed by a middle part -generated by (a* U b*)-followed by a trailing b. If we let S be a new symbol interpreted as "a string in the language," and AJ be a symbol standing for "middle part,"

then we can express this observation by writing S --+ aMb,

where --+ is read "can be." We call such an expression a rule. What can ArI, the middle part, be? The answer is: either a string of a's or a string of b's. We express this by adding the rules

M --+ A and M --+ B,

where A and B are new symbols that stand for strings of a's and b's, respectively.

Now, what is a string of a's? It can be the empty string A --+ e,

or it may consist of a leading a followed by a string of a's:

A --+ aA.

Similarly, for B:

B --+ e and B --+ bB.

The language denoted by the regular expression a(a* U b*)b can then be defined alternatively by the following language generator.

3.1: Context-Free Grammars 115

Start with the string consisting of the single symbol 5. Find a symbol in the current string that appears to the left of ---+ in one of the rules above. Replace an occurrence of this symbol with the string that appears to the right of ---+ in the same rule. Repeat this process until no such symbol can be found.

For example, to generate the string aaab we start with 5, as specified; we then replace 5 by aMb according to the first rule, 5 ---+ aMbo To aMb we apply the rule M ---+ A and obtain aAb. We then twice apply the rule A ---+ aA to get the string aaaAb. Finally, we apply the rule A ---+ e. In the resulting string, aaab, we cannot identify any symbol that appears to the left of ---+ in some rule. Thus the operation of our language generator has ended, and aaab was produced, as promised.

A context-free grammar is a language generator that operates like the one above, with some such set of rules. Let us pause to explain at this point why such a language generator is called context-free. Consider the string aaAb, which was an intermediate stage in the generation of aaab. It is natural to call the strings aa and b that surround the symbol A the context of A in this particular string. Now, the rule A ---+ aA says that we can replace A by the string aA no matter what the surrounding strings are; in other words, independently of the context of A. In Chapter 4 we examine more general grammars, in which replacements may be conditioned on the existence of an appropriate context.

In a context-free grammar, some symbols appear to the left of ---+ in rules --5, M, A, and B in our example- and some --a and b--- do not. Symbols of the latter kind are called terminals, since the production of a string consisting solely of such symbols signals the termination of the generation process. All these ideas are stated formally in the next definition.

Definition 3.1.1: A context-free grammar G is a quadruple (V,~, R, 5), where

V is an alphabet,

~ (the set of terminals) is a subset of V,

R (the set of rules) is a finite subset of (V - ~) x V', and 5 (the start symbol) is an element of V - L

The members of V - ~ are called nonterminals. For any A E V - ~ and u E V', we write A ---+0 u whenever (A, u) E R. For any strings u, v E V', we write u =?c v if and only if there are strings x, y E V' and A E V - ~

such that u = xAy, v = xv'y, and A ---+0 v'. The relation

=?'G

is the reflexive, transitive closure of =?c. Finally, L(G), the language generated by G, is {w E ~. : 5

=?'G

w}; we also say that G generates each string in L(G).

A language L is said to be a context-free language if L = L(G) for some context-free grammar G.

116 Chapter 3: CONTEXT-FREE LANGUAGES When the grammar to which we refer is obvious, we write A ---+ wand u

=>

v instead of A ---+c wand u =>c v.

We call any sequence of the form

Wo =>c WI =>c ... =>c Wn

a derivation in G of Wn from W00 Here Wo,···, Wn may be any strings in V*, and n, the length of the derivation, may be any natural number, including zero.

We also say that the derivation has n steps.

Example 3.1.1: Consider the context-free grammar G = (V,~, R, S), where V = {S,a,b}, ~ = {a,b}, and R consists of the rules S ---+ aSb and S ---+ e. A possible derivation is

=>

aSb

=>

aaSbb

=>

aabb.

Here the first two steps used the rule S ---+ aSb, and the last used the rule S ---+ e. In fact, it is not hard to see that L(G) = {anbn : n

2

O}. Hence some context-free languages are not regular.<)

We shall soon see, however, that all regular languages are context-free.

Example 3.1.2: Let G be the grammar (W,~, R, S,), where W = {S,A,N, V,P} U~,

~ = {Jim, big, green, cheese, ate}, R={P---+N,

P ---+ AP, S ---+ PVP, A ---+ big, A ---+ green, N ---+ cheese, N ---+ Jim, V ---+ ate}

Here G is designed to be a grammar for a part of English; S stands for sentence, A for adjective, N for noun, V for verb, and P for phrase. The following are some strings in L(G).

Jim ate cheese

big Jim ate green cheese big cheese ate Jim

Unfortunately, the following are also strings in L(G):

3.1: Context-Free Grammars

big cheese ate green green big green big cheese green Jim ate green big Jilll

117

Example 3.1.3: Computer programs written in any programming language must satisfy some rigid criteria in order to be syntactically correct and there-fore amenable to mechanical interpretation. Fortunately, the syntax of most programming languages can, unlike that of human languages, be captured by context-free grammars. We shall see in Section 3.7 that being context-free is extremely helpful when it comes to parsing a program, that is, analyzing it to understand its syntax. Here, we give a grammar that generates a fragment of many common programming languages. This language consists of all strings over the alphabet {(,),

+, *,

id} that represent syntactically correct arithmetic expressions involving

+

^and*. id stands for any identifier, that is to say, variable name.t Examples of such strings are id and id * (id * id

+

id), but not *id

+ (

+ *

id.

Let G = (V,:E, R, E) where V, :E, and R are as follows.

V = {+,*, (,),id,T,F,E}, :E = {+, *, (, ), id},

R = {E -+ E +T, E-+ T, T-+T*F, T-+F, F -+ (E), F -+ id}.

(Rl) (R2) (R3)

(R4)

(R5) (R6) The symbols E, T, and F are abbreviations for expression, term, and factor, respectively.

The grammar G generates the string (id * id

+

id) * (id

+

id) by the following deri vation.

E~T

~T*F

~T* (E)

~T* (E +T)

by Rule R2 by Rule R3 by Rule R5 by Rule Rl

t

Incidentally, discovering such identifiers (or reserved words of the language, or numerical constants) in the program is accomplished at the earlier stage of lexical analysis, by algorithms based on regular expressions and finite automata.

118 Chapter 3: CONTEXT-FREE LANGUAGES

*T * (T + T) by Rule R2

*T* (F + T) by Rule R4

*T* (id + T) by Rule R6

*T* (id + F) by Rule R4

*T * (id + id) by Rule R6

*F * (id + id) by Rule R4

*(E) * (id + id) by Rule R5

*(E + T) * (id + id) by Rule Rl

*(E+F)*(id+id) by Rule R4

*(E + id) * (id + id) by Rule R6

*(T + id) * (id + id) by Rule R2

*(T * F + id) * (id + id) by Rule R3

*(F * F + id) * (id + id) by Rule R4

*(F * id + id) * (id + id) by Rule R6

*(id * id + id) * (id + id) by Rule R6 See Problem 3.1.8 for context-free grammars that generate larger subsets of programming languages.O

Example 3.1.4: The following grammar generates all strings of properly bal-anced left and right parentheses: every left parenthesis can be paired with a unique subsequent right parenthesis, and every right parenthesis can be paired with a unique preceding left parenthesis. Moreover, the string between any such pair has the same property. We let G = (V, 2;, R, S), where

V = {S,(,)}, 2; = {C)}, R = {S -+ e,

S -+ SS, S -+ (S)}.

Two derivations in this grammar are

S * SS * S(S) * S((S)) * S(()) *

OW)

and

S * SS * (S)S * OS * O(S) * O(())

3.1: Context-Free Grammars 119 Thus the same string may have several derivations in a context-free grammar;

in the next subsection we discuss the intricate ways in which such derivations may be related.

Incidentally, L(G) is another context-free language that is not regular (that it is not regular was the object of Problem 2.4.6).0

EXaIllple 3.1.5: Obviously, there are context-free languages that are not regular (we have already seen two examples). However, all regular languages are context-free. In the course of this chapter we shall encounter several proofs of this fact.

For example, we shall see in Section 3.3 that context-free languages are precisely the languages accepted by certain language acceptors called pushdown automata.

Now we shall also point out that the pushdown acceptor is a generalization of the finite automaton, in the sense that any finite automaton can be trivially considered as a pushdown automatoIl. Hence all regular languages are context-free.

For another proof, we shall see in Section 3.5 that the class of context-free languages is closed under union, concatenation, and Kleene star (Theorem 3.5.1); furthermore, the trivial languages

0

and {a} are definitely context-free (generated by the context-free grammars with no rules, or with only the rule S -+ a, respectively). Hence the class of context-free languages must contain all regular languages, the closure of the trivial languages under these operations.

Figure 3-1

But let us now show that all regular languages are context-free by a direct construction. Consider the regular language accepted by the deterministic finite automaton M = (K, 2;, <5, s, F). The same language is generated by the grammar G(M) = (V, 2;, R, S), where V = K U 2;, S = s, and R consists of these rules:

=

{q -+ ap : <5 (q, a)

=

p} U {q -+ e : q E F}.

That is, the nonterminals are the states of the automaton; as for rules, for each transition from q to p on input a we have in R the rule q -+ ap. For example, for the automaton in Figure 3-1 we would construct this grammar:

S -+ as,S -+ bA,A -+ aE,A -+ bA,E -+ as,E -+ bA,E -+ e.

120 Chapter 3: CONTEXT-FREE LANGUAGES It is left as an exercise to show that the resulting context-free grammar gener-ates precisely the language accepted by the automaton (see Problem 3.1.10 for a general treatment of context-free grammars such as G(M) above, and their relationship with finite automata).O

Problems for Section 3.1

3.1.1. Consider the grammar G = (V, I:, R, S), where V = {a,b,S,A},

I: = {a, b}, R = {S --+ AA,

A --+ AAA,.

A --+ a, A --+ bA, A --+ Ab}.

S -+ bA, A -+ a, A -+

as,

(a) Show that ababba E L(G).

A -+ BAA, B -+ b, B -+ bS, B -+ ABB}.

(b) Prove that L( G) is the set of all nonempty strings in {a, b} that have equal numbers of occurrences of a and b.

3.1.6. Let G be a context-free grammar and let k

> o.

We let Lk(G) <:;;; L(G) be the set of all strings that have a derivation in G with k or fewer steps.

(a) What is L5(G), where G is the grammar of Example 3.1.4

(b) Show that, for all context-free grammars G and all k

>

^{0, L}

d

^{G) is}

finite.

3.1.7. Let G = (V,2;,R,S), where V = {a,b,S}, 2; = {a,b}, and R = {S-+

aSb, S -+ aSa, S -+ bSa, S -+ bSb, S -+ e}. Show that L(G) is regular.

3.1.8. A program in a real programming language, such as C or Pascal, consists of statements, where each statement is one of several types:

(1) assignment statement, of the form id := E, where E is any arithmetic expression (generated by the grammar of Example 3.1.3).

(2) conditional statement, of the form, say, if E

<

E then statement, or a while statement of the form while E

<

E do statement.

(3) goto statement; furthermore, each statement could be preceded by a label.

(4) compound statement, that is, many statements preceded by a begin, followed by an end, and separated by a";".

Give a context-free grammar that generates all possible statements in the simplified programming language described above.

122 Chapter 3: CONTEXT-FREE LANGUAGES 3.1.9. Show that the following languages are free by exhibiting

context-free grammars generating each.

(a) {ambn : m::::: n}

(b) {ambnd'd^q: m

+

ⁿ⁼^p

+

^q}

(d) {uawb: u,

wE

{a, b}*,

lui = Iwl}

(e) WICW2C . .. cWkccwf : k ::::: 1, 1 ~ j ~ k, Wi E {a, b} + for i = 1, ... ,k}

(f) {ambn : m ~ 2n}

3.1.10. Call a context-free grammar G = (V,~, R, S) regular (or right-linear) if R ~ (V - ~) x ~* (V - ~ U {e}); that is, if each transition has a right-hand side that consists of a string of terminals followed by at most one nonterminal.

(a) Consider the regular grammar G = (V,~, R, S), where V = {a,b,A,B,S}

~ = {a,b}

R = {S -+ abA,S -+ B,S -+ baB,S -+ e, A -+ bS,B -+ as,A -+ b}.

Construct a nondeterministic finite automaton M such that L(M) = L(G).

Trace the transitions of M that lead to the acceptance of the string abba, and compare with a derivation of the same string in G.

(b) Prove that a language is regular if and only if there is a regular grammar that generates it. (Hint: Recall Example 3.1.5.)

(c) Call a context-free grammar G = (V,~, R, S) left-linear if and only if R ~ (V - ~) x (V - ~) U {e} )~*. Show that a language is regular if and only if it is the language generated by some left-linear grammar.

(d) Suppose that G = (V,~, R, S) is a context-free grammar such that each rule in R is either of the form A -+ wB or of the form A -+ Bw or of the form A -+ w, where in each case A, BE V - ~ and w E ~*. Is L(G) necessarily regular? Prove it or give a counter-example.

---

S S

( S ) ( S )

I I

e e

Figure 3-2

We call such a picture a parse tree. The points are called nodes; each node carries a label that is a symbol in V. The topmost node is called the root, and the nodes along the bottom are called leaves. All leaves are labeled by terminals, or possibly the empty string e. By concatenating the labels of the leaves from left to right, we obtain the derived string of terminals, which is called the yield of the parse tree.

More formally, for an arbitrary context-free grammar G

=

(V, 1:, R, S), we define its parse trees and their roots, leaves, and yields, as follows.

1. o a

This is a parse tree for each a E 1:. The single node of this parse tree is both the root and a leaf. The yield of this parse tree is a.

2. If A --+ e is a rule in R, then

[

is a parse tree; its root is the node labeled A, its sole leaf is the node labeled e, and its yield is e.

124 Chapter 3: CONTEXT-FREE LANGUAGES 3. If

Are parse trees, where n

>

1, with roots labeled AI, . .. ,An respectively, and with yields YI, ... , Yn, and A -+ Al ... An is a rule in R, then

is a parse tree. Its root is the new node labeled A, its leaves are the leaves of its constituent parse trees, and its yield is YI ... Yn.

4. Nothing else is a parse tree.

Example 3.2.1: Recall the grammar G that generates all arithmetic expressions over id (Example 3.1.3). A parse tree with yield id

*

^(id

+

id) is shown in Figure 3-3.0

Intuitively, parse trees are ways of representing derivations of strings in L(G) so that the superficial differences between derivations, owing to the order of application of rules, are suppressed. To put it otherwise, parse trees represent equivalence classes of derivations. We make this intuition precise below.

Let G

=

(V, 2;, R, S) be a context-free grammar, and let D

=

^Xl^=}X2 =}

. . . =} Xn and D' = x~ =} x~ =} . . . =} x~ be two derivations in G, where

Xi, X~ E V* for i = 1, ...

,n,

^Xl,

xi

E V - 2;, and Xn, x~ E 2;*. That is, they are both derivations of terminal strings from a single nonterminal. We say that D precedes D', written D

-<

D', if n

>

2 and there is an integer k, 1

<

n such that

(1) for all i -:P k we have Xi = x~;

(2) Xk-I = X~_I = uAvBw, where u, v, w E V*, and A, B, E V - 2;;

3.2: Parse Trees

I

F *

I I

id F

~ ( E )

~ T

+

I I

F T

I I

F

I

Figure 3-3 (3) Xk = uyvBw, where A -+ y E R;

(4) x~ = uAvzw where B -+ z E R;

(5) Xk+l

=

X~+l

=

uyvzw.

125

In other words, the two derivations are identical except for two consecutive steps, during which the same two nonterminals are replaced by the same two strings but in opposite orders in the two derivations. The derivation in which the leftmost of the two nonterminals is replaced first is said to precede the other.

Example 3.2.2: Consider the following three derivations D1 , D2 , and D3 in the grammar G generating all strings of balanced parentheses:

Dl =8 :::} 88 :::} (8)8 :::} ((8))8 :::} (())8 :::} (())(8) :::} (())O D2 =8 :::} 88 :::} (8)8 :::} ((8))8 :::} ((8))(8) :::} (())(8) :::} (())O D3 =8 :::} 88 :::} (8)8 :::} ((8))8 :::} ((8))(8) :::} ((8))0 :::} (())O

We have that Dl

-<

D2 and D2

-<

_D3.However, it is not the case that Dl

-<

_D3,

since the two latter derivations differ in more than one intermediate string.

Notice that all three derivations have the same parse tree, the one shown in Figure 3-4.0

We say that two derivations D and D' are similar if the pair (D, D') belongs in the reflexive, symmetric, transitive closure of

-<.

Since the reflexive,

126 Chapter 3: CONTEXT-FREE LANGUAGES 5

~

5 5

~ ~

( 5 ) ( 5 ⁾

~ I

( 5 ⁾ e

I

Figure 3-4

symmetric, transitive closure of any relation is by definition reflexive, symmetric, and transitive, similarity is an equivalence relation. To put it otherwise, two derivations are similar if they can be transformed into another via a sequence of "switchings" in the order in which rules are applied. Such a "switching" can replace a derivation either by one that precedes it, or by one that it precedes.

Example 3.2.2 (continued): Parse trees capture exactly, via a natural isomor-phism, the equivalence classes of the "similarity" equivalence relation between derivations of a string defined above. The equivalence class of the derivations of

(()) 0

---S S

~ ~

S S ( S )

I ~ I

e ( S ) _e

~

( S )

I

Figure 3-6

Each equivalence class of derivations under similarity, that is to say, each parse tree, contains a derivation that is maximal under

-<;

that is, it is not preceded by any other derivation. This derivation is called a leftmost deriva-tion. A leftmost derivation exists in every parse tree, and it can be obtained as follows. Starting from the label of the root A, repeatedly replace the leftmost nonterminal in the current string according to the rule suggested by the parse tree. Similarly, a rightmost derivation is one that does not precede any other derivation; it is obtained from the parse tree by always expanding the rightmost nonterminal in the current string. Each parse tree has exactly one leftmost and exactly one rightmost derivation. This is so because the leftmost derivation of a parse tree is uniquely determined, since at each step there is one nonterminal to replace: the leftmost one. Similarly for the rightmost derivation. In the example above, DJ is a leftmost derivation, and DIO is a rightmost one.

It is easy to tell when a step of a derivation can be a part of a leftmost derivation: the leftmost nonterminal must be replaced. We write x ~ y if and

在文檔中 ELEMENTS OF THE THEORY OF COMPUTATION (頁 127-150)