• 沒有找到結果。

On the applicability of the longest-match rule in lexical analysis

N/A
N/A
Protected

Academic year: 2021

Share "On the applicability of the longest-match rule in lexical analysis"

Copied!
16
0
0

加載中.... (立即查看全文)

全文

(1)

On the applicability of the longest-match rule

in lexical analysis



Wuu Yang

a;∗

, Chey-Woei Tsay

b

, Jien-Tsai Chan

a

aComputer and Information Science Department, National Chiao-Tung University, HsinChu, Taiwan, ROC bDepartment of Computer Science and Information Management, Providence University, Taichung County,

Taiwan, ROC

Received 25 April 2002; accepted 28 June 2002

Abstract

The lexical analyzer of a compiler usually adopts the longest-match rule to resolve ambiguities when deciding the next token in the input stream. However, that rule may not be applicable in all situations. Because the longest-match rule is widely used, a language designer or a compiler implementor frequently overlooks the subtle implications of the rule. The consequence is either a 5awed language design or a de6cient implementation. We propose a method that automatically checks the applicability of the longest-match rule and identi6es precisely the situations in which that rule is not applicable. The method is useful to both language designers and compiler implementors. In particular, the method is indispensable to automatic generators of language translation systems since, without the method, the generated lexical analyzers can only blindly apply the longest-match rule and this results in erroneous behaviors. The crux of the method consists of two algorithms: one is to compute the regular set of the sequences of tokens produced by a nondeterministic Mealy automaton when the automaton processes elements of an input regular set. The other is to determine whether a regular set and a context-free language have nontrivial intersection with a set of equations.

c

 2002 Elsevier Science Ltd. All rights reserved.

Keywords: Compiler; Context-free grammar; Finite-state automaton; Lexical analyzer; Mealy automaton; Moore automaton; Parser; Regular expression; Scanner

This work was supported in part by National Science Council, Taiwan, ROC, under grants NSC 86-2213-E-009-021

and NSC 86-2213-E-009-079.

Corresponding author. Tel.: +886-3-5712121x56614; fax: +886-3-5721490. E-mail address: wuuyang@cis.nctu.edu.tw(W. Yang).

1477-8424/02/$ - see front matter c 2002 Elsevier Science Ltd. All rights reserved.

(2)

1. Introduction

In modern compilers, the lexical analyzers are implemented according to regular-expression spec-i6cations. The lexical analyzer partitions a stream of characters into groups, called tokens.1 Lexical

ambiguities arise when a sequence of characters may be partitioned in more than one way. For in-stance, the six-character string “123456” may be considered as an integer of six digits or six integers of 1 digit each, according to common regular-expression speci6cations. Intuitively, the formal view, 6nding a longest match, is more natural and more reasonable.

The traditional model of a lexical analyzer is a Moore automaton, which can be made deter-ministic with the subset-construction technique [1]. However, due to the look-ahead behavior of a lexical analyzer, a Mealy machine is a better model of a lexical analyzer [2]. Lexical ambiguities arise because the Mealy automaton underlying the lexical analyzer is, in general, nondeterministic. Furthermore, there is no way to make it deterministic in general. For example, in Fig. 1(a), two classes of tokens,  and , are de6ned. A (deterministic) Moore automaton [1] corresponding to the

Fig. 1. A scanner speci6cation and its Moore and Mealy automata.

1In our view, white spaces and comments also constitute tokens; only these whitespace tokens are discarded rather

(3)

two tokens is shown in Fig. 1(b).2 This Moore automaton is constructed with standard techniques

discussed in most compiler textbooks [3,4]. An equivalent (nondeterministic) Mealy automaton is shown in Fig. 1(c). Notice the three bold arrows that carry output tokens. The Mealy automaton is nondeterministic in that there are two outgoing edges from state 2 labeled b. Similar situations hold for state 3. (Fig. 1(d) will be explained in a later section.) When the stream of characters “abab” is fed into the automaton in Fig. 1(c), either the two tokens  or a single token  may be produced. The longest-match rule is generally adopted to enforce determinism on a nondeterministic Mealy automaton [2]. The longest-match rule dictates that the next token is the one that contains the most number of input characters. Though the rule is applicable in many situations, it may not always yield the desired results. For instance, consider the string “” in C++ programs [5]. In the program fragment ”out  ff”, the string “” should be considered as a single redirection token whereas in the program fragment “foo ¡ bar ¡ buzz ”, the string “” should be interpreted as two separate, consecutive greater-than tokens. Upon encountering the string “”, the lexical analyzer needs to consult the parser to check the context in which the string “” occurs. For a second example, note that, in Modula-2 [6], integers, such as “10”, real numbers, such as “10.”, and the range symbol “..” are all allowed tokens. For the string “10..20”, the partitioning yields the three tokens “10.”, “.”, and “20” if the longest-match rule is observed strictly. But a correct partitioning in this case should be the three tokens “10”, “..”, and “20”. For a third example, consider the string “a(b;c)” in an Ada program. Partitioning according to the longest-match rule yields the tokens: “a”, “(”, “b”, “;”, “c”, “”, and “”, and “)”. A correct partitioning should be “a”, “”, “(”, “b”, “,”, “c”, and “)”. Similar parser errors will occur in the following C++ program fragments: “i := j +++++ k” and “i := j ++k” if the longest-match rule is strictly obeyed. We conclude that misinterpretations by the longest-match rule occur very frequently in practice. Since there are many such misinterpretations in lexical analysis, it is necessary to have a technique that determines when the longest-match rule can be safely applied. In this paper, we propose such a technique.

It is tempting to conjecture that the longest-match rule is not applicable whenever a token is a pre6x of another token. This is not necessarily so. For example, both “¡” and “” are tokens in the C++ language. Since no two consecutive “¡” tokens can appear in any C++ programs, two consecutive “¡” characters are always interpreted as a single “” token. In this case, the longest-match rule is applicable.

One naturally will ask when the longest-match rule can be safely applied. In the past, it depends on the experience of the language designer or the compiler implementor to answer this question. However, this is an unreliable approach when a new language is designed or implemented. The longest-match rule is so common that a language designer or a compiler implementor frequently overlooks the subtle implications caused by the longest-match rule in the design or implementation. The consequence is either a 5awed language design or a de6cient implementation. Thus, it is impor-tant to have a method that automatically checks the applicability of the longest-match rule and warns the compiler implementor about the potential problems in the use of the longest-match rule. Such a technique is particularly indispensable in automatic generators of language translation systems, such as lex [7], Eli [8], and PCCTS [9], because, without the technique, the generated lexical analyzers

2Our model of Moore automata is slightly diLerent from the one de6ned in [1]: our Moore automaton always transits

(4)

can only blindly apply the longest-match rule. However, blind application of the longest-match rule only results in erroneous lexical analyzers.

A necessary condition for ambiguities is that a token is a pre6x of another. A naive approach is to post a warning message whenever such a condition arises. This amounts to giving up the longest-match rule. However, past experience shows that the longest-match rule is applicable in most cases. It would be better if warning messages are issued only when ambiguity will actually occur. We propose such an approach to detecting the ambiguity occurrences more precisely.

Our approach makes use of the context-free grammars underlying the parsers. Ambiguities are not declared prematurely whenever there is more than one partitioning of a sequence of tokens. Rather, ambiguities are declared only if there is more than one partitioning, among the many plausible partitionings, that may be part of sentences derived from the context-free grammar. To be more speci6c, suppose that an input stream of characters may be partitioned into two diLerent sequences of tokens 1; 2; : : : ; j and 1; 2; : : : ; k according to a lexical speci6cation. This could be an ambiguity

for the lexical analyzer. However, if only one of the two sequences can form a valid sentence, the lexical analyzer is forced to choose that sequence—there is no ambiguity in this case.

Because it is impractical for the lexical analyzer to examine the whole input stream in deciding the next token, the lexical analyzer is constrained to read only as many characters that may constitute a token as possible. The input stream beyond what was already read is assumed to contain any characters. Under this assumption, it suNces to consider alternative partitionings of a sequence of characters that constitute a token , rather than alternative partitionings of the whole input stream. The single token , of course, could be part of a valid sentence. Therefore, we examine whether any alternative partitioning of the sequence of characters for  could also be part of a valid sentence. If so, the language designer or the compiler implementor is warned of the potential ambiguity. Otherwise, the longest-match rule can be safely adopted.

The remainder of this paper is organized as follows. Section2presents the overview of our method. In Section 3, we propose a set of axioms to compute the possible sequences of output tokens of a 6nite automaton when elements of a regular set are used as input to the 6nite automaton. The diNculty in such computation lies in the Kleene star operator in the regular expressions. To solve this diNculty, Section4 de6nes a “macro” automaton derived from the original automaton and solves a set of equations related to the macro automaton. In Section 5, we discuss how to decide whether a regular set and a context-free language intersect with an iterative algorithm. The complete detection algorithm is summarized in Section 6. The last section concludes this paper and discusses related work.

2. Overview of the detection method

In order to detect lexical ambiguities, a lexical speci6cation, written in regular expressions, is transformed into a deterministic Moore automaton M [3,4]. There is a unique initial state and one or more accepting states in the Moore automaton. In order to simplify the following presentation, we will assume that diLerent accepting states accept diLerent classes of tokens.

Lexical ambiguities may arise when one accepting state can reach another (not necessarily distinct) accepting state via a nonnull path. An example is a path from states 3 to 4 in Fig. 1(b). Given two accepting states s and t of a Moore automaton M, consider a path P that starts from the initial state,

(5)

passes state s and reaches state t. On the sequence of input characters corresponding to the path P; M moves from the initial state to state t, emits the token  accepted by state t, and returns to the initial state (if the longest-match rule is adopted). Alternatively, M may move from the initial state to state s, emit the token  accepted by state s, and scan the remaining part of the input from the initial state. When scanning the remaining input, M may produce other tokens. If the sequence of alternative output tokens may not be part of any sentence accepted by the parser, the longest-match rule does produce the desired output—the single token . On the other hand, when the sequence of alternative output tokens may become part of a sentence accepted by the parser, ambiguity will arise. Thus, we need to examine the accepting-to-accepting paths and determine the sequences of output tokens when M scans input corresponding to these paths, and tests whether the sequences of output tokens of M can be embedded in sentences accepted by the parser.

De6ne the look-ahead set as the set of all accepting-to-accepting paths. Given two accepting states s and t of a Moore automaton M, the set of all paths from s to t is a regular set, which may be described by a regular expression. For the sake of computing alternative output tokens, each such regular expression is annotated with the token class accepted by the state s. Thus, elements of the look-ahead set are pairs (; la), where  is a token class and la is a regular expression of the input characters, called the look-ahead expression.

To compute the look-ahead expression from state s to state t, we identify the subgraph of the state-transition graph of M induced by all the states that are reachable from state s. This subgraph is also a 6nite automaton, in which state s is the initial state and t is the (only) accepting state. Then this new 6nite automaton is converted back to a regular expression, which is the required look-ahead expression. The conversion of a 6nite automaton to a regular expression is by a technique very similar to the one discussed in Section 4 [10].

The second step of the detection method is, for each pair (; la) of the look-ahead set, to determine the sequences of tokens produced by the automaton M when M scans strings of input characters satisfying the regular expression la. Since the look-ahead expression la may represent an in6nite number of strings, it is not feasible to run M on the strings one by one. We propose a set of axioms to compute this set of sequences of output tokens in Section 3. It turns out that this set of sequences of tokens produced by M is also a regular set. Call this set the alternative output sequences corresponding to the pair (; la), denoted by AOS(; la).

Elements of AOS(; la) are potential sequences of output tokens if the longest-match rule is not applied. If any element of AOS(; la) may be part of a sentence derivable from the context-free grammar underlying the parser, there are potential lexical ambiguities. Only in such case will the language designer or the compiler implementor be warned about this potential ambiguity.

Our next task is to determine whether any element of a regular set AOS(; la) can be part of a sentence of a context-free language. Let  be the regular expression denoting the set AOS(; la). Let  be the regular set de6ned by the expression VV, where V is the set of all possible tokens. Let L denote the context-free language speci6ed by the context-free grammar underlying the parser. We solve the above problem by determining whether the intersection of the regular set  and the context-free language L is an empty set or not. If  and L do not intersect, we may safely adopt the longest-match rule. Otherwise, there is a string of input characters that can be partitioned into several distinct sequences of tokens two or more of which may be part of sentences derivable from a context-free grammar. We use an iterative algorithm to solve the intersection problem. The details are in Section 5.

(6)

3. Computing sequences of output tokens

In this section, we will show how to compute the sequence of tokens produced by a (deterministic) Moore automaton when the automaton scans a string of input characters that is an element of a regular set. When the regular set is 6nite, the Moore automaton can scan strings of the regular set one by one and collects the results. The diNculty lies in that the regular set may be in6nite. We use eight axioms and an algorithm to compute the set of all possible sequences of output tokens. The following result shows that this set of all possible sequences of output tokens is also a regular set, whose vocabulary is the set of all tokens.

In order to simplify the following presentation, we assume that the Moore automaton is determin-istic. Let M be a deterministic Moore automaton (M will also denote the state-transition function of the automaton). Let a be an input character and A and B be regular expressions of input characters. Let q0 denote the initial state of M and Token(M(q; a)) denote the output token associated with

state M(q; a). Let X and Y be sets of pairs of the form [q; #],3 where q is a state of M and # is a

regular expression of output tokens (not input characters). Intuitively, q denotes the state of M at a certain point of time and # denotes the set of accumulated sequences of output tokens at that time. First, we de6ne the combined transition and output function ⊕ of the automaton M on a regular expression by the following eight axioms:

(1) {[q; #]} ⊕ a = {[M(q; a); #]} ∪ {[q0; # · Token(M(q; a))]} if M(q; a) is an accepting state

={[M(q; a); #]} if M(q; a) is not an accepting state = ? if M(q; a) is error, (2) X ⊕ $ = X , (3) X ⊕ (A · B) = (X ⊕ A) ⊕ B, (4) X ⊕ (A | B) = (X ⊕ A) ∪ (X ⊕ B), (5) X ⊕ A∗= limit {Y1; Y2; : : :}, where Y1= X; Yn+1= Yn∪ Yn⊕ A, (6) ? ⊕ A = ?, (7) (X ∪ Y ) ⊕ A = (X ⊕ A) ∪ (Y ⊕ A), (8) {[q; #]; [q; ']} = {[q; (# | ')]},

Axioms 1, 2, 3, 4, 6, and 7 are quite straightforward. Axiom 1 describes the state-transition and output behavior of M. Note that M transits to the initial state immediately after a token is emitted. Note also that M also reserves the right not to emit a token even if it enters an accepting state. Axiom 8 combines pairs with the same state into a single pair. Axiom 5 re5ects the fact that the Kleene star operation denotes zero or more repetitions of a regular expression. The limit operation in the 6fth axiom is de6ned as follows: given two regular expressions #1 and #2, we say #16 #2 if

the regular set de6ned by #1 is a subset of that de6ned by #2. Given two pairs [q1; #1] and [q2; #2],

we say [q1; #1] 6 [q2; #2] if q1= q2 and #16 #2. Given two sets of pairs U = {[pi; #i] | i = 1; 2; : : :}

and V = {[qj; 'j] | j = 1; 2; : : :}, we say U 6 V if (1) for every pair [pi; #i] in U, there is a pair

[qj; 'j] in V such that [pi; #i] 6 [qj; 'j], or (2) there is a set W such that U 6 W and W =V (based

on Axiom 8). Given a sequence of sets of pairs V1; V2; : : : ; we say V = limit {V1; V2; : : :} if (1)

3We use the square brackets [ : : : ] for pairs concerning the sequence of output tokens. We reserve the round brackets

(7)

Vi6 V for all i, and (2) for any pair [q; '] in V , and any element ' of the regular set de6ned by

', there exists a set Vk in the sequence of the sets of pairs such that {[q; ']} 6 Vk. We can easily

prove the following lemma according to the above de6nition.

Lemma. Given two sets of pairs U and V , if U 6 V then U ∪ V = V .

Intuitively, limit {Y1; Y2; : : :} is equal to Y1∪ Y2∪ : : : : But this de6nition is useless since, in

Axiom 5 above, it is already known that Yn6 Yn+1 for all n. Note that, in general, for any m, there

exists n ¿ m such that Ym= Yn. Therefore, limit {Y1; Y2; : : :} = Yn, for any n. Direct computation of

Y1; Y2; : : : eLectively enumerates the elements of a regular set. This implies that direct computation of

the sets Y1; Y2; : : : one by one may not always result in the desired limit. We will present a method

to compute the limit in the next section.

X ⊕A∗ is a 6xed point of a monotone function on a lattice, de6ned as follows: consider a determin-istic Moore automaton M. De6ne ,= the set of all output tokens. De6ne E = {[q; #] | q is a state of M and # is regular expression over ,}. Let 2E denote the powerset of E. Then (2E; 6) is a partially

ordered set. De6ne, for U; V ∈ 2E

U V = lub(U; V ) = U ∪ V; U V = glb(U; V ):

We can show that (2E; ; ) is a lattice. Unfortunately, it is not a complete lattice. On this lattice,

we may de6ne a (monotone) function: fA= .U: U ∪ (U ⊕ A). Note that fnA(X ) = Yn+1 de6ned in

Axiom 5. The limit limit {Y1; Y2; : : :} is a 6xed point of fA. The 6xed point corresponds to the

equation (X ⊕ A∗) ⊕ A = X ⊕ A.

Based on the algorithm in the next section, we assert that the limit always exists for any X and is unique up to equivalent regular expressions and the application of Axiom 8. Furthermore, the computation of the limit in Axiom 5 will halt in 6nite amount of time and the limit may be represented as a 9nite set of pairs (that is, the number of pairs in the limit is 6nite though each pair may denote an in6nite regular set).

A pair [q; '] contains both the information regarding the 6nal state and the information regarding the sequences of output tokens. Since we are only interested in the sequences of output tokens, we use the operation collect to collect all the potential output, which is de6ned as follows: consider a 6nite set of pairs X = {[pi; #i] | i = 1; 2; : : : ; k}. For each pair [pi; #i] in X , let {'i;j| j = 1; 2; : : : ; mi}

be the set of tokens associated with the accepting states that are reachable from state pi in M’s

state-transition graph. Let /i denote the regular expression #i('i;1| 'i;2| : : : | 'i;mi). Then collect(X )=

(/1| /2| : : : | /k).

Our purpose in this section is, given a pair (; la) of the look-ahead set, to compute the set of the alternative output sequences corresponding to the pair (; la), that is, AOS(; la). Note that AOS(; la) = collect({[q0; ]} ⊕ la), where q0 is the initial state of M. An example of computing

AOS is included in the next section. 4. Computing X ⊕ A* in Axiom 5

In this section, we discuss in detail how to compute the limit operation used in Axiom 5 listed in the previous section. Our purpose is to compute X ⊕ A∗, where X is a 6nite set of pairs of the form

(8)

Fig. 2. Construction of Mq; A.

[q; '] and A is a regular expression. As was discussed in the previous section, it is not feasible to compute the sequence X; X ⊕ A; (X ⊕ A) ⊕ A; : : : ; etc. Our solution is to construct a new automaton and then solve equations of regular expressions on this new automaton.

Though X is a set of pairs, due to Axiom 7, we may consider one pair at a time. In order to compute {[q; ']} ⊕ A∗ on a given 6nite automaton M, we 6rst construct a new macro automaton Mq;A: Mq;A is a nondeterministic Mealy automaton in which every transition corresponds to an

ag-gregate of transitions of the original automaton M on an element of the regular set de6ned by A. The macro automaton Mq;A is constructed with the work-list algorithm shown in Fig. 2. Initially, the

work list contains a single state, that is, state q. Then a state p is picked up from the work list and

{[p; $]} ⊕ A ($ is the empty string) is computed inductively by the axioms in the previous section.

Let {[p; $]} ⊕ A = {[pi; #i] | i = 1; 2; : : : ; k}. For each pair [pi; #i], create a new state pi in Mq;A (if

one does not already exist) and create a new transition from state p to state pi, which is labeled

with the output sequence #i. If pi is a newly created state, then state pi is added to the work list.

The above step is repeated for each state added to the work list. This work-list algorithm terminates when the work list becomes empty. The initial state of Mq;A is the state q. Note that the edges of

Mq;A are labeled with the output token sequences, rather than the input characters (since input is

always the regular expression A). Note that the work-list algorithm must terminate because there are only a 6nite number of states in M and each state is processed at most once. Fig. 2 is a recast of the work-list algorithm.

After constructing the macro automaton Mq;A, we will calculate the output regular expression

for each state of Mq;A. The calculation is performed by solving a set of equations. For each

state p of Mq;A, there is a variable P representing the regular expression associated with state

p. Let {ri #ip | i = 1; 2; : : : ; m} be the set of edges entering state p (the notation ri #ip

denotes an edge from ri to p labeled #i). If state p is not the initial state of Mq;A, the

vari-able P is de6ned by the equation P = R1#1| R2#2| : : : | Rm#m. For the initial state q of Mq;A, let

{ri #iq | i = 1; 2; : : : ; n} be the set of edges entering state q. The equation de6ning q’s variable

is Q = ' | R1#1| R2#2| : : : ; | Rn#n. This set of mutually recursive equations can be solved with a

(9)

After obtaining the regular expression P for each state p of automaton Mq;A, the set {[p; P] | p

is a state of Mq;A; and P is the regular expression associated with state p} is {[q; ']} ⊕ A∗.

Example. Consider the scanner speci6cation in Fig. 1(a). Two classes of tokens are speci6ed:  and . The scanner speci6cation is converted to the deterministic Moore automaton M shown in Fig. 1(b), where state 3 recognizes the token class  and state 4 recognizes the token class . There is an accepting-to-accepting path from states 3 to 4. Hence, the look-ahead set is {(; a∗b)}. Then we need to compute AOS(; ab) = collect({[1; ]} ⊕ ab). First we compute {[1; ]} ⊕ a.

To compute {[1; ]} ⊕ a∗, we need to construct the automaton M1;a. Initially, there is a state 1 in M1;a. Then the following computation is performed, based on axioms in Section 3:

{[1; $]} ⊕ a = {[2; $]}; {[2; $]} ⊕ a = {[5; $]}; {[5; $]} ⊕ a = {[2; $]};

Thus, two more states 2 and 5 are created in M1;a, together with the transitions. Fig. 1(d) shows

the automaton M1; a. Let P; Q, and R be the regular-expression variables for states 1, 2, and 5,

respectively. Then we may set up the following equations: P = ;

Q = P$ | R$; R = Q$:

Solving these equations, we obtain the “least” solution P = Q = R = . (The least solution is obtained by considering the equal sign = in the equations as the derivation sign → in production systems.) Therefore, {[1; ]} ⊕ a∗= {[1; ]; [2; ]; [5; ]}. Call this set X . Next we may compute X ⊕ b = {[3; ]; [1; ]}. Therefore, {[1; ]} ⊕ a∗b = {[3; ]; [1; ]}. This means that, if elements of the regular set de6ned by the look-ahead expression ab are fed into the automaton M, either M will end up in state 3 with no output (the token  is accounted for previous input characters) or in state 1 with a single output token . For example, suppose that the input is abaab, of which the suNx aab is the look-ahead string. The automaton in Fig. 1(b) may exhibit three kinds of possible behaviors: (1) it produces a token  and returns to the initial state 1; (2) it emits a token  and halts at state 3; or (3) it emits two tokens  and returns to state 1. Case (1) above occurs when the longest-match rule is adopted. Cases (2) and (3), corresponding to the above computation, indicate the possible states of the automaton if the longest-match rule is not adopted.

Finally, the collect operation is performed. In the automaton in Fig. 1(b), both state 1 and state 3 can reach the two accepting states 3 and 4. Each of the two tokens  and  is appended to each of the two expressions  and . Hence, we reach the result AOS(; ab) = {; ; ; }, which corresponds to the regular expression  = ( | )( | ). The regular set de6ned by  is tested for containment in sentences derived by the context-free grammar underlying the parser. Though the regular set contains only four elements and can be tested easily, we will propose a general solution for arbitrary regular sets in the next section.

Before we prove that the above is correct, we 6rst claim, without a proof, that the set of equations of regular expressions is solved correctly. Speci6cally, we claim the following two lemmas.

(10)

Lemma. For any pair {[r; R]} 6 {[q; ']} ⊕ An, for some n, the state r is included in Mq;A.

Fur-thermore, let R be the regular expression associated with state r obtained by solving the set of equations. Then R is an element of the regular set de9ned by R.

Lemma. Let r be a state of Mq;A and R be the regular expression associated with state r. Let

R be an element of the regular set de9ned by R. Then R corresponds to a 6nite path from the initial state to state r in Mq;A.

Below we show that the computation of {[q; ']} ⊕ A∗ is correct. Speci6cally, we need to prove the following theorem.

Theorem. The set {[p; P] | p is a state of Mq;A; and P is the regular expression associated with

state p} computed by the method in this section is equal to {[q; ']} ⊕ A∗ de9ned in the previous section.

Proof. Let X denote the set {[q; ']}. Let Y denote the set {[p; P] | p is a state of Mq;A; and P

is the regular expression associated with statep} computed by the method in this section. Let Y1=X

and Yn+1= Yn∪ Yn⊕ A. We need to show that Y = limit {Y1; Y2; : : :}.

First we show that Yn6 Y , for all n. Since Yn6 Yn+1 for all n, consider any set of a pair

{[r; R]} such that {[r; R]} 6 Yn+1 but {[r; R]}  Yn. Intuitively, {[r; R]} 6 {[q; ']} ⊕ An. Based on

the construction of Mq;A, the state r must be a state in the macro machine Mq;A. Let R be the regular

expression associated with state r obtained by solving the set of equations. Since we assume that the set of equations of regular expressions are solved correctly, R is an element of the regular set de6ned by R. Hence, {[r; R]} 6 Y . This implies that Y

n6 Y , for all n.

Next we need to show that for any pair [r; R] in Y , and any element R of the regular set de6ned by R, there exists a set Yk such that {[r; R]} 6 Yk. Since we assume that the set of equations is solved correctly, R must correspond to a 6nite path from the initial state to state r in Mq;A. Let k

be the length of the path. Then {[r; R]} 6 Yk.

Based on the correctness of solving the set of equations induced by the macro automaton Mq;A,

we may prove the correctness of the axioms of the previous section.

Theorem. The set of the 8 axioms of Section 3 correctly computes [q; #] ⊕ A for any regular expression A in 9nite amount of time.

Proof. By structural induction [11] on the syntax of the regular expressions. Details are omitted. 5. Determining the intersection of a regular language and a context-free language

Our next task is to determine whether any element of a regular set AOS(; la) can be part of a sentence of the context-free language accepted by the parser. Let  be the regular expression denoting the set AOS(; la). Let  be the regular set de6ned by the expression VV, where V is the set of all possible tokens. Let L denote the context-free language accepted by the parser. We solve the above problem by determining whether the intersection of the regular set  and the

(11)

context-free language L is an empty set or not. Note that both  and L are based on the same vocabulary, the set of all tokens.

Let M be a 6nite automaton corresponding to the regular set  and G be the context-free grammar underlying the parser. We de6ne an operator ⊗ that takes two arguments: a set of states (of M) and a string of terminals and nonterminals (of G). The result of ⊗ is a set of states (of M). The notation Q ⊗ # = Q means that, starting from a state of Q; M will reach a state of Q on an input string that is derivable from the string # (according to the production rules of G). With the ⊗ operator, the regular set and the context-free language intersect if and only if the set {q0} ⊗ S contains an

accepting state of M, where q0 is the initial state of M and S is the start symbol of the grammar

of G.

The ⊗ operator may be viewed as an extension of the transition function of the automaton to strings of terminals and nonterminals of the context-free grammar. The operator ⊗ is de6ned by the following 6ve axioms:

(1) {q} ⊗ a = {q| Mmoves from state q to state q on input a; where a is a terminal of G}. (2) {q} ⊗ A = {q| M moves from state q to state q on a string of terminals #; where A is a

nonterminal of G and A →∗#}.

(3) Q ⊗ #' = (Q ⊗ #) ⊗ ', where # and ' are strings of terminals and nonterminals.

(4) (Q1∪ Q2) ⊗ # = (Q1⊗ #) ∪ (Q2⊗ #), where # is a string of terminals and nonterminals.

(5) ? ⊗ # = ?.

The operator ⊗ is similar to the ⊕ operator de6ned in the previous section. They diLer in two aspects: (1) The second argument of the ⊗ operator is a context-free language whereas that of the

⊕ operator is a regular set; and (2) The ⊗ operator does not compute output components.

To 6nd {q} ⊗ a for a state q of M and a terminal a of G, we may simply examine the transition table of M. To 6nd {q} ⊗ A, where A is a nonterminal of G, we establish a set of equations and solve the equations iteratively. Let A → #1| #2| : : : | #m be the set of all the A-productions in G.

Then {q} ⊗ A = ({q} ⊗ #1) ∪ ({q} ⊗ #2) ∪ : : : ∪ ({q} ⊗ #m).

There is one such equation for each state q of M and each nonterminal A of G. The above set of equations can be solved by an iteration algorithm. Initially, assume {q} ⊗ A = ? for each q and each A. Then we repeatedly evaluate the set of equations until a stable solution is reached. The iteration algorithm is shown in Fig. 3, where an expression, such as {q} ⊗ A, is treated as a variable.

Example. Fig. 4(a) is a deterministic automaton corresponding to the regular set . State 1 is the initial state. Fig.4(b) is a grammar de6ning the context-free language L. Fig. 4(b) is the ⊗ operator applied to the automaton in Fig. 4(a) and the context-free grammar in Fig. 4(b). For the sake of brevity, we have omitted the set symbols {: : :}. From the de6nition of the ⊗ operator, we obtain the following six equations:

{1} ⊗ T = ({1} ⊗ 8T8) ∪ ({1} ⊗ 9T9) ∪ ({1} ⊗ :)

= ({1} ⊗ 8 ⊗ T ⊗ 8) ∪ ({1} ⊗ 9 ⊗ T ⊗ 9) ∪ ({1} ⊗ :) = ({2} ⊗ T ⊗ 8) ∪ ({1} ⊗ T ⊗ 9) ∪ {1};

(12)

Fig. 3. The iteration algorithm.

Fig. 4. An automaton, a context-free grammar, and the ⊗ operator.

{2} ⊗ T = ({2} ⊗ 8T8) ∪ ({2} ⊗ 9T9) ∪ ({2} ⊗ :) = ({2} ⊗ 8 ⊗ T ⊗ 8) ∪ ({2} ⊗ 9 ⊗ T ⊗ 9) ∪ ({2} ⊗ :) = (? ⊗ T ⊗ 8) ∪ ({1} ⊗ T ⊗ 9) ∪ {1}; {3} ⊗ T = ({3} ⊗ 8T8) ∪ ({3} ⊗ 9T9) ∪ ({3} ⊗ :): = ({3} ⊗ 8 ⊗ T ⊗ 8) ∪ ({3} ⊗ 9 ⊗ T ⊗ 9) ∪ ({3} ⊗ :) = (? ⊗ T ⊗ 8) ∪ (? ⊗ T ⊗ 9) ∪ ?; {1} ⊗ S = {1} ⊗ T$ = {1} ⊗ T ⊗ $; {2} ⊗ S = {2} ⊗ T$ = {2} ⊗ T ⊗ $;

(13)

{3} ⊗ S = {3} ⊗ T$

= {3} ⊗ T ⊗ $:

Let x1; x2; x3; x4; x5, and x6 denote the six terms: {1}⊗T; {2}⊗T; {3}⊗T; {1}⊗S; {2}⊗S; {3}⊗S;

respectively. After some simpli6cation, we get the following six equations: x1= (x2⊗ 8) ∪ (x1⊗ 9) ∪ {1}; x2= (x1⊗ 9) ∪ {1}; x3= ?; x4= x1⊗ $; x5= x2⊗ $; x6= x3⊗ $:

Initially, assume that x1= x2= x3= x4= x5= x6= ?. We repeatedly evaluate the six equations.

After three iterations, we reach a stable solution: {1} ⊗ S = {3}; {2} ⊗ S = {3}; {3} ⊗ S = ?; {1} ⊗ T = {1; 2}; {2} ⊗ T = {1}, and {3} ⊗ T = ?.

Because {1} ⊗ S = {3}, which means that the automaton moves from state 1 (the initial state) to 3 (an accepting state) on a sentence derived from the start symbol S, the regular set and the context-free language do have common elements. An example is the string 89:98$.

The iteration algorithm in Fig. 3 always halts due to its accumulative nature. That the solution is correct can be proved by an inductive reasoning, as follows: the addition of a state q into the solution of {q}⊗A can be traced backward eventually to a transition q1aq2 in the 6nite automaton, where

a is a token. Thus, we can construct a string of terminals that is derivable from A and the 6nite automaton moves from state q to state q on that string. Conversely, if the 6nite automaton moves from state q to state q on a string derivable from A, there must be a path (of 6nite length) from q to q on the state-transition graph of the 6nite automaton that is labeled with the string. Thus, q must be eventually added to {q} ⊗ A. Speci6cally, the following theorem is asserted:

Theorem (Correctness of the iteration algorithm): q∈ {q}⊗A if and only there is a string derivable from the (terminal or nonterminal) symbol A according to the context-free grammar on which the 9nite automaton moves from state q to state q.

Yet another problem that we need to address is the independence of the answer on the particular 6nite automaton and the particular context-free grammar used in the computation. We investigate whether a regular set and a context-free language have common elements. In the iteration algo-rithm, an arbitrary 6nite automaton for the regular set and an arbitrary context-free grammar for the context-free language are chosen for setting up the set of equations. It is natural to ask whether we will reach a diLerent answer if diLerent 6nite automata or diLerent context-free grammars are chosen. Based on the correctness of the iteration algorithm, we may claim that the same result is always reached no matter which 6nite automaton or context-free grammar is chosen.

(14)

Fig. 5. The complete detection algorithm.

A classical method for the intersection problem is to integrate the 6nite automaton of the reg-ular set and the pushdown automaton of the context-free language into a new pushdown automa-ton. A new context-free grammar can then be derived from the integrated pushdown automaautoma-ton. Though the ⊗ operator did not compute the exact intersection, it provides additional information relating the states of a 6nite automaton and the nonterminals of a context-free grammar. This information is useful in simplifying the LR parser of the context-free grammar as well as its parser [12].

6. The complete detection algorithm

Fig. 5 is the complete con5ict detection algorithm. It 6rst computes the determinisitic Moore automaton for the set of regular expressions for tokens. Then each pair of accepting states of the automaton is examined for alternative partitionings when the longest-match rule is not observed. The regular set AOS(; la) is computed by the algorithms in Sections 3 and 4. Then the algorithm in Section 5 is applied to determine whether any element of AOS(; la) could be part of a sentence of the programming language. If so, a misinterpretation may potentially arise and hence a suitable warning message is issued. It is this warning message that indicates the precise situation in which the longest-match rule is not applicable.

(15)

7. Conclusion and related work

We have identi6ed the applicability problem of the longest-match rule and proposed a solution. The crux of the solution consists of two algorithms: one is to compute the regular set of the sequences of tokens produced by a nondeterministic Mealy automaton when the automaton processes elements of an input regular set. The other is to determine whether a regular set and a context-free language have nontrivial intersection with a set of equations.

The work reported here is a systematic method to detect potential ambiguities in lexical analysis. It aids a language designer to check his/her design of a new language and it helps a compiler writer in applying the longest-match rule. As far as we know, this kind of integrated check (of both the lexical speci6cation and the syntactic speci6cation) has not been studied in details previously. The work reported here may be viewed as a re6ned longest-match rule. The longest-match rule is widely used in compiler implementation [3,4] and has been studied in details in [13], where the author proposed a new lexical analysis method to solve the look-ahead problem. Most compiler textbooks treat a lexical analyzer as a Moore machine. The scangen scanner generator [4] exhibits some 5avor of a Mealy machine. A deterministic Mealy-machine model is proposed in [2] for lexical analysis due to the longest-match rule. The work reported in [2] is concerned only with a subclass of automata—the class of 6nite look-ahead automata; by contrast, the work reported in this paper is applicable to in6nite look-ahead automata as well as 6nite look-ahead ones.

References

[1] Hopcroft JE, Ullman JD. Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley, 1979.

[2] Yang W. Mealy machines are a better model of lexical analyzers. Computer Languages 1996;22(1):27–38. [3] Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. [4] Fischer CN, Leblanc Jr RJ. Crafting a compiler with C. Reading, MA: Benjamin/Cummings, 1991.

[5] Ellis MA, Stroustrup B. The annotated C + + reference manual. Reading, MA: Addison-Wesley, 1990. [6] Wirth N. Programming with modula-2, 2nd corrected ed. New York: Springer, 1983.

[7] Lesk ME, Schmidt E. LEX—a lexical analyzer generator. Computer Science Technical Report 39, Bell Labs., Murray Hill, NJ, 1975.

[8] Gray RW, Heuring VP, Levi SP, Sloane AM, Waite WM. Eli: a complete, 5exible compiler construction system. Communications of the ACM 1992;35(2):121–31.

[9] Parr T, Language translation using PCCTS and C + +: a reference guide. San Jose, CA: Automata Publishing, 1997. A pre-release version is available fromftp://ftp.parr-research.com/pub/pccts/Book/reference.ps.

[10] Aho AV, Ullman JD. The theory of parsing, translation, and compiling: parsing. Englewood CliLs, NJ: Prentice-Hall, 1972.

[11] Burstall RM. Proving properties of programs by structural induction. The Computer Journal 1969;12(1):41–8. [12] Yang W. A lattice framework for analyzing context-free languages with applications in parser simplication and

data-5ow analysis. Journal of Information Science and Engineering 1999;15(2):287–306. [13] Yang W. On the look-ahead problem in lexical analysis. ACTA Informatica 1995;32:459–76.

Wuu Yang received his B.S. degree in computer science from National Taiwan University in 1982 and the M.S. and Ph.D. degrees in computer science from University of Wisconsin at Madison in 1987 and 1990, respectively. Currently he is a professor in the National Chiao-Tung University, Taiwan, Republic of China. Dr. Yang’s current research interests include

(16)

Java and network security, programming languages and compilers, and attribute grammars. He is also very interested in the study of human languages and human intelligence.

Chey-Woei Tsay received his B.S. degree in computer science from National Taiwan University in 1982. After receiving a Ph.D. degree in computer science from University of Utah, he joined the Department of Computer Science and Information Management, Providence University, Taiwan, Republic of China. Dr. Tsay’s current research interests include network computation, computer graphics, programming languages and compilers, and user interface design.

Lien-Tsai Chan received his M.S. degree in Computer and Information Science from the National Chiao Tung University in 1996. Currently he is a Ph.D. candidate in the same university. His research interests include programming languages, attribute grammars, compilers, programming systems.

數據

Fig. 1. A scanner speci6cation and its Moore and Mealy automata.
Fig. 3. The iteration algorithm.

參考文獻

相關文件

6 《中論·觀因緣品》,《佛藏要籍選刊》第 9 冊,上海古籍出版社 1994 年版,第 1

The first row shows the eyespot with white inner ring, black middle ring, and yellow outer ring in Bicyclus anynana.. The second row provides the eyespot with black inner ring

Robinson Crusoe is an Englishman from the 1) t_______ of York in the seventeenth century, the youngest son of a merchant of German origin. This trip is financially successful,

fostering independent application of reading strategies Strategy 7: Provide opportunities for students to track, reflect on, and share their learning progress (destination). •

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =>

We explicitly saw the dimensional reason for the occurrence of the magnetic catalysis on the basis of the scaling argument. However, the precise form of gap depends

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and