### 1.8 **FINITE REPRESENTATIONS OF LANGUAGES **

A central issue in the theory of computation is the representation of languages by finite specifications. Naturally, any finite language is amenable to finite rep-resentation by exhaustive enumeration of all the strings in the language. The issue becomes challenging only when infinite languages are considered.

Let us be somewhat more precise about the notion of "finite representation
of a language." The first point to be made is that any such representation must
itself be a string, a finite sequence of symbols over some alphabet ~. Second, we
certainly want different languages to have different representations, otherwise
*the term representation could hardly be considered appropriate. But these two *
requirements already imply that the possibilities for finite representation are
severely limited. For the set ~* of strings over an alphabet ~ is count ably
infinite, so the number of possible representations of languages is count ably
infinite. (This would remain true even if we were not bound to use a particular
alphabet ~, so long as the total number of available symbols was countably
infinite.) On the other hand, the set of all possible languages over a given
alphabet ~ -that is, 2^{E}^{* -} is uncountably infinite, since 2N , and hence the
power set of any count ably infinite set is not count ably infinite. With only a
countable number of representations and an uncountable number of things to
represent, we are unable to represent all languages finitely. Thus, the most we
can hope for is to find finite representations, of one sort or another, for at least
some of the more interesting languages.

This is our first result in the theory of computation: No matter how pow-erful are the methods we use for representing languages, only countably many languages can be represented, so long as the representations themselves are fi-nite. There being uncountably many languages in all, the vast majority of them will inevitably be missed under any finite representational scheme.

Of course, this is not the last thing we shall have to say along these lines.

We shall describe several ways of describing and representing languages, each more powerful than the last in the sense that each is capable of describing languages the previous one cannot. This hierarchy does not contradict the fact that all these finite representational methods are inevitably limited in scope for the reasons just explained.

We shall also want to derive ways of exhibiting particular languages that cannot be represented by the various representational methods we study. We know that the world of languages is inhabited by vast numbers of such unrep-resent able specimens, but, strangely perhaps, it can be exceedingly difficult to catch one, put it on display, and document it. Diagonalization arguments will eventually assist us here.

To begin our study of finite representations, we consider expressions

-48 Chapter 1: SETS, RELATIONS, AND LANGUAGES strings of symbols- that describe how languages can be built up by using the operations described in the previous section.

Example 1.8.1: Let *L *= *{w *E {O,I}*: whastwoorthreeoccurrencesof1,
the first and second of which are not consecu ti ve }. This language can be
de-scribed using only singleton sets and the symbols U, 0, and * as

{O}* ^{0 }{l} ^{0 }{O}* ^{0 } {O} ^{0 }{I} ^{0 }{Or ^{0 }«{l} ^{0 }{O}*) U 0*).

It is not hard to see that the language represented by the above expression is precisely the language L defined above. The important thing to notice is that the only symbols used in this representation are the braces { and }, the parentheses ( and ),

### 0,

0, 1,### *,

^{0, }and U. In fact, we may dispense with the braces and

^{0 }and write simply

*L *= 0*10*010*(10* U 0*).

Roughly speaking, an expression such as the one for *L *in Example 1.8.1 is
called a regular expression. That is, a regular expression describes a language
exclusively by means of single symbols and

### 0,

combined perhaps with the symbols U and### *,

possibly with the aid of parentheses. But in order to keep straight the expressions about which we are talking and the "mathematical English" we are using for discussing them, we must tread rather carefully. Instead of using U,*, and

### 0,

which are the names in this book for certain operations and sets, we introduce special symbols U,### *,

and 0, which should be regarded for the moment*as completely free of meaningful overtones, just like the symbols a, b,*and 0 used in earlier examples. In the same way, we introduce special symbols ( and ) instead of the parentheses ( and ) we have been using for doing mathematics.

The regular expressions over an alphabet I;* are all strings over the alphabet I; U {(,), 0, U,*} that can be obtained as follows.

(1) 0 and each member of I; is a regular expression.

(2) If a and *f3 *are regular expressions, then so is *(af3). *

(3) If a and *f3 *are regular expressions, then so is *(aUf3). *

(4) If a *is a regular expression, then so is a*. *

(5) Nothing is a regular expression unless it follows from (1) through (4).

Every regular expression represents a language, according to the
interpreta-tion of the symbols U and * as set union and Kleene star, and of juxtaposition of
expressions as concatenation. Formally, the relation between regular expressions
and the languages they represent is established by a function £, such that if a
is any regular expression, then *£(a) *is the language represented by a. That is,

£ is a function from strings to languages. The function £ is defined as follows.

1.8: **Finite **Representations of Languages
(1) £(0) = 0, and *£(a) *= *{a} *for each *a *E I:.

(2) If a and (3 are regular expressions, then *C«O'(3)) *= £(O').c((3).

(3) If a and (3 are regular expressions, then C«O'U(3)) = *£(0') *

### u

C«(3).(4) If a is a regular expression, then *£(0'*) *= *£(0')*. *

**49 **

Statement 1 defines *£(0') *for each regular expression a that consists of a
single symbol; then (2) through (4) define .c(o') for regular
expres-sions of some length in terms of *£(0") *for one or two regular expressions a' of
smaller length. Thus every regular expression is associated in this way with
some language.

**Example 1.8.2: **What is *£«(aUb)*a))? *We have the following.

*£«(aUb)*a)) =C«aUb)*)C(a) *by(2)

*=£«aUb)*){a} *by (1)

*=£«aUb))*{a} *by (4)

=(.c(a) U *£(b))*{a} *by (3)

*=({a} *U *{b})*{a} *by (1) twice

*={a,b}*{a} *

*={w *E *{a,br : *wends with an *a} *

**Example 1.8.3: What language is represented by ***(c*(aU(bc*))*)? *This regular
expression represents the set of all strings over *{a, b, *c} that do not have the
substring *ac. * Clearly no string in £( *(c* *(aU *(bc*)) **)) can contain the substring
*ac, *since each occurrence of *a *in such a string is either at the end of the string,
*or is followed by another occurrence of a, or is followed by an occurrence of b. *

On the other hand, let *w *be a string with no substring *ac. *Then *w *begins with
zero or more c's. If they are removed, the result is a string with no sub-string
*ac *and not beginning with c. Any such string is in *£«aU(bc*))); *for it can
be read, left to right, as a sequence of a's, b's, and c's, with any blocks of c's
immediately following b's (not following a's, and not at the beginning of the
string). Therefore wE *C«c*(aU(bc*))*)).O *

**Example 1.8.4: **(O*U«(O*(1U(ll)))«OO*)(1U(ll)))*)O*)) represents the set
of all strings over {O, I} that do not have the substring 111.0

Every language that can be represented by a regular expression can be represented by infinitely many of them. For example, a and (O'U0) always rep-resent the same language; so do «O'U(3)U,) and (O'U«(3U,)). Since set union

50 Chapter 1: SETS, RELATIONS, AND LANGUAGES
and concatenation are associative operations --that is, since *(L1 *U *L**2 ) *U *L3 = *
*L1 *^{U }*(L2 *^{U }*L3) *^{for all }*L 1, L**2 , **L3, *and the same for concatenation- we
nor-mally omit the extra ( and ) symbols in regular expressions; for example, we
treat aUbUc as a regular expression even though "officially" it is not. For
an-other example, the regular expression of Example 1.8.4 might be rewritten as
O*UO*(lU11 )(OO* (lUll) )*0*.

Moreover, now that we have shown that regular expressions and the lan-guages they represent can be defined formally and unambiguously, we feel free, when no confusion can result, to blur the distinction between the regular expres-sions and the "mathematical English" we are using for talking about languages.

*Thus we may say at one point that a' b' is the set of all strings consisting of *
some number of a's followed by some number of *b's *-to be precise, we should
have written *{a}' *^{0 } *{b}*. * At another point, we might say that *a'b' *is a
regu-lar expression representing that set; in this case, to be precise, we should have
written *(a*b*). *

The class of regular languages over an alphabet I; is defined to consist of
all languages *L *such that *L *

### =

*Lea)*for some regular expression a over I;. That is, regular languages are all languages that can be described by regular expressions.

*Alternatively, regular languages can be thought of in terms of closures. The class *
of regular languages over I; is precisely the closure of the set of languages

*{{O'} : ** ^{0' }*E I;} U {0}

with respect to the functions of union, concatenation, and Kleene star.

We have already seen that regular expressions do describe some nontrivial
and interesting languages. Unfortunately, we cannot describe by regular
expres-sions some languages that have very simple descriptions by other means. For
example, *{on *1 *n : n *~ *O} will be shown in Chapter 2 not to be regular. Surely *
any theory of the finite representation of languages will have to accommodate at
least such simple languages as this. Thus regular expressions are an inadequate
specification method in general.

In search of a general method for finitely specifying languages, we might return to our general scheme

*L *

### =

^{{w }^{E }

^{I;' : }

*has property*

^{w }*P}.*

But which properties *P *should we entail? For example, what makes the
pre-ceding properties, *"w *consists of a number of O's followed by an equal number
*of 1 's" and "w has no occurrence of 111" such obvious candidates? The reader *
*may ponder about the right answer; but let us for now allow algorithmic *
*prop-erties, and only these. That is, for a property P of strings to be admissible as a *
specification of a language, there must be an algorithm for deciding whether a
given string belongs to the language. An algorithm that is specifically designed,

1.8: Finite Representations of Languages 51
for some language L, to answer questions of the form "Is string *w *a member of
*L?" will be called a language recognition device. For example, a device for *
recognizing the language

*L *= *{w *E {O, 1} * : *w *does not have 111 as a substring}.

by reading strings, a symbol at a time, from left to right, might operate like this:

Keep a count. which starts at zero and is set back to zero every time a 0 is encoun-tered in the input; add one every time a 1 is encounencoun-tered in the input; stop with a No answer if the count ever reaches three. and stop with a Yes answer if the whole string is read without the count reaching three.

An alternative and somewhat orthogonal method for specifying a language
is to describe how a generic specimen in the language is produced. For example,
a regular expression such as (e U b U bb)(a U ab U abb)* may be viewed as a way
*of generating members of a language: *

To produce a member of *L. *first write down either nothing. or *b. *or *bb; *then write
down *a *or *abo *or *abb. *and do this any number of times, including zero; all and only
members of *L *can be produced in this way.

Such language generators are not algorithms, since they are not designed to answer questions and are not completely explicit about what to do (how are we to choose which of a, ab, or abb is to be written down?) But they are important and useful means of representing languages all the same. The relation between language recognition devices and language generators, both of which are types of finite language specifications, is another major subject of this book.

Problems for Section 1.8

1.8.1. What language is represented by the regular expression *((a*a)b)Ub)? *

1.8.2. Rewrite each of these regular expressions as a simpler expression represent-ing the same set.

(a) 0*Ua*Ub*U(aUb)*

(b) *((a*b*)*(b*a*)*)* *

(c) *(a*b)*U(b*a)* *

(d) *(aUb)*a(aUb)* *

1.8.3. Let I; = *{a, b}. *Write regular expressions for the following sets:

(a) All strings in I;* with no more than three a's.

(b) All strings in I;* with a number of a's divisible by three.

(c) All strings in I;* with exactly one occurrence of the substring aaa.

1.8.4. Prove that if L is regular, then so is *L' *= *{w : uw *E *L *for some string *u}. *

(Show how to construct a regular expression for *L' *from one for *L.) *

52 Chapter 1: SETS, RELATIONS, AND LANGUAGES 1.8.5. Which of the following are true? Explain.

(a) *baa *E a*b*a*b*

(b) *b*a* *

### n

*a*b**=

*a**U

*b**

(c) *a*b*nb*c*=0 *
(d) *abcd E (a(cd)*b)* *

1.8.6. The star height *h(a) *of a regular expression *a *is defined by induction as
follows.

*h(0) =0 *

*h(a) =0 for each a *E I;

*h(aUj3) =h(a;3) *

### =

the maximum of h(a) and h(j3).*h(a*) =h(a) *

### +

1For example, if a = *«(ab)*Ub*)*Ua*), then h(a) *= 2. Find, in each case, a
regular expression which represents the same language and has star height
as small as possible.

(a) *(abc)*ab)* *

(b) *(a(ab*c)*)* *

(c) *(c(a*b)*)* *

(d) *(a*Ub*Uab) * *
(e) *(abb*a)* *

1.8.7. A regular expression is in disjunctive normal form if it is of the form
(aJ Ua2 U ... *Ua**n ) **for some n ;::: 1, where none of the ai's contains an *
oc-currence of U. Show that every regular language is represented by one in
disjunctive normal form.

REFERENCES

*An excellent source on informal set theory is the book *

*o P. Halmos Naive Set Theory, Princeton, N.J.: D. Van Nostrand, 1960. *

*A splendid book on mathematical induction is *

*o G. Polya Induction and Analogy in Mathematics, Princeton, N.J.: Princeton *
University Press, 1954.

*A number of examples of applications of the pigeonhole principle appear in the first *
*chapter of *

*o C. L. Liu Topics in Combinatorial Mathematics, Buffalo, N.Y.: Mathematical *
Association of America, 1972.

*Cantor's original diagonalization argument can be found in *

*o G. Cantor Contributions to the Foundations of the Theory of Transfinite *
*Num-bers *New York: Dover Publications, 1947.

*The V-notation and severol variants were introduced in *

References 53

o D. E. Knuth "Big omicron and big omega and big theta," *ACM SIGACT News, *
8 (2), pp. 18-23, 1976.

*The O(n3) algorithm for the reflexive-transitive closure *is *from *

o S. Warshall "A theorem on Boolean matrices," *Journal of the ACM, 9, 1, pp. *
11-12, 1962.

*Two books on algorithms and their analysis are *

o T. H. Cormen, C. E. Leiserson, R. L. Rivest *Introduction to Algorithms, *
Cam-bridge, Mass.: The MIT Press., 1990, and

o G. Brassard, P. Bratley *Fundamentals of Algorithms, Englewood Cliffs, N.J.: *

Prentice Hall, 1996.

*Two advanced books on language theory are *

o A. Salomaa *Formal Languages New York: Academic Press, 1973. *

o M. A. Harrison *Introduction to Formal Language Theory, Reading, Massach.: *

Addison-Wesley, 1978.