Regular Expressions for Languages over Infinite Alphabets
Michael Kaminski
Department of Computer Science Technion – Israel Institute of Technology Haifa 32000, Israel
kaminski@cs.technion.ac.il
Tony Tan
Department of Computer Science National University of Singapore 3 Science Drive 2
Singapore 117543
Abstract. In this paper we introduce a notion of a regular expression over infinite alphabets and show that a language is definable by an infinite alphabet regular expression if and only if it is accepted by finite-state unification based automaton – a model of computation that is tightly related to other models of automata over infinite alphabets.
Keywords: Finite state automata, infinite alphabets, regular expressions
1. Introduction
A new model of finite-state automata dealing with infinite alphabets, called finite-state datalog automata (FSDA) was introduced in [16]. These automata were intended for the abstract study of relational lan- guages. Since the character of relational languages requires the use of infinite alphabets of names of variables, in addition to a finite set of states, FSDA are equipped with a finite set of “registers” capable of retaining a variable name (out of an infinite set of names). The equality test, which is performed in
Address for correspondence: Department of Computer Science, Technion – Israel Institute of Technology, Haifa 32000, Israel
ordinary finite-state automata (FA) was replaced with unification, which is a crucial element of relational languages.
Later, FSDA were extended in [7] to a more general model dealing with infinite alphabets, called finite-memory automata (FMA). FMA were designed to accept the infinite alphabet counterpart of the ordinary regular languages. Similarly to FSDA, FMA are equipped with a finite set of registers which are either empty or contain a symbol from the infinite alphabet, but contrary to FSDA, registers in FMA cannot contain symbols currently stored in other registers. By restricting the power of the automaton to copying a symbol to a register and comparing the content of a register with an input symbol only, without the ability to perform any functions, the automaton is only able to “remember” a finite set of input symbols. Thus, the languages accepted by FMA possess many of the properties of regular languages.
Whereas decision of the emptiness and containment for FMA- (and, consequently, for FSDA-) lan- guages is relatively simple, the problem of inclusion for FMA-languages is undecidable, see [11, 12].
An extension of FSDA to a general infinite alphabet called finite-state unification based automata, (FSUBA) was proposed in [17]. These automata are similar in many ways to FMA, but are a bit weaker, because a register of FSUBA may contain a symbol currently stored in other registers. It was shown in [17] that FSDA can be simulated by FSUBA and that the problem of inclusion for FSUBA languages is decidable.
While the study of finite automata over infinite alphabets started as purely theoretical, since the appearance of [7] and [8] it seems to have turned to more practically oriented. The key idea for the applicability is finding practical interpretations to the infinite alphabet and to the languages over it.
In [11, 12], members of (the infinite) alphabet
are interpreted as records of communication ac- tions, “send” and “receive” of messages during inter-process-communication. Words in a language
over this alphabet are MSCs, message sequence charts, capturing behaviors of the communica- tion network.
In [4], members ofare interpreted as URLs’ addresses of internet sites, a word inis interpreted as a “navigation path” in the internet, the result of some finite sequence of clicks.
In [5] there is another internet-oriented interpretation of , namely, XML mark-ups of pages in a site.
In this paper we introduce a notion of a regular expression for languages over infinite alphabets and show that a language is definable by an infinite alphabet regular expression if and only if it is accepted by an FSUBA.
The paper is organized as follows. In the next section we recall the definition of FSDA from [16] and in Section 3 we recall the definition of FSUBA from [17]. In Section 4 we present the main result of our paper – unification based regular expressions for languages over infinite alphabets, whose equivalence to FSUBA is proven in Sections 5 and 6. Section 7 contains the proof of a modification of a technical lemma from [17]. Finally, Section 8 deals with the complexity of intertranslations between FSUBA and unification based regular expressions.
2. Finite-state datalog automata
We start with examples of relational languages which can and cannot be defined by finite-state datalog automata (FSDA). Relational languages are languages over infinite alphabets whose symbols are of the
form , where belongs to a finite alphabet of binary relation symbols andand come from an infinite alphabet of variables. For example, the relational language
generated by the Horn grammar
is definable by finite-state datalog automata, see [16, Section 2], whereas the relational language
generated by the Horn grammar
is not, because the restrictions of FSDA-languages to finite alphabets are regular, see [8, Proposition 1].
Next we recall the definition of FSDA from [16].
A finite-state datalog automaton or, shortly, FSDA, is a system
!"#
, where
and are a finite alphabet of binary relation symbols and an infinite alphabet of variables, respectively, $ %and & (' ,1 whereas the input alphabet of is )
. That is, an input symbol is a relation , where (is a binary relation symbol and ( are variables.
, ( , and ! * are a finite set of states, the initial state, and the set of final states, respectively.
"
is the number of registers of, which are capable of either being empty or retaining a variable from.
# *
) )
+++"
)
+++"
)
,
--...-
/0 )
is the transition relation whose elements are called transitions. The intuitive meaning of the transition relation is as follows. If the automaton is in statereading relation and there is a transition 1123 (# such that the register 14either contains4or is empty,5, then the automaton can enter state
3, copy4into the14th register, if the latter is empty, and empty (reset) the registers whose indices belong to 2. The above registers1and1are referred to as the transition registers.
An actual state of an FSDA is an element of together with the contents of all registers of the automaton. Thus, has infinitely many states2 which are pairs 6, where ( and 6 (
7
&
/
. Such pairs are called configurations ofand are denoted
8
. The pair &
/
, denoted
8
, is the initial configuration, and the configurations with the first component in ! are called final configurations. The set of final configurations is denoted!8.
The transition relation#induces the following relation#
8
on
8) )
)
)
8
. Let3 ( and 6 /63 333/ ( 7
&
/
. Then 6 36 3(#8if and only if there is a transition 1123(#such that the following four conditions are satisfied.
1In this paper we reserve9to denote an empty register.
2This is the major difference between ordinary finite-state automata and finite-state automata over infinite alphabets.
1. (
4
&
,5. That is, the transition register 14either contains4or is empty.
2. If14 (' 2, then3 4,5. That is, if the transition register14is not reset in the transition, its content is4.
3. For all (2,3 &. 4. For all (' 27
11
,3 .
Let be a word over
)
. A run of the automaton
onconsists of a sequence of configurations +++such that is the initial configuration8, and 4 4 4 44(#8,5+++.
We say that acceptsif there exists a run +++ ofon such that ( !
8
. The set of all words accepted byis denoted byand is referred to as an FSDA-language. We refer the reader to [16] for additional examples of FSDA-languages and their relation to DATALOG.
3. Finite-state unification based automata
Till the end of this paper
is an infinite alphabet not containing &. For a word 6 /over
7
&
, we define the content of6, denoted 6 , by 6
'&
+++"
. That is, 6
consists of all symbols ofwhich appear in 6.
Definition 3.1. ([17]) A finite-state unification based automaton (over ) or, shortly, FSUBA, is a sys- tem
! #
, where
, ( , and ! * are a finite set of states, the initial state, and the set of final states, respectively.
/
(
7
&
/
," , is the initial assignment – register initialization: the symbol in the5th register is4. Recall that &is reserved to denote an empty register. That is, if &, then theth register is empty.
*
is the “read only” alphabet whose symbols cannot be copied into empty registers.3 One may think of as a set of the language constants which cannot be unified, cf [16].
# * )
+++"
)
,
--...-
/0 )
is the transition relation whose elements are called transi- tions. The intuitive meaning of#is as follows. If the automaton is in statereading symboland there is a transition 123 (#such that the1th register either contains or is empty, then the automaton can enter state 3, writein the 1th register (if it is empty), and erase the content of the registers whose indices belong to 2. The1th register will be referred to as the transition register.
Like in the case of FSDA, an actual state of is an element of together with the contents of all registers of the automaton. That is, has infinitely many states which are pairs 6, where ( and 6 (7
&
/
. These pairs are called configurations of. The set of all configurations ofis
3Of course, we could letbe any subset of . However, since the elements ofcannot be copied into empty registers, the automaton can make a move with the input fromonly if the symbol already appears in one of the automaton registers, i.e., belongs to the initial assignment.
denoted 8. The pair , denoted8, is called the initial configuration,4and the configurations with the first component in! are called final configurations. The set of final configurations is denoted!8.
Transition relation#induces the following relation#
8
on
8
))
8
.
Let3 (,6 /and63 333/. Then the triple 6363belongs to#8if and only if there is a transition 123in#such that the following conditions are satisfied.
Either &(i.e., the transition register is empty in which caseis copied into it) and (' , or (i.e., the transition register contains).
If1 (' 2, then3 , i.e., if the transition register is not reset in the transition, its content is. For all (2,3 &.
For all (' 27
1
,3 . Let be a word over
. A run of on consists of a sequence of configurations
+++
such that is the initial configuration8and 444 (#8,5+++.
We say thataccepts , if there exists a run +++ofon such that (!8. The set of all words accepted byis denoted by
and is referred to as an FSUBA-language.
Example 3.1. Let
&
/
+++
#
be an "-register FSUBA, where#consists of the only one transition 1%. Alternatively,can be described by the following diagram.
1
%
&
&
initialization
Obviously, , if1 ", and
/
, otherwise.
Example 3.2. ([17]) Let
,&%#be a one-register FSUBA, where#consists of the following two transitions:
%
%
, see the diagram below.
%
%
&
initialization
4Recall that anddenote the initial state and the initial assignment, respectively.
Then
(
: an accepting run ofonis&. In contrast, the language
(
'
is not an FSUBA language.5 To prove that, assume to the contrary that for some FSUBA ! #,. Since is infinite and is finite,
contains two different symbolsand. By the definition of
, it contains the word . Let 66, 64 4-4-4-/, 5 , be an accepting run of onand let1be the transition register between configurations 6and 6. Since neither of and belongs to and ' , - &and - . Then, replacing- with in
6
6
we obtain an accepting run ofon, which contradicts .6 The following example shows how FSDA can be simulated by FSUBA.
Example 3.3. ([17]) Let !"#,
+++
, be an FSDA. Consider an FSUBA3
33!3 #3
, such that
7
,
3
7
# )
" "
,
3
,
!3
!,
&
/
,
, and
#consists of all transitions of the form 1, 2, or 3 below 1. "%1123 ", 2.
1123 "1
%
1123 "
, or 3.
1123 "123
, where
1123 (
#. That is, we break each transition
3
2
11
of#into three “consecutive” transitions
5It can be readily seen thatis accepted by a finite-memory automaton introduced in [7].
6The decision procedure for the inclusion of FSUBA-languages in [17] is based on a refined version of this argument.
%
"
1123 "
1
%
1123 "
2 1
3
of#3.
A straightforward induction on the word length shows that
(
if and only if
(
3
+
Example 3.4. Let
&&
%
#
be a 2-register FSUBA, where#consists of the following three transitions:
%
,
,
%
, see the diagram below.
%
%
& &
initialization It can be easily seen that
(
. Example 3.5. ([17], cf [8, Example 1].) Let
&&
%
#
be an FSUBA with two registers and#consists of the following five transitions:
,
%
,
,
%
, and
, see the diagram below.
%
%
& &
initialization
It can be easily seen that
(
there exist 5 53 such that 4 4
+
That is, consists of all words over in which some symbol appears twice or more. For example, an accepting run ofon is
&&
&&
&
&
&
&
+
Example 3.6. ([17]) Let
! #
be an FSUBA such that &does not appear in and for all 123 (#, 2 %. Thenis a regular language over . In general, since the restriction of a set of configurations to a finite alphabet is finite, the restrictions of FSUBA-languages to finite alphabets are regular, cf. [8, Proposition 1].
4. Regular expressions for FSUBA languages
In this section we introduce an alternative description of FSUBA languages by the so called unification based expressions which are the infinite alphabet counterpart of the ordinary regular expressions.
Definition 4.1. Let
+++
/
be a set of variables such that $ %and let be a finite subset of. Unification based regular expressions over , or shortly UB-expressions, if is understood from the context, are defined as follows.
%
, and each element of 7 are UB-expressions.
Ifandare UB-expressions, then so is .
If3 * andand are UB-expressions, then so are and
.
The intuition behind the above definition is as follows. Each variable in corresponds to a register of the automaton and a “variable” assignment of symbols from to variables in is the register assignment. Finally, subscripts3indicate the set of registers reset by the automaton.
The definition of languages defined by UB-expressions is based on the observation that the set of all sequences of an "-register FSUBA diagram labels corresponding to its accepting runs is a regular language over
+++"
)
,
--...-
/0
. Thus, with a unification based regular expressions over
we associate an ordinary regular expression over (finite) alphabet 7 7 , denoted, that is defined by induction as follows.
If(
%
7
7
, thenis.
is .
is 3. Finally, is3 .
Let6 (7 7 . With the5th symbol 4of6,5+++, we associate a word4 (
7
as described below, cf. [7, Definition 3].
If4 ( , then4 4. If4 3 *, then4 .
If4 (, then4satisfies the following (global) conditions.
– If for each53 5such that4 , there exists533,53 533 5, such that (4,7then4 can be any element of .
– Otherwise, let53be the maximal integer less than5such that4 and no symbol3 * that appears between the53th and the5th positions of6contains. Then4 4.
The word6 , where4is as defined above,5 +++, is called an instance of
6. The set of all instances of6is denoted by6.
Example 4.1. Let66 ( 7 7 . Then6%66 6.8
Next, for a language*7 7 , we denote bythe set of all instances of all elements of
. That is,
6.
Finally, for a UB-expressionwe define the language(over) as the set of all instances of the elements of:.9
7Of course, in such case,must be of the form
.
8Note that
.
9Recall that
is a language over
.
Example 4.2. It can be readily seen that and behave like the ordinary concatenation and Kleene star, respectively. In addition, for a non-empty3, is redundant, because .10 Example 4.3. The language from Example 3.2 is . Similarly, for a UB-expression
over
%
, consists of all words over having the same first and last symbols.
Thus,
is the language from Example 3.5.
Example 4.4. Consider a subclass of UB-expressions, called FSDA-expressions, that is defined below.
%
, and UB-expressions of the form , where ( and ( are FSDA- expressions.
Ifandare FSDA-expressions, then so are , , and
.
It easily follows from Example 3.3 and the constructions in Sections 5 and 6 that FSDA languages are defined by FSDA-expressions and vice versa, each FSDA expression defines an FSDA language.
Theorem 4.1. A language is defined by a UB-expression if and only if it is accepted by an FSUBA.
The proof of the “if” part of Theorem 4.1 is based on a tight relationship between FSUBA and the ordinary finite automata. It is presented in the next section. The proof of the “only if” part of the theorem is based on the relevant closure properties of FSUBA-languages and is quite standard. For the sake of completeness, we present it in Section 6.
We conclude this section with one more closure property of FSUBA-languages that is an immediate corollary to Theorem 4.1.
Corollary 4.1. FSUBA languages are closed under reversing.11
Proof:
It can be easily verified that, for a UB-expression ,
, where
is defined by the following induction.
If(
%
7
7
, then
is.
is
.
is
.
is
.
Remark 4.1. Using an alternative equivalent model of computation that is similar to M-automata intro- duced in [8], one can show that FSUBA languages are also closed under intersection.12
10Of course,is redundant as well (but, still, very useful), because
.
11It should be pointed out that FMA languages are not closed under reversing, see [8, Example 8]. Therefore, it is unlikely that there is a kind of regular expressions for FMA languages.
12It follows from Example 3.2 that FSUBA languages are not closed under complementation.