### Regular Expressions with Binding over Data Words for Querying Graph Databases

Leonid Libkin^{1}, Tony Tan^{2}, and Domagoj Vrgoˇc^{3}

1 University of Edinburgh, email: libkin@inf.ed.ac.uk

2 Hasselt University and Transnational University of Limburg, email:

tony.tan@uhasselt.be

3 University of Edinburgh, email: domagoj.vrgoc@ed.ac.uk

Abstract. Data words assign to each position a letter from a finite alphabet and a data value from an infinite set. Introduced as an ab- straction of paths in XML documents, they recently found applications in querying graph databases as well. Those are actively studied due to applications in such diverse areas as social networks, semantic web, and biological databases. Querying formalisms for graph databases are based on specifying paths conforming to some regular conditions, which led to a study of regular expressions for data words.

Previously studied regular expressions for data words were either rather limited, or had the full expressiveness of register automata, at the ex- pense of a quite unnatural and unintuitive binding mechanism for data values. Our goal is to introduce a natural extension of regular expres- sions with proper bindings for data values, similar to the notion of freeze quantifiers used in connection with temporal logics over data words, and to study both language-theoretic properties of the resulting class of lan- guages of data words, and their applications in querying graph databases.

### 1 Introduction

Data words, unlike the usual words over finite alphabet, assign to each position
both a letter from a finite alphabet and an element of an infinite set, referred
to as a data value. An example of a data word is ^{a}_{1} b

2

a 3

b

1. This is a data word over the finite alphabet {a, b}, with data elements coming from an infinite domain, in this case, N. Investigations of data words picked up recently due to their importance in the study of XML documents. Those are naturally modeled as ordered unranked trees in which every node has both a label and a datum (these are referred to as data trees). Data words then model paths in data trees, and as such are essential for investigations of many path-based formalisms for XML, for instance, its navigational query language XPath. We refer the reader to [30, 7, 14, 29] for recent surveys.

While the XML data format dominated the data management landscape for a while, primarily in the 2000s, over the past few years the focus started shifting towards the graph data model. Graph-structured data appears naturally in a variety of applications, most notably social networks and the Semantic Web (as

it underlies the RDF format). Its other applications include biology, network traffic, crime detection, and modeling object-oriented data [13, 20, 23, 25–28].

Such databases are represented as graphs in which nodes are objects and the edge labels specify relationships between them; see [1, 4] for surveys.

Just as in the case of XML, a crucial building block in queries against graph data deals with properties of paths in them. The most basic formalism is that of regular path queries, or RPQs, which select nodes connected by a path de- scribed by a regular language over the labeling alphabet [11]. There are multiple extensions with more complex patterns, backward navigation, regular relations over paths, and non-regular features [3, 5, 6, 8–10]. In real applications we deal with both navigational information and data, so it is essential that we look at properties of paths that also describe how data values change along them. Since such paths (as we shall explain later) are just data words, it becomes necessary to provide expressive and well-behaved mechanisms for describing languages of data words.

One of the most commonly used formalisms for describing the notion of regularity for data words is that of register automata [18]. These extend the standard NFAs with registers that can store data values; transitions can compare the currently read data value with values stored in registers.

However, register automata are not convenient for specifying properties –
ideally, we want to use regular expressions to define languages. These have been
looked at in the context of data words (or words over infinite alphabets), and are
based on the idea of using variables for binding data values. An initial attempt
to define such expressions was made in [19], but it was very limited. Another
formalism, called regular expressions with memory, was shown to be equivalent
to register automata [21, 22]. At the first glance, they appear to be a good
formalism: these are expressions like a ↓x (a[x^{=}])^{∗} saying: read letter a, bind
data value to x, and read the rest of the data word checking that all letters are a
and the data values are the same as x. This will define data words ^{a}_{d} · · · ^{a}_{d} for
some data value d. This is reminiscent of freeze quantifiers used in connection
with the study of data word languages [12].

The serious problem with these expressions, however, is the binding of vari- ables. The expression above is fine, but now consider the following expression:

a ↓_{x} (a[x^{=}]a ↓_{x})^{∗}a[x^{=}]. This expression re-binds variable x inside the scope of
another binding, and then crucially, when this happens, the original binding of
x is lost! Such expressions really mimic the behavior of register automata, which
makes them more procedural than declarative. (The above expression defines
data words of the form _{d}^{a}

1

a
d_{1} · · · _{d}^{a}

n

a
d_{n}.)

Losing the original binding of a variable when reusing it inside its scope goes completely against the usual practice of writing logical expressions, programs, etc., that have bound variables. Nevertheless, this feature was essential for cap- turing register automata [21]. So natural questions arise:

– Can we define regular expressions for data words that use the acceptable scope/binding policies for variables? Such expressions will be more declara- tive than procedural, and more appropriate for being used in queries.

– Do these fall short of the full power of register automata?

– What are their basic properties, and what is the complexity of querying graph data with such expressions?

Contributions Our main contribution is to define a new formalism of regular expressions with binding, or REWBs, to study its properties, and to show how it can be used in the context of graph querying. The binding mechanism of REWBs follows the standard scoping rules, and is essentially the same as in LTL extensions with freeze quantifiers [12]. We also look at some subclasses of REWBs based on the types of conditions one can use: in simple REWBs, each condition involves at most one variable (all those shown above were such), and in positive REWBs, negation and inequality cannot be used in conditions.

We show that the class of languages defined by REWBs is strictly contained in the class of languages defined by register automata. The separating example is rather intricate, and indeed it appears that for most reasonable languages one can think of, if they are definable by register automata, they would be definable by REWBs as well. At the same time, REWBs lower the complexity of some key computational tasks related to languages of data words. For instance, non- emptiness is Pspace-complete for register automata [12], but we show that it is NP-complete for REWBs (and trivializes for simple and positive REWBs).

We consider the containment and universality problems for REWBs. In gen- eral they are undecidable, even for simple REWBs. However, the problem be- comes decidable for positive REWBs.

We look at applications of REWBs in querying graph databases. The prob- lem of query evaluation is essentially checking whether the intersection of two languages of data words is nonempty. We use this to show that the complexity of query evaluation is Pspace-complete (note that it is higher than the complexity of nonemptiness alone); for a fixed REWB, the complexity is tractable.

At the end we also sketch some results concerning a model of data word automaton that uses variables introduced in [15]. We also comment on how these can be combined with register automata to obtain a language subsuming all the previously used ones while still retaining good query evaluation bounds.

Organization We define data words and data graphs in Section 2. In Section 3 we introduce our notion of regular expression with binding (REWB) and study their nonemptiness and universality problems in Section 4 and Section 5, respectively.

In Section 6 we study REWBs as a graph database query language and in Section 7 we consider some possible extensions that could be useful in graph querying.

Due to space limitations, complete proofs of all the results are in the appendix.

### 2 Data words and data graphs

Let Σ be a finite alphabet and D a countable infinite set of data values. A data
word is simply a finite string over the alphabet Σ × D. That is, in each position
a data word carries a letter from Σ and a data value from D. We will denote
data words by ^{a}_{d}^{1}

1 . . . ^{a}_{d}^{n}

n, where ai∈ Σ and di∈ D.

A data graph (over Σ) is pair G = (V, E), where – V is a finite set of nodes;

– E ⊆ V × Σ × D × V is a set of edges where each edge contains a label from Σ and a data value from D.

We write V (G) and E(G) to denote the set of nodes and edges of G, respectively.

An edge e from a node u to a node u^{0} is written in the form (u, ^{a}_{d}, u^{0}), where
a ∈ Σ and d ∈ D. We call a the label of the edge e and d the data value of the
edge e. We write D(G) to denote the set of data values in G.

The following is an example of a data graph, with nodes u_{1}, . . . , u_{6}and edges
(u_{1}, ^{a}_{3}, u2), (u_{3}, ^{b}_{1}, u2), (u_{2}, ^{a}_{3}, u5), (u_{6}, ^{a}_{5}, u4), (u_{2}, ^{a}_{1}, u4), (u_{4}, ^{a}_{4}, u3) and
(u5, ^{c}_{7}, u6).

u1

u_{2}

u3

u4

u_{5} u_{6}

a 3

b 1 a

3

c 7

a 4

a 5

a 1

A path from a node v to a node v^{0} in G is a sequence
π = v_{1} a_{1}

d1

v_{2} a_{2}

d2

v_{3} a_{3}

d3

· · · vn

a_{n}
dn

v_{n+1}

such that each (vi, ^{a}_{d}^{i}

i, vi+1) is an edge for each i ≤ n, and v1= v and vn+1= v^{0}.
A path π defines a data word w(π) = ^{a}_{d}^{1}

1

a2

d2

a3

d3 · · · ^{a}_{d}^{n}

n.

Remark Note that we have chosen a model in which labels and data values appear in edges. Of course other variations are possible, for instance labels appearing in edges and data values in nodes. All of these easily simulate each other, very much in the same way as one can use either labeled transitions systems or Kripke structures as models of temporal or modal logic formulae. In fact both models – with labels in edges and labels in nodes – have been considered in the context of semistructured data and, at least from the point of view of their expressiveness, they are viewed as equivalent. Our choice is dictated by the ease of notation primarily, as it identifies paths with data words.

### 3 Regular expressions with binding

We now define regular expressions with binding for data words. As explained already, expressions with variables for data words were previously defined in [22]

but those were really designed to mimic the transitions of register automata, and had very procedural, rather than declarative flavor. Here we define them using proper scoping rules.

Variables will store data values; those will be compared with other variables using conditions. To define them, assume that, for each k > 0, we have variables x1, . . . , xk. Then the set of conditions Ck is given by the grammar:

c := > | ⊥ | x^{=}_{i} | x^{6=}_{i} | c ∧ c | c ∨ c | ¬c, 1 ≤ i ≤ k.

The satisfaction of a condition is defined with respect to a data value d ∈ D and
a (partial) valuation ν : {x_{1}, . . . , x_{k}} → D of variables as follows:

– d, ν |= > and d, ν 6|= ⊥;

– d, ν |= x^{=}_{i} iff d = ν(xi);

– d, ν |= x^{6=}_{i} iff d 6= ν(xi);

– the semantics for Boolean connectives ∨, ∧, and ¬ is standard.

Next we define regular expressions with binding.

Definition 1. Let Σ be a finite alphabet and {x1, . . . , xk} a finite set of vari- ables. Regular expressions with binding (REWB) over Σ[x1, . . . , xk] are defined inductively as follows:

r := ε | a | a[c] | r + r | r · r | r^{∗} | a ↓_{x}_{i} (r) (1)
where a ∈ Σ and c is a condition in Ck.

A variable x_{i}is bound if it it occurs in the scope of some ↓_{x}_{i}operator and free
otherwise. More precisely, free variables of an expression are defined inductively:

ε and a have no free variables, in a[c] all variables occurring in c are free, in
r1+ r2and r1· r2the free variables are those of r1 and r2, the free variables of
r^{∗} are those of r, and the free variables of a ↓x_{i} (r) are those of r except xi. We
will write r(x1, . . . , xl) if x1, . . . , xlare the free variables in r.

A valuation on the variables x1, . . . , xkis a partial function ν : {x1, . . . , xk} 7→

D. We denote by F (x1, . . . , xk) the set of all valuations on x1, . . . , xk. For a
valuation ν, we write ν[xi ← d] to denote the valuation ν^{0} obtained by fixing
ν^{0}(xi) = d and ν^{0}(x) = ν(x) for all other x 6= xi. Likewise, we write ν[¯x ← ¯d]

for a simultaneous substitution of values from ¯d = (d1, . . . , dl) for variables

¯

x = (x_{1}, . . . , x_{l}). Also notation ν(¯x) = ¯d means that ν(x_{i}) = d_{i} for all i ≤ l.

Semantics Let r(¯x) be an REWB over Σ[x_{1}, . . . , x_{k}]. A valuation ν ∈ F (x_{1}, . . . , x_{k})
is compatible with r, if ν(¯x) is defined.

A regular expression r(¯x) over Σ[x1, . . . , xk] and a valuation ν ∈ F (x1, . . . , xk) compatible with r define a language L(r, ν) of data words as follows.

– If r = a and a ∈ Σ, then L(r, ν) = { ^{a}_{d} | d ∈ N}.

– If r = a[c], then L(r, ν) = { ^{a}_{d} | d, ν |= c}.

– If r = r1+ r2, then L(r, ν) = L(r1, ν) ∪ L(r2, ν).

– If r = r_{1}· r_{2}, then L(r, ν) = L(r_{1}, ν) · L(r_{2}, ν).

– If r = r_{1}^{∗}, then L(r, ν) = L(r_{1}, ν)^{∗}.
– If r = a ↓_{x}_{i} (r_{1}), then L(r, ν) = [

d∈D

na d

o

· L(r1, ν[x_{i}← d]).

A REWB r defines a language of data words as follows.

L(r) = [

ν compatible with r

L(r, ν).

In particular, if r is without free variables, then L(r) = L(r, ∅). We will call such REWBs closed.

Register automata and expressions with memory As mentioned earlier, regis- ter automata extend NFAs with the ability to store and compare data values.

Formally, an automaton with k registers is A = (Q, q_{0}, F, T ), where:

– Q is a finite set of states;

– q0∈ Q is the initial state;

– F ⊆ Q is the set of final states;

– T is a finite set of transitions of the form (q, a, c) → (I, q^{0}), where q, q^{0} are
states, a is a label, I ⊆ {1, . . . , k}, and c is a condition in Ck.

Intuitively the automaton traverses a data word from left to right, starting in
q_{0}, with all registers empty. If it reads ^{a}_{d} in state q with register configuration
τ : {1, . . . , k} → D, it may apply a transition (q, a, c) → (I, q^{0}) if d, τ |= c; it
then enters state q^{0} and changes contents of registers i, with i ∈ I, to d. For
more details on register automata we refer reader to [18, 22].

Expressions introduced in [21] had a similar syntax but rather different se- mantics. They were built using a ↓x, concatenation, union and Kleene star. That is, no binding was introduced with a ↓x; rather it directly matched the operation of putting a value in a register. In contrast, we use proper bindings of variables;

expression a ↓xappears only in the context a ↓x(r) where it binds x inside the expression r only. This corresponds to the standard binding policies in logic, or in programs.

Example 1. We list several examples of languages expressible with our expres- sions. In all cases below we have a singleton alphabet Σ = {a}.

– The language that consists of data words where the data value in the first
position is different from the others is given by: a ↓_{x}((a[x^{6=}])^{∗}).

– The language that consists of data words where the data values in the first
and the last position are the same is given by: a ↓_{x}(a^{∗}· a[x^{=}]).

– The language that consists of data words where there are two positions with
the same data value: a^{∗}· a ↓_{x}(a^{∗}· a[x^{=}]) · a^{∗}.

Note that in REWBs in the above example the conditions are very simple:

they are either x^{=} or x^{6=}. We will call such expressions simple REWBs.

We shall also consider positive REWBs where negation and inequality are
disallowed in conditions. That is, all the conditions c are constructed using the
following syntax: c := > | x^{=}_{i} | c ∧ c | c ∨ c,, where 1 ≤ i ≤ k.

We finish this section by showing that REWBs are strictly weaker than reg- ister automata (i.e., proper binding of variables has a cost – albeit small – in terms of expressiveness).

Theorem 1. The class of languages defined by REWBs is strictly contained in the class of languages accepted by register automata.

That the class of languages defined by REWBs is contained in the class of languages defined by register automata can be proved by using a similar inductive construction as in [21, Proposition 5.3]. The idea behind the construction of the separating example follows the intuition that defining scope of variables restricts the power of the language, compared to register automata where once stored, the value remains in the register until rewritten. As the proof is rather technical and lengthy, we present it in the appendix.

We note that the separating example is rather intricate, and certainly not a natural language one would think of. In fact, all natural languages definable with register automata that we used here as examples – and many more, especially those suitable for graph querying – are definable by REWBs.

### 4 The nonemptiness problem

We now look at the standard language-theoretic problem of nonemptiness:

Nonemptiness for REWBs Input: A REWB r over Σ[x1, . . . , xk].

Task: Decide whether L(r) 6= ∅.

More generally, one can ask if L(r, ν) 6= ∅ for a REWB r and a compatible valuation ν.

Recall that for register automata, the nonemptiness problem is Pspace- complete [12] (and the same bound applied to regular expressions with memory [22]). Introducing proper binding, we lose little expressiveness and yet can lower the complexity.

Theorem 2. The nonemptiness problem for REWBs is NP-complete.

The proof is in the appendix. Note that for simple and positive REWBs the problem trivializes.

Proposition 1. – For every simple REWB r over Σ[x_{1}, . . . , x_{k}], and for ev-
ery valuation ν compatible with r, we have L(r, ν) 6= ∅.

– For every positive REWB r over Σ[x_{1}, . . . , x_{k}], there is a valuation ν such
that L(r, ν) 6= ∅.

### 5 Containment and universality

We now turn our attention to language containment. That is we are dealing with the following problem:

Containment for REWBs

Input: Two REWBs r_{1}, r_{2} over Σ[x_{1}, . . . , x_{k}].

Task: Decide whether L(r1) ⊆ L(r2).

When r2 is a fixed expression denoting all data words, this is the universality problem. We show that both are undecidable.

In fact, we show a stronger statement, that universality of simple REWBs that use just a single variable is already undecidable.

Universality for one-variable REWBs Input: An REWB r over Σ[x].

Task: Decide whether L(r) = (Σ × D)^{∗}.

Theorem 3. Universality for one-variable REWBs is undecidable. In particular, containment for REWBs is undecidable too.

While restriction to simple REWBs does not make the problem decidable, the restriction to positive REWBs does: as is often the case, static analysis tasks become easier without negation.

Theorem 4. The containment problem for positive REWBs is decidable.

Proof. It is rather straightforward to show that any positive REWB can be converted into a register automaton without inequality [19]. The decidability of the language containment follows from the fact that the containment problem for register automata without inequality is decidable [31].

### 6 REWBs as a query language for data graphs

Standard mechanisms for querying graph databases are based on regular path queries, or RPQs: those select nodes connected by a path belonging to a given regular language [4, 11, 9, 10]. For data graphs, we follow the same idea, but now paths are specified by REWBs, since they contain data. In this section we study the complexity of this querying formalism.

We first explain how the problem of query evaluation can be cast as a problem of checking nonemptiness of language intersection.

Note that a data graph G can be viewed as an automaton, generating data words. That is, given a data graph G = (V, E), and a pair of nodes s, t, we let L(G, s, t) be {w(π) | π is a path from s to t in G}; this is a set of data words.

Let r(¯x) be a REWB over Σ[x1, . . . , xk]. For ν compatible with r, we let L(G, s, t, r, ν) be L(G, s, t) ∩ L(r, ν). Then for a graph G = (V, E), we define the

answer to r over G as the set Q(r, G) of triples (s, t, ¯d) ∈ V × V × D^{k}, such that
L(G, s, t, r, ν[¯x ← ¯d])) 6= ∅. In other words, there is a path π in G from s to t
such that w(π) ∈ L(r, ν), where ν(¯x) = ¯d.

If r is a closed REWB, we do not need a valuation in the above definition.

That is, Q(r, G) is the set of pairs of nodes (s, t) such that L(G, s, t) ∩ L(r) 6= ∅, i.e., there is a path π in G from s to t such that w(π) ∈ L(r).

In what follows we are interested in the query evaluation and query contain- ment problems. For simplicity we will work with closed REWBs only. We start with query evaluation.

Query Evaluation for REWB

Input: A data graph G, two nodes s, t ∈ V (G) and a REWB r.

Task: Decide whether (s, t) ∈ Q(r, G).

Note that in this problem, both the data graph and the query, given by r, are inputs; this is referred to as the combined complexity of query evaluation. If the expression r is fixed, we are talking about data complexity.

Recall that for the usual graphs (without data), the combined complexity of evaluating RPQs is polynomial, but if conjunctions of RPQs are taken, it goes up to NP (and could be NP-complete, in fact [11, 10]). When we look at data graphs and specify paths with register automata, combined complexity jumps to Pspace-complete [21].

However, we have seen that REWBs are less expressive than register au- tomata, so perhaps a lower NP bound would apply to them? One way to try to do it is to find a polynomial bound on the length of a minimal path witnessing a REWB in a data graph. The next proposition shows that this is impossible, since in some cases the shortest witnessing path will be exponentially long, even if the REWB uses only one variable.

Proposition 2. Let Σ = {$,¢, a, b} be a finite alphabet. There exists a family
of data graphs {G_{n}(s, t)}_{n>1}with two distinguished nodes s and t, and a family
of closed REWBs {r_{n}}_{n>1} such that

– each Gn(s, t) is of size O(n);

– each rn is a closed REWB over Σ[x] of length O(n); and
– every data word in L(Gn, s, t, rn) is of length Ω(2^{bn/2c}).

The proof of this is rather involved and can be found in the appendix.

Next we describe the complexity of the query evaluation problem. It turns out that it matches that for register automata.

Theorem 5. – The complexity of query evaluation for REWB is Pspace- complete.

– For each fixed r, the complexity of query evaluation for REWB is in NLogspace.

In other words, the combined complexity of queries based on REWBs is Pspace-complete, and their data complexity is in NLogspace (and of course it

can be NLogspace-complete even for very simple expressions, e.g., Σ^{∗}, which
just expresses reachability). Note that the combined complexity is acceptable
(it matches, for example, the combined complexity of standard relational query
languages such as relational calculus and algebra), and that data complexity is
the best possible for a language that can express the reachability problem.

We prove Pspace membership by showing how to transform REWBs into regular expressions when only finitely many data values are considered. Since the expression in question is of length exponential in the size of the input, standard on-the-fly construction of product with the input graph (viewed as an NFA) gives us the desired bound. Details of this construction, as well as the proof of hardness, can be found in the appendix. The same proof, for a fixed r, gives us the bound for data complexity.

Note that the upper bound follows from the connection with register au- tomata. In order to make our presentation self contained we opted to present a different proof in the appendix.

By examining the proofs of Theorem 5 and Theorem 3 we observe that lower bounds already hold for both simple and positive REWBs. That is we get the following.

Corollary 1. The following holds for simple REWBs.

– Combined complexity of simple (or positive) REWB queries is Pspace- complete.

– Data complexity of simple (or positive) REWB queries is NLogspace-complete.

Another important problem in querying graphs is query containment. In gen-
eral, the query containment problem asks, for two REWBs r_{1}, r_{2}over Σ[x_{1}, . . . , x_{k}],
whether Q(r_{1}, G) ⊆ Q(r_{2}, G) for every data graph G. For REWB-based queries
we look at, this problem is easily seen to be equivalent to language containment.

Using this fact and the results of Section 5 we obtain the following.

Corollary 2. Query containment is undecidable for REWBs and simple REWBs.

It becomes decidable if we restrict our queries to positive REWBs.

### 7 Conclusions and Extensions

After conducting an extensive study of their language-theoretic properties and their ability to query graph data we conclude that REWBs can serve as a highly expressive language that still retains good query evaluation properties. Although weaker than register automata and their expression counterpart – regular ex- pressions with memory, REWBs come with a more natural and declarative syn- tax and have a lower complexity of some language-theoretic properties such as nonemptiness. They also complete a picture of expressions that relate to register automata – a question that often came up in the discussions about the connec- tion of regular expressions with memory (REMs) and register automata [21, 22], as they can be seen as a natural restriction of REMs with proper scoping rules.

As we have seen, both in this paper and in previous work on graph querying, all of the considered formalisms have a combined complexity of query evaluation that is either a low degree polynomial, or Pspace-complete. A natural question to ask is if there is a formalism whose combined complexity lies between these two classes.

An answer to this can be given using a model of automata that extends NFAs in a similar way that REWBs extend regular expressions – by allowing usage of variables. These automata, called variable automata, were introduced in [15] and although originally defined for words over an infinite alphabet, they can easily be modified to handle data words. Intuitively, they can be viewed as NFAs with a guess of data values to be assigned to variables, with the run of the automaton verifying correctness of the guess. An example of a variable automaton recognizing the language of all words where the last data value is different from all others is given in the following image.

qa

start qb

a x

a

?

Here we observe that variable automata use two sorts of variables – an ordi- nary bound variable x that is assigned a unique value, and a special free variable

?, whose every occurrence is assigned a value different from the ones assigned to the bound variables.

It can be show that variable automata, used as a graph querying formalism, have NP-complete combined complexity of query evaluation and that their de- terministic subclass [15] has coNP query containment. Due to space limitations we defer the technical details of these results to the appendix.

The somewhat synthetic nature of variable automata and their usage of the free variable makes them incomparable with REWBs and register automata, as the example above demonstrates. A natural question then is whether there is a model that encompasses both and still retains the same good query evaluation bounds. It can be shown that by allowing variable automata to use the full power of registers we get a model that subsumes all of the previously studied models and whose combined complexity is no worse that the one of register automata. As the details of the construction are rather lengthy we defer them to the appendix.

### References

1. S. Abiteboul, P. Buneman, D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kauffman, 1999.

2. S. Abiteboul, R. Hull, V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

3. S. Abiteboul, V. Vianu. Regular path queries with constraints. JCSS 58 (1999), 428–452.

4. R. Angles, C. Guti´errez. Survey of graph database models. ACM Comput. Surv.

40(1): (2008).

5. P. Barcel´o, D. Figueira, L. Libkin. Graph logics with rational relations and the generalized intersection problem. In LICS 2012.

6. P. Barcel´o, L. Libkin, A. W. Lin, P. Wood. Expressive languages for path queries over graph-structured data. ACM TODS, 37(4) (2012).

7. M. Bojanczyk. Automata for Data Words and Data Trees. In RTA 2010, pages 1–4.

8. D. Calvanese, G. de Giacomo, M. Lenzerini, M. Y. Vardi. Containment of con- junctive regular path queries with inverse. In KR’00, pages 176–185.

9. D. Calvanese, G. de Giacomo, M. Lenzerini, M. Y. Vardi. Rewriting of regular expressions and regular path queries. JCSS, 64(3):443–465 (2002).

10. M. P. Consens, A. O. Mendelzon. GraphLog: a visual formalism for real life recur- sion. In PODS’90, pages 404–416.

11. I. Cruz, A. Mendelzon, P. Wood. A graphical query language supporting recursion.

In SIGMOD’87, pages 323–330.

12. S. Demri, R. Lazi´c. LTL with the freeze quantifier and register automata. ACM TOCL 10(3): (2009).

13. W. Fan. Graph pattern matching revised for social network analysis. In ICDT 2012, pages 8–21.

14. D. Figueira. Reasoning on words and trees with data. PhD thesis, 2010.

15. O. Grumberg, O. Kupferman, S. Sheinvald. Variable automata over infinite alpha- bets. In LATA’10, pages 561–572.

16. O. Grumberg, O. Kupferman, S. Sheinvald. Variable automata over infinite alpha- bets. Manuscript, 2011.

17. C. Gutierrez, C. Hurtado, A. Mendelzon. Foundations of semantic Web databases.

J. Comput. Syst. Sci. 77(3): 520–541 (2011).

18. M. Kaminski, N. Francez. Finite-memory automata. TCS 134(2): 329–363 (1994).

19. M. Kaminski and T. Tan. Regular expressions for languages over infinite alphabets.

Fundamenta Informaticae, 69(3):301–318 (2006).

20. U. Leser. A query language for biological networks. Bioinformatics 21 (suppl 2) (2005), ii33–ii39.

21. L. Libkin, D. Vrgoˇc. Regular path queries on graphs with data. In ICDT’12, pages 74–85.

22. L. Libkin, D. Vrgoˇc. Regular expressions for data words. LPAR’12, pages 274–288.

23. R. Milo, S. Shen-Orr, et al. Network motifs: simple building blocks of complex networks. Science 298(5594) (2002), 824–827.

24. F. Neven, T. Schwentick, V. Vianu. Finite state machines for strings over infinite alphabets. ACM TOCL 5(3): 403–435 (2004).

25. F. Olken. Graph data management for molecular biology. OMICS 7: 75–78 (2003).

26. J. P´erez, M. Arenas, C. Gutierrez. Semantics and complexity of SPARQL. ACM TODS 34(3): 1–45 (2009).

27. R. Ronen and O. Shmueli. SoQL: a language for querying and creating data in social networks. In ICDE 2009, pages 1595–1602.

28. M. San Mart´ın, C. Gutierrez. Representing, querying and transforming social networks with RDF/SPARQL. In ESWC 2009, pages 293–307.

29. T. Schwentick. A Little Bit Infinite? On Adding Data to Finitely Labelled Struc- tures. In STACS 2008, pages 17–18.

30. L. Segoufin. Automata and logics for words and trees over an infinite alphabet. In CSL’06, pages 41-57.

31. A. Tal. Decidability of Inclusion for Unification Based Automata. M.Sc. thesis (in Hebrew), Technion, 1999.

## APPENDIX

### Proofs

### Proof of Theorem 1

To prove the Theorem we define the language P that will separate register automata from REWBs.

For a positive integer m ≥ 1, we define a language P_{m}over the unary alpha-
bet Σ = {a} which consists of data words of the form:

a
d_{0}

a
d_{1}

a
e_{0}

a
e_{1}

· · ·

|{z}

v1

a
d_{1}

a
d_{2}

· · ·

|{z}

w1

a
e_{1}

a
e_{2}

· · ·

|{z}

v2

a
d_{2}

a
d_{3}

· · ·

|{z}

w2

a
e_{2}

a
e_{3} · · · ·

· · · _{e}^{a}

m−2

a em−1

· · ·

|{z}

v_{m−1}
a
dm−1

a dm

· · ·

|{z}

w_{m−1}
a
em−1

a em

where m ≥ 1 and for each i = 1, . . . , m, the data value di does not appear in vi

and the data value eidoes not appear in wi. We then define the language P as

P := [

m≥1

Pm

Now Theorem 1 follows immediately from Lemmas 1 and 2 below.

Lemma 1. The language P is accepted by a two-register automaton.

Proof. It is rather straightforward to show that the language P is accepted by
two-register automaton. One register is to take care of the d_{i}’s and the other the
e_{i}’s.

Lemma 2. The language P is not definable by REWBs.

Next we prove Lemma 2. Note that for simplicity we prove the Lemma for the case of simple REWBs. It is straightforward to see that the same proof works in the case of REWBs that use multiple comparisons in one condition.

The proof is rather technical and will require a few auxiliary notions.

Let r be an REWB over Σ[x1, . . . , xk]. A derivation tree t with respect to r is
a tree whose internal nodes are labeled with (r^{0}, ν) where r^{0} is an subexpression
of r and ν ∈ F [x_{1}, . . . , x_{k}] constructed as follows. The root node is labeled with
(e, ∅). The other nodes are labeled as follows. For a node u labeled with (e^{0}, ν),
its children are labeled as follows.

– If r^{0} = a, then u has only one child: a leaf node labeled with ^{a}_{d} for some
d ∈ D.

– If r^{0}= a[x^{=}], then u has only one child: a leaf node labeled with _{ν(x)}^{a} .

– If r^{0}= a[x^{6=}], then u has only one child: a leaf node labeled with ^{a}_{d} for some
d 6= ν(x).

– If r^{0} = r_{1}+ r_{2}, then u has only one child: a leaf node labeled with either
(r_{1}, ν) or (r_{2}, ν).

– If r^{0} = r_{1}· r_{2}, then u has only two children: the left child is labeled with
(r_{1}, ν) and the right child is labeled with (r_{2}, ν).

– If r^{0}= r_{1}^{∗}, then u has either only one child: a leaf node labeled with ; or at
least one child labeled with (r_{1}, ν).

– If r^{0}= a ↓_{x}·(r_{1}), then u has only two children: the left child is labeled with

a

d and the right child is labeled with (r1, ν[x ← d]), for some data value d ∈ D.

A derivation tree t defines a data word w(t) as the word read on the leaf nodes of t from left to right.

Proposition 3. A data word w ∈ L(r, ∅) if and only if there exists a derivation tree t such that w = w(t).

Proof. We start with the “only if” direction. Suppose that w ∈ L(r, ∅). By
induction on the length of e, we can construct the derivation tree t such that
w = w(t). It is a rather straightforward induction, where the induction step is
based on the recursive definition of REWB, where r is either a, a[x^{=}], a[x^{6=}],
r1+ r2, r1· r2, r^{∗}_{1} or a ↓x.(r1).

Now we prove the “if” direction. We are going to show that for every node
u in t, if u is labeled with (r^{0}, ν), then wu(t) ∈ L(r^{0}, ν). This can be proved by
induction on the height of the node u, which is defined as follows.

– The height of a leaf node is 0.

– The height of a node u is the maximum between the heights of its children nodes.

It is a rather straightforward induction, where the base case is the nodes with zero height and the induction step is carried on nodes of height h with the induction hypothesis assumed to hold on nodes of height < h.

For a node u in a derivation tree t, the word induced by the node u is the subword made up of the leaf nodes in the subtree rooted at u. We denote such subword by wu(t). Suppose w(t) = w1wu(t)w2, the index pair of the node u is the pair of integers (i, j) such that i = length(w1) + 1 and j = length(w1wu(t)).

A derivation tree t induces a binary relation Rtas follows.

Rt= {(i, j) | (i, j) is the index pair of a node u in t labeled with a ↓ xi· (r^{0})}.

Note that R_{t} is a partial function from the set {1, . . . , length(w(t))} to itself,
where if R_{t}(i) is defined, then i < R_{t}(i).

For a pair (i, j) ∈ R_{t}, we say that the variable x is associated with (i, j), if
(i, j) is the index pair of a node u in t labeled with a label of the form a ↓ x · (r^{0}).

Two binary tuples (i, j) and (i^{0}, j^{0}), where i < j and i^{0} < j^{0}, cross each other if
either i < i^{0}< j < j^{0} or i < i^{0}< j < j^{0}.

Proposition 4. For any derivation tree t, the binary relation R_{t} induced by it
does not contain any two pairs (i, j) and (i^{0}, j^{0}) that cross each other.

Proof. Suppose (i, j), (i^{0}, j^{0}) ∈ Rt. Then let u and u^{0} be the nodes whose index
pairs are (i, j) and (i^{0}, j^{0}), respectively. There are two cases.

– The nodes u and u^{0} are descendants of each other.

Suppose u is a descendant of u^{0}. Then, we have i^{0}< i < j < j^{0}.
– The nodes u and u^{0} are not descendants of each other.

Suppose the node u^{0} is on the right side of u, that is, wu^{0}(t) is on the right
side of wu(t) in w. Then we have i^{0} < j^{0} < i < j.

In either case (i, j) and (i^{0}, j^{0}) do not cross each other. This completes the proof
of our claim.

Now we are ready to prove Lemma 2.

Proof of Lemma 2. Suppose to the contrary that there is an REWB r over Σ[x1, . . . , xk] such that L(r) = P, where Σ = {a}. Consider the following word w ∈ Pm, where m = k + 2:

w = _{d}^{a}

0

a
d_{1}

a
e_{0}

a
e_{1}

· · ·

|{z}

v1

a
d_{1}

a
d_{2}

· · ·

|{z}

w1

a
e_{1}

a
e_{2}

· · ·

|{z}

v2

a
d_{2}

a
d_{3}

· · ·

|{z}

w2

a
e_{2}

a
e_{3} · · · ·

· · · _{e}^{a}

m−2

a
e_{m−1}

· · ·

|{z}

vm−1

a
d_{m−1}

a
d_{m}

· · ·

|{z}

wm−1

a
e_{m−1}

a
e_{m}

where

– each of the data values in v_{1}, w_{1}, . . . , v_{m−1}, w_{m−1} appear exactly once in w;

– d_{0}, d_{1}, . . . , d_{m}, e_{0}, e_{1}, . . . , e_{m} are pairwise different.

Let t be the derivation tree of w. Consider the binary relation R_{t} and the
following sets A and B.

A = {length(w^{0}) | w^{0} is the prefix _{d}^{a}

0

a

d_{1} · · · _{d}^{a}

l−1

a

d_{l} of w where 1 ≤ l ≤ m − 1}

B = {length(w^{0}) | w^{0} is the prefix _{d}^{a}

0

a

d_{1} · · · _{e}^{a}

l−1

a

e_{l} of w where 1 ≤ l ≤ m − 1}

Claim. The relation Rt is a function on A ∪ B. That is, for every h ∈ A ∪ B,
there is h^{0} such that (h, h^{0}) ∈ Rt.

Proof. Suppose there exists h ∈ A ∪ B such that R_{t}(h) is not defined. Assume
that h ∈ A. Let l be the index 1 ≤ l ≤ m − 1 where h = length(w^{0}) and w^{0} is
the prefix _{d}^{a}

0

a

d1 · · · _{d}^{a}

l−1

a dl.

If Rt(h) is not defined, then for any valuation ν found in the nodes in t, dl∈ Image(ν). So, the word/

w^{00}= _{d}^{a}

0

a d1

a e0

a

e1 · · · _{d}^{a}

l−1

a

f · · · _{e}^{a}

l−1

a
e_{l} · · · _{d}^{a}

l

a

d_{l+1} · · · ·

is also in L(r), where f is a new data value. That is, the word w^{00} is obtained
by replacing the first appearance of d_{l} with f . This contradicts the fact that
P = L(r), since w^{00} ∈ P. The same reasoning goes for the case if h ∈ B. This/
completes the proof of our claim.

Remark 1. Without loss of generality, we can assume that each variable in the REWB r is introduced only once. Otherwise, we can rename the variable.

Claim. There exist (h1, h2), (h^{0}_{1}, h^{0}_{2}) ∈ Rt such that h1 < h2 < h^{0}_{1} < h^{0}_{2} and
h1, h^{0}_{1}∈ A and both (h1, h2), (h^{0}_{1}, h^{0}_{2}) have the same associated variable.

Proof. The cardinality |A| = k + 1. So there exists a variable x ∈ {x1, . . . , xk}
and (h_{1}, h_{2}), (h^{0}_{1}, h^{0}_{2}) ∈ R_{t} such that (h_{1}, h_{2}), (h^{0}_{1}, h^{0}_{2}) are associated with the
variable x. By Remark 1, no variable is written twice in e, so the nodes u, u^{0}
associated with (h_{1}, h_{2}), (h^{0}_{1}, h^{0}_{2}) are not descendants of each other, so we have
h_{1} < h_{2} < h^{0}_{1} < h^{0}_{2}, or h^{0}_{1} < h^{0}_{2} < h_{1} < h_{2}. This completes the proof of our
claim.

¿From the following claim we immediately get that P 6= L(r).

Claim. There exists a word w^{00}∈ P, but w/ ^{00}∈ L(r).

Proof. The word w^{00} is constructed from the word w. By Claim 7, there exist
(h1, h2), (h^{0}_{1}, h^{0}_{2}) ∈ Rt such that h1 < h2 < h^{0}_{1} < h^{0}_{2} and h1, h^{0}_{1} ∈ A and both
h1, h^{0}_{1}have the same associated variable.

By definition of the language P, between h1 and h^{0}_{1}, there exists an index
l ∈ B such that h1< l < h^{0}_{1}. (Recall that the set A contains the positions of the
data values d^{0}s, and the set B the positions of the data values e^{0}s.)

Let h be the maximum of such indices. The index h is not the index of the
last e, hence Rt(h) exists and Rt(h) < h2, by Proposition 4. Now the data value
in Rt(h) is different from the data value in position h. To get w^{00}, we change the
data value in the position h with a new data value f , and it will not change the
acceptance of the word w^{00}by the REWB r.

However, the word w^{00}given by

w^{00}= a
d_{0}

a
d_{1}

a
e_{0}

a
e_{1}

· · · ·

a
e_{l−1}

a f

· · · a
e_{l}

a
e_{l+1}

· · · ·

is not in P, by definition.

Thus, this completes the proof of our claim.

This completes the proof of Lemma 2.

### Proof of Theorem 2

To prove the NP-upper bound we will need the following Proposition.

Proposition 5. For every REWB r over Σ[x_{1}, . . . , x_{k}] and every valuation ν
compatible with r, if L(r, ν) 6= ∅, then there exists a data word w ∈ L(r, ν) of
length O(|r|).

Proof. The proof is by induction on the length of r. The basis is when the length of r is 1. There are two cases: a[c] and a; and it is trivial that our proposition holds.

Let r be an REWB and ν a valuation compatible with r. For the induction hypothesis, we assume that our proposition holds for all REWBs of shorter length than r. For the induction step, we prove our proposition for r. There are four cases.

– Case 1: r = r_{1}+ r_{2}.

If L(r, ν) 6= ∅, then by the induction hypothesis, either L(r_{1}, ν) or L(r_{2}, ν)
are not empty. So, either

• there exists w1∈ L(r1, ν) such that |w_{1}| = O(|r1|); or

• there exists w2∈ L(r2, ν) such that |w_{2}| = O(|r2|).

Thus, by definition, there exists w ∈ L(r, ν) such that |w| = O(|r|).

– Case 2: r = r_{1}· r2.

If L(r, ν) 6= ∅, then by the definition, L(r_{1}, ν) and L(r_{2}, ν) are not empty.

So by the induction hypothesis

• there exists w1∈ L(r1, ν) such that |w_{1}| = O(|r1|); and

• there exists w2∈ L(r2, ν) such that |w_{2}| = O(|r2|).

Thus, by definition, w_{1}· w2∈ L(r, ν) and |w1· w2| = O(|r|).

– Case 3: r = (r_{1})^{∗}.

This case is trivial since ε ∈ L(r, ν).

– Case 4: r = a ↓_{x}_{i} (r_{1}).

If L(r, ν) 6= ∅, then by the definition, L(r_{1}, ν[x_{i}← d]) is not empty, for some
data value d. By the induction hypothesis, there exists w_{1}∈ L(r_{1}, ν[x_{i}← d])
such that |w1| = O(|r1|). By definition, ^{a}_{d}w1∈ L(r, ν).

This completes the proof of Proposition 5.

The NP membership follows from Proposition 5, where given a REWB r, we simply guess a data word w ∈ L(r) of length O(|r|). The verification that w ∈ L(r) can be done deterministically in polynomial time.

Note that the data values here can be made small as well. It is well known that in a word accepted by a register automaton one can replace the data values with the ones from the set 1, . . . k + 1, where k is the number of registers [18, 22], while retaining the acceptance condition. Thus we can always assume that the values appearing in our word are not bigger than the number of variables in our expression plus one.

We prove NP hardness via a reduction from 3-SAT.

Assume that ϕ = (`_{1,1}∨ `_{1,2} ∨ `_{1,3}) ∧ · · · ∧ (`_{n,1}∨ `_{n,2}∨ `_{n,3}) is the given
3-CNF formula, where each `i,j is a literal. Let x1, . . . xk denote the variables
occurring in ϕ. We say that the literal `i,j is negative, if it is a negation of a
variable. Otherwise, we call it a positive literal.

We will define a REWB r over Σ[y_{1}, z_{1}, y_{2}, z_{2}, . . . , y_{k}, z_{k}] of length O(n) such
that ϕ is satisfiable if and only if L(r) 6= ∅.

Let r be the following REWB.

r := a ↓_{y}_{1}(a ↓_{z}_{1} (a ↓_{y}_{2} (a ↓_{z}_{2} (· · · (a ↓_{y}_{k} (a ↓_{z}_{k}(
(r1,1+ r1,2+ r1,3) · · · (rn,1+ rn,2+ rn,3) . . .),

ri,j:=

b[y_{k}^{=}∧ z_{k}^{=}] if `i,j= xk

b[y_{k}^{=}∧ z_{k}^{6=}] + b[z^{=}_{k} ∧ y_{k}^{6=}] if `_{i,j}= ¬x_{k}

Obviously, |r| = O(n). We are going to prove that ϕ is satisfiable if and only if L(r) 6= ∅.

Assume first that ϕ is satisfiable. Then there is an assignment f : {x_{1}, . . . , x_{k}} 7→

{0, 1} making ϕ true. We define the evaluation ν : {y_{1}, z_{1}, . . . y_{n}, z_{n}} 7→ {0, 1} as
follows.

– If f (x_{i}) = 1, then ν(y_{i}) = ν(z_{i}) = 1.

– If f (xi) = 0, then ν(yi) = 0 and ν(zi) = 1.

We define the following data word.

w :=

a ν(y1)

a ν(z1)

· · ·

a ν(yk)

a ν(zk)

b 1

· · ·b 1

| {z }

n times

To see that w ∈ L(r), we observe that the first 2k labels are parsed to bind
values y1, z1, . . . yk, zk to corresponding values determined by ν. To parse the
remaining _{1}^{b} · · · _{1}^{b}, we observe that for each i ∈ {1, . . . , n}, `i,1∨ `i,2∨ `i,3 is
true according to the assignment f if and only if ^{b}_{1} ∈ L(ri,1+ ri,2+ ri,3, ν).

Conversely, assume that L(r) 6= ∅. Let w = a

dy_{1}

a
dz_{1}

· · ·

a
dy_{k}

a
dz_{k}

b d1

· · · b dn

∈ L(r).

We define the following assignment f : {x1, . . . , xk} 7→ {0, 1}.

f (xi) = 1 if d_{y}_{i}= dz_{i}

0 if dyi6= dzi

We are going to show that f is a satisfying assignment for ϕ. Now since w ∈ L(r), we have

b d1

· · · b dn

∈ L((r1,1+ r1,2+ r1,3) · · · (rn,1+ rn,2+ rn,3), ν),

where ν(yi) = dy_{i} and ν(zi) = dz_{i}. In particular, we have for every j = 1, . . . , n,

b dj

∈ L(rj,1+ r_{j,2}+ r_{j,3}, ν).

W.l.o.g, assume that _{d}^{b}

j ∈ L(rj,1). There are two cases.

– If r_{j,1}= b[y^{=}_{i} ∧ z_{i}^{=}], then by definition, `_{j,1}= x_{i}, hence the clause `_{j,1}∨ `_{j,2}∨

`_{j,3} is true under the assignment f .

– If r_{j,1} = b[y^{=}_{i} ∧ z^{6=}_{i} ] + b[z_{i}^{=}∧ y_{i}^{6=}], then by definition, `_{j,1} = ¬x_{i}, hence the
clause `j,1∨ `j,2∨ `j,3 is true under the assignment f .

Thus, the assignment f is a satisfying assignment for the formula ϕ. This com- pletes the proof of our theorem.

### Proof of Proposition 1

First we consider the case of simple REWBs.

The proof is by induction on the length of r. The basis is when the length
of r is 1. There are three cases: a[x^{=}_{i} ], a[x^{6=}_{i} ] and a; and it is trivial that our
proposition holds.

Let r be an REWB and ν a valuation compatible with r. For the induction hypothesis, we assume that our proposition holds for all REWBs of shorter length than r. For the induction step, we prove our proposition for r. There are four cases.

– Case 1: r = r1+ r2.

By the induction hypothesis, both L(r1, ν) and L(r2, ν) are not empty, thus, by definition, L(r, ν) is also not empty.

– Case 2: r = r_{1}· r_{2}.

By the induction hypothesis, both L(r_{1}, ν) and L(r_{2}, ν) are not empty, thus,
by definition, L(r, ν) is also not empty.

– Case 3: r = (r1)^{∗}.

This case is trivial, since ε ∈ L(r, ν), thus, L(r, ν) 6= ∅.

– Case 4: r = a ↓_{x}_{i} (r_{1}).

By the induction hypothesis, L(r_{1}, ν[x_{i}← d]) is not empty for some arbitrary
data value d. Thus, by definition, L(r, ν) is also not empty.

Next we prove the claim for positive REWBs.

Namely what we show is that if for any d ∈ D we define νd(x) := d, with x a variable in our expression, we will have L(r, νd) 6= ∅.

The proof is by induction on the length of r. The basis is when the length of r is 1. There are two cases: a[c] and a; and it is trivial that our proposition holds.

Let r be a positive REWB. For the induction hypothesis, we assume that our proposition holds for all REWBs of shorter length than r. For the induction step, we prove our proposition for r. There are four cases.

– Case 1: r = r1+ r2.

By the induction hypothesis, L(r1, νd) and L(r2, νd) are nonempty, thus, by definition, L(r, νd) is also not empty.

– Case 2: r = r_{1}· r_{2}.

By the induction hypothesis, both L(r1, νd) and L(r2, νd) are not empty, thus, by definition, L(r, νd) is also not empty.