Efficient algorithms for regular expression constrained sequence alignment

(1)

Efficient algorithms for regular expression constrained

sequence alignment

✩

Yun-Sheng Chung

a

_{, Chin Lung Lu}

b

_{, Chuan Yi Tang}

a,∗

a_{Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300, ROC} b_{Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan 300, ROC}

Received 16 October 2006; received in revised form 18 March 2007; accepted 23 April 2007 Available online 29 April 2007

Communicated by F.Y.L. Chin

Abstract

Imposing constraints is an effective means to incorporate biological knowledge into alignment procedures. As in the PROSITE database, functional sites of proteins can be effectively described as regular expressions. In an alignment of protein sequences it is natural to expect that functional motifs should be aligned together. Due to this motivation, Arslan introduced the regular expression constrained sequence alignment problem and proposed an algorithm which, if implemented naïvely, can take time and space up to O(|Σ|2|V |4n2)and O(|Σ|2|V |4n), respectively, where Σ is the alphabet, n is the sequence length, and V is the set of states in an automaton equivalent to the input regular expression. In this paper we propose a more efficient algorithm solving this problem which takes O(|V |3n2)time and O(|V |2n)space in the worst case. If|V | = O(log n) we propose another algorithm with time complexity O(|V |2log|V |n2). The time complexity of our algorithms is independent of Σ, which is desirable in protein applications where the formulation of this problem originates; a factor of|Σ|2= 400 in the time complexity of the previously proposed algorithm would significantly affect the efficiency in practice.

Keywords: Constrained sequence alignment; Algorithms; Regular expression; Dynamic programming

1. Introduction

Sequence alignment is one of the most fundamen-tal problems in computational biology. Due to its im-portance, it has been extensively studied in the past (see, e.g., [8]). As biological knowledge and predictions

✩ _{A preliminary version of this work was presented in the 17th}

An-nual Symposium on Combinatorial Pattern Matching, CPM 2006.

* _{Corresponding author.}

E-mail addresses: [email protected] (Y.-S. Chung), [email protected] (C.L. Lu), [email protected] (C.Y. Tang).

grow, it is often desirable if one can incorporate more in-formation into the alignment procedure in hope that the alignment result can be more biologically meaningful and reasonable. In particular, when the input sequences share some properties, one then expects that the re-sulting alignment should not violate these properties. Such kind of preservation is in its nature a satisfac-tion of properly defined constraints. Due to this need, Tang et al. [11] defined the constrained sequence align-ment problem (CSA, for short). They considered the alignment of RNase sequences which share a conserved sequence of residues H, K, H. It is expected that the

(2)

Fig. 1. Left: automaton A, equivalent to R= a. Right: weighted automaton M, constructed by A × A.

alignment result should have these conserved residues aligned together. A constraint is then defined in [11] as a sequence of characters. The problem requires that the alignment result must contain a sequence of columns, with length the same as the constraint, such that each character in the constraint appears in exactly one of these columns and that each column consists solely of a character. The goal is to find the best such align-ment.

Later, Chin et al. [3] proposed an efficient algorithm for the pairwise version of CSA, along with a 2-ap-proximation algorithm for the multiple alignment ver-sion with scoring function satisfying triangle inequality. Tsai et al. [13] generalized the definition of a constraint from a sequence of characters to a sequence of strings (patterns), and the occurrences of each pattern in the se-quences need not be identical to the pattern specified in the input. Instead, the Hamming distances between each pattern and its occurrences need only to be within a user-specified threshold. This formulation enables the user to align sequences so that some known motifs are required to be aligned together. Lu and Huang [9] then reduced the memory requirement of the algorithm in [13], which significantly improves the applicability of the tool. The constrained longest common subse-quence problem, a special case of CSA, was studied in [12,4].

As indicated by Arslan [2], it is well known that pro-teins with similar functions often share some motifs. The PROSITE database [7] collects biologically signif-icant sites and motifs of proteins. In particular, these motifs can be conveniently represented as regular ex-pressions. Hence an alignment tool able to incorporate patterns in regular expressions would be useful. To ful-fill this need, Arslan introduced the regular expression constrained sequence alignment problem (RECSA, for short) [2]. A feasible solution of RECSA is an alignment containing a run of contiguous columns such that both of the two substrings corresponding to these columns match the regular expression. The following example, constructed by Arslan [2], clearly illustrates the

defini-tion and its difference with a standard alignment without constraint: T G F P S V G K T K D D - - - - A | | | | | | | | T - F - S V A - - K D D D G K S A T - - - G F P S V G K T K D D A | | T F S V A K D D D G K S - - - A ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

In all examples given in this section, a match of identical symbols is scored 1 and all other cases are scored 0. The upper alignment shown above is an op-timal alignment without constraint, while the lower one is an optimal constrained alignment. The con-straint R is (G+ A)GK(S + T), the P-loop mo-tif. The starred columns “support” the satisfaction of constraint R: both GFPSVGKT and AKDDDGKS match R. For more descriptions about biological applications of RECSA, the reader is referred to Arslan’s paper [2].

There are well-established algorithms to convert R into an ε-free NFA A (see, e.g., [6]). In [2], R is con-verted into A first. Then a weighted finite automaton M is constructed from A to accept all alignments satisfy-ing R. Automaton M is essentially a product machine of A. Each state in M corresponds to a pair of states in A. The alphabet of M corresponds to edit operations; an edit operation is represented as a pair of symbols in Σ₋= Σ ∪ {−}, excluding (−, −). The transitions in M and their relationship with those in A are illus-trated in Fig. 1. Given a sequence of edit operations, or equivalently an alignment, as input to M, some states in M can be reached from the initial state (q0, q0). Con-sider the example in Fig. 1: given[t−_ta], states (q0, q1) as well as (q0, q0)are reached, while given[t−ctat], no

state other than (q0, q0)is reached. In particular, given a feasible constrained alignment satisfying R, for ex-ample,[ta_tac−], state (q1, q1), the final state of M in the example, is reached; the self loop on the final state is added for this purpose. As in a standard dynamic pro-gramming alignment algorithm, Arslan’s algorithm it-erates over pairs (i1, i2)of indices of the input strings.

(3)

On each (i1, i2), a corresponding Mi1,i2, an automaton with the same transitions as those in M, is constructed. Each state (p, q) of Mi1,i2 contains the score of an opti-mal alignment of S1[1..i1] and S2[1..i2] such that (p, q) can be reached if the alignment is given to Mi1,i2 as input; such (p, q) is called active in [2]. If no such alignment exists for a state (p, q) of Mi1,i2, its corre-sponding score is set to−∞ and it is not active. Hence the optimal score of aligning S1and S2can be found by taking the maximum of the scores stored in all the fi-nal states in M_|S₁_|,|S₂_|. Given an additional input symbol (a, b)∈ Σ₋2 \ {(−, −)}, the active states in Mi1,i2 may change, and the scores are updated by adding the score of the current input symbol (edit operation). The result-ing weighted automaton is denoted by M_i(a,b)

1,i2 . The score updating rule in Arslan’s algorithm is

Mi1,i2= max M(S1[i1],−) i1−1,i2 , M (−,S2[i2]) i1,i2−1 , M (S1[i1],S2[i2]) i1−1,i2−1 , where the maximum operation applied on the three weighted automata yields a weighted automaton with each state scored by the maximum score from the cor-responding states in the three automata.

To update the active states and the scores, all transi-tions in M are examined by the algorithm in [2]. The time complexity stated in [2] is O(rn2), where r is the size of the transition function of M. Let V be the set of states in A. There can be up to|Σ| transitions from a state of A to another. Hence the number of transitions in A can be up to O(|Σ||V |2_{), and that in M can be} up to O(|Σ|2_{|V |}4₎_{since M is made by a product of A.} The time complexity of the algorithm in [2] can then be O(|Σ|2|V |4n2), if implemented naïvely. In addition, O(n) copies of M are maintained throughout in [2], and the space complexity stated in [2] is O(rn), which can be up to O(|Σ|2_{|V |}4_{n), if implemented naïvely. In [2] it} is not mentioned how to reconstruct the optimal align-ment. The stated space complexity is for the computa-tion of the optimal score. Here we make the dependence on the alphabet size explicit since in protein applications for which RECSA is originally formulated,|Σ|2is 400 which would significantly affect the efficiency.

The critical information in M is the scores on the states; storing the entire M is unnecessary. In this paper, we directly use a matrix to store scores. The matrix is implemented using a simple array and can be easily ma-nipulated. Time and space complexity of our algorithm are, respectively, bounded by O(|V |3_n2₎_{and O(|V |}2_n) in the worst case; we also mention how the optimal alignment can be reconstructed within this space bound. When|V | = O(log n), which is not remote since in gen-eral R is much shorter than the input sequences [10], we propose another algorithm to take advantage of

op-erations on O(log n)-length data, which takes constant time in the standard unit-cost RAM model. The result-ing time complexity is O(|V |2_log_{|V | n}2₎_{in this case.} Although M is not used in our algorithm, it serves as a conceptual framework facilitating our presentation.

2. Preliminaries

For convenience throughout this paper we assume that |S1| = |S2| = n. An edit operation can be defined to be a symbol in Σ−× Σ−\ {(−, −)}, where Σ−= Σ∪ {−}. An edit operation is written as either a pair (a, b) or a column vector[a_b] in this paper. Define ρ : Σ₋∗ → Σ∗ to be the “removing spaces” operator, e.g., ρ(a−tc−g) = atcg. A sequence A = [a1

b₁ · · · am

bm] of edit operations is called an alignment of S1 and S2 if ρ(a1· · · am)= S1 and ρ(b1· · · bm)= S2. Let γ be a real-valued scoring function defined on sequences of edit operations satisfying

γ a1 b1· · · am bm = ⎧ ⎪ ⎨ ⎪ ⎩ 0 if m= 0, i.e., if it is an empty alignment; γ ([a1 b1· · · am−1 bm−1]) + γ ([ am bm]) otherwise.

Let R be a regular expression over alphabet Σ . The goal of RECSA is to find a maximum-scored alignment A = [a1

b1· · ·

am

bm] of S1 and S2 such that there exist 1 and 2with both ρ(a1· · · a2)and ρ(b1· · · b2) match-ing R.

Let A= (Q, Σ, δ, q0, F ) be an ε-free NFA equiv-alent to R, which can be constructed manually or by any established algorithm. Then δ(p, ε)= δ(p, −) = {p} for all p ∈ Q. We also define δ(Q_{, a)} _{to be}

p∈Qδ(p, a), where Q⊆ Q and a ∈ Σ. In this pa-per we use Q and V interchangeably. Following the notations in [2], M = (QM, WM, ΣM, δM, q₀M, FM), where QM = Q × Q, WM: QM → R, ΣM = Σ₋× Σ₋\ {(−, −)}, q₀M= (q0, q0)and FM= F × F . Tran-sition function δM is defined as

δM(p, q), (a, b)= ⎧ ⎪ ⎨ ⎪ ⎩ δ(p, a)× δ(q, b) if (p, q) /∈ FM∪ {(q0, q0)}; δ(p, a)× δ(q, b) ∪ {(p, q)} otherwise.

Function δM can be naturally extended to be defined on the Cartesian product of QM and the set of all possible alignments. A state (p, q) of Mi1,i2is active iff there ex-ists some alignment A of S1[1..i1] and S2[1..i2] such that (p, q) ∈ δM(q₀M,A). Each state (p, q) in Mi1,i2 has a score Wi1,i2(p, q) assigned to it. If (p, q) is ac-tive, then Wi1,i2(p, q)is the score of an optimal align-ment A of S1[1..i1] and S2[1..i2] such that (p, q) ∈

(4)

δM(q₀M,A). Otherwise no such alignment exists and Wi1,i2(p, q)= −∞.

3. The algorithms

In this section we present two algorithms, the first for the general case and the second for the case that|V | = O(log n).

3.1. The general case

Recall that automaton Mi1,i2 accepts alignments of S1[1..i1] and S2[1..i2] satisfying the regular ex-pression constraint. Also recall that once we have Wn,n(p, q)for all p, q∈ Q, the optimal score of align-ing S1and S2with the constraint satisfied can be eas-ily found by max(p,q)∈F ×FWn,n(p, q). Hence if we can compute all Wi1,i2 then the problem is solved. There is a simple relationship among Wi1,i2, Wi1−1,i2, Wi1,i2−1 and Wi1−1,i2−1which allows one to compute Wi1,i2 based on the other three. Define W

(a,b)

i1,i2 (p, q)= max_(p,q): (p,q)∈δM_((p,q),(a,b))Wi1,i2(p, q)+ γ (a, b). Then for i1, i2 1, the relationship can be stated as fol-lows: Wi1,i2(p, q) = maxW(S1[i1],−) i1−1,i2 (p, q), W (−,S2[i2]) i1,i2−1 (p, q), W(S1[i1],S2[i2]) i1−1,i2−1 (p, q) . (1)

The algorithm in [2] is based on (1). On each (i1, i2), weighted automaton Mi1,i2, whose transitions are the same as those in M, is constructed such that state (p, q) in Mi1,i2 carries the score Wi1,i2(p, q). Given Mi1−1,i2, M

(S1[i1],-)

i1−1,i2 can be computed, which is also a weighted automaton where state (p, q) carries the score W(S1[i1],−) i1−1,i2 (p, q); M (−,S2[i2]) i1,i2−1 and M (S1[i1],S2[i2]) i1−1,i2−1 are similarly defined. Then Mi1,i2 is computed as max{M(S1[i1],−) i1−1,i2 , M (_−,S2[i2]) i1,i2−1 , M (S1[i1],S2[i2]) i1−1,i2−1 }, which is a weighted automaton where state (p, q) has score as defined in the right-hand side of (1). To compute M(S1[i1],−)

i1−1,i2 from Mi1−1,i2, in [2] each transition of Mi1−1,i2 is examined a constant number of times so that for each (p, q) the score W(S1[i1],−)

i1−1,i2 (p, q) can be computed. Automata M(−,S2[i2])

i1,i2−1 and M

(S1[i1],S2[i2]) i1−1,i2−1 are computed in the same way, and Mi1,i2 can then be com-puted easily.

Let E be the set of “label-free” arcs in A, i.e., (q, q)∈ E iff q ∈ δ(q, a)for some a∈ Σ. In partic-ular, there can be at most one “label-free” arc from state qto q. Clearly, there can be up to O(|Σ||E|) transitions in A and O(|Σ|2|E|2)transitions in M.

It is not necessary to maintain the whole machine structure of M on each (i1, i2). The relevant informa-tion is the scores Wi1,i2(p, q). On each (i1, i2)we use table Li1,i2 to store Wi1,i2. In addition, to handle the transitions we need not use δM directly; the use of δ is sufficient. We construct a table T to represent δ: T[q, a; q] = 1 if q ∈ δ(q, a), and T[q, a; q] = 0 oth-erwise. This construction is easy, taking O(|Σ||V |2₎ time and space if automaton A is originally represented as an adjacency list. The computation of Li1,i2[p, q] = Wi1,i2(p, q)can then be improved, and is detailed as fol-lows.

For (i1, i2) fixed, first consider the computation of W(S1[i1],S2[i2])

i1−1,i2−1 (p, q). In terms of δ instead of δ M_, W(S1[i1],S2[i2])

i1−1,i2−1 (p, q)can be written as given in formula (2) below. Hence the use of δ suffices.

Assume that Li1−1,i2−1[p, q] has been correctly computed for all p, q ∈ V . Then to compute L(S1[i1],S2[i2])

i1−1,i2−1 [p, q] for all p, q ∈ V , expected to be equal to W(S1[i1],S2[i2])

i1−1,i2−1 (p, q)and initialized to be−∞, we proceed as follows. Let L be a temporary table with all entries L[p, q] initialized to be −∞.

1: for p, p∈ V with T [p, S1[i1]; p] = 1 do

2: for q∈ V do

3: L[p, q]

← max{L[p, q], Li1−1,i2−1[p, q] + γ (S1[i1],

S2[i2])};

4: for q, q∈ V with T [q, S2[i2]; q] = 1 do

5: for p∈ V do 6: L(S1[i1],S2[i2]) i1−1,i2−1 [p, q] ← max{L(S1[i1],S2[i2]) i1−1,i2−1 [p, q], L[p, q _]}; 7: for p, q∈ V with (p, q) ∈ (F × F ) ∪ {(q0, q0)} do 8: L(S1[i1],S2[i2]) i1−1,i2−1 [p, q] ← max{L(S1[i1],S2[i2]) i1−1,i2−1 [p, q],

Li1−1,i2−1[p, q] + γ (S1[i1], S2[i2])}; In the first part (lines 1–3) L[p, q_{] is computed to be} maxp: p_∈δ(p,S1[i1])Li1−1,i2−1[p, q]+γ (S1[i1],S2[i2]). In the second part (lines 4–6) L(S1[i1],S2[i2])

i1−1,i2−1 [p, q] is ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ maxp,q: p∈δ(p,S₁[i1]),q∈δ(q,S2[i2])Wi1−1,i2−1(p, q)+ γ (S1[i1], S2[i2]), if (p, q) /∈ FM∪ {(q0, q0)}; maxp,q: p∈δ(p,S1[i1]),q∈δ(q,S2[i2])or(p,q)=(p,q)Wi1−1,i2−1(p, q)+ γ (S1[i1], S2[i2]), otherwise. (2)

(5)

computed to be maxq: q∈δ(q,S2[i2])L[p, q], which in turn is maxp: p_∈δ(p,S1[i1]),q: q∈δ(q,S2[i2])Li1−1,i2−1[p, q] + γ (S1[i1], S2[i2]), as desired. The last two lines deal with the self loops on the final and initial states in M.

For p∈ V , let deg(p) be the in degree of p in the graph (V , E), where E is the set of label-free arcs ob-tained from A. Then the time taken by the above code segment is O( _p_∈Vdeg(p)×|V |), which is O(|E||V |). Basically, the two-step approach makes the algorithm presented here faster than the one in [2].

The computation for each of L(S1[i1],−) i1−1,i2 and L(−,S2[i2])

i1,i2−1 is easier (e.g., to compute L (S1[i1],−)

i1−1,i2 , we can remove lines 4–6 in the above code segment, replace γ (S1[i1], S2[i2]) with γ (S1[i1], −), etc.), which also takes O(|E||V |) time.

Now we consider the boundary cases when i1= 0 or i2= 0. It is clear that W0,0(q0, q0)= 0 and that W0,0(p, q)= −∞ for (p, q) = q₀M since the only align-ment of S1[1..0] = ε and S2[1..0] = ε is an empty align-ment, with a score of 0 by the definition of γ , and from q₀M we can only reach q₀Mgiven an empty alignment as input to M. In addition, it can be seen that

Wi1,i2(p, q)= ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ W(S1[i1],−) i1−1,0 (p, q) if i1>0 and i2= 0, W(−,S2[i2]) 0,i2−1 (p, q) if i2>0 and i1= 0.

In [2], the weights for all the states (p, q)= (q0, q0)in Mi1,0for i1>0 or in M0,i2for i2>0 are all set to−∞, which, without appropriate assumptions on γ (e.g., tri-angle inequality; no such assumption is made in [2]), would lead to a failure of considering possible optimal constrained alignments like[_{− t}c −] where R = c + t, S1= c and S2= t. Certainly, such a problem may be fixed by initializing weights in the manner stated above.

Now that L(S1[i1],S2[i2]) i1−1,i2−1 , L (S1[i1],−) i1−1,i2 and L (_−,S2[i2]) i1,i2−1 can be computed, by (1) it is easy to compute Li1,i2 in O(|V |2₎ _{time for a fixed (i}

1, i2). Graph (V , E) is connected if considered undirected, and hence |V | = O(|E|). Therefore the total time of our algorithm is O(|V ||E|n2_).

To reconstruct the optimal constrained alignment, first recall that in an optimal constrained alignmentA = [a1

b1· · ·

am

bm] of S1and S2, there exist 1and 2such that δ(q0, ρ(a1· · · a2))∩ F = ∅ and δ(q0, ρ(b1· · · b2))∩ F = ∅ (recall that ρ(·) is the “removing spaces” opera-tion). We may regardA as composed of three parts: an optimal unconstrained alignment of ρ(a1· · · a1−1)and ρ(b1· · · b1−1), that of ρ(a1· · · a2)and ρ(b1· · · b2) and that of ρ(a2+1· · · am) and ρ(b2+1· · · bm). Note

that, an optimal unconstrained alignment of ρ(a1· · · a2) and ρ(b1· · · b2)automatically satisfies the constraint, since both strings match the regular expression. Let j1= |ρ(a1· · · a1−1)|, j1 = |ρ(a1· · · a2)|, j2= |ρ(b1 · · · b1−1)| and j2= |ρ(b1· · · b2)|. Clearly if we know j1, j1, j2and j2 then the constrained alignmentA can be constructed in O(n) space by using Hirschberg’s celebrated divide-and-conquer algorithm [5] to align S1[1..j1] and S2[1..j2], align S1[j1+ 1..j₁] and S2[j2+ 1..j₂], and align S1[j₁ + 1..n] and S2[j₂+ 1..n]. For this purpose we compute matrices ηi1,i2

1 and η

i1,i2 2 for each (i1, i2). Let A = [ab1₁· · ·

am

bm] be the alignment im-plied by Li1,i2[p, q]. If (p, q) ∈ F × F , then η

i1,i2 1 [p, q] stores (j1, j2)and η₂i1,i2[p, q] stores (j₁, j₂)such that p∈ δ(q0, S1[j1+ 1..j1]) and q ∈ δ(q0, S2[j2+ 1..j2]), where S1[j1+ 1..j1] = ρ(a1· · · a2)and S2[j2+ 1..j2] = ρ(b1· · · b2)for some 1and 2. If (p, q) /∈ F × F , then ηi1,i2

1 [p, q] stores (j1, j2) such that p∈ δ(q0, S1[j1 + 1..i1]) and q ∈ δ(q0, S2[j1 + 1..i2]), where S1[j1 + 1..i1] = ρ(a1· · · am) and S2[j2 + 1..i2] = ρ(b1· · · bm) for some 1; η

i1,i2

2 [p, q] simply stores (i1, i2). Note that when computing Li1,i2 we can de-termine the value of (am, bm); for example, (am, bm)= (S1[i1], −) if Li1,i2[p, q] = L

(S1[i1],−) i1−1,i2 [p, q].

To see how to compute these values we assume that (am, bm)= (S1[i1], −); the other two cases are similar. Let (p, q)∈ δ(q₀M,[a1

b1 · · ·

am−1

bm−1]) be such that L(S1[i1],−)

i1−1,i2 [p, q] = Li1−1,i2[p, q] + γ (S1[i1], −) and that (p, q)∈ δM((p, q), (S1[i1], −)). Such (p, q)can be determined during the computation of L(S1[i1],−)

i1−1,i2 [p, q]. Consider first that (p, q) /∈ F × F . Then ηi1,i2

2 [p, q] = (i1, i2). If (p, q) = (q0, q0), then clearly we can set ηi1,i2

1 [p, q] = (i1− 1, i2). If (p, q)= (q0, q0) then we can set ηi1,i2

1 [p, q] = η i1−1,i2

1 [p, q] since if (j1, j2)= ηi₁1−1,i2[p, q] then p∈ δ(q0, S1[j1+ 1..i1 −1]) and q ∈ δ(q0, S2[j2+1..i2]) where S1[j1+1..i1− 1] = ρ(a1· · · am−1) and S2[j2 + 1..i2] = ρ(b1· · · bm−1) for some 1. Since (p, q) /∈ F × F , and ei-ther p = q0 or p= p, we have (p, q)∈ δM((p, q), (S1[i1], −)) iff p ∈ δ(p, S1[i1]). Hence p ∈ δ(q0, S1[j1+1..i1]) and q ∈ δ(q0, S2[j2+1..i2]) with S1[j1+ 1..i1] = ρ(a1· · · am)and S2[j2+1..i2] = ρ(b1· · · bm). Now consider (p, q)∈ F × F . When (p, q) = (p, q), then we simply set ηi1,i2

1 [p, q] = η i1−1,i2 1 [p, q] and ηi1,i2 2 [p, q] = η i1−1,i2 2 [p, q]. When (p, q) = (p, q), then ηi1,i2 1 [p, q] and η i1,i2

2 [p, q] are set as in the case where (p, q) /∈ F × F .

(6)

1: for q∈ V do

2: Compute array X such that Li1−1,i2[X[j], q] Li1−1,i2[X[j + 1], q]; /* by sorting */ 3: Initialize bit vector Y of length|V | to be all 0;

4: for j= 1 to |V | do

5: T← T [X[j], S1[i1]] ⊗ ¯Y ;

/* Tis a temporary bit vector,⊗ is the bitwise AND operator and ¯Y is the bitwise complement of Y . */

6: Y← Y ⊕ T; /*⊕ is the bitwise OR operator. */ 7: t← T1; /* The number of bits with value 1 in T. */

8: for c← 1 to t do

9: L(S1[i1],−)

i1−1,i2 [λ[T

_{], q] ← L}_i

1−1,i2[X[j], q] + γ (S1[i1], −); /* λ[T] is the position of the leftmost bit with value 1 in T. */ 10: Set bit λ[T] of Tto 0; 11: for p, q∈ V with (p, q) ∈ (F × F ) ∪ {(q0, q0)} do 12: L(S1[i1],−) i1−1,i2 [p, q] ← max{L (S1[i1],−) i1−1,i2 [p, q], Li1−1,i2[p, q] + γ (S1[i1], −)}; Algorithm 1. Therefore ηi1,i2 1 and η i1,i2

2 can be set without diffi-culty. For the boundary case, all entries in η₁0,0and η0,0₂ are simply set to (0, 0). Let

(p∗, q∗)= arg max (p,q)_{∈F ×F} Ln,n[p, q] .

With the knowledge of ηn,n₁ [p∗, q∗] and ηn,n₂ [p∗, q∗], an optimal constrained alignment of S1and S2can then be constructed in linear space.

When computing Li1,i2 all those Li₁,i₂ with i1 i1− 2 are not necessary. Similar arguments are also ap-plicable to ηi1,i2

1 and η

i1,i2

2 . Each of these matrices has O(|V |2)entries. Given ηn,n₁ and η₂n,n, the reconstruction of an optimal constrained alignment takes O(n) space (and O(n2) time so the time complexity stated before is not affected). Therefore the space complexity of our algorithm is O(|V |2_{n). On the other hand, the space} complexity stated in [2] is O(rn), where r is the number of transitions in M, without reconstruction of the opti-mal alignment. It is clear that r= (|V |2).

3.2. An algorithm for short regular expression constraints

In general R is much shorter than S1 and S2 [10]. The number of states in the automaton A equivalent to R may hence also be small relative to n. Here we present another algorithm which can be combined with the algorithm presented in the last subsection to yield a more efficient algorithm when|V | = O(log n). Under this assumption, we represent each T[q, a] (regarded as a vector containing entries T[q, a; p] for all p ∈ V ) by a bit vector of length |V |; such a bit vector now takes O(1) space to store. For brevity we only show the com-putation of L(S1[i1],−)

i1−1,i2 . The other cases can be handled similarly.

To compute L(S1[i1],−)

i1−1,i2 [p, q], we need the maximum of Li1−1,i2[p, q] + γ (S1[i1], −) over all p∈ V such that p∈ δ(p, S1[i1]). If computed directly, this maxi-mum takes time proportional to the number of such p. However, if we already know which p is responsible for the maximum, then no more computation is needed. Consider the computation of L(S1[i1],−)

i1−1,i2 [p, q] for all p ∈ V for a fixed q∈ V . Suppose that Li1−1,i2[X[j], q] Li1−1,i2[X[j + 1], q], where j = 1, . . . , |V | − 1 and {X[i]: 1 i |V |} = V . Clearly, for all those p with p∈ δ(X[1], S1[i1]), L(S_i₁_−1,i1[i1],−)₂ [p, q] can be sim-ply set to Li1−1,i2[X[1], q] + γ (S1[i1], −). As to those pwith p∈ δ(X[2], S1[i1]), if p /∈ δ(X[1], S1[i1]) also, then L(S1[i1],−)

i1−1,i2 [p, q] can be simply set to Li1−1,i2[X[2], q] + γ (S1[i1], −). Continue with this process, all L(S1[i1],−)

i1−1,i2 [p, q] can be computed. Constant time opera-tions on data with O(log n) bits make the determination of whether p∈ δ(X[j], S1[i1]) \_kj−1₌₁δ(X[k], S1[i1]) very efficient. Indeed, if we have a bit vector Y such that Y[p] = 1 iff p ∈j_k−1₌₁δ(X[k], S1[i1]), then all those p∈ δ(X[j], S1[i1])\

j−1

k=1δ(X[k], S1[i1]) can be found in constant time by taking T[X[j], S1[i1]]⊗ ¯Y , where ⊗ is the bitwise AND operation and ¯Y is the bitwise com-plement of Y . This is the fundamental reason why the algorithm here can achieve the desired efficiency; note that comparable efficiency cannot be achieved for this operation without the assumption of |V | = O(log n). The procedure is formally stated in Algorithm 1.

The correctness should have been clear from the above discussion. Here simply note that lines 7–10 im-plement the statement “For each p∈ Tset L(S1[i1],−)

i1−1,i2 [p, q] ← Li1−1,i2[X[j], q] + γ (S1[i1], −).”

We now analyze the time complexity of the above code segment. Line 2 takes O(|V |2_log_{|V |) total time.} Line 3 takes O(|V |) total time. Lines 5–7 take O(1) time

(7)

for each iteration, hence take O(|V |2₎_{time in total. For} a fixed q, line 8 is executed O(|V |) times since the T for a specific value of j is disjoint from all those cor-responding to other values of j . Hence the total time taken by lines 8–10 is O(|V |2_{). Lines 11–12 also take} O(|V |2₎_{total time. As we need to iterate over all} in-dex pairs (i1, i2), the overall time complexity for solving RECSA is O(|V |2_log_{|V | n}2_{). Therefore by combining} the algorithm presented in the last subsection, we can solve RECSA in time O(min{|V | log |V |, |E|}|V | n2_).

The above results are summarized as follows.

Theorem 1. RECSA can be solved in O(|V ||E|n2) time and O(|V |2_{n) space. If} _{|V | = O(log n), then it can} be solved in O(min{|E|, |V | log |V |}|V |n2_{) time and} O(|V |2_{n) space.}

4. Future works

In practice it is important to have an algorithm to align multiple sequences with the given regular expres-sion constraint satisfied. In [2,1] it is suggested to ex-tend the structure of weighted finite automata to support multiple sequences. However the size of such automata grows exponentially with the number of sequences, be-coming an exponential multiplicative factor to the ex-ponential time complexity of a multiple sequence align-ment procedure. Hence a reasonable progressive algo-rithm of RECSA for multiple sequences is still in de-mand.

References

[1] A.N. Arslan, Multiple sequence alignment containing a sequence of regular expressions, in: Proc. IEEE Symposium on

Compu-tational Intelligence in Bioinformatics and CompuCompu-tational Biol-ogy, 2005, pp. 1–7.

[2] A.N. Arslan, Regular expression constrained sequence align-ment, in: A. Apostolico, M. Crochemore, K. Park (Eds.), CPM 2005, in: Lecture Notes in Computer Science, vol. 3537, Springer, 2005, pp. 322–333.

[3] F.Y.L. Chin, N.L. Ho, T.W. Lam, P.W.H. Wong, Efficient con-strained multiple sequence alignment with performance guar-antee, Journal of Bioinformatics and Computational Biology 3 (2005) 1–18.

[4] F.Y.L. Chin, A.D. Santis, A.L. Ferrara, N.L. Ho, S.K. Kim, A simple algorithm for the constrained sequence problems, In-formation Processing Letters 90 (2004) 175–179.

[5] D.S. Hirschberg, A linear space algorithm for computing max-imal common subsequences, Communications of the ACM 18 (1975) 341–343.

[6] J.E. Hopcroft, R. Motwani, J.D. Ullman, Introduction to Au-tomata Theory, Languages, and Computation, second ed., Addison-Wesley, 2001.

[7] N. Hulo, C.J.A. Sigrist, V. Le Saux, P.S. Langendijk-Genevaux, L. Bordoli, A. Gattiker, E. De Castro, P. Bucher, A. Bairoch, Recent improvements to the PROSITE database, Nucleic Acids Research 32 (2004) 134–137.

[8] T. Jiang, Y. Xu, M.Q. Zhang (Eds.), Current Topics in Computa-tional Molecular Biology, MIT Press, 2002.

[9] C.L. Lu, Y.P. Huang, A memory-efficient algorithm for multiple sequence alignment with constraints, Bioinformatics 21 (2005) 20–30.

[10] C.J. Sigrist, L. Cerutti, N. Hulo, A. Gattiker, L. Falquet, M. Pag-ni, A. Bairoch, P. Bucher, PROSITE: A documented database using patterns and profiles as motif descriptors, Briefings in Bioinformatics 3 (2002) 265–274.

[11] C.Y. Tang, C.L. Lu, M.D.-T. Chang, Y.-T. Tsai, Y.-J. Sun, K.-M. Chao, J.-M. Chang, Y.-H. Chiou, C.-M. Wu, H.-T. Chang, W.-I. Chou, Constrained sequence alignment tool development and its application to RNase family alignment, Journal of Bioinformat-ics and Computational Biology 1 (2003) 267–287.

[12] Y.-T. Tsai, The constrained common sequence problem, Infor-mation Processing Letters 88 (2003) 173–176.

[13] Y.-T. Tsai, Y.P. Huang, C.T. Yu, C.L. Lu, MuSiC: A tool for multiple sequence alignment with constraints, Bioinformatics 20 (2004) 2309–2311.