An almost-linear time and linear space algorithm for the longest common subsequence problem

(1)

An almost-linear time and linear space algorithm

for the longest common subsequence problem

✩

J.Y. Guo

∗

, F.K. Hwang

Department of Applied Mathematics, National Chiaotung University, Hsinchu, Taiwan, ROC 30500 Received 12 July 2004; received in revised form 2 December 2004

Available online 27 January 2005 Communicated by Wen-Lian Hsu

Abstract

There are two general approaches to the longest common subsequence problem. The dynamic programming approach takes quadratic time but linear space, while the nondynamic-programming approach takes less time but more space. We propose a new implementation of the latter approach which seems to get the best for both time and space for the DNA application.

Keywords: Algorithms; Primal-dual algorithm; Longest common subsequence

1. Introduction

Mutations in DNA arise naturally in an evolution process. These mutations include substitutions, inser-tions and deleinser-tions of nucleotides, leading to “editing” of DNA texts. A sequence comparison of two DNA sequences attempts to align the two sequences to min-imize a function of these mutations. The most com-monly used function is the so-called edit distance first introduced by Levenshtein [5] which simply counts the number of mutations. If substitutions are not

al-✩ _{Research partially supported by ROC National Science council}

grant NSC 90-2115-M-009-007.

* _{Corresponding author.}

E-mail address: [email protected] (J.Y. Guo).

lowed, then the alignment minimizing the edit distance will produce a longest common subsequence (LCS) of the two sequences. Note that the LCS problem had been studied by mathematicians for general sequences long before the edit distance was introduced for DNA sequences.

Assume that both sequences are of O(n) length. Needleman and Wunsch [6] gave an O(n2) time and

O(n2) space dynamic programming algorithm for the

LCS problem. Hirschberg [2] improved to O(n) space by using a divide-and-conquer technique. Later, Hunt and Szymanski [4], and Hirschberg [3], both noticed that not all steps in the dynamic-programming pro-cedure need to be processed and they proposed more efficient nondynamic-programming algorithms. Hunt and Szymanski’s algorithm was improved by

(2)

tolico [1] to require O(n log n) time and O(n+ l) space, where l denotes the number of matches be-tween two sequences. Hirschberg’s algorithm requires O(Ln) time and O(n+ Ln) space, where L is the length of an LCS. Pevzner and Weterman [7] recog-nized that these algorithms can be cast into a primal-dual set-up. The derived primal-primal-dual algorithm, as pre-sented by Pevzner and Waterman, takes O(l+ Ln) time and O(l+ Ln) space. In this paper we give an O(nL) time and O(n) space implementation of the primal-dual algorithm.

2. The primal-dual algorithm

Let I = {I1, I2, . . . , Im} and J = {J1, J2, . . . , Jn} denote two DNA sequences when Ii, Jj∈ {A, C, G, T}. DefineP = {(i, j): Ii= Jj}. Assume m = O(n). Then typically, |P| = O(n2). This is the case if each

nu-cleotide independently has probability pA, pC, pG, pT

of being A, C, G, T, respectively. We will also denote

P = {p1, p2, . . . , pl} where each pk is a pair (ik, jk). The partial order≺ is defined by

px≺ py if ix< iy, jx< jy.

The conjugate partial order≺∗is defined by

px≺∗py if ix iy, jx jy.

Let< denotes the partial order such that

px< py if either px≺ py or px≺∗py.

Pevzner and Waterman proved that< is a linear order

p1< p2< · · · < pl. Note that|<| = |P| = l which is typically O(n2).

The algorithm, as presented in [7], assigns p1, p2, . . . one at a time (in order) to sets C1, C2, . . . such

that the elements in a given Ck can be linearly or-dered in ≺∗. Suppose that p1, p2, . . . , pu have been assigned to C1, C2, . . . , Cv. Let p∗₁, p₂∗, . . . , pv∗denote the≺∗-maximum elements of C1, C2, . . . , Cv, respec-tively. Let k, 1 k v, be the minimum index such

that p∗_k ≺∗ pu+1. Assign pu+1 to Ck. If no such k exists, assign pu+1 to Cv+1. We also set a counter b(pu+1) such that b(pu+1)=    0 if k= 1, p_k∗₋₁ if 2 k v,

p_v∗ if k does not exist.

Note that if b(pu₊₁)= 0, then b(pu₊₁)⊀∗pu₊₁. Sup-pose p1, p2, . . . , pl are assigned to C1, C2, . . . , CL. Then L is the length of an LCS. An LCS can be backtracked from any element in CL by using the b function. Once an LCS is identified, a corresponding (nonunique) alignment can be obtained by filling in between pkand pk+1the unmatched nucleotides from both sequence in an arbitrary order as long as being consistent with each sequence order.

The following example, taken from [7], illustrates the algorithm. I=IT1 I2 G I3 C I4 A I5 T I6 A, J=JA1 J2 T J3 C J4 T J5 G J6 A J7 T, P = (1, 2),p1 p2 (1, 4), p3 (1, 7), p4 (2, 5), p5 (3, 3), p6 (4, 1), p7 (4, 6), p8 (5, 2), p9 (5, 4), p10 (5, 7), p11 (6, 1), p12 (6, 6) <: p3, p2, p1, p4, p5, p7, p6, p10, p9, p8, p12, p11.

For example, p3< p2since p3≺∗p2, while p1< p4since p1≺ p4.

The assignment of pu+1, u = 0, 1, . . . , 11, and b(pu+1) are given in Table 1.

To find an LCS, we can start from p12 to obtain p12 p9 p5 p1or from p10to obtain p10 p7 p5 p1. Using the former, an optimal alignment can

be

– TGCAT – A– AT – C – TGAT

It takes O(n+ l) time and space to construct P and O(l log l) time to<-order P. It takes O(lL) time and O(l+ L) space to construct C1, . . . , CL.

Table 1

C1 C2 C3 C4

p3 p2 p1 p6 p11 p4 p5 p8 p7 p9 p10 p12

(3)

3. An O(nL) time and O(n) space implementation

We construct a table X with 5 rows marked by j , A, C, G, T and n+1 columns marked by n, n−1, . . . , 1, 0 (the indices of J ). Column n is empty. If index n is of nucleotide N , then column n− 1 has entry n in row N and copies the other entries from column n. In general, if index j is of nucleotide N , the column j− 1 has entry j in row N and copies the other entries from column j .

For example, for J= ATCTGAT see Table 2. It is easily verified that the entries in each row are nonincreasing in j . Next we construct a table Y with L+ 1 columns (L is unknown at the beginning) marked by C0, C1, . . . , CL, and 6 rows marked by j, i, A, C, G, T. Along with table Y , we also set up a

backtrack function b. At the beginning, only the C0

column is filled with entries 0, 0, A(0), C(0), G(0),

T (0), the last four entries from table X. Then we

pro-ceed with the indices of I one by one in order to construct Y . Suppose index 1 is of nucleotide N .

In-Table 2 j 7 6 5 4 3 2 1 0 A – – 6 6 6 6 6 1 C – – – – – 3 3 3 G – – – 5 5 5 5 5 T – 7 7 7 4 4 2 2

spect row N in Y and we find only one index T (0). Fill column C1 with entries T (0), 1, A(T (0)), C(T (0)), G(T (0)), T (T (0)), and set b(1, T (0))= (0, 0).

Suppose we are dealing with index y of nucleotide

N where Ck is the large x such that Cx is nonempty. By our construction, entries in row j of Y are in-creasing (easily observed after we finish describing the implementation). Hence entries in row A, C, G, T are nondecreasing. Inspect row N which, say, has entries

n0 n1 · · · nkfor k L. For each ni in the order from large to small, we do the following:

Let jw denotes the value of j in column Cw, 0 w k. Compare nk with jk, jk−1, . . . until the first column Cw(k)such that jw(k)< nk. We fill the col-umn Cw(k)+1(or replace its entries) with nk, y, A(nk), C(nk), G(nk), T (nk). Set b(y, nk)= (i, j) where (i, j) is from Cw(k). In general, suppose nz has just filled the column Cw(z)+1 with z, y, A(nz), C(nz), G(nz), T (nz). Let nv be the next ni < nz. We compare nv with jw(z), jw(z)−1, . . . until Cw(v) is found. Set b(y, nv)= (i, j) where (i, j) is from Cw(v).

We demonstrate this procedure by the example

I

index: T1G2C3A4T5A6 indexJ : A1T2C3T4G5A6T7

We will fill in Y column by column until a column needs to be replaced, then we draw a new Y with the new column (see Tables 3 and 4).

Table 3 C0 C1 C2 C0 C1 C2 C3 C0 C1 C2 C3 C4 j 0 2 5 0 2 3 6 0 1 3 6 7 i 0 1 2 0 1 3 4 0 4 3 4 5 A 1 6 6 1 6 6 – 1 6 6 – – C 3 3 – 3 3 – – 3 3 – – – G 5 5 – 5 5 5 – 5 5 5 – – T 2 4 7 2 4 4 7 2 2 4 7 – b(1, 2)= (0, 0), b(3, 3) = (1, 2), b(5, 7) = (4, 6), b(2, 5) = (1, 2), b(4, 6) = (3, 3), b(4, 1) = (0, 0). Table 4 C0 C1 C2 C3 C4 C0 C1 C2 C3 C4 j 0 1 2 4 7 0 1 2 4 6 i 0 4 5 5 5 0 6 5 5 6 A 1 6 6 6 – 1 6 6 6 – C 3 3 3 – – 3 3 3 – – G 5 5 5 5 – 5 5 5 5 – T 2 2 4 7 – 2 2 4 7 7 b(5, 4)= (3, 3), b(6, 6) = (5, 4), b(5, 2) = (4, 1), b(6, 1) = 0.

(4)

Finally, take a pair (i, j ) from any CLcolumn, we can trace an LCS with length L through the b func-tion. In the above example, (6, 6) is a pair in C4. From b(6, 6)= (5, 4), b(5, 4) = (3, 3), b(3, 3) = (1, 2), we

obtain the LCS: (I1, J2), (I3, J3), (I5, J4), (I6, J6). If

we start from the pair (5, 7), then we have (I1, J2), (I3, J3), (I4, J6), (I5, J7).

We now prove that this procedure is indeed an im-plementation of the primal-dual algorithm. Note that we process the pairs inP in the lexicographical order of (i, j ). So pairs with the same i, called the i-batch, are processed consecutively.

Suppose we are processing the i-batch, and C1, . . . , Ckare nonempty. Let (i1, j1), . . . , (ik, jk) be the max-imal pair in C1, . . . , Ck, respectively. Then j1< j2< · · · < jk.

It suffices to prove jw < jw+1. If (iw, jw) is processed before (iw+1, jw+1), then

iw iw+1 and jw< jw+1

or (iw+1, jw+1) would be assigned to Cw. If (iw, jw) is processed afterwards, and (i_w, j_w) was the maximal

pair of Cw when (iw+1, jw+1) was processed, then iw iw iw+1 and jw jw < jw+1.

Note that all pairs (i, j) processed before the i-batch have i < i. Hence an i-pair can either ∗_(i_{, j}_{), or be noncomparable, but not smaller. More} specifically (i, j )∗(ih, jh) if and only if j jh. So an i-pair (i, j ) joins Chif and only if

jh−1< j jh

and if j > jh, then (i, j ) starts a new Ck+1. Thus pairs in the i-batch are partitioned into several inter-vals where pairs in the same interval go to the same

Ch. Also note that i-pairs are always comparable in

≺∗_{since j}∗

1< j2∗<· · · < jg∗implies (i, j∗)∗(i, j₂∗)∗ · · ·∗(i, j_g∗).

So we only need to assign one pair (i, j ) in each in-terval h to Ch where j is minimal among all i-pairs in the interval. It is easily verified that the (i, j ) pair in column Ch of table Y is indeed the maximal pair (ih, jh) of Ch. So the entry in row N and column Ch gives the minimal index x > jh of a nucleotide of type N . Therefore, if N is the next nucleotide to be processed, then all the j -values of the maximal pairs in C1, . . . , Ck, (C0gives the overall minimum j ) are

provided by row N .

We now check the time complexity of this imple-mentation. Table X can be constructed in O(n) time. To construct the dynamic table Y , we need to go through the O(n) elements of I . Since the entries in both row j and row N are ordered, starting from com-paring the maximal entries of both row, each compar-ison eliminates one entry from further comparcompar-isons. Since there are at most 2L entries in the two rows, it takes O(L)-time to locate the entries{ni} of row N. Inserting the column of ni (and possibly deleting a column) takes constant time. The backtrack function needs to be updated at most L times, and it takes con-stant time to update it. So processing each element of I takes O(L) time, and the construction of table Y takes O(nL) time. We have an O(nL) time algorithm. It is also easily seen that tables X and Y can be constructed in O(n) space.

4. Conclusions

For the LCS problem, the dynamic programming approach requires quadratic time but linear space, while the nondynamic-programming approach re-quires O(n log n) time or O(Ln) time, which is almost linear when the length of an LCS is small compared to

n, but more than linear space. We gave a

nondynamic-programming implementation with O(Ln) time and O(n) space, efficient in both time and space.

Although our presentation is for a DNA sequence, the implementation is valid for any general sequence with, say, p alphabets. If p is treated as a variable, then the time complexity would be O(n(L+ p)) and the space complexity O(np). We may also drop the assumption that both sequences are of lengths of O(n) order. If the lengths of the two sequences, m < n, are not equal, then either the time complexity would be O(mp+ nL) and the space complexity O(mp), or m and n are interchanged in the above complexities.

References

[1] A. Apostolico, Improving the worst-case performance of the Hunt–Szymanski strategy for the longest common subsequence of two strings, Inform. Process. Lett. 23 (1986) 63–69. [2] D.S. Hirschberg, A linear space algorithm for computing

(5)

[3] D.S. Hirschberg, Algorithms for the longest common subse-quence problem, J. ACM 24 (1977) 664–675.

[4] J.W. Hunt, T.G. Szymanski, A fast algorithm for computing longest common subsequences, Comm. ACM 20 (1977) 350– 353.

[5] V.I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Soviet Phys. Dokl. 6 (1966) 707–710.

[6] S.B. Needleman, C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970) 443–453.

[7] P.A. Pevzner, M.S. Waterman, Generalized sequence alignment and duality, Adv. Appl. Math. 14 (2) (1993) 139–171.