Recent Developments in Linear-Space Alignment Methods: A Survey

(1)

A Survey

KUN-MAO CHAO,¹ROSS C. HARDISON,²and WEBB MILLER^{1, †}

ABSTRACT

A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms.

Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proportional to the sum of the lengths of the two sequences being aligned. This paper begins by reviewing the basic idea, as it applies to the global (i.e., end-to-end) alignment of two DNA or protein sequences. Three of our recent extensions of the technique are then outlined. The first extension computes an optimal alignment subject to the constraint that each position, i, of the first sequence must be aligned somewhere between positions L[i] and U [i] of the second sequence, for given values of L and U . The second finds all aligned position pairs (i.e., potential columns of the alignment) that occur in an alignment whose score exceeds a given threshold. The third treats the case where each of the two sequences is allowed to be an alignment (e.g., a sequence of aligned pairs), using a sensitive scoring scheme. We also describe two linear-space methods for computing k best local (i.e., involving only a part of each sequence) alignments, where k≥ 1. One is a linear-space version of the algorithm of Waterman and Eggert (1987), and the other is based on the strategy proposed by Wilbur and Lipman (1983). Finally, we describe programs that implement various combinations of these techniques to provide a multi-sequence alignment method that is especially suited to handling a few very long sequences. The utility of these programs is illustrated by analysis of the locus control region of the _β-like globin gene cluster of several mammals.

Key Words: sequence analysis, dynamic programming, linear-space algorithm, multiple alignment

INTRODUCTION

Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for ‘‘rigorous’’ alignment of DNA and protein sequences. For a number of useful alignment-scoring schemes, this method is guaranteed to produce an alignment of two giv en sequences having the highest possible score.

1Department of Computer Science and Engineering,²Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802.

†Corresponding author.

(2)

For alignment scores that are popular with molecular biologists, dynamic-programming alignment of two sequences requires quadratic time, i.e., time proportional to the product of the two sequence lengths. In particular, this holds for affine gap costs, that is, scoring schemes under which a gap of length k is penalized g+ ek, where g is a fixed

‘‘gap-opening penalty’’ and e is a ‘‘gap-extension penalty’’ (Gotoh, 1982). (More general alignment scores, which are more expensive to optimize, were considered by Waterman et al., 1976, but have not found wide-spread use.) Quadratic time is necessitated by the inspection of every pair (i, j), where i is a position in the first sequence and j is a position in the second sequence. For many applications, e.g., database searches, such an exhaus- tive examination of position pairs may not be worth the effort, and a number of faster methods have been proposed.

Standard implementations of dynamic-programming alignment also require a quadratic amount of computer memory to explicitly produce a highest-scoring alignment.

(Computing just the largest alignment score in less space is straightforward.) For certain applications, such as careful analysis of a few long DNA sequences, this space restriction is more important than the time constraint. For that reason, a number of methods have been proposed by biologists to decrease the amount of space allocated for each position pair, (i, j).

As it turns out, an alignment strategy that uses only linear space had earlier been published by a computer scientist. The original formulation (Hirschberg, 1975) was for an alignment-scoring scheme that is too restrictive to be of general utility in molecular biology, but the basic idea is quite robust and works readily for affine gap penalties (Myers and Miller, 1988). We feel that the power of this approach is still not fully appre- ciated by biologists, and this survey is meant to assist in attracting the deserved recogni- tion.

After developing the basic linear-space dynamic-programming alignment algorithm, we present the three recent extensions mentioned in the Abstract. Except for the last extension, the ideas are described for the simplification of affine gap costs where g = 0.

Then two linear-space methods for computing k best non-intersecting local alignments for any k≥ 1 are outlined. These developments culminate in multi-sequence alignment algorithms that are the centerpiece of a software system we are building to analyze mam- malian _β-like globin gene clusters. This system, which is described near the end of the paper, is intended for eventual application to sequence data from other important genomic loci, when those data become available.

AV AILABILITY

The C source programs for our linear-space alignment methods are available by anonymous ftp from groucho.cse.psu.edu. Table 1 summarizes the available linear-space alignment programs mounted on the ftp site. The authors may be contacted by electronic mail to [email protected].

(3)

Directory name Function References band Aligning within a diagonal band Chao et al. (1992) BMB Constrained sequence alignment Chao et al. (1993a) robust Locating well-conserved regions Chao et al. (1993b) sub Computing all suboptimal alignments Chao (1994)

sim Local similarities Huang and Miller (1991)

falign Building local alignments from fragments Chao and Miller (1994) yama Multiple sequence alignment Hardison et al. (1993b, 1994)

Table 1. Summary of our linear-space alignment tools.

THE DYNAMIC-PROGRAMMING ALIGNMENT ALGORITHM

It is quite helpful to recast the problem of aligning two sequences as an equivalent problem of finding a maximum-score path in a certain graph, as has been observed by a number of authors, including Myers and Miller (1989). This alternative formulation allows the problem to be visualized in a way that permits the use of geometric intuition.

We find this visual imagery critical for keeping track of the low-level details that arise in development and implementation of dynamic-programming alignment algorithms.

In this and the next three sections we assume the following simple alignment-scoring scheme. For each possible aligned pair

[

^xy

]

, where each of x and y is either a normal sequence entry or the symbol ‘‘−’’, there is an assigned score _σ(

[

^xy

]

). The score of a pairwise alignment is defined to be the sum of the _σ-values of its aligned pairs (i.e., columns).

Recall that a directed graph G = (V , E) consists of a set V of nodes (also called ver- tices) and a set E of edges. The edge from node u to node v, if it exists, is denoted u→ v.

A sequence of consecutive edges

u₁→ u2, u₂→ u3, . . . , u_k₋₁→ uk

is a path from u₁ to u_k. If each edge u→ v is assigned a scoreσ(u→ v), then the score of such a path is

Σ

^k−1ⁱ⁼¹ ^σ^(uⁱ^{→ u}ⁱ⁺¹^).

We now describe the relationship between maximum-score paths and optimal align- ments. Consider two sequences, A = a₁a₂. . . a_M and B = b₁b₂. . . b_N. That is, A con- tains M symbols and B contains N symbols, where the symbols are from an arbitrary

‘‘alphabet’’ that does not contain the dash symbol, ‘‘−’’. The alignment graph for A and B, denoted G_{A, B}, is an edge-labeled directed graph. The nodes of G_{A, B} are the pairs (i, j) where i∈[0, M] and j∈[0, N]. (We use the notation [p, q] for the set { p, p+ 1, . . . , q − 1, q}.) When graphed, these nodes are arrayed in M + 1 rows (row i corresponds to a_i for i∈[1, M], with an additional row 0) and N + 1 columns (column j corresponds to b_j for j∈[1, N]). The edge set for GA, B consists of the following edges,

(4)

labeled as indicated.

1. (i− 1, j) → (i, j) for i∈[1, M] and j∈[0, N], labeled

[

^a₋ⁱ

]

2. (i, j− 1) → (i, j) for i∈[0, M] and j∈[1, N], labeled

[

b⁻_j

]

3. (i− 1, j − 1) → (i, j) for i∈[1, M] and j∈[1, N], labeled

[

b^aⁱ_j

]

Fig. 1 provides an example of the construction.

− T

C−

− T

− C

− T

C− C−

CT

CC

− C

− T

TT

CT

− T

C−

− T

C− C−

CT

CC

− C

− C FIG. 1. Alignment graph GA, Bfor the sequences A= TC and B = CTC.

It is instructive to look for a path from (0, 0) (the upper left corner of the graph of Fig. 1) to (2, 3) (the lower right) such that the labels along the path ‘‘spell’’ the alignment:

-TC CTC

The first aligned pair is

[

C⁻

]

, so the first edge must be horizontal. The second pair is

[

^TT

]

^,

so the second edge must be diagonal. The third pair is

[

^CC

]

, so the third edge must be diagonal. Generally, when a path descends from row i− 1 to row i, it picks up an aligned pair with top entry a_i. A path from (0, 0) to (M , N ) has zero or more horizontal edges, then a vertical or diagonal edges to row 1, then zero or more horizontal edges, then an edge to row 2, then . . ., so the top entries of the labels along the path are a₁, a₂, . . ., possi- bly with some interspersed dashes. Similarly, the bottom entries spell B if dashes are ignored, so the aligned pairs spell an alignment of A and B. Indeed, alignments are in general equivalent to paths, as we now state more precisely.

Fact: Let G_{A, B} be the alignment graph for sequences A and B. With each path from (0, 0) to (M, N ) associate the alignment formed by concatenating the edge labels along the path, i.e., the alignment ‘‘spelled’’ by the path. Then every such path determines an alignment of A and B, and every alignment of A and B is determined by a unique path.

In other words, there is a one-to-one correspondence between paths in G_{A, B} from (0, 0) to (M , N ) and alignments of A and B. Furthermore, if the score _σ(_π) is assigned to each edge of G_{A, B}, where_π is the aligned pair labeling that edge, then a path’s score is exactly

(5)

the score of the corresponding alignment.

At each node, the score is computed from the scores of immediate predecessors and of entering edges, which are pictured in Fig. 2. The procedure of Fig. 3 computes the maximum alignment score by considering rows of G_{A, B} in order, sweeping left to right within each row. S[i, j] denotes the maximum score of a path from (0, 0) to (i, j). Lines 7-10 mirror Fig. 2. In row 0 there is but a single edge entering a node (lines 2-3), and similarly for column 0 (line 5). This is a quadratic-space procedure since it uses the (M+1)-by-(N +1) array S to hold all node-scores.

j−1 j i−1

i σ

(

b⁻_j

)

ai−

(

σ

)

bj

ai

(

^σ

)

FIG. 2. Edges entering node (i, j) and their scores.

1. S[0, 0]← 0 2. for j← 1 to N do

3. S[0, j]← S[0, j − 1] +σ(

[

b⁻_j

]

⁾

4. for i← 1 to M do

5. S[i, 0]← S[i − 1, 0] +σ(

[

^a₋ⁱ

]

^}

6. for j← 1 to N do

7. Vertical← S[i − 1, j] +σ(

[

^a₋ⁱ

]

⁾

8. Diagonal← S[i − 1, j − 1] +σ(

[

b^aⁱj

]

⁾

9. Horizontal← S[i, j − 1] +σ(

[

b⁻j

]

⁾

10. S[i, j]← max{Vertical, Diagonal, Horizontal}

11. write "Maximum alignment score is" S[M , N ]

FIG. 3. Quadratic-space, score-only alignment algorithm.

The next step is to see that the optimal alignment score for A and B can be com- puted in linear space. Indeed, it is apparent that the scores in row i of S depend only on those in row i− 1. Thus, after treating row i, the space used for values in row i − 1 can be recycled to hold values in row i+ 1. In other words, we can get by with space for two rows, since all that we ultimately want is the single score S[M, N ].

In fact, a single array, S[0. . N ], is adequate. S[ j] holds the most recently computed value in column j, so that as values of S are computed, they overwrite old values. There is a slight conflict in this strategy, since two ‘‘active’’ values are needed in the current col- umn, necessitating an additional scalar, s, to hold one of them. Fig. 4 shows the grid locations of values in S and of scalars s and c when (i, j) is reached in the computation.

(6)

S[k] holds path scores for row i when k < j, and for row i− 1 when k ≥ j. Fig. 5 is a direct translation of Fig. 3 using the memory-allocation scheme of Fig. 4.

i i−1

j−1 j

s c

FIG. 4. Grid locations of entries of a vector of length N+ 1 just before the maximum path- score is evaluated at node (i, j). Additionally, a scalar s holds the path score at (i− 1, j − 1) and c holds the score at (i, j− 1).

1. S[0]← 0

2. for j← 1 to N do

3. S[ j]← S[ j − 1] +σ(

[

b⁻j

]

⁾

4. for i← 1 to M do

5. s← S[0]

6. S[0]← c ← S[0] +σ(

[

^a₋ⁱ

]

⁾

7. for j← 1 to N do

8. c← max{S[ j] +σ(

[

^a₋ⁱ

]

^{), s}⁺^σ⁽

[

b^aⁱj

]

^{), c}⁺^σ⁽

[

b⁻j

]

^)}

9. s← S[ j]

10. S[ j]← c

11. write "Maximum alignment score is" S[N ] FIG. 5. Linear-space computation of alignment scores.

We will soon need to perform this computation in the reverse direction. Here, the relevant edges are the ones leaving node (i, j), as pictured in Fig. 6, and the quadratic- space algorithm is given in Fig. 7. A slight generalization of a linear-space version of Fig. 7 appears in lines 26-35 of Fig. 9; its derivation is left as an exercise for the reader.

(7)

j i

i+1

a−

(

σ

)

i+1 j+1

(

^σ a

)

b i+1

(

j+1

σ b

)

− j+1

FIG. 6. Edges leaving (i, j) and their scores.

1. S[M, N ]← 0

2. for j← N − 1 down to 0 do

3. S[M, j]← S[M, j + 1] +σ(

[

b⁻j+1

]

⁾

4. for i← M − 1 down to 0 do

5. S[i, N ]← S[i + 1, N] +σ(

[

^a₋ⁱ⁺¹

]

⁾

6. for j← N − 1 down to 0 do

7. S[i, j]← max

⎧⎪

⎪⎨

⎪⎪

⎩

S[i+ 1, j] +^σ(

[

^aⁱ⁺¹₋

]

⁾

S[i+ 1, j + 1] +σ(

[

b^aⁱ⁺¹j+1

]

⁾

S[i, j+ 1] +σ(

[

b⁻j+1

]

⁾

8. write "Maximum alignment score is" S[0, 0]

FIG. 7. Computation of alignment scores in the reverse direction.

HIRSCHBERG’S INSIGHT

We are now ready to describe Hirschberg’s linear-space alignment algorithm; the algorithm delivers an explicit optimal alignment, not merely its score. First, make a ‘‘for- ward’’ score-only pass (Fig. 5), stopping at the middle row, i.e., row mid =⎣M/2⎦. Then make a backward score-only pass (the linear-space version of Fig. 7), again stopping at the middle row. Thus, for each point along the middle row, we now hav e the optimal score from (0, 0) to that point and the optimal score from that point to (M , N ). Adding those numbers gives the optimal score over all paths from (0, 0) to (M , N ) that pass through that point. A sweep along the middle row, checking those sums, determines a point (mid, j) where an optimal path crosses the middle row. This reduces the problem to finding an optimal path from (0, 0) to (mid, j) and an optimal path from (mid, j) to (M , N ), which is done recursively.

Fig. 8A shows the two subproblems and each of their ‘‘subsubproblems’’. Note that regardless of where the optimal path crosses the middle row, the total of the sizes of the two subproblems is just half the size of the original problem, where problem size is

(8)

measured by the number of nodes. Similarly, the total sizes of all subsubproblems is a fourth the original size. Letting T be the size of the original, it follows that the total sizes of all problems, at all levels of recursion, is at most T + ½T + ¼T . . . = 2T . Since com- putation time is directly proportional to the problem size, this approach will deliver an optimal alignment in about twice the time needed to compute merely its score.

Fig. 8B shows a typical point in the alignment process. The initial portion of an optimal path will have been determined, and the current problem is to report the aligned pairs along an optimal path from (i₁, j₁) to (i₂, j₂). Fig. 9 provides detailed pseudo-code for the linear-space alignment algorithm.

N

M 00

optimal midpoint M

2

N

M 00

optimal path i₁

i₂

j1 j

2

(A) (B)

FIG. 8. (A) The two subproblems and four subsubproblems in Hirschberg’s linear-space alignment procedure. (B) Snapshot of the execution of Hirschberg’s algorithm. Shaded areas indicate problems remaining to be solved.

(9)

1. shared strings a₁a₂. . . a_M, b₁b₂. . . b_N

2. shared temporary integer arrays S⁻[0. . N ], S⁺[0. . N ] 3. procedure Align(M , N )

4. if M = 0 then

5. for j ← 1 to N do 6. write

[

b⁻_j

]

7. else

8. path(0, 0, M , N )

9. recursive procedure path(i₁, j₁, i₂, j₂) 10. if i₁+ 1 = i2 or j₁ = j2then

11. write aligned pairs for maximum-score path from (i₁, j₁) to (i₂, j₂)

12. else

13. mid ← ⎣(i¹+ i2)/2⎦

14. /* find maximum path scores from (i₁, j₁) */

15. S⁻[ j₁]← 0

16. for j ← j1+ 1 to j2do

17. S⁻[ j]← S⁻[ j− 1] +σ(

[

b⁻j

]

⁾

18. for i ← i1+ 1 to mid do

19. s ← S⁻[ j₁]

20. S⁻[ j₁]← c ← S⁻[ j₁]+σ(

[

^a−ⁱ

]

⁾

21. for j ← j1+ 1 to j2do

22. c← max{S⁻[ j]+σ(

[

^a−ⁱ

]

^{), s}⁺^σ⁽

[

b^aⁱ_j

]

^{), c}⁺^σ⁽

[

b⁻_j

]

^)}

23. s ← S⁻[ j]

24. S⁻[ j]← c

25. /* find maximum path scores to (i₂, j₂) */

26. S⁺[ j₂]← 0

27. for j ← j2− 1 down to j1 do 28. S⁺[ j]← S⁺[ j+ 1] +σ(

[

b⁻j+1

]

⁾

29. for i ← i2− 1 down to mid do

30. s ← S⁺[ j₂]

31. S⁺[ j₂]← c ← S⁺[ j₂]+σ(

[

^aⁱ⁺¹₋

]

⁾

32. for j ← j2− 1 down to j1 do

33. c← max{S⁺[ j]+σ(

[

^aⁱ⁺¹₋

]

^{), s}⁺^σ⁽

[

b^aⁱ⁺¹j+1

]

^{), c}⁺^σ⁽

[

b⁻j+1

]

^)}

34. s ← S⁺[ j]

35. S⁺[ j]← c

36. /* find where maximum-score path crosses row mid */

37. j ← value x ∈[ j1, j₂] that maximizes S⁻[x]+ S⁺[x]

38. path(i₁, j₁, mid, j) 39. path(mid, j, i₂, j₂)

FIG. 9. Linear-space alignment algorithm.

(10)

ALIGNMENT WITHIN A NARROW REGION

Our first variant of Hirschberg’s strategy computes an alignment when all operations are constrained to a narrow region of the dynamic-programming grid. In particular, sup- pose that L[i] and U [i] specify lower and upper bounds, respectively, on the columns to be considered in row i, for all relevant values of i. Of course, the algorithms of the pre- ceding section can be modified to inspect only points satisfying these constraints, but the resulting methods are not as time-efficient as we would like.

The deficiency is most obvious in the ‘‘limiting case’’ where L[i] = U [i] = i for all i, i.e., when the region degenerates to a diagonal line. The number of grid points is then just N , the sequence length. Splitting the problem along the middle row produces two subproblems of total size equal to that of the original problem, which is also the total of the sizes of the ‘‘grandchildren’’ problems. Since the recursion depth is log₂ N , the run- ning time of a straightforward application of Hirschberg’s strategy exceeds the score-only time by the factor log N .

We hav e developed several alignment-producing methods with time bound proportional to the region’s area. One approach is based on the observation that score-only time is attained by Hirschberg’s method if the region contains at least half of the dynamic-programming grid points. This holds since the number of point-inspections to produce the alignment is at most twice the number of dynamic-programming grid points, and hence at most four times the region’s area.

Given a narrow region, R, we can proceed as follows. We may as well assume that L and U are non-decreasing since if, e.g., L[i] were larger than L[i+ 1], we could set L[i+ 1] to equal L[i] without affecting the set of constrained alignments. Enclose as many rows as possible from the top of the region in an upright rectangle, subject to the condition that the rectangle’s area at most doubles the area of its intersection with R.

Then starting with the first row of R not in the rectangle, cover additional rows of R with a second such rectangle, and so on. Fig. 10 illustrates this process of covering R with rectangles and Fig. 11 gives the algorithm for locating the partition line segments.

FIG. 10. Covering a narrow region, R, with non-intersecting rectangles, each of which has more points inRthan outside R. The dark lines show where values of S⁺are saved during the backward pass described in the text.

(11)

1. I ← empty 2. i0← 0

3. sum_widths← U[0] − L[0] + 1 4. for i← 1 to M do

5. width← U[i] − L[i] + 1

6. sum_widths← sum_widths + width

7. if (U [i]− L[i0]+ 1) × (i − i0+ 1) > 2 × sum_widths then

8. I ← I∪{i}

9. i0← i

10. sum_widths← width

FIG. 11. Algorithm to partition a region of grid points.

A score-only backward pass is made over R, computing S⁺. Values of S⁺ are retained for the top line in every rectangle (the top rectangle can be skipped). See Fig.

10. It can be shown that the total length of these pieces cannot exceed three times the total number of columns, as required for a linear space bound. Next, perform a score- only forward pass, stopping at the last row in the first rectangle. A sweep along the boundary between the first and second rectangles locates a crossing edge on an optimal path though R. That is, we can find a point p on the last row of the first rectangle and a point q on the first row of the second rectangle such that there is a vertical or diagonal edge e from p to q, and e is on an optimal path. Such an optimal path can be found by applying Hirschberg’s strategy to R’s intersection with the first rectangle (omitting columns following p) and recursively computing a path from q through the remainder of R. This process inspects a grid point at most once during the backward pass, once in a forward pass computing p and q, and an average of four times for applying Hirschberg’s method to R’s intersection with a rectangle. Details of this approach can be found in Chao et al., 1993b, while a substantially different method that attains a factor of essen- tially 2 is presented by Chao et al. (1993a).

FINDING ALIGNED PAIRS IN NEARLY-OPTIMAL ALIGNMENTS Our second variant finds all edges that are contained in at least one path whose score exceeds a given threshold,_τ. Again, a recursive subproblem will consist of applying the alignment algorithm over a rectangular portion of the original dynamic-programming matrix, but now it is necessary that we continue to work with values S⁻ and S⁺ that are defined relative to the original problem. To accomplish this, each problem to be solved is defined by specifying values of S⁻for nodes on the upper and left borders of the defining rectangle, and values of S⁺for the lower and right borders.

To divide a problem of this form, a forward pass propagates values of S⁻to nodes in the middle row, and a backward pass propagates values of S⁺ to the row that follows the middle row. This information allows us to determine all edges starting in the middle row that are contained in a path of score at least_τ. We are left with two subproblems to solve, but their total area can be as large as that of the original problem. See Fig. 12A. For rea- sons that are analogous to ones explained in the previous section, this means that the divide-and-conquer process can require time and space exceeding the score-only

(12)

requirements by the factor log M .

i₁

i₂ midi

nearly−optimal path that is leftmost in row midi nearly−optimal path that is rightmost in row midi

i₁

i₂ midi

j midj j

1 2

(A) (B)

FIG. 12. Undesirable (A) and desirable (B) divisions into subproblems when computing all edges that lie in nearly-optimal paths, i.e., paths scoring at leastτ.

The following strategy (Fig. 12B) reduces the space requirement to linear. A for- ward pass propagates values of S⁻to nodes in the middle row and the middle column, and a backward pass propagates values of S⁺to those nodes. The data determining any one of the four subproblems, i.e., the arrays of S-values on its borders, is then at most half the size of the set of data defining the parent problem. The maximum total space requirement is realized when recursion reaches a directly solvable problem in the extreme upper left corner of the original grid; at that time there are essentially 2(M+ N) S-values saved for borders of the original problem, M+ N values on the middle row and column of the orig- inal problem, (M+ N)/2 values for the upper left subproblem, (M + N)/4 values for the upper-left-most subsubproblem, etc., giving a total of about 4(M+ N) retained S-values.

Fig. 13 outlines the approach.

(13)

1. recursive procedure sub_opt(i1, j1, i2, j2, boundary score vectors) 2. if i1+ 1 ≥ i2or j1+ 1 ≥ j2then

3. write all edges with score at leastτ.

4. else

5. midi← ⎣(i¹+ i2)/2⎦ 6. midj← ⎣( j¹+ j2)/2⎦

7. A linear-space forward computation is performed to compute S⁻: store S⁻[i, j] if i= midi or j = midj;

8. A linear-space backward computation is performed to compute S⁺: store S⁺[i, j] if i= midi or j = midj;

9. /* Divide the problem by row midi and column midj */

10. for each of the four subproblems do

11. Let i′1, j′1, i′2, and j′2be the shrunken boundary row and column indexes.

12. sub_opt(i′1, j′1, i′2, j′2, new boundary score vectors)

FIG. 13. The algorithm for computing all edges that are contained in at least one path whose score exceeds a given threshold,τ.

The log M factor still afflicts the running time, though a precise analysis shows that it affects only grid points lying between nearly-optimal paths, and a more sophisticated algorithm reduces that penalty factor to log log M (Chao, 1994).

ALIGNING TWO ALIGNMENTS

To obtain satisfactory alignments of DNA or protein sequences, one generally needs an alignment-scoring scheme that is more flexible than the ones treated so far in this paper (Fitch and Smith, 1983). Such schemes may assign a score for certain adjacent pairs of columns, in addition to a score for every individual column. A case in point con- cerns aligning two sequences that are themselves alignments (i.e., sequences of fixed- height columns), where the potential resulting alignments are scored using what is called the sum of pairwise scores with quasi-natural gap costs (Altschul, 1989).

To illustrate the basic idea behind these scores, suppose we are aligning A and B, where A is a three-way alignment of DNA sequences and B is a compatible two-way alignment. Thus a_i (i.e., the i^th entry of A) is a column containing three symbols, each being one of A, C, G, T , or ‘‘−’’. Given these five sequences (three from A and two from B), there are

(

⁵2

)

= 10 possible sequence pairs. With each such pair of DNA sequences, say sequences p and q with 1 ≤ p < q ≤ 5, we have an associated set of scoresσp, q for aligned pairs. A five-column (i.e., with 5 entries) is assigned the score_σ, defined as the sum of the ten scores for aligned pairs obtained by selecting two of the five entries.

Moreover, a ‘‘gap cost’’ g_{p, q} is given for each sequence pair. These costs are used to define a score for each adjacent pair of five-columns, as follows. Fix a pair of five- columns. For each sequence pair ( p, q), the pair of five-columns is assessed a penalty g_{p, q}if the p^th and q^th rows of the pair exhibit any one of the six patterns given in Fig. 14.

These counts tend to over-estimate the actual number of gaps in the pairwise alignment induced by the five-way alignment, but their use is mandated by considerations of execution efficiency (Altschul, 1989).

(14)

For x x−

x x

x −x − x x

− − −x x

−x

−−

−x

−−

, , or count 1 gap start.

For or count 1 gap end.

FIG. 14. Rules for counting gaps and quasi-gaps given a pair of adjacent columns of a multi-sequence alignment. A pair of rows has been extracted, giving a pair of aligned pairs.

Here x denotes an entry other than ‘‘−’’. The total count for the adjacent multi-sequence columns is the number of gap starts and gap ends, summed over all pairs of rows.

Such alignment problems can be modeled as follows. With each dynamic-program- ming grid point (i, j), associate three nodes, denoted (i, j)_V, (i, j)_D and (i, j)_H. The V (vertical) node has edges entering from each of the three nodes above it, the D node has three entering diagonal edges and the H node has three entering horizontal edges.

(Nodes in row 0 or column 0 may have fewer entering edges.) Scores are assigned to nodes and edges as depicted in Fig. 15. Again, alignments correspond to paths in the graph, but now a path’s score is the sum of the scores attached to its edges and its nodes.

j−1 j

ai

b_j b_j−1

( )

σ⁽²⁾ a

−i−1

( )

σ⁽²⁾

( )

σ⁽²⁾ a_i−1 b−_j−1a

bi_j

a_i b_j

ai

b_j−1

( )

σ⁽²⁾ −

( )

σ⁽²⁾

( )

σ⁽²⁾ b−_j−1

ai

b−_j b−_j b−_j

(

^a−ⁱ

)

σ⁽²⁾ a

−i−1

(

−

)

σ⁽²⁾ a_i−1

− b_j

a_i

(

−

)

σ⁽²⁾ b_j ai

σ

(

^a−ⁱ

)

σ

(

^ab_jⁱ

)

σ

(

b_j

)

− i−1

i

V H

D V

H D

V H

D

V H

D

FIG. 15. The three nodes at grid point (i, j), the edges entering those nodes, and the scores associated with those nodes and edges. This model assigns a scoreσ⁽²⁾(c1c2) to every adja- cent column-pair c1c2.

Given this model of the alignment problem, one can proceed as before. For exam- ple, to determine the score, D[i, j], of an optimal path ending at (i, j)_D, maximize over the three entering edges, i.e.,

(15)

D[i, j]←σ(

[

b^aⁱj

]

⁾^{+ max}

⎧⎪

⎪⎨

⎪⎪

⎩

V [i− 1, j − 1] +σ⁽²⁾(

[

^a₋ⁱ⁻¹b^aⁱj

]

⁾

D[i− 1, j − 1] +σ⁽²⁾(

[

b^aⁱ⁻¹j−1

a_i bj

]

⁾

H [i− 1, j − 1] +σ⁽²⁾(

[

b⁻_j−1 ai

b_j

]

⁾

Also, Hirschberg’s strategy extends to this model. That is (1) a score-only computation can be performed using essentially just the space to store values in one row and (2) forward and backward passes to the middle row allow one to recursively construct an optimal path in approximately twice the score-only time. Fig. 16 provides detailed pseudo- code for the linear-space alignment algorithm.

1. shared string arrays a1a2. . . a_M, b1b2. . . b_N

2. shared temporary integer arrays V⁻[0. . N ], D⁻[0. . N ], H⁻[0. . N ], V⁺[0. . N ], D⁺[0. . N ], H⁺[0. . N ] 3. procedure Align(M , N )

4. if M = 0 then

5. write

[

b⁻₁

] [

b⁻₂

]

^{. . .}

[

b⁻_N

]

6. else

7. path(0, 0,′D′, M + 1, N + 1, ′D′) 8. recursive procedure path(i1, j1, type₁, i2, j2, type₂) 9. if i1+ 1 = i2or j1= j2then

10. write aligned pairs for maximum-score path from (i1, j1)type₁to (i2, j2)type₂

11. else

12. mid← ⎣(i¹+ i2)/2⎦

13. /* A linear-space forward computation is performed to compute V⁻and D⁻*/

14. V⁻[ j]← maximum path score from (i1, j1)type₁to (mid, j)V, for j∈[ j1, j2] 15. D⁻[ j]← maximum path score from (i1, j1)type₁to (mid, j)D, for j∈[ j1, j2]

16. /* A linear-space backward computation is performed to compute V⁺and D⁺*/

17. V⁺[ j]← maximum path score from (mid, j)V to (i2, j2)type₂, for j∈[ j1, j2] 18. D⁺[ j]← maximum path score from (mid, j)Dto (i2, j2)type₂, for j∈[ j1, j2] 19. /* find where maximum-score path crosses mid column */

20. (J , X )← values J ∈[ j1, j2] and X∈{′V′, ′D′} that maximize X⁻[J ]+ X⁺[J ] 21. path(i1, j1, type₁, mid, J , X )

22. path(mid, J , X , i2, j2, type₂)

FIG. 16. Algorithm for delivering an optimal alignment in the model of aligning two alignments.

LOCAL ALIGNMENT

In many applications, a global (i.e., end-to-end) alignment of the two giv en sequences is inappropriate; instead, a local alignment (i.e., involving only a part of each sequence) is desired. In other words, one seeks a high-scoring path that need not termi- nate at the corners of the dynamic-programming grid (Smith and Waterman, 1981). The

(16)

highest local alignment score can be computed as follows:

S[i, j]← max

⎧⎪

⎪⎪

⎨⎪

⎪⎪

⎩

0 if 0≤ i ≤ M and 0 ≤ j ≤ N

S[i− 1, j] +σ(

[

^a₋ⁱ

]

⁾ ^{if 1}≤ i ≤ M and 0 ≤ j ≤ N S[i− 1, j − 1] +σ(

[

b^aⁱj

]

⁾ ^{if 1}≤ i ≤ M and 1 ≤ j ≤ N S[i, j− 1] +σ(

[

b⁻j

]

⁾ ^{if 0}≤ i ≤ M and 1 ≤ j ≤ N

A single highest-scoring alignment can be found by locating the alignment’s end points (which is straightforward to do in linear space), then applying Hirschberg’s strategy to the two substrings bracketed by those points.

Further complications arise when one seeks k best alignments, where k > 1. For computing an arbitrary number of non-intersecting and high-scoring local alignments, Waterman and Eggert (1987) developed a very time-efficient method. Producing a linear- space variant of their algorithm requires ideas that differ significantly from those pre- sented in previous sections (Huang and Miller, 1991; Huang et al., 1990). In brief, the strategy is to record the interesting parts of S in a space-efficient fashion and to perform local recomputation each time a best alignment is reported.

For each vertex v= (i, j), define First(v) to be the last vertex in some fixed topologi- cal order such that an alignment beginning at First(v) and ending at v has score S[i, j]. It can be shown that if vertices u and v satisfy First(u)≠ First(v), then an optimal align- ment from First(u) to u and an optimal alignment from First(v) to v have no vertex in common. Vertices are partitioned into equivalence classes by their First-values; vertices u and v are in the same class if and only if First(u)= First(v).

The algorithm works as follows. A forward pass is made through all vertices to con- struct k best equivalence classes. After delivering the best alignment (which is in the equivalence class with the highest score), a backward pass is performed to locate the region that has to be recomputed in order to update the recorded k best equivalence classes. A forward pass is then made over the region to find newly-exposed equivalence classes. This process is repeated until the k^th best alignment is delivered. Fig. 17 gives the outline. This algorithm has also been adapted to run efficiently on a parallel computer (Huang et al, 1992).

1. Compute k best equivalence classes in a single sweep 2. for i← 1 to k do

3. Deliver an optimal alignment in the equivalence class with the highest score 4. if i≠ k then

5. Determine a rectangle to be recomputed

6. Obtain k− i best equivalence classes by recomputing the rectangle FIG. 17. Algorithm for computing k best local alignments.

(17)

FAST LOCAL ALIGNMENT

To attain greater speed for local alignment, Wilbur and Lipman (1983, 1984) employed the strategy of building alignments from "fragments", such as short runs of exact matches. This strategy has been very successful as an initial phase for database search. With protein sequences, it might work better to begin with inexact but high-scor- ing matches, such as those used by the blast program (Altschul et al., 1990). In general, a full-resolution alignment process can be made more efficient by restricting the search to a ‘‘neighborhood’’ of the alignment-from-fragments (Pearson and Lipman, 1988; Pear- son, 1990; Chao et al., 1994).

It is straightforward to construct a highest-scoring alignment from fragments in O(F²) time, where F is the total number of fragments (Wilbur and Lipman, 1983). Based on the ‘‘candidate-list paradigm’’, Eppstein et al. (1992) developed an algorithm that runs in O(F log log F) time. (Strictly speaking, these times should involve the two sequence lengths as additive terms, and the ‘‘log log F’’ can be improved slightly.) How- ev er, the data structure employed to obtain this theoretical efficiency is unusable in prac- tice. With a practical data structure, the complexity becomes O(F log F), which is still a great improvement over O(F²) for problems of the size we regularly solve.

When long DNA sequences are aligned, it may be impractical to store all of the fragments. The algorithm of Eppstein et al. (1992) can be easily adapted to compute merely the optimal score of a local alignment in O(M+ N) space by keeping only those fragments in current (four) candidate lists, which are all of size O(M+ N). To deliver a local alignment in O(M+ N) space, Chao and Miller (1994) extend the approach of Hirschberg (1975) to identify one or two fragments near the middle of an optimal alignment, then recursively compute the alignment’s remaining prefix and suffix. Several possibilities arise when identifying the middle fragments from the information retained in the candidate lists created by the forward and backward passes to the middle rows. Chao and Miller (1994) show that all possibilities can be checked in O(c) time where there are c columns in the subproblem. The resulting algorithm runs in O((M+ N + F log N) log M) time and O(M+ N) space. The ‘‘log M’’ factor from the F log N log M term arises in the analysis because in theory the two subproblems could contain almost all fragments of the parent problem.

Chao and Miller (1994) also give a time-efficient, linear-space algorithm for con- structing k best non-intersecting alignments from fragments, following the strategy of Huang and Miller (1991). A forward pass is made through the entire set of fragments to find the first and last fragments on an optimal alignment. As paths to fragments f are discovered, they are divided into equivalence classes according to the first fragment on a highest-scoring alignment ending at f , and information about k best pairwise non-equiv- alent paths is retained. When fragments of a highest-scoring alignment are discarded, it is sufficient to recompute scores for fragments in the equivalence class containing the alignment’s fragments. It is shown that this recomputation can be carried out efficiently.

(18)

MULTIPLE ALIGNMENT TOOLS

We hav e implemented a family of multi-sequence alignment programs, based on the algorithms described in this paper. The user is required to specify an aligning tree that indicates an appropriate order for progressive alignment; typically this tree will mirror the sequences’ phylogenetic tree. Each program then operates progressively (Feng and Doolittle, 1987), i.e., by repeatedly aligning two sequences or smaller alignments to build up the final alignment. At each progressive step the sum of pairwise scores with quasi- natural gap costs is maximized (subject to the constraint that the step leaves columns of its input alignments intact). Thus, each step involves a pairwise alignment process that may be equivalent to a generalized (three nodes per grid point) optimal path problem.

For example, an evolutionary tree for (homologous sequences from) human, galago, rabbit and mouse is given in panel (A) of Fig. 18. The corresponding order for progressive alignment would be to align sequences from human and galago, then to align the resulting pairwise alignment with the rabbit sequences, and finally to align the 3-way alignment with the mouse sequence, denoted:

((human galago)rabbit)mouse

For human, galago, lemur and rabbit, the evolutionary tree in panel (B) suggests that one align the prosimian primate sequences (galago and lemur), then align the human sequence and the prosimian alignment, and finally align the 3-way alignment with the rabbit sequence, denoted:

(human(galago lemur))rabbit

lemur

(A) (B)

human galago rabbit mouse human galago rabbit

FIG. 18 Evolutionary trees. Branch lengths in these two idealized trees are not related to ev olutionary time.

Different substitution scores_σ_{p, q} are permitted for each sequence pair. Moreover, the programs support a flexible scheme for scoring gaps. In general, a gap of length k at the left end of sequence p is penalized g^L_p + e^L_pk, a gap of length k at the right end of sequence p is penalized g^R_p + e^R_pk, and an internal gap or progressive quasi-gap of length k in the pairwise alignment of sequences p and q is penalized g_{p, q} + e_{p, q}k. Different scores for each sequence pair are appropriate when aligning sequences at varying phylogenetic distances, and we need flexible end-gap scores (in particular, the option of setting them to 0) for aligning sequences of vastly different lengths.

(19)

One program, called yama (yet another multiple aligner), implements the general- ized Hirschberg’s linear-space algorithm (Fig. 16). The time taken to align alignments A and B of lengths M and N , respectively, is analyzed as follows. First, a pass over A determines, for each column of A, the sum of all pairwise_σ scores (i.e., for each pair of entries in that column), the number of pairwise gap starts and the number of pairwise gap ends (using the rules of Fig. 14). This requires time O(r²_AM ), where r_A is the number of rows in A. (If we wanted to penalize all quasi-gaps, not just the progressive quasi-gaps, we would also compute the number of additional gap ends that would be introduced by placing a null column before each column.) A similar pass is made over B. Let k_A,B be the number of pairs of basic sequences, one from A and one from B, i.e., k_A,B = r_Ar_B. Each of the_σ and_σ⁽²⁾ values in Fig. 15 costs O(k_A,B) time to compute. Since there are O(1) such values per grid point and O(MN ) grid points, the total time to align A and B, including the initial passes over A and B, is O(r²_AM+ r²BN+ kA,BMN ). Under any normal conditions (e.g., r_A ≤ N), the time is O(kA,BMN ). The computation time for delivering an optimal alignment is approximately twice the time needed to compute merely its score.

Thus computing an optimal alignment of A and B takes time O(k_A,BMN ). Finally, con- sider the time needed to align n sequences progressively. Let k_i denote k_A,B at the i^th step, where alignments A and B are aligned. There are n− 1 such steps, and the sum of the k_i is ½n(n− 1), since every pair of basic sequences is counted exactly once. Thus, if the final multiple alignment has length L, then the total time complexity is:

n−1

i

Σ

=1^O(kⁱ^L²⁾^{= O(n}²^L²⁾

Assuming that most of the basic sequences have length nearly L, the time for progressive multiple alignment is thus roughly equal to the time for computing all pairwise alignments. We remark that several other programs save time in each progressive step by aligning consensus sequences that summarize alignments A and B. We typically align only a few sequences (i.e., n is small) under conditions where execution time is not a con- cern, so the n²factor is quite affordable.

To increase efficiency, a step of the progressive alignment algorithm can be constrained to the portion of the dynamic-programming grid lying between two boundary lines. Another program, called yama2, performs constrained alignment using lower and upper bounding functions L_{p, q} and U_{p, q}for some or all sequence pairs. When a progres- sive step aligns alignments A and B, each pair ( p, q), such that sequence p appears in A and sequence q appears in B, creates a constraint on the step, and all such constraints are enforced.

Finally, yama3 considers constrained alignments consisting of aligned pairs in nearly-optimal alignments. In our intended usage, a set of ‘‘patterns’’ is specified (per- haps the consensus sequences for transcription factor binding sites); yama3 selects, from among all alignments with the highest score, an alignment with the largest number of conserved blocks that match a pattern.

For the alignment discussed in the next section, we began with a 73,326-basepair sequence that covers most of the_β-like globin gene locus of humans. Twenty-six shorter sequences, with lengths from a few hundred to over 10,000, were extracted from the corresponding sequence data for five other mammals. These supposedly homologous sequences were identified by applying a pairwise (k-best local) alignment program to