Linear-Space Algorithms that Build Local Alignments from Fragments 1

(1)

Algorithmica (1995) 13:106-134

Algorithmica

9 1995 Springer-Verlag New York Inc.

Linear-Space Algorithms that Build Local Alignments from Fragments 1

K u n - M a o C h a o 2 a n d W. Miller 2

Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of "alignment fragments," such as all occurrences of identical 5-tuples in each of two DNA sequences. We first combine a time-efficient algorithm developed by Galil and coworkers with a space-saving approach of Hirschberg to obtain a local alignment algorithm that uses O((M + N + F log N) log M) time and O(M + N) space to align sequences of lengths M and N from a pool of F alignment fragments. Ideas of Huang and Miller are then employed to develop a time- and space-efficient algorithm that computes n best nonintersecting alignments for any n > 1. An example illustrates the utility of these methods.

Key Words. Sequence comparison, Local alignment, Dynamic programming, Candidate-list paradigm, Linear-space algorithm.

1. Introduction. The u n f o r t u n a t e scarcity of interaction between biologists a n d c o m p u t e r scientists is well illustrated by the parallel developments of d y n a m i c - p r o g r a m m i n g m e t h o d s for c o m p a r i n g sequences. Such m e t h o d s were independently discovered by biologists (Needleman a n d W u n s c h , 1970), c o m p u t e r scientists ( W a g n e r a n d Fischer, 1974), a n d workers in other fields. ( F o r a survey of the history, see S a n k o f f a n d Kruskal, 1983). T h e precise relationship between the m e t h o d s developed by the two c o m m u n i t i e s is s o m e w h a t obscured by n o t a t i o n a l differences, m o s t but n o t all of which are insignificant. C o m p u t e r scientists typically m a k e explicit the relationship between two sequences by p r o d u c i n g a list of editing operations that converts one sequence to the other, while biologists prefer to see an alignment of the sequences. M o r e o v e r , m a t h e m a t i c a l l y oriented researchers find it natural to c o m p a r e sequences using a distance "metric," i.e., where a sequence is at a distance zero f r o m itself a n d large values of the measure c o r r e s p o n d to highly divergent sequences. O n the other hand, biologists tend to think in terms of similarity scores, i.e., a sequence has a very high similarity score when c o m p a r e d with itself, similar sequences have a smaller, but still positive, similarity score, a n d a pair of unrelated sequences has a negative score. It is tempting to believe that there is s o m e kind of " d u a l i t y " between minimizing a distance m e a s u r e a n d maximizing a similarity score that makes it immaterial which

1 This work was supported in part by Grant R01 LM05110 from the National Library of Medicine.

2 Department of Computer Science, The Pennsylvania State University, University Park, PA 16802, USA.

Received June 1, 1992; revised February 18, 1993. Communicated by E. W. Myers.

(2)

one is adopted. Indeed, this is true under certain circumstances (Smith et al., 1981), though not for "local" alignments (see below).

Biologists need more general measurements of sequence relatedness than are typically considered by computer scientists. The most popular formulation in the computer-science literature is the "longest common subsequence problem," which is equivalent to scoring alignments by simply counting the number of exact matches. For comparing protein sequences, it is important to permit the bonus awarded for aligning two symbols to depend on the particular symbol pair (Feng et al., 1985). For both DNA and protein sequences, it is standard to penalize a long gap (i.e., deletion from one of the sequences) less than the sum of the penalties for a set of shorter gaps of the same total length (Fitch and Smith, 1983). This is usually accomplished by charging g + t • e for a gap of length t. Thus the

"gap-open penalty" g is assessed for every gap, regardless of length, and an additional "gap-extension penalty" e is charged for every sequence entry in the gap. Such penalties are called affine gap penalties. Gotoh (1982) showed how to compute optimal alignments efficiently under such scoring rules.

Even more general models for quantifying sequence relatedness have been proposed. For example, it is sometimes useful to have the penalty for adding a symbol to a gap depend on the position of the gap within the sequence (Gribskov et al., 1990), which is motivated by the observation that insertions in certain regions of a protein sequence can be much more likely than at other regions. Another generalization is to let the incremental gap cost 6i = ci+ 1 - ci, where a k-symbol gap costs ck, be a monotone function of i, e.g., 61 > 62 _-_ .... (Waterman, 1984;

Miller and Myers, 1988; Galil and Giancarlo, 1989). In this paper we model only affine gap penalties since we are concerned mainly with DNA sequences (for which position-dependent penalties are not so useful as for proteins) and since concave or convex gap penalties have yet to be found truly useful. Indeed, there is some evidence that monotonic gap-extension penalties incorrectly model nature in certain circumstances (Pascarella and Argos, 1992).

Besides needing flexible ways of scoring alignments, biologists typically want to compute kinds of alignments that are not equivalent to the models studied by computer scientists, who generally consider a problem equivalent to global alignment, i.e., computing an optimal alignment that is required to extend from the :starts of the given sequences to their ends. Biologists frequently find it more useful to seek an alignment that is highest-scoring among all alignments between an arbitrary section of the first sequence and an arbitrary section of the second sequence (Smith and Waterman, 1981), which is called the local alignment problem.

Probably the most useful of these variations for aligning biological sequences is that of computing "n best nonintersecting" local alignments. Care must be taken to formalize this notion in a way that allows subtle matches lying near much stronger (but perhaps less interesting) matches to be found, while still permitting efficient computation (Waterman and Eggert, 1987; Waterman, 1989). Earlier attempts to define the "n best local alignments problem" in terms of minimizing a measure of the distance between two sequences led to substantially more cumbersome algorithms (Goad and Kanehisa, 1982; Sellers, 1984).

On the computer-science side, Hirschberg (1975) discovered a method for

(3)

108 Kun-Mao Chao and W. Miller computing longest common subsequences using only linear space (space proportional to the sum of the sequence lengths) rather than the naive quadratic space (space proportional to the product of the sequence lengths). Although space, rather than time, is often the constraining factor when applying dynamic-programming techniques to biological sequences (e.g., with a DNA sequence of length 50,000, a quadratic-space method uses billions of computer memory locations), biologists did not discover the technique for themselves. While Hirschberg's original formulation was for alignment scores that are unrealistically simple for applications in biology, Myers and Miller (1988) (also Miller and Myers, 1988) extended the approach to affine gap costs. Moreover, linear-space methods have been developed for the "n best local alignment problem" (Huang et al., 1990; Huang and Miller,

1991).

To attain greater speed, biologists have employed the strategy of building alignments from alignment fragments (Wilbur and Lipman, 1983, 1984). For example, one could specify some fragment length k _> 1 and work with fragments consisting of a segment of length at least k that occurs in both sequences. With protein sequences, it might well work better to begin with inexact but high-scoring matches, such as those used by the blast program (Altschul et al., 1990) for other purposes. In any case, algorithms that optimize the score over alignments constructed from fragments can run faster than algorithms that optimize over all possible alignments.

Alignments constructed from fragments (or often just the alignments' scores) have been very successful in initial filtering criteria within programs that search a sequence database for matches to a query sequence; database sequences whose alignment score with the query sequence falls below a threshold are ignored, and the remainder are subjected to a slower but higher-resolution alignment process.

Moreover, the high-resolution process can be made more efficient by restricting the search to a "neighborhood" of the alignment-from-fragments (Pearson and Lipman, 1988; Pearson, 1990; Chao et al., 1992).

It is straightforward to construct optimally an alignment from fragments in

O(F 2)

time, where F is the total number of fragments (Wilbur and Lipman, 1983).

Eppstein et al. (1992a) develop such an alignment algorithm that runs in O(F log log F) time. (Strictly speaking, these times should involve the two sequence lengths as additive terms, and the "log log F" can be improved slightly.) However, the data structure employed to obtain this theoretical efficiency is unusable in practice. With a practical data structure, the complexity becomes O(F log F), which is still a great improvement over O(F 2) for problems of the size we regularly solve.

It should be noted that the basic approach of Eppstein et al. was discovered independently by Myers and Huang (1992) in solving a different problem; they present an alternative and very useful graphical description of how the method

works.

By adapting a number of existing ideas and proposing a few new ones, this paper develops a method for constructing n best nonintersecting local alignments from given alignment fragments. Moreover, the algorithm runs in linear space and utilizes alignment scoring schemes that are appropriate for biological problems.

We begin with a review of the algorithm of Eppstein et al., reformulated to compute

(4)

the score of the best local alignment with affine gap costs. The next step, which is again straightforward, is to utilize Hirschberg's approach to find an optimal alignment (not merely its score) using only linear space. The final step, and the one involving a serious extension of what was already known, is to compute n best local alignments from fragments, following the general outline of Huang and Miller (1991).

The strength of this paper lies in the utility of the algorithm developed. The algorithm is central to our plans for dealing with DNA sequences of lengths between 105 and 1 0 6 , a n area that will soon be important. Whereas alignments built from, say, exact matches of length 5, are not sufficient, in themselves, for our studies of gene regulation and molecular evolution (Hardison and Miller, 1993), this algorithm provides a sensitive filtering procedure for locating regions whose similarity can then be measured by a more accurate technique, like that of Chao

et aI. (1993). Moreover, its rather considerable algorithmic complexity is appropri-

ate; it is expected to outperform simpler methods substantially.

A useful attribute of the algorithm given here is that its position on the sensitivity-versus-speed spectrum is tuned by the choice of initial fragments. For instance, in an example at the end of the paper, using 6-words (exact matches of length at least 6) permits adequate detection of fairly subtle similarities with a 16-fold speedup compared with full-resolution alignment. Near the other extreme, Chao et al. (1993) use 8-words to attain a 1000-fold speedup when finding an alignment of length over 100,000 (between chloroplast genomes of tobacco and liverwort).

Unfortunately, several desirable topics must be omitted from this report. A thorough evaluation of the method's effectiveness for its intended uses would require specification of the accompanying high-resolution alignment technique and a careful discussion of biological properties of sequences that are only now starting to become available; hence it lies beyond the scope of this paper. Also, only an incomplete theoretical analysis is offered since no one has as yet shown how to give a satisfying performance analysis of even the simplest realist n-best alignment method (Waterman and Eggert, 1987).

2. The Basic Definitions. Let the given sequences be A = a l a 2 " ' a u and B = bib2"'" bn. The discussion is simplified by specifying a set of alignment fragments and their scores, though it is straightforward to apply the algorithms described below to other classes of alignment fragments. With this in mind, define a f r a g m e n t to be a triple ( i , j , k ) such that a t = bj, ag+l = bj+l . . . . ,a~+k-1 = h i + k - l , and k > k_min, for a fixed minimum fragment length k_min. Moreover, a fragment is to be maximal, i.e., not properly contained in another fragment. Fragment f ' = ( i ' , f , k') is said to be above fragment f = (i,j, k) if i' + k' < i a n d f + k' < j , and f is then b e l o w f ' . Notice that this defines a partial-ordering relation. An alignment is defined as a sequence of fragments fl, fa . . . . ,fl such that f~ is above f~+ ~ for i = 1 , . . . , l - 1. Henceforth, the terms "alignment" and "path" are used interchangeably. See Figure 1. It should be noted that these definitions originate from Eppstein et al. (1992a).

(5)

110 Kun-Mao Chao and W. Miller

(a) G A C T T G A C T A G A G

A G C T A C T G T G A A T

, , \ ^\

\ \

\

(b)

C T T G A C T - A G A C T - - A C T G T G A

Fig. 1. (a) Graphical representation of the alignment fragments consisting of exact matches of length 2 or more between GACT1GACTAGAG and AGCTACTGTGAAT. (b) Traditional representation of the local alignment generated from the three dark diagonal line segments of (a).

The score of fragment f = (i, j, k) is defined as k and denoted sc(f). Penalties for connecting fragments are needed for scoring alignments. Let the nonnegative constant g and positive constants e and r be the penalties for opening up a gap (horizontal or vertical displacement), for extending a gap by one symbol, and for replacing one symbol by another, respectively. The affine function gap(t) = g + te is charged for a gap of length t. Define the diagonal of fragment f = (i, j, k) to be j - i, denoted by Diag(f). Suppose f ' = (i',j', k') is above f = (i,j, k). The penalty of connecting f ' and f , denoted by Connect(f', f), is defined as follows (see

(1) Diag(f) = Diag(f') ⁽²⁾Diag(f) > Diag(f')

,q

%

,,4

%,

Fig. 2. Connecting two fragments.

(a) Diag(f ) < Diag(f')

(6)

Figure 2):

Case 1: D i a g ( f ) = D i a o ( f ' )

C o n n e c t ( f ' , f ) = (i - i' - k')r.

Case 2: D i a g ( f ) > D i a o ( f ' )

C o n n e c t ( f ' , f ) = g a p ( D i a g ( f ) - O i a g ( f ' ) ) + (i - i' - k')r.

Case 3: D i a g ( f ) < D i a o ( f ' )

C o n n e c t ( f ' , f ) = 9 a p ( O i a g ( f ' ) - O i a g ( f ) ) + (j - j' - k')r.

We assume that r < 2e, guaranteeing that the penalty for connecting f ' and f is the minimum among all possible ways of connection. The score of an alignment (fl, f2 . . . f~) is defined as the sum of the fragment scores, minus the connection penalties of adjacent fragments, i.e.,

l l - 1

sc(fi) -- ~ Connect(f~, fi+O.

i = 1 i = 1

For example, if we take g = e = r = 1, then the score of the alignment in Figure l(b) is 7 - 6 = 1. Notice that all symbol replacements are considered equal; there are no provisions for utilizing identical or similar sequence elements between fragments. It is this idealization that permits efficient computation of maximum- score alignments.

The local alignment problem is thus to find an alignment with the highest score.

For a global alignment problem, it is necessary to assess additional penalties for connecting the alignment to (0, 0) and (M + 1, N + 1).

Let S c o r e ( f ) be the maximum score over all alignments ending at f , so that m a x f { S c o r e ( f ) } gives the score of the best local alignment. Because alignments must use whole fragments, the principle of optimality yields the recurrence relation:

S c o r e ( f ) max{0, maxf, abov~i {Score ( f ' ) - C o n n e c t ( f ' , f)}} + sc(f).

A straightforward dynamic-programming method, essentially just an optimal-path algorithm for directed acyclic graphs, solves the recurrence in O(F 2) time, where F is the number of fragments (Wilbur and Lipman, 1984).

3. The Algorithm of Eppstein e t al. The algorithm of Eppstein et al. (1992a) is based on the "candidate-list paradigm" (Miller and Myers, 1988; Galil and Park, 1992), i.e., we keep lists of all fragments f ' that are candidates for maximizing S c o r e ( f ' ) - C o n n e c t ( f ' , f ) for some fragment f at which S c o r e ( f ) will later be evaluated. The computation proceeds by rows. When f ' s starting position is

(7)

112 K u n - M a o Chao and W. Miller

. . . . . . .

9 ,,, influence

9 ~ r e , o n

left "" diagonal influence",,,'~-- influence region -,.. regmn Fig. 3. Influence regions.

reached, the fragment above it that determines Score(f) can be found by searching through some candidate lists. Once f ' s end position is reached, f is added to any candidate lists that should contain it.

To facilitate the process, the region below a fragment is divided into three disjoint subregions, as depicted in Figure 3. Suppose f ' = (i',f,k') is above f = (i, j, k). Then f is said to be under the right influence of f ' if Diag(f') < Diag(f), while f is under the left influence of f ' if Diag(f') > Diag(f). The right-influence region of f ' is defined to be the range consisting of all points (x, y) such that i' + k' <_ x <_ M,j' + k' <_ y <_ N, and y - x > Diag(f'). Similarly, the left-influence region of f ' is defined to be the range consisting of all points (x, y) such that i' + k' < x < M, j' + k' < y _< N, and y - x < Diag(f'). Let

R l ( f ) = max{Score(f') - Connect(f, f ) such that f is under the right influence o f f ' } ,

L I ( f ) = max{Score(f') - Connect(f', f ) such that f is under the left influence of f'},

DI(f) = max{Score(f') - Connect(f', f ) such that Diag(f) = Diag(f')}.

Then Score(f) = max{0, RI(f), LI(f), DI(f)} + sc(f).

DI(f) can be determined by considering only the nearest previous fragment on the same diagonal. RI(f) requires more effort, but is similar to, and somewhat simpler than, LI(f). Thus, we discuss only computation of LI(f); the interested reader can refer to Eppstein et al. (1992a) for the algorithm's complete description.

Lemma 1 says that if f is superior to f ' for some point in their common left-influence region (Figure 4), it is superior for all points in that region. Define Decay(f: x, y) as S c o r e ( f ) - Connect(f, h), where h is an imaginary fragment starting at (x, y).

LEMMA 1. I f Decay(f: x, y) >_ Decay(f' : x, y) for some (x, y) in the common left- influence region of fragments f and f', then the inequality holds for all entries in that region.

(8)

~ -.

. %

i i. "%

Fig. 4. Common left-influence region of f and f'.

PROOF. I t suffices t o s h o w t h a t Decay(f: x, y) - Decay(f': x, y) is i n d e p e n d e n t of x a n d y. I f f -- (i,j, k), t h e n

D e c a y ( f : x, y) = S c o r e ( f ) - g a p ( D i a g ( f ) + x - y) - (y - j - k)r,

where r is the symbol-replacement penalty. An analogous equation holds for f ' = (i', j', k'), so

D e c a y ( f : x, y) -- D e c a y ( f ' : x, y) = S c o r e ( f ) - S c o r e ( f ' )

+ ( D i a g ( f ' ) -- D i a g ( f ) ) e + q + k - j ' - k')r,

where e is the gap-extension penalty. []

A fragment f ' is said to l e f t - d o m i n a t e the region if, for any fragment f starting in that region, L l ( f ) = S c o r e ( f ' ) - C o n n e c t ( f ' , f ) . Figure 5 depicts the regions that are left-dominated by each fragment. (Ties can be broken arbitrarily.) Each row is divided into intervals by the vertical and diagonal lines that separate regions,

"-i "r i ",i ",

i i ".~ "4 ~ qt : "..

i ,. i ".i, i i"..,i **..*..,**

",.i. "<, ",, "-, **.*..",i**

""-. E ",4 ".,,j ",,..

",~ %1 %,

Fig, 5. Regions left-dominated by some fragments. Circles indicate the lower end of a fragment, and solid circles specify those fragments that are still "active" at the last row.

(9)

(1) (2)

" _"-,, " ( ' , ? ' , , ",,. ",.,?.,( .. J ",,.i",., " ".,,.,

c p c d p d'

Fig. 6. Locating the fragment that left-dominates the interval containing p.

and the same fragment can left-dominate more than one of these intervals (e.g., fragment f l in Figure 6(2)). Two candidate lists are used for computing LI(f).

One sorted list, denoted by LC, gives the columns where vertical region boundaries intersect the current row. Given a point p in the current row, LC is searched for the largest entry strictly less than p, to which is attached a pointer to the "active"

fragment that left-dominates the interval immediately to the entry's right. The other sorted list, denoted by LD, gives the diagonals where region boundaries intersect the current row. Given p, LD is searched f o r the largest entry not exceeding p's diagonal, to which is attached a pointer to the active fragment that left-dominates the interval beginning at that diagonal. If p is the initial point of fragment f , then LI(f) is easily calculated since one of the two fragments found by the searches determines LI(f).

Figure 6 illustrates the process of searching LC and LD to find the active fragment that left-dominates the interval containing p. Searching LC finds c and returns f . In Figure 6(1), p is in the interval left-dominated by f . In Figure 6(2) p is not under the left influence of f , but searching LD finds d and returns f 3 which left-dominates d. (Notice that d is not in the left-influence region of f2.) Incident- ally, Figure 6(2) indicates why it will not work to search LD for the first entry laroer than p's diagonal.

Once Ll(f) has been computed for all f ' s starting at the current row, LC and LD are updated for the next row as follows. For each fragment f ending at the current row, first determine if it left-dominates some region. In general (except when f ends on some column boundary in LC or some diagonal boundary in LD), this is done by searching LC and LD to find the fragment, say f ' , that left-dominates the endpoint of f , and then computing Decay(f: x, y ) - Decay(f: x, y) for any (x, y) in their common left-influence region. If f is superior to f', then modify LC and LD to reflect the new region left-dominated by f . Several cases arise when f is added to LC and LD. Here we take one case as an example. Readers can refer to Eppstein et al. for more details.

Suppose that the closest boundary to the left of f ' s endpoint is a diagonal and the closest boundary to the right is a column (see Figure 7). We keep intersection lists, denoted by CUT(i), giving the columns where two boundary lines intersect in row i. In the case at hand, the borders on either side of f ' s endpoint intersect at a point a in the C U T list for some later row (Figure 7), and a must be removed from that list. Two new intersection points, b and c, must be added to the

(10)

,,(

"%',~ %

i %%

%% ~ *%

%"%%

*%

%%%%%%

"~ a

\

7

i %,%%.~176

*,, "%.

"..i x t l

"..

" R '

"o c

Fig. 7. The case when the endpoint of f lies between a diagonal and a column boundary.

appropriate lists (if they fall within the grid). Moreover, f now left-dominates the region R 2 and f ' left-dominates the regions R1 and R 3. Thus, we add one column left-dominated by f to LC and one diagonal left-dominated by f ' to LD.

Before leaving row i, points saved in CUT(i) need to be treated. F o r each intersection point, we extend either the column or diagonal, in the following sense.

At a given intersection point, three regions come together (see Figure 8). Let fl, f2 andf3 be the left-dominating fragments for regions R1, R2, and R3, respectively.

If Decay(f 1 : x, y) > Decay(f 3: x, y) for some (x, y) in their c o m m o n left-influence region, then we:

(1) Remove from LC the column where f 3 ends.

(2) Change the pointer for diagonal Diag(fl) from f2 to f3.

(3) If the interval just right of the intersection point is terminated by a column boundary, we add an entry to the CUT list for the row where diagonal Diag(fl) crosses that c o l u m n .

Otherwise, we:

(1) Remove diagonal Diag(fl) from LD.

(2) If the interval just left of the intersection point begins at a diagonal boundary, we add an entry to the appropriate CUT list.

The following pseudocode gives the outline for computing LI(f) and updating LC, LD, and the C U T lists.

I1

".i.

R2"I R 1 ".. [ R3 Fig. 8. An intersection point.

(11)

116 Kun-Mao Chao and W. Miller for i *- 1 to M do

{ for each fragment f starting at row i do { f ' ~ SEARCHL(LC, LD, f )

LI(f) ~ Score(if) - Connect(if, f )

}

for each fragment f ending at row i do UPDATEL(LC, LD, CUT, f ) Handle points in CUT(i).

}

Consider now the running time of the complete alignment algorithm. The diagonal influence DI(f) can be computed in O(1) time since it involves merely finding the nearest fragment above f and on f ' s diagonal. RI(f) can be computed in a simpler way than LI(f) because the right-influence regions are bounded by rows (instead of columns) and diagonals. Therefore, the intersection lists are no longer needed. Furthermore, only one list, sorted by diagonals, needs to be maintained. Suppose that the candidate lists (LC, LD, and the list for the right influence) are implemented as balanced search trees, so that each search or update operation takes O(log t) time, where the size of the tree is t ___ N. The fragments can be generated in O(M + N + F) time using a suffix tree, so the total time for the above algorithm, ignoring the updating of CUT lists, is O(M + N + F log N).

Note that the total number of CUT points handled is at most 2F, since a fragment determines at most two boundary lines (Figure 5) and a boundary line dies at each intersection point. Each CUT point can be handled in O(log N) time, so the total time for a balanced-tree implementation is O(M + N + F log N).

4. A Linear-Space Algorithm for Local Alignment. When long DNA sequences are aligned, it may be impractical to store all of the fragments. For example, there are 3,504,057 maximal fragments of length at least 5 between a 73,326-symbol DNA sequence containing the human fl-like globin gene cluster and a corresponding 44,594-symbol sequence for a rabbit.

To compute the optimal score of a local alignment, it is sufficient to keep only those fragments in current candidate lists (e.g., LC and LD). Fragments that start in row i are generated when the algorithm reaches row i. Also, for each fragment f , the number of candidate lists containing f is maintained, and f is deleted when the number reaches zero. Since the size of a candidate list is O(M + N), the total space requirement for this score-only method is O(M + N).

Explicitly producing a best local alignment in linear space is more difficult. Our approach works in two phases. In outline, we locate first and last fragments of a best local alignment, then use a linear-space global alignment algorithm to compute an optimal global alignment for the sequence segments bounded by these two fragments. In order to locate the first and last fragments, we compute for each f the first fragment on a path to f of score Score(f). When Score(f) is determined, this first fragment is either f itself (if Score(f) = so(f)) or equal to the first fragment for the f ' determining Score(f). A somewhat more efficient strategy for locating the first and last fragments can be found in Huang et al. (1990).

(12)

Locating the first and last fragments of a best local alignment reduces the problem to generating a global alignment in the region bounded by these two fragments. Specifically, if the first fragment is f ' = (i', j', k') and the last fragment is f = (i, j, k), then the local alignment we seek consists of f', followed by a global alignment of a r + k' ar + k" + 1''" a t - 1 and b j, + k'bj' + k' + 1''" b j_ 1 (discarding fragments that expand to larger fragments in the complete sequences), followed by f . To solve the global problem, we use the strategy devised by Hirschberg (1975). Begin by applying a global, cost-only variant of the algorithm of Eppstein et al.. It differs from the local alignment algorithm described above in that the 0 term in the equation

S c o r e ( f ) = max{0, R I ( f ) , L I ( f ) , D I ( f ) } + s c ( f )

is replaced by a penalty for reaching f from the starts of the sequence segments defining the global problem. Let S c o r e - ( f ) denote this global "backward" score at f . The process is stopped after processing the middle row, i.e., row m = L(i' + k' + i - 1)/21 where i', etc., are as above, and the current candidate lists are saved. Then an "inverted" version of the algorithm is applied to compute Score +(jr) defined as the maximum score of an alignment from f to the ends of the segments defining the global problem. This backward pass is stopped after processing row m + l .

The goal is now to identify one or two fragments near the middle of an optimal alignment, then recursively compute the alignment's remaining prefix and suffix.

Several possibilities arise when identifying the middle fragments from the information retained in the candidate lists created by the forward and backward passes to the middle rows (see Figure 9). As is now shown, each possibility can be checked in O(c) time where there are c = j - j' - k' columns in the subproblem, so we need only check them all and pick the one yielding the best alignment.

The first case is that an optimal alignment uses a fragment that includes rows m and m + 1 (Figure 9(1)). For each such crossing fragment f , S c o r e - ( f ) +

(1) (2) (3)

X - N~f-

m X m ""o.o" ra ~ ' " ' ' "

"X + "X +

(4a) (4b)

',,(-

f'.i s 9 ,, p ' . . ~

~'~ oo.

\ f §

Fig. 9. Cases that arise when dividing a global alignment problem.

(13)

118 Kun-Mao Chao and W. Miller Score +(f) - sc(f) gives the best score of all (global for the subproblem) alignments using f , and all of these values can be inspected in O(c) time.

For all the remaining cases, fix an optimal alignment, let f - be the last fragment on the alignment that lies on or above row m, and let f § be the first fragment on or after row m + 1. (Actually f - and f § could be "pseudo- fragments" corresponding to the upper left or lower right corners of the subproblem, which are added to create a global alignment problem.) Our second case is where f - and f + lie on the same diagonal. In O(c) time we can loop over all fragments associated with lists for DI in the upper and lower problems, in order of increasing diagonal, and determine all potential pairs ( f - , f+). The optimal score over all paths that jump from f - to f § is Score-(f-) + Score+(f +) - Connect(f-, f +).

Case (3) of Figure 9 is where Diag(f § > Diag(f-). As before, our decision to compute by rows makes handling right influences simpler than left influences, so we omit explicit treatment of Case (3).

When Diag(f +) < Diag(f-), it is impossible for both f - to intersect row m and f § to intersect row m + 1. (This is because f § has to start strictly after the column where f - ends.) Without loss of generality, suppose that f - does not intersect row m. Project the lower end of f - vertically onto row m, obtaining point p of Cases (4a) and (4b) of Figure 9. Thus if f - = ( i - , j - , k - ) , then p = (m,j- + k- - 1). There are two subcases.

The first subcase is when the diagonal containing p does not exceed Diag(f+).

For each fragment f + associated with the list for DI of the lower problem, let f § diagonal intersect the row m + 1 at point q. Recall that the candidate lists for the upper problem were updated to be ready for row m + 1 when treating row m. Point q is under the left influence of f - , so f - can be determined from q in O(log c) time. The best score for a path jumping from f - to f § is again Score-(f-) + Score+(f + - Connect(f-, f+). Better yet, we can arrange that the possible fragments f § and the left-influence intervals in row m + 1 can be enumerated in order of increasing diagonal in O(c) time, which gives a linear-time treatment of Case (4a).

The other subcase is when the diagonal containing p exceeds Diag(f § Let point s be the intersection of row m + 1 and the diagonal containing p. Then s is under the left influence o f f - and p is under the backward left influence o f f § Therefore, given p, we could determine s, f - , f + , and the highest score of a global alignment that jumps from f - to f § all in O(log c) time. Actually, we can arrange that the left-influence intervals in row m + 1 and the backward left-influence intervals in row m can be enumerated from left to right in O(c) time, which covers Case (4b) in linear time.

The fragments f - and f § (or f in Case (1)) reduce the problem to two subproblems (Figure 10), which are solved recursively. Problems with fewer than k_min rows or columns, where k_min is the minimum fragment length, are handled by doing nothing. Clearly, an optimal alignment is computed by this method using only O(M + N) space.

Let T(M, N, F) denote the worst-case time to apply the global "divide-and- conquer" alignment algorithm to sequences of lengths M and N with F fragments.

(14)

m

+

~.~ ~ ~E~' ~

Fig. IO. The two subproblems.

Our next goal is to show that

T(M, N, F) = O((M + N + F log N) log M).

It then follows immediately that the same time bound holds for the local alignment algorithm (which performs an additional O(M + N + F log N)-time pass to find the first and last fragment).

First, consider subproblem sizes when a problem is divided (Figure 10). Both subproblems have fewer than M/2 rows. If the upper subproblem has N - columns and F - fragments and the lower subproblem has N § columns and F § fragments, then N - + N + < N and F - + F + < F. Previous considerations show that the time to split a problem with size parameters M, N, and F into those subproblems (not counting the time to solve the subproblems) is bounded by z(M + N + F log N) for some constant z, where we interpret log x to mean max{log2 x, 1}. Then T(M, N, F) <_ z(M + N + F log N) log M. To see this, first note that it holds if the problem is such that no recursive calls are made. For other problems, by induction:

T(M, N, F) < "c(M + N + F log N) + T(M/2, N - , F - ) + T(M/2, N +, F +)

< z(M + N + F log N) + z(M/2 + N - + F - log N - ) log M/2 + "c(M/2 + N + + F + log N +) log M/2

< z(M + N + F log N) + "r(M + N + F log N) log M/2

= z(M + N + F log N) log M.

5. The n Best Local Alignments. A pair of long and related sequences will often exhibit a number of important local similarities. To discover them, it is inadequate to determine a highest-scoring alignment, a second-highest, a third-highest, and so on, since trivial perturbations of the highest-scoring alignment will often dominate the list. The following strategy yields far more useful results. First, compute a highest-scoring alignment. Remove all fragments in that alignment and

(15)

120 Kun-Mao Chao and W. Miller find a highest-scoring alignment from the remaining fragments. Remove all fragments in thatsecond alignment and find a highest-scoring alignment from the remaining fragments, and so on until n alignments have been reported. We refer to this process as computing n best nonintersecting local alignments. Formally speaking, two local alignments of sequences A and B intersect if they share a fragment. A list ~:, ct 2 .. . . . ~n of alignments of A and B is referred to as n best local alignments of A and B if ~1 is a highest-scoring local alignment and if 2 < i < n, then ~i is highest scoring among all local alignments that do not intersect

~ , ~2 . . . . , ~i- ~. It should be noted that different tie-breaking rules may result in different n best local alignments. (See Huang and Miller (1991) for an analogous example.)

A straightforward implementation of computing n best nonintersecting local alignments, which starts anew with each reduced set of fragments, is unnecessarily inefficient. Typically, most or all of the computed alignments will be far shorter than the underlying sequences, and discarding the alignment's fragments affects Score(f) only for fragments f lying near the alignment. The idea, then, is to develop an incremental approach that repeats only those parts of the computation where results may change. For traditional sequence alignment via dynamic programming, Waterman and Eggert (1987) developed a quadratic-space algorithm and a linear-space algorithm was given by Huang and Miller (1991).

We next present a time-efficient, linear-space algorithm for constructing the n best nonintersecting alignments from fragments, following the strategy of Huang and Miller (1991). It is assumed that n is known a priori. In outline, the algorithm works as follows. A forward pass is made through the entire set of fragments to find the first and last fragments on an optimal alignment. This pass differs from the earlier procedure in that as paths to fragments f are discovered, they are divided into equivalence classes according to the first fragment on a highest-scoring alignment ending at f , and information about the n best pairwise nonequivalent paths is retained. When fragments of a highest-scoring alignment are discarded, it is sufficient to recompute scores for fragments in the equivalence class containing the alignment's fragments (Lemma 2, below).

5.1. Equivalence Classes. In general, we use G to denote a set of fragments.

Specifically, G 1 is the original set of maximal fragments between sequences A and B, and Gm for m > 1 is obtained from Gin-, by removing the fragments of a highest-scoring local alignment. Let ScoreR(f) be the maximum score over all alignments from G~ ending at f , and let < ~ denote any topological order on the fragments in G1 (relative to the "above" relation). Firstm(f) is defined to be the last fragment in this ordering such that there is an alignment of score Scorem(f) from that fragment to f using only fragments in Gm (i.e., the topological order is used to break ties).

LEMMA 2. Fix m >_ 1 and let u be the fragment such that Gm+ : is formed from Gm by removing the fragments of an optimal alignment from First,~(u) to u. I f v is a fragment with Firstm(v)~Firstm(u), then Score~+l(V)=Scorem(v) and Firstm+ l(v) = First~(v).

(16)

PROOF. The critical observation is that an optimal path (alignment) from Firstr~(V) to v cannot share a fragment with an optimal path from Firstm(u) to u. To see this, suppose that f occurred on both paths. Without loss of generality, Firstm(v) follows Firstm(u) in the chosen topological order, and it follows readily that an optimal path from First~,(v) to u exists, a contradiction. (For more details, see the proof

Lemma 1 of Huang and Miller (1991)). []

Define a relation Em over the fragments in G,, by uE~v if and only if Firstm(u ) = Firstm(v). Em is an equivalence relation, and hence partitions the fragments in G,, into equivalence classes. For each equivalence class C of E,,, define Score,,(C) = max{Scorem(f): f e C}.

Let W be the n - m + l t h highest equivalence-class score in G,~. (As alignments are reported and the equivalence classes are refined, W will in general increase.) The effective region of f is chosen so that if f ' starts outside of the effective region, then Score(f) - Connect(f, f ' ) < W. The following lemma explains how to de- termine the effective region.

LEMMA 3. Suppose Scorer,(f) > W and define

h = [-max{(Score,,(f) - W - g)/e), (Scorem(f) - W)/r}-],

where g, e, and r are the penalties f o r connecting f r a g m e n t s . I f f ' is a f r a g m e n t lying below f and starting more than h rows or more than h columns after the end o f f , then Scorer,(f) - Connect(f, f ' ) < W.

PROOF. First suppose that Diag(f') = Diag(f). If more than h rows separate f and f ' , then Connect(f, f ' ) > hr > Score(f) - W , i.e., Score(f) - Connect(f, f ' ) <_

W. Otherwise, suppose without loss of generality that f ' is under the left influence of f . Let f = (i,j, k) and f ' = (i',j', k'). Connect(f, f ' ) = gap(Diag(f) - Diag(f')) :+

(j' - - j - k)r = g + (j - i - f + i')e + ( f - j - k)r = g + hie + hEr, where hi + h2 = i' - i - k >>_ h. If r _> e, then Connect(f, f') > g + he >_ g + Score(f) - W - g = Score(f) - W. Otherwise, Connect(f, f ' ) >_ g + hr > hr >__ Score(f) - W. []

Let B o x ( T , L, B, R) denote the rectangle whose upper left corner is (T, L) and lower right corner is (B, R). Let f = (i,j, k). If Score(f) > W , then the effective region of f is the rectangle Box(i + k - 1,j + k - 1, min{M,i + k - 1 + h}, min{N,j + k - 1 + h}) where h is defined in Lemma 3. In general, these regions are square (see Figure 11), except when truncated at an edge of the dynamic- programming grid. If Score(f) < W, then f ' s effective region is empty.

Each of the retained equivalence classes C is represented by a 7-tuple:

( S , F , u , T , L , B , R ) ,

(17)

Fig. 11. A rectangle containing all effective regions of an equivalence class.

where

S = Scorem(C),

F = First~(f) for all f ~ C,

Firstm(U ) = F and Scorem(U ) = Scorem(C), and

Box(T, L, B, R) contains all effective regions of fragments in C.

Henceforth, we use tuple to designate such a 7-tuple, and refer to the entries of tuple C by C. S, C. F , . . . , C. R.

5.2. Algorithm Outline. We are now ready to discuss the algorithm outline of H u a n g and Miller (1991) in more detail.

Algorithm outline

1. Compute n best tuples (S, F, u, T, L, B, R ) in G 1 in a single sweep for m ~ 1 to n do

2. { C ~ a maximum-score tuple in Gm

3. Construct an optimal alignment from C. F to C. u i f m ~ n then

4. { Determine T < C . T and L _ < C . L so that no alignment from Gin+ 1 starting outside Box(T, L, C. B, C. R) and ending inside Box(C. T, C. L, C. B, C. R) has score greater than W

5. Obtain n - m best tuples in G m+ 1 by recomputing Box(T, L, C. B, C. R)

}

Once a best class C from Gm has been located and its optimal alignment reported, we need to discover any high-scoring alignments that were hidden by that alignment. To do so, perform a backward computation to locate row T and column L such that it is sufficient to recompute the region Box(T, L, C. B, C. R).

(Any alignment starting outside Box(T, L, C. B, C. R) and ending in Box(C. T, C. L, C. B, C. R) has score at most W, and hence can be ignored.) That is, a

(18)

forward pass will then be performed inside Box(T, L, C . B , C . R ) to look for fragments where Scorem+l(f) exceeds W, in which case the set of retained equivalence classes is altered. It should be noted that potentially more efficient methods exist for delimiting a recomputation region that is not necessarily rectangular, but keeping to rectangular regions simplifies the discussion.

For step 1, the algorithm of Section 3 is employed to find n best tuples, which are maintained in a list. When a better-score tuple is found, it replaces a minimum-score tuple in the list. Step 3 is accomplished by applying the linear- space method discussed in Section 4. Since the (forward) recomputation in step 5 is straightforward and similar to step 1, we leave it as an exercise to the reader.

Step 4 is more involved. In Section 5.3 we discuss a backward computation to locate T and L in step 4. Finally, Section 5.4 describes a variant of the algorithm of Eppstein et al. that is required in the backward computation.

5.3. The Backward Computation. In the following we describe how to locate T and L in step 4. Let Score(f) denote the maximum score over all alignments starting at f = (i,j, k) and ending in Box(T, L, C. B, C. R), and let Last(f) be the obvious analog of First(f). Define

E x t , ( f ) = max{O, i - Fmax{Score(f)/e, Score(f )~r}-]}, E x h ( f ) = max{0, j - Fmax{Score(f)/e, Score(f )~r}-]}.

Notice that opening up a gap is not penalized. Roughly speaking, Extt(f) and E x h ( f ) are the boundary row and column, respectively, such that extending from row _< Extt(f) or column _< Exfi(f) to f will not gain any additional score. The following two lemmas pave the way for termination conditions. (Only the align- ments ending in Box(T, L, C. B, C. R) are considered.)

LEMMA 4. I f f is a fragment lying above f ' and endin9 above row Extt(f'), such that the effective region of f does not intersect any row >_ Extt(f' ), then the score of any alionment in which f and f ' are adjacent is at most W.

PROOF. First we show Connect(f, f ' ) >_ Score(f) - W + Score(f'). Let f = (i,j, k) and f ' -- (i', j', k').

Case 1." Diag(f') = Diag(f). By definition, Connect(f, f ' ) = (i' - i - k)r

= (Extt(f') - i - k)r + (i' - Exq(f'))r

= Connect(f, f " ) + (Fmax{Score(f')/e, Score(f')/r}-])r (where f " is a pseudofragment starting at

p = (Extt(f'), Extt(f') + diag(f')))

> Score(f) - W + Score(f') (by Lemma 3).

(19)

(1)

(3a)

v,

%.

~ p

111

%.

Ext t (f') (2)

".,lip

9 "N~'

Ext t ( f " )

X

i

9 ., p

II %~

"N~"

Extt (f') (3b)

~ >~ Score(f)-W , Extt (f ) i ~

"N~;

Fig. 12, Divide Connect(f, f') at row Ext,(f').

Case 2: D i a g ( f ' ) > D i a g ( f ) . The p r o o f for this case is similar to Case 1.

Case 3a." D i a g ( f ' ) < D i a o ( f ) and j + k - 1 < E x t t ( f ' ) + D i a o ( f ' ) . The proof for this case is similar to Case 1.

Case 3 b : D i a g ( f ' ) < D i a o ( f ) and j + k - 1 > E x t t ( f ' ) + Diag(f'). Since the effective region of f does not intersect any row _> E x t t ( f ' ) , it is easy to see that ( S c o r e ( f ) - W - #)/e < D i a g ( f ) - ( j + k - E x t t ( f ' ) ) . Thus,

S c o r e ( f ) - W < 9 + ( D i a g ( f ) -- ( j + k - E x t t ( f ' ) ) e .

By definition,

C o n n e c t ( f , f ' ) = 9 a p ( D i a g ( f ) - D i a g ( f ' ) ) + (j' - j - k)r 9 + ( D i a o ( f ) - D i a o ( f ' ) ) e + (j' - j - k)r 9 § ( D i a g ( f ) -- ( j + k - E x t t ( f ' ) ) e

+ (j + k - Ext,(f') - D i a o ( f ' ) ) e + (j' - j - k)r

>_ S c o r e ( f ) - W + ( j + k - E x t , ( f ' ) - j ' + i'))e + (j' - j - k)r

= S c o r e ( f ) - W + h i e + h2 r

(where h 1 + h 2 = i ' - E x t t ( f ' ) = V m a x { S c o r e ( f ' ) / e , S c o r e ( f ')/r}-])

>_ S c o r e ( f ) - W + S c o r e ( f ' ) .

It follows that the score of an optimal alignment in which f and f ' a r e adjacent

is S c o r e ( f ) + S c o r e ( f ' ) - C o n n e c t ( f , f ' ) < W.. []

(20)

C.R

C2~

Fig. 13. Subcase 2 of Lemma 6.

LEMMA 5.

If

f is a fragment lying above f ' and ending left of column Extl(f' ), such that the effective region of f does not intersect any column >>_ Extl(f'), then the score of any alignment in which f and f ' are adjacent is at most W.

Now, termination conditions for a backward computation are given. It involves simultaneously determining an intermediate rectangle Box( T', L', C. B, C. R). The backward computation is terminated when it meets the following two conditions (see Figure 13):

(i) If f is still active after we have treated row T and column L, and Last(f) ends in Box(T', L', C. B; C. R), then Ext,(f) > T and Extl(f) >>_ L. (By "active" we mean those fragments associated with some candidate list or those ending closest to T and L on their diagonal.)

(ii) No rectangle for a saved class intersects the F-shaped shaded region. (T < T' unless T' = 0, and L < L' unless L' = 0.)

The following lemma shows that when the backward computation meets the termination conditions, it fulfills the need of step 4.

LEMMA 6. L e t P be an alignment that starts outside Box(T, L, C . B, C . R) and ends in Box(C. T, C. L, C. B, C. R). 7hen Score(P) <_ W.

PROOF. Assume T # 0 and L r 0 (cases when T = 0 and/or L - - - 0 can be handled similarly). Let f be the first fragment of P that ends in Box(T, L, C. B, C. R). There are two subcases.

Subcase 1: f starts outside Box(T, L, C. B, C. R ). By condition (i), it is clear that Last(f) must end in the shaded region, which implies Score(P) <_ W by condition (ii).

Subcase 2: f starts in Box(T, L, C. B, C. R) (Figure 13). In this case, some alignment P' aligning an active fragment (possibly f itself), say f ' , exists such that f ' is

(21)

126 Kun-Mao Chao and W. Miller the first fragment of P' that starts in Box(T, L, C . B, C . R), and Score(P) <_ Score(P').

We now show that Score(P') <_ W. If Last(f') ends outside Box(T', L', C. B, C. R), it is easy to show that Score(P') < Score(Last(f')) <_ W. Otherwise, Last(f') must end in Box(T', L', C.B, C.R), which means Extt(f' ) >_ T and Extl(f' ) >_ L by condition (i). Together with condition (ii), this guarantees that the effective region of any fragment in P' that is above f ' does not intersect any row > Ext,(f') or any column > Exfi(f'). By Lemmas 4 and 5, we have Score(P')<_ W. Thus

Score(P) < W. []

Figure 14 gives the backward computation for locating T and L. Assume that L I S T stores n - m + 1 best tuples for Gin, and only fragments in G , + I are considered. When locate stops, the two termination conditions are met. Further- more, T is guaranteed to be strictly less than T' unless T' = 0, and similarly for L. Section 5.4 explains how to update lists affected by adding row t and column I.

5.4. Interleaving Computations in the Row and Column Directions. Since earlier discussions concerned top-to-bottom computations, the following description is couched in those terms, though in actuality computation of T and L runs in the reverse direction (Figure 14). Computation of T and L differs from the computation of the first and last fragments of an optimal local alignment in the following respects.

1. Fragments contained in alignments reported earlier are not considered.

2. Fragments that cross column C. R or row C. B are ignored.

3. Columns and rows are added to the computed region (in an effort to satisfy condition (i), above), so row and column lengths vary in an unpredictable manner.

4. If extending T and/or L causes the region to intersect a rectangle associated with another equivalence class, C', then T' and L' must be extended to guarantee

T' < C'. T and L' < C'. L (Figure 13).

Additions (1) and (2) do not warrant further discussion here, and the procedure disjoint of Figure 14 covers condition (4). The only important complication caused by property (3) is that candidate lists for rows might be affected while computing in the column direction, and vice versa. For example, in Figure 15, adding more columns to the region might cause q to right-dominate some interval of row t + 1 (the candidate lists for row t + 1 remain after treating row t), though q is currently not represented as influencing row t + 1. Note that computation of RI(f) in the column direction works like computation of LI(f) in the row direction.

5.4.1. Updating the Right-Influence Candidate List for Row t + l when Adding Column I. Consider the effect upon the right-influence candidate list for row t + 1 from the fragments ending at column l before or at row t. There are two phases in the updating procedure. The first phase deletes those ignorable fragments. Let p and q be fragments ending at column l where q ends above p. Fragment q is said to be ignorable if Decay(p: x, y ) > Decay(q: x, y) for some (x, y) in their common right-influence region (see Figure 15). This is because the right-influence

(22)

Procedure locate

Perform a backward computation in region Box(C. T, C. L, C. B, C. R).

t ~ - T ' ~ C . T I~-L' , - C . L

T ~-max{0, min{T' - 1, min{Extt(f): f ends in Box(C. T, C. L, C. B, C. R)}}}

L ~-max{0, min{L' -- 1, min{Extt(f): f ends in Box(C. T, C. L, C. B, C. g)}}}

repeat

( w h i l e t > T o r l > L d o { w h i l e t > T d o

{ t * - t - 1

for each fragment f ending at row t within columns l and C. R do { Compute Score(f)

if Last(f) ends in Box(T, L', C. B, C. R) then ( T * - min{T, Extt(f) }

L ~- min{L, Exh(f)}

} }

Update lists affected by adding row t

while l > L do

{ l ~ - l - 1

for each fragment f ending at column 1 within rows t and C. B do { Compute Score(f)

if Last(f) ends in Box(T', L', C. B, C. R) then { T ~- rain{T, Extt(f)}

L *- min(L, Exh(f)}

}

Update lists affected by adding column 1

} }

} until disjoint(T, L, T', L', C. B, C. R) or T = L = 0 boolean disjoint(vat T, L, T', L '; b, r)

for each c in LISTdo

{ ifc. T < _ b a n d c . L < _ r a n d c . B > _ T a n d c . R > _ L a n d ( c . T < T ' n r c . T<L')then { T ' * - min{T', c. T}

L' ~ min{L', c. L}

T-~ max{0, min{T, T' -- 1}}

L ,- max{0, min{L, L' -- 1}}

for each active fragment f do

if Last(f) ends in Box(T', L', b, r) then { T*- rain{T, Extt(f) }

L ~- rain{L, Exh(f) }

}

return false

return true

Fig. 14. Algorithm for computing T and L.

(23)

128 K u n - M a o C h a o a n d W. Miller

"9 %%

%,9 9 ~ 0 : ~ . . . " ~

~':... ^*k---x

_:..='..~ ...

.::.- ... ... .9 ",%%%%%%

"9 ~'::.i ... "9 x,

"... "... ~. ... ::::::::::::::::::::::::::...

~.?: ... .':-~....b~, ""-9149 x ".

%%,. ... ::.,...::.., ... ~'-...

"%%%%%% : ".. 9 ~ i ~ i ~ ~

%9

Fig. 15. The effect on Rlist when adding one m o r e column.

region q after row t is contained in p's right-influence region and the inequality holds for all entries in that region (analogous to Lemma 1). Let cl, c2 . . . . , Ch be the list of the fragments ending at column l before row t + 1 in decreasing row order. The following pseudocode removes the ignorable fragments from the list.

u ~ - i

v ~ - - 2

while v < h do

{ if Decay(cu: x, y) >_ Decay(co: x, y) for some (x, y) in their common right-influence region then

remove co from the list else

u ~ v v ~ v + l

After the removal of those ignorable fragments, the remaining fragments have the property that if one fragment, say f , ends above another, say f ' in column I before row t + 1, then D e c a y ( f : x , y ) > D e c a y ( f ' : x , y ) for all (x,y) in their common right-influence region.

Phase 2 is to determine which of the remaining fragments in phase 1 could possibly right-dominate some region after row t. Let Rlist be the right-influence candidate list, sorted by diagonals, for row t + 1 before adding column I. For each of the remaining fragments in phase 1, say f , search Rlist with Diag(f) to find the right-dominating fragment, say f'. If Decay(f: x, y) > Decay(f: x, y) for some (x, y) in their common right-influence region, f is added to Rlist. Adding f might cause the deletion of some fragments in Rlist. Those fragments can be detected by

(24)

sweeping Rlist from Diag(f) to the right until reaching the end of Rlist, or some fragment, say f", such that Decay(f: x, y) < Decay(f": x, y) for some (x, y) in their common right-influence region.

The time spent in phase 1 is O(h), which can be charged as 0(1) per fragment.

For each remaining fragment, phase 2 performs one search operation, one possible insertion, and some possible deletions caused by adding that fragment. Suppose that the right-influence candidate list is implemented as a balanced search tree, so that each of the search, insertion, and deletion operations can be done in time O(log SR), where S R is the maximum size of the right-influence candidate list.

Charge the search and insertion cost to the fragment itself. However, the deletion cost is charged to the deleted fragment. Since each fragment can be deleted from the right-influence candidate list at most once, it follows that for each fragment we charge O(log SR) cost in these two phases.

5.4.2. Updatin9 the Left-Influence Candidate Lists for Row t + 1 when Addin9 Column l. Let LC and LD be the left-influence candidate lists for row t + 1 before adding column I. Let C U T be the intersection lists for the row direction before adding column I. There are two phases in updating LC, LD, and the C U T lists for the row direction. The first phase deletes those ignorable fragments. Let p and q be fragments ending at column l where q ends above p. Fragment p is said to be ignorable if Decay(q: x, y) >_ Decay(p: x, y) for some (x, y) in their common left-influence region (see Figure 16). This is because the left-influence region of p after row t is contained in q's left-influence region and the inequality holds for all entries in that region (Lemma 1). The method for this phase is similar to phase 1 in Section 5.4.1. After the removal of those ignorable fragments, the remaining fragments have the property that if one fragment, say f , ends above another, say f ' in column 1 before row t + 1, then Decay(f: x, y) < Decay(f': x, y) for all (x, y) in their common left-influence region.

i " , , i " . .

: "-, qi (il ! "; , iN

I..i +',."... i',i "., i ",i

i ~ ",, "q i It ~'q N

i ,A i ",.i i ~".. i .'!., 17,

! i.... ! ".< ~ "...~ ~ ,,

Fig. 16. The effect on LC, LD, and CUT when adding a column.

Linear-Space Algorithms that Build Local Alignments from Fragments 1

Algorithmica

Linear-Space Algorithms that Build Local Alignments from Fragments 1

O(F 2)

works.

, , \ \

\ \

\ \

\

,q

,,4

"-i "r i ",i ",

i ,. i ".i, i i"..,i ..*..,

",.i. "<, ",, "-, .*..",i

" "-,, " ( ' , ? ' , , ",,. ",.,?.,( .. J ",,.i",., " ".,,.,

,,(

"%',~ %

i %%

%% ~ *%

%"%%

*%

%%%%%%

"~ a

i %,%%.~176

*,, "%.

"..i x t l

".i.

}

}

"X + "X +

',,(-

~'~ oo.

}

v,

9

"N~'

X

If

} }

}

} }

}

, , \ ^\

i ,. i ".i, i i"..,i **..*..,**

",.i. "<, ",, "-, **.*..",i**

" _"-,, " ( ' , ? ' , , ",,. ",.,?.,( .. J ",,.i",., " ".,,.,