• 沒有找到結果。

2.2 Progressive multiple sequence alignment

3.1.2 Divide and conquer method

In this section, we shall use divide-and-conquer method to design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a1a2. . . am and B = b1b2. . . bn, a given order set C = (C1, C2, . . . , Cγ) of γ constraints and a given error threshold .

Recall that Hirschberg [17] developed a linear-space algorithm for solving the longest common subsequence problem based on the technique of divide and conquer. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences [6, 21]. In this paper, we generalize the Hirschberg’s algorithm so that it is able to deal with the constrained pairwise sequence alignment. As compared with others, our generalization is more complicated because the grid graph G we deal with here is 3D, instead of 2D, and the input sequences are accompanied with several constraints which need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (imid, jmid, kmid) on an optimal path from M0(0, 0) to Mγ(m, n) in G so that we are able to divide the constrained alignment problem into two smaller constrained alignment problems, then these smaller constrained alignment problems are continued to be divided in the same way, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (see Figure 3.2 for an illustration).

Before describing our algorithm, some notation must be introduced as follows. Let Aiand Bj denote the suffixes ai+1ai+2. . . amand bj+1bj+2. . . bnof A and B, respectively, for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let Ckdenote the ordered subset (Ck+1, Ck+2, . . . , Cγ) for 1 ≤ k ≤ γ. Define Mk(i, j) to be the score of an optimal constrained alignment of Ai and Bj w.r.t. Ck, and define MSk(i, j), MDk(i, j) and MIk(i, j) to be the maximum score of all constrained alignments of Ai and Bj w.r.t. Ck that begin with a substitution pair

PSfrag replacements

M0(0, 0)

Mγ(m, n) imid

kmid

jmid

Figure 3.2: Schematic diagram of divide and conquer approach: two light gray areas are the reduced subproblems after middle position (imid, jmid, kmid) is determined, each of which will be further divided into two subproblems of dark gray areas.

(i.e, (ai+1, bj+1)), a deletion pair (i.e., (ai+1, −)) and an insertion pair (i.e., (−, bj+1)), respectively. Let Ck(h) = (C1, C2, . . . , Ck−1, Ck,h) and Ck(h) = (Ck,h, Ck+1, . . . , Cγ), where Ck,h= ck,h+1ck,h+2. . . ck,λk. Let Nk(i, j, h) denote the score of an optimal semi-constrained alignment L of Ai and Bj w.r.t. Ck(h) that begins with a band whose induced consensus is equal to Ck,h. Note that the recurrences for computing matrices Mk, MSk, MDk, MIk and Nk can be developed similarly as those for computing Mk, MSk, MDk, MIkand Nk, respectively. Clearly, MSk(i, j) = Mk(i−1, j−1)+σ(ai, bj). For simplicity, let Ai(h) (respectively, Bj(h)) denote the suffix of Ai (respectively, Bj) with length of h (i.e., Ai(h) = ai−h+1. . . ai and Bj(h) = bj−h+1. . . bj). If dH(Aik), Ck) ≤ λk×  and dH(Bjk), Ck) ≤ λk× , then we can reformulate the recurrence of Nk as follows: Nk(i, j, 1) = Mk−1(i − 1, j − 1) + σ(ai, bj) and Nk(i, j, h) = Nk(i − 1, j − 1, h − 1) + σ(ai, bj) for each 1 < h ≤ λk.

Next, we describe our divide-and-conquer algorithm, called as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. C as follows.

The key point is to determine the middle position (imid, jmid, kmid) of the optimal path in G to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment L, we use

score(L) to denote the score of L. Let Lγ(A, B) be an optimal constrained alignments of A and B w.r.t. C and clearly score(Lγ) = Mγ(m, n). Let imid = bm2c. Then we partition Lγ(A, B) into two parts by cutting it at the position immediately after aimid and we let L1γ(A, B) denote the part containing aimid and L2γ(A, B) denote the remaining part. Let bjmid be the last character in L1γ(A, B) from B, and let kmid be the largest index so that a prefix of Ckmid with length hmid appears in L1γ(A, B). Then there are two possibilities when we consider the last aligned pair of L1γ(A, B).

Case 1: The last aligned pair of L1γ(A, B) is a substitution pair (i.e., (aimid, bjmid)).

In this case, we have Mγ(m, n) = score(Lγ(A, B)) = score(L1γ(A, B)) + score(L2γ(A, B)). If (aimid, bjmid) is not a constrained column in Lγ(A, B), then L1γ(A, B) is an optimal constrained alignment of Aimid and Bjmid w.r.t. Ckmid ending with a substitution pair (aimid, bjmid), and L2γ(A, B) is an optimal constrained alignment of Aimid and Bjmid w.r.t. Ck. Hence, Mγ(m, n) = MSk

mid(imid, jmid) + Mkmid(imid, jmid).

If (aimid, bjmid) is a constrained column in Lγ(A, B), then L1γ(A, B) is an optimal semi-constrained alignment of Aimid and Bjmid w.r.t. Ckmid(hmid) ending with a band B1 whose induced consensus is equal to Ckmid,hmid. If hmid < λkmid, then L2γ(A, B) is an optimal semi-constrained alignment of Aimid and Bjmid w.r.t. Ckmid(hmid) beginning with a band B2 whose induced consensus is equal to Ckmid,hmid. Moreover, the induced consensus of the merge of B1 and B2 have to be equal to Ckmid. In this case, we compensate it by adding wo.

In summary, the recurrence of Mγ(m, n) is derived as follows. above recurrence is not changed, but can be reformulated as follows.

Mγ(m, n) = max

Now, we show how to use O(γλn), instead of O(γmn), memory to determine jmid, kmid and hmid, where λ = max1≤k≤γλk and usually λ << m. In fact, a single

Note that the old values in entry E(k, j) will be moved into an extra entry, called as Vk

PSfrag replacements the entry (i, j, k) of G, marked with ”?”, is reached for the computation.

whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into Vk, however, we need to first move Mk(i−1, j −1) in Vkinto a space, called as vk,k+1. The mechanism above will enable us to compute Nk(i, j, 1), which needs to refer to Mk−1(i − 1, j − 1) that is kept in vk−1,k, and compute Nk(i, j, h) for each 2 ≤ h ≤ λk, which needs to refer to Nk(i−1, j−1, h−1) that is kept in Vk, compute MSk(i, j) which needs to refer Mk(i−1, j −1) that is kept in Vk, and finally we are able to compute Mk(i, j). Figure 3.3 shows the grid locations of E(k − 1), E(k) and the values in Vk−1 and Vk when we reach the entry (i, j, k) of G for the computation, where E(k) denotes the the kth row of E. Hence, the totally needed space for computing and storing all Mk(imid, j), MSk(imid, j), MDk(imid, j) MIk(imid, j) and Nk(imid, j, h) is the sum of the space of matrix E, the space of all Vk, 0 ≤ k ≤ γ, and the space of all vk,k+1, 0 ≤ k < γ, which is equal to O(γλn). Similarly, the process of computing and storing all Mk(imid, j), MSk(imid, j), MDk(imid, j) MIk(imid, j) and Nk(imid, j, h) still needs O(γλn) space. Hence, the determination of jmid, kmid and hmid can be done in O(γλn) space. The details of CPSA-DC algorithm are described as follows, where the program code of BestScoreRev is similar to that of BestScore and hence is omitted.

Algorithm CPSA-DC(istart, iend, jstart, jend, kstart, kend)

Input: Sequences aistart. . . aiend and bjstart. . . bjend with constraints (Ckstart, . . . , Ckend) Step 1: if (istart > iend) or (jstart > jend) then

Align the nonempty sequence with spaces;

else

end if

CPSA-DC(imid+ λk− hmid+ 1, iend, jmid+ λk− hmid+ 1, jend, kmid+ 1, kend);

end if

if type = case 5 then

CPSA-DC(istart, imid− λk, jstart, jmid− λk, kstart, kmid− 1);

Align aimid−λk+1. . . aimid with bjmid−λ+1. . . bjmid; CPSA-DC(imid+ 1, iend, jmid+ 1, jend, kmid+ 1, kend);

end if

Algorithm BestScore(istart, iend, jstart, jend, kstart, kend)

Input: Sequences aistart. . . aiend and bjstart. . . bjend with constraints (Ckstart, . . . , Ckend) Output:

Step 1: /* Reindex */

m = istart− iend+ 1; n = jstart− jend+ 1; γ = kstart− kend+ 1;

Step 2: /* Initialization */

for j = 0 to n do for k = 0 to γ do

if (j = 0) and (k = 0) then Mk(·, j) = 0; else Mk(·, j) = −∞;

if (j = 0) or (k > 0) then MIk(·, j) = −∞; else MIk(·, j) = −wo− jwe; MSk(·, j) = MDk(·, j) = −∞;

if k ≥ 1 then

for h = 1 to λk do Nk(·, j, h) = −∞;

end for end if end for end for

Step 3: /* Computation */

for i = 1 to m do

for k = 0 to γ do /* For the case of j = 0 */

Vk(Mk(·, 0)) = Mk(·, 0);

if k ≥ 1 then

for h = 1 to λk do

Vk(Nk(·, 0, h)) = Nk(·, 0, h));

Vk(H1(k, h)) = H1(k, h);

Vk(H2(k, h)) = H2(k, h);

end for end if

MSk(·, 0) = MIk(·, 0) = −∞;

Mk(·, 0) = MDk(·, 0) = −wo− jwe; end for

for j = 1 to n do /* For the case of j > 0 */

for k = 0 to γ do

tempk(Mk(·, j)) = Mk(·, j) ; if k ≥ 1 then

for h = 1 to λk do

tempk(Nk(·, j, h)) = Nk(·, j, h);

tempk(H1(k, h)) = H1(k, h);

tempk(H2(k, h)) = H2(k, h);

end for end if

MSk(·, j) = V (Mk(·, j)) + σ(aistart+i−1, bjstart+j−1);

MDk(·, j) = max{MDk(·, j) − we, Mk(·, j) − wo− we};

MIk(·, j) = max{MIk(·, j − 1) − we, Mk(·, j − 1) − wo− we};

if k ≥ 1 then

for h = 1 to λk do if h = 1 then

Nk(·, j, h) = vk−1,k+ σ(aistart+i−1, bjstart+j−1);

if aistart+i−1 6= ck,h then H1(k, h) = 1; else H1(k, h) = 0;

if bjstart+j−1 6= ck,h then H2(k, h) = 1; else H2(k, h) = 0;

else

Now, we analyze the time complexity of our CPSA-DC algorithm for solving the constrained pairwise sequence alignment. As illustrated in Figure 3.2, after determining the middle position (imid, jmid, kmid) of the optimal path in G, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (imid, jmid, kmid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of the entries in G. In is not hard to see that the time complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of

the subproblem. Let T denote the size of the original problem (i.e., T = γmn). Then the total time complexity of our CPSA-DC algorithm is equal to T +T2 +T4 + · · · = 2T , which is twice as high as the CPSA-DP algorithm.

相關文件