Divide and conquer method - Progressive multiple sequence alignment

2.2 Progressive multiple sequence alignment

3.1.2 Divide and conquer method

In this section, we shall use divide-and-conquer method to design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a₁a₂. . . a_m and B = b₁b₂. . . b_n, a given order set C = (C₁, C₂, . . . , C_γ) of γ constraints and a given error threshold .

Recall that Hirschberg [17] developed a linear-space algorithm for solving the longest common subsequence problem based on the technique of divide and conquer. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences [6, 21]. In this paper, we generalize the Hirschberg’s algorithm so that it is able to deal with the constrained pairwise sequence alignment. As compared with others, our generalization is more complicated because the grid graph G we deal with here is 3D, instead of 2D, and the input sequences are accompanied with several constraints which need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (i_mid, j_mid, k_mid) on an optimal path from M₀(0, 0) to M_γ(m, n) in G so that we are able to divide the constrained alignment problem into two smaller constrained alignment problems, then these smaller constrained alignment problems are continued to be divided in the same way, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (see Figure 3.2 for an illustration).

Before describing our algorithm, some notation must be introduced as follows. Let A_iand B_j denote the suffixes a_i+1a_i+2. . . a_mand b_j+1b_j+2. . . b_nof A and B, respectively, for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let C_kdenote the ordered subset (C_k+1, C_k+2, . . . , C_γ) for 1 ≤ k ≤ γ. Define M_k(i, j) to be the score of an optimal constrained alignment of A_i and B_j w.r.t. C_k, and define M^S_k(i, j), M^D_k(i, j) and M^I_k(i, j) to be the maximum score of all constrained alignments of A_i and B_j w.r.t. C_k that begin with a substitution pair

PSfrag replacements

M₀(0, 0)

M_γ(m, n) imid

k_mid

j_mid

Figure 3.2: Schematic diagram of divide and conquer approach: two light gray areas are the reduced subproblems after middle position (i_mid, j_mid, k_mid) is determined, each of which will be further divided into two subproblems of dark gray areas.

(i.e, (a_i+1, b_j+1)), a deletion pair (i.e., (a_i+1, −)) and an insertion pair (i.e., (−, b_j+1)), respectively. Let C_k(h) = (C₁, C₂, . . . , C_k−1, C_k,h) and C_k(h) = (C_k,h, C_k+1, . . . , C_γ), where C_k,h= c_k,h+1c_k,h+2. . . c_k,λ_k. Let N_k(i, j, h) denote the score of an optimal semi-constrained alignment L of A_i and B_j w.r.t. C_k(h) that begins with a band whose induced consensus is equal to Ck,h. Note that the recurrences for computing matrices M_k, M^S_k, M^D_k, M^I_k and N_k can be developed similarly as those for computing M_k, M^S_k, M^D_k, M^I_kand N_k, respectively. Clearly, M^S_k(i, j) = M_k(i−1, j−1)+σ(a_i, b_j). For simplicity, let Ai(h) (respectively, Bj(h)) denote the suffix of Ai (respectively, Bj) with length of h (i.e., A_i(h) = a_i−h+1. . . a_i and B_j(h) = b_j−h+1. . . b_j). If d_H(A_i(λ_k), C_k) ≤ λ_k× and d_H(B_j(λ_k), C_k) ≤ λ_k× , then we can reformulate the recurrence of N_k as follows: Nk(i, j, 1) = Mk−1(i − 1, j − 1) + σ(ai, bj) and Nk(i, j, h) = Nk(i − 1, j − 1, h − 1) + σ(a_i, b_j) for each 1 < h ≤ λ_k.

Next, we describe our divide-and-conquer algorithm, called as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. C as follows.

The key point is to determine the middle position (i_mid, j_mid, k_mid) of the optimal path in G to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment L, we use

score(L) to denote the score of L. Let L_γ(A, B) be an optimal constrained alignments of A and B w.r.t. C and clearly score(Lγ) = Mγ(m, n). Let imid = b^m₂c. Then we partition L_γ(A, B) into two parts by cutting it at the position immediately after a_i_mid and we let L¹_γ(A, B) denote the part containing a_i_mid and L²_γ(A, B) denote the remaining part. Let b_j_mid be the last character in L¹_γ(A, B) from B, and let k_mid be the largest index so that a prefix of C_k_mid with length h_mid appears in L¹_γ(A, B). Then there are two possibilities when we consider the last aligned pair of L¹_γ(A, B).

Case 1: The last aligned pair of L¹_γ(A, B) is a substitution pair (i.e., (a_i_mid, b_j_mid)).

In this case, we have M_γ(m, n) = score(Lγ(A, B)) = score(L¹_γ(A, B)) + score(L²_γ(A, B)). If (a_i_mid, b_j_mid) is not a constrained column in L_γ(A, B), then L¹_γ(A, B) is an optimal constrained alignment of A_i_mid and B_j_mid w.r.t. C_k_mid ending with a substitution pair (a_i_mid, b_j_mid), and L²_γ(A, B) is an optimal constrained alignment of A_i_mid and B_j_mid w.r.t. C_k. Hence, M_γ(m, n) = M^S_k

mid(i_mid, j_mid) + M_k_mid(i_mid, j_mid).

If (a_i_mid, b_j_mid) is a constrained column in L_γ(A, B), then L¹_γ(A, B) is an optimal semi-constrained alignment of A_i_mid and B_j_mid w.r.t. C_k_mid(h_mid) ending with a band B₁ whose induced consensus is equal to C_k_mid_,h_mid. If h_mid < λ_k_mid, then L²_γ(A, B) is an optimal semi-constrained alignment of A_i_mid and B_j_mid w.r.t. C_k_mid(h_mid) beginning with a band B₂ whose induced consensus is equal to C_k_mid_,h_mid. Moreover, the induced consensus of the merge of B₁ and B₂ have to be equal to C_k_mid. In this case, we compensate it by adding w_o.

In summary, the recurrence of M_γ(m, n) is derived as follows. above recurrence is not changed, but can be reformulated as follows.

M_γ(m, n) = max

Now, we show how to use O(γλn), instead of O(γmn), memory to determine jmid, kmid and hmid, where λ = max1≤k≤γλk and usually λ << m. In fact, a single

Note that the old values in entry E(k, j) will be moved into an extra entry, called as Vk

PSfrag replacements the entry (i, j, k) of G, marked with ”?”, is reached for the computation.

whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into V_k, however, we need to first move M_k(i−1, j −1) in V_kinto a space, called as v_k,k+1. The mechanism above will enable us to compute Nk(i, j, 1), which needs to refer to Mk−1(i − 1, j − 1) that is kept in vk−1,k, and compute N_k(i, j, h) for each 2 ≤ h ≤ λ_k, which needs to refer to N_k(i−1, j−1, h−1) that is kept in V_k, compute M^S_k(i, j) which needs to refer M_k(i−1, j −1) that is kept in V_k, and finally we are able to compute M_k(i, j). Figure 3.3 shows the grid locations of E(k − 1), E(k) and the values in V_k−1 and V_k when we reach the entry (i, j, k) of G for the computation, where E(k) denotes the the kth row of E. Hence, the totally needed space for computing and storing all M_k(i_mid, j), M^S_k(i_mid, j), M^D_k(i_mid, j) M^I_k(i_mid, j) and N_k(i_mid, j, h) is the sum of the space of matrix E, the space of all V_k, 0 ≤ k ≤ γ, and the space of all v_k,k+1, 0 ≤ k < γ, which is equal to O(γλn). Similarly, the process of computing and storing all M_k(i_mid, j), M^S_k(i_mid, j), M^D_k(i_mid, j) M^I_k(i_mid, j) and N_k(i_mid, j, h) still needs O(γλn) space. Hence, the determination of j_mid, k_mid and h_mid can be done in O(γλn) space. The details of CPSA-DC algorithm are described as follows, where the program code of BestScoreRev is similar to that of BestScore and hence is omitted.

Algorithm CPSA-DC(i_start, i_end, j_start, j_end, k_start, k_end)

Input: Sequences a_i_start. . . a_i_end and b_j_start. . . b_j_end with constraints (C_k_start, . . . , C_k_end) Step 1: if (i_start > i_end) or (j_start > j_end) then

Align the nonempty sequence with spaces;

else

end if

CPSA-DC(i_mid+ λ_k− h_mid+ 1, i_end, j_mid+ λ_k− h_mid+ 1, j_end, k_mid+ 1, k_end);

end if

if type = case 5 then

CPSA-DC(i_start, i_mid− λ_k, j_start, j_mid− λ_k, k_start, k_mid− 1);

Align a_i_mid−λ_k+1. . . a_i_mid with b_j_mid−λ+1. . . b_j_mid; CPSA-DC(i_mid+ 1, i_end, j_mid+ 1, j_end, k_mid+ 1, k_end);

end if

Algorithm BestScore(i_start, i_end, j_start, j_end, k_start, k_end)

Input: Sequences a_i_start. . . a_i_end and b_j_start. . . b_j_end with constraints (C_k_start, . . . , C_k_end) Output:

Step 1: /* Reindex */

m = i_start− i_end+ 1; n = j_start− j_end+ 1; γ = k_start− k_end+ 1;

Step 2: /* Initialization */

for j = 0 to n do for k = 0 to γ do

if (j = 0) and (k = 0) then M_k(·, j) = 0; else M_k(·, j) = −∞;

if (j = 0) or (k > 0) then M^I_k(·, j) = −∞; else M^I_k(·, j) = −w_o− jw_e; M^S_k(·, j) = M^D_k(·, j) = −∞;

if k ≥ 1 then

for h = 1 to λ_k do N_k(·, j, h) = −∞;

end for end if end for end for

Step 3: /* Computation */

for i = 1 to m do

for k = 0 to γ do /* For the case of j = 0 */

V_k(M_k(·, 0)) = M_k(·, 0);

if k ≥ 1 then

for h = 1 to λ_k do

V_k(N_k(·, 0, h)) = N_k(·, 0, h));

V_k(H₁(k, h)) = H₁(k, h);

V_k(H₂(k, h)) = H₂(k, h);

end for end if

M^S_k(·, 0) = M^I_k(·, 0) = −∞;

M_k(·, 0) = M^D_k(·, 0) = −w_o− jw_e; end for

for j = 1 to n do /* For the case of j > 0 */

for k = 0 to γ do

temp_k(M_k(·, j)) = M_k(·, j) ; if k ≥ 1 then

for h = 1 to λ_k do

temp_k(N_k(·, j, h)) = N_k(·, j, h);

temp_k(H₁(k, h)) = H₁(k, h);

temp_k(H₂(k, h)) = H₂(k, h);

end for end if

M^S_k(·, j) = V (M_k(·, j)) + σ(a_i_start_+i−1, b_j_start_+j−1);

M^D_k(·, j) = max{M^D_k(·, j) − w_e, M_k(·, j) − w_o− w_e};

M^I_k(·, j) = max{M^I_k(·, j − 1) − w_e, M_k(·, j − 1) − w_o− w_e};

if k ≥ 1 then

for h = 1 to λ_k do if h = 1 then

N_k(·, j, h) = v_k−1,k+ σ(a_i_start_+i−1, b_j_start_+j−1);

if a_i_start_+i−1 6= c_k,h then H₁(k, h) = 1; else H₁(k, h) = 0;

if b_j_start_+j−1 6= c_k,h then H₂(k, h) = 1; else H₂(k, h) = 0;

else

Now, we analyze the time complexity of our CPSA-DC algorithm for solving the constrained pairwise sequence alignment. As illustrated in Figure 3.2, after determining the middle position (imid, jmid, kmid) of the optimal path in G, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (imid, jmid, kmid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of the entries in G. In is not hard to see that the time complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of

the subproblem. Let T denote the size of the original problem (i.e., T = γmn). Then the total time complexity of our CPSA-DC algorithm is equal to T +^T₂ +^T₄ + · · · = 2T , which is twice as high as the CPSA-DP algorithm.

在文檔中 MuSiC and MuSiC-ME：有效率的限制型多重序列比對工具 (頁 20-30)