MuSiC and MuSiC-ME：有效率的限制型多重序列比對工具

全文

(1)國立交通大學生物資訊所碩士論文. MuSiC and MuSiC-ME：有效率的限制型多重序列比對工具 MuSiC and MuSiC-ME: Efficient Tools for Multiple Sequence Alignment with Constraints. 研究生：黃彥斌指導教授：盧錦隆. 教授. 中華民國九十三年六月.

(2) MuSic and MuSiC-ME：有效率的限制型多重序列比對工具 MuSiC and MuSiC-ME: Efficient Tools for Multiple Sequence Alignment with Constraints 研究生：黃彥斌. Student：Yen Pin Huang. 指導教授：盧錦隆教授. Advisor：Prof. Chin Lung Lu. 國立交通大學生物資訊所碩士論文. A Thesis Submitted to Institute of Bioinformatics College of Biological Science and Technology National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master in Biological Science and Technology. June 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年六月.

(3) 2d¿b ÊÞÓ’m£lÞÓíä2, Ö½åªú (Multiple Sequence Alignment) Ê ê¨!ÄñCŸë”åíÞÓ<2,u'àí x ¦ÞÓçðúåí !Z/Š?/Æ“É[˛ø<¥íw…, àº“¶Pí{! }äÈíÂvœ §”!¯í¶P 3Öíº4£\è4í Motifs Ä¤, ÊdÖ½åªúív `, ÞÓçðı?ø_ x?Déø<!Z4í/Š?4í/\G4íA| C{!ªJ§ªÊø– Í7, ñ‡˛íÖ½åªú x×·Ì¶Å—¥ Û°, ÄÑFb·É;Wåíqñ §ª7I7wFÊ!Z/Š?/Æ“,˛ øím7 Ä¤, Ê¥_d2, Bb3bíñíu û˝êø˚ÑÌ„ íÖ½åªú xVj²¥_½æ ¥_ xorUà6pÖ‘íå£ø <Ì„‘K, ©_Ì„‘Kú@Êå,ø<!Z4í/Š?4í/\è4íA |C{!, Í(ßÞø Ö½åªúU)ø<Å—Ì„‘KíA|C{!§ ªÊø– BbSà7F‚í Progressive íj¶Vj²Ì„íÖ½åªú½ æ 9õ,, ¥_j¶í-Êk?´ql|^0íÆ¶Vj²Ì„s‘ åªú½æ (Constrained Pairwise Sequence Alignment Problem) BbSà Dynamic Programming. £ Divide-and-Conquer ¥s.°íj¶}ql|ø_Ê. vÈ,^0£Çø_Ê˛È,^0íÆ¶V°)|7“íÌ„ís‘å ªú Í(, Bby;W¥sÆ¶}ê|s_Ì„Ö½åªú x: MuSiC (Multiple Seuqnece Alignment with Constraints) £ MuSiC-ME (MemoryEfficient MuSiC). °v, Bb6‚àø<zÕèƒ (¨Ö SARS) í 30 UTR åV. ¿t¥s xíªà4, 1*wFßÞíåªú2vƒÊ SARS èƒ2}~ LA Pseudoknot !ZíÒi. i.

(4) Abstract Multiple sequence alignment (MSA) has received much attention in the fields of bioinformatics and computational biology because it is very useful for discovering the biological meanings of sequences. Usually, biologists may have advanced knowledge about the structures/functionalities/evolutionary relationships of sequences of interest, such as active site residues, intramolecular disulfide bonds, substrate binding sites, enzyme activities and conserved motifs (consensuses).. They would ex-. pect an MSA program that is able to align these sequences such that the structural/functional/conserved bases (i.e., nucleotides or residues) are aligned together. However, most available MSA programs cannot satisfy such a requirement because they generate an alignment based only on the content of the sequences, ignoring the known functional/structural/conserved information. Hence, in this thesis, we study and develop a so-called constrained multiple sequence alignment (CMSA) tool, which takes as input the sequences and several user-specified constraints, each with corresponding to the known functional/structural/conserved bases, and generates an output of alignment in which the bases corresponding to a user-specified constraint are aligned together. We use the progressive approach to design efficient programs for heuristically solving the CMSA problem. The kernel of this approach is an efficient algorithms for optimally solving the constrained pairwise alignment (CPSA) problem. We use two different approaches, called as dynamic programming and divide-and-conquer, to design a time-efficient algorithm and a memory-efficient algorithm respectively for optimally solving the CPSA program. Based on those two algorithms, we then develop two programs, called MuSiC (Multiple Sequence Alignment with Constraints) and MuSiCME (Memory-Efficient MuSiC), respectively. To demonstrate the applicabilities of our programs, we test them on a data set of RNA sequences of 30 UTRs of several coronaviruses, including SARS, for detecting a fragment in the 30 UTR of SARS that is able to fold into a stable pseudoknotted structure, where such a pseudoknot is considered to be involved in the RNA replication of coronaviruses.. ii.

(5) Ðá >áBíNû`¤c–»4-íNû, `û7BrÖdû˝íj¶, 1/Ê Bû˝,`X@½v, .ÀwíTX›Œ; >á{%6Œ¬BíÅ, J£v 2¥Bí°ç ç!_£¤; >áðAí< DXM, éB)JÌRÌ?íêA î=çP. iii.

(6) Contents Chinese Abstract. i. Abstract. ii. Acknowledgement. iii. 1 Introduction. 1. 2 Preliminaries. 5. 2.1. Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.2. Progressive multiple sequence alignment . . . . . . . . . . . . . . . . .. 6. 3 Algorithms 3.1. 3.2. 9. Constrained pairwise sequence alignment . . . . . . . . . . . . . . . . .. 9. 3.1.1. Dynamic programming method . . . . . . . . . . . . . . . . . .. 9. 3.1.2. Divide and conquer method . . . . . . . . . . . . . . . . . . . .. 13. Constrained multiple sequence alignment . . . . . . . . . . . . . . . . .. 23. 4 Implementation and Discussion. 25. 4.1. MuSiC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 4.2. MuSiC-ME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28. 5 Conclusions. 31. References. 33. iv.

(7) List of Figures 3.1. The schematic diagram of four adjacent entries of G, where entry I (i, j, k) consists of three nodes Mk , MD k and Mk corresponding to I Mk (i, j), MD k (i, j), Mk (i, j), respectively. . . . . . . . . . . . . . . . . .. 3.2. 12. Schematic diagram of divide and conquer approach: two light gray areas are the reduced subproblems after middle position (imid , jmid , kmid ) is determined, each of which will be further divided into two subproblems of dark gray areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.3. 14. The grid locations of E(k − 1), E(k) and the values in Vk−1 and Vk when the entry (i, j, k) of G, marked with ”?”, is reached for the computation.. 17. 4.1. The interface of MuSiC. . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 4.2. The partial display of the resulting CMSA of MuSiC by aligning the sequences of SARS-TW1 30 UTR with those of other five coronaviruses.. 4.3. 27. The diagram of the predicted pseudoknot in the 30 UTR of SARS-TW1 ranging from 29460 to 29521 bp. . . . . . . . . . . . . . . . . . . . . . .. 28. 4.4. The interface of MuSiC-ME. . . . . . . . . . . . . . . . . . . . . . . . .. 29. 4.5. The partial display of the resulting CMSA of MuSiC-ME by aligning the sequences of SARS-TW1 30 UTR with those of other five coronaviruses.. 4.6. 30. The partial display of the resulting MSA of Clustal W 1.82 by aligning the 30 UTR sequences of six coronaviruses, where the bases not in the pseudoknots are marked with dots. . . . . . . . . . . . . . . . . . . . .. v. 30.

(8) Chapter 1 Introduction Multiple sequence alignment (MSA) is one of the fundamental problems in bioinformatics and computational biology that have been studied extensively, because it is a useful tool in the phylogenetic analyses among various organisms, identification of conserved motifs and domains in a group of related proteins, secondary and tertiary structure prediction of a protein/RNA and so on [4, 5, 14, 23, 24]. The sum-of-pairs score is widely used for selecting an optimal MSA. This kind of MSA problem, called sum-of-pairs MSA (SPMSA) problem, can be solved by extending the dynamic programming algorithm of Needleman and Wunsch for aligning two sequence [22]. In the worst case, it needs to take O(2k nk ) time to align k sequences of length n. This exponential time limits the dynamic programming technique to align only a small number of short sequences. Actually, the SPMSA problem has been shown to be NP-complete [3, 40], which means that it seems to be impossible to design an efficient algorithm to find the mathematically optimal alignment. Hence, some approximate and heuristic methods are introduced to overcome this problem. For the approximate methods, Gusfield [13] first proposed a polynomial time approximation algorithm with performance ratio of 2 − k2 . Then Pevzner [26] improved the performance ratio to 2 − k3 . Recently, Bafna, Lawler and Pevzner [2] further improved the performance ratio to 2 −. l k. for any fixed l. It is worth mentioning that. Li, Ma and Wang [19] have given a polynomial time approximation scheme for finding a multiple sequence alignment within a constant band, which is often useful in many 1.

(9) practical cases. For the heuristic methods, the most widely used heuristic methods are so-called progressive strategies [8, 11, 16, 35, 37]. Usually,. biologists. may. have. advanced. knowledge. about. the. struc-. tures/functionalities/evolutionary relationships of sequences of interest, such as active site residues, intramolecular disulfide bonds, substrate binding sites, enzyme activities and conserved motifs (consensuses). They would expect an MSA program that is able to align these sequences such that the structural/functional/conserved bases (i.e., nucleotides or residues) are aligned together.. However, the currently. existing MSA tools, such as CLUSTAL W, do not guarantee that the generated alignments meet such a requirement that some particular bases would be aligned together.. Hence, Tang et al.. [34] defined and studied the so-called constrained. multiple sequence alignment (CMSA) problem that given the input sequences with several user-specified constraints, generates an MSA in which the bases corresponding to a user-specified constraint are aligned together, where each user-specified constraint corresponds to the know functional/structural/constrained bases. Tang et al. designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment tool based on the progressive method, where each constraint used by Tang et al. is a single base. Their proposed algorithm for two sequences runs in O(γn4 ) time and consumes O(n4 ) space, where γ is the number of the constrained bases and n is the maximum length of sequences. Later, this result was improved independently to O(γn2 ) time and O(γn2 ) space using the same approach of dynamic programming [7, 44]. In fact, each of the columns requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of bases, instead of a single base. In other words, the user-specified constraint may be a fragment of bases. For some applications, biologists may further expect that some mismatches are allowed among the bases of the columns requested to be aligned. Hence, in this thesis, we consider a more generalized CMSA problem in which the user-specified constraints are a fragment of bases and some mismatched 2.

(10) may occur in the columns requested to be aligned. We use the progressive approach to design efficient programs for heuristically solving the CMSA problem. The kernel of this approach is an efficient algorithms for optimally solving the constrained pairwise alignment (CPSA) problem. First of all, we use the dynamic programming technique to design an algorithm of O(γn2 ) time and O(γn2 ) space optimally solving CPSA and then use this algorithm as the kernel to develope CMSA tool, called as MuSiC (Multiple Sequence Alignment with Constraints). The result greatly increases the performances and practical usage of the CMSA tools developed by the progressive approach. However, the requirement of O(γn2 ) memory still limits it to align a set of short sequences, at most several hundreds of bases. To align large genomic sequences, there is a need to design a memory-efficient algorithm for the CPSA problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in the second part, we adopt the so-called divideand-conquer approach to design a memory-efficient algorithm for optimally solving the CPSA problem, which runs in O(γn2 ) time, but consumes only O(γλn) space, where λ is the maximum of the lengths of constraints and usually λ << n in practical applications. Based on this algorithm, we have finally developed a memory-efficient CMSA tool, called as MuSiC-ME (Memory-Efficient MuSiC), using the progressive approach. To demonstrate the applicabilities of our programs, we test them on a data set of RNA sequences of 30 untranslated regions (UTRs) of several coronaviruses, including SARS, for detecting a fragment in the 30 UTR of SARS that is able to fold into a stable pseudoknotted structure, where such a pseudoknot is considered to be involve in the RNA replication of coronaviruses. The rest of this thesis is organized as follows. In Chapter 2, we give the formal definition of the CMSA problem we study in this thesis and also introduce the main steps of the adopted progressive MSA. In Chapter 3, we first use the dynamic programming technique to design a time-efficient algorithm for optimally solving the CPSA problem, then use the divide-and-conquer technique to design a memory-efficient algorithm for the CPSA problem, and finally used the progressive approach to develop two CMSA programs. In Chapter 4, we demonstrate the applicability of our developed programs 3.

(11) by testing them on a data set of RNA sequences. Finally, we make some conclusions in Chapter 5.. 4.

(12) Chapter 2 Preliminaries 2.1. Problem Formulation. Let S = {S1 , S2 , · · · , Sk } be the set of k sequences over the alphabet Σ. Then a multiple sequence alignment (MSA) of S is a rectangular matrix consisting of k rows of characters of Σ ∪ {-} such that no column consists entirely of dashes and removing dashes from row i leaves Si for any 1 ≤ i ≤ k. The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of S with the optimal SP score is the so-called sum-of-pairs MSA problem [4, 5, 14, 23, 24]. Let l(P ) be the length of a fragment sequence P . Let dH (P 0 , P 00 ) denote the Hamming distance between two fragments P 0 and P 00 of equal length (i.e., l(P 0 ) = l(P 00 )), which is equal to the number of mismatched pairs in the alignment of P 0 and P 00 without any gap. Given an alignment L of S, a band is defined as a block of consecutive columns in L (i.e., a submatrix of L). For any band B of L, B(Si ) denotes the fragment of Si whose residues/nucleotides are all in the band B, where 1 ≤ i ≤ k. A sequence S = s1 s2 . . . sλ is said to appear in L if L contains a band B of λ columns, say x1 , x2 , . . . , xλ , such that the characters of column xj , where 1 ≤ j ≤ λ, are all equal to sj , or equivalently, B(Si ) = S for each 1 ≤ i ≤ k. If dH (B(Si ), S) ≤ l(S) × for 5.

(13) a given error ratio 0 ≤ < 1 (i.e., some mismatches are allowed between B(Si ) and S), then S is said to approximately appear in L. From the biological viewpoint, S can be considered as the consensus among the fragment sequences in B and hence S is also called as an induced consensus by the band B. For any two sequences S 0 and S 00 , S 0 ≺ S 00 is used to denote that S 0 (approximately) appears strictly before S 00 in L (i.e., their corresponding bands do not overlap). Let C = (C1 , C2 , . . . , Cγ ) be a ordered set of γ constraints, each Ci = ci,1 ci,2 . . . ci,λi with length of λi . Then the constrained multiple sequence alignment (CMSA) of S with respect to C is defined to be an MSA L of S in which all constraints of C approximately appear in the order C1 ≺ C2 ≺ . . . ≺ Cγ such that dH (Bj (Si ), Cj ) ≤ λj × for each band Bj whose induced consensus is Cj , where 1 ≤ i ≤ k and 1 ≤ j ≤ γ. Given a set S of k sequences along with an ordered set C of γ constraints and an error ratio , the so-called constrained multiple sequence alignment problem is to find a CMSA w.r.t. C with the optimal SP score.. 2.2. Progressive multiple sequence alignment. The progressive approach is one of the widely used heuristics for efficiently finding a good MSA of several sequences. The ideas behind it are as follows [8, 11, 16, 35, 37]: 1. Compute the distance matrix by aligning all pairs of sequences: Usually, this distance matrix is obtained by applying FASTA [20, 27] or the dynamic programming algorithm of Needleman and Wunsch [22] to each pair of sequences. 2. Construct the guide tree from the distance matrix: For the existing progressive methods, they mainly differ in the method used to build the guide tree for directing the order of alignment of sequence. To build the guide tree, for example, PILEUP (a program of GCG packages) uses UPGMA (Unweighted Pair-Group Method using Arithmetic mean) method [33] and CLUSTAL W [37] uses NJ (Neighbor-Joining) method [31]. 3. Progressively align the sequences according to the branching order in the guide tree: Initially, the closest two sequences in the tree are aligned using the normal 6.

(14) dynamic programming algorithm. After aligning, this pair of sequences is fixed and any introduced gaps cannot be shifted later (i,e., once a gap, always a gap). Then the next two closest pre-aligned groups of sequences are joined in the same way until all sequences have been aligned. (Here, we may consider a sequence as an aligned group of a sequence.) To align two groups of the pre-aligned sequences, the score between any two positions in these two groups is usually the arithmetic average of the scores for all possible character comparisons at those positions. We call this kind of scoring methods as a set-to-set scoring. In fact, MST has been used as a significant tool for data classification in the fields of biological data analysis. In [34], Tang et al. proposed a variant of progressive method by using the Kruskal MST to construct the guide tree, called Kruskal merging order tree. The Kruskal merging order tree of k sequences is constructed as follows. First, we create a complete graph G = (V, E) of k sequences in a way that each vertex of V represents a sequence and each edge e of E is associated with a weight d(e) to represent the distance between the corresponding sequences of its end-vertices. Then we use the Kruskal’s algorithm [18] to construct the Kruskal MST of G, denoted by T . For completeness, we describe the Kruskal method for constructing T as follows. 1. Sort all edges of E in non-decreasing order according to their distances. 2. Initially, T is empty. Then we repeatedly add the edges of E in non-decreasing order to T in a way that if the currently adding edge e to T dose not create a cycle in T , then we add e to T ; otherwise, we discard e. Next, according to the Kruskal MST T , we build the Kruskal merging order tree TK as follows. 1. Let V = {v1 , v2 , . . . , vk } and e1 , e2 , . . . , ek−1 be the edges of T with d(e1 ) ≤ d(e2 ) ≤ . . . ≤ d(ek−1 ). 2. For each vertex vi ∈ V , we create a tree Ti such that Ti contains only a node vi . For the purpose of merging trees, we consider Ti as a rooted tree by designating vi as its root, and define the merge of two tree Ti and Tj respectively rooted at 7.

(15) vi and vj to be a new tree rooted at a new vertex u such that vi and vj become the children of u. 3. For each eK = (vi , vj ), where K increases from 1 to k − 1, we find the trees Ti and Tj containing vi and vj respectively and then merge them into a new tree. This process is continued until the remaining is only one tree. Then this final tree is the so-called Kruskal merging order tree TK . The construction of G for k sequences can be done in O(k 2 ) time and the computation of the Kruskal’s MST T of G can be done in O(k 2 log k) time. Then the construction of TK from T can be implemented by the disjoint set union and find algorithm proposed by Gabow and Tarjan [12] in O(m + k) time, where m denotes the number of union and find operations. It is not hard to see that m = O(k) and hence the construction of TK takes O(k) time. Therefore, the total time complexity of constructing TK is O(k 2 log k).. 8.

(16) Chapter 3 Algorithms Here, we describe our algorithms to efficiently solve the CMSA problem based on the progressive approach adopted by Tang et al. [34]. The ideas behind this progressive approach are first to design an efficient algorithm to optimally solve the constrained pairwise sequence alignment (CPSA) and then use it as a kernel to progressively align the input sequences into a CMSA according to the branching order of a guide tree. The main different part of our progressive algorithm from Tang’s is the algorithms for solving the CPSA problem. In the following, we first use the dynamic programming technique to design a time-efficient algorithm for optimally solving the CPSA problem, then use the divide-and-conquer technique to design a memory-efficient algorithm for the CPSA problem, and finally use the progressive approach to develop two CMSA programs.. 3.1 3.1.1. Constrained pairwise sequence alignment Dynamic programming method. In this section, we shall use dynamic programming method to design a time-efficient algorithm for solving the CPSA problem with two given sequences A = a1 a2 . . . am and B = b1 b2 . . . bn , a given order set C = (C1 , C2 , . . . , Cγ ) of γ constraints and a given error threshold . For any two characters a, b ∈ Σ, let σ(a, b) denote the score of aligning a with b. 9.

(17) The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with wo + l × we , where wo > 0 is the gap-open penalty and we > 0 is the gap-extension penalty. For convenience, let Ai = a1 a2 . . . ai , Bj = b1 b2 . . . bj and Ck = (C1 , C2 , . . . , Ck ), where 1 ≤ i ≤ m, 1 ≤ j ≤ n, 1 ≤ k ≤ γ. Let Mk (i, j) denote the score of an optimal constrained alignment of Ai and Bj w.r.t. Ck . Clearly, Mγ (m, n) is the score of an optimal constrained alignment of A and B w.r.t. C. L is called as a semi-constrained alignment of Ai and Bj w.r.t. Ck if it is a constrained alignment of Ai and Bj w.r.t. Ck−1 and also ends (or begins) with a band whose induced consensus is equal to a prefix of Ck (or a suffix of C1 ). Nk (i, j, h) is defined to be the score of an optimal semi-constrained alignment of Ai and Bj w.r.t. Ck that ends with an induced I consensus equal to Ck,h , where Ck,h = ck,1 ck,2 . . . ck,h . Let MD k (i, j) and Mk (i, j) be. the maximum scores of all constrained alignments of Ai and Bj w.r.t. Ck that end with a deletion pair (i.e., (ai , −)) and an insertion pair (i.e., (−, bj )), respectively. By the definition, it is not hard to derive the recurrence of Mk (i, j), where 1 ≤ i ≤ m and 1 ≤ j ≤ n, as follows. If k = 0, then. Mk (i, j) = max.    Mk (i − 1, j − 1) + σ(ai , bj ),    . MD k (i, j),       MI (i, j). k. If 1 ≤ k ≤ γ, then. Mk (i, j) = max.    Mk (i − 1, j − 1) + σ(ai , bj ),         MD (i, j), k    MIk (i, j),        N (i, j, λ ). k k. Clearly, Nk (i, j, λk ) = Mk−1 (i−λk , j−λk )+Σ0≤h≤λk −1 σ(ai−h , bi−h ), if dH (Ai (λk ), Ck ) ≤ λk × and dH (Bj (λk ), Ck ) ≤ λk × , where Ai (λk ) = ai−λk +1 . . . ai and Bj (λk ) = bj−λk +1 . . . bj . Otherwise, Nk (i, j, λk ) = −∞. To simply describe the computation of I S MD k (i, j) and Mk (i, j), we introduce another notation Mk (i, j) which is defined to be. the maximum score of all constrained alignments of Ai and Bj w.r.t. Ck that end with a substitution pair (i.e., (ai , bj )). Let LD k (Ai , Bj ) denote the alignment of Ai and Bj 10.

(18) 0 with score MD k (i, j) which ends with a deletion pair (ai , −). Let L be the portion of. LD k (Ai , Bj ) before the last aligned pair (ai , −). Then there are three possibilities when we consider the last aligned pair of L0 . Case 1: The last aligned pair of L0 is a substitution pair. Then the score of L0 is MSk (i − 1, j) and (ai , −) is charged by a gap-open penalty and a gap-extension D S penalty in MD k (i, j). Hence, Mk (i, j) = Mk (i − 1, j) − wo − we .. Case 2: The last aligned pair of L0 is a deletion pair. Then the score of L0 is MD k (i − 1, j) and (ai , −) is charged by only one gap-extension penalty in MD k (i, j). Hence, D MD k (i, j) = Mk (i − 1, j) − we .. Case 3: The last aligned pair of L0 is an insertion pair. Then the score of L0 is MIk (i − 1, j) and (ai , −) is charged by a gap-open penalty and a gap-extension D I penalty in MD k (i, j). Hence, Mk (i, j) = Mk (i − 1, j) − wo − we .. In summary, we have    MSk (i − 1, j) − wo − we ,    . D MD k (i, j) = max  Mk (i − 1, j) − we ,.      MI (i − 1, j) − w − w . o e k. However, by including an extra MD k (i − 1, j) − wo − we into the right-hand site of the above recurrence, we can reformulate the above recurrence as MD k (i, j) = max.    Mk (i − 1, j) − wo − we ,   MD (i − 1, j) − w . e k. Similar to the discussion above, the recurrence of MIk (i, j) can be derived as    Mk (i, j − 1) − wo − we ,. MIk (i, j) = max .  MI (i, j − 1) − w . e k. I The initializations of Mk (i, j), MD k (i, j) and Mk (i, j) for all 0 ≤ k ≤ γ are as follows. I • If k = 0, then Mk (0, 0) = 0, MD k (0, 0) = Mk (0, 0) = −∞, Mk (i, 0) = I MD k (i, 0) = −wo − iwe and Mk (i, 0) = −∞ for all 1 ≤ i ≤ m, and Mk (0, j) =. MIk (0, j) = −wo − jwe and MD k (0, j) = −∞ for all 1 ≤ j ≤ n. 11.

(19) (i − 1, j − 1, k). (i − 1, j, k). MD k. Nk. MD k. Nk. 0. 0 0. PSfrag replacements. MIk. 0. −wo − we MIk. Mk. 0. 0. Mk. MD k. Nk. 0. 0 0. MIk. 0. −we. (i, j, k). MD k. Nk. −wo − we. (i, j − 1, k). σ(ai , bj ). −we. −wo − we. −we. 0 −wo − we MIk. Mk. 0. Mk. −we. Figure 3.1:. The schematic diagram of four adjacent entries of G, where. entry (i, j, k) consists of three nodes Mk , MD and MIk corresponding to k I Mk (i, j), MD k (i, j), Mk (i, j), respectively.. I • If 1 ≤ k ≤ γ, then Mk (0, 0) = 0, MD k (0, 0) = Mk (0, 0) = −∞, Mk (i, j) = I MD k (i, j) = Mk (i, j) = −∞ for all 1 ≤ i ≤ m and 1 ≤ j ≤ n.. According to the recurrences above, we can design an algorithm to compute Mγ (m, n) and its corresponding constrained alignment using the technique of dynamic programming as follows. For convenience, we can depict the recurrences of I matrices Mk , MD k , Mk and Nk for all 0 ≤ k ≤ γ by a 3D grid graph G, which. consists of (m + 1) × (n + 1) × (γ + 1) entries and each entry (i, j, k) consists of I D I four nodes Mk , MD k , Mk and Nk corresponding to Mk (i, j), Mk (i, j), Mk (i, j), and. Nk (i, j, λk ), respectively. Figure 3.1 illustrates the relationship of four adjacent entries (i, j, k), (i − 1, j, k), (i, j − 1, k) and (i − 1, j − 1, k) of G for each fixed k. Note that there is a directed edge, which is not shown in Figure 3.1, with weight Σ0≤h≤λk −1 σ(ai−h , bj−h ) from the Mk−1 node of the entry (i − λk , j − λk , k − 1) to the Nk node of the entry (i, j, k). Then each path from M0 (0, 0) node of entry (0, 0, 0) to Mγ (m, n) node of entry (m, n, γ) corresponds to a constrained alignment of A and B w.r.t. C. As a re12.

(20) sult, an optimal constrained alignment of A and B can be obtained by backtracking a shortest path from Mγ (m, n) to M0 (0, 0) in G. It is not hard to see that the algorithm costs both computer time and memory in the order of O(γmn). We call the above algorithm based on the dynamic programming approach as CPSA-DP algorithm.. 3.1.2. Divide and conquer method. In this section, we shall use divide-and-conquer method to design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a1 a2 . . . am and B = b1 b2 . . . bn , a given order set C = (C1 , C2 , . . . , Cγ ) of γ constraints and a given error threshold . Recall that Hirschberg [17] developed a linear-space algorithm for solving the longest common subsequence problem based on the technique of divide and conquer. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences [6, 21]. In this paper, we generalize the Hirschberg’s algorithm so that it is able to deal with the constrained pairwise sequence alignment. As compared with others, our generalization is more complicated because the grid graph G we deal with here is 3D, instead of 2D, and the input sequences are accompanied with several constraints which need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (imid , jmid , kmid ) on an optimal path from M0 (0, 0) to Mγ (m, n) in G so that we are able to divide the constrained alignment problem into two smaller constrained alignment problems, then these smaller constrained alignment problems are continued to be divided in the same way, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (see Figure 3.2 for an illustration). Before describing our algorithm, some notation must be introduced as follows. Let Ai and B j denote the suffixes ai+1 ai+2 . . . am and bj+1 bj+2 . . . bn of A and B, respectively, for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let C k denote the ordered subset (Ck+1 , Ck+2 , . . . , Cγ ) for 1 ≤ k ≤ γ. Define Mk (i, j) to be the score of an optimal constrained alignment of Ai S. D. I. and B j w.r.t. C k , and define Mk (i, j), Mk (i, j) and Mk (i, j) to be the maximum score of all constrained alignments of Ai and B j w.r.t. C k that begin with a substitution pair 13.

(21) kmid. M0 (0, 0). PSfrag replacements imid. Mγ (m, n). jmid. Figure 3.2: Schematic diagram of divide and conquer approach: two light gray areas are the reduced subproblems after middle position (imid , jmid , kmid ) is determined, each of which will be further divided into two subproblems of dark gray areas. (i.e, (ai+1 , bj+1 )), a deletion pair (i.e., (ai+1 , −)) and an insertion pair (i.e., (−, bj+1 )), respectively. Let Ck (h) = (C1 , C2 , . . . , Ck−1 , Ck,h ) and C k (h) = (C k,h , Ck+1 , . . . , Cγ ), where C k,h = ck,h+1 ck,h+2 . . . ck,λk . Let N k (i, j, h) denote the score of an optimal semiconstrained alignment L of Ai and B j w.r.t. C k (h) that begins with a band whose induced consensus is equal to C k,h . Note that the recurrences for computing matrices S. D. I. Mk , Mk , Mk , Mk and N k can be developed similarly as those for computing Mk , I S MSk , MD k , Mk and Nk , respectively. Clearly, Mk (i, j) = Mk (i−1, j−1)+σ(ai , bj ). For. simplicity, let Ai (h) (respectively, Bj (h)) denote the suffix of Ai (respectively, Bj ) with length of h (i.e., Ai (h) = ai−h+1 . . . ai and Bj (h) = bj−h+1 . . . bj ). If dH (Ai (λk ), Ck ) ≤ λk × and dH (Bj (λk ), Ck ) ≤ λk × , then we can reformulate the recurrence of Nk as follows: Nk (i, j, 1) = Mk−1 (i − 1, j − 1) + σ(ai , bj ) and Nk (i, j, h) = Nk (i − 1, j − 1, h − 1) + σ(ai , bj ) for each 1 < h ≤ λk . Next, we describe our divide-and-conquer algorithm, called as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. C as follows. The key point is to determine the middle position (imid , jmid , kmid ) of the optimal path in G to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment L, we use 14.

(22) score(L) to denote the score of L. Let Lγ (A, B) be an optimal constrained alignments of A and B w.r.t. C and clearly score(Lγ ) = Mγ (m, n). Let imid = b m2 c. Then we partition Lγ (A, B) into two parts by cutting it at the position immediately after aimid and we let L1γ (A, B) denote the part containing aimid and L2γ (A, B) denote the remaining part. Let bjmid be the last character in L1γ (A, B) from B, and let kmid be the largest index so that a prefix of Ckmid with length hmid appears in L1γ (A, B). Then there are two possibilities when we consider the last aligned pair of L1γ (A, B). Case 1: The last aligned pair of L1γ (A, B) is a substitution pair (i.e., (aimid , bjmid )). In this case, we have Mγ (m, n) = score(Lγ (A, B)) = score(L1γ (A, B)) + score(L2γ (A, B)).. If (aimid , bjmid ) is not a constrained column in Lγ (A, B), then. L1γ (A, B) is an optimal constrained alignment of Aimid and Bjmid w.r.t. Ckmid ending with a substitution pair (aimid , bjmid ), and L2γ (A, B) is an optimal constrained alignment of Aimid and B jmid w.r.t. C k . Hence, Mγ (m, n) = MSkmid (imid , jmid ) + Mkmid (imid , jmid ). If (aimid , bjmid ) is a constrained column in Lγ (A, B), then L1γ (A, B) is an optimal semiconstrained alignment of Aimid and Bjmid w.r.t. Ckmid (hmid ) ending with a band B1 whose induced consensus is equal to Ckmid ,hmid . If hmid < λkmid , then L2γ (A, B) is an optimal semi-constrained alignment of Aimid and B jmid w.r.t. C kmid (hmid ) beginning with a band B2 whose induced consensus is equal to C kmid ,hmid . Moreover, the induced consensus of the merge of B1 and B2 have to be equal to Ckmid . In this case, we have Mγ (m, n) = Nkmid (imid , jmid , hmid ) + N kmid (imid , jmid , hmid ). If hmid = λkmid , then L2γ (A, B) is an optimal constrained alignment of Aimid and B jmid w.r.t. C kmid (hmid ), and hence Mγ (m, n) = Nkmid (imid , jmid , λkmid ) + Mkmid (imid , jmid ). Case 2: The last aligned pair of L1γ (A, B) is a deletion pair (i.e., (aimid , −)). If the first aligned pair in L2γ (A, B) is not a deletion pair, then Mγ (m, n) = S. I. D max{MD kmid (imid , jmid )+Mkmid (imid , jmid ), Mkmid (imid , jmid )+Mkmid (imid , jmid )}. If the. first aligned pair in L2γ (A, B) is a deletion pair, then Mγ (m, n) = MD kmid (imid , jmid ) + D. Mkmid (imid , jmid ) + wo . Since the open penalty of the gap containing aimid and aimid +1 D. in Lγ (A, B) is charged twice by MD kmid (imid , jmid ) and Mkmid (imid , jmid ), we need to compensate it by adding wo .. 15.

(23) In summary, the recurrence of Mγ (m, n) is derived as follows.. Mγ (m, n) = max.  S    MD  kmid (imid , jmid ) + Mkmid (imid , jmid ),     I   MD (imid , jmid ) + Mkmid (imid , jmid ),  k  mid       MD (imid , jmid ) + MD (imid , jmid ) + wo ,.                   .    MSkmid (imid , jmid ) + Mkmid (imid , jmid ),         Nkmid (imid , jmid , hmid ) + N kmid (imid , jmid , hmid ),        Nk (imid , jmid , λk ) + Mk (imid , jmid ).                  . kmid. kmid. mid. mid. mid. D. By adding extra MD kmid (imid , jmid ) + Mkmid (imid , jmid ) into the right-hand side, the above recurrence is not changed, but can be reformulated as follows.. Mγ (m, n) = max.     MD  kmid (imid , jmid ) + Mkmid (imid , jmid ),     D   MD  kmid (imid , jmid ) + Mkmid (imid , jmid ) + wo ,   . MSkmid (imid , jmid ) + Mkmid (imid , jmid ),        Nkmid (imid , jmid , hmid ) + N kmid (imid , jmid , hmid ),        Nk (imid , jmid , λk ) + Mk (imid , jmid ) mid. mid. mid.                             . In other words, jmid , kmid and hmid are the indices j, k and h, where 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h < λk , such that the following maximal value is the maximum.     MD  k (imid , j) + Mk (imid , j),     D   MD  k (imid , j) + Mk (imid , j) + wo ,   .               .       Nk (imid , j, h) + N k (imid , j, h),        Nk (imid , j, λk ) + Mk (imid , j).              . max  MSk (imid , j) + Mk (imid , j),. Now, we show how to use O(γλn), instead of O(γmn), memory to determine jmid , kmid and hmid , where λ = max1≤k≤γ λk and usually λ << m. In fact, a single matrix E of size (γ + 1) × (n + 1) with each entry E(k, j) of (λ + 4) space is enough to I compute Mk (imid , j), MSk (imid , j), MD k (imid , j) Mk (imid , j) and Nk (imid , j, h), where. 1 ≤ j ≤ n, 0 ≤ k ≤ γ, 1 ≤ h ≤ λk . When reaching the entry (i, j, k) of 3D grid graph G, we use entry E(k, j) of E to hold the most recently computed values of Mk (i, j), I MSk (i, j), MD k (i, j) Mk (i, j) and Nk (i, j, h), which clearly needs a total of λk + 4 space.. Note that the old values in entry E(k, j) will be moved into an extra entry, called as Vk 16.

(24) j k Vk. PSfrag replacements. vk−1,k. i. E(k). ?. i−1 Vk−1 i. E(k − 1). j. Figure 3.3: The grid locations of E(k − 1), E(k) and the values in Vk−1 and Vk when the entry (i, j, k) of G, marked with ”?”, is reached for the computation. whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into Vk , however, we need to first move Mk (i−1, j −1) in Vk into a space, called as vk,k+1 . The mechanism above will enable us to compute Nk (i, j, 1), which needs to refer to Mk−1 (i − 1, j − 1) that is kept in vk−1,k , and compute Nk (i, j, h) for each 2 ≤ h ≤ λk , which needs to refer to Nk (i−1, j−1, h−1) that is kept in Vk , compute MSk (i, j) which needs to refer Mk (i−1, j −1) that is kept in Vk , and finally we are able to compute Mk (i, j). Figure 3.3 shows the grid locations of E(k − 1), E(k) and the values in Vk−1 and Vk when we reach the entry (i, j, k) of G for the computation, where E(k) denotes the the kth row of E. Hence, the totally needed I space for computing and storing all Mk (imid , j), MSk (imid , j), MD k (imid , j) Mk (imid , j). and Nk (imid , j, h) is the sum of the space of matrix E, the space of all Vk , 0 ≤ k ≤ γ, and the space of all vk,k+1 , 0 ≤ k < γ, which is equal to O(γλn). Similarly, the S. D. I. process of computing and storing all Mk (imid , j), Mk (imid , j), Mk (imid , j) M k (imid , j) and N k (imid , j, h) still needs O(γλn) space. Hence, the determination of jmid , kmid and hmid can be done in O(γλn) space. The details of CPSA-DC algorithm are described as follows, where the program code of BestScoreRev is similar to that of BestScore and hence is omitted.. 17.

(25) Algorithm CPSA-DC(istart , iend , jstart , jend , kstart , kend ) Input: Sequences aistart . . . aiend and bjstart . . . bjend with constraints (Ckstart , . . . , Ckend ) Step 1: if (istart > iend ) or (jstart > jend ) then Align the nonempty sequence with spaces; else imid = b istart2+iend c; BestScore(istart , imid , jstart , jend , kstart , kend ); BestScoreRev(imid + 1, iend , jstart , jend , kstart , kend ); end if Step 2: max = −∞; for j = jstart − 1 to jend do for k = kstart − 1 to kend do if MD k (·, j) + Mk (·, j) > max then max = MD k (·, j) + Mk (·, j); jmid = j; kmid = k; type = case 1; end if D. if MD k (·, j) + Mk (·, j) > max then D. max = MD k (·, j) + Mk (·, j); jmid = j; kmid = k; type = case 2; end if if MSk (·, j) + Mk (·, j) > max then max = MSk (·, j) + Mk (·, j); jmid = j; kmid = k; type = case 3; end if if k ≥ 1 then for h = 1 to λk − 1 do 1 (k,h) 2 (k,h) if ( H1 (k,h)+H ≤ ) and ( H2 (k,h)+H ≤ ) then λk λk. if Nk (·, j, h) + N k (·, j, h) > max then max = Nk (·, j, h) + N k (·, j, h); jmid = j; kmid = k; hmid = h; type = case 4; 18.

(26) end if end if end for k) ≤ ) and ( H2λ(k,h) ≤ ) then if ( H1 (k,λ λk k. if Nk (·, j, λk ) + Mk (·, j) > max then max = Nk (·, j, λk ) + Mk (·, j); jmid = j; kmid = k; hmid = h; type = case 5; end if end if end if end for end for Step 3: if type = case 1 then CPSA-DC(istart , imid − 1, jstart , jmid , kstart , kmid ); Align aimid with a space; CPSA-DC(imid + 1, iend , jmid + 1, jend , kmid + 1, kend ); end if if type = case 2 then CPSA-DC(istart , imid − 1, jstart , jmid , kstart , kmid ); Align aimid aimid +1 with two spaces; CPSA-DC(imid + 2, iend , jmid + 1, jend , kmid + 1, kend ); end if if type = case 3 then CPSA-DC(istart , imid − 1, jstart , jmid − 1, kstart , kmid ); Align aimid with bjmid ; CPSA-DC(imid + 1, iend , jmid + 1, jend , kmid + 1, kend ); end if if type = case 4 then CPSA-DC(istart , imid − hmid , jstart , jmid − hmid , kstart , kmid − 1); Align aimid −hmid +1 . . . aimid +λk −hmid with bjmid −hmid +1 . . . bjmid +λk −hmid ; 19.

(27) CPSA-DC(imid + λk − hmid + 1, iend , jmid + λk − hmid + 1, jend , kmid + 1, kend ); end if if type = case 5 then CPSA-DC(istart , imid − λk , jstart , jmid − λk , kstart , kmid − 1); Align aimid −λk +1 . . . aimid with bjmid −λ+1 . . . bjmid ; CPSA-DC(imid + 1, iend , jmid + 1, jend , kmid + 1, kend ); end if. Algorithm BestScore(istart , iend , jstart , jend , kstart , kend ) Input: Sequences aistart . . . aiend and bjstart . . . bjend with constraints (Ckstart , . . . , Ckend ) Output: Step 1: /* Reindex */ m = istart − iend + 1; n = jstart − jend + 1; γ = kstart − kend + 1; Step 2: /* Initialization */ for j = 0 to n do for k = 0 to γ do if (j = 0) and (k = 0) then Mk (·, j) = 0; else Mk (·, j) = −∞; if (j = 0) or (k > 0) then MIk (·, j) = −∞; else MIk (·, j) = −wo − jwe ; MSk (·, j) = MD k (·, j) = −∞; if k ≥ 1 then for h = 1 to λk do Nk (·, j, h) = −∞; end for end if end for end for Step 3: /* Computation */ for i = 1 to m do for k = 0 to γ do /* For the case of j = 0 */ 20.

(28) Vk (Mk (·, 0)) = Mk (·, 0); if k ≥ 1 then for h = 1 to λk do Vk (Nk (·, 0, h)) = Nk (·, 0, h)); Vk (H1 (k, h)) = H1 (k, h); Vk (H2 (k, h)) = H2 (k, h); end for end if MSk (·, 0) = MIk (·, 0) = −∞; Mk (·, 0) = MD k (·, 0) = −wo − jwe ; end for for j = 1 to n do /* For the case of j > 0 */ for k = 0 to γ do tempk (Mk (·, j)) = Mk (·, j) ; if k ≥ 1 then for h = 1 to λk do tempk (Nk (·, j, h)) = Nk (·, j, h); tempk (H1 (k, h)) = H1 (k, h); tempk (H2 (k, h)) = H2 (k, h); end for end if MSk (·, j) = V (Mk (·, j)) + σ(aistart +i−1 , bjstart +j−1 ); D MD k (·, j) = max{Mk (·, j) − we , Mk (·, j) − wo − we };. MIk (·, j) = max{MIk (·, j − 1) − we , Mk (·, j − 1) − wo − we }; if k ≥ 1 then for h = 1 to λk do if h = 1 then Nk (·, j, h) = vk−1,k + σ(aistart +i−1 , bjstart +j−1 ); if aistart +i−1 6= ck,h then H1 (k, h) = 1; else H1 (k, h) = 0; if bjstart +j−1 6= ck,h then H2 (k, h) = 1; else H2 (k, h) = 0; 21.

(29) else Nk (·, j, h) = Vk (Nk (·, j, h − 1)) + σ(aistart +i−1 , bjstart +j−1 ); if aistart +i−1 6= ck,h then H1 (k, h) = H1 (k, h − 1) + 1; if bjstart +j−1 6= ck,h then H2 (k, h) = H2 (k, h − 1) + 1; end if end for end if    MD (·, j), MI (·, j), Nk (·, j, λk ) k k. Mk (·, j) = max .   . ;.   V (M (·, j)) + σ(a  k k istart +i−1 , bjstart +j−1 ),. vk,k+1 = Vk (Mk (·, j)); Vk (Mk (·, j)) = tempk (Mk (·, j)); if k ≥ 1 then for h = 1 to λk do Vk (N (·, j, h)) = tempk (Nk (·, j, h)); H1 (k, h) = tempk (H1 (k, h)); H2 (k, h) = tempk (H2 (k, h)); end for end if end for end for. Now, we analyze the time complexity of our CPSA-DC algorithm for solving the constrained pairwise sequence alignment. As illustrated in Figure 3.2, after determining the middle position (imid , jmid , kmid ) of the optimal path in G, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (imid , jmid , kmid ), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of the entries in G. In is not hard to see that the time complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of 22.

(30) the subproblem. Let T denote the size of the original problem (i.e., T = γmn). Then the total time complexity of our CPSA-DC algorithm is equal to T + T2 + T4 + · · · = 2T , which is twice as high as the CPSA-DP algorithm.. 3.2. Constrained multiple sequence alignment. In this section, we use Algorithm CPSA-DP and CPSA-DC in previous section as kernels to design two CMSA algorithms, called Algorithm CMSA-DP and CMSA-DC respectively, for progressively aligning the input sequences into a CMSA according to the branching order of a guide tree, where the guide tree we use here is the socalled Kruskal merging order tree. We refer the reader to Section 2 for the details of constructing the Kruskal merging order tree. Except for the adopted CPSA kernel, CMSA-DP and CMSA-DC have the same execution steps, which are described as follows. 1. Compute the distance matrix D by globally aligning all pairs of sequences using Algorithm CPSA-DP or CPSA-DC , where D(i, j) denotes the distance between sequences Si and Sj . 2. Create a complete graph G from the distance matrix D and then compute the Kruskal merging order tree Tk from G to serve as the guide tree. 3. Progressively align the sequences according to the branching order of the guide tree Tk in a way that the currently two closest pre-aligned groups of sequences are joined by applying Algorithm CPSA-DP or CPSA-DC to these two groups of sequences, where the score between any two positions in these two groups is the arithmetic average of the scores for all possible character comparisons at those positions. In the following, we analyze the time complexity of Algorithm CMSA-DP/CMSADC. It is not hard to see that step 1 costs O(γk 2 n2 ) time, where n is the maximum of the lengths of k sequences. According to the paper of Tang et al. [34] , step 2 can be done in O(k 2 log k) time. In step 3, there are at most O(k) iterations for calling 23.

(31) Algorithm CPSA-DP/CPSA-DC, whose time complexity is O(γn2 ), to join two prealigned groups of sequences. Hence, the time complexity of step 3 is O(γkn2 ). Clearly, the cost of Algorithm CMSA-DP/CMSA-DC is dominated by step 1 and hence its time complexity is O(γk 2 n2 ).. 24.

(32) Chapter 4 Implementation and Discussion 4.1. MuSiC. We use Java language to implement the CMSA-DP algorithm as a web server, called as MuSiC, which is a short for Multiple Sequence Alignment with Constraints. It can be easily accessed via a simple web interface (see Figure 4.1). The input of the MuSiC system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of bases that (approximately) appears in all input sequences. The output of MuSiC is a constrained multiple sequence alignment in which the fragments of the input sequences whose bases exhibit a given degree of similarity to a constraint are aligned together. The use of the proposed MuSiC system is illustrated below to help to detect a fragment of an RNA sequence in the 30 untranslated region (UTR) of the SARS-TW1 sequence, which can fold itself into a pseudoknot structure. The structural elements in the 50 and 30 UTRs of a plus-straind RNA virus have been postulated to be involved in RNA replication, transcription and translation by interacting with viral or cellular proteins. Much evidence supports the fact that the pseudoknots in 30 UTRs among coronaviruses participate in the replication of RNA [43]. The SARS (Severe Acute Respiratory Syndrome) virus, which caused several hundreds of deaths since its outbreak in early 2003, is a novel type of coronavirus, so a pseudoknot is expected to be observed in its 30 UTR. By comparing the sequences of the phylogenetically conserved pseudoknots among many coronaviruses, 25.

(33) Figure 4.1: The interface of MuSiC. Williams et al. found that these sequences contain several fragments of conserved nucleotides (consensuses). They found 12 consensuses, say CU, CA, AA, GG, C, UG, A, G, AG, U and A, among coronaviruses including HCV-229E (human coronavirus), PEDV (porcine epidemic diarrhea virus), TGEV (porcine transmissible gastroenteritis virus), BCV (bovine coronavirus) and MHV (mouse hepatitis virus). To determine whether or not the 30 UTR of SARS has a pseudoknot, SARS-TW1 (AY291451) was chosen as the test subject and the MuSiC system was used to align the sequence of the 30 UTR of SARS-TW1 with those of the detected pseudoknots in the 30 UTRs of BCV, MHV, PEDV, TGEV and HCV-229E coronaviruses. The consensuses described above were used as the constrained sequences in the proposed MuSiC system and then the default scoring matrix and gap penalties were chosen with the initial setting = 0. As a result, no CMSA was found to satisfy the requirement, because the postulated pseudoknot in the 30 UTR of SARS-TW1 may comprise the fragments that are only partially, rather than completely, similar to the constraints. Hence, this 26.

(34) Figure 4.2: The partial display of the resulting CMSA of MuSiC by aligning the sequences of SARS-TW1 30 UTR with those of other five coronaviruses. case was tested again with letting = 0.5 so that in the band of the resulting CMSA, of length two or three, no more than one mismatch may exist between the fragment of each input sequence and the constraint. Consequently, as indicated in Figure 4.2 , a satisfied CMSA was found. Note that, the band of the resulting CMSA that corresponds to a constraint is black and its corresponding constraint is displayed beneath it. In some bands of this resulting CMSA, such as the fourth, sixth and ninth, at most one mismatch exists between the fragment of each input sequence and the corresponding constraint. Moreover, the part of SARS-TW1 aligned with the pseudoknot sequences of other coronaviruses is interspersed with only two gaps of length one. This finding suggests that this part of SARS-TW1 may fold itself into a pseudoknot structure and possibly be involved in replicating SARS viruses. Therefore, this SARS-TW1 fragment is further validated by applying PKNOTS, developed by the Eddy group [30], to determine whether it can fold itself into a pseudoknot structure with a stable free energy. Consequently, this fragment of SARS-TW1 indeed folds itself into a stable pseudoknot whose base pairings have a topology, as indicated in Figure 4.3 , that is very similar to those of other coronaviruses described in the literature [43]. However, whether or not this SARS-TW1 fragment participates in replicating the RNA of SARS must be investigated experimentally in the laboratory.. 27.

(35) PSfrag replacements. AGAAUGAAUUCUCG 50 CUACU. GAACACG CUUGUGC. A. CAAAUCAAU GUUUAGUUA. ACUUUAA 30. UAG. Figure 4.3: The diagram of the predicted pseudoknot in the 30 UTR of SARS-TW1 ranging from 29460 to 29521 bp.. 4.2. MuSiC-ME. We also use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME that is short for Memory-Efficient tool for Multiple Sequence Alignment with Constraints (see Figure 4.4). It is worth mentioning that for MuSiCME, the letters representing the constraints are not just the individual bases, but also the IUPAC (International Union of Pure and Applied Chemistry) codes. For example, nucleotides N and R have the meanings of any bases and purine (i.e., A or G), respectively. To demonstrate the practicability of our MuSiC-ME system, we also use it to detect a fragment of an RNA sequence in the 30 UTR of the SARS-TW1 sequence being able to fold into a stable pseudoknot. In this test, we aligned the sequence of the 30 UTR of SARS-TW1 with those of the 30 UTRs of BCV, MHV, PEDV, TGEV and HCV-229E coronaviruses, and used the consensuses as described before as the constraints. Since these constraints are too short, they occur frequently in the large genomic sequences. For our purpose, we further combine a few of constraints into a new and larger constraint as follows. Among the consensuses above, the first and second (respectively, ninth and tenth) consensuses are located in the 50 (respectively, 30 ) end of stem 1 and they are separated by other 4 (respectively, 4) bases, and the seventh and eighth (respectively, eleventh and twelfth) consensuses are located in the 50 (respectively, 30 ) end of stem 2 and they are separated by other 3 (respectively, 3) bases. Since both stems 1 and 2 contain no loops, we are able to combine the consensuses in the same stem into a new and larger constraint. Hence, we have new constraints like CUNNNNC for the 50 end of stem 1, GNNNNAG for the 28.

(36) Figure 4.4: The interface of MuSiC-ME. 30 end of stem 1, UNNNA for the 50 end of stem 2, and UNNNA for the 30 end of stem 2. Finally, we got eight constraints with the order of (CUNNNNC, A, AA, G, C, UNNNA, GNNNNAG, UNNNA) for this test. After running MuSiC-ME, a satisfied CMSA was found as shown in Figure 4.5. This resulting CMSA implies that the fragment of SARS-TW1 between the first band and the last band may fold into a pseudoknot structure that is possibly involved in replicating SARS viruses. In fact, the fragment is the one we found in the test with MuSiC and hence it can fold into a stable pseudoknot as shown in Figure 4.3. Note that above test was run on IBM PC with 1.26 GHz processor and 128 MB RAM under Linux system. Under this limited memory environment, the instance can not be executed by the MuSiC system whose kernel, the CPSA-DP algorithm, was implemented by the dynamic programming approach, due to running out of memory. The input sequences above were also tested by Clustal W 1.82, the most commonly used MSA tool. According to its resulting MSA as shown in Figure 4.6, the fragments 29.

(37) Figure 4.5: The partial display of the resulting CMSA of MuSiC-ME by aligning the sequences of SARS-TW1 30 UTR with those of other five coronaviruses. of all pseudoknots, including our detected pseudoknot for SARS-TW1, are not able to be aligned well so that it is difficult for us to identify the exact fragment of the SARS-TW1 pseudoknot from this MSA.. Figure 4.6: The partial display of the resulting MSA of Clustal W 1.82 by aligning the 30 UTR sequences of six coronaviruses, where the bases not in the pseudoknots are marked with dots.. 30.

(38) Chapter 5 Conclusions In this thesis, we studied a generalized CMSA problem, which is to find a CMSA for the input sequences with several user-specified constraints such that the fragments of the input sequences whose bases exhibit a given degree of similarity to a constraint are aligned together. In this model, each of the user-specified constraint may be a fragment of bases, instead of a single base only, and the adopted gap scoring is the socalled affine gap penalty that penalizes a gap once for opening and then proportionally to its length. First, we adopted the dynamic programming and divide-and-conquer techniques to design a time-efficient algorithm and a memory-efficient algorithm respectively for optimally solving the CPSA problem. Based on these two kernel algorithms, we developed two CMSA programs, called as MuSiC and MuSiC-ME respectively, using the progressive approach. The MuSiC program generates a good CMSA efficiently, but its high memory requirement limits it to align a set of short sequences, at most several hundreds of bases. The MuSiC-ME program made it possible to align several largescale sequences with constraints through the desktop PC with the limited memory. In this system, moreover, the letters allowed to represent the constraints are the IUPAC codes, which will enable the user to define a more flexible constraints or combine several small constraints with fixed distances into a large one. The practicabilities of MuSiC and MuSiC-ME were also demonstrated by helping us to detect the fragment of the 30 UTR of SARS that is able to fold itself into a stable pseudoknot for possibly 31.

(39) participating in replicating the RNA of SARS.. 32.

(40) References [1] Altschul, S. & Lipman, D. (1989) Trees stars and multiple biological sequence alignment. SIAM Journal on Applied Mathematics, 49, 197–209. [2] Bafna, V., Lawler, E. L. & Pevzner, P. A. (1997) Approximation algorithms for multiple sequence alignment. Theoretical Computer Science, 182, 233–244. [3] Bonizzoni, P. & Vedova, G. D. (2001) The complexity of multiple sequence alignment with SP-score that is a metric. Theoretical Computer Science, 259, 63–79. [4] Carrillo, H. & Lipman, D. (1988) The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 48, 1073–1082. [5] Chan, S. C., Wong, A. K. C. & Chiu, D. K. Y. (1992) A survey of multiple sequence comparison methods. Bulletin of Mathematical Biology, 54, 563–598. [6] Chao, K. M., Hardison, R. C. & W, W. M. (1994) Recent developments in linearspace alignment methods: a survey. Journal of Computational Biology, 1, 271– 291. [7] Chin, F. Y. L., Ho, N. L., Lamy, T. W., Wong, P. W. H. & Chan, M. Y. (2003) Efficient constrained multiple sequence alignment with performance guarantee. In Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB 2003) pp. 337–346 IEEE, Los Alamitos, CA. [8] Corpet, F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research, 16, 10881–10890.. 33.

(41) [9] Deiman, B. & Pleij, C. W. A. (1997) Pseudoknots: a vital feature in viral RNA. Seminars in Virology, 8, 166–175. [10] Depiereux, E. & Feytmans, E. (1992) MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. Computer Applications in the Biosciences, 8, 501–509. [11] Feng, D. F. & Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 25, 351–360. [12] Gabow, H. & Tarjan, R. (1985) A linear-time algorithm for a special case of disjoint set union. Journal of Computer and System Sciences, 30, 209–221 [13] Gusfield, D. (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55, 141–154. [14] Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press. [15] Hein, J. (1989) A new method that simultaneously aligns and reconstruct ancestral sequences for any number of homologous sequences, when the phylogeny is given. Molecular Biology and Evolution, 6, 649–668. [16] Higgins, D. & Sharpe, P. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [17] Hirschberg, D. (1975) A linear space algorithm for computing maximal common subseqeuences. Communications of the ACM, 18, 341–343. [18] Kruskal, J. (1956) On the shrtest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7, 48–50. [19] Li, M., Ma, B. & Wang, L. (2000) Near optimal multiple alignment within a band in polynomial time. In Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing (STOC 2000) pp. 425–434 ACM Press, Portland.. 34.

(42) [20] Lipman, D. & Pearson, W. (1985) Rapid and sensitive protein simularity search. Science, 227, 1435–1411. [21] Myers, E. W. & Miller, W. (1988) Optimal alignment in linear space. CABIOS, 4, 11–17. [22] Needleman, S. & Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Evolution, 48, 443–453 [23] Nicholas, H. B., Ropelewski, A. J. & Deerfield, D. W. (2002) Strategies for multiple sequence alignment. Biotechniques, 32, 592–603. [24] Notredame, C. (2002) Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131–144. [25] Notredame, C., Higgins, D. G. & Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302, 205–217. [26] Pevzner, P. A. (1992) Multiple alignment, communication cost, and graph matching. SIAM Journal on Applied Mathematics, 52, 1763–1779. [27] Pearson, W. (1991) Searching protein sequence libraries: Computation of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithm. Genomics, 11, 635–650. [28] Pleij, C. W. A. (1994) RNA pseudoknots. Current Opinion in Structural Biology, 4, 337–344. [29] Ravi, R. & Kececioglu, J. (1998) Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree. Discrete Applied Mathematics, 88, 355–366. [30] Rivas, E. & Eddy, S. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology, 285, 2053–2068. 35.

(43) [31] Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new mothod for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406–425. [32] Schuler, G. D., Altschul, S. F. & Lipman, D. J. (1991) A workbench for multiple alignment construction and analysis. Proteins: Structure, Function and Genetics, 9, 180–190. [33] Sneath, P. & Sokal, R. (1973) Numerical Taxonomy. Freeman, San Francisco, CA. [34] Tang, C. Y., Lu, C. L., Chang, M. D. T., Tsai, Y. T., Sun, Y. J., Chao, K. M., Chang, J. M., Chiou, Y. H., Wu, C. M., Chang, H. T. & Chou, W. I. (2003) Constrained multiple sequence alignment tool development and its application to RNase family alignment. Journal of Bioinformatics and Computational Biology, 1, 267–287. [35] Taylor, W. R. (1987) Multiple sequence alignment by a pairwise algorithm. CABIOS, 3, 81–87. [36] Taylor, W. R. (1994) Motif-biased protein sequence alignment. Journal of Computational Biology, 1, 297–310. [37] Thompson, J. D., Higgs, D. G. & Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Research, 22, 4673–4680. [38] Thompson, J. D., Plewniak, F., Thierry, J.-C. & Poch, O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Research, 28, 2919–2926. [39] Wang, L. & Gusfield, D. (1997) Improved approximation algorithms for tree alignment. Journal of Algorithms, 25, 255-273. [40] Wang, L. & Jiang, T. (1994) On the complexity of multiple sequence alignment. Journal of Computational Biology, 1, 337–348. 36.

(44) [41] Wang, L., Jiang, T. & Gusfield, D. (2000) A more efficient approximation scheme for tree alignment. SIAM Journal on Computing, 30, 283–299. [42] Wang, L., Jiang, T. & Lawler, E. (1996) Approximation algorithms for tree alignment with a givn phylogeny. Algorithmica, 16, 302–315. [43] Williams, G. D., Chang, R.-Y. & Brian, D. A. (1999) A phylogenetically conserved hairpin-type 39 untranslated region pseudoknot functions in coronavirus RNA replication. Journal of Virology, 73, 8349–8355. [44] Yu, C. T. (2003). Efficient algorithms for constrained sequence alignment problems. Master’s thesis Department of Computer Science and Information Management, Providence University.. 37.

(45)