In this section, we use Algorithm CPSA-DP and CPSA-DC in previous section as kernels to design two CMSA algorithms, called Algorithm CMSA-DP and CMSA-DC respectively, for progressively aligning the input sequences into a CMSA according to the branching order of a guide tree, where the guide tree we use here is the so-called Kruskal merging order tree. We refer the reader to Section 2 for the details of constructing the Kruskal merging order tree. Except for the adopted CPSA kernel, CMSA-DP and CMSA-DC have the same execution steps, which are described as follows.
1. Compute the distance matrix D by globally aligning all pairs of sequences using Algorithm CPSA-DP or CPSA-DC , where D(i, j) denotes the distance between sequences Si and Sj.
2. Create a complete graph G from the distance matrix D and then compute the Kruskal merging order tree Tk from G to serve as the guide tree.
3. Progressively align the sequences according to the branching order of the guide tree Tk in a way that the currently two closest pre-aligned groups of sequences are joined by applying Algorithm CPSA-DP or CPSA-DC to these two groups of sequences, where the score between any two positions in these two groups is the arithmetic average of the scores for all possible character comparisons at those positions.
In the following, we analyze the time complexity of Algorithm CMSA-DP/CMSA-DC. It is not hard to see that step 1 costs O(γk2n2) time, where n is the maximum of the lengths of k sequences. According to the paper of Tang et al. [34] , step 2 can be done in O(k2log k) time. In step 3, there are at most O(k) iterations for calling
Algorithm CPSA-DP/CPSA-DC, whose time complexity is O(γn2), to join two pre-aligned groups of sequences. Hence, the time complexity of step 3 is O(γkn2). Clearly, the cost of Algorithm CMSA-DP/CMSA-DC is dominated by step 1 and hence its time complexity is O(γk2n2).
Chapter 4
Implementation and Discussion
4.1 MuSiC
We use Java language to implement the CMSA-DP algorithm as a web server, called as MuSiC, which is a short for Multiple Sequence Alignment with Constraints. It can be easily accessed via a simple web interface (see Figure 4.1). The input of the MuSiC system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of bases that (approximately) appears in all input sequences. The output of MuSiC is a constrained multiple sequence alignment in which the fragments of the input sequences whose bases exhibit a given degree of similarity to a constraint are aligned together. The use of the proposed MuSiC system is illus-trated below to help to detect a fragment of an RNA sequence in the 30 untranslated region (UTR) of the SARS-TW1 sequence, which can fold itself into a pseudoknot structure. The structural elements in the 50 and 30 UTRs of a plus-straind RNA virus have been postulated to be involved in RNA replication, transcription and transla-tion by interacting with viral or cellular proteins. Much evidence supports the fact that the pseudoknots in 30 UTRs among coronaviruses participate in the replication of RNA [43]. The SARS (Severe Acute Respiratory Syndrome) virus, which caused several hundreds of deaths since its outbreak in early 2003, is a novel type of coron-avirus, so a pseudoknot is expected to be observed in its 30 UTR. By comparing the sequences of the phylogenetically conserved pseudoknots among many coronaviruses,
Figure 4.1: The interface of MuSiC.
Williams et al. found that these sequences contain several fragments of conserved nucleotides (consensuses). They found 12 consensuses, say CU, CA, AA, GG, C, UG, A, G, AG, U and A, among coronaviruses including HCV-229E (human coronavirus), PEDV (porcine epidemic diarrhea virus), TGEV (porcine transmissible gastroenteritis virus), BCV (bovine coronavirus) and MHV (mouse hepatitis virus). To determine whether or not the 30 UTR of SARS has a pseudoknot, SARS-TW1 (AY291451) was chosen as the test subject and the MuSiC system was used to align the sequence of the 30 UTR of SARS-TW1 with those of the detected pseudoknots in the 30 UTRs of BCV, MHV, PEDV, TGEV and HCV-229E coronaviruses. The consensuses de-scribed above were used as the constrained sequences in the proposed MuSiC system and then the default scoring matrix and gap penalties were chosen with the initial setting = 0. As a result, no CMSA was found to satisfy the requirement, because the postulated pseudoknot in the 30 UTR of SARS-TW1 may comprise the fragments that are only partially, rather than completely, similar to the constraints. Hence, this
Figure 4.2: The partial display of the resulting CMSA of MuSiC by aligning the se-quences of SARS-TW1 30 UTR with those of other five coronaviruses.
case was tested again with letting = 0.5 so that in the band of the resulting CMSA, of length two or three, no more than one mismatch may exist between the fragment of each input sequence and the constraint. Consequently, as indicated in Figure 4.2 , a satisfied CMSA was found. Note that, the band of the resulting CMSA that corre-sponds to a constraint is black and its corresponding constraint is displayed beneath it.
In some bands of this resulting CMSA, such as the fourth, sixth and ninth, at most one mismatch exists between the fragment of each input sequence and the corresponding constraint. Moreover, the part of SARS-TW1 aligned with the pseudoknot sequences of other coronaviruses is interspersed with only two gaps of length one. This finding suggests that this part of SARS-TW1 may fold itself into a pseudoknot structure and possibly be involved in replicating SARS viruses. Therefore, this SARS-TW1 fragment is further validated by applying PKNOTS, developed by the Eddy group [30], to de-termine whether it can fold itself into a pseudoknot structure with a stable free energy.
Consequently, this fragment of SARS-TW1 indeed folds itself into a stable pseudoknot whose base pairings have a topology, as indicated in Figure 4.3 , that is very similar to those of other coronaviruses described in the literature [43]. However, whether or not this SARS-TW1 fragment participates in replicating the RNA of SARS must be investigated experimentally in the laboratory.
PSfrag replacements
Figure 4.3: The diagram of the predicted pseudoknot in the 30 UTR of SARS-TW1 ranging from 29460 to 29521 bp.