2 Preliminaries
2.3 Progressive Multiple Sequence Alignment
2.3 Progressive Multiple Sequence Alignment
The progressive approach is one of the widely used heuristics for efficiently finding a good MSA of several sequences. The ideas behind it are as follows [13, 17, 21, 49, 50].
1. Compute the distance matrix by aligning all pairs of sequences: Usually, this distance matrix is obtained by applying FASTA [27, 39] or the dynamic programming algorithm of Needleman and Wunsch [38] to each pair of sequences.
2. Construct the guide tree from the distance matrix: For the existing progressive methods, they mainly differ in the method used to build the guide tree for directing the order of alignment of sequence. To build the guide tree, for example, PILEUP (a program of GCG packages) uses UPGMA (Unweighted
Pair-Group Method using Arithmetic mean) method [46] and CLUSTAL W [50] uses NJ (Neighbor-Joining) method [43].
3. Progressively align the sequences according to the branching order in the guide tree: Initially, the closest two sequences in the tree are aligned using the normal dynamic programming algorithm. After aligning, this pair of sequences is fixed and any introduced gaps cannot be shifted later (i,e., once a gap, always a gap). Then the next two closest pre-aligned groups of sequences are joined in the same way until all sequences have been aligned. (Here, we may consider a sequence as an aligned group of a sequence.) To align two groups of the pre-aligned sequences, the score between any two positions in these two groups is usually the arithmetic average of the scores for all possible character comparisons at those positions. We call this kind of scoring methods as a set-to-set scoring.
In fact, MST has been used as a significant tool for data classification in the fields of biological data analysis. In [48], Tang et al. proposed a variant of progressive method by using the Kruskal MST to construct the guide tree, called Kruskal merging order tree. The Kruskal merging order tree of k sequences is constructed as follows.
First, we create a complete graph G = (V,E) of k sequences in a way that each vertex of V represents a sequence and each edge e of E is associated with a weight d(e) to represent the distance between the corresponding sequences of its end-vertices. Then we use the Kruskal’s algorithm [25] to construct the Kruskal MST of G, denoted by T .For completeness, we describe the Kruskal method for constructing T as follows.
1. Sort all edges of E in non-decreasing order according to their distances.
2. Initially, T is empty. Then we repeatedly add the edges of E in non-decreasing order to T in a way that if the currently adding edge e to T dose not create a cycle in T , then we add e to T ; otherwise, we discard e.
Next, according to the Kruskal MST T , we build the Kruskal merging order tree TK as follows.
1. Let V = {v1, v2, . . . , vk} and e1, e2, . . . , ek-1 be the edges of T with d(e1) ≦ d(e2) ≦ . . . ≦ d(ek-1).
2. For each vertex vi∈V , we create a tree Ti such that Ti contains only a node vi. For the purpose of merging trees, we consider Ti as a rooted tree by designating vi as its root, and define the merge of two tree Ti and Tj
respectively rooted at vi and vj to be a new tree rooted at a new vertex u such that vi and vj become the children of u.
3. For each ek = (vi, vj), where K increases from 1 to k - 1, we find the trees Ti and Tj containing vi and vj respectively and then merge them into a new tree. This process is continued until the remaining is only one tree. Then this final tree is the so-called Kruskal merging order tree TK.
The construction of G for k sequences can be done in O(k2) time and the computation of the Kruskal’s MST T of G can be done in O(k2 log k) time. Then the construction of TK from T can be implemented by the disjoint set union and find algorithm proposed by Gabow and Tarjan [33] in O(m + k) time, where m denotes the number of union and find operations. It is not hard to see that m = O(k) and hence the construction of TK takes O(k) time. Therefore, the total time complexity of constructing TK is O(k2log k).
Chapter 3 Algorithm
In this chapter, we shall propose an algorithm to solve the problem of pairwise sequence alignment with several regular expression constraints (RECPSA). Then, we shall present how to extend this algorithm for dealing with multiple sequences with several regular expression constraints (RECMSA).
3.1 Algorithm for RECPSA (Pairwise Sequence Alignment with Regular Expression Constraints)
First consider the case of pairwise alignment. For simplicity let the two input sequences have equal lengths of n. Let Ah =(Qh,Σ,δh,qh,Fh) be an ε-free NFA equivalent to Rh (qh is the initial state of Ah), and
be the weighted automaton corresponding to
A weighted automaton. Also let M be an NFA obtained by “concatenating” M
2 1,i i
Mh
1, . . . , Mm
by adding an “empty move” from each final state of Mh to the initial state of Mh+1, 1 ≤ h < m. The initial state of M is set to be the initial state of M1, and the final states of M are the final states of Mm. Hence M accepts all feasible constrained alignments (see Fig. 3.3 for an example). The score held in state (p, q) of (resp. ) is denoted
as (resp. ). The score represents the score of a best alignment A of S
2 the score of optimally aligning S
) us from the initial state of to state (p,q). The goal of the algorithm is to compute , since is the optimal score of aligning S
)
S2 with all m regular expression constraints satisfied.
The algorithm iterates over all pairs (i1, i2) of indices of sequences S1 and S2, row by row. It computes Mhi1,i2for all 1 ≤ i1, i2 ≤ n and 1 ≤ h ≤ m throughout its execution.
Let Vh and Eh be the set of states and set of arcs in automaton Ah, respectively. Denote as Lih1,i2 a |Vh| × |Vh| table used to hold Whi1,i2. On each sequence index pair (i1, i2), the
algorithm computes Lih1,i2, . . . , Lim1,i2, in the order. For all (p, q) ∈ , [p,q] is computed using the (first) algorithm Chung et al. proposed in [12]. The same is performed for , h > 1, except that [q of [p, q] rather than -∞ as the other states are, the maximum taken over all final states (p, q) of M
The above procedure takes ⎟
⎠
O
)
time by [12]. In what follows we present how to reconstruct the optimalalignment in ⎟
2 space without affecting the time complexity, by a generalization of the method Chung et al. proposed in [12].
Consider an optimal alignment A satisfying all constraints. Intuitively, if we know the substrings of S1 and S2 responsible for A’s satisfaction of the constraints, then A (or another optimal solution) can be reconstructed efficiently. Suppose that the substrings of S1 and S2 responsible for A’s satisfying the lth regular expression are
[
bl el]
proper order to obtain A. Each alignment can be computed in linear space by Hirschberg’s celebrated divide-and-conquer algorithm [14]. In the example of Fig.3.1,, , , , , , , .
1 1
1 =
b e11 =1 b12 =1 e12 =1 b12 =3 e12 =3 b22 =2 e22 =2
Two types of |Vh|×|Vh| tables, βli1,h,i2 and ηli1,h,i2, 1≤l ≤h, are maintained for this
purpose. Let A be the alignment underlying the score . Then keeps the indices of S
)
1 and S2 corresponding to the column immediately before the first column of leaving the initial state of Ml. If l= h (resp. l<h), then keeps the indices of S1 and S2 corresponding to the first column of A which leads us to arrive at state (p, q)(resp. a final state) of Ml. It can be seen that, if (p, q) is the final state of discussed in the last paragraph.
1l −1 values are still critical).
[p q
The computation of and , h=1,…, m, proceeds in the same manner as score among all final states. In general, when
[ ]
p qWhen row i1 is being considered, all tables for rows less than i1 are not necessary and can be discarded. Therefore the overall space requirement, including the reconstruction of the optimal alignment, is ⎟
⎠
In [2], an algorithm extending the one in [1] to support multiple constraints and multiple sequences, is proposed. In the pairwise case, the algorithm in [2] has time
complexity ⎟
and space complexity ⎟
⎠
V =
)
, hence the algorithm presented here never takes more time than the one in [2]. In the worst case, Eh can be proportional to Vh2, and the algorithm in [2]time, while the one presented here takes ⎟
⎠
Hence the time complexity of the algorithm presented here compares favorably with
2If (p, q) is not a final state, then the value held in ηhi1,h,i2[ ]p,q may not be as described in the last paragraph. But as just mentioned, this does not affect the correctness.
the one in [2]. As to the space complexity, if Eh =O
( )
Vh , then the algorithm herealgorithm here is more space efficient. More importantly, the stated space complexity for the algorithm in [2] does not include the reconstruction of the alignment; only the alignment score can be computed. It is clearly an important issue for a web-server to report the alignment. If a naive backtracking method is used to augment the algorithm in [2] to reconstruct the alignment, the space requirement would be ⎟
⎠ which is too high for practical use.
Now, we apply this algorithm to solve the following example.
Input Scoring matrix
Figure 3.1:An illustration of a constrained alignment.
t Figure 3.2:The matrix of weighted automata.
5
Figure 3.3: An illustration of the weighted automaton M corresponding to the example in Fig.3.1. State (p2, p1), etc., is denoted as p21, etc., for brevity. (a)
respectively, while the final states are p22 and r22, respectively. The initial and final states of M4,4 are p11 and r22, respectively. (b) The alignment underlying
is the optimal alignment of S
]
3.2 Algorithm for RECMSA (Multiple Sequence Alignment with Regular Expression Constraints)
In this section, we use Algorithm RECPSA in previous section as kernels to design an RECMSA algorithm, for progressively aligning the input sequences into a RECMSA according to the branching order of a guide tree, where the guide tree we use here is the so-called Kruskal merging order tree. We refer the reader to Section 2 for the details of constructing the Kruskal merging order tree. Except for the adopted RECPSA kernel, the execution steps, which are described as follows.
1. Compute the distance matrix D by globally aligning all pairs of sequences using Algorithm RECPSA, where D(i, j) denotes the distance between sequences Si
and Sj .
2. Create a complete graph G from the distance matrix D and then compute the Kruskal merging order tree Tk from G to serve as the guide tree.
3. Progressively align the sequences according to the branching order of the guide tree Tk in a way that the currently two closest pre-aligned groups of sequences are joined by applying Algorithm RECPSA to these two groups of sequences, where the score between any two positions in these two groups is the arithmetic average of the scores for all possible character comparisons at those positions.
In the following, we analyze the time complexity of Algorithm RECPSA. It is not hard to see that step 1 costs O(k2n2) time, where n is the maximum of the lengths of k sequences. According to the paper of Tang et al. [48] , step 2 can be done in O(k2log k) time. In step 3, there are at most O(k) iterations for calling Algorithm RECPSA, whose time complexity is ⎟
⎠
, to join two pre-aligned groups of sequences.
Hence, the time complexity of step 3 is ⎟
3 , hence the cost of Algorithm RECMSA is dominated by step 3 and
hence its time complexity is ⎟
⎠
Chapter 4
Implementation of RE-MuSiC
In this chapter, we shall introduce the implementation of RE-MuSiC, as well as its web interface, and then describe how to use it in details. Besides, we shall introduce the syntax of regular expression utilized in RE-MuSiC.
4.1 RE-MuSiC
The kernel of RE-MuSiC (short of Multiple Sequence Alignment with Regular Expression Constraints) was implemented by C and its web interface by PHP and HTML. RE-MuSiC (http://140.113.239.131/RE-MUSIC) can be easily accessed via a simple web interface (see Figure 4.1). The input of the RE-MuSiC web server consists of a set of Protein/DNA/RNA sequences and a set of user-specified regular expression constraints. The output of RE-MuSiC is a multiple sequence alignment with regular expression constraints.
4.2 Usage of RE-MuSiC
In this section, we shall describe the usage of RE-MuSiC step by step, the output of RE-MuSiC and other information including scoring matrices, gap-penalty, and syntax of regular expressions currently used in RE-MuSiC.
4.2.1 Input of RE-MuSiC
1. Input a set of genomic sequences in the FASTA format in the top blank field (1).
2. Enter one or more regular expressions that are separated by spaces in the
"Regular expression constraints" field (2). Note that each constraint of regular expression should be put in quotes first. For example, if users have two regular expressions, say [ST]-x(2)-[DE] and
G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}, then they may key in the following line in the "Regular expression constraints" field:
"[ST]-x(2)-[DE]" "G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}"
If no constraint is specified, then RE-MuSiC produces an unconstrained alignment.
3. Select the type of input sequences that can be either protein or DNA/RNA (3).
4. Just click "Execute RE-MuSiC" button (4) if users would like to run RE-MuSiC with default parameters; otherwise, they continue with the following parameter settings.
5. Select a suitable scoring matrix for protein or DNA/RNA sequences from a list of predefined matrices (5).
6. Key in two real values for gap open penalty (6) and gap extension penalty (7), respectively, since the RE-MuSiC web server penalizes the gaps using the affine gap penalty function.
7. Check the checkbox and enter an email address (8) if the user would also like to receive an email that contains a hyperlink to the RE-MuSiC result from the server. Note that the RE-MuSiC result will be kept on the server only for 24 hours.
8. Click "Execute RE-MuSiC" button to run RE-MuSiC (4).
Figure 4.1: The web interface of RE-MuSiC.
4.2.2 Output of RE-MuSiC
In the first part of the output page, RE-MuSiC shows the user-specified parameters, including scoring matrix, gap open and extension penalties, regular expression constraints and so on. Next, RE-MuSiC outputs its result of the constrained sequence alignment, in which the columns whose residues/nucleotides match regular expression constraints are shaded in yellow (refer to Figures 4.2 and 4.3 for examples). On addition, RE-MuSiC allows users to download the RE-MuSiC alignments in FASTA format or ClustalW format.
Figure 4.2: An example of the output of RE-MuSiC for protein sequences, where the residues in the first block of columns shaded in yellow match the first regular expression of "[ST]-x-[RK]", and those in the second block of columns shaded in yellow match the second regular expression of
"G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}".
Figure 4.3: An example of the output of RE-MuSiC for RNA sequences. Notice that a gap appears in the block of matching the 1st regular expression constraint, which is not allowed to happen in MuSiC.
4.2.3 Scoring Matrices
For protein sequences, three inbuilt series of scoring matrices are used in RE-MuSiC system: (1) GONNET 250 (default), (2) BLOSUM 30, 45, 62, and 80, and (3) PAM 30, 70, 120, 250, and 350. For DNA/RNA sequences, RE-MuSiC provides identity matrix only.
4.2.4 Gap Penalty
The RE-MuSiC web server penalizes the gaps with the so-called affine gap penalty function, which charges the score of "Gap open penalty" for the existence of a gap and the score of "Gap extension penalty" for each residue/nucleotide in the gap.
The default values of "Gap open penalty" for protein and DNA/RNA sequences are 10.0 and 15.0, respectively, and those of "Gap extension penalty" are 0.2 and 6.66, respectively. All these default values can be modified by the user, depending on the evolutionary distance between the input sequences of interest.
4.2.5 Syntax of Regular Expression Used in RE-MuSiC
Regular expression is a pattern-defining notation that describes a string (or, equivalently, sequence here) or a set of strings. In RE-MuSiC, the conventions of describing a pattern of regular expression are the same as those used in PROSITE.
• The standard IUPAC one-letter codes for the amino acids and nucleotides are used in the regular expression of RE-MuSiC. Notice here that the symbol 'x'/'X' is used for a position where any amino acid or nucleotide is accepted.
IUPAC Codes for Amino Acids
Letter Meaning Letter Meaning
A A (Alanine) N N (Asparagine)
B D, N P P (Proline)
C C (Cystine) Q Q (Glutamine) D D (Aspartic Acid) R R (Arginine) E E (Glutamic Acid) S S (Serine) F F (Phenylalanine) T T (Threonine) G G (Glycine) V V (Valine) H H (Histidine) W W (Tryptophan)
I I (Isoleucine) X X (Unknown or Other Amino Acid) K K (Lysine) Y Y (Tyrosine)
L L (Leucine) Z E, Q M M (Methionine)
IUPAC Codes for Nucleotides
Letter Meaning Letter Meaning
A A (Adenine) X/N A, C, G, T B C, G, T R A, G (Purine) C C (Cytosine) S C, G
D A, G, T T T (Thymine) G G (Guanine) U U (Uracil) H A, C, T V A, C, G
K G, T W A, T
M A, C Y C, T (Pyrimidine)
• The amino acids (or nucleotides) that are allowed to appear at a given position are indicated by listing them in a pair of square brackets '[ ]'.
o For example, [ALT] stands for Ala (A), Leu (L) or Thr (T).
• The amino acids (or nucleotides) that are not accepted at a given position are indicated by listing them in a pair of braces '{ }'.
o For example, {AM} stands for any amino acid except Ala (A) and Met (M).
• Each element in a pattern of regular expression is separated from its neighbors by a dash '-'.
o For example, [GA]-G-K-[ST] means that the first position of the pattern can be occupied by either Gly (G) or Ala (A), the second and third positions must be Gly (G) and Lys (K), respectively, and the last position can be either Ser (S) or Thr (T).
• Repetition of an element in a pattern is indicated by appending, immediately following that element, an integer or a pair of integers (meaning the allowed range of the number of repetitions) in parentheses. For example,
o x(3) equals to x-x-x, a meaning pattern of any three amino acids (or nucleotides).
o A(3) equals to A-A-A, a meaning pattern of three amino acids of Ala (A).
o x(2,4) equals to x-x, x-x-x, or x-x-x-x.
For example, based on the above conventions, [AC]-x-V-x(4)-{ED} is translated as [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}. A subsequence matches a given regular expression, if it can be described by this regular expression. For example, "AgVdefgB" matches the above regular expression.
Chapter 5 Experiments
In this chapter, we shall demonstrate the applicability of our RE-MuSiC by testing it on two data sets, one with protein sequences and the other with RNA sequences.
5.1 Protein Sequences with Active Site Residues
In this experiment, we analyzed three GST (Glutathione S-Transferase) proteins as follows.
1. AtGST: a phi class GST from plant Arabidopsis thaliana whose PDBID is 1GNW:A.
2. SjGST: an alpha class GST from non-mammalian Schistosoma japonicum (flat worm) whose PDBID is 1M99:A.
3. SsGST: a pi class GST from mammalian Sus scrofa (pig) whose PDBID is 2GSR:A.
Notice that the structural similarity between these three proteins is very high (see Figure 5.1), although their pairwise sequence identities are extremely low. In particular, the glutathione binding sites (G-sites) of these GST proteins have been found to have a conserved architecture, and the glutathione backbone conformations adopted in these GSTs from different species are quite similar [42]. Also, the cheimical natures of their residues acting as G-site ligands and interactions facilitated with glutathione exhibit analogy [42]. Due to their low sequence identity, it is hard to use a typical alignment tool, like ClustalW, to produce an accurate alignment (see
Figure 5.2 for example). In this ClustalW alignment as shown in Figure 5.3, one of the active site residues shared by these three GST proteins was not aligned well. They actually should be aligned together, however, if we superpose the crystal structures of these three GST proteins [42]. By querying PROSITE with these three GST protein sequences, we noticed that they all share a pattern of PS00006 ("[ST]-x(2)-[DE]") in common. Using this pattern as the constraint, we aligned again the three GST protein sequences mentioned above with RE-MuSiC, resulting in an alignment as shown in Figure 5.3 that indeed satisfies the requested constraint (i.e., those residues matching
"[ST]-x(2)-[DE]" were lined up). In addition, it is worth mentioning here that all the active site residues shared among these GSTs were also aligned together. This suggests that, with additional information regarding some common patterns, RE-MuSiC is more reliable in aligning together biologically important residues from a set of closely related proteins, even when their sequence similarities are low.
Demonstrated by this experiment, therefore, our RE-MuSiC web server is helpful in the detection of active site residues in a given set of protein sequences.
Figure 5.1: (a) Alpha class GST structure to which SjGST from non-mammalian S.
japonicum (flat worm) belongs, (b) Phi class GST structure to which AtGST from plant A. thaliana belongs, (c) Pi class GST structure to which SsGST from mammalian S. scrofa (pig) belongs.
Figure 5.2: The active site residues shared by all three GST proteins are marked in boxes. The active site residues in green boxes are aligned together, but the others in red boxes are not.
Figure 5.3: The constrained sequence alignment produced by RE-MuSiC, using the pattern of "[ST]-x(2)-[DE]" (PS00006) as the constraint, in which the residues shaded in yellow match the pattern. In addition, the residues in green boxes that correspond to the active sites shared by these three GST proteins are aligned together.
5.2 RNA Sequences with Phylogenetically Conserved Pseudoknots
In this experiment, we aligned the 3' untranslated region (3'-UTR) sequences of
In this experiment, we aligned the 3' untranslated region (3'-UTR) sequences of