Syntax of Regular Expression Used in RE-MuSiC

4 Implementation

4.2 Usage of RE-MuSiC

4.2.5 Syntax of Regular Expression Used in RE-MuSiC

Regular expression is a pattern-defining notation that describes a string (or, equivalently, sequence here) or a set of strings. In RE-MuSiC, the conventions of describing a pattern of regular expression are the same as those used in PROSITE.

• The standard IUPAC one-letter codes for the amino acids and nucleotides are used in the regular expression of RE-MuSiC. Notice here that the symbol 'x'/'X' is used for a position where any amino acid or nucleotide is accepted.

IUPAC Codes for Amino Acids

Letter Meaning Letter Meaning

A A (Alanine) N N (Asparagine)

B D, N P P (Proline)

C C (Cystine) Q Q (Glutamine) D D (Aspartic Acid) R R (Arginine) E E (Glutamic Acid) S S (Serine) F F (Phenylalanine) T T (Threonine) G G (Glycine) V V (Valine) H H (Histidine) W W (Tryptophan)

I I (Isoleucine) X X (Unknown or Other Amino Acid) K K (Lysine) Y Y (Tyrosine)

L L (Leucine) Z E, Q M M (Methionine)

IUPAC Codes for Nucleotides

Letter Meaning Letter Meaning

A A (Adenine) X/N A, C, G, T B C, G, T R A, G (Purine) C C (Cytosine) S C, G

D A, G, T T T (Thymine) G G (Guanine) U U (Uracil) H A, C, T V A, C, G

K G, T W A, T

M A, C Y C, T (Pyrimidine)

• The amino acids (or nucleotides) that are allowed to appear at a given position are indicated by listing them in a pair of square brackets '[ ]'.

o For example, [ALT] stands for Ala (A), Leu (L) or Thr (T).

• The amino acids (or nucleotides) that are not accepted at a given position are indicated by listing them in a pair of braces '{ }'.

o For example, {AM} stands for any amino acid except Ala (A) and Met (M).

• Each element in a pattern of regular expression is separated from its neighbors by a dash '-'.

o For example, [GA]-G-K-[ST] means that the first position of the pattern can be occupied by either Gly (G) or Ala (A), the second and third positions must be Gly (G) and Lys (K), respectively, and the last position can be either Ser (S) or Thr (T).

• Repetition of an element in a pattern is indicated by appending, immediately following that element, an integer or a pair of integers (meaning the allowed range of the number of repetitions) in parentheses. For example,

o x(3) equals to x-x-x, a meaning pattern of any three amino acids (or nucleotides).

o A(3) equals to A-A-A, a meaning pattern of three amino acids of Ala (A).

o x(2,4) equals to x-x, x-x-x, or x-x-x-x.

For example, based on the above conventions, [AC]-x-V-x(4)-{ED} is translated as [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}. A subsequence matches a given regular expression, if it can be described by this regular expression. For example, "AgVdefgB" matches the above regular expression.

Chapter 5 Experiments

In this chapter, we shall demonstrate the applicability of our RE-MuSiC by testing it on two data sets, one with protein sequences and the other with RNA sequences.

5.1 Protein Sequences with Active Site Residues

In this experiment, we analyzed three GST (Glutathione S-Transferase) proteins as follows.

1. AtGST: a phi class GST from plant Arabidopsis thaliana whose PDBID is 1GNW:A.

2. SjGST: an alpha class GST from non-mammalian Schistosoma japonicum (flat worm) whose PDBID is 1M99:A.

3. SsGST: a pi class GST from mammalian Sus scrofa (pig) whose PDBID is 2GSR:A.

Notice that the structural similarity between these three proteins is very high (see Figure 5.1), although their pairwise sequence identities are extremely low. In particular, the glutathione binding sites (G-sites) of these GST proteins have been found to have a conserved architecture, and the glutathione backbone conformations adopted in these GSTs from different species are quite similar [42]. Also, the cheimical natures of their residues acting as G-site ligands and interactions facilitated with glutathione exhibit analogy [42]. Due to their low sequence identity, it is hard to use a typical alignment tool, like ClustalW, to produce an accurate alignment (see

Figure 5.2 for example). In this ClustalW alignment as shown in Figure 5.3, one of the active site residues shared by these three GST proteins was not aligned well. They actually should be aligned together, however, if we superpose the crystal structures of these three GST proteins [42]. By querying PROSITE with these three GST protein sequences, we noticed that they all share a pattern of PS00006 ("[ST]-x(2)-[DE]") in common. Using this pattern as the constraint, we aligned again the three GST protein sequences mentioned above with RE-MuSiC, resulting in an alignment as shown in Figure 5.3 that indeed satisfies the requested constraint (i.e., those residues matching

"[ST]-x(2)-[DE]" were lined up). In addition, it is worth mentioning here that all the active site residues shared among these GSTs were also aligned together. This suggests that, with additional information regarding some common patterns, RE-MuSiC is more reliable in aligning together biologically important residues from a set of closely related proteins, even when their sequence similarities are low.

Demonstrated by this experiment, therefore, our RE-MuSiC web server is helpful in the detection of active site residues in a given set of protein sequences.

Figure 5.1: (a) Alpha class GST structure to which SjGST from non-mammalian S.

japonicum (flat worm) belongs, (b) Phi class GST structure to which AtGST from plant A. thaliana belongs, (c) Pi class GST structure to which SsGST from mammalian S. scrofa (pig) belongs.

Figure 5.2: The active site residues shared by all three GST proteins are marked in boxes. The active site residues in green boxes are aligned together, but the others in red boxes are not.

Figure 5.3: The constrained sequence alignment produced by RE-MuSiC, using the pattern of "[ST]-x(2)-[DE]" (PS00006) as the constraint, in which the residues shaded in yellow match the pattern. In addition, the residues in green boxes that correspond to the active sites shared by these three GST proteins are aligned together.

5.2 RNA Sequences with Phylogenetically Conserved Pseudoknots

In this experiment, we aligned the 3' untranslated region (3'-UTR) sequences of the following four coronaviruses, where HCoV-229E and PEDV are group 1 coronaviruses, while BCoV and MHV belong to group 2.

1. HCoV-229E: human 229E coronavirus whose accession number in GenBank is af304460.

2. PEDV: porcine epidemic diarrhea virus whose accession number in GenBank is af353511.

3. BCoV: bovine coronavirus whose accession number in GenBank is af220295.

4. MHV: murine hepatitis virus whose accession number in GenBank is af201929.

It has been reported that phylogenetically conserved pseudoknots found in the 3'-UTRs of these four coronaviruses (refer to Figure 5.4) have been postulated to be involved in their RNA replication [54]. Notice that the pairwise sequence identities between sequences in the different groups are extremely low. Hence, it was difficult for ClustalW to have the sequence fragments corresponding to the conserved pseudoknots aligned together (see Figure 5.5 for example). However, as shown in Figure 5.6, this goal was achieved by RE-MuSiC when the pseudoknot consensus (as shown in Figure 5.7), derived by Williams et al., [54] from the 3'-UTRs of various coronaviruses, was used as the constraint. In general, loops in pseudoknots are less conserved. To enhance the flexibility of the consensus, hence, we treat the nucleotides involved in the loop regions as "don't care" symbols ("x") and describe the consensus as

"x(5)-C-U-x(4)-C-x(15,16)-U-G-x(2)-A-x(5,7)-G-x(4)-A-G-x(7,10)-U-x(3)-A-x(5)".

This experiment suggests that RE-MuSiC is able to help locate those sequence fragments that are conserved from the structural point of view.

Figure 5.4: Phylogenetically conserved pseudoknots in the 3'-UTRs of four coronavirus. (a) HCoV-229E (human 229E coronavirus), (b) PEDV (porcine epidemic diarrhea virus), (c) BCoV (bovine coronavirus), (d) MHV (murine hepatitis virus).

Figure 5.5: A partial view of the alignment produced by ClustalW, where the fragments shaded in light blue corresponds to the phylogenetically conserved pseudoknots in the 3'-UTRs of the four coronaviruses. Notably, these four shaded fragments were not aligned together.

Figure 5.6: A partial view of the alignment produced by RE-MuSiC using the constraint of "x(5)-C-U-x(4)-C-x(15,16)-U-G-x(2)-A-x(5,7)-G-x(4)-A-G-x(7,10)-U- x(3)-A-x(5)", where the fragments shaded in yellow, corresponding to the phylogenetically conserved pseudoknots in the 3'-UTRs of the four coronaviruses, are aligned together.

Figure 5.7: The consensuses adapted from [54], which was derived by Williams et al.

from the 3'-UTRs of various coronaviruses, including HCoV-229E, PEDV, BCoV and MHV.

Chapter 6 Conclusions

In this thesis, we studied the RECMSA problem, whose aim is to find an RECMSA for the input sequences with several user-specified regular expression constraints such that substrings of the input sequences whose bases match regular expression constraint are aligned together. In this model, each of the user-specified constraints is a regular expression, which is useful in expressing biologically important sites such as those stored in PROSITE, as well as structural elements which often involve variable ranges in them. In contrast, the plain-strings-with-mismatches model adopted in previously available tools, MuSiC and MuSiC-ME, is not flexible enough to express such patterns.

We adopted the dynamic programming and divide-and-conquer techniques to design a time and memory efficient algorithm for optimally solving the RECPSA problem. In addition, we designed a method to find in the resulting alignment the regions responsible for the satisfactions of the constraints. Based on the algorithm, we developed a web Server RE-MuSiC for the RECMSA problem using the progressive approach. The algorithm underlying RE-MuSiC represents an improvement over the previously proposed algorithm [2], and is more appropriate for implementation in a web-server.

Experiments on GST proteins and on coronaviruses with phylogenetically

conserved pseudoknots demonstrated that, with additional knowledge incorporated, RE-MuSiC is able to produce meaningful alignments in which important residues or structural elements can be aligned properly, even if the similarity among input sequences is low. Such ability is also useful for prediction purposes.

References

[1] Arslan, A.N. (2005) Regular expression constrained sequence alignment. In Proc.

16th Annual Symposium on Combinatorial Pattern Matching (CPM05), vol.

3537 of Lecture Notes in Computer Science, Springer, pp. 322-333.

[2] Arslan, A.N. (2005) Multiple sequence alignment containing a sequence of regular expressions. In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB05), pp.1-7.

[3] Arslan, A. N. (2006) An algorithm with linear expected running time for string editing with substitutions and substring reversals. The Proceedings of the Biotechnology and Bioinformatics Symposium (BIOT-2006), pp. 90-96, Provo, Utah, October 20-21, 2006 (acceptance rate = 40%)

[4] Arslan, A. N. and He, D. (2006) An improved algorithm for the regular expression constrained multiple sequence alignment problem. The Proceedings of the 6^th IEEE Symposium on Bioinformatics and Biotechnology (BIBE 2006), pp. 121-126, Washington, DC, October 16-18, 2006.

[5] Arslan, A. N. (≥2007) Regular expression constrained sequence alignment, Journal of Discrete Algorithms, Elsevier (to appear)

[6] Bafna, V., Lawler, E. L. & Pevzner, P. A. (1997) Approximation algorithms for multiple sequence alignment. Theoretical Computer Science, 182, 233–244.

[7] Bonizzoni, P. & Vedova, G. D. (2001) The complexity of multiple sequence alignment with SP-score that is a metric. Theoretical Computer Science, 259, 63–79.

[8] Carrillo, H. & Lipman, D. (1988) The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 48, 1073–1082.

[9] Chan, S. C.,Wong, A. K. C. & Chiu, D. K. Y. (1992) A survey of multiple sequence comparison methods. Bulletin of Mathematical Biology, 54, 563–598.

[10] Cheng, C.Y., Chang, C.H., Wu, Y.J., & Li, Y.K. (2006) Exploration of glycosyl hydrolase family 75, a chitosanase from Aspergillus fumigatus. J. Biol. Chem., 281, 3137-3144.

[11] Chin, F.Y.L., Ho, N.L., Lam, T.W., Wong, P.W.H., & Chan, M.Y. (2005) Efficient constrained multiple sequence alignment with performance guarantee. J.

Bioinform. Comput. Biol., 3, 1-18.

[12] Chung, Y.-S., Lu, C.L., & Tang, C.Y. (2006) Efficient algorithms for regular expression constrained sequence alignment. In Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM06), vol. 4009 of Lecture Notes in

Computer Science, Springer, pp. 389-400.

[13] Corpet, F. (1988) Multiple sequence alignment with hierarchical clustering.

Nucleic Acids Research, 16, 10881–10890.

[14] D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18:341–343, 1975.

[15] Depiereux, E., & Feytmans, E. (1992) MATCH-BOX: A fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput.

Appl. Biosci., 8, 501-509.

[16] Dunbrack, R.L. (2006) Sequence comparison and protein structure prediction.

Curr. Opin. Struct. Biol., 16, 374-384.

[17] Feng, D. F. & Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 25, 351–360.

[18] Gusfield, D. (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55, 141–154.

[19] Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.

[20] He, D., Arslan, A. N., and Ling, A. C. H. (2006) A fast algorithm for the constrained multiple sequence alignment problem. Acta Cybernetica, 17:

701-717.

[21] Higgins, D. & Sharpe, P. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244.

[22] Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. 2nd edn. Addison-Wesley (2001).

[23] Huang, C.H., Lu, C.L., & Chiu, H.T. (2005) A heuristic approach for detecting RNA H-type pseudoknots. Bioinformatics, 21, 3501-3508.

[24] Hulo, N., Sigrist, C.J.A., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., and Bairoch, A. (2004) Recent

improvements to the PROSITE database. Nucleic Acids Res., 32, 134-137.

[25] Kruskal, J. (1956) On the shrtest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7, 48–50.

[26] Li, M., Ma, B. & Wang, L. (2000) Near optimal multiple alignment within a band in polynomial time. In Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing (STOC 2000) pp. 425–434 ACM Press, Portland.

[27] Lipman, D. & Pearson, W. (1985) Rapid and sensitive protein simularity search.

Science, 227, 1435–1411.

[28] Lu, C.L., & Huang, Y.P. (2005) A memory-efficient algorithm for multiple sequence alignment with constraints. Bioinformatics, 21, 20-30.

[29] Myers, G., Selznick, S., Zhang, Z., & Miller, W. (1996) Progressive multiple alignment with constraints. J. Comput. Biol., 3, 563-572.

[30] Morgenstern, B., Dress, A., & Werner, T. (1996) Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad.

Sci. USA, 93, 12098-12103.

[31] Morgenstern, B., Frech, K., Dress, A., & Werner, T. (1998) DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290-294.

[32] Morgenstern, B. (1999) DIALIGN 2: Improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211-218.

[33] Morgenstern, B. (2004) DIALIGN: Multiple DNA and protein sequence alignment at BiBi-Serv. Nucleic Acids Res., 32, W33-W36.

[34] Morgenstern, B., Werner, N., Prohaska, S.J., Schneider, R.S.I., Subramanian, A.R., Stadler, P.F., & Weyer-Menkhoff, J. (2005) Multiple sequence alignment with user-defined constraints at GOBICS. Bioinformatics, 21, 1271-1273.

[35] Morgenstern, B., Prohaska, S.J., Pohler, D., & Stadler, P.F. (2006) Multiple sequence alignment with user-defined anchor points. Algorithms Mol. Biol,, 1, 6.

[36] Notredame, C. (2002) Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131-144.

[37] Nicholas, H. B., Ropelewski, A. J. & Deerfield, D. W. (2002) Strategies for multiple sequence alignment. Biotechniques, 32, 592–603.

[38] Needleman, S. & Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Evolution, 48, 443–453.

[39] Pearson, W. (1991) Searching protein sequence libraries: Computation of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithm.

Genomics, 11, 635–650.

[40] Pevzner, P. A. (1992) Multiple alignment, communication cost, and graph matching. SIAM Journal on Applied Mathematics, 52, 1763–1779.

[41] Reeder, J., Hochsmann, M., Rehmsmeier, M., Voss, B., & Giegerich, R. (2006) Beyond Mfold: Recent advances in RNA bioinformatics. J. Biotechnol., 124, 41-55.

[42] Reinemer, P., Prade, L., Hof, P., Neuefeind, T., Huber, R., Zettl, R., Palme, K., Schell, J., Koelln, I., Bartunik, H.D. et al. (1996) Three-dimensional structure of glutathione S-transferase from Arabidopsis thaliana at 2.2 A resolution:

Structural characterization of herbicide-conjugating plant glutathione

S-transferases and a novel active site architecture. J. Mol. Biol., 255, 289-309.

[43] Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new mothod for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406–425.

[44] Sammeth, M., Morgenstern, B., & Stoye, J. (2003) Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics, 19 Suppl. 2,

ii189-ii195.

[45] Shuler, G.D., Altschul, S.F., & Lipman, D.J. (1991) A workbench for multiple alignment construction and analysis. Proteins: Struct., Funct., Genet., 9, 180-190.

[46] Sneath, P. & Sokal, R. (1973) Numerical Taxonomy. Freeman, San Francisco, CA.

[47] Song, B., Choi, J.H., Chen, G.Y., Szymanski, J., Zhang, G.Q., Tung, A.K.H., Kang, J., Kim, S., & Yang, J. (2006) ARCS: An aggregated related column scoring scheme for aligned sequences. Bioinformatics, 22, 2326-2332.

[48] Tang, C.Y., Lu, C.L., Chang, M.D.T., Tsai, Y.T., Sun, Y.J., Chao, K.M., Chang, J.M., Chiou, Y.H., Wu, C.M., Chang, H.T., and Chou, W.I. (2003) Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol., 1, 267-287.

[49] Taylor, W. R. (1987) Multiple sequence alignment by a pairwise algorithm.

CABIOS, 3, 81–87.

[50] Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence

weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673-4680.

[51] Thompson, J.D., Plewniak, F., Thierry, J.-C., & Poch., O. (2000) DbClustal:

Rapid and reliable multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 28, 2919-2926.

[52] Tsai, Y.-T., Huang, Y.P., Yu, C.T., & Lu, C.L. (2004) MuSiC: A tool for multiple sequence alignment with constraints. Bioinformatics, 20, 2309-2311.

[53] Wang, L. & Jiang, T. (1994) On the complexity of multiple sequence alignment.

Journal of Computational Biology, 1, 337–348.

[54] Williams, G.D., Chang, R.Y. and Brian, D.A. (1999) A phylogenetically conserved hairpin-type 3’ untranslated region pseudoknot functions in coronavirus RNA replication. J. Virol., 73, 8349-8355.

在文檔中正規化表示式的限制型多重序列比對之研究 (頁 36-0)