Algorithm for RECMSA - 正規化表示式的限制型多重序列比對之研究

3 Algorithms

3.2 Algorithm for RECMSA

In this section, we use Algorithm RECPSA in previous section as kernels to design an RECMSA algorithm, for progressively aligning the input sequences into a RECMSA according to the branching order of a guide tree, where the guide tree we use here is the so-called Kruskal merging order tree. We refer the reader to Section 2 for the details of constructing the Kruskal merging order tree. Except for the adopted RECPSA kernel, the execution steps, which are described as follows.

1. Compute the distance matrix D by globally aligning all pairs of sequences using Algorithm RECPSA, where D(i, j) denotes the distance between sequences Si

and Sj .

2. Create a complete graph G from the distance matrix D and then compute the Kruskal merging order tree Tk from G to serve as the guide tree.

3. Progressively align the sequences according to the branching order of the guide tree Tk in a way that the currently two closest pre-aligned groups of sequences are joined by applying Algorithm RECPSA to these two groups of sequences, where the score between any two positions in these two groups is the arithmetic average of the scores for all possible character comparisons at those positions.

In the following, we analyze the time complexity of Algorithm RECPSA. It is not hard to see that step 1 costs O(k²n²) time, where n is the maximum of the lengths of k sequences. According to the paper of Tang et al. [48] , step 2 can be done in O(k²log k) time. In step 3, there are at most O(k) iterations for calling Algorithm RECPSA, whose time complexity is ⎟

⎠

, to join two pre-aligned groups of sequences.

Hence, the time complexity of step 3 is ⎟

3 , hence the cost of Algorithm RECMSA is dominated by step 3 and

hence its time complexity is ⎟

⎠

Chapter 4 Implementation of RE-MuSiC

In this chapter, we shall introduce the implementation of RE-MuSiC, as well as its web interface, and then describe how to use it in details. Besides, we shall introduce the syntax of regular expression utilized in RE-MuSiC.

4.1 RE-MuSiC

The kernel of RE-MuSiC (short of Multiple Sequence Alignment with Regular Expression Constraints) was implemented by C and its web interface by PHP and HTML. RE-MuSiC (http://140.113.239.131/RE-MUSIC) can be easily accessed via a simple web interface (see Figure 4.1). The input of the RE-MuSiC web server consists of a set of Protein/DNA/RNA sequences and a set of user-specified regular expression constraints. The output of RE-MuSiC is a multiple sequence alignment with regular expression constraints.

4.2 Usage of RE-MuSiC

In this section, we shall describe the usage of RE-MuSiC step by step, the output of RE-MuSiC and other information including scoring matrices, gap-penalty, and syntax of regular expressions currently used in RE-MuSiC.

4.2.1 Input of RE-MuSiC

1. Input a set of genomic sequences in the FASTA format in the top blank field (1).

2. Enter one or more regular expressions that are separated by spaces in the

"Regular expression constraints" field (2). Note that each constraint of regular expression should be put in quotes first. For example, if users have two regular expressions, say [ST]-x(2)-[DE] and

G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}, then they may key in the following line in the "Regular expression constraints" field:

"[ST]-x(2)-[DE]" "G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}"

If no constraint is specified, then RE-MuSiC produces an unconstrained alignment.

3. Select the type of input sequences that can be either protein or DNA/RNA (3).

4. Just click "Execute RE-MuSiC" button (4) if users would like to run RE-MuSiC with default parameters; otherwise, they continue with the following parameter settings.

5. Select a suitable scoring matrix for protein or DNA/RNA sequences from a list of predefined matrices (5).

6. Key in two real values for gap open penalty (6) and gap extension penalty (7), respectively, since the RE-MuSiC web server penalizes the gaps using the affine gap penalty function.

7. Check the checkbox and enter an email address (8) if the user would also like to receive an email that contains a hyperlink to the RE-MuSiC result from the server. Note that the RE-MuSiC result will be kept on the server only for 24 hours.

8. Click "Execute RE-MuSiC" button to run RE-MuSiC (4).

Figure 4.1: The web interface of RE-MuSiC.

4.2.2 Output of RE-MuSiC

In the first part of the output page, RE-MuSiC shows the user-specified parameters, including scoring matrix, gap open and extension penalties, regular expression constraints and so on. Next, RE-MuSiC outputs its result of the constrained sequence alignment, in which the columns whose residues/nucleotides match regular expression constraints are shaded in yellow (refer to Figures 4.2 and 4.3 for examples). On addition, RE-MuSiC allows users to download the RE-MuSiC alignments in FASTA format or ClustalW format.

Figure 4.2: An example of the output of RE-MuSiC for protein sequences, where the residues in the first block of columns shaded in yellow match the first regular expression of "[ST]-x-[RK]", and those in the second block of columns shaded in yellow match the second regular expression of

"G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}".

Figure 4.3: An example of the output of RE-MuSiC for RNA sequences. Notice that a gap appears in the block of matching the 1st regular expression constraint, which is not allowed to happen in MuSiC.

4.2.3 Scoring Matrices

For protein sequences, three inbuilt series of scoring matrices are used in RE-MuSiC system: (1) GONNET 250 (default), (2) BLOSUM 30, 45, 62, and 80, and (3) PAM 30, 70, 120, 250, and 350. For DNA/RNA sequences, RE-MuSiC provides identity matrix only.

4.2.4 Gap Penalty

The RE-MuSiC web server penalizes the gaps with the so-called affine gap penalty function, which charges the score of "Gap open penalty" for the existence of a gap and the score of "Gap extension penalty" for each residue/nucleotide in the gap.

The default values of "Gap open penalty" for protein and DNA/RNA sequences are 10.0 and 15.0, respectively, and those of "Gap extension penalty" are 0.2 and 6.66, respectively. All these default values can be modified by the user, depending on the evolutionary distance between the input sequences of interest.

4.2.5 Syntax of Regular Expression Used in RE-MuSiC

Regular expression is a pattern-defining notation that describes a string (or, equivalently, sequence here) or a set of strings. In RE-MuSiC, the conventions of describing a pattern of regular expression are the same as those used in PROSITE.

• The standard IUPAC one-letter codes for the amino acids and nucleotides are used in the regular expression of RE-MuSiC. Notice here that the symbol 'x'/'X' is used for a position where any amino acid or nucleotide is accepted.

IUPAC Codes for Amino Acids

Letter Meaning Letter Meaning

A A (Alanine) N N (Asparagine)

B D, N P P (Proline)

C C (Cystine) Q Q (Glutamine) D D (Aspartic Acid) R R (Arginine) E E (Glutamic Acid) S S (Serine) F F (Phenylalanine) T T (Threonine) G G (Glycine) V V (Valine) H H (Histidine) W W (Tryptophan)

I I (Isoleucine) X X (Unknown or Other Amino Acid) K K (Lysine) Y Y (Tyrosine)

L L (Leucine) Z E, Q M M (Methionine)

IUPAC Codes for Nucleotides

Letter Meaning Letter Meaning

A A (Adenine) X/N A, C, G, T B C, G, T R A, G (Purine) C C (Cytosine) S C, G

D A, G, T T T (Thymine) G G (Guanine) U U (Uracil) H A, C, T V A, C, G

K G, T W A, T

M A, C Y C, T (Pyrimidine)

• The amino acids (or nucleotides) that are allowed to appear at a given position are indicated by listing them in a pair of square brackets '[ ]'.

o For example, [ALT] stands for Ala (A), Leu (L) or Thr (T).

• The amino acids (or nucleotides) that are not accepted at a given position are indicated by listing them in a pair of braces '{ }'.

o For example, {AM} stands for any amino acid except Ala (A) and Met (M).

• Each element in a pattern of regular expression is separated from its neighbors by a dash '-'.

o For example, [GA]-G-K-[ST] means that the first position of the pattern can be occupied by either Gly (G) or Ala (A), the second and third positions must be Gly (G) and Lys (K), respectively, and the last position can be either Ser (S) or Thr (T).

• Repetition of an element in a pattern is indicated by appending, immediately following that element, an integer or a pair of integers (meaning the allowed range of the number of repetitions) in parentheses. For example,

o x(3) equals to x-x-x, a meaning pattern of any three amino acids (or nucleotides).

o A(3) equals to A-A-A, a meaning pattern of three amino acids of Ala (A).

o x(2,4) equals to x-x, x-x-x, or x-x-x-x.

For example, based on the above conventions, [AC]-x-V-x(4)-{ED} is translated as [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}. A subsequence matches a given regular expression, if it can be described by this regular expression. For example, "AgVdefgB" matches the above regular expression.

Chapter 5 Experiments

In this chapter, we shall demonstrate the applicability of our RE-MuSiC by testing it on two data sets, one with protein sequences and the other with RNA sequences.

5.1 Protein Sequences with Active Site Residues

In this experiment, we analyzed three GST (Glutathione S-Transferase) proteins as follows.

1. AtGST: a phi class GST from plant Arabidopsis thaliana whose PDBID is 1GNW:A.

2. SjGST: an alpha class GST from non-mammalian Schistosoma japonicum (flat worm) whose PDBID is 1M99:A.

3. SsGST: a pi class GST from mammalian Sus scrofa (pig) whose PDBID is 2GSR:A.

Notice that the structural similarity between these three proteins is very high (see Figure 5.1), although their pairwise sequence identities are extremely low. In particular, the glutathione binding sites (G-sites) of these GST proteins have been found to have a conserved architecture, and the glutathione backbone conformations adopted in these GSTs from different species are quite similar [42]. Also, the cheimical natures of their residues acting as G-site ligands and interactions facilitated with glutathione exhibit analogy [42]. Due to their low sequence identity, it is hard to use a typical alignment tool, like ClustalW, to produce an accurate alignment (see

Figure 5.2 for example). In this ClustalW alignment as shown in Figure 5.3, one of the active site residues shared by these three GST proteins was not aligned well. They actually should be aligned together, however, if we superpose the crystal structures of these three GST proteins [42]. By querying PROSITE with these three GST protein sequences, we noticed that they all share a pattern of PS00006 ("[ST]-x(2)-[DE]") in common. Using this pattern as the constraint, we aligned again the three GST protein sequences mentioned above with RE-MuSiC, resulting in an alignment as shown in Figure 5.3 that indeed satisfies the requested constraint (i.e., those residues matching

"[ST]-x(2)-[DE]" were lined up). In addition, it is worth mentioning here that all the active site residues shared among these GSTs were also aligned together. This suggests that, with additional information regarding some common patterns, RE-MuSiC is more reliable in aligning together biologically important residues from a set of closely related proteins, even when their sequence similarities are low.

Demonstrated by this experiment, therefore, our RE-MuSiC web server is helpful in the detection of active site residues in a given set of protein sequences.

Figure 5.1: (a) Alpha class GST structure to which SjGST from non-mammalian S.

japonicum (flat worm) belongs, (b) Phi class GST structure to which AtGST from plant A. thaliana belongs, (c) Pi class GST structure to which SsGST from mammalian S. scrofa (pig) belongs.

Figure 5.2: The active site residues shared by all three GST proteins are marked in boxes. The active site residues in green boxes are aligned together, but the others in red boxes are not.

Figure 5.3: The constrained sequence alignment produced by RE-MuSiC, using the pattern of "[ST]-x(2)-[DE]" (PS00006) as the constraint, in which the residues shaded in yellow match the pattern. In addition, the residues in green boxes that correspond to the active sites shared by these three GST proteins are aligned together.

5.2 RNA Sequences with Phylogenetically Conserved Pseudoknots

In this experiment, we aligned the 3' untranslated region (3'-UTR) sequences of the following four coronaviruses, where HCoV-229E and PEDV are group 1 coronaviruses, while BCoV and MHV belong to group 2.

1. HCoV-229E: human 229E coronavirus whose accession number in GenBank is af304460.

2. PEDV: porcine epidemic diarrhea virus whose accession number in GenBank is af353511.

3. BCoV: bovine coronavirus whose accession number in GenBank is af220295.

4. MHV: murine hepatitis virus whose accession number in GenBank is af201929.

It has been reported that phylogenetically conserved pseudoknots found in the 3'-UTRs of these four coronaviruses (refer to Figure 5.4) have been postulated to be involved in their RNA replication [54]. Notice that the pairwise sequence identities between sequences in the different groups are extremely low. Hence, it was difficult for ClustalW to have the sequence fragments corresponding to the conserved pseudoknots aligned together (see Figure 5.5 for example). However, as shown in Figure 5.6, this goal was achieved by RE-MuSiC when the pseudoknot consensus (as shown in Figure 5.7), derived by Williams et al., [54] from the 3'-UTRs of various coronaviruses, was used as the constraint. In general, loops in pseudoknots are less conserved. To enhance the flexibility of the consensus, hence, we treat the nucleotides involved in the loop regions as "don't care" symbols ("x") and describe the consensus as

"x(5)-C-U-x(4)-C-x(15,16)-U-G-x(2)-A-x(5,7)-G-x(4)-A-G-x(7,10)-U-x(3)-A-x(5)".

This experiment suggests that RE-MuSiC is able to help locate those sequence fragments that are conserved from the structural point of view.

Figure 5.4: Phylogenetically conserved pseudoknots in the 3'-UTRs of four coronavirus. (a) HCoV-229E (human 229E coronavirus), (b) PEDV (porcine epidemic diarrhea virus), (c) BCoV (bovine coronavirus), (d) MHV (murine hepatitis virus).

Figure 5.5: A partial view of the alignment produced by ClustalW, where the fragments shaded in light blue corresponds to the phylogenetically conserved pseudoknots in the 3'-UTRs of the four coronaviruses. Notably, these four shaded fragments were not aligned together.

Figure 5.6: A partial view of the alignment produced by RE-MuSiC using the constraint of "x(5)-C-U-x(4)-C-x(15,16)-U-G-x(2)-A-x(5,7)-G-x(4)-A-G-x(7,10)-U- x(3)-A-x(5)", where the fragments shaded in yellow, corresponding to the phylogenetically conserved pseudoknots in the 3'-UTRs of the four coronaviruses, are aligned together.

Figure 5.7: The consensuses adapted from [54], which was derived by Williams et al.

from the 3'-UTRs of various coronaviruses, including HCoV-229E, PEDV, BCoV and MHV.

Chapter 6 Conclusions

In this thesis, we studied the RECMSA problem, whose aim is to find an RECMSA for the input sequences with several user-specified regular expression constraints such that substrings of the input sequences whose bases match regular expression constraint are aligned together. In this model, each of the user-specified constraints is a regular expression, which is useful in expressing biologically important sites such as those stored in PROSITE, as well as structural elements which often involve variable ranges in them. In contrast, the plain-strings-with-mismatches model adopted in previously available tools, MuSiC and MuSiC-ME, is not flexible enough to express such patterns.

We adopted the dynamic programming and divide-and-conquer techniques to design a time and memory efficient algorithm for optimally solving the RECPSA problem. In addition, we designed a method to find in the resulting alignment the regions responsible for the satisfactions of the constraints. Based on the algorithm, we developed a web Server RE-MuSiC for the RECMSA problem using the progressive approach. The algorithm underlying RE-MuSiC represents an improvement over the previously proposed algorithm [2], and is more appropriate for implementation in a web-server.

Experiments on GST proteins and on coronaviruses with phylogenetically

conserved pseudoknots demonstrated that, with additional knowledge incorporated, RE-MuSiC is able to produce meaningful alignments in which important residues or structural elements can be aligned properly, even if the similarity among input sequences is low. Such ability is also useful for prediction purposes.

References

[1] Arslan, A.N. (2005) Regular expression constrained sequence alignment. In Proc.

16th Annual Symposium on Combinatorial Pattern Matching (CPM05), vol.

3537 of Lecture Notes in Computer Science, Springer, pp. 322-333.

[2] Arslan, A.N. (2005) Multiple sequence alignment containing a sequence of regular expressions. In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB05), pp.1-7.

[3] Arslan, A. N. (2006) An algorithm with linear expected running time for string editing with substitutions and substring reversals. The Proceedings of the Biotechnology and Bioinformatics Symposium (BIOT-2006), pp. 90-96, Provo, Utah, October 20-21, 2006 (acceptance rate = 40%)

[4] Arslan, A. N. and He, D. (2006) An improved algorithm for the regular expression constrained multiple sequence alignment problem. The Proceedings of the 6^th IEEE Symposium on Bioinformatics and Biotechnology (BIBE 2006), pp. 121-126, Washington, DC, October 16-18, 2006.

[5] Arslan, A. N. (≥2007) Regular expression constrained sequence alignment, Journal of Discrete Algorithms, Elsevier (to appear)

[6] Bafna, V., Lawler, E. L. & Pevzner, P. A. (1997) Approximation algorithms for multiple sequence alignment. Theoretical Computer Science, 182, 233–244.

[7] Bonizzoni, P. & Vedova, G. D. (2001) The complexity of multiple sequence alignment with SP-score that is a metric. Theoretical Computer Science, 259, 63–79.

[8] Carrillo, H. & Lipman, D. (1988) The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 48, 1073–1082.

[9] Chan, S. C.,Wong, A. K. C. & Chiu, D. K. Y. (1992) A survey of multiple sequence comparison methods. Bulletin of Mathematical Biology, 54, 563–598.

[10] Cheng, C.Y., Chang, C.H., Wu, Y.J., & Li, Y.K. (2006) Exploration of glycosyl hydrolase family 75, a chitosanase from Aspergillus fumigatus. J. Biol. Chem., 281, 3137-3144.

[11] Chin, F.Y.L., Ho, N.L., Lam, T.W., Wong, P.W.H., & Chan, M.Y. (2005) Efficient constrained multiple sequence alignment with performance guarantee. J.

Bioinform. Comput. Biol., 3, 1-18.

[12] Chung, Y.-S., Lu, C.L., & Tang, C.Y. (2006) Efficient algorithms for regular expression constrained sequence alignment. In Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM06), vol. 4009 of Lecture Notes in

Computer Science, Springer, pp. 389-400.

[13] Corpet, F. (1988) Multiple sequence alignment with hierarchical clustering.

Nucleic Acids Research, 16, 10881–10890.

[14] D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18:341–343, 1975.

[15] Depiereux, E., & Feytmans, E. (1992) MATCH-BOX: A fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput.

Appl. Biosci., 8, 501-509.

[16] Dunbrack, R.L. (2006) Sequence comparison and protein structure prediction.

Curr. Opin. Struct. Biol., 16, 374-384.

[17] Feng, D. F. & Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution, 25, 351–360.

[18] Gusfield, D. (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55, 141–154.

[19] Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.

[20] He, D., Arslan, A. N., and Ling, A. C. H. (2006) A fast algorithm for the constrained multiple sequence alignment problem. Acta Cybernetica, 17:

701-717.

[21] Higgins, D. & Sharpe, P. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244.

[22] Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. 2nd edn. Addison-Wesley (2001).

[23] Huang, C.H., Lu, C.L., & Chiu, H.T. (2005) A heuristic approach for detecting RNA H-type pseudoknots. Bioinformatics, 21, 3501-3508.

[24] Hulo, N., Sigrist, C.J.A., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., and Bairoch, A. (2004) Recent

improvements to the PROSITE database. Nucleic Acids Res., 32, 134-137.

[25] Kruskal, J. (1956) On the shrtest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7, 48–50.

[26] Li, M., Ma, B. & Wang, L. (2000) Near optimal multiple alignment within a band in polynomial time. In Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing (STOC 2000) pp. 425–434 ACM Press, Portland.

[27] Lipman, D. & Pearson, W. (1985) Rapid and sensitive protein simularity search.

Science, 227, 1435–1411.

[28] Lu, C.L., & Huang, Y.P. (2005) A memory-efficient algorithm for multiple sequence alignment with constraints. Bioinformatics, 21, 20-30.

[29] Myers, G., Selznick, S., Zhang, Z., & Miller, W. (1996) Progressive multiple alignment with constraints. J. Comput. Biol., 3, 563-572.

[30] Morgenstern, B., Dress, A., & Werner, T. (1996) Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad.

Sci. USA, 93, 12098-12103.

[31] Morgenstern, B., Frech, K., Dress, A., & Werner, T. (1998) DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290-294.

[32] Morgenstern, B. (1999) DIALIGN 2: Improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211-218.

[33] Morgenstern, B. (2004) DIALIGN: Multiple DNA and protein sequence alignment at BiBi-Serv. Nucleic Acids Res., 32, W33-W36.

[34] Morgenstern, B., Werner, N., Prohaska, S.J., Schneider, R.S.I., Subramanian, A.R., Stadler, P.F., & Weyer-Menkhoff, J. (2005) Multiple sequence alignment with user-defined constraints at GOBICS. Bioinformatics, 21, 1271-1273.

[35] Morgenstern, B., Prohaska, S.J., Pohler, D., & Stadler, P.F. (2006) Multiple sequence alignment with user-defined anchor points. Algorithms Mol. Biol,, 1, 6.

[36] Notredame, C. (2002) Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131-144.

[37] Nicholas, H. B., Ropelewski, A. J. & Deerfield, D. W. (2002) Strategies for multiple sequence alignment. Biotechniques, 32, 592–603.

[38] Needleman, S. & Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Evolution, 48, 443–453.

[39] Pearson, W. (1991) Searching protein sequence libraries: Computation of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithm.

Genomics, 11, 635–650.

[40] Pevzner, P. A. (1992) Multiple alignment, communication cost, and graph matching. SIAM Journal on Applied Mathematics, 52, 1763–1779.

[41] Reeder, J., Hochsmann, M., Rehmsmeier, M., Voss, B., & Giegerich, R. (2006) Beyond Mfold: Recent advances in RNA bioinformatics. J. Biotechnol., 124, 41-55.

[42] Reinemer, P., Prade, L., Hof, P., Neuefeind, T., Huber, R., Zettl, R., Palme, K., Schell, J., Koelln, I., Bartunik, H.D. et al. (1996) Three-dimensional structure of glutathione S-transferase from Arabidopsis thaliana at 2.2 A resolution:

Structural characterization of herbicide-conjugating plant glutathione

S-transferases and a novel active site architecture. J. Mol. Biol., 255, 289-309.

[43] Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new mothod for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406–425.

[44] Sammeth, M., Morgenstern, B., & Stoye, J. (2003) Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics, 19 Suppl. 2,

ii189-ii195.

[45] Shuler, G.D., Altschul, S.F., & Lipman, D.J. (1991) A workbench for multiple alignment construction and analysis. Proteins: Struct., Funct., Genet., 9, 180-190.

[46] Sneath, P. & Sokal, R. (1973) Numerical Taxonomy. Freeman, San Francisco, CA.

[47] Song, B., Choi, J.H., Chen, G.Y., Szymanski, J., Zhang, G.Q., Tung, A.K.H., Kang, J., Kim, S., & Yang, J. (2006) ARCS: An aggregated related column scoring scheme for aligned sequences. Bioinformatics, 22, 2326-2332.

[48] Tang, C.Y., Lu, C.L., Chang, M.D.T., Tsai, Y.T., Sun, Y.J., Chao, K.M., Chang, J.M., Chiou, Y.H., Wu, C.M., Chang, H.T., and Chou, W.I. (2003) Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol., 1, 267-287.

[49] Taylor, W. R. (1987) Multiple sequence alignment by a pairwise algorithm.

CABIOS, 3, 81–87.

[50] Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence

在文檔中正規化表示式的限制型多重序列比對之研究 (頁 30-0)