Example analysis - E XPERIMENTAL R ESULTS AND D ISCUSSION

CHAPTER 4 RECOGNIZING PROTEIN STRUCTURAL DOMAINS AND SCOP SUPERFAMILIES . 57

4.3 E XPERIMENTAL R ESULTS AND D ISCUSSION

4.3.2 Example analysis

Figure 4.2 shows a fastSCOP result with multi-domain immunophilin (AtFKBP42) from Arabidopsis thaliana (PDB code 2IF4-A) [100] as the query structure. The release date of this protein is Oct 31, 2006, and this protein has not been recorded in SCOP. As shown in Figure 4.2A, the fastSCOP recognized two domains and their SCOP superfamilies, which are the FKBP-like superfamily (SCOP entry d.26.1) and the TPR-like superfamily (SCOP entry a.118.8) for this query. The FKBP domain (Figure 4.2C) of AtFKBP42 consists of a six-stranded anti-parallel β-sheet, wrapped around a short α-helix, and is similar to those of FKBP52 (PDB code 1Q1C-A) [101], FKBP 25 (PDB code 1PBK) [102], FKBP 13 (PDB code 1U79-A) [103] and FKBP 12 (PDB code 1BKF) [104]. The FKBP domain has been demonstrated to interact with plasma membrane-localized ABC transporters AtPGP1 and AtPGP, which directly mediate cellular auxin efflux [105]. The TPR domain of AtFKBP42 is completely helical and binds to AtHSP90, which is critical to plant development and phenotypic plasticity [106, 107].

After the structural domains and evolutionary superfamilies were recognized, the fastSCOP server allowed users to browse similar structures of these superfamilies. Using this AtFKBP42 as a query, the server can identify 13 and 17 similar structures of the FKBP-like domain and TPR domain, respectively. Figure 4.2B illustrates the multiple amino-acid sequence alignment and structural alphabet alignment between AtFKBP42 and five FKBP-like homologous proteins, including FKBP52, FKBP 25, FKBP 13 and FKBP 12. The aligned secondary structures are represented as a continuous color spectrum from red through orange, yellow, green and blue to violet (Figures 4.2B and 4.2C). The structural alphabets were strongly conserved in areas of the secondary structures, which are β-strands (represented by structural alphabets E, F, H, K, and N) or α-helices (represented by structural alphabets A, Y, B, C, and D). These results reveal that the structural alphabet sequences are much better conserved than the amino acid sequences, which result explains why 3D-BLAST detected these distantly related proteins.

4.4 Web service

The fastSCOP server is accessible at “http:// fastSCOP.life.nctu.edu.tw/.” The server can identify the structural domains and determine the evolutionary classification of a query structure from evolutionary classification databases. Users input a PDB code with a protein

chain (e.g. 2IF4-A). When the query structure is a new protein structure, the fastSCOP server enables users to input the structure file in PDB format.

This server typically yielded structural domains and the SCOP superfamilies of a query structure in an average of 6 seconds (Figure 4.2A). The server can present the members of the assigned SCOP superfamily and provide both multiple sequence alignments and multiple structural alignments (Figure 4.2B) based on users’ requirements. The aligned structures are visualized in PNG format in MolScript and Raster3D packages (Figures 4.2C and 4.2D). The server allows a user to download the aligned structure coordinates in PDB format.

4.5 Summary

This work demonstrated the robustness and feasibility of the fastSCOP server for recognizing the structural domains and the evolutionary classifications of protein structures.

The key contribution of this work is the cooperative integration in fastSCOP of 3D-BLAST (a fast structural database search tool) and MAMMOTH (a fast detailed structural alignment tool); the former is required for efficiency and the latter for accuracy. Future works will adopt the fastSCOP for other evolutionary classification databases, such as CATH. Additionally, the fastSCOP can be applied to develop structural motifs and sequence motifs from multiple structure and sequence alignments.

Chapter 5 Conclusions

5.1 Summary

In this thesis, a new approach named 3D-BLAST is proposed for fast structural database searches. The core idea of 3D-BLAST was to design a structural alphabet—to be used to code 3D protein structure databases into structural alphabet sequence databases (SADB)—and a structural alphabet substitution matrix (SASM). We then enhanced the sequence alignment tool BLAST, which searches the SADB using the matrix SASM to rapidly determine protein structure homology or evolutionary classification. 3D-BLAST was designed to maintain the advantages of BLAST, including its robust statistical basis, effective and reliable database search capabilities, and established reputation in biology.

3D-BLAST is rapid and accurate in scanning a large protein structural database, and is useful in an initial scan for similar protein structures, which can be refined using detailed structural comparison methods .However, the use of 3D-BLAST as a search tool also has several limitations, which are (a) 3D-BLAST may have made minor shifts in aligning two local segments with similar letters, (b) the E-values of the hit proteins are insignificant, and (c) the query is a multiple-domain protein. Because of this, an automated server (fastSCOP) is presented, which integrates a fast structure database search tool (3D-BLAST) and a detailed structural alignment tool (MAMMOTH), to recognize SCOP domains and evolutionary superfamilies of a query structure. The classification accuracy of this server is 98% for 464 single-domain queries and 122 multiple-domain queries.

In addition, this study has analyzed the feasibility of studying Space-Related Pharmamotif (SRP) and demonstrated some preliminary results of SRP applied to biosynthesis pathway or cancer pathway. We believe that 3D-BLAST is adopted to develop the motif search tool, called as 3D-PHI-BLAST, for rapidly pharmalogous search.

5.2 Major Contributions

In short, the major contributions of this thesis can be summarized in the following:

1. We have developed a novel kappa-alpha (κ, α) plot derived structural alphabet and a novel BLOSUM-like substitution matrix, called structural alphabet substitution matrix (SASM) which searches in a structural alphabet database (SADB).

2. We present a novel protein structure database search tool, 3D-BLAST, that is useful for analyzing novel structures and can return a ranked list of alignments.

This tool has the features of BLAST (for example, robust statistical basis, and effective and reliable search capabilities) and employs a kappa-alpha (κ, α) plot derived structural alphabet and a new substitution matrix. 3D-BLAST searches more than 12,000 protein structures in 1.2 s and yields good results in zones with low sequence similarity.

3. We have built an automated server (fastSCOP), which integrates a fast structure database search tool (3D-BLAST) and a detailed structural comparison tool (MAMMOTH), to recognize SCOP domains and SCOP superfamilies of a query structure. MAMMOTH provided the Z-score and root-mean-square deviation (RMSD) of the Ca atom positions of the aligned residues between the query structure and the hit structure according to the Euclidean distance between corresponding residues rather than the distance between amino acid ‘types’ used in sequence alignments. To combine 3D-BLAST and MAMMOTH is able to reduce the ill effects of 3D-BLAST to improve the assignment accuracy.

5.3 Future Perspectives

5.3.1 Space-Related Pharmamotif discovery in interaction site of protein

Small protein sequence or structural segments with highly conserved properties that may have important biological functions. On the basis of conservation of criteria, like psychochemical property and structural similarity, several conserved segments of proteins belonging to the same protein family with specific function have been identified. These segments are termed ‘structural motifs’. These motifs with their spatial orientation and preservation of structural similarity represent the conserved core of each protein family.

Previous studies have been developed for prediction of fold and function of a protein using

short segments of sequence and/or structural elements [108-111].

Various methods have been proposed so far for the automated motif discovery in a set of protein sequences [112]. These discovery methods use aligned sequences or multiple sequence alignment (MSA) as an input such as PRINTS [37], PROSITE [38, 113], and Pfam [39]. Besides, TEIRESIAS [40], PRATT2 [41] and a specific pattern growth approach [42]

are applied to directly identify frequent patterns from unaligned biological sequences without aligning them. Although motif discovery approaches with unaligned sequence only are more efficiency and less computationally intensive, it may provide the less biological meanings.

Subsequently, many of the most functional and evolutionary relationships between homologous protein are so distinct that they cannot be clearly detected through MSA and are evident only by pairwise or multiple structure comparison of the 3D structures. In addition, sequence-based representations are only an approximation to the underlying structural and functional information. Therefore, structural motifs identified at 3D structure level provide significant and reliable information.

A set of functional structural motifs need not to be contiguous in sequence and might discover from the clustering in space of similar side chains coming from different parts of homologous proteins. Finding shared structural motifs in a protein family can be applied to map the interaction site of different proteins with the same partner [114], for locating of the binding site for a common ligand. Besides, sequence and structure motifs have an application in drug design [115] when motifs map to functional sites and ligand binding sites.

In the future, we will propose a novel approach for systems biology and drug design based on the recent developed 3D-BLAST method of protein structural identification [34-36].

We will design new structural motifs that can describe the interacting environment in protein active site named Space-Related Pharmamotif (SRP). The SRP is defined as a set of space-related structural motifs that prefers a set of similar protein sub-site structures consistently interact with ligand, DNA or peptide.

3D-BLAST:

fast homologus search /pre-screen

Each protein is used to search against nr-PDB

(SRP: a set of structural motifs)

Figure 5.1 The framework of Space-Related Pharmamotif Discovery and pharmalogs search.

Figure 5.1 shows that the conceptual framework of fast SRP discovery and fast pharmalogs search using SRP. For a group of proteins with similar function and ligand, we build up a set of interacting environment structural motifs and provide fast SRP discovery.

Using tertiary protein structure, 3D-BLAST not only allows a fast protein similarity search but also identifies 23 states of the structural alphabet (SA) sequences that represent local structure of SRP. We integrate 3D-BLAST and a detailed structural alignment tool (MAMMOTH [10] and MAMMOTH-multi [116]) to recognize sub-site structures consistently interact with ligand. We use 3D-BLAST to scan quickly the PDB database [4]

and selected the homologous structures. MAMMOTH and MAMMOTH-multi was then adopted to align sequentially the query structure with each homologous structure to refine the

detailed amino acid position of alignment. Finally, we identify SRP based on the functional or ligand-binding sites of protein and their spatial orientation.

Besides, our novel approach can be applied to fast pharmalogous search using SRP, as named as 3D-PHI-BLAST (Figure 5.1). According to results of the discovery of SRP, we are able to construct SRP with various functions into a database. Using protein with unknown function as query, the 3D-PHI-BLAST may provide rapid motif search through the protein structure and SRP database to predict function and ligand/DNA/peptide pharmacophore binding model.

5.3.2 Immunoinformatics

In the future, 23-state structural alphabet will be aimed to peptide drug design and developing immunoinformatics. For peptide drug design, we will focus in peptide-peptide interaction and build peptide fragment profile database. The peptide fragment profile database will be constructed by 3D-BLAST, our structural motif database and large information about various peptide-peptide interactions.

Besides, we will propose an immunoinformatics system which includes structural immunoinformatics methodology and immunological databases. The system is able to screen and design the antibodies/peptides with high specificity to diagnostic and therapeutic applications. We will develop several structural bioinformatics methods and enhance/modify them for immunology purpose. We will build the integrated immunological databases which include CDR segment database, epitope database and CDR-Epitope interactions database.

Additionally, we will offer services for searching between these databases and present the statistical significance of a search to indicate the reliability of the prediction. Furthermore, we will develop an antibody selection platform as the practical application. In this platform, this platform will be combined with phage-display library and yeast cell-display library. Also, the antibody selection platform provides rapid motif search to predict therapeutic peptide and visualization of drug selection.

Bibliography

1. Burley, S.K., et al., Structural genomics: beyond the human genome project. Nature Genetics, 1999. 23: p. 151-157.

2. Burley, S.K. and J.B. Bonanno, Structural genomics of proteins from conserved

biochemical pathways and processes. Current Opinion in Structural Biology, 2002. 12:

p. 383-391.

3. Todd, A.E., et al., Progress of structural genomics initiatives: an analysis of solved target structures. Journal of Molecular Biology, 2005. 348: p. 1235-1260.

4. Deshpande, N., et al., The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Research, 2005. 33: p.

D233-D237.

5. Watson, J.D., R.A. Laskowski, and J.M. Thornton, Predicting protein function from sequence and structural data. Current Opinion in Structural Biology, 2005. 15: p.

275-284.

6. Altschul, S.F., et al., Basic local alignment search tool. Journal of Molecular Biology, 1990. 215: p. 403-410.

7. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997. 25: p. 3389-3402.

8. Holm, L. and C. Sander, Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology, 1993. 233: p. 123-138.

9. Shindyalov, I.N. and P.E. Bourne, Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, 1998. 11: p.

739-747.

10. Ortiz, A.R., C.E. Strauss, and O. Olmea, MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Science, 2002. 11: p. 2606-2621.

11. Madej, T., J.F. Gibrat, and S.H. Bryant, Threading a database of protein cores.

Proteins, 1995. 23: p. 356-369.

12. Aung, Z. and K.L. Tan, Rapid 3D protein structure database searching using information retrieval techniques. Bioinformatics, 2004. 20: p. 1045-1052.

13. Shyu, C.R., et al., ProteinDBS: a real-time retrieval system for protein structure comparison. Nucleic Acids Research, 2004. 32: p. W572-W575.

14. Martin, A.C., The ups and downs of protein topology; rapid comparison of protein structure. Protein Engineering, 2000. 13: p. 829-837.

15. Guyon, F., et al., SA-Search: a web tool for protein structure mining based on a Structural Alphabet. Nucleic Acids Research, 2004. 32: p. W545-W548.

16. Carpentier, M., S. Brouillet, and J. Pothier, YAKUSA: a fast structural database

scanning method. Proteins, 2005. 61: p. 137-151.

17. Bystroff, C. and D. Baker, Prediction of local structure in proteins using a library of sequence-structure motifs. Journal of Molecular Biology, 1998. 281: p. 565-577.

18. Camproux, A.C., R. Gautier, and P. Tuffery, A hidden markov model derived structural alphabet for proteins. Journal of Molecular Biology, 2004. 339: p. 591-605.

19. de Brevern, A.G., C. Etchebest, and S. Hazout, Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins, 2000. 41: p.

271-287.

20. Fetrow, J.S., M.J. Palumbo, and G. Berg, Patterns, structures, and amino acid

frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins, 1997. 27: p. 249-271.

21. Kolodny, R., et al., Small libraries of protein fragments model native protein structures accurately. Journal of Molecular Biology, 2002. 323: p. 297-307.

22. Levitt, M., Accurate modeling of protein conformation by automatic segment matching. Journal of Molecular Biology, 1992. 226: p. 507-533.

23. Rooman, M.J., J. Rodriguez, and S.J. Wodak, Automatic definition of recurrent local structure motifs in proteins. Journal of Molecular Biology, 1990. 213: p. 327-336.

24. de Brevern, A.G., New assessment of a structural alphabet. In Silico Biol, 2005. 5: p.

283-289.

25. Tyagi, M., et al., A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications. Proteins, 2006. 65: p. 32-39.

26. Tyagi, M., et al., Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet. Nucleic Acids Res, 2006. 34: p. W119-W123.

27. Unger, R. and J.L. Sussman, The importance of short structural motifs in protein structure analysis. J Comput Aided Mol Des, 1993. 7: p. 457-472.

28. Fourrier, L., C. Benros, and A.G. de Brevern, Use of a structural alphabet for analysis of short loops connecting repetitive structures. BMC Bioinformatics, 2004. 5: p. 58.

29. Lo, W.C., et al., Protein structural similarity search by Ramachandran codes. BMC Bioinformatics, 2007. 8: p. 307.

30. Lo, W.C., et al., iSARST: an integrated SARST web server for rapid protein structural similarity searches. Nucleic Acids Res, 2009. 37(Web Server issue): p. W545-51.

31. Ramachandran, G.N. and V. Sasisekharan, Conformation of polypeptides and proteins.

Adv Protein Chem, 1968. 23: p. 283-438.

32. Lo, W.C. and P.C. Lyu, CPSARST: an efficient circular permutation search tool applied to the detection of novel protein structural relationships. Genome Biol, 2008.

9(1): p. R11.

33. Chotia, C. and A.M. Lesk, The relation between the divergence of sequence and structure in proteins. EMBO J., 1986. 5: p. 823-826.

34. Tung, C.H., J.W. Huang, and J.M. Yang, Kappa-alpha plot derived structural alphabet

and BLOSUM-like substitution matrix for rapid search of protein structure database.

Genome Biology, 2007. 8: p. R31.1-R31.16.

35. Yang, J.M. and C.H. Tung, Protein structure database search and evolutionary classification. Nucleic Acids Research, 2006. 34: p. 3646-3659.

36. Tung, C.H. and J.M. Yang, fastSCOP: a fast web server for recognizing protein structural domains and SCOP superfamilies. Nucleic Acids Research, 2007. 35: p.

W438-W443.

37. Attwood, T.K., et al., PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research, 2003. 31: p. 400-402.

38. Hulo, N., et al., The PROSITE database. Nucleic Acids Research, 2006. 34: p.

D227-D230.

39. Bateman, A., et al., The Pfam protein families database. Nucleic Acids Research, 2004.

32: p. D138-D141.

40. Rigoutsos, I. and A. Floratos, Combinatorial pattern discovery in biological sequences:

The TEIRESIAS algorithm. Bioinformatics, 1998. 14: p. 55-67.

41. Jonassen, I., J.F. Collins, and D.G. Higgins, Finding flexible patterns in unaligned protein sequences. Protein Science, 1995. 4: p. 1587-1595.

42. Ye, K., W.A. Kosters, and A.P. Ijzerman, An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

Bioinformatics, 2007. 23: p. 687-693.

43. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 1995. 247: p.

536-540.

44. Henikoff, S. and J.G. Henikoff, Amino acid substitution matrices from protein blocks.

Proceedings of the National Academy of Sciences of the United States of America, 1992. 89: p. 10915-10919.

45. Huang, C.C., et al., Structural basis of tyrosine sulfation and VH-gene usage in antibodies that recognize the HIV type 1 coreceptor-binding site on gp120.

Proceedings of the National Academy of Sciences of the United States of America, 2004. 101: p. 2706-2711.

46. Adachi, S., et al., Direct observation of photolysis-induced tertiary structural changes in hemoglobin. Proceedings of the National Academy of Sciences of the United States of America, 2003. 100: p. 7039-7044.

47. Takano, K., Y. Yamagata, and K. Yutani, Role of amino acid residues at turns in the conformational stability and folding of human lysozyme. Biochemistry, 2000. 39: p.

8655-8665.

48. Hutchinson, E.G. and J.M. Thornton, PROMOTIF--a program to identify and analyze structural motifs in proteins. Protein Science, 1996. 5: p. 212-220.

49. Banner, D.W., et al., Atomic coordinates for triose phosphate isomerase from chicken

muscle. Biochemical and Biophysical Research Communications, 1976. 72: p.

146-155.

50. Hogbom, M., et al., The radical site in chlamydial ribonucleotide reductase defines a new R2 subclass. Science, 2004. 305: p. 245-248.

51. Kumar, S. and M. Bansal, Geometrical and sequence characteristics of alpha-helices in globular proteins. Biophysical Journal, 1998. 75: p. 1935-1944.

52. Barlow, D.J. and J.M. Thornton, Helix geometry in proteins. Journal of Molecular Biology, 1988. 201: p. 601-619.

53. Milner-White, E.J., Recurring loop motif in proteins that occurs in right-handed and left-handed forms. Its relationship with alpha-helices and beta-bulge loops. Journal of Molecular Biology, 1988. 199: p. 503-511.

54. Pearson, W.R. and D.J. Lipman, Improved tools for biological sequence comparison.

Proceedings of the National Academy of Sciences of the United States of America, 1988. 85: p. 2444-2448.

55. Karplus, K., C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 1998. 14: p. 846-856.

56. Pearl, F., et al., The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.

Nucleic Acids Research, 2005. 33: p. D247-D251.

57. Vetting, M.W., et al., A bacterial acetyltransferase capable of regioselective

N-acetylation of antibiotics and histones. Chemistry & Biology, 2004. 11: p. 565-573.

58. Nagano, N., C.A. Orengo, and J.M. Thornton, One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. Journal of Molecular Biology, 2002. 321: p. 741-765.

59. Wolf, E., et al., Crystal structure of a GCN5-related N-acetyltransferase: Serratia marcescens aminoglycoside 3-N-acetyltransferase. Cell, 1998. 94: p. 439-449.

60. Peapus, D.H., et al., Structural characterization of the enzyme-substrate,

enzyme-intermediate, and enzyme-product complexes of thiamin phosphate synthase.

Biochemistry, 2001. 40: p. 10103-10114.

61. Terwilliger, T.C., Structural genomics in North America. Nature Structural Biology, 2000. 7 Suppl: p. 935-939.

62. Wilmanns, M., et al., Structural conservation in parallel beta/alpha-barrel enzymes that catalyze three sequential reactions in the pathway of tryptophan biosynthesis.

Biochemistry, 1991. 30: p. 9161-9169.

63. Schaffer, A.A., et al., IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 1999. 15: p.

1000-1011.

64. Uzumaki, T., et al., Crystal structure of the C-terminal clock-oscillator domain of the cyanobacterial KaiA protein. Nature Structural & Molecular Biology, 2004. 11: p.

623-631.

65. Lamb, A.L., et al., Heterodimeric structure of superoxide dismutase in complex with its metallochaperone. Nature Structural Biology, 2001. 8: p. 751-755.

66. Rosenzweig, A.C., et al., Crystal structure of the Atx1 metallochaperone protein at 1.02 A resolution. Structure, 1999. 7: p. 605-617.

67. Hurley, J.K., et al., Structure-function relationships in Anabaena ferredoxin:

correlations between X-ray crystal structures, reduction potentials, and rate constants of electron transfer to ferredoxin:NADP+ reductase for site-specific ferredoxin

mutants. Biochemistry, 1997. 36: p. 11100-11117.

68. Zhang, C. and S.H. Kim, Overview of structural genomics: from structure to function.

Current Opinion in Chemical Biology, 2003. 7: p. 28-32.

69. Chance, M.R., et al., High-throughput computational and experimental techniques in structural genomics. Genome Research, 2004. 14: p. 2145-2154.

70. Grandori, R. and J. Carey, Six new candidate members of the alpha/beta twisted

open-sheet family detected by sequence similarity to flavodoxin. Protein Science, 1994.

3: p. 2185-2193.

71. Frazao, C., et al., Structure of a dioxygen reduction enzyme from Desulfovibrio gigas.

Nature Structural Biology, 2000. 7: p. 1041-1045.

72. Harris, M.A., et al., The Gene Ontology (GO) database and informatics resource.

Nucleic Acids Research, 2004. 32: p. D258-D261.

73. Falquet, L., et al., The PROSITE database, its status in 2002. Nucleic Acids Research,

在文檔中以蛋白質結構字元集研究結構與功能之相關性 (頁 74-0)