PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具

全文

(1)國立交通大學生物資訊所碩士論文. PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具. PiSA-BLAST: A New Tool for Protein Structure Alignment and Database Search. 研究生：董其樺指導教授：楊進木. 教授. 中華民國九十四年七月.

(2) PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具. PiSA-BLAST: A New Tool for Protein Structure Alignment and Database Search. 研究生：董其樺. Student：Chi-hua Tung. 指導教授：楊進木. Advisor：Jinn-Moon Yang. 國立交通大學生物資訊所碩士論文. A Thesis Submitted to Institute of Bioinformatics National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master in Bioinformatics. July 2005 Hsinchu, Taiwan, Republic of China. 中華民國九十四年七月.

(3) PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具學生：董其樺. 指導教授：楊進木國立交通大學生物資訊所碩士班摘. 要. 近年來隨著蛋白質結構數量快速成長，有效搜尋結構資料庫的方法愈形重要。當一個新的蛋白質結晶產生後，研究者會希望得知該蛋白質是否跟其他已知結構的蛋白質相似，以及其相似程度。由於蛋白質結晶結構的數量龐大，研究者便十分需要一個準確而有效率的搜尋相似結構之工具。在本研究論文中，我們發展一套新的工具「PiSA-BLAST」，除了提出準確的比對結果外，也能大幅提昇結構搜尋的執行速度。這套工具依據 DSSP 程式所定義的蛋白質特殊資訊：kappa 角與 alpha 角，利用分群演算法加以分析後得一轉換規則表。依據此規則表，將蛋白質結構資料庫裡所有已知結構的蛋白質轉換成一級序列，並建成序列資料庫。根據此序列資料庫，我們同時也發展一套新的計分陣列，將之用來計算序列比對時的比對分數。接著，我們結合知名的序列比對工具「BLAST」，在輸入一欲查詢、比對的蛋白質結構後，不需真正疊合兩個三級結構，即能快速地從含有大量序列的結構資料庫搜尋、比對，最後能獲得相似蛋白質的清單。我們從 SCOP 及 PDB 資料庫中挑選出五套測試資料，以驗證 PiSA-BLAST 之效能。我們以 108 個查詢結構(query structures)在 SCOP 95 的搜尋結果為例，此資料庫包含 9,354 個蛋白質結構， PiSA-BLAST 及 CE 在 108 個查詢的平均準確度分別為 78.2%與 82.1%，PiSA-BLAST 總搜尋時間只需 34 秒，遠快於 CE 搜尋所需的 1,169,832 秒。另外，PSI-BLAST 的平均準確度則為 69.8%，並共花費 18.3 秒。根據本篇論文的研究結果，顯示下列結論：一、PiSA-BLAST 能以接近 BLAST 的速度搜尋結構資料庫，並較 CE 快上 34,000 倍左右。二、PiSA-BLAST 能獲得接近 CE 的準確度，同時較以胺基酸為基礎的序列比對工具，如 BLAST、PSI-BLAST 等，提供更精確的搜尋結果。這些結果顯示，在結構比對時，我們所發展的結構編碼以及計分陣列確實正確、可用。三、如同 BLAST 在執行序列比對時能輸出一 e-value，PiSA-BLAST 亦可在搜尋結構時提供此輸出值。經測試，當 e-value 小於閾值 e -15 時，PiSA-BLAST 可達到 90%的準確度。四、PiSA-BLAST 可成為一個結構比對的快速篩選工具，先執行一次快速比對，輸出多個結果後再利用其他速度較慢，但比對方法詳盡、可信的工具如 CE、DALI，作第二次的分析。五、PiSA-BLAST 已建立網頁服務，使用者能在線上即時搜尋結構資料庫。綜合以上所述，本研究 □ٛ 愀Ĥ 因體學與蛋白質體學應有相當的貢獻。. I.

(4) PiSA-BLAST: A New Tool for Protein Structure Alignment and Database Search Student: Chi-hua Tung. Advisor: Dr. Jinn-Moon Yang Institute of Bioinformatics National Chiao Tung University ABSTRACT. The structural database searching has become increasingly important with growing numbers of known protein structures. This increase was near exponential in the early 1990s and has become linear over the past several years. As more and more the availability of the growing number of protein crystal structures, the demand for a very fast and accurate method to searching for structures similar to a query structure is high. In this thesis, we have developed a new tool, termed PiSA-BLAST for protein structure database search that does not require the alignment of two 3D structures. Here we have developed a new method for the protein structure alignment by transforming 3D structures into 1D sequences. This method use the information of kappa and alpha angles, derived from DSSP program, to represent the protein 3D structure. Based on the segment information and clustering method, we transform the structural information with kappa and alpha angles into coded regions. After that, each protein with 3D structure is able to transfer into 1D sequence and we could develop a new substitution matrix that can be used as the scoring matrix of sequence alignment for 23 new codes. These encoded sequences are collected as a structure database. Launching BLAST, a well-known sequence alignment tool, to search structure database in a short time and we will get a list of proteins that are similar in structure. We evaluated PiSA-BLAST on five diverse data sets from SCOP and protein data bank. For the dataset SCOP 95 with 108 queries on 9,354 protein domains, the average precisions of PiSA-BLAST and CE are 78.2% and 82.1%, respectively, and the total executing times are 34 seconds for PiSA-BLAST and about 1,169,832 seconds for CE. The average precision is 69.8% and time is 18.3 seconds for PSI-BLAST. Based on these experiments, we summarized several observations: (1) PiSA-BLAST is as fast as BLAST for protein structure database search and is 34,000 times faster than CE on the database SCOP 95. (2) The accuracy of PiSA-BLAST closes the accuracy of CE and much better than BLAST and PSI-BLAST which are based on amino-acid sequences. These results imply that our structural new codes and substitute matrix are useful for protein structure alignment. (3) PiSA-BLAST is able to provide a significant e-value with e-15 for structure database search as the e-value with e-3 in BLAST for sequence database search. PiSA-BLAST achieved about 90% accuracy for a query when e-value is less than e-15. (4) PiSA-BLAST is a useful filtering tool before performing a detailed database search, such as CE and DALI. (5) PiSA-BLAST is able to provide real-time web services for protein structure database search as BLAST in protein sequence search. We believe that this issue is important for structural genomics and proteomics.. II.

(5) 誌. 謝. 在這三年來的研究所生涯中，首先衷心感謝指導教授—楊進木老師。在整個研究過程中，老師花費了很多心力及時間，悉心指導我的論文與研究。老師對於學術研究的細心、嚴謹、堅持與執著，將是我未來持續學習的目標。這本論文能夠順利完成，要感謝實驗室所有同學的幫忙。特別感謝章維，他早期的初步研究以及撰寫程式的功力，都為我奠定了穩固的基礎，讓我的研究得以順利進行。另外要感謝我的兩位好友—開文及靜婷。在我心情最低落的時候，給予我無盡的鼓勵、安慰與溫暖。在我情緒最愉快的時刻，陪我一起出遊、聊天，抒解壓力。沒有他們的陪伴，我很難走到現在。最後謝謝我的家人，默默地在背後給我支持、鼓勵與動力，讓我可在求學過程中無所顧忌，全力衝刺。要感謝的人太多太多了，一切就感謝上天吧。其樺夏'05. III.

(6) CONTENTS Abstract (in Chinese) ··············································································································· I Abstract ··································································································································II Acknowledgements (in Chinese) ·························································································· III Contents································································································································ IV List of Tables ························································································································ VI List of Figures······················································································································ VII. Chapter 1. Introduction ··········································································································1 1.1 Motivations and Purposes ····························································································· 1 1.2 Related Works··············································································································· 2 1.3 Thesis Overview ··········································································································· 4. Chapter 2. Materials and Methods ························································································6 2.1 Preparing Training Set from Protein Structure Database··············································· 6 2.2 Dividing Protein Structures into Segments by Kappa-Alpha Angle Map ······················ 7 2.3 Finding Representative Segments and Using Nearest Neighbor Clustering Algorithm for New Codes Assigning···························································································· 8 2.4 Generating a Substitution Matrix for 23 New Codes····················································11 2.5 Structure Searching by Sequence Alignment tool: BLAST and PSI-BLAST ·············· 13 2.6 Evaluating the Performance ························································································ 14 2.7 Practical Applications ································································································· 17. Chapter 3. Results and Discussions······················································································18 3.1 Representative Segments of 23 New Codes ································································ 18 3.2 The Substitution Matrix for 23 New Codes································································· 19 3.3 Evaluating Statistical Significance ·············································································· 20 3.4 Speed Evaluations······································································································· 25 3.5 Performance Factor Analysis: Sequence Identity, Structure Similarity and Expect Value ·································································································································· 25. IV.

(7) 3.6 Same Searching Cases Analysis·················································································· 26 3.7 PiSA-BLAST on Practical Applications······································································ 28 3.8 Web Service ················································································································ 28. Chapter 4. Conclusions ·········································································································29 4.1 Summary····················································································································· 29 4.2 Major Contributions and Future Perspectives ····························································· 29. References·······························································································································78. V.

(8) List of Tables Table 1. A small test set selected from previous work [14] ..................................................31 Table 2. Summary of 108 queries selected from SCOP all and SCOP 95 ............................32 Table 3. Comparison PiSA-BLAST with six methods on the dataset shown in Table 1 ......34 Table 4. Executing times of 20 queries on the database with 200 proteins shown in Table 1 ................................................................................................................................35 Table 5. Running times of 108 queries on the database with 33311 proteins shown in Table 2 ................................................................................................................................36 Table 6. Comparison running times of BLAST, PSI-BLAST and PiSA-BLAST for 108 queries searching on five databases selected from PDB and SCOP.......................37 Table 7. Average precisions of five alignment tools on 108 queries searching on the SCOP 95 database ..................................................................................................................38. VI.

(9) List of Figures. Figure 1. Step-by-step illustration of the PiSA-BLAST methodology using 1brbI as the query protein searching against nr-PDB (protein data bank)................................ 42 Figure 2. Overview of our method ....................................................................................... 43 Figure 3. Comparison the amino acids compositions of our train set, including 1584 proteins for encoding the structured codes and the substitute matrix, with three well-known structure databases (DSSP database, SCOP 95 and SCOP 40 database) .............. 44 Figure 4. The kappa-alpha distribution of 263696 segments in our training set (792 protein pairs)...................................................................................................................... 45 Figure 5. Accumulated distributions of (A) 20 kinds of amino acids and (B) 23 new codes in training set............................................................................................................. 46 Figure 6. The conformations of the representative segments of 23 new codes.................... 47 Figure 7. The conformations of representative segment in each cell of four main groups: (I) helix codes (A, Y, B, C, D) have 4 segments; (II) helix-like codes (G, I, L) have 12 segments; (III) strand codes (E, F, H) have 15 segments; (IV) strand-like codes (K, N) have 11 segments ............................................................................................. 48 Figure 8. The distribution relationship between 23 new codes (in PiSA-BLAST) and 8 secondary structure codes (in DSSP): (A) The structural-coded distribution of helix codes (H, G and I) in DSSP; (B) The structural-coded distribution of strand codes (E and B) in DSSP; (C) The structural-coded distribution of loop codes (S, T and others) in DSSP........................................................................................... 50 Figure 9. The average precisions of PiSA-BLAST on 108 queries searching on SCOP 95 using various values of λ and gap penalty ........................................................ 51 Figure 10. The average precision plot of PiSA-BLAST on 108 queries searching on SCOP 95. VII.

(10) using various values of λ.................................................................................... 52 Figure 11. The substitution matrix of 23 new codes .............................................................. 53 Figure 12. Recall-precision curves of CE using z-score and rmsd to order searching results on 108 queries searching the SCOP 95 database ....................................................... 54 Figure 13. Recall-precision curves of five alignment tools for 108 queries on the large database of 33311 proteins indicated in Table 2 ................................................... 55 Figure 14. Recall-precision curves for 108 queries with CE, BLAST, PSI-BLAST, PiSA-BLAST and PiSA-PSI-BLAST on SCOP 95 database ............................... 56 Figure 15. ROC curves of three tools performing 108 queries on the large database of 33311 proteins shown in Table 2...................................................................................... 57 Figure 16. ROC curves of five tools perform 108 queries on SCOP 95% database .............. 58 Figure 17. The illustration of “chain-break” problem in CE alignment................................. 59 Figure 18. The illustration of the problem of ordering the searching results by Z-score in CE alignment............................................................................................................... 60 Figure 19. The relationship between e-value and structure similarity in PiSA-BLAST........ 61 Figure 20. The relationship between e-value and precision in PiSA-BLAST ....................... 62 Figure 21. Comparison PiSA-BLAST with BLAST with high sequence identity (> 25%) on two databases: (A) the database with 33311 proteins shown in Table 2 and (B) the SCOP 95................................................................................................................ 63 Figure 22. Comparison PiSA-BLAST with BLAST with low sequence identity (< 25%) on two databases: (A) the database with 33311 proteins shown in Table 2 and (B) the SCOP 95................................................................................................................ 64 Figure 23. Comparison PiSA-BLAST with BLAST with high Z-score (> 3.5 by CE) on two databases: (A) the database with 33311 proteins shown in Table 2 and (B) the SCOP 95................................................................................................................ 65. VIII.

(11) Figure 24. Comparison PiSA-BLAST with BLAST with low Z-score (< 3.5 by CE) on two databases: (A) the database with 33311 proteins shown in Table 2 and (B) the SCOP 95................................................................................................................ 66 Figure 25. The correlations between Z-score (CE) and sequence identity calculated by (A) PiSA-BLAST and (B) BLAST.............................................................................. 67 Figure 26. The results of FASTA, PiSA-BLAST and CE alignment to related domains: query protein “d1qe0a1” and subject protein “d1nj1a1” ................................................ 68 Figure 27. The results of FASTA, PiSA-BLAST and CE alignment to related domains: query protein “d1gr3a_” and subject protein “d1aly__”................................................. 69 Figure 28. The results of FASTA, PiSA-BLAST and CE align with related domains: query protein “d1dbqa_” and subject protein “d1tlfa_”.................................................. 70 Figure 29. The results of FASTA, PiSA-BLAST and CE align with related domains: query protein “d1cjwa_” and subject one “d1cm0a_” .................................................... 72 Figure 30. A bad case in our method for comparison of two related domain proteins, d1mkma1 and d1e17a_ ......................................................................................... 74 Figure 31. A false positive example in PiSA-BLAST for comparison of two non-related domain proteins, (A) d1jbga_ and (B) d1pk5a_.................................................... 75 Figure 32. A practical application of fold assignment using PiSA-BLAST........................... 76 Figure 33. The illustration of PiSA-BLAST web service ...................................................... 77. IX.

(12) Chapter 1 Introduction 1.1 Motivations and Purposes. Protein structures are being determined at a very rapid rate; as of 07-Jun-2005, there were more than 31000 proteins in the Protein Data Bank (PDB) and the number is increasing daily and rapidly. As a result, faster tools for structural comparison and database searching become essential. Protein structure comparisons have been made since the very early days of protein crystallography. These pioneering early works have been reviewed [1]. However, these early methods are too slow to handle the volume of data that is now available.. In general, we cannot detect the similarity of two remotely homologous proteins by sequence comparison alone because comparing the amino acid sequences of the proteins cannot provide sufficient information required by the biologist. Therefore, we need to compare their 3D structures in order to determine their similarity as the 3D structures are better preserved than the sequences throughout the evolution. We usually compare a protein structure against a database of other protein structures to find the structures that are similar to it.. Here we develop a novel structure alignment tool, termed PiSA-BLAST, for protein structure comparison and fast database searching. PiSA-BLAST cannot only scan whole protein database as fast as sequence alignment but also obtain acceptable accuracy. Our method use segment information such as kappa and alpha angles, derived form DSSP program,. 1.

(13) to represent the local 3D structures of proteins. With nearest neighbor clustering algorithm [2], we transform the 2D information of kappa and alpha angles into 23 new coded residues. By this way, each protein 3D structure in PDB could be described as a 1D sequence. After transforming, we develop a new substitution matrix for 23 codes and replace default matrix of sequence alignment with the new one. The structure comparison is established by a well-known sequence alignment program such as BLAST [3, 4] to search for similar coded sequences that are converted from other protein. Our results show that PiSA-BLAST is 5000 times faster than the popular CE method for structural database searching, while its overall accuracy is only slightly inferior to that of CE. Although our new methods could not provide the same accuracy as the results of CE, it can be used as a pre-filtering tool before performing a detailed database search by other more delicate but slower structure alignment tools.. 1.2 Related Works. As in past research, the different amino acid sequences may determine similar protein structures [5, 6]. If there is 30% or above sequence identity between two proteins, these two proteins may have quite similar 3D-structure [7]. However, sequence comparison alone cannot provide required information in the twilight zone of protein sequence alignments [8]. If only using sequence alignment to detect protein structure similarity, it will lose some proteins which are with low sequence identity and high structure similarity. Structural comparison must be performed in this case.. Many methods have been proposed and implemented for structural comparison. The classical pairwise comparison methods include DALI [9], VAST [10, 11] and CE [12]. These are the two-level methods, which start with finding the matching pairs of secondary structure elements (SSEs) or Cα backbone fragments, and then go into the detailed finding of the 2.

(14) matching Cα atom pairs. The distance matrix alignment (DALI) algorithm is the core of FSSP [9]. This algorithm is based on building residue-to-residue distance matrices and using Monte Carlo to optimize distance matrix comparing. The vector alignment search tool (VAST) define protein secondary structure elements as vectors to compare 3D protein structures and determine the protein structure neighbors [10, 11]. In the method of combinatorial extension (CE), aligned fragment pairs are divided in a protein. After that, these pairs are joined into an optimal path for the full alignment [12]. These methods can provide us with the good quality answers. But when performing a database search, they all have to use exhaustive searching, which results in slow response times.. TopScan [13] are examples of pairwise comparison methods that take SSEs as basic elements to be compared. These methods are less accurate, but much faster than the two-level methods. However, when searching against a very large database, these methods still cannot provide the required quick response time. The design strategy of ProtDex2 [14] is to apply the IR approaches using SSEs as the basic elements in order to perform rapid database searching without having access to every 3D structure in the database. ProtDex2 first build an inverted-file index based on the feature vectors of the relationships among the SSEs from all the protein structures in the database.. Unlike 1-dimensional sequence comparison, structure alignment is much more complex and computationally expensive to compare two structures to determine their similarity. Although some of the related works are very efficient for pair-wise structure comparison, the main disadvantage of these methods is that they practice exhaustive searching to compare the query structure against all protein structures in the database when performing a structural database search. Exhaustive searching can give a satisfactory response time until today.. 3.

(15) However, giving the rapid growth rates of the structural databases in the near future, such a structural database searching will be restrictedly expensive to be performed.. With a query protein structure, we search through the database and report the structures that are similar to the query structure. There may define a similarity threshold, and the structures whose scores are equal to or above the threshold are reported. Because the execution time of global searching through a structural database is very expensive, some fast but rough searching methods such as TopScan [13] and ProtDex2 [14] can be used as a pre-filter before performing the further database searching. In this way, the structures that are very improbable to be included in the report could be eliminated after a quick screening before going into the expensive comparison.. 1.3 Thesis Overview. We develop a novel sequence-based structure alignment: PiSA-BLAST for fast database searching. In chapter 2, we have prepared training set from ASTRAL SCOP database 1.65 40% set. We divide domain proteins of training set into many segments that have various kappa and alpha angle. Then, we find representative segments of each kappa and alpha angle cell and use cluster algorithm to group these representative segments. After that, we assign a new code for each representative group. Next, we need to develop a substitution matrix for new codes and use it to replace default matrix for sequence alignment tool. Finally, we can run sequence alignment tool to do fast protein structure searching in database.. In chapter 3, we demonstrated the conformation of representative segments that are belonging to the same coding region and the new substitution matrix for representative segments. In addition, we evaluated the database searching time and screening performance of 4.

(16) PiSA-BLAST with several testing sets by precision, recall, false positive rate and ROC curve. Besides, we discussed the relationship between precision, sequence identity, structure similarity and theoretically expected number and given some examples to explain PiSA-BLAST how to works on practical applications and what weakness it has in this chapter.. Chapter 4 presented some conclusions and future perspectives. Our major contribution is to develop a novel fast structure alignment tool for protein database searching. The coded sequence has biological meanings. From 3D to 1D level, PiSA-BLAST can decrease execute time by translating 3D-structure to 1D-sequence and using sequence level to align structure. From 1D to 3D, PiSA-BLAST can enhance the accuracy of sequence alignment for structure searching by adding segment information into 1D-sequence. Because of fast structure database searching, we can apply PiSA-BLAST in biological issues like fold assignment and homology searching. Furthermore, PiSA-BLAST can be used on several practical applications, for example, multiple structure alignment, finding structure motifs, protein function prediction, and protein-protein interaction in the future.. 5.

(17) Chapter 2 Materials and Methods Step-by-step illustration of the PiSA-BLAST methodology is showed in Figure 1. Given one known 3D structure for query protein in a structure database. Every 3D structure in database can be divided into 5-mer structure segments by its kappa and alpha angle. After determining segments, we translate these segments into encoded sequence according kappa and alpha clustering map. The following step is to run structure alignment with encoded sequence using sequence alignment tool: BLAST. As the result, we can gain alignment score, structure similarity and even superposition sites of two aligned protein.. The flowchart of research step is shown in Figure 2. First, we prepare training set from ASTRAL SCOP database 1.65 40% set [15, 16]. Second, we divide domain proteins of training set into many segments that are have various kappa and alpha angle. Then, we find representative segments of each kappa and alpha angle and use cluster algorithm [2] to group these representative segments. After that, we assign a new code for each representative group. Next, we need to develop a substitution matrix for new codes and use it to replace default matrix for sequence alignment tool. We can use sequence alignment tool to do fast protein structure searching in database and evaluate the performance. Finally, we apply the PiSA-BLAST on practical application.. 2.1 Preparing Training Set from Protein Structure Database. We prepare 792 pairs domain proteins in ASTRAL SCOP database 1.65 40% set [15, 16]. 6.

(18) for developing of 3D-1D coding and establishing new substitution matrix. The principle of training set collecting is as follows.. First, we select families with at least two domain proteins and totally choice 882 families. In these families, select one pair domain per ten domain proteins in random. Each pair domain belongs to the same family and sequence identity of each pair domain is less than 40%. Second, after structure alignment of CE, the RMSD in pair domain proteins is less than 5Å. Third, the residues in all selected domain proteins are exclude “X”.. We expect that our training set can reflect the real condition in composition of amino acids. Figure 3 shows that Comparison the amino acids compositions of our train set, including 1584 proteins for encoding the structured codes and the substitute matrix, with three well-known structure databases (DSSP database [17], SCOP 95 and SCOP 40 database [15]). The distributions of amino acids compositions of these four databases are similar. So, our training set can provide right and meaningful information.. 2.2 Dividing Protein Structures into Segments by Kappa-Alpha Angle Map. The kappa angle is described as virtual bond angle (bend angle) defined by the three C-alpha atoms of residues I-2, I, I+2. The range of kappa angle is 0° to 180°. The alpha angle is described as virtual torsion angle (dihedral angle) defined by the four C-alpha atoms of residues I-1, I, I+1, I+2. The range of alpha angle is –180 ° to 180 ° (described at http://www.cmbi.kun.nl/gv/dssp/de-scrip.html#SECSTRUC). According to the definition of kappa and alpha angle, we define the local structure with 5 residues long as a segment.. 792 domain protein pairs have been divided into total 263696 segments. These segments 7.

(19) are separated by various kappa and alpha angle. Figure 4 shows the distribution of 263696 segments in various kappa and alpha angle. The color bar on the right side shows the distribution scale. These segments are encoded into 23 codes based on the distributions of kappa and alpha angle. The helix-like segments (e.g., A, B, C and D) have more than 9000 segments whose alpha angle ranging from 40° to 60° and kappa angle ranging from 100° to 120°. The strand-like segments (e.g., E and F) have over 3000 segments with alpha angle ranging from -180° to -140° and kappa angle ranging from 0° to 20°.. Because of the large number of segments, we need to cluster these segments for representative segments deciding and meaningful codes assigning.. 2.3 Finding Representative Segments and Using Nearest Neighbor Clustering Algorithm for New Codes Assigning. There are total 648 cells on kappa and alpha angle map K. Each cell includes many segments shown in Figure 4. We use the simple way as follows to decide one representative segment for each cell.. We building inter-segment distance matrix for one cell. Let dij be the structure distance (measured by superimpose program [18]) between segment i and segment j. The number of i and j is equal to the number of segments for this cell. Then, we summarize each column of the distance matrix and get the minimum of sum of column. Hence we select the representative segment for one cell depend on its lowest total structure distance among other segments.. After finding representative segment for every cell, we use nearest neighbor clustering. 8.

(20) algorithm [2] to group these representative segments with similar conformation. The algorithm is based on calculating a matrix, D, where N is the number of representative segments to be clustered. The matrix D is stored with the values of Rmsd for inter-representative segments. Dij is a measure of structure similarity (computed by superimpose program [18]) between representative segments i and j. Clusters are formed recursively by adding other representative segments according to the nearest neighbor criterion. The method of nearest neighbor clustering is as follows:. Input: (1) The matrix D is stored with the values of RMSD for all inter-representative segments. Dij is a measure of structure similarity between representative segments i and j (0≦i, j≦648). (2) The matrix K is collected with the numbers of segments with various kappa and alpha angle. Kab is a number, which means how many segments in alpha angle a° and kappa angle b° (0≦a≦36, 0≦b≦18). Output: The encoding rule map E point out that each cell with various alpha-kappa angle could be assign one letters of the alphabet. The size of encoding rule map is 36*18 according the range of kappa and alpha angle. The range of alpha angle is observed into 10° interval ranging from -180° to 180°. The range of kappa angle is observed into 10° interval ranging from 0° to 180°. Step: (1) Select one cell of E with particular kappa-alpha angle which the Kab is the most and this cell Eab did not assign any code yet to be the center of a cluster.. 9.

(21) (2) Assume that the representative segment of this center is representative segment i.Sort the value from Di,0 to Di,648. (3) According the result of sorting, from top to bottom, group every cell Ea’b’ repeatedly into the cluster with center Eab if the Ea’b’ fit in with following conditions. (3.1) Given a threshold, t, on the nearest neighbor distance. Assume that the representative segment of Ea’b’ is representative segment j. The Dij is less than t. (3.2) Given a threshold, u, for the maximum fragments number. If group into the cluster, the summation of the number of fragments in this cluster is still less than u. (4) Check if this cell Ea’b’ has already grouped to other cluster. (4.1) If not, group the cell Ea’b’ into the cluster with center Eab and record the Dij for the optimized clustering. (4.2) Otherwise, compare the value of the previous and present record of Dij. (4.2.1) If the present record of Dij is less than the previous one, Ea’b’ would be re-assign into the present cluster. However, the sum of the number of fragments in this cluster must be less than u. (4.2.2) If the previous record of Dij is less than the present one, do nothing and keep previous cluster. (5) Repeat step 1 to 4 until every cells of E is clustered to 21 groups. (6) First group has only one cell of E. This cell Eab is assigned to code “A”. The code “A” with alpha angle more than 46° and kappa angle less than 114° will be assigned to another code ”Y”. (7) Every cells of E in second group are assigned to code “B”, ones in third group are assigned to code “C”, and etc. Ones in last group are assigned to code “X”. There are exclude J, O and U in code assignment. (8) If the Kab is less than 40, this cell would be assigned to code “Z”. 10.

(22) (9) Every Eab is assigned to one code and output result of encode rule map E.. Here, the threshold, t and u, is given depending on how many groups we want. Here the threshold t is 0.72, u is 18450, and 21 groups is made. The threshold u is given by the 7% of the number of total segments.. Each group in various cells is assigned to a new code. There are 21 codes named as letter “A” to “X” (exclude “J”, “O” and “U”). If the number of segments in one cell is less than 50, this cell will be assigned to Code “Z”. In addition, when the structure is coding to sequence, the new code “A” with alpha angle more than 46° and kappa angle less than 114° will be assigned to another code ”Y”.. 2.4 Generating a Substitution Matrix for 23 New Codes. The method of generating a substitution matrix refer to BLOSUM62 [19]. The elements of the substitution matrix are calculated as follows. For each residue position in the training set of pair database of aligned structural pairs, the statistics is counted at each aligned position. Each protein chain is considered to be a coded sequence aligned to a structure. The substitution score for coded sequence i and j with homologous structure is given by the information value [20].. Let the total number of amino acid i, j pairs (1≦j≦i≦20) for each entry of the frequency table be fij. Then the observed probability of occurrence for each i, j pair is 20. i. qij = fij / ∑∑ fij i =1 j =1. 11. (1).

(23) Next we estimate the expected probability of occurrence for each i, j pair. It is assumed that the observed pair frequencies are those of the population. In general, the probability of occurrence of the ith amino acid in an i, j pair is. pi = qii + ∑ qij / 2 j ≠i. (2). The expected probability of occurrence eij for each i, j pair is then pipj for i = j and pipj + pjpi = 2 pipj for i≠j.. if i = j. ⎧⎪ pi p j eij = ⎨ ⎪⎩2 pi p j. if i ≠ j. (3). Then, the substitution matrix scores are then defined as. sij = λ log 2. qij. (4). eij. where λ is an arbitrary positive rational number. Here, λ is given 1.89 for the best performance and efficiency.. The following describes the overall procedure for generating the λ value and optimized gap penalties. In the first step, we tested the λ value observed into 0.5 interval ranging from 1.0 to 10.0. The result revealed that the λ value between 1.5 and 2.5 is better. The second step is verifying the detail λ value observed into 0.1 interval ranging from 1.5 to 2.5. Furthermore, we test the six sets of open and extend gap penalty and λ value to find out the optimized parameter for the performance of PiSA-BLAST. As the Figure 9 showing, 12.

(24) the best combination of parameters is 8 for open gap penalty, 2 for extend gap penalty, and from 1.8 to 1.9 for the λ value. Finally, we experimented the best λ value from 1.82 to 1.93 according to the observation of results in second step. Figure 10 demonstrates that we acquire the best performance of database searching when λ value is 1.89.. 2.5 Structure Searching by Sequence Alignment tool: BLAST and PSI-BLAST. We download standalone BLAST 2.2.10 [3, 4] from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/snapshot/2004-12-05/. The default matrix: “BLOSUM 62” is replaced by the substitution matrix for 23 new codes. Use program: “formatdb” to create our own database made of 3D-1D coded and FASTA formatted protein sequences for BLAST searching. We execute BLAST by program “Blastall”. Blastall may be used to perform all five flavors of blast comparison. A typical use of blastall would be to perform a “blastp” search (protein vs. protein) of a query file called INPUT would be: blastall -p blastp –d DATABASE –i INPUT–o OUTPUT –M BLOSUM62 -G 8 -E 2 -F F The output is placed into the result file OUTPUT and the search is performed against the 'DATABASE' database. Other blastall options showed above are “-M BLOSUM62” which is default scoring matrix, “-G 8 –E 2” which means that open gap penalty is 8 and extend one is 2, and “-F F” that is to tell blastall do not filter query sequence.. Furthermore, we also combine Position-Specific Iterated BLAST, or PSI-BLAST, with our method for detail database searching [3]. The PSI-BLAST program can do an iterative 13.

(25) search in which sequences found in one round of searching are used to build a score model for the next round of searching. When the PSI-BLAST is producing, the position-specific matrix for round i+1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i. There is another command to perform PSI-BLAST. blastpgp -d DATABASE -i INPUT -o OUTPUT -F F -G 8 -E 2 -j 3 -t F -h 1e-15 Program “blastpgp” takes a protein query and perform PSI-BLAST search to create a position specific matrix using a protein database. Some of arguments used in PSI-BLAST are the same as BLAST. There are different options between BLAST and PSI-BLAST, such as “-j 3” which is the maximum number of rounds, “-t F” which means that program do not use composition based statistics, and “-h 1e-15” that is the e-value threshold for including sequences in the score matrix model. The e-value threshold is 0.001 in default. However, in order to obtain correct result and best performance, we change the value from 0.001 to 1e-15 for PiSA-BLAST.. The top part of the output of PSI-BLAST for each round distinguishes the sequences into: sequences found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users at NCBI. To skip quickly from the output of one round to the next, search for the string “producing”, which is part of the header for each round and likely does not appear elsewhere in the output. PSI-BLAST “converges” and stops if all sequences found at round i+1 below the e-value threshold were already in the model at the beginning of the round.. 2.6 Evaluating the Performance. 14.

(26) We compare both the results from BLAST, PSI-BLAST, CE and PiSA-BLAST against SCOP classifications [15, 16] which is regarded as the golden standard by the biologists. SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family among which family is the most detailed classification. In our test, if a protein in the result set belong to the same SCOP family as the query protein, it is counted as a true hit.. We used 3 testing sets for evaluating the performance. First two experiments are refer to Aung et al. [14]. One involved a small database and a limited number of queries, and the other involved a large database and a greater number of queries. Third experiment involved the same number of queries as second testing set and SCOP 1.65 95% database.. In first experiment set, there are 10 proteins from Globins family (a.1.1.2 in SCOP) and 10 proteins from Serine/Threonin kinases family (d.144.1.1 in SCOP) from the representative ASTRAL dataset with less than 40% sequence homology. These 20 proteins are designated as the query proteins. Table 1 shows that the small test set selected from previous work [14]. There are 200 members in the database and 20 queries in two SCOP families are listed. In addition, other 180 proteins were selected from four major classes (All-α, All-β, α/β and α+β) of the same representative dataset. These 180 proteins were combine with the above-mentioned 20 query proteins to form the small target database of 200 proteins.. In second and third testing sets, we conduct another experiment using a large database containing 33311 proteins which is refer to Aung et al. [14] and containing 9354 domain proteins in SCOP 1.65 95% database. From them, Zeyar select 108 query proteins which belongs to 108 medium-size families (with ≥40 and ≤180 members) from four major classes,. 15.

(27) and which have less than 40% sequence homology to each other. The lists of 108 query proteins are given respectively in Table 2.. Four common metrics were used to evaluate the quality of database searching, including precision, recall, false positive rate and ROC curve. The precision is defined as Ah/Th. The recall and false positive rate can be given as Ah/A and (Th-Ah)/(T-A), respectively. Here, Ah is the number of true hits in the hit list, Th is the total number of domain proteins in the hit list, A is total number of true hits in the databases, and T is 33311 or 9354, the total number of domain proteins in these two large databases. The ROC curve plots the sensitivity against the “1-specificity”. The sensitivity is equal to recall, and the “1-specidicity” is equal to false positive rate.. True hit, also called a relevant retrieval, is defined as an event of retrieving a protein from the database that belongs to the same “family” as the query. BLAST, PSI-BLAST and PiSA-BLAST can retrieve subject proteins by e-value and alignment score. However, CE did not provide sorted retrieval list when we use CE to perform one-against-all searching. For this reason, we need to sort the searching results by ourselves in order to obtain the retrieval list. In Figure 12, Recall-precision curves of CE using z-score and rmsd to order searching results on 108 queries searching the SCOP 95 database. The results of CE searching which are sorted by z-score are much more accurate than by rmsd. Hence, the results of CE searching are sorted by its Z-score.. We test the database searching time for CE, BLAST, PSI-BLAST and PiSA-BLAST on the same machine (LINUX platform with Pentium IV processors 2.8GHz and 2GB memory). We use the default parameters for CE, BLAST, PSI-BLAST and 7 target databases: small. 16.

(28) database including 200 proteins, large database including 33311 proteins, PDB, nr-PDB, SCOP 1.65 database, SCOP 1.65 95% and SCOP 1.65 40% as the database searching for both methods.. 2.7 Practical Applications. The advancements in the protein crystallography to determine the structures of the protein molecules, the sizes of the structure databases such as PDB are growing at a very fast rate. It is possible that many new protein structures have been crystallized but their function and fold is still unknown.. Because of fast structure database searching, we can apply PiSA-BLAST in biological issues like fold assignment and homology searching. Here, we use PiSA-BLAST on biological application: fold assugnment and function predition.. We took 108 proteins that is the same as above testing set as the query to perform PiSA-BLAST database searching. These proteins is well-known function and have been assigned to particular fold family at SCOP and CATH database.The search was performed against the PDB database which is published at 19 April in this year. Then, we observed the top rank 100 proteins at the output of PiSA-BLAST.. We assume that there are several proteins which is unknown fold or function in top rank 100. If these new proteins are certainly similar to the query protein according its high statistical significance, we could predict these their function and fold family confidently.. 17.

(29) Chapter 3 Results and Discussions. 3.1 Representative Segments of 23 New Codes The representative segments and 23 new codes defined by nearest neighbor clustering method are meaningful. Figure 4 also shows the result of the new code with 23 letters of alphabet mapping into kappa and alpha angle with the distribution of segments. Figure 5 shows the accumulated distributions of 20 kinds of amino acids and 23 structured codes in training set. The accumulated distribution of 23 codes is similar to the distribution of 20 amino acids. The most number in 20 amino acids is amino acid, leucine (L), and the ratio is 9.26%. The most quantity in 23 new codes for PiSA-BLAST is H and the ratio is 6.99%.. Figure 6 indicates the conformations of the representative segments of 23 new codes. The representative segments at code A, Y, B, C and D are called helix segment and segments at code G, I, L are called helix-like segment according to its conformation and distribution of DSSP secondary structure. The representative segments at code E, F and H are called strand segment and segments at code K, N are called strand-like segment. Representative segments at other codes are classified into loop-like segment and display different conformations between helix-like and strand-like segments. Figure 7 is another evidence to display the conformations of representative segment in each cell of four main groups. As the conformations show, it is clear to see that the structure of segments is very similar in same secondary structure defined region.. 18.

(30) Figure 8 demonstrates that the distribution relationship between 23 new codes (in PiSA-BLAST) and 8 secondary structure codes (in DSSP [17]). It is clear to illustrate that the distribution of helix, helix-like, strand and strand-like segments defined by PiSA-BLAST are high related to secondary structures in DSSP and explain why the conformation of representative segments is similar in same coding. As shown in Figure 8(A), helix and helix-like segments: “AYBCDGIL” have large number in helix codes: “HGI” which is defined by DSSP. In the Figure 8(B), we also see strand and strand-like codes: “EFHKN” defined by PiSA-BLAST have quite a few of distribution in DSSP strand code: E and B.. According conformations in Figure 7 and the distribution of secondary structure in Figure 8, we can prove 23 codes in the encoding rule map are meaningful.. 3.2 The Substitution Matrix for 23 New Codes The substitution matrix of 23 new codes is given in Figure 11. The matrix offers insights about substitution preferences of 23 new codes between homologous structures. All identical new codes having the same secondary structure have positive substitution scores. The scores on the diagonal cells are much higher than the scores on the non-diagonal cells. Red dot-square part (A, Y, C, B, and D) is the scores of aligning helix codes to helix codes and blue dot-square part (H, E, and F) is the scores of aligning strand codes to strand codes. The scores of aligning helix codes to strand codes are the smallest.. In Figure 11, red dot-square is shown as the substitution scores of helix and helix-like codes. The mean of scores between helix and helix-like codes is greater than zero. Blue dot-square is shown as the substitution scores of strand and strand-like codes. The average of these scores in the blue square is greater than zero, too. In the yellow region, it is display 19.

(31) positive score on the substitution matrix. In addition, orange region shows that there are negative substitution scores in the matrix between helix and strand codes, which are dissimilar secondary structures. Further more, light yellow region shows clearly that there are smaller substitution scores than ones in yellow region between helix and helix-like codes or strand and strand-like codes.. The above relationships are well known, showing that the substitution matrix embodies conventional knowledge about structure information in proteins.. 3.3 Evaluating Statistical Significance PiSA-BLAST is more accurate than BLAST and other tools for structure database searching. As shown in Table 3, we compare PiSA-BLAST with well-known tools for small database searching. In the Table 3, row i represents the ranking under the various methods to retrieve i relevant answers. For example, row 6 says that when 6 answers are required, the top 6 ranked answers from DALI, CE, ProtDex2 and PiSA-BLAST are the 6 relevant answers from the same family as the query; while BLAST ranks the 6 relevant answers among the top 18 retrievals.. We can see that PiSA-BLAST appears the good performance as good as CE and DALI in small database searching. In order to obtain all the relevant answers, PiSA-BLAST retrieves same number of proteins as the detailed comparison methods of DALI and CE. BLAST and PSI-BLAST using amino acid sequence to search homologous proteins have to retrieve more proteins than DALI, CE and PiSA-BLAST using structural information to search database.. The accuracy comparison is shown in Figures 13 and 14. The results are shown as 20.

(32) recall-precision curves. Again, a relevant retrieval is defined as an event of retrieving a protein from the database that belongs to the same ‘family’ as the query. In Figure 13, the recall-precision curves of five alignment tools for 108 queries on the large database of 33311 proteins indicated in Table 2 is given. It shows clearly that PiSA-BLAST is the best and TopScan is the worst among these five approaches. BLAST and PSI-BLAST using sequence information only cannot provide right relevant retrieval, even PSI-BLAST search repeatedly. The results of ProtDex2 and TopScan, two fast structure alignment tools, are summarized from [14]. ProtDex2 [14] and TopScan [13] can search database quickly on sequence level but lost quite a few structural information.. In Figure 14, we compare the performance of PiSA-BLAST with CE, PiSA-PSI-BLAST, BLAST and PSI-BLAST methods on SCOP 1.65 95% database. Recall-precision curves in Figure 14 show obviously that CE supplies the more accurate than other methods. The accuracy of PiSA-BLAST closes the results of CE and PiSA-BLAST is about 34000 times fast than CE. Besides, PiSA-PSI-BLAST surprisingly only slightly improves PiSA-BLAST. In contrast, the performance of PSI-BLAST is much better than BLAST. At 10% recall, the precision of BLAST and PSI-BLAST is the same high as PiSA-BLAST. At 20% recall, PiSA-BLAST and PiSA-PSI-BLAST can supply the same accuracy as CE. However, when the recall is 20% and above, the precision of BLAST and PSI-BLAST decrease quickly.. The results of ROC curve for 108 queries on large databases searching are shown in Figures 15 and 16. PiSA-BLAST and PiSA-PSI-BLAST can appear the performance close to CE and are more accurate than sequence alignment tools, BLAST and PSIBLAST. Table 7 shows that the average precision of BLAST, PSI-BLAST, PiSA-BLAST, CE and PiSA-PSI-BLAST in SCOP95% database searching with each query protein.. 21.

(33) We discuss the result of CE and PiSA-PSI-BLAST as following description. The overall accuracy of CE is better than other methods. However, the results of homology searching of CE may show weakness and even worse than PiSA-BLAST in some queries. As shown in Table 7, database searching of CE obtains worse result in following query proteins: #6 d1b3ra1, #19 d1d3ga_, #21 d1dbqa_, #22 d1di0a_, #29 d1e4ft1, #32 d1ej8a_, #62 d1i1ra1, #90 d1qfja2, #102 d1ggwa_, #104 d2cmd_1.. There are two reasons to cause the worse result of CE according our observation. First, some retrieval domain proteins have chain-break in their 3D structure files. “Chain-break” means that the residue number is non-continuous in one domain or chain. When the protein occurs this chain-break condition, CE may take this protein as two chains and perform incorrect structure comparison as shown in the Figure 17. Some subject proteins occur this condition in the searching of query proteins, such as #6 d1b3ra1, #21 d1dbqa_, #22 d1di0a_, #104 d2cmd_1. Here, we take subject protein “d1c41a_” in query protein: “#22 d1di0a_” as example, because of the precision of this subject protein in CE is only 0.00813. As shown in Figure 17, there is the condition of chain-break in subject protein “d1c41a_” shown with blue square in Figures 17(A) and (C). The residue number is non-continuous from 76 to 107. The conformation of structure alignment of two proteins is slightly unsatisfied. Furthermore, the alignment length is sorter than the length of query protein and both Z-score and Rmsd is quite low as the alignment result in Figure 17(D). Besides, we observed that CE determines the wrong length of the domain protein “d1c41a_”. The original length of “d1c41a_” is 165 but the size detected by CE is only 72 because of chain-break problem. Nevertheless, PiSA-BLAST is not influenced by chain-break. Even the residue number has been broken; the encoding of structure in PiSA-BLAST method is still continuous.. 22.