Chapter 3. Results and Discussions
3.3 Evaluating Statistical Significance
PiSA-BLAST is more accurate than BLAST and other tools for structure database searching. As shown in Table 3, we compare PiSA-BLAST with well-known tools for small database searching. In the Table 3, row i represents the ranking under the various methods to retrieve i relevant answers. For example, row 6 says that when 6 answers are required, the top 6 ranked answers from DALI, CE, ProtDex2 and PiSA-BLAST are the 6 relevant answers from the same family as the query; while BLAST ranks the 6 relevant answers among the top 18 retrievals.
We can see that PiSA-BLAST appears the good performance as good as CE and DALI in small database searching. In order to obtain all the relevant answers, PiSA-BLAST retrieves same number of proteins as the detailed comparison methods of DALI and CE. BLAST and PSI-BLAST using amino acid sequence to search homologous proteins have to retrieve more proteins than DALI, CE and PiSA-BLAST using structural information to search database.
The accuracy comparison is shown in Figures 13 and 14. The results are shown as
recall-precision curves. Again, a relevant retrieval is defined as an event of retrieving a protein from the database that belongs to the same ‘family’ as the query. In Figure 13, the recall-precision curves of five alignment tools for 108 queries on the large database of 33311 proteins indicated in Table 2 is given. It shows clearly that PiSA-BLAST is the best and TopScan is the worst among these five approaches. BLAST and PSI-BLAST using sequence information only cannot provide right relevant retrieval, even PSI-BLAST search repeatedly.
The results of ProtDex2 and TopScan, two fast structure alignment tools, are summarized from [14]. ProtDex2 [14] and TopScan [13] can search database quickly on sequence level but lost quite a few structural information.
In Figure 14, we compare the performance of PiSA-BLAST with CE, PiSA-PSI-BLAST, BLAST and PSI-BLAST methods on SCOP 1.65 95% database. Recall-precision curves in Figure 14 show obviously that CE supplies the more accurate than other methods. The accuracy of PiSA-BLAST closes the results of CE and PiSA-BLAST is about 34000 times fast than CE. Besides, PiSA-PSI-BLAST surprisingly only slightly improves PiSA-BLAST.
In contrast, the performance of PSI-BLAST is much better than BLAST. At 10% recall, the precision of BLAST and PSI-BLAST is the same high as PiSA-BLAST. At 20% recall, PiSA-BLAST and PiSA-PSI-BLAST can supply the same accuracy as CE. However, when the recall is 20% and above, the precision of BLAST and PSI-BLAST decrease quickly.
The results of ROC curve for 108 queries on large databases searching are shown in Figures 15 and 16. PiSA-BLAST and PiSA-PSI-BLAST can appear the performance close to CE and are more accurate than sequence alignment tools, BLAST and PSIBLAST. Table 7 shows that the average precision of BLAST, PSI-BLAST, PiSA-BLAST, CE and PiSA-PSI-BLAST in SCOP95% database searching with each query protein.
We discuss the result of CE and PiSA-PSI-BLAST as following description. The overall accuracy of CE is better than other methods. However, the results of homology searching of CE may show weakness and even worse than PiSA-BLAST in some queries. As shown in Table 7, database searching of CE obtains worse result in following query proteins: #6 d1b3ra1, #19 d1d3ga_, #21 d1dbqa_, #22 d1di0a_, #29 d1e4ft1, #32 d1ej8a_, #62 d1i1ra1,
#90 d1qfja2, #102 d1ggwa_, #104 d2cmd_1.
There are two reasons to cause the worse result of CE according our observation. First, some retrieval domain proteins have chain-break in their 3D structure files. “Chain-break”
means that the residue number is non-continuous in one domain or chain. When the protein occurs this chain-break condition, CE may take this protein as two chains and perform incorrect structure comparison as shown in the Figure 17. Some subject proteins occur this condition in the searching of query proteins, such as #6 d1b3ra1, #21 d1dbqa_, #22 d1di0a_,
#104 d2cmd_1. Here, we take subject protein “d1c41a_” in query protein: “#22 d1di0a_” as example, because of the precision of this subject protein in CE is only 0.00813. As shown in Figure 17, there is the condition of chain-break in subject protein “d1c41a_” shown with blue square in Figures 17(A) and (C). The residue number is non-continuous from 76 to 107. The conformation of structure alignment of two proteins is slightly unsatisfied. Furthermore, the alignment length is sorter than the length of query protein and both Z-score and Rmsd is quite low as the alignment result in Figure 17(D). Besides, we observed that CE determines the wrong length of the domain protein “d1c41a_”. The original length of “d1c41a_” is 165 but the size detected by CE is only 72 because of chain-break problem. Nevertheless, PiSA-BLAST is not influenced by chain-break. Even the residue number has been broken; the encoding of structure in PiSA-BLAST method is still continuous.
Second, it is uncertainly that lower Z-score means dissimilar structure. Some protein comparisons possess lower Z-score but present better RMSD. We observed this issue in following query proteins: #19 d1d3ga_, #32 d1ej8a_ and #90 d1qfja2. Here, we take subject protein “d1eso__” in query protein: “#32 d1ej8a_” as example. The precision of this subject protein is only 0.2. In Figure 18, it shows obviously the illustration of the problem of ordering the searching results by Z-score in CE alignment. The comparison of two similar structures is with proper RMSD but displays worse Z-score. It is clearly to see that the comparison between query and subject proteins is not bad. The main secondary structure of these two proteins is aligned appropriately. On the other hand, the gaps inserted into alignment are just loop structure of two proteins. The structures of query and subject proteins are similar and the rmsd is 2.07, but the Z-score is only 4.4. Therefore, the rank of the subject protein is 50 and behind 40 false positive proteins. The performance of CE would be bad in some cases, because we only sorted the retrieval lists by Z-score. We may sort all results that are provided using CE by better way, such as combing Z-score with RMSD.
There is one probably explanation about that PiSA-PSI-BLAST did not enhance supposed performance. Changing the e-value threshold for including sequences in the PSI-BLAST position specific matrix model may cause different alignment results. Although we choose the most appropriate e-value threshold: “10-15”, we may obtain the worse achievement of PiSA-PSI-BLAST in some cases.
For example, there are too many incorrect domain proteins, which are not the same family as query protein, and these e-values of domain proteins are below threshold in searching of query protein “#3 d1ajsa_”. There are 79 subject proteins that are below the
e-value threshold. However, there are actually 63 proteins, which are not the same family as query protein. Therefore, the position specific matrix model made by method of PSI-BLAST may include wrong information and cause the iterated searching to go toward wrong result.
On the other hand, there are only a few domain proteins with same family as query below the threshold in several cases. Accordingly, the position specific matrix model may not contain enough sequence information to perform correct searching. For example, there are only 3 proteins below the e-value threshold in searching of query protein “#85 d1pina2”.
PiSA-BLAST can provide the theoretically expected number like e-value of BLAST to indicate what the performance is better. Here, we give 10-15 as significance estimate according to our observation. In the Figure 19, the relationship between e-value and structure similarity in PiSA-BLAST is shown. The 1681 points in total on the plot mean every query and subject protein pairs searching in SCOP 95 database. There are 943 points in area (A) and only 79 points in area (B). PiSA-BLAST achieves 98.6% and 92.2% proteins whose Z scores are more than 4.0 and 5.0 when the e-value is less than 10-15.
In Figure 20, it shows the relationship between e-value and precision in PiSA-BLAST.
PiSA-BLAST performs 108 queries on the SCOP 95 database. The yellow bars mean that the distribution of e-value of PiSA-BLAST is less than 10-15 and red ones mean that the distribution of e-value is more than 10-15. The protein pairs of precision with 80% and upper occupy 91% protein pairs at below 10-15 of e-value of PiSA-BLAST. Hence, the value 10-15 we given are reasonable.