Alternative Ways of Performance Evaluation

3. PROTEIN FOLD PREDICTION BY DATA FUSION APPROACH

4.3 C LUSTERING A PPROACH

4.4.4 Alternative Ways of Performance Evaluation

To evaluate the local-fit RMS quality we compare the histograms of local-fit RMS error measuring the deviations of fragments from their corresponding building blocks (Figure 8(a)).

It shows that the total count of lower RMS error (area under the curve) for SMCM is larger than that for TSCA and this indicates that more fragments are represented by good building blocks with lower reconstruction errors for SMCM.

0 0.5 1 1.5 2 2.5

Figure 8. (a) Histograms of local-fit RMS errors for SMCM and TSCA (b) Protein by protein comparison of local-fit RMS error for SMCM and TSCA

Finally, we compare protein by protein the local-fit RMS error produced by SMCM and TSCA in Figure 8(b), where proteins are sorted in descending order of local-fit-RMS errors produced by SMCM. Figure 8(b) reveals that SMCM’s local-fit RMS error is usually lower than the corresponding TSCA error. In this particular case, about 75% of the protein’s SMCM local-fit RMS error is lower. The results on the updated database are found to be quite similar to the results on the original database. We have also experimented with fragment lengths 5, 6 and

reconstruction error for both databases that we have experimented with. For SMCM the reconstruction error obtained with fragment length 7 is 7.57 which is slightly higher than that with hexamers. However, for TSCA the best reconstruction error of 7.59 is achieved with fragment length 7 while the error with fragment length 6 is 8.14.

4.4.5 Evaluation of the Library of Building Blocks on Other Datasets

To evaluate the quality of the library of building blocks, we use it to reconstruct proteins used in two more recent studies by Micheletti et al. [14] and Kolodny et al. [13]. We have excluded a few proteins with sequence discontinuity [51]. In these two datasets there are 10 and 144 proteins that are used for test, respectively. For the reconstruction, we follow a scheme similar in spirit with the method in [13]. The reconstruction process tries to minimize global-fit RMS deviation (RMSD). While, reconstructing residue by residue, instead of using the building block with the best local-fit RMS, the one with the minimum global-fit RMSD is chosen. It is noted that local-fit RMSD at each residue during such reconstruction usually will be higher. For the Micheletti et al. dataset, the global-fit RMSD obtained is 0.92A^o which is slightly lower than 1.06A^o reported in [14]. For the dataset used by Kolodny et al., authors reported the global-fit RMSD between 0.76A^o and 2.9A^o for different fragment lengths. While for this dataset, using hexamers we have achieved a global-fit RMSD of only 1.05A^o which is better than a global-fit RMSD of 1.26A^o reported in [13]. This further establishes that SMCM can extract biologically meaningful structural motifs that can be used for reconstruction of protein structure.

5. Incremental Structural Mountain Clustering Methods (ISMCM)

In case of a very big dataset for training, it becomes a time-consuming process for finding the clusters. To shorten the training time, an incremental approach is proposed. At first, we choose the longest protein in the training set and use it as the only protein for clustering to find the building blocks in the first step and evaluate the performance by checking the unassigned count of hexamers (that can’t be assigned to any building block within 1A^o ) for each protein in the whole training set. The protein with the largest count of unassigned hexamers in this step is picked up and added to the selected set of proteins for clustering in the next step. Then, the two chosen proteins are used for clustering and select the next protein with high unassigned count.

The same process is repeated until the unassigned ratio (abbreviated as U_ratio) of the whole set of hexamers is less than a threshold. Thus, we use only part of the original training dataset to cover the most occurring patterns and use them to find the building blocks accordingly. It will save computation time and complexity because the number of training fragments is reduced.

Following the derivation in chapter 4, when the value n is reduced and then the approximate computation complexity of

O

(

n

²) and

O

(

n

c

) will also be reduced.

5.1 ISMCM Algorithm Algorithm:

Input: Dataset P = {The complete list of proteins for training}

Choose: Threshold on unassigned ratio to stop the iteration

Repeat until unassigned ratio is less than the threshold

1. Move the protein with largest unassigned count from P₂ into P₁. Note that P₁ and P₂ satisfy the conditions: P₁∪P₂ =P and P₁∩ P₂ ={} (for the first iteration, the longest protein is chosen and move into the selected set P₁) 2. Find the building blocks from P₁ using SMCM.

3. Compute the unassigned count of hexamers for each protein in P₂. These are the counts of hexamers that can’t be represented by any building blocks derived from P₁

within a RMS error of 1A^o . Also, compute the unassigned ratio of the whole set of hexamers.

End Repeat.

The incremental version of TSCA (ITSCA) can be written exactly in the same manner.

5.2. RESULTS

According to the incremental algorithm, we find the building blocks and evaluate the global-fit RMS (GRMS) and local-fit RMS (LRMS) for both Training set and Test set until U_ratio is less than some threshold. The results are given in the tables within this section.

5.2.1 Results on Dataset A

For the incremental version of the two algorithms, we have varied the fragment length from 5 to 7. The choice of α is also varied from 3.5 to 5.5. For each fragment length, we report results with the best choices of α. Table 18 summarizes the results using the ISMCM algorithm for Dataset AOLD. In Table 18, U_ratio denotes the percentage of total fragments that cannot be assigned to any building block within a distance of 1A^o . In this case too, we find that fragment

length 6 with α=5 yields the best result of global-fit RMS error 7.19 which is less than 7.3 reported in Unger et al. These results are produced using the same dataset as used in Unger et al.

[12].

Table 18. ISMCM results on Dataset AOLD

Frag.

length α ^Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS 5 3.5 1 4HHBb 20 20.5% 0.60 7.86 0.81 10.53 5 3.5 2 1PCY 38 2.9% 0.44 6.64 0.62 8.67 5 3.5 3 1BP2 40 1.5% 0.43 7.60 0.61 8.07 5 3.5 4 5PTI 44 0.0% 0.42 5.90 0.60 8.08 6 5.0 1 4HHBb 35 35.0% 0.77 5.80 1.08 9.48 6 5.0 2 1PCY 73 9.4% 0.49 7.06 0.81 8.79 6 5.0 3 1BP2 93 3.0% 0.41 5.07 0.76 8.36 6 5.0 4 5PTI 104 0.0% 0.37 3.35 0.75 7.19 7 3.5 1 4HHBb 51 43.8% 0.95 8.68 1.38 10.16 7 3.5 2 1PCY 117 16.4% 0.49 4.67 0.98 8.27 7 3.5 3 1BP2 153 6.2% 0.36 3.69 0.93 7.95 7 3.5 4 5PTI 173 0.0% 0.28 2.19 0.90 7.60

Table 18 shows that with one protein in the training set, the test error is quite high. As we increase the number of proteins in the training set, the number of building blocks increases and the training and test errors decrease. Table 18 also reveals that increasing the number of proteins from 1 to 2 in training set changes the number of building blocks and test error more drastically than those by increasing the number of training proteins from 3 to 4. This asymptotic behavior, which will be illustrated further with Dataset B, indicates the utility and consistency of the incremental version of SMCM.

Table 19 depicts the performance of the ISMCM on the updated version of Dataset ANEW.

other hand, when we apply incremental version of the TSCA to the same dataset with sequence length 6 and use the 4 training proteins to construct the building blocks, we obtain 101 clusters with no unassigned hexamers (unassigned ratio =0%). The local-fit RMS error is 0.76 and global-fit RMS error is 8.14 on the test data (See Table 20). Since ISMCM uses six more building blocks than ITSCA, to make a fair comparison of ITSCA and ISMCM, we remove the trailing 6 building blocks from the 107 building blocks. Thus, for both methods we now use the same number of building blocks to represent all target fragments and reconstruct the first 60 residues of the 71 proteins whose lengths are larger than 60. For ISMCM, when we use only 101 clusters, the local-fit RMS very marginally increases to 0.73 and global-fit RMS increases to 7.55 from 7.32 (a 3% increase). But it is still better than 8.14 realized by ITSCA with the same fragment length of six.

Table 19. ISMCM results on the updated Dataset ANEW

Frag.

length α ^Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS 5 5.5 1 4HHBb 20 20.7% 0.60 7.88 0.79 10.37 5 5.5 2 1PCY 35 3.9% 0.44 7.76 0.61 9.14 5 5.5 3 1BP2 39 0.7% 0.42 5.14 0.59 7.97 5 5.5 4 5PTI 45 0.0% 0.41 4.63 0.59 8.86 6 5.5 1 4HHBb 35 35.0% 0.77 6.60 1.06 9.63 6 5.5 2 1PCY 74 9.9% 0.49 6.59 0.79 8.29 6 5.5 3 1BP2 94 3.2% 0.41 4.92 0.74 8.21 6 5.5 4 5PTI 107 0.0% 0.36 4.00 0.72 7.32 7 3.5 1 4HHBb 51 43.8% 0.95 8.28 1.36 10.06 7 3.5 2 1PCY 117 16.7% 0.49 4.46 0.96 8.03 7 3.5 3 1BP2 152 6.2% 0.36 2.76 0.91 7.83 7 3.5 4 5PTI 174 0.0% 0.28 1.90 0.88 7.57

When we apply the ITSCA to the updated Dataset ANEW, we get the best result with fragment length=7 and using all 4 proteins. However, the test global-fit RMS error is 7.59, which is still higher than the global-fit RMS error of 7.32 produced by the SMCM with

fragment length 6. The results are summarized in Table 20.

Table 20. ITSCA results on the updated Dataset ANEW

Frag.

Length

Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS 5 1 4HHBb 18 21.0% 0.69 8.28 0.86 10.81

5 2 1PCY 29 4.6% 0.57 7.48 0.70 9.07

5 3 1BP2 34 2.2% 0.50 8.00 0.64 9.04

5 4 5PTI 38 0.0% 0.49 5.64 0.62 8.21

6 1 4HHBb 34 34.7% 0.81 7.39 1.08 10.12

6 2 1PCY 70 9.9% 0.53 7.33 0.82 8.83

6 3 1BP2 90 3.4% 0.46 6.63 0.77 8.29

6 4 5PTI 101 0.0% 0.43 5.94 0.76 8.14

7 1 4HHBb 51 43.3% 0.96 8.62 1.37 10.05 7 2 1PCY 113 16.4% 0.51 4.67 0.98 8.31

7 3 1BP2 151 6.2% 0.38 3.47 0.92 7.95

7 4 5PTI 168 0.0% 0.31 2.12 0.90 7.59

5.2.2. Results on the Dataset B

For this data too, we have experimented with fragment lengths 5, 6 and 7 as summarized in Table 21 and Table 22 for the ISMCM and ITSCA respectively. For these two tables we find that ISMCM with fragment length 7 produces the best results of global-fit RMS error of 14.67 which is better than the best global-fit RMS error of 16.26 achieved by ITSCA with fragment length seven. However, the ISMCM usually finds more building blocks than the ITSCA. For example, Table 22 shows that with fragment length 7 and five training proteins, the total number of building blocks found by ITSCA is 756 and this results in a global-fit RMS reconstruction error of 16.26 whereas for ISMCM the number of building blocks is 871 yielding the best reconstruction error of 14.67. Just to compare the performance when ISMCM uses 716 building locks (fragment length 7, number of proteins equal to 4) the global-fit RMS

ISMCM is primarily not by the fact that it finds and uses more building blocks but because of quality of the building blocks that are placed at the center of dense areas of data points (here 3-D structures of length 5, 6 or 7).

Table 21. ISMCM results on Dataset B

Frag.

length α ^Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS 5 5.5 1 1YGE 61 2.7% 0.52 17.02 0.58 19.78 5 5.5 2 1CZFa 73 1.7% 0.52 17.36 0.58 19.49 5 5.5 3 1DMR 81 1.0% 0.52 17.23 0.57 19.38 5 5.5 4 1SMD 86 0.7% 0.50 16.71 0.55 18.63 5 5.5 5 1LAM 91 0.5% 0.50 16.75 0.55 17.83 5 5.5 6 1PPN 92 0.4% 0.50 16.43 0.55 17.93 6 5 1 1YGE 171 11.0% 0.62 15.93 0.69 18.03 6 5 2 1DMR 221 6.5% 0.59 15.44 0.66 17.20 6 5 3 1CZFa 258 5.0% 0.61 15.58 0.67 17.26 6 5 4 1SMD 288 4.1% 0.59 14.38 0.65 16.70 6 5 5 1B4Va 316 3.4% 0.58 14.46 0.64 16.06 6 5 6 3SIL 354 2.8% 0.57 14.44 0.64 16.30 7 5 1 1YGE 337 27.8% 0.76 15.51 0.83 17.52 7 5 2 1DMR 517 19.5% 0.70 14.29 0.78 15.95 7 5 3 1CZFa 614 16.7% 0.68 14.13 0.76 16.59 7 5 4 1SMD 716 14.3% 0.67 13.76 0.75 15.44 7 5 5 1KAPp 798 12.3% 0.65 13.39 0.74 14.91 7 5 6 3SIL 871 11.0% 0.64 13.39 0.73 14.67

Comparison of the global-fit RMS errors in Tables 21 and 22 reveals that the ISMCM errors are usually less than those by the ITSCA. From Table 21 we also find that as we increase the number of training proteins, the number of building blocks increases. But the increase in the number of building blocks when the number of training proteins is increased from 1 to 2 is much more than that when we increase the number from 4 to 5. And this is true for all fragment lengths. Moreover, going beyond five proteins increases the number of building blocks only marginally. These are very desirable attributes of any incremental algorithm and it suggests that

beyond certain number, increasing the number of proteins in the training data will not have much effect on the building blocks.

Table 22. ITSCA results on Dataset B

Frag.

length Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS

5 1 1YGE 56 3.1% 0.62 19.29 0.66 20.83

5 2 1DMR 56 2.0% 0.61 18.38 0.64 20.31

5 3 1CZFa 70 1.3% 0.56 17.36 0.60 19.23

5 4 1LAM 74 0.9% 0.60 18.88 0.63 20.27

5 5 1SMD 80 0.8% 0.60 18.77 0.63 20.23

5 6 1BXOa 79 0.6% 0.59 18.50 0.63 20.01 6 1 1YGE 151 13.7% 0.68 16.68 0.74 18.90 6 2 1DMR 208 7.9% 0.68 15.44 0.73 17.60 6 3 1CZFa 239 6.2% 0.67 15.27 0.72 17.14 6 4 1SMD 274 4.9% 0.66 15.39 0.71 17.25 6 5 1LAM 289 4.4% 0.61 15.15 0.67 16.47 6 6 1QKSa 306 3.7% 0.60 15.22 0.66 16.66 7 1 1YGE 327 29.7% 0.80 16.18 0.86 17.76 7 2 1DMR 496 20.4% 0.79 16.09 0.84 17.26 7 3 1CZFa 577 17.9% 0.77 15.49 0.83 16.65 7 4 1SMD 685 15.0% 0.75 15.66 0.82 16.91 7 5 1VNS 756 13.3% 0.74 15.22 0.81 16.26 7 6 1QKSa 806 12.2% 0.73 15.17 0.80 16.30

To investigate it further, in Table 23 we report the results when we increased the number of proteins in the training set to 12 with fragment length 6. Table 23 depicts that the first six proteins generated 354 building blocks whereas another additional six proteins added only 68 building blocks. When the number of proteins is increased from 11 to 12 the number of building blocks is increased by just 1. This behavior is more clearly reflected in Figure 9 which displays the variation of number of building blocks and global-fit RMS errors on the test data as a function of number of proteins in the training set. Thus use of more proteins in the training data

computational cost will be increasing when more proteins are used for training and the marginal benefit is decreasing. As we can find in the Figure 9, the library size is going to stop increasing its number, and also the test GRMS stop decreasing. The details are listed in the Table 23.

Table 23. ISMCM results on Dataset B using 12 training proteins

Frag.

Figure 9. The variation of library size and that of reconstruction errors as functions of the number of training proteins.

5.2.3 Alternative Ways of Performance Evaluation

To evaluate quality of the building blocks for Dataset ANEW, we compare the histogram of local-fit RMS measuring the deviations of fragments from their corresponding building blocks (Figure 10(a)). It is found that the total count of lower RMS error (area under the curve) for ISMCM is larger than that for ITSCA and this indicates that more fragments are represented by good building blocks with lower errors for ISMCM. Finally, we sort the proteins according to their ISMCM local-fit RMS deviations and compare the results one by one with the ITSCA local-fit RMS error for the same protein. We get the two curves as shown in Figure 10(b), which reveals that ISMCM local-fit RMS errors are usually lower than ITSCA errors.

0 0.5 1 1.5 2 2.5

0 200 400 600 800 1000 1200 1400 1600

RMS

Count

(a)

ISMSM ITSCA

0 20 40 60 80 100

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Protein

Avg. RMS

(b)

ISMSM ITSCA

Figure 10. (a) Histograms of local-fit RMS errors for ISMCM and ITSCA (b) Protein by protein comparison of local-fit RMS error for ISMCM and ITSCA on Dataset ANEW

To compare the performance of both methods on the Dataset B, we proceed in the same way as we did for Dataset A. We remove the trailing clusters with smaller number of members to make both methods use the same number of building blocks. In Figure 11(a) and 11(b), we compare the histogram of local-fit RMS errors and average local-fit RMS error per protein, respectively. Like Dataset A, here we also find that ISMCM outperforms ITSCA with respect to these evaluation criteria.

0 1 2 3 4

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

RMS

Count

(a)

ISMSM ITSCA

0 50 100 150

0 0.2 0.4 0.6 0.8 1

Protein

Avg. RMS

(b)

ISMSM ITSCA

Fig.ure 11. (a) Histograms of local-fit RMS errors for ISMCM and ITSCA (b) Protein by protein comparison of local-fit RMS error for ISMCM and ITSCA on Dataset B

5.2.4 Visual Assessment of the Quality of the Building Blocks

In this section, we examine visually how well the building blocks can represent the target fragments. For this we consider two examples, one with a very good fit (Figure 12) and the other with a relatively poor fit (Figure 13), but still within 1A^o threshold. Figure 12 (a) shows a building block (AVGFMLA) whereas Figure 12(b) represents a target fragment (TLSELHC).

Apparently the two structures look quite different. But Figure 12(c), the rotated version of the building block AVGFMLA obtained after best molecular fit with the target look almost identical to Figure 12(b). The superimposition of AVGFMLA and TLSELHC shown in Figure 12(d) clearly demonstrates an excellent fit between the two. The four panels in Figure 13 show the representation of the target fragment EGVEIAC with the building block ENAIGGS.

Although, in terms of the local-fit RMS error it is a poorer fit, yet the building block matches very nicely to the target (Figure 13(d)).

Figure 12. Representation of a target fragment using a building block, a good fit case. (a) The original building block; (b) The target hexamer; (c) The rotated and shifted building block; (d) The building block and target hexamer superimposed after alignment.

Figure 13. Representation of a target fragment using a building block, a poor fit case. (a) The original building block; (b) The target hexamer; (c) The rotated and shifted building block; (d) The building block and target hexamer superimposed after alignment.

Next, we would like to show the biological structures of the building blocks of top two most populated clusters using ISMCM and compare them with the building blocks of top two clusters using ITSCA method. The most typical helical building block found by ISMCM is AVGFMLA and it is located at residue 324-330 of 1SMD; whereas the most populated building block found by ITSCA is GAAQVIM and it is located at residue 147-153 of 1DMR. The fact that GAAQVIM is also helical structure and is included in the cluster of AVGFMLA, appears that the ITSCA cluster associated with GAAQVIM and the ISMCM cluster associated with AVGFMLA represent the same biological structural motif. Figure 14(a) and Figure 14(b) show these two building blocks and it is clear that they represent the same structural unit. Similarly, we find that the most typical extended strand TKVIFEG found by ISMCM is located at residue 43 of 1CZFa whereas its counterpart GIKIYVS found by ITSCA is located at residue 464 of 1SMD. These two building blocks are depicted in Figure 15(a) and Figure 15(b). It can be seen that these building blocks represent similar structures of biological significance.

Figure 14. (a) ISMCM building block (AVGFMLA) at residue 324-330 of 1SMD (b) ITSCA building block (GAAQVIM) at residue 147-153 of 1DMR

Figure 15. (a) ISMCM building block (TKVIFEG) at residue 43-49 of 1CZFa (b) ITSCA building block (GIKIYVS) at residue 464-470 of 1SMD

6. CONCLUSIONS AND FUTURE WORK

We have applied the concept of combinatorial fusion to improve accuracy in protein structure prediction. In particular, we have successfully improved the overall predictive accuracy rate of 87% for the four classes and 69.6% for the 27 folding patterns. We improve previous results by Huang et al. [9] (65.5% for folding structures) and Ding and Dubchak [8] (56.5% for folding structures) by incorporating the method of combinatorial fusion with the RBFN neural network using the hierarchical learning architecture. These rates are higher than previous results and it demonstrates that data fusion is a viable method for feature selection and combination in the prediction and classification of protein structures. Work has been performed to improve those results which used other machine learning technique such as kernel method, SVM and genetic algorithm. For example, Yu et al. [43] has obtained good accuracy rate using SVM with

n-peptide coding schemes and jury voting. Future work can be performed to improve these

results using our combinatorial fusion approach.

Also, we present a structural variant of the mountain clustering method that is suitable for data like 3-D structures of protein fragments. We have analyzed the SMCM and TSCA and have demonstrated that since TSCA does not take into account the geometry of the data, it may extract poorer building blocks than the SMCM. The utility of this algorithm is demonstrated on the same dataset used by Unger et al. In fact, the superiority of this algorithm is demonstrated on two versions of datasets (the original one and the newly updated one on the same set of proteins). To visually compare the quality of reconstructions we also proposed two alternative ways revealing that the performance of SMCM building blocks is usually better than TSCA building blocks both in terms of the local-fit RMS histogram and in terms of the average RMS deviation for individual protein. Our experiments demonstrate that the SMCM can find useful building blocks to successfully reconstruct the 3-D protein structures for the first 60 residues (as done by Unger et al.) of all test proteins with global-fit RMS error within 7.19A^o . It can also

obtain good local-fit RMS errors indicating that these building blocks can model the nearby fragments within tolerable errors.

Both SMCM and TSCA are computationally expensive when the size of training dataset is large. Hence we proposed an incremental version of the SMCM. The same concept is also used to obtain an incremental version of the TSCA. We have made extensive experimentation with these two algorithms using two versions of the dataset used by Unger et al. as well as another dataset used by other researchers. The incremental SMCM is also found to be quite effective and it is found to exhibit the properties expected from an incremental algorithm. More specifically, as the number of proteins increases in the training set, the increase in the number of building blocks decreases and consequently the rate of decrease in the global reconstruction error both on the training and test data falls down. Moreover, the incremental SMCM is found to be more effective than the incremental TSCA. Although, the SMCM usually finds more building blocks than those found by the TSCA, we have demonstrated that the improved performance for SMCM comes from the quality of the building blocks which are placed at the center of areas dense in training data.

None of the algorithms discussed here can take into account fragments of variable length.

To extend the algorithms for fragments of variable length, we need measures of similarity between fragments of different lengths. For example, if we have two fragments both are helix,

在文檔中資料融合及山峰群聚法應用於改善蛋白質結構預測與分析 (頁 63-0)