Results on the Dataset B - INCREMENTAL STRUCTURAL MOUNTAIN CLUSTERING METHODS (ISMCM)

5. INCREMENTAL STRUCTURAL MOUNTAIN CLUSTERING METHODS (ISMCM)

5.2. RESULTS

5.2.2. Results on the Dataset B

For this data too, we have experimented with fragment lengths 5, 6 and 7 as summarized in Table 21 and Table 22 for the ISMCM and ITSCA respectively. For these two tables we find that ISMCM with fragment length 7 produces the best results of global-fit RMS error of 14.67 which is better than the best global-fit RMS error of 16.26 achieved by ITSCA with fragment length seven. However, the ISMCM usually finds more building blocks than the ITSCA. For example, Table 22 shows that with fragment length 7 and five training proteins, the total number of building blocks found by ITSCA is 756 and this results in a global-fit RMS reconstruction error of 16.26 whereas for ISMCM the number of building blocks is 871 yielding the best reconstruction error of 14.67. Just to compare the performance when ISMCM uses 716 building locks (fragment length 7, number of proteins equal to 4) the global-fit RMS

ISMCM is primarily not by the fact that it finds and uses more building blocks but because of quality of the building blocks that are placed at the center of dense areas of data points (here 3-D structures of length 5, 6 or 7).

Table 21. ISMCM results on Dataset B

Frag.

length α ^Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS 5 5.5 1 1YGE 61 2.7% 0.52 17.02 0.58 19.78 5 5.5 2 1CZFa 73 1.7% 0.52 17.36 0.58 19.49 5 5.5 3 1DMR 81 1.0% 0.52 17.23 0.57 19.38 5 5.5 4 1SMD 86 0.7% 0.50 16.71 0.55 18.63 5 5.5 5 1LAM 91 0.5% 0.50 16.75 0.55 17.83 5 5.5 6 1PPN 92 0.4% 0.50 16.43 0.55 17.93 6 5 1 1YGE 171 11.0% 0.62 15.93 0.69 18.03 6 5 2 1DMR 221 6.5% 0.59 15.44 0.66 17.20 6 5 3 1CZFa 258 5.0% 0.61 15.58 0.67 17.26 6 5 4 1SMD 288 4.1% 0.59 14.38 0.65 16.70 6 5 5 1B4Va 316 3.4% 0.58 14.46 0.64 16.06 6 5 6 3SIL 354 2.8% 0.57 14.44 0.64 16.30 7 5 1 1YGE 337 27.8% 0.76 15.51 0.83 17.52 7 5 2 1DMR 517 19.5% 0.70 14.29 0.78 15.95 7 5 3 1CZFa 614 16.7% 0.68 14.13 0.76 16.59 7 5 4 1SMD 716 14.3% 0.67 13.76 0.75 15.44 7 5 5 1KAPp 798 12.3% 0.65 13.39 0.74 14.91 7 5 6 3SIL 871 11.0% 0.64 13.39 0.73 14.67

Comparison of the global-fit RMS errors in Tables 21 and 22 reveals that the ISMCM errors are usually less than those by the ITSCA. From Table 21 we also find that as we increase the number of training proteins, the number of building blocks increases. But the increase in the number of building blocks when the number of training proteins is increased from 1 to 2 is much more than that when we increase the number from 4 to 5. And this is true for all fragment lengths. Moreover, going beyond five proteins increases the number of building blocks only marginally. These are very desirable attributes of any incremental algorithm and it suggests that

beyond certain number, increasing the number of proteins in the training data will not have much effect on the building blocks.

Table 22. ITSCA results on Dataset B

Frag.

length Protein

count PDB No. Library

size U_ratio Train LRMS

Train GRMS

Test LRMS

Test GRMS

5 1 1YGE 56 3.1% 0.62 19.29 0.66 20.83

5 2 1DMR 56 2.0% 0.61 18.38 0.64 20.31

5 3 1CZFa 70 1.3% 0.56 17.36 0.60 19.23

5 4 1LAM 74 0.9% 0.60 18.88 0.63 20.27

5 5 1SMD 80 0.8% 0.60 18.77 0.63 20.23

5 6 1BXOa 79 0.6% 0.59 18.50 0.63 20.01 6 1 1YGE 151 13.7% 0.68 16.68 0.74 18.90 6 2 1DMR 208 7.9% 0.68 15.44 0.73 17.60 6 3 1CZFa 239 6.2% 0.67 15.27 0.72 17.14 6 4 1SMD 274 4.9% 0.66 15.39 0.71 17.25 6 5 1LAM 289 4.4% 0.61 15.15 0.67 16.47 6 6 1QKSa 306 3.7% 0.60 15.22 0.66 16.66 7 1 1YGE 327 29.7% 0.80 16.18 0.86 17.76 7 2 1DMR 496 20.4% 0.79 16.09 0.84 17.26 7 3 1CZFa 577 17.9% 0.77 15.49 0.83 16.65 7 4 1SMD 685 15.0% 0.75 15.66 0.82 16.91 7 5 1VNS 756 13.3% 0.74 15.22 0.81 16.26 7 6 1QKSa 806 12.2% 0.73 15.17 0.80 16.30

To investigate it further, in Table 23 we report the results when we increased the number of proteins in the training set to 12 with fragment length 6. Table 23 depicts that the first six proteins generated 354 building blocks whereas another additional six proteins added only 68 building blocks. When the number of proteins is increased from 11 to 12 the number of building blocks is increased by just 1. This behavior is more clearly reflected in Figure 9 which displays the variation of number of building blocks and global-fit RMS errors on the test data as a function of number of proteins in the training set. Thus use of more proteins in the training data

computational cost will be increasing when more proteins are used for training and the marginal benefit is decreasing. As we can find in the Figure 9, the library size is going to stop increasing its number, and also the test GRMS stop decreasing. The details are listed in the Table 23.

Table 23. ISMCM results on Dataset B using 12 training proteins

Frag.

Figure 9. The variation of library size and that of reconstruction errors as functions of the number of training proteins.

5.2.3 Alternative Ways of Performance Evaluation

To evaluate quality of the building blocks for Dataset ANEW, we compare the histogram of local-fit RMS measuring the deviations of fragments from their corresponding building blocks (Figure 10(a)). It is found that the total count of lower RMS error (area under the curve) for ISMCM is larger than that for ITSCA and this indicates that more fragments are represented by good building blocks with lower errors for ISMCM. Finally, we sort the proteins according to their ISMCM local-fit RMS deviations and compare the results one by one with the ITSCA local-fit RMS error for the same protein. We get the two curves as shown in Figure 10(b), which reveals that ISMCM local-fit RMS errors are usually lower than ITSCA errors.

0 0.5 1 1.5 2 2.5

0 200 400 600 800 1000 1200 1400 1600

RMS

Count

(a)

ISMSM ITSCA

0 20 40 60 80 100

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Protein

Avg. RMS

(b)

ISMSM ITSCA

Figure 10. (a) Histograms of local-fit RMS errors for ISMCM and ITSCA (b) Protein by protein comparison of local-fit RMS error for ISMCM and ITSCA on Dataset ANEW

To compare the performance of both methods on the Dataset B, we proceed in the same way as we did for Dataset A. We remove the trailing clusters with smaller number of members to make both methods use the same number of building blocks. In Figure 11(a) and 11(b), we compare the histogram of local-fit RMS errors and average local-fit RMS error per protein, respectively. Like Dataset A, here we also find that ISMCM outperforms ITSCA with respect to these evaluation criteria.

0 1 2 3 4

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

RMS

Count

(a)

ISMSM ITSCA

0 50 100 150

0 0.2 0.4 0.6 0.8 1

Protein

Avg. RMS

(b)

ISMSM ITSCA

Fig.ure 11. (a) Histograms of local-fit RMS errors for ISMCM and ITSCA (b) Protein by protein comparison of local-fit RMS error for ISMCM and ITSCA on Dataset B

5.2.4 Visual Assessment of the Quality of the Building Blocks

In this section, we examine visually how well the building blocks can represent the target fragments. For this we consider two examples, one with a very good fit (Figure 12) and the other with a relatively poor fit (Figure 13), but still within 1A^o threshold. Figure 12 (a) shows a building block (AVGFMLA) whereas Figure 12(b) represents a target fragment (TLSELHC).

Apparently the two structures look quite different. But Figure 12(c), the rotated version of the building block AVGFMLA obtained after best molecular fit with the target look almost identical to Figure 12(b). The superimposition of AVGFMLA and TLSELHC shown in Figure 12(d) clearly demonstrates an excellent fit between the two. The four panels in Figure 13 show the representation of the target fragment EGVEIAC with the building block ENAIGGS.

Although, in terms of the local-fit RMS error it is a poorer fit, yet the building block matches very nicely to the target (Figure 13(d)).

Figure 12. Representation of a target fragment using a building block, a good fit case. (a) The original building block; (b) The target hexamer; (c) The rotated and shifted building block; (d) The building block and target hexamer superimposed after alignment.

Figure 13. Representation of a target fragment using a building block, a poor fit case. (a) The original building block; (b) The target hexamer; (c) The rotated and shifted building block; (d) The building block and target hexamer superimposed after alignment.

Next, we would like to show the biological structures of the building blocks of top two most populated clusters using ISMCM and compare them with the building blocks of top two clusters using ITSCA method. The most typical helical building block found by ISMCM is AVGFMLA and it is located at residue 324-330 of 1SMD; whereas the most populated building block found by ITSCA is GAAQVIM and it is located at residue 147-153 of 1DMR. The fact that GAAQVIM is also helical structure and is included in the cluster of AVGFMLA, appears that the ITSCA cluster associated with GAAQVIM and the ISMCM cluster associated with AVGFMLA represent the same biological structural motif. Figure 14(a) and Figure 14(b) show these two building blocks and it is clear that they represent the same structural unit. Similarly, we find that the most typical extended strand TKVIFEG found by ISMCM is located at residue 43 of 1CZFa whereas its counterpart GIKIYVS found by ITSCA is located at residue 464 of 1SMD. These two building blocks are depicted in Figure 15(a) and Figure 15(b). It can be seen that these building blocks represent similar structures of biological significance.

Figure 14. (a) ISMCM building block (AVGFMLA) at residue 324-330 of 1SMD (b) ITSCA building block (GAAQVIM) at residue 147-153 of 1DMR

Figure 15. (a) ISMCM building block (TKVIFEG) at residue 43-49 of 1CZFa (b) ITSCA building block (GIKIYVS) at residue 464-470 of 1SMD

在文檔中資料融合及山峰群聚法應用於改善蛋白質結構預測與分析 (頁 69-78)