Experimental results - 蛋白質細胞定位及核糖核酸結合點之預測

3.2 Results

3.2.1 Experimental results

Experiment 1: The benefit of using the TFPSSM weighting scheme

The overall accuracy of 1NN_TFIDF and 1NN_TFPSSM for each gapped distance are shown in Figure 3.3. The highest overall accuracy of 1NN_TFPSSM is 89.47%

when l equals 4, 5, and 13 and it is considerably higher than the best 1NN_TFIDF score 74.38% when l equals to 4. Therefore, adopting the TFPSSM weighting scheme significantly improves the performance of 1NN_TFIDF.

The performance of 1NN_TFIDF and 1NN_TFPSSM in the high- and low-homology data sets is shown in Table 3.2. 1NN_TFPSSM dramatically improves the performance of 1NN_TFIDF by about 26% in overall accuracy on PSLow661.

Hence, incorporation of PSSM in the weighting scheme is useful for improving per-formance due to insufficient sequence information in the low-homology data set.

Figure 3.3: Overall accuracy of 1NN_TFIDF and 1NN_TFPSSM with respect to gapped distances on the PS1444 data set.

Table 3.2: The comparison of 1NN_TFIDF and 1NN_TFPSSM on the PSHigh783 and PSLow661 data sets.

PSHigh783 PSLow661

1NN_TFPSSM 1NN_TFIDF 1NN_TFPSSM 1NN_TFIDF Loc. Sites Acc.(%) MCC Acc. (%) MCC Acc. (%) MCC Acc(%). MCC

CP 94.20 0.96 71.01 0.74 83.25 0.77 41.15 0.36 IM 99.31 0.99 98.62 0.89 82.93 0.82 84.15 0.48 PP 95.86 0.94 86.21 0.89 74.05 0.63 38.17 0.46 OM 99.66 0.99 95.88 0.95 85 0.82 66.00 0.48 EC 96.99 0.96 92.48 0.91 57.89 0.51 28.07 0.26

Overall 97.96 - 91.83 - 79.43 - 53.86 -

Experiment 2: The effect of incorporating PSSM information and gapped-dipeptide encoding scheme

Table 3.3 shows the performance of 1NN_TFPSSM, 1NN_ClustalW, 1NN_PSI-BLASTps, and 1NN_PSI-BLASTnr on the PSHigh783 and PSLow661 data sets. The overall accuracy on the PSHigh783 data set is very similar for all methods.

However, for the PSLow661 data set, 1NN_ClustalW, 1NN_PSI-BLASTps, and 1NN_PSI-BLASTnr attain 42.97%, 57.94% and 66.57%, respectively, in overall ac-curacy. This result reveals that better performance can be achieved when a larger da-tabase is used in constructing PSSM. This also lends support to our assumption that incorporating more information into PSSM is more effective for the prediction of proteins with low sequence identity to the training set. Most notably, 1NN_TFPSSM outperforms 1NN_PSI-BLASTnr by 12.86% in overall accuracy. This suggests that the incorporation of PSSM based on gapped-dipeptide encoding scheme significantly improves the predictive performance, especially for proteins of low sequence identity.

Table 3.3: Comparison of 1NN_TFPSSM, 1NN_ClustalW, 1NN_PSI-BLASTps and 1NN_PSI-BLASTnr for the PSHigh783 and PSLow661 data sets

PSHigh783 Loc.

Sites

1NN_TFPSSM 1NN_ClustalW 1NN_PSI-BLASTps 1NN_PSI-BLASTnr

Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC

1NN_TFPSSM 1NN_ClustalW 1NN_PSI-BLASTps 1NN_PSI-BLASTnr

Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC

Experiment 2: The benefit of PLSA feature reduction

Determine the reduced size of PLSA

The size of PLSA is determined by LSA singular values. Figure 3.4 show the singular values in decreasing order on different gapped distances upper bound data sets.

The 40-th largest singular value is close to zero in Figure 3.4, but in the inset the 160-th largest singular value is close to zero. Hence, the reduced feature size of PLSA is set to 40, 80 and 160. However, we do not test larger PLSA reduced size or one-by-one PLSA reduced size in consideration of the training efficiency and avoid-ance of data overfitting.

Figure 3.4: Singular values in decreasing order of each gapped distance. The inset shows singular values without 1-th largest one for detailed representation.

For one PLSA reduced size, the training and testing procedures of PSLDoc take 1.5 hours and about 2~3 minutes for all gapped distances, respectively. However, PSLDoc-PLSA takes about 180 and 1.4 hours in training and testing, respectively. Fig-ure 3.5 shows the performance of PSLDoc_-PLSA and PSLDoc, where PSLDoc_Fx de-notes PSLDoc with PLSA reduced size x.

The highest overall accuracy among all gapped distances of PSLDoc_F40, PSLDoc_F80, and PSLDoc_F160 is 92.31%, 93.01%, and 92.52%, respectively, which is 0.83%, 1.52%, and 1.04% better than that of PSLDoc-PLSA. Using PLSA not only improves learning efficiency but also performance. In the following experiments, PSLDoc takes the gapped distance 13 and PLSA at reduced size 80.

90.5

Figure 3.5: Overall accuracy of PSLDoc_F40, PSLDoc_F80, PSLDoc_F160 and PSLDoc-PLSA with respect to gapped distance on the PS1444 dataset.

Experiment 4: The benefit of SVM and PLSA feature reduction

Table 3.4 shows the performance of PSLDoc, 1NN_TFPSSM and 1NN_ClustalW on PSHigh783 and PSLow661. The overall accuracy of 1NN_ClustalW on PSHigh783 (97.32%) is very similar to that of Yu et. al.’s (97.7%). 1NN_TFPSSM and PSLDoc perform better than 1NN_ClustalW on PSHigh783. On the other hand, PSLDoc im-proves 1NN_TFPSSM on PSLow661 by 7.41% due to the non-linear SVM classifica-tion and PLSA feature reducclassifica-tion and extracclassifica-tion. This shows that PSLDoc is suitable for both the high- and low-homology data sets.

Table 3.4: Comparison of PSLDoc, 1NN_TFPSSM, and 1NN_ClustalW for the PSHigh783 and PSLow661 data sets.

PSHigh783 PSLow661

PSLDoc 1NN_TFPSSM 1NN_ClustalW PSLDoc 1NN_TFPSSM 1NN_ClustalW Loc. Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC Acc.(%) MCC

Table 3.5: Comparison of PSLDoc, HYBRID and PSORTb v.2.0 on the PS1444 data sets. The PSLDoc performance of incorporating a three-way data split procedure is indicated in the parentheses.

PSLDoc HYBRID PSORTb v.2.0

Loc. Sites Acc. MCC Acc. MCC Acc. MCC

CP 94.96(94.24) 0.91(0.91) 95.00 0.89 70.10 0.77 IM 93.20(93.53) 0.94(0.94) 90.60 0.92 92.60 0.92 PP 89.13(89.13) 0.87(0.85) 88.80 0.84 69.20 0.78 OM 95.65(95.14) 0.95(0.94) 95.10 0.93 94.90 0.95 EC 90.00(87.37) 0.87(0.86) 85.30 0.87 78.90 0.86 Overall 93.01(92.45) - 91.60 - 82.60 -

Experiment 5: Comparison of PSLDoc, HYBRID and PSORTb v.2.0

Table 3.5 shows the performance of PSLDoc, HYBRID, and PSORTb v2.0 on PS1444. PSLDoc achieves the best performance 93.01%, better than HYBIRD 91.6%

and PSORTb 82.6%.

Experiment 6: PSLDoc under different prediction thresholds versus PSORTb v.2.0 on the PS1444 data set

Prediction confidence

The probability estimated by LIBSVM is used for determining the confidence levels of classifications. The class with the largest probability is chosen as the final predicted class. The confidence of the final predicted class, prediction confidence (Jones, 1999), could be regarded as the value of the largest probability minus the second largest probability. Figure 3.6 shows the relationship between accuracy and prediction confi-dence. For proteins with prediction confidence in the range [0.9-1], the prediction ac-curacy is near 100% (99.12%).

0 10 20 30 40 50 60 70 80 90 100

[0-0.1) [0.1-0.2) [0.2-0.3) [0.3-0.4) [0.4-0.5) [0.5-0.6) [0.6-0.7) [0.7-0.8) [0.8-0.9) [0.9-1]

Overall Accuracy (%)

Prediction Confidence

Figure 3.6: Overall accuracy of PSLDoc with respect to prediction confidence. [x,y) represents the prediction confidence is more than x but under y.

Prediction threshold

Gardy et al. suggested that when a prediction system is unable to generate a confident prediction, the program had better report a result of “Unknown” because biologists usually prefer correct prediction (high precision) to prediction coverage (recall) (Gardy, et al., 2005). To provide more precise prediction results, we determine a pre-diction threshold to filter out prepre-diction results with low confidence. That is, the SVM classifier predicts results only when the prediction confidence is above the threshold, otherwise the SVM classifier will output “Unknown” (Gardy, et al., 2005; Gardy, et al., 2003). Recall and precision for each prediction threshold are shown in Figure 3.7.

0 0.1

Figure 3.7: Overall accuracy of PSLDoc with respect to prediction confidence. The value above the point denotes the corresponding prediction threshold.

Table 3.6 shows the performance of PSLDoc under different prediction thresh-olds. Setting the prediction threshold to 0.7, PSLDoc achieves slightly better recall than PSORTb v.2.0 (83.66% versus 82.6%), whereas the precision of PSLDoc is bet-ter than PSORTb v.2.0 (97.89% versus 95.8%). Besides, when the prediction thresh-old is set to 0.3, PSLDoc achieves comparable precision to PSORTb v.2.0 (95.77% vs.

95.8%), and PSLDoc’s recall is much better than that of PSORTb v.2.0 (89.27% vs.

82.6%).

Table 3.6. Comparison of PSLDoc under the prediction threshold 0.7, PSLDoc under the prediction threshold 0.3 and PSORTbv.2.0

Loc. PSLDoc_PreThr=0.7 PSLDoc_PreThr=0.3 PSORTb v.2.0

TP FP FN Pre. Rec. TP FP FN Pre. Rec. TP FP FN Pre. Rec.

在文檔中蛋白質細胞定位及核糖核酸結合點之預測 (頁 71-79)