Performance evaluation - 挖掘可語意解讀之知識並預測蛋白質之殘基與去氧核醣核酸之鍵結

4.1.1 Using original feature sets in PDNA-62

The cf values are considered combining each weight in early work. We want to know the influence of the cf value, and the portion about the research is given in Figure 4.1.

Although these lines had similar tendencies, they cause disparity in the results, inclusive of total accuracy, NP value and decision tree size. Generally, the smaller cf values establish the more brief decision trees with readable characteristic. But we should think about the balance of the performance and the tree size. In Figure 4.1, we discover the over-low cf value would make the harmful influence for classifications.

However, the influence would progressively decrease when the cf value reach certain rang. Therefore, we choose cf = 20 to do our later research in the thesis.

Decision trees are built by the same features with [4] research via C5.0 system. The gray lines in Figure 4.2 showing the performance in this process compare with the conclusion of the original paper, represented by the hollow triangle point. Its NP value is 61.1% and total accuracy is 79.1%. The two performance values of these parts of our results showed in Figure 4.2 are better than this referenced paper, if the gray points lain in the right-up of the hollow point. The left-up portion of the hollow point express that NP value is better than referenced paper, but total accuracy is worse than it. And the mean of the right-down points of the referenced results describe the

Figure 4.1. To compare NP value and total accuracy in different parameters: From right to left in each curve, the weights are gradually increasing.

opposite results to these left-up ones.

We could detect that the added weights would effect the distribution of the results - bigger weights, better NP values. The values near the points show its weight parameters in Figure 4.2. The trees are pruned too seriously when lower weight because the binding are regarded as noises. For these reasons, the bigger weights are considered making the binding data, the less part in the total, rather getting the opportunities of the reserve than eliminating. Nevertheless, the total accuracy value would be sacrifice, because one binding datum which gets the correct classification might cause more mistakes, i.e. more nonbinding data are regarded as noises. For the same thought, using the smaller weights, more nonbinding would be classified correctly; but the binding data would be displayed more wrong classification. Even better total accuracy performance is observed. In according to the NP function, raising the little judgment of the nonbinding data and losing the much one of the binding data in each relative ratio would make NP value decreasing. For the above-mention, we

Figure 4.2. To compare the results between Ahmad et al. (2004) and our proposed method: The gray line use original features, the same with the suggestion in Ahmad et al. (2004) and the weights are showed beside the points. And the hollow triangle point displays the result of Ahmad et al. (2004). The dotted lines purpose to explain conveniently and are unconcerned with the results of the experiment.

could get the appropriate the NP value and the total accuracy if we choose the middle weights. These results with middle weights are given in the right-up of in Figure 4.2.

4.1.2 Using the proposed feature sets in PDNA-62

The black lines in Figure 4.3 show the performance of the proposed 11 dimensions into C5.0. We could get more useful results by using the feature set than the original one. No matter what the NP value or the total accuracy could lie in the greater grade in the black curve in Figure 4.3. According Table 4.1, we are able to analyze the trend.

When the bigger weights added, the sensitivity, the ratio of the accuracy prediction of the binding data, would raise. But the process makes the specificity, the ratio of the accuracy prediction of the nonbinding data, diminish. The total accuracy is reducing and the NP value is increasing because the nonbinding data have more part in the whole.

Figure 4.3. To compare the results in different parameters: The black line expresses the result by using proposed feature set, 11 features, the weights are showed besides the point. Other presentation is the same with Figure 4.2.

Table 4.1. Performance of the different feature sets: When cf = 20, to use the different weights shows the difference between the original and the new feature performance.

the original features the new features

weight

According to above-mentioned performances, we suggest the classification problem should utilize the middle weights, from 2 to 4, and appropriate cf value, near 20. That could provide the suitable results with considered total accuracy, NP value and tree size for proposed 11 features by using DT method. In Figure 4.4, we show the comparison of NP value by the similar total accuracy in the different conditions. The viewpoint is fair and easily observable for realizing the performance. From Figure 4.4, we are aware that DT method can improve the classification results for the same features. And by C5.0 model, the new proposed features provide a serviceable way for predicting DNA-binding proteins.

Figure 4.4. To compare our result with previous researches: The white bar shows the result of Ahmad et al. (2004). These gray and black bar are our results by using the original and proposed features with cf = 20, weight = 4 in DT method, respectively.

在文檔中挖掘可語意解讀之知識並預測蛋白質之殘基與去氧核醣核酸之鍵結 (頁 31-35)