TFPSSM Vote. Based on the first experiment, we use the PCA parameter setting with the best performance on TFPSSM-1NN and evaluate our method performance on CAFA2-Swiss and CAFA3-Swiss.
Because Gene Ontology is a hierarchical structure, there are two types of annotation: leaf only annotation and propagated annotation. As the name implies, leaf only annotation is the leaf node of the Gene Ontology tree, and propagated annotation is all the ancestors of the leaf annotation. Voting with propagated label means that we also need to consider the path on the Gene Ontology. Therefore, we want to investigate the inference on our voting strategy from voting with leaf only label or propagated label.
4.5.3 Experiment 3: TFPSSM CATH
In this experiment we want to discuss the combination effect of TFPSSM and CATH Fun-Fams. At the first we will select K proteins with by dynamic threshold like Dynamic-KNN, then we calculate the FunFam intersection amount of those protein and query protein. At the predicting phase, we use the intersection amount as voting weight.
4.5.4 Experiment 4: Testing
Adopted from the previous experiments, we can obtain the settings with best performance in training data from CAFA2-Swiss. We will use these settings and training datasets to predict the CAFA2 benchmark dataset as the test data. Hence, we could evaluate our methods and compare other methods’ performance on CAFA2 on the same basis.
‧
In this chapter, we present the experimental results as well as discussions on comparative performance of our proposed methods for predicting protein function.
5.1 Experiment 1: PCA
The Fmax of TFPSSM 1NN and two baseline models on CAFA2-Swiss and CAFA3-Swiss with different PCA parameters are summarized in Figure 5.1. The range of explained variance ratio lies between 90% and 98.5%, with a step size of 0.5%. As each dataset has its own unique properties, we will not compare the Fmax over different datasets. However, the results still indicate that whitening improveFmax on TFPSSM 1NN significantly.
Table 5.1 shows the average protein amount of five folds in redundant dataset and non-redundant dataset. With clustering, we can reduce the amount of protein sequences over 30%, and the SVD for PCA carried out from non-redundant dataset still exhibit the same representa-tiveness.
According to the results of this experiment, TFPSSM features benefit from the whitening pre-processing followed by SVD from non-redundant dataset. Besides, TFPSSM 1NN demon-strated superior performance than the two baseline models on both CAFA2-Swiss and CAFA3-Swiss datasets. The best explained ratio varies in different training datasets and different on-tologies (Table 5.2) used in further experiment.
‧
Figure 5.1: Fmax of TFPSSM 1NN on CAFA2-Swiss and CAFA3-Swiss with different PCA parameters
Table 5.1: Average protein amount of five folds in redundant dataset and non-redundant dataset Type Dataset # of redundant dataset # of non-redundant dataset Reduced ratio by cluster
BPO CAFA2-Swiss 32,582 22,231 31.77%
CAFA3-Swiss 40,650 27,158 33.19%
CCO CAFA2-Swiss 32,457 22,521 30.61%
CAFA3-Swiss 39,462 26,631 32.51%
MFO CAFA2-Swiss 20,845 14,711 29.43%
CAFA3-Swiss 28,267 19,254 31.89%
Table 5.2: Best PCA explained ratio and average dimension
Reduced dimension is carried out under the PCA with whiten preprocess and SVD from non-redundant training dataset.
Type Dataset Explained Ratio Dimension Average Fmax
BPO CAFA2-Swiss 96.0% 107 0.4021
CAFA3-Swiss 96.0% 101 0.4167
CCO CAFA2-Swiss 95.0% 51 0.6610
CAFA3-Swiss 95.0% 46 0.6436
MFO CAFA2-Swiss 96.5% 121 0.5796
CAFA3-Swiss 96.5% 126 0.5701
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting
Because the interaction between K-nearest neighbor algorithm and weighted voting is com-plicated, we will consider multiple combinations of different KNN algorithms, voting weights and voting with leaf only annotation or propagated annotation in this experiment. In the follow-ing experiments, propagated annotation will be denoted as pro-* and leaf only annotation will be denoted as leaf-* in the figures.
5.2.1 Fixed-KNN
In Fixed-KNN experiment, we setK from 1 to 10 and vote with propagated annotation and leaf only annotation using three different weight assignment rules. Experiment results are de-picted in Figure 5.2. We observed that better performance can be obtained by setting K to be larger than 1 in BPO and CCO on CAFA2-Swiss and CAFA3-Swiss datasets. However, in MFO the benefit will decrease afterK is larger than 3. Among the three weight computation schemes,
’Inverse’ is more reliable than the other two methods. As a result, we will employ the ’Inverse’
approach in further experiments. Meanwhile, the results also reveal that voting with propagated annotation is better than voting with leaf only annotation, as was expected.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.2: Fmax of Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss with differentK, voting weights and voting annotation
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5.2.2 Dynamic-KNN
Figure 5.3 and Table 5.3 present Fmax and coverage of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss datasets with dynamic thresholds and voting annotation under partial evalu-ation mode. The threshold of dynamic-KNN is determined from the distance series of training proteins’nearest neighbor in the training dataset, and we consider the 1st quartile, 2nd quar-tile or 3rd quarquar-tile from this distance series. Not surprisingly, the coverage corresponds to the threshold in different quartiles. Using 2nd quartile as the threshold not only achieves the better performance on three ontologies, but also contains half of the test data. In this experiment, we can also find that voting with propagated annotation to be an effective strategy to address our problem.
We can conclude that voting with propagated annotation is better than voting with leaf only annotation from the previous two experiments, hence in the following Hybrid-KNN experiment, we will only consider voting with propagated annotation to reduce the complexity.
Figure 5.3: Fmax of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss with different dynamic threshold, voting weights and voting annotation under partial evaluation mode
‧
Table 5.3: Average coverage of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss with dif-ferent threshold
Type Dataset Quartile # of testing data # of predict coverage
BPO
Figure 5.4 shows the Fmax of Fixed-KNN voting with propagated annotation under full evaluation mode and average Fmax of the three weight assignment methods with propagated annotation on Dynamic-KNN under partial evaluation mode. Dynamic-KNN has better per-formance on Fmax than Fixed-KNN if we ignore the coverage of testing data. In other words, Dynamic-KNN splits the query proteins into‘easy to predict’part and‘hard to predict’part according to whether they have neighbor proteins under dynamic threshold or not, and only predict ‘the easy to predict’part.
Figure 5.4 also indicates that in some cases Dynamic-KNN with 1st quartile and 3rd quartile as threshold is even worse than Fixed-KNN, so we only inspect the combination of fixedK from 1 to 10 and 2nd quartile as dynamic threshold with three different voting weight assignment methods.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.4: Fmax of Fixed-KNN and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss
‧
In this experiment, we want to study the effects of combining Fixed-KNN and Dynamic-KNN. Based on the experimental results of Fixed-KNN and Dynamic-KNN, we examine the combination of fixedK from 1 to 10 and 2nd quartile as dynamic threshold with Inverse voting weight and voting with propagated annotation.
Fmaxof Hybrid-KNN and Fixed-KNN with Inverse voting weight is presented in Figure 5.5.
It turns out that the benefit ofFmax from combination of Fixed-KNN and Dynamic-KNN is not obvious. Besides, the optimalK for Hybrid-KNN does not correspond exactly to that in Fixed-KNN experiments. For example, the optimalK for Fixed-KNN is 7 on BPO in CAFA2-Swiss, yet the optimal K for Hybrid-KNN is 4 on BPO in CAFA2-Swiss. This phenomenon can be explained by having the easy part and hard part in the training data we discussed previously.
Figure 5.6 shows theFmax of Dynamic-KNN on the easy part and the Fmax of Fixed-KNN on both the easy part and the hard part separately. Figure 5.6 revealed that the optimalK for Fixed-KNN for the easy part and hard part is different. Fmax of Dynamic-KNN with Q2 threshold is almost equal to the optimal Fixed-KNN in BPO and MFO, and somewhat worse than the optimal Fixed-KNN in CCO. Accordingly, Hybrid-KNN is able to predict the easy part with Dynamic-KNN and the hard part with Fixed-KNN respectively.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.5: Fmax of Hybrid-KNN and Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.6: Fmax of Fixed-KNN and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss under partial evaluation mode
Easy to predict part is denoted as Easy-*, and hard to predict part is denoted as Hard-*.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5.3 Experiment 3: TFPSSM-CATH
Figure 5.7 and Table 5.4 show theFmax and coverage of TFPSSM CATH on CAFA2-Swiss and CAFA3-Swiss datasets under partial evaluation mode. In TFPSSM CATH we will skip the query protein if there is not enough protein family information. For example, there is no FunFam under E-value threshold for a query protein or FunFam intersection of query proteins and training proteins is zero. Because TFPSSM CATH uses the same threshold as Dynamic-KNN, we also compare theFmax of these two methods for those proteins that could predict with TFPSSM CATH in Figure 5.7. For those proteins in the easy part, TFPSSM CATH exhibits better performance than Dynamic-KNN. However, as observed in Figure 5.8, the combination of TFPSSM CATH and Fixed-KNN does not yield better performance than Hybrid-KNN. Because TFPSSM Vote is based on Euclidean distance and TFPSSM CATH is based on intersection amount, the voting weight design difference results in the worse combination.
Figure 5.7: Fmaxof TFPSSM CATH and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss under partial evaluation mode
‧
Table 5.4: Average coverage of TFPSSM CATH on CAFA2-Swiss and CAFA3-Swiss with different threshold
Type Dataset Quartile # of testing data # of predict coverage
BPO
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.8: Fmax of Hybrid-KNN and combination of TFPSSM CATH and Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss under full evaluation mode