Experiment 3: TFPSSM-CATH - 基於資料科學方法之巨量蛋白質功能預測

TFPSSM Vote. Based on the first experiment, we use the PCA parameter setting with the best performance on TFPSSM-1NN and evaluate our method performance on CAFA2-Swiss and CAFA3-Swiss.

Because Gene Ontology is a hierarchical structure, there are two types of annotation: leaf only annotation and propagated annotation. As the name implies, leaf only annotation is the leaf node of the Gene Ontology tree, and propagated annotation is all the ancestors of the leaf annotation. Voting with propagated label means that we also need to consider the path on the Gene Ontology. Therefore, we want to investigate the inference on our voting strategy from voting with leaf only label or propagated label.

4.5.3 Experiment 3: TFPSSM CATH

In this experiment we want to discuss the combination effect of TFPSSM and CATH Fun-Fams. At the first we will select K proteins with by dynamic threshold like Dynamic-KNN, then we calculate the FunFam intersection amount of those protein and query protein. At the predicting phase, we use the intersection amount as voting weight.

4.5.4 Experiment 4: Testing

Adopted from the previous experiments, we can obtain the settings with best performance in training data from CAFA2-Swiss. We will use these settings and training datasets to predict the CAFA2 benchmark dataset as the test data. Hence, we could evaluate our methods and compare other methods’ performance on CAFA2 on the same basis.

‧

In this chapter, we present the experimental results as well as discussions on comparative performance of our proposed methods for predicting protein function.

5.1 Experiment 1: PCA

The F_max of TFPSSM 1NN and two baseline models on CAFA2-Swiss and CAFA3-Swiss with different PCA parameters are summarized in Figure 5.1. The range of explained variance ratio lies between 90% and 98.5%, with a step size of 0.5%. As each dataset has its own unique properties, we will not compare the F_max over different datasets. However, the results still indicate that whitening improveF_max on TFPSSM 1NN significantly.

Table 5.1 shows the average protein amount of five folds in redundant dataset and non-redundant dataset. With clustering, we can reduce the amount of protein sequences over 30%, and the SVD for PCA carried out from non-redundant dataset still exhibit the same representa-tiveness.

According to the results of this experiment, TFPSSM features benefit from the whitening pre-processing followed by SVD from non-redundant dataset. Besides, TFPSSM 1NN demon-strated superior performance than the two baseline models on both CAFA2-Swiss and CAFA3-Swiss datasets. The best explained ratio varies in different training datasets and different on-tologies (Table 5.2) used in further experiment.

‧

Figure 5.1: F_max of TFPSSM 1NN on CAFA2-Swiss and CAFA3-Swiss with different PCA parameters

Table 5.1: Average protein amount of five folds in redundant dataset and non-redundant dataset Type Dataset # of redundant dataset # of non-redundant dataset Reduced ratio by cluster

BPO CAFA2-Swiss 32,582 22,231 31.77%

CAFA3-Swiss 40,650 27,158 33.19%

CCO CAFA2-Swiss 32,457 22,521 30.61%

CAFA3-Swiss 39,462 26,631 32.51%

MFO CAFA2-Swiss 20,845 14,711 29.43%

CAFA3-Swiss 28,267 19,254 31.89%

Table 5.2: Best PCA explained ratio and average dimension

Reduced dimension is carried out under the PCA with whiten preprocess and SVD from non-redundant training dataset.

Type Dataset Explained Ratio Dimension Average Fmax

BPO CAFA2-Swiss 96.0% 107 0.4021

CAFA3-Swiss 96.0% 101 0.4167

CCO CAFA2-Swiss 95.0% 51 0.6610

CAFA3-Swiss 95.0% 46 0.6436

MFO CAFA2-Swiss 96.5% 121 0.5796

CAFA3-Swiss 96.5% 126 0.5701

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting

Because the interaction between K-nearest neighbor algorithm and weighted voting is com-plicated, we will consider multiple combinations of different KNN algorithms, voting weights and voting with leaf only annotation or propagated annotation in this experiment. In the follow-ing experiments, propagated annotation will be denoted as pro-* and leaf only annotation will be denoted as leaf-* in the figures.

5.2.1 Fixed-KNN

In Fixed-KNN experiment, we setK from 1 to 10 and vote with propagated annotation and leaf only annotation using three different weight assignment rules. Experiment results are de-picted in Figure 5.2. We observed that better performance can be obtained by setting K to be larger than 1 in BPO and CCO on CAFA2-Swiss and CAFA3-Swiss datasets. However, in MFO the benefit will decrease afterK is larger than 3. Among the three weight computation schemes,

’Inverse’ is more reliable than the other two methods. As a result, we will employ the ’Inverse’

approach in further experiments. Meanwhile, the results also reveal that voting with propagated annotation is better than voting with leaf only annotation, as was expected.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 5.2: F_max of Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss with differentK, voting weights and voting annotation

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2.2 Dynamic-KNN

Figure 5.3 and Table 5.3 present F_max and coverage of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss datasets with dynamic thresholds and voting annotation under partial evalu-ation mode. The threshold of dynamic-KNN is determined from the distance series of training proteins’nearest neighbor in the training dataset, and we consider the 1st quartile, 2nd quar-tile or 3rd quarquar-tile from this distance series. Not surprisingly, the coverage corresponds to the threshold in different quartiles. Using 2nd quartile as the threshold not only achieves the better performance on three ontologies, but also contains half of the test data. In this experiment, we can also find that voting with propagated annotation to be an effective strategy to address our problem.

We can conclude that voting with propagated annotation is better than voting with leaf only annotation from the previous two experiments, hence in the following Hybrid-KNN experiment, we will only consider voting with propagated annotation to reduce the complexity.

Figure 5.3: F_max of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss with different dynamic threshold, voting weights and voting annotation under partial evaluation mode

‧

Table 5.3: Average coverage of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss with dif-ferent threshold

Type Dataset Quartile # of testing data # of predict coverage

BPO

Figure 5.4 shows the F_max of Fixed-KNN voting with propagated annotation under full evaluation mode and average Fmax of the three weight assignment methods with propagated annotation on Dynamic-KNN under partial evaluation mode. Dynamic-KNN has better per-formance on F_max than Fixed-KNN if we ignore the coverage of testing data. In other words, Dynamic-KNN splits the query proteins into‘easy to predict’part and‘hard to predict’part according to whether they have neighbor proteins under dynamic threshold or not, and only predict ‘the easy to predict’part.

Figure 5.4 also indicates that in some cases Dynamic-KNN with 1st quartile and 3rd quartile as threshold is even worse than Fixed-KNN, so we only inspect the combination of fixedK from 1 to 10 and 2nd quartile as dynamic threshold with three different voting weight assignment methods.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 5.4: F_max of Fixed-KNN and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss

‧

In this experiment, we want to study the effects of combining Fixed-KNN and Dynamic-KNN. Based on the experimental results of Fixed-KNN and Dynamic-KNN, we examine the combination of fixedK from 1 to 10 and 2nd quartile as dynamic threshold with Inverse voting weight and voting with propagated annotation.

F_maxof Hybrid-KNN and Fixed-KNN with Inverse voting weight is presented in Figure 5.5.

It turns out that the benefit ofF_max from combination of Fixed-KNN and Dynamic-KNN is not obvious. Besides, the optimalK for Hybrid-KNN does not correspond exactly to that in Fixed-KNN experiments. For example, the optimalK for Fixed-KNN is 7 on BPO in CAFA2-Swiss, yet the optimal K for Hybrid-KNN is 4 on BPO in CAFA2-Swiss. This phenomenon can be explained by having the easy part and hard part in the training data we discussed previously.

Figure 5.6 shows theF_max of Dynamic-KNN on the easy part and the F_max of Fixed-KNN on both the easy part and the hard part separately. Figure 5.6 revealed that the optimalK for Fixed-KNN for the easy part and hard part is different. F_max of Dynamic-KNN with Q2 threshold is almost equal to the optimal Fixed-KNN in BPO and MFO, and somewhat worse than the optimal Fixed-KNN in CCO. Accordingly, Hybrid-KNN is able to predict the easy part with Dynamic-KNN and the hard part with Fixed-KNN respectively.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 5.5: F_max of Hybrid-KNN and Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 5.6: F_max of Fixed-KNN and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss under partial evaluation mode

Easy to predict part is denoted as Easy-*, and hard to predict part is denoted as Hard-*.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

5.3 Experiment 3: TFPSSM-CATH

Figure 5.7 and Table 5.4 show theF_max and coverage of TFPSSM CATH on CAFA2-Swiss and CAFA3-Swiss datasets under partial evaluation mode. In TFPSSM CATH we will skip the query protein if there is not enough protein family information. For example, there is no FunFam under E-value threshold for a query protein or FunFam intersection of query proteins and training proteins is zero. Because TFPSSM CATH uses the same threshold as Dynamic-KNN, we also compare theF_max of these two methods for those proteins that could predict with TFPSSM CATH in Figure 5.7. For those proteins in the easy part, TFPSSM CATH exhibits better performance than Dynamic-KNN. However, as observed in Figure 5.8, the combination of TFPSSM CATH and Fixed-KNN does not yield better performance than Hybrid-KNN. Because TFPSSM Vote is based on Euclidean distance and TFPSSM CATH is based on intersection amount, the voting weight design difference results in the worse combination.

Figure 5.7: F_maxof TFPSSM CATH and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss under partial evaluation mode

‧

Table 5.4: Average coverage of TFPSSM CATH on CAFA2-Swiss and CAFA3-Swiss with different threshold

Type Dataset Quartile # of testing data # of predict coverage

BPO

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 5.8: F_max of Hybrid-KNN and combination of TFPSSM CATH and Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss under full evaluation mode

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

在文檔中基於資料科學方法之巨量蛋白質功能預測 - 政大學術集成 (頁 35-49)

Experiment 3: TFPSSM-CATH

4.5.3 Experiment 3: TFPSSM CATH

4.5.4 Experiment 4: Testing

‧

5.1 Experiment 1: PCA

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting

5.2.1 Fixed-KNN

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2.2 Dynamic-KNN

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.3 Experiment 3: TFPSSM-CATH

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學