國
立 政 治 大 學
‧
N a tio na
l C h engchi U
Figure 3.2: TFPSSMni ve rs it y
pose higher dimension datasets into lower dimension ones with a set of successive orthogonal components that explain a maximum amount of the variance. In our study, the orthogonal com-ponents are computed from the training data and applied on both training data and testing data with scikit-learn v0.19.0 decomposition module. The size of reduced dimensions can be chosen in different explained variance ratios. Accordingly, we have designed a series of experiments to use different explained variance ratios in different training datasets. The details will be provided in Chapter 4.
3.3 CATH information
We adopt the HMMer model of FunFams released on CATH Gene3D web server to predict the FunFam of the query protein and proteins in the training data. In our experiment, we set best domainE-value threshold at 10−5. That is when HMM scan report a FunFam best domain
‧
E-value lower than 10−5, we will consider this protein as one of this FunFam.
3.4 Gene Ontology prediction by K-nearest neighbor algo-rithm and weighted voting
We designed three methods to use TFPSSM vector andK-nearest neighbor (KNN) algorithm to predict protein function.
3.4.1 TFPSSM 1NN
The first method is TFPSSM with one nearest neighbor. We will find query protein’s nearest neighbor in the training data in the Euclidean space, and we predict that this query protein will have the same GO terms of its nearest neighbor. Because we want all proteins to be predicted and simplify the method, the confidence score of the prediction is set to 1.00.
3.4.2 TFPSSM Vote
The second method is TFPSSM with K nearest neighbors and weighted vote based on Eu-clidean distance. We designed three branches of TFPSSM Vote to determine the K and three different voting weights to predict GO terms.
3.4.2.1 Three branches of TFPSSM to determineK
We consider three K-nearest neighbors algorithms:
1. FixedK nearest neighbors for KNN, referred to as the fixed-KNN.
2. Dynamic threshold to selectK nearest neighbors, in which the threshold is set to 1st quar-tile, 2nd quartile or 3rd quartile carried out from the distance series of each protein’s first nearest neighbor in training data, referred to as the dynamic-KNN.
3. Combination of fixed-KNN and dynamic-KNN, that is if a protein could not predict by Dyanmic-KNN we will apply fixed-KNN instead, referred to as the hybrid-KNN.
‧
3.4.2.2 Three voting weights to predict GO terms
In the part concerning weighted voting, we design three weights for TFPSSM Vote:
1. Inverse of the Euclidean distance between the query protein and the K-nearest protein, referred to as Inverse.
2. Square root of inverse of the Euclidean distance between the query protein and the K-nearest protein, referred to as Sqrt.
3. Equal voting weights (all weights set to 1), referred to as Equal.
3.4.3 TFPSSM CATH
The third method is TFPSSM withK nearest neighbors and weighted vote based on CATH FunFams intersection amount. K is followed with TFPSSM Vote, but a different voting weight scheme is employed. With CATH FunFam HMMer model, we can obtain information regarding each proteins’ FunFams, so we use the intersection GO amount between the query protein and K nearest neighbors proteins as the voting weight instead of Euclidean distance.
3.4.4 Normalization of weighted voting
To solve the multiple label prediction problem, we use weighted voting strategy, that is, each protein similar to the query protein will have a score to vote for those annotated GO on query protein. After voting is finished, the total score of each GO annotation will be normalized be-tween 0 and 1 by dividing the maximum total score, and the normalized score will be considered as the confidence score of this GO annotation prediction. Figure 3.3 gives an illustration of our normalized weighted voting strategy.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 3.3: Weighted Voting
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
3.5 System architecture
We proposed a protein function prediction method based on homology extension and pro-tein family. Figure 3.4 depicts our current system architecture. The three methods mentioned previously only differ in the final step, the post-processing stage.
Figure 3.4: System architecture of proposed framework for prediction of protein functions
‧
To compare with other methods and validate the reproducibility of our experiments, we fol-low the same evaluation measures and the dataset used in CAFA. We will discuss the dataset, cross-validation procedure for training model, evaluation measures, baseline models and exper-iment design in this chapter.
4.1 Data sets
In our experiments, we use data from both CAFA2 and CAFA3. At the end of CAFA2, Function-SIG releases the training data, the test data, and the evaluation metrics of each method that have been evaluated. Training data of CAFA2 includes three different databases, GO Con-sortium, UniProt-GOA, and Swiss-Prot, but we only use the data from Swiss-Prot as training datasets, since the annotation evidence codes from Swiss-Prot are more reliable. Testing data from CAFA2 is the benchmark dataset they used to evaluate each submitted method. We use the CAFA2 training data to train our model and predict the CAFA2 benchmark dataset in order to compare with other methods. For convenience, we will refer to the training dataset Swiss-Prot in CAFA2 as CAFA2-Swiss and test dataset in CAFA2 as CAFA2-benchmark.
In addition, we also use the training data from CAFA3. (At CAFA3 they only provided data from Swiss-Prot, which is referred to as CAFA3-Swiss.) Because CAFA3 is still in the evaluation phase, there are no ground truth labels of the test data. Still, we can use CAFA3
‧
training data to verify our method’s stability using cross-validation. Table 4.1 gives a short summary of each dataset, including the number of protein sequence, the number of GO , and the median GO number of each protein in BPO, CCO, and MFO.
Because our biological knowledge is extremely incomplete, the gene function will be added, removed or updated in Gene Ontology database over time. We use the same Gene Ontology database employed in CAFA2, which is released on July 15, 2013 on the CAFA2-Swiss and CAFA2-Benchmark, and the Gene Ontology database released on June 1, 2016 on CAFA3-Swiss.
Table 4.1: Statistic of Dataset
Dataset Type # of seq. # of GO # of median GO
We use five-fold cross-validation to examine the stability of our proposed method on training dataset. Cross-validation is a model validation technique, which is widely used on prediction problems. In a real prediction problem, a prediction model will get some data with answer as training data to train itself. After training phase, the model will get some data without answers as test data to predict. Cross-validation is used to split the training dataset into training data and testing data (validation data) in the training phase. With this procedure, the prediction model can prevent pitfalls such as overfitting. In five-fold cross-validation, we split training dataset into five partitions, and take one fold as the validation data, the other four folds as the training data. Repeating this step five rounds, each round will use different training data and validation data as shown in Figure 4.1.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 4.1: Five-fold cross-validation
4.3 Evaluation measures
There are two major evaluation types of protein function prediction: protein-centric and term-centric. In this research, we focus on protein-centric evaluation. The prediction result for each term will have a score between 0 and 1, which is considered as a confidence score. Thus, a decision thresholdτ must be applied to determine the set of predicted GO terms P(τ). Similarly, a set of experimentally determined GO terms will be denoted asT. To determine the quality of prediction, a similarity function must be calculated between P(τ)andT for each protein in an evaluation set. For each proteini and thresholdτ, we define its precision as
pri(τ) = ∑υ∈OI(υ ∈ Pi(τ)∧υ ∈ Ti)
∑υ∈OI(υ ∈ Pi(τ)) (4.3.1) and recall as
rci(τ) = ∑υ∈OI(υ ∈ Pi(τ)∧υ ∈ Ti)
∑υ∈O I(υ ∈ Ti) (4.3.2)
‧
whereI(·)is an indicator function.
Using a database ofN proteins, methods could be evaluated under full mode or partial mode.
In some methods if there is not enough information to predict some proteins, the method will skip those proteins and be evaluated under partial mode with lower coverage. We can determine the average precision from individual scores over a set ofm(τ)proteins in which at least one prediction was made above thresholdτ as
pr(τ) = 1 m(τ) ·m
(τ) i
∑
=1pri(τ) (4.3.3)
The average recall, on the other hand, is calculated as
rc(τ) = 1 ne ·
∑
nei=1
rci(τ) (4.3.4)
on the entire set ofnetest proteins. In partial evaluation modene =m(0)and in full evaluation mode ne = N. To provide a single-score evaluation of computational models, we used the maximum F-measure over all threshold defined as:
Fmax =max
τ {2·pr(τ)·rc(τ)
pr(τ) +rc(τ) } (4.3.5)
4.4 Baseline models
In our experiments, we will use two baseline models: Naïve and BLAST, to compare with our method. These two baseline models are adopted from Matlab evaluation codes for CAFA2 experiment [15].
4.4.1 Naïve method
The Naïve method predicts terms based on the frequency in the training data, and the nor-malized frequency will be the score of the predicted term. As a result, in Naïve method, each query protein will be predicted to the constant result.
‧
The BLAST method predicts terms based on BLAST searching result against the training data. BLAST will first return the high local alignment identity proteins of the query protein, then the BLAST method will predict term based on these hits proteins, and convert E-value to score.
4.5 Experiment design
We design the following experiments to evaluate the performance of each step in the frame-work we proposed. Experiments about training dataset are carried out from five-fold cross-validation. All experiments are evaluated with the same evaluation measure in the previous section. In these experiments, we will use non-redundant training data, which is obtained from the redundant training data after clustering with Ultra-fast sequence analysis (USERACH) [16]
at 50% identities the author recommended. We have conducted the three experiments in the following subsections.
4.5.1 Experiment 1: PCA
This experiment is designed to evaluate the benefit of different parameters on the step of feature reduction with PCA. In this experiment, we only consider TFPSSM-1NN on CAFA2-Swiss and CAFA3-CAFA2-Swiss to simplify the process. There are three factors we wish to discuss:
1) the size of reduced dimensions with different explained variance ratios, 2) the influence by Singular Value Decomposition (SVD), which used to project vector to lower dimensional space, carried out from redundant or non-redundant training data and 3) the benefit from whitening, which is a preprocessing step by scaling each component to unit variance.
4.5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting
The purpose of this experiment is to compare the interaction between K-nearest neighbors algorithm, voting weights and voting with leaf only annotations or propagated annotations for
‧
TFPSSM Vote. Based on the first experiment, we use the PCA parameter setting with the best performance on TFPSSM-1NN and evaluate our method performance on CAFA2-Swiss and CAFA3-Swiss.
Because Gene Ontology is a hierarchical structure, there are two types of annotation: leaf only annotation and propagated annotation. As the name implies, leaf only annotation is the leaf node of the Gene Ontology tree, and propagated annotation is all the ancestors of the leaf annotation. Voting with propagated label means that we also need to consider the path on the Gene Ontology. Therefore, we want to investigate the inference on our voting strategy from voting with leaf only label or propagated label.
4.5.3 Experiment 3: TFPSSM CATH
In this experiment we want to discuss the combination effect of TFPSSM and CATH Fun-Fams. At the first we will select K proteins with by dynamic threshold like Dynamic-KNN, then we calculate the FunFam intersection amount of those protein and query protein. At the predicting phase, we use the intersection amount as voting weight.
4.5.4 Experiment 4: Testing
Adopted from the previous experiments, we can obtain the settings with best performance in training data from CAFA2-Swiss. We will use these settings and training datasets to predict the CAFA2 benchmark dataset as the test data. Hence, we could evaluate our methods and compare other methods’ performance on CAFA2 on the same basis.
‧
In this chapter, we present the experimental results as well as discussions on comparative performance of our proposed methods for predicting protein function.
5.1 Experiment 1: PCA
The Fmax of TFPSSM 1NN and two baseline models on CAFA2-Swiss and CAFA3-Swiss with different PCA parameters are summarized in Figure 5.1. The range of explained variance ratio lies between 90% and 98.5%, with a step size of 0.5%. As each dataset has its own unique properties, we will not compare the Fmax over different datasets. However, the results still indicate that whitening improveFmax on TFPSSM 1NN significantly.
Table 5.1 shows the average protein amount of five folds in redundant dataset and non-redundant dataset. With clustering, we can reduce the amount of protein sequences over 30%, and the SVD for PCA carried out from non-redundant dataset still exhibit the same representa-tiveness.
According to the results of this experiment, TFPSSM features benefit from the whitening pre-processing followed by SVD from non-redundant dataset. Besides, TFPSSM 1NN demon-strated superior performance than the two baseline models on both CAFA2-Swiss and CAFA3-Swiss datasets. The best explained ratio varies in different training datasets and different on-tologies (Table 5.2) used in further experiment.
‧
Figure 5.1: Fmax of TFPSSM 1NN on CAFA2-Swiss and CAFA3-Swiss with different PCA parameters
Table 5.1: Average protein amount of five folds in redundant dataset and non-redundant dataset Type Dataset # of redundant dataset # of non-redundant dataset Reduced ratio by cluster
BPO CAFA2-Swiss 32,582 22,231 31.77%
CAFA3-Swiss 40,650 27,158 33.19%
CCO CAFA2-Swiss 32,457 22,521 30.61%
CAFA3-Swiss 39,462 26,631 32.51%
MFO CAFA2-Swiss 20,845 14,711 29.43%
CAFA3-Swiss 28,267 19,254 31.89%
Table 5.2: Best PCA explained ratio and average dimension
Reduced dimension is carried out under the PCA with whiten preprocess and SVD from non-redundant training dataset.
Type Dataset Explained Ratio Dimension Average Fmax
BPO CAFA2-Swiss 96.0% 107 0.4021
CAFA3-Swiss 96.0% 101 0.4167
CCO CAFA2-Swiss 95.0% 51 0.6610
CAFA3-Swiss 95.0% 46 0.6436
MFO CAFA2-Swiss 96.5% 121 0.5796
CAFA3-Swiss 96.5% 126 0.5701
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5.2 Experiment 2: K-nearest neighbors algorithm and weighted voting
Because the interaction between K-nearest neighbor algorithm and weighted voting is com-plicated, we will consider multiple combinations of different KNN algorithms, voting weights and voting with leaf only annotation or propagated annotation in this experiment. In the follow-ing experiments, propagated annotation will be denoted as pro-* and leaf only annotation will be denoted as leaf-* in the figures.
5.2.1 Fixed-KNN
In Fixed-KNN experiment, we setK from 1 to 10 and vote with propagated annotation and leaf only annotation using three different weight assignment rules. Experiment results are de-picted in Figure 5.2. We observed that better performance can be obtained by setting K to be larger than 1 in BPO and CCO on CAFA2-Swiss and CAFA3-Swiss datasets. However, in MFO the benefit will decrease afterK is larger than 3. Among the three weight computation schemes,
’Inverse’ is more reliable than the other two methods. As a result, we will employ the ’Inverse’
approach in further experiments. Meanwhile, the results also reveal that voting with propagated annotation is better than voting with leaf only annotation, as was expected.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.2: Fmax of Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss with differentK, voting weights and voting annotation
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5.2.2 Dynamic-KNN
Figure 5.3 and Table 5.3 present Fmax and coverage of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss datasets with dynamic thresholds and voting annotation under partial evalu-ation mode. The threshold of dynamic-KNN is determined from the distance series of training proteins’nearest neighbor in the training dataset, and we consider the 1st quartile, 2nd quar-tile or 3rd quarquar-tile from this distance series. Not surprisingly, the coverage corresponds to the threshold in different quartiles. Using 2nd quartile as the threshold not only achieves the better performance on three ontologies, but also contains half of the test data. In this experiment, we can also find that voting with propagated annotation to be an effective strategy to address our problem.
We can conclude that voting with propagated annotation is better than voting with leaf only annotation from the previous two experiments, hence in the following Hybrid-KNN experiment, we will only consider voting with propagated annotation to reduce the complexity.
Figure 5.3: Fmax of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss with different dynamic threshold, voting weights and voting annotation under partial evaluation mode
‧
Table 5.3: Average coverage of Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss with dif-ferent threshold
Type Dataset Quartile # of testing data # of predict coverage
BPO
Figure 5.4 shows the Fmax of Fixed-KNN voting with propagated annotation under full evaluation mode and average Fmax of the three weight assignment methods with propagated annotation on Dynamic-KNN under partial evaluation mode. Dynamic-KNN has better per-formance on Fmax than Fixed-KNN if we ignore the coverage of testing data. In other words, Dynamic-KNN splits the query proteins into‘easy to predict’part and‘hard to predict’part according to whether they have neighbor proteins under dynamic threshold or not, and only predict ‘the easy to predict’part.
Figure 5.4 also indicates that in some cases Dynamic-KNN with 1st quartile and 3rd quartile as threshold is even worse than Fixed-KNN, so we only inspect the combination of fixedK from 1 to 10 and 2nd quartile as dynamic threshold with three different voting weight assignment methods.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.4: Fmax of Fixed-KNN and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss
‧
In this experiment, we want to study the effects of combining Fixed-KNN and Dynamic-KNN. Based on the experimental results of Fixed-KNN and Dynamic-KNN, we examine the combination of fixedK from 1 to 10 and 2nd quartile as dynamic threshold with Inverse voting weight and voting with propagated annotation.
Fmaxof Hybrid-KNN and Fixed-KNN with Inverse voting weight is presented in Figure 5.5.
It turns out that the benefit ofFmax from combination of Fixed-KNN and Dynamic-KNN is not obvious. Besides, the optimalK for Hybrid-KNN does not correspond exactly to that in Fixed-KNN experiments. For example, the optimalK for Fixed-KNN is 7 on BPO in CAFA2-Swiss, yet the optimal K for Hybrid-KNN is 4 on BPO in CAFA2-Swiss. This phenomenon can be explained by having the easy part and hard part in the training data we discussed previously.
Figure 5.6 shows theFmax of Dynamic-KNN on the easy part and the Fmax of Fixed-KNN on both the easy part and the hard part separately. Figure 5.6 revealed that the optimalK for Fixed-KNN for the easy part and hard part is different. Fmax of Dynamic-KNN with Q2 threshold is almost equal to the optimal Fixed-KNN in BPO and MFO, and somewhat worse than the optimal Fixed-KNN in CCO. Accordingly, Hybrid-KNN is able to predict the easy part with Dynamic-KNN and the hard part with Fixed-KNN respectively.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.5: Fmax of Hybrid-KNN and Fixed-KNN on CAFA2-Swiss and CAFA3-Swiss
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5.6: Fmax of Fixed-KNN and Dynamic-KNN on CAFA2-Swiss and CAFA3-Swiss under partial evaluation mode
Easy to predict part is denoted as Easy-*, and hard to predict part is denoted as Hard-*.