Prediction accuracy affected by enlargement of synonymous words42

Chapter 3 Protein Secondary Structure Prediction

3.3 D ISCUSSIONS

3.3.2 Prediction accuracy affected by enlargement of synonymous words42

Although the parameter b in PSI-BLAST is set at 500 for searches, not every query protein can have that number of similar proteins in the database used to generate sequence alignments. Because some query proteins are quite unique, PSI-BLAST only reports a few similar proteins at most, and may not report any. In such cases, SymPred would not have enough synonymous words to generate reliable predictions. On the other hand, some query proteins have many highly similar proteins in the database, which results in duplicate synonymous words. Apart from the number of sequence alignments, the number of distinct synonymous words may affect SymPred’s performance. Therefore, we analyze the relationship between the number of distinct synonymous words and the SymPred’s prediction performance.

To study the relationship, we set different thresholds for selecting corresponding subsets u of test protein sequences. The selection criterion is defined as follows. For each test protein t in DsspNr-25, let v denote the number of distinct synonymous words in the word set of t, and let L be the sequence length of t ; then let e = v/L, which denotes the multiple of L in terms of v. If e is greater than or equal to a threshold, the protein t is added to u. We compare the average Q3 accuracy of proteins in u with respect to different thresholds.

Table 8 shows the prediction performance of SymPred and SymPsiPred with respect to different thresholds. The results show that there is a positive correlation between the number of distinct synonymous words and the prediction performance of SymPred and SymPsiPred. For SymPred, the accuracy improves from 81.0% to 83.5% when the

threshold increases from e≧0 to e≧150. It is remarkable that SymPred can predict approximately 75% of the proteins in DsspNr-25 with 83.1% accuracy, and more than 50% of the protein sequences can be predicted with 83.5% accuracy. For SymPsiPred, the accuracy increases from 83.9% to 85.5% when the threshold increases from e≧0 to e≧

150. The results imply that SymPred and SymPsiPred have the potential to achieve higher accuracy as the number of protein sequences in the NCBInr database increases.

Table 8 – The relationship between the number of distinct synonymous words and the prediction performance. For each test protein t of length L in DsspNr-25, let v denote the number of distinct synonymous words of t. Define e = v/L, the multiplicity of v over L. If e is greater than or equal to a threshold, the protein t is selected. The results show that there is a positive correlation between the number of distinct synonymous words and the prediction performance of SymPred and SymPsiPred.

Selection criterion

e≧ 0 e≧ 5 e≧25 e≧50 e≧75 e≧100 e≧125 e≧150

Number of selected proteins

8297 7983 7252 6660 6178 5637 5035 4378

SymPred 81.0 81.6 82.3 82.8 83.1 83.3 83.4 83.5 Q3

SymPsiPred 83.9 84.3 84.8 85.1 85.2 85.3 85.4 85.5

3.3.3 Essential Residues

Since the confidence level measures the ratio of voting scores a residue x gets to the summation of the normalization factors, it reflects the degree of sequence conservation in protein evolution. We use the confidence levels representing the degrees of importance of residues in determining the structure and function of a protein sequence.

To study the effectiveness of essential residues, we developed a general prediction method, called ProtoPred, which only uses the secondary structural information as the single feature for general proteome prediction problems, such as function prediction and enzyme/non-enzyme classification. The confidence levels are used as weights to indicate the degrees of importance of residues when finding protein templates for the prediction.

ProtoPred: A Prototype of Prediction Method

Figure 6 shows the main algorithm of ProtoPred. ProtoPred is a simple template based method for general prediction problems. It is a standard query-template alignment algorithm that is used frequently in homology modeling or threading methods [108-110].

For the training of ProtoPred, we used a sliding window of size w to extract the real secondary structure fragments from each of the training proteins. Each structure fragment carried the related information from its origin, such as function labels or protein classes.

These fragments were treated as templates for predictions. For test phase we used the same sliding window to extract the predicted secondary structure fragments from the target protein. Each structure fragment (denoted as s) was used to search against the

template pool. We compared the similarities between s and each template t in the template pool. The similarity was estimated as follows.

For each position x (from 1 to w) if s[x] was identical to t[x], then t would get a weighted score from s, i.e., the confidence level of s[x]. Each s selects the best template t with the highest sum of weighted scores (denoted as Sum_ws). If the best template t was labeled as class A, then the target protein would get a score of Sum_ws for class A. Finally, the target protein would be predicted as the class with the highest score.

Figure 6 – The main algorithm of ProtoPred. (a) Template extraction (b) The prediction procedure.

The knowledge of protein functions is crucial to the understanding of biological process.

Since the experimental procedures for protein function annotation are inherently low throughput, the accurate computational techniques for protein function prediction represent useful tools. Automated protein function prediction methods include direct homology-based and indirect subsequence/feature-based approaches. For the indirect subsequence-based approaches, often only specific subsequences are crucial for the protein to perform its function [109]. This motivated us to use the essential residues in the function predictions.

We downloaded the protein function labels from the Gene Ontology Annotation Database (goa_pdb) [111]. Since we needed to compile a dataset whose protein sequences are not redundant (mutual sequence identity less than 25%) and each of them is of known secondary structure, we then made an intersection set of goa_pdb with DsspNr-25. The number of proteins is 2677 and the total number of distinct function labels is 1539. It is worth to note that the function labels contain all GO annotations for the 2677 proteins, including the function labels of biological process, molecular functions, and cellular components. For example, the function labels of protein 1ak6 are 3779 (molecular function: actin binding) and 5622 (cellular component: intracellular).

In this application, we focus on verifying the efficacy of different sources of PSS. These sources are the real secondary structures, the predicted secondary structures of SymPred, and the predicted secondary structure of PSIPRED. ProtoPred predicts the most specific function label among 1539 candidates for a target protein by using one of the sources of secondary structures rather than general functions. The prediction accuracy is 100% if the

predicted function label belongs to the target protein, otherwise it is 0%. For example, if we predict 1ak6 as the function 3779 (or 5622) then the accuracy is 100%. The hierarchical structure of GO annotations is not exploited in our prediction method, though it could be used to improve prediction accuracy [6].

ProtoPred extract structure fragments using a sliding window of size w. Table 9 shows the results for several different window sizes. It can be observed that ProtoPred’s prediction using the predicted secondary structure of SymPred shows the highest accuracy for all studied window sizes (except the window size of 11 because it is too short to represent the uniqueness of structures for different function classes). For example, for the window size of 51, the prediction accuracies of ProtoPred using the features of real structure, PSIPRED’s prediction, and SymPred’s prediction are 49.8%, 35.4%, and 57.6% respectively. Notably, the Q₃ of PSIPRED and SymPred on this dataset are 80.3% and 81.1%. Although the performances of PSS prediction of the two methods are similar, the effectiveness is quite different. Moreover, the performance of ProtoPred with SymPred’s prediction is also better than that of ProtoPred with real structure. A possible explanation for this discrepancy is that different structures within a protein did not have equal importance for its function. It shows that SymPred could identify the essential residues which are crucial for proteins to perform their functions.

Structural identities of low relevance residues dilute the influence of major residues when using the real structure as the feature in the ProtoPred’s prediction.

Table 9 – The accuracy (%) of function predictions using different structure sources and different window sizes.

Window Size 11 21 31 41 51 61 71 Real Structure 21.0 21.1 31.5 45.5 49.8 51.8 53.0 PSIPRED 21.0 21.0 23.3 28.9 35.4 40.6 44.0 SymPred 21.0 21.5 39.4 53.8 57.6 58.3 59.1

Experiment Result on Enzyme/non-enzyme classification Using Essential Residues

Many protein function prediction methods focus on only one specific type of functions [112-113]. The problem of enzyme and non-enzyme classifications is a special case of function prediction. We do not have to predict a functional type but only to distinguish between enzyme and non-enzyme. In Dobson and Doig’s study, they use multiple features such as secondary structure, amino acid propensities, and surface properties to do the binary classifications. They further divide the features into 52 sub-features and select 36 optimal sub-features for the SVM models to generate the classifier. The overall accuracies are 77.16% and 80.14% for the two different sizes of sub-features, respectively.

We download Dobson and Doig’s dataset which contained 1076 proteins. Since SymPred’s prediction is the most effective feature among different sources of PSS in the above protein function prediction, ProtoPred uses SymPred’s prediction as the input feature for the problem of enzyme and non-enzyme classifications. ProtoPred achieves an overall accuracy 81.8%.

In this application, we only use the secondary structural information for enzyme/non-enzyme classification and achieve a better result. It suggests that the secondary structural information with the essential residue annotation may be sufficient to predict protein functions, which supports the conclusion of Przytycka et al [11].

在文檔中一個基於同義字辭典的蛋白質序列分析與分類的方法 (頁 57-64)