• 沒有找到結果。

Evaluation of the multi-localized confidence score (MLCS)

Chapter 4 Protein Subcellular Localization Prediction

4.2.4 Evaluation of the multi-localized confidence score (MLCS)

A significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles; therefore, it is important to differentiate single-localized proteins from multi-localized proteins. We used the entire ngLOC dataset to compare different MLCS thresholds on the correct distinction between single-localized and multi-localized proteins. Specifically, we used the portions of true positives in the multi-localized proteins and true negatives in the single-localized proteins as the performance measures.

A true positive represents a multi-localized protein whose MLCS is above the threshold and a true negative represents a single-localized protein whose MLCS is below the threshold.

We illustrate the cumulative percentages of true positive and true negative versus the MLCS threshold in Figure 12, which shows that the true negative curve is increasing along the MLCS axis whereas the true positive curve is decreasing. If the MLCS threshold is set to be 40, 60.7% of multi-localized proteins are true positives and 96.5% of single-localized proteins are true negatives. It shows that 60.7% of multi-localized proteins obtained MLCS of 40 or better, whereas only 3.5% of single-localized proteins within this range. If the MLCS threshold is set to be 20, 86.3% of multi-localized proteins are true positives and 82.8% of single-localized proteins are true negatives. In ngLOC, the best result shows that 76% of multi-localized proteins belong to true positives and 81% of single-localized proteins belong to true negatives when 40 of MLCS threshold is applied. The result shows that KnowPredsite better differentiate multi-localized proteins from those that are single-localized.

4.3 Discussions

Unlike most machine learning methods that the parameters of the prediction models are not biologically explainable, the prediction result of KnowPredsite is explainable and the prediction process is transparent and traceable. To predict the localization sites of a protein, KnowPredsite can show the template sequences and their associated contributive confidence scores for a query protein. Such information is useful for interpretation of the prediction results. In this section, we select the four sequences EF1A2_RABIT, RASH_HUMAN, MCA3_MOUSE, and CFDP2_BOVIN from the ngLOC dataset, to demonstrate the interpretation of KnowPredsite prediction results.

The prediction result of each of the first three proteins and its template sequences extracted from the synonymous dictionary used for prediction are shown in Table 13 to Table 15, respectively. In each table, the prediction result shows the MLCS and the confidence score of each localization site that the query protein would be localized into.

Moreover, the template proteins which are used to vote for the localization sites are shown in each table. We only list the top eight template proteins which contribute most to the confidence scores of the query sequence. For each template sequence, its contribution to confidence score of each localization site and the sequence identity to the query protein calculated by ClustalW (denoted by SI) are shown.

In the example of EF1A2_RABIT shown in Table 13, KnowPredsite predicts it being single-localized at cytoplasm (CYT) since MLCS is very low (7.40) and CYT has the highest confidence score. However, the localization site of EF1A2_RABIT reported in the ngLOC dataset is nuclear (NUC). Examining the eight template proteins, we find that

they all have high sequence identities with EF1A2_RABIT and most of them are localized into CYT except EF1A2_RAT localized into NUC. According to the Gene Ontology annotation, it is localized into CYT and NUC, which are the two sites with the highest confidence scores in KnowPredsite’s prediction.

Table 13 – Prediction result of EF1A2_RABIT.

Query CYT CSK END EXC GOL LYS MIT NUC* PLA POX MLCS

EF1A2_RABIT 95.45 0 0 1.45 0 0 0.04 2.97 0.05 0 7.40

Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI

EF1A2_RAT 0 0 0 0 0 0 0 2.94 0 0 99.78 EF1A_CHICK 2.77 0 0 0 0 0 0 0 0 0 92.22 EF1A1_HUMAN 2.75 0 0 0 0 0 0 0 0 0 92.22 EF1A1_RAT 2.75 0 0 0 0 0 0 0 0 0 92.22 EF1A0_XENLA 2.69 0 0 0 0 0 0 0 0 0 90.06 EF1A_BRARE 2.64 0 0 0 0 0 0 0 0 0 90.06 EF1A2_XENLA 2.64 0 0 0 0 0 0 0 0 0 88.79

EF1A3_XENLA 2.60 0 0 0 0 0 0 0 0 0 88.55

*: correct answer; SI: sequence identity.

In the example of RASH_HUMAN shown in Table 14, KnowPredsite predicts RASH_HUMAN being localized into plasma membrane (PLA) and cytoplasm (CYT).

However, the correct localization site is cytoplasm and Golgi apparatus (GOL). Referring to the prediction result, the confidence score of PLA is much higher than those of CYT and GOL. It is also observed that most of the template proteins are localized into PLA.

According to the annotation in Gene Ontology and SwissProt, RASH_HUMAN is localized into PLA and GOL, and the template protein, RASN_HUMAN, is also

localized into PLA and GOL. If applying the new annotation data, KnowPredsite can predict RASH_HUMAN correctly.

Table 14 – Prediction result of RASH_HUMAN.

Query CYT* CSK END EXC GOL* LYS MIT NUC PLA POX MLCS

RASH_HUMAN 18.95 0.06 0.09 0.09 13.74 0.04 0.24 0.25 83.61 0 36.24

Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI

RASK_HUMAN 0 0 0 0 0 0 0 0 13.88 0 86.32 RASK_MOUSE 0 0 0 0 0 0 0 0 13.81 0 86.32 RASN_HUMAN 13.19 0 0 0 13.19 0 0 0 0 0 85.19 LET60_CAEEL 0 0 0 0 0 0 0 0 10.55 0 74.07 RAS3_RHIRA 0 0 0 0 0 0 0 0 5.05 0 57.07 RAS1_RHIRA 0 0 0 0 0 0 0 0 4.88 0 58.62 RAS2_RHIRA 0 0 0 0 0 0 0 0 4.33 0 35.20 RAS_LIMLI 0 0 0 0 0 0 0 0 4.15 0 46.03

*: correct answer; SI: sequence identity.

As for MCAS_MOUSE shown in Table 15, KnowPredsite predicts its MLCS 100 and it being localized into cytoplasm (CYT) and nuclear (NUC) correctly. Examining the template proteins, we observe that KnowPredsite identifies some related proteins, i.e., which have the same localization with the query protein. EF1G1_YEAST and NU155_RAT, even though they share very low sequence identity 8.67% and 3.17%, respectively, with the query protein. Notably, the two template proteins rank second and seventh, respectively, among all template proteins. Furthermore, though GSTA_PLEPL has higher sequence identity (15.86%) with the query protein than EF1G1_YEAST, the confidence score contributed by EF1G1_YEAST is much higher than that by GSTA_PLEPL (2.74 vs. 0.35). It shows that the contributive confidence score is not necessary to be positively correlated with the sequence identity when template sequences are dissimilar with the query sequence. In this example, EF1G1_YEAST shares more local similarities (peptide fragments) with the query protein than GSTA_PLEPL does. If MCA3_HUMAN, the one that shares 88.51% sequence identity with the query protein, is taken out from the template pool, KnowPredsite can still predict correctly for protein MCA3_MOUSE.

Table 15 – Prediction result of MCA3_MOUSE. Templates marked with ‘+’ are those that have the same localization annotation with the query protein.

Query CYT* CSK END EXC GOL LYS MIT NUC* PLA POX MLCS

MCA3_MOUSE 95.46 0.3 0.27 0.36 0.2 0.01 1.13 93.59 1.82 0.22 100

Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI

MCA3_HUMAN+ 89.16 0 0 0 0 0 0 89.16 0 0 88.51

EF1G1_YEAST+ 2.74 0 0 0 0 0 0 2.47 0 0 8.67

EF1G2_YEAST 0.49 0 0 0 0 0 0.49 0 0 0 8.50

GSTA_PLEPL 0.35 0 0 0 0 0 0 0 0 0 15.86

SYEC_YEAST 0.16 0 0 0 0 0 0 0 0 0 3.86

CCNA1_MOUSE 0 0.15 0 0 0 0 0 0 0 0 7.36

NU155_RAT+ 0.14 0 0 0 0 0 0 0.14 0 0 3.17

GCYB2_HUMAN 0.14 0 0 0 0 0 0 0 0 0 4.86

*: correct answer; SI: sequence identity.

For the multi-localized proteins, there are 318 proteins unable to find similar sequences by the Blast-hit method. However, the localization sites of around half of them can be correctly predicted by KnowPredsite. We randomly choose an example, CFDP2_BOVIN, to demonstrate the KnowPredsite’s capability of identifying related sequences from the

CYT only, and 32 are localized into NUC only. Their sequence identities against CFDP2_BOVIN are very low, ranging from 3.47% to 13.8%. The result suggests that local similarity captured by our method is beneficial for PSL prediction when global sequence similarity is very low.

Another example comes form a user’s query. We also implement KnowPredsite as a web server to provide prediction service for the public domain. This example also demonstrates the local similarities among proteins with low sequence identities.

Table 16 shows the prediction result of the query protein sent by a user. The query protein, X1005941 should be the protein of the first template since the two share 100% of sequence identity. Therefore, its correct localization site should be the nuclear. In addition to the 100% identical sequence, we also identify more other sequences localized into the same site. However, their sequence identities are very low with the query protein, which range from 7.67% to 15.82%. According to the prediction result, we can still correctly predict the query protein without referring to the first template sequence. It shows that proteins with low sequence similarities actually not only share synonymous words but also move to the same localization site.

Table 16 – An example from user’s query.

Query CYT CSK END EXC GOL LYS MIT NUC PLA POX MLCS

X1005941 0.83 0.08 0.1 0.32 0.11 0 0.16 98.18 0.5 0.01 2.62

Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI

PBX1_MOUSE 0 0 0 0 0 0 0 90.25 0 0 100

MEIS1_MOUSE 0 0 0 0 0 0 0 1.15 0 0 12.09

MEIS1_XENLA 0 0 0 0 0 0 0 1.14 0 0 12.79

PKNX2_HUMAN 0 0 0 0 0 0 0 0.87 0 0 15.82

TGIF_HUMAN 0 0 0 0 0 0 0 0.53 0 0 11.34

B3_USTMA 0 0 0 0 0 0 0 0.47 0 0 10.71

TGIF2_HUMAN 0 0 0 0 0 0 0 0.36 0 0 7.67

TGIF_MOUSE 0 0 0 0 0 0 0 0.3 0 0 10.47

4.4 Availability

The KnowPredsite web server as well as the ngLOC dataset is available at http://bio-cluster.iis.sinica.edu.tw/kbloc/. Figure 13 shows a screenshot of KnowPredsite

web server. Like SymPred and SymPsiPred web servers, KnowPredsite takes either single sequence or multiple sequences and predict the localization sites of the protein(s). The sequence input should be in fasta format and the sequence length of each of query protein should be longer than 30 in order to have significant sequence alignment when performing a PSI-BLAST search. If an E-mail address is assigned, the prediction result of each query protein will be sent to the user immediately when the prediction is completed.

Moreover, users can set the threshold of similarity level freely before the prediction. The prediction result is an html file showing the prediction scores and the template proteins we used. We list template proteins and their sequence identities with the query protein to show how we make the prediction.

Figure 13 – The KnowPredsite web server.

4.5 Summaries

In this study, we propose a highly accurate subcellular localization prediction method for single- and multi-localized proteins, called KnowPredsite, which is based on a synonymous dictionary instead of frequently used machine learning approaches. The synonymous dictionary, called SynonymDict, is compiled from a given dataset of proteins with known localization site annotation to capture local similarity between proteins so that related proteins with the same localization can be identified. Using these related proteins obtained from the synonymous dictionary, the localization site of a query protein can be better predicted.

We used the ngLOC dataset to evaluate the performance of KnowPredsite. The dataset consists of 25887 single-localized proteins and 2169 multi-localized proteins of ten subcellular proteomes from 1923 species. In order to compare KnowPredsite with ngLOC and the baseline Blast-hit method, we performed ten-fold cross validation on the dataset.

The experiment results show that KnowPredsite achieves higher prediction accuracy than ngLOC and Blast-hit. Particularly, on multi-localized sequences KnowPredsite outperformed ngLOC by 8.2% in accuracy when a protein is correctly predicted if at least one site is correctly identified and by 12.4% in accuracy when a protein is correctly predicted if both sites are correctly identified.

A major advantage of dictionary based approaches is that the prediction process is

prediction results in our experiments, we find that KnowPredsite can efficiently use local similarity to identify related sequences even when their sequence identity is low so as to predict localization site with high accuracy.

When more proteins have known localization sites, most machine learning based methods need to retrain the prediction models, In contrast, KnowPredsite can be easily improved by incrementally expanding the synonymous dictionary, i.e., adding new synonymous word entries or updating existing entries with new protein sources and their localization site information. This feature indicates the expansibility and efficiency in maintaining the KnowPredsite prediction system.