Results and discussion - Cross-species prediction of protein-protein interactions

Chapter 4. Applications of Homologous Protein-protein Interactions

4.1 Cross-species prediction of protein-protein interactions

4.1.2 Results and discussion

4.1.2.1 Accuracy of rank-based interolog mapping

For practicability of approach to predict interactions, we wish to develop a method which has reliable predicting accuracy and acceptable coverage. In this preliminary study, we propose a new “rank-based interolog mapping” method. This method looses the best-match mapping to get a higher coverage of the total interactome. On the other hand, this method selects part, not all, of homologs in the target organism to amend the two questions of generalized interolog mapping.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Number of true positives

Figure 14. The comparison of accuracy of rank-based interolog (yellow, blue, and pink lines), best-match (green line), and generalized interolog mapping (deep blue line) methods in (A) worm-yeast mapping and (B) the four mappings. “E10+rank”, “E40+rank”, and “E70+rank”

mean Acc(10^-10,R), Acc(10^-40,R), and Acc(10^-70,R), R∈[1,5,10,15,20,25,30,35,40,45,50,55, 60,65,70,75,80,85,90,95,100, 'All'], respectively.

First, we map only worm interactions onto the yeast genome. We assess the predicting accuracy of our method, best-match and generalized interolog mapping against sets of gold standard positives P and negatives N (see Section 4.1.5). Figure 14A shows the relationship between accuracy and coverage in the worm-yeast mapping. The blue line indicates the accuracy of generalized interolog mapping from JE < 10^-190 to 10^-10. The green line indicates the accuracy of best-match mapping at JE < 10^-10. There are three clear observations:

1. While selecting only pairs of top R homologs (see Section 4.1.5) as candidate inter-actions (i.e. rank-based interologs), the accuracy would be usually better than the accuracy of generalized interolog mapping. For example, the purple line consists of plots Acc(10^-70,R),

R∈[1,5,10,15,20,25,30,35,40, 45,50,55,60,65,70,75,80,85,90,95,100, 'All'], JE < 10^-70 was used as a good threshold of predicting interactions in Yu et al. (2004). The accuracy of R = 1, 5, and 10 are 0.22, 0.26, and 0.21, respectively, are better than Acc(10^-70, 'All') = 0.11.

'All' means R has no limit.

2. If JE < 10^-110, Acc(JE, 'All') will raise sharply but the number of true positives are < 25. In other words, a very low coverage of yeast interactions.

3. The max number of true positives in best-match mapping is 50 (at JE < 10^-10). Similarly, there is a low coverage of yeast interactions.

To gather better statistics, we map inter-actions in worm, fruit fly, mouse, and human onto the yeast genome, assessing them against our gold standards. We perform a similar analysis in Figure 14B. The number of true positives dotted in Figure 14B is sum of the true positives in worm-yeast, fly-yeast, mouse-yeast, and human-yeast mappings. The accuracy is calculated by sum of the true and false positives in the four mapping processes.

In Figure 14B, the comparison among rank-based interolog, best-match, and generalized interolog mapping is similar to that in Figure 14A. The accuracy of R = 1, 5, and 10 are 0.21, 0.17, and 0.12, respectively, are better than Acc(10^-70, 'All') = 0.04.

4.1.2.2 Functional similarity between homologous protein pairs

For quantitatively assessing the unreliable homologous pairs in rank-based interologs, we construct sets of P’ and N’ (see Section 4.1.5). Genome Ontology (GO) consortium provides a standardized vocabulary, in which three structured ontologies have been proposed, which allow the description of molecular function (MF), biological process (BP) and cellular component (CC)⁴⁰. This annotation particularly allows for assessing the functional similarity of genes or

their products. Based on Wu et al. (2006)⁴¹, we calculate the functional similarities between query (in the four organisms) and target (in yeast) interactions by using GO annotations.

However, not all of protein sequences have GO annotations. Table 2 shows the percentage of TP’(10^-10, 'All') and FP’(10^-10, 'All') with the terms of CC and BP ontologies in each organism.

Here TP’(10^-10, 'All') = TP(10^-10, 'All') ∩ P’ and FP’(10^-10, 'All') = FP(10^-10, 'All') ∩ N’.

Table 2. Comparison of true and false positives selected by JE and that evaluated by CC and BP annotations

Species TP(10^-10, 'All') FP(10^-10, 'All') TP’(10^-10, 'All') FP’(10^-10, 'All')

Worm 788 13971 412 (52.3%) 2778 (19.9%)

Fly 780 73235 362 (46.4%) 23148 (31.6%)

Mouse 912 37636 685 (75.1%) 27770 (73.8%)

Human 2790 187149 1661 (59.5%) 128752 (68.8%)

The statistics of recall of TP’(10^-10, 'All') and FP’(10^-10, 'All') in four mappings are showed in Figure 15. Figure 15A indicates the relationship between recall and rank. For example, in worm-yeast mapping, the recall of TP’(10^-10, 'All') at R = 1, 5, 10 are 5.1%

(21/412), 59.0% (243/412), and 98.3% (405/412), respectively. At R = 1, 5, 10, the recall of FP’(10^-10, 'All') is 0.5% (15/2778), 6.7% (185/2778), and 14.8% (411/2778). There are similar trends in the four mappings from the source organisms to yeast.

Otherwise, there is no given JE could satisfy the demands together: High recall of true positives and low recall of false positives. For example, the recall of FP’(10^-10, 'All') is 16.3%

at JE < 10^-40, near that at R = 10, but the recall of TP’(10^-10, 'All') is only 58.3%. At JE < 10^-40, the recall of true and false positives are 10.4% and 2.6%, respectively. This result suggests that rank-based mapping method could predict more reliable interactions under a given percentage of false positives than best-match and generalized interolog mapping methods.

Figure 15. The relationship between recall of TP’(10^-10, 'All') and FP’(10^-10, 'All') against (A) rank and (B) JE. TP’(10^-10, 'All') and FP’(10^-10, 'All') of each mapping are represented by blue and pink solid lines, respectively.

4.1.2.3 Orthologous interactions

Figure 16A and 16B shows the distribution of orthologous interactions against rank and JE. The total number of orthologous interactions of four mappings is 1,626. Obviously, the orthologous interactions between four organisms and yeast concentrate in top-1－top-5 (totally 99.4%, 1616/ 1626; If top-1－top-10, 99.8%, 1622/1626) but spread at various JE (e.g. 6.9% in 10^-40 < JE < 10^-50 and 10.5% in 10^-70 < JE < 10^-80). This result supplies two suggestions: First, best-match mapping may be not good because it will lose ~44% of orthologous interactions.

Second, generalized interolog mapping with any given JE would lose part of orthologous interactions. Although loosing JE could raise the coverage of orthologous interactions, the false positives would increase sharply. Our rank-based interolog mapping method could supply higher coverage of orthologous interactions and acceptable quantity of false positives.

Top-1 Top-2 Top-3 Top-4 Top-5 Top-6 Top-7 Top-8 Top-9 Top-10 > Top-10

Rank Percentage of orthologous interactions

B A

Figure 16. Distribution of total orthologous inter-actions in the four mappings against (A) rank and (B) JE.

4.1.3 Discussion

4.1.3.1 Case analysis

We use two cases to explain why rank-based interolog mapping method could work. In first case, the query interaction P43686-P62195 in human has 231 pairs of possible homologs (E-value < 10^-10) in yeast. P43686 and P62195 are 26S protease regulatory subunit 6B and 8 of proteasome, respectively⁴². Figure 17A shows that all of 15 true positives (0.8 <RSS^GO_A₋_B_,_A_'₋_B_'<

1.0) are < top 10. 97% (29/30) of true and false positives withRSS_A^GO₋_B_,_A_'₋_B_'< 0.8 are out of top 10. The rank of each pair is calculated in the same way described in Figure 16. Comparing to Figure 17B, reliable true positives spread in 10^-170 < JE < 10^-40, this suggests that any given JE

would lose part of true positives.

Similarly, Figure 17C and 17D represent another case, interaction O16368-Q9GZH5 in worm. O16368 is 26S protease regulatory subunit 4. Q9GZH5 is non-ATPase protein 1 of proteasome regulatory particle ⁴³. All of 49 true positives (0.8 <RSS^GO₋ _, _'₋ _'< 1.0) are < top 10.

These reliable true positives spread in 10^-130 < JE < 10^-30. In this case, there are no true and

Figure 17. Three cases of rank-based interolog mapping. (A) and (B) show the TP’(10^-10, 'All') (colored blue) and FP’(10^-10, 'All') (colored pink) of the query interaction P43686-P62195 in human. Similarly, (C) and (D) show the TP’(10^-10, 'All') and FP’(10^-10, 'All') of the query interaction O17071-Q09583 in worm. (E) and (F) show the TP’(10^-10, 'All') and FP’(10^-10, 'All') of the query interaction O44156-Q27488 in worm. RSS is RSS_A^GO₋_B_,_A_'₋_B_'.

4.1.3.2 Three types of good predictions

We classify our predictions between organisms into three types. As JE < 10^-70 was considered as a good threshold for predicting interactions, we represent the advantages of rank-based interologs in detail at JE < 10^-70.

First, the true and false positives of a query interaction have various JE. This suggests that any given JE, such as 10^-70, may be not a good cutoff. For example, O16368-Q9GZH5 is a known interaction in worm. They have 21 and 2 possible homologs (E-value < 10^-10) in yeast.

|FP(10^-10, 5)| = 0, |TP(10^-10, 10)| = 14 and FP(10^-10, 10) = 4. The Acc(10^-10, 5) and Acc(10^-10, 10) are 1.0 and 0.78, both better than Acc(10^-70, 'All').

In cases of the second type, true positives of a query interaction have JE from higher to lower than 10^-70 and there are no or few false positives. Most of true positives are ranked in top R. For this type, the query interaction O17071-Q09583 in worm is used as an example.

O17071 has 21 homologs (E-value < 10^-10) and Q09583 has 7 homologs (E-value < 10^-10) in yeast, respectively. |TP(10^-10, 'All')| is 49 in the total 147 homolog pairs. In this case, |TP(10^-70, 'All')| = 7 and |FP(10^-70, 'All')| = 0, the predicting accuracy of using threshold JE < 10^-70 is 0.14.

Otherwise, |TP(10^-40, 'All')| = 43 and |FP(10^-40, ‘All’)| = 0, |TP(10^-10, 5)| = 25 and |FP(10^-10, 5)|

= 0, |TP(10^-10, 10)| = 49 and |FP(10^-10, 10)| = 0. The accuracy of R = 5 and 10 are 0.51 and 1.0, respectively.

Third, all pairs of TP(10^-10, 'All') of a query interaction have JE higher than 10^-70. For this type, we get interaction O44156-Q27488 as an example. Both O44156 and Q27488 have 7

possible homologs (E-value < 10^-10) in yeast. The minimum JE of all pairs of homologs is 7.7 ×

10^-63. In other words, there is no true positives has JE < 10^-70, |TP(10^-70, 'All')| = 0. Otherwise,

|TP(10^-10, 5)| = 15 and |FP(10^-10, 5)| = 0, |TP(10^-10, 10)| = 21 and |FP(10^-10, 10)| = 0, the accuracy of rank-based interolog mapping method at R = 5 and 10 are 0.71 and 1.0, respectively.

在文檔中同源蛋白質-蛋白質交互作用之研究 (頁 44-52)