A question caused by local sequence alignment

Chapter 3. Evidence Supplying the Existence of Homologous Protein-protein

3.4 Discussion

3.4.2 A question caused by local sequence alignment

In this study, we supplied the concept of homologous PPIs and inferred the transferability of domain and function pairs. Moreover, we applied the concept to construct a web server, PPISearch. In the process that we evaluated the results of searching homologous PPIs, a question was found. We utilized BLASTP as the fast sequence alignment tool to search potential homologs, however, these search results might be biased by local sequence alignments.

We presented an example to describe this question in Figure 12. Q8INB9 is a RAC serine/threonine-protein kinase of fruit fly with three Pfam domains. In this kinase, a protein kinase domain (red) has 258 amino acids, which covers 42% of the whole sequence length, 611 amino acids. The potential homologs with BLASTP E-value ≤ 10^-10 have the protein kinase domain, however, they lose the interacting domain PF00169 (deep blue, pleckstrin homology domain). This example indicated the question of searching potential homologs. The question will be reformed in the future works.

E-value ≤10^-10

Q8INB9 Q9VPR4

4E-90 Q6C292

P25382 1E-115

…

1E-86 P18961 7E-86 P12688 2E-85 P11792

2E-26 P38622

RAC

serine/threonine-protein kinase Notchless

42%

Interacting domains PF00169 PF00400

Homologs found by BLASTP alignments may be biased because of the domain(s) with large coverage

Protein kinase

domain E-value ≤10^-10

Figure 12. The search result of an interacting protein pair Q8INB9-Q9VPR4. The potential homologs of Q8INB9 keep the protein kinase domain (colore red) but have no PF00169 domain (colored deep blue) to interact with the potential homolog of Q9VPR4.

Additionally, as described in Section 3.1.3 and 3.1.4, we discussed possible reasons of why members of a PPI family have different domains and binding models with the query protein.

Figure 13 shows an example of our observation. Transcription factor IIIB (TFIIIB), consisting of the TATA-binding protein (TBP), TFIIB-related factor (BRF-1) and BDP-1, is a central component in basal and regulated transcription by RNA polymerase III³⁶. In this case, we found that when we searched homologs of yeast BRF-1 by using BLASTP, there were protein sequences with E-values ≤ 10^-10, which had no interacting domains (colored by blue), in searching results. This observation suggested that local alignment methods, such as BLASTP,

may get unreliable homologs because of locally similar regions on sequences. The question will be reformed in the future works, too.

Homologs

IIB (Pyrococcus furiosus)

TBP-related

Figure 13. An example of our method selecting proteins without interacting domain (E-values

≤ 10^-10) as homologs. The interacting domains of the complex 1NGM chains A and B are colored by blue (PF07741) and yellow (PF08515), respectively. Q00403, the transcription initiation factor IIB of human, does not has the domain PF07741 but has E-value ≤ 10^-10.

Chapter 4. Applications of Homologous Protein-protein Interactions

In this chapter, we applied homologous PPI to species prediction of PPIs, and cross-species network comparisons. Many experimental approaches, for example, yeast two-hybrid system, mass spectroscopy, and tandem affinity purification, have been used to decipher PPI networks. To complement these experimental techniques, a number of computational methods for predicting PPIs, such as PathBLAST^{16, 17} and interologs^{5, 37} (i.e. conservation of inter-actions across species), have been developed¹⁹.

The concept of interolog (originally introduced by Walhout et al.³⁷) combines known PPIs from one or more source species and orthology relationships between the source and target species to predict PPIs in the target species. Yu et al. (2004)⁵ extended and assessed the concept of interologs to provide a “generalized interolog mapping” method (see below). We considered that our discovery (described in Section 3.4 Discussion) can be used to advance the generalized interolog mapping method.

In addition, cross-species network comparison provides insights into the relationships between the proteins of an organism thereby contributing to a better understanding of cellular processes. However, large-scale interaction networks are available for only several model organisms. We considered that the concept of homologous PPI are useful for a systematic transfer of PPI networks between multiple species.

4.1 Cross-species prediction of protein-protein interactions

4.1.1 Background

Protein-protein interactions play an essential role in cellular functions. For rapidly increasing of sequenced genomes, it has been of significant value to provide the approaches of predicting PPIs from one organism (with abundant known interactions) to another organism (with less interaction data). In other words, to reliably transfer PPI annotation from one organism to another⁵.

The concept of “interologs” means: If interacting proteins A and B in one organism (source) have interacting orthologs A' and B' in another organism (target), the pair of A-B and A'-B' are called interologs. Operationally, the ortholog of a protein is defined as its best-matching homolog in another organism. Matthews et al. (2001)¹⁸ proposed a “best-match mapping” method to predict worm (C. elegans) interactions from yeast (S. cerevisiae) inter-actome. This method considered all pairs of best-matching homologs (homologs are defined by BLASTP E-value ≤ 10^-10) of interacting yeast proteins as potential interologs.

Additionally, Yu et al. (2004)⁵ extended and assessed the concept of interologs to provide a “generalized interolog mapping” method. The mapping method regards all pairs of homologs, which have joint similarities (see Section 4.1.5) larger than a certain cutoff, as possible interologs. Their results showed that interaction annotation could be reliably transferred between two organisms if a pair of proteins has a joint E-value (JE) < 10^-70.

There are interesting questions in best-match and generalized interolog mapping methods.

Firstly, best-match mapping method suffers from low coverage of the total interactome⁵, because of using only best matches. For this question, Yu et al. (2004) proposed the method of

homologs of a query protein selected at a certain E-value would sometimes be different in subcellular compartment, biological process, or function from the query protein. For example, YLL034C in yeast has a low E-value (< 10^-120) with protein Q01853 in mouse, but YLL034C has no CDC48 domain (for protein degradation)³⁸. The protein pairs having these sequences may be not reliable candidates of interologs.

Orthology of different organisms are usually used in predicting interactions³⁹. The third question is that, orthologous protein interactions between two species have various JE. For example, two protein pairs P47857-P12382 and Q8R317-P40142 in mouse have orthologous interactions YMR205C-YGR240C and YMR276W-YBR117C with JE =10^-171 and 10^-27 in yeast, respectively. In other words, a certain cutoff would usually lose part of orthologous interactions.

To improve these three questions, we preliminarily propose a new “ranked-based interolog mapping” method for predicting protein-protein interactions between species. This method uses only part, not all, of homologs of interacting proteins to gather possible interologs.

4.1.2 Results and discussion

4.1.2.1 Accuracy of rank-based interolog mapping

For practicability of approach to predict interactions, we wish to develop a method which has reliable predicting accuracy and acceptable coverage. In this preliminary study, we propose a new “rank-based interolog mapping” method. This method looses the best-match mapping to get a higher coverage of the total interactome. On the other hand, this method selects part, not all, of homologs in the target organism to amend the two questions of generalized interolog mapping.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Number of true positives

Figure 14. The comparison of accuracy of rank-based interolog (yellow, blue, and pink lines), best-match (green line), and generalized interolog mapping (deep blue line) methods in (A) worm-yeast mapping and (B) the four mappings. “E10+rank”, “E40+rank”, and “E70+rank”

mean Acc(10^-10,R), Acc(10^-40,R), and Acc(10^-70,R), R∈[1,5,10,15,20,25,30,35,40,45,50,55, 60,65,70,75,80,85,90,95,100, 'All'], respectively.

First, we map only worm interactions onto the yeast genome. We assess the predicting accuracy of our method, best-match and generalized interolog mapping against sets of gold standard positives P and negatives N (see Section 4.1.5). Figure 14A shows the relationship between accuracy and coverage in the worm-yeast mapping. The blue line indicates the accuracy of generalized interolog mapping from JE < 10^-190 to 10^-10. The green line indicates the accuracy of best-match mapping at JE < 10^-10. There are three clear observations:

1. While selecting only pairs of top R homologs (see Section 4.1.5) as candidate inter-actions (i.e. rank-based interologs), the accuracy would be usually better than the accuracy of generalized interolog mapping. For example, the purple line consists of plots Acc(10^-70,R),

R∈[1,5,10,15,20,25,30,35,40, 45,50,55,60,65,70,75,80,85,90,95,100, 'All'], JE < 10^-70 was used as a good threshold of predicting interactions in Yu et al. (2004). The accuracy of R = 1, 5, and 10 are 0.22, 0.26, and 0.21, respectively, are better than Acc(10^-70, 'All') = 0.11.

'All' means R has no limit.

2. If JE < 10^-110, Acc(JE, 'All') will raise sharply but the number of true positives are < 25. In other words, a very low coverage of yeast interactions.

3. The max number of true positives in best-match mapping is 50 (at JE < 10^-10). Similarly, there is a low coverage of yeast interactions.

To gather better statistics, we map inter-actions in worm, fruit fly, mouse, and human onto the yeast genome, assessing them against our gold standards. We perform a similar analysis in Figure 14B. The number of true positives dotted in Figure 14B is sum of the true positives in worm-yeast, fly-yeast, mouse-yeast, and human-yeast mappings. The accuracy is calculated by sum of the true and false positives in the four mapping processes.

In Figure 14B, the comparison among rank-based interolog, best-match, and generalized interolog mapping is similar to that in Figure 14A. The accuracy of R = 1, 5, and 10 are 0.21, 0.17, and 0.12, respectively, are better than Acc(10^-70, 'All') = 0.04.

4.1.2.2 Functional similarity between homologous protein pairs

For quantitatively assessing the unreliable homologous pairs in rank-based interologs, we construct sets of P’ and N’ (see Section 4.1.5). Genome Ontology (GO) consortium provides a standardized vocabulary, in which three structured ontologies have been proposed, which allow the description of molecular function (MF), biological process (BP) and cellular component (CC)⁴⁰. This annotation particularly allows for assessing the functional similarity of genes or

their products. Based on Wu et al. (2006)⁴¹, we calculate the functional similarities between query (in the four organisms) and target (in yeast) interactions by using GO annotations.

However, not all of protein sequences have GO annotations. Table 2 shows the percentage of TP’(10^-10, 'All') and FP’(10^-10, 'All') with the terms of CC and BP ontologies in each organism.

Here TP’(10^-10, 'All') = TP(10^-10, 'All') ∩ P’ and FP’(10^-10, 'All') = FP(10^-10, 'All') ∩ N’.

Table 2. Comparison of true and false positives selected by JE and that evaluated by CC and BP annotations

Species TP(10^-10, 'All') FP(10^-10, 'All') TP’(10^-10, 'All') FP’(10^-10, 'All')

Worm 788 13971 412 (52.3%) 2778 (19.9%)

Fly 780 73235 362 (46.4%) 23148 (31.6%)

Mouse 912 37636 685 (75.1%) 27770 (73.8%)

Human 2790 187149 1661 (59.5%) 128752 (68.8%)

The statistics of recall of TP’(10^-10, 'All') and FP’(10^-10, 'All') in four mappings are showed in Figure 15. Figure 15A indicates the relationship between recall and rank. For example, in worm-yeast mapping, the recall of TP’(10^-10, 'All') at R = 1, 5, 10 are 5.1%

(21/412), 59.0% (243/412), and 98.3% (405/412), respectively. At R = 1, 5, 10, the recall of FP’(10^-10, 'All') is 0.5% (15/2778), 6.7% (185/2778), and 14.8% (411/2778). There are similar trends in the four mappings from the source organisms to yeast.

Otherwise, there is no given JE could satisfy the demands together: High recall of true positives and low recall of false positives. For example, the recall of FP’(10^-10, 'All') is 16.3%

at JE < 10^-40, near that at R = 10, but the recall of TP’(10^-10, 'All') is only 58.3%. At JE < 10^-40, the recall of true and false positives are 10.4% and 2.6%, respectively. This result suggests that rank-based mapping method could predict more reliable interactions under a given percentage of false positives than best-match and generalized interolog mapping methods.

Figure 15. The relationship between recall of TP’(10^-10, 'All') and FP’(10^-10, 'All') against (A) rank and (B) JE. TP’(10^-10, 'All') and FP’(10^-10, 'All') of each mapping are represented by blue and pink solid lines, respectively.

4.1.2.3 Orthologous interactions

Figure 16A and 16B shows the distribution of orthologous interactions against rank and JE. The total number of orthologous interactions of four mappings is 1,626. Obviously, the orthologous interactions between four organisms and yeast concentrate in top-1－top-5 (totally 99.4%, 1616/ 1626; If top-1－top-10, 99.8%, 1622/1626) but spread at various JE (e.g. 6.9% in 10^-40 < JE < 10^-50 and 10.5% in 10^-70 < JE < 10^-80). This result supplies two suggestions: First, best-match mapping may be not good because it will lose ~44% of orthologous interactions.

Second, generalized interolog mapping with any given JE would lose part of orthologous interactions. Although loosing JE could raise the coverage of orthologous interactions, the false positives would increase sharply. Our rank-based interolog mapping method could supply higher coverage of orthologous interactions and acceptable quantity of false positives.

Top-1 Top-2 Top-3 Top-4 Top-5 Top-6 Top-7 Top-8 Top-9 Top-10 > Top-10

Rank Percentage of orthologous interactions

B A

Figure 16. Distribution of total orthologous inter-actions in the four mappings against (A) rank and (B) JE.

4.1.3 Discussion

4.1.3.1 Case analysis

We use two cases to explain why rank-based interolog mapping method could work. In first case, the query interaction P43686-P62195 in human has 231 pairs of possible homologs (E-value < 10^-10) in yeast. P43686 and P62195 are 26S protease regulatory subunit 6B and 8 of proteasome, respectively⁴². Figure 17A shows that all of 15 true positives (0.8 <RSS^GO_A₋_B_,_A_'₋_B_'<

1.0) are < top 10. 97% (29/30) of true and false positives withRSS_A^GO₋_B_,_A_'₋_B_'< 0.8 are out of top 10. The rank of each pair is calculated in the same way described in Figure 16. Comparing to Figure 17B, reliable true positives spread in 10^-170 < JE < 10^-40, this suggests that any given JE

would lose part of true positives.

Similarly, Figure 17C and 17D represent another case, interaction O16368-Q9GZH5 in worm. O16368 is 26S protease regulatory subunit 4. Q9GZH5 is non-ATPase protein 1 of proteasome regulatory particle ⁴³. All of 49 true positives (0.8 <RSS^GO₋ _, _'₋ _'< 1.0) are < top 10.

These reliable true positives spread in 10^-130 < JE < 10^-30. In this case, there are no true and

Figure 17. Three cases of rank-based interolog mapping. (A) and (B) show the TP’(10^-10, 'All') (colored blue) and FP’(10^-10, 'All') (colored pink) of the query interaction P43686-P62195 in human. Similarly, (C) and (D) show the TP’(10^-10, 'All') and FP’(10^-10, 'All') of the query interaction O17071-Q09583 in worm. (E) and (F) show the TP’(10^-10, 'All') and FP’(10^-10, 'All') of the query interaction O44156-Q27488 in worm. RSS is RSS_A^GO₋_B_,_A_'₋_B_'.

4.1.3.2 Three types of good predictions

We classify our predictions between organisms into three types. As JE < 10^-70 was considered as a good threshold for predicting interactions, we represent the advantages of rank-based interologs in detail at JE < 10^-70.

First, the true and false positives of a query interaction have various JE. This suggests that any given JE, such as 10^-70, may be not a good cutoff. For example, O16368-Q9GZH5 is a known interaction in worm. They have 21 and 2 possible homologs (E-value < 10^-10) in yeast.

|FP(10^-10, 5)| = 0, |TP(10^-10, 10)| = 14 and FP(10^-10, 10) = 4. The Acc(10^-10, 5) and Acc(10^-10, 10) are 1.0 and 0.78, both better than Acc(10^-70, 'All').

In cases of the second type, true positives of a query interaction have JE from higher to lower than 10^-70 and there are no or few false positives. Most of true positives are ranked in top R. For this type, the query interaction O17071-Q09583 in worm is used as an example.

O17071 has 21 homologs (E-value < 10^-10) and Q09583 has 7 homologs (E-value < 10^-10) in yeast, respectively. |TP(10^-10, 'All')| is 49 in the total 147 homolog pairs. In this case, |TP(10^-70, 'All')| = 7 and |FP(10^-70, 'All')| = 0, the predicting accuracy of using threshold JE < 10^-70 is 0.14.

Otherwise, |TP(10^-40, 'All')| = 43 and |FP(10^-40, ‘All’)| = 0, |TP(10^-10, 5)| = 25 and |FP(10^-10, 5)|

= 0, |TP(10^-10, 10)| = 49 and |FP(10^-10, 10)| = 0. The accuracy of R = 5 and 10 are 0.51 and 1.0, respectively.

Third, all pairs of TP(10^-10, 'All') of a query interaction have JE higher than 10^-70. For this type, we get interaction O44156-Q27488 as an example. Both O44156 and Q27488 have 7

possible homologs (E-value < 10^-10) in yeast. The minimum JE of all pairs of homologs is 7.7 ×

10^-63. In other words, there is no true positives has JE < 10^-70, |TP(10^-70, 'All')| = 0. Otherwise,

|TP(10^-10, 5)| = 15 and |FP(10^-10, 5)| = 0, |TP(10^-10, 10)| = 21 and |FP(10^-10, 10)| = 0, the accuracy of rank-based interolog mapping method at R = 5 and 10 are 0.71 and 1.0, respectively.

4.1.4 Summary

In this preliminary study, we propose a rank-based interolog mapping method for predicting interactions across species. This method looses best-match mapping method to get a higher coverage of the total interactome. On the other hand, this method selects part, not all, of homologs in the target organism to amend generalized interolog mapping method.

Four mappings of worm-yeast, fly-yeast, mouse-yeast, and human-yeast are included in our preliminary study. In general, rank-based mapping method could predict more reliable interactions (including positives annotated by CC and BP ontologies and orthologous interactions) under a given percentage of false positives than best-match and generalized interolog mapping methods.

4.1.5 Methods

4.1.5.1 Rank-based interolog mapping

Interolog mapping is a process that maps interactions in the source organism onto the target organism to predict possible interactions. To address the three questions of best-match and generalized interolog mapping described above, we introduce a new “rank-based interolog

homologs could be defined as the proteins having an E-value < 10^-10 from BLASTP^{18, 44}. An overview of the rank-based interolog mapping is depicted in Figure 18.

A B

Figure 18. Schematic illustration of rank-based interolog mapping method. Proteins A1', A'2,…, A'm and B1', B'2,…, B'n are possible homologs (E-value < 10^-10) of proteins A and B in the source organism, respectively. All possible pairs between homologs A'1,…, A'R and B’1,…, B’R

are called ranked-based interologs.

The steps are described as following:

1. For any given protein in the source organism (e.g. worm), we collect all of its homologs by BLASTP E-value < 10^-10.

2. These possible homologs of proteins A and B in the target organism (yeast) are ranked by

their E-values from low to high (i.e. from 0 to 10^-10), respectively.

3. These homologs ranked in top R are selected to pair with each other. These possible protein pairs between the homologs A'1,…, A'R and B'1…B'R are called ranked-based interologs.

Otherwise, the all protein pairs between the homologs A'1,…, A'm and B'1,…, B'n are generalized interologs.

The best-match mapping method considers pairs between the best-matching homologs as the candidates of interaction¹⁸. The generalized interolog mapping method uses all pairs of homologs, which have joint similarities larger than a certain cutoff, to find possible interactions in the target organism⁵. In this preliminary study, we consider the protein pairs between the top R possible homologs as the candidates of interaction in the target organism.

4.1.5.2 Source data sets

To assess the rank-based interolog mapping method, we need source organisms with known interaction data. In this preliminary study, worm, fruit fly, mouse, and human are used as source organisms. We collect the interactions of these four organisms recorded in IntAct database⁴⁵ (Table 3). We then map these interactions onto the yeast genome. The protein sequences of these four source organisms and the target organism yeast are from SWISS-PROT and SGD database⁴⁶, respectively.

4.1.5.3 Gold standard target data sets Set of gold standard positives (P)

To assess the performance of interolog mapping, we need a collection of known interactions as positives in the target organism. Previously, a data set derived from the MIPS complex catalog, which contains 8,250 unique interacting protein pairs, has been used as a standard reference for known interactions^{5, 47, 48}. We also consider the MIPS interactions as

Table 2. Source data sets derived from IntAct

Species Worm Human Fly Mouse Total

Number of PPIs 4,653 18,943 19,774 2,728 46,098

Set of gold standard negatives (N)

A set of negatives (i.e. non-interacting proteins) in yeast is necessary for evaluating our method. Jansen et al. (2003)⁴⁹ considered pairs of proteins in different subcellular compartments as good estimates for non-interacting proteins. This set has 2,708,746 such protein pairs. Therefore, we find that 3,689 interactions in this set are also recorded in the core database of DIP⁵⁰. We exclude these interactions and take 2,705,057 protein pairs as the set of gold standard negatives in this preliminary study.

4.1.5.4 Accuracy of interolog mapping

We assess the predicting accuracy of our method, best-match and generalized interolog mapping against P and N in yeast. The accuracy (Acc) is calculated as following:

) , ( ) , (

) , ) (

( TP J R FP J R

R J R TP

Acc = +

In this equation, TP(J,R) = H(J,R) ∩ P, FP(J,R) = H(J,R) ∩ N. H(J,R) means the sets of rank-based interologs, best-matching homologous pairs, or generalized interologs in yeast at a certain cutoff. For example, in rank-based interolog mapping, J is a given joint E-value (see below) and R is the number of homologs selected by ranking (i.e. top R). Otherwise, in generalized interolog mapping, J is a certain joint E-value and R has no limits. |TP(J,R)| and

|FP(J,R)| are the number of true and false positives at a given J and R.

4.1.5.5 Joint E-value (JE)

JE is the geometric means of E-values for the two pairs of interacting proteins. For example, if the E-values of A-A' and B-B' are EA-A’ and EB-B’, JE between pairs A-B and A'-B' is

4.1.5.6 GO similarity measure

We assume that if a pair A'-B' in yeast is a reliable homologous pairs of A-B, A-A' and B- B’ would be in similar subcellular compartment, biological process and have similar molecular function. Wu et al. (2006) proposed a method to measure semantic similarity between two proteins by using CC and BP annotations. Based on the relative specificity similarities (RSS) defined by their method, we calculate the similarity in cellular component and biological process between a protein pair A-B and its rank-based interologs A'-B' ( ). RSS values for the CC (RSS

在文檔中同源蛋白質-蛋白質交互作用之研究 (頁 39-0)