Methods - Cross-species prediction of protein-protein interactions

Chapter 4. Applications of Homologous Protein-protein Interactions

4.1 Cross-species prediction of protein-protein interactions

4.1.5 Methods

4.1.5.1 Rank-based interolog mapping

Interolog mapping is a process that maps interactions in the source organism onto the target organism to predict possible interactions. To address the three questions of best-match and generalized interolog mapping described above, we introduce a new “rank-based interolog

homologs could be defined as the proteins having an E-value < 10^-10 from BLASTP^{18, 44}. An overview of the rank-based interolog mapping is depicted in Figure 18.

A B

Figure 18. Schematic illustration of rank-based interolog mapping method. Proteins A1', A'2,…, A'm and B1', B'2,…, B'n are possible homologs (E-value < 10^-10) of proteins A and B in the source organism, respectively. All possible pairs between homologs A'1,…, A'R and B’1,…, B’R

are called ranked-based interologs.

The steps are described as following:

1. For any given protein in the source organism (e.g. worm), we collect all of its homologs by BLASTP E-value < 10^-10.

2. These possible homologs of proteins A and B in the target organism (yeast) are ranked by

their E-values from low to high (i.e. from 0 to 10^-10), respectively.

3. These homologs ranked in top R are selected to pair with each other. These possible protein pairs between the homologs A'1,…, A'R and B'1…B'R are called ranked-based interologs.

Otherwise, the all protein pairs between the homologs A'1,…, A'm and B'1,…, B'n are generalized interologs.

The best-match mapping method considers pairs between the best-matching homologs as the candidates of interaction¹⁸. The generalized interolog mapping method uses all pairs of homologs, which have joint similarities larger than a certain cutoff, to find possible interactions in the target organism⁵. In this preliminary study, we consider the protein pairs between the top R possible homologs as the candidates of interaction in the target organism.

4.1.5.2 Source data sets

To assess the rank-based interolog mapping method, we need source organisms with known interaction data. In this preliminary study, worm, fruit fly, mouse, and human are used as source organisms. We collect the interactions of these four organisms recorded in IntAct database⁴⁵ (Table 3). We then map these interactions onto the yeast genome. The protein sequences of these four source organisms and the target organism yeast are from SWISS-PROT and SGD database⁴⁶, respectively.

4.1.5.3 Gold standard target data sets Set of gold standard positives (P)

To assess the performance of interolog mapping, we need a collection of known interactions as positives in the target organism. Previously, a data set derived from the MIPS complex catalog, which contains 8,250 unique interacting protein pairs, has been used as a standard reference for known interactions^{5, 47, 48}. We also consider the MIPS interactions as

Table 2. Source data sets derived from IntAct

Species Worm Human Fly Mouse Total

Number of PPIs 4,653 18,943 19,774 2,728 46,098

Set of gold standard negatives (N)

A set of negatives (i.e. non-interacting proteins) in yeast is necessary for evaluating our method. Jansen et al. (2003)⁴⁹ considered pairs of proteins in different subcellular compartments as good estimates for non-interacting proteins. This set has 2,708,746 such protein pairs. Therefore, we find that 3,689 interactions in this set are also recorded in the core database of DIP⁵⁰. We exclude these interactions and take 2,705,057 protein pairs as the set of gold standard negatives in this preliminary study.

4.1.5.4 Accuracy of interolog mapping

We assess the predicting accuracy of our method, best-match and generalized interolog mapping against P and N in yeast. The accuracy (Acc) is calculated as following:

) , ( ) , (

) , ) (

( TP J R FP J R

R J R TP

Acc = +

In this equation, TP(J,R) = H(J,R) ∩ P, FP(J,R) = H(J,R) ∩ N. H(J,R) means the sets of rank-based interologs, best-matching homologous pairs, or generalized interologs in yeast at a certain cutoff. For example, in rank-based interolog mapping, J is a given joint E-value (see below) and R is the number of homologs selected by ranking (i.e. top R). Otherwise, in generalized interolog mapping, J is a certain joint E-value and R has no limits. |TP(J,R)| and

|FP(J,R)| are the number of true and false positives at a given J and R.

4.1.5.5 Joint E-value (JE)

JE is the geometric means of E-values for the two pairs of interacting proteins. For example, if the E-values of A-A' and B-B' are EA-A’ and EB-B’, JE between pairs A-B and A'-B' is

4.1.5.6 GO similarity measure

We assume that if a pair A'-B' in yeast is a reliable homologous pairs of A-B, A-A' and B- B’ would be in similar subcellular compartment, biological process and have similar molecular function. Wu et al. (2006) proposed a method to measure semantic similarity between two proteins by using CC and BP annotations. Based on the relative specificity similarities (RSS) defined by their method, we calculate the similarity in cellular component and biological process between a protein pair A-B and its rank-based interologs A'-B' ( ). RSS values for the CC (RSS

GO B A B

RSSA₋ _, _'₋ _'

CC) and BP (RSS^BP) ontologies mean the similarity of CC and BP terms of a given protein pair, respectively. The values are between 0 and 1.0. The equations we used are as follows.

Wu et al. (2006) supplied both three confidence levels of yeast protein pairs annotated in the CC and BP ontologies. Their results showed that 78% interactions of their positive dataset fall into the high-confidence segment of 0.8 < RSS^CC < 1.0 and 0.8 < RSS^BP < 1.0. They suggested that the highest-confidence segment may contain most yeast protein-protein

dataset including these protein-protein interactions with 0.8 <RSS_A^GO₋_B_,_A_'₋_B_'< 1.0 in P. N’ is the dataset consisted of N and these interactions of P but having RSS_A^GO₋_B_,_A_'₋_B_'< 0.8.

4.1.5.7 Orthologous interactions between source and target organisms

We identified the orthologous proteins between the source organisms and yeast by ENSEMBL database (Mar, 2008)⁵¹. For comparing the coverage of orthologous interactions of our method, best-match and generalized interolog mapping, we identify and count the interacting pairs of orthologs in all pairs of possible homologs (E-value < 10^-10). For example, interacting proteins P47857-P12382 in mouse have orthologous interactions YMR205C-YGR240C. In Figure 16A, the label “Top-n” is the maximum of ranks of EA-A' and EB-B'. For example, YMR205C and YGR240C are the top 1 and top 2 in the ranking of homologs by E-values, respectively. The label of pair YMR205C-YGR240C is n = max(1, 2) = 2.

4.2 Cross-species network comparison by using homologous PPIs

The comparison of biochemical networks across multiple species can be preliminarily studied by the concept of homologous PPIs. The discovery of sequence homologs to a known protein often provides clues for understanding the function of a newly sequenced gene. As an increasing number of reliable PPIs become available, identifying homologous PPIs should be useful to understand a newly determined PPI. Moreover, homologous PPIs may share similar functions and domains. We consider the concept “homologous PPIs”, which to our knowledge is firstly proposed, as a starting point to compare and map protein-protein interaction networks across multiple species. Recently, several PPI databases (e.g. IntAct and BioGRID) allow users to input one or a pair of proteins or gene names to acquire the PPIs associated with the query protein(s). Few computational methods^{12, 13} applied homologous interactions to assess the

reliability of PPIs.

The cross-species network comparison can be used to identify the corresponding pathways from one organism to another. For systems biology, the comparison of networks provides clues for understanding some important issue, such as evolution of networks. For drug discovery, the pathway-level comparison can indicate candidates of key targets (protein or protein-protein interactions) to inhibit a specific mechanism of a disease. As described above, the concept of homologous PPIs is useful to find PPIs which share similar functions and domains, we have applied this concept to search and, even re-build, the corresponding pathways from one organism to another.

We used the pathway of non-small cell lung cancer as an example (Figure 19A). The PPI of Raf (A-Raf proto-oncogene serine/threonine-protein kinase) and MEK (mitogen-activated protein kinase kinase) is key component of the Ras-dependent signaling pathway from receptors to the nucleus^{52, 53}. Clinically, inhibition of MEKs suppressed Raf-mediated cellular growth. The three Raf family members, Raf-1, B-Raf, and A-Raf, are highly homologous and originally isolated as a oncogene contributing to cellular transformation and are one of the best characterized Ras effectors to activate the mitogen-activated protein kinase (MAPK) signaling pathway. Raf directly phosphorylates and activates MEK via two conserved serine residues in the kinase activation loop of MEK. The two member of MEK family, MEK-1 and MEK-2, in turn, are dual-specificity threonine and tyrosine kinases that phosphorylate and activate ERKs (extracellular signal-regulated kinases).

Figure 19B shows the homologous PPIs of the query PPI, A-Raf (UniProt Accession number: P10398) and MEK-1 (Q02750). For this query, the PPISearch server identifies a PPI family including five homologous PPIs in human and two homologous PPIs in mouse that belong to Raf family proteins (Raf-1, A-Raf, and B-Raf), which phosphorylate and activate

annotation with the query.

Based on this search, we will be able to provide reliable corresponding Ras-dependent signaling pathway in mouse. The Raf family proteins, P10398, P15056, P04049 (human) and P28028 (mouse), can be used to network comparison. And so on, the MEK proteins, Q02750, P36507 (human) and P31938, Q63932 (mouse) can be used to build network comparison, too.

… …

MEK-1

(Q02750) ^1e-180

E-value E-value

Grb2 Sos Ras Raf MEK ERK

PI3K

Grb2 Sos Ras Raf MEK ERK

PI3K Interactions of A-Raf and MEK-1

PF00069

Figure 19. The homologous PPIs of the query PPI, A-Raf (P10398) and MEK-1 (Q02750). (A) The diagram of pathways of non-small cell lung cancer in human and mouse. The key component, interaction between Rafs and MEKs, is squared. (B) For this query, the PPISearch server identifies a PPI family including five homologous PPIs in human and two homologous PPIs in mouse that belong to Raf family proteins (Raf-1, A-Raf, and B-Raf), which phosphorylate and activate MEK-1 and MEK-2. These homologous PPIs share the same Pfam domain annotation with the query.

Chapter 5. Conclusion

5.1 Summary

The interactions between proteins are critical to most biological processes. To identify and characterize protein-protein interactions and their networks, many high-throughput experimental approaches, such as yeast two-hybrid screening, mass spectroscopy, and tandem affinity purification, and computational methods (phylogenetic profiles, known 3D complexes, and interologs) have been proposed. Some PPI databases, such as IntAct, MIPS, DIP, MINT, and BioGRID have accumulated PPIs submitted by biologists, and those from mining literature, high-throughput experiments, and other data sources. As these interaction databases continue growing in size, they become increasingly useful for the above goal and analysis of newly identified interactions.

To address this issue, we have proposed the PPISearch server for searching homologous PPIs across multiple species and annotating the query protein pair. According to our knowledge, PPISearch is the first public server that identifies homologous PPIs from annotated PPI databases and infers transferability of interacting domains and functions between homologous PPIs and the query. PPISearch is an easy-to-use web server that allows users to input a pair of protein sequences. Then, this server finds homologous PPIs in multiple species from five public databases (IntAct, MIPS, DIP, MINT, and BioGRID) and annotates the query. Our results demonstrate that this server achieves high agreements on interacting domain-domain pairs and function pairs between query protein pairs and their respective homologous PPIs.

This study demonstrated the utility and feasibility of the PPISearchserver in identifying homologous PPIs and inferring conservedDDPs and MFPs from PPI families. By allowing users to input a pair of protein sequences, PPISearch is the first server that can identify homologous PPIs from annotated PPI databases andinfer transferability of interacting domains and functions betweenhomologous PPIs and a query. Our experimental results demonstrated that the query protein pair and its homologous PPIs achievedhigh agreement on conserved DDPs and MFPs. We believe that PPISearchis a fast homologous PPIs search server and is able to providevaluable annotations for a newly determined PPI.

5.2 Future works

5.2.1 Directions for future research

There are several directions for future research:

1. For supplying more biological evidence to support the concept of homologous PPIs, we will add the pathway-level and complex-level insights (described in Section 3.4).

2. Based on these PPI families constructed by our methodology, we will be able to investigate the evolutionary relationships between PPIs across multiple species.

3. The PPI families constructed by our methodology will be able to supply us with clues to study the evolutionary conservation of PPIs across multiple species.

4. Currently, we have preliminarily got evidence to support the feasibility of applying homologous PPIs to cross-species PPI prediction and network comparison (described in Sections 4.1 and 4.2). We will study these two issues more deeply.

5. We will add approaches, such as FASTA, into our methodology to correct the question caused by local sequence alignments (described in Section 3.4).

5.2.2 Combination of sequence-based and structure-based interolog mapping

Our laboratory has developed a concept “3D-domain interologs”^{4, 54}. We will combine the sequence-based homologous PPIs with the structure-based method of interolog mapping. The detail of the concept of 3D-domain interologs is described as follows.

For studying the mechanisms of PPIs in multiple species, domain-domain interactions, which are regarded as key of PPIs, should be identified. As the rapid increasing of protein structures, to identify interacting domains from three-dimensional (3D) structural complexes is able to study domain-domain interactions. A known 3D structure of interacting proteins provides interacting domains and atomic details for thousands of direct physical interactions.

Based on considering interacting domain of interacting protein pair, we have proposed a concept “3D-domain interolog mapping” to improve the generalized interologs mapping. Two physical interacting-domain sequences of a 3D-dimer protein structure are used as the queries to identify its 3D-domain interolog candidates by searching on protein sequences of genomes by utilizing PSI-BLAST. The proteins with both significant sequence similarity and the same interacting domains are considered as 3D-domain homologs forming a homolog family. Here, we define as 3D-domain homologs as: (1) candidates of both alignments with a significant PSI-BLAST E-value (< 10^-8); (2) candidates have 25% domain sequence identity in both sequences in the PSI-BLAST alignment; (3) candidates have 25% sequence identity in both sequences on contacted residues in the PSI-BLAST alignment. The 3D-domain interolog candidates are defined as the all protein pairs between two homolog families derived from two sequences of a structure 3D-dimer (Figure 20). We believe that 3D-domain interolog mapping is able to study

the evolution of the interacting domain through 3D-domain homologous family from multiple 3d-domain interolog mapping &

Generalized interolog mapping

ecTbetaR2 (interacting domain)

TGF beta (interacting domain) Pfam-B 73018

PKinase Pfam-B 211911

Activin_recp TGF_beta_GS

Figure 20. Architecture of 3D-domain interolog mapping. Human TGF-B3 and TGFBR-2 co-crystallize in PDB⁵⁵. Four Zebrafish homologous proteins of Human TGF-B3 found are by PSIBLAST. Likewise, five Zebrafish proteins are homologous to Human TGFBR-2. Through generalized interologs mapping, all possible pairs between the two families are considered as the generalized interologs (show as black and green line with arrows). Moreover, we could find the interacting domains of TGF-B3－TGFBR-2 complex (TGF beta domain is showed as gray and ecTbetaR2 domain is shown as light green) by exploring the co-crystal structure. The pairs of proteins which contain these interacting domains are considered as 3D-domain interologs (show as green line with arrows).

Based on combination of homologous PPIs and 3D-domain interologs, we will develop a new scoring function to model protein interface. The scoring function is

E = Einteracting + Econsensus + Esimilarity

The scoring function is composed of interacting force (Einteracting), consensus of residues (Econsensus) and template similarity (Esimilarity). We have applied this function and 3D-domain interologs to measure the interaction changes during evolution and the effect of residue substitution on the binding interface.

Human Mouse Cow

Type 1 Type 2

Human Mouse

Cow

Interacting score >= threshold

Type 1 Type 2

Homologs Interacting protein pair

Homologs

Figure 21. Overview of mutation analysis in protein-protein interactions.

In comparison of biochemical networks across species, protein-protein interactions may be conserved or non-conserved. Figure 21 shows an overview of how we analyze the causes of a protein-protein interaction would keep or lose in different organisms. In our study, we have acquired a reliable threshold of E (= Einteracting+Econsensus+Esimilarity) to estimate that two proteins will interact with each other or not. The contacting residues of homologous proteins among organisms are colored by yellow and green.

In Figure 21, the protein-protein interaction exists in human and mouse (≥ threshold) but

not in cow (< threshold). This observation suggests that the mutation in human (colored red) may not disrupt the interaction (called Type 1 mutation), but the mutation in cow (colored blue) may cause the loss of this interaction (Type 2 mutation). This model could help us to perform large-scale analyses of changes in interacting modes and residues among multiple organisms.

These analyses will support us to understand the causes of conservation and diversity in protein-protein interaction networks.

References

1. Watson, J. D., Laskowski, R. A. & Thornton, J. M. Predicting protein function from sequence and structural data. Current Opinion in Structural Biology 15, 275-284 (2005).

2. Yang, J.-M. & Tung, C.-H. Protein structure database search and evolutionary classification. Nucleic Acids Research 34, 3646-3659 (2006).

3. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O.

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U. S. A. 96, 4285-4288 (1999).

4. Chen, Y.-C., Lo, Y.-S., Hsu, W.-C. & Yang, J.-M. 3D-partner: a web server to infer interacting partners and binding models. Nucleic Acids Research 35, W561-567 (2007).

5. Yu, H. Y. et al. Annotation transfer between genomes: Protein-protein interologs and protein-DNA regulogs. Genome Research 14, 1107-1118 (2004).

6. Shoemaker, B. A. & Panchenko, A. R. Deciphering protein-protein interactions. Part I.

Experimental techniques and databases. PLoS Computational Biology 3, 337-344 (2007).

7. Kerrien, S. et al. IntAct - open source resource for molecular interaction data. Nucleic Acids Research 35, D561-D565 (2007).

8. Mewes, H. W. et al. MIPS: analysis and annotation of genome information in 2007.

Nucleic Acids Research 36, D196-D201 (2008).

9. Salwinski, L. et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research 32, D449-D451 (2004).

10. Chatr-Aryamontri, A. et al. MINT: the molecular INTeraction database. Nucleic Acids Research 35, D572-D574 (2007).

11. Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34, D535-D539 (2006).

12. Patil, A. & Nakamura, H. Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics 6, 100-112 (2005).

13. Saeed, R. & Deane, C. An assessment of the uses of homologous interactions.

Bioinformatics 24, 689-695 (2008).

14. Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239-259 (2007).

15. Michaut, M. et al. InteroPORC: automated inference of highly conserved protein interaction networks. Bioinformatics 24, 1625-1631 (2008).

16. Kelley, B. et al. Conserved pathways within bacteria and yeast as revealed by global

17. Sharan, R. et al. Conserved patterns of protein interaction in multiple species. Proc.

Natl. Acad. Sci. U. S. A. 102, 1974-1979 (2005).

18. Matthews, L. R. et al. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Research 11, 2120-2126 (2001).

19. Shoemaker, B. A. & Panchenko, A. R. Deciphering protein-protein interactions. Part II.

Computational methods to predict protein and domain interaction partners. PLoS Computational Biology 3, e43 (2007).

20. Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Research 36, D281-D288 (2008).

21. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Research 37, D211-D215 (2009).

22. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25-29 (2000).

23. Kersey, P. et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Research 33, D297-D302 (2005).

24. Andreeva, A. et al. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Research 32, D226-D229 (2004).

25. Kriventseva, E. V., Fleischmann, W., Zdobnov, E. M. & Apweiler, R. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Research 29, 33-36 (2001).

26. Bonifacino, J. S. & Traub, L. M. Signals for sorting of transmembrane proteins to endosomes and lysosomes. Annual Review of Biochemistry 72, 395-447 (2003).

27. Heldwein, E. E. et al. Crystal structure of the clathrin adaptor protein 1 core. Proc. Natl.

Acad. Sci. U. S. A. 101, 14108-14113 (2004).

28. Lieb, J. D., Albrecht, M. R., Chuang, P. T. & Meyer, B. J. MIX-1: An essential component of the C-elegans mitotic machinery executes x chromosome dosage compensation. Cell 92, 265-277 (1998).

29. Hagstrom, K. A., Holmes, V. F., Cozzarelli, N. R. & Meyer, B. J. C. elegans condensin promotes mitotic chromosome architecture, centromere organization, and sister

chromatid segregation during mitosis and meiosis. Genes & Development 16, 729-742 (2002).

30. Hirano, M. & Hirano, T. Hinge-mediated dimerization of SMC protein is essential for its dynamic interaction with DNA. EMBO Journal 21, 5733-5744 (2002).

31. Massague, J., Blain, S. W. & Lo, R. S. TGF-beta signaling in growth control, cancer,

在文檔中同源蛋白質-蛋白質交互作用之研究 (頁 52-0)