Thesis overview - 同源蛋白質-蛋白質交互作用之研究

Chapter 1. Introduction

1.3 Thesis overview

The thesis is organized as follows. In Chapter 2, we proposed the PPI concept and the methodology to find the homologous PPIs for a given query PPI. Then, we combined the methodology with a non-redundant PPI data set to construct a web server, PPISearch. The PPI data set consists of 290,137 PPIs derived from 576 species.

In Chapter 3, we proposed evidence supplying the existence of PPI families, the results of homologous PPIs search, and discussed the observations of PPI families. We examined the new concept of homologous PPI based on four insights. Moreover, we used case studies to describe the insights into the concept of homologous PPI, and the statistical analyses of PPI families. Our results demonstrated the utility and feasibility of the PPISearch server in

identifying homologous PPIs and inferring conserveddomains and functions from PPI families.

By allowing users to inputa pair of protein sequences, PPISearch is the first server thatcan identify homologous PPIs from annotated PPI databases andinfer transferability of interacting domains and functions betweenhomologous PPIs and a query. We showed that PPISearchis a fast homologous PPIs search server and is able to providevaluable annotations for a newly determined PPI.

In Chapter 4, we applied the concept of homologous PPI (i.e. PPI family) to cross-species prediction of PPIs and cross-cross-species network comparisons. In recent years, for complementing experimental techniques (e.g. yeast two-hybrid system and mass spectroscopy), a number of computational methods, such as PathBLAST^{16, 17} and interologs^{5, 18}, have been developed to predict PPIs¹⁹. The concept of interologs has been extended to be a “generalized interolog mapping” method⁵. Our results showed that our discovery can be used to advance the generalized interolog mapping method. In addition, we used case studies to present that the concept of homologous PPI are useful for a systematic transfer of PPI networks between multiple species.

Chapter 2. Methods and Materials for Finding Homologous Protein-protein Interactions

In this chapter, we presented the concept of PPI family and the method to find the homologous PPIs for a given query PPI. Figure 1 illustrates the concept of searching homologous PPIs. For this purpose, we constructed a non-redundant PPI data set for searching homologous PPIs. The data set consists of PPIs derived from five public databases, IntAct,MIPS, DIP, MINT and BioGRID. Total number of PPIs in this data set is 290,137 in 576 species.

Interacting proteins

Figure 1. Illustration of searching homologous interactions. Interacting proteins A and B is the query protein pair given by users. A1'-A4' and B1'-B3' are the homologs (defined by BLASTP E-value) of proteins A and B, respectively. The homolog pairs recorded in the integrated PPI

Additionally, we presented how we evaluate the reliability of homologous PPIs, which are defined by sequence similarity (BLASTP E-values) and joint sequence similarity (joint E-value) between the query protein pair and these homologous PPIs.

2.1 Overview of homologous protein-protein interaction search

In this study, we developed a methodology for searching homologous PPIs and used it to constructing a web server, PPISearch. Figure 2 shows the details of the PPISearch server to searchhomologous PPIs of a query protein pair (A and B) by the followingsteps (Figure 2A).

This server first identifies the homologousfamilies (A' and B') of A and B, respectively, with E-value ≤10^-10 by using BLASTP to scan the annotated PPI databases (Figure 2B and C). All protein pairs of A' and B' are considered candidates of homologous PPIs. We selected homologous PPIs from these candidates,which are recorded in the annotated databases, and have significantjoint sequence similarity (E-value ≤ 10^-40) between candidatesand the query (Figure 2D). Then, we measure the conservationratios of domain-domain pairs (DDPs; Pfam²⁰ and InterPro²¹ domains) and protein functions (Gene Ontology annotations²²) derived from these homologous PPIs of the query (Figure 2E).

Annotated databases (290,137 protein-protein interactions) BLASTP E-value ≤ 10^-10 BLASTP E-value ≤ 10^-10

Homologous PPIs

Step 1: Query a pair of protein sequences (A and B)

Step 2: Identify homologous protein families (A' and B') of A and B,

respectively, with E-values ≤ 10^-10using BLASTP from annotated PPI databases

0.6

Step 4: Measure the conservation ratios of all of domain-domain pairs (DDPs) and protein functions derived from these homologous PPIs of a query. The DDPs and function terms are considered as conservation if their ratios ≥ 0.6.

Step 5: Output homologous PPIs, conserved DDPs and functions, and multiple sequence alignments across multiple species for the query

Step 3: Identify homologous PPIs which are protein pairs of A' and B' and

recorded in annotated databases with joint E-values (J_E) ≤ 10^-40.

Figure 2. Overview of the PPISearch server for homologous protein-protein interaction search and conservation analysis using proteinsσ1A-adaptin and γ1-adaptin as the query. (A) The main procedure. (B) Identify homologs ofσ1A-adaptin and γ1-adaptin using BLASTP to scan the annotated PPI databases. (C) The homologous families of σ1A-adaptin and γ1-adaptin with E-values ≤10^-10. (D) Homologous PPIs of the query. (E) Conservation ratios of domain-domain pairs derived from homologous PPIs.

2.2 Homologous protein-protein interaction

The concept of homologous PPI is the core of the this studyto identify the PPI family and measure DDPs and functional conservationsof a query protein pair (A and B). We defined a homologous PPIas follows: (1) homologs of A and B are proteins with significantsequence similarity BLASTP E-values ≤ 10^-10;^{5, 18} (2)significant joint sequence similarity (joint E-value JE ≤ 10^-40)between two pairs, i.e. (A, A1') and (B, B1'), of the queryprotein pair (A and B) and their respective homologs (A1' and B^B1') recorded in annotated PPI databases. This work followed previous studies^{5, 18} to define joint sequence similarity as

B A

E E

J = ×

⁽¹⁾

where EA is theE-value of proteins A and A1'; and EB is the E-value of proteins B and B^B 1'. Here, J_E ≤10 is considered a significant similarity according to statistical analysis of 290,137 annotated PPIs and 6,597 orthologous PPI families collected from the PORC database (see Chapter 3).

-40

2.3 Non-redundant data set for searching homologous PPIs Table 1 lists the dates and numbers of PPIs of the five public databases.

Table 1. Five source data sets of PPIs

Database Number of PPIs Date

IntAct 147,634 Dec. 14, 2008

MIPS 18,529 Oct. 1, 2008

DIP 52,445 Oct. 14, 2008

MINT 77,846 Oct. 28, 2008

BioGRID 150,827 Dec. 17, 2008

Total 447,281

After removing redundant PPIs, the annotated data set used in this study has 290,137 PPIs.

These PPIs were identified experimentally from 576 species.

We describe briefly these public databases as follows: (1) IntAct: All interactions are derived from literature curation or direct user submissions and are freely available⁷. IntAct is a freely available and open source database system of protein interaction data; (2) MIPS: The Munich Information Center for Protein Sequences (MIPS) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. PPIs of MIPS are annotated by the compilation of manually curated databases for protein interactions based on literature to serve as an accepted set of reliable annotated interaction data⁸; (3) DIP: The DIP database catalogs experimentally determined PPIs. It combines information from a variety of sources to create a single, consistent set of PPIs. The data stored within the DIP database were curated manually by both expert curators and automatically using computational approaches that utilize the knowledge about the PPI networks extracted from the most reliable, core subset of the DIP data⁹; (4) MINT: The Molecular INTeraction database focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators¹⁰; (5) BioGRID: The BioGRID (Biological General Repository for Interaction Datasets) database was developed to house and distributes collections of protein and genetic interactions from major model organism species. BioGRID currently contains ~150,000 interactions from six different species, as derived from both high-throughput studies and conventional focused studies¹¹.

The tabular data files of PPIs from the five public databases were downloaded. We merged all of these PPIs and removed duplications by using UniProt accession numbers. A total of 290,137 PPIs in 576 species were included in our investigation.

2.4 Annotations of homologous protein-protein interactions

A query protein pair and its homologous PPIs, significant bothin sequence and joint sequence similarity, can be considereda PPI family. The concept of PPI families is similar to thatof protein sequence family^{20, 21} and protein structure family²⁴. We believe that PPI families can be applied widely inbiological investigations. Here, we assume that the membersof a PPI family are conserved on specific functions and in interactingdomain(s). Using these conservations of a PPI family, our servercan be used to annotate the protein functions and DDPs of aquery protein pair.

2.4.1 Transferability of molecular function

These members of a PPI family often have similar molecular functions.We used the molecular function (MF) terms of Gene Ontology²² to annotate the functions of a query protein pair. The conservation ratio (CRFm) of an MF term pair (MFP) m in homologousPPIs of a query i is utilized to measure the agreement and isdefined as

Additionally,the shared ratio of MFPs (SRF), which is statistically derived from 106,997 annotated queries, is utilized to estimate thetransferability of conserved function pairs shared by the queryand its homologous PPIs. The SRF against different ratio k isdefined as

∑

whereQ is a set of annotated PPIs in databases; i is a query proteinpair; fi(CRFm ≥k) is the number of MFPs with CRFm values exceedingk and these MFPs are shared by the query i and its homologousPPIs; and Fi(CRFm ≥k) is the total number of MFPs with CRFm≥k, where MFPs are derived from homologous PPIs of the queryi. Here, k is set to 0.6.

2.4.2 Transferability of domain-domain pairs

A query protein pair and its homologous PPIs often show conserveinteracting DDPs. To measure the occurrence of each DDP in aPPI family, we define the conservation ratio (CRDp) of a DDPpin homologous PPIs of a query protein pair i as

Figure 2D and E show an example to calculatethe CRD values of four DDPs. In addition, to evaluate the transferabilityof DDPs between a query and its homologous PPIs statistically,this study defines the shared ratio (SRD) of DDPs using CRDpand 103,762 annotated PPIs as query protein pairs. The SRD ofDDPs against different ratio c is given as

∑

where Q is a set of annotated PPIs in databases(here, the total number of PPIs in Q is 103 762);

i is a queryprotein pair; di(CRDp ≥c) is the number of DDPs with CRDp valuesexceeding c;

and these DDPs are shared by the query i and itshomologous PPIs. Di(CRDp ≥c) is the total number of the DDPswith CRDp ≥c, where DDPs are derived from homologous PPIs ofthe query i. Here, this work used a statistical approach todetermine the threshold c (here, c = 0.6)

of CRDp to yield reliableDDP annotations with an acceptable level of Di. Please notethat CRDp and SRD are computed from a query protein pair anda set of queries, respectively.

2.5 Data sets for evaluating the approach of searching homologous protein-protein interactions

For evaluating the usefulness of the PPISearch server for the discoveryof homologous PPIs and for the annotations of a query proteinpair, we constructed two query protein sets, termed HOM and ORT.For searching homologous PPIs, HOM and ORT data sets are used to assess performance of PPISearch and to determine the threshold of joint E-valueJE [Equation (1)]

(Figure 3A).

2.5.1 HOM data set

The HOM set includesall of 290,137 PPIs. The HOM set wasapplied to infer the relations between conservation ratios [CRF and CRD defined in Equations (2) and (4)] and the transferabilityof DDPs and MFPs, respectively, between a query and its homologousPPIs.

2.5.2 ORT data set

The ORT set has 6,597 orthologous PPI families(14,571 PPIs) derived from the annotated PPI database and PORC orthology database. PORC data (putative orthologous clusters) were defined as orthologous families from Integr8²³ and CluSTr²⁵ databases. These clusters contain all sequenced organisms (1125 bacteria, 125 eukaryota and 50 archaea in the release 94). Each entry in PORC represents a cluster of genes grouped by the similarity of their longest protein

product. According to the construction process of PORC, a gene cluster contains at most a single protein from a given species and a protein can be assigned to only a single cluster.

Chapter 3. Evidence Supplying the Existence of Homologous Protein-protein Interactions

In this chapter, we presented the evidence of existence of homologous PPIs (Section 3.1), the results of homologous PPIs search (i.e. PPISearch) (Sections 3.2-3.3) and discussed the observations of PPI family. We used case studies to describe the insights used to examine the concept of homologous PPI and statistically analyze PPI families (Section 3.4).

3.1 Evidence of the existence of homologous PPIs

We analyzed the results of homologous PPIs by four views. Firstly, we observed the conservation of biological function in PPI families. Secondly, we observed the conservation of domain pairs in PPI families. Thirdly, because of the sharing of conserved domain pairs in PPI families, we observed the conservation of interacting domains in PPI families (based on protein 3D structures). Finally, because of the sharing of conserved interacting domains in PPI families, we observed the conservation of binding interface between two proteins of each PPI in PPI families. These evidences from the four views showed the existence of homologous PPIs (i.e.

PPI family).

3.1.1 Conservation of molecular function in PPI families

To verify the discoveryof homologous PPIs, we selected two query protein sets, termed HOM and ORT.To search homologous PPIs, HOM and ORT are used to assess PPI families and to evaluate the threshold of joint E-valueJE (Figure 3A). In addition, the HOM set wasapplied to infer the relations between conservation ratios [CRF defined in Chapter 2] and the transferabilityof MFPs, respectively, between a query and its homologousPPIs (Figure 3B).

The HOM set includesall 290,137 PPIs and the ORT set has 6,597 orthologous PPI families (14,571 PPIs) derived from the annotated PPI database and PORCorthology database.

HOM and ORT were used to assess the PPISearch server in identifyinghomologous PPIs and orthologous PPIs, respectively, by searchingthe annotated PPI database (290,137 PPIs with 54,422 proteins).Figure 3A shows the relationships between joint E-value JE andnumber of orthologous PPIs (black) and homologous PPIs (red).The orthologous PPIs often have the same functions and domains. When JE ≤10^-40, the number of orthologous PPIs decreases significantly; conversely, the number of homologous PPIs decreasesmore gradually than that at JE ≥ 10^-40. This result showsthat the proposed method is able to identify 98.2% orthologous PPIs with a reasonable number of homologous PPIs when JE ≤ 10^-40.

0 Conservation ratio of MF pairs in homologous PPIs

Shared ratio of MF pairs (SRF)

Number of MF pairs (NMFP)

SRF (logJE<-10)

(Continued on next page)

0.6 Conservation ratio of DDPs in homologous PPIs

Shared ratio of DDPs (SRD)

Number of domain pairs (NDP)

SRD (logJE<-10)

Figure 3. Conservations of biological functions and domain pairs in PPI families. (A) The relationships between joint E-value JE and the numbers of orthologous PPIs (black) and homologous PPIs (red) derived from 290,137 annotated PPIs. (B) The relationships between the conservation ratios of molecular function pairs (MFPs) with the shared ratios of MFPs and with the number (dotted lines) of MFPs derived from 106,997 PPI families. The shared ratio of MFPs is 0.69 and the number of MFPs is 454,251 if the conservation ratio is 0.6 and the joint E-value is 10^-40 (green lines). (C) The relationships between conservation ratios of DDPs with shared ratios of DDPs and with the number (dotted lines) of DDPs derived from 103,762 PPI families. The shared ratio of DDPs is 0.88 and the number of DDPs is 252,728 when the conservation ratio is 0.6 and joint E-value is 10^-40 (green lines).

To evaluate the transferability of MFPs between a queryand its homologous PPIs, we used the SRF [Equation (3)]. The HOM set is also used to evaluate the utilityof the PPISearch server in annotating the query protein pair.By excluding proteins without molecular function annotations of GO from the queryset, 106,997 PPIs are used to evaluate the transferability (SRF)of conserved MFPs between these query PPIs and their respectivehomologous PPIs (Figure 3B). Themembers of a PPI family have similar molecular functions, andSRF ratios are highly correlated with conservation ratios (CRF)of MFPs. When the CRF is 0.6 and the joint E-value is 10^-40(green lines), the SRD is 0.69 and the number of MFPs is 454,251.

3.1.2 Conservation of domain pairs in PPI families

In addition, the HOM set wasapplied to infer the relations between conservation ratios [CRD defined in Chapter 2] and the transferabilityof DDPs, respectively, between a query and its homologousPPIs (Figure 3C). To evaluate the transferability of DDPs between a queryand its homologous PPIs, we used the SRD [Equation (5)]. By excluding proteins without domain annotations from the queryset, 103,762 PPIs are used to evaluate the transferability (SRD)of conserved DDPs between these query PPIs and their respectivehomologous PPIs (Figure 3C).

Figure 3C shows the relationship between conservation ratios(CRD) of DDPs and the SRD ratios. The SRD ratio increases significantly(solid lines) when the CRD increases and CRD ≤ 0.6. Conversely,the number of DDPs derived from 103,762 PPI families decreases (dotted lines) as CRD increases. If the CRD is set to 0.6 andthe joint E-value is set to 10^-40 (green lines), the SRDis 0.88 and the number of DDPs is 252,728. This result demonstrates that members of a PPI family reliably share DDPs (or interacting domains). Additionally, similar resultswere obtained for the transferability of conserved functions betweenhomologous PPIs and the query (Figure 3B).

These results reveal that PPI families achieve a highSRD with a reasonable number of DDPs when the joint E-value is set to 10^-40. In summary, these experimental results demonstrate that this server achieves high agreement on MFPs and DDPsbetween the query and their respective homologous PPIs.

3.1.3 Conservation of interacting domains in PPI families

The two above evidence were acquired by sequence-based searching. As an increasing number of structural data was available (e.g. protein complexes in PDB), we used structure-based views to examine the concept of homologous PPIs. In this section 3.1.3, we observed the conservation of interacting domains in PPI families because of the assumption of "the members of a PPI family have similar interacting domains".

Firstly, we collected a data set of protein complexes. Each complex was composed of two protein chains (i.e. heterodimer or homodimer) and was (1) recorded in the annotated PPI database (290,137 PPIs) and (2) recorded in iPfam database. iPfam is a database []. After selecting protein complexes from PDB, a data set of 1,014 complexes (in other words, 1,014 PPIs) was constructed.

Figure 4 shows the method we calculated the conservation of interacting domain in PPI families. Proteins A-B is an interacting protein pair, in which there are two physical domain-domain interactions (DDIs). If a protein pair between homologs (E-value ≤10^-10) A' and B' kept

≥ 1 DDI, we considered the protein pair has similar interacting domains to the query pair A-B.

We compared the two partitions of "PPI families" and "Non PPI families". The "PPI families"

consisted of PPIs with JE ≥10^-40, for example, the PPIs circled by blue. Conversely, the "Non PPI families" consisted of PPIs having JE <10^-40.

Conservation of interacting domain

Figure 4. Illustration of the method we calculated the conservation of interacting domain in PPI families. The rectangles colored by green, blue, light blue, yellow, and red mean domains.

The query interacting protein pair A-B has two DDIs. All of the pairs of A' and B' homologs are marked "Yes" (keeping ≥ 1 DDI with the query PPI A-B) or "No" (keeping no DDI with the query PPI A-B).

Figure 5 indicates the results of observing the conservation of interacting domains in PPI families. We found that the number of PPIs keeping ≥ 1 DDI (11,060 PPIs) were 2.35-fold more than that of PPIs not keeping DDI (4,699 PPIs) in the set "PPI families". In comparison, the number of PPIs not keeping (27,264 PPIs) were 1.74-fold that of PPIs keeping DDI ≥ 1 DDI (15,653 PPIs) in the set "Non PPI families".

0 5000 10000 15000 20000 25000 30000

PPI families Non PPI families Type

Number of PPIs

Keeping DDI Non-keeping DDI

Figure 5. Conservation of interacting domains in PPI families. The number of PPIs keeping ≥ 1 DDI is 11,060 PPIs (blue) and that of PPIs not keeping DDI is 4,699 PPIs (red) in the set

"PPI families". In comparison, the number of PPIs not keeping is 27,264 PPIs (red) and that of PPIs keeping DDI ≥ 1 DDI is 15,653 PPIs (blue) in the set "Non PPI families".

These results indicated that there was higher conservation of interacting domains in PPI families than in non-PPI families. In the Section 3.4 Discussion, we supplied a possible reason of why there were 4,699 PPIs which not keeping DDI in the set "PPI families".

3.1.4 Conservation of binding model in PPI families

After we acquired the evidence of conservation of interacting domain, we were interesting to observed the similarity of structural binding interfaces within PPI families. This idea was derived from the assumption that the PPI members of a PPI family have similar structural

在文檔中同源蛋白質-蛋白質交互作用之研究 (頁 9-0)