• 沒有找到結果。

Chapter 2. Methods and Materials for Finding Homologous Protein-protein

2.5 Data sets for evaluating the approach of searching homologous protein-protein

2.5.2 ORT data set

The ORT set has 6,597 orthologous PPI families(14,571 PPIs) derived from the annotated PPI database and PORC orthology database. PORC data (putative orthologous clusters) were defined as orthologous families from Integr823 and CluSTr25 databases. These clusters contain all sequenced organisms (1125 bacteria, 125 eukaryota and 50 archaea in the release 94). Each entry in PORC represents a cluster of genes grouped by the similarity of their longest protein

product. According to the construction process of PORC, a gene cluster contains at most a single protein from a given species and a protein can be assigned to only a single cluster.

Chapter 3.

Evidence Supplying the Existence of Homologous Protein-protein Interactions

In this chapter, we presented the evidence of existence of homologous PPIs (Section 3.1), the results of homologous PPIs search (i.e. PPISearch) (Sections 3.2-3.3) and discussed the observations of PPI family. We used case studies to describe the insights used to examine the concept of homologous PPI and statistically analyze PPI families (Section 3.4).

3.1 Evidence of the existence of homologous PPIs

We analyzed the results of homologous PPIs by four views. Firstly, we observed the conservation of biological function in PPI families. Secondly, we observed the conservation of domain pairs in PPI families. Thirdly, because of the sharing of conserved domain pairs in PPI families, we observed the conservation of interacting domains in PPI families (based on protein 3D structures). Finally, because of the sharing of conserved interacting domains in PPI families, we observed the conservation of binding interface between two proteins of each PPI in PPI families. These evidences from the four views showed the existence of homologous PPIs (i.e.

PPI family).

3.1.1 Conservation of molecular function in PPI families

To verify the discoveryof homologous PPIs, we selected two query protein sets, termed HOM and ORT.To search homologous PPIs, HOM and ORT are used to assess PPI families and to evaluate the threshold of joint E-valueJE (Figure 3A). In addition, the HOM set wasapplied to infer the relations between conservation ratios [CRF defined in Chapter 2] and the transferabilityof MFPs, respectively, between a query and its homologousPPIs (Figure 3B).

The HOM set includesall 290,137 PPIs and the ORT set has 6,597 orthologous PPI families (14,571 PPIs) derived from the annotated PPI database and PORCorthology database.

HOM and ORT were used to assess the PPISearch server in identifyinghomologous PPIs and orthologous PPIs, respectively, by searchingthe annotated PPI database (290,137 PPIs with 54,422 proteins).Figure 3A shows the relationships between joint E-value JE andnumber of orthologous PPIs (black) and homologous PPIs (red).The orthologous PPIs often have the same functions and domains. When JE ≤10-40, the number of orthologous PPIs decreases significantly; conversely, the number of homologous PPIs decreasesmore gradually than that at JE ≥ 10-40. This result showsthat the proposed method is able to identify 98.2% orthologous PPIs with a reasonable number of homologous PPIs when JE ≤ 10-40.

0 Conservation ratio of MF pairs in homologous PPIs

Shared ratio of MF pairs (SRF)

0

Number of MF pairs (NMFP)

SRF (logJE<-10)

(Continued on next page)

0.6 Conservation ratio of DDPs in homologous PPIs

Shared ratio of DDPs (SRD)

0

Number of domain pairs (NDP)

SRD (logJE<-10)

Figure 3. Conservations of biological functions and domain pairs in PPI families. (A) The relationships between joint E-value JE and the numbers of orthologous PPIs (black) and homologous PPIs (red) derived from 290,137 annotated PPIs. (B) The relationships between the conservation ratios of molecular function pairs (MFPs) with the shared ratios of MFPs and with the number (dotted lines) of MFPs derived from 106,997 PPI families. The shared ratio of MFPs is 0.69 and the number of MFPs is 454,251 if the conservation ratio is 0.6 and the joint E-value is 10-40 (green lines). (C) The relationships between conservation ratios of DDPs with shared ratios of DDPs and with the number (dotted lines) of DDPs derived from 103,762 PPI families. The shared ratio of DDPs is 0.88 and the number of DDPs is 252,728 when the conservation ratio is 0.6 and joint E-value is 10-40 (green lines).

To evaluate the transferability of MFPs between a queryand its homologous PPIs, we used the SRF [Equation (3)]. The HOM set is also used to evaluate the utilityof the PPISearch server in annotating the query protein pair.By excluding proteins without molecular function annotations of GO from the queryset, 106,997 PPIs are used to evaluate the transferability (SRF)of conserved MFPs between these query PPIs and their respectivehomologous PPIs (Figure 3B). Themembers of a PPI family have similar molecular functions, andSRF ratios are highly correlated with conservation ratios (CRF)of MFPs. When the CRF is 0.6 and the joint E-value is 10-40 (green lines), the SRD is 0.69 and the number of MFPs is 454,251.

3.1.2 Conservation of domain pairs in PPI families

In addition, the HOM set wasapplied to infer the relations between conservation ratios [CRD defined in Chapter 2] and the transferabilityof DDPs, respectively, between a query and its homologousPPIs (Figure 3C). To evaluate the transferability of DDPs between a queryand its homologous PPIs, we used the SRD [Equation (5)]. By excluding proteins without domain annotations from the queryset, 103,762 PPIs are used to evaluate the transferability (SRD)of conserved DDPs between these query PPIs and their respectivehomologous PPIs (Figure 3C).

Figure 3C shows the relationship between conservation ratios(CRD) of DDPs and the SRD ratios. The SRD ratio increases significantly(solid lines) when the CRD increases and CRD ≤ 0.6. Conversely,the number of DDPs derived from 103,762 PPI families decreases (dotted lines) as CRD increases. If the CRD is set to 0.6 andthe joint E-value is set to 10-40 (green lines), the SRDis 0.88 and the number of DDPs is 252,728. This result demonstrates that members of a PPI family reliably share DDPs (or interacting domains). Additionally, similar resultswere obtained for the transferability of conserved functions betweenhomologous PPIs and the query (Figure 3B).

These results reveal that PPI families achieve a highSRD with a reasonable number of DDPs when the joint E-value is set to 10-40. In summary, these experimental results demonstrate that this server achieves high agreement on MFPs and DDPsbetween the query and their respective homologous PPIs.

3.1.3 Conservation of interacting domains in PPI families

The two above evidence were acquired by sequence-based searching. As an increasing number of structural data was available (e.g. protein complexes in PDB), we used structure-based views to examine the concept of homologous PPIs. In this section 3.1.3, we observed the conservation of interacting domains in PPI families because of the assumption of "the members of a PPI family have similar interacting domains".

Firstly, we collected a data set of protein complexes. Each complex was composed of two protein chains (i.e. heterodimer or homodimer) and was (1) recorded in the annotated PPI database (290,137 PPIs) and (2) recorded in iPfam database. iPfam is a database []. After selecting protein complexes from PDB, a data set of 1,014 complexes (in other words, 1,014 PPIs) was constructed.

Figure 4 shows the method we calculated the conservation of interacting domain in PPI families. Proteins A-B is an interacting protein pair, in which there are two physical domain-domain interactions (DDIs). If a protein pair between homologs (E-value ≤10-10) A' and B' kept

≥ 1 DDI, we considered the protein pair has similar interacting domains to the query pair A-B.

We compared the two partitions of "PPI families" and "Non PPI families". The "PPI families"

consisted of PPIs with JE ≥10-40, for example, the PPIs circled by blue. Conversely, the "Non PPI families" consisted of PPIs having JE <10-40.

Conservation of interacting domain

Figure 4. Illustration of the method we calculated the conservation of interacting domain in PPI families. The rectangles colored by green, blue, light blue, yellow, and red mean domains.

The query interacting protein pair A-B has two DDIs. All of the pairs of A' and B' homologs are marked "Yes" (keeping ≥ 1 DDI with the query PPI A-B) or "No" (keeping no DDI with the query PPI A-B).

Figure 5 indicates the results of observing the conservation of interacting domains in PPI families. We found that the number of PPIs keeping ≥ 1 DDI (11,060 PPIs) were 2.35-fold more than that of PPIs not keeping DDI (4,699 PPIs) in the set "PPI families". In comparison, the number of PPIs not keeping (27,264 PPIs) were 1.74-fold that of PPIs keeping DDI ≥ 1 DDI (15,653 PPIs) in the set "Non PPI families".

0 5000 10000 15000 20000 25000 30000

PPI families Non PPI families Type

Number of PPIs

Keeping DDI Non-keeping DDI

Figure 5. Conservation of interacting domains in PPI families. The number of PPIs keeping ≥ 1 DDI is 11,060 PPIs (blue) and that of PPIs not keeping DDI is 4,699 PPIs (red) in the set

"PPI families". In comparison, the number of PPIs not keeping is 27,264 PPIs (red) and that of PPIs keeping DDI ≥ 1 DDI is 15,653 PPIs (blue) in the set "Non PPI families".

These results indicated that there was higher conservation of interacting domains in PPI families than in non-PPI families. In the Section 3.4 Discussion, we supplied a possible reason of why there were 4,699 PPIs which not keeping DDI in the set "PPI families".

3.1.4 Conservation of binding model in PPI families

After we acquired the evidence of conservation of interacting domain, we were interesting to observed the similarity of structural binding interfaces within PPI families. This idea was derived from the assumption that the PPI members of a PPI family have similar structural binding interfaces between two protein partners of each PPI. We have developed 3D-partner,

which is a web tool to predict interacting partners and binding models of a query protein sequence through structure complexes and a new scoring function.

For above purpose, we collected a data set of protein complexes from PDB, which was composed of 517 heterodimers because 3D-partner was developed based on protein structures of heterodimers. Similar to the description in Section 3.1.3, we compared the two subsets of

"PPI families" (4,998 PPIs) and "Non PPI families" (9,102 PPIs). The results of comparison between the two data subsets are showed in Figure 6. We used a threshold Z-score to measure the similarity of binding model and identify interacting partners with the query. The Z-score reveals that the proportion of true positives rises when a higher Z-score is utilized. The P value of T test between the Z-scores of the two subsets "PPI families" and "Non PPI families" is less than 10-30. The results indicated that there was significant difference between the two subsets.

-log(JE) > 40

0 500 1000 1500 2000 2500 3000 3500

-4 -3 -2 -1 0 1 2 3 4 5 6 7 >7

Similarity of Binding Model (Z-score)

Number of PPIs

PPI families Non PPI families

Figure 6. Distribution of Z-score (i.e. similarity of binding models) in two subsets of "PPI families" (blue) and "Non PPI families" (red). There are 4,998 and 9,102 PPIs in the two subsets, respectively.

In addition, we were interesting that why many PPIs in “PPI families” subset have low similarity with the query PPIs. For this purpose, we used a descriptor, aligned contact residue identity (CI), to observe the similarity of binding interface in a PPI family. Figure 7 shows an illustration of how we calculated CI values.

We selected the PPIs in "PPI families" subset for observation (Figure 8) and example analysis. Figure 8 indicates the distributions of CI values in PPIs with Z-score ≥ 1.96 (i.e. 95%

confidence interval) and that with Z-score < 1.96. We found that 93.5% of PPIs with Z-score <

1.96 have CI = 0. In other words, these results suggested that we would get PPIs with different binding model from the query PPIs through the search method we currently used. In the Section 3.4 Discussion, we supplied a possible reason of causing this observation.

3FKS chain S 3FKS

chain T

GLNNIQAE LNLEP QV VFGLNNIQAEESGVKGMALNLEPGQVG VFGLNNLQAEELVEFGMALNLEPGQVG AHGLDNVMSGENAVMGMALNLEENNVG

CI = 14/15 = 0.93

Contact residues

CI = 8/15 = 0.53

P0ABB0 P09219

P09219

P0ABB0 3FKS:T

Figure 7. An example of how to calculate CI values. The 15 residues colored by red are part of contact residues (on 3FKS chain T) in the interacting interface between 3FKS chains T and chain S. The underlined residues in the aligned sequences P09219 and P0ABB0 are the residues which are identical to the contact residues on 3FKS chain T. In this case, CI of

0 500 1000 1500 2000 2500

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Contact residue identity (CI)

Number of PPIs

Z-score < 1.96 Z-score > 1.96

Figure 8. Distribution of CI values in PPIs with Z-score ≥ 1.96 (blue) and Z-score < 1.96 (red).

3.2 Input, output, and options of the PPISearch server

The PPISearch is an easy-to-use web server (Figure 9). Usersinput a pair of protein sequences in FASTA format or UniProt ID, and choose E-value thresholds for homologs and for homologousPPIs (Figure 9A). In addition, users can assign the CRD andCRF thresholds, specific species and the number of homologousPPIs in a species.

To evaluate the usefulness of the PPISearch server for the discoveryof homologous PPIs and for the annotations of a query proteinpair, we selected two query protein sets, termed HOM and ORT.To search homologous PPIs, HOM and ORT are used to assess PPISearch performance and to determine the threshold of joint E-valueJE (Figure 3A). In addition, the HOM set wasapplied to infer the relations between conservation ratios [CRDand CRF defined in Chapter 2] and the transferabilityof DDPs and MFPs, respectively, between a query and its homologousPPIs (Figure 3B and C). The HOM set includesall 290,137 PPIs and the ORT set

has 6,597 orthologous PPI families(14,571 PPIs) derived from the annotated PPI database and PORCorthology database.

HOM and ORT were used to assess the PPISearch server in identifyinghomologous PPIs and orthologous PPIs, respectively, by searchingthe annotated PPI database (290,137 PPIs with 54,422 proteins).Figure 3A shows the relationships between joint E-value JE andnumber of orthologous PPIs (black) and homologous PPIs (red).The orthologous PPIs often have the same functions and domains. When JE ≤10-40, the number of orthologous PPIs decreases significantly; conversely, the number of homologous PPIs decreasesmore gradually than that at JE ≥10-40. This result showsthat the proposed method is able to identify 98.2% orthologous PPIs with a reasonable number of homologous PPIs when JE ≤10-40.

To evaluate the transferability of DDPs and MFPs between a queryand its homologous PPIs, we used the SRD [Equation (3)] andSRF [Equation (5)]. The HOM set is used to evaluate the utilityof the PPISearch server in annotating the query protein pair.By excluding proteins without domain annotations from the queryset, 103,762 PPIs are used to evaluate the transferability (SRD) of conserved DDPs between these query PPIs and their respective homologous PPIs (Figure 3B). The transferability (SRF) of conservedfunctions between the 106,997 PPIs and their homologous PPIsis assessed by excluding proteins without molecular functionterms of GO from the original query set (Figure 3C).

Figure 3B shows the relationship between conservation ratios(CRD) of DDPs and the SRD ratios. The SRD ratio increases significantly(solid lines) when the CRD increases and CRD ≤ 0.6. Conversely,the number of DDPs derived from 103,762 PPI families decreases (dotted lines) as CRD increases. If the CRD is set to 0.6 andthe joint E-value is set to 10-40 (green lines), the SRDis 0.88 and the number of DDPs is 252,728. This result demonstrates that members of a PPI family derived by PPISearch reliably share DDPs (or interacting

functions betweenhomologous PPIs and the query (Figure 3C). Themembers of a PPI family have similar molecular functions, andSRF ratios are highly correlated with conservation ratios (CRF)of MFPs. When the CRF is 0.6 and the joint E-value is 10-40 (green lines), the SRF is 0.69 and the number of MFPs is 454,251.

These results reveal that the PPISearch server achieves a highSRD with a reasonable number of DDPs when the joint E-valueis set to 10-40. In summary, these experimental results demonstrate that this server achieves high agreement on DDPsand MFPs between the query and their respective homologous PPIs.

Typically, the PPISearch server yields homologous PPIs within 20 seconds when sequence length is ≤ 350 (Figure 9B). This server identifieshomologous PPIs in multiple species; conservations and GO annotationsof protein functions; conservations and annotations of DDPs; and the best-matched protein pairs of the query (Figure 9C). Additionally, the PPISearch server provides multiple sequencealignments of homologous PPIs and indicates the conserved residuesbased on amino acid types. For each homologous PPI, this servershows the alignments and experimental annotations (e.g. interactiontypes, experimental methods, gene names and GO terms).

A

B

C

Figure 9. The PPISearch server search results using proteins MIX-1 and SMC-4 of Caenorhabditis elegans as the query. (A) The user interface for assignments of query protein sequences and E-value thresholds of homologs and homologous PPIs. (B) Homologous PPIs of MIX-1–SMC-4 in multiple species and public databases. (C) Conserved protein functions (GO terms) and domain-domain pairs (Pfam and InterPro) of homologous PPIs with a conservation

3.3 Example analysis of homologous PPI search

3.3.1 σ1A-adaptin and γ1-adaptin

Figure 2C and D show search results using σ1A-adaptin (UniProtaccession number: P61967) and γ1-adaptin (P22892) of Mus musculusas the query. These two proteins are components of the heterotetramericadaptor protein complex 1 (AP-1), which medicates clathrin-coatedvesicle transport from the trans-Golgi network to endosome26.According to the crystal structure (PDB code 1W63)27, thisprotein pair is a physical interaction, but it is not recordedin the annotated PPI database. For this query, the PPISearchserver identifies 14 homologous PPIs, a PPI family, from fourspecies (human, mouse, fruit fly and yeast). This PPI familyhas four DDPs (Figure 2E) — PF01217-PF01602 (CRD is 1.0),PF01217-PF02883 (0.93), PF1217-PF02296 (0.14) and PF01217-PF07718(0.07). Two DDPs (PF01217-PF01602 and PF01217-PF02883) with highest CRD ratios are the domain compositions of the queryand PF01217-PF01602 is the interacting domains27.

This server allows users to choose the JE threshold of homologousPPIs. For example, when JE is set to 10-100 (default valueis 10-40), the number of homologous PPIs decreases from 14 to 10 by filtering out the last four PPIs (Figure 2D). These10 homologous PPIs consistently include the two DDPs PF01217-PF01602and PF01217-PF02883, each with a CRD = 1.0.

Furthermore, userscan choose the best match or number of homologous PPIs in aspecies. In this manner, the PPISearch server is able to select the primary homologous PPIs of each species for specific applications,such as evolutionary analysis of essential proteins.

3.3.2 MIX-1 and SMC-4

Mitotic chromosome and X-chromosome-associated protein (MIX-1,Q09591) and structural maintenance of chromosomes protein 4 (SMC-4, Q20060) of Caenorhabditis elegans are members of SMCprotein family, and are required for mitotic chromosome segregation28. Both MIX-1 and SMC-4 are essential components in formingthe condensin complex for interphase chromatin to convert intomitotic-like condense chromosomes28, 29. Using C. elegansMIX-1 and SMC-4 as the query protein pair and JE is set to 10-40,the PPISearch server found seven homologous interactions fromannotated PPI databases (Figure 9B). These seven homologous PPIs are consistently SMC–SMC protein interactions, including SMC-2–SMC-4, SMC-3–

SMC-4 and SMC-2–SMC-1, in four species. Among these homologous PPIs, two PPIs, Q95347-Q9NTJ3 (Homo sapiens) and P38989-Q12267 (Saccharomyces cerevisiae), are orthologous interactions of the query MIX-1–SMC-423.

These seven homologous PPIs of MIX-1 and SMC-4 include 136 GOterm pairs. Among these GO terms, the CRF ratios of four GOMF term pairs and two GO BP term pairs exceed 0.6 (Figure 9C).These six GO term pairs are consistent with the term-pair combinationsof MIX-1 and SMC-4. For example, MIX-1 and SMC-4 have the sametwo GO MF annotations, protein binding (GO:0005515) and ATP-binding (GO:0005524). Additionally, these seven homologous PPIs containfour DDPs with CRD ratios of 1.0. These four DDPs, PF02463-PF02463, PF06470-PF02463, PF02463-PF06470 and PF06470-PF06470, are recorded in iPfam20 and are consistent with the query pair. The hinge-hingeinteraction (PF02463-PF02463) is experimentally proved, andis conserved in the eukaryotic SMC-2–SMC-4 heterodimer30. These analytical results reveal that the PPISearch serveris able to identify homologous PPIs that share conserved DDPsand MFPs with the query.

3.4 Discussion

3.4.1 Example analysis for giving more insights into PPI family

In above content, we brought up the concept of homologous PPIs, and give statistic evidence and biological examples to support it. At next step, we will provide more evidence to verify the homologous PPIs identified by our methodolgy. For this purpose, we will verify this issue based on four views: (1) domain composition of PPIs, (2) biological functions of PPIs, (3) the locations of PPIs in pathways, and (4) PPIs in manually curated complexes. In other words, we assume that if a PPI is “homologous” to another PPI, they have the same specific function, interacting domains, and they are experimentally identified in the same pathway and/or in the same protein complex.

The first two views have been used to evaluate the concept of homologous PPIs through Pfam annotations and GO terms. Currently, we are starting to gain insights into homology of PPIs by the last two views. Preliminarily, we use components of the transforming growth factor β (TGF-B) system as an example to test our assumption.

The first two views have been used to evaluate the concept of homologous PPIs through Pfam annotations and GO terms. Currently, we are starting to gain insights into homology of PPIs by the last two views. Preliminarily, we use components of the transforming growth factor β (TGF-B) system as an example to test our assumption.

相關文件