• 沒有找到結果。

PCFAMILY: A WEB SERVER FOR SEARCHING HOMOLOGOUS PROTEIN

The proteins in a cell often assemble into complexes to carry out their functions and play an essential role of biological processes. The PCFamily server identifies template-based homologous protein complexes (called protein complex family) and infers functional modules of the query proteins. This server first finds homologous structure complexes of the query using BLASTP to search the structural template database (11,263 complexes). PCFamily then searches the homologous complexes of the templates (query) from a complete genomic database (Integr8 with 6,352,363 protein sequences in 2,274 species). According to these homologous complexes across multiple species, this sever infers binding models (e.g.

hydrogen bonds and conserved amino acids in the interfaces), functional modules, and the conserved interacting domains and Gene Ontology annotations of the protein complex family.

Experimental results demonstrate that the PCFamily server can be useful for binding model visualizations and annotating the query proteins. We believe that the server is able to provide valuable insights for determining functional modules of biological networks across multiple species. The PCFamily sever is available at http://pcfamily.life.nctu.edu.tw.

3-1. Introduction

Protein complexes are fundamental units of macromolecular organization and their composition is also known to vary according to cellular requirements 7. To identify and characterize the protein complexes, genome-scale interaction discovery approaches, such as two-hybrid system or affinity purification 70,71, have been proposed. However, these methods

protein-protein interactions (PPI) 8,9,28,29 and structure complexes 31, previous studies have suggested that the total number of protein-protein interaction types are limited (~10,000 types)

72 and the quaternary structures (QS) can be clustered into 3,151 QS families 73.

A known three-dimensional (3D) structure complex provides physical protein interaction topology, interacting domains, and atomic detailed binding models of interactions. Recently, some studies utilized template-based methods (i.e. comparative modeling 32 and fold recognition 33), which search a 3D-complex library to model a large set of yeast complexes 45,74. These methods are time-consuming to search all possible homologous PPIs or complexes, which are useful to explore interface evolutions of a specific 3D structure complex, from a large complete genomic database (e.g. Integr8) with many species 30.

To address these issues, we numerously enhanced and modified both PPI family search (sequence-based PPI search method 23) and 3D-domain interologs with template-based scoring function (3D-template PPI prediction method 39). According to our knowledge, PCFamily is the first public server that identifies homologous complexes (≥ two proteins) and module evolution of the query. For a set of query protein sequences, this server provides the template-based homologous complexes (called protein complex family (PCF)) in multiple species, graphic visualization of conserved interacting residues and binding models (interfaces), conserved Gene Ontology (GO) annotations 36 and interacting domains. Our results demonstrate that this server achieves high agreements on interacting domains and GO annotations between query proteins and their respective homologous complexes.

3-2. Method and Implementation

Figure 3-1. Overview of the PCfamily server for homologous complexes search using proteins Skp1, Skp2, and Cks1 of Rattus norvegicus as the query.

(A) The main procedure. (B) Identify the template candidate (PDB code 2ast) of the query using BLASTP and template-based scoring function to scan the structural template database. (C) The topology of the template. (D) The homologous PPI families of interfaces A-B and B-C of the template searching on Integr8 database. (E) Template-based homologous complexes of the query.

Figure 3-1 shows the details of the PCFamily server to search the template-based homologous complexes (PCF) of a set of query protein sequences by following steps (Fig.

3-1A). First, the server uses BLASTP to search template candidates from structural template database (11,263 structure complexes selected from Protein Data Bank (PDB)). Then we utilize template-based scoring function 39 to statistically evaluate the complex similarity (joint Z-value

≥ 3.0) between query proteins and candidates (Figs. 3-1B and 3-1C). After a template was A

Step 3: Identify template-based

homologous PPIs families (A-B and B-C) of interface of chain A-B and B-C, respectively, with using interface similarity Z-values ≥ 3 from complete genomic database (Integr8).

Step 5: Measure the conservation ratios of domain compositions and GO annotations. The server provides multiple binding models and conserved interacting residues of these homologous complexes.

Step 1: Query a set of protein sequences.

Step 6: Output homologous complexes, interfaces of A-B and B-C with joint Z-values (JZ) ≥ 3.

Step2: Search the template candidates with the complex similarity (joint Z-value

≥ 3).

selected, the server searches the PPI family of each interface of template with Z-value ≥ 3.0 from a complete genomic database (Integr8 version 103, containing 6,352,363 protein sequences in 2,274 species) 30 (Figs. 3-2A and 3-1D). These PPI families are combined into homologous complexes with the significant complex similarity (joint Z-value ≥ 3.0) according to the interfaces of the 3D-complex template (Fig. 3-1E). For this PCF including the query, we measured the conservation ratio (CR) of the domain composition (DC) and CRs of biological processes (BP), cellular components (CC), and molecular functions (MF) using Gene Ontology annotations. Finally, this server provides homologous complexes; graphic visualization of complex topology; detailed residues interactions and interface alignments across multiple species (Fig. 3-2); conservations with GO annotations and DCs.

Figure 3-2. Binding models and multiple sequence alignments of PPI family in Skp1-Skp2-Cks1 complex (PDB code 2ast).

(A) The atomic binding model with hydrogen bonds (red dash lines) for each interface of the template. (B) Multiple sequence alignments of PPI family of the interface A (Skp1)-B (Skp2), respectively.

Homologous complex

The concept of homologous complex (≥ two proteins) is extended from homologous PPIs

23 and 3D-domain interologs with template-based scoring function 39. Here, we used a 3D-trimer template T (proteins A, B, and C) with two interfaces A-B and B-C as a simple case to define the homologous complex of T as follows: (1) A', B' and C' are the homologous proteins of A, B, and C, respectively, with the significant sequence similarity (BLASTP E-values ≤10-10) 40,44; (2) A'-B' and B'-C' are the template-based homologous PPIs of A-B and B-C, respectively, with the significant interface similarity (Z-value ≥ 3.0) 39; (3) significant complex similarity (joint Z-value ≥ 3.0) between complexes A'-B'-C' and A-B-C. The joint Z-value of the complex similarity is defined as

ni i

z

Z

J

1

(1)

where n is the number of interfaces of a template (T); Zi is the Z-value (interface similarity) of the template-based homologous PPI i (e.g. A'-B') based on the template interface (e.g. A-B).

Here, JZ ≥ 3.0 is considered as significant similarity according to the statistical analysis of 941 3D-structure complexes with 2,138,123 homologous complexes.

Template-based scoring function

We have recently proposed a template-based scoring function to determine the reliability of the PPI derived from a 3D-dimer structure 39. For a predicted template-based PPI, this scoring function assigned a score, including residue-residue interacting scores, which consist of the steric (Evdw) and hydrogen-bond (ESF) energies, and sequence consensus scores which the couple-conserved residue score (Econs) and contact-residue similarity score (Esim). Finally, we calculated the Z-value of the score for this PPI using the mean and standard deviation of 10,000 random interfaces by mutating 60% interface residues.

Annotations of homologous complexes

A 3D-complex template and its homologous complexes can be considered as a PCF. The concept of the PCF is analogous to the notions of protein sequence family 20, protein structure family 22 and PPI family 23. We believe that PCFs can be applied widely in biological investigations. We assume that the members of a PCF are conserved on GO annotations, interacting domain(s) and binding model(s). Using these conservations of a PCF, the PCFamily server can annotate the GO terms (BP, CC, and MF) and DCs of query proteins. To statistically evaluate the agreement of GO terms and DCs between the template and its PCF (with N homologous complexes), we define the agreement ratio (AR) using the conservation ratio (CR=Na/N), where Na is the number of homologous complexes with the same GO term (or DC) in a PCF. The AR is given as

) ) (

/ ) (

 (  

iQ Ai CR c Ti CR c

AR (2)

where Q is a set of query templates; Ti (CR≥ c) is the total number of the GO terms (or DCs) of template i when CR ≥ c; Ai (CR ≥ c) is the number of the agreement GO terms (or DCs) of template i when CR ≥ c.

3-3. Input, Output and Options

PCFamily is an easy-to-use web server (Fig. 3-3). Users input a single or a set of protein sequence(s) in FASTA format or a 3D-complexes protein structure (PDB code) (Fig. 3-3A).

Typically, the PCFamily server yields structural template candidates within 25 seconds when querying three sequences and the numbers of amino acids are ≤ 450 (Fig. 3-3B). For the query, this server shows the template candidate and its PCF; detailed atomic interactions of the interfaces and binding models by using Jmol 75; protein interaction topology (Fig. 3-3C);

multiple sequence alignments (MSA) with hydrogen-bond residues and conserved residues (Fig. 3-3D); and CRs of DCs and GO terms (BP, CC and MF) (Fig. 3-3E).

Figure 3-3. The PCfamily server search results using proteins Epor, Epo, and Epor of Mus musculus as the query.

(A) The user interface for inputting the query protein sequences or PDB code. (B) The template candidate of the query. (C) The numbers of conserved domains and GO term conservations, interfaces, protein interaction topology, homologous complexes of the query (selected template). (D) Multiple sequence alignments and interacting residue conservations of homologous PPIs of the interface A (Epo)-B (Epor), respectively. (E) Conserved domain and GO term compositions of the protein complex family.

A B

C

D E

3-4. Example Analysis

The complex of Skp1, Skp2, and Cks1

Figure 3-1 shows search results using S-phase kinase-associated protein 1 (Skp1, UniProt accession number: Q6PEC4), S-phase kinase-associated protein 2 (Skp2, B2GUZ0), and RGD1561797 protein (Cks1, B2RZ99) of Rattus norvegicus as the query. Skp1 and Skp2 are subunits of the SCFSkp2 ubiquitin ligase complex that regulates proteolysis of the p27Kip1 protein in cell cycle progression 76,77. Recognition and ubiquitination of p27Kip1 requires the accessory protein Cks1 by the SCFSkp2 ubiquitin-ligase complex 76. According to KEGG pathway database 78, Skp1-Skp2 and Skp2-Cks1 in Rattus norvegicus are recorded in the ubiquitin mediated proteolysis pathway and the small cell lung cancer pathway, respectively. For this query, the PCFamily server found the template candidate (PDB code 2ast 76) (Fig. 3-1C) and 43 homologous complexes (called SCF complex family), from nine species (e.g. Homo sapiens, Rattus norvegicus, and Bos taurus (Fig. 3-1E)). Among these 43 homologous complexes, one complex (Homo sapiens) is recorded in the IntAct database 8 and three homologous complexes, including the query in Rattus norvegicus, Q9WTX5 (Skp1)-Q9Z0Z3 (Skp2)-P61025 (Cks1b) in Mus musculus, and Q3ZCF3 (SKP1)-A7MB09 (SKP2)-Q0P5A5 (CKS1B) in Bos taurus, are recorded in KEGG pathway. In addition, 6 members are Skp1-Skp2-Cks1b (or Cks2) complexes which are highly relative to the query and the template. All members of this PCF have the same DC PF01466 (Skp1)-PF00646 (F-box)-PF01111 (CKS) and a high consensus DC PF03931 (Skp1_POZ)-PF00646-PF01111 (CR=0.95). The query proteins consist of these two DCs (Fig. 3-1E).

The PCFamily server provides the binding model and MSAs of each interface (Figs. 3-2 and Fig. 3-4) based on the template. Interface A-B (Fig 3-2A) contains 3 main hydrogen bonds, including Gln1097-Trp2097, Glu1156-Tyr2128, and Asn1157-Ser2121. These six residues are conserved in mammals (Fig 3-2B). Additionally, PCFamily identifies six sidechain-sidechain

hydrogen bonds forming the network to stabilize the interface B-C 76 (Fig. 3-4). All interacting residues forming the hydrogen bonds are often highly conserved and useful for observing the interface evolution across multiple species.

Figure 3-4. Binding models and multiple sequence alignments of PPI family in Skp1-Skp2-Cks1 complex (PDB code 2ast).

(A) The atomic binding model with hydrogen bonds (red dash lines) for each interface of the template. (B) Multiple sequence alignments of the interface B-C (Skp2-Cks1). This interface includes 11 and 26 homologous proteins of the chains B (Skp2) and C (Cks1), respectively.

Epor-Epo-Epor complex

Erythropoietin (Epo) stimulates the proliferation and differentiation of the cells (e.g.

erythroid precursor cells) 79,80. Epo binds and orientates two cell-surface erythropoietin receptors (Epor) to activate cells and trigger an intracellular phosphorylation cascade 81. Using Mus musculus Epor (P14753), Epo (P07321), and Epor (P14753) as the query proteins (Fig.

3-3A), the PCFamily server found the template candidate (PDB code 1eer) (Fig. 3-3B) and its 6 homologous Epor-Epo-Epor complexes in three species (Fig. 3-3C). Among these 6 complexes, three complexes, P19235-P01588-P19235 (Homo sapiens), P14753-P07321-P14753 (Mus musculus) and Q5FVS4-P29676-Q5FVS4 (Rattus norvegicus) are recorded in KEGG. Two complexes are formed by Epo (P29676) binding to Epors Q07303

79 and O35545 82, respectively. PCFamily indicates the MSAs with hydrogen-bond and conserved residues in the interfaces A-B (Fig. 3-3D) and A-C (Fig. 3-5) of Epor-Epo-Epor PCF.

Figure 3-5. Multiple sequence alignments of the (Epo-Epor) A-C interface of template cytokine/receptor complex (PDB code 1eer).

This interface includes five and six homologous proteins of the chains A (erythropoietin) and C (erythropoietin receptors), respectively.

A

B

This PCF includes 65 GO term compositions. Among these GO term compositions, the CR ratios of two MF compositions and three CC compositions exceed 0.6 (Fig. 3-3E). The query has these five GO term compositions, such as GO:0004900 (erythropoietin receptor activity)-GO:0005128 (erythropoietin receptor binding)-GO:0004900. Additionally, the query and these homologous complexes consistently contain two conserved DCs (CR=1), including PF00041-PF00758-PF00041 and PF09067-PF00758-PF09067. PF00758-PF00041 and PF00758-PF09067 are recorded in iPfam 35. These results reveal that the PCFamily server can identify homologous complexes for the interface evolution and annotations of the query.

3-5. Results

Figure 3-6. Evaluations of the PCFamily server on 941 protein complex families.

(A) The distributions of recall (solid) and precision (dot) with different joint Z-value thresholds. (B) The relationships between agreement ratios and the conservation ratios of domain compositions (DC), biological processes (BP), molecular functions (MF), and cellular components (CC).

To evaluate the accuracy of the PCFamily server for discovery of homologous complexes and the annotations of query proteins, we selected a non-redundant query structural template set. This set comprising 941 protein complexes (2,979 sequences and 2,042 interfaces, called NR941) was selected from the PDB released in Feb 24, 2006. For searching homologous

complexes, NR941 was used to assess PCFamily performance and to determine the threshold of joint Z-value Jz (Equation (1)) on the Integr8 database (Fig. 3-6A). In addition, the NR941 set was applied to calculate CRs of DCs (and GO terms) for each PCF and infer the relations between CRs and ARs (Equation 2) of DCs and GO terms (Fig. 3-6B).

We defined the gold standard positive and negative sets to measure the performance of the PCFamily server. Here, we used a trimer structural template T (proteins A, B, and C) with two interfaces A-B and B-C as a simple case to describe a positive complex (A'-B'-C') of T as follows: (1) A', B' and C' are homologs of A, B, and C, respectively, with the significant sequence similarity (BLASTP E-values ≤10-10) 40,44; (2) A'-B' and B'-C' are PPIs recorded in annotated PPI databases (e.g. IntAct) and have the same interacting domains of A-B and B-C, respectively. Based on the rules, the gold standard positive set includes 770 complexes derived from the Integr8 for the set NR941. On the other hand, the gold standard negative set was generated according to the assumption that proteins, located in the same subcellular localization and acting in the similar biological processes, are more likely to form a complex than proteins involved in different processes. This study applied the relative specificity similarity (RSS) 69 to measure the BP and CC similarities of PPIs based on the GO terms.

According to 198,882 interactions in IntAct database, we considered a complex candidate is a negative case, if BP and CC RSS scores of any interface of the complex are less than 0.4 (Fig.

3-7). Here, the negative set consists of 1,960 complexes.

Precision, recall and F-measure were utilized to assess the reliability of the PCFamily server for searching homologous complexes. The F-measure is given as (2 × precision × recall) / (precision + recall) where the precision and recall using the gold standard positive and negative sets. Figure 3-6A shows the relationships between joint Z-value Jz and recall and precision using 941 complexes on the Integr8 database. The recall significantly decreases when joint Z-value ≥ 3; conversely, the precision increases slightly when joint Z-value is between 3

and 4. The recall and precision are 0.82 and 0.45, respectively, and the PCFamily server yields the highest F-measure value (0.55) if the threshold of joint Z-value is set to 3.

Figure 3-6B shows the relationships between ARs and the CRs of DCs, BP, CC, and MF. If the CR of DCs is greater than 0.6 (black), the AR between the query and their respective homologous complexes exceeds 0.95 (Equation 2). If the CR of GO terms (i.e. BP, CC, and MF) is greater than 0.6, the ARs are consistent larger than 0.74 for BP (0.77, green), CC (0.74, yellow), and MF (0.75, red). These experimental results demonstrate that this server achieves high agreements on DCs and GO terms between the query (i.e. template complexes) and their respective homologous complexes.

Figure 3-7. The distributions of the biological process (BP) and cellular component (CC) RSS scores on 84,082 protein-protein interactions selected from the IntAct database.

Among 198,882 interactions recorded in IntAct, 84,082 interactions can be calculated the BP and CC RSS scores.

The BP and CC RSS scores of 14,188 (16.88%) and 1,742 (2.07%) interactions, respectively, are less than 0.4.

0 5 10 15 20 25 30 35

=0 0~0.1 0.1~0.2 0.2~0.3 0.3~0.4 0.4~0.5 0.5~0.6 0.6~0.7 0.7~0.8 0.8~0.9 0.9~1.0

RSS score

Percentage (%)

BP CC

3-6. Conclusions

This study demonstrates the utility and feasibility of the PCFamily server in identifying homologous complexes and inferring conserved domains and GO terms from protein complex families. PCFamily is the first server to provide homologous complexes in multiple species;

graphic visualization of the complex topology and detailed atomic residue-residue interactions;

interface alignments; conservations of GO terms and domain compositions. Our experimental results demonstrate that the query and its homologous complexes achieve high agreements on domains and GO terms. We believe that PCFamily is a fast homologous complexes search server and is able to provide valuable insights for determining functional modules of biological networks across multiple species.

Chapter 4. Structural interactome of multiple vertebrate