STRUCTURAL INTERACTOME OF MULTIPLE VERTEBRATE GENOMES

A crucial step toward understanding the spatiotemporal dynamics of a cellular system is to investigate protein-protein interaction (PPI) networks and biochemical progress. Currently, the large-scale methods are often unable to respond how a protein interacts with another one within a given PPI network and describe the relationship between the mutation of proteins and disease syndrome. To address this issue, we numerously enhanced and modified our previous PPI family search and 3D-domain interologs with template-based scoring function. Our method could efficiently enlarge the PPIs annotated with residue-based binding models in structure resolved networks in H. sapiens, M. musculus, and D. rerio. This work is the first to construct structure resolved PPI networks across multiple species, including H. sapiens, M. musculus, and D. rerio. The PPIs with atomic residue-based binding models in the derived structure resolved network achieved highly agreement with Gene Ontology similarities. Furthermore, the architecture of these networks is a scale-free network which is consistent with most of the cellular networks. In addition, our derived networks can be used to observe the consensus proteins and modules which are high conserved appearing in multiple organisms. These consensus proteins are often the essential genes and related to diseases recorded in OMIM.

Experimental results also indicate that the mutations of interacting residues on the PPIs often related to diseases are often on. Our results demonstrate that the structure resolved PPI networks in vertebrates can provide valuable insights for understanding the mechanisms of biological processes.

4-1. Introduction

A crucial step toward understanding the spatiotemporal dynamics of a cellular system is to investigate protein-protein interaction (PPI) networks and biochemical progress ^3,83,84. Many high throughput experimental methods, such as high-throughput yeast two-hybrid screening

25,26

and co-affinity purification ²⁷, and computational approaches have been proposed to construct the PPI network within an organism. These large-scale methods are often unable to respond how a protein interacts with another one and describe the relationship between the mutation of proteins and disease syndrome. Previous studies have combined protein structure information with experimental PPIs to investigate how mutations affect protein interactions in disease ^14-16. Based on experimental PPIs, a structurally resolved human protein interaction network has been reconstructed to examine the relationships between genes, mutations and associated disorders ¹⁶. These experimental PPIs were distributed on several well-studied organisms (e.g. S. cerevisiae); conversely, the PPIs of most species were not complete. For example, the numbers of PPIs for D. rerio (227) and Mus musculus (7,736) recorded in five public databases ^8-10,85,86 (e.g. BioGRID and IntAct).

To discover the sequence homologs of a known protein provides the clues for understanding the function of a newly sequenced gene. We have provided "protein-protein interaction family" to annotate genome-scale PPIs through the homologous PPIs ²³ searching the complete genomic database (Integr8, containing 6,352,363 protein sequences in 2,274 species) ³⁰. Furthermore, a known three-dimensional (3D) structure complex could provide interacting domains, and atomic detailed binding models of interactions. Some methods have utilized template-based methods (i.e. comparative modeling ³² and fold recognition ³³) to predict the PPIs by accessing interface preference through the fitness of known template structures. However, these methods ^32,33 are time-consuming to search all possible protein-protein pairs in a large genome-scale database across multiple species. Therefore, to

further utilize both "protein-protein interaction family" and 3D structure complexes, we are able to construct structure resolved PPI networks with binding mechanisms in multiple organisms.

To address this issue, we numerously enhanced and modified our previous PPI family search (sequence-based PPI search method ²³) and 3D-domain interologs with template-based scoring function (3D-template PPI prediction method ⁸⁷). Our method could efficiently enlarge the PPIs annotated with residue-based binding models in structure resolved networks in H.

sapiens, M. musculus, and D. rerio. For each structure resolved network, we investigated the reliability by using the Gene Ontology and the network architecture (i.e. scale-free network). In addition, our method can identify the conserved proteins and network modules across multiple networks. These conserved proteins are highly related to the essential genes and diseases recorded in "Online Mendelian Inheritance in Man (OMIM ¹¹)". Furthermore, we demonstrated that these disease-related mutations are more enrichment on the interacting residues, especially forming the hydrogen bonds. These results indicate that the structure resolved PPI networks can provide the insight for understanding the mechanisms of biological processes and interactomes.

4-2. Methods and Materials

Constructing the structural resolved PPI networks

A major challenge of systems biology is to understand the networks of interacting genes, proteins and small molecules that produce biological functions. For efficiently enlarging protein interactions annotated with residue-based binding models, we have proposed the concept "3D-domain interolog mapping ^39,87": for a known 3D-structure complex (template T with chains A and B), domain a (in chain A) interacts with domain b (in chain B) in one species.

The proteins of the homolog families A' and B' of A and B have the significant sequence similarity (i.e. BLASTP E-values ≤10^-10) and contain interacting domains a and b, respectively.

All possible protein pairs between these two homolog families are considered as protein-protein interaction candidates using the template T. Then, we utilize our previous scoring system ^39,87 to evaluate the binding model similarity between candidates and template.

According to this concept, protein sequence databases can be searched to annotated protein-protein interactions across multiple species efficiently.

Figure 4-1. The overview of constructing structure resolved PPI networks in three vertebrates though "3D-domain interolog mapping"

(A) 3D-domain interolog mapping is used to infer the homologous PPIs through structural templates and complete genome databases. (B) The structure resolved PPI networks of H. sapiens, M. musculus, and D. rerio. (C)The human PPI network with the disease data derived from the OMIM. The size and color of a node (protein) denote the numbers of interactions and diseases, respectively.

Figure 4-1 illustrates the overview of constructing structure resolved PPI networks in three vertebrates though "3D-domain interolog mapping". First, a structure template library

comprising 60,618 3D-dimers involved in 24,815 complexes was selected from the protein data bank (PDB⁸⁸) released in Sep 2, 2011 (Fig. 4-1A). The interacting residues and scoring functions defined by using our previous studies ^39,87 were used to identify the similar binding interfaces of PPIs. After 3D-dimer template library and template profiles were built, we inferred homologous PPIs of each interface of the template with Z-value ≥ 3.0 from a complete genomic database (Integr8 ⁸⁹) (Fig. 4-1A). According to these homologous PPIs in H. sapiens, M. musculus, and D. rerio, we constructed and aligned these structure resolved PPI networks (Fig. 4-1B).

Multiple network alignment

The methods for network alignments can be roughly divided into global alignment and local alignment. By searching for a single comprehensive PPI network mapping of the whole set of proteins and protein interactions from different species⁹⁰, the global network alignment can answer interactome evolutions with conserved and specific proteins (and PPIs). Two basic issues should be addressed for a network alignment method. Firstly, an alignment method should provide the importance (such as hub and conservations) of proteins and PPIs in multiple networks across species. Second, for a selected protein and PPI, the score function of an alignment method should reflect the similarity of the aligned proteins and PPIs in the networks.

Here, we described a new global network alignment method based on "3d-domain interologs mapping". According to the definition of the "3d-domain interologs mapping", the protein-protein interactions of the same family share the same interacting domains and have the similar binding models. Therefore, for a specific PPI, these PPIs could be considered as the corresponding PPI alignment candidate in other organisms.

Our global alignment method applied a greedy strategy which the PPI with highest

importance is aligned with the highest priority. Here, we evaluated the importance of a given PPI (I) within the network based on the degree, conservation, and PPI reliability. Two proteins, forming a PPI, with a higher degree are usually the hub in a network. The degree (DI) of a PPI forming by proteins a and b is defined as DI = Da + Db, where the Da and Db are the degrees of proteins a and b, respectively. The PPI involving in many organisms is usually the essential PPI and plays an important role for biological functions and processes. Therefore, the evaluation conservation (CI) of a PPI (I) is defined as CI = TaxI /11, where TaxI is the number of taxonomy divisions, defined by the NCBI taxonomy database ⁹¹, of the interacting proteins in the PPI I family. Here, the maximum number of taxonomy divisions is 11. Finally, the reliability (RI) of a PPI is defined as RI = (EI + TI) / 2, where EI is 1 if the PPI I was recorded in five public PPI databases (e.g. IntAct⁸, DIP⁹, MIPS²⁸, BioGRID¹⁰, and MINT²⁹); otherwise, EI is 0. The TI is set to 1 while the 3D-dimer template and the PPI I are in the same organism;

otherwise, TI is 0. Final, the importance (S) of a given PPI (I) is calculated by S=CI + RI + TI.

The network alignment algorithm

Given three structural resolved PPI networks of H. sapiens (NH), M. musculus (NM), and D.

rerio (ND), we provided multiple network alignment by aligning NM and ND to NH and the algorithm is summarized in Figure 4-2 and proceeds as follows:

(1) For each PPI of NH, NM and ND, we calculate the importance (S) of the PPI by using the equation (S=CI + RI + TI) describing in previous paragraph.

(2) After calculating the all importance of all PPIs among the NH, each PPI gets the priority according to the value of importance. Then, Greedy picking the PPI I with the highest value of importance and its corresponding PPI IH family (FH).

(3) Selecting the most similar PPI IM and ID of M. musculus and D. rerio, respectively, in the FH based on the significant joint sequence similarity between two pairs, i.e., (A,

A₁') and (B, B₁'), of the I (A and B) and I_M (A₁' and B₁'). This work followed previous

Finally, we identified 1,887 proteins and 5,845 PPIs which are consensus in structure resolved PPI networks of H. sapiens, M. musculus, and D. rerio.

Collecting the list of disease-associated genes, mutations, and diseases

To further investigate the relationship between disease-associated genes and mutations in the structure resolved human PPI networks, we collected the disease-related mutations from OMIM ¹¹ database. The database of single nucleotide polymorphisms (dbSNP ⁹², build 132) is a public-domain archive for a broad collection of simple genetic polymorphisms. According to OMIM ¹¹ database which contains the relationships between genes and diseases, we collected all "OMIM-curated-records" from the dbSNP database. We got 15,995 mutations including in-frame and truncating mutations in 1,949 genes. For the further analysis, we selected the 2,202 mutations in 137 genes to validate the structurally resolve human PPI network with annotations of mutations and diseases (Fig. 4-1C). Here, the sizes and color distributions of the nodes (proteins) denote the numbers of interactions and diseases, respectively. The larger node represents the protein with the more number of PPIs and the red node denotes the protein with the more number of diseases. There are two main disease hubs (No. of disease > 10): TGFR4 and TGFR3 with 14 and 13 diseases, respectively.

4-3. Results

Structure resolved PPI networks of H. sapiens, M. musculus, and D. rerio

For evaluating the structural PPI network annotated with residue-based binding models, we compared the numbers of the proteins and PPIs in our structure resolved PPI networks with ones of the human structural PPI network ¹⁶ which can only be applied on the well-studied

species. According to PPI recorded in five public databases, the number of PPIs in human (67,596 PPIs) is significantly more than the ones of mouse (7,735 PPIs) and zebrafish (221 PPIs) (Table 4-1). The method proposed by Wang, X. J. et al. ¹⁶ would not be useful to apply to the mouse and zebrafish because this methods considered both the experimental PPIs and protein structures. Conversely, our method using "3d-domain interologs mapping" and "PPI family is able to efficiently enlarge PPIs annotated with residue-based binding models, especially useful for seldom-studied organisms (e.g. zebrafish) or new sequencing organisms.

Although most of the PPIs derived from our "3d-domain interologs mapping" are still not confirmed by experiments, our previous works have achieved the high annotating precision and high agreement with ddG of experimental binding energies and experimental PPIs 23,24,39,87

Table 4-1. Statistics of proteins and PPIs derived from our result, public databases, and Wang, X. J. et al. on H.

sapiens, M. musculus, and D. rerio Species No. proteins

in genome^*1

3D-domain interologs Five public databases^*2 Wang, X. J. et al ¹⁶. No. proteins No. PPIs No. proteins No. PPIs No. proteins No. PPIs

H. sapiens 56,006 9,493 39,058 12,206 67,596 2,816 4,222

M. musculus 36,379 7,689 33,125 4,177 7,735 - -

D. rerio 21,601 5,084 21,236 137 221 - -

*1 The number of proteins in a specific genome is calculated by using the Integr8 database.

*2 The experimental PPIs are derived from five public databases (IntAct, MIPS, DIP, MINT, and BioGRID)

To further verify the quality of our structure resolved PPI networks, we utilized the Gene Ontology (GO) ⁹³ similarities, including biological process (BP), cellular component (CC), and molecular function (MF), between interacting protein pairs and all protein pairs in a structural PPI network. Here, we applied the relative specificity similarity (RSS) ⁶⁹ to measure the GO similarities between two proteins. Figure 4-3 illustrates the RSS score distributions of BP, CC, and MF on interacting protein pairs and all protein pairs in the structural PPI network. GO annotations of BP, CC, and MF of are enrichment while the RSS scores are higher than 0.7 (Fig.

4-3). In addition, the RSS scores of interacting protein pairs are significantly greater than the

statistical hypothesis test. The RSS score distributions of BP, CC, and MF on interacting protein pairs and all protein pairs within the mouse and zebrafish networks (Fig. 4-4) are similar to ones of the human network. These results illustrate the importance of structural resolution and imply that the PPIs in our structure resolved PPI networks significantly share the similar GO annotations.

Figure 4-3. The distributions of relative specificity similarity (RSS) of BP, CC, and MF of the interacting protein pairs in the derived structural PPI networks

(A) The BP RSS distributions of 10,163 interacting protein pairs and all protein pairs (3,925,772 pairs). (B) The CC RSS distributions of 9,254 interacting protein pairs and all protein pairs (3,424,256 pairs). (C) The MF RSS distributions of 12,387 interacting protein pairs and all protein pairs (4,331,532 pairs). The protein pairs with BP (CC or MF) annotations are considered. The BP, CC, and MF RSS scores of interacting pairs have a significantly enrichment while RSS score >= 0.7. The interacting pairs have significantly higher RSS scores than the ones of random pairs in the networks according to the Mann–Whitney U test (p-value < 10^-40).

A network with a power degree distribution is called scale-free, a name that is rooted in statistical physics literature. An important finding of the cellular network architecture is that most networks within the cell approximate a scale-free topology ⁹⁴. Therefore, our structure resolved PPI networks of H. sapiens, M. musculus, and D. rerio were evaluated based on the characteristic of scale-free networks that the P(k), the probability of a node with k links, decreases as the node degree increases on a log-log plot (Fig. 4-5). Then, the degree exponent γ are 2.127, 2.088, and 1.958 in the structure resolved PPI networks of H. sapiens, M. musculus, and D. rerio, respectively. In general, the smaller the value of γ, the more important the role of the hubs is in the network. A scale-free network typically has degree exponents 2 ≤ γ ≤ 3, but

can also exist with exponents less than 2 ^94,95. This result is consistent with the architecture (i.e.

scale-free network property) of some cellular networks ⁹⁵.

Figure 4-4. The distributions of BP, CC, and MF RSS scores on interacting protein pairs and all protein pairs within the mouse and zebrafish networks

The BP, CC, and MF RSS scores have a significantly enrichment while RSS score >= 0.7 in both mouse and zebrafish networks.

Figure 4-5. The node degree distributions of three structure resolved PPI networks: (A) H. sapiens, (B) M.

musculus, and (C) D. rerio

The degree exponent γ are 2.127, 2.088, and 1.958 in the structure resolved PPI networks of H. sapiens, M.

musculus, and D. rerio, respectively. A scale-free network typically has degree exponents 2 ≤ γ ≤ 3, but can also exist with exponents less than 2. These three structural networks are scale-free networks.

The structure resolved PPI networks analysis

To further investigate the biological meaning of our networks, we analyzed the grouping property of human network by using the Gene Ontology annotations. Here, we defined the grouping property of a network as that the proteins which are involved in similar process and located on similar cellular component would be the neighbors in a network. We identified four cellular components (i.e. nucleus, intracellular, membrane, and others) for each protein based on the CC annotations (Fig. 4-6A). We also identified six biological processes (i.e. immune response, transport, signal transduction, protein metabolic, nucleic acid metabolic process, and others) for each protein based on the BP annotations (Fig. 4-6B).

Figure 4-6. Characteristics of the structure resolved protein network in H. sapiens using GO annotations.

(A) According to GO cellular component (CC) annotations, proteins in structure resolved protein network can be annotated into four CC terms (groups), including 218 proteins in nuclear part (GO:0044428, red), 829 proteins in intracellular (GO:0005622, yellow), 1265 proteins in membrane (GO:0016020, green), and others (gray). (B) Based on biological processes, 281 proteins are annotated with nucleobase-containing compound metabolic process (e.g., transcription) (GO:0006139, red); 613 proteins are annotated with protein metabolic process (e.g., translation) (GO:0019538, yellow); 710 proteins are with signal transduction (GO:0007165, green); 364 proteins are with transport (GO:0006810, blue); and 436 proteins are with immune response (GO:0006810, black).

Intracellular Nucleus Membrane Others

Immune response Transport Signal transduction Protein metabolic

Nucleobase-containing compound metabolic

A B

Others

Figure 4-7. Six major cellular processes in our derived network of H. sapiens.

According to the GO annotations (Fig. 4-6), our derived structure resolved PPI network could be grouped into six major cellular processes, including nucleic acid metabolic process (e.g., transcription); protein metabolic process (e.g., translation); intracellular signal transduction process; membrane signal transduction process; transport process; proteolysis process (e.g. proteasome); and immune responses.

According to the GO annotations, our derived structure resolved PPI network could be grouped into six major cellular processes, including nucleic acid metabolic process (e.g., transcription); protein metabolic process (e.g., translation); intracellular signal transduction process; membrane signal transduction process; transport process; proteolysis process (e.g.

proteasome); and immune responses. In addition, our PPI network can also reflect the communication of six major cellular processes (Fig. 4-7). The intracellular signal transduction plays an important role in our network. This process receives the signals which are provided from the membrane signal transduction (e.g. EGFR, FGFR, and other membrane receptors) and the immune response (e.g. T-cell receptor). In addition, the intracellular signal transduction also

Immune response

Proteolysis

Transport

Membrane signal transductions

Intracellular signal transductions

Nucleic acid metabolic H. sapiens

the peripheral portion of our derived network. The nucleic acid metabolic processes are the kernel processes of a living cell and could be regulated by the signal transduction. In our derived network, the nucleic acid metabolic process only communicates with the intracellular signal transduction and transport process. The results imply that the biological behavior of our derived network is consistence with our knowledge for a living cell.

The consensus proteins, processes, and organism-specific processes

According to "3d-domain interologs mapping" and the multiple network alignment described in Methods, we were able to compare these three vertebrate protein interaction networks (i.e. H. sapiens, M. musculus, and D. rerio) and identify the consensus proteins and protein-protein interactions. Here, we identified 1,887 consensus proteins and 5,845 consensus PPIs from 4,135 proteins and 21,648 PPIs of structure resolved human network. To further evaluate the biological meanings and network topologies of the consensus and non-consensus proteins, we investigated these consensus proteins according to the three dimensions, including the essential proteins; involving in diseases; and locating in the central part (e.g. hub) within the protein interaction network.

Essential genes usually involve in the fundamental cellular processes which required for the survival of an organism. As a result, the essential genes are often highly conserved across multiple organisms ⁹⁶. We collected the annotations of essential proteins from the Database of Essential Gene (DEG ⁹⁷). Because few vertebrate proteins, especially in Homo sapiens, were recorded as are essential genes recorded in DEG, we identified the essential proteins (genes) of the Homo sapiens, Mus musculus, and Danio rerio by using BLAST to search orthologs of essential genes recorded in DEG from Integr8 ³⁰. To investigate the reliability of the orthologs mapping, we collected the orthologs protein data set (named ORT) from the COG database ⁹⁸

and evaluated the relationship between the sequence similarity (i.e. BLASTP E-value) and orthologs protein pairs. The ORT set consists of 3,050,847 orthologs protein pairs and 112,920 proteins. Figure 4-8 illustrates the sequence similarity distribution of these orthologs protein pairs. When sequence similarity (BLASTP E-value) ≥ 10-70, the number of all protein pairs significantly increase to cause the decreasing of the precision (No. orthologs protein pairs / No.

all protein pairs); moreover, the number of orthologs protein pairs decrease more gradually

在文檔中同源蛋白質交互作用與複合體剖析蛋白質交互作用體行為 (頁 57-93)