• 沒有找到結果。

1-1. Background

Protein-protein interaction (PPI) networks provide key insights into complex biological systems, from how different processes communicate to the function of individual residues on a single protein. For instance, the systematic identification of protein-protein interactions1-3 or protein complexes4-7 has been a widely used strategy for understanding the physical architecture of the cell. Therefore, several large network databases such as IntAct8, DIP9, and BioGRID10 record hundreds of thousands of physical and genetic interactions from a wide variety of organisms have been purposed.

A wealth of investigations have been undertaken to deepen our understanding of hereditary diseases. As a result of that, databases such as the Online Mendelian Inheritance in Man (OMIM)11 and UniProt12 together contain almost 30,000 experimentally verified mutations. Nevertheless, the exact mechanisms by which mutations alter a protein's function are in many cases poorly understood. Therefore, researchers have recently begun to use PPI networks to explore the genotype-to-phenotype relationships13-16, on the basis that many proteins function by interacting with other proteins. However, this idea has only been applied in Human based on the requirement of high-quality PPI with the binding mechanism.

In addition, the concept "homologs" is useful for identifying consensus proteins across multiple organisms and could provide the key residues related to the functions within a given protein. Previous studies have been compared PPI network across multiple organisms to identify the essential pathways and the mechanisms of evolution17-19. For example, Peterson, G.

J. et al. have shown that interaction change through binding site evolution is faster than through gene gain or loss19 based on the comparison between 23 fungal PPI networks.

However, these studies only focused on a small sub-network or on few organisms which have an enrichment PPI data (e.g. Homo sapiens and Saccharomyces cerevisiae).

Figure 1-1. The overview of constructing the structure resolved PPI networks and studying the interactome behavior

(A) Using protein-protein interaction family and protein complex family to construct the structure resolved PPI networks in multiple oragnsims. (B) The "interactome behavior" through the consensus component. (C) The structure resolved PPI networks would provide the insight for understanding the mechanism of biological processes.

To address these issues, the structure resolved interaction family (i.e. protein-protein interaction family and protein complex family) are the basic elements and the core idea of our research to construct structure resolved PPI networks and study the behaviors of a specific PPI network. The PPI family is a group of molecular interactions which share the consensus interacting domain, binding environment, and have similar biological processes. The concepts

Vertebrate

of PPI families not only help us to construct the highly reliable PPI network in a specific organisms (e.g. Homo sapiens, Mus musculus, and Danio rerio) but also provide the consensus and the diversity behavior of interactome through comparing with multiple species (Fig. 1-1).

The methods of inferring interface families and interactomes are briefly summarized as follows.

In protein-protein interaction family, the concept of PPI families is similar to that of protein sequence family20,21 and protein structure family22. Here, the members of a PPI family are conserved on specific functions and in interacting domain(s). Using these conservations of homologous PPIs, it can be used to annotate the protein functions and provide high quality PPIs.

Protein complexes are fundamental units of macromolecular organization and their composition is also known to vary according to cellular requirements7. According to these homologous complexes across multiple species, protein complex family provides the binding models (e.g. hydrogen bonds and conserved amino acids in the interfaces), functional modules, and the conserved interacting domains and Gene Ontology annotations of the members.

Based on the members (protein-protein or protein complexes) of protein-protein interface family23 and protein complex family24 that are consensus of functional annotation across multiple species, we are able to identify the conserved components in the PPI networks across multiple species and indicate the changes of the conserved components at the interspecific level. Therefore, we would use the strategies to reveal "interactome behavior".

1-2. Current state of constructing protein-protein interaction networks

Many high throughput experimental and computational approaches, such as high-throughput yeast two-hybrid screening25,26 and co-affinity purification27, have been

proposed to construct the PPI network within an organism. These large-scale methods are often unable to respond how a protein interacts with another one and describe the relationship between the mutation of proteins and disease syndrome. Previous studies have combined protein structure information with protein interaction data to investigate how mutations affect protein interactions in disease14-16. For instance, Wang, X. J. et al. generated a structurally resolved human protein interaction network to systematically examine relationship genes, mutations and associated disorders16.

Table 1-1. The list of the members of proteins and protein-protein interactions in 11 common used organisms NCBI

Taxonomy ID Organisms No. Proteins in

Integr8 database

No. PPIs in five annotated database

9606 Homo sapiens 56,006 67,596

10090 Mus musculus 36,379 7,535

3702 Arabidopsis thaliana 35,825 6,985

6239 Caenorhabditis elegans 23,154 10,095

7227 Drosophila melanogaster 15,155 37,674

7955 Danio rerio 21,601 221

10116 Rattus norvegicus 13,807 2,199

9913 Bos taurus 12,235 281

9031 Gallus gallus 6,279 70

36329 Plasmodium falciparum 5,353 2,956

4932 Saccharomyces cerevisiae 5,727 237,193

Total 231,521 372,805

However, the experimental PPI data is necessary for these methods. The experimental PPI databases (e.g. IntAct8, DIP9, MIPS28, BioGRID10, and MINT29) are dominated by few species, especially Saccharomyces cerevisiae. Table 1-1 presents the number of PPIs and proteins in organisms that are commonly used in molecular researches. For example, there are 56,006 proteins (24.19% of 11 common organisms) and 67,596 PPIs (18.1% of 11 common organisms) of Homo sapiens are recorded in Integr8 database 30 (which are collected the complete sequencing genomes) and the five public interaction databases, respectively. On the contrary, the Saccharomyces cerevisiae only has 5,727 proteins (2.4%), but it has the dominant experimental PPI recorded in the databases (i.e. 237,193; 63.6% of 11 common organisms).

This statistical data indicate that current interaction databases are overestimated and have many false-positive recorded PPIs in some organisms (e.g. Saccharomyces cerevisiae). Moreover,

these databases are underestimated and incomplete in most organisms (e.g. Homo sapiens and Mus musculus). Both of the overestimated and underestimated protein interaction data could influence the low reliable construction of protein interactome in a specific organism.

Protein Data Bank (PDB)31 stores three-dimensional (3D) structure complexes, from which physical interacting domains can be identified to study DDIs and PPIs using comparative modeling32,33. As the number of protein structures increases rapidly, some domain-domain interaction databases, such as 3did34, and iPfam35, have recently been derived from PDB. Additionally, some methods have utilized template-based methods (i.e. comparative modeling32 and fold recognition33), which search a 3D-complex library to identify homologous templates of a pair of query protein sequences, in order to predict the protein-protein interactions by accessing interface preference, and score query pair protein sequences according to how they fit the known template structures. However, these methods32,33 are time-consuming to search all possible protein-protein pairs in a large genome-scale database.

For example, the possible protein-protein pairs on the UniProt12 database (4,826,134 sequences) are about 2.33×1013. In addition, these methods are unable to form homologous PPIs to explore the protein-protein evolution for a specific structure template.

In this thesis, we presented the "3D-domain interologs mapping" and "protein complex family" to construct the structure resolved PPI networks across multiple organisms.

"3d-domain interolos mapping" is a concept for efficiently enlarging protein interactions annotated through the homologous PPIs with residue-based binding models. We verified the structure resolved PPI networks on Gene Ontology annotations36 and the architecture of topology (i.e. scale-free network properties). In addition, we also provide the consensus proteins across three networks based on "3D-domain interologs mapping". These consensus proteins are highly related to the essential genes and disease related proteins. We believe that structure resolved PPI networks would provide the insight for understanding the mechanism of

biological processes within a given PPI network.

1-3. Thesis overview

The thesis is organized as follows. In Chapter 2, for efficiently enlarging protein interactions annotated with residue-based binding models, we proposed a new concept

"3D-domain interolog mapping" with a scoring system to explore all homologous protein-protein interaction pairs between the two homolog families, derived from a known 3D-structure dimmer (template), across multiple species. Each family consists of homologous proteins which have interacting domains of the template for studying domain interface evolution of two interacting homolog families. The 3D-interologs database records the evolution of protein-protein interactions database across multiple species. Based on

“3D-domain interolog mapping” and a template-based scoring function, we infer 173,294 homologous protein-protein interactions by using 1,895 three-dimensional (3D) structure heterodimers to search the UniProt database (4,826,134 protein sequences). The 3D-interologs database comprises 15,124 species and 283,980 protein-protein interactions, including 173,294 interactions (61%) and 110,686 interactions (39%) summarized from the IntAct database. For a protein-protein interaction, the 3D-interologs database shows functional annotations (e.g. Gene Ontology), interacting domains and binding models (e.g. hydrogen-bond interactions and conserved residues). Additionally, this database provides couple-conserved residues and the interacting evolution by exploring the interologs across multiple species. Experimental results reveal that the proposed scoring function obtains good agreement for the binding affinity of 275 mutated residues from the ASEdb. The precision and recall of our method are 0.52 and 0.34, respectively, by using 563 non-redundant heterodimers to search on the Integr8 database30 (549 complete genomes). Experimental results demonstrate that the proposed

protein-protein interaction evolution across multiple species. In addition, the top-ranked strategy and template interface score are able to significantly improve the accuracies of identifying protein-protein interactions in a complete genome.

In Chapter 3, we presented the PCFamily server to identify template-based homologous protein complexes (called protein complex family) and infer functional modules of the query proteins. This server first finds homologous structure complexes of the query using BLASTP to search the structural template database (11,263 complexes). PCFamily then searches the homologous complexes of the templates (query) from a complete genomic database (Integr8 with 6,352,363 protein sequences in 2,274 species). According to these homologous complexes across multiple species, this sever infers binding models (e.g. hydrogen bonds and conserved amino acids in the interfaces), functional modules, and the conserved interacting domains and Gene Ontology annotations of the protein complex family. Experimental results demonstrate that the PCFamily server can be useful for binding model visualizations and annotating the query proteins. We believe that the server is able to provide valuable insights for determining functional modules of biological networks across multiple species.

In chapter 4, we provide the structure resolved PPI networks across multiple species, including H. sapiens, M. musculus, and D. rerio. According to structure-based homologous PPIs in multiple species, the PPIs with atomic residue-based binding models in the derived structure resolved network achieved highly agreement with Gene Ontology (BP, CC, and MF terms) similarities. Furthermore, the architecture of these networks is a scale-free network which is consistent with most of the cellular networks. In addition, our derived networks can be used to observe the consensus proteins and modules (a fundamental unit forming with highly connected proteins) which are high conserved appearing in multiple organisms. These consensus proteins are often the essential genes and related to diseases recorded in OMIM.

Experimental results also indicate that the mutations of interacting residues on the PPIs often

related to diseases are often on. Our results demonstrate that the structure resolved PPI networks can provide valuable insights for understanding the mechanisms of biological processes.

In chapter 5, we provide a method to characterize a given PPI network. Although, many graphic features have been purposed to measure the role of proteins and identify local modularity structures of high connectivity in a PPI network, the pseudoinverse of the Laplacian matrix plays a key role, has a nice interpretation in terms of random walk on a network, and defines the kernels on a given network. Therefore, we proposed the modularity structure matrix (MS-matrix), which is the pseudoinverse of the Laplacian matrix for a given network, to evaluate the modularity structure properties of a PPI network. According to our knowledge, the MS-matrix is the first property to identify both global important proteins and local density regions within a network. For a given PPI network of S. cerevisiae, our results demonstrate that the important proteins identified by the MS-matrix are related to the essential biological processes (i.e. essential genes) and highly consistence with the topology features (i.e. degree, closeness centrality, and betweenness centrality). Then, the relationship between proteins derived from the MS-matrix could reflect the similarity of Gene Ontology and could be useful for the module identification. Furthermore, biological characterization (e.g. Gene Onotology) of the modules derived from the MS-matrix is similar to the modules collected from the experiment database (e.g. MIPS). Our results demonstrate that the MS-matrix would provide the insight for investigating a PPI network through important proteins and local modularity structures.

In the final chapter, we summarized the results of this thesis, and then discuss the future works. To further investigate the behavior of PPI network within a given cell, gene expression data would provide an aspect of in-depth understanding of the dynamic organization of the PPI network and its role in the regulation of cellular processes. For example, the Connectivity Map

(also known as cmap) provided by Lamb, J. et al. is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms that together enable the discovery of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes 37. Therefore, we will combine the gene expression data into the PPI network. We will try to illustrate the behavior of PPI networks under different cell types and different conditions. For example, because the Connectivity Map could provide the up-regulated and down-regulated proteins of given drugs and diseases, combining these data with our structure resolved PPI networks should be able to explain the mechanism of relationship between the drugs, genes and diseases.

Chapter 2. 3D-interologs: An evolution database of