3D-INTEROLOGS: AN EVOLUTION DATABASE OF PHYSICAL

genomes

Interactions between proteins are critical to most biological processes. To identify and characterize protein-protein interactions (PPIs) and their networks, many high-throughput experimental approaches, such as yeast two-hybrid screening, mass spectroscopy, and tandem affinity purification, and computational methods (phylogenetic profiles³⁸, known 3D complexes³⁹, and interologs⁴⁰) have been proposed⁴¹. Some PPI databases, such as IntAct⁸, BioGRID¹⁰, DIP⁹, MIPS²⁸, and MINT²⁹, have accumulated PPIs submitted by biologists, and those from mining literature, high-throughput experiments, and other data sources. As these interaction databases continue growing in size, they become increasingly useful for analysis of newly identified interactions.

The discovery of sequence homologs to a known protein often provides clues for understanding the function of a newly sequenced gene. As an increasing number of reliable PPIs become available, identifying homologous PPIs should be useful to understand a newly determined PPI. Recently, several PPI databases (e.g., IntAct and BioGRID) allow users to input one or a pair of proteins or gene names to acquire the PPIs associated with the query protein(s). Few computational methods^42,43 applied homologous interactions to assess the reliability of PPIs.

To address this issue, we proposed the concept called "homologous protein-protein interaction"²³. We define a homologous PPI as follows: (1) homologs of A and B are proteins with significant sequence similarity BLASTP E-values ≤10^-10 ^40,44; (2) significant joint

protein pair (A and B) and their respective homologs (A₁' and B₁') recorded in annotated PPI databases. In addition, we constructed the PPISearch server for searching homologous PPIs across multiple species and annotating the query protein pair. According to our knowledge, PPISearch is the first public server that identifies homologous PPIs from annotated PPI databases and infers transferability of interacting domains and functions between homologous PPIs and the query. Our results demonstrate that this server achieves high agreements on interacting domain-domain pairs and function pairs between query protein pairs and their respective homologous PPIs.

Furthermore, a known 3D structure of interacting proteins provides interacting domains and atomic details for thousands of direct physical interactions. It is usually possible to build the binding model of a protein-protein interaction by comparative modeling if a known complex structure comprising homologs of these two sequences is available^32,45. Therefore, we developed a new scoring function³⁹, which includes the contact residue interacting score (e.g.

the steric, hydrogen bonds, and electrostatic interactions) and the template consensus score (e.g.

couple-conserved residue and the template similarity scores), to evaluate how well the interfaces between the query and interacting candidates.

For efficiently enlarging protein interactions annotated with residue-based binding models, we proposed a new concept "3D-domain interolog mapping" with a scoring system³⁹ to explore all possible homologous protein-protein interaction pairs between the two homolog families, derived from a known 3D-structure dimmer (template), across multiple species. Each family consists of homologous proteins which have interacting domains of the template for studying domain interface evolution of two interacting homolog families.

The 3D-interologs database records the evolution of protein-protein interactions database across multiple species. Based on “3D-domain interolog mapping” and a new scoring function, we infer 173,294 homologous protein-protein interactions by using 1,895 three-dimensional

(3D) structure heterodimers to search the UniProt database (4,826,134 protein sequences). The 3D-interologs database comprises 15,124 species and 283,980 protein-protein interactions, including 173,294 interactions (61%) and 110,686 interactions (39%) summarized from the IntAct database. For a protein-protein interaction, the 3D-interologs database shows functional annotations (e.g. Gene Ontology), interacting domains and binding models (e.g.

hydrogen-bond interactions and conserved residues). Additionally, this database provides couple-conserved residues and the interacting evolution by exploring the interologs across multiple species. Experimental results reveal that the proposed scoring function obtains good agreement for the binding affinity of 275 mutated residues from the ASEdb. The precision and recall of our method are 0.52 and 0.34, respectively, by using 563 non-redundant heterodimers to search on the Integr8 database (549 complete genomes).

Experimental results demonstrate that the proposed method can infer reliable physical protein-protein interactions and be useful for studying the protein-protein interaction evolution across multiple species. In addition, the top-ranked strategy and template interface score are able to significantly improve the accuracies of identifying protein-protein interactions in a complete genome. The 3D-interologs database is available at http://3D-interologs.life.nctu.edu.tw.

2-1. Introduction

A major challenge of post genomic biology is to understand the networks of interacting genes, proteins and small molecules that produce biological functions. The large number of protein interactions ^8,9,28, generated by large-scale experimental methods ^26,46,47, computational methods 32,38,39,44,48-50

, and integrated approaches ^51,52, provides opportunities and challenges in annotating protein functions, protein-protein interactions (PPI) and domain-domain interactions

(DDI), and in modeling the cellular signaling and regulatory networks. An approach based on evolutionary cross-species comparisons, such as PathBLAST ^53,54 and interologs (i.e.

interactions are conserved across species ^40,44), is a valuable framework for addressing these issues. However, these methods often cannot respond how a protein interacts with another one across multiple species.

Protein Data Bank (PDB) ³¹ stores three-dimensional (3D) structure complexes, from which physical interacting domains can be identified to study DDIs and PPIs using comparative modeling ^32,33. Some DDI databases, such as 3did ³⁴, and iPfam ³⁵, have recently been derived from PDB. Additionally, some methods have utilized template-based methods (i.e.

comparative modeling ³² and fold recognition ³³), which search a 3D-complex library to identify homologous templates of a pair of query protein sequences, in order to predict the protein-protein interactions by accessing interface preference, and score query pair protein sequences according to how they fit the known template structures. However, these methods

32,33

are time-consuming to search all possible protein-protein pairs in a large genome-scale database (Fig. 2-1A). For example, the possible protein-protein pairs on the UniProt database (4,826,134 sequences) are about 2.33×10¹³ ¹². In addition, these methods are unable to form homologous PPIs to explore the protein-protein evolution for a specific structure template.

To address these issues, we proposed a new concept "3D-domain interolog mapping" (Fig.

2-1B): for a known 3D-structure complex (template T with chains A and B), domain a (in chain A) interacts with domain b (in chain B) in one species. Homolog families A' and B' of A and B are proteins, which are significant sequence similarity BLASTP E-values ≤10^-10 and contain domains a and b, respectively. All possible protein pairs between these two homolog families are considered as protein-protein interaction candidates using the template T. Based on this concept, protein sequence databases can be searched to predict protein-protein interactions across multiple species efficiently. When the genome was deciphered completely for a species,

we considered the rank of protein-protein interaction candidates in each species into our previous scoring system ³⁹ to reduce a large number of false positives. The 3D-interologs database which can indicate interacting domains and contact residues in order to visualize molecular details of a protein-protein interaction. Additionally, this database can provide couple-conserved residues and evolutionary clues of a query sequence and its partners by examining the interologs across multiple species.

Figure 2-1. Two frameworks of template-based methods for protein-protein interactions (PPI).

(A) For each query protein sequence pair, the method searches 3D-dimer template library to identify homologous templates for exploring the query protein pair, such as MULTIPROSPECTOR ³³. (B) For each structure in 3D-dimer template library, the method searches protein sequence database to identify homologous PPIs of the query structure, such as 3D-interologs.

2-2. Methods and Materials

Figure 2-2 illustrates the overview of the 3D-interologs database. The 3D-interologs allows users to input the UniProt accession number (UniProt AC ¹²) or the sequence with FASTA format of the query protein (Fig. 2-2A). When the input is a sequence, 3D-interologs uses BLAST to identify the hit interacting proteins. We identified protein-protein interactions

Query

in 3D-interologs database through structure complexes and a new scoring function using the following steps (Fig. 2-2B). First, a 3D-dimer template library comprising 1,895 heterodimers (3,790 sequences, called NR1895) was selected from the PDB released in Feb 24, 2006.

Duplicate complexes, defined by sequence identity of above 98%, were removed from the library. Dimers containing chains shorter than 30 residues were also excluded ^33,55. Interacting domains and contact residues of two chains were identified for each complex in the 3D-dimer library. Contact residues, in which any heavy atoms should be within a threshold distance of 4.5 Å to any heavy atoms of another chain, were regarded as the core parts of the 3D-interacting domains in a complex. Each domain was required to have at least 5 contact residues and more than 25 interacting contacted-residue pairs to ensure that the interface between two domains was reasonably extensive. After the interacting domains were determined, its SCOP domains ²² were identified, and its template profiles were constructed by PSI-BLAST.

PSI-BLAST was adopted to search the domain sequences against the UniRef90 database ¹², in which the sequence identity < 90% of each other and the number of iteration was set to 3.

After 3D-dimer template library and template profiles were built, we inferred candidates of interacting proteins by 3D-domain interolog mapping. To identify the interacting-protein candidates against protein sequences in the UniProt version 11.3 (containing 4,826,134 protein sequences), the chain profile was used as the initial position-specific score matrix (PSSM) of PSI-BLAST in each template consisting of two chains (e.g. CA and CB, Fig. 2-2C). The number of iterations was set to 1. Therefore, this search procedure can be considered as a profile-to-sequence alignment. A pairing-protein sequence (e.g. S1 and S2) was considered as a protein-protein interaction candidate if the sequence identity exceeded 30% and the aligned contact residue ratio (CR) was greater than 0.5 for both alignments (i.e. S1 aligning to CA and S2 aligning to CB). For each interacting candidate, the scoring function was applied to calculate the interacting score and the Z-value, which indicates the statistical significance of the

interacting score. An interacting candidate was regarded as a protein-protein interaction if its Z-value was above 3.0 and it ranked in the Top 25 in one species. The candidate rank was considered in one species to reduce the ill-effect of the out-paralogs that arose from a duplication event before the speciation ⁵⁶. These inferred interacting protein pairs were collected in the database.

Figure 2-2. Overview of the 3D-interologs database for protein-protein interacting evolution, protein functions annotations and binding models across multiple species.

Finally, for the hit interacting partner derived from 3D-domain interolog mapping, this database provides functional annotations (e.g. UniProt AC, organism, descriptions, and Gene Ontology (GO) annotations ³⁶, Fig. 2-2D), and the visualization of the binding models and interaction evolutions (Fig. 2-2C) between the query protein and its partners. We then constructed two multiple sequence alignments of the query protein and its interacting partner

Users input a query sequence or UniProt accession number

3D-interologs Database

Create a two-chain complex library from PDB and create profiles of two chains (e.g. chains A and B) of each complex (e.g. 1jkg) in this library

by PSI-BLAST

> Q9UKK6 | NXT1_HUMAN NTF2-related export protein 1 - Homo sapiens (Human).

MASVDFKTYVDQACRAAEEFVNVYYTTMDKRRRLLSRLYMGTATLVWNGNAVSGQESLSE

(Fig. 2-2C) across multiple species. Here, the interacting-protein pair with the highest Z-score in a species was chosen as interologs for constructing multiple sequence alignments using a star alignment. The chains (e.g. Chains A and B, Fig. 2-2C) of the hit structure template were considered as the centers, and all selected interacting-protein pairs across species were aligned to respective chains of the template by PSI-BLAST. The 3D-interologs database annotates the important contact residues in the interface according to the following formats: hydrogen-bond residues (green); conserved residues (orange), conserved residues with hydrogen bonds (yellow) and other (gray).

Data Sets

Two data sets were used to assess 3D-domain interolog mapping and the scoring functions.

To determine the contribution of a residue to the binding affinity, the alanine-scanning mutagenesis is frequently used as an experimental probe. We selected 275 mutated (called BA-275) residues from the ASEdb ⁵⁷ with 16 heterodimers whose 3D structures were known.

Those mutated residues are contact residues and positioned at protein–protein interfaces.

ASEdb gives the corresponding delta G value representing the change in free energy of binding upon mutation to alanine for each experimentally mutated residue. Residues that contribute a large amount of binding energy are often labeled as hot spots.

In addition, we selected a non-redundant set (NR-563), comprising 563 dimer protein structures from the set NR1895 to evaluate the performance of our scoring functions for predicting PPIs in S. cerevisiae and in 549 species collected in Integr8 database (2,102,196 proteins ³⁰).

2-3. Scoring Function and Matrices

We have recently proposed a scoring function to determine the reliability of a

protein-protein interaction ³⁹. This study enhances this scoring by dividing the template consensus score into the template similar score and the couple-conserved residue score. Based on this scoring function, the 3D-interologs database can provide the interacting evolution across multiple species and the statistical significance (Z-value), the binding models and functional annotations between the query protein and its interacting partners. The scoring function is defined as energy (i.e. hydrogen-bond energy, electrostatic energy and disulfide-bond energy), respectively; and Esim is the template interface similar score; and the Econs is couple-conserved residue score. The optimal w value was yielded by testing various values ranging from 0.1 to 5.0; w is set to 3 for the best performance and efficiency on predicting binding affinity (BA-275) and predicting PPIs in S. cerevisiae and in 549 species (Integr8) using the data set NR-563. The Evdw and ESF are given as

where CP denotes the number of the aligned-contact residues of proteins A and B aligned to a hit template; Vssij and Vsbij (Vsbji) are the sidechain-sidechain and sidechain-backbone van der Waals energies between residues i (in protein A) and j (in protein B), respectively. Tssij and Tsbij

(Tsbji) are the sidechain-sidechain and sidechain-backbone special interacting energies between i and j, respectively, if the pair residues i and j form the special bonds (i.e. hydrogen bond, salt bridge, or disulfide bond) in the template structure. The van der Waals energies (Vssij, Vsbij, and Vsbji) and special interacting energies (Tssij, Tsbij, and Tsbji) were calculated from the four knowledge-based scoring matrices (Fig. 2-3), namely sidechain-sidechain (Fig. 2-3A) and sidechain-backbone van der Waals scoring matrices (Fig. 2-3B); and sidechain-sidechain (Fig.

2-3C) and sidechain-backbone special-bond scoring matrices (Fig. 2-3D).

Figure 2-3. Knowledge-based protein-protein interacting scoring matrices: (A) sidechain-sidechain van-der Waals scoring matrix; (B) sidechain-backbone van-der Waals scoring matrix; (C) sidechain-sidechain special-bond scoring matrix; (D) sidechain-backbone special-bond matrix scoring.

The sidechain-sidechain scoring matrices are symmetric and sidechain-backbone scoring matrices are non-symmetric. For sidechain-sidechain van-der Waals scoring matrix, the scores are high (yellow blocks) if large-aliphatic residues (i.e. Val, Leu, Ile, and Met) interact to large-aliphatic residues or aromatic residues (i.e.

Phe, Tyr, and Trp) interact to aromatic residue. In contrast, the scores are low (orange blocks) when nonpolar residues interact to polar residues. For sidechain-sidechain special-bond scoring matrix, the scores are high when an interacting resides (i.e. Cys to Cys) form a disulfide bond or basic residues (i.e. Arg, Lys, and His) interact to acidic residues (Asp and Glu). The scoring values are zero if nonpolar residues interact to other residues.

These four knowledge-based matrices, which were derived using a general mathematical structure ⁵⁸ from a nonredundant set of 621 3D-dimer complexes proposed by Glaser et al. ⁵⁹, are the key components of the 3D-interologs database for predicting protein-protein interactions. This dataset is composed of 217 heterodimers and 404 homodimers and the

sequence identity is less than 30% to each other. The entry (S_ij), which is the interacting score for a contact residue i, j pair (1≤i, j≤20), of a scoring matrix is defined as

ij ij

ij e

S lnq , where qij

and eij are the observed probability and the expected probability, respectively, of the occurrence of each i, j pair. For sidechain-sidechain van-der Waals scoring matrix, the scores are high (yellow blocks) if large-aliphatic residues (i.e. Val, Leu, Ile, and Met) interact to large-aliphatic residues or aromatic residues (i.e. Phe, Tyr, and Trp) interact to aromatic residue. In contrast, the scores are low (orange blocks) when nonpolar residues interact to polar residues. The top two highest scores are 3.0 (Met. interacting to Met) and 2.9 (Trp interacting to Trp).

The value of E_sim was calculated from the BLOSUM62 matrix ⁵⁸ based on two alignments between two chains (A and B) of the template and their homologous proteins (A' and B'),

where CP is the number of contact residue pairs in the template; i and j are the contact residue in chains A and B, respectively. Kii' is the score of aligning residue i (in chain A) to i' (in protein A') and Kjj' is the score of aligning residue j (in chain B) to j' (in protein B') according to BLOSUM62 matrix. Kii and Kjj are the diagonal scores of BLOSUM62 matrix for residues i and j, respectively. The couple-conserved residue score (Econs) was determined from two profiles of the template and is given by

)

where CP is the number of contact residue pairs; Mip is the score in the PSSM for residue type i at position p in Protein A; Mjp′ is the score in the PSSM for residue type j at position p′ in Protein B, and Kii and Kjj are the diagonal scores of BLOSUM62 matrix for residue types i and j, respectively.

To evaluate statistical significance (Z-value) of the interacting score of a protein-protein interaction candidate, we randomly generated 10,000 interfaces by mutating 60% contact residues for each heterodimer in 3D-dimer template library. The selected residue was substituted with another amino acid residue according to the probability derived from these 621 complexes ⁵⁹. The mean and standard deviation for each 3D-dimer were determined from these 10,000 random interfaces which are assuming to form a normal distribution. Based on the mean and standard deviation, the Z-value of a protein-protein candidate predicted by this template can be calculated.

2-4. Inputs and Outputs

The 3D-interologs database server is easy-to-use. Users input the UniProt AC or the FASTA format of the query protein (Fig. 2-2A). The server generally returns a list of interacting partners with functional annotations (e.g. the gene name, the protein description and GO annotations) (Fig. 2-2D) and provides the visualization of the binding model and contact residues between the query protein and its partner by aligning them to respective template sequences and structures. Additionally, the 3D-interologs system indicates the interacting evolution analysis by using multiple sequence alignments of the interologs across multiple species (Fig. 2-2C). The significant contact residues in the interface are indicated. If Java is installed in the user’s browser, then the output shows the structures, and users can dynamically view the binding model, interacting domains and important residues in the browser.

2-5. Example Analysis

Figure 2-4 show the search results using the human protein NXT1 (UniProt AC Q9UKK6) as the query sequence. The NXT1, which is a nucleocytoplasmic transport factor and shuttles

between the nucleus and cytoplasm, accumulates at the nuclear pore complexes⁶⁰. For this query, 3D-interologs database yielded 8 hit interacting partners (Fig. 2-4A), comprising 5 partners derived from 3D-interologs database and 5 partners from the IntACT database. Thus, two partners were present in both databases. Among these 8 hits, 3 partners (i.e. Uniprot AC Q68CW9, Q5H9I1 and Q9GZY0) were not recorded in IntAct database, but they very likely interact with NXT1. The Q68CW9, which is part of the protein NXF1 (UniProt AC Q9UBU9), consists of the UBA-like domain and the NTF-like domain, which is responsible for association with the protein NXT1 ⁶¹. The sequence of the protein Q5H9I1 is the same as that of the protein Q9H4D5 (i.e. nuclear RNA export factor 3), which binds to NXT1 ⁶². The protein Q9GZY0 (nuclear RNA export factor 2) binds protein NXT1 to export mRNA cargoes from nucleus into cytosol ⁶³.

The protein NXT1 interacts with the protein NXF1 to form a compact heterodimers (PDB code 1jkg ⁶³)and an interacting β surface, which is lined with hydrophobic and hydrophilic residues (Fig. 2-4B). Twenty hydrogen bonds or electrostatic interactions are formed in this compact interface. The salt bridge formed by NXT1 Arg134 and NXF1 Asp482 is especially important in the interface ⁵⁷. The interacting evolution analysis built by 10 interologs reveals that two residues (Arg134 and Asp482) are conserved in all species (Fig. 2-4C). Additionally, some interacting residues forming the hydrogen bonds are also couple-conserved, for example NXT1 Asp76 and NXF1 Arg440; NXT1 Gln78 and NXF1 Ser417; NXT1 Pro79 and NXF1 Asn531 ⁵⁷. The evolution of interaction is valuable to reflect both couple-conserved and critical residues in the binding site.

Figure 2-4. The 3D-interologs database search results of using human NXT1 as query.

(A) Eight interacting partners of NXT1 are found in the 3D-Interologs. For each interacting partner, this server provides UniProt accession number, protein description, organism and Gene Ontology annotation. (B) Detailed

在文檔中同源蛋白質交互作用與複合體剖析蛋白質交互作用體行為 (頁 21-43)