• 沒有找到結果。

Chapter 1. INTRODUCTION

1.4 Thesis overview

In this study, we use “3D-domain interologs mapping” to predict domain annotated protein interactions. The flowchart shows in Figure 3. We collect dimers of known structure from Protein Databank. For each 3D-dimer, we estimate the probabilities with witch residue pairs occur at various contact positions and construct a pairPSSM to assess the fit of any

possible interacting protein pairs. And then we use these dimers as queries to search target protein database and predict many candidates of protein-protein interaction. We evaluate our method on three datasets; one is the multiple complex structures with the same interacting SCOP domains(32); one is protein-protein interactions in yeast proteome; and the other is yeast gene expression profiles. We finally apply our method to seven organisms commonly used in molecular research, including Homo sapiens, Mus musculus, Rattus norvegicus, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. In these seven organisms, our method predicts ~450,000 new protein interactions in which the interacting domains and residues (binding sites) are automatically modeled. These visualized interacting residues are useful for the detailed analysis of protein-protein interactions.

Chapter 2

METHODS AND MATERIALS

2.1 Overview

We apply 3D-domain interologs mapping to infer domain annotated protein-protein interactions from 3D protein complexes. Step by Step of our method is showed as follows:

(i)preparing the database of dimer template: we consider two contact chains in a 3D protein complex as 3D-complex template. We identify the contact residues of a 3D-complex template (containing chains A and B) in the PDB and define domain boundary by the SCOP database;

(ii)for each dimer template, build a protein-protein interaction position-specific matrix (pairPSSM); (iii)generating candidates: we use a dimer template to search two protein families (A’ and B’) which contain the corresponding domains (a and b) from the target protein database and consider all the protein pairs between the two family as candidates of interacting proteins; (iv)scoring: we project the contact residue positions from a dimer template to its homologous protein pair. Then, summation of energy of all contact residue pairs in the homologous protein pair based on pairPSSM is considered the interactive energy of the homologous protein pair. If the interactive energy exceed to a threshold, we predict the two proteins interact with each other.

2.2 Preparation of data sets

2.2.1 Database of 3D-dimer

The PDB database (Protein Data Bank, http://www.rcsb.org/pdb/)(27) stores many three-dimensional structure of macromolecules, some of which are cocrystallized records. We select the cocrystallized proteins with using the criteria listed below:

1. The resolution of the PDB records should be smaller than 3.0Å.

2. Each chain of the cocrystallized proteins should be comprised more than 35 amino acids to be considered as domains. If the protein is consisted of the cross-chain domain defined by SCOP, we also regard it.

3. The number of interacting residue pairs is set to be greater than 25 and each chain must contain more than 5 contact residues to make sure that the dimer is reasonably extensive(20). Interacting residue pairs are defined as a pair of residues from different chains that have at least one pair of heavy atoms within 4.5Å with each other.

4. Elimination of artificial packing complexes rather than biologically functional multimers by using PQS server(33), where adopt the reduction of solvent accessibility (ASA) due to oligomerisation.

From 35343 records in PDB (20060204), 29369 dimers of known structures are selected including 8018 heterodimers and 21351 homodimers. Then, we remove redundancy by sequence identity > 50% and leave 1122 heterodimers and 3514 homodimers, respectively.

2.2.2 Definition of protein domain

The protein domain definition is referenced by SCOP (Structure Classification of

Proteins, http://scop.berkeley.edu/)(32). In 1.69 releases, there are 70859 domains from 25973 PDB entries. The SCOP database is based on evolutionary relationships and on the principles that govern their three-dimensional structure. Strong sequence similarity alone is considered to be sufficient evidence for common ancestry. Close structural and functional similarity together is also accepted as sufficient evidence for distant homology between proteins that lack significant sequence similarity. But neither structural nor functional similarity alone is considered to be strong evidence. Therefore, the SCOP database is organized on a number of hierarchical levels, with the principle ones being family, superfamily, fold and class. Families contain protein domains that share a clear common evolutionary origin, as evidenced by sequence identity or extremely similar structure and function. Superfamilies consist of families whose proteins share very common structure and function, and therefore there is reason to believe that the different families are evolutionarily related. Folds consist of one or more Superfamilies that share a common structure core structures. Depending on the type and organization of secondary structural elements, folds are grouped into four main classes (all α, all β, α+β, α/β).

2.2.3 Data set of related 3D-dimers

We want to explore whether the two similar dimers possess the similar protein interaction type. Modeling protein interactions by homology is reasonable only when this hypothesis is valid. Here we defined the two 3D-dimers contain the same interacting SCOP domain (more than three contact residues within the domain boundary) are related-dimer pairs.

From our database of 3D-dimer, we first remove the dimers with no definition of SCOP domain then leave 5553 heterodimers and 15026 homodimers. Second, the dimers are clustered by BlastCluster(34) according sequence identity more than 80% and both sequence

coverage more than 0.8. We choose one representative dimer from each cluster if the number of interacting residue pairs more than the mean of the cluster and the resolution of crystallization is smallest. In this way, a representative set of 3D-dimers is selected, which contains 999 heterodimers and 2837 homodimers. Third, the representative set of 3D-dimers is grouped based on the domain definition in SCOP. We group the dimers which possess the same interacting domain pair in family level. Totally, there are 540 types of interacting domain pair in the 999 heterodimers and 1425 types of domain-pair in the 2837 homodimers, respectively. 189 groups of heterodimer and 488 groups of homodimer contain more than 1 member. We choose one representative member for each group and pair the representative one for the all other members in the group. These pairs of dimer are considered as related dimer pairs. In this way, we select 459 related dimer pairs from the 189 groups of heterodimer and 1412 related dimer pairs from 489 groups of homodimer, respectively.

On the other hand, we define a subset from database of 3D-dimer which include the dimers should be in two-chain PDB records (there are only two proteins in the PDB entry).

The rationale is interaction between two proteins may be bothered by other proteins if there are more than three proteins cocrystallized at the same time. There are 897 two-chain heterodimers and 3665 two-chain homodimers selected from the database of dimer template.

Finally, we can get 114 pairs of two-chain heterodimer and 616 pairs of two-chain homodimers.

2.2.4 Data set of true protein complexes and unreasonable protein pairs

Our method is based on a 3D-dimer to model all potential protein interactions and used a specific pairPSSM to determine whether the two proteins interact in nature. Here we construct a data set include the protein pairs which really form complexes in living thing and the protein

pairs don’t interact with each other (the unreasonable protein pairs). From 189 representative 3D-dimers (exclude the antibody-antigen complex), we used the PSI-BLAST to search our database of 3D-dimer which remove dimers with > 80% sequence identity and get homologous protein pair with E value threshold 0.1. If the protein pair is cocrystallized in PDB and it contains the same interactive SCOP domains for the query dimer, we consider the protein pair as positive that means this protein pair should be predicted by the 3D-dimer. In the other word, if the protein pair is not cocrystallized in PDB and it does not contain the same interactive SCOP domains for the query dimer, we consider the protein pair as negative.

In this way, we can select 224 positive protein pairs and 282 negative protein pairs.

2.2.5 Data set of yeast proteome

We test our method to predict interactions in baker’s yeast (Saccharomyces cerevisiae).

Being a model system relevant to human biology, yeast has attracted special interest from the scientific community. There are 6000 proteins but the estimated number of actual interactions is smaller than 100,000 in Saccharomyces cerevisiae(1). Recently, high-throughput interaction screens of protein interactions in S. cerevisiae had been conductive by yeast two-hybrid(2,35) and affinity purification followed by mass spectroscopy(36,37). At the same time, many large-scale experiment of gene expression for S. cerevisiae proteome are also carried out(38,39). The public available data of functional genomics in Saccharomyces cerevisiae are most comprehensive.

The yeast proteome is obtained from the web site of the SGD (Saccharomyces Genome Database, http://www.yeastgenome.org)(40). The corresponding amino acid sequences of total 5877 open reading frames (ORFs) are subsequently downloaded. The comprehensive protein-protein interactions are downloaded from the DIP database (Database of Interacting

Protein, http://dip.doe-mbi.ucla.edu/)(14). The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the knowledge about the networks of protein-protein interaction extracted from the most reliable, core subset of the DIP data. There are total 5882 reliable interactions downloaded from DIP and considered as positive set.

Because no directly information about which proteins do not interact, Jansen et al.(6) assumed that proteins in different compartments do not interact with each other and generate 2,708,746 non-interacting protein pairs from lists of proteins in separate subcellular compartments(41). We used these protein pairs as gold negative set.

2.2.6 Data set of yeast gene expression

The gene expression profiles of two interacting proteins were also used to access the accuracy of our method according to the basic assumption: “the gene pair with similar expression profiles is likely to encode an interacting protein pair”(42). The Rosetta compendium set consisting of the expression profiles of 300 deletion mutants and under chemical treatments(39) was used to measure the similarity of gene expression profiles of two genes

2.2.7 Performance criteria

We used two common metrics to assess the quality of our method, including mean average precision (MAP) and mean false positive rate (MFP). The mean average precision is

defined as:

where Thi is the number of compounds in a hit list containing i correct interactions; A is total number of true hits in the databases; M is the total number of template.

The mean false positive rate is defined as:

FP=

( ∑

hA=1

(

ThAh

) (

/ TA

) )

/A (4a) MFP=

( ∑

Mi=1FPi

)

/M (4b)

where Ah is the number of active ligands among the Th highest ranking compounds; T is the total number of candidates from PSI-BLAST; A is total number of true hits in the T; M is the total number of template.

In this study, the similarity of two gene-expression is defined by the Pearson correlation coefficient between the two gene-expression profiles (see 2.2.6). To test whether the mean of correlation coefficient for candidates of protein-protein interactions higher than that of non-interacting protein pairs, we calculate the T-score and the P-value for the null hypothesis of the sample mean (our prediction) smaller than the mean of gold negative set. We apply standard two sample T-test statistics:

2 where u is mean of samples, and S is the standard deviations of the samples:

=

− −

= n

i

u n x

S i

1

)2

1 ( 1

(6)

2.3 Construction of pairPSSM

Here we want to develop a method to estimate the probabilities with witch residue pairs occur at various contact positions by evolutionary profiles, leading to a more sensitive scoring system. The probabilities with witch residue pairs occur at various contact positions are transformed to energy of contact residue pair (pair Position Specific Scoring Matrix called pairPSSM). The energy calculated from pairPSSM is the specific interfacial energy. On the other hand, the energy calculated from empirical matrix is the general interfacial energy.

2.3.1 Building comprehensive and non-redundant protein database

To obtain the evolutionary profiles from multiple sequence alignment, our alignment result should be come from a comprehensive and non-redundant protein database. The protein database is obtained from the web site of the NCBI Reference Sequences database (Release 16, http://www.ncbi.nlm.nih.gov/RefSeq/). The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. In Release 16, Refseq includes 2,273,764 protein sequences across 3244 organisms. Table 1 shows the protein sequence composition in Refseq database. Although RefSeq aims to provide a non-redundant set of sequences for users, two major source of redundancy occur in RefSeq.

One is alternative splicing; the other is duplication of genes (paralog). Therefore, we used BlastCluster to remove redundancy with both the sequence identity and coverage as high as

90% in the same species. In this way, 2,109,945 protein sequences are selected into our non-redundant protein database.

2.3.2 Scoring matrix architecture

For a 3D-dimer with the number of contact residue pairs M, the empirical energy matrix of dimension 20 × 20 is replaced by a protein-protein interaction position-specific matrix (pair PSSM) of dimension M × 20. The residue pair in a contact position is considered as a single symbol. The advantage of this matrix is estimation of the probabilities with witch residue pairs occur at various contact positions, leading to a more sensitive scoring system.

2.3.3 Construction of multiple sequence alignment

To produce a multiple sequence alignment from the PSI-BLAST output, we simply collect all RefSeq sequence segments that have been aligned to the two proteins of 3D-dimer with E value below a threshold, by set to 10-9 to assure the members are similar enough to the 3D-dimer. The two proteins of 3D-dimer are used as a master for constructing two multiple sequence alignments, respectively. Any row that is > 95% identical to 3D-dimer is purged. To ensure that all the sequences found by PSI-BLAST are likely to be structurally related to the dimer template, we set a similarity threshold described by Batalov & Abagyan(43,44) was used, defined as follows:

t(L)≤31.8L0.102+m×17.4L0.29 (7)

where t(L) is the percentage sequence identity threshold dependent on alignment length L, and m is the level confidence of this threshold (in standard deviations). In this work, we use m = 3

(the identity threshold at least above 25%) to make sure all the sequences found by PSI-BLAST with similar interactive type to their dimer template (see section 3.1).

Because most the interacting proteins belong to the same species, the proteins in the two multiple sequence alignments must be arranged in order by species. For example, the 1st protein in a multiple sequence alignment is a human protein, and the 1st protein in the other must be a human protein. The 2nd protein in a multiple sequence alignment is a mouse protein, and the 2nd protein in the other must be a mouse protein.

2.3.4 Target frequency estimation

Given two multiple sequence alignment from a 3D-dimer, we generate score matrices with the theoretical foundation is that the scores for a specific contact position be of the form log (Qij/Pij), where Qij is the estimated probability for contact residue pair i&j to be found in the column and Pij is the expect probability of i&j to be found in the column. The estimate of Qij for a specific contact position should converse simply to the observed frequency of residue pair i&j in that column. However, it is complicate estimating the Qij include small sample size and prior knowledge of relationships among the residues should be considered. We implemented the data-dependent pseudocount method introduced by Tatusov et al.(45). It is relative simplicity and often performs nearly as well comparing the Dirichlet mixtures(46).

We slightly modify the data-dependent pseudocount method by using the prior knowledge of amino acid relationships embodied in the substitution matrix to generate residue pair pseudocount frequencies gij. For a given position pair of contact residues C, we construct pseudocount frequencies gij using the formula:

gij =PiPjeSij (8)

where Sij is the interactive energy of residue i&j contact in empirical energy matrix. Pi is the background probability of residue i. The rationale is that we use the prior knowledge of interactive energy between residue i&j to estimate pseudo frequency. We then estimate Qij

followed as:

β α

β α

+

= ij + ij

ij

g

Q f (9)

where the α and β are the relative weights given to observed and pseudocount residue frequency. In our study, we let α = the number of different residue-pair types in column -1 and β = 5. If the β is larger, the greater the emphasis is given to prior knowledge.

Pij is the expect probability of i&j to be found in the column and is calculated from:

Pij = PiPj (10)

where Pi is the expect probability of residue i occurring in protein surface. The residue composition of the protein interface is obtained from Lu et al.(47)(Table 2).

2.3.5 Amino acid classification

The sequence variability at each contact position could be estimated from the two multiple sequence alignment of dimer template. However, by not making concessions for conservative mutations the scheme becomes too rigid. Unlike unconservative mutations, conservative ones preserve the essential nature of the side chain. Therefore, we make some tolerances for such mutations. Saha et al.(48) made a classification based on the similarity of the environment of each amino acid residue in protein structures to the nine groups:

(i) Ala and Val; (ii) Met, Leu and Ile; (iii) Gly, Ser and Thr; (iv) Pro, Phe, Tyr and Trp; (v) Cys;

(vi) His; (vii) Arg and Lys; (viii) Asp and Glu; (ix) Asn and Gln. We test this classification of amino acid whether suitable for access the contact residue potential by calculating the standard deviation of contact residue potential in the cluster of amino acid. Figure 4A shows that the three groups {A, V}, {P, F, Y, W} and {R, K} have high standard deviations of intra contact residue potential. Therefore, we slightly modify the group as follows:

(i) Ala and Gly; (ii) Val, Met, Leu and Ile; (iii) Pro, Ser and Thr; (iv) Phe, Tyr and Trp; (v) Cys;

(vi) His and Arg; (vii) Lys; (viii) Asp and Glu; (ix) Asn and Gln. In this way, all the standard deviations of intra contact residue potential are smaller than 0.4 (Figure 4B). We consider this amino acid classification is more reasonable for measuring the contact residue potential.

2.3.6 pairPSSM evaluation

The pairPSSM is evaluated by two data sets. First, we test whether the energy calculated from pairPSSM could distinguish the true protein complexes and unreasonable protein pairs (data set described in section 2.2.4). Second, we apply our method to predict protein-protein interactions in yeast proteome (data set described in section 2.2.5) and used two common metrics (MAP and MFP described in 2.2.6) to assess the performance of pairPSSM and compare the empirical matrix used by previous method. Then, we test the similarity of gene expression profiles for the candidates of protein-protein interactions. Finally, we give two true biological examples to illustrate the operation and power of our method.

Chapter 3

RESULTS and DISCUSSIONS

In this study, we first explore the relationship between sequence similarity and interactive similarity in protein-protein interactions. Modeling protein interactions by homology is reasonable only when the correlation is high enough. Second, because the structures of proteins are unsolved, we are only able to use the method of sequence alignment to align the template and target proteins. The consistence ratio between sequence alignment and structure alignment must also be studied. Third, our method is verified in two data sets:

one is true protein complexes and unreasonable protein pairs. The other is protein-protein interactions in yeast proteome. Finally, we applied 3D-domain interologs mapping to predict protein interactions for seven organisms commonly used in molecular research.

3.1 Two issues in modeling interactions by homology

3.1.1 Similar 3D-dimers imply similar interactive types

The data set (459 pairs of related heterodimers and 1412 pairs related homodimers) described above provide all instances of a particular interaction type occurring within different complex structures, that we then wish to compare to each other and correlate with sequence similarity. To compare the binding of different instances of the two dimers with the same interacting domains, we devise an index, pair coverage, to calculate binding site overlap from the number of shared interacting residue pairs. Given a pair of related dimers A-B and

A’-B’, where A-A’ and B-B’ contain same SCOP domains. we use a structural alignment tool, CE(49), to align the A-A’ and B-B’, respectively. The pair coverage is defined as:

' ' 2

cov

B A

AB NCP

NCP erage NCPM

NCP erage NCPM