• 沒有找到結果。

3.1 Inferred Protein-protein Interactions from Domain-domain Interactions

Currently, in our structural domain-domain interaction database includes information on 36465 protein chains of known 3D structure making a total of 1008 type of domain-domain interactions. Of these, we grouped these interactions into 1008 types according to the Pfam domain mediating them (Table 2). We used the domain-domain interactions to predict protein-protein interactions and assess the prediction accuracies at the protein level. After calculated the score of each protein and template, we applied these domain-domain interactions to infer protein-protein interactions on eight common organism models, including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Helicobacter pylori and, Escherichia coli. In these organisms, we inferred 53669 protein-protein interactions in Homo sapiens; 39689 protein-protein interactions in Mus musculus; 4461 protein-protein interactions in Rattus norvegicus; 857 protein-protein interactions in Drosophila melanogaster; 941 protein-protein interactions in Caenorhabditis elegans; 1130 protein-protein interactions in Saccharomyces cerevisiae; 603 protein-protein interactions in Escherichia coli; 133 protein-protein interactions in Helicobacter pylori. In total, we inferred 101483 protein-protein interactions (Table 1). It is a great quantity more than the data stored in DIP database.

3.2 Determine the Threshold

In this thesis, our scoring function contains four scoring terms and each has its own weight. We tested different weight combination of scoring terms (Figure 6). We considered that each scoring term of our scoring function has certain effect. We thought that the function similarity, sequence similarity and binding site conservation are equally important. In our

opinion, and are both extended from PSI-BLAST to measure the sequence similarity. In order to take each scoring term into account, we determine the as 1, as 0.5, as 0.5 and as 1.

We Wp

Wk We

Wp Wb

The benchmark we used to verify our predictions is TP/FP ratio. The TP/FP ratio is defined that our predicting protein-pairs overlapping with the positive set divides overlapping with the negative set. The positive dataset which contains 14779 Saccharomyces cerevisiae interacting protein pairs in DIP [30], the other is the negative dataset which contains 2599785 non-interacting protein pairs in Saccharomyces cerevisiae [40]. Experimental results show that the TP/FP ratio is highly correlated to values of our scoring function (Figure 7). Because of Jansen et al. indicated that the TP/FP ratio greater than 1 the predictions will be more reliable [40]. The Figure 7 shows that the scores of our scoring function are highly correlated to the TP/FP ratio. We defined a threshold at TP/FP ratio as 1. When TP/FP ratio as 1 the threshold is 0.5. The threshold as 0.5 means that the score of candidate protein divides the score of template must greater than 0.5.

Figure 8 shows the number of inferred protein-protein interactions (candidates) at different TP/FP ratio. In Figure 7 and Figure 8, we found that with the severer threshold the numbers of candidates were going down and the TP/FP ratios were on the rising. The TP/FP ratio will reflect the quality of our predictions. The TP/FP ratio is regarded as the accuracy of our predictions. At higher accuracy, we will get the high quality predictions, but the number of inferred protein-protein interactions will be decreased. We determine the threshold at 0.5. It is a compromise between quality and quantity. In Figure 7 and Figure 8, the TP/FP ratio is the most important information. Because of the TP/FP ratio is defined as our predicting protein-pairs overlapping with the positive set divides overlapping with the negative set.

When the number of our predicting protein-pairs overlapping with the positive set is too low,

the TP/FP ratio might lose its statistic meaning. And it brought out the perturbations in Figure 7 and Figure 8 when positive-overlap was insufficient.

3.3 Correlation Coefficients of Gene Expression

Genes with similar expression profiles are likely to encode interacting proteins. We study the distribution of correlation coefficients for protein pairs with predicted interaction probability greater than a certain threshold. Figure 9 shows that the gene-expression profiles are highly correlated to values of our threshold. We compared the gene expression profile correlation coefficients of our predictions with those of random protein pairs and DIP database, and our predictions have a higher mean correlation coefficient.

3.4 Examples

We sought an example from the literature to illustrate the operation and accuracy of the method. Some of the most intensively studied interactions are those between fibroblast growth factors (FGFs) and receptors. FGF signaling pathways are intricate and are intertwined with insulin-like growth factor, transforming growth factor-beta, bone morphogenetic protein, and vertebrate homologs of Drosophila wingless activated pathways. FGFs are major regulators of embryonic development: They influence the formation of the primary body axis, neural axis, limbs, and other structures. The activities of FGFs depend on their coordination of fundamental cellular functions, such as survival, replication, differentiation, adhesion, and motility, through effects on gene expression and the cytoskeleton. FGFs play key roles in morphogenesis, development, angiogenesis, and wound healing. There are more than 20 human FGFs that bind to one or more of 7 FGF receptors (FGFR1c, -1b, -2c, -2b, -3c, -3b, and -4; c and b denote isoforms IIIc & IIIb formed by alternative splicing [44]). For example, PDB ID 1DJS is a protein complex (FGFR2 complex with FGF1), the chain A of 1DJS

(Swiss-Prot accession number: P21802) and chain B (Swiss-Prot accession number: P05230) are interacting with each other. In DIP database, chain A is 3788N and chain B is 3787N. We analyzed the domain architecture of the interface between them. We discovered that the interaction between these two chains can be reduced to immunoglobulin domain (Pfam ID:

PF00047) interact with fibroblast growth factor domain (Pfam ID: PF00167). First, we analyzed the other proteins stored in DIP which interacted with 3788N. We found the other proteins interact with 3788N all constituted with fibroblast growth factor domain. We tried to seek more other proteins which were composed of fibroblast growth factor domain. We found 22 human proteins were composed of fibroblast growth factor domain and some of them have crystal structure (1II4, 1NUN) to confirm they indeed interact with FGF2 (P21802).

3.5 Web Service

We developed a web-base database named “DAPID” to present our result. DAPID has been setup to a web service as shown in Figure 10. Users can input a Swiss-prot accession number as a query. The DAPID will return the interacting partners of the query protein. And we will show the detail information about this pair of proteins. The website of DAPID is http://gemdock.life.nctu.edu.tw/dapid/.

相關文件