生物資訊在探索癌症相關基因上之研究－子計畫五：整合生物微晶片及蛋白質相互作用之資訊以改進與細胞週期有關之基因調控網路的預測(3/3)

(1)

行政院國家科學委員會專題研究計畫成果報告

生物資訊在探索癌症相關基因上之研究--子計畫五：整合生物微晶片及蛋白質相互作用之資訊以改進與細胞週期有

關之基因調控網路的預測(3/3) 研究成果報告(完整版)

計畫類別：整合型

計畫編號： NSC 95-2745-E-468-008-URD

執行期間： 95 年 08 月 01 日至 96 年 07 月 31 日執行單位：亞洲大學生物科技學系

計畫主持人：吳家樂

計畫參與人員：碩士班研究生-兼任助理：蔡明成、寥晃聖、陳怡仲、邱金水共同主持人：劉湘川

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

中華民國 96 年 10 月 31 日

(2)

私校能量計畫 - 子計畫五：

整合生物微晶片及蛋白質相互作用之資訊以改進與細胞週期有關之基因調控網路的預測 (3/3)–完整版報告

計畫主持人：吳家樂 (Ka-Lok Ng)

1. Introduction

The interaction between proteins is an important feature of protein function. Behind protein-protein interactions (PPI) there are protein domains interacting with each others to perform the necessary functions. Therefore, understanding proteins interactions at the domain level gives a global view of protein-protein interaction network (PIN). Putative domain-domain interactions (DDI) could be derived using the following approaches:

(1) association method (Deng et al. 2002),

(2) domain pair exclusion analysis (Riley et al. 2005), (3) integrative approach (Ng et al. 2003a),

(4) domain combination pair approach, PreSPI (Han et al. 2004), and (5) random decision forest model (Chen and Liu 2005).

2. Method 2.1 Input data

The domain combination pair approach (Han et al. 2004) is employed to derive putative protein DDI by using the PPI database DIP (Salwinski et al. 2004), Jan. 16, 2006 version, which recorded the PPI data for seven species: that is C. elegan, D. melanogaster, E. coli, H.

pylori, H. sapiens, M. musculus and S. cerevisiae. Protein-domain annotation of DIP can be obtained from the protein domain database, Pfam (Finn et al. 2006). Pfam is a large collection of multiple sequence alignments for each domain family and uses hidden Markov models to find domains in new proteins. Domains in PfamA are well defined because the corresponding multiple sequence alignments and hidden Markov models have been checked, and most of the domains have been assigned functions.

2.2 Domain combination pair approach

Assuming a protein A contains n domains, there are 2ⁿ-1 different domain combinations, the so-called power set of A with the empty set excluded, ps’(A), according to the domain combination pair approach. Then given an interacting protein pair (A,B) with m and n domains respectively, one considers that there are (2^m-1)*(2ⁿ-1) possible DDI. The set of domain combination pairs of two proteins A and B, DC(A,B), is defined by

DC(A,B) = {ps’(A) × ps’(B)} (1)

where × denotes the Cartesian product of set ps’(A) and ps’(B). Since a protein can either has a single domain or multiple domains, combination of possible domain pairs can be derived from each of the interacting protein pair obtained from the DIP database (Salwinski et al. 2004).

To measure the likelihood of a DDI, the domain combination pair interaction matrix M is introduced. The element Mαβ denotes the weighted interaction probability of a domain pair (

α, β

) for a given protein pair (A_i, B_j), and its value is given by

∑

_×

=

) ,

( | '( )| | '( )|

1

j

i B

A ps Ai ps Bj

M_αβ (2)

(3)

where |S| denotes the cardinality of set S, the summation is over all possible pairs of (A_i, B_j) such that

α

and

β

is an element of ps’(A_i)and ps’(B_j) respectively. Then, the elements of the normalized DDI interaction matrix APαβ (so-called appearance probability matrix (Han et al. 2004) is defined by

= ∑

β

α αβ

αβ αβ

,

M

AP M

(3)

The matrix element APαβ represents the DDI probability of domain combination α and β.

If a DDI is found more frequently than expected by chance, it is likely that this DDI is a true interacting domain pair.

The results of putative DDI are computed for seven species. For each species, a negative learning set is constructed in order to improve the accuracy of DDI prediction. That is, given N proteins having K protein-protein interactions among them, the size of the negative learning set is equal to C^N₂

+ N – K. This number represents the total number of

non-interacting protein pairs for a particular species. Then, we calculated the probability of the DDI for the negative set of domain combination pairs, but now in the non-interaction space. Introduction of the negative learning set generated three AP matrices, one for the DDI space, I, one for the domain-domain non-interaction space (derived from the negative learning set), R, and the matrix elements of these two matrices are denoted by AP^Iαβ and

AP

^Rαβ respectively. The overlapping region of matrices AP^I and AP^R is denoted by AP^C, where C denotes the overlapping part. In other words, the domain combination pairs of two proteins A and B could be classified into three categories, that is DC^I

(A,B), DC

^R

(A,B) and DC

^C

(A,B).

After constructing the AP matrices, one can predict the interaction probability between the protein pair (A, B) based on the three AP matrices. Let X denotes the PPI and non-PPI events. A value of one and zero represent the PPI and non-PPI event respectively. Given the domain information for proteins A and B, one could determine the interaction probability using the Bayer’s rule, that is

) 0

| ) , ( ( ) 0 ( ) 1

| ) , ( ( ) 1 (

) 1

| ) , ( ( ) 1 ( )) , (

| 1 (

=

= +

=

= =

=

X B A DC P X

P X

B A DC P X P

X B A DC P X P B A DC X

P

C C

I C

(4)

where

∑

∑ ∑

⋅

− +

⋅

= ⋅

=

β

α αβ

β

α αβ

β

α αβ

, ,

,

) ( )

1 ( ) (

) ( )

1

( _C

R total

C I total

AP R

k AP

I k

AP I

k X

P (5)

where Itotal and Rtotal

in the above equations represent the total number of interacting and

non-interacting protein pairs, respectively, (

AP )

_I^C αβ and (

AP )

_R^C αβ denote the interacting and non-interacting probability of domain combinations α and β in the overlapping space respectively, furthermore, P(X=0) = 1 – P(X=1). The constant k is inserted into the Eq.(5) because the exact ratio of Itotal

and R

total in nature is unknown. The ratio of the total number of interacting and non-interacting protein pairs is determined by using the method of maximum-likelihood estimation. The maximal likelihood function L is defined by

( )

ⁿ ^x

x n

x

p p

C

L

= 1− ⁻ (6)

(4)

where n is the total number of possible PPI, x is the total number of known PPI, and p is the probability of PPI. The parameter k is determined by the following condition,

=0

∂

k

L

(7) Once the probabilities of the non-interacting values for the domain combination pairs are obtained, then the probability of PPI is computed.

The probability that a protein pair (A, B) with m and n domains respectively could possibly interacting is estimated by the Primary Interaction Probability (PIP) (Han et al.

2004). PIP is given by

)) , (

| 1 ( 1 ( )

(

P X DC A B

AP AP

AP

AP AP

B A

PIP

^C

C R C

I C

I

C I C

I

= + −

+

= +

− −

−

(8) where AP^I-C denotes the matrix elements appear in the AP^I

– AP

^C space, and ||AP|| denotes the total sum of the matrix elements of AP.

In order to test whether the computed PIP results can provide potential PPI links between the proteins, three biological pathways (the yeast septin, E.coli chemotoxic pathway, and the blood coagulation pathway) are selected, then the pairwise PIP values for each pathway are computed and ranked, and the PPI prediction accuracy is determined by comparing with the corresponding experimentally determined network.

Three statistical measures are defined to characterize the prediction performance, that is the accuracy, Q, true positive specificity, S_{TP ,}and true negative specificity, S_TN, they are defined as Q = (TP+TN)/(TP+TN+FP+FN), STP = TP/(TP+FP), and STN = TN/(TN+FP) respectively. TP, TN, FP and FN stand for true positive, true negative, false positive, and false negative events respectively.

2.3 Order index

Assuming that proteins A and B interacts, the AP-index of protein is defined by

∈∑

−

=

) ( '

) ( log ) ( )

(

A ps dA

dA p dA p A

H

(1) where dA stands for an element of ps’(A), and p(dA) denotes the DDI interaction probability of domain combination dA, For example, H(A) is replaced by

∈∑

−

) ( '

) log(

A ps dA

I dA I

dA

AP

. If H(A) is greater than H(B), then it is claimed that protein A regulates protein B. The rationale is based on the assumption that protein contains DDI with a larger AP^I value could possibly play the role of an upstream regulator.

3. Results

3.1 Putative DDI results and InterDom

To evaluate the prediction, the putative DDI results are compared with that of the database InterDom v.1.2 (June 2004). The results are depicted in Table 1.

In Table 1, we present the comparison of our putative DDI results with that of the database InterDom. All the DDI (N_DDI) are selected from our pre-computed DDI data, and compared with the InterDom records. Only DDI with a score larger than or equal to 0.4 are selected from InterDom in the comparison. InterDom assigns a score from 0 to 49322 for DDI, a score of 0.4 and above accounts for 90% of the 30037 records. The effective

(5)

number of DDI, E_DDI, denotes single-domain interaction, and it does not include domain combination pair DDI, since these types of DDI are not available in InterDom. MDDI stands for the matched DDI, and the matching ratio SM is defined as MDDI/EDDI *100%.

Table 1. The putative domain-domain interaction results (NDDI) obtained by the domain combination pair approach compared with that of the database InterDom.

In the InterDom comparison study, the DDI matching ratio ranges from 66.3% to 89.5%

for the seven species. An average matching ratio of 75.7% is obtained, this indicates the model is rather sucessiveful.

3.2 The yeast septin complex

To verify whether the pre-computed DDI results can provide potential PPI links between proteins, three biological pathways, i.e. the septin complex, the E. coli chemotaxis pathway and the blood coagulation pathway, are selected for further study. The predicted PPI events among those proteins in these three pathways are compared with the experimentally determined PPI networks.

For the first comparison, the yeast septin complex, which composed of six proteins (CDC3, CDC10, CDC11, CDC12, GIN4, SHS1), is selected. Thus, there are 15 (C⁶₂) possible PPI among the proteins. A PPI link is assumed if the PIP value is larger than or equal to 0.1, our method correctly predicted that all six proteins interact with each other, that is a prediction accuracy Q of 100% (15/15) is achieved as well as S_TPand S_TN. The same prediction accuracy is reported by InterDom. In contrast, PreSPI returned an error message (PIP_Value = error) for each of the 15 possible interactions. In this case, our prediction performed much better than PreSPI.

The PIP threshold is set at the 0.1 level because this is the least stringent value among the three PPI cases we studied. Use of a small PIP threshold would predict more PPI, but most of them are false positive interactions. In order to show how the threshold affects the prediction accuracy and specificity performance, a higher threshold value of 0.6 is selected for further study, and the results are reported in section 3.3 and 3.4

3.3 The E. coli chemotaxis pathway

In the second study, the E. coli chemotaxis pathway is selected. Chemotaxis is the response of cells to chemical stimuli by directed movement. The chemotaxis pathway, obtained from KEGG (Kanehisa et al. 2006), composed of 11 proteins: MCP (consists of trg, tap, CheD, CheM), Aer, CheA, CheB, CheR, CheW, CheY, and CheZ. The predicted results based on DDI are depicted in Table 2 (a PPI is assumed if the PIP threshold is set to 0.1 or 0.6).

Table 2. Predicted number of protein-protein interactions and the statistical measure results with PIP threshold of 0.1 and 0.6 in the E. coli chemotaxis pathway, and compared with that of PreSPI.

For the chemotaxis pathway, there are 55 (C¹¹2) possible interactions among the 11 proteins. Our prediction returned a PIP value for each of the 55 interactions. PreSPI returned only 36 interactions, and the rest are not addressed. The accuracy of our prediction

(6)

is comparable (at the 0.1 threshold level) to PreSPI, whereas 19 more PPI links are predicted, and a better true negative specificity STN are obtained. The InterDom database gave null result for this pathway study. If the threshold is set to the 0.6 level, it gave a much better sensitivity and specificity ratios, for instance, the accuracy, Q, raised from 51% to 76%, the specificity ratios, STP and STN, raised from 33% to 57%, and 37% to 85%, respectively.

3.4 The blood coagulation pathway

In the last study, we applied the computed DDI data to reconstruct the blood coagulation pathway. Blood clotting occurs via three pathways, intrinsic, extrinsic and common pathways, in which a total of 13 proteins are involved. The blood coagulation pathway composed of 13 proteins: FI, FII, FIII, FV, FVII, FVIII, FIX, FX, FXI, FXII, FXIII, PKK, and HMWK. Based on the DDI data, the predicted results are depicted in the Table 3. In general there are 78 possible interactions, but only 48 interactions can be determined in our computation (a PPI is assumed if the PIP threshold is set to 0.1 or 0.6)

Table 3. Predicted number of protein-protein interactions and the statistical measure results with PIP threshold of 0.1 and 0.6 in the blood coagulation pathway, and compared with that of PreSPI.

When comparing our results with those predicted by PreSPI, our prediction achieves a much better accuracy (at the 0.1 threshold level). Both computations returned similar S_TP value, however, our calculation obtained a much better value of STN. If the threshold is set to the 0.6 level, it gave a slightly better sensitivity and specificity ratios, for instance, the accuracy, Q, raised from 54% to 60%, the specificity ratios, S_TP and S_TN, raised from 24%

to 28%, and 57% to 65%, respectively.

The difference between our results and that of PreSPI is probably because of PreSPI used the IntAct (Hermjakob et al. 2004) database for domain annotations, whereas the Pfam database is used in our work. It is known that the two databases provide a somewhat different set of domain annotations for proteins, this leads to the fact that different inputs (the learning set as well as the negative learning set) are used by each study.

To further characterize PPI, the regulatory orders of PPI for six biological pathways are studied, and the results are given in the following sub-sections. All the pathways are taken from E.coli and yeast only, since the PPI data and domain annotations coverage rate for these two species are relative higher than the other five species, in other words, the problem of missing domain annotations and DDI information are less severe in those two species.

3.5 Order index - E. coli chemotaxis pathway

The chemotaxis pathway composed of six proteins or protein complexes: MCP, Aer, (CheA,CheW), CheB, CheY, and CheZ. The following five PPI regulatory order pairs are recorded in KEGG: MCP-(CheA-CheW), Aer-(CheA,CheW), (CheA,CheW)-CheB, (CheA,CheW)-CheY, and CheZ-CheY, where the bracket (…..) stands for protein complex, and symbol on the left of a regulatory relation X-Y is the upstream regulatory protein.

For the chemotaxis pathway, the AP-order index approach correctly predicted the five regulatory relationships, it achieves a prediction accuracy of 100% (i.e. 5/5).

The same method is applied for the other five PPI pathways as well. It is demonstrated in the following subsections that the prediction accuracy of the order index approach is very

(7)

encouraging.

3.6 Order index – the yeast cell cycle DNA damage checkpoint

The yeast cell cycle DNA damage checkpoint in the G2 phase is selected in this study. In this pathway there are 20 regulatory relations among the following 22 proteins or protein complexes: Rad17, Rad24, Mec3, Ddc1, Rad9, Mec1, Ctr1, Chk1, Pds1, Rad53, Cdc5, (Clb1, Cdc28), Mih1, Cak1, Cks1, Swi5, Sic1, Swe1, (Scf, Met30), Gin4, Hsl1, (Hsl7, Hsl1). Since the domain annotation for Mec3, Ddc1, Pds1 and Sic1 are not available (four PPI relations are removed), the regulatory relations for Rad17-Rad24, and (Clb1, Cdc28)-Cks1 are not clearly defined by KEGG (two more regulatory relations are removed), therefore, only 14 relations (the second column in Table 4) among 15 proteins are considered in the prediction.

Among the 14 relations, the relative dependence of Gin4-Swe1 and Hsc1-Swe1 are not determined because they have the same AP-order index values, hence, only 12 relations left (the third column in Table 4). Among the 12 relations, 7 relations are correctly predicted.

The seven correct predictions are: Rad9-Mec1, Mec1-Chk1, Mec1-Rad53, Cdc5-(Clb1,Cdc28), Mih1-(Clb1,Cdc28), Cak1-(Clb1, Cdc28) and Swe1-(Clb1,Cdc28).

Hence, the regulatory order prediction accuracy for the damage checkpoint pathway is 58.3% (i.e. 7/12).

3.7 Order index – the yeast cell cycle spindle checkpoint

For the spindle checkpoint pathway, there are 12 regulatory relations among the following 14 proteins or protein complexes: Mps1, (Bub1,Bub3), (Mad1,Mad2,Mad3), (APC/C, Cdc20), (APC/C, Cdh1), Cdc14, Swi5, Sic1, Esc5, (Dbf2, Mob1), Dbf20, Tem1, Bub2 and Let1. Since the domain annotation for Sic1 and Esc5 are not recorded in the SwissProt database, therefore, three of the protein regulatory relations cannot be determined. Furthermore, one relation has the same AP-order index value ((APC/C, Cdc20)-(APC/C, Cdh1)), hence 8 relations left. The order index method correctly predicted seven PPI regulatory order pairs out of the eight relations, these are Mps1-(Bub1,Bub3), (Mad1,Mad2,Mad3)-(APC/C,Cdc20), Cdc14-(APC/C, Cdh1), Let1-Tem1, Tem1-Dbf20, Cdc14-Swi5, and Bub2-Tem1. Hence, the method achieves a prediction accuracy of 87.5%

(i.e. 7/8).

3.8 Order index – the yeast MAPK signaling pathway, starvation

In this study the yeast starvation, osmolarity and hypotonic shock pathways are selected.

For the starvation pathway, there are six regulatory relations among the following seven proteins, Sho1, Ras2, Cdc42, Ste20, Ste11, Ste7 and Kss1. Among the six relations, the relative dependence of Ste7-Kss1 is not determined because it has the same AP-order index value. The order index method correctly predicted the regulatory order of the other five PPI pairs: Ras2-cdc42, Sho1-Cdc42, Cdc42-Ste20, ste20-ste11, ste11-ste7, ste7-Kss1. The method achieves a prediction accuracy of 100% (5/5).

3.9 Order index – the yeast MAPK signaling pathway, osmolarity

For the osmolarity pathway, there are eight regulatory relations among the following nine proteins: Sho1, Sln1, Ste20, Ypd1, Ste11, Ssk1, Ssk2, Pbs2, and Hog1. Among the eight relations, the PBs2-Hog1 relation has the same AP-order index value, hence seven relations left. The order index method correctly predicted the six PPI relations: Sho1-Ste20,

(8)

ste20-ste11, ste11-Pbs2, Ypd1-Ssk1, Ssk1-Ssk2, and Ssk2-Pbs2, hence, the method achieves a prediction accuracy of 85.7% (6/7)

3.10 Order index – the yeast MAPK signaling pathway, hypotonic shock

For the hypotonic shock pathway, there are six regulatory relations among the following seven proteins: Mid2, Rho1, Fks1, Pkc1, Bck1, (Mkk1,Mkk2), and Slt2. Among the six relations, two relations have the same AP-order index values, hence, four relations left. The order index method correctly predicted the three PPI relations: Mid2-Rho1, Fks1-Rho1 and Rho1-Pkc1, and Pkc1-Bck1 is incorrectly predicted, hence, the method achieves a prediction accuracy of 75% (3/4).

In Table 4, we summarized the total number of PPI relations recorded by KEGG, the total number of PPI with well-defined domain annotation, the number of correct predictions determined by the order index method, and the prediction accuracy for the six pathways we selected. On average the order index approach can achieved a prediction accuracy of 80.5%, that is, for the six PPI pathways we studied, 33 relations are correctly predicted among a total of 41 relations. A total of 48 PPI relations are studied, in which seven relations have the same AP-order index values, hence, the coverage rate of prediction is 85.4%.

Table 4. The prediction accuracy of the AP-order index method with the threshold set to 1.0. The first column denotes the name of the studied pathway. The second column represents the total number of PPI relations recorded in KEGG with which domain annotation are well-defined. The third column represents the number of PPI relations left after taking into account of the threshold. The fourth column represents the number of regulatory orders correctly predicted by the AP-order index method. The last column denotes the prediction accuracy of the method.

3.11 Order index – robustness test

In order to test the robustness of the order index calculation, we assumed that if the

AP-order index values for two regulatory relations differed by least than 10% (the

difference between the larger value and the smaller value divided by the smaller one), then we considered that the method is not able to determine the regulatory order. The regulatory order predictions are repeated for the above six pathways, and the results is depicted in Table 5. The order index approach predicted 25 correct relations out of 31 relations, this amounts to a prediction accuracy of 80.6%, which is essentially the same as the prediction without the 10% difference choice. This indicates that the order index approach is rather robust with respect to the choice of threshold. The coverage rate of regulatory order prediction is equal to 64.6%, i.e. 31/48.

Table 5. The prediction accuracy of the AP-order index method with the threshold set to 1.1. The first column denotes the name of the studied pathway. The second column represents the total number of PPI relations recorded in KEGG with which domain annotation are well-defined. The third column represents the number of PPI relations left after taking into account of the threshold. The fourth column represents the number of regulatory orders correctly predicted by the AP-order index method. The last column denotes the prediction accuracy of the method.

To account for the statistical significance of the method, a hypothesis test is performed on the mean number of correct predictions for the six pathways. Assuming a one-tailed

(9)

binomial probability distribution test, the hypothesis t-test rejects the null hypothesis at a 99% level.

3.12 A web-based service for PPI and regulatory order prediction

The predicted domain-domain interaction results are available at http://210.70.82.82/kzbio2/r_ap.php. Several query interfaces are implemented to facilitate data display, such as the DDI, PIP, PIP query and network reconstruction services. For instance, the PIP query service allows the user to input two proteins’ Swissprot (Boeckmann et al. 2003) ID and get the probability of their interaction, i.e. the PIP value. In case the actual Swissprot ID is not known, user can input domain’s PfamA ID, the system could return the predicted probability of the protein interaction. Furthermore, the network service allows the user to reconstruct PPI network, and predict the regulatory order of a PPI.

To reconstruct the PPI network, user can either input a set of proteins or domains IDs, the system returns a text file where putative PPI interactions are predicted. The predicted PPI network can be visualized by reading the output file using Cytoscape (Shannon et al. 2003).

We also have set up a web-based service for the public to use the AP-order index method for prediction, which is available at http://210.70.82.82/kzbio2/oi.php. For instance, if one wants to determine the regulatory order of Aer and (CheA, CheW), prepare the following line as an input,

PF08447,PF00672,PF00015 PF01627,PF02895,PF02518,PF01584

where the first and second columns denote the PfamA annotations of the Aer and (CheA, CheW) proteins.

Paste the above line in the box provided in the AP-order index web page, give a name for the output file, select E.coil under the species manual, and press the send button. The platform will return a file which states the prediction result (either A regulates B or not able to determine the regulatory order).

4. Conclusion

The domain combination pair approach is employed to derive putative protein DDI from the PPI database DIP. To evaluate the prediction performance of the approach, the DDI predicted results are compared with that of the database InterDom, where an average matching ratio of 75.7% can be achieved (assuming the Jan. 16, 2006 version of DIP).

Three PPI networks are chosen to test the prediction accuracy of our computation. The yeast septin complex, and the blood coagulation pathways are reconstructed with a much better accuracy and true negative specificity than another study. For the E. coli chemotaxis pathway study, comparable PPI prediction accuracy is obtained whereas more PPI and a better true negative specificity are obtained in our prediction. This indicated the merit of our calculations. Furthermore, an entropy-like quantity, so called AP-order index, based on DDI data, is introduced to predict the regulatory order for a PPI. The prediction accuracy of this method is demonstrated for six PPI pathways. It is found that this method can achieve a prediction accuracy of 80.5%. This implies that the order index is a very reliable parameter to determine the regulatory order of PPI.

There are two major obstacles for the PPI and regulatory order calculations: (i) many proteins do not have complete PfamA domain annotations, and (ii) there is the missing DDI information problem. Much further experimental works are still needed to resolve prior two problems.

(10)

Acknowledgements

Drs. Ka-Lok Ng and Chien-Hung Huang works are supported by the National Science Council of R.O.C. under the grant of NSC 95-2745-E-468-008-URD and NSC 95-2221-E-150-025-MY2 respectively. We thank Mr. Jengru Lee for developing computer programs to do the calculation, and constructing the database.

References

Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Schneider M. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365 - 370.

Chen Xue-Wen and Liu Mei 2005. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics 21, 4394-4400.

Deng Minghua, Mehta Shipra, Sun Fengzhu, and Chen Ting 2002. Inferring domain-domain interactions from protein-protein interactions. Genome Res. 12, 1540-1548.

Finn R. D., Mistry Jaina, Schuster-Böckler Benjamin, Griffiths-Jones Sam, Hollich Volker, Lassmann Timo, Moxon Simon, Marshall Mhairi, Khanna Ajay, Durbin Richard, Eddy Sean R., Sonnhammer Erik L. L., and Bateman Alex 2006. Pfam: clans, web tools and services. Nucl. Acids Res. 34, D247-D251.

Han Dong-Soo, Kim Hong-Soog, Jang Woo-Hyuk, Lee Sung-Doke, and Suh Jung-Keun 2004. PreSPI: a domain combination based prediction system for protein-protein interaction Nucleci Acids Res. 32, 6312-6320.

Hermjakob H., Montecchi-Palazzi L., Lewington C., Mudali S., Kerrien S., Orchard S., Vingron M., Roechert B., Roepstorff P., Valencia A., Margalit H., Armstrong J., Bairoch A., Cesareni G., Sherman D., Apweiler R. 2004. Nucl. Acids. Res. 32: D452-D455.

Kanehisa M., Goto S., Hattori M., Aoki-Kinoshita K.F., Itoh M., Kawashima S., Katayama T., Araki M., and Hirakawa M. 2006. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354-357.

Ng S.K., Zhang Z., Tan S.H. 2003a. Integrative approach for computationally inferring protein domain interactions. Bioinformatics 19, 923-929.

Ng S.K., Zhang Z., Tan S.H., Lin K. 2003b. InterDom: a database of putative interaction protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res., 31, 251-254.

Riley R., Lee C., Sabatti C., Eisenberg D. 2005. Method Inferring protein domain interactions from databases of interacting proteins. Genome Biology 6, R89.

Salwinski L., Miller C.S., Smith A.J., Pettit F.K., Bowie J.U., Eisenberg D. 2004. The Database of Interacting Proteins. Nucl. Acids Res. 32, D449-51.

Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B., Ideker T. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498-504

(11)

Table 1. The putative domain-domain interaction results (N

DDI

) obtained by the domain combination pair approach compared with that of the database InterDom.

Species NDDI EDDI MDDI SM(%)

C. elegan

3874 1751 1142 65.2

D melagonster

1346 695 523 75.4

E .coli

59062 12075 1695 66.3

H. pylori

894 276 247 89.5

H. sapiens

6327 1187 849 71.5

M. musculus

1031 206 160 77.7

S. cerevisiae

39415 4440 3772 84.5

Average 75.7%

(12)

Table 2. Predicted number of protein-protein interactions and the statistical measure results with PIP threshold of 0.1 and 0.6 in the E. coli chemotaxis pathway, and compared with that of PreSPI.

Threshold = 0.1 Threshold = 0.6 PreSPI

TP 13 8 12

TN 15 34 7

FP 26 6 17

FN 1 7 0

total 55 55 36

Q 51% 76% 53%

S_TP 33% 57% 41%

STN 37% 85% 29%

Q = (TP+TN)/(TP+TN+FP+FN), S_TP= TP/(TP+FP), S_TN= TN/(TN+FP), total = TP+TN+FP+FN.

(13)

12

Table 3. Predicted number of protein-protein interactions and the statistical measure results with PIP threshold of 0.1 and 0.6 in the blood coagulation pathway, and compared with that of PreSPI.

Threshold = 0.1 Threshold = 0.6 PreSPI

TP 5 5 18

TN 21 24 0

FP 16 13 60

FN 6 6 0

total 48 48 78

Q 54% 60% 23%

S_TP 24% 28% 23%

S_TN 57% 65% 0%

Q = (TP+TN)/(TP+TN+FP+FN), S_TP= TP/(TP+FP), S_TN= TN/(TN+FP), total = TP+TN+FP+FN.

(14)

Table 4. The prediction accuracy of the AP-order index method with the threshold set to 1.0. The first column denotes the name of the studied pathway. The second column represents the total number of PPI relations recorded in KEGG with which domain annotation are well-defined. The third column represents the number of PPI relations left after taking into account of the threshold. The fourth column represents the number of regulatory orders correctly predicted by the AP-order index method. The last column denotes the prediction accuracy of the method.

Pathway name Total no. of PPI relations

Actual no. of PPI relations

Correct predictions

Accuracy (%)

Chemotaxis 5 5 5 100

DNA damage 14 12 7 58.3

Spindle checkpoint 9 8 7 87.5

Starvation 6 5 5 100

Osmolarity 8 7 6 85.7

Hypotonic 6 4 3 75.0

Total 48 41 33

(15)

14

Table 5. The prediction accuracy of the AP-order index method with the threshold set to 1.1. The first column denotes the name of the studied pathway. The second column represents the total number of PPI relations recorded in KEGG with which domain annotation are well-defined. The third column represents the number of PPI relations left after taking into account of the threshold. The fourth column represents the number of regulatory orders correctly predicted by the AP-order index method. The last column denotes the prediction accuracy of the method.

Pathway name Total no. of PPI relations

Actual no. of PPI relations

Correct predictions

Accuracy (%)

Chemotaxis 5 3 3 100

DNA damage 14 10 7 70.0

Spindle checkpoint 9 8 7 87.5

Starvation 6 2 2 100

Osmolarity 8 4 3 75.0

Hypotonic 6 4 3 75.0

Total 48 31 25

(16)

Self-assessment

We have completed the major aims of the proposal, that is deriving putative domain-domain interaction pair and introduce the AP-index to predict the regulatory order of a protein-protein interaction pair. A web-based service was set up which provide the PPI and regulatory order services for the public.

During the period 2004 and 2005, our results are presented, either oral or poster presentations, in international conferences and local conferences.

Publications

期刊論文

1. Ka-Lok Ng, Chien-Hung Huang*, Hsueh-Chuan Liu, Hsiang-Chuan Liu (2008) Applications of domain-domain interactions in pathway study

Computational Biology and Chemistry, 32 (in press, to appear at 2008) 2. J.D. Wang, Hsiang-Chuan Liu, Jeffrey J.P. Tsai, Ka-Lok Ng* (2007)

Scaling Behavior of Maximal Repeat Distributions in Genomic Sequences Int’l J. of Cognitive Informatics and Natural Intelligence (to appear) 3. Kuo-Ching Hsiao, Chien-Hung Huang, Ka-Lok Ng* (2006)

Protein Structural Classes Prediction via Residues Environment Profile Asian J. Health and Information Sci., 1(3), in press

4. Chien-Hung Huang, Jywe-Fei Fang, Jeffrey J.P. Tsai, Ka-Lok Ng* (2007)

“Topological Robustness of the Protein-protein Interaction Networks"

Lecture Notes in Bioinformatics vol. 4023, RECOMB 2005 Regulatory Genomics and Systems Biology Workshop, E. Eskin et al . (Eds.), p.166-177, Springer Verlag (SCI 著作)

* corresponding author

國際性研討會論文或壁報 2007 年

1. Liu Hsueh-Chuan, Huang Chien-Hung, Tsai J.F, Ng Ka-Lok* “APPLICATIONS OF DOMAIN-DOMAIN INTERACTION IN PATHWAYS STUDY” 5th Asia-Pacific Bioinformatics Conference (APBC2007) Hong Kong, 15-17, Jan. 2007, poster abstract p.42. (95 學年)

國際性研討會論文或壁報 2006 年

1. Lee Jeng-ru, Liu Hsiang-Chuan, Tsai J.F., Ng Ka-Lok*. “Large scale prediction of domain-domain interactions from protein-protein interactions”. 4th Asia-Pacific Bioinformatics Conference (APBC2006) Taiwan, 13-16 Feb, 2006. P063 Poster (94 學年)

2. Huang Chien-Hung, Tsai J.F., Fang Jywe-Fei, Ng Ka-Lok*. "Topological Stability of the protein-protein interaction networks”. 4th Asia-Pacific Bioinformatics Conference (APBC2006) Taiwan, 13-16 Feb, 2006. P062 Poster (94 學年)

3. Wang J.D., Liu Hsiang-Chuan, Ng Ka-Lok*. "Scaling Behavior of Maximal Repeat Distributions in Genome Sequences". 4th Asia-Pacific Bioinformatics Conference (APBC2006) Taiwan, 13-16 Feb, 2006. P061 Poster (94 學年)

4. Chien-Hung Huang, Tsai J.F., Ng Ka-Lok*. “Deriving Domain-domain Interactions from Protein-protein

(17)

16 Interactions Networks”. INFORMS06, Hong Kong, 25-28, June 2006, p.40. Oral presentation (94 學年) 5. Ng Ka-Lok*, Liu Hsueh-Chuan, Liu Hsiang-Chuan, Tsai J.F. “Reconstructing protein-protein interaction

networks from domain-domain interactions”. Asia Pacific Association for Medical Informatics (APAMI 2006), Taipei, October 27-29, 2006. p.31, Oral presentation 全文 (95 學年)

6. Hsiang-Chuan Liu, Chien-Hung Huang, Ka-Lok Ng “Protein-protein interaction pathways reconstruction from domain-domain interactions”. The 7th International Conference on Systems Biology (ICSB-2006), Yokohama Japan, 9-11 October 2006. Poster, p.42, FN43 (95 學年)

(18)

表 Y04

行政院國家科學委員會補助國內專家學者出席國際學術會議報告

95 年 10 月 16 日報告人姓

名吳家樂服務機構

及職稱

亞洲大學

生物科技與生物資訊系副教授

時間會議地點

8 – 12 Oct. 2006 ICSB 2006

Pacifico Yokohama, Japan

本會核定補助文號

NSC 95-2745-E-468-008-URD

會議名稱

(中文)第七屆系統生物學 2006 國際研討會

(英文) The 7^th International Conference on System Biology 2006 發表

論文題目

(中文)從蛋白質功能域相互作用推測蛋白質相互作用網路

( 英文 )Protein-protein interaction pathways reconstruction from domain-domain interactions

(中文)人類 miRNA 基因與調控子及 CpG 島

( 英文 )Finding human miRNA genes located within promoter regions and associated with CpG islands

附件三

(19)

表 Y04

報告內容應包括下列各項：

一、參加會議經過

Oct. 8

z Morning session : Attended Tutorial 8 - Modeling, simulating, and analyzing biochemical systems with Copasi

z Afternoon session : Attended Tutorial 6 - Analyzing Biochemical Systems using the E-Cell System

Oct. 9

z Attended the Plenary Talks: P1 , P2, P4: Oct. 9th 10:00-12:30

z P1: Upinder S. Bhalla (The National Centre of Biological Science, Bangalore)

"Electricity meets Chemistry: Fast and Slow Signaling in Memory "

z P2: Atsushi Miyawaki (Riken Brain Science Institute)

"Spatio-temporal Patterns of Intracellular Signaling"

z P4: Luis Serrano (European Molecular Biology Laboratory)

"Evolvability and hierarchy in rewired bacterial gene networks"

Oct. 10

z Attended the Plenary Talks: P3, P5: Oct.10, 9:00-9:30, 9:30-10:00 z P3: Stephen Quake (Stanford University / HHMI)

"Biological Large Scale Integration"

z P5: Steve Oliver (University of Manchester)

"Dealing with the complexity of a 'simple' eukaryotic cell"

Oct. 11

z Complex Systems Biology - Oct. 11th, 9:15-10:30 z Mihajlo D. Masarovic (Case Western Reserve University)

"Interaction Balance Coordination as Organizing Principle in Complex Systems Biology"

z Jack Donald Keene (Duke University Medical Center)

"Coordination of Gene Expression by RNA Operons"

z Kenneth Alan Loparo (Case Western Research University)

"Applications of Complex Systems Biology to the Study of Neural Systems "

z Control and System Theory for Systems Biology - Oct. 11th, 11:00-12:30 z Francis J. Doyle (University of California, Santa Barbara)

"Robustness Analysis of Biological Networks Using Sensitivity Measures"

z Pablo A. Iglesias (Johns Hopkins University)

"Feedback Control Regulation of Cell Division"

z John Doyle (California Institute of Technology)

"The architecture of cellular regulation"

z Signal transduction - Oct. 11th, 14:00-16:30 z Hans V. Westerhoff (The University of Manchester)

"Cell-signaling Dynamics in Time and Space"

z William S. Hlavacek (Los Alamos National Laboratory)

"Rules for Modeling Signal-Transduction Systems"

z Philippe Bastiaens (EMBL Heidelberg)

"Reaction cycles in the spatial and temporal organization of cell signaling"

二、與會心得

This year the conference was held at Yokohama, Japan. The conference comprises a broad area of topics in the area of system biology.

Topics include,

• Systems Biology for Medicine

Drug discovery, Cancer Systems Biology, Systems Immunology, Cardiovascular Systems

(20)

表 Y04

Biology, Systems Biology of Diabetes and Metabolic Syndrome

• Systems Biology of Basic Biological Processes

Developmental Systems Biology, Metabolome and Bioprocess, Signal transduction, Cyclic and Dynamical Behaviors, Systems Neuroscience, Microorganisms

• Expanding Fronts in Systems Biology

Large-Scale Biology, Bioinformatics Support for Systems Biology, Synthetic Biology, Complex Systems Biology, Systems marine biology

I had two posters presentation on Oct. 9 and 10. The titles of my two posters presentations were (i) Protein-protein interaction pathways reconstruction from domain-domain interactions and, (ii) Finding human miRNA genes located within promoter regions and associated with CpG islands.

There were many interesting and good talks presented in this conference. I highlighted their main interesting results in below.

Oct. 8 9:30 am

Tutotial - Modeling, simulating, and analyzing biochemical systems with Copasi Pedro Mendes (Virginia Bioinformatics Institute)

Copasi (Complex Pathway Simulator) is a software application for simulation and analysis of biochemical networks. It is developed jointly by the groups of Pedro Mendes (Virginia Bioinformatics Institute, USA) and Ursula Kummer (EML Research, Germany), and is freely available for academic use.

Copasi's current features include stochastic and deterministic time course simulation, steady-state analysis (including stability), metabolic control analysis, elementary mode analysis, mass conservation analysis, import and export of SBML level 2, optimization, parameter scanning and parameter fitting. It runs on MS Windows, Linux, OS X, and Solaris SPARC. So, it is one of the few computational tools in systems biology that are OS X compatible.

The presenters use Copasi to explain how the modelling, simulation and computational analysis of biochemical systems works. They also critically evaluate the limitations of different simulation methods.

Oct. 8 1:30 pm

Tutotial - Analyzing Biochemical Systems using the E-Cell System

Nathan Addy, Satya Arjunan, Bin Hu, Yuri Matsuzaki, Martin Robert, Takeshi Sakurada Koichi Takahashi (Keio University)

Bifurcation and sensitivity analysis can be used to elucidate the relationship between the dynamics of a nonlinear system in biology and the parameters of the system. The bifurcation program in E-Cell numerically computes the stable states of the system, such as the stable or oscillating point, with graphical representation of results. Elasticity coefficients with respect to amplitude and frequency, which indicate the robustness of the oscillation are also represented. Participants will experiment with these features hands-on using a simple oscillation model – the Drosophila circadian cycle model.

Metabolic control analysis can demonstrate how fluxes and intermediate concentrations in a metabolic pathway are regulated by the enzymes that constitute the system. The analysis encompasses structural analysis, elasticity coefficients and the sensitivity of metabolites to small changes in individual parameters such as in enzyme concentrations or kinetic parameters. Flux and concentration control coefficients are some of the outcomes of metabolic control analysis. Participants used metabolic control analysis to evaluate the Kuchel's erythrocyte model.

(21)

表 Y04

October 9 10:00-10:30 am

“Electricity meets Chemistry: Fast and Slow Signaling in Memory”

Upinder S. Bhalla

National Centre for Biological Sciences, TIFR, Bangalore, India http://www.ncbs.res.in/~bhalla/index.html

Deliberations on memory mechanisms often seem to proceed on at least three independent tracks. One of these involves biochemical mechanisms for plasticity, including feedback loops and cellular activation. Space is another dimension, and is the arena for interactions between synapses, and propagation of signals between synapses, dendrites, and the cell body. Finally, electrical activity is a function of cell as well as network dynamics, and here too feedback may play a role through reverberating activity in network loops. It is an interesting process to develop models that impinge on all of these levels, because of the wide range of timescales, numerical techniques, and sheer computational load. It is especially tricky to get parameters for such models. I will describe a study where we have used coupled electrical and biochemical compartmental modeling, and weeded out several candidate models by comparing their predictions to our experiments. The surviving models incorporate chemical, spatial and electrical ingredients. They exhibit network-activity controlled single-cell reverberating activation, with interesting spatial consequences. We suggest that this is a form of short-term and spatially defined memory. It sits at the interface between individual synapses and dendrites, and also between network and cellular attributes of memory.

October 9 10:30-11:00 am

“Spatio-temporal Patterns of Intracellular Signaling”

Atsushi Miyawaki

RIKEN Brain Science Institute

http://www.brain.riken.go.jp/english/b_rear/b5_lob/a_miyawaki.html

“Why bio-imaging, i.e. real time fluorescence imaging?" Currently, this is a topic of great interest in the bioscience community. Many molecules involved in signal transduction have been identified, and the hierarchy among those molecules has also been elucidated. It is not uncommon to see a signal transduction diagram in which arrows are used to link molecules to show enzyme reactions and intermolecular interactions. To obtain a further understanding of a signal transduction system, however, the diagram must contain the three axes in space as well as a fourth dimension, time, because all events are controlled ingeniously in space and time. Since the isolation of green fluorescent protein (GFP) from the bioluminescent jellyfish in 1992 and later with its relatives, researchers have been awaiting the development of a tool, which enables the direct visualization of biological functions. This has been increasingly enhanced by the marriage of GFP with fluorescence resonance energy transfer (FRET) or fluorescence cross-correlation spectroscopy (FCCS), and is further expanded upon by the need for "post-genomic analyses." It is not my intent to discourage the trend seeking the visualization of biological function. I would like to propose that it is time to evaluate the true asset of "bio-imaging" for its potential and limitations in order to utilize and truly benefit from this novel technique.

October 9 12:00-12:30 pm

"Evolvability and hierarchy in rewired bacterial gene networks"

Luis Serrano

EMBL-CRG Systems Biology Programme, Centre for Genomic Regulation, Spain, 2. EMBL, Germany

http://www-db.embl.de/jss/EmblGroupsHD/per_397.html

Bacterial gene networks are highly plastic, allowing radical reconnections at the summit of the

(22)

表 Y04

gene network hierarchy, fuelling evolvability.Sequencing of genetic material from several organisms has revealed that duplication and drift of existing genes has primarily molded the contents of a given genome. Though the effect of knocking out or over-expressing a particular gene has been studied in many organisms, no study has systematically explored the effect of adding new links in a biological network. To explore network plasticity, we constructed 598 recombinations of promoters (including regulatory regions) with different transcription or s-factors in Escherichia coli, over the genetic background of the wild-type. We found that ~95% of reconnected networks are tolerated by the bacterial cell and very few give different growth profiles. Expression levels correlate with the position of the factor in the wild-type network hierarchy. Most importantly, we find that certain combinations consistently survive over the wild-type under various selection pressures. This suggests that new links in the network could readily confer a fitness advantage to individuals in a population and hence may fuel evolution.

October 10 9:00 – 9:30 am

"Biological Large Scale Integration"

Stephen Quake

Dept of Bioengineering and (by courtesy) Applied Physics, Stanford University and Howard Hughes Medical Institute

http://med.stanford.edu/profiles/Stephen_Quake/

The integrated circuit revolution changed our lives by automating computational tasks on a grand scale. My group has been asking whether a similar revolution could be enabled by automating biological tasks. To that end, we have developed a method of fabricating very small plumbing devices – chips with small channels and valves that manipulate fluids containing biological molecules and cells, instead of the more familiar chips with wires and transistors that manipulate electrons. Using this technology, we have fabricated chips that have thousands of valves in an area of one square inch. We are using these chips in applications ranging from bioreactors to structural genomics to systems biology. However, there is also a substantial amount of basic physics to explore with these systems – the properties of fluids change dramatically as the working volume is scaled from milliliters to nanoliters.

z Microfluid system

z Large half-life of protein function

z Biological dark matter – 99% of bacteria cannot be cultivate

October 10 9:30-10:00 am

"Dealing with the complexity of a 'simple' eukaryotic cell"

Stephen G. Oliver

Faculty of Life Sciences, The University of Manchester, U.K.

http://www.ls.manchester.ac.uk/people/profile/index.asp?tb=0

Systems biology aims at taking a more synthetic or holistic approach to deciphering the workings of living organisms. Although the ultimate aim is to construct mathematical models of complete cells or organisms that have both explanatory and predictive power, we are some way from achieving such global syntheses and we need a principled way of reducing the complexity of the problem. Accordingly, we require a top-down strategy to provide an initial coarse-grained model of the cell, and a bottom-up strategy in which individual sub-systems are modeled.

Metabolic Control Analysis (MCA) is a conceptual and mathematical formalism that models the relative contributions of individual effectors in a pathway to both the flux through the pathway and the concentrations of individual intermediates within it. To exploit MCA in an initial

top-down systems analysis of the eukaryotic cell, two categories of experiments are required. In category 1 experiments, flux is changed and the impact on the levels of the direct and indirect products of gene action is measured. We have measured the impact of changing the flux on the transcriptome, proteome, and metabolome of Saccharomyces cerevisiae. In this whole-cell

(23)

表 Y04

analysis, flux equates to growth rate. In category 2 experiments, the levels of individual gene products are altered, and the impact on the flux is measured. We have used competition analyses between the complete set of heterozygous yeast deletion mutants to reveal genes encoding proteins with high flux control coefficients.

For the bottom-up approach, the initial problem is one of systems identification. While a lot of time is currently spent debating the question “What is Systems Biology?”, why (in an organism where we know so much about its biochemistry, physiology, and cell biology as S. cerevisiae) should it be a problem to identify the biological sub-systems that must be fully characterised and built into a comprehensive model of the eukaryotic cell? This problem arises because we have previously studied these biological systems in isolation and in a rigorously reductionist fashion.

Now, we must study them as parts of an integrated whole. The problem is that our current view of, say, a metabolic or signal transduction pathway is often two-dimensional (rather than

four-dimensional) and is frequently poorly integrated, if at all, with other cellular pathways. Thus our view of the network of metabolic pathways may not be the same as the yeast’s. In order to gain a “yeast’s eye view”, we have coupled flux balance analysis with both metabolomics and genetics. Although the initial aim of these approaches is the identification of the ‘natural’

metabolic systems of yeast, the principles involved should be more widely applicable to the problem of biological systems identification.

October 10, 2 pm

“System level analysis and engineering of industrial bacteria”

Sang Yup Lee, KAIST

z Choose two genes from microarray late stage Æ rise metabolite production z Leptin production – Serine-rich production increase interlukin

z Enhanced production of recombination protein (patent) z Silver Cell research at MBEL and Bic

z http://webcell.org z MetaFluxNet v1.8

z Succinic acid productin increased by 4o times (US$ 550 million market)

References

[1] Appl. Environ. Microbial (2003), 69, 5772 [2] Trends in Biotechnoloy (2005), 23, 349 [3] Curr. Opin. Biotech (06), 17, 488

"Metabolome Analysis and Synthetic Biology"

Masaru Tomita (Keio University)

Keio University was founded by Yukichi Fukuzawa (appeared on the ￥10000 dollar notes) z Metabolome analysis of AAP hepatoxicity in mouse liver

z Multi-omics Æ synethic biology z Metabolic – CE/TOF-MS

z Fluxome – GC/MS – NMR, GC/TOF z Proteome – shotgun, 2D gel

z Transcriptome – RT-PCR z Merge two genomes – Bacillus

z Artificial operons – order of genes is important, Itaya et al.

z Metabloome factory – Tsuruoka

z In vitro enzyme rate constant (usually work at the maximum rate) is not equal in vivo

References

[1] PNAS (2005) 102, 15971