• 沒有找到結果。

蛋白質立體結構預測(3/3)

N/A
N/A
Protected

Academic year: 2021

Share "蛋白質立體結構預測(3/3)"

Copied!
30
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 成果報告

蛋白質立體結構預測(3/3)

計畫類別: 個別型計畫 計畫編號: NSC93-3112-B-002-022- 執行期間: 93 年 05 月 01 日至 94 年 04 月 30 日 執行單位: 國立臺灣大學資訊工程學系暨研究所 計畫主持人: 高成炎 共同主持人: 楊進木 計畫參與人員: 詹鎮熊 博士 報告類型: 完整報告 報告附件: 出席國際會議研究心得報告及發表論文 處理方式: 本計畫可公開查詢

中 華 民 國 94 年 8 月 8 日

(2)

行政院國家科學委員會補助專題研究計畫

■ 成 果 報 告

□期中進度報告

蛋白質立體結構預測(3/3)

計畫類別:■ 個別型計畫 □ 整合型計畫

計畫編號:NSC 93-3112-B-002-022-

執行期間:

93 年 5 月 1 日至 94 年 4 月 30 日

計畫主持人:高成炎 教授

共同主持人:張文章、楊進木、林智仁 教授

計畫參與人員:詹鎮熊 博士

成果報告類型(依經費核定清單規定繳交):□精簡報告 ■完整報告

本成果報告包括以下應繳交之附件:

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式:除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外,得立即公開查詢

□涉及專利或其他智慧財產權,□一年□二年後可公開查詢

執行單位:國立臺灣大學資訊工程學系暨研究所

中 華 民 國 94 年 7 月 30 日

(3)

The results accomplished in this project has been published in Bioinformatics,

Vol. 21, pp. 1415-1420, 2005. (SCI, 2004 impact factor: 6.7) The manuscript

is attached in this report.

(4)

Cysteine separations profiles (CSP) on protein sequences infer disulfide

connectivity

East Zhao1, Hsuan-Liang Liu2, Chi-Hung Tsai1, Huai-Kuang Tsai1, Chen-hsiung Chan1, Cheng-Yan Kao1*

1Bioinformatics Laboratory, Department of Computer Science and Information Engineering,

National Taiwan University

No. 1 Sec. 4 Roosevelt Rd., Taipei, Taiwan 106

2Department of Chemical Engineering and Graduate Institute of Biotechnology, National Taipei

University of Technology

No. 1 Sec. 3 Chung-Hsiao E. Rd., Taipei, Taiwan 10608

*Corresponding author: Ph: +886-2-23625336 ext. 401 Fax: +886-2-23658741

(5)

Abstract

Motivation: Disulfide bonds play an important role in protein folding. A precise prediction of disulfide connectivity can strongly reduce the conformational search space and increase the

accuracy in protein structure prediction. Conventional disulfide connectivity predictions use

sequence information, and prediction accuracy is limited. Here, by using an alternative scheme

with global information for disulfide connectivity prediction, higher performance is obtained with

respect to other approaches.

Result: Cysteine separation profiles (CSPs) have been used to predict the disulfide connectivity of proteins. The separations among oxidized cysteine residues on a protein sequence have been

encoded into vectors named cysteine separation profiles (CSPs). Through comparisons of their

CSPs, the disulfide connectivity of a test protein is inferred from a non-redundant template set.

For non-redundant proteins in SwissProt 39 (SP39) sharing less than 30% sequence identity, the

prediction accuracy of a four-fold cross validation is 49%. The prediction accuracy of disulfide

connectivity for proteins in SwissProt 43 (SP43) is even higher (53%). The relationship between

the similarity of CSPs and the prediction accuracy is also discussed. The method proposed in this

work is relatively simple and can generate higher accuracies comparing to conventional methods.

It may be also combined with other algorithms for further improvements in protein structure

prediction.

Availability: The program and datasets are available from the authors upon request. Contact: cykao@csie.ntu.edu.tw

(6)
(7)

1. Introduction

A disulfide bond is a strong covalent bond between two cysteine residues in proteins. It

plays a key role in protein folding and in determining the structure/function relationships of

proteins (Abkevich and Shakhnovich, 2000; Wedemeyer et al., 2000; Welker et al., 2001). In

addition, it is important in maintaining a protein in its stable folded state. A disulfide

connectivity pattern can be used to discriminate the structural similarity between proteins

(Chuang et al., 2003). In protein folding prediction, the knowledge of the locations of

disulfide bonds can dramatically reduce the search in conformational space (Skolnick et al.,

1997; Huang et al., 1999). Therefore, a higher performance in predicting disulfide

connectivity pattern is likely to increase the accuracy in predicting the three-dimensional

(3D) structures of proteins through the reduction of the number of steps during

conformational space search.

Generally, the prediction of disulfide connectivity pattern in proteins consists of two

consecutive steps. Firstly, the disulfide bonding state of each cysteine residue in a protein is

predicted based on its amino acid sequence and evolutionary information using various

algorithms, such as neural networks (Fariselli et al., 1999; Fiser and Simon, 2000), support

vector machines (Chen et al., 2004), and hidden Markov models (Martelli et al., 2002).

Secondly, the location of disulfide bonds is subsequently predicted based on the bonding

state of each cysteine residue using algorithms such as Monte Carlo (MC) simulated

annealing together with weighted graph matching (Fariselli and Casadio, 2001) and

(8)

prediction accuracy of the oxidation state of cysteine residues has reached 90% (Chen et al.,

2004) and can be used confidently. However, the task of predicting disulfide connectivity

remains challenging. The best prediction accuracy ever reported so far is only 44% (Vullo

and Frasconi, 2004), in which recursive neural network was used to score connectivity

patterns represented in unidirected graphs. Such prediction accuracy is still far from being

usable, although it is much higher than that by a random predictor.

In this work, cysteine separation profiles (CSPs) of proteins are adopted for the

prediction of disulfide connectivity. It has been shown that proteins with similar disulfide

bonding patterns also share similar folds (Chuang et al., 2003; van Vlijmen et al., 2004).

Theoretical work has suggested that disulfide bonds may stabilize the structures of protein

fragments between the connected cysteine residues (Abkevich and Shakhnovich, 2000);

therefore, the separations between oxidized cysteine residues may be used in the task of

predicting disulfide connectivity. Previous works on disulfide connectivity predictions have

used graphs to represent disulfide connection patterns (Fariselli and Casadio, 2001; Vullo

and Frasconi, 2004). Protein sequences, contact potentials, and evolutionary information

have been well used to score various connection patterns. The present approach encodes

separations among cysteine residues into the form of vectors. The prediction of disulfide

connectivity is based on the comparisons of vectors from testing and template dataset, in

which similar vectors imply similar connection patterns. The method proposed here (CSP) is

(9)

2. System and Methods

2.1 Datasets

The datasets used to evaluate the predicting power of CSPs were constructed from

SwissProt release No. 39 (Bairoch and Apweiler, 2000), including sequences with annotated

disulfide bridges. Protein sequences in SwissProt release No. 39 are filtered according to

procedures described in two previous works (Fariselli and Casadio, 2001; Vullo and

Frasconi, 2004). This dataset is denoted as ‘SP39’. Another dataset based on SP39 was also

constructed; redundant sequences with pairwise sequence identity more than 30% were

removed. This non-redundant set is denoted as ‘SP39-ID30’. SP39-ID30 is used to

investigate the effects of sequence identities on the prediction accuracy of CSP.

Another dataset was further constructed to verify the predicting power of CSP. The

same filter procedures were applied to sequences in SwissProt release No. 43, where

sequences in release 39 were excluded. Thus it is possible to predict proteins newly added to

SwissProt database between releases No. 39 and No. 43. This set is denoted as ‘SP43’.

Redundant sequences with pairwise sequence identity more than 25% in SP43 were also

removed. The template set used to predict disulfide connectivity in SP43 was constructed

from SwissProt release 39. Sequences in this set were filtered as in SP39 and SP43, except

for the PDB filter. Only sequences sharing less than 30% identity with those in SP43 were

kept. This template set is denoted as ‘SP39-TEMPLATE’.

The numbers of sequences divided according to the number of disulfide bridges in

(10)

2.2 Basic Assumption

Similar disulfide bonding patterns infer similar protein structures regardless of sequence

identity (Chuang et al., 2003). Figure 1 shows an example of two proteins with the same disulfide

bonding patterns. Tick anticoagulant peptide (serine protease inhibitor, PDB id 1TAP) (Antuch et

al., 1994) and cacicludine (calcium channel blocker, PDB id 1BF0) (Gilquin et al., 1999) exhibit

the same disulfide connectivity [1-6, 2-3, 4-5], which means that the first oxidized cysteine is

connected with the sixth one, the second with the third, and the forth with the fifth. These two

proteins share sequence identity of only 18.2%, but with a Cα root-mean-square deviation

(RMSD) of 3.6Å (Chuang et al., 2003). Although the sequence identity is below the twilight zone,

the structure and separations among cysteine residues are similar for these two proteins. The

residue numbers for cysteines in the two proteins are [5, 15, 33, 39, 55, 59] and [7, 16, 32, 40, 53,

57], respectively. The positions and separations of cysteine residues are similar for these two

proteins. It is likely that cysteine separations are related to disulfide connectivity patterns, and

through the comparison of cysteine separation profiles (CSPs), the disulfide connectivity patterns

may be inferred and predicted.

2.3 Cysteine Separation Profile (CSP) and evaluation of prediction accuracy

Cysteine separation profiles (CSPs) contain cysteine separation information. Protein x

with n disulfide bonds and 2n cysteine residues has a cysteine separation profile (CSPx)

defined as

)

(

1, 2,..., 2 −1

) (

= 2 − 1, 3− 2,..., 2 − 2 −1 = n n n x C C C C C C s s s CSP

(11)

separation between cysteines Ci and Ci+1. By this definition, a protein with disulfide bonds

will have a CSP.

The divergence, D, between two CSPs is defined as follows:

− = i Y i X i s s D

where X and are the ith separations for CSPs of two different proteins X and Y.

i

s siY

The CSP of a test protein was then compared with all CSPs of template proteins. The

disulfide connectivity pattern of the test protein can be predicted as that of the template

protein with the most similar CSP, i.e., with the smallest divergence value, D. If the

divergence D between two CSPs equals to 0, the CSPs was termed ‘matched profiles’,

otherwise they are ‘mismatched profiles’. If more than one template proteins are matched,

one of the templates is randomly selected for the prediction. The ambiguous situations are

rare, only less than 2% are observed.

Prediction accuracy of our method was evaluated with Qp and Qc value, which are the

fraction of proteins with correct disulfide connectivity prediction and is defined as:

, p c p c p c C C Q Q T T = =

where Cp is number of proteins with all the disulfide connectivity correctly predicted;

Tp is the total number of test proteins; Cc is the number of disulfide connectivity correctly

(12)

3. Results

3.1 Four-fold cross validation

In order to compare with other approaches for disulfide connectivity prediction, similar criteria

were used to select our dataset. The same four-fold cross validation has been applied to our

datasets. The SP39 and SP39-ID30 datasets were divided into 4 subsets, and the disulfide

connectivity prediction was repeated 4 times. For each prediction, one of the 4 subsets was used

as the test set and the other 3 subsets were put together to form a template set. The final

prediction accuracy was averaged over the four prediction results.

Table 2 summarizes the disulfide connectivity prediction results obtained from this

study as well as those obtained from the previous works (Fariselli and Casadio, 2001; Vullo

and Frasconi, 2004). “Frequency” is a trivial method, where the prediction is based on most

frequently observed pattern in the training set. ‘MC graph-matching’ and ‘NN

graph-matching’ are both based on a graph representation of disulfide bonding patterns,

using Monte Carlo and Neural Networks for pattern recognition, respectively (Fariselli and

Casadio, 2001). The results termed BiRnn are obtained from recursive neural networks with

sequence and evolutionary information (Vullo and Frasconi, 2004), the disulfide

connectivity patterns are also represented using graphs. The prediction results from this

work are termed CSP, with dataset noted in the parenthesis. The prediction results are

divided according to the number of disulfide bridges.

The average value of Qp using CSP is 0.81 for SP39. However, redundant sequences

(13)

mismatched profiles patterns. The number of matched profile patterns is high, and is likely

to be resulted from redundant and homologous sequences in SP39 dataset. The redundancy

may have caused over fitting in SP39, even with four-fold cross validation. In order to

control and test over fitting, we extracted the sequences with pairwise sequence identities

less than 30% from SP39 and then generated another dataset, SP39-ID30. The average value

of Qp (B = 2~5) using CSP is 49% for SP39-ID30. With redundant sequences removed, the

four-fold cross validation prediction accuracy of CSPs is higher than the best results ever

reported from previous works.

The prediction accuracies for protein chains with different disulfide bridge numbers are

all significantly higher for ‘CSP (SP39)’. For proteins with 2, 4, and 5 disulfide bridges, the

prediction accuracies in ‘CSP (SP39-ID30)’ are higher than other works. The prediction

accuracy for proteins with 3 disulfide bridges is 2% lower than that of ‘BiRnn-1 profile’, but

is still significantly higher than those from other works. 3.2 Handout Prediction of New Sequences from SP43

We further validate CSP on a new dataset, SP43, which contains new sequences not

seen in SwissProt release 39. We use SP39-TEMPLATE as template set to predict disulfide

connectivity patterns of new sequences in SP43. The pairwise identities of sequences in the

template set and SP43 are less than 30%, with template sequences sharing higher identities

with those in SP43 being removed. The overall prediction accuracy in SP43 dataset is 53%,

which shows significant improvement over the prediction on the other dataset, SP39. The

(14)

bridges, the prediction accuracies in SP43 dataset are higher than those obtained with

four-fold cross validation in SP39-ID30 dataset. This implies that increasing even the

number of non-redundant templates may improve the prediction accuracy of CSP. 3.3 Examples

Three examples of CSP matching are listed in Table 3. These examples are taken from

the SSDB database (Chuang et al., 2003). The CSPs for template and query protein

sequences, as well as their divergence score D, disulfide connectivity patterns, and sequence

identities, are shown in Table 3. In the three examples, the divergence scores are all smaller

than 10, implying that they share similar disulfide positioning and connectivity patterns. The

sequence identities in the three examples are all lower than 20%, thus structure similarity

from sequence homology can be ruled out.

The structures and sequences of these examples are illustrated in Figures 1-3. The first

example is shown in Figure 1. Tick anticoagulant peptide (serine protease inhibitor, PDB id

1TAP) (Antuch et al., 1994) and cacicludine (calcium channel blocker, PDB id 1BF0)

(Gilquin et al., 1999) have a divergence score D=8, their disulfide connectivity pattern is

[1-6, 2-3, 4-5]. Example 2 is illustrated in Figure 2. Thionin (toxic arthropod protein, PDB

id 1GPS) (Bruix et al., 1993) and brazzein (thermostable sweet-tasting protein, PDB id 1brz)

(Caldwell et al., 1998) share 18.8% sequence identity. Their divergence score D is 6, and the

disulfide connectivity pattern is [1-8, 2-5, 3-4, 6-7]. The third example (Figure 3), C-type

lectin carbohydrate recognition domain of human tetranectin (PDB id 1TN3) (Kastrup et al.,

(15)

also have a divergence score of D=6 . Their sequence identity is 17.7% and the

connectivity pattern is [1-2, 3-6, 4-5]. For all proteins, the oxidized cysteine residues are

indicated in black. Cysteine residues on sequences are highlighted in bold and underline. In

each case, the cysteine residues are positioned in similar sites along the sequence, and the

separations among these cysteine residues are nearly identical.

4. Discussions

The number of possible disulfide connectivity patterns increases rapidly with number

of disulfide bridges. For a protein with n disulfide bridges (n*2 oxidized cysteines), the

number of possible disulfide connectivity patterns Np can be formulated as follows:

≤ − = − = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = n i p n i n n n n N (2 1)!! (2 1) ! 2 2 2 4 2 2 2 2 2 2 L .

Table 4 lists the number of possible disulfide connectivity patterns for proteins with

different disulfide bridge numbers. The use of CSPs may be obscure at first, since the

rapidly increasing number of pattern cannot be covered exhaustively. However, the observed

numbers of patterns in PDB peak at 5 disulfide bridges, and decline afterward. Only 45

patterns are observed for protein chains with 5 disulfide bridges, as opposed to possible 945

patterns expected. These results imply that the disulfide connectivity pattern of a protein

sequence can be predicted from a limited set of templates.

One limitation of our approach is that a pattern not presented in the training set cannot

be predicted correctly. Other machine learning approach has to enumerate all possible

patterns to obtain a prediction with maximum score (Vullo and Frasconi, 2004); therefore it

(16)

of all possible patterns is expensive (Vullo and Frasconi, 2004), our approach can achieve

comparable prediction performance in a much simpler and faster algorithm.

The use of divergence score D for profile matching deserves more discussion, e.g.

what is the divergence score cutoff for a reliable prediction? The prediction accuracies for

protein chains with different divergence score coverage are shown in Figure 4. The

divergence coverage means that a profile matches with divergence score smaller than or

equal to that specified. For example, coverage 5 means profiles matched with a divergence

score equal to or smaller than 5. Prediction results of the three datasets are illustrated. As

can be seen, when divergence coverage is 0, which means the profiles are ‘matched profiles’,

the prediction accuracy is 100%. The prediction accuracies become lower as divergence

coverage increases. With divergence coverage of 50, the prediction accuracy is slightly

higher than the overall accuracy. Divergence coverage can be used as a decision value for

adoption of CSP or machine-learning approaches. However, these divergence scores are not

normalized according to the number of disulfide bridges. Sequences with low divergence

matches can be predicted with CSP, and the connectivity patterns of other sequences can be

elucidated with neural network (Vullo and Frasconi, 2004), support vector machine (Chen et

al., 2004), or other machine-learning approaches.

5. Conclusions

In this work, we have shown that cysteine separation profiles (CSPs) can be used in

predicting disulfide connectivity patterns based on the hypothesis that proteins with similar

(17)

accuracy of CSP proposed in this study is higher than those obtained by other approaches. The

handout prediction of new sequences in SP43 dataset can reach 53%. The method mentioned here

is extremely simple; therefore the computation time is minimum comparing to other methods.

The rational behind our method is completely different from previous studies using sequence and

evolutionary information. Our method suggests that topology itself may be an important factor in

disulfide connectivity, as it has been proposed by theoretical study (Abkevich and Shakhnovich,

2000) and observations in structure databases (Chuang et al., 2003). Although many efforts have

been made to predict the disulfide connectivity patterns, current prediction accuracy is limited

around 50%. However, by combining CSP and other algorithms proposed previously (Fariselli

and Casadio, 2001; Vullo and Frasconi, 2004), it is possible to further improve the prediction

accuracy. The use of predicted disulfide connectivity patterns in ab initio protein structure

prediction and other applications would become more reliable in the foreseeable future.

6. Acknowledgements

The authors thank National Science Council of Taiwan for financial support (project number

NSC-93-3112-B-002-022).

7. References

Abkevich, V. I. and Shakhnovich, E. I. (2000) What can disulfide bonds tell us about protein energetics, function and folding: Simulations and bioinformatics analysis. J. Mol. Biol., 300, 975-985.

(18)

NMR solution structure of the recombinant tick anticoagulant protein (rtap), a factor Xa inhibitor from the tick Ornithodoros moubata. FEBS Lett., 352, 251-257.

Bairoch, A. and Apweiler, R. (2000) The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45-48.

Bruix, M., Jimenez, M. A., Santoro, J., Gonzalez, C., Colilla, F. J., Mendez, E. and Rico, M. (1993) Solution structure of gamma 1-H and gamma 1-P thionins from barley and wheat endosperm determined by 1H-NMR: A structural motif common to toxic arthropod proteins. Biochemistry, 32, 715-724.

Caldwell, J. E., Abildgaard, F., Dzakula, Z., Ming, D., Hellekant, G. and Markley, J. L. (1998) Solution structure of the thermostable sweet-tasting protein brazzein. Nat. Struct. Biol., 5, 427-431.

Chen, Y.-C., Lin, Y.-S., Lin, C.-J. and Hwang, J.-K. (2004) Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins., 55, 1036-1042.

Chuang, C.-C., Chen, C.-Y., Yang, J.-M., Lyu, P.-C. and Hwang, J.-K. (2003) Relationship between protein structures and disulfide-bonding patterns. Proteins., 55, 1-5.

Fariselli, P., Martelli, P.L., & Casadio, R. (2002). A neural network base method for prediction the disulfide connectivity in proteins. In Damiani, E., et al. (Eds.), Knowledge based

intelligent information engineering systems and allied technologies (KES 2002), Vol. 1,

pp. 464-468. IOS Press.

Fariselli, P. and Casadio, R. (2001) Prediction of disulfide connectivity in proteins.

Bioinformatics, 17, 957-964.

Fariselli, P., Riccobelli, P. and Casadio, R. (1999) Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins., 36, 340-346.

Fiser, A. and Simon, I. (2000) Predicting the oxidation state of cysteines by multiple sequence alignment. Bioinformatics, 16, 251-256.

Fukuda, K., Mizuno, H., Atoda, H. and Morita, T. (2000) Crystal structure of flavocetin-a, a platelet glycoprotein Ib-binding protein, reveals a novel cyclic tetramer of c-type

(19)

lectin-like heterodimers. Biochemistry, 39, 1915-1923.

Gilquin, B., Lecoq, A., Desne, F., Guenneugues, M., Zinn-Justin, S. and Menez, A. (1999) Conformational and functional variability supported by the BPTI fold: Solution structure of the Ca2+ channel blocker calcicludine. Proteins., 34, 520-532.

Huang, E. S., Samudrala, R. and Ponder, J. W. (1999) Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J. Mol. Biol., 290, 267-281.

Kastrup, J. S., Nielsen, B. B., Rasmussen, H., Holtet, T. L., Graversen, J. H., Etzerodt, M., Thogersen, H. C. and Larsen, I. K. (1998) Structure of the c-type lectin carbohydrate recognition domain of human tetranectin. Acta Crystallogr. D Biol. Crystallogr., 54, 757-766.

Martelli, P. L., Fariselli, P., Malaguti, L. and Casadio, R. (2002) Prediction of the

disulfide-bonding state of cysteines in proteins at 88% accuracy. Protein Sci., 11, 2735-2739.

Skolnick, J., Kolinski, A. and Ortiz, A. R. (1997) MONSSTER: A method for folding globular proteins with a small number of distance restraints. J. Mol. Biol., 265, 217-241.

van Vlijmen, H. W. T., Gupta, A., Narasimhan, L. S. and Singh, J. (2004) A novel database of disulfide patterns and its application to the discovery of distantly related homologs. J. Mol.

Biol., 335, 1083-1092.

Vullo, A. and Frasconi, P. (2004) Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics, 20, 653-659.

Wedemeyer, W. J., Welker, E., Narayan, M. and Scheraga, H. A. (2000) Disulfide bonds and protein folding. Biochemistry, 39, 4207-4216.

Welker, E., Wedemeyer, W. J., Narayan, M. and Scheraga, H. A. (2001) Coupling of

conformational folding and disulfide-bond reactions in oxidative folding of proteins.

(20)

Table 1

Number of chains in the datasets, divided according to the number of disulfide bridges (B).

Datasets B = 2 B = 3 B = 4 B = 5 B = 2~5

SP39 150 149 105 45 449

SP39-ID30 92 81 43 28 244

SP39-TEMPLATE 244 198 98 45 585

(21)

Table 2

Comparison among different disulfide connectivity prediction algorithms.

B = 2 B = 3 B = 4 B = 5 B = 2~5 Algorithms Qp Qc Qp Qc Qp Qc Qp Qc Qp Qc Frequencya 58% 58% 29% 37% 1% 10% 0% 23% 29% 32% MC graph-matchingb 56% 56% 21% 36% 17% 37% 2% 21% 29% 38% NN graph-matchingc 68% 68% 22% 37% 20% 37% 2% 26% 34% 42% BiRnn-1 sequenced 59% 59% 17% 30% 10% 22% 4% 18% 28% 32% BiRnn-1 profiled 65% 65% 46% 56% 24% 32% 8% 27% 42% 46% BiRnn-2 sequenced 59% 59% 22% 34% 18% 30% 8% 24% 31% 37% BiRnn-2 profiled 73% 73% 41% 51% 24% 37% 13% 30% 44% 49% CSP (SP39)e 89% 89% 81% 84% 81% 85% 51% 60% 81% 81% CSP (SP39-ID30)f 74% 74% 44% 53% 26% 44% 18% 31% 49% 52% CSP (SP43)g 71% 71% 49% 58% 30% 40% 28% 33% 53% 53%

a Prediction accuracy reported by Vullo and Frasconi (2004).

b Prediction accuracy reported by Fariselli and Casadio (2001).

c Prediction accuracy reported by Fariselli et al. (2002).

d Prediction accuracy reported by Vullo and Frasconi (2004).

e Prediction accuracy of CSP on SP39 with redundant sequences retained.

f Prediction accuracy of CSP with redundant sequences removed.

(22)
(23)

Table 3

Examples of cysteine separation profiles (CSP).

Template PDB id Template CSP Query PDB id Query CSP Disulfide Connectivity Pattern Divergence (D) Sequence Identity 1TAP (10, 18, 6, 16, 4) 1BF0 (9, 16, 8, 13, 4) [1-6, 2-3, 4-5] 8 18.2% 1GPS (11, 6, 4, 10, 7, 2, 4) 1BRZ (12, 6, 4, 11, 10, 2, 3) [1-8, 2-5, 3-4, 6-7] 6 18.8% 1TN3 (10, 17, 75, 16, 8) 1C3A:A (11, 17, 72, 17, 8) [1-2, 3-6, 4-5] 6 17.7%

(24)

Table 4

Number of possible disulfide connectivity patterns (Np) for protein chains with different disulfide

bridge numbers. Number of disulfide bridges (B) Np a Observed N pb B = 2 3 3 B = 3 15 15 B = 4 105 43 B = 5 945 45 B = 6 10395 29 B = 7 135135 14

a Number of possible disulfide connectivity patterns.

b Observed number of disulfide connectivity patterns. Statistics obtained from the SSDB database

(25)

Figure legends

Figure 1

The structures of two proteins with low sequence identity but share the same disulfide bonding

patterns. (a) anticoagulant protein (PDB id 1TAP), (b) calcium channel blocker (PDB id 1BF0),

and (c) the sequences of the two protein, with cysteine residues highlighted with bold and

underline. Both proteins have three disulfide bonds ([1-6, 2-3, 4-5]) and BPTI-like structures; the

sequence identity is 18.2%.

Figure 2

The structures of (a) thionin, toxic arthropod protein (PDB id 1GPS), (b) brazzein, thermostable

sweet-tasting protein (PDB id 1brz), and (c) their sequences with cysteine highlighted. The

divergence score D between these two protein sequences is 6. Both proteins have disulfide

connectivity [1-8, 2-5, 3-4, 6-7] and their sequence identity is 18.8%.

Figure 3

Structures of (a) C-type lectin carbohydrate recognition domain of human tetranectin (PDB id

1TN3), (b) flavocetin-A from Habu snake venom (PDB id 1C3A:A), and (c) their sequences with

cysteine residues highlighted. The divergence score D between these two protein sequences is 6.

Both proteins have disulfide connectivity [1-2, 3-6, 4-5] and their sequence identity is 17.7%.

Figure 4

Prediction accuracy of the datasets with various divergence coverage, which means that a profile

(26)

(Qp) is higher with lower divergence coverage. With D≤50 the prediction accuracy is still

slightly higher than the overall Qp. See text for details.

(27)
(28)
(29)
(30)

參考文獻

相關文件

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

“Does perceived organizational support mediate the relationship between procedural justice and organizational citizenship behavior”. Academy of Management

“Transductive Inference for Text Classification Using Support Vector Machines”, Proceedings of ICML-99, 16 th International Conference on Machine Learning, pp.200-209. Coppin

Solving SVM Quadratic Programming Problem Training large-scale data..

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data. SVM cannot handle large sets if using kernels

Core vector machines: Fast SVM training on very large data sets. Using the Nystr¨ om method to speed up

Core vector machines: Fast SVM training on very large data sets. Multi-class support