Identification of Antifreeze Proteins and Their Functional Residues by Support Vector Machine and Genetic Algorithms based on n-Peptide Compositions

(1)

Manuscript Number:

Title: Identification of Antifreeze Proteins and Their Functional Residues by Support Vector Machine and Genetic Algorithms based on n-Peptide Compositions

Short Title: Identify AFPs and Their Functional Residues Article Type: Research Article

Section/Category: Other

Keywords: support vector machines; genetic algorithm; n-peptide composition; antifreeze protein; AFP Corresponding Author: Chin Sheng Yu, Ph.D.

Corresponding Author's Institution: Feng Chia University First Author: Chin Sheng Yu, Ph.D.

Order of Authors: Chin Sheng Yu, Ph.D.;Chih Hao Lu

Abstract: For the first time, multiple sets of global n-peptide compositions from antifreeze protein (AFP) sequences of certain cold-adapted fish and insects were analyzed using support vector machine and genetic algorithms. The identification of AFPs is difficult because they exist as evolutionarily divergent types, and because their sequences and structures are present in limited numbers in currently

available databases. Our results reveal that it is feasible to identify the shared sequential features among the various structural types of AFPs. Moreover, we were able to identify residues involved in ice binding without referring to three-dimensional structures of AFPs. This approach should be useful for genomic and proteomic studies involving cold-adapted organisms.

Suggested Reviewers: Peter L. Davies Queen's University

peter.davies@queensu.ca expert of antifreeze protein Brendan J J McConkey University of Waterloo mcconkey@uwaterloo.ca

expert of antifreeze protein recognition

(2)

Dear Prof.,

Here within enclosed is our paper for consideration to be published on PloS ONE.

The further information about the paper is in the following:

The Title: Identification of antifreeze proteins and their important residues

by using support vector machines based on n-peptide

compositions

The Authors: Chin-Sheng Yu

and Chih-Hao Lu

It is first discussed that the antifreeze proteins and their functional important residues

can be identified from protein sequences analysis. The common characters in

antifreeze sequence still lack due to the poor homologs and radical different type in

current database. Our approach not only provides excellent results for discriminating

them without using the 3D structural information, but the most important, it is

allowed a further investigation the rule of potential key residues in ice-binding

interface.

The authors claim that none of the material in the paper has been published or is

under consideration for publication elsewhere.

I am the corresponding author and my address and other information is as follows,

Address: Department of Information Engineering and Computer Science,

Feng Chia University, Taichung, 40724, Taiwan

E-mail:

yucs@fcu.edu.tw

Tel: 886-4-24517250 ext. 3742

Fax: 886-4-24516101

(3)

Identification of AFPs and Their Functional Residues

1 Identification of Antifreeze Proteins and Their Functional Residues by Support Vector Machine

and Genetic Algorithms based onn-Peptide Compositions Chin-Sheng Yu1,2* and Chih-Hao Lu3

From the 1Department of Information Engineering and Computer Science, 2Master’s Program in Biomedical Informatics and Biomedical Engineering, Feng Chia University, Taichung 40724, Taiwan

and the 3Graduate Institute of Molecular Systems Biomedicine, China Medical University, Taichung 40402, Taiwan

*Correspond to: Chin-Sheng Yu, Department of Information Engineering and Computer Science, Feng Chia University, Taichung 40724, Taiwan. FAX: +886-4-2451-6101. Phone: +886-4-2451-7250, ext. 3742. E-mail: yucs@fcu.edu.tw.

Abstract

1

For the first time, multiple sets of global n-peptide compositions from antifreeze protein (AFP)

2

sequences of certain cold-adapted fish and insects were analyzed using support vector machine and

3

genetic algorithms. The identification of AFPs is difficult because they exist as evolutionarily

4

divergent types, and because their sequences and structures are present in limited numbers in currently

5

available databases. Our results reveal that it is feasible to identify the shared sequential features

6

among the various structural types of AFPs. Moreover, we were able to identify residues involved in

7

ice binding without referring to three-dimensional structures of AFPs. This approach should be useful

8

for genomic and proteomic studies involving cold-adapted organisms.

9

Keywords: support vector machines; genetic algorithm; n–peptide composition; antifreeze protein;

10

AFP

11

INTRODUCTION

12

Antifreeze proteins (AFPs) in cold-adapted organisms prevent macroscopic ice build-up by binding to

13

ice and thereby forestalling additional crystallization [1]. By doing so, AFPs allow organisms to

14

survive below 0°C. It is of great interest to harness this singular property—non-antifreeze proteins

15

cannot bind ice—for applications related to the agriculture and food industries [2,3,4,5] and to the

16

rational design of new AFPs. However, first it is necessary to understand how AFPs and ice interact.

17

Accurately identifying AFPs from evolutionarily divergent organisms is difficult because their

18

sequences and structures differ radically [6,7]. To complicate matters further, for closely related

19

species, the sequences, and consequently the structures, of their AFPs may also differ substantially if

20

they have been geographically isolated [8]. Additionally, searching for homologous sequences within

21

databases has not been a fruitful approach given the disparity among AFP sequences. Directly

22

(4)

2 studying AFP-ice interactions is also difficult, and a definitive picture of such interactions is not

23

currently available [7]. Therefore, because many AFPs do not have structural or sequential features in

24

common, it is challenging to correlate the relationships among their sequences, structures, and

25

function.

26

A large number of biochemical and structural studies [9,10,11] have been performed in an attempt to

27

understand how AFPs interact with ice on the molecular level, including site-directed mutagenesis

28

[12,13,14] and computational experiments [15]. An ice-binding model that incorporates surface

29

complementarity is generally accepted [16]. Recently, Doxey and colleagues [9] successfully

30

identified AFPs, for which three-dimensional (3D) crystallographic structures were available, on the

31

basis of their highly ordered and planar ice-binding surfaces, but their algorithm could not identify an

32

AFP when only its NMR solution structure was available because the coordinates for the atoms at and

33

near its surface were not well defined. [9,17]. Additionally, their algorithm requires the use of a

34

three-dimensional (3D) structure, which is not always available for a given AFP.

35

It is obvious, therefore, that AFPs cannot be easily distinguished from other types of proteins.

36

Additional information is needed to understand how AFPs and ice interact on a fundamental

37

physicochemical level before such interactions can be applied to cold-adapted mechanisms. Although

38

the types of amino acids present are closely coupled to the ice-binding properties of AFPs [10,13],

39

current models usually rely on only 3D structures. To make additional use of the knowledge that has

40

accumulated over the decades, e.g., identification of the ”hydrophobic surface” effect [7,11], the

41

spatial regularity of an AFP solvent accessible surface, the presence of nonpolar residues, and other

42

properties directly related to the binding properties of AFPs, an algorithm that can discern these

43

properties is necessary. Therefore, for this report, we developed an integrated approach to rapidly

44

identify AFPs from their amino acid sequences. Our statistically based, support vector machine (SVM)

45

algorithm has been used to identify certain inherent protein traits e.g., protein disulfide connectivities

46

[18], subcellular localizations [19,20], and protein folds [21], when given a query sequence, and it

47

does not require a computational mechanical model or structure comparison. For this report, during

48

the training and testing of this algorithm for different classifiers associated with AFPs, multiple

49

feature schemes based on n-peptide compositions extracted from the sequences were used. Then, a

50

genetic algorithm (GA) was used iteratively for key-feature selection and to improve the identification

51

accuracy. This integrated approach enabled the recognition of AFPs on the basis of preferred short

52

peptide sequences, rather than on structural comparisons. The identified AFP sequence features have

53

not been reported previously, yet they correlate well with the properties of the ice-binding interfaces.

54

This approach is suitable for the further identification of the ice-binding surfaces of AFPs.

55

METHODS

56

The Validation Dataset that Contained AFPs and non-AFPs with Known 3D Structures—

57

To assess our approach without bias, we tested it using a sequence validation dataset that did not

58

(5)

3 contain homologous proteins, and to examine the effects of key residues on function, we included

59

only AFPs that had solved structures. This set contained 3762 nonredundant non-AFPs and 44 AFPs,

60

which had been collected from the PISCES server [22] and the Protein Data Bank (PDB) [23],

61

respectively. To include as many representative structures as possible, the non-AFPs had <25%

62

pairwise sequence identity (SI), R-factors of 0.25 and a crystallographic resolution of at least 2 Å. The

63

AFP sequences were separated into eight subsets on the basis of sequence identity by ClustalW2 [24].

64

Table 1 lists the PDB IDs of the AFPs in each subset. For a given subset, the associated AFP(s) had a

65

sequence(s) that was not homologous to any of the AFPs in the other subsets. The non-AFPs were

66

randomly divided among the eight subsets to cross test the performance of our approach and then

67

were merged as a single trained model for use with other (independent) datasets (see below). Under

68

such a critical condition, any afterward AFPs recognition so far is not simply from the self-trained

69

sequences.

70

71

Independent Datasets—

72

We constructed three other datasets that did not contain the AFPs included in the aforementioned eight

73

subsets to test our algorithm after training it with the latter. The first set included three AFP structures

74

deposited recently in the PDB [23]; the second set contained 369 nonredundant AFP sequences

75

deposited in the UniProKT database [25,26], which represented an evolutionarily divergent group of

76

organisms; the third set contained two “antifreeze-like” (AFL) proteins that, while incapable of

77

binding ice, have both a sequence and a structure that are very similar to the fish type III AFP [27].

78

Table 2 lists the number of AFPs derived from each type of organism included in the second dataset.

79

80

Feature schemes—

81

The n-peptide composition feature-based coding schemes, with n = 1 encoding the amino acid

82

composition; n = 2, the dipeptide composition; n = 3, the tripeptide composition, etc., were used

83

previously to predict protein properties [19,20,21,28], and we used them to characterize the important

84

ice-binding features of AFPs. A set of symbols, An for the original amino acids; Hn for hydrophobicity

85

[29]; Vn for the normalized van der Waals volume [29]; Zn for polarizability [29]; Pn for polarity [29];

86

and Fn, Sn, and En, for groups of residues classified according to four, seven, and eight

87

physical/chemical properties, respectively, were used to denote the feature schemes [19]. However, to

88

characterize the key functional residues more robustly, partitioned subsequences, g-gap dipeptides,

89

and local amino acid composition strategies were also included. [19] The partitioned amino acid

90

composition Y k

X _{is a concatenation of all amino acid sequences of composition Y and length k. The}

91

symbol Dg identifies the frequency of a sequence in the form a(x)gb, where a and b denote specific

92

amino acids and (x)g denotes the g-intervening (g-gap) residues of any type between the pair. The

93

symbol Wl indicates the amino acid composition for peptides characterized by a set of sliding

94

windows of length l centered on a given type of amino acid. It provides information concerning the

95

sequential neighbors for of a given type of amino acid.

96

(6)

4

97

Assembly Machine-learning Algorithms—

98

All SVM calculations were performed using LIBSVM [30], which is a general library for support

99

vector classification and regression, and the radial basis function kernel. In addition to the SVM

100

algorithm [31], we implemented a GA to efficiently optimize the selection of feature attributes as

101

detailed previously [18]. The combined use of the SVM algorithm and the GA is denoted as SVMGA.

102

For the SVMGA, the feature attributes of each feature scheme, the penalty parameter C, the kernel

103

parameter γ of the RBF function used for SVM identification by the GA approach were determined in

104

advance. The GA procedure rapidly filtered out feature attributes that are not useful for SVM

105

identification on the basis of each feature scheme.

106

107

The Voting System—

108

The coding scheme symbols given above denote the SVM classifiers that were derived from the

109

various properties of the sequence descriptors. For simplicity, the participants in the

110

SVM-identification system [19,20] were incorporated as:

111

∑

∈ = = = + + + ' 1 5 6 0 9 1 l S l S S k g g k A k D X W X

112

with S = {H3,V3,Z3,P3,F3,S2,E2} and S' = {7, . . .,15}. The system counts the jury votes from each

113

classifier to determine if a protein is an AFP.

114

115

Performance Assessment—

116

As in previous work [19,20,21], we employed the accuracy Qi = ci/ni × 100 to assess the performance

117

of identification, i.e., the prediction accuracy, where ci is the number of correctly identified AFPs in

118

the class i∈ (AFP, non-AFP), and ni is the number of sequences. The overall identification accuracy is

119

given by

120

∑

= i i iQ f P ,

121

where fi = ni/N, and N is the total number of sequences. Although Qi provides a convenient assessment

122

for identification performance, the Matthews Correlation Coefficient (MCC) [32] is a more

123

informative measure of the performance and is given by:

124

) )( )( )( (TP FN TP FP TN FP TN FN FN FP TN TP MCC + + + + × − × = ,

125

where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false

126

negatives, respectively. A value for MCC of 1, 0, or –1 represents a perfect correlation, a random

127

correlation, or an inverse correlation, respectively. Consideration of the MCC, allowed us to modify

128

our approach to lower the number of false positives returned. To be a credible method, our approach

129

needed to return as few false positives as possible.

130

(7)

5 AFP Sequence Homology Search—

132

To verify our ability to identify AFPs via their protein sequences, we tested the homology

133

relationships among the AFP sequences. A query sequence from the second independent data set was

134

aligned with the sequences of the 44 AFPs of the validation set. Only these 44 AFPs were used

135

because their 3D structures have been solved, and they had been experimentally shown to bind ice.

136

We performed an all-against-all sequence alignment using the global alignment program ALIGN [33].

137

Only the top-ranked sequence of the 44 AFP sequences was then used to assess the effect of homology

138

on AFP identification, i.e., the SI value for the query sequence and the top-ranked sequence

139

determined the usefulness of the homology search approach.

140

RESULTS

141

Identification of AFPs in a Cross-validation Dataset—

142

For the cross-validation test, the non-AFPs were randomly and equally divided into eight subsets,

143

each of which contained a single representative AFP (which is identified by the first PDB ID (in bold

144

type) in each subset list (Table 1)), and these sets formed the single representative AFP mode. Then, if

145

the AFP representative had homologous sequences, these sequences were added into the

146

corresponding subset. The eight subsets can be thought of as eight distant branches of an evolutionary

147

tree. These sets formed the multiple representative AFP mode. For an experiment, the sequences of

148

seven of the subsets were used to train the SVM algorithm with a given feature scheme, and then the

149

output model of the trained algorithm was used to test the sequences in the subset that was not used

150

for training. This training-and-testing cross-validation procedure was repeated eight times for a given

151

feature scheme, each time omitting a different sequence subset during training. All results reported the

152

performance on the total number of datasets. The SVM classifiers were optimized so that the

153

algorithm could assign a protein sequence as either an AFP or non-AFP sequence.

154

Table 3 contains a summary of the identification accuracies and the MCC values for the different

155

combinations of feature schemes used for the single representative AFP mode and the multiple

156

representative AFP mode. Only the best result for a given feature scheme is reported. The best overall

157

identification accuracy was 62.5% for the single representative AFP mode used by the SVM

158

algorithm. Incorporation of the GA algorithm substantially improved the identification accuracy.

159

Using the iterative procedures mentioned above, the GA identified the largest number of true positives

160

and the smallest number of false positives as it discarded feature attributes that were not useful for the

161

SVM classification. The assembled SVMGA approach correctly identified all AFPs in the single

162

representative AFP mode. Using just the smallest possible number of selected features, the SVM

163

classifier identified more completely structurally dissimilar AFPs than did Doxey and colleagues who

164

used the structural characteristics of the AFPs [9]. After we decreased the number of FPs as much as

165

possible (<70 FPs remained), we tested the algorithm with the multiple representative AFP mode,

166

which was a more realistic dataset. Although the performance of the algorithm declined with the

167

(8)

6 increase in the number of divergent sequences, the identification accuracy was a respectable 54.5%.

168

169

Identification of AFPs in the Independent Datasets—

170

The three AFPs of the first independent dataset, which were the A chains of 2zib, 3bog, and 3boi were

171

all accurately identified as AFPs. We observed that the sequence of 2zib is homologous to that of 2afp,

172

which was contained in the eighth validation subset, and the sequences of 3bog and 3boi are

173

homologous to that of 2pne, which was contained in the sixth validation subset. In addition to

174

accurately identifying the proteins of the first independent dataset as AFPs, the algorithm also

175

recognized that the human and bacterial AFL proteins (PDB IDs 1wvo and 1xuz, respectively) [27]

176

were not AFPs. The human AFL and the bacterial AFL are both very similar in sequence and structure

177

to that of the fish type III AFP (PDB code 1msi).

178

For the AFPs of the second independent dataset, which represent a divergent group of organisms and

179

were collected from the UniProKT database [25,26], about 61% were correctly identified as AFPs by

180

the SVMGA. The SI pair distribution, which characterizes the relative number of sequence pairs in

181

the close percentage sequence identity interval, was used to examine the effect of sequence homology

182

on AFP identification. The 369 AFP sequences were each used as a query sequence to profile the SI

183

pair-distribution. Each query sequence was aligned with the 44 AFPs of the validation set and also

184

with the other 368 sequences of the second independent data set. The largest SI value for each query

185

that was aligned with the 44 AFPs was plotted along the y axis, and the largest SI value for

186

corresponding sequence aligned with the other 368 sequences of the second dataset was plotted along

187

the x axis (Fig. 1). The SI values associated with AFPs in the independent dataset that were

188

incorrectly identified by the SVMGA are colored red in Figure 1, and most of these values are <20%,

189

which below the so-called midnight-zone threshold where a structural/functional relationship can be

190

detected [34]. Because the dataset that contained the 369 AFPs was biased as it contained AFPs from

191

well-characterized cold-adapted organisms, many of the points were located at the far end of the x

192

axis.

193

194

Coding Schemes—

195

For the different coding-scheme SVM classifiers used in this study, we were able to reduce the

196

number of feature attributes required by at least 50% after implementing the GA. Consequently, each

197

remaining classifier was well suited to identifying the corresponding type of AFP (Table 4). To

198

understand why the features were selected as classifiers, we assigned a number (vote) when the

199

pattern of residues in a sequence matched a GA-selected feature attribute of a coding scheme. The

200

sequence position was marked as an SVMGA key residue if it had received a majority of the jury

201

votes from the 14 coding schemes that we used for the multiple representative AFP mode. For

202

instance, the dipeptide LT was selected in the D0 scheme, and the interval dipeptide T(X2)T was

203

selected in the D2 scheme. Hence, for the short peptide NTALT, the L in the forth position and the

204

first T each received one vote, and the second T received two votes (Table 5). Eight representative

205

(9)

7 AFPs are presented in Fig. 2, with their SVMGA key residues marked. Residues with >6 votes, with 4

206

or 5 votes, and with <3 votes are colored red, yellow, and gray, respectively. Fig. 3 illustrates the

207

average number of SVMGA key residues in AFP sequences (black bars) and the number of in

208

non-AFP sequences (gray bars). And it is obviously that the number of SVMGA key residues in AFP

209

sequences is twice in non-AFPs. Approximately 70% of the SVMGA-selected key residues are

210

solvent exposed (data not shown), which is sensible as these residues are more likely to interact with

211

ice.

212

DISCUSSION

213

Previous studies have deduced the structural character of the interactions between ice and AFP

214

molecules [7,14]. Knowing how ice and AFP molecules interact allows for the identification of AFPs

215

given their structures (see the excellent results of Doxey and colleagues reported in Table 3). However,

216

the method of Doxey and colleagues required the use of proteins with solved 3D structures, and

217

therefore, until this report, there has not been a more general method for AFP identification.

218

For this report, we presented an integrated machine-learning method, SVMGA, to identify AFPs that

219

uses multiple n-peptide composition features. Our results show that sequentially divergent AFPs can

220

be identified according to their shared sequence characteristics because any test sequence or its

221

homologs are not appearing in trained set. A set of n-peptide composition-based SVM predictors were

222

combined to accurately recognize AFPs, and more importantly, to identify the key functional residues

223

at the ice-binding surfaces. Several reports [7] have characterized defining residue repeats in AFP

224

sequences, e.g., alanine-rich sequences in the α-helix of type I AFPs (A28–A34, Fig. 2f), and

225

Thr-Cys-Thr (Fig. 2b) or Thr-Xaa-Thr (Fig. 2c) sequences in insect AFPs. The feature attributes,

226

selected by our SVMGA approach, included these features. Some of the key SVMGA residues in

227

these representative structures of AFPs, formed relatively flat planes, e.g., the red and yellow

228

clustered regions in Fig. 2 and 4.Additionally, SVMGA approach identified some residues reside at

229

the interface between two chains of crystallized form in PDB, e.g., T13 and T24 in chain A of 1wfa

230

(Fig. 2f), but actually the active protein is monomer. We found others that the SVMGA key residues

231

in red, L12, L23, A31, and T35, reside on the same side of the flat binding interface. Another similar

232

example is the β-sheet plane of chain A in 1ezg (Fig. 2b), although the Thr-Cys-Xaa tri-peptide

233

parallel strands [35] align perfectly in the dimer crystallized form, this flattest ice-binding surface is

234

found in the monomer as seen by the coloration at the functional interface.

235

We also inspected the key residues that were identified in the eelpout type III AFP, which has been

236

subjected to many mutagenesis studies. As mentioned in Method, this eelpout type III AFP, which

237

PDB codes 1msi, had no homolog in any of the AFPs in trained subsets 1, 3, 4, 5, 6, 7 and 8 (Table 1.).

238

And the key residues of 1msi were inferred from theses dissimilar trained sequences by SVMGA

239

approach. Compared with previous studies [12,14], the SVMGA identified half of the proven

240

ice-binding residues at the interface (Fig. 4b). For the three residues, N14, A16, and T18, which when

241

(10)

8 mutated caused the greatest decreases in AFP activity, the SVMGA method found the latter two.

242

Although our approach failed to recognized Q9, T15, V20, and Q44, the SVMGA identified the

243

nearby residues, L10, P12, L17, M22, V45, and V49. Residues L10 and P12 also reside at the

244

ice-binding interface.

245

For the detail results obtained for the 369 AFPs in the second independent dataset (Fig. 5), for which

246

no structural information was available, the identification accuracy diminished as the evolutionary

247

distance of a protein sequence increased from the model fish and insect sequences. For sequences

248

with very low SI values (15~20%), especially those from algae, bacteria, and plants, our approach was

249

around 30% accurate. The identification of fish AFPs was around 60% accurate even when sequences

250

with lower than 20% SI values. In fact, we believe that the features encoded in the fish and insect

251

sequences may be used to identify AFPs from evolutionarily divergent organisms. Additionally, as

252

more sequence data for AFPs are accumulated, they can be used to further characterize the

253

mechanisms of cold adaptation. Finally, our approach can be used as an efficient way to obtain high

254

throughput identification of protein function on a genome-wide scale. We have implemented iAFP

255

web service, which is available at http://140.134.24.89/~iafp/.

256

ACKNOWLEDGMENTS

257

We thank Jenn-Kang Hwang (National Chiao Yung University) for his invaluable comments and

258

crucial insights and Chen-Hsiung Chan (Tzu Chi University) for helpful discussions. This work was

259

supported by grants from the National Science Council, Taiwan to CSY and the National Science

260

Council and China Medical University, Taiwan to CHL. We are grateful for the hardware and software

261

support by the Intelligent Digit Center at Feng Chia University and the Structural Bioinformatics Core

262

Facility at Nation Chiao Tung University, respectively.

263

264

REFERENCES

1. Fletcher GL, Hew CL, Davies PL (2001) Antifreeze proteins of teleost fishes. Annu Rev

Physiol 63: 359–390.

2. Knight CA (2000) Structural biology. Adding to the antifreeze agenda. Nature 406:

249–251.

3. Fan Y, Liu B, Wang HB, Wang SQ, Wang JF (2002) Cloning of an antifreeze protein gene

in carrot and influence on freeze tolerance of transgenic tobaccos. Plant Cell Rep 21:

296–301.

4. Rubinsky B, Arav A, Devries AL (1992) The cryoprotective effect of antifreeze

glycopeptides from antarctic fishes. Cryobiology 29: 69–79.

(11)

9

5. Griffith M, Ewart KV (1995) Antifreeze proteins and their potential use in frozen foods.

Biotechnol Adv 13: 375–402.

6. Griffith M, Yaish MW (2004) Antifreeze proteins in overwintering plants: a tale of two

activities. Trends Plant Sci 9: 399–405.

7. Jia Z, Davies PL (2002) Antifreeze proteins: an unusual receptor-ligand interaction. Trends

Biochem Sci 27: 101–106.

8. Graham LA, Lougheed SC, Ewart KV, Davies PL (2008) Lateral transfer of a lectin-like

antifreeze protein gene in fishes. PLoS ONE 3: e2616.

9. Doxey AC, Yaish MW, Griffith M, McConkey BJ (2006) Ordered surface carbons

distinguish antifreeze proteins and their ice-binding regions. Nat Biotechnol 24:

852–855.

10. Graether SP, Sykes BD (2004) Cold survival in freeze-intolerant insects: the structure and

function of beta-helical antifreeze proteins. Eur J Biochem 271: 3285–3296.

11. Harding MM, Ward LG, Haymet AD (1999) Type I 'antifreeze' proteins. Structure-activity

studies and mechanisms of ice growth inhibition. Eur J Biochem 264: 653–665.

12. Graether SP, DeLuca CI, Baardsnes J, Hill GA, Davies PL, et al. (1999) Quantitative and

qualitative analysis of type III antifreeze protein structure and function. J Biol Chem

274: 11842–11847.

13. Graether SP, Kuiper MJ, Gagne SM, Walker VK, Jia Z, et al. (2000) Beta-helix structure

and ice-binding properties of a hyperactive antifreeze protein from an insect. Nature

406: 325–328.

14. Jia Z, DeLuca CI, Chao H, Davies PL (1996) Structural basis for the binding of a globular

antifreeze protein to ice. Nature 384: 285–288.

15. Nutt DR, Smith JC (2008) Dual function of the hydration layer around an antifreeze

protein revealed by atomistic molecular dynamics simulations. J Am Chem Soc 130:

13066–13073.

16. Leinala EK, Davies PL, Jia Z (2002) Crystal structure of beta-helical antifreeze protein

points to a general ice binding model. Structure 10: 619–627.

17. Fernandez-Recio J, Totrov M, Skorodumov C, Abagyan R (2005) Optimal docking area: a

new method for predicting protein-protein interaction sites. Proteins 58: 134–143.

18. Lu CH, Chen YC, Yu CS, Hwang JK (2007) Predicting disulfide connectivity patterns.

Proteins 67: 262–270.

19. Yu CS, Chen YC, Lu CH, Hwang JK (2006) Prediction of protein subcellular localization.

Proteins 64: 643–651.

20. Yu CS, Lin CJ, Hwang JK (2004) Predicting subcellular localization of proteins for

Gram-negative bacteria by support vector machines based on n-peptide compositions.

Protein Sci 13: 1402–1406.

(12)

10

assignment by support vector machines using generalized npeptide coding schemes

and jury voting from multiple-parameter sets. Proteins 50: 531–536.

22. Wang G, Dunbrack RL, Jr. (2003) PISCES: a protein sequence culling server.

Bioinformatics 19: 1589–1591.

23. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data

Bank. Nucleic Acids Res 28: 235–242.

24. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, et al. (2007) Clustal

W and Clustal X version 2.0. Bioinformatics 23: 2947–2948.

25. Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its

supplement TrEMBL in 2000. Nucleic Acids Res 28: 45–48.

26. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38: D142–148.

27. Hamada T, Ito Y, Abe T, Hayashi F, Guntert P, et al. (2006) Solution structure of the

antifreeze-like domain of human sialic acid synthase. Protein Sci 15: 1010–1016.

28. Chen YC, Hwang JK (2005) Prediction of disulfide connectivity from protein sequences.

Proteins 61: 507–512.

29. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH (1999) Recognition of a protein fold

in the context of the Structural Classification of Proteins (SCOP) classification.

Proteins 35: 401–407.

30. Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. pp. Software

available from

http://www.csie.ntu.edu.tw/~cjlin/libsvm

.

31. Vapnik V (1995) The nature of statistical learning theory. New York Springer.

32. Matthews BW (1975) Comparison of the predicted and observed secondary structure of

T4 phage lysozyme. Biochim Biophys Acta 405: 442–451.

33. Myers EW, Miller W (1988) Optimal alignments in linear space. Comput Appl Biosci 4:

11–17.

34. Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12: 85-94.

35. Liou YC, Tocilj A, Davies PL, Jia Z (2000) Mimicry of ice structure by surface hydroxyls

and water of a beta-helix antifreeze protein. Nature 406: 322–324.

36. DeLano WL (2002) The PyMOL Molecular Graphics System In: Scientific. D, editor. San

Carlos, CA, USA.

http://www.pymol.org

. .

(13)

11 FIGURE LEGENDS

Fig. 1. Sequence identity distribution for pairs of AFPs. The x-axis values are the best pairwise-matched SI values for each AFP sequence against the other 368 sequences. The y-axis values are the best pairwise-matched SI values for each of the 369 AFP sequences of the second independent dataset against the 44 sequences of the validation set. A black symbol indicates a correctly identified AFP in the independent data set, and a red symbol indicates an incorrectly identified AFP.

Fig. 2. Examples of key residues mapped onto the surfaces of the eight representative AFPs used in the cross-validation tests. The structures were drawn with PyMOL [36]. The residues colored in gray were not identified as key residues. The residues in red obtained more votes than did the residues in yellow. (a) 1c3y; (b) 1ezg; (c) 1eww; (d) 2pne; (e) 1c89; (f) 1wfa; (g) 2py2; (h) 2afp.

Fig. 3. Difference of the number of SVMGA key residues extracted from the 44 AFP and 3762 non-AFP sequences in cross-validation dataset, respectively. Each black bar represents the mean ± standard deviations of coverage percentage a SVMGA residue was included in a AFP sequence. Each gray bar represents for non-AFP sequence.

Fig. 4. The surface of the eelpout type III AFP (PDB ID 1msi) drawn with PyMOL [36]. (a) The key residues selected by the SVMGA are labeled in black words. Residues Q9 and N14, which were identified as key residues in a mutagenesis study but not by the SVMGA, are labeled in blue. (b) A view of the ice-binding interface, wherein all residues that are part of the interface are labeled. The residues identified by SVMGA are in red and yellow. Residues known to be important in ice binding, but not identified by the SVMGA, are in cyan. Residue I13, which was not identified by the SVMGA, is in gray. Its status as a key residue has not been determined by a mutagenesis study.

Fig. 5. The identification accuracy for the 369 AFPs from the second independent set. Each bar correlates the identification accuracy with a range of maximum SI values, which was found using the

(14)

12 Table 1. The eight protein subsets used for cross-validation testing.

Subset Type PDB ID

1 insect AFP 1c3y

2 Type III fish AFP 1c89; 3nla; 1ucs; 1ops; 1kde; 1ame; 1msi; 1b7i; 1b7j; 1b7k; 1ekl; 1gzi; 1hg7; 1jab; 1msj; 2ame; 2jia; 2msi; 2msj; 2spg; 3ame; 3msi; 4ame; 4msi; 5msi; 6ame; 6msi; 7ame; 7msi; 8ame; 8msi; 9ame; 9msi; 3 β-helical insect AFP 1ezg

4 Type I fish AFP 1wfa; 1j5b; 1y03 5 β-helical insect AFP 1eww; 1l0s; 1m8n

6 insect AFP 2pne

7 Type II fish AFP 2py2 8 Type II fish AFP 2afp

Notes: The sequences of the PDB codes given in bold type were used for the single representative AFP mode.

(15)

13 Table 2. The number of antifreeze protein sequences for a given type of organism in the independent dataset that contained 369 AFPs.

Organism Number of sequences

Algae 17 Bacteria 101

Fish 123 Insects 105

(16)

14 Table 3. The performances of SVM and SVMGA for the eight-fold cross-validation tests that used the single representative AFP mode or the multiple representative AFP mode.

SVM SVMGA

Number Subset C+X3+V3X5 §14 Feature Schemes §§Doxey et al.[9]

1 (1) 1 0 (0) 1 (1) - 1 (33) 2 0 (0) 1 (15) (3) 1 (1) 3 1 (1) 1 (1) (1) 1 (3) 4 0 (0) 1 (1) (3) 1 (3) 5 1 (2) 1 (3) (2) 1 (1) 6 1 (1) 1 (1) - 1 (1) 7 1 (1) 1 (1) - 1 (1) 8 1 (1) 1 (1) (0) AFP accuracy 62.5% (13.6%) 100.0% (54.5%) (90.0%) AFP precision 21.7% (25.0%) 10.4% (25.8%) (42.9%) Overall accuracy 99.4% (98.5%) 98.2% (97.7%) (99.6%) MCC 0.367 (0.178) 0.319 (0.365) (0.620) TP 5 (6) 8 (24) (9) TN 3744 (3744) 3693 (3693) (3184) FP 18 (18) 69 (69) (12) FN 3 (38) 0 (20) (1)

Notes: Values given in parentheses are the number of homologous proteins accurately recognized using in the multiple representative AFP mode.

§_{14 feature schemes:}

∑

∈ = = + + + ' 5 3 1 1 S l l S S k g g k A k D X W X where g = {0,1,2,3,5}, S = {H3,V3,P3,S2,}, and S' = {9,15}

§§_{Doxey and colleagues used structure as the property to identify 10 AFPs in their dataset excellently.} Only 2atp, for which its NMR structure was used, was not identified correctly.

(17)

15 Table 4. The feature schemes that enabled the recognition of the AFP in a subset when the single representative mode was used. The filled circles correlate the feature schemes with the AFPs that they identified. The AFPs are denoted according to their subsets.

Feature Scheme Subset C Wl D0 D2 D3 S2X5 H3X5 P3X5 V3X5 Z3X5 1 ● 2 ● ● ● 3 ● ● ● ● ● ● ● ● 4 ● ● ● ● 5 ● ● ● ● ● ● ● ● 6 ● ● ● ● 7 ● ● ● ● ● 8 ● ● ● ● ● ●

(18)

16 Table 5. An example of votes acquired by residues in a sequence from 1msi.

Sequence ….. Q9 L10 I11 P12 I13 N14 T15 A16 L17 T18 ….. Coding C * * * * X2 * * * * X3 * * D0 * * D1 * D2 * * D3 D5 * O3X5 * * * P3X5 * * * V3X5 * * * * * * S2X5 W9 * * * * * * * W15 Votes ….. 1 4 3 4 2 1 3 5 6 6 …..

(19)

(20)

(21)

(22)

(23)

(24)