2. Material and Methods
2.1 Dataset of cellulase proteins
The enzyme of cellulase was taken from the National Center for Biotechnology Information (NCBI), we search all endo-glucanases (EC 3.2.1.4) and exo-glucanases (EC 3.2.1.91) by their EC number. There are 154 endo-glucanase proteins, 65 exo-glucanases, and the grand total number of enzyme we obtained is 219, the detail was shown in Table 1, 2. However, we only selected 4 proteins that have their own active site residues literature information from the Catalytic Site Atlas (CSA)-2.2.10 and binding site residues for each glucanases, the list of our dataset informatin was shown in Table 3. 2,9-14
2.2 Comprehend characteristics of cellulase sturcture
Cellulose is a nature polymer composed of repeating glucose units, and each glucose unit is rotated 180° relative to its neighbors along the main axis.15 Cellulose exits in a highly crystalline form, therefore, hydrolysis of cellulose requires co-operative activities of three classes of enzymes:
i. Endo-glucanase or 1,4-β-D-glucanhydrolase (EC 3.2.1.4)
ii. Exo-glucanase or 1,4-β-D-glucan cellobiohydrolases (EC 3.2.1.91)
iii. β-glucosidases or β-glucoside glucohydrolases (EC 3.2.1.21)
The structures of endo-glucanases are commonly characterized by a groove or a cleft to bind a linear cellulose chain in order to fit in a random manner at amorphous sites.
Generally, exo-glucanases or cellobiohydrolases (CBH) possess tunnels-like active sites, which can only accept a substrate chain via its terminal regions.16 These tunnels proved to be essential to the cellobiohydrolases for the processive cleavage of cellulose chains from the reducing or nonreducing ends. The Cellulose degradation flow is shown in Figure 1.17
2.3 Analysis and Classification of binding sites
2.3.1 Amino acid type
Different amino acids apparently have various propensities to be binding site residues. Binding site residues are classified according to the 20 standard amino acid one letter abbreviation from hydrophobic group, hydrophilic one to Charged type as follows, i.e., G, A, V, L, I, M, P, F, W, Y, C, S, T, N, Q, D, E, H, R and K.
2.3.2 Weighted contact number model (WCN)
It has recently been shown that in proteins the atomic mean-square displacement (or B-factor) can be related to the number of the neighboring atoms (or protein contact number) and the square distance from the center of mass of a protein.18 Here, we will refer this method as the contact number (CN). This method can be further improved if the protein CN is scaled down by the square of the distance between the contacting pair. To consider the distance factor, a distance-dependent contact number will defined by weighting the integral contact number with the factor which is the distance between Cα atoms of i and j residues.
v
i = 1where N is total residue numbers of the protein, and we refer is as the weighted CN model (WCN). The CN (or WCN) profile of protein of N residues is defined as
(2)
where is defined as the reciprocal contact number, i. e., .
For the purpose of easy comparison, we will normalize to its Z-scores:
(3)
where and are the mean and the standard deviation of . Here designates .
2.3.3 Relative Solvent Accessibility (RSA)
The surface area is an important structure characteristic in binding a non-protein molecule (such as the substrate or cofactor) and in non-protein-non-protein complexes interaction.19 Thus, the binding site residues are generally more exposed to solvent than others. Amino acid relative accessibility is the degree to which a residue in a protein is accessible to a solvent module. The relative solvent accessibility is computed by
(4)
where is the solvent accessibility of a residue was assigned by using the program DSSP, given in Å2 units. is the maximal accessibility for the amino acids given by B. Rost et al.20 A residue is considered as accessible if its relative accessible
surface area (RSA) ≥ 5%, a cut-off devised and optimized by Miller et al.21 If a residue is accessible in the protomer it is in the protein surface, otherwise it is core.
i < 5% means Buried, ≥ 5% means Exposed. Therefore, the thresholds that we selected are the same as those in Rost and Miller.20,21
2.3.4 Performance measures
The performance measurements of sensitivity and specificity are measured by true positive rate (TPR) and false positive rate (FPR). The TPR is given by
(5)
And the FPR is given by
(6)
Where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The sensitivity value is equal to TPR, and the specificity value is given by
(6)
In order to reconfirm the WCN (z-score) threshold to predict binding site, our calculation is in the process of drawing a diagram of TPR and FPR. For testing case, the WCN (z-score) outputs the probabilities of endo-glucanase and exo-glucanase.
Consequently, the decision threshold we selected for endo-glucanase is less than -0.5,
!
Specificity = 1 " FPR
for exo-glucanase is smaller then -0.8, thus the class with higher specificity is the predicted of the WCN.
3. Results and Discussion
3.1 The dataset
We analyzed the WCN (z-score) distribution of cellulase binding site residues that we can find from literature. Figure 2 shows the frequency of endo-glucanase binding residues (black) compared exo-glucanase binding residues (white). From the distribution, we can see WCN (z-score) of cellulase that most of the binding residues are between -1.6 ~ 0.9. After that, making statistics to measure sensitivity and specificity in order to decide the proper WCN threshold to predict binding site as follow.
3.2 The prediction performance
In this statistic, we calculate various WCN z- score threshold values to verify the sensitivity and specificity with respect to each threshold value. Besides, the threshold ranges from -1.6 to 0.9, increasing by 0.1 each time. If the residues we selected under threshold and also match the literature binding site residues, it is considered as “positive”; otherwise, it is considered as “negative”. Therefore each threshold value will produce a group of TPR and FPR, which decides a point on the diagram in Figure 3 and the list in Table 5. Figure 3(A)(B)(C) from top to the bottom shows all dataset cellulase diagram, endo-glucanase group diagram and exo-glucanase group diagram. Based on the measurement of sensitivity, specificity and the diagram
of a relationship TPR and FPR, we decide the suitable WCN z- score binding site threshold for endo-glucanases is < – 0.5 and for exo-glucanase is < -0.8.
3.3 Comprehend analysis of endo- and exo-glucanases
Despite the good prediction performance of WCN model, the sensitivity and specificity for predicting cellulase binding sites are need to improve, so we add a filter RSA to raise the values of specificity. The RSA threshold we selected (≥ 5%) based on Rost and Miller.20,21
3.3.1 Endo-glucanases
In this study, the endo-glucaase dataset we selected is as follows, PDB id 1TML, 2ENG, 1JS4 and 2NLR. The Figure 4(A) shows the WCN model of enzyme 1TML structure, Figure 4(B) shows the WCN z- score distribution of 1TML, Figure 4(C) compares Figure (D)(E) shows the experimental binding site residues colored in red, the residues under the WCN threshold (< -0.5) colored in orange and then selected residues that are exposed colored in orange, all of them are surface form and Figure (F) compares Figure (G)(H) shows the cartoon protein structure form, Figure 4(E)(H) means that we pick the residues that conform to WCN and RSA at the same time. And the comparison method WCN with WCN included RSA of sensitivity and specificity is shown in Table 6. Figure 5(A) to (H) shows the information of enzyme 2ENG structure like as Figure 4, the comparison method WCN with WCN included RSA of sensitivity and specificity is shown in Table 7. Figure 6(A) to (H) shows the information of enzyme 1JS4 structure like as Figure 4 enzyme 1TML, the comparison
method WCN with WCN included RSA of sensitivity and specificity is shown in Table 8. Figure 7 (A) to (H) shows the information of enzyme 2NLR structure like as Figure 4 enzyme 1TML, the comparison profile of endo-glucanase 2NLR method WCN with WCN included RSA of sensitivity and specificity is shown in Table 9.
Above-mentioned the relationship of performance, we combine two methods WCN and WCN include RSA, we can figure out the residues under WCN z-score threshold of enzymes we selected are much more than the method include RSA, although the sensitivity value will decrease, however we can lower the false positive value and enhance the true negative value then our specificity value will increase much more. It is clear that the binding site residues tend to have lower WCN z-score value and exposed according to our comparison with performance profile results.
3.3.2 Exo-glucanases
In this study next to the endo-glucanases, the exo-glucaase dataset we selected is as follows, PDB id 1CEL, 1QK2, 2HIS and 1EXP. Because of the enzyme 2HIS and 1EXP are in the same family, we select 1EXP for discussing only. The Figure 8(A) shows the WCN model of enzyme 1CEL structure, Figure 8(B) shows the WCN z- score distribution of 1CEL, Figure 8(C) compares Figure (D)(E) shows the experimental binding site residues colored in red, the residues under the WCN threshold (< -0.8) colored in orange and then selected residues that are exposed colored in orange, all of them are cartoon form, Figure 8(D)(E) means that we pick the residues that conform to WCN and RSA at the same time. And the comparison method WCN with WCN included RSA of sensitivity and specificity is shown in
Figure 8, the comparison method WCN with WCN included RSA of sensitivity and specificity is shown in Table 11. Figure 10(A) to (H) shows the information of enzyme 1EXP structure like as Figure 8 enzyme 1CEL, the comparison method WCN with WCN included RSA of sensitivity and specificity is shown in Table 12. Above-mentioned the relationship of performance, we also combine two methods WCN and WCN include RSA, we can figure out the residues under WCN zscore threshold (≤ -0.8) of enzymes we selected are much more than the method include RSA, although the sensitivity value will decrease, however we can lower the false positive value and enhance the true negative value then our specificity value will increase much more. It is clear that the binding site residues tend to have lower WCN z-score value and exposed according to our comparison with performance profile results.
Figure 11 shows the frequency of amino acid type in cellulase experimental binding site compared with our method of binding site prediction base on WCN including RSA, we expect that experimental and our work would be have similar amino acid type on binding substrates. However, the experimental frequency of hydrophobic, hydrophilic and charged amino acid type in endo-glucanases binding site are 35%, 35% and 31%; in exo-glucanases are 23%, 38% and 38%. And in our work, the frequency of hydrophobic, hydrophilic and charged amino acid type in endo-glucanases binding site are 41%, 34% and 25%; in exo-endo-glucanases are 45%, 24% and 31%. There is no significant correlation with the frequency of amino acid type in binding substrate between experimental and our work. However, we still can figure out the performance values of endo-glucanases and exo-glucanases show in Table 12, that values are increasing in specificity, through different WCN z- score threshold and RSA included. It means that enzymes are very specific, and the binding sites of enzyme are especially less flexible than other residues. The WCN is used for indicate
structure rigidity. However, there are complementary relationships between structural characteristics of binding sites and based on the method WCN. The WCN z- score threshold for endo-glucanases is larger than exo-glucanases, it means that most of endo-glucanases binding substrate structure are flexible, and exo-glucanases are more rigid for cellulose hydrolyze. It is reasonable that endo-glucanses need more space for hydrolyzing cellulose. Thus, using WCN and RSA may help understanding and finding binding site residues of cellulase as more as possible.
4. Conclusion
In this work, we present a structural analysis of cellulase binding sites using a dataset of 219 enzymes which was chosen from NCBI and CSA. This dataset is nonredundant, and we selected 4 enzymes for each endo-glucanase and exo-glucanase, total 8 enzymes that have their own experimental catalytic residues data form literature. The conclusion that we analysis is that through our methods, the WCN (z- score) threshold for endo-glucanases is larger than exo-glucanases, it is reasonable that endo-glucanses need more space for hydrolyzing cellulose. Besides, the performance value of predicting binding sites let us know that we can increase the specificity values based on WCN and RSA. It means that most of binding sites are rigid than other residues and exposed although they are hydrophobic.
Based on all these characteristics with binding sites may enable people to understand more information for structure- function relationships; furthermore, it will be helpful for predicting binding sites in cellulase of unknown function from protein structures and maybe we could tell endo-glucanase and exo-glucanase by their binding sites in the further work.
REFERENCES
1. Richmond T. Higher plant cellulose synthases. Genome Biol 2000;1(4):REVIEWS3001.
2. Sulzenbacher G, Mackenzie LF, Wilson KS, Withers SG, Dupont C, Davies GJ. The crystal structure of a 2-fluorocellotriosyl complex of the Streptomyces lividans endoglucanase CelB2 at 1.2 A resolution. Biochemistry
1999;38(15):4826-4833.
3. Zhang YH, Lynd LR. Toward an aggregated understanding of enzymatic hydrolysis of cellulose: noncomplexed cellulase systems. Biotechnol Bioeng 2004;88(7):797-824.
4. Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B.
The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res 2009;37(Database issue):D233-238.
5. Lynd LR, Weimer PJ, van Zyl WH, Pretorius IS. Microbial cellulose utilization: fundamentals and biotechnology. Microbiol Mol Biol Rev 2002;66(3):506-577, table of contents.
6. Lu CH, Huang SW, Lai YL, Lin CP, Shih CH, Huang CC, Hsu WL, Hwang JK. On the relationship between the protein structure and protein dynamics.
Proteins-Structure Function and Bioinformatics 2008;72(2):625-634.
7. Shih CH, Huang SW, Yen SC, Lai YL, Yu SH, Hwang JK. A simple way to compute protein dynamics without a mechanical model. Proteins-Structure Function and Bioinformatics 2007;68(1):34-38.
8. Lin CP, Huang SW, Lai YL, Yen SC, Shih CH, Lu CH, Huang CC, Hwang JK.
Deriving protein dynamical properties from weighted protein contact number.
Proteins-Structure Function and Bioinformatics 2008;72(3):929-935.
9. Zou JY, Kleywegt GJ, Stahlberg J, Driguez H, Nerinckx W, Claeyssens M, Koivula A, Teerii TT, Jones TA. Crystallographic evidence for substrate ring distortion and protein conformational changes during catalysis in
cellobiohydrolase Cel6A from Trichoderma reesei. Structure with Folding &
Design 1999;7(9):1035-1045.
10. Russell RB. Detection of protein three-dimensional side-chain patterns: New examples of convergent evolution. Journal of Molecular Biology
1998;279(5):1211-1227.
11. Notenboom V, Birsan C, Nitz M, Rose DR, Warren RAJ, Withers SG.
Insights into transition state stabilization of the beta-1,4-glycosidase Cex by covalent intermediate accumulation in active site mutants. Nature Structural Biology 1998;5(9):812-818.
12. Sakon J, Irwin D, Wilson DB, Karplus PA. Structure and mechanism of endo/exocellulase E4 from Thermomonospora fusca. Nature Structural Biology 1997;4(10):810-818.
13. Davies GJ, Tolley SP, Henrissat B, Hjort C, Schulein M. Structures of
oligosaccharide-bound forms of the endoglucanase V from Humicola insolens at 1.9 angstrom resolution. Biochemistry 1995;34(49):16210-16220.
14. Spezio M, Wilson DB, Karplus PA. Crystal-Structure of the Catalytic Domain of a Thermophilic Endocellulase. Biochemistry 1993;32(38):9906-9916.
15. Grassick A, Murray PG, Thompson R, Collins CM, Byrnes L, Birrane G, Higgins TM, Tuohy MG. Three-dimensional structure of a thermostable native cellobiohydrolase, CBHIB, and molecular characterization of the cel7 gene from the filamentous fungus, Talaromyces emersonii. European Journal of Biochemistry 2004;271(22):4495-4506.
16. Divne C, Stahlberg J, Teeri TT, Jones TA. High-resolution crystal structures reveal how a cellulose chain is bound in the 50 angstrom long tunnel of cellobiohydrolase I from Trichoderma reesei. Journal of Molecular Biology 1998;275(2):309-325.
17. Beguin P, Gilkes NR, Kilburn DG, Miller RC, Oneill GP, Warren RAJ.
Cloning of Cellulase Genes. Crc Critical Reviews in Biotechnology 1987;6(2):129-162.
18. Halle B. Flexibility and packing in proteins. Proceedings of the National Academy of Sciences of the United States of America 2002;99(3):1274-1279.
19. Samanta U, Bahadur RP, Chakrabarti P. Quantifying the accessible surface area of protein residues in their local environment. Protein Engineering 2002;15(8):659-667.
20. Rost B, Sander C. Conservation and prediction of solvent accessibility in protein families. Proteins 1994;20(3):216-226.
21. Miller S, Lesk AM, Janin J, Chothia C. The Accessible Surface-Area and Stability of Oligomeric Proteins. Nature 1987;328(6133):834-836.
TABLES
Table1. The dataset of Exo-glucanases from NCBI, CSA
PDB GH Catalytic Residues
1BVW 6 Y174 R179 D180 D226 align 1qk2
Table2. The dataset of Endo-glucanases from NCBI, CSA
PDB GH Catalytic Residues
1A3H 5 N138 E139 H200 Y202 E228 align 1bqc
1IA6 9 D56 D59 E410 align 1js4
Table3. Proteins have own catalytic residues data from literature
PDB GH Catalytic Residues
1JS4 9 D55 D58
Table4. Bindings site residues of each protein from literatures
PDB Binding Residues
1JS49 H125 W128 F205 W209 W256 D261 W313 R317 R378 Y429
1TML10 W41 Y73 D117 D155 H159 W162 S189 W231 E263 D265 A271
2ENG11 T6 R7 Y8 D10 K13 W18 A19 K21 S45 E82 S110 H119 D121 G127 G128 V129 Y147 G148 D178 N179
Endo- glucanase
2NLR12 F8 N22 W24 H65 Y66 N100 D104 W106 E120 M122 N155 S157 Q199 E203
1CEL15 N141 Y145 E217 H228 W367
Table 5.1. The measurement of all dataset
Threshold TPR FPR TP: true positive, TN: true negative, FN:
false negative, FP: false positive.
Table 5.2. The measurement of endo-glucanase
Threshold TPR FPR
Table 5.3. The measurement of exo-glucanases
Threshold TPR FPR
Table 6. Endo-glucanase 1TML. The comparison with WCN and WCN & RSA included.
1TML TP FP FN TN Sensitivity (%) Specificity (%)
WCN (< -0.5) 7 94 4 181 64 66
WCN (< -0.5) & RSA (≥ 0.05) 5 24 6 251 45 91
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 7. Endo-glucanase 2ENG. The comparison with WCN and WCN & RSA included.
2ENG TP FP FN TN Sensitivity (%) Specificity (%)
WCN (< -0.5) 10 63 10 122 50 66
WCN (< -0.5) & RSA (≥ 0.05) 9 19 11 167 47 90
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 8. Endo-glucanase 1JS4. The comparison with WCN and WCN & RSA included.
1JS4 TP FP FN TN Sensitivity (%) Specificity (%)
WCN (< -0.5) 4 220 6 375 40 63
WCN (< -0.5) & RSA (≥ 0.05) 3 52 7 543 30 91
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 9. Endo-glucanase 2NLR. The comparison with WCN and WCN & RSA included.
2NLR TP FP FN TN Sensitivity (%) Specificity (%)
WCN (< -0.5) 9 71 5 137 64 66
WCN (< -0.5) & RSA (≥ 0.05) 5 12 9 196 36 94
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 10. Exo-glucanase 1CEL. The comparison with WCN and WCN & RSA included.
1CEL TP FP FN TN Sensitivity (%) Specificity (%)
WCN (< -0.8) 5 96 0 333 100 78
WCN (< -0.8) & RSA (≥ 0.08) 3 19 2 410 60 96
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 11. Exo-glucanase 1QK2. The comparison with WCN and WCN & RSA included.
1QK2 TP FP FN TN Sensitivity (%) Specificity (%)
WCN (< -0.8) 9 74 5 275 64 79
WCN (< -0.8) & RSA (≥ 0.08) 8 8 6 314 57 97
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 12. Exo-glucanase 1EXP. The comparison with WCN and WCN & RSA included.
1EXP TP FP FN TN Sensitivity (%) Specificity (%)
WCN <-0.8 5 67 0 240 100 78
WCN (< -0.8) & RSA (≥ 0.08) 4 6 1 301 80 98
TP: true positive, FP: false positive, FN: false negative, TN: true negative, Sensitivity: TP/(TP+FN), Specificity: 1-(FP/(FP+TN)), All statistical measures are percentage value (%).
Table 13. The comparison of WCN with WCN include RSA
Sensitivity (%) Specificity (%)
Endo-glucanase (< -0.5) 79 78
FIGURES
Figure 1.The processive synergy mechnism of cellulose hydrolysis. (A) Cellulose
consist crystalline region and amorphous region. (B) Endo-glucanase cut at theinternal amorphous sites. (C) Exo-glucanase acts on the reducing or nonreducing ends of chains. β-glucosidases hydrolyze soluble cellodextrins and cellobiose to glucose.
(A)
(B)
Figure 2.WCN z-score distribution of literature binding site residues. The frequency
of endo-glucanase (A) binding residues colored in black compared withexo-glucanase (B) binding residues colored in white.
(A)
Figure3. The diagram of relationship TPR and FPR from top to the bottom are (A) all selected cellulase dataset, (B) endo-glucanase group, (C) exo-glucanase group.
(A)
(B)
Figure 4. (A) 1TML protein WCN model in putty form. (B) The WCN z- score distribution of protein 1TML.
(C)
(D) (E)
Figure 4. Proteins are surface form. (C) 1TML experimental binding site residues colored in red. (D) The residues under WCN threshold (< -0.5) are colored in orange.
(E) The residues selected include WCN and RSA threshold are also colored in orange.
(F)
(G) (H)
Figure 4. Proteins are cartoon form. (F) 1TML experimental binding site residues colored in red. (G) The residues under WCN threshold (< -0.5) are colored in orange.
(H) The residues selected include WCN and RSA threshold are also colored in orange.
(A)
(B)
Figure 5. (A) 2ENG protein WCN model in putty form. (B) The WCN z- score distribution of protein 2ENG.
(C)
(D) (E)
Figure 5. Proteins are surface form. (C) 2ENG experimental binding site residues colored in red. (D) The residues under WCN threshold (< -0.5) are colored in orange.
(E) The residues selected include WCN and RSA threshold are also colored in orange.
(F)
(G) (H)
Figure 5. Proteins are cartoon form. (F) 2ENG experimental binding site residues colored in red. (G) The residues under WCN threshold (< -0.5) are colored in orange.
(H) The residues selected include WCN and RSA threshold are also colored in orange.
(A)
(B)
Figure 6. (A) 1JS4 protein WCN model in putty form. (B) The WCN z- score
Figure 6. (A) 1JS4 protein WCN model in putty form. (B) The WCN z- score