3. PROTEIN FOLD PREDICTION BY DATA FUSION APPROACH
3.2 C OMPUTATIONAL F RAMEWORK AND A RCHITECTURE
We use the data sets from Ding and Dubchak [8] which were originated from the SCOP database for training and testing. Training data set is selected from the database built for the prediction of 128 folding patterns in the SCOP database [21]. It is ensured that any pair of two proteins in the training set is less than 35% identical in any aligned subsequence longer than 80 residues. The independent testing set is selected from the PDB-40D set [1], [18]-[21]. Moreover, all proteins in the testing set are less than 40% identical to each other. No protein in the testing set is more than 35% identical to any protein in the training set. The total number of proteins is 698 with 313 and 385 for training and testing, respectively. These proteins will be divided into 4 classes and 27 folding patterns all together according to their structures. Table 14 shows the number of proteins in different classes and folding patterns used for training and testing in this
study.
Table 14. The variety in protein structures for training and testing
3.2.2 Features
Features extraction from the data is critical for meaningful results before these features can be subjected to machine learning techniques. Different features may result in different classifications. Two major approaches including direct and indirect coding have been used to
Classes Folding patterns Number of proteins (Training)
Number of proteins (Testing)
1. α1: Globin-like 13 6
2. α2: Cytochrome c 7 9
3. α3: DNA-binding 3-helical bundle 12 20
4. α4: 4-helical up-and-down bundle 7 8
5. α5: 4-helical cytokines 9 9
1. all-α
6. α6: Alpha; EF-hand 7 9
7. β1: Immunoglobulin-like β-sandwich 30 44
8. β2: Cupredoxins 9 12
9. β3: Viral coat and capsid proteins 16 13
10. β4: ConA-like lections/glucanases 7 6
11. β5: SH3-like barrel 8 8
12. β6: OB-fold 13 19
13. β7: Trefoil 8 4
14. β8: Trypsin-like serine proteases 9 4
2. all-β
15. β9: Lipocalins 9 7
16. (α/β)1: (TIM)-barrel 29 48
17. (α/β)2: FAD (also NAD)-binding motif 11 12
18. (α/β)3: Flavodoxin-like 11 13
19. (α/β)4: NAD(P)-binding Rossmann-fold 13 27
20. (α/β)5: P-loop containing nucleotide 10 12
21. (α/β)6: Thioredoxin-like 9 8
22. (α/β)7: Ribonuclease H-like motif 10 14
23. (α/β)8: Hydrolases 11 7
3. α/β
24. (α/β)9: Periplasmic binding protein-like 11 4
25. (α+β)1: β-grasp 7 8
26. (α+β)2: Ferredozin-like 13 27
4. α+β
27. (α+β)3: Small inhibitors, toxins, lectins 12 27
assigned for each sequence which is position and length independent [9]. Ding and Dubchak [8]
proposed six direct coding features for protein structure classification. These single-parameter features are global descriptions of a peptide chain representing the proteins. These features are based on physical, chemical and structural properties of the constituent amino acids.
The six single-parameter features are amino acid composition (C), predicted secondary structure (S), hydrophobicity (H), normalized Van Der Waals volume (V), polarity (P) and polarizability (Z). The five multiple-parameter features, ‘CS’, ‘CSH’, ‘CSHP’, ‘CSHPV’ and
‘CSHPVZ’ were constructed to classify protein folding patterns. Ding and Dubchak [8] finally determined one multiple-parameter feature ‘CSHP’ with the highest overall accuracy rate for protein structure prediction with SVM. The above eleven single and multiple parameter features all emphasize more on the global properties and structures of amino acid sequences than on the local interactions among neighboring peptides.
In Huang et al. [9], they used the N-gram concept while extracting features from the amino acid sequence of proteins. Two other indirect coding features, generated from the bi-gram (B) and the spaced-bi-gram coding (SB) scheme, respectively, were proposed. These features reflect the local interactions among neighboring peptides within the 3-D structure of a protein.
We combined the six single-parameter features proposed by Ding and Dubchak [8] and the outcomes of the two indirect coding features to form two new multiple-parameter features
‘CSHPVZ+B’ and ‘CSHPVZ+B+SB’. We showed that using the feature ‘CSHPVZ+B+SB’
together with NN outperformed all single- or multiple-parameter features used by Ding and Dubchak [8] in terms of prediction accuracy rate for protein structure classification.
In this study, we start with eight features, ‘C’, ‘CS’, ‘CSH’, ‘CSHP’, ‘CSHPV’,
‘CSHPVZ’, ‘CSHPVZ+B’ and ‘CSHPVZ+B+SB’ to assign protein classes or folding patterns.
Then, we use the method of data fusion for feature selection and combination in order to
improve classification accuracy.
3.2.3 The HLA Computational Architecture
The NNs have been commonly used in many machine learning and data mining applications, such as input-output mapping and bioinformatics [29], [30]. We use NN as a multi-class classifier to build hierarchical learning architecture (HLA) for the purpose of protein structure prediction. The Multilayer Perceptron (MLP) and the Radial Basis Function Network (RBFN) are two popular NN models. The RBFN is a three-layer network with Gaussian function that is suitable to be a classifier [31] since the weights of RBFN are measured and adjusted according to the distance of data. It was shown [9] that the overall prediction accuracy rate for protein structure classification using RBFN is better than that using MLP. Therefore, we adopted the RBFN model in this study where one hidden layer and nodes will be generated automatically.
The hidden layer nodes show the coordinate of training sample clusters.
The HLA framework, proposed in Huang et al. [9], consists of a two-level procedure. In the first level, a protein is classified into one of four classes by a multi-class classifier (classifier 1 in Figure 2). Then, in the second level, it is further classified into one of fi folding patterns by the corresponding multi-class classifier (f1
, f
2, f
3and f
4 is equal to 6, 9, 9 and 3 in classifier 1, 2, 3, and 4 respectively in Figure 2).In Huang et al. [9], it has been shown that the HLA framework is an effective learning structure which reduces the number of classifiers, avoids the voting scheme, and directly indicates the reliability or confidence of the result predicted. Our current study incorporates data fusion in HLA for the testing data set, as shown in Figure 2. For the training data set, HLA is used without data fusion. To predict which of four classes a protein belongs to with HLA, we use eight individual features to assign class to each protein in the testing data set at first. Then,
protein class discrimination. Finally, the protein class is predicted with the combined feature.
For protein folding patterns associated with each protein class, the eight individual features are used once more to assign protein folding patterns to each protein in the class. Similarly, data fusion is applied again for feature selection and combination in order to improve the prediction of protein folding patterns
Figure 2. The architecture of HLA using data fusion