Multi-class Methodologies for Protein Fold Classification

IV. Protein Fold Recognition

4.3 Multi-class Methodologies for Protein Fold Classification

Remember that we have 27 folds of data so we have to solve multi-class classi-fication problems. Currently two approaches are commonly used for combining the binary SVM classifiers to perform a multi-class prediction. One is the “one-against-one” method (See Chapter 2.2.2) where k(k−1)/2 classifiers are constructed and each

Table 4.1: Non-redundant subset of 27 SCOP folds using in training and testing

Fold Index # Training data # Test data

Globin-like 1 13 6

Cytochrome c 3 7 9

DNA-binding 3-helical bundle 4 12 20

4-helical up-and-down bundle 7 7 8

4-helical cytokines 9 9 9

Alpha;EF-hand 11 7 9

Immunoglobulin-like β-sandwich 20 30 44

Cupredoxins 23 9 12

Viral coat and capsid proteins 26 16 13

ConA-like lectins/glucanases 30 7 6

SH3-like barrel 31 8 8

OB-fold 32 13 19

Trefoil 33 8 4

Trypsin-like serine proteases 35 9 4

Lipocalins 39 9 7

α/β

(TIM)-barrel 46 29 48

FAD(also NAD)-binding motif 47 11 12

Flavodoxin-like 48 11 13

NAD(P)-binding Rossmann-fold 51 13 27

P-loop containing nucleotide 54 10 12

Thioredoxin-like 57 9 8

Ribonuclease H-like motif 59 10 14

Hydrolases 62 11 7

Periplasmic binding protein-like 69 11 4

α+β

β-grasp 72 7 8

Ferredoxin-like 87 13 27

Small inhibitors,toxins,lectins 110 14 27

Table 4.2: Six parameter sets extracted from protein sequence

Symbol parameter set Dimension

C Amino acids composition 20

S Predicted secondary structure 21

H Hydrophobicity 21

V Normalized van der Waals volume 21

P Polarity 21

Z Polarizability 21

one trains data from two different classes. Another approach for multi-class classifi-cation is the “one-against-all” method (See Chapter 2.2.1) where k SVM models are constructed and the ith SVM is trained with data in the ith class as positive, and all other data as negative. A comparison on both methods for multi-class SVM is in [31].

After analyzing our data, we find out that the number of proteins in each fold is quit small (7∼30 for the training set). If using the “one-against-one” method, some binary classifiers may work on only 14 data points. It may emerge larger noise due to the involvement of all possible binary classifier pairs. On the contrary, if the

“one-against-all” method is used, we will have more examples (same as the training data) to learn.

Meanwhile we observed the interesting results from [80] where they do the molec-ular classification of multiple tumor types. Their data set contains only 190 samples grouped into 14 classes. They found that for using both cross validation and inde-pendent test set, the “one-against-all” achieves the better performance. The authors conclude that the reason is because the binary classifier in the “one-against-all”

method has more examples than the “one-against-one” method. In our multi-class fold prediction problem we have the same situation, lots of classes but only few data. Therefore, in our implementation, we will mainly consider the “one-against-all” method to generate binary classifiers for multi-class prediction.

Note that according to Ding and Dubchak [18], using multiple parameter sets and applying a majority vote on the results lead to much better prediction accuracy.

Thus, in our study we will base on the six parameter sets to construct 15 encoding schemes. For the first six coding schemes, each of the six parameter sets (C, S, H, V, P, Z) is used.

After doing some experiments the following combinations CS, HZ, SV, CSH, VPZ, HVP, CSHV, and CSHVPZ are chosen as another eight coding schemes. Note that they have different dimensionalities. For the combination CS, there are 41 (20+21) dimensions. Similarly, for HZ and SV, both have 42 (21+21) dimensions. Therefore, CSH, VPZ, HVP, and CSHVPZ have 62, 63, 63, and 125 dimensions respectively.

As we have 27 protein folds, for each encoding scheme if the “one-against-all”

is used, there are 27 binary classifiers. Since we have 14 coding schemes, using the

“one-against-all” strategy, totally we will train 14 × 27 binary classifiers. Following [22], if a protein is classified as “positive” then we will assign a vote to that class. If a protein is classified as “negative” the probability that it belongs to anyone of the other 26 classes is only 1/26. If we still assign it to one of the other 26 classes, the misclassification rate may be very high. Thus, these proteins are not assigned to any class.

In our coding schemes if any of the 14 × 27 “one-against-all” binary classifiers assigns a protein sequence to a folding class, then that class gets a vote. Therefore, for the 14 coding schemes base on above “one-against-all” strategy, each fold (class) will have zero to 14 votes. However, we found that after the above procedure some proteins may not have any vote on any fold. For example, among 385 data of the independent test set, using the parameter set “composition” only, 142 are classified as positive by some binary classifiers. If they are assigned to the corresponding folds, 126 are correctly predicted with the accuracy rate 88.73%. The remaining 243 data are not assigned to any fold, so their status is still unknown. Results of using the 14 coding schemes are shown in Table 4.3. Although for the worst case a protein may be assigned to 27 folds, practically most input proteins obtain no more than one vote.

After using the above 14 coding schemes there are still some proteins whose corresponding folds are not assigned. Since in the “one-against-one” SVM classifier we use the so-called “Max Wins” strategy (See Chapter 2.2.2), after the testing procedure each protein must be assigned to a fold (class). Therefore, we will use the best “one-against-one” method as the 15th coding scheme and combine it with the above 14 “one-against-all” results using a voting scheme to get the final prediction.

Here we used the same “one-against-one” method in Ding and Dubchak [18]. For example, a combination C+H means we separately perform the “one-against-one”

method on two parameter sets C and H. Then we combine the votes obtained from using the two parameter sets to decide the winner.

The best result we find is the combined C+S+H+V parameter sets where the average accuracy achieves 58.2%. It is slightly above 55.5% accuracy by Ding and Dubchak and their best result 56.5% using C+S+H+P. Figure 4.1 shows the overall structure of our method.

Before constructing each SVM classifier, we first conduct some cross validation with different parameters on the training data. The best parameters C and γ selected are shown in Tables A.1 and A.2.

在文檔中 Application of Support Vector Machines in Bioinformatics (頁 30-34)