Analysis - Combining SVMs with Various Feature Selection Strategies

VI. Experiments

6.4 Analysis

The first observation from Table 6.2 is that for the RM-bound SVM, aggressively removing features usually performs no better than applying the RM-bound SVM once. This may suggest that the aggressive RM-bound SVM sometimes removes informative features. Furthermore, if we compare the two feature scaling methods with the other five feature selection methods, we can see that in most cases the feature scaling methods provide only the similar accuracy, although they indeed have good improvement on madelon, splice, and dna.

The second observation is that, among the five feature selection methods, in general there is no method that is the best for all data sets. For some problems they all select the whole set of features, so it is hard to judge them only from the testing accuracy of the selected feature subsets. If we consider the data set with a large number of features, such as arcene, arrhythmia, dexter, dorothea and madelon, we may conclude that the method using F-score criterion as the feature ranking is slightly better than the other four methods.

From Figure 6.1 to Figure 6.5, we can see that in general the trend of the CV accuracy against the number of features is roughly the same as the trend of the testing

Table 6.2: The testing accuracy obtained by each feature selection/scaling method.

The numbers of features selected are also reported in parentheses for those five feature selection methods. Some results are not available due to the limitation of the programs.

Name F-score Change Change w Shuffle RM-bound RM-bound

dist. SV Heiler prob. agg.

arcene 84.00% 81.00% 83.00% 80.00% 83.00% – –

(2480) (1237) (4960) (1235) (4610)

arrhythmia 83.57% 77.14% 76.43% 77.14% 77.86% 79.29% 79.29%

(32) (56) (31) (56) (69)

covtype2 80.09% 80.09% 80.09% 80.09% 80.09% 78.47% 77.90%

(54) (54) (54) (53) (52)

dexter 91.67% 89.33% 88.33% 88.67% 85.33% – –

(242) (3829) (60) (3828) (62)

diabetes 76.67% 76.67% 76.67% 76.67% 76.00% 76.67% 76.67%

(8) (8) (8) (8) (6)

digits 100.00% 100.00% 99.80% 100.00% 100.00% 100.00% 100.00%

(81) (40) (5) (40) (6)

dorothea 93.14% 92.57% 92.86% 92.57% – – –

(1376) (669) (688) (669) –

ijcnn 97.78% 97.78% 97.78% 97.78% 97.78% 95.64% 90.01%

(22) (22) (22) (22) (22)

image 97.72% 97.72% 97.72% 97.23% 97.72% 97.43% 82.28%

(18) (18) (18) (9) (18)

madelon 87.50% 87.83% 87.83% 87.83% 51.83% 92.00% 92.33%

(15) (15) (15) (15) (9)

ringnorm 98.51% 98.51% 98.51% 98.51% 98.51% 97.51% 97.51%

(20) (20) (20) (20) (20)

splice 93.10% 93.93% 93.93% 93.93% 93.06% 94.94% 92.41%

(7) (7) (7) (7) (6)

tree 88.49% 88.09% 88.49% 88.09% 88.45% 87.40% 87.40%

(18) (9) (18) (9) (17)

twonorm 97.11% 97.11% 97.11% 97.11% 97.23% 96.43% 96.43%

(20) (20) (20) (20) (19)

waveform 89.28% 89.28% 89.28% 89.28% 88.30% 89.52% 89.52%

(21) (21) (21) (21) (8)

dna 94.69% 95.19% 94.27% 94.27% 94.27% 95.28% 58.52%

(22) (45) (180) (180) (130)

protein 65.68% 65.46% 65.46% 65.46% 65.70% 60.90% 59.89%

(178) (356) (356) (356) (136)

satimage 90.30% 90.30% 90.30% 90.30% 90.30% 90.50% 90.50%

(36) (36) (36) (36) (36)

shuttle 99.93% 99.93% 99.93% 99.93% 99.92% 99.86% 99.86%

(4) (4) (4) (4) (9)

usps 95.27% 95.27% 95.27% 95.27% 95.27% 94.82% 94.82%

(256) (256) (256) (256) (256)

accuracy. This observation tells us that Algorithm 9 gives reasonable estimates of the selected feature size: if the CV accuracy of some feature size is higher, then so is the testing accuracy.

As mentioned earlier, for many data all methods selected the whole set of features, and we cannot tell whether the ranking is good or bad. Therefore, we try to observe the curve from those figures to see which feature ranking is better. However, the results are still dependent on data sets. For example, for usps and covtype2 the random shuffling provides good ranking while F-score does not. However, F-score performs well on arcene, dorothea, and protein. For data with a small number of features, we may conclude that F-score sometimes provides a slightly worse ranking compared to the other four strategies.

Since the method of [16] and the methods that observe the change of the decision values are quite similar, we also compare the performance on these three strategy.

We found that for most data such as dna, they have almost the same performace and there is no significant difference.

There is also an interesting observation for data madelon when using the random shuffling on features. Its result is merely a random guess. We check the obtained values of the feature importance and find that they are extremely small. Thus, there is almost no performance difference after shuffling. We conjecture that because the number of features is large (500 features), and the testing accuracy using all features is small (60%), there are too many outliers.

6 8 10 12

0.550.650.750.85

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

arcene estimates

log(#features)

accuracy

6 8 10 12

0.550.650.750.85

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

arcene testing

log(#features)

accuracy

1 2 3 4 5 6 7 8

0.600.650.700.750.80

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

arrhythmia estimates

log(#features)

accuracy

1 2 3 4 5 6 7 8

0.600.650.700.750.80

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

arrhythmia testing

log(#features)

accuracy

2 3 4 5

0.600.650.700.750.80

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

covtype estimates

log(#features)

accuracy

2 3 4 5

0.600.650.700.750.80

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

covtype testing

log(#features)

accuracy

6 8 10 12

0.650.700.750.800.850.90

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

dexter estimates

log(#features)

accuracy

6 8 10 12

0.650.700.750.800.850.90

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

dexter testing

log(#features)

accuracy

Figure 6.1: Comparisons of CV accuracy and testing accuracy against log number of features between feature ranking methods.

1.0 1.5 2.0 2.5 3.0

0.730.740.750.760.770.78

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

diabetes estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0

0.730.740.750.760.770.78

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

diabetes testing

log(#features)

accuracy

2 4 6 8

0.9900.9940.998

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

digits estimates

log(#features)

accuracy

2 4 6 8

0.9900.9940.998

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

digits testing

log(#features)

accuracy

10 11 12 13 14 15 16

0.9150.9250.935 F−score

Chg.(SV) Chg.(dist.) w(Heiler) shuffle

dorothea estimates

log(#features)

accuracy

10 11 12 13 14 15 16

0.9150.9250.935

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

dorothea testing

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.910.930.950.97

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

ijcnn estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.910.930.950.97

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

ijcnn testing

log(#features)

accuracy

Figure 6.2: Comparisons of CV accuracy and testing accuracy against the log number of features between feature ranking methods.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.800.850.900.95

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

image estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.800.850.900.95

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

image testing

log(#features)

accuracy

2 4 6 8

0.50.60.70.8

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

madelon estimates

log(#features)

accuracy

2 4 6 8

0.50.60.70.8

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

madelon testing

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.700.800.90

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

ringnorm estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.700.800.90

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

ringnorm testing

log(#features)

accuracy

2 3 4 5 6

0.800.840.880.92

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

splice estimates

log(#features)

accuracy

2 3 4 5 6

0.800.840.880.92

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

splice testing

log(#features)

accuracy

Figure 6.3: Comparisons of CV accuracy and testing accuracy against log number of features between feature ranking methods.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.780.800.820.840.860.88

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

tree estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.780.800.820.840.860.88

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

tree testing

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.750.800.850.900.95

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

twonorm estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.750.800.850.900.95

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

twonorm testing

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.800.840.88

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

waveform estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.800.840.88

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

waveform testing

log(#features)

accuracy

1 2 3 4 5 6 7

0.700.750.800.850.900.95

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

dna estimates

log(#features)

accuracy

1 2 3 4 5 6 7

0.700.750.800.850.900.95

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

dna testing

log(#features)

accuracy

Figure 6.4: Comparisons of CV accuracy and testing accuracy against log number of features between feature ranking methods.

2 4 6 8

0.450.500.550.600.65

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

protein estimates

log(#features)

accuracy

2 4 6 8

0.450.500.550.600.65

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

protein testing

log(#features)

accuracy

1 2 3 4 5

0.600.700.800.90

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

satimage estimates

log(#features)

accuracy

1 2 3 4 5

0.600.700.800.90

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

satimage testing

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0

0.880.920.961.00

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

shuttle estimates

log(#features)

accuracy

1.0 1.5 2.0 2.5 3.0

0.880.920.961.00

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

shuttle testing

log(#features)

accuracy

1 2 3 4 5 6 7 8

0.40.50.60.70.80.91.0

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

usps estimates

log(#features)

accuracy

1 2 3 4 5 6 7 8

0.40.50.60.70.80.91.0

F−score Chg.(SV) Chg.(dist.) w(Heiler) shuffle

usps testing

log(#features)

accuracy

Figure 6.5: Comparisons of CV accuracy and testing accuracy against log number of features between feature ranking methods.

Discussion and Conclusions

We have investigated several feature selection and feature scaling methods. The feature selection methods discussed in this thesis can be categorized into three kinds:

the statistical score, the gradient of the decision values, and random shuffling on features. The first one is not related to the SVM while the second and the third rely on its performance.

From experiments we see that there is no method that is best for all data sets.

Sometimes there is no significant difference on the performance. In addition, prop-erties of data sets may affect the performance. For example, for madelon random shuffling on features performs very bad, but other methods they all successfully se-lect good features. Therefore it is hard to find the best feature sese-lection method.

Beside the prediction performance, training time is also an issue. F-score is very simple and efficient to calculate. On the other hand, approaches of using the change of the decision values (Section 5.2.2), and the normal vector of the decision boundary (Section 5.1) have to train a good SVM model with parameter selection.

For strategies by observing the change of the distribution and the random shuffling on features they even spend more time. From the aspect of the training time, these two methods are not practical when the number of feature is large.

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the decision values on support vectors are slightly better. Also for large data the F-score method is faster. In conclusion, the F-score criterion is good enough for feature selection. If the data is not too large and the training time is reasonable, methods that use the gradient of the decision values on support vectors can also be considered.

[1] R. R. Bailey, E. J. Pettit, R. T. Borochoff, M. T. Manry, and X. Jiang. Auto-matic recognition of usgs land use/cover categories using statistical and neural networks classifiers. In SPIE OE/Aerospace and Remote Sensing, Bellingham, WA, 1993. SPIE.

[2] C. L. Blake and C. J. Merz. UCI repository of machine learn-ing databases. Technical report, University of California, Department of Information and Computer Science, Irvine, CA, 1998. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.

[3] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal mar-gin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

[4] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[5] C.-C. Chang and C.-J. Lin. IJCNN 2001 challenge: Generalization ability and text decoding. In Proceedings of IJCNN. IEEE, 2001.

[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[7] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46:131–159, 2002.

[8] Y.-W. Chen and C.-J. Lin. Combining svms with various feature selection strategies. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature extraction, foundations and Applications. Springer, 2004.

[9] W. Chu, S. Keerthi, and C. Ong. Bayesian trigonometric support vector classi-fier. Neural Computation, 15(9):2227–2254, 2003.

[10] K.-M. Chung, W.-C. Kao, C.-L. Sun, L.-L. Wang, and C.-J. Lin. Radius margin bounds for support vector machines with the RBF kernel. Neural Computation, 15:2643–2681, 2003.

[11] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(05):1105–1114, 2002.

[12] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–

297, 1995.

[13] R. A. Fisher. The use of multiple measurements in taxonomic problem. Annals of Eugenics, 7:179–188, 1936.

[14] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J.

Mach. Learn. Res., 3:1157–1182, 2003.

[15] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389–422, 2002.

[16] M. Heiler, D. Cremers, and C. Schn¨orr. Efficient feature subset selection for sup-port vector machines. Technical Resup-port 21, University of Mannheim, Germany, Department of Mathematics and Computer Science, Computer Vision, Graph-ics, and Pattern Recognition Group, D-68131 Mannheim, Germany, 2001.

[17] C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification. Technical report, 2003.

[18] T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of International Conference on Machine Learning, 1999.

[19] G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In International Conference on Machine Learning, pages 121–129, 1994. Journal version in AIJ, available at http://citeseer.nj.nec.com/13663.html.

[20] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

[21] R. Kohovi and G. H. John. Wrappers for feature subset selection. Artificial Intellgence, 97(1-2):273–324, 1997.

[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. MNIST database available at http://yann.lecun.com/exdb/mnist/.

[23] A. Liaw and M. Wiener. Classification and regression by randomForest. R News, 2/3:18–22, December 2002.

[24] C.-J. Lin. A Guide to Support Vector Machines.

[25] C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view. Neural Computation, 13(2):307–317, 2001.

[26] H.-T. Lin and C.-J. Lin. A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, 2003.

[27] A. K. McCallum, R. Rosenfeld, T. M. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In J. W. Shavlik, editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 359–367, Madison, US, 1998. Morgan Kaufmann Publishers, San Fran-cisco, US.

[28] G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition.

Wiley, New York, 1992.

[29] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.

[30] S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature se-lection by gradient descent in function space. Journal of Machine Learning Research, 3:1333–1356, 2003.

[31] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press.

[32] D. Prokhorov. IJCNN 2001 neural network competition. Slide presentation in IJCNN’01, Ford Research Laboratory, 2001.

http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf .

[33] G. R¨atsch. Benchmark data sets, 1999. Available at http://ida.first.gmd.de/~raetsch/data/benchmarks.htm.

[34] V. Svetnik, A. Liaw, C. Tong, and T. Wang. Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules.

In F. Roli, J. Kittler, and T. Windeatt, editors, Proceedings of the 5th Inter-national Workshopon Multiple Classifier Systems, Lecture Notes in Computer Science vol. 3077., pages 334–343. Springer, 2004.

[35] V. Svetnik, A. Liaw, C. Tong, and T. Wang. Application of breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules.

In Multiple Classifier Systems, pages 334–343, 2004.

[36] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.

[37] V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Computation, 12(9):2013–2036, 2000.

在文檔中 Combining SVMs with Various Feature Selection Strategies (頁 65-78)