• 沒有找到結果。

Data sets are roughly seperated into two categories. Table 5.1 lists the statistics of large-scale data sets. Besides other data sets, mnist is scaled. Table 5.2 lists the UCI/Statlog data sets. In the UCI/Statlog data sets dna, satimage, letter and shuttle provide training and testing sets, while others don’t provide training and testing sets so that these data sets will be divided into an 80/20 split as the training set and the testing set.

20

Table 5.1: The summary of data statistics. The second column shows the number of classes/labels in the data set. n is the number of features. The last two columns are the number of training and testing data respectively.

Data set # classes n # training # testing

news20 20 62,061 15,935 3,993

mnist 10 780 11,982 1,984

covtype 7 54 464,810 116,202

rcv1 53 47,236 518,571 15,564

sector 105 55,197 6,412 3,207

vehicle 3 100 78,823 19,705

Table 5.2: The summary of UCI/Statlog data statistics. The second column shows the number of classes/labels in the data set. n is the number of features. The last column is the number of data instances.

Data set # classes n # instances

iris 3 4 150

wine 3 13 178

glass 6 9 214

vowel 11 10 528

segment 7 19 2,310

dna 3 180 2,000

satimage 6 36 4,435

letter 26 16 15,000

shuttle 7 9 43,500

All the data sets are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets/multiclass.html

5.2 Setting

We compare the following six implementations.

1. LR: Logistic regression solved with trust region Newton method (Lin et al., 2008).

We use the implemention of LIBLINEAR 2.91 with option -s 0.

2. L2-SVM: L2-loss support vector machine solved with coordinate descent method.

The SVM uses L2-regularization term. The option in LIBLINEAR is -s 2.

3. L1-SVM: L1-loss support vector machine solved with coordinate descent method.

22

It also uses L2-regularization term. The option in LIBLINEAR is -s 3.

4. CS: Crammer and Singer method with coordinate descent method. The option in LIBLINEAR is -s 4.

5. ME dual: ME dual implementation using original feature vectors.

6. ME dual-2: Similar implementation as ME dual using reduced feature vectors.

For binary solvers such as LR, L2-SVM and L1-SVM, we implement the one-against-one approach which is denoted as “1-1” while one-against-one-against-all is denoted as “1-all.”

We search for best parameters with cross-validation. C is found in the range of [2−5, 2−4, . . . , 23]. Too large C leads to much more training time and cannot get much improvement in testing accuracy. And default value of  is choosed as 0.1. For the stopping condition, ME dual uses the relative function difference,

The procedure stops when (5.1) is less than 10−8.

5.3 Comparison

The result for cross-validation is in Table 5.3. For LR, L2-SVM and L1-SVM, data sets only news20, sector and iris have higher accuraries with the one-against-all approach. LR and L1-SVM perform similarly for vehicle and wine. While the difference of accuracies for wine and dna is little by using L2-SVM with both approaches. One can also see that Crammer and Singer outperforms other models on sector, wine and shuttle, but the accuracy of every classifiers on wine is almost the same.

Table 5.4 gives the result for testing accuracy. LR, L2-SVM and L1-SVM perform better in one-against-all approach on news20, sector and wine. LR also has higher ac-curacy with one-against-all approach on iris. For Crammer and Singer, it outperforms

other classifiers on sector and shuttle. The difference of accuracies among Crammer and Singer, ME dual and ME dual-2 is less than one percent. Crammer and Singer outperforms the other two ME implementations on sector, iris, vowel, letter and shuttle, but worse on glass and satimage. For segment and dna, the three implementations per-form similarly. We also observe that L1-SVM and L2-SVM outperper-form other classifiers most of the time.

Best C values of each classifier on every data sets are listed in Table 5.5. Data sets news20, wine and dna favor small C values while for vehicle, segment, letter and shuttle large C values are preferred.

For almost all the classifiers take more time in training with one-against-all ap-proach. LR spends more time in training on sector with one-against-one apap-proach.

The same situation happens to L1-SVM on covtype. One can see that Crammer and Singer spends least time on mnist, rcv1 and sector. Among three “all-together” imple-mentations, Crammer and Singer takes least time than other two on all data sets. The result of training time is in Table 5.6. For testing time, for all classifiers they spend more time with one-against-one models. The result is in Table 5.7.

24

Table 5.3: Cross-validation accuracy for each classifiers on data sets. For each data set, the highest accuracy is bold-faced.

LR L2-SVM L1-SVM CS ME ME

dual dual dual-2

data 1-all 1-1 1-all 1-1 1-all 1-1

news20 83.44 81.63 83.23 80.73 82.96 80.23 82.24 82.22 81.69 covtype 71.50 72.49 71.26 72.52 71.15 72.67 72.43 72.44 72.43 mnist 91.24 93.72 90.95 93.99 91.28 94.10 92.37 92.06 92.04 rcv1 92.93 93.27 92.93 93.42 92.81 93.43 93.08 93.04 93.02 sector 92.20 90.52 93.22 91.52 93.75 91.56 94.01 92.00 92.19 vehicle 80.29 80.40 80.04 80.34 80.18 80.59 80.28 80.32 80.32 iris 88.33 85.83 88.33 86.67 88.33 86.67 86.67 85.83 85.83 wine 97.20 97.20 97.90 97.20 97.90 97.90 97.90 96.50 95.80 glass 56.98 59.30 58.14 59.88 54.65 58.72 54.07 58.72 58.72 vowel 48.23 68.32 49.41 76.12 46.81 72.34 58.16 58.63 58.63 segment 91.99 93.99 91.99 95.13 92.26 94.91 94.21 93.99 93.99 dna 94.14 93.93 93.79 93.57 93.64 94.21 93.71 93.71 93.79 satimage 83.92 87.66 83.18 87.76 82.18 87.89 86.44 86.53 86.53 letter 68.01 81.13 66.63 82.60 63.34 82.53 75.94 74.98 74.98 shuttle 92.77 96.08 91.96 96.05 93.65 97.10 97.15 95.99 95.99

Table 5.4: Testing accuracy for each classifiers on data sets. For each data set, the highest accuracy is bold-faced.

LR L2-SVM L1-SVM CS ME ME

dual dual dual-2

data 1-all 1-1 1-all 1-1 1-all 1-1

news20 84.47 82.79 84.72 82.07 84.24 82.07 83.62 83.67 83.35 covtype 71.67 72.62 71.34 72.65 71.19 72.79 72.52 72.59 72.59 mnist 91.83 94.41 91.67 94.47 92.01 94.45 92.93 92.64 92.64 rcv1 92.10 92.41 92.02 92.47 91.98 92.55 92.30 92.33 92.25 sector 92.70 91.61 93.92 92.55 94.08 92.55 94.36 92.64 92.67 vehicle 80.51 80.62 80.17 80.35 80.38 80.70 80.44 80.44 80.44 iris 93.33 90.00 90.00 93.33 83.33 93.33 90.00 86.67 86.67 wine 97.14 94.29 97.14 94.29 97.14 94.29 94.29 94.29 94.29 glass 66.67 73.81 64.29 73.81 61.90 66.67 64.29 69.05 69.05 vowel 44.76 74.29 45.71 78.10 37.14 77.14 56.19 51.43 51.43 segment 90.04 91.77 89.83 93.29 90.69 93.07 91.99 91.13 91.13 dna 94.35 93.68 94.44 94.01 93.42 93.42 94.18 94.18 94.01 satimage 81.75 85.55 81.05 85.85 79.90 86.15 83.95 84.45 84.45 letter 68.00 81.60 66.34 82.92 62.76 83.38 76.78 75.18 75.18 shuttle 92.96 96.23 92.24 96.19 93.83 97.30 97.39 96.09 96.09

Table 5.5: The best C for each classifiers on data sets.

LR L2-SVM L1-SVM CS ME ME

dual dual dual-2

data 1-all 1-1 1-all 1-1 1-all 1-1

news20 2−1 2−2 2−5 2−5 2−5 2−5 2−5 2−3 2−4

Table 5.6: Training time for each classifiers on data sets.

LR L2-SVM L1-SVM CS ME ME

dual dual dual-2

data 1-all 1-1 1-all 1-1 1-all 1-1

news20 74.15 31.65 29.97 25.09 2.34 0.84 18.73 273.16 268.69 covtype 70.37 44.06 32.18 17.45 58.07 87.78 192.30 1151.14 1288.89 mnist 75.67 42.87 27.89 17.98 60.65 4.44 3.36 76.84 114.07 rcv1 1701.79 906.45 527.28 462.41 140.58 109.51 40.16 1396.86 ∞ sector 39.39 73.51 24.81 129.20 5.45 4.18 2.97 58.11 103.96 vehicle 19.44 13.75 14.19 9.68 23.61 13.13 24.28 31.33 41.06 glass 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 vowel 0.00 0.02 0.00 0.01 0.00 0.01 0.03 0.13 0.13 segment 0.06 0.03 0.05 0.00 0.11 0.05 0.16 1.41 1.61

dna 0.03 0.02 0.02 0.01 0.01 0.00 0.01 0.04 0.05

satimage 0.15 0.10 0.06 0.04 0.34 0.02 0.20 1.47 1.8 letter 1.24 0.93 0.51 0.44 0.86 0.59 1.43 14.84 16.98 shuttle 0.91 0.52 0.31 0.17 0.26 0.24 0.18 1.89 2.06

26

Table 5.7: Testing time for each classifiers on data sets.

LR L2-SVM L1-SVM CS ME ME

dual dual dual-2

data 1-all 1-1 1-all 1-1 1-all 1-1

news20 0.01 0.14 0.01 0.21 0.03 0.14 0.01 0.02 0.02 covtype 0.04 0.09 0.05 0.07 0.03 0.06 0.06 0.11 0.05 mnist 0.06 0.07 0.02 0.08 0.03 0.11 0.01 0.08 0.00 rcv1 0.11 3.17 0.08 3.19 0.09 3.21 0.13 0.08 0.13 sector 0.17 7.43 0.15 7.40 0.16 7.46 0.12 0.25 0.09 vehicle 0.02 0.02 0.03 0.03 0.00 0.03 0.04 0.03 0.03 glass 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 vowel 0.00 0.02 0.00 0.01 0.00 0.01 0.03 0.13 0.13 segment 0.06 0.03 0.05 0.00 0.11 0.05 0.16 1.41 1.61 dna 0.03 0.02 0.02 0.01 0.01 0.00 0.01 0.04 0.05 satimage 0.15 0.10 0.06 0.04 0.34 0.02 0.20 1.47 1.8 letter 1.24 0.93 0.51 0.44 0.86 0.59 1.43 14.84 16.98 shuttle 0.91 0.52 0.31 0.17 0.26 0.24 0.18 1.89 2.06

Discussion and Conclusion

From the experiment, one can observe that one-against-one approach can reach higher accuracy for most of the time. If there are k classes in the data set, for one-against-all approach there are k sub-models whereas there are k(k − 1)/2 sub-models with one-against-one approach. And the size of data to be trained for each sub-problem with one-against-one approach is less than that with one-against-all approach.

The difference of number of sub-models between both approaches will become large when there are large number of classes. Therefore the amount of time we spend and the amount of memory we take will be a crucial concern when we train data of many classes. One-against-one approach leads to higher accuracy because more sub-models for each class are obtained. But it also results in longer testing time because there are more models to be tested for each instance. In total one-against-one approach takes less time in training, but when the number of classes is really large like sector, the situation will become different. However, one should make a trade-off among different factors.

For “all-together” methods, the data sets in this work are not so favorable of these approaches. Overall ME dual and ME dual-2 take a large number of logarithm com-putations which take more time than regular arithmetic calculations. ME dual-2 uses reduced feature vectors and outputs the solution with rich sparcity, which leads to

27

28

less testing time. But it takes much time in looking up the table of indices of reduced feature vectors and results in more training time than ME dual. Although ME is widely used in the NLP applications, in the experiment it takes little advantage. Nevertheless, the feature reduction approach can be furthur studied that we can apply the technique to other models with one-against-all approach.

A. L. Berger, V. J. Della Pietra, and S. A. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–

71, 1996.

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal mar-gin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. Jackel, Y. LeCun, U. A. M¨uller, E. S¨ackinger, P. Simard, and V. Vapnik. Comparison of classifier methods: a case study in handwriting digit recognition. In International Confer-ence on Pattern Recognition, pages 77–87. IEEE Computer Society Press, 1994.

K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–

297, 1995.

K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In Computational Learning Theory, pages 35–46, 2000.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIB-LINEAR: A library for large linear classification. Journal of Machine Learn-ing Research, 9:1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/

papers/liblinear.pdf.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

F.-L. Huang, C.-J. Hsieh, K.-W. Chang, and C.-J. Lin. Iterative scaling and coordinate descent methods for maximum entropy. In Proceedings of the 47th Annual Meeting of the Association of Computational Linguistics (ACL), 2009.

Short paper.

29

30

T. Joachims. Training linear SVMs in linear time. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing, 2006.

D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recogni-tion. Prentice Hall, second edition, 2008.

S. S. Keerthi, S. Sundararajan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A sequential dual method for large scale multi-class linear SVMs. In Proceedings of the Forteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 408–416, 2008. URL http://www.csie.ntu.edu.tw/

~cjlin/papers/sdm_kdd.pdf.

S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman, editor, Neu-rocomputing: Algorithms, Architectures and Applications. Springer-Verlag, 1990.

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008.

URL http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.

E. Mayoraz and E. Alpaydin. Support vector machines for multi-class classifica-tion. In IWANN (2), pages 833–842, 1999. URL http://citeseer.nj.nec.com/

mayoraz98support.html.

R. Memisevic. Dual optimization of conditional probability models. Technical report, Department of Computer Science, University of Toronto, 2006.

R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101–141, 2004. ISSN 1533-7928.

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Con-ference on Machine Learning (ICML), 2007.

J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-04, Royal Holloway, 1998.

H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41–75, Octo-ber 2011. URL http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.

pdf.

相關文件