In Chapter 2.5 we discussed two multi-class methods: all and one-against-one. Asthe latter trains k(k − 1)/2 classifiers, where k is the number of classes, we may think that “one-against-one” takes more training time. In fact, this statment may not be right. Assume the time for training one two-class problem (i.e., solving the dual problem (6.1)), is (·)d, where (·) is the number of variables. Then the training time of the two multi-class approaches are compared in Table 6.9.1. Clearly, if d ≥ 2, the one-against-one approach has a shorter training time.
Table 6.9.1: Training complexity of two multi-class approaches Method one-against-all one-against-one Average size per
l 2l/k
two-class SVM
# two-class SVMs k k(k − 1)/2
training time kO(ld) k(k−1)2 O((2lk)d)
To discuss the testing time, we first look at the decision function of two-clas SVMs (2.10). Kernel evaluations are the main part and some subsequent multipli-cations/additions of these kernel values follow. For multi-class SVMs, a training instance may be a support vector of several two-class SVMs. There is no need to do the same kernel evaluation several times, so indeed we conduct such evaluations in the beginning before calculating the decision values. Therefore, if the percentage of training data to be support vectors in the k and k(k − 1)/2 two-class SVMs is about the same and k is not too large, the two multi-class approaches have similar testing time.
6.10 Notes
Some early work on decomposition methods are, for example, (Osuna et al., 1997;
Joachims, 1998; Platt, 1998). The idea of using only two elements for the working set
6.11. EXERCISES 81 are from the Sequential Minimal Optimization (SMO) by (Platt, 1998). The working set discussed is indeed a special version discussed in (Joachims, 1998; Keerthi et al., 2001).
6.11 Exercises
1. In this chapter, we use a more complicated version of Theorem 5.2.8. Prove it by the same derivation of the theorem.
2. Consider the following SVM formula:
minw,b,ξ
How will you change the derivation in Chapter 6.4?
3. Prove that the working set selection in Chapter 6.3 is equivalent to solving the following problem: zero. The constraint (6.29) implies that a descent direction involving only two variables is obtained. Then components of α with non-zero dt are included in the working set B which is used to construct the sub-problem (6.2). Note that d is only used for identifying B but not as a search direction.
4. Consider x1, x2, x3 with kx1 − x2k = kx1− x3k = kx2 − x3k, y = [1, 1, −1]T,
where a = e−γkxi−xjk2. We assume C is large so is not needed here. Prove that at the optimal solution,
α∗ =h
2 3(1−a)
2 3(1−a)
4 3(1−a)
iT
. (6.30)
Then prove that if the initial solution is zero and the decomposition method in Chapter 6.2 is considered, after k is large enough,
(αk+1− α∗)TQ(αk+1− α∗) = 1
4(αk− α∗)TQ(αk− α∗). (6.31) This is a simple example for which the decomposition method is linearly con-vergent. Hence, in theory, the linear convergence is already the best worst-case analysis for the decomposition method in Chapter 6.2.
5. Write a simple SVM classifier using the algorithm in this chapter. You consider dense input format only. Basically you use the simple working set selection and solve two-variable optimization problems until given stopping tolerance is satisfied. You can write the predictor in the same program so we immediately know the test accuracy. Kernel elements are calculated by need.
Requirement:
• The code should be less than 300 lines in any high-level language.
• RBF kernel is enough
• Run some small data in
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/binary
using some given parameters. On some (maybe most) parameters your code should be faster then libsvm, but on others it is the other way around.
Explain why.
• Moreover, your code should be able to train the 50,000 ijcnn1 data set in the same url address.
Bibliography
Bazaraa, M. S., H. D. Sherali, and C. M. Shetty (1993). Nonlinear programming : theory and algorithms (Second ed.). Wiley.
Boser, B., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM Press.
Chang, C.-C., C.-W. Hsu, and C.-J. Lin (2000). The analysis of decomposition meth-ods for support vector machines. IEEE Transactions on Neural Networks 11 (4), 1003–1008.
Chang, C.-C. and C.-J. Lin (2001a). IJCNN 2001 challenge: Generalization ability and text decoding. In Proceedings of IJCNN. IEEE.
Chang, C.-C. and C.-J. Lin (2001b). LIBSVM: a library for support vector machines.
Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
Cortes, C. and V. Vapnik (1995). Support-vector network. Machine Learning 20, 273–297.
Cover, T. (1965). Geometrical and statistical properties of systems of linear inequal-ities with applications in pattern recognition. IEEE Transactions on Electronic Computers EC-14, 326–334.
Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Ma-chines. Cambridge, UK: Cambridge University Press.
Fletcher, R. (1987). Practical Methods of Optimization. John Wiley and Sons.
Friedman, J. (1996). Another approach to polychotomous classification. Tech-nical report, Department of Statistics, Stanford University. Available at http://www-stat.stanford.edu/reports/friedman/poly.ps.Z.
83
Gardy, J. L., C. Spencer, K. Wang, M. Ester, G. E. Tusnady, I. Simon, S. Hua, K. deFays, C. Lambert, K. Nakai, and F. S. Brinkman (2003). PSORT-B: improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Research 31 (13), 3613–3617.
Hsu, C.-W. and C.-J. Lin (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425.
Hush, D. and C. Scovel (2003). Polynomial-time decomposition algorithms for support vector machines. Machine Learning 51, 51–71.
Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.
Keerthi, S. S. and E. G. Gilbert (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning 46, 351–360.
Keerthi, S. S. and C.-J. Lin (2003). Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15 (7), 1667–1689.
Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2001). Im-provements to Platt’s SMO algorithm for SVM classifier design. Neural Computa-tion 13, 637–649.
Knerr, S., L. Personnaz, and G. Dreyfus (1990). Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman (Ed.), Neurocomputing: Algorithms, Architectures and Applications. Springer-Verlag.
Kreßel, U. (1999). Pairwise classification and support vector machines. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods
— Support Vector Learning, Cambridge, MA, pp. 255–268. MIT Press.
Lin, C.-J. (2001a). Formulations of support vector machines: a note from an opti-mization point of view. Neural Computation 13 (2), 307–317.
Lin, C.-J. (2001b). Linear convergence of a decomposition method for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.
BIBLIOGRAPHY 85 Lin, C.-J. (2001c). On the convergence of the decomposition method for support
vector machines. IEEE Transactions on Neural Networks 12 (6), 1288–1298.
Lin, C.-J. (2002a). Asymptotic convergence of an SMO algorithm without any as-sumptions. IEEE Transactions on Neural Networks 13 (1), 248–250.
Lin, C.-J. (2002b). A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Transactions on Neural Networks 13 (5), 1045–
1052.
Michie, D., D. J. Spiegelhalter, and C. C. Taylor (1994). Machine Learning, Neural and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall. Data available at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.
Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97, New York, NY, pp.
130–136. IEEE.
Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.
Prokhorov, D. (2001). IJCNN 2001 neural network competi-tion. Slide presentation in IJCNN’01, Ford Research Laboratory.
http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf .
Sarle, W. S. (1997). Neural Network FAQ. Periodic posting to the Usenet newsgroup comp.ai.neural-nets.
Sch¨olkopf, B. and A. J. Smola (2002). Learning with kernels. MIT Press.
Vapnik, V. (1998). Statistical Learning Theory. New York, NY: Wiley.
Index
strong duality, 42 attributes, 6
complementarity condition, 42 data instance, 6
features, 6
kernel function, 18
Nearest Neighbor Methods, 6
86