Discussions and conclusions - A Simple Decomposition Method for Support Vector Machines

From an optimization point of view, decomposition methods are like “coordinate search”

methods or “alternating variables method” (Fletcher, 1987, Chapter 2.2). They have slow convergences as the first and second order information is not used. In addition, if the working set selection is not appropriate, though the strict decrease of the objective value holds, the algorithm may not converge (see, for example, Powell, 1973). However, even with such disadvantages, the decomposition method has become one of the major methods for SVM.

We think the main reason is that the decomposition method is efficient for SVM in the following situations:

1. C is small and most support vectors are at the upper bound. That is, there are not many free variables in the iterative process.

2. The problem is well-conditioned even though there are many free variables.

For example, we do not think thatadultproblems with C= 1000 belong to the above cases.

They are difficult problems for decomposition methods.

If for most applications we only need solutions of problems which belong to the above situations, current decomposition methods may be good enough. Especially a SMO type (Platt, 1998) algorithm has the advantage of not requiring any optimization solver. However, if in many cases we need to solve difficult problems (for example, C is large), more optimiza-tion knowledge and techniques should be considered. We hope that practical applicaoptimiza-tions will provide a better understanding on this issue.

Regarding the SVM formulation, we think (1.5) is simpler than (1.1) but with similar quality for our test problems. In addition, in this paper we experiment with different im-plementation of the working set selection. The cost is always the same: at most O(ql) by selecting some of the largest and smallest∇ f (αk). This may not be the case for regular SVM formulation (1.1) due to the linear constraint y^Tα = 0. In SVM^light, the implementation is

simple because (2.1) is very special. If we change constraints of (2.1) to 0≤ αk+ d ≤ C, the solution procedure may be more complicated. Currently we add b²/2 into the objective function. This is the same as finding a hyperplane passing through the origin for separating data [φ(xi), 1], i = 1, . . . , l. It was pointed out by Cristianini and Shawe-Taylor (2000) that the number 1 added may not be the best choice. Experimenting with different numbers can be a future issue for improving the performance ofBSVM.

From a numerical point of view, there are also possible advantages of using (1.5). When solving (1.2), a numerical difficulty is on deciding whether a variable is at the bound or not because it is generally not recommended to compare a floating-point number with another one. For example, to calculate b of (1.2), we use the following KKT condition

yi(Qα)i+ b − 1 = 0 if 0 < αi < C. (5.1)

Therefore, we can calculate b by (5.1) where i is any element in B. However, when imple-menting (5.1), we cannot directly compareαito 0 or C. In SVM^light, a smalla= 10⁻¹²> 0 is introduced. They consider αi to be free if αi≥ a andαi ≤ C − a. Otherwise, if a wrongαiis considered, the obtained b can be erroneous. On the other hand, if the bounded formulation is used and appropriate solvers for sub-problem (2.7) are used, it is possible to directly compareαi with 0 or C without needing ana. For example, in Lin and Mor´e (1999), they used a method called “project gradient” and in their implementation all values at bounds are done by direct assignments. Hence it is safe to compareαiwith 0 or C. To be more precise, for floating-point computation, ifαi ← C is assigned somewhere, a future floating-point comparison between C and C returns true as they both have the same internal representation.

In Platt (1998), a situation was considered where the bias b of the standard SVM formu-lation (1.2) is a constant instead of a variable. Then the dual problem is a bound-constrained formulation soBSVMcan be easily modified for it.

In Section 2, we demonstrate that finding a good working set is not an easy task. Some-times a natural method turns out to be a bad choice. It is also interesting to note that for different formulations (1.2) and (1.5), similar selection strategies (2.1) and (2.2) give to-tally different performance. On the other hand, both Algorithms 2.1 and 4.1 are very useful for (1.2) and (1.5), respectively. Therefore, for any new SVM formulations, we should be careful when applying existing selections to them.

Finally we summarize some possible advantages of Algorithm 2.1:

1. It keeps the number of free variables as low as possible. This in general leads to faster convergences for difficult problems.

2. It tends to consider free variables in the current iteration again in subsequent iterations.

Therefore, corresponding columns of these elements are naturally cached.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan via the grant NSC 89-2213-E-002-013. The authors thank Chih-Chung Chang for many helpful discussions

and comments. Part of the software implementation benefited from his help. They also thank Thorsten Joachims, Pavel Laskov, John Platt, and anonymous referees for helpful comments.

Notes

1. BSVM is available at http://www.csie.ntu.edu.tw/~cjlin/bsvm.

2. TRON is available at http://www.mcs.anl.gov/~more/tron

References

Blake, C. L. & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA. (Available at http://www.ics.uci.edu/∼mlearn/mLRepository.html)

Chang, C.-C., Hsu, C.-W., & Lin, C.-J. (2000). The analysis of decomposition methods for support vector machines.

IEEE Trans. Neural Networks, 11:4, 1003–1008.

Cristianini, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press.

Fletcher, R. (1987). Practical methods of optimization. New York: John Wiley and Sons.

Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The kernel adatron algorithm: A fast and simple learning procedure for support vector machines. In Proceeding of 15th Intl. Conf. Machine Learning. San Francisco, CA: Morgan Kaufman Publishers.

Ho, T. K. & Kleinberg, E. M. (1996). Building projectable classifiers of arbitrary complexity. In Proceedings of the 13th International Conference on Pattern Recognition (pp. 880–885). Vienna, Austria.

Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press.

Joachims, T. (2000). Private communication.

Laskov, P. (2002). Feasible direction decomposition algorithms for training support vector machines, Machine Learning, 46, 315–349.

Lin, C.-J. (2000). On the convergence of the decomposition method for support vector machines (Technical Report). Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. To appear in IEEE Trans. Neural Network.

Lin, C.-J. & Mor´e, J. J. (1999). Newton’s method for large-scale bound constrained problems. SIAM J. Optim., 9, 1100–1127. (Software available at http://www.mcs.anl.gov/˜more/tron)

Mangasarian, O. L. & Musicant, D. R. (1999). Successive overrelaxation for support vector machines. IEEE Trans.

Neural Networks, 10:5, 1032–1037.

Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classification.

Englewood Cliffs, N.J.: Prentice Hall. (Data available at anonymous ftp: ftp.ncc.up.pt/pub/statlog/)

Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97.

Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in kernel methods—support vector learning. Cambridge, MA:

MIT Press.

Powell, M. J. D. (1973). On search directions for minimization. Math. Programming, 4, 193–201.

Saunders, C., Stitson, M. O., Weston, J., Bottou, L., Sch¨olkopf, B., & Smola, A. (1998). Support vector ma-chine reference manual (Technical Report No. CSD-TR-98-03). Egham, UK: Royal Holloway, University of London.

Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (Eds.). (1998). Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press.

Vanderbei, R. (1994). LOQO: An interior point code for quadratic programming (Technical Report No. SOR 94-15). Statistics and Operations Research, Princeton University. (revised November, 1998)

Vapnik, V. (1995). The nature of statistical learning theory. New York, NY: Springer-Verlag.

Vapnik, V. (1998). Statistical learning theory. New York, NY: John Wiley.

Received February 23, 2000 Revised February 20, 2001 Accepted February 21, 2001 Final manuscript July 20, 2001

在文檔中 A Simple Decomposition Method for Support Vector Machines (頁 21-24)