Asymptotic Convergence of an SMO Algorithm Without Any
Assumptions
Chih-Jen Lin
Department of Computer Science and Information Engineering National Taiwan University, Taipei 106, Taiwan
cjlin@csie.ntu.edu.tw
Abstract
The asymptotic convergence in Lin [6] can be applied to a modified SMO algorithm by Keerthi et al. [5] with some assumptions. Here we show that for this algorithm those assumptions are not necessary.
I. Introduction
Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rlsuch that
yi ∈ {1, −1}, the support vector machines (SVM) [9] require the solution of the following
optimization problem: min α f (α) = 1 2α TQα − eTα 0≤ αi ≤ C, i = 1, . . . , l, (1) yTα = 0,
where C > 0 and e is the vector of all ones. Training vectors xi are mapped into a higher
dimensional space by φ and Qij ≡ yiyjK(xi, xj) where K(xi, xj) ≡ φ(xi)Tφ(xj) is the
kernel.
Due to the density of the matrix Q, currently the decomposition method is one of the major methods to solve (1) (e.g. [7], [3], [8]). It is an iterative process where in each iteration the index set of variables are separated to two sets B and N , where B is the working set. Then in that iteration variables corresponding to N are fixed while a sub-problem on variables corresponding to B is minimized.
Among these methods, Platt’s Sequential Minimal Optimization (SMO) [8] is a simple algorithm where in each iteration only two variables are selected in the working set so the sub-problem can be analytically solved without using an optimization software. Keerthi et al. [5] pointed out a problem in the original SMO and proposed two modified versions. The
one using the two indices which have the maximal violation of the Karush-Kuhn-Tucker (KKT) condition may be now the most popular implementation among SVM software (e.g. LIBSVM [1], SVMTorch [2]). It is also a special case of another popular software SV Mlight
[3]. For convergence Keerthi and Gilbert [4] has proved that under a stopping criterion and any stopping tolerance, it terminates in finite iterations. However, this result does not imply the asymptotic convergence. On the other hand, the asymptotic convergence of Lin [6] for the software SV Mlight can be applied to this algorithm when the size of
the working set is restricted to two. However, in [6, Assumption IV.1] it requires the assumption that any two by two principal sub-matrix of the Hessian matrix Q is positive definite. This assumption may not be true if, for example, some data points are the same. In this paper we show that without this assumption results in [6] still follow. Hence existing implementations are asymptotically convergent without any problem.
The method by Keerthi et al. is as follows: Using yi =±1, the KKT condition of (1)
can be rewritten to max( max αi<C,yi=1−∇f(α) i, max αi>0,yi=−1∇f(α) i) ≤ min( min αi<C,yi=−1∇f(α) i, min αi>0,yi=1−∇f(α) i), (2)
where ∇f(α) = Qα − e is the gradient of f(α) defined in (1). Then they consider
i≡ argmax({−∇f(α)t| yt= 1, αt< C}, {∇f(α)t| yt=−1, αt > 0}), (3)
j ≡ argmin({∇f(α)t| yt=−1, αt < C}, {−∇f(α)t | yt = 1, αt> 0}), (4)
and use B ≡ {i, j} as the working set. That is, i and j are the two elements which violate the KKT condition the most.
If{αk} is the sequence generated by the decomposition method, the asymptotic
conver-gence means that any convergent subsequence goes to an optimum of (1). The result of finite termination by Keerthi and Gilbert cannot be extended here because both sides of the inequality (2) are not continuous functions of α. In [6], the asymptotic convergence has been proved but the author has to assume that the matrix Q satisfies
min
where I is any subset of {1, . . . , l} with |I| ≤ 2 and min(eig(·)) is the smallest eigenvalue of a matrix ([6, Assumption IV.1]). The main purpose of this paper is to show that (5) is not necessary.
II. Main Results
The only reason why we need (5) is for Lemma IV.2 in [6]. It proves that there exists σ > 0 such that
f (αk+1)≤ f(αk)− σ 2kα
k+1
− αkk2, for all k. (6)
In the following we will show that without (5), (6) is still valid. First we note that if αk is the current solution and B ={i, j} is selected using (3) and (4), the required minimization on the sub-problem takes place in the rectangle S = [0, C]× [0, C] along a path where yiαi+ yjαj =−yNTαkN is constant. Let the parametric change in α on this path be given
by α(t):
αi(t)≡ αki + t/yi, αj(t)≡ αkj − t/yj, αs(t)≡ αks, ∀s 6= i, j.
The sub-problem is to minimize ψ(t)≡ f(α(t)) subject to (αi(t), αj(t))∈ S. Let ¯t denote
the solution of this problem and αk+1 = α(¯t). Clearly,
|¯t| = kαk+1 − αk k/√2. (7) As ψ(t) is a quadratic function on t, ψ(t) = ψ(0) + ψ0(0)t + ψ00(0)t2/2. (8) Since ψ0(t) = l X s=1 ∇f(α(t))sα0s(t) = yi∇f(α(t))i− yj∇f(α(t))j = yi( l X s=1 Qisαs(t)− 1) − yj( l X s=1 Qjsαs(t)− 1) and (9) ψ00(t) = Qii+ Qjj− 2yiyjQij, (10)
we have ψ0(0) = yi∇f(αk)i− yj∇f(αk)j and (11) ψ00(0) = φ(xi)Tφ(xi) + φ(xj)Tφ(xj)− 2yi2y 2 jφ(xi)Tφ(xj) = kφ(xi)− φ(xj)k2. (12)
Then our new lemma is as follows:
Lemma II.1 If the working set selection is by using (3) and (4), there exists σ > 0 such that for any k, (6) holds.
Proof: Since Q is positive semidefinite, ψ00(t) ≥ 0 so we can consider the following two cases:
Case 1 ψ00(0) > 0. Let t∗ denote the unconstrained minimum of ψ, i.e. t∗ = −ψ0(0)/ψ00(0). Clearly, ¯t = γt∗ where 0 < γ ≤ 1. Then, by (8),
ψ(¯t)− ψ(0) = γ−ψ 0(0)2 ψ00(0) + γ2 2 ψ0(0)2 ψ00(0) ≤ −γ 2 2 ψ0(0)2 ψ00(0) =− ψ00(0) 2 ¯t 2 = −ψ 00(0) 4 kα k+1 − αk k2, (13)
where the last equality is from (7).
Case 2 ψ00(t) = 0. By (12), φ(xi) = φ(xj). Using this, (9), and (11) we get
ψ0(0) = yi l X s=1 Qisαsk− yj l X s=1 Qjsαks = yi( l X s=1 yiysφ(xi)Tφ(xs)αks − 1) − yj( l X s=1 yjysφ(xj)Tφ(xs)αks− 1) = yj − yi.
With (11), since descent is assured, ψ0(0)6= 0. Thus yi 6= yj and hence |ψ0(0)| = 2. Since
ψ00(0) = ψ00(t) = 0 implies ψ0(t) is a linear function, with ψ(¯t)≤ ψ(0) and |¯t| ≤ C, ψ(¯t)− ψ(0) = −|ψ0(0)¯t| ≤ −2 C¯t 2 = −kα k+1− αkk2 C . (14)
Note that ψ(0) = f (αk) and ψ(¯t) = f (αk+1). Thus, using (10), (7), and (14), if we get
σ ≡ min{2
C, mini,j {
Qii+ Qjj − 2yiyjQij
2 : Qii+ Qjj − 2yiyjQij > 0}}, then the proof is complete.
III. Conclusion
Using [6, Theorem IV.1], results here can be extended to the decomposition method for support vector regression which selects the two-component working set in a similar way. The future challenge will be to remove the same assumption when the size of the working set is more than two.
Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the grant NSC 90-2213-E-002-111. The author thanks Sathiya Keerthi for many helpful comments.
References
[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[2] R. Collobert and S. Bengio. SVMTorch: A support vector machine for large-scale regression and classification problems. Journal of Machine Learning Research, 1:143–160, 2001.
[3] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.
[4] S. S. Keerthi and E. G. Gilbert. Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46:351–360, 2002.
[5] S. S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13:637–649, 2001.
[6] C.-J. Lin. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12, 2001. To appear.
[7] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of CVPR’97, 1997.
[8] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.