• 沒有找到結果。

Asymptotic convergence of an SMO algorithm without any assumptions

N/A
N/A
Protected

Academic year: 2021

Share "Asymptotic convergence of an SMO algorithm without any assumptions"

Copied!
5
0
0

加載中.... (立即查看全文)

全文

(1)

Asymptotic Convergence of an SMO Algorithm Without Any

Assumptions

Chih-Jen Lin

Department of Computer Science and Information Engineering National Taiwan University, Taipei 106, Taiwan

cjlin@csie.ntu.edu.tw

Abstract

The asymptotic convergence in Lin [6] can be applied to a modified SMO algorithm by Keerthi et al. [5] with some assumptions. Here we show that for this algorithm those assumptions are not necessary.

I. Introduction

Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rlsuch that

yi ∈ {1, −1}, the support vector machines (SVM) [9] require the solution of the following

optimization problem: min α f (α) = 1 2α T − eTα 0≤ αi ≤ C, i = 1, . . . , l, (1) yTα = 0,

where C > 0 and e is the vector of all ones. Training vectors xi are mapped into a higher

dimensional space by φ and Qij ≡ yiyjK(xi, xj) where K(xi, xj) ≡ φ(xi)Tφ(xj) is the

kernel.

Due to the density of the matrix Q, currently the decomposition method is one of the major methods to solve (1) (e.g. [7], [3], [8]). It is an iterative process where in each iteration the index set of variables are separated to two sets B and N , where B is the working set. Then in that iteration variables corresponding to N are fixed while a sub-problem on variables corresponding to B is minimized.

Among these methods, Platt’s Sequential Minimal Optimization (SMO) [8] is a simple algorithm where in each iteration only two variables are selected in the working set so the sub-problem can be analytically solved without using an optimization software. Keerthi et al. [5] pointed out a problem in the original SMO and proposed two modified versions. The

(2)

one using the two indices which have the maximal violation of the Karush-Kuhn-Tucker (KKT) condition may be now the most popular implementation among SVM software (e.g. LIBSVM [1], SVMTorch [2]). It is also a special case of another popular software SV Mlight

[3]. For convergence Keerthi and Gilbert [4] has proved that under a stopping criterion and any stopping tolerance, it terminates in finite iterations. However, this result does not imply the asymptotic convergence. On the other hand, the asymptotic convergence of Lin [6] for the software SV Mlight can be applied to this algorithm when the size of

the working set is restricted to two. However, in [6, Assumption IV.1] it requires the assumption that any two by two principal sub-matrix of the Hessian matrix Q is positive definite. This assumption may not be true if, for example, some data points are the same. In this paper we show that without this assumption results in [6] still follow. Hence existing implementations are asymptotically convergent without any problem.

The method by Keerthi et al. is as follows: Using yi =±1, the KKT condition of (1)

can be rewritten to max( max αi<C,yi=1−∇f(α) i, max αi>0,yi=−1∇f(α) i) ≤ min( min αi<C,yi=−1∇f(α) i, min αi>0,yi=1−∇f(α) i), (2)

where ∇f(α) = Qα − e is the gradient of f(α) defined in (1). Then they consider

i≡ argmax({−∇f(α)t| yt= 1, αt< C}, {∇f(α)t| yt=−1, αt > 0}), (3)

j ≡ argmin({∇f(α)t| yt=−1, αt < C}, {−∇f(α)t | yt = 1, αt> 0}), (4)

and use B ≡ {i, j} as the working set. That is, i and j are the two elements which violate the KKT condition the most.

Ifk} is the sequence generated by the decomposition method, the asymptotic

conver-gence means that any convergent subsequence goes to an optimum of (1). The result of finite termination by Keerthi and Gilbert cannot be extended here because both sides of the inequality (2) are not continuous functions of α. In [6], the asymptotic convergence has been proved but the author has to assume that the matrix Q satisfies

min

(3)

where I is any subset of {1, . . . , l} with |I| ≤ 2 and min(eig(·)) is the smallest eigenvalue of a matrix ([6, Assumption IV.1]). The main purpose of this paper is to show that (5) is not necessary.

II. Main Results

The only reason why we need (5) is for Lemma IV.2 in [6]. It proves that there exists σ > 0 such that

f (αk+1)≤ f(αk) σ 2kα

k+1

− αkk2, for all k. (6)

In the following we will show that without (5), (6) is still valid. First we note that if αk is the current solution and B ={i, j} is selected using (3) and (4), the required minimization on the sub-problem takes place in the rectangle S = [0, C]× [0, C] along a path where yiαi+ yjαj =−yNTαkN is constant. Let the parametric change in α on this path be given

by α(t):

αi(t)≡ αki + t/yi, αj(t)≡ αkj − t/yj, αs(t)≡ αks, ∀s 6= i, j.

The sub-problem is to minimize ψ(t)≡ f(α(t)) subject to (αi(t), αj(t))∈ S. Let ¯t denote

the solution of this problem and αk+1 = α(¯t). Clearly,

|¯t| = kαk+1 − αk k/√2. (7) As ψ(t) is a quadratic function on t, ψ(t) = ψ(0) + ψ0(0)t + ψ00(0)t2/2. (8) Since ψ0(t) = l X s=1 ∇f(α(t))sα0s(t) = yi∇f(α(t))i− yj∇f(α(t))j = yi( l X s=1 Qisαs(t)− 1) − yj( l X s=1 Qjsαs(t)− 1) and (9) ψ00(t) = Qii+ Qjj− 2yiyjQij, (10)

(4)

we have ψ0(0) = yi∇f(αk)i− yj∇f(αk)j and (11) ψ00(0) = φ(xi)Tφ(xi) + φ(xj)Tφ(xj)− 2yi2y 2 jφ(xi)Tφ(xj) = kφ(xi)− φ(xj)k2. (12)

Then our new lemma is as follows:

Lemma II.1 If the working set selection is by using (3) and (4), there exists σ > 0 such that for any k, (6) holds.

Proof: Since Q is positive semidefinite, ψ00(t) ≥ 0 so we can consider the following two cases:

Case 1 ψ00(0) > 0. Let t∗ denote the unconstrained minimum of ψ, i.e. t∗ = −ψ0(0)/ψ00(0). Clearly, ¯t = γtwhere 0 < γ ≤ 1. Then, by (8),

ψ(¯t)− ψ(0) = γ−ψ 0(0)2 ψ00(0) + γ2 2 ψ0(0)2 ψ00(0) ≤ −γ 2 2 ψ0(0)2 ψ00(0) =− ψ00(0) 2 ¯t 2 = −ψ 00(0) 4 kα k+1 − αk k2, (13)

where the last equality is from (7).

Case 2 ψ00(t) = 0. By (12), φ(xi) = φ(xj). Using this, (9), and (11) we get

ψ0(0) = yi l X s=1 Qisαsk− yj l X s=1 Qjsαks = yi( l X s=1 yiysφ(xi)Tφ(xs)αks − 1) − yj( l X s=1 yjysφ(xj)Tφ(xs)αks− 1) = yj − yi.

With (11), since descent is assured, ψ0(0)6= 0. Thus yi 6= yj and hence |ψ0(0)| = 2. Since

ψ00(0) = ψ00(t) = 0 implies ψ0(t) is a linear function, with ψ(¯t)≤ ψ(0) and |¯t| ≤ C, ψ(¯t)− ψ(0) = −|ψ0(0)¯t| ≤ −2 C¯t 2 = −kα k+1− αkk2 C . (14)

Note that ψ(0) = f (αk) and ψ(¯t) = f (αk+1). Thus, using (10), (7), and (14), if we get

σ ≡ min{2

C, mini,j {

Qii+ Qjj − 2yiyjQij

2 : Qii+ Qjj − 2yiyjQij > 0}}, then the proof is complete.

(5)

III. Conclusion

Using [6, Theorem IV.1], results here can be extended to the decomposition method for support vector regression which selects the two-component working set in a similar way. The future challenge will be to remove the same assumption when the size of the working set is more than two.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan via the grant NSC 90-2213-E-002-111. The author thanks Sathiya Keerthi for many helpful comments.

References

[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[2] R. Collobert and S. Bengio. SVMTorch: A support vector machine for large-scale regression and classification problems. Journal of Machine Learning Research, 1:143–160, 2001.

[3] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

[4] S. S. Keerthi and E. G. Gilbert. Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46:351–360, 2002.

[5] S. S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13:637–649, 2001.

[6] C.-J. Lin. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12, 2001. To appear.

[7] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of CVPR’97, 1997.

[8] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

參考文獻

相關文件

In Section 3, we propose a GPU-accelerated discrete particle swarm optimization (DPSO) algorithm to find the optimal designs over irregular experimental regions in terms of the

We are not aware of any existing methods for identifying constant parameters or covariates in the parametric component of a semiparametric model, although there exists an

Other advantages of our ProjPSO algorithm over current methods are (1) our experience is that the time required to generate the optimal design is gen- erally a lot faster than many

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

In summary, the main contribution of this paper is to propose a new family of smoothing functions and correct a flaw in an algorithm studied in [13], which is used to guarantee

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

The case where all the ρ s are equal to identity shows that this is not true in general (in this case the irreducible representations are lines, and we have an infinity of ways

Like the proximal point algorithm using D-function [5, 8], we under some mild assumptions es- tablish the global convergence of the algorithm expressed in terms of function values,