Linear Convergence of a Decomposition Method for
Support Vector Machines
Chih-Jen Lin
Department of Computer Science and Information Engineering
National Taiwan University Taipei 106, Taiwan [email protected] Abstract
Recently the asymptotic convergence of some commonly used decomposition methods for support vector machines has been established. However, their local convergence rates are still unknown. In this paper, under the assumptions that the kernel matrix is positive definite and the problem is non-degenerate, we prove the linear convergence of a popular decomposition method.
1
Introduction
Given training vectors xi ∈ Rn, i = 1, . . . , l, in two classes, and a vector y ∈ Rl
such that yi ∈ {1, −1}, the support vector machines (SVM) (Cortes and Vapnik,
1995; Vapnik, 1998) require the solution of the following optimization problem: min α f (α) = 1 2α TQα − eTα subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (1.1) yTα = 0,
where e is the vector of all ones, C is the upper bound of all variables, and Q is an l by l positive semidefinite matrix. Training vectors xi are mapped into a higher
(maybe infinite) dimensional space by the function φ and Qij ≡ yiyjK(xi, xj)
where K(xi, xj) ≡ φ(xi)Tφ(xj) is the kernel.
Due to the density of the matrix Q, currently the decomposition method is one of the major methods to solve SVM (e.g. (Osuna et al., 1997; Joachims, 1998; Platt, 1998)). It is an iterative process and in each iteration the index set of
variables is portioned to two sets B and N , where B is the working set. Then in that iteration variables corresponding to N are fixed while a sub-problem on variables corresponding to B is minimized.
Among these decomposition methods, the software SV Mlight(Joachims, 1998) is a popular one. It has a systematic way for selecting the working set B whose size can be any even number. When the size of B is restricted to sets having two elements, it coincides with a modification of the SMO algorithm by Keerthi et al. (2001). Originally proposed by Platt (1998), the Sequential Minimal Optimization (SMO) algorithm is an extreme of the decomposition method whose working sets are restricted to two elements. The advantage of SMO is that in each iteration the sub-problem can be analytically solved without using an optimization software. Other software which have used the same working set selection as SV Mlight are, for example, LIBSVM (Chang and Lin, 2001).
The asymptotic convergence of the decomposition method used in SV Mlight was first proved in (Lin, 2001). More information about existing work on the convergence of decomposition methods can be found in the same paper. Up to now there are no results yet about local convergence of decomposition methods. In this paper we will establish the linear convergence of the method used by SV Mlight. The analysis of convergence rates is very important for optimization
methods as it helps to understand how fast an algorithm converges. It can also give more insights on the practical behaviors.
This paper is organized as follows. In section 2 we briefly introduce the algo-rithm used by SV Mlight, in particular, its working set selection. Section 4 presents the main result of the linear convergence. Using this theoretical result, Section 5 explains some practical behaviors of decomposition methods. Finally in Section 7 we discuss the relation between our proof and some earlier work which focus on general bound-constrained optimization.
2
The Method of SV M
lightIn this section we describe the working set selection of SV Mlightusing the
Karush-Kuhn-Tucker (KKT) condition, (i.e. the optimality condition) of (1.1): If α is an optimal solution of (1.1), there is a number b and two nonnegative vectors λ and
µ such that
∇f (α) + by = λ − µ,
λiαi = 0, µi(C − α)i = 0, λi ≥ 0, µi ≥ 0, i = 1, . . . , l,
where ∇f (α) = Qα − e is the gradient of f (α). This can be rewritten as ∇f (α)i+ byi ≥ 0 if αi = 0, ∇f (α)i+ byi ≤ 0 if αi = C, ∇f (α)i+ byi = 0 if 0 < αi < C. Since yi = ±1, by defining Iup(α) ≡ {i | αi < C, yi = 1 or αi > 0, yi = −1}, and Ilow(α) ≡ {i | αi < C, yi = −1 or αi > 0, yi = 1},
a feasible α is optimal for (1.1) if and only if max
i∈Iup(α)
−yi∇f (α)i ≤ min i∈Ilow(α)
−yi∇f (α)i. (2.1)
When α is not an optimal solution, if i ∈ Iup(α), j ∈ Ilow(α), and −yi∇f (α)i >
−yj∇f (α)j, following (Keerthi and Gilbert, 2002), we call such (i, j) a “violating
pair.”
If q, an even number, is the size of the working set B and αk is the current iterate, SV Mlight selects the working set in the following way: q/2 indices are sequentially selected from elements in Iup(αk) so that
−yi1∇f (α k) i1 ≥ −yi2∇f (α k) i2 ≥ · · · ≥ −yiq/2∇f (α k) iq/2. (2.2)
The other q/2 indices are sequentially selected from Ilow(αk) such that
−yj1∇f (α
k)
j1 ≤ · · · ≤ −yjq/2∇f (α
k)
jq/2. (2.3)
Therefore, SV Mlightessentially finds the q/2 most violated pairs into the working set and we call (i1, j1) a “maximal violating pair.”
We consider only violating pairs so if −yiq/2∇f (α
k)
iq/2 ≤ −yjq/2∇f (α
k)
jq/2, we
reduce the size of the working set. Note that the working set will not be empty as there is at least one violating pair if α is not optimal yet.
Interestingly this working set selection was originally derived from the concept of feasible directions in constrained optimization though we feel a derivation from the violation of the KKT condition is more intuitive.
3
Existing Convergence Results
The asymptotic convergence of an optimization algorithm usually means that any its convergent subsequence goes to a (local) optimum. Note that the strict decrease of the objective value may not imply this property. The asymptotic convergence of decomposition methods was first studied in (Chang et al., 2000). However, the authors were able to consider only some types of decompoeition methods which did not coincide with existing implementations. It was until (Lin, 2002a) that the asymptotic convergence of SV Mlight was established:
Theorem 1 Assume the matrix Q satisfies min
I (min(eig(QII))) > 0, (3.1)
where I is any subset of {1, . . . , l} with |I| ≤ q and min(eig(·)) is the smallest eigenvalue of a matrix. If {αk} is the sequence generated by the decomposition
method in Section 2, the limit of any its convergent subsequence is an optimal solution of (1.1).
If the size of the working set is restriced to two (i.e. q = 2), (Lin, 2002a) provides a proof of the above theorem without any assumption.
Another property related to the convergence is the “finite termination” of an algorithm. For a given stopping condition with any pre-specified tolerance, it discusses whether the optimization algorithm terminates in a finite number of iterations. The first such results for the decomposition methods is in (Keerthi and Gilbert, 2002):
Theorem 2 If the algorithm in Section 2 is used and q = 2, for any given > 0, after a finite number of iterations,
max i∈Iup(α) −yi∇f (α)i ≤ min i∈Ilow(α) −yi∇f (α)i + (3.2) is satisfied.
Note that Theorem 2 does not imply Theorem 1 as both sides of (3.2) are not continuous functions of α. That is, we cannot take their limits with → 0 and claim that any convergent point has already satisfied the KKT condition and hance is an optimum. For the general situation of more than two elements in the working set, (Lin, 2002b) proves Theorem 2 under some minor assumptions.
4
Main Results on Linear Convergence
Before proving the main results, we need some assumptions. First we assume that the kernel matrix is positive definite:
Assumption 1 K is positive definite.
Note that K and Q, the Hessian of (1.1), have the same eigenvalues so Q is positive definite as well. Then (1.1) is a strictly convex programming problem and hence has a unique global optimum α∗.
Theorem 1 implies that the whole sequence {αk} of the decomposition method
converges to α∗. We can also see that Theorem II.3 of (Lin, 2002b) holds:
1. If the algorithm takes infinite iterations, max
i∈Iup(α∗)
−yi∇f (α∗)i = min i∈Ilow(α∗)
−yi∇f (α∗)i.
Let us call the above quantity as b∗.
2. After k is large enough, only elements whose −yi∇f (α∗)i are b∗ can still be
modified. Furthermore, only such elements can still form violating pairs. Therefore, in final iterations, the algorithm works only on a particular subset of variables. This makes our analysis easier as convergence rates relate to behaviors in final iterations. Moreover, for this particular subset of variables, we need an additional assumption: problem (1.1) is non-degenerate.
Assumption 2 (Nondegeneracy) For the optimal solution α∗, we have ∇f (α∗)i+
This condition is also called strict complementarity in the optimization terminol-ogy as it means two values in
αi∗(Qα∗− e + b∗y)i = 0
of the KKT condition cannot be both zeros. The situation is similar for (C − α∗i)(Qα∗− e + b∗y)
i = 0. Therefore, after k is large enough, all bounded variables
are fixed and are not included in the working set. By treating bounded variables as constants essentially we are solving a problem with the following form:
min α f (α) = 1 2α TQα + pTα subject to yTα = ∆, (4.1)
where 0 < αki < C for all i even though we do not write down inequality constraints explicitly. Then the optimal solution α∗ with its Lagrange multiplier b∗ can be obtained by the following linear system:
Q y yT 0 α∗ b∗ =−p ∆ . (4.2)
In each iteration, we consider minimizing f (αk
B+d) where d is the direction moving
from αk B so the sub-problem is min d 1 2d TQ BBd + ∇f (αk)TBd. subject to yTBd = 0, (4.3)
where ∇f (αk) = Qαk+ p now. If a solution of (4.3) is dk, then αk+1
B = αBk + dk
and αk+1N = αk
N. With the Lagrange multiplier bk, this sub-problem can be solved
by the following equation:
QBB yB yBT 0 dk bk =−∇f (α k) B 0 . (4.4) Using (4.2), Q(αk− α∗) = Qαk+ p + b∗y = ∇f (αk) + b∗y. (4.5)
By defining Y ≡ diag(y) to be a diagonal matrix with elements of y on the diagonal, with yi = ±1, we have
−Y Q(αk− α∗
Now without inequalities, a “maximal violating pair” is obtained simply by the maximal and the minimal elements of −Y ∇f (αk). As simultaneously subtracting
a constant b∗ does not affect the order of a sequence, we have
argmaxi(−yi(Q(αk− α∗))i) = argmaxi(−yi∇f (αk)i) and
argmini(−yi(Q(αk− α∗))i) = argmini(−yi∇f (αk)i). (4.6)
The following two theorems are main results on linear convergence. They require two technical lemmas which are left in the end of this section.
Theorem 3 There is c < 1 such that after k is large enough,
(αk+1− α∗)TQ(αk+1− α∗) ≤ c(αk− α∗)TQ(αk− α∗). (4.7)
Proof. We directly calculate the difference between the (k + 1)st and the kth iterations: (αk+1− α∗)TQ(αk+1− α∗) − (αk− α∗)TQ(αk− α∗) (4.8) = 2(dk)T(Q(αk− α∗))B+ (dk)TQBBdk = (dk)T(2(Q(αk− α∗))B− ∇f (αk)B− bkyB) (4.9) = (dk)T((Q(αk− α∗))B+ (b∗− bk)yB) (4.10) = (dk)T((Q(αk− α∗))B+ (bk− b∗)yB) (4.11) = −[−(Q(αk− α∗))B+ (b∗− bk)yB]TQ−1BB[−(Q(α k− α∗ ))B+ (b∗− bk)yB],
where (4.9) is from (4.4), (4.10) is from (4.5), (4.11) is obtained by using the fact yT
Bdk = 0 from (4.4), and the last equality is from (4.4) and (4.5). If we define
ˆ
Q ≡ YBQ−1BBYB and v ≡ −Y (Q(αk− α∗)), (4.12)
where YB ≡ diag(yB), then vB = −YB(Q(αk− α∗))B and (4.8) becomes
−[vB+ (b∗− bk)eB]TQ[vˆ B+ (b∗− bk)eB]. (4.13)
Using the fact that at least one “maximal violating pair” is in B, with (4.6) we can define
v1 ≡ max
i (vi) = maxi∈B (vi) and v
l ≡ min
We denote that min(eig(·)) and max(eig(·)) to be the minimal and maximal eigenvalues of a matrix, respectively. Then
[vB+ (b∗− bk)eB]TQ[vˆ B+ (b∗ − bk)eB] ≥ min(eig( ˆQ))[vB+ (b∗ − bk)eB]T[vB+ (b∗− bk)eB] ≥ min(eig( ˆQ))(v 1− vl)2 2 (4.15) ≥ min(eig( ˆQ)) 2 ( yTQ−1y P i,j|Q −1 ij | )2max(|v1|, |vl|)2 (4.16) ≥ min(eig( ˆQ)) 2l ( yTQ−1y P i,j|Q −1 ij | )2(Q(αk− α∗))TQ(αk− α∗) (4.17) ≥ min(eig( ˆQ)) 2l max(eig(Q−1))( yTQ−1y P i,j|Q −1 ij | )2(Q(αk− α∗))TQ−1Q(αk− α∗) ≥ min(eig( ˆQ)) 2l max(eig(Q−1))( yTQ−1y P i,j|Q −1 ij | )2(αk− α∗)TQ(αk− α∗), (4.18)
where (4.15) is from (4.14) and Lemma 1, (4.16) is from Lemma 2, and (4.17) follows from (4.14).
Here we give more details about the derivation of (4.16): If v1vl ≤ 0, then of course |v1− vl| ≥ max(|v1|, |vl|). With yi = ±1, y TQ−1y P i,j|Q −1 ij |
≤ 1 so (4.16) follows. On the other hand, if v1vl ≥ 0,
we consider v = (Y QY )(−Y (αk − α∗)) from (4.12). Since −eTY (αk − α∗) =
−yT(αk− α∗) = 0, we can apply Lemma 2: With
|(Y QY )−1ij | = |Q−1ij yiyj| = |Q−1ij | and eT(Y QY )−1e = yTQ−1y, we have |v1− vl| ≥ max(|v1|, |vl|) − min(|v1|, |vl|) ≥ ( y TQ−1y P i,j|Q −1 ij | ) max(|v1|, |vl|) which implies (4.16).
Then we can define a constant c as follows: c ≡ 1 − min B min(eig(Q−1BB)) 2l max(eig(Q−1))( yTQ−1y P i,j|Q −1 ij | )2 < 1.
Combining (4.13) and (4.18), after k is large enough, (4.7) holds. 2 The linear convergence of the objective function is as follows: Theorem 4 There is c < 1 such that after k is large enough,
f (αk+1) − f (α∗) ≤ c(f (αk) − f (α∗)).
Proof. We will show that for any k, f (αk) − f (α∗) = 1
2(α
k− α∗
)TQ(αk− α∗)
so the proof immediately follows from Theorem 3. Using (4.2), f (αk) − f (α∗) = 1 2(α k)TQαk+ pTαk−1 2(α ∗ )TQα∗− pTα∗ = 1 2(α k)TQαk+ (−Qα∗− b∗ y)Tαk− 1 2(α ∗ )TQα∗− (−Qα∗− b∗y)Tα∗ = 1 2(α k)TQαk− (α∗ )TQαk+ 1 2(α ∗ )TQα∗ (4.19) = 1 2(α k− α∗ )TQ(αk− α∗).
Since we always keep the feasibility of αk, we can use yTαk= ∆ to cancel out the
term yTα∗ and have (4.19). 2
Next we present two technical lemmas used earlier. Lemma 1 If v1 ≥ · · · ≥ vl, l X i=1 v2i ≥ (v1− vl) 2 2 . Proof. l X i=1 v2i ≥ v2 1 + v 2 l ≥ (v1− vl)2 2 . 2
Lemma 2 If Q is invertible, then for any x such that 1. eTx = 0,
2. v ≡ Qx, maxi((Qx)i) = v1 > vl= mini((Qx)i), and v1vl≥ 0, we have min(|v1|, |vl|) ≤ (1 − eTQ −1e P i,j|Q −1 ij | ) max(|v1|, |vl|).
Proof. Since v1 > vl and v1vl ≥ 0, we have either v1 > vl ≥ 0 or 0 ≥ v1 > vl.
For the first case, if the result is wrong, vl > (1 − e TQ−1 e P i,j|Q −1 ij | )v1, so for j = 1, . . . , l, v1− vj ≤ v1− vl < ( e TQ−1e P i,j|Q −1 ij | )v1. (4.20) With x = Q−1v and (4.20), eTx = eTQ−1v = X i,j Q−1ij vj = X i,j Q−1ij (v1− (v1− v j)) ≥ v1eTQ−1e − (v1− vl)X i,j |Q−1ij | > v1 eTQ−1e − ( e TQ−1e P i,j|Q −1 ij | )X i,j |Q−1ij | = 0
causes a contradiction. The case of 0 ≥ v1 > vl is similar. 2
5
Some Practical Considerations
Earlier experiments have pointed out that if the kernel matrix is well conditioned, the decomposition method converges more quickly. This has been mentioned in, for example, (Hsu and Lin, 2002, Section 5).
Results in this paper provide more insights about this observation. Here, we discuss the situation when the RBF kernel is used (i.e., K(xi, xj) = e−γkxi−xjk
2
When γ is large, Q → I is well conditioned. We show that for larger γ, the linear convergence rate is higher:
Since Qii = 1, i = 1, . . . , l for the RBF kernel,
l
X
i=1
λi = trace(Q) = l,
where λ1, . . . , λl are eigenvalues of Q. Therefore,
min(eig(Q−1BB)) ≤ 1 and max(eig(Q−1)) ≥ 1.
With (yTQ−1y)/P i,j|Q −1 ij | ≤ 1, min B min(eig(Q−1BB)) l max(eig(Q−1))( yTQ−1y 2P i,j|Q −1 ij | )2 ≤ 1 4l. When γ is large, Q → I so min B min(eig(Q−1BB)) l max(eig(Q−1))( yTQ−1y 2P i,j|Q −1 ij | )2 → 1 4l,
its largest possible value. Therefore, the convergence seems faster when the kernel matrix is well-conditioned.
On the other hand, when Q is very ill-conditioned, 1/ max(eig(Q−1)) = min(eig(Q)) can be very small. Then the rate constant c is close to 1 so the convergence is very slow. For linear SVM with the number of training samples greater than the number of attributes, Q is only positive semi-definite so min(eig(Q)) = 0. Practi-cally decomposition methods converge very slowly for such cases so indeed people consider that SMO might not be very suitable for linear SVM (Chung et al., 2002). Though results in this paper assume the positive definiteness of the kernel matrix, if we consider such linear SVM as ill-conditioned problems, our results also helps to explain the slow convergence. We think that theoretical properties of decomposition methods for linear SVM are worth for further investigation.
6
An Example
We have shown that under some general conditions, the decomposition method discussed here is at least linearly convergent. However, it is still not clear whether
the convergence is actually better than linear or not. Here, we present a simple example which exactly has the linear convergence. Hence, in theory, the linear convergence is already the best worst-case analysis.
Consider x1, x2, x3 with kx1 − x2k = kx1− x3k = kx2 − x3k, y = [1, 1, −1]T,
and C = ∞. If the RBF kernel is used, the dual SVM problem is
min α1,α2,α3 1 2α1 α2 α3 1 a −a a 1 −a −a −a 1 α1 α2 α3 − (α1 + α2+ α3) subject to α1+ α2− α3 = 0, 0 ≤ α1, α2, α3,
where a = e−γkxi−xjk2. We assume C is large so is not needed here. At the optimal
solution,
α∗ =h3(1−a)2 3(1−a)2 3(1−a)4 i
T
. (6.1)
We will show that after k is large enough, (αk+1− α∗)TQ(αk+1− α∗) = 1
4(α
k− α∗
)TQ(αk− α∗). (6.2)
Now q, the size of the working set, must be two so the three possible sets are {1, 2}, {1, 3}, and {2, 3}. We can see that Assumptions 1 and 2 are easily satisfied. Thus, after k is large enough, αk
i, i = 1, . . . , 3 are strictly positive. Then, in each
iteration, after solving the sub-problem the two variables are positive so they have the same yi∇f (α)i. Hence, under the rules of (2.2) and (2.3), any one for
the other two possible sets can be the working set of the next iteration. For example, if {1, 3} is the working set of the kth iteration, then for the (k + 1)st iteration, either {2, 3} or {1, 2} can be used.
For convenience, we define eki ≡ αk
i − α ∗
i, i = 1, . . . , 3.
We claim that at the kth iteration: 1. If {1, 3} is the working set, then
2. If {2, 3} is the working set, then
ek+11 + 2ek+12 = 0. (6.4)
3. If {1, 2} is the working set, then
ek+11 − ek+1
2 = 0. (6.5)
For the first case, using (4.4),
1 −a −a 1 αk+1 1 αk+13 −1 1 + bk+1 1 −1 = −αk2 a −a . (6.6) With αk+11 − αk+13 = −αk2, we have αk+11 = 1 1 − a− αk 2 2 . Therefore, using (6.1) and αk+12 = αk
2,
2(αk+11 − α∗1) + (αk+12 − α2∗) = 2
1 − a − 3α
∗ 2 = 0.
The second case, (6.4), can be derived by a similar way. For the third case, it is easy to see that if {1, 2} is the working set, αk+11 = α2k+1. With α∗1 = α∗2,
ek+11 = ek+12 = e k 1 + ek2 2 = ek3 2. (6.7) Using ek+12 = ek 2, e k+1 1 = ek1, and e k+1 1 = e k+1
2 = ek3/2, for the three respective
cases, by induction, if e1
i 6= 0, i = 1, . . . , 3, then
eki 6= 0, i = 1, . . . , 3, for all k. (6.8)
Now we are ready to prove (6.2). With αk
3 = αk1 + αk2,
(αk− α∗)TQ(αk− α∗)
= 2(1 − a)((ek1)2+ (ek2)2+ ek1ek2).
If {1, 3} is the working set, then with (6.3) and αk+12 = αk2,
(αk+1− α∗)TQ(αk+1− α∗) = 2(1 − a)((ek+11 )2+ (ek+12 )2+ ek+11 ek+12 ) (6.9) = 3 2(1 − a)(e k 2)2.
The validity of (6.2) requires 3 2(1 − a)(e k 2)2 2(ek 1)2+ 2(ek2)2+ 2ek1ek2 = 1 4 which, under (6.8), is equivalent to
(ek1− ek 2)(e k 1 + 2e k 2) = 0. (6.10)
Since {1, 3} is the current working set, in the previous iteration, the set must be {1, 2} or {2, 3}. Thus, (6.10) follows from (6.5) and (6.4).
The proof for the case that {2, 3} is the working set is very similar. If {1, 2} is the working set, putting (6.7) into (6.9), (6.10) becomes
(2ek1 + ek2)(ek1 + 2ek2) = 0,
so the result also follows.
Indeed by a more detailed description, we can show that if the initial solution is zero, (6.2) holds for all k = 1, 2, . . .
7
Discussion
The decomposition method has been an old optimization technique which is also called, for example, “coordinate search,” “method of alternating variables,” or “coordinate descent method.” However, in most cases only bound-constrained or unconstrained optimization problems are considered where the linear convergence (without the non-degeneracy assumption) has been established in, for example, (Luo and Tseng, 1992) and references therein. With the additional linear con-straint yTα = 0 and differences on the working set selection, we have not been
able to get similar proofs without the non-degeneracy assumption. How to fill this gap is a further research issue.
On the other hand, after using the non-degeneracy assumption and (Lin, 2002b, Theorem II.3) to remove inequalities, (4.1) is a very simple problem. Hence we essentially follow the structure of proving the linear convergence of the steepest descent method for unconstrained convex quadratic programming problems (see, for example, (Nocedal and Wright, 1999, Chapter 3.3)). Two news things we have to take care of are:
1. Using the property that the “maximal violation pair” is selected so (4.16), an expression only on the variables of the working set, can be connected to (4.17) which is related to all variables.
2. Handling the linear constraint yTα = ∆. For the unconstrained case there is no b∗ and bk so (4.15) can directly imply (4.16). Here we need Lemma 2
to connect them.
Acknowledgments
This work was supported in part by the National Science Council of Taiwan via the grant NSC 90-2213-E-002-111.
References
Chang, C.-C., C.-W. Hsu, and C.-J. Lin (2000). The analysis of decomposi-tion methods for support vector machines. IEEE Transacdecomposi-tions on Neural Net-works 11 (4), 1003–1008.
Chang, C.-C. and C.-J. Lin (2001). LIBSVM: a library for support vector ma-chines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Chung, K.-M., W.-C. Kao, C.-L. Sun, and C.-J. Lin (2002). Decomposition
meth-ods for linear support vector machines. Technical report, Department of Com-puter Science and Information Engineering, National Taiwan University. Cortes, C. and V. Vapnik (1995). Support-vector network. Machine Learning 20,
273–297.
Hsu, C.-W. and C.-J. Lin (2002). A simple decomposition method for support vector machines. Machine Learning 46, 291–314.
Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.
Keerthi, S. S. and E. G. Gilbert (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning 46, 351–360.
Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Com-putation 13, 637–649.
Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks 12 (6), 1288–1298. Lin, C.-J. (2002a). Asymptotic convergence of an SMO algorithm without any
assumptions. IEEE Transactions on Neural Networks 13 (1), 248–250.
Lin, C.-J. (2002b). A formal analysis of stopping criteria of decomposition meth-ods for support vector machines. IEEE Transactions on Neural Networks 13 (5), 1045–1052.
Luo, Z.-Q. and P. Tseng (1992). On the convergence of coordinate descent method for convex dierentiable minimization. Journal of Optimization Theory and Ap-plications 72 (1), 7–35.
Nocedal, J. and S. J. Wright (1999). Numerical Optimization. New York, NY: Springer-Verlag.
Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97, New York, NY, pp. 130–136. IEEE.
Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.