A formal analysis of stopping criteria of decomposition methods for support vector machines

(1)

A Formal Analysis of Stopping Criteria of

Decomposition Methods for Support Vector Machines

Chih-Jen Lin, Member, IEEE

Abstract—In a previous paper, we proved the convergence of

a commonly used decomposition method for support vector ma-chines (SVMs). However, there is no theoretical justification about its stopping criterion, which is based on the gap of the violation of the optimality condition. It is essential to have the gap asymptoti-cally approach zero, so we are sure that existing implementations stop in a finite number of iterations after reaching a specified tol-erance. Here, we prove this result and illustrate it by two exten-sions: -SVM and a multiclass SVM by Crammer and Singer. A further result shows that, in final iterations of the decomposition method, only a particular set of variables are still being modified. This supports the use of the shrinking and caching techniques in some existing implementations. Finally, we prove the asymptotic convergence of a decomposition method for this multiclass SVM. Discussions on the difference between this convergence proof and the one in another paper by Lin are also included.

Index Terms—Asymptotic convergence, decomposition methods,

stopping criteria, support vector machines (SVMs).

I. INTRODUCTION

G

IVEN a training set of instance-label pairs

where and , the support

vector machines (SVMs) [3], [13] require the solution of the following optimization problem:

subject to

(1) Here, training vectors are mapped into a higher (maybe in-finite) dimensional space by the function . Then SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. is the penalty parameter of the error term. As the number of variables becomes large after mapping the data, practically we solve the dual problem

subject to

(2)

Manuscript received June 1, 2001; revised January 16, 2002. This work was supported in part by the National Science Council of Taiwan under Grant NSC 90-2213-E-002-111.

The author is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan (e-mail: cjlin@ csie.ntu.edu.tw).

Publisher Item Identifier S 1045-9227(02)04437-5.

where is the vector of all ones, becomes the upper bound of all variables , and is an by positive

semidefinite matrix. Note that where

is called the kernel function. Then and

is the decision function.

Due to the density of the matrix , currently the decompo-sition method is one of the major methods to solve SVM (e.g., [6], [10], and [11]). It is an iterative process, wherein each iter-ation the index set of variables is separated to two sets and , where is the working set. Then, in that iteration, variables corresponding to are fixed, while a subproblem on variables corresponding to is minimized.

Practically, we need a stopping condition for the decomposi-tion method. Such a criterion usually uses the informadecomposi-tion of the Karush–Kuhn–Tucker (KKT) condition, that is, the optimality condition of (2): If is an optimal solution of (2), there is a number and two nonnegative vectors and such that

This is usually rewritten as

if (3a)

if (3b)

if (3c)

As , we can further reformulate it as

if (4a)

if (4b)

Since , by expressing inequalities of (4) as lower and upper bounds of , this KKT condition is equivalent to

(5)

(2)

Let be the solution at the th iteration. If is not an optimal solution then . Hence, a natural stopping cri-terion might be

(6) where is a stopping tolerance. We can see that (5) is a much simpler way of describing the KKT condition. Some existing working set selections also follow from identifying elements violating (5). More importantly, unlike earlier approaches where is calculated using (3c), the condition on free variables, we do not have to worry if there are free variables in the final solution or not. If all variables are at bounds, using (5), can be simply calculated as . Such a stopping criterion has been derived and used in, for example, [1] and [7].

In an earlier work [8] on the convergence of the decomposi-tion method proposed in the software [6], we focused on proving that any limit point of is an optimal solution of (2). However, such results do not directly support the validity of using (6) as the stopping criterion. To be more precise, even though if we have which is an optimal solu-tion, directly from the definition of and , we may

not have . A similar problem happens

for and . A reason is that it may be possible that

but . Therefore, we worry

about the situation that converges to an optimal solution

but . Then, the decomposition

implementation never stops by using (6) as the criterion. In Sec-tion II, we prove that this situaSec-tion will never happen.

Note that if the size of the working set is restricted to two, the finite termination of using the stopping criterion (6) has been proved in [7]. To be more precise, they prove that for any tol-erance , the algorithm stops in a finite number of iterations. However, as discussed previously, their result does not imply the asymptotic convergence (i.e., any limit point of is an optimum). On the other hand, for more general analyses on the stopping criterion (6), we will use the asymptotic convergence which has been proved in [8].

In Section II, we also prove that most bounded variables are identified after finite steps so final iterations focus on a particular set of variables. This analysis supports the use of shrinking and caching techniques in the decomposition method. Section III then shows some extensions on a more complicated optimization formulation. We use two examples to illustrate our results: -SVM [12] and a multiclass SVM in [4]. We then also prove the asymptotic convergence of a decomposition method for this multiclass SVM. Discussions on the difference between this convergence proof and the one in [8] are also included.

II. MAINRESULTS ONSTOPPINGCRITERIA

Here, we consider a more general problem

subject to

(7) where is any continuously convex differentiable function, , and and are lower and upper bounds,

respectively. Following the requirement that , here we consider only the following situation:

(8) If for some so we can remove such vari-ables before solving the problem by decomposition methods. We then define generalized and

(9) and

(10)

We denote the set of indexes whose are

the same as . A similar definition goes for . Thus, if is not an optimal solution yet, so

.

The following theorem shows the validity of the stopping cri-terion (6).

Theorem II.1: Assume a decomposition method for solving (7) satisfies the following conditions.

1) .

2) At least one element of and one element of are included in the working set of each itera-tion.

3) For variables considered in (10), if are any two of them such that

a) satisfies

or b) satisfies

or c)

then, if is in the working set, must be selected as well.

4) Similar to Condition 3), we assume that analogous con-ditions hold for variables considered in (9).

5) converges to an optimal solution of (7). Then

(11) Proof: We prove the theorem by contradiction and, thus, let us assume that the result (11) is wrong. Then, with Condi-tion 1), there is an infinite set and a such that

(12) As has infinitely many elements but the number of pairs of variables is finite, from Condition 2), there are indexes and and an infinite subsequence such that and

(3)

are both in the working set. Without loss of generality, we can consider only the case that there is

such that and for all

and (13)

Thus, from the definition of and in (9) and (10)

and (14)

Since is continuously differentiable and

(15) From (12), (13), and (15), we have

(16) If we have an infinite subset of such that and , then the KKT condition of the subproblem implies

Since is a convergent sequence, taking the limit we have (17) which contradicts (16).

Therefore, we have that

or after is large enough

(18) Because of (18) and (14), there is an infinite set such that for all

or or

(19) and

(20) For the first case of (19), is selected in the working set and then modified. However, since (16) and Condition 5) imply

after is large enough, with in (19), and Condition 3) of this theorem, is also in the working set of the th iteration where . With (20), from the KKT condition of the sub-problem

(21) The situation for the second case is similar. For the third case, both and are modified so and are in the working set. Hence, (21) is also valid.

Therefore, we have (21) for all . As goes to infinity, we again obtain (17) which contradicts (16). Thus, the assump-tion (12) is wrong so the proof is complete.

Note that Conditions 2)-4) of Theorem II.1 are requirements on the working set selection. We list conditions instead of fo-cusing on a particular working set selection so that more flex-ible selections may be used.

We now check that the working set selection of

satisfies the Conditions 2)-4) of Theorem II.1. If , an even number, is the size of the working set, indexes are sequen-tially selected from elements that satisfy or

so that

and

(22) The other indexes are sequentially selected from elements

which satisfy or such that

and

(23) It can be clearly seen that directly from (22) and (23) these con-ditions are satisfied.

An interesting note is that this working set selection was orig-inally derived from the concept of feasible directions in strained optimization but not from the violation of the KKT con-dition.

Regarding the global convergence of which is the Con-dition 5), unfortunately, we prove only a weaker result in [8]: under a minor assumption every limit point of convergent sub-sequences is an optimal solution. However, results in [8] do imply the global convergence if (2) has a unique optimal so-lution. Then Theorem II.1 can be applied. For example, if is positive definite, the solution of (2) is unique.

In the following we will show that for the algorithm used by , Theorem II.1 is valid without needing the global con-vergence of . However, we still need the property proved in [8] that any limit point is an optimum. As all other conditions of Theorem II.1 are satisfied for this particular working set se-lection, the only remaining assumption is a minor one used in [8].

Theorem II.2: Under [8, Assumption IV.1], the decomposi-tion method using (22) and (23) for selecting the working set

has that if , then

(24) Proof: We also prove the theorem by contradiction. How-ever, in addition to assuming an infinite sequence such that (12) holds, using results in [8], we further consider one of its con-vergent subsequence whose limit is an optimum of (2). Note that here any infinite sequence in the feasible region of (2) has at least one convergent subsequence because (8) implies that the feasible region is compact. Then, with [8, Assumption IV.1], the limit point is an optimum. Therefore, we can consider an infi-nite set and such that (12) holds and

is an optimum of (2).

Then, until (16), the proof is similar to that of Theorem II.1. Of course, in (15) must be replaced by

(4)

Remember that we consider the case of . We then claim that for all large enough

and (25)

If (25) is wrong

or

or (26)

For the first case, is selected in the working set and mod-ified. Since is a convergent subsequence, [8, Th. IV.3] implies that also converges to . Thus, after is large enough, (16) implies

Hence, (22) and (23) imply that is selected in the working set as well. Therefore, from the KKT condition of the subproblem at the st iteration

which is impossible after is large enough. The situation for other cases of (26) is similar. Therefore, (25) is correct.

Since [8, Th. IV.3] shows that for any given

converges to the same point as , using the same argument above we have that for any given , after is large enough

and and

(27) Consider . Using the same counting procedure in [8, Th.

IV.5], we can show that at some and

are both selected in the working set so the KKT condition of the subproblem again shows

This is in contradiction to (27).

Therefore, the assumption (12) is wrong so

A special case of the working set selection using (22) and (23) is to restrict the size of the working set to two. That is, only

one element of and one element of are

included in the working set. This is an algorithm discussed in [7] and used in, for example, LIBSVM [1]. For this special case, [9] proves that [8, Assumption IV.1] is not necessary. Hence Theorem II.2 is valid without needing any assumption, exactly the same as results from [7].

Next, based on the above results, we show that after is large

enough, only elements whose are or

can still be modified. For simplification, we only show results extended from Theorem II.1.

Theorem II.3: Under the same assumptions as Theorem II.1, we have the following result:

For any whose corresponding is neither nor , after is large enough, is always at a bound which is equal to .

Proof: First, we know that from the KKT condition,

if is neither nor is at a bound.

Without loss of generality, we consider

with and (28)

If the result of this theorem is wrong, happens infinitely many times. Therefore, from (28), there is an infinite set and

a such that

(29)

Now, for any , we have

or . Therefore, or

after is large enough. Since is a convergent sequence, is in a compact region. With

, there is an infinite subset of

such that exists and

(30)

Hence, (29) and (30) imply

(31)

which contradicts Theorem II.1. This completes the proof. Therefore, after is large enough, only elements in

(32) can still be possibly modified.

This analysis supports the use of shrinking techniques [6] in the decomposition method as in final iterations it is possible that most variables are not changed any more. That is, after identi-fying some variables which may be at the bounds eventually, we temporarily remove them and solve a smaller optimization problem. Though a final check is still needed, in general, the training time can be largely saved.

Caching is another popular technique employed in implemen-tations of decomposition methods. We store recently used kernel elements in the computer memory in order to save the number of kernel evaluations. Note that in each iteration as variables are updated, basically columns of the Hessian matrix have to be involved for updating the gradient . Theorem II.3 sup-ports this caching strategy as it shows that in final iterations only some particular columns of the kernel matrix are still needed.

For the algorithm used in , Theorem II.3 becomes as follows:

Theorem II.4: Under [8, Assumption IV.1], if (22) and (23) are used for selecting the working set and is the limit point of any convergent subsequence , we have the fol-lowing result.

(5)

For any whose corresponding is neither nor , after is large enough, is always at a bound which is equal to .

The proof is nearly the same as that for Theorem II.3. To conclude this section we note that (7) is a more general formulation so results in this section apply to different formu-lations discussed in [8] such as support vector regression and one-class SVM.

III. EXTENSIONS

In this section, we consider a more general problem

subject to

(33)

where and . Therefore, there

are variables and linear equality constraints. Hence,, it is like there are groups of variables where each one satisfies a linear constraint. The KKT condition requires that there are

such that for all

if if

where means . We can rewrite the KKT

condition as

(34) By defining

and the stopping criterion can be

(35) where is the stopping tolerance.

Similar to the situation in Section II, we can define two sets and . With some minor modifications on Conditions 2)–4), we can have results similar to Theorem II.1.

Theorem III.1: Assume all conditions of Theorem II.1 hold with Conditions 2)–4) replaced by the following conditions.

2’ In each iteration, if the th group has the largest , then at least one of

and one of are included in the working set.

3’ In each iteration, variables of the group with the largest satisfy Condition 3).

4’ In each iteration, variables of the group with the

smallest satisfy Condition 4).

Then

(36) Some SVM formulations are of this form. We will give two examples: -SVM and a multiclass SVM by Crammer and Singer. The -SVM [12] can be written as the following form [2]:

subject to

where is a parameter to adjust the number of support vectors and training errors, and are numbers of training data in two classes, and . A stopping criterion like (35) has been used in the experiment of [2] which implemented a modified decomposition method from the one in [8]. We can easily check that Conditions 2’– 4’ of Theorem III.1 are satis-fied. Regarding the convergence of , though we have not explicitly written down the proof, we conjecture that the same results in [8] that every limit point is an optimal solution should still apply.

Another example is a formulation for multiclass SVM by Crammer and Singer [4]

subject to (37a)

(37b) where is the Kronecker product, is the number of classes, is an by kernel matrix, is an by identity matrix, each is an r by 1 constant vector, and is the label of the th data, and

if if

Here, is an by l vector variable and we denote

and

Hence, the stopping criterion can be

(38)

which is directly derived from (34) and (35). Note that since there are no lower bounds and all coefficients in (37a) are ,

(6)

Now there is no lower bound on , so our requirement that lower bounds must be larger than seems to be violated. However, for this problem we can easily set a finite lower bound on as follows. Using (37a)

(39) Thus, (37) is still in the form of (33). Since the feasibility is always kept, never reaches the lower bound . Hence, the stopping criterion (35) still reduces to (38).

Implementations of decomposition methods for (37) have been discussed in [4] and [5]. Basically, the th group of variables which has the maximal violation in (35) becomes the working set. Thus, the subproblem at the th iteration is

subject to

(40)

where means . Note that (40) is a

very simple problem. In [4], two methods were proposed for it: one is an algorithm, while the other is an iterative procedure.

Since a whole group of variables is selected, Con-ditions 2’–4’ of Theorem III.1 hold. If the sequence converges to an optimal solution, we have (36). In Section IV, we will prove the convergence of this decomposition method.

IV. CONVERGENCE OF ADECOMPOSITIONMETHOD FOR

MULTICLASSSVMBYCRAMMER ANDSINGER

The decomposition method mentioned in Section III for mul-ticlass SVM is not in the category of decomposition methods considered in [8]. Hence, proofs in [8] cannot be directly used. However, since the working set selection as well as the sub-problem are quite special where each time one of the groups of variables is considered, we will show that the convergence proof is even simpler. First, we prove a simple lemma.

Lemma IV.1: Consider the following problem:

subject to

(41)

where , and . Let

be the smallest eigenvalue of a matrix. If there is such that , then at an optimal solution of (41)

Proof: From the KKT condition of (41), if is an optimal solution, there is a such that

if if if

Since and

With

To use Lemma IV.1, we make the following assumption. Assumption IV.2: There exists such that for all

, the kernel matrix satisfies

From now on, we consider any one of convergent

subse-quences and assume

(42) We then have the following lemma.

Lemma IV.3: Assume and are as defined in (42). Then for any given positive integer , the sequence

converges to .

Proof: If the th group is selected as the working set, we have

where

and

Hence, an equivalent form of the subproblem (40) is

subject to

(43) where is the variable. Since is a feasible point of (37),

.

As the smallest eigenvalue of the Hessian of (43) is and (43) is in the form of (41), from Assumption IV.2 and Lemma IV.1, we have

(7)

Next, we show that is a convergent sequence. First, we know that is decreasing. Using (39), the feasible region of (37) is a compact set, so exists and

(45) Then, for the subsequence , from (44) and (45) we have

Thus

From , we can prove too.

Therefore, for any given .

We then need a technical lemma.

Lemma IV.4: Assume and are as defined in (42).

If , then after is large enough,

are not changed.

Proof: Assume is selected and changed

in-finitely many times. At any of these , we solve the sub-problem (40) to obtain . Now, consider the following problem with variable

subject to

(46) Since is not an optimal solution of (46), we assume an optimal solution of (46) is .

Since is an optimal solution of (40), we have

(47)

As goes to infinity, from Lemma IV.3, (47) becomes

which contradicts that is not an optimal solution of (46). Therefore, is not selected after is large enough. Hence,

remains the same.

The main result on the convergence is the following. Theorem IV.5: Assume is the sequence generated using the algorithm by Crammer and Singer. Under the Assumption IV.2, for any convergent subsequence of , its limit point is a global minimum of (37).

Proof: Using (39), we know that the feasible region of (37) is compact. Hence, has convergent subsequences. Assume is the limit point of one convergent subsequence . If is not an optimal solution of (37), from the KKT condition (34), there are some groups such that

We define

(48) Using Lemma IV.4 and the continuity of , we can con-sider all large enough such that the following three state-ments are valid.

1) Those satisfying are not changed

any more. 2) For all

(49) 3) For all

(50) In the rest of this proof, we will have a procedure to show that some group with is still selected. This then contradicts Lemma IV.4.

Consider the th iteration where a group is selected and modified. Thus

but (51)

Using (38), we assume that

(52) Then, at the st iteration, if is selected, then

is not changed. Thus, if

(53) using (50), (52), (53), and the definition of , we have

Thus

(54) Similarly

so with (51)

(8)

Thus, in the next iterations, no matter is selected or not, we have

(56) On the other hand, from (48) and (49) and the fact that

(57)

some group with satisfies

(58) Note that the first inequality of (58) comes from a similar deriva-tion of (55). For (55), in (54) the difference between

and is estimated. For (58), since (57), we can di-rectly measure the difference between and .

Hence, (56) and (58) imply that should not be selected in the th iterations. Therefore, since there are groups of variables, in iterations, some group with

must be selected. This contradicts Lemma IV.4, which shows that this th group of variables should not be selected.

Therefore, any limit point of is a KKT point of (37). As (37) is a convex optimization problem, any limit point is a global minimum.

If the kernel matrix is positive definite, is also pos-itive definite. Then, (37) is a strictly convex problem so there is a unique optimal solution. Thus, is a globally convergent sequence if is positive definite.

V. CONCLUSION ANDDISCUSSION

Originally, we tried to prove Theorem II.1 by using as few conditions on the decomposition methods as possible. Surpris-ingly, we finally needed most properties of an existing working set selection. After finishing the convergence proof [8], an open question left is whether more flexible working set selections still lead to convergence. So far, unfortunately, in many scenarios, properties of a systematic working selection are always needed. This seems to hint that proving more generalized convergence may not be an easy task.

Next, we discuss the convergence proof for the formula-tion by Crammer and Singer. We can see that Assumpformula-tion IV.2 is generally true. For example, for the polynomial kernel , if all data are not zero vectors, . On the other hand, in [8] it assumes

, where is any subset of with is a square submatrix of , and

is the smallest eigenvalue of a matrix. We have to consider any as there are no restrictions on the working set. On the other hand, the reason why Assumption IV.2 is simpler is that now in each iteration one of the groups is selected where each group

has variables . Hence, the square submatrix of is reduced to a small diagonal matrix . Then all eigenvalues of are .

The proof in this paper also shows the importance of Lemma IV.3. For both proofs here and in [8], as we are not able to prove the global convergence of , instead we

prove that converges if is a

convergent subsequence. Then, this property is used to link several subproblems in subsequent iterations.

ACKNOWLEDGMENT

The author thanks four anonymous referees for their helpful comments.

REFERENCES

[1] C.-C. Chang and C.-J. Lin. (2001) LIBSVM: A Library for Support Vector Machines. [Online]. Available: http://www.csie.ntu.edu.tw/ ~cjlin/libsvm

[2] , “Training-support vector classifiers: Theory and algorithms,”

Neural Comput., vol. 13, no. 9, pp. 2119–2147, 2001.

[3] C. Cortes and V. Vapnik, “Support-vector network,” Machine Learning, vol. 20, pp. 273–297, 1995.

[4] K. Crammer and Y. Singer, “On the algorithmic implementation of mul-ticlass kernel-based vector machines,” J. Machine Learning R., vol. 2, pp. 265–292, 2001.

[5] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Networks, vol. 13, pp. 415–425, Mar. 2002.

[6] T. Joachims, Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1998.

[7] S. S. Keerthi and E. G. Gilbert, “Convergence of a generalized SMO algorithm for SVM classifier design,” Machine Learning, vol. 46, pp. 351–360, 2002.

[8] C.-J. Lin, “On the convergence of the decomposition method for support vector machines,” IEEE Trans. Neural Networks, vol. 12, pp. 1288–1298, Nov. 2001.

[9] , “Asymptotic convergence of an SMO algorithm without any as-sumptions,” IEEE Trans. Neural Networks, vol. 13, pp. 248–250, Jan. 2002.

[10] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to face detection,” in Proc. CVPR’97, 1997, pp. 130–136. [11] J. C. Platt, Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1998.

[12] B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Comput., vol. 12, pp. 1207–1245, 2000.

[13] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.

Chih-Jen Lin (S’91–M’98) received the B.S. degree

in mathematics from National Taiwan University, Taipei, Taiwan, in 1993. He received the M.S. and Ph.D. degrees from the Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, in 1997 and 1998, respectively.

Since September 1998, he has been an Assistant Professor in the Department of Computer Science and Information Engineering, National Taiwan University. His research interests include machine learning, numerical optimization, and various applications of operations research.