The analysis of decomposition methods for support vector machines

(1)

The Analysis of Decomposition Methods for Support Vector Machines

Chih-Chung Chang, Chih-Wei Hsu, and Chih-Jen Lin

Abstract—The support vector machine (SVM) is a new and promising technique for pattern recognition. It requires the solution of a large dense quadratic programming problem. Traditional optimization methods cannot be directly applied due to memory restrictions. Up to now, very few methods can handle the memory problem and an important one is the “decomposition method.” However, there is no convergence proof so far. In this paper, we connect this method to projected gradient methods and provide theoretical proofs for a version of decomposition methods. An extension to bound-constrained formulation of SVM is also provided. We then show that this convergence proof is valid for general decomposition methods if their working set selection meets a simple requirement.

Index Terms—Decomposition methods, projected gradients, support vector machines.

I. INTRODUCTION

The support vector machine (SVM) is a new and very promising classification technique for pattern recognition. Surveys of SVM are, for example, Burges [1], Cortes and Vapnik [2], Schölkopf et al. [3], and Vapnik [4]. Giving training

vectors of length , and a vector defined

as follows:

if in class 1, if in class 2

the support vector technique in general requires the solution of the following quadratic programming problem:

(1) where is the vector of all ones, is the upper bound of all vari-ables, is a positive semidefinite matrix. Possible choices of

are, for example, and .

Note that is considered as a support vector if is the solution

of (1) and .

The difficulty of solving (1) is the density of because is in general not zero and becomes a fully dense matrix. Hence a prohibitive amount of memory is required to store the matrix. Thus, traditional optimization algorithms such as Newton, quasi-Newton, etc., cannot be directly applied. Several authors (e.g., [5]–[10]) have proposed methods with successful

Manuscript received November 10, 1999; revised April 18, 2000. This work was supported in part by the National Science Council of Taiwan, R.O.C., under Grant NSC-88-2213-E-002-097.

The authors are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan, R.O.C., (e-mail: [email protected]).

Publisher Item Identifier S 1045-9227(00)05952-X.

implementations to conquer this difficulty. Unlike the Newton method which usually involves the whole Hessian matrix , these methods only calculate components of when they are required in the current iteration. Among them, an important one is the “decomposition method” proposed by Osuna et al. [8]. As a variation of the active set methods, they separated the index of the training set to two sets and ,

where for and is the working set if is the

current iterate of the algorithm. If we denote , , , and as vectors containing corresponding elements, the objective value is equal to

. At each iteration, is fixed and the following subproblem with the variable is solved:

(2)

where is a permutation of the matrix and

is the size of . Elements of are kept as zero from the beginning. The Karush–Kuhn–Tucker (KKT) condition of (2)

includes of (2) and

(3)

If there is one satisfying , from (3),

could be easily obtained. Since all components of are zero, if is an optimal solution, satisfies the KKT condition of

(2): , for . Therefore, the algorithm

moves elements of which violate the KKT condition of (2) to , and move out zero elements from to . If the size of is bigger than the number of support vectors, because no cycle happens, the algorithm stops in finite steps.

The main shortcoming of the above algorithm is that we do not know the number of support vectors a priori. To handle this issue, the same authors [11] proposed to use a small number as the size of . Furthermore, there is no restriction that must only contain zeros. To be more precise, any element in which violates the KKT condition can enter . Since is small in gen-eral (less than 100), this method never faces memory problem. Good numerical results were reported so several implementa-tions and further improvements are given in [5], [9], [10]. How-ever, even though the strict decrease of the objective function still holds, there is no theoretical proof to show that the sequence converges to an optimal solution. This issue has been circulated in the SVM community for a while (see, for example, the dis-cussion in Smola and Schölkopf [12, Sec. 5.5.2], and Keerthi et

al. [13, Sec. 3]), but there is no satisfactory solution so far.

(2)

Osuna et al., where a more “random” selection is used, Joachims solves the following problem in order to select the working set:

(4a) if (4b) if (4c) (4d) where th iterate; ; gradient of at .

Note that means the number of components of which are not zero. The constraint (4d) implies that a descent direction involving only variables is obtained. Then compo-nents of with nonzero are included in the working set which is used to construct the subproblem (2). Note that is only used for identifying but not as a search direction. Using the decomposition method with the new technique for selecting the working set , Joachims reported promising numerical re-sults. In this paper, we will demonstrate that a variation of his algorithm theoretically converges.

In Section II, we present a more general algorithm with an analysis on the selection of the working set. Then in Section III, we prove the convergence of this algorithm using the techniques of projected gradients. Section IV extends the convergence proofs to bound-constrained formulations of SVM. In Section V, we explain that our convergence proof is valid for general decomposition methods if their working set selection meets a simple requirement. Finally we provide concluding remarks in Section VI.

A preliminary version of this paper was presented in an earlier workshop [14].

II. GENERALALGORITHM

To describe the new algorithm, we replace the problem (1) by the following form:

(5) where is a continuously differentiable function from to and the set is the feasible set of (1)

(6) Therefore, (1) is a special case of (5) when

. In addition, we assume is a finite positive number so is a bounded set.

In the following we describe a more general algorithm for solving (5).

Step 1) Let be a positive integer with . Set ,

choose , and find an as

the initial solution. Step 2) Solve

(7)

Assume contains indexes for which . Step 3) If , stop and output as an optimal solution.

Otherwise, let

if

otherwise. (8)

Define

The mapping is defined by

For , define , and

select so that (9) and or where satisfies (10) Step 4) Find a new such that

if and

Go to Step 2).

Joachims requires that . This may not

be always possible, so we modify the equality to

. Another difference is that we use instead of (4a)–(4c). Note that (4a)–(4c) do not ensure the feasibility so the solution of (4) may not be a feasible direction. However, we use this property for the convergence proof as will be explained in Section III.

The Step 3) of the algorithm searches for a point by following a “partial” projected gradient direction. We called a “partial” gradient direction since it contains only components of which are in the working set of . For a standard projected gradient method (e.g., Calamai and Moré

[15], and Bertsekas [16]), , where

is the projection into , and is referred as the Cauchy

point. The sufficient decrease of function values at Cauchy

points is guaranteed by conditions (9) and (10) and is used for

the convergence proof. Then by requiring ,

we will prove that any limit point of the sequence is an

optimal solution. Note that when ,

in our algorithm is a feasible solution to the subproblem (2). Since most existing decomposition methods obtain the optimal solution of (2) for , in Step 4) the requirement that

is satisfied.

Because the use of instead of makes a differ-ence in Step 3), there is a new obstacle proving the convergdiffer-ence. In the rest of this section, we will construct the relation between (7) and the following problem:

(3)

Definition II.1: A point is a stationary point of (5) if , for all . Immediately, we have the following result.

Lemma II.2: If is the optimal objective value of (11), if and only if is a stationary point.

Proof: If is not a stationary point, there is an

such that . Thus is a feasible

solution of (11). The objective value causes a contradiction. The proof of the other direction is similar.

We will investigate (11) in detail. Except , con-straints of (11) on can be written as

where , and , for

all . The KKT condition of (11) is

(12) where is a real number, and and are multipliers asso-ciated with inequalities. If is an optimal solution of (11), the

property and (12) imply that

As is the optimal objective value of (11)

(13)

Since each term of (13) is nonpositive, if is the index such that has the smallest value

(14)

Hence if , . As , with the constraint

, there are some elements satisfying

(15) Furthermore, since all are one or , there is at least one index satisfying (15) such that

(16) Now we are ready to switch back to (7) by considering the fol-lowing problem:

and

(17)

If , define . Then ,

and from (13), (14), and (16)

A similar setting can be made for the case when . Therefore, by assigning other elements of to be zero, we find a feasible solution of (17) with objective value less than . As this solution is also a feasible point of (7), if is the optimal objective value of (7), we have the following lemma.

Lemma II.3: .

From Lemma II.3, the fact that , and

Lemma II.2, we have the following lemma.

Lemma II.4: if and only if is a stationary point.

III. CONVERGENCEPROOF

Based on the analysis on special structure of SVM formula-tions in Section II, in this section we will provide the conver-gence proof.

As the only difference between and is the restriction on

the domain variable such that , following

from Calamai and Moré [15], we have the following lemma.

Lemma III.1: Let be the projection into .

1) If , then

(18) 2) is a monotone operator, that is

(19) If then strict inequality holds.

3) Given and , , the function

defined by

(20) is nonincreasing.

We define a function of ,

which yields the following inequalities: From (18)

(21) Therefore

and consequently

(4)

From (19), if , then

(23) The following lemma ensures that there is an satisfying the requirements in Step 3 of the algorithm.

Lemma III.2: If is not a stationary point of (5), then

1) For all , .

2) There is a in , and an such that for all

(24) and

3) There are positive constants and , and constants and in , and such that (9) and (10) are satisfied.

Proof: If 1) is false, we can find an such that

Then from (23), for all , .

However, from (21)

Therefore, , for all . Because is not

a stationary point, from Lemma II.4, (7) has a solution

and there exists an such that .

Because , we can find an such that

. Then

This contradicts the property that .

For 2), because is not a stationary point and is continu-ously differentiable, from 1) of this Lemma and (21), we have

(25) From (20), is nonincreasing. On the other hand,

from (22), , so

(26) Thus, (25) implies

Hence the result (24) immediately follows.

We prove 3) by giving an example from Bertsekas [17]. De-fine , where is the smallest integer such that (24) is satisfied. From 2), we know this exists. Then by using this as , both (9) and (10) are satisfied.

The following theorem from Calamai and Moré [15] is very important to our convergence proof. Because we are now using instead of as the search direction of the

Cauchy step, to demonstrate its validity, here we present the whole proof.

Theorem III.3: Let be continuously differen-tiable on

Proof: Assume there is an infinite subset such that and

(27) We will prove that this assumption leads to a contradiction. From (27), for all

(28)

Since is bounded and , is bounded,

converges, and condition (9) implies . Hence (21) and (28) show that

and

In particular, we have shown that eventually and hence where satisfies (10). Now assign

and . With this, (20) implies

Since and for , we obtain

(29) Together with (21) and (23), (29) implies that for

(30) We now use (30) to obtain the desired contradiction. Since converges to zero, (30) implies that converges to zero. Thus the continuity of ,

the fact , and the

boundedness of show that

Therefore, if

then

Hence (30) establishes that for all

suf-ficiently large. This is the desired contradiction because (10)

guarantees that .

We are now in a position to prove the main theorem of this paper.

(5)

Theorem III.4: Let be continuously differen-tiable on . Any limit point of is a stationary point of (5).

Proof: For any , from (18)

Hence

If is a direction at with , then setting yields

Since this inequality is true for all such , with Lemma II.3, for any

Theorem III.3 implies that converges to zero. Since converges, (9) implies that

converges to zero. Since is continuously differentiable on and is bounded, if is a limit point of a subsequence

Since this inequality is true for all , from Definition II.1, is a stationary point of (5).

Since the quadratic program of SVM is convex, Theorem III.4 assures that any limit point of is a global minimum.

IV. BOUND-CONSTRAINEDFORMULATIONS OFSVM Recently several authors [7], [18] have proposed different methods for formulating support vector machines. The resulting quadratic program is a bound-constrained problem

Note that the matrix here is not the same as the matrix in (1). In this section we would like to demonstrate that if the proposed algorithm in Section II is applied to this bound-con-strained problem, the convergence proof can still follow.

If and are defined by the same way, now it is much easier to establish a relation between them. Instead of Lemma II.3, we have the following lemma.

Lemma IV.1:

Proof: First we assume is an optimal solution

of (11). We then divide to subsets:

. Let and we distribute the rest

ele-ments to such that each subset has no more than

elements. If we define vectors as

otherwise

then all , are feasible solutions of (7).

Thus

Therefore

It can be clearly seen that Lemmas III.1 and III.2 and The-orem III.3 are still true without any modifications. Then in the proof of Theorem III.4, we use Lemma IV.1 to obtain the con-vergence result.

V. GENERALWORKINGSETSELECTIONS

From the proofs in Section III and the discussion in Sec-tion IV, we realize that if the optimal objective value of the working set selection subproblem is relatively small enough, the algorithm presented in Section II always converges.

Corollary V.1: Suppose Step 2) of the algorithm in Section II

is replaced by another strategy and contains indexes of the selected working set. If the optimal objective value of the following problem:

(31)

if (32)

satisfies

where is a positive constant, Theorem III.4 still holds. This corollary gives us the flexibility of choosing different types of working sets. For example, a strategy can be as follows:

Solve

(33) and define

indexes with selected from another strategy

If is the optimal objective value of (33), from Lemma II.3, we have

Hence the algorithm converges. Such a strategy might be useful in practice as (7) is a steepest-descent type selection which may not be the best choice for quick convergence.

(6)

VI. CONCLUSION

In this paper we discuss the convergence of decomposition methods for support vector machines. The proof is based on the connection between a more general algorithm and the techniques of projected gradients. We also demonstrate that the proof can be easily applied to general bound-constrained for-mulations of SVM. Examples of flexible working set selections are also given.

ACKNOWLEDGMENT

The third author thanks Dr. J. Moré for bringing him to the subject of support vector machines and some very helpful dis-cussions.

REFERENCES

[1] C. J. C. Burges, “A tutorial on support vector machines for pattern recog-nition,” Data Mining Knowl. Disc., vol. 2, no. 2, pp. 121–167, 1998. [2] C. Cortes and V. Vapnik, “Support-vector network,” Mach. Learn., vol.

20, pp. 273–297, 1995.

[3] B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds., Advances in

Kernel Methods—Support Vector Learning. Cambridge, MA: MIT Press, 1998.

[4] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [5] T. Joachims, “Making large-scale SVM learning practical,” in Advances

in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C.

Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1998. [6] L. Kaufman, “Solving the quadratic programming problem arising in

support vector classification,” in Advances in Kernel

Methods—Sup-port Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola,

Eds. Cambridge, MA: MIT Press, 1999.

[7] O. L. Mangasarian and D. R. Musicant, “Successive overrelaxation for support vector machines,” IEEE Trans. Neural Networks, vol. 10, pp. 1032–1037, Sept. 1999.

[8] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to face detection,” in Proc. CVPR ’97, 1997.

[9] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods—Support Vector

Learning, B. Schölkopf, Ed. Cambridge, MA: MIT Press, 1998. [10] C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schölkopf, and A.

Smola, “Support vector machine reference manual,” Royal Holloway Coll., Univ. London, Egham, U.K., Tech. Rep. CSD-TR-98-03, 1998. [11] E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for

support vector machines,” in Proc. IEEE NNSP’97, 1997.

[12] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Royal Holloway College, Univ. London, Egham, U.K., Neuro COLT Tech. Rep. TR-1998-030, 1998.

[13] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to platt’s SMO algorithm for SVM classifier design,” Dept. Mech. Prod. Eng., Nat. Univ. Singapore, Tech. Rep., 1999. [14] C.-C. Chang, C.-W. Hsu, and C.-J. Lin, “The analysis of decomposition

methods for support vector machines,” in Workshop Support Vector

Ma-chines, IJCAI 99, 1999.

[15] P. H. Calamai and J. J. Moré, “Projected gradient methods for linearly constrained problems,” Math. Programming, vol. 39, pp. 93–116, 1987. [16] D. P. Bertsekas, Nonlinear Programming. Belmont, MA: Athena,

1995.

[17] D. P. Bertsekas, “On the Goldstein-Levitin-Polyak gradient projection method,” IEEE Trans. Automat. Contr., vol. AC-21, pp. 174–184, 1976. [18] T.-T. Friess, N. Cristianini, and C. Campbell, “The kernel adatron algo-rithm: a fast and simple learning procedure for support vector machines,” in Proc. 15th Int. Conf. Machine Learning. San Mateo, CA: Morgan Kaufmann, 1998.