The Analysis of Decomposition Methods for Support Vector Machines
Chih-Chung Chang, Chih-Wei Hsu, and Chih-Jen Lin
Abstract—The support vector machine (SVM) is a new and promising technique for pattern recognition. It requires the solution of a large dense quadratic programming problem. Traditional optimization methods cannot be directly applied due to memory restrictions. Up to now, very few methods can handle the memory problem and an important one is the “decomposition method.” However, there is no convergence proof so far. In this paper, we connect this method to projected gradient methods and provide theoretical proofs for a version of decomposition methods. An extension to bound-constrained formulation of SVM is also provided. We then show that this convergence proof is valid for general decomposition methods if their working set selection meets a simple requirement.
Index Terms—Decomposition methods, projected gradients, support vector machines.
I. INTRODUCTION
The support vector machine (SVM) is a new and very promising classification technique for pattern recognition. Surveys of SVM are, for example, Burges [1], Cortes and Vapnik [2], Schölkopf et al. [3], and Vapnik [4]. Giving training
vectors of length , and a vector defined
as follows:
if in class 1, if in class 2
the support vector technique in general requires the solution of the following quadratic programming problem:
(1) where is the vector of all ones, is the upper bound of all vari-ables, is a positive semidefinite matrix. Possible choices of
are, for example, and .
Note that is considered as a support vector if is the solution
of (1) and .
The difficulty of solving (1) is the density of because is in general not zero and becomes a fully dense matrix. Hence a prohibitive amount of memory is required to store the matrix. Thus, traditional optimization algorithms such as Newton, quasi-Newton, etc., cannot be directly applied. Several authors (e.g., [5]–[10]) have proposed methods with successful
Manuscript received November 10, 1999; revised April 18, 2000. This work was supported in part by the National Science Council of Taiwan, R.O.C., under Grant NSC-88-2213-E-002-097.
The authors are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan, R.O.C., (e-mail: [email protected]).
Publisher Item Identifier S 1045-9227(00)05952-X.
implementations to conquer this difficulty. Unlike the Newton method which usually involves the whole Hessian matrix , these methods only calculate components of when they are required in the current iteration. Among them, an important one is the “decomposition method” proposed by Osuna et al. [8]. As a variation of the active set methods, they separated the index of the training set to two sets and ,
where for and is the working set if is the
current iterate of the algorithm. If we denote , , , and as vectors containing corresponding elements, the objective value is equal to
. At each iteration, is fixed and the following subproblem with the variable is solved:
(2)
where is a permutation of the matrix and
is the size of . Elements of are kept as zero from the beginning. The Karush–Kuhn–Tucker (KKT) condition of (2)
includes of (2) and
(3)
If there is one satisfying , from (3),
could be easily obtained. Since all components of are zero, if is an optimal solution, satisfies the KKT condition of
(2): , for . Therefore, the algorithm
moves elements of which violate the KKT condition of (2) to , and move out zero elements from to . If the size of is bigger than the number of support vectors, because no cycle happens, the algorithm stops in finite steps.
The main shortcoming of the above algorithm is that we do not know the number of support vectors a priori. To handle this issue, the same authors [11] proposed to use a small number as the size of . Furthermore, there is no restriction that must only contain zeros. To be more precise, any element in which violates the KKT condition can enter . Since is small in gen-eral (less than 100), this method never faces memory problem. Good numerical results were reported so several implementa-tions and further improvements are given in [5], [9], [10]. How-ever, even though the strict decrease of the objective function still holds, there is no theoretical proof to show that the sequence converges to an optimal solution. This issue has been circulated in the SVM community for a while (see, for example, the dis-cussion in Smola and Schölkopf [12, Sec. 5.5.2], and Keerthi et
al. [13, Sec. 3]), but there is no satisfactory solution so far.
The paper by Joachims [5] has drawn us a lot of attention be-cause of its method of updating and . Unlike the method by 1045–9227/00$10.00 © 2000 IEEE
Osuna et al., where a more “random” selection is used, Joachims solves the following problem in order to select the working set:
(4a) if (4b) if (4c) (4d) where th iterate; ; gradient of at .
Note that means the number of components of which are not zero. The constraint (4d) implies that a descent direction involving only variables is obtained. Then compo-nents of with nonzero are included in the working set which is used to construct the subproblem (2). Note that is only used for identifying but not as a search direction. Using the decomposition method with the new technique for selecting the working set , Joachims reported promising numerical re-sults. In this paper, we will demonstrate that a variation of his algorithm theoretically converges.
In Section II, we present a more general algorithm with an analysis on the selection of the working set. Then in Section III, we prove the convergence of this algorithm using the techniques of projected gradients. Section IV extends the convergence proofs to bound-constrained formulations of SVM. In Section V, we explain that our convergence proof is valid for general decomposition methods if their working set selection meets a simple requirement. Finally we provide concluding remarks in Section VI.
A preliminary version of this paper was presented in an earlier workshop [14].
II. GENERALALGORITHM
To describe the new algorithm, we replace the problem (1) by the following form:
(5) where is a continuously differentiable function from to and the set is the feasible set of (1)
(6) Therefore, (1) is a special case of (5) when
. In addition, we assume is a finite positive number so is a bounded set.
In the following we describe a more general algorithm for solving (5).
Step 1) Let be a positive integer with . Set ,
choose , and find an as
the initial solution. Step 2) Solve
(7)
Assume contains indexes for which . Step 3) If , stop and output as an optimal solution.
Otherwise, let
if
otherwise. (8)
Define
The mapping is defined by
For , define , and
select so that (9) and or where satisfies (10) Step 4) Find a new such that
if and
Go to Step 2).
Joachims requires that . This may not
be always possible, so we modify the equality to
. Another difference is that we use instead of (4a)–(4c). Note that (4a)–(4c) do not ensure the feasibility so the solution of (4) may not be a feasible direction. However, we use this property for the convergence proof as will be explained in Section III.
The Step 3) of the algorithm searches for a point by following a “partial” projected gradient direction. We called a “partial” gradient direction since it contains only components of which are in the working set of . For a standard projected gradient method (e.g., Calamai and Moré
[15], and Bertsekas [16]), , where
is the projection into , and is referred as the Cauchy
point. The sufficient decrease of function values at Cauchy
points is guaranteed by conditions (9) and (10) and is used for
the convergence proof. Then by requiring ,
we will prove that any limit point of the sequence is an
optimal solution. Note that when ,
in our algorithm is a feasible solution to the subproblem (2). Since most existing decomposition methods obtain the optimal solution of (2) for , in Step 4) the requirement that
is satisfied.
Because the use of instead of makes a differ-ence in Step 3), there is a new obstacle proving the convergdiffer-ence. In the rest of this section, we will construct the relation between (7) and the following problem:
Definition II.1: A point is a stationary point of (5) if , for all . Immediately, we have the following result.
Lemma II.2: If is the optimal objective value of (11), if and only if is a stationary point.
Proof: If is not a stationary point, there is an
such that . Thus is a feasible
solution of (11). The objective value causes a contradiction. The proof of the other direction is similar.
We will investigate (11) in detail. Except , con-straints of (11) on can be written as
where , and , for
all . The KKT condition of (11) is
(12) where is a real number, and and are multipliers asso-ciated with inequalities. If is an optimal solution of (11), the
property and (12) imply that
As is the optimal objective value of (11)
(13)
Since each term of (13) is nonpositive, if is the index such that has the smallest value
(14)
Hence if , . As , with the constraint
, there are some elements satisfying
(15) Furthermore, since all are one or , there is at least one index satisfying (15) such that
(16) Now we are ready to switch back to (7) by considering the fol-lowing problem:
and
(17)
If , define . Then ,
and from (13), (14), and (16)
A similar setting can be made for the case when . Therefore, by assigning other elements of to be zero, we find a feasible solution of (17) with objective value less than . As this solution is also a feasible point of (7), if is the optimal objective value of (7), we have the following lemma.
Lemma II.3: .
From Lemma II.3, the fact that , and
Lemma II.2, we have the following lemma.
Lemma II.4: if and only if is a stationary point.
III. CONVERGENCEPROOF
Based on the analysis on special structure of SVM formula-tions in Section II, in this section we will provide the conver-gence proof.
As the only difference between and is the restriction on
the domain variable such that , following
from Calamai and Moré [15], we have the following lemma.
Lemma III.1: Let be the projection into .
1) If , then
(18) 2) is a monotone operator, that is
(19) If then strict inequality holds.
3) Given and , , the function
defined by
(20) is nonincreasing.
We define a function of ,
which yields the following inequalities: From (18)
(21) Therefore
and consequently
From (19), if , then
(23) The following lemma ensures that there is an satisfying the requirements in Step 3 of the algorithm.
Lemma III.2: If is not a stationary point of (5), then
1) For all , .
2) There is a in , and an such that for all
(24) and
3) There are positive constants and , and constants and in , and such that (9) and (10) are satisfied.
Proof: If 1) is false, we can find an such that
Then from (23), for all , .
However, from (21)
Therefore, , for all . Because is not
a stationary point, from Lemma II.4, (7) has a solution
and there exists an such that .
Because , we can find an such that
. Then
This contradicts the property that .
For 2), because is not a stationary point and is continu-ously differentiable, from 1) of this Lemma and (21), we have
(25) From (20), is nonincreasing. On the other hand,
from (22), , so
(26) Thus, (25) implies
Hence the result (24) immediately follows.
We prove 3) by giving an example from Bertsekas [17]. De-fine , where is the smallest integer such that (24) is satisfied. From 2), we know this exists. Then by using this as , both (9) and (10) are satisfied.
The following theorem from Calamai and Moré [15] is very important to our convergence proof. Because we are now using instead of as the search direction of the
Cauchy step, to demonstrate its validity, here we present the whole proof.
Theorem III.3: Let be continuously differen-tiable on
Proof: Assume there is an infinite subset such that and
(27) We will prove that this assumption leads to a contradiction. From (27), for all
(28)
Since is bounded and , is bounded,
converges, and condition (9) implies . Hence (21) and (28) show that
and
In particular, we have shown that eventually and hence where satisfies (10). Now assign
and . With this, (20) implies
Since and for , we obtain
(29) Together with (21) and (23), (29) implies that for
(30) We now use (30) to obtain the desired contradiction. Since converges to zero, (30) implies that converges to zero. Thus the continuity of ,
the fact , and the
boundedness of show that
Therefore, if
then
Hence (30) establishes that for all
suf-ficiently large. This is the desired contradiction because (10)
guarantees that .
We are now in a position to prove the main theorem of this paper.
Theorem III.4: Let be continuously differen-tiable on . Any limit point of is a stationary point of (5).
Proof: For any , from (18)
Hence
If is a direction at with , then setting yields
Since this inequality is true for all such , with Lemma II.3, for any
Theorem III.3 implies that converges to zero. Since converges, (9) implies that
converges to zero. Since is continuously differentiable on and is bounded, if is a limit point of a subsequence
Since this inequality is true for all , from Definition II.1, is a stationary point of (5).
Since the quadratic program of SVM is convex, Theorem III.4 assures that any limit point of is a global minimum.
IV. BOUND-CONSTRAINEDFORMULATIONS OFSVM Recently several authors [7], [18] have proposed different methods for formulating support vector machines. The resulting quadratic program is a bound-constrained problem
Note that the matrix here is not the same as the matrix in (1). In this section we would like to demonstrate that if the proposed algorithm in Section II is applied to this bound-con-strained problem, the convergence proof can still follow.
If and are defined by the same way, now it is much easier to establish a relation between them. Instead of Lemma II.3, we have the following lemma.
Lemma IV.1:
Proof: First we assume is an optimal solution
of (11). We then divide to subsets:
. Let and we distribute the rest
ele-ments to such that each subset has no more than
elements. If we define vectors as
otherwise
then all , are feasible solutions of (7).
Thus
Therefore
It can be clearly seen that Lemmas III.1 and III.2 and The-orem III.3 are still true without any modifications. Then in the proof of Theorem III.4, we use Lemma IV.1 to obtain the con-vergence result.
V. GENERALWORKINGSETSELECTIONS
From the proofs in Section III and the discussion in Sec-tion IV, we realize that if the optimal objective value of the working set selection subproblem is relatively small enough, the algorithm presented in Section II always converges.
Corollary V.1: Suppose Step 2) of the algorithm in Section II
is replaced by another strategy and contains indexes of the selected working set. If the optimal objective value of the following problem:
(31)
if (32)
satisfies
where is a positive constant, Theorem III.4 still holds. This corollary gives us the flexibility of choosing different types of working sets. For example, a strategy can be as follows:
Solve
(33) and define
indexes with selected from another strategy
If is the optimal objective value of (33), from Lemma II.3, we have
Hence the algorithm converges. Such a strategy might be useful in practice as (7) is a steepest-descent type selection which may not be the best choice for quick convergence.
VI. CONCLUSION
In this paper we discuss the convergence of decomposition methods for support vector machines. The proof is based on the connection between a more general algorithm and the techniques of projected gradients. We also demonstrate that the proof can be easily applied to general bound-constrained for-mulations of SVM. Examples of flexible working set selections are also given.
ACKNOWLEDGMENT
The third author thanks Dr. J. Moré for bringing him to the subject of support vector machines and some very helpful dis-cussions.
REFERENCES
[1] C. J. C. Burges, “A tutorial on support vector machines for pattern recog-nition,” Data Mining Knowl. Disc., vol. 2, no. 2, pp. 121–167, 1998. [2] C. Cortes and V. Vapnik, “Support-vector network,” Mach. Learn., vol.
20, pp. 273–297, 1995.
[3] B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds., Advances in
Kernel Methods—Support Vector Learning. Cambridge, MA: MIT Press, 1998.
[4] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [5] T. Joachims, “Making large-scale SVM learning practical,” in Advances
in Kernel Methods—Support Vector Learning, B. Schölkopf, C. J. C.
Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1998. [6] L. Kaufman, “Solving the quadratic programming problem arising in
support vector classification,” in Advances in Kernel
Methods—Sup-port Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola,
Eds. Cambridge, MA: MIT Press, 1999.
[7] O. L. Mangasarian and D. R. Musicant, “Successive overrelaxation for support vector machines,” IEEE Trans. Neural Networks, vol. 10, pp. 1032–1037, Sept. 1999.
[8] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to face detection,” in Proc. CVPR ’97, 1997.
[9] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods—Support Vector
Learning, B. Schölkopf, Ed. Cambridge, MA: MIT Press, 1998. [10] C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schölkopf, and A.
Smola, “Support vector machine reference manual,” Royal Holloway Coll., Univ. London, Egham, U.K., Tech. Rep. CSD-TR-98-03, 1998. [11] E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for
support vector machines,” in Proc. IEEE NNSP’97, 1997.
[12] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Royal Holloway College, Univ. London, Egham, U.K., Neuro COLT Tech. Rep. TR-1998-030, 1998.
[13] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to platt’s SMO algorithm for SVM classifier design,” Dept. Mech. Prod. Eng., Nat. Univ. Singapore, Tech. Rep., 1999. [14] C.-C. Chang, C.-W. Hsu, and C.-J. Lin, “The analysis of decomposition
methods for support vector machines,” in Workshop Support Vector
Ma-chines, IJCAI 99, 1999.
[15] P. H. Calamai and J. J. Moré, “Projected gradient methods for linearly constrained problems,” Math. Programming, vol. 39, pp. 93–116, 1987. [16] D. P. Bertsekas, Nonlinear Programming. Belmont, MA: Athena,
1995.
[17] D. P. Bertsekas, “On the Goldstein-Levitin-Polyak gradient projection method,” IEEE Trans. Automat. Contr., vol. AC-21, pp. 174–184, 1976. [18] T.-T. Friess, N. Cristianini, and C. Campbell, “The kernel adatron algo-rithm: a fast and simple learning procedure for support vector machines,” in Proc. 15th Int. Conf. Machine Learning. San Mateo, CA: Morgan Kaufmann, 1998.