Asymptotic behaviors of support vector machines with gaussian kernel

(1)

Asymptotic Behaviors of Support Vector Machines with

Gaussian Kernel

S. Sathiya Keerthi

[email protected]

Department of Mechanical Engineering, National University of Singapore, Singapore 119260, Republic of Singapore

Chih-Jen Lin

[email protected]

Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan

Support vector machines (SVMs) with the gaussian (RBF) kernel have been popular for practical use. Model selection in this class of SVMs involves two hyperparameters: the penalty parameterC and the kernel widthσ. This letter analyzes the behavior of the SVM classifier when these hyperparameters take very small or very large values. Our results help in understanding the hyperparameter space that leads to an efficient heuristic method of searching for hyperparameter values with small gen-eralization errors. The analysis also indicates that if complete model se-lection using the gaussian kernel has been conducted, there is no need to consider linear SVM.

1 Introduction

Given a training set of instance-label pairs(xi, yi), i = 1, . . . , l where xi∈ Rn

and y ∈ {1, −1}l_{, support vector machines (SVMs) (Vapnik, 1998) require}

the solution of the following (primal) optimization problem:

min w,b,ξ 1 2w T_w_{+ C}l i=1 ξi subject to yi(wTzi+ b) ≥ 1 − ξi, (1.1) ξi≥ 0, i = 1, . . . , l.

Here, training vectors xiare mapped into a higher- (maybe infinite-)

dimen-sional space by the functionφ as zi= φ(xi). C > 0 is the penalty parameter

of the error term.

(2)

Usually we solve equation 1.1 by solving the following dual problem: min α F(α) = 1 2α T_{Qα − e}T_α subject to 0≤ αi≤ C, i = 1, . . . , l, (1.2) yTα = 0,

where e is the vector of all ones and Q is an l by l positive semidefinite matrix. The(i, j)th element of Q is given by Qij ≡ yiyjK(xi, xj), where K(xi, xj) ≡

φ(xi)Tφ(xj) is called the kernel function. Then w =

_l i=1αiyiφ(xi) and sgn(wT_{φ(x) + b) = sgn} l i=1 αiyiK(xi, x) + b

is the decision function.

We are particularly interested in the gaussian kernel: K(˜x, ¯x) = exp − ˜x − ¯x 2 2σ2 . (1.3)

Our aim is to analyze the behaviors of the SVM classifier when C and/or

σ2 _{take very small or very large values. The motivation is that such an}

analysis will help in understanding the hyperparameter space that will lead to efficient heuristic ways of searching for points in that space with small generalization errors. Some of the behaviors that we will discuss are known in the literature (although details associated with these are usually not written down carefully), but some key behaviors are new results that are not entirely obvious. Here is a summary of the asymptotic behaviors of the SVM classifier that are derived in this article:

• Severe underfitting (the entire data space is assigned to the majority class) occurs in the following cases: (1)σ2_{is fixed and C}_{→ 0, (2) σ}2_{→ 0} and C is fixed to a sufficiently small value, and (3)σ2 → ∞ and C is fixed.

• Severe overfitting (small regions around the training examples of the minority class are classified to be that class, while the rest of the data space is classified as the majority class) occurs in the case whereσ2_{→ 0} and C is fixed to a sufficiently large value.

• If σ2 _{is fixed and C} _{→ ∞, the SVM classifier strictly separates the}

training examples of the two classes. This is a case of overfitting if the problem under consideration has noise.

• If σ2 _{→ ∞ and C = ˜Cσ}2 _{where ˜}_{C is fixed, then the SVM classifier}

(3)

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 −2 0 2 4 −5 0 5 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Log C Log sigsq

Figure 1: A figurative summary of the asymptotic behaviors. The problem has 11 examples in class 1 (×) and 7 examples in class 2 (+). Thus, class 1 is the majority class, and class 2 is the minority class. The plot in the center shows the eight(log C, log σ2_{) pairs tried. The decision curves corresponding to these}

eight pairs are displayed in the surrounding plots at respective positions. Plots without a decision curve correspond to underfitting classifiers for which the entire input region is classified as class 1.

Figure 1 gives a summary of the asymptotic behaviors.

Asymptotic behaviors of the generalization error associated with the

SVM classifier as C and/or σ2_{take extreme values and can be understood}

by studying corresponding behaviors of the leave-one-out (LOO) error. The LOO error is computed as follows. For the ith example, equations 1.1 and 1.2 are solved after leaving out that example. The resulting classifier is applied to check if the ith example is misclassified. The procedure is repeated for each i. The fraction of misclassified examples is the LOO error.

This article is organized as follows. In section 2, we analyze the asymp-totic behaviors of the SVM classifier using the gaussian kernel. The results lead to a simple and efficient heuristic model selection strategy described in section 3. Experiments show that the proposed method is competitive with

(4)

the usual cross-validation search strategy in terms of generalization error achieved, while at the same time it is much more efficient.

2 Asymptotic Behaviors

To establish various asymptotic behaviors of the SVM decision function as well as the LOO error, we need the following assumption, which will be assumed throughout the article:

Assumption 1:

1. l1 > l2+ 1 > 2 where l1and l2are the numbers of training examples in class 1 and class 2, respectively.1

2. For i= j, xi= xj. That is, no two examples have identical x vectors.2

The following lemma is useful.

Lemma 1. For any given(C, σ2), the solution (α) of equation 1.2 is unique. Also, for everyσ2,{zi| yi= 1} and {zi| yi= −1} are linearly separable.

Proof.From Micchelli (1986), if the gaussian kernel is used and xi= xj ∀i =

j from assumption 2, Q is positive definite. By corollary 1 of Chang and Lin (2001b), we get linear separability in z-space. Uniqueness ofα follows from the fact that equation 1.2 is a strictly convex quadratic programming problem.

We now discuss the various asymptotic behaviors. As the results of each case are stated, it is useful to refer to the example shown in Figure 1. Wher-ever we come across results whose proofs do not shed any insight on the asymptotic behaviors, we only state the results and relegate the proofs to the appendix.

2.1 Case 1. σ2 Fixed and C → 0. It can be shown (see the proof of theorem 5 in Chang & Lin, 2001b, for details) that if C is smaller than a certain positive value, the following holds:

αi= C ∀ i with yi= −1. (2.1)

1_{If l}

2> l1+ 1 > 2, then we can always interchange the two classes and apply all the

results derived in this article. Cases where|l1− l2| ≤ 1 or min{l1, l2} < 2 correspond to

abnormal situations that are not worth discussing in detail since in practice, the number of examples in the two classes rarely satisfies any of these two conditions.

2_{This is a generic assumption that is easily satisfied if small, random perturbations}

(5)

Let us take one such C. Using equation 2.1 together withl_i₌₁yiαi= 0 and

l1> l2, it is easy to see that there exists at least one i for whichαi< C and

yi= 1. For such an i, we have

wTzi+ b ≥ 1. (2.2)

For C → 0, we have αi → 0, and so wTz =

_l

i=1αiyiK(xi, x) → 0, where

z= φ(x). These imply that if X is any compact subset of Rn_{, then for any}

given 0< a < 1, there exists ¯C > 0 such that for all C ≤ ¯C, we have

b≥ a and l i=1 αiyiK(xi, x) ≤ a 2∀x ∈ X. (2.3)

Hence, for all C≤ ¯C, f(x) > 0 ∀x ∈ X.

In particular, if we take X to be the compact subset of data space that is of interest to the given problem, then for sufficiently small C, every point in this subset is classified as class 1.

The first part of assumption 1 allows us to use similar arguments for the case of equation 1.2 with one example left out. Then we can also show that as C→ 0, the number of LOO errors is l2. Thus, C→ 0 corresponds to severe underfitting as expected. Furthermore, we have the following properties as C→ 0: 1. w 2= αTQα → 0. 2. limC→0C1 _l i=1αi= limC→0C2 i: yi=−1αi= 2l2.

3. Using the equality of primal and dual objective function values at optimality and the inequalityαT_{Qα ≤ l}2_C2_{, we get}

lim C→0 l i=1 ξi= lim C→0 1 C _l i=1 αi− αTQα = 2l2.

It is useful to interpret the above asymptotic results geometrically; in par-ticular, study the movement of the top, middle, and bottom planes defined by wT_z_{+ b = 1, w}T_z_{+ b = 0, and w}T_z_{+ b = −1 as C → 0. By equation 2.2,}

at least one example of class 1 lies on or above the top plane. By property 1 given above, the distance between the top and bottom planes (which equals 2/ w ) goes to infinity. Hence, the middle and bottom planes are forced to move down farther and farther away from the location where the training points are located, causing the half-space defined by wT_z_{+ b ≥ 0 to cover}

X entirely, the compact subset of interest to the problem, after C becomes sufficiently small.

(6)

Remark 1. The results given above for C → 0 are general and apply to nongaussian kernels also, assuming, of course, that all hyperparameters associated with the kernel function are kept fixed. The results also apply if Q is a bounded function of C since theorem 5 of Chang and Lin (2001b) holds for this case.

Remark 2. For kernels whose values are bounded (e.g., the gaussian ker-nel), there is ¯C such that equation 2.3 holds for all x∈ Rn_{. Thus, for all C}_{≤ ¯C,}

f(x) > 0 ∀x ∈ Rn.

That is, for all C≤ ¯C, every point is classified as class 1.

2.2 Case 2.σ2Fixed andC→ ∞. By lemma 1, {zi| yi= 1} and {zi| yi=

−1} are linearly separable. This implies that it is possible to set ξi = 0 ∀i

while still remaining feasible for equation 1.1. Thus, as C→ ∞, the solution of equation 1.1 approaches the solution of the hard margin problem:

min w,b 1 2w T_w subject to yi(wTzi+ b) ≥ 1, i = 1, . . . , l. (2.4)

A formal treatment of this is in Lin (2001), which shows that if equa-tion 2.4 is feasible, there exists a C∗such that for C≥ C∗, the solution set of equation 1.1 is the same as that of equation 2.4. An easy way to see this result is to solve equation 1.2 with C= ∞, obtain the {αi}, and set C∗ = maxiαi.

The limiting SVM classifier classifies all training examples correctly, and so it is an overfitting classifier. In particular, severe overfitting occurs when σ2_{is small since the flexibility of the classifier is high when}_σ2_{is small.}

For the case of C→ ∞, it is not possible to make any conclusions about

the actual value of the LOO error. That value depends on the data set as well as on the value ofσ2_{. However, after equation 1.2 is solved using all} the examples, it is possible to give bounds on the LOO error (Joachims, 2000; Vapnik & Chapelle, 2000) without solving the quadratic programs obtained by leaving out one example at a time.

2.3 Case 3.C Is Fixed andσ2 _{→ 0. Let us define δ}

ij = 1 for i = j and δij = 0 if i = j. Since e− xi−xj 2_/(2σ2₎ → δij asσ2 → 0, we consider the following problem: min α 1 2α T_{α − e}T_α subject to 0≤ αi≤ C, i = 1, . . . , l, (2.5) yTα = 0.

(7)

Using lemma 2 (the proof is in section A.1), asσ2 _{→ 0, the solution of} equation 1.2 converges to that of equation 2.5. Since l1 > l2, the solution of equation 1.2 has 0 < αi < C for at least one i.3 Thus, b is uniquely

determined, and asσ2→ 0, it approaches the value of b corresponding to

the primal form of equation 2.5.

Therefore, let us study the solution of equation 2.5. In section A.2, we show that its solution is given byαi= α+if yi= 1 and αi= α−if yi= −1,

where α−₌ C_lim if C≥ C_lim C if C< Clim, α +₌ 2l2/l if C≥ Clim l2C/l1 if C< Clim, (2.6)

and C_lim= 2l1/l. The threshold parameter b in the primal form correspond-ing to equation 2.5 can be determined uscorrespond-ing the fact that 0< α+< C (and hence all class 1 examples lie on the top plane defined by wTz+ b = 1):

b=

(l1− l2)/l if C ≥ Clim, 1− l2C/l1 if C< Clim.

(2.7)

Consider the classifier function f(x) = wT_z_{+ b corresponding to}

equa-tion 2.5. In secequa-tion A.2, we also show the following:

1. If C≥ Clim/2, f classifies all training examples correctly and classifies the rest of the space as class 1. Thus, it overfits the training data. 2. If C < Clim/2, then f classifies the entire space as class 1, and so it

underfits the training data. 3. The number of LOO errors is l2.

Consider the SVM classifier corresponding to the gaussian kernel for

small values ofσ2_{. Even though the number of LOO errors tends to l}

2for all C, it is important to note that the SVM classifier is qualitatively very different for large C and small C. For large C, there are small regions around each example of class 2 that are classified as class 2 (overfitting), while for small C, there are no such regions (underfitting).

It is interesting to note that ifσ2_{is small and C is greater than a threshold} that is around C_lim, from equation 2.6, the SVM classifier does not depend on C. Thus, contour lines of constant generalization error are parallel to the C axis in the region whereσ2is small and C is large.

3_{As we show below (see equation 2.6), the solution of equation 2.5 is well in the}

interior of(0, C) for at least one i. Since for small values of σ2_{, the solution of equation 1.2}

approaches that of equation 2.5, it follows that the solution of equation 1.2 also has 0<

(8)

2.4 Case 4.C Is Fixed andσ2_{→ ∞. When σ}2_{→ ∞, we can write} K(˜x, ¯x) = exp(− ˜x − ¯x 2_/2σ2₎ = 1 − ˜x − ¯x 2 2σ2 + o( ˜x − ¯x 2_/σ2₎ = 1 − ˜x 2 2σ2 − ¯x 2 2σ2 + ˜xT_¯x σ2 + o( ˜x − ¯x 2_/σ2_). _(2.8)

Now consider equation 1.2. Using the simplification given above, we can write the first (quadratic) term of the objective function in equation 1.2 as

i j αiαjyiyjK(xi, xj) = T1+ T2+ T3+ T4 2σ2 + 1 2 i j αiαjyiyj ij σ2, where T1= i j αiαjyiyj, T2= − i j αiαjyiyj xi 2, T3= − i j αiαjyiyj xj 2, T4= 2 i j αiαjyiyjxTixj, and lim σ2_→∞ ij= 0. (2.9)

By the equality constraint of equation 1.2, T1= ( iαiyi)2= 0. We can also rewrite T2as T2 = ( iαiyi xi 2)( jαjyj) = 0. In a similar way, T3 = 0. By defining ˜αi=_σαi₂ ∀ i, (2.10)

equation 1.2 can be written as4 min ˜α F σ2 = 1 2 i j ˜αi˜αjyiyj˜Kij− i ˜αi subject to 0≤ ˜αi≤ ˜C, i = 1, . . . , l, (2.11) yT˜α = 0, where ˜Kij= xTixj+ ijand ˜C = C σ2. (2.12)

(9)

Remark 3. Note that ˜Kijmay not correspond to a valid kernel satisfying

the Mercer’s condition. But that is immaterial since we always operate with the constraint yT˜α = 0. In the presence of this constraint, equations 1.2 and 2.11 are equivalent.

Remark 4. If C is fixed at some value andσ2 _{is made large, ˜}_{C of} equa-tion 2.11 goes to zero, and so the situaequa-tion is similar to case 1, discussed at the beginning of this section. By equation 2.9, ˜Kijis a bounded function for large

σ2 _{(or, equivalently, for small ˜}_{C). By the last sentence of remark 1, results} of case 1 can be applied here. Thus, for C fixed andσ2→ ∞, equation 2.11 corresponds to a severely underfitting classifier. Since equations 2.11 and 1.2 correspond to the same problem in different forms, they have the same primal decision function (for full details, see equation A.8). Therefore, in this situation, we get a severely underfitting classifier.

For a given ˜C, asσ2→ ∞ and C varies with σ2as given by equation 2.12, we can see that equation 2.11 is close to the following linear SVM problem:

min ˜α 1 2 i j ˜αi˜αjyiyjxTixj− i ˜αi subject to 0≤ ˜αi≤ ˜C, i = 1, . . . , l, (2.13) yT˜α = 0.

We are interested in their corresponding decision functions, which can lead us to analyze the performance of equation 1.1. Now the primal form of equation 2.13 is min ˜w,˜b,˜ξ 1 2˜w T_{˜w + ˜C}l i=1 ˜ξi subject to yi( ˜wTxi+ ˜b) ≥ 1 − ˜ξi, (2.14) ˜ξi≥ 0, i = 1, . . . , l.

Let(w(σ2_{), b(σ}2_{)), and ( ˜w, ˜b) denote primal optimal solutions of equations 1.1} and 2.14, respectively. We then have the following theorem:

Theorem 1. For any x, lim_σ2_→∞w(σ2)Tz = ˜wTx. If the optimal ˜b of equa-tion 2.14 is unique, then lim_σ2_→∞b(σ2) = ˜b and hence the following also hold:

1. For any x, lim

σ2_→∞w(σ

(10)

2. If ˜wT_x_{+ ˜b = 0, then for σ}2_{sufficiently large,} sgn(w(σ2₎T_z_{+ b(σ}2_{)) = sgn( ˜w}T_x_{+ ˜b).}

The proof is in section A.3. Thus, for a given ˜C, the limiting SVM gaussian kernel classifier asσ2_{→ ∞ is the same as the SVM linear kernel classifier for} ˜C. Hereafter, we will simply refer to the SVM linear kernel classifier as linear

SVM. The above analysis can also be extended to show that asσ2→ ∞, the

LOO error corresponding to equations 1.1 and 2.13 is the same.

The above results also show that in the part of the hyperparameter space whereσ2_{is large, if}_(C

1, σ₁2) and (C2, σ₂2) are related by C1/σ₁2= C2/σ₂2= ˜C, the classifiers corresponding to the two combinations are nearly the same. Hence, both will give nearly the same value for generalization error (or an estimate of it, such as k-fold cross-validation error or LOO error). Thus, in this part of the hyperparameter space, contour lines of such functions will be straight lines with slope 1: logσ2 _{= log C − log ˜C. Then all classifiers} defined by points on that straight line for largeσ2_{are nearly the same as} the linear SVM classifier corresponding to ˜C.

Given that for any x, lim_σ2_→∞w(σ2)Tz= ˜wTx holds without any assump-tion, the assumption on the uniqueness of ˜b in theorem 1 should be viewed as only a minor technical irritant.5_{For normal situations, the uniqueness} as-sumption is a reasonable one to make. Unless ˜C is very small, typically there will be at least one ˜αistrictly in between 0 and ˜C; when such an ˜αiexists,

lemma 3 in section A.1 (as applied to section 2.14) implies the uniqueness of ˜b. The case of a very small ˜C corresponds to the upper left part of the

plane in which log C and logσ2 are the horizontal and vertical axes. We

can easily see this by considering C fixed and increasingσ2_{to large values}

(the upper part) or consideringσ2_{fixed and decreasing C to small values}

(the left part). As remark 4 and case 1 of this section show, each of these asymptotic behaviors corresponds to a severely underfitting SVM decision function.

Finally, theorem 1 also indicates that if complete model selection on (C, σ2_{) using the gaussian kernel has been conducted, there is no need to} consider linear SVM. This helps in the selection of kernels.

3 A Method of Model Selection

It is usual to take log C and logσ2 _{as the parameters of the}

hyperparam-eter space. Putting together the results derived in the previous section, it is easy to see that in the asymptotic (outer) regions of the(log C, log σ2)

5_{The assumption is needed to state results cleanly. If ˜b is nonunique, SVM classifiers}

also become nonunique, and then it becomes clumsy to talk about convergence of SVM decision functions.

(11)

$ & )

Figure 2: A rough boundary curve separating the underfitting/overfitting re-gion from the “good” rere-gion. For each fixed ˜C, the equation logσ2_{= log C−log ˜C}

defines a straight line of unit slope. Asσ2_{→ ∞ along this line, the SVM}

classi-fier converges to the linear SVM classiclassi-fier with penalty parameter ˜C. The dotted line corresponds to the choice of ˜C that gives the optimal generalization error for the linear SVM.

space, there exists a contour of generalization error (or an estimate such as LOO error or k-fold cross validation error) that looks like that shown in Figure 2 and helps separate the hyperparameter space into two regions: an overfitting/underfitting region and a good region (which most likely has the hyperparameter set with the best generalization error). (For LOO, recall that in the underfitting/overfitting region, the number of LOO errors is l2.) The straight line with unit slope in the largeσ2_{region (log}_σ2_{= log C − log ˜C)}

corresponds to the choice of ˜C, which is small enough to make the linear

SVM an underfitting one. The presence of a separating contour as outlined in Figure 2 has been observed on a number of real-world data sets (Lee, 2001).

When searching for a good set of values for log C and logσ2_{, it is usual}

to form a two-dimensional uniform grid (say r× r) of points in this space

and find a combination that gives the least value for some estimate of gen-eralization error. This is expensive since it requires trying r2(C, σ2) pairs. The earlier discussion relating to Figure 2 suggests a simple and efficient heuristic method for finding a hyperparameter set with small generalization error: form a line of unit slope that cuts through the middle part of the good region (see the dashed line in Figure 2) and search on it for a good set of hyperparameters. The ˜C that defines this line can be set to the optimal value

(12)

of penalty parameter for the linear SVM. Thus, we propose the following procedure:

1. Search for the best C of linear SVM and call it ˜C.

2. Fix ˜C from step 1, and search for the best(C, σ2_{) satisfying log σ}2 ₌ log C− log ˜C using the gaussian kernel.

The idea is that asσ2→ ∞, SVM with gaussian kernel behaves like linear

SVM, and so the best ˜C should happen in the upper part of the “good”

re-gion in Figure 2. Then a search on the line defined by logσ2_{= log C − log ˜C} gives an even better point in the “good” region. In many practical pattern recognition problems, a linear classifier already gives a reasonably good performance, and some added nonlinearities help obtain finer improve-ments in accuracy. Step 2 of our procedure can be thought of as a simple way of injecting the required nonlinearities via the gaussian kernel. Since the procedure involves only two one-dimensional searches, it requires only 2r pairs of(C, σ2_{) to be tried.}

To test the goodness of the proposed method, we compare it with the usual method of using a two-dimensional grid search. For both, fivefold cross-validation was used to obtain estimates of generalization error. For the usual method, we uniformly discretize the [−10, 10] × [−10, 10] region to 212_{= 441 points. At each point, a fivefold cross-validation is conducted.} The point with the best CV accuracy is chosen and used to predict the test data.

For the proposed method, we search for ˜C by fivefold cross-validation

on linear SVM using uniformly spaced log C values in [−8, 2]. Then we discretize [−8, 8] as values of log σ2_{and check all points satisfying log}_σ2₌ log C− log ˜C. Because now fewer points have to be tried, we use the smaller grid spacing of 0.5 for both discretizations. The total number of points tried is 54.

To evaluate empirically the usefulness of the proposed method, we con-sider several binary problems from R¨atsch (1999). For each problem, R¨atsch (1999) gives 100 realizations of the given data set into (training set, test set) partitions. We consider only the first of those realizations. In addition, the problemadult, from the UCI “adult” data set (Blake & Merz, 1998), and the problemweb, both as compiled by Platt (1998), are also included. For each of these two data sets, there are several realizations. For our study here, we consider only the realization with the smallest training set; the full data set with training data (including duplicated ones) removed is taken as the test set. For all data sets used, Table 1 gives the number of input variables, the number of training examples, and the number of test examples. All data sets are directly used as given in the mentioned references, without any further normalization or scaling.

The SVM softwareLIBSVM(Chang & Lin, 2001a), which implements a de-composition method, is employed for solving equation 1.2. Table 1 presents

(13)

Table 1: Comparison of the Model Selection Methods.

Problem Number of Number of Number of Test Set Error of Test Set Error of Inputs Training Test Usual Grid Proposed Examples Examples Method Method banana 2 400 4900 0.1235 (6,−0) 0.1178 (−2, −2) diabetes 8 468 300 0.2433 (4,7) 0.2433 (4,6) image 18 1300 1010 0.02475 (9,4) 0.02475 (1,0.5) splice 60 1000 2175 0.09701 (1,4) 0.1011 (0,4) ringnorm 20 400 7000 0.01429(−2, 2) 0.018 (−3, 2) twonorm 20 400 7000 0.031 (1,3) 0.02914 (1,4) waveform 21 400 4600 0.1078 (0,3) 0.1078 (0,3) tree 18 700 11,692 0.1132 (8,4) 0.1246 (2,2) adult 123 1605 29,589 0.1614 (5,6) 0.1614 (5,6) web 300 2477 38,994 0.02223 (5,5) 0.02223 (5,5) Note: For each approach, apart from the test error, the optimal(log C, log σ2_{) pair is also}

given.

the test error of the two methods, as well as the corresponding chosen val-ues of log C and logσ2_{. It can be clearly seen that the new method is very} competitive with the usual method in terms of test set accuracy. For large data sets, the proposed method has the great advantage that it checks many fewer points on the(log C, log σ2) plane, and so the savings in computing time can be large.

Note that in the chosen problems, the following quantities have a rea-sonably wide range: test error (1.5% to 25%), the number of input variables (2 to 300), and the number of training examples (400 to 2477), and so the em-pirical evaluation demonstrates the applicability of the proposed approach to different types of data sets.

A remaining issue is how to decide the range of log C for determining ˜C in step 1. From Table 1, we can see that log ˜C = log C − log σ2_{is usually} not a large number. Furthermore, we observe that for all problems, after C is greater than a certain threshold, the cross-validation accuracy of the linear SVM is about the same. Therefore, if we start searching from small C values and go on to large C values, the search can be stopped after the CV accuracy stops varying much. An example of the variation of the fivefold CV accuracy of linear SVM is given in Figure 3.

For linear SVMs, we can formally establish that there exists a finite lim-iting value C∗such that for C≥ C∗, the solution of the linear SVM remains unchanged. If{xi : yi = 1} and {xi : yi = −1} are linearly separable, then

the above result is easy to appreciate; the same ideas used in case 2 can be applied to show this. However, if{xi : yi = 1} and {xi : yi = −1} are not

(14)

70 72 74 76 78 80 82 84 -8 -6 -4 -2 0 2

CV rate

log(C)

Figure 3: Variation of CV accuracy of linear SVM with C for the image problem.

establish. Here, we prove the following theorem:

Theorem 2. There exists a finite value C∗ and (w∗, b∗) such that (w, b) = (w∗_{, b}∗_{) solves equation 1.2, ∀C ≥ C}∗_{. If this decision function is used, the LOO} error is same for all C≥ C∗. Moreover, this w∗is unique.

Details of the proof are in section A.4. Appendix

A.1 Two Useful Lemmas.

Lemma 2. Consider an optimization problem with the form 1.2 and Q is a func-tion ofσ2 (denoted as Q(σ2)). Let α(σ2) be its solutions. For a given number a, if

Q∗≡ lim

σ2_→aQ(σ

2₎

exists, then there exists a convergent sequence{α(σ2

k)} with σk2→ a, and the limit

of any such sequence is an optimal solution of equation 1.2 with the Hessian matrix Q∗. Moreover, if Q∗is positive definite, lim_σ2_→aα(σ2) exists.

(15)

Proof.The feasible region of equation 1.2 is independent ofσ2_{so is compact.} Then there exists a convergent sequence{α(σ_k2)} with limk→∞σk2 = a. For

any one such sequence, we have 1 2α(σ 2 k)TQ(σk2)α(σk2) − eTα(σk2) ≤ 1 2(α ∗₎T_Q(σ2 k)α∗− eTα∗, and (A.1) 1 2(α ∗₎T_Q∗_α∗_{− e}T_α∗_≤ 1 2α(σ 2 k)TQ∗(σk2)α(σk2) − eTα(σk2),

whereα∗is any optimal solution of equation 1.2 with the Hessian matrix

Q∗. Ifα(σ_k2) goes to ¯α, taking the limit of equation A.1, 1

2(α

∗₎T_Q∗_α∗_{− e}T_α∗₌ 1

2¯α

T_Q∗_{¯α − e}T_¯α.

Thus,¯α is an optimal solution too.

If Q∗is positive definite, equation 1.2 is a strictly convex problem with a unique optimal solution. This implies that lim_σ2_→aα(σ2) exists.

Lemma 3. If equation 1.2 has an optimal solution with at least one free variable (i.e., 0< αi< C for at least one i), then the optimal b of equation 1.1 is unique. Proof.The Karush-Kuhn-Tucker (KKT) condition (i.e. the optimality condi-tion) of equation 1.2 is that ifα is an optimal solution, there are a number b

and two nonnegative vectorsλ and µ such that

∇F(α) + by = λ − µ,

λiαi= 0, µi(C − α)i= 0, λi≥ 0, µi≥ 0, i = 1, . . . , l,

where∇F(α) = Qα − e is the gradient of F(α) = 1/2αTQα − eTα. This can

be rewritten as ∇F(α)i+ by_i≥ 0 ifαi= 0, ∇F(α)i+ byi≤ 0 ifαi= C, (A.2) ∇F(α)i+ byi= 0 if 0< αi< C. Note that ∇F(α)i= yiwTzi− 1

is independent of different optimal solutionsα as the primal optimal solu-tion w is unique.

Let(w, b, ξ) denote a primal solution. As already said, w is unique. By convexity of the solution set, the set of all possible b solutions, B is an interval.

(16)

Once b is chosen,ξ is uniquely defined. By assumption, there exists b ∈ B and a corresponding Lagrange multiplier vectorα(b) with a free alpha, say, 0< α(b)k< C. Thus, α(b) is an optimal solution of equation 1.2 and so, by

equation A.2,∇F(α)k+ byk= 0. Denote

A0 = {i | ∇F(α)i+ by_i> 0}, AC= {i | ∇F(α)i+ by_i< 0}, and

AF = {i | i /∈ A0∪ AC}.

Let us define 2e and 2 f to be the minimum and maximum of the following set:

{bnew_{| ∇F(α)}

i+ bnewyi> 0 if i ∈ A0;

∇F(α)i+ bnewyi< 0 if i ∈ AC}. (A.3)

Clearly e< b < f . Suppose B is not a singleton. Now choose bnew_{∈ B ∩ [e, f ]}

such that bnew _{= b. Let α(b}new_{) be any Lagrange multiplier corresponding}

to bnew_{. Thus,}_α(bnew_{) and b}new_{satisfy equation A.2. Suppose b}new _{> b and}

yk= 1. Then

∇F(α)k+ bnewyk> 0 so α(bnew)k= 0 < α(b)k. (A.4)

If we use equation A.2 as applied to(b, α(b)) and (bnew, α(bnew)), equation A.3 implies the following:α(b)i = α(bnew)i ∀ i ∈ A0∪ AC with yi = yk; also,

α(b)i ≥ α(bnew)i ∀ i ∈ AF with yi = yk. Note that k is an element of this

second group. Thus, with equation A.4, i: yi=yk α(b)i> i: yi=yk α(bnew₎ i. (A.5)

This is a violation of the fact that both α(b) and α(bnew) are solutions of equation 1.2 since, for a given dual solutionα, the dual cost is ( w 2/2) − 2_{i: y}_i₌₁αi and the first term is the same for α(b) as well as α(bnew). If

yk = −1, the proof is the same, but equation A.5 becomes

i: yi=ykα(b)i<

i: yi=ykα(b

new₎

i. A similar contradiction can be reached if bnew< b. Thus, B

is a singleton and b is unique.

A.2 Optimal Solution of Equation 2.5. KKT conditions applied to equa-tion 2.5 correspond to the existence of a scalar b and two nonnegative vectors λ and µ such that

αi− 1 + byi= λi− µi,

(A.6) αiλi= 0, (C − αi)µi= 0, i = 1, . . . , l.

(17)

To show that the solution is given by equations 2.6 and 2.7, all that we need to do is to show the existence ofλ and µ so that equation A.6 holds. For the solution 2.6, when C≥ Clim, using b defined in equation 2.7,

αi− 1 + byi= 0 ∀ i,

so we can simply chooseλ = µ = 0 so that equation A.6 is satisfied. If C< Clim, αi− 1 + byi= 0 if yi= 1, Cl l1 − 2 ≤ 0 if yi= −1,

so equation A.6 also holds. Therefore, equation 2.6 gives an optimal solution for equation 2.5.

Let us now analyze properties of the classifier function f associated with equation 2.5. Note using equation 2.7 that b > 0. For x = xi, exp(− x −

xi 2/σ2) → 0 as σ2 → 0. Therefore, for such x, the classifier function

cor-responding to equation 2.5 is given by f(x) = b. Since b > 0, all points x not in the training set are classified as class 1 regardless of the value of C. This, together with item 2 of assumption 2, implies that the number of LOO errors is equal to l2.

For a training point xi, we have f(xi) → yiαi+ b as σ2 → 0. Thus, after

σ2_{is sufficiently small, all class 1 training points are classified correctly by} f . For training points xiin class 2, we can use equations 2.6 and 2.7 to show

that (i) for C> Clim/2, all of those points are classified correctly by f , and (ii) for C≤ Clim/2, all of those points are classified incorrectly by f .

A.3 Proof of Theorem 1. To prove theorem 1, first we write down the primal form of equation 2.11:

min ˜w,˜b,˜ξ 1 2˜w T_{˜w + ˜C}l i=1 ˜ξi subject to yi( ˜wT˜φ(xi) + ˜b) ≥ 1 − ˜ξi, (A.7) ˜ξi≥ 0, i = 1, . . . , l,

where ˜φ(x) ≡ σφ(x).6By defining w≡ σ ˜w, multiplying the objective func-tion of equafunc-tion A.7 byσ2and using equation 2.12, equation A.7 has exactly the same form as equation 1.1, so we can say,

b= ˜b, ξ = ˜ξ, and ˜wT˜φ(x) + ˜b = wTφ(x) + b. (A.8)

6_{It should be pointed out that equation 2.11 is not directly the dual of equation A.7.}

(18)

A difficulty in proving this theorem is that the solution of equation A.7 is an element of a vector space that is different from that of a solution of

equation 2.14. Hence, to build the relation asσ2 → ∞, we will consider

their duals using lemma 2.

Assume ˜α(σ2_{) is the solution of equation 2.11 under a given ˜C. It is in a} bounded region for allσ2_{> 0 so there is a convergent sequence ˜α(σ}2

k) → ˜α

asσ2

k → ∞. We can apply lemma 2, as now the ij component of the Hessian

of equation 2.11 is a function ofσ2_: yiyj˜Kij= σ2yiyj e− xi−xj /(2σ2)− 1 + xi 2 2σ2 + xj 2 2σ2 ,

with the limit yiyjxTixjasσ2→ ∞. Therefore, ˜α(σ_k2) converges to an optimal

solution ˜α of equation 2.13.

We denote that ˜w(σ2_{) and ˜w are unique optimal solutions of equations A.7} and 2.14, respectively. Then, for any such convergent sequence{ ˜α(σ2

k)}∞k=1, using yT˜α(σ_k2) = 0, we have that for any x,

lim σ2 k→∞ ˜w(σ2 k)T˜φ(x) = lim σ2 k→∞ l i=1 yi˜α(σk2)i˜φ(xi)T˜φ(x) = lim σ2 k→∞ l i=1 yi˜α(σk2)i(σk2− xi 2/2 + xTix− x 2/2) = lim σ2 k→∞ l i=1 yi˜α(σk2)i(− xi 2/2 + xTix) (A.9) =l i=1 yi˜αixTix+ d( ˜α) = ˜wTx+ d( ˜α), (A.10)

where equation A.9 follows from equation 2.8 and yT_˜α(σ2

k) = 0 and d( ˜α) ≡

−l

i=1yi˜α xi 2/2. By a similar way, we can prove

lim σ2 k→∞ ˜w(σ2 k)T˜w(σk2) = lim σ2 k→∞ l i=1 l j=1 ˜α(σ2 k)i˜α(σk2)jyiyj˜φ(xi)T˜φ(xj) =l i=1 l j=1 ˜α(σ2 k)i˜α(σk2)jyiyjxTixj= ˜wT˜w. (A.11)

(19)

Note that equation A.11 follows from the discussion between equations 2.8 and 2.11. The last equality is via ˜w =l_i₌₁ ˜αoxi, as ˜w is the optimal solution

of equation 2.11.

Next we consider that( ˜w, ˜b) is the unique optimal solution of

equa-tion 2.14. The constraints of equaequa-tion A.7 imply that max yi=1 {1 − ˜w(σ2₎T_˜φ(x i) − ˜ξ(σ2)i} ≤ ˜b(σ2_{) ≤ max} yi=−1 {−1 − ˜w(σ2₎T˜φ(x i) + ˜ξ(σ2)i}.

Note that the primal-dual optimality condition implies

0≤ ˜ξ(σ2)i≤ l i=1 ˜ξ(σ2₎ i≤ eT˜α(σ2) ˜C ≤ l.

With equation A.9 and the assumption l1 ≥ 1 and l2 ≥ 1, after σ2is large enough, ˜b(σ2_{) is in a bounded region. When ( ˜w(σ}2_{), ˜b(σ}2_{)) is optimal for} equation A.7, the optimal ˜ξ(σ2_{) is}

˜ξ(σ2₎

i≡ max(0, 1 − yi( ˜w(σ2)T˜φ(xi) + ˜b(σ2))).

For any convergent sequence ˜b(σ2

k) → b∗ withσk2 → ∞, we can further

have a subsequence such that{ ˜α(σ2

k)} converges. Thus, we can consider any

such sequence with both properties. Then equation A.10 implies ˜ξ(σ2

k)i→ ξi∗= max(0, 1 − yi( ˜wTxi+ d( ˜α) + b∗)). (A.12)

Hence,( ˜w, b∗+ d( ˜α), ξ∗) is feasible for equation 2.14. By defining ¯ξ(σ2

k)i≡ max(0, 1 − yi( ˜w(σk2)T˜φ(xi) − d( ˜α) + ˜b)),

( ˜w(σ2

k), ˜b − d( ˜α), ¯ξ(σk2)) is feasible for equation A.7. In addition, using

equa-tion A.10, ˜ξ ≡ lim σ2 k→∞ ¯ξ(σ2 k) = max(0, 1 − yi( ˜wTxi+ ˜b)), (A.13)

so( ˜w, ˜b, ˜ξ) is optimal for equation 2.14. Thus, 1 2˜w T_{˜w + ˜C}l i=1 ˜ξi ≤ 1 2 ˜w T_{˜w + ˜C}l i=1 ξi∗, and 1 2 ˜w(σ 2 k)T˜w(σk2) + ˜C l i=1 ˜ξ(σ2 k)i ≤ 1 2 ˜w(σ 2 k)T˜w(σk2) + ˜C l i=1 ¯ξ(σ2 k)i. (A.14)

(20)

With equations A.11, A.12, and A.13, taking the limit A.14 becomes 1 2˜w T_{˜w + ˜C}l i=1 ξ∗ i ≤ 1 2 ˜w T_{˜w + ˜C}l i=1 ˜ξi.

Therefore, we have that( ˜w, b∗+ d( ˜α), ξ∗) is optimal for equation 2.14. Since ˜b is unique by assumption,

b∗+ d( ˜α) = ˜b. (A.15)

Now we are ready to prove the main result, equation 2.15. If it is wrong, there is > 0 and a sequence { ˜w(σ_k2)} with σ_k2→ ∞ such that

| ˜w(σ2

k)T˜φ(x) + b(σk2) − ˜wTx− ˜b| ≥ , ∀k. (A.16)

Since we can find an infinite subset K such that lim_k_∈K,σ2 k→∞˜b(σ

2 k) = b∗

and equation A.10 holds, with b(σ2_{) = ˜b(σ}2_{) from equation A.8, the above} analysis (equations A.10 and A.15) shows that

lim k∈K,σ2 k→∞ w(σ2 k)Tφ(x) + b(σk2) = ˜wTx+ d( ˜α) + ˜b − d( ˜α) = ˜wT_x_{+ ˜b.}

This contradicts equation A.16 so equation 2.15 is valid. Therefore, if ˜wT_x_{+ ˜b = 0, after σ}2_{is sufficiently large,}

sgn(w(σ2₎T_{φ(x) + b(σ}2_{)) = sgn( ˜w}T_x_{+ ˜b).}

A.4 Proof of Theorem 2. Letα1be a feasible vector of equation 1.2 for

C = C1 andα2 be a feasible vector of equation 1.2 for C = C2. We say

that α1 _and _α2 _{are on the same face if the following hold: (i)} _{{i | 0 <} α1

i < C1} = {i | 0 < α2i < C2}; (ii) {i | αi1 = C1} = {i | α2i = C2} and (iii){i | α1

i = 0} = {i | αi2 = 0}. To prove theorem 2, we need the following

result:

Lemma 4. If C1 < C2and their corresponding duals have optimal solutions at the same face, then for any C1 ≤ C ≤ C2, there is at least one optimal solution at the same face. Furthermore, there are optimal solutionsα and b of equation 1.2, which form linear functions of C in [C1, C2].

Proof of Lemma 4. Ifα1 _and _α2 _{are optimal solutions at the same face} corresponding to C1and C2, then they satisfy the following KKT conditions,

(21)

respectively:

Qα1_{− e + b}1_y_{= λ}1_{− µ}1_{, λ}1

iαi1= 0, (C1− αi1)µ1i = 0,

Qα2_{− e + b}2_y_{= λ}2_{− µ}2_{, λ}2

iαi2= 0, (C2− αi2)µ2i = 0.

Since they are at the same face, λ2

iαi1= 0, λ1iαi2= 0,

(A.17) (C1_{− α}1

i)µ2i = 0, (C2− α2i)µ1i = 0.

As C1≤ C ≤ C2, we can have 0≤ τ ≤ 1 such that

C= τC1+ (1 − τ)C2. (A.18)

Let

α ≡ τα1_{+ (1 − τ)α}2_{, λ ≡ τλ}1_{+ (1 − τ)λ}2_,

(A.19) µ ≡ τµ1_{+ (1 − τ)µ}2_{, b ≡ τb}1_{+ (1 − τ)b}2_.

Thenα, λ, µ, b satisfy the KKT condition at C:

Qα − e + by = λ − µ, λiαi= 0, (C − αi)µi= 0,

0≤ αi≤ C, λi≥ 0, µi≥ 0, yTα = 0.

Using equation A.18, τ = C− C2

C1− C2.

Putting it into equation A.19,α and b are linear functions of C where C ∈

[C1, C2]. This proves the lemma. Let us now prove theorem 2.

Proof of Theorem 3. As we already mentioned, if the points of the two classes are linearly separable in x space, then the proof of the result is straightforward. So let us give a proof only for the case of linearly non-separable points. Since the number of faces is finite, by lemma 4 there exists a C∗such that for C≥ C∗, there are optimal solutions at the same face. For the rest of the proof, let us consider only optimal solutions on a single face. For any C1> C∗, lemma 4 implies that there are optimal solutionsα and b, which form linear functions of C in the interval [C∗, C1]. Since

l i=1 ξi= i:αi=C −[(Qα)i− 1 + byi], (A.20)

(22)

l

i=1ξiis a linear function of C in this interval and can be represented as l

i=1

ξi= AC + B, (A.21)

where A and B are constants. If we consider another C2> C1, _l

i=1ξiis also

a linear function of C in [C∗, C2]. For each C, the optimal 1₂wTw as well as _l

i=1ξiare unique. Thus, the two linear functions have the same values at

more than two points, so they are indeed identical. Therefore, equation A.21 holds for any C≥ C∗.

Sincel_i₌₁ξiis a decreasing function of C (e.g., using techniques similar

to Chang and Lin (2001b, lemma 4)), A≤ 0. However, A cannot be negative,

as otherwisel_i₌₁ξigoes to−∞ as C increases. Hence, A = 0, and so

_l

i=1ξi

is a constant for C≥ C∗.

If(w1_{, b}1_{, ξ}1_{) and (w}2_{, b}2_{, ξ}2_{) are optimal solutions at C = C}

1 and C2, respectively, then 1 2(w 1₎T_w1_{+ C} 1 l i=1 ξ1 i ≤ 1 2(w 2₎T_w2_{+ C} 1 l i=1 ξ2 i and 1 2(w 2₎T_w2_{+ C} 2 l i=1 ξ2 i ≤ 1 2(w 1₎T_w1_{+ C} 2 l i=1 ξ1 i

imply that(w1₎T_w1_{= (w}2₎T_w2_{. That is,}_αT_{Qα is a constant for C ≥ C}∗_.

Therefore(w2_{, b}2_{, ξ}2_{) is also feasible and optimal when C = C}

1. Since the solution of w is unique (e.g., Lin, 2001, lemma 1), w1_{= w}2_.

If F= {i | 0 < αi < C} = ∅, there is xisuch that wTxi+ b = yi. Hence

b1= b2, and so the decision functions as well as the LOO rates are the same for C≥ C∗.

On the other hand, if F = ∅ and we denote α(C) the solution of

equa-tion 1.2 at a given C, thenα(C) = (C/C∗)α(C∗) for all C ≥ C∗. As wT_w ₌

αT_{Qα becomes a constant, we have w = 0 after C ≥ C}∗_{. However, since}

F= ∅, the optimal b might not be unique under the same C. For any one of

such b,(w, b) is optimal for equation 1.2 for all C ≥ C∗.

Finally, since wT_{w becomes a constant, for C}_{≥ C}∗_{, the solution of}

equa-tion 1.2 is also a soluequa-tion of min w,b,ξ l i=1 ξi subject to yi(wTxi+ b) ≥ 1 − ξi, ξi≥ 0, i = 1, . . . , l.

(23)

Acknowledgments

C.-J. L. was partially supported by the National Science Council of Taiwan, grant NSC 90-2213-E-002-111.

References

Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases (Technical Rep.) Irvine, CA: University of California, Department of Infor-mation and Computer Science. Available on-line at: http://www.ics.uci. edu/∼mlearn/MLRepository.html.

Chang, C.-C., & Lin, C.-J. (2001a). LIBSVM: A library for support vector machines. Software Available on-line at: http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Chang, C.-C., & Lin, C.-J. (2001b). Trainingν-support vector classifiers: Theory

and algorithms. Neural Computation, 13(9), 2119–2147.

Joachims, T. (2000). Estimating the generalization performance of a SVM effi-ciently. In Proceedings of the International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.

Lee, J.-H. (2001). Model selection of the bounded SVM formulation using the RBF

kernel. Master’s thesis, National Taiwan University.

Lin, C.-J. (2001). Formulations of support vector machines: A note from an optimization point of view. Neural Computation, 13(2), 307–317.

Micchelli, C. A. (1986). Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constructive Approximation, 2, 11–22. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch ¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.),

Advances in kernel methods—Support vector learning. Cambridge, MA: MIT

Press.

R¨atsch, G. (1999). Benchmark data sets. Available on-line at: http://ida.first.gmd. de/∼raetsch/data/benchmarks.htm.

Vapnik, V. (1998). Statistical learning theory. New York: Wiley.

Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for support vector machines. Neural Computation, 12(9), 2013–2036.