Training v-Support Vector Classifiers: Theory and Algorithms

(1)

Training

ν-Support Vector Classifiers: Theory and Algorithms

Chih-Chung Chang

Chih-Jen Lin

Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan

The ν-support vector machine (ν-SVM) for classification proposed by

Sch ¨olkopf, Smola, Williamson, and Bartlett (2000) has the advantage of using a parameterν on controlling the number of support vectors. In this article, we investigate the relation betweenν-SVM and C-SVM in detail. We show that in general they are two different problems with the same optimal solution set. Hence, we may expect that many numerical aspects of solving them are similar. However, compared to regularC-SVM, the

formulation ofν-SVM is more complicated, so up to now there have been no effective methods for solving large-scaleν-SVM. We propose a decom-position method forν-SVM that is competitive with existing methods for

C-SVM. We also discuss the behavior ofν-SVM by some numerical

ex-periments. 1 Introduction

Theν-support vector classification (Sch¨olkopf, Smola, & Williamson, 1999;

Sch ¨olkopf, Smola, Williamson, & Bartlett, 2000) is a new class of support vector machines (SVM). Given training vectors xi∈Rn, i = 1, . . . , l in two classes and a vector y∈Rl_{such that y}

i∈ {1, −1}, they consider the following primal problem: (Pν) min 1 2w T_w_{− νρ +} 1 l l i=1 ξi yi(wTφ(xi) + b) ≥ ρ − ξi, ξi≥ 0, i = 1, . . . , l, ρ ≥ 0. (1.1) Here 0≤ ν ≤ 1 and training vectors xiare mapped into a higher- (maybe infinite) dimensional space by the functionφ. This formulation is different from the original C-SVM (Vapnik, 1998):

(PC) min 1 2w T_w_{+ C}l i=1 ξi yi(wTφ(xi) + b) ≥ 1 − ξi, ξi≥ 0, i = 1, . . . , l. (1.2)

(2)

In equation 1.2, a parameter C is used to penalize variablesξi. As it is dif-ficult to select an appropriate C, in Pν, (Sch ¨olkopf et al. (2000) introduce a new parameterν, which lets one control the number of support vectors and errors. To be more precise, they proved thatν is an upper bound on the frac-tion of margin errors and a lower bound of the fracfrac-tion of support vectors. In addition, with probability 1, asymptotically,ν equals both fractions.

Although P_ν has such an advantage, its dual is more complicated than the dual of PC: (Dν) min 1 2α T_Qα yTα= 0, eTα≥ ν, 0≤ αi≤ 1/l, i= 1, . . . , l, (1.3) where e is the vector of all ones, Q is a positive semidefinite matrix, Qij≡

yiyjK(xi, xj), and K(xi, xj) ≡ φ(xi)Tφ(xj) is the kernel.

Remember that the dual of PCis as follows:

(DC) min

1 2α

T_Qα_{− e}T_α

yTα= 0, 0 ≤ α_i≤ C, i = 1, . . . , l.

Therefore, it can be clearly seen that Dνhas one more inequality constraint. We are interested in the relation between Dνand DC. Though in Sch ¨olkopf et al. (2000, Proposition 13), this issue has been studied, we investigate this relation in more detail in section 2. The main result, theorem 5, shows that solving them is like solving two different problems with the same optimal solution set. In addition, the increase of C in C-SVM is like the decrease

ofν in ν-SVM. Based on the work in section 2, in section 3 we derive the

formulation ofν as a decreasing function of C.

Due to the density of Q, traditional optimization algorithms such as Newton and quasi-Newton cannot be directly applied to solve DCor Dν. Currently major methods of solving large DC(for example, decomposition methods (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998; Saun-ders et al., 1998) and the method of nearest points (Keerthi, Shevade, & Murthy, 2000)) use the simple structure of constraints. Because of the addi-tional inequality, these methods cannot be directly used for solving D_ν. Up to now, there have been no implementation methods for large-scaleν-SVM. In section 4, we propose a decomposition method similar to the software SVMlight(Joachims, 1998) for C-SVM.

Section 5 presents numerical results. Experiments indicate that several numerical properties on solving DCand Dνare similar. A timing comparison shows that the proposed method forν-SVM is competitive with existing methods for C-SVM. Finally, section 6 gives a discussion and conclusion.

(3)

2 The Relation Betweenν-SVM and C-SVM

In this section we construct a relationship between Dν and DC; the main result is in theorem 5. The relation between DCand Dνhas been discussed by Sch ¨olkopf et al. (2000, Proposition 13), who show that if P_ν leads to

ρ > 0, then PCwith C= 1/(ρl) leads to the same decision function. Here

we provide a more complete investigation.

In this section we first try to simplify Dνby showing that the inequality

eTα≥ ν can be treated as an equality:

Theorem 1. Let 0 ≤ ν ≤ 1. If (D_ν) is feasible, there is at least one optimal

solution of D_ν that satisfies eT_α _{= ν. In addition, if the objective value of D}

νis

not zero, all optimal solutions of D_νsatisfy eT_α_{= ν.}

Proof. Since the feasible region of D_νis bounded, if it is feasible, D_νhas at least one optimal solution. Assume D_νhas an optimal solution α such that

eTα> ν. Since eTα> ν ≥ 0, by defining

¯α ≡ ν

eT_αα,

¯α is feasible to Dνand eT¯α = ν. Since α is an optimal solution of Dν, with

eT_α_{> ν,}

αTQα≤ ¯αTQ¯α = ν

eT_α 2

αTQα≤ αTQα. (2.1)

Thus ¯α is an optimal solution of D_ν, and αT_Qα_{= 0. This also implies that} if the objective value of Dν is not zero, all optimal solutions of Dν satisfy

eTα= ν.

Therefore, in general eTα≥ ν in D_νcan be written as eTα= ν. Sch¨olkopf

et al. (2000), noted that practically one can alternatively work with eTα≥ ν

as an equality constraint. From the primal side, it was first shown by Crisp and Burges (1999) thatρ ≥ 0 in P_νis redundant. Withoutρ ≥ 0, the dual becomes: min 1 2α T_Qα yTα= 0, eTα= ν, (2.2) 0≤ αi≤ 1/l, i= 1, . . . , l.

Therefore, the equality is naturally obtained. Note that this is an example that two problems have the same optimal solution set but are associated with two duals that have different optimal solution sets. Here the primal problem,

(4)

which has more restrictions, is related to a dual with a larger feasible region. For our later analysis, we keep on using Dνbut not equation 2.2. Interestingly we will see that the exceptional situation where Dν has optimal solutions such that eTα> ν happens only for those ν that we are not interested in.

Due to the additional inequality, the feasibility of D_νand DCis different. For DC, 0 is a trivial feasible point, but Dνmay be infeasible. An example where P_νis unbounded below and D_νis infeasible is as follows: Given three training data with y1= y2= 1 and y3= −1, if ν = 0.9, there is no α in Dνthat

satisfies 0≤ αi≤ 1/3, [1, 1, −1]α = 0 and eTα≥ 0.9. Hence Dνis infeasible. When this happens, we can choose w= 0, ξ1= ξ2= 0, b = ρ, ξ3= 2ρ as a

feasible solution of P_ν. Then the objective value is−0.9ρ +2ρ/3, which goes to−∞ as ρ → ∞. Therefore, P_νis unbounded.

We then describe a lemma that was first proved in Crisp and Burges (1999).

Lemma 1. Dνis feasible if and only ifν ≤ νmax, where

νmax≡ 2 min(#yi= 1, #yi= −1)

l ,

and (#yi= 1) and (#yi= −1) denote the number of elements in the first and second

classes, respectively.

Proof. Since 0≤ αi ≤ 1/l, i = 1, . . . , l, with yTα = 0, for any α feasible to Dν, we have eTα ≤ ν_max. Therefore, if Dνis feasible,ν ≤ νmax. On the

other hand, if 0< ν ≤ νmax, min(#yi= 1, #yi= −1) > 0 so we can define a feasible solution of D_ν: αj=      ν 2(#yi= 1) if yj= 1, ν 2(#yi= −1) if yj= −1.

This α satisfies 0 ≤ αi ≤ 1/l, i = 1, . . . , l and yTα = 0. If ν = 0, clearly

α= 0 is a feasible solution of D_ν.

Note that the size ofνmaxdepends on how balanced the training set is. If

the numbers of positive and negative examples match, thenνmax= 1.

We then note that if C> 0, by dividing each variable by Cl, DCis equiv-alent to the following problem:

(D C) min 1 2α T_Qα₋eTα Cl yTα= 0, 0 ≤ αi≤ 1/l, i = 1, . . . , l.

(5)

It can be clearly seen that D_C and D_ν are very similar. We prove the following lemma about D_C:

Lemma 2. If D_Chas different optimal solutions α1and α2, then eTα1= eTα2

and αT

1Qα1= αT2Qα2. Therefore, we can define two functions eTαCand αT_CQαC

on C, where αCis any optimal solution of D_C.

Proof. Since D_Cis a convex problem, if α1= α2are both optimal solutions,

for all 0≤ λ ≤ 1, 1 2(λα1+ (1 − λ)α2) T_Q_(λα 1+ (1 − λ)α2) − eT(λα1+ (1 − λ)α2)/(Cl) = λ 1 2α T 1Qα1− eTα1/(Cl) + (1 − λ) 1 2α T 2Qα2− eTα2/(Cl) . This implies αT₁Qα2= 1 2α T 1Qα1+ 1 2α T 2Qα2. (2.3)

Since Q is positive semidefinite, Q= LT_{L so equation 2.3 implies}_Lα

1−

Lα2 = 0. Thus, αT₂Qα2= αT₁Qα1. Therefore, eTα1= eTα2, and the proof

is complete.

Next we prove a theorem on optimal solutions of D_Cand Dν:

Theorem 2. If D_Cand D_νshare one optimal solution α∗with eT_α∗_{= ν, their}

optimal solution sets are the same.

Proof. From lemma 2, any other optimal solution α of D_C also satisfies

eTα= ν so α is feasible to D_ν. Since αT_Qα_{= (α}∗₎T_Qα∗_{from lemma 2, all}

D_C’s optimal solutions are also optimal solutions of Dν. On the other hand, if α is any optimal solution of Dν, it is feasible for D_C. With the constraint

eTα≥ ν = eTα∗and αTQα= (α∗)TQα∗, 1 2α T_Qα_{− e}T_α_{/(Cl) ≤} 1 2(α ∗₎T_Q_(α∗_{) − e}T_α∗_/(Cl).

Therefore, all optimal solutions of Dνare also optimal for D_C. Hence their optimal solution sets are the same.

If α is an optimal solution of D_C, it satisfies the following Karush-Kuhn-Tucker (KKT) condition:

Qα− e

(6)

λTα= 0, ξTe

l − α

= 0, yT_α_{= 0,}

λi≥ 0, ξi≥ 0, 0 ≤ αi≤ 1/l, i = 1, . . . , l. (2.4) By settingρ ≡ 1/(Cl) and ν ≡ eTα, α also satisfies the KKT condition of

D_ν: Qα− ρe + by = λ − ξ, λT_α_{= 0, ξ}Te l − α = 0, yTα= 0, eTα≥ ν, ρ(eTα− ν) = 0, λi≥ 0, ξi≥ 0, ρ ≥ 0, 0 ≤ αi≤ 1/l, i = 1, . . . , l. (2.5) From theorem 2, this implies that for each D_C, its optimal solution set is the same as that of D_ν, whereν = eT_α_{. For each D}

C, such a Dν is unique as from theorem 1, ifν1= ν2, Dν1and Dν2have different optimal solution sets.

Therefore, we have the following theorem:

Theorem 3. For each D_C, C > 0, its optimal solution set is the same as that of

one (and only one) Dν, whereν = eTαand α is any optimal solution of D_C.

Similarly, we have:

Theorem 4. If D_ν, ν > 0, has a nonempty feasible set and its objective value is

not zero, D_ν’s optimal solution set is the same as that of at least one D_C.

Proof. If the objective value of Dνis not zero, from the KKT condition 2.5,

αTQα− ρeTα= −

l

i=1

ξi/l.

Then αTQα> 0 and equation 2.5 imply

ρeT_α_{= α}T_Qα₊l

i=1

ξi/l > 0, ρ > 0, and eTα= ν.

By choosing a C> 0 such that ρ = 1/(Cl), α is a KKT point of D_C. Hence from theorem 2, the optimal solution set of this D_Cis the same as that of D_ν.

Next we prove two useful lemmas. The first one deals with the special situation when the objective value of D_νis zero.

(7)

Lemma 3. If the objective value of D_ν,ν ≥ 0, is zero and there is a D_C, C > 0

such that any its optimal solution αCsatisfies eTαC = ν, then ν = νmaxand all

D_C, C> 0, have the same optimal solution set as that of D_ν.

Proof. For this D_ν, we can setρ = 1/(Cl), so αC is a KKT point of Dν. Therefore, since the objective value of D_νis zero, αT

CQαC= 0. Furthermore, we have QαC= 0. In this case, equation 2.4 of DC’s KKT condition becomes

− e Cl+ beI −beJ = λ − ξ, (2.6)

whereλi, ξi≥ 0, and I and J are indices of two different classes. If beI ≥ 0, there are three situations of equation 2.6:

> 0 < 0 , < 0 < 0 , = 0 < 0 .

The first case implies(αC)I = 0 and (αC)J= (eJ)/l. Hence if J is nonempty,

yTα_C= 0 causes contradiction. Hence all data are in the same class.

There-fore, D_νand all D_C, C> 0, have the unique optimal solution zero due to the constraints yTα= 0 and α ≥ 0. Furthermore, eTα= ν = ν_max= 0.

The second case happens only when αC= e/l. Then yTα= 0 and yi= 1 or −1 imply that (#yi = 1) = (#yi = −1) and eTαC = 1 = ν = νmax.

We then show that e/l is also an optimal solution of any other D

C. Since 0≤ αi ≤ 1/l, i = 1, . . . , l, for any feasible α of ¯D_C, the objective function satisfies 1 2α T_Qα₋eTα Cl ≥ − eTα Cl ≥ − 1 Cl. (2.7)

Now(#yi=1) = (#yi=−1) so e/l is feasible. When α = e/l, the inequality of equation 2.7 becomes an equality. Thus e/l is actually an optimal solution of all D_C, C > 0. Therefore, D_ν and all DC, C > 0 have the same unique optimal solution e/l.

For the third case, b= 1/(Cl), (αC)J= eJ/l, ν = eTαC= 2eT_J(αC)J= νmax,

and J contains elements that have fewer elements. Because there exists such a C and b, for any other C, b can be adjusted accordingly so that the KKT condition is still satisfied. Therefore, from theorem 3, all D_C, C > 0 have the same optimal solution set as that of D_ν. The situation when beI ≤ 0 is similar.

Lemma 4. Assume αCis any optimal solution of D_C. Then eTαCis a continuous

(8)

Proof. If C1 < C2, and α1and α2are optimal solutions of D_C₁and D_C₂, respectively, we have 1 2α T 1Qα1− eTα1 C1l ≤ 1 2α T 2Qα2− eTα2 C1l (2.8) and 1 2α T 2Qα2− eTα2 C2l ≤ 1 2α T 1Qα1− eTα1 C2l . (2.9) Hence eT_α 1 C2l − eT_α 2 C2l ≤ 1 2α T 1Qα1− 1 2α T 2Qα2≤ eT_α 1 C1l − eT_α 2 C1l . (2.10)

Since C2> C1> 0, equation 2.10 implies eTα1− eTα2≥ 0. Therefore, eTαC is a decreasing function on(0, ∞). From this result, we know that for any C∗∈ (0, ∞), limC→(C∗)+eTαCand limC→(C∗)−eTαCexist, and

lim C→(C∗)+e T_α C≤ eTαC∗≤ lim C→(C∗)−e T_α C.

To prove the continuity of eT_α

C, it is sufficient to prove limC→C∗eTα_C = eTαC∗for all C∗∈ (0, ∞).

If limC→(C∗)+eTαC< eTαC∗, there is a¯ν such that 0≤ lim

C→(C∗)+e T_α

C< ¯ν < eTαC∗. (2.11)

Hence¯ν > 0. If D_¯ν’s objective value is not zero, from theorem 4 and the fact that eT_α

Cis a decreasing function, there exists a C> C∗such that αCsatisfies

eTαC= ¯ν. This contradicts equation 2.11, where limC→(C∗₎+eTαC< ¯ν. Therefore, the objective value of D¯ν is zero. Since for all Dν,ν ≤ ¯ν, their feasible regions include that of D_¯ν, their objective values are also zero. From theorem 3, the fact that eT_α

C is a decreasing function, and

limC→(C∗₎+eTα_C< ¯ν, each D_C, C> C∗, has the same optimal solution set as

that of one D_ν, where eT_α

C= ν < ¯ν. Hence by lemma 3, eTαC= νmax, for

all C. This contradicts equation 2.11.

Therefore, limC→(C∗₎+eTαC = eTαC∗. Similarly, lim_C_→(C∗₎−eTαC =

eTαC∗. Thus, lim C→C∗e

T_α

C= eTαC∗.

(9)

Theorem 5. We can define lim C→∞e T_α C= ν∗≥ 0 and lim C→0e T_α C= ν∗≤ 1,

where αCis any optimal solution of D_C. Thenν∗ = νmax. For any ν > ν∗, Dνis

infeasible. For anyν ∈ (ν∗, ν∗], the optimal solution set of Dνis the same as that of

either D_C, C> 0, or some D_C, where C is any number in an interval. In addition,

the optimal objective value of D_ν is strictly positive. For any 0≤ ν ≤ ν_∗, D_νis

feasible with zero optimal objective value.

Proof. First, from lemma 4 and the fact that 0 ≤ eTα ≤ 1, we know

ν∗ _and_ν

∗ can be defined without problems. We then proveν∗ = νmaxby

showing that after C is small enough, all D_C’s optimal solutions αCsatisfy

eT_α

C= νmax.

Assume I includes elements of the class that has fewer elements and J includes elements of the other class. If αC is an optimal solution of D_C, it satisfies the following KKT condition:

QII QIJ Q_JI Q_JJ (αC)I (αC)J − e Cl+ bC y_I y_J = (λC)I− (ξC)I (λC)J− (ξC)J ,

where λC ≥ 0, ξC ≥ 0, αTCλC = 0, and ξTC(e/l − αC) = 0. When C is small enough, bCyJ> 0 must hold. Otherwise, since QJI(αC)I+ QJJ(αC)Jis bounded, Q_JI(αC)I+ QJJ(αC)J− eJ/(Cl) + bCyJ < 0 implies (αC)J = eJ/l, which violates the constraint yT_α _{= 0 if (#y}

i= 1) = (#yi= −1). Therefore,

bCyJ > 0 so bCyI < 0. This implies that (αC)I = eI/l when C is sufficiently

small. Hence eTαC= νmax= ν∗.

If(#yi=1) = (#yi=−1), we can let αC= e/l and bC= 0. When C is small enough, this will be a KKT point. Therefore, eT_α

C= νmax= ν∗= 1.

From lemma 1 we immediately know that D_νis infeasible ifν > ν∗. From lemma 4, where eT_α

Cis a continuous function, for anyν ∈ (ν∗, ν∗], there is

a(D_C) such that eT_α

C= ν. Then from theorem 3, D_Cand Dνhave the same optimal solution set.

If Dν has the same optimal solution set as that of D_C

1 and D

C2 where

C1 < C2, since eTαC is a decreasing function, for any C ∈ [C1, C2], its

optimal solutions satisfy eT_α_{= ν. From theorem 3, its optimal solution set} is the same as that of D_ν. Thus, such Cs construct an interval.

Ifν < ν∗, Dν must be feasible from lemma 1. It cannot have nonzero

objective value due to theorem 4 and the definition ofν∗. For Dν∗, ifν∗= 0,

the objective value of D_ν_∗is zero as α = 0 is a feasible solution. If ν_∗ > 0, since feasible regions of D_ν are bounded by 0 ≤ αi ≤ 1/l, i = 1, . . . , l, with theorem 1, there is a sequence{α_ν_i}, ν1 ≤ ν2 ≤ · · · < ν∗ such that

(10)

Since eT_α

νi = νi, eTˆα = limνi→ν∗eTανi = ν∗. We also have 0≤ ˆα ≤ 1/l and yTˆα = limνi→ν∗yTανi = 0 so ˆα is feasible to Dν∗. However, ˆαTQˆα =

limνi→ν∗αTνiQανi= 0 as αTνiQανi= 0 for all νi. Therefore, the objective value of Dν∗is always zero.

Next we prove that the objective value of D_νis zero if and only ifν ≤ ν_∗. From the above discussion, ifν ≤ ν_∗, the objective value of D_νis zero. If the objective value of D_νis zero butν > ν_∗, theorem 3 impliesν = νmax= ν∗=

ν∗, which causes a contradiction. Hence the proof is complete.

Note that when the objective value of Dνis zero, the optimal solution w of the primal problem P_νis zero. Crisp & Burges (1999, sec. 4) considered such a P_νas a trivial problem. Next we present a corollary:

Corollary 1. If training data are separable,ν_∗ = 0. If training data are

non-separable,ν_∗ ≥ 1/l > 0. Furthermore, if Q is positive definite, training data are

separable andν∗ = 0.

Proof. From (Lin 2001, theorem 3.3), if data are separable, there is a C∗such that for all C≥ C∗, an optimal solution αC∗ of DC∗ is also optimal for DC. Therefore, for D_C, an optimal solution becomes αC∗/(Cl) and eTαC∗/(Cl) → 0 as C → ∞. Thus, ν_∗ = 0. On the other hand, if data are nonseparable, no matter how large C is, there are components of optimal solutions at the upper bound. Therefore, eTαC≥ 1/l > 0 for all C. Hence, ν∗ ≥ 1/l.

If Q is positive definite, the unconstrained problem,

min1 2α

T_Qα_{− e}T_α_, _(2.12)

has a unique solution at α= Q−1e. If we add constraints to equation 2.12, min 1

2α

T_Qα_{− e}T_α

yTα= 0, α_i≥ 0, i = 1, . . . , l, (2.13)

is a problem with a smaller feasible region. Thus the objective value of equation 2.13 is bounded. From corollary 27.3.1 of Rockafellar (1970) any bounded finite dimensional space quadratic convex function over a polyhe-dral attains at least an optimal solution. Therefore, equation 2.13 is solvable. From Lin (2001, theorem 2), this implies the following primal problem is solvable:

min 1 2w

T_w

yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l.

(11)

In many situations, Q is positive definite. For example, from Micchelli (1986), if the radial basis function (RBF) kernel is used and xi = xj, Q is positive definite.

We illustrate the above results by some examples. Given three nonsepa-rable training points x1= 0, x2= 1, and x3= 2 with y = [1, −1, 1]T, we will

show that this is an example of lemma 3. Note that this is a nonseparable problem. For all C> 0, the optimal solution of D_Cis α= [1/6, 1/3, 1/6]T_. Therefore, in this case,ν∗= ν∗ = 2/3. For Dν, ν ≤ 2/3, an optimal solution is α= (3ν/2)[1/6, 1/3, 1/6]Twith the objective value

(3ν/2)2_{[1/6, 1/3, 1/6]}  00 01 −20 0 −2 4    1/6_1/3 1/6   = 0.

Another example shows that we may have the same value of eTαCfor all C in an interval, where αCis any optimal solution of DC. Given x1 =

−1 0 , x2 = 1 1 , x3 = 0 −1 , and x4 = 0 0 with y= [1, −1, 1, −1]T_{, part of} the KKT condition of D_Cis     1 1 0 0 1 2 1 0 0 1 1 0 0 0 0 0         α1 α2 α3 α4     −_4C1     1 1 1 1     + b     1 −1 1 −1     = λ − ξ.

Then one optimal solution of D_Cis:

α_C= [1₄,1₄,1₄,1₄]T _b_{∈ [1 −} 1 4C,4C1 −12] if 0< C ≤ 13, = 1 36[3+C2, −3 +C4, 3 +C2, 9]T =12C1 if13 ≤ C ≤ 43, = [1 8, 0,18,14]T =4C1 −18 if43 ≤ C ≤ 4, = [1 2C, 0,2C1,C1]T =−14C if C≥ 4.

This is a separable problem. We haveν∗= 1, ν∗= 0, and

eTαC=        1 if 0< C ≤ 1₃, 1 3+9C2 if13 ≤ C ≤43, 1 2 if43 ≤ C ≤ 4, 1 2C if C≥ 4. (2.14)

In summary this section shows:

(12)

• Solving Dν and DC is just like solving two different problems with the same optimal solution set. We may expect that many numerical aspects of solving them are similar. However, they are still two different problems, so we cannot obtain C without solving Dν. Similarly, without solving DC, we cannot findν.

3 The Relation Betweenν and C

A formula like equation 2.14 motivates us to conjecture that allν = eTα_C

have a similar form. That is, in each interval of C, eT_α

C= A + B/C, where

A and B are constants independent of C. The formulation of eT_α

Cwill be the main topic of this section.

We note that in equation 2.14, in each interval of C, αCare at the same face. Here we say two vectors are at the same face if they have the same free, lower-bounded, and upper-bounded components. The following lemma deals with the situation when αCare at the same face:

Lemma 5. If C < C and there are αCand α_C at the same face, then for each C∈ [C, C], there is at least one optimal solution αCof D_Cat the same face as αC

and α_C. Furthermore,

eTα_C= ₁+2

C , C ≤ C ≤ C,

where1and2are constants independent of C. In addition,2≥ 0.

Proof. If{1, . . . , l} are separated into two sets A and F, where A corre-sponds to bounded variables and F correcorre-sponds to free variables of αC(or

α_Cas they are at the same face), the KKT condition shows QFF QFA Q_AF Q_AA αF α_A − e Cl + b y_F y_A = 0 λ_A− ξ_A , (3.1) yT_Fα_F+ yT_Aα_A= 0, (3.2) λi≥ 0, ξi≥ 0, i ∈ A. (3.3)

Equations 3.1 and 3.2 can be rewritten as  QQ_AFFF QQ_AAFA yy_AF yT_F yT_A 0    αα_AF b   −  eeAF/(Cl)/(Cl) 0   =  λ_A− ξ0 _A 0   . If Q_FFis positive definite, αF= Q−1_FF(eF/(Cl) − QFAαA− byF). (3.4)

(13)

Thus, yT_FαF+ yTAαA= yTFQFF−1(eF/(Cl) − QFAαA− byF) + yTAαA= 0 implies b= y T AαA+ yTFQ−1FF(eF/(Cl) − QFAαA) yT FQ−1FFyF . Therefore, α_F= Q−1_FF eF Cl− QFAαA − yTAαA+ yTFQ−1FF(eF/(Cl) − QFAαA) yT_FQ−1_FFy_F yF . (3.5)

We note that for C≤ C ≤ C, if (αC)Fis defined by equation 3.5 and(αC)A≡ (α_C)A(or(αC)A), then(αC)i ≥ 0, i = 1, . . . , l. In addition, αCsatisfies the first part of equation 3.1 (the part with right-hand side zero). The sign of the second part is not changed, and equation 3.2 is also valid. Thus, we have constructed an optimal solution αCof D_Cthat is at the same face as αCand

α_C. Then following from equation 3.5 and αAis a constant vector for all C≤ C ≤ C, eTα_C= eT_FQ−1_FF(eF/(Cl) − QFAαA− byF) + eTAαA = eT FQ−1FF eF/(Cl) − QFAαA − yTAαA+ yTFQ−1FF(eF/(Cl) − QFAαA) yT_FQ−1_FFy_F yF + eT AαA = eT_FQ−1_FFeF l − eT_FQ−1_FF(yT_FQ−1_FFeF/l)yF yT_FQ−1_FFy_F /C + 1 = eT_FQ−1_FFeF l − (eT FQ−1FFyF)2 (yT FQ−1FFyF)l /C + 1 = 2/C + 1.

If Q_FFis not invertible, it is positive semidefinite so we can have Q_FF= ˆQD ˆQT

(14)

gener-ality we assume D= ¯D 0

0 0

. Then equation 3.4 can be modified to

D ˆQTα_F= ˆQ−1(e_F/(Cl) − Q_FAα_A− by_F).

One solution of the above system is

α_F= ˆQ−T ¯D−1 0 0 0 ˆQ−1(eF/(Cl) − QFAαA− byF).

Thus, a representation similar to equation 3.4 is obtained, and all arguments follow.

Note that due to the positive semidefiniteness of Q_FF, αFmay have mul-tiple solutions. From lemma 2, eT_α

Cis a well-defined function of C. Hence the representation1+ 2/C is valid for all solutions. From lemma 4, eTαC is a decreasing function of C, so2≥ 0.

The main result on the representation of eT_α

Cis in the following theorem:

Theorem 6. There are 0< C1< · · · < Csand Ai, Bi, i = 1, . . . , s such that

eTα_C=        ν∗ _C_{≤ C} 1, Ai+BCi Ci≤ C ≤ Ci+1, i = 1, . . . , s − 1, As+B_Cs Cs≤ C,

where αCis an optimal solution of DC. We also have

Ai+ Bi

Ci+1 = Ai+1+

Bi+1

Ci+1, i = 1, . . . , s − 1. (3.6)

Proof. From theorem 5, we know that eTα_C = ν∗ when C is sufficiently small. From lemma 4, if we gradually increase C, we will reach a C1such

that if C> C1, eTαC < ν∗. If for all C≥ C1, αCare at the same face, from lemma 5, we have eT_α

C = A1+ B1/C, ∀C ≥ C1. Otherwise, from this C1,

we can increase C to a C2 such that for all intervals (C2, C2+ ), ≥ 0,

there is an αCnot at the same face as αC1and αC2. Then from lemma 5, for

C1≤ C ≤ C2, we can have A1and B1such that eTα_C= A₁+B1

C.

We can continue this procedure. Since the number of possible faces is finite (≤ 3l_{), we have only finite C}

(15)

such that there exist αCi and αCj at the same face. Then lemma 5 implies that for all Ci ≤ C ≤ Cj, all αCare at the same face as αCi and αCj. This contradicts the definition of Ci+1.

From lemma 4, the continuity of eTα_Cimmediately implies equation 3.6.

Finally we provide Figure 1 to demonstrate the relation betweenν and

C. It clearly indicates thatν is a decreasing function of C. Information about

these two test problems,australianandheart, is in section 5.

4 A Decomposition Method forν-SVM

Based on existing decomposition methods for C-SVM, in this section we propose a decomposition method forν-SVM.

For solving DC, existing decomposition methods separate the index{1, . . . , l} of the training set to two sets B and N, where B is the working set if α is the current solution of the algorithm. If we denote αBand αNas vectors containing corresponding elements, the objective value of DC is equal to

1

2αTBQBBαB− (eB+ QBNαN)TαB+1₂αTNQNNαN− eTNαN. At each iteration,

α_Nis fixed, and the following problem with the variable αBis solved:

min 1 2α T BQBBαB− (eB− QBNαN)TαB yT_Bα_B= −yT_Nα_N, (4.1) 0≤ (αB)i≤ C, i = 1, . . . , q, where Q_BB Q_BN QNB QNN

is a permutation of the matrix Q and q is the size of B. The strict decrease of the objective function holds, and the theoretical convergence was studied in Chang, Hsu, and Lin (2000), Keerthi and Gilbert (2000), and Lin (2000).

An important process in the decomposition methods is the selection of the working set B. In the software SVMlight_{(Joachims, 1998), there is a systematic} way to find the working set B. In each iteration the following problem is solved: min∇ f (αk)Td yTd= 0, −1 ≤ di≤ 1, (4.2) di≥ 0, if (αk)i= 0, di≤ 0, if (αk)i= C, (4.3) |{di| di= 0}| = q, (4.4) where we represent f(α) ≡ 1₂αT_Qα_{− e}T_α_{, α}

kas the solution at the kth iteration, and∇ f (αk) is the gradient of f (α) at αk. Note that|{di| di= 0}| means the number of components of d that are not zero. The constraint 4.4 implies that a descent direction involving only q variables is obtained. Then

(16)

Figure 1: Relation betweenν and C.

components of αkwith nonzero diare included in the working set B, which is used to construct the subproblem, equation 4.1. Note that d is used only for identifying B but not as a search direction.

If q is an even number Joachims (1998) showed a simple strategy for solving equations 4.2 through 4.4. First, he sorts yi∇ f (αk)i, i = 1, . . . , l in

(17)

decreasing order. Then he successively picks the q/2 elements from the top of the sorted list, which 0 < (αk)i < C or di = −yi obeys equation 4.3. Similarly he picks the q/2 elements from the bottom of the list for which

0< (αk)i< C or di= yiobeys equation 4.3. Other elements of d are assigned

to be zero. Thus, these q nonzero elements compose the working set. A complete analysis of his procedure is in Lin (2000, sect. 2).

To modify the above strategy for D_ν, we consider the following problem in each iteration:

min∇ f (αk)Td

yTd= 0, eTd= 0, −1 ≤ di≤ 1,

di≥ 0, if (αk)i= 0, di≤ 0, if (αk)i= 1/l, (4.5) |{di| di= 0}| ≤ q,

where q is an even integer. Now f(α) ≡ 1

2αTQα. Here we use≤ instead

of= because in theory q nonzero elements may not be always available. This was first pointed out by Chang et al. (2000). Note that the subproblem, equation 4.1, becomes as follows if decomposition methods are used for solving D_ν: min 1 2α T BQBBαB+ QBNαTNαB yT_BαB= −yT_NαN, (4.6) eT_Bα_B= ν − eT_Nα_N, 0≤ (αB)i≤ 1/l, i = 1, . . . , q.

Problem 4.5 is more complicated then 4.2 as there is an additional constraint

eT_d_{= 0. The situation of q = 2 has been discussed in Keerthi and Gilbert} (2000). We will describe a recursive procedure for solving equation 4.5.

We consider the following problem:

min t∈S ∇ f (αk)tdt t∈S ytdt = 0, t∈S dt= 0, −1 ≤ dt ≤ 1, dt≥ 0, if (αk)t= 0, dt≤ 0, if (αk)t= 1/l, (4.7) |{dt| dt= 0, t ∈ S}| ≤ q,

which is the same as equation 4.5 if S= {1, . . . , l}. We denote the variables {dt|t ∈ S} as d and the objective function

(18)

Algorithm 1. If q= 0, the algorithm stops and outputs d = 0. Otherwise choose a pair of indices i and j from either

i= argmin_t{∇ f (α_k)t|yt= 1, (αk)t < 1/l, t ∈ S}, j= argmax_t{∇ f (αk)t|yt = 1, (αk)t> 0, t ∈ S}, (4.8) or i= argmin_t{∇ f (αk)t|yt= −1, (αk)t < 1/l, t ∈ S}, j= argmax_t{∇ f (αk)t|yt = −1, (αk)t> 0, t ∈ S}, (4.9)

depending on which one gives a smaller∇ f (αk)i− ∇ f (αk)j. If there are no such i

and j, or∇ f (αk)i−∇ f (αk)j≥ 0, the algorithm stops and outputs a solution d = 0.

Otherwise we assign di= 1, dj = −1 and determine values of other variables by

recursively solving a smaller problem of equation 4.7:

min t∈S ∇ f (αk)tdt t∈S ytdt= 0, t∈S dt= 0, −1 ≤ dt≤ 1, dt≥ 0, if (αk)t= 0, dt ≤ 0, if (αk)t= 1/l, (4.10) |{dt| dt= 0, t ∈ S}| ≤ q,

where S= S\{i, j} and q= q − 2.

Algorithm 1 assigns nonzero values to at most q/2 pairs. The indices of nonzero elements in the solution d are used as B in the subproblem 4.6. Note that algorithm 1 can be implemented as an iterative procedure by se-lecting q/2 pairs sequentially. Then the computational complexity is similar to Joachim’s strategy. Here, for convenience in writing proofs, we describe it in a recursive way. Next we prove that algorithm 1 solves equation 4.5.

Lemma 6. If there is an optimal solution d of equation 4.7, there exists an optimal

integer solution d∗with d∗_t ∈ {−1, 0, 1}, for all t ∈ S.

Proof. Because_t_∈Sdt = 0, if there are some noninteger elements in d, there must be at least two. Furthermore, from the linear constraints

t∈S ytdt= 0 and t∈S dt= 0, we have t∈S,yt=1 ytdt= 0 and t∈S,yt=−1 ytdt= 0. (4.11)

(19)

Thus, if there are only two noninteger elements diand dj, they must satisfy

yi= yj.

Therefore, if d contains some noninteger elements, there must be two of them, diand dj, which satisfy yi= yj. If di+ dj= c,

∇ f (αk)idi+ ∇ f (αk)jdj= (∇ f (αk)i− ∇ f (αk)j)di+ c∇ f (αk)j. (4.12) Since di, dj /∈ {−1, 0, 1} and −1 < di, dj < 1, if ∇ f (αk)i = ∇ f (αk)j, we can pick a sufficiently small > 0 and shift diand djby−(∇ f (αk)i− ∇ f (αk)j)

and(∇ f (αk)i− ∇ f (αk)j), respectively, without violating their feasibility.

Then the decrease of the objective value contradicts the assumption that d is an optimal solution. Hence we know∇ f (αk)i= ∇ f (αk)j.

Then we can eliminate at least one of the nonintegers by shifting diand djby argminv{|v|: v ∈ {di− di, di − di, dj− dj, dj − dj}}. The objective value is the same because of equation 4.12 and∇ f (αk)i= ∇ f (αk)j. We can repeat this process until an integer optimal solution d∗is obtained.

Lemma 7. If there is an optimal integer solution d of equation 4.7 that is not

all zero and(i, j) can be chosen from equation 4.8 or 4.9, then there is an optimal

integer solution d∗with d∗_i = 1 and d∗_j = −1.

Proof. Because (i, j) can be chosen from equation 4.8 or 4.9, we know

(αk)i < 1/l and (αk)j > 0. We will show that if di = 1 and dj = −1, we

can construct an optimal integer solution d∗ from d such that d∗_i = 1 and d_j∗= −1.

We first note that for any nonzero integer element di, from equation 4.11,

there is a nonzero integer element djsuch that

dj = −diand yj= yi. We define p(i_{) ≡ j}_.

If di = −1, we can find i = p(i) such that di = 1 and yi = yi. Since di = 1, (α_k)_i < 1/l. By the definition of i and the fact that (α_k)_i < 1/l,

∇ f (αk)i ≤ ∇ f (αk)i. Let d_i∗ = 1, d∗_i = −1, and d∗_t = dt otherwise. Then

obj(d∗_{) ≤ obj(d), so d}∗_{is also an optimal solution. Similarly, if d}

j= 1, we can have an optimal solution d∗with d∗_j = −1.

Therefore, if the above transformation has been done, we have only three cases left:(di, dj) = (0, −1), (1, 0), and (0, 0). For the first case, we can find an i= p(j) such that di = 1 and y_i = y_i = y_j. From the definition of i and the fact that(αk)i < 1/l and (αk)i< 1/l, ∇ f (αk)i≤ ∇ f (αk)i. We can define d∗_i = 1, d∗_i = 0, and d∗_t = dtotherwise. Then obj(d∗) ≤ obj(d) so d∗is also an optimal solution. If(di, dj) = (1, 0), the situation is similar.

Finally we check the case where diand djare both zero. Since d is a nonzero integer vector, we can consider a di = 1 and j = p(i). From equations 4.8

(20)

and 4.9,∇ f (α_k)i− ∇ f (αk)j ≤ ∇ f (αk)i− ∇ f (α_k)_j. Let d∗_i = 1, d∗_j = −1,

d∗_i = d_j∗ = 0, and d∗t = dtotherwise. Then d∗is feasible for equation 4.7 and

obj(d∗) ≤ obj(d). Thus, d∗is an optimal solution.

Lemma 8. If there is an integer optimal solution of equation 4.7 and algorithm 1 outputs a zero vector d, then d is already an optimal solution of equation 4.7.

Proof. If the result is wrong, there is an integer optimal solution d∗ of equation 4.7 such that

obj(d∗_{) =}

t∈S

∇ f (αk)td∗t < 0.

Without loss of generality, we can consider only the case of

t∈S,yt=1

∇ f (αk)td∗t < 0. (4.13)

From equation 4.11 and d∗_t ∈ {−1, 0, 1}, the number of indices satisfying d∗_t = 1, yt = 1 is the same as those of d∗t = −1, yt = 1. Therefore, we must have min d∗_t=1,yt=1 ∇ f (αk)t− max d∗_t=−1,yt=1 ∇ f (αk)t< 0. (4.14) Otherwise, d∗_t=1,yt=1 ∇ f (αk)t− d∗_t=−1,yt=1 ∇ f (αk)t= yt=1 ∇ f (αk)td∗t ≥ 0 contradicts equation 4.13.

Then equation 4.14 implies that in algorithm 1, i and j can be chosen with di= 1 and dj= −1. This contradicts the assumption that algorithm 1 outputs a zero vector.

Theorem 7. Algorithm 1 solves equation 4.7.

Proof. First we note that the set of d that satisfies|{dt| dt= 0, t ∈ S}| ≤ q can be considered as the union of finitely many closed sets of the form {d | di1 = 0, . . . , dil−q = 0}. Therefore, the feasible region of equation 4.7 is

closed. With the bounded constraints−1 ≤ di≤ 1, i = 1, . . . , l, the feasible region is compact, so there is at least one optimal solution.

As q is an even integer, we assume q= 2k. We then finish the proof by induction on k:

(21)

k> 0: Suppose algorithm 1 outputs a vector d with di = 1 and dj = −1. In this situation the optimal solution of equation 4.7 cannot be zero. Otherwise, by assigning a vector ¯d with ¯di= 1, ¯dj= −1, and ¯dt = 0 for all t∈ S\{i, j}, obj( ¯d) < 0 gives a smaller objective value than that of the zero vector. Thus, the assumptions of lemma 7 hold. Then by the fact that equation 4.7 is solvable and lemmas 6 and 7, we know that there is an optimal solution d∗of equation 4.5 with d∗_i = 1 and d∗_j = −1.

By induction{dt, t ∈ S} is an optimal solution of equation 4.10. Since {d∗

t, t ∈ S} is also a feasible solution of equation 4.10, we have

obj(d) = ∇ f (αk)idi+ ∇ f (αk)jdj+ t∈S ∇ f (αk)tdt ≤ ∇ f (αk)id∗i + ∇ f (αk)jdj∗+ t∈S ∇ f (αk)td∗t = obj(d∗). (4.15) Thus d, the output of algorithm 1 is an optimal solution.

Suppose algorithm 1 does not output a vector d with di= 1 and dj= −1. Then d is actually a zero vector. Immediately from lemma 8, d = 0 is an optimal solution.

Since equation 4.5 is a special case of equation 4.7, theorem 7 implies that algorithm 1 can solve it.

After solving Dν, we want to calculateρ and b in Pν. The KKT condition, equation 2.5, shows (Qα)i− ρ + byi= 0 if 0 < αi< 1/l, ≥ 0 if αi= 0, ≤ 0 if αi= 1/l. Define r1≡ ρ − b, r2≡ ρ + b.

If yi= 1, the KKT condition becomes (Qα)i− r1= 0 if 0 < αi< 1/l,

≥ 0 if αi= 0,

≤ 0 if αi= 1/l. (4.16)

Therefore, if there areαithat satisfy equation 4.16, r1 = (Qα)i. Practically to avoid numerical errors, we can average them:

r1= 0<αi<1/l,yi=1(Qα)i 0<αi<1/l,yi=11 .

(22)

On the other hand, if there is no suchαi, as r1must satisfy max αi=1/l,yi=1 (Qα)i≤ r1≤ min αi=0,yi=1 (Qα)i,

we take r1the midpoint of the range.

For yi= −1, we can calculate r2in a similar way.

After r1and r2are obtained,

ρ =r1+ r2

2 and − b =

r1− r2

2 .

Note that the KKT condition can be written as

max αi>0,yi=1 (Qα)i≤ min αi<1/l,yi=1 (Qα)iand max αi>0,yi=−1 (Qα)i≤ min αi<1/l,yi=−1 (Qα)i.

Hence practically we can use the following stopping criterion: The decom-position method stops if the solution α satisfies the following condition:

− (Qα)i+ (Qα)j< , (4.17)

where > 0 is a chosen stopping tolerance, and i and j are the first pair obtained from equation 4.8 or 4.9.

In section 5, we conduct some experiments on this new method.

5 Numerical Experiments

When C is large, there may be more numerical difficulties using decompo-sition methods for solving DC(see, for example, the discussion in Hsu & Lin, 1999). Now there is no C in Dν, so intuitively we may think that this difficulty no longer exists. In this section, we test the proposed decomposi-tion method on examples with differentν and examine required time and iterations.

Since the constraints 0 ≤ αi ≤ 1/l, i = 1, . . . , l, imply αi are small, the objective value of Dνmay be very close to zero. To avoid possible numerical inaccuracy, here we consider the following scaled form of Dν:

min 1 2α

T_Qα

yTd= 0, eTα= νl, (5.1)

0≤ αi≤ 1, i = 1, . . . , l.

The working set selection follows the discussion in section 4, and here we implement a special case with q= 2. Then the working set in each iteration contains only two elements.

(23)

For the initial pointα1, we assign the first [νl/2] elements with yl = 1 as [1, . . . , 1, νl/2 − [νl/2]]T_{. Similarly, the same numbers are assigned to the} first [νl/2] elements with yi= −1. All other elements are assigned to be zero. Unlike the decomposition method for DC, where the zero vector is usually used as the initial solution so∇ f (α1) = −e, now α1containsνl nonzero

components. In order to obtain∇ f (α1) = Qα1 of equation 4.5, in the

be-ginning of the decomposition procedure, we must computeνl columns of Q. This might be a disadvantage of usingν-SVM. Further investigations are needed on this issue.

We test the RBF kernel with Qij = yiyje−xi−xj

2_/n

, where n is the num-ber of attributes of training data. Our implementation is part of the software LIBSVM(version 2.03), which is an integrated package for SVM classification and regression. (LIBSVMis available online at http://www.csie.ntu.edu.tw/ ∼cjlin/libsvm.)

We test problems from various collections. Problemsaustralianto shut-tleare from the Statlog collection (Michie, Spiegelhalter, & Taylor, 1994). Problemsadult4andweb7are compiled by Platt (1998) from the UCI Ma-chine Learning Repository (Blake & Merz, 1998; Murphy & Aha 1994). Note that all problems from Statlog are with real numbers, so we scale them to [−1, 1]. Problems adult4and web7 are with binary representa-tion, so we do not conduct any scaling. Some of these problems have more than two classes, so we treat all data not in the first class as in the second class.

AsLIBSVMalso implements a decomposition method with q = 2 for C-SVM (Chang & Lin, 2000), we try to conduct some comparisons between

C-SVM andν-SVM. Note that these two codes are nearly the same except

for different working selections specially for D_νand DC. For each problem, we solve its DCform using C = 1 and C = 1000 first. If αCis an optimal solution of DC, we then calculateν by eTαC/(Cl) and solve Dν. The stopping tolerance for solving C-SVM is set to be 10−3. As the α of equation 4.17 is like the α of DCdivided by C and the stopping criterion involves Qα, to have a fair comparison, the tolerance (i.e., of equation 4.17) for equation 5.1 is set as 10−3/C.

The computational experiments for this section were done on a Pentium III-500 with 256 MB RAM using thegcccompiler. We used 100 MB as the cache size ofLIBSVMfor storing recently used Qij.

Tables 1 and 2 report results of C= 1 and 1000, respectively. In each ta-ble, the correspondingν is listed, and the number of iterations and time (in seconds) of both algorithms are compared. Note that for the same problem, fewer iterations do not always lead to less computational time. We think there are two possible reasons: First, the computational time for calculating the initial gradient for D_νis more expensive. Second, due to different con-tents of the cache (or different numbers of kernel evaluations), the cost of each iteration is different. We also present the number of support vectors (#SV column) as well as free support vectors (#FSV column). It can be clearly

(24)

Figure 2: Training data and separating hyperplanes.

seen that the proposed method for D_νperforms very well. This comparison has shown the practical viability of usingν-SVM.

From Sch ¨olkopf et al. (2000), we know thatνl is a lower bound of the number of support vectors and an upper bound of the number of bounded support vectors (also number of misclassified training data). It can be clearly seen from Tables 1 and 2 thatνl lies between the number of support vectors and bounded support vectors. Furthermore, we can see that ifν becomes smaller, the total number of support vectors decreases. This is consistent with using DC, where the increase of C decreases the number of support vectors.

We also observe that although the total number of support vectors de-creases asν becomes smaller, the number of free support vectors increases.

Whenν is decreased (C is increased), the separating hyperplane tries to fit

as many training data as possible. Hence more points (that is, more freeαi) tend to be at two planes wTφ(x) + b = ±ρ. We illustrate this in Figures 2a and 2b, whereν = 0.5 and 0.2, respectively, are used on the same problem. Since the weakest part of the decomposition method is that it cannot con-sider all variables together in each iteration (only q elements are selected), a larger number of free variables may cause more difficulty.

This explains why many more iterations are required whenν are smaller. Therefore, here we have given an example that for solving DCand Dν, the decomposition method faces a similar difficulty.

6 Discussion and Conclusion

In an earlier version of this article, since we did not know how to design a decomposition method for Dνthat has two linear constraints, we tried to

(25)

T able 1: Solving C -SVM and ν-SVM: C = 1 (T ime in Seconds). Pr oblem l ν C Iteration ν Iteration C T ime ν T ime #SV #FSV ν l austr alian 690 0.309619 1040 946 0.34 0.42 244 55 214 diabetes 768 0.574087 395 297 0.4 0.47 447 13 441 ger man 1000 0.556643 953 909 1.23 1.61 600 88 557 hear t 270 0.43103 219 175 0.07 0.08 132 25 11 7 v ehicle 846 0.501 182 791 904 0.69 0.91 439 26 424 satimage 4435 0.083544 355 534 8.16 14.05 377 12 371 letter 15,000 0.036588 764 897 22.59 35.13 563 26 549 shuttle 43,500 0.141534 3267 6982 422.04 1058.0 6159 5 6157 adult4 4781 0.41394 1460 1464 21.14 28.86 2002 53 1980 w eb7 24,692 0.059718 1896 1721 74.51 102.99 1556 140 1475 T able 2: Solving C -SVM and ν-SVM: C = 1000 (T ime in Seconds). Pr oblem l ν C Iteration ν Iteration C T ime ν T ime #SV #FSV ν l austr alian 690 0.147234 151,438 117,758 10.98 8.65 222 167 102 diabetes 768 0.421373 216,845 137,941 18.96 11.79 376 102 324 ger man 1000 0.069128 79,542 81,824 11.24 11.37 509 494 70 hear t 270 0.033028 11,933 11,075 0.38 0.35 100 99 9 v ehicle 846 0.262569 220,973 190,324 20.07 17.01 284 11 1 223 satimage 4435 0.015416 44,372 45,323 28.3 28.31 136 106 69 letter 15,000 0.005789 69,052 70,604 141.4 134.14 152 100 87 shuttle 43,500 0.033965 143,273 154,558 1215.8 1468.56 1487 17 1478 adult4 4781 0.263506 359,618 350,818 257.51 244.84 1760 837 1260 w eb7 24,692 0.023691 187,578 187,170 1262.15 11 12.07 11 12 696 585

(26)

remove one of them. For C-SVM Friess, Cristianini, and Campbell (1998) and Mangasarian and Musicant (1999) added b2/2 into the objective function so the dual does not have the linear constraint yTα = 0. We used a similar

approach for Pνby considering the following new primal problem:

( ¯Pν) min 1₂wTw+1₂b2− νρ + 1_l l i=1 ξi yi(wTφ(xi) + b) ≥ ρ − ξi, ξi≥ 0, i = 1, . . . , l, ρ ≥ 0. (6.1) The dual of ¯Pνis:

( ¯Dν) min 1₂αT(Q + yyT)α

eTα≥ ν,

0≤ αi≤ 1/l, i= 1, . . . , l. (6.2) Similar to theorem 1, we can solve ¯D_ν using only the equality eT_α _{= ν.} Hence the new problem has only one simple equality constraint and can be solved using existing decomposition methods like SVMlight_.

To be more precise, the working selection becomes: min∇ f (αk)Td

eTd= 0, −1 ≤ di≤ 1,

di≥ 0, if (αk)i= 0, di≤ 0, if (αk)i= 1/l,

|{di| di= 0}| ≤ q, (6.3)

where f(α) is1₂αT(Q + yyT)α.

Equation 6.3 can be considered as a special problem of equation 4.2 since

eof eT_d_{= 0 is a special case of y. Thus SVM}light_{’s selection procedure can} be directly used. An earlier version ofLIBSVMimplemented this decompo-sition method for ¯D_ν. However, later we find that the performance is much worse than that of the method for Dν. This can be seen in Tables 3 and 4, which present the same information as Tables 1 and 2 for solving ¯Dν. As the major difference is on the working set selection, we suspect that the per-formance gap is similar to the situation happened for C-SVM. Hsu and Lin (1999) showed that by directly using SVMlight_{’s strategy, the decomposition} method for ( ¯DC) min 1 2α T_{(Q + yy}T_{)α − e}T_α 0≤ αi≤ C, i= 1, . . . , l, (6.4) performs much worse than that for DC. Note that the relation between ¯DC and ¯D_νis very similar to that of DCand Dνpresented earlier. Thus we con-jecture that there are some common shortages of using SVMlight’s working

(27)

Table 3: Solving( ¯Dν): A Comparison with Table 1.

Problem l ν ν Iteration ν Time #SV #FSV

australian 690 0.309619 4871 0.64 244 53 diabetes 768 0.574087 1816 0.58 447 13 german 1000 0.556643 1641 1.67 599 87 heart 270 0.43103 527 0.1 130 23 vehicle 846 0.501182 1402 1.04 437 26 satimage 4435 0.083544 3034 15.44 380 16 letter 15,000 0.036588 7200 54.6 562 28 shuttle 43,500 0.141534 17,893 1198.83 6161 8 adult4 4781 0.41394 7500 35.03 2002 54 web7 24,692 0.059718 3109 107.5 1563 149

Table 4: Solving( ¯Dν): A Comparisom with Table 2.

Problem l ν ν Iteration ν Time #SV #FSV

australian 690 0.147234 597,205 36.06 222 167 diabetes 768 0.421373 1,811,571 132.7 376 102 german 1000 0.069128 504,114 56.33 508 493 heart 270 0.033028 48,581 1.13 100 99 vehicle 846 0.262569 1,626,315 125.51 284 112 satimage 4435 0.015416 919,695 445.42 136 106 letter 15,000 0.005789 1,484,401 2544.23 150 97 shuttle 43,500 0.033965 8,364,010 59,286.83 1487 18 adult4 4781 0.263506 8,155,518 4905.67 1759 842 web7 24,692 0.023691 28,791,608 96,912.82 1245 830

set selection for ¯DCand ¯Dν. Further investigations are needed to understand whether explanations in Hsu and Lin (1999) are true for ¯D_ν.

In conclusion, this article discusses the relation betweenν-SVM and C-SVM in detail. In particular, we show that solving them is just like solving two different problems with the same optimal solution set. We also have proposed a decomposition method forν-SVM. Experiments show that this method is competitive with methods for C-SVM. Hence we have demon-strated the practical viability ofν-SVM.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan, grant NSC 89-2213-E-002-013. C.-J. L. thanks Craig Saunders for bringing him to the attention ofν-SVM and thanks a referee of Lin (2001) whose comments led him to think about the infeasibility of D_ν. We also thank Bernhard Sch ¨olkopf and two anonymous referees for helpful comments.