• 沒有找到結果。

Training v-Support Vector Classifiers: Theory and Algorithms

N/A
N/A
Protected

Academic year: 2021

Share "Training v-Support Vector Classifiers: Theory and Algorithms"

Copied!
29
0
0

加載中.... (立即查看全文)

全文

(1)

Training

ν-Support Vector Classifiers: Theory and Algorithms

Chih-Chung Chang

Chih-Jen Lin

Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan

The ν-support vector machine (ν-SVM) for classification proposed by

Sch ¨olkopf, Smola, Williamson, and Bartlett (2000) has the advantage of using a parameterν on controlling the number of support vectors. In this article, we investigate the relation betweenν-SVM and C-SVM in detail. We show that in general they are two different problems with the same optimal solution set. Hence, we may expect that many numerical aspects of solving them are similar. However, compared to regularC-SVM, the

formulation ofν-SVM is more complicated, so up to now there have been no effective methods for solving large-scaleν-SVM. We propose a decom-position method forν-SVM that is competitive with existing methods for

C-SVM. We also discuss the behavior ofν-SVM by some numerical

ex-periments. 1 Introduction

Theν-support vector classification (Sch¨olkopf, Smola, & Williamson, 1999;

Sch ¨olkopf, Smola, Williamson, & Bartlett, 2000) is a new class of support vector machines (SVM). Given training vectors xi∈Rn, i = 1, . . . , l in two classes and a vector y∈Rlsuch that y

i∈ {1, −1}, they consider the following primal problem: (Pν) min 1 2w Tw− νρ + 1 l l  i=1 ξi yi(wTφ(xi) + b) ≥ ρ − ξi, ξi≥ 0, i = 1, . . . , l, ρ ≥ 0. (1.1) Here 0≤ ν ≤ 1 and training vectors xiare mapped into a higher- (maybe infinite) dimensional space by the functionφ. This formulation is different from the original C-SVM (Vapnik, 1998):

(PC) min 1 2w Tw+ Cl i=1 ξi yi(wTφ(xi) + b) ≥ 1 − ξi, ξi≥ 0, i = 1, . . . , l. (1.2)

(2)

In equation 1.2, a parameter C is used to penalize variablesξi. As it is dif-ficult to select an appropriate C, in Pν, (Sch ¨olkopf et al. (2000) introduce a new parameterν, which lets one control the number of support vectors and errors. To be more precise, they proved thatν is an upper bound on the frac-tion of margin errors and a lower bound of the fracfrac-tion of support vectors. In addition, with probability 1, asymptotically,ν equals both fractions.

Although Pν has such an advantage, its dual is more complicated than the dual of PC: (Dν) min 1 2α T yTα= 0, eTα≥ ν, 0≤ αi≤ 1/l, i= 1, . . . , l, (1.3) where e is the vector of all ones, Q is a positive semidefinite matrix, Qij

yiyjK(xi, xj), and K(xi, xj) ≡ φ(xi)Tφ(xj) is the kernel.

Remember that the dual of PCis as follows:

(DC) min

1 2α

T− eTα

yTα= 0, 0 ≤ αi≤ C, i = 1, . . . , l.

Therefore, it can be clearly seen that Dνhas one more inequality constraint. We are interested in the relation between Dνand DC. Though in Sch ¨olkopf et al. (2000, Proposition 13), this issue has been studied, we investigate this relation in more detail in section 2. The main result, theorem 5, shows that solving them is like solving two different problems with the same optimal solution set. In addition, the increase of C in C-SVM is like the decrease

ofν in ν-SVM. Based on the work in section 2, in section 3 we derive the

formulation ofν as a decreasing function of C.

Due to the density of Q, traditional optimization algorithms such as Newton and quasi-Newton cannot be directly applied to solve DCor Dν. Currently major methods of solving large DC(for example, decomposition methods (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998; Saun-ders et al., 1998) and the method of nearest points (Keerthi, Shevade, & Murthy, 2000)) use the simple structure of constraints. Because of the addi-tional inequality, these methods cannot be directly used for solving Dν. Up to now, there have been no implementation methods for large-scaleν-SVM. In section 4, we propose a decomposition method similar to the software SVMlight(Joachims, 1998) for C-SVM.

Section 5 presents numerical results. Experiments indicate that several numerical properties on solving DCand Dνare similar. A timing comparison shows that the proposed method forν-SVM is competitive with existing methods for C-SVM. Finally, section 6 gives a discussion and conclusion.

(3)

2 The Relation Betweenν-SVM and C-SVM

In this section we construct a relationship between Dν and DC; the main result is in theorem 5. The relation between DCand Dνhas been discussed by Sch ¨olkopf et al. (2000, Proposition 13), who show that if Pν leads to

ρ > 0, then PCwith C= 1/(ρl) leads to the same decision function. Here

we provide a more complete investigation.

In this section we first try to simplify Dνby showing that the inequality

eTα≥ ν can be treated as an equality:

Theorem 1. Let 0 ≤ ν ≤ 1. If (Dν) is feasible, there is at least one optimal

solution of Dν that satisfies eTα = ν. In addition, if the objective value of D

νis

not zero, all optimal solutions of Dνsatisfy eTα= ν.

Proof. Since the feasible region of Dνis bounded, if it is feasible, Dνhas at least one optimal solution. Assume Dνhas an optimal solution α such that

eTα> ν. Since eTα> ν ≥ 0, by defining

¯α ≡ ν

eTαα,

¯α is feasible to Dνand eT¯α = ν. Since α is an optimal solution of Dν, with

eTα> ν,

αT≤ ¯αTQ¯α = ν

eTα 2

αT≤ αT. (2.1)

Thus ¯α is an optimal solution of Dν, and αT= 0. This also implies that if the objective value of Dν is not zero, all optimal solutions of Dν satisfy

eTα= ν.

Therefore, in general eTα≥ ν in Dνcan be written as eTα= ν. Sch¨olkopf

et al. (2000), noted that practically one can alternatively work with eTα≥ ν

as an equality constraint. From the primal side, it was first shown by Crisp and Burges (1999) thatρ ≥ 0 in Pνis redundant. Withoutρ ≥ 0, the dual becomes: min 1 2α T yTα= 0, eTα= ν, (2.2) 0≤ αi≤ 1/l, i= 1, . . . , l.

Therefore, the equality is naturally obtained. Note that this is an example that two problems have the same optimal solution set but are associated with two duals that have different optimal solution sets. Here the primal problem,

(4)

which has more restrictions, is related to a dual with a larger feasible region. For our later analysis, we keep on using Dνbut not equation 2.2. Interestingly we will see that the exceptional situation where Dν has optimal solutions such that eTα> ν happens only for those ν that we are not interested in.

Due to the additional inequality, the feasibility of Dνand DCis different. For DC, 0 is a trivial feasible point, but Dνmay be infeasible. An example where Pνis unbounded below and Dνis infeasible is as follows: Given three training data with y1= y2= 1 and y3= −1, if ν = 0.9, there is no α in Dνthat

satisfies 0≤ αi≤ 1/3, [1, 1, −1]α = 0 and eTα≥ 0.9. Hence Dνis infeasible. When this happens, we can choose w= 0, ξ1= ξ2= 0, b = ρ, ξ3= 2ρ as a

feasible solution of Pν. Then the objective value is−0.9ρ +2ρ/3, which goes to−∞ as ρ → ∞. Therefore, Pνis unbounded.

We then describe a lemma that was first proved in Crisp and Burges (1999).

Lemma 1. Dνis feasible if and only ifν ≤ νmax, where

νmax≡ 2 min(#yi= 1, #yi= −1)

l ,

and (#yi= 1) and (#yi= −1) denote the number of elements in the first and second

classes, respectively.

Proof. Since 0≤ αi ≤ 1/l, i = 1, . . . , l, with yTα = 0, for any α feasible to Dν, we have eTα ≤ νmax. Therefore, if Dνis feasible,ν ≤ νmax. On the

other hand, if 0< ν ≤ νmax, min(#yi= 1, #yi= −1) > 0 so we can define a feasible solution of Dν: αj=      ν 2(#yi= 1) if yj= 1, ν 2(#yi= −1) if yj= −1.

This α satisfies 0 ≤ αi ≤ 1/l, i = 1, . . . , l and yTα = 0. If ν = 0, clearly

α= 0 is a feasible solution of Dν.

Note that the size ofνmaxdepends on how balanced the training set is. If

the numbers of positive and negative examples match, thenνmax= 1.

We then note that if C> 0, by dividing each variable by Cl, DCis equiv-alent to the following problem:

(D C) min 1 2α TeTα Cl yTα= 0, 0 ≤ αi≤ 1/l, i = 1, . . . , l.

(5)

It can be clearly seen that DC and Dν are very similar. We prove the following lemma about DC:

Lemma 2. If DChas different optimal solutions α1and α2, then eTα1= eTα2

and αT

11= αT22. Therefore, we can define two functions eTαCand αTCC

on C, where αCis any optimal solution of DC.

Proof. Since DCis a convex problem, if α1= α2are both optimal solutions,

for all 0≤ λ ≤ 1, 1 2(λα1+ (1 − λ)α2) TQ(λα 1+ (1 − λ)α2) − eT(λα1+ (1 − λ)α2)/(Cl) = λ  1 2α T 1Qα1− eTα1/(Cl) + (1 − λ)  1 2α T 2Qα2− eTα2/(Cl) . This implies αT1Qα2= 1 2α T 1Qα1+ 1 2α T 2Qα2. (2.3)

Since Q is positive semidefinite, Q= LTL so equation 2.3 impliesLα

1−

2 = 0. Thus, αT2Qα2= αT1Qα1. Therefore, eTα1= eTα2, and the proof

is complete.

Next we prove a theorem on optimal solutions of DCand Dν:

Theorem 2. If DCand Dνshare one optimal solution αwith eTα= ν, their

optimal solution sets are the same.

Proof. From lemma 2, any other optimal solution α of DC also satisfies

eTα= ν so α is feasible to Dν. Since αT= (α)Tfrom lemma 2, all

DC’s optimal solutions are also optimal solutions of Dν. On the other hand, if α is any optimal solution of Dν, it is feasible for DC. With the constraint

eTα≥ ν = eTαand αT= (α)T∗, 1 2α T− eTα/(Cl) ≤ 1 2)TQ) − eTα/(Cl).

Therefore, all optimal solutions of Dνare also optimal for DC. Hence their optimal solution sets are the same.

If α is an optimal solution of DC, it satisfies the following Karush-Kuhn-Tucker (KKT) condition:

e

(6)

λTα= 0, ξTe

l − α



= 0, yTα= 0,

λi≥ 0, ξi≥ 0, 0 ≤ αi≤ 1/l, i = 1, . . . , l. (2.4) By settingρ ≡ 1/(Cl) and ν ≡ eTα, α also satisfies the KKT condition of

Dν: − ρe + by = λ − ξ, λTα= 0, ξTe l − α  = 0, yTα= 0, eTα≥ ν, ρ(eTα− ν) = 0, λi≥ 0, ξi≥ 0, ρ ≥ 0, 0 ≤ αi≤ 1/l, i = 1, . . . , l. (2.5) From theorem 2, this implies that for each DC, its optimal solution set is the same as that of Dν, whereν = eTα. For each D

C, such a Dν is unique as from theorem 1, ifν1= ν2, Dν1and Dν2have different optimal solution sets.

Therefore, we have the following theorem:

Theorem 3. For each DC, C > 0, its optimal solution set is the same as that of

one (and only one) Dν, whereν = eTαand α is any optimal solution of DC.

Similarly, we have:

Theorem 4. If Dν, ν > 0, has a nonempty feasible set and its objective value is

not zero, Dν’s optimal solution set is the same as that of at least one DC.

Proof. If the objective value of Dνis not zero, from the KKT condition 2.5,

αT− ρeTα= −

l 

i=1

ξi/l.

Then αT> 0 and equation 2.5 imply

ρeTα= αT+l

i=1

ξi/l > 0, ρ > 0, and eTα= ν.

By choosing a C> 0 such that ρ = 1/(Cl), α is a KKT point of DC. Hence from theorem 2, the optimal solution set of this DCis the same as that of Dν.

Next we prove two useful lemmas. The first one deals with the special situation when the objective value of Dνis zero.

(7)

Lemma 3. If the objective value of Dν,ν ≥ 0, is zero and there is a DC, C > 0

such that any its optimal solution αCsatisfies eTαC = ν, then ν = νmaxand all

DC, C> 0, have the same optimal solution set as that of Dν.

Proof. For this Dν, we can setρ = 1/(Cl), so αC is a KKT point of Dν. Therefore, since the objective value of Dνis zero, αT

CC= 0. Furthermore, we have QαC= 0. In this case, equation 2.4 of DC’s KKT condition becomes

e Cl+ beI −beJ = λ − ξ, (2.6)

whereλi, ξi≥ 0, and I and J are indices of two different classes. If beI ≥ 0, there are three situations of equation 2.6:

> 0 < 0 , < 0 < 0 , = 0 < 0 .

The first case impliesC)I = 0 and (αC)J= (eJ)/l. Hence if J is nonempty,

yTαC= 0 causes contradiction. Hence all data are in the same class.

There-fore, Dνand all DC, C> 0, have the unique optimal solution zero due to the constraints yTα= 0 and α ≥ 0. Furthermore, eTα= ν = νmax= 0.

The second case happens only when αC= e/l. Then yTα= 0 and yi= 1 or −1 imply that (#yi = 1) = (#yi = −1) and eTαC = 1 = ν = νmax.

We then show that e/l is also an optimal solution of any other D

C. Since 0≤ αi ≤ 1/l, i = 1, . . . , l, for any feasible α of ¯DC, the objective function satisfies 1 2α TeTα Cl ≥ − eTα Cl ≥ − 1 Cl. (2.7)

Now(#yi=1) = (#yi=−1) so e/l is feasible. When α = e/l, the inequality of equation 2.7 becomes an equality. Thus e/l is actually an optimal solution of all DC, C > 0. Therefore, Dν and all DC, C > 0 have the same unique optimal solution e/l.

For the third case, b= 1/(Cl), (αC)J= eJ/l, ν = eTαC= 2eTJC)J= νmax,

and J contains elements that have fewer elements. Because there exists such a C and b, for any other C, b can be adjusted accordingly so that the KKT condition is still satisfied. Therefore, from theorem 3, all DC, C > 0 have the same optimal solution set as that of Dν. The situation when beI ≤ 0 is similar.

Lemma 4. Assume αCis any optimal solution of DC. Then eTαCis a continuous

(8)

Proof. If C1 < C2, and α1and α2are optimal solutions of DC1and DC2, respectively, we have 1 2α T 1Qα1eTα1 C1l ≤ 1 2α T 2Qα2eTα2 C1l (2.8) and 1 2α T 2Qα2eTα2 C2l ≤ 1 2α T 1Qα1eTα1 C2l . (2.9) Hence eTα 1 C2leTα 2 C2l ≤ 1 2α T 11− 1 2α T 22≤ eTα 1 C1leTα 2 C1l . (2.10)

Since C2> C1> 0, equation 2.10 implies eTα1− eTα2≥ 0. Therefore, eTαC is a decreasing function on(0, ∞). From this result, we know that for any C∈ (0, ∞), limC→(C)+eTαCand limC→(C)eTαCexist, and

lim C→(C)+e Tα C≤ eTαC∗≤ lim C→(C)e Tα C.

To prove the continuity of eTα

C, it is sufficient to prove limC→CeTαC = eTαCfor all C∈ (0, ∞).

If limC→(C)+eTαC< eTαC∗, there is a¯ν such that 0≤ lim

C→(C)+e Tα

C< ¯ν < eTαC. (2.11)

Hence¯ν > 0. If D¯ν’s objective value is not zero, from theorem 4 and the fact that eTα

Cis a decreasing function, there exists a C> Csuch that αCsatisfies

eTαC= ¯ν. This contradicts equation 2.11, where limC→(C)+eTαC< ¯ν. Therefore, the objective value of D¯ν is zero. Since for all Dν,ν ≤ ¯ν, their feasible regions include that of D¯ν, their objective values are also zero. From theorem 3, the fact that eTα

C is a decreasing function, and

limC→(C)+eTαC< ¯ν, each DC, C> C∗, has the same optimal solution set as

that of one Dν, where eTα

C= ν < ¯ν. Hence by lemma 3, eTαC= νmax, for

all C. This contradicts equation 2.11.

Therefore, limC→(C)+eTαC = eTαC∗. Similarly, limC→(C)eTαC =

eTαC∗. Thus, lim C→Ce

Tα

C= eTαC.

(9)

Theorem 5. We can define lim C→∞e Tα C= ν≥ 0 and lim C→0e Tα C= ν≤ 1,

where αCis any optimal solution of DC. Thenν= νmax. For any ν > ν, Dνis

infeasible. For anyν ∈ (ν, ν], the optimal solution set of Dνis the same as that of

either DC, C> 0, or some DC, where C is any number in an interval. In addition,

the optimal objective value of Dν is strictly positive. For any 0≤ ν ≤ ν, Dνis

feasible with zero optimal objective value.

Proof. First, from lemma 4 and the fact that 0 ≤ eTα ≤ 1, we know

νandν

∗ can be defined without problems. We then proveν= νmaxby

showing that after C is small enough, all DC’s optimal solutions αCsatisfy

eTα

C= νmax.

Assume I includes elements of the class that has fewer elements and J includes elements of the other class. If αC is an optimal solution of DC, it satisfies the following KKT condition:

QII QIJ QJI QJJ C)I C)Je Cl+ bC yI yJ = C)I− (ξC)I C)J− (ξC)J ,

where λC ≥ 0, ξC ≥ 0, αTCλC = 0, and ξTC(e/l − αC) = 0. When C is small enough, bCyJ> 0 must hold. Otherwise, since QJIC)I+ QJJC)Jis bounded, QJIC)I+ QJJC)J− eJ/(Cl) + bCyJ < 0 implies (αC)J = eJ/l, which violates the constraint yTα = 0 if (#y

i= 1) = (#yi= −1). Therefore,

bCyJ > 0 so bCyI < 0. This implies that (αC)I = eI/l when C is sufficiently

small. Hence eTαC= νmax= ν.

If(#yi=1) = (#yi=−1), we can let αC= e/l and bC= 0. When C is small enough, this will be a KKT point. Therefore, eTα

C= νmax= ν∗= 1.

From lemma 1 we immediately know that Dνis infeasible ifν > ν∗. From lemma 4, where eTα

Cis a continuous function, for anyν ∈ (ν, ν∗], there is

a(DC) such that eTα

C= ν. Then from theorem 3, DCand Dνhave the same optimal solution set.

If Dν has the same optimal solution set as that of DC

1 and D 

C2 where

C1 < C2, since eTαC is a decreasing function, for any C ∈ [C1, C2], its

optimal solutions satisfy eTα= ν. From theorem 3, its optimal solution set is the same as that of Dν. Thus, such Cs construct an interval.

Ifν < ν, Dν must be feasible from lemma 1. It cannot have nonzero

objective value due to theorem 4 and the definition ofν. For Dν∗, ifν∗= 0,

the objective value of Dνis zero as α = 0 is a feasible solution. If ν > 0, since feasible regions of Dν are bounded by 0 ≤ αi ≤ 1/l, i = 1, . . . , l, with theorem 1, there is a sequenceνi}, ν1 ≤ ν2 ≤ · · · < ν∗ such that

(10)

Since eTα

νi = νi, eTˆα = limνi→νeTανi = ν∗. We also have 0≤ ˆα ≤ 1/l and yTˆα = limνi→νyTανi = 0 so ˆα is feasible to Dν∗. However, ˆαTQˆα =

limνi→ναTνiνi= 0 as αTνiνi= 0 for all νi. Therefore, the objective value of Dν∗is always zero.

Next we prove that the objective value of Dνis zero if and only ifν ≤ ν. From the above discussion, ifν ≤ ν, the objective value of Dνis zero. If the objective value of Dνis zero butν > ν, theorem 3 impliesν = νmax= ν∗=

ν∗, which causes a contradiction. Hence the proof is complete.

Note that when the objective value of Dνis zero, the optimal solution w of the primal problem Pνis zero. Crisp & Burges (1999, sec. 4) considered such a Pνas a trivial problem. Next we present a corollary:

Corollary 1. If training data are separable,ν = 0. If training data are

non-separable,ν ≥ 1/l > 0. Furthermore, if Q is positive definite, training data are

separable andν= 0.

Proof. From (Lin 2001, theorem 3.3), if data are separable, there is a C∗such that for all C≥ C, an optimal solution αCof DCis also optimal for DC. Therefore, for DC, an optimal solution becomes αC/(Cl) and eTαC/(Cl) → 0 as C → ∞. Thus, ν = 0. On the other hand, if data are nonseparable, no matter how large C is, there are components of optimal solutions at the upper bound. Therefore, eTαC≥ 1/l > 0 for all C. Hence, ν≥ 1/l.

If Q is positive definite, the unconstrained problem,

min1 2α

T− eTα, (2.12)

has a unique solution at α= Q−1e. If we add constraints to equation 2.12, min 1

2α

T− eTα

yTα= 0, αi≥ 0, i = 1, . . . , l, (2.13)

is a problem with a smaller feasible region. Thus the objective value of equation 2.13 is bounded. From corollary 27.3.1 of Rockafellar (1970) any bounded finite dimensional space quadratic convex function over a polyhe-dral attains at least an optimal solution. Therefore, equation 2.13 is solvable. From Lin (2001, theorem 2), this implies the following primal problem is solvable:

min 1 2w

Tw

yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l.

(11)

In many situations, Q is positive definite. For example, from Micchelli (1986), if the radial basis function (RBF) kernel is used and xi = xj, Q is positive definite.

We illustrate the above results by some examples. Given three nonsepa-rable training points x1= 0, x2= 1, and x3= 2 with y = [1, −1, 1]T, we will

show that this is an example of lemma 3. Note that this is a nonseparable problem. For all C> 0, the optimal solution of DCis α= [1/6, 1/3, 1/6]T. Therefore, in this case,ν= ν= 2/3. For Dν, ν ≤ 2/3, an optimal solution is α= (3ν/2)[1/6, 1/3, 1/6]Twith the objective value

(3ν/2)2[1/6, 1/3, 1/6]  00 01 −20 0 −2 4    1/61/3 1/6   = 0.

Another example shows that we may have the same value of eTαCfor all C in an interval, where αCis any optimal solution of DC. Given x1 =

 −1 0  , x2 =  1 1  , x3 =  0 −1  , and x4 =  0 0  with y= [1, −1, 1, −1]T, part of the KKT condition of DCis     1 1 0 0 1 2 1 0 0 1 1 0 0 0 0 0         α1 α2 α3 α4     −4C1     1 1 1 1     + b     1 −1 1 −1     = λ − ξ.

Then one optimal solution of DCis:

αC= [14,14,14,14]T b∈ [1 − 1 4C,4C1 −12] if 0< C ≤ 13, = 1 36[3+C2, −3 +C4, 3 +C2, 9]T =12C1 if13 ≤ C ≤ 43, = [1 8, 0,18,14]T =4C1 −18 if43 ≤ C ≤ 4, = [1 2C, 0,2C1,C1]T =−14C if C≥ 4.

This is a separable problem. We haveν= 1, ν∗= 0, and

eTαC=        1 if 0< C ≤ 13, 1 3+9C2 if13 ≤ C ≤43, 1 2 if43 ≤ C ≤ 4, 1 2C if C≥ 4. (2.14)

In summary this section shows:

(12)

• Solving Dν and DC is just like solving two different problems with the same optimal solution set. We may expect that many numerical aspects of solving them are similar. However, they are still two different problems, so we cannot obtain C without solving Dν. Similarly, without solving DC, we cannot findν.

3 The Relation Betweenν and C

A formula like equation 2.14 motivates us to conjecture that allν = eTαC

have a similar form. That is, in each interval of C, eTα

C= A + B/C, where

A and B are constants independent of C. The formulation of eTα

Cwill be the main topic of this section.

We note that in equation 2.14, in each interval of C, αCare at the same face. Here we say two vectors are at the same face if they have the same free, lower-bounded, and upper-bounded components. The following lemma deals with the situation when αCare at the same face:

Lemma 5. If C < C and there are αCand αC at the same face, then for each C∈ [C, C], there is at least one optimal solution αCof DCat the same face as αC

and αC. Furthermore,

eTαC= 1+2

C , C ≤ C ≤ C,

where1and2are constants independent of C. In addition,2≥ 0.

Proof. If{1, . . . , l} are separated into two sets A and F, where A corre-sponds to bounded variables and F correcorre-sponds to free variables of αC(or

αCas they are at the same face), the KKT condition shows QFF QFA QAF QAA αF αAe Cl + b yF yA = 0 λA− ξA , (3.1) yTFαF+ yTAαA= 0, (3.2) λi≥ 0, ξi≥ 0, i ∈ A. (3.3)

Equations 3.1 and 3.2 can be rewritten as  QQAFFF QQAAFA yyAF yTF yTA 0    ααAF b   −  eeAF/(Cl)/(Cl) 0   =  λA− ξ0 A 0   . If QFFis positive definite, αF= Q−1FF(eF/(Cl) − QFAαA− byF). (3.4)

(13)

Thus, yTFαF+ yTAαA= yTFQFF−1(eF/(Cl) − QFAαA− byF) + yTAαA= 0 implies b= y T AαA+ yTFQ−1FF(eF/(Cl) − QFAαA) yT FQ−1FFyF . Therefore, αF= Q−1FF  eF Cl− QFAαAyTAαA+ yTFQ−1FF(eF/(Cl) − QFAαA) yTFQ−1FFyF yF  . (3.5)

We note that for C≤ C ≤ C, if (αC)Fis defined by equation 3.5 andC)AC)A(orC)A), then(αC)i ≥ 0, i = 1, . . . , l. In addition, αCsatisfies the first part of equation 3.1 (the part with right-hand side zero). The sign of the second part is not changed, and equation 3.2 is also valid. Thus, we have constructed an optimal solution αCof DCthat is at the same face as αCand

αC. Then following from equation 3.5 and αAis a constant vector for all C≤ C ≤ C, eTαC= eTFQ−1FF(eF/(Cl) − QFAαA− byF) + eTAαA = eT FQ−1FF  eF/(Cl) − QFAαAyTAαA+ yTFQ−1FF(eF/(Cl) − QFAαA) yTFQ−1FFyF yF  + eT AαA =  eTFQ−1FFeF leTFQ−1FF(yTFQ−1FFeF/l)yF yTFQ−1FFyF  /C + 1 =  eTFQ−1FFeF l(eT FQ−1FFyF)2 (yT FQ−1FFyF)l  /C + 1 = 2/C + 1.

If QFFis not invertible, it is positive semidefinite so we can have QFF= ˆQD ˆQT

(14)

gener-ality we assume D= ¯D 0

0 0

. Then equation 3.4 can be modified to

D ˆQTαF= ˆQ−1(eF/(Cl) − QFAαA− byF).

One solution of the above system is

αF= ˆQ−T  ¯D−1 0 0 0  ˆQ−1(eF/(Cl) − QFAαA− byF).

Thus, a representation similar to equation 3.4 is obtained, and all arguments follow.

Note that due to the positive semidefiniteness of QFF, αFmay have mul-tiple solutions. From lemma 2, eTα

Cis a well-defined function of C. Hence the representation1+ 2/C is valid for all solutions. From lemma 4, eTαC is a decreasing function of C, so2≥ 0.

The main result on the representation of eTα

Cis in the following theorem:

Theorem 6. There are 0< C1< · · · < Csand Ai, Bi, i = 1, . . . , s such that

eTαC=        νC≤ C 1, Ai+BCi Ci≤ C ≤ Ci+1, i = 1, . . . , s − 1, As+BCs Cs≤ C,

where αCis an optimal solution of DC. We also have

Ai+ Bi

Ci+1 = Ai+1+

Bi+1

Ci+1, i = 1, . . . , s − 1. (3.6)

Proof. From theorem 5, we know that eTαC = νwhen C is sufficiently small. From lemma 4, if we gradually increase C, we will reach a C1such

that if C> C1, eTαC < ν. If for all C≥ C1, αCare at the same face, from lemma 5, we have eTα

C = A1+ B1/C, ∀C ≥ C1. Otherwise, from this C1,

we can increase C to a C2 such that for all intervals (C2, C2+ ),  ≥ 0,

there is an αCnot at the same face as αC1and αC2. Then from lemma 5, for

C1≤ C ≤ C2, we can have A1and B1such that eTαC= A1+B1

C.

We can continue this procedure. Since the number of possible faces is finite (≤ 3l), we have only finite C

(15)

such that there exist αCi and αCj at the same face. Then lemma 5 implies that for all Ci ≤ C ≤ Cj, all αCare at the same face as αCi and αCj. This contradicts the definition of Ci+1.

From lemma 4, the continuity of eTαCimmediately implies equation 3.6.

Finally we provide Figure 1 to demonstrate the relation betweenν and

C. It clearly indicates thatν is a decreasing function of C. Information about

these two test problems,australianandheart, is in section 5.

4 A Decomposition Method forν-SVM

Based on existing decomposition methods for C-SVM, in this section we propose a decomposition method forν-SVM.

For solving DC, existing decomposition methods separate the index{1, . . . , l} of the training set to two sets B and N, where B is the working set if α is the current solution of the algorithm. If we denote αBand αNas vectors containing corresponding elements, the objective value of DC is equal to

1

2αTBQBBαB− (eB+ QBNαN)TαB+12αTNQNNαN− eTNαN. At each iteration,

αNis fixed, and the following problem with the variable αBis solved:

min 1 2α T BQBBαB− (eB− QBNαN)TαB yTBαB= −yTNαN, (4.1) 0≤ (αB)i≤ C, i = 1, . . . , q, where QBB QBN QNB QNN

is a permutation of the matrix Q and q is the size of B. The strict decrease of the objective function holds, and the theoretical convergence was studied in Chang, Hsu, and Lin (2000), Keerthi and Gilbert (2000), and Lin (2000).

An important process in the decomposition methods is the selection of the working set B. In the software SVMlight(Joachims, 1998), there is a systematic way to find the working set B. In each iteration the following problem is solved: min∇ f (αk)Td yTd= 0, −1 ≤ di≤ 1, (4.2) di≥ 0, if (αk)i= 0, di≤ 0, if (αk)i= C, (4.3) |{di| di= 0}| = q, (4.4) where we represent f(α) ≡ 12αT− eTα, α

kas the solution at the kth iteration, and∇ f (αk) is the gradient of f (α) at αk. Note that|{di| di= 0}| means the number of components of d that are not zero. The constraint 4.4 implies that a descent direction involving only q variables is obtained. Then

(16)

Figure 1: Relation betweenν and C.

components of αkwith nonzero diare included in the working set B, which is used to construct the subproblem, equation 4.1. Note that d is used only for identifying B but not as a search direction.

If q is an even number Joachims (1998) showed a simple strategy for solving equations 4.2 through 4.4. First, he sorts yi∇ f (αk)i, i = 1, . . . , l in

(17)

decreasing order. Then he successively picks the q/2 elements from the top of the sorted list, which 0 < (αk)i < C or di = −yi obeys equation 4.3. Similarly he picks the q/2 elements from the bottom of the list for which

0< (αk)i< C or di= yiobeys equation 4.3. Other elements of d are assigned

to be zero. Thus, these q nonzero elements compose the working set. A complete analysis of his procedure is in Lin (2000, sect. 2).

To modify the above strategy for Dν, we consider the following problem in each iteration:

min∇ f (αk)Td

yTd= 0, eTd= 0, −1 ≤ di≤ 1,

di≥ 0, if (αk)i= 0, di≤ 0, if (αk)i= 1/l, (4.5) |{di| di= 0}| ≤ q,

where q is an even integer. Now f(α) ≡ 1

2αT. Here we use≤ instead

of= because in theory q nonzero elements may not be always available. This was first pointed out by Chang et al. (2000). Note that the subproblem, equation 4.1, becomes as follows if decomposition methods are used for solving Dν: min 1 2α T BQBBαB+ QBNαTNαB yTBαB= −yTNαN, (4.6) eTBαB= ν − eTNαN, 0≤ (αB)i≤ 1/l, i = 1, . . . , q.

Problem 4.5 is more complicated then 4.2 as there is an additional constraint

eTd= 0. The situation of q = 2 has been discussed in Keerthi and Gilbert (2000). We will describe a recursive procedure for solving equation 4.5.

We consider the following problem:

min t∈S ∇ f (αk)tdt  t∈S ytdt = 0,  t∈S dt= 0, −1 ≤ dt ≤ 1, dt≥ 0, if (αk)t= 0, dt≤ 0, if (αk)t= 1/l, (4.7) |{dt| dt= 0, t ∈ S}| ≤ q,

which is the same as equation 4.5 if S= {1, . . . , l}. We denote the variables {dt|t ∈ S} as d and the objective function



(18)

Algorithm 1. If q= 0, the algorithm stops and outputs d = 0. Otherwise choose a pair of indices i and j from either

i= argmint{∇ f (αk)t|yt= 1, (αk)t < 1/l, t ∈ S}, j= argmaxt{∇ f (αk)t|yt = 1, (αk)t> 0, t ∈ S}, (4.8) or i= argmint{∇ f (αk)t|yt= −1, (αk)t < 1/l, t ∈ S}, j= argmaxt{∇ f (αk)t|yt = −1, (αk)t> 0, t ∈ S}, (4.9)

depending on which one gives a smaller∇ f (αk)i− ∇ f (αk)j. If there are no such i

and j, or∇ f (αk)i−∇ f (αk)j≥ 0, the algorithm stops and outputs a solution d = 0.

Otherwise we assign di= 1, dj = −1 and determine values of other variables by

recursively solving a smaller problem of equation 4.7:

min t∈S ∇ f (αk)tdt  t∈S ytdt= 0,  t∈S dt= 0, −1 ≤ dt≤ 1, dt≥ 0, if (αk)t= 0, dt ≤ 0, if (αk)t= 1/l, (4.10) |{dt| dt= 0, t ∈ S}| ≤ q,

where S= S\{i, j} and q= q − 2.

Algorithm 1 assigns nonzero values to at most q/2 pairs. The indices of nonzero elements in the solution d are used as B in the subproblem 4.6. Note that algorithm 1 can be implemented as an iterative procedure by se-lecting q/2 pairs sequentially. Then the computational complexity is similar to Joachim’s strategy. Here, for convenience in writing proofs, we describe it in a recursive way. Next we prove that algorithm 1 solves equation 4.5.

Lemma 6. If there is an optimal solution d of equation 4.7, there exists an optimal

integer solution dwith dt ∈ {−1, 0, 1}, for all t ∈ S.

Proof. Becauset∈Sdt = 0, if there are some noninteger elements in d, there must be at least two. Furthermore, from the linear constraints

 t∈S ytdt= 0 and  t∈S dt= 0, we have  t∈S,yt=1 ytdt= 0 and  t∈S,yt=−1 ytdt= 0. (4.11)

(19)

Thus, if there are only two noninteger elements diand dj, they must satisfy

yi= yj.

Therefore, if d contains some noninteger elements, there must be two of them, diand dj, which satisfy yi= yj. If di+ dj= c,

∇ f (αk)idi+ ∇ f (αk)jdj= (∇ f (αk)i− ∇ f (αk)j)di+ c∇ f (αk)j. (4.12) Since di, dj /∈ {−1, 0, 1} and −1 < di, dj < 1, if ∇ f (αk)i = ∇ f (αk)j, we can pick a sufficiently small > 0 and shift diand djby−(∇ f (αk)i− ∇ f (αk)j)

and(∇ f (αk)i− ∇ f (αk)j), respectively, without violating their feasibility.

Then the decrease of the objective value contradicts the assumption that d is an optimal solution. Hence we know∇ f (αk)i= ∇ f (αk)j.

Then we can eliminate at least one of the nonintegers by shifting diand djby argminv{|v|: v ∈ {di− di, di − di, dj− dj, dj − dj}}. The objective value is the same because of equation 4.12 and∇ f (αk)i= ∇ f (αk)j. We can repeat this process until an integer optimal solution d∗is obtained.

Lemma 7. If there is an optimal integer solution d of equation 4.7 that is not

all zero and(i, j) can be chosen from equation 4.8 or 4.9, then there is an optimal

integer solution dwith di = 1 and dj = −1.

Proof. Because (i, j) can be chosen from equation 4.8 or 4.9, we know

(αk)i < 1/l and (αk)j > 0. We will show that if di = 1 and dj = −1, we

can construct an optimal integer solution dfrom d such that di = 1 and dj∗= −1.

We first note that for any nonzero integer element di, from equation 4.11,

there is a nonzero integer element djsuch that

dj = −diand yj= yi. We define p(i) ≡ j.

If di = −1, we can find i = p(i) such that di = 1 and yi = yi. Since di = 1, (αk)i < 1/l. By the definition of i and the fact that (αk)i < 1/l,

∇ f (αk)i ≤ ∇ f (αk)i. Let di= 1, di = −1, and dt = dt otherwise. Then

obj(d) ≤ obj(d), so dis also an optimal solution. Similarly, if d

j= 1, we can have an optimal solution dwith dj = −1.

Therefore, if the above transformation has been done, we have only three cases left:(di, dj) = (0, −1), (1, 0), and (0, 0). For the first case, we can find an i= p(j) such that di = 1 and yi = yi = yj. From the definition of i and the fact that(αk)i < 1/l and (αk)i< 1/l, ∇ f (αk)i≤ ∇ f (αk)i. We can define di = 1, di = 0, and dt = dtotherwise. Then obj(d) ≤ obj(d) so d∗is also an optimal solution. If(di, dj) = (1, 0), the situation is similar.

Finally we check the case where diand djare both zero. Since d is a nonzero integer vector, we can consider a di = 1 and j = p(i). From equations 4.8

(20)

and 4.9,∇ f (αk)i− ∇ f (αk)j ≤ ∇ f (αk)i− ∇ f (αk)j. Let di = 1, dj = −1,

di = dj∗ = 0, and dt = dtotherwise. Then d∗is feasible for equation 4.7 and

obj(d) ≤ obj(d). Thus, d∗is an optimal solution.

Lemma 8. If there is an integer optimal solution of equation 4.7 and algorithm 1 outputs a zero vector d, then d is already an optimal solution of equation 4.7.

Proof. If the result is wrong, there is an integer optimal solution d∗ of equation 4.7 such that

obj(d) =

t∈S

∇ f (αk)tdt < 0.

Without loss of generality, we can consider only the case of 

t∈S,yt=1

∇ f (αk)tdt < 0. (4.13)

From equation 4.11 and dt ∈ {−1, 0, 1}, the number of indices satisfying dt = 1, yt = 1 is the same as those of dt = −1, yt = 1. Therefore, we must have min dt=1,yt=1 ∇ f (αk)t− max dt=−1,yt=1 ∇ f (αk)t< 0. (4.14) Otherwise,  dt=1,yt=1 ∇ f (αk)t−  dt=−1,yt=1 ∇ f (αk)t=  yt=1 ∇ f (αk)tdt ≥ 0 contradicts equation 4.13.

Then equation 4.14 implies that in algorithm 1, i and j can be chosen with di= 1 and dj= −1. This contradicts the assumption that algorithm 1 outputs a zero vector.

Theorem 7. Algorithm 1 solves equation 4.7.

Proof. First we note that the set of d that satisfies|{dt| dt= 0, t ∈ S}| ≤ q can be considered as the union of finitely many closed sets of the form {d | di1 = 0, . . . , dil−q = 0}. Therefore, the feasible region of equation 4.7 is

closed. With the bounded constraints−1 ≤ di≤ 1, i = 1, . . . , l, the feasible region is compact, so there is at least one optimal solution.

As q is an even integer, we assume q= 2k. We then finish the proof by induction on k:

(21)

k> 0: Suppose algorithm 1 outputs a vector d with di = 1 and dj = −1. In this situation the optimal solution of equation 4.7 cannot be zero. Otherwise, by assigning a vector ¯d with ¯di= 1, ¯dj= −1, and ¯dt = 0 for all t∈ S\{i, j}, obj( ¯d) < 0 gives a smaller objective value than that of the zero vector. Thus, the assumptions of lemma 7 hold. Then by the fact that equation 4.7 is solvable and lemmas 6 and 7, we know that there is an optimal solution dof equation 4.5 with di = 1 and dj = −1.

By induction{dt, t ∈ S} is an optimal solution of equation 4.10. Since {d

t, t ∈ S} is also a feasible solution of equation 4.10, we have

obj(d) = ∇ f (αk)idi+ ∇ f (αk)jdj+  t∈S ∇ f (αk)tdt ≤ ∇ f (αk)idi + ∇ f (αk)jdj∗+  t∈S ∇ f (αk)tdt = obj(d). (4.15) Thus d, the output of algorithm 1 is an optimal solution.

Suppose algorithm 1 does not output a vector d with di= 1 and dj= −1. Then d is actually a zero vector. Immediately from lemma 8, d = 0 is an optimal solution.

Since equation 4.5 is a special case of equation 4.7, theorem 7 implies that algorithm 1 can solve it.

After solving Dν, we want to calculateρ and b in Pν. The KKT condition, equation 2.5, shows (Qα)i− ρ + byi= 0 if 0 < αi< 1/l, ≥ 0 if αi= 0, ≤ 0 if αi= 1/l. Define r1≡ ρ − b, r2≡ ρ + b.

If yi= 1, the KKT condition becomes (Qα)i− r1= 0 if 0 < αi< 1/l,

≥ 0 if αi= 0,

≤ 0 if αi= 1/l. (4.16)

Therefore, if there areαithat satisfy equation 4.16, r1 = (Qα)i. Practically to avoid numerical errors, we can average them:

r1=  0<αi<1/l,yi=1(Qα)i  0<αi<1/l,yi=11 .

(22)

On the other hand, if there is no suchαi, as r1must satisfy max αi=1/l,yi=1 (Qα)i≤ r1≤ min αi=0,yi=1 (Qα)i,

we take r1the midpoint of the range.

For yi= −1, we can calculate r2in a similar way.

After r1and r2are obtained,

ρ =r1+ r2

2 and − b =

r1− r2

2 .

Note that the KKT condition can be written as

max αi>0,yi=1 (Qα)i≤ min αi<1/l,yi=1 (Qα)iand max αi>0,yi=−1 (Qα)i≤ min αi<1/l,yi=−1 (Qα)i.

Hence practically we can use the following stopping criterion: The decom-position method stops if the solution α satisfies the following condition:

− (Qα)i+ (Qα)j< , (4.17)

where > 0 is a chosen stopping tolerance, and i and j are the first pair obtained from equation 4.8 or 4.9.

In section 5, we conduct some experiments on this new method.

5 Numerical Experiments

When C is large, there may be more numerical difficulties using decompo-sition methods for solving DC(see, for example, the discussion in Hsu & Lin, 1999). Now there is no C in Dν, so intuitively we may think that this difficulty no longer exists. In this section, we test the proposed decomposi-tion method on examples with differentν and examine required time and iterations.

Since the constraints 0 ≤ αi ≤ 1/l, i = 1, . . . , l, imply αi are small, the objective value of Dνmay be very close to zero. To avoid possible numerical inaccuracy, here we consider the following scaled form of Dν:

min 1 2α

T

yTd= 0, eTα= νl, (5.1)

0≤ αi≤ 1, i = 1, . . . , l.

The working set selection follows the discussion in section 4, and here we implement a special case with q= 2. Then the working set in each iteration contains only two elements.

(23)

For the initial pointα1, we assign the first [νl/2] elements with yl = 1 as [1, . . . , 1, νl/2 − [νl/2]]T. Similarly, the same numbers are assigned to the first [νl/2] elements with yi= −1. All other elements are assigned to be zero. Unlike the decomposition method for DC, where the zero vector is usually used as the initial solution so∇ f (α1) = −e, now α1containsνl nonzero

components. In order to obtain∇ f (α1) = Qα1 of equation 4.5, in the

be-ginning of the decomposition procedure, we must computeνl columns of Q. This might be a disadvantage of usingν-SVM. Further investigations are needed on this issue.

We test the RBF kernel with Qij = yiyje−xi−xj

2/n

, where n is the num-ber of attributes of training data. Our implementation is part of the software LIBSVM(version 2.03), which is an integrated package for SVM classification and regression. (LIBSVMis available online at http://www.csie.ntu.edu.tw/ ∼cjlin/libsvm.)

We test problems from various collections. Problemsaustralianto shut-tleare from the Statlog collection (Michie, Spiegelhalter, & Taylor, 1994). Problemsadult4andweb7are compiled by Platt (1998) from the UCI Ma-chine Learning Repository (Blake & Merz, 1998; Murphy & Aha 1994). Note that all problems from Statlog are with real numbers, so we scale them to [−1, 1]. Problems adult4and web7 are with binary representa-tion, so we do not conduct any scaling. Some of these problems have more than two classes, so we treat all data not in the first class as in the second class.

AsLIBSVMalso implements a decomposition method with q = 2 for C-SVM (Chang & Lin, 2000), we try to conduct some comparisons between

C-SVM andν-SVM. Note that these two codes are nearly the same except

for different working selections specially for Dνand DC. For each problem, we solve its DCform using C = 1 and C = 1000 first. If αCis an optimal solution of DC, we then calculateν by eTαC/(Cl) and solve Dν. The stopping tolerance for solving C-SVM is set to be 10−3. As the α of equation 4.17 is like the α of DCdivided by C and the stopping criterion involves Qα, to have a fair comparison, the tolerance (i.e., of equation 4.17) for equation 5.1 is set as 10−3/C.

The computational experiments for this section were done on a Pentium III-500 with 256 MB RAM using thegcccompiler. We used 100 MB as the cache size ofLIBSVMfor storing recently used Qij.

Tables 1 and 2 report results of C= 1 and 1000, respectively. In each ta-ble, the correspondingν is listed, and the number of iterations and time (in seconds) of both algorithms are compared. Note that for the same problem, fewer iterations do not always lead to less computational time. We think there are two possible reasons: First, the computational time for calculating the initial gradient for Dνis more expensive. Second, due to different con-tents of the cache (or different numbers of kernel evaluations), the cost of each iteration is different. We also present the number of support vectors (#SV column) as well as free support vectors (#FSV column). It can be clearly

(24)

Figure 2: Training data and separating hyperplanes.

seen that the proposed method for Dνperforms very well. This comparison has shown the practical viability of usingν-SVM.

From Sch ¨olkopf et al. (2000), we know thatνl is a lower bound of the number of support vectors and an upper bound of the number of bounded support vectors (also number of misclassified training data). It can be clearly seen from Tables 1 and 2 thatνl lies between the number of support vectors and bounded support vectors. Furthermore, we can see that ifν becomes smaller, the total number of support vectors decreases. This is consistent with using DC, where the increase of C decreases the number of support vectors.

We also observe that although the total number of support vectors de-creases asν becomes smaller, the number of free support vectors increases.

Whenν is decreased (C is increased), the separating hyperplane tries to fit

as many training data as possible. Hence more points (that is, more freeαi) tend to be at two planes wTφ(x) + b = ±ρ. We illustrate this in Figures 2a and 2b, whereν = 0.5 and 0.2, respectively, are used on the same problem. Since the weakest part of the decomposition method is that it cannot con-sider all variables together in each iteration (only q elements are selected), a larger number of free variables may cause more difficulty.

This explains why many more iterations are required whenν are smaller. Therefore, here we have given an example that for solving DCand Dν, the decomposition method faces a similar difficulty.

6 Discussion and Conclusion

In an earlier version of this article, since we did not know how to design a decomposition method for Dνthat has two linear constraints, we tried to

(25)

T able 1: Solving C -SVM and ν-SVM: C = 1 (T ime in Seconds). Pr oblem l ν C Iteration ν Iteration C T ime ν T ime #SV #FSV l austr alian 690 0.309619 1040 946 0.34 0.42 244 55 214 diabetes 768 0.574087 395 297 0.4 0.47 447 13 441 ger man 1000 0.556643 953 909 1.23 1.61 600 88 557 hear t 270 0.43103 219 175 0.07 0.08 132 25 11 7 v ehicle 846 0.501 182 791 904 0.69 0.91 439 26 424 satimage 4435 0.083544 355 534 8.16 14.05 377 12 371 letter 15,000 0.036588 764 897 22.59 35.13 563 26 549 shuttle 43,500 0.141534 3267 6982 422.04 1058.0 6159 5 6157 adult4 4781 0.41394 1460 1464 21.14 28.86 2002 53 1980 w eb7 24,692 0.059718 1896 1721 74.51 102.99 1556 140 1475 T able 2: Solving C -SVM and ν-SVM: C = 1000 (T ime in Seconds). Pr oblem l ν C Iteration ν Iteration C T ime ν T ime #SV #FSV l austr alian 690 0.147234 151,438 117,758 10.98 8.65 222 167 102 diabetes 768 0.421373 216,845 137,941 18.96 11.79 376 102 324 ger man 1000 0.069128 79,542 81,824 11.24 11.37 509 494 70 hear t 270 0.033028 11,933 11,075 0.38 0.35 100 99 9 v ehicle 846 0.262569 220,973 190,324 20.07 17.01 284 11 1 223 satimage 4435 0.015416 44,372 45,323 28.3 28.31 136 106 69 letter 15,000 0.005789 69,052 70,604 141.4 134.14 152 100 87 shuttle 43,500 0.033965 143,273 154,558 1215.8 1468.56 1487 17 1478 adult4 4781 0.263506 359,618 350,818 257.51 244.84 1760 837 1260 w eb7 24,692 0.023691 187,578 187,170 1262.15 11 12.07 11 12 696 585

(26)

remove one of them. For C-SVM Friess, Cristianini, and Campbell (1998) and Mangasarian and Musicant (1999) added b2/2 into the objective function so the dual does not have the linear constraint yTα = 0. We used a similar

approach for Pνby considering the following new primal problem:

( ¯Pν) min 12wTw+12b2− νρ + 1l l  i=1 ξi yi(wTφ(xi) + b) ≥ ρ − ξi, ξi≥ 0, i = 1, . . . , l, ρ ≥ 0. (6.1) The dual of ¯is:

( ¯Dν) min 12αT(Q + yyT

eTα≥ ν,

0≤ αi≤ 1/l, i= 1, . . . , l. (6.2) Similar to theorem 1, we can solve ¯Dν using only the equality eTα = ν. Hence the new problem has only one simple equality constraint and can be solved using existing decomposition methods like SVMlight.

To be more precise, the working selection becomes: min∇ f (αk)Td

eTd= 0, −1 ≤ di≤ 1,

di≥ 0, if (αk)i= 0, di≤ 0, if (αk)i= 1/l,

|{di| di= 0}| ≤ q, (6.3)

where f(α) is12αT(Q + yyT)α.

Equation 6.3 can be considered as a special problem of equation 4.2 since

eof eTd= 0 is a special case of y. Thus SVMlight’s selection procedure can be directly used. An earlier version ofLIBSVMimplemented this decompo-sition method for ¯Dν. However, later we find that the performance is much worse than that of the method for Dν. This can be seen in Tables 3 and 4, which present the same information as Tables 1 and 2 for solving ¯. As the major difference is on the working set selection, we suspect that the per-formance gap is similar to the situation happened for C-SVM. Hsu and Lin (1999) showed that by directly using SVMlight’s strategy, the decomposition method for ( ¯DC) min 1 2α T(Q + yyT)α − eTα 0≤ αi≤ C, i= 1, . . . , l, (6.4) performs much worse than that for DC. Note that the relation between ¯DC and ¯Dνis very similar to that of DCand Dνpresented earlier. Thus we con-jecture that there are some common shortages of using SVMlight’s working

(27)

Table 3: Solving( ¯Dν): A Comparison with Table 1.

Problem l ν ν Iteration ν Time #SV #FSV

australian 690 0.309619 4871 0.64 244 53 diabetes 768 0.574087 1816 0.58 447 13 german 1000 0.556643 1641 1.67 599 87 heart 270 0.43103 527 0.1 130 23 vehicle 846 0.501182 1402 1.04 437 26 satimage 4435 0.083544 3034 15.44 380 16 letter 15,000 0.036588 7200 54.6 562 28 shuttle 43,500 0.141534 17,893 1198.83 6161 8 adult4 4781 0.41394 7500 35.03 2002 54 web7 24,692 0.059718 3109 107.5 1563 149

Table 4: Solving( ¯Dν): A Comparisom with Table 2.

Problem l ν ν Iteration ν Time #SV #FSV

australian 690 0.147234 597,205 36.06 222 167 diabetes 768 0.421373 1,811,571 132.7 376 102 german 1000 0.069128 504,114 56.33 508 493 heart 270 0.033028 48,581 1.13 100 99 vehicle 846 0.262569 1,626,315 125.51 284 112 satimage 4435 0.015416 919,695 445.42 136 106 letter 15,000 0.005789 1,484,401 2544.23 150 97 shuttle 43,500 0.033965 8,364,010 59,286.83 1487 18 adult4 4781 0.263506 8,155,518 4905.67 1759 842 web7 24,692 0.023691 28,791,608 96,912.82 1245 830

set selection for ¯DCand ¯. Further investigations are needed to understand whether explanations in Hsu and Lin (1999) are true for ¯Dν.

In conclusion, this article discusses the relation betweenν-SVM and C-SVM in detail. In particular, we show that solving them is just like solving two different problems with the same optimal solution set. We also have proposed a decomposition method forν-SVM. Experiments show that this method is competitive with methods for C-SVM. Hence we have demon-strated the practical viability ofν-SVM.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan, grant NSC 89-2213-E-002-013. C.-J. L. thanks Craig Saunders for bringing him to the attention ofν-SVM and thanks a referee of Lin (2001) whose comments led him to think about the infeasibility of Dν. We also thank Bernhard Sch ¨olkopf and two anonymous referees for helpful comments.

數據

Figure 1: Relation between ν and C.
Figure 2: Training data and separating hyperplanes.
Table 4: Solving ( ¯D ν ): A Comparisom with Table 2.

參考文獻

相關文件

Feng-Jui Hsieh (Department of Mathematics, National Taiwan Normal University) Hak-Ping Tam (Graduate Institute of Science Education,. National Taiwan

2 Department of Educational Psychology and Counseling / Institute for Research Excellence in Learning Science, National Taiwan Normal University. Research on embodied cognition

Department of Computer Science and Information

Department of Computer Science and Information

Department of Computer Science and Information

Department of Mathematics, National Taiwan Normal University,

2 Center for Theoretical Sciences and Center for Quantum Science and Engineering, National Taiwan University, Taipei 10617, Taiwan!. ⇤ Author to whom correspondence should

2 Center for Theoretical Sciences and Center for Quantum Science and Engineering, National Taiwan University, Taipei 10617, Taiwan..