Training v-support vector regression: Theory and algorithms

(1)

Training

ν-Support Vector Regression: Theory and Algorithms

Chih-Chung Chang

b4506055@csie.ntu.edu.tw

Chih-Jen Lin

cjlin@csie.ntu.edu.tw

Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan

We discuss the relation between-support vector regression (-SVR) and ν-support vector regression (ν-SVR). In particular, we focus on properties that are different from those ofC-support vector classification (C-SVC) andν-support vector classification (ν-SVC). We then discuss some issues that do not occur in the case of classification: the possible range of and the scaling of target values. A practical decomposition method for ν-SVR is implemented, and computational experiments are conducted. We show some interesting numerical observations specific to regres-sion.

1 Introduction

Theν-support vector machine (Sch¨olkopf, Smola, Williamson, & Bartlett, 2000; Sch ¨olkopf, Smola, & Williamson, 1999) is a new class of support vector machines (SVM). It can handle both classification and regression. Proper-ties on trainingν-support vector classifiers (ν-SVC) have been discussed in Chang & Lin (2001b). In this letter, we focus onν-support vector regression (ν-SVR). Given a set of data points, {(x1, y1), . . . , (xl, yl)}, such that xi∈ Rn

is an input and yi∈ R1is a target output, the primary problem ofν-SVR is

as follows: (Pν) min1 2w T_w_{+ C} ν +1 l l i=1 (ξi+ ξi∗) (1.1) (wT_φ(x i) + b) − yi≤ + ξi, yi− (wTφ(xi) + b) ≤ + ξi∗, ξi, ξi∗≥ 0, i = 1, . . . , l, ≥ 0.

Here, 0≤ ν ≤ 1, C is the regularization parameter, and training vectors xiare

mapped into a higher (maybe infinite) dimensional space by the functionφ. The-insensitive loss function means that if wT_{φ(x) is in the range of y ± ,}

(2)

no loss is considered. This formulation is different from the original-SVR (Vapnik, 1998): (P) min1₂wTw+C_l l i=1 (ξi+ ξi∗) (wT_φ(x i) + b) − yi≤ + ξi, (1.2) yi− (wTφ(xi) + b) ≤ + ξi∗, ξi, ξi∗≥ 0, i = 1, . . . , l.

As it is difficult to select an appropriate, Sch¨olkopf et al. (1999) intro-duced a new parameterν that lets one control the number of support vectors and training errors. To be more precise, they proved thatν is an upper bound on the fraction of margin errors and a lower bound of the fraction of sup-port vectors. In addition, with probability 1, asymptotically,ν equals both fractions.

Then there are two different dual formulations for P_νand P: min1 2(α − α ∗₎T_Q_{(α − α}∗_{) + y}T_{(α − α}∗₎ eT(α − α∗) = 0, eT(α + α∗) ≤ Cν, (1.3) 0≤ αi, αi∗≤ C/l, i= 1, . . . , l. min1 2(α − α ∗₎T_Q_{(α − α}∗_{) + y}T_{(α − α}∗_{) + e}T_{(α + α}∗₎ eT(α − α∗) = 0, (1.4) 0≤ αi, αi∗≤ C/l, i= 1, . . . , l,

where Qij≡ φ(xi)Tφ(xj) is the kernel and e is the vector of all ones. Then

the approximating function is

f(x) = l i=1 (α∗ i − αi)φ(xi)Tφ(x) + b.

For regression, the parameterν replaces while in the case of classification, ν replaces C. In Chang & Lin (2001b), we discussed the relation betweenν-SVC and C-SVC as well as how to solveν-SVC in detail. Here, we are interested in different properties for regression. For example, the relation between

ν-SVR and -SVR is not the same as that between ν-SVC and C-SVC. In

addition, similar to the situation of C-SVC, we make sure that the inequality

eT_{(α + α}∗_{) ≤ Cν can be replaced by an equality so algorithms for ν-SVC}

can be applied toν-SVR. They will be the main topics of sections 2 and 3. In section 4, we discuss the possible range of and show that it might be easier to useν-SVM. We also demonstrate some situations where the

(3)

scaling of the target values y is needed. Note that these issues do not occur for classification. Finally, section 5 presents computational experiments. We discuss some interesting numerical observations that are specific to support vector regression.

2 The Relation Betweenν-SVR and -SVR

In this section, we will derive a relationship between the solution set of

-SVR and ν-SVR that allows us to conclude that the inequality constraint,

equation 1.3, can be replaced by an equality.

In the dual formulations mentioned earlier, eT_{(α + α}∗_{) is related to Cν.} Similar to Chang and Lin (2001b), we scale them to the following formula-tions so eT(α + α∗) is related to ν: (Dν) min1 2(α − α ∗₎T_Q_{(α − α}∗_{) + (y/C)}T_{(α − α}∗₎ eT(α − α∗) = 0, eT(α + α∗) ≤ ν, 0≤ αi, αi∗≤ 1/l, i= 1, . . . , l. (2.1) (D) min1 2(α − α ∗₎T_Q_{(α − α}∗_{) + (y/C)}T_{(α − α}∗_{) + (/C)e}T_{(α + α}∗₎ eT(α − α∗) = 0, 0≤ αi, αi∗≤ 1/l, i= 1, . . . , l.

For convenience, following Sch ¨olkopf et al. (2000), we represent

α α∗

as α(∗). Remember that forν-SVC, not all 0 ≤ ν ≤ 1 lead to meaningful problems of(D_ν). Here, the situation is similar, so in the following, we define a ν∗, which will be the upper bound of the interesting interval ofν.

Definition 1. Define ν∗ ≡ minα(∗)eT(α + α∗), where α(∗) is any optimal solution of(D), = 0.

Note that 0 ≤ αi, αi∗ ≤ 1/l implies that the optimal solution set of D

or D_νis bounded. Because their objective and constraint functions are all continuous, any limit point of a sequence in the optimal solution set is in it as well. Hence, we have that the optimal solution set of Dor Dνis close and bounded (i.e., compact). Using this property, if = 0, there is at least one optimal solution that satisfies eT_{(α + α}∗_{) = ν}∗_.

The following lemma shows that for D, > 0, at an optimal solution one ofαiandαi∗must be zero:

(4)

Proof. If the result is wrong, then we can reduce values of two nonzeroαi

andα∗_i such that α− α∗is still the same, but the term eT(α + α∗) of the objective function is decreased. Hence, α(∗) is not an optimal solution so there is a contradiction.

The following lemma is similar to Chang and Lin (2001b, lemma 4):

Lemma 2. If α(∗)₁ is any optimal solution of D₁, α(∗)₂ is any optimal solution of D₂, and 0≤ 1< 2, then

eT(α1+ α∗1) ≥ eT(α2+ α∗2). (2.2)

Therefore, for any optimal solution α(∗) of D, > 0, eT(α+ α∗) ≤ ν∗.

Unlike the case of classification where eTαCis a well-defined function of

C if αCis an optimal solution of the C-SVC problem, here for-SVR, for the

same D, there may be different eT(α+ α∗). The main reason is that eTαis

the only linear term of the objective function of C-SVC, but for D, the linear term becomes(y/C)T_{(α − α}∗_{) + (/C)e}T_{(α + α}∗_{). We will elaborate more} on this in lemma 4, where we prove that eT_(α

+ α∗) can be a function of

if Q is a positive definite matrix.

The following lemma shows the relation between Dν and D. In partic-ular, we show that for 0 ≤ ν < ν∗, any optimal solution of D_ν satisfies

eT_{(α + α}∗_{) = ν.}

Lemma 3. For any D_ν, 0 ≤ ν < ν∗, one of the following two situations must happen:

1. D_ν’s optimal solution set is part of the solution set of a D, > 0.

2. D_ν’s optimal solution set is the same as that of D, where > 0 is any one element in a unique open interval.

In addition, any optimal solution of Dνsatisfies eT(α + α∗) = ν and αiα∗i = 0. Proof. The Karush-Kuhn-Tucker (KKT) condition of D_νshows that there existρ ≥ 0 and b such that

Q −Q −Q Q α α∗ + y/C −y/C + ρ C e e − b e −e = λ− ξ λ∗− ξ∗ . (2.3) Ifρ = 0, α(∗)is an optimal solution of D, = 0. Then, eT_{(α + α}∗_{) ≥ ν}∗_{> ν} causes a contradiction.

(5)

Therefore,ρ > 0, so the KKT condition implies that eT_{(α + α}∗_{) = ν.} Then there are two possible situations:

Case 1:ρ is unique: By assigning = ρ, all optimal solutions of Dνare KKT points of D. Hence, Dν’s optimal solution set is part of that of a

D. This Dis unique, as otherwise we can find anotherρ that satisfies equation 2.3.

Case 2:ρ is not unique. That is, there are two ρ1< ρ2. Supposeρ1andρ2 are the smallest and largest one satisfying the KKT. Again, the existence ofρ1andρ2is based on the compactness of the optimal solution set. Then for anyρ1< ρ < ρ2, we consider the problem D, = ρ. Define

1≡ ρ1and2≡ ρ2. From lemma 2, since1< < 2, ν = eT_(α

1+ α∗1) ≥ eT(α + α∗) ≥ eT(α2+ α∗2) = ν,

where α(∗)is any optimal solution of D, and α(∗)₁ and α(∗)₂ are optimal solutions of D₁and D₂, respectively. Hence, all optimal solutions of

Dsatisfy eT_(α+α∗_{) = ν and all KKT conditions of D}_ν_{so D}_{’s optimal} solution set is in that of Dν.

Hence, Dνand Dshare at least one optimal solution α(∗). For any other optimal solution ¯α(∗)of Dν, it is feasible for(D). Since

1 2( ¯α − ¯α ∗₎T_Q_{( ¯α − ¯α}∗_{) + (y/C)}T_{( ¯α − ¯α}∗₎ = 1 2(α − α ∗₎T_Q_{(α − α}∗_{) + (y/C)}T_{(α − α}∗_), and eT_{( ¯α + ¯α}∗_{) ≤ ν, we have} 1 2( ¯α − ¯α ∗₎T_Q_{( ¯α − ¯α}∗_{) + (y/C)}T_{( ¯α − ¯α}∗_{) + (/C)e}T_{( ¯α + ¯α}∗₎ ≤ 1 2(α − α ∗₎T_Q_{(α − α}∗_{) + (y/C)}T_{(α − α}∗_{) + (/C)e}T_{(α + α}∗_).

Therefore, all optimal solutions of Dνare also optimal for D. Hence, Dν’s optimal solution set is the same as that of D, where > 0 is any one element

in a unique open interval(ρ1, ρ2).

Finally, as D_ν’s optimal solution set is the same as or part of a D, from lemma 1, we haveαiαi∗= 0.

Using the above results, we now summarize a main theorem:

Theorem 1. We have: 1. ν∗≤ 1.

(6)

3. For anyν ∈ [0, ν∗), lemma 3 holds. That is, one of the following two situations must happen: (a) Dν’s optimal solution set is part of the solution set of D, > 0, or (b) Dν’s optimal solution set is the same as that of D, where

> 0 is any one element in a unique open interval.

4. For all D_ν, 0 ≤ ν ≤ 1, there are always optimal solutions that happen at the equality eT_{(α + α}∗_{) = ν.}

Proof. From the explanation after definition 1, there exists an optimal so-lution of D, = 0, which satisfies eT_{(α + α}∗_{) = ν}∗_{. Then this α}(∗) _must

satisfyαiαi∗= 0, i = 1, . . . , l, so ν∗≤ 1. In addition, this α(∗)is also feasible

to D_ν,ν ≥ ν∗. Since D_ν has the same objective function as D, = 0, but has one more constraint, this solution of D, = 0, is also optimal for Dν. Hence, Dνand Dν∗have the same optimal objective value.

For 0 ≤ ν < ν∗, we already know from theorem 3 that the optimal solution happens only at the equality. For 1≥ ν ≥ ν∗, first we know that

D, = 0 has an optimal solution α(∗)that satisfies eT_{(α + α}∗_{) = ν}∗_{. Then} we can increase some elements of α and α∗ such that new vectors ˆα(∗) satisfy eT_{( ˆα + ˆα}∗_{) = ν but ˆα − ˆα}∗ _{= α − α}∗_{. Hence, ˆα}(∗) _{is an optimal}

solution of D_ν, which satisfies the equality constraint.

Therefore, the above results ensure that it is safe to solve the following problem instead of D_ν:

( ¯Dν) min1₂(α − α∗)TQ(α − α∗) + (y/C)T(α − α∗) eT(α − α∗) = 0, eT(α + α∗) = ν,

0≤ αi, α∗i ≤ 1/l, i= 1, . . . , l.

This result is important because existing SVM algorithms have been able to handle these equalities easily.

Note that forν-SVC, there is also a ν∗where forν ∈ (ν∗, 1], D_νis infeasi-ble. At that time,ν∗= 2 min(#positive data, #negative data)/l can be easily calculated (Crisp & Burges 2000). Now forν-SVR, it is difficult to know

ν∗_{a priori. However, we do not have to worry about this. If a ¯}_D_ν_{, ν > ν}∗ is solved, a solution with its objective value equal to that of D, = 0 is obtained. Then someαiandα∗i may both be nonzero.

Since there are always optimal solutions of the dual problem that satisfy

eT(α + α∗) = ν, this also implies that the ≥ 0 constraint of Pν is not necessary. In the following theorem, we derive the same result directly from the primal side:

Theorem 2. Consider a problem that is the same as P_νbut without the inequality constraint ≥ 0. We have that for any 0 < ν < 1, any optimal solution of P_νmust satisfy ≥ 0.

(7)

Proof. Assume(w, b, ξ, ξ∗, ) is an optimal solution with < 0. Then for each i, − − ξ∗ i ≤ wTφ(xi) + b − yi≤ + ξi (2.4) implies ξi+ ξi∗+ 2 ≥ 0. (2.5) With equation 2.4, −0 − max(0, ξ∗ i + ) ≤ − − ξi∗≤ wTφ(xi) + b − yi ≤ 0 + + ξi≤ 0 + max(0, ξi+ ).

Hence(w, b, max(0, ξ + e), max(0, ξ∗+ e), 0) is a feasible solution of P_ν. From equation 2.5,

max(0, ξi+ ) + max(0, ξi∗+ ) ≤ ξi+ ξi∗+ .

Therefore, with < 0 and 0 < ν < 1, 1 2w T_w_{+ C} ν +1 l l i=1 (ξi+ ξi∗) > 1 2w T_w₊C l l +l i=1 (ξi+ ξi∗) ≥ 1 2w T_w₊C l l i=1 (max(0, ξi+ ) + max(0, ξi∗+ ))

implies that(w, b, ξ, ξ∗) is not an optimal solution. Therefore, any optimal solution of P_νmust satisfy ≥ 0.

Next we demonstrate an example where one Dcorresponds to many D_ν. Given two training points, x1 = 0, x2 = 0, and target values y1 = − < 0 and y2 = > 0. When = , if the linear kernel is used and C = 1, D becomes

min 2(α₁∗+ α2)

0≤ α1, α1∗, α2, α2∗≤ 1/l, α1− α1∗+ α2− α∗2= 0.

Thus,α₁∗ = α2 = 0, so any 0 ≤ α1 = α∗2 ≤ 1/l is an optimal solution. Therefore, for this, the possible eT_(α+α∗_{) ranges from 0 to 1. The relation} betweenν and is illustrated in Figure 1.

(8)

✲ ✻ ν ν∗_{= 1} = 

Figure 1: An example where one Dcorresponds to different Dν

3 When the Kernel Matrix Q Is Positive Definite

In the previous section, we showed that for-SVR, eT_(α

+ α∗) may not

be a well-defined function of, where α(∗) is any optimal solution of D. Because of this difficulty, we cannot exactly apply results on the relation between C-SVC andν-SVC to -SVR and ν-SVR. In this section, we show that if Q is positive definite, then eT(α+ α∗) is a function of , and all

results discussed in Chang and Lin (2001b) hold.

Assumption 1. Qis positive definite.

Lemma 4. If > 0, then D has a unique optimal solution. Therefore, we can define a function eT_(α

+ α∗) on , where α(∗) is the optimal solution of D. Proof. Since D is a convex problem, if α(∗)₁ and α(∗)₂ are both optimal solutions, for all 0≤ λ ≤ 1,

1 2(λ(α1− α ∗ 1) + (1 − λ)(α2− α∗2))TQ(λ(α1− α∗1) + (1 − λ)(α2− α∗2)) + (y/C)T_(λ(α 1− α∗1) + (1 − λ)(α2− α∗2)) + (/C)eT_(λ(α 1+ α∗1) + (1 − λ)(α2+ α∗2)) = λ 1 2(α1− α ∗ 1)TQ(α1− α∗1) +(y/C)T_(α 1− α∗1) + (/C)eT(α1+ α∗1) + (1 − λ) 1 2(α2− α ∗ 2)TQ(α2− α∗2) +(y/C)T_(α 2− α∗2) + (/C)eT(α2+ α∗2) .

(9)

This implies (α1− α∗1)TQ(α2− α∗2) = 1 2(α1− α ∗ 1)TQ(α1− α∗1) +1 2(α2− α ∗ 2)TQ(α2− α∗2). (3.1) Since Q is positive semidefinite, Q= LTL, so equation 3.1 impliesL(α1−

α∗₁)−L(α₂−α∗₂) = 0. Hence, L(α₁−α∗₁) = L(α₂−α∗₂). Since Q is positive

definite, L is invertible, so α1− α∗₁= α2− α∗₂. Since > 0, from lemma 1, (α1)i(α₁∗)i= 0 and (α2)i(α₂∗)i= 0. Thus, we have α1= α2and α∗₁= α∗₂.

For convex optimization problems, if the Hessian is positive definite, there is a unique optimal solution. Unfortunately, here the Hessian is

Q −Q

−Q Q

, which is only positive semidefinite. Hence, special efforts are needed for proving the property of the unique optimal solution.

Note that in the above proof, L(α1− α∗₁) = L(α2− α∗₂) implies (α1−

α∗₁)TQ(α1− α∗1) = (α2− α2∗)TQ(α2− α∗2). Since α(∗)1 and α(∗)2 are both optimal solutions, they have the same objective value so−yT_(α

1− α∗1) + eT_(α

1+α∗₁) = −yT(α2−α∗₂)+eT(α2+α∗₂). This is not enough for proving that eT_(α

1+ α∗₁) = eT(α2+ α∗₂). On the contrary, for ν-SVC, the objective function is 1₂αTQα− eTα, so without the positive definite assumption,

αT₁Qα1= αT₂Qα2already implies eTα1 = eTα2. Thus, eTαCis a function

of C.

We then state some parallel results in Chang and Lin (2001b) without proofs:

Theorem 3. If Q is positive definite, then the relation between D_ν and D is summarized as follows:

1. (a) For any 1≥ ν ≥ ν∗, D_νhas the same optimal objective value as D, = 0. (b) For anyν ∈ [0, ν∗), Dνhas a solution that is the same as that of either one D, > 0, or some D, where is any number in an interval.

2. Ifα∗is the optimal solution of D, > 0, the relation between ν and is as

follows: There are 0< 1< · · · < sand Ai, Bi, i = 1, . . . , s such that eT(α+ α∗) =    ν∗ ₀_{< ≤}₁_, Ai+ Bi i≤ ≤ i+1, i = 1, . . . , s − 1, 0 s≤ ,

where α(∗) is the optimal solution of D. We also have

(10)

and

As−1+ Bs−1s= 0.

In addition, Bi≤ 0, i = 1, . . . , s − 1.

The second result of theorem 3 shows thatν is a piece-wise linear function of. In addition, it is always decreasing.

4 Some Issues Specific to Regression

The motivation ofν-SVR is that it may not be easy to decide the parameter

. Hence, here we are interested in the possible range of . As expected,

results show that is related to the target values y.

Theorem 4. The zero vector is an optimal solution of Dif and only if ≥maxi=1,...,lyi− mini=1,...,lyi

2 . (4.1)

Proof. If the zero vector is an optimal solution of D, the KKT condition implies that there is a b such that

y −y + e e − b e −e ≥ 0.

Hence, − b ≥ −yiand + b ≥ yi, for all i. Therefore,

− b ≥ − min

i=1,...,lyiand + b ≥ maxi=1,...,lyi so

≥maxi=1,...,lyi− mini=1,...,lyi

2 .

On the other hand, if equation 4.1 is true, we can easily check that α=

α∗ = 0 satisfy the KKT condition so the zero vector is an optimal solution

of D.

Therefore, when using-SVR, the largest value of to try is (maxi=1,...,lyi−

mini=1,...,lyi)/2.

On the other hand, should not be too small; if → 0, most data are support vectors, and overfitting tends to happen. Unfortunately, we have not been able to find an effective lower bound on. However, intuitively we would think that it is also related to the target values y.

(11)

As the effective range of is affected by the target values y, a way to solve this difficulty for-SVM is by scaling the target valves before training the data. For example, if all target values are scaled to [−1, +1], then the effective range of will be [0, 1], the same as that of ν. Then it may be easier to choose.

There are other reasons to scale the target values. For example, we en-countered some situations where if the target values y are not properly scaled, it is difficult to adjust the value of C. In particular, if yi, i = 1, . . . , l

are large numbers and C is chosen to be a small number, the approximating function is nearly a constant.

5 Algorithms

The algorithm considered here forν-SVR is similar to the decomposition method in Chang and Lin (2001b) forν-SVC. The implementation is part of the software LIBSVM (Chang & Lin, 2001a). Another SVM software that has also implementedν-SVR is mySVM (R ¨uping, 2000).

The basic idea of the decomposition method is that in each iteration, the indices{1, . . . , l} of the training set are separated to two sets B and N, where

B is the working set and N= {1, . . . , l}\B. The vector αNis fixed, and then

a subproblem with the variable αBis solved.

The decomposition method was first proposed for SVM classification (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998). Extensions to -SVR are in, for example, Keerthi, Shevade, Bhattacharyya, & Murthy (2000) and Laskov (2002). The main difference of these methods is their working set selections, which may significantly affect the number of iterations. Due to the additional equality 1.3 in theν-SVM, more considerations on the work-ing set selection are needed. (Discussions on classification are in Keerthi & Gilbert, 2002, and Chang & Lin, 2001b.)

For consistency with other SVM formulations in LIBSVM, we consider

D_νas the following scaled form: min ¯α 1 2¯α T¯Q¯α + ¯pT_¯α ¯yT_{¯α =} 1, (5.1) ¯eT_{¯α =} 2, 0≤ ¯αt≤ C, t = 1, . . . , 2l, where ¯Q =_−QQ −Q_Q, ¯α = α α∗ , ¯p = y y , ¯y = e −e , ¯e = e e , 1= 0, 2= Clν.

(12)

That is, we replace C/l by C. Note that because of the result in theorem 1, we are safe to use an equality constraint here in equation 5.1.

Then the subproblem is as follows: min ¯αB 1 2¯α T B¯QBB¯αB+ ( ¯pB+ ¯QBN¯αkN)T¯αB ¯yT B¯αB= 1− ¯yTN¯αN, (5.2) ¯eT B¯αB= 2− ¯eTN¯αN, 0≤ ( ¯αB)t≤ C, t = 1, . . . , q,

where q is the size of the working set.

Following the idea of sequential minimal optimization (SMO) by Platt (1998), we use only two elements as the working set in each iteration. The main advantage is that an analytic solution of equation 5.2 can be obtained so there is no need to use an optimization software.

Our working set selection follows from Chang and lin (2001b), which is a modification of the selection in the software SVMlight_{(Joachims, 1998).}

Since they dealt with the case of more general selections where the size is not restricted to two, here we have a simpler derivation directly using the KKT condition. It is similar to that in Keerthi and Gilbert (2002, section 5).

Now if only two elements i and j are selected but ¯yi= ¯yj, then ¯yTB¯αB =

1 − ¯yTN¯αN and ¯eBT¯αB = 2− ¯eTN¯αN imply that there are two equations

with two variables, so in general equation 5.2 has only one feasible point. Therefore, from¯α_k, the solution of the kth iteration, it cannot be moved any more. On the other hand, if¯yi= ¯yj,¯yTB¯αB= 1− ¯yNT¯αNand¯eTB¯αB= 2−¯eTN¯αN

become the same equality, so there are multiple feasible solutions. Therefore, we have to keep¯yi= ¯yjwhile selecting the working set.

The KKT condition of equation 5.1 shows that there areρ and b such that ∇ f ( ¯α)i− ρ + b¯yi= 0 if 0 < ¯αi< C,

≥ 0 if ¯αi= 0,

≤ 0 if ¯αi= C.

Define

r1≡ ρ − b, r2≡ ρ + b.

If¯yi= 1, the KKT condition becomes

∇ f ( ¯α)i− r1≥ 0 if ¯αi< C, (5.3)

(13)

On the other hand, if¯yi= −1, it is

∇ f ( ¯α)i− r2 ≥ 0 if ¯αi< C, (5.4)

≤ 0 if ¯αi> 0.

Hence, indices i and j are selected from either

i= argmin_t{∇ f ( ¯α)t|¯yt= 1, ¯αt< C},

j= argmax_t{∇ f ( ¯α)t|¯yt= 1, ¯αt> 0}, (5.5)

or

i= argmin_t{∇ f ( ¯α)t|¯yt= −1, ¯αt< C},

j= argmax_t{∇ f ( ¯α)t|¯yt= −1, ¯αt> 0}, (5.6)

depending on which one gives a larger∇ f ( ¯α)j− ∇ f ( ¯α)i (i.e., larger KKT

violations). If the selected∇ f ( ¯α)j− ∇ f ( ¯α)iis smaller than a given (10−3in

our experiments), the algorithm stops.

Similar to the case ofν-SVC, here the zero vector cannot be the initial solution. This is due to the additional equality constraint¯eT¯α = 2of equa-tion 5.1. Here we assign both initial¯α and ¯α∗with the same values. The first νl/2 elements are [C, . . . , C, C(νl/2 − νl/2)]T_{while others are zero.}

It has been proved that if the decomposition method of LIBSVM is used for solving D, > 0, during iterations αiαi∗ = 0 always holds (Lin, 2001,

theorem 4.1). Now forν-SVR we do not have this property as αiandα∗i may

both be nonzero during iterations.

Next, we discuss how to findν∗. We claim that if Q is positive definite and(α, α∗) is any optimal solution of D, = 0, then

ν∗₌l

i=1

|αi− αi∗|.

Note that by defining β≡ α − α∗, D, = 0 is equivalent to

min 1

2β

T_Qβ_{+ (y/C)}T_β eTβ= 0,

−1/l ≤ βi≤ 1/l, i= 1, . . . , l.

When Q is positive definite, it becomes a strictly convex programming prob-lem, so there is a unique optimal solution β. That is, we have a unique α−α∗ but may have multiple optimal(α, α∗). With conditions 0 ≤ αi, α∗i ≤ 1/l,

the calculation of|αi− αi∗| is similar to αiandαi∗until one becomes zero.

Then|αi− α∗i| is the smallest possible αi+ αi∗with a fixedαi− αi∗. In the next

section, we will use the RBF kernel, so if no data points are the same, Q is positive definite.

(14)

6 Experiments

In this section we demonstrate some numerical comparisons between ν-SVR and-SVR. We test the RBF kernel with Qij= e−xi−xj

2_/n

, where n is the number of attributes of a training data.

The computational experiments for this section were done on a Pentium III-500 with 256 MB RAM using the gcc compiler. Our implementation is part of the software LIBSVM, which includes bothν-SVR and -SVR using the decomposition method. We used 100 MB as the cache size of LIBSVM for storing recently used Qij. The shrinking heuristics in LIBSVM is turned

off for an easier comparison.

We test problems from various collections. Problems housing, aba -lone,mpg,pyrimidines, andtriazinesare from the Statlog collection (Michie, Spiegelhetter, & Taylor, 1994). From StatLib (http://lib.stat.cmu.edu/ datasets), we selectbodyfat,space ga, andcadata. Problemcpusmallis from the Delve archive which collects data for evaluating learning in valid ex-periments (http://www.cs.toronto.edu/~delve). Problemmgis a Mackey-Glass time series where we use the same settings as the experiments in Flake and Lawrence (2002). Thus, we predict 85 time steps in the future with six inputs. For these problems, some data entries have missing attributes, so we remove them before conducting experiments. Both the target and attribute values of these problems are scaled to [−1, +1]. Hence, the effective range of is [0, 1].

For each problem, we solve its D_ν form usingν = 0.2, 0.4, 0.6, and 0.8 first. Then we solve Dwith = ρ for comparison. Tables 1 and 2 present the number of training data (l), the number of iterations, and the training time by using C= 1 and C = 100, respectively. In the last column, we also list the numberν∗of each problem.

From both tables, we have the following observations:

1. Following theoretical results, we see that asν increases, its corre-sponding decreases.

2. Ifν ≤ ν∗, asν increases, the number of iterations of ν-SVR and its corresponding-SVR is increasing. Note that the case of ν ≤ ν∗covers all results in Table 1 and most of Table 2. Our explanation is as follows: Whenν is larger, there are more support vectors, so during iterations, the number of nonzero variables is also larger. Hsu and Lin (2002) pointed out that if during iterations there are more nonzero variables than those at the optimum, the decomposition method will take many iterations to reach the final face. Here, a face means the subspace by considering only free variables. An example is in Figure 2 where we plot the number of free variables during iterations against the number of iterations. To be more precise, the y-axis is the number of 0 < αi < C and 0 < α∗i < C where (α, α∗) is the solution

at one iteration. We can see that for solving-SVR or ν-SVR, it takes a lot of iteration to identify the optimal face. From the aspect of-SVR, we can considereT_{(α + α}∗_{) as a penalty term in the objective function of D}

(15)

Table 1: Solvingν-SVR and -SVR: C = 1 (time in seconds).

Problem l ν ν Iterations Iterations ν Time Time ν∗

pyrimidines 74 0.2 0.135131 181 145 0.03 0.02 0.817868 0.4 0.064666 175 156 0.03 0.04 0.6 0.028517 365 331 0.04 0.03 0.8 0.002164 695 460 0.05 0.05 mpg 392 0.2 0.152014 988 862 0.19 0.16 0.961858 0.4 0.090124 1753 1444 0.32 0.27 0.6 0.048543 2115 1847 0.40 0.34 0.8 0.020783 3046 2595 0.56 0.51 bodyfat 252 0.2 0.012700 1112 1047 0.14 0.13 0.899957 0.4 0.006332 2318 2117 0.25 0.23 0.6 0.002898 3553 2857 0.37 0.31 0.8 0.001088 4966 3819 0.48 0.42 housing 506 0.2 0.161529 799 1231 0.30 0.34 0.946593 0.4 0.089703 1693 1650 0.53 0.45 0.6 0.046269 1759 2002 0.63 0.60 0.8 0.018860 2700 2082 0.85 0.65 triazines 186 0.2 0.380308 175 116 0.13 0.10 0.900243 0.4 0.194967 483 325 0.18 0.15 0.6 0.096720 422 427 0.20 0.18 0.8 0.033753 532 513 0.23 0.23 mg 1385 0.2 0.366606 1928 1542 1.58 1.18 0.992017 0.4 0.216329 3268 3294 2.75 2.35 0.6 0.124792 3400 3300 3.36 2.76 0.8 0.059115 4516 4296 4.24 3.65 abalone 4177 0.2 0.168812 4189 3713 15.68 11.69 0.994775 0.4 0.094959 8257 7113 30.38 22.88 0.6 0.055966 12,483 12,984 42.74 37.41 0.8 0.026165 18,302 18,277 65.98 54.04 space ga 3107 0.2 0.087070 5020 4403 10.47 7.56 0.990468 0.4 0.053287 8969 7731 18.70 14.44 0.6 0.032080 12,261 10,704 26.27 20.72 0.8 0.014410 16,311 13,852 32.71 27.19 cpusmall 8192 0.2 0.086285 8028 7422 82.66 59.14 0.990877 0.4 0.054095 16,585 15,240 203.20 120.48 0.6 0.031285 22,376 19,126 283.71 163.96 0.8 0.013842 28,262 24,840 355.59 213.25 cadata 20,640 0.2 0.294803 12,153 10,961 575.11 294.53 0.997099 0.4 0.168370 24,614 20,968 1096.87 574.77 0.6 0.097434 35,161 30,477 1530.01 851.91 0.8 0.044636 42,709 40,652 1883.35 1142.27

Hence, when is larger, fewer αi, α∗i are nonzero. That is, the number of

support vectors is fewer.

3. There are few problems (e.g.,pyrimidines,bodyfat, andtriazines) where

ν ≥ ν∗_{is encountered. When this happens, their}_{should be zero, but due} to numerical inaccuracy, the output are only small positive numbers. Then for differentν ≥ ν∗, when solving their corresponding D, the number of

(16)

Table 2: Solvingν-SVR and -SVR: C = 100 (time in seconds).

Problem l ν ν Iterations Iterations ν Time Time ν∗

pyrimidines 74 0.2∗ 0.000554 29,758 11,978 0.63 0.27 0.191361 0.4∗ 0.000317 30,772 11,724 0.65 0.27 0.6∗ 0.000240 27,270 11,802 0.58 0.27 0.8∗ 0.000146 20,251 12,014 0.44 0.28 mpg 392 0.2 0.121366 85,120 74,878 9.53 8.26 0.876646 0.4 0.069775 210,710 167,719 24.50 19.32 0.6 0.032716 347,777 292,426 42.08 34.82 0.8 0.007953 383,164 332,725 47.61 40.86 bodyfat 252 0.2 0.001848 238,927 164,218 16.80 11.58 0.368736 0.4∗ 0.000486 711,157 323,016 50.77 23.24 0.6∗ 0.000291 644,602 339,569 46.23 24.33 0.8∗ 0.000131 517,370 356,316 37.28 25.55 housing 506 0.2 0.092998 154,565 108,220 24.21 16.87 0.815085 0.4 0.051726 186,136 182,889 30.49 29.51 0.6 0.026340 285,354 271,278 48.62 45.64 0.8 0.002161 397,115 284,253 69.16 49.12 triazines 186 0.2 0.193718 16,607 22,651 0.94 1.20 0.582147 0.4 0.074474 34,034 47,205 1.89 2.52 0.6∗ 0.000381 106,621 51,175 5.69 2.84 0.8∗ 0.000139 68,553 50,786 3.73 2.81 mg 1385 0.2 0.325659 190,065 195,519 87.99 89.20 0.966793 0.4 0.189377 291,315 299,541 139.10 141.73 0.6 0.107324 397,449 407,159 194.81 196.14 0.8 0.043439 486,656 543,520 241.20 265.27 abalone 4177 0.2 0.162593 465,922 343,594 797.48 588.92 0.988298 0.4 0.091815 901,275 829,951 1577.83 1449.37 0.6 0.053244 1,212,669 1,356,556 2193.97 2506.52 0.8 0.024670 1,680,704 1,632,597 2970.98 2987.30 space ga 3107 0.2 0.078294 510,035 444,455 595.42 508.41 0.984568 0.4 0.048643 846,873 738,805 1011.82 867.32 0.6 0.028933 1,097,732 1,054,464 1362.67 1268.40 0.8 0.013855 1,374,987 1,393,044 1778.38 1751.39 cpusmall 8192 0.2 0.070568 977,374 863,579 4304.42 3606.35 0.978351 0.4 0.041640 1,783,725 1,652,396 8291.12 7014.32 0.6 0.022280 2,553,150 2,363,251 11,673.62 10,691.95 0.8 0.009616 3,085,005 2,912,838 14,784.05 12,737.35 cadata 20640 0.2 0.263428 1,085,719 1,081,038 16,003.55 15,475.36 0.995602 0.4 0.151341 2,135,097 2,167,643 31,936.05 31,474.21 0.6 0.087921 2,813,070 2,614,179 42,983.89 38,580.61 0.8 0.039595 3,599,953 3,379,580 54,917.10 49,754.27 Note:∗Experiments whereν ≥ ν∗.

iterations is about the same, as we essentially solve the same problem: D

with = 0.

On the other hand, surprisingly, we see that at this time asν increases, it is easier to solve D_ν with fewer iterations. Now, its solution is optimal for

(17)

Figure 2: Iterations and number of free variables (ν ≤ ν∗).

contrary to the general caseν ≤ ν∗ where it is difficult to identify and move free variables during iterations back to bounds at the optimum, there is no strong need to do so. To be more precise, in the beginning of the decomposition method, many variables become nonzero as we try to modify them for minimizing the objective function. If, finally, most of these variables are still nonzero, we do not need the efforts to put them back to bounds. In Figure 3, we plot the number of free variables against the number of iterations using problemtriazineswithν = 0.6, 0.8, and = 0.000139 ≈ 0.

(18)

It can be clearly seen that for largeν, the decomposition method identifies the optimal face more quickly, so the total number of iterations is fewer.

4. Whenν ≤ ν∗, we observe that there are minor differences in the num-ber of iterations for-SVR and ν-SVR. In Table 1, for nearly all problems,

ν-SVR takes a few more iterations than -SVR. However, in Table 2, for

prob-lemstriazinesandmg,ν-SVR is slightly faster. Note that there are several dis-similarities between algorithms forν-SVR and -SVR. For example, -SVR generally starts from the zero vector, butν-SVR has to use a nonzero initial solution. For the working set selection, the two indices selected for-SVR can be anyαiorαi∗, but the two equality constraints lead to the selection 5.5 and

5.6 forν-SVR where the set is from either {α1, . . . , αl} or {α∗₁, . . . , α∗_l}.

Further-more, as the stopping tolerance 10−3might be too loose in some cases, the obtained after solving D_νmay be a little different from the theoretical value. Hence, we actually solve two problems with slightly different optimal solu-tion sets. All of these factors may contribute to the distincsolu-tion on iterasolu-tions.

5. We see that it is much harder to solve problems using C= 100 than using C= 1. The difference is even more dramatic than the case of classifi-cation. We do not have a good explanation for this observation.

7 Conclusion

We have shown that the inequality in theν-SVR formulation can be treated as an equality. Hence, algorithms similar to those forν-SVC can be applied for

ν-SVR. In addition, in section 6, we showed similarities and dissimilarities

on numerical properties of-SVR and ν-SVR. We think that in the future, the relation between C andν (or C and ) should be investigated in more detail. The model selection on these parameters is also an important issue.

References

Chang, C.-C., & Lin, C.-J. (2001a). LIBSVM: A library for support vector machines [Computer Software]. Available on-line: http://www.csie.ntu.edu.tw/~cjlin /libsvm.

Chang, C.-C., & Lin, C.-J., (2001b). Trainingν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147.

Crisp, D. J., & Burges, C. J. C. (2000). A geometric interpretation ofν-SVM classi-fiers. In S. Solla, T. Leen, & K.-R. M ¨uller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press.

Flake, G. W., & Lawrence, S. (2002). Efficient SVM regression training with SMO. Machine Learning, 46, 271–290.

Hsu, C.-W., & Lin, C.-J. (2002). A simple decomposition method for support vector machines. Machine Learning, 46, 291–314.

Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch ¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning, Cambridge, MA: MIT Press.

(19)

Keerthi, S. S., & Gilbert, E. G. (2002). Convergence of a generalized SMO algo-rithm for SVM classifier design. Machine Learning, 46, 351–360.

Keerthi, S. S., Shevade, S., Bhattacharyya, C., and Murthy, K. (2000). Improve-ments to SMO algorithm for SVM regression. IEEE Transactions on Neural Networks, 11(5), 1188–1193.

Laskov, P. (2002). An improved decomposition algorithm for regression support vector machines. Machine Learning, 46, 315–350.

Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12, 1288–1298. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and

statistical classification. Englewood Cliffs, NJ: Prentice Hall. Available on-line at anonymous ftp: ftp.ncc.up.pt/pub/statlog/.

Osuna, E., Freund, R., & Girosi, E. (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97. New York: IEEE. Platt, J. C. (1998). Fast training of support vector machines using sequential

minimal optimization. In B. Sch ¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press.

R ¨uping, S. (2000). mySVM—another one of those support vector ma-chines. [Computer software]. Available on-line: http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/.

Sch ¨olkopf, B., Smola, A. J., & Williamson, R. (1999). Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11, Cambridge, MA: MIT Press.

Sch ¨olkopf, B., Smola, A., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245.

Vapnik, V. (1998). Statistical learning theory. New York: Wiley.