Leave-one-out Bounds for Support Vector Regression Model Selection

(1)

Manuscript Number: 2833

Leave-one-out Bounds for Support Vector Regression Model

Selection

Ming-Wei Chang and Chih-Jen Lin Department of Computer Science and

Information Engineering National Taiwan University

Taipei 106, Taiwan cjlin@csie.ntu.edu.tw Abstract

Minimizing bounds of leave-one-out (loo) errors is an important and efficient ap-proach for support vector machine (SVM) model selection. Past research focuses on their use for classification but not regression. In this article, we derive various loo bounds for support vector regression (SVR) and discuss the difference from those for classification. Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. We also discuss the differentiability of loo bounds.

1 Introduction

Recently, support vector machines (Boser, Guyon, and Vapnik 1992; Cortes and Vapnik 1995) have been a promising tool for classification and regression. Its success depends on the tuning of several parameters which affect the generalization error. A popular approach is to approximate the error by a bound that is a function of parameters. Then, we search for parameters so that this bound is minimized. Past efforts focus on such bounds for classification, and the aim of this paper is to derive bounds for regression.

We first briefly introduce support vector regression (SVR). Given training vectors xi ∈ Rn, i = 1, . . . , l, and a vector y∈ Rl as their target values, SVR solves

min w,b,ξ,ξ∗ 1 2w T_{w +} C 2 l X i=1 ξ_i2+C 2 l X i=1 (ξ_i∗)2 (1.1) subject to _{− − ξ}_i∗ _{≤ w}Tφ(xi) + b− yi ≤ + ξi, i = 1, . . . , l.

Data are mapped to a higher dimensional space by the function φ and an -insensitive loss function is used. We refer to this form as L2-SVR as a two-norm penalty term

(2)

ξ2

i + (ξi∗)2 is used. As w may be a huge vector variable after introducing the mapping

function φ, practically we solve the dual problem: min α,α∗ 1 2(α− α ∗₎T_K(α_˜ _{− α}∗₎ + l X i=1 (αi+ αi∗) + l X i=1 yi(αi− α∗i) (1.2) subject to l X i=1 (αi− α∗i) = 0, 0≤ αi, α∗i, i = 1, . . . , l,

where K(xi, xj) = φ(xi)Tφ(xj) is the kernel function. ˜K = K + I/C and I is the

identity matrix. For optimal w and (α, α∗_{), the primal-dual relationship shows}

w = l X i=1 (α∗ i − αi)φ(xi),

so the approximate function is

f (x) = wTφ(x) + b (1.3) = − l X i=1 K(x, xi)(αi− α∗i) + b.

More general information about SVR can be found in the tutorial by Smola and Sch¨olkopf (2004).

One difficulty over classification for parameter selection is that SVR possesses an additional parameter . Therefore, the search space of parameters is bigger than that for classification. Some work have tried to address SVR parameter selection. In (Momma and Bennett 2002), the authors perform model section by pattern search, so the number of parameters checked is smaller than that by a full grid search. In (Kwok 2001) and (Smola, Murata, Sch¨olkopf, and M¨uller 1998), the authors analyze the behavior of and conclude that the optimal scales linearly with the input noise of the training data. However, this property can be applied only when the noise is known. Gao, Gunn, Harris, and Brown (2002) derive a Bayesian framework for SVR which leads to minimize a function of parameters. However, its performance is not very good compared to a full grid search (Lin and Weng 2004). An improvement is in (Chu, Keerthi, and Ong 2003), which modified the standard SVR formulation. This improved Bayesian SVR will be compared to our approach in this article.

This paper is organized as follows. Section 2 briefly reviews loo bounds for support vector classification. We derive various loo bounds for SVR in Section 3.

(3)

Implemen-tation issues are in Section 4 and experiments are in Section 5. Conclusions are in Section 6.

2 Leave-one-out Bounds for Classification: a

Re-view

Given training vectors xi ∈ Rn, i = 1, . . . , l, and a vector y ∈ Rl such that yi ∈

{1, −1}, an SVM formulation for two-class classification is: min w,b,ξ 1 2w T_{w +}C 2 l X i=1 ξ_i2 (2.1) subject to yi(wTφ(xi) + b)≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.

Next, we briefly review two loo bounds.

2.1 Radius Margin (RM) Bound for Classification

By defining ˜ w_≡ w √ Cξ , (2.2)

Vapnik and Chapelle (2000) have shown that the following radius margin (RM) bound holds:

loo _{≤ 4 ˜}R2_{k ˜}w_k2 = 4 ˜R2eTα, (2.3) where loo is the number of loo errors and e is a vector of all ones. In (2.3), α is the solution of the following dual problem:

max α e T_α − 1₂αT(Q + I C)α subject to 0_{≤ α}i, i = 1, . . . , l, (2.4) yTα= 0,

where Qij = yiyjK(xi, xj). At optimum, k ˜wk2 = eTα. Define

˜ φ(xi)≡ φ(xi) ei √ C ,

where ei is a zero vector of length l except the ith component is one. Then ˜R in (2.3)

is the radius of the smallest sphere containing all ˜φ(xi), i = 1, . . . , l.

The right-hand side of (2.3) is a function of parameters, which will then be mini-mized for parameter selection.

(4)

2.2 Span Bound for Classification

Span bound, another loo bound proposed in (Vapnik and Chapelle 2000), is tighter than the RM bound. Define S2

t as the optimal objective value of following problem:

min λ k ˜φ(xt)− P i∈F\{t}λiφ(x˜ i)k2 (2.5) subject to P i∈F\{t}λi = 1,

where _{F = {i | α}i > 0} is the index set of free components of an optimal α of (2.4).

Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, the span bound is:

l

X

t=1

αtSt2. (2.6)

(2.5) indicates that St is smaller than 2 ˜R, the diameter of the smallest sphere

con-taining all ˜φ(xi). Thus, (2.6) is tighter than (2.3).

Unfortunately, S2

t is not a continuous function (Chapelle, Vapnik, Bousquet, and

Mukherjee 2002), so a modified span bound is proposed:

l

X

t=1

αtS˜t2, (2.7)

where ˜S2

t is the optimal objective value of

min λ k ˜φ(xt)− P i∈F\{t}λiφ(x˜ i)k2+ ηP_i∈F\{t}λ2iα1i (2.8) subject to P i∈F\{t}λi = 1.

η is a positive parameter that controls the smoothness of the bound. From (2.5) and (2.8), S2

t ≤ ˜St2, so (2.7) is also an loo bound. Define D as an l × l diagonal matrix

where Dii= η/αi and Dij = 0 for i 6= j. Define a new kernel matrix ˜K with

˜ K(xi, xj) = ˜φ(xi)Tφ(x˜ j). We let ˜ D =DFF 0F 0T F 0 and M = ˜ K_FF e_F eT F 0 , (2.9)

where ˜K_FF is the sub-matrix of ˜K corresponding to free support vectors and e_F (0_F) is a vector of |F| ones (zeros). By defining

˜

(5)

Chapelle, Vapnik, Bousquet, and Mukherjee (2002) showed that ˜ S2 t = ˜K(xt, xt)− hT( ˜Mt)−1h = 1 ( ˜M−1₎_tt − ˜Dtt, (2.11)

where ˜Mt _{is the sub-matrix of ˜}_{M with the tth column and row removed and h is the}

tth column of ˜M excluding ˜Mtt.

Note that Chapelle, Vapnik, Bousquet, and Mukherjee (2002) did not give a formal proof on the continuity of (2.7). We address this issue in Section 4.1.

3 Leave-one-out Bounds for Regression

First, the Karash-Kunh-Tucker (KKT) optimality condition of (1.2) is listed here for further analysis: A vector α is optimal for (1.2) if and only if it satisfies constraints of (1.2) and there is a scalar b such that

−( ˜K(α_{− α}∗₎₎ i+ b = yi+ , if αi > 0, −( ˜K(α− α∗₎₎ i+ b = yi− , if α∗i > 0, (3.1) yi− ≤ −( ˜K(α− α∗))i+ b≤ yi+ , if αi = αi∗ = 0, where ( ˜K(α− α∗₎₎

i is the ith element of ˜K(α− α∗). From (3.1), αiα∗i = 0 when

the KKT condition holds. General discussion about KKT conditions can be seen in optimization books (e.g., (Bazaraa, Sherali, and Shetty 1993)).

3.1 Radius Margin Bound for Regression

To study the loo error for SVR, we introduce the leave-one-out problem without the tth data: min wt_,bt_,ξt_,ξt∗ 1 2(w t₎T_(wt_{) +} C 2 X i6=t (ξ_it)2+C 2 X i6=t (ξ_it∗)2 subject to − − ξt∗ i ≤ (wt)Tφ(xi) + bt− yi ≤ + ξit, (3.2) i = 1, . . . , t_{− 1, t + 1, . . . , l.}

Though ξt and ξt∗ are vectors with l− 1 elements, we define ξt

t = ξtt∗ = 0 to make

them have l elements. The approximate function of (3.2) is ft(x) = (wt)Tφ(x) + bt,

(6)

so the loo error for SVR is defined as loo≡ l X t=1 |ft_(x t)− yt|. (3.3)

The loo error is well defined if the approximation function is unique. Note that though w (and wt_{) is unique due to the strictly convex term w}T_{w (and (w}t₎T_wt_{), multiple}

b (or bt_{) is possible (see, for example, the discussion in (Lin 2001a)). Therefore, we}

make the following assumption:

Assumption 1 (1.2) and the dual problem of (3.2) all have free support vectors. We say a dual SVR has free support vectors if (α, α∗_{) is optimal and there are some}

i such that αi > 0 or α∗i > 0 (i.e., αi+ α∗i > 0 as αiα∗i = 0). Under this assumption,

(1.3) and (3.1) imply wT_φ(x

i) + b = yi+ if αi > 0 (or = yi − if α∗i > 0). As

the optimal w is unique, so is b. Similarly, bt _{is unique as well. We then introduce a}

useful lemma:

Lemma 1 Under Assumption 1, 1. If αt> 0, ft(xt)≥ yt.

2. If α∗

t > 0, ft(xt)≤ yt.

The proof is in Appendix A. If αt> 0, the KKT condition (3.1) implies that f (xt) is

also larger or equal to yt. Thus, this lemma reveals the relative position of ft(xt) to

f (xt) and yt.

The next lemma gives an error bound on each individual leave-one-out test. Lemma 2 Under Assumption 1,

1. If αt= α∗t = 0, |ft(xt)− yt| = |f(xt)− yt| ≤ .

2. If αt> 0, ft(xt)− yt ≤ 4 ˜R2αt+ .

3. If α∗

t > 0, yt− ft(xt)≤ 4 ˜R2α∗t + .

The proof is in Appendix B. Then, when αt > 0, |ft(xt)− yt| = ft(xt)− yt from

Lemma 1. It follows _|ft_(x

t)− yt| ≤ 4 ˜R2αt+ from Lemma 2. After extending this

argument to the cases of α∗

(7)

Theorem 1 Under Assumption 1, the leave-one-out error (3.3) is bounded by 4 ˜R2eT(α + α∗) + l. (3.4) We discuss the difference on proving bounds for classification and regression. In classification, the RM bound (2.3) is from the following derivation: If the tth training data is wrongly classified during the loo procedure, then

1≤ 4αtR˜2, (3.5)

where αt is the tth element of the optimal solution of (2.4). If the data is correctly

classified, the loo error is zero and still smaller than 4αtR˜2. Therefore, 4 ˜R2eTα is

larger than the number of loo errors. On the other hand, since there is no “wrongly classified” data in regression, we use Lemma 2 instead of (3.5) and Lemma 1 is required.

3.2 Span Bound for L2-SVR

Similar to the above discussion, we can have the span bound for L2-SVR:

Theorem 2 Under the same assumptions of Theorem 1 and the assumption that the set of support vectors remains the same during the loo procedure, the loo error of L2-SVR is bounded by l X t=1 (αt+ α∗t)St2+ l, (3.6) where S2

t is the optimal objective value of (2.5) with F replaced by {i | αi+ αi∗ > 0}.

The proof is in Appendix C. The same as the case for classification, (3.6) may not be continuous, so we propose a similar modification like (2.7):

l X t=1 (αt+ α∗t) ˜St2+ l. (3.7) ˜ S2

t is the optimal objective solution of

min λ k ˜φ(xt)− P i∈F\{t}λiφ(x˜ i)k2+ ηP_i∈F\{t}λ2i(αi+α1 ∗i) (3.8) subject to P i∈F\{t}λi = 1,

where η is a positive parameter and F = {i | αi+ α∗i > 0}. The calculation of ˜St2 is

(8)

3.3 LOO Bounds for L1-SVR

L1-SVR is another commonly used form for regression. It considers the following objective function min w,b,ξ,ξ∗ 1 2w T_{w + C} l X i=1 ξi+ C l X i=1 ξ_i∗ (3.9)

under the constraints of (1.1) and nonnegative constraints on ξ, ξ∗: ξi ≥ 0, ξi∗ ≥ 0, i =

1, . . . , l. The name “L1” comes from the linear loss function. The dual problem is min α,α∗ 1 2(α− α ∗₎T_K(α_{− α}∗_{) +} l X i=1 (αi+ α∗i) + l X i=1 yi(αi− α∗i) (3.10) subject to l X i=1 (αi− α∗i) = 0, 0≤ αi, α∗i ≤ C, i = 1, . . . , l,

Two main differences between (1.2) and (3.10) are that ˜K is replaced by K and αi, α∗i

are upper-bounded by C. To derive loo bounds, we still require Assumption 1. With some modifications in the proof (details in Appendix D.1), Lemma 1 still holds. For Lemma 2, results are different as now αi, αt∗ ≤ C and ξt plays a role:

Lemma 3

1. If αt= α∗t = 0, |ft(xt)− yt| = |f(xt)− yt| ≤ .

2. If αt> 0, ft(xt)− yt ≤ 4R2αt+ ξt+ ξt∗+ .

3. If α∗

t > 0, yt− ft(xt)≤ 4R2α∗t + ξt+ ξt∗+ .

The proof is in Appendix D.2. Note that R is now the radius of the smallest sphere containing all φ(xi), i = 1, . . . , l. Using Lemmas 1 and 3, the bound is

4R2eT(α + α∗) + eT(ξ + ξ∗) + l.

Regarding the span bound, the proof for Theorem 2 still holds. However, S2 t is

redefined as the optimal objective value of the following problem: min λ kφ(xt)− P i∈F\{t}λiφ(xi)k2 (3.11) subject to P i∈F\{t}λi = 1,

where _{F = {i | 0 < α}i+ α∗i < C}. Then an loo bound is l X t=1 (αt+ α∗t)St2+ l X t=1 (ξt+ ξt∗) + l. (3.12)

(9)

4 Implementation Issues

In the rest of this paper, we consider only loo bounds using L2-SVR.

4.1 Continuity and Differentiability

To use the bound, α and α∗ _{must be well defined functions of parameters. That is,}

we need the uniqueness of the optimal dual solution. As ˜K contains the term I/C and hence is positive definite, α and α∗ _{are unique (Chang and Lin 2002, Lemma 4).}

To discuss continuity and differentiability, we make an assumption about the kernel function:

Assumption 2 The kernel function is differentiable respect to parameters.

For continuity, we have known that the span bound is not continuous, but others are:

Theorem 3

1. (α, α∗_{) and ˜}_R2 _{are continuous, and so is the radius margin bound.}

2. The modified span bound (3.7) is continuous.

The proof is in Appendix E. To minimize leave-one-out bounds, differentiability is important as we may have to calculate the gradient. Unfortunately, loo bounds for L2-SVR are not differentiable. An example for the radius margin bound is in Appendix F. This situation is different from classification, where the radius margin bound for L2-SVM is differentiable (see more discussion in (Chung, Kao, Sun, Wang, and Lin 2003)). However, we may still use gradient-based methods as gradients exist almost everywhere.

Theorem 4 Radius margin and modified span bounds are differentiable almost ev-erywhere. If around a given parameter set, zero and non-zero elements of (α, α∗_{) are}

the same, then bounds are differentiable at this parameter set.

The proof is in Appendix G. The above discussion applies to bounds for classifi-cation as well. For differentiable points, we calculate gradients in Section 4.2.

4.2 Gradient Calculation

To have the gradient of loo bounds, we need the gradient of α + α∗_{, ˜}_R2_{, and ˜}_S2 t.

(10)

4.2.1 Gradient of α + α∗

Define ˆα_F _{≡ α}∗_F _{− α}_F and recall the definition of M in (2.9). For free support vectors, KKT optimality conditions (3.1) imply that

M ˆαF b =p 0 , where pi = ( yi− if ˆαi > 0, yi+ if ˆαi < 0. We have ∂(αi+ αi∗) ∂θ = zi ∂ ˆαi ∂θ , (4.1) where zi = ( 1 if ˆαi > 0, −1 if ˆαi < 0.

Except , all other parameters relate to M but not p, so for any such parameter θ, M _{∂ ˆ}αF ∂θ ∂b ∂θ +∂M ∂θ ˆα_F b = 0. Thus, _{∂ ˆ}αF ∂θ ∂b ∂θ =−M−1 _{∂ ˜}_K ∂θ 0F 0T F 0 ˆα_F b =−M−1 _{∂ ˜}_K ∂θαˆF 0 . (4.2) If θ is , _{∂ ˆ}αF ∂ ∂b ∂ = M−1 _∂p ∂ 0 , and ∂pi ∂ = ( −1 if ˆαi > 0, 1 if ˆαi < 0. 4.2.2 Gradient of ˜S2 t Now ˜S2

t is defined as (2.11) but Dtt = η/(αt+ α∗t) for t∈ F. Using (2.11),

∂S2 t ∂θ =− 1 ( ˜M−1₎2 tt ∂( ˜M−1₎ tt ∂θ + η (αt+ α∗t)2 ∂(αt+ α∗t) ∂θ . Note that ∂( ˜M−1₎ tt ∂θ = ∂ ˜M−1 ∂θ ! tt = M˜−1∂ ˜M ∂θ M˜ −1 ! tt , (4.3)

(11)

and ∂(αt+ α∗t)/∂θ can be obtained using (4.1) and (4.2). Furthermore, in (4.3), ∂ ˜M ∂θ = _{∂ ˜}_K ∂θ + ∂ ˜D ∂θ 0F 0T F 0 , where ∂ ˜D ∂θ ! ii =− η (αi + αi∗)2 ∂(αi+ α∗i) ∂θ , i∈ F. 4.2.3 Gradient of ˜R2 ˜

R2 _{is the optimal object value of}

max β l X i=1 βiK(x˜ i, xi)− βTKβ˜ (4.4) subject to 0_{≤ β}i, i = 1, . . . , l, l X i=1 βi = 1.

(see, for example, (Vapnik 1998)). From (Bonnans and Shapiro 1998), it is differen-tiable and the gradient is

∂ ˜R2 ∂θ = l X i=1 βi ∂ ˜K(xi, xi) ∂θ − β T∂ ˜K ∂θ β.

5 Experiments

In this section, different parameter selection methods including the proposed bounds are compared. We consider the same real data used in (Lin and Weng 2004) and some statistics are in Table 1. To have a reliable comparison, for each data set, we randomly produce 30 training/testing splits. Each training set consists of 4/5 of the data and the remaining are for testing. Parameter selection using different methods are applied on each training file. We then report the average and standard deviation of 30 mean squared errors (MSE) on predicting test sets. A method with lower MSE is better.

We compare two proposed bounds with three other parameter selection methods: 1. RM (L2-SVR): the radius margin bound (3.4).

2. MSP (L2-SVR): the modified span bound (3.7).

(12)

Table 1: Data Statistics: n is the number of features and l is the number of data instances. Problem n l pyrim 27 74 triazines 60 186 mpg 7 392 housing 13 566 add10 10 1,000 cpusmall 12 1,000 spacega 6 1,000 abalone 8 1,000

4. CV (L1-SVR): the same as the previous method but L1-SVR is considered. 5. BSVR: a Bayesian framework which improves the smoothness of the evidence

function using a modified SVR (Chu, Keerthi, and Ong 2003). All methods except BSVR use the Radial Basis Function (RBF) kernel

K(xi, xj) = e−kxi−xjk

2_/(2σ2₎

, (5.1)

where σ2 _{is the kernel parameter. BSVR implements an extension of the RBF kernel:}

K(xi, xj) = κ0e−kxi−xjk

2_/(2σ2₎

+ κb, (5.2)

where κ0 and κb are two additional kernel parameters. Both kernels satisfy

Assump-tion 2 on differentiability.

Implementation details and experimental results are in the following subsections.

5.1 Implementations of Various Model Selection Methods

RM and MSP are differentiable almost everywhere, so we implement quasi-Newton, a gradient-based optimization method, to minimize them. The parameter η in the modified span bound (2.7) is set to be 0.1. Section 5.3 will discuss the impact of using different η. Following most earlier work on minimizing loo bounds, we consider parameters in the log scale: (ln C, ln σ2_{, ln ). Thus, if f is the function of parameters,}

the gradient is calculated by

∂f ∂ ln θ = θ

∂f ∂θ,

(13)

Table 2: Mean and standard deviation of 30 MSEs (using 30 training/testing splits).

RM MSP L2 CV L1 CV BSVR

Problem MEAN STD MEAN STD MEAN STD MEAN STD MEAN STD

pyrim 0.015 0.010 0.007 0.008 0.007 0.007 0.007 0.007 0.007 0.008 triazines 0.042 0.005 0.021 0.006 0.021 0.005 0.023 0.008 0.021 0.007 mpg 8.156 1.598 7.045 1.682 7.122 1.809 7.146 1.924 6.894 1.856 housing 23.14 7.774 9.191 2.733 9.318 2.957 11.26 5.014 10.40 3.950 add10 6.491 1.675 1.945 0.254 1.820 0.182 1.996 0.194 2.298 0.256 cpusmall 34.02 13.33 14.57 4.692 14.73 3.754 15.63 5.344 16.17 5.740 spacega 0.037 0.007 0.013 0.001 0.012 0.001 0.013 0.001 0.014 0.001 abalone 10.69 1.551 5.071 0.678 5.088 0.646 5.247 0.806 5.514 0.912

and formulas in Section 4.2. Suppose θ is the parameter vector to be determined. The quasi-Newton method is an iterative procedure to minimize f (θ). If k is the index of the loop, the kth iteration for updating θk to θk+1 is as the following

1. Compute a search direction p =−Hk∇f(θk).

2. Find θk+1= θk+ λp using a line search to ensure sufficient decrease. 3. Obtain Hk+1 by Hk+1 = ( (I₋ stT tT_s)Hk(I− tsT tT_s) + ssT tT_s if t T_{s > 0,} Hk otherwise, (5.3) where s = θk+1− θk and t =∇f(θk+1)− ∇f(θk).

Here, Hk serves as the inverse of an approximate Hessian of f and is set to be the

identity matrix in the first iteration. The sufficient decrease by the line search usually means

f (θk+ λp)_{≤ f(θ}k) + σ1λ∇f(θk)Tp, (5.4)

where 0 < σ1 < 1 is a positive constant. We find the largest value λ in a set

{γi _{| i = 0, 1, . . .} such that (5.4) holds (γ = 1/2 used in this paper). We confine}

the search in a fixed region, so each parameter θi is associated with a lower bound

li and upper bound ui. If in the quasi-Newton method, θik+ λpi is not in [li, ui], it

is projected to the interval. For ln C and ln σ2_{, we set l}

(14)

li = −8, but ui = 1. We could not use too large ui as a too large may cause all

data to be in the -insensitive tube and hence α = α∗ = 0. Then, Assumption 1 does not hold and the loo bound may not be valid. More discussion on the use of quasi-Newton method is in (Chung, Kao, Sun, Wang, and Lin 2003, Section 5)

Table 3: Average (ln C, ln σ2_{, ln ) of 30 runs.}

RM MSP L2 CV L1 CV Problem ln C ln σ2 _ln _{ln C} _{ln σ}2 _ln _{ln C} _{ln σ}2 _ln _{ln C} _{ln σ}2 _ln pyrim 4.0 1.0 -2.6 1.0 2.9 -8.0 6.4 3.1 -7.3 2.1 3.4 -6.6 triazines -1.4 -0.6 -1.8 -0.8 3.0 -8.0 3.5 4.4 -7.2 0.6 4.0 -4.7 mpg -0.7 0.4 -1.0 4.7 1.1 -7.1 3.6 0.6 -6.6 5.6 0.4 -1.7 housing -1.5 1.1 -1.0 5.7 1.2 -6.9 6.6 1.6 -6.1 7.5 1.4 -1.7 add10 0.9 4.6 -1.0 8.0 3.3 -8.0 8.0 3.0 -7.8 7.9 2.9 -1.2 cpusmall -1.1 -0.5 -1.0 7.9 1.6 -5.6 7.9 2.4 -7.0 8.0 1.7 -2.1 spacega -6.5 -6.2 -1.7 7.4 3.1 -8.0 7.3 0.0 -6.8 6.0 1.1 -4.6 abalone -8.0 7.0 -1.0 3.7 0.6 -8.0 4.1 0.9 -6.8 6.8 1.2 -2.1

The initial point of the quasi-Newton method is (ln C, ln σ2, ln ) = (0, 0,_−3). The minimization procedure stops when

k∇f(θk)_{k < (1 + f(θ}k))_{× 10}−5, or f (θ

k−1₎_{− f(θ}k₎

f (θk−1) < 10

−5 _(5.5)

happens. Each function (gradient) evaluation involves solving an SVR and is the main computational bottleneck. We use LIBSVM (Chang and Lin 2001) as the underlying solver.

For CV, we try 2,312 parameter sets with (ln C, ln γ, ln ) = [−8, −7, . . . , 8] × [−8, −7, . . . , 8] × [−8, −7, . . . , −1]. Similar to the case of using loo bounds, here we avoid considering too large . The one with the lowest five-fold CV accuracy is used to train the model for testing.

For BSVR, we directly use the authors’ gradient-based implementation with the same stopping condition (5.5). Note that their evidence function is not differentiable either.

5.2 Experimental Results

Table 2 presents mean and standard deviation of 30 MSEs. CV (L1- and L2-SVR), MSP, and BSVR are similar but RM is worse. In classification, Chapelle, Vapnik,

(15)

Table 4: Computational time (in second) for parameter selection. RM SP L2 CV L1 CV pyrim 0.6 0.3 84.3 99.27 triazines 2.5 5.8 310.8 440.0 mpg 10.7 63.0 1536.7 991.59 housing 23.3 159.4 2368.9 1651.5 add10 160.6 957.1 6940.2 5312.7 cpusmall 204.7 931.8 10073.7 6087.0 spacega 53.6 771.9 4246.9 15514.42 abalone 68.0 1030.4 12905.0 7523.3

Table 5: Number of function and gradient evaluations of the quasi-Newton method (average of 30 runs).

RM MSP

Problem FEV GEV FEV GEV

pyrim 34.7 13.5 32.2 13.5 triazines 29.8 10.4 30.0 13.2 mpg 38.2 5.0 21.6 11.7 housing 40.5 5.0 33.2 13.2 add10 44.0 4.96 27.8 5.4 cpusmall 43.9 5.0 35.6 6.66 spacega 28.6 9.96 30.1 9.93 abalone 19.0 3.0 19.9 10.3

Bousquet, and Mukherjee (2002) showed that radius margin and span bounds perform similarly. From our experiments on the radius margin bound, parameters are more sensitive in regression than in classification. One possible reason is that the loo error for SVR is a continuous but not a discrete measurement.

The good performance of BSVR indicates that its Bayesian evidence function is accurate, but the use of a more general kernel function may also help. On the other hand, even though MSP uses only the RBF kernel, it is competitive with BSVR. Note that as CV is conducted on a discrete set of parameters, sometimes its MSE is slightly worse than that of MSP or BSVR, which considers parameters in a continuous space. Table 3 presents the parameters obtained by different approaches. We do not give

(16)

those by BSVR as it considers more than three parameters. Clearly these methods obtain quite distinct parameters even though they (except RM) give similar testing errors. This observation indicates that good parameters are in a quite wide region. In other words, SVM is sensitive to parameters but is not too sensitive. Moreover, different regions that RM and MSP lead to also cause the two approaches to have quite different running time. More details are in Table 4 and the discussion below.

To see the performance gain of the bound-based methods, we compare the com-putational time in Table 4. The experiments were done on a Pentium IV 2.8 GHz computer using the linux operating system. We did not compare the running time of BSVR as the code is available only on MS windows. Clearly, using bounds saves a significant amount of time.

For RM and MSP, the quasi-Newton implementation requires much fewer SVRs than CV. Table 5 lists the average number of function and gradient evaluations of the quasi-Newton method. Note that the number of function evaluations is the same as the number of SVRs solved. From Table 4, MSP is slower than RM though they have similar numbers of function/gradient evaluations. As they do not land at the same parameter region, their respective SVR training time is different. In other words, the individual SVR training time here is related to parameters. Now MSP leads to a good region with smaller testing errors but training SVRs with parameters in this region takes more time.

From Table 4, the computational time of CV using L1-and L2-SVR is not close. As they have different formulas (c.g. C in L1-SVR and C/2 in L2-SVR), we do not expect them to be very similar.

5.3 Discussion

The smoothing parameter η of the modified span bound was simply set to be 0.1 for experiments. It is important to check how η affects the performance of the bound. Figure 1 presents the relation between η and the test error. From (2.8), large η causes the modified bound to be away from the original one. Thus, the performance is worse as shown in Figure 1. However, if η is reasonably small, the performance is quite stable. Therefore, the selection of η is not difficult.

It is also interesting to investigate how tight the proposed bounds are in practice. Figure 2 compares different bounds and the loo value. We select the best σ2 _and

from CV and show values of bounds and loo via changing C. Clearly, the span bound is a good approximation of loo, but RM is not when C is large. This situation has

(17)

0 10 20 30 40 50 -10 -8 -6 -4 -2 0 2 4 6 8 10 test MSE ln(eta) add10 mpg cpusmall housing

Figure 1: Effect of η on testing errors: using the first training/testing split of the problems add10, mpg, housing, and cpusmall.

happened in classification (Chung, Kao, Sun, Wang, and Lin 2003). The reason is that St can be much smaller than 2 ˜R under some parameters. Recall that ˜R is the

radius of the smallest sphere containing ˜φ(xi), i = 1, . . . , l, so ˜R is large if there are

two far away points. However, the span bound finds a combination of xi, i /∈ F \{t},

to be as close to xt.

6 Conclusions

In this article, we derive loo bounds for SVR and discuss their properties. Exper-iments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. A future study is to apply the proposed bounds on feature selection. We also would like to implement non-smooth optimization techniques as bounds here are not really differentiable. The implementation considering L1-SVR is also interesting.

Experiments demonstrate that minimizing the proposed bound is more efficient than cross validation on a discrete set of parameters. For a model with more than two parameters, a grid search is time consuming, so a gradient-based method may be more suitable.

(18)

0 1 2 3 4 5 6 7 8 -6 -5 -4 -3 -2 -1 0 1 2 3 4 bound value ln(C) loo RM MSP

Figure 2: Loo, radius margin bound (RM), and the modified span bound (MSP): σ2

and are fixed via using the best parameters from CV (L2-SVR). The training file is spacega.

Acknowledgment

The authors would like to thank Olivier Chapelle for many helpful comments.

References

Bazaraa, M. S., H. D. Sherali, and C. M. Shetty (1993). Nonlinear programming : theory and algorithms (Second ed.). Wiley.

Bonnans, J. F. and A. Shapiro (1998). Optimization problems with perturbations: a guided tour. SIAM Review 40 (2), 228–264.

Boser, B., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal mar-gin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory.

Chang, C.-C. and C.-J. Lin (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Chang, C.-C. and C.-J. Lin (2002). Training ν-support vector regression: Theory and algorithms. Neural Computation 14 (8), 1959–1977.

Chapelle, O., V. Vapnik, O. Bousquet, and S. Mukherjee (2002). Choosing multiple parameters for support vector machines. Machine Learning 46, 131–159.

(19)

Chu, W., S. Keerthi, and C. J. Ong (2003). Bayesian support vector regression using a unified loss function. IEEE Transactions on Neural Networks. To appear. Chung, K.-M., W.-C. Kao, C.-L. Sun, L.-L. Wang, and C.-J. Lin (2003). Radius

margin bounds for support vector machines with the RBF kernel. Neural Com-putation 15, 2643–2681.

Clarke, F. H. (1983). Optimization and Nonsmooth Analysis. New York: Wiley. Cortes, C. and V. Vapnik (1995). Support-vector network. Machine Learning 20,

273–297.

Gao, J. B., S. R. Gunn, C. J. Harris, and M. Brown (2002). A probabilistic frame-work for SVM regression and error bar estimation. Machine Learning 46, 71–89. Golub, G. H. and C. F. Van Loan (1996). Matrix Computations (Third ed.). The

Johns Hopkins University Press.

Joachims, T. (2000). Estimating the generalization performance of a SVM effi-ciently. In Proceedings of the International Conference on Machine Learning, San Francisco. Morgan Kaufman.

Kwok, J. T. (2001). Linear dependency between epsilon and the input noise in epsilon-support vector regression. Proceedings of the International Conference on Artificial Neural Networks ICANN .

Lin, C.-J. (2001a). Formulations of support vector machines: a note from an opti-mization point of view. Neural Computation 13 (2), 307–317.

Lin, C.-J. (2001b). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks 12 (6), 1288–1298. Lin, C.-J. and R. C. Weng (2004). Simple probabilistic predictions for support

vector regression. Technical report, Department of Computer Science, National Taiwan University.

Momma, M. and K. P. Bennett (2002). A pattern search method for model selection of support vector regression. Proceedings of SIAM Conference on Data Mining. Smola, A., N. Murata, B. Sch¨olkopf, and K.-R. M¨uller (1998). Asymptotically optimal choice of epsilon-loss for support vector machines. Proceeding of the International Conference on Artificial Neural Network .

Smola, A. J. and B. Sch¨olkopf (2004). A tutorial on support vector regression. Statistics and Computing 14 (3), 199–222.

(20)

Ulbrich, M. (2000). Nonsmooth Newton-like Methods for Variational Inequalities and Constrained Optimization Problems in Function Spaces. Ph. D. thesis, Tech-nische Universit¨at M¨unchen.

Vapnik, V. (1998). Statistical Learning Theory. New York, NY: Wiley.

Vapnik, V. and O. Chapelle (2000). Bounds on error expectation for support vector machines. Neural Computation 12 (9), 2013–2036.

A

Proof of Lemma 1

We consider the first case, αt > 0. Let (wt, bt, ξt, ξt∗) and (w, b, ξ, ξ∗) be the optimal

solutions of (3.2) and (1.1), respectively. Though ξt and ξt∗ are vectors with l − 1 elements, recall that we define ξt

t = ξtt∗= 0 to make ξt and ξt∗ have l elements.

Note that the only difference between (3.2) and (1.1) is that (1.1) possesses the following constraint:

− − ξ∗

t ≤ wTφ(xt) + b− yt≤ + ξt. (A.1)

We then prove the lemma by a contradiction. If the result is wrong, there are αt> 0

and ft_(x

t) < yt. From the KKT condition (3.1), αt > 0 implies ξt > 0 and f (xt) =

yt+ + ξt > yt+ > yt. Then, we have

f (xt) = wTφ(xt) + b > yt> ft(xt) = (wt)Tφ(xt) + bt.

Therefore, there is 0 < p < 1 such that (1− p)(wT_φ(x

t) + b) + p((wt)Tφ(xt) + bt) = yt. (A.2)

Using the feasibility of the two points,

( ˆw, ˆb, ˆξ, ˆξ∗) = (1_{− p)(w, b, ξ, ξ}∗) + p(wt, bt, ξt, ξt∗)

is a new feasible solution of (3.2) without considering the tth element of ˆξt and ˆξt∗.

(21)

1 2wˆ T_{w +}_ˆ C 2 X i6=t ( ˆξi)2+ C 2 X i6=t ( ˆξ_i∗)2 ≤ (1 − p) 1 2w T_{w +} C 2 X i6=t (ξi)2+ C 2 X i6=t (ξ∗_i)2 ! + p 1 2(w t₎T_(wt_{) +} C 2 X i6=t (ξt i)2+ C 2 X i6=t (ξt∗ i )2 ! ≤ 1₂wTw + C 2 X i6=t (ξi)2+ C 2 X i6=t (ξ_i∗)2. (A.3)

The last inequality comes from the fact that (wt_{, b}t_{, ξ}t_{, ξ}t∗_{) is optimal for (3.2) and if}

the constraint (A.1) of (1.1) is not considered, (w, b, ξ, ξ∗) is feasible for (3.2). In this case, αt > 0, which implies ξt > 0 and ξt∗ = 0 from KKT conditions (3.1).

Since ξt6= 0, ˆξt 6= 0 as well. As ˆξt and ˆξt∗ are not considered for deriving (A.3), we can

redefine ˆξt = ˆξt∗ = 0 such that ( ˆw, ˆb, ˆξ, ˆξ ∗

) satisfies (A.1) and thus is also a feasible solution of (1.1). Then, from (A.3), ˆξt= 0 < ξt, and ˆξt∗= ξt∗ = 0, we have

1 2wˆ T_{w +}_ˆ C 2 l X i=1 ( ˆξi)2+ C 2 l X i=1 ( ˆξ∗ i)2 < 1 2w T_{w +} C 2 l X i=1 (ξi)2+ C 2 l X i=1 (ξ∗ i)2. (A.4)

Therefore, ( ˆw, ˆb, ˆξ, ˆξ∗) is a better solution than (w, b, ξ, ξ∗) of (1.1), a contradiction. The proof of the other case is similar. 2

B

Proof of Lemma 2

For easier description, in this proof we introduce a different representation of SVR dual: min ¯ α 1 2α¯ T_{Q ¯}_¯_α_{+ p}T_α_¯ _(B.1) subject to zT_α_¯ _{= 0,} ¯ αi ≥ 0, i = 1, . . . , 2l, where ¯ α= α α∗ , p =e + y e− y , z = −e e ,

(22)

and ¯ Q = K + I/C −K − I/C −K − I/C K + I/C . (B.2)

We proceed the proof by considering three cases: 1) αt = αt∗ = 0

If αt= α∗t = 0, then

α1, . . . , αt−1, αt+1, . . . , αl, (B.3)

and

α∗₁, . . . , α_t−1∗ , α∗_t+1, . . . , α∗_l, (B.4) is a feasible and optimal solution of the dual problem of (3.2) because it satisfies the KKT condition. Since there are free support vectors, b can be uniquely determined from (3.1) using some nonnegative αi or α∗i. It follows that b = bt and f (xt) = ft(xt).

From (3.1),

|ft_(x

t)− yt| = |f(xt)− yt| ≤ .

2) αt> 0

In this case, we mainly use the formulation (B.1) to represent L2-SVR with the optimal solution ¯α. We also represent the dual of (3.2) by a form similar to (B.1) and denote ¯αt as any its unique optimal solution. Note that (3.2) has l_{− 1 constraints,} so ¯αt has 2(l_{− 1) elements. Here, we define ¯}αt_t= ¯αt_t+l = 0 to make ¯αt a vector with 2l elements. The KKT condition of (B.1) can be rewritten as:

( ¯Q ¯α)i+ bzi+ pi = 0, if ¯αi > 0,

( ¯Q ¯α)i+ bzi+ pi > 0, if ¯αi = 0.

(B.5) Next, we follow the procedure of (Vapnik and Chapelle 2000) and (Joachims 2000). In (B.1), the approximate value of the training vector xt can be written as:

f (xt) =− 2l X i=1 ¯ αiQ¯it+ b,

which equals f (xt) defined in (1.3). Here, we intend to consider

f0(xt) =− X i6=t,t+l ¯ αiQ¯it+ b (B.6) as an approximation of ft_(x

t) since both f0 and ft do not consider the tth training

vector. However, (B.6) is not applicable since (B.3) and (B.4) is not a feasible solution of dual of (3.2) when αt > 0. Therefore, we construct γ from ¯α where γ is feasible

(23)

Define a set_Ft₌_{{i | ¯}_α

i > 0, i 6= t, t + l}. In order to make γ a feasible solution,

let γ = ¯α_{− η, where η satisfies}

ηi ≤ ¯αi, i∈ Ft,

ηi = 0, i6∈ Ft, i6= t, i 6= t + l, (B.7)

ηi = ¯αi, i = t, i = t + l,

and

zTη = 0. (B.8)

For example, we can set all η1, . . . , ηl to zero except ηt= αt. Then, usingP2li=l+1α¯i ≥

αt, we can find η which satisfies (B.7) and 2l

X

i=l+1

ηi = αt. (B.9)

Denote F as the objective function of (B.1). Then, F (γ)− F ( ¯α) =1 2( ¯α− η) T_{Q( ¯}_¯ _α_{− η) + p}T_{( ¯}_α_{− η)−} 1 2( ¯α) T_{Q( ¯}_¯ _α)_{− p}T_α_¯ =1 2η T_Qη_¯ _{− η}T_{( ¯}_{Q ¯}_α_{+ p).} (B.10)

From (B.5) and (B.7), for any i_{∈ F}t_{∪{t}, ( ¯}_{Q ¯}_α_+p)

i =−zib and for any i6∈ Ft∪{t},

ηi = 0. Using (B.8), (B.10) is reduced to

F (γ)− F ( ¯α) = 1 2η

T_Qη._¯ _(B.11)

Similarly, from ¯αt, we construct a vector δ that is a feasible solution of (B.1). Let ¯

Ft ₌_{{i | ¯}_αt

i > 0, i6= t, t + l}. In order to make δ feasible, define δ = ¯αt− µ, where

µ satisfies µi ≤ ¯αti, i∈ ¯Ft, µi = 0, i6∈ ¯Ft, i6= t, i 6= t + l, (B.12) µi =−¯αi, i = t, i = t + l, and zT_µ_{= 0.} _(B.13)

(24)

The existence of µ easily follows from Assumption 1. With the condition zT_α_¯t _{= 0,}

this assumption implies that at least one of ¯αt

l+1, . . . , ¯αt2l is positive. Thus, while ¯αtt is

increased from zero to ¯αt, we can increase some positive ¯αtl+1, . . . , ¯αt2lso that zTα¯t= 0

still holds.

Next, define δ = ¯αt_{− µ and note that δ is a feasible solution of (1.2). It follows:} F (δ)_{− F ( ¯}αt) =1 2( ¯α t − µ)TQ( ¯¯ αt_{− µ) + p}T( ¯αt_{− µ)−} 1 2( ¯α t₎T_{Q( ¯}_¯ _αt₎ − pTα¯t =1 2µ T_Qµ_¯ − µT( ¯Q ¯αt+ p). (B.14) From (B.5), ( ¯Q ¯αt+ p)i =−zibt,∀i ∈ ¯Ft, (B.15)

where ¯αt _{and b}t _{are optimal for (3.2) and its dual. By (B.12), (B.13), and (B.15),}

µT( ¯Q ¯αt+ p) =_{− b}t 2l X i6=t,t+l µizi+ µt( ¯Q ¯αt+ p)t+ µt+l( ¯Q ¯αt+ p)t+l =− ¯αt( ¯Q ¯αt+ p + btz)t− ¯αt+l( ¯Q ¯αt+ p + btz)t+l = ¯αt(ft(xt)− yt− ) + ¯αt+l(yt− ft(xt)− ). Thus, (B.14) is simplified to ¯ αt(ft(xt)− yt− ) + ¯αt+l(yt− ft(xt)− ) =F ( ¯αt)− F (δ) + 1 2µ T_Qµ._¯ (B.16)

Here we claim that ¯αt+l = 0 when ¯αt > 0 since from (B.5),

αtα∗t = ¯αtα¯t+l = 0. (B.17)

Note that F (δ) _{≥ F ( ¯}α) as ¯α is the optimal solution of (B.1). Similarly, F (γ) _≥ F ( ¯αt). Combining (B.17), (B.16), and (B.11), ¯ αt(ft(xt)− yt− ) = F ( ¯αt)− F (δ) + 1 2µ T_Qµ_¯ ≤ F (γ) − F ( ¯α) + 1 2µ T_Qµ_¯ = 1 2η T_{Qη +}_¯ 1 2µ T_Qµ._¯ (B.18)

(25)

Let Bη be the set containing all feasible η. That is, all η satisfy (B.7) and (B.8).

Similarly, Bµis the set containing all feasible µ. Then,

¯ αt(ft(xt)− yt− ) ≤ min η_∈Bη 1 2η T_{Qη + min}_¯ µ_∈Bµ 1 2µ T_Qµ,_¯ _(B.19)

since (B.18) is valid for all feasible η and µ.

Recall ˜K = K + I/C and define gi = ηi − ηi+l for any i = 1, . . . , l. From the

definition of (B.1) and (B.2), ηTQη =¯ l X i=1 l X j=1 gigjK˜ij = gt2K˜tt+ 2gt X i6=t giK˜it+ X i6=t X j6=t gigjK˜ij.

Moreover, we can rewrite

ηTQη = g¯ _t2( ˜Ktt− 2 X i6=t λiK˜it+ X i6=t X j6=t λiλjK˜ij), _(B.20) where λi =− gi gt , i = 1, . . . , l. (B.21)

When αt > 0, αt+l = 0 from (B.17). Therefore, gt is not zero since ηt+l = 0 and

gt= ¯αt. From (B.8), l X i=1,i6=t λi = 1. (B.22) Note that ηiηi+l = 0. (B.23)

from (B.7) and (B.17). Therefore, from ˜Kij = ˜φ(xi)Tφ(x˜ j), and (B.23), so (B.20),

(B.21), and (B.22) imply min η_∈Bη ηTQη = g¯ 2_td2( ˜φ(xt), Λt) = ¯αt2d2( ˜φ(xt), Λt), (B.24) where Λt ={ l X i=1,i6=t λiφ(x˜ i)| X i6=t λi = 1, λi ≥ − ¯ αi gt if ¯αi ≥ 0, λi ≤ ¯ αi+l gt if ¯αi+l ≥ 0},

and d( ˜φ(xt), Λt) is the distance between ˜φ(xt) and the set Λt in the feature space.

Define a subset of Λt as:

Λ+_t =_{ l X i=1,i6=t λiφ(x˜ i)| X i6=t λi = 1, λi ≥ 0, λi ≥ − ¯ αi gt if ¯αi ≥ 0, λi ≤ ¯ αi+l gt if ¯αi+l ≥ 0}.

(26)

Recall that in (B.9), one way to find feasible ¯α_{−η is to decrease some free ¯}αl+1, . . . , ¯α2l.

With gt= ηt= ¯αt> 0, this is achieved by using positive λi: ¯αi+l− λigt. Thus, Λ+t is

nonempty. As Λ+t is a subset of the convex hull by ˜φ(xi), i = 1, . . . , l, i6= t,

d2( ˜φ(xt), Λt)≤ d2( ˜φ(xt), Λ+t )≤ max i6=t k ˜φ(xt)− ˜φ(xi)k 2 ≤ 4 ˜R2. (B.25) (B.24) then implies min η∈Bη ηTQη¯ ≤ 4¯α2_tR˜2. (B.26) Similarly, min µ_∈Bµ µTQµ¯ _{≤ 4¯}α2_tR˜2. (B.27) Combining the above inequalities and (B.19), and canceling out ¯αt, we have

ft(xt)− yt≤ 4 ˜R2α¯t+ = 4 ˜R2αt+ . (B.28)

3) α∗ t > 0

The result can be proved through a similar procedure for the case of αt> 0. 2

C

Proof of Theorem 2

Define ¯η=−¯µ= ¯α_{− ¯}αt_{. Under the assumption the set of support vectors remains}

the same during the loo procedure, ¯η _{∈ B}η and ¯µ ∈ Bµ. Then, (B.18) becomes an

equality: ¯

αt(ft(xt)− yt− ) = ¯ηTQ¯¯η = min η_∈Bη

ηTQη = ¯¯ α2_td2( ˜φ(xt), Λt).

Since ¯η = ¯α_{− ¯}αt and the sets of support vectors of ¯αand ¯αt are the same, d2( ˜φ(xt), Λt) = d2( ˜φ(xt), Λ0t), where Λ0_t=_{ X i6=t,αi+α∗i>0 λiφ(x˜ i)| X i6=t λi = 1}.

The reason is that, using the assumption, we do not need to consider the constraints associated with free support vectors in the definition of Λt. Therefore, it follows that

loo≤ l X t=1 (αt+ αt∗)St2 + l, where S2

t is the optimal objective value of (2.5) withF replaced by {i | i 6= t, αi+αi∗ >

(27)

D

LOO Bounds for L1-SVR

D.1 Modifications in the Proof of Lemma 1

In Lemma 1, αt > 0 implies ξt > 0, which is used for the strict inequality in (A.4).

However, for L1-SVR, ξt may be zero even if αt > 0. To prove the inequality, we

consider (A.3) in which the equality actives only if 1 2(w t₎T_(wt_{) + C}X i6=t ξt i + C X i6=t ξt∗ i = 1 2w T_{w + C}X i6=t ξi+ C X i6=t ξ∗ i.

Therefore, (w, b, ξ, ξ∗) is optimal for (3.2) as well. Using Assumption 1, (w, b) = (wt_{, b}t_{), so}

ft_(x

t) = f (xt)≥ yt

contradicts the assumption that ft_(x

t) < yt.

D.2 Modifications in the Proof of Lemma 2

The proof for the case of αt= α∗t = 0 is exactly the same, so we focus on the case of

αt> 0. Similar to the L2 case, we consider a form like (B.1) and ¯αbecomes the dual

variable.

Now _Ft _{is redefined as} _{{i | 0 < ¯}_α

i < C, i 6= t, t + l}. We claim that η still exists

so that 0_{≤ ¯α}i− ηi ≤ C, i∈ Ft, ηi = 0, i6∈ Ft, i6= t, i 6= t + l, (D.1) ηi = ¯αi, i = t, i = t + l, and zT_η_{= 0} _(D.2)

are satisfied. In order to decrease ¯αt to zero, one may decrease some free ¯αl+1, . . . , ¯α2l

so that (B.9) is satisfied. However, for L1-SVR, it is possible that after all free ¯

αl+1, . . . , ¯α2l are decreased to 0, ¯αt is not zero yet. At this point we must increase

some free ¯α1, . . . , ¯αl. Since we keep eTα¯ = 0 and all ¯αl+1, . . . , ¯α2l have been updated

to zero or remain at C, ¯ αt+ X i=1,...,l,i6=t, 0< ¯αi<C ¯ αi = ∆C, (D.3)

(28)

where ∆ _{≥ 1 is an integer. (D.3) implies that one can reduce ¯}αt to zero and increase

free ¯αi, i = 1, . . . , l without exceeding the upper bound.

In Appendix B, from (B.10) to (B.11), we use the property that for any i∈ Ft_∪{t},

( ¯Q ¯α+ p)i =−zibi. Now this equality may not hold when i = t. If ¯αt= C,

( ¯Q ¯α+ p)t=−ztbt− ξt, and ξ∗ t = 0, so (B.11) becomes F (γ)_{− F ( ¯}α) = 1 2η T_{Qη + ¯}_¯ _α t(ξt+ ξt∗). (D.4) For ¯αt_{, now} _Ft₌_{{i | 0 < ¯}_αt i < C, i6= t, t + l}. Using Assumption 1, Ft 6= ∅. By

a similar argument on the existence of η, there is µ such that 0_{≤ ¯}αt i− µi ≤ C, i∈ ¯Ft, µi = 0, i6∈ ¯Ft, i6= t, i 6= t + l, (D.5) µi =−¯αi, i = t, i = t + l, and zTµ= 0. (D.6)

Thus, (B.16) holds. With (D.4), an inequality similar to (B.19) is ¯ αt(ft(xt)− yt− ) ≤ min η_∈Bη 1 2η T_{Qη + min}_¯ µ_∈Bµ 1 2µ T_{Qµ + ¯}_¯ _α t(ξt+ ξt∗).

We then use the same derivation from (B.20) to (B.27), but in the definition of Λt

and Λ+_t, λi is confined by

0≤ ¯αi+ λigt≤ C and 0 ≤ ¯αi+l− λigt≤ C, i = 1, . . . , l. (D.7)

In the discussion near (D.3), a feasible ¯α_{− η can be obtained by decreasing some} free ¯αl+1, . . . , ¯α2l, an operation which uses positive λi in ¯αi+l − λigt. We may have

to increase some free ¯α1, . . . , ¯αl as well. This also requires positive λi in ¯αi + λigt.

Therefore, Λ+t 6= ∅, so (B.26) (and similarly (B.27)) follows.

Finally, (B.28) becomes

(29)

E

Proof of Theorem 3

We consider the formula (B.1). ¯α is continuous if for any θ0_{, lim}

θ→θ0α(θ) = ¯¯ α(θ0).

If this is wrong, there is a convergent sequence _{{ ¯}α(θi)} such that limi→∞{θi} = θ0

but lim_i→∞α(θ¯ i) = ¯α0 6= ¯α(θ0). Note that the existence of the convergent sequence

requires that _{{ ¯}α(θi)} are in a compact set. This property has been discussed in (Lin

2001b) for L2-SVM. Since ¯α(θi) and ¯α(θ) are both optimal solutions at θi and θ,

respectively, 1 2α(θ¯ i) T_Q(θ_¯ i) ¯α(θi) + pTα(θ¯ i)≤ 1 2α(θ¯ 0₎T_Q(θ_¯ i) ¯α(θ0) + pTα(θ¯ 0), and 1 2α(θ¯ 0₎T_Q(θ_¯ 0_{) ¯}_α(θ0_{) + p}T_α(θ_¯ 0₎_≤ 1 2α(θ¯ i) T_Q(θ_¯ i) ¯α(θi) + pTα(θ¯ i). (E.1)

With Assumption 2 that all kernel elements are continuous, lim_i→∞Q(θ¯ i) = ¯Q(θ0).

Taking the limit of (E.1), 1

2α(θ¯

0₎T_Q(θ_¯ 0_{) ¯}_α(θ0_{) + p}T_α(θ_¯ 0_{) =} 1

2( ¯α

0₎T_Q(θ_¯ 0_{) ¯}_α0_{+ p}T_α_¯0_.

Thus, ¯α0 is an optimal solution, too. Since the optimal solution is unique under θ0_,

¯

α= α(θ0_{), a contradiction. Therefore, α is continuous.}

About ˜R2_{, it is the optimal objective value of (4.4). By the same procedure, β is}

continuous, and so is ˜R2_{. Therefore, the radius margin bound 4 ˜}_R2_eT_{(α + α}∗_{) + l is}

continuous.

Next, we prove the continuity of the modified span bound. As we have proved that ¯α is continuous, it is sufficient to consider ˜S2

t only. Define any sequence that

converges to θ0 _as _{θ

i}. There are corresponding sequences { ¯α(θi)} and { ˜St2(θi)}. If

for any convergent {θi}, { ˜St2(θi)} converges to ˜St2(θ0), then ˜St2 is continuous at θ0.

Thus, the convergence of { ˜S2

t(θi)} to ˜St2(θ0) is what we are going to show next.

Note that for any ¯α, we can define two index sets:

F = {i | 0 < ¯αi} and L = {i | ¯αi = 0}. (E.2)

They include the indices of free and lower-bounded elements of ¯α. Thus, for any ¯α(θ), there are associated _Fθ and Lθ. Usually we call these sets the face of ¯α(θ). Later if

we state that the faces of ¯α(θ1) and ¯α(θ2) are identical, it means thatFθ1 =Fθ2 and

Lθ1 =Lθ2.

Because there is only a finite number of possible faces of ¯α, we can separate{ ¯α(θi)}

(30)

the same face. As it suffices to prove that for any such subsequence, _{{ ˜}S2

t(θi)} converges

to ˜S2

t(θ0), without loss of generality, we assume that{ ¯α(θi)} are all at the same face.

Since it is a convergent sequence and ¯α(θ) is a continuous function, there is a fixed (maybe empty) set J ⊂ {1, . . . , l} such that αj(θ) + α∗j(θ) > 0 for any θ∈ {θi}, j ∈ J

and

lim

i→∞αj(θi) + α ∗

j(θi) = 0. (E.3)

Now, we calculate the limit of ( ˜Mt₎−1 _{which was defined in (2.11). We decompose}

˜ Mt _{to four blocks} ˜ Mt= A1 A2 AT 2 A3 . Here, we rearrange ˜Mt _{such that}

A1 = ˜KJJ + ˜DJJ = ˜KJJ + DJJ,

Hence, the first |J| columns and rows of ˜Mt _{correspond to indices satisfying (E.3).}

A3 is the sub-matrix of ˜Mt without the first |J| columns and rows. We have

A1 A2 AT 2 A3 −1 =A−11 (I + A2B−1AT2A−11 ) −A−11 A2B−1 −B−1_AT 2A−11 B−1 , (E.4)

where B = A3 − AT2A−11 A2. From (E.3), it follows every diagonal element of A1

converges to infinity when{θi} approaches θ0. Therefore, from Lemma 2.3.3 of (Golub

and Van Loan 1996),

lim

i→∞A1(θi)

−1 _{= A}

1(θ0)−1 = O, (E.5)

where O is a _{|J| × |J| zero matrix. According to (E.4) and (E.5),} lim i→∞ ˜ Mt(θi)−1 = O O1 OT 1 A3(θ0)−1 , and lim i→∞h(θi) = hJ(θ0) h0_(θ0₎ ,

where O1is a|J|×q zero matrix if q is the number of columns of A3. hj(θ0) and h0(θ0)

are sub-vectors of h(θ0_{) with the first} _{|J| and the remaining elements, respectively.}

Then, lim i→∞ ˜ S_t2(θi) = lim i→∞h(θi) T_M_˜t_(θ i)−1h(θi) = h0(θ0)TA3(θ0)−1h0(θ0) = ˜St2(θ0).

Therefore, for any {θi} which converges to θ0, { ˜St2(θi)} converges to ˜St2(θ0) so ˜St2 is

(31)

F

An Example that e

T

(α + α

∗

)

of L2-SVR Is Not

Differentiable

Consider an L2-SVR problem with = 0.1,

˜ K =   1 + 1 C 0.3 0.6 0.3 1 + _C1 0.3 0.6 0.3 1 + _C1  , (F.1) and y =0.4 −0.1 0.9T .

(F.1) can be the kernel matrix when using the RBF kernel. For example, if σ = 1/√2, then

kx1− x2k =p− log 0.3 ≈ 1.0973,

kx1− x3k =p− log 0.6 ≈ 0.7147,

kx2− x3k =p− log 0.3 ≈ 1.0973,

There are x1, x2, and x3 which form a triangle satisfying the above.

Assume δ is a small positive number. If C = 2 + δ, then the optimal solution is α=h5C(C−2)_∆ 13C(2C+5)_∆ 0i T and α∗ =0 0 31C2_+55C ∆ T , where ∆ = 6(4C + 5)(2C + 5). It follows: lim C→2+ ∂eT_{(α + α}∗₎ ∂C = 55 351. If C = 2− δ, then the optimal solution is

α=0 4C 7C+10 0 T and α∗ =0 0 4C 7C+10 T . Then, lim C→2− ∂eT_{(α + α}∗₎ ∂C = 5 36.

(32)

G

Proof of Theorem 4

We prove these bounds are piecewise differentiable. Such a property implies that the function is locally Lipschitz continuous and hence differentiable almost everywhere (Clarke 1983). First we define piece-wise differentiable functions:

Definition 1

1) Ck _{is a function class in which every function is differentiable k times.}

2) A function f : Rn _{→ R}m _{is called a P C}k _{function (kth piecewise differentiable),}

1_{≤ k ≤ ∞, if f is continuous and for every point x ∈ R}n_{, there exists a neighborhood}

W of x and a finite collection of Ck_{-function f}i _{: W} _{→ R}m _{, i = 1, . . . , N, such that}

f(x)_{∈ {f}1(x), . . . , fN(x)_{}, ∀x ∈ W.}

P C1_{functions are also called piecewise differentiable functions. There are some useful}

properties about P Ck _functions.

Theorem 5

1. A function f(x) : V _{→ R}m _{defined on the open set V} _{⊂ R}n _{is piecewise}

differentiable if and only if its component functions fi(x) : V → R, i = 1, . . . , m

are also piecewise differentiable.

2. (Ulbrich 2000, Proposition 2.20) The class of P Ck_{-functions is closed under}

composition, finite summation, and multiplication.

We then have a lemma to prove that ¯α(θ) is a piece-wise differentiable function. Lemma 4 If in Assumption 2, the kernel function is k-times differentiable, ¯α(θ), the optimal solution of (B.1), is a P Ck _function.

Proof. From Definition 1, a function is P Ck _{if, at any point, there is a neighborhood}

such that the function in this neighborhood consists of a finite number of k-times differentiable functions. Below we construct finite k-times differentiable functions, so

¯

α(θ) is composed of them for any θ.

As ¯α is an optimal solution of (B.1), its KKT optimality condition (3.1) can be rewritten as

( ¯Q ¯α)i+ pi+ zib ≥ 0, if ¯αi = 0,

(33)

Assume ¯αis obtained under a parameter set θ1 and is at the faceF and L defined in

(E.2). We first consider the case where F 6= ∅. The KKT condition of free support vectors can be written as:

¯

Q_FFα¯_F + ¯Q_FLα¯_L+ bz_F =_−p_F. (G.2) If we combine (G.2) and the linear constraint, zT_α_¯ _{= 0, ¯}_α

F and b are the solution

of the following linear system: ¯ Q_FF z_F zT F 0 ¯α_F b =−pF 0 . (G.3)

As _{F and L are the face of ¯}α, the above equation uses the fact that αi equals to 0

for every i_{∈ L.} From (G.3), ¯α_F b = M−1h, (G.4) where M = ¯ Q_FF z_F zT F 0 and h =−pF 0 . Next, we build a function γ_F(θ) = (M−1_h)

F which is made by removing the last

component of (G.4). We claim that γ_F(θ) is a k-times differentiable function since the matrix M is invertible and both M and h are k-times differentiable functions of θ. Furthermore, we can construct a k-times differentiable function γ(θ) as follows:

γ(θ) =γF(θ) 0_L

, (G.5)

where 0_L is the vector containing |L| zeros. For the other case where F = ∅, we can construct γ(θ) = 0_L, which is also a k-times differentiable function.

Notice that when θ = θ1, γ(θ1) = ¯α(θ1). Moreover, for all parameters whose

corresponding optimal solutions are at the same face, α’s are the same as values of a k-times differentiable function. That is, for any parameter θ2 where ¯α(θ2) is at the

same face as ¯α(θ1), γ(θ2) = ¯α(θ2).

Next, we collect all possible functions like γ(θ), which can cover ¯α(θ) at any value of θ. As l is the number of training data, there is a finite number of possible faces:

Fi_,_Li_{, i = 1, . . . , N,} _(G.6)

where N ≤ 22l_{. For each face, we construct a function γ}i_{(θ), which, following the}

(34)

Therefore, for any θ0 _{we have}

¯

α(θ0)∈ {γ1_(θ0_{), γ}2_(θ0_{), . . . , γ}N_(θ0₎_}.

Since ¯α(θ) is continuous by Theorem 3, ¯α(θ)∈ P Ck _{following from Definition 1. 2}

Thus, using Theorem 5, the radius margin bound is a P Ck _{function. Next, we}

discuss ˜S2

t. Since ¯α ∈ P Ck for every θ0, there exists a neighborhood W of θ0 and a

finite collection of Ck_{-functions ¯}_αi_{, i = 1, . . . , N, such that}

¯

α(θ) _{∈ { ¯}α1(θ), . . . , ¯αN(θ)_{}, ∀θ ∈ W.}

For any ¯αi _{function, we construct a C}k _{function ˜}_Si _{by (2.11). Note that the first}

item is a Ck _{function from Assumption 2, and only the second term involves with ¯}_α.

Then,

˜

S_t2(θ)∈ { ˜S1(θ), . . . , ˜SN(θ)}, ∀θ ∈ W. Since we have shown that ˜S2

t is continuous, ˜St2 ∈ P Ck. Furthermore, from Theorem

5, Pl

t=1αtS˜t2 is a P Ck function.

In the above analysis, if around a given parameter set, all (α, α∗_{) share the same}

face, then the bound is the same as a differentiable function in a neighborhood of this parameter set. Thus, the bound is differentiable there.