Simple Probabilistic Predictions for Support Vector Regression

(1)

Simple Probabilistic Predictions for Support

Vector Regression

Chih-Jen Lin

∗

and Ruby C. Weng

†

∗

_{Department of Computer Science}

National Taiwan University, Taipei 106, Taiwan

(cjlin@csie.ntu.edu.tw)

†

_{Department of Statistics}

National Chengchi University, Taipei 116, Taiwan

Abstract

Support vector regression (SVR) has been popular in the past decade, but it provides only an estimated target value instead of predictive probability intervals. Many work have addressed this issue but sometimes the SVR formula must be modified. This paper presents a rather simple and direct approach to construct such intervals. We assume that the conditional distribution of the target value depends on its input only through the predicted value, and propose to model this distribution by simple functions. Experiments show that the proposed approach gives predictive intervals with competitive coverages with Bayesian SVR methods.

I. INTRODUCTION

In the past decade support vector regression (SVR) [15], [12] has been popular for regression problems. SVR provides only an estimated target value; however, the statement that the future value falls in an interval with a specified probability is more informative. This paper aims to construct predictive intervals for the future values.

For conventional linear regression, the prediction interval has been well developed; for example, see [16] for Gaussian noise case and [3], [14] for non-Gaussian case. SVR differs from conventional regression in that it maps input data into a high dimensional reproducing kernel Hilbert space and uses an -insensitive loss function. As a result, SVR has a sparse representation of solutions, and hence is relatively fast in training/testing. However, due to these differences, the existing methods for constructing prediction intervals can not be applied. Recently Bayesian interpretations of SVR have been developed [6], [4], [2] along the ways of Bayesian techniques for Neural Networks [8] and for SVM classification [13], [11]. Using a Bayesian framework, one can determine parameters in SVR by maximizing an evidence function, and at the same time derive an error bar for prediction.

Some of these Bayesian approaches perform well, but in several situations they cannot be applied. For example, they may modify the SVR formulation, so it is more difficult to use existing SVR software. In addition, some may prefer using other methods (e.g., cross validation) for parameter selection. As the best parameters are not from

(2)

minimizing the Bayesian evidence function, the Bayesian error bar is not applicable. In this article, we propose a rather simple approach to construct predictive intervals under given parameters. The key ideas are assuming that the conditional distribution of the target value depends on its input only through the predicted value, and modeling this distribution by some simple functions. To begin, we employ cross validation (CV) to obtain a set of out-of-sample regression residuals from the training data. These residuals are supposed to provide information regarding the distribution of prediction errors. Then, as the prediction errors are usually symmetric and concentrated around zero, we fit the residuals with zero-mean Gaussian and Laplace families. The most powerful scale-invariant test is conducted to select between Gaussian and Laplace families. After selecting the family, the final model is determined by using the maximum likelihood estimate for the scale parameter.

The assumption that the distribution of the target value depends on its input only through the predicted value is somewhat restricted. However, it often works well in practice or can provide a crude estimation for initial analysis. For data whose distribution strongly depends on variables, we can cluster data into different groups and apply the proposed technique on each group.

Though an error bar for prediction is a natural by-product under the Bayesian framework, the performance of such an error bar estimation has not been fully investigated in the literature. In this paper, we evaluate the Bayesian approach and our proposed by measuring the difference between the counted and the expected numbers of future data points lying in the interval with pre-specified probabilities. This paper is organized as follows. Section II introduces the methods and justifies their validity. Section III briefly reviews SVR and its Bayesian interpretation. Experiments and analysis on real-world sets are in Sections IV and V, respectively. Section VI gives concluding remarks.

II. THEPROPOSEDAPPROACH

In regression problems, we are given a set of training data D = {(xi, yi) | xi ∈ Rn, yi ∈ R, i = 1, ..., l}. We

suppose that the data are collected from the model:

yi= f (xi) + δi, (1)

where f (x) is the underlying function and δi are independent and identically distributed random noises.

Given a test data x, the distribution of y given x and D, P (y | x, D), allows one to draw probabilistic inferences about y; for example, one can construct a predictive interval I = I(x) such that y ∈ I with a pre-specified probability. Denoting ˆf as the estimated function based on D (using SVR or other methods by training on D), then ζ = ζ(x) ≡ y − ˆf (x) is the out-of-sample residual (or prediction error), and y ∈ I is equivalent to ζ ∈ I − ˆf (x). We propose to model the distribution of ζ based on a set of out-of-sample residuals {ζi}li=1 using training data

D. The ζi’s are generated by first conducting a k-fold cross validation to get ˆfj, j = 1, . . . , k, and then setting

ζi≡ yi− ˆfj(xi) for (xi, yi) in the jth fold. It is conceptually clear that the distribution of ζi’s may resemble that

of the prediction error ζ.

To further illustrate this approach, in Figure 1 we investigate ζi’s from a real data set (cpusmall). Basically, a

(3)

ζi’s must be retained. On the contrary, distributions like Gaussian and Laplace, commonly used as noise models,

require only location and scale parameters. In Figure 1 we plot the fitted curves using these two families and the histogram of ζi’s. The figure shows that the distribution of ζi’s seems symmetric about zero and that both Gaussian

and Laplace reasonably capture the shape of ζi’s. Thus, we propose to model ζiby zero-mean Gaussian and Laplace,

or equivalently, model the conditional distribution of y given ˆf (x) by Gaussian and Laplace with mean ˆf (x). To obtain the fitted curves using Laplace and Gaussian distributions, we first express the density functions of zero-mean Laplace and Gaussian with scale parameter σ,

Laplace: p(z) = 1 2σe −|z|_σ _; ₍₂₎ and Gaussian: p(z) = √1 2πσe −z2 2σ2. (3)

Next, assuming that ζi are independent, we can estimate the scale parameter by maximizing the likelihood. For

Laplace, the maximum likelihood estimate is

σ = Pl

i=1|ζi|

l , (4)

and for Gaussian,

σ2= Pl i=1ζ 2 i l . (5)

Then we obtain the fitted curves by plugging these estimates into (2) and (3), respectively. In the rests of the paper we refer to the two methods as “Lap” and “Gau.” As we conduct CV to obtain ζi, (4) is essentially the mean

absolute error (MAE) of CV, and (5) the mean squared error (MSE).

In theory, the distribution of ζ may depend on the input x, and accordingly the length of the predictive interval for ζ with a pre-specified coverage probability may vary from case to case, reflecting the fact that the prediction variances vary with different input values. Though our interval for ζ is free of x, and hence does not reflect this property, it can be justified if we consider the probability to be taken over all possible input values. It also worths noting that our modeling shares some similarities with that in [10]. In the context of classification, [10] proposes to model the probability output, P (y = 1 | ˆf (x)), by a sigmoid function of ˆf . Both [10] and our approach assume that the conditional distribution of y given x depends on x only through ˆf (x). Both propose to model this conditional distribution by simple parametric functions, and then estimate the corresponding parameters by the maximum likelihood principle.

Regarding the selection of either Gaussian or Laplace, Figure 1 shows that Laplace seemingly outperforms Gaussian for problemcpusmall. Though a graph like Figure 1 does provide information as to which family better captures ζi’s, such a visual detection is not efficient and can be subjective. In fact, one can select between Laplace

and Gaussian without even fitting the two models. The following theorem [7, chapter 6] gives the most powerful test among all tests which are invariant under scale transformation.

(4)

Theorem 1: Suppose that Z1, . . . , Zlare a random sample from a distribution with density 1 σlp( z1 σ) · · · p( zl σ),

where p(z) is either zero for z < 0 or symmetric about zero. The most powerful scale-invariant test for testing H0: p = p0against H1: p = p1rejects H0 when

R∞ 0 τ l−1_p 1(τ z1) · · · p1(τ zl)dτ R∞ 0 τl−1p0(τ z1) · · · p0(τ zl)dτ > c.

Here “most powerful” means that when H1 is true, the test has the highest probability of rejecting H0. The

Gaussian versus Laplace [5] is a special case of the theorem. Corollary 2: (Gaussian vs. Laplace) For p0(z) = e−z

2_/2

/√2π and p1(z) = e−|z|/2, the test of Lemma 1 reduces

to rejecting H0 whenpP Zi2/P |Zi| > c.

At significant level α, the constant c satisfies P0 pP Z2 i P |Zi| > c = α, (6)

where P0is the probability under H0; that is, c is determined so that the probability of rejecting H0 is α when H0

is actually true. Typically α is chosen to be 0.05. Now we briefly summarize the proposed procedure:

1) Generate predicted errors ζ1, . . . , ζl by cross-validation using training data.

2) Use Corollary 2 to test Gaussian against Laplace. Once the decision is made, we determine the scale parameter σ using the maximum likelihood estimate ((4) or (5)).

One should notice that the above procedure may fit for other regression techniques as well, though this paper mainly focuses on applying it to SVR.

III. SVRAND ITSBAYESIANINTERPRETATION:AREVIEW The classical SVR considers the -insensitive loss function

`(δ) =            −δ − if δ < −, 0 if δ ∈ [−, ], δ − if δ > , (7) and solves min w,b,ξ,ξ∗ 1 2w T_{w + C} l X i=1 (ξi+ ξi∗) subject to yi− f (xi) ≤ + ξi, (8) f (xi) − yi≤ + ξi∗, ξi, ξ∗i ≥ 0, i = 1, . . . , l.

(5)

Fig. 1. Histogram of ζi’s from the problemcpusmall (using parameters (C, γ, ) listed in the last row of Table II(a)). The x-axis is ζi

using five-fold CV and the y-axis is the normalized number of data in each bin of width 1. The Laplace distribution (4) uses the parameter σ = 2.7948, which is the cross-validation mean absolute error. The Gaussian distribution (5) uses the parameter σ2= 19.4106, which is the cross-validation mean squared error. Note that there are four of the |ζi|’s exceeding 20, with the maximum 51.5, but the x-axis is cut at ±20

for visual concern.

Here f (xi) = wTφ(xi) + b and data are mapped to a higher dimensional space by the function φ. Similar to

support vector classification, as w may be a huge vector variable, we solve the dual problem: min α,α∗ 1 2(α − α ∗₎T_{K(α − α}∗_{) +} l X i=1 (αi+ α∗i) + l X i=1 yi(αi− α∗i) subject to l X i=1 (αi− α∗i) = 0, (9) 0 ≤ αi, α∗i ≤ C, i = 1, . . . , l,

where K is the kernel matrix with Kij = K(xi, xj) = φ(xi)Tφ(xj). For example, the RBF kernel takes the form

K(xi, xj) = exp(−γkxi− xjk2). (10)

In the literature of Bayesian SVR, f = [f (x1), . . . , f (xl)]T is regarded as a random vector whose prior is assumed

to be a zero mean Gaussian process with covariance matrix Σ, and the likelihood of the data given f is assumed to be p(D|f ) = Πl_i=1p(δi) ∝ exp(−C · l X i=1 `(δi)), (11)

where δi = yi− f (xi), C is a positive parameter, and `(·) is the loss function. The parameters in the prior and

(6)

function

p(D|θ) = Z

p(D|f , θ)p(f |θ)df .

If we take Σij= Ki,j as in (10) and ` as the -insensitive loss function, then the hyperparameter is θ = (γ, C, ),

where γ comes from the prior of f and (C, ) from the likelihood of the data given f .

[4] gives a Bayesian interpretation to the classical SVR formulation but without the presence of the constant term b in the underlying function f . Then they derive an approximation to the logarithm of the evidence function:

where α, α∗ is the optimal solution of the dual SVR (9) under a given θ, F is the set of their free components:

F ≡ {i | 0 < αi< C or 0 < α∗i < C}, and (13)

KF,F is the sub-matrix of the kernel matrix corresponding to F .

Suppose that a test case x is given for which the target value y corrupted with noise δ is unknown. Applying the -insensitive loss to (11), one has the density of δ,

p(δ) = C

2(C + 1)exp(−C`(δ)), (14)

from which we see that δ has mean zero and variance σ2_δ = 2

C2 +

2_{(C + 3)}

3(C + 1). (15)

[4] shows that the conditional probability distribution of f (x) given D is p(f (x)|D) = √1 2πσt exp −(f (x) − ˆf (x)) 2 2σ2 t , (16) where σ_t2= K(x, x) − K_F,xT K_F,F−1KF,x

with KF,x being the vector containing all K(xi, x), i ∈ F , and

ˆ

f (x) = (αF− α∗F) T _{· K}

F,x

is the decision function. Consequently, the prediction variance is var(y − ˆf (x)) = σ2δ+ σ

2 t,

which is the square of the so called “error bar for prediction.” The main advantages of Bayesian approaches are

1) parameter and feature selection can be done simultaneously by maximizing the evidence function, and 2) the error bar for prediction can be formulated.

(7)

TABLE I

DATA SET STATISTICS: FORspace ga,abalone,add10,ANDcpusmall,RANDOM SUBSETS OF1,000INSTANCES ARE USED. Problem #data #features

pyrim 74 27 triazines 186 60 bodyfat 252 14 mpg 392 7 housing 506 13 add10 1000 10 cpusmall 1000 12 space ga 1000 6 abalone 1000 8

The performance depends on the quality of the evidence function. To evaluate the performance of this error bar estimation, the distribution of ζ = y − ˆf (x) is required. By decomposing ζ into two independent components,

y − ˆf (x) = (y − f (x)) + (f (x) − ˆf (x)), (17)

we can obtain the distribution of ζ by convolution of the two densities (14) and (16).

[2] thinks that the lack of smoothness of the -insensitive loss function may cause inaccuracy in the approximation of the evaluation function, and hence the inference about θ. Thus, they propose a soft insensitive loss function by solving a modified SVR: min w,ξ,ξ∗ 1 2w T_{w + C} l X i=1 (ψ(ξi) + ψ(ξi∗)) subject to yi− wTφ(xi) ≤ (1 − β) + ξi, (18) wTφ(xi) − yi≤ (1 − β) + ξ∗i, ξi, ξ∗i ≥ 0, i = 1, . . . , l, where ψ(π) =      π2 4β if π ∈ [0, 2β), π − β if π ∈ [2β, ∞). They derive an approximation to the logarithm of the evidence function:

lnp(D|θ) ≈ −(optimal objective value of (18)) −1 2ln det I + C 2βKF,F −l ln Zs, (19)

(8)

TABLE II

DETAILED INFORMATION REGARDING FIVE TRAINING/TESTING SPLITS OFcpusmall. THE EXPECTED NUMBER OF COVERAGE IS(#TEST SET)×(80%)=160. THE METHODLAP∗IS PROPOSED AND DESCRIBED INSECTIONIV.

C, γ, Gau Lap Lap∗ BSVR1

64.0,0.25,0.500 178 174 170 23

64.0,0.25,0.500 176 170 164 24

64.0,0.25,0.004 178 166 161 1

64.0,0.25,0.500 167 159 158 17

64.0,0.25,0.250 181 169 166 12

(a) Best parameters based on CV and the numbers of test instances covered.

0.5,0.06,0.500 193 187 169 81

0.5,0.06,0.500 195 183 168 89

0.5,0.06,0.500 193 181 167 84

0.5,0.06,0.500 188 178 161 90

0.5,0.06,0.500 191 180 160 77

(b) Best parameters based on maximizing (12) and the numbers of test instances covered.

C, κ, κ0, κb, Gau Lap Lap∗ BSVR2

0.50,0.68,335.9,102.2,0.057 175 167 162 164 0.43,0.66,338.2,102.0,0.055 181 170 168 165 0.49,0.70,329.7,102.2,0.057 178 164 162 177 0.45,0.68,274.0,103.0,0.056 174 168 167 162 0.54,0.67,314.3,101.6,0.054 182 165 163 171 (c) Best parameters based on maximizing (19) and the numbers of test instances covered.

where I is the identity matrix, F has the same form as (13) but with α, α∗replaced by the optimal solution of the dual of (18), and Zs= 2(1 − β) + 2 r πβ C erf( p Cβ) + 2 Ce −Cβ with erf(z) =√2 π Z z 0 e−t2dt.

Their conditional distribution of f (x) given data has the same form as (16), but with σ_t2= K(x, x) − K_F,xT 2β

C I + KF,F −1

(9)

TABLE III

DETAILED INFORMATION REGARDING FIVE TRAINING/TESTING SPLITS OFhousing. THE EXPECTED NUMBER OF COVERAGE IS(#TEST SET)×(80%)=81.

32.0,0.12,0.031 87 81 78 50

8.0,0.25,0.062 90 86 84 97

4.0,0.25,0.062 93 91 87 100

64.0,0.25,0.125 84 80 79 78

8.0,0.25,0.062 93 88 86 100

(a) Best parameters based on CV and the numbers of test instances covered.

32.0,0.50,0.004 87 82 80 70

32.0,0.50,0.008 88 86 86 74

16.0,0.50,0.016 92 90 90 86

32.0,0.50,0.004 91 89 89 71

16.0,0.50,0.004 86 82 82 90

(b) Best parameters based on maximizing (12) and the numbers of test instances covered.

C, κ, κ0, κb, Gau Lap Lap∗ BSVR2

13.91,0.72,0.2,73.2,0.050 91 83 78 88

11.03,0.54,0.2,81.7,0.045 92 89 85 91

14.08,0.48,0.2,60.1,0.077 92 90 88 92

10.76,0.41,0.2,81.4,0.041 93 86 79 88

13.73,0.43,0.2,66.5,0.058 95 86 83 85

(c) Best parameters based on maximizing (19) and the numbers of test instances covered.

In contrast to (7), the loss function becomes

l,β(δ) =                            −δ − if δ ∈ (−∞, −(1 + β)) (δ+(1−β))2 4β if δ ∈ [−(1 + β), −(1 − β)] 0 if δ ∈ (−(1 − β), (1 − β)) (δ−(1−β))2 4β if δ ∈ [(1 − β), (1 + β)] −δ − if δ ∈ ((1 + β), ∞). (21)

(10)

TABLE IV

AVERAGE ABSOLUTE DIFFERENCE ON NUMBER OF COVERAGES:USINGCVFOR PARAMETER SELECTION.

Problem #80% Gau Lap Lap∗ _Hist _BSVR1

pyrim 11.8 1.4 1.2 1.8 2.0 2.8 triazines 29.8 2.7 2.1 2.1 1.5 5.2 bodyfat 40.3 9.3 7.9 3.7 2.0 9.1 mpg 62.7 4.3 2.4 2.8 2.3 14.5 housing 81.0 8.4 4.6 3.7 5.0 17.5 add10 160.0 7.8 6.6 6.6 6.8 121.2 cpusmall 160.0 16.0 8.0 4.6 5.8 144.6 space ga 160.0 6.0 6.8 6.8 5.8 43.4 abalone 160.0 13.2 6.4 7.2 8.2 135.6 (a) Pre-specified probability = 80%.

Problem #95% Gau Lap Lap∗ Hist BSVR1

pyrim 14.1 0.7 0.5 0.7 0.7 0.7 triazines 35.3 1.1 0.9 0.9 0.9 2.2 bodyfat 47.9 1.7 1.3 0.9 1.1 1.7 mpg 74.5 0.7 0.6 0.6 0.7 3.7 housing 96.1 2.2 2.2 2.2 2.2 7.8 add10 190.0 3.6 7.8 7.8 3.8 138.8 cpusmall 190.0 4.0 3.4 2.8 1.0 168.4 space ga 190.0 3.4 3.2 3.2 3.2 40.6 abalone 190.0 3.8 2.6 2.8 4.2 157.8 (b) Pre-specified probability = 95%.

Thus the density function of δ is

p(δ) = 1 ZD

exp(−C`,β(δ)), (22)

where ZD=R exp(−C`,β(δ))dδ. Using (11) and (21), σ2δ is

2 Zs (1 − β)33 3 + r πβ C 2β C + (1 − β) 2 2erf(pCβ) +4(1 − β)β 2 C + 2(1 − β)2 C +2(1 + β) C2 + 2 C3 exp(−Cβ) (23) and the prediction variance is σ2

δ + σ 2 t.

An important difference in [2] is the use of the kernel K(xi, xj) = κ0exp −

κ

2kxi− xjk

2_{+ κ}

(11)

TABLE V

AVERAGE ABSOLUTE DIFFERENCE ON NUMBER OF COVERAGES: MAXIMIZINGBSVR1EVIDENCE FUNCTION FOR PARAMETER SELECTION.

pyrim 11.8 1.4 0.6 0.8 1.0 2.4 triazines 29.8 1.8 1.5 1.5 1.7 1.7 bodyfat 40.3 6.1 3.7 2.4 2.9 8.9 mpg 62.7 5.9 3.7 4.3 4.5 5.0 housing 81.0 7.8 4.8 4.8 5.4 8.2 add10 160.0 9.0 8.8 8.8 9.6 13.6 cpusmall 160.0 32.0 21.8 5.0 8.4 75.8 space ga 160.0 7.6 4.6 4.6 5.4 17.4 abalone 160.0 14.4 7.0 5.8 5.8 10.4 (a) Pre-specified probability = 80%.

pyrim 14.1 0.8 0.8 0.7 0.5 0.5 triazines 35.3 1.8 1.6 1.6 2.4 2.5 bodyfat 47.9 1.9 1.3 1.9 2.7 2.1 mpg 74.5 1.4 1.2 0.8 1.4 4.3 housing 96.1 2.4 2.6 2.2 1.0 5.8 add10 190.0 4.6 8.0 8.0 5.8 4.8 cpusmall 190.0 4.4 2.8 4.0 2.6 44.8 space ga 190.0 2.8 3.2 3.2 3.2 5.0 abalone 190.0 3.4 3.4 3.6 3.2 1.4 (b) Pre-specified probability = 95%.

Thus, instead of one parameter γ in the RBF kernel, here three have to be decided, and the hyperparameter is θ = (κ0, κ, κb, C, ).

In the rest of this paper we refer to the Bayesian methods in [4] and [2] as BSVR1 and BSVR2, respectively.

IV. EXPERIMENTS

We compare the proposed approach with the two Bayesian methods reviewed in Section III. Several regression problems are considered: Problemshousing,abalone,mpg,pyrim, andtriazinesare from the Statlog collection [9]; bodyfat and space ga are from StatLib (http://lib.stat.cmu.edu/datasets); Problems add10

and cpusmall are from the Delve archive (http://www.cs.toronto.edu/˜delve). For these problems, some data entries have missing attributes so we remove them before conducting experiments. Note that the attribute values of these problems are scaled to [−1, +1], but target values are kept the same. To save the computational time, for problems with more than 1,000 instances, only a random subset of 1,000 points are used. The numbers

(12)

TABLE VI

ERROR ON COVERAGE: MAXIMIZINGBSVR2EVIDENCE FUNCTION FOR PARAMETER SELECTION.

pyrim 11.8 2.2 2.0 1.6 2.2 1.6 triazines 29.8 3.3 2.3 2.3 1.9 2.3 bodyfat 40.3 9.1 7.7 3.2 3.0 9.1 mpg 62.7 5.5 3.3 2.7 2.3 3.7 housing 81.0 11.6 5.8 3.5 2.7 7.8 add10 160.0 12.6 9.6 9.6 9.6 10.6 cpusmall 160.0 18.0 6.8 4.4 2.6 7.8 space ga 160.0 9.0 5.4 4.8 4.8 4.8 abalone 160.0 13.6 5.0 4.0 9.2 11.4 (a) Difference to 80% coverage.

pyrim 14.1 0.7 0.6 0.5 1.0 0.7 triazines 35.3 1.0 1.0 1.0 1.4 1.3 bodyfat 47.9 1.7 1.3 0.9 0.9 1.5 mpg 74.5 0.7 0.6 0.6 1.4 1.8 housing 96.1 1.4 1.0 1.6 2.0 1.5 add10 190.0 4.0 3.6 3.6 4.6 5.0 cpusmall 190.0 3.6 2.8 3.0 2.2 2.4 space ga 190.0 3.4 2.8 2.2 2.0 2.8 abalone 190.0 2.8 2.8 3.2 2.0 3.8 (b) Difference to 95% coverage.

of data instances and features are reported in Table I.

In the experiment, each data set is separated to five folds and sequentially one fold is used for testing and the remaining are for training. To have a good model, parameter selection is conducted on the training set. We consider the following methods:

1) Cross validation: (C, γ, ) = [2−1, 20, . . . , 26] × [2−8, 2−7, . . . , 21] × [2−8, 2−7, . . . , 21] are tried and the one with the highest five-fold CV accuracy is used to train the model for testing. For this setting, BSVR2 is not compared as its implementation uses (24), a kernel with more parameters.

2) Maximization of the evidence function P (D | θ) of BSVR1: We search the same space of (C, γ, ) used in 1) and choose the one which gives the maximal value of the evidence function. Similar to using CV for parameter selection, BSVR2 is not compared.

(13)

. P (D | θ) is maximized by a gradient-based implementation used in [2]. For this setting we did not compare BSVR1 as its kernel implementation must be changed. On the contrary, it is still easy to use the proposed approaches as their implementations are independent of parameters.

Implementation details and experimental results are given in the following subsections.

A. Implementation Details

Given a pre-specified probability 1 − 2s, the performance of various approaches is evaluated by comparing the number of testing data lying in their prediction intervals with the expected number, (1 − 2s) × (# test set). For each (x, y) in the test set, the prediction interval for y is ( ˆf (x) − ps, ˆf (x) + ps), where ps is the upper sth percentile

of the corresponding probability distribution of ζ(= y − ˆf (x)). Therefore, we simply count the number of ζ in the test set lying in [−ps, ps], and compare this number with its expected value.

For a zero-mean symmetric variable Z with density p(z), ps can be determined by solving

Z ps −∞

p(z)dz = 1 − s.

For example, a Gaussian with p(z) defined in (3) has ps= σ−1Φ−1(1 − s), where Φ(x) ≡

Rx −∞ 1 √ 2πe −z2_/2 dz, and hence the prediction interval for ζ is

(−σ−1Φ−1(1 − s), σ−1Φ−1(1 − s)); (25)

a Laplace with p(z) as in (2) has ps= −σln(2s), and the resulting interval is

(σln(2s), −σln(2s)). (26)

For the variable ζ in (17) under the BSVR1 setting, we denote pζ as the density of ζ and write

1 − s = Z ps −∞ pζ(z)dz = Z ∞ −∞ Z ps−f −∞ p(δ)dδpf |D(f )df,

where p(δ) and p(δ)dδpf |D(f ) are as in (14) and (16). Since the integral of (14) over a certain range can be derived

explicitly, the convolution here is reduced to a one-dimensional integration. Then this percentile problem is resolved by numerical integration. Similar treatment can be applied to BSVR2, with p(δ) in (14) replaced by (22). Note that for Bayesian approach the distribution of ζ depends on the input value x, and so does ps. Therefore, the numerical

integration needs be carried out for each test instance.

For non-symmetric distributions like the discretized histogram, we simply sort ζi’s into the form ζ(1) < ζ(2)<

· · · < ζ(l), and then set the (1 − 2s)100% prediction interval for ζ as (ζs·(#test set), ζ(1−s)·(#test set)). Below we

refer to this method as “Hist,” which will be compared with other approaches as well.

Here are more implementation details. For searching the best parameters by CV and then calculating the coverages, we use the software LIBSVM [1], which solves the standard SVR (9). As BSVR1 uses the formulation without the bias term b, we modify LIBSVM for such a form and use it to evaluate the evidence function. For BSVR2, the implementation from [2] is adopted.

(14)

TABLE VII

COMPARISON BETWEENCVANDBAYES FOR PARAMETER SELECTION: MSEAND AVERAGE NUMBER OF SUPPORT VECTORS.

CV BSVR1 BSVR2

Problem MSE #SV MSE #SV MSE #SV

pyrim 0.0467 44.2 0.0864 58.6 0.0490 38.0 triazines 0.1395 83.4 0.1734 147.0 0.1291 99.0 bodyfat 0.0027 62.6 0.0087 174.2 0.0029 82.0 mpg 0.0196 149.4 0.0237 300.0 0.0193 204.0 housing 0.0214 212.6 0.0239 382.0 0.0230 272.0 add10 1.9466 634.8 8.0780 743.0 2.8555 782.0 cpusmall 16.5119 722.2 180.0748 755.0 15.9123 786.0 space ga 0.0131 465.8 0.0128 623.6 0.0149 614.0 abalone 5.3678 684.2 7.8067 757.2 5.5827 779.0

B. Results and a New Method “Lap∗”

We first use the datasetcpusmallto describe some experimental findings. Table II reports the results of parameter selection and the number of test instances lying in the predictive intervals for each of the five training/testing splits. Here the parameters are selected by the different strategies described earlier. In this experiment, the test size is 200 for each split and s is set as 0.1, so the coverage probability is 0.8 and the expected number of instances being covered is 160. Recall the description under Figure 1 that there are a couple of extreme values of ζi’s,

with the maximum as large as 51.5. Consequently, the estimate of the scale parameter σ is quite large and the resulting prediction interval ((25) or (26)) is too wide. This justifies why “Gau” and “Lap” tend to over-cover the test instances. Therefore, we propose to re-estimate the scale parameter by discarding the “very extreme” ζi’s. Here

ζi’s are called “very extreme” if they exceed ±5 × (standard deviation of distribution). The resulting coverages are

then shown in the fourth column, entitled as “Lap∗.” We study another problemhousing in Table III. Results are similar.

Tables IV-VI present the results for all the datasets using different parameter selection strategies. Here we simply report the average absolute difference over the five splits. For each split, the absolute difference is

|# of ζ in [−ps, ps] − (1 − 2s) × (# test set)|

with s = 0.1 and 0.025, corresponding to coverage probabilities 1 − 2s = 80% and 95%. For example, the absolute differences forcpusmallusing CV and “Lap.” are 14, 10, 6, 1, and 9, as given in Table II(a), and hence the averaged value is 8.0. A by-product of this experiment is the comparison between standard SVR and the two Bayesian SVRs. Table VII presents the averages of MSEs and the numbers of support vectors over the five training/testing splits.

(15)

V. ANALYSIS

We first consider the results ofcpusmallin Table II. Table II(a) shows that, with the model parameters selected by CV, BSVR1 severely under-covers the test instances (the expected number is 160). This phenomenon can be explained by the influence of the parameters on σ2

δ in (15). The parameter (C, ) = (64, 0.5) leads to σδ2 =

4.8 × 10−4+ 8 × 10−2, which is rather small and causes the prediction interval to be too narrow to cover most test instances. For the third training/testing split, (C, ) = (64, 0.004) results in an even smaller σ_δ2, and hence an even worse coverage. Though the situation has been substantially improved when choosing the model parameter as the maximizer of the Bayesian evidence function (12), the results are still way below the expected value 160. This result indicates that the evidence function of BSVR1 is not accurate. MSE shown in Table VII further confirms such a conclusion. On the other hand, BSVR2 gives good results in Table II(c). Its MSE in Table VI is competitive with that by CV for parameter selection. The performance of BSVR2 indicates that its Bayesian evidence function (19) is accurate, and the use of a more general kernel function may also help. Regarding proposed methods, “Gau,” “Lap,” and “Lap∗” all produced reasonable coverages no matter using which method for parameter selection. In general, “Lap∗” improves upon “Lap,” and for all the five training/testing splits “Lap” outperforms “Gau.” The results in Table III can be explained similarly.

For other datasets using CV as the parameter selection strategy, Table IV shows that, like the previous two tables, BSVR1 is worse than others. The overall performance of “Lap” is better than that of “Gau.” Indeed the results of the most powerful test (6) are in favor of Laplace for all the nine datasets. This seems to indicate that, for these problems, y − ˆf (x) is more like a Laplace rather than a Gaussian. The results of “Lap” and “Lap∗_{” are satisfactory.}

The advantage of using “Lap∗” over “Lap” is apparent on datasetsbodyfatandcpusmallin Table IV(a), where the averaged absolute errors have been cut to nearly half. The “Hist” produces nice results as well. The main reason is that “Hist” directly makes use of information from ζi’s, instead of assuming a symmetric model with zero. However,

as mentioned in Section II, it is more complex as all ζi’s must be retained.

Tables V and VI report results with model parameters maximizing the Bayesian evidence functions (12) and (19), respectively. For BSVR1, the coverages are much better than those in Table IV; however, in general it is still the worst among all. Again, like Table II(b), this can be explained by the choice of the parameter set and its effect on σ2_δ. For the proposed methods, as before, “Lap” outperforms “Gau” in almost all cases, “Lap∗” further improves “Lap,” and “Hist” is quite competitive.

In summary, the experimental results indicate that the Bayesian error depends on different Bayesian evidence functions. As our proposed methods are not related to parameters, they are quite stable for different parameter selection methods.

Regarding the MSEs of CV and two Bayesian methods, Table VII shows that CV and BSVR2 are better. This result is consistent with those in previous tables. Better target value prediction also leads to better probability intervals.

(16)

VI. CONCLUSIONS

In this paper, we propose a simple approach for probabilistic prediction suitable for the standard SVR. Our approach starts with generating out-of-sample residuals by cross validation, and then fits the residuals by simple parametric models like Gaussian and Laplace. The most powerful scale-invariant test is applied to effectively test Gaussian against Laplace. We then compare it with the Bayesian SVR methods by evaluating the performance of the prediction intervals. The experiments on real-world problems show that our easy approach works fairly well and is robust to parameter selection strategies. Moreover, in certain cases we can further improve upon our approach by re-estimate the scale parameter of the Laplace family. In summary, though we assume that the distribution of the target value depends on its input only through the predicted value, the proposed approach easily provides some useful probability information for SVR analysis.

REFERENCES

[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available athttp://www.csie.ntu. edu.tw/˜cjlin/libsvm.

[2] W. Chu, S. Keerthi, and C. J. Ong. Bayesian support vector regression using a unified loss function. IEEE Transactions on Neural

Networks, 15:29–44, 2004.

[3] B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association, 78(382):316–331, 1983.

[4] J. B. Gao, S. R. Gunn, C. J. Harris, and M. Brown. A probabilistic framework for SVM regression and error bar estimation. Machine

Learning, 46:71–89, 2002.

[5] R. V. Hogg. More light on the kurtosis and related statistics. Journal of the American Statistical Association, 67:422–424, 1972. [6] M. H. Law and J. T. Kwok. Bayesian support vector regression. In Proceedings of the Eighth International Workshop on Artificial

Intelligence and Statistics (AISTATS), pages 239–244, Key West, Florida, USA, 2001.

[7] E. L. Lehmann. Testing statistical hypotheses. Wiley, New York, 2nd edition, 1986.

[8] D. J. C. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472, 1992.

[9] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available athttp://www.ncc.up.pt/liacc/ML/statlog/datasets.html.

[10] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press.

[11] M. Seeger. Bayesian model selection for support vector machines. In S. A. Solla, T. K. Leen, and K.-R. M¨uller, editors, Advances in

Neural Information Processing Systems, volume 12, pages 603–609, 2000.

[12] A. J. Smola and B. Sch¨olkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199–222, 2004.

[13] P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine learning, 46:21–52, 2002. [14] R. A. Stine. Bootstrap prediction intervals for regression. Journal of the American Statistical Association, 80(392):1026–1029, 1985. [15] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995.