Parameter Selection for Linear Support Vector Regression

(1)

Parameter Selection for Linear Support Vector Regression

Jui-Yang Hsia, Chih-Jen Lin

Abstract—In linear support vector regression (SVR), regularization parameter and error sensitivity parameter are used to avoid overfitting the training data. A proper selection of parameters is very essential for obtaining a good model, but the search process may be complicated and time-consuming. In an earlier work by Chu et al. (2015), an effective parameter-selection procedure by using warm-start techniques to solve a sequence of optimization problems has been proposed for linear classification.

We extend their techniques to linear SVR, but address some new and challenging issues. In particular, linear classification involves only the regularization parameter but linear SVR has an extra error sensitivity parameter. We investigate the effective range of each parameter and the sequence in checking the two parameters. Based on this work, an effective tool for the selection of parameters for linear SVR has been available for public use.

I. INTRODUCTION

Support vector regression (SVR) is a linear regression model commonly used in machine learning and data mining. It extends least-square regression by considering an -insensitive loss function. Further, to avoid overfitting the training data, the concept of regularization is usually applied. An SVR thus solves an optimization problem that involves two parameters:

the regularization parameter (often referred to as C) and the error sensitivity parameter (often referred to as ). This work aims to derive an effective strategy for selecting these two parameters. Note that we focus on linear SVR rather than kernel SVR, which involves also kernel parameters.

Parameter selection of a learning method is part of the broader subject of automated machine learning (autoML). In general we solve an optimization problem over parameters, where many global optimization algorithms can be applied (e.g., [9], [12], [13], [19], [20], [21]). Approaches specific to parameter selection for machine learning have also been available (e.g., [14], [22]). Further, methods specially designed for support vector machines (SVM) have been proposed (e.g., [1], [2], [3], [4], [5], [7], [11], [15], [17], [23], [25], [26], [27]).

Most of them focus on classification rather than regression.

Further, methods suitable for linear SVM may not be effective for kernel SVM, and vice versa. A more detailed review of past works is given in supplementary materials.

Among all existing studies, we are interested in the work [2], which applies a warm-start technique for the parameter selection of linear classification (l2-regularized logistic regression and l2-loss SVM). Because the only parameter is the regularization parameter C, their strategy is to sequentially check cross-validation (CV) accuracy at the following parameters

C_min, C_min∆, C_min∆², · · · , (1)

Jui-Yang Hsia and Chih-Jen Lin are with National Taiwan University. (e- mail: hsiajuiyang5174@gmail.com and cjlin@csie.ntu.edu.tw). This work was partially supported by MOST of Taiwan grant 107-2221-E-002-167-MY3.

where ∆ > 1 is a given constant to control the increase of the parameter and C ≤ Cmin is shown to be not useful.

The search procedure stops after the performance cannot be further improved. Between two consecutive parameters, they consider a warm-start technique for fast training. Specifically, the solution of the optimization problem under the current C is used as the initial solution in solving the next problem with the parameter ∆C. Although the idea is simple, [2] must solve some issues in order to finish a now widely used parameter- selection tool in the popular package LIBLINEAR [6] for linear classification.

In this work, we aim to extend the procedure in [2] for SVR.

However, because of the difference between classification and regression, and the extra parameter , some modifications must be made. Further, we must solve the following new challenges.

• We find that deriving a suitable C_min for regression is more difficult than classification.

• For classification that involves only one parameter, the search sequence shown in (1) can be a reasonable choice.

For SVR with two parameters, more options are possible.

For example, we can consider a sequence of values, and for every fixed , we run a sequence in (1). Alternatively, we can consider a sequence of C values first and for every C we check a sequence of values.

• Because the search space of C ∈ (0, ∞) is huge, it is a common practice to consider a sequence in (1) by exponentially increasing the C value. However, depending on the data, may be in a small interval, so a linear increase/decrease of the value might be more suitable.

In this work, we thoroughly investigate the above issues. Our final recommended setting is to check a sequence of C values for every fixed value.

We choose to extend the classification work in [2] for linear SVR rather than consider some existing parameter- selection works for kernel SVR (e.g., [23], [26]) because of the following reasons. First, the procedure is simpler because we directly check a grid of (C, ) points. Note that kernel SVR involves more parameters, so a grid search may not be feasible and more sophisticated approaches are needed.

Second, while checking (C, ) points may be time-consuming, by effective warm-start techniques in [2], the overall procedure is practically viable.

This paper is organized as follows. In the next section, we introduce the formulations of SVR and discuss how to obtain an approximate solution of its optimization problem.

In Section III, we discuss the relationship between solutions of optimization problems and SVR parameters. In particular, we identify a suitable range of C and . In Section IV,

(2)

we discuss the procedure to search parameters and show details of our implementation. Section V experimentally con- firms the viability of the proposed approach. Supplementary materials are available at https://www.csie.ntu.edu.tw/^∼cjlin/

libsvmtools/warm-start/.

Our proposed procedure has been included in the package LIBLINEAR [6] (after version 2.30) for the parameter selection of linear regression.

II. SVR OPTIMIZATIONPROBLEM

Consider training data (yi, x_i) ∈ R × Rⁿ, for i = 1, · · · , l, where yi ∈ R is the target value and xi ∈ Rⁿ is the feature vector. We use l to denote the number of training instances and let n be the number of features in each instance.

Linear SVR [24] finds a model w such that w^Tx_i is close to the target value y_i. It solves the following problem with a given regularization parameter C > 0 and an error sensitivity parameter ≥ 0.

minw f (w; C, ), where (2)

f (w; C, ) ≡ 1

2kwk²+ CL(w; ).

In (2), kwk²/2 is the regularization term and L(w; ) is the sum of training losses defined as

L(w; ) = (Pl

i=1max(|w^Tx_i− y_i| − , 0) L1 loss, Pl

i=1max(|w^Tx_i− yi| − , 0)² L2 loss.

SVR employs the -insensitive loss so that small losses of some instances are ignored. That is, the loss is zero if |w^Txi− yi| ≤ .

The objective function f (w; C, ) is strongly convex, so the unique optimal solution exists and we denote it as

w_C, = arg min

w f (w; C, ).

Because L1 loss is not differentiable and L2 loss is differentiable but not twice differentiable, different optimization methods have been proposed for training SVR. A detailed study is in [8], which considers two types of approaches:

Newton methods for L2-loss SVR and dual coordinate descent methods for L1- and L2-loss SVR. These methods were extended from studies for linear classification (e.g, [10], [16]).

For the parameter selection of classification problems, [2]

recommends a Newton method after some careful evaluations.

An important reason is that a Newton method possesses some advantages under a warm-start setting for training linear classification problems. Therefore, we follow [2] to consider a Newton method to solve each SVR problem in the parameter- selection procedure.

A Newton method iteratively finds Newton directions by considering the second-order approximation of the objective function. Details of the Newton method we considered are in [8], [16]. Because differentiability is required, here we consider only L2-loss SVR and investigate its parameter selection.¹

1Note that Newton method requires second derivative, but the L2-loss function is not twice differentiable. We follow [18] to consider the generalized second derivative.

Because of the nature of numerical computation, in practice we only obtain an approximate solution ˜wC, of wC,, returned from the optimization procedure. In an iterative optimization process a stopping condition must be imposed for the finite termination. For the Newton method considered in this work, we assume the stopping condition is that ˜wC,

satisfies

k∇f ( ˜wC,; C, )k ≤ τ k∇f (0; C, )k, (3) where τ ∈ (0, 1) is the stopping tolerance. Clearly, (3) is related to the optimal condition ∇f (wC,; C, ) = 0, but we further consider a relative setting to compare with the gradient at the zero point, which is a common initial point of the optimization procedure. The condition (3) plays a role in the parameter-selection procedure, where details are in Section IV.

A. Warm-start Techniques for Parameter Selection

While many parameter-selection strategies are available, the approach in [2] is a conservative but reliable setting of checking the cross validation (CV) performance under different parameter values. The training set is randomly split into several folds. Each time one fold is used for validation, while other folds are considered for training. Therefore, many SVR optimization problems must be solved.

To reduce the running time, [2] considers a warm-start strategy to solve closely related optimization problems. We extend their setting for linear SVR. Suppose w_C₁_,₁ is the optimal solution under C = C1 and = 1. If (C1, 1) is slightly changed to (C2, 2), we use wC₁,₁ as the initial solution for solving the new optimization problem. The idea behind such a warm-start strategy is as follows. For optimization techniques such as Newton methods, they iteratively generate a sequence {wk}^∞_k=0 converging to the optimum. Because a small change of parameters may not cause a significant change of the optimization problem, the optimal solution of the original problem can be a good starting point for the new problem. Then the number of optimization iterations may be significantly reduced in comparison with that without warm- start (e.g, using 0 as the initial solution).

We divide the parameter-selection problem for SVR into two parts. One is the search range of each parameter. The other is the design of the search procedure. We study the first in Section III and the second in Section IV.

III. RANGE OFPARAMETERS

We check the range of a parameter by assuming that the other is fixed. To simplify the notification, if is fixed, we denote

wC = wC,, ˜wC= ˜wC,,

L(w) = L(w; ), f (w; C) = f (w; C, ).

Similarly, if C is fixed, we have

w= w_C,, ˜w= ˜w_C,, f (w; ) = f (w; C, ).

For a suitable parameter range we hope that first parameters achieving the best performance are within it and second the range should be as small as possible. We follow [2] to identify

(3)

parameters that should not be considered. For example, if a parameter setting leads to a model that does not learn enough information from the training data, then underfitting occurs and such parameters should not be used.

A. Zero Vector is a Trivial Model

We begin with showing that the zero vector leads to a model that may not learn enough information from the training data.

First, because

f (wC,; C, ) ≤ f (0; C, ) and kwC,k ≥ k0k, we have

L(wC,, ) ≤ L(0, ).

The larger training loss indicates that 0 may not learn more from the training data than any wC,. Second, the following theorem shows that the learnability of wC deteriorates as C approaches zero and wC eventually goes to the zero point.

Theorem 1. If C1> C0, then

kwC₁k ≥ kwC₀k and L(wC₁) ≤ L(wC₀).

Further,

lim

C→0w_C= 0.

Note that proofs of all theorems are in the supplementary materials. From the discussion, we can treat 0 as a trivial model that underfits the training data. For any with L(w) ≈ L(0), w may have not learned enough information from the training data.

B. Parameter C

We fix and discuss the upper and lower bounds for parameter C.

1) Lower Bound of C: From the discussion in Section III-A, we check under which C values the training loss L(wC) is close to L(0) by proving the following theorem.

Theorem 2. Consider L2 loss. For 0 ≤ δ1< 1, we have L(wC) ≥ (1 − δ1) × L(0) ∀C ≤ C_min, where C_min is defined as

Cmin=

( _δ²

1L(0) 8(Pl

i=1|yi|)²(max_ikxik)² ifL(0) > 0,²

∞ ifL(0) = 0. (4)

Therefore, by choosing a δ1 close to 0, Cmin can be a lower bound for the parameter C.

2We have that L(0)/0 = ∞ if this occurs.

2) Upper Bound ofC: We first check properties of {wC} when C is large. Let W_∞ be the set of points that attain the minimum of L(w).

W∞≡ {w | L(w) = inf

w⁰ L(w⁰)}.

For classification problems, [2] has discussed the convergence property of {wC} as C → ∞. We extend their results here for regression.

Theorem 3. Consider any non-negative and convex loss function. IfW_∞6= ∅, then

lim

C→∞wC= w_∞, where w_∞= arg min

w∈W∞

kwk². (5) If L2 loss is used, thenW_∞6= ∅.

Because w∞is a model without using regularization, overfitting tends to occur and the performance is often not the best. However, it is difficult to identify a C_max so that if C ≥ Cmax, the model is sufficiently close to w_∞. We leave more investigations in Section IV.

C. Parameter

We now fix C and discuss upper and lower bounds for parameter .

1) Lower Bound of: Because ≥ 0, a trivial lower bound of is = 0. We argue that this is a meaningful lower bound because [8] has shown that = 0 often leads to a good model.

That is, for some data sets the -insensitive setting is not needed and regularized least-square regression is as effective as SVR.

2) Upper Bound of: From the definition of -insensitive loss functions, if is large so that for most data the loss is zero, then the model tends to underfit the training data. Thus an obvious upper bound is

max= max

i |yi|.

Under this _max, f (0) = 0 implies that w = 0 is an optimal solution of (2) and insufficient information has been learned.

IV. THESEARCHPROCEDURE

After studying the range of each parameter, we must find an effective search procedure. Under a grid setting, a two- level loop sweeps C (or ) first, and at the inner level, we go through values of the other parameter. Then two issues must be addressed.

• The parameter to be used for the outer loop.

• The search sequence of each parameter.

These two issues are complicated, so our discussion goes from decisions that are easily made to those that are less certain.

We start by checking the search sequence of the parameter C. For the parameter selection of linear classification, an exponential increase of the regularization parameter C has been commonly considered; see the sequence in (1). The reason is that C ∈ (0, ∞) is in a rather large range and we need the exponential increase of the parameter to cover the search space. The same setting should be applied for regression because we still have C ∈ (0, ∞). In addition,

(4)

we follow (1) to start from Cmin because of two reasons.

First, for both classification and regression, C_min has been specifically derived; see [2] and (4). In contrast, we do not have a clear way to calculate an upper bound and must rely on techniques discussed later in this section. Second, if we consider a decreasing sequence, solving the first optimization problem may be time-consuming. The reason is that under a large C, the model tries to better fit the training data and the optimization problem is known to be more difficult.³Based on these reasons, regardless of whether C is used in the outer or the inner level of the loop, we always consider an increasing sequence of C values as in (1).

We now discuss the search sequence of the other parameter

. An exponential sequence like the setting for C can be considered. However, it should be a decreasing one starting at = max. The reason is that because 0 is the solution when

= _max, the optimization problem is easier when is closed to _max. Recall that a similar reason leads us to begin the search of C at Cmin.

Instead of an exponential sequence, we argue that the sequence from a linear segmentation of [0, max] may be more suitable for the parameter . An important difference from C is that is in a bounded interval [min, max], where min = 0 and max< ∞. Further, while both lower and upper bounds of C tend to be values not leading to a good model, for , [8] has pointed out that the model of using = 0 is often competitive.

We also have Theorem 4.

→0limw= w0.

If an exponentially decreasing sequence starting from _max is considered, many problems with ≈ 0 are checked, but Theorem 4 shows that their resulting models are similar. In contrast, a linear sequence can clearly avoid this situation.

In Section V-A, we conduct experiments to compare the two settings (linear and exponential) for the search sequence of .

The remaining issue is when to stop increasing the parameter C in the search procedure because Cmax is the only bound that cannot be explicitly obtained in Section III. We extend the setting in [2] by following Theorem 3, which states that {wC}, C → ∞ converge to a point w∞. Their idea is to terminate the selection procedure if the approximate solutions of t_stop consecutive optimization problems are the same. That is, if

˜

wC= ˜w∆C = ˜w_∆2C= ˜w_∆3C· · · = ˜w_∆tstopC, (6) then the search process terminates at C. It is easy to check (6) by the stopping condition (3) of the optimization procedure⁴: k∇f ( ˜wC; ∆^tC)k ≤ τ k∇f (0; ∆^tC)k, t = 1, 2, · · · , tstop.

(7) In other words, an approximate solution ˜w_Csatisfies the above stopping condition with t = 0, but we check if it is also the

3In fact, some past works think an efficient way to solve a single SVM under a large C is through a warm-start setting on the problems corresponding to an increasing sequence of smaller C values; see, for example, the software BSVM https://www.csie.ntu.edu.tw/^∼cjlin/bsvm/.

4We explain in supplementary materials why on the right-hand side of (7), the 0 point is always used.

returned solution of the next several problems without any optimization iteration. We choose t_stop = 5 for experiments in Section V though more discussion on its selection is in supplementary materials.

V. EXPERIMENTS

We conduct experiments on some regression sets available at https://www.csie.ntu.edu.tw/^∼cjlin/libsvmtools/datasets/.

For some data sets, a scaled version is provided at the same site by linearly scaling each attribute to [−1, 1] or [0, 1]. We named the scaled version with an extension “-scale.” Our search procedure aims to find a model achieving the best five-fold CV result on the validation MSE (Mean Square Error). Because of space limit, we present only results on some larger sets, while leave detailed experimental settings (including data statistics) and complete experimental results in supplementary materials.

A. Exponential or Linear Search Sequence for the Parameter In Section IV, we discuss the issue of using an exponential or a linear sequence of values in the search procedure. We conduct a comparison by considering C values in the following set

{C_min, C_min∆, C_min∆², · · · , C_max} (8) and values in either a linear- or an exponential-spaced sequence:

{0, , 2, · · · , max} or {2⁻³⁰, 2⁻³⁰∆, 2⁻³⁰∆², · · · , max}, (9) where

C_max= 2⁵⁰, = max

20 and ∆ = 2. (10)

In Figure 1, for each data set we show ((log₂C, ) or

(log₂C, log₂) versus log₂(CV MSE)

depending on the sequence for . We observe that if an exponential sequence is used, then CV MSE is almost the same in the entire figure. The reason is that from Theorem 4, after is smaller than a certain value, CV MSE is similar.

For the purpose of exploring different CV MSE values, we conclude that a linear sequence should be more suitable in our parameter selection procedure.

B. Evaluation of Various Implementations for the Search Procedure

We compare cross validation MSE of the following settings:

• “Full and independent” (Baseline): By using (C, ) values in (8) and (9), we solve all linear SVR problems independently.

For each SVR problem 0 is the initial point and the stopping condition is (3).

This setting aims to show that if without any techniques to reduce the search space and without the warm start implementation, what the resulting MSE is. We compare with this baseline setting to see if our procedures trade the performance for efficiency.

(5)

ε 5 0 15 10 25 20 log2

(C)

−40

−20 0 20

40 log2(CV MSE)

2 4

log2(ε)

−25−30

−15−20

−5−10 5 0 log2

(C)

−40

−20 0 20

40 log2(CV MSE)

2 4

(a) abalone

ε 0.5 0.0 1.5 1.0 2.5 2.0 log2(C)

−80−60

−40−20 020 40 log2(CV MSE)

−4.5

−3.5

−2.5

−1.5

log2(ε)

−25−30

−15−20

−5 −10 0 log2(C)

−80−60

−40−20 020 40 log2(CV MSE)

−4.5

−3.5

−2.5

−1.5

(b) space-ga

Figure 1: Cross validation MSE (log scaled) by different search sequences of . Left: linear. Right: exponential.

• (, C): The warm-start setting is considered. For the search procedure, is in the outer loop and C is in the inner loop.

The search range of is shown in (9), while for C, we evaluate the following settings:

– The termination criterion (7) extended from [2] is applied.

See details in Section IV.

– “No termination criterion”: We run the full grid up to the large Cmaxspecified in (10). Thus the number of checked (, C) points is the same as “Full and independent.” They differ only in whether the warm-start strategy is applied or not.

• (C, ): We apply the warm-start setting, where for the search procedure, C is in the outer loop and is in the inner loop.

For the termination of the C sequence, the condition (7) is less suitable here. Because C is in the outer loop, to check (7), all models under different values must be stored.

Therefore, we run the full grid up to the specified C_max in (10). The number of checked (C, ) points is thus the same as that of the “Full and independent” setting and that of

“No termination criterion” in the (, C) setting.

To see if any MSE change occurs after applying the warm start technique, in Table I we present the following ratio:

Best CV MSE by applying warm start

Best CV MSE by “Full and independent”. (11) We observe that all ratios are close to one, indicating that the CV MSE is close to the baseline setting of independently running a full grid without warm start. Note that except the use of (7) to early stop the C sequence in the (, C) setting, all others go over the same full grid of parameters as the baseline

“Full and independent.” For them because the same set of SVR problems is solved, theoretically the ratio should be exactly one. However, with approximate solutions satisfying only (3), the resulting models are slightly different. From ratios all close

Table I: An MSE comparison with the baseline setting of running the full grid without warm start; see the ratio defined in (11). (C, ) and (, C) indicate that C and are used in the outer loop of the parameter grid, respectively. YPMSD is the abbreviation of YearPredictionMSD. Ratios different from one are bold-faced.

Data set (, C) (C, )

Criterion in (7)

No criterion

abalone 1.00 1.00 1.00

abalone-scale 1.00 1.00 1.00

cadata 1.09 1.09 1.01

cpusmall 1.00 1.00 1.03

cpusmall-scale 1.00 1.00 1.00

E2006-train 0.99 0.99 1.00

housing 1.04 1.04 1.00

housing-scale 1.00 1.00 1.00

log1p-E2006-train 1.00 1.00 1.00

mg 1.00 1.00 1.00

mg-scale 1.00 1.00 1.00

space-ga 1.00 1.00 1.00

space-ga-scale 1.00 1.00 1.00

YPMSD 1.02 1.02 0.99

to one in Table I we conclude that equally good approximate solutions of SVR problems are obtained after applying warm start.

More importantly, from Table I the setting via (7) without considering all grid points also has ratios close to one, indicating that it has covered needed parameters without sacrificing the performance.

C. Running-time Reduction of Warm-start Methods

To check the effectiveness of warm-start methods we present in Table II the following ratio.⁵

Running time by applying warm start

Running time by “Full and independent”. (12) A smaller ratio indicates a better time reduction by using warm start. From Table II, we have the following observations.

• All the values in Table II are much smaller than one. This result shows that the warm-start techniques can significantly reduce the time required to search the parameters.

• For the (, C) setting, the running time with/without the early termination of the C sequence is almost the same.

We give the following explanation. The criterion (7) checks the stopping condition of several consecutive optimization problems. When (7) holds, the corresponding ˜wC may be close enough to w∞ by Theorem 3 and the condition (3) may hold ever since:

k∇f ( ˜wC_stop; C)k ≤ τ k∇f (0; C)k, ∀C ≥ Cstop, (13) where Cstop is the value when (7) is satisfied. Therefore, if we check more C values all the way up to the specified C_max, at each C the optimization method terminates without running any iteration. In this situation, the early termination via (7) is not needed. However, in theory Cstop may not exist to have (13) because ˜wC_stop is only an approximate rather than the optimal solution. Thus it is still possible that the optimization method takes time at each C and

5Running time is estimated by the number of CG steps in the Newton method for training SVR. See details in supplementary materials.

(6)

Table II: Ratio defined in (12) to show the time reduction of using warm start.

Data set (, C) (C, )

Criterion in (7)

No criterion

abalone 0.12 0.12 0.54

abalone-scale 0.11 0.11 0.50

cadata 0.06 0.06 0.67

cpusmall 0.06 0.06 0.62

cpusmall-scale 0.10 0.10 0.69

E2006-train 0.14 0.14 0.35

housing 0.07 0.07 0.66

housing-scale 0.11 0.11 0.59

log1p-E2006-train 0.08 0.08 0.62

mg 0.13 0.13 0.61

mg-scale 0.13 0.13 0.61

space-ga 0.08 0.08 0.73

space-ga-scale 0.13 0.13 0.62

YPMSD 0.04 0.04 0.35

we expensively run the procedure all the way up to Cmax. Further, the selection of C_max is a tricky issue; in our experiment we choose 2⁵⁰ in (10) without a good reason.

Therefore, we can say that (7) is a relaxed condition of (13) to avoid the huge efforts of possibly running up to an extremely large C_max.

• Between (C, ) and (, C), we observe that (C, ) costs more. It seems that warm start is less effective when C is fixed and is slightly changed. A reason might be that in our experiments, the number of SVR problems per value is often larger than that per C value. Then the time saving by applying warm start for the (C, ) strategy is less dramatic.

Note that in (9) we split [min, _max] to 20 intervals, but with a small C_min and a large C_max, the number of C values in [Cmin, Cmax] tends to be larger. More detailed discussion is provided in supplementary materials.

D. Comparison with Other Parameter Selection Methods We have compared our proposed method with two existing techniques for parameter selection: simulated annealing and particle swarm optimization, which details are in supplementary materials. We find that these alternative approaches, while more sophisticated, are often as competitive. However, they are not as robust as our search on a grid of parameters. In some situations, they lead to parameters with much worse CV MSE.

VI. RECOMMENDEDPROCEDURE ANDCONCLUSIONS

We have shown that the termination criterion (7) works effectively in practice. Because this criterion is applicable when C is in the inner loop and our experiments show that the (, C) setting takes less running time, our recommended setting is to have in the outer loop and C in the inner, and the criterion (7) is imposed. A detailed algorithm is given in supplementary materials.

We list technical insights from this development and future research issues in supplementary materials. In summary, we have developed an effective parameter-selection procedure based on the warm-start technique for linear SVR.

REFERENCES

[1] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46:131–159, 2002.

[2] B.-Y. Chu, C.-H. Ho, C.-H. Tsai, C.-Y. Lin, and C.-J. Lin. Warm start for parameter selection of linear classifiers. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2015.

[3] K.-M. Chung, W.-C. Kao, C.-L. Sun, L.-L. Wang, and C.-J. Lin. Radius margin bounds for support vector machines with the RBF kernel. Neural Computation, 15:2643–2681, 2003.

[4] D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines.

In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 345–349, 2000.

[5] K. Duan, S. S. Keerthi, and A. N. Poo. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51:41–59, 2003.

[6] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.

LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[7] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–1415, 2004.

[8] C.-H. Ho and C.-J. Lin. Large-scale linear support vector regression.

Journal of Machine Learning Research, 13:3323–3348, 2012.

[9] J. H. Holland. Adaptation in Natural and Artificial Systems. MIT Press, 1992.

[10] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan.

A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.

[11] C.-M. Huang, Y.-J. Lee, D. K. Lin, and S.-Y. Huang. Model selection for support vector machines via uniform design. Computational Statistics and Data Analysis, 52(1):335–346, 2007.

[12] J. Kennedy and R. Eberhart. Particle swarm optimization. In Proceed- ings of International Conference on Neural Networks (ICNN), pages 1942–1948. 1995.

[13] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.

[14] R. Kohavi and G. H. John. Automatic parameter selection by minimizing estimated error. In Proceedings of the Twelfth International Conference on Machine Learning (ICML), pages 304–312, 1995.

[15] J.-H. Lee and C.-J. Lin. Automatic model selection for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, 2000.

[16] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008.

[17] S.-W. Lin, Z.-J. Lee, S.-C. Chen, and T.-Y. Tseng. Parameter determination of support vector machine and feature selection using simulated annealing approach. Applied Soft Computing, 8(4):1505–1512, 2008.

[18] O. L. Mangasarian. A finite Newton method for classification. Opti- mization Methods and Software, 17(5):913–929, 2002.

[19] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines.

The Journal of Chemical Physics, 21(6):1087–1092, 1953.

[20] J. Moˇckus. On Bayesian methods for seeking the extremum. In Proceedings of the IFIP Technical Conference, pages 400–404, 1974.

[21] J. A. Nelder and R. Mead. A simplex method for function minimization.

The Computer Journal, 7(4):308–313, 1965.

[22] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959. 2012.

[23] B. ¨Ust¨un, W. J. Melssen, M. K. Oudenhuijzen, and L. M. C. Buydens.

Determination of optimal support vector regression parameters by genetic algorithms and simplex optimization. Analytica Chimica Acta, 544(1-2):292–305, 2005.

[24] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.

[25] Z. Wen, B. Li, R. Kotagiri, J. Chen, Y. Chen, and R. Zhang. Improving efficiency of svm k-fold cross-validation by alpha seeding. In Proceed- ings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[26] C.-H. Wu, G.-H. Tzeng, and R.-H. Lin. A novel hybrid genetic algorithm for kernel function and parameter optimization in support vector regression. Expert Systems with Applications, 36(3):4725–4735, 2009.

[27] Z.-L. Wu, A. Zhang, C.-H. Li, and A. Sudjianto. Trace solution paths for SVMs via parametric quadratic programming. KDD Worskshop: Data Mining Using Matrices and Tensors, 2008.