Differences Between SVC and SVR - Linear Support Vector Regression

II. Linear Support Vector Regression

2.1 Differences Between SVC and SVR

SVR is very similar to SVC, although they differ in several aspects. These differ-ences are not specific to the linear case, but we list them here to help our subsequent discussion.

1. Labels versus target values: SVC considers class label y ∈ {+1, −1} rather than a real number.

2. Loss functions: The loss function of SVC is

max(0, 1 − y_iw^Tx_i) or max(0, 1 − y_iw^Tx_i)².

In classification, we hope yw^Tx ≥ 1, but in regression we would like to have

− ≤ w^Tx − y ≤ .

Consequently, SVR has one more parameter than SVC. Parameter selection for SVR is thus more time consuming.

3. Number of dual variables: The dual problem of SVR has 2l variables, while SVC has only l. If a dual-based solver is applied without a careful design, the cost may be significantly higher than that for SVC.

CHAPTER III

Optimization Methods for Training Linear SVR

In this chapter, we extend two linear-SVC methods in LIBLINEAR for linear SVR.

The first is a Newton method for L2-loss SVR, while the second is a coordinate descent method for L1-/L2-loss SVR.

3.1 A Trust Region Newton Method (TRON) for L2-loss SVR

TRON (Lin and Mor´e, 1999) is a general optimization method for differentiable unconstrained and bound-constrained problems, where the primal problem of L2-loss SVR is a case. Lin et al. (2008) investigate the use of TRON for L2-loss SVC and logistic regression. In this chapter, we discuss how TRON can be applied to solve large linear L2-loss SVR.

The optimization procedure of TRON involves two layers of iterations. At the k-th outer-layer iteration, given the current position w^k, TRON sets a trust-region size ∆_k and constructs a quadratic model

q_k(s) ≡ ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s

as the approximation to f (w^k+ s) − f (w^k). Then, in the inner layer, TRON solves the following problem to find a Newton direction under a step-size constraint.

mins q_k(s) subject to ksk ≤ ∆_k. (3.1)

Algorithm 1 A trust region Newton method for L2-loss SVR

TRON adjusts the trust region ∆_k according to the approximate function reduction q_k(s) and the real function decrease; see details in Lin et al. (2008).

To compute a truncated Newton direction by solving (3.1), TRON needs the gra-dient ∇f (w) and Hessian ∇²f (w). The gradient of L2-loss SVR is

∇f (w) = w + 2C(X_I₁_,:)^T(X_I₁_,:w − y_I₁− e_I₁) − 2C(X_I₂_,:)^T(−X_I₂_,:w + y_I₂ − e_I₂), where

X ≡ [x1, . . . , xl]^T, I1 ≡ {i | w^Txi− yi > }, and I2 ≡ {i | w^Txi− yi < −}.

However, ∇²f (w) does not exist because L2-loss SVR is not twice differentiable. Fol-lowing Mangasarian (2002) and Lin et al. (2008), we use the generalized Hessian matrix.

Let

I ≡ I₁∪ I₂. The generalized Hessian can be defined as

∇²f (w) = I + 2C(XI,:)^TDI,IXI,:,

where I is the identity matrix, and D is an l-by-l diagonal matrix with

D_ii≡

For large-scale problems, we can not store an n-by-n Hessian matrix in the memory.

The same problem has occurred in classification, so Lin et al. (2008) applied an iterative method to solve (3.1). In each inner iteration, only some Hessian-vector products are required and they can be performed without storing Hessian. We consider the same setting so that for any vector v ∈ Rⁿ,

∇²f (w)v = v + 2C(X_I,:)^T(D_I,I(X_I,:v)).

For the stopping condition, we follow the current setting in LIBLINEAR to check if the gradient is small enough in compared to the initial gradient.

k∇f (w^k)k₂ ≤ _sk∇f (w⁰)k₂, (3.2) where w⁰ is the initial iterate and _s is stopping tolerance given by users. Algorithm 1 gives the basic framework of TRON.

Similar to the situation in classification, the most expensive operation is the Hessian-vector product. It costs O(|I|n) to evaluate ∇²f (w)v.

3.2 Dual Coordinate Descent Methods (DCD)

In this chapter, we introduce DCD, a coordinate descent method for the dual form of SVC/SVR. It is used in LIBLINEAR for both L1- and L2-loss SVC. We first extend the setting of Hsieh et al. (2008) to SVR and then propose a better algorithm using properties of SVR.

3.2.1 A Direct Extension from Classification to Regression

A coordinate descent method sequentially updates one variable by solving the fol-lowing subproblem.

minz f_A(α + ze_i) − f_A(α) subject to 0 ≤ α_i+ z ≤ U.

where

f_A(α + ze_i) − f_A(α) = ∇_if_A(α)z +1

2∇²_iif_A(α)z²

and e_i ∈ R^2l×1is a vector with i-th element one and others zero. The optimal value z can be solved in a closed form, so α_i is updated by

To efficiently implement (3.3), techniques that have been employed for SVC can be applied. First, we precalculate ¯Q_ii = x^T_i x_i+ λ, ∀i in the beginning. Second, (Q(α⁺−

If the current iterate α_i is updated to ¯α_i by (3.3), then vector u can be maintained by

u ←

Both (3.3) and (3.4) cost O(n), which is the same as the cost in classification.

Hsieh et al. (2008) check the projected gradient ∇^Pf_A(α) for the stopping condition because α is optimal if and only if ∇^Pf_A(α) is zero. The projected gradient is defined

Algorithm 2 A DCD method for linear L1-/L2-loss SVR

1. Given α⁺ and α⁻. Let α =α⁺ α⁻

and the corresponding u =Pl

i=1(α_i− α_i+l)x_i. the overall procedure in Algorithm 2.

Hsieh et al. (2008) apply two techniques to make a coordinate descent method faster. The first one is to permute all variables at each iteration to decide the order for update. We find that this setting is also useful for SVR. The second implementation technique is shrinking. By gradually removing some variables, smaller optimization problems are solved to save the training time. In Hsieh et al. (2008), they remove those which are likely to be bounded (i.e., 0 or U ) at optimum. Their shrinking strategy can be directly applied here, so we omit details.

While we have directly applied a coordinate descent method to solve (2.4), the procedure does not take SVR’s special structure into account. Note that α⁺ and α⁻ in (2.5) are very related. We can see that in the following situations some operations in Algorithm 2 are redundant.

1. We pointed out in Chapter II that an optimal α of (2.4) satisfies

α⁺_i α⁻_i = 0, ∀i. (3.6)

If one of α⁺ or α⁻ is positive at optimum, it is very possible that the other is zero throughout all final iterations. Because we sequentially select variables for update, these zero variables, even if not updated in steps 3.1.1–3.1.2 of Algorithm 2, still need to be checked in the beginning of step 3.1. Therefore, some operations are wasted. Shrinking can partially solve this problem, but alternatively we may explicitly use the property (3.6) in designing the coordinate descent algorithm.

2. We show that some operations in calculating the projected gradient in (3.5) are wasted if all we need is the largest component of the projected gradient. Assume α_i⁺> 0 and α⁻_i = 0. If the optimality condition at α⁻_i is not satisfied yet, then

∇^P_i+lf_A(α) = ∇_i+lf_A(α) = −(Q(α⁺− α⁻))_i+ + y_i+ λα⁻_i < 0.

We then have

0 < −∇_i+lf_A(α) = (Q(α⁺− α⁻))_i− − y_i− λα⁻_i

< (Q(α⁺− α⁻))i + − yi+ λα⁺_i = ∇ifA(α), (3.7)

so a larger violation of the optimality condition occurs at α⁺_i . Thus, when α⁺_i > 0 and α⁻_i = 0, checking ∇i+lfA(α) is not necessary if we aim to find the largest element of the projected gradient.

In the next chapter, we propose a better coordinate descent method for SVR by com-bining α⁺ and α⁻ to a vector α⁺− α⁻.

3.2.2 A New Coordinate Descent Method by Solving α⁺ and α⁻ Together

problem (3.8) can be transformed as

min

We design a coordinate descent method to solve (3.9). Interestingly, (3.9) is in a form similar to the primal optimization problem of L1-regularized regression and classification. In LIBLINEAR, a coordinate descent solver is provided for L1-regularized L2-loss SVC (Yuan et al., 2010). We will adapt some of its implementation techniques here. A difference between L1-regularized classification and the problem (3.9) is that (3.9) has additional bounded constraints.

Assume β is the current iterate and its i-th component, denoted as a scalar variable s, is being updated. That is, β is a constant vector in the subsequent discussion. Then the following one-variable subproblem is solved.

mins g(s) subject to − U ≤ s ≤ U, (3.10)

where

g(s) = f_B(β + (s − β_i)e_i) − f_B(β)

= |s| + ( ¯Qβ − y)_i(s − β_i) + 1 2

Q¯_ii(s − β_i)²+ constant. (3.11)

It is well known that (3.11) can be reduced to “soft-thresholding” in signal processing and has a closed-form minimum. However, here we decide to give detailed derivations of solving (3.10) because of several reasons. First, s is now bounded in [−U, U ]. Second, the discussion will help to explain our stopping condition and shrinking procedure.

To solve (3.10), we start with checking the derivative of g(s). Although g(s) is not differentiable at s = 0, its derivatives at s ≥ 0 and s ≤ 0 are respectively

g_p⁰(s) = + ( ¯Qβ − y)_i + ¯Q_ii(s − β_i) if s ≥ 0, and g_n⁰(s) = − + ( ¯Qβ − y)_i + ¯Q_ii(s − β_i) if s ≤ 0.

Both g_p⁰(s) and g_n⁰(s) are linear functions of s. Further,

g_n⁰(s) ≤ g_p⁰(s), ∀s ∈ R.

For any strictly convex quadratic function, the unique minimum occurs when the first derivative is zero. Because g(s) is only piece-wise quadratic, we consider three cases in Figure 3.1 according to the values of g_p⁰(s) and g_n⁰(s). In Figure 3.1(a), 0 <

g⁰_n(0) < g⁰_p(0), so g(0) is the smallest on the positive side:

g(0) ≤ g(s), ∀s ≥ 0. (3.12)

−U 0 U

g_p⁰(s) g⁰_n(s)

s^∗

(a) 0 < g⁰_n(0) < g_p⁰(0).

s^∗ = 0 s g_p⁰(s)

g_n⁰(s)

−U U

(b) g⁰_n(0) ≤ 0 ≤ g_p⁰(0).

s 0

g⁰_p(s) g_n⁰(s)

s^∗

−U U

Figure 3.1: We discuss the minimization of (3.10) using three cases. The y-axis in-dicates the value of g⁰_p(s) and g_n⁰(s). The point s^∗ denotes the optimal solution.

For s ≤ 0, g⁰_n(s) = 0 has a root because the line of g_n⁰(s) intersects the x-axis. With (3.12), this root is the minimum for both s ≤ 0 and s ≥ 0. By solving g⁰_n(s) = 0 and taking the condition 0 < g⁰_n(0), the solution of (3.10) is

β_i− − + ( ¯Qβ − y)_i

Q¯_ii if − + ( ¯Qβ − y)_i > ¯Q_iiβ_i. (3.13) We also need to take the constraint s ∈ [−U, U ] in Equation (3.10) into account. If the value obtained in (3.13) is smaller than −U , then g⁰_n(s) > 0, ∀s ≥ −U . That is, g(s) is an increasing function and the minimum is at s = −U .

The situation is similar in Figure 3.1(c), where the minimum occurs at g_p⁰(s) = 0.

For the remaining case in Figure 3.1(b),

g_n⁰(0) ≤ 0 ≤ g_p⁰(0). (3.14)

Inequalities in (3.14) imply that g(s) is a decreasing function at s ≤ 0, but is an increasing function at s ≥ 0. Thus, an optimal solution occurs at s = 0. A summary of the three cases shows that the subproblem (3.10) has the following closed form solution.

In (3.16), we simplify the solution form in (3.13) by using the property

g_p⁰(β_i) = + ( ¯Qβ − y)_i, and g_n⁰(β_i) = − + ( ¯Qβ − y)_i. (3.17)

Following the same technique in Chapter 3.2.1, we maintain a vector u and calculate

Algorithm 3 A new DCD method which solves (3.9) for linear L1-/L2-loss SVR The new DCD method to solve (3.9) is sketched in Algorithm 3.

For the convergence, we show in Appendix A that Algorithm 3 is a special case of the general framework in Tseng and Yun (2009) for non-smooth separable minimization.

Their Theorem 2(b) implies that Algorithm 3 converges in an at least linear rate.

Theorem 1 For L1-loss and L2-loss SVR, if β^k is the k-th iterate generated by Algo-rithm 3, then {β^k} globally converges to an optimal solution β^∗. The convergence rate is at least linear: there are 0 < µ < 1 and an iteration k₀ such that

f_B(β^k+1) − f_B(β^∗) ≤ µ(f_B(β^k) − f_B(β^∗)), ∀k ≥ k₀.

For the stopping condition and the shrinking procedure, we will mainly follow the setting in LIBLINEAR for L1-regularized classification. To begin, we study how to measure the violation of the optimality condition of (3.10) during the optimization procedure. From Figure 3.1(c), we see that

if 0 < β_i^∗ < U is optimal for (3.10), then g_p⁰(β_i^∗) = 0.

Thus, if 0 < β_i < U , |g_p⁰(β_i)| can be considered as the violation of the optimality. From

gives the violation of the optimality. After considering all situations, we know that

β_i is optimal for (3.10) if and only if v_i = 0,

If β is unconstrained (i.e., U = ∞), then (3.18) reduces to the minimum-norm sub-gradient used in L1-regularized problems. Based on it, Yuan et al. (2010) derive their stopping condition and shrinking scheme. We follow them to use a similar stopping condition.

kv^kk1 < skv⁰k1, (3.19) where v⁰ and v^k are the initial violation and the violation in the k-th iteration, respec-tively. Note that v^k’s components are sequentially obtained via (3.18) in l coordinate descent steps of the k-th iteration.

For shrinking, we remove bounded variables (i.e., β_i = 0, U , or −U ) if they may not be changed at the final iterations. Following Yuan et al. (2010), we use a “tighter”

form of the optimality condition to conjecture that a variable may have stuck at a bound. We shrink βi if it satisfies one of the following conditions.

β_i = 0 and g⁰_n(β_i) < −M < 0 < M < g_p⁰(β_i), (3.20) β_i = U and g⁰_p(β_i) < −M, or (3.21)

β_i = −U and g_n⁰(β_i) > M, (3.22)

where

M ≡ max

i v_i^k−1 (3.23)

is the maximal violation of the previous iteration. The condition (3.20) is equivalent to

β_i = 0 and − + M < ( ¯Qβ)_i− y_i < − M. (3.24) This is almost the same as the one used in Yuan et al. (2010); see Equation (32) in that thesis. However, there are some differences. First, because they solve L1-regularized SVC, in (3.24) becomes the constant one. Second, they scale M to a smaller value.

Note that M used in (3.20)–(3.22) controls how aggressive our shrinking scheme is. In Chapter 4.6, we will investigate the effect of using different M values.

For L2-loss SVR, α_i is not upper-bounded in the dual problem, so (3.20) becomes the only condition to shrink variables. This makes L2-loss SVR have less opportunity to shrink variables than L1-loss SVR. The same situation has been known for L2-loss SVC.

In Chapter 3.2.1, we pointed out some redundant operations in calculating the projected gradient of f_A(α⁺, α⁻). If 0 < β_i < U , we have α⁺_i = β_i and α⁻_i = 0. In this situation, Equation (3.7) indicates that for finding the maximal violation of the

optimality condition, we only need to check ∇^P_i f_A(α) rather than ∇^P_i+lf_A(α). From (3.5) and (3.17),

∇^P_i f_A(α) = ( ¯Qβ − y)_i+ = g_p⁰(β).

This is what we checked in (3.18) when 0 < β < U . Therefore, no operations are wasted.

Algorithm 4 is the overall procedure to solve (3.9). In the beginning, we set M = ∞, so no variables are shrunk at the first iteration. The set T in Algorithm 4 includes variables which have not been shrunk. During the iterations, the stopping condition of a smaller problem of T is checked. If it is satisfied but T is not the full set of variables, we reset T to be {1, . . . , l}; see the if-else statement in step 6.3 of Algorithm 4. This setting ensures that the algorithm stops only after the stopping condition for problem (3.9) is satisfied. Similar approaches have been used in LIBSVM (Chang and Lin, 2011) and some solvers in LIBLINEAR.

3.3 Difference Between Linear and Nonlinear SVR

The discussion in Chapters 3.2.1–3.2.2 concludes that α⁺_i and α⁻_i should be solved together rather than separately. Interestingly, for nonlinear (kernel) SVR, Liao et al.

(2002) argue that the opposite is better. They consider SVR with a bias term, so the dual problem contains an additional linear constraint.

i=1

(α⁺_i − α⁻_i ) = 0.

Because of this constraint, their coordinate descent implementation (called decompo-sition methods in the SVM community) must select at least two variables at a time.

They discuss the following two settings.

1. Considering f_A(α) and selecting i, j ∈ {1, . . . , 2l} at a time.

2. Selecting i, j ∈ {1, . . . , l} and then updating α_i⁺, α⁻_i , α⁺_j , and α⁻_j together. That is, a four-variable subproblem is solved.

The first setting corresponds to ours in Chapter 3.2.1, while the second is related to that in Chapter 3.2.2. We think Liao et al. (2002) prefer the first because of the following reasons, from which we can see some interesting differences between linear and nonlinear SVM.

1. For nonlinear SVM, we can afford to use gradient information for selecting the working variables; see reasons explained in Chapter 4.1 of Hsieh et al. (2008).

This is in contrast to the sequential selection for linear SVM. Following the gradient-based variable selection, Liao et al. (2002, Theorem 3.4) show that if an optimal (α⁺_i )^∗ > 0, then α⁻_i remains zero in the final iterations without being selected for update. The situation for (α⁻_i )^∗ > 0 is similar. Therefore, their coordinate descent algorithm implicitly has a shrinking implementation, so the first concern discussed in Chapter 3.2.1 is alleviated.

2. Solving a four-variable subproblem is complicated. In contrast, for the two-variable subproblem of α_i⁺ and α⁻_i , we demonstrate in Chapter 3.2.2 that a simple closed-form solution is available.

3. The implementation of coordinate descent methods for nonlinear SVM is more complicated than that for linear because of steps such as gradient-based variable selection and kernel-cache maintenance, etc. Thus, the first setting of minimizing f_A(α) possesses the advantage of being able to reuse the code of SVC. This is the approach taken by the nonlinear SVM package LIBSVM (Chang and Lin, 2011), in which SVC and SVR share the same optimization solver. In contrast, for linear SVC/SVR, the implementation is simple, so we can have a dedicated

code for SVR. In this situation, minimizing f_B(β) is more preferable than f_A(α).

Algorithm 4 Details of Algorithm 3 with a stopping condition and a shrinking im-plementation.

1. Given β and corresponding u =Pl

i=1βixi.

2. Set λ = 0 and U = C if L1-loss SVR; λ = 1/(2C) and U = ∞ if L2-loss SVR.

3. Compute the Hessian diagonal ¯Q_ii, ∀i = 1, . . . , l.

4. M ← ∞, and compute kv⁰k₁ by (3.18).

5. T ← {1, . . . , l}.

6. For k = 0, 1, 2, . . .

6.1. Randomly permute T .

6.2. For i ∈ T // select an index to update

6.2.1. g_p⁰ ← −yi+ u^Txi+ λβi+ , g_n⁰ ← −yi+ u^Txi+ λβi− .

6.2.2. Find v_i^k by (3.18).

6.2.3. If any condition in (3.20)–(3.22) is satisfied T ← T \{i}.

continue

6.2.4. Find s by (3.16).

6.2.5. u ← u + (s − β_i)x_i. 6.2.6. β_i ← s.

6.3. If kv^kk1/kv⁰k1 < s

If T = {1, . . . , l}

break else

T ← {1, . . . , l}, and M ← ∞.

else

M ← kv^kk∞.

CHAPTER IV

Experiments

In this chapter, we compare nonlinear/linear SVR and evaluate the methods de-scribed in Chapters III. Two evaluation criteria are used. The first one is mean squared error (MSE).

The other is squared correlation coefficient (R²). Given the target values y and the predicted values y⁰, R² is defined as

We consider the following data sets in our experiments. All except CTR are publicly available at LIBSVM data set.¹

• MSD: We consider this data because it is the largest regression set in the UCI Machine Learning Repository (Frank and Asuncion, 2010). It is originally from Bertin-Mahieux et al. (2011). Each instance contains the audio features of a song, and the target value is the year the song was released. The original target

1http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

value is from 1922 to 2011, but we follow Bertin-Mahieux et al. (2011) to linearly scale it to [0, 1].

• TFIDF-2006, LOG1P-2006: This data set comes from some context-based analysis and discussion of the financial condition of a corporation (Kogan et al., 2009).² The target values are the log transformed volatilities of the corporation. We use records in the last year (2006) as the testing data, while the previous five years (2001–2005) for training.

There are two different feature representations. TFIDF-2006 contains TF-IDF (term frequency and inverse document frequency) of unigrams, but LOG1P-2006 contains

log(1 + TF),

where TF is the term frequency of unigrams and bigrams. Both representations also include the volatility in the past 12 months as an additional feature.

• CTR: The data set is from an Internet company. Each feature vector is a binary representation of a web page and an advertisement block. The target value is the click-through-rate (CTR) defined as (#clicks)/(#page views).

• KDD2010b: This classification problem comes from KDD Cup 2010. The class label indicates whether a student answered a problem correctly or not on a online tutoring system. We consider this problem because of several reasons. First, we have not found other large and sparse regression problems. Second, we are interested in the performance of SVR algorithms when a classification problem is treated as a regression one.

2The raw data are available at http://www.ark.cs.cmu.edu/10K/.

Table 4.1: Data set statistics: #non-zeros means the number of non-zero elements in all training instances. Note that data sets are sorted according to the number of features.

Data #instances

#features #non-zeros

range of y training testing in training

MSD 463,715 51,630 90 41,734,346 [0, 1]

TFIDF-2006 16,087 3,308 150,360 19,971,015 [−7.90, −0.52]

LOG1P-2006 16,087 3,308 4,272,227 96,731,839 [−7.90, −0.52]

CTR 11,382,195 208,988 22,510,600 257,526,282 [0, 1]

KDD2010b 19,264,097 748,401 29,890,095 566,345,888 {0, 1}

The numbers of instances, features, nonzero elements in training data, and the range of target values are listed in Table 4.1. Except MSD, all others are large sparse data.

We use the zero vector as the initial solution of all algorithms. All implementations are in C++ and experiments are conducted on a 64-bit machine with Intel Xeon 2.0GHz CPU (E5504), 4MB cache, and 32GB main memory. Programs used for our experiment can be found at http://www.csie.ntu.edu.tw/~cjlin/liblinear/exp.html.

4.2 A Comparison Between Two DCD Algorithms

Our first experiment is to compare two DCD implementations (Algorithms 2 and 4) so that only the better one is used for subsequence analysis. For this comparison, we normalize each instance to a unit vector and consider L1-loss SVR.

Figure 4.1 presents results using parameters C = 1 and = 0.1. The x-axis is the training time, and the y-axis is the relative difference to the dual optimal function value.

f_A(α) − f_A(α^∗)

|f_A(α^∗)| , (4.1)

where α^∗ is the optimum solution. We run optimization algorithms long enough to get an approximate f_A(α^∗). In Figure 4.1, DCD-1 and DCD-1-sh are Algorithm 2 with-out/with shrinking, respectively. DCD-2, and DCD-2-sh are the proposed Algorithm 4. If shrinking is not applied, we simply plot the value (4.1) once every eight

itera-tions. With shrinking, the setting is more complicated because the stopping tolerance

_s affects the shrinking implementation; see step 6.3 in Algorithm 4. Therefore, we run Algorithms 2 and 4 several times under various _s values to obtain pairs of (training time, function value).

Results show that DCD-2 is faster than DCD-1. Because the training time (x-axis) is in log-scale, the difference is significant. This observation is consistent with our discussion in Chapter 3.2.1 that Algorithm 2 suffers from some redundant operations.

We mentioned that shrinking can reduce the overhead and this is supported by the result that DCD-1-sh becomes closer to DCD-2-sh. Based on this experiment, we only use Algorithm 4 in subsequent analysis.

This experiment also reveals how useful the shrinking technique is. For both Algo-rithms 2 and 4, we clearly observe that shrinking very effectively reduces the training time.

4.3 A Comparison Between Linear and Nonlinear SVR

We wrote in Chapter I that the motivation of this research work is to check if for some applications linear SVR can give competitive MSE/R² with nonlinear SVR, but enjoy faster training. In this chapter, we compare DCD for linear SVR with the package LIBSVM (Chang and Lin, 2011) for nonlinear SVR.

For LIBSVM, we consider RBF kernel, so Q_ij in Equation (2.5) becomes

Q_ij ≡ e^−γkxⁱ^−x^j^k²,

where γ is a user-specified parameter. Because LIBSVM’s training time is very long, we

在文檔中大規模線性支持向量迴歸 (頁 13-0)