LIBSVM: A Library for Support Vector Machines

(1)

LIBSVM: A Library for Support Vector Machines

Chih-Chung Chang and Chih-Jen Lin Department of Computer Science National Taiwan University, Taipei, Taiwan

Email: [email protected]

Initial version: 2001 Last updated: August 23, 2022

Abstract

LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popu- larity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems, theoretical convergence, multi-class classification, probability estimates, and parameter selection are discussed in detail.

Keywords: Classification, LIBSVM, optimization, regression, support vector machines, SVM

1 Introduction

Support Vector Machines (SVMs) are a popular machine learning method for classification, regression, and other learning tasks. Since the year 2000, we have been developing the package LIBSVM as a library for support vector machines. The Web address of the package is at http://www.csie.ntu.edu.tw/~cjlin/libsvm. LIBSVM is cur- rently one of the most widely used SVM software. In this article,¹ we present all implementation details of LIBSVM. However, this article does not intend to teach the practical use of LIBSVM. For instructions of using LIBSVM, see the README file included in the package, the LIBSVM FAQ,² and the practical guide by Hsu et al.

(2003). An earlier version of this article was published in Chang and Lin (2011).

LIBSVM supports the following learning tasks.

1This LIBSVM implementation document was created in 2001 and has been maintained at http:

//www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf.

(2)

Table 1: Representative works in some domains that have successfully used LIBSVM.

Domain Representative works

Computer vision LIBPMK (Grauman and Darrell, 2005) Natural language processing Maltparser (Nivre et al., 2007)

Neuroimaging PyMVPA (Hanke et al., 2009) Bioinformatics BDVal (Dorff et al., 2010)

1. SVC: support vector classification (two-class and multi-class).

2. SVR: support vector regression.

3. One-class SVM.

A typical use of LIBSVM involves two steps: first, training a data set to obtain a model and second, using the model to predict information of a testing data set. For SVC and SVR, LIBSVM can also output probability estimates. Many extensions of LIBSVM are available at libsvmtools.³

The LIBSVM package is structured as follows.

1. Main directory: core C/C++ programs and sample data. In particular, the file svm.cpp implements training and testing algorithms, where details are described in this article.

2. The tool sub-directory: this sub-directory includes tools for checking data format and for selecting SVM parameters.

3. Other sub-directories contain pre-built binary files and interfaces to other lan- guages/software.

LIBSVM has been widely used in many areas. From 2000 to 2010, there were more than 250,000 downloads of the package. In this period, we answered more than 10,000 emails from users. Table 1 lists representative works in some domains that have successfully used LIBSVM.

This article is organized as follows. In Section 2, we describe SVM formulations supported in LIBSVM: C-support vector classification (C-SVC), ν-support vector classification (ν-SVC), distribution estimation (one-class SVM), ϵ-support vector regression (ϵ-SVR), and ν-support vector regression (ν-SVR). Section 3 then discusses performance measures, basic usage, and code organization. All SVM formulations

3LIBSVM Tools: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools.

(3)

supported in LIBSVM are quadratic minimization problems. We discuss the optimization algorithm in Section 4. Section 5 describes two implementation techniques to reduce the running time for minimizing SVM quadratic problems: shrinking and caching. LIBSVM provides some special settings for unbalanced data; details are in Section 6. Section 7 discusses our implementation for multi-class classification.

Section 8 presents how to transform SVM decision values into probability values. Pa- rameter selection is important for obtaining good SVM models. Section 9 presents a simple and useful parameter selection tool in LIBSVM. Finally, Section 10 concludes this work.

2 SVM Formulations

LIBSVM supports various SVM formulations for classification, regression, and distribution estimation. In this section, we present these formulations and give corresponding references. We also show performance measures used in LIBSVM.

2.1 C-Support Vector Classification

Given training vectors x_i ∈ Rⁿ, i = 1, . . . , l, in two classes, and an indicator vector y ∈ R^l such that y_i ∈ {1, −1}, C-SVC (Boser et al., 1992; Cortes and Vapnik, 1995) solves the following primal optimization problem.

minw,b,ξ

1

2w^Tw + C

l

X

i=1

ξ_i (1)

subject to y_i(w^Tϕ(x_i) + b) ≥ 1 − ξ_i, ξi ≥ 0, i = 1, . . . , l,

where ϕ(x_i) maps x_i into a higher-dimensional space and C > 0 is the regularization parameter. Due to the possible high dimensionality of the vector variable w, usually we solve the following dual problem.

minα

1

2α^TQα − e^Tα

subject to y^Tα = 0, (2)

0 ≤ α_i ≤ C, i = 1, . . . , l,

where e = [1, . . . , 1]^T is the vector of all ones, Q is an l by l positive semidefinite

(4)

After problem (2) is solved, using the primal-dual relationship, the optimal w satisfies

w =

l

X

i=1

y_iα_iϕ(x_i) (3)

and the decision function is

sgn w^Tϕ(x) + b = sgn

l

X

i=1

y_iα_iK(x_i, x) + b

! .

We store y_iα_i ∀i, b, label names,⁴ support vectors, and other information such as kernel parameters in the model for prediction.

2.2 ν-Support Vector Classification

The ν-support vector classification (Sch¨olkopf et al., 2000) introduces a new parameter ν ∈ (0, 1]. It is proved that ν an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.

Given training vectors x_i ∈ Rⁿ, i = 1, . . . , l, in two classes, and a vector y ∈ R^l such that y_i ∈ {1, −1}, the primal optimization problem is

min

w,b,ξ,ρ

1

2w^Tw − νρ + 1 l

l

X

i=1

ξ_i

subject to yi(w^Tϕ(xi) + b) ≥ ρ − ξi, (4) ξ_i ≥ 0, i = 1, . . . , l, ρ ≥ 0.

The dual problem is

minα

1

2α^TQα

subject to 0 ≤ α_i ≤ 1/l, i = 1, . . . , l, (5) e^Tα ≥ ν, y^Tα = 0,

where Q_ij = y_iy_jK(x_i, x_j). Chang and Lin (2001) show that problem (5) is feasible if and only if

ν ≤ 2 min(#y_i = +1, #y_i = −1)

l ≤ 1,

so the usable range of ν is smaller than (0, 1].

4In LIBSVM, any integer can be a label name, so we map label names to ±1 by assigning the first training instance to have y₁= +1.

(5)

The decision function is sgn

l

X

i=1

yiαiK(xi, x) + b

! .

It is shown that e^Tα ≥ ν can be replaced by e^Tα = ν (Crisp and Burges, 2000;

Chang and Lin, 2001). In LIBSVM, we solve a scaled version of problem (5) because numerically α_i may be too small due to the constraint α_i ≤ 1/l.

minα¯

1

2α¯^TQ ¯α

subject to 0 ≤ ¯α_i ≤ 1, i = 1, . . . , l, (6) e^Tα = νl,¯ y^Tα = 0.¯

If α is optimal for the dual problem (5) and ρ is optimal for the primal problem (4), Chang and Lin (2001) show that α/ρ is an optimal solution of C-SVM with C = 1/(ρl). Thus, in LIBSVM, we output (α/ρ, b/ρ) in the model.⁵

2.3 Distribution Estimation (One-class SVM)

One-class SVM was proposed by Sch¨olkopf et al. (2001) for estimating the support of a high-dimensional distribution. Given training vectors x_i ∈ Rⁿ, i = 1, . . . , l without any class information, the primal problem of one-class SVM is

min

w,ξ,ρ

1

2w^Tw − ρ + 1 νl

l

X

i=1

ξ_i (7)

subject to w^Tϕ(xi) ≥ ρ − ξi, (8) ξ_i ≥ 0, i = 1, . . . , l.

The dual problem is

minα

1

2α^TQα

subject to 0 ≤ α_i ≤ 1/(νl), i = 1, . . . , l, (9) e^Tα = 1,

where Q_ij = K(x_i, x_j) = ϕ(x_i)^Tϕ(x_j). The decision function is

sgn

l

X

i=1

α_iK(x_i, x) − ρ

! .

5More precisely, solving (6) obtains ¯ρ = ρl. Because ¯α = lα, we have α/ρ = ¯α/ ¯ρ. Hence, in

(6)

Similar to the case of ν-SVC, in LIBSVM, we solve a scaled version of (9).

minα

1

2α^TQα

subject to 0 ≤ α_i ≤ 1, i = 1, . . . , l, (10) e^Tα = νl.

2.4 ϵ-Support Vector Regression (ϵ-SVR)

Consider a set of training points, {(x1, z1), . . . , (xl, zl)}, where xi ∈ Rⁿ is a feature vector and zi ∈ R¹ is the target output. Under given parameters C > 0 and ϵ > 0, the standard form of support vector regression (Vapnik, 1998) is

min

w,b,ξ,ξ^∗

1

2w^Tw + C

l

X

i=1

ξ_i+ C

l

X

i=1

ξ_i^∗ subject to w^Tϕ(x_i) + b − z_i ≤ ϵ + ξ_i,

z_i− w^Tϕ(x_i) − b ≤ ϵ + ξ_i^∗, ξ_i, ξ_i^∗ ≥ 0, i = 1, . . . , l.

The dual problem is

minα,α^∗

1

2(α − α^∗)^TQ(α − α^∗) + ϵ

l

X

i=1

(α_i+ α_i^∗) +

l

X

i=1

z_i(α_i− α^∗_i)

subject to e^T(α − α^∗) = 0, (11)

0 ≤ α_i, α^∗_i ≤ C, i = 1, . . . , l, where Q_ij = K(x_i, x_j) ≡ ϕ(x_i)^Tϕ(x_j).

After solving problem (11), the approximate function is

l

X

i=1

(−αi+ α^∗_i)K(xi, x) + b.

In LIBSVM, we output α^∗− α in the model.

2.5 ν-Support Vector Regression (ν-SVR)

Similar to ν-SVC, for regression, Sch¨olkopf et al. (2000) use a parameter ν ∈ (0, 1]

to control the number of support vectors. The parameter ϵ in ϵ-SVR becomes a

(7)

parameter here. With (C, ν) as parameters, ν-SVR solves

w,b,ξ,ξmin^∗,ϵ

1

2w^Tw + C(νϵ + 1 l

l

X

i=1

(ξi + ξ_i^∗)) subject to (w^Tϕ(x_i) + b) − z_i ≤ ϵ + ξ_i,

z_i− (w^Tϕ(x_i) + b) ≤ ϵ + ξ_i^∗, ξ_i, ξ_i^∗ ≥ 0, i = 1, . . . , l, ϵ ≥ 0.

The dual problem is minα,α^∗

1

2(α − α^∗)^TQ(α − α^∗) + z^T(α − α^∗)

subject to e^T(α − α^∗) = 0, e^T(α + α^∗) ≤ Cν, (12) 0 ≤ α_i, α_i^∗ ≤ C/l, i = 1, . . . , l.

The approximate function is

l

X

i=1

(−α_i+ α^∗_i)K(x_i, x) + b.

Similar to ν-SVC, Chang and Lin (2002) show that the inequality e^T(α+α^∗) ≤ Cν can be replaced by an equality. Moreover, C/l may be too small because users often choose C to be a small constant like one. Thus, in LIBSVM, we treat the user- specified regularization parameter as C/l. That is, ¯C = C/l is what users specified and LIBSVM solves the following problem.

minα,α^∗

1

2(α − α^∗)^TQ(α − α^∗) + z^T(α − α^∗) subject to e^T(α − α^∗) = 0, e^T(α + α^∗) = ¯Clν,

0 ≤ α_i, α_i^∗ ≤ ¯C, i = 1, . . . , l.

Chang and Lin (2002) prove that ϵ-SVR with parameters ( ¯C, ϵ) has the same solution as ν-SVR with parameters (l ¯C, ν).

3 Performance Measures, Basic Usage, and Code Organization

This section describes LIBSVM’s evaluation measures, shows some simple examples

(8)

3.1 Performance Measures

After solving optimization problems listed in previous sections, users can apply decision functions to predict labels (target values) of testing data. Let x₁, . . . , x¯l be the testing data and f (x₁), . . . , f (x¯l) be decision values (target values for regression) predicted by LIBSVM. If the true labels (true target values) of testing data are known and denoted as y_i, . . . , y¯l, we evaluate the prediction results by the following measures.

3.1.1 Classification

Accuracy

= # correctly predicted data

# total testing data × 100%

3.1.2 Regression

LIBSVM outputs MSE (mean squared error) and r² (squared correlation coefficient).

MSE = 1

¯l

X

i=1

(f (xi) − yi)²,

r² =

¯lP^¯^l_i=1f (xi)yi−P^¯l

i=1f (xi)P^¯l i=1yi

2

¯lP^¯^l_i=1f (x_i)²− P^¯l

i=1f (x_i)2 ¯lP^¯^l_i=1y_i²− P^¯l

i=1y_i2 .

3.2 A Simple Example of Running LIBSVM

While detailed instructions of using LIBSVM are available in the README file of the package and the practical guide by Hsu et al. (2003), here we give a simple example.

LIBSVM includes a sample data set heart scale of 270 instances. We split the data to a training set heart scale.tr (170 instances) and a testing set heart scale.te.

$ python tools/subset.py heart_scale 170 heart_scale.tr heart_scale.te The command svm-train solves an SVM optimization problem to produce a model.⁶

$ ./svm-train heart_scale.tr

*

optimization finished, #iter = 87 nu = 0.471645

6The default solver is C-SVC using the RBF kernel (50) with C = 1 and γ = 1/n.

(9)

obj = -67.299458, rho = 0.203495 nSV = 88, nBSV = 72

Total nSV = 88

Next, the command svm-predict uses the obtained model to classify the testing set.

$ ./svm-predict heart_scale.te heart_scale.tr.model output Accuracy = 83% (83/100) (classification)

The file output contains predicted class labels.

3.3 Code Organization

All LIBSVM’s training and testing algorithms are implemented in the file svm.cpp.

The two main sub-routines are svm train and svm predict. The training procedure is more sophisticated, so we give the code organization in Figure 1.

From Figure 1, for classification, svm train decouples a multi-class problem to two-class problems (see Section 7) and calls svm train one several times. For regression and one-class SVM, it directly calls svm train one. The probability outputs for classification and regression are also handled in svm train. Then, according to the SVM formulation, svm train one calls a corresponding sub-routine such as solve c svc for C-SVC and solve nu svc for ν-SVC. All solve * sub-routines call the solver Solve after preparing suitable input values. The sub-routine Solve min- imizes a general form of SVM optimization problems; see (13) and (24). Details of the sub-routine Solve are described in Sections 4-6.

4 Solving the Quadratic Problems

This section discusses algorithms used in LIBSVM to solve dual quadratic problems listed in Section 2. We split the discussion to two parts. The first part considers optimization problems with one linear constraint, while the second part checks those with two linear constraints.

(10)

svm train

svm train one

solve c svc solve nu svc

Solve

. . . Various SVM formulations Two-class SVC, SVR, one-class SVM

Main training sub-routine

Solving problems (13) and (24) Figure 1: LIBSVM’s code organization for training. All sub-routines are in svm.cpp.

4.1 Quadratic Problems with One Linear Constraint: C-SVC, ϵ-SVR, and One-class SVM

We consider the following general form of C-SVC, ϵ-SVR, and one-class SVM.

minα f (α)

subject to y^Tα = ∆, (13)

0 ≤ α_t ≤ C, t = 1, . . . , l, where

f (α) ≡ 1

2α^TQα + p^Tα

and y_t = ±1, t = 1, . . . , l. The constraint y^Tα = 0 is called a linear constraint. It can be clearly seen that C-SVC and one-class SVM are already in the form of problem (13). For ϵ-SVR, we use the following reformulation of Eq. (11).

minα,α^∗

1

2(α^∗)^T, α^T Q −Q

−Q Q

α^∗ α

+ϵe^T − z^T, ϵe^T + z^Tα^∗ α

subject to y^T α^∗ α

= 0, 0 ≤ αt, α^∗_t ≤ C, t = 1, . . . , l, where

y = [1, . . . , 1

| {z }

l

, −1, . . . , −1

| {z }

l

]^T.

We do not assume that Q is positive semi-definite (PSD) because sometimes non-PSD kernel matrices are used.

(11)

4.1.1 Decomposition Method for Dual Problems

The main difficulty for solving problem (13) is that Q is a dense matrix and may be too large to be stored. In LIBSVM, we consider a decomposition method to conquer this difficulty. Some earlier works on decomposition methods for SVM include, for example, Osuna et al. (1997a); Joachims (1998); Platt (1998); Keerthi et al. (2001);

Hsu and Lin (2002b). Subsequent developments include, for example, Fan et al.

(2005); Palagi and Sciandrone (2005); Glasmachers and Igel (2006). A decomposition method modifies only a subset of α per iteration, so only some columns of Q are needed. This subset of variables, denoted as the working set B, leads to a smaller optimization sub-problem. An extreme case of the decomposition methods is the Sequential Minimal Optimization (SMO) (Platt, 1998), which restricts B to have only two elements. Then, at each iteration, we solve a simple two-variable problem without needing any optimization software. LIBSVM considers an SMO-type decomposition method proposed in Fan et al. (2005).

Algorithm 1 (An SMO-type decomposition method in Fan et al., 2005) 1. Find α¹ as the initial feasible solution. Set k = 1.

2. If α^k is a stationary point of problem (2), stop. Otherwise, find a two-element working set B = {i, j} by WSS 1 (described in Section 4.1.2). Define N ≡ {1, . . . , l}\B. Let α^k_B and α^k_N be sub-vectors of α^k corresponding to B and N , respectively.

3. If a_ij ≡ K_ii+ K_jj− 2K_ij > 0,⁷

Solve the following sub-problem with the variable αB = [αi αj]^T. minαi,αj

1

2α_i α_jQ_ii Q_ij Q_ij Q_jj

α_i α_j

+ (p_B+ Q_BNα^k_N)^T α_i α_j

subject to 0 ≤ α_i, α_j ≤ C, (14)

y_iα_i+ y_jα_j = ∆ − y^T_Nα^k_N, else

7We abbreviate K(x_i, x_j) to K_ij.

(12)

Let τ be a small positive constant and solve minαi,αj

1

2α_i α_jQ_ii Q_ij Q_ij Q_jj

α_i α_j

+ (p_B+ Q_BNα^k_N)^T α_i α_j

+τ − a_ij

4 ((α_i− α^k_i)² + (α_j − α^k_j)²) (15) subject to constraints of problem (14).

4. Set α^k+1_B to be the optimal solution of sub-problem (14) or (15), and α^k+1_N ≡ α^k_N. Set k ← k + 1 and go to Step 2.

Note that B is updated at each iteration, but for simplicity, we use B instead of B^k. If Q is PSD, then aij > 0. Thus sub-problem (15) is used only to handle the situation where Q is non-PSD.

4.1.2 Stopping Criteria and Working Set Selection

The Karush-Kuhn-Tucker (KKT) optimality condition of problem (13) implies that a feasible α is a stationary point of (13) if and only if there exists a number b and two nonnegative vectors λ and ξ such that

∇f (α) + by = λ − ξ,

λ_iα_i = 0, ξ_i(C − α_i) = 0, λ_i ≥ 0, ξ_i ≥ 0, i = 1, . . . , l, (16) where ∇f (α) ≡ Qα + p is the gradient of f (α). Note that if Q is PSD, from the primal-dual relationship, ξ, b, and w generated by Eq. (3) form an optimal solution of the primal problem. The condition (16) can be rewritten as

∇_if (α) + by_i

(≥ 0 if α_i < C,

≤ 0 if α_i > 0. (17)

Since y_i = ±1, condition (17) is equivalent to that there exists b such that m(α) ≤ b ≤ M (α),

where

m(α) ≡ max

i∈Iup(α)−y_i∇_if (α) and M (α) ≡ min

i∈Ilow(α)−y_i∇_if (α), and

I_up(α) ≡ {t | α_t < C, y_t= 1 or α_t> 0, y_t= −1}, and I_low(α) ≡ {t | α_t< C, y_t = −1 or α_t> 0, y_t= 1}.

(13)

That is, a feasible α is a stationary point of problem (13) if and only if

m(α) ≤ M (α). (18)

From (18), a suitable stopping condition is

m(α^k) − M (α^k) ≤ ϵ, (19)

where ϵ is the tolerance.

For the selection of the working set B, we use the following procedure from Section II of Fan et al. (2005).

WSS 1

1. For all t, s, define

a_ts ≡ K_tt+ K_ss− 2K_ts, b_ts ≡ −y_t∇_tf (α^k) + y_s∇_sf (α^k) > 0, (20) and

¯

ats ≡ a_ts if ats > 0, τ otherwise.

Select

i ∈ arg max

t {−y_t∇_tf (α^k) | t ∈ I_up(α^k)}, j ∈ arg min

t

−b²_it

¯

a_it | t ∈ I_low(α^k), −y_t∇_tf (α^k) < −y_i∇_if (α^k)

. (21)

2. Return B = {i, j}.

The procedure selects a pair {i, j} approximately minimizing the function value; see the term −b²_it/¯a_it in Eq. (21).

4.1.3 Solving the Two-variable Sub-problem

Details of solving the two-variable sub-problem in Eqs. (14) and (15) are deferred to Section 6, where a more general sub-problem is discussed.

4.1.4 Maintaining the Gradient

From the discussion in Sections 4.1.1 and 4.1.2, the main operations per iteration are on finding QBNα^k_N + p_B for constructing the sub-problem (14), and calculating

(14)

∇f (α^k) for the working set selection and the stopping condition. These two operations can be considered together because

Q_BNα^k_N + p_B = ∇_Bf (α^k) − Q_BBα^k_B (22) and

∇f (α^k+1) = ∇f (α^k) + Q_:,B(α^k+1_B − α^k_B), (23) where |B| ≪ |N | and Q_:,B is the sub-matrix of Q including columns in B. If at the kth iteration we already have ∇f (α^k), then Eq. (22) can be used to construct the sub-problem. After the sub-problem is solved, Eq. (23) is employed to have the next

∇f (α^k+1). Therefore, LIBSVM maintains the gradient throughout the decomposition method.

4.1.5 The Calculation of b or ρ

After the solution α of the dual optimization problem is obtained, the variables b or ρ must be calculated as they are used in the decision function.

Note that b of C-SVC and ϵ-SVR plays the same role as −ρ in one-class SVM, so we define ρ = −b and discuss how to find ρ. If there exists α_i such that 0 < α_i < C, then from the KKT condition (18), ρ = y_i∇_if (α). In LIBSVM, for numerical stability, we average all these values.

ρ = P

i:0<αi<Cyi∇if (α)

|{i | 0 < α_i < C}| .

For the situation that no αi satisfying 0 < αi < C, the KKT condition (18) becomes

−M (α) = max{y_i∇_if (α) | α_i = 0, y_i = −1 or α_i = C, y_i = 1}

≤ ρ

≤ −m(α) = min{yi∇if (α) | αi = 0, yi = 1 or αi = C, yi = −1}.

We take ρ the midpoint of the preceding range.

4.1.6 Initial Values

Algorithm 1 requires an initial feasible α. For C-SVC and ϵ-SVR, because the zero vector is feasible, we select it as the initial α.

For one-class SVM, the scaled form (10) requires that 0 ≤ α_i ≤ 1, and

l

X

i=1

α_i = νl.

(15)

We let the first ⌊νl⌋ elements have α_i = 1 and the (⌊νl⌋ + 1)st element have α_i = νl − ⌊νl⌋.

4.1.7 Convergence of the Decomposition Method

Fan et al. (2005, Section III) and Chen et al. (2006) discuss the convergence of Algo- rithm 1 in detail. For the rate of linear convergence, List and Simon (2009) prove a result without making the assumption used in Chen et al. (2006).

4.2 Quadratic Problems with Two Linear Constraints: ν- SVC and ν-SVR

From problems (6) and (12), both ν-SVC and ν-SVR can be written as the following general form.

minα

1

2α^TQα + p^Tα

subject to y^Tα = ∆₁, (24)

e^Tα = ∆₂,

0 ≤ α_t ≤ C, t = 1, . . . , l.

The main difference between problems (13) and (24) is that (24) has two linear constraints y^Tα = ∆₁ and e^Tα = ∆₂. The optimization algorithm is very similar to that for (13), so we describe only differences.

4.2.1 Stopping Criteria and Working Set Selection

Let f (α) be the objective function of problem (24). By the same derivation in Sec- tion 4.1.2, The KKT condition of problem (24) implies that there exist b and ρ such that

∇_if (α) − ρ + by_i

(≥ 0 if α_i < C,

≤ 0 if α_i > 0. (25)

Define

r₁ ≡ ρ − b and r₂ ≡ ρ + b. (26)

If yi = 1, (25) becomes

∇_if (α) − r₁

(≥ 0 if αi < C,

≤ 0 if αi > 0. (27)

(16)

if y_i = −1, (25) becomes

∇_if (α) − r₂

(≥ 0 if αi < C,

≤ 0 if αi > 0. (28)

Hence, given a tolerance ϵ > 0, the stopping condition is

max (m_p(α) − M_p(α), m_n(α) − M_n(α)) < ϵ, (29) where

m_p(α) ≡ max

i∈Iup(α),yi=1−y_i∇_if (α), M_p(α) ≡ min

i∈Ilow(α),yi=1−y_i∇_if (α), and m_n(α) ≡ max

i∈Iup(α),yi=−1−y_i∇_if (α), M_n(α) ≡ min

i∈Ilow(α),yi=−1−y_i∇_if (α).

The following working set selection is extended from WSS 1.

WSS 2 (Extension of WSS 1 for ν-SVM) 1. Find

i_p ∈ arg m_p(α^k), j_p ∈ arg min

t

(

−b²_i_p_t

¯

a_i_p_t | y_t= 1, α_t∈ I_low(α^k), −y_t∇_tf (α^k) < −y_i_p∇_i_pf (α^k) )

.

2. Find

i_n ∈ arg m_n(α^k), jn ∈ arg min

t

−b²_i_n_t

¯

a_i_n_t | yt= −1, αt∈ Ilow(α^k), −yt∇tf (α^k) < −yin∇inf (α^k)

.

3. Return {i_p, j_p} or {i_n, j_n} depending on which one gives smaller −b²_ij/¯a_ij. 4.2.2 The Calculation of b and ρ

We have shown that the KKT condition of problem (24) implies Eqs. (27) and (28) according to y_i = 1 and −1, respectively. Now we consider the case of y_i = 1. If there exists α_i such that 0 < α_i < C, then we obtain r₁ = ∇_if (α). In LIBSVM, for numerical stability, we average these values.

r₁ = P

i:0<αi<C,yi=1∇_if (α)

|{i | 0 < α_i < C, y_i = 1}|. If there is no α_i such that 0 < α_i < C, then r₁ satisfies

αi=C,ymaxi=1∇_if (α) ≤ r₁ ≤ min

αi=0,yi=1∇_if (α).

(17)

We take r₁ the midpoint of the previous range.

For the case of y_i = −1, we can calculate r₂ in a similar way.

After r1 and r2 are obtained, from Eq. (26), ρ = r₁+ r₂

2 and − b = r₁− r₂ 2 . 4.2.3 Initial Values

For ν-SVC, the scaled form (6) requires that 0 ≤ α_i ≤ 1, X

i:yi=1

α_i = νl

2, and X

i:yi=−1

α_i = νl 2.

We let the first νl/2 elements of αi with yi = 1 to have the value one.⁸ The situation for y_i = −1 is similar. The same setting is applied to ν-SVR.

5 Shrinking and Caching

This section discusses two implementation tricks (shrinking and caching) for the decomposition method and investigates the computational complexity of Algorithm 1.

5.1 Shrinking

An optimal solution α of the SVM dual problem may contain some bounded elements (i.e., α_i = 0 or C). These elements may have already been bounded in the middle of the decomposition iterations. To save the training time, the shrinking technique tries to identify and remove some bounded elements, so a smaller optimization problem is solved (Joachims, 1998). The following theorem theoretically supports the shrinking technique by showing that at the final iterations of Algorithm 1 in Section 4.1.2, only a small set of variables is still changed.

Theorem 5.1 (Theorem IV in Fan et al., 2005) Consider problem (13) and assume Q is positive semi-definite.

1. The following set is independent of any optimal solution ¯α.

I ≡ {i | −y_i∇_if ( ¯α) > M ( ¯α) or − y_i∇_if ( ¯α) < m( ¯α)}.

Further, for every i ∈ I, problem (13) has a unique and bounded optimal solution at α_i.

(18)

2. Assume Algorithm 1 generates an infinite sequence {α^k}. There exists ¯k such that after k ≥ ¯k, every α^k_i, i ∈ I has reached the unique and bounded optimal solution. That is, α_i^kremains the same in all subsequent iterations. In addition,

∀k ≥ ¯k:

i ̸∈ {t | M (α^k) ≤ −y_t∇_tf (α^k) ≤ m(α^k)}.

If we denote A as the set containing elements not shrunk at the kth iteration, then instead of solving problem (13), the decomposition method works on a smaller problem.

minαA

1

2α^T_AQ_AAα_A+ (p_A+ Q_ANα^k_N)^Tα_A

subject to 0 ≤ α_i ≤ C, ∀i ∈ A, (30)

y^T_Aα_A = ∆ − y^T_Nα^k_N,

where N = {1, . . . , l}\A is the set of shrunk variables. Note that in LIBSVM, we always rearrange elements of α, y, and p to maintain that A = {1, . . . , |A|}. Details of the index rearrangement are in Section 5.4.

After solving problem (30), we may find that some elements are wrongly shrunk.

When that happens, the original problem (13) is reoptimized from a starting point α = [α^α^AN], where α_A is optimal for problem (30) and α_N corresponds to shrunk bounded variables.

In LIBSVM, we start the shrinking procedure in an early stage. The procedure is as follows.

1. After every min(l, 1000) iterations, we try to shrink some variables. Note that throughout the iterative process, we have

m(α^k) > M (α^k) (31)

because the condition (19) is not satisfied yet. Following Theorem 5.1, we conjecture that variables in the following set can be shrunk.

{t | −y_t∇_tf (α^k) > m(α^k), t ∈ I_low(α^k), α^k_t is bounded}∪

{t | −y_t∇_tf (α^k) < M (α^k), t ∈ I_up(α^k), α^k_t is bounded}

= {t | −y_t∇_tf (α^k) > m(α^k), α_t^k= C, y_t= 1 or α^k_t = 0, y_t= −1}∪

{t | −yt∇tf (α^k) < M (α^k), α_t^k= 0, yt= 1 or α^k_t = C, yt= −1}.

(32)

Thus, the size of the set A is gradually reduced in every min(l, 1000) iterations.

The problem (30), and the way of calculating m(α^k) and M (α^k) are adjusted accordingly.

(19)

2. The preceding shrinking strategy is sometimes too aggressive. Hence, when the decomposition method achieves the following condition for the first time.

m(α^k) ≤ M (α^k) + 10ϵ, (33)

where ϵ is the specified stopping tolerance, we reconstruct the gradient (details in Section 5.3). Then, the shrinking procedure can be performed based on more accurate information.

3. Once the stopping condition

m(α^k) ≤ M (α^k) + ϵ (34)

of the smaller problem (30) is reached, we must check if the stopping condition of the original problem (13) has been satisfied. If not, then we reactivate all variables by setting A = {1, . . . , l} and start the same shrinking procedure on the problem (30).

Note that in solving the shrunk problem (30), we only maintain its gradient Q_AAα_A+ Q_ANα_N + p_A (see also Section 4.1.4). Hence, when we reactivate all variables to reoptimize the problem (13), we must reconstruct the whole gradient ∇f (α). Details are discussed in Section 5.3.

For ν-SVC and ν-SVR, because the stopping condition (29) is different from (19), variables being shrunk are different from those in (32). For y_t= 1, we shrink elements in the following set

{t | −y_t∇_tf (α^k) > m_p(α^k), α_t= C, y_t = 1}∪

{t | −y_t∇_tf (α^k) < M_p(α^k), α_t= 0, y_t = 1}.

For y_t = −1, we consider the following set.

{t | −y_t∇_tf (α^k) > m_n(α^k), α_t= 0, y_t= −1}∪

{t | −y_t∇_tf (α^k) < M_n(α^k), α_t= C, y_t= −1}.

5.2 Caching

Caching is an effective technique for reducing the computational time of the decomposition method. Because Q may be too large to be stored in the computer memory, Q_ij elements are calculated as needed. We can use available memory (called kernel cache)

(20)

not need to be recalculated. Theorem 5.1 also supports the use of caching because in final iterations, only certain columns of the matrix Q are still needed. If the cache already contains these columns, we can save kernel evaluations in final iterations.

In LIBSVM, we consider a simple least-recent-use caching strategy. We use a circular list of structures, where each structure is defined as follows.

struct head_t {

head_t *prev, *next; // a circular list Qfloat *data;

int len; // data[0,len) is cached in this entry };

A structure stores the first len elements of a kernel column. Using pointers prev and next, it is easy to insert or delete a column. The circular list is maintained so that structures are ordered from the least-recent-used one to the most-recent-used one.

Because of shrinking, columns cached in the computer memory may be in different length. Assume the ith column is needed and Q_1:t,i have been cached. If t ≤ |A|, we calculate Q_t+1:|A|,i and store Q_1:|A|,i in the cache. If t > |A|, the desired Q_1:|A|,i are already in the cache. In this situation, we do not change the cached contents of the ith column.

5.3 Reconstructing the Gradient

If condition (33) or (34) is satisfied, LIBSVM reconstructs the gradient. Because

∇_if (α), i = 1, . . . , |A| have been maintained in solving the smaller problem (30), what we need is to calculate ∇_if (α), i = |A| + 1, . . . , l. To decrease the cost of this reconstruction, throughout iterations we maintain a vector ¯G ∈ R^l.

G¯_i = C X

j:αj=C

Q_ij, i = 1, . . . , l. (35)

Then, for i /∈ A,

∇if (α) =

l

X

j=1

Qijαj + pi = ¯Gi+ pi+ X

j:j∈A 0<αj<C

Qijαj. (36)

Note that we use the fact that if j /∈ A, then α_j = 0 or C.

(21)

The calculation of ∇f (α) via Eq. (36) involves a two-level loop over i and j.

Using i or j first may result in a very different number of Q_ij evaluations. We discuss the differences next.

1. i first: for |A| + 1 ≤ i ≤ l, calculate Q_i,1:|A|. Although from Eq. (36), only {Q_ij | 0 < α_j < C, j ∈ A} are needed, our implementation obtains all Q_i,1:|A|

(i.e., {Q_ij | j ∈ A}). Hence, this case needs at most

(l − |A|) · |A| (37)

kernel evaluations. Note that LIBSVM uses a column-based caching implementation. Due to the symmetry of Q, Q_i,1:|A| is part of Q’s ith column and may have been cached. Thus, Eq. (37) is only an upper bound.

2. j first: let

F ≡ {j | 1 ≤ j ≤ |A| and 0 < α_j < C}.

For each j ∈ F , calculate Q_1:l,j. Though only Q|A|+1:l,j is needed in calculating

∇_if (α), i = |A| + 1, . . . , l, we must get the whole column because of our cache implementation.⁹ Thus, this strategy needs no more than

l · |F | (38)

kernel evaluations. This is an upper bound because certain kernel columns (e.g., Q_1:|A|,j, j ∈ A) may be already in the cache and do not need to be recalculated.

We may choose a method by comparing (37) and (38). However, the decision depends on whether Q’s elements have been cached. If the cache is large enough, then elements of Q’s first |A| columns tend to be in the cache because they have been used recently.

In contrast, Q_i,1:|A|, i /∈ A needed by method 1 may be less likely in the cache because columns not in A are not used to solve problem (30). In such a situation, method 1 may require almost (l − |A|) · |A| kernel valuations, while method 2 needs much fewer evaluations than l · |F |.

Because method 2 takes an advantage of the cache implementation, we slightly lower the estimate in Eq. (38) and use the following rule to decide the method of calculating Eq. (36):

If (l/2) · |F | > (l − |A|) · |A|

use method 1 Else

use method 2

(22)

This rule may not give the optimal choice because we do not take the cache contents into account. However, we argue that in the worst scenario, the selected method by the preceding rule is only slightly slower than the other method. This result can be proved by making the following assumptions.

• A LIBSVM training procedure involves only two gradient reconstructions. The first is performed when the 10ϵ tolerance is achieved; see Eq. (33). The second is in the end of the training procedure.

• Our rule assigns the same method to perform the two gradient reconstructions.

Moreover, these two reconstructions cost a similar amount of time.

We refer to “total training time of method x” as the whole LIBSVM training time (where method x is used for reconstructing gradients), and “reconstruction time of method x” as the time of one single gradient reconstruction via method x. We then consider two situations.

1. Method 1 is chosen, but method 2 is better.

We have

Total time of method 1

≤ (Total time of method 2) + 2 · (Reconstruction time of method 1)

≤ 2 · (Total time of method 2). (39)

We explain the second inequality in detail. Method 2 for gradient reconstruction requires l · |F | kernel elements; however, the number of kernel evaluations may be smaller because some elements have been cached. Therefore,

l · |F | ≤ Total time of method 2. (40) Because method 1 is chosen and Eq. (37) is an upper bound,

2 · (Reconstruction time of method 1) ≤ 2 · (l − |A|) · |A| < l · |F |. (41) Combining inequalities (40) and (41) leads to (39).

2. Method 2 is chosen, but method 1 is better.

We consider the worst situation where Q’s first |A| columns are not in the cache.

As |A| + 1, . . . , l are indices of shrunk variables, most likely the remaining l − |A|

(23)

Table 2: A comparison between two gradient reconstruction methods. The decomposition method reconstructs the gradient twice after satisfying conditions (33) and (34). We show in each row the number of kernel evaluations of a reconstruction. We check two cache sizes to reflect the situations with/without enough cache. The last two rows give the total training time (gradient reconstructions and other operations) in seconds. We use the RBF kernel K(x_i, x_j) = exp(−γ∥x_i− x_j∥²).

(a) a7a: C = 1, γ = 4, ϵ = 0.001.

l = 16, 100 Cache = 1,000 MB Cache = 10 MB Reconstruction |F | |A| Method 1 Method 2 Method 1 Method 2 First 10,597 12,476 0 21,470,526 45,213,024 170,574,272

Second 10,630 12,476 0 0 45,213,024 171,118,048

Training time ⇒ 102s 108s 341s 422s

No shrinking: 111s No shrinking: 381s

(b) ijcnn1: C = 16, γ = 4, ϵ = 0.5.

l = 49, 900 Cache = 1,000 MB Cache = 10 MB

Reconstruction |F | |A| Method 1 Method 2 Method 1 Method 2 First 1,767 43,678 274,297,840 5,403,072 275,695,536 88,332,330 Second 2,308 6,023 263,843,538 28,274,195 264,813,241 115,346,805

Training time ⇒ 189s 46s 203s 116s

No shrinking: 42s No shrinking: 87s

columns of Q are not in the cache either and (l − |A|) · |A| kernel evaluations are needed for method 1. Because l · |F | ≤ 2 · (l − |A|) · |A|,

(Reconstruction time of method 2) ≤ 2 · (Reconstruction time of method 1).

Therefore,

Total time of method 2

≤ (Total time of method 1) + 2 · (Reconstruction time of method 1)

≤ 2 · (Total time of method 1).

Table 2 compares the number of kernel evaluations in reconstructing the gradient.

We consider problems a7a and ijcnn1.¹⁰ Clearly, the proposed rule selects the better method for both problems. We implement this technique after version 2.88 of LIBSVM.

10Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.

(24)

5.4 Index Rearrangement

In solving the smaller problem (30), we need only indices in A (e.g., α_i, y_i, and x_i, where i ∈ A). Thus, a naive implementation does not access array contents in a continuous manner. Alternatively, we can maintain A = {1, . . . , |A|} by rearranging array contents. This approach allows a continuous access of array contents, but requires costs for the rearrangement. We decide to rearrange elements in arrays because throughout the discussion in Sections 5.2-5.3, we assume that a cached ith kernel column contains elements from the first to the tth (i.e., Q1:t,i), where t ≤ l. If we do not rearrange indices so that A = {1, . . . , |A|}, then the whole column Q1:l,i

must be cached because l may be an element in A.

We rearrange indices by sequentially swapping pairs of indices. If t₁ is going to be shrunk, we find an index t₂ that should stay and then swap them. Swapping two elements in a vector α or y is easy, but swapping kernel elements in the cache is more expensive. That is, we must swap (Q_t₁_,i, Q_t₂_,i) for every cached kernel column i. To make the number of swapping operations small, we use the following implementation.

Starting from the first and the last indices, we identify the smallest t₁ that should leave the largest t₂ that should stay. Then, (t₁, t₂) are swapped and we continue the same procedure to identify the next pair.

5.5 A Summary of the Shrinking Procedure

We summarize the shrinking procedure in Algorithm 2.

Algorithm 2 (Extending Algorithm 1 to include the shrinking procedure) Initialization

1. Let α¹ be an initial feasible solution.

2. Calculate the initial ∇f (α¹) and ¯G in Eq. (35).

3. Initialize a counter so shrinking is conducted every min(l, 1000) iterations 4. Let A = {1, . . . , l}

For k = 1, 2, . . .

1. Decrease the shrinking counter

2. If the counter is zero, then shrinking is conducted.

(a) If condition (33) is satisfied for the first time, reconstruct the gradient

(25)

(b) Shrink A by removing elements in the set (32). The implementation described in Section 5.4 ensures that A = {1, . . . , |A|}.

(c) Reset the shrinking counter

3. If α^k_A satisfies the stopping condition (34) (a) Reconstruct the gradient

(b) If α^k satisfies the stopping condition (34) Return α^k

Else

Reset A = {1, . . . , l} and set the counter to one¹¹ 4. Find a two-element working set B = {i, j} by WSS 1 5. Obtain Q_1:|A|,i and Q_1:|A|,j from cache or by calculation

6. Solve sub-problem (14) or (15) by procedures in Section 6. Update α^k to α^k+1

7. Update the gradient by Eq. (23) and update the vector ¯G

5.6 Is Shrinking Always Better?

We found that if the number of iterations is large, then shrinking can shorten the training time. However, if we loosely solve the optimization problem (e.g., by using a large stopping tolerance ϵ), the code without using shrinking may be much faster.

In this situation, because of the small number of iterations, the time spent on all decomposition iterations can be even less than one single gradient reconstruction.

Table 2 compares the total training time with/without shrinking. For a7a, we use the default ϵ = 0.001. Under the parameters C = 1 and γ = 4, the number of iterations is more than 30,000. Then shrinking is useful. However, for ijcnn1, we deliberately use a loose tolerance ϵ = 0.5, so the number of iterations is only around 4,000. Because our shrinking strategy is quite aggressive, before the first gradient reconstruction, only Q_A,A is in the cache. Then, we need many kernel evaluations for reconstructing the gradient, so the implementation with shrinking is slower.

If enough iterations have been run, most elements in A correspond to free α_i (0 < α_i < C); i.e., A ≈ F . In contrast, if the number of iterations is small (e.g., ijcnn1 in Table 2), many bounded elements have not been shrunk and |F | ≪ |A|.

Therefore, we can check the relation between |F | and |A| to conjecture if shrinking

(26)

is useful. In LIBSVM, if shrinking is enabled and 2 · |F | < |A| in reconstructing the gradient, we issue a warning message to indicate that the code may be faster without shrinking.

5.7 Computational Complexity

While Section 4.1.7 has discussed the asymptotic convergence and the local convergence rate of the decomposition method, in this section, we investigate the computational complexity.

From Section 4, two places consume most operations at each iteration: finding the working set B by WSS 1 and calculating Q_:,B(α^k+1_B − α^k_B) in Eq. (23).¹² Each place requires O(l) operations. However, if Q_:,B is not available in the cache and assume each kernel evaluation costs O(n), the cost becomes O(ln) for calculating a column of kernel elements. Therefore, the complexity of Algorithm 1 is

1. #Iterations × O(l) if most columns of Q are cached throughout iterations.

2. #Iterations × O(nl) if columns of Q are not cached and each kernel evaluation costs O(n).

Several works have studied the number of iterations of decomposition methods; see, for example, List and Simon (2007). However, algorithms studied in these works are slightly different from LIBSVM, so there is no theoretical result yet on LIBSVM’s number of iterations. Empirically, it is known that the number of iterations may be higher than linear to the number of training data. Thus, LIBSVM may take considerable training time for huge data sets. Many techniques, for example, Fine and Scheinberg (2001); Lee and Mangasarian (2001); Keerthi et al. (2006); Segata and Blanzieri (2010), have been developed to obtain an approximate model, but these are beyond the scope of our discussion. In LIBSVM, we provide a simple sub-sampling tool, so users can quickly train a small subset.

6 Unbalanced Data and Solving the Two-variable Sub-problem

For some classification problems, numbers of data in different classes are unbalanced.

Some researchers (e.g., Osuna et al., 1997b, Section 2.5; Vapnik, 1998, Chapter 10.9)

12Note that because |B| = 2, once the sub-problem has been constructed, solving it takes only a constant number of operations (see details in Section 6).

(27)

have proposed using different penalty parameters in the SVM formulation. For example, the C-SVM problem becomes

min

w,b,ξ

1

2w^Tw + C⁺X

yi=1

ξ_i+ C⁻ X

yi=−1

ξ_i

subject to y_i(w^Tϕ(x_i) + b) ≥ 1 − ξ_i, (42) ξ_i ≥ 0, i = 1, . . . , l,

where C⁺ and C⁻ are regularization parameters for positive and negative classes, respectively. LIBSVM supports this setting, so users can choose weights for classes.

The dual problem of problem (42) is minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C⁺, if y_i = 1, 0 ≤ α_i ≤ C⁻, if y_i = −1, y^Tα = 0.

A more general setting is to assign each instance x_i a regularization parameter C_i. If C is replaced by C_i, i = 1, . . . , l in problem (13), most results discussed in earlier sections can be extended without problems.¹³ The major change of Algorithm 1 is on solving the sub-problem (14), which now becomes

αmini,αj

1

2αi αj

Q_ii Q_ij Q_ji Q_jj

α_i α_j

+ (Q_i,Nα_N + p_i)α_i+ (Q_j,Nα_N + p_j)α_j subject to yiαi + yjαj = ∆ − y^T_Nα^k_N, (43)

0 ≤ α_i ≤ C_i, 0 ≤ α_j ≤ C_j.

Let α_i = α^k_i + d_i and α_j = α^k_j + d_j. The sub-problem (43) can be written as mindi,dj

1

2d_i d_jQ_ii Q_ij Qij Qjj

d_i dj

+∇_if (α^k) ∇_jf (α^k)d_i dj

subject to y_id_i+ y_jd_j = 0,

−α^k_i ≤ d_i ≤ C_i− α^k_i, −α^k_j ≤ d_j ≤ C_j − α^k_j.

Define a_ij, b_ij, and ¯a_ij as in Eq. (20), and ˆd_i ≡ y_id_i, ˆd_j ≡ y_jd_j. Using ˆd_i = − ˆd_j, the objective function can be written as

1

2¯a_ijdˆ²_j + b_ijdˆ_j.

13This feature of using C_i, ∀i is not included in LIBSVM, but is available as an extension at