“Warm Start for Parameter Selection of Linear Classifiers”

(1)

Supplementary Materials for

“Warm Start for Parameter Selection of Linear Classifiers”

Bo-Yu Chu

Dept. of Computer Science National Taiwan Univ., Taiwan

Chia-Hua Ho

Cheng-Hao Tsai

[email protected] Chieh-Yen Lin

Chih-Jen Lin

[email protected] I. PROOFS

This section includes all proofs of theorems in the paper.

First we show a useful lemma.

Lemma 1 If w∞ exists, then for any C > 0, the norm of the optimal solution wC is upper bounded by kw∞k.

kwCk ≤ kw∞k, ∀C > 0. (I.1) Proof. We prove the result by contradiction. If a w^C satisfies kwCk > kw∞k, then

f (w∞)

C = kw∞k²

2C + L(w∞) < kwCk²

2C + L(w∞) ≤ f (wC)

C ,

(I.2) where the last inequality is from the definition of W∞ in (2.1). Results in (I.2) contradict the fact that wC is the optimal solution of (2).

I.1 Proof of Theorem 1

We first show that w∞ exists. Because W∞6= φ, we can consider any ¯w ∈ W∞ and define the following bounded region.

{w | kwk ≤ k ¯wk} ∩ W∞. (I.3) The continuity of L(w) implies that W∞ is closed. There- fore, the new region defined in (I.3) is compact. This property implies that the minimum value of (7) is attained. Fur- thermore, L(w) is convex, so W∞ is convex as well. The strict convexity of kwk²implies that w∞is unique.

Next, by Lemma 1, (I.1) implies that

wC∈ S, ∀C > 0, (I.4) where S is the following compact set

S ≡ {w | kwk ≤ kw∞k}.

We then show that lim

C→∞

f (wC)

C = L(w∞). (I.5)

This result follows from taking the limit on L(w∞) ≤ L(wC) ≤ f (wC)

C

≤ f (w∞)

C = kw∞k²

2C + L(w∞).

Finally, if the result in (7) does not hold, then there is a sequence {wC_j}^∞_j=1and a positive number δ such that

kwC_j − w∞k ≥ δ, ∀j = 1, 2, . . . . (I.6) Because S is compact, there exists a convergent subsequence {wC_j}, j ∈ J such that

lim

j∈J,j→∞wC_j = w^∗. (I.7)

From (I.4), w^∗ ∈ S, so kw^∗k ≤ kw∞k. This property implies

w^∗∈ W/ ∞. (I.8)

Otherwise, kw^∗k ≤ kw∞k and w^∗∈ W∞indicates that w^∗ is a solution of (7). However, (I.6) and the uniqueness of the solution of (7) cause a contradiction.

From (I.8),

L(w^∗) > L(w∞).

However, (I.5), (I.7), and the continuity of L(w) imply that L(w^∗) = L(w∞),

so we have a contradiction. Therefore, the proof is complete.

I.2 Proof of Theorem 2

It is sufficient to prove that there exists a w^∗such that L(w^∗) = inf

w L(w).

The optimization problem infwL(w) can be rewritten as minw,ξ kξk²

subject to yiw^Txi≥ 1 − ξi, i = 1, . . . , l. (I.9) Note that (w, ξ) = (0, e) satisfies the constraints, so (I.9) is feasible. Besides, the feasible region of (I.9) is a polyhedral, and the objective function kξk² is convex quadratic. By [5, Corollary 27.3.1], the infimum value is attained.

(2)

I.3 Proof of Theorem 3

From Theorem 1, we only need to show that under condition (9), there exists a w^∗such that

L(w^∗) = inf

w L(w). (I.10)

Because

inf

w L(w) ≤ L(0), there exists a sequence {wk} such that

L(wk) ≤ L(0), ∀k, and

k→∞lim L(wk) = inf

w L(w).

Then

0 ≤ log(1 + e^−yⁱ^w^T^k^xⁱ) ≤ L(0), ∀i, k,

so there are a subset J and constants L1, . . . , Ll, such that lim

k∈J,k→∞log(1 + e^−yⁱ^w^T^k^xⁱ) = Li, ∀i. (I.11) If Li6= 0, ∀i, then

lim

k∈J,k→∞Y Xwk= v, (I.12)

where X and Y are defined in (6), and

v =







− log(e^L¹− 1) .. .

− log(e^L^l− 1)





.

We prove Li6= 0, ∀i later. If it is true, from (I.12), we show that there exists w^∗such that

Y Xw^∗= v. (I.13)

and therefore infwL(w) is attained. Otherwise, because min

w kY Xw − vk²

attains a minimum ˆw following from [5], if no w^∗exists, we have

Y X ˆw 6= v.

However, (I.12) implies that we can always find wk such that

kY Xwk− vk < kY X ˆw − vk, a situation that violates the optimality of ˆw.

The remaining task is to prove that Li 6= 0, ∀i. If this result does not hold, then there exists an index i such that Li= 0. From (I.11),

e^−yⁱ^w^k^T^xⁱ→ 0 and yiwk^Txi→ ∞ as k → ∞, k ∈ J. (I.14) Thus J must be an infinite set. We have

kwkk → ∞. (I.15)

Otherwise, the boundedness of kwkk and |yiw^T_kxi| ≤ kwkkkxik violate (I.14). From (I.15), wk6= 0 after k is large enough, so we can consider a sequence {wk/kwkk}. Because this sequence is in a compact set, there exists a subset J⁰of J and a point ˆw such that

lim

k→∞,k∈J⁰

wk

kwkk = ˆw. (I.16)

From (9), there is an instance xr such that yrw¯^Txr= − < 0.

With (I.16), we can further find a subset J⁰⁰of J⁰ such that yrw^T_kxr≤ −

2kwkk < 0, ∀k ∈ J⁰⁰. Then as k → ∞,

L(wk) ≥ log(1 + e^−y^r^w^T^k^x^r)

≥ log(1 + e^kw^k^k/2)

→ ∞

following from (I.15). This result violates the fact that L(wk) → inf

w L(w) ≤ L(0).

Therefore, Li6= 0, ∀i and the proof is complete.

I.4 Proof of Theorem 4

From the optimality condition (see, for example, Section 3.4 in [6]),

(αC)i

C = 2 max(0, 1 − yiw^TCxi), ∀i, (I.17) and

(αC)i

C = e^−yⁱ^w^T^C^xⁱ

1 + e^−yⁱ^w^T^C^xⁱ, ∀i (I.18) for L2 and logistic losses, respectively. With wC→ w∞by Theorems 2 and 3 and the continuity of max and exponen- tial functions, taking the limit of (I.17) and (I.18) gives the desired results.

I.5 Proof of Theorem 5

Because data are not separable, Theorem 3 implies that w∞ exists. Then from the Definition 1, there exists an instance xisuch that

yiw^T∞xi< 0.

From Theorem 4, clearly (αC)i→ ∞ as C → ∞.

I.6 Proof of Theorem 6

The existence of C^∗, v1 and v2 and the optimality of αC

have been proved in [2, Theorem 3]. From Theorem 1 and (4),

C→∞lim wC

C = 0 = (Y X)^Tv1. Then

wC= (Y X)^Tv2= w∞, ∀C ≥ C^∗.

I.7 Proof of Theorem 7

See the paper.

I.8 Proof of Theorem 8

We prove (25) by contradiction. Firstly, we show that if

wC₁= wC₂, (I.19)

then

wC₁= wC₂ = w∞. (I.20)

(3)

Because f (w) is convex, the gradient at the optimal solution is zero. Therefore, from (I.19),

∇f (wC₁) = wC₁+ C1∇L(wC₁) = 0,

∇f (wC₂) = wC₂+ C2∇L(wC₂)

= wC1+ C2∇L(wC1) = 0.

Then,

∇L(wC₁) =(wC₁+ C1∇L(wC₁)) − (wC₁+ C2∇L(wC₁)) C1− C2

= 0.

By the convexity of L(w), wC₁ is an optimal solution of L(·), so wC₁ ∈ W∞. By Lemma 1, kwC₁k ≤ kw∞k, so by the definition of w∞in (7), wC1= wC2 = w∞.

By the assumption kw∞k 6= 0 and the fact that w∞min- imizes L(w),

∇f (w∞) = w∞+ C1∇L(w∞) = w∞6= 0. (I.21) Hence wC₁ 6= w∞, a contradiction to (I.20). Therefore, (I.19) does not hold, and hence wC₁ 6= wC₂.

For (26), we can obtain the result by taking limit to the following equation.

k∇f (wC/∆; C)k

k∇f (0; C)k =kwC/∆+ C∇L(wC/∆)k Ck∇L(0)k

=(C − C/∆)k∇L(wC/∆)k Ck∇L(0)k

=∆ − 1

∆

k∇L(wC/∆)k

k∇L(0)k . (I.22) When C → 0, wC/∆→ 0. By the continuity of ∇L(·),

C→0lim∇L(wC/∆) = ∇L(0).

Therefore, (I.22) converges to (∆ − 1)/∆ when C → 0.

For (27), because {wC} converges to w∞and ∇L(w∞) = 0, the continuity of ∇L(·) implies that (I.22) converges to zero as C goes to ∞.

II. CV ACCURACY UNDER SMALL REG- ULARIZATION PARAMETER

In this section, we explain that the CV accuracy tends to be fixed when C is close to zero. Firstly, we prove the following lemma.

Lemma 2 If L(w) is nonnegative, lim

C→0f (wC) = lim

C→0kwCk = 0. (II.1) Proof. Since

0 ≤ kwCk²

2 ≤ f (wC) ≤ f (0), by taking limit to both sides, we have 0 ≤ lim

C→0

kwCk² 2 ≤ lim

C→0f (wC) ≤ lim

C→0f (0) = lim

C→0CL(0) = 0.

Therefore, (II.1) follows.

Then we discuss the three losses separately. For L1-loss SVM, Lemma 2 implies that when C is small enough,

kw^TCxik < 1 for all i = 1, . . . , l.

Hence, ξ(wC; xi, yi) > 0. By the KKT condition, αi = C for all i. With (4), for any test instance x,

sgn(w^T_Cx) = sgn(

l

X

i=1

yiCx^T_ix) = sgn(

l

X

i=1

yix^T_ix) is independent of C.

For L2-loss SVM, by the KKT condition,

αi= 2C max(1 − yiwC^Txi, 0), ∀i = 1, . . . , l. (II.2) When wC is close to zero, (II.2) becomes

αi= 2C(1 − yiwC^Txi),

and αiis close to 2C. Therefore, for any test instance x, sgn(w^T_Cx) = sgn(

l

X

i=1

yiαix^T_ix)

= sgn(

l

X

i=1

2C(1 − yiw^TCxi)yix^Tix)

= sgn(

l

X

i=1

(1 − yiw^T_Cxi)yix^T_ix). (II.3) If

l

X

i=1

yix^T_ix 6= 0,

then because kwCk → 0 from Lemma 2, when C is small enough,

(II.3) = sgn(

l

X

i=1

yixix).

The prediction is almost independent of C.

Similarly, for logistic regression, we have the following equality from the KKT condition.

αi= C

1 + e^yⁱ^w^T^C^xⁱ, (II.4) where the value is close to C/2 when kwCk is small. For any test instance x,

sgn(w^T_Cx) = sgn(

l

X

i=1

yiαix^T_ix)

= sgn

l

X

i=1

C

1 + e^yⁱ^w^C^T^xⁱyix^Tix

!

= sgn

l

X

i=1

yix^Tix 1 + e^yⁱ^w^C^T^xⁱ

!

= sgn(

l

X

i=1

yix^Tix) if

l

X

i=1

yix^Tix 6= 0,

and C is small. The prediction is almost independent of C.

Note that the result is slightly different from [3, Case 1], where they consider the decision and the loss functions with a bias term:

w^Tx + b.

(4)

They prove that, for L1-loss SVM, the decision function always outputs the major class when C is small. Although their result also implies that the CV accuracy is fixed when C is small, the value may be different from ours here. When wC is small, the bias term b dominates the decision value w_C^Tx + b, so the major class is predicted. On the other hand, our decision function does not have a bias term, so the results still depend on w^T_Cx.

III. A DETAILED COMPARISON USING PRI- MAL NEWTON AND DUAL COORDI- NATE DESCENT METHODS

We begin with describing some implementation details and then give a detailed comparison.

III.1 Implementation Details

We slightly adjust solvers in LIBLINEAR for the purpose of experiments.

III.1.1 Primal-based Stopping Condition

As mentioned in Section 4, we need to make primal New- ton and dual coordinate descent (CD) methods have the same stopping condition to have a fair comparison. While the dual solver’s stopping condition can check either the optimality condition of the dual or the primal variables, the primal solver only has the primal variables. Therefore we modified our dual solver to check the optimality condition (28) of the primal variables, so both solvers have the same stopping condition.

However, it is not trivial to evaluate the primal stopping condition (28) for the dual CD method. CD is an efficient method that takes only O(nl) operations for updating all variables once (called an outer iteration in [1]), while eval- uating the stopping condition (28) has the same time com- plexity. If we check the stopping condition in each iteration, a large portion of the training time is used for this extra condition check instead of solving the optimization problem.

Therefore, we only check (28) once in every k iterations. We use k = 10 in our experiments.

However, the setting of checking the primal-based stopping condition once per k iterations may still have a huge affection on the training time when the number of iterations is small. For example, if the dual solver’s default stopping condition can be reached in two iterations, the training time becomes five times because at least 10 iterations are needed.

Although this situation seldom happens because the dual solver’s stopping condition is usually stricter than (28), we decide to keep the original condition and use it along with (28).

III.1.2 Practical Implementation of

(24)

When we introduced (24) as the stopping condition for the parameter selection procedure, we assume that wC is the optimal solution for minimizing f (w; C). Practically, we have only an approximate solution ˜wC, so w_∆t−1C in (24) must be replaced by ˜w_∆t−1C.

If a primal-based solver is used, we then have the following property. At ∆⁻²C, the initial solution is

¯

w_∆−2C= ˜w_∆−3C.

Because (24) has the same form as the stopping condition (28) for solving the optimization problem under a fixed regu-

larization parameter,¹it implies that ¯w_∆−2Cis immediately returned as the approximate solution without any iteration.

Therefore,

˜

w_∆−2C= ¯w_∆−2C= ˜w_∆−3C. By the same reason, we have

w˜_∆−3C= ˜w_∆−2C= ˜w_∆−1C= ˜wC. (III.1) Therefore, in our implementation, we simply check the number of times where the initial and the returned solutions of the optimization solver are the same. That is, if the count reaches three for a continuous sequence of ∆⁻²C, ∆⁻¹C, and C, then the procedure for the parameter selection stops.

If the dual solver is used, the situation is different. At

∆⁻³C, we have an approximate dual solution ˜α_∆−3C and the corresponding primal solution ˜w_∆−3C with

˜

w_∆−3C=Xl

i=1yi( ˜α_∆−3C)ixi. Assume ˜w_∆−3C satisfies (24) with t = −2:

k∇f ( ˜w_∆−3C; ∆⁻²C)k ≤ k∇f (0; ∆⁻²C)k. (III.2) Then the procedure continues to find an approximate solution ˜α_∆−2C. The dual initial solution is

¯

α_∆−2C= ∆ ˜α_∆−3C. (III.3) Because we check the primal-based stopping condition (28) for the dual coordinate descent method, with (III.3) it is like that we start with

¯

w_∆−2C= ∆ ˜w_∆−3C

in checking this condition. It is less likely that k∇f (∆ ˜w_∆−3C; ∆⁻²C)k ≤ k∇f (0; ∆⁻²C)k holds because we have assumed in (III.2) that ˜w_∆−3C is a good approximate solution for minimizing f (w; ∆⁻²C) and we have the property that {wC} converges to w∞. There- fore, the optimization procedure does not stop in the beginning. Instead, it takes several steps before reaching the stopping criterion. Then

˜

w_∆−2C6= ˜w_∆−3C (III.4) and ˜w_∆−2C is a better approximate solution than ˜w_∆−3C

at ∆⁻²C. By the convergence of {wC}, ˜w_∆−2Ctends to be a better solution than ˜w_∆−3C for minimizing the function f (w; ∆⁻¹C). That is, ˜w_∆−2C more easily satisfies

k∇f ( ˜w_∆−2C; ∆⁻¹C)k ≤ k∇f (0; ∆⁻¹C)k,

which is the next condition in (24) to be checked. Therefore, we expect that the parameter-selection procedure stops ear- lier if the dual solver is used, and we will verify this result in Section III.2.1.

Because of (III.4), right after C is increased and before optimization solver is called, we must check (24) with gradient evaluations. In contrast, the implementation is easier if we apply a primal-based optimization method of using (28) as the stopping condition. The reason is that we can take the advantage of (III.1) by checking the difference of ˜w vectors without gradient evaluations.

1Note that here we assume that (24) has been slightly modified to have the term min(l⁺, l⁻)/l in (28).

(5)

III.1.3 Maximum Iterations

Another implementation issue is when the solver should stop if the solver’s stopping condition can hardly be reached.

For example, in Section 4 we have shown that the dual CD method has lengthy iterations when C is large. To avoid unreasonable long training time, all solvers in LIBLINEAR stop when a maximal number of iterations is reached even if the stopping condition is not satisfied. To make (28) as the stopping condition used for both primal and dual solvers in most cases, we increase the default 1,000 maximum iterations to 10,000 and 100,000, respectively. Note that the limit for dual is higher because usually its number of iterations is higher than that of primal.

III.1.4 An Improvement of the Newton Method

When solving the optimization problem in step 5.1.1 of Algorithm 1, a trust region Newton method [4] computes the Newton direction s in each iteration by solving the following trust-region sub-problem.

mins ∇f (w)^Ts +1

2s^T∇^Tf (w)s (III.5) subject to ksk ≤ δ,

where w is the current iterate and δ is the size of the trust region. A Conjugate Gradient (CG) method is used to ap- proximately solve (III.5). CG is an iterative procedure that is terminated by LIBLINEAR if either s exceeds the trust region or

k∇f (w) + ∇²f (w)sk ≤ CGk∇f (w)k, (III.6) where CG= 0.1.

When w is close to the optimal wC, k∇f (w)k is small.

Then the stopping condition becomes stricter. Therefore, for a Newton method starting with initial w = 0, CG stopping condition (III.6) is loose in the beginning, but is tight in the end. Now with warm start, because {wC} → w∞, the initial ¯w = wC is close to w∆C for a large C and the CG stopping condition is tight in the beginning. For a truncated Newton method such as the trust region Newton method, early directions are not good enough, so there is no need to accurately solve (III.5). Therefore, under the warm-start setting, the original CG = 0.1 in LIBLINEAR may cause a too tight condition.

To understand the performance under different CG values, in Table I, we show the CV time and the CV rate under

CG= 0.1 and 0.5 for some data sets. The columns in Table I are defined as follows.

• stop C: the last C that the parameter-selection procedure checked. That is, the C value which satisfies the termination criterion (24) of the parameter-selection procedure.

• stop time: The cumulative CV time (in seconds) from Cmin to the stop C.

• best rate: The best CV rate achieved by checking from Cmin to the stop C.

• total time: The total CV time (in seconds) from Cmin

to Cmax.

We see that CGdoes not affect the found best CV rate and the corresponding C very much, but in document data such as yahoo-japan and yahoo-korea, the CV time is dramatically reduced by using CG= 0.5. The setting of CG= 0.1 causes too many CG steps in the first several (outer) iterations.

Table I: Comparison of primal Newton method with

CG= 0.1 or 0.5.

Data set CG

stop stop best total

log₂C time rate time

a9a 0.1 -2 1.87e+01 84.77 2.30e+01

0.5 0 2.18e+01 84.80 2.42e+01 covtype scale 0.1 -4 6.40e+02 75.66 7.22e+02 0.5 -4 6.37e+02 75.67 7.17e+02 german scale 0.1 -1 1.74e−01 77.10 1.98e−01 0.5 -1 2.00e−01 77.10 2.20e−01

ijcnn1 0.1 3 1.03e+01 92.43 1.11e+01

0.5 3 9.41e+00 92.42 1.01e+01 rcv1 test 0.1 4 8.85e+02 97.75 9.33e+02 0.5 4 8.30e+02 97.75 8.79e+02 webspam (unigram) 0.1 7 1.42e+03 92.57 1.59e+03 0.5 4 1.23e+03 92.62 1.28e+03 yahoo-japan 0.1 3 1.79e+03 92.69 1.79e+03 0.5 3 1.27e+03 92.68 1.27e+03 yahoo-korea 0.1 5 1.60e+04 86.89 1.60e+04 0.5 5 8.40e+03 86.86 8.40e+03

Although for some data sets the total CV time increases instead, the difference is relatively minor.

As a result, we believe that using a larger CG= 0.5 is a generally better setting when warm start is applied. For con- sistency, we use this setting for all experiments with/without warm start. That is, when standard LIBLINEAR is used in Section 4, we change CG to 0.5 from the default 0.1.

III.2 Comparison Results

We check the performance of both two-class and multi- class problems in LIBSVM data sets.² Because of the large amount of data sets, we run experiments on many machines at the same time. Although these machines have different computation capability, we ensure that the primal and the dual solvers run on the same machine for any data set. By the same reason, training time of the six selected data sets presented in the paper may be different from the time shown here.

The data statistics are in Tables II and III.

III.2.1 Two-class Problems

We apply our parameter-selection procedure on data sets listed in Table II. The columns are defined in Section III.1.4.

We terminate the process if it does not stop in three days.

In this case, we indicate “not finished” in the table.

We have some observations from Table IV. Firstly, because the optimization problem is not exactly solved, the best CV rates of the two solvers may be different on some data sets. For few cases the stopping condition (28) is too loose to get the true CV. We will discuss this issue in Sec- tion III.3. Secondly, the dual CD method usually terminates with a smaller C. This result verifies our expectation in Sec- tion III.1.2. Finally, for data sets with more instances than features, the primal Newton method is more competitive.

On the other hand, for sparse document data sets, which have many features, the dual CD method is usually fast if we compare the training time from Cmin to the best C. In addition to the fact that the dual CD method is more effi-

2http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets/

(6)

Table II: Two-class data statistics: Density is the average ratio of non-zero features per instance.

Data set l n density

a1a 1,605 119 11.649%

a2a 2,265 119 11.651%

a3a 3,185 122 11.365%

a4a 4,781 122 11.365%

a5a 6,414 122 11.366%

a6a 11,220 122 11.368%

a7a 16,100 122 11.369%

a8a 22,696 123 11.277%

a9a 32,561 123 11.276%

australian 690 14 100.000%

australian scale 690 14 87.443%

breast-cancer 683 10 100.000%

breast-cancer scale 683 10 100.000%

cod-rna 59,535 8 100.000%

colon-cancer 62 2,000 100.000%

covtype 581,012 54 21.998%

covtype scale 581,012 54 22.121%

diabetes 768 8 100.000%

diabetes scale 768 8 99.854%

duke 44 7,129 100.000%

epsilon normalized 400,000 2,000 100.000%

fourclass 862 2 100.000%

fourclass scale 862 2 99.594%

german 1,000 24 100.000%

german scale 1,000 24 95.837%

gisette scale 6,000 5,000 99.100%

heart 270 13 100.000%

heart scale 270 13 96.239%

ijcnn1 49,990 22 59.091%

ionosphere scale 351 34 88.411%

KDD2010-a 8,407,752 20,216,830 0.000%

KDD2010-b 19,264,097 29,890,095 0.000%

leu 38 7,129 100.000%

liver-disorders 345 6 100.000%

liver-disorders scale 345 6 99.082%

madelon 2,000 500 100.000%

mushrooms 8,124 112 18.750%

news20 19,996 1,355,191 0.034%

rcv1 test 677,399 47,236 0.155%

rcv1 train 20,242 47,236 0.157%

real-sim 72,309 20,958 0.245%

skin nonskin 245,057 3 100.000%

sonar scale 208 60 99.992%

splice 1,000 60 100.000%

splice scale 1,000 60 100.000%

svmguide1 3,089 4 100.000%

svmguide3 1,243 22 99.495%

url combined 2,396,130 3,231,961 0.004%

w1a 2,477 300 3.823%

w2a 3,470 300 3.878%

w3a 4,912 300 3.885%

w4a 7,366 300 3.892%

w5a 9,888 300 3.881%

w6a 17,188 300 3.888%

w7a 24,692 300 3.890%

w8a 49,749 300 3.883%

webspam (trigram) 350,000 16,609,143 0.022%

webspam (unigram) 350,000 254 33.517%

yahoo-japan 176,203 832,026 0.016%

yahoo-korea 460,554 3,052,939 0.011%

Table III: Multi-class data statistics: Density is the average ratio of non-zero features per instance.

Data set l n #classes density

acoustic 78,823 50 3 100.000%

acoustic scale 78,823 50 3 100.000%

aloi 108,000 128 1,000 23.982%

aloi scale 108,000 128 1,000 23.982%

combined 78,823 100 3 100.000%

combined scale 78,823 100 3 100.000%

connect-4 67,557 126 3 33.333%

covtype 581,012 54 7 21.998%

covtype scale 581,012 54 7 22.222%

covtype scale01 581,012 54 7 22.121%

dna scale 2,000 180 3 25.342%

glass scale 214 9 6 99.844%

iris scale 150 4 3 97.833%

letter scale 15,000 16 26 100.000%

mnist 60,000 780 10 19.218%

mnist8m 8,100,000 784 10 25.388%

mnist8m scale 8,100,000 784 10 25.388%

mnist scale 60,000 780 10 19.218%

news20 15,935 62,061 20 0.129%

news20 scale 15,935 62,061 20 0.129%

pendigits 7,494 16 10 87.182%

poker 25,010 10 10 100.000%

protein 17,766 357 3 28.999%

rcv1 test multiclass 518,571 47,236 53 0.137%

rcv1 train multiclass 15,564 47,236 51 0.140%

satimage scale 4,435 36 6 98.990%

sector 6,412 53 105 617.218%

sector scale 6,412 55,197 105 0.295%

segment scale 2,310 19 7 94.484%

seismic 78,823 50 3 100.000%

seismic scale 78,823 50 3 100.000%

shuttle scale 43,500 9 7 99.771%

svmguide2 391 20 3 100.000%

svmguide4 300 10 6 100.000%

usps 7,291 256 10 100.000%

vehicle scale 846 18 4 98.023%

vowel 528 10 11 99.943%

vowel scale 528 10 11 100.000%

wine scale 178 13 3 99.870%

(7)

cient to handle document data, another reason of the faster training is that it stops the parameter-selection procedure at a smaller C. However, if we check the total time, the primal Newton method may still be competitive; see, for example, KDD2010-a. The result is consistent with the known property that a first-order optimization method like the dual CD method is slower when C is large.

III.2.2 Multi-class Problems

LIBLINEAR implements a one-versus-the-rest approach for multi-class problems, so several two-class problems are solved.

We terminate the parameter-selection procedure if the stopping condition (28) holds for all two-class problems.

By the same setting for binary data sets, we check the performance on multi-class problems listed in Table III and present results in Table V. The observations of multi-class problems are consistent to those of two-class problems.

III.3 Strictness of the Stopping Condition

In Tables IV and V, we can find some data sets where the dual CD method has much higher CV accuracy than the primal Newton method. In Section III.2.1, we suspected that the stricter stopping condition for the dual CD method causes such results. We conduct experiments to check data sets with this problem by varying the stopping tolerance for the primal Newton method. We include one data set (ijcnn) that does not suffer from this problem in this experiment as a comparison. Results are in the left columns of Table VI. The best CV rates are similar for ijcnn under different

values. However, for other data sets, the best CV rate increases when we use a smaller . This observation implies that for these data sets, the stopping tolerance itself is also a parameter that must be tuned.

To alleviate the above problem, we can always use a strict stopping tolerance, but training time may increase for other data sets. Besides, it is hard to find a good stopping tolerance that is small enough for all data sets. Therefore, we propose an interactive setting to help users improve their model if they think the stopping tolerance is not strict enough.

The procedure is as follows.

1. The parameter-selection procedure stops at ˜C and outputs w_C^k_˜, k = 1, . . . , K for all K CV folds.

2. If users think the procedure stops too early with inac- curate CV accuracy because of the stopping tolerance, they can specify a stricter stopping condition to run the procedure again. They do not need to start from Cmin. Instead, the parameter-selection procedure starts at ˜C and uses w^kC˜ as the initial solution for training.

3. If the problem of using a too large tolerance still occurs, users can go back to step 1 and repeat the process.

We present the result of the above interactive procedure in the right columns of Table VI. A concern on our procedure is that the best C may be smaller than the initial ˜C considered in step 2. However, a comparison between left and right columns in Table VI shows that the proposed interactive procedure can effectively select a C value with CV accuracy close to the best, while requires less training time than the setting of always using a small tolerance.

However, w_C^k_˜ is not always necessary to output, and sav- ing and loading K models also make the implementation more complicated. Therefore, we also consider using 0 as the initial solution in step 2. The results are in Table VII.³

3The experiments of Tables VI and VII are conducted on

Table VII shows that the improvement without using w^k_C_˜ is still very effective. Therefore, we include this version in our released parameter-selection tool.

III.4 Summary

In summary of the comparison between the primal Newton method and the dual CD method, we have the following observations.

1. Although the dual CD method can solve large document data sets more efficiently than the primal Newton method, the advantage is weakened when C is very large. The situation can be very serious for some problems. To avoid having such bad situations, we choose the primal Newton method in our tool.

2. The CG stopping tolerance CG chosen for the primal Newton method of solving a single optimization problem may be too tight when we solve a sequence of problems using warm start.

3. Although the primal Newton method needs a stricter stopping tolerance on some data sets, we designed an interactive utility to effectively alleviate this problem.

IV. EXPERIMENTAL RESULTS OF L2-LOSS SVM

We conduct experiments on L2-loss SVM under the same setting as logistic regression. Figure I is the CV accuracy and CV training/validation time under different regulariza- tion parameters. Note that the best C tends to be smaller than that of logistic regression. The reason is that L2-loss function gives a larger penalty for a wrong prediction. Ta- ble VIII is the initial function values of the primal and dual solvers. Figure II demonstrates the training time versus the relative difference from the optimal objective value under the best C values found by Algorithm 1. The comparison of cumulative running time with/without warm start is in Figure III.

References

[1] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In ICML, 2008.

[2] W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin. De- composition methods for linear support vector machines.

Neural Comput., 16(8):1689–1704, 2004.

[3] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Comput., 15(7):1667–1689, 2003.

[4] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression.

In ICML, 2007.

[5] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.

[6] H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate descent methods for logistic regression and maximum en- tropy models. MLJ, 85:41–75, 2011.

different machines, so the times are different.

(8)

Table IV: Comparison of primal Newton method and dual CD method (binary data sets).

dual primal

Data set stop log₂C stop time best rate total time stop log₂C stop time best rate total time

a1a 5 1.05 83.24 2.47 10 0.57 83.18 0.57

a2a 5 2.57 82.08 5.23 6 1.04 82.03 1.12

a3a 4 2.30 83.61 5.36 8 1.75 83.58 1.79

a4a 3 2.25 84.40 4.84 8 1.46 84.44 1.52

a5a 4 8.63 84.44 16.02 4 4.60 84.44 5.21

a6a 1 12.62 84.28 23.39 3 8.20 84.30 9.52

a7a 0 7.23 84.57 15.02 6 5.48 84.56 6.04

a8a 0 20.62 84.51 40.65 3 16.08 84.53 18.95

a9a -1 21.00 84.79 37.95 4 11.93 84.80 13.56

australian -16 0.39 85.80 129.38 -12 0.43 68.99 0.62

australian scale 4 0.19 86.96 0.32 6 0.08 86.81 0.09

breast-cancer -37 0.09 94.73 1,013.96 -35 0.05 65.01 0.13

breast-cancer scale 4 0.15 96.78 0.22 6 0.06 96.63 0.07

cod-rna -13 71.52 93.36 22,225.90 -10 37.93 87.56 54.00

colon-cancer 3 1.24 83.87 1.42 6 1.93 83.87 2.02

covtype -21 140.44 75.58 not finished -17 92.88 61.30 235.85

covtype scale -1 456.23 75.66 714.25 0 453.72 75.67 537.97

diabetes -3 2.67 68.36 22.89 -4 0.14 67.97 0.18

diabetes scale 7 0.24 77.34 0.27 6 0.12 77.34 0.13

duke 1 1.50 88.64 1.80 6 1.53 88.64 1.63

epsilon normalized 6 11,074.39 89.80 15,451.41 6 13,605.46 89.81 14,500.33

fourclass -7 0.14 73.78 0.32 -4 0.06 73.78 0.08

fourclass scale 5 0.10 68.68 0.14 7 0.04 68.68 0.05

german 0 1.94 77.20 13.73 2 0.21 76.50 0.24

german scale 4 0.42 77.20 0.61 5 0.24 77.10 0.27

gisette scale 0 225.62 97.22 263.80 0 225.92 97.27 261.16

heart 1 12.89 84.07 58.98 1 0.06 83.70 0.07

heart scale 6 0.28 83.33 0.36 8 0.19 83.33 0.23

ijcnn1 4 43.50 92.46 60.54 6 35.74 92.42 37.67

ionosphere scale 8 1.40 84.62 1.88 10 0.49 84.33 0.49

KDD2010-a -3 13,711.68 88.24 71,712.26 2 54,877.52 88.23 56,312.56

KDD2010-b -3 31,599.50 88.89 102,894.63 2 98,611.87 88.85 101,344.49

leu 2 1.56 89.47 1.87 4 1.80 92.11 2.01

liver-disorders -3 0.95 69.57 2.62 0 0.17 70.14 0.22

liver-disorders scale 10 0.26 66.67 0.26 10 0.05 66.09 0.05

madelon -8 9,233.95 60.30 128,458.04 -4 48.09 60.30 54.02

mushrooms 1 3.31 99.98 3.95 4 2.30 99.96 2.61

news20 10 178.54 96.56 178.54 10 257.85 96.46 257.85

rcv1 test 3 1,420.24 97.77 1,944.22 7 1,462.53 97.75 1,524.17

rcv1 train 9 71.11 97.05 75.25 10 87.48 97.02 87.48

real-sim 8 80.84 97.53 90.04 10 71.36 97.53 71.36

skin nonskin -17 117.46 90.66 337.21 -15 80.31 90.71 129.74

sonar scale 10 6.27 74.04 6.27 10 0.45 74.04 0.45

splice 3 3.18 80.80 8.23 7 0.97 80.70 1.01

splice scale 6 0.70 72.70 0.85 6 0.40 72.70 0.44

svmguide1 -11 0.81 84.49 17.96 -7 0.40 83.39 0.59

svmguide3 8 1.60 80.05 3.08 10 0.83 79.57 0.83

url combined -8 7,308.55 99.40 12,439.65 -4 7,175.31 97.75 8,691.55

w1a 10 8.76 97.46 8.76 10 2.34 97.50 2.34

w2a 10 6.42 97.38 6.42 10 1.02 97.38 1.02

w3a 10 40.97 97.68 40.97 10 5.17 97.70 5.17

w4a 10 21.65 97.81 21.65 10 2.23 97.81 2.23

w5a 10 28.92 97.87 28.92 10 3.34 97.86 3.34

w6a 9 97.88 98.07 121.84 10 18.57 97.99 18.57

w7a 8 36.50 98.19 54.53 10 11.45 98.20 11.45

w8a 7 62.88 98.37 104.72 10 29.34 98.34 29.34

webspam (trigram) 1 17,397.35 99.63 26,008.73 5 23,469.62 98.83 25,024.18

webspam (unigram) 1 622.97 92.80 1,179.12 7 674.11 92.62 703.39

yahoo-japan 9 978.55 92.69 1,099.30 10 1,501.85 92.68 1,501.85

yahoo-korea 5 4,301.03 87.35 10,690.05 9 5,964.98 87.28 6,039.39

(9)

Table V: Comparison of primal Newton method and dual CD method (multi-class data sets).

dual primal

Data set stop log₂C stop time best rate total time stop log₂C stop time best rate total time

acoustic 0 230.37 68.07 370.82 6 193.92 67.84 205.82

acoustic scale 4 422.51 70.53 693.26 3 202.46 69.85 234.97

aloi -3 79,725.37 85.92 not finished 10 163,364.21 86.83 163,364.21 aloi scale -7 55,116.27 41.16 not finished 10 192,875.19 86.81 192,875.19

combined 1 465.78 80.36 736.76 10 565.93 80.14 565.93

combined scale -3 300.03 80.49 781.66 4 304.88 79.59 334.32

connect-4 -2 204.50 75.78 385.94 0 193.89 75.68 232.62

covtype -19 1,216.15 70.56 not finished -14 884.40 61.02 1,885.82

covtype scale 5 7,145.51 71.53 8,867.22 9 3,837.61 71.53 3,872.09

covtype scale01 2 3,866.98 71.53 5,808.50 7 2,712.17 71.49 2,814.53

dna scale 7 8.42 95.00 20.79 10 1.82 95.15 1.82

glass scale 10 1.11 64.49 1.11 10 0.19 64.95 0.19

iris scale 10 0.14 88.00 0.14 10 0.06 88.00 0.06

letter scale 9 154.85 68.09 163.32 10 78.15 68.07 78.15

mnist -18 320.88 91.30 246,983.17 -10 414.50 91.15 648.09

mnist8m -21 40,992.31 86.26 not finished -19 56,154.13 85.72 149,764.99 mnist8m scale -9 156,704.34 86.28 not finished 1 156,364.96 85.72 180,256.27

mnist scale -2 957.83 91.29 5,912.23 8 947.18 91.12 976.39

news20 10 962.99 83.43 962.99 10 738.04 83.43 738.04

news20 scale 10 640.42 84.37 640.42 10 703.26 84.34 703.26

pendigits -6 26.76 93.42 4,377.59 2 10.53 92.94 12.25

poker 10 1,100.21 49.96 1,100.21 10 90.50 49.96 90.50

protein 4 51.35 68.49 201.80 9 43.90 68.51 44.58

satimage scale 8 26.89 83.52 32.61 8 13.08 83.40 13.49

sector 3 13.20 0.98 29.27 3 5.15 0.98 14.07

sector scale 10 1,623.44 92.69 1,623.44 10 2,219.52 92.58 2,219.52

segment scale 8 16.30 92.38 29.02 10 4.49 92.25 4.49

seismic 3 321.56 70.94 473.28 6 322.09 70.57 341.66

seismic scale -3 247.53 70.92 545.83 2 273.39 70.13 313.80

shuttle scale 10 315.67 92.71 315.67 10 119.03 92.47 119.03

svmguide2 10 0.42 83.12 0.42 10 0.23 83.63 0.23

svmguide4 10 3.96 57.67 3.96 10 0.29 50.67 0.29

usps 4 344.92 94.84 3,996.14 9 186.11 94.71 188.75

vehicle scale 10 16.93 78.96 16.93 10 2.05 79.20 2.05

vowel 7 7.82 44.70 10.54 10 2.44 45.08 2.44

vowel scale 10 11.31 45.64 11.31 10 3.37 45.64 3.37

wine scale 10 0.74 99.44 0.74 10 0.28 99.44 0.28

(10)

Table VI: Parameter selection under different and the effectiveness of the interactive utility. The initial solution is w_C^k_˜ when a new is used.

from Cmin interactive

Data set log₂C best log˜ ₂C best CV rate time log₂C best log˜ ₂C best CV rate time australian

1e-02 -43.0 -16.0 68.55 0.04 -43.0 -16.0 68.55 0.04

1e-03 -43.0 -5.0 81.45 0.06 -13.0 -5.0 81.45 0.02

1e-04 -43.0 -3.0 85.94 0.08 0.0 0.0 85.36 0.01

breast-cancer

1e-02 -57.0 -57.0 65.01 0.02 -57.0 -57.0 65.01 0.02

1e-03 -57.0 -57.0 65.01 0.03 -35.0 -35.0 65.01 0.00

1e-04 -57.0 -57.0 65.01 0.03 -32.0 -32.0 65.01 0.00

1e-05 -57.0 -8.0 93.27 0.07 -29.0 -8.0 93.27 0.03

1e-06 -57.0 -2.0 94.44 0.08 -5.0 -1.0 94.44 0.01

cod-rna

1e-02 -38.0 -13.0 87.56 3.88 -38.0 -13.0 87.56 3.88

1e-03 -38.0 -11.0 87.65 4.80 -10.0 -7.0 88.41 0.70

1e-04 -38.0 0.0 93.43 8.01 -4.0 0.0 93.43 1.30

covtype

1e-02 -46.0 -23.0 61.06 105.39 -46.0 -23.0 61.06 105.39

1e-03 -46.0 -12.0 70.51 183.96 -20.0 -13.0 70.21 50.16

1e-04 -46.0 -7.0 75.24 395.74 -10.0 -7.0 75.24 111.87

ijcnn1

1e-02 -18.0 3.0 92.43 6.80 -18.0 3.0 92.43 6.80

1e-03 -18.0 5.0 92.46 7.88 6.0 6.0 92.45 0.59

1e-04 -18.0 4.0 92.45 10.17 9.0 9.0 92.45 0.44

url combined

1e-02 -31.0 -9.0 97.79 2,367.36 -31.0 -9.0 97.79 2,367.36

1e-03 -31.0 -2.0 98.98 6,861.13 -6.0 -4.0 98.72 2,156.13

1e-04 -31.0 2.0 99.46 20,350.75 -1.0 2.0 99.46 13,170.86

Table VII: Parameter selection under different and the effectiveness of the interactive utility. The initial solution is 0 when a new is used.

from Cmin interactive

dataset log ˜C best log C best CV rate time log ˜C best log C best CV rate time australian

1e-02 -43.0 -16.0 68.55 0.07 -43.0 -16.0 68.55 0.07

1e-03 -43.0 -5.0 81.45 0.11 -13.0 -5.0 81.45 0.04

1e-04 -43.0 -3.0 85.94 0.17 0.0 1.0 85.36 0.04

breast-cancer

1e-02 -57.0 -57.0 65.01 0.04 -57.0 -57.0 65.01 0.04

1e-03 -57.0 -57.0 65.01 0.05 -35.0 -35.0 65.01 0.00

1e-04 -57.0 -57.0 65.01 0.05 -33.0 -33.0 65.01 0.01

1e-05 -57.0 -8.0 93.27 0.13 -28.0 -8.0 93.27 0.05

1e-06 -57.0 -2.0 94.44 0.15 -5.0 -5.0 94.00 0.02

cod-rna

1e-02 -38.0 -13.0 87.56 9.69 -38.0 -13.0 87.56 9.69

1e-03 -38.0 -11.0 87.65 11.83 -10.0 -10.0 87.68 1.52

1e-04 -38.0 0.0 93.43 21.41 -8.0 0.0 93.43 7.25

covtype

1e-02 -46.0 -23.0 61.06 210.87 -46.0 -23.0 61.06 210.87 1e-03 -46.0 -12.0 70.51 457.31 -20.0 -13.0 70.22 186.39 1e-04 -46.0 -7.0 75.24 990.42 -10.0 -7.0 75.24 463.35

Table VIII: Difference between the initial and optimal function values. L2-loss SVM is used. The approach that is closer to the optimum is boldfaced.

log₂C primal dual k ¯wk²/2 primal dual k ¯wk²/2 primal dual k ¯wk²/2

madelon ijcnn webspam

−4 5.83e−05 −1.11e+00 8.05e−03 9.89e−01 −1.40e+01 1.50e+01 8.54e+01 −2.59e+02 3.45e+02 0 0.00e+00 −7.80e+00 8.26e−03 1.15e−01 −1.96e+01 1.98e+01 6.35e+02 −1.80e+03 2.43e+03 4 0.00e+00 −5.91e+01 8.28e−03 7.55e−03 −2.02e+01 2.02e+01 3.80e+03 −1.32e+04 1.70e+04 8 0.00e+00 −4.99e+02 8.28e−03 5.06e−04 −2.02e+01 2.02e+01 1.24e+04 −6.78e+04 7.99e+04

rcv1 yahoo-japan news20

−4 1.18e+02 −4.90e+02 6.08e+02 6.87e+01 −1.14e+02 1.83e+02 3.66e+01 −6.45e+01 1.01e+02 0 8.60e+02 −2.47e+03 3.33e+03 1.48e+03 −1.99e+03 3.47e+03 1.81e+02 −7.53e+02 9.34e+02 4 7.01e+03 −1.97e+04 2.67e+04 9.79e+03 −3.56e+04 4.54e+04 7.81e+01 −2.33e+03 2.41e+03 8 2.33e+04 −1.28e+05 1.51e+05 8.34e+03 −1.25e+05 1.33e+05 3.41e+01 −2.74e+03 2.78e+03

(11)

−30 −20 −10 0 10 log2C

50 52 54 56 58 60 62

CVaccuracy

0 5000 10000 15000 20000 25000 30000 35000 40000

CumulativeCVtime

CV Rate CV Tight Primal-ws Dual-ws

(a) madelon

−15 −10 −5 0 5 10

log2C 90.0

90.5 91.0 91.5 92.0 92.5

CVaccuracy

0 5 10 15 20

CumulativeCVtime

(b) ijcnn

−15 −10 −5 0 5 10

log2C 60

65 70 75 80 85 90 95 100

CVaccuracy

0 5000 10000 15000 20000 25000 30000

CumulativeCVtime

(c) webspam

−20 −15 −10 −5 0 5 10

log2C 92

93 94 95 96 97 98

CVaccuracy

0 200 400 600 800 1000 1200 1400 1600

CumulativeCVtime

(d) rcv1

−15 −10 −5 0 5 10

log2C 90.0

90.5 91.0 91.5 92.0 92.5 93.0

CVaccuracy

0 200 400 600 800 1000

CumulativeCVtime

(e) yahoo-japan

−15 −10 −5 0 5 10

log2C 75

80 85 90 95 100

CVaccuracy

0 50 100 150 200 250

CumulativeCVtime

(f) news20

Figure I: CV accuracy and training time using L2-loss SVM with warm start. The two CV curves and the left y-axis are the CV accuracy in percentage (%). The dashed lines and the right y-axis are the cumulative training time in the CV procedure in seconds. The vertical line indicates the last C value checked by Algorithm 1.

(12)

0.00 0.05 0.10 0.15 0.20 0.25 time (sec)

−10

−8

−6

−4

−2 0

relativefuncvaldiff(log) primal dual primal-ws dual-ws

(a) madelon

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time (sec)

−10

−8

−6

−4

−2 0 2

(b) ijcnn

0 200 400 600 800 1000 1200 1400 1600 1800 time (sec)

−10

−8

−6

−4

−2 0 2

(c) webspam

0 10 20 30 40 50 60

time (sec)

−10

−8

−6

−4

−2 0 2

(d) rcv1

0 20 40 60 80 100 120 140 time (sec)

−10

−8

−6

−4

−2 0 2

(e) yahoo-japan

0 10 20 30 40 50

time (sec)

−10

−8

−6

−4

−2 0 2

(f) news20

Figure II: Objective values versus training time using L2-loss SVM and the best C found by Algorithm 1.

The solid lines correspond to settings without applying warm start, where default initial points in LIBLINEAR are used. Primal-ws and Dual-ws are primal and dual solvers with warm-start settings, respectively, and the initial point is obtained by (13) and (14). The horizontal line indicates that the condition (28) with LIBLINEAR’s default = 10⁻² has been established.

(13)

−30 −20 −10 0 10 log2C

10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴ 10⁵

CumulativeCVtime

primal-ws dual-ws primal dual

(a) madelon

−15 −10 −5 0 5 10

log2C 10⁻¹

10⁰ 10¹ 10² 10³ 10⁴

CumulativeCVtime

(b) ijcnn

−15 −10 −5 0 5 10

log2C 10²

10³ 10⁴ 10⁵

CumulativeCVtime

(c) webspam

−20 −15 −10 −5 0 5 10

log2C 10¹

10² 10³ 10⁴

CumulativeCVtime

(d) rcv1

−15 −10 −5 0 5 10

log2C 10¹

10² 10³ 10⁴

CumulativeCVtime

(e) yahoo-japan

−15 −10 −5 0 5 10

log2C 10⁰

10¹ 10² 10³

CumulativeCVtime

(f) news20

Figure III: Training time (in seconds) using L2-loss SVM with/without warm-start techniques. The vertical line indicates the last C value checked by Algorithm 1. Because the training time quickly increases when C becomes large, the y-axis is log-scaled.