Supplementary materials for "Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in

(1)

Supplementary materials for "Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in

Multi-core Environments"

Wei-Lin Chiang

Dept. of Computer Science National Taiwan Univ., Taiwan

[email protected]

Mu-Chu Lee

[email protected]

Chih-Jen Lin

[email protected]

I. PROOFS

Following [1], we will apply some results proved in [4], which studies CD methods for problems in the following form:

minα f (α)

subject to Li≤ αi≤ Ui, (I.1) where

f (α) ≡ g(Eα) + b^Tα,

f (·) and g(·) are proper closed functions, E is a constant matrix, and Li ∈ [−∞, ∞), Ui∈ (−∞, ∞] are lower/upper bounds. It has been checked in [1] that l1 and l2 loss SVM are in the form of (I.1) and satisfy additional assumptions needed in [4].

We introduce an important class of gradient-based scheme for CD’s variable selection: the Gauss-Southwell rule. It plays an important role in our proof. The rule requires that at a CD step the variable i selected for update satisfies the following condition:

|di| ≥ β max

j |dj|, (I.2)

where

dj= min(max(αj− ∇jf (α), 0), U ) − αj, and β is a fixed constant in the interval (0, 1].

I.1 Proof of Theorem 1

Assume the result in Theorem 1 is wrong. Based on the use of a variable ¯t to check the number of updates per outer iteration, we know that if the algorithm does not terminate, at lease one αi is updated per outer iteration. Therefore, line 15 of Algorithm 4 to change α is conducted infinitely many times. Let us collect all these α iterates to form an infinite sequence {α^k}. If our variable selection for each CD update satisfies the Gauss-Southwell rule, then Lemma 4.2 of [4] implies that

k→∞lim α^k− P [α^k− ∇f (α^k)] = 0, where P [·] is the projection operator defined as

P [αi] = min(max(αi, 0), U ).

Then there exists ¯k such that

kα^k− P [α^k− ∇f (α^k)]k∞< ¯ε min(min

j

Q¯jj, 1), ∀k ≥ ¯k.

(I.3) Note that minjQ¯jj> 0 because we have mentioned in Sec- tion 2.1 that instances causing ¯Qjj = 0 can be easily re- moved before the optimization process.

We will prove that if an index i is updated at α^k, then

|α^ki − P [α^ki −∇if (α^k) Q¯ii

]| < ¯ε. (I.4) This inequality is important because it violates the condition at line 14 of Algorithm 4. We will use it to obtain the contradiction. First, if

0 ≤ α^k_i − ∇if (α^k) ≤ U, (I.5) then

|α^ki − P [α^ki −∇if (α^k) Q¯ii

]|

≤|∇if (α^k) Q¯ii

| = |α^k_i − P [α^k_i − ∇if (α^k)]|

Q¯ii

< ¯ε,

(I.6)

where the first inequality is from the property of the projection operator, the equality is from (I.5), and the last inequality is from (I.3). Second, if

α^ki − ∇if (α^k) < 0, then with ¯Qii> 0 and (I.3)

|α^k_i − P [α^k_i −∇if (α^k) Q¯ii

]|

≤|α^k_i| = |α^k_i − P [α^k_i − ∇if (α^k)]| < ¯ε.

(I.7)

The situation for

α^k_i − ∇if (α^k) > U

is similar. Therefore we have (I.4). However, (I.4) violates our condition to update α^ki. Thus our assumption is wrong and the algorithm should terminate after a finite number of steps.

It remains to prove that our variable selection follows the Gauss-Southwell rule. Because α is in the following compact set,¹

{α | f (α) ≤ f (initial α), α is feasible}, (I.8)

1See a proof in, for example [1, Section 7.1].

(2)

there exists a constant S such that for all α in the set defined in (I.8),

maxj |αj− P [αj− ∇jf (α)]| ≤ S.

Let

β =ε min(1, min¯ jQ¯jj)

S . (I.9)

Then we have that at any iteration k, the selected index i satisfies

|α^ki − P [α^ki − ∇if (α^k)]|

≥ min(1, min

j

Q¯jj)(α^ki− P [α^ki −∇if (α^k) Q¯ii

])

≥ min(1, min

j

Q¯jj)¯ε

=βS

≥β max

j |α^kj− P [α^kj− ∇jf (α^k)]|.

The first inequality comes from properties of the projection operator; see how we derive the first inequality in (I.6) and (I.7). The second inequality is from how we decided if an element should be updated or not, while the last is from (I.9). Therefore, our selection follows the Gauss-Southwell rule.

I.2 Proof of Theorem 2

Assume the result is wrong. Because we have shown in Section I.1 that {α^ε^k^,¯^ε^k} is in a compact set, there is a con- vergent sub-sequence {α^ε^k^,¯^ε^k}, k ∈ K such that

lim

k∈K,k→∞α^ε^k^,¯^ε^k= ¯α (I.10) and

k∈K,k→∞lim w^ε^k^,¯^ε^k= ¯w =Xl

j=1α¯jyjxj6= w^∗. Before Algorithm 4 stops, at the last iteration, we have the following intermediate vectors.

α^k,1, . . . , α^k,T, α^ε^k^,¯^ε^k,

where α^k,tcorresponds to the α vector before the set ¯Bt is handled. We will prove that

lim

k∈K,k→∞α^k,1= · · · = lim

k∈K,k→∞α^k,T

= lim

k∈K,k→∞α^ε^k^,¯^ε^k= ¯α. (I.11) Consider α^k,t and α^k,t+1. Between these two vectors elements in ¯Bt may be updated. We can further consider the following iterates

α^k,t,1, . . . , α^{k,t,| ¯}^B^t^|. (I.12) Because only elements in the selected subset B ⊂ ¯Bt are actually considered for update, many adjacent ones in (I.12) are the same. Regardless of whether α^k,t,s = α^k,t,s+1 or not, from Lemma 2 in Section 7.4 of [1],

f (α^k,t,s) − f (α^k,t,s+1) ≥1 2

Q¯iikα^k,t,s− α^k,t,s+1k

≥1 2min

j

Q¯jjkα^k,t,s− α^k,t,s+1k,

where xiis assumed to be the instance considered at α^k,t,s. Because our setting ensures that the function value is mono- tonically decreasing and f (α) is lower-bounded, we have that f (α^ε^k^,¯^ε^k) as well as f (α^k,t,s), ∀s globally converge.

Therefore, lim

k∈K,k→∞f (α^k,t,s) − f (α^k,t,s+1) = 0

= lim

k∈K,k→∞

1 2min

j

Q¯jjkα^k,t,s− α^k,t,s+1k.

Note that we have explained in Section 2.1 that minjQ¯jj>

0. Then the limits of all vectors in (I.12) when k ∈ K, k → ∞ are all the same. Therefore,

k∈K,k→∞lim α^k,t= lim

k∈K,k→∞α^k,t+1. By similar arguments, we have (I.11).

When Algorithm 4 stops, we see that either

|∇^Pif (α^k,t)| ≤ εk, ∀t = 1, . . . , T, ∀i ∈ ¯Bt (I.13) or

α^k,1= · · · = α^k,T = α^ε^k^,¯^ε^k and (|∇^P_if (α^ε^k^,¯^ε^k)| ≤ δεk, or

α^ε_i^k^,¯^ε^k− Pα^ε_i^k^,¯^ε^k− ∇if (α^ε^k^,¯^ε^k) ≤ ¯εk, ∀i = 1, . . . , l.

(I.14) The first case corresponds to the situation when M ≤ ε holds at line 18, while the second situation means that ¯t = 0 occurs (either B is empty in selecting elements from ¯B at line 10 or |d| < ¯ε at line 14.)

Because ¯α is not optimal, from the optimality condition there exists an index i such that

∇if ( ¯α) < 0 if ¯αi= 0, or

∇if ( ¯α) > 0 if ¯αi= U, or

∇if ( ¯α) 6= 0 if 0 < ¯αi< U.

(I.15)

From (16), the continuity of ∇f (α), and (I.11), there exists k such that for all k ∈ K, k ≥ ¯¯ k, ∀t = 1, . . . , T

Case 1 of (I.15):

0 ≤ α^k,t_i + εk< U, ∇if (α^k,t) < −εk. Case 2 of (I.15):

0 < α^k,t_i − εk≤ U, ∇if (α^k,t) > εk. Case 3 of (I.15):

εk< α^k,t_i < U − εk, |∇if (α^k,t)| > εk.

From these three cases and our setting of εk> ¯εk, we have

|∇^P_if (α^k,t)| > εk, ∀t = 1, . . . , T, and

α^k,t_i − Ph

α^k,t_i − ∇if (α^k,t)i

> εk> ¯εk, ∀t = 1, . . . , T.

This clearly violates (I.13) and (I.14). Therefore, our assumption is wrong, and hence {w^ε^k^,¯^ε^k} converges to the optimal w^∗.

II. ADDITIONAL ANALYSIS OF ALGORITHM 4

(3)

II.1 Scheduling of the Parallel For Loop

A parallel for loop must assign tasks to different threads.

For example, OpenMP may statically dispatch tasks to threads or dynamically assign tasks according to the load of each thread. The setting, often referred to as the scheduling of the loop, may affect the computational speed; see, for example, a study in [3, Supplement]. We applied different OpenMP scheduling schemes for the operation at line 8 of Algorithm 4. Results show that the running time is about the same. The explanation is that the for loop being parallelized is a light task: Because ¯B is relatively small (no more than a few thousands), it is difficult to improve the utilization of cores by improving the load balance.

III. DETAILED RESULTS OF MINI-BATCH CD

In Figure III, we compare mini-batch CD using atomic and reduce operations for updating w (see Section 2.2.1) with LIBLINEAR. The parameter βbin Algorithm 2 for mini- batch CD is set to be

βb= 1 +(b − 1)(lσ²− 1)

l − 1 , (III.1)

where b is the batch size, l is the number of instances, and σ² is the spectral norm of the normalized data matrix

X = [ ¯¯ x1, . . . , ¯xl],

in which each ¯xi = xi/kxik.² We implemented the mini- batch CD without the shrinking technique because the ex- pected convergence proof in [5] is based on the randomness of all instances. For a fair comparison, LIBLINEAR without shrinking is used.

An important parameter to be decided is the size of B.

In (III.1), we can observe that |B| = b is strongly related to βb, which is an important coefficient of the sub-problem in Algorithm 2. If |B| is too large, then the CD update in each iteration becomes more conservative, leading to a slow convergence. In contrast, if |B| is too small, then the overhead in parallelizing CD updates becomes significant, a situation that may lead to a worse scalability.

We consider three sizes of |B|: 16, 64 and 256. The results show that for the dense data covtype, a small |B| leads to no scalability for both implementations. Further, the atomic operations significantly slow down the program in all sizes of B. For the sparse data rcv1, the opposite result is observed: the one implementing the reduce operation is worse because each dense array ˆu^p defined in (9) han- dles only several sparse instances. For the comparison with the single-core LIBLINEAR without the shrinking technique, clearly the mini-batch method is slower. This result leads to our decision in Section 5.3 for not including mini-bach CD in the main comparison. The experimental result also confirms our assessment in Section 2.2.1, in which we point out the difficulty to update some w components together in a multi-core environment.

IV. DETAILS OF THE SHRINKING IMPLE- MENTATION

2The computational time presented in Figure III does not include the calculation of σ², which is quite time-consuming.

In Algorithm I we give details of implementing Algorithm 4 with the shrinking technique. We directly apply the setting in [1], so Algorithm I is basically the combination of Algo- rithm 4 in this work and Algorithm 3 in [1].

We mentioned in Section 3.1 that LIBLINEAR actually uses

max

i ∇f (α^k,i) − min

i ∇f (α^k,i) < ε

as the stopping condition. We follow the same setting so at line 32 the condition becomes M − m ≤ ε. In the future we hope to change the dual CD in LIBLINEAR and the parallel extension to use

maxi |∇^Pif (α^k,i)| < ε.

V. RESULTS WITHOUT APPLYING THE SHRINK- ING TECHNIQUE

Under the same setting in Section 5.3, we compare asynchronous CD, Algorithm 4, and LIBLINEAR without applying the shrinking technique.

However, some issues occur in asynchronous CD without applying the shrinking technique because we found that the implementation in [2] has very slow final convergence. In [1, Section 3.1], it is noticed that a random permutation of indices in the beginning of each outer iterations leads to much faster convergence than the setting of using a fixed order of CD updates. However, the experimental code in [2] did not conduct the same procedure of random shuf- fling as LIBLINEAR. Their setting is to split all instances into P blocks, where P is the number of threads, and all threads parallely permute indices within their associated blocks. Therefore, elements in each block remain fixed through- out iterations. To analyze the influence of the randomness, we modified the experimental code in [2] to have the same global index permutation as in LIBLINEAR. A comparison between the implementation in [2] and the new setting is in Figure IV. We can observe that for almost all data, asynchronous CD with a weaker randomness converges very slowly in the later stage.

As a comparison, we check the situation when the shrinking technique is used. Results are shown in Figure V. The difference between the two settings is less significant than that in Figure IV. The explanation is that the randomness is improved by the shrinking technique. In LIBLINEAR as well as the implementation in [2], they move shrunken elements to the end of the index list. Therefore, when the shrinking technique is applied, the order of instances may be changed at each iteration, leading to a better randomness.

Based on the above analysis, for the comparison with Al- gorithm 4 and LIBLINEAR without the shrinking technique we consider the new asynchronous CD implementation that permutes all indices in the beginning of each outer iteration.

Results are in Figures VI and VII for l1 and l2 losses, respectively. We observe that the scalability of Algorithm 4 is better than that in Figure 1. The explanation is that without the shrinking technique, many elements (e.g., those that will eventually be bounded) have ∇^P_if (α) = 0. Then the size of B is relatively smaller in comparison with the size of ¯B. Hence a better scalability is obtained. For asynchronous CD, the convergence is improved in url-combined.

Our guess is that because many indices have ∇^P_if (α) = 0, after the gradient value is calculated, the thread does not need to update w. From the less frequent update of w, the

(4)

Algorithm I A parallel dual CD method in practice 1: Specify α and calculate w =P

jyjαjxj

2: Specify δ, ε, ¯ε, init ¯B, max ¯B

3: Let ¯M ← ∞, ¯m ← −∞, A ← {1, . . . , l}

4: now ¯B ← init ¯B 5: while true do

6: Let M ← −∞, m ← ∞, ¯A ← A, ¯t ← 0 7: while ¯A 6= ∅ do

8: Choose ¯B ⊂ ¯A with | ¯B| = min(now ¯B, | ¯A|) 9: A ← ¯¯ A \ ¯B

10: Calculate ∇fB¯(α) in parallel

11: B ← ∅

12: for i ∈ ¯B do

13: P G ← 0; G ← ∇if (α)

14: if (αi< C and G < 0) or (αi> 0 and G > 0) then

15: P G ← G

16: else if (αi= 0 and G > ¯M ) or (αi= C and G < ¯m) then

17: A ← A \ {i}

18: M ← max(M, P G), m ← min(m, P G) 19: if |P G| ≥ δε1 then

20: B ← B ∪ {i}

21: if |B| = 0 then

22: now ¯B ← min(1.5 × now ¯B, max ¯B) 23: else if |B| ≥ init ¯B then

24: now ¯B ← now ¯B/2 25: for i ∈ B do

26: G ← yiw^Txi− 1 + Diiαi

27: d ← min(max(αi− G/ ¯Qii, 0), U ) − αi

28: if |d| ≥ ¯ε then 29: αi← αi+ d 30: w ← w + dyixi

31: t ← ¯¯ t + 1

32: if M − m ≤ ε1 or ¯t = 0 then 33: if A = {1, . . . , l} and ε1≤ ε then

34: break

35: else

36: A ← {1, . . . , l}, ¯M ← ∞, ¯m ← −∞

37: ε1← max(0.1ε1, ε) 38: if M ≤ 0 then 39: M ← ∞¯ 40: else 41: M ← M¯ 42: if m ≥ 0 then 43: m ← −∞¯ 44: else 45: m ← m¯

(5)

(a) rcv1 (b) yahoo-korea (c) yahoo-japan

(d) webspam (e) url-combined (f) KDD2010-b

(g) covtype (h) epsilon (i) HIGGS

Figure I: A comparison of Algorithm 4 with | ¯B| = 64 and 256. In the legend, “fix” means that the adaptive rule in Section 4.2 is not applied to update | ¯B|. All other settings are the same as Figure 1. The l1 loss is considered.

lag τ discussed in Section 2.2.2 is relatively small and may become smaller and hence the desired conditions for the convergence are easier satisfied. Unfortunately, asynchronous CD still fails for covtype. In summary, when shrinking is not applied, Algorithm 4 is competitive with asynchronous CD and is still more robust.

VI. RESULTS OF USING DIFFERENT

C

VAL- UES

In Section 5 we present results of C = 1. We wonder if similar observations can still be made under other C values.

While users may experiment with different values, an important C value is the one that achieves the best validation accuracy. Therefore, we conduct five-fold cross validation on C ∈ {2⁻¹⁰, . . . , 2¹⁰}. Then the best C is used for comparing the training time. Results for l1 and l2 losses are respectively represented in Figures VIII and IX. All other settings are the same as those in Section 5. Results are generally similar to those in Figures 1 and 2 because in many cases

the selected C is not very different from C = 1. However, we roughly see that the problems become more difficult when C is large (e.g., webspam, covtype, epsilon in Figure VIII and webspam in Figure IX). It is known that the convergence of dual CD is slower for such cases, but how scalability is affected is an issue worth investigating.

VII. THE STOPPING CONDITION OF AL- GORITHM I

To show the relationship between the training time and the closeness to the optimal object value, we apply non- stop settings for all approaches in Figures 1 and 2. For Algorithm I (indicated as Algorithm 4 in the paper), we set ε = 0 and initial ε1= 0.1. Therefore, whenever M − m ≤ ε1

is satisfied, ε1is reduced by a factor of 10. For asynchronous CD and LIBLINEAR, we initially set the stopping tolerance ε to be 0.1. When M − m ≤ ε is satisfied, we keep reducing ε by a factor of 10.

In the practical use, we can not apply a non-stop setting.

(6)

Figure II: A comparison of Algorithm 4 with | ¯B| = 256 and 1024. In the legend, “fix” means that the adaptive rule in Section 4.2 is not applied to update | ¯B|. All other settings are the same as Figure 1. The l1 loss is considered.

For Algorithm I, this means that a stopping condition under a given ε > 0 is used. It is important to check if the algorithm performs well. In this section, we give some detailed analysis and study the behavior under different stopping conditions. Finally, we make an improved version of Algorithm I regarding the issues when stopping conditions are considered.

In Section 4.3, we introduced a variable ε1to prevent Al- gorithm I from being ε-dependent. However, ε is still a lower bound of ε1 (see line 37 in Algorithm I). Therefore, the behavior of Algorithm I is still affected by different ε.

To see the effect of ε, we run Algorithm I under ε = 0.1 and 0.01. Results are shown in Figures XI and XII respectively. Clearly, we can observe that Algorithm I converges slower in some periods, particularly in the final stage (e.g., some almost horizontal segments in the end of the curves of

“Alg I-1” for problems rcv1, webspam, and covtype). There are two possible reasons. First, resetting the active set A is a time-consuming process because all gradient elements

including those which should not be checked are calculated in the next iteration. In Algorithm I, we reset it whenever ε1 is reduced (see line 36); this may be too frequent. Next, we describe another reason. Because of the setting

ε1← max(0.1ε1, ε),

in the final stage of the algorithm, we may have ε1= ε.

Then with shrinking, the condition M − m ≤ ε1

may be quickly satisfied after only very few α elements are updated. This process may repeat several times until both M − m ≤ ε and A = {1, . . . , l} are true.

We revise the if statement in lines 40-45 of Algorithm I to prevent the frequent reset of the active set A, while still make the decrease of ε1 possible. To begin, we make the lower bound of ε1be smaller than ε by using 0.01ε. Second,

(7)

(a) rcv1 (|B| = 16) (b) rcv1 (|B| = 64) (c) rcv1 (|B| = 256)

(d) covtype (|B| = 16) (e) covtype (|B| = 64) (f) covtype (|B| = 256)

Figure III: A comparison between single-core LIBLINEAR and multi-core mini-batch CD. The mini-batch CD implementations by using atomic and reduce operations are respectively denoted as “mba” and “mbr.” We present running time in seconds (x-axis) versus the relative difference to the optimal objective value (y- axis, log-scaled). All methods are implemented without the shrinking technique. We use 1 and 8 cores for mini-batch CD.

the reset of the active set A occurs only if M − m ≤ ε

holds rather than when ε1 is decreased. Details are in the following statements.

1: Specify ε1= 0.1

2: Specify εmin= min(0.01ε, ε1) 3: // skipped

4: if M − m ≤ ε1 or ¯t = 0 then 5: ε1← max(0.1ε1, εmin) 6: if M − m ≤ ε then

7: if A = {1, . . . , l} and ε1≤ ε then 8: break

9: else

10: A ← {1, . . . , l}, ¯M ← ∞, ¯m ← −∞

(a) epsilon (b) webspam

Figure X: A comparison between Algorithm I and the new setting. We set ε = 0.01 and the l1 loss is used.

In Figure X, we show the results of applying new settings.

We can observe that the less frequent reactivation of the set A does not affect the convergence speed. However, the slow convergence in the final stage may still be observed. The reason is that when

M − m ≤ ε

holds, the situation is similar to when ε1 = ε occurs in the previous setting. Our earlier discussion has explained that slow convergence may happen. To improve the final convergence, we further reduce the frequency of reactivating the set A by modifying line 6 to be

M − m ≤ ρε, (VII.1)

where ρ < 1 but is close to 1. The reason behind this setting is that

M − m ≤ ε (VII.2)

is a condition applied on a smaller problem of only variables in A. It is easier to hold than a condition on all variables.

When (VII.2) is satisfied, we may not be that close to the optimal solution yet and hence the reactivation of the set A is not necessary. On the other hand, if the shrinking procedure does not remove any elements, our new setting will lead to longer training time. Therefore, the parameter ρ should be only slightly smaller than 1. In Figures XI and XII, we present the result of using ρ = 0.9. Clearly, we can see that in the final stage, the training time is significantly improved for many data sets. Another observation is that when the number of cores is increased from one to eight, the improvement becomes less dramatic. The reason is that the slow

(8)

Figure IV: A comparison between two asynchronous dual CD implementations without applying the shrinking technique. “permutation” is our implementation that randomly permutes the indices in the beginning of each outer iteration, while the other is the experimental code from [2]. The l1 loss is considered.

convergence is from computing the gradient after resetting the set A. This calculation is parallelized in Algorithm I, so the issue of slow convergence becomes less significant.

We have also checked the situation when l2-loss is used.

Results by using ε = 0.1 and 0.01 are respectively presented in Figures XIII and XIV. The improvement is less significant.

The reason might be that shrinking is less effective for l2-loss SVM because αiis now unbounded above.

References

[1] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In ICML, 2008.

[2] C.-J. Hsieh, H.-F. Yu, and I. S. Dhillon. PASSCoDe:

Parallel asynchronous stochastic dual coordinate descent. In ICML, 2015.

[3] M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-

vector multiplications for large-scale logistic regression on shared-memory systems. In ICDM, 2015.

[4] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimiza- tion. J. Optim. Theory Appl., 72(1):7–35, 1992.

[5] M. Tak´aˇc, P. Richt´arik, and N. Srebro. Distributed mini- batch SDCA, 2015. arXiv.

(9)

Figure V: A comparison between two asynchronous dual CD implementations with applying the shrinking technique. “permutation” is our implementation that randomly permutes the indices in the beginning of each outer iteration, while the other is the experimental code from [2]. The l1 loss is considered.

(10)

Figure VI: A comparison of dual CD methods without applying the shrinking technique. All settings are the same as Figure 1. The l1 loss is considered.

(11)

Figure VII: A comparison of dual CD methods without applying the shrinking technique. All settings are the same as Figure 1. The l2 loss is considered.

(12)

(a) rcv1 (C = 2) (b) yahoo-korea (C = 8) (c) yahoo-japan (C = 2)

(d) webspam (C = 32) (e) url-combined (C = 2) (f) KDD2010-b (C = 0.25)

(g) covtype (C = 16) (h) epsilon (C = 16) (i) HIGGS (C = 0.5)

Figure VIII: A comparison of different parallel dual CD methods. All settings are the same as Figure 1 except that the best C selected from cross validation is used; see the C value shown next to the data name.

The l1 loss is used.

(13)

(a) rcv1 (C = 0.5) (b) yahoo-korea (C = 4) (c) yahoo-japan (C = 0.5)

(d) webspam (C = 32) (e) url-combined (C = 2) (f) KDD2010-b (C = 0.03125)

(g) covtype (C = 0.015625) (h) epsilon (C = 4) (i) HIGGS (C = 0.5)

Figure IX: A comparison of different parallel dual CD methods. All settings are the same as Figure 2 except that the best C selected from cross validation is used; see the C value shown next to the data name. The l2 loss is used.

(14)

Figure XI: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.1. The l1 loss is used.

(15)

Figure XII: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.01. The l1 loss is used.

(16)

Figure XIII: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.1. The l2 loss is used.

(17)

Figure XIV: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.01. The l2 loss is used.