Supplementary materials for "Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in
Multi-core Environments"
Wei-Lin Chiang
Dept. of Computer Science National Taiwan Univ., Taiwan
Mu-Chu Lee
Dept. of Computer Science National Taiwan Univ., Taiwan
Chih-Jen Lin
Dept. of Computer Science National Taiwan Univ., Taiwan
I. PROOFS
Following [1], we will apply some results proved in [4], which studies CD methods for problems in the following form:
minα f (α)
subject to Li≤ αi≤ Ui, (I.1) where
f (α) ≡ g(Eα) + bTα,
f (·) and g(·) are proper closed functions, E is a constant matrix, and Li ∈ [−∞, ∞), Ui∈ (−∞, ∞] are lower/upper bounds. It has been checked in [1] that l1 and l2 loss SVM are in the form of (I.1) and satisfy additional assumptions needed in [4].
We introduce an important class of gradient-based scheme for CD’s variable selection: the Gauss-Southwell rule. It plays an important role in our proof. The rule requires that at a CD step the variable i selected for update satisfies the following condition:
|di| ≥ β max
j |dj|, (I.2)
where
dj= min(max(αj− ∇jf (α), 0), U ) − αj, and β is a fixed constant in the interval (0, 1].
I.1 Proof of Theorem 1
Assume the result in Theorem 1 is wrong. Based on the use of a variable ¯t to check the number of updates per outer iteration, we know that if the algorithm does not terminate, at lease one αi is updated per outer iteration. Therefore, line 15 of Algorithm 4 to change α is conducted infinitely many times. Let us collect all these α iterates to form an infinite sequence {αk}. If our variable selection for each CD update satisfies the Gauss-Southwell rule, then Lemma 4.2 of [4] implies that
k→∞lim αk− P [αk− ∇f (αk)] = 0, where P [·] is the projection operator defined as
P [αi] = min(max(αi, 0), U ).
Then there exists ¯k such that
kαk− P [αk− ∇f (αk)]k∞< ¯ε min(min
j
Q¯jj, 1), ∀k ≥ ¯k.
(I.3) Note that minjQ¯jj> 0 because we have mentioned in Sec- tion 2.1 that instances causing ¯Qjj = 0 can be easily re- moved before the optimization process.
We will prove that if an index i is updated at αk, then
|αki − P [αki −∇if (αk) Q¯ii
]| < ¯ε. (I.4) This inequality is important because it violates the condition at line 14 of Algorithm 4. We will use it to obtain the contradiction. First, if
0 ≤ αki − ∇if (αk) ≤ U, (I.5) then
|αki − P [αki −∇if (αk) Q¯ii
]|
≤|∇if (αk) Q¯ii
| = |αki − P [αki − ∇if (αk)]|
Q¯ii
< ¯ε,
(I.6)
where the first inequality is from the property of the pro- jection operator, the equality is from (I.5), and the last in- equality is from (I.3). Second, if
αki − ∇if (αk) < 0, then with ¯Qii> 0 and (I.3)
|αki − P [αki −∇if (αk) Q¯ii
]|
≤|αki| = |αki − P [αki − ∇if (αk)]| < ¯ε.
(I.7)
The situation for
αki − ∇if (αk) > U
is similar. Therefore we have (I.4). However, (I.4) violates our condition to update αki. Thus our assumption is wrong and the algorithm should terminate after a finite number of steps.
It remains to prove that our variable selection follows the Gauss-Southwell rule. Because α is in the following compact set,1
{α | f (α) ≤ f (initial α), α is feasible}, (I.8)
1See a proof in, for example [1, Section 7.1].
there exists a constant S such that for all α in the set defined in (I.8),
maxj |αj− P [αj− ∇jf (α)]| ≤ S.
Let
β =ε min(1, min¯ jQ¯jj)
S . (I.9)
Then we have that at any iteration k, the selected index i satisfies
|αki − P [αki − ∇if (αk)]|
≥ min(1, min
j
Q¯jj)(αki− P [αki −∇if (αk) Q¯ii
])
≥ min(1, min
j
Q¯jj)¯ε
=βS
≥β max
j |αkj− P [αkj− ∇jf (αk)]|.
The first inequality comes from properties of the projection operator; see how we derive the first inequality in (I.6) and (I.7). The second inequality is from how we decided if an element should be updated or not, while the last is from (I.9). Therefore, our selection follows the Gauss-Southwell rule.
I.2 Proof of Theorem 2
Assume the result is wrong. Because we have shown in Section I.1 that {αεk,¯εk} is in a compact set, there is a con- vergent sub-sequence {αεk,¯εk}, k ∈ K such that
lim
k∈K,k→∞αεk,¯εk= ¯α (I.10) and
k∈K,k→∞lim wεk,¯εk= ¯w =Xl
j=1α¯jyjxj6= w∗. Before Algorithm 4 stops, at the last iteration, we have the following intermediate vectors.
αk,1, . . . , αk,T, αεk,¯εk,
where αk,tcorresponds to the α vector before the set ¯Bt is handled. We will prove that
lim
k∈K,k→∞αk,1= · · · = lim
k∈K,k→∞αk,T
= lim
k∈K,k→∞αεk,¯εk= ¯α. (I.11) Consider αk,t and αk,t+1. Between these two vectors ele- ments in ¯Bt may be updated. We can further consider the following iterates
αk,t,1, . . . , αk,t,| ¯Bt|. (I.12) Because only elements in the selected subset B ⊂ ¯Bt are actually considered for update, many adjacent ones in (I.12) are the same. Regardless of whether αk,t,s = αk,t,s+1 or not, from Lemma 2 in Section 7.4 of [1],
f (αk,t,s) − f (αk,t,s+1) ≥1 2
Q¯iikαk,t,s− αk,t,s+1k
≥1 2min
j
Q¯jjkαk,t,s− αk,t,s+1k,
where xiis assumed to be the instance considered at αk,t,s. Because our setting ensures that the function value is mono- tonically decreasing and f (α) is lower-bounded, we have that f (αεk,¯εk) as well as f (αk,t,s), ∀s globally converge.
Therefore, lim
k∈K,k→∞f (αk,t,s) − f (αk,t,s+1) = 0
= lim
k∈K,k→∞
1 2min
j
Q¯jjkαk,t,s− αk,t,s+1k.
Note that we have explained in Section 2.1 that minjQ¯jj>
0. Then the limits of all vectors in (I.12) when k ∈ K, k → ∞ are all the same. Therefore,
k∈K,k→∞lim αk,t= lim
k∈K,k→∞αk,t+1. By similar arguments, we have (I.11).
When Algorithm 4 stops, we see that either
|∇Pif (αk,t)| ≤ εk, ∀t = 1, . . . , T, ∀i ∈ ¯Bt (I.13) or
αk,1= · · · = αk,T = αεk,¯εk and (|∇Pif (αεk,¯εk)| ≤ δεk, or
αεik,¯εk− Pαεik,¯εk− ∇if (αεk,¯εk) ≤ ¯εk, ∀i = 1, . . . , l.
(I.14) The first case corresponds to the situation when M ≤ ε holds at line 18, while the second situation means that ¯t = 0 occurs (either B is empty in selecting elements from ¯B at line 10 or |d| < ¯ε at line 14.)
Because ¯α is not optimal, from the optimality condition there exists an index i such that
∇if ( ¯α) < 0 if ¯αi= 0, or
∇if ( ¯α) > 0 if ¯αi= U, or
∇if ( ¯α) 6= 0 if 0 < ¯αi< U.
(I.15)
From (16), the continuity of ∇f (α), and (I.11), there exists k such that for all k ∈ K, k ≥ ¯¯ k, ∀t = 1, . . . , T
Case 1 of (I.15):
0 ≤ αk,ti + εk< U, ∇if (αk,t) < −εk. Case 2 of (I.15):
0 < αk,ti − εk≤ U, ∇if (αk,t) > εk. Case 3 of (I.15):
εk< αk,ti < U − εk, |∇if (αk,t)| > εk.
From these three cases and our setting of εk> ¯εk, we have
|∇Pif (αk,t)| > εk, ∀t = 1, . . . , T, and
αk,ti − Ph
αk,ti − ∇if (αk,t)i
> εk> ¯εk, ∀t = 1, . . . , T.
This clearly violates (I.13) and (I.14). Therefore, our as- sumption is wrong, and hence {wεk,¯εk} converges to the optimal w∗.
II. ADDITIONAL ANALYSIS OF ALGORITHM 4
II.1 Scheduling of the Parallel For Loop
A parallel for loop must assign tasks to different threads.
For example, OpenMP may statically dispatch tasks to threads or dynamically assign tasks according to the load of each thread. The setting, often referred to as the scheduling of the loop, may affect the computational speed; see, for ex- ample, a study in [3, Supplement]. We applied different OpenMP scheduling schemes for the operation at line 8 of Algorithm 4. Results show that the running time is about the same. The explanation is that the for loop being par- allelized is a light task: Because ¯B is relatively small (no more than a few thousands), it is difficult to improve the utilization of cores by improving the load balance.
III. DETAILED RESULTS OF MINI-BATCH CD
In Figure III, we compare mini-batch CD using atomic and reduce operations for updating w (see Section 2.2.1) with LIBLINEAR. The parameter βbin Algorithm 2 for mini- batch CD is set to be
βb= 1 +(b − 1)(lσ2− 1)
l − 1 , (III.1)
where b is the batch size, l is the number of instances, and σ2 is the spectral norm of the normalized data matrix
X = [ ¯¯ x1, . . . , ¯xl],
in which each ¯xi = xi/kxik.2 We implemented the mini- batch CD without the shrinking technique because the ex- pected convergence proof in [5] is based on the randomness of all instances. For a fair comparison, LIBLINEAR without shrinking is used.
An important parameter to be decided is the size of B.
In (III.1), we can observe that |B| = b is strongly related to βb, which is an important coefficient of the sub-problem in Algorithm 2. If |B| is too large, then the CD update in each iteration becomes more conservative, leading to a slow convergence. In contrast, if |B| is too small, then the overhead in parallelizing CD updates becomes significant, a situation that may lead to a worse scalability.
We consider three sizes of |B|: 16, 64 and 256. The re- sults show that for the dense data covtype, a small |B| leads to no scalability for both implementations. Further, the atomic operations significantly slow down the program in all sizes of B. For the sparse data rcv1, the opposite re- sult is observed: the one implementing the reduce operation is worse because each dense array ˆup defined in (9) han- dles only several sparse instances. For the comparison with the single-core LIBLINEAR without the shrinking technique, clearly the mini-batch method is slower. This result leads to our decision in Section 5.3 for not including mini-bach CD in the main comparison. The experimental result also confirms our assessment in Section 2.2.1, in which we point out the difficulty to update some w components together in a multi-core environment.
IV. DETAILS OF THE SHRINKING IMPLE- MENTATION
2The computational time presented in Figure III does not include the calculation of σ2, which is quite time-consuming.
In Algorithm I we give details of implementing Algorithm 4 with the shrinking technique. We directly apply the setting in [1], so Algorithm I is basically the combination of Algo- rithm 4 in this work and Algorithm 3 in [1].
We mentioned in Section 3.1 that LIBLINEAR actually uses
max
i ∇f (αk,i) − min
i ∇f (αk,i) < ε
as the stopping condition. We follow the same setting so at line 32 the condition becomes M − m ≤ ε. In the future we hope to change the dual CD in LIBLINEAR and the parallel extension to use
maxi |∇Pif (αk,i)| < ε.
V. RESULTS WITHOUT APPLYING THE SHRINK- ING TECHNIQUE
Under the same setting in Section 5.3, we compare asyn- chronous CD, Algorithm 4, and LIBLINEAR without apply- ing the shrinking technique.
However, some issues occur in asynchronous CD without applying the shrinking technique because we found that the implementation in [2] has very slow final convergence. In [1, Section 3.1], it is noticed that a random permutation of indices in the beginning of each outer iterations leads to much faster convergence than the setting of using a fixed order of CD updates. However, the experimental code in [2] did not conduct the same procedure of random shuf- fling as LIBLINEAR. Their setting is to split all instances into P blocks, where P is the number of threads, and all threads parallely permute indices within their associated blocks. Therefore, elements in each block remain fixed through- out iterations. To analyze the influence of the randomness, we modified the experimental code in [2] to have the same global index permutation as in LIBLINEAR. A compari- son between the implementation in [2] and the new setting is in Figure IV. We can observe that for almost all data, asynchronous CD with a weaker randomness converges very slowly in the later stage.
As a comparison, we check the situation when the shrink- ing technique is used. Results are shown in Figure V. The difference between the two settings is less significant than that in Figure IV. The explanation is that the randomness is improved by the shrinking technique. In LIBLINEAR as well as the implementation in [2], they move shrunken elements to the end of the index list. Therefore, when the shrinking technique is applied, the order of instances may be changed at each iteration, leading to a better randomness.
Based on the above analysis, for the comparison with Al- gorithm 4 and LIBLINEAR without the shrinking technique we consider the new asynchronous CD implementation that permutes all indices in the beginning of each outer iteration.
Results are in Figures VI and VII for l1 and l2 losses, re- spectively. We observe that the scalability of Algorithm 4 is better than that in Figure 1. The explanation is that without the shrinking technique, many elements (e.g., those that will eventually be bounded) have ∇Pif (α) = 0. Then the size of B is relatively smaller in comparison with the size of ¯B. Hence a better scalability is obtained. For asyn- chronous CD, the convergence is improved in url-combined.
Our guess is that because many indices have ∇Pif (α) = 0, after the gradient value is calculated, the thread does not need to update w. From the less frequent update of w, the
Algorithm I A parallel dual CD method in practice 1: Specify α and calculate w =P
jyjαjxj
2: Specify δ, ε, ¯ε, init ¯B, max ¯B
3: Let ¯M ← ∞, ¯m ← −∞, A ← {1, . . . , l}
4: now ¯B ← init ¯B 5: while true do
6: Let M ← −∞, m ← ∞, ¯A ← A, ¯t ← 0 7: while ¯A 6= ∅ do
8: Choose ¯B ⊂ ¯A with | ¯B| = min(now ¯B, | ¯A|) 9: A ← ¯¯ A \ ¯B
10: Calculate ∇fB¯(α) in parallel
11: B ← ∅
12: for i ∈ ¯B do
13: P G ← 0; G ← ∇if (α)
14: if (αi< C and G < 0) or (αi> 0 and G > 0) then
15: P G ← G
16: else if (αi= 0 and G > ¯M ) or (αi= C and G < ¯m) then
17: A ← A \ {i}
18: M ← max(M, P G), m ← min(m, P G) 19: if |P G| ≥ δε1 then
20: B ← B ∪ {i}
21: if |B| = 0 then
22: now ¯B ← min(1.5 × now ¯B, max ¯B) 23: else if |B| ≥ init ¯B then
24: now ¯B ← now ¯B/2 25: for i ∈ B do
26: G ← yiwTxi− 1 + Diiαi
27: d ← min(max(αi− G/ ¯Qii, 0), U ) − αi
28: if |d| ≥ ¯ε then 29: αi← αi+ d 30: w ← w + dyixi
31: t ← ¯¯ t + 1
32: if M − m ≤ ε1 or ¯t = 0 then 33: if A = {1, . . . , l} and ε1≤ ε then
34: break
35: else
36: A ← {1, . . . , l}, ¯M ← ∞, ¯m ← −∞
37: ε1← max(0.1ε1, ε) 38: if M ≤ 0 then 39: M ← ∞¯ 40: else 41: M ← M¯ 42: if m ≥ 0 then 43: m ← −∞¯ 44: else 45: m ← m¯
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure I: A comparison of Algorithm 4 with | ¯B| = 64 and 256. In the legend, “fix” means that the adaptive rule in Section 4.2 is not applied to update | ¯B|. All other settings are the same as Figure 1. The l1 loss is considered.
lag τ discussed in Section 2.2.2 is relatively small and may become smaller and hence the desired conditions for the con- vergence are easier satisfied. Unfortunately, asynchronous CD still fails for covtype. In summary, when shrinking is not applied, Algorithm 4 is competitive with asynchronous CD and is still more robust.
VI. RESULTS OF USING DIFFERENT
CVAL- UES
In Section 5 we present results of C = 1. We wonder if similar observations can still be made under other C values.
While users may experiment with different values, an im- portant C value is the one that achieves the best validation accuracy. Therefore, we conduct five-fold cross validation on C ∈ {2−10, . . . , 210}. Then the best C is used for comparing the training time. Results for l1 and l2 losses are respec- tively represented in Figures VIII and IX. All other settings are the same as those in Section 5. Results are generally similar to those in Figures 1 and 2 because in many cases
the selected C is not very different from C = 1. However, we roughly see that the problems become more difficult when C is large (e.g., webspam, covtype, epsilon in Figure VIII and webspam in Figure IX). It is known that the convergence of dual CD is slower for such cases, but how scalability is affected is an issue worth investigating.
VII. THE STOPPING CONDITION OF AL- GORITHM I
To show the relationship between the training time and the closeness to the optimal object value, we apply non- stop settings for all approaches in Figures 1 and 2. For Algorithm I (indicated as Algorithm 4 in the paper), we set ε = 0 and initial ε1= 0.1. Therefore, whenever M − m ≤ ε1
is satisfied, ε1is reduced by a factor of 10. For asynchronous CD and LIBLINEAR, we initially set the stopping tolerance ε to be 0.1. When M − m ≤ ε is satisfied, we keep reducing ε by a factor of 10.
In the practical use, we can not apply a non-stop setting.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure II: A comparison of Algorithm 4 with | ¯B| = 256 and 1024. In the legend, “fix” means that the adaptive rule in Section 4.2 is not applied to update | ¯B|. All other settings are the same as Figure 1. The l1 loss is considered.
For Algorithm I, this means that a stopping condition un- der a given ε > 0 is used. It is important to check if the algorithm performs well. In this section, we give some de- tailed analysis and study the behavior under different stop- ping conditions. Finally, we make an improved version of Algorithm I regarding the issues when stopping conditions are considered.
In Section 4.3, we introduced a variable ε1to prevent Al- gorithm I from being ε-dependent. However, ε is still a lower bound of ε1 (see line 37 in Algorithm I). Therefore, the behavior of Algorithm I is still affected by different ε.
To see the effect of ε, we run Algorithm I under ε = 0.1 and 0.01. Results are shown in Figures XI and XII respec- tively. Clearly, we can observe that Algorithm I converges slower in some periods, particularly in the final stage (e.g., some almost horizontal segments in the end of the curves of
“Alg I-1” for problems rcv1, webspam, and covtype). There are two possible reasons. First, resetting the active set A is a time-consuming process because all gradient elements
including those which should not be checked are calculated in the next iteration. In Algorithm I, we reset it whenever ε1 is reduced (see line 36); this may be too frequent. Next, we describe another reason. Because of the setting
ε1← max(0.1ε1, ε),
in the final stage of the algorithm, we may have ε1= ε.
Then with shrinking, the condition M − m ≤ ε1
may be quickly satisfied after only very few α elements are updated. This process may repeat several times until both M − m ≤ ε and A = {1, . . . , l} are true.
We revise the if statement in lines 40-45 of Algorithm I to prevent the frequent reset of the active set A, while still make the decrease of ε1 possible. To begin, we make the lower bound of ε1be smaller than ε by using 0.01ε. Second,
(a) rcv1 (|B| = 16) (b) rcv1 (|B| = 64) (c) rcv1 (|B| = 256)
(d) covtype (|B| = 16) (e) covtype (|B| = 64) (f) covtype (|B| = 256)
Figure III: A comparison between single-core LIBLINEAR and multi-core mini-batch CD. The mini-batch CD implementations by using atomic and reduce operations are respectively denoted as “mba” and “mbr.” We present running time in seconds (x-axis) versus the relative difference to the optimal objective value (y- axis, log-scaled). All methods are implemented without the shrinking technique. We use 1 and 8 cores for mini-batch CD.
the reset of the active set A occurs only if M − m ≤ ε
holds rather than when ε1 is decreased. Details are in the following statements.
1: Specify ε1= 0.1
2: Specify εmin= min(0.01ε, ε1) 3: // skipped
4: if M − m ≤ ε1 or ¯t = 0 then 5: ε1← max(0.1ε1, εmin) 6: if M − m ≤ ε then
7: if A = {1, . . . , l} and ε1≤ ε then 8: break
9: else
10: A ← {1, . . . , l}, ¯M ← ∞, ¯m ← −∞
(a) epsilon (b) webspam
Figure X: A comparison between Algorithm I and the new setting. We set ε = 0.01 and the l1 loss is used.
In Figure X, we show the results of applying new settings.
We can observe that the less frequent reactivation of the set A does not affect the convergence speed. However, the slow convergence in the final stage may still be observed. The reason is that when
M − m ≤ ε
holds, the situation is similar to when ε1 = ε occurs in the previous setting. Our earlier discussion has explained that slow convergence may happen. To improve the final convergence, we further reduce the frequency of reactivating the set A by modifying line 6 to be
M − m ≤ ρε, (VII.1)
where ρ < 1 but is close to 1. The reason behind this setting is that
M − m ≤ ε (VII.2)
is a condition applied on a smaller problem of only variables in A. It is easier to hold than a condition on all variables.
When (VII.2) is satisfied, we may not be that close to the optimal solution yet and hence the reactivation of the set A is not necessary. On the other hand, if the shrinking proce- dure does not remove any elements, our new setting will lead to longer training time. Therefore, the parameter ρ should be only slightly smaller than 1. In Figures XI and XII, we present the result of using ρ = 0.9. Clearly, we can see that in the final stage, the training time is significantly improved for many data sets. Another observation is that when the number of cores is increased from one to eight, the improve- ment becomes less dramatic. The reason is that the slow
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure IV: A comparison between two asynchronous dual CD implementations without applying the shrinking technique. “permutation” is our implementation that randomly permutes the indices in the beginning of each outer iteration, while the other is the experimental code from [2]. The l1 loss is considered.
convergence is from computing the gradient after resetting the set A. This calculation is parallelized in Algorithm I, so the issue of slow convergence becomes less significant.
We have also checked the situation when l2-loss is used.
Results by using ε = 0.1 and 0.01 are respectively presented in Figures XIII and XIV. The improvement is less significant.
The reason might be that shrinking is less effective for l2-loss SVM because αiis now unbounded above.
References
[1] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In ICML, 2008.
[2] C.-J. Hsieh, H.-F. Yu, and I. S. Dhillon. PASSCoDe:
Parallel asynchronous stochastic dual coordinate de- scent. In ICML, 2015.
[3] M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-
vector multiplications for large-scale logistic regression on shared-memory systems. In ICDM, 2015.
[4] Z.-Q. Luo and P. Tseng. On the convergence of coordi- nate descent method for convex differentiable minimiza- tion. J. Optim. Theory Appl., 72(1):7–35, 1992.
[5] M. Tak´aˇc, P. Richt´arik, and N. Srebro. Distributed mini- batch SDCA, 2015. arXiv.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure V: A comparison between two asynchronous dual CD implementations with applying the shrinking technique. “permutation” is our implementation that randomly permutes the indices in the beginning of each outer iteration, while the other is the experimental code from [2]. The l1 loss is considered.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure VI: A comparison of dual CD methods without applying the shrinking technique. All settings are the same as Figure 1. The l1 loss is considered.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure VII: A comparison of dual CD methods without applying the shrinking technique. All settings are the same as Figure 1. The l2 loss is considered.
(a) rcv1 (C = 2) (b) yahoo-korea (C = 8) (c) yahoo-japan (C = 2)
(d) webspam (C = 32) (e) url-combined (C = 2) (f) KDD2010-b (C = 0.25)
(g) covtype (C = 16) (h) epsilon (C = 16) (i) HIGGS (C = 0.5)
Figure VIII: A comparison of different parallel dual CD methods. All settings are the same as Figure 1 except that the best C selected from cross validation is used; see the C value shown next to the data name.
The l1 loss is used.
(a) rcv1 (C = 0.5) (b) yahoo-korea (C = 4) (c) yahoo-japan (C = 0.5)
(d) webspam (C = 32) (e) url-combined (C = 2) (f) KDD2010-b (C = 0.03125)
(g) covtype (C = 0.015625) (h) epsilon (C = 4) (i) HIGGS (C = 0.5)
Figure IX: A comparison of different parallel dual CD methods. All settings are the same as Figure 2 except that the best C selected from cross validation is used; see the C value shown next to the data name. The l2 loss is used.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure XI: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.1. The l1 loss is used.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure XII: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.01. The l1 loss is used.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure XIII: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.1. The l2 loss is used.
(a) rcv1 (b) yahoo-korea (c) yahoo-japan
(d) webspam (e) url-combined (f) KDD2010-b
(g) covtype (h) epsilon (i) HIGGS
Figure XIV: A comparison between Algorithm I and the new setting (VII.1) with ρ = 0.9. We set the stopping tolerance ε = 0.01. The l2 loss is used.