We compare various implementations of sparse matrix-vector multiplications when they are used in a Newton method.
2.2.1 Data Sets and Experimental Settings
We carefully choose data sets to cover different density and different relationships between numbers of instances and features. All data sets listed in Table 2.1 are available at the LIBSVM data set page. Note that we use the tri-gram version of webspam in our experiments.
Besides two-class data sets, we consider several multi-class problems. In LIBLINEAR, a K-class problem is decomposed to K two-class problems by the one-versus-the-rest strategy. Because each two-class problem treats one class as positive and the rest as neg-ative, all the K problems share the same x1, . . . , xl. Their matrix-vector multiplications involve the same X and XT. We are interested in such cases because approaches like RSB need the initial construction only once.
We choose C = 1 as the regularization parameter in (2.1). An exception is C = 0.0625 for KDD2010-b for shorter training time. Our code is based on LIBLINEAR [9], which implements a trust region Newton method [26]. We compare the following approaches to conduct matrix-vector multiplications in the Newton method.
• Baseline: The one-core implementation in LIBLINEAR (Section 2.1.1) serves as the baseline in our investigation of the speedup. We use version 1.96.
• OpenMP: We employ OpenMP to parallelize the loops for matrix-vector multiplications.
The scheduling we used isschedule(dynamic,256), where details will be discussed in Section 2.2.4. For XTu we have two settings OpenMP-atomic and OpenMP-array.
(a) rcv1_binary (b) covtype_binary
Figure 2.3: Speedup of calculating Xd and XTu separately and together
The first one considers atomic operations, while the second stores results of each thread to an array; see Section 2.1.2.
• MKL: We use Intel MKL version 11.2; see Section 2.1.4.
• RSB: the approach described in Section 2.1.5 by using the recursive sparse block format.
We use version 1.2.0 of the package librsb athttp://librsb.sourceforge.net.
Experiments are conducted on machines with 12 cores of Intel Xeon E5-2620 CPUs which has 32K L1-cache, 256K L2-cache and 15360K shared L3-cache. Because past comparisons (e.g., [28]) involving MKL often use the Intel icc compiler, we use it for all our programs with the option-O3 -xAVX -fPIC -openmp. Our experiment uses 1, 2, 4, 6, 8, 10, 12 threads because the machine we used has up to 12 cores.
2.2.2 Running Time of Sparse Matrix-vector Multiplications in New-ton Methods
Although past works have pointed out the importance of sparse matrix-vector mul-tiplications in Newton methods for data classification, no numerical results have been reported to show their cost in the entire optimization procedure. In Table 2.1, we show the percentage of running time spent on matrix-vector multiplications. Clearly matrix-vector
(a) KDD2010-b (b) url_combined (c) webspam (d) epsilon
(e) rcv1_binary (f) covtype_binary (g) rcv1_multiclass (h) covtype_multiclass
Figure 2.4: Speedup of the Newton method for training logistic regression
multiplications are the computational bottleneck because for almost all problems, more than 90% of the running time is spent on them. Therefore, it is very essential to parallelize the sparse matrix-vector multiplications in a Newton method.
2.2.3 Comparison on Matrix-vector Multiplications
Although past works from numerical analysis and parallel processing communities have compared different strategies for matrix-vector multiplications, they did not use ma-trices from machine learning problems. It is interesting to see if our results are consistent with theirs. We present the result of
speedup = running time of LIBLINEAR running time of the compared approach
versus the number of threads. Note that the speedup may be bigger than the number of threads because of the storage and algorithmic differences from the baseline. The speedup for Xd and XT(DXd) may be different, so we separately present results in Figures 2.1 and 2.2.
Results indicate that all approaches give excellent speedup for calculating Xd. For some problems, the speedup is so good that it is much higher than the number of threads.
For example, using eight threads, RSB achieves speedup higher than 12. This result is possible because while the baseline uses a row-based storage, RSB applies a block storage for better data locality. Between RSB and MKL there is no clear winner. This observation is slightly different from [28], which shows that RSB is better than MKL. One reason might be that [28] experimented with matrices close to squared ones, but ours here are not. Although OpenMP’s implementation is the simplest, it enjoys good speedup.
For XTu with u = DXd, in general the speedup becomes worse than that for Xd.
For a few cases the difference is significant (e.g., RSB for covtype_binary). Like the sit-uation for Xd, there is no clear winner between RSB and MKL. For the two OpenMP implementations for XTu, OpenMP-atomic completely fails to gain any speedup. Al-though we pointed out in Section 2.1.2 the possible delay of using atomic operations, such poor results are not what we expected. In contrast, OpenMP-array gives good speedup and is sometimes even better than RSB or MKL. From this experiment we see that even for the simple implementation by OpenMP, a careful design of algorithms for the loops can make significant differences.
2.2.4 Detailed Analysis on Implementations using OpenMP
Following the dramatic difference between OpenMP-atomic and OpenMP-array for XT(DXd), we learned that having a suitable implementation by OpenMP is non-trivial.
Therefore, we devote this subsection to thoroughly investigate some implementation is-sues.
First we discuss the two implementations of computing Xd and XT(DXd) separately (Section 2.1.2) and together by (Section 2.1.3). Note that OpenMP-array is used here be-cause of the much better speedup than OpenMP-atomic. Figure 2.3 presents the speedup of the two settings. The approach of using (2.3) in Section 2.1.3 is better because data
instances x1, . . . , xl are accessed once rather than twice. We already see improvement in the one-core situation, so (2.3) is what should have been implemented in LIBLINEAR.
Interestingly, the gap between the two approaches widens as the number of threads in-creases. An explanation is that the doubled number of data accesses cause problems when more threads are accessing data at the same time. Based on this experiment, from now on when we refer to the OpenMP implementation, we mean (2.3) with the OpenMP-array setting.
Next we investigate the implementation of OpenMP-array. We show in Appendix that without suitable settings (in particular, the scheduling of OpenMP loops), speedup can be poor.
2.2.5 Results on Newton Methods for Solving (2.1)
Figure 2.4 presents the speedup of the Newton method using various implementations for matrix-vector multiplications. Note that we consider end-to-end running time after excluding the initial data loading and the final model writing. Therefore, the construction time for transforming the row-based format to RSB is now included. This transformation does not occur for other approaches.
For RSB, results in Section 2.2.3 indicate that its speedup for Xd is excellent, but is poor for XTu. Therefore, we consider a variant RSBt that generates and stores the RSB format of XT in the beginning. Although more memory is used, we hope that the speedup for XTu can be as good as that for Xd.
Results in Figure 2.4 show that OpenMP (the version presented in Section 2.1.3) is the best for almost all cases, while MKL comes the second. RSB is worse than MKL mainly because of the following reasons. First, from Figure 2.2, RSB is mostly worse than MKL for XTu. Second, the construction time is not entirely negligible. Regarding
RSBt, although we see better speedup for XTu, where details are not shown because of space limitation, the overall speed is worse than RSB for two-class problems. The reason is because that the construction time is doubled. However, for multi-class problems, RSBt may become better than RSB. Because several binary problems on the same data instances are solved, the construction time for the RSB format of XT has less impact on the total training time.
The speedup for KDD2010-b is worse than others because it has the lowest percentage of running time on matrix-vector multiplications (see Table 2.1).
We must point out that one factor to the superiority of OpenMP over others is that it implements (2.3) by considering Xd and XTu together. In this regard, future develop-ment and support of XXT(·) type of operations in sparse matrix packages is a important direction.
2.2.6 Summary of Experiments
1. It is more difficult to gain speedup for XTu than Xd. Improving the speedup of XTu should be a focus for the future development of programs for sparse matrix-vector mul-tiplications.
2. With settings discussed in Section 2.2.4, simple implementations by OpenMP can achieve excellent speedup and beat existing state-of-the-art packages.
3 Parallel Dual Coordinate Descent Method
For Newton methods, in Chapter 2 we have shown that with careful implementations, excellent speedup can be achieved in a multi-core environment. Its success relies on paral-lelizable operations that involve all data together. As mentioned in Chapter 1, we are also interested in another important class of optimization methods, coordinate descent method, which is an effective approach for large-scale linear classification. However, its sequential design make the parallelization difficult. In this chapter, we target at the parallelization in multi-core environments and study parallel dual coordinate descent methods. Some details such as theoretical proofs and other experimental results can be found in the Supplemen-tary of [5] on our website: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
multicore-liblinear/.
Several works have proposed parallel extensions of dual CD methods (e.g., [15, 23, 34, 36, 37]), in which [15, 34, 36] focus more on multi-core environments. We can further categorize them to two types:
1. Mini-batch CD [34]. Each time a batch of instances are selected and CD updates are parallelly applied to them.
2. Asynchronous CD [15][36]. Threads independently update different coordinates in parallel. The convergence is often faster than synchronous algorithms, but some-times the algorithm fails to converge.
In Section 3.1, we detailedly discuss the above approaches for parallel dual CD, and
ex-plain why they may be either inefficient or not robust. In Section 3.2, we propose a new and simple framework that can effectively take the advantage of multi-core computation.
Theoretical properties such as asymptotic convergence and finite termination under given stopping tolerances are provided in Section 3.3. In Section 3.4, we conduct through exper-iments and comparisons. Results show that our proposed method is robust and efficient.