There is no need to store a separate XT
However, it is possible that threads on ui1xi1 and ui2xi2 want to update the same component ¯us at the same time:
1: for i = 1, . . . , l do in parallel
2: for (xi)s 6= 0 do
3: u¯s ← ¯us + ui(xi)s
4: end for
5: end for
Multi-core linear classification Parallel matrix-vector multiplications
Atomic Operations for Parallel X
Tu
An atomic operation can avoid other threads to write ¯us at the same time.
1: for i = 1, . . . , l do in parallel
2: for (xi)s 6= 0 do
3: atomic: ¯us ← ¯us + ui(xi)s
4: end for
5: end for
However, waiting time can be a serious problem
Multi-core linear classification Parallel matrix-vector multiplications
Reduce Operations for Parallel X
Tu
Another method is using temporary arrays
maintained by each thread, and summing up them in the end
That is, store uˆp = X
i
{uixi | i run by thread p}
and then
¯
u = X
p
ˆ up
Multi-core linear classification Parallel matrix-vector multiplications
Atomic Operation: Almost No Speedup
Reduce operations are superior to atomic operations
1 2 4# threads6 8 10 12 0
2 4 6 8 10
Speedup
OMP-array OMP-atomic
1 2 4# threads6 8 10 12 0
1 2 3 4 5
Speedup
OMP-array OMP-atomic
rcv1 binary covtype binary Subsequently we use the reduce operations
Multi-core linear classification Parallel matrix-vector multiplications
Existing Algorithms for Sparse Matrix-vector Product
This is always an important research issue in numerical analysis
Instead of our direct implementation to parallelize loops, in the next slides we will consider two existing methods
Multi-core linear classification Parallel matrix-vector multiplications
Recursive Sparse Blocks (Martone, 2014)
RSB (Recursive Sparse Blocks) is an effective format for fast parallel sparse matrix-vector multiplications It recursively partitions a matrix to be like the figure
Locality of memory references improved, but the construction time is not negligible
Multi-core linear classification Parallel matrix-vector multiplications
Recursive Sparse Blocks (Cont’d)
Parallel, efficient sparse matrix-vector operations Improve locality of memory references
But the initial construction time is about 20
multiplications, which is not negligible in some cases We will show the result in the experiment part
Multi-core linear classification Parallel matrix-vector multiplications
Intel MKL
Intel Math Kernel Library (MKL) is a commercial library including optimized routines for linear algebra (Intel)
It supports fast matrix-vector multiplications for different sparse formats.
We consider the row-oriented format to store X .
Multi-core linear classification Experiments
Outline
5 Multi-core linear classification
Parallel matrix-vector multiplications Experiments
Multi-core linear classification Experiments
Experiments
Baseline: Single core version in LIBLINEAR 1.96 OpenMP to parallelize loops
MKL: Intel MKL version 11.2 RSB: librsb version 1.2.0
Multi-core linear classification Experiments
Speedup of X d: All are Excellent
rcv1 binary webspam kddb
url combined covtype binary rcv1
Multi-core linear classification Experiments
More Difficult to Speed up X
Tu
rcv1 binary webspam kddb
url combined covtype binary rcv1
Multi-core linear classification Experiments
Reducing Memory Access to Improve Speedup
In computing
Xd and XT(DXd) the data matrix is accessed twice
We notice that these two operations can be combined together
XTDXd = Xl
i =1xiDiixTi d
We can parallelize one single loop by OpenMP
Multi-core linear classification Experiments
Reducing Memory Access to Improve Speedup (Cont’d)
Better speedup as memory accesses reduced
1 2 4# threads6 8 10 12 0
2 4 6 8 10
Speedup
Combined Separated
1 2 4# threads6 8 10 12 012
34 567 8
Speedup
Combined Separated
rcv1 binary covtype binary
The number of operations is the same, but memory access dramatically affects the idle time of threads
Multi-core linear classification Experiments
OpenMP Scheduling
An OpenMP loop assigns tasks to different threads.
The default schedule(static) splits indices to P blocks, where each contains l /P elements.
However, as tasks may be unbalanced, we can have a dynamic scheduling – available threads are
assigned to the next tasks.
For example, schedule(dynamic,256) implies that a thread works on 256 elements each time.
Unfortunately, overheads occur for the dynamic task assignment.
Multi-core linear classification Experiments
OpenMP Scheduling (Cont’d)
Deciding suitable scheduling is not trivial.
Consider implementing XTu as an example. This operation involves the following three loops.
1 Initializing ˆup = 0, ∀p = 1, . . . , P.
2 Calculating ˆup, ∀p by ˆ
up = X
{uixi | i run by thread p}
3 Calculating ¯u = PP p=1uˆp.
Multi-core linear classification Experiments
OpenMP Scheduling (Cont’d)
• Consider the second step
covtype binary rcv1 binary
schedule(static) 0.2879 2.9387
schedule(dynamic) 1.2611 2.6084
schedule(dynamic, 256) 0.2558 1.6505
• Clearly, a suitable scheduling is essential
• The other two steps are more balanced, so schedule(static) is used (details omitted)
Multi-core linear classification Experiments
Speedup of Total Training Time
rcv1 binary webspam kddb
url combined covtype binary rcv1
Multi-core linear classification Experiments
Analysis of Experimental Results
For RSB, the speedup for Xd is excellent, but is poor for XTu on some n l data (e.g. covtype) Furthermore, construction time is expensive
OpenMP is the best for almost all cases, mainly because of combing Xd and XTu together Therefore, with appropriate settings, simple
implementations by OpenMP can achieve excellent speedup
Distributed linear classification
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Multi-core linear classification
6 Distributed linear classification
7 Discussion and conclusions
Distributed linear classification
Outline
6 Distributed linear classification
Distributed matrix-vector multiplications Experiments
Distributed linear classification Distributed matrix-vector multiplications
Outline
6 Distributed linear classification
Distributed matrix-vector multiplications Experiments
Distributed linear classification Distributed matrix-vector multiplications
Data in a Distributed System
When data are too large, we may for example let each node store some instances
X1 X2 . . . Xp node 1
node 2
node P
We would like to distributedly compute
∇f (w)d = XTDXd
Distributed linear classification Distributed matrix-vector multiplications
Parallel Hessian-vector Product
Like in the shared-memory situation, we have XTDXd = X1TD1X1d + · · · + XPTDPXPd We let each node calculate
XpTDpXpd and sum resulting vectors together
This is a reduce operation. We have used similar techniques for multi-core situations
Distributed linear classification Distributed matrix-vector multiplications
Master-slave or Master-master
Master-slave: only master gets XTDXd and runs the whole Newton method
Master-master: every node gets XTDXd. Then each node has all the information to finish the Newton method
Here we consider a master-master setting.
One reason is that for master-slave,
implementations on master and slaves are different This is different from multi-core situations, where only one master copy is run
Distributed linear classification Distributed matrix-vector multiplications
Allreduce Operation
We let every node get XTDX d
d
d
d
X1TD1X1d
X2TD2X2d
X3TD3X3d
ALL REDUCE
XTDX d
XTDX d
XTDX d
Allreduce: reducing all vectors (XiTDiXid, ∀i) to a single vector (XTDXd ∈ Rn) and then sending the result to every node
Distributed linear classification Distributed matrix-vector multiplications
Instance-wise and Feature-wise Data Splits
Instead of storing a subset of data at each node, we can store a subset of features
Xiw,1 Xiw,2 Xiw,3
Xfw,1Xfw,2Xfw,3
Instance-wise Feature-wise
Distributed linear classification Distributed matrix-vector multiplications
Instance-wise and Feature-wise Data Splits (Cont’d)
Feature-wise: each machine calculates part of the Hessian-vector product
(∇2f (w)d)fw,1 = d1+CXfw,1T D(Xfw,1d1+· · ·+Xfw,PdP) Xfw,1d1 + · · · + Xfw,PdP ∈ Rl must be available on all nodes (by allreduce)
Because
XkTDkXkd : O(n), Xfw,pdp : O(l ), amount of data moved per Hessian-vector product:
Instance-wise: O(n), Feature-wise: O(l )
Distributed linear classification Experiments
Outline
6 Distributed linear classification
Distributed matrix-vector multiplications Experiments
Distributed linear classification Experiments
Experiments
We compare
TRON: Newton method
ADMM: alternating direction method of multipliers (Boyd et al., 2011; Zhang et al., 2012)
Vowpal Wabbit (Langford et al., 2007) TRON and ADMM are implemented by MPI Details in Zhuang et al. (2015)
Distributed linear classification Experiments
Experiments (Cont’d)
0 100 200 300 400 500
10−4 10−2 100
Time (sec.)
Relative function value difference
VW ADMM−FW ADMM−IW TRON−FW TRON−IW
0 1000 2000 3000 4000 5000 10−4
10−2 100 102
Time (sec.)
Relative function value difference
VW ADMM−FW ADMM−IW TRON−FW TRON−IW
epsilon webspam
32 machines are used
Horizontal line: test accuracy has stabilized Instance-wise and feature-wise splits useful for l n and l n, respectively
Distributed linear classification Experiments
Experiments (Cont’d)
We have seen that communication cost is a big concern
In terms of running time, multi-core implementation is often faster
However, data preparation and loading are issues other than running time
Overall we see that distributed training is a complicated issue
Discussion and conclusions
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Multi-core linear classification
6 Distributed linear classification
7 Discussion and conclusions
Discussion and conclusions
Outline
7 Discussion and conclusions Some resources
Conclusions
Discussion and conclusions Some resources
Outline
7 Discussion and conclusions Some resources
Conclusions
Discussion and conclusions Some resources
Software
• Most materials in this talks are based on our experiences in developing two popular software
• Kernel: LIBSVM (Chang and Lin, 2011)
http://www.csie.ntu.edu.tw/~cjlin/libsvm
• Linear: LIBLINEAR (Fan et al., 2008).
http://www.csie.ntu.edu.tw/~cjlin/liblinear See also a survey on linear classification in Yuan et al.
(2012)
Discussion and conclusions Some resources
Multi-core LIBLINEAR
An extension of the software LIBLINEAR
See http://www.csie.ntu.edu.tw/~cjlin/
libsvmtools/multicore-liblinear
This is based on the study in Lee et al. (2015) We already have many users. For example, one user from USC uses this tool to reduce his training time from over 30 hours to 5 hours
Discussion and conclusions Some resources
Distributed LIBLINEAR
An extension of the software LIBLINEAR
See http://www.csie.ntu.edu.tw/~cjlin/
libsvmtools/distributed-liblinear
We support both MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)
Discussion and conclusions Conclusions
Outline
7 Discussion and conclusions Some resources
Conclusions
Discussion and conclusions Conclusions
Conclusions
Linear classification is an old topic; but recently there are new and interesting applications
Kernel methods are still useful for many
applications, but linear classification + feature engineering are suitable for some others
Linear classification will continue to be used in situations ranging from small-model to big-data applications
Discussion and conclusions Conclusions
Acknowledgments
Many students have contributed to our research on large-scale linear classification
We also thank the partial support from Ministry of Science and Technology in Taiwan
Discussion and conclusions Conclusions
References I
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.
C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.
C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4:
79–85, 1957.
Discussion and conclusions Conclusions
References II
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.
Intel. Intel Math Kernel Library Reference Manual.
T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.
S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.
S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.
J. Langford, L. Li, and A. Strehl. Vowpal Wabbit, 2007.
https://github.com/JohnLangford/vowpal_wabbit/wiki.
M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2015. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/multicore_liblinear_icdm.pdf.
Discussion and conclusions Conclusions
References III
C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.
C.-Y. Lin, C.-H. Tsai, C.-P. Lee, and C.-J. Lin. Large-scale logistic regression and linear support vector machines using Spark. In Proceedings of the IEEE International Conference on Big Data, pages 519–528, 2014. URL http:
//www.csie.ntu.edu.tw/~cjlin/papers/spark-liblinear/spark-liblinear.pdf.
M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse
matrix–vector multiplication with the recursive sparse blocks format. Parallel Computing, 40:251–270, 2014.
E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–136, 1997.
J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -Support Vector Learning, Cambridge, MA, 1998. MIT Press.
T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, pages 629–637. 2013.