There is no need to store a separate XT
However, it is possible that threads on ui1xi1 and ui2xi2 want to update the same component ¯us at the same time:
1: for i = 1, . . . , l do in parallel
2: for (xi)s 6= 0 do
3: u¯s ← ¯us + ui(xi)s
4: end for
5: end for
Atomic Operations for Parallel X
Tu
An atomic operation can avoid other threads to write ¯us at the same time.
1: for i = 1, . . . , l do in parallel
2: for (xi)s 6= 0 do
3: atomic: ¯us ← ¯us + ui(xi)s
4: end for
5: end for
However, waiting time can be a serious problem
Reduce Operations for Parallel X
Tu
Another method is using temporary dense arrays maintained by each thread, and summing up them in the end
That is, store uˆp = X
i
{uixi | i run by thread p}
and then
¯
u = X
p
ˆ up
Atomic Operation: Almost No Speedup
Reduce operations are superior to atomic operations
1 2 4# threads6 8 10 12 0
2 4 6 8 10
Speedup
OMP-array OMP-atomic
1 2 4# threads6 8 10 12 0
1 2 3 4 5
Speedup
OMP-array OMP-atomic
rcv1 binary covtype binary Subsequently we use the reduce operations
Existing Algorithms for Sparse Matrix-vector Product
Instead of our direct implementation to parallelize loops, in the next slides we will consider two existing methods
Recursive Sparse Blocks (Martone, 2014)
RSB (Recursive Sparse Blocks) is an effective format for fast parallel sparse matrix-vector multiplications It recursively partitions a matrix to be like the figure
Locality of memory references improved, but the construction time is not negligible
Recursive Sparse Blocks (Cont’d)
Parallel, efficient sparse matrix-vector operations Improve locality of memory references
But the initial construction time is about 20
multiplications, which is not negligible in some cases We will show the result in the experiments
Intel MKL
Intel Math Kernel Library (MKL) is a commercial library including optimized routines for linear algebra (Intel)
It supports fast matrix-vector multiplications for different sparse formats.
We consider the row-oriented format to store X .
Experiments
Baseline: Single core version in LIBLINEAR 1.96. It sequentially run the following operations
u = X d u ← Du
¯
u = XTu, where u = DX d OpenMP: Use OpenMP to parallelize loops MKL: Intel MKL version 11.2
RSB: librsb version 1.2.0
Speedup of X d : All are Excellent
rcv1 binary webspam kddb
url combined covtype binary rcv1 multiclass
More Difficult to Speed up X
Tu
rcv1 binary webspam kddb
url combined covtype binary rcv1 multiclass
Indeed it’s not easy to have a multi-core implementation that is
1 simple, and
2 reasonably efficient
Let me describe what we do in the end in multi-core LIBLINEAR
Reducing Memory Access to Improve Speedup
In computing
X d and XT(DX d ) the data matrix is accessed twice
We notice that these two operations can be combined together
XTDX d =Xl
i =1xiDiixTi d
We can parallelize one single loop by OpenMP
Reducing Memory Access to Improve Speedup (Cont’d)
Better speedup as memory accesses are reduced
1 2 4# threads6 8 10 12 0
2 4 6 8 10
Speedup
Combined Separated
1 2 4# threads6 8 10 12 012
34 567 8
Speedup
Combined Separated
rcv1 binary covtype binary
The number of operations is the same, but memory access dramatically affects the idle time of threads
Reducing Memory Access to Improve Speedup (Cont’d)
Therefore, if we can efficiently do Xl
i =1xiDiixTi d (8)
then probably we don’t need sophisticated sparse matrix-vector packages
However, for a simple operation like (8) careful implementations are still needed
OpenMP Scheduling
An OpenMP loop assigns tasks to different threads.
The default schedule(static) splits indices to P blocks, where each contains l /P elements.
However, as tasks may be unbalanced, we can have a dynamic scheduling – available threads are
assigned to the next tasks.
For example, schedule(dynamic,256) implies that a thread works on 256 elements each time.
Unfortunately, overheads occur for the dynamic task assignment.
OpenMP Scheduling (Cont’d)
Deciding suitable scheduling is not trivial.
Consider implementing XTu as an example. This operation involves the following three loops.
1 Initializing ˆup = 0, ∀p = 1, . . . , P.
2 Calculating ˆup, ∀p by ˆ
up = X
{uixi | i run by thread p}
3 Calculating ¯u = PP p=1uˆp.
OpenMP Scheduling (Cont’d)
• Consider the second step
covtype binary rcv1 binary
schedule(static) 0.2879 2.9387
schedule(dynamic) 1.2611 2.6084
schedule(dynamic, 256) 0.2558 1.6505
• Clearly, a suitable scheduling is essential
• The other two steps are more balanced, so schedule(static) is used (details omitted)
Speedup of Total Training Time
rcv1 binary webspam kddb
url combined covtype binary rcv1 multiclass
Analysis of Experimental Results
For RSB, the speedup for X d is excellent, but is poor for XTu on some n l data (e.g. covtype) Furthermore, construction time is expensive
OpenMP is the best for almost all cases, mainly because of combing X d and XTu together Therefore, with appropriate settings, simple
implementations by OpenMP can achieve excellent speedup
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Conclusions
Optimization has been very useful for machine learning
We must incorporate machine learning knowledge in designing suitable optimization algorithms and
software
The interaction between optimization and machine learning is very interesting and exciting.
References I
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.
R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.
W.-L. Chiang, M.-C. Lee, and C.-J. Lin. Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016. URL http://www.csie.ntu.edu.tw/~cjlin/papers/multicore_cddual.pdf.
C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.
R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning Research, 6:1889–1918, 2005. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/quadworkset.pdf.
References II
T. Glasmachers and U. Dogan. Accelerated coordinate descent with adaptive coordinate frequencies. In Proceedings of the 5th Asian Conference on Machine Learning, volume 29 of Proceedings of Machine Learning Research, pages 72–86, 2013.
C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4:
79–85, 1957.
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.
Intel. Intel Math Kernel Library Reference Manual.
T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.
S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.
S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.
C.-P. Lee and S. J. Wright. Random permutations fix a worst case for cyclic coordinate descent, 2016. arXiv preprint arXiv:1607.08320.
References III
M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2015. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/multicore_liblinear_icdm.pdf.
C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.
O. L. Mangasarian. A finite Newton method for classification. Optimization Methods and Software, 17(5):913–929, 2002.
J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse
matrix–vector multiplication with the recursive sparse blocks format. Parallel Computing, 40:251–270, 2014.
E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–136, 1997.