There is no need to store a separate X^{T}

However, it is possible that threads on u_{i}_{1}x_{i}_{1} and
u_{i}_{2}x_{i}_{2} want to update the same component ¯u_{s} at the
same time:

1: for i = 1, . . . , l do in parallel

2: for (xi)s 6= 0 do

3: u¯_{s} ← ¯u_{s} + u_{i}(x_{i})_{s}

4: end for

5: end for

### Atomic Operations for Parallel X

^{T}

### u

An atomic operation can avoid other threads to
write ¯u_{s} at the same time.

1: for i = 1, . . . , l do in parallel

2: for (x_{i})_{s} 6= 0 do

3: atomic: ¯u_{s} ← ¯u_{s} + u_{i}(x_{i})_{s}

4: end for

5: end for

However, waiting time can be a serious problem

### Reduce Operations for Parallel X

^{T}

### u

Another method is using temporary dense arrays maintained by each thread, and summing up them in the end

That is, store
uˆ^{p} = X

i

{u_{i}x_{i} | i run by thread p}

and then

¯

u = X

p

ˆ
u^{p}

### Atomic Operation: Almost No Speedup

Reduce operations are superior to atomic operations

1 2 4# threads6 8 10 12 0

2 4 6 8 10

Speedup

OMP-array OMP-atomic

1 2 4# threads6 8 10 12 0

1 2 3 4 5

Speedup

OMP-array OMP-atomic

rcv1 binary covtype binary Subsequently we use the reduce operations

### Existing Algorithms for Sparse Matrix-vector Product

Instead of our direct implementation to parallelize loops, in the next slides we will consider two existing methods

### Recursive Sparse Blocks (Martone, 2014)

RSB (Recursive Sparse Blocks) is an effective format for fast parallel sparse matrix-vector multiplications It recursively partitions a matrix to be like the figure

Locality of memory references improved, but the construction time is not negligible

### Recursive Sparse Blocks (Cont’d)

Parallel, efficient sparse matrix-vector operations Improve locality of memory references

But the initial construction time is about 20

multiplications, which is not negligible in some cases We will show the result in the experiments

### Intel MKL

Intel Math Kernel Library (MKL) is a commercial library including optimized routines for linear algebra (Intel)

It supports fast matrix-vector multiplications for different sparse formats.

We consider the row-oriented format to store X .

### Experiments

Baseline: Single core version in LIBLINEAR 1.96. It sequentially run the following operations

u = X d u ← Du

¯

u = X^{T}u, where u = DX d
OpenMP: Use OpenMP to parallelize loops
MKL: Intel MKL version 11.2

RSB: librsb version 1.2.0

### Speedup of X d : All are Excellent

rcv1 binary webspam kddb

url combined covtype binary rcv1 multiclass

### More Difficult to Speed up X

^{T}

### u

rcv1 binary webspam kddb

url combined covtype binary rcv1 multiclass

Indeed it’s not easy to have a multi-core implementation that is

1 simple, and

2 reasonably efficient

Let me describe what we do in the end in multi-core LIBLINEAR

### Reducing Memory Access to Improve Speedup

In computing

X d and X^{T}(DX d )
the data matrix is accessed twice

We notice that these two operations can be combined together

X^{T}DX d =X^{l}

i =1xiD_{ii}x^{T}_{i} d

We can parallelize one single loop by OpenMP

### Reducing Memory Access to Improve Speedup (Cont’d)

Better speedup as memory accesses are reduced

1 2 4# threads6 8 10 12 0

2 4 6 8 10

Speedup

Combined Separated

1 2 4# threads6 8 10 12 012

34 567 8

Speedup

Combined Separated

rcv1 binary covtype binary

The number of operations is the same, but memory access dramatically affects the idle time of threads

### Reducing Memory Access to Improve Speedup (Cont’d)

Therefore, if we can efficiently do
X^{l}

i =1xiD_{ii}x^{T}_{i} d (8)

then probably we don’t need sophisticated sparse matrix-vector packages

However, for a simple operation like (8) careful implementations are still needed

### OpenMP Scheduling

An OpenMP loop assigns tasks to different threads.

The default schedule(static) splits indices to P blocks, where each contains l /P elements.

However, as tasks may be unbalanced, we can have a dynamic scheduling – available threads are

assigned to the next tasks.

For example, schedule(dynamic,256) implies that a thread works on 256 elements each time.

Unfortunately, overheads occur for the dynamic task assignment.

### OpenMP Scheduling (Cont’d)

Deciding suitable scheduling is not trivial.

Consider implementing X^{T}u as an example. This
operation involves the following three loops.

1 Initializing ˆu^{p} = 0, ∀p = 1, . . . , P.

2 Calculating ˆu^{p}, ∀p by
ˆ

u^{p} = X

{u_{i}x_{i} | i run by thread p}

3 Calculating ¯u = PP
p=1uˆ^{p}.

### OpenMP Scheduling (Cont’d)

• Consider the second step

covtype binary rcv1 binary

schedule(static) 0.2879 2.9387

schedule(dynamic) 1.2611 2.6084

schedule(dynamic, 256) 0.2558 1.6505

• Clearly, a suitable scheduling is essential

• The other two steps are more balanced, so schedule(static) is used (details omitted)

### Speedup of Total Training Time

rcv1 binary webspam kddb

url combined covtype binary rcv1 multiclass

### Analysis of Experimental Results

For RSB, the speedup for X d is excellent, but is
poor for X^{T}u on some n l data (e.g. covtype)
Furthermore, construction time is expensive

OpenMP is the best for almost all cases, mainly
because of combing X d and X^{T}u together
Therefore, with appropriate settings, simple

implementations by OpenMP can achieve excellent speedup

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Conclusions

Optimization has been very useful for machine learning

We must incorporate machine learning knowledge in designing suitable optimization algorithms and

software

The interaction between optimization and machine learning is very interesting and exciting.

### References I

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.

W.-L. Chiang, M.-C. Lee, and C.-J. Lin. Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016. URL http://www.csie.ntu.edu.tw/~cjlin/papers/multicore_cddual.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning Research, 6:1889–1918, 2005. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/quadworkset.pdf.

### References II

T. Glasmachers and U. Dogan. Accelerated coordinate descent with adaptive coordinate frequencies. In Proceedings of the 5th Asian Conference on Machine Learning, volume 29 of Proceedings of Machine Learning Research, pages 72–86, 2013.

C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4:

79–85, 1957.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

Intel. Intel Math Kernel Library Reference Manual.

T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.

S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

C.-P. Lee and S. J. Wright. Random permutations fix a worst case for cyclic coordinate descent, 2016. arXiv preprint arXiv:1607.08320.

### References III

M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2015. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/multicore_liblinear_icdm.pdf.

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.

O. L. Mangasarian. A finite Newton method for classification. Optimization Methods and Software, 17(5):913–929, 2002.

J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.

M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse

matrix–vector multiplication with the recursive sparse blocks format. Parallel Computing, 40:251–270, 2014.

E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–136, 1997.