T u Operation (Cont’d) - Large-scale Linear Classiﬁcation

There is no need to store a separate X^T

However, it is possible that threads on u_i₁xi1 and u_i₂xⁱ2 want to update the same component ¯u_s at the same time:

1: for i = 1, . . . , l do in parallel

2: for (xⁱ)s 6= 0 do

3: u¯_s ← ¯u_s + u_i(xi)_s

4: end for

5: end for

Multi-core linear classification Parallel matrix-vector multiplications

Atomic Operations for Parallel X

u

An atomic operation can avoid other threads to write ¯u_s at the same time.

1: for i = 1, . . . , l do in parallel

2: for (xi)_s 6= 0 do

3: atomic: ¯u_s ← ¯u_s + u_i(xⁱ)_s

4: end for

5: end for

However, waiting time can be a serious problem

Multi-core linear classification Parallel matrix-vector multiplications

Reduce Operations for Parallel X

u

Another method is using temporary arrays

maintained by each thread, and summing up them in the end

That is, store uˆ^p = X

{u_ixi | i run by thread p}

and then

u = X

ˆ u^p

Multi-core linear classification Parallel matrix-vector multiplications

Atomic Operation: Almost No Speedup

Reduce operations are superior to atomic operations

1 2 4# threads6 8 10 12 0

2 4 6 8 10

Speedup

OMP-array OMP-atomic

1 2 4# threads6 8 10 12 0

1 2 3 4 5

Speedup

OMP-array OMP-atomic

rcv1 binary covtype binary Subsequently we use the reduce operations

Multi-core linear classification Parallel matrix-vector multiplications

Existing Algorithms for Sparse Matrix-vector Product

This is always an important research issue in numerical analysis

Instead of our direct implementation to parallelize loops, in the next slides we will consider two existing methods

Multi-core linear classification Parallel matrix-vector multiplications

Recursive Sparse Blocks (Martone, 2014)

RSB (Recursive Sparse Blocks) is an effective format for fast parallel sparse matrix-vector multiplications It recursively partitions a matrix to be like the figure

Locality of memory references improved, but the construction time is not negligible

Multi-core linear classification Parallel matrix-vector multiplications

Recursive Sparse Blocks (Cont’d)

Parallel, efficient sparse matrix-vector operations Improve locality of memory references

But the initial construction time is about 20

multiplications, which is not negligible in some cases We will show the result in the experiment part

Multi-core linear classification Parallel matrix-vector multiplications

Intel MKL

Intel Math Kernel Library (MKL) is a commercial library including optimized routines for linear algebra (Intel)

It supports fast matrix-vector multiplications for different sparse formats.

We consider the row-oriented format to store X .

Multi-core linear classification Experiments

Outline

5 Multi-core linear classification

Parallel matrix-vector multiplications Experiments

Multi-core linear classification Experiments

Experiments

Baseline: Single core version in LIBLINEAR 1.96 OpenMP to parallelize loops

MKL: Intel MKL version 11.2 RSB: librsb version 1.2.0

Multi-core linear classification Experiments

Speedup of X d: All are Excellent

rcv1 binary webspam kddb

url combined covtype binary rcv1

Multi-core linear classification Experiments

More Difficult to Speed up X

u

rcv1 binary webspam kddb

url combined covtype binary rcv1

Multi-core linear classification Experiments

Reducing Memory Access to Improve Speedup

In computing

Xd and X^T(DXd) the data matrix is accessed twice

We notice that these two operations can be combined together

X^TDXd = X^l

i =1xⁱD_iix^Ti d

We can parallelize one single loop by OpenMP

Multi-core linear classification Experiments

Reducing Memory Access to Improve Speedup (Cont’d)

Better speedup as memory accesses reduced

1 2 4# threads6 8 10 12 0

2 4 6 8 10

Speedup

Combined Separated

1 2 4# threads6 8 10 12 012

34 567 8

Speedup

Combined Separated

rcv1 binary covtype binary

The number of operations is the same, but memory access dramatically affects the idle time of threads

Multi-core linear classification Experiments

OpenMP Scheduling

An OpenMP loop assigns tasks to different threads.

The default schedule(static) splits indices to P blocks, where each contains l /P elements.

However, as tasks may be unbalanced, we can have a dynamic scheduling – available threads are

assigned to the next tasks.

For example, schedule(dynamic,256) implies that a thread works on 256 elements each time.

Unfortunately, overheads occur for the dynamic task assignment.

Multi-core linear classification Experiments

OpenMP Scheduling (Cont’d)

Deciding suitable scheduling is not trivial.

Consider implementing X^Tu as an example. This operation involves the following three loops.

1 Initializing ˆu^p = 0, ∀p = 1, . . . , P.

2 Calculating ˆu^p, ∀p by ˆ

u^p = X

{u_ixi | i run by thread p}

3 Calculating ¯u = PP p=1uˆ^p.

Multi-core linear classification Experiments

OpenMP Scheduling (Cont’d)

• Consider the second step

covtype binary rcv1 binary

schedule(static) 0.2879 2.9387

schedule(dynamic) 1.2611 2.6084

schedule(dynamic, 256) 0.2558 1.6505

• Clearly, a suitable scheduling is essential

• The other two steps are more balanced, so schedule(static) is used (details omitted)

Multi-core linear classification Experiments

Speedup of Total Training Time

rcv1 binary webspam kddb

url combined covtype binary rcv1

Multi-core linear classification Experiments

Analysis of Experimental Results

For RSB, the speedup for Xd is excellent, but is poor for X^Tu on some n l data (e.g. covtype) Furthermore, construction time is expensive

OpenMP is the best for almost all cases, mainly because of combing Xd and X^Tu together Therefore, with appropriate settings, simple

implementations by OpenMP can achieve excellent speedup

Distributed linear classification

Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

Distributed linear classification

Outline

6 Distributed linear classification

Distributed matrix-vector multiplications Experiments

Distributed linear classification Distributed matrix-vector multiplications

Outline

6 Distributed linear classification

Distributed matrix-vector multiplications Experiments

Distributed linear classification Distributed matrix-vector multiplications

Data in a Distributed System

When data are too large, we may for example let each node store some instances

X₁ X₂ . . . X_p node 1

node 2

node P

We would like to distributedly compute

∇f (w)d = X^TDXd

Distributed linear classification Distributed matrix-vector multiplications

Parallel Hessian-vector Product

Like in the shared-memory situation, we have X^TDXd = X1^TD₁X₁d + · · · + XP^TD_PX_Pd We let each node calculate

X_p^TDpXpd and sum resulting vectors together

This is a reduce operation. We have used similar techniques for multi-core situations

Distributed linear classification Distributed matrix-vector multiplications

Master-slave or Master-master

Master-slave: only master gets X^TDXd and runs the whole Newton method

Master-master: every node gets X^TDXd. Then each node has all the information to finish the Newton method

Here we consider a master-master setting.

One reason is that for master-slave,

implementations on master and slaves are different This is different from multi-core situations, where only one master copy is run

Distributed linear classification Distributed matrix-vector multiplications

Allreduce Operation

We let every node get X^TDX d

X₁^TD₁X₁d

X₂^TD₂X₂d

X₃^TD₃X₃d

ALL REDUCE

X^TDX d

Allreduce: reducing all vectors (X_i^TD_iX_id, ∀i) to a single vector (X^TDXd ∈ Rⁿ) and then sending the result to every node

Distributed linear classification Distributed matrix-vector multiplications

Instance-wise and Feature-wise Data Splits

Instead of storing a subset of data at each node, we can store a subset of features

X_iw,1 X_iw,2 X_iw,3

X_fw,1X_fw,2X_fw,3

Instance-wise Feature-wise

Distributed linear classification Distributed matrix-vector multiplications

Instance-wise and Feature-wise Data Splits (Cont’d)

Feature-wise: each machine calculates part of the Hessian-vector product

(∇²f (w)d)^fw,1 = d¹+CX_fw,1^T D(X_fw,1d¹+· · ·+X_fw,Pd^P) X_fw,1d1 + · · · + X_fw,PdP ∈ R^l must be available on all nodes (by allreduce)

Because

X_k^TDkXkd : O(n), Xfw,pd^p : O(l ), amount of data moved per Hessian-vector product:

Instance-wise: O(n), Feature-wise: O(l )

Distributed linear classification Experiments

Outline

6 Distributed linear classification

Distributed matrix-vector multiplications Experiments

Distributed linear classification Experiments

Experiments

We compare

TRON: Newton method

ADMM: alternating direction method of multipliers (Boyd et al., 2011; Zhang et al., 2012)

Vowpal Wabbit (Langford et al., 2007) TRON and ADMM are implemented by MPI Details in Zhuang et al. (2015)

Distributed linear classification Experiments

Experiments (Cont’d)

0 100 200 300 400 500

10⁻⁴ 10⁻² 10⁰

Time (sec.)

Relative function value difference

VW ADMM−FW ADMM−IW TRON−FW TRON−IW

0 1000 2000 3000 4000 5000 10⁻⁴

10⁻² 10⁰ 10²

Time (sec.)

Relative function value difference

VW ADMM−FW ADMM−IW TRON−FW TRON−IW

epsilon webspam

32 machines are used

Horizontal line: test accuracy has stabilized Instance-wise and feature-wise splits useful for l n and l n, respectively

Distributed linear classification Experiments

Experiments (Cont’d)

We have seen that communication cost is a big concern

In terms of running time, multi-core implementation is often faster

However, data preparation and loading are issues other than running time

Overall we see that distributed training is a complicated issue

Discussion and conclusions

Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

Discussion and conclusions

Outline

7 Discussion and conclusions Some resources

Conclusions

Discussion and conclusions Some resources

Outline

7 Discussion and conclusions Some resources

Conclusions

Discussion and conclusions Some resources

Software

• Most materials in this talks are based on our experiences in developing two popular software

• Kernel: LIBSVM (Chang and Lin, 2011)

http://www.csie.ntu.edu.tw/~cjlin/libsvm

• Linear: LIBLINEAR (Fan et al., 2008).

http://www.csie.ntu.edu.tw/~cjlin/liblinear See also a survey on linear classification in Yuan et al.

(2012)

Discussion and conclusions Some resources

Multi-core LIBLINEAR

An extension of the software LIBLINEAR

See http://www.csie.ntu.edu.tw/~cjlin/

libsvmtools/multicore-liblinear

This is based on the study in Lee et al. (2015) We already have many users. For example, one user from USC uses this tool to reduce his training time from over 30 hours to 5 hours

Discussion and conclusions Some resources

Distributed LIBLINEAR

An extension of the software LIBLINEAR

See http://www.csie.ntu.edu.tw/~cjlin/

libsvmtools/distributed-liblinear

We support both MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)

Discussion and conclusions Conclusions

Outline

7 Discussion and conclusions Some resources

Conclusions

Discussion and conclusions Conclusions

Conclusions

Linear classification is an old topic; but recently there are new and interesting applications

Kernel methods are still useful for many

applications, but linear classification + feature engineering are suitable for some others

Linear classification will continue to be used in situations ranging from small-model to big-data applications

Discussion and conclusions Conclusions

Acknowledgments

Many students have contributed to our research on large-scale linear classification

We also thank the partial support from Ministry of Science and Technology in Taiwan

Discussion and conclusions Conclusions

References I

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.

C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4:

79–85, 1957.

Discussion and conclusions Conclusions

References II

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

Intel. Intel Math Kernel Library Reference Manual.

T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.

S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

J. Langford, L. Li, and A. Strehl. Vowpal Wabbit, 2007.

https://github.com/JohnLangford/vowpal_wabbit/wiki.

M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2015. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/multicore_liblinear_icdm.pdf.

Discussion and conclusions Conclusions

References III

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.

C.-Y. Lin, C.-H. Tsai, C.-P. Lee, and C.-J. Lin. Large-scale logistic regression and linear support vector machines using Spark. In Proceedings of the IEEE International Conference on Big Data, pages 519–528, 2014. URL http:

//www.csie.ntu.edu.tw/~cjlin/papers/spark-liblinear/spark-liblinear.pdf.

M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse

matrix–vector multiplication with the recursive sparse blocks format. Parallel Computing, 40:251–270, 2014.

E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–136, 1997.

J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -Support Vector Learning, Cambridge, MA, 1998. MIT Press.

T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, pages 629–637. 2013.

在文檔中 Large-scale Linear Classiﬁcation (頁 115-159)