Distributed Data Classification
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at ICML workshop on New Learning Frameworks and Models for Big Data, June 25, 2014
Outline
1 Introduction: why distributed classification
2 Example: a distributed Newton method for logistic regression
3 Discussion from the viewpoint of the application workflow
4 Conclusions
Introduction: why distributed classification
Outline
1 Introduction: why distributed classification
2 Example: a distributed Newton method for logistic regression
3 Discussion from the viewpoint of the application workflow
4 Conclusions
Introduction: why distributed classification
Why Distributed Data Classification?
The usual answer is that data are too big to be stored in one computer
However, we will show that the whole issue is more complicated
Introduction: why distributed classification
Let’s Start with An Example
Using a linear classifier LIBLINEAR (Fan et al., 2008) to train the rcv1 document data sets (Lewis et al., 2004).
# instances: 677,399, # features: 47,236 On a typical PC
$time ./train rcv1_test.binary Total time: 50.88 seconds
Loading time: 43.51 seconds
Introduction: why distributed classification
For this example
loading time running time
In fact, two seconds are enough ⇒ test accuracy becomes stable
Introduction: why distributed classification
Loading Time Versus Running Time
To see why this happens, let’s discuss the complexity Assume the memory hierarchy contains only disk and number of instances is l
Loading time: l × (a big constant)
Running time: lq × (some constant), where q ≥ 1.
Running time is often larger than loading because q > 1 (e.g., q = 2 or 3)
Example: kernel methods
Introduction: why distributed classification
Loading Time Versus Running Time (Cont’d)
Therefore,
lq−1 > a big constant
and traditionally machine learning and data mining papers consider only running time
When l is large, we may use a linear algorithm (i.e., q = 1) for efficiency
Introduction: why distributed classification
Loading Time Versus Running Time (Cont’d)
An important conclusion of this example is that computation time may not be the only concern - If running time dominates, then we should design algorithms to reduce number of operations
- If loading time dominates, then we should design algorithms to reduce number of data accesses This example is on one machine. Situation on distributed environments is even more complicated
Introduction: why distributed classification
Possible Advantages of Distributed Data Classification
Parallel data loading
Reading several TB data from disk is slow Using 100 machines, each has 1/100 data in its local disk ⇒ 1/100 loading time
But moving data to these 100 machines may be difficult!
Fault tolerance
Some data replicated across machines: if one fails, others are still available
Introduction: why distributed classification
Possible Disadvantages of Distributed Data Classification
More complicated (of course)
Communication and synchronization
Everybody says moving computation to data, but this isn’t that easy
Introduction: why distributed classification
Going Distributed or Not Isn’t Easy to Decide
Quote from Yann LeCun (KDnuggets News 14:n05)
“I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.”
Now disk and RAM are large. You may load several TB of data once and conveniently conduct all
analysis
The decision is application dependent
Example: a distributed Newton method for logistic regression
Outline
1 Introduction: why distributed classification
2 Example: a distributed Newton method for logistic regression
3 Discussion from the viewpoint of the application workflow
4 Conclusions
Example: a distributed Newton method for logistic regression
Logistic Regression
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
Regularized logistic regression minw f (w), where
f (w) = 1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. C : regularization parameter decided by users
Twice differentiable, so we can use Newton methods
Example: a distributed Newton method for logistic regression
Newton Methods
Newton direction
mins ∇f (wk)Ts + 1
2sT∇2f (wk)s
This is the same as solving Newton linear system
∇2f (wk)s = −∇f (wk)
Hessian matrix ∇2f (wk) too large to be stored
∇2f (wk) : n × n, n : number of features But Hessian has a special form
∇2f (w) = I + CXTDX ,
Example: a distributed Newton method for logistic regression
Newton Methods (Cont’d)
X : data matrix. D diagonal with Dii = e−yiwTxi
(1 + e−yiwTxi)2
Using Conjugate Gradient (CG) to solve the linear system. Only Hessian-vector products are needed
∇2f (w)s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach
Other details; see Lin et al. (2008) and the software LIBLINEAR
Example: a distributed Newton method for logistic regression
Parallel Hessian-vector Product
Hessian-vector products are the computational bottleneck
XTDX s
Data matrix X is now distributedly stored
X1 X2
. . . Xp node 1
node 2
node p
XTDX s = X1TD1X1s + · · · + XpTDpXps
Example: a distributed Newton method for logistic regression
Parallel Hessian-vector Product (Cont’d)
We use allreduce to let every node get XTDX s
s
s
s
X1TD1X1s
X2TD2X2s
X3TD3X3s
ALL REDUCE
XTDX s
XTDX s
XTDX s
Allreduce: reducing all vectors (XiTDiXix, ∀i ) to a single vector (XTDX s ∈ Rn) and then sending the result to every node
Example: a distributed Newton method for logistic regression
Parallel Hessian-vector Product (Cont’d)
Then each node has all the information to finish a Newton method
We don’t use a master-slave model because implementations on master and slaves become different
We use MPI here, but will discuss other programming frameworks later
Example: a distributed Newton method for logistic regression
Instance-wise and Feature-wise Data Splits
Xiw,1 Xiw,2
Xiw,3
Xfw,1Xfw,2Xfw,3
Instance-wise Feature-wise
Feature-wise: each machine calculates part of the Hessian-vector product
(∇2f (w)v)fw,1 = v1+CXfw,1T D(Xfw,1v1+· · ·+Xfw,pvp)
Example: a distributed Newton method for logistic regression
Instance-wise and Feature-wise Data Splits (Cont’d)
Xfw,1v1 + · · · + Xfw,pvp ∈ Rl must be available on all nodes (by allreduce)
Amount of data moved per Hessian-vector product:
Instance-wise: O(n), Feature-wise: O(l )
Example: a distributed Newton method for logistic regression
Experiments
Two sets:
Data set l n #nonzeros
epsilon 400,000 2,000 800,000,000 webspam 350,000 16,609,143 1,304,697,446 We use Amazon AWS
We compare
TRON: Newton method
ADMM: alternating direction method of multipliers (Boyd et al., 2011; Zhang et al., 2012)
Example: a distributed Newton method for logistic regression
Experiments (Cont’d)
0 20 40 60
10−5 100
Time (s)
Relative function value difference
ADMM−IW ADMM−FW TRON−IW TRON−FW
0 200 400 600 800 10−5
100
Time (s)
Relative function value difference
ADMM−IW ADMM−FW TRON−IW TRON−FW
epsilon webspam
16 machines are used
Horizontal line: test accuracy has stabilized TRON has faster convergence than ADMM Instance-wise and feature-wise splits useful for
Example: a distributed Newton method for logistic regression
Other Distributed Classification Methods
We give only an example here (distributed Newton) There are many other methods
For example, distributed quasi Newton, distributed random forests, etc.
Existing software include, for example, Vowpal Wabbit (Langford et al., 2007)
Discussion from the viewpoint of the application workflow
Outline
1 Introduction: why distributed classification
2 Example: a distributed Newton method for logistic regression
3 Discussion from the viewpoint of the application workflow
4 Conclusions
Discussion from the viewpoint of the application workflow
Training Is Only Part of the Workflow
Previous experiments show that for a set with 0.35M instances and 16M features, distributed training using 16 machines takes 50 seconds
This looks good, but is not the whole story Copying data from Amazon S3 to 16 local disks takes more than 150 seconds
Distributed training may not be the bottleneck in the whole workflow
Discussion from the viewpoint of the application workflow
Example: CTR Prediction
CTR prediction is an important component of an advertisement system
CTR = # clicks
# impressions. A sequence of events
Not clicked Features of user Clicked Features of user Not clicked Features of user
· · · ·
A binary classification problem. We use the
Discussion from the viewpoint of the application workflow
Example: CTR Prediction (Cont’d)
System Architecture
Discussion from the viewpoint of the application workflow
Example: CTR Prediction (Cont’d)
We use data in a sliding window. For example, data of past week is used to train a model for today’s prediction
We keep renting local disks
A coming instance is immediately dispatched to a local disk
Thus data moving is completed before training For training, we rent machines to mount these disks Data are also constantly removed
Discussion from the viewpoint of the application workflow
Example: CTR Prediction (Cont’d)
This design effectively alleviates the problem of moving and copying data before training
However, if you want to use data 3 months ago for analysis, data movement becomes a issue
This is an example showing that distributed training is just part of the workflow
It is important to consider all steps in the whole application
See also an essay by Jimmy Lin (2012)
Discussion from the viewpoint of the application workflow
What if We Don’t Maintain Data at All?
We may use an online setting so an instance is used only once
Advantages: the classification implementation is simpler than methods like distributed Newton Disadvantage: you may worry about accuracy The situation may be application dependent
Discussion from the viewpoint of the application workflow
Programming Frameworks
We use MPI for the above experiments How about others like MapReduce?
MPI is more efficient, but has no fault tolerance In contrast, MapReduce is slow for iterative algorithms due to heavy disk I/O
Many new frameworks are being actively developed 1. Spark (Zaharia et al., 2010)
2. REEF (Chun et al., 2013)
Selecting suitable frameworks for distributed classification isn’t that easy!
Discussion from the viewpoint of the application workflow
A Comparison Between MPI and Spark
Data set l n
epsilon 400,000 2,000 dense features rcv1 677,399 47,236 sparse features
0 20 40 60 80
−8
−6
−4
−2 0 2
Training time (seconds)
Relative function value difference (log)
spark−one−core spark−multi−cores spark−one−core−coalesce spark−multi−cores−coalesce mpi
0 10 20 30 40
−8
−6
−4
−2 0 2
Training time (seconds)
Relative function value difference (log)
spark−one−core spark−multi−cores spark−one−core−coalesce spark−multi−cores−coalesce mpi
Discussion from the viewpoint of the application workflow
A Comparison Between MPI and Spark (Cont’d)
8 nodes in a local cluster (not AWS) are used. Spark is slower, but in general competitive
Some issues may cause the time differences C versus Scala
Allreduce versus master-slave setting
Discussion from the viewpoint of the application workflow
Distributed LIBLINEAR
We recently released an extension of LIBLINEAR for distributed classification
See http://www.csie.ntu.edu.tw/~cjlin/
libsvmtools/distributed-liblinear We support both MPI and Spark
The development is still in an early stage. We are working hard to improve the Spark version
Your comments are very welcome.
Conclusions
Outline
1 Introduction: why distributed classification
2 Example: a distributed Newton method for logistic regression
3 Discussion from the viewpoint of the application workflow
4 Conclusions
Conclusions
Conclusions
Designing distributed training algorithm isn’t easy.
You can parallelize existing algorithms or create new ones
Issues such as communication cost must be solved We also need to know that distributed training is only one component of the whole workflow
System issues are important because many
programming frameworks are still being developed Overall, distributed classification is an active and exciting research topic