Large-scale Linear Classiﬁcation: Status and Challenges

(1)

and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

San Francisco Machine Learning Meetup, October 30, 2014

Chih-Jen Lin (National Taiwan Univ.) 1 / 43

(2)

Outline

1 Introduction

2 Optimization methods

3 Sample applications

4 Big-data linear classification

5 Conclusions

(3)

Outline

1 Introduction

5 Conclusions

(4)

Introduction

Linear Classification

The model is a weight vector w (for binary classification)

The decision function is

sgn(w^Tx )

Although many new and advanced techniques are available (e.g., deep learning), linear classifiers remain to be useful because of their simplicity We will give an overview of this topic in this talk

(5)

Linear and Kernel Classification

Linear Nonlinear

Linear: data in the original input space; nonlinear: data mapped to other spaces

Original: [height, weight]

Nonlinear: [height, weight, weight/height²] Kernel is one of the nonlinear methods

(6)

Introduction

Linear and Nonlinear Classification

Methods such as SVM and logistic regression can be used in two ways

• Kernel methods: data mapped to another space x ⇒ φ(x )

φ(x )^Tφ(y) easily calculated; no good control on φ(·)

• Linear classification + feature engineering:

Directly use x without mapping. But x may have been carefully generated. Full control on x

We will focus on the 2nd type of approaches in this talk

(7)

Why Linear Classification?

• If φ(x ) is high dimensional, decision function sgn(w^Tφ(x ))

is expensive

• Kernel methods:

w ≡

l

X

i =1

α_iφ(x_i) for some α, K (x_i, x_j) ≡ φ(x_i)^Tφ(x_j)

New decision function: sgn Pl

i =1α_iK (x_i, x )

• Special φ(x ) so calculating K (x_i, x_j) is easy. Example:

K (x_i, x_j) ≡ (x^T_i x_j+ 1)² = φ(x_i)^Tφ(x_j), φ(x ) ∈ R^O(n²⁾

(8)

Introduction

Why Linear Classification? (Cont’d)

Prediction

w^Tx versus X^l

i =1α_iK (x_i, x ) If K (x_i, x_j) takes O(n), then

O(n) versus O(nl ) Kernel: cost related to size of training data Linear: cheaper and simpler

(9)

Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds

(10)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(11)

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(12)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(13)

Binary Linear Classification

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw f (w ), f (w ) ≡ w^Tw 2 + C

l

X

i =1

ξ(w ; x_i, y_i)

w^Tw /2: regularization term (we have no time to talk about L1 regularization here)

ξ(w ; x , y ): loss function: we hope y w^Tx > 0 C : regularization parameter

(14)

Introduction

Loss Functions

Some commonly used ones:

ξ_L1(w ; x , y ) ≡ max(0, 1 − y w^Tx ), (1) ξ_L2(w ; x , y ) ≡ max(0, 1 − y w^Tx )², (2) ξ_LR(w ; x , y ) ≡ log(1 + e^{−y w}^T^x). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(15)

Loss Functions (Cont’d)

−y w^Tx ξ(w ; x , y )

ξL1

ξ_L2

ξ_LR

Their performance is usually similar

Optimization methods may be different because of differentiability

(16)

Optimization methods

Outline

1 Introduction

5 Conclusions

(17)

Optimization Methods

Many unconstrained optimization methods can be applied

For kernel, optimization is over a variable α where

w =

l

X

i =1

αiφ(xi)

We cannot minimize over w because it may be infinite dimensional

However, for linear, minimizing over w or α is ok

(18)

Optimization Methods (Cont’d)

Among unconstrained optimization methods,

Low-order methods: quickly get a model, but slow final convergence

High-order methods: more robust and useful for ill-conditioned situations

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification

(19)

Optimization: 2nd Order Methods

Newton direction (if twice differentiable) mins ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s This is the same as solving Newton linear system

∇²f (w^k)s = −∇f (w^k)

Hessian matrix ∇²f (w^k) too large to be stored

∇²f (w^k) : n × n, n : number of features But Hessian has a special form

∇²f (w ) = I + CX^TDX ,

(20)

Optimization: 2nd Order Methods (Cont’d)

X : data matrix. D diagonal.

Using Conjugate Gradient (CG) to solve the linear system. Only Hessian-vector products are needed

∇²f (w )s = s + C · X^T(D(X s)) Therefore, we have a Hessian-free approach

(21)

Optimization: 1st Order Methods

We consider L1-loss and the dual SVM problem minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

2α^TQα − e^Tα and

Q_ij = y_iy_jx^T_i x_j, e = [1, . . . , 1]^T We will apply coordinate descent (CD) methods The situation for L2 or LR loss is very similar

(22)

1st Order Methods (Cont’d)

Coordinate descent: a simple and classic technique Change one variable at a time

Given current α. Let e_i = [0, . . . , 0, 1, 0, . . . , 0]^T. min

d f (α + d e_i) = 1

2Q_iid² + ∇_if (α)d + constant Without constraints

optimal d = −∇_if (α) Qii

Now 0 ≤ α_i + d ≤ C α_i ← min

max

α_i − ∇_if (α) Q_ii , 0

, C

(23)

Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method This result is from Hsieh et al. (2008)

(24)

Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(25)

Low- versus High-order Methods

• We saw that low-order methods are efficient to give a model. However, high-order methods may be useful for difficult situations

• An example: # instance: 32,561, # features: 123

Objective value Accuracy

# features is small ⇒ solving primal is more suitable

(26)

Sample applications

Outline

1 Introduction

Dependency parsing using feature combination Transportation-mode detection in a sensor hub

5 Conclusions

(27)

Outline

1 Introduction

5 Conclusions

(28)

Sample applications Dependency parsing using feature combination

Dependency Parsing: an NLP Application

Kernel Linear

RBF Poly-2 Linear Poly-2 Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

We get faster training/testing, while maintain good accuracy

But how to achieve this?

(29)

Linear Methods to Explicitly Train φ(x

_i

)

Example: low-degree polynomial mapping:

φ(x ) = [1, x₁, . . . , x_n, x₁², . . . , x_n², x₁x₂, . . . , x_n−1x_n]^T For this mapping, # features = O(n²)

Recall O(n) for linear versus O(nl ) for kernel Now O(n²) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n n ⇒ O( ¯n²) may be much smaller than O(l ¯n)

(30)

Sample applications Dependency parsing using feature combination

Handing High Dimensionality of φ(x )

A multi-class problem with sparse data

n Dim. of φ(x ) l n w ’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

¯

n: average # nonzeros per instance Degree-2 polynomial is used

Dimensionality of w is very high, but w is sparse Some training feature columns of x_ix_j are entirely zero

Hashing techniques are used to handle sparse w

(31)

Discussion

See more details in Chang et al. (2010) If φ(x ) is too high dimensional, people have proposed projection or hashing techniques to use fewer features as approximations

Examples: Kar and Karnick (2012); Pham and Pagh (2013)

This has been used in computational advertising (Chapelle et al., 2014)

(32)

Sample applications Transportation-mode detection in a sensor hub

Outline

1 Introduction

5 Conclusions

(33)

Example: Classifier in a Small Device

In a sensor application (Yu et al., 2013), the classifier can use less than 16KB of RAM

Classifiers Test accuracy Model Size

Decision Tree 77.77 76.02KB

AdaBoost (10 trees) 78.84 1,500.54KB SVM (RBF kernel) 85.33 1,287.15KB Number of features: 5

We consider a degree-3 polynomial mapping dimensionality = 5 + 3

3

+ bias term = 57.

(34)

Sample applications Transportation-mode detection in a sensor hub

Example: Classifier in a Small Device

One-against-one strategy for 5-class classification

5 2

× 57 × 4bytes = 2.28KB Assume single precision

Results

SVM method Test accuracy Model Size

RBF kernel 85.33 1,287.15KB

Polynomial kernel 84.79 2.28KB

Linear kernel 78.51 0.24KB

(35)

Outline

1 Introduction

5 Conclusions

(36)

Big-data linear classification

Big-data Linear Classification

Nowadays data can be easily larger than memory capacity

Disk-level linear classification: Yu et al. (2012) and subsequent developments

Distributed linear classification: recently an active research topic

Example: we can parallelize the 2nd-order method discussed earlier. Recall the Hessian-vector product

∇²f (w )s = s + C · X^T(D(X s))

(37)

Parallel Hessian-vector Product

Hessian-vector products are the computational bottleneck

X^TDX s

Data matrix X is now distributedly stored

X₁ X2

. . . X_p node 1

node 2

node p

X^TDX s = X₁^TD₁X₁s + · · · + X_p^TD_pX_ps

(38)

Instance-wise and Feature-wise Data Splits

X_iw,1 X_iw,2 X_iw,3

X_fw,1X_fw,2X_fw,3

Instance-wise Feature-wise

We won’t have time to get into details. But their communication cost is different

Data moved per Hessian-vector product Instance-wise: O(n), Feature-wise: O(l )

(39)

Discussion: Dostributed Training or Not?

One can always subsample data to one machine for deep analysis

Deciding to do distributed classification or not is an issue

In some areas distributed training has been successfully applied

One example is CTR (click-through rate) prediction in computational advertising

(40)

Discussion: Platform Issues

For the above-mentioned Newton methods, we have MPI and Spark implementations

We are preparing the integration to Spark MLlib Other existing distributed linear classifiers include Vowpal Wabbit from Yahoo!/Microsoft and Sibyl from Google

Platforms such as Spark are still being rapidly changed. This is a bit annoying

A carefully implementation may sometimes thousands times faster than a casual one

(41)

Discussion: Design of Distributed Algorithms

On one computer, often we do batch rather than online learning

Online and streaming learning may be more useful for big-data applications

The example (Newton method) we showed is a synchronous parallel algorithms

Maybe asynchronous ones are better for big data?

(42)

Conclusions

Outline

1 Introduction

5 Conclusions

(43)

Resources on Linear Classification

• Since 2007, we have been actively developing the software LIBLINEAR for linear classification www.csie.ntu.edu.tw/~cjlin/liblinear

• A distributed extension (MPI and Spark) is now available

• An earlier survey on linear classification is Yuan et al.

(2012)

Recent Advances of Large-scale Linear Classification.

Proceedings of IEEE, 2012

It contains many references on this subject

(44)

Conclusions

Linear classification is an old topic; but recently there are new and interesting applications

Kernel methods are still useful for many

applications, but linear classification + feature engineering are suitable for some others

Linear classification will continue to be used in situations ranging from small-model to big-data applications

(45)

Acknowledgments

Many students have contributed to our research on large-scale linear classification

We also thank the partial support from National Science Council of Taiwan