Large-scale Linear Classiﬁcation

(1)

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Criteo, August 1, 2014

Chih-Jen Lin (National Taiwan Univ.) 1 / 44

(2)

Outline

Introduction

Optimization methods

Extension of linear classification Big-data linear classification Conclusions and future directions

(3)

Outline

Introduction

(4)

Introduction

Linear and Nonlinear Classification

Linear Nonlinear

Linear: a linear function to separate data in the original input space; nonlinear: data mapped to other spaces

Original: [height, weight]

Nonlinear: [height, weight, weight/height²]

(5)

Linear and Nonlinear Classification (Cont’d)

Methods such as SVM and logistic regression can be used in two ways

• Kernel methods: data mapped to another space x ⇒ φ(x )

φ(x )^Tφ(y) easily calculated; no good control on φ(·)

• Linear classification + feature engineering:

Directly use x without mapping. But x may have been carefully generated using some nonlinear information. Full control on x

We will focus on the 2nd type of approaches in this talk

(6)

Introduction

Why Linear Classification?

• If φ(x ) is high dimensional, decision function sgn(w^Tφ(x ))

is expensive

• Kernel methods:

w ≡

l

X

i =1

α_iφ(x_i) for some α, K (x_i, x_j) ≡ φ(x_i)^Tφ(x_j)

New decision function: sgn Pl

i =1α_iK (x_i, x )

• Special φ(x ) so calculating K (x_i, x_j) is easy. Example:

K (x , x ) ≡ (x^Tx + 1)² = φ(x )^Tφ(x ), φ(x ) ∈ R^O(n²⁾

(7)

Why Linear Classification? (Cont’d)

Prediction

w^Tx versus X^l

i =1α_iK (x_i, x ) If K (x_i, x_j) takes O(n), then

O(n) versus O(nl ) Kernel: cost related to size of training data Linear: cheaper and simpler

(8)

Introduction

Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds

(9)

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(10)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances

(11)

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(12)

Introduction

Binary Linear Classification

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw f (w ), f (w ) ≡ w^Tw 2 + C

l

X

i =1

ξ(w ; x_i, y_i)

w^Tw /2: regularization term (we have no time to talk about L1 regularization here)

ξ(w ; x , y ): loss function: we hope y w^Tx > 0 C : regularization parameter

(13)

Loss Functions

Some commonly used ones:

ξ_L1(w ; x , y ) ≡ max(0, 1 − y w^Tx ), (1) ξL2(w ; x , y ) ≡ max(0, 1 − y w^Tx )², (2) ξ_LR(w ; x , y ) ≡ log(1 + e^{−y w}^T^x). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3); no reference because it can be traced back to 19th century

(14)

Introduction

Loss Functions (Cont’d)

−y w^Tx ξ(w ; x , y )

ξ_L1 ξ_L2

ξLR

Their performance is usually similar

(15)

Loss Functions (Cont’d)

However,

ξ_L1: not differentiable

ξL2: differentiable but not twice differentiable ξ_LR: twice differentiable

The same optimization method may not be applicable to all these losses

(16)

Outline

Introduction

(17)

Optimization Methods

Many unconstrained optimization methods can be applied

For kernel, optimization is over a variable α where

w =

l

X

i =1

αiφ(xi)

We cannot minimize over w because it may be infinite dimensional

However, for linear, minimizing over w or α is ok

(18)

Optimization Methods (Cont’d)

Among unconstrained optimization methods,

Low-order methods: quickly get a model, but slow final convergence

High-order methods: more robust and useful for ill-conditioned situations

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification

(19)

Optimization: 2nd Order Methods

Newton direction

mins ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s This is the same as solving Newton linear system

∇²f (w^k)s = −∇f (w^k)

Hessian matrix ∇²f (w^k) too large to be stored

∇²f (w^k) : n × n, n : number of features But Hessian has a special form

∇²f (w ) = I + CX^TDX ,

(20)

Optimization: 2nd Order Methods (Cont’d)

X : data matrix. D diagonal. For logistic regression, D_ii = e^−yⁱ^w^T^xⁱ

1 + e^−yⁱ^w^T^xⁱ

Using CG to solve the linear system. Only Hessian-vector products are needed

∇²f (w )s = s + C · X^T(D(X s)) Therefore, we have a Hessian-free approach

(21)

Optimization: 1st Order Methods

We consider L1-loss and the dual SVM problem minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

2α^TQα − e^Tα and

Q_ij = y_iy_jx^T_i x_j, e = [1, . . . , 1]^T We will apply coordinate descent (CD) methods The situation for L2 or LR loss is very similar

(22)

1st Order Methods (Cont’d)

Coordinate descent: a simple and classic technique Change one variable at a time

Given current α. Let e_i = [0, . . . , 0, 1, 0, . . . , 0]^T. min

d f (α + d e_i) = 1

2Q_iid² + ∇_if (α)d + constant Without constraints

optimal d = −∇_if (α) Qii

Now 0 ≤ α_i + d ≤ C α_i ← min

max

α_i − ∇_if (α) Q , 0

, C

(23)

Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method This result is from Hsieh et al. (2008)

(24)

Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(25)

Low- versus High-order Methods

• We saw that low-order methods are efficient to give a model. However, high-order methods may be useful for difficult situationa

• An example: # instance: 32,561, # features: 123

Objective value Accuracy

# features is small ⇒ solving primal is more suitable

(26)

Extension of linear classification

Outline

Introduction

(27)

Extension of Linear Classification

Linear classification can be extended in different ways

An important one is to approximate nonlinear classifiers

Goal: better accuracy of nonlinear but faster training/testing

Examples

1. Explicit data mappings + linear classification 2. Kernel approximation + linear classification I will focus on the first

(28)

Linear Methods to Explicitly Train φ(x

_i

)

Example: low-degree polynomial mapping:

φ(x ) = [1, x₁, . . . , x_n, x₁², . . . , x_n², x₁x₂, . . . , x_n−1x_n]^T For this mapping, # features = O(n²)

When is it useful?

Recall O(n) for linear versus O(nl ) for kernel Now O(n²) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n n ⇒ O( ¯n²) may be much smaller than O(l ¯n)

(29)

Example: Dependency Parsing

A multi-class problem with sparse data

n Dim. of φ(x ) l n w ’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

¯

n: average # nonzeros per instance Degree-2 polynomial is used

Dimensionality of w is very high, but w is sparse Some training feature columns of x_ix_j are entirely zero

Hashing techniques are used to handle sparse w

(30)

Example: Dependency Parsing (Cont’d)

LIBSVM LIBLINEAR

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

We get faster training/testing, but maintain good accuracy

See detailed discussion in Chang et al. (2010)

(31)

Discussion

In the above example, we use all pairs

This is fine for some applications, but # features may become too large

People have proposed projection or hashing

techniques to use fewer features as approximations Examples: Kar and Karnick (2012); Pham and Pagh (2013)

This has been used in computational adversitements (Chapelle et al., 2014)

(32)

Big-data linear classification

Outline

Introduction

(33)

Big-data Linear Classification

Nowadays data can be easily larger than memory capacity

Disk-level linear classification: Yu et al. (2012) and subsequent developments

Distributed linear classification: recently an active research topic

Example: we can parallelize the 2nd-order method discussed earlier. Recall the Hessian-vector product

∇²f (w )s = s + C · X^T(D(X s))

(34)

Parallel Hessian-vector Product

Hessian-vector products are the computational bottleneck

X^TDX s

Data matrix X is now distributedly stored

X₁ X2

. . . X_p node 1

node 2

node p

X^TDX s = X₁^TD₁X₁s + · · · + X_p^TD_pX_ps

(35)

Parallel Hessian-vector Product (Cont’d)

We use allreduce to let every node get X^TDX s

s

X₁^TD₁X₁s

X₂^TD₂X₂s

X₃^TD₃X₃s

ALL REDUCE

X^TDX s

Allreduce: reducing all vectors (X_i^TD_iX_ix , ∀i ) to a single vector (X^TDX s ∈ Rⁿ) and then sending the result to every node

(36)

Instance-wise and Feature-wise Data Splits

X_iw,1 Xiw,2

X_iw,3

Xfw,1Xfw,2Xfw,3

Instance-wise Feature-wise

Feature-wise: each machine calculates part of the Hessian-vector product

(∇²f (w )v)fw,1 = v1+CX_fw,1^T D(Xfw,1v1+· · ·+Xfw,pvp)

(37)

Instance-wise and Feature-wise Data Splits (Cont’d)

X_fw,1v₁ + · · · + X_fw,pv_p ∈ R^l must be available on all nodes (by allreduce)

Data moved per Hessian-vector product Instance-wise: O(n), Feature-wise: O(l )

(38)

Experiments

Two sets:

Data set l n #nonzeros

epsilon 400,000 2,000 800,000,000 webspam 350,000 16,609,143 1,304,697,446 For results of more sets, see Zhuang et al. (2014) We use Amazon AWS

We compare

1. TRON: Trust-region Newton method 2. ADMM: alternating direction method of

multipliers (Boyd et al., 2011; Zhang et al., 2012)

(39)

Experiments (Cont’d)

0 20 40 60

10⁻⁵ 10⁰

Time (s)

Relative function value difference

ADMM−IW ADMM−FW TRON−IW TRON−FW

0 200 400 600 800 10⁻⁵

10⁰

Time (s)

Relative function value difference

ADMM−IW ADMM−FW TRON−IW TRON−FW

epsilon webspam

16 machines are used

Horizontal line: test accuracy has stabilized TRON has faster convergence than ADMM Instance-wise and feature-wise splits useful for l n and l n, respectively

(40)

Programming Frameworks

We use MPI for the above experiments How about others like MapReduce?

MPI is more efficient, but has no fault tolerance In contrast, MapReduce is slow for iterative algorithms due to heavy disk I/O

Many new frameworks are being actively developed 1. Spark (Zaharia et al., 2010)

2. REEF (Chun et al., 2013)

Selecting suitable frameworks for distributed classification isn’t that easy!

(41)

A Comparison Between MPI and Spark

0 20 40 60

−8

−6

−4

−2 0 2

Training time (seconds)

Relative function value difference (log)

Spark LIBLINEAR Spark LIBLINEAR−m MPI LIBLINEAR

We use the data set epsilon (8 nodes). Spark is slower, but in general competitive

(42)

Conclusions and future directions

Outline

Introduction

(43)

Resources on Linear Classification

Since 2007, we have been actively developing the software LIBLINEAR for linear classification www.csie.ntu.edu.tw/~cjlin/liblinear It’s now widely used in Internet companies An earlier survey on linear classification is Yuan et al. (2012)

Recent Advances of Large-scale Linear Classification. Proceedings of IEEE, 2012 It contains many references on this subject

(44)

Distributed LIBLINEAR

We recently released an extension of LIBLINEAR for distributed classification

See http://www.csie.ntu.edu.tw/~cjlin/

libsvmtools/distributed-liblinear We support both MPI and Spark

The development is still in an early stage. Your comments are very welcome.

(45)

Conclusions

Linear classification is an old topic; but recently there are new and interesting applications

Kernel methods are still useful for many

applications, but linear classification + feature engineering are suitable for some others

Advantages of linear: easier feature engineering We expect that linear classification can be widely used in situations ranging from small-model to big-data classification

(46)

Acknowledgments

Many students have contributed to our research on large-scale linear classification

We also thank the partial support from National Science Council of Taiwan