### Recent Advances in Large Linear Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at NEC Labs, August 26, 2011

Chih-Jen Lin (National Taiwan Univ.) 1 / 44

This talk is based on our recent survey paper invited by Proceedings of IEEE

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

Chih-Jen Lin (National Taiwan Univ.) 2 / 44

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 3 / 44

Introduction

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 4 / 44

Introduction

### Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean data not mapped to a higher dimensional space

Original: [height, weight]

Nonlinear: [height, weight, weight/height^{2}]

Chih-Jen Lin (National Taiwan Univ.) 5 / 44

Introduction

### Linear and Nonlinear Classification (Cont’d)

Given training data {y_{i}, x_{i}}, x_{i} ∈ R^{n}, i = 1, . . . , l ,
y_{i} = ±1

l : # of data, n: # of features

Linear: find (w, b) such that the decision function is
sgn w^{T}x + b

Nonlinear: map data to φ(x_{i}). The decision
function becomes

sgn w^{T}φ(x)+ b
Later b is omitted

Chih-Jen Lin (National Taiwan Univ.) 6 / 44

Introduction

### Why Linear Classification?

• If φ(x) is high dimensional, w^{T}φ(x) is expensive

• Kernel methods:

w ≡ X^{l}

i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)^{T}φ(xj)
New decision function: sgn

X^{l}

i =1α_{i}K (x_{i}, x)

• Special φ(x) so that calculating K (x_{i}, x_{j}) is easy

• Example:

K (x_{i}, x_{j}) ≡ (x^{T}_{i} x_{j} + 1)^{2} = φ(x_{i})^{T}φ(x_{j}), φ(x) ∈ R^{O(n}^{2}^{)}

Chih-Jen Lin (National Taiwan Univ.) 7 / 44

Introduction

### Why Linear Classification? (Cont’d)

Prediction

w^{T}x versus X^{l}

i =1α_{i}K (x_{i}, x)
If K (x_{i}, x_{j}) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

Chih-Jen Lin (National Taiwan Univ.) 8 / 44

Introduction

### Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Recently linear classification is a popular research topic. Sample works in 2005-2008: Joachims (2006); Shalev-Shwartz et al. (2007); Hsieh et al.

(2008)

They focus on large sparse data

There are many other recent papers and software

Chih-Jen Lin (National Taiwan Univ.) 9 / 44

Introduction

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 10 / 44

Introduction

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 10 / 44

Introduction

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 10 / 44

Binary linear classification

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 11 / 44

Binary linear classification

### Binary Linear Classification

Training data {y_{i}, x_{i}}, x_{i} ∈ R^{n}, i = 1, . . . , l , y_{i} = ±1
l : # of data, n: # of features

minw r (w) + C X^{l}

i =1ξ(w; xi, yi) r (w): regularization term

ξ(w; x, y ): loss function: we hope y w^{T}x > 0
C : regularization parameter

Chih-Jen Lin (National Taiwan Univ.) 12 / 44

Binary linear classification

### Loss Functions

Some commonly used ones:

ξ_{L1}(w; x, y ) ≡ max(0, 1 − y w^{T}x), (1)
ξ_{L2}(w; x, y ) ≡ max(0, 1 − y w^{T}x)^{2}, and (2)
ξ_{LR}(w; x, y ) ≡ log(1 + e^{−y w}^{T}^{x}). (3)
SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

Chih-Jen Lin (National Taiwan Univ.) 13 / 44

Binary linear classification

### Loss Functions (Cont’d)

−y w^{T}x
ξ(w; x, y )

ξL1

ξ_{L2}

ξ_{LR}

They are similar

Chih-Jen Lin (National Taiwan Univ.) 14 / 44

Binary linear classification

### Regularization

L1 versus L2

kwk_{1} and w^{T}w/2
w^{T}w/2: smooth, easier to optimize
kwk_{1}: non-differentiable

sparse solution; possibly many zero elements Possible advantages of L1 regularization:

Feature selection Less storage for w

Chih-Jen Lin (National Taiwan Univ.) 15 / 44

Binary linear classification

### Training Linear Classifiers

Many recent developments; won’t show details here Why training linear is faster than nonlinear?

Recall the O(n) and O(nl ) difference in prediction:

w^{T}x and X^{l}

i =1αiK (xi, x) n: # features, l : # data

A similar situation happens here. During training:

X^{l}

t=1α_{t}x^{T}_{i} x_{t} often needed ⇒ O(nl ) (4)

Chih-Jen Lin (National Taiwan Univ.) 16 / 44

Binary linear classification

### Training Linear Classifiers (Cont’d)

By maintaining u ≡

l

X

t=1

y_{t}α_{t}x_{t} → u^{T}x_{i} O(n) cost
u: an intermediate variable during training;

eventually approaches the final weight vector w
Key: we are able to store x_{t}, ∀t and maintain u
Nonlinear: can’t store φ(x_{t})

For linear, basically any optimization method can be applied

Chih-Jen Lin (National Taiwan Univ.) 17 / 44

Binary linear classification

### Choosing a Training Algorithm

Data property

# instances # features or the other way around Primal or dual

First-order or higher-order

Now first-order is slightly preferred as seldom we need an accurate optimization solution

Cost of operations

exp/log more expensive; avoid them in training LR Others

Chih-Jen Lin (National Taiwan Univ.) 18 / 44

Binary linear classification

### L1 Regularization

Non-differentiable: need non-smooth optimization techniques

Difficult to apply sophisticated methods Currently, coordinate descent or Newton with coordinate descent are among the most efficient (Yuan et al., 2010; Friedman et al., 2010; Yuan et al., 2011)

Chih-Jen Lin (National Taiwan Univ.) 19 / 44

Multi-class linear classification

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 20 / 44

Multi-class linear classification

### Solving Several Binary Problems

Same methods for linear and nonlinear classification But there are some subtle differences

One-vs-rest

w_{m} : class m positive; others negative
class of x ≡ arg max

m=1,...,kw^{T}_{m}x.

Memory: O(kn); k: # classes

One-vs-one: w_{1,2}, . . . , w_{(k−1),k} constructed
O(k^{2}n) memory cost

Chih-Jen Lin (National Taiwan Univ.) 21 / 44

Multi-class linear classification

### Solving Several Binary Problems (Cont’d)

So one-vs-rest more suitable than one-vs-one This isn’t the case for kernelized SVM/LR

Chih-Jen Lin (National Taiwan Univ.) 22 / 44

Multi-class linear classification

### Considering All Data at Once

wmin_{1},...,w_{k}

1 2

X^{k}

m=1kw_{m}k^{2}_{2} + C X^{l}

i =1ξ({wm}^{k}_{m=1}; xi, yi),
Multi-class SVM by Crammer and Singer (2001)

loss function : max

m6=y max(0, 1 − (wy − w_{m})^{T}x).

Maximum Entropy (ME)

loss function : P(y |x) ≡ exp(w^{T}_{y} x)
Pk

m=1exp(w^{T}_{m}x),
Many don’t think that ME is close to SVM; but it is.

Note if # classes = 2, ME ⇒ LR

Chih-Jen Lin (National Taiwan Univ.) 23 / 44

Applications in non-standard scenarios

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 24 / 44

Applications in non-standard scenarios

### Applications in Non-standard Scenarios

Linear classification can be applied to many other places

An important one is to approximate nonlinear classifiers

Goal: better accuracy of nonlinear but faster training/testing

Two types of methods here

- Linear-method for explicit data mappings - Approximating kernels

Chih-Jen Lin (National Taiwan Univ.) 25 / 44

Applications in non-standard scenarios

### Linear Methods to Explicitly Train φ(x

_{i}

### )

Example: low-degree polynomial mapping:

φ(x) = [1, x_{1}, . . . , x_{n}, x_{1}^{2}, . . . , x_{n}^{2}, x_{1}x_{2}, . . . , x_{n−1}x_{n}]^{T}
For this mapping, # features = O(n^{2})

When is it useful?

Recall O(n) for linear versus O(nl ) for kernel
Now O(n^{2}) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n n ⇒ O(¯n^{2}) may still be smaller than O(l ¯n)

Chih-Jen Lin (National Taiwan Univ.) 26 / 44

Applications in non-standard scenarios

### High Dimensionality of φ(x) and w

• Many new considerations in large scenarios

• For example, w has O(n^{2}) components if degree is 2
Our application: n = 46, 155, 20G for w

• See detailed discussion in Chang et al. (2010)

• A related development is the COFFIN framework by Sonnenburg and Franc (2010)

Chih-Jen Lin (National Taiwan Univ.) 27 / 44

Applications in non-standard scenarios

### An NLP Application: Dependency Parsing

Construct dependency graph: a multi-class problem

nsubj ROOT det dobj prep det pobj p

### John hit the ball with a bat .

NNP VBD DT NN IN DT NN .

Very sparse: ¯n, average # nonzeros per instance

n Dim. of φ(x) l n w’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

Chih-Jen Lin (National Taiwan Univ.) 28 / 44

Applications in non-standard scenarios

### An NLP Application (Cont’d)

LIBSVM LIBLINEAR

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

Explicitly using φ(x) instead of kernels

⇒ faster training and testing

Some interesting Hashing techniques used to handle sparse w

Chih-Jen Lin (National Taiwan Univ.) 29 / 44

Applications in non-standard scenarios

### Approximating Kernels

Following Lee and Wright (2010), we consider two categories

Kernel matrix approximation:

Original matrix Q with

Q_{ij} = y_{i}y_{j}K (x_{i}, x_{j})
Consider

Q = ¯¯ Φ^{T}Φ ≈ Q.¯

Φ ≡ [¯¯ x_{1}, . . . , ¯x_{l}] becomes new training data ⇒
trained by a linear classifier

Chih-Jen Lin (National Taiwan Univ.) 30 / 44

Applications in non-standard scenarios

### Approximating Kernels (Cont’d)

Φ ∈ R¯ ^{d ×l}, d l . # features # data
Testing is an issue

Feature mapping approximation

A mapping function ¯φ : R^{n} → R^{d} such that
φ(x)¯ ^{T}φ(t) ≈ K (x, t).¯

Testing is straightforward because ¯φ(·) is available Many mappings have been proposed; in particular, Hashing

φ(·) may be¯ dense or sparse

Chih-Jen Lin (National Taiwan Univ.) 31 / 44

Data beyond memory capacity

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 32 / 44

Data beyond memory capacity

### Data Beyond Memory Capacity

Most existing algorithms assume data in memory They are slow if data larger than memory

Frequent disk access of data; CPU time no longer the main concern

They cannot be run in distributed environments Many challenging research issues

Chih-Jen Lin (National Taiwan Univ.) 33 / 44

Data beyond memory capacity

### When Data Cannot Fit In Memory

LIBLINEAR on machine with 1 GB memory:

Disk swap causes lengthy training time

Chih-Jen Lin (National Taiwan Univ.) 34 / 44

Data beyond memory capacity

### Disk-level Data Classification

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time but ensure overall convergence

But loading time becomes a big concern

Reading 1TB from a hard disk takes very long time

Chih-Jen Lin (National Taiwan Univ.) 35 / 44

Data beyond memory capacity

### Distributed Linear Classification

An important advantage: each node loads data in its disk

Parallel data loading, but how about operations?

Issues

- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern

Chih-Jen Lin (National Taiwan Univ.) 36 / 44

Data beyond memory capacity

### Distributed Linear Classification (Cont’d)

Simple approaches

Subsampling: a subset to fit in memory - Simple and useful in some situations

- In a sense, you do a “reduce” operation to collect data to one computer, and then conduct detailed analysis

Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset

- Example: Zinkevich et al. (2010)

Chih-Jen Lin (National Taiwan Univ.) 37 / 44

Data beyond memory capacity

### Distributed Linear Classification (Cont’d)

Some results by averaging models

yahoo-korea kddcup10 webspam epsilson

Using all 87.29 89.89 99.51 89.78

Avg. models 86.08 89.64 98.40 88.83

Using all: solves a single linear SVM

Avg. models: each node solves a linear SVM on a subset

Slightly worse but in general OK

Chih-Jen Lin (National Taiwan Univ.) 38 / 44

Data beyond memory capacity

### Distributed Linear Classification (Cont’d)

Parallel optimization Many possible approaches

If the method involves matrix-vector products, then such operations can be paralleled

Each iteration involves communication

Also MapReduce not very suitable for iterative algorithms (I/O for fault tolerance)

Should have as few iterations as possible

Chih-Jen Lin (National Taiwan Univ.) 39 / 44

Data beyond memory capacity

### Distributed Linear Classification (Cont’d)

ADMM (Boyd et al., 2011)

w1,...,wminm,z

1

2z^{T}z + C

m

X

j =1

X

i ∈B_{j}

ξ_{L1}(w; x_{i}, y_{i}) + ρ
2

m

X

j =1

kw_{j} − zk^{2}
subject to w_{j} − z = 0, ∀j

Each problem independently updated; but must collect wj

Some have tried MapReduce, but no public implementation yet

Convergence may not be very fast (i.e., need some iterations)

Chih-Jen Lin (National Taiwan Univ.) 40 / 44

Data beyond memory capacity

### Distributed Linear Classification (Cont’d)

Vowpal Wabbit (Langford et al., 2007)

After version 6.0, Hadoop support has been provided LBFGS (quasi Newton) algorithms

From John’s talk: 2.1T features, 17B samples, 1K nodes ⇒ 70 minutes

Chih-Jen Lin (National Taiwan Univ.) 41 / 44

Discussion and conclusions

### Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 42 / 44

Discussion and conclusions

### Related Topics

Structured learning

Instead of yi ∈ {+1, −1}, y_{i} becomes a vector
Examples: condition random fields (CRF) and
structured SVM

They are linear classifiers Regression

Document classification has been widely used, but document regression (e.g., L2-regularized SVR) less frequently applied

Example: yi is CTR and xi is a web page

L1-regularized least-square regression is another story ⇒ very popular for compressed sensing

Chih-Jen Lin (National Taiwan Univ.) 43 / 44

Discussion and conclusions

### Conclusions

Linear classification is an old topic; but new developments for large-scale applications are interesting

Linear classification works on x rather than φ(x) Easy and flexible for feature engineering

Linear classification + feature engineering useful for many real applications

Chih-Jen Lin (National Taiwan Univ.) 44 / 44