Recent Advances in Large Linear Classiﬁcation

(1)

Recent Advances in Large Linear Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at NEC Labs, August 26, 2011

Chih-Jen Lin (National Taiwan Univ.) 1 / 44

(2)

This talk is based on our recent survey paper invited by Proceedings of IEEE

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

(3)

Outline

Introduction

Binary linear classification Multi-class linear classification

Applications in non-standard scenarios Data beyond memory capacity

Discussion and conclusions

(4)

Introduction

Outline

Introduction

(5)

Introduction

Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean data not mapped to a higher dimensional space

Original: [height, weight]

Nonlinear: [height, weight, weight/height²]

(6)

Introduction

Linear and Nonlinear Classification (Cont’d)

Given training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1

l : # of data, n: # of features

Linear: find (w, b) such that the decision function is sgn w^Tx + b

Nonlinear: map data to φ(x_i). The decision function becomes

sgn w^Tφ(x)+ b Later b is omitted

(7)

Introduction

Why Linear Classification?

• If φ(x) is high dimensional, w^Tφ(x) is expensive

• Kernel methods:

w ≡ X^l

i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)^Tφ(xj) New decision function: sgn

X^l

i =1α_iK (x_i, x)

• Special φ(x) so that calculating K (x_i, x_j) is easy

• Example:

K (x_i, x_j) ≡ (x^T_i x_j + 1)² = φ(x_i)^Tφ(x_j), φ(x) ∈ R^O(n²⁾

(8)

Introduction

Why Linear Classification? (Cont’d)

Prediction

w^Tx versus X^l

i =1α_iK (x_i, x) If K (x_i, x_j) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

(9)

Introduction

Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Recently linear classification is a popular research topic. Sample works in 2005-2008: Joachims (2006); Shalev-Shwartz et al. (2007); Hsieh et al.

(2008)

They focus on large sparse data

There are many other recent papers and software

(10)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(11)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(12)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(13)

Binary linear classification

Outline

Introduction

(14)

Binary Linear Classification

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw r (w) + C X^l

i =1ξ(w; xi, yi) r (w): regularization term

ξ(w; x, y ): loss function: we hope y w^Tx > 0 C : regularization parameter

(15)

Loss Functions

Some commonly used ones:

ξ_L1(w; x, y ) ≡ max(0, 1 − y w^Tx), (1) ξ_L2(w; x, y ) ≡ max(0, 1 − y w^Tx)², and (2) ξ_LR(w; x, y ) ≡ log(1 + e^{−y w}^T^x). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(16)

Loss Functions (Cont’d)

−y w^Tx ξ(w; x, y )

ξL1

ξ_L2

ξ_LR

They are similar

(17)

Regularization

L1 versus L2

kwk₁ and w^Tw/2 w^Tw/2: smooth, easier to optimize kwk₁: non-differentiable

sparse solution; possibly many zero elements Possible advantages of L1 regularization:

Feature selection Less storage for w

(18)

Training Linear Classifiers

Many recent developments; won’t show details here Why training linear is faster than nonlinear?

Recall the O(n) and O(nl ) difference in prediction:

w^Tx and X^l

i =1αiK (xi, x) n: # features, l : # data

A similar situation happens here. During training:

X^l

t=1α_tx^T_i x_t often needed ⇒ O(nl ) (4)

(19)

Training Linear Classifiers (Cont’d)

By maintaining u ≡

l

X

t=1

y_tα_tx_t → u^Tx_i O(n) cost u: an intermediate variable during training;

eventually approaches the final weight vector w Key: we are able to store x_t, ∀t and maintain u Nonlinear: can’t store φ(x_t)

For linear, basically any optimization method can be applied

(20)

Choosing a Training Algorithm

Data property

# instances # features or the other way around Primal or dual

First-order or higher-order

Now first-order is slightly preferred as seldom we need an accurate optimization solution

Cost of operations

exp/log more expensive; avoid them in training LR Others

(21)

L1 Regularization

Non-differentiable: need non-smooth optimization techniques

Difficult to apply sophisticated methods Currently, coordinate descent or Newton with coordinate descent are among the most efficient (Yuan et al., 2010; Friedman et al., 2010; Yuan et al., 2011)

(22)

Multi-class linear classification

Outline

Introduction

(23)

Solving Several Binary Problems

Same methods for linear and nonlinear classification But there are some subtle differences

One-vs-rest

w_m : class m positive; others negative class of x ≡ arg max

m=1,...,kw^T_mx.

Memory: O(kn); k: # classes

One-vs-one: w_1,2, . . . , w_(k−1),k constructed O(k²n) memory cost

(24)

Solving Several Binary Problems (Cont’d)

So one-vs-rest more suitable than one-vs-one This isn’t the case for kernelized SVM/LR

(25)

Considering All Data at Once

wmin₁,...,w_k

1 2

X^k

m=1kw_mk²₂ + C X^l

i =1ξ({wm}^k_m=1; xi, yi), Multi-class SVM by Crammer and Singer (2001)

loss function : max

m6=y max(0, 1 − (wy − w_m)^Tx).

Maximum Entropy (ME)

loss function : P(y |x) ≡ exp(w^T_y x) Pk

m=1exp(w^T_mx), Many don’t think that ME is close to SVM; but it is.

Note if # classes = 2, ME ⇒ LR

(26)

Applications in non-standard scenarios

Outline

Introduction

(27)

Applications in Non-standard Scenarios

Linear classification can be applied to many other places

An important one is to approximate nonlinear classifiers

Goal: better accuracy of nonlinear but faster training/testing

Two types of methods here

- Linear-method for explicit data mappings - Approximating kernels

(28)

Linear Methods to Explicitly Train φ(x

_i

)

Example: low-degree polynomial mapping:

φ(x) = [1, x₁, . . . , x_n, x₁², . . . , x_n², x₁x₂, . . . , x_n−1x_n]^T For this mapping, # features = O(n²)

When is it useful?

Recall O(n) for linear versus O(nl ) for kernel Now O(n²) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n n ⇒ O(¯n²) may still be smaller than O(l ¯n)

(29)

High Dimensionality of φ(x) and w

• Many new considerations in large scenarios

• For example, w has O(n²) components if degree is 2 Our application: n = 46, 155, 20G for w

• See detailed discussion in Chang et al. (2010)

• A related development is the COFFIN framework by Sonnenburg and Franc (2010)

(30)

An NLP Application: Dependency Parsing

Construct dependency graph: a multi-class problem

nsubj ROOT det dobj prep det pobj p

John hit the ball with a bat .

NNP VBD DT NN IN DT NN .

Very sparse: ¯n, average # nonzeros per instance

n Dim. of φ(x) l n w’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

(31)

An NLP Application (Cont’d)

LIBSVM LIBLINEAR

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

Explicitly using φ(x) instead of kernels

⇒ faster training and testing

Some interesting Hashing techniques used to handle sparse w

(32)

Approximating Kernels

Following Lee and Wright (2010), we consider two categories

Kernel matrix approximation:

Original matrix Q with

Q_ij = y_iy_jK (x_i, x_j) Consider

Q = ¯¯ Φ^TΦ ≈ Q.¯

Φ ≡ [¯¯ x₁, . . . , ¯x_l] becomes new training data ⇒ trained by a linear classifier

(33)

Approximating Kernels (Cont’d)

Φ ∈ R¯ ^{d ×l}, d l . # features # data Testing is an issue

Feature mapping approximation

A mapping function ¯φ : Rⁿ → R^d such that φ(x)¯ ^Tφ(t) ≈ K (x, t).¯

Testing is straightforward because ¯φ(·) is available Many mappings have been proposed; in particular, Hashing

φ(·) may be¯ dense or sparse

(34)

Data beyond memory capacity

Outline

Introduction

(35)

Data Beyond Memory Capacity

Most existing algorithms assume data in memory They are slow if data larger than memory

Frequent disk access of data; CPU time no longer the main concern

They cannot be run in distributed environments Many challenging research issues

(36)

When Data Cannot Fit In Memory

LIBLINEAR on machine with 1 GB memory:

Disk swap causes lengthy training time

(37)

Disk-level Data Classification

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time but ensure overall convergence

But loading time becomes a big concern

Reading 1TB from a hard disk takes very long time

(38)

Distributed Linear Classification

An important advantage: each node loads data in its disk

Parallel data loading, but how about operations?

Issues

- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern

(39)

Distributed Linear Classification (Cont’d)

Simple approaches

Subsampling: a subset to fit in memory - Simple and useful in some situations

- In a sense, you do a “reduce” operation to collect data to one computer, and then conduct detailed analysis

Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset

- Example: Zinkevich et al. (2010)

(40)

Distributed Linear Classification (Cont’d)

Some results by averaging models

yahoo-korea kddcup10 webspam epsilson

Using all 87.29 89.89 99.51 89.78

Avg. models 86.08 89.64 98.40 88.83

Using all: solves a single linear SVM

Avg. models: each node solves a linear SVM on a subset

Slightly worse but in general OK

(41)

Distributed Linear Classification (Cont’d)

Parallel optimization Many possible approaches

If the method involves matrix-vector products, then such operations can be paralleled

Each iteration involves communication

Also MapReduce not very suitable for iterative algorithms (I/O for fault tolerance)

Should have as few iterations as possible

(42)

Distributed Linear Classification (Cont’d)

ADMM (Boyd et al., 2011)

w1,...,wminm,z

1

2z^Tz + C

m

X

j =1

X

i ∈B_j

ξ_L1(w; x_i, y_i) + ρ 2

m

X

j =1

kw_j − zk² subject to w_j − z = 0, ∀j

Each problem independently updated; but must collect wj

Some have tried MapReduce, but no public implementation yet

Convergence may not be very fast (i.e., need some iterations)

(43)

Distributed Linear Classification (Cont’d)

Vowpal Wabbit (Langford et al., 2007)

After version 6.0, Hadoop support has been provided LBFGS (quasi Newton) algorithms

From John’s talk: 2.1T features, 17B samples, 1K nodes ⇒ 70 minutes

(44)

Outline

Introduction

(45)

Conclusions

Linear classification is an old topic; but new developments for large-scale applications are interesting

Linear classification works on x rather than φ(x) Easy and flexible for feature engineering

Linear classification + feature engineering useful for many real applications

Recent Advances in Large Linear Classiﬁcation

Recent Advances in Large Linear Classification

Outline

Outline

Linear and Nonlinear Classification

Linear and Nonlinear Classification (Cont’d)

Why Linear Classification?

Why Linear Classification? (Cont’d)

Linear is Useful in Some Places

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Outline

Binary Linear Classification

Loss Functions

Loss Functions (Cont’d)

Regularization

Training Linear Classifiers

Training Linear Classifiers (Cont’d)

Choosing a Training Algorithm

L1 Regularization

Outline

Solving Several Binary Problems

Solving Several Binary Problems (Cont’d)

Considering All Data at Once

Outline

Applications in Non-standard Scenarios

Linear Methods to Explicitly Train φ(x

)

High Dimensionality of φ(x) and w

An NLP Application: Dependency Parsing

John hit the ball with a bat .

An NLP Application (Cont’d)

Approximating Kernels

Approximating Kernels (Cont’d)

Outline

Data Beyond Memory Capacity

When Data Cannot Fit In Memory

Disk-level Data Classification

Distributed Linear Classification

Distributed Linear Classification (Cont’d)

Distributed Linear Classification (Cont’d)

Distributed Linear Classification (Cont’d)

Distributed Linear Classification (Cont’d)

Distributed Linear Classification (Cont’d)

Outline

Related Topics

Conclusions