### Classification

### Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Asian Conference on Machine Learning, November, 2013

Chih-Jen Lin (National Taiwan Univ.) 1 / 42

et al., 2012)

Recent Advances of Large-scale Linear Classification. Proceedings of IEEE, 2012

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

Chih-Jen Lin (National Taiwan Univ.) 2 / 42

Introduction

Optimization Methods

Extension of Linear Classification Discussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) 3 / 42

### Outline

Introduction

Optimization Methods

Extension of Linear Classification Discussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) 4 / 42

### Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean a linear function is used to separate data in the original input space

Original: [height, weight]

Nonlinear: [height, weight, weight/height^{2}]
Kernel is one of the methods for nonlinear

Chih-Jen Lin (National Taiwan Univ.) 5 / 42

### Linear and Nonlinear Classification (Cont’d)

Methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x)

φ(x)^{T}φ(y) easily calculated; no good control on φ(·)
Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We will focus on linear here

Chih-Jen Lin (National Taiwan Univ.) 6 / 42

### Why Linear Classification?

• If φ(x) is high dimensional, decision function
sgn(w^{T}φ(x))

is expensive. So kernel methods use
w ≡ X^{l}

i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)^{T}φ(xj)
Then new decision function is sgn

Pl

i =1α_{i}K (x_{i}, x)

• Special φ(x) so calculating K (x_{i}, x_{j}) is easy. Example:

K (x_{i}, x_{j}) ≡ (x^{T}_{i} x_{j} + 1)^{2} = φ(x_{i})^{T}φ(x_{j}), φ(x) ∈ R^{O(n}^{2}^{)}

Chih-Jen Lin (National Taiwan Univ.) 7 / 42

### Why Linear Classification? (Cont’d)

However, kernel is still expensive Prediction

w^{T}x versus X^{l}

i =1α_{i}K (x_{i}, x)
If K (x_{i}, x_{j}) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

Chih-Jen Lin (National Taiwan Univ.) 8 / 42

### Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds

Chih-Jen Lin (National Taiwan Univ.) 9 / 42

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 10 / 42

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 10 / 42

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 10 / 42

### Binary Linear Classification

Training data {y_{i}, x_{i}}, x_{i} ∈ R^{n}, i = 1, . . . , l , y_{i} = ±1
l : # of data, n: # of features

minw f (w), f (w) ≡ w^{T}w
2 + C

l

X

i =1

ξ(w; x_{i}, y_{i})

w^{T}w/2: regularization term (we have no time to
talk about L1 regularization here)

ξ(w; x, y ): loss function: we hope y w^{T}x > 0
C : regularization parameter

Chih-Jen Lin (National Taiwan Univ.) 11 / 42

### Loss Functions

Some commonly used ones:

ξ_{L1}(w; x, y ) ≡ max(0, 1 − y w^{T}x), (1)
ξL2(w; x, y ) ≡ max(0, 1 − y w^{T}x)^{2}, (2)
ξ_{LR}(w; x, y ) ≡ log(1 + e^{−y w}^{T}^{x}). (3)
SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3); no reference because it can be traced back to 19th century

Chih-Jen Lin (National Taiwan Univ.) 12 / 42

### Loss Functions (Cont’d)

−y w^{T}x
ξ(w; x, y )

ξ_{L1}
ξ_{L2}

ξLR

Their performance is usually similar

Chih-Jen Lin (National Taiwan Univ.) 13 / 42

### Loss Functions (Cont’d)

However, optimization methods for them may be different

ξL1: not differentiable

ξ_{L2}: differentiable but not twice differentiable
ξ_{LR}: twice differentiable

Chih-Jen Lin (National Taiwan Univ.) 14 / 42

### Outline

Introduction

Optimization Methods

Extension of Linear Classification Discussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) 15 / 42

### Optimization: 2nd Order Methods

Newton direction

mins ∇f (w^{k})^{T}s + 1

2s^{T}∇^{2}f (w^{k})s

This is the same as solving Newton linear system

∇^{2}f (w^{k})s = −∇f (w^{k})

Hessian matrix ∇^{2}f (w^{k}) too large to be stored

∇^{2}f (w^{k}) : n × n, n : number of features
But Hessian has a special form

∇^{2}f (w) = I + CX^{T}DX ,

Chih-Jen Lin (National Taiwan Univ.) 16 / 42

### Optimization: 2nd Order Methods (Cont’d)

X : data matrix. D diagonal. For logistic regression,
D_{ii} = e^{−y}^{i}^{w}^{T}^{x}^{i}

1 + e^{−y}^{i}^{w}^{T}^{x}^{i}

Using CG to solve the linear system. Only Hessian-vector products are needed

∇^{2}f (w)s = s + C · X^{T}(D(X s))
Therefore, we have a Hessian-free approach

Chih-Jen Lin (National Taiwan Univ.) 17 / 42

### 2nd-order Methods (Cont’d)

In LIBLINEAR, we use the trust-region + CG approach by Steihaug (1983); see details in Lin et al. (2008)

What if we use L2 loss? It’s differentiable but not twice differentiable

ξ_{L2}(w; x, y ) ≡ max(0, 1 − y w^{T}x)^{2}
We can use generalized Hessian (Mangasarian,
2002). Details not discussed here

Chih-Jen Lin (National Taiwan Univ.) 18 / 42

### Optimization: 1st Order Methods

We consider L1-loss and the dual SVM problem minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

2α^{T}Qα − e^{T}α
and

Q_{ij} = y_{i}y_{j}x^{T}_{i} x_{j}, e = [1, . . . , 1]^{T}
We will apply coordinate descent methods
The situation for L2 or LR loss is very similar

Chih-Jen Lin (National Taiwan Univ.) 19 / 42

### 1st Order Methods (Cont’d)

Coordinate descent: a simple and classic technique Change one variable at a time

Given current α. Let e_{i} = [0, . . . , 0, 1, 0, . . . , 0]^{T}.
min

d f (α + d e_{i}) = 1

2Q_{ii}d^{2} + ∇_{i}f (α)d + constant
Without constraints

optimal d = −∇_{i}f (α)
Qii

Now 0 ≤ α_{i} + d ≤ C
α_{i} ← min

max

α_{i} − ∇_{i}f (α)
Q_{ii} , 0

, C

Chih-Jen Lin (National Taiwan Univ.) 20 / 42

### 1st Order Methods (Cont’d)

∇_{i}f (α) = (Qα)_{i} − 1 = X^{l}

j =1Q_{ij}α_{j} − 1

= X^{l}

j =1yiyjx^{T}_{i} xjαj − 1

O(ln) cost; l :# data, n: # features. But we can define

u ≡ X^{l}

j =1yjαjxj, Easy gradient calculation: costs O(n)

∇_{i}f (α) = (y_{i}x_{i})^{T} X^{l}

j =1y_{j}x_{j}α_{j} − 1 = y_{i}u^{T}x_{i} − 1

Chih-Jen Lin (National Taiwan Univ.) 21 / 42

### 1st Order Methods (Cont’d)

All we need is to maintain u
u = X^{l}

j =1y_{j}α_{j}x_{j},
If

¯

α_{i} : old ; α_{i} : new
then

u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}.
Also costs O(n)

References: first use for SVM probably by Mangasarian and Musicant (1999); Friess et al. (1998), but

popularized for linear SVM by Hsieh et al. (2008)

Chih-Jen Lin (National Taiwan Univ.) 22 / 42

### 1st Order Methods (Cont’d)

Summary of the dual coordinate descent method Given initial α and find u = P

iy_{i}α_{i}x_{i}.
While α is not optimal (Outer iteration)

For i = 1, . . . , l (Inner iteration)
(a) ¯α_{i} ← α_{i}

(b) G = y_{i}u^{T}x_{i} − 1
(c) If α_{i} can be changed

α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}

Chih-Jen Lin (National Taiwan Univ.) 23 / 42

### Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method This result is from Hsieh et al. (2008)

Chih-Jen Lin (National Taiwan Univ.) 24 / 42

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

Chih-Jen Lin (National Taiwan Univ.) 25 / 42

### Analysis

First-order methods can quickly get a model But second-order methods are more robust and faster for ill-conditioned situations

Both type of optimization methods are useful for linear classification

Chih-Jen Lin (National Taiwan Univ.) 26 / 42

### An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

If number of features is small, solving primal is more suitable

Chih-Jen Lin (National Taiwan Univ.) 27 / 42

### Outline

Introduction

Optimization Methods

Extension of Linear Classification Discussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) 28 / 42

### Extension of Linear Classification

Linear classification can be extended in different ways

An important one is to approximate nonlinear classifiers

Goal: better accuracy of nonlinear but faster training/testing

Examples

1. Explicit data mappings + linear classification 2. Kernel approximation + linear classification I will focus on the first

Chih-Jen Lin (National Taiwan Univ.) 29 / 42

### Linear Methods to Explicitly Train φ(x

_{i}

### )

Example: low-degree polynomial mapping:

φ(x) = [1, x_{1}, . . . , x_{n}, x_{1}^{2}, . . . , x_{n}^{2}, x_{1}x_{2}, . . . , x_{n−1}x_{n}]^{T}
For this mapping, # features = O(n^{2})

When is it useful?

Recall O(n) for linear versus O(nl ) for kernel
Now O(n^{2}) versus O(nl )

Sparse data

n ⇒ ¯n, average # non-zeros for sparse data

¯

n n ⇒ O(¯n^{2}) may be much smaller than O(l ¯n)

Chih-Jen Lin (National Taiwan Univ.) 30 / 42

### Example: Dependency Parsing

A multi-class problem with sparse data

n Dim. of φ(x) l n w’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456

¯

n: average # nonzeros per instance Degree-2 polynomial is used

Dimensionality of w is too large, but w is sparse Some interesting Hashing techniques are used to handle sparse w

Chih-Jen Lin (National Taiwan Univ.) 31 / 42

### Example: Dependency Parsing (Cont’d)

LIBSVM LIBLINEAR

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

We get faster training/testing, but maintain good accuracy

See detailed discussion in Chang et al. (2010)

Chih-Jen Lin (National Taiwan Univ.) 32 / 42

### Example: Classifier in a Small Device

In a sensor application (Yu et al., 2013), the classifier must use less than 16KB of RAM

Classifiers Test accuracy Model Size

Decision Tree 77.77 76.02KB

AdaBoost (10 trees) 78.84 1,500.54KB SVM (RBF kernel) 85.33 1,287.15KB Number of features: 5

We consider a degree-3 mapping dimensionality = 5 + 3

3

+ bias term = 57.

Chih-Jen Lin (National Taiwan Univ.) 33 / 42

### Example: Classifier in a Small Device (Cont’d)

One-against-one strategy for 5-class classification

5 2

× 57 × 4bytes = 2.28KB Assume single precision

Results

SVM method Test accuracy Model Size

RBF kernel 85.33 1,287.15KB

Polynomial kernel 84.79 2.28KB

Linear kernel 78.51 0.24KB

Chih-Jen Lin (National Taiwan Univ.) 34 / 42

### Example: Classifier in a Small Device (Cont’d)

Running time (in seconds)

LIBSVM LIBLINEAR Primal Dual Training time 30,519.10 1,368.25 4,039.20 LIBSVM: polynomial kernel

LIBLINEAR: training polynomial expansions primal: 2nd-order method; dual: 1st-order LIBLINEAR dual: slow convergence. Now

#data #features = 57

Chih-Jen Lin (National Taiwan Univ.) 35 / 42

### Discussion

Unfortunately, polynomial mappings easily cause high dimensionality. Some have proposed

“projection” techniques to use fewer features as approximations

Examples: Kar and Karnick (2012); Pham and Pagh (2013)

Recently, ensemble of tree models (e.g., random forests or GBDT) become very useful. But under model-size constraints (the 2nd application), linear may still be the way to go

Chih-Jen Lin (National Taiwan Univ.) 36 / 42

### Outline

Introduction

Optimization Methods

Extension of Linear Classification Discussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) 37 / 42

### Big-data Linear Classification

Shared and distributed scenarios are very different Here I discuss more about distributed classification The major saving is parallel data loading

But high communication cost is a big concern

Chih-Jen Lin (National Taiwan Univ.) 38 / 42

### Big-data Linear Classification (Cont’d)

Data classification if often only one component of the whole workflow

Example: distributed feature generation may be more time consuming than classification

This explains why so far not many effective packages are available for big-data classification Many research and engineering issues remain to be solved

Chih-Jen Lin (National Taiwan Univ.) 39 / 42

### Conclusions

Linear classification is an old topic; but recently there are new and interesting applications

Kernel methods are still useful for many

applications, but linear classification + feature engineering are suitable for some others

Advantages of linear: because of working on x, easier for feature engineering

We expect that linear classification can be widely used in situations ranging from small-model to big-data classification

Chih-Jen Lin (National Taiwan Univ.) 40 / 42

### References I

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471–1490, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

T.-T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast and simple learning procedure for support vector machines. In Proceedings of 15th Intl. Conf.

Machine Learning. Morgan Kaufman Publishers, 1998.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

P. Kar and H. Karnick. Random feature maps for dot product kernels. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 583–591, 2012.

Chih-Jen Lin (National Taiwan Univ.) 41 / 42

### References II

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.

O. L. Mangasarian. A finite Newton method for classification. Optimization Methods and Software, 17(5):913–929, 2002.

O. L. Mangasarian and D. R. Musicant. Successive overrelaxation for support vector machines.

IEEE Transactions on Neural Networks, 10(5):1032–1037, 1999.

N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239–247, 2013.

T. Steihaug. The conjugate gradient method and trust regions in large scale optimization.

SIAM Journal on Numerical Analysis, 20:626–637, 1983.

T. Yu, D. Wang, M.-C. Yu, C.-J. Lin, and E. Y. Chang. Careful use of machine learning methods is needed for mobile applications: A case study on transportation-mode detection. Technical report, Studio Engineering, HTC, 2013. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/transportation-mode/casestudy.pdf.

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent advances of large-scale linear classification.

Proceedings of the IEEE, 100(9):2584–2603, 2012. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/survey-linear.pdf.

Chih-Jen Lin (National Taiwan Univ.) 42 / 42