### Large-scale Linear Classification:

### Status and Challenges

### Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Criteo Machine Learning Workshop, November 8, 2017

### Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

Introduction

### Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

Introduction

### Linear Classification

Although many new and advanced techniques are available (e.g., deep learning), linear classifiers remain to be useful because of their simplicity

We have fast training/prediction for large-scale data A large-scale optimization problem is solved

The focus of this talk is on how to solve this optimization problem

Introduction

### The Software LIBLINEAR

My talk will be very related to research done in developing the software LIBLINEAR for linear classification

www.csie.ntu.edu.tw/~cjlin/liblinear It is now one of the most used linear classification tools

Introduction

### Linear and Kernel Classification

Methods such as SVM and logistic regression are often used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x )

φ(x )^{T}φ(y) easily calculated; no good control on
φ(·)

Feature engineering + linear classification:

Directly use x without mapping. But x may have been carefully generated. Full control on x

Introduction

### Comparison Between Linear and Kernel

For certain problems, accuracy by linear is as good as kernel

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds

Introduction

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype multiclass 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31

webspam 25.7 93.35 15,681.8 99.26

Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Introduction

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype multiclass 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31

webspam 25.7 93.35 15,681.8 99.26

Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Introduction

### Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype multiclass 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31

webspam 25.7 93.35 15,681.8 99.26

Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Introduction

### Binary Linear Classification

Training data {yi, xi}, x_{i} ∈ R^{n}, i = 1, . . . , l , yi = ±1
l : # of data, n: # of features

minw f (w ), where f (w ) ≡

C

l

X

i =1

ξ(w ; x_{i}, y_{i}) +
(1

2w^{T}w L2 regularization
kw k_{1} L1 regularization
ξ(w ; x , y ): loss function: we hope y w^{T}x > 0
C : regularization parameter

Introduction

### Loss Functions

Some commonly used loss functions.

ξ_{L1}(w ; x , y ) ≡ max(0, 1 − y w^{T}x ), (1)
ξ_{L2}(w ; x , y ) ≡ max(0, 1 − y w^{T}x )^{2}, (2)
ξ_{LR}(w ; x , y ) ≡ log(1 + e^{−y w}^{T}^{x}). (3)
SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

Optimization methods

### Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

Optimization methods

### Optimization Methods

A difference between linear and kernel is that for kernel, optimization must be over a variable α (usually through the dual problem) where

w = X^{l}

i =1α_{i}φ(x_{i})

We cannot minimize over w , which may be infinite dimensional

However, for linear, minimizing over w or α is ok

Optimization methods

### Optimization Methods (Cont’d)

Unconstrained optimization methods can be categorized to

Low-order methods: quickly get a model, but slow final convergence

High-order methods: more robust and useful for ill-conditioned situations

We will show both types of optimization methods are useful for linear classification

Further, to handle large problems, the algorithms must take problem structure into account

Let’s discuss a low-order method (coordinate descent) in detail

Optimization methods

### Coordinate Descent

We consider L1-loss and the dual SVM problem minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

2α^{T}Qα − e^{T}α
and

Q_{ij} = y_{i}y_{j}x^{T}_{i} x_{j}, e = [1, . . . , 1]^{T}
We will apply coordinate descent (CD) methods
The situation for L2 or LR loss is very similar

Optimization methods

### Coordinate Descent (Cont’d)

For current α, change α_{i} by fixing others
Let

ei = [0, . . . , 0, 1, 0, . . . , 0]^{T}
The sub-problem is

min

d f (α + d e_{i}) = 1

2Q_{ii}d^{2} + ∇_{i}f (α)d + constant
subject to 0 ≤ α_{i} + d ≤ C

Without constraints

optimal d = −∇_{i}f (α)
Qii

Optimization methods

### Coordinate Descent (Cont’d)

Now 0 ≤ αi + d ≤ C αi ← min

max

αi − ∇_{i}f (α)
Q_{ii} , 0

, C

Note that

∇_{i}f (α) = (Qα)_{i} − 1 = X^{l}

j =1Q_{ij}α_{j} − 1

= X^{l}

j =1yiyjx^{T}_{i} xjαj − 1

Expensive: O(ln), l : # instances, n: features

Optimization methods

### Coordinate Descent (Cont’d)

A trick in Hsieh et al. (2008) is to define and maintain

u ≡X^{l}

j =1y_{j}α_{j}x_{j},

Easy gradient calculation: the cost is O(n)

∇_{i}f (α) = yiu^{T}xi − 1

Note that this cannot be done for kernel as x_{i} is
high dimensional

Optimization methods

### Coordinate Descent (Cont’d)

The procedure

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯α_{i} ← α_{i}

(b) G = y_{i}u^{T}xi − 1

(c) α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
(d) If α_{i} needs to be changed

u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}
Maintaining u also costs

O(n)

Optimization methods

### Coordinate Descent (Cont’d)

Having

u ≡X^{l}

j =1y_{j}α_{j}xj,

∇_{i}f (α) = y_{i}u^{T}x_{i} − 1
and

¯

α_{i} : old ; α_{i} : new
u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}.
is very essential

This isn’t the vanilla CD dated back to Hildreth (1957)

We take the problem structure into account

Optimization methods

### Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method

This result is from Hsieh et al. (2008) with C = 1

Optimization methods

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

Optimization methods

### Low- versus High-order Methods

We see low-order methods are efficient, but

high-order methods are useful for difficult situations CD for dual

$ time ./train -c 1 news20.scale 2.528s

$ time ./train -c 100 news20.scale 28.589s

Newton for primal

$ time ./train -c 1 -s 2 news20.scale 8.596s

$ time ./train -c 100 -s 2 news20.scale 11.088s

Optimization methods

### Training Median-sized Data: Status

Basically a solved problem

However, as data and memory continue to grow, new techniques are needed for large-scale sets.

Two possible strategies are

1 Multi-core linear classification

2 Distributed linear classification

Multi-core linear classification

### Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

Multi-core linear classification

### Multi-core Linear Classification

Nowadays each CPU has several cores

However, parallelizing algorithms to use multiple cores may not be that easy

In fact, algorithms may need to be redesigned Since two years ago we have been working on multi-core LIBLINEAR

Multi-core linear classification

### Multi-core Linear Classification (Cont’d)

Three multi-core solvers have been released

1 Newton method for primal L2-regularized problem (Lee et al., 2015)

2 Coordinate descent method for dual

L2-regularized problem (Chiang et al., 2016)

3 Coordinate descent method for primal

L1-regularized problem (Zhuang et al., 2017) They are practically useful. For example, one user from USC thanked us because “a job (taking >30 hours using one core) now can finish within 5 hours”

We will briefly discuss the 2nd and the 3rd

Multi-core linear classification

### Multi-core CD for Dual

Recall the CD algorithm for dual is

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← α_{i}

(b) G = y_{i}u^{T}xi − 1

(c) α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
(d) If α_{i} needs to be changed

u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}

Multi-core linear classification

### Multi-core CD for Dual (Cont’d)

The algorithm is inherently sequential Suppose

α_{i}^{0} is updated after α_{i}

Then αi^{0} must wait until the latest u is obtained
The parallelization is difficult

Multi-core linear classification

### Multi-core CD for Dual (Cont’d)

Asynchronous CD is possible (Hsieh et al., 2015), but may diverge

We note that for a given set ¯B

∇_{i}f (w ) = w^{T}xi, ∀i ∈ ¯B
can be calculated in parallel

We then propose a framework

Multi-core linear classification

### Multi-core CD for Dual (Cont’d)

While α is not optimal (a) Select a set ¯B

(b) Calculate ∇B¯f (α) in parallel (c) Select B ⊂ ¯B with |B| | ¯B|

(d) Sequentially update α_{i}, i ∈ B

Multi-core linear classification

### Multi-core CD for Dual (Cont’d)

The selection of

B ⊂ ¯B with |B| | ¯B|

is by ∇B¯f (w )

The idea is simple, but needs efforts to have a practical setting (details omitted)

Multi-core linear classification

### Multi-core CD for Dual (Cont’d)

webspam url combined

Alg-4: the method in Chiang et al. (2016) Asynchronous CD (Hsieh et al., 2015)

Multi-core linear classification

### Multi-core CD for L1 Regularization

Currently, primal CD (Yuan et al., 2010) or its variants (Yuan et al., 2012) is the state-of-the-art for L1

Each CD step involves one feature

Some attempts of parallel CD for L1 include Asynchronous CD (Bradley et al., 2011) Block CD (Bian et al., 2013)

These methods are not satisfactory for either divergence issue, or

poor speedup

Multi-core linear classification

### Multi-core CD for L1 Regularization (Cont’d)

We struggled for years for find a solution

Recently, in a work (Zhuang et al., 2017) we have an effective setting

It’s partially supported by Criteo Faculty Research Award

Our idea is simple: direct parallelization of CD But wait.. This shouldn’t work because each CD iteration is cheap

Multi-core linear classification

### Direct Parallelization of CD

Let’s consider a simple setting to decide if one CD step should be parallelized or not

if #non-zeros in an instance/feature ≥ a threshold then

multi-core else

single-core

Idea: a CD step is parallelized if there are enough operations

Multi-core linear classification

### Direct Parallelization of CD (Cont’d)

Speedup of CD for dual, L2 regularization Data set

#threads

2 4 8

sparse

sets avazu-app 0.4 0.3 0.2

criteo 0.5 0.3 0.2

dense sets

epsilon normalized 1.3 1.3 1.1 splice site.t.10% 1.8 2.8 4.1 CD for dual: one instance at a time

Threshold: 0 (sparse), 500 (dense)

If 500 for sparse, no instance parallelized The speedup is poor

Multi-core linear classification

% of instances/features containing 50% and 80%

#non-zeros

Data set Instance Feature

avazu-app 50% 80% 0.2% 1%

criteo 50% 80% 0.01% 0.2%

kdd2010-a 40% 73% 0.03% 2%

kdd2012 50% 80% 0.003% 0.5%

rcv1 test 24% 54% 1% 5%

splice site.t.10% 50% 80% 9% 57%

url combined 44% 76% 0.002% 0.006%

webspam 29% 55% 0.6% 2%

yahoo-korea 20% 48% 0.07% 0.5%

Features’ non-zero distribution is extremely skewed Non-zeros are in few dense (and parallelizable) features

Multi-core linear classification

### Speedup of CD for L1 Regularization

LR loss used Naive Block CD Async. CD

Data set 2 4 8 2 4 8 2 4 8

avazu-app 1.9 3.4 5.6 0.4 0.7 1.0 1.4 2.7 3.4 criteo 1.8 3.3 5.5 0.7 1.2 1.9 1.5 2.9 4.8 epsilon normalized 2.0 4.0 7.9 x x x 1.3 2.1 x HIGGS 2.0 3.9 7.5 0.7 0.8 0.9 1.0 1.3 x kdd2010-a 1.7 2.4 3.1 0.8 1.4 2.4 1.5 2.7 4.8 kdd2012 1.9 2.8 3.9 0.2 0.4 0.6 2.1 4.7 7.0 rcv1 test 1.9 3.4 5.9 x x x 1.3 2.5 4.5 splice site.t.10% 1.9 3.6 6.2 x x x 1.6 2.7 4.3 url combined 2.0 3.5 6.2 0.5 0.9 1.3 1.0 1.7 1.7 webspam 1.8 3.2 4.8 0.1 0.3 0.5 1.4 2.5 4.1 yahoo-korea 1.9 3.5 5.9 0.2 0.3 0.5 1.3 2.4 4.4

Distributed linear classification

### Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

Distributed linear classification

### Distributed Linear Classification

It’s even more complicated than multi-core

I don’t have time to discuss this topic in detail, but let me share some lessons

A big mistake was that we worked on distributed before multi-core

Distributed linear classification

### Distributed Linear Classification (Cont’d)

A few years ago, big data was hot. So we extended a Newton solver in LIBLINEAR to MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)

We were a bit ahead of time; Spark MLlib wasn’t even available then

Unfortunately, very few people use our code, especially the Spark one

We moved to multi-core. Immediately, multi-core LIBLINEAR has many users

Distributed linear classification

### Distributed Linear Classification (Cont’d)

Why we failed? Several possible reasons Not many people have big data??

System issues are more important than we thought.

At that time Spark wasn’t easy to use and was being actively changed

System configuration and application scenarios may significantly vary

An algorithm useful for systems with fast network speed may be useless for systems with slow

communication

Distributed linear classification

### Distributed Linear Classification (Cont’d)

Application dependency is stronger.

L2 and L1 regularization often give similar accuracy.

On a single machine, we may not want to use L1 because training is more difficult and the smaller model size isn’t that important

However, for distributed applications many have told me that they need L1

A lesson is that for people from academia, it’s better to collaborate with industry for research on distributed machine learning

Conclusions

### Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

Conclusions

### Conclusions

Linear classification is an old topic, but it remains to be useful for many applications

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to be studied