### Optimization Methods for Large-scale Linear Classification

### Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at University of Rome “La Sapienza,” June 25, 2013

Part of this talk is based on our recent survey paper in Proceedings of IEEE, 2012

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Introduction

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Introduction

### Linear and Nonlinear Classification

Some popular methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x)

φ(x)^{T}φ(y) easily calculated; no good control on φ(·)
Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as nonlinear and linear classifiers; we will focus on linear here

Introduction

### Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean data not mapped to a higher dimensional space

Original: [height, weight]

Nonlinear: [height, weight, weight/height^{2}]

Introduction

### Linear and Nonlinear Classification (Cont’d)

Given training data {y_{i}, x_{i}}, x_{i} ∈ R^{n}, i = 1, . . . , l ,
yi = ±1

l : # of data, n: # of features

Linear: find (w, b) such that the decision function is
sgn w^{T}x + b

Nonlinear: map data to φ(x_{i}). The decision
function becomes

sgn w^{T}φ(x)+ b
Later b is omitted

Introduction

### Why Linear Classification?

• If φ(x) is high dimensional, w^{T}φ(x) is expensive

• Kernel methods:

w ≡ X^{l}

i =1α_{i}φ(x_{i}) for some α, K (x_{i}, x_{j}) ≡ φ(x_{i})^{T}φ(x_{j})
New decision function: sgn

X^{l}

i =1αiK (xi, x)

• Special φ(x) so that calculating K (x_{i}, x_{j}) is easy

• Example:

K (xi, xj) ≡ (x^{T}_{i} xj + 1)^{2} = φ(xi)^{T}φ(xj), φ(x) ∈ R^{O(n}^{2}^{)}

Introduction

### Why Linear Classification? (Cont’d)

Prediction

w^{T}x versus X^{l}

i =1α_{i}K (x_{i}, x)
If K (x_{i}, x_{j}) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

Introduction

### Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds Recently linear classification is a popular research topic

Binary linear classification

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Binary linear classification

### Binary Linear Classification

Training data {y_{i}, x_{i}}, x_{i} ∈ R^{n}, i = 1, . . . , l , y_{i} = ±1
l : # of data, n: # of features

minw

w^{T}w
2 + C

l

X

i =1

ξ(w; x_{i}, y_{i})

w^{T}w/2: regularization term

ξ(w; x, y ): loss function: we hope y w^{T}x > 0
C : regularization parameter

Binary linear classification

### Loss Functions

Some commonly used ones:

ξ_{L1}(w; x, y ) ≡ max(0, 1 − y w^{T}x), (1)
ξ_{L2}(w; x, y ) ≡ max(0, 1 − y w^{T}x)^{2}, (2)
ξ_{LR}(w; x, y ) ≡ log(1 + e^{−y w}^{T}^{x}). (3)
SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

Binary linear classification

### Loss Functions (Cont’d)

−y w^{T}x
ξ(w; x, y )

ξ_{L1}
ξ_{L2}

ξLR

They are similar in terms of performance

Binary linear classification

### Loss Functions (Cont’d)

However,

ξL1: not differentiable

ξ_{L2}: differentiable but not twice differentiable
ξ_{LR}: twice differentiable

Many optimization methods can be used

Optimization Methods: Second-order Methods

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Optimization Methods: Second-order Methods

### Truncated Newton Method

Newton direction

mins ∇f (w^{k})^{T}s + 1

2s^{T}∇^{2}f (w^{k})s

This is the same as solving Newton linear system

∇^{2}f (w^{k})s = −∇f (w^{k})

Hessian matrix ∇^{2}f (w^{k}) too large to be stored

∇^{2}f (w^{k}) : n × n, n : number of features
For document data, n can be millions or more

Optimization Methods: Second-order Methods

### Using Special Properties of Data Classification

But Hessian has a special form

∇^{2}f (w) = I + CX^{T}DX ,
D diagonal. For logistic regression,

D_{ii} = e^{−y}^{i}^{w}^{T}^{x}^{i}
1 + e^{−y}^{i}^{w}^{T}^{x}^{i}
X : data, # instances × # features

X = [x_{1}, . . . , x_{l}]^{T}

Optimization Methods: Second-order Methods

### Using Special Properties of Data Classification (Cont’d)

Using CG to solve the linear system. Only Hessian-vector products are needed

∇^{2}f (w)s = s + C · X^{T}(D(X s))
Therefore, we have a Hessian-free approach

In Lin et al. (2008), we use the trust-region + CG approach by Steihaug (1983)

Quadratic convergence is achieved

Optimization Methods: Second-order Methods

### Training L2-loss SVM

The loss function is differentiable but not twice differentiable

ξ_{L2}(w; x, y ) ≡ max(0, 1 − y w^{T}x)^{2}
We can use generalized Hessian (Mangasarian,
2002)

Works well in practice, but no theoretical quadratic convergence

Optimization Methods: First-order Methods

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Optimization Methods: First-order Methods

First-order methods are popular in data classification Reason: no need to accurately solve the

optimization problem

We consider L1-loss SVM as an example here, though same methods may be extended to L2 and logistic loss

Optimization Methods: First-order Methods

### SVM Dual

From primal dual relationship minα f (α)

subject to 0 ≤ α_{i} ≤ C , ∀i ,
where

f (α) ≡ 1

2α^{T}Qα − e^{T}α
and

Qij = yiyjx^{T}_{i} xj, e = [1, . . . , 1]^{T}

Optimization Methods: First-order Methods

### Dual Coordinate Descent

Very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , α_{i}, . . .)
A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

Optimization Methods: First-order Methods

### The Procedure

Given current α. Let e_{i} = [0, . . . , 0, 1, 0, . . . , 0]^{T}.
mind f (α + d ei) = 1

2Qiid^{2} + ∇if (α)d + constant
Without constraints

optimal d = −∇_{i}f (α)
Q_{ii}
Now 0 ≤ α_{i} + d ≤ C

α_{i} ← min

max

α_{i} − ∇_{i}f (α)
Qii

, 0

, C

Optimization Methods: First-order Methods

### The Procedure (Cont’d)

∇_{i}f (α) = (Qα)_{i} − 1 = X^{l}

j =1Q_{ij}α_{j} − 1

= X^{l}

j =1y_{i}y_{j}x^{T}_{i} x_{j}α_{j} − 1
Directly calculating gradients costs O(ln)
l :# data, n: # features

For linear SVM, define
u ≡ X^{l}

j =1y_{j}α_{j}x_{j},
Easy gradient calculation: costs O(n)

∇ f (α) = y u^{T}x − 1

Optimization Methods: First-order Methods

### The Procedure (Cont’d)

All we need is to maintain u
u = X^{l}

j =1y_{j}α_{j}x_{j},
If

¯

α_{i} : old ; α_{i} : new
then

u ← u + (αi − ¯αi)yixi. Also costs O(n)

Optimization Methods: First-order Methods

### Algorithm

Given initial α and find u = X

i

y_{i}α_{i}x_{i}.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯α_{i} ← α_{i}

(b) G = yiu^{T}xi − 1
(c) If α_{i} can be changed

α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}

Optimization Methods: First-order Methods

### Analysis

Convergence; from Luo and Tseng (1992)

f (α^{k+1}) − f (α^{∗}) ≤ µ(f (α^{k}) − f (α^{∗})), ∀k ≥ k_{0}.
α^{∗}: optimal solution

Recently we prove the result with k_{0} = 1 (Wang and
Lin, 2013)

Difficulty: the objective function is convex only rather than strictly convex

Optimization Methods: First-order Methods

### Careful Implementation

Some techniques can improve the running speed

Shrinking: remove α_{i} if it is likely to be bounded at
the end

Easier to conduct shrinking than the kernel case (details not shown)

Order of sub-problems being minimized
α1 → α_{2} → · · · → α_{l}

Can use any random order at each outer iteration
α_{π(1)} → α_{π(2)} → · · · → α_{π(l )}

Very effective in practice

Optimization Methods: First-order Methods

### Difference from the Kernel Case

What if coordinate descent methods are applied to kernel classifiers?

Recall the gradient is

∇_{i}f (α) =

l

X

j =1

y_{i}y_{j}x^{T}_{i} x_{j}α_{j}−1 = (y_{i}x_{i})^{T}

l

X

j =1

y_{j}x_{j}α_{j}−1
but we cannot do this for kernel because

K (x_{i}, x_{j}) = φ(x_{i})^{T}φ(x_{j})
is not separated

If using kernel, the cost of calculating ∇_{i}f (α) must
be O(ln)

Optimization Methods: First-order Methods

### Difference from the Kernel Case (Cont’d)

This difference is similar to our earlier discussion on the prediction cost

w^{T}x versus X^{l}

i =1αiK (xi, x) O(n) versus O(nl )

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)

In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)

Optimization Methods: First-order Methods

### Difference from the Kernel Case (Cont’d)

In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variable for update

Recall there are two types of coordinate descent methods

Gauss-Seidel: sequential selection of variables Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available

Optimization Methods: First-order Methods

### Difference from the Kernel Case (Cont’d)

Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel

Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell

Experiments

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Experiments

### Comparisons

L2-SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method

Experiments

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

Experiments

### Analysis

Dual coordinate descents are very effective if # data, # features are large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

Experiments

### An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

Big-data Machine Learning

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Big-data Machine Learning

### Big-data Machine Learning

Data distributedly stored

This is a new topic and many research works are still going on

You may ask what the difference is from distributed optimization

They are related, but now the algorithm must avoid expensive data accesses

Big-data Machine Learning

### Big-data Machine Learning (Cont’d)

Issues for parallelization

- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern

Big-data Machine Learning

### Simple Distributed Linear Classification I

Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset

- Example: Zinkevich et al. (2010) Some results by averaging models

yahoo-korea kddcup10 webspam epsilson

Using all 87.29 89.89 99.51 89.78

Avg. models 86.08 89.64 98.40 88.83 Using all: solves a single linear SVM

Big-data Machine Learning

### Simple Distributed Linear Classification II

Avg. models: each node solves a linear SVM on a subset

Slightly worse but in general OK

Big-data Machine Learning

### ADMM by Boyd et al. (2011) I

Recall the SVM problem (bias term b omitted) minw

1

2w^{T}w + C

l

X

i =1

max(0, 1 − y_{i}w^{T}x_{i})
An equivalent optimization problem

w1,...,wminm,z

1

2z^{T}z + C

m

X

j =1

X

i ∈Bj

max(0, 1 − yiw^{T}_{j} xi)+

ρ 2

m

X

j =1

kw_{j} − zk^{2}
subject to w_{j} − z = 0, ∀j

Big-data Machine Learning

### ADMM by Boyd et al. (2011) II

The key is that

z = w_{1} = · · · = w_{m}
are all optimal w

This optimization problem was proposed in 1970s,
but is now applied to distributed machine learning
Each node has a subset B_{j} and updates w_{j}

Only w1, . . . , wm must be collected

Data are not moved; less communication cost Still, we cannot afford too many iterations because of communication cost

Big-data Machine Learning

### Vowpal Wabbit (Langford et al., 2007) I

It started as a linear classification package on a single computer

After version 6.0, Hadoop support has been provided A hybrid approach: parallel SGD initially and switch to LBFGS (quasi Newton)

They argue that AllReduce is a more suitable operation than MapReduce

What is AllReduce?

Every node starts with a value and ends up with the sum at all nodes

Big-data Machine Learning

### Vowpal Wabbit (Langford et al., 2007) II

In Agarwal et al. (2012), the authors argue that many machine learning algorithms can be

implemented using AllReduce LBFGS is an example

They train 17B samples with 16M features on 1K nodes ⇒ 70 minutes

Conclusions

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

Conclusions

### Conclusions

Linear classification is an old topic; but recently there are new applications and large-scale challenges The optimization problem can be solved by many existing techniques

However, some machine-learning aspects must be considered

In particular, data access may become a bottleneck in large-scale scenarios

Overall, linear classification is still an on-going and exciting research area