Optimization Methods for Large-scale Linear Classiﬁcation

(1)

Optimization Methods for Large-scale Linear Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at University of Rome “La Sapienza,” June 25, 2013

(2)

Part of this talk is based on our recent survey paper in Proceedings of IEEE, 2012

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

(3)

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(4)

Introduction

Outline

Introduction

(5)

Introduction

Linear and Nonlinear Classification

Some popular methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x)

φ(x)^Tφ(y) easily calculated; no good control on φ(·) Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as nonlinear and linear classifiers; we will focus on linear here

(6)

Introduction

Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean data not mapped to a higher dimensional space

Original: [height, weight]

Nonlinear: [height, weight, weight/height²]

(7)

Introduction

Linear and Nonlinear Classification (Cont’d)

Given training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , yi = ±1

l : # of data, n: # of features

Linear: find (w, b) such that the decision function is sgn w^Tx + b

Nonlinear: map data to φ(x_i). The decision function becomes

sgn w^Tφ(x)+ b Later b is omitted

(8)

Introduction

Why Linear Classification?

• If φ(x) is high dimensional, w^Tφ(x) is expensive

• Kernel methods:

w ≡ X^l

i =1α_iφ(x_i) for some α, K (x_i, x_j) ≡ φ(x_i)^Tφ(x_j) New decision function: sgn

X^l

i =1αiK (xi, x)

• Special φ(x) so that calculating K (x_i, x_j) is easy

• Example:

K (xi, xj) ≡ (x^T_i xj + 1)² = φ(xi)^Tφ(xj), φ(x) ∈ R^O(n²⁾

(9)

Introduction

Why Linear Classification? (Cont’d)

Prediction

w^Tx versus X^l

i =1α_iK (x_i, x) If K (x_i, x_j) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

(10)

Introduction

Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds Recently linear classification is a popular research topic

(11)

Outline

Introduction

(12)

Binary Linear Classification

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw

w^Tw 2 + C

l

X

i =1

ξ(w; x_i, y_i)

w^Tw/2: regularization term

ξ(w; x, y ): loss function: we hope y w^Tx > 0 C : regularization parameter

(13)

Loss Functions

Some commonly used ones:

ξ_L1(w; x, y ) ≡ max(0, 1 − y w^Tx), (1) ξ_L2(w; x, y ) ≡ max(0, 1 − y w^Tx)², (2) ξ_LR(w; x, y ) ≡ log(1 + e^{−y w}^T^x). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(14)

Loss Functions (Cont’d)

−y w^Tx ξ(w; x, y )

ξ_L1 ξ_L2

ξLR

They are similar in terms of performance

(15)

Loss Functions (Cont’d)

However,

ξL1: not differentiable

ξ_L2: differentiable but not twice differentiable ξ_LR: twice differentiable

Many optimization methods can be used

(16)

Optimization Methods: Second-order Methods

Outline

Introduction

(17)

Truncated Newton Method

Newton direction

mins ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s

This is the same as solving Newton linear system

∇²f (w^k)s = −∇f (w^k)

Hessian matrix ∇²f (w^k) too large to be stored

∇²f (w^k) : n × n, n : number of features For document data, n can be millions or more

(18)

Using Special Properties of Data Classification

But Hessian has a special form

∇²f (w) = I + CX^TDX , D diagonal. For logistic regression,

D_ii = e^−yⁱ^w^T^xⁱ 1 + e^−yⁱ^w^T^xⁱ X : data, # instances × # features

X = [x₁, . . . , x_l]^T

(19)

Using Special Properties of Data Classification (Cont’d)

Using CG to solve the linear system. Only Hessian-vector products are needed

∇²f (w)s = s + C · X^T(D(X s)) Therefore, we have a Hessian-free approach

In Lin et al. (2008), we use the trust-region + CG approach by Steihaug (1983)

Quadratic convergence is achieved

(20)

Training L2-loss SVM

The loss function is differentiable but not twice differentiable

ξ_L2(w; x, y ) ≡ max(0, 1 − y w^Tx)² We can use generalized Hessian (Mangasarian, 2002)

Works well in practice, but no theoretical quadratic convergence

(21)

Optimization Methods: First-order Methods

Outline

Introduction

(22)

First-order methods are popular in data classification Reason: no need to accurately solve the

optimization problem

We consider L1-loss SVM as an example here, though same methods may be extended to L2 and logistic loss

(23)

SVM Dual

From primal dual relationship minα f (α)

subject to 0 ≤ α_i ≤ C , ∀i , where

f (α) ≡ 1

2α^TQα − e^Tα and

Qij = yiyjx^T_i xj, e = [1, . . . , 1]^T

(24)

Dual Coordinate Descent

Very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , α_i, . . .) A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

(25)

The Procedure

Given current α. Let e_i = [0, . . . , 0, 1, 0, . . . , 0]^T. mind f (α + d ei) = 1

2Qiid² + ∇if (α)d + constant Without constraints

optimal d = −∇_if (α) Q_ii Now 0 ≤ α_i + d ≤ C

α_i ← min

max

α_i − ∇_if (α) Qii

, 0

, C

(26)

The Procedure (Cont’d)

∇_if (α) = (Qα)_i − 1 = X^l

j =1Q_ijα_j − 1

= X^l

j =1y_iy_jx^T_i x_jα_j − 1 Directly calculating gradients costs O(ln) l :# data, n: # features

For linear SVM, define u ≡ X^l

j =1y_jα_jx_j, Easy gradient calculation: costs O(n)

∇ f (α) = y u^Tx − 1

(27)

The Procedure (Cont’d)

All we need is to maintain u u = X^l

j =1y_jα_jx_j, If

¯

α_i : old ; α_i : new then

u ← u + (αi − ¯αi)yixi. Also costs O(n)

(28)

Algorithm

Given initial α and find u = X

i

y_iα_ix_i.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯α_i ← α_i

(b) G = yiu^Txi − 1 (c) If α_i can be changed

α_i ← min(max(α_i − G /Q_ii, 0), C ) u ← u + (α_i − ¯α_i)y_ix_i

(29)

Analysis

Convergence; from Luo and Tseng (1992)

f (α^k+1) − f (α^∗) ≤ µ(f (α^k) − f (α^∗)), ∀k ≥ k₀. α^∗: optimal solution

Recently we prove the result with k₀ = 1 (Wang and Lin, 2013)

Difficulty: the objective function is convex only rather than strictly convex

(30)

Careful Implementation

Some techniques can improve the running speed

Shrinking: remove α_i if it is likely to be bounded at the end

Easier to conduct shrinking than the kernel case (details not shown)

Order of sub-problems being minimized α1 → α₂ → · · · → α_l

Can use any random order at each outer iteration α_π(1) → α_π(2) → · · · → α_{π(l )}

Very effective in practice

(31)

Difference from the Kernel Case

What if coordinate descent methods are applied to kernel classifiers?

Recall the gradient is

∇_if (α) =

l

X

j =1

y_iy_jx^T_i x_jα_j−1 = (y_ix_i)^T

l

X

j =1

y_jx_jα_j−1 but we cannot do this for kernel because

K (x_i, x_j) = φ(x_i)^Tφ(x_j) is not separated

If using kernel, the cost of calculating ∇_if (α) must be O(ln)

(32)

Difference from the Kernel Case (Cont’d)

This difference is similar to our earlier discussion on the prediction cost

w^Tx versus X^l

i =1αiK (xi, x) O(n) versus O(nl )

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)

In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)

(33)

Difference from the Kernel Case (Cont’d)

In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variable for update

Recall there are two types of coordinate descent methods

Gauss-Seidel: sequential selection of variables Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available

(34)

Difference from the Kernel Case (Cont’d)

Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel

Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell

(35)

Experiments

Outline

Introduction

(36)

Experiments

Comparisons

L2-SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method

(37)

Experiments

Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(38)

Experiments

Analysis

Dual coordinate descents are very effective if # data, # features are large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

(39)

Experiments

An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

(40)

Big-data Machine Learning

Outline

Introduction

(41)

Big-data Machine Learning

Data distributedly stored

This is a new topic and many research works are still going on

You may ask what the difference is from distributed optimization

They are related, but now the algorithm must avoid expensive data accesses

(42)

Big-data Machine Learning (Cont’d)

Issues for parallelization

- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern

(43)

Simple Distributed Linear Classification I

Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset

- Example: Zinkevich et al. (2010) Some results by averaging models

yahoo-korea kddcup10 webspam epsilson

Using all 87.29 89.89 99.51 89.78

Avg. models 86.08 89.64 98.40 88.83 Using all: solves a single linear SVM

(44)

Simple Distributed Linear Classification II

Avg. models: each node solves a linear SVM on a subset

Slightly worse but in general OK

(45)

ADMM by Boyd et al. (2011) I

Recall the SVM problem (bias term b omitted) minw

1

2w^Tw + C

l

X

i =1

max(0, 1 − y_iw^Tx_i) An equivalent optimization problem

w1,...,wminm,z

1

2z^Tz + C

m

X

j =1

X

i ∈Bj

max(0, 1 − yiw^T_j xi)+

ρ 2

m

X

j =1

kw_j − zk² subject to w_j − z = 0, ∀j

(46)

ADMM by Boyd et al. (2011) II

The key is that

z = w₁ = · · · = w_m are all optimal w

This optimization problem was proposed in 1970s, but is now applied to distributed machine learning Each node has a subset B_j and updates w_j

Only w1, . . . , wm must be collected

Data are not moved; less communication cost Still, we cannot afford too many iterations because of communication cost

(47)

Vowpal Wabbit (Langford et al., 2007) I

It started as a linear classification package on a single computer

After version 6.0, Hadoop support has been provided A hybrid approach: parallel SGD initially and switch to LBFGS (quasi Newton)

They argue that AllReduce is a more suitable operation than MapReduce

What is AllReduce?

Every node starts with a value and ends up with the sum at all nodes

(48)

Vowpal Wabbit (Langford et al., 2007) II

In Agarwal et al. (2012), the authors argue that many machine learning algorithms can be

implemented using AllReduce LBFGS is an example

They train 17B samples with 16M features on 1K nodes ⇒ 70 minutes

(49)

Conclusions

Outline

Introduction

(50)

Conclusions

Linear classification is an old topic; but recently there are new applications and large-scale challenges The optimization problem can be solved by many existing techniques

However, some machine-learning aspects must be considered

In particular, data access may become a bottleneck in large-scale scenarios

Overall, linear classification is still an on-going and exciting research area