• 沒有找到結果。

Optimization Methods for Large-scale Linear Classification

N/A
N/A
Protected

Academic year: 2022

Share "Optimization Methods for Large-scale Linear Classification"

Copied!
50
0
0

加載中.... (立即查看全文)

全文

(1)

Optimization Methods for Large-scale Linear Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at University of Rome “La Sapienza,” June 25, 2013

(2)

Part of this talk is based on our recent survey paper in Proceedings of IEEE, 2012

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

(3)

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(4)

Introduction

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(5)

Introduction

Linear and Nonlinear Classification

Some popular methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x)

φ(x)Tφ(y) easily calculated; no good control on φ(·) Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as nonlinear and linear classifiers; we will focus on linear here

(6)

Introduction

Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean data not mapped to a higher dimensional space

Original: [height, weight]

Nonlinear: [height, weight, weight/height2]

(7)

Introduction

Linear and Nonlinear Classification (Cont’d)

Given training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1

l : # of data, n: # of features

Linear: find (w, b) such that the decision function is sgn wTx + b

Nonlinear: map data to φ(xi). The decision function becomes

sgn wTφ(x)+ b Later b is omitted

(8)

Introduction

Why Linear Classification?

• If φ(x) is high dimensional, wTφ(x) is expensive

• Kernel methods:

w ≡ Xl

i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)Tφ(xj) New decision function: sgn

 Xl

i =1αiK (xi, x)



• Special φ(x) so that calculating K (xi, xj) is easy

• Example:

K (xi, xj) ≡ (xTi xj + 1)2 = φ(xi)Tφ(xj), φ(x) ∈ RO(n2)

(9)

Introduction

Why Linear Classification? (Cont’d)

Prediction

wTx versus Xl

i =1αiK (xi, x) If K (xi, xj) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

(10)

Introduction

Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds Recently linear classification is a popular research topic

(11)

Binary linear classification

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(12)

Binary linear classification

Binary Linear Classification

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw

wTw 2 + C

l

X

i =1

ξ(w; xi, yi)

wTw/2: regularization term

ξ(w; x, y ): loss function: we hope y wTx > 0 C : regularization parameter

(13)

Binary linear classification

Loss Functions

Some commonly used ones:

ξL1(w; x, y ) ≡ max(0, 1 − y wTx), (1) ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2, (2) ξLR(w; x, y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(14)

Binary linear classification

Loss Functions (Cont’d)

−y wTx ξ(w; x, y )

ξL1 ξL2

ξLR

They are similar in terms of performance

(15)

Binary linear classification

Loss Functions (Cont’d)

However,

ξL1: not differentiable

ξL2: differentiable but not twice differentiable ξLR: twice differentiable

Many optimization methods can be used

(16)

Optimization Methods: Second-order Methods

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(17)

Optimization Methods: Second-order Methods

Truncated Newton Method

Newton direction

mins ∇f (wk)Ts + 1

2sT2f (wk)s

This is the same as solving Newton linear system

2f (wk)s = −∇f (wk)

Hessian matrix ∇2f (wk) too large to be stored

2f (wk) : n × n, n : number of features For document data, n can be millions or more

(18)

Optimization Methods: Second-order Methods

Using Special Properties of Data Classification

But Hessian has a special form

2f (w) = I + CXTDX , D diagonal. For logistic regression,

Dii = e−yiwTxi 1 + e−yiwTxi X : data, # instances × # features

X = [x1, . . . , xl]T

(19)

Optimization Methods: Second-order Methods

Using Special Properties of Data Classification (Cont’d)

Using CG to solve the linear system. Only Hessian-vector products are needed

2f (w)s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach

In Lin et al. (2008), we use the trust-region + CG approach by Steihaug (1983)

Quadratic convergence is achieved

(20)

Optimization Methods: Second-order Methods

Training L2-loss SVM

The loss function is differentiable but not twice differentiable

ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2 We can use generalized Hessian (Mangasarian, 2002)

Works well in practice, but no theoretical quadratic convergence

(21)

Optimization Methods: First-order Methods

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(22)

Optimization Methods: First-order Methods

First-order methods are popular in data classification Reason: no need to accurately solve the

optimization problem

We consider L1-loss SVM as an example here, though same methods may be extended to L2 and logistic loss

(23)

Optimization Methods: First-order Methods

SVM Dual

From primal dual relationship minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

TQα − eTα and

Qij = yiyjxTi xj, e = [1, . . . , 1]T

(24)

Optimization Methods: First-order Methods

Dual Coordinate Descent

Very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , αi, . . .) A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

(25)

Optimization Methods: First-order Methods

The Procedure

Given current α. Let ei = [0, . . . , 0, 1, 0, . . . , 0]T. mind f (α + d ei) = 1

2Qiid2 + ∇if (α)d + constant Without constraints

optimal d = −∇if (α) Qii Now 0 ≤ αi + d ≤ C

αi ← min

 max



αi − ∇if (α) Qii

, 0

 , C



(26)

Optimization Methods: First-order Methods

The Procedure (Cont’d)

if (α) = (Qα)i − 1 = Xl

j =1Qijαj − 1

= Xl

j =1yiyjxTi xjαj − 1 Directly calculating gradients costs O(ln) l :# data, n: # features

For linear SVM, define u ≡ Xl

j =1yjαjxj, Easy gradient calculation: costs O(n)

∇ f (α) = y uTx − 1

(27)

Optimization Methods: First-order Methods

The Procedure (Cont’d)

All we need is to maintain u u = Xl

j =1yjαjxj, If

¯

αi : old ; αi : new then

u ← u + (αi − ¯αi)yixi. Also costs O(n)

(28)

Optimization Methods: First-order Methods

Algorithm

Given initial α and find u = X

i

yiαixi.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← αi

(b) G = yiuTxi − 1 (c) If αi can be changed

αi ← min(max(αi − G /Qii, 0), C ) u ← u + (αi − ¯αi)yixi

(29)

Optimization Methods: First-order Methods

Analysis

Convergence; from Luo and Tseng (1992)

f (αk+1) − f (α) ≤ µ(f (αk) − f (α)), ∀k ≥ k0. α: optimal solution

Recently we prove the result with k0 = 1 (Wang and Lin, 2013)

Difficulty: the objective function is convex only rather than strictly convex

(30)

Optimization Methods: First-order Methods

Careful Implementation

Some techniques can improve the running speed

Shrinking: remove αi if it is likely to be bounded at the end

Easier to conduct shrinking than the kernel case (details not shown)

Order of sub-problems being minimized α1 → α2 → · · · → αl

Can use any random order at each outer iteration απ(1) → απ(2) → · · · → απ(l )

Very effective in practice

(31)

Optimization Methods: First-order Methods

Difference from the Kernel Case

What if coordinate descent methods are applied to kernel classifiers?

Recall the gradient is

if (α) =

l

X

j =1

yiyjxTi xjαj−1 = (yixi)T

l

X

j =1

yjxjαj−1 but we cannot do this for kernel because

K (xi, xj) = φ(xi)Tφ(xj) is not separated

If using kernel, the cost of calculating ∇if (α) must be O(ln)

(32)

Optimization Methods: First-order Methods

Difference from the Kernel Case (Cont’d)

This difference is similar to our earlier discussion on the prediction cost

wTx versus Xl

i =1αiK (xi, x) O(n) versus O(nl )

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)

In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)

(33)

Optimization Methods: First-order Methods

Difference from the Kernel Case (Cont’d)

In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variable for update

Recall there are two types of coordinate descent methods

Gauss-Seidel: sequential selection of variables Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available

(34)

Optimization Methods: First-order Methods

Difference from the Kernel Case (Cont’d)

Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel

Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell

(35)

Experiments

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(36)

Experiments

Comparisons

L2-SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method

(37)

Experiments

Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(38)

Experiments

Analysis

Dual coordinate descents are very effective if # data, # features are large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

(39)

Experiments

An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

(40)

Big-data Machine Learning

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(41)

Big-data Machine Learning

Big-data Machine Learning

Data distributedly stored

This is a new topic and many research works are still going on

You may ask what the difference is from distributed optimization

They are related, but now the algorithm must avoid expensive data accesses

(42)

Big-data Machine Learning

Big-data Machine Learning (Cont’d)

Issues for parallelization

- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern

(43)

Big-data Machine Learning

Simple Distributed Linear Classification I

Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset

- Example: Zinkevich et al. (2010) Some results by averaging models

yahoo-korea kddcup10 webspam epsilson

Using all 87.29 89.89 99.51 89.78

Avg. models 86.08 89.64 98.40 88.83 Using all: solves a single linear SVM

(44)

Big-data Machine Learning

Simple Distributed Linear Classification II

Avg. models: each node solves a linear SVM on a subset

Slightly worse but in general OK

(45)

Big-data Machine Learning

ADMM by Boyd et al. (2011) I

Recall the SVM problem (bias term b omitted) minw

1

2wTw + C

l

X

i =1

max(0, 1 − yiwTxi) An equivalent optimization problem

w1,...,wminm,z

1

2zTz + C

m

X

j =1

X

i ∈Bj

max(0, 1 − yiwTj xi)+

ρ 2

m

X

j =1

kwj − zk2 subject to wj − z = 0, ∀j

(46)

Big-data Machine Learning

ADMM by Boyd et al. (2011) II

The key is that

z = w1 = · · · = wm are all optimal w

This optimization problem was proposed in 1970s, but is now applied to distributed machine learning Each node has a subset Bj and updates wj

Only w1, . . . , wm must be collected

Data are not moved; less communication cost Still, we cannot afford too many iterations because of communication cost

(47)

Big-data Machine Learning

Vowpal Wabbit (Langford et al., 2007) I

It started as a linear classification package on a single computer

After version 6.0, Hadoop support has been provided A hybrid approach: parallel SGD initially and switch to LBFGS (quasi Newton)

They argue that AllReduce is a more suitable operation than MapReduce

What is AllReduce?

Every node starts with a value and ends up with the sum at all nodes

(48)

Big-data Machine Learning

Vowpal Wabbit (Langford et al., 2007) II

In Agarwal et al. (2012), the authors argue that many machine learning algorithms can be

implemented using AllReduce LBFGS is an example

They train 17B samples with 16M features on 1K nodes ⇒ 70 minutes

(49)

Conclusions

Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(50)

Conclusions

Conclusions

Linear classification is an old topic; but recently there are new applications and large-scale challenges The optimization problem can be solved by many existing techniques

However, some machine-learning aspects must be considered

In particular, data access may become a bottleneck in large-scale scenarios

Overall, linear classification is still an on-going and exciting research area

參考文獻

相關文件

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and

For an important class of matrices the more qualitative assertions of Theorems 13 and 14 can be considerably sharpened. This is the class of consistly

For different types of optimization problems, there arise various complementarity problems, for example, linear complemen- tarity problem, nonlinear complementarity problem

For different types of optimization problems, there arise various complementarity problems, for example, linear complementarity problem, nonlinear complementarity problem,

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Solving SVM Quadratic Programming Problem Training large-scale data..

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on