• 沒有找到結果。

# Optimization Methods for Large-scale Linear Classiﬁcation

N/A
N/A
Protected

Share "Optimization Methods for Large-scale Linear Classiﬁcation"

Copied!
50
0
0

(1)

### Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at University of Rome “La Sapienza,” June 25, 2013

(2)

Part of this talk is based on our recent survey paper in Proceedings of IEEE, 2012

G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.

It’s also related to our development of the software LIBLINEAR

www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.

(3)

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(4)

Introduction

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(5)

Introduction

### Linear and Nonlinear Classification

Some popular methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x)

φ(x)Tφ(y) easily calculated; no good control on φ(·) Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as nonlinear and linear classifiers; we will focus on linear here

(6)

Introduction

### Linear and Nonlinear Classification

Linear Nonlinear

By linear we mean data not mapped to a higher dimensional space

Original: [height, weight]

Nonlinear: [height, weight, weight/height2]

(7)

Introduction

### Linear and Nonlinear Classification (Cont’d)

Given training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1

l : # of data, n: # of features

Linear: find (w, b) such that the decision function is sgn wTx + b

Nonlinear: map data to φ(xi). The decision function becomes

sgn wTφ(x)+ b Later b is omitted

(8)

Introduction

### Why Linear Classification?

• If φ(x) is high dimensional, wTφ(x) is expensive

• Kernel methods:

w ≡ Xl

i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)Tφ(xj) New decision function: sgn

 Xl

i =1αiK (xi, x)



• Special φ(x) so that calculating K (xi, xj) is easy

• Example:

K (xi, xj) ≡ (xTi xj + 1)2 = φ(xi)Tφ(xj), φ(x) ∈ RO(n2)

(9)

Introduction

### Why Linear Classification? (Cont’d)

Prediction

wTx versus Xl

i =1αiK (xi, x) If K (xi, xj) takes O(n), then

O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler

(10)

Introduction

### Linear is Useful in Some Places

For certain problems, accuracy by linear is as good as nonlinear

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds Recently linear classification is a popular research topic

(11)

Binary linear classification

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(12)

Binary linear classification

### Binary Linear Classification

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw

wTw 2 + C

l

X

i =1

ξ(w; xi, yi)

wTw/2: regularization term

ξ(w; x, y ): loss function: we hope y wTx > 0 C : regularization parameter

(13)

Binary linear classification

### Loss Functions

Some commonly used ones:

ξL1(w; x, y ) ≡ max(0, 1 − y wTx), (1) ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2, (2) ξLR(w; x, y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(14)

Binary linear classification

### Loss Functions (Cont’d)

−y wTx ξ(w; x, y )

ξL1 ξL2

ξLR

They are similar in terms of performance

(15)

Binary linear classification

### Loss Functions (Cont’d)

However,

ξL1: not differentiable

ξL2: differentiable but not twice differentiable ξLR: twice differentiable

Many optimization methods can be used

(16)

Optimization Methods: Second-order Methods

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(17)

Optimization Methods: Second-order Methods

### Truncated Newton Method

Newton direction

mins ∇f (wk)Ts + 1

2sT2f (wk)s

This is the same as solving Newton linear system

2f (wk)s = −∇f (wk)

Hessian matrix ∇2f (wk) too large to be stored

2f (wk) : n × n, n : number of features For document data, n can be millions or more

(18)

Optimization Methods: Second-order Methods

### Using Special Properties of Data Classification

But Hessian has a special form

2f (w) = I + CXTDX , D diagonal. For logistic regression,

Dii = e−yiwTxi 1 + e−yiwTxi X : data, # instances × # features

X = [x1, . . . , xl]T

(19)

Optimization Methods: Second-order Methods

### Using Special Properties of Data Classification (Cont’d)

Using CG to solve the linear system. Only Hessian-vector products are needed

2f (w)s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach

In Lin et al. (2008), we use the trust-region + CG approach by Steihaug (1983)

(20)

Optimization Methods: Second-order Methods

### Training L2-loss SVM

The loss function is differentiable but not twice differentiable

ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2 We can use generalized Hessian (Mangasarian, 2002)

Works well in practice, but no theoretical quadratic convergence

(21)

Optimization Methods: First-order Methods

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(22)

Optimization Methods: First-order Methods

First-order methods are popular in data classification Reason: no need to accurately solve the

optimization problem

We consider L1-loss SVM as an example here, though same methods may be extended to L2 and logistic loss

(23)

Optimization Methods: First-order Methods

### SVM Dual

From primal dual relationship minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

TQα − eTα and

Qij = yiyjxTi xj, e = [1, . . . , 1]T

(24)

Optimization Methods: First-order Methods

### Dual Coordinate Descent

Very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , αi, . . .) A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

(25)

Optimization Methods: First-order Methods

### The Procedure

Given current α. Let ei = [0, . . . , 0, 1, 0, . . . , 0]T. mind f (α + d ei) = 1

2Qiid2 + ∇if (α)d + constant Without constraints

optimal d = −∇if (α) Qii Now 0 ≤ αi + d ≤ C

αi ← min

 max



αi − ∇if (α) Qii

, 0

 , C



(26)

Optimization Methods: First-order Methods

### The Procedure (Cont’d)

if (α) = (Qα)i − 1 = Xl

j =1Qijαj − 1

= Xl

j =1yiyjxTi xjαj − 1 Directly calculating gradients costs O(ln) l :# data, n: # features

For linear SVM, define u ≡ Xl

j =1yjαjxj, Easy gradient calculation: costs O(n)

∇ f (α) = y uTx − 1

(27)

Optimization Methods: First-order Methods

### The Procedure (Cont’d)

All we need is to maintain u u = Xl

j =1yjαjxj, If

¯

αi : old ; αi : new then

u ← u + (αi − ¯αi)yixi. Also costs O(n)

(28)

Optimization Methods: First-order Methods

### Algorithm

Given initial α and find u = X

i

yiαixi.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← αi

(b) G = yiuTxi − 1 (c) If αi can be changed

αi ← min(max(αi − G /Qii, 0), C ) u ← u + (αi − ¯αi)yixi

(29)

Optimization Methods: First-order Methods

### Analysis

Convergence; from Luo and Tseng (1992)

f (αk+1) − f (α) ≤ µ(f (αk) − f (α)), ∀k ≥ k0. α: optimal solution

Recently we prove the result with k0 = 1 (Wang and Lin, 2013)

Difficulty: the objective function is convex only rather than strictly convex

(30)

Optimization Methods: First-order Methods

### Careful Implementation

Some techniques can improve the running speed

Shrinking: remove αi if it is likely to be bounded at the end

Easier to conduct shrinking than the kernel case (details not shown)

Order of sub-problems being minimized α1 → α2 → · · · → αl

Can use any random order at each outer iteration απ(1) → απ(2) → · · · → απ(l )

Very effective in practice

(31)

Optimization Methods: First-order Methods

### Difference from the Kernel Case

What if coordinate descent methods are applied to kernel classifiers?

if (α) =

l

X

j =1

yiyjxTi xjαj−1 = (yixi)T

l

X

j =1

yjxjαj−1 but we cannot do this for kernel because

K (xi, xj) = φ(xi)Tφ(xj) is not separated

If using kernel, the cost of calculating ∇if (α) must be O(ln)

(32)

Optimization Methods: First-order Methods

### Difference from the Kernel Case (Cont’d)

This difference is similar to our earlier discussion on the prediction cost

wTx versus Xl

i =1αiK (xi, x) O(n) versus O(nl )

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)

In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)

(33)

Optimization Methods: First-order Methods

### Difference from the Kernel Case (Cont’d)

In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variable for update

Recall there are two types of coordinate descent methods

Gauss-Seidel: sequential selection of variables Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available

(34)

Optimization Methods: First-order Methods

### Difference from the Kernel Case (Cont’d)

Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel

Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell

(35)

Experiments

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(36)

Experiments

### Comparisons

L2-SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method

(37)

Experiments

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(38)

Experiments

### Analysis

Dual coordinate descents are very effective if # data, # features are large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

(39)

Experiments

### An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

(40)

Big-data Machine Learning

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(41)

Big-data Machine Learning

### Big-data Machine Learning

Data distributedly stored

This is a new topic and many research works are still going on

You may ask what the difference is from distributed optimization

They are related, but now the algorithm must avoid expensive data accesses

(42)

Big-data Machine Learning

### Big-data Machine Learning (Cont’d)

Issues for parallelization

- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern

(43)

Big-data Machine Learning

### Simple Distributed Linear Classification I

Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset

- Example: Zinkevich et al. (2010) Some results by averaging models

yahoo-korea kddcup10 webspam epsilson

Using all 87.29 89.89 99.51 89.78

Avg. models 86.08 89.64 98.40 88.83 Using all: solves a single linear SVM

(44)

Big-data Machine Learning

### Simple Distributed Linear Classification II

Avg. models: each node solves a linear SVM on a subset

Slightly worse but in general OK

(45)

Big-data Machine Learning

### ADMM by Boyd et al. (2011) I

Recall the SVM problem (bias term b omitted) minw

1

2wTw + C

l

X

i =1

max(0, 1 − yiwTxi) An equivalent optimization problem

w1,...,wminm,z

1

2zTz + C

m

X

j =1

X

i ∈Bj

max(0, 1 − yiwTj xi)+

ρ 2

m

X

j =1

kwj − zk2 subject to wj − z = 0, ∀j

(46)

Big-data Machine Learning

### ADMM by Boyd et al. (2011) II

The key is that

z = w1 = · · · = wm are all optimal w

This optimization problem was proposed in 1970s, but is now applied to distributed machine learning Each node has a subset Bj and updates wj

Only w1, . . . , wm must be collected

Data are not moved; less communication cost Still, we cannot afford too many iterations because of communication cost

(47)

Big-data Machine Learning

### Vowpal Wabbit (Langford et al., 2007) I

It started as a linear classification package on a single computer

After version 6.0, Hadoop support has been provided A hybrid approach: parallel SGD initially and switch to LBFGS (quasi Newton)

They argue that AllReduce is a more suitable operation than MapReduce

What is AllReduce?

Every node starts with a value and ends up with the sum at all nodes

(48)

Big-data Machine Learning

### Vowpal Wabbit (Langford et al., 2007) II

In Agarwal et al. (2012), the authors argue that many machine learning algorithms can be

implemented using AllReduce LBFGS is an example

They train 17B samples with 16M features on 1K nodes ⇒ 70 minutes

(49)

Conclusions

### Outline

Introduction

Binary linear classification

Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments

Big-data Machine Learning Conclusions

(50)

Conclusions

### Conclusions

Linear classification is an old topic; but recently there are new applications and large-scale challenges The optimization problem can be solved by many existing techniques

However, some machine-learning aspects must be considered

In particular, data access may become a bottleneck in large-scale scenarios

Overall, linear classification is still an on-going and exciting research area

We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking

We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking algorithms,

In this paper, we first proposed a data selection technique Closest RankSVM, that discovers the most informative pairs for pair-wise ranking. We then presented the experimental

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments. Chih-Jen Lin (National Taiwan Univ.) 86

Less effective if C is large (or data not scaled) Same problem occurs for training nonlinear SVMs But no need to use large C. Same model after C ≥ ¯ C [Keerthi and

Finally, we compare the proposed block minimization framework with other approaches for solving linear classification problems when data size is beyond the memory capacity.. We

In developing LIBSVM and LIBLINEAR, we design suitable optimization methods for these special optimization problems. Some methods are completely new, but some are modification

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and

For an important class of matrices the more qualitative assertions of Theorems 13 and 14 can be considerably sharpened. This is the class of consistly

For different types of optimization problems, there arise various complementarity problems, for example, linear complemen- tarity problem, nonlinear complementarity problem

For different types of optimization problems, there arise various complementarity problems, for example, linear complementarity problem, nonlinear complementarity problem,

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Solving SVM Quadratic Programming Problem Training large-scale data..

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on

◦ Disallow tasks in the production prio rity band to preempt one another.... Jobs

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to

Advantages of linear: easier feature engineering We expect that linear classification can be widely used in situations ranging from small-model to big-data classification. Chih-Jen

Multi-core linear classification Parallel matrix-vector multiplications.. Existing Algorithms for Sparse

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine.. Reasons behind

A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning

Usually the goal of classification is to minimize the number of errors Therefore, many classification methods solve optimization problems.. We will discuss a topic called