• 沒有找到結果。

Large-scale Linear Classification: Status and Challenges

N/A
N/A
Protected

Academic year: 2022

Share "Large-scale Linear Classification: Status and Challenges"

Copied!
47
0
0

加載中.... (立即查看全文)

全文

(1)

Large-scale Linear Classification:

Status and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Criteo Machine Learning Workshop, November 8, 2017

(2)

Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

(3)

Introduction

Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

(4)

Introduction

Linear Classification

Although many new and advanced techniques are available (e.g., deep learning), linear classifiers remain to be useful because of their simplicity

We have fast training/prediction for large-scale data A large-scale optimization problem is solved

The focus of this talk is on how to solve this optimization problem

(5)

Introduction

The Software LIBLINEAR

My talk will be very related to research done in developing the software LIBLINEAR for linear classification

www.csie.ntu.edu.tw/~cjlin/liblinear It is now one of the most used linear classification tools

(6)

Introduction

Linear and Kernel Classification

Methods such as SVM and logistic regression are often used in two ways

Kernel methods: data mapped to another space x ⇒ φ(x )

φ(x )Tφ(y) easily calculated; no good control on φ(·)

Feature engineering + linear classification:

Directly use x without mapping. But x may have been carefully generated. Full control on x

(7)

Introduction

Comparison Between Linear and Kernel

For certain problems, accuracy by linear is as good as kernel

But training and testing are much faster Especially document classification

Number of features (bag-of-words model) very large Large and sparse data

Training millions of data in just a few seconds

(8)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype multiclass 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31

webspam 25.7 93.35 15,681.8 99.26

Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(9)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype multiclass 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31

webspam 25.7 93.35 15,681.8 99.26

Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(10)

Introduction

Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype multiclass 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31

webspam 25.7 93.35 15,681.8 99.26

Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(11)

Introduction

Binary Linear Classification

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w ), where f (w ) ≡

C

l

X

i =1

ξ(w ; xi, yi) + (1

2wTw L2 regularization kw k1 L1 regularization ξ(w ; x , y ): loss function: we hope y wTx > 0 C : regularization parameter

(12)

Introduction

Loss Functions

Some commonly used loss functions.

ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ), (1) ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2, (2) ξLR(w ; x , y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(13)

Optimization methods

Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

(14)

Optimization methods

Optimization Methods

A difference between linear and kernel is that for kernel, optimization must be over a variable α (usually through the dual problem) where

w = Xl

i =1αiφ(xi)

We cannot minimize over w , which may be infinite dimensional

However, for linear, minimizing over w or α is ok

(15)

Optimization methods

Optimization Methods (Cont’d)

Unconstrained optimization methods can be categorized to

Low-order methods: quickly get a model, but slow final convergence

High-order methods: more robust and useful for ill-conditioned situations

We will show both types of optimization methods are useful for linear classification

Further, to handle large problems, the algorithms must take problem structure into account

Let’s discuss a low-order method (coordinate descent) in detail

(16)

Optimization methods

Coordinate Descent

We consider L1-loss and the dual SVM problem minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

TQα − eTα and

Qij = yiyjxTi xj, e = [1, . . . , 1]T We will apply coordinate descent (CD) methods The situation for L2 or LR loss is very similar

(17)

Optimization methods

Coordinate Descent (Cont’d)

For current α, change αi by fixing others Let

ei = [0, . . . , 0, 1, 0, . . . , 0]T The sub-problem is

min

d f (α + d ei) = 1

2Qiid2 + ∇if (α)d + constant subject to 0 ≤ αi + d ≤ C

Without constraints

optimal d = −∇if (α) Qii

(18)

Optimization methods

Coordinate Descent (Cont’d)

Now 0 ≤ αi + d ≤ C αi ← min

 max



αi − ∇if (α) Qii , 0

 , C



Note that

if (α) = (Qα)i − 1 = Xl

j =1Qijαj − 1

= Xl

j =1yiyjxTi xjαj − 1

Expensive: O(ln), l : # instances, n: features

(19)

Optimization methods

Coordinate Descent (Cont’d)

A trick in Hsieh et al. (2008) is to define and maintain

u ≡Xl

j =1yjαjxj,

Easy gradient calculation: the cost is O(n)

if (α) = yiuTxi − 1

Note that this cannot be done for kernel as xi is high dimensional

(20)

Optimization methods

Coordinate Descent (Cont’d)

The procedure

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← αi

(b) G = yiuTxi − 1

(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed

u ← u + (αi − ¯αi)yixi Maintaining u also costs

O(n)

(21)

Optimization methods

Coordinate Descent (Cont’d)

Having

u ≡Xl

j =1yjαjxj,

if (α) = yiuTxi − 1 and

¯

αi : old ; αi : new u ← u + (αi − ¯αi)yixi. is very essential

This isn’t the vanilla CD dated back to Hildreth (1957)

We take the problem structure into account

(22)

Optimization methods

Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method

This result is from Hsieh et al. (2008) with C = 1

(23)

Optimization methods

Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(24)

Optimization methods

Low- versus High-order Methods

We see low-order methods are efficient, but

high-order methods are useful for difficult situations CD for dual

$ time ./train -c 1 news20.scale 2.528s

$ time ./train -c 100 news20.scale 28.589s

Newton for primal

$ time ./train -c 1 -s 2 news20.scale 8.596s

$ time ./train -c 100 -s 2 news20.scale 11.088s

(25)

Optimization methods

Training Median-sized Data: Status

Basically a solved problem

However, as data and memory continue to grow, new techniques are needed for large-scale sets.

Two possible strategies are

1 Multi-core linear classification

2 Distributed linear classification

(26)

Multi-core linear classification

Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

(27)

Multi-core linear classification

Multi-core Linear Classification

Nowadays each CPU has several cores

However, parallelizing algorithms to use multiple cores may not be that easy

In fact, algorithms may need to be redesigned Since two years ago we have been working on multi-core LIBLINEAR

(28)

Multi-core linear classification

Multi-core Linear Classification (Cont’d)

Three multi-core solvers have been released

1 Newton method for primal L2-regularized problem (Lee et al., 2015)

2 Coordinate descent method for dual

L2-regularized problem (Chiang et al., 2016)

3 Coordinate descent method for primal

L1-regularized problem (Zhuang et al., 2017) They are practically useful. For example, one user from USC thanked us because “a job (taking >30 hours using one core) now can finish within 5 hours”

We will briefly discuss the 2nd and the 3rd

(29)

Multi-core linear classification

Multi-core CD for Dual

Recall the CD algorithm for dual is

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← αi

(b) G = yiuTxi − 1

(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed

u ← u + (αi − ¯αi)yixi

(30)

Multi-core linear classification

Multi-core CD for Dual (Cont’d)

The algorithm is inherently sequential Suppose

αi0 is updated after αi

Then αi0 must wait until the latest u is obtained The parallelization is difficult

(31)

Multi-core linear classification

Multi-core CD for Dual (Cont’d)

Asynchronous CD is possible (Hsieh et al., 2015), but may diverge

We note that for a given set ¯B

if (w ) = wTxi, ∀i ∈ ¯B can be calculated in parallel

We then propose a framework

(32)

Multi-core linear classification

Multi-core CD for Dual (Cont’d)

While α is not optimal (a) Select a set ¯B

(b) Calculate ∇B¯f (α) in parallel (c) Select B ⊂ ¯B with |B|  | ¯B|

(d) Sequentially update αi, i ∈ B

(33)

Multi-core linear classification

Multi-core CD for Dual (Cont’d)

The selection of

B ⊂ ¯B with |B|  | ¯B|

is by ∇B¯f (w )

The idea is simple, but needs efforts to have a practical setting (details omitted)

(34)

Multi-core linear classification

Multi-core CD for Dual (Cont’d)

webspam url combined

Alg-4: the method in Chiang et al. (2016) Asynchronous CD (Hsieh et al., 2015)

(35)

Multi-core linear classification

Multi-core CD for L1 Regularization

Currently, primal CD (Yuan et al., 2010) or its variants (Yuan et al., 2012) is the state-of-the-art for L1

Each CD step involves one feature

Some attempts of parallel CD for L1 include Asynchronous CD (Bradley et al., 2011) Block CD (Bian et al., 2013)

These methods are not satisfactory for either divergence issue, or

poor speedup

(36)

Multi-core linear classification

Multi-core CD for L1 Regularization (Cont’d)

We struggled for years for find a solution

Recently, in a work (Zhuang et al., 2017) we have an effective setting

It’s partially supported by Criteo Faculty Research Award

Our idea is simple: direct parallelization of CD But wait.. This shouldn’t work because each CD iteration is cheap

(37)

Multi-core linear classification

Direct Parallelization of CD

Let’s consider a simple setting to decide if one CD step should be parallelized or not

if #non-zeros in an instance/feature ≥ a threshold then

multi-core else

single-core

Idea: a CD step is parallelized if there are enough operations

(38)

Multi-core linear classification

Direct Parallelization of CD (Cont’d)

Speedup of CD for dual, L2 regularization Data set

#threads

2 4 8

sparse

sets avazu-app 0.4 0.3 0.2

criteo 0.5 0.3 0.2

dense sets

epsilon normalized 1.3 1.3 1.1 splice site.t.10% 1.8 2.8 4.1 CD for dual: one instance at a time

Threshold: 0 (sparse), 500 (dense)

If 500 for sparse, no instance parallelized The speedup is poor

(39)

Multi-core linear classification

% of instances/features containing 50% and 80%

#non-zeros

Data set Instance Feature

avazu-app 50% 80% 0.2% 1%

criteo 50% 80% 0.01% 0.2%

kdd2010-a 40% 73% 0.03% 2%

kdd2012 50% 80% 0.003% 0.5%

rcv1 test 24% 54% 1% 5%

splice site.t.10% 50% 80% 9% 57%

url combined 44% 76% 0.002% 0.006%

webspam 29% 55% 0.6% 2%

yahoo-korea 20% 48% 0.07% 0.5%

Features’ non-zero distribution is extremely skewed Non-zeros are in few dense (and parallelizable) features

(40)

Multi-core linear classification

Speedup of CD for L1 Regularization

LR loss used Naive Block CD Async. CD

Data set 2 4 8 2 4 8 2 4 8

avazu-app 1.9 3.4 5.6 0.4 0.7 1.0 1.4 2.7 3.4 criteo 1.8 3.3 5.5 0.7 1.2 1.9 1.5 2.9 4.8 epsilon normalized 2.0 4.0 7.9 x x x 1.3 2.1 x HIGGS 2.0 3.9 7.5 0.7 0.8 0.9 1.0 1.3 x kdd2010-a 1.7 2.4 3.1 0.8 1.4 2.4 1.5 2.7 4.8 kdd2012 1.9 2.8 3.9 0.2 0.4 0.6 2.1 4.7 7.0 rcv1 test 1.9 3.4 5.9 x x x 1.3 2.5 4.5 splice site.t.10% 1.9 3.6 6.2 x x x 1.6 2.7 4.3 url combined 2.0 3.5 6.2 0.5 0.9 1.3 1.0 1.7 1.7 webspam 1.8 3.2 4.8 0.1 0.3 0.5 1.4 2.5 4.1 yahoo-korea 1.9 3.5 5.9 0.2 0.3 0.5 1.3 2.4 4.4

(41)

Distributed linear classification

Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

(42)

Distributed linear classification

Distributed Linear Classification

It’s even more complicated than multi-core

I don’t have time to discuss this topic in detail, but let me share some lessons

A big mistake was that we worked on distributed before multi-core

(43)

Distributed linear classification

Distributed Linear Classification (Cont’d)

A few years ago, big data was hot. So we extended a Newton solver in LIBLINEAR to MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)

We were a bit ahead of time; Spark MLlib wasn’t even available then

Unfortunately, very few people use our code, especially the Spark one

We moved to multi-core. Immediately, multi-core LIBLINEAR has many users

(44)

Distributed linear classification

Distributed Linear Classification (Cont’d)

Why we failed? Several possible reasons Not many people have big data??

System issues are more important than we thought.

At that time Spark wasn’t easy to use and was being actively changed

System configuration and application scenarios may significantly vary

An algorithm useful for systems with fast network speed may be useless for systems with slow

communication

(45)

Distributed linear classification

Distributed Linear Classification (Cont’d)

Application dependency is stronger.

L2 and L1 regularization often give similar accuracy.

On a single machine, we may not want to use L1 because training is more difficult and the smaller model size isn’t that important

However, for distributed applications many have told me that they need L1

A lesson is that for people from academia, it’s better to collaborate with industry for research on distributed machine learning

(46)

Conclusions

Outline

1 Introduction

2 Optimization methods

3 Multi-core linear classification

4 Distributed linear classification

5 Conclusions

(47)

Conclusions

Conclusions

Linear classification is an old topic, but it remains to be useful for many applications

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to be studied

參考文獻

相關文件

Parameter/kernel selection and practical issues Multi-class classification.. Discussion

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

• Both the galaxy core problem and the abundance problem are associated with dwarf galaxies on the scale < few kpc !!..

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Solving SVM Quadratic Programming Problem Training large-scale data..

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on