• 沒有找到結果。

Training Support Vector Machines: Status and Challenges

N/A
N/A
Protected

Academic year: 2022

Share "Training Support Vector Machines: Status and Challenges"

Copied!
39
0
0

全文

(1)

Training Support Vector Machines:

Status and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Microsoft Research Asia

(2)

Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

(3)

Support Vector Classification

Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1 Maximizing the margin

[Boser et al., 1992, Cortes and Vapnik, 1995]

minw,b

1

2wTw + C

l

X

i =1

max(1 − yi(wTφ(xi)+ b), 0) High dimensional ( maybe infinite ) feature space

φ(x) = (φ1(x), φ2(x), . . .).

(4)

Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,

where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum

w = Pl

i =1αiyiφ(xi)

Kernel: K (xi, xj) ≡ φ(xi)Tφ(xj) ; closed form

(5)

Large Dense Quadratic Programming

Qij 6= 0, Q : an l by l fully dense matrix minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0

50,000 training points: 50,000 variables:

(50, 0002 × 8/2) bytes = 10GB RAM to store Q Traditional optimization methods cannot be directly applied

(6)

Decomposition Methods

Working on some variables each time (e.g.,

[Osuna et al., 1997, Joachims, 1998, Platt, 1998]) Working set B , N = {1, . . . , l }\B fixed

Sub-problem at the kth iteration:

minαB

1

2αTBkN)TQBB QBN

QNB QNN

 αB αkN



eTB (ekN)TαB αkN



subject to 0 ≤ α ≤ C , i ∈ B, yTα = −yTαk

(7)

Avoid Memory Problems

The new objective function 1

TBQBBαB + (−eB + QBNαkN)TαB + constant Only B columns of Q needed (|B| ≥ 2)

Calculated when used Trade time for space

Popular software such as SVMlight and LIBSVM are of this type

Work well if data not too large (e.g., ≤ 100k)

(8)

Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

(9)

Is It Possible to Train Large SVM?

Accurately solve quadratic programs with millions of variables or more?

General approach: very unlikely

Cases with many support vectors: quadratic time bottleneck on

QSV, SV

Parallelization: possible but

Difficult in distributed environments due to high communication cost

(10)

Is It Possible to Train Large SVM?

(Cont’d)

For large problems, approximation almost unavoidable

That is, don’t accurately solve the quadratic program of the full training set

(11)

Approximately Training SVM

Can be done in many aspects Data level: sub-sampling Optimization level:

Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today

Many papers have addressed this issue

(12)

Approximately Training SVM (Cont’d)

Subsampling

Simple and often effective Many more advanced techniques

Incremental training: (e.g., [Syed et al., 1999]) Data ⇒ 10 parts

train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics e.g., [Bakır et al., 2005]

(13)

Approximately Training SVM (Cont’d)

Approximate the kernel; e.g., [Fine and Scheinberg, 2001, Williams and Seeger, 2001]

Use part of the kernel; e.g.,

[Lee and Mangasarian, 2001, Keerthi et al., 2006]

Early stopping of optimization algorithms [Tsang et al., 2005] and most parallel works And many others

Some simple but some sophisticated

(14)

Approximately Training SVM (Cont’d)

But sophisticated techniques may not be always useful

Sometimes slower than sub-sampling covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

(15)

Approximately Training SVM (Cont’d)

But sophisticated techniques may not be always useful

Sometimes slower than sub-sampling covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

(16)

Approximately Training SVM (Cont’d)

Personally I prefer specialized approach for large-scale scenarios

Distribution of training data

??

General Solvers (e.g., LIBSVM, SVMlight) Median and

small Large

(17)

Approximately Training SVM (Cont’d)

We don’t have many large and well labeled sets They appear in certain application domains Specific properties of data should be considered May significantly improve the training speed We will illustrate this point using linear SVM The design of software for large and median/small problems should be different

(18)

Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

(19)

Linear SVM

Data not mapped to another space Primal without the bias term b

minw

1

2wTw + C

l

X

i =1

max 0, 1 − yiwTxi Dual

minα f (α) ≡ 1

TQα − eTα subject to 0 ≤ αi ≤ C , ∀i

T

(20)

Linear SVM (Cont’d)

In theory, RBF kernel with certain parameters

⇒ as good as linear [Keerthi and Lin, 2003]

RBF kernel:

K (xi, xj) = e−γkxi−xjk2 That is,

Test accuracy of linear

≤Test accuracy of RBF Linear SVM not better than nonlinear; but

(21)

Linear SVM for Large Document Sets

Bag of words model (TF-IDF or others) A large # of features

Accuracy similar with/without mapping vectors What if training is much faster?

A very effective approximation to nonlinear SVM

(22)

A Comparison: LIBSVM and LIBLINEAR

rcv1: # data: > 600k, # features: > 40k TF-IDF

Using LIBSVM (linear kernel)

> 10 hours

Using LIBLINEAR

Computation: < 5 seconds; I/O: 60 seconds Same stopping condition

Accuracy similar to nonlinear; more than 100x speedup

(23)

Why Training Linear SVM Is Faster?

In optimization, each iteration we often need

if (α) = (Qα)i − 1 Nonlinear SVM

if (α) = Xl

j =1yiyjK (xi, xjj − 1 cost: O(nl ); n: # features, l : # data

Linear: use w ≡ Xl

j =1yjαjxj and ∇if (α) = yiwTxi − 1

(24)

Faster if # iterations not l times more For details, see

C.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S.

Sundararajan. A dual coordinate descent method for large-scale linear SVM. ICML 2008.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear

classification. Journal of Machine Learning Research 9(2008), 1871-1874.

Experiments

Problem l : # data n: # features

news20 19,996 1,355,191

yahoo-japan 176,203 832,026

rcv1 677,399 47,236

(25)

Testing Accuracy versus Training Time

news20 yahoo-japan

(26)

Training Linear SVM Always Much Faster?

No

If #data  #features, the algorithm used above may not be very good

Need some other ways

But document data are not of this type Large-scale SVM training is domain specific

(27)

Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

(28)

Training Nonlinear SVM via Linear SVM

Revisit nonlinear SVM minw

1

2wTw + C

l

X

i =1

max(1 − yiwTφ(xi), 0) Dimension of φ(x): large

If not very large, directly train SVM without kernel Calculate ∇if (α) at each step

Kernel: O(nl )

(29)

Degree-2 Polynomial Mapping

Degree-2 polynomial kernel

K (xi, xj) = (1 + xTi xj)2 Instead we do

φ(x) = [1,√

2x1, . . . ,√

2xn, x12, . . . , xn2,√

2x1x2, . . . ,√

2xn−1xn]T. Now we can just consider

φ(x) = [1, x1, . . . , xn, x12, . . . , xn2, x1x2, . . . , xn−1xn]T. O(n2) dimensions can cause troubles; some

(30)

Accuracy Difference with linear and RBF

Data set Degree-2 Polynomial Time Accuracy diff.

LIBLINEAR LIBSVM linear RBF

a9a 1.6 89.8 0.07 0.02

real-sim 59.8 1,220.5 0.49 0.10

ijcnn1 10.7 64.2 5.63 −0.85

MNIST38 8.6 18.4 2.47 −0.40

covtype 5,211.9 ≥ 3 × 105 3.74 −15.98 webspam 3,228.1 ≥ 3 × 105 5.29 −0.76

Some problems: accuracy similar to RBF; but training much faster

Less nonlinear SVM to approximate highly nonlinear

(31)

NLP Applications

In NLP (Natural Language Processing) degree-2 or degree-3 polynomial kernels very popular

Competitive with RBF; better than linear No theory yet; but possible reasons

Bigram/trigram useful

This is different from other areas (e.g., image), which mainly use RBF

Currently people complain that training is slow

(32)

SVM with Low-Degree Polynomial Mapping

Dependency Parsing

nsubjROOT det dobj prep det pobj p

John hit the ball with a bat .

NNP VBD DT NN IN DT NN .

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

(33)

Dependency Parsing

nsubjROOT det dobj prep det pobj p

John hit the ball with a bat .

NNP VBD DT NN IN DT NN .

LIBSVM LIBLINEAR

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

(34)

Dependency Parsing (Cont’d)

Details:

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M.

Ringgaard, and C.-J. Lin. Low-degree polynomial mapping of data for SVM, 2009.

(35)

Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

(36)

What If Data Cannot Fit in Memory?

We can manage to train data in disk Details not shown here

However, what if data too large to store in one machine?

So far not many such cases with well labeled data It’s expensive to label data

We do see very large but low quality data Dealing with such data is different

(37)

L1-regularized Classifiers

Replacing kwk2 with kwk1

minw kwk1 + C × (losses) Sparsity: many w elements are zeros Feature selection

LIBLINEAR supports L2 loss and logistic regression max 0, 1 − yiwTxi2

and log(1 + e−yiwTxi) If using least-square loss and y ∈ Rl,

related to L1-regularized problems in signal

(38)

Conclusions

Training large SVM is difficult

The (at least) quadratic time bottleneck Approximation is often needed; but some are non-intuitive ways

E.g., linear SVM good approximation to nonlinear SVM for some applications

Difficult to have a general approach for all large scenarios

Special techniques are needed

(39)

Conclusions (Cont’d)

Software design for large and median/small problems should be different

Median/small problems: general and simple software Sources for my past work are available on my page.

In particular, LIBSVM:

http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBLINEAR: http:

//www.csie.ntu.edu.tw/~cjlin/liblinear I will be happy to talk to any machine learning users here

參考文獻

相關文件

We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking

We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking algorithms,

possible preceding labels when we train the m -th chain, m examples will exist for each example in the original training data, and they may have different label features and

 Although the probability of local bone invasion from palate cancer to facial bones is not considerably high, and certain dental problem should be. considered first, X-ray

Parameter/kernel selection and practical issues Multi-class classification.. Discussion

the prediction of protein secondary structure, multi-class protein fold recognition, and the prediction of human signal peptide cleavage sites.. By using similar data, we

 civilian life and opportunities ©2011 Yen-Ping Shan All rights reserved

In terms of contracted foreigners with work duration greater than 90 days, foreigner status and application process should be handled in accordance with regulations relevant to

The Task Force fully recognises students’ diverse learning and development needs across different key stages and domains, and hence the recommendations need to be considered in

The long-term solution may be to have adequate training for local teachers, however, before an adequate number of local teachers are trained it is expedient to recruit large numbers

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Solving SVM Quadratic Programming Problem Training large-scale data..

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced?. insight and

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data. SVM cannot handle large sets if using kernels

Core vector machines: Fast SVM training on very large data sets. Using the Nystr¨ om method to speed up

Core vector machines: Fast SVM training on very large data sets. Multi-class support

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to

In developing LIBSVM, we found that many users have zero machine learning knowledge.. It is unbelievable that many asked what the difference between training and

If we would like to use both training and validation data to predict the unknown scores, we can record the number of iterations in Algorithm 2 when using the training/validation

Core vector machines: Fast SVM training on very large data sets. Multi-class support

To improve the operating performance, the companies should pay attention to critical success factors of “support and participation of employees”, “employee training and

Godsill, “Detection of abrupt spectral changes using support vector machines: an application to audio signal segmentation,” Proceedings of the IEEE International Conference