## Training Support Vector Machines:

## Status and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Microsoft Research Asia

## Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

## Support Vector Classification

Training data (x_{i}, y_{i}), i = 1, . . . , l , x_{i} ∈ R^{n}, y_{i} = ±1
Maximizing the margin

[Boser et al., 1992, Cortes and Vapnik, 1995]

minw,b

1

2w^{T}w + C

l

X

i =1

max(1 − y_{i}(w^{T}φ(x_{i})+ b), 0)
High dimensional ( maybe infinite ) feature space

φ(x) = (φ1(x), φ2(x), . . .).

## Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0,

where Qij = yiyjφ(xi)^{T}φ(xj) and e = [1, . . . , 1]^{T}
At optimum

w = Pl

i =1α_{i}y_{i}φ(x_{i})

Kernel: K (x_{i}, x_{j}) ≡ φ(x_{i})^{T}φ(x_{j}) ; closed form

## Large Dense Quadratic Programming

Q_{ij} 6= 0, Q : an l by l fully dense matrix
minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ αi ≤ C , i = 1, . . . , l
y^{T}α = 0

50,000 training points: 50,000 variables:

(50, 000^{2} × 8/2) bytes = 10GB RAM to store Q
Traditional optimization methods cannot be directly
applied

## Decomposition Methods

Working on some variables each time (e.g.,

[Osuna et al., 1997, Joachims, 1998, Platt, 1998]) Working set B , N = {1, . . . , l }\B fixed

Sub-problem at the kth iteration:

minαB

1

2α^{T}_{B} (α^{k}_{N})^{T}Q_{BB} QBN

Q_{NB} Q_{NN}

α_{B}
α^{k}_{N}

−

e^{T}_{B} (e^{k}_{N})^{T}α_{B}
α^{k}_{N}

subject to 0 ≤ α ≤ C , i ∈ B, y^{T}α = −y^{T}α^{k}

## Avoid Memory Problems

The new objective function 1

2α^{T}_{B}Q_{BB}α_{B} + (−e_{B} + Q_{BN}α^{k}_{N})^{T}α_{B} + constant
Only B columns of Q needed (|B| ≥ 2)

Calculated when used Trade time for space

Popular software such as SVM^{light} and LIBSVM are
of this type

Work well if data not too large (e.g., ≤ 100k)

## Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

## Is It Possible to Train Large SVM?

Accurately solve quadratic programs with millions of variables or more?

General approach: very unlikely

Cases with many support vectors: quadratic time bottleneck on

QSV, SV

Parallelization: possible but

Difficult in distributed environments due to high communication cost

## Is It Possible to Train Large SVM?

## (Cont’d)

For large problems, approximation almost unavoidable

That is, don’t accurately solve the quadratic program of the full training set

## Approximately Training SVM

Can be done in many aspects Data level: sub-sampling Optimization level:

Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today

Many papers have addressed this issue

## Approximately Training SVM (Cont’d)

Subsampling

Simple and often effective Many more advanced techniques

Incremental training: (e.g., [Syed et al., 1999]) Data ⇒ 10 parts

train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics e.g., [Bakır et al., 2005]

## Approximately Training SVM (Cont’d)

Approximate the kernel; e.g., [Fine and Scheinberg, 2001, Williams and Seeger, 2001]

Use part of the kernel; e.g.,

[Lee and Mangasarian, 2001, Keerthi et al., 2006]

Early stopping of optimization algorithms [Tsang et al., 2005] and most parallel works And many others

Some simple but some sophisticated

## Approximately Training SVM (Cont’d)

But sophisticated techniques may not be always useful

Sometimes slower than sub-sampling covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

## Approximately Training SVM (Cont’d)

But sophisticated techniques may not be always useful

Sometimes slower than sub-sampling covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

## Approximately Training SVM (Cont’d)

Personally I prefer specialized approach for large-scale scenarios

Distribution of training data

??

General
Solvers (e.g.,
LIBSVM,
SVM^{light})
Median and

small Large

## Approximately Training SVM (Cont’d)

We don’t have many large and well labeled sets They appear in certain application domains Specific properties of data should be considered May significantly improve the training speed We will illustrate this point using linear SVM The design of software for large and median/small problems should be different

## Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

## Linear SVM

Data not mapped to another space Primal without the bias term b

minw

1

2w^{T}w + C

l

X

i =1

max 0, 1 − y_{i}w^{T}x_{i}
Dual

minα f (α) ≡ 1

2α^{T}Qα − e^{T}α
subject to 0 ≤ α_{i} ≤ C , ∀i

T

## Linear SVM (Cont’d)

In theory, RBF kernel with certain parameters

⇒ as good as linear [Keerthi and Lin, 2003]

RBF kernel:

K (xi, xj) = e^{−γkx}^{i}^{−x}^{j}^{k}^{2}
That is,

Test accuracy of linear

≤Test accuracy of RBF Linear SVM not better than nonlinear; but

## Linear SVM for Large Document Sets

Bag of words model (TF-IDF or others) A large # of features

Accuracy similar with/without mapping vectors What if training is much faster?

A very effective approximation to nonlinear SVM

## A Comparison: LIBSVM and LIBLINEAR

rcv1: # data: > 600k, # features: > 40k TF-IDF

Using LIBSVM (linear kernel)

> 10 hours

Using LIBLINEAR

Computation: < 5 seconds; I/O: 60 seconds Same stopping condition

Accuracy similar to nonlinear; more than 100x speedup

## Why Training Linear SVM Is Faster?

In optimization, each iteration we often need

∇_{i}f (α) = (Qα)i − 1
Nonlinear SVM

∇_{i}f (α) = X^{l}

j =1y_{i}y_{j}K (x_{i}, x_{j})α_{j} − 1
cost: O(nl ); n: # features, l : # data

Linear: use w ≡ Xl

j =1y_{j}α_{j}x_{j} and ∇_{i}f (α) = y_{i}w^{T}x_{i} − 1

Faster if # iterations not l times more For details, see

C.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S.

Sundararajan. A dual coordinate descent method for large-scale linear SVM. ICML 2008.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear

classification. Journal of Machine Learning Research 9(2008), 1871-1874.

Experiments

Problem l : # data n: # features

news20 19,996 1,355,191

yahoo-japan 176,203 832,026

rcv1 677,399 47,236

## Testing Accuracy versus Training Time

news20 yahoo-japan

## Training Linear SVM Always Much Faster?

No

If #data #features, the algorithm used above may not be very good

Need some other ways

But document data are not of this type Large-scale SVM training is domain specific

## Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

## Training Nonlinear SVM via Linear SVM

Revisit nonlinear SVM minw

1

2w^{T}w + C

l

X

i =1

max(1 − y_{i}w^{T}φ(x_{i}), 0)
Dimension of φ(x): large

If not very large, directly train SVM without kernel
Calculate ∇_{i}f (α) at each step

Kernel: O(nl )

## Degree-2 Polynomial Mapping

Degree-2 polynomial kernel

K (x_{i}, x_{j}) = (1 + x^{T}_{i} x_{j})^{2}
Instead we do

φ(x) = [1,√

2x_{1}, . . . ,√

2x_{n}, x_{1}^{2}, . . . ,
x_{n}^{2},√

2x1x2, . . . ,√

2xn−1xn]^{T}.
Now we can just consider

φ(x) = [1, x1, . . . , xn, x_{1}^{2}, . . . , x_{n}^{2}, x1x2, . . . , xn−1xn]^{T}.
O(n^{2}) dimensions can cause troubles; some

## Accuracy Difference with linear and RBF

Data set Degree-2 Polynomial Time Accuracy diff.

LIBLINEAR LIBSVM linear RBF

a9a 1.6 89.8 0.07 0.02

real-sim 59.8 1,220.5 0.49 0.10

ijcnn1 10.7 64.2 5.63 −0.85

MNIST38 8.6 18.4 2.47 −0.40

covtype 5,211.9 ≥ 3 × 10^{5} 3.74 −15.98
webspam 3,228.1 ≥ 3 × 10^{5} 5.29 −0.76

Some problems: accuracy similar to RBF; but training much faster

Less nonlinear SVM to approximate highly nonlinear

## NLP Applications

In NLP (Natural Language Processing) degree-2 or degree-3 polynomial kernels very popular

Competitive with RBF; better than linear No theory yet; but possible reasons

Bigram/trigram useful

This is different from other areas (e.g., image), which mainly use RBF

Currently people complain that training is slow

SVM with Low-Degree Polynomial Mapping

## Dependency Parsing

nsubjROOT det dobj prep det pobj p

### John hit the ball with a bat .

NNP VBD DT NN IN DT NN .

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

LAS 88.55 90.60 88.07 90.71

## Dependency Parsing

nsubjROOT det dobj prep det pobj p

### John hit the ball with a bat .

NNP VBD DT NN IN DT NN .

LIBSVM LIBLINEAR

RBF Poly Linear Poly

Training time 3h34m53s 3h21m51s 3m36s 3m43s

Parsing speed 0.7x 1x 1652x 103x

UAS 89.92 91.67 89.11 91.71

## Dependency Parsing (Cont’d)

Details:

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M.

Ringgaard, and C.-J. Lin. Low-degree polynomial mapping of data for SVM, 2009.

## Outline

Training support vector machines Training large-scale SVM

Linear SVM

SVM with Low-Degree Polynomial Mapping Discussion and Conclusions

## What If Data Cannot Fit in Memory?

We can manage to train data in disk Details not shown here

However, what if data too large to store in one machine?

So far not many such cases with well labeled data It’s expensive to label data

We do see very large but low quality data Dealing with such data is different

## L1-regularized Classifiers

Replacing kwk_{2} with kwk_{1}

minw kwk_{1} + C × (losses)
Sparsity: many w elements are zeros
Feature selection

LIBLINEAR supports L2 loss and logistic regression
max 0, 1 − y_{i}w^{T}x_{i}2

and log(1 + e^{−y}^{i}^{w}^{T}^{x}^{i})
If using least-square loss and y ∈ R^{l},

related to L1-regularized problems in signal

## Conclusions

Training large SVM is difficult

The (at least) quadratic time bottleneck Approximation is often needed; but some are non-intuitive ways

E.g., linear SVM good approximation to nonlinear SVM for some applications

Difficult to have a general approach for all large scenarios

Special techniques are needed

## Conclusions (Cont’d)

Software design for large and median/small problems should be different

Median/small problems: general and simple software Sources for my past work are available on my page.

In particular, LIBSVM:

http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBLINEAR: http:

//www.csie.ntu.edu.tw/~cjlin/liblinear I will be happy to talk to any machine learning users here