Support Vector Machines and Kernel Methods: Status and Challenges

(1)

Support Vector Machines and Kernel Methods: Status and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at K. U. Leuven Optimization in Engineering Center, January 15, 2013

(2)

Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

(3)

Outline

(4)

Support Vector Classification

Training vectors : x_i, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]^T

Consider a simple case with two classes:

Define an indicator vector y y_i =

1 if x_i in class 1

−1 if x_i in class 2 A hyperplane which separates all data

(5)

w^Tx + b = h₊₁

−10

i

A separating hyperplane: w^Tx + b = 0 (w^Txi) + b ≥ 1 if yi = 1 (w^Tx_i) + b ≤ −1 if y_i = −1

Decision function f (x) = sgn(w^Tx + b), x: test data Many possible choices of w and b

(6)

Maximal Margin

Distance between w^Tx + b = 1 and −1:

2/kwk = 2/

√ w^Tw

A quadratic programming problem (Boser et al., 1992)

minw,b

1 2w^Tw

subject to y_i(w^Tx_i + b) ≥ 1, i = 1, . . . , l .

(7)

Data May Not Be Linearly Separable

An example:

Allow training errors

Higher dimensional ( maybe infinite ) feature space φ(x) = [φ₁(x), φ₂(x), . . .]^T.

(8)

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

min

w,b,ξ

1

2w^Tw +C

l

X

i =1

ξ_i

subject to y_i(w^Tφ(x_i)+ b) ≥ 1 −ξ_i, ξ_i ≥ 0, i = 1, . . . , l .

Example: x ∈ R³, φ(x) ∈ R¹⁰ φ(x) = [1,√

2x₁,√

2x₂,√

2x₃, x₁², x₂², x₃²,√

2x₁x₂,√

2x₁x₃,√

2x₂x₃]^T

(9)

Finding the Decision Function

w: maybe infinite variables

The dual problem: finite number of variables minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0,

where Q_ij = y_iy_jφ(x_i)^Tφ(x_j) and e = [1, . . . , 1]^T At optimum

w =Pl

i =1α_iy_iφ(x_i)

A finite problem: #variables = #training data

(10)

Kernel Tricks

Q_ij = y_iy_jφ(x_i)^Tφ(x_j) needs a closed form Example: x_i ∈ R³, φ(x_i) ∈ R¹⁰

φ(x_i) = [1,√

2(x_i)₁,√

2(x_i)₂,√

2(x_i)₃, (x_i)²₁, (x_i)²₂, (x_i)²₃,√

2(x_i)₁(x_i)₂,√

2(x_i)₁(x_i)₃,√

2(x_i)₂(x_i)₃]^T Then φ(x_i)^Tφ(x_j) = (1 + x^T_i x_j)².

Kernel: K (x, y) = φ(x)^Tφ(y); common kernels:

e^−γkxⁱ^−x^j^k², (Radial Basis Function) (x^T_i x_j/a + b)^d (Polynomial kernel)

(11)

Can be inner product in infinite dimensional space Assume x ∈ R¹ and γ > 0.

e^−γkxⁱ^−x^j^k² = e^−γ(xⁱ^−x^j⁾² = e^−γxⁱ²^+2γxⁱ^x^j^−γx^j²

=e^−γxⁱ²^−γx^j² 1 + 2γx_ix_j

1! + (2γx_ix_j)²

2! + (2γx_ix_j)³

3! + · · ·

=e^−γxⁱ²^−γx^j² 1 · 1+

r2γ 1!x_i ·

r2γ 1!x_j+

r(2γ)² 2! x_i² ·

r(2γ)² 2! x_j² +

r(2γ)³ 3! x_i³ ·

r(2γ)³

3! x_j³ + · · · = φ(x_i)^Tφ(xj), where

φ(x ) = e^−γx²

1,

r2γ 1!x ,

r(2γ)² 2! x²,

r(2γ)³

3! x³, · · ·

T

.

(12)

Issues

So what kind of kernel should I use?

What kind of functions are valid kernels?

How to decide kernel parameters?

Some of these issues will be discussed later

(13)

Decision function

At optimum

w =Pl

i =1α_iy_iφ(x_i) Decision function

w^Tφ(x) + b

=

l

X

i =1

α_iy_iφ(x_i)^Tφ(x) + b

=

l

X

i =1

α_iy_iK (x_i, x) + b

Only φ(x_i) of α_i > 0 used ⇒ support vectors

(14)

Support Vectors: More Important Data

Only φ(x_i) of α_i > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

(15)

Outline

(16)

Deriving the Dual

For simplification, consider the problem without ξi

minw,b

1 2w^Tw

subject to y_i(w^Tφ(x_i) + b) ≥ 1, i = 1, . . . , l . Its dual is

minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i, i = 1, . . . , l , y^Tα = 0.

(17)

Lagrangian Dual

maxα≥0 min

w,b L(w, b, α), where

L(w, b, α) = 1

2kwk² −

l

X

i =1

α_i y_i(w^Tφ(x_i) + b) − 1 Strong duality (be careful about this)

min Primal = max

α≥0 min

w,b L(w, b, α)

(18)

Simplify the dual. When α is fixed, minw,b L(w, b, α) =

(−∞ if Pl

i =1α_iy_i 6= 0, minw

1

2w^Tw −Pl

i =1α_i[y_i(w^Tφ(x_i) − 1] if Pl

i =1α_iy_i = 0.

If Pl

i =1α_iy_i 6= 0, we can decrease

−b

l

X

i =1

α_iy_i in L(w, b, α) to −∞

(19)

If Pl

i =1α_iy_i = 0, optimum of the strictly convex function

1

2w^Tw −

l

X

i =1

αi[yi(w^Tφ(xi) − 1]

happens when

∇_wL(w, b, α) = 0.

Thus,

w =

l

X

i =1

α_iy_iφ(x_i).

(20)

Note that w^Tw =

^l X

i =1

α_iy_iφ(x_i)

T ^l X

j =1

α_jy_jφ(x_j)

= X

i ,j

α_iα_jy_iy_jφ(x_i)^Tφ(x_j)

The dual is maxα≥0







l

P

i =1

αi − ¹₂ P

i ,j

αiαjyiyjφ(xi)^Tφ(xj) if Pl

i =1αiyi = 0,

−∞ if Pl

i =1α_iy_i 6= 0.

(21)

Lagrangian dual: max_α≥0 min_w,bL(w, b, α)

−∞ definitely not maximum of the dual Dual optimal solution not happen when

l

X

i =1

α_iy_i 6= 0 .

Dual simplified to max

α∈R^l l

X

i =1

α_i − 1 2

l

X

i =1 l

X

j =1

α_iα_jy_iy_jφ(x_i)^Tφ(x_j) subject to y^Tα = 0,

αi ≥ 0, i = 1, . . . , l .

(22)

Primal versus Dual

Recall the dual problem is minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0

and at optimum w =

l

X

i =1

α_iy_iφ(x_i) (1)

(25)

Primal versus Dual (Cont’d)

What if we put (1) into primal min

α,ξ

1

2α^TQα + C

l

X

i =1

ξ_i

subject to (Qα + by)_i ≥ 1 − ξ_i (2) ξ_i ≥ 0

If Q is positive definite, we can prove that the optimal α of (2) is the same as that of the dual So dual is not the only choice to solve when we use kernels

(26)

Other Variants

A general form for binary classification minw r (w) + C

l

X

i =1

ξ(w; x_i, y_i) r (w): regularization term

ξ(w; x, y ): loss function: we hope y w^Tx > 0 C : regularization parameter

(27)

Loss Functions

Some commonly used loss functions:

ξ_L1(w; x, y ) ≡ max(0, 1 − y w^Tx), (3) ξ_L2(w; x, y ) ≡ max(0, 1 − y w^Tx)², and (4) ξ_LR(w; x, y ) ≡ log(1 + e^{−y w}^T^x). (5) We omit the bias term b here

SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(3)-(4)

Logistic regression (LR): (5)

(28)

Loss Functions (Cont’d)

−y w^Tx ξ(w; x, y )

ξ_L1 ξL2

ξ_LR

Indeed SVM and logistic regression are very similar

(29)

Loss Functions (Cont’d)

If we use square loss function

ξ(w; x, y ) ≡ (1 − y w^Tx)² it becomes least-square SVM (Suykens and Vandewalle, 1999) or Gaussian process

(30)

Regularization

L1 versus L2

kwk₁ and w^Tw/2 w^Tw/2: smooth, easier to optimize kwk₁: non-differentiable

sparse solution; possibly many zero elements Possible advantages of L1 regularization:

Feature selection Less storage for w

(31)

Training SVM

The main issue is to solve the dual problem minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0

This will be discuss in Thursday’s lecture, which talks about the connection between optimization and machine learning

(32)

Outline

(33)

Let’s Try a Practical Example

A problem from astroparticle physics

1 2.61e+01 5.88e+01 -1.89e-01 1.25e+02 1 5.70e+01 2.21e+02 8.60e-02 1.22e+02 1 1.72e+01 1.73e+02 -1.29e-01 1.25e+02 0 2.39e+01 3.89e+01 4.70e-01 1.25e+02 0 2.23e+01 2.26e+01 2.11e-01 1.01e+02 0 1.64e+01 3.92e+01 -9.91e-02 3.24e+01 Training and testing sets available: 3,089 and 4,000 Data available at LIBSVM Data Sets

(34)

Training and Testing

Training the set svmguide1 to obtain svmguide1.model

$./svm-train svmguide1 Testing the set svmguide1.t

$./svm-predict svmguide1.t svmguide1.model out Accuracy = 66.925% (2677/4000)

We see that training and testing accuracy are very different. Training accuracy is almost 100%

$./svm-predict svmguide1 svmguide1.model out Accuracy = 99.7734% (3082/3089)

(35)

Why this Fails

Gaussian kernel is used here

We see that most kernel elements have K_ij = e^−kxⁱ^−x^j^k²^/4

(

= 1 if i = j ,

→ 0 if i 6= j . because some features in large numeric ranges For what kind of data,

K ≈ I ?

(36)

Why this Fails (Cont’d)

If we have training data

φ(x₁) = [1, 0, . . . , 0]^T ...

φ(x_l) = [0, . . . , 0, 1]^T then

K = I

Clearly such training data can be correctly separated, but how about testing data?

So overfitting occurs

(37)

Overfitting

See the illustration in the next slide In theory

You can easily achieve 100% training accuracy This is useless

When training and predicting a data, we should Avoid underfitting: small training error

Avoid overfitting: small testing error

(38)

l and s: training; and 4: testing

(39)

Data Scaling

Without scaling, the above overfitting situation may occur

Also, features in greater numeric ranges may dominate

A simple solution is to linearly scale each feature to [0, 1] by:

feature value − min max − min , There are many other scaling methods

Scaling generally helps, but not always

(40)

Data Scaling: Same Factors

A common mistake

$./svm-scale -l -1 -u 1 svmguide1 > svmguide1.scale

$./svm-scale -l -1 -u 1 svmguide1.t > svmguide1.t.scale -l -1 -u 1: scaling to [−1, 1]

We need to use same factors on training and testing

$./svm-scale -s range1 svmguide1 > svmguide1.scale

$./svm-scale -r range1 svmguide1.t > svmguide1.t.scale Later we will give a real example

(41)

After Data Scaling

Train scaled data and then predict

$./svm-train svmguide1.scale

$./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predict

Accuracy = 96.15%

Training accuracy is now similar

$./svm-predict svmguide1.scale svmguide1.scale.model o Accuracy = 96.439%

For this experiment, we use parameters C = 1, γ = 0.25, but sometimes performances are sensitive to parameters

(42)

Parameters versus Performances

If we use C = 20, γ = 400

$./svm-train -c 20 -g 400 svmguide1.scale

$./svm-predict svmguide1.scale svmguide1.scale.model o Accuracy = 100% (3089/3089)

100% training accuracy but

$./svm-predict svmguide1.t.scale svmguide1.scale.model o Accuracy = 82.7% (3308/4000)

Very bad test accuracy Overfitting happens

(43)

Parameter Selection

For SVM, we may need to select suitable parameters They are C and kernel parameters

Example:

γ of e^−γkxⁱ^−x^j^k² a, b, d of (x^T_i xj/a + b)^d

How to select them so performance is better?

(44)

Performance Evaluation

Available data ⇒ training and validation

Train the training; test the validation to estimate the performance

A common way is k-fold cross validation (CV):

Data randomly separated to k groups

Each time k − 1 as training and one as testing Select parameters/kernels with best CV result There are many other methods to evaluate the performance

(45)

Contour of CV Accuracy

(46)

The good region of parameters is quite large SVM is sensitive to parameters, but not that sensitive

Sometimes default parameters work

but it’s good to select them if time is allowed

(47)

Example of Parameter Selection

Direct training and test

$./svm-train svmguide3

$./svm-predict svmguide3.t svmguide3.model o

→ Accuracy = 2.43902%

After data scaling, accuracy is still low

$./svm-scale -s range3 svmguide3 > svmguide3.scale

$./svm-scale -r range3 svmguide3.t > svmguide3.t.scale

$./svm-train svmguide3.scale

$./svm-predict svmguide3.t.scale svmguide3.scale.model o

→ Accuracy = 12.1951%

(48)

Example of Parameter Selection (Cont’d)

Select parameters by trying a grid of (C , γ) values

$ python grid.py svmguide3.scale

· · ·

128.0 0.125 84.8753

(Best C =128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)

Train and predict using the obtained parameters

$ ./svm-train -c 128 -g 0.125 svmguide3.scale

$ ./svm-predict svmguide3.t.scale svmguide3.scale.model svmguide3.t.predict

→ Accuracy = 87.8049%

(49)

Selecting Kernels

RBF, polynomial, or others?

For beginners, use RBF first

Linear kernel: special case of RBF

Accuracy of linear the same as RBF under certain parameters (Keerthi and Lin, 2003)

Polynomial kernel:

(x^T_i x_j/a + b)^d

Numerical difficulties: (< 1)^d → 0, (> 1)^d → ∞ More parameters than RBF

(50)

Selecting Kernels (Cont’d)

Commonly used kernels are Gaussian (RBF), polynomial, and linear

But in different areas, special kernels have been developed. Examples

1. χ² kernel is popular in computer vision 2. String kernel is useful in some domains

(51)

A Simple Procedure for Beginners

After helping many users, we came up with the following procedure

1. Conduct simple scaling on the data 2. Consider RBF kernel K (x, y) = e^−γkx−yk²

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set 5. Test

In LIBSVM, we have a python script easy.py implementing this procedure.

(52)

A Simple Procedure for Beginners (Cont’d)

We proposed this procedure in an “SVM guide”

(Hsu et al., 2003) and implemented it in LIBSVM From research viewpoints, this procedure is not novel. We never thought about submiting our guide somewhere

But this procedure has been tremendously useful.

Now almost the standard thing to do for SVM beginners

(53)

A Real Example of Wrong Scaling

Separately scale each feature of training and testing data to [0, 1]

$ ../svm-scale -l 0 svmguide4 > svmguide4.scale

$ ../svm-scale -l 0 svmguide4.t > svmguide4.t.scale

$ python easy.py svmguide4.scale svmguide4.t.scale Accuracy = 69.2308% (216/312) (classification) The accuracy is low even after parameter selection

$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale

$ ../svm-scale -r range4 svmguide4.t > svmguide4.t.scale

$ python easy.py svmguide4.scale svmguide4.t.scale Accuracy = 89.4231% (279/312) (classification)

(54)

A Real Example of Wrong Scaling (Cont’d)

With the correct setting, the 10 features in the test data svmguide4.t.scale have the following maximal values:

0.7402, 0.4421, 0.6291, 0.8583, 0.5385, 0.7407, 0.3982, 1.0000, 0.8218, 0.9874

Scaling the test set to [0, 1] generated an erroneous set.

(55)

Outline

(56)

Multi-class Classification

k classes

One-against-the rest: Train k binary SVMs:

1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class

...

k decision functions

(w¹)^Tφ(x) + b₁ ...

(w^k)^Tφ(x) + b_k

(57)

Prediction:

arg max

j (w^j)^Tφ(x) + bj

Reason: If x ∈ 1st class, then we should have (w¹)^Tφ(x) + b₁ ≥ +1

(w²)^Tφ(x) + b₂ ≤ −1 ...

(w^k)^Tφ(x) + b_k ≤ −1

(58)

Multi-class Classification (Cont’d)

One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs

y_i = 1 y_i = −1 Decision functions class 1 class 2 f¹²(x) = (w¹²)^Tx + b¹² class 1 class 3 f¹³(x) = (w¹³)^Tx + b¹³ class 1 class 4 f¹⁴(x) = (w¹⁴)^Tx + b¹⁴ class 2 class 3 f²³(x) = (w²³)^Tx + b²³ class 2 class 4 f²⁴(x) = (w²⁴)^Tx + b²⁴ class 3 class 4 f³⁴(x) = (w³⁴)^Tx + b³⁴

(59)

For a testing data, predicting all binary SVMs Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Select the one with the largest vote class 1 2 3 4

# votes 3 1 1 1 May use decision values as well

(60)

More Complicated Forms

Solving a single optimization problem (Weston and Watkins, 1999; Crammer and Singer, 2002; Lee et al., 2004)

There are many other methods A comparison in Hsu and Lin (2002)

RBF kernel: accuracy similar for different methods But 1-against-1 is the fastest for training

(61)

Outline

(62)

SVM doesn’t Scale Up

Yes, if using kernels

Training millions of data is time consuming

Cases with many support vectors: quadratic time bottleneck on

Q_{SV, SV}

For noisy data: # SVs increases linearly in data size (Steinwart, 2003)

Some solutions Parallelization Approximation

(63)

Parallelization

Multi-core/Shared Memory/GPU

• One line change of LIBSVM

Multicore Shared-memory

1 80 1 100

2 48 2 57

4 32 4 36

8 27 8 28

50,000 data (kernel evaluations: 80% time)

• GPU (Catanzaro et al., 2008); Cell (Marzolla, 2010) Distributed Environments

• Chang et al. (2007); Zanni et al. (2006); Zhu et al.

(2009).

(64)

Approximately Training SVM

Can be done in many aspects Data level: sub-sampling Optimization level:

Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today

Many papers have addressed this issue

(65)

Approximately Training SVM (Cont’d)

Subsampling

Simple and often effective More advanced techniques

Incremental training: (e.g., Syed et al., 1999) Data ⇒ 10 parts

train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics For example, Bakır et al. (2005)

(66)

Approximately Training SVM (Cont’d)

Approximate the kernel; e.g., Fine and Scheinberg (2001); Williams and Seeger (2001)

Use part of the kernel; e.g., Lee and Mangasarian (2001); Keerthi et al. (2006)

Early stopping of optimization algorithms Tsang et al. (2005) and others

And many more

Some simple but some sophisticated

(67)

Approximately Training SVM (Cont’d)

Sophisticated techniques may not be always useful Sometimes slower than sub-sampling

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

(68)

Approximately Training SVM (Cont’d)

Sophisticated techniques may not be always useful Sometimes slower than sub-sampling

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

(69)

Discussion: Large-scale Training

We don’t have many large and well labeled sets Expensive to obtain true labels

Specific properties of data should be considered We will illustrate this point using linear SVM The design of software for very large data sets should be application different

(70)

Outline

(71)

Linear and Kernel Classification

Methods such as SVM and logistic regression can used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x)

φ(x_i)^Tφ(x_j) easily calculated; little control on φ(·) Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as kernel and linear classifiers

(72)

Linear and Kernel Classification

Let’s check the prediction cost w^Tx + b versus X^l

i =1α_iK (x_i, x) + b If K (xi, xj) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper

(73)

Linear and Kernel Classification (Cont’d)

Also, linear is a special case of kernel

Indeed, we can prove that accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)

Therefore, roughly we have

accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear

(74)

Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

Recently linear classification is a popular research topic. Sample works in 2005-2008: Joachims (2006); Shalev-Shwartz et al. (2007); Hsieh et al.

(2008)

(75)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(76)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(77)

Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

(78)

Extension: Training Explicit Form of Nonlinear Mappings

Linear-SVM method to train φ(x₁), . . . , φ(x_l) Kernel not used

Applicable only if dimension of φ(x) not too large Low-degree Polynomial Mappings

K (x_i, x_j) = (x^T_i x_j + 1)² = φ(x_i)^Tφ(x_j) φ(x) = [1,√

2x₁, . . . ,√

2x_n, x₁², . . . , x_n²,

√

2x1x2, . . . ,

√

2xn−1xn]^T

When degree is small, train the explicit form of φ(x)

(79)

Testing Accuracy and Training Time

Data set

Degree-2 Polynomial Accuracy diff.

Training time (s)

Accuracy Linear RBF LIBLINEAR LIBSVM

a9a 1.6 89.8 85.06 0.07 0.02

real-sim 59.8 1,220.5 98.00 0.49 0.10

ijcnn1 10.7 64.2 97.84 5.63 −0.85

MNIST38 8.6 18.4 99.29 2.47 −0.40

covtype 5,211.9 NA 80.09 3.74 −15.98

webspam 3,228.1 NA 98.44 5.29 −0.76

Training φ(x_i) by linear: faster than kernel, but sometimes competitive accuracy

(80)

Discussion: Directly Train φ(x

_i

), ∀i

See details in our work (Chang et al., 2010) A related development: Sonnenburg and Franc (2010)

Useful for certain applications

(81)

Outline

(82)

Extensions of SVM

Multiple Kernel Learning (MKL) Learning to rank

Semi-supervised learning Active learning

Cost sensitive learning Structured Learning

(83)

Conclusions

SVM and kernel methods are rather mature areas But still quite a few interesting research issues Many are extensions of standard classification It is possible to identify more extensions through real applications

(84)

References I

G. H. Bakır, L. Bottou, and J. Weston. Breaking svm complexity with cross-training. In L. K.

Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 81–88. MIT Press, Cambridge, MA, 2005.

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.

E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Parallelizing support vector machines on distributed computers. In NIPS 21, 2007.

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471–1490, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, (2–3):201–233, 2002.

(85)

References II

S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel representations.

Journal of Machine Learning Research, 2:243–264, 2001.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines.

IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification.

Technical report, Department of Computer Science, National Taiwan University, 2003.

URL http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

T. Joachims. Training linear SVMs in linear time. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7:1493–1515, 2006.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American Statistical Association, 99(465):67–81, 2004.

(86)

References III

Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, 2001.

C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view. Neural Computation, 13(2):307–317, 2001.

M. Marzolla. Optimized training of support vector machines on the cell processor. Technical Report UBLCS-2010-02, Department of Computer Science, University of Bologna, Italy, Feb. 2010. URL http://www.cs.unibo.it/pub/TR/UBLCS/ABSTRACTS/2010.bib?

ncstrl.cabernet//BOLOGNA-UBLCS-2010-02.

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Sonnenburg and V. Franc. COFFIN : A computational framework for linear SVMs. In Proceedings of the Twenty Seventh International Conference on Machine Learning (ICML), pages 999–1006, 2010.

I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4:

1071–1105, 2003.

J. Suykens and J. Vandewalle. Least square support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999.

(87)

References IV

N. A. Syed, H. Liu, and K. K. Sung. Incremental learning with support vector machines. In Workshop on Support Vector Machines, IJCAI99, 1999.

I. Tsang, J. Kwok, and P. Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005.

J. Weston and C. Watkins. Multi-class support vector machines. In M. Verleysen, editor, Proceedings of ESANN99, pages 219–224, Brussels, 1999. D. Facto Press.

C. K. I. Williams and M. Seeger. Using the Nystr¨om method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.

L. Zanni, T. Serafini, and G. Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7:

1467–1492, 2006.

Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen. P-packSVM: Parallel primal gradient descent kernel SVM. In Proceedings of the IEEE International Conference on Data Mining, 2009.

Support Vector Machines and Kernel Methods: Status and Challenges