### Support Vector Machines and Kernel Methods: Status and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at K. U. Leuven Optimization in Engineering Center, January 15, 2013

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Support Vector Classification

Training vectors : x_{i}, i = 1, . . . , l
Feature vectors. For example,
A patient = [height, weight, . . .]^{T}

Consider a simple case with two classes:

Define an indicator vector y
y_{i} =

1 if x_{i} in class 1

−1 if x_{i} in class 2
A hyperplane which separates all data

w^{T}x + b =
h_{+1}

−10

i

A separating hyperplane: w^{T}x + b = 0
(w^{T}xi) + b ≥ 1 if yi = 1
(w^{T}x_{i}) + b ≤ −1 if y_{i} = −1

Decision function f (x) = sgn(w^{T}x + b), x: test data
Many possible choices of w and b

### Maximal Margin

Distance between w^{T}x + b = 1 and −1:

2/kwk = 2/

√
w^{T}w

A quadratic programming problem (Boser et al., 1992)

minw,b

1
2w^{T}w

subject to y_{i}(w^{T}x_{i} + b) ≥ 1,
i = 1, . . . , l .

### Data May Not Be Linearly Separable

An example:

Allow training errors

Higher dimensional ( maybe infinite ) feature space
φ(x) = [φ_{1}(x), φ_{2}(x), . . .]^{T}.

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

min

w,b,ξ

1

2w^{T}w +C

l

X

i =1

ξ_{i}

subject to y_{i}(w^{T}φ(x_{i})+ b) ≥ 1 −ξ_{i},
ξ_{i} ≥ 0, i = 1, . . . , l .

Example: x ∈ R^{3}, φ(x) ∈ R^{10}
φ(x) = [1,√

2x_{1},√

2x_{2},√

2x_{3}, x_{1}^{2},
x_{2}^{2}, x_{3}^{2},√

2x_{1}x_{2},√

2x_{1}x_{3},√

2x_{2}x_{3}]^{T}

### Finding the Decision Function

w: maybe infinite variables

The dual problem: finite number of variables minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0,

where Q_{ij} = y_{i}y_{j}φ(x_{i})^{T}φ(x_{j}) and e = [1, . . . , 1]^{T}
At optimum

w =Pl

i =1α_{i}y_{i}φ(x_{i})

A finite problem: #variables = #training data

### Kernel Tricks

Q_{ij} = y_{i}y_{j}φ(x_{i})^{T}φ(x_{j}) needs a closed form
Example: x_{i} ∈ R^{3}, φ(x_{i}) ∈ R^{10}

φ(x_{i}) = [1,√

2(x_{i})_{1},√

2(x_{i})_{2},√

2(x_{i})_{3}, (x_{i})^{2}_{1},
(x_{i})^{2}_{2}, (x_{i})^{2}_{3},√

2(x_{i})_{1}(x_{i})_{2},√

2(x_{i})_{1}(x_{i})_{3},√

2(x_{i})_{2}(x_{i})_{3}]^{T}
Then φ(x_{i})^{T}φ(x_{j}) = (1 + x^{T}_{i} x_{j})^{2}.

Kernel: K (x, y) = φ(x)^{T}φ(y); common kernels:

e^{−γkx}^{i}^{−x}^{j}^{k}^{2}, (Radial Basis Function)
(x^{T}_{i} x_{j}/a + b)^{d} (Polynomial kernel)

Can be inner product in infinite dimensional space
Assume x ∈ R^{1} and γ > 0.

e^{−γkx}^{i}^{−x}^{j}^{k}^{2} = e^{−γ(x}^{i}^{−x}^{j}^{)}^{2} = e^{−γx}^{i}^{2}^{+2γx}^{i}^{x}^{j}^{−γx}^{j}^{2}

=e^{−γx}^{i}^{2}^{−γx}^{j}^{2} 1 + 2γx_{i}x_{j}

1! + (2γx_{i}x_{j})^{2}

2! + (2γx_{i}x_{j})^{3}

3! + · · ·

=e^{−γx}^{i}^{2}^{−γx}^{j}^{2} 1 · 1+

r2γ
1!x_{i} ·

r2γ
1!x_{j}+

r(2γ)^{2}
2! x_{i}^{2} ·

r(2γ)^{2}
2! x_{j}^{2}
+

r(2γ)^{3}
3! x_{i}^{3} ·

r(2γ)^{3}

3! x_{j}^{3} + · · · = φ(x_{i})^{T}φ(xj),
where

φ(x ) = e^{−γx}^{2}

1,

r2γ 1!x ,

r(2γ)^{2}
2! x^{2},

r(2γ)^{3}

3! x^{3}, · · ·

T

.

### Issues

So what kind of kernel should I use?

What kind of functions are valid kernels?

How to decide kernel parameters?

Some of these issues will be discussed later

### Decision function

At optimum

w =Pl

i =1α_{i}y_{i}φ(x_{i})
Decision function

w^{T}φ(x) + b

=

l

X

i =1

α_{i}y_{i}φ(x_{i})^{T}φ(x) + b

=

l

X

i =1

α_{i}y_{i}K (x_{i}, x) + b

Only φ(x_{i}) of α_{i} > 0 used ⇒ support vectors

### Support Vectors: More Important Data

Only φ(x_{i}) of α_{i} > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Deriving the Dual

For simplification, consider the problem without ξi

minw,b

1
2w^{T}w

subject to y_{i}(w^{T}φ(x_{i}) + b) ≥ 1, i = 1, . . . , l .
Its dual is

minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i}, i = 1, . . . , l ,
y^{T}α = 0.

### Lagrangian Dual

maxα≥0 min

w,b L(w, b, α), where

L(w, b, α) = 1

2kwk^{2} −

l

X

i =1

α_{i} y_{i}(w^{T}φ(x_{i}) + b) − 1
Strong duality (be careful about this)

min Primal = max

α≥0 min

w,b L(w, b, α)

Simplify the dual. When α is fixed, minw,b L(w, b, α) =

(−∞ if Pl

i =1α_{i}y_{i} 6= 0,
minw

1

2w^{T}w −Pl

i =1α_{i}[y_{i}(w^{T}φ(x_{i}) − 1] if Pl

i =1α_{i}y_{i} = 0.

If Pl

i =1α_{i}y_{i} 6= 0, we can decrease

−b

l

X

i =1

α_{i}y_{i}
in L(w, b, α) to −∞

If Pl

i =1α_{i}y_{i} = 0, optimum of the strictly convex
function

1

2w^{T}w −

l

X

i =1

αi[yi(w^{T}φ(xi) − 1]

happens when

∇_{w}L(w, b, α) = 0.

Thus,

w =

l

X

i =1

α_{i}y_{i}φ(x_{i}).

Note that
w^{T}w =

^{l}
X

i =1

α_{i}y_{i}φ(x_{i})

T ^{l}
X

j =1

α_{j}y_{j}φ(x_{j})

= X

i ,j

α_{i}α_{j}y_{i}y_{j}φ(x_{i})^{T}φ(x_{j})

The dual is maxα≥0

l

P

i =1

αi − ^{1}_{2} P

i ,j

αiαjyiyjφ(xi)^{T}φ(xj) if Pl

i =1αiyi = 0,

−∞ if Pl

i =1α_{i}y_{i} 6= 0.

Lagrangian dual: max_{α≥0} min_{w,b}L(w, b, α)

−∞ definitely not maximum of the dual Dual optimal solution not happen when

l

X

i =1

α_{i}y_{i} 6= 0
.

Dual simplified to max

α∈R^{l}
l

X

i =1

α_{i} − 1
2

l

X

i =1 l

X

j =1

α_{i}α_{j}y_{i}y_{j}φ(x_{i})^{T}φ(x_{j})
subject to y^{T}α = 0,

αi ≥ 0, i = 1, . . . , l .

### More about Dual Problems

After SVM is popular

Quite a few people think that for any optimization problem

⇒ Lagrangian dual exists and strong duality holds Wrong! We usually need

Convex programming; Constraint qualification We have them

SVM primal is convex; Linear constraints

Our problems may be infinite dimensional Can still use Lagrangian duality

See a rigorous discussion in Lin (2001)

### Primal versus Dual

Recall the dual problem is minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0

and at optimum w =

l

X

i =1

α_{i}y_{i}φ(x_{i}) (1)

### Primal versus Dual (Cont’d)

What if we put (1) into primal min

α,ξ

1

2α^{T}Qα + C

l

X

i =1

ξ_{i}

subject to (Qα + by)_{i} ≥ 1 − ξ_{i} (2)
ξ_{i} ≥ 0

If Q is positive definite, we can prove that the optimal α of (2) is the same as that of the dual So dual is not the only choice to solve when we use kernels

### Other Variants

A general form for binary classification minw r (w) + C

l

X

i =1

ξ(w; x_{i}, y_{i})
r (w): regularization term

ξ(w; x, y ): loss function: we hope y w^{T}x > 0
C : regularization parameter

### Loss Functions

Some commonly used loss functions:

ξ_{L1}(w; x, y ) ≡ max(0, 1 − y w^{T}x), (3)
ξ_{L2}(w; x, y ) ≡ max(0, 1 − y w^{T}x)^{2}, and (4)
ξ_{LR}(w; x, y ) ≡ log(1 + e^{−y w}^{T}^{x}). (5)
We omit the bias term b here

SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(3)-(4)

Logistic regression (LR): (5)

### Loss Functions (Cont’d)

−y w^{T}x
ξ(w; x, y )

ξ_{L1}
ξL2

ξ_{LR}

Indeed SVM and logistic regression are very similar

### Loss Functions (Cont’d)

If we use square loss function

ξ(w; x, y ) ≡ (1 − y w^{T}x)^{2}
it becomes least-square SVM (Suykens and
Vandewalle, 1999) or Gaussian process

### Regularization

L1 versus L2

kwk_{1} and w^{T}w/2
w^{T}w/2: smooth, easier to optimize
kwk_{1}: non-differentiable

sparse solution; possibly many zero elements Possible advantages of L1 regularization:

Feature selection Less storage for w

### Training SVM

The main issue is to solve the dual problem minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0

This will be discuss in Thursday’s lecture, which talks about the connection between optimization and machine learning

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Let’s Try a Practical Example

A problem from astroparticle physics

1 2.61e+01 5.88e+01 -1.89e-01 1.25e+02 1 5.70e+01 2.21e+02 8.60e-02 1.22e+02 1 1.72e+01 1.73e+02 -1.29e-01 1.25e+02 0 2.39e+01 3.89e+01 4.70e-01 1.25e+02 0 2.23e+01 2.26e+01 2.11e-01 1.01e+02 0 1.64e+01 3.92e+01 -9.91e-02 3.24e+01 Training and testing sets available: 3,089 and 4,000 Data available at LIBSVM Data Sets

### Training and Testing

Training the set svmguide1 to obtain svmguide1.model

$./svm-train svmguide1 Testing the set svmguide1.t

$./svm-predict svmguide1.t svmguide1.model out Accuracy = 66.925% (2677/4000)

We see that training and testing accuracy are very different. Training accuracy is almost 100%

$./svm-predict svmguide1 svmguide1.model out Accuracy = 99.7734% (3082/3089)

### Why this Fails

Gaussian kernel is used here

We see that most kernel elements have
K_{ij} = e^{−kx}^{i}^{−x}^{j}^{k}^{2}^{/4}

(

= 1 if i = j ,

→ 0 if i 6= j . because some features in large numeric ranges For what kind of data,

K ≈ I ?

### Why this Fails (Cont’d)

If we have training data

φ(x_{1}) = [1, 0, . . . , 0]^{T}
...

φ(x_{l}) = [0, . . . , 0, 1]^{T}
then

K = I

Clearly such training data can be correctly separated, but how about testing data?

So overfitting occurs

### Overfitting

See the illustration in the next slide In theory

You can easily achieve 100% training accuracy This is useless

When training and predicting a data, we should Avoid underfitting: small training error

Avoid overfitting: small testing error

### l and s: training; and 4: testing

### Data Scaling

Without scaling, the above overfitting situation may occur

Also, features in greater numeric ranges may dominate

A simple solution is to linearly scale each feature to [0, 1] by:

feature value − min max − min , There are many other scaling methods

Scaling generally helps, but not always

### Data Scaling: Same Factors

A common mistake

$./svm-scale -l -1 -u 1 svmguide1 > svmguide1.scale

$./svm-scale -l -1 -u 1 svmguide1.t > svmguide1.t.scale -l -1 -u 1: scaling to [−1, 1]

We need to use same factors on training and testing

$./svm-scale -s range1 svmguide1 > svmguide1.scale

$./svm-scale -r range1 svmguide1.t > svmguide1.t.scale Later we will give a real example

### After Data Scaling

Train scaled data and then predict

$./svm-train svmguide1.scale

$./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predict

Accuracy = 96.15%

Training accuracy is now similar

$./svm-predict svmguide1.scale svmguide1.scale.model o Accuracy = 96.439%

For this experiment, we use parameters C = 1, γ = 0.25, but sometimes performances are sensitive to parameters

### Parameters versus Performances

If we use C = 20, γ = 400

$./svm-train -c 20 -g 400 svmguide1.scale

$./svm-predict svmguide1.scale svmguide1.scale.model o Accuracy = 100% (3089/3089)

100% training accuracy but

$./svm-predict svmguide1.t.scale svmguide1.scale.model o Accuracy = 82.7% (3308/4000)

Very bad test accuracy Overfitting happens

### Parameter Selection

For SVM, we may need to select suitable parameters They are C and kernel parameters

Example:

γ of e^{−γkx}^{i}^{−x}^{j}^{k}^{2}
a, b, d of (x^{T}_{i} xj/a + b)^{d}

How to select them so performance is better?

### Performance Evaluation

Available data ⇒ training and validation

Train the training; test the validation to estimate the performance

A common way is k-fold cross validation (CV):

Data randomly separated to k groups

Each time k − 1 as training and one as testing Select parameters/kernels with best CV result There are many other methods to evaluate the performance

### Contour of CV Accuracy

The good region of parameters is quite large SVM is sensitive to parameters, but not that sensitive

Sometimes default parameters work

but it’s good to select them if time is allowed

### Example of Parameter Selection

Direct training and test

$./svm-train svmguide3

$./svm-predict svmguide3.t svmguide3.model o

→ Accuracy = 2.43902%

After data scaling, accuracy is still low

$./svm-scale -s range3 svmguide3 > svmguide3.scale

$./svm-scale -r range3 svmguide3.t > svmguide3.t.scale

$./svm-train svmguide3.scale

$./svm-predict svmguide3.t.scale svmguide3.scale.model o

→ Accuracy = 12.1951%

### Example of Parameter Selection (Cont’d)

Select parameters by trying a grid of (C , γ) values

$ python grid.py svmguide3.scale

· · ·

128.0 0.125 84.8753

(Best C =128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)

Train and predict using the obtained parameters

$ ./svm-train -c 128 -g 0.125 svmguide3.scale

$ ./svm-predict svmguide3.t.scale svmguide3.scale.model svmguide3.t.predict

→ Accuracy = 87.8049%

### Selecting Kernels

RBF, polynomial, or others?

For beginners, use RBF first

Linear kernel: special case of RBF

Accuracy of linear the same as RBF under certain parameters (Keerthi and Lin, 2003)

Polynomial kernel:

(x^{T}_{i} x_{j}/a + b)^{d}

Numerical difficulties: (< 1)^{d} → 0, (> 1)^{d} → ∞
More parameters than RBF

### Selecting Kernels (Cont’d)

Commonly used kernels are Gaussian (RBF), polynomial, and linear

But in different areas, special kernels have been developed. Examples

1. χ^{2} kernel is popular in computer vision
2. String kernel is useful in some domains

### A Simple Procedure for Beginners

After helping many users, we came up with the following procedure

1. Conduct simple scaling on the data
2. Consider RBF kernel K (x, y) = e^{−γkx−yk}^{2}

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set 5. Test

In LIBSVM, we have a python script easy.py implementing this procedure.

### A Simple Procedure for Beginners (Cont’d)

We proposed this procedure in an “SVM guide”

(Hsu et al., 2003) and implemented it in LIBSVM From research viewpoints, this procedure is not novel. We never thought about submiting our guide somewhere

But this procedure has been tremendously useful.

Now almost the standard thing to do for SVM beginners

### A Real Example of Wrong Scaling

Separately scale each feature of training and testing data to [0, 1]

$ ../svm-scale -l 0 svmguide4 > svmguide4.scale

$ ../svm-scale -l 0 svmguide4.t > svmguide4.t.scale

$ python easy.py svmguide4.scale svmguide4.t.scale Accuracy = 69.2308% (216/312) (classification) The accuracy is low even after parameter selection

$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale

$ ../svm-scale -r range4 svmguide4.t > svmguide4.t.scale

$ python easy.py svmguide4.scale svmguide4.t.scale Accuracy = 89.4231% (279/312) (classification)

### A Real Example of Wrong Scaling (Cont’d)

With the correct setting, the 10 features in the test data svmguide4.t.scale have the following maximal values:

0.7402, 0.4421, 0.6291, 0.8583, 0.5385, 0.7407, 0.3982, 1.0000, 0.8218, 0.9874

Scaling the test set to [0, 1] generated an erroneous set.

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Multi-class Classification

k classes

One-against-the rest: Train k binary SVMs:

1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class

...

k decision functions

(w^{1})^{T}φ(x) + b_{1}
...

(w^{k})^{T}φ(x) + b_{k}

Prediction:

arg max

j (w^{j})^{T}φ(x) + bj

Reason: If x ∈ 1st class, then we should have
(w^{1})^{T}φ(x) + b_{1} ≥ +1

(w^{2})^{T}φ(x) + b_{2} ≤ −1
...

(w^{k})^{T}φ(x) + b_{k} ≤ −1

### Multi-class Classification (Cont’d)

One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs

y_{i} = 1 y_{i} = −1 Decision functions
class 1 class 2 f^{12}(x) = (w^{12})^{T}x + b^{12}
class 1 class 3 f^{13}(x) = (w^{13})^{T}x + b^{13}
class 1 class 4 f^{14}(x) = (w^{14})^{T}x + b^{14}
class 2 class 3 f^{23}(x) = (w^{23})^{T}x + b^{23}
class 2 class 4 f^{24}(x) = (w^{24})^{T}x + b^{24}
class 3 class 4 f^{34}(x) = (w^{34})^{T}x + b^{34}

For a testing data, predicting all binary SVMs Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Select the one with the largest vote class 1 2 3 4

# votes 3 1 1 1 May use decision values as well

### More Complicated Forms

Solving a single optimization problem (Weston and Watkins, 1999; Crammer and Singer, 2002; Lee et al., 2004)

There are many other methods A comparison in Hsu and Lin (2002)

RBF kernel: accuracy similar for different methods But 1-against-1 is the fastest for training

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### SVM doesn’t Scale Up

Yes, if using kernels

Training millions of data is time consuming

Cases with many support vectors: quadratic time bottleneck on

Q_{SV, SV}

For noisy data: # SVs increases linearly in data size (Steinwart, 2003)

Some solutions Parallelization Approximation

### Parallelization

Multi-core/Shared Memory/GPU

• One line change of LIBSVM

Multicore Shared-memory

1 80 1 100

2 48 2 57

4 32 4 36

8 27 8 28

50,000 data (kernel evaluations: 80% time)

• GPU (Catanzaro et al., 2008); Cell (Marzolla, 2010) Distributed Environments

• Chang et al. (2007); Zanni et al. (2006); Zhu et al.

(2009).

### Approximately Training SVM

Can be done in many aspects Data level: sub-sampling Optimization level:

Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today

Many papers have addressed this issue

### Approximately Training SVM (Cont’d)

Subsampling

Simple and often effective More advanced techniques

Incremental training: (e.g., Syed et al., 1999) Data ⇒ 10 parts

train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics For example, Bakır et al. (2005)

### Approximately Training SVM (Cont’d)

Approximate the kernel; e.g., Fine and Scheinberg (2001); Williams and Seeger (2001)

Use part of the kernel; e.g., Lee and Mangasarian (2001); Keerthi et al. (2006)

Early stopping of optimization algorithms Tsang et al. (2005) and others

And many more

Some simple but some sophisticated

### Approximately Training SVM (Cont’d)

Sophisticated techniques may not be always useful Sometimes slower than sub-sampling

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

### Approximately Training SVM (Cont’d)

Sophisticated techniques may not be always useful Sometimes slower than sub-sampling

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

### Discussion: Large-scale Training

We don’t have many large and well labeled sets Expensive to obtain true labels

Specific properties of data should be considered We will illustrate this point using linear SVM The design of software for very large data sets should be application different

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Linear and Kernel Classification

Methods such as SVM and logistic regression can used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x)

φ(x_{i})^{T}φ(x_{j}) easily calculated; little control on φ(·)
Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as kernel and linear classifiers

### Linear and Kernel Classification

Let’s check the prediction cost
w^{T}x + b versus X^{l}

i =1α_{i}K (x_{i}, x) + b
If K (xi, xj) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper

### Linear and Kernel Classification (Cont’d)

Also, linear is a special case of kernel

Indeed, we can prove that accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)

Therefore, roughly we have

accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear

### Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

Recently linear classification is a popular research topic. Sample works in 2005-2008: Joachims (2006); Shalev-Shwartz et al. (2007); Hsieh et al.

(2008)

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

### Extension: Training Explicit Form of Nonlinear Mappings

Linear-SVM method to train φ(x_{1}), . . . , φ(x_{l})
Kernel not used

Applicable only if dimension of φ(x) not too large Low-degree Polynomial Mappings

K (x_{i}, x_{j}) = (x^{T}_{i} x_{j} + 1)^{2} = φ(x_{i})^{T}φ(x_{j})
φ(x) = [1,√

2x_{1}, . . . ,√

2x_{n}, x_{1}^{2}, . . . , x_{n}^{2},

√

2x1x2, . . . ,

√

2xn−1xn]^{T}

When degree is small, train the explicit form of φ(x)

### Testing Accuracy and Training Time

Data set

Degree-2 Polynomial Accuracy diff.

Training time (s)

Accuracy Linear RBF LIBLINEAR LIBSVM

a9a 1.6 89.8 85.06 0.07 0.02

real-sim 59.8 1,220.5 98.00 0.49 0.10

ijcnn1 10.7 64.2 97.84 5.63 −0.85

MNIST38 8.6 18.4 99.29 2.47 −0.40

covtype 5,211.9 NA 80.09 3.74 −15.98

webspam 3,228.1 NA 98.44 5.29 −0.76

Training φ(x_{i}) by linear: faster than kernel, but
sometimes competitive accuracy

### Discussion: Directly Train φ(x

_{i}

### ), ∀i

See details in our work (Chang et al., 2010) A related development: Sonnenburg and Franc (2010)

Useful for certain applications

### Outline

Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM

Multi-class classification Large-scale training Linear SVM

Discussion and conclusions

### Extensions of SVM

Multiple Kernel Learning (MKL) Learning to rank

Semi-supervised learning Active learning

Cost sensitive learning Structured Learning

### Conclusions

SVM and kernel methods are rather mature areas But still quite a few interesting research issues Many are extensions of standard classification It is possible to identify more extensions through real applications

### References I

G. H. Bakır, L. Bottou, and J. Weston. Breaking svm complexity with cross-training. In L. K.

Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 81–88. MIT Press, Cambridge, MA, 2005.

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.

E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Parallelizing support vector machines on distributed computers. In NIPS 21, 2007.

Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471–1490, 2010. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, (2–3):201–233, 2002.

### References II

S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel representations.

Journal of Machine Learning Research, 2:243–264, 2001.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines.

IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification.

Technical report, Department of Computer Science, National Taiwan University, 2003.

URL http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

T. Joachims. Training linear SVMs in linear time. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7:1493–1515, 2006.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American Statistical Association, 99(465):67–81, 2004.

### References III

Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, 2001.

C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view. Neural Computation, 13(2):307–317, 2001.

M. Marzolla. Optimized training of support vector machines on the cell processor. Technical Report UBLCS-2010-02, Department of Computer Science, University of Bologna, Italy, Feb. 2010. URL http://www.cs.unibo.it/pub/TR/UBLCS/ABSTRACTS/2010.bib?

ncstrl.cabernet//BOLOGNA-UBLCS-2010-02.

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.

S. Sonnenburg and V. Franc. COFFIN : A computational framework for linear SVMs. In Proceedings of the Twenty Seventh International Conference on Machine Learning (ICML), pages 999–1006, 2010.

I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4:

1071–1105, 2003.

J. Suykens and J. Vandewalle. Least square support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999.

### References IV

N. A. Syed, H. Liu, and K. K. Sung. Incremental learning with support vector machines. In Workshop on Support Vector Machines, IJCAI99, 1999.

I. Tsang, J. Kwok, and P. Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005.

J. Weston and C. Watkins. Multi-class support vector machines. In M. Verleysen, editor, Proceedings of ESANN99, pages 219–224, Brussels, 1999. D. Facto Press.

C. K. I. Williams and M. Seeger. Using the Nystr¨om method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.

L. Zanni, T. Serafini, and G. Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7:

1467–1492, 2006.

Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen. P-packSVM: Parallel primal gradient descent kernel SVM. In Proceedings of the IEEE International Conference on Data Mining, 2009.