Support vector machines: status and challenges

(1)

Support vector machines: status and challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Caltech, November 2006

(2)

Outline

Basic concepts Current Status Challenges Conclusions

(3)

Outline

(4)

Support Vector Classification

Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]

Consider a simple case with two classes:

Define an indicator vector y

yi =

1 if xi in class 1

−1 if xⁱ in class 2, A hyperplane which separates all data

(5)

w^Tx+ b = h₊₁

−10

i

A separating hyperplane: w^Tx+ b = 0 (w^Txi) + b ≥ 1 if yⁱ = 1 (w^Tx_i) + b ≤ −1 if yⁱ = −1

Decision function f (x) = sgn(w^Tx+ b), x: test data Many possible choices of w and b

(6)

Maximal Margin

Distance between w^Tx+ b = 1 and −1:

2/kwk = 2/√ w^Tw A quadratic programming problem [Boser et al., 1992]

min

w,b

1 2w^Tw

subject to yi(w^Txi + b) ≥ 1, i = 1, . . . , l.

(7)

Data May Not Be Linearly Separable

An example:

Allow training errors

Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .).

(8)

Standard SVM [Cortes and Vapnik, 1995]

min

w,b,ξ

1

2w^Tw+C

l

X

i=1

ξi

subject to yi(w^Tφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l.

Example: x ∈ R³, φ(x) ∈ R¹⁰ φ(x) = 1,√

2x1,√

2x2,√

2x3, x₁², x₂², x₃²,√

2x1x₂,√

2x1x₃,√

2x2x₃

(9)

Finding the Decision Function

w: maybe infinite variables The dual problem

minα

1

2α^TQα− e^Tα

subject to 0 ≤ αⁱ ≤ C , i = 1, . . . , l y^Tα = 0,

where Qij = yiyjφ(xi)^Tφ(xj) and e = [1, . . . , 1]^T At optimum

w = Pl

i=1αiyiφ(xi)

A finite problem: #variables = #training data

(10)

Kernel Tricks

Qij = yiyjφ(xi)^Tφ(xj) needs a closed form Example: x ∈ R³, φ(x) ∈ R¹⁰

φ(x) = 1,√

2x1,√

2x2,√

2x3, x₁², x₂², x₃²,√

2x1x₂,√

2x1x₃,√

2x2x₃ Then φ(xi)^Tφ(xj) = (1 + x^T_i xj)² ⇒ K(xi, xj) Decision function

w^Tφ(x) + b =

l

X

i=1

αiy_iφ(xi)^Tφ(x)+ b Only φ(xi) of αi > 0 used ⇒ support vectors

(11)

Support Vectors: More Important Data

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

A 3-D demonstration

www.csie.ntu.edu.tw/˜cjlin/libsvmtools/svmtoy3d

(12)

Outline

(13)

Solving the Dual

minα

1

2α^TQα− e^Tα

subject to 0 ≤ αⁱ ≤ C , i = 1, . . . , l y^Tα = 0

Qij 6= 0, Q : an l by l fully dense matrix 30,000 training points: 30,000 variables:

(30, 000² × 8/2) bytes = 3GB RAM to store Q:

Optimization methods cannot be directly applied Extensive work has been done

Now easy to solve median-sized problems

(14)

An example of training 50,000 instances using LIBSVM

$ ./svm-train -m 200 -c 16 -g 4 22features optimization finished, #iter = 24981

Total nSV = 3370 time 5m1.456s

Calculating Q may have taken more than 5 minutes

#SVs = 3,370 ≪ 50,000

SVM properties used in optimization A detailed discussion

www.csie.ntu.edu.tw/^∼cjlin/talks/rome.pdf

(15)

Parameter/Kernel Selection

Penalty parameter C : balance between generalization and training errors

min

w,b

1

2w^Tw+ C

l

X

i=1

ξi

kernel parameters Cross validation

Data split to training/validation Other more efficient techniques

(16)

Difficult if number of parameters is large E.g., feature scaling:

K(x, y) = e⁻^Pⁿⁱ⁼¹^γⁱ^(xⁱ^−yⁱ⁾² Some features more important

A challenging research issue

(17)

Design Kernels

Still a research issue

e.g., in bioinformatics and vision, many new kernels But, should be careful if the function is a valid one

K(x, y) = φ(x)^Tφ(y)

For example, any two strings s1, s2 we can define edit distance

e^−γedit(s¹^,s²⁾

It’s not a valid kernel [Cortes et al., 2003]

(18)

Multi-class Classification

Combining results of several two-class classifiers One-against-the rest

One-against-one And other ways

A comparison in [Hsu and Lin, 2002]

(19)

Outline

(20)

Challenges

Unbalanced data

Some classes few data, some classes a lot Different evaluation criteria?

Structural data sets

An instance may not be a vector e.g., a tree from a sentence Labels in order relationships SVM for ranking

(21)

Challenges (Cont’d)

Multi-label classification

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data

SVM cannot handle large sets if using kernels Two possibilities:

Linear SVMs. In some situations, can solve much larger problems

Approximation: sub-sampling and beyond

(22)

Challenges (Cont’d)

Semi-supervised learning

Some available data unlabeled

How can we guarantee the performance of using only labeled data?

(23)

Outline

(24)

Why is SVM Popular?

No definitive answer; In my opinion

Reasonably easy to use and often competitive performance

Rather general: linear/nonlinear Gaussian process/RBF networks

Basic concept relatively easy: maximal margin It’s lucky

(25)

Conclusions

We must admit that

SVM is a rather mature area

But still quite a few interesting research issues Many are extensions of standard classification problems

Detailed SVM tutorial in Machine Learning Summer School 2006

www.csie.ntu.edu.tw/^∼cjlin/talks/MLSS.pdf

(26)

References I

Boser, B. E., Guyon, I., and Vapnik, V. (1992).

A training algorithm for optimal margin classifiers.

In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press.

Cortes, C., Haffner, P., and Mohri, M. (2003).

Positive definite rational kernels.

In Proceedings of the 16th Annual Conference on Learning Theory, pages 41–56.

Cortes, C. and Vapnik, V. (1995).

Support-vector network.

Machine Learning, 20:273–297.

Hsu, C.-W. and Lin, C.-J. (2002).

A comparison of methods for multi-class support vector machines.

IEEE Transactions on Neural Networks, 13(2):415–425.