Support vector machines: status and challenges
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at Caltech, November 2006
Outline
Basic concepts Current Status Challenges Conclusions
Outline
Basic concepts Current Status Challenges Conclusions
Support Vector Classification
Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]
Consider a simple case with two classes:
Define an indicator vector y
yi =
1 if xi in class 1
−1 if xi in class 2, A hyperplane which separates all data
wTx+ b = h+1
−10
i
A separating hyperplane: wTx+ b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1
Decision function f (x) = sgn(wTx+ b), x: test data Many possible choices of w and b
Maximal Margin
Distance between wTx+ b = 1 and −1:
2/kwk = 2/√ wTw A quadratic programming problem [Boser et al., 1992]
min
w,b
1 2wTw
subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l.
Data May Not Be Linearly Separable
An example:
Allow training errors
Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .).
Standard SVM [Cortes and Vapnik, 1995]
min
w,b,ξ
1
2wTw+C
l
X
i=1
ξi
subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l.
Example: x ∈ R3, φ(x) ∈ R10 φ(x) = 1,√
2x1,√
2x2,√
2x3, x12, x22, x32,√
2x1x2,√
2x1x3,√
2x2x3
Finding the Decision Function
w: maybe infinite variables The dual problem
minα
1
2αTQα− eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w = Pl
i=1αiyiφ(xi)
A finite problem: #variables = #training data
Kernel Tricks
Qij = yiyjφ(xi)Tφ(xj) needs a closed form Example: x ∈ R3, φ(x) ∈ R10
φ(x) = 1,√
2x1,√
2x2,√
2x3, x12, x22, x32,√
2x1x2,√
2x1x3,√
2x2x3 Then φ(xi)Tφ(xj) = (1 + xTi xj)2 ⇒ K(xi, xj) Decision function
wTφ(x) + b =
l
X
i=1
αiyiφ(xi)Tφ(x)+ b Only φ(xi) of αi > 0 used ⇒ support vectors
Support Vectors: More Important Data
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1
A 3-D demonstration
www.csie.ntu.edu.tw/˜cjlin/libsvmtools/svmtoy3d
Outline
Basic concepts Current Status Challenges Conclusions
Solving the Dual
minα
1
2αTQα− eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
Qij 6= 0, Q : an l by l fully dense matrix 30,000 training points: 30,000 variables:
(30, 0002 × 8/2) bytes = 3GB RAM to store Q:
Optimization methods cannot be directly applied Extensive work has been done
Now easy to solve median-sized problems
An example of training 50,000 instances using LIBSVM
$ ./svm-train -m 200 -c 16 -g 4 22features optimization finished, #iter = 24981
Total nSV = 3370 time 5m1.456s
Calculating Q may have taken more than 5 minutes
#SVs = 3,370 ≪ 50,000
SVM properties used in optimization A detailed discussion
www.csie.ntu.edu.tw/∼cjlin/talks/rome.pdf
Parameter/Kernel Selection
Penalty parameter C : balance between generalization and training errors
min
w,b
1
2wTw+ C
l
X
i=1
ξi
kernel parameters Cross validation
Data split to training/validation Other more efficient techniques
Difficult if number of parameters is large E.g., feature scaling:
K(x, y) = e−Pni=1γi(xi−yi)2 Some features more important
A challenging research issue
Design Kernels
Still a research issue
e.g., in bioinformatics and vision, many new kernels But, should be careful if the function is a valid one
K(x, y) = φ(x)Tφ(y)
For example, any two strings s1, s2 we can define edit distance
e−γedit(s1,s2)
It’s not a valid kernel [Cortes et al., 2003]
Multi-class Classification
Combining results of several two-class classifiers One-against-the rest
One-against-one And other ways
A comparison in [Hsu and Lin, 2002]
Outline
Basic concepts Current Status Challenges Conclusions
Challenges
Unbalanced data
Some classes few data, some classes a lot Different evaluation criteria?
Structural data sets
An instance may not be a vector e.g., a tree from a sentence Labels in order relationships SVM for ranking
Challenges (Cont’d)
Multi-label classification
An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data
SVM cannot handle large sets if using kernels Two possibilities:
Linear SVMs. In some situations, can solve much larger problems
Approximation: sub-sampling and beyond
Challenges (Cont’d)
Semi-supervised learning
Some available data unlabeled
How can we guarantee the performance of using only labeled data?
Outline
Basic concepts Current Status Challenges Conclusions
Why is SVM Popular?
No definitive answer; In my opinion
Reasonably easy to use and often competitive performance
Rather general: linear/nonlinear Gaussian process/RBF networks
Basic concept relatively easy: maximal margin It’s lucky
Conclusions
We must admit that
SVM is a rather mature area
But still quite a few interesting research issues Many are extensions of standard classification problems
Detailed SVM tutorial in Machine Learning Summer School 2006
www.csie.ntu.edu.tw/∼cjlin/talks/MLSS.pdf
References I
Boser, B. E., Guyon, I., and Vapnik, V. (1992).
A training algorithm for optimal margin classifiers.
In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press.
Cortes, C., Haffner, P., and Mohri, M. (2003).
Positive definite rational kernels.
In Proceedings of the 16th Annual Conference on Learning Theory, pages 41–56.
Cortes, C. and Vapnik, V. (1995).
Support-vector network.
Machine Learning, 20:273–297.
Hsu, C.-W. and Lin, C.-J. (2002).
A comparison of methods for multi-class support vector machines.
IEEE Transactions on Neural Networks, 13(2):415–425.