• 沒有找到結果。

# Support vector machines: status and challenges

N/A
N/A
Protected

Share "Support vector machines: status and challenges"

Copied!
26
0
0

(1)

### Support vector machines: status and challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Caltech, November 2006

(2)

### Outline

Basic concepts Current Status Challenges Conclusions

(3)

### Outline

Basic concepts Current Status Challenges Conclusions

(4)

### Support Vector Classification

Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]

Consider a simple case with two classes:

Define an indicator vector y

yi =

 1 if xi in class 1

−1 if xi in class 2, A hyperplane which separates all data

(5)

wTx+ b = h+1

−10

i

A separating hyperplane: wTx+ b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1

Decision function f (x) = sgn(wTx+ b), x: test data Many possible choices of w and b

(6)

### Maximal Margin

Distance between wTx+ b = 1 and −1:

2/kwk = 2/√ wTw A quadratic programming problem [Boser et al., 1992]

min

w,b

1 2wTw

subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l.

(7)

### Data May Not Be Linearly Separable

An example:

Allow training errors

Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .).

(8)

Standard SVM [Cortes and Vapnik, 1995]

min

w,b,ξ

1

2wTw+C

l

X

i=1

ξi

subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l.

Example: x ∈ R3, φ(x) ∈ R10 φ(x) = 1,√

2x1,√

2x2,√

2x3, x12, x22, x32,√

2x1x2,√

2x1x3,√

2x2x3

(9)

### Finding the Decision Function

w: maybe infinite variables The dual problem

minα

1

TQα− eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,

where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum

w = Pl

i=1αiyiφ(xi)

A finite problem: #variables = #training data

(10)

### Kernel Tricks

Qij = yiyjφ(xi)Tφ(xj) needs a closed form Example: x ∈ R3, φ(x) ∈ R10

φ(x) = 1,√

2x1,√

2x2,√

2x3, x12, x22, x32,√

2x1x2,√

2x1x3,√

2x2x3 Then φ(xi)Tφ(xj) = (1 + xTi xj)2 ⇒ K(xi, xj) Decision function

wTφ(x) + b =

l

X

i=1

αiyiφ(xi)Tφ(x)+ b Only φ(xi) of αi > 0 used ⇒ support vectors

(11)

### Support Vectors: More Important Data

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

A 3-D demonstration

www.csie.ntu.edu.tw/˜cjlin/libsvmtools/svmtoy3d

(12)

### Outline

Basic concepts Current Status Challenges Conclusions

(13)

### Solving the Dual

minα

1

TQα− eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0

Qij 6= 0, Q : an l by l fully dense matrix 30,000 training points: 30,000 variables:

(30, 0002 × 8/2) bytes = 3GB RAM to store Q:

Optimization methods cannot be directly applied Extensive work has been done

Now easy to solve median-sized problems

(14)

An example of training 50,000 instances using LIBSVM

\$ ./svm-train -m 200 -c 16 -g 4 22features optimization finished, #iter = 24981

Total nSV = 3370 time 5m1.456s

Calculating Q may have taken more than 5 minutes

#SVs = 3,370 ≪ 50,000

SVM properties used in optimization A detailed discussion

www.csie.ntu.edu.tw/cjlin/talks/rome.pdf

(15)

### Parameter/Kernel Selection

Penalty parameter C : balance between generalization and training errors

min

w,b

1

2wTw+ C

l

X

i=1

ξi

kernel parameters Cross validation

Data split to training/validation Other more efficient techniques

(16)

Difficult if number of parameters is large E.g., feature scaling:

K(x, y) = ePni=1γi(xi−yi)2 Some features more important

A challenging research issue

(17)

### Design Kernels

Still a research issue

e.g., in bioinformatics and vision, many new kernels But, should be careful if the function is a valid one

K(x, y) = φ(x)Tφ(y)

For example, any two strings s1, s2 we can define edit distance

e−γedit(s1,s2)

It’s not a valid kernel [Cortes et al., 2003]

(18)

### Multi-class Classification

Combining results of several two-class classifiers One-against-the rest

One-against-one And other ways

A comparison in [Hsu and Lin, 2002]

(19)

### Outline

Basic concepts Current Status Challenges Conclusions

(20)

### Challenges

Unbalanced data

Some classes few data, some classes a lot Different evaluation criteria?

Structural data sets

An instance may not be a vector e.g., a tree from a sentence Labels in order relationships SVM for ranking

(21)

### Challenges (Cont’d)

Multi-label classification

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data

SVM cannot handle large sets if using kernels Two possibilities:

Linear SVMs. In some situations, can solve much larger problems

Approximation: sub-sampling and beyond

(22)

### Challenges (Cont’d)

Semi-supervised learning

Some available data unlabeled

How can we guarantee the performance of using only labeled data?

(23)

### Outline

Basic concepts Current Status Challenges Conclusions

(24)

### Why is SVM Popular?

No definitive answer; In my opinion

Reasonably easy to use and often competitive performance

Rather general: linear/nonlinear Gaussian process/RBF networks

Basic concept relatively easy: maximal margin It’s lucky

(25)

### Conclusions

SVM is a rather mature area

But still quite a few interesting research issues Many are extensions of standard classification problems

Detailed SVM tutorial in Machine Learning Summer School 2006

www.csie.ntu.edu.tw/cjlin/talks/MLSS.pdf

(26)

### References I

Boser, B. E., Guyon, I., and Vapnik, V. (1992).

A training algorithm for optimal margin classifiers.

In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press.

Cortes, C., Haffner, P., and Mohri, M. (2003).

Positive definite rational kernels.

In Proceedings of the 16th Annual Conference on Learning Theory, pages 41–56.

Cortes, C. and Vapnik, V. (1995).

Support-vector network.

Machine Learning, 20:273–297.

Hsu, C.-W. and Lin, C.-J. (2002).

A comparison of methods for multi-class support vector machines.

IEEE Transactions on Neural Networks, 13(2):415–425.

infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch. derived new and

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Solving SVM Quadratic Programming Problem Training large-scale data..

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

Ongoing Projects in Image/Video Analytics with Deep Convolutional Neural Networks. § Goal – Devise effective and efficient learning methods for scalable visual analytic