Optimization, Support Vector Machines,
and Machine Learning
Chih-Jen Lin
Department of Computer Science National Taiwan University
Outline
Introduction to machine learning and support vector machines (SVM)
SVM and optimization theory
SVM and numerical optimization Practical use of SVM
Talk slides available at
http://www.csie.ntu.edu.tw/˜cjlin/talks/rome.pdf
This talk intends to give optimization researchers an overview of SVM research
What Is Machine Learning?
Extract knowledge from data
Classification, clustering, and others We focus only on classification here Many new optimization issues
Data Classification
Given training data in different classes (labels known) Predict test data (labels unknown)
Examples
Handwritten digits recognition Spam filtering
Training and testing
Methods:
Nearest Neighbor Neural Networks Decision Tree
Support vector machines: another popular method Main topic of this talk
Machine learning, applied statistics, pattern recognition Very similar, but slightly different focuses
As it’s more applied, machine learning is a bigger
Support Vector Classification
Training vectors : xi, i = 1, . . . , l
Consider a simple case with two classes: Define a vector y
yi =
(
1 if xi in class 1 −1 if xi in class 2, A hyperplane which separates all data
wTx + b = +1 0 −1 A separating hyperplane: wTx + b = 0 (wTxi) + b > 0 if yi = 1 (wTxi) + b < 0 if yi = −1
Decision function f (x) = sign(wT x + b), x: test data Variables: w and b : Need to know coefficients of a plane
Many possible choices of w and b
Select w, b with the maximal margin.
Maximal distance between wTx + b = ±1 (wTxi) + b ≥ 1 if yi = 1
(wTxi) + b ≤ −1 if yi = −1
Distance between wTx + b = 1 and −1: 2/kwk = 2/√wTw max 2/kwk ≡ min wTw/2 min w,b 1 2w Tw subject to yi((wTxi) + b) ≥ 1, i = 1, . . . , l.
A nonlinear programming problem A 3-D demonstration
Notations very different from optimization Well, this is unavoidable
Higher Dimensional Feature Spaces
Earlier we tried to find a linear separating hyperplane
Data may not be linearly separable
Non-separable case: allow training errors
min w,b,ξ 1 2w Tw + C l X i=1 ξi yi((wTxi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l
ξi > 1, xi not on the correct side of the separating plane
Nonlinear case: linearly separable in other spaces ?
Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .).
Example: x ∈ R3, φ(x) ∈ R10
φ(x) = (1, √2x1, √2x2, √2x3, x21,
x22, x23, √2x1x2, √2x1x3, √2x2x3) A standard problem (Cortes and Vapnik, 1995):
min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.
Finding the Decision Function
w: a vector in a high dimensional space ⇒ maybe infinite variables
The dual problem min α 1 2α T Qα − eTα subject to 0 ≤ αi ≤ C, i = 1, . . . , l yT α = 0, where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T w = Pli=1 αiyiφ(xi) SVM problem: primal . – p.14/121
Primal and dual : Discussed later A finite problem:
#variables = #training data
Qij = yiyjφ(xi)Tφ(xj) needs a closed form
Efficient calculation of high dimensional inner products
Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) =(1, √ 2(xi)1, √ 2(xi)2, √ 2(xi)3, (xi)21, (xi)22, (xi)23, √2(xi)1(xi)2, √2(xi)1(xi)3, √2(xi)2(xi)3) Then φ(xi)Tφ(xj) = (1 + xTi xj)2.
Kernel Tricks
Kernel: K(x, y) = φ(x)Tφ(y)
No need to explicitly know φ(x) Common kernels K(xi, xj) =
e−γkxi−xjk2, (Radial Basis Function)
(xTi xj/a + b)d (Polynomial kernel)
They can be inner product in infinite dimensional space Assume x ∈ R1 and γ > 0.
e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γx2i+2γxixj−γx2j = e−γx2i−γx 2 j 1 + 2γxixj 1! + (2γxixj)2 2! + (2γxixj)3 3! + · · · = e−γx2i−γx 2 j 1 · 1 + r 2γ 1! xi · r 2γ 1! xj + r (2γ)2 2! x 2 i · r (2γ)2 2! x 2 j + r (2γ)3 3! x 3 i · r (2γ)3 3! x 3 j + · · · = φ(xi)Tφ(xj), where φ(x) = e−γx21, r 2γ 1! x, r (2γ)2 2! x 2, r (2γ)3 3! x 3 , · · · T.
Decision function
w: maybe an infinite vector At optimum w = Pli=1 αiyiφ(xi) Decision function wTφ(x) + b = l X i=1 αiyiφ(xi)Tφ(x) + b = l X i=1 αiyiK(xi, x) + b . – p.18/121
No need to have w
> 0: 1st class, < 0: 2nd class Only φ(xi) of αi > 0 used
Support Vectors: More Important Data
−1.5 −1 −0.5 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 . – p.20/121Is Kernel Really Useful?
Training data mapped to be linearly independent ⇒ separable
Except this, we know little in high dimensional spaces Selection is another issue
On the one hand, very few general kernels
On the other hand, people try to design kernels specific to applications
SVM and Optimization
Dual problem is essential for SVM
There are other optimization issues in SVM But, things are not that simple
If SVM isn’t good, useless to study its optimization issues
Optimization in ML Research
Everyday there are new classification methods Most are related to optimization problems
Most will never be popular
Things optimization people focused (e.g., convergence rate) may not be that important for ML people
In machine learning
The use of optimization techniques sometimes not rigorous
Usually an optimization algorithm
1. Strictly decreasing
2. Convergence to a stationary point
3. Convergence rate
In some ML papers, 1 even does not hold Some wrongly think 1 and 2 the same
Status of SVM
Existing methods:
Nearest neighbor, Neural networks, decision trees.
SVM: similar status (competitive but may not be better) In my opinion, after careful data pre-processing
Appropriately use NN or SVM ⇒ similar accuracy But, users may not use them properly
The chance of SVM
Easier for users to appropriately use it Replacing NN on some applications
So SVM has survived as a ML method
There are needs to seriously study its optimization issues
A Primal-Dual Example
Let us have an example before deriving the dual To check the primal dual relationship:
w =
l
X
i=1
αiyiφ(xi)
Two training data in R1:
△
0 1
What is the separating hyperplane ?
Primal Problem
x1 = 0, x2 = 1 with y = [−1, 1]T. Primal problem min w,b 1 2w 2 subject to w · 1 + b ≥ 1, −1(w · 0 + b) ≥ 1.−b ≥ 1 and w ≥ 1 − b ≥ 2. We are minimizing 12w2 The smallest is w = 2.
(w, b) = (2, −1) optimal solution.
The separating hyperplane 2x − 1 = 0
△
0 x = 1/2• 1
Dual Problem
Formula without penalty parameter C
min α∈Rl 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to αi ≥ 0, i = 1, . . . , l, and l X i=1 αiyi = 0.
Get the objective function
xT1 x1 = 0, xT1 x2 = 0 xT2 x1 = 0, xT2 x2 = 1
Objective function 1 2α 2 1 − (α1 + α2) = 1 2 h α1 α2i " 0 0 0 1 # " α1 α2 # − h1 1i " α1 α2 # . Constraints α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2. . – p.32/121
α2 = α1 to the objective function, 1 2α 2 1 − 2α1 Smallest value at α1 = 2. α2 = 2 as well [2, 2]T satisfies 0 ≤ α1 and 0 ≤ α2 Optimal Primal-dual relation w = y1α1x1 + y2α2x2 = 1 · 2 · 1 + (−1) · 2 · 0 = 2
SVM Primal and Dual
Standard SVM (Primal) min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.w: huge (maybe infinite) vector variable
Practically we solve dual, a different but related problem
Dual problem min α 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to 0 ≤ αi ≤ C, i = 1, . . . , l, l X i=1 yiαi = 0. K(xi, xj) = φ(xi)Tφ(xj) available α: l variables; finite
Primal Dual Relationship
At optimum ¯ w = l X i=1 ¯ αiyiφ(xi) 1 2w¯ Tw¯ + C l X i=1 ¯ ξi = eTα¯ − 1 2α¯ TQ ¯α. where e = [1, . . . , 1]T.Primal objective value = −Dual objective value
How does this dual come from ?
Derivation of the Dual
Consider a simpler problem min w,b 1 2w T w subject to yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l. Its dual min α 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to 0 ≤ αi, i = 1, . . . , l, l X i=1 yiαi = 0.
Lagrangian Dual
Defined as max α≥0(minw,b L(w, b, α)), where L(w, b, α) = 1 2kwk 2 − l X i=1 αi yi(wTφ(xi) + b) − 1 Strong dualitymin Primal = max
α≥0(minw,b L(w, b, α))
Simplify the dual. When α is fixed, min w,b L(w, b, α) = ( −∞ if Pl i=1 αiyi 6= 0, minw 12wT w − Pli=1 αi[yi(wTφ(xi) − 1] if Pl i=1 αiyi = 0. If Pl i=1 αiyi 6= 0, decrease −b Pl i=1 αiyi in L(w, b, α) to −∞
If Pl
i=1 αiyi = 0, optimum of the strictly convex 1
2wTw −
Pl
i=1 αi[yi(wTφ(xi) − 1] happens when
∂ ∂wL(w, b, α) = 0. Assume w ∈ Rn. L(w, b, α) rewritten as 1 2 n X j=1 wj2 − l X i=1 αi[yi( n X j=1 wjφ(xi)j − 1] ∂ ∂wj L(w, b, α) = wj − l X i=1 αiyiφ(xi)j = 0 . – p.40/121
Thus, w = l X i=1 αiyiφ(xi). Note that wTw = l X i=1 αiyiφ(xi) T l X j=1 αjyjφ(xj) = X i,j αiαjyiyjφ(xi)Tφ(xj)
The dual is max α≥0 ( Pl i=1 αi − 12 P αiαjyiyjφ(xi)Tφ(xj) if Pl i=1 αiyi = 0, −∞ if Pl i=1 αiyi 6= 0.
−∞ definitely not maximum of the dual
Dual optimal solution not happen when Pl
i=1 αiyi 6= 0. Dual simplified to max α∈Rl l X i=1 αi − 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) subject to αi ≥ 0, i = 1, . . . , l, and yT α = 0. . – p.42/121
Karush-Kuhn-Tucker (KKT) conditions
The KKT condition of the dual:
Qα − e = −by + λ αiλi = 0
λi ≥ 0
The KKT condition of the primal:
w = l X i=1 αiyixi αi(yiwTxi + byi − 1) = 0 yT α = 0, αi ≥ 0
Let λi = yi((wTxi) + b) − 1 , (Qα − e + by)i = l X j=1 yiyjαjxTi xj − 1 + byi = yiwT xi − 1 + yib = yi(wTxi + b) − 1
The KKT of the primal the same as the KKT of the dual
More about Dual Problems
w may be infinite
Seriously speaking, infinite programming (Lin, 2001a) In machine learning, quite a few think that for any
optimization problem Lagrangian dual exists
This is wrong
Lagrangian duality usually needs
Convex programming problems
We have them
SVM primal is convex Constraints linear
Why ML people sometimes make such mistakes They focus on developing new methods
It is difficult to show a counter example
Large Dense Quadratic Programming
minα 12αTQα − eTα, subject to yTα = 0, 0 ≤ αi ≤ C Qij 6= 0, Q : an l by l fully dense matrix
30,000 training points: 30,000 variables:
(30, 0002 × 8/2) bytes = 3GB RAM to store Q: still difficult Traditional methods:
Newton, Quasi Newton cannot be directly applied Current methods:
Decomposition methods (e.g., (Osuna et al., 1997; Joachims, 1998; Platt, 1998))
Nearest point of two convex hulls (e.g., (Keerthi et al., 1999))
Decomposition Methods
Working on a few variable each time
Similar to coordinate-wise minimization Working set B, N = {1, . . . , l}\B fixed Size of B usually <= 100
Sub-problem in each iteration:
min αB 1 2 h αTB (αkN)T i " QBB QBN QN B QN N # " αB αkN # − h eTB (ekN)Ti " αB αkN # subject to 0 ≤ αt ≤ C, t ∈ B, yTBαB = −yTNαkN
Avoid Memory Problems
The new objective function 1
2α
T
BQBBαB + (−eB + QBNαkN)T αB + constant
B columns of Q needed
Calculated when used
Decomposition Method: the Algorithm
1. Find initial feasible α1 Set k = 1.
2. If αk stationary, stop. Otherwise, find working set B. Define N ≡ {1, . . . , l}\B 3. Solve sub-problem of αB: min αB 1 2α T BQBBαB + (−eB + QBNαkN)TαB subject to 0 ≤ αt ≤ C, t ∈ B yBT αTB = −yTNαkN,
Does it Really Work?
Compared to Newton, Quasi-Newton
Slow convergence
However, no need to have very accurate α
sgn l X i=1 αiyiK(xi, x) + b !
Prediction not affected much
In some situations, # support vectors ≪ # training points Initial α1 = 0, some elements never used
An example where ML knowledge affect optimization
Working Set Selection
Very important
Better selection ⇒ fewer iterations
But
Better selection ⇒ higher cost per iteration Two issues:
1. Size
|B| ր, # iterations ց |B| ց, # iterations ր
Size of the Working Set
Keeping all nonzero αi in the working set
If all SVs included ⇒ optimum
Few iterations (i.e., few sub-problems) Size varies
May still have memory problems Existing software
Small and fixed size
Memory problems solved Though sometimes slower
Sequential Minimal Optimization (SMO)
Consider |B| = 2 (Platt, 1998)
|B| ≥ 2 because of the linear constraint
Extreme of decomposition methods
Sub-problem analytically solved; no need to use optimization software min αi,αj 1 2 h αi αj i " Qii Qij Qij Qjj # " αi αj # + (QBNαkN − eB)T " αi αj # s.t. 0 ≤ αi, αj ≤ C, yiαi + yjαj = −yTNαkN,
Optimization people may not think this a big advantage Machine learning people do: they like simple code
A minor advantage in optimization
No need to have inner and outer stopping conditions B = {i, j}
Too slow convergence?
With other tricks, |B| = 2 fine in practice
Selection by KKT violation
minα f (α), subject to yTα = 0, 0 ≤ αi ≤ C
α stationary if and only if
∇f(α) + by = λ − µ, λiαi = 0, µi(C − αi) = 0, λi ≥ 0, µi ≥ 0, i = 1, . . . , l, ∇f(α) ≡ Qα − e Rewritten as ∇f(α)i + byi ≥ 0 if αi < C, ∇f(α)i + byi ≤ 0 if αi > 0.
Note yi = ±1 KKT further rewritten as ∇f(α)i + b ≥ 0 if αi < C, yi = 1 ∇f(α)i − b ≥ 0 if αi < C, yi = −1 ∇f(α)i + b ≤ 0 if αi > 0, yi = 1, ∇f(α)i − b ≤ 0 if αi > 0, yi = −1
A condition on the range of b:
max{−yt∇f(α)t | αt < C, yt = 1 or αt > 0, yt = −1}
≤ b
≤ min{−yt∇f(α)t | αt < C, yt = −1 or αt > 0, yt = 1}
Define
Iup(α) ≡ {t | αt < C, yt = 1 or αt > 0, yt = −1}, and
Ilow(α) ≡ {t | αt < C, yt = −1 or αt > 0, yt = 1}.
α stationary if and only if feasible and max
i∈Iup(α)
−yi∇f(α)i ≤ min i∈Ilow(α)
Violating Pair
KKT equivalent to
-
t ∈ Iup(α) t ∈ Ilow(α)
−yt∇f(α)t
Violating pair (Keerthi et al., 2001)
i ∈ Iup(α), j ∈ Ilow(α), and − yi∇f(α)i > −yj∇f(α)j.
Strict decrease if and only if B has at least one violating pair.
However, having violating pair not enough for convergence.
Maximal Violating Pair
If |B| = 2, naturally indices most violate the KKT condition: i ∈ arg max t∈Iup(αk) −yt∇f(αk)t, j ∈ arg min t∈Ilow(αk) −yt∇f(αk)t, Can be extended to |B| > 2
Calculating Gradient
To find violating pairs, gradient maintained throughout all iterations
Memory problem occur as ∇f(α) = Qα − e involves Q Solved by following tricks
1. α1 = 0 implies ∇f(α1) = Q · 0 − e = −e Initial gradient easily obtained
2. Update ∇f(α) using only QBB and QBN:
∇f(αk+1) = ∇f(αk) + Q(αk+1 − αk)
= ∇f(αk) + Q:,B(αk+1 − αk)B
Only |B| columns needed per iteration
Selection by Gradient Information
Maximal violating pair same as using gradient information
{i, j} = arg min
B:|B|=2Sub(B), where Sub(B) ≡ min dB ∇f(α k)T BdB (1a) subject to yBT dB = 0, dt ≥ 0, if αkt = 0, t ∈ B, (1b) dt ≤ 0, if αkt = C, t ∈ B, (1c) −1 ≤ dt ≤ 1, t ∈ B. (1d)
First considered in (Joachims, 1998)
Let d ≡ [dB; 0N], (1a) comes from minimizing f (αk + d) ≈ f(αk) + ∇f(αk)T d
= f (αk) + ∇f(αk)TBdB.
First order approximation
0 ≤ αt ≤ C leads to (1b) and (1c).
−1 ≤ dt ≤ 1, t ∈ B avoid −∞ objective value
Rough explanation connecting to maximal violating pair ∇f(αk)idi + ∇f(αk)jdj
= yi∇f(αk)i · yidi + yj∇f(αk)j · yjdj
= (yi∇f(αk)i − yj∇f(αk)j) · (yidi)
We used yidi + yjdj = 0
Find {i, j} so that
yi∇f(αk)i − yj∇f(αk)j the smallest, yidi = 1, yjdj = −1
yidi = 1 corresponds to i ∈ Iup(αk):
Convergence: Maximal Violating Pair
Special case of (Lin, 2001c)
Let α¯ limit of any convergent subsequence {αk}, k ∈ K. If not stationary, ∃ a violating pair
¯i ∈ Iup(α), ¯j ∈ Ilow(α), and − y¯i∇f( ¯α)¯i + y¯j∇f( ¯α)¯j> 0
If i ∈ Iup( ¯α), then i ∈ Iup(αk), ∀k ∈ K large enough If i ∈ Ilow( ¯α), then i ∈ Ilow(αk), ∀k ∈ K large enough So
{¯i, ¯j} a violating pair at k ∈ K
From k to k + 1: Bk = {i, j}
i /∈ Iup(αk+1) or j /∈ Ilow(αk+1)
because of optimality of sub-problem
If we can show
{αk}k∈K → ¯α ⇒ {αk+1}k∈K → ¯α,
then {¯i, ¯j} should not be selected at k, k + 1, . . . , k + r
A procedure showing in finite iterations, it is selected Contradiction
Key of the Proof
Essentially we proved
In finite iterations, B = {¯i, ¯j} selected to have a contradiction
Can be used to design working sets (Lucidi et al., 2005):
∃N > 0 such that for all k, any violating
pair of αk selected at least once in iterations k to k + N A cyclic selection
{1, 2}, {1, 3}, . . . , {1, l}, {2, 3}, . . . , {l − 1, l}
Beyond Maximal Violating Pair
Better working sets?
Difficult: # iterations ց but cost per iteration ր
May not imply shorter training time
A selection by second order information (Fan et al., 2005) As f is a quadratic, f (αk + d) = f (αk) + ∇f(αk)Td + 1 2d T ∇2f (αk)d = f (αk) + ∇f(αk)TBdB + 1 2d T B∇2f (α k) BBdB
Selection by Quadratic Information
Using second order information
min B:|B|=2 Sub(B), Sub(B) ≡ min dB 1 2d T B∇2f (αk)BBdB + ∇f(αk)TBdB subject to yTBdB = 0, dt ≥ 0, if αtk = 0, t ∈ B, dt ≤ 0, if αtk = C, t ∈ B. −1 ≤ dt ≤ 1, t ∈ B not needed if ∇2f (αk)BB = QBB PD
Too expensive to check 2l sets
A heuristic
1. Select
i ∈ arg maxt {−yt∇f(αk)t | t ∈ Iup(αk)}.
2. Select
j ∈ arg min
t {Sub({i, t}) | t ∈ Ilow(α k),
−yt∇f(αk)t < −yi∇f(αk)i}.
3. Return B = {i, j}.
The same i as the maximal violating pair Check only O(l) possible B’s to decide j
Comparison of Two Selections
Iteration and time ratio between using quadratic information and maximal violating pair
0 0.2 0.4 0.6 0.8 1 1.2 1.4
image splice tree a1a australian breast-cancerdiabetes fourclass german.numerw1a abalone cadata cpusmall space_ga mg
Ratio Data sets time (40M cache) time (100K cache) total #iter . – p.72/121
Comparing SVM Software/Methods
In optimization, straightforward to compare two methods Now the comparison under one set of parameters may not be enough
Unclear yet the most suitable way of doing comparisons In ours, we check two
1. Time/total iterations for several parameter sets used in parameter selection
2. Time/iterations for final parameter set
Issues about the Quadratic Selection
Asymptotic convergence holds
Faster convergence than maximal violating pair Better approximation per iteration
But lacks global explanation yet
What if we check all 2l sets
Iteration ratio between checking all and checking O(l):
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
image splice tree a1a australian breast-cancer diabetes fourclass german.numer w1a
Ratio
Data sets
parameter selection final training
Fewer iterations, but ratio (0.7 to 0.8) not enough to justify the higher cost per iteration
Caching and Shrinking
Speed up decomposition methods Caching (Joachims, 1998)
Store recently used Hessian columns in computer memory
Example
$ time ./libsvm-2.81/svm-train -m 0.01 a4a 11.463s
$ time ./libsvm-2.81/svm-train -m 40 a4a 7.817s
Shrinking (Joachims, 1998)
Some bounded elements remain until the end
Heuristically resized to a smaller problem
After certain iterations, most bounded elements identified and not changed (Lin, 2002)
Stopping Condition
In optimization software such conditions are important However, don’t be surprised if you see no stopping
conditions in an optimization code of ML software
Sometimes time/iteration limits more suitable
From KKT condition max i∈Iup(α) −yi∇f(α)i ≤ min i∈Ilow(α) −yi∇f(α)i + ǫ (2)
a natural stopping condition
Better Stopping Condition
Now in out software ǫ = 10−3
Past experience: ok but sometimes too strict
At one point we almost changed to 10−1
Large C ⇒ large ∇f(α) components Too strict ⇒ many iterations
Need a relative condition
Example of Slow Convergence
Using C = 1
$./libsvm-2.81/svm-train -c 1 australian_scale optimization finished, #iter = 508
obj = -201.642538, rho = 0.044312
Using C = 5000
$./libsvm-2.81/svm-train -c 5000 australian_scale optimization finished, #iter = 35241
obj = -242509.157367, rho = -7.186733
Optimization researchers may rush to solve difficult cases
That’s what I did in the beginning
It turns out that large C less used than small C
Finite Termination
Given ǫ, finite termination under (2) (Keerthi and Gilbert, 2002; Lin, 2002)
Not implied from asymptotic convergence as min
i∈Ilow(α)
−yi∇f(α)i − max i∈Iup(α)
−yi∇f(α)i
not a continuous function of α We worry
αki → 0 and i ∈ Iup(αk) ∩ Ilow(αk) causes the program never ends
ML people do not care this much
Many think finite termination same as asymptotic convergence
We are careful on such issues in our software A good SVM software should
1. be a rigorous numerical optimization code
2. serve the need of users in ML and other areas
Both are equally important
Issues Not Discussed Here
Q not PSD
Solving sub-problems
Analytic form for SMO (two-variable problem) Linear convergence (Lin, 2001b)
f (αk+1) − f( ¯α) ≤ c(f(αk) − f( ¯α)) Best worst case analysis
Practical Use of SVM
Let Us Try An Example
A problem from astroparticle physics
1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02 1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02 1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02 1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02 1.0 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02 1.0 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02 1.0 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
Training and testing sets available: 3,089 and 4,000 Data format is an issue
SVM software:
LIBSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm
Now one of the most used SVM software Installation
On Unix:
Download zip file and make On Windows:
Download zip file and make
c:nmake -f Makefile.win
Windows binaries included in the package
Usage of
LIBSVM
Training
Usage: svm-train [options] training_set_file [model_file] options:
-s svm_type : set type of SVM (default 0) 0 -- C-SVC
1 -- nu-SVC
2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR
-t kernel_type : set type of kernel function (default
Testing
Training and Testing
Training
$./svm-train train.1 ...*
optimization finished, #iter = 6131 nu = 0.606144
obj = -1061.528899, rho = -0.495258 nSV = 3053, nBSV = 724
Total nSV = 3053
Testing
$./svm-predict test.1 train.1.model test.1.predict
Accuracy = 66.925% (2677/4000)
What does this Output Mean
obj: the optimal objective value of the dual SVM rho: −b in the decision function
nSV and nBSV: number of support vectors and bounded support vectors
(i.e., αi = C).
nu-svm is a somewhat equivalent form of C-SVM where C is replaced by ν.
Why this Fails
After training, nearly 100% support vectors Training and testing accuracy different
$./svm-predict train.1 train.1.model o Accuracy = 99.7734% (3082/3089) RBF kernel used e−γkxi−xjk2 Then Kij ( = 1 if i = j, → 0 if i 6= j. . – p.90/121
K → I min α 1 2α Tα − eTα subject to 0 ≤ αi ≤ C, i = 1, . . . , l yT α = 0 Optimal solution 2 > α = e − y Te l y > 0 αi > 0 yi(wTxi + b) = 1
Data Scaling
Without scaling
Attributes in greater numeric ranges may dominate
Example: height gender x1 150 F x2 180 M x3 185 M and y1 = 0, y2 = 1, y3 = 1. . – p.92/121
The separating hyperplane
x1
x2 x3
Decision strongly depends on the first attribute What if the second is more important
Linearly scale the first to [0, 1] by:
1st attribute − 150 185 − 150 , New points and separating hyperplane
x1
x2x3
Transformed to the original space, x1
x2 x3 The second attribute plays a role
More about Data Scaling
A common mistake
$./svm-scale -l -1 -u 1 train.1 > train.1.scale $./svm-scale -l -1 -u 1 test.1 > test.1.scale
Same factor on training and testing
$./svm-scale -s range1 train.1 > train.1.scale $./svm-scale -r range1 test.1 > test.1.scale $./svm-train train.1.scale
$./svm-predict test.1.scale train.1.scale.model test.1.predict
→ Accuracy = 96.15%
We store the scaling factor used in training and apply them for testing set
More on Training
Train scaled data and then prediction
$./svm-train train.1.scale
$./svm-predict test.1.scale train.1.scale.model test.1.predict
→ Accuracy = 96.15%
Training accuracy now is
$./svm-predict train.1.scale train.1.scale.model Accuracy = 96.439% (2979/3089) (classification)
Default parameter C = 1, γ = 0.25
Different Parameters
If we use C = 20, γ = 400
$./svm-train -c 20 -g 400 train.1.scale
$./svm-predict train.1.scale train.1.scale.model Accuracy = 100% (3089/3089) (classification)
100% training accuracy but
$./svm-predict test.1.scale train.1.scale.model Accuracy = 82.7% (3308/4000) (classification)
Very bad test accuracy
Overfitting and Underfitting
When training and predicting a data, we should
Avoid underfitting: small training error Avoid overfitting: small testing error
Overfitting
In theory
You can easily achieve 100% training accuracy
This is useless
Surprisingly
Many application papers did this
Parameter Selection
Sometimes important Now parameters are
C and kernel parameters Example:
γ of e−γkxi−xjk2
a, b, d of (xTi xj/a + b)d How to select them ?
Performance Evaluation
Training errors not important; only test errors count
l training data, xi ∈ Rn, yi ∈ {+1, −1}, i = 1, . . . , l, a
learning machine:
x → f(x, α), f(x, α) = 1 or − 1. Different α: different machines
The expected test error (generalized error)
R(α) =
Z 1
2|y − f(x, α)|dP (x, y) y: class of x (i.e. 1 or -1)
P (x, y) unknown, empirical risk (training error): Remp(α) = 1 2l l X i=1 |yi − f(xi, α)| 1
2|yi − f(xi, α)| : loss, choose 0 ≤ η ≤ 1, with probability
at least 1 − η:
R(α) ≤ Remp(α) + another term
A good pattern recognition method: minimize both terms at the same time Remp(α) → 0
Performance Evaluation (Cont.)
In practice
Available data ⇒ training, validation, and (testing) Train + validation ⇒ model
k-fold cross validation:
Data randomly separated to k groups.
Each time k − 1 as training and one as testing Select parameters with highest CV
Another optimization problem
A Simple Procedure
1. Conduct simple scaling on the data
2. Consider RBF kernel K(x, y) = e−γkx−yk2
3. Use cross-validation to find the best parameters C and γ
4. Use the best C and γ to train the whole training set
5. Test
Best C and γ by training k − 1 and the whole ? In theory, a minor difference
No problem in practice
Why trying RBF Kernel First
Linear kernel: special case of RBF (Keerthi and Lin, 2003)
Leave-one-out cross-validation accuracy of linear the
same as RBF under certain parameters Related to optimization as well
Polynomial: numerical difficulties (< 1)d → 0, (> 1)d → ∞
More parameters than RBF
Parameter Selection in
LIBSVM
grid search + CV
$./grid.py train.1 train.1.scale
[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408)
[local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354)
. . .
Contour of Parameter Selection
d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma) . – p.110/121Simple script in
LIBSVM
easy.py: a script for dummies
$python easy.py train.1 test.1 Scaling training data...
Cross validation... Best c=2.0, g=2.0 Training...
Scaling testing data... Testing...
Example: Engine Misfire
Detection
Problem Description
First problem of IJCNN Challenge 2001, data from Ford Given time series length T = 50, 000
The kth data
x1(k), x2(k), x3(k), x4(k), x5(k), y(k)
y(k) = ±1: output, affected only by x1(k), . . . , x4(k)
x5(k) = 1, kth data considered for evaluating accuracy 50,000 training data, 100,000 testing data (in two sets)
Past and future information may affect y(k)
x1(k): periodically nine 0s, one 1, nine 0s, one 1, and so on. Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000 x4(k) more important . – p.114/121
Background: Engine Misfire Detection
How engine works
Air-fuel mixture injected to cylinder
intact, compression, combustion, exhaustion
Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite
Frequent misfires: pollutants and costly replacement On-board detection:
Engine crankshaft rational dynamics with a position sensor
Training data: from some expensive experimental environment
Encoding Schemes
For SVM: each data is a vector
x1(k): periodically nine 0s, one 1, nine 0s, one 1, ... 10 binary attributes
x1(k − 5), . . . , x1(k + 4) for the kth data x1(k): an integer in 1 to 10
Which one is better
We think 10 binaries better for SVM
x4(k) more important
Including x4(k − 5), . . . , x4(k + 4) for the kth data
Each training data: 22 attributes
Training SVM
Selecting parameters; generating a good model for prediction
RBF kernel K(xi, xj) = φ(xi)T φ(xj) = e−γkxi−xjk
2
Two parameters: γ and C
Five-fold cross validation on 50,000 data Data randomly separated to five groups.
Each time four as training and one as testing
Use C = 24, γ = 22 and train 50,000 data for the final model
d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma) . – p.118/121
Test set 1: 656 errors, Test set 2: 637 errors
About 3000 support vectors of 50,000 training data A good case for SVM
This is just the outline. There are other details.
Conclusions
SVM optimization issues are challenging Quite extensively studied
But better results still possible
Why working on machine learning? It is less mature than optimization More new issues
Many other optimization issues from machine learning Need to study things useful for ML tasks
While we complain ML people’s lack of optimization knowledge, we must admit this fact first
ML people focus on developing methods, so pay less attention to optimization details
Only if we widely apply solid optimization techniques to machine learning
then the contribution of optimization in ML can be recognized