Optimization, Support Vector Machines, and Machine Learning

(1)

Optimization, Support Vector Machines,

and Machine Learning

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Outline

Introduction to machine learning and support vector machines (SVM)

SVM and optimization theory

SVM and numerical optimization Practical use of SVM

Talk slides available at

http://www.csie.ntu.edu.tw/˜cjlin/talks/rome.pdf

This talk intends to give optimization researchers an overview of SVM research

(3)

What Is Machine Learning?

Extract knowledge from data

Classification, clustering, and others We focus only on classification here Many new optimization issues

(4)

Data Classification

Given training data in different classes (labels known) Predict test data (labels unknown)

Examples

Handwritten digits recognition Spam filtering

Training and testing

(5)

Methods:

Nearest Neighbor Neural Networks Decision Tree

Support vector machines: another popular method Main topic of this talk

Machine learning, applied statistics, pattern recognition Very similar, but slightly different focuses

As it’s more applied, machine learning is a bigger

(6)

Support Vector Classification

Training vectors : x_i, i = 1, . . . , l

Consider a simple case with two classes: Define a vector y

yi =

(

1 if x_i in class 1 −1 if x_i in class 2, A hyperplane which separates all data

(7)

wTx + b =    +1 0 −1    A separating hyperplane: wTx + b = 0 (wTx_i) + b > 0 if yi = 1 (wTx_i) + b < 0 if yi = −1

(8)

Decision function f (x) = sign(wT x + b), x: test data Variables: w and b : Need to know coefficients of a plane

Many possible choices of w and b

Select w, b with the maximal margin.

Maximal distance between wTx _{+ b = ±1} (wTx_i_{) + b ≥ 1} if yi = 1

(wTx_i_{) + b ≤ −1} if yi = −1

(9)

Distance between wTx + b = 1 and ₋₁: 2/kwk = 2/√wTw max 2/kwk ≡ min wTw/2 min w,b 1 2w T_w subject to y_i((wTx_i_{) + b) ≥ 1,} i = 1, . . . , l.

A nonlinear programming problem A 3-D demonstration

(10)

Notations very different from optimization Well, this is unavoidable

(11)

Higher Dimensional Feature Spaces

Earlier we tried to find a linear separating hyperplane

Data may not be linearly separable

Non-separable case: allow training errors

min w,b,ξ 1 2w T_w ₊ _C l X i=1 ξi yi((wTxi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l

ξi > 1, xi not on the correct side of the separating plane

(12)

Nonlinear case: linearly separable in other spaces ?

Higher dimensional ( maybe infinite ) feature space φ(x) = (φ₁(x), φ₂(x), . . .).

(13)

Example: x _{∈ R}3_{, φ(x) ∈ R}10

φ(x) = (1, √2x₁, √2x₂, √2x₃, x2₁,

x2₂, x2₃, √2x₁x₂, √2x₁x₃, √2x₂x₃) A standard problem (Cortes and Vapnik, 1995):

min w,b,ξ 1 2w T_w _{+ C} l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.

(14)

Finding the Decision Function

w: a vector in a high dimensional space ⇒ maybe infinite variables

The dual problem min α 1 2α T Qα − eTα subject to _{0 ≤ α}_i _{≤ C, i = 1, . . . , l} yT α = 0, where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T w = Pl_i=1 α_iy_iφ(x_i) SVM problem: primal . – p.14/121

(15)

Primal and dual : Discussed later A finite problem:

#variables = #training data

Qij = yiyjφ(xi)Tφ(xj) needs a closed form

Efficient calculation of high dimensional inner products

Example: x_i _{∈ R}3, φ(xi) ∈ R10 φ(xi) =(1, √ 2(xi)1, √ 2(xi)2, √ 2(xi)3, (xi)2₁, (x_i)2₂, (x_i)2₃, √2(x_i)₁(x_i)₂, √2(x_i)₁(x_i)₃, √2(x_i)₂(x_i)₃) Then φ(xi)Tφ(xj) = (1 + xT_i xj)2.

(16)

Kernel Tricks

Kernel: K(x, y) = φ(x)Tφ(y)

No need to explicitly know φ(x) Common kernels K(x_i, x_j) =

e−γkxi−xjk2_, _{(Radial Basis Function)}

(xT_i x_j/a + b)d (Polynomial kernel)

They can be inner product in infinite dimensional space Assume _{x ∈ R}1 and γ > 0.

(17)

e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γx2i+2γxixj−γx2j = e−γx2i−γx 2 j 1 + 2γxixj 1! + (2γx_ix_j)2 2! + (2γx_ix_j)3 3! + · · · = e−γx2i−γx 2 j 1 · 1 + r 2γ 1! xi · r 2γ 1! xj + r (2γ)2 2! x 2 i · r (2γ)2 2! x 2 j + r (2γ)3 3! x 3 i · r (2γ)3 3! x 3 j + · · · = φ(xi)Tφ(xj), where φ(x) = e−γx21, r 2γ 1! x, r (2γ)2 2! x 2_, r (2γ)3 3! x 3 , · · · T.

(18)

Decision function

w: maybe an infinite vector At optimum w = Pl_i=1 αiyiφ(xi) Decision function wTφ(x) + b = l X i=1 αiyiφ(xi)Tφ(x) + b = l X i=1 αiyiK(xi, x) + b . – p.18/121

(19)

No need to have w

> 0: 1st class, < 0: 2nd class Only φ(xi) of αi > 0 used

(20)

Support Vectors: More Important Data

−1.5 −1 −0.5 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 . – p.20/121

(21)

Is Kernel Really Useful?

Training data mapped to be linearly independent ⇒ separable

Except this, we know little in high dimensional spaces Selection is another issue

On the one hand, very few general kernels

On the other hand, people try to design kernels specific to applications

(22)

SVM and Optimization

Dual problem is essential for SVM

There are other optimization issues in SVM But, things are not that simple

If SVM isn’t good, useless to study its optimization issues

(23)

Optimization in ML Research

Everyday there are new classification methods Most are related to optimization problems

Most will never be popular

Things optimization people focused (e.g., convergence rate) may not be that important for ML people

(24)

In machine learning

The use of optimization techniques sometimes not rigorous

Usually an optimization algorithm

1. Strictly decreasing

2. Convergence to a stationary point

3. Convergence rate

In some ML papers, 1 even does not hold Some wrongly think 1 and 2 the same

(25)

Status of SVM

Existing methods:

Nearest neighbor, Neural networks, decision trees.

SVM: similar status (competitive but may not be better) In my opinion, after careful data pre-processing

Appropriately use NN or SVM _⇒ similar accuracy But, users may not use them properly

The chance of SVM

Easier for users to appropriately use it Replacing NN on some applications

(26)

So SVM has survived as a ML method

There are needs to seriously study its optimization issues

(27)

(28)

A Primal-Dual Example

Let us have an example before deriving the dual To check the primal dual relationship:

w =

l

X

i=1

αiyiφ(xi)

Two training data in R1:

△

0 1

What is the separating hyperplane ?

(29)

Primal Problem

x₁ = 0, x₂ = 1 with y = [−1, 1]T. Primal problem min w,b 1 2w 2 subject to _{w · 1 + b ≥ 1,} −1(w · 0 + b) ≥ 1.

(30)

−b ≥ 1 and _{w ≥ 1 − b ≥ 2}. We are minimizing 1₂w2 The smallest is w = 2.

(w, b) = (2, −1) optimal solution.

The separating hyperplane _{2x − 1 = 0}

△

0 x = 1/2• 1

(31)

Dual Problem

Formula without penalty parameter C

min α∈Rl 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to αi ≥ 0, i = 1, . . . , l, and l X i=1 αiyi = 0.

Get the objective function

xT₁ x₁ = 0, xT₁ x₂ = 0 xT₂ x₁ = 0, xT₂ x₂ = 1

(32)

Objective function 1 2α 2 1 − (α1 + α2) = 1 2 h α₁ α₂i " 0 0 0 1 # " α₁ α₂ # − h1 1i " α₁ α₂ # . Constraints α₁ _{− α}₂ _{= 0, 0 ≤ α}₁_{, 0 ≤ α}₂. . – p.32/121

(33)

α₂ = α₁ to the objective function, 1 2α 2 1 − 2α1 Smallest value at α₁ = 2. α₂ = 2 as well [2, 2]T satisfies _{0 ≤ α}₁ and _{0 ≤ α}₂ Optimal Primal-dual relation w = y₁α₁x₁ + y₂α₂x₂ = 1 · 2 · 1 + (−1) · 2 · 0 = 2

(34)

SVM Primal and Dual

Standard SVM (Primal) min w,b,ξ 1 2w T_w _{+ C} l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.

w: huge (maybe infinite) vector variable

Practically we solve dual, a different but related problem

(35)

Dual problem min α 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to _{0 ≤ α}_i _{≤ C,} i = 1, . . . , l, l X i=1 yiαi = 0. K(xi, xj) = φ(xi)Tφ(xj) available α: l variables; finite

(36)

Primal Dual Relationship

At optimum ¯ w = l X i=1 ¯ αiyiφ(xi) 1 2w¯ T_w_¯ _{+ C} l X i=1 ¯ ξi = eTα¯ − 1 2α¯ T_{Q ¯}_α. where e = [1, . . . , 1]T.

Primal objective value _{= −}Dual objective value

How does this dual come from ?

(37)

Derivation of the Dual

Consider a simpler problem min w,b 1 2w T _w subject to y_i(wTφ(x_i_{) + b) ≥ 1, i = 1, . . . , l.} Its dual min α 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to _{0 ≤ α}_i, i = 1, . . . , l, l X i=1 yiαi = 0.

(38)

Lagrangian Dual

Defined as max α≥0(minw,b L(w, b, α)), where L(w, b, α) = 1 2kwk 2 − l X i=1 αi yi(wTφ(xi) + b) − 1 Strong duality

min Primal = max

α≥0(minw,b L(w, b, α))

(39)

Simplify the dual. When α is fixed, min w,b L(w, b, α) = ( −∞ if Pl i=1 αiyi 6= 0, min_w 1₂wT w − Pl_i=1 α_i[y_i(wTφ(x_i) − 1] if Pl i=1 αiyi = 0. If Pl i=1 αiyi 6= 0, decrease _−b Pl i=1 αiyi in L(w, b, α) to −∞

(40)

If Pl

i=1 αiyi = 0, optimum of the strictly convex 1

2wTw −

Pl

i=1 αi[yi(wTφ(xi) − 1] happens when

∂ ∂wL(w, b, α) = 0. Assume w _{∈ R}n. L(w, b, α) rewritten as 1 2 n X j=1 w_j2 ₋ l X i=1 α_i[y_i( n X j=1 w_jφ(x_i)_j _{− 1]} ∂ ∂wj L(w, b, α) = wj − l X i=1 αiyiφ(xi)j = 0 . – p.40/121

(41)

Thus, w = l X i=1 αiyiφ(xi). Note that wTw = l X i=1 αiyiφ(xi) T l X j=1 αjyjφ(xj) = X i,j αiαjyiyjφ(xi)Tφ(xj)

(42)

The dual is max α≥0 ( Pl i=1 αi − 12 P αiαjyiyjφ(xi)Tφ(xj) if Pl i=1 αiyi = 0, −∞ if Pl i=1 αiyi 6= 0.

−∞ definitely not maximum of the dual

Dual optimal solution not happen when Pl

i=1 αiyi 6= 0. Dual simplified to max α∈Rl l X i=1 αi − 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) subject to αi ≥ 0, i = 1, . . . , l, and yT α = 0. . – p.42/121

(43)

Karush-Kuhn-Tucker (KKT) conditions

The KKT condition of the dual:

Qα − e = −by + λ αiλi = 0

λi ≥ 0

The KKT condition of the primal:

w = l X i=1 αiyixi αi(yiwTxi + byi − 1) = 0 yT α = 0, αi ≥ 0

(44)

Let λi = yi((wTxi) + b) − 1 , (Qα − e + by)i = l X j=1 yiyjαjxT_i xj − 1 + byi = y_iwT x_i _{− 1 + y}_ib = yi(wTxi + b) − 1

The KKT of the primal the same as the KKT of the dual

(45)

Large Dense Quadratic Programming

minα 1₂αTQα − eTα, subject to yTα = 0, 0 ≤ αi ≤ C Qij 6= 0, Q : an l by l fully dense matrix

30,000 training points: 30,000 variables:

(30, 0002 _{× 8/2)} bytes = 3GB RAM to store Q: still difficult Traditional methods:

Newton, Quasi Newton cannot be directly applied Current methods:

Decomposition methods (e.g., (Osuna et al., 1997; Joachims, 1998; Platt, 1998))

Nearest point of two convex hulls (e.g., (Keerthi et al., 1999))

(49)

Decomposition Methods

Working on a few variable each time

Similar to coordinate-wise minimization Working set B, _{N = {1, . . . , l}\B} fixed Size of B usually <= 100

Sub-problem in each iteration:

min αB 1 2 h αT_B (αk_N)T i " Q_BB Q_BN Q_{N B} Q_{N N} # " α_B αk_N # − h eT_B (ek_N)Ti " α_B αk_N # subject to _{0 ≤ α}_t _{≤ C, t ∈ B, y}T_Bα_B _{= −y}T_Nαk_N

(50)

Avoid Memory Problems

The new objective function 1

2α

T

BQBBαB + (−eB + QBNαkN)T αB + constant

B columns of Q needed

Calculated when used

(51)

Decomposition Method: the Algorithm

1. Find initial feasible α1 Set k = 1.

2. If αk stationary, stop. Otherwise, find working set B. Define _{N ≡ {1, . . . , l}\B} 3. Solve sub-problem of α_B: min αB 1 2α T BQBBαB + (−eB + QBNαkN)TαB subject to _{0 ≤ α}_t _{≤ C, t ∈ B} y_BT αT_B _{= −y}T_Nαk_N,

(52)

Does it Really Work?

Compared to Newton, Quasi-Newton

Slow convergence

However, no need to have very accurate α

sgn l X i=1 αiyiK(xi, x) + b !

Prediction not affected much

In some situations, # support vectors _≪ # training points Initial α1 = 0, some elements never used

An example where ML knowledge affect optimization

(53)

Working Set Selection

Very important

Better selection _⇒ fewer iterations

But

Better selection _⇒ higher cost per iteration Two issues:

1. Size

|B| ր, # iterations _ց |B| ց, # iterations _ր

(54)

Size of the Working Set

Keeping all nonzero αi in the working set

If all SVs included _⇒ optimum

Few iterations (i.e., few sub-problems) Size varies

May still have memory problems Existing software

Small and fixed size

Memory problems solved Though sometimes slower

(55)

Sequential Minimal Optimization (SMO)

Consider _{|B| = 2} (Platt, 1998)

|B| ≥ 2 because of the linear constraint

Extreme of decomposition methods

Sub-problem analytically solved; no need to use optimization software min αi,αj 1 2 h αi αj i " Qii Qij Q_ij Q_jj # " αi α_j # + (Q_BNαk_N _{− e}_B)T " αi α_j # s.t. _{0 ≤ α}_i, αj ≤ C, yiαi + yjαj = −yT_Nαk_N,

(56)

Optimization people may not think this a big advantage Machine learning people do: they like simple code

A minor advantage in optimization

No need to have inner and outer stopping conditions B = {i, j}

Too slow convergence?

With other tricks, _{|B| = 2} fine in practice

(57)

Selection by KKT violation

minα f (α), subject to yTα = 0, 0 ≤ αi ≤ C

α stationary if and only if

∇f(α) + by = λ − µ, λiαi = 0, µi(C − αi) = 0, λi ≥ 0, µi ≥ 0, i = 1, . . . , l, ∇f(α) ≡ Qα − e Rewritten as ∇f(α)i + byi ≥ 0 if αi < C, ∇f(α)i + byi ≤ 0 if αi > 0.

(58)

Note yi = ±1 KKT further rewritten as ∇f(α)i + b ≥ 0 if αi < C, yi = 1 ∇f(α)i − b ≥ 0 if αi < C, yi = −1 ∇f(α)i + b ≤ 0 if αi > 0, yi = 1, ∇f(α)i − b ≤ 0 if αi > 0, yi = −1

A condition on the range of b:

max{−yt∇f(α)t | αt < C, yt = 1 or αt > 0, yt = −1}

≤ b

≤ min{−yt∇f(α)t | αt < C, yt = −1 or αt > 0, yt = 1}

(59)

Define

I_up_{(α) ≡ {t | α}t < C, yt = 1 or αt > 0, yt = −1}, and

I_low_{(α) ≡ {t | α}t < C, yt = −1 or αt > 0, yt = 1}.

α stationary if and only if feasible and max

i∈Iup(α)

−yi∇f(α)i ≤ min i∈Ilow(α)

(60)

Violating Pair

KKT equivalent to

-

t ∈ Iup(α) t ∈ Ilow(α)

−yt∇f(α)t

Violating pair (Keerthi et al., 2001)

i ∈ Iup(α), j ∈ Ilow(α), and − yi∇f(α)i > −yj∇f(α)j.

Strict decrease if and only if B has at least one violating pair.

However, having violating pair not enough for convergence.

(61)

Maximal Violating Pair

If _{|B| = 2}, naturally indices most violate the KKT condition: i ∈ arg max t∈Iup(αk) −yt∇f(αk)t, j ∈ arg min t∈Ilow(αk) −yt∇f(αk)t, Can be extended to _{|B| > 2}

(62)

Calculating Gradient

To find violating pairs, gradient maintained throughout all iterations

Memory problem occur as _{∇f(α) = Qα − e} involves Q Solved by following tricks

1. α1 = 0 implies _∇f(α1_{) = Q · 0 − e = −e} Initial gradient easily obtained

2. Update _∇f(α) using only Q_BB and Q_BN:

∇f(αk+1) = ∇f(αk) + Q(αk+1 _{− α}k)

= ∇f(αk) + Q_:,B(αk+1 _{− α}k)_B

Only _|B| columns needed per iteration

(63)

Selection by Gradient Information

Maximal violating pair same as using gradient information

{i, j} = arg min

B:|B|=2Sub(B), where Sub_{(B) ≡ min} dB ∇f(α k₎T BdB (1a) subject to y_BT d_B = 0, d_t _{≥ 0,} if αk_t _{= 0, t ∈ B,} (1b) dt ≤ 0, if αk_t = C, t ∈ B, (1c) −1 ≤ dt ≤ 1, t ∈ B. (1d)

(64)

First considered in (Joachims, 1998)

Let d _{≡ [d}_B; 0_N], (1a) comes from minimizing f (αk _{+ d) ≈ f(α}k_{) + ∇f(α}k)T d

= f (αk_{) + ∇f(α}k)T_Bd_B.

First order approximation

0 ≤ αt ≤ C leads to (1b) and (1c).

−1 ≤ dt ≤ 1, t ∈ B avoid −∞ objective value

(65)

Rough explanation connecting to maximal violating pair ∇f(αk)idi + ∇f(αk)jdj

= yi∇f(αk)i · yidi + yj∇f(αk)j · yjdj

= (yi∇f(αk)i − yj∇f(αk)j) · (yidi)

We used yidi + yjdj = 0

Find _{{i, j}} so that

yi∇f(αk)i − yj∇f(αk)j the smallest, yidi = 1, yjdj = −1

yidi = 1 corresponds to i ∈ Iup(αk):

(66)

Convergence: Maximal Violating Pair

Special case of (Lin, 2001c)

Let α¯ limit of any convergent subsequence _{αk_{}, k ∈ K}. If not stationary, _∃ a violating pair

¯i ∈ Iup(α), ¯j ∈ Ilow(α), and − y¯i∇f( ¯α)¯i + y¯j∇f( ¯α)¯j> 0

If _{i ∈ I}_up( ¯α), then _{i ∈ I}_up(αk_{), ∀k ∈ K} large enough If _{i ∈ I}_low( ¯α), then _{i ∈ I}_low(αk_{), ∀k ∈ K} large enough So

{¯i, ¯j} a violating pair at _{k ∈ K}

(67)

From k to k + 1: Bk _{= {i, j}}

i /_{∈ I}_up(αk+1) or j /_{∈ I}_low(αk+1)

because of optimality of sub-problem

If we can show

{αk}_k∈_{K → ¯}α _{⇒ {α}k+1_}_k∈_{K → ¯}α,

then _{{¯i, ¯j}} should not be selected at k, k + 1, . . . , k + r

A procedure showing in finite iterations, it is selected Contradiction

(68)

Key of the Proof

Essentially we proved

In finite iterations, _{B = {¯i, ¯j}} selected to have a contradiction

Can be used to design working sets (Lucidi et al., 2005):

∃N > 0 such that for all k, any violating

pair of αk selected at least once in iterations k to k + N A cyclic selection

{1, 2}, {1, 3}, . . . , {1, l}, {2, 3}, . . . , {l − 1, l}

(69)

Beyond Maximal Violating Pair

Better working sets?

Difficult: # iterations _ց but cost per iteration _ր

May not imply shorter training time

A selection by second order information (Fan et al., 2005) As f is a quadratic, f (αk + d) = f (αk_{) + ∇f(α}k)Td + 1 2d T ∇2f (αk)d = f (αk_{) + ∇f(α}k)T_Bd_B + 1 2d T B∇2f (α k₎ BBdB

(70)

Selection by Quadratic Information

Using second order information

min B:|B|=2 Sub(B), Sub_{(B) ≡ min} dB 1 2d T B∇2f (αk)BBdB + ∇f(αk)TBdB subject to yT_Bd_B = 0, dt ≥ 0, if α_tk = 0, t ∈ B, dt ≤ 0, if α_tk = C, t ∈ B. −1 ≤ dt ≤ 1, t ∈ B not needed if ∇2f (αk)BB = QBB PD

Too expensive to check ₂l sets

(71)

A heuristic

1. Select

i ∈ arg max_t {−yt∇f(αk)t | t ∈ Iup(αk)}.

2. Select

j ∈ arg min

t {Sub({i, t}) | t ∈ Ilow(α k_),

−yt∇f(αk)t < −yi∇f(αk)i}.

3. Return _{B = {i, j}}.

The same i as the maximal violating pair Check only O(l) possible B’s to decide j

(72)

Comparison of Two Selections

Iteration and time ratio between using quadratic information and maximal violating pair

0 0.2 0.4 0.6 0.8 1 1.2 1.4

image splice tree a1a australian breast-cancerdiabetes fourclass german.numerw1a abalone cadata cpusmall space_ga mg

Ratio Data sets time (40M cache) time (100K cache) total #iter . – p.72/121

(73)

Comparing SVM Software/Methods

In optimization, straightforward to compare two methods Now the comparison under one set of parameters may not be enough

Unclear yet the most suitable way of doing comparisons In ours, we check two

1. Time/total iterations for several parameter sets used in parameter selection

2. Time/iterations for final parameter set

(74)

Issues about the Quadratic Selection

Asymptotic convergence holds

Faster convergence than maximal violating pair Better approximation per iteration

But lacks global explanation yet

What if we check all ₂l sets

Iteration ratio between checking all and checking O(l):

(75)

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

image splice tree a1a australian breast-cancer diabetes fourclass german.numer w1a

Ratio

Data sets

parameter selection final training

Fewer iterations, but ratio (0.7 to 0.8) not enough to justify the higher cost per iteration

(76)

Caching and Shrinking

Speed up decomposition methods Caching (Joachims, 1998)

Store recently used Hessian columns in computer memory

Example

$ time ./libsvm-2.81/svm-train -m 0.01 a4a 11.463s

$ time ./libsvm-2.81/svm-train -m 40 a4a 7.817s

(77)

Shrinking (Joachims, 1998)

Some bounded elements remain until the end

Heuristically resized to a smaller problem

After certain iterations, most bounded elements identified and not changed (Lin, 2002)

(78)

Stopping Condition

In optimization software such conditions are important However, don’t be surprised if you see no stopping

conditions in an optimization code of ML software

Sometimes time/iteration limits more suitable

From KKT condition max i∈Iup(α) −yi∇f(α)i ≤ min i∈Ilow(α) −yi∇f(α)i + ǫ (2)

a natural stopping condition

(79)

Better Stopping Condition

Now in out software ǫ = 10−3

Past experience: ok but sometimes too strict

At one point we almost changed to 10−1

Large _{C ⇒} large _∇f(α) components Too strict _⇒ many iterations

Need a relative condition

(80)

Example of Slow Convergence

Using C = 1

$./libsvm-2.81/svm-train -c 1 australian_scale optimization finished, #iter = 508

obj = -201.642538, rho = 0.044312

Using C = 5000

$./libsvm-2.81/svm-train -c 5000 australian_scale optimization finished, #iter = 35241

obj = -242509.157367, rho = -7.186733

Optimization researchers may rush to solve difficult cases

That’s what I did in the beginning

It turns out that large C less used than small C

(81)

Finite Termination

Given ǫ, finite termination under (2) (Keerthi and Gilbert, 2002; Lin, 2002)

Not implied from asymptotic convergence as min

i∈Ilow(α)

−yi∇f(α)i − max i∈Iup(α)

−yi∇f(α)i

not a continuous function of α We worry

αk_i _{→ 0} and _{i ∈ I}_up(αk_{) ∩ I}_low(αk) causes the program never ends

(82)

ML people do not care this much

Many think finite termination same as asymptotic convergence

We are careful on such issues in our software A good SVM software should

1. be a rigorous numerical optimization code

2. serve the need of users in ML and other areas

Both are equally important

(83)

Issues Not Discussed Here

Q not PSD

Solving sub-problems

Analytic form for SMO (two-variable problem) Linear convergence (Lin, 2001b)

f (αk+1_{) − f( ¯}α_{) ≤ c(f(α}k_{) − f( ¯}α)) Best worst case analysis

(84)

Practical Use of SVM

(85)

Let Us Try An Example

A problem from astroparticle physics

1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02 1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02 1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02 1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02 1.0 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02 1.0 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02 1.0 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02

Training and testing sets available: 3,089 and 4,000 Data format is an issue

(86)

SVM software:

LIBSVM

http://www.csie.ntu.edu.tw/~cjlin/libsvm

Now one of the most used SVM software Installation

On Unix:

Download zip file and make On Windows:

Download zip file and make

c:nmake -f Makefile.win

Windows binaries included in the package

(87)

Usage of

LIBSVM

Training

Usage: svm-train [options] training_set_file [model_file] options:

-s svm_type : set type of SVM (default 0) 0 -- C-SVC

1 -- nu-SVC

2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR

-t kernel_type : set type of kernel function (default

Testing

(88)

Training and Testing

Training

$./svm-train train.1 ...*

optimization finished, #iter = 6131 nu = 0.606144

obj = -1061.528899, rho = -0.495258 nSV = 3053, nBSV = 724

Total nSV = 3053

Testing

$./svm-predict test.1 train.1.model test.1.predict

Accuracy = 66.925% (2677/4000)

(89)

What does this Output Mean

obj: the optimal objective value of the dual SVM rho: _−b in the decision function

nSV and nBSV: number of support vectors and bounded support vectors

(i.e., αi = C).

nu-svm is a somewhat equivalent form of C-SVM where C is replaced by ν.

(90)

Why this Fails

After training, nearly 100% support vectors Training and testing accuracy different

$./svm-predict train.1 train.1.model o Accuracy = 99.7734% (3082/3089) RBF kernel used e−γkxi−xjk2 Then K_ij ( = 1 if i = j, → 0 if _{i 6= j.} . – p.90/121

(91)

K → I min α 1 2α T_α − eTα subject to _{0 ≤ α}_i _{≤ C, i = 1, . . . , l} yT α = 0 Optimal solution 2 > α = e − y T_e l y > 0 αi > 0 yi(wTxi + b) = 1

(92)

Data Scaling

Without scaling

Attributes in greater numeric ranges may dominate

Example: height gender x₁ 150 F x₂ 180 M x₃ 185 M and y₁ = 0, y₂ = 1, y₃ = 1. . – p.92/121

(93)

The separating hyperplane

x₁

x₂ x₃

Decision strongly depends on the first attribute What if the second is more important

(94)

Linearly scale the first to [0, 1] by:

1st attribute _{− 150} 185 − 150 , New points and separating hyperplane

x₁

x₂x₃

(95)

Transformed to the original space, x₁

x₂ x₃ The second attribute plays a role

(96)

More about Data Scaling

A common mistake

$./svm-scale -l -1 -u 1 train.1 > train.1.scale $./svm-scale -l -1 -u 1 test.1 > test.1.scale

(97)

Same factor on training and testing

$./svm-scale -s range1 train.1 > train.1.scale $./svm-scale -r range1 test.1 > test.1.scale $./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

We store the scaling factor used in training and apply them for testing set

(98)

More on Training

Train scaled data and then prediction

$./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

Training accuracy now is

$./svm-predict train.1.scale train.1.scale.model Accuracy = 96.439% (2979/3089) (classification)

Default parameter C = 1, γ = 0.25

(99)

Different Parameters

If we use C = 20, γ = 400

$./svm-train -c 20 -g 400 train.1.scale

$./svm-predict train.1.scale train.1.scale.model Accuracy = 100% (3089/3089) (classification)

100% training accuracy but

$./svm-predict test.1.scale train.1.scale.model Accuracy = 82.7% (3308/4000) (classification)

Very bad test accuracy

(100)

Overfitting and Underfitting

When training and predicting a data, we should

Avoid underfitting: small training error Avoid overfitting: small testing error

(101)

(102)

Overfitting

In theory

You can easily achieve 100% training accuracy

This is useless

Surprisingly

Many application papers did this

(103)

Parameter Selection

Sometimes important Now parameters are

C and kernel parameters Example:

γ of e−γkxi−xjk2

a, b, d of (xT_i x_j/a + b)d How to select them ?

(104)

Performance Evaluation

Training errors not important; only test errors count

l training data, x_i _{∈ R}n, yi ∈ {+1, −1}, i = 1, . . . , l, a

learning machine:

x → f(x, α), f(x, α) = 1 or _{− 1.} Different α: different machines

The expected test error (generalized error)

R(α) =

Z ₁

2|y − f(x, α)|dP (x, y) y: class of x (i.e. 1 or -1)

(105)

P (x, y) unknown, empirical risk (training error): Remp(α) = 1 2l l X i=1 |yi − f(xi, α)| 1

2|yi − f(xi, α)| : loss, choose 0 ≤ η ≤ 1, with probability

at least _{1 − η}:

R(α) ≤ Remp(α) + another term

A good pattern recognition method: minimize both terms at the same time Remp(α) → 0

(106)

Performance Evaluation (Cont.)

In practice

Available data _⇒ training, validation, and (testing) Train + validation _⇒ model

k-fold cross validation:

Data randomly separated to k groups.

Each time _{k − 1} as training and one as testing Select parameters with highest CV

Another optimization problem

(107)

A Simple Procedure

1. Conduct simple scaling on the data

2. Consider RBF kernel K(x, y) = e−γkx−yk2

3. Use cross-validation to find the best parameters C and γ

4. Use the best C and γ to train the whole training set

5. Test

Best C and γ by training _{k − 1} and the whole ? In theory, a minor difference

No problem in practice

(108)

Why trying RBF Kernel First

Linear kernel: special case of RBF (Keerthi and Lin, 2003)

Leave-one-out cross-validation accuracy of linear the

same as RBF under certain parameters Related to optimization as well

Polynomial: numerical difficulties (< 1)d _{→ 0, (> 1)}d _{→ ∞}

More parameters than RBF

(109)

Parameter Selection in

LIBSVM

grid search + CV

$./grid.py train.1 train.1.scale

[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408)

[local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354)

. . .

(110)

Contour of Parameter Selection

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma) . – p.110/121

(111)

Simple script in

LIBSVM

easy.py: a script for dummies

$python easy.py train.1 test.1 Scaling training data...

Cross validation... Best c=2.0, g=2.0 Training...

Scaling testing data... Testing...

(112)

Example: Engine Misfire

Detection

(113)

Problem Description

First problem of IJCNN Challenge 2001, data from Ford Given time series length T = 50, 000

The kth data

x₁(k), x₂(k), x₃(k), x₄(k), x₅(k), y(k)

y(k) = ±1: output, affected only by x₁(k), . . . , x₄(k)

x₅(k) = 1, kth data considered for evaluating accuracy 50,000 training data, 100,000 testing data (in two sets)

(114)

Past and future information may affect y(k)

x₁(k): periodically nine 0s, one 1, nine 0s, one 1, and so on. Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000 x₄(k) more important . – p.114/121

(115)

Background: Engine Misfire Detection

How engine works

Air-fuel mixture injected to cylinder

intact, compression, combustion, exhaustion

Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite

Frequent misfires: pollutants and costly replacement On-board detection:

Engine crankshaft rational dynamics with a position sensor

Training data: from some expensive experimental environment

(116)

Encoding Schemes

For SVM: each data is a vector

x₁(k): periodically nine 0s, one 1, nine 0s, one 1, ... 10 binary attributes

x₁_{(k − 5), . . . , x}₁(k + 4) for the kth data x₁(k): an integer in 1 to 10

Which one is better

We think 10 binaries better for SVM

x₄(k) more important

Including x₄_{(k − 5), . . . , x}₄(k + 4) for the kth data

Each training data: 22 attributes

(117)

Training SVM

Selecting parameters; generating a good model for prediction

RBF kernel K(xi, xj) = φ(xi)T φ(xj) = e−γkxi−xjk

2

Two parameters: γ and C

Five-fold cross validation on 50,000 data Data randomly separated to five groups.

Each time four as training and one as testing

Use C = 24, γ = 22 and train 50,000 data for the final model

(118)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma) . – p.118/121

(119)

Test set 1: 656 errors, Test set 2: 637 errors

About 3000 support vectors of 50,000 training data A good case for SVM

This is just the outline. There are other details.

(120)

Conclusions

SVM optimization issues are challenging Quite extensively studied

But better results still possible

Why working on machine learning? It is less mature than optimization More new issues

Many other optimization issues from machine learning Need to study things useful for ML tasks

(121)

While we complain ML people’s lack of optimization knowledge, we must admit this fact first

ML people focus on developing methods, so pay less attention to optimization details

Only if we widely apply solid optimization techniques to machine learning

then the contribution of optimization in ML can be recognized

Optimization, Support Vector Machines, and Machine Learning