Linear Support Vector Machine Support Vector Machine
Solving a Particular Standard Problem
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all nX =
0 0 2 2 2 0 3 0
y =
−1
−1 +1 +1
− b
≥ 1 (i)−2 w 1 − 2 w 2 − b
≥ 1 (ii)2w 1
+ 0w 2
+ b
≥ 1 (iii)3w 1
+ 0w 2
+ b
≥ 1 (iv)•
(i) & (iii) =⇒
w 1
≥ +1 (ii) & (iii) =⇒w 2
≤ −1
=⇒
1 2 w T w
≥1
•
(w1
=1,w 2
=−1,b
=−1) atlower bound
and satisfies (i)− (iv) gSVM(x) = sign(x1
− x2
− 1):Linear Support Vector Machine Support Vector Machine
Solving a Particular Standard Problem
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all nX =
0 0 2 2 2 0 3 0
y =
−1
−1 +1 +1
− b
≥ 1 (i)−2 w 1 − 2 w 2 − b
≥ 1 (ii)2w 1
+ 0w 2
+ b
≥ 1 (iii)3w 1
+ 0w 2
+ b
≥ 1 (iv)•
(i) & (iii) =⇒
w 1
≥ +1 (ii) & (iii) =⇒w 2
≤ −1
=⇒
1 2 w T w
≥1
•
(w1
=1,w 2
=−1,b
=−1) atlower bound
and satisfies (i)− (iv) gSVM(x) = sign(x1
− x2
− 1):SVM? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/28
Linear Support Vector Machine Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
1
=1,w 2
=−1,b
=−1) margin(b,w)
=kwk 1
=√ 1
2
x1−x2−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplesupport vector (candidate)
support vector
machine (SVM): learnfattest hyperplanes
(with help ofsupport vectors
)Linear Support Vector Machine Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
1
=1,w 2
=−1,b
=−1) margin(b,w)
=kwk 1
=√ 1
2
x1−x2−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplesupport vector (candidate)
support vector
machine (SVM): learnfattest hyperplanes
(with help ofsupport vectors
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/28
Linear Support Vector Machine Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
1
=1,w 2
=−1,b
=−1) margin(b,w)
=kwk 1
=√ 1
2
x1−x2−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplesupport vector (candidate)
support vector
machine (SVM): learnfattest hyperplanes
(with help ofsupport vectors
)Linear Support Vector Machine Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
1
=1,w 2
=−1,b
=−1) margin(b,w)
=kwk 1
=√ 1
2
x1−x2−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplesupport vector (candidate)
support vector
machine (SVM):learn
fattest hyperplanes
(with help ofsupport vectors
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/28
Linear Support Vector Machine Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Linear Support Vector Machine Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/28
Linear Support Vector Machine Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Linear Support Vector Machine Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/28
Linear Support Vector Machine Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Linear Support Vector Machine Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/28
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u T Qu
+p T u
subject toa T m u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 T d 0 d I d
;
p =
0 d +1
constraints:
a T n =
y n
1 x T n
;
c n =
1
;M =
N
SVM with general QP solver: easy
if you’ve read the manual :-)
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u T Qu
+p T u
subject toa T m u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 T d 0 d I d
;
p =
0 d +1
constraints:
a T n =
y n
1 x T n
;
c n =
1
;M =
N
SVM with general QP solver: easy
if you’ve read the manual :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/28
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u T Qu
+p T u
subject toa T m u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 T d 0 d I d
;
p =
0 d +1
constraints:
a T n =
y n
1 x T n
;
c n =
1
;M =
N
SVM with general QP solver: easy
if you’ve read the manual :-)
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u T Qu
+p T u
subject toa T m u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 T d 0 d I d
;
p = 0 d +1
constraints:a T n = y n
1 x T n
;
c n = 1;
M = NSVM with general QP solver: easy
if you’ve read the manual :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/28
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u T Qu
+p T u
subject toa T m u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 T d 0 d I d
;
p = 0 d +1
constraints:a T n = y n
1 x T n
;
c n = 1;
M = NSVM with general QP solver:
easy
if you’ve read the manual :-)
Linear Support Vector Machine Support Vector Machine
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 Q =
0 0 T d 0 d I d
;
p = 0 d +1
;a T n = y n
1 x T n
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear? z n
= Φ(xn
)—remember? :-)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/28
Linear Support Vector Machine Support Vector Machine
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 Q =
0 0 T d 0 d I d
;
p = 0 d +1
;a T n = y n
1 x T n
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear?
z n
= Φ(xn
)—remember? :-)Linear Support Vector Machine Support Vector Machine
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 Q =
0 0 T d 0 d I d
;
p = 0 d +1
;a T n = y n
1 x T n
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear? z n
= Φ(xn
)—remember? :-)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/28
Linear Support Vector Machine Support Vector Machine
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 Q =
0 0 T d 0 d I d
;
p = 0 d +1
;a T n = y n
1 x T n
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear?
z n
= Φ(xn
)—remember? :-)Linear Support Vector Machine Support Vector Machine
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 Q =
0 0 T d 0 d I d
;
p = 0 d +1
;a T n = y n
1 x T n
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear?
z n
= Φ(xn
)—remember? :-)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/28
Linear Support Vector Machine Support Vector Machine
Fun Time
Consider two negative examples with
x 1
= (0, 0) and x2
= (2, 2); two positive examples withx 3
= (2, 0) and x4
= (3, 0), as shown on page 17 of the slides. Defineu, Q, p, c n
as those listed on page 20 of the slides. What area T n
that need to be fed into the QP solver?1 a
T1= [−1, 0, 0]
,a
T2= [−1, 2, 2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
2 a
T1= [1, 0, 0]
,a
T2= [1, −2, −2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
3 a
T1= [1, 0, 0]
,a
T2= [1, 2, 2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
4 a
T1= [−1, 0, 0]
,a
T2= [−1, −2, −2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
Reference Answer: 4
We needa T n
=yn
1
x T n
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/28
Linear Support Vector Machine Support Vector Machine
Fun Time
Consider two negative examples with
x 1
= (0, 0) and x2
= (2, 2); two positive examples withx 3
= (2, 0) and x4
= (3, 0), as shown on page 17 of the slides. Defineu, Q, p, c n
as those listed on page 20 of the slides. What area T n
that need to be fed into the QP solver?1 a
T1= [−1, 0, 0]
,a
T2= [−1, 2, 2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
2 a
T1= [1, 0, 0]
,a
T2= [1, −2, −2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
3 a
T1= [1, 0, 0]
,a
T2= [1, 2, 2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
4 a
T1= [−1, 0, 0]
,a
T2= [−1, −2, −2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
Reference Answer: 4
We needa T n
=yn
1
x T n
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Why Large-Margin Hyperplane?
min
b,w 1 2 w T w
subject to y
n
(wT z n
+b)
≥ 1 for all nminimize constraint regularization E
in w T w
≤ CSVM
w T w
Ein
=0 [and more]SVM (large-margin hyperplane):
‘weight-decay regularization’ within E in = 0
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Why Large-Margin Hyperplane?
min
b,w 1 2 w T w
subject to y
n
(wT z n
+b)
≥ 1 for all nminimize constraint regularization E
in w T w
≤ CSVM
w T w
Ein
=0 [and more]SVM (large-margin hyperplane):
‘weight-decay regularization’ within E in = 0
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Why Large-Margin Hyperplane?
min
b,w 1 2 w T w
subject to y
n
(wT z n
+b)
≥ 1 for all nminimize constraint regularization E
in w T w
≤ CSVM
w T w
Ein
=0 [and more]SVM (large-margin hyperplane):
‘weight-decay regularization’ within E in = 0
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Large-Margin Restricts Dichotomies
consider ‘large-margin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA 0 : like PLA = ⇒ shatter ‘general’ 3 inputs
A 1.126 : more strict than SVM = ⇒ cannot shatter any 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Large-Margin Restricts Dichotomies
consider ‘large-margin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA 0 : like PLA = ⇒ shatter ‘general’ 3 inputs
A 1.126 : more strict than SVM = ⇒ cannot shatter any 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Large-Margin Restricts Dichotomies
consider ‘large-margin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA 0 : like PLA = ⇒ shatter ‘general’ 3 inputs
A 1.126 : more strict than SVM = ⇒ cannot shatter any 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Large-Margin Restricts Dichotomies
consider ‘large-margin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA 0 : like PLA = ⇒ shatter ‘general’ 3 inputs
A 1.126 : more strict than SVM = ⇒ cannot shatter any 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC=3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(Aρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC=3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(Aρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC=3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(Aρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC =3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(Aρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC =3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(Aρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/28
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC =3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(A
ρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Benefits of Large-Margin Hyperplanes
large-margin
hyperplanes hyperplanes hyperplanes + feature transform Φ
# even fewer not many many
boundary simple simple sophisticated
• not many
good, for dVC and generalization• sophisticated
good, for possibly better Ein
a new possibility: non-linear SVM