Machine Learning Techniques ( 機器學習技法)
Lecture 1: Linear Support Vector Machine
HsuanTien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 0/28
Course History
NTU Version
•
1517 weeks (2+ hours)•
highlypraised withEnglish and blackboard teaching
Coursera Version
•
8 weeks of ‘foundations’(previous course) + 8 weeks of ‘techniques’ (this course)
• Mandarin teaching
to reach more audience in need• slides teaching
improved with Coursera’s quiz and homework mechanismsgoal:
try
making Coursera version even better than NTU versionLinear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 2/28
Linear Support Vector Machine Course Introduction
Fun Time
Which of the following description of this course is true?
1
the course will be taught in Taiwanese2
the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek3
the course will be 16 weeks long4
the course will focus on three major techniquesReference Answer: 4
1
no, my Taiwanese is unfortunately not good enough for teaching (yet)2
no, although what we teach may serve as building blocks3
no, unless you have also joined the previous course4
yes,let’s get started!
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 3/28
Linear Support Vector Machine Course Introduction
Fun Time
Which of the following description of this course is true?
1
the course will be taught in Taiwanese2
the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek3
the course will be 16 weeks long4
the course will focus on three major techniquesReference Answer: 4
1
no, my Taiwanese is unfortunately not good enough for teaching (yet)2
no, although what we teach may serve as building blocks3
no, unless you have also joined the previous course4
yes,let’s get started!
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 3/28
Roadmap
1
Embedding Numerous Features: Kernel ModelsLecture 1: Linear Support Vector Machine
Course Introduction
LargeMargin Separating Hyperplane Standard LargeMargin Problem Support Vector Machine
Reasons behind LargeMargin Hyperplane
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Linear Support Vector Machine LargeMargin Separating Hyperplane
Linear Classification Revisited
PLA/pocket
h(x) = sign(s)
s x
x
x x
_{0}1 2
d
h ( ) x
plausible err = 0/1
(small flipping noise)
minimizespecially
(linear separable)
linear (hyperplane) classifiers:
h(x) = sign(w
^{T} x)
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 5/28
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like!E
_{out}
(w)≤E _{in} (w)
 {z }
0
+
Ω( H)
 {z }
d
VC=d +1
You?
rightmost one, possibly :)
Linear Support Vector Machine LargeMargin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussianlike) noise on future
x
≈ x^{n}
:⇐⇒
x _{n} further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x _{n}
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x _{n}
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 7/28
Fat Hyperplane
• robust
separating hyperplane:fat
—far from both sides of examples
• robustness
≡fatness: distance to closest x _{n}
goal: findfattest
separating hyperplaneLinear Support Vector Machine LargeMargin Separating Hyperplane
LargeMargin Separating Hyperplane
max
w fatness(w)
subject to
w classifies every (x _{n}
, yn
)correctlyfatness(w) =
minn=1,...,N
distance(x_{n}
, w)max
w margin(w)
subject to everyy _{n} w ^{T} x _{n} > 0
margin(w) =
minn=1,...,N
distance(x_{n}
, w)•
fatness: formally calledmargin
• correctness: y _{n}
=sign(w^{T} x _{n}
)goal: find
largestmargin separating
hyperplaneHsuanTien Lin (NTU CSIE) Machine Learning Techniques 9/28
LargeMargin Separating Hyperplane
max
w margin(w)
subject to everyy _{n} w ^{T} x _{n} > 0
margin(w) =
minn=1,...,N
distance(x_{n}
, w)•
fatness: formally calledmargin
• correctness: y _{n}
=sign(w^{T} x _{n}
)goal: find
largestmargin
separating
hyperplaneLinear Support Vector Machine LargeMargin Separating Hyperplane
Fun Time
Consider two examples (v, +1) and (−v, −1) where v ∈ R
^{2}
(without padding the v_{0}
=1). Which of the following hyperplane is thelargestmargin separating
one for the two examples? You are highly encouraged to visualize by considering, for instance,v = (3, 2).
1
x_{1}
=02
x_{2}
=03
v_{1}
x_{1}
+v_{2}
x_{2}
=04
v_{2}
x_{1}
+v_{1}
x_{2}
=0Reference Answer: 3
Here the
largestmargin separating
hyperplane (line) must be a perpendicular bisector of the line segment betweenv and
−v. Hence v is a normal vector of the largestmargin line. The result can be extended to the more general case ofv
∈ R^{d}
.HsuanTien Lin (NTU CSIE) Machine Learning Techniques 10/28
Fun Time
Consider two examples (v, +1) and (−v, −1) where v ∈ R
^{2}
(without padding the v_{0}
=1). Which of the following hyperplane is thelargestmargin separating
one for the two examples? You are highly encouraged to visualize by considering, for instance,v = (3, 2).
1
x_{1}
=02
x_{2}
=03
v_{1}
x_{1}
+v_{2}
x_{2}
=04
v_{2}
x_{1}
+v_{1}
x_{2}
=0Reference Answer: 3
Here the
largestmargin separating
hyperplane (line) must be a perpendicular bisector of the line segment betweenv and
−v. Hence v is a normal vector of the largestmargin line. The result can be extended to the more general case ofv
∈ R^{d}
.Linear Support Vector Machine Standard LargeMargin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w ^{T} x _{n} > 0 margin(w) = min
n=1,...,N distance(x _{n} , w)
‘shorten’ x and w
distance
needsw _{0}
and(w _{1} , . . . , w _{d} )
differently (to be derived)b
=w _{0}
 w

=
w _{1}
.. . w _{d}
;
XX x _{0} = X X 1
 x

=
x _{1}
.. . x _{d}
for this part: h(x) = sign(w
^{T} x
+b)
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 11/28
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w ^{T} x ^{0}
+b
=0consider
x ^{0}
,x ^{00}
on hyperplane1 w ^{T} x ^{0}
= −b, w ^{T} x ^{00}
= −b
2 w
⊥ hyperplane:
w ^{T}
(x^{00}
−x ^{0}
) {z } vector on hyperplane
=0
3
distance = project (x−x ^{0}
)to⊥ hyperplane
dist(x, h)
x^{′} x^{′′}
w x
distance(x,
b, w) =
w ^{T}
k
w
k(x−x ^{0}
)=
1
1k
w
kw ^{T} x + b
Linear Support Vector Machine Standard LargeMargin Problem
Distance to Separating Hyperplane
distance(x,
b, w) =
1k
w
kw ^{T} x + b
• separating
hyperplane: for every ny n (w ^{T} x n + b) > 0
•
distance toseparating
hyperplane:distance(x
_{n}
,b, w) =
1k
w
ky _{n}
(w^{T} x _{n}
+b)
max
b,w
margin(b,w)
subject to every
y n (w ^{T} x _{n} + b) > 0
margin(b,w) =
minn=1,...,N 1
kwk y _{n}
(w^{T} x _{n}
+b)
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 13/28
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
_{n}
(w^{T} x _{n}
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
y_{n}
(w^{T} x _{n}
+b)
• w ^{T} x + b
=0 same as 3w^{T} x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatn=1,...,N min y n (w ^{T} x _{n} + b) = 1
=⇒ margin(b, w) = _{kwk} ^{1}
max
b,w 1 kwk
subject to
every y n (w ^{T} x _{n} + b) > 0 min
n=1,...,N y _{n} (w ^{T} x _{n} + b) = 1
Linear Support Vector Machine Standard LargeMargin Problem
Standard LargeMargin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w ^{T} x _{n} + b) = 1
necessary constraints: y
_{n}
(w^{T} x _{n}
+b)
≥ 1 for all n original constraint:min _{n=1,...,N} y n (w ^{T} x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. y_{n}
(w^{T} x _{n}
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (_{1.126} ^{b}
,_{1.126} ^{w}
)(contradiction!)
final change: max =⇒ min, remove√w
, add
^{1} _{2}
minb,w
1 2 w ^{T} w
subject to y
_{n}
(w^{T} x _{n}
+b)
≥ 1 for all nHsuanTien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard LargeMargin Problem
Fun Time
Consider three examples (x
_{1}
, +1), (x2
, +1), (x3
,−1), wherex _{1}
= (3, 0), x_{2}
= (0, 4), x_{3}
= (0, 0). In addition, consider a hyperplane x_{1}
+x_{2}
=1. Which of the following is not true?1
the hyperplane is a separating one for the three examples2
the distance from the hyperplane tox _{1}
is 23
the distance from the hyperplane tox _{3}
is^{√} ^{1}
2 4
the example that is closest to the hyperplane isx _{3}
Reference Answer: 2
The distance from the hyperplane to
x _{1}
is√ 1
2
(3 + 0− 1) =√ 2.HsuanTien Lin (NTU CSIE) Machine Learning Techniques 16/28
Linear Support Vector Machine Standard LargeMargin Problem
Fun Time
Consider three examples (x
_{1}
, +1), (x2
, +1), (x3
,−1), wherex _{1}
= (3, 0), x_{2}
= (0, 4), x_{3}
= (0, 0). In addition, consider a hyperplane x_{1}
+x_{2}
=1. Which of the following is not true?1
the hyperplane is a separating one for the three examples2
the distance from the hyperplane tox _{1}
is 23
the distance from the hyperplane tox _{3}
is^{√} ^{1}
2 4
the example that is closest to the hyperplane isx _{3} Reference Answer: 2
The distance from the hyperplane to
x _{1}
is√ 1
2
(3 + 0− 1) =√ 2.HsuanTien Lin (NTU CSIE) Machine Learning Techniques 16/28
Linear Support Vector Machine Support Vector Machine
Solving a Particular Standard Problem
min
b,w 1 2 w ^{T} w
subject to y
n
(w^{T} x _{n}
+b)
≥ 1 for all nX =
0 0 2 2 2 0 3 0
y =
−1
−1 +1 +1
− b
≥ 1 (i)−2 w _{1} − 2 w _{2} − b
≥ 1 (ii)2w _{1}
+ 0w _{2}
+ b
≥ 1 (iii)3w _{1}
+ 0w _{2}
+ b
≥ 1 (iv)•
(i) & (iii) =⇒
w _{1}
≥ +1 (ii) & (iii) =⇒w _{2}
≤ −1
=⇒
^{1} _{2} w ^{T} w
≥1
•
(w_{1}
=1,w _{2}
=−1,b
=−1) atlower bound
and satisfies (i)− (iv) g_{SVM}(x) = sign(x_{1}
− x2
− 1):SVM? :)
Linear Support Vector Machine Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
_{1}
=1,w _{2}
=−1,b
=−1) margin(b,w)
=_{kwk} ^{1}
=^{√} ^{1}
2
x^{1}−x^{2}−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplesupport vector (candidate)
support vector
machine (SVM):learn
fattest hyperplanes
(with help ofsupport vectors
)HsuanTien Lin (NTU CSIE) Machine Learning Techniques 18/28
Solving General SVM
min
b,w 1 2 w ^{T} w
subject to y
_{n}
(w^{T} x _{n}
+b)
≥ 1 for all n• not easy manually, of course :)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w ^{T} w
subject to y
n
(w^{T} x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u ^{T} Qu
+p ^{T} u
subject toa ^{T} _{m} u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 ^{T} _{d} 0 _{d} I d
;
p = 0 _{d +1}
constraints:a ^{T} _{n} = y n
1 x ^{T} _{n}
;
c n = 1;
M = NSVM with general QP solver:
easy
if you’ve read the manual :)
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 20/28
SVM with QP Solver
Linear HardMargin SVM Algorithm
1 Q =
0 0 ^{T} _{d} 0 _{d} I d
;
p = 0 _{d +1}
;a ^{T} _{n} = y n
1 x ^{T} _{n}
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
_{SVM}• hardmargin: nothing violate ‘fat boundary’
• linear: x _{n}
want
nonlinear?
z n
= Φ(xn
)—remember? :)Linear Support Vector Machine Support Vector Machine
Fun Time
Consider two negative examples with
x _{1}
= (0, 0) and x2
= (2, 2); two positive examples withx _{3}
= (2, 0) and x_{4}
= (3, 0), as shown on page 17 of the slides. Defineu, Q, p, c _{n}
as those listed on page 20 of the slides. What area ^{T} _{n}
that need to be fed into the QP solver?1 a
^{T}_{1}= [−1, 0, 0]
,^{a}
^{T}_{2}= [−1, 2, 2]
,^{a}
^{T}_{3}= [−1, 2, 0]
,^{a}
^{T}_{4}= [−1, 3, 0]
2 a
^{T}_{1}= [1, 0, 0]
,^{a}
^{T}_{2}= [1, −2, −2]
,^{a}
^{T}_{3}= [−1, 2, 0]
,^{a}
^{T}_{4}= [−1, 3, 0]
3 a
^{T}_{1}= [1, 0, 0]
,^{a}
^{T}2= [1, 2, 2]
,^{a}
^{T}3= [1, 2, 0]
,^{a}
^{T}4= [1, 3, 0]
4 a
^{T}_{1}= [−1, 0, 0]
,^{a}
^{T}2= [−1, −2, −2]
,^{a}
^{T}3= [1, 2, 0]
,^{a}
^{T}4= [1, 3, 0]
Reference Answer: 4
We needa ^{T} _{n}
=yn
1
x ^{T} _{n}
.HsuanTien Lin (NTU CSIE) Machine Learning Techniques 22/28
Fun Time
Consider two negative examples with
x _{1}
= (0, 0) and x2
= (2, 2); two positive examples withx _{3}
= (2, 0) and x_{4}
= (3, 0), as shown on page 17 of the slides. Defineu, Q, p, c _{n}
as those listed on page 20 of the slides. What area ^{T} _{n}
that need to be fed into the QP solver?1 a
^{T}_{1}= [−1, 0, 0]
,^{a}
^{T}_{2}= [−1, 2, 2]
,^{a}
^{T}_{3}= [−1, 2, 0]
,^{a}
^{T}_{4}= [−1, 3, 0]
2 a
^{T}_{1}= [1, 0, 0]
,^{a}
^{T}_{2}= [1, −2, −2]
,^{a}
^{T}_{3}= [−1, 2, 0]
,^{a}
^{T}_{4}= [−1, 3, 0]
3 a
^{T}_{1}= [1, 0, 0]
,^{a}
^{T}2= [1, 2, 2]
,^{a}
^{T}3= [1, 2, 0]
,^{a}
^{T}4= [1, 3, 0]
4 a
^{T}_{1}= [−1, 0, 0]
,^{a}
^{T}2= [−1, −2, −2]
,^{a}
^{T}3= [1, 2, 0]
,^{a}
^{T}4= [1, 3, 0]
Reference Answer: 4
We needa ^{T} _{n}
=yn
1
x ^{T} _{n}
.Linear Support Vector Machine Reasons behind LargeMargin Hyperplane
Why LargeMargin Hyperplane?
min
b,w 1 2 w ^{T} w
subject to y
_{n}
(w^{T} z _{n}
+b)
≥ 1 for all nminimize constraint regularization E
_{in} w ^{T} w
≤ CSVM
w ^{T} w
E_{in}
=0 [and more]SVM (largemargin hyperplane):
‘weightdecay regularization’ within E _{in} = 0
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 23/28
LargeMargin Restricts Dichotomies
consider ‘largemargin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA ^{0} : like PLA = ⇒ shatter ‘general’ 3 inputs
A ^{1.126} : more strict than SVM = ⇒ cannot shatter any 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Linear Support Vector Machine Reasons behind LargeMargin Hyperplane
VC Dimension of LargeMargin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [datadependent, need more than VC]
—
instead of
d
VC( H) [dataindependent, covered by VC]
d
VC( A ^{ρ} ) when X = unit circle in R ^{2}
•
ρ = 0: just perceptrons (dVC =3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radiusR hyperball:
dVC(A
ρ
)≤ minR ^{2} ρ ^{2}
, d
+1≤ d + 1
 {z }
d
VC(perceptrons)
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 25/28
Benefits of LargeMargin Hyperplanes
largemargin
hyperplanes hyperplanes hyperplanes + feature transform Φ
# even fewer not many many
boundary simple simple sophisticated
• not many
good, for d_{VC} and generalization• sophisticated
good, for possibly better E_{in}
a new possibility: nonlinear SVM
largemargin hyperplanes
+ numerous feature transform Φ
# not many
boundary sophisticated
Linear Support Vector Machine Reasons behind LargeMargin Hyperplane
Fun Time
Consider running the ‘largemargin algorithm’A
ρ
withρ =^{1} _{4}
on a Zspace such that z = Φ(x) is of 1126 dimensions (excluding z0
) and kzk ≤ 1. What is the upper bound of d^{VC}(Aρ
)when calculated by minR
^{2}ρ
^{2}, d+1?
1
52
173
11264
1127Reference Answer: 2
By the description, d = 1126 and R = 1. So the upper bound is simply 17.
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 27/28
Fun Time
Consider running the ‘largemargin algorithm’A
ρ
withρ =^{1} _{4}
on a Zspace such that z = Φ(x) is of 1126 dimensions (excluding z0
) and kzk ≤ 1. What is the upper bound of d^{VC}(Aρ
)when calculated by minR
^{2}ρ
^{2}, d+1?
1
52
173
11264
1127Reference Answer: 2
By the description, d = 1126 and R = 1. So the upper bound is simply 17.
Linear Support Vector Machine Reasons behind LargeMargin Hyperplane
Summary
1
Embedding Numerous Features: Kernel ModelsLecture 1: Linear Support Vector Machine
Course Introduction
from foundations to techniques LargeMargin Separating Hyperplane
intuitively more robust against noise Standard LargeMargin Problem
minimize ‘length of w’ at special separating scale Support Vector Machine
‘easy’ via quadratic programming Reasons behind LargeMargin Hyperplane fewer dichotomies and better generalization
• next: solving nonlinear Support Vector Machine
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
HsuanTien Lin (NTU CSIE) Machine Learning Techniques 28/28