Machine Learning Techniques ( 機器學習技法)
Lecture 1: Linear Support Vector Machine
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/28
Course History
NTU Version
•
15-17 weeks (2+ hours)•
highly-praised withEnglish and blackboard teaching
Coursera Version
•
8 weeks of ‘foundations’(previous course) + 8 weeks of ‘techniques’ (this course)
• Mandarin teaching
to reach more audience in need• slides teaching
improved with Coursera’s quiz and homework mechanismsgoal:
try
making Coursera version even better than NTU versionLinear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28
Linear Support Vector Machine Course Introduction
Fun Time
Which of the following description of this course is true?
1
the course will be taught in Taiwanese2
the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek3
the course will be 16 weeks long4
the course will focus on three major techniquesReference Answer: 4
1
no, my Taiwanese is unfortunately not good enough for teaching (yet)2
no, although what we teach may serve as building blocks3
no, unless you have also joined the previous course4
yes,let’s get started!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/28
Linear Support Vector Machine Course Introduction
Fun Time
Which of the following description of this course is true?
1
the course will be taught in Taiwanese2
the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek3
the course will be 16 weeks long4
the course will focus on three major techniquesReference Answer: 4
1
no, my Taiwanese is unfortunately not good enough for teaching (yet)2
no, although what we teach may serve as building blocks3
no, unless you have also joined the previous course4
yes,let’s get started!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/28
Roadmap
1
Embedding Numerous Features: Kernel ModelsLecture 1: Linear Support Vector Machine
Course Introduction
Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine
Reasons behind Large-Margin Hyperplane
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Linear Support Vector Machine Large-Margin Separating Hyperplane
Linear Classification Revisited
PLA/pocket
h(x) = sign(s)
s x
x
x x
01 2
d
h ( ) x
plausible err = 0/1
(small flipping noise)
minimizespecially
(linear separable)
linear (hyperplane) classifiers:
h(x) = sign(w
T x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/28
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like!E
out
(w)≤E in (w)
| {z }
0
+
Ω( H)
| {z }
d
VC=d +1
You?
rightmost one, possibly :-)
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/28
Fat Hyperplane
• robust
separating hyperplane:fat
—far from both sides of examples
• robustness
≡fatness: distance to closest x n
goal: findfattest
separating hyperplaneLinear Support Vector Machine Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w fatness(w)
subject to
w classifies every (x n
, yn
)correctlyfatness(w) =
minn=1,...,N
distance(xn
, w)max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: formally calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin separating
hyperplaneHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/28
Large-Margin Separating Hyperplane
max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: formally calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin
separating
hyperplaneLinear Support Vector Machine Large-Margin Separating Hyperplane
Fun Time
Consider two examples (v, +1) and (−v, −1) where v ∈ R
2
(without padding the v0
=1). Which of the following hyperplane is thelargest-margin separating
one for the two examples? You are highly encouraged to visualize by considering, for instance,v = (3, 2).
1
x1
=02
x2
=03
v1
x1
+v2
x2
=04
v2
x1
+v1
x2
=0Reference Answer: 3
Here the
largest-margin separating
hyperplane (line) must be a perpendicular bisector of the line segment betweenv and
−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case ofv
∈ Rd
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/28
Fun Time
Consider two examples (v, +1) and (−v, −1) where v ∈ R
2
(without padding the v0
=1). Which of the following hyperplane is thelargest-margin separating
one for the two examples? You are highly encouraged to visualize by considering, for instance,v = (3, 2).
1
x1
=02
x2
=03
v1
x1
+v2
x2
=04
v2
x1
+v1
x2
=0Reference Answer: 3
Here the
largest-margin separating
hyperplane (line) must be a perpendicular bisector of the line segment betweenv and
−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case ofv
∈ Rd
.Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w T x n > 0 margin(w) = min
n=1,...,N distance(x n , w)
‘shorten’ x and w
distance
needsw 0
and(w 1 , . . . , w d )
differently (to be derived)b
=w 0
| w
|
=
w 1
.. . w d
;
XX x 0 = X X 1
| x
|
=
x 1
.. . x d
for this part: h(x) = sign(w
T x
+b)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/28
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
k
w
k(x−x 0
)=
1
1k
w
k|w T x + b
|Linear Support Vector Machine Standard Large-Margin Problem
Distance to Separating Hyperplane
distance(x,
b, w) =
1k
w
k|w T x + b
|• separating
hyperplane: for every ny n (w T x n + b) > 0
•
distance toseparating
hyperplane:distance(x
n
,b, w) =
1k
w
ky n
(wT x n
+b)
max
b,w
margin(b,w)
subject to every
y n (w T x n + b) > 0
margin(b,w) =
minn=1,...,N 1
kwk y n
(wT x n
+b)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/28
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatn=1,...,N min y n (w T x n + b) = 1
=⇒ margin(b, w) = kwk 1
max
b,w 1 kwk
subject to
every y n (w T x n + b) > 0 min
n=1,...,N y n (w T x n + b) = 1
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all nHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard Large-Margin Problem
Fun Time
Consider three examples (x
1
, +1), (x2
, +1), (x3
,−1), wherex 1
= (3, 0), x2
= (0, 4), x3
= (0, 0). In addition, consider a hyperplane x1
+x2
=1. Which of the following is not true?1
the hyperplane is a separating one for the three examples2
the distance from the hyperplane tox 1
is 23
the distance from the hyperplane tox 3
is√ 1
2 4
the example that is closest to the hyperplane isx 3
Reference Answer: 2
The distance from the hyperplane to
x 1
is√ 1
2
(3 + 0− 1) =√ 2.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/28
Linear Support Vector Machine Standard Large-Margin Problem
Fun Time
Consider three examples (x
1
, +1), (x2
, +1), (x3
,−1), wherex 1
= (3, 0), x2
= (0, 4), x3
= (0, 0). In addition, consider a hyperplane x1
+x2
=1. Which of the following is not true?1
the hyperplane is a separating one for the three examples2
the distance from the hyperplane tox 1
is 23
the distance from the hyperplane tox 3
is√ 1
2 4
the example that is closest to the hyperplane isx 3 Reference Answer: 2
The distance from the hyperplane to
x 1
is√ 1
2
(3 + 0− 1) =√ 2.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/28
Linear Support Vector Machine Support Vector Machine
Solving a Particular Standard Problem
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all nX =
0 0 2 2 2 0 3 0
y =
−1
−1 +1 +1
− b
≥ 1 (i)−2 w 1 − 2 w 2 − b
≥ 1 (ii)2w 1
+ 0w 2
+ b
≥ 1 (iii)3w 1
+ 0w 2
+ b
≥ 1 (iv)•
(i) & (iii) =⇒
w 1
≥ +1 (ii) & (iii) =⇒w 2
≤ −1
=⇒
1 2 w T w
≥1
•
(w1
=1,w 2
=−1,b
=−1) atlower bound
and satisfies (i)− (iv) gSVM(x) = sign(x1
− x2
− 1):SVM? :-)
Linear Support Vector Machine Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
1
=1,w 2
=−1,b
=−1) margin(b,w)
=kwk 1
=√ 1
2
x1−x2−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplesupport vector (candidate)
support vector
machine (SVM):learn
fattest hyperplanes
(with help ofsupport vectors
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/28
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1 for all n• not easy manually, of course :-)
•
gradient descent?not easy with constraints
•
luckily:• (convex) quadratic objective function of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Linear Support Vector Machine Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(Q, p, A, c)
minu 1
2 u T Qu
+p T u
subject toa T m u
≥c m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
Q =
0 0 T d 0 d I d
;
p = 0 d +1
constraints:a T n = y n
1 x T n
;
c n = 1;
M = NSVM with general QP solver:
easy
if you’ve read the manual :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/28
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 Q =
0 0 T d 0 d I d
;
p = 0 d +1
;a T n = y n
1 x T n
;
c n = 1
2
b w
← QP(
Q, p, A, c)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear?
z n
= Φ(xn
)—remember? :-)Linear Support Vector Machine Support Vector Machine
Fun Time
Consider two negative examples with
x 1
= (0, 0) and x2
= (2, 2); two positive examples withx 3
= (2, 0) and x4
= (3, 0), as shown on page 17 of the slides. Defineu, Q, p, c n
as those listed on page 20 of the slides. What area T n
that need to be fed into the QP solver?1 a
T1= [−1, 0, 0]
,a
T2= [−1, 2, 2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
2 a
T1= [1, 0, 0]
,a
T2= [1, −2, −2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
3 a
T1= [1, 0, 0]
,a
T2= [1, 2, 2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
4 a
T1= [−1, 0, 0]
,a
T2= [−1, −2, −2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
Reference Answer: 4
We needa T n
=yn
1
x T n
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/28
Fun Time
Consider two negative examples with
x 1
= (0, 0) and x2
= (2, 2); two positive examples withx 3
= (2, 0) and x4
= (3, 0), as shown on page 17 of the slides. Defineu, Q, p, c n
as those listed on page 20 of the slides. What area T n
that need to be fed into the QP solver?1 a
T1= [−1, 0, 0]
,a
T2= [−1, 2, 2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
2 a
T1= [1, 0, 0]
,a
T2= [1, −2, −2]
,a
T3= [−1, 2, 0]
,a
T4= [−1, 3, 0]
3 a
T1= [1, 0, 0]
,a
T2= [1, 2, 2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
4 a
T1= [−1, 0, 0]
,a
T2= [−1, −2, −2]
,a
T3= [1, 2, 0]
,a
T4= [1, 3, 0]
Reference Answer: 4
We needa T n
=yn
1
x T n
.Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Why Large-Margin Hyperplane?
min
b,w 1 2 w T w
subject to y
n
(wT z n
+b)
≥ 1 for all nminimize constraint regularization E
in w T w
≤ CSVM
w T w
Ein
=0 [and more]SVM (large-margin hyperplane):
‘weight-decay regularization’ within E in = 0
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/28
Large-Margin Restricts Dichotomies
consider ‘large-margin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA 0 : like PLA = ⇒ shatter ‘general’ 3 inputs
A 1.126 : more strict than SVM = ⇒ cannot shatter any 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC =3)•
ρ >√ 3
2
: cannot shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(A
ρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/28
Benefits of Large-Margin Hyperplanes
large-margin
hyperplanes hyperplanes hyperplanes + feature transform Φ
# even fewer not many many
boundary simple simple sophisticated
• not many
good, for dVC and generalization• sophisticated
good, for possibly better Ein
a new possibility: non-linear SVM
large-margin hyperplanes
+ numerous feature transform Φ
# not many
boundary sophisticated
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Fun Time
Consider running the ‘large-margin algorithm’A
ρ
withρ =1 4
on a Z-space such that z = Φ(x) is of 1126 dimensions (excluding z0
) and kzk ≤ 1. What is the upper bound of dVC(Aρ
)when calculated by minR
2ρ
2, d+1?
1
52
173
11264
1127Reference Answer: 2
By the description, d = 1126 and R = 1. So the upper bound is simply 17.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 27/28
Fun Time
Consider running the ‘large-margin algorithm’A
ρ
withρ =1 4
on a Z-space such that z = Φ(x) is of 1126 dimensions (excluding z0
) and kzk ≤ 1. What is the upper bound of dVC(Aρ
)when calculated by minR
2ρ
2, d+1?
1
52
173
11264
1127Reference Answer: 2
By the description, d = 1126 and R = 1. So the upper bound is simply 17.
Linear Support Vector Machine Reasons behind Large-Margin Hyperplane
Summary
1
Embedding Numerous Features: Kernel ModelsLecture 1: Linear Support Vector Machine
Course Introduction
from foundations to techniques Large-Margin Separating Hyperplane
intuitively more robust against noise Standard Large-Margin Problem
minimize ‘length of w’ at special separating scale Support Vector Machine
‘easy’ via quadratic programming Reasons behind Large-Margin Hyperplane fewer dichotomies and better generalization
• next: solving non-linear Support Vector Machine
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 28/28