Machine Learning Techniques ( 機器學習技法)
Lecture 1: Linear Support Vector Machine
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Linear Support Vector Machine Course Introduction
Course History
NTU Version
•
15-17 weeks (2+ hours)•
highly-praised withEnglish and blackboard teaching
Coursera Version
•
8 weeks of ‘foundations’ (previous course) + 8 weeks of ‘techniques’ (this course)• Mandarin teaching
to reach more audience in need• slides teaching
improved with Coursera’s quiz and homework mechanismsgoal:
try
making Coursera version even better than NTU versionHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/28
Linear Support Vector Machine Course Introduction
Course History
NTU Version
•
15-17 weeks (2+ hours)•
highly-praised withEnglish and blackboard teaching
Coursera Version
•
8 weeks of ‘foundations’(previous course) + 8 weeks of ‘techniques’ (this course)
• Mandarin teaching
to reach more audience in need• slides teaching
improved with Coursera’s quiz and homework mechanismsgoal:
try
making Coursera version even better than NTU versionLinear Support Vector Machine Course Introduction
Course History
NTU Version
•
15-17 weeks (2+ hours)•
highly-praised withEnglish and blackboard teaching
Coursera Version
•
8 weeks of ‘foundations’(previous course) + 8 weeks of ‘techniques’ (this course)
• Mandarin teaching
to reach more audience in need• slides teaching
improved with Coursera’s quiz and homework mechanismsgoal:
try
making Coursera version even better than NTU versionHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/28
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28
Linear Support Vector Machine Course Introduction
Course Design
from Foundations to Techniques
•
mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes:-)
•
three major techniques surroundingfeature transforms:
• Embedding Numerous Features: how to exploit and regularize numerous features?
—inspires Support Vector Machine (SVM) model
• Combining Predictive Features: how to construct and blend predictive features?
—inspires Adaptive Boosting (AdaBoost) model
• Distilling Implicit Features: how to identify and learn implicit features?
—inspires Deep Learning model
allows students to
use ML professionally
Linear Support Vector Machine Course Introduction
Fun Time
Which of the following description of this course is true?
1
the course will be taught in Taiwanese2
the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek3
the course will be 16 weeks long4
the course will focus on three major techniquesReference Answer: 4
1
no, my Taiwanese is unfortunately not good enough for teaching (yet)2
no, although what we teach may serve as building blocks3
no, unless you have also joined the previous course4
yes,let’s get started!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/28
Linear Support Vector Machine Course Introduction
Fun Time
Which of the following description of this course is true?
1
the course will be taught in Taiwanese2
the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek3
the course will be 16 weeks long4
the course will focus on three major techniquesReference Answer: 4
1
no, my Taiwanese is unfortunately not good enough for teaching (yet)2
no, although what we teach may serve as building blocks3
no, unless you have also joined the previous course4
yes,let’s get started!
Linear Support Vector Machine Course Introduction
Roadmap
1
Embedding Numerous Features: Kernel ModelsLecture 1: Linear Support Vector Machine
Course Introduction
Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine
Reasons behind Large-Margin Hyperplane
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Linear Classification Revisited
PLA/pocket
h(x) = sign(s)
s x
x
x x
01 2
d
h ( ) x
plausible err = 0/1
(small flipping noise)
minimizespecially
(linear separable)
linear (hyperplane) classifiers:
h(x) = sign(w
T x)
Linear Support Vector Machine Large-Margin Separating Hyperplane
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like! Eout
(w)≤E in (w)
| {z }
0
+
Ω( H)
| {z }
d
VC=d +1
You?
rightmost one, possibly :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like! Eout
(w)≤E in (w)
| {z }
0
+
Ω( H)
| {z }
d
VC=d +1
You?
rightmost one, possibly :-)
Linear Support Vector Machine Large-Margin Separating Hyperplane
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like!E
out
(w)≤E in (w)
| {z }
0
+
Ω( H)
| {z }
d
VC=d +1
You?
rightmost one, possibly :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like!E
out
(w)≤E in (w)
| {z }
0
+
Ω( H)
| {z }
d
VC=d +1
You?
rightmost one, possibly :-)
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x n
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x n
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:
more robust
because oflarger distance to closest x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:
more robust
because oflarger distance to closest x n
Linear Support Vector Machine Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
if (Gaussian-like) noise on future
x
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
⇐⇒
distance to closest x n
⇐⇒
amount of noise tolerance
⇐⇒
robustness of hyperplane
rightmost one:more robust
because of
larger distance to closest x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Fat Hyperplane
• robust
separating hyperplane:fat
—far from both sides of examples
• robustness
≡fatness: distance to closest x n
goal: findfattest
separating hyperplaneLinear Support Vector Machine Large-Margin Separating Hyperplane
Fat Hyperplane
• robust
separating hyperplane:fat
—far from both sides of examples
• robustness
≡fatness: distance to closest x n
goal: find
fattest
separating hyperplaneHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Fat Hyperplane
• robust
separating hyperplane:fat
—far from both sides of examples
• robustness
≡fatness: distance to closest x n
goal: findfattest
separating hyperplaneLinear Support Vector Machine Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w fatness(w)
subject to
w classifies every (x n
, yn
)correctlyfatness(w) =
minn=1,...,N
distance(xn
, w)max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: formally calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin separating
hyperplaneHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w margin(w)
subject to
w classifies every (x n
, yn
)correctlymargin(w) =
minn=1,...,N
distance(xn
, w)max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: formally calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin
separating
hyperplaneLinear Support Vector Machine Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: formally calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin separating
hyperplaneHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: formally calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin
separating
hyperplaneLinear Support Vector Machine Large-Margin Separating Hyperplane
Fun Time
Consider two examples (v, +1) and (−v, −1) where v ∈ R
2
(without padding the v0
=1). Which of the following hyperplane is thelargest-margin separating
one for the two examples? You are highly encouraged to visualize by considering, for instance,v = (3, 2).
1
x1
=02
x2
=03
v1
x1
+v2
x2
=04
v2
x1
+v1
x2
=0Reference Answer: 3
Here the
largest-margin separating
hyperplane (line) must be a perpendicular bisector of the line segment betweenv and
−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case ofv
∈ Rd
.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/28
Linear Support Vector Machine Large-Margin Separating Hyperplane
Fun Time
Consider two examples (v, +1) and (−v, −1) where v ∈ R
2
(without padding the v0
=1). Which of the following hyperplane is thelargest-margin separating
one for the two examples? You are highly encouraged to visualize by considering, for instance,v = (3, 2).
1
x1
=02
x2
=03
v1
x1
+v2
x2
=04
v2
x1
+v1
x2
=0Reference Answer: 3
Here the
largest-margin separating
hyperplane (line) must be a perpendicular bisector of the line segment betweenv and
−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case ofv
∈ Rd
.Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w T x n > 0 margin(w) = min
n=1,...,N distance(x n , w)
‘shorten’ x and w
distance
needsw 0
and(w 1 , . . . , w d )
differently (to be derived)b
=w 0
| w
|
=
w 1
.. . w d
;
XX x 0 = X X 1
| x
|
=
x 1
.. . x d
for this part: h(x) = sign(w
T x
+b)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w T x n > 0 margin(w) = min
n=1,...,N distance(x n , w)
‘shorten’ x and w
distance
needsw 0
and(w 1 , . . . , w d )
differently (to be derived)b
=w 0
| w
|
=
w 1
.. . w d
;
XX x 0 = X X 1
| x
|
=
x 1
.. . x d
for this part: h(x) = sign(w
T x
+b)
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w T x n > 0 margin(w) = min
n=1,...,N distance(x n , w)
‘shorten’ x and w
distance
needsw 0
and(w 1 , . . . , w d )
differently (to be derived)b
=w 0
| w
|
=
w 1
.. . w d
;
XX x 0 = X X 1
| x
|
=
x 1
.. . x d
for this part: h(x) = sign(w
T x
+b)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w T x n > 0 margin(w) = min
n=1,...,N distance(x n , w)
‘shorten’ x and w
distance
needsw 0
and(w 1 , . . . , w d )
differently (to be derived)b
=w 0
| w
|
=
w 1
.. . w d
;
XX x 0 = X X 1
| x
|
=
x 1
.. . x d
for this part: h(x) = sign(w
T x
+b)
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
=−
b
,
w T x 00
=−
b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′ w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
=−
b
,
w T x 00
=−
b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=
0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
=−
b
,
w T x 00
=−
b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=
0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=
0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=
0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
kw
k(x−
x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
k
w
k(x−x 0
)=
1
1 kw
k|w T x
+
b
|
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
k
w
k(x−x 0
)=
1
1 kw
k|w T x
+
b
|
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
b, w), with hyperplane w T x 0
+b
=0consider
x 0
,x 00
on hyperplane1 w T x 0
= −b, w T x 00
= −b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
b, w) =
w T
k
w
k(x−x 0
)=
1
1k
w
k|w T x + b
|Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Separating Hyperplane
distance(x,
b, w) =
1k
w
k|w T x + b
|• separating
hyperplane: for every ny n (w T x n + b) > 0
•
distance toseparating
hyperplane: distance(xn
,b, w) =
1k
w
ky n
(wT x n
+b)
max
b,w
margin(b,w)
subject to every
y n (w T x n + b) > 0
margin(b,w) =
minn=1,...,N
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Separating Hyperplane
distance(x,
b, w) =
1k
w
k|w T x + b
|• separating
hyperplane: for every ny n (w T x n + b) > 0
•
distance toseparating
hyperplane: distance(xn
,b, w) =
1k
w
ky n
(wT x n
+b)
max
b,w
margin(b,w)
subject to every
y n (w T x n + b) > 0
margin(b,w) =
minn=1,...,N
distance(xn
,b, w)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/28
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Separating Hyperplane
distance(x,
b, w) =
1k
w
k|w T x + b
|• separating
hyperplane: for every ny n (w T x n + b) > 0
•
distance toseparating
hyperplane:distance(x
n
,b, w) =
1k
w
ky n
(wT x n
+b)
max
b,w
margin(b,w)
subject to every
y n (w T x n + b) > 0
margin(b,w) =
minn=1,...,N
distance(xn
,b, w)
Linear Support Vector Machine Standard Large-Margin Problem
Distance to Separating Hyperplane
distance(x,
b, w) =
1k
w
k|w T x + b
|• separating
hyperplane: for every ny n (w T x n + b) > 0
•
distance toseparating
hyperplane:distance(x
n
,b, w) =
1k
w
ky n
(wT x n
+b)
max
b,w
margin(b,w)
subject to every
y n (w T x n + b) > 0
margin(b,w) =
minn=1,...,N 1
kwk y n
(wT x n
+b)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/28
Linear Support Vector Machine Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatmin
n=1,...,N y n (w T x n + b) = 1
=⇒margin(b,
w) = kwk 1
max
b,w 1 kwk
subject to every y
n
(wT x n
+b)> 0min
n=1,...,N y n (w T x n + b) = 1
Linear Support Vector Machine Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatmin
n=1,...,N y n (w T x n + b) = 1
=⇒margin(b,
w) = kwk 1
max
b,w 1 kwk
subject to every y
n
(wT x n
+b)> 0min
n=1,...,N y n (w T x n + b) = 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/28
Linear Support Vector Machine Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatn=1,...,N min y n (w T x n + b) = 1
=⇒margin(b,
w) = kwk 1
max
b,w 1 kwk
subject to every y
n
(wT x n
+b)> 0min
n=1,...,N y n (w T x n + b) = 1
Linear Support Vector Machine Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatn=1,...,N min y n (w T x n + b) = 1
=⇒ margin(b, w) = kwk 1
max
b,w 1 kwk
subject to every y
n
(wT x n
+b)> 0min
n=1,...,N y n (w T x n + b) = 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/28
Linear Support Vector Machine Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatn=1,...,N min y n (w T x n + b) = 1
=⇒ margin(b, w) = kwk 1
max
b,w 1 kwk
subject to every y
n
(wT x n
+b)> 0min
n=1,...,N y n (w T x n + b) = 1
Linear Support Vector Machine Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(b,w)
subject to every y
n
(wT x n
+b)
> 0 margin(b,w) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
• w T x + b
=0 same as 3wT x + 3b
=0: scaling does not matter• special
scaling: only consider separating (b,w)
such thatn=1,...,N min y n (w T x n + b) = 1
=⇒ margin(b, w) = kwk 1
max
b,w 1 kwk
subject to
every y n (w T x n + b) > 0 min
n=1,...,N y n (w T x n + b) = 1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/28
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all noriginal constraint:
min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,
w) here (inside)
if optimal (b,w)
outside, e.g. yn
(wT x n
+b)
>1.126
for all n
—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,
w) here (inside)
if optimal (b,w)
outside, e.g. yn
(wT x n
+b)
>1.126
for all n
—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
>1.126
for all n
—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside,e.g. y
n
(wT x n
+b)
>1.126
for all n
—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
>1.126
for all n
—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1for all n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√
w
, add
1 2
max
b,w
min
b,w 1 2 w T w
1 kwk
subject to y
n
(wT x n
+b)
≥ 1 for all nLinear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√w, add 1 2
max
b,w
min
b,w 1 2 w T w
1 kwk
subject to y
n
(wT x n
+b)
≥ 1 for all nHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28
Linear Support Vector Machine Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k subject tomin
n=1,...,N y n (w T x n + b) = 1
necessary constraints: y
n
(wT x n
+b)
≥ 1 for all n original constraint:min n=1,...,N y n (w T x n + b) = 1
want: optimal (b,w) here (inside)
if optimal (b,
w)
outside, e.g. yn
(wT x n
+b)
> 1.126 for all n—can scale (b,
w)
to “more optimal” (1.126 b
,1.126 w
)(contradiction!)
final change: max =⇒ min, remove√w, add 1 2
min
b,w
1 2
w T w
subject to y