Machine Learning Techniques ( 機器學習技巧)
Lecture 1: Large-Margin Linear Classification
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Large-Margin Linear Classification
Agenda
Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine
Reasons behind Large-Margin Hyperplane
Large-Margin Linear Classification Large-Margin Separating Hyperplane
Linear Classification Revisited
PLA/pocket
h(x) = sign(s)
s x
x
x x
01 2
d
h ( ) x
plausible err = 0/1
(small flipping noise)
minimizespecially
(linear separable)
linear (hyperplane) classifiers:
h(x) = sign(w
T x)
Large-Margin Linear Classification Large-Margin Separating Hyperplane
Which Line Is Best?
•
PLA? depending on randomness•
VC bound? whichever you like!E
out
(w)≤ Ein
(w)| {z }
0
+ Ω(H)
| {z }
d
VC=d +1
You?
rightmost one, possibly :-)
Large-Margin Linear Classification Large-Margin Separating Hyperplane
Why Rightmost Hyperplane?
informal argument
•
if (Gaussian-like) noise on futurex
≈ xn
:⇐⇒
x n further from hyperplane
⇐⇒
tolerate more noise
⇐⇒
more robust to overfitting
• ⇐⇒ robustness of separating hyperplane
⇐⇒
amount of noise tolerance
⇐⇒
distance to closest x n
rightmost one:
more robust
Large-Margin Linear Classification Large-Margin Separating Hyperplane
Fat Hyperplane
• robust
separating hyperplane:fat
—far from both sides of examples
• robustness
≡fatness: distance to closest x n
goal: findfattest
separating hyperplaneLarge-Margin Linear Classification Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w fatness(w)
subject to
w classifies every (x n
, yn
)correctlyfatness(w) =
minn=1,...,N
distance(xn
, w)max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin
separating hyperplaneLarge-Margin Linear Classification Large-Margin Separating Hyperplane
Large-Margin Separating Hyperplane
max
w margin(w)
subject to everyy n w T x n > 0
margin(w) =
minn=1,...,N
distance(xn
, w)•
fatness: calledmargin
• correctness: y n
=sign(wT x n
)goal: find
largest-margin
separating hyperplaneLarge-Margin Linear Classification Large-Margin Separating Hyperplane
Fun Time
Large-Margin Linear Classification Standard Large-Margin Problem
Distance to Hyperplane: Preliminary
max w margin(w)
subject to every y n w T x n > 0 margin(w) = min
n=1,...,N distance(x n , w)
‘shorten’ x and w
distance
needsw 0
and(w 1 , . . . , w d )
differently (to be derived)b
=w 0
| w
|
=
w 1
.. . w d
;
XX x 0 = X X 1
| x
|
=
x 1
.. . x d
next: h(x) = sign(w
T x
+b)
Large-Margin Linear Classification Standard Large-Margin Problem
Distance to Hyperplane
want: distance(x,
w, b), with hyperplane w T x + b
=0 considerx 0
on hyperplane1 w T x 0
=−b
2 w
⊥ hyperplane:
w T
(x00
−x 0
)| {z } vector on hyperplane
=0
3
distance = project (x−x 0
)to⊥ hyperplane
dist(x, h)
x′ x′′
w x
distance(x,
w, b) =
w T
k
w
k(x−x 0
)=
1
1k
w
k|w T x + b
|Large-Margin Linear Classification Standard Large-Margin Problem
Distance to Separating Hyperplane
distance(x,
w, b) =
1k
w
k|w T x + b
|• separating
hyperplane: for every ny n (w T x n + b) > 0
•
distance toseparating
hyperplane:distance(x
n
,w, b) =
1k
w
ky n
(wT x n
+b)
maxb,w
margin(w,b)
subject to every
y n (w T x n + b) > 0
margin(w,b) =
minn=1,...,N 1
kwk y n
(wT x n
+b)
Large-Margin Linear Classification Standard Large-Margin Problem
Margin of Special Separating Hyperplane
max
b,w
margin(w,b)
subject to every y
n
(wT x n
+b)
> 0 margin(w,b) =
minn=1,...,N 1
kwk
yn
(wT x n
+b)
•
(w,b)
and (1126w, 1126b): same hyperplane, same margin• special
scaling: only consider separating (w,b)
such thatmin n y n (w T x n + b) = 1
=⇒ margin(w, b) = kwk 1
max
b,w 1 kwk
subject to
every y n (w T x n + b) > 0 min
n=1,...,N y n (w T x n + b) = 1
Large-Margin Linear Classification Standard Large-Margin Problem
Standard Large-Margin Hyperplane Problem
max
b,w
1
k
w
k = 1√
w T w
subject tomin
n=1,...,N y n (w T x n + b) = 1
final changes:
•
max =⇒ min, remove√w
, add
1 2
•
min(. . .) = 1 =⇒ (. . .) ≥ 1—min
1 2 w T w
means not all (. . .) > 1min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1Large-Margin Linear Classification Standard Large-Margin Problem
Fun Time
Large-Margin Linear Classification Support Vector Machine
Solving a Particular Standard Problem
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1X =
0 0 2 2 2 0 3 0
y =
−1
−1 +1 +1
− b
≥ 1 (i)−2 w 1 − 2 w 2 − b
≥ 1 (ii)2w 1
+ 0w 2
+ b
≥ 1 (iii)3w 1
+ 0w 2
+ b
≥ 1 (iv)•
(i) & (iii) =⇒
w 1
≥ +1 (ii) & (iii) =⇒w 2
≤ −1
=⇒
1 2 w T w
≥1
•
(w1
=1,w 2
=−1,b
=−1) atlower bound
and satisfies (i)− (iv) gSVM(x) = sign(x1
− x2
− 1):SVM? :-)
Large-Margin Linear Classification Support Vector Machine
Support Vector Machine (SVM)
optimal solution: (w
1
=1,w 2
=−1,b
=−1) margin(w,b)
=kwk 1
=√ 1
2
x1−x2−1=0 0.707
•
examples on boundary:‘locates’ fattest hyperplane
other examples:not needed
•
call boundary examplessupport vector (candidates)
support vector
machine (SVM):learn
fattest hyperplanes
(with help ofsupport vectors
)Large-Margin Linear Classification Support Vector Machine
Solving General SVM
min
b,w 1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1• not easy manually, of course :-)
•
gradient decent?not easy with constraints
•
luckily:• (convex) quadratic objective functions of (b, w)
• linear constraints of (b, w)
—quadratic programming
quadratic programming
(QP):‘easy’ optimization problem
Large-Margin Linear Classification Support Vector Machine
Quadratic Programming
optimal (b,
w) =
? minb,w
1 2 w T w
subject to y
n
(wT x n
+b)
≥ 1, for n = 1, 2, . . . , Noptimal
u
← QP(A, c, P, r)
minu
1
2 u T Au
+c T u
subject top T m u
≥r m
,for m = 1, 2, . . . , M
objective function:
u =
b w
;
A =
0 0 T d 0 d I d
;
c = 0 d +1
constraints:p T n = y n
1 x T n
;
r n = 1;
M = NSVM with general QP solver:
easy
if you’ve read the manual :-)
Large-Margin Linear Classification Support Vector Machine
SVM with QP Solver
Linear Hard-Margin SVM Algorithm
1 A =
0 0 T d 0 d I d
;
c = 0 d +1
;p T n = y n
1 x T n
;
r n = 1
2
b w
← QP(
A, c, P, r)
3
returnb
&w
asg
SVM• hard-margin: nothing violate ‘fat boundary’
• linear: x n
want
non-linear?
z n
= Φ(xn
)—remember? :-)Large-Margin Linear Classification Support Vector Machine
Fun Time
Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane
Why Large-Margin Hyperplane?
min
b,w 1 2 w T w
subject to y
n
(wT z n
+b)
≥ 1minimize constraint regularization E
in w T w
≤ CSVM
w T w
Ein
=0 [and more]SVM (large-margin hyperplane):
‘weight-decay regularization’ within E in = 0
Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane
Large-Margin Restricts Dichotomies
consider ‘large-margin algorithm’A
ρ
:either
returns g with margin(g) ≥ ρ (if exists)
, or 0 otherwiseA 0 : like PLA = ⇒ shatter ‘general’ 3 inputs
A 1.126 : more strict than SVM = ⇒ no-shatter some 3 inputs
ρ
fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒
better generalization
Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane
VC Dimension of Large-Margin Algorithm
fewer dichotomies =⇒ smaller
‘VC dim.’
considers d
VC( A ρ ) [data-dependent, need more than VC]
—
instead of
d
VC( H) [data-independent, covered by VC]
d
VC( A ρ ) when X = unit circle in R 2
•
ρ = 0: just perceptrons (dVC =3)•
ρ >√ 3
2
: no shatter any 3 inputs (dVC< 3)—some inputs must be of
distance ≤ √ 3
generally, whenX in
radius-R hyperball:
dVC(A
ρ
)≤ minR 2 ρ 2
, d
+1≤ d + 1
| {z }
d
VC(perceptrons)
Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane
Benefits of Large-Margin Hyperplanes
large-margin
hyperplanes hyperplanes hyperplanes + higher-order transforms
# even fewer not many many
boundary simple simple sophisticated
• not many
good, for dVC and generalization• sophisticated
good, for possibly better Ein
a new possibility: non-linear SVM
large-margin hyperplanes + higher-order transforms
# not many
boundary sophisticated
Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane
Fun Time
Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane