Machine Learning Techniques ( 機器學習技法)
Lecture 2: Dual Support Vector Machine
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University ( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23
Dual Support Vector Machine
Roadmap
1
Embedding Numerous Features: Kernel ModelsLecture 1: Linear Support Vector Machine
linear
SVM: morerobust
and solvable withquadratic programming Lecture 2: Dual Support Vector Machine
Motivation of Dual SVM Lagrange Dual SVM Solving Dual SVM
Messages behind Dual SVM
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23
Dual Support Vector Machine Motivation of Dual SVM
Non-Linear Support Vector Machine Revisited
min
b,w 1 2 w T w
s. t. yn
(wT z n
|{z}
Φ(x
n)
+b)≥ 1, for n = 1, 2, . . . , N
Non-Linear Hard-Margin SVM
1 Q =
0 0 T ˜
d
0 d ˜ I ˜ d
;
p = 0 d +1 ˜
;a T n = y n
1 z T n
;
c n = 1
2
b
w
← QP(
Q, p, A, c)
3
return b∈ R & w ∈R d ˜
with gSVM(x) = sign(wT Φ(x)
+b)•
demanded:not many
(large-margin), butsophisticated
boundary (feature transform)•
QP withd + 1 ˜
variables and N constraints—challenging if
d ˜
large,or infinite?! :-)
goal: SVM
without dependence on ˜ d
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23
Dual Support Vector Machine Motivation of Dual SVM
Todo: SVM ‘without’ ˜ d
Original SVM
(convex) QP of• d + 1 variables ˜
•
N constraints‘Equivalent’ SVM
(convex) QP of• N variables
• N + 1
constraintsWarning: Heavy Math!!!!!!
•
introduce some necessary mathwithout rigor
to helpunderstand SVM deeper
• ‘claim’ some results
if details unnecessary—like how we ‘claimed’ Hoeffding
‘Equivalent’ SVM: based on some
dual problem
of Original SVMHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23
Dual Support Vector Machine Motivation of Dual SVM
Key Tool: Lagrange Multipliers
Regularization by
Constrained-Minimizing E in
min
w
Ein
(w) s.t.w T w ≤ C
⇔ Regularization by Minimizing E aug
min
w
Eaug
(w) = Ein
(w) +λ
Nw T w
•
C equivalent to someλ
≥ 0 by checkingoptimality condition
∇E
in
(w) +2λ N w = 0
•
regularization: viewλ
asgiven parameter instead of C, and
solve ‘easily’•
dual SVM: viewλ’s as unknown given the constraints, and solve them as variables instead
how many
λ’s as variables?
N—one per constraint
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23
Dual Support Vector Machine Motivation of Dual SVM
Starting Point: Constrained to ‘Unconstrained’
min
b,w 1 2 w T w
s.t.
y n (w T z n + b) ≥ 1
, for n = 1, 2, . . . , NLagrange Function
withLagrange multipliers @
@ λ n α n
,L
(b, w,α) =
1 2 w T w
| {z }
objective
+
N
X
n=1
α n
(1− y n (w T z n + b)
| {z }
constraint
)
Claim
SVM≡ minb,w
all α
maxn≥0 L
(b, w,α)
=min
b,w
∞
ifviolate
;1 2 w T w
iffeasible
•
any ‘violating’ (b, w): maxall α
n≥0
+P
n α n
(some positive)→
∞
•
any ‘feasible’ (b, w): maxall α
n≥0
+P
n α n
(all non-positive)
=
constraints nowhidden in max
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23
Dual Support Vector Machine Motivation of Dual SVM
Fun Time
Consider two transformed examples (z
1
, +1) and (z2
,−1) with z1
=z
andz 2
=−z. What is the Lagrange functionL
(b, w,α)
of hard-margin SVM?1 1
2 w T w + α 1
(1 +w T z + b) + α 2
(1 +w T z + b)
2 1
2 w T w + α 1
(1− wT z
− b) +α 2
(1− wT z + b)
3 1
2 w T w + α 1
(1 +w T z + b) + α 2
(1 +w T z
− b)4 1
2 w T w + α 1
(1− wT z
− b) +α 2
(1− wT z
− b)Reference Answer: 2
By definition,L
(b, w,α) =
12
w T w + α 1
(1− y1
(wT z 1
+b)) +α 2
(1− y2
(wT z 2
+b)) with (z1
, y1
) = (z, +1) and (z2
, y2
) = (−z, −1).Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23
Dual Support Vector Machine Motivation of Dual SVM
Fun Time
Consider two transformed examples (z
1
, +1) and (z2
,−1) with z1
=z
andz 2
=−z. What is the Lagrange functionL
(b, w,α)
of hard-margin SVM?1 1
2 w T w + α 1
(1 +w T z + b) + α 2
(1 +w T z + b)
2 1
2 w T w + α 1
(1− wT z
− b) +α 2
(1− wT z + b)
3 1
2 w T w + α 1
(1 +w T z + b) + α 2
(1 +w T z
− b)4 1
2 w T w + α 1
(1− wT z
− b) +α 2
(1− wT z
− b)Reference Answer: 2
By definition,
L
(b, w,α) =
12
w T w + α 1
(1− y1
(wT z 1
+b)) +α 2
(1− y2
(wT z 2
+b)) with (z1
, y1
) = (z, +1) and (z2
, y2
) = (−z, −1).Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23
Dual Support Vector Machine Lagrange Dual SVM
Lagrange Dual Problem
for
any fixed α 0 with all α 0 n ≥ 0
, minb,w
max
all α
n≥0
L(b, w,α)
≥ min
b,w
L(b, w,α 0
) becausemax
≥any
for
best α 0 ≥ 0 on RHS
, minb,w
max
all α
n≥0
L(b, w,α)
≥ max
all α
n0≥0
minb,w
L(b, w,α 0
)| {z }
Lagrange dual problem
because
best
is one ofany
Lagrange dual problem:‘outer’ maximization of α
onlower bound of original problem
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23
Dual Support Vector Machine Lagrange Dual SVM
Strong Duality of Quadratic Programming
min
b,w
all
maxα
n≥0
L(b, w, α)
| {z }
equiv. to original (primal) SVM
≥ all
maxα
n
≥0
min
b,w
L(b, w, α)
| {z }
Lagrange dual
• ‘ ≥’: weak duality
• ‘=’: strong duality, true for QP if
• convex primal
• feasible primal (true if Φ-separable)
• linear constraints
—called constraint qualification
exists
primal-dual
optimal solution (b,w, α)
forboth sides
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23
Dual Support Vector Machine Lagrange Dual SVM
Solving Lagrange Dual: Simplifications (1/2)
all
maxα
n≥0
min
b,w
1
2 w T w
+N
X
n=1
α n
(1− yn
(wT z n
+b))
| {z }
L(b,w,α)
• inner problem
‘unconstrained’, at optimal:∂L(b,w,α)
∂b
=0 = −PN
n=1 α n
yn
•
no loss of optimality if solving with constraintP N
n=1 α n y n = 0
but wait,b can be removed
max
all α
n≥0,P y
nα
n=0
minb,w 1
2 w T w
+N
X
n=1
α n
(1− yn
(wT z n
))−XX XX
XX X P N
n=1 α n y n
·b
!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23
Dual Support Vector Machine Lagrange Dual SVM
Solving Lagrange Dual: Simplifications (2/2)
max
all α
n≥0,P y
nα
n=0
minb,w 1
2 w T w
+N
X
n=1
α n
(1− yn
(wT z n
))!
• inner problem
‘unconstrained’, at optimal:∂L(b,w,α)
∂w
i =0 =w i
−PN
n=1 α n
yn
zn,i
•
no loss of optimality if solving with constraintw = P N
n=1 α n y n z n
but wait!
max
all α
n≥0,P y
nα
n=0,w= P α
ny
nz
nmin b,w 1
2 w T w
+N
X
n=1
α n
−w T w
!
⇐⇒ max
all α
n≥0,P y
nα
n=0,w= P α
ny
nz
n−
1 2
kN
X
n=1
α n
yn z n
k2
+N
X
n=1
α n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23
Dual Support Vector Machine Lagrange Dual SVM
KKT Optimality Conditions
max
all α
n≥0,P y
nα
n=0,w= P α
ny
nz
n−
1 2
kN
X
n=1
α n
yn z n
k2
+N
X
n=1
α n
if
primal-dual
optimal (b,w, α),
• primal feasible: y n
(wT z n
+b)
≥ 1• dual feasible: α n
≥ 0• dual-inner
optimal: P yn α n
=0;w
=Pα n
yn z n
• primal-inner
optimal (at optimal all ‘Lagrange terms’ disappear):α n
(1− yn
(wT z n
+b)) =
0—called
Karush-Kuhn-Tucker (KKT) conditions, necessary for
optimality [& sufficient here]will use
KKT
to ‘solve’ (b,w)
from optimalα
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23
Dual Support Vector Machine Lagrange Dual SVM
Fun Time
For a single variable w , consider minimizing
1 2
w2
subject to two linear constraints w ≥ 1 and w ≤ 3. We know that the Lagrange function L(w, α) =1 2
w2
+α1
(1− w) + α2
(w− 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem?1
α1
≥ 0 and α2
≥ 02
w =α1
− α2
3
α1
(1− w) = 0 and α2
(w− 3) = 0.4
all of the aboveReference Answer: 4
1 contains dual-feasible constraints; 2 contains dual-inner-optimal constraints; 3 contains primal-inner-optimal constraints.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23
Dual Support Vector Machine Lagrange Dual SVM
Fun Time
For a single variable w , consider minimizing
1 2
w2
subject to two linear constraints w ≥ 1 and w ≤ 3. We know that the Lagrange function L(w, α) =1 2
w2
+α1
(1− w) + α2
(w− 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem?1
α1
≥ 0 and α2
≥ 02
w =α1
− α2
3
α1
(1− w) = 0 and α2
(w− 3) = 0.4
all of the aboveReference Answer: 4
1 contains dual-feasible constraints;
2 contains dual-inner-optimal constraints;
3 contains primal-inner-optimal constraints.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23
Dual Support Vector Machine Solving Dual SVM
Dual Formulation of Support Vector Machine
max
all α
n≥0,P y
nα
n=0,w= P
α
ny
nz
n −1 2
kN
X
n=1
α n
yn z n
k2
+N
X
n=1
α n
standard hard-margin SVM
dual
min
α
1 2
N
X
n=1 N
X
m=1
α n α m
yn
ym z T n z m
−N
X
n=1
α n
subject to
N
X
n=1
y
n α n
=0;α n
≥ 0, for n = 1, 2, . . . , N(convex) QP of
N variables
&N + 1
constraints, as promised how to solve?yeah, we know QP! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23
Dual Support Vector Machine Solving Dual SVM
Dual SVM with QP Solver
optimal
α
=? minα
1 2
N
P
n=1 N
P
m=1
α
nα
my
ny
mz
Tnz
m−
N
X
n=1
α n
subject to
N
X
n=1
y
n α n
=0;α n
≥ 0,for n = 1, 2, . . . , N
optimal
α
← QP(Q, p, A, c)
minα 1
2 α T Qα
+p T α
subject toa T i α
≥c i
,for i = 1, 2, . . .
• q n,m = y n y m z T n z m
• p = − 1 N
• a ≥ = y, a ≤ = − y;
a T n = n-th unit direction
• c ≥ = 0, c ≤ = 0; c n = 0
note: many solvers treatequality (a ≥ , a ≤ ) &
bound (a n ) constraints specially for numerical stability
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23
Dual Support Vector Machine Solving Dual SVM
Dual SVM with Special QP Solver
optimal
α
← QP(Q
D,p, A, c)
minα
1
2 α T Q
Dα
+p T α
subject to
special equality and bound constraints
• q
n,m = y n y m z T n z m
, oftennon-zero
•
if N = 30, 000,dense Q
D(N by N symmetric) takes > 3G RAM•
needspecial solver
for• not storing whole Q
D• utilizing special constraints properly to scale up to large N
usually better to use
special solver
in practiceHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23
Dual Support Vector Machine Solving Dual SVM
Optimal (b, w)
KKT conditions
if
primal-dual
optimal (b,w, α),
• primal feasible: y n
(wT z n
+b)
≥ 1• dual feasible: α n
≥ 0• dual-inner
optimal: P yn α n
=0;w
=Pα n
yn z n
• primal-inner
optimal (at optimal all ‘Lagrange terms’ disappear):α n
(1− yn
(wT z n
+b)) =
0 (complementary slackness)•
optimalα
=⇒ optimalw? easy above!
•
optimalα
=⇒ optimalb? a range from primal feasible
&equality from
comp. slackness
if oneα n
> 0⇒b
=yn
−w T z n
comp. slackness:
α n
> 0 ⇒ on fat boundary (SV!)Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23
Dual Support Vector Machine Solving Dual SVM
Fun Time
Consider two transformed examples (z
1
, +1) and (z2
,−1) with z1
=z
andz 2
=−z. After solving the dual problem of hard-margin SVM, assume that the optimalα1
andα2
are both strictly positive. What is the optimal b?1
−12
03
14
not certain with the descriptions aboveReference Answer: 2
With the descriptions, at the optimal (b, w), b = +1− w
T z =
−1 + wT z
That is,w T z = 1 and b = 0.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23
Dual Support Vector Machine Solving Dual SVM
Fun Time
Consider two transformed examples (z
1
, +1) and (z2
,−1) with z1
=z
andz 2
=−z. After solving the dual problem of hard-margin SVM, assume that the optimalα1
andα2
are both strictly positive. What is the optimal b?1
−12
03
14
not certain with the descriptions aboveReference Answer: 2
With the descriptions, at the optimal (b, w), b = +1− w
T z =
−1 + wT z
That is,w T z = 1 and b = 0.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23
Dual Support Vector Machine Messages behind Dual SVM
Support Vectors Revisited
•
on boundary:‘locates’ fattest hyperplane;
others:
not needed
•
examples withα n
> 0: on boundary•
callα n
> 0 examples(z n , y n ) support vectors
(( (( ( hhh (candidates) h h
• SV (positive α n )
⊆ SV candidates (on boundary)
x1−x2−1=0 0.707
•
onlySV
needed to computew: w
=N
P
n=1
α n
yn z n
= PSV
α n
yn z n
•
onlySV
needed to computeb: b
=yn
−w T z n
with anySV (z n , y n )
SVM: learn fattest hyperplane
by identifyingsupport vectors
with
dual
optimal solutionHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23
Dual Support Vector Machine Messages behind Dual SVM
Representation of Fattest Hyperplane
SVM
w
SVM =N
X
n=1
α n
(yn z n
)α n
fromdual solution
PLA
w
PLA =N
X
n=1
β n
(yn z n
)β n
by# mistake corrections
w
=linear combination of yn z n
•
also true for GD/SGD-based LogReg/LinReg whenw 0
=0
•
callw ‘represented’ by data
SVM: represent w
bySVs only
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23
Dual Support Vector Machine Messages behind Dual SVM
Summary: Two Forms of Hard-Margin SVM
Primal Hard-Margin SVM
minb,w
1 2 w T w
sub. to y
n
(wT z n
+b)
≥ 1, for n = 1, 2, . . . , N• d + 1 variables, ˜
N constraints—suitable when
d + 1 small ˜
•
physical meaning: locatespecially-scaled
(b,w)
Dual Hard-Margin SVM
minα
1
2 α T
QDα
− 1T α
s.t.y T α
=0;α n
≥ 0 for n = 1, . . . , N• N variables,
N + 1 simple constraints
—suitable when
N small
•
physical meaning: locateSVs
(zn
, yn
)& theirα n
both eventually result in optimal (b,
w)
for fattest hyperplaneg
SVM(x) = sign(wT Φ(x)
+b)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23
Dual Support Vector Machine Messages behind Dual SVM
Are We Done Yet?
goal: SVM
without dependence on ˜ d
min
α 1
2 α T Q
Dα
− 1T α
subject toy T α
=0;α n
≥ 0, for n = 1, 2, . . . , N• N variables, N + 1 constraints: no dependence on ˜ d?
• q n,m = y n y m z T n z m
: inner product in R˜ d
—O(˜
d )
via naïve computation!no dependence
only if
avoiding naïve computation (next lecture :-))
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23
Dual Support Vector Machine Messages behind Dual SVM
Fun Time
Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of
examples that are on the fat boundary—that is, SV candidates?
1
02
10243
12344
9999Reference Answer: 3
Because SVs are always on the fat boundary,
# SVs≤ # SV candidates ≤ N.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Dual Support Vector Machine Messages behind Dual SVM
Fun Time
Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of
examples that are on the fat boundary—that is, SV candidates?
1
02
10243
12344
9999Reference Answer: 3
Because SVs are always on the fat boundary,
# SVs≤ # SV candidates ≤ N.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Dual Support Vector Machine Messages behind Dual SVM
Summary
1
Embedding Numerous Features: Kernel ModelsLecture 2: Dual Support Vector Machine
Motivation of Dual SVM
want to remove dependence on ˜ d Lagrange Dual SVM
KKT conditions link primal/dual Solving Dual SVM
another QP, better solved with special solver Messages behind Dual SVM
SVs represent fattest hyperplane
• next: computing inner product in R
d˜efficiently
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23