Machine Learning Techniques
( 機器學習技法)
Lecture 5: Kernel Logistic Regression
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Kernel Logistic Regression
Roadmap
1
Embedding Numerous Features: Kernel ModelsLecture 4: Soft-Margin Support Vector Machine
allow somemargin violations ξ n
while penalizing them by C; equivalent toupper-bounding α n
by CLecture 5: Kernel Logistic Regression
Soft-Margin SVM as Regularized Model SVM versus Logistic Regression
SVM for Soft Binary Classification Kernel Logistic Regression
2 Combining Predictive Features: Aggregation Models
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Wrap-Up
Hard-Margin Primal
min b,w
1 2 w T w
s.t. y n (w T z n + b) ≥ 1
Soft-Margin Primal
b,w,ξ min 1
2 w T w + C
N
X
n=1
ξ n
s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0
Hard-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n
Soft-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n ≤ C
soft-margin preferred in practice;
linear: LIBLINEAR; non-linear: LIBSVMKernel Logistic Regression Soft-Margin SVM as Regularized Model
Wrap-Up
Hard-Margin Primal
min b,w
1 2 w T w
s.t. y n (w T z n + b) ≥ 1
Soft-Margin Primal
b,w,ξ min 1
2 w T w + C
N
X
n=1
ξ n
s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0
Hard-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n
Soft-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n ≤ C
soft-margin preferred in practice;
linear: LIBLINEAR; non-linear: LIBSVMKernel Logistic Regression Soft-Margin SVM as Regularized Model
Wrap-Up
Hard-Margin Primal
min b,w
1 2 w T w
s.t. y n (w T z n + b) ≥ 1
Soft-Margin Primal
b,w,ξ min 1
2 w T w + C
N
X
n=1
ξ n
s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0
Hard-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n
Soft-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n ≤ C
soft-margin preferred in practice;
linear: LIBLINEAR; non-linear: LIBSVMKernel Logistic Regression Soft-Margin SVM as Regularized Model
Wrap-Up
Hard-Margin Primal
min b,w
1 2 w T w
s.t. y n (w T z n + b) ≥ 1
Soft-Margin Primal
b,w,ξ min 1
2 w T w + C
N
X
n=1
ξ n
s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0
Hard-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n
Soft-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n ≤ C
soft-margin preferred in practice;
linear: LIBLINEAR; non-linear: LIBSVMKernel Logistic Regression Soft-Margin SVM as Regularized Model
Wrap-Up
Hard-Margin Primal
min b,w
1 2 w T w
s.t. y n (w T z n + b) ≥ 1
Soft-Margin Primal
b,w,ξ min 1
2 w T w + C
N
X
n=1
ξ n
s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0
Hard-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n
Soft-Margin Dual
min α
1
2 α T Qα − 1 T α s.t. y T α = 0
0 ≤ α n ≤ C
soft-margin preferred in practice;
linear: LIBLINEAR; non-linear: LIBSVM
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max1 − y
n
(wT z n
+b),
0
•
(xn
,yn
)violating margin:ξ n
=1 − y
n
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0
‘unconstrained’ form of soft-margin SVM:
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max1 − y
n
(wT z n
+b),
0
•
(xn
,yn
)violating margin:ξ n
=1 − y
n
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0
‘unconstrained’ form of soft-margin SVM:
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max1 − y
n
(wT z n
+b),
0
•
(xn
,yn
)violating margin:ξ n
=1 − y
n
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0
‘unconstrained’ form of soft-margin SVM:
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max 1 − yn
(wT z n
+b),0
•
(xn
,yn
)violating margin:ξ n
=1 − yn
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0
‘unconstrained’ form of soft-margin SVM:
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max 1 − yn
(wT z n
+b),0
•
(xn
,yn
)violating margin:ξ n
=1 − yn
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0
‘unconstrained’ form of soft-margin SVM:
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max 1 − yn
(wT z n
+b), 0•
(xn
,yn
)violating margin:ξ n
=1 − yn
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0‘unconstrained’ form of soft-margin SVM:
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Slack Variables ξ n
•
record ‘margin violation’ byξ n
•
penalize withmargin violation
b,w,ξ min 1
2 w
Tw + C ·
N
X
n=1
ξ
ns.t. y n (w T z n + b) ≥ 1 − ξ
nand ξ
n≥ 0 for all n
Hi Hi
violation
on any (b,
w), ξ n
=margin violation
=max 1 − yn
(wT z n
+b), 0•
(xn
,yn
)violating margin:ξ n
=1 − yn
(wT z n
+b)•
(xn
,yn
)not violating margin:ξ n
=0‘unconstrained’ form of soft-margin SVM:
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Unconstrained Form
min b,w
1
2 w
Tw + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
familiar? :-)
min 1
2
w T w
+C
Xerr c
just L2 regularization
min λ
N
w T w +
1 NXerr with
shorter w, another parameter, and special err
why not solve this? :-)
•
not QP,no (?) kernel trick
• max(·, 0) not differentiable, harder to
solveKernel Logistic Regression Soft-Margin SVM as Regularized Model
Unconstrained Form
min b,w
1
2 w
Tw + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
familiar? :-)
min 1
2
w T w
+C
Xerr c
just L2 regularization
min λ
N
w T w +
1 NXerr with
shorter w, another parameter, and special err
why not solve this? :-)
•
not QP,no (?) kernel trick
• max(·, 0) not differentiable, harder to
solveKernel Logistic Regression Soft-Margin SVM as Regularized Model
Unconstrained Form
min b,w
1
2 w
Tw + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
familiar? :-)
min 1
2
w T w
+C
Xerr c
just L2 regularization
min λ
N
w T w +
1 NXerr with
shorter w, another parameter, and special err
why not solve this? :-)
•
not QP,no (?) kernel trick
• max(·, 0) not differentiable, harder to
solveKernel Logistic Regression Soft-Margin SVM as Regularized Model
Unconstrained Form
min b,w
1
2 w
Tw + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
familiar? :-)
min 1
2
w T w
+C
Xerr c
just L2 regularization
min λ
N
w T w +
1 NXerr with
shorter w, another parameter, and special err
why not solve this? :-)
•
not QP,no (?) kernel trick
• max(·, 0) not differentiable, harder to
solveKernel Logistic Regression Soft-Margin SVM as Regularized Model
Unconstrained Form
min b,w
1
2 w
Tw + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
familiar? :-)
min 1
2
w T w
+C
Xerr c
just L2 regularization
min λ
N
w T w +
1 NXerr with
shorter w, another parameter, and special err
why not solve this? :-)
•
not QP,no (?) kernel trick
• max(·, 0) not differentiable, harder to
solveKernel Logistic Regression Soft-Margin SVM as Regularized Model
Unconstrained Form
min b,w
1
2 w
Tw + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
familiar? :-)
min 1
2
w T w
+C
Xerr c
just L2 regularization
min λ
N
w T w +
1 NXerr with
shorter w, another parameter, and special err
why not solve this? :-)
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrclarger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularizationviewing SVM as regularized model:
allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM
1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrclarger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularizationviewing SVM as regularized model:
allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrc
larger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularizationviewing SVM as regularized model:
allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrc
larger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrclarger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrclarger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularizationviewing SVM as regularized model:
allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
SVM as Regularized Model
minimize constraint regularization by constraint E
in w T w ≤ C
hard-margin SVM
w T w
Ein
=0 [and more]L2 regularization
N λ w T w + E in
soft-margin SVM1 2 w T w + CN c
Ein
large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short
w
soft margin ⇐⇒ specialerrclarger
C
orC
⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:allows
extending/connecting
to other learning modelsKernel Logistic Regression Soft-Margin SVM as Regularized Model
Fun Time
When viewing soft-margin SVM as regularized model, a larger C corresponds to
1
a larger λ, that is, stronger regularization2
a smaller λ, that is, stronger regularization3
a larger λ, that is, weaker regularization4
a smaller λ, that is, weaker regularizationReference Answer: 4
Comparing the formulations on page 4 of the slides, we see that C corresponds to
2λ 1
. So larger C corresponds to smaller λ, which surely means weaker regularization.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/20
Kernel Logistic Regression Soft-Margin SVM as Regularized Model
Fun Time
When viewing soft-margin SVM as regularized model, a larger C corresponds to
1
a larger λ, that is, stronger regularization2
a smaller λ, that is, stronger regularization3
a larger λ, that is, weaker regularization4
a smaller λ, that is, weaker regularizationReference Answer: 4
Comparing the formulations on page 4 of the slides, we see that C corresponds to
2λ 1
. So larger C corresponds to smaller λ, which surely means weaker regularization.Kernel Logistic Regression SVM versus Logistic Regression
Algorithmic Error Measure of SVM
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0): upper bound of err0/1
—often called
hinge error measure
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm
err c
SVM:algorithmic error measure
byconvex upper bound
of err0/1
Kernel Logistic Regression SVM versus Logistic Regression
Algorithmic Error Measure of SVM
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
—often called
hinge error measure
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm
err c
SVM:algorithmic error measure
byconvex upper bound
of err0/1
Kernel Logistic Regression SVM versus Logistic Regression
Algorithmic Error Measure of SVM
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
—often called
hinge error measure
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm
err c
SVM:algorithmic error measure
byconvex upper bound
of err0/1
Kernel Logistic Regression SVM versus Logistic Regression
Algorithmic Error Measure of SVM
min b,w
1
2 w T w + C
N
X
n=1
max 1 − y
n(w
Tz
n+ b), 0
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
—often called
hinge error measure
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm
err c
SVM:algorithmic error measure
byconvex upper bound
of err0/1
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s
err c
SVM(s,y
)=0
≈ −y
s
(ln 2) ·
err
SCE(s,y
)≈ 0
SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s
err c
SVM(s,y
)=0
≈ −y
s
(ln 2) ·
err
SCE(s,y
)≈ 0
SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s
err c
SVM(s,y
)=0
≈ −y
s
(ln 2) ·
err
SCE(s,y
)≈ 0
SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s
err c
SVM(s,y
)=0
≈ −y
s
(ln 2) ·
err
SCE(s,y
)≈ 0
SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s err c
SVM(s,y
) =0≈ −y
s
(ln 2) ·
err
SCE(s,y
)≈ 0
SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s err c
SVM(s,y
) =0≈ −y
s
(ln 2) ·
err
SCE(s,y
)≈ 0
SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s err c
SVM(s,y
) =0≈ −y
s
(ln 2) ·err
(s,y
) ≈ 0SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Connection between SVM and Logistic Regression
linear score
s
=w T z n + b
•
err0/1
(s,y
) =Jys
≤ 0K• err c
SVM(s,y
) =max(1 −y s,
0):upper bound of err
0/1
• err
SCE(s,y
) =log2
(1 + exp(−ys)):
another upper bound of err
0/1
used inlogistic regression
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 svm scaled ce
−∞ ←−
y s
−→ +∞≈ −y
s err c
SVM(s,y
) =0≈ −y
s
(ln 2) ·err
SCE(s,y
) ≈ 0SVM
≈L2-regularized logistic regression
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’ optimization
& theoretical guarantee•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’ optimization
& regularization guard•
cons: loose bound of err0/1
for very negativeys
regularized LogReg =⇒ approximate SVM
SVM =⇒ approximate LogReg (?)
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’ optimization
& theoretical guarantee•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’
optimization
®ularization guard
•
cons: loose bound of err0/1
for very negativeys
regularized LogReg =⇒ approximate SVM
SVM =⇒ approximate LogReg (?)
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’ optimization
& theoretical guarantee•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’
optimization
®ularization guard
•
cons: loose bound of err0/1
for very negativeys
regularized LogReg =⇒ approximate SVM
SVM =⇒ approximate LogReg (?)
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’
optimization
&theoretical guarantee
•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’
optimization
®ularization guard
•
cons: loose bound of err0/1
for very negativeys
regularized LogReg =⇒ approximate SVM
SVM =⇒ approximate LogReg (?)
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’
optimization
&theoretical guarantee
•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’
optimization
®ularization guard
•
cons: loose bound of err0/1
for very negativeys
regularized LogReg =⇒ approximate SVM
SVM =⇒ approximate LogReg (?)
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’
optimization
&theoretical guarantee
•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’
optimization
®ularization guard
•
cons: loose bound of err0/1
for very negativeys
regularized LogReg =⇒ approximate SVM
SVM =⇒ approximate LogReg (?)
Kernel Logistic Regression SVM versus Logistic Regression
Linear Models for Binary Classification
PLA
minimize err
0/1
specially•
pros:efficient if lin. separable
•
cons: works only if lin. separable, otherwise needingsoft-margin SVM
minimize regularized errcSVMby QP
•
pros:‘easy’
optimization
&theoretical guarantee
•
cons: loose bound of err0/1
for very negativeys
regularized
logistic regression for classification
minimize regularized errSCE by GD/SGD/...•
pros:‘easy’
optimization
®ularization guard
•
cons: loose bound of err0/1
for very negativeys
Kernel Logistic Regression SVM versus Logistic Regression
Fun Time
We know thaterrcSVM(s, y ) is an upper bound of err
0/1
(s, y ). When is the upper bound tight? That is, when iserrcSVM(s, y ) = err0/1
(s, y )?1
ys ≥ 02
ys ≤ 03
ys ≥ 14
ys ≤ 1Reference Answer: 3
By plotting the figure, we can easily see that errcSVM(s, y ) = err
0/1
(s, y ) if and only if ys ≥ 1. In that case, both error functions evaluate to 0.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/20
Kernel Logistic Regression SVM versus Logistic Regression
Fun Time
We know thaterrcSVM(s, y ) is an upper bound of err
0/1
(s, y ). When is the upper bound tight? That is, when iserrcSVM(s, y ) = err0/1
(s, y )?1
ys ≥ 02
ys ≤ 03
ys ≥ 14
ys ≤ 1Reference Answer: 3
By plotting the figure, we can easily see that errcSVM(s, y ) = err
0/1
(s, y ) if and only if ys ≥ 1.In that case, both error functions evaluate to 0.
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
SVM for Soft Binary Classification
Naïve Idea 1
1
run SVM and get (bSVM,w
SVM)2
returng(x) = θ(w
T
SVMx + b
SVM)•
‘direct’ use of similarity—works reasonably well
• no LogReg flavor
Naïve Idea 2
1
run SVM and get (bSVM,w
SVM)2
run LogReg with (bSVM,w
SVM)asw 0
3
return LogReg solution as g(x)•
not really ‘easier’ than original LogReg• SVM flavor (kernel?) lost
want: flavors from both sides
Kernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(
A
·(w
T
SVMΦ(x) +b
SVM)+
B
)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
Φ
SVM(x
n)
) +
B
two-level learning:
LogReg
onSVM-transformed
dataKernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
Φ
SVM(x
n)
) +
B
two-level learning:
LogReg
onSVM-transformed
dataKernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
Φ
SVM(x
n)
) +
B
two-level learning:
LogReg
onSVM-transformed
dataKernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
Φ
SVM(x
n)
) +
B
two-level learning:
LogReg
onSVM-transformed
dataKernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
Φ
SVM(x
n)
) +
B
two-level learning:
LogReg
onSVM-transformed
dataKernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
Φ
SVM(x
n)
) +
B
two-level learning:
LogReg
onSVM-transformed
dataKernel Logistic Regression SVM for Soft Binary Classification
A Possible Model: Two-Level Learning
g(x) = θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• SVM flavor: fix hyperplane direction by w
SVM—kernelapplies• LogReg flavor: fine-tune hyperplane to match maximum
likelihood byscaling (A) and shifting (B)
• often A > 0 if w
SVMreasonably good
• often B ≈ 0 if b
SVMreasonably good
new LogReg Problem:
min
A,B
1 N
N
X
n=1
log
1 + exp
−y
n
A
· (wT
SVMΦ(xn
) +b
SVM| {z }
) +
B
Kernel Logistic Regression SVM for Soft Binary Classification
Probabilistic SVM
Platt’s Model of Probabilistic SVM for Soft Binary Classification
1
runSVM
on D to get(b
SVM, w
SVM)
[or the equivalent α], and transform D toz 0 n
=w T
SVMΦ(xn
) +b
SVM—actual model performs this step in a more complicated manner
2
runLogReg
on {(z0 n
,yn
)}N n=1
to get(A, B)
—actual model adds some special regularization here
3
return g(x) =θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• soft binary classifier
not having the same boundary asSVM classifier
—because of
B
•
how to solveLogReg: GD/SGD/or better
—because only
two variables
kernel SVM =⇒ approx. LogReg in Z-space
exact LogReg in Z-space?
Kernel Logistic Regression SVM for Soft Binary Classification
Probabilistic SVM
Platt’s Model of Probabilistic SVM for Soft Binary Classification
1
runSVM
on D to get(b
SVM, w
SVM)
[or the equivalent α], and transform D toz 0 n
=w T
SVMΦ(xn
) +b
SVM—actual model performs this step in a more complicated manner
2
runLogReg
on {(z0 n
,yn
)}N n=1
to get(A, B)
—actual model adds some special regularization here
3
return g(x) =θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• soft binary classifier
not having the same boundary asSVM classifier
—because of
B
•
how to solveLogReg: GD/SGD/or better
—because only
two variables
kernel SVM =⇒ approx. LogReg in Z-space
exact LogReg in Z-space?
Kernel Logistic Regression SVM for Soft Binary Classification
Probabilistic SVM
Platt’s Model of Probabilistic SVM for Soft Binary Classification
1
runSVM
on D to get(b
SVM, w
SVM)
[or the equivalent α], and transform D toz 0 n
=w T
SVMΦ(xn
) +b
SVM—actual model performs this step in a more complicated manner
2
runLogReg
on {(z0 n
,yn
)}N n=1
to get(A, B)
—actual model adds some special regularization here
3
return g(x) =θ(A
· (wT
SVMΦ(x) +b
SVM) +B)
• soft binary classifier
not having the same boundary asSVM classifier
—because of
B
•
how to solveLogReg: GD/SGD/or better
—because only
two variables
kernel SVM =⇒ approx. LogReg in Z-space