Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 5: Kernel Logistic Regression

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Kernel Logistic Regression

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 4: Soft-Margin Support Vector Machine

allow some

margin violations ξ _n

while penalizing them by C; equivalent to

upper-bounding α n

by C

Lecture 5: Kernel Logistic Regression

Soft-Margin SVM as Regularized Model SVM versus Logistic Regression

SVM for Soft Binary Classification Kernel Logistic Regression

2 Combining Predictive Features: Aggregation Models

(3)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w ^T w

s.t. y n (w ^T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w ^T w + C

N

X

n=1

ξ _n

s.t. y n (w ^T z n + b) ≥ 1 − ξ n , ξ _n ≥ 0

Hard-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α _n ≤ C

soft-margin preferred in practice;

linear: LIBLINEAR; non-linear: LIBSVM

(4)

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w ^T w

s.t. y n (w ^T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w ^T w + C

N

X

n=1

ξ _n

s.t. y n (w ^T z n + b) ≥ 1 − ξ n , ξ _n ≥ 0

Hard-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α _n ≤ C

soft-margin preferred in practice;

(5)

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w ^T w

s.t. y n (w ^T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w ^T w + C

N

X

n=1

ξ _n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ _n , ξ _n ≥ 0

Hard-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α _n ≤ C

soft-margin preferred in practice;

(6)

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w ^T w

s.t. y n (w ^T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w ^T w + C

N

X

n=1

ξ _n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ _n , ξ _n ≥ 0

Hard-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α _n ≤ C

soft-margin preferred in practice;

(7)

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w ^T w

s.t. y n (w ^T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w ^T w + C

N

X

n=1

ξ _n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ _n , ξ _n ≥ 0

Hard-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1 2 α ^T Qα − 1 ^T α s.t. y ^T α = 0

0 ≤ α _n ≤ C

soft-margin preferred in practice;

(8)

Slack Variables ξ _n

•

record ‘margin violation’ by

ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max

1 − y

_n

(w

^T z _n

+b)

,

0

•

(x

_n

,y

_n

)violating margin:

ξ n

=

1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

)not violating margin:

ξ _n

=

0

‘unconstrained’ form of soft-margin SVM:

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

(9)

Slack Variables ξ _n

• ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max

1 − y

n

(w

^T z _n

+b)

,

0

•

(x

_n

,y

_n

)violating margin:

ξ n

=

1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

ξ _n

=

0

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

(10)

Slack Variables ξ _n

• ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max

1 − y

n

(w

^T z _n

+b)

,

0

•

(x

_n

,y

_n

)violating margin:

ξ n

=

1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

ξ _n

=

0

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

(11)

Slack Variables ξ _n

• ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

^T z _n

+b),

0

•

(x

_n

,y

_n

)violating margin:

ξ n

=1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

ξ _n

=

0

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

(12)

Slack Variables ξ _n

• ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

^T z _n

+b),

0

•

(x

_n

,y

_n

)violating margin:

ξ n

=1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

ξ _n

=

0

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

(13)

Slack Variables ξ _n

• ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

^T z _n

+b), 0

•

(x

_n

,y

_n

)violating margin:

ξ n

=1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

ξ _n

=0

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

(14)

Slack Variables ξ _n

• ξ n

•

penalize with

margin violation

b,w,ξ min 1

2 w

^T

w + C ·

N

X

n=1

ξ

_n

s.t. y _n (w ^T z _n + b) ≥ 1 − ξ

_n

and ξ

_n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

^T z _n

+b), 0

•

(x

_n

,y

_n

)violating margin:

ξ n

=1 − y

_n

(w

^T z _n

+b)

•

(x

_n

,y

_n

ξ _n

=0

(15)

Unconstrained Form

min b,w

1 2 w

^T

w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

familiar? :-)

min 1

2

w ^T w

+

C

X

err c

just L2 regularization

min λ

N

w ^T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

•

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(16)

Unconstrained Form

min b,w

1 2 w

^T

w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

familiar? :-)

min 1

2

w ^T w

+

C

X

err c

just L2 regularization

min λ

N

w ^T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

•

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(17)

Unconstrained Form

min b,w

1 2 w

^T

w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

familiar? :-)

min 1

2

w ^T w

+

C

X

err c

just L2 regularization

min λ

N

w ^T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

•

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(18)

Unconstrained Form

min b,w

1 2 w

^T

w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

familiar? :-)

min 1

2

w ^T w

+

C

X

err c

just L2 regularization

min λ

N

w ^T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

•

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(19)

Unconstrained Form

min b,w

1 2 w

^T

w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

familiar? :-)

min 1

2

w ^T w

+

C

X

err c

just L2 regularization

min λ

N

w ^T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

•

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(20)

Unconstrained Form

min b,w

1 2 w

^T

w + C

N

X

n=1

max 1 − y

_n

(w

^T

z

_n

+ b), 0

familiar? :-)

min 1

2

w ^T w

+

C

X

err c

just L2 regularization

min λ

N

w ^T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

(21)

SVM as Regularized Model

minimize constraint regularization by constraint E

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization

viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(22)

SVM as Regularized Model

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

w

larger

C

or

C

allows

extending/connecting

(23)

SVM as Regularized Model

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

w

larger

C

or

C

allows

extending/connecting

(24)

SVM as Regularized Model

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

w

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:

allows

extending/connecting

(25)

SVM as Regularized Model

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

w

larger

C

or

C

allows

extending/connecting

(26)

SVM as Regularized Model

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

w

larger

C

or

C

allows

extending/connecting

(27)

SVM as Regularized Model

_in w ^T w ≤ C

hard-margin SVM

w ^T w

E

_in

=0 [and more]

L2 regularization

_N ^λ w ^T w + E _in

soft-margin SVM

¹ ₂ w ^T w + CN c

E

_in

w

larger

C

or

C

allows

extending/connecting

(28)

Fun Time

When viewing soft-margin SVM as regularized model, a larger C corresponds to

1

a larger λ, that is, stronger regularization

2

a smaller λ, that is, stronger regularization

3

a larger λ, that is, weaker regularization

4

a smaller λ, that is, weaker regularization

Reference Answer: 4

Comparing the formulations on page 4 of the slides, we see that C corresponds to

_2λ ¹

. So larger C corresponds to smaller λ, which surely means weaker regularization.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/20

(29)

Fun Time

When viewing soft-margin SVM as regularized model, a larger C corresponds to

1

a larger λ, that is, stronger regularization

2

a smaller λ, that is, stronger regularization

3

a larger λ, that is, weaker regularization

4

a smaller λ, that is, weaker regularization

Reference Answer: 4

Comparing the formulations on page 4 of the slides, we see that C corresponds to

_2λ ¹

. So larger C corresponds to smaller λ, which surely means weaker regularization.

(30)

Kernel Logistic Regression SVM versus Logistic Regression

Algorithmic Error Measure of SVM

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

n

(w

^T

z

n

+ b), 0

linear score

s

=

w ^T z _n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0): upper bound of err

_0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

_0/1

(31)

Algorithmic Error Measure of SVM

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

n

(w

^T

z

n

+ b), 0

linear score

s

=

w ^T z _n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

_0/1

(32)

Algorithmic Error Measure of SVM

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

n

(w

^T

z

n

+ b), 0

linear score

s

=

w ^T z _n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

_0/1

(33)

Algorithmic Error Measure of SVM

min b,w

1 2 w ^T w + C

N

X

n=1

max 1 − y

n

(w

^T

z

n

+ b), 0

linear score

s

=

w ^T z _n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

_0/1

(34)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

another upper bound of err

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

≈

L2-regularized logistic regression

(35)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

≈

L2-regularized logistic regression

(36)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

≈

L2-regularized logistic regression

(37)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

≈

L2-regularized logistic regression

(38)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

≈

L2-regularized logistic regression

(39)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

≈

L2-regularized logistic regression

(40)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

(s,

y

) ≈ 0

SVM

≈

L2-regularized logistic regression

(41)

Connection between SVM and Logistic Regression

linear score

s

=

w ^T z n + b

•

err

_0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

_0/1

• err

SCE(s,

y

) =log

₂

(1 + exp(−y

s)):

_0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

) ≈ 0

SVM

≈

L2-regularized logistic regression

(42)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

•

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

•

pros:

‘easy’ optimization

& theoretical guarantee

•

cons: loose bound of err

_0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

•

pros:

‘easy’ optimization

& regularization guard

• _0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(43)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

• pocket

soft-margin SVM

•

pros:

‘easy’ optimization

• _0/1

for very negative

ys

regularized

logistic regression for classification

•

pros:

‘easy’

optimization

&

regularization guard

• _0/1

for very negative

ys

SVM =⇒ approximate LogReg (?)

(44)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

• pocket

soft-margin SVM

•

pros:

‘easy’ optimization

• _0/1

for very negative

ys

regularized

logistic regression for classification

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

SVM =⇒ approximate LogReg (?)

(45)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

• pocket

soft-margin SVM

•

pros:

‘easy’

optimization

&

theoretical guarantee

• _0/1

for very negative

ys

regularized

logistic regression for classification

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

SVM =⇒ approximate LogReg (?)

(46)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

• pocket

soft-margin SVM

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

regularized

logistic regression for classification

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

SVM =⇒ approximate LogReg (?)

(47)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

• pocket

soft-margin SVM

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

regularized

logistic regression for classification

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

SVM =⇒ approximate LogReg (?)

(48)

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

•

pros:

efficient if lin. separable

• pocket

soft-margin SVM

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

regularized

logistic regression for classification

•

pros:

‘easy’

optimization

&

• _0/1

for very negative

ys

(49)

Fun Time

We know thaterrcSVM(s, y ) is an upper bound of err

_0/1

(s, y ). When is the upper bound tight? That is, when iserrcSVM(s, y ) = err

_0/1

(s, y )?

1

ys ≥ 0

2

ys ≤ 0

3

ys ≥ 1

4

ys ≤ 1

Reference Answer: 3

By plotting the figure, we can easily see that errcSVM(s, y ) = err

_0/1

(s, y ) if and only if ys ≥ 1. In that case, both error functions evaluate to 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/20

(50)

Fun Time

We know thaterrcSVM(s, y ) is an upper bound of err

_0/1

(s, y ). When is the upper bound tight? That is, when iserrcSVM(s, y ) = err

_0/1

(s, y )?

1

ys ≥ 0

2

ys ≤ 0

3

ys ≥ 1

4

ys ≤ 1

Reference Answer: 3

By plotting the figure, we can easily see that errcSVM(s, y ) = err

_0/1

(s, y ) if and only if ys ≥ 1.

In that case, both error functions evaluate to 0.

(51)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (b_SVM,

w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

•

‘direct’ use of similarity

—works reasonably well

• no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (b_SVM,

w

_SVM)as

w ₀

3

return LogReg solution as g(x)

•

not really ‘easier’ than original LogReg

• SVM flavor (kernel?) lost

want: flavors from both sides

(52)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(53)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(54)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(55)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(56)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(57)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(58)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(59)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(60)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(61)

SVM for Soft Binary Classification

Naïve Idea 1

1 w

_SVM)

2

return

g(x) = θ(w

^T

_SVM

x + b

_SVM)

• • no LogReg flavor

Naïve Idea 2

1 w

_SVM)

2 w

_SVM)as

w ₀

3

• • SVM flavor (kernel?) lost

(62)

A Possible Model: Two-Level Learning

g(x) = θ(

A

·

(w

^T

_SVMΦ(x) +

b

_SVM)

+

B

)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B













two-level learning:

LogReg

on

SVM-transformed

data

(63)

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

^T

_SVMΦ(x) +

b

_SVM) +

B)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B













two-level learning:

LogReg

on

SVM-transformed

data

(64)

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

^T

_SVMΦ(x) +

b

_SVM) +

B)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B













two-level learning:

LogReg

on

SVM-transformed

data

(65)

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

^T

_SVMΦ(x) +

b

_SVM) +

B)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B













two-level learning:

LogReg

on

SVM-transformed

data

(66)

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

^T

_SVMΦ(x) +

b

_SVM) +

B)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B













two-level learning:

LogReg

on

SVM-transformed

data

(67)

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

^T

_SVMΦ(x) +

b

_SVM) +

B)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B













two-level learning:

LogReg

on

SVM-transformed

data

(68)

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

^T

_SVMΦ(x) +

b

_SVM) +

B)

• SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

• LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log





1 + exp





−y

n

A

· (w

^T

_SVMΦ(x

n

) +

b

SVM

| {z }

) +

B













(69)

Probabilistic SVM

Platt’s Model of Probabilistic SVM for Soft Binary Classification

1

run

SVM

on D to get

(b

SVM

, w

_SVM

)

[or the equivalent α], and transform D to

z ⁰ _n

=

w ^T

_SVMΦ(x

_n

) +

b

SVM

—actual model performs this step in a more complicated manner

2

run

LogReg

on {(z

⁰ _n

,y

_n

)}

^N _n=1

to get

(A, B)

—actual model adds some special regularization here

3

return g(x) =

θ(A

· (w

^T

_SVMΦ(x) +

b

SVM) +

B)

• soft binary classifier

not having the same boundary as

SVM classifier

—because of

B

•

how to solve

LogReg: GD/SGD/or better

—because only

two variables

kernel SVM =⇒ approx. LogReg in Z-space

exact LogReg in Z-space?

(70)

Probabilistic SVM

Platt’s Model of Probabilistic SVM for Soft Binary Classification

1

run

SVM

on D to get

(b

SVM

, w

_SVM

)

z ⁰ _n

=

w ^T

_SVMΦ(x

_n

) +

b

SVM

2

run

LogReg

on {(z

⁰ _n

,y

_n

)}

^N _n=1

to get

(A, B)

3

return g(x) =

θ(A

· (w

^T

_SVMΦ(x) +

b

SVM) +

B)

• soft binary classifier

SVM classifier

—because of

B

•

how to solve

LogReg: GD/SGD/or better

—because only

two variables

exact LogReg in Z-space?

(71)

Probabilistic SVM

Platt’s Model of Probabilistic SVM for Soft Binary Classification

1

run

SVM

on D to get

(b

SVM

, w

_SVM

)

z ⁰ _n

=

w ^T

_SVMΦ(x

_n

) +

b

SVM

2

run

LogReg

on {(z

⁰ _n

,y

_n

)}

^N _n=1

to get

(A, B)

3

return g(x) =

θ(A

· (w

^T

_SVMΦ(x) +

b

SVM) +

B)

• soft binary classifier

SVM classifier

—because of

B

•

how to solve

LogReg: GD/SGD/or better

—because only

Machine Learning Techniques (ᘤᢈ)