• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
116
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技法)

Lecture 5: Kernel Logistic Regression

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Kernel Logistic Regression

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 4: Soft-Margin Support Vector Machine

allow some

margin violations ξ n

while penalizing them by C; equivalent to

upper-bounding α n

by C

Lecture 5: Kernel Logistic Regression

Soft-Margin SVM as Regularized Model SVM versus Logistic Regression

SVM for Soft Binary Classification Kernel Logistic Regression

2 Combining Predictive Features: Aggregation Models

(3)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w T w

s.t. y n (w T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w T w + C

N

X

n=1

ξ n

s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0

Hard-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n ≤ C

soft-margin preferred in practice;

linear: LIBLINEAR; non-linear: LIBSVM

(4)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w T w

s.t. y n (w T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w T w + C

N

X

n=1

ξ n

s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0

Hard-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n ≤ C

soft-margin preferred in practice;

linear: LIBLINEAR; non-linear: LIBSVM

(5)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w T w

s.t. y n (w T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w T w + C

N

X

n=1

ξ n

s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0

Hard-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n ≤ C

soft-margin preferred in practice;

linear: LIBLINEAR; non-linear: LIBSVM

(6)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w T w

s.t. y n (w T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w T w + C

N

X

n=1

ξ n

s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0

Hard-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n ≤ C

soft-margin preferred in practice;

linear: LIBLINEAR; non-linear: LIBSVM

(7)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Wrap-Up

Hard-Margin Primal

min b,w

1 2 w T w

s.t. y n (w T z n + b) ≥ 1

Soft-Margin Primal

b,w,ξ min 1

2 w T w + C

N

X

n=1

ξ n

s.t. y n (w T z n + b) ≥ 1 − ξ n , ξ n ≥ 0

Hard-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n

Soft-Margin Dual

min α

1

2 α T Qα − 1 T α s.t. y T α = 0

0 ≤ α n ≤ C

soft-margin preferred in practice;

linear: LIBLINEAR; non-linear: LIBSVM

(8)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max

1 − y

n

(w

T z n

+b)

,

0



(x

n

,y

n

)violating margin:

ξ n

=

1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=

0

‘unconstrained’ form of soft-margin SVM:

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

(9)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max

1 − y

n

(w

T z n

+b)

,

0



(x

n

,y

n

)violating margin:

ξ n

=

1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=

0

‘unconstrained’ form of soft-margin SVM:

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

(10)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max

1 − y

n

(w

T z n

+b)

,

0



(x

n

,y

n

)violating margin:

ξ n

=

1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=

0

‘unconstrained’ form of soft-margin SVM:

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

(11)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

T z n

+b),

0



(x

n

,y

n

)violating margin:

ξ n

=1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=

0

‘unconstrained’ form of soft-margin SVM:

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

(12)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

T z n

+b),

0



(x

n

,y

n

)violating margin:

ξ n

=1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=

0

‘unconstrained’ form of soft-margin SVM:

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

(13)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

T z n

+b), 0

(x

n

,y

n

)violating margin:

ξ n

=1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=0

‘unconstrained’ form of soft-margin SVM:

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

(14)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Slack Variables ξ n

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y n (w T z n + b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

on any (b,

w), ξ n

=

margin violation

=max 1 − y

n

(w

T z n

+b), 0

(x

n

,y

n

)violating margin:

ξ n

=1 − y

n

(w

T z n

+b)

(x

n

,y

n

)not violating margin:

ξ n

=0

‘unconstrained’ form of soft-margin SVM:

(15)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Unconstrained Form

min b,w

1

2 w

T

w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

familiar? :-)

min 1

2

w T w

+

C

X

err c

just L2 regularization

min λ

N

w T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(16)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Unconstrained Form

min b,w

1

2 w

T

w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

familiar? :-)

min 1

2

w T w

+

C

X

err c

just L2 regularization

min λ

N

w T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(17)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Unconstrained Form

min b,w

1

2 w

T

w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

familiar? :-)

min 1

2

w T w

+

C

X

err c

just L2 regularization

min λ

N

w T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(18)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Unconstrained Form

min b,w

1

2 w

T

w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

familiar? :-)

min 1

2

w T w

+

C

X

err c

just L2 regularization

min λ

N

w T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(19)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Unconstrained Form

min b,w

1

2 w

T

w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

familiar? :-)

min 1

2

w T w

+

C

X

err c

just L2 regularization

min λ

N

w T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

not QP,

no (?) kernel trick

• max(·, 0) not differentiable, harder to

solve

(20)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Unconstrained Form

min b,w

1

2 w

T

w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

familiar? :-)

min 1

2

w T w

+

C

X

err c

just L2 regularization

min λ

N

w T w +

1 N

Xerr with

shorter w, another parameter, and special err

why not solve this? :-)

(21)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization

viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(22)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization

viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(23)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization

viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(24)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(25)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(26)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization

viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(27)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

SVM as Regularized Model

minimize constraint regularization by constraint E

in w T w ≤ C

hard-margin SVM

w T w

E

in

=0 [and more]

L2 regularization

N λ w T w + E in

soft-margin SVM

1 2 w T w + CN c

E

in

large margin ⇐⇒ fewer hyperplanes ⇐⇒ L2 regularization of short

w

soft margin ⇐⇒ specialerrc

larger

C

or

C

⇐⇒ smaller λ ⇐⇒ less regularization viewing SVM as regularized model:

allows

extending/connecting

to other learning models

(28)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Fun Time

When viewing soft-margin SVM as regularized model, a larger C corresponds to

1

a larger λ, that is, stronger regularization

2

a smaller λ, that is, stronger regularization

3

a larger λ, that is, weaker regularization

4

a smaller λ, that is, weaker regularization

Reference Answer: 4

Comparing the formulations on page 4 of the slides, we see that C corresponds to

1

. So larger C corresponds to smaller λ, which surely means weaker regularization.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/20

(29)

Kernel Logistic Regression Soft-Margin SVM as Regularized Model

Fun Time

When viewing soft-margin SVM as regularized model, a larger C corresponds to

1

a larger λ, that is, stronger regularization

2

a smaller λ, that is, stronger regularization

3

a larger λ, that is, weaker regularization

4

a smaller λ, that is, weaker regularization

Reference Answer: 4

Comparing the formulations on page 4 of the slides, we see that C corresponds to

1

. So larger C corresponds to smaller λ, which surely means weaker regularization.

(30)

Kernel Logistic Regression SVM versus Logistic Regression

Algorithmic Error Measure of SVM

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0): upper bound of err

0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

0/1

(31)

Kernel Logistic Regression SVM versus Logistic Regression

Algorithmic Error Measure of SVM

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

0/1

(32)

Kernel Logistic Regression SVM versus Logistic Regression

Algorithmic Error Measure of SVM

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

0/1

(33)

Kernel Logistic Regression SVM versus Logistic Regression

Algorithmic Error Measure of SVM

min b,w

1

2 w T w + C

N

X

n=1

max 1 − y

n

(w

T

z

n

+ b), 0 

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

—often called

hinge error measure

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm

err c

SVM:

algorithmic error measure

by

convex upper bound

of err

0/1

(34)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

L2-regularized logistic regression

(35)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

L2-regularized logistic regression

(36)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

L2-regularized logistic regression

(37)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s

err c

SVM(s,

y

)

=0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

L2-regularized logistic regression

(38)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

L2-regularized logistic regression

(39)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

)

≈ 0

SVM

L2-regularized logistic regression

(40)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

(s,

y

) ≈ 0

SVM

L2-regularized logistic regression

(41)

Kernel Logistic Regression SVM versus Logistic Regression

Connection between SVM and Logistic Regression

linear score

s

=

w T z n + b

err

0/1

(s,

y

) =J

ys

≤ 0K

• err c

SVM(s,

y

) =max(1 −

y s,

0):

upper bound of err

0/1

• err

SCE(s,

y

) =log

2

(1 + exp(−y

s)):

another upper bound of err

0/1

used in

logistic regression

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 svm scaled ce

−∞ ←−

y s

−→ +∞

≈ −y

s err c

SVM(s,

y

) =0

≈ −y

s

(ln 2) ·

err

SCE(s,

y

) ≈ 0

SVM

L2-regularized logistic regression

(42)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’ optimization

& theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’ optimization

& regularization guard

cons: loose bound of err

0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(43)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’ optimization

& theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’

optimization

&

regularization guard

cons: loose bound of err

0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(44)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’ optimization

& theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’

optimization

&

regularization guard

cons: loose bound of err

0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(45)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’

optimization

&

theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’

optimization

&

regularization guard

cons: loose bound of err

0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(46)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’

optimization

&

theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’

optimization

&

regularization guard

cons: loose bound of err

0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(47)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’

optimization

&

theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’

optimization

&

regularization guard

cons: loose bound of err

0/1

for very negative

ys

regularized LogReg =⇒ approximate SVM

SVM =⇒ approximate LogReg (?)

(48)

Kernel Logistic Regression SVM versus Logistic Regression

Linear Models for Binary Classification

PLA

minimize err

0/1

specially

pros:

efficient if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

soft-margin SVM

minimize regularized errcSVMby QP

pros:

‘easy’

optimization

&

theoretical guarantee

cons: loose bound of err

0/1

for very negative

ys

regularized

logistic regression for classification

minimize regularized errSCE by GD/SGD/...

pros:

‘easy’

optimization

&

regularization guard

cons: loose bound of err

0/1

for very negative

ys

(49)

Kernel Logistic Regression SVM versus Logistic Regression

Fun Time

We know thaterrcSVM(s, y ) is an upper bound of err

0/1

(s, y ). When is the upper bound tight? That is, when iserrcSVM(s, y ) = err

0/1

(s, y )?

1

ys ≥ 0

2

ys ≤ 0

3

ys ≥ 1

4

ys ≤ 1

Reference Answer: 3

By plotting the figure, we can easily see that errcSVM(s, y ) = err

0/1

(s, y ) if and only if ys ≥ 1. In that case, both error functions evaluate to 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/20

(50)

Kernel Logistic Regression SVM versus Logistic Regression

Fun Time

We know thaterrcSVM(s, y ) is an upper bound of err

0/1

(s, y ). When is the upper bound tight? That is, when iserrcSVM(s, y ) = err

0/1

(s, y )?

1

ys ≥ 0

2

ys ≤ 0

3

ys ≥ 1

4

ys ≤ 1

Reference Answer: 3

By plotting the figure, we can easily see that errcSVM(s, y ) = err

0/1

(s, y ) if and only if ys ≥ 1.

In that case, both error functions evaluate to 0.

(51)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(52)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(53)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(54)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(55)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(56)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(57)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(58)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(59)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(60)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(61)

Kernel Logistic Regression SVM for Soft Binary Classification

SVM for Soft Binary Classification

Naïve Idea 1

1

run SVM and get (bSVM,

w

SVM)

2

return

g(x) = θ(w

T

SVM

x + b

SVM)

‘direct’ use of similarity

—works reasonably well

no LogReg flavor

Naïve Idea 2

1

run SVM and get (bSVM,

w

SVM)

2

run LogReg with (bSVM,

w

SVM)as

w 0

3

return LogReg solution as g(x)

not really ‘easier’ than original LogReg

SVM flavor (kernel?) lost

want: flavors from both sides

(62)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(

A

·

(w

T

SVMΦ(x) +

b

SVM)

+

B

)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B



two-level learning:

LogReg

on

SVM-transformed

data

(63)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B



two-level learning:

LogReg

on

SVM-transformed

data

(64)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B



two-level learning:

LogReg

on

SVM-transformed

data

(65)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B



two-level learning:

LogReg

on

SVM-transformed

data

(66)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B



two-level learning:

LogReg

on

SVM-transformed

data

(67)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

Φ

SVM

(x

n

)

) +

B



two-level learning:

LogReg

on

SVM-transformed

data

(68)

Kernel Logistic Regression SVM for Soft Binary Classification

A Possible Model: Two-Level Learning

g(x) = θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

SVM flavor: fix hyperplane direction by w

SVM—kernelapplies

LogReg flavor: fine-tune hyperplane to match maximum

likelihood by

scaling (A) and shifting (B)

• often A > 0 if w

SVM

reasonably good

• often B ≈ 0 if b

SVM

reasonably good

new LogReg Problem:

min

A,B

1 N

N

X

n=1

log

1 + exp

−y

n



A

· (w

T

SVMΦ(x

n

) +

b

SVM

| {z }

) +

B



(69)

Kernel Logistic Regression SVM for Soft Binary Classification

Probabilistic SVM

Platt’s Model of Probabilistic SVM for Soft Binary Classification

1

run

SVM

on D to get

(b

SVM

, w

SVM

)

[or the equivalent α], and transform D to

z 0 n

=

w T

SVMΦ(x

n

) +

b

SVM

—actual model performs this step in a more complicated manner

2

run

LogReg

on {(z

0 n

,y

n

)}

N n=1

to get

(A, B)

—actual model adds some special regularization here

3

return g(x) =

θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

soft binary classifier

not having the same boundary as

SVM classifier

—because of

B

how to solve

LogReg: GD/SGD/or better

—because only

two variables

kernel SVM =⇒ approx. LogReg in Z-space

exact LogReg in Z-space?

(70)

Kernel Logistic Regression SVM for Soft Binary Classification

Probabilistic SVM

Platt’s Model of Probabilistic SVM for Soft Binary Classification

1

run

SVM

on D to get

(b

SVM

, w

SVM

)

[or the equivalent α], and transform D to

z 0 n

=

w T

SVMΦ(x

n

) +

b

SVM

—actual model performs this step in a more complicated manner

2

run

LogReg

on {(z

0 n

,y

n

)}

N n=1

to get

(A, B)

—actual model adds some special regularization here

3

return g(x) =

θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

soft binary classifier

not having the same boundary as

SVM classifier

—because of

B

how to solve

LogReg: GD/SGD/or better

—because only

two variables

kernel SVM =⇒ approx. LogReg in Z-space

exact LogReg in Z-space?

(71)

Kernel Logistic Regression SVM for Soft Binary Classification

Probabilistic SVM

Platt’s Model of Probabilistic SVM for Soft Binary Classification

1

run

SVM

on D to get

(b

SVM

, w

SVM

)

[or the equivalent α], and transform D to

z 0 n

=

w T

SVMΦ(x

n

) +

b

SVM

—actual model performs this step in a more complicated manner

2

run

LogReg

on {(z

0 n

,y

n

)}

N n=1

to get

(A, B)

—actual model adds some special regularization here

3

return g(x) =

θ(A

· (w

T

SVMΦ(x) +

b

SVM) +

B)

soft binary classifier

not having the same boundary as

SVM classifier

—because of

B

how to solve

LogReg: GD/SGD/or better

—because only

two variables

kernel SVM =⇒ approx. LogReg in Z-space

exact LogReg in Z-space?

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep