• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
40
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 11: Linear Models for Classification

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Linear Models for Classification

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How

Can Machines Learn?

Lecture 10: Logistic Regression

gradient descent

on

cross-entropy error

to get good

logistic hypothesis Lecture 11: Linear Models for Classification

Linear Models for Binary Classification Stochastic Gradient Descent

Multiclass via Logistic Regression Multiclass via Binary Classification

4 How Can Machines Learn Better?

(3)

Linear Models for Classification Linear Models for Binary Classification

Linear Models Revisited

linear scoring function:

s

=

w T x linear classification

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1 discrete E

in

(w):

NP-hard to solve

linear regression

h(x) = s

s x

x

x x0

1 2

d

h x( )

friendly err = squared quadratic convex E

in

(w):

closed-form solution

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = cross-entropy smooth convex E

in

(w):

gradient descent

can linear regression or logistic regression

help linear classification?

(4)

Linear Models for Classification Linear Models for Binary Classification

Error Functions Revisited

linear scoring function:

s

=

w T x

for binary classification

y ∈ {−1, +1}

linear classification

h(x) = sign(s) err(h, x, y) = Jh(x) 6= y K

err

0/1

(s,

y

)

= Jsign(

s) 6= y

K

= Jsign(

ys) 6=

1K

linear regression

h(x) = s

err(h, x, y) = (h(x) − y)

2

errSQR(s,

y

)

= (s−

y) 2

= (y

s

− 1)

2

logistic regression

h(x) = θ(s) err(h, x, y ) = − ln h(y x)

errCE(s,

y

)

= ln(1 + exp(−y

s))

(ys): classification

correctness score

(5)

Linear Models for Classification Linear Models for Binary Classification

Visualizing Error Functions

0/1 err

0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

2

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

2

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys  1 but over-charge ys  1

small

err

SQR→ small

err 0/1

ce: monotonic of

ys

small errCE↔ small

err 0/1

• scaled

ce: a proper upper bound of

0/1

small errSCE↔ small

err 0/1

upper bound:

useful for designing algorithmic errorcerr

(6)

Linear Models for Classification Linear Models for Binary Classification

Visualizing Error Functions

0/1 err

0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

2

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

2

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys  1 but over-charge ys  1

small

err

SQR→ small

err 0/1

ce: monotonic of

ys

small errCE↔ small

err 0/1

• scaled

ce: a proper upper bound of

0/1

small errSCE↔ small

err 0/1

upper bound:

useful for designing algorithmic errorcerr

(7)

Linear Models for Classification Linear Models for Binary Classification

Visualizing Error Functions

0/1 err

0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

2

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

2

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys  1 but over-charge ys  1

small

err

SQR→ small

err 0/1

ce: monotonic of

ys

small errCE↔ small

err 0/1

• scaled

ce: a proper upper bound of

0/1

small errSCE↔ small

err 0/1

upper bound:

useful for designing algorithmic errorcerr

(8)

Linear Models for Classification Linear Models for Binary Classification

Visualizing Error Functions

0/1 err

0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

2

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

2

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys  1 but over-charge ys  1

small

err

SQR→ small

err 0/1

ce: monotonic of

ys

small errCE↔ small

err 0/1

• scaled

ce: a proper upper bound of

0/1

small errSCE↔ small

err 0/1

upper bound:

useful for designing algorithmic errorcerr

(9)

Linear Models for Classification Linear Models for Binary Classification

Theoretical Implication of Upper Bound

For any

ys

where

s = w T x

err 0/1

(s,

y

) ≤

err

SCE

(s, y )

=

ln 2 1

errCE(s,

y).

=⇒

E in 0/1 (w)

E in

SCE

(w)

=

ln 2 1

E

in

CE(w)

E out 0/1 (w)

E out

SCE

(w)

=

ln 2 1

E

out

CE(w)

VC on

0/1:

E out 0/1 (w)

E in 0/1 (w)

+

0/1

ln 2 1

E

in

CE(w) +

0/1

VC-Reg on

CE

:

E out 0/1 (w)

ln 2 1

E

out

CE(w)

ln 2 1

E

in

CE(w) +

ln 2 1

CE

small E

in

CE(w) =⇒ small

E out 0/1 (w):

logistic/linear reg. for linear classification

(10)

Linear Models for Classification Linear Models for Binary Classification

Regression for Classification

1

run

logistic/linear reg. on D with y n

∈ {−1, +1} to get wREG

2

return g(x) = sign(w

T

REG

x)

PLA

pros:

efficient + strong guarantee if lin. separable

cons: works only if lin. separable, otherwise needing

pocket

heuristic

linear regression

pros:

‘easiest’

optimization

cons: loose bound of err

0/1

for large

|ys|

logistic regression

pros:

‘easy’

optimization

cons: loose bound of err

0/1

for very negative

ys

linear regression

sometimes used to

set w 0 for PLA/pocket/logistic regression

logistic regression often preferred over pocket

(11)

Linear Models for Classification Linear Models for Binary Classification

Fun Time

Following the definition in the lecture, which of the following is not always ≥ err 0/1 (y , s) when y ∈ {−1, +1}?

1

err

0/1

(y , s)

2

errSQR(y , s)

3

errCE(y , s)

4

errSCE(y , s)

Reference Answer: 3

Too simple, uh? :-)

Anyway, note that err

0/1

is surely an upper bound of itself.

(12)

Linear Models for Classification Stochastic Gradient Descent

Two Iterative Optimization Schemes

For t = 0, 1, . . .

w t+1

← w

t

+ ηv when stop, return last

w as g

PLA

pick (x

n

,y

n

)and decide

w t+1

by

the one example

O(1) time per iteration

:-)

x 9

w(t)

w(t+1)

update: 2

logistic regression (pocket)

check D and decide

w t+1 (or new ˆ w)

by

all examples

O(N) time per iteration

:-(

logistic regression with

O(1) time per iteration?

(13)

Linear Models for Classification Stochastic Gradient Descent

Logistic Regression Revisited

w t+1

← w

t

+ η

1 N

N

X

n=1

θ



−y

n w T t x n  y n x n 

| {z }

−∇E

in

(w

t

)

want: update direction

v ≈ −∇E in (w t )

want:

while computing

v by one single (x n , y n )

technique on removing

N 1

N

P

n=1

:

view as expectation

E

over

uniform choice of n!

stochastic gradient:

w err(w, x n , y n )

with

random n

true gradient:

w

E

in

(w) =

E

random n

w err(w, x n , y n )

(14)

Linear Models for Classification Stochastic Gradient Descent

Stochastic Gradient Descent (SGD)

stochastic gradient

=

true gradient

+

zero-mean ‘noise’ directions Stochastic Gradient Descent

idea: replace

true gradient

by

stochastic gradient

after enough steps,

average true gradient

average stochastic gradient

pros:

simple & cheaper computation :-)

—useful for

big data

or

online learning

cons: less stable in nature

SGD logistic regression,

looks familiar? :-):

w t+1

← w

t

+ η

θ



−y

n w T t x n  y n x n 

| {z }

−∇err(w

t

,x

n

,y

n

)

(15)

Linear Models for Classification Stochastic Gradient Descent

PLA Revisited

SGD logistic regression:

w t+1

← w

t

+

η

·

θ



−y

n w T t x n  y n x n 

PLA:

w t+1

← w

t

+

1

·

r

y

n

6= sign(w

T t x n

)

z

y n x n 

SGD logistic regression ≈

‘soft’

PLA

PLA ≈ SGD logistic regression with

η = 1

when

w T t x n

large

two practical rule-of-thumb:

stopping condition?

t large enough

η?

0.1

when

x in proper range

(16)

Linear Models for Classification Stochastic Gradient Descent

Fun Time

Consider applying SGD on linear regression for big data. What is the update direction when using the negative stochastic gradient?

1 x n

2

y

n x n

3

2(w

T t x n

− y

n

)x

n

4

2(y

n

− w

T t x n

)x

n Reference Answer: 4

Go check lecture 9 if you have forgotten about the gradient of squared error. :-)

Anyway, the update rule has a nice physical interpretation: improve

w t

by ‘correcting’

proportional to the residual (y

n

− w

T t x n

).

(17)

Linear Models for Classification Multiclass via Logistic Regression

Multiclass Classification

Y = {



,

,

4, ?}

(4-class classification)

many applications

in practice, especially for

‘recognition’

next: use

tools for {×, ◦} classification

to {



,

,

4, ?} classification

(18)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time



or not? {



=

◦, ♦

=

×, 4

=

×, ?

=

×}

(19)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time

or not? {



=

×, ♦

=

◦, 4

=

×, ?

=

×}

(20)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time

4

or not? {



=

×, ♦

=

×, 4

=

◦, ?

=

×}

(21)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time

?

or not? {



=

×, ♦

=

×, 4

=

×, ?

=

◦}

(22)

Linear Models for Classification Multiclass via Logistic Regression

Multiclass Prediction: Combine Binary Classifiers

but

ties? :-)

(23)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time Softly

P(



|x)? {



=

◦, ♦

=

×, 4

=

×, ?

=

×}

(24)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time Softly

P(

|x)? {



=

×, ♦

=

◦, 4

=

×, ?

=

×}

(25)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time Softly

P(4|x)? {



=

×, ♦

=

×, 4

=

◦, ?

=

×}

(26)

Linear Models for Classification Multiclass via Logistic Regression

One Class at a Time Softly

P(?|x)? {



=

×, ♦

=

×, 4

=

×, ?

=

◦}

(27)

Linear Models for Classification Multiclass via Logistic Regression

Multiclass Prediction: Combine Soft Classifiers

g(x) = argmax k ∈Y θ



w T [k ] x



(28)

Linear Models for Classification Multiclass via Logistic Regression

One-Versus-All (OVA) Decomposition

1

for k ∈ Y

obtain

w [k ]

by running

logistic regression

on D

[k ]

= {(x

n

,y

n 0

=2Jy

n

=kK − 1)}

N n=1

2

return g(x) = argmax

k ∈Y



w T [k ] x



pros: efficient,

pros:

can be coupled with any

logistic regression-like approaches

cons: often

unbalanced

D

[k ]

when K large

extension:

multinomial (‘coupled’) logistic regression

OVA: a simple multiclass

meta-algorithm

to keep in your toolbox

(29)

Linear Models for Classification Multiclass via Logistic Regression

Fun Time

Which of the following best describes the training effort of OVA decomposition based on logistic regression on some K -class classification data of size N?

1

learn K logistic regression hypotheses, each from data of size N/K

2

learn K logistic regression hypotheses, each from data of size N ln K

3

learn K logistic regression hypotheses, each from data of size N

4

learn K logistic regression hypotheses, each from data of size NK

Reference Answer: 3

Note that the

learning part can be easily

done in parallel, while the data is essentially

of the same size as the original data.

(30)

Linear Models for Classification Multiclass via Binary Classification

Source of Unbalance: One versus All

idea: make binary classification problems more

balanced

by one versus

one

(31)

Linear Models for Classification Multiclass via Binary Classification

One versus One at a Time



or

? {



=

◦, ♦

=

×, 4

=nil,

?

=nil}

(32)

Linear Models for Classification Multiclass via Binary Classification

One versus One at a Time



or

4? { 

=

◦, ♦

=nil,

4

=

×, ?

=nil}

(33)

Linear Models for Classification Multiclass via Binary Classification

One versus One at a Time



or

?? { 

=

◦, ♦

=nil,

4

=nil,

?

=

×}

(34)

Linear Models for Classification Multiclass via Binary Classification

One versus One at a Time

or

4? { 

=nil,

=

◦, 4

=

×, ?

=nil}

(35)

Linear Models for Classification Multiclass via Binary Classification

One versus One at a Time

or

?? { 

=nil,

=

◦, 4

=nil,

?

=

×}

(36)

Linear Models for Classification Multiclass via Binary Classification

One versus One at a Time

4

or

?? { 

=nil,

=nil,

4

=

◦, ?

=

×}

(37)

Linear Models for Classification Multiclass via Binary Classification

Multiclass Prediction: Combine Pairwise Classifiers

g(x) = tournament champion

n

w T [k ,`] x

o

(voting of classifiers)

(38)

Linear Models for Classification Multiclass via Binary Classification

One-versus-one (OVO) Decomposition

1

for (k , `) ∈ Y × Y

obtain

w [k ,`]

by running

linear binary classification

on D

[k ,`]

= {(x

n

,y

n 0

=2Jy

n

=kK − 1) : y

n

=k or y

n

= `}

2

return g(x) = tournament championn

w T [k ,`] x

o

pros: efficient (‘smaller’ training problems), stable,

pros:

can be coupled with any

binary classification approaches

cons: use O(K

2

)

w [k ,`]

—more space, slower prediction, more training

OVO: another simple multiclass

meta-algorithm

to keep in your toolbox

(39)

Linear Models for Classification Multiclass via Binary Classification

Fun Time

Assume that some binary classification algorithm takes exactly N

3

CPU-seconds for data of size N. Also, for some 10-class multiclass classification problem, assume that there are N/10 examples for each class. Which of the following is total CPU-seconds needed for OVO decomposition based on the binary classification algorithm?

1 9 200

N

3

2 9 25

N

3

3 4 5

N

3

4

N

3

Reference Answer: 2

There are 45 binary classifiers, each trained with data of size (2N)/10. Note that OVA decomposition with the same algorithm would take 10N

3

time, much worse than OVO.

(40)

Linear Models for Classification Multiclass via Binary Classification

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How

Can Machines Learn?

Lecture 10: Logistic Regression

Lecture 11: Linear Models for Classification Linear Models for Binary Classification

three models useful in different ways Stochastic Gradient Descent

follow negative stochastic gradient Multiclass via Logistic Regression

predict with maximum estimated P(k |x) Multiclass via Binary Classification

predict the tournament champion

next: from linear to nonlinear

4 How Can Machines Learn Better?

參考文獻

相關文件

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of

Modern Machine Learning Models Adaptive (or Gradient) Boosting. Modern Machine Learning

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension?. 3 How Can

happy linear modeling after Z = Φ(X ) Price of Nonlinear Transform.