Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 11: Linear Models for Classification

Hsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Linear Models for Classification

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How

Can Machines Learn?

Lecture 10: Logistic Regression

gradient descent

on

cross-entropy error

to get good

logistic hypothesis Lecture 11: Linear Models for Classification

Linear Models for Binary Classification Stochastic Gradient Descent

Multiclass via Logistic Regression Multiclass via Binary Classification

4 How Can Machines Learn Better?

(3)

Linear Models for Classification Linear Models for Binary Classification

Linear Models Revisited

linear scoring function:

s

=

w ^T x linear classification

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = 0/1 discrete E

_in

(w):

NP-hard to solve

linear regression

h(x) = s

s x

x

x x₀

1 2

d

h x( )

friendly err = squared quadratic convex E

_in

(w):

closed-form solution

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = cross-entropy smooth convex E

_in

(w):

gradient descent

can linear regression or logistic regression

help linear classification?

(4)

Error Functions Revisited

linear scoring function:

s

=

w ^T x

for binary classification

y ∈ {−1, +1}

linear classification

h(x) = sign(s) err(h, x, y) = Jh(x) 6= y K

err

_0/1

(s,

y

)

= Jsign(

s) 6= y

K

= Jsign(

ys) 6=

1K

linear regression

h(x) = s

err(h, x, y) = (h(x) − y)

²

errSQR(s,

y

)

= (s−

y) ²

= (y

s

− 1)

²

logistic regression

h(x) = θ(s) err(h, x, y ) = − ln h(y x)

errCE(s,

y

)

= ln(1 + exp(−y

s))

(ys): classification

correctness score

(5)

Visualizing Error Functions

0/1 err

_0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

²

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

₂

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys 1 but over-charge ys 1

small

err

SQR→ small

err _0/1

•

ce: monotonic of

ys

small err_CE↔ small

err _0/1

• scaled

ce: a proper upper bound of

0/1

small err_SCE↔ small

err _0/1

upper bound:

useful for designing algorithmic errorcerr

(6)

Visualizing Error Functions

0/1 err

_0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

²

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

₂

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys 1 but over-charge ys 1

small

err

SQR→ small

err _0/1

•

ce: monotonic of

ys

err _0/1

• scaled

0/1

err _0/1

upper bound:

(7)

Visualizing Error Functions

0/1 err

_0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

²

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

₂

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys 1 but over-charge ys 1

small

err

SQR→ small

err _0/1

•

ce: monotonic of

ys

err _0/1

• scaled

0/1

err _0/1

upper bound:

(8)

Visualizing Error Functions

0/1 err

_0/1

(s, y ) = Jsign( y s) 6= 1 K sqr err

SQR

(s, y ) = (y s − 1)

²

ce err

CE

S

(s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE

(s, y ) = log

₂

(1 + exp(−y s))

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr ce

−3 −2 −1 0 1 2 3

0 1 2 4 6

ys err

0/1 sqr scaled ce

• 0/1: 1 iff ys ≤ 0

• sqr: large if ys 1 but over-charge ys 1

small

err

SQR→ small

err _0/1

•

ce: monotonic of

ys

err _0/1

• scaled

0/1

err _0/1

upper bound:

(9)

Theoretical Implication of Upper Bound

For any

ys

where

s = w ^T x

err _0/1

(s,

y

) ≤

err

SCE

(s, y )

=

_{ln 2} ¹

errCE(s,

y).

=⇒

E _in ^0/1 (w)

≤

E _in

^SCE

(w)

=

_{ln 2} ¹

E

_in

^CE(w)

E _out ^0/1 (w)

≤

E _out

^SCE

(w)

=

_{ln 2} ¹

E

_out

^CE(w)

VC on

0/1:

E _out ^0/1 (w)

≤

E _in ^0/1 (w)

+

Ω ^0/1

≤

_{ln 2} ¹

E

_in

^CE(w) +

Ω ^0/1

VC-Reg on

CE

:

E _out ^0/1 (w)

≤

_{ln 2} ¹

E

_out

^CE(w)

≤

_{ln 2} ¹

E

_in

^CE(w) +

_{ln 2} ¹

Ω^CE

small E

_in

^CE(w) =⇒ small

E _out ^0/1 (w):

logistic/linear reg. for linear classification

(10)

Regression for Classification

1

run

logistic/linear reg. on D with y _n

∈ {−1, +1} to get wREG

2

return g(x) = sign(w

^T

_REG

x)

PLA

•

pros:

efficient + strong guarantee if lin. separable

•

cons: works only if lin. separable, otherwise needing

pocket

heuristic

linear regression

•

pros:

‘easiest’

optimization

•

cons: loose bound of err

_0/1

for large

|ys|

logistic regression

•

pros:

‘easy’

optimization

•

cons: loose bound of err

_0/1

for very negative

ys

• linear regression

sometimes used to

set w ₀ for PLA/pocket/logistic regression

• logistic regression often preferred over pocket

(11)

Fun Time

Following the definition in the lecture, which of the following is not always ≥ err 0/1 (y , s) when y ∈ {−1, +1}?

1

err

_0/1

(y , s)

2

errSQR(y , s)

3

errCE(y , s)

4

errSCE(y , s)

Reference Answer: 3

Too simple, uh? :-)

Anyway, note that err

_0/1

is surely an upper bound of itself.

(12)

Linear Models for Classification Stochastic Gradient Descent

Two Iterative Optimization Schemes

For t = 0, 1, . . .

w _t+1

← w

t

+ ηv when stop, return last

w as g

PLA

pick (x

_n

,y

_n

)and decide

w _t+1

by

the one example

O(1) time per iteration

:-)

x 9

w(t)

w(t+1)

update: 2

logistic regression (pocket)

check D and decide

w _t+1 (or new ˆ w)

by

all examples

O(N) time per iteration

:-(

logistic regression with

O(1) time per iteration?

(13)

Logistic Regression Revisited

w _t+1

← w

t

+ η

1 N

N

X

n=1

θ

−y

n w ^T _t x _n y _n x _n

| {z }

−∇E

in

(w

t

)

•

want: update direction

v ≈ −∇E _in (w _t )

want:

while computing

v by one single (x _n , y _n )

•

technique on removing

_N ¹

N

P

n=1

:

view as expectation

E

over

uniform choice of n!

stochastic gradient:

∇ _w err(w, x _n , y _n )

with

random n

true gradient:

∇

_w

E

_in

(w) =

E

random n

∇

_w err(w, x n , y n )

(14)

Stochastic Gradient Descent (SGD)

stochastic gradient

=

true gradient

+

zero-mean ‘noise’ directions Stochastic Gradient Descent

•

idea: replace

true gradient

by

stochastic gradient

•

after enough steps,

average true gradient

≈

average stochastic gradient

•

pros:

simple & cheaper computation :-)

—useful for

big data

or

online learning

•

cons: less stable in nature

SGD logistic regression,

looks familiar? :-):

w _t+1

← w

t

+ η

θ

−y

n w ^T _t x _n y _n x _n

| {z }

−∇err(w

t

,x

n

,y

n

)

(15)

PLA Revisited

SGD logistic regression:

w _t+1

← w

t

+

η

·

θ

−y

n w ^T _t x _n y n x _n

PLA:

w _t+1

← w

t

+

1

·

r

y

_n

6= sign(w

^T _t x _n

)

z

y _n x _n

•

SGD logistic regression ≈

‘soft’

PLA

•

PLA ≈ SGD logistic regression with

η = 1

when

w ^T _t x _n

large

two practical rule-of-thumb:

•

stopping condition?

t large enough

•

η?

0.1

when

x in proper range

(16)

Fun Time

Consider applying SGD on linear regression for big data. What is the update direction when using the negative stochastic gradient?

1 x _n

2

y

_n x _n

3

2(w

^T _t x _n

− y

n

)x

_n

4

2(y

_n

− w

^T _t x _n

)x

_n Reference Answer: 4

Go check lecture 9 if you have forgotten about the gradient of squared error. :-)

Anyway, the update rule has a nice physical interpretation: improve

w _t

by ‘correcting’

proportional to the residual (y

n

− w

^T _t x _n

).

(17)

Linear Models for Classification Multiclass via Logistic Regression

Multiclass Classification

•

Y = {

,

♦

,

4, ?}

(4-class classification)

• many applications

in practice, especially for

‘recognition’

next: use

tools for {×, ◦} classification

to {

,

♦

,

4, ?} classification

(18)

One Class at a Time

or not? {

=

◦, ♦

=

×, 4

=

×, ?

=

×}

(19)

One Class at a Time

♦

or not? {

=

×, ♦

=

◦, 4

=

×, ?

=

×}

(20)

One Class at a Time

4

or not? {

=

×, ♦

=

×, 4

=

◦, ?

=

×}

(21)

One Class at a Time

?

or not? {

=

×, ♦

=

×, 4

=

×, ?

=

◦}

(22)

Multiclass Prediction: Combine Binary Classifiers

but

ties? :-)

(23)

One Class at a Time Softly

P(

|x)? {

=

◦, ♦

=

×, 4

=

×, ?

=

×}

(24)

One Class at a Time Softly

P(

♦

|x)? {

=

×, ♦

=

◦, 4

=

×, ?

=

×}

(25)

One Class at a Time Softly

P(4|x)? {

=

×, ♦

=

×, 4

=

◦, ?

=

×}

(26)

One Class at a Time Softly

P(?|x)? {

=

×, ♦

=

×, 4

=

×, ?

=

◦}

(27)

Multiclass Prediction: Combine Soft Classifiers

g(x) = argmax _{k ∈Y} θ

w ^T _{[k ]} x

(28)

One-Versus-All (OVA) Decomposition

1

for k ∈ Y

obtain

w _{[k ]}

by running

logistic regression

on D

_{[k ]}

= {(x

_n

,y

_n ⁰

=2Jy

ⁿ

=kK − 1)}

N n=1

2

return g(x) = argmax

_{k ∈Y}

w ^T _{[k ]} x

•

pros: efficient,

pros:

can be coupled with any

logistic regression-like approaches

•

cons: often

unbalanced

D

_{[k ]}

when K large

•

extension:

multinomial (‘coupled’) logistic regression

OVA: a simple multiclass

meta-algorithm

to keep in your toolbox

(29)

Fun Time

Which of the following best describes the training effort of OVA decomposition based on logistic regression on some K -class classification data of size N?

1

learn K logistic regression hypotheses, each from data of size N/K

2

learn K logistic regression hypotheses, each from data of size N ln K

3

learn K logistic regression hypotheses, each from data of size N

4

learn K logistic regression hypotheses, each from data of size NK

Reference Answer: 3

Note that the

learning part can be easily

done in parallel, while the data is essentially

of the same size as the original data.

(30)

Linear Models for Classification Multiclass via Binary Classification

Source of Unbalance: One versus All

idea: make binary classification problems more

balanced

by one versus

one

(31)

One versus One at a Time

or

♦

? {

=

◦, ♦

=

×, 4

=nil,

?

=nil}

(32)

One versus One at a Time

or

4? {

=

◦, ♦

=nil,

4

=

×, ?

=nil}

(33)

One versus One at a Time

or

?? {

=

◦, ♦

=nil,

4

=nil,

?

=

×}

(34)

One versus One at a Time

♦

or

4? {

=nil,

♦

=

◦, 4

=

×, ?

=nil}

(35)

One versus One at a Time

♦

or

?? {

=nil,

♦

=

◦, 4

=nil,

?

=

×}

(36)

One versus One at a Time

4

or

?? {

=nil,

♦

=nil,

4

=

◦, ?

=

×}

(37)

Multiclass Prediction: Combine Pairwise Classifiers

g(x) = tournament champion

n

w ^T _{[k ,`]} x

o

(voting of classifiers)

(38)

One-versus-one (OVO) Decomposition

1

for (k , `) ∈ Y × Y

obtain

w _{[k ,`]}

by running

linear binary classification

on D

_{[k ,`]}

= {(x

n

,y

_n ⁰

=2Jy

ⁿ

=kK − 1) : y

ⁿ

=k or y

n

= `}

2

return g(x) = tournament championn

w ^T _{[k ,`]} x

o

•

pros: efficient (‘smaller’ training problems), stable,

pros:

can be coupled with any

binary classification approaches

•

cons: use O(K

²

)

w _{[k ,`]}

—more space, slower prediction, more training

OVO: another simple multiclass

meta-algorithm

to keep in your toolbox

(39)

Fun Time

Assume that some binary classification algorithm takes exactly N

³

CPU-seconds for data of size N. Also, for some 10-class multiclass classification problem, assume that there are N/10 examples for each class. Which of the following is total CPU-seconds needed for OVO decomposition based on the binary classification algorithm?

1 9 200

N

³

2 9 25

N

³

3 4 5

N

³

4

N

³

Reference Answer: 2

There are 45 binary classifiers, each trained with data of size (2N)/10. Note that OVA decomposition with the same algorithm would take 10N

³

time, much worse than OVO.

(40)

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How

Can Machines Learn?

Machine Learning Foundations (ᘤ9M)