• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
40
0
0

(1)

## ( 機器學習基石)

### Lecture 11: Linear Models for Classification

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

### ( 國立台灣大學資訊工程系)

(2)

Linear Models for Classification

### 3 How

Can Machines Learn?

on

to get good

### 4 How Can Machines Learn Better?

(3)

Linear Models for Classification Linear Models for Binary Classification

## Linear Models Revisited

linear scoring function:

=

s x

x

x x0

1 2

d

h x( )

in

s x

x

x x0

1 2

d

h x( )

in

s x

x

x x0

1 2

d

h x( )

in

### (w):

can linear regression or logistic regression

### help linear classification?

(4)

Linear Models for Classification Linear Models for Binary Classification

## Error Functions Revisited

linear scoring function:

=

### wTx

for binary classification

err

(s,

)

= Jsign(

K

= Jsign(

1K

2

errSQR(s,

)

= (s−

= (y

− 1)

errCE(s,

)

= ln(1 + exp(−y

### s))

(ys): classification

### correctness score

(5)

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

0/1

SQR

2

CE

S

SCE

2

small

SQR→ small

ce: monotonic of

### ys

small errCE↔ small

### • scaled

ce: a proper upper bound of

### 0/1

small errSCE↔ small

### upper bound:

useful for designing algorithmic errorcerr

(6)

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

0/1

SQR

2

CE

S

SCE

2

small

SQR→ small

ce: monotonic of

### ys

small errCE↔ small

### • scaled

ce: a proper upper bound of

### 0/1

small errSCE↔ small

### upper bound:

useful for designing algorithmic errorcerr

(7)

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

0/1

SQR

2

CE

S

SCE

2

small

SQR→ small

ce: monotonic of

### ys

small errCE↔ small

### • scaled

ce: a proper upper bound of

### 0/1

small errSCE↔ small

### upper bound:

useful for designing algorithmic errorcerr

(8)

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

0/1

SQR

2

CE

S

SCE

2

small

SQR→ small

ce: monotonic of

### ys

small errCE↔ small

### • scaled

ce: a proper upper bound of

### 0/1

small errSCE↔ small

### upper bound:

useful for designing algorithmic errorcerr

(9)

Linear Models for Classification Linear Models for Binary Classification

## Theoretical Implication of Upper Bound

For any

where

(s,

) ≤

SCE

=

errCE(s,

=⇒

SCE

=

E

CE(w)

SCE

=

E

CE(w)

VC on

+

E

CE(w) +

VC-Reg on

:

E

CE(w)

E

CE(w) +

CE

small E

CE(w) =⇒ small

### logistic/linearreg. for linear classification

(10)

Linear Models for Classification Linear Models for Binary Classification

## Regression for Classification

run

### logistic/linearreg. on D with yn

∈ {−1, +1} to get wREG

### 2

return g(x) = sign(w

REG

pros:

### •

cons: works only if lin. separable, otherwise needing

heuristic

pros:

### •

cons: loose bound of err

for large

pros:

### •

cons: loose bound of err

### 0/1

for very negative

### • linear regression

sometimes used to

### • logistic regression often preferred overpocket

(11)

Linear Models for Classification Linear Models for Binary Classification

## Fun Time

err

(y , s)

errSQR(y , s)

errCE(y , s)

errSCE(y , s)

### Too simple, uh? :-)

Anyway, note that err

### 0/1

is surely an upper bound of itself.

(12)

Linear Models for Classification Stochastic Gradient Descent

## Two Iterative Optimization Schemes

For t = 0, 1, . . .

← w

### t

+ ηv when stop, return last

pick (x

,y

)and decide

by

### the one example

O(1) time per iteration

x 9

w(t)

w(t+1)

update: 2

### logistic regression (pocket)

check D and decide

by

### all examples

O(N) time per iteration

### :-(

logistic regression with

### O(1) time per iteration?

(13)

Linear Models for Classification Stochastic Gradient Descent

## Logistic Regression Revisited

← w

+ η

−y

| {z }

in

t

### •

want: update direction

want:

while computing

### •

technique on removing

### n=1

:

view as expectation

over

with

E

(w) =

### w err(w, x n , y n )

(14)

Linear Models for Classification Stochastic Gradient Descent

=

+

idea: replace

by

### •

after enough steps,

pros:

—useful for

or

### •

cons: less stable in nature

SGD logistic regression,

← w

+ η

−y

| {z }

t

n

n

### )

(15)

Linear Models for Classification Stochastic Gradient Descent

## PLA Revisited

SGD logistic regression:

← w

+

·

−y

PLA:

← w

+

·

y

6= sign(w

)

### •

SGD logistic regression ≈

PLA

### •

PLA ≈ SGD logistic regression with

when

### wTtxn

large

two practical rule-of-thumb:

### •

stopping condition?

η?

when

### x in proper range

(16)

Linear Models for Classification Stochastic Gradient Descent

## Fun Time

y

2(w

− y

)x

2(y

− w

)x

### Go check lecture 9 if you have forgottenabout the gradient of squared error. :-)

Anyway, the update rule has a nice physical interpretation: improve

### wt

by ‘correcting’

proportional to the residual (y

− w

### Ttxn

).

(17)

Linear Models for Classification Multiclass via Logistic Regression

## Multiclass Classification

Y = {

,

,

### 4, ?}

(4-class classification)

### • many applications

in practice, especially for

‘recognition’

next: use

to {

,

,

### 4, ?} classification

(18)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

or not? {

=

=

=

=

### ×}

(19)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

or not? {

=

=

=

=

### ×}

(20)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

or not? {

=

=

=

=

### ×}

(21)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

or not? {

=

=

=

=

### ◦}

(22)

Linear Models for Classification Multiclass via Logistic Regression

## Multiclass Prediction: Combine Binary Classifiers

but

### ties? :-)

(23)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time Softly

P(

|x)? {

=

=

=

=

### ×}

(24)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time Softly

P(

|x)? {

=

=

=

=

### ×}

(25)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time Softly

P(4|x)? {

=

=

=

=

### ×}

(26)

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time Softly

P(?|x)? {

=

=

=

=

### ◦}

(27)

Linear Models for Classification Multiclass via Logistic Regression

## Multiclass Prediction: Combine Soft Classifiers



### wT[k ]x



(28)

Linear Models for Classification Multiclass via Logistic Regression

## One-Versus-All (OVA) Decomposition

for k ∈ Y

obtain

by running

on D

= {(x

,y

=2Jy

=kK − 1)}

### 2

return g(x) = argmax





### •

pros: efficient,

pros:

can be coupled with any

cons: often

D

when K large

extension:

### multinomial (‘coupled’) logistic regression

OVA: a simple multiclass

### meta-algorithm

(29)

Linear Models for Classification Multiclass via Logistic Regression

## Fun Time

### 1

learn K logistic regression hypotheses, each from data of size N/K

### 2

learn K logistic regression hypotheses, each from data of size N ln K

### 3

learn K logistic regression hypotheses, each from data of size N

### 4

learn K logistic regression hypotheses, each from data of size NK

Note that the

### done in parallel, while the data is essentially

of the same size as the original data.

(30)

Linear Models for Classification Multiclass via Binary Classification

## Source of Unbalance: One versusAll

idea: make binary classification problems more

by one versus

### one

(31)

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or

? {

=

=

=nil,

### ?

=nil}

(32)

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or

=

=nil,

=

### ×, ?

=nil}

(33)

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or

=

=nil,

=nil,

=

### ×}

(34)

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or

=nil,

=

=

### ×, ?

=nil}

(35)

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or

=nil,

=

=nil,

=

### ×}

(36)

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or

=nil,

=nil,

=

=

### ×}

(37)

Linear Models for Classification Multiclass via Binary Classification

## Multiclass Prediction: Combine Pairwise Classifiers

n

o

### (voting of classifiers)

(38)

Linear Models for Classification Multiclass via Binary Classification

## One-versus-one (OVO) Decomposition

### 1

for (k , `) ∈ Y × Y

obtain

by running

on D

= {(x

,y

=2Jy

=kK − 1) : y

=k or y

= `}

### 2

return g(x) = tournament championn

o

### •

pros: efficient (‘smaller’ training problems), stable,

pros:

can be coupled with any

cons: use O(K

)

### w[k ,`]

—more space, slower prediction, more training

OVO: another simple multiclass

### meta-algorithm

(39)

Linear Models for Classification Multiclass via Binary Classification

## Fun Time

Assume that some binary classification algorithm takes exactly N

### 3

CPU-seconds for data of size N. Also, for some 10-class multiclass classification problem, assume that there are N/10 examples for each class. Which of the following is total CPU-seconds needed for OVO decomposition based on the binary classification algorithm?

N

N

N

N

### 3

There are 45 binary classifiers, each trained with data of size (2N)/10. Note that OVA decomposition with the same algorithm would take 10N

### 3

time, much worse than OVO.

(40)

Linear Models for Classification Multiclass via Binary Classification

## Summary

### 3 How

Can Machines Learn?

### 4 How Can Machines Learn Better?

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of