## Machine Learning Foundations

## ( 機器學習基石)

### Lecture 11: Linear Models for Classification

Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw

### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Linear Models for Classification

## Roadmap

### 1 When Can Machines Learn?

### 2 Why Can Machines Learn?

### 3 **How**

Can Machines Learn?
### Lecture 10: Logistic Regression

**gradient descent**

on**cross-entropy error**

to get good**logistic hypothesis** Lecture 11: Linear Models for Classification

### Linear Models for Binary Classification Stochastic Gradient Descent

### Multiclass via Logistic Regression Multiclass via Binary Classification

### 4 How Can Machines Learn Better?

Linear Models for Classification Linear Models for Binary Classification

## Linear Models Revisited

linear scoring function:

### s

=**w** ^{T} **x** linear classification

### h(x) = sign(s)

*s*
*x*

*x*

*x*
*x*_{0}

1 2

*d*

*h x*( )

### plausible err = 0/1 **discrete E**

_{in}

### (w):

**NP-hard to solve**

### linear regression

### h(x) = s

*s*
*x*

*x*

*x*
*x*_{0}

1 2

*d*

*h x*( )

### friendly err = squared quadratic convex E

_{in}

### (w):

### closed-form solution

### logistic regression

### h(x) = θ(s)

*s*
*x*

*x*

*x*
*x*_{0}

1 2

*d*

*h x*( )

### plausible err = cross-entropy smooth convex E

_{in}

### (w):

### gradient descent

can linear regression or logistic regression

**help linear classification?**

Linear Models for Classification Linear Models for Binary Classification

## Error Functions Revisited

linear scoring function:

### s

=**w** ^{T} **x**

for binary classification### y ∈ {−1, +1}

### linear classification

### h(x) = sign(s) err(h, **x,** y) = **Jh(x) 6=** y K

err

_{0/1}

(s,### y

)= Jsign(

### s) 6= y

K= Jsign(

### ys) 6=

1K### linear regression

### h(x) = s

### err(h, **x,** y) = (h(x) − y)

^{2}

errSQR(s,

### y

)= (s−

### y) ^{2}

= (y

### s

− 1)^{2}

### logistic regression

### h(x) = θ(s) err(h, **x, y )** = **− ln h(y x)**

errCE(s,

### y

)= ln(1 + exp(−y

### s))

(ys): classification

### correctness score

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

### 0/1 err

_{0/1}

### (s, y ) = Jsign( y s) 6= 1 K sqr err

SQR### (s, y ) = (y s − 1)

^{2}

### ce err

CES

### (s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE### (s, y ) = log

_{2}

### (1 + exp(−y s))

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr ce

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr scaled ce

### • 0/1: 1 iff ys ≤ 0

### • sqr: large if ys 1 **but over-charge** ys 1

small### err

SQR→ small### err _{0/1}

### •

ce: monotonic of### ys

small err_{CE}↔ small

### err _{0/1}

### • scaled

ce: a proper upper bound of### 0/1

small err_{SCE}↔ small

### err _{0/1}

### upper bound:

useful for designing algorithmic errorcerr

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

### 0/1 err

_{0/1}

### (s, y ) = Jsign( y s) 6= 1 K sqr err

SQR### (s, y ) = (y s − 1)

^{2}

### ce err

CES

### (s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE### (s, y ) = log

_{2}

### (1 + exp(−y s))

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr ce

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr scaled ce

### • 0/1: 1 iff ys ≤ 0

### • sqr: large if ys 1 **but over-charge** ys 1

small### err

SQR→ small### err _{0/1}

### •

ce: monotonic of### ys

small err_{CE}↔ small

### err _{0/1}

### • scaled

ce: a proper upper bound of### 0/1

small err_{SCE}↔ small

### err _{0/1}

### upper bound:

useful for designing algorithmic errorcerr

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

### 0/1 err

_{0/1}

### (s, y ) = Jsign( y s) 6= 1 K sqr err

SQR### (s, y ) = (y s − 1)

^{2}

### ce err

CES

### (s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE### (s, y ) = log

_{2}

### (1 + exp(−y s))

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr ce

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr scaled ce

### • 0/1: 1 iff ys ≤ 0

### • sqr: large if ys 1 **but over-charge** ys 1

small### err

SQR→ small### err _{0/1}

### •

ce: monotonic of### ys

small err_{CE}↔ small

### err _{0/1}

### • scaled

ce: a proper upper bound of### 0/1

small err_{SCE}↔ small

### err _{0/1}

### upper bound:

useful for designing algorithmic errorcerr

Linear Models for Classification Linear Models for Binary Classification

## Visualizing Error Functions

### 0/1 err

_{0/1}

### (s, y ) = Jsign( y s) 6= 1 K sqr err

SQR### (s, y ) = (y s − 1)

^{2}

### ce err

CES

### (s, y ) = ln(1 + exp(−y s)) scaled ce err

SCE### (s, y ) = log

_{2}

### (1 + exp(−y s))

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr ce

### −3 −2 −1 0 1 2 3

### 0 1 2 4 6

### ys err

### 0/1 sqr scaled ce

### • 0/1: 1 iff ys ≤ 0

### • sqr: large if ys 1 **but over-charge** ys 1

small### err

SQR→ small### err _{0/1}

### •

ce: monotonic of### ys

small err_{CE}↔ small

### err _{0/1}

### • scaled

ce: a proper upper bound of### 0/1

small err_{SCE}↔ small

### err _{0/1}

### upper bound:

useful for designing algorithmic errorcerr

Linear Models for Classification Linear Models for Binary Classification

## Theoretical Implication of Upper Bound

For any

### ys

where### s = **w** ^{T} **x**

### err _{0/1}

(s,### y

) ≤### err

SCE### (s, y )

=_{ln 2} ^{1}

errCE(s,### y).

=⇒

### E _{in} ^{0/1} (w)

≤ ### E _{in}

^{SCE}

### (w)

=_{ln 2} ^{1}

E_{in}

^{CE}(w)

### E _{out} ^{0/1} (w)

≤ ### E _{out}

^{SCE}

### (w)

=_{ln 2} ^{1}

E_{out}

^{CE}(w)

VC on

### 0/1:

### E _{out} ^{0/1} (w)

≤ ### E _{in} ^{0/1} (w)

+### Ω ^{0/1}

≤

_{ln 2} ^{1}

E_{in}

^{CE}(w) +

### Ω ^{0/1}

VC-Reg on

### CE

:### E _{out} ^{0/1} (w)

≤ _{ln 2} ^{1}

E_{out}

^{CE}(w)

≤

_{ln 2} ^{1}

E_{in}

^{CE}(w) +

_{ln 2} ^{1}

Ω^{CE}

small E

_{in}

^{CE}(w) =⇒ small

### E _{out} ^{0/1} (w):

**logistic/linear** **reg. for** linear classification

Linear Models for Classification Linear Models for Binary Classification

## Regression for Classification

### 1

run**logistic/linear** **reg. on D with y** _{n}

**∈ {−1, +1} to get w**REG

### 2

return g(x) = sign(w^{T}

_{REG}

**x)**

### PLA

### •

pros:**efficient +** **strong guarantee** **if lin. separable**

### •

cons: works only if lin. separable, otherwise needing**pocket**

heuristic
### linear regression

### •

pros:**‘easiest’**

**optimization**

### •

cons: loose bound of err_{0/1}

for large### |ys|

### logistic regression

### •

pros:**‘easy’**

**optimization**

### •

cons: loose bound of err_{0/1}

for
very negative### ys

### • **linear regression**

sometimes used to### set **w** _{0} for **PLA/pocket/logistic regression**

### • **logistic regression often preferred over** **pocket**

Linear Models for Classification Linear Models for Binary Classification

## Fun Time

### Following the definition in the lecture, which of the following is not always ≥ err 0/1 (y , s) when y ∈ {−1, +1}?

### 1

err_{0/1}

(y , s)
### 2

errSQR(y , s)### 3

errCE(y , s)### 4

errSCE(y , s)### Reference Answer: 3

**Too simple, uh? :-)**

Anyway, note that err_{0/1}

is
surely an upper bound of itself.
Linear Models for Classification Stochastic Gradient Descent

## Two Iterative Optimization Schemes

For t = 0, 1, . . .

**w** _{t+1}

**← w**

### t

+ ηv when stop, return last**w as g**

### PLA

pick (x

_{n}

,y_{n}

)and decide**w** _{t+1}

by
**the one example**

O(1) time per iteration

**:-)**

**x**
**9**

**w(t)**

**w(t+1)**

update: 2

### logistic regression (pocket)

check D and decide**w** _{t+1} (or new ˆ **w)**

by**all examples**

O(N) time per iteration

**:-(**

logistic regression with

### O(1) **time per iteration?**

Linear Models for Classification Stochastic Gradient Descent

## Logistic Regression Revisited

**w** _{t+1}

**← w**

### t

+ η### 1 N

### N

### X

### n=1

### θ

−y

### n **w** ^{T} _{t} **x** _{n} y _{n} **x** _{n}

| {z }

### −∇E

in### (w

t### )

### •

want: update direction**v ≈** −∇E _{in} (w _{t} )

want:

while computing

**v** by one single (x _{n} , y _{n} )

### •

technique on removing_{N} ^{1}

### N

### P

### n=1

:

view as expectation

### E

over### uniform choice of n!

stochastic gradient:

### ∇ _{w} err(w, x _{n} , y _{n} )

with_{w}

### random n

true gradient:∇

_{w}

E_{w}

_{in}

(w) = ### E

### random n

∇_{w} err(w, x n , y n )

_{w}

Linear Models for Classification Stochastic Gradient Descent

## Stochastic Gradient Descent (SGD)

### stochastic gradient

=### true gradient

+### zero-mean ‘noise’ directions Stochastic Gradient Descent

### •

idea: replace### true gradient

by### stochastic gradient

### •

after enough steps,### average true gradient

≈### average stochastic gradient

### •

pros:**simple & cheaper computation :-)**

—useful for

**big data**

or**online learning**

### •

cons: less stable in natureSGD logistic regression,

**looks familiar? :-):**

**w** _{t+1}

**← w**

### t

+ η### θ

−y

### n **w** ^{T} _{t} **x** _{n} y _{n} **x** _{n}

| {z }

### −∇err(w

t### ,x

n### ,y

n### )

Linear Models for Classification Stochastic Gradient Descent

## PLA Revisited

SGD logistic regression:

**w** _{t+1}

**← w**

### t

+### η

·### θ

−y

### n **w** ^{T} _{t} **x** _{n} y n **x** _{n}

PLA:
**w** _{t+1}

**← w**

### t

+### 1

·### r

y

_{n}

**6= sign(w**

^{T} _{t} **x** _{n}

)
### z

### y _{n} **x** _{n}

### •

SGD logistic regression ≈### ‘soft’

PLA### •

PLA ≈ SGD logistic regression with### η = 1

when**w** ^{T} _{t} **x** _{n}

large
two practical rule-of-thumb:

### •

stopping condition?### t **large enough**

### •

η?### 0.1

when**x in proper range**

Linear Models for Classification Stochastic Gradient Descent

## Fun Time

### Consider applying SGD on linear regression for big data. What is the update direction when using the negative stochastic gradient?

### 1 **x** _{n}

### 2

y_{n} **x** _{n}

### 3

2(w^{T} _{t} **x** _{n}

− y### n

)x_{n}

### 4

2(y_{n}

**− w**

^{T} _{t} **x** _{n}

)x_{n} Reference Answer: 4

**Go check lecture 9 if you have forgotten** **about the gradient of squared error. :-)**

Anyway, the update rule has a nice physical
interpretation: improve**w** _{t}

by ‘correcting’
proportional to the residual (y

### n

**− w**

^{T} _{t} **x** _{n}

).
Linear Models for Classification Multiclass via Logistic Regression

## Multiclass Classification

### •

Y = {,### ♦

,### 4, ?}

(4-class classification)

### • **many applications**

in
practice, especially for
‘recognition’

next: use

**tools for {×,** ◦} classification

to
{,### ♦

,### 4, ?} classification

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

or not? {=### ◦, ♦

=### ×, 4

=### ×, ?

=### ×}

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

### ♦

or not? {=### ×, ♦

=### ◦, 4

=### ×, ?

=### ×}

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

### 4

or not? {=### ×, ♦

=### ×, 4

=### ◦, ?

=### ×}

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time

### ?

or not? {=### ×, ♦

=### ×, 4

=### ×, ?

=### ◦}

Linear Models for Classification Multiclass via Logistic Regression

## Multiclass Prediction: Combine Binary Classifiers

but

**ties? :-)**

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time **Softly**

P(

**|x)? {**=

### ◦, ♦

=### ×, 4

=### ×, ?

=### ×}

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time **Softly**

P(

### ♦

**|x)? {**=

### ×, ♦

=### ◦, 4

=### ×, ?

=### ×}

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time **Softly**

P(4|x)? {

=### ×, ♦

=### ×, 4

=### ◦, ?

=### ×}

Linear Models for Classification Multiclass via Logistic Regression

## One Class at a Time **Softly**

P(?|x)? {

=### ×, ♦

=### ×, 4

=### ×, ?

=### ◦}

Linear Models for Classification Multiclass via Logistic Regression

## Multiclass Prediction: Combine **Soft** Classifiers

### g(x) = argmax _{k ∈Y} θ

**w** ^{T} _{[k ]} **x**

Linear Models for Classification Multiclass via Logistic Regression

## One-Versus-All (OVA) Decomposition

### 1

for k ∈ Yobtain

**w** _{[k ]}

by running### logistic regression

on D_{[k ]}

= {(x_{n}

,y_{n} ^{0}

=2Jy^{n}

=kK − 1)}
### N n=1

### 2

return g(x) = argmax_{k ∈Y}

**w** ^{T} _{[k ]} **x**

### •

pros: efficient,pros:

can be coupled with any

### logistic regression-like approaches

### •

cons: often**unbalanced**

D_{[k ]}

when K large
### •

extension:**multinomial (‘coupled’) logistic regression**

OVA: a simple multiclass

**meta-algorithm**

to keep in your toolbox
Linear Models for Classification Multiclass via Logistic Regression

## Fun Time

### Which of the following best describes the training effort of OVA decomposition based on logistic regression on some K -class classification data of size N?

### 1

learn K logistic regression hypotheses, each from data of size N/K### 2

learn K logistic regression hypotheses, each from data of size N ln K### 3

learn K logistic regression hypotheses, each from data of size N### 4

learn K logistic regression hypotheses, each from data of size NK### Reference Answer: 3

Note that the

**learning part can be easily**

**done in parallel, while the data is essentially**

of the same size as the original data.
Linear Models for Classification Multiclass via Binary Classification

## Source of **Unbalance: One versus** **All**

idea: make binary classification problems more

**balanced**

by one versus**one**

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or### ♦

? {=### ◦, ♦

=### ×, 4

=nil,### ?

=nil}Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or### 4? {

=### ◦, ♦

=nil,### 4

=### ×, ?

=nil}Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

or### ?? {

=### ◦, ♦

=nil,### 4

=nil,### ?

=### ×}

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

### ♦

or### 4? {

=nil,### ♦

=### ◦, 4

=### ×, ?

=nil}Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

### ♦

or### ?? {

=nil,### ♦

=### ◦, 4

=nil,### ?

=### ×}

Linear Models for Classification Multiclass via Binary Classification

## One versus One at a Time

### 4

or### ?? {

=nil,### ♦

=nil,### 4

=### ◦, ?

=### ×}

Linear Models for Classification Multiclass via Binary Classification

## Multiclass Prediction: Combine **Pairwise** Classifiers

### g(x) = tournament champion

n**w** ^{T} _{[k ,`]} **x**

o
**(voting of classifiers)**

Linear Models for Classification Multiclass via Binary Classification

## One-versus-one (OVO) Decomposition

### 1

for (k , `) ∈ Y × Yobtain

**w** _{[k ,`]}

by running### linear binary classification

on D_{[k ,`]}

= {(x### n

,y_{n} ^{0}

=2Jy^{n}

=kK − 1) : y^{n}

=k or y### n

= `}### 2

return g(x) = tournament championn**w** ^{T} _{[k ,`]} **x**

o
### •

pros: efficient (‘smaller’ training problems), stable,pros:

can be coupled with any

### binary classification approaches

### •

cons: use O(K^{2}

)**w** _{[k ,`]}

—more space, slower prediction, more training

OVO: another simple multiclass

**meta-algorithm**

to keep in your toolbox
Linear Models for Classification Multiclass via Binary Classification

## Fun Time

Assume that some binary classification algorithm takes exactly N

^{3}

CPU-seconds for data of size N. Also, for some 10-class multiclass
classification problem, assume that there are N/10 examples for each
class. Which of the following is total CPU-seconds needed for OVO
decomposition based on the binary classification algorithm?
### 1 9 200

N^{3}

### 2 9 25

N^{3}

### 3 4 5

N^{3}

### 4

N^{3}

### Reference Answer: 2

There are 45 binary classifiers, each trained with data of size (2N)/10. Note that OVA decomposition with the same algorithm would take 10N

^{3}

time, much worse than OVO.
Linear Models for Classification Multiclass via Binary Classification