Machine Learning Foundations
( 機器學習基石)
Lecture 11: Linear Models for Classification
Hsuan-Tien Lin (林軒田)htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Linear Models for Classification
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How
Can Machines Learn?Lecture 10: Logistic Regression
gradient descent
oncross-entropy error
to get goodlogistic hypothesis Lecture 11: Linear Models for Classification
Linear Models for Binary Classification Stochastic Gradient Descent
Multiclass via Logistic Regression Multiclass via Binary Classification
4 How Can Machines Learn Better?
Linear Models for Classification Linear Models for Binary Classification
Linear Models Revisited
linear scoring function:
s
=w T x linear classification
h(x) = sign(s)
s x
x
x x0
1 2
d
h x( )
plausible err = 0/1 discrete E
in(w):
NP-hard to solve
linear regression
h(x) = s
s x
x
x x0
1 2
d
h x( )
friendly err = squared quadratic convex E
in(w):
closed-form solution
logistic regression
h(x) = θ(s)
s x
x
x x0
1 2
d
h x( )
plausible err = cross-entropy smooth convex E
in(w):
gradient descent
can linear regression or logistic regression
help linear classification?
Linear Models for Classification Linear Models for Binary Classification
Error Functions Revisited
linear scoring function:
s
=w T x
for binary classificationy ∈ {−1, +1}
linear classification
h(x) = sign(s) err(h, x, y) = Jh(x) 6= y K
err
0/1
(s,y
)= Jsign(
s) 6= y
K= Jsign(
ys) 6=
1Klinear regression
h(x) = s
err(h, x, y) = (h(x) − y)
2errSQR(s,
y
)= (s−
y) 2
= (y
s
− 1)2
logistic regression
h(x) = θ(s) err(h, x, y ) = − ln h(y x)
errCE(s,
y
)= ln(1 + exp(−y
s))
(ys): classification
correctness score
Linear Models for Classification Linear Models for Binary Classification
Visualizing Error Functions
0/1 err
0/1(s, y ) = Jsign( y s) 6= 1 K sqr err
SQR(s, y ) = (y s − 1)
2ce err
CES
(s, y ) = ln(1 + exp(−y s)) scaled ce err
SCE(s, y ) = log
2(1 + exp(−y s))
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr ce
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr scaled ce
• 0/1: 1 iff ys ≤ 0
• sqr: large if ys 1 but over-charge ys 1
smallerr
SQR→ smallerr 0/1
•
ce: monotonic ofys
small errCE↔ smallerr 0/1
• scaled
ce: a proper upper bound of0/1
small errSCE↔ smallerr 0/1
upper bound:
useful for designing algorithmic errorcerr
Linear Models for Classification Linear Models for Binary Classification
Visualizing Error Functions
0/1 err
0/1(s, y ) = Jsign( y s) 6= 1 K sqr err
SQR(s, y ) = (y s − 1)
2ce err
CES
(s, y ) = ln(1 + exp(−y s)) scaled ce err
SCE(s, y ) = log
2(1 + exp(−y s))
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr ce
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr scaled ce
• 0/1: 1 iff ys ≤ 0
• sqr: large if ys 1 but over-charge ys 1
smallerr
SQR→ smallerr 0/1
•
ce: monotonic ofys
small errCE↔ smallerr 0/1
• scaled
ce: a proper upper bound of0/1
small errSCE↔ smallerr 0/1
upper bound:
useful for designing algorithmic errorcerr
Linear Models for Classification Linear Models for Binary Classification
Visualizing Error Functions
0/1 err
0/1(s, y ) = Jsign( y s) 6= 1 K sqr err
SQR(s, y ) = (y s − 1)
2ce err
CES
(s, y ) = ln(1 + exp(−y s)) scaled ce err
SCE(s, y ) = log
2(1 + exp(−y s))
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr ce
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr scaled ce
• 0/1: 1 iff ys ≤ 0
• sqr: large if ys 1 but over-charge ys 1
smallerr
SQR→ smallerr 0/1
•
ce: monotonic ofys
small errCE↔ smallerr 0/1
• scaled
ce: a proper upper bound of0/1
small errSCE↔ smallerr 0/1
upper bound:
useful for designing algorithmic errorcerr
Linear Models for Classification Linear Models for Binary Classification
Visualizing Error Functions
0/1 err
0/1(s, y ) = Jsign( y s) 6= 1 K sqr err
SQR(s, y ) = (y s − 1)
2ce err
CES
(s, y ) = ln(1 + exp(−y s)) scaled ce err
SCE(s, y ) = log
2(1 + exp(−y s))
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr ce
−3 −2 −1 0 1 2 3
0 1 2 4 6
ys err
0/1 sqr scaled ce
• 0/1: 1 iff ys ≤ 0
• sqr: large if ys 1 but over-charge ys 1
smallerr
SQR→ smallerr 0/1
•
ce: monotonic ofys
small errCE↔ smallerr 0/1
• scaled
ce: a proper upper bound of0/1
small errSCE↔ smallerr 0/1
upper bound:
useful for designing algorithmic errorcerr
Linear Models for Classification Linear Models for Binary Classification
Theoretical Implication of Upper Bound
For any
ys
wheres = w T x
err 0/1
(s,y
) ≤err
SCE(s, y )
=ln 2 1
errCE(s,y).
=⇒
E in 0/1 (w)
≤E in
SCE(w)
=ln 2 1
Ein
CE(w)E out 0/1 (w)
≤E out
SCE(w)
=ln 2 1
Eout
CE(w)VC on
0/1:
E out 0/1 (w)
≤E in 0/1 (w)
+Ω 0/1
≤
ln 2 1
Ein
CE(w) +Ω 0/1
VC-Reg on
CE
:E out 0/1 (w)
≤ln 2 1
Eout
CE(w)≤
ln 2 1
Ein
CE(w) +ln 2 1
ΩCEsmall E
in
CE(w) =⇒ smallE out 0/1 (w):
logistic/linear reg. for linear classification
Linear Models for Classification Linear Models for Binary Classification
Regression for Classification
1
runlogistic/linear reg. on D with y n
∈ {−1, +1} to get wREG2
return g(x) = sign(wT
REGx)
PLA
•
pros:efficient + strong guarantee if lin. separable
•
cons: works only if lin. separable, otherwise needinglinear regression
•
pros:‘easiest’
optimization
•
cons: loose bound of err0/1
for large|ys|
logistic regression
•
pros:‘easy’
optimization
•
cons: loose bound of err0/1
for very negativeys
• linear regression
sometimes used toset w 0 for PLA/pocket/logistic regression
• logistic regression often preferred over pocket
Linear Models for Classification Linear Models for Binary Classification
Fun Time
Following the definition in the lecture, which of the following is not always ≥ err 0/1 (y , s) when y ∈ {−1, +1}?
1
err0/1
(y , s)2
errSQR(y , s)3
errCE(y , s)4
errSCE(y , s)Reference Answer: 3
Too simple, uh? :-)
Anyway, note that err0/1
is surely an upper bound of itself.Linear Models for Classification Stochastic Gradient Descent
Two Iterative Optimization Schemes
For t = 0, 1, . . .
w t+1
← wt
+ ηv when stop, return lastw as g
PLA
pick (x
n
,yn
)and decidew t+1
bythe one example
O(1) time per iteration
:-)
x 9
w(t)
w(t+1)
update: 2
logistic regression (pocket)
check D and decidew t+1 (or new ˆ w)
byall examples
O(N) time per iteration
:-(
logistic regression with
O(1) time per iteration?
Linear Models for Classification Stochastic Gradient Descent
Logistic Regression Revisited
w t+1
← wt
+ η1 N
N
X
n=1
θ
−y
n w T t x n y n x n
| {z }
−∇E
in(w
t)
•
want: update directionv ≈ −∇E in (w t )
want:
while computing
v by one single (x n , y n )
•
technique on removingN 1
N
P
n=1
:
view as expectation
E
overuniform choice of n!
stochastic gradient:
∇ w err(w, x n , y n )
withrandom n
true gradient:∇
w
Ein
(w) =E
random n
∇w err(w, x n , y n )
Linear Models for Classification Stochastic Gradient Descent
Stochastic Gradient Descent (SGD)
stochastic gradient
=true gradient
+zero-mean ‘noise’ directions Stochastic Gradient Descent
•
idea: replacetrue gradient
bystochastic gradient
•
after enough steps,average true gradient
≈average stochastic gradient
•
pros:simple & cheaper computation :-)
—useful for
big data
oronline learning
•
cons: less stable in natureSGD logistic regression,
looks familiar? :-):
w t+1
← wt
+ ηθ
−y
n w T t x n y n x n
| {z }
−∇err(w
t,x
n,y
n)
Linear Models for Classification Stochastic Gradient Descent
PLA Revisited
SGD logistic regression:
w t+1
← wt
+η
·θ
−y
n w T t x n y n x n
PLA:w t+1
← wt
+1
·r
y
n
6= sign(wT t x n
)z
y n x n
•
SGD logistic regression ≈‘soft’
PLA•
PLA ≈ SGD logistic regression withη = 1
whenw T t x n
largetwo practical rule-of-thumb:
•
stopping condition?t large enough
•
η?0.1
whenx in proper range
Linear Models for Classification Stochastic Gradient Descent
Fun Time
Consider applying SGD on linear regression for big data. What is the update direction when using the negative stochastic gradient?
1 x n
2
yn x n
3
2(wT t x n
− yn
)xn
4
2(yn
− wT t x n
)xn Reference Answer: 4
Go check lecture 9 if you have forgotten about the gradient of squared error. :-)
Anyway, the update rule has a nice physical interpretation: improvew t
by ‘correcting’proportional to the residual (y
n
− wT t x n
).Linear Models for Classification Multiclass via Logistic Regression
Multiclass Classification
•
Y = {,♦
,4, ?}
(4-class classification)
• many applications
in practice, especially for‘recognition’
next: use
tools for {×, ◦} classification
to {,♦
,4, ?} classification
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time
or not? {=◦, ♦
=×, 4
=×, ?
=×}
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time
♦
or not? {=×, ♦
=◦, 4
=×, ?
=×}
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time
4
or not? {=×, ♦
=×, 4
=◦, ?
=×}
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time
?
or not? {=×, ♦
=×, 4
=×, ?
=◦}
Linear Models for Classification Multiclass via Logistic Regression
Multiclass Prediction: Combine Binary Classifiers
but
ties? :-)
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time Softly
P(
|x)? {=◦, ♦
=×, 4
=×, ?
=×}
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time Softly
P(
♦
|x)? {=×, ♦
=◦, 4
=×, ?
=×}
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time Softly
P(4|x)? {
=×, ♦
=×, 4
=◦, ?
=×}
Linear Models for Classification Multiclass via Logistic Regression
One Class at a Time Softly
P(?|x)? {
=×, ♦
=×, 4
=×, ?
=◦}
Linear Models for Classification Multiclass via Logistic Regression
Multiclass Prediction: Combine Soft Classifiers
g(x) = argmax k ∈Y θ
w T [k ] x
Linear Models for Classification Multiclass via Logistic Regression
One-Versus-All (OVA) Decomposition
1
for k ∈ Yobtain
w [k ]
by runninglogistic regression
on D[k ]
= {(xn
,yn 0
=2Jyn
=kK − 1)}N n=1
2
return g(x) = argmaxk ∈Y
w T [k ] x
•
pros: efficient,pros:
can be coupled with any
logistic regression-like approaches
•
cons: oftenunbalanced
D[k ]
when K large•
extension:multinomial (‘coupled’) logistic regression
OVA: a simple multiclass
meta-algorithm
to keep in your toolboxLinear Models for Classification Multiclass via Logistic Regression
Fun Time
Which of the following best describes the training effort of OVA decomposition based on logistic regression on some K -class classification data of size N?
1
learn K logistic regression hypotheses, each from data of size N/K2
learn K logistic regression hypotheses, each from data of size N ln K3
learn K logistic regression hypotheses, each from data of size N4
learn K logistic regression hypotheses, each from data of size NKReference Answer: 3
Note that the
learning part can be easily
done in parallel, while the data is essentially
of the same size as the original data.Linear Models for Classification Multiclass via Binary Classification
Source of Unbalance: One versus All
idea: make binary classification problems more
balanced
by one versusone
Linear Models for Classification Multiclass via Binary Classification
One versus One at a Time
or♦
? {=◦, ♦
=×, 4
=nil,?
=nil}Linear Models for Classification Multiclass via Binary Classification
One versus One at a Time
or4? {
=◦, ♦
=nil,4
=×, ?
=nil}Linear Models for Classification Multiclass via Binary Classification
One versus One at a Time
or?? {
=◦, ♦
=nil,4
=nil,?
=×}
Linear Models for Classification Multiclass via Binary Classification
One versus One at a Time
♦
or4? {
=nil,♦
=◦, 4
=×, ?
=nil}Linear Models for Classification Multiclass via Binary Classification
One versus One at a Time
♦
or?? {
=nil,♦
=◦, 4
=nil,?
=×}
Linear Models for Classification Multiclass via Binary Classification
One versus One at a Time
4
or?? {
=nil,♦
=nil,4
=◦, ?
=×}
Linear Models for Classification Multiclass via Binary Classification
Multiclass Prediction: Combine Pairwise Classifiers
g(x) = tournament champion
nw T [k ,`] x
o(voting of classifiers)
Linear Models for Classification Multiclass via Binary Classification
One-versus-one (OVO) Decomposition
1
for (k , `) ∈ Y × Yobtain
w [k ,`]
by runninglinear binary classification
on D[k ,`]
= {(xn
,yn 0
=2Jyn
=kK − 1) : yn
=k or yn
= `}2
return g(x) = tournament championnw T [k ,`] x
o•
pros: efficient (‘smaller’ training problems), stable,pros:
can be coupled with any
binary classification approaches
•
cons: use O(K2
)w [k ,`]
—more space, slower prediction, more training
OVO: another simple multiclass
meta-algorithm
to keep in your toolboxLinear Models for Classification Multiclass via Binary Classification
Fun Time
Assume that some binary classification algorithm takes exactly N
3
CPU-seconds for data of size N. Also, for some 10-class multiclass classification problem, assume that there are N/10 examples for each class. Which of the following is total CPU-seconds needed for OVO decomposition based on the binary classification algorithm?1 9 200
N3
2 9 25
N3
3 4 5
N3
4
N3
Reference Answer: 2
There are 45 binary classifiers, each trained with data of size (2N)/10. Note that OVA decomposition with the same algorithm would take 10N
3
time, much worse than OVO.Linear Models for Classification Multiclass via Binary Classification