Machine Learning Foundations
( 機器學習基石)
Lecture 9: Linear Regression
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Linear Regression
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
Lecture 8: Noise and Error
learning can happen
with
target distribution P(y |x)
andlow E in w.r.t. err
3 How
Can Machines Learn?Lecture 9: Linear Regression Linear Regression Problem Linear Regression Algorithm Generalization Issue
Linear Regression for Binary Classification
4 How Can Machines Learn Better?
Linear Regression Linear Regression Problem
Credit Limit Problem
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit limit?
100,000
unknown target function f : X → Y (ideal credit limit formula)
training examples D : (x
1, y
1), · · · , (x
N, y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
Y = R:
regression
Linear Regression Linear Regression Problem
Linear Regression Hypothesis
age 23 years
annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000
•
Forx = (x 0
, x1
, x2
, · · · , xd
)‘features of customer’,approximate the
desired credit limit
with aweighted
sum:y
≈d
X
i=0
w i
xi
•
linear regression hypothesis:h(x) = w T x
h(x): like perceptron, but without the sign
Linear Regression Linear Regression Problem
Illustration of Linear Regression
x = (x ) ∈ R
x
y
x = (x 1 , x 2 ) ∈ R 2
x
1x
2y
x
1x
2y
linear regression:
find
lines/hyperplanes
with smallresiduals
Linear Regression Linear Regression Problem
The Error Measure
popular/historical error measure:
squared error
err(ˆ y , y ) = (ˆ y − y ) 2
in-sample
E
in
(w) = 1 NN
X
n=1
(h(x n )
| {z }
w
Tx
n− y n ) 2
out-of-sample
E
out
(w) = E(x,y)∼P (w T x − y ) 2
next: how to minimize E
in
(w)?Linear Regression Linear Regression Problem
Fun Time
Consider using linear regression hypothesis h(x) = w T x to predict the credit limit of customers x. Which feature below shall have a positive weight in a good hypothesis for the task?
1
birth month2
monthly income3
current debt4
number of credit cards ownedReference Answer: 2
Customers with higher monthly income should naturally be given a higher credit limit, which is captured by the positive weight on the ‘monthly income’ feature.
Linear Regression Linear Regression Algorithm
Matrix Form of E in (w)
E
in
(w) = 1 NN
X
n=1
(w
T x n
−y n
)2
= 1 NN
X
n=1
(x
T n w
−y n
)2
= 1
N
x T 1 w
−y 1 x T 2 w
−y 2
. . .
x T N w
−y N
2
= 1
N
− − x T 1 − −
− − x T 2 − − . . .
− − x T N − −
w
−
y 1 y 2 . . . y N
2= 1
Nk
X
|{z}
N×d +1
w
|{z}
d +1×1
−
y
|{z}
N×1
k
2
Linear Regression Linear Regression Algorithm
min
w
Ein
(w) = 1NkXw−
yk 2
w
Ein
•
Ein
(w): continuous, differentiable,convex
•
necessary condition of ‘best’w
∇E
in
(w) ≡
∂E
in∂w
0(w)∂E
in∂w
1(w) . . .∂E
in∂w
d(w)
=
0 0
. . .0
—not possibleto ‘roll down’
task: find
w
LINsuch that ∇Ein
(wLIN) =0
Linear Regression Linear Regression Algorithm
The Gradient ∇E in (w)
E
in
(w) = 1NkXw−
yk 2
= 1 N
w
T X T X
| {z }
A
w
− 2wT X T y
|{z}
b
+
y T y
|{z}
c
one w only
Ein
(w)=N 1
aw 2
− 2bw+c
∇E
in
(w)=N 1
(2aw− 2b)simple! :-)
vector w
Ein
(w)=N 1
w T Aw
− 2wT b
+c
∇E
in
(w)=N 1
(2Aw− 2b) similar (derived by definition)∇E
in
(w) =N 2 X T Xw
−X T y
Linear Regression Linear Regression Algorithm
Optimal Linear Regression Weights
task: find
w
LIN such thatN 2 X T Xw
−X T y = ∇E in
(w) =0
invertible X T X
• easy!
unique solutionw
LIN=X T X
−1
X T
| {z }
pseudo-inverse
X
†y
•
often the case becauseN d + 1
singular X T X
• many
optimal solutions•
one of the solutionsw
LIN=X † y
by defining
X †
in other wayspractical suggestion:
use
well-implemented † routine
instead ofX T X
−1
X T
for numerical stability when
almost-singular
Linear Regression Linear Regression Algorithm
Linear Regression Algorithm
1
from D, constructinput matrix X
andoutput vector y
byX =
− − x T 1 − −
− − x T 2 − −
· · ·
− − x T N − −
| {z }
N×(d +1)
y =
y 1 y 2
· · · y N
| {z }
N×1
2
calculate pseudo-inverseX †
|{z}
(d +1)×N 3
returnw
LIN|{z}
(d +1)×1
=
X † y
simple and efficient with
good † routine
Linear Regression Linear Regression Algorithm
Fun Time
After getting
w
LIN, we can calculate the predictions ˆyn
=w T
LINx n
. If all ˆyn
are collected in a vector ˆ
y similar to how we form y, what is the matrix
formula of ˆy?
1 y
2 XX T y
3 XX † y
4 XX † XX T y
Reference Answer: 3
Note that ˆ
y = Xw
LIN. Then, a simple substitution ofw
LIN reveals the answer.Linear Regression Generalization Issue
Is Linear Regression a ‘Learning Algorithm’?
w
LIN=X † y
No!
•
analytic (closed-form) solution, ‘instantaneous’•
not improving Ein
nor Eout
iterativelyYes!
•
good Ein
?yes, optimal!
•
good Eout
?yes, finite d
VClike perceptrons
•
improving iteratively?somewhat, within an iterative pseudo-inverse routine
if E
out
(wLIN)is good,learning ‘happened’!
Linear Regression Generalization Issue
Benefit of Analytic Solution:
‘Simpler-than-VC’ Guarantee
E
in
= ED∼P
Nn
E
in
(wLIN w.r.t. D)oto be shown
= noise level · 1 −
d +1 N
E
in
(wLIN) = 1Nky−
y
ˆ|{z}
predictions
k
2
= 1Nky−
X X † y
|{z}
w
LINk
2
= 1
Nk( I
|{z}
identity
−XX
†
)yk2
call
XX †
thehat matrix H
because itputs ∧ on y
Linear Regression Generalization Issue
Geometric View of Hat Matrix
y
ˆy
span of X y − ˆy
in R N
• y =
ˆXw
LIN within thespan of X columns
• y − ˆ y
smallest:y − ˆ y
⊥span
• H: project y
to ˆy ∈ span
•
I −H: transform y
toy − ˆ y
⊥span
claim: trace(I −
H) =
N − (d + 1).Why? :-)
Linear Regression Generalization Issue
An Illustrative ‘Proof’
y
ˆy
span of X f (X)
noise y − ˆy
•
ify
comes from some idealf (X) ∈ span
plusnoise
• noise with per-dimension ‘noise level’
σ2
transformed by I −H
to bey − ˆ y
E
in
(wLIN) = 1Nky − ˆ
yk 2
=N 1
k(I −H)noisek 2
=
N 1
N − (d + 1)σ2
E
in
= σ2
· 1 −d +1 N
Eout
= σ2
· 1 +d +1 N
(complicated!)
Linear Regression Generalization Issue
The Learning Curve
E out
=noise level · 1 + d +1 N
E in
=noise level · 1 − d +1 N
Number of Data Points, N
E xp ec te d E rr or
E
outE
inσ
2d + 1
•
both converge toσ2
(noise level) for N → ∞•
expected generalization error:2(d +1) N
—similar to worst-case guarantee from VC
linear regression (LinReg):
learning ‘happened’!
Linear Regression Generalization Issue
Fun Time
Which of the following property about H is not true?
1 H
is symmetric2 H 2
=H
(double projection = single one)3
(I −H) 2
= I −H
(double residual transform = single one)4
none of the aboveReference Answer: 4
You can conclude that 2 and 3 are true by their physical meanings!
:-)
Linear Regression Linear Regression for Binary Classification
Linear Classification vs. Linear Regression
Linear Classification
Y = {−1, +1}
h(x) = sign(w
T x)
err(ˆy, y ) = Jy 6= yˆ KNP-hard
to solve in generalLinear Regression
Y = R
h(x) = w
T x
err(ˆy, y ) = (ˆy − y )2 efficient analytic solution
{−1, +1} ⊂ R: linear regression for classification?
1
run LinReg on binary classification data D (efficient)2
return g(x) = sign(wT
LINx)
but explanation of this
heuristic?
Linear Regression Linear Regression for Binary Classification
Relation of Two Errors
err 0/1 = r
sign(w T x) 6= y z
err sqr =
w T x − y 2
desired y = 1
w
Tx err
squared 0/1
desired y = −1
w
Tx err
err 0/1
≤err sqr
Linear Regression Linear Regression for Binary Classification
Linear Regression for Binary Classification
err 0/1
≤err sqr
classification E out (w) VC
≤classification E in (w)
+√ . . . .
≤
regression E in (w)
+√ . . . .
•
(loose) upper bounderr sqr
aserr to approximatecerr 0/1
•
tradebound tightness
forefficiency
w
LIN: useful baseline classifier, or asinitial PLA/pocket vector
Linear Regression Linear Regression for Binary Classification
Fun Time
Which of the following functions are upper bounds of the pointwise 0/1 error qsign(w T x) 6= y y
for y ∈ {−1, +1}?
1
exp(−yw T x)
2
max(0, 1 − y wT x)
3
log2
(1 + exp(−yw T x))
4
all of the aboveReference Answer: 4
Plot the curves and you’ll see. Thus, all three can be used for binary classification. In fact, all three functions connect to very important algorithms in machine learning and we will discuss one of them soon in the next lecture.
Stay tuned. :-)
Linear Regression Linear Regression for Binary Classification