Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 9: Linear Regression

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Linear Regression

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

Lecture 8: Noise and Error

learning can happen

with

target distribution P(y |x)

and

low E _in w.r.t. err

3 How

Can Machines Learn?

Lecture 9: Linear Regression Linear Regression Problem Linear Regression Algorithm Generalization Issue

Linear Regression for Binary Classification

4 How Can Machines Learn Better?

(3)

Linear Regression Linear Regression Problem

Credit Limit Problem

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit limit?

100,000

unknown target function f : X → Y (ideal credit limit formula)

training examples D : (x

1

, y

1

), · · · , (x

N

, y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

Y = R:

regression

(4)

Linear Regression Hypothesis

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

•

For

x = (x ₀

, x

1

, x

2

, · · · , x

d

)‘features of customer’,

approximate the

desired credit limit

with a

weighted

sum:

y

≈

d

X

i=0

w _i

x

_i

•

linear regression hypothesis:

h(x) = w ^T x

h(x): like perceptron, but without the sign

(5)

Illustration of Linear Regression

x = (x ) ∈ R

x

y

x = (x ₁ , x 2 ) ∈ R ²

x

1

x

2

y

x

1

x

2

y

linear regression:

find

lines/hyperplanes

with small

residuals

(6)

The Error Measure

popular/historical error measure:

squared error

err(ˆ y , y ) = (ˆ y − y ) ²

in-sample

E

_in

(w) = 1 N

N

X

n=1

(h(x _n )

| {z }

w

^T

x

n

− y _n ) ²

out-of-sample

E

_out

(w) = E

(x,y)∼P (w ^T x − y ) ²

next: how to minimize E

_in

(w)?

(7)

Fun Time

Consider using linear regression hypothesis h(x) = w ^T x to predict the credit limit of customers x. Which feature below shall have a positive weight in a good hypothesis for the task?

1

birth month

2

monthly income

3

current debt

4

number of credit cards owned

Reference Answer: 2

Customers with higher monthly income should naturally be given a higher credit limit, which is captured by the positive weight on the ‘monthly income’ feature.

(8)

Linear Regression Linear Regression Algorithm

Matrix Form of E in (w)

E

_in

(w) = 1 N

N

X

n=1

(w

^T x _n

−

y n

)

²

= 1 N

N

X

n=1

(x

^T _n w

−

y n

)

²

= 1

N

x ^T ₁ w

−

y ₁ x ^T ₂ w

−

y ₂

. . .

x ^T _N w

−

y _N

2

= 1

N







− − x ^T ₁ − −

− − x ^T ₂ − − . . .

− − x ^T _N − −





 w

−





 y ₁ y ₂ . . . y _N







2

= 1

Nk

X

|{z}

N×d +1

w

|{z}

d +1×1

−

y

|{z}

N×1

k

²

(9)

min

w

E

_in

(w) = 1

NkXw−

yk ²

w

E

_in

•

E

_in

(w): continuous, differentiable,

convex

•

necessary condition of ‘best’

w

∇E

_in

(w) ≡







∂E

in

∂w

0(w)

∂E

in

∂w

1(w) . . .

∂E

in

∂w

d(w)







=







0 0

. . .

0







—not possibleto ‘roll down’

task: find

w

LINsuch that ∇E

_in

(wLIN) =

0

(10)

The Gradient ∇E _in (w)

E

_in

(w) = 1

NkXw−

yk ²

= 1 N



w

^T X ^T X

| {z }

A

w

− 2w

^T X ^T y

|{z}

b

+

y ^T y

|{z}

c





one w only

E

_in

(w)=

_N ¹

aw ²

− 2bw+

c

∇E

_in

(w)=

_N ¹

(2aw− 2b)

simple! :-)

vector w

E

_in

(w)=

_N ¹

w ^T Aw

− 2w

^T b

+

c

∇E

_in

(w)=

_N ¹

(2Aw− 2b) similar (derived by definition)

∇E

in

(w) =

_N ² X ^T Xw

−

X ^T y

(11)

Optimal Linear Regression Weights

task: find

w

_LIN such that

_N ² X ^T Xw

−

X ^T y = ∇E _in

(w) =

0 invertible X ^T X

• easy!

unique solution

w

LIN=

X ^T X

−1

X ^T

| {z }

pseudo-inverse

X

^†

y

•

often the case because

N d + 1

singular X ^T X

• many

optimal solutions

•

one of the solutions

w

_LIN=

X ^† y

by defining

X ^†

in other ways

practical suggestion:

use

well-implemented † routine

instead of

X ^T X

−1

X ^T

for numerical stability when

almost-singular

(12)

Linear Regression Algorithm

1

from D, construct

input matrix X

and

output vector y

by

X =







− − x ^T ₁ − −

− − x ^T ₂ − −

· · ·

− − x ^T _N − −







| {z }

N×(d +1)

y =





 y ₁ y ₂

· · · y _N







| {z }

N×1

2

calculate pseudo-inverse

X ^†

|{z}

(d +1)×N 3

return

w

LIN

|{z}

(d +1)×1

=

X ^† y

simple and efficient with

good † routine

(13)

Fun Time

After getting

w

_LIN, we can calculate the predictions ˆy

n

=

w ^T

_LIN

x _n

. If all ˆy

n

are collected in a vector ˆ

y similar to how we form y, what is the matrix

formula of ˆ

y?

1 y

2 XX ^T y

3 XX ^† y

4 XX ^† XX ^T y

Reference Answer: 3

Note that ˆ

y = Xw

_LIN. Then, a simple substitution of

w

LIN reveals the answer.

(14)

Linear Regression Generalization Issue

Is Linear Regression a ‘Learning Algorithm’?

w

_LIN=

X ^† y

No!

•

analytic (closed-form) solution, ‘instantaneous’

•

not improving E

_in

nor E

out

iteratively

Yes!

•

good E

_in

?

yes, optimal!

•

good E

_out

?

yes, finite d

_VC

like perceptrons

•

improving iteratively?

somewhat, within an iterative pseudo-inverse routine

if E

_out

(w_LIN)is good,

learning ‘happened’!

(15)

Benefit of Analytic Solution:

‘Simpler-than-VC’ Guarantee

E

_in

= E

D∼P

^N

n

E

_in

(w_LIN w.r.t. D)o

to be shown

= noise level · 1 −

^{d +1} _N

E

_in

(wLIN) = 1

Nky−

y

ˆ

|{z}

predictions

k

²

= 1

Nky−

X X ^† y

|{z}

w

LIN

k

²

= 1

Nk( I

|{z}

identity

−XX

^†

)yk

²

call

XX ^†

the

hat matrix H

because it

puts ∧ on y

(16)

Geometric View of Hat Matrix

y

ˆy

span of X y − ˆy

in R ^N

• y =

ˆ

Xw

LIN within the

span of X columns

• y − ˆ y

smallest:

y − ˆ y

⊥

span

• H: project y

to ˆ

y ∈ span

•

I −

H: transform y

to

y − ˆ y

⊥

span

claim: trace(I −

H) =

N − (d + 1).

Why? :-)

(17)

An Illustrative ‘Proof’

y

ˆy

span of X f (X)

noise y − ˆy

•

if

y

comes from some ideal

f (X) ∈ span

plus

noise

• noise with per-dimension ‘noise level’

σ

²

transformed by I −

H

to be

y − ˆ y

E

_in

(wLIN) = 1

Nky − ˆ

yk ²

=

_N ¹

k(I −

H)noisek ²

=

_N ¹

N − (d + 1)σ

²

E

_in

= σ

²

· 1 −

^{d +1} _N

E

out

= σ

²

· 1 +

^{d +1} _N

(complicated!)

(18)

The Learning Curve

E _out

=

noise level · 1 + ^{d +1} _N

E _in

=

noise level · 1 − ^{d +1} _N

Number of Data Points, N

E xp ec te d E rr or

E

out

E

in

σ

²

d + 1

•

both converge toσ

²

(noise level) for N → ∞

•

expected generalization error:

^{2(d +1)} _N

—similar to worst-case guarantee from VC

linear regression (LinReg):

learning ‘happened’!

(19)

Fun Time

Which of the following property about H is not true?

1 H

is symmetric

2 H ²

=

H

(double projection = single one)

3

(I −

H) ²

= I −

H

(double residual transform = single one)

4

none of the above

Reference Answer: 4

You can conclude that 2 and 3 are true by their physical meanings!

:-)

(20)

Linear Regression Linear Regression for Binary Classification

Linear Classification vs. Linear Regression

Linear Classification

Y = {−1, +1}

h(x) = sign(w

^T x)

err(ˆy, y ) = Jy 6= yˆ K

NP-hard

to solve in general

Linear Regression

Y = R

h(x) = w

^T x

err(ˆy, y ) = (ˆy − y )

² efficient analytic solution

{−1, +1} ⊂ R: linear regression for classification?

1

run LinReg on binary classification data D (efficient)

2

return g(x) = sign(w

^T

_LIN

x)

but explanation of this

heuristic?

(21)

Relation of Two Errors

err _0/1 = r

sign(w ^T x) 6= y z

err sqr =

w ^T x − y 2

desired y = 1

w

^T

x err

squared 0/1

desired y = −1

w

^T

x err

err _0/1

≤

err sqr

(22)

Linear Regression for Binary Classification

err _0/1

≤

err sqr

classification E out (w) ^VC

≤

classification E _in (w)

+

√ . . . .

≤

regression E _in (w)

+

√ . . . .

•

(loose) upper bound

err sqr

aserr to approximatec

err _0/1

•

trade

bound tightness

for

efficiency

w

_LIN: useful baseline classifier, or as

initial PLA/pocket vector

(23)

Fun Time

Which of the following functions are upper bounds of the pointwise 0/1 error qsign(w ^T x) 6= y y

for y ∈ {−1, +1}?

1

exp(−y

w ^T x)

2

max(0, 1 − y w

^T x)

3

log

₂

(1 + exp(−y

w ^T x))

4

all of the above

Reference Answer: 4

Plot the curves and you’ll see. Thus, all three can be used for binary classification. In fact, all three functions connect to very important algorithms in machine learning and we will discuss one of them soon in the next lecture.

Stay tuned. :-)

(24)

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

Lecture 8: Noise and Error

3 How

Can Machines Learn?

Machine Learning Foundations (ᘤ9M)