• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 9: Linear Regression

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Linear Regression

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

Lecture 8: Noise and Error

learning can happen

with

target distribution P(y |x)

and

low E in w.r.t. err

3 How

Can Machines Learn?

Lecture 9: Linear Regression Linear Regression Problem Linear Regression Algorithm Generalization Issue

Linear Regression for Binary Classification

4 How Can Machines Learn Better?

(3)

Linear Regression Linear Regression Problem

Credit Limit Problem

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit limit?

100,000

unknown target function f : X → Y (ideal credit limit formula)

training examples D : (x

1

, y

1

), · · · , (x

N

, y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

Y = R:

regression

(4)

Linear Regression Linear Regression Problem

Linear Regression Hypothesis

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

For

x = (x 0

, x

1

, x

2

, · · · , x

d

)‘features of customer’,

approximate the

desired credit limit

with a

weighted

sum:

y

d

X

i=0

w i

x

i

linear regression hypothesis:

h(x) = w T x

h(x): like perceptron, but without the sign

(5)

Linear Regression Linear Regression Problem

Illustration of Linear Regression

x = (x ) ∈ R

x

y

x = (x 1 , x 2 ) ∈ R 2

x

1

x

2

y

x

1

x

2

y

linear regression:

find

lines/hyperplanes

with small

residuals

(6)

Linear Regression Linear Regression Problem

The Error Measure

popular/historical error measure:

squared error

err(ˆ y , y ) = (ˆ y − y ) 2

in-sample

E

in

(w) = 1 N

N

X

n=1

(h(x n )

| {z }

w

T

x

n

− y n ) 2

out-of-sample

E

out

(w) = E

(x,y)∼P (w T x − y ) 2

next: how to minimize E

in

(w)?

(7)

Linear Regression Linear Regression Problem

Fun Time

Consider using linear regression hypothesis h(x) = w T x to predict the credit limit of customers x. Which feature below shall have a positive weight in a good hypothesis for the task?

1

birth month

2

monthly income

3

current debt

4

number of credit cards owned

Reference Answer: 2

Customers with higher monthly income should naturally be given a higher credit limit, which is captured by the positive weight on the ‘monthly income’ feature.

(8)

Linear Regression Linear Regression Algorithm

Matrix Form of E in (w)

E

in

(w) = 1 N

N

X

n=1

(w

T x n

y n

)

2

= 1 N

N

X

n=1

(x

T n w

y n

)

2

= 1

N

x T 1 w

y 1 x T 2 w

y 2

. . .

x T N w

y N

2

= 1

N

− − x T 1 − −

− − x T 2 − − . . .

− − x T N − −

w

 y 1 y 2 . . . y N

2

= 1

Nk

X

|{z}

N×d +1

w

|{z}

d +1×1

y

|{z}

N×1

k

2

(9)

Linear Regression Linear Regression Algorithm

min

w

E

in

(w) = 1

NkXw−

yk 2

w

E

in

E

in

(w): continuous, differentiable,

convex

necessary condition of ‘best’

w

∇E

in

(w) ≡

∂E

in

∂w

0(w)

∂E

in

∂w

1(w) . . .

∂E

in

∂w

d(w)

=

0 0

. . .

0

—not possibleto ‘roll down’

task: find

w

LINsuch that ∇E

in

(wLIN) =

0

(10)

Linear Regression Linear Regression Algorithm

The Gradient ∇E in (w)

E

in

(w) = 1

NkXw−

yk 2

= 1 N

w

T X T X

| {z }

A

w

− 2w

T X T y

|{z}

b

+

y T y

|{z}

c

one w only

E

in

(w)=

N 1



aw 2

− 2bw+

c



∇E

in

(w)=

N 1

(2aw− 2b)

simple! :-)

vector w

E

in

(w)=

N 1



w T Aw

− 2w

T b

+

c



∇E

in

(w)=

N 1

(2Aw− 2b) similar (derived by definition)

∇E

in

(w) =

N 2 X T Xw

X T y



(11)

Linear Regression Linear Regression Algorithm

Optimal Linear Regression Weights

task: find

w

LIN such that

N 2 X T Xw

X T y = ∇E in

(w) =

0

invertible X T X

easy!

unique solution

w

LIN= 

X T X



−1

X T

| {z }

pseudo-inverse

X

y

often the case because

N  d + 1

singular X T X

many

optimal solutions

one of the solutions

w

LIN=

X y

by defining

X

in other ways

practical suggestion:

use

well-implemented † routine

instead of

X T X



−1

X T

for numerical stability when

almost-singular

(12)

Linear Regression Linear Regression Algorithm

Linear Regression Algorithm

1

from D, construct

input matrix X

and

output vector y

by

X =

− − x T 1 − −

− − x T 2 − −

· · ·

− − x T N − −

| {z }

N×(d +1)

y =

 y 1 y 2

· · · y N

| {z }

N×1

2

calculate pseudo-inverse

X

|{z}

(d +1)×N 3

return

w

LIN

|{z}

(d +1)×1

=

X y

simple and efficient with

good † routine

(13)

Linear Regression Linear Regression Algorithm

Fun Time

After getting

w

LIN, we can calculate the predictions ˆy

n

=

w T

LIN

x n

. If all ˆy

n

are collected in a vector ˆ

y similar to how we form y, what is the matrix

formula of ˆ

y?

1 y

2 XX T y

3 XX y

4 XX XX T y

Reference Answer: 3

Note that ˆ

y = Xw

LIN. Then, a simple substitution of

w

LIN reveals the answer.

(14)

Linear Regression Generalization Issue

Is Linear Regression a ‘Learning Algorithm’?

w

LIN=

X y

No!

analytic (closed-form) solution, ‘instantaneous’

not improving E

in

nor E

out

iteratively

Yes!

good E

in

?

yes, optimal!

good E

out

?

yes, finite d

VC

like perceptrons

improving iteratively?

somewhat, within an iterative pseudo-inverse routine

if E

out

(wLIN)is good,

learning ‘happened’!

(15)

Linear Regression Generalization Issue

Benefit of Analytic Solution:

‘Simpler-than-VC’ Guarantee

E

in

= E

D∼P

N

n

E

in

(wLIN w.r.t. D)o

to be shown

= noise level · 1 −

d +1 N



E

in

(wLIN) = 1

Nky−

y

ˆ

|{z}

predictions

k

2

= 1

Nky−

X X y

|{z}

w

LIN

k

2

= 1

Nk( I

|{z}

identity

−XX

)yk

2

call

XX

the

hat matrix H

because it

puts ∧ on y

(16)

Linear Regression Generalization Issue

Geometric View of Hat Matrix

y

ˆy

span of X y − ˆy

in R N

y =

ˆ

Xw

LIN within the

span of X columns

y − ˆ y

smallest:

y − ˆ y

span

• H: project y

to ˆ

y ∈ span

I −

H: transform y

to

y − ˆ y

span

claim: trace(I −

H) =

N − (d + 1).

Why? :-)

(17)

Linear Regression Generalization Issue

An Illustrative ‘Proof’

y

ˆy

span of X f (X)

noise y − ˆy

if

y

comes from some ideal

f (X) ∈ span

plus

noise

noise with per-dimension ‘noise level’

σ

2

transformed by I −

H

to be

y − ˆ y

E

in

(wLIN) = 1

Nky − ˆ

yk 2

=

N 1

k(I −

H)noisek 2

=

N 1

N − (d + 1)σ

2

E

in

= σ

2

· 1 −

d +1 N

 E

out

= σ

2

· 1 +

d +1 N



(complicated!)

(18)

Linear Regression Generalization Issue

The Learning Curve

E out

=

noise level · 1 + d +1 N



E in

=

noise level · 1 − d +1 N



Number of Data Points, N

E xp ec te d E rr or

E

out

E

in

σ

2

d + 1

both converge toσ

2

(noise level) for N → ∞

expected generalization error:

2(d +1) N

—similar to worst-case guarantee from VC

linear regression (LinReg):

learning ‘happened’!

(19)

Linear Regression Generalization Issue

Fun Time

Which of the following property about H is not true?

1 H

is symmetric

2 H 2

=

H

(double projection = single one)

3

(I −

H) 2

= I −

H

(double residual transform = single one)

4

none of the above

Reference Answer: 4

You can conclude that 2 and 3 are true by their physical meanings!

:-)

(20)

Linear Regression Linear Regression for Binary Classification

Linear Classification vs. Linear Regression

Linear Classification

Y = {−1, +1}

h(x) = sign(w

T x)

err(ˆy, y ) = Jy 6= yˆ K

NP-hard

to solve in general

Linear Regression

Y = R

h(x) = w

T x

err(ˆy, y ) = (ˆy − y )

2 efficient analytic solution

{−1, +1} ⊂ R: linear regression for classification?

1

run LinReg on binary classification data D (efficient)

2

return g(x) = sign(w

T

LIN

x)

but explanation of this

heuristic?

(21)

Linear Regression Linear Regression for Binary Classification

Relation of Two Errors

err 0/1 = r

sign(w T x) 6= y z

err sqr =



w T x − y  2

desired y = 1

w

T

x err

squared 0/1

desired y = −1

w

T

x err

err 0/1

err sqr

(22)

Linear Regression Linear Regression for Binary Classification

Linear Regression for Binary Classification

err 0/1

err sqr

classification E out (w) VC

classification E in (w)

+

√ . . . .

regression E in (w)

+

√ . . . .

(loose) upper bound

err sqr

aserr to approximatec

err 0/1

trade

bound tightness

for

efficiency

w

LIN: useful baseline classifier, or as

initial PLA/pocket vector

(23)

Linear Regression Linear Regression for Binary Classification

Fun Time

Which of the following functions are upper bounds of the pointwise 0/1 error qsign(w T x) 6= y y

for y ∈ {−1, +1}?

1

exp(−y

w T x)

2

max(0, 1 − y w

T x)

3

log

2

(1 + exp(−y

w T x))

4

all of the above

Reference Answer: 4

Plot the curves and you’ll see. Thus, all three can be used for binary classification. In fact, all three functions connect to very important algorithms in machine learning and we will discuss one of them soon in the next lecture.

Stay tuned. :-)

(24)

Linear Regression Linear Regression for Binary Classification

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

Lecture 8: Noise and Error

3 How

Can Machines Learn?

Lecture 9: Linear Regression Linear Regression Problem

use hyperplanes to approximate real values Linear Regression Algorithm

analytic solution with pseudo-inverse Generalization Issue

E out − E in2(d +1) N on average Linear Regression for Binary Classification

0/1 error ≤ squared error

next: binary classification, regression, and then?

4 How Can Machines Learn Better?

參考文獻

相關文件

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

If we recorded the monthly sodium in- take for each individual in a sample and his/her blood pressure, do individuals with higher sodium consumption also have higher blood

Indeed, in our example the positive effect from higher term structure of credit default swap spreads on the mean numbers of defaults can be offset by a negative effect from

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension?. 3 How Can