• 沒有找到結果。

6.867 Machine learning: lecture 3

N/A
N/A
Protected

Academic year: 2022

Share "6.867 Machine learning: lecture 3"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

6.867 Machine learning: lecture 3

Tommi S. Jaakkola MIT CSAIL

tommi@csail.mit.edu

(2)

Topics

• Beyond linear regression models

– additive regression models, examples – generalization and cross-validation – population minimizer

• Statistical regression models

– model formulation, motivation – maximum likelihood estimation

(3)

Linear regression

• Linear regression functions,

f : R → R f (x; w) = w0 + w1x, or

f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd

combined with the squared loss, are convenient because they are linear in the parameters.

(4)

Linear regression

• Linear regression functions,

f : R → R f (x; w) = w0 + w1x, or

f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd

combined with the squared loss, are convenient because they are linear in the parameters.

– we get closed form estimates of the parameters w = (Xˆ TX)−1XTy

where, for example, y = [y1, . . . , yn]T.

(5)

Linear regression

• Linear regression functions,

f : R → R f (x; w) = w0 + w1x, or

f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd

combined with the squared loss, are convenient because they are linear in the parameters.

– we get closed form estimates of the parameters w = (Xˆ TX)−1XTy

where, for example, y = [y1, . . . , yn]T.

– the resulting prediction errors i = yi − f (xi; ˆw) are uncorrelated with any linear function of the inputs x.

(6)

Linear regression

• Linear regression functions,

f : R → R f (x; w) = w0 + w1x, or

f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd

combined with the squared loss, are convenient because they are linear in the parameters.

– we get closed form estimates of the parameters w = (Xˆ TX)−1XTy

where, for example, y = [y1, . . . , yn]T.

– the resulting prediction errors i = yi − f (xi; ˆw) are uncorrelated with any linear function of the inputs x.

– we can easily extend these to non-linear functions of the inputs while still keeping them linear in the parameters

(7)

Beyond linear regression

• Example extension: mth order polynomial regression where f : R → R is given by

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm – linear in the parameters, non-linear in the inputs – solution as before

w = (Xˆ TX)−1XTy where

w =ˆ

 ˆ w0

ˆ w1

· · · ˆ wm

, X =

1 x1 x21 . . . xm1 1 x2 x22 . . . xm2

· · · · 1 xn x2n . . . xmn

(8)

Polynomial regression

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 1 degree = 3

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 5 degree = 7

(9)

Complexity and overfitting

• With limited training examples our polynomial regression model may achieve zero training error but nevertless has a large test (generalization) error

train 1 n

n

X

t=1

(yt − f (xt; ˆw))2 ≈ 0 test E(x,y)∼P (y − f (x; ˆw))2  0

−2 −1 0 1 2

−5 0 5

x

y

• We suffer from over-fitting when the training error no longer bears any relation to the generalization error

(10)

Avoiding over-fitting: cross-validation

• Cross-validation allows us to estimate the generalization error based on training examples alone

Leave-one-out cross-validation treats each training example in turn as a test example:

CV = 1 n

n

X

i=1

yi − f (xi; ˆw−i)2

where ˆw−i are the least squares estimates of the parameters without the ith training example.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−5

−4

−3

−2

−1 0 1 2 3 4 5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−5

−4

−3

−2

−1 0 1 2 3 4 5

(11)

Polynomial regression: example cont’d

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 1, CV = 0.6 degree = 3, CV = 1.5

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 5, CV = 6.0 degree = 7, CV = 15.6

(12)

Additive models

• More generally, predictions can be based on a linear combination of a set of basis functions (or features) {φ1(x), . . . , φm(x)}, where each φi(x) : Rd → R, and

f (x; w) = w0 + w1φ1(x) + . . . + wmφm(x)

• Examples:

If φi(x) = xi, i = 1, . . . , m, then

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

(13)

Additive models

• More generally, predictions can be based on a linear combination of a set of basis functions (or features) {φ1(x), . . . , φm(x)}, where each φi(x) : Rd → R, and

f (x; w) = w0 + w1φ1(x) + . . . + wmφm(x)

• Examples:

If φi(x) = xi, i = 1, . . . , m, then

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm If m = d, φi(x) = xi, i = 1, . . . , d, then

f (x; w) = w0 + w1x1 + . . . + wdxd

(14)

Additive models cont’d

• The basis functions can capture various (e.g., qualitative) properties of the inputs.

For example: we can try to rate companies based on text descriptions

x = text document (collection of words) φi(x) =  1 if word i appears in the document

0 otherwise f (x; w) = w0 + X

i∈words

wiφi(x)

(15)

Additive models cont’d

• We can also make predictions by gauging the similarity of examples to “prototypes”.

For example, our additive regression function could be f (x; w) = w0 + w1φ1(x) + . . . + wmφm(x)

where the basis functions are “radial basis functions”

φk(x) = exp{ − 1

2kx − xkk2 }

measuring the similarity to the prototypes; σ2 controls how quickly the basis function vanishes as a function of the distance to the prototype.

(training examples themselves could serve as prototypes)

(16)

Additive models cont’d

• We can view the additive models graphically in terms of simple “units” and “weights”

. . . f (x; w)

x1 x2

φ1(x) φm(x)

w1 1 w0

wm

• In neural networks the basis functions themselves have adjustable parameters (cf. prototypes)

(17)

Squared loss and population minimizer

• What do we get if we have unlimited training examples (the whole population) and no constraints on the regression function?

minimize E(x,y)∼P (y − f (x))2

with respect to an unconstrained function f : R → R

−2 −1 0 1 2

−15

−10

−5 0 5 10 15

(18)

Squared loss and population minimizer

• To minimize

E(x,y)∼P (y − f (x))2 = Ex∼Px h

Ey∼Py|x(y − f (x))2 i we can focus on each x separately since f (x) can be chosen independently for each different x. For any particular x we can

∂f (x)Ey∼Py|x(y − f (x))2 = 2Ey∼Py|x(y − f (x))

= 2(E{y|x} − f (x)) = 0 Thus the function we are trying to approximate is the conditional expectation

f(x) = E{y|x}

(19)

Topics

• Beyond linear regression models

– additive regression models, examples – generalization and cross-validation – population minimizer

• Statistical regression models

– model formulation, motivation – maximum likelihood estimation

(20)

Statistical view of linear regression

• In a statistical regression model we model both the function and noise

Observed output = function + noise y = f (x; w) + 

where, e.g.,  ∼ N (0, σ2).

• Whatever we cannot capture with our chosen family of functions will be interpreted as noise

−2 −1 0 1 2

−6

−4

−2 0 2 4 6

(21)

Statistical view of linear regression

• f (x; w) is trying to capture the mean of the observations y given the input x:

E{ y | x } = E{ f (x; w) +  | x }

= f (x; w)

where E{ y | x } is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution P )

0 5

(22)

Statistical view of linear regression

• According to our statistical model

y = f (x; w) + ,  ∼ N (0, σ2)

the outputs y given x are normally distributed with mean f (x; w) and variance σ2:

p(y|x, w, σ2) = 1

2πσ2 exp{ − 1

2(y − f (x; w))2 }

(we model the uncertainty in the predictions, not just the mean)

• Loss function? Estimation?

(23)

Maximum likelihood estimation

• Given observations Dn = {(x1, y1), . . . , (xn, yn)} we find the parameters w that maximize the (conditional) likelihood of the outputs

L(Dn; w, σ2) =

n

Y

i=1

p(yi|xi, w, σ2)

Example: linear function p(y|x, w, σ2) =

√ 1

2πσ2 exp{ − 1

2(y − w0 − w1x)2 }

−2 −1 0 1 2

−6

−4

−2 0 2 4 6

(why is this a bad fit according to the likelihood criterion?)

(24)

Maximum likelihood estimation cont’d

Likelihood of the observed outputs:

L(D; w, σ2) =

n

Y

i=1

P (yi|xi, w, σ2)

• It is often easier (but equivalent) to try to maximize the log-likelihood:

l(D; w, σ2) = log L(D; w, σ2) =

n

X

i=1

log P (yi|xi, w, σ2)

=

n

X

i=1



− 1

2(yi − f (xi; w))2 − log √

2πσ2



=



− 1 2σ2

 n X

i=1

(yi − f (xi; w))2 + . . .

(25)

Maximum likelihood estimation cont’d

• Maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to

Loss(yi, f (xi; w)) = − log P (yi|xi, w, σ2)

Loss defined as the negative log-probability is known as the log-loss.

(26)

Maximum likelihood estimation cont’d

• The log-likelihood of observations log L(D; w, σ2) =

n

X

i=1

log P (yi|xi, w, σ2)

is a generic fitting criterion and can be used to estimate the noise variance σ2 as well.

• Let ˆw be the maximum likelihood (here least squares) setting of the parameters. What is the maximum likelihood estimate of σ2, obtained by solving

∂σ2 log L(D; w, σ2) = 0 ?

(27)

Maximum likelihood estimation cont’d

• The log-likelihood of observations log L(D; w, σ2) =

n

X

i=1

log P (yi|xi, w, σ2)

is a generic fitting criterion and can be used to estimate the noise variance σ2 as well.

• Let ˆw be the maximum likelihood (here least squares) setting of the parameters. The maximum likelihood estimate of the noise variance σ2 is

ˆ

σ2 = 1 n

n

X

i=1

(yi − f (xi; ˆw))2 i.e., the mean squared prediction error.

參考文獻

相關文件

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-field operator was developed

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-eld operator was developed

The learning and teaching in the Units of Work provides opportunities for students to work towards the development of the Level I, II and III Reading Skills.. The Units of Work also

(1) Western musical terms and names of composers commonly used in the teaching of Music are included in this glossary.. (2) The Western musical terms and names of composers

Hope theory: A member of the positive psychology family. Lopez (Eds.), Handbook of positive

Lecture 1: Introduction and overview of supergravity Lecture 2: Conditions for unbroken supersymmetry Lecture 3: BPS black holes and branes. Lecture 4: The LLM bubbling

Lecture 1: Introduction and overview of supergravity Lecture 2: Conditions for unbroken supersymmetry Lecture 3: BPS black holes and branes.. Lecture 4: The LLM bubbling

The English terms, simple or compound, included in the glossary are listed in alphabetical order, e.g3. ther terms ‘active transport’ is considered beginning with the