6.867 Machine learning: lecture 3

(1)

6.867 Machine learning: lecture 3

Tommi S. Jaakkola MIT CSAIL

tommi@csail.mit.edu

(2)

Topics

• Beyond linear regression models

– additive regression models, examples – generalization and cross-validation – population minimizer

• Statistical regression models

– model formulation, motivation – maximum likelihood estimation

(3)

Linear regression

• Linear regression functions,

f : R → R f (x; w) = w₀ + w₁x, or

f : R^d → R f (x; w) = w₀ + w₁x₁ + . . . + w_dx_d

combined with the squared loss, are convenient because they are linear in the parameters.

(4)

Linear regression

f : R → R f (x; w) = w₀ + w₁x, or

f : R^d → R f (x; w) = w₀ + w₁x₁ + . . . + w_dx_d

– we get closed form estimates of the parameters w = (Xˆ ^TX)⁻¹X^Ty

where, for example, y = [y₁, . . . , y_n]^T.

(5)

Linear regression

f : R → R f (x; w) = w₀ + w₁x, or

f : R^d → R f (x; w) = w₀ + w₁x₁ + . . . + w_dx_d

– the resulting prediction errors _i = y_i − f (x_i; ˆw) are uncorrelated with any linear function of the inputs x.

(6)

Linear regression

f : R → R f (x; w) = w₀ + w₁x, or

f : R^d → R f (x; w) = w₀ + w₁x₁ + . . . + w_dx_d

– the resulting prediction errors _i = y_i − f (x_i; ˆw) are uncorrelated with any linear function of the inputs x.

– we can easily extend these to non-linear functions of the inputs while still keeping them linear in the parameters

(7)

Beyond linear regression

• Example extension: m^th order polynomial regression where f : R → R is given by

f (x; w) = w₀ + w₁x + . . . + w_m−1x^m−1 + w_mx^m – linear in the parameters, non-linear in the inputs – solution as before

w = (Xˆ ^TX)⁻¹X^Ty where

w =ˆ





 ˆ w₀

ˆ w₁

· · · ˆ w_m







, X =







1 x₁ x²₁ . . . x^m₁ 1 x₂ x²₂ . . . x^m₂

· · · · 1 x_n x²_n . . . x^m_n







(8)

Polynomial regression

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 1 degree = 3

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 5 degree = 7

(9)

Complexity and overfitting

• With limited training examples our polynomial regression model may achieve zero training error but nevertless has a large test (generalization) error

train 1 n

n

X

t=1

(y_t − f (x_t; ˆw))² ≈ 0 test E_(x,y)∼P (y − f (x; ˆw))² 0

−2 −1 0 1 2

−5 0 5

x

y

• We suffer from over-fitting when the training error no longer bears any relation to the generalization error

(10)

Avoiding over-fitting: cross-validation

• Cross-validation allows us to estimate the generalization error based on training examples alone

Leave-one-out cross-validation treats each training example in turn as a test example:

CV = 1 n

n

X

i=1

y_i − f (x_i; ˆw⁻ⁱ)²

where ˆw⁻ⁱ are the least squares estimates of the parameters without the i^th training example.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−5

−4

−3

−2

−1 0 1 2 3 4 5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−5

−4

−3

−2

−1 0 1 2 3 4 5

(11)

Polynomial regression: example cont’d

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 1, CV = 0.6 degree = 3, CV = 1.5

−2 −1 0 1 2

−5 0 5

x

y

−2 −1 0 1 2

−5 0 5

x

y

degree = 5, CV = 6.0 degree = 7, CV = 15.6

(12)

Additive models

• More generally, predictions can be based on a linear combination of a set of basis functions (or features) {φ₁(x), . . . , φ_m(x)}, where each φ_i(x) : R^d → R, and

f (x; w) = w₀ + w₁φ₁(x) + . . . + w_mφ_m(x)

• Examples:

If φ_i(x) = xⁱ, i = 1, . . . , m, then

f (x; w) = w₀ + w₁x + . . . + w_m−1x^m−1 + w_mx^m

(13)

Additive models

• More generally, predictions can be based on a linear combination of a set of basis functions (or features) {φ₁(x), . . . , φ_m(x)}, where each φ_i(x) : R^d → R, and

f (x; w) = w₀ + w₁φ₁(x) + . . . + w_mφ_m(x)

• Examples:

If φ_i(x) = xⁱ, i = 1, . . . , m, then

f (x; w) = w₀ + w₁x + . . . + w_m−1x^m−1 + w_mx^m If m = d, φ_i(x) = x_i, i = 1, . . . , d, then

f (x; w) = w₀ + w₁x₁ + . . . + w_dx_d

(14)

Additive models cont’d

• The basis functions can capture various (e.g., qualitative) properties of the inputs.

For example: we can try to rate companies based on text descriptions

x = text document (collection of words) φ_i(x) = 1 if word i appears in the document

0 otherwise f (x; w) = w₀ + X

i∈words

w_iφ_i(x)

(15)

Additive models cont’d

• We can also make predictions by gauging the similarity of examples to “prototypes”.

For example, our additive regression function could be f (x; w) = w₀ + w₁φ₁(x) + . . . + w_mφ_m(x)

where the basis functions are “radial basis functions”

φ_k(x) = exp{ − 1

2σ²kx − x_kk² }

measuring the similarity to the prototypes; σ² controls how quickly the basis function vanishes as a function of the distance to the prototype.

(training examples themselves could serve as prototypes)

(16)

Additive models cont’d

• We can view the additive models graphically in terms of simple “units” and “weights”

. . . f (x; w)

x₁ x₂

φ₁(x) φ_m(x)

w₁ 1 w₀

w_m

• In neural networks the basis functions themselves have adjustable parameters (cf. prototypes)

(17)

Squared loss and population minimizer

• What do we get if we have unlimited training examples (the whole population) and no constraints on the regression function?

minimize E_(x,y)∼P (y − f (x))²

with respect to an unconstrained function f : R → R

−2 −1 0 1 2

−15

−10

−5 0 5 10 15

(18)

Squared loss and population minimizer

• To minimize

E_(x,y)∼P (y − f (x))² = E_x∼P_x h

E_y∼P_y|x(y − f (x))² i we can focus on each x separately since f (x) can be chosen independently for each different x. For any particular x we can

∂

∂f (x)E_y∼P_y|x(y − f (x))² = 2E_y∼P_y|x(y − f (x))

= 2(E{y|x} − f (x)) = 0 Thus the function we are trying to approximate is the conditional expectation

f^∗(x) = E{y|x}

(19)

Topics

• Beyond linear regression models

– additive regression models, examples – generalization and cross-validation – population minimizer

• Statistical regression models

– model formulation, motivation – maximum likelihood estimation

(20)

Statistical view of linear regression

• In a statistical regression model we model both the function and noise

Observed output = function + noise y = f (x; w) +

where, e.g., ∼ N (0, σ²).

• Whatever we cannot capture with our chosen family of functions will be interpreted as noise

−2 −1 0 1 2

−6

−4

−2 0 2 4 6

(21)

Statistical view of linear regression

• f (x; w) is trying to capture the mean of the observations y given the input x:

E{ y | x } = E{ f (x; w) + | x }

= f (x; w)

where E{ y | x } is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution P )

0 5

(22)

Statistical view of linear regression

• According to our statistical model

y = f (x; w) + , ∼ N (0, σ²)

the outputs y given x are normally distributed with mean f (x; w) and variance σ²:

p(y|x, w, σ²) = 1

√

2πσ² exp{ − 1

2σ²(y − f (x; w))² }

(we model the uncertainty in the predictions, not just the mean)

• Loss function? Estimation?

(23)

Maximum likelihood estimation

• Given observations D_n = {(x₁, y₁), . . . , (x_n, y_n)} we find the parameters w that maximize the (conditional) likelihood of the outputs

L(D_n; w, σ²) =

n

Y

i=1

p(y_i|x_i, w, σ²)

Example: linear function p(y|x, w, σ²) =

√ 1

2πσ² exp{ − 1

2σ²(y − w₀ − w₁x)² }

−2 −1 0 1 2

−6

−4

−2 0 2 4 6

(why is this a bad fit according to the likelihood criterion?)

(24)

Maximum likelihood estimation cont’d

Likelihood of the observed outputs:

L(D; w, σ²) =

n

Y

i=1

P (y_i|x_i, w, σ²)

• It is often easier (but equivalent) to try to maximize the log-likelihood:

l(D; w, σ²) = log L(D; w, σ²) =

n

X

i=1

log P (y_i|x_i, w, σ²)

=

n

X

i=1

− 1

2σ²(y_i − f (x_i; w))² − log √

2πσ²

=

− 1 2σ²

ⁿ X

i=1

(y_i − f (x_i; w))² + . . .

(25)

Maximum likelihood estimation cont’d

• Maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to

Loss(y_i, f (x_i; w)) = − log P (y_i|x_i, w, σ²)

Loss defined as the negative log-probability is known as the log-loss.

(26)

Maximum likelihood estimation cont’d

• The log-likelihood of observations log L(D; w, σ²) =

n

X

i=1

is a generic fitting criterion and can be used to estimate the noise variance σ² as well.

• Let ˆw be the maximum likelihood (here least squares) setting of the parameters. What is the maximum likelihood estimate of σ², obtained by solving

∂

∂σ² log L(D; w, σ²) = 0 ?

(27)

Maximum likelihood estimation cont’d

• The log-likelihood of observations log L(D; w, σ²) =

n

X

i=1

is a generic fitting criterion and can be used to estimate the noise variance σ² as well.

• Let ˆw be the maximum likelihood (here least squares) setting of the parameters. The maximum likelihood estimate of the noise variance σ² is

ˆ

σ² = 1 n

n

X

i=1

(y_i − f (x_i; ˆw))² i.e., the mean squared prediction error.