6.867 Machine learning: lecture 3
Tommi S. Jaakkola MIT CSAIL
tommi@csail.mit.edu
Topics
• Beyond linear regression models
– additive regression models, examples – generalization and cross-validation – population minimizer
• Statistical regression models
– model formulation, motivation – maximum likelihood estimation
Linear regression
• Linear regression functions,
f : R → R f (x; w) = w0 + w1x, or
f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd
combined with the squared loss, are convenient because they are linear in the parameters.
Linear regression
• Linear regression functions,
f : R → R f (x; w) = w0 + w1x, or
f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd
combined with the squared loss, are convenient because they are linear in the parameters.
– we get closed form estimates of the parameters w = (Xˆ TX)−1XTy
where, for example, y = [y1, . . . , yn]T.
Linear regression
• Linear regression functions,
f : R → R f (x; w) = w0 + w1x, or
f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd
combined with the squared loss, are convenient because they are linear in the parameters.
– we get closed form estimates of the parameters w = (Xˆ TX)−1XTy
where, for example, y = [y1, . . . , yn]T.
– the resulting prediction errors i = yi − f (xi; ˆw) are uncorrelated with any linear function of the inputs x.
Linear regression
• Linear regression functions,
f : R → R f (x; w) = w0 + w1x, or
f : Rd → R f (x; w) = w0 + w1x1 + . . . + wdxd
combined with the squared loss, are convenient because they are linear in the parameters.
– we get closed form estimates of the parameters w = (Xˆ TX)−1XTy
where, for example, y = [y1, . . . , yn]T.
– the resulting prediction errors i = yi − f (xi; ˆw) are uncorrelated with any linear function of the inputs x.
– we can easily extend these to non-linear functions of the inputs while still keeping them linear in the parameters
Beyond linear regression
• Example extension: mth order polynomial regression where f : R → R is given by
f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm – linear in the parameters, non-linear in the inputs – solution as before
w = (Xˆ TX)−1XTy where
w =ˆ
ˆ w0
ˆ w1
· · · ˆ wm
, X =
1 x1 x21 . . . xm1 1 x2 x22 . . . xm2
· · · · 1 xn x2n . . . xmn
Polynomial regression
−2 −1 0 1 2
−5 0 5
x
y
−2 −1 0 1 2
−5 0 5
x
y
degree = 1 degree = 3
−2 −1 0 1 2
−5 0 5
x
y
−2 −1 0 1 2
−5 0 5
x
y
degree = 5 degree = 7
Complexity and overfitting
• With limited training examples our polynomial regression model may achieve zero training error but nevertless has a large test (generalization) error
train 1 n
n
X
t=1
(yt − f (xt; ˆw))2 ≈ 0 test E(x,y)∼P (y − f (x; ˆw))2 0
−2 −1 0 1 2
−5 0 5
x
y
• We suffer from over-fitting when the training error no longer bears any relation to the generalization error
Avoiding over-fitting: cross-validation
• Cross-validation allows us to estimate the generalization error based on training examples alone
Leave-one-out cross-validation treats each training example in turn as a test example:
CV = 1 n
n
X
i=1
yi − f (xi; ˆw−i)2
where ˆw−i are the least squares estimates of the parameters without the ith training example.
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−5
−4
−3
−2
−1 0 1 2 3 4 5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−5
−4
−3
−2
−1 0 1 2 3 4 5
Polynomial regression: example cont’d
−2 −1 0 1 2
−5 0 5
x
y
−2 −1 0 1 2
−5 0 5
x
y
degree = 1, CV = 0.6 degree = 3, CV = 1.5
−2 −1 0 1 2
−5 0 5
x
y
−2 −1 0 1 2
−5 0 5
x
y
degree = 5, CV = 6.0 degree = 7, CV = 15.6
Additive models
• More generally, predictions can be based on a linear combination of a set of basis functions (or features) {φ1(x), . . . , φm(x)}, where each φi(x) : Rd → R, and
f (x; w) = w0 + w1φ1(x) + . . . + wmφm(x)
• Examples:
If φi(x) = xi, i = 1, . . . , m, then
f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm
Additive models
• More generally, predictions can be based on a linear combination of a set of basis functions (or features) {φ1(x), . . . , φm(x)}, where each φi(x) : Rd → R, and
f (x; w) = w0 + w1φ1(x) + . . . + wmφm(x)
• Examples:
If φi(x) = xi, i = 1, . . . , m, then
f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm If m = d, φi(x) = xi, i = 1, . . . , d, then
f (x; w) = w0 + w1x1 + . . . + wdxd
Additive models cont’d
• The basis functions can capture various (e.g., qualitative) properties of the inputs.
For example: we can try to rate companies based on text descriptions
x = text document (collection of words) φi(x) = 1 if word i appears in the document
0 otherwise f (x; w) = w0 + X
i∈words
wiφi(x)
Additive models cont’d
• We can also make predictions by gauging the similarity of examples to “prototypes”.
For example, our additive regression function could be f (x; w) = w0 + w1φ1(x) + . . . + wmφm(x)
where the basis functions are “radial basis functions”
φk(x) = exp{ − 1
2σ2kx − xkk2 }
measuring the similarity to the prototypes; σ2 controls how quickly the basis function vanishes as a function of the distance to the prototype.
(training examples themselves could serve as prototypes)
Additive models cont’d
• We can view the additive models graphically in terms of simple “units” and “weights”
. . . f (x; w)
x1 x2
φ1(x) φm(x)
w1 1 w0
wm
• In neural networks the basis functions themselves have adjustable parameters (cf. prototypes)
Squared loss and population minimizer
• What do we get if we have unlimited training examples (the whole population) and no constraints on the regression function?
minimize E(x,y)∼P (y − f (x))2
with respect to an unconstrained function f : R → R
−2 −1 0 1 2
−15
−10
−5 0 5 10 15
Squared loss and population minimizer
• To minimize
E(x,y)∼P (y − f (x))2 = Ex∼Px h
Ey∼Py|x(y − f (x))2 i we can focus on each x separately since f (x) can be chosen independently for each different x. For any particular x we can
∂
∂f (x)Ey∼Py|x(y − f (x))2 = 2Ey∼Py|x(y − f (x))
= 2(E{y|x} − f (x)) = 0 Thus the function we are trying to approximate is the conditional expectation
f∗(x) = E{y|x}
Topics
• Beyond linear regression models
– additive regression models, examples – generalization and cross-validation – population minimizer
• Statistical regression models
– model formulation, motivation – maximum likelihood estimation
Statistical view of linear regression
• In a statistical regression model we model both the function and noise
Observed output = function + noise y = f (x; w) +
where, e.g., ∼ N (0, σ2).
• Whatever we cannot capture with our chosen family of functions will be interpreted as noise
−2 −1 0 1 2
−6
−4
−2 0 2 4 6
Statistical view of linear regression
• f (x; w) is trying to capture the mean of the observations y given the input x:
E{ y | x } = E{ f (x; w) + | x }
= f (x; w)
where E{ y | x } is the conditional expectation of y given x, evaluated according to the model (not according to the underlying distribution P )
0 5
Statistical view of linear regression
• According to our statistical model
y = f (x; w) + , ∼ N (0, σ2)
the outputs y given x are normally distributed with mean f (x; w) and variance σ2:
p(y|x, w, σ2) = 1
√
2πσ2 exp{ − 1
2σ2(y − f (x; w))2 }
(we model the uncertainty in the predictions, not just the mean)
• Loss function? Estimation?
Maximum likelihood estimation
• Given observations Dn = {(x1, y1), . . . , (xn, yn)} we find the parameters w that maximize the (conditional) likelihood of the outputs
L(Dn; w, σ2) =
n
Y
i=1
p(yi|xi, w, σ2)
Example: linear function p(y|x, w, σ2) =
√ 1
2πσ2 exp{ − 1
2σ2(y − w0 − w1x)2 }
−2 −1 0 1 2
−6
−4
−2 0 2 4 6
(why is this a bad fit according to the likelihood criterion?)
Maximum likelihood estimation cont’d
Likelihood of the observed outputs:
L(D; w, σ2) =
n
Y
i=1
P (yi|xi, w, σ2)
• It is often easier (but equivalent) to try to maximize the log-likelihood:
l(D; w, σ2) = log L(D; w, σ2) =
n
X
i=1
log P (yi|xi, w, σ2)
=
n
X
i=1
− 1
2σ2(yi − f (xi; w))2 − log √
2πσ2
=
− 1 2σ2
n X
i=1
(yi − f (xi; w))2 + . . .
Maximum likelihood estimation cont’d
• Maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to
Loss(yi, f (xi; w)) = − log P (yi|xi, w, σ2)
Loss defined as the negative log-probability is known as the log-loss.
Maximum likelihood estimation cont’d
• The log-likelihood of observations log L(D; w, σ2) =
n
X
i=1
log P (yi|xi, w, σ2)
is a generic fitting criterion and can be used to estimate the noise variance σ2 as well.
• Let ˆw be the maximum likelihood (here least squares) setting of the parameters. What is the maximum likelihood estimate of σ2, obtained by solving
∂
∂σ2 log L(D; w, σ2) = 0 ?
Maximum likelihood estimation cont’d
• The log-likelihood of observations log L(D; w, σ2) =
n
X
i=1
log P (yi|xi, w, σ2)
is a generic fitting criterion and can be used to estimate the noise variance σ2 as well.
• Let ˆw be the maximum likelihood (here least squares) setting of the parameters. The maximum likelihood estimate of the noise variance σ2 is
ˆ
σ2 = 1 n
n
X
i=1
(yi − f (xi; ˆw))2 i.e., the mean squared prediction error.