Advanced Topics in Learning and Vision
Ming-Hsuan Yang
mhyang@csie.ntu.edu.tw
Lecture 5 (draft)
Overview
• Linear regression
• Logistic regression
• Linear classifier
• Fisher linear discriminant
• Support vector machine
• Kernel PCA
• Kernel discriminant analysis
• Relevance vector machine
Announcements
• More course material available on the course web page
• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap
• Project proposal:
- Send me one page project proposal by Oct 19.
Lecture 5 (draft) 2
Netlab
• Available at http://www.ncrg.aston.ac.uk/netlab/
• The latest release of Netlab includes the following algorithms:
- PCA
- Mixtures of probabilistic PCA
- Gaussian mixture model with EM training algorithm
- Linear and logistic regression with IRLS training algorithm
- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions
- Radial basis function (RBF) networks with both Gaussian and non-local basis functions
- Optimizers, including quasi-Newton methods, conjugate gradients and scaled conjugate gradients
- Multi-layer perceptron with Gaussian mixture outputs (mixture density networks)
- Gaussian prior distributions over parameters for the MLP, RBF and GLM including multiple hyper-parameters
- Laplace approximation framework for Bayesian inference (evidence
- Automatic Relevance Determination for input selection
- Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo
- K-nearest neighbor classifier - K-means clustering
- Generative Topographic Map
- Neuroscale topographic projection - Gaussian Processes
- Hinton diagrams for network weights - Self-organizing map
• Note the code adopts row-based
Lecture 5 (draft) 4
Supervised Learning
• Linear classifier: Linear regression, logistic regression, Fisher linear discriminant
• Nonlinear classifier: linear/nonlinear support vector machine, kernel principal component, kernel discriminant analysis
• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/hetrogenous classifiers
Regression
• Hypothesis class: Want to find out the relationship between the input data x and desired output y, parameterized by a function f
• Estimation: Collect a training set of examples and labels, {(x1, y1), . . . (xN, yn)}, want to find the best estimate fˆ.
• Evaluation: measure how well fˆ generalizes to unseen data, i.e., whether f (xˆ 0) agrees with y0.
Lecture 5 (draft) 6
Hypotheses and Estimation
• Want to find a simple linear classifier to to solve classification problems
• For example, given a set of male and female images, we want to find a classifier
ˆ
y = f (x; θ) = sign(θ · x) (1) where w are the parameters of the function, x is an image and y ∈ {−1, 1}.ˆ
• Learn θ from a training set
Loss(y, ˆy) =
0, y = ˆy
1, y 6= ˆy (2)
and minimize the error 1
N
N
X
i=1
Loss(yi, ˆyi) = 1 N
N
X
i=1
Loss(yi, f (xi; θ)) (3)
Problem Formulation
• Posit the estimation problem as an optimization problem
• Given a training set, find θ that minimizes empirical loss:
1 N
PN
i=1 Loss(yi, f (xi; θ))
- f can be linear or nonlinear
- yi can be discrete labels, or real valued, or other
• Model complexity and regularization
• Why do we minimize empirical loss (from training set)? After all, we are interested in the performance for new, unseen samples.
Lecture 5 (draft) 8
Training and Test Performance: Sampling
• Assume that each training and test example-label pair (x, y) is drawn independently from the same but unknown distribution of examples and labels.
• Represent as a joint distribution p(x, y) so that each training/test example is a sample from this distribution, (xi, yi) ∼ p.
Empirical (training) loss = N1 PN
i=1 Loss(yi, f (xi; θ))
Expected (test) loss = E(x,y)∼p{Loss(y, f (x; θ))} (4)
• Training loss is based on the set of sampled examples and labels
• An approximate estimate for the test performance measured over the entire distribution
• Model complexity and regularization
• Will come back to this topic later
Linear Regression
f : R → R f (x; w) = w1x
f : Rd → R f (x; w) = w1x1 + w2x2 + . . . + wdxd (5)
• Given pairs of points (xi, yi), want to w that minimizes the prediction error
• Measure the prediction loss in terms of square error, Loss(y, yˆ) = (y − ˆy)2.
• The empirical loss over the all N training samples
E(w) = 1 N
N
X
i=1
(yi − f (xi; w))2 (6)
Lecture 5 (draft) 10
• In matrix form,
y = Xw (7)
y1 y2 ...
yN
=
xT1 xT2 . . . xTN
w (8)
where yN ×1, xd×1, XN ×d, and wd×1.
• X is a skinny matrix, i.e., N > d.
• Over-constrained set of linear equations (more equations than unknowns)
• Solve y = Xw approximately
- Let r = Xw − y be the residual or error - Find w that minimizes ||r||
- Equivalently, find x = xls that minimizes ||r||
- Linear relation between X and y - Interpret w as projection
Lecture 5 (draft) 12
Least Squares
• To find w, minimize norm of residual error
||r||2 = ||Xw − y||2 (9)
• Take derivative w.r.t. w and set to zero:
∂
∂w||Xw − y||2 = ∂w∂ (Xw − y)T(Xw − y)
= XT(Xw − y)
= XTXw − XTy = 0
(10)
• Assume XTX is invertible, we have
wls = (XTX)−1XTy (11)
• Very famous and useful formula.
−1
• X† = (XTX)−1XT is called the pseudo inverse of X.
• X† is a left inverse of X:
X†X = (XTX)−1XTX = I (12)
Lecture 5 (draft) 14
General Form
• Prediction given by linear combination of basis function
f (x, w) =
M
X
i=0
wiφi(x) = wTφ(x) (13)
• Example: polynomial
f (x, W ) = w0 + w1x + w2x2 + . . . + wMxM (14) so that the basis functions are
φi(x) = xi (15)
• Minimize sum of square error function
E(w) = 1 2
N
X(wTφ(xi) − yi)2 (16)
• Linear in the parameters, nonlinear in the inputs
• Solution as before
w = (φTφ)−1φTy (17)
and φ is the design matrix given by
φ0(x1) . . . φM(x1) φ0(x2) . . . φM(x2) ... ... ...
φ0(xN) . . . φM(xN)
(18)
Lecture 5 (draft) 16
Polynomial Regression
Complexity and over-fitting [Jaakkola 04]
• Reality: The polynomial regression model may achieve zero empirical (training) error but nevertheless has a large test (generalization) error.
train N1 PN
t=1(yt − f (xt; w))2 ≈ 0
test E(x,y)∼p(y − f (x; w))2 0 (19)
• May suffer from over-fitting when training error no longer bears any relation t the generalization error.
Lecture 5 (draft) 18
Avoid Over-Fitting: Cross Validation
• Cross validation allows us to estimate the generalization error based on training examples alone
• Leave-one-out cross validation treats each training example in turn as a test example
CV = 1 N
N
X
i=1
(yi − f (xi; w−i))2 (20) where w−i are the least squares estimates of the parameters without the i-th example.
Logistic Regression
• Model the function and noise
Observed value = function + noise
y = f (x; w) + ε (21)
where, e.g., ε ∼ N (0, σ2).
• View regression in a probabilistic sense
y = Xw + e, e ∼ N (0, σ2I) (22)
• Assume training examples are generated from a model in this class with unknown parameters w∗.
y = Xw∗ + e, e ∼ N (0, σ2I) (23)
• Want to estimate w∗.
• Read Jaakkola lecture notes (2-6) for details.
Lecture 5 (draft) 20
Fisher Linear Discriminant
Read Jaakkola lecture notes 5 and the assigned paper: Fishefaces vs.
Eigenfaces.