Advanced Topics in Learning and Vision

22  Download (0)

Full text

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

Lecture 5 (draft)

(2)

Overview

• Linear regression

• Logistic regression

• Linear classifier

• Fisher linear discriminant

• Support vector machine

• Kernel PCA

• Kernel discriminant analysis

• Relevance vector machine

(3)

Announcements

• More course material available on the course web page

• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap

• Project proposal:

- Send me one page project proposal by Oct 19.

Lecture 5 (draft) 2

(4)

Netlab

• Available at http://www.ncrg.aston.ac.uk/netlab/

• The latest release of Netlab includes the following algorithms:

- PCA

- Mixtures of probabilistic PCA

- Gaussian mixture model with EM training algorithm

- Linear and logistic regression with IRLS training algorithm

- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions

- Radial basis function (RBF) networks with both Gaussian and non-local basis functions

- Optimizers, including quasi-Newton methods, conjugate gradients and scaled conjugate gradients

- Multi-layer perceptron with Gaussian mixture outputs (mixture density networks)

- Gaussian prior distributions over parameters for the MLP, RBF and GLM including multiple hyper-parameters

- Laplace approximation framework for Bayesian inference (evidence

(5)

- Automatic Relevance Determination for input selection

- Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo

- K-nearest neighbor classifier - K-means clustering

- Generative Topographic Map

- Neuroscale topographic projection - Gaussian Processes

- Hinton diagrams for network weights - Self-organizing map

• Note the code adopts row-based

Lecture 5 (draft) 4

(6)

Supervised Learning

• Linear classifier: Linear regression, logistic regression, Fisher linear discriminant

• Nonlinear classifier: linear/nonlinear support vector machine, kernel principal component, kernel discriminant analysis

• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/hetrogenous classifiers

(7)

Regression

• Hypothesis class: Want to find out the relationship between the input data x and desired output y, parameterized by a function f

• Estimation: Collect a training set of examples and labels, {(x1, y1), . . . (xN, yn)}, want to find the best estimate fˆ.

• Evaluation: measure how well fˆ generalizes to unseen data, i.e., whether f (xˆ 0) agrees with y0.

Lecture 5 (draft) 6

(8)

Hypotheses and Estimation

• Want to find a simple linear classifier to to solve classification problems

• For example, given a set of male and female images, we want to find a classifier

ˆ

y = f (x; θ) = sign(θ · x) (1) where w are the parameters of the function, x is an image and y ∈ {−1, 1}.ˆ

• Learn θ from a training set

Loss(y, ˆy) =

 0, y = ˆy

1, y 6= ˆy (2)

and minimize the error 1

N

N

X

i=1

Loss(yi, ˆyi) = 1 N

N

X

i=1

Loss(yi, f (xi; θ)) (3)

(9)

Problem Formulation

• Posit the estimation problem as an optimization problem

• Given a training set, find θ that minimizes empirical loss:

1 N

PN

i=1 Loss(yi, f (xi; θ))

- f can be linear or nonlinear

- yi can be discrete labels, or real valued, or other

• Model complexity and regularization

• Why do we minimize empirical loss (from training set)? After all, we are interested in the performance for new, unseen samples.

Lecture 5 (draft) 8

(10)

Training and Test Performance: Sampling

• Assume that each training and test example-label pair (x, y) is drawn independently from the same but unknown distribution of examples and labels.

• Represent as a joint distribution p(x, y) so that each training/test example is a sample from this distribution, (xi, yi) ∼ p.

Empirical (training) loss = N1 PN

i=1 Loss(yi, f (xi; θ))

Expected (test) loss = E(x,y)∼p{Loss(y, f (x; θ))} (4)

• Training loss is based on the set of sampled examples and labels

• An approximate estimate for the test performance measured over the entire distribution

• Model complexity and regularization

• Will come back to this topic later

(11)

Linear Regression

f : R → R f (x; w) = w1x

f : Rd → R f (x; w) = w1x1 + w2x2 + . . . + wdxd (5)

• Given pairs of points (xi, yi), want to w that minimizes the prediction error

• Measure the prediction loss in terms of square error, Loss(y, yˆ) = (y − ˆy)2.

• The empirical loss over the all N training samples

E(w) = 1 N

N

X

i=1

(yi − f (xi; w))2 (6)

Lecture 5 (draft) 10

(12)

• In matrix form,

y = Xw (7)

 y1 y2 ...

yN

=

 xT1 xT2 . . . xTN

w (8)

where yN ×1, xd×1, XN ×d, and wd×1.

• X is a skinny matrix, i.e., N > d.

• Over-constrained set of linear equations (more equations than unknowns)

• Solve y = Xw approximately

- Let r = Xw − y be the residual or error - Find w that minimizes ||r||

- Equivalently, find x = xls that minimizes ||r||

(13)

- Linear relation between X and y - Interpret w as projection

Lecture 5 (draft) 12

(14)

Least Squares

• To find w, minimize norm of residual error

||r||2 = ||Xw − y||2 (9)

• Take derivative w.r.t. w and set to zero:

∂w||Xw − y||2 = ∂w (Xw − y)T(Xw − y)

= XT(Xw − y)

= XTXw − XTy = 0

(10)

• Assume XTX is invertible, we have

wls = (XTX)−1XTy (11)

• Very famous and useful formula.

−1

(15)

• X = (XTX)−1XT is called the pseudo inverse of X.

• X is a left inverse of X:

XX = (XTX)−1XTX = I (12)

Lecture 5 (draft) 14

(16)

General Form

• Prediction given by linear combination of basis function

f (x, w) =

M

X

i=0

wiφi(x) = wTφ(x) (13)

• Example: polynomial

f (x, W ) = w0 + w1x + w2x2 + . . . + wMxM (14) so that the basis functions are

φi(x) = xi (15)

• Minimize sum of square error function

E(w) = 1 2

N

X(wTφ(xi) − yi)2 (16)

(17)

• Linear in the parameters, nonlinear in the inputs

• Solution as before

w = (φTφ)−1φTy (17)

and φ is the design matrix given by

φ0(x1) . . . φM(x1) φ0(x2) . . . φM(x2) ... ... ...

φ0(xN) . . . φM(xN)

(18)

Lecture 5 (draft) 16

(18)

Polynomial Regression

(19)

Complexity and over-fitting [Jaakkola 04]

• Reality: The polynomial regression model may achieve zero empirical (training) error but nevertheless has a large test (generalization) error.

train N1 PN

t=1(yt − f (xt; w))2 ≈ 0

test E(x,y)∼p(y − f (x; w))2  0 (19)

• May suffer from over-fitting when training error no longer bears any relation t the generalization error.

Lecture 5 (draft) 18

(20)

Avoid Over-Fitting: Cross Validation

• Cross validation allows us to estimate the generalization error based on training examples alone

• Leave-one-out cross validation treats each training example in turn as a test example

CV = 1 N

N

X

i=1

(yi − f (xi; w−i))2 (20) where w−i are the least squares estimates of the parameters without the i-th example.

(21)

Logistic Regression

• Model the function and noise

Observed value = function + noise

y = f (x; w) + ε (21)

where, e.g., ε ∼ N (0, σ2).

• View regression in a probabilistic sense

y = Xw + e, e ∼ N (0, σ2I) (22)

• Assume training examples are generated from a model in this class with unknown parameters w.

y = Xw + e, e ∼ N (0, σ2I) (23)

• Want to estimate w.

• Read Jaakkola lecture notes (2-6) for details.

Lecture 5 (draft) 20

(22)

Fisher Linear Discriminant

Read Jaakkola lecture notes 5 and the assigned paper: Fishefaces vs.

Eigenfaces.

Figure

Updating...

References

Related subjects :