**Advanced Topics in Learning and Vision**

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

Lecture 5 (draft)

**Overview**

• Linear regression

• Logistic regression

• Linear classifier

• Fisher linear discriminant

• Support vector machine

• Kernel PCA

• Kernel discriminant analysis

• Relevance vector machine

**Announcements**

• More course material available on the course web page

• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap

• Project proposal:

- Send me one page project proposal by Oct 19.

Lecture 5 (draft) 2

**Netlab**

• Available at http://www.ncrg.aston.ac.uk/netlab/

• The latest release of Netlab includes the following algorithms:

- PCA

- Mixtures of probabilistic PCA

- Gaussian mixture model with EM training algorithm

- Linear and logistic regression with IRLS training algorithm

- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions

- Radial basis function (RBF) networks with both Gaussian and non-local basis functions

- Optimizers, including quasi-Newton methods, conjugate gradients and scaled conjugate gradients

- Multi-layer perceptron with Gaussian mixture outputs (mixture density networks)

- Gaussian prior distributions over parameters for the MLP, RBF and GLM including multiple hyper-parameters

- Laplace approximation framework for Bayesian inference (evidence

- Automatic Relevance Determination for input selection

- Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo

- K-nearest neighbor classifier - K-means clustering

- Generative Topographic Map

- Neuroscale topographic projection - Gaussian Processes

- Hinton diagrams for network weights - Self-organizing map

• Note the code adopts row-based

Lecture 5 (draft) 4

**Supervised Learning**

• Linear classifier: Linear regression, logistic regression, Fisher linear discriminant

• Nonlinear classifier: linear/nonlinear support vector machine, kernel principal component, kernel discriminant analysis

• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/hetrogenous classifiers

**Regression**

• Hypothesis class: Want to find out the relationship between the input data x and desired output y, parameterized by a function f

• Estimation: Collect a training set of examples and labels,
{(x_{1}, y_{1}), . . . (x_{N}, y_{n})}, want to find the best estimate fˆ.

• Evaluation: measure how well fˆ generalizes to unseen data, i.e., whether
f (xˆ ^{0}) agrees with y^{0}.

Lecture 5 (draft) 6

**Hypotheses and Estimation**

• Want to find a simple linear classifier to to solve classification problems

• For example, given a set of male and female images, we want to find a classifier

ˆ

y = f (x; θ) = sign(θ · x) (1) where w are the parameters of the function, x is an image and y ∈ {−1, 1}.ˆ

• Learn θ from a training set

Loss(y, ˆy) =

0, y = ˆy

1, y 6= ˆy (2)

and minimize the error 1

N

N

X

i=1

Loss(y_{i}, ˆy_{i}) = 1
N

N

X

i=1

Loss(y_{i}, f (x_{i}; θ)) (3)

**Problem Formulation**

• Posit the estimation problem as an optimization problem

• Given a training set, find θ that minimizes empirical loss:

1 N

PN

i=1 Loss(y_{i}, f (x_{i}; θ))

- f can be linear or nonlinear

- y_{i} can be discrete labels, or real valued, or other

• Model complexity and regularization

• Why do we minimize empirical loss (from training set)? After all, we are interested in the performance for new, unseen samples.

Lecture 5 (draft) 8

**Training and Test Performance: Sampling**

• Assume that each training and test example-label pair (x, y) is drawn
*independently from the same but unknown distribution of examples and*
labels.

• Represent as a joint distribution p(x, y) so that each training/test example is
*a sample from this distribution,* (x_{i}, y_{i}) ∼ p.

Empirical (training) loss = _{N}^{1} PN

i=1 Loss(y_{i}, f (x_{i}; θ))

Expected (test) loss = E_{(x,y)∼p}{Loss(y, f (x; θ))} (4)

• Training loss is based on the set of sampled examples and labels

• An approximate estimate for the test performance measured over the entire distribution

• Model complexity and regularization

• Will come back to this topic later

**Linear Regression**

f : R → R f (x; w) = w_{1}x

f : R^{d} → R f (x; w) = w_{1}x_{1} + w_{2}x_{2} + . . . + w_{d}x_{d} (5)

• Given pairs of points (x_{i}, y_{i}), want to w that minimizes the prediction error

• Measure the prediction loss in terms of square error, Loss(y, yˆ) = (y − ˆy)^{2}.

• The empirical loss over the all N training samples

E(w) = 1 N

N

X

i=1

(y_{i} − f (x_{i}; w))^{2} (6)

Lecture 5 (draft) 10

• In matrix form,

y = Xw (7)

y_{1}
y_{2}
...

y_{N}

=

x^{T}_{1}
x^{T}_{2}
. . .
x^{T}_{N}

w (8)

where y_{N ×1}, x_{d×1}, X_{N ×d}, and w_{d×1}.

• X is a skinny matrix, i.e., N > d.

• Over-constrained set of linear equations (more equations than unknowns)

• Solve y = Xw *approximately*

- Let r = Xw − y be the residual or error - Find w that minimizes ||r||

- Equivalently, find x = x_{ls} that minimizes ||r||

- Linear relation between X and y - Interpret w as projection

Lecture 5 (draft) 12

**Least Squares**

• To find w, minimize norm of residual error

||r||^{2} = ||Xw − y||^{2} (9)

• Take derivative w.r.t. w and set to zero:

∂

∂w||Xw − y||^{2} = _{∂w}^{∂} (Xw − y)^{T}(Xw − y)

= X^{T}(Xw − y)

= X^{T}Xw − X^{T}y = 0

(10)

• Assume X^{T}X is invertible, we have

w_{ls} = (X^{T}X)^{−1}X^{T}y (11)

• Very famous and useful formula.

−1

• X^{†} = (X^{T}X)^{−1}X^{T} *is called the pseudo inverse of* X.

• X^{†} *is a left inverse of* X:

X^{†}X = (X^{T}X)^{−1}X^{T}X = I (12)

Lecture 5 (draft) 14

**General Form**

• Prediction given by linear combination of basis function

f (x, w) =

M

X

i=0

w_{i}φ_{i}(x) = w^{T}φ(x) (13)

• Example: polynomial

f (x, W ) = w_{0} + w_{1}x + w_{2}x^{2} + . . . + w_{M}x^{M} (14)
so that the basis functions are

φ_{i}(x) = x^{i} (15)

• Minimize sum of square error function

E(w) = 1 2

N

X(w^{T}φ(x_{i}) − y_{i})^{2} (16)

• Linear in the parameters, nonlinear in the inputs

• Solution as before

w = (φ^{T}φ)^{−1}φ^{T}y (17)

and φ is the design matrix given by

φ_{0}(x_{1}) . . . φ_{M}(x_{1})
φ_{0}(x_{2}) . . . φ_{M}(x_{2})
... ... ...

φ_{0}(x_{N}) . . . φ_{M}(x_{N})

(18)

Lecture 5 (draft) 16

**Polynomial Regression**

**Complexity and over-fitting [Jaakkola 04]**

• Reality: The polynomial regression model may achieve zero empirical (training) error but nevertheless has a large test (generalization) error.

train _{N}^{1} PN

t=1(y_{t} − f (x_{t}; w))^{2} ≈ 0

test E_{(x,y)∼p}(y − f (x; w))^{2} 0 (19)

• May suffer from over-fitting when training error no longer bears any relation t the generalization error.

Lecture 5 (draft) 18

**Avoid Over-Fitting: Cross Validation**

• Cross validation allows us to estimate the generalization error based on training examples alone

• Leave-one-out cross validation treats each training example in turn as a test example

CV = 1 N

N

X

i=1

(y_{i} − f (x_{i}; w_{−i}))^{2} (20)
where w_{−i} are the least squares estimates of the parameters without the
i-th example.

**Logistic Regression**

• Model the function and noise

Observed value = function + noise

y = f (x; w) + ε (21)

where, e.g., ε ∼ N (0, σ^{2}).

• View regression in a probabilistic sense

y = Xw + e, e ∼ N (0, σ^{2}I) (22)

• Assume training examples are generated from a model in this class with
unknown parameters w^{∗}.

y = Xw^{∗} + e, e ∼ N (0, σ^{2}I) (23)

• Want to estimate w^{∗}.

• Read Jaakkola lecture notes (2-6) for details.

Lecture 5 (draft) 20

**Fisher Linear Discriminant**

Read Jaakkola lecture notes 5 and the assigned paper: Fishefaces vs.

Eigenfaces.