• 沒有找到結果。

Advanced Topics in Learning and Vision

N/A
N/A
Protected

Academic year: 2022

Share "Advanced Topics in Learning and Vision"

Copied!
22
0
0

加載中.... (立即查看全文)

全文

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

Lecture 5 (draft)

(2)

Overview

• Linear regression

• Logistic regression

• Linear classifier

• Fisher linear discriminant

• Support vector machine

• Kernel PCA

• Kernel discriminant analysis

• Relevance vector machine

(3)

Announcements

• More course material available on the course web page

• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap

• Project proposal:

- Send me one page project proposal by Oct 19.

Lecture 5 (draft) 2

(4)

Netlab

• Available at http://www.ncrg.aston.ac.uk/netlab/

• The latest release of Netlab includes the following algorithms:

- PCA

- Mixtures of probabilistic PCA

- Gaussian mixture model with EM training algorithm

- Linear and logistic regression with IRLS training algorithm

- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions

- Radial basis function (RBF) networks with both Gaussian and non-local basis functions

- Optimizers, including quasi-Newton methods, conjugate gradients and scaled conjugate gradients

- Multi-layer perceptron with Gaussian mixture outputs (mixture density networks)

- Gaussian prior distributions over parameters for the MLP, RBF and GLM including multiple hyper-parameters

- Laplace approximation framework for Bayesian inference (evidence

(5)

- Automatic Relevance Determination for input selection

- Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo

- K-nearest neighbor classifier - K-means clustering

- Generative Topographic Map

- Neuroscale topographic projection - Gaussian Processes

- Hinton diagrams for network weights - Self-organizing map

• Note the code adopts row-based

Lecture 5 (draft) 4

(6)

Supervised Learning

• Linear classifier: Linear regression, logistic regression, Fisher linear discriminant

• Nonlinear classifier: linear/nonlinear support vector machine, kernel principal component, kernel discriminant analysis

• Ensemble of classifiers: Adaboost, bagging, ensemble of homogeneous/hetrogenous classifiers

(7)

Regression

• Hypothesis class: Want to find out the relationship between the input data x and desired output y, parameterized by a function f

• Estimation: Collect a training set of examples and labels, {(x1, y1), . . . (xN, yn)}, want to find the best estimate fˆ.

• Evaluation: measure how well fˆ generalizes to unseen data, i.e., whether f (xˆ 0) agrees with y0.

Lecture 5 (draft) 6

(8)

Hypotheses and Estimation

• Want to find a simple linear classifier to to solve classification problems

• For example, given a set of male and female images, we want to find a classifier

ˆ

y = f (x; θ) = sign(θ · x) (1) where w are the parameters of the function, x is an image and y ∈ {−1, 1}.ˆ

• Learn θ from a training set

Loss(y, ˆy) =

 0, y = ˆy

1, y 6= ˆy (2)

and minimize the error 1

N

N

X

i=1

Loss(yi, ˆyi) = 1 N

N

X

i=1

Loss(yi, f (xi; θ)) (3)

(9)

Problem Formulation

• Posit the estimation problem as an optimization problem

• Given a training set, find θ that minimizes empirical loss:

1 N

PN

i=1 Loss(yi, f (xi; θ))

- f can be linear or nonlinear

- yi can be discrete labels, or real valued, or other

• Model complexity and regularization

• Why do we minimize empirical loss (from training set)? After all, we are interested in the performance for new, unseen samples.

Lecture 5 (draft) 8

(10)

Training and Test Performance: Sampling

• Assume that each training and test example-label pair (x, y) is drawn independently from the same but unknown distribution of examples and labels.

• Represent as a joint distribution p(x, y) so that each training/test example is a sample from this distribution, (xi, yi) ∼ p.

Empirical (training) loss = N1 PN

i=1 Loss(yi, f (xi; θ))

Expected (test) loss = E(x,y)∼p{Loss(y, f (x; θ))} (4)

• Training loss is based on the set of sampled examples and labels

• An approximate estimate for the test performance measured over the entire distribution

• Model complexity and regularization

• Will come back to this topic later

(11)

Linear Regression

f : R → R f (x; w) = w1x

f : Rd → R f (x; w) = w1x1 + w2x2 + . . . + wdxd (5)

• Given pairs of points (xi, yi), want to w that minimizes the prediction error

• Measure the prediction loss in terms of square error, Loss(y, yˆ) = (y − ˆy)2.

• The empirical loss over the all N training samples

E(w) = 1 N

N

X

i=1

(yi − f (xi; w))2 (6)

Lecture 5 (draft) 10

(12)

• In matrix form,

y = Xw (7)

 y1 y2 ...

yN

=

 xT1 xT2 . . . xTN

w (8)

where yN ×1, xd×1, XN ×d, and wd×1.

• X is a skinny matrix, i.e., N > d.

• Over-constrained set of linear equations (more equations than unknowns)

• Solve y = Xw approximately

- Let r = Xw − y be the residual or error - Find w that minimizes ||r||

- Equivalently, find x = xls that minimizes ||r||

(13)

- Linear relation between X and y - Interpret w as projection

Lecture 5 (draft) 12

(14)

Least Squares

• To find w, minimize norm of residual error

||r||2 = ||Xw − y||2 (9)

• Take derivative w.r.t. w and set to zero:

∂w||Xw − y||2 = ∂w (Xw − y)T(Xw − y)

= XT(Xw − y)

= XTXw − XTy = 0

(10)

• Assume XTX is invertible, we have

wls = (XTX)−1XTy (11)

• Very famous and useful formula.

−1

(15)

• X = (XTX)−1XT is called the pseudo inverse of X.

• X is a left inverse of X:

XX = (XTX)−1XTX = I (12)

Lecture 5 (draft) 14

(16)

General Form

• Prediction given by linear combination of basis function

f (x, w) =

M

X

i=0

wiφi(x) = wTφ(x) (13)

• Example: polynomial

f (x, W ) = w0 + w1x + w2x2 + . . . + wMxM (14) so that the basis functions are

φi(x) = xi (15)

• Minimize sum of square error function

E(w) = 1 2

N

X(wTφ(xi) − yi)2 (16)

(17)

• Linear in the parameters, nonlinear in the inputs

• Solution as before

w = (φTφ)−1φTy (17)

and φ is the design matrix given by

φ0(x1) . . . φM(x1) φ0(x2) . . . φM(x2) ... ... ...

φ0(xN) . . . φM(xN)

(18)

Lecture 5 (draft) 16

(18)

Polynomial Regression

(19)

Complexity and over-fitting [Jaakkola 04]

• Reality: The polynomial regression model may achieve zero empirical (training) error but nevertheless has a large test (generalization) error.

train N1 PN

t=1(yt − f (xt; w))2 ≈ 0

test E(x,y)∼p(y − f (x; w))2  0 (19)

• May suffer from over-fitting when training error no longer bears any relation t the generalization error.

Lecture 5 (draft) 18

(20)

Avoid Over-Fitting: Cross Validation

• Cross validation allows us to estimate the generalization error based on training examples alone

• Leave-one-out cross validation treats each training example in turn as a test example

CV = 1 N

N

X

i=1

(yi − f (xi; w−i))2 (20) where w−i are the least squares estimates of the parameters without the i-th example.

(21)

Logistic Regression

• Model the function and noise

Observed value = function + noise

y = f (x; w) + ε (21)

where, e.g., ε ∼ N (0, σ2).

• View regression in a probabilistic sense

y = Xw + e, e ∼ N (0, σ2I) (22)

• Assume training examples are generated from a model in this class with unknown parameters w.

y = Xw + e, e ∼ N (0, σ2I) (23)

• Want to estimate w.

• Read Jaakkola lecture notes (2-6) for details.

Lecture 5 (draft) 20

(22)

Fisher Linear Discriminant

Read Jaakkola lecture notes 5 and the assigned paper: Fishefaces vs.

Eigenfaces.

參考文獻

相關文件

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using

In this Learning Unit, students should be able to use Cramer’s rule, inverse matrices and Gaussian elimination to solve systems of linear equations in two and three variables, and

mathematical statistics, statistical methods, regression, survival data analysis, categorical data analysis, multivariate statistical methods, experimental design.

• To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small..

MTL – multi-task learning for STM and LM, where they share the embedding layer PSEUDO – train an STM with labeled data, generate labels for unlabeled data, and retrain STM.