Christopher M. Bishop

(1)

Part 2: Unsupervised Learning

Machine Learning Techniques for Computer Vision

Microsoft Research Cambridge

Christopher M. Bishop

x

3

x

3

(2)

Overview of Part 2

• Mixture models

• EM

• Variational Inference

• Bayesian model complexity

• Continuous latent variables

(3)

The Gaussian Distribution

• Multivariate Gaussian

• Maximum likelihood

mean covariance

(4)

Gaussian Mixtures

• Linear super-position of Gaussians

• Normalization and positivity require

(5)

Example: Mixture of 3 Gaussians

0 0.5 1

(a)

0 0.5 1

(b)

(6)

Maximum Likelihood for the GMM

• Log likelihood function

• Sum over components appears inside the log

– no closed form ML solution

(7)

EM Algorithm – Informal Derivation

(8)

EM Algorithm – Informal Derivation

• M step equations

(9)

EM Algorithm – Informal Derivation

• E step equation

(10)

EM Algorithm – Informal Derivation

• Can interpret the mixing coefficients as prior probabilities

• Corresponding posterior probabilities (responsibilities)

(11)

Old Faithful Data Set

Time

between

eruptions

(minutes)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

Latent Variable View of EM

• To sample from a Gaussian mixture:

– first pick one of the components with probability – then draw a sample from that component

– repeat these two steps for each new data point

0 0.5 1

(a)

(19)

Latent Variable View of EM

• Goal: given a data set, find

• Suppose we knew the colours

– maximum likelihood would involve fitting each component to the corresponding cluster

• Problem: the colours are latent (hidden) variables

(20)

Incomplete and Complete Data

complete

0 0.5 1

(a)

0 0.5 1

(b)

incomplete

(21)

Latent Variable Viewpoint

(22)

Latent Variable Viewpoint

• Binary latent variables describing which component generated each data point

• Conditional distribution of observed variable

• Prior distribution of latent variables

• Marginalizing over the latent variables we obtain

X

Z

(23)

Graphical Representation of GMM

p z _n z _n

x n

S m

N

(24)

Latent Variable View of EM

• Suppose we knew the values for the latent variables – maximize the complete-data log likelihood

– trivial closed-form solution: fit each component to the corresponding set of data points

• We don’t know the values of the latent variables

– however, for given parameter values we can compute

the expected values of the latent variables

(25)

Posterior Probabilities (colour coded)

0.5 1

0.5

1

(26)

Over-fitting in Gaussian Mixture Models

• Infinities in likelihood function when a component

‘collapses’ onto a data point:

with

• Also, maximum likelihood cannot determine the number

K of components

(27)

Cross Validation

• Can select model complexity using an independent validation data set

• If data is scarce use cross-validation:

– partition data into S subsets – train on S−1 subsets

– test on remainder – repeat and average

• Disadvantages

– computationally expensive

(28)

Bayesian Mixture of Gaussians

• Parameters and latent variables appear on equal footing

• Conjugate priors

p

z _n z _n x _n x _n

L m

N

(29)

Data Set Size

• Problem 1: learn the function

for from 100 (slightly) noisy examples

– data set is computationally small but statistically large

• Problem 2: learn to recognize 1,000 everyday objects from 5,000,000 natural images

– data set is computationally large but statistically small

• Bayesian inference

– computationally more demanding than ML or MAP

(30)

Variational Inference

• Exact Bayesian inference intractable

• Markov chain Monte Carlo – computationally expensive – issues of convergence

• Variational Inference

– broadly applicable deterministic approximation – let denote all latent variables and parameters

– approximate true posterior using a simpler distribution

– minimize Kullback-Leibler divergence

(31)

General View of Variational Inference

• For arbitrary

where

(32)

Variational Lower Bound

(33)

Factorized Approximation

• Goal: choose a family of q distributions which are:

– sufficiently flexible to give good approximation – sufficiently simple to remain tractable

• Here we consider factorized distributions

• No further assumptions are required!

• Optimal solution for one factor, keeping the remainder fixed

(34)

0 0.5 1 0

0.5 1

x 1

x 2

(a)

(35)

Lower Bound

• Can also be evaluated

• Useful for maths/code verification

• Also useful for model comparison:

(36)

Illustration: Univariate Gaussian

• Likelihood function

• Conjugate prior

• Factorized variational distribution

(37)

Initial Configuration

1 2

τ

(a)

(38)

After Updating

−1 0 0 1

1 2

µ τ

(b)

(39)

After Updating

1 2

τ

(c)

(40)

Converged Solution

−1 0 0 1

1 2

µ τ

(d)

(41)

Variational Mixture of Gaussians

• Assume factorized posterior distribution

• No other approximations needed!

(42)

Variational Equations for GMM

(43)

Lower Bound for GMM

(44)

VIBES

Bishop, Spiegelhalter and Winn (2002)

(45)

ML Limit

• If instead we choose

we recover the maximum likelihood EM algorithm

(46)

Bound vs. K for Old Faithful Data

(47)

Bayesian Model Complexity

(48)

Sparse Bayes for Gaussian Mixture

• Corduneanu and Bishop (2001)

• Start with large value of K

– treat mixing coefficients as parameters – maximize marginal likelihood

– prunes out excess components

(49)

(50)

(51)

Summary: Variational Gaussian Mixtures

• Simple modification of maximum likelihood EM code

• Small computational overhead compared to EM

• No singularities

• Automatic model order selection

(52)

Continuous Latent Variables

• Conventional PCA

– data covariance matrix

– eigenvector decomposition

• Minimizes sum-of-squares projection – not a probabilistic model

– how should we choose L ?

x

1

x

1

x

2

x

2

x

n

x

n

~x

n

~x

n

u

1

u

1

(53)

Probabilistic PCA

• Tipping and Bishop (1998)

• L dimensional continuous latent space

• D dimensional data space

x

2

x

2

z

{ w

PCA

factor analysis

(54)

Probabilistic PCA

• Marginal distribution

• Advantages

– exact ML solution

– computationally efficient EM algorithm

– captures dominant correlations with few parameters – mixtures of PPCA

– Bayesian PCA

– building block for more complex models

N W

z

n

z

n

x

n

x

n

(55)

EM for PCA

0 2 (a)

(56)

EM for PCA

−2 0 2

−2 0

2 (b)

(57)

EM for PCA

0 2 (c)

(58)

EM for PCA

−2 0 2

−2 0

2 (d)

(59)

EM for PCA

0 2 (e)

(60)

EM for PCA

−2 0 2

−2 0

2 (f)

(61)

0 2 (g)

EM for PCA

(62)

Bayesian PCA

• Bishop (1998)

• Gaussian prior over columns of

• Automatic relevance determination (ARD)

ML PCA Bayesian PCA

N W

z

_n

z

_n

x

_n

x

_n

(63)

Non-linear Manifolds

• Example: images of a rigid object

x

1

x

1

x

3

x

3

x

2

x

2

(64)

Bayesian Mixture of BPCA Models

s

n

s

n

z z

nnm

x

n

x

n

_m

N

M W

m

W

m

(65)

(66)

Flexible Sprites

• Jojic and Frey (2001)

• Automatic decomposition of video sequence into – background model

– ordered set of masks (one per object per frame)

– foreground model (one per object per frame)

(67)

(68)

Transformed Component Analysis

• Generative model

• Now include transformations (translations)

• Extend to L layers

• Inference intractable so use variational framework

s

l

s

l

m m

ll

T

n l

T

n l

x

n

x

n

N

L

(69)

(70)

Bayesian Constellation Model

• Li, Fergus and Perona (2003)

• Object recognition from small training sets

• Variational treatment of fully Bayesian model

(71)

Bayesian Constellation Model

(72)

Christopher M. Bishop

Part 2: Unsupervised Learning

Machine Learning Techniques for Computer Vision

Microsoft Research Cambridge

Christopher M. Bishop

x

x

Overview of Part 2

• Mixture models

• EM

• Variational Inference

• Bayesian model complexity

• Continuous latent variables

The Gaussian Distribution

• Multivariate Gaussian

• Maximum likelihood

mean covariance

Gaussian Mixtures

• Linear super-position of Gaussians

• Normalization and positivity require

Example: Mixture of 3 Gaussians

Maximum Likelihood for the GMM

• Log likelihood function

• Sum over components appears inside the log

– no closed form ML solution

EM Algorithm – Informal Derivation

EM Algorithm – Informal Derivation

• M step equations

EM Algorithm – Informal Derivation

• E step equation

EM Algorithm – Informal Derivation

• Can interpret the mixing coefficients as prior probabilities

• Corresponding posterior probabilities (responsibilities)

Old Faithful Data Set

Time

between

eruptions

(minutes)

Latent Variable View of EM

• To sample from a Gaussian mixture:

– first pick one of the components with probability – then draw a sample from that component

– repeat these two steps for each new data point

Latent Variable View of EM

• Goal: given a data set, find

• Suppose we knew the colours

– maximum likelihood would involve fitting each component to the corresponding cluster

• Problem: the colours are latent (hidden) variables

Incomplete and Complete Data

complete

0 0.5 1

0 0.5 1

(a)

0 0.5 1

0 0.5 1

(b)

incomplete

Latent Variable Viewpoint

Latent Variable Viewpoint

• Binary latent variables describing which component generated each data point

• Conditional distribution of observed variable

• Prior distribution of latent variables

• Marginalizing over the latent variables we obtain

X

Z

Graphical Representation of GMM

p z n z n

x n

x n

S m

N

Latent Variable View of EM

• Suppose we knew the values for the latent variables – maximize the complete-data log likelihood

– trivial closed-form solution: fit each component to the corresponding set of data points

• We don’t know the values of the latent variables

– however, for given parameter values we can compute

the expected values of the latent variables

Posterior Probabilities (colour coded)

0.5

1

Over-fitting in Gaussian Mixture Models

p z _n z _n

z _n z _n x _n x _n