• 沒有找到結果。

Christopher M. Bishop

N/A
N/A
Protected

Academic year: 2022

Share "Christopher M. Bishop"

Copied!
72
0
0

加載中.... (立即查看全文)

全文

(1)

Part 2: Unsupervised Learning

Machine Learning Techniques for Computer Vision

Microsoft Research Cambridge

Christopher M. Bishop

x

3

x

3

(2)

Overview of Part 2

• Mixture models

• EM

• Variational Inference

• Bayesian model complexity

• Continuous latent variables

(3)

The Gaussian Distribution

• Multivariate Gaussian

• Maximum likelihood

mean covariance

(4)

Gaussian Mixtures

• Linear super-position of Gaussians

• Normalization and positivity require

(5)

Example: Mixture of 3 Gaussians

0 0.5 1

0 0.5 1

(a)

0 0.5 1

0 0.5 1

(b)

(6)

Maximum Likelihood for the GMM

• Log likelihood function

• Sum over components appears inside the log

– no closed form ML solution

(7)

EM Algorithm – Informal Derivation

(8)

EM Algorithm – Informal Derivation

• M step equations

(9)

EM Algorithm – Informal Derivation

• E step equation

(10)

EM Algorithm – Informal Derivation

• Can interpret the mixing coefficients as prior probabilities

• Corresponding posterior probabilities (responsibilities)

(11)

Old Faithful Data Set

Time

between

eruptions

(minutes)

(12)
(13)
(14)
(15)
(16)
(17)
(18)

Latent Variable View of EM

• To sample from a Gaussian mixture:

– first pick one of the components with probability – then draw a sample from that component

– repeat these two steps for each new data point

0 0.5 1

0 0.5 1

(a)

(19)

Latent Variable View of EM

• Goal: given a data set, find

• Suppose we knew the colours

– maximum likelihood would involve fitting each component to the corresponding cluster

• Problem: the colours are latent (hidden) variables

(20)

Incomplete and Complete Data

complete

0 0.5 1

0 0.5 1

(a)

0 0.5 1

0 0.5 1

(b)

incomplete

(21)

Latent Variable Viewpoint

(22)

Latent Variable Viewpoint

• Binary latent variables describing which component generated each data point

• Conditional distribution of observed variable

• Prior distribution of latent variables

• Marginalizing over the latent variables we obtain

X

Z

(23)

Graphical Representation of GMM

p z n z n

x n

x n

S m

N

(24)

Latent Variable View of EM

• Suppose we knew the values for the latent variables – maximize the complete-data log likelihood

– trivial closed-form solution: fit each component to the corresponding set of data points

• We don’t know the values of the latent variables

– however, for given parameter values we can compute

the expected values of the latent variables

(25)

Posterior Probabilities (colour coded)

0.5 1

0.5

1

(26)

Over-fitting in Gaussian Mixture Models

• Infinities in likelihood function when a component

‘collapses’ onto a data point:

with

• Also, maximum likelihood cannot determine the number

K of components

(27)

Cross Validation

• Can select model complexity using an independent validation data set

• If data is scarce use cross-validation:

– partition data into S subsets – train on S−1 subsets

– test on remainder – repeat and average

• Disadvantages

– computationally expensive

(28)

Bayesian Mixture of Gaussians

• Parameters and latent variables appear on equal footing

• Conjugate priors

p

z n z n x n x n

L m

N

(29)

Data Set Size

• Problem 1: learn the function

for from 100 (slightly) noisy examples

– data set is computationally small but statistically large

• Problem 2: learn to recognize 1,000 everyday objects from 5,000,000 natural images

– data set is computationally large but statistically small

• Bayesian inference

– computationally more demanding than ML or MAP

(30)

Variational Inference

• Exact Bayesian inference intractable

• Markov chain Monte Carlo – computationally expensive – issues of convergence

• Variational Inference

– broadly applicable deterministic approximation – let denote all latent variables and parameters

– approximate true posterior using a simpler distribution

– minimize Kullback-Leibler divergence

(31)

General View of Variational Inference

• For arbitrary

where

(32)

Variational Lower Bound

(33)

Factorized Approximation

• Goal: choose a family of q distributions which are:

– sufficiently flexible to give good approximation – sufficiently simple to remain tractable

• Here we consider factorized distributions

• No further assumptions are required!

• Optimal solution for one factor, keeping the remainder fixed

(34)

0 0.5 1 0

0.5 1

x 1

x 2

(a)

(35)

Lower Bound

• Can also be evaluated

• Useful for maths/code verification

• Also useful for model comparison:

(36)

Illustration: Univariate Gaussian

• Likelihood function

• Conjugate prior

• Factorized variational distribution

(37)

Initial Configuration

1 2

τ

(a)

(38)

After Updating

−1 0 0 1

1 2

µ τ

(b)

(39)

After Updating

1 2

τ

(c)

(40)

Converged Solution

−1 0 0 1

1 2

µ τ

(d)

(41)

Variational Mixture of Gaussians

• Assume factorized posterior distribution

• No other approximations needed!

(42)

Variational Equations for GMM

(43)

Lower Bound for GMM

(44)

VIBES

Bishop, Spiegelhalter and Winn (2002)

(45)

ML Limit

• If instead we choose

we recover the maximum likelihood EM algorithm

(46)

Bound vs. K for Old Faithful Data

(47)

Bayesian Model Complexity

(48)

Sparse Bayes for Gaussian Mixture

• Corduneanu and Bishop (2001)

• Start with large value of K

– treat mixing coefficients as parameters – maximize marginal likelihood

– prunes out excess components

(49)
(50)
(51)

Summary: Variational Gaussian Mixtures

• Simple modification of maximum likelihood EM code

• Small computational overhead compared to EM

• No singularities

• Automatic model order selection

(52)

Continuous Latent Variables

• Conventional PCA

– data covariance matrix

– eigenvector decomposition

• Minimizes sum-of-squares projection – not a probabilistic model

– how should we choose L ?

x

1

x

1

x

2

x

2

x

n

x

n

~x

n

~x

n

u

1

u

1

(53)

Probabilistic PCA

• Tipping and Bishop (1998)

• L dimensional continuous latent space

• D dimensional data space

x

2

x

2

z



{ w

PCA

factor analysis

(54)

Probabilistic PCA

• Marginal distribution

• Advantages

– exact ML solution

– computationally efficient EM algorithm

– captures dominant correlations with few parameters – mixtures of PPCA

– Bayesian PCA

– building block for more complex models











N W

z

n

z

n

x

n

x

n

(55)

EM for PCA

0

2 (a)

(56)

EM for PCA

−2 0 2

−2 0

2 (b)

(57)

EM for PCA

0

2 (c)

(58)

EM for PCA

−2 0 2

−2 0

2 (d)

(59)

EM for PCA

0

2 (e)

(60)

EM for PCA

−2 0 2

−2 0

2 (f)

(61)

0

2 (g)

EM for PCA

(62)

Bayesian PCA

• Bishop (1998)

• Gaussian prior over columns of

• Automatic relevance determination (ARD)

ML PCA Bayesian PCA









 N W

z

n

z

n

x

n

x

n



(63)

Non-linear Manifolds

• Example: images of a rigid object

x

1

x

1

x

3

x

3

x

2

x

2

(64)

Bayesian Mixture of BPCA Models



s

n

s

n

z z

nnm

x

n

x

n



m



m

N

M W

m

W

m











(65)
(66)

Flexible Sprites

• Jojic and Frey (2001)

• Automatic decomposition of video sequence into – background model

– ordered set of masks (one per object per frame)

– foreground model (one per object per frame)

(67)
(68)

Transformed Component Analysis

• Generative model

• Now include transformations (translations)

• Extend to L layers

• Inference intractable so use variational framework

 s

l

s

l

m m

ll

T

n l

T

n l

x

n

x

n

N

L

(69)
(70)

Bayesian Constellation Model

• Li, Fergus and Perona (2003)

• Object recognition from small training sets

• Variational treatment of fully Bayesian model

(71)

Bayesian Constellation Model

(72)

Summary of Part 2

• Discrete and continuous latent variables – EM algorithm

• Build complex models from simple components – represented graphically

– incorporates prior knowledge

• Variational inference

– Bayesian model comparison

參考文獻

相關文件

We propose two types of estimators of m(x) that improve the multivariate local linear regression estimator b m(x) in terms of reducing the asymptotic conditional variance while

• Many statistical procedures are based on sta- tistical models which specify under which conditions the data are generated.... – Consider a new model of automobile which is

The left panel shows boxplots showing the 100 posterior predictive p values (PPP-values) for each observed raw score across the 100 simulated data sets generated from

Joint “ “AMiBA AMiBA + Subaru + Subaru ” ” data, probing the gas/DM distribution data, probing the gas/DM distribution out to ~80% of the cluster. out to ~80% of the cluster

A Boolean function described by an algebraic expression consists of binary variables, the constant 0 and 1, and the logic operation symbols.. For a given value of the binary

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and

The ES and component shortfall are calculated using the simulation from C-vine copula structure instead of that from multivariate distribution because the C-vine copula

Constrain the data distribution for learned latent codes Generate the latent code via a prior