Advanced Topics in Learning and Vision

26  Download (0)

Full text


Advanced Topics in Learning and Vision

Ming-Hsuan Yang



• Project midterm presentation: Nov 23

• Reading (due Nov 22):

- Forsyth et al.: Structure from motion and color constancy using MCMC.

- Tu et al.: Image parsing using DDMCMC

D. Forsyth, J. Haddon and S. Ioffe. The joy of sampling. International Journal of Computer Vision, volume 41, no. 1/2, pages 109–134, 2001.

Z. Tu, X. Chen, A. Yuille and S. Zhu. Image parsing: unifying segmentation, detection, and recognition. In Proceedings of International Conference on Computer Vision, pages

18–25, 2003.

• Supplementary reading:

- David MacKay. Introduction to Monte Carlo methods.

- C. Bishop. Lecture notes for British Computer Society Summer School on Pattern Recognition.



• EM algorithm

• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications


Conditional Independence

• If A is independent of B given C

p(A|B, C) = p(A|C) (1)

• Equivalently,

p(A, B|C) = p(A|B, C)p(B|C)

= p(A|C)p(B|C) (2)

• Conditional independence is crucial in applying graphical models as it significantly reduces the complexity of the graph.

• Need to be careful when nodes are not observable. See Bishop’s tutorials for details.


Markov Properties

• Markov properties allows us to exploit conditional independence in graphical models.

- Time domain: p(st|s1, . . . , st−1) = p(st|st−1), e.g., Hidden Markov Model (HMM)

- Space domain (Markov blanket): p(A|M B(A), B) = p(A|M B(A)), e.g., Markov Random Field (MRF).

• M B(A) include the parents and children of node A, and it’s children’s parents, because they can be used to explain away the children nodes.

• Shield off nodes that are conditionally independent of node A given the M B(A).


ML, MAP and Bayesian Learning

• Data set: D = {x1, . . . , xN} with distribution p(x).

• Likelihood (independent observations): L(θ) = p(D|θ) = QN

i=1 p(xi|θ).

• Maximum likelihood (ML) estimate:

θM L = arg max

θ L(θ) = arg max

θ p(D|θ) (3)

• Predictive distribution: p(x|D) ≈ p(x|θM L).

• Maximum a posteriori (MAP) estimate:

θM AP = arg max

θ p(θ|D) = arg max

θ p(D|θ)p(θ) (4)

where p(D) is usually assumed as a constant.

• Predictive distribution: p(x|D) ≈ p(x|θM AP).


Bayesian learning: marginalize over unknown parameters, rather than point estimates

p(x|D) = Z

p(x|θ)p(θ|D)dθ (5)

• Avoids overfitting problems of ML and MAP

p(D|θ) = QN

i=1 p(xi|θ)

p(D, θ, x) = p(D|θ)p(x|θ)p(θ) (6)


• Predictive distribution:

p(x|D) = 1 p(D)


p(x|θ)p(D|θ)p(θ)dθ (7)

• Model evidence:

p(D) = Z

p(D|θ)p(θ)dθ (8)


Exponential family

• Many distributions can be written as

p(x|θ) = exp{θTu(x) + g(θ) + f (x)} (9)

• Exponential family: Gaussian, Dirichlet, Gamma, Multinomial, Wishart, Bernoulli, . . .

• Use precision (inverse of variance) τ = 1/σ2. N (x|µ, τ−1) = ( τ

2π)1/2 exp{−τ

2(x − µ)2} (10)

θ =


−τ /2

 u(x) =

 x x2

g(θ) = 12 ln(τ ) − 12τ µ2 f (x) = 0



Bayesian Estimation

• Likelihood function

p(D|θ) =




p(xi|θ) (12)

• Conjugate prior: prior of θ has same functional form as likelihood

p(θ|η, ν) = exp{θTη + νg(θ) + h(η, ν)} (13)

• Posterior:

p(θ|x) = p(x|θ)p(θ) ∝ exp{θTη + ¯¯ νg(θ)} (14) where η = η + u(x),¯ ν = ν + 1.¯

• Can interpret prior as ν effective observations of value η.

• Examples (which we put prior on parameters):

- Gaussian for the mean of a Gaussian.

- Gaussian-Wishart of mean and precision of Gaussian.

- Dirichlet for the parameters of a discrete distribution.


The EM Algorithm Revisited

• Consider a model with observed data y, hidden/latent variables x, and parameters θ. The log likelihood is bounded by

L(θ) = log p(y|θ) = logR p(x, y|θ)dx

= logR q(x)p(x,y|θ)q(x) dx

≥ R q(x) log p(x,y|θ)q(x) dx ≡ F (q, θ)


where q(x) is an arbitrary density function over the hidden variables, and the lower bound holds due to the concavity of the log function, i.e., using Jensen’s inequality.

• Recall Jensen’s inequality for convex function

f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2) (16) for x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1.

• The lower bound F is a functional of q(x) and θ, F (q, θ) = PN

i=1 F(i)(q(i), θ).


• Idea: Iterate between optimizing the lower bound as a function of q and as a function of θ.

• Can prove that the log likelihood is never decreased.

E step: Optimize F w.r.t. q while holding θ fixed (i.e., compute expectation of hidden variables).

qk(x) = arg max



q(x) log p(x, y|θk−1)

q(x) (17)

qk(x) = p(x|y, θk−1) (18)

M step: Optimize F w.r.t. θ while holding q fixed (i.e., maximizing the log likelihood).

θk = arg maxθ R qk(x) log p(x,y|θ)q


θk = arg maxθ R qk(x) log p(x, y|θ)dx (19)

• In the E step, for each data point, the distribution over the hidden variables is set to the posterior for that data point, qk(i) = p(x|y(i), θk−1), ∀i.


• In the M step, the set of parameters is re-estimated by maximizing the sum of the expected log likelihood: θk = arg maxθ P

i R qk(i) log p(x, y(i)|θ)dx.

• Note that from (17) to (18),

qk(x) = arg max



log p(y|θk−1) + R q(x) log p(x|y,θq(x)k−1)dx

i (20)

• The first term is constant w.r.t. q(x) and the second term is the negative of KL divergence

KL(q(x)||p(x|y, θk−1)) = Z

q(x) log q(x)

p(x|y, θk−1) (21) which is minimized when q(x) = p(x|y, θk−1).

• In the E step, the goal is to find the posterior distribution of the hidden variables given the observed variables and the current settlings of the parameters.


• Also, since the KL divergence is zero, F (qk, θk−1) = L(θk−1) at the end of E step.

• In the M step, F is increased w.r.t. θ. Thus, F (qk, θk) ≥ F (qk, θk−1). And after the next E step, L(θk) = F (qk+1, θk) ≥ F (qk, θk). Thus

L(θk) ≥ L(θk−1).

• Can be applied to all the latent variable models including factor analysis, probabilistic PCA, mixture of factor analyzers, mixture of probabilistic PCA, mixture of Gaussians, etc.

• The likelihood often has many local optimal and EM may converge to some local optimal rather than the global one.

• Can be extended to MAP or Bayesian estimates.

• See Bishop’s lectures notes on latent variables, mixture models and EM (BCS summer school, Exeter, 2003).


Markov Chain Monte Carlo (MCMC)

• It is difficult to compute joint distribution exactly.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples

- Estimate expectation of functions under p(x), e.g.,

Φ = Z

φ(x)p(x)dx (22)

• Focus on sampling problems as the expectation can be estimated by drawing random samples {x(r)}.

Φ =ˆ 1 R



φ(x(r)) (23)


• As R → ∞, Φ → Φˆ since the variance σ2 =


(φ(x) − Φ)2dx (24)

decreases as σR2.

• Good news: the accuracy of Monte Carlo estimate in (23) is independent of the dimensionality of the space sampled!

Bad news: It is difficult to draw independent samples in the high dimensional space.


Why Sampling p(x) Is Difficult?

• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p(x).

p(x) = p(x)/Z (25)

• Two difficulties in evaluating p(x).

- Typically we do not know Z

Z = Z

p(x)dx (26)

- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.


• Example:

• Decompose x = (x1, . . . , xd) in every dimension.

• Discretize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {xi}, and

Z = P

i p(xi)

p(xi) = p(xi)/Z (27)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 501000 samples in 1000-dimensional space!


• Even if we draw 2 samples in each dimension, we still need 21000 samples in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.


Importance Sampling

• Recall we evaluate p(x) and evaluate with p: p(x) = p(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q(x), where

q(x) = q/Zq, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample wr = p(x(r))

q(x(r)) (28)


, and

Φ =ˆ P

r wrφ(x(r)) P

r wr (29)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ 1


h 1 + (x−xγ 0)2 i where γ is a scale )


• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy


Rejection Sampling

• Further assume that the proposal function q

cq(x) > p(x) ∀x (30)

• Steps:

- First generate x from q(x) and evaluated with cq(x)

- Second generate uniformly distributed variable u from the interval [0, cq(x)].

- If u > p(x), then x is rejected. Otherwise, x is added to out set of samples {x(r)}.


Metropolis Sampling

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current state x(t).

• Example: q(x0; x(t)) may be a simple Gaussian distribution centered at x(t).

• A tentative state x0 is generated from the proposal density q(x0; x(t)).


• Compute

a = p(x0)


q(x(t);x0) q(x0;x(t))

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a


• If the step is accepted, we set x(t+1) = x0. Otherwise, we set x(t+1) = x(t).

• We need to compute p(x0)

p(x(t)) and q(x(t);x0)


• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings algorithm.

• Metropolis method is an example of Markov Chain Monte Carlo (MCMC) method.


• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x(r)} is generated where each sample x(t) having a probability distribution that depends on the previous value, x(t−1).

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.




Related subjects :