**Advanced Topics in Learning and Vision**

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

**Announcements**

• Project midterm presentation: Nov 23

• Reading (due Nov 22):

- Forsyth et al.: Structure from motion and color constancy using MCMC.

- Tu et al.: Image parsing using DDMCMC

*D. Forsyth, J. Haddon and S. Ioffe. The joy of sampling. International Journal of Computer*
*Vision, volume 41, no. 1/2, pages 109–134, 2001.*

Z. Tu, X. Chen, A. Yuille and S. Zhu. Image parsing: unifying segmentation, detection, and
*recognition. In Proceedings of International Conference on Computer Vision, pages*

18–25, 2003.

• Supplementary reading:

- David MacKay. Introduction to Monte Carlo methods.

- C. Bishop. Lecture notes for British Computer Society Summer School on Pattern Recognition.

**Overview**

• EM algorithm

• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications

**Conditional Independence**

• If A is independent of B given C

p(A|B, C) = p(A|C) (1)

• Equivalently,

p(A, B|C) = p(A|B, C)p(B|C)

= p(A|C)p(B|C) (2)

• Conditional independence is crucial in applying graphical models as it significantly reduces the complexity of the graph.

• Need to be careful when nodes are not observable. See Bishop’s tutorials for details.

**Markov Properties**

• Markov properties allows us to exploit conditional independence in graphical models.

- Time domain: p(s_{t}|s_{1}, . . . , s_{t−1}) = p(s_{t}|s_{t−1}), e.g., Hidden Markov Model
(HMM)

- Space domain (Markov blanket): p(A|M B(A), B) = p(A|M B(A)), e.g., Markov Random Field (MRF).

• M B(A) include the parents and children of node A, and it’s children’s parents, because they can be used to explain away the children nodes.

• Shield off nodes that are conditionally independent of node A given the M B(A).

**ML, MAP and Bayesian Learning**

• Data set: D = {x_{1}, . . . , x_{N}} with distribution p(x).

• Likelihood (independent observations): L(θ) = p(D|θ) = QN

i=1 p(x_{i}|θ).

• Maximum likelihood (ML) estimate:

θ_{M L} = arg max

θ L(θ) = arg max

θ p(D|θ) (3)

• Predictive distribution: p(x|D) ≈ p(x|θ_{M L}).

• Maximum a posteriori (MAP) estimate:

θ_{M AP} = arg max

θ p(θ|D) = arg max

θ p(D|θ)p(θ) (4)

where p(D) is usually assumed as a constant.

• Predictive distribution: p(x|D) ≈ p(x|θ_{M AP}).

• *Bayesian learning: marginalize over unknown parameters, rather than point*
estimates

p(x|D) = Z

p(x|θ)p(θ|D)dθ (5)

• Avoids overfitting problems of ML and MAP

p(D|θ) = QN

i=1 p(x_{i}|θ)

p(D, θ, x) = p(D|θ)p(x|θ)p(θ) (6)

• Predictive distribution:

p(x|D) = 1 p(D)

Z

p(x|θ)p(D|θ)p(θ)dθ (7)

• Model evidence:

p(D) = Z

p(D|θ)p(θ)dθ (8)

**Exponential family**

• Many distributions can be written as

p(x|θ) = exp{θ^{T}u(x) + g(θ) + f (x)} (9)

• Exponential family: Gaussian, Dirichlet, Gamma, Multinomial, Wishart, Bernoulli, . . .

• Use precision (inverse of variance) τ = 1/σ^{2}.
N (x|µ, τ^{−1}) = ( τ

2π)^{1/2} exp{−τ

2(x − µ)^{2}} (10)

θ =

µτ

−τ /2

u(x) =

x
x^{2}

g(θ) = ^{1}_{2} ln(_{2π}^{τ} ) − ^{1}_{2}τ µ^{2}
f (x) = 0

(11)

**Bayesian Estimation**

• Likelihood function

p(D|θ) =

N

Y

i=1

p(x_{i}|θ) (12)

• Conjugate prior: prior of θ has same functional form as likelihood

p(θ|η, ν) = exp{θ^{T}η + νg(θ) + h(η, ν)} (13)

• Posterior:

p(θ|x) = p(x|θ)p(θ) ∝ exp{θ^{T}η + ¯¯ νg(θ)} (14)
where η = η + u(x),¯ ν = ν + 1.¯

• Can interpret prior as ν effective observations of value η.

• Examples (which we put prior on parameters):

- Gaussian for the mean of a Gaussian.

- Gaussian-Wishart of mean and precision of Gaussian.

- Dirichlet for the parameters of a discrete distribution.

**The EM Algorithm Revisited**

• Consider a model with observed data y, hidden/latent variables x, and parameters θ. The log likelihood is bounded by

L(θ) = log p(y|θ) = logR p(x, y|θ)dx

= logR q(x)^{p(x,y|θ)}_{q(x)} dx

≥ R q(x) log ^{p(x,y|θ)}_{q(x)} dx ≡ F (q, θ)

(15)

where q(x) is an arbitrary density function over the hidden variables, and the lower bound holds due to the concavity of the log function, i.e., using Jensen’s inequality.

• Recall Jensen’s inequality for convex function

f (λx_{1} + (1 − λ)x_{2}) ≤ λf (x_{1}) + (1 − λ)f (x_{2}) (16)
for x_{1}, x_{2} ∈ (a, b) and 0 ≤ λ ≤ 1.

• The lower bound F is a functional of q(x) and θ, F (q, θ) = PN

i=1 F^{(i)}(q^{(i)}, θ).

• Idea: Iterate between optimizing the lower bound as a function of q and as a function of θ.

• Can prove that the log likelihood is never decreased.

• **E step: Optimize** F w.r.t. q while holding θ fixed (i.e., compute expectation
of hidden variables).

q_{k}(x) = arg max

q(x)

Z

q(x) log p(x, y|θ_{k−1})

q(x) (17)

q_{k}(x) = p(x|y, θ_{k−1}) (18)

• **M step: Optimize** F w.r.t. θ while holding q fixed (i.e., maximizing the log
likelihood).

θ_{k} = arg max_{θ} R q_{k}(x) log ^{p(x,y|θ)}_{q}

k(x)

θ_{k} = arg max_{θ} R q_{k}(x) log p(x, y|θ)dx (19)

• In the E step, for each data point, the distribution over the hidden variables
is set to the posterior for that data point, q_{k}^{(i)} = p(x|y^{(i)}, θ_{k−1}), ∀i.

• In the M step, the set of parameters is re-estimated by maximizing the sum
of the expected log likelihood: θ_{k} = arg max_{θ} P

i R q_{k}^{(i)} log p(x, y^{(i)}|θ)dx.

• Note that from (17) to (18),

q_{k}(x) = arg max

q(x)

h

log p(y|θ_{k−1}) + R q(x) log ^{p(x|y,θ}_{q(x)}^{k−1}^{)}dx

i (20)

• The first term is constant w.r.t. q(x) and the second term is the negative of KL divergence

KL(q(x)||p(x|y, θ_{k−1})) =
Z

q(x) log q(x)

p(x|y, θ_{k−1}) (21)
which is minimized when q(x) = p(x|y, θ_{k−1}).

• In the E step, the goal is to find the posterior distribution of the hidden variables given the observed variables and the current settlings of the parameters.

• Also, since the KL divergence is zero, F (q_{k}, θ_{k−1}) = L(θ_{k−1}) at the end of E
step.

• In the M step, F is increased w.r.t. θ. Thus, F (q_{k}, θ_{k}) ≥ F (q_{k}, θ_{k−1}). And
after the next E step, L(θ_{k}) = F (q_{k+1}, θ_{k}) ≥ F (q_{k}, θ_{k}). Thus

L(θ_{k}) ≥ L(θ_{k−1}).

• Can be applied to all the latent variable models including factor analysis, probabilistic PCA, mixture of factor analyzers, mixture of probabilistic PCA, mixture of Gaussians, etc.

• The likelihood often has many local optimal and EM may converge to some local optimal rather than the global one.

• Can be extended to MAP or Bayesian estimates.

• See Bishop’s lectures notes on latent variables, mixture models and EM (BCS summer school, Exeter, 2003).

**Markov Chain Monte Carlo (MCMC)**

• It is difficult to compute joint distribution exactly.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples

- Estimate expectation of functions under p(x), e.g.,

Φ = Z

φ(x)p(x)dx (22)

• Focus on sampling problems as the expectation can be estimated by
drawing random samples {x^{(r)}}.

Φ =ˆ 1 R

X

r

φ(x^{(r)}) (23)

• As R → ∞, Φ → Φˆ since the variance
σ^{2} =

Z

(φ(x) − Φ)^{2}dx (24)

decreases as ^{σ}_{R}^{2}.

• Good news: the accuracy of Monte Carlo estimate in (23) is independent of
*the dimensionality of the space sampled!*

• *Bad news: It is difficult to draw independent samples in the high*
dimensional space.

**Why Sampling** p(x) **Is Difficult?**

• Assume that the target (but unknown) density function p(x) can be
evaluated, within a multiplicative constant, by p^{∗}(x).

p(x) = p^{∗}(x)/Z (25)

• Two difficulties in evaluating p^{∗}(x).

- Typically we do not know Z

Z = Z

p^{∗}(x)dx (26)

- Even if we know Z*, it is difficult to draw samples to well represent or*
*cover* p(x) in the high dimensional space.

• Example:

• Decompose x = (x_{1}, . . . , x_{d}) in every dimension.

• Discretize x and ask for samples from discrete probability distribution over
a set of uniformly spaced points {x_{i}}, and

Z = P

i p^{∗}(x_{i})

p(x_{i}) = p^{∗}(x_{i})/Z (27)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space,
we need 50^{1000} samples in 1000-dimensional space!

• Even if we draw 2 samples in each dimension, we still need 2^{1000} samples
in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.

**Importance Sampling**

• Recall we evaluate p^{∗}(x) and evaluate with p^{∗}: p(x) = p^{∗}(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x)
which we can evaluate with a multiplicative constant q^{∗}(x), where

q(x) = q^{∗}/Z_{q}, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample
w_{r} = p^{∗}(x^{(r)})

q^{∗}(x^{(r)}) (28)

, and

Φ =ˆ P

r w_{r}φ(x^{(r)})
P

r w_{r} (29)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ ^{1}

πγ

h 1 + (^{x−x}_{γ} ^{0})^{2} ^{i} where γ is a scale )

• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy

**Rejection Sampling**

• Further assume that the proposal function q

cq^{∗}(x) > p^{∗}(x) ∀x (30)

• Steps:

- First generate x from q(x) and evaluated with cq^{∗}(x)

- Second generate uniformly distributed variable u from the interval
[0, cq^{∗}(x)].

- If u > p^{∗}(x), then x is rejected. Otherwise, x is added to out set of
samples {x^{(r)}}.

**Metropolis Sampling**

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current
state x^{(t)}.

• Example: q(x^{0}; x^{(t)}) may be a simple Gaussian distribution centered at x^{(t)}.

• A tentative state x^{0} is generated from the proposal density q(x^{0}; x^{(t)}).

• Compute

a = ^{p}^{∗}^{(x}^{0}^{)}

p^{∗}(x^{(t)})

q(x^{(t)};x^{0})
q(x^{0};x^{(t)})

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a

(31)

• If the step is accepted, we set x^{(t+1)} = x^{0}. Otherwise, we set x^{(t+1)} = x^{(t)}.

• We need to compute ^{p}^{∗}^{(x}^{0}^{)}

p^{∗}(x^{(t)}) and ^{q(x}^{(t)}^{;x}^{0}^{)}

q(x^{0};x^{(t)})

• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings algorithm.

• Metropolis method is an example of Markov Chain Monte Carlo (MCMC) method.

• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x^{(r)}} is generated
where each sample x^{(t)} having a probability distribution that depends on
the previous value, x^{(t−1)}.

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.