Advanced Topics in Learning and Vision
Ming-Hsuan Yang
mhyang@csie.ntu.edu.tw
Announcements
• Project midterm presentation: Nov 23
• Reading (due Nov 22):
- Forsyth et al.: Structure from motion and color constancy using MCMC.
- Tu et al.: Image parsing using DDMCMC
D. Forsyth, J. Haddon and S. Ioffe. The joy of sampling. International Journal of Computer Vision, volume 41, no. 1/2, pages 109–134, 2001.
Z. Tu, X. Chen, A. Yuille and S. Zhu. Image parsing: unifying segmentation, detection, and recognition. In Proceedings of International Conference on Computer Vision, pages
18–25, 2003.
• Supplementary reading:
- David MacKay. Introduction to Monte Carlo methods.
- C. Bishop. Lecture notes for British Computer Society Summer School on Pattern Recognition.
Overview
• EM algorithm
• Markov Chain Monte Carlo (MCMC)
• Variational inference
• Belief propagation, loop belief propagation
• Gaussian process, Gaussian process latent variable model
• Applications
Conditional Independence
• If A is independent of B given C
p(A|B, C) = p(A|C) (1)
• Equivalently,
p(A, B|C) = p(A|B, C)p(B|C)
= p(A|C)p(B|C) (2)
• Conditional independence is crucial in applying graphical models as it significantly reduces the complexity of the graph.
• Need to be careful when nodes are not observable. See Bishop’s tutorials for details.
Markov Properties
• Markov properties allows us to exploit conditional independence in graphical models.
- Time domain: p(st|s1, . . . , st−1) = p(st|st−1), e.g., Hidden Markov Model (HMM)
- Space domain (Markov blanket): p(A|M B(A), B) = p(A|M B(A)), e.g., Markov Random Field (MRF).
• M B(A) include the parents and children of node A, and it’s children’s parents, because they can be used to explain away the children nodes.
• Shield off nodes that are conditionally independent of node A given the M B(A).
ML, MAP and Bayesian Learning
• Data set: D = {x1, . . . , xN} with distribution p(x).
• Likelihood (independent observations): L(θ) = p(D|θ) = QN
i=1 p(xi|θ).
• Maximum likelihood (ML) estimate:
θM L = arg max
θ L(θ) = arg max
θ p(D|θ) (3)
• Predictive distribution: p(x|D) ≈ p(x|θM L).
• Maximum a posteriori (MAP) estimate:
θM AP = arg max
θ p(θ|D) = arg max
θ p(D|θ)p(θ) (4)
where p(D) is usually assumed as a constant.
• Predictive distribution: p(x|D) ≈ p(x|θM AP).
• Bayesian learning: marginalize over unknown parameters, rather than point estimates
p(x|D) = Z
p(x|θ)p(θ|D)dθ (5)
• Avoids overfitting problems of ML and MAP
p(D|θ) = QN
i=1 p(xi|θ)
p(D, θ, x) = p(D|θ)p(x|θ)p(θ) (6)
• Predictive distribution:
p(x|D) = 1 p(D)
Z
p(x|θ)p(D|θ)p(θ)dθ (7)
• Model evidence:
p(D) = Z
p(D|θ)p(θ)dθ (8)
Exponential family
• Many distributions can be written as
p(x|θ) = exp{θTu(x) + g(θ) + f (x)} (9)
• Exponential family: Gaussian, Dirichlet, Gamma, Multinomial, Wishart, Bernoulli, . . .
• Use precision (inverse of variance) τ = 1/σ2. N (x|µ, τ−1) = ( τ
2π)1/2 exp{−τ
2(x − µ)2} (10)
θ =
µτ
−τ /2
u(x) =
x x2
g(θ) = 12 ln(2πτ ) − 12τ µ2 f (x) = 0
(11)
Bayesian Estimation
• Likelihood function
p(D|θ) =
N
Y
i=1
p(xi|θ) (12)
• Conjugate prior: prior of θ has same functional form as likelihood
p(θ|η, ν) = exp{θTη + νg(θ) + h(η, ν)} (13)
• Posterior:
p(θ|x) = p(x|θ)p(θ) ∝ exp{θTη + ¯¯ νg(θ)} (14) where η = η + u(x),¯ ν = ν + 1.¯
• Can interpret prior as ν effective observations of value η.
• Examples (which we put prior on parameters):
- Gaussian for the mean of a Gaussian.
- Gaussian-Wishart of mean and precision of Gaussian.
- Dirichlet for the parameters of a discrete distribution.
The EM Algorithm Revisited
• Consider a model with observed data y, hidden/latent variables x, and parameters θ. The log likelihood is bounded by
L(θ) = log p(y|θ) = logR p(x, y|θ)dx
= logR q(x)p(x,y|θ)q(x) dx
≥ R q(x) log p(x,y|θ)q(x) dx ≡ F (q, θ)
(15)
where q(x) is an arbitrary density function over the hidden variables, and the lower bound holds due to the concavity of the log function, i.e., using Jensen’s inequality.
• Recall Jensen’s inequality for convex function
f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2) (16) for x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1.
• The lower bound F is a functional of q(x) and θ, F (q, θ) = PN
i=1 F(i)(q(i), θ).
• Idea: Iterate between optimizing the lower bound as a function of q and as a function of θ.
• Can prove that the log likelihood is never decreased.
• E step: Optimize F w.r.t. q while holding θ fixed (i.e., compute expectation of hidden variables).
qk(x) = arg max
q(x)
Z
q(x) log p(x, y|θk−1)
q(x) (17)
qk(x) = p(x|y, θk−1) (18)
• M step: Optimize F w.r.t. θ while holding q fixed (i.e., maximizing the log likelihood).
θk = arg maxθ R qk(x) log p(x,y|θ)q
k(x)
θk = arg maxθ R qk(x) log p(x, y|θ)dx (19)
• In the E step, for each data point, the distribution over the hidden variables is set to the posterior for that data point, qk(i) = p(x|y(i), θk−1), ∀i.
• In the M step, the set of parameters is re-estimated by maximizing the sum of the expected log likelihood: θk = arg maxθ P
i R qk(i) log p(x, y(i)|θ)dx.
• Note that from (17) to (18),
qk(x) = arg max
q(x)
h
log p(y|θk−1) + R q(x) log p(x|y,θq(x)k−1)dx
i (20)
• The first term is constant w.r.t. q(x) and the second term is the negative of KL divergence
KL(q(x)||p(x|y, θk−1)) = Z
q(x) log q(x)
p(x|y, θk−1) (21) which is minimized when q(x) = p(x|y, θk−1).
• In the E step, the goal is to find the posterior distribution of the hidden variables given the observed variables and the current settlings of the parameters.
• Also, since the KL divergence is zero, F (qk, θk−1) = L(θk−1) at the end of E step.
• In the M step, F is increased w.r.t. θ. Thus, F (qk, θk) ≥ F (qk, θk−1). And after the next E step, L(θk) = F (qk+1, θk) ≥ F (qk, θk). Thus
L(θk) ≥ L(θk−1).
• Can be applied to all the latent variable models including factor analysis, probabilistic PCA, mixture of factor analyzers, mixture of probabilistic PCA, mixture of Gaussians, etc.
• The likelihood often has many local optimal and EM may converge to some local optimal rather than the global one.
• Can be extended to MAP or Bayesian estimates.
• See Bishop’s lectures notes on latent variables, mixture models and EM (BCS summer school, Exeter, 2003).
Markov Chain Monte Carlo (MCMC)
• It is difficult to compute joint distribution exactly.
• Goals:
- Aim to approximate the joint distribution p(x) so that we can draw samples
- Estimate expectation of functions under p(x), e.g.,
Φ = Z
φ(x)p(x)dx (22)
• Focus on sampling problems as the expectation can be estimated by drawing random samples {x(r)}.
Φ =ˆ 1 R
X
r
φ(x(r)) (23)
• As R → ∞, Φ → Φˆ since the variance σ2 =
Z
(φ(x) − Φ)2dx (24)
decreases as σR2.
• Good news: the accuracy of Monte Carlo estimate in (23) is independent of the dimensionality of the space sampled!
• Bad news: It is difficult to draw independent samples in the high dimensional space.
Why Sampling p(x) Is Difficult?
• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p∗(x).
p(x) = p∗(x)/Z (25)
• Two difficulties in evaluating p∗(x).
- Typically we do not know Z
Z = Z
p∗(x)dx (26)
- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.
• Example:
• Decompose x = (x1, . . . , xd) in every dimension.
• Discretize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {xi}, and
Z = P
i p∗(xi)
p(xi) = p∗(xi)/Z (27)
• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 501000 samples in 1000-dimensional space!
• Even if we draw 2 samples in each dimension, we still need 21000 samples in 1000-dimensional space.
• Related to Ising model, Boltzmann machine and Markov field.
• See MacKay for more detail on the number of samples are required to have a good approximation.
Importance Sampling
• Recall we evaluate p∗(x) and evaluate with p∗: p(x) = p∗(x)/Z.
• p is often complicated and difficult to draw samples from.
• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q∗(x), where
q(x) = q∗/Zq, and from which we can generate samples.
• Introduce weights to adjust the “importance” of each sample wr = p∗(x(r))
q∗(x(r)) (28)
, and
Φ =ˆ P
r wrφ(x(r)) P
r wr (29)
• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).
• Problem: difficult to estimate how reliable Φˆ is.
• Examples of proposal functions: Gaussian and Cauchy distributions
(p(x) ∼ 1
πγ
h 1 + (x−xγ 0)2 i where γ is a scale )
• The results suggest we should use “heavy tailed” importance sampler
• Heavy tailed: a high proportion of the population is comprised of extreme values.
Left: Gaussian Right: Cauchy
Rejection Sampling
• Further assume that the proposal function q
cq∗(x) > p∗(x) ∀x (30)
• Steps:
- First generate x from q(x) and evaluated with cq∗(x)
- Second generate uniformly distributed variable u from the interval [0, cq∗(x)].
- If u > p∗(x), then x is rejected. Otherwise, x is added to out set of samples {x(r)}.
Metropolis Sampling
• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).
• The Metropolis algorithm makes use of q(x) which depends on the current state x(t).
• Example: q(x0; x(t)) may be a simple Gaussian distribution centered at x(t).
• A tentative state x0 is generated from the proposal density q(x0; x(t)).
• Compute
a = p∗(x0)
p∗(x(t))
q(x(t);x0) q(x0;x(t))
If a ≥ 1, then the new state is accepted
Otherwise, the new state is accepted with probability a
(31)
• If the step is accepted, we set x(t+1) = x0. Otherwise, we set x(t+1) = x(t).
• We need to compute p∗(x0)
p∗(x(t)) and q(x(t);x0)
q(x0;x(t))
• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.
• The general algorithm for asymmetric q is called Metropolis-Hastings algorithm.
• Metropolis method is an example of Markov Chain Monte Carlo (MCMC) method.
• Widely used for high dimensional problems.
• Has been applied to vision problems with good success in image segmentation, recognition, etc.
• Involves a Markov process in which a sequence of {x(r)} is generated where each sample x(t) having a probability distribution that depends on the previous value, x(t−1).
• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).
• Random walk: small or large steps?
• Problems: slow convergence
• Many methods have been proposed for speed up.