Advanced Topics in Learning and Vision

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

[email protected]

(2)

Announcements

• Project midterm presentation: Nov 23

• Reading (due Nov 22):

- Forsyth et al.: Structure from motion and color constancy using MCMC.

- Tu et al.: Image parsing using DDMCMC

D. Forsyth, J. Haddon and S. Ioffe. The joy of sampling. International Journal of Computer Vision, volume 41, no. 1/2, pages 109–134, 2001.

Z. Tu, X. Chen, A. Yuille and S. Zhu. Image parsing: unifying segmentation, detection, and recognition. In Proceedings of International Conference on Computer Vision, pages

18–25, 2003.

• Supplementary reading:

- David MacKay. Introduction to Monte Carlo methods.

- C. Bishop. Lecture notes for British Computer Society Summer School on Pattern Recognition.

(3)

Overview

• EM algorithm

• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications

(4)

Conditional Independence

• If A is independent of B given C

p(A|B, C) = p(A|C) (1)

• Equivalently,

p(A, B|C) = p(A|B, C)p(B|C)

= p(A|C)p(B|C) (2)

• Conditional independence is crucial in applying graphical models as it significantly reduces the complexity of the graph.

• Need to be careful when nodes are not observable. See Bishop’s tutorials for details.

(5)

Markov Properties

• Markov properties allows us to exploit conditional independence in graphical models.

- Time domain: p(s_t|s₁, . . . , s_t−1) = p(s_t|s_t−1), e.g., Hidden Markov Model (HMM)

- Space domain (Markov blanket): p(A|M B(A), B) = p(A|M B(A)), e.g., Markov Random Field (MRF).

• M B(A) include the parents and children of node A, and it’s children’s parents, because they can be used to explain away the children nodes.

• Shield off nodes that are conditionally independent of node A given the M B(A).

(6)

ML, MAP and Bayesian Learning

• Data set: D = {x₁, . . . , x_N} with distribution p(x).

• Likelihood (independent observations): L(θ) = p(D|θ) = QN

i=1 p(x_i|θ).

• Maximum likelihood (ML) estimate:

θ_{M L} = arg max

θ L(θ) = arg max

θ p(D|θ) (3)

• Predictive distribution: p(x|D) ≈ p(x|θ_{M L}).

• Maximum a posteriori (MAP) estimate:

θ_{M AP} = arg max

θ p(θ|D) = arg max

θ p(D|θ)p(θ) (4)

where p(D) is usually assumed as a constant.

• Predictive distribution: p(x|D) ≈ p(x|θ_{M AP}).

(7)

• Bayesian learning: marginalize over unknown parameters, rather than point estimates

p(x|D) = Z

p(x|θ)p(θ|D)dθ (5)

• Avoids overfitting problems of ML and MAP

p(D|θ) = QN

i=1 p(x_i|θ)

p(D, θ, x) = p(D|θ)p(x|θ)p(θ) (6)

(8)

• Predictive distribution:

p(x|D) = 1 p(D)

Z

p(x|θ)p(D|θ)p(θ)dθ (7)

• Model evidence:

p(D) = Z

p(D|θ)p(θ)dθ (8)

(9)

Exponential family

• Many distributions can be written as

p(x|θ) = exp{θ^Tu(x) + g(θ) + f (x)} (9)

• Exponential family: Gaussian, Dirichlet, Gamma, Multinomial, Wishart, Bernoulli, . . .

• Use precision (inverse of variance) τ = 1/σ². N (x|µ, τ⁻¹) = ( τ

2π)^1/2 exp{−τ

2(x − µ)²} (10)

θ =

µτ

−τ /2

u(x) =

x x²

g(θ) = ¹₂ ln(_2π^τ ) − ¹₂τ µ² f (x) = 0

(11)

(10)

Bayesian Estimation

• Likelihood function

p(D|θ) =

N

Y

i=1

p(x_i|θ) (12)

• Conjugate prior: prior of θ has same functional form as likelihood

p(θ|η, ν) = exp{θ^Tη + νg(θ) + h(η, ν)} (13)

• Posterior:

p(θ|x) = p(x|θ)p(θ) ∝ exp{θ^Tη + ¯¯ νg(θ)} (14) where η = η + u(x),¯ ν = ν + 1.¯

• Can interpret prior as ν effective observations of value η.

• Examples (which we put prior on parameters):

- Gaussian for the mean of a Gaussian.

- Gaussian-Wishart of mean and precision of Gaussian.

- Dirichlet for the parameters of a discrete distribution.

(11)

The EM Algorithm Revisited

• Consider a model with observed data y, hidden/latent variables x, and parameters θ. The log likelihood is bounded by

L(θ) = log p(y|θ) = logR p(x, y|θ)dx

= logR q(x)^p(x,y|θ)_q(x) dx

≥ R q(x) log ^p(x,y|θ)_q(x) dx ≡ F (q, θ)

(15)

where q(x) is an arbitrary density function over the hidden variables, and the lower bound holds due to the concavity of the log function, i.e., using Jensen’s inequality.

• Recall Jensen’s inequality for convex function

f (λx₁ + (1 − λ)x₂) ≤ λf (x₁) + (1 − λ)f (x₂) (16) for x₁, x₂ ∈ (a, b) and 0 ≤ λ ≤ 1.

• The lower bound F is a functional of q(x) and θ, F (q, θ) = PN

i=1 F⁽ⁱ⁾(q⁽ⁱ⁾, θ).

(12)

• Idea: Iterate between optimizing the lower bound as a function of q and as a function of θ.

• Can prove that the log likelihood is never decreased.

• E step: Optimize F w.r.t. q while holding θ fixed (i.e., compute expectation of hidden variables).

q_k(x) = arg max

q(x)

Z

q(x) log p(x, y|θ_k−1)

q(x) (17)

q_k(x) = p(x|y, θ_k−1) (18)

• M step: Optimize F w.r.t. θ while holding q fixed (i.e., maximizing the log likelihood).

θ_k = arg max_θ R q_k(x) log ^p(x,y|θ)_q

k(x)

θ_k = arg max_θ R q_k(x) log p(x, y|θ)dx (19)

• In the E step, for each data point, the distribution over the hidden variables is set to the posterior for that data point, q_k⁽ⁱ⁾ = p(x|y⁽ⁱ⁾, θ_k−1), ∀i.

(13)

• In the M step, the set of parameters is re-estimated by maximizing the sum of the expected log likelihood: θ_k = arg max_θ P

i R q_k⁽ⁱ⁾ log p(x, y⁽ⁱ⁾|θ)dx.

• Note that from (17) to (18),

q_k(x) = arg max

q(x)

h

log p(y|θ_k−1) + R q(x) log ^p(x|y,θ_q(x)^k−1⁾dx

i (20)

• The first term is constant w.r.t. q(x) and the second term is the negative of KL divergence

KL(q(x)||p(x|y, θ_k−1)) = Z

q(x) log q(x)

p(x|y, θ_k−1) (21) which is minimized when q(x) = p(x|y, θ_k−1).

• In the E step, the goal is to find the posterior distribution of the hidden variables given the observed variables and the current settlings of the parameters.

(14)

• Also, since the KL divergence is zero, F (q_k, θ_k−1) = L(θ_k−1) at the end of E step.

• In the M step, F is increased w.r.t. θ. Thus, F (q_k, θ_k) ≥ F (q_k, θ_k−1). And after the next E step, L(θ_k) = F (q_k+1, θ_k) ≥ F (q_k, θ_k). Thus

L(θ_k) ≥ L(θ_k−1).

• Can be applied to all the latent variable models including factor analysis, probabilistic PCA, mixture of factor analyzers, mixture of probabilistic PCA, mixture of Gaussians, etc.

• The likelihood often has many local optimal and EM may converge to some local optimal rather than the global one.

• Can be extended to MAP or Bayesian estimates.

• See Bishop’s lectures notes on latent variables, mixture models and EM (BCS summer school, Exeter, 2003).

(15)

Markov Chain Monte Carlo (MCMC)

• It is difficult to compute joint distribution exactly.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples

- Estimate expectation of functions under p(x), e.g.,

Φ = Z

φ(x)p(x)dx (22)

• Focus on sampling problems as the expectation can be estimated by drawing random samples {x^(r)}.

Φ =ˆ 1 R

X

r

φ(x^(r)) (23)

(16)

• As R → ∞, Φ → Φˆ since the variance σ² =

Z

(φ(x) − Φ)²dx (24)

decreases as ^σ_R².

• Good news: the accuracy of Monte Carlo estimate in (23) is independent of the dimensionality of the space sampled!

• Bad news: It is difficult to draw independent samples in the high dimensional space.

(17)

Why Sampling p(x) Is Difficult?

• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p^∗(x).

p(x) = p^∗(x)/Z (25)

• Two difficulties in evaluating p^∗(x).

- Typically we do not know Z

Z = Z

p^∗(x)dx (26)

- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.

(18)

• Example:

• Decompose x = (x₁, . . . , x_d) in every dimension.

• Discretize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {x_i}, and

Z = P

i p^∗(x_i)

p(x_i) = p^∗(x_i)/Z (27)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 50¹⁰⁰⁰ samples in 1000-dimensional space!

(19)

• Even if we draw 2 samples in each dimension, we still need 2¹⁰⁰⁰ samples in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.

(20)

Importance Sampling

• Recall we evaluate p^∗(x) and evaluate with p^∗: p(x) = p^∗(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q^∗(x), where

q(x) = q^∗/Z_q, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample w_r = p^∗(x^(r))

q^∗(x^(r)) (28)

(21)

, and

Φ =ˆ P

r w_rφ(x^(r)) P

r w_r (29)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ ¹

πγ

h 1 + (^x−x_γ ⁰)² ⁱ where γ is a scale )

(22)

• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy

(23)

Rejection Sampling

• Further assume that the proposal function q

cq^∗(x) > p^∗(x) ∀x (30)

• Steps:

- First generate x from q(x) and evaluated with cq^∗(x)

- Second generate uniformly distributed variable u from the interval [0, cq^∗(x)].

- If u > p^∗(x), then x is rejected. Otherwise, x is added to out set of samples {x^(r)}.

(24)

Metropolis Sampling

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current state x^(t).

• Example: q(x⁰; x^(t)) may be a simple Gaussian distribution centered at x^(t).

• A tentative state x⁰ is generated from the proposal density q(x⁰; x^(t)).

(25)

• Compute

a = ^p^∗^(x⁰⁾

p^∗(x^(t))

q(x^(t);x⁰) q(x⁰;x^(t))

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a

(31)

• If the step is accepted, we set x^(t+1) = x⁰. Otherwise, we set x^(t+1) = x^(t).

• We need to compute ^p^∗^(x⁰⁾

p^∗(x^(t)) and ^q(x^(t)^;x⁰⁾

q(x⁰;x^(t))

• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings algorithm.

• Metropolis method is an example of Markov Chain Monte Carlo (MCMC) method.

(26)

• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x^(r)} is generated where each sample x^(t) having a probability distribution that depends on the previous value, x^(t−1).

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.