Advanced Topics in Learning and Vision

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

[email protected]

(2)

Announcements

• Project midterm presentation: Nov 22

• Reading (due Nov 29):

- Freeman et al.: Application of belief propagation to super resolution.

W. Freeman, E. Pasztor and O. Carmichael. Learning low-level vision. International Journal of Computer Vision, vol. 401, no. 1, pages 25–47, 2000.

• Supplementary reading:

- David MacKay. Introduction to Monte Carlo methods.

- Zhu, Dallaert, and Tu ICCV 05 Tutorial: Markov Chain Monte Carlo for Computer Vision

(3)

Overview

• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications

(4)

Markov Chain Monte Carlo (MCMC)

• Motivation: It is difficult to compute joint distribution exactly.

• Name of the game:

- Draw samples by running a cleverly constructed Markov chain for a long time.

- Monte Carlo integration draws samples from the distribution, and then forms sample average to approximate expectations.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples.

- Estimate expectation of functions under p(x), e.g.,

(5)

• Focus on sampling problems as the expectation can be estimated by drawing random samples {x^(r)}.

Φ =ˆ 1 R

X

r

φ(x^(r)) (2)

• As R → ∞, Φ → Φˆ since the variance

σ² = Z

(φ(x) − Φ)²dx (3)

decreases as ^σ_R².

• Good news: the accuracy of Monte Carlo estimate in (2) is independent of the dimensionality of the space sampled!

• Bad news: It is difficult to draw independent samples in the high dimensional space.

(6)

Why Sampling?

• Non-parametric approach.

• Versatile: accommodate to arbitrary densities.

• Easy for analysis and visualization.

• Memory requirements = O(N ) where N is the number of samples.

• In high dimensional space, sampling is a key step for:

- modeling: simulation, synthesis.

- learning: estimating parameters.

(7)

(8)

Why Sampling p(x) Is Difficult?

• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p^∗(x).

p(x) = p^∗(x)/Z (4)

• Two difficulties in evaluating p^∗(x).

- Typically we do not know Z

Z = Z

p^∗(x)dx (5)

- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.

(9)

• Example:

• Decompose x = (x₁, . . . , x_d) in every dimension.

• Discreteize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {x_i}, and

Z = P

i p^∗(x_i)

p(x_i) = p^∗(x_i)/Z (6)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 50¹⁰⁰⁰ samples in 1000-dimensional space!

(10)

• Even if we draw 2 samples in each dimension, we still need 2¹⁰⁰⁰ samples in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.

(11)

Importance Sampling

• Recall we evaluate p^∗(x) and evaluate with p^∗: p(x) = p^∗(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q^∗(x), where

q(x) = q^∗/Z_q, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample w_r = p^∗(x^(r))

q^∗(x^(r)) (7)

(12)

, and

Φ =ˆ P

r w_rφ(x^(r)) P

r w_r (8)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ ¹

πγ

h 1 + (^x−x_γ ⁰)² ⁱ where γ is a scale )

(13)

• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy

(14)

Rejection Sampling

• Further assume that the proposal function q

cq^∗(x) > p^∗(x) ∀x (9)

• Steps:

- First generate x from q(x) and evaluated with cq^∗(x)

(15)

• Problem: Need to pick a right value of c.

• In general, c grows exponentially with the dimensionality.

(16)

Markov Chain

• A set of states, S = {s₁, s₂, . . . , s_N}.

• The probability of moving from state s_i to state s_j in one step is p_ij

• Transition matrix: R (rain), N (nice), S (sunny)

P =





1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2



 (10)

• Define p⁽ⁿ⁾_ij as the probability of s_j reaching s_j in n steps, e.g., p⁽²⁾₁₃ = p₁₁p₁₃ + p₁₂p₂₃ + p₁₃p₃₃

p⁽²⁾ =

r

Xp_ikp_kj

(17)

P² =





0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500









0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500



 =





0.438 0.188 0.375 0.375 0.250 0.375 0.250 0.188 0.438



 (12)

P³ =





0.406 0.203 0.391 0.406 0.188 0.406 0.391 0.203 0.406



 (13)

P⁴ =





0.402 0.199 0.398 0.398 0.203 0.398 0.398 0.199 0.402



 (14)

P⁵ =





0.400 0.200 0.399 0.400 0.199 0.400 0.399 0.200 0.400



 (15)

P⁶ =





0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400



 (16)

(18)

P⁷ =





0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400



 (17)

P⁸ =





0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400



 (18)

• No matter where we start, after 6 days, the probability of rainy day is 0.4, the probability of nice day is 0.2, and the probability of sunny day is 0.4 Theorem 1. Let P be the transition matrix of a Markov chain. The ij-th entry p⁽ⁿ⁾_ij of the matrix P gives the probability that the Markov chain, starting state s_j will be in state s_j.

Definition 1. A Markov chain is called an ergodic chain if it is possible to go from every state to every other state (not necessarily in one move). It is

(19)

Theorem 2. Let P be the transition matrix for a regular chain. Then, as n → ∞, the power Pⁿ approach a limiting matrix W with all rows the same vector w. The vector w is a strictly positive probability vector (i.e., the

components are all positive and they sum to one).

Theorem 3. Let P be a regular transition matrix, let W = lim

n→∞Pⁿ

let w be the common row of W, and let c be the column vector all of those elements are 1. Then.

1. wP = w, and any row vector v such that vP = v is a constant multiple of w.

2. P c = c, and any column x such that P x = x is a multiple of c.

Definition 3. A row vector w with the property wP = w is called a fixed row vector for P. Similarly, a column vector x such that P x = x is called a fixed column vector for P.

• In other words, a fixed row vector is a left eigenvector of the matrix P

(20)

corresponding to the eigenvalue 1.

[w₁ w₂ w₃]





1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2



 = [w₁ w₂ w₃] (19) Solving this linear system and we get

w = [0.4 0.2 0.4]

• In general, we solve

wP = wI where I is the identity matrix, or equivalently

w(P − I) = 0

(21)

Theorem 4. For an ergodic Markov chain, there is a unique probability vector w such that wP = w and w is strictly positive. Any row vector such that vP = v is a multiple of w. Any column vector x such that P x = x is a constant vector.

• Subject to regularity conditions, the chain will gradually “forget” its initial state and p^(t)(·|s₀) will eventually converge to a unique stationary (or invariant) distribution, which does not depend on t or s₀.

• Detailed balance:

w(x⁰; x)p(x) = w(x; x⁰)p(x⁰), ∀x and x⁰

• The period until the chain converges to a stationary distribution is called the burn in time.

(22)

PaageRank

• PageRank: Suppose we have a set of four web pages, A, B, C, and D as depicted above. The PageRank (PR) of A is

P R(A) = ^{P R(B)}₂ + ^{P R(C)}₁ + ^{P R(D)}₃

P R(A) = ^{P R(B)}_L(B) + ^{P R(C)}_L(C) + ^{P R(D)}_L(D) (20)

• Random surfer: Markov process

(23)

• The PR values are the entries of the dominant eigenvector of the modified adjacency matrix. The dominant (i.e., first) eigenvector is

R =







P R(p₁) P R(p₂)

...

P R(p_N)







(22)

where R is the solution of the system

R =







q/N q/N...

q/N







+ (1 − q)







l(p₁, p₁) l(p₁, p₂) . . . l(p₁, p_N) l(p₂, p₁) . . .

...

l(p_N, p₁) l(p_N, p_N)







R (23)

where l(p_i, p_j) is an adjacency function. l(p_i, p_j) = 0 if page p_j does not link to link to p_i, and normalized such that for each j PN

i=1 l(p_i, p_j) = 1, i.e., the elements of each column sum up to 1.

• Related to random walk, Markov process and spectral clustering

(24)

• Can be seen as a particular dynamic system that is looking for equilibrium in the state space.

• L. Page and S. Brim Pagerank, “An eigenvector based ranking approach for hypertext,” In 21st Annual ACM/SIGIR International Conference on

Research and Development in Information Retrieval, 1998.

(25)

Metropolis Sampling

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current state x^(t).

• Example: q(x⁰; x^(t)) may be a simple Gaussian distribution centered at x^(t).

• A tentative state x⁰ is generated from the proposal density q(x⁰; x^(t)).

(26)

• Compute

a = ^p^∗^(x⁰⁾

p^∗(x^(t))

q(x^(t);x⁰) q(x⁰;x^(t))

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a

(24)

• If the step is accepted, we set x^(t+1) = x⁰. Otherwise, we set x^(t+1) = x^(t).

• We need to compute ^p^∗^(x⁰⁾

p^∗(x^(t)) and ^q(x^(t)^;x⁰⁾

q(x⁰;x^(t))

• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings

(27)

• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x^(r)} is generated where each sample x^(t) having a probability distribution that depends on the previous value, x^(t−1).

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.

(28)

Example

(29)

Gibbs Sampling

• Also known as heat bath method.

• Can be viewed as a Metropolis method in which the proposal distribution Q is defined in terms of the conditional distribution of the joint distribution p(x).

• Assume that while p(x) is too complex to draw samples, its conditional distributions p(x_i|{x_j}_j6=i) are tractable to work with.

• In the general case of k variables, a single iteration involves sampling one parameter at a time.

x^(t+1)₁ ∼ p(x₁|x^(t)₂ , x^(t)₃ , . . . , x^(t)_k ) x^(t+1)₂ ∼ p(x₂|x^(t)₁ , x^(t)₃ , . . . , x^(t)_k ) x^(t+1)₃ ∼ p(x₃|x^(t)₁ , x^(t)₂ , . . . , x^(t)_k ) . . .

(30)

• Gibbs sampling suffers from the same defect as simple Metropolis

algorithms - the space is explored by a random walk, unless a fortuitous

(31)

MCMC Applications

• Image parsing [Zhu et al. ICCV 03]

- Analyze an image with a set of pre-defined vocabulary: faces, text, and generic regions.

- Each vocabulary is parameterized.

- Instead of using naive proposal density function, use detectors (based on image contents) for better proposal functions

- Decompose the solution space into a union of many subspaces.

- Explore solution space by designing efficient Markov Chains and sample the posterior probabilities.

(32)

• Human pose estimation [Lee and Cohen CVPR 02]

• Visual tracking

• Structure from motion

(33)

Variational Inference

• Exact Bayesian inference is intractable

• Markov Chain Monte Carlo - computationally expensive - convergence issue

• Variational inference

- broadly applicable deterministic approximation - let θ denote all latent variables and parameters

- approximate true posterior p(θ|D) using a simple distribution q(θ).

- minimize Kullback-Leibler divergence: KL(q||p).