Advanced Topics in Learning and Vision

33  Download (0)

Full text


Advanced Topics in Learning and Vision

Ming-Hsuan Yang



• Project midterm presentation: Nov 22

• Reading (due Nov 29):

- Freeman et al.: Application of belief propagation to super resolution.

W. Freeman, E. Pasztor and O. Carmichael. Learning low-level vision. International Journal of Computer Vision, vol. 401, no. 1, pages 25–47, 2000.

• Supplementary reading:

- David MacKay. Introduction to Monte Carlo methods.

- Zhu, Dallaert, and Tu ICCV 05 Tutorial: Markov Chain Monte Carlo for Computer Vision



• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications


Markov Chain Monte Carlo (MCMC)

• Motivation: It is difficult to compute joint distribution exactly.

• Name of the game:

- Draw samples by running a cleverly constructed Markov chain for a long time.

- Monte Carlo integration draws samples from the distribution, and then forms sample average to approximate expectations.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples.

- Estimate expectation of functions under p(x), e.g.,


• Focus on sampling problems as the expectation can be estimated by drawing random samples {x(r)}.

Φ =ˆ 1 R



φ(x(r)) (2)

• As R → ∞, Φ → Φˆ since the variance

σ2 = Z

(φ(x) − Φ)2dx (3)

decreases as σR2.

• Good news: the accuracy of Monte Carlo estimate in (2) is independent of the dimensionality of the space sampled!

Bad news: It is difficult to draw independent samples in the high dimensional space.


Why Sampling?

• Non-parametric approach.

• Versatile: accommodate to arbitrary densities.

• Easy for analysis and visualization.

• Memory requirements = O(N ) where N is the number of samples.

• In high dimensional space, sampling is a key step for:

- modeling: simulation, synthesis.

- learning: estimating parameters.


Why Sampling p(x) Is Difficult?

• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p(x).

p(x) = p(x)/Z (4)

• Two difficulties in evaluating p(x).

- Typically we do not know Z

Z = Z

p(x)dx (5)

- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.


• Example:

• Decompose x = (x1, . . . , xd) in every dimension.

• Discreteize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {xi}, and

Z = P

i p(xi)

p(xi) = p(xi)/Z (6)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 501000 samples in 1000-dimensional space!


• Even if we draw 2 samples in each dimension, we still need 21000 samples in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.


Importance Sampling

• Recall we evaluate p(x) and evaluate with p: p(x) = p(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q(x), where

q(x) = q/Zq, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample wr = p(x(r))

q(x(r)) (7)


, and

Φ =ˆ P

r wrφ(x(r)) P

r wr (8)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ 1


h 1 + (x−xγ 0)2 i where γ is a scale )


• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy


Rejection Sampling

• Further assume that the proposal function q

cq(x) > p(x) ∀x (9)

• Steps:

- First generate x from q(x) and evaluated with cq(x)


• Problem: Need to pick a right value of c.

• In general, c grows exponentially with the dimensionality.


Markov Chain

• A set of states, S = {s1, s2, . . . , sN}.

• The probability of moving from state si to state sj in one step is pij

• Transition matrix: R (rain), N (nice), S (sunny)

P =

1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2

 (10)

• Define p(n)ij as the probability of sj reaching sj in n steps, e.g., p(2)13 = p11p13 + p12p23 + p13p33

p(2) =




P2 =

0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500

0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500

 =

0.438 0.188 0.375 0.375 0.250 0.375 0.250 0.188 0.438

 (12)

P3 =

0.406 0.203 0.391 0.406 0.188 0.406 0.391 0.203 0.406

 (13)

P4 =

0.402 0.199 0.398 0.398 0.203 0.398 0.398 0.199 0.402

 (14)

P5 =

0.400 0.200 0.399 0.400 0.199 0.400 0.399 0.200 0.400

 (15)

P6 =

0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400

 (16)


P7 =

0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400

 (17)

P8 =

0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400

 (18)

• No matter where we start, after 6 days, the probability of rainy day is 0.4, the probability of nice day is 0.2, and the probability of sunny day is 0.4 Theorem 1. Let P be the transition matrix of a Markov chain. The ij-th entry p(n)ij of the matrix P gives the probability that the Markov chain, starting state sj will be in state sj.

Definition 1. A Markov chain is called an ergodic chain if it is possible to go from every state to every other state (not necessarily in one move). It is


Theorem 2. Let P be the transition matrix for a regular chain. Then, as n → ∞, the power Pn approach a limiting matrix W with all rows the same vector w. The vector w is a strictly positive probability vector (i.e., the

components are all positive and they sum to one).

Theorem 3. Let P be a regular transition matrix, let W = lim


let w be the common row of W, and let c be the column vector all of those elements are 1. Then.

1. wP = w, and any row vector v such that vP = v is a constant multiple of w.

2. P c = c, and any column x such that P x = x is a multiple of c.

Definition 3. A row vector w with the property wP = w is called a fixed row vector for P. Similarly, a column vector x such that P x = x is called a fixed column vector for P.

• In other words, a fixed row vector is a left eigenvector of the matrix P


corresponding to the eigenvalue 1.

[w1 w2 w3]

1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2

 = [w1 w2 w3] (19) Solving this linear system and we get

w = [0.4 0.2 0.4]

• In general, we solve

wP = wI where I is the identity matrix, or equivalently

w(P − I) = 0


Theorem 4. For an ergodic Markov chain, there is a unique probability vector w such that wP = w and w is strictly positive. Any row vector such that vP = v is a multiple of w. Any column vector x such that P x = x is a constant vector.

• Subject to regularity conditions, the chain will gradually “forget” its initial state and p(t)(·|s0) will eventually converge to a unique stationary (or invariant) distribution, which does not depend on t or s0.

• Detailed balance:

w(x0; x)p(x) = w(x; x0)p(x0), ∀x and x0

• The period until the chain converges to a stationary distribution is called the burn in time.



• PageRank: Suppose we have a set of four web pages, A, B, C, and D as depicted above. The PageRank (PR) of A is

P R(A) = P R(B)2 + P R(C)1 + P R(D)3

P R(A) = P R(B)L(B) + P R(C)L(C) + P R(D)L(D) (20)

• Random surfer: Markov process


• The PR values are the entries of the dominant eigenvector of the modified adjacency matrix. The dominant (i.e., first) eigenvector is

R =

P R(p1) P R(p2)


P R(pN)


where R is the solution of the system

R =

q/N q/N...


+ (1 − q)

l(p1, p1) l(p1, p2) . . . l(p1, pN) l(p2, p1) . . .


l(pN, p1) l(pN, pN)

R (23)

where l(pi, pj) is an adjacency function. l(pi, pj) = 0 if page pj does not link to link to pi, and normalized such that for each j PN

i=1 l(pi, pj) = 1, i.e., the elements of each column sum up to 1.

• Related to random walk, Markov process and spectral clustering


• Can be seen as a particular dynamic system that is looking for equilibrium in the state space.

• L. Page and S. Brim Pagerank, “An eigenvector based ranking approach for hypertext,” In 21st Annual ACM/SIGIR International Conference on

Research and Development in Information Retrieval, 1998.


Metropolis Sampling

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current state x(t).

• Example: q(x0; x(t)) may be a simple Gaussian distribution centered at x(t).

• A tentative state x0 is generated from the proposal density q(x0; x(t)).


• Compute

a = p(x0)


q(x(t);x0) q(x0;x(t))

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a


• If the step is accepted, we set x(t+1) = x0. Otherwise, we set x(t+1) = x(t).

• We need to compute p(x0)

p(x(t)) and q(x(t);x0)


• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings


• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x(r)} is generated where each sample x(t) having a probability distribution that depends on the previous value, x(t−1).

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.




Gibbs Sampling

• Also known as heat bath method.

• Can be viewed as a Metropolis method in which the proposal distribution Q is defined in terms of the conditional distribution of the joint distribution p(x).

• Assume that while p(x) is too complex to draw samples, its conditional distributions p(xi|{xj}j6=i) are tractable to work with.

• In the general case of k variables, a single iteration involves sampling one parameter at a time.

x(t+1)1 ∼ p(x1|x(t)2 , x(t)3 , . . . , x(t)k ) x(t+1)2 ∼ p(x2|x(t)1 , x(t)3 , . . . , x(t)k ) x(t+1)3 ∼ p(x3|x(t)1 , x(t)2 , . . . , x(t)k ) . . .


• Gibbs sampling suffers from the same defect as simple Metropolis

algorithms - the space is explored by a random walk, unless a fortuitous


MCMC Applications

• Image parsing [Zhu et al. ICCV 03]

- Analyze an image with a set of pre-defined vocabulary: faces, text, and generic regions.

- Each vocabulary is parameterized.

- Instead of using naive proposal density function, use detectors (based on image contents) for better proposal functions

- Decompose the solution space into a union of many subspaces.

- Explore solution space by designing efficient Markov Chains and sample the posterior probabilities.


• Human pose estimation [Lee and Cohen CVPR 02]

• Visual tracking

• Structure from motion


Variational Inference

• Exact Bayesian inference is intractable

• Markov Chain Monte Carlo - computationally expensive - convergence issue

• Variational inference

- broadly applicable deterministic approximation - let θ denote all latent variables and parameters

- approximate true posterior p(θ|D) using a simple distribution q(θ).

- minimize Kullback-Leibler divergence: KL(q||p).




Related subjects :