**Advanced Topics in Learning and Vision**

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

**Announcements**

• Project midterm presentation: Nov 22

• Reading (due Nov 29):

- Freeman et al.: Application of belief propagation to super resolution.

*W. Freeman, E. Pasztor and O. Carmichael. Learning low-level vision. International*
*Journal of Computer Vision, vol. 401, no. 1, pages 25–47, 2000.*

• Supplementary reading:

- David MacKay. Introduction to Monte Carlo methods.

- Zhu, Dallaert, and Tu ICCV 05 Tutorial: Markov Chain Monte Carlo for Computer Vision

**Overview**

• Markov Chain Monte Carlo (MCMC)

• Variational inference

• Belief propagation, loop belief propagation

• Gaussian process, Gaussian process latent variable model

• Applications

**Markov Chain Monte Carlo (MCMC)**

• Motivation: It is difficult to compute joint distribution exactly.

• Name of the game:

- *Draw samples by running a cleverly constructed Markov chain for a long*
time.

- *Monte Carlo integration draws samples from the distribution, and then*
forms sample average to approximate expectations.

• Goals:

- Aim to approximate the joint distribution p(x) so that we can draw samples.

- Estimate expectation of functions under p(x), e.g.,

• Focus on sampling problems as the expectation can be estimated by
drawing random samples {x^{(r)}}.

Φ =ˆ 1 R

X

r

φ(x^{(r)}) (2)

• As R → ∞, Φ → Φˆ since the variance

σ^{2} =
Z

(φ(x) − Φ)^{2}dx (3)

decreases as ^{σ}_{R}^{2}.

• Good news: the accuracy of Monte Carlo estimate in (2) is independent of
*the dimensionality of the space sampled!*

• *Bad news: It is difficult to draw independent samples in the high*
dimensional space.

**Why Sampling?**

• Non-parametric approach.

• Versatile: accommodate to arbitrary densities.

• Easy for analysis and visualization.

• Memory requirements = O(N ) where N is the number of samples.

• In high dimensional space, sampling is a key step for:

- modeling: simulation, synthesis.

- learning: estimating parameters.

**Why Sampling** p(x) **Is Difficult?**

• Assume that the target (but unknown) density function p(x) can be
evaluated, within a multiplicative constant, by p^{∗}(x).

p(x) = p^{∗}(x)/Z (4)

• Two difficulties in evaluating p^{∗}(x).

- Typically we do not know Z

Z = Z

p^{∗}(x)dx (5)

- Even if we know Z*, it is difficult to draw samples to well represent or*
*cover* p(x) in the high dimensional space.

• Example:

• Decompose x = (x_{1}, . . . , x_{d}) in every dimension.

• Discreteize x and ask for samples from discrete probability distribution over
a set of uniformly spaced points {x_{i}}, and

Z = P

i p^{∗}(x_{i})

p(x_{i}) = p^{∗}(x_{i})/Z (6)

• Suppose we draw 50 samples uniformly spaced in 1-dimensional space,
we need 50^{1000} samples in 1000-dimensional space!

• Even if we draw 2 samples in each dimension, we still need 2^{1000} samples
in 1000-dimensional space.

• Related to Ising model, Boltzmann machine and Markov field.

• See MacKay for more detail on the number of samples are required to have a good approximation.

**Importance Sampling**

• Recall we evaluate p^{∗}(x) and evaluate with p^{∗}: p(x) = p^{∗}(x)/Z.

• p is often complicated and difficult to draw samples from.

• Proposal density function: Assume that we have a simpler density q(x)
which we can evaluate with a multiplicative constant q^{∗}(x), where

q(x) = q^{∗}/Z_{q}, and from which we can generate samples.

• Introduce weights to adjust the “importance” of each sample
w_{r} = p^{∗}(x^{(r)})

q^{∗}(x^{(r)}) (7)

, and

Φ =ˆ P

r w_{r}φ(x^{(r)})
P

r w_{r} (8)

• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).

• Problem: difficult to estimate how reliable Φˆ is.

• Examples of proposal functions: Gaussian and Cauchy distributions

(p(x) ∼ ^{1}

πγ

h 1 + (^{x−x}_{γ} ^{0})^{2} ^{i} where γ is a scale )

• The results suggest we should use “heavy tailed” importance sampler

• Heavy tailed: a high proportion of the population is comprised of extreme values.

Left: Gaussian Right: Cauchy

**Rejection Sampling**

• Further assume that the proposal function q

cq^{∗}(x) > p^{∗}(x) ∀x (9)

• Steps:

- First generate x from q(x) and evaluated with cq^{∗}(x)

• Problem: Need to pick a right value of c.

• In general, c grows exponentially with the dimensionality.

**Markov Chain**

• A set of states, S = {s_{1}, s_{2}, . . . , s_{N}}.

• The probability of moving from state s_{i} to state s_{j} in one step is p_{ij}

• Transition matrix: R (rain), N (nice), S (sunny)

P =

1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2

(10)

• Define p^{(n)}_{ij} as the probability of s_{j} reaching s_{j} in n steps, e.g.,
p^{(2)}_{13} = p_{11}p_{13} + p_{12}p_{23} + p_{13}p_{33}

p^{(2)} =

r

Xp_{ik}p_{kj}

P^{2} =

0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500

0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500

=

0.438 0.188 0.375 0.375 0.250 0.375 0.250 0.188 0.438

(12)

P^{3} =

0.406 0.203 0.391 0.406 0.188 0.406 0.391 0.203 0.406

(13)

P^{4} =

0.402 0.199 0.398 0.398 0.203 0.398 0.398 0.199 0.402

(14)

P^{5} =

0.400 0.200 0.399 0.400 0.199 0.400 0.399 0.200 0.400

(15)

P^{6} =

0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400

(16)

P^{7} =

0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400

(17)

P^{8} =

0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400

(18)

• No matter where we start, after 6 days, the probability of rainy day is 0.4,
the probability of nice day is 0.2, and the probability of sunny day is 0.4
* Theorem 1. Let* P

*be the transition matrix of a Markov chain. The ij-th*

*entry*p

^{(n)}

_{ij}

*of the matrix*P

*gives the probability that the Markov chain,*

*starting state*s

_{j}

*will be in state*s

_{j}

*.*

**Definition 1. A Markov chain is called an ergodic chain if it is possible to***go from every state to every other state (not necessarily in one move). It is*

* Theorem 2. Let* P

*be the transition matrix for a regular chain. Then, as*n → ∞, the power P

^{n}

*approach a limiting matrix*W

*with all rows the same*

*vector*w. The vector w

*is a strictly positive probability vector (i.e., the*

*components are all positive and they sum to one).*

* Theorem 3. Let* P

*be a regular transition matrix, let*W = lim

n→∞P^{n}

*let* w *be the common row of* W*, and let* c *be the column vector all of those*
*elements are 1. Then.*

*1.* wP = w, and any row vector v *such that* vP = v *is a constant multiple of*
w.

*2.* P c = c, and any column x *such that* P x = x *is a multiple of* c.

* Definition 3. A row vector* w

*with the property*wP = w

*is called a fixed*

*row vector for*P

*. Similarly, a column vector*x

*such that*P x = x

*is called a*

*fixed column vector for*P

*.*

• In other words, a fixed row vector is a left eigenvector of the matrix P

corresponding to the eigenvalue 1.

[w_{1} w_{2} w_{3}]

1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2

= [w_{1} w_{2} w_{3}] (19)
Solving this linear system and we get

w = [0.4 0.2 0.4]

• In general, we solve

wP = wI where I is the identity matrix, or equivalently

w(P − I) = 0

**Theorem 4. For an ergodic Markov chain, there is a unique probability***vector* w *such that* wP = w *and* w *is strictly positive. Any row vector such*
*that* vP = v *is a multiple of* w. Any column vector x *such that* P x = x *is a*
*constant vector.*

• Subject to regularity conditions, the chain will gradually “forget” its initial
state and p^{(t)}(·|s_{0}) *will eventually converge to a unique stationary (or*
*invariant) distribution, which does not depend on* t or s_{0}.

• Detailed balance:

w(x^{0}; x)p(x) = w(x; x^{0})p(x^{0}), ∀x and x^{0}

• The period until the chain converges to a stationary distribution is called the
*burn in time.*

**PaageRank**

• PageRank: Suppose we have a set of four web pages, A, B, C, and D as depicted above. The PageRank (PR) of A is

P R(A) = ^{P R(B)}_{2} + ^{P R(C)}_{1} + ^{P R(D)}_{3}

P R(A) = ^{P R(B)}_{L(B)} + ^{P R(C)}_{L(C)} + ^{P R(D)}_{L(D)} (20)

• Random surfer: Markov process

• The PR values are the entries of the dominant eigenvector of the modified adjacency matrix. The dominant (i.e., first) eigenvector is

R =

P R(p_{1})
P R(p_{2})

...

P R(p_{N})

(22)

where R is the solution of the system

R =

q/N q/N...

q/N

+ (1 − q)

l(p_{1}, p_{1}) l(p_{1}, p_{2}) . . . l(p_{1}, p_{N})
l(p_{2}, p_{1}) . . .

...

l(p_{N}, p_{1}) l(p_{N}, p_{N})

R (23)

where l(p_{i}, p_{j}) is an adjacency function. l(p_{i}, p_{j}) = 0 if page p_{j} does not link
to link to p_{i}, and normalized such that for each j PN

i=1 l(p_{i}, p_{j}) = 1, i.e., the
elements of each column sum up to 1.

• Related to random walk, Markov process and spectral clustering

• Can be seen as a particular dynamic system that is looking for equilibrium in the state space.

• L. Page and S. Brim Pagerank, “An eigenvector based ranking approach for hypertext,” In 21st Annual ACM/SIGIR International Conference on

Research and Development in Information Retrieval, 1998.

**Metropolis Sampling**

• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).

• The Metropolis algorithm makes use of q(x) which depends on the current
state x^{(t)}.

• Example: q(x^{0}; x^{(t)}) may be a simple Gaussian distribution centered at x^{(t)}.

• A tentative state x^{0} is generated from the proposal density q(x^{0}; x^{(t)}).

• Compute

a = ^{p}^{∗}^{(x}^{0}^{)}

p^{∗}(x^{(t)})

q(x^{(t)};x^{0})
q(x^{0};x^{(t)})

If a ≥ 1, then the new state is accepted

Otherwise, the new state is accepted with probability a

(24)

• If the step is accepted, we set x^{(t+1)} = x^{0}. Otherwise, we set x^{(t+1)} = x^{(t)}.

• We need to compute ^{p}^{∗}^{(x}^{0}^{)}

p^{∗}(x^{(t)}) and ^{q(x}^{(t)}^{;x}^{0}^{)}

q(x^{0};x^{(t)})

• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.

• The general algorithm for asymmetric q is called Metropolis-Hastings

• Widely used for high dimensional problems.

• Has been applied to vision problems with good success in image segmentation, recognition, etc.

• Involves a Markov process in which a sequence of {x^{(r)}} is generated
where each sample x^{(t)} having a probability distribution that depends on
the previous value, x^{(t−1)}.

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Random walk: small or large steps?

• Problems: slow convergence

• Many methods have been proposed for speed up.

**Example**

**Gibbs Sampling**

• Also known as heat bath method.

• Can be viewed as a Metropolis method in which the proposal distribution Q is defined in terms of the conditional distribution of the joint distribution p(x).

• Assume that while p(x) is too complex to draw samples, its conditional
distributions p(x_{i}|{x_{j}}_{j6=i}) are tractable to work with.

• In the general case of k variables, a single iteration involves sampling one parameter at a time.

x^{(t+1)}_{1} ∼ p(x_{1}|x^{(t)}_{2} , x^{(t)}_{3} , . . . , x^{(t)}_{k} )
x^{(t+1)}_{2} ∼ p(x_{2}|x^{(t)}_{1} , x^{(t)}_{3} , . . . , x^{(t)}_{k} )
x^{(t+1)}_{3} ∼ p(x_{3}|x^{(t)}_{1} , x^{(t)}_{2} , . . . , x^{(t)}_{k} )
. . .

• Gibbs sampling suffers from the same defect as simple Metropolis

algorithms - the space is explored by a random walk, unless a fortuitous

**MCMC Applications**

• Image parsing [Zhu et al. ICCV 03]

- Analyze an image with a set of pre-defined vocabulary: faces, text, and generic regions.

- Each vocabulary is parameterized.

- Instead of using naive proposal density function, use detectors (based on image contents) for better proposal functions

- Decompose the solution space into a union of many subspaces.

- Explore solution space by designing efficient Markov Chains and sample the posterior probabilities.

• Human pose estimation [Lee and Cohen CVPR 02]

• Visual tracking

• Structure from motion

**Variational Inference**

• Exact Bayesian inference is intractable

• Markov Chain Monte Carlo - computationally expensive - convergence issue

• Variational inference

- broadly applicable deterministic approximation - let θ denote all latent variables and parameters

- approximate true posterior p(θ|D) using a simple distribution q(θ).

- minimize Kullback-Leibler divergence: KL(q||p).