Advanced Topics in Learning and Vision
Ming-Hsuan Yang
mhyang@csie.ntu.edu.tw
Announcements
• Project midterm presentation: Nov 22
• Reading (due Nov 29):
- Freeman et al.: Application of belief propagation to super resolution.
W. Freeman, E. Pasztor and O. Carmichael. Learning low-level vision. International Journal of Computer Vision, vol. 401, no. 1, pages 25–47, 2000.
• Supplementary reading:
- David MacKay. Introduction to Monte Carlo methods.
- Zhu, Dallaert, and Tu ICCV 05 Tutorial: Markov Chain Monte Carlo for Computer Vision
Overview
• Markov Chain Monte Carlo (MCMC)
• Variational inference
• Belief propagation, loop belief propagation
• Gaussian process, Gaussian process latent variable model
• Applications
Markov Chain Monte Carlo (MCMC)
• Motivation: It is difficult to compute joint distribution exactly.
• Name of the game:
- Draw samples by running a cleverly constructed Markov chain for a long time.
- Monte Carlo integration draws samples from the distribution, and then forms sample average to approximate expectations.
• Goals:
- Aim to approximate the joint distribution p(x) so that we can draw samples.
- Estimate expectation of functions under p(x), e.g.,
• Focus on sampling problems as the expectation can be estimated by drawing random samples {x(r)}.
Φ =ˆ 1 R
X
r
φ(x(r)) (2)
• As R → ∞, Φ → Φˆ since the variance
σ2 = Z
(φ(x) − Φ)2dx (3)
decreases as σR2.
• Good news: the accuracy of Monte Carlo estimate in (2) is independent of the dimensionality of the space sampled!
• Bad news: It is difficult to draw independent samples in the high dimensional space.
Why Sampling?
• Non-parametric approach.
• Versatile: accommodate to arbitrary densities.
• Easy for analysis and visualization.
• Memory requirements = O(N ) where N is the number of samples.
• In high dimensional space, sampling is a key step for:
- modeling: simulation, synthesis.
- learning: estimating parameters.
Why Sampling p(x) Is Difficult?
• Assume that the target (but unknown) density function p(x) can be evaluated, within a multiplicative constant, by p∗(x).
p(x) = p∗(x)/Z (4)
• Two difficulties in evaluating p∗(x).
- Typically we do not know Z
Z = Z
p∗(x)dx (5)
- Even if we know Z, it is difficult to draw samples to well represent or cover p(x) in the high dimensional space.
• Example:
• Decompose x = (x1, . . . , xd) in every dimension.
• Discreteize x and ask for samples from discrete probability distribution over a set of uniformly spaced points {xi}, and
Z = P
i p∗(xi)
p(xi) = p∗(xi)/Z (6)
• Suppose we draw 50 samples uniformly spaced in 1-dimensional space, we need 501000 samples in 1000-dimensional space!
• Even if we draw 2 samples in each dimension, we still need 21000 samples in 1000-dimensional space.
• Related to Ising model, Boltzmann machine and Markov field.
• See MacKay for more detail on the number of samples are required to have a good approximation.
Importance Sampling
• Recall we evaluate p∗(x) and evaluate with p∗: p(x) = p∗(x)/Z.
• p is often complicated and difficult to draw samples from.
• Proposal density function: Assume that we have a simpler density q(x) which we can evaluate with a multiplicative constant q∗(x), where
q(x) = q∗/Zq, and from which we can generate samples.
• Introduce weights to adjust the “importance” of each sample wr = p∗(x(r))
q∗(x(r)) (7)
, and
Φ =ˆ P
r wrφ(x(r)) P
r wr (8)
• It can be shown that Φˆ converges to Φ, the mean value of φ(x) as R increases (under some constraints).
• Problem: difficult to estimate how reliable Φˆ is.
• Examples of proposal functions: Gaussian and Cauchy distributions
(p(x) ∼ 1
πγ
h 1 + (x−xγ 0)2 i where γ is a scale )
• The results suggest we should use “heavy tailed” importance sampler
• Heavy tailed: a high proportion of the population is comprised of extreme values.
Left: Gaussian Right: Cauchy
Rejection Sampling
• Further assume that the proposal function q
cq∗(x) > p∗(x) ∀x (9)
• Steps:
- First generate x from q(x) and evaluated with cq∗(x)
• Problem: Need to pick a right value of c.
• In general, c grows exponentially with the dimensionality.
Markov Chain
• A set of states, S = {s1, s2, . . . , sN}.
• The probability of moving from state si to state sj in one step is pij
• Transition matrix: R (rain), N (nice), S (sunny)
P =
1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2
(10)
• Define p(n)ij as the probability of sj reaching sj in n steps, e.g., p(2)13 = p11p13 + p12p23 + p13p33
p(2) =
r
Xpikpkj
P2 =
0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500
0.500 0.250 0.250 0.500 0.000 0.500 0.250 0.250 0.500
=
0.438 0.188 0.375 0.375 0.250 0.375 0.250 0.188 0.438
(12)
P3 =
0.406 0.203 0.391 0.406 0.188 0.406 0.391 0.203 0.406
(13)
P4 =
0.402 0.199 0.398 0.398 0.203 0.398 0.398 0.199 0.402
(14)
P5 =
0.400 0.200 0.399 0.400 0.199 0.400 0.399 0.200 0.400
(15)
P6 =
0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400
(16)
P7 =
0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400
(17)
P8 =
0.400 0.200 0.400 0.400 0.200 0.400 0.400 0.200 0.400
(18)
• No matter where we start, after 6 days, the probability of rainy day is 0.4, the probability of nice day is 0.2, and the probability of sunny day is 0.4 Theorem 1. Let P be the transition matrix of a Markov chain. The ij-th entry p(n)ij of the matrix P gives the probability that the Markov chain, starting state sj will be in state sj.
Definition 1. A Markov chain is called an ergodic chain if it is possible to go from every state to every other state (not necessarily in one move). It is
Theorem 2. Let P be the transition matrix for a regular chain. Then, as n → ∞, the power Pn approach a limiting matrix W with all rows the same vector w. The vector w is a strictly positive probability vector (i.e., the
components are all positive and they sum to one).
Theorem 3. Let P be a regular transition matrix, let W = lim
n→∞Pn
let w be the common row of W, and let c be the column vector all of those elements are 1. Then.
1. wP = w, and any row vector v such that vP = v is a constant multiple of w.
2. P c = c, and any column x such that P x = x is a multiple of c.
Definition 3. A row vector w with the property wP = w is called a fixed row vector for P. Similarly, a column vector x such that P x = x is called a fixed column vector for P.
• In other words, a fixed row vector is a left eigenvector of the matrix P
corresponding to the eigenvalue 1.
[w1 w2 w3]
1/2 1/4 1/4 1/2 0 1/2 1/4 1/4 1/2
= [w1 w2 w3] (19) Solving this linear system and we get
w = [0.4 0.2 0.4]
• In general, we solve
wP = wI where I is the identity matrix, or equivalently
w(P − I) = 0
Theorem 4. For an ergodic Markov chain, there is a unique probability vector w such that wP = w and w is strictly positive. Any row vector such that vP = v is a multiple of w. Any column vector x such that P x = x is a constant vector.
• Subject to regularity conditions, the chain will gradually “forget” its initial state and p(t)(·|s0) will eventually converge to a unique stationary (or invariant) distribution, which does not depend on t or s0.
• Detailed balance:
w(x0; x)p(x) = w(x; x0)p(x0), ∀x and x0
• The period until the chain converges to a stationary distribution is called the burn in time.
PaageRank
• PageRank: Suppose we have a set of four web pages, A, B, C, and D as depicted above. The PageRank (PR) of A is
P R(A) = P R(B)2 + P R(C)1 + P R(D)3
P R(A) = P R(B)L(B) + P R(C)L(C) + P R(D)L(D) (20)
• Random surfer: Markov process
• The PR values are the entries of the dominant eigenvector of the modified adjacency matrix. The dominant (i.e., first) eigenvector is
R =
P R(p1) P R(p2)
...
P R(pN)
(22)
where R is the solution of the system
R =
q/N q/N...
q/N
+ (1 − q)
l(p1, p1) l(p1, p2) . . . l(p1, pN) l(p2, p1) . . .
...
l(pN, p1) l(pN, pN)
R (23)
where l(pi, pj) is an adjacency function. l(pi, pj) = 0 if page pj does not link to link to pi, and normalized such that for each j PN
i=1 l(pi, pj) = 1, i.e., the elements of each column sum up to 1.
• Related to random walk, Markov process and spectral clustering
• Can be seen as a particular dynamic system that is looking for equilibrium in the state space.
• L. Page and S. Brim Pagerank, “An eigenvector based ranking approach for hypertext,” In 21st Annual ACM/SIGIR International Conference on
Research and Development in Information Retrieval, 1998.
Metropolis Sampling
• Importance and rejection sampling only work well if proposal function q(x) is a good approximation of p(x).
• The Metropolis algorithm makes use of q(x) which depends on the current state x(t).
• Example: q(x0; x(t)) may be a simple Gaussian distribution centered at x(t).
• A tentative state x0 is generated from the proposal density q(x0; x(t)).
• Compute
a = p∗(x0)
p∗(x(t))
q(x(t);x0) q(x0;x(t))
If a ≥ 1, then the new state is accepted
Otherwise, the new state is accepted with probability a
(24)
• If the step is accepted, we set x(t+1) = x0. Otherwise, we set x(t+1) = x(t).
• We need to compute p∗(x0)
p∗(x(t)) and q(x(t);x0)
q(x0;x(t))
• If proposal density is a simple symmetric density as a Gaussian, then the latter factor is unity and the Metropolis algorithm simply involves comparing the value of the target density at two points.
• The general algorithm for asymmetric q is called Metropolis-Hastings
• Widely used for high dimensional problems.
• Has been applied to vision problems with good success in image segmentation, recognition, etc.
• Involves a Markov process in which a sequence of {x(r)} is generated where each sample x(t) having a probability distribution that depends on the previous value, x(t−1).
• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).
• Random walk: small or large steps?
• Problems: slow convergence
• Many methods have been proposed for speed up.
Example
Gibbs Sampling
• Also known as heat bath method.
• Can be viewed as a Metropolis method in which the proposal distribution Q is defined in terms of the conditional distribution of the joint distribution p(x).
• Assume that while p(x) is too complex to draw samples, its conditional distributions p(xi|{xj}j6=i) are tractable to work with.
• In the general case of k variables, a single iteration involves sampling one parameter at a time.
x(t+1)1 ∼ p(x1|x(t)2 , x(t)3 , . . . , x(t)k ) x(t+1)2 ∼ p(x2|x(t)1 , x(t)3 , . . . , x(t)k ) x(t+1)3 ∼ p(x3|x(t)1 , x(t)2 , . . . , x(t)k ) . . .
• Gibbs sampling suffers from the same defect as simple Metropolis
algorithms - the space is explored by a random walk, unless a fortuitous
MCMC Applications
• Image parsing [Zhu et al. ICCV 03]
- Analyze an image with a set of pre-defined vocabulary: faces, text, and generic regions.
- Each vocabulary is parameterized.
- Instead of using naive proposal density function, use detectors (based on image contents) for better proposal functions
- Decompose the solution space into a union of many subspaces.
- Explore solution space by designing efficient Markov Chains and sample the posterior probabilities.
• Human pose estimation [Lee and Cohen CVPR 02]
• Visual tracking
• Structure from motion
Variational Inference
• Exact Bayesian inference is intractable
• Markov Chain Monte Carlo - computationally expensive - convergence issue
• Variational inference
- broadly applicable deterministic approximation - let θ denote all latent variables and parameters
- approximate true posterior p(θ|D) using a simple distribution q(θ).
- minimize Kullback-Leibler divergence: KL(q||p).