TOO MANY CYCLES!
BRENDAN FREY
Intractable local computations
Even if the graph is a tree, the local functions (conditional probabilities, potentials) may not
yield tractable sum-product computations
• Eg, non-Gaussian pdfs
Active contour model
(Blake and Isard, Springer-Verlag 1998)
Unobserved state: P(ut|ut-1) ut = control points of spline
(contour)
Observation: P(ot|ut) ot = for all measurement
lines: # edges, distance of edges from contour
Measurement lines are placed At fixed intervals along the contour
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
NONLINEAR
LINEAR (GAUSSIAN)
BRENDAN FREY
Approximate inference
• Monte Carlo
• Markov chain Monte Carlo
• Variational techniques
• Local probability propagation
• Alternating maximizations
Monte Carlo inference
• u = unobserved vars; o = observed vars
• Obtain random sample u(1) , u(2), …, u(R) and use it to
– Represent P(u|o)
– Estimate an expectation,
E[f] =
Σ
uf(u)P(u|o)Eg, P(ui=1|o) =
Σ
uI(ui=1)P(u|o)I(expr) = 1 if expr is true I(expr) = 0 otherwise
BRENDAN FREY
Expectations from a sample
• From the sample u(1), u(2), …, u(R), we can estimate
E[f] ≅ (1/R)
Σ
r f(u(r))• If u(1), u(2), …, u(R) are independent draws from P(u|o), this estimate
– is unbiased
– has variance
∝
1/RRejection sampling
• Otherwise reject and draw again
u
P*(u)
B(u)
• Goal: Hold o constant, draw u from P(u|o) Given
• P*(u)
∝
P(u,o)can eval P*(u)
• Randomly accept u with prob P*(u)/B(u)
• Draw u from normalized form of B(u)
• B(u) ≥ P*(u) can eval B(u)
can “sample” B(u)
Efficiency measured by rejection rate
BRENDAN FREY
yt+1
Rejection sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose B(d,m,z) = 1 ≥ P*(d,m,z)
d
yt
z m
1
0
• Draw d,m,z from uniform distribution
Reject !
• Randomly accept with probability
P*(d,m,z)/B(d,m,z)
= P(d,m,z,yt,yt+1)
Rejection sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose B(d,m,z) = 1 ≥ P*(d,m,z)
• Draw d,m,z from uniform distribution
yt+1 d
yt
z m
0
1
Accept !
• Randomly accept with probability
P*(d,m,z)/B(d,m,z)
= P(d,m,z,yt,yt+1)
BRENDAN FREY
Importance sampling
• Goal: Holding o fixed, represent P(u|o) by a weighted sample
• Find P*(u) ∝ P(u,o) and Q*(u), such that can evaluate P*(u)/Q*(u) and can “sample” Q*(u)
• Sample u(1), u(2), …, u(R), from Q*(u)
• Compute weights w(r) = P*(u(r))/Q*(u(r))
• Represent P(u|o) by
{
u(r), w(r)/(Σjw(j))}
• Eg E[f] ≅
Σ
r f(u(r))w(r)/(Σjw(j))Accuracy given by “effective sample size”
Importance sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose Q*(d,m,z) = 1
• Draw (d,m,z)(r) from uniform distribution
d
yt
z m
1
0
• Weight (d,m,z)(r) by P(d(r),m(r),z(r),yt,yt+1)
Low weight
BRENDAN FREY
Importance sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose Q*(d,m,z) = 1
yt+1 d
yt
z m
0
1
• Draw (d,m,z)(r) from uniform distribution
• Weight (d,m,z)(r) by P(d(r),m(r),z(r),yt,yt+1)
High weight
A better Q-distribution
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose Q*(d,m,z) = P(d,m,z)
• Draw (d,m,z)(r) from P(d,m,z)
• Weight (d,m,z)(r) by P(yt,yt+1|d(r),m(r),z(r)) Called
“Likelihood weighting”
BRENDAN FREY
Particle filtering (condensation)
(Isard, Blake, et al, et al, et al,…)
• Goal: Use sample S = ut(1)…ut(R) from
P(ut|o1,…, ot-1) to sample from P(ut|o1,…, ot)
• Weight each “particle” ut(r) in S by P(ot|ut(r))
• Redraw a sample S’ from the weighted sample
• For each particle zt in S’ draw zt+1 from P(zt+1|zt) Exact, for infinite-size samples
For finite-size samples, it may lose modes
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
P(ut|o1,…,o
t
)∝P(ot|u
t
)P(ut|o1,…,ot-1
)Q*(ut) P*(ut)
Condensation for active contours
(Blake and Isard, Springer-Verlag 1998)
Unobserved state: P(ut|ut-1) ut = control points of spline
(contour)
Observation: P(ot|ut) ot = for all measurement
lines: # edges, distance of edges from contour
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
NONLINEAR
LINEAR (GAUSSIAN)
• Sampling from P(u t |o
1,…, o t )
performs tracking
BRENDAN FREY
3D Body Tracking Model
(Sidenbladh, Black, Fleet, ECCV 2000)
•3D articulated model
•Perspective projection
•Monocular sequence
•Unknown environment
•Motion-based likelihood Goal:
State:
–joint angles and body pose
–joint/pose velocities
–appearance model Vt
At
φt
It
Observations:
–image at timet
Dynamic Bayes net
At
At−1
Vt
It
φt
Vt−1
It−1
−1
φt
… …
Shows estimated mean of posterior distribution (with significant changes in view and depth)
3D body tracking using particle filtering
(Sidenbladh, Black, Fleet, ECCV 2000)
BRENDAN FREY
Markov chain Monte Carlo (MCMC)
• MCMC simulates a separate Markov chain on u so that the stationary
distribution of the chain is P(u|o)
• MCMC is NOT just Monte Carlo in a Markov chain (eg, condensation is not MCMC, but is an importance-type
sampler for a Markov model)
Achieving a stationary distribution
• Goal: Sample from P(u|o), holding o fixed
• Start with any configuration u
• Repeatedly draw u’ from a transition distribution T(u’|u) – this converges to the stationary
distribution Q(u) defined by
Q(u’) =
Σ
u Q(u)T(u’|u)• Q(u) is stationary if detailed balance holds:
Q(u)T(u’|u) = Q(u’)T(u|u’)
• So, pick T(u’|u) so that
P(u|o) T(u’|u) = T(u|u’) P(u’|o)P(o)
P(o)P(u,o) P(u’,o)
BRENDAN FREY
Choosing the sample
• Approach 1:
– Simulate 1 chain for a long time and store configurations periodically
• Approach 2:
– Simulate several chains for moderate
amounts of time and collect configurations
Gibbs sampling
• Goal: Sample from P(u|o), holding o fixed
• Require it to be easy to sample from P(ui|all other vars)
– easy in discrete graphical models
– for real-valued models, may need local MC or MCMC
• Start with any configuration u
• Visit variables in u in order or at random
• Draw ui from P(ui|all other vars)
Stationary distribution = P(u|o) (det balance)
BRENDAN FREY
d
z2 z3 z4 z5 z6 z1
m
Time t Time t+1
z0
Gibbs sampling in the shifter net
1
0
Time t Time t+1
Gibbs sampling in the shifter net
Randomly initialize unobserved variables
BRENDAN FREY
1
0
Time t Time t+1
Gibbs sampling in the shifter net
Sample vars one at a time P(var|others) ∝
P(var|parents) x
Πchildren P(child|parents) var is in this set
1
0
Time t Time t+1
Gibbs sampling in the shifter net
Sample vars one at a time P(var|others) ∝
P(var|parents) x
Πchildren P(child|parents) var is in this set
BRENDAN FREY
1
0
Time t Time t+1
Gibbs sampling in the shifter net
Sample vars one at a time P(var|others) ∝
P(var|parents) x
Πchildren P(child|parents) var is in this set
01 1 1
1 1
1 1 1 10
0
The Metropolis algorithm
• MCMC version of importance sampling
• As before, find P*(u) ∝ P(u,o)
• Find a proposal distribution, Q*(u’|u), such that can evaluate [P*(u’)Q*(u|u’)]/[P*(u)Q*(u’|u)] and can “sample” from Q*(u’|u)
MCMC step:
• Based on old u, sample new u’ from Q*(u’|u)
• Compute a = [P*(u’)Q*(u|u’)]
/
[P*(u)Q*(u’|u)]• Accept u’ with probability min(a,1)
BRENDAN FREY
Problem: Highly flexible contours are difficult to track, because prob of picking the right ut from P(ut|ut-1) is very small Fix: For each new
observation, apply
Metropolis to the control points to “tweak” them back on track
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
Metropolis for active contours
(Kristjansson and Frey, submitted to NIPS 2000)
Variational methods
• Suppose P(u|o) is intractable
• Idea: Approximate P(u|o) with a distribution Q that is tractable
• Use Q to compute expectations, etc
• Parameterize Q: Q(u|Φ)
• Choose “distance” D(Q,P)
• Minimize D(Q,P) wrt Φ
BRENDAN FREY
Choosing the “distance”
D =
Σ
u Q(u|Φ)log[Q(u|Φ)/P(u|o)]u
Q
P For unimodal Q, minimizing D favors this Q
D =
Σ
u P(u|o)log[P(u|o)/Q(u|Φ)]u
Q P For unimodal Q, minimizing D favors this Q
Making the distance tractable
Usually, the distance can’t be computed directly
D =
Σ
u Q(u|Φ)log[Q(u|Φ)/P(u|o)]• This distance uses P(u|o)
• Instead, use F = D – log P(o):
F =
Σ
u Q(u|Φ)log[Q(u|Φ)/P(u,o)]• P(u,o) factorizes for a Bayes net!
BRENDAN FREY
What’s needed to minimize F
F =
Σ
u Q(u|Φ) log[Q(u|Φ)/P(u,o)]=
Σ
u Q(u|Φ) log[Q(u|Φ)]-
Σ
uΣ
i Q(u|Φ) log[P(xi|parents of xi)]• Entropy of Q
• Solution of local log-P equations
• We want
Σ
u to break into smaller sumsTypes of Q-distribution
• Factorized: Q(u|Φ) =
Π
iQ(ui |Φi)aka mean field
• Q can be a graphical model that is a substructure of P
• Q can be a graphical model with a different structure from P
BRENDAN FREY
Variational inference in mixed-state dynamic net
(Pavlovic, Frey and Huang, CVPR 1999)
x0 x1 x2 s0 s1 s2
y0 y1 y2
Y
∏
∏
∏
∏
−
=
−
= −
−
= −
−
= −
=
=
1
0 1
1
1 0
0
1
0
1 1
1
1 0
)
| (
) ,
| ( )
| (
)
| ( )
| ( )
(
)
| ( )
| (
)
| (
T
t
t t T
t
t t t T
t
t t T
t
t t
LDS HMM
x y P
u x x P s
x P
s q P s
s P s
P
Q Q
Q
U Y, X Q
S
Q U, Y, S X,
Variational parameters
P(st|Y) Input trace + LDS state
vectors colored by most probable HMM state
Colors for HMM states x0 x1 x2
s0 s1 s2
y0 y1 y2
Variational inference in mixed-state dynamic net
(Pavlovic, Frey and Huang, CVPR 1999)
t, time step
BRENDAN FREY
More examples
• An introduction to variational methods
(Jordan et al 1999)
• Variational methods based on convexity analysis (Jaakkola 1997)
• Derivation of a simple variational technique for nonlinear Gaussian Bayes nets
(Frey and Hinton 1999)
Sum-product algorithm in graphs with cycles
• To heck with the cycles - just apply the algorithm! (Even though it’s not exact…)
• Due to cycles, algorithm iterates (passes messages around loops) until we stop it
• Impressive: Error-correcting decoding
– 0.15 dB from Shannon limit – now in 2 telecom standards – Lucent is producing a chip
BRENDAN FREY
Iterative sum-product WORKS in
network for layered appearance model
(Frey, CVPR 2000)
Index of object in layer l
Far away
Intensity of ray n at camera Intensity of ray n at layer l
Camera
Sum-product in layered appearance model
Layer 4 Layer 3 Layer 2 Layer 1 Iter 1
Iter 2
…
Obj 1 Obj 2 Obj 3 Obj 4
Input Image
Input Image Layer 1 Layer L
BRENDAN FREY
Sum-product in layered appearance model
Layer 4 Layer 3 Layer 2 Layer 1 Iter 1
Iter 2
…
Obj 1 Obj 2 Obj 3 Obj 4
Input Image
Input Image Layer 1 Layer L
Markov network for image and scene patch modeling
(from Freeman and Pasztor, ICCV 1999)
image patches
scene patches
BRENDAN FREY
Original 70x70
Markov network result, 280x280
Cubic Spline 280x280
True 280x280
Super-resolution by iterative probability propagation
(from Freeman and Pasztor, ICCV 1999)
Alternating maximizations
• Pick an assignment for all variables
• Repeatedly select a tractable
substructure (eg, a tree) and find the most probable configuration in the substructure
• This is guaranteed to find a sequence of global configurations that are
increasingly probable
BRENDAN FREY
Tracking self-occluding objects in disparity maps
(Jojic, Turk and Huang, ICCV 1999)