Vision by
Inference and Learning in Graphical Models
Brendan Frey
(www.cs.uwaterloo.ca/~frey)
University of Waterloo
University of Illinois at Urbana-Champaign Microsoft Research
Acknowledgements for comments and suggestions
• P. Anandan
• G. Bradski
• D. Fleet
• D. Heckerman
• T. Huang
• N. Jojic
• R. Szeliski
BRENDAN FREY
Approaching vision as probabilistic inference
Graphics: Compute P(x|c) Vision using Bayes rule:
P(c|x) = a P(x|c)P(c), a = 1
/ Σ
c P(x|c)P(c)• Input: x = vector of pixel intensities
• Class: c = class index 1, 2, …, C Vision: Compute P(c|x)
Example: P(x|c) Gaussian
• P(x|c) = (2πσc2)-1/2exp[-(x-µc)2/2σc2]
x P(x|c=2)P(c=2) P(x|c=1)P(c=1)
1 2 1
Values of c that maximize P(c|x) = a P(x|c)P(c)
BRENDAN FREY
Examples: Image input
• P(Fred | Image)
• P(Happy | Image)
• P(Happy | Image, Fred)
• P(Fred | Image, Happy)
• …
Examples: Video input
• P(User wants mouse click | Video)
• P(Pixel i from layer L at time t | Video)
• P(Shape of objects at time t | Video)
• P(Appearance of objects at time t |Video)
• P(Position of objects at time t | Video)
• P(shape, appearance, positions of objects at time t | Video)
• …
BRENDAN FREY
Optimal decision making
• If P(x|c) and P(c) are correct, picking cMAP = argmaxc P(x|c)P(c)
minimizes the number of classification errors
• If U(c,c*) is the utility of picking class c*
when the true class is c, use
cMEU = argmaxc*
Σ
c U(c,c*)P(x|c)P(c)• A probabilistic inference
P(xi | Observed x’s) gives optimal decisions for xi
Generative models
• We suppose that observations are the result of a structured generative process on system variables x1, x2, …, xN
• A generative model is a density model P(x1, x2, …, xN)
BRENDAN FREY
Burglar alarm problem (Pearl 86)
Burglar: b=0 no burglar; b=1 burglar Alarm: a=0 no alarm; a=1 alarm rings
P(a,b) = P(b)P(a|b) Eg, P(a=1|b=1) = 0.8
EARTHQUAKES also trigger alarm Earthquake: e=0 no quake; a=1 quake
P(a,b,e) = P(e)P(b)P(a|b,e)
Useful questions about
P(a,b,e) = P(e)P(b)P(a|b,e)
• Under P, are b and e independent ?
• Under P, are b and e independent given a?
• Probabilistic inferences: P(b=1|a=1) = ?, P(b=1|a=1,e=0) = ?, P(b=1|a=1,e=0) = ?
BRENDAN FREY
P(d,m,z,yt,yt+1) = P(d) P(m)
Π
iP(zi | d)Π
iP(yit | zi)Π
iP(yit+1 | zi-1,zi+1,m)Shifter problem: Patches in motion
Examples of 6 x 1 patches from a video sequence Time t:
Time t+1:
Still Right Still Right
Sparse Dense
Noise-free patches - Easier to explain!
• d = density (0=sparse, 1=dense)
• zi = noise-free intensity of pixel i at time t
• m = motion (0=still, 1=right)
• yit = noisy, observed intensity of pixel i at time t
Useful questions about
P(d,m,z,yt,yt+1) = P(d) P(m)
Π
iP(zi | d)Π
iP(yit | zi)Π
iP(yit+1 | zi-1,zi+1,m)•
Σ
dΣ
mΣ
zΣ
ytΣ
yt+1P(d,m,z,yt,yt+1) = 1 ?• Under P, does yt depend on m ?
• Probabilistic inferences: P(m | yt,yt+1) = ?, P(d | yt,yt+1) = ?, P(yt+1| yt) = ?
BRENDAN FREY
PART I
BAYESIAN NETWORKS
(directed graphical models)
• MAY be constructed using causal relationships between variables
• Quickly conveys the factorization of a distribution
• By construction, implies the distribution is normalized
• Clearly expresses dependencies and independencies between variables
• Can be used to derive fast inference algorithms
Bayesian network
BRENDAN FREY
Causal construction of burglar net
e b
Assuming earthquakes don’t cause burglaries, e is not connected to b
a
Earthquakes and burglars trigger
the alarm, so e and b are connected to a
Causal construction of shifter net
d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
m
Time t Time t+1
z0
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
BRENDAN FREY
Definition of a Bayes net
• Directed graph
No cycles when following arrows, “DAG”
• Unique variable associated with each node
• For each node, a conditional distribution:
P(child variable | parent variables)
• Defines a joint distribution:
P(x1,…, xN) =
Π
i P(xi | parents of xi)Conditional probabilities in burglar net
e b
a
P(e=1) = .01 P(b=1) = .1
P(a,b,e) = P(e)P(b)P(a|e,b)
P(a=1|e=0,b=0) = .001 P(a=1|e=0,b=1) = .8 P(a=1|e=1,b=0) = .9 P(a=1|e=1,b=1) = .98
BRENDAN FREY
d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
z0 m
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
A distribution in the shifter net
P(y2t+I=1|z1=0,z2=0,m) = .01 P(y2t+I=1|z1,z2=1,m=0) = .99 P(y2t+I=1|z1,z2=0,m=0) = .01 P(y2t+I=1|z1=1,z2,m=1) = .99 P(y2t+I=1|z1=0,z2,m=1) = .01
Direct bonuses of Bayes nets
• The Markov blanket MB[xi] for variable xi can be read off the graph, where
P(xi | MB[xi], other vars) = P(xi | MB[xi])
• Simulating P(x1,…, xN) is easy
• Normalization
Σ
x1... Σ
xN[ Π
i P(xi | parents of xi)]
= 1Simulating the shifter net
Sample d from P(d) Sample z0
from P(z0|d)
Sample y1t
from P(y1t|z1)
Sample m from P(m)
Sample y1t+1
from P(y1t+1|z0,z1,m) d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
z0 m
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
Work “down”
the network
Markov blankets
• What is the smallest set of variables that
“isolates” a variable xi from the other variables in the network?
• The Markov blanket, MB[xi]:
P(xi| MB[xi], X \ {xi} \ MB[xi]) = P(xi| MB[xi])
• If set S does not contain MB[xi], P(xi| S, X \ {xi} \ S) ≠ P(xi| S)
BRENDAN FREY
A Markov blanket in the shifter net
MB[z6] = {d, y6t, y6t+1, m, z5}
d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
z0 m
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
Pruning Bayes nets
• For variables x1,…, xN, b, suppose b does not have children
• If we delete node b and its edges, the resulting network describes
P(x1,…, xN) =
Σ
b P(x1,…, xN,b)So,
Σ
x1... Σ
xN[ Π
i P(xi | parents of xi)]
= 1BRENDAN FREY
m
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
m
yIt+I y2t+I y3t+I y4t+I y5t+I
Pruning the shifter net
m
yIt+I y2t+I y3t+I y4t+I
m
yIt+I y2t+I y3t+I
m
yIt+I y2t+I
m
yIt+I
m d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
P(d,z,yt) = P(d)
Π
iP(zi | d)Π
iP(yit | zi)Use this simpler net to make inferences about d and z and yt
Noncausal constructions
Reasons for noncausal constructions
• System is not causal
• System too complex for causal construction
• For computational efficiency, a noncausal net is preferable
BRENDAN FREY
Procedure for noncausal construction
• Order the variables (eg, at random)
• Add the variables, one at a time
• Make the current variable a child of all previously added variables
• Delete as many edges as possible, reducing the number of parents for the current variable
The last step requires probing the
physical system or answering queries
A noncausal construction of the burglar net, order a, e, b
P(a,b,e) =
P(a)P(e|a)P(b|a,e)
a
P(e|a) ≠ P(e) so leave a→e
e b
P(b|a,e) ≠ P(b) P(b|a,e) ≠ P(b|a) P(b|a,e) ≠ P(b|e) so leave e→b and a→b
BRENDAN FREY
P(a,b,e) =
P(a)P(e|a)P(b|a,e)
a
e b
a
e b
P(a,b,e) = P(e)P(b)P(a|e,b)
Causal construction Non-causal construction
Are e and b independent?
YES CAN’T
TELL!
BRENDAN FREY
Conditional independencies
• Is xA independent of xB given xS?
Is P(xA, xB| xS) = P(xA | xS) P(xB | xS) ?
• YES, if every path from xA to xB is BLOCKED
• A path can be blocked in 3 ways:
xS
1 xS
2
xS is not a descendent
3
…
BRENDAN FREY
Independencies in the shifter net
• P(yt,m) = P(yt) P(m)
• P(d,m) = P(d) P(m)
• P(d,m|yt) = P(d|yt) P(m|yt)
• P(d,m|yt+1) ≠ P(d|yt+1) P(m|yt+1) d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
z0 m
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
x2 x3 x4 x5 x6 x1
“Extreme” Bayes nets
Factorized model P(x) =
Π
i P(xi)x2 x3 x4 x5 x6 x1
Unstructured model P(x) =
Π
i P(xi|x1,…xi-1)- Always true, from chain rule of prob.
BRENDAN FREY
Mixture model (Naive Bayes)
c
x2 x3 x4 x5 x6 x1
P(x,c) = P(c)
Π
i P(xi|c)P(x) =
Σ
c P(x,c)Discrete
c
x
SHORTHAND P(x,c) = P(c)P(x|c)
z cc=1
Mixture of Gaussians
P(c) = πc
P(z|c) = N(z; µµµµc , ΦΦΦΦc)
µµµµ1= diag(ΦΦΦΦ1) = µµµµ2=
π1= 0.6,
π2= 0.4, diag(ΦΦΦΦ2) =
z=
BRENDAN FREY
x z ss=
Transformed mixture of Gaussians
(Frey and Jojic, CVPR 1999)
cc=1 P(c) = πc
P(x|z,s) = N(x; shift(z,s), ΨΨΨΨ)
µµµµ1= diag(ΦΦΦΦ1) = µµµµ2=
π1= 0.6,
π2= 0.4,
Shift, s P(s)
diag(ΦΦΦΦ2) = z=
x=
Diagonal
BRENDAN FREY
Layered
appearance model
(Frey, CVPR 2000)
Index of object in layer l
Far
Intensity of ray n at camera Intensity of ray n at layer l
Near
BRENDAN FREY
I(x,y): vector of pix int diffs between other views and pix int in 1st view at x,y
Multiview layered model
(variant of Torr, Szeliski, Anandan, CVPR 1999)
Θ Θ Θ
Θ: Params of layer planes
z(x,y):
depth in 1st view of pixel at x,y in 1st view
L(x,y) in I:
Layer of pixel at x,y in 1st view
Dynamic Bayes nets
• Just a Bayes net for time-series data
BRENDAN FREY
Markov model
z1 z2 … zt-1 zt zt+1 … MB[zt] = {zt-1, zt+1}
P(zt|z1,z2,…, zt-1,zt+1,…) = P(zt|zt-1,zt+1) P(zt|z1,z2,…, zt-1 ) = P(zt|zt-1)
Hidden Markov model
z1 z2 … zt-1 zt zt+1 …
zt discrete, P(zt| zt-1) = “transition matrix”
xt discrete or continuous Eg, P(xt|zt) = Normal(xt;µxt,Cxt)
x1 x2 … xt-1 xt xt+1 …
BRENDAN FREY
Linear dynamic system (Kalman filter model)
z1 z2 … zt-1 zt zt+1 …
zt continuous, P(zt| zt-1) Gaussian xt continuous, P(xt|zt) Gaussian
x1 x2 … xt-1 xt xt+1 …
BRENDAN FREY
Transformed hidden Markov model
(Jojic, Petrovic, Frey and Huang, CVPR 2000)
x s
c
z
x s
c
z
t
t-1
BRENDAN FREY
Active contour model
(Blake and Isard, Springer-Verlag 1998)
Unobserved state: P(ut|ut-1) ut = control points of spline
(contour)
Observation: P(ot|ut) ot = for all measurement
lines: # edges, distance of edges from contour
Measurement lines are placed At fixed intervals along the contour
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
NONLINEAR
LINEAR (GAUSSIAN)
3D body tracking model
(Sidenbladh, Black, Fleet, ECCV 2000)
•3D articulated model
•Perspective projection
•Monocular sequence
•Unknown environment
•Motion-based likelihood Goal:
State:
–joint angles and body pose
–joint/pose velocities
–appearance model Vt
At
φt
It
Observations:
–image at timet
Dynamic Bayes net
At
At−1
Vt
It
φt
Vt−1
It−1
−1
φt
… …
BRENDAN FREY
Switching mixture of state-space models
(Ghahramani and Hinton, 1997)
x0 x1 x2 s0 s1 s2
y0 y1 y2
System “switch”
State of system A Measurements
x0 x1 x2 State of system B
Mixed-state dynamic Bayes net
(Pavlovic, Frey and Huang, CVPR 1999)
• Uses discrete-state HMM to drive
continuous dynamics (Kalman filtering)
x0 x1 x2 s0 s1 s2
y0 y1 y2
Decision/
action/mode State of dynamics
Measurements
Y
BRENDAN FREY
Easy-living net
Microsoft, 2002
PART II
UNDIRECTED
GRAPHICAL MODELS
BRENDAN FREY
Markov random field (MRF)
• Undirected graph on variables
• Graph gives Markov blankets:
The Markov blanket of a variable is the variable’s neighbors
z
indicates variables in the Markov blanket for z
The distribution for an MRF
• If P(x1,…, xN) ≠ 0, for all configs of x1,…, xN, then P(x1,…, xN) can be expressed
P(x1,…, xN) = α
Π
c φc({xi: i ∈ Qc})• c indexes the maximal cliques
• Qc is the set of the variables in clique c
• φc( ) is a strictly positive function
(potential) on the variables in clique c
• α is a normalizing constant
BRENDAN FREY
Burglar MRF
e b
a
1 maximal clique:
Q1 = {e,b,a}
Clique potential:
φ1(e,b,a) Distribution:
P(e,b,a) = α φ1(e,b,a) Are e and b independent? CAN’T TELL!
Line processes
(Geman and Geman)
0 0 0 0 Maximal clique
Patterns with high φ
1 1 0 0 0 0
1 1
1 0 1 0 0 1
0 1 1 1
1 1
1 0 0 0
Patterns with low φ
0 1 0 0
0 0 1 0
0 0 0 1 0 1
1 1
1 0 1 1
1 1 0 1
1 1 1 0
Under P(), lines are probable
1 0 0 1
0 1 1 0
BRENDAN FREY
Markov network for image and scene patch modeling
(from Freeman and Pasztor, ICCV 1999)
image patches
scene patches
Bayes net – MRF hybrids
• Suppose we have an MRF for x, with distribution PMRF(x)
• Suppose we have a Bayes net for z, with distribution PBN(z)
• Then, we can add directed edges
connecting variables in x to variables in z, creating a modified Bayes net,
PBN(z|x)
• The joint distribution is PBN(z|x)PMRF(x)
BRENDAN FREY
I(x,y): vector of pix int diffs between other views and pix int in 1st view at x,y
BN-HMM Hybrid Multiview layered model
(variant of Torr, Szeliski, Anandan, CVPR 1999)
Θ Θ Θ
Θ: Params of layer planes
z(x,y):
depth in 1st view of pixel at x,y in 1st view
L(x,y) in I:
Layer of pixel at x,y in 1st view
Factor graphs
(Kschischang, Frey, Loeliger, subm to IEEE Trans IT)
• Bipartite graph: variable nodes and function nodes
• A local function is associated with each function node – this function depends on the neighboring variables
• The global function is given by the product of the local functions
BRENDAN FREY
Burglar factor graphs
e b
a
e b
a
P(e,a,b)
P(e) P(b)
P(a|e,b)
No independencies (like MRF)
Same independencies as Bayes net
Converting an MRF to a factor graph
• Create variable nodes
• Create one function node for each maximal clique in the MRF
• Connect each function node to the variables in the corresponding clique
• Set the function associated with each function node to the corresponding clique potential
• Global function = MRF distribution
• Each MRF has a unique factor graph
• Different factor graphs may have the same MRF
BRENDAN FREY
Converting a Bayes net to a factor graph
• Create variable nodes
• For each variable, create one function node and connect it to the variable
• Connect each function node to the parents of the corresponding variable
• Set the function associated with each function node to the corresponding conditional pdf in the Bayes net
• Global function = Bayes net distribution
• If the child of each local function is indicated (eg, with an arrow), the resulting factor graph has the same semantics as a Bayes net
PART III
INFERENCE
BRENDAN FREY
A probabilistic inference
P(xi | Observed x’s) gives optimal decisions for xi
Probabilistic inference
Recall for a correct generative model P(x1, x2, …, xN)
Inference: Mixture of Gaussians
P(c) = πc
P(c|z) = P(z|c)P(c)
/ Σ
c P(z|c)P(c)µµµµ1= diag(ΦΦΦΦ1) = µµµµ2=
π1= 0.6,
π2= 0.4, diag(ΦΦΦΦ2) =
c
z
P(c=1|z) = .9 P(c=2|z) = .1
z=
P(c=1|z) = .2 P(c=2|z) = .8
z= P(z|c) = N(z; µµµµc, ΦΦΦΦc)
BRENDAN FREY
Inference: Transformed mixture of Gaussians
P(x,z,s,c) =
P(x|z,s)P(s)P(z|c)P(c)
µµµµ1= diag(ΦΦΦΦ1) = µµµµ2=
π1= 0.6, π2= 0.4,
x z s
c
diag(ΦΦΦΦ2) =
x=
sMAP=
cMAP
=1 zMAP=
Linear
Gaussian Linear
Gaussian
Discrete Discrete
General “brute force” inference
• Suppose x1, x2, …, xN are binary
P(x1) =
Σ
x2Σ
x3 …Σ
xNP(x1, x2, …, xN)• This takes about 2N operations
• Generally, computing P(xi|Observed x’s) takes 2(N - #observed x’s) operations
BRENDAN FREY
Inference in Bayes nets
d
z2 z3 z4 z5 z6 z1
m
Time t Time t+1
z0
P(m=1|Obs)=0.8 P(zi|Obs)
P(d=1|Obs)=0.2
Observed parents “abandon” children
• We can remove the edges connecting observed parents to their children
o a1 a2
c1 c2
Observation: o=o’
P(o|a1,a2)
P(c1|o) P(c2|o)
o a1 a2
c1 c2
P(o=o’|a1,a2)
P(c1|o=o’) P(c2|o=o’)
BRENDAN FREY
Sum-product algorithm
(probability propagation, forward-backward algorithm) (Gallager 1963; Pearl 1986; Lauritzen & Spiegelhalter 1986; …)
• Suppose we have a graphical model for discrete variables x1, x2, …, xN
• If the graphical model is a tree (or
“close” to being a tree), the sum-product algorithm can compute
P(xi|Observed x’s) for all xi in LINEAR TIME
Example: Discrete Markov model
P(A,B,C,D,E) = P(E|D)P(D|C)P(C|B)P(B|A)P(A)
P(E) =
Σ
DΣ
CΣ
BΣ
AP(E|D)P(D|C)P(C|B)P(B|A)P(A)A B C D E
=
Σ
DΣ
CΣ
BΣ
AP(E|D)P(D|C)P(C|B)P(B|A)P(A)=
Σ
DΣ
CΣ
BP(E|D)Σ
AP(D|C)P(C|B)P(B|A)P(A)=
Σ
DΣ
CΣ
BP(E|D)P(D|C)Σ
AP(C|B)P(B|A)P(A)=
Σ
DΣ
CΣ
BP(E|D)P(D|C)P(C|B)Σ
AP(B|A)P(A)=
Σ
DΣ
CΣ
BP(E|D)P(D|C)P(C|B)[Σ
AP(B|A)P(A)]=
Σ
DΣ
CP(E|D)Σ
BP(D|C)P(C|B)[Σ
AP(B|A)P(A)]=
Σ
DΣ
CP(E|D)P(D|C)Σ
BP(C|B)[Σ
AP(B|A)P(A)]=
Σ
DΣ
CP(E|D)P(D|C)[ Σ
BP(C|B)[Σ
AP(B|A)P(A)]]
=
Σ
DP(E|D)Σ
CP(D|C)[ Σ
BP(C|B)[Σ
AP(B|A)P(A)]]
=
Σ
DP(E|D)[ Σ
CP(D|C)[ Σ
BP(C|B)[Σ
AP(B|A)P(A)]] ]
f(B) f(B)
g(C)
g(C)
h(D)
h(D) P(E)
BRENDAN FREY
General sum-product algorithm
• Messages: short vectors of numbers;
interpret as functions of discrete vars
• Messages flow in both directions on each edge
• Initially, all messages are set to 1
• Messages are updated randomly or in a given order
• Messages are fused to compute (or approximate) P(xi|Observed x’s)
Passing messages in Bayes nets
f(a) = Σc,bh(c)q(c)P(c|a,b)g(b)
c
d e
a b
h(c) q(c)
f(a)
g(b) Against edge
c
d e
a b
h(b)
q(c) g(a)
f(c)
f(c) = Σa,bq(c)P(c|a,b)g(a)h(b) With edge
P(c|o) ≅ αΣa,bh(c)q(c)P(c|a,b)f(a)g(b)
c
d e
a b
g(b)
q(c) f(a)
h(c)
Fusion
Each message is a function of its parent
BRENDAN FREY
Propagating observations in Bayes nets
f(a) = Σbh(c’)q(c’)P(c’|a,b)g(b)
c
d e
a b
h(c) q(c)
f(a)
g(b)
Against edge
Observation: c=c’
f(c’) = Σa,bq(c’)P(c’|a,b)g(a)h(b)
c
d e
a b
h(b)
q(c) g(a)
f(c)
With edge
f(c) = 0, c not equal to c’
Passing messages in factor graphs
• Much simpler than Bayes nets!!!!
• Bayes nets and MRFs can be converted to factor graphs really easily
f(a) = g(a)h(a)
a
h(a) f(a)
g(a)
Each message is a function of its neighboring variable
f(a) = ΣbΣcφ(a,b,c)g(b)h(c)
c a b
h(c)
f(a) g(b)
Local f’n:
φ(a,b,c)
Out of var Out of f’n Fusion
P(a|o) ≅ f(a)g(a)h(a)
a
h(a) f(a)
g(a)
BRENDAN FREY
Result of fusion
• Unobserved variables: u1,…,uK
• Observed variables: o
• Fusion at ui estimates
P(ui,o) =
Σ
u1,…,ui-1,ui+1,…,uK P(u1,…,uK,o)• Local normalization:
P(ui|o) = P(ui,o)
/ Σ
uiP(ui,o)Properties of sum-product
• Exact in trees
• Computationally efficient even for
– linear Gaussian vars, without discrete children – observed real vars with discrete parents
• Some applications:
– Error-correcting decoding (trellis codes) – Speech recognition (HMM is a tree) – Kalman tracking (LDS net is a tree)
– Multiscale smoothing (tree on image)
, …
BRENDAN FREY
Max-product (Viterbi) algorithm
• Replace SUM with MAX in the sum- product algorithm
• Max-product computes
Φ(ui) = max P(u1,…,uK,o)
u1,…,ui-1,ui+1,…,uK
• MAP configuration:
uiMAP = argmax Φ(ui)
ui
What if graph is not a tree?
• Can “cluster” variables
• Can convert the graph to a “join tree”
(Lauritzen and Spiegelhalter 1988)
• Can use “bucket elimination” (Dechter 1999)
BUT, THESE METHODS ONLY WORK WHEN THE NUMBER OF
CYCLES IS TRACTABLE
Lots of systems
are best described by nets with an intractable
number of cycles
Quite a few cycles
d
z2 z3 z4 z5
y6t y5t y4t y3t y2t yIt
z6 z1
m
Time t Time t+1
z0
yIt+I y2t+I y3t+I y4t+I y5t+I y6t+I
BRENDAN FREY
TOO MANY CYCLES!
TOO MANY CYCLES!
BRENDAN FREY
Intractable local computations
Even if the graph is a tree, the local functions (conditional probabilities, potentials) may not
yield tractable sum-product computations
• Eg, non-Gaussian pdfs
Active contour model
(Blake and Isard, Springer-Verlag 1998)
Unobserved state: P(ut|ut-1) ut = control points of spline
(contour)
Observation: P(ot|ut) ot = for all measurement
lines: # edges, distance of edges from contour
Measurement lines are placed At fixed intervals along the contour
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
NONLINEAR
LINEAR (GAUSSIAN)
BRENDAN FREY
Approximate inference
• Monte Carlo
• Markov chain Monte Carlo
• Variational techniques
• Local probability propagation
• Alternating maximizations
Monte Carlo inference
• u = unobserved vars; o = observed vars
• Obtain random sample u(1) , u(2), …, u(R) and use it to
– Represent P(u|o)
– Estimate an expectation,
E[f] =
Σ
uf(u)P(u|o)Eg, P(ui=1|o) =
Σ
uI(ui=1)P(u|o)I(expr) = 1 if expr is true I(expr) = 0 otherwise
BRENDAN FREY
Expectations from a sample
• From the sample u(1), u(2), …, u(R), we can estimate
E[f] ≅ (1/R)
Σ
r f(u(r))• If u(1), u(2), …, u(R) are independent draws from P(u|o), this estimate
– is unbiased
– has variance
∝
1/RRejection sampling
• Otherwise reject and draw again
u
P*(u)
B(u)
• Goal: Hold o constant, draw u from P(u|o) Given
• P*(u)
∝
P(u,o)can eval P*(u)
• Randomly accept u with prob P*(u)/B(u)
• Draw u from normalized form of B(u)
• B(u) ≥ P*(u) can eval B(u)
can “sample” B(u)
Efficiency measured by rejection rate
BRENDAN FREY
yt+1
Rejection sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose B(d,m,z) = 1 ≥ P*(d,m,z)
d
yt
z m
1
0
• Draw d,m,z from uniform distribution
Reject !
• Randomly accept with probability
P*(d,m,z)/B(d,m,z)
= P(d,m,z,yt,yt+1)
Rejection sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose B(d,m,z) = 1 ≥ P*(d,m,z)
• Draw d,m,z from uniform distribution
yt+1 d
yt
z m
0
1
Accept !
• Randomly accept with probability
P*(d,m,z)/B(d,m,z)
= P(d,m,z,yt,yt+1)
BRENDAN FREY
Importance sampling
• Goal: Holding o fixed, represent P(u|o) by a weighted sample
• Find P*(u) ∝ P(u,o) and Q*(u), such that can evaluate P*(u)/Q*(u) and can “sample” Q*(u)
• Sample u(1), u(2), …, u(R), from Q*(u)
• Compute weights w(r) = P*(u(r))/Q*(u(r))
• Represent P(u|o) by
{
u(r), w(r)/(Σjw(j))}
• Eg E[f] ≅
Σ
r f(u(r))w(r)/(Σjw(j))Accuracy given by “effective sample size”
Importance sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose Q*(d,m,z) = 1
• Draw (d,m,z)(r) from uniform distribution
d
yt
z m
1
0
• Weight (d,m,z)(r) by P(d(r),m(r),z(r),yt,yt+1)
Low weight
BRENDAN FREY
Importance sampling in the shifter net
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose Q*(d,m,z) = 1
yt+1 d
yt
z m
0
1
• Draw (d,m,z)(r) from uniform distribution
• Weight (d,m,z)(r) by P(d(r),m(r),z(r),yt,yt+1)
High weight
A better Q-distribution
d
yt
z m
yt+1
• Choose P*(d,m,z) = P(d,m,z,yt,yt+1)
• Choose Q*(d,m,z) = P(d,m,z)
• Draw (d,m,z)(r) from P(d,m,z)
• Weight (d,m,z)(r) by P(yt,yt+1|d(r),m(r),z(r)) Called
“Likelihood weighting”
BRENDAN FREY
Particle filtering (condensation)
(Isard, Blake, et al, et al, et al,…)
• Goal: Use sample S = ut(1)…ut(R) from
P(ut|o1,…, ot-1) to sample from P(ut|o1,…, ot)
• Weight each “particle” ut(r) in S by P(ot|ut(r))
• Redraw a sample S’ from the weighted sample
• For each particle zt in S’ draw zt+1 from P(zt+1|zt) Exact, for infinite-size samples
For finite-size samples, it may lose modes
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
P(ut|o1,…,o
t
)∝P(ot|u
t
)P(ut|o1,…,ot-1
)Q*(ut) P*(ut)
Condensation for active contours
(Blake and Isard, Springer-Verlag 1998)
Unobserved state: P(ut|ut-1) ut = control points of spline
(contour)
Observation: P(ot|ut) ot = for all measurement
lines: # edges, distance of edges from contour
u1 u2 … ut-1 ut
o1 o2 … ot-1 ot
NONLINEAR
LINEAR (GAUSSIAN)