### Vision by

### Inference and Learning in Graphical Models

Brendan Frey

(www.cs.uwaterloo.ca/~frey)

University of Waterloo

University of Illinois at Urbana-Champaign Microsoft Research

### Acknowledgements for comments and suggestions

• P. Anandan

• G. Bradski

• D. Fleet

• D. Heckerman

• T. Huang

• N. Jojic

• R. Szeliski

BRENDAN FREY

### Approaching vision as probabilistic inference

Graphics: Compute P(x|c)
*Vision using Bayes rule:*

P(c|x) = a P(x|c)P(c), a = 1

### / Σ

_{c}

^{P(x|c)P(c)}

• Input: x = vector of pixel intensities

• Class: c = class index 1, 2, …, C Vision: Compute P(c|x)

### Example: P(x|c) Gaussian

• P(x|c) = (2πσ_{c}^{2})^{-1/2}exp[-(x-µ_{c})^{2}/2σ_{c}^{2}]

x P(x|c=2)P(c=2) P(x|c=1)P(c=1)

1 2 1

Values of c that maximize P(c|x) = a P(x|c)P(c)

BRENDAN FREY

### Examples: Image input

• P(Fred | Image)

• P(Happy | Image)

• P(Happy | Image, Fred)

• P(Fred | Image, Happy)

• …

### Examples: Video input

• P(User wants mouse click | Video)

• P(Pixel i from layer L at time t | Video)

• P(Shape of objects at time t | Video)

• P(Appearance of objects at time t |Video)

• P(Position of objects at time t | Video)

• P(shape, appearance, positions of objects at time t | Video)

• …

BRENDAN FREY

### Optimal decision making

• If P(x|c) and P(c) are correct, picking
c^{MAP} = argmax_{c} P(x|c)P(c)

minimizes the number of classification errors

*• If U(c,c*) is the utility of picking class c**

when the true class is c, use

c^{MEU} = argmax_{c*}

### Σ

_{c}U(c,c*)P(x|c)P(c)

• A *probabilistic inference*

P(x_{i} | Observed x’s)
gives optimal decisions for x_{i}

### Generative models

• We suppose that observations are the
result of a structured generative process
on system variables x_{1}, x_{2}, …, x_{N}

• A *generative model* is a density model
P(x_{1}, x_{2}, …, x_{N})

BRENDAN FREY

### Burglar alarm problem (Pearl 86)

Burglar: b=0 no burglar; b=1 burglar Alarm: a=0 no alarm; a=1 alarm rings

P(a,b) = P(b)P(a|b) Eg, P(a=1|b=1) = 0.8

EARTHQUAKES also trigger alarm Earthquake: e=0 no quake; a=1 quake

P(a,b,e) = P(e)P(b)P(a|b,e)

### Useful questions about

P(a,b,e) = P(e)P(b)P(a|b,e)

• Under P, are b and e independent ?

• Under P, are b and e independent given a?

• Probabilistic inferences: P(b=1|a=1) = ?, P(b=1|a=1,e=0) = ?, P(b=1|a=1,e=0) = ?

BRENDAN FREY

**P(d,m,z,y**^{t}**,y**^{t+1}) = P(d) P(m)

### Π

_{i}

^{P(z}

_{i}

^{| d)}

### Π

_{i}

^{P(y}

_{i}

^{t}

^{| z}

_{i}

^{)}

### Π

_{i}

^{P(y}

_{i}

^{t+1}

^{| z}

_{i-1}

^{,z}

_{i+1}

^{,m)}

### Shifter problem: Patches in motion

Examples of 6 x 1 patches from a video sequence Time t:

Time t+1:

Still Right Still Right

Sparse Dense

*Noise-free patches - Easier to explain!*

• d = density (0=sparse, 1=dense)

• z_{i} = noise-free intensity of pixel i at time t

• m = motion (0=still, 1=right)

• y_{i}^{t} = noisy, observed intensity of pixel i at time t

### Useful questions about

**P(d,m,z,y**^{t}**,y**^{t+1}) = P(d) P(m)

### Π

_{i}

^{P(z}

_{i}

^{| d)}

### Π

_{i}

^{P(y}

_{i}

^{t}

^{| z}

_{i}

^{)}

### Π

_{i}

^{P(y}

_{i}

^{t+1}

^{| z}

_{i-1}

^{,z}

_{i+1}

^{,m)}

•

### Σ

_{d}

### Σ

_{m}

### Σ

_{z}### Σ

_{y}^{t}

### Σ

_{y}^{t+1}

^{P(d,m,z,y}^{t}

^{,y}^{t+1}

^{)}

^{= 1 ?}

• Under P, does y^{t} depend on m ?

**• Probabilistic inferences: P(m | y**^{t}**,y**^{t+1}) = ?,
**P(d | y**^{t}**,y**^{t+1}) = ?, **P(y**^{t+1}**| y**^{t}) = ?

BRENDAN FREY

### PART I

## BAYESIAN NETWORKS

### (directed graphical models)

*• MAY be constructed using causal*
relationships between variables

• Quickly conveys the factorization of a distribution

• By construction, implies the distribution is normalized

• Clearly expresses dependencies and independencies between variables

• Can be used to derive fast inference algorithms

### Bayesian network

BRENDAN FREY

### Causal construction of burglar net

e b

*Assuming earthquakes*
don’t cause burglaries,
e is not connected to b

a

Earthquakes and burglars trigger

the alarm, so e and b are connected to a

### Causal construction of shifter net

d

z_{2} z_{3} z_{4} z_{5}

y_{6}^{t}
y_{5}^{t}
y_{4}^{t}
y_{3}^{t}
y_{2}^{t}
y_{I}^{t}

z_{6}
z_{1}

m

Time t Time t+1

z_{0}

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I} y_{6}^{t+I}

BRENDAN FREY

### Definition of a Bayes net

• Directed graph

*No cycles when following arrows, “DAG”*

• Unique variable associated with each node

• For each node, a conditional distribution:

P(child variable | parent variables)

• Defines a joint distribution:

P(x_{1},…, x_{N}) =

### Π

_{i}

^{P(x}

_{i}| parents of x

_{i})

### Conditional probabilities in burglar net

e b

a

P(e=1) = .01 P(b=1) = .1

P(a,b,e) = P(e)P(b)P(a|e,b)

P(a=1|e=0,b=0) = .001 P(a=1|e=0,b=1) = .8 P(a=1|e=1,b=0) = .9 P(a=1|e=1,b=1) = .98

BRENDAN FREY

d

**z**** _{2}** z

_{3}z

_{4}z

_{5}

y_{6}^{t}
y_{5}^{t}
y_{4}^{t}
y_{3}^{t}
y_{2}^{t}
y_{I}^{t}

z_{6}
**z**_{1}

z_{0} **m**

y_{I}^{t+I} **y**_{2}** ^{t+I}** y

_{3}

^{t+I}y

_{4}

^{t+I}y

_{5}

^{t+I}y

_{6}

^{t+I}

### A distribution in the shifter net

P(y_{2}^{t+I}=1|z_{1}=0,z_{2}=0,m) = .01
P(y_{2}^{t+I}=1|z_{1},z_{2}=1,m=0) = .99
P(y_{2}^{t+I}=1|z_{1},z_{2}=0,m=0) = .01
P(y_{2}^{t+I}=1|z_{1}=1,z_{2},m=1) = .99
P(y_{2}^{t+I}=1|z_{1}=0,z_{2},m=1) = .01

### Direct bonuses of Bayes nets

*• The Markov blanket MB[x*_{i}] for variable
x_{i} can be read off the graph, where

P(x_{i} | MB[x_{i}], other vars) = P(x_{i} | MB[x_{i}])

• Simulating P(x_{1},…, x_{N}) is easy

• Normalization

### Σ

^{x}1

### ... Σ

^{x}N

### [ ^{Π}

_{i}

^{P(x}

_{i}| parents of x

_{i})

### ]

^{= 1}

### Simulating the shifter net

Sample d
from P(d)
Sample z_{0}

from
P(z_{0}|d)

Sample y_{1}^{t}

from P(y_{1}^{t}|z_{1})

Sample m from P(m)

Sample y_{1}^{t+1}

from P(y_{1}^{t+1}|z_{0},z_{1},m)
d

z_{2} z_{3} z_{4} z_{5}

y_{6}^{t}
y_{5}^{t}
y_{4}^{t}
y_{3}^{t}
y_{2}^{t}
y_{I}^{t}

z_{6}
z_{1}

z_{0} m

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I} y_{6}^{t+I}

*Work “down”*

*the network*

### Markov blankets

• What is the smallest set of variables that

“isolates” a variable x_{i} from the other
variables in the network?

• The Markov blanket, MB[x_{i}]:

P(x_{i}| MB[x_{i}], X \ {x_{i}} \ MB[x_{i}]) = P(x_{i}| MB[x_{i}])

• If set S does not contain MB[x_{i}],
P(x_{i}| S, X \ {x_{i}} \ S) ≠ P(x_{i}| S)

BRENDAN FREY

### A Markov blanket in the shifter net

MB[z_{6}] = {d, y_{6}^{t}, y_{6}^{t+1}, m, z_{5}}

**d**

z_{2} z_{3} z_{4} **z**_{5}

**y**_{6}** ^{t}**
y

_{5}

^{t}y

_{4}

^{t}y

_{3}

^{t}y

_{2}

^{t}y

_{I}

^{t}

**z**** _{6}**
z

_{1}

z_{0} **m**

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I} **y**_{6}^{t+I}

### Pruning Bayes nets

• For variables x_{1},…, x_{N}, b, suppose b does
not have children

• If we delete node b and its edges, the resulting network describes

P(x_{1},…, x_{N}) =

### Σ

b P(x_{1},…, x

_{N},b)

So,

### Σ

^{x}1

### ... Σ

^{x}N

### [ ^{Π}

_{i}

^{P(x}

_{i}| parents of x

_{i})

### ]

^{= 1}

BRENDAN FREY

m

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I} y_{6}^{t+I}

m

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I}

### Pruning the shifter net

m

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I}

m

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I}

m

y_{I}^{t+I} y_{2}^{t+I}

m

y_{I}^{t+I}

m d

z_{2} z_{3} z_{4} z_{5}

y_{6}^{t}
y_{5}^{t}
y_{4}^{t}
y_{3}^{t}
y_{2}^{t}
y_{I}^{t}

z_{6}
z_{1}

**P(d,z,y**^{t}) = P(d)

### Π

_{i}

^{P(z}

_{i}

^{| d)}

### Π

_{i}

^{P(y}

_{i}

^{t}

^{| z}

_{i}

^{)}

Use this simpler net
to make inferences
**about d and z and y**^{t}

### Noncausal constructions

Reasons for noncausal constructions

• System is not causal

• System too complex for causal construction

• For computational efficiency, a noncausal net is preferable

BRENDAN FREY

Procedure for noncausal construction

• Order the variables (eg, at random)

• Add the variables, one at a time

• Make the current variable a child of all previously added variables

• Delete as many edges as possible, reducing the number of parents for the current variable

The last step requires probing the

physical system or answering queries

### A noncausal construction of the burglar net, order a, e, b

P(a,b,e) =

P(a)P(e|a)P(b|a,e)

a

P(e|a) ≠ P(e) so leave a→e

e b

P(b|a,e) ≠ P(b) P(b|a,e) ≠ P(b|a) P(b|a,e) ≠ P(b|e) so leave e→b and a→b

BRENDAN FREY

P(a,b,e) =

P(a)P(e|a)P(b|a,e)

a

e b

a

e b

P(a,b,e) = P(e)P(b)P(a|e,b)

Causal construction Non-causal construction

### Are e and b independent?

### YES CAN’T

### TELL!

BRENDAN FREY

### Conditional independencies

• Is x^{A} independent of x^{B} given x^{S}?

Is P(x^{A}, x^{B}| x^{S}) = P(x^{A} | x^{S}) P(x^{B} | x^{S}) ?

• YES, if every path from x^{A} to x^{B} is
BLOCKED

• A path can be blocked in 3 ways:

x^{S}

1 x^{S}

2

x^{S} * is not*
a descendent

3

…

BRENDAN FREY

### Independencies in the shifter net

• **P(y**^{t}**,m) = P(y**^{t}) P(m)

• P(d,m) = P(d) P(m)

**• P(d,m|y**^{t}**) = P(d|y**^{t}**) P(m|y**^{t})

**• P(d,m|y**^{t+1}) ≠ **P(d|y**^{t+1}**) P(m|y**^{t+1})
d

z_{2} z_{3} z_{4} z_{5}

y_{6}^{t}
y_{5}^{t}
y_{4}^{t}
y_{3}^{t}
y_{2}^{t}
y_{I}^{t}

z_{6}
z_{1}

z_{0} m

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I} y_{6}^{t+I}

x_{2} x_{3} x_{4} x_{5} x_{6}
x_{1}

### “Extreme” Bayes nets

Factorized model
**P(x) =**

### Π

_{i}

^{P(x}

_{i}

^{)}

x_{2} x_{3} x_{4} x_{5} x_{6}
x_{1}

Unstructured model
**P(x) =**

### Π

_{i}

^{P(x}

_{i}

^{|x}

_{1}

^{,…x}

_{i-1}

^{)}

- Always true, from chain rule of prob.

BRENDAN FREY

### Mixture model (Naive Bayes)

c

x_{2} x_{3} x_{4} x_{5} x_{6}
x_{1}

**P(x,c) = P(c)**

### Π

_{i}

^{P(x}

_{i}

^{|c)}

**P(x) =**

### Σ

_{c}

^{P(x,c)}Discrete

c

**x**

SHORTHAND
**P(x,c) = P(c)P(x|c)**

**z**
cc^{=}1

### Mixture of Gaussians

P(c) = π_{c}

**P(z|c) = N(z;** µµµµ_{c} , ΦΦΦΦ_{c})

µµµµ_{1}= diag(ΦΦΦΦ_{1}) =
µµµµ_{2}=

π_{1}^{= 0.6,}

π_{2}= 0.4, diag(ΦΦΦΦ_{2}) =

**z=**

BRENDAN FREY

**x**
**z**
ss^{=}

### Transformed mixture of Gaussians

(Frey and Jojic, CVPR 1999)

cc=1
P(c) = π_{c}

**P(x|z,s) = N(x; shift(z,s),** ΨΨΨΨ)

µµµµ_{1}= diag(ΦΦΦΦ_{1}) =
µµµµ_{2}=

π_{1}= 0.6,

π_{2}^{= 0.4,}

Shift, s P(s)

diag(ΦΦΦΦ_{2}) = **z=**

**x=**

Diagonal

BRENDAN FREY

### Layered

### appearance model

(Frey, CVPR 2000)

Index of object
in layer *l*

Far

Intensity of ray *n* at camera
Intensity of ray *n* at layer *l*

Near

BRENDAN FREY

I(x,y): vector of pix int diffs
between other views and
pix int in 1^{st} view at x,y

### Multiview layered model

(variant of Torr, Szeliski, Anandan, CVPR 1999)

Θ Θ Θ

Θ: Params of layer planes

z(x,y):

depth in
1^{st} view of
pixel at x,y
in 1^{st} view

L(x,y) in *I*^{:}

Layer of
pixel at x,y
in 1^{st} view

### Dynamic Bayes nets

• Just a Bayes net for time-series data

BRENDAN FREY

### Markov model

z_{1} z_{2} ** ^{…}** z

_{t-1}z

_{t}z

_{t+1}

**MB[z**

^{…}_{t}] = {z

_{t-1}, z

_{t+1}}

P(z_{t}|z_{1},z_{2},…, z_{t-1},z_{t+1},…) = P(z_{t}|z_{t-1},z_{t+1})
P(z_{t}|z_{1},z_{2},…, z_{t-1} ) = P(z_{t}|z_{t-1})

### Hidden Markov model

z_{1} z_{2} ** ^{…}** z

_{t-1}z

_{t}z

_{t+1}

^{…}z_{t} discrete, P(z_{t}| z_{t-1}) = “transition matrix”

x_{t} discrete or continuous
Eg, P(x_{t}|z_{t}) = Normal(x_{t};µ_{x}t,C^{x}t)

x_{1} x_{2} ** ^{…}** x

_{t-1}x

_{t}x

_{t+1}

**…**

BRENDAN FREY

### Linear dynamic system (Kalman filter model)

z_{1} z_{2} ** ^{…}** z

_{t-1}z

_{t}z

_{t+1}

^{…}z_{t} continuous, P(z_{t}| z_{t-1}) Gaussian
x_{t} continuous, P(x_{t}|z_{t}) Gaussian

x_{1} x_{2} ** ^{…}** x

_{t-1}x

_{t}x

_{t+1}

**…**

BRENDAN FREY

### Transformed hidden Markov model

(Jojic, Petrovic, Frey and Huang, CVPR 2000)

**x**
s

c

**z**

**x**
s

c

**z**

### t

### t-1

BRENDAN FREY

### Active contour model

(Blake and Isard, Springer-Verlag 1998)

Unobserved state: P(u_{t}|u_{t-1})
u_{t} = control points of spline

(contour)

Observation: P(o_{t}|u_{t})
o_{t} = for all measurement

lines: # edges, distance of edges from contour

Measurement lines are placed At fixed intervals along the contour

u_{1} u_{2} **…** u_{t-1} u_{t}

o_{1} o_{2} **…** o_{t-1} o_{t}

NONLINEAR

LINEAR (GAUSSIAN)

3D body tracking model

(Sidenbladh, Black, Fleet, ECCV 2000)

•3D articulated model

•Perspective projection

•Monocular sequence

•Unknown environment

•Motion-based likelihood Goal:

State:

–joint angles and body pose

–joint/pose velocities

–appearance model
V*t*

A*t*

φ*t*

I*t*

Observations:

–image at time*t*

Dynamic Bayes net

A*t*

A_{t}_{−}1

V*t*

I*t*

φ*t*

V_{t}_{−}1

I_{t}_{−}1

−1

φ*t*

### … …

BRENDAN FREY

Switching mixture of state-space models

(Ghahramani and Hinton, 1997)

x_{0} x_{1} x_{2}
s_{0} s_{1} s_{2}

y_{0} y_{1} y_{2}

System “switch”

State of system A Measurements

x_{0} x_{1} x_{2} State of system B

### Mixed-state dynamic Bayes net

(Pavlovic, Frey and Huang, CVPR 1999)

• Uses discrete-state HMM to drive

continuous dynamics (Kalman filtering)

x_{0} x_{1} x_{2}
s_{0} s_{1} s_{2}

y_{0} y_{1} y_{2}

Decision/

action/mode State of dynamics

Measurements

Y

BRENDAN FREY

### Easy-living net

Microsoft, 2002

### PART II

## UNDIRECTED

## GRAPHICAL MODELS

BRENDAN FREY

### Markov random field (MRF)

• Undirected graph on variables

• Graph gives Markov blankets:

The Markov blanket of a variable is the variable’s neighbors

z

indicates variables in the Markov blanket for z

### The distribution for an MRF

• If P(x_{1},…, x_{N}) ≠ 0, for all configs of
x_{1},…, x_{N}, then P(x_{1},…, x_{N}) can be
expressed

P(x_{1},…, x_{N}) = α

### Π

_{c}

^{φ}

_{c}

^{({x}

_{i}

^{: i}

^{∈}

^{Q}

_{c}

^{})}

• c indexes the maximal cliques

• Q_{c} is the set of the variables in clique c

• φ_{c}( ) is a strictly positive function

(potential) on the variables in clique c

• α is a normalizing constant

BRENDAN FREY

### Burglar MRF

e b

a

1 maximal clique:

Q_{1} = {e,b,a}

Clique potential:

φ_{1}(e,b,a)
Distribution:

P(e,b,a) = α φ_{1}(e,b,a)
Are e and b independent? CAN’T TELL!

### Line processes

(Geman and Geman)

0 0 0 0 Maximal clique

*Patterns with high* φ

1 1 0 0 0 0

1 1

1 0 1 0 0 1

0 1 1 1

1 1

1 0 0 0

*Patterns with low* φ

0 1 0 0

0 0 1 0

0 0 0 1 0 1

1 1

1 0 1 1

1 1 0 1

1 1 1 0

### Under P(), lines are probable

1 0 0 1

0 1 1 0

BRENDAN FREY

### Markov network for image and scene patch modeling

(from Freeman and Pasztor, ICCV 1999)

image patches

scene patches

### Bayes net – MRF hybrids

**• Suppose we have an MRF for x, with**
distribution P^{MRF}**(x)**

**• Suppose we have a Bayes net for z,**
with distribution P^{BN}**(z)**

• Then, we can add directed edges

**connecting variables in x to variables in**
**z, creating a modified Bayes net,**

P^{BN}**(z|x)**

• The joint distribution is P^{BN}**(z|x)P**^{MRF}**(x)**

BRENDAN FREY

I(x,y): vector of pix int diffs
between other views and
pix int in 1^{st} view at x,y

BN-HMM Hybrid Multiview layered model

(variant of Torr, Szeliski, Anandan, CVPR 1999)

Θ Θ Θ

Θ: Params of layer planes

z(x,y):

depth in
1^{st} view of
pixel at x,y
in 1^{st} view

L(x,y) in *I*^{:}

Layer of
pixel at x,y
in 1^{st} view

### Factor graphs

(Kschischang, Frey, Loeliger, subm to IEEE Trans IT)

• Bipartite graph: variable nodes and function nodes

*• A local function is associated with each*
function node – this function depends
on the neighboring variables

*• The global function is given by the*
product of the local functions

BRENDAN FREY

### Burglar factor graphs

e b

a

e b

a

P(e,a,b)

P(e) P(b)

P(a|e,b)

No independencies (like MRF)

Same independencies as Bayes net

### Converting an MRF to a factor graph

• Create variable nodes

• Create one function node for each maximal clique in the MRF

• Connect each function node to the variables in the corresponding clique

• Set the function associated with each function node to the corresponding clique potential

• Global function = MRF distribution

• Each MRF has a unique factor graph

• Different factor graphs may have the same MRF

BRENDAN FREY

### Converting a Bayes net to a factor graph

• Create variable nodes

• For each variable, create one function node and connect it to the variable

• Connect each function node to the parents of the corresponding variable

• Set the function associated with each function node to the corresponding conditional pdf in the Bayes net

• Global function = Bayes net distribution

• If the child of each local function is indicated (eg, with an arrow), the resulting factor graph has the same semantics as a Bayes net

### PART III

## INFERENCE

BRENDAN FREY

A *probabilistic inference*

P(x_{i} | Observed x’s)
gives optimal decisions for x_{i}

### Probabilistic inference

Recall for a correct *generative model*
P(x_{1}, x_{2}, …, x_{N})

### Inference: Mixture of Gaussians

P(c) = π_{c}

**P(c|z) =** **P(z|c)P(c)**

### / Σ

^{c}

^{P(z|c)P(c)}µµµµ_{1}= diag(ΦΦΦΦ_{1}) =
µµµµ_{2}=

π_{1}= 0.6,

π_{2}= 0.4, diag(ΦΦΦΦ_{2}) =

c

**z**

**P(c=1|z) = .9**
**P(c=2|z) = .1**

**z**^{=}

**P(c=1|z) = .2**
**P(c=2|z) = .8**

**z**^{=}
**P(z|c) = N(z;** µµµµ_{c}, ΦΦΦΦ_{c})

BRENDAN FREY

### Inference: Transformed mixture of Gaussians

**P(x,z,s,c) =**

**P(x|z,s)P(s)P(z|c)P(c)**

µµµµ_{1}= diag(ΦΦΦΦ_{1}) =
µµµµ_{2}=

π_{1}= 0.6,
π_{2}= 0.4,

**x**
**z**
s

c

diag(ΦΦΦΦ_{2}) =

**x=**

s^{MAP}^{=}

c^{MAP}

=1
**z**^{MAP}=

Linear

Gaussian Linear

Gaussian

Discrete Discrete

### General “brute force” inference

• Suppose x_{1}, x_{2}, …, x_{N} are binary

P(x_{1}) =

### Σ

x_{2}

### Σ

x_{3}…

### Σ

x_{N}P(x

_{1}, x

_{2}, …, x

_{N})

• This takes about 2^{N} operations

• Generally, computing P(x_{i}|Observed x’s)
takes 2(N - #observed x’s) operations

BRENDAN FREY

### Inference in Bayes nets

d

z_{2} z_{3} z_{4} z_{5} z_{6}
z_{1}

m

Time t Time t+1

z_{0}

P(m=1|Obs)=0.8
P(z_{i}|Obs)

P(d=1|Obs)=0.2

### Observed parents “abandon” children

• We can remove the edges connecting observed parents to their children

o
a_{1} a_{2}

c_{1} c_{2}

Observation: o=o’

P(o|a_{1},a_{2})

P(c_{1}|o) P(c_{2}|o)

o
a_{1} a_{2}

c_{1} c_{2}

P(o=o’|a_{1},a_{2})

P(c_{1}|o=o’) P(c_{2}|o=o’)

BRENDAN FREY

### Sum-product algorithm

(probability propagation, forward-backward algorithm) (Gallager 1963; Pearl 1986; Lauritzen & Spiegelhalter 1986; …)

• Suppose we have a graphical model for
discrete variables x_{1}, x_{2}, …, x_{N}

• If the graphical model is a tree (or

“close” to being a tree), the sum-product algorithm can compute

P(x_{i}|Observed x’s) for all x_{i}
*in LINEAR TIME*

### Example: Discrete Markov model

P(A,B,C,D,E) = P(E|D)P(D|C)P(C|B)P(B|A)P(A)

P(E) =

### Σ

D### Σ

C### Σ

B### Σ

AP(E|D)P(D|C)P(C|B)P(B|A)P(A)A B C D E

=

### Σ

D### Σ

C### Σ

B### Σ

AP(E|D)P(D|C)P(C|B)P(B|A)P(A)=

### Σ

D### Σ

C### Σ

B^{P(E|D)}

### Σ

AP(D|C)P(C|B)P(B|A)P(A)=

### Σ

D### Σ

C### Σ

BP(E|D)P(D|C)### Σ

AP(C|B)P(B|A)P(A)=

### Σ

D### Σ

C### Σ

BP(E|D)P(D|C)P(C|B)### Σ

A^{P(B|A)P(A)}

=

### Σ

D### Σ

C### Σ

BP(E|D)P(D|C)P(C|B)[### Σ

A^{P(B|A)P(A)}]

=

### Σ

D### Σ

C^{P(E|D)}

### Σ

BP(D|C)P(C|B)[### Σ

A^{P(B|A)P(A)}]

=

### Σ

D### Σ

CP(E|D)P(D|C)### Σ

B^{P(C|B)}[

### Σ

A^{P(B|A)P(A)}]

=

### Σ

D### Σ

CP(E|D)P(D|C)### [ Σ

B^{P(C|B)}[

### Σ

A^{P(B|A)P(A)}]

### ]

=

### Σ

D^{P(E|D)}

### Σ

C^{P(D|C)}

### [ Σ

B^{P(C|B)}[

### Σ

A^{P(B|A)P(A)}]

### ]

=

### Σ

D^{P(E|D)}

### [ _{Σ}

_{C}

^{P(D|C)}

### [ Σ

B^{P(C|B)}[

### Σ

A^{P(B|A)P(A)}]

### ] ]

f(B) f(B)

g(C)

g(C)

h(D)

h(D) P(E)

BRENDAN FREY

### General sum-product algorithm

• Messages: short vectors of numbers;

interpret as functions of discrete vars

• Messages flow in both directions on each edge

**• Initially, all messages are set to 1**

• Messages are updated randomly or in a given order

• Messages are fused to compute (or
approximate) P(x_{i}|Observed x’s)

### Passing messages in Bayes nets

f(a) = Σ_{c,b}h(c)q(c)P(c|a,b)g(b)

c

d e

a b

h(c) q(c)

f(a)

g(b) Against edge

c

d e

a b

h(b)

q(c) g(a)

f(c)

f(c) = Σ_{a,b}q(c)P(c|a,b)g(a)h(b)
With edge

**P(c|o)** ≅ αΣ_{a,b}h(c)q(c)P(c|a,b)f(a)g(b)

c

d e

a b

g(b)

q(c) f(a)

h(c)

Fusion

*Each message is a function of its parent*

BRENDAN FREY

### Propagating observations in Bayes nets

f(a) = Σ_{b}h(c’)q(c’)P(c’|a,b)g(b)

c

d e

a b

h(c) q(c)

f(a)

g(b)

Against edge

Observation: c=c’

f(c’) = Σ_{a,b}q(c’)P(c’|a,b)g(a)h(b)

c

d e

a b

h(b)

q(c) g(a)

f(c)

With edge

f(c) = 0, c not equal to c’

### Passing messages in factor graphs

• Much simpler than Bayes nets!!!!

*• Bayes nets and MRFs can be converted to*
factor graphs really easily

f(a) = g(a)h(a)

a

h(a) f(a)

g(a)

Each message is a function of its neighboring variable

f(a) = Σ_{b}Σ_{c}^{φ}(a,b,c)g(b)h(c)

c a b

h(c)

f(a) g(b)

Local f’n:

φ(a,b,c)

Out of var Out of f’n Fusion

**P(a|o)** ≅ f(a)g(a)h(a)

a

h(a) f(a)

g(a)

BRENDAN FREY

### Result of fusion

• Unobserved variables: u_{1},…,u_{K}

**• Observed variables: o**

• Fusion at u_{i} estimates

P(u_{i}**,o) =**

### Σ

^{u}

_{1}

^{,…,u}

_{i-1}

^{,u}

_{i+1}

^{,…,u}

_{K}

^{P(u}

_{1}

^{,…,u}

_{K}

^{,o)}• Local normalization:

P(u_{i}**|o) = P(u**_{i}**,o)**

### / Σ

^{u}

_{i}

^{P(u}

_{i}

^{,o)}### Properties of sum-product

• Exact in trees

• Computationally efficient even for

– linear Gaussian vars, without discrete children
*– observed real vars with discrete parents*

• Some applications:

– Error-correcting decoding (trellis codes) – Speech recognition (HMM is a tree) – Kalman tracking (LDS net is a tree)

– Multiscale smoothing (tree on image)

**, …**

BRENDAN FREY

### Max-product (Viterbi) algorithm

• Replace SUM with MAX in the sum- product algorithm

• Max-product computes

Φ(u_{i}) = max P(u_{1},…,u_{K}**,o)**

u_{1},…,u_{i-1},u_{i+1},…,u_{K}

• MAP configuration:

u_{i}^{MAP} = argmax Φ(u_{i})

u_{i}

### What if graph is not a tree?

• Can “cluster” variables

• Can convert the graph to a “join tree”

(Lauritzen and Spiegelhalter 1988)

• Can use “bucket elimination” (Dechter 1999)

BUT, THESE METHODS ONLY WORK WHEN THE NUMBER OF

CYCLES IS TRACTABLE

### Lots of systems

### are best described by nets with an intractable

### number of cycles

*Quite a few cycles*

d

z_{2} z_{3} z_{4} z_{5}

y_{6}^{t}
y_{5}^{t}
y_{4}^{t}
y_{3}^{t}
y_{2}^{t}
y_{I}^{t}

z_{6}
z_{1}

m

Time t Time t+1

z_{0}

y_{I}^{t+I} y_{2}^{t+I} y_{3}^{t+I} y_{4}^{t+I} y_{5}^{t+I} y_{6}^{t+I}

BRENDAN FREY

*TOO MANY CYCLES!*

*TOO MANY CYCLES!*

BRENDAN FREY

### Intractable local computations

### Even if the graph is a tree, the local functions (conditional probabilities, potentials) may not

### yield tractable sum-product computations

### • Eg, non-Gaussian pdfs

### Active contour model

(Blake and Isard, Springer-Verlag 1998)

Unobserved state: P(u_{t}|u_{t-1})
u_{t} = control points of spline

(contour)

Observation: P(o_{t}|u_{t})
o_{t} = for all measurement

lines: # edges, distance of edges from contour

Measurement lines are placed At fixed intervals along the contour

u_{1} u_{2} **…** u_{t-1} u_{t}

o_{1} o_{2} **…** o_{t-1} o_{t}

NONLINEAR

LINEAR (GAUSSIAN)

BRENDAN FREY

### Approximate inference

• Monte Carlo

• Markov chain Monte Carlo

• Variational techniques

• Local probability propagation

• Alternating maximizations

### Monte Carlo inference

**• u = unobserved vars; o = observed vars**

**• Obtain random sample u**^{(1)} **, u**^{(2)}**, …, u**^{(R)}
and use it to

**– Represent P(u|o)**

– Estimate an expectation,

E[f] =

### Σ

**uf(u)P(u|o)**

Eg, P(u_{i}**=1|o) =**

### Σ

_{u}^{I(u}

_{i}

^{=1)P(u|o)}I(expr) = 1 if expr is true I(expr) = 0 otherwise

BRENDAN FREY

### Expectations from a sample

**• From the sample u**^{(1)}**, u**^{(2)}**, …, u**^{(R)}, we
can estimate

E[f] ≅ (1/R)

### Σ

_{r f(u}^{(r)}

^{)}

**• If u**^{(1)}**, u**^{(2)}**, …, u**^{(R)} are independent
**draws from P(u|o), this estimate**

– is unbiased

– has variance

### ∝

^{1/R}

### Rejection sampling

• Otherwise reject and draw again

u

P*(u)

B(u)

• Goal: Hold o constant, draw u from P(u|o) Given

• P*(u)

### ∝

^{P(u,o)}

*can eval P*(u)*

• Randomly accept u with prob P*(u)/B(u)

• Draw u from normalized form of B(u)

• B(u) ≥ P*(u) can eval B(u)

can “sample” B(u)

Efficiency measured by rejection rate

BRENDAN FREY

**y**^{t+1}

### Rejection sampling in the shifter net

d

**y**^{t}

**z** m

**y**^{t+1}

**• Choose P*(d,m,z) = P(d,m,z,y**^{t}**,y**^{t+1})

**• Choose B(d,m,z) = 1** ≥ **P*(d,m,z)**

d

**y**^{t}

**z** m

1

0

**• Draw d,m,z from uniform distribution**

*Reject !*

• Randomly accept with probability

**P*(d,m,z)/B(d,m,z)**

**= P(d,m,z,y**^{t}**,y**^{t+1})

### Rejection sampling in the shifter net

d

**y**^{t}

**z** m

**y**^{t+1}

**• Choose P*(d,m,z) = P(d,m,z,y**^{t}**,y**^{t+1})

**• Choose B(d,m,z) = 1** ≥ **P*(d,m,z)**

**• Draw d,m,z from uniform distribution**

**y**^{t+1}
d

**y**^{t}

**z** m

0

1

*Accept !*

• Randomly accept with probability

**P*(d,m,z)/B(d,m,z)**

**= P(d,m,z,y**^{t}**,y**^{t+1})

BRENDAN FREY

### Importance sampling

• Goal: Holding o fixed, represent P(u|o) by a weighted sample

• Find P*(u) ∝ P(u,o) and Q*(u), such that can evaluate P*(u)/Q*(u) and can “sample” Q*(u)

• Sample u^{(1)}, u^{(2)}, …, u^{(R)}, from Q*(u)

• Compute weights w^{(r)} = P*(u^{(r)})/Q*(u^{(r)})

• Represent P(u|o) by

### {

^{u}

^{(r)}

^{, w}

^{(r)}

^{/(}

^{Σ}

_{j}

^{w}

^{(j)}

^{)}

### }

• Eg E[f] ≅

### Σ

r^{f(u}

^{(r)}

^{)w}

^{(r)}

^{/(}Σ

_{j}

^{w}

^{(j)}

^{)}

Accuracy given by “effective sample size”

### Importance sampling in the shifter net

d

**y**^{t}

**z** m

**y**^{t+1}

**• Choose P*(d,m,z) = P(d,m,z,y**^{t}**,y**^{t+1})

**• Choose Q*(d,m,z) = 1**

**• Draw (d,m,z)**^{(r)} from uniform distribution

d

**y**^{t}

**z** m

1

0

**• Weight (d,m,z)**^{(r)} by
P(d^{(r)},m^{(r)}**,z**^{(r)}**,y**^{t}**,y**^{t+1})

*Low weight*

BRENDAN FREY

### Importance sampling in the shifter net

d

**y**^{t}

**z** m

**y**^{t+1}

**• Choose P*(d,m,z) = P(d,m,z,y**^{t}**,y**^{t+1})

**• Choose Q*(d,m,z) = 1**

**y**^{t+1}
d

**y**^{t}

**z** m

0

1

**• Draw (d,m,z)**^{(r)} from uniform distribution

**• Weight (d,m,z)**^{(r)} by
P(d^{(r)},m^{(r)}**,z**^{(r)}**,y**^{t}**,y**^{t+1})

*High weight*

### A better Q-distribution

d

**y**^{t}

**z** m

**y**^{t+1}

**• Choose P*(d,m,z) = P(d,m,z,y**^{t}**,y**^{t+1})

**• Choose Q*(d,m,z) = P(d,m,z)**

**• Draw (d,m,z)**^{(r)} **from P(d,m,z)**

**• Weight (d,m,z)**^{(r)} **by P(y**^{t}**,y**^{t+1}|d^{(r)},m^{(r)}**,z**^{(r)})
Called

“Likelihood weighting”

BRENDAN FREY

### Particle filtering (condensation)

(Isard, Blake, et al, et al, et al,…)

• Goal: Use sample S = ut^{(1)}^{…u}t^{(R)} ^{from}

P(ut^{|o}^{1}^{,…, o}t-1) to sample from P(ut^{|o}^{1}^{,…, o}t^{)}

• Weight each “particle” ut^{(r)} in S by P(ot^{|u}t^{(r)}^{)}

• Redraw a sample S’ from the weighted sample

• For each particle z_{t} in S’ draw z_{t+1} from P(z_{t+1}|z_{t})
Exact, for infinite-size samples

For finite-size samples, it may lose modes

u_{1} u_{2} **…** u_{t-1} u_{t}

o_{1} o_{2} **…** o_{t-1} o_{t}

P(ut|o^{1},…,o

### t

^{)}

∝P(ot|u

### t

^{)P(ut|o}

^{1}

^{,…,o}

### t-1

^{)}

Q*(ut) P*(ut)

### Condensation for active contours

(Blake and Isard, Springer-Verlag 1998)

Unobserved state: P(u_{t}|u_{t-1})
u_{t} = control points of spline

(contour)

Observation: P(o_{t}|u_{t})
o_{t} = for all measurement

lines: # edges, distance of edges from contour

u_{1} u_{2} **…** u_{t-1} u_{t}

o_{1} o_{2} **…** o_{t-1} o_{t}

NONLINEAR

LINEAR (GAUSSIAN)

### • Sampling from P(u t ^{|o}

^{1}