Undirected graphical models
(Markov Random Field or Markov Network)
Reference to:
Books:
Bishop Ch8, PGM Ch4 slides & Nodes :
Cedric Archambean’s slide at Pascal Bootcamp 2010 Tiberio Caetano’s slide at MLSS 2008
CMU Graphical Learning Model lecture notes
Prof. Shou-de Lin
sdlin@csie.ntu.edu.tw CSIE/GINM
1 SAI 2010
An Recommendation Letter Example
• Chance of getting a job depends on the
recommendation letter. However, only one letter is provided to the company, and it is determined by the
‘choice’.
• How about the dependency (L1 J|C=2)?
– BN makes independence assertions only at the level of variables.
• How about (L1 L2|C, J)?
• There is no perfect map using BN for such distribution
job letter1
choice
letter2
2 SAI 2010
A Friendship Example
• Each random variable represents whether a person knows a news.
• A goes out with B and D. C goes out with B and D (see the right graph). A and C doesn’t know each other, and so does B and D.
• The left graph implies (AC|{B,D}) , but it also implies BD|A
• The left graph implies B and D are dependent given C and A
• The middle graph implies (AC|{B,D}) but it also implies BD
• There is no perfect map using BN for such distribution
3 SAI 2010
Undirected graphical models
• Ideally we would like to have more freedom in the graph
• Markov random fields allow for the specification of a different class of Conditional Independence (CI) statements
• The class of CI statements for MRFs can be easily defined by graphical means in undirected graphs.
– The absence of edges imply CI statements
• Local potential functions and the cliques in the graph completely determine the joint distribution.
• Give correlations between variables, but no explicit way to generate samples
4 SAI 2010
In the dating example
• P(a,b,c,d)=
1(A,B)
2(B,C)
3(C,D)
4(D,A)/Z
• Z is a normalization term
5 SAI 2010
Def: Markov Network (MN) or Markov Random Field (MRF)
• How about a sequence of multiplications?
, where g
ci(x
ci) is a function of a subset of x.
• A distribution P
is a Gibbs distribution
parameterized by a set of factors = {
1(D
1),…
k(D
k)} if it is defined as P
(X
1,….X
n)=
1(D
1)*
2(D
2)…
m(D
m)/Z (Z is a normalization factor)
n i i
i
i x x
K
i
C C K
i
C C
n g x Z g x
x Z x
p ,...,
1 1
1 1
) ( where
, ) 1 (
) ,..., (
6 SAI 2010
What are the proper Factors for MRF?
• Ip(H): all dependency satisfies: two nodes that are not
connected is conditionally independent given all other nodes
– P(x1,x2|X\{x1,x2})=P(x1|X\{x1,x2}) P(x2|X\{x1,x2})
• Therefore, nodes that are not connected should NOT be put in the same factor nodes that ARE connected should be put in the same factor
• How can we identify nodes that ARE connected?
– Maximal clique
7 SAI 2010
Cliques and Maximal Cliques
• A clique of a graph is a complete subgraph (each pair of nodes have an edge)
• A maximal clique of a graph is a clique which is not a subset of another clique
• {A, C} form a clique
• {A, B, D} and {B, E} are maximal clique
C
A
D
B E
8 SAI 2010
Factorization in MRF
• For a graph with K maximal cliques, we can decompose the joint as
(xCi are the nodes belonging to the maximal clique Ci)
• The factorization is in terms of local potential functions
• Can also be described as: a Gibbs distribution p factorizes over an MRF H of each Xci are nodes belonging to a clique Ci
• The potential functions do not necessarily have a probabilistic interpretation
n i i
i
i x x
K
i
C C
K
i
C C
n g x Z g x
x Z x
p ,...,
1 1
1 1
) (
where ,
) 1 (
) ,..., (
i g
g
Ci( )}, with
Ci( ) 0 for all
{
9 SAI 2010
Example of potential function
• Let
C
A
D
B E
} 1 , 0 {
}, 1 , 0 {
}, ,
1 {B E B E
C
B E g(B,E)
0 0 0.4
0 1 0.8
1 0 3.0
1 1 2.5
Not necessarily to be a probability
CPD can be seen as a special case of the factors 10
SAI 2010
Separation, CI and Factorization in MRF (1/2)
• Factorization CI: If a probability distribution factorizes according to an undirected graph, and if X, Y and Z are disjoint subsets of nodes such that Z separated X from Y (no directed link from X to Y through Z), then the distribution satisfies X Y| Z
Proof:
• We start by considering the case where XYZ=S. As Z separates X from Y, there are no direct edges between X and Y. Hence, any clique in H is fully contained either in XZ or YZ. Let Ix be the indexes of the set of cliques that are contained in XZ, and let Iy be the indexes of the
remaining cliques. Therefore P(X1,….Xn)=iIxi(Di) iIy i(Di)/K, K is the normalization factor
P(X1,….Xn)=f(X,Z)g(Y,Z)/K X Y| Z
• Then we consider the case where XYZS, let U=S-XYZ, it is possible to partition U into two disjoint sets U1 and U2 such that Z separates XU1 from YU2. Similar to the above argument, we can
conclude P(X1,….Xn)=f(X,U1,Z) f(Y,U2,Z)/K (X,U1 Y,U2|Z) X Y| Z
11 SAI 2010
Separation, CI and Factorization in MRF (2/2)
CI properties and factorization are equivalent in MRF :
• Factorization CI: If a probability distribution factorizes according to an undirected graph, and if A, B and C are
disjoint subsets of nodes such that C separated A from B, then the distribution satisfies
• CI Factorization: If a positive probability distribution satisfies the CI statements implied by graph separation over the undirected graph, then it also factorizes according to this graph.
– known as the Hammersley-Clifford theorem (proof: section 4.4 in PGM, or section 4.2.3 in our textbook)
C B A |
12 SAI 2010
Local Dependency for Markov Random Field
• Weakest dependency : two nodes that are not connected is
conditionally independent given all other nodes
– All dependency satisfies such condition is called Ip(H)
• Less weak dependency: a node is conditional independent of every other node in the network given its directed neighbors
– All dependency satisfies such condition is called Il(H)
X1
X2
X3 X4 X5
X6
X7 X8
13 SAI 2010
Global Graph Separation
• if every path from A to B includes at least one node from C, then C is said to separate A from B in G.
• Path is blocked by C A B | C
• All dependency that satisfies such condition is denoted as I(H)
• For ANY MRF, we can denote I(H) ==> Il(H) == > Ip(H)
• For a positive joint probability distribution P, the following three statements are equivalent (proof see 4.3.2.2 in PGM):
– P |= I(H) – P |= Il(H) – P |= Ip(H)
14 SAI 2010
A canonical example: image denoising
This figure is from Tibério Caetano’s slide at MLSS 2008
15 SAI 2010
Image denoising:
Ising model
• yi is the observed noisy pixel and xi is an unknown noise-free pixel
• there is a strong correlation between xi and yi
• neighboring pixels xi and xj in an image are strongly correlated.
• Cliques: {xi, yi} and {xi, xj} where i and j are indices of
neighboring pixels noisy nodes are correlated with denoised nodes
• E(x,y)=hixi - i,jxixj - ixiyi , p(x,y)=e-E(x,y)/Z
bias toward xi=0 neighbor nodes should be the same
• The lower the energy the better (i.e. the higher the probability)
16 SAI 2010
Optimization in Ising Model
• E(x,y)=h
ix
i-
i,jx
ix
j-
ix
iy
i, p(x,y)=e
-E(x,y)
/Z
• Iterated conditional modes (ICM):
– It’s a gradient ascent method.
– First initialize x
i=y
ifor all i.
– Then we take one node x
jat a time and evaluate the total energy for two states x
j=+1 and x
j=-1, choose the better assignment.
– The algorithm can converge to a local optimal.
SAI 2010 17
MRF versus Bayesian network
Similarities:
• CI properties of the joint are encoded into the graph structure and define families of structured probability distributions.
• CI properties are related to concepts of separation of (groups of) nodes in the graph.
• Local entities in the graph imply the simplified algebraic structure (factorization) of the joint.
18 SAI 2010
MRF versus Bayesian network
Differences:
• The set of probability distributions represented by MRFs is different from the set represented by Bayesian networks.
• MRFs have a normalization constant that couples all factors, whereas Bayesian networks have not.
• Factors in Bayesian networks are probability distributions, while factors in MRFs are nonnegative potentials.
19 SAI 2010
Markov Blanket in BN and MRF
• The Markov Blanket of a node xi in either a BN or an MRF is the smallest set of nodes A such that p(xi |x~i ) = p(xi |xA )
• BN: parents, children and co-parents of children of the node
• MRF: neighbors of the node
BN MRF
20 SAI 2010
Mapping a Linear Bayesian networks into a MRF
• Bayesian network:
• MRF
• The mapping is here straightforward. Let
x1 x2 x3 …... xn
)
| ( )
| ( ) ( )
,...,
(x1 xn p x1 p x2 x1 p xn xn1
p
) , (
) , ( )
, 1 (
) ,...,
( 1 n g1,2 x1 x2 g2,3 x2 x3 gn 1,n xn 1 xn x Z
x
p
x1 x2 x3 …... xn
1
and
)
| ( )
, (
, ),
| ( )
, (
),
| ( ) ( )
, (
1 1
, 1 2
3 3
2 3 , 2
1 2 1
2 1 2 , 1
Z
x x p x
x g
x x p x
x g
x x p x p x
x g
n n n
n n
n
21 SAI 2010
Mapping head-to-head Bayesian networks into a MRF
• When there are head-to-head nodes one has to add edges to convert the Bayesian network into the undirected graph.
• This process of ‘marrying the parents’ has become known as moralization, and the resulting undirected graph, after
dropping the arrows, is called the moral graph.
x1 x2
x3
x1 x2
x3
) ,
| ( ) ( ) ( )
, ,
(x1 x2 x3 p x1 p x2 p x3 x1 x2
p 1 ( , , )
) , ,
( 1 2 3 g1,2,3 x1 x2 x3 x Z
x x
p
Involves the three variables
22 SAI 2010
Mapping a Bayesian networks into a MRF- General Process
• Add additional undirected links between all pairs of parents of each node in the graph.
• Initialize all of the clique potentials of the moral graph to 1.
• Take each conditional distribution factor in the original directed graph and multiply it into one of the clique potentials.
23 SAI 2010
Dependency Revisit
• P: the set of all distributions P over a given set of variables.
• D: a set of distributions D that can be represented by BN
• U: a set of distributions U that can be represented by MRF
SAI 2010 24
Factor graph
Two types of nodes
• The circles in a factor graph represent random variables
• The squares represent factors in the joint distribution
Two nodes are neighbors if they share a common factor.
) ( ) , ( ) , ( ) , ( )
, ,
(x1 x2 x3 f x1 x2 f x1 x2 f x2 x3 f x3
p a b c d
25 SAI 2010
Factor graph
• Factor graphs incorporate explicit details about the
factorization, but factor nodes do not correspond to CIs.
• An edge between a circle and a square states that the
corresponding function has the corresponding variable as an argument.
• The joint distribution is the product of all functions (so, the functions are factors).
• They preserve the tree structure of DAGs and undirected graphs.
26 SAI 2010
Converting a BN into a factor graph
• Create nodes corresponding to the original DAG.
• Create factor nodes corresponding the conditional distributions.
• Conversion is not unique:
) ,
| ( )
, , ( ), (
) ( ), (
) (
) ,
| ( ) ( ) ( )
, , (
2 1 3
3 2 1 2
2 1
1
2 1 3
2 1
3 2 1
x x x
p x
x x f x
p x
f x
p x
f
x x x
p x
p x p x
x x f
c b
a
27 SAI 2010
Converting an MRF into a factor graph
• Create nodes corresponding to the original undirected graph.
• Create factor nodes corresponding the potential functions.
• Conversion is not unique:
) ,
, ( )
, ( )
, ,
(
) ,
, ( )
, ,
(
3 2
1 3 , 2 , 1 3
2 3
2 1
3 2
1 3 , 2 , 1 3
2 1
x x
x x
x f
x x
x f
x x
x x
x x f
b
a
28 SAI 2010
Why Factor Graphs?
• Now we have two completely different types of graphs:
BN and MRF.
– It’s a problem because their inference strategy can be very different
• We probably want a unified framework
– How about converting BN to MRF?
– Problem: loops can be created.
• It is possible to convert both BN and MRF into trees without loops.
SAI 2010 29