### Undirected graphical models

### (Markov Random Field or Markov Network)

Reference to:

Books:

Bishop Ch8, PGM Ch4 slides & Nodes :

Cedric Archambean’s slide at Pascal Bootcamp 2010 Tiberio Caetano’s slide at MLSS 2008

CMU Graphical Learning Model lecture notes

Prof. Shou-de Lin

sdlin@csie.ntu.edu.tw CSIE/GINM

1 SAI 2010

### An Recommendation Letter Example

### • Chance of getting a job depends on the

### recommendation letter. However, only one letter is provided to the company, and it is determined by the

### ‘choice’.

### • How about the dependency (L1 J|C=2)?

– BN makes independence assertions only at the level of variables.

### • How about (L1 L2|C, J)?

**• There is no perfect map using BN for such distribution **

job letter1

choice

letter2

2 SAI 2010

## A Friendship Example

• Each random variable represents whether a person knows a news.

• A goes out with B and D. C goes out with B and D (see the right graph). A and C doesn’t know each other, and so does B and D.

• The left graph implies (AC|{B,D}) , but it also implies BD|A

• The left graph implies B and D are dependent given C and A

• The middle graph implies (AC|{B,D}) but it also implies BD

**• There is no perfect map using BN for such distribution **

3 SAI 2010

### Undirected graphical models

• Ideally we would like to have more freedom in the graph

• Markov random fields allow for the specification of a different class of Conditional Independence (CI) statements

• The class of CI statements for MRFs can be easily defined by graphical means in undirected graphs.

– The absence of edges imply CI statements

• Local potential functions and the cliques in the graph completely determine the joint distribution.

• Give correlations between variables, but no explicit way to generate samples

4 SAI 2010

## In the dating example

### • P(a,b,c,d)=

_{1}

### (A,B)

_{2}

### (B,C)

_{3}

### (C,D)

_{4}

### (D,A)/Z

### • Z is a normalization term

5 SAI 2010

### Def: Markov Network (MN) or Markov Random Field (MRF)

### • How about a sequence of multiplications?

### , where g

_{ci}

### (x

_{ci}

### ) is a function of a subset of x.

### • A distribution P

_{}

** is a Gibbs distribution **

**parameterized by a set of factors = {**

_{1}

### (D

_{1}

### ),…

###

_{k}

### (D

_{k}

### )} if it is defined as P

_{}

### (X

_{1}

### ,….X

_{n}

### )=

_{1}

### (D

_{1}

### )*

###

_{2}

### (D

_{2}

### )…

_{m}

### (D

_{m}

### )/Z (Z is a normalization factor)

###

###

*n* *i* *i*

*i*

*i* *x* *x*

*K*

*i*

*C*
*C*
*K*

*i*

*C*
*C*

*n* *g* *x* *Z* *g* *x*

*x* *Z*
*x*

*p* ,...,

1 1

1 1

) ( where

, ) 1 (

) ,..., (

6 SAI 2010

### What are the proper Factors for MRF?

• I_{p}(H): all dependency satisfies: two nodes that are not

connected is conditionally independent given all other nodes

– P(x_{1},x_{2}|X_{\{x1,x2}})=P(x_{1}|X_{\{x1,x2}}) P(x_{2}|X_{\{x1,x2}})

• Therefore, nodes that are not connected should NOT be put in the same factor nodes that ARE connected should be put in the same factor

• How can we identify nodes that ARE connected?

– Maximal clique

7 SAI 2010

### Cliques and Maximal Cliques

• A clique of a graph is a complete subgraph (each pair of nodes have an edge)

• A maximal clique of a graph is a clique which is not a subset of another clique

**• {A, C} form a clique **

**• {A, B, D} and {B, E} are maximal clique **

C

A

D

B E

8 SAI 2010

### Factorization in MRF

• For a graph with K maximal cliques, we can decompose the joint as

(x_{Ci} are the nodes belonging to the maximal clique C_{i})

• The factorization is in terms of local potential functions

• Can also be described as: a Gibbs distribution p factorizes over
an MRF H of each Xc_{i } are nodes belonging to a clique C_{i }

• The potential functions do not necessarily have a probabilistic interpretation

###

###

*n* *i* *i*

*i*

*i* *x* *x*

*K*

*i*

*C*
*C*

*K*

*i*

*C*
*C*

*n* *g* *x* *Z* *g* *x*

*x* *Z*
*x*

*p* ,...,

1 1

1 1

) (

where ,

) 1 (

) ,..., (

*i* *g*

*g*

*C*

*i*

### ( )}, with

*C*

*i*

### ( ) 0 for all

### {

9 SAI 2010

### Example of potential function

• Let

C

A

D

B E

} 1 , 0 {

}, 1 , 0 {

}, ,

1 {*B* *E* *B* *E*

*C*

** B E ** **g(B,E) **

0 0 0.4

0 1 0.8

1 0 3.0

1 1 2.5

Not necessarily to be a probability

CPD can be seen as a special
case of the factors _{10 }

SAI 2010

### Separation, CI and Factorization in MRF (1/2)

**• Factorization CI: If a probability distribution factorizes according to **
an undirected graph, and if X, Y and Z are disjoint subsets of nodes such
that Z separated X from Y (no directed link from X to Y through Z), then
the distribution satisfies X Y| Z

**Proof: **

• We start by considering the case where XYZ=S. As Z separates X from Y, there are no direct edges between X and Y. Hence, any clique in H is fully contained either in XZ or YZ. Let Ix be the indexes of the set of cliques that are contained in XZ, and let Iy be the indexes of the

remaining cliques. Therefore P(X_{1},….X_{n})=_{iIx}_{i}(D_{i}) _{iIy } _{i}(D_{i})/K, K is
the normalization factor

P(X_{1},….X_{n})=f(X,Z)g(Y,Z)/K X Y| Z

• Then we consider the case where XYZS, let U=S-XYZ, it is
**possible to partition U into two disjoint sets U**_{1} and U_{2} such that Z
separates XU_{1} from YU_{2.} Similar to the above argument, we can

conclude P(X_{1},….X_{n})=f(X,U_{1},Z) f(Y,U_{2},Z)/K (X,U_{1} Y,U_{2}|Z)
X Y| Z

11 SAI 2010

### Separation, CI and Factorization in MRF (2/2)

CI properties and factorization are equivalent in MRF :

**• Factorization CI: If a probability distribution factorizes **
according to an undirected graph, and if A, B and C are

disjoint subsets of nodes such that C separated A from B, then the distribution satisfies

**• CI Factorization: If a positive probability distribution **
satisfies the CI statements implied by graph separation over
the undirected graph, then it also factorizes according to this
graph.

– known as the Hammersley-Clifford theorem (proof: section 4.4 in PGM, or section 4.2.3 in our textbook)

*C*
*B*
*A* |

12 SAI 2010

### Local Dependency for Markov Random Field

• Weakest dependency : two nodes that are not connected is

conditionally independent given
**all other nodes **

– All dependency satisfies such
condition is called I_{p}(H)

• Less weak dependency: a node is conditional independent of every other node in the network given its directed neighbors

– All dependency satisfies such
condition is called I_{l}(H)

X_{1 }

X_{2 }

X_{3 } X_{4 } X_{5 }

X_{6 }

X_{7 } X_{8 }

13 SAI 2010

### Global Graph Separation

• if every path from A to B includes at least one node from C, then C is said to separate A from B in G.

• Path is blocked by C A B | C

• All dependency that satisfies such condition is denoted as I(H)

• For ANY MRF, we can denote I(H) ==> I_{l}(H) == > I_{p}(H)

**• For a positive joint probability distribution P, the following three **
statements are equivalent (proof see 4.3.2.2 in PGM):

– P |= I(H)
– P |= I_{l}(H)
– P |= I_{p}(H)

14 SAI 2010

### A canonical example: image denoising

This figure is from Tibério Caetano’s slide at MLSS 2008

15 SAI 2010

### Image denoising:

### Ising model

• y_{i} is the observed noisy pixel and x_{i} is an unknown noise-free
pixel

• there is a strong correlation between x_{i} and y_{i}

• neighboring pixels x_{i} and x_{j} in an image are strongly correlated.

• Cliques: {x_{i,} y_{i}} and {x_{i,} x_{j}} where i and j are indices of

neighboring pixels noisy nodes are correlated with denoised nodes

• E(x,y)=h_{i}x_{i } - _{i,j}x_{i}x_{j }- _{i}x_{i}y_{i } , p(x,y)=e^{-E(x,y)}/Z_{ }

bias toward x_{i}=0 neighbor nodes should be the same

• The lower the energy the better (i.e. the higher the probability)

16 SAI 2010

## Optimization in Ising Model

### • E(x,y)=h

_{i}

### x

_{i }

### -

_{i,j}

### x

_{i}

### x

_{j }

### -

_{i}

### x

_{i}

### y

_{i }

### , p(x,y)=e

^{-}

E(x,y)

### /Z

_{ }

### • Iterated conditional modes (ICM):

### – It’s a gradient ascent method.

### – First initialize x

_{i}

### =y

_{i }

### for all i.

### – Then we take one node x

_{j}

### at a time and evaluate the total energy for two states x

_{j}

### =+1 and x

_{j}

### =-1, choose the better assignment.

### – The algorithm can converge to a local optimal.

SAI 2010 17

### MRF versus Bayesian network

Similarities:

• CI properties of the joint are encoded into the graph structure and define families of structured probability distributions.

• CI properties are related to concepts of separation of (groups of) nodes in the graph.

• Local entities in the graph imply the simplified algebraic structure (factorization) of the joint.

18 SAI 2010

### MRF versus Bayesian network

Differences:

• The set of probability distributions represented by MRFs is different from the set represented by Bayesian networks.

• MRFs have a normalization constant that couples all factors, whereas Bayesian networks have not.

• Factors in Bayesian networks are probability distributions, while factors in MRFs are nonnegative potentials.

19 SAI 2010

### Markov Blanket in BN and MRF

• The Markov Blanket of a node x_{i }in either a BN or an MRF is
the smallest set of nodes A such that p(x_{i} |x_{~i} ) = p(x_{i} |x_{A} )

• BN: parents, children and co-parents of children of the node

• MRF: neighbors of the node

BN MRF

20 SAI 2010

### Mapping a Linear Bayesian networks into a MRF

• Bayesian network:

• MRF

• The mapping is here straightforward. Let

x_{1 } x_{2 } x_{3 } …... x_{n }

)

| ( )

| ( ) ( )

,...,

(*x*_{1} *x** _{n}*

*p*

*x*

_{1}

*p*

*x*

_{2}

*x*

_{1}

*p*

*x*

_{n}*x*

_{n}_{}

_{1}

*p*

) , (

) , ( )

, 1 (

) ,...,

( _{1} _{n}*g*_{1}_{,}_{2} *x*_{1} *x*_{2} *g*_{2}_{,}_{3} *x*_{2} *x*_{3} *g*_{n}_{1}_{,}_{n}*x*_{n}_{1} *x*_{n}*x* *Z*

*x*

*p* _{} _{}

x_{1 } x_{2 } x_{3 } …... x_{n }

1

and

)

| ( )

, (

, ),

| ( )

, (

),

| ( ) ( )

, (

1 1

, 1 2

3 3

2 3 , 2

1 2 1

2 1 2 , 1

*Z*

*x*
*x*
*p*
*x*

*x*
*g*

*x*
*x*
*p*
*x*

*x*
*g*

*x*
*x*
*p*
*x*
*p*
*x*

*x*
*g*

*n*
*n*
*n*

*n*
*n*

*n*

21 SAI 2010

### Mapping head-to-head Bayesian networks into a MRF

• When there are head-to-head nodes one has to add edges to convert the Bayesian network into the undirected graph.

• This process of ‘marrying the parents’ has become known as moralization, and the resulting undirected graph, after

dropping the arrows, is called the moral graph.

x_{1 } x_{2 }

x_{3 }

x_{1 } x_{2 }

x_{3 }

) ,

| ( ) ( ) ( )

, ,

(*x*_{1} *x*_{2} *x*_{3} *p* *x*_{1} *p* *x*_{2} *p* *x*_{3} *x*_{1} *x*_{2}

*p* 1 ( , , )

) , ,

( _{1} _{2} _{3} *g*_{1}_{,}_{2}_{,}_{3} *x*_{1} *x*_{2} *x*_{3}
*x* *Z*

*x*
*x*

*p*

Involves the three variables

22 SAI 2010

### Mapping a Bayesian networks into a MRF- General Process

### • Add additional undirected links between all pairs of parents of each node in the graph.

### • Initialize all of the clique potentials of the moral graph to 1.

### • Take each conditional distribution factor in the original directed graph and multiply it into one of the clique potentials.

23 SAI 2010

## Dependency Revisit

### • P: the set of all distributions P over a given set of variables.

### • D: a set of distributions D that can be represented by BN

### • U: a set of distributions U that can be represented by MRF

SAI 2010 24

### Factor graph

Two types of nodes

• The circles in a factor graph represent random variables

• The squares represent factors in the joint distribution

Two nodes are neighbors if they share a common factor.

) ( ) , ( ) , ( ) , ( )

, ,

(*x*_{1} *x*_{2} *x*_{3} *f* *x*_{1} *x*_{2} *f* *x*_{1} *x*_{2} *f* *x*_{2} *x*_{3} *f* *x*_{3}

*p* _{a}_{b}_{c}_{d}

25 SAI 2010

### Factor graph

• Factor graphs incorporate explicit details about the

factorization, but factor nodes do not correspond to CIs.

• An edge between a circle and a square states that the

corresponding function has the corresponding variable as an argument.

• The joint distribution is the product of all functions (so, the functions are factors).

• They preserve the tree structure of DAGs and undirected graphs.

26 SAI 2010

### Converting a BN into a factor graph

• Create nodes corresponding to the original DAG.

• Create factor nodes corresponding the conditional distributions.

• Conversion is not unique:

) ,

| ( )

, , ( ), (

) ( ), (

) (

) ,

| ( ) ( ) ( )

, , (

2 1 3

3 2 1 2

2 1

1

2 1 3

2 1

3 2 1

*x*
*x*
*x*

*p*
*x*

*x*
*x*
*f*
*x*

*p*
*x*

*f*
*x*

*p*
*x*

*f*

*x*
*x*
*x*

*p*
*x*

*p*
*x*
*p*
*x*

*x*
*x*
*f*

*c*
*b*

*a*

27 SAI 2010

### Converting an MRF into a factor graph

• Create nodes corresponding to the original undirected graph.

• Create factor nodes corresponding the potential functions.

• Conversion is not unique:

### ) ,

### , ( )

### , ( )

### , ,

### (

### ) ,

### , ( )

### , ,

### (

3 2

1 3 , 2 , 1 3

2 3

2 1

3 2

1 3 , 2 , 1 3

2 1

*x* *x*

*x* *x*

*x* *f*

*x* *x*

*x* *f*

*x* *x*

*x* *x*

*x* *x* *f*

*b*

*a*

###

###

###

###

28 SAI 2010

## Why Factor Graphs?

### • Now we have two completely different types of graphs:

### BN and MRF.

– It’s a problem because their inference strategy can be very different

### • We probably want a unified framework

– How about converting BN to MRF?

– Problem: loops can be created.

**• It is possible to convert both BN and MRF into trees ** without loops.

SAI 2010 29