(Markov Random Field or Markov

(1)

Undirected graphical models

(Markov Random Field or Markov Network)

Reference to:

Books:

Bishop Ch8, PGM Ch4 slides & Nodes :

Cedric Archambean’s slide at Pascal Bootcamp 2010 Tiberio Caetano’s slide at MLSS 2008

CMU Graphical Learning Model lecture notes

Prof. Shou-de Lin

sdlin@csie.ntu.edu.tw CSIE/GINM

1 SAI 2010

(2)

An Recommendation Letter Example

• Chance of getting a job depends on the

recommendation letter. However, only one letter is provided to the company, and it is determined by the

‘choice’.

• How about the dependency (L1  J|C=2)?

– BN makes independence assertions only at the level of variables.

• How about (L1  L2|C, J)?

• There is no perfect map using BN for such distribution

job letter1

choice

letter2

2 SAI 2010

(3)

A Friendship Example

• Each random variable represents whether a person knows a news.

• A goes out with B and D. C goes out with B and D (see the right graph). A and C doesn’t know each other, and so does B and D.

• The left graph implies (AC|{B,D}) , but it also implies BD|A 

• The left graph implies B and D are dependent given C and A 

• The middle graph implies (AC|{B,D})  but it also implies BD 

• There is no perfect map using BN for such distribution

3 SAI 2010

(4)

Undirected graphical models

• Ideally we would like to have more freedom in the graph

• Markov random fields allow for the specification of a different class of Conditional Independence (CI) statements

• The class of CI statements for MRFs can be easily defined by graphical means in undirected graphs.

– The absence of edges imply CI statements

• Local potential functions and the cliques in the graph completely determine the joint distribution.

• Give correlations between variables, but no explicit way to generate samples

4 SAI 2010

(5)

In the dating example

• P(a,b,c,d)= 

₁

(A,B) 

₂

(B,C) 

₃

(C,D) 

₄

(D,A)/Z

• Z is a normalization term

5 SAI 2010

(6)

Def: Markov Network (MN) or Markov Random Field (MRF)

• How about a sequence of multiplications?

, where g

_ci

(x

_ci

) is a function of a subset of x.

• A distribution P

_

is a Gibbs distribution

parameterized by a set of factors = {

₁

(D

₁

),…



_k

(D

_k

)} if it is defined as P

_

(X

₁

,….X

_n

)=

₁

(D

₁

)*



₂

(D

₂

)…

_m

(D

_m

)/Z (Z is a normalization factor)

 



 



n i i

i

i x x

K

i

C C K

i

C C

n g x Z g x

x Z x

p ,...,

1 1

) ( where

, ) 1 (

) ,..., (

6 SAI 2010

(7)

What are the proper Factors for MRF?

• I_p(H): all dependency satisfies: two nodes that are not

connected is conditionally independent given all other nodes

– P(x₁,x₂|X_\{x1,x2})=P(x₁|X_\{x1,x2}) P(x₂|X_\{x1,x2})

• Therefore, nodes that are not connected should NOT be put in the same factor  nodes that ARE connected should be put in the same factor

• How can we identify nodes that ARE connected?

– Maximal clique

7 SAI 2010

(8)

Cliques and Maximal Cliques

• A clique of a graph is a complete subgraph (each pair of nodes have an edge)

• A maximal clique of a graph is a clique which is not a subset of another clique

• {A, C} form a clique

• {A, B, D} and {B, E} are maximal clique

C

A

D

B E

8 SAI 2010

(9)

Factorization in MRF

• For a graph with K maximal cliques, we can decompose the joint as

(x_Ci are the nodes belonging to the maximal clique C_i)

• The factorization is in terms of local potential functions

• Can also be described as: a Gibbs distribution p factorizes over an MRF H of each Xc_i are nodes belonging to a clique C_i

• The potential functions do not necessarily have a probabilistic interpretation

 



 



n i i

i

i x x

K

i

C C

K

i

C C

n g x Z g x

x Z x

p ,...,

1 1

) (

where ,

) 1 (

) ,..., (

i g

g

Ci

( )}, with

Ci

( ) 0 for all

{   

9 SAI 2010

(10)

Example of potential function

• Let

C

A

D

B E

} 1 , 0 {

}, 1 , 0 {

}, ,

1 {B E B E

C

B E g(B,E)

0 0 0.4

0 1 0.8

1 0 3.0

1 1 2.5

Not necessarily to be a probability

CPD can be seen as a special case of the factors ₁₀

SAI 2010

(11)

Separation, CI and Factorization in MRF (1/2)

• Factorization  CI: If a probability distribution factorizes according to an undirected graph, and if X, Y and Z are disjoint subsets of nodes such that Z separated X from Y (no directed link from X to Y through Z), then the distribution satisfies X  Y| Z

Proof:

• We start by considering the case where XYZ=S. As Z separates X from Y, there are no direct edges between X and Y. Hence, any clique in H is fully contained either in XZ or YZ. Let Ix be the indexes of the set of cliques that are contained in XZ, and let Iy be the indexes of the

remaining cliques. Therefore P(X₁,….X_n)=_iIx_i(D_i) _iIy _i(D_i)/K, K is the normalization factor

P(X₁,….X_n)=f(X,Z)g(Y,Z)/K  X  Y| Z

• Then we consider the case where XYZS, let U=S-XYZ, it is possible to partition U into two disjoint sets U₁ and U₂ such that Z separates XU₁ from YU_2. Similar to the above argument, we can

conclude P(X₁,….X_n)=f(X,U₁,Z) f(Y,U₂,Z)/K  (X,U₁  Y,U₂|Z)  X  Y| Z

11 SAI 2010

(12)

Separation, CI and Factorization in MRF (2/2)

CI properties and factorization are equivalent in MRF :

• Factorization  CI: If a probability distribution factorizes according to an undirected graph, and if A, B and C are

disjoint subsets of nodes such that C separated A from B, then the distribution satisfies

• CI Factorization: If a positive probability distribution satisfies the CI statements implied by graph separation over the undirected graph, then it also factorizes according to this graph.

– known as the Hammersley-Clifford theorem (proof: section 4.4 in PGM, or section 4.2.3 in our textbook)

C B A  |

12 SAI 2010

(13)

Local Dependency for Markov Random Field

• Weakest dependency : two nodes that are not connected is

conditionally independent given all other nodes

– All dependency satisfies such condition is called I_p(H)

• Less weak dependency: a node is conditional independent of every other node in the network given its directed neighbors

– All dependency satisfies such condition is called I_l(H)

X₁

X₂

X₃ X₄ X₅

X₆

X₇ X₈

13 SAI 2010

(14)

Global Graph Separation

• if every path from A to B includes at least one node from C, then C is said to separate A from B in G.

• Path is blocked by C  A  B | C

• All dependency that satisfies such condition is denoted as I(H)

• For ANY MRF, we can denote I(H) ==> I_l(H) == > I_p(H)

• For a positive joint probability distribution P, the following three statements are equivalent (proof see 4.3.2.2 in PGM):

– P |= I(H) – P |= I_l(H) – P |= I_p(H)

14 SAI 2010

(15)

A canonical example: image denoising

This figure is from Tibério Caetano’s slide at MLSS 2008

15 SAI 2010

(16)

Image denoising:

Ising model

• y_i is the observed noisy pixel and x_i is an unknown noise-free pixel

• there is a strong correlation between x_i and y_i

• neighboring pixels x_i and x_j in an image are strongly correlated.

• Cliques: {x_i, y_i} and {x_i, x_j} where i and j are indices of

neighboring pixels noisy nodes are correlated with denoised nodes

• E(x,y)=h_ix_i - _i,jx_ix_j- _ix_iy_i , p(x,y)=e^-E(x,y)/Z

bias toward x_i=0 neighbor nodes should be the same

• The lower the energy the better (i.e. the higher the probability)

16 SAI 2010

(17)

Optimization in Ising Model

• E(x,y)=h

_i

x

_i

- 

_i,j

x

_i

x

_j

- 

_i

x

_i

y

_i

, p(x,y)=e

^-

E(x,y)

/Z

• Iterated conditional modes (ICM):

– It’s a gradient ascent method.

– First initialize x

_i

=y

_i

for all i.

– Then we take one node x

_j

at a time and evaluate the total energy for two states x

_j

=+1 and x

_j

=-1, choose the better assignment.

– The algorithm can converge to a local optimal.

SAI 2010 17

(18)

MRF versus Bayesian network

Similarities:

• CI properties of the joint are encoded into the graph structure and define families of structured probability distributions.

• CI properties are related to concepts of separation of (groups of) nodes in the graph.

• Local entities in the graph imply the simplified algebraic structure (factorization) of the joint.

18 SAI 2010

(19)

MRF versus Bayesian network

Differences:

• The set of probability distributions represented by MRFs is different from the set represented by Bayesian networks.

• MRFs have a normalization constant that couples all factors, whereas Bayesian networks have not.

• Factors in Bayesian networks are probability distributions, while factors in MRFs are nonnegative potentials.

19 SAI 2010

(20)

Markov Blanket in BN and MRF

• The Markov Blanket of a node x_iin either a BN or an MRF is the smallest set of nodes A such that p(x_i |x_~i ) = p(x_i |x_A )

• BN: parents, children and co-parents of children of the node

• MRF: neighbors of the node

BN MRF

20 SAI 2010

(21)

Mapping a Linear Bayesian networks into a MRF

• Bayesian network:

• MRF

• The mapping is here straightforward. Let

x₁ x₂ x₃ …... x_n

)

| ( )

| ( ) ( )

,...,

(x₁ x_n  p x₁ p x₂ x₁ p x_n x_n_₁

p 

) , (

) , ( )

, 1 (

) ,...,

( ₁ _n g₁_,₂ x₁ x₂ g₂_,₃ x₂ x₃ g_n ₁_,_n x_n ₁ x_n x Z

x

p   _ _

x₁ x₂ x₃ …... x_n

1

and

)

| ( )

, (

, ),

| ( )

, (

),

| ( ) ( )

, (

1 1

, 1 2

3 3

2 3 , 2

1 2 1

2 1 2 , 1





Z

x x p x

x g

x x p x

x g

x x p x p x

x g

n n n

n n

 n

21 SAI 2010

(22)

Mapping head-to-head Bayesian networks into a MRF

• When there are head-to-head nodes one has to add edges to convert the Bayesian network into the undirected graph.

• This process of ‘marrying the parents’ has become known as moralization, and the resulting undirected graph, after

dropping the arrows, is called the moral graph.

x₁ x₂

x₃

x₁ x₂

x₃

) ,

| ( ) ( ) ( )

, ,

(x₁ x₂ x₃ p x₁ p x₂ p x₃ x₁ x₂

p  1 ( , , )

) , ,

( ₁ ₂ ₃ g₁_,₂_,₃ x₁ x₂ x₃ x Z

x x

p 

Involves the three variables

22 SAI 2010

(23)

Mapping a Bayesian networks into a MRF- General Process

• Add additional undirected links between all pairs of parents of each node in the graph.

• Initialize all of the clique potentials of the moral graph to 1.

• Take each conditional distribution factor in the original directed graph and multiply it into one of the clique potentials.

23 SAI 2010

(24)

Dependency Revisit

• P: the set of all distributions P over a given set of variables.

• D: a set of distributions D that can be represented by BN

• U: a set of distributions U that can be represented by MRF

SAI 2010 24

(25)

Factor graph

Two types of nodes

• The circles in a factor graph represent random variables

• The squares represent factors in the joint distribution

Two nodes are neighbors if they share a common factor.

) ( ) , ( ) , ( ) , ( )

, ,

(x₁ x₂ x₃ f x₁ x₂ f x₁ x₂ f x₂ x₃ f x₃

p  _a _b _c _d

25 SAI 2010

(26)

Factor graph

• Factor graphs incorporate explicit details about the

factorization, but factor nodes do not correspond to CIs.

• An edge between a circle and a square states that the

corresponding function has the corresponding variable as an argument.

• The joint distribution is the product of all functions (so, the functions are factors).

• They preserve the tree structure of DAGs and undirected graphs.

26 SAI 2010

(27)

Converting a BN into a factor graph

• Create nodes corresponding to the original DAG.

• Create factor nodes corresponding the conditional distributions.

• Conversion is not unique:

) ,

| ( )

, , ( ), (

) ( ), (

) (

) ,

| ( ) ( ) ( )

, , (

2 1 3

3 2 1 2

2 1

1

2 1 3

2 1

3 2 1

x x x

p x

x x f x

p x

f x

p x

f

x x x

p x

p x p x

x x f

c b

a   



27 SAI 2010

(28)

Converting an MRF into a factor graph

• Create nodes corresponding to the original undirected graph.

• Create factor nodes corresponding the potential functions.

• Conversion is not unique:

) ,

, ( )

, ,

(

) ,

, ( )

, ,

(

3 2

1 3 , 2 , 1 3

2 3

2 1

3 2

1 3 , 2 , 1 3

2 1

x x

x f

x x

x f

x x

x x f

b

a





28 SAI 2010

(29)

Why Factor Graphs?

• Now we have two completely different types of graphs:

BN and MRF.

– It’s a problem because their inference strategy can be very different

• We probably want a unified framework

– How about converting BN to MRF?

– Problem: loops can be created.

• It is possible to convert both BN and MRF into trees without loops.

SAI 2010 29