BRENDAN FREY
Tracking self-occluding objects in disparity maps
(Jojic, Turk and Huang, ICCV 1999)
PART IV
LEARNING THE
BRENDAN FREY
Maximum likelihood estimation
• Suppose we observe K IID training cases o(1)…o(K)
• o(k) is the kth training case
• For time-series, o(k) = (o1(k),o2(k), …,oT(k))
• Let θθθθ be the parameters of the local functions in a graphical model
• Maximum likelihood estimate of θθθθ: θθθθML = argmaxθθθθ
Π
k P(o(k)|θθθθ)Why maximum-likelihood?
• If θθθθ spans all pdfs and if K → ∞, then the ML estimate matches the true data
distribution
• Decisions based on P maximize utility (are Bayes-optimal)
• …works well in many applications
BRENDAN FREY
Complete data in Bayes nets
• All variables are observed, so
P(o|θθθθ) =
Π
i P(oi|pai,θθθθ), pai = parents of oi• Since argmax () = argmax log (), θθθθML = argmaxθθθθ log
Π
k P(o(k)|θθθθ)= argmaxθθθθ
Σ
k log P(o(k)|θθθθ)θθθθML = argmaxθθθθ
Σ
kΣ
i log P(oi(k)|pai(k),θθθθ)Log-likelihood
• Let θi ⊆ θθθθ parameterize P(oi|pai,θi)
• Then,
θiML = argmaxθi
Σ
k log P(oi(k)|pai(k),θi)• Learning the entire Bayes net decouples into learning individual conditional
distributions
• The sufficient statistics for learning a
Bayes net are the sufficient statistics of the individual conditional distributions
BRENDAN FREY
Unconstrained discrete net
• Let P(oi=α |pai=β,θi) = θiαβ
• Let niαβ = # times oi=α and pai=β in training data
• Then, θiαβML = niαβ
/ ( Σ
α niαβ)
Example: Learning the shifter net
z1 z2 m y2t+1 # counts
1 0 1 1 99
1 0 1 0 1
d
yt
z m
yt+1
Suppose we observe all variables in 100 trials
Set P(y2t+1=1|z1=1,z2=0,m =1) = 0.99
BRENDAN FREY
Logistic regression in binary nets
• Given frequencies, niαβ, ML estimate of θi is computed using iterative optimization (cf Hastie and Tibshirani 1990)
P(oi=1|pai,θi) = g(θi0+
Σ
n:on∈paiθinon)z 1 g(z)
Continuous child with discrete parents
• P(oi|pai=β,θi) = continuous distribution with index β and parameters θi
• Eg, Gaussian:
P(oi|pai=β,θi) = N(oi;µiβ,Ciβ),
• µiβ,Ciβ = mean and (co)variance of oi for configuration β of parents pai
• For obs with pai=β, estimate µiβ and Ciβ
BRENDAN FREY
Continuous child with continuous parents
• Estimation becomes a regression-type problem
• Eg, linear Gaussian model:
P(oi|pai,θi) = N(oi; θi0+
Σ
n:on∈paiθinon,Ci),• mean = linear function of parents
• Estimation = linear regression
Complete data in MRFs
• All variables are observed, so
P(o|θθθθ) =
[ Π
i φi(cli|θθθθ)] / ( Σ
o[ Π
i φi(cli|θθθθ)])
• Since argmax () = argmax log (), θθθθML = argmaxθθθθ log
Π
k P(o(k)|θθθθ)= argmaxθθθθ
Σ
k log P(o(k)|θθθθ)θθθθML = argmaxθθθθ
Σ
k( Σ
i log φi(cli(k)|θθθθ) - log( Σ
o[ Π
i φi(cli|θθθθ)]) )
Generally
Intractable!
BRENDAN FREY
Incomplete data in Bayes nets
• Partition variables x into observed vars, o, and unobserved vars, u: x = (o,u)
P(o|θθθθ) =
Σ
uP(o,u|θθθθ) =Σ
uP(x|θθθθ),P(x|θθθθ) =
Π
iP(xi|pai,θθθθ), pai = parents of xi θθθθML = argmaxθθθθ logΠ
k P(o(k)|θθθθ)= argmaxθθθθ
Σ
k log( Σ
u(k)P(o(k),u(k)|θθθθ))
= argmaxθθθθ
Σ
k log( Σ
u(k)[ Π
iP(xi(k)|pai,θθθθ)])
Problem! Summation gets in the way of logΠi
Example: Mixture of 2 unit-variance Gaussians
P(z) = π1αexp[-(z-µ1)2/2] + π2αexp[-(z-µ2)2/2]
where α = (2π)-1/2
The log-likelihood to be maximized is
log
(
π1αexp[-(z-µ1)2/2] + π2αexp[-(z-µ2)2/2])
The parameters (π1,µ1,π2,µ2) that maximize this do not have a simple, closed form solution
• One approach: Use nonlinear optimizer (eg, Newton-type method). OR…
BRENDAN FREY
Expectation Maximization Algorithm
(Dempster, Laird and Rubin 1977)
• Learning was more straightforward when the data was complete
• Can we use probabilistic inference
(compute P(u|o,θθθθ)) to “fill in” the missing data and then use the learning rules for complete data?
• YES: EM algorithm
• Initialize θθθθ (eg, using 2nd order statistics)
• E-Step: Compute Q(u) = P(u|o,θθθθ) for all u
• M-Step: Holding Q(u) constant maximize
Σ
uQ(u) log P(o,u|θθθθ) wrt θθθθ• Repeat E and M steps until convergence
• EM consistently increases log
( Σ
uP(o,u|θθθθ))
Expectation Maximization (EM) Algorithm
“Ensemble completion”
BRENDAN FREY
• Recall P(o,u|θθθθ) =
Π
iP(xi|pai,θθθθ), x = (o,u)• Let θi ⊆ θθθθ parameterize P(xi|pai,θi)
• Then, maximizing
Σ
uQ(u) log P(o,u|θθθθ)wrt θi becomes equivalent to maximizing
Σ
xi,pai Q(xi,pai) log P(xi|pai,θi)• Just like for complete data, learning is decoupled
EM in Bayes nets
Expectation Maximization Algorithm: Theory
(Following Neal and Hinton 1999)
• Problem: Maximize log
( Σ
uP(o,u|θθθθ))
wrt θθθθ• We need to “move” the sum outside of the logarithm
• Recall Jensen’s inequality:
Concave function of a convex sum ≥ Convex sum of concave function
Concave f’n log(); Weights qi: qi≥0, Σiqi=1:
log(Σiqipi) ≥ Σiqilog(pi),
pi’s are any positive numbers
BRENDAN FREY
• Rewrite log
( Σ
uP(o,u|θθθθ))
by introducing Q(u):log
( Σ
uQ(u)[P(o,u|θθθθ)/Q(u)])
,where Q(u) > 0 and
Σ
uQ(u) = 1• Applying Jensen’s inequality,
log
( Σ
uP(o,u|θθθθ))
= log( Σ
uQ(u)[P(o,u|θθθθ)/Q(u)])
≥
Σ
uQ(u) log [P(o,u|θθθθ)/Q(u)]Expectation Maximization Algorithm: Theory
log
( Σ
uP(o,u|θθθθ))
≥Σ
uQ(u) log [P(o,u|θθθθ)/Q(u)]• The Q(u) that makes this an equality is Q(u) = P(u|o,θθθθ)
• The E-Step maximizes the bound wrt Q(u)
• The M-Step maximizes the bound wrt θθθθ
• Since the E-step makes the bound tight, each iteration of EM is guaranteed to increase
log
( Σ
uP(o,u|θθθθ))
Expectation Maximization Algorithm: Theory
EM for mixture of Gaussians: E step
c
z
µµµµ1= ΦΦΦΦ1=
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
Images from data set
c=1
Images from data set
z=
c=2
P(c|z) 0.52 c 0.48
µµµµ1= ΦΦΦΦ1=
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
EM for mixture of Gaussians: E step
Images from data set
z=
c=1 c c=2
P(c|z) 0.51 0.49
µµµµ1= ΦΦΦΦ1=
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
EM for mixture of Gaussians: E step
Images from data set
z=
c=1 c c=2
P(c|z) 0.48 0.52
µµµµ1= ΦΦΦΦ1=
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
EM for mixture of Gaussians: E step
Images from data set
z=
c=1 c c=2
P(c|z) 0.43 0.57
µµµµ1= ΦΦΦΦ1=
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
EM for mixture of Gaussians: E step
µµµµ1= ΦΦΦΦ1= c
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
Set µµµµ1 to the average of zP(c=1|z) z Set µµµµ2 to the average of zP(c=2|z)
EM for mixture of Gaussians: M step
µµµµ1= ΦΦΦΦ1= c
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
Set µµµµ1 to the average of zP(c=1|z) z Set µµµµ2 to the average of zP(c=2|z)
EM for mixture of Gaussians: M step
µµµµ1= ΦΦΦΦ1= c
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
Set ΦΦΦΦ1 to the average of z
diag((z-µµµµ1)T (z-µµµµ1))P(c=1|z) Set ΦΦΦΦ2 to the average of
diag((z-µµµµ2)T (z-µµµµ2))P(c=2|z)
EM for mixture of Gaussians: M step
µµµµ1= ΦΦΦΦ1= c
µµµµ2= ΦΦΦΦ2=
π1= 0.5, π2= 0.5,
Set ΦΦΦΦ1 to the average of z
diag((z-µµµµ1)T (z-µµµµ1))P(c=1|z) Set ΦΦΦΦ2 to the average of
diag((z-µµµµ2)T (z-µµµµ2))P(c=2|z)
EM for mixture of Gaussians: M step
› a f t e r i t e r a t i n g t o c o n v e r g e n c e :
c
z
µµµµ1= ΦΦΦΦ1=
µµµµ2= ΦΦΦΦ2=
π1= 0.6, π2= 0.4,
BRENDAN FREY
x z ss=
Transformed mixture of Gaussians
(Frey and Jojic, CVPR 1999)
cc=1 P(c) = πc
P(x|z,s) = N(x; shift(z,s), ΨΨΨΨ)
µµµµ1= diag(ΦΦΦΦ1) = µµµµ2=
π1= 0.6,
π2= 0.4,
Shift, s P(s)
diag(ΦΦΦΦ2) = z=
x=
Diagonal
EM for the transformed mixture of Gaussians (TMG)
x l
c
z
• E step: Compute P(l|x), P(c|x) and p(z|c,x) for each x in data
• M step: Set
– πc = avg of P(c|x)
– ρl = avg of P(l|x)
– µµµµc = avg mean of p(z|c,x) – ΦΦΦΦc = avg variance of p(z|c,x) – ΨΨΨΨ = avg var of p(x-Gl z|x)
EM in TMG for face clustering
TMG MG
Cluster means
Exact EM for the shifter net
d
yt
z m
yt+1
• Suppose we observe only yt and yt+1 in the training data
• Use brute force inference to compute P(d,m,z|yt,yt+1) for every training case
• The M-step is equivalent to extending the training set by sampling d,m,z from P(d,m,z|yt,yt+1) for every training case and then updating the
conditional distributions using counts (as for the complete case)
BRENDAN FREY
Other examples of exact EM
• Baum Welch algorithm for learning HMM
• Adaptive Kalman filtering
• Factor analysis (Rubin and Thayer 1982)
• Transformed HMMs (Jojic, Petrovic, Frey and Huang, CVPR 2000)
• …
When exact EM is intractable
Exact EM can be intractable if
• log P(xi|pai,θi) can’t be maximized wrt θi in a tractable way, or
• Inference is intractable
– The network has too many cycles (eg, layered vision model)
– The local computations needed in the sum-product algorithm are intractable (eg, active contours)
BRENDAN FREY
Partial M steps
• If log P(xi|pai,θi) can’t be maximized in closed form, compute
Σ
xi,pai Q(xi,pai) ∂log P(xi|pai,θi)/ ∂θi during the M-step and learn using a gradient-based method (eg, conjugate gradients)• Variants:
– “GEM” (Dempster et al 1977)
– Expectation conditional-maximization
(Meng and Rubin 1992)
Approximate E steps
• If inference is intractable, we can use one of the approximate inference
methods described above to obtain Q(u) ≈ P(u|o,θθθθ)
in the E-step
• Perform an exact or partial M-step
BRENDAN FREY
EM using condensation in mixed-state dynamic nets
(Blake et al 1999)
x0 x1 x2 s0 s1 s2
y0 y1 y2
Video of hands juggling
balls
Generalized EM
(Neal and Hinton 1999)
Recall
log
( Σ
uP(o,u|θθθθ))
≥Σ
uQ(u) log [P(o,u|θθθθ)/Q(u)] = B• Generalized EM: Maximize B
• Keep a Q(u) for each training case and in each E-step, adjust Q(u) to increase B
• In the M-step, adjust θ θ θ θ to increase B
BRENDAN FREY
Variational EM
(Jordan et al 1999)
• Recall one of the variational “distances”:
F = D – logP(o|θθθθ)
=
Σ
uQ(u|ΦΦΦΦ) log[Q(u|ΦΦΦΦ)/P(u,o|θθθθ)]• This is the bound from above! (without “-”):
log
( Σ
uP(o,u|θθθθ))
≥Σ
uQ(u|ΦΦΦΦ) log [P(o,u|θθθθ)/Q(u|ΦΦΦΦ)]• E-step: Adjust ΦΦΦΦ to increase bound
• M-step: Adjust θθθθ to increase bound
Variational inference in mixed-state dynamic net
(Pavlovic, Frey and Huang, CVPR 1999)
x0 x1 x2 s0 s1 s2
y0 y1 y2
Y
∏
∏
∏
∏
−
=
−
= −
−
= −
−
= −
=
=
1
0 1
1
1 0
0
1
0
1 1
1
1 0
)
| (
) ,
| ( )
| (
)
| ( )
| ( )
(
)
| ( )
| (
)
| (
T
t
t t T
t
t t t T
t
t t T
t
t t
LDS HMM
x y P
u x x P s
x P
s q P s
s P s
P
Q Q
Q
U Y, X Q
S
Q U, Y, S X,
Variational parameters
BRENDAN FREY
P(st|Y) Input trace + LDS state
vectors colored by most probable HMM state
Colors for HMM states x0 x1 x2
s0 s1 s2
y0 y1 y2
Variational inference in mixed-state dynamic net
(Pavlovic, Frey and Huang, CVPR 1999)
t, time step
Other examples of variational EM
• Exploiting tractable substructures
(Saul and Jordan 1996)
• Mixtures of switching state-space models
(Ghahramani and Hinton 1997)
• Recursive algorithms in graphical models
(Jaakkola and Jordan 1997)
• Factorial hidden Markov models
(Ghahramani and Hordan 1997)
• Nonlinear Gaussian Bayes nets
(Frey and Hinton 1999)
• …
BRENDAN FREY
Wake-sleep learning in Bayes nets
(Hinton, Dayan, Frey and Neal, Science 1995)
• In addition to the top-down generative
network, there is a bottom-up “recognition network” that does inference by simulation
• The recognition net is trained on fantasies u
o
u o Generative
network
Recognition network
Some References
Books on probability and information theory
• E. T. Jaynes. Probability Theory – The Logic of Science, www.math.albany.edu:8008/JaynesBook.html.
• Cover and Thomas. Elements of Information Theory.
Books on graphical models
• F. Jensen 1996. An Introduction to Bayesian Networks, UCL Press.
• B. J. Frey 1998. Graphical Models for Machine Learning and Digital Communication, MIT Press.
• M. I. Jordan (ed) 1999. Learning in Graphical Models, MIT Press.
Tutorials
• W. Buntine 1994. Learning with Graphical Models, www.ultimode.com/wray.
• D. Heckerman 1999. A Tutorial on Learning Bayes Nets from Data, www.research.microsoft.com/~heckerman.
• Z. Ghahramani and S. Roweis 1999. Probabilistic Models for Unsupervised Learning,www.gatsby.ucl.ac.uk.
BRENDAN FREY Conferences
• Uncertainty in Artificial Intelligence
• Neural Information Processing Systems Exact inference in graphical models
• B. J. Frey 1999. Probability propagation (the sum-product algorithm) in graphical models, tutorial available atwww.cs.uwaterloo.ca/~frey.
• R. Dechter 1999. Bucket Elimination: A unifying framework for probabilistic inference, M. I. Jordan (ed) Learning in Graphical Models, MIT Press.
• S. L. Lauritzen and D. J. Speigelhalter 1988. Local computation with
probabilities on graphical structures and their application to expert systems, Journal of the Royal Statistical Society, Series B, 50:2, 157-224.
Monte Carlo inference
• R. M. Neal 1993. Probabilistic inference using Markov chain Monte Carlo Methods, technical report available atwww.cs.toronto.edu/~radford.
• D. J. C. MacKay 1999. Introduction to Monte Carlo methods, M. I. Jordan (ed) Learning in Graphical Models, MIT Press.
Variational techniques
• L. K. Saul and M. I. Jordan 1996. Exploiting tractable substructures in intractable networks, Touretzky, Mozer and Hasselmo (eds), Advances in Neural Information Processing Systems 8, MIT Press.
• T. S. Jaakkola and M. I. Jordan 1996. Computing upper and lower bounds on likelihoods in intractable networks, Proceedings of the 12thConference on Uncertainty in Artificial Intelligence.
• Z. Ghahramani and G. E. Hinton 1997. Mixtures of switching state space models,www.gatsby.ucl.ac.uk.
• T. S. Jaakkola 1997. Variational Methods for Inference and Estimation in Graphical Models,www.ai.mit.edu/people/tommi.
• M. I. Jordan, Z. Ghahramani, T. S. Jaakkola and L. K. Saul 1999. An introduction to variational methods for graphical models, M. I. Jordan (ed) Learning in Graphical Models, MIT Press.
• B. J. Frey and G. E. Hinton 1999. Variational inference and learning in nonlinear Gaussian belief networks, Neural Computation 11:1, 193-214.
Iterative probability propagation (sum-product algorithm)
• B. J. Frey and D. J. C. MacKay 1998. A revolution: Probability propagation in networks with cycles, Jordan, Kearns and Solla (eds), Advances in Neural Information Processing Systems 10, MIT Press.
• W. Freeman and E. Pasztor 1999. Learning low-level vision, Proceedings of the IEEE International Conference on Computer Vision, IEEE Press.
• Frey 2000. Filling in scenes by propagating probabilities through layers and into appearance models, Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, IEEE Press.