• 沒有找到結果。

LEARNING THE PARAMETERS OF

BRENDAN FREY

Tracking self-occluding objects in disparity maps

(Jojic, Turk and Huang, ICCV 1999)

PART IV

LEARNING THE

BRENDAN FREY

Maximum likelihood estimation

• Suppose we observe K IID training cases o(1)…o(K)

• o(k) is the kth training case

• For time-series, o(k) = (o1(k),o2(k), …,oT(k))

• Let θθθθ be the parameters of the local functions in a graphical model

• Maximum likelihood estimate of θθθθ: θθθθML = argmaxθθθθ

Π

k P(o(k)|θθθθ)

Why maximum-likelihood?

• If θθθθ spans all pdfs and if K → ∞, then the ML estimate matches the true data

distribution

• Decisions based on P maximize utility (are Bayes-optimal)

• …works well in many applications

BRENDAN FREY

Complete data in Bayes nets

• All variables are observed, so

P(o|θθθθ) =

Π

i P(oi|pai,θθθθ), pai = parents of oi

• Since argmax () = argmax log (), θθθθML = argmaxθθθθ log

Π

k P(o(k)|θθθθ)

= argmaxθθθθ

Σ

k log P(o(k)|θθθθ)

θθθθML = argmaxθθθθ

Σ

k

Σ

i log P(oi(k)|pai(k),θθθθ)

Log-likelihood

• Let θi ⊆ θθθθ parameterize P(oi|paii)

• Then,

θiML = argmaxθi

Σ

k log P(oi(k)|pai(k),θi)

• Learning the entire Bayes net decouples into learning individual conditional

distributions

• The sufficient statistics for learning a

Bayes net are the sufficient statistics of the individual conditional distributions

BRENDAN FREY

Unconstrained discrete net

• Let P(oi=α |pai=β,θi) = θiαβ

• Let niαβ = # times oi=α and pai=β in training data

• Then, θiαβML = niαβ

/ ( Σ

α niαβ

)

Example: Learning the shifter net

z1 z2 m y2t+1 # counts

1 0 1 1 99

1 0 1 0 1

d

yt

z m

yt+1

Suppose we observe all variables in 100 trials

Set P(y2t+1=1|z1=1,z2=0,m =1) = 0.99

BRENDAN FREY

Logistic regression in binary nets

• Given frequencies, niαβ, ML estimate of θi is computed using iterative optimization (cf Hastie and Tibshirani 1990)

P(oi=1|paii) = g(θi0+

Σ

n:onpaiθinon)

z 1 g(z)

Continuous child with discrete parents

• P(oi|pai=β,θi) = continuous distribution with index β and parameters θi

• Eg, Gaussian:

P(oi|pai=β,θi) = N(oiiβ,Ciβ),

• µiβ,Ciβ = mean and (co)variance of oi for configuration β of parents pai

• For obs with pai=β, estimate µiβ and Ciβ

BRENDAN FREY

Continuous child with continuous parents

• Estimation becomes a regression-type problem

• Eg, linear Gaussian model:

P(oi|paii) = N(oi; θi0+

Σ

n:onpaiθinon,Ci),

• mean = linear function of parents

• Estimation = linear regression

Complete data in MRFs

• All variables are observed, so

P(o|θθθθ) =

[ Π

i φi(cli|θθθθ)

] / ( Σ

o

[ Π

i φi(cli|θθθθ)

])

• Since argmax () = argmax log (), θθθθML = argmaxθθθθ log

Π

k P(o(k)|θθθθ)

= argmaxθθθθ

Σ

k log P(o(k)|θθθθ)

θθθθML = argmaxθθθθ

Σ

k

( Σ

i log φi(cli(k)|θθθθ) - log

( Σ

o

[ Π

i φi(cli|θθθθ)

]) )

Generally

Intractable!

BRENDAN FREY

Incomplete data in Bayes nets

• Partition variables x into observed vars, o, and unobserved vars, u: x = (o,u)

P(o|θθθθ) =

Σ

uP(o,u|θθθθ) =

Σ

uP(x|θθθθ),

P(x|θθθθ) =

Π

iP(xi|pai,θθθθ), pai = parents of xi θθθθML = argmaxθθθθ log

Π

k P(o(k)|θθθθ)

= argmaxθθθθ

Σ

k log

( Σ

u(k)P(o(k),u(k)|θθθθ)

)

= argmaxθθθθ

Σ

k log

( Σ

u(k)

[ Π

iP(xi(k)|pai,θθθθ)

])

Problem! Summation gets in the way of logΠi

Example: Mixture of 2 unit-variance Gaussians

P(z) = π1αexp[-(z-µ1)2/2] + π2αexp[-(z-µ2)2/2]

where α = (2π)-1/2

The log-likelihood to be maximized is

log

(

π1αexp[-(z-µ1)2/2] + π2αexp[-(z-µ2)2/2]

)

The parameters (π1,µ1,π2,µ2) that maximize this do not have a simple, closed form solution

• One approach: Use nonlinear optimizer (eg, Newton-type method). OR…

BRENDAN FREY

Expectation Maximization Algorithm

(Dempster, Laird and Rubin 1977)

• Learning was more straightforward when the data was complete

• Can we use probabilistic inference

(compute P(u|o,θθθθ)) to “fill in” the missing data and then use the learning rules for complete data?

• YES: EM algorithm

• Initialize θθθθ (eg, using 2nd order statistics)

• E-Step: Compute Q(u) = P(u|o,θθθθ) for all u

• M-Step: Holding Q(u) constant maximize

Σ

uQ(u) log P(o,u|θθθθ) wrt θθθθ

• Repeat E and M steps until convergence

• EM consistently increases log

( Σ

uP(o,u|θθθθ)

)

Expectation Maximization (EM) Algorithm

“Ensemble completion”

BRENDAN FREY

• Recall P(o,u|θθθθ) =

Π

iP(xi|pai,θθθθ), x = (o,u)

• Let θi ⊆ θθθθ parameterize P(xi|paii)

• Then, maximizing

Σ

uQ(u) log P(o,u|θθθθ)

wrt θi becomes equivalent to maximizing

Σ

xi,pai Q(xi,pai) log P(xi|pai,θi)

• Just like for complete data, learning is decoupled

EM in Bayes nets

Expectation Maximization Algorithm: Theory

(Following Neal and Hinton 1999)

• Problem: Maximize log

( Σ

uP(o,u|θθθθ)

)

wrt θθθθ

• We need to “move” the sum outside of the logarithm

• Recall Jensen’s inequality:

Concave function of a convex sum Convex sum of concave function

Concave f’n log(); Weights qi: qi0, Σiqi=1:

log(Σiqipi) Σiqilog(pi),

pi’s are any positive numbers

BRENDAN FREY

• Rewrite log

( Σ

uP(o,u|θθθθ)

)

by introducing Q(u):

log

( Σ

uQ(u)[P(o,u|θθθθ)/Q(u)]

)

,

where Q(u) > 0 and

Σ

uQ(u) = 1

• Applying Jensen’s inequality,

log

( Σ

uP(o,u|θθθθ)

)

= log

( Σ

uQ(u)[P(o,u|θθθθ)/Q(u)]

)

Σ

uQ(u) log [P(o,u|θθθθ)/Q(u)]

Expectation Maximization Algorithm: Theory

log

( Σ

uP(o,u|θθθθ)

)

Σ

uQ(u) log [P(o,u|θθθθ)/Q(u)]

• The Q(u) that makes this an equality is Q(u) = P(u|o,θθθθ)

• The E-Step maximizes the bound wrt Q(u)

• The M-Step maximizes the bound wrt θθθθ

• Since the E-step makes the bound tight, each iteration of EM is guaranteed to increase

log

( Σ

uP(o,u|θθθθ)

)

Expectation Maximization Algorithm: Theory

EM for mixture of Gaussians: E step

c

z

µµµµ1= ΦΦΦΦ1=

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

Images from data set

c=1

Images from data set

z=

c=2

P(c|z) 0.52 c 0.48

µµµµ1= ΦΦΦΦ1=

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

EM for mixture of Gaussians: E step

Images from data set

z=

c=1 c c=2

P(c|z) 0.51 0.49

µµµµ1= ΦΦΦΦ1=

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

EM for mixture of Gaussians: E step

Images from data set

z=

c=1 c c=2

P(c|z) 0.48 0.52

µµµµ1= ΦΦΦΦ1=

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

EM for mixture of Gaussians: E step

Images from data set

z=

c=1 c c=2

P(c|z) 0.43 0.57

µµµµ1= ΦΦΦΦ1=

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

EM for mixture of Gaussians: E step

µµµµ1= ΦΦΦΦ1= c

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

Set µµµµ1 to the average of zP(c=1|z) z Set µµµµ2 to the average of zP(c=2|z)

EM for mixture of Gaussians: M step

µµµµ1= ΦΦΦΦ1= c

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

Set µµµµ1 to the average of zP(c=1|z) z Set µµµµ2 to the average of zP(c=2|z)

EM for mixture of Gaussians: M step

µµµµ1= ΦΦΦΦ1= c

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

Set ΦΦΦΦ1 to the average of z

diag((z-µµµµ1)T (z-µµµµ1))P(c=1|z) Set ΦΦΦΦ2 to the average of

diag((z-µµµµ2)T (z-µµµµ2))P(c=2|z)

EM for mixture of Gaussians: M step

µµµµ1= ΦΦΦΦ1= c

µµµµ2= ΦΦΦΦ2=

π1= 0.5, π2= 0.5,

Set ΦΦΦΦ1 to the average of z

diag((z-µµµµ1)T (z-µµµµ1))P(c=1|z) Set ΦΦΦΦ2 to the average of

diag((z-µµµµ2)T (z-µµµµ2))P(c=2|z)

EM for mixture of Gaussians: M step

a f t e r i t e r a t i n g t o c o n v e r g e n c e :

c

z

µµµµ1= ΦΦΦΦ1=

µµµµ2= ΦΦΦΦ2=

π1= 0.6, π2= 0.4,

BRENDAN FREY

x z ss=

Transformed mixture of Gaussians

(Frey and Jojic, CVPR 1999)

cc=1 P(c) = πc

P(x|z,s) = N(x; shift(z,s), ΨΨΨΨ)

µµµµ1= diag(ΦΦΦΦ1) = µµµµ2=

π1= 0.6,

π2= 0.4,

Shift, s P(s)

diag(ΦΦΦΦ2) = z=

x=

Diagonal

EM for the transformed mixture of Gaussians (TMG)

x l

c

z

• E step: Compute P(l|x), P(c|x) and p(z|c,x) for each x in data

• M step: Set

πc = avg of P(c|x)

ρl = avg of P(l|x)

µµµµc = avg mean of p(z|c,x) ΦΦΦΦc = avg variance of p(z|c,x) ΨΨΨΨ = avg var of p(x-Gl z|x)

EM in TMG for face clustering

TMG MG

Cluster means

Exact EM for the shifter net

d

yt

z m

yt+1

• Suppose we observe only yt and yt+1 in the training data

• Use brute force inference to compute P(d,m,z|yt,yt+1) for every training case

• The M-step is equivalent to extending the training set by sampling d,m,z from P(d,m,z|yt,yt+1) for every training case and then updating the

conditional distributions using counts (as for the complete case)

BRENDAN FREY

Other examples of exact EM

• Baum Welch algorithm for learning HMM

• Adaptive Kalman filtering

• Factor analysis (Rubin and Thayer 1982)

• Transformed HMMs (Jojic, Petrovic, Frey and Huang, CVPR 2000)

• …

When exact EM is intractable

Exact EM can be intractable if

• log P(xi|paii) can’t be maximized wrt θi in a tractable way, or

• Inference is intractable

– The network has too many cycles (eg, layered vision model)

– The local computations needed in the sum-product algorithm are intractable (eg, active contours)

BRENDAN FREY

Partial M steps

• If log P(xi|paii) can’t be maximized in closed form, compute

Σ

xi,pai Q(xi,pai) ∂log P(xi|paii)/ ∂θi during the M-step and learn using a gradient-based method (eg, conjugate gradients)

• Variants:

– “GEM” (Dempster et al 1977)

– Expectation conditional-maximization

(Meng and Rubin 1992)

Approximate E steps

• If inference is intractable, we can use one of the approximate inference

methods described above to obtain Q(u)P(u|o,θθθθ)

in the E-step

• Perform an exact or partial M-step

BRENDAN FREY

EM using condensation in mixed-state dynamic nets

(Blake et al 1999)

x0 x1 x2 s0 s1 s2

y0 y1 y2

Video of hands juggling

balls

Generalized EM

(Neal and Hinton 1999)

Recall

log

( Σ

uP(o,u|θθθθ)

)

Σ

uQ(u) log [P(o,u|θθθθ)/Q(u)] = B

• Generalized EM: Maximize B

• Keep a Q(u) for each training case and in each E-step, adjust Q(u) to increase B

• In the M-step, adjust θ θ θ θ to increase B

BRENDAN FREY

Variational EM

(Jordan et al 1999)

• Recall one of the variational “distances”:

F = D – logP(o|θθθθ)

=

Σ

uQ(u|ΦΦΦΦ) log[Q(u|ΦΦΦΦ)/P(u,o|θθθθ)]

• This is the bound from above! (without “-”):

log

( Σ

uP(o,u|θθθθ)

)

Σ

uQ(u|ΦΦΦΦ) log [P(o,u|θθθθ)/Q(u|ΦΦΦΦ)]

• E-step: Adjust ΦΦΦΦ to increase bound

• M-step: Adjust θθθθ to increase bound

Variational inference in mixed-state dynamic net

(Pavlovic, Frey and Huang, CVPR 1999)

x0 x1 x2 s0 s1 s2

y0 y1 y2

Y

=

=

=

=

=

=

1

0 1

1

1 0

0

1

0

1 1

1

1 0

)

| (

) ,

| ( )

| (

)

| ( )

| ( )

(

)

| ( )

| (

)

| (

T

t

t t T

t

t t t T

t

t t T

t

t t

LDS HMM

x y P

u x x P s

x P

s q P s

s P s

P

Q Q

Q

U Y, X Q

S

Q U, Y, S X,

Variational parameters

BRENDAN FREY

P(st|Y) Input trace + LDS state

vectors colored by most probable HMM state

Colors for HMM states x0 x1 x2

s0 s1 s2

y0 y1 y2

Variational inference in mixed-state dynamic net

(Pavlovic, Frey and Huang, CVPR 1999)

t, time step

Other examples of variational EM

• Exploiting tractable substructures

(Saul and Jordan 1996)

• Mixtures of switching state-space models

(Ghahramani and Hinton 1997)

• Recursive algorithms in graphical models

(Jaakkola and Jordan 1997)

• Factorial hidden Markov models

(Ghahramani and Hordan 1997)

• Nonlinear Gaussian Bayes nets

(Frey and Hinton 1999)

• …

BRENDAN FREY

Wake-sleep learning in Bayes nets

(Hinton, Dayan, Frey and Neal, Science 1995)

• In addition to the top-down generative

network, there is a bottom-up “recognition network” that does inference by simulation

• The recognition net is trained on fantasies u

o

u o Generative

network

Recognition network

Some References

Books on probability and information theory

E. T. Jaynes. Probability Theory – The Logic of Science, www.math.albany.edu:8008/JaynesBook.html.

Cover and Thomas. Elements of Information Theory.

Books on graphical models

F. Jensen 1996. An Introduction to Bayesian Networks, UCL Press.

B. J. Frey 1998. Graphical Models for Machine Learning and Digital Communication, MIT Press.

M. I. Jordan (ed) 1999. Learning in Graphical Models, MIT Press.

Tutorials

W. Buntine 1994. Learning with Graphical Models, www.ultimode.com/wray.

D. Heckerman 1999. A Tutorial on Learning Bayes Nets from Data, www.research.microsoft.com/~heckerman.

Z. Ghahramani and S. Roweis 1999. Probabilistic Models for Unsupervised Learning,www.gatsby.ucl.ac.uk.

BRENDAN FREY Conferences

Uncertainty in Artificial Intelligence

Neural Information Processing Systems Exact inference in graphical models

B. J. Frey 1999. Probability propagation (the sum-product algorithm) in graphical models, tutorial available atwww.cs.uwaterloo.ca/~frey.

R. Dechter 1999. Bucket Elimination: A unifying framework for probabilistic inference, M. I. Jordan (ed) Learning in Graphical Models, MIT Press.

S. L. Lauritzen and D. J. Speigelhalter 1988. Local computation with

probabilities on graphical structures and their application to expert systems, Journal of the Royal Statistical Society, Series B, 50:2, 157-224.

Monte Carlo inference

R. M. Neal 1993. Probabilistic inference using Markov chain Monte Carlo Methods, technical report available atwww.cs.toronto.edu/~radford.

D. J. C. MacKay 1999. Introduction to Monte Carlo methods, M. I. Jordan (ed) Learning in Graphical Models, MIT Press.

Variational techniques

L. K. Saul and M. I. Jordan 1996. Exploiting tractable substructures in intractable networks, Touretzky, Mozer and Hasselmo (eds), Advances in Neural Information Processing Systems 8, MIT Press.

T. S. Jaakkola and M. I. Jordan 1996. Computing upper and lower bounds on likelihoods in intractable networks, Proceedings of the 12thConference on Uncertainty in Artificial Intelligence.

Z. Ghahramani and G. E. Hinton 1997. Mixtures of switching state space models,www.gatsby.ucl.ac.uk.

T. S. Jaakkola 1997. Variational Methods for Inference and Estimation in Graphical Models,www.ai.mit.edu/people/tommi.

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola and L. K. Saul 1999. An introduction to variational methods for graphical models, M. I. Jordan (ed) Learning in Graphical Models, MIT Press.

B. J. Frey and G. E. Hinton 1999. Variational inference and learning in nonlinear Gaussian belief networks, Neural Computation 11:1, 193-214.

Iterative probability propagation (sum-product algorithm)

B. J. Frey and D. J. C. MacKay 1998. A revolution: Probability propagation in networks with cycles, Jordan, Kearns and Solla (eds), Advances in Neural Information Processing Systems 10, MIT Press.

W. Freeman and E. Pasztor 1999. Learning low-level vision, Proceedings of the IEEE International Conference on Computer Vision, IEEE Press.

Frey 2000. Filling in scenes by propagating probabilities through layers and into appearance models, Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, IEEE Press.

相關文件