• 沒有找到結果。

Problem of Gaussian Mixture

N/A
N/A
Protected

Academic year: 2022

Share "Problem of Gaussian Mixture"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

Expectation Maximization

An Approach to Parameter Estimation

Jiangsheng Yu

School of Electronics Engineering and Computer Sciencec Peking University, Beijing, 100871

School of Information Engineering, Shihezi University, Xinjiang

[email protected], http://icl.pku.edu.cn/yujs

(2)

Topics

1. Problem of Gaussian mixture

2. Basic ideas of maximum likelihood estimate (MLE)a and expectation maximization (EM) 3. Applications of the EM algorithm to

(a) Censored data

(b) Parameter estimation of mixture-density (c) Hidden Markov Model (HMM)

4. Further readings and conclusion 5. Appendix

6. References

aSee Chapter 2 of [1], which also introduced the EM algorithm.

(3)

Problem of Gaussian Mixture

Problem 1 Sample X = (X1, X2, · · · , Xn) is from an α1N (µ1, σ12) + α2N (µ2, σ22) population, where α1 + α2 = 1, 0 ≤ α1, α2 ≤ 1, how to estimate the parameters (µi, σi2) and the coefficients αi?

Example 1 Gaussian mixture:

(4)

Simulation

The nonparametric density estimation of 1,000 simulated data generated by the distribution of

0.4N (3, 1) + 0.6N (−2, 4), done by S-plus 2000.

-8 -6 -4 -2 0 2 4 6

0.00.020.040.060.080.100.120.14

Gaussian mixture

0.4*N(3,1)+0.6*N(-2,4)

(5)

MLE

Definition 1 Sample X = (X1, X2, · · · , Xn) is from the population with density function p(x|Θ).

likelihood function: the density of X given Θ L(Θ|x) = p(x|Θ) =

n

Y

i=1

p(xi|Θ) (1) log-likelihood function: l(Θ|x) = log L(Θ|x)

Definition 2 The MLE of Θ is Θ = argmaxˆ

Θ L(Θ|x)

= argmax

Θ

l(Θ|x) (2)

(6)

Basic Idea of MLE

God always let the event with the biggest probability happen firstly — The MLE of Θ is to make the

sample occur the most likely.

C. F. Gauss (1777-1855) R. A. Fisher (1890-1962)

Figure 1: Founders of MLE method

(7)

Complete Data

Definition 3 The sample X = (X1, · · · , Xn) together with the missing (or latent) data Y is called complete data. For instance, in Example 1, Y ∼ B(1, α1).

Definition 4 The complete likelihood is

L(Θ|x, y) = p(x, y|Θ) (3) where p(x, y|Θ) is the joint density of X and Y given the parameter Θ.

Definition 5 The complete log-likelihood is

l(Θ|x, y) = log L(Θ|x, y) (4)

(8)

Complete MLE

By the definition of conditional density, p(x|Θ) = p(x, y|Θ)

p(y|x, Θ) (5)

where p(y|x, Θ) is the conditional density of Y given X = x and Θ. By (2) we have complete MLE

Θ = argmaxˆ

Θ

l(Θ|x)

= argmax

Θ

[log p(x, y|Θ) − log p(y|x, Θ)]

= argmax

Θ

[l(Θ|x, y) − log p(y|x, Θ)]

(6)

(9)

MSE Predicator of l (Θ |x)

Given X = x and Θ = Θt−1, where Θt−1 is the current estimates of the unknown parameters,

log p(x, Y|Θ) is a function of Y whose unique best Mean Squared Error (MSE) predicator is

Q(Θ, Θt−1) = E(log p(x, Y|Θ)|x, Θt−1) (7)

The MSE predicator of log p(Y|x, Θ) is

H(Θ, Θt−1) = E(log p(Y|x, Θ)|x, Θt−1) (8) Then, we get the MSE predicator of l(Θ|x)

l(Θ|x) ≈ Q(Θ, Θt−1) − H(Θ, Θt−1) (9)

(10)

Basic Idea of EM

Theorem 1 The procedure of Θt ← argmax

Θ

Q(Θ, Θt−1) (10) guarantees that l(Θt|x) ≥ l(Θt−1|x) with equality iff Q(Θt|x, Θt−1) = Q(Θt−1|x, Θt−1), t = 1, 2, · · · .

Note Repeat (10) till a maximal point of l(Θ|x), then choose another initial estimate of Θ randomly and re- peat the EM procedure. argmax

Θ

l(Θ|x) will be found with big probability after enough many iterations.

(11)

Proof of Theorem

l(Θt|x) − l(Θt−1|x) =

[Q(Θt|x, Θt−1) − Q(Θt−1|x, Θt−1)] + [H(Θt−1|x, Θt−1) − H(Θt|x, Θt−1)]

Q(Θt|x, Θt−1) − Q(Θt−1|x, Θt−1) ≥ 0 by (10).

H(Θt−1|x, Θt−1) − H(Θt|x, Θt−1) ≥ 0 since

Ht−1|x, Θt−1) − H(Θt|x, Θt−1)

=

Z

y∈∆

log p(y|x, Θ)|x, Θt−1)

p(y|x, Θ)|x, Θt) p(y|x, Θt−1)dy

= K(Θt−1, Θt) ≥ 0

(11)

where K(Θt−1, Θt) is the Kullback-Leibler information divergence, always non-negative (see Appendix 1). 

(12)

Scheme of EM

Expectation step: Let Θt−1 be the current estimates of the unknown parameters.

Q(Θ, Θt−1) = E(log p(x, Y|Θ)|x, Θt−1)

= Z

y∈∆

log p(x, y|Θ)p(y|x, Θt−1)dy

= Z

y∈∆

log p(x, y|Θ)p(x, yt−1) p(xt−1) dy

(12)

Maximization step: p(xt−1) is independent of Θ, Θt = argmax

Θ

Q(Θ, Θt−1)

= argmax

Θ

Z

y∈∆

log p(x, y|Θ)p(x, y|Θt−1)dy (13)

See [7] for the convergence properties of the EM algorithm.

(13)

Modified EM

From now on, we will denote

G(Θ, Θt−1) = Z

y∈∆

log p(x, y|Θ)p(x, y|Θt−1)dy (14)

By (12) and (13), contrast to (10) we have Theorem 2 The procedure of

Θt ← argmax

Θ

G(Θ, Θt−1) (15) guarantees that l(Θt|x) ≥ l(Θt−1|x) with equality iff Q(Θt|x, Θt−1) = Q(Θt−1|x, Θt−1), t = 1, 2, · · · .

(14)

Censored Data

Let Xij ∼ N (µi, σ2) be the response r.v. of the jth element among those receiving the ith treatment.

TREATMENTS RESULTS

1 X11 X12 · · · X1n1 2 X21 X22 · · · X2n2

...

k Xk1 Xk2 · · · Xknk

Figure 2: One-way layout with missing data

Problem 2 Suppose that some Xij are unknown, how do you estimate Θ = (µ1, · · · , µk, σ2)?

(15)

EM for Censored Data

1. Denote the unknown Xij by Y, the complete likelihood p(x, y|Θ) is

 1

2πσ



k

P

i=1

ni

exp

( k X

i=1 ni

X

j=1

(xij − µi)2 2

)

(16)

where the unknown xij is written by y’s.

2. Run the EM algorithm with the complete data (X, Y).

(16)

Mixture-density Problem

Problem 3 Given a sample X = (X1, · · · , Xn), consider the mixture density (or frequency)

p(x|Θ) =

m

X

i=1

αipi(x|θi) (17) with identifiable parameters Θ = (θ1, · · · , θm), where αi > 0 are the prior probabilities of each mixture

componenta and Pm

i=1 αi = 1.

Task: Find the MLE of Θ by X.

aLet Y = (Y1, · · · , Yn) be the latent data that Yi = k if the ith sample is generated by the kth mixture component.

(17)

Latent Data of Problem 3

The conditional frequency of Y = (Y1, · · · , Yn) given X = x and Θt−1 is

p(y|x, Θt−1) =

n

Y

i=1

p(yi|xi, Θt−1) (18) where Yi has the frequency table as follows

Yi 1 2 · · · j · · · m P α1 α2 · · · αj · · · αm Figure 3: Frequency table of Yi

(18)

Q (Θ, Θ

t−1

) of Problem 3

The joint density of X and Y given Θ is

p(x, y|Θ) =

n

Y

i=1

p(xi, yiyi)

=

n

Y

i=1

αyipyi(xiyi)

(19)

Given X = x and Θ = Θt−1, by (18) and (19) we have

Q(Θ, Θt−1) = E(log p(x, Y|Θ))

=

m

X

j=1 n

X

i=1

log[αjpj(xij)]p(j|xi, Θt−1) (20)

Another proof of (20) can be found in [2].

(19)

Calculating α

j

To find αj, we introduce the Lagrange multiplier λ with the constraint of Pm

j=1 αj = 1 and solve the following equation:

∂αj

"

Q(Θ, Θt−1) + λ

m

X

j=1

αj − 1

!#

= 0 (21)

or n

X

i=1

p(j|xi, Θt−1) = −λαj (22)

Summing both sides over j, we get λ = −n. Thus,

ˆ

αj = 1 n

n

X

i=1

p(j|xi, Θt−1) (23)

(20)

Hidden Markov Model

Viterbi: Which is the best urn sequence?

Baum-Welch: Which are the best parameters?

where

Figure 4: Urn Model of HMM

(21)

Parameter Estimation of HMM

There are large numbers of papers on HMM and its applications, the early frequently cited paper is [5].

Problem 4 Suppose that the observation sequence O = (O1, · · · , OT ) is given and the hidden state sequence is Q = (Q1, · · · , QT ), where r.v. Ot ranges over values V = {v1, v2, · · · , vm} and r.v. Qt ranges over states S = {1, 2, · · · , n}.

Task: Estimate the parameters λ = (A, B, π) that maximizes P(o, q|λ), where

A = (aij)n×n, where aij = P(Qt = j|Qt−1 = i), is the stochastic transition matrix, satisfying the Markov propertya. Especially, π = (πi)1×n, where πi = P(Q1 = i), are the initial state distributions.

B = (bi)1×n are the observation distributions, where bi(ot) = P(Ot = ot|Qt = i) satisfying that Pm

k=1 bi(k) = 1. We write bi(k) for bi(vk) without confusion.

aFor a random process, the probability of the current state only depends on the former state.

(22)

Strategy of Problem 4

The parameter estimation of HMM is to find λ = argmax

λ P(o, q|λ) (24)

Let λ0 be the current parameters. By (14), we have G(λ, λ0) = X

q∈ST

log P(o, q|λ)P(o, q|λ0) (25)

where ST is the set of all T -length state sequences. By EM method, the strategy of Problem 4 is

1. Calculate G(λ, λ0).

2. Find out argmax G(λ, λ0).

(23)

Estimating Each Parameter

The joint probability of O and Q given λ is

P(o, q|λ) = πq1bq1(o1)

T

Y

i=2

aqt−1qtbqt(ot) (26)

By (26), so (25) turns to

G(λ, λ0) = X

q∈ST

log πq1P(o, q0)+

X

q∈ST

T

X

t=2

log aqt−1qt

!

P(o, q0)+

X

q∈ST

" T X

t=1

log bqt(ot)

#

P(o, q0)

(27)

(24)

Estimating π

The first item in (27) is X

q∈ST

log πq1P(o, q0) =

n

X

i=1

log πiP(o, q1 = i0) (28)

By the method of Lagrange multiplier, solve

∂πi

" n X

i=1

log πiP(o, q1 = i0) + η

n

X

i=1

πi − 1

!#

= 0 (29)

we get the estimates of π = (π1, · · · , πn) ˆ

πi = P(o, q1 = i0)

P(o0) = P(q1 = i|o, λ0) (30) Denote P(qt = i|o, λ0) by γt(i), thus (30) says that ˆπi = γ1(i).

(25)

Estimating A

The second item in (27) is

n

X

i=1 n

X

j=1 T

X

t=2

log aijP(o, qt−1 = i, qt = j0) (31)

By the method of Lagrange multiplier with constraint Pn

j=1 aij = 1, we get the estimates of A = (aij)n×n

ˆ

aij =

T

X

t=2

P(o, qt−1 = i, qt = j0)

T

X

t=2

P(o, qt−1 = i0)

=

T

X

t=2

ξt(i, j)

T

X

t=2

γt(i)

(32)

where ξt(i, j) = P(qt−1 = i, qt = j|o, λ0).

(26)

Estimating B

The third item in (27) is

n

X

i=1 T

X

t=1

log bi(ot)P(o, qt = i0) (33)

By the method of Lagrange multiplier with constraint Pm

k=1 bi(k) = 1, we get the estimates of B = (bi)1×n

ˆbi(k) =

T

X

t=1

P(o, qt = i0ot,vk

T

X

t=1

P(o, qt = i0)

=

T

X

t=1

γt(i)δot,vk

T

X

t=1

γt(i)

(34)

(27)

Baum-Welch Algorithm

initialization : λ0,  // is an experiential threshold value

calculation : λ = ( ˆA, ˆB, ˆπ) where

ˆ aij =

T

X

t=2

ξt(i, j)

T

X

t=2

γt(i)

//where γt(i) = P(qt = i|o, λ0)

//and ξt(i, j) = P(qt−1 = i, qt = j|o, λ0)

ˆbi(k) =

T

X

t=1

γt(i)δot,vk

T

X

t=1

γt(i)

//where δot,vk =

1 if ot = vk 0 otherwise

ˆ

πi = γ1(i)

condition : if

log P(o, q|λ) P(o, q|λ0)

<  end

goto : otherwise, let λ0 = λ goto calculation

(28)

Some Notes

See [8] for the intuitive explanation of HMM.

1. aˆij = the expected number from urn i to urn j the expected number away from urn i

2. ˆbi(k) = the expected number of urn i observing vk the expected number of urn i

3. πˆi = the expected relative frequency of urn i at time 1.

4. Imputing to EM, the Baum-Welch Algorithm just reaches the locally optimal solution. By Hill-climbing method, usually we can get the optimal solution.

5. HMM works well in Speech Recognition, but a little bad in Chinese NLP. In practice, the assumption of Markov

property sounds unreasonable sometimes (e.g., independent words in variable distance).

(29)

Further Readings

Chapter 2 of P. J. Bickel’s excellent book [1].

In [6], M. A. Tanner summarized the up-to-date Bayesian approaches to the exploration of posterior distributions and likelihood functions. The main topics of [6] include

1. Normal/nonnormal approximations to likelihoods and posteriors 2. The EM Algorithm, the Data Augmentation Algorithm, and

3. Markov Chain Monte Carlo (MCMC) methods: the Gibbs Sampler and the Metropolis Algorithm

J. A. Bilmes discussed the problem of Gaussian mixture and the

parameter estimation of HMM (i.e. Baum-Welch algorithm) in [2], a detailed guide of the applications of EM algorithm.

T. Mitchell’s Machine Learning also gives a brief but legible

introduction to EM (see Chapter 6 — Bayesian Learning). Moreover, he introduced EM within exponential family in [4] briefly.

(30)

Conclusion

Although EM method is powerful in many cases, but we still have the following difficulties:

1. EM is a method of MLE based on the complete data, which assumes that the distribution family is given. But in practice, this assumption is

usually inaccessible.

2. It is difficult to validate the optimal solution of EM method.

3. Maximum likelihood estimates need neither exist nor be unique. The same is EM.

4. Large sample is still necessary, although it is not the fault of EM.

(31)

Appendix 1

Definition 6 If p(x) is a density function, the Kullback-Leibler information divergencea is defined by

K(θ, η) = Eθ



log p(x|θ) p(x|η)



=

Z

−∞

log p(x|θ)

p(x|η)p(x|θ)dx

(35)

Theorem 3 K(θ, η) ≥ 0 with equality iff p(x|θ) = p(x|η).

Proof By the facts of log p(x|θ)

p(x|η) = − log



1 + p(x|η) − p(x|θ) p(x|θ)



and log(1 + x) < x for all x > −1, x 6= 0. 

aSee pp116 of [1] for the details of its application to MLE.

(32)

Acknowledgement

1. The report was done when I taught AI at Shihezi University, Xinjiang. Many thanks to my colleagues at the School of Information Engineering here, who made me homefelt.

2. Thank the leaders of the school that provided me a chance to give a series of talks on Machine Learning.

3. I cannot be thankful enough to my friends Prof. Mei Changlin and Prof. Wang Hanpin. It’s Prof. Mei that

introduced his researches on Nonparametric Regression to me. And Prof. Wang showed me the state-of-the-art of

Formal Semantics. It is a great pleasure to discuss the

problems in Mathematical Statistics and Computer Science with them during our walk after dinner.

(33)

References

1. P. J. Bickel and K. A. Doksum (2001), Mathematical Statistics — Basic Ideas and Selected Topics (Second Edition). Prentice-Hall, Inc.

2. J. A. Bilmes (1998), A General Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.

3. A. P. Dempster, N. M. Laird and D. B. Rubin (1977), Maximum-likelihood from Incomplete Data via EM Algorithm. J. Royal Statist. Soc. Ser. B., 39.

4. T. Mitchell (2003), Statistical Approaches to Learning and Discovery. The course of Machine Learning at CMU.

5. L. R. Rabiner (1989), A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, Vol. 77, pp257-286.

6. M. A. Tanner (1996), Tools for Statistical Inference — Methods for the Exploration of Posterior Distributions and Likelihood Functions (Third Edition). Springer-Verlag New York, Inc.

7. C. F. J. Wu (1983), On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11(1), pp95-103.

8. J. S. Yu (2002), On Hidden Markov Model. Seminar report of Machine Learning at Peking University, http://icl.pku.edu.cn/yujs.

(34)

Thank you

for your attention!

數據

Figure 1: Founders of MLE method
Figure 2: One-way layout with missing data
Figure 4: Urn Model of HMM

參考文獻

相關文件

To date we had used PSO and successfully found optimal designs for experiments up to 8 factors for a mixture model, nonlinear models up to 6 parameters and also for more involved

This theorem does not establish the existence of a consis- tent estimator sequence since, with the true value θ 0 unknown, the data do not tell us which root to choose so as to obtain

vs Functional grammar (i.e. organising grammar items according to the communicative functions) at the discourse level2. “…a bridge between

11 (1998) 227–251] for the nonnegative orthant complementarity problem to the general symmet- ric cone complementarity problem (SCCP). We show that the class of merit functions

For the proposed algorithm, we establish a global convergence estimate in terms of the objective value, and moreover present a dual application to the standard SCLP, which leads to

For the proposed algorithm, we establish its convergence properties, and also present a dual application to the SCLP, leading to an exponential multiplier method which is shown

Since the subsequent steps of Gaussian elimination mimic the first, except for being applied to submatrices of smaller size, it suffices to conclude that Gaussian elimination

Since the subsequent steps of Gaussian elimination mimic the first, except for being applied to submatrices of smaller size, it suffices to conclude that Gaussian elimination