Problem of Gaussian Mixture

(1)

Expectation Maximization

An Approach to Parameter Estimation

Jiangsheng Yu

School of Electronics Engineering and Computer Sciencec Peking University, Beijing, 100871

School of Information Engineering, Shihezi University, Xinjiang

[email protected], http://icl.pku.edu.cn/yujs

(2)

Topics

1. Problem of Gaussian mixture

2. Basic ideas of maximum likelihood estimate (MLE)^a and expectation maximization (EM) 3. Applications of the EM algorithm to

(a) Censored data

(b) Parameter estimation of mixture-density (c) Hidden Markov Model (HMM)

4. Further readings and conclusion 5. Appendix

6. References

aSee Chapter 2 of [1], which also introduced the EM algorithm.

(3)

Problem of Gaussian Mixture

Problem 1 Sample X = (X₁, X₂, · · · , Xⁿ) is from an α₁N (µ¹, σ₁²) + α₂N (µ², σ₂²) population, where α₁ + α₂ = 1, 0 ≤ α¹, α₂ ≤ 1, how to estimate the parameters (µi, σ_i²) and the coefficients αi?

Example 1 Gaussian mixture:

(4)

Simulation

The nonparametric density estimation of 1,000 simulated data generated by the distribution of

0.4N (3, 1) + 0.6N (−2, 4), done by S-plus 2000.

-8 -6 -4 -2 0 2 4 6

0.00.020.040.060.080.100.120.14

Gaussian mixture

0.4*N(3,1)+0.6*N(-2,4)

(5)

MLE

Definition 1 Sample X = (X₁, X₂, · · · , Xⁿ) is from the population with density function p(x|Θ).

likelihood function: the density of X given Θ L(Θ|x) = p(x|Θ) =

n

Y

i=1

p(x_i|Θ) (1) log-likelihood function: l(Θ|x) = log L(Θ|x)

Definition 2 The MLE of Θ is Θ = argmaxˆ

Θ L(Θ|x)

= argmax

Θ

l(Θ|x) (2)

(6)

Basic Idea of MLE

God always let the event with the biggest probability happen firstly — The MLE of Θ is to make the

sample occur the most likely.

C. F. Gauss (1777-1855) R. A. Fisher (1890-1962)

Figure 1: Founders of MLE method

(7)

Complete Data

Definition 3 The sample X = (X₁, · · · , Xⁿ) together with the missing (or latent) data Y is called complete data. For instance, in Example 1, Y ∼ B(1, α¹).

Definition 4 The complete likelihood is

L(Θ|x, y) = p(x, y|Θ) (3) where p(x, y|Θ) is the joint density of X and Y given the parameter Θ.

Definition 5 The complete log-likelihood is

l(Θ|x, y) = log L(Θ|x, y) (4)

(8)

Complete MLE

By the definition of conditional density, p(x|Θ) = p(x, y|Θ)

p(y|x, Θ) (5)

where p(y|x, Θ) is the conditional density of Y given X = x and Θ. By (2) we have complete MLE

Θ = argmaxˆ

Θ

l(Θ|x)

= argmax

Θ

[log p(x, y|Θ) − log p(y|x, Θ)]

= argmax

Θ

[l(Θ|x, y) − log p(y|x, Θ)]

(6)

(9)

MSE Predicator of l (Θ |x)

Given X = x and Θ = Θ_t₋₁, where Θ_t₋₁ is the current estimates of the unknown parameters,

• log p(x, Y|Θ) is a function of Y whose unique best Mean Squared Error (MSE) predicator is

Q(Θ, Θ_t₋₁) = E(log p(x, Y|Θ)|x, Θ^t−1) (7)

• The MSE predicator of log p(Y|x, Θ) is

H(Θ, Θ_t₋₁) = E(log p(Y|x, Θ)|x, Θ^t−1) (8) Then, we get the MSE predicator of l(Θ|x)

l(Θ|x) ≈ Q(Θ, Θ^t₋₁) − H(Θ, Θ^t₋₁) (9)

(10)

Basic Idea of EM

Theorem 1 The procedure of Θ_t ← argmax

Θ

Q(Θ, Θ_t₋₁) (10) guarantees that l(Θ_t|x) ≥ l(Θ^t−1|x) with equality iff Q(Θ_t|x, Θ^t−1) = Q(Θ_t₋₁|x, Θ^t−1), t = 1, 2, · · · .

Note Repeat (10) till a maximal point of l(Θ|x), then choose another initial estimate of Θ randomly and repeat the EM procedure. argmax

Θ

l(Θ|x) will be found with big probability after enough many iterations.

(11)

Proof of Theorem

l(Θ_t|x) − l(Θ^t−1|x) =

[Q(Θ_t|x, Θ^t−1) − Q(Θ^t−1|x, Θ^t−1)] + [H(Θt−1|x, Θ^t₋₁) − H(Θ^t|x, Θ^t₋₁)]

• Q(Θ_t|x, Θ^t−1) − Q(Θ^t−1|x, Θ^t−1) ≥ 0 by (10).

• H(Θ_t₋₁|x, Θ^t−1) − H(Θ^t|x, Θ^t−1) ≥ 0 since

H(Θ_t−1|x, Θ^t−1) − H(Θ^t|x, Θ^t−1)

=

Z

y∈∆

log p(y|x, Θ)|x, Θ^t−1)

p(y|x, Θ)|x, Θ^t) p(y|x, Θ^t−1)dy

= K(Θ^t−1, Θ_t) ≥ 0

(11)

where K(Θ^t₋₁, Θ_t) is the Kullback-Leibler information divergence, always non-negative (see Appendix 1).

(12)

Scheme of EM

Expectation step: Let Θ_t−1 be the current estimates of the unknown parameters.

Q(Θ, Θ_t−1) = E(log p(x, Y|Θ)|x, Θ^t−1)

= Z

y∈∆

log p(x, y|Θ)p(y|x, Θ^t−1)dy

= Z

y∈∆

log p(x, y|Θ)p(x, y|Θ^t−1) p(x|Θ^t−1) dy

(12)

Maximization step: p(x|Θ^t−1) is independent of Θ, Θ_t = argmax

Θ

Q(Θ, Θ_t−1)

= argmax

Θ

Z

y∈∆

log p(x, y|Θ)p(x, y|Θ^t−1)dy (13)

See [7] for the convergence properties of the EM algorithm.

(13)

Modified EM

From now on, we will denote

G(Θ, Θ_t−1) = Z

y∈∆

log p(x, y|Θ)p(x, y|Θ^t−1)dy (14)

By (12) and (13), contrast to (10) we have Theorem 2 The procedure of

Θ_t ← argmax

Θ

G(Θ, Θ_t₋₁) (15) guarantees that l(Θ_t|x) ≥ l(Θ^t−1|x) with equality iff Q(Θ_t|x, Θ^t₋₁) = Q(Θ_t₋₁|x, Θ^t₋₁), t = 1, 2, · · · .

(14)

Censored Data

Let X_ij ∼ N (µⁱ, σ²) be the response r.v. of the j^th element among those receiving the i^th treatment.

TREATMENTS RESULTS

1 X₁₁ X₁₂ · · · X_1n₁ 2 X₂₁ X₂₂ · · · X_2n₂

...

k X_k1 X_k2 · · · X_kn_k

Figure 2: One-way layout with missing data

Problem 2 Suppose that some X_ij are unknown, how do you estimate Θ = (µ₁, · · · , µ^k, σ²)?

(15)

EM for Censored Data

1. Denote the unknown X_ij by Y, the complete likelihood p(x, y|Θ) is

1

√2πσ

k

P

i=1

n_i

exp

( _k X

i=1 n_i

X

j=1

−(x_ij − µⁱ)² 2σ²

)

(16)

where the unknown x_ij is written by y’s.

2. Run the EM algorithm with the complete data (X, Y).

(16)

Mixture-density Problem

Problem 3 Given a sample X = (X₁, · · · , Xⁿ), consider the mixture density (or frequency)

p(x|Θ) =

m

X

i=1

α_ip_i(x|θⁱ) (17) with identifiable parameters Θ = (θ₁, · · · , θ^m), where α_i > 0 are the prior probabilities of each mixture

component^a and Pm

i=1 α_i = 1.

Task: Find the MLE of Θ by X.

aLet Y = (Y₁, · · · , Yⁿ) be the latent data that Y_i = k if the i^th sample is generated by the k^th mixture component.

(17)

Latent Data of Problem 3

The conditional frequency of Y = (Y₁, · · · , Yⁿ) given X = x and Θ_t₋₁ is

p(y|x, Θ^t−1) =

n

Y

i=1

p(y_i|xⁱ, Θ_t₋₁) (18) where Y_i has the frequency table as follows

Y_i 1 2 · · · j · · · m P α¹ α₂ · · · α^j · · · α^m Figure 3: Frequency table of Y_i

(18)

Q (Θ, Θ

_t₋₁

) of Problem 3

The joint density of X and Y given Θ is

p(x, y|Θ) =

n

Y

i=1

p(x_i, y_i|θ^yⁱ)

=

n

Y

i=1

α_y_ip_y_i(x_i|θ^yⁱ)

(19)

Given X = x and Θ = Θ_t₋₁, by (18) and (19) we have

Q(Θ, Θ_t−1) = E(log p(x, Y|Θ))

=

m

X

j=1 n

X

i=1

log[α_jp_j(x_i|θ^j)]p(j|xⁱ, Θ_t−1) (20)

Another proof of (20) can be found in [2].

(19)

Calculating α

_j

To find α_j, we introduce the Lagrange multiplier λ with the constraint of Pm

j=1 α_j = 1 and solve the following equation:

∂

∂α_j

"

Q(Θ, Θ_t−1) + λ

m

X

j=1

α_j − 1

!#

= 0 (21)

or n

X

i=1

p(j|xⁱ, Θ_t−1) = −λα^j (22)

Summing both sides over j, we get λ = −n. Thus,

ˆ

α_j = 1 n

n

X

i=1

p(j|xⁱ, Θ_t−1) (23)

(20)

Hidden Markov Model

Viterbi: Which is the best urn sequence?

Baum-Welch: Which are the best parameters?

where

Figure 4: Urn Model of HMM

(21)

Parameter Estimation of HMM

There are large numbers of papers on HMM and its applications, the early frequently cited paper is [5].

Problem 4 Suppose that the observation sequence O = (O₁, · · · , O^T ) is given and the hidden state sequence is Q = (Q₁, · · · , QT ), where r.v. O_t ranges over values V = {v¹, v₂, · · · , v^m} and r.v. Q^t ranges over states S = {1, 2, · · · , n}.

Task: Estimate the parameters λ = (A, B, π) that maximizes P(o, q|λ), where

• A = (aij)n×n, where aij = P(Q^t = j|Q_t−1 = i), is the stochastic transition matrix, satisfying the Markov property^a. Especially, π = (π_i)_1×n, where π_i = P(Q₁ = i), are the initial state distributions.

• B = (bi)_1×n are the observation distributions, where bi(ot) = P(Ot = ot|Qt = i) satisfying that Pm

k=1 bi(k) = 1. We write bi(k) for bi(v_k) without confusion.

aFor a random process, the probability of the current state only depends on the former state.

(22)

Strategy of Problem 4

The parameter estimation of HMM is to find λ^∗ = argmax

λ P(o, q|λ) (24)

Let λ⁰ be the current parameters. By (14), we have G(λ, λ⁰) = X

q∈S_T^∗

log P(o, q|λ)P(o, q|λ⁰) (25)

where S_T^∗ is the set of all T -length state sequences. By EM method, the strategy of Problem 4 is

1. Calculate G(λ, λ⁰).

2. Find out argmax G(λ, λ⁰).

(23)

Estimating Each Parameter

The joint probability of O and Q given λ is

P(o, q|λ) = π^q1b_q₁(o₁)

T

Y

i=2

a_q_t−1_q_tb_q_t(o_t) (26)

By (26), so (25) turns to

G(λ, λ⁰) = X

q∈S_T^∗

log π_q₁P(o, q|λ⁰)+

X

q∈S_T^∗

T

X

t=2

log a_q_t−1_q_t

!

P(o, q|λ⁰)+

X

q∈S_T^∗

" _T X

t=1

log b_q_t(o_t)

#

P(o, q|λ⁰)

(27)

(24)

Estimating π

The first item in (27) is X

q∈S_T^∗

log π_q₁P(o, q|λ⁰) =

n

X

i=1

log π_iP(o, q1 = i|λ⁰) (28)

By the method of Lagrange multiplier, solve

∂

∂π_i

" _n X

i=1

log π_iP(o, q¹ = i|λ⁰) + η

n

X

i=1

π_i − 1

!#

= 0 (29)

we get the estimates of π = (π₁, · · · , πⁿ) ˆ

π_i = P(o, q1 = i|λ⁰)

P(o|λ⁰) = P(q1 = i|o, λ⁰) (30) Denote P(qt = i|o, λ⁰) by γ_t(i), thus (30) says that ˆπ_i = γ₁(i).

(25)

Estimating A

The second item in (27) is

n

X

i=1 n

X

j=1 T

X

t=2

log a_ijP(o, qt−1 = i, q_t = j|λ⁰) (31)

By the method of Lagrange multiplier with constraint Pn

j=1 a_ij = 1, we get the estimates of A = (a_ij)_n_×n

ˆ

a_ij =

T

X

t=2

P(o, qt−1 = i, q_t = j|λ⁰)

T

X

t=2

P(o, q^t−1 = i|λ⁰)

=

T

X

t=2

ξ_t(i, j)

T

X

t=2

γ_t(i)

(32)

where ξ_t(i, j) = P(qt−1 = i, q_t = j|o, λ⁰).

(26)

Estimating B

The third item in (27) is

n

X

i=1 T

X

t=1

log b_i(o_t)P(o, qt = i|λ⁰) (33)

By the method of Lagrange multiplier with constraint Pm

k=1 b_i(k) = 1, we get the estimates of B = (b_i)₁_×n

ˆb_i(k) =

T

X

t=1

P(o, q^t = i|λ⁰)δ_o_t_,v_k

T

X

t=1

P(o, qt = i|λ⁰)

=

T

X

t=1

γ_t(i)δ_o_t_,v_k

T

X

t=1

γ_t(i)

(34)

(27)

Baum-Welch Algorithm

initialization : λ⁰, // is an experiential threshold value

calculation : λ = ( ˆA, ˆB, ˆπ) where

ˆ a_ij =

T

X

t=2

ξt(i, j)

T

X

t=2

γt(i)

//where γt(i) = P(qt = i|o, λ⁰)

//and ξt(i, j) = P(qt−1 = i, qt = j|o, λ⁰)

ˆb_i(k) =

T

X

t=1

γt(i)δot,vk

T

X

t=1

γt(i)

//where δot,vk =







1 if ot = v_k 0 otherwise

ˆ

πi = γ₁(i)

condition : if

log P(o, q|λ) P(o, q|λ⁰)

< end

goto : otherwise, let λ⁰ = λ goto calculation

(28)

Some Notes

See [8] for the intuitive explanation of HMM.

1. aˆ_ij = the expected number from urn i to urn j the expected number away from urn i

2. ˆb_i(k) = the expected number of urn i observing v_k the expected number of urn i

3. πˆ_i = the expected relative frequency of urn i at time 1.

4. Imputing to EM, the Baum-Welch Algorithm just reaches the locally optimal solution. By Hill-climbing method, usually we can get the optimal solution.

5. HMM works well in Speech Recognition, but a little bad in Chinese NLP. In practice, the assumption of Markov

property sounds unreasonable sometimes (e.g., independent words in variable distance).

(29)

Conclusion

Although EM method is powerful in many cases, but we still have the following difficulties:

1. EM is a method of MLE based on the complete data, which assumes that the distribution family is given. But in practice, this assumption is

usually inaccessible.

2. It is difficult to validate the optimal solution of EM method.

3. Maximum likelihood estimates need neither exist nor be unique. The same is EM.

4. Large sample is still necessary, although it is not the fault of EM.

(31)

Appendix 1

Definition 6 If p(x) is a density function, the Kullback-Leibler information divergence^a is defined by

K(θ, η) = E^θ

log p(x|θ) p(x|η)

=

Z _∞

−∞

log p(x|θ)

p(x|η)p(x|θ)dx

(35)

Theorem 3 K(θ, η) ≥ 0 with equality iff p(x|θ) = p(x|η).

Proof By the facts of log p(x|θ)

p(x|η) = − log

1 + p(x|η) − p(x|θ) p(x|θ)

and log(1 + x) < x for all x > −1, x 6= 0.

aSee pp116 of [1] for the details of its application to MLE.

(32)

Acknowledgement

1. The report was done when I taught AI at Shihezi University, Xinjiang. Many thanks to my colleagues at the School of Information Engineering here, who made me homefelt.

2. Thank the leaders of the school that provided me a chance to give a series of talks on Machine Learning.

3. I cannot be thankful enough to my friends Prof. Mei Changlin and Prof. Wang Hanpin. It’s Prof. Mei that

introduced his researches on Nonparametric Regression to me. And Prof. Wang showed me the state-of-the-art of

Formal Semantics. It is a great pleasure to discuss the

problems in Mathematical Statistics and Computer Science with them during our walk after dinner.

(33)

References

1. P. J. Bickel and K. A. Doksum (2001), Mathematical Statistics — Basic Ideas and Selected Topics (Second Edition). Prentice-Hall, Inc.

2. J. A. Bilmes (1998), A General Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.

3. A. P. Dempster, N. M. Laird and D. B. Rubin (1977), Maximum-likelihood from Incomplete Data via EM Algorithm. J. Royal Statist. Soc. Ser. B., 39.

4. T. Mitchell (2003), Statistical Approaches to Learning and Discovery. The course of Machine Learning at CMU.

5. L. R. Rabiner (1989), A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, Vol. 77, pp257-286.

6. M. A. Tanner (1996), Tools for Statistical Inference — Methods for the Exploration of Posterior Distributions and Likelihood Functions (Third Edition). Springer-Verlag New York, Inc.

7. C. F. J. Wu (1983), On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11(1), pp95-103.

8. J. S. Yu (2002), On Hidden Markov Model. Seminar report of Machine Learning at Peking University, http://icl.pku.edu.cn/yujs.

(34)

Problem of Gaussian Mixture

Expectation Maximization

Topics

Problem of Gaussian Mixture

Simulation

MLE

Basic Idea of MLE

Complete Data

Complete MLE

MSE Predicator of l (Θ |x)

Basic Idea of EM

Proof of Theorem

Scheme of EM

Modified EM

Censored Data

EM for Censored Data

Mixture-density Problem

Latent Data of Problem 3

Q (Θ, Θ

) of Problem 3

Calculating α

Hidden Markov Model

Parameter Estimation of HMM

Strategy of Problem 4

Estimating Each Parameter

Estimating π

Estimating A

Estimating B

Baum-Welch Algorithm

Some Notes

Further Readings

Conclusion

Appendix 1

Acknowledgement

References

Thank you

for your attention!