I 1947CenterSt.Suite600Berkeley,California94704-1198(510)643-9153FAX(510)643-7684
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and
Hidden Markov Models
Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
Berkeley CA, 94704 and
Computer Science Division
Department of Electrical Engineering and Computer Science U.C. Berkeley
TR-97-021 April 1998
Abstract
We describe the maximum-likelihood parameter estimation problem and how the Expectation- Maximization (EM) algorithm can be used for its solution. We first describe the abstract form of the EM algorithm as it is often given in the literature. We then develop the EM pa- rameter estimation procedure for two applications: 1) finding the parameters of a mixture of Gaussian densities, and 2) finding the parameters of a hidden Markov model (HMM) (i.e., the Baum-Welch algorithm) for both discrete and Gaussian mixture observation models.
We derive the update equations in fairly explicit detail but we do not prove any conver-
gence properties. We try to emphasize intuition rather than mathematical rigor.
1 Maximum-likelihood
Recall the definition of the maximum-likelihood estimation problem. We have a density function
p
(xj)that is governed by the set of parameters(e.g.,p
might be a set of Gaussians andcould be the means and covariances). We also have a data set of sizeN
, supposedly drawn from this distribution, i.e.,X =fx1;:::;
xN
g. That is, we assume that these data vectors are independent and identically distributed (i.i.d.) with distributionp
. Therefore, the resulting density for the samples isp
(Xj)=YN
i
=1p
(xi
j)=L(jX):
This functionL(jX)is called the likelihood of the parameters given the data, or just the likelihood function. The likelihood is thought of as a function of the parameterswhere the dataX is fixed.
In the maximum likelihood problem, our goal is to find thethat maximizesL. That is, we wish to findwhere
= argmax
L(jX)
:
Often we maximizelog(L(jX))instead because it is analytically easier.
Depending on the form of
p
(xj)this problem can be easy or hard. For example, ifp
(xj)is simply a single Gaussian distribution where = (
;
2), then we can set the derivative oflog(L(jX))to zero, and solve directly for
and2 (this, in fact, results in the standard formulas for the mean and variance of a data set). For many problems, however, it is not possible to find such analytical expressions, and we must resort to more elaborate techniques.2 Basic EM
The EM algorithm is one such elaborate technique. The EM algorithm [ALR77, RW84, GJ95, JJ94, Bis95, Wu83] is a general method of finding the maximum-likelihood estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values.
There are two main applications of the EM algorithm. The first occurs when the data indeed has missing values, due to problems with or limitations of the observation process. The second occurs when optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. The latter application is more common in the computational pattern recognition community.
As before, we assume that dataX is observed and is generated by some distribution. We call
X the incomplete data. We assume that a complete data set existsZ =(X
;
Y)and also assume (or specify) a joint density function:p
(zj)=p
(x;
yj)=p
(yjx;
)p
(xj).
Where does this joint density come from? Often it “arises” from the marginal density function
p
(xj)and the assumption of hidden variables and parameter value guesses (e.g., our two exam- ples, Mixture-densities and Baum-Welch). In other cases (e.g., missing data values in samples of a distribution), we must assume a joint relationship between the missing and observed values.With this new density function, we can define a new likelihood function,L(jZ)=L(jX
;
Y)=p
(X;
Yj), called the complete-data likelihood. Note that this function is in fact a random variable since the missing informationY is unknown, random, and presumably governed by an underlying distribution. That is, we can think ofL(jX;
Y) =h
X;
(Y)for some functionh
X;
()whereX andare constant andYis a random variable. The original likelihoodL(jX)is referred to as the incomplete-data likelihood function.The EM algorithm first finds the expected value of the complete-data log-likelihoodlog
p
(X;
Yj)with respect to the unknown dataYgiven the observed dataX and the current parameter estimates.
That is, we define:
Q
(;
(i
,1))=E
hlogp
(X;
Yj)jX;
(i
,1)i (1)Where(
i
,1) are the current parameters estimates that we used to evaluate the expectation and are the new parameters that we optimize to increaseQ
.This expression probably requires some explanation. 1 The key thing to understand is that
X and (
i
,1) are constants, is a normal variable that we wish to adjust, andY is a random variable governed by the distributionf
(yjX;
(i
,1)). The right side of Equation 1 can therefore be re-written as:E
hlogp
(X;
Yj)jX;
(i
,1)i=Zy2
log
p
(X;
yj)f
(yjX;
(i
,1))d
y:
(2)Note that
f
(yjX;
(i
,1))is the marginal distribution of the unobserved data and is dependent on both the observed dataX and on the current parameters, andis the space of valuesycan take on.In the best of cases, this marginal distribution is a simple analytical expression of the assumed pa- rameters(
i
,1)and perhaps the data. In the worst of cases, this density might be very hard to obtain.Sometimes, in fact, the density actually used is
f
(y;
Xj(i
,1))=f
(yjX;
(i
,1))f
(Xj(i
,1))butthis doesn’t effect subsequent steps since the extra factor,
f
X(Xj(i
,1))is not dependent on. As an analogy, suppose we have a functionh
(;
)of two variables. Considerh
(;
Y) where is a constant and Y is a random variable governed by some distributionf
Y(y
). Thenq
() =E
Y[h
(;
Y)] = Ryh
(;
y)f
Y(y
)d
y is now a deterministic function that could be maximized if desired.The evaluation of this expectation is called the E-step of the algorithm. Notice the meaning of the two arguments in the function
Q
(;
0). The first argumentcorresponds to the parameters that ultimately will be optimized in an attempt to maximize the likelihood. The second argument
0corresponds to the parameters that we use to evaluate the expectation.
The second step (the M-step) of the EM algorithm is to maximize the expectation we computed in the first step. That is, we find:
(
i
) = argmax
Q
(;
(i
,1)):
These two steps are repeated as necessary. Each iteration is guaranteed to increase the log- likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood func- tion. There are many rate-of-convergence papers (e.g., [ALR77, RW84, Wu83, JX96, XJ96]) but we will not discuss them here.
1Recall thatE[h(Y)jX = x] =
R
y h(y)f
YjX
(yjx)dy. In the following discussion, we drop the subscripts from different density functions since argument usage should should disambiguate different ones.
A modified form of the M-step is to, instead of maximizing
Q
(;
(i
,1)), we find some(i
) such thatQ
((i
);
(i
,1))> Q
(;
(i
,1)). This form of the algorithm is called Generalized EM (GEM) and is also guaranteed to converge.As presented above, it’s not clear how exactly to “code up” the algorithm. This is the way, however, that the algorithm is presented in its most general form. The details of the steps required to compute the given quantities are very dependent on the particular application so they are not discussed when the algorithm is presented in this abstract form.
3 Finding Maximum Likelihood Mixture Densities Parameters via EM
The mixture-density parameter estimation problem is probably one of the most widely used appli- cations of the EM algorithm in the computational pattern recognition community. In this case, we assume the following probabilistic model:
p
(xj)=XM
i
=1 i p i(xj i)
where the parameters are = (
1;:::; M ;
1;:::; M) such thatPM i
=1 i = 1and eachp i is a
density function parameterized by i. In other words, we assume we haveM
component densities
mixed together withM
mixing coefficients i.
p i is a
density function parameterized by i. In other words, we assume we haveM
component densities
mixed together withM
mixing coefficients i.
M
component densities mixed together withM
mixing coefficientsi.
The incomplete-data log-likelihood expression for this density from the dataX is given by:
log (L(jX))=log
N
Y
i
=1p
(x ij)=XN i
=1log
0
@
M
X
j
=1 j p j(x ij j)
j)
1
A
which is difficult to optimize because it contains the log of the sum. If we considerX as incomplete, however, and posit the existence of unobserved data items Y = f
y igNi
=1 whose values inform us
which component density “generated” each data item, the likelihood expression is significantly
simplified. That is, we assume that y i 2 1;:::;M
for each i
, andy i = k
if thei th sample was
;:::;M
for eachi
, andy i = k
if thei th sample was
generated by the
k thmixture component. If we know the values ofY, the likelihood becomes:
log (L(jX
;
Y))=log (P
(X;
Yj))=XN
i
=1log(
P
(x ijy i)P
(y
))=XN
P
(y
))=XN
i
=1log(
yip yi(x ij yi))
which, given a particular form of the component densities, can be optimized using a variety of
techniques.
x ij yi))
which, given a particular form of the component densities, can be optimized using a variety of
techniques.
The problem, of course, is that we do not know the values ofY. If we assumeY is a random vector, however, we can proceed.
We first must derive an expression for the distribution of the unobserved data. Let’s first guess at parameters for the mixture density, i.e., we guess that
g
= ( g1;:::; g M ;
1g ;:::; M g
) are the
appropriate parameters for the likelihoodL(
g
jX;
Y). Giveng
, we can easily computep j(x ij g j)
g j)
for each
i
andj
. In addition, the mixing parameters, j can be though of as prior probabilities
of each mixture component, that is j = p
(component j). Therefore, using Bayes’s rule, we can
compute:
p
(component j). Therefore, using Bayes’s rule, we can compute:p
(y ijx i ;
g
)= gyip yi(x ij gyi)
p yi(x ij gyi)
gyi)
p
(x ijg
) = gyip yi(x ij gyi)
p yi(x ij gyi)
gyi)
P
M k
=1 g k p k(x ij k g)
k g)
and
p
(yjX;
g
)=YN
i
=1p
(y ijx i ;
g
)
where y = (
y
1;:::;y N) is an instance of the unobserved data independently drawn. When we now look at Equation 2, we see that in this case we have obtained the desired marginal density by assuming the existence of the hidden variables and making a guess at the initial parameters of their distribution.
In this case, Equation 1 takes the form:
Q
(;
g
) = Xy2
log(L(jX
;
y))p
(yjX;
g
)= X
y2
N
X
i
=1log( yip yi(x ij yi))YN
x ij yi))YN
N
j
=1p
(y jjx j ;
g
)
=
M
X
y
1=1M
X
y
2=1:::
XM
y
N=1N
X
i
=1log( yip yi(x ij yi))YN
x ij yi))YN
N
j
=1p
(y jjx j ;
g
)
=
M
X
y
1=1M
X
y
2=1:::
XM
y
N=1N
X
i
=1M
X
`
=1 `;yilog( ` p `(x ij `))YN
x ij `))YN
N
j
=1p
(y jjx j ;
g
)
=
M
X
`
=1N
X
i
=1log(
` p `(x ij `)) XM y
1=1
`)) XM y
1=1
M
X
y
2=1:::
XM
y
N=1 `;yi YN
j
=1p
(y jjx j ;
g
) (3)
In this form,
Q
(;
g
)looks fairly daunting, yet it can be greatly simplified. We first note that for`
21;:::;M
,M
X
y
1=1M
X
y
2=1:::
XM
y
N=1 `;yi YN
j
=1p
(y jjx j ;
g
)
= 0
@
M
X
y
1=1:::
XM
y
i,1=1M
X
y
i+1=1:::
XM
y
N=1N
Y
j
=1;j
6=i p(y j
jx j ;
g
)1
A
p
(`
jx i ;
g
)=
N
Y
j
=1;j
6=i
0
@
M
X
y
j=1p
(y jjx j ;
g
)
1
A
p
(`
jx i ;
g
)=p
(`
jx i ;
g
) (4)since
P
M i
=1p
(i
jx j ;
g
)=1. Using Equation 4, we can write Equation 3 as:Q
(;
g
) = XM
`
=1N
X
i
=1log(
` p `(x ij `))p
(`
jx i ;
g
)
`))p
(`
jx i ;
g
)
=
M
X
`
=1N
X
i
=1log(
`)p
(`
jx i ;
g
)+XM
`
=1N
X
i
=1log(
p `(x ij `))p
(`
jx i ;
g
) (5)
`))p
(`
jx i ;
g
) (5)
To maximize this expression, we can maximize the term containing
` and the term containing
`independently since they are not related.
To find the expression for
`, we introduce the Lagrange multiplierwith the constraint that
P
` `
=1, and solve the following equation:@ @ `
"
M
X
`
=1N
X
i
=1log( `)p
(`
jx i ;
g
)+ X
` `
,1! #
=0
or X
N
i
=11
` p
(`
jx i ;
g
)+=0Summing both sizes over
`
, we get that=,N
resulting in: `= N
1
N
X
i
=1p
(`
jx i ;
g
)For some distributions, it is possible to get an analytical expressions for
`as functions of everything
else. For example, if we assumed
-dimensional Gaussian component distributions with meanand
covariance matrix, i.e.,
=(;
)thenp `(x
j ` ;
`
)= 1
(2
)d=
2j`
j1=
2e
,12(x
,`)T,1` (
x
,`)
:
(6)To derive the update equations for this distribution, we need to recall some results from matrix algebra.
The trace of a square matrix tr(
A
)is equal to the sum ofA
’s diagonal elements. The trace of a scalar equals that scalar. Also, tr(A
+B
) =tr(A
)+tr(B
), and tr(AB
) =tr(BA
)which implies thatPi x Ti Ax i
=tr(AB
)whereB
= Pi x i x Ti
. Also note thatjA
jindicates the determinant of a matrix, and thatjA
,1j=1=
jA
j.We’ll need to take derivatives of a function of a matrix
f
(A
) with respect to elements of that matrix. Therefore, we define@f
(A
)@A
to be the matrix withi;j th entry [@f @a
(i;jA
)
] where
a i;j is the
i;j th entry ofA
. The definition also applies taking derivatives with respect to a vector. First,
@x @x
TAx
=(A
+A T)x
. Second, it can be shown that whenA
is a symmetric matrix:
@
jA
j@a i;j =
(
A
i;j
ifi
=j
2A
i;j
ifi
6=j
whereA
i;j
is thei;j thcofactor ofA
. Given the above, we see that:
@
logjA
j@A
=(
A
i;j =jA
j ifi
=j
2A
i;j =jA
j ifi
6=j
)
=2
A
,1,diag(A
,1)by the definition of the inverse of a matrix. Finally, it can be shown that:
@
tr(AB
)@A
=B
+B T ,Diag(B
):
Taking the log of Equation 6, ignoring any constant terms (since they disappear after taking derivatives), and substituting into the right side of Equation 5, we get:
M
X
`
=1N
X
i
=1log(
p `(x ij ` ;
`
))p
(`
jx i ;
g
)
` ;
`
))p
(`
jx i ;
g
)=
M
X
`
=1N
X
i
=1
, 1
2
log(j
`
j),12
(
x i, `)T
T
,1
`
(x i, `)
p
(`
jx i ;
g
) (7)Taking the derivative of Equation 7 with respect to
`and setting it equal to zero, we get:
N
X
i
=1
,1
`
(x i, `)p
(`
jx i ;
g
)=0
p
(`
jx i ;
g
)=0with which we can easily solve for
`to obtain:
`=
P
Ni
=1x i p
(`
jx i ;
g
)P
Ni
=1p
(`
jx i ;
g
):
To find
`
, note that we can write Equation 7 as:M
X
`
=1"
1
2 log(j
,1
`
j)XN
i
=1p
(`
jx i ;
g
),12
N
X
i
=1p
(`
jx i ;
g
)tr,1`
(x i, `)(x i, `)T
x i, `)T
T
#
=
M
X
`
=1"
1
2 log(j
,1
`
j)XN
i
=1p
(`
jx i ;
g
),12
N
X
i
=1p
(`
jx i ;
g
)tr,1` N `;i
#
where
N `;i =(x i, `)(x i, `)T
.
`)(x i, `)T
.
`)T
.
Taking the derivative with respect to,1
`
, we get:1
2
N
X
i
=1p
(`
jx i ;
g
)(2`
,diag(`
)), 12
N
X
i
=1p
(`
jx i ;
g
)(2N `;i,diag(N `;i))
= 1
2
N
X
i
=1p
(`
jx i ;
g
)(2M `;i,diag(M `;i))
= 2
S
,diag(S
)where
M `;i=`
,N `;iand whereS
= 12PNi
=1p
(`
jx i ;
g
)M `;i. Setting the derivative to zero, i.e.,
S
= 12PNi
=1p
(`
jx i ;
g
)M `;i. Setting the derivative to zero, i.e.,
2
S
,diag(S
)=0, implies thatS
=0. This givesN
X
i
=1p
(`
jx i ;
g
)(`
,N `;i)=0 or
`
=P
Ni
=1p
(`
jx i ;
g
)N `;i
P
Ni
=1p
(`
jx i ;
g
) =P
Ni
=1p
(`
jx i ;
g
)(x i, `)(x i, `)T
x i, `)T
T
P
Ni
=1p
(`
jx i ;
g
)Summarizing, the estimates of the new parameters in terms of the old parameters are as follows:
new ` = N
1
N
X
i
=1p
(`
jx i ;
g
) new ` = PNi
=1x i p
(`
jx i ;
g
)
P
Ni
=1p
(`
jx i ;
g
)
new `
=P
Ni
=1p
(`
jx i ;
g
)(x i, new ` )(x i, new ` )T
x i, new ` )T
T
P
Ni
=1p
(`
jx i ;
g
)Note that the above equations perform both the expectation step and the maximization step simultaneously. The algorithm proceeds by using the newly derived parameters as the guess for the next iteration.
4 Learning the parameters of an HMM, EM, and the Baum-Welch algorithm
A Hidden Markov Model is a probabilistic model of the joint probability of a collection of random variablesf
O
1;:::;O T ;Q
1;:::;Q Tg. TheO tvariables are either continuous or discrete observa-
tions and theQ t variables are “hidden” and discrete. Under an HMM, there are two conditional
independence assumptions made about these random variables that make associated algorithms
tractable. These independence assumptions are 1), the t th hidden variable, given the (t
,1)st
Q t variables are “hidden” and discrete. Under an HMM, there are two conditional
independence assumptions made about these random variables that make associated algorithms
tractable. These independence assumptions are 1), the t th hidden variable, given the (t
,1)st
t
,1)st
hidden variable, is independent of previous variables, or:
P
(Q tjQ t,1;O t,1;:::;Q
1;O
1)=P
(Q tjQ t,1);
;O t,1;:::;Q
1;O
1)=P
(Q tjQ t,1);
Q t,1);
and 2), the
t thobservation, given thet thhidden variable, is independent of other variables, or:
P
(O tjQ T ;O T ;Q T,1;O T,1;:::;Q t+1;O t+1;Q t ;Q t,1;O t,1;:::;Q
1;O
1)=P
(O tjQ t):
;O T,1;:::;Q t+1;O t+1;Q t ;Q t,1;O t,1;:::;Q
1;O
1)=P
(O tjQ t):
;O t+1;Q t ;Q t,1;O t,1;:::;Q
1;O
1)=P
(O tjQ t):
;O t,1;:::;Q
1;O
1)=P
(O tjQ t):
Q t):
In this section, we derive the EM algorithm for finding the maximum-likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. This algorithm is also known as the Baum-Welch algorithm.
Q t is a discrete random variable withN
possible values f1:::N
g. We further assume that
the underlying “hidden” Markov chain defined byP
(Q tjQ t,1)is time-homogeneous (i.e., is inde-
pendent of the timet
). Therefore, we can representP
(Q tjQ t,1) as a time-independent stochastic
transition matrixA
= fa i;jg =p
(Q t =j
jQ t,1 =i
). The special case of timet
=1is described
by the initial state distribution, i =p
(Q
1=i
). We say that we are in statej
at timet
ifQ t=j
. A
Q t,1)is time-homogeneous (i.e., is inde-
pendent of the timet
). Therefore, we can representP
(Q tjQ t,1) as a time-independent stochastic
transition matrixA
= fa i;jg =p
(Q t =j
jQ t,1 =i
). The special case of timet
=1is described
by the initial state distribution, i =p
(Q
1=i
). We say that we are in statej
at timet
ifQ t=j
. A
Q t,1) as a time-independent stochastic
transition matrixA
= fa i;jg =p
(Q t =j
jQ t,1 =i
). The special case of timet
=1is described
by the initial state distribution, i =p
(Q
1=i
). We say that we are in statej
at timet
ifQ t=j
. A
p
(Q t =j
jQ t,1 =i
). The special case of timet
=1is described
by the initial state distribution, i =p
(Q
1=i
). We say that we are in statej
at timet
ifQ t=j
. A
i
). The special case of timet
=1is described by the initial state distribution, i =p
(Q
1=i
). We say that we are in statej
at timet
ifQ t=j
. A
j
. Aparticular sequence of states is described by
q
= (q
1;:::;q T)whereq t 2f1:::N
gis the state at
timet
.
:::N
gis the state at timet
.A particular observation sequence
O
is described asO
= (O
1 =o
1;:::;O T = o T). The
probability of a particular observation vector at a particular time t
for state j
is described by:
t
for statej
is described by:b j(o t) = p
(O t = o tjQ t = j
). The complete collection of parameters for all observation distri-
butions is represented byB
=fb j()g.
p
(O t = o tjQ t = j
). The complete collection of parameters for all observation distri-
butions is represented byB
=fb j()g.
Q t = j
). The complete collection of parameters for all observation distri-
butions is represented byB
=fb j()g.
There are two forms of output distributions we will consider. The first is a discrete observation assumption where we assume that an observation is one of
L
possible observation symbolso t 2
V
= fv
1;:::;v Lg. In this case, ifo t =v k, thenb j(o t) = p
(O t = v kjq t = j
). The second form
of probably distribution we consider is a mixture ofM
multivariate Gaussians for each state where
v k, thenb j(o t) = p
(O t = v kjq t = j
). The second form
of probably distribution we consider is a mixture ofM
multivariate Gaussians for each state where
o t) = p
(O t = v kjq t = j
). The second form
of probably distribution we consider is a mixture ofM
multivariate Gaussians for each state where
v kjq t = j
). The second form
of probably distribution we consider is a mixture ofM
multivariate Gaussians for each state where
j
). The second form of probably distribution we consider is a mixture ofM
multivariate Gaussians for each state whereb j(o t)=PM `
=1c j`N(o tj j` ;
j`
)=PM `
=1c j` b j`(o t).
M `
=1c j`N(o tj j` ;
j`
)=PM `
=1c j` b j`(o t).
j` ;
j`
)=PM `
=1c j` b j`(o t).
We describe the complete set of HMM parameters for a given model by:
=(A;B;
). Thereare three basic problems associated with HMMs:
1. Find
p
(O
j)for someO
= (o
1;:::;o T). We use the forward (or the backward) procedure for this since it is much more efficient than direct evaluation.
2. Given some
O
and some, find the best state sequenceq
= (q
1;:::;q T) that explainsO
.
The Viterbi algorithm solves this problem but we won’t discuss it in this paper.
3. Find
= argmaxp
(O
j). The Baum-Welch (also called forward-backward or EM for HMMs) algorithm solves this problem, and we will develop it presently.In subsequent sections, we will consider only the first and third problems. The second is addressed in [RJ93].
4.1 Efficient Calculation of Desired Quantities
One of the advantages of HMMs is that relatively efficient algorithms can be derived for the three problems mentioned above. Before we derive the EM algorithm directly using the
Q
function, we review these efficient procedures.Recall the forward procedure. We define
i(t
)=p
(O
1 =o
1;:::;O t=o t ;Q t=i
j)
o t ;Q t=i
j)
which is the probability of seeing the partial sequence
o
1;:::;o tand ending up in statei
at timet
.
We can efficiently define
i(t
)recursively as:
1.
i(1)= i b i(o
1)
o
1)2.
j(t
+1)=hPNi
=1 i(t
)a ij
t
)a ij
i
b j(o t+1)
3. p
(O
j)=PNi
=1 i(T
)
p
(O
j)=PNi
=1 i(T
)
The backward procedure is similar:
i(t
)=p
(O t+1 =o t+1;:::;O T =o TjQ t=i;
)
o t+1;:::;O T =o TjQ t=i;
)
o TjQ t=i;
)
i;
)which is the probability of the ending partial sequence
o t+1;:::;o T given that we started at statei
i
at time
t
. We can efficiently define i(t
)as:
1.
i(T
)=1
2.
i(t
)=PNj
=1a ij b j(o t+1) j(t
+1)
o t+1) j(t
+1)
t
+1)3.
p
(O
j)=PNi
=1 i(1) i b i(o
1)
o
1)We now define
i(t
)=p
(Q t=i
jO;
)
i
jO;
)which is the probability of being in state
i
at timet
for the state sequenceO
. Note that:p
(Q t=i
jO;
)= p
(O;Q t=i
j)
i
j)P
(O
j) =p
(O;Q t=i
j)
P
Nj
=1p
(O;Q t=j
j)
Also note that because of Markovian conditional independence
i(t
) i(t
)=p
(o
1;:::;o t ;Q t=i
j)p
(o t+1;:::;o TjQ t=i;
)=p
(O;Q t=i
j)
t
)=p
(o
1;:::;o t ;Q t=i
j)p
(o t+1;:::;o TjQ t=i;
)=p
(O;Q t=i
j)
;:::;o TjQ t=i;
)=p
(O;Q t=i
j)
i;
)=p
(O;Q t=i
j)
so we can define things in terms of
i(t
)and i(t
)as
i(t
)= i(t
) i(t
)
t
)as
i(t
)= i(t
) i(t
)
t
) i(t
)
P
Nj
=1 j(t
) j(t
)
t
)We also define
ij(t
)=p
(Q t=i;Q t+1 =j
jO;
)
i;Q t+1 =j
jO;
)
which is the probability of being in state
i
at timet
and being in statej
at timet
+1. This can also be expanded as: ij(t
)= p
(Q t=i;Q t+1 =j;O
j)
i;Q t+1 =j;O
j)
p
(O
j) = i(t
)a ij b j(o t+1) j(t
+1)
o t+1) j(t
+1)
t
+1)P
Ni
=1PNj
=1 i(t
)a ij b j(o t+1) j(t
+1)
o t+1) j(t
+1)
t
+1)or as:
ij(t
)= p
(Q t=i
jO
)p
(o t+1:::o T ;Q t+1 =j
jQ t=i;
)
i
jO
)p
(o t+1:::o T ;Q t+1 =j
jQ t=i;
)
j
jQ t=i;
)
p
(o t+1:::o TjQ t=i;
) =
i(t
)a ij b j(o t+1) j(t
+1)
i(t
)
Q t=i;
) =
i(t
)a ij b j(o t+1) j(t
+1)
i(t
)
t
)a ij b j(o t+1) j(t
+1)
i(t
)
j(t
+1)
i(t
)
t
)If we sum these quantities across time, we can get some useful values. I.e., the expression
T
X
t
=1
i(t
)
is the expected number of times in state
i
and therefore is the expected number of transitions away from statei
forO
. Similarly,T
X,1t
=1 ij(t
)
is the expected number of transitions from state
i
to statej
forO
. These follow from the fact thatX
t i
(t
)=Xt E[I t
(i
)]=E
[Xt I t
(i
)]and
X
t ij
(t
)==Xt E[I t
(i;j
)]=E
[Xt I t
(i;j
)]where
I t(i
)is an indicator random variable that is1when we are in statei
at timet
, andI t(i;j
)is
i;j
)isa random variable that is1when we move from state
i
to statej
after timet
.Jumping the gun a bit, our goal in forming an EM algorithm to estimate new parameters for the HMM by using the old parameters and the data. Intuitively, we can do this simply using relative frequencies. I.e., we can define update rules as follows:
The quantity
~i
=i(1) (8)
is the expected relative frequency spent in state
i
at time 1.The quantity
~
a ij =
P
T t
=1,1 ij(t
)
P
T t
=1,1
i(t
) (9)
is the expected number of transitions from state
i
to statej
relative to the expected total number of transitions away from statei
.And, for discrete distributions, the quantity
~
b i(k
)= PTt
=1 ot;v
k
i(t
)
;v
k
i(t
)
P
Tt
=1
i(t
) (10)
is the expected number of times the output observations have been equal to
v k while in statei
relative to the expected total number of times in state
i
.For Gaussian mixtures, we define the probability that the
` th component of the i th mixture
generated observation
o tas
i`(t
)=
i(t
)c i` b i`(o t)
t
)c i` b i`(o t)
b i(o t) =p
(Q t=i;X it=`
jO;
)
p
(Q t=i;X it=`
jO;
)
`
jO;
)where
X itis a random variable indicating the mixture component at timet
for statei
.
From the previous section on Gaussian Mixtures, we might guess that the update equations for this case are:
c i`=
P
Tt
=1
i`(t
)
P
Tt
=1
i(t
)
i`=
P
Tt
=1
i`(t
)o t
P
Tt
=1
i`(t
)
i`
=P
Tt
=1
i`(t
)(o t, i`)(o t, i`)T
i`)(o t, i`)T
i`)T
P
Tt
=1
i`(t
)
When there are
E
observation sequences thee th being of lengthT e, the update equations be-
come:
i =
P
Ee
=1
ei(1)
E c i`=
P
Ee
=1PT t
=1e
ei`(t
)
P
Ee
=1PT t
=1e
ei(t
)
i`=
P
Ee
=1PT t
=1e
ei`(t
)o et
P