HR EST / HER EST
8.8 Parameter Re-Estimation Formulae
# alignment model set for two-model re-estimation ALIGNMODELMMF = dir2/hmacs
ALIGNHMMLIST = hmmlist2
is necessary. HERest only needs to be invoked using that configuration file.
HERest -C config -C config.2model -S trainlist -I labs -H dir1/hmacs -M dir3 hmmlist1 The models in directory dir1 are updated using the alignment models stored in directory dir2 and the result is written to directory dir3. Note that trainlist is a standard HTK script and that the above command uses the capability of HERest to accept multiple configuration files on the command line. If each HMM is stored in a separate file, the configuration variables ALIGNMODELDIR and ALIGNMODELEXT can be used.
Only the state level alignment is obtained using the alignment models. In the exceptional case that the update model set contains mixtures of Gaussians, component level posterior probabilities are obtained from the update models themselves.
8.8 Parameter Re-Estimation Formulae
For reference purposes, this section lists the various formulae employed within the HTK parameter estimation tools. All are standard, however, the use of non-emitting states and multiple data streams leads to various special cases which are usually not covered fully in the literature.
The following notation is used in this section
N number of states
S number of streams
Ms number of mixture components in stream s
T number of observations
Q number of models in an embedded training sequence Nq number of states in the q’th model in a training sequence O a sequence of observations
ot the observation at time t, 1 ≤ t ≤ T
ost the observation vector for stream s at time t aij the probability of a transition from state i to j cjsm weight of mixture component m in state j stream s
µjsm vector of means for the mixture component m of state j stream s Σjsm covariance matrix for the mixture component m of state j stream s λ the set of all parameters defining a HMM
8.8.1 Viterbi Training (HInit)
In this style of model training, a set of training observations Or, 1 ≤ r ≤ R is used to estimate the parameters of a single HMM by iteratively computing Viterbi alignments. When used to initialise a new HMM, the Viterbi segmentation is replaced by a uniform segmentation (i.e. each training observation is divided into N equal segments) for the first iteration.
Apart from the first iteration on a new model, each training sequence O is segmented using a state alignment procedure which results from maximising
φN(T ) = max
i φi(T )aiN
for 1 < i < N where
φj(t) = h
maxi φi(t − 1)aij
i bj(ot) with initial conditions given by
φ1(1) = 1 φj(1) = a1jbj(o1)
for 1 < j < N . In this and all subsequent cases, the output probability bj(·) is as defined in equations7.1and7.2in section7.1.
8.8 Parameter Re-Estimation Formulae 143
If Aij represents the total number of transitions from state i to state j in performing the above maximisations, then the transition probabilities can be estimated from the relative frequencies
ˆaij = Aij
PN
k=2Aik
The sequence of states which maximises φN(T ) implies an alignment of training data observa-tions with states. Within each state, a further alignment of observaobserva-tions to mixture components is made. The tool HInit provides two mechanisms for this: for each state and each stream
1. use clustering to allocate each observation ost to one of Ms clusters, or
2. associate each observation ost with the mixture component with the highest probability In either case, the net result is that every observation is associated with a single unique mixture component. This association can be represented by the indicator function ψrjsm(t) which is 1 if orst is associated with mixture component m of stream s of state j and is zero otherwise.
The means and variances are then estimated via simple averages
µˆjsm= PR
r=1
PTr
t=1ψjsmr (t)orst PR
r=1
PTr
t=1ψjsmr (t) Σˆjsm=
PR
r=1
PTr
t=1ψjsmr (t)(orst− ˆµjsm)(orst− ˆµjsm)T PR
r=1
PTr
t=1ψjsmr (t)
Finally, the mixture weights are based on the number of observations allocated to each compo-nent
cjsm= PR
r=1
PTr
t=1ψrjsm(t) PR
r=1
PTr
t=1
PMs
l=1ψjslr (t)
8.8.2 Forward/Backward Probabilities
Baum-Welch training is similar to the Viterbi training described in the previous section except that the hard boundary implied by the ψ function is replaced by a soft boundary function L which represents the probability of an observation being associated any given Gaussian mixture component. This occupation probability is computed from the forward and backward probabilities.
For the isolated-unit style of training, the forward probability αj(t) for 1 < j < N and 1 < t ≤ T is calculated by the forward recursion
αj(t) =
"N −1 X
i=2
αi(t − 1)aij
# bj(ot)
with initial conditions given by
α1(1) = 1 αj(1) = a1jbj(o1) for 1 < j < N and final condition given by
αN(T ) =
N −1X
i=2
αi(T )aiN
The backward probability βi(t) for 1 < i < N and T > t ≥ 1 is calculated by the backward recursion
βi(t) =
N −1X
j=2
aijbj(ot+1)βj(t + 1) with initial conditions given by
βi(T ) = aiN
8.8 Parameter Re-Estimation Formulae 144
for 1 < i < N and final condition given by
β1(1) =
N −1X
j=2
a1jbj(o1)βj(1)
In the case of embedded training where the HMM spanning the observations is a composite constructed by concatenating Q subword models, it is assumed that at time t, the α and β values corresponding to the entry state and exit states of a HMM represent the forward and backward probabilities at time t−∆t and t+∆t, respectively, where ∆t is small. The equations for calculating α and β are then as follows.
For the forward probability, the initial conditions are established at time t = 1 as follows
α(q)1 (1) =
( 1 if q = 1
α(q−1)1 (1)a(q−1)1Nq−1 otherwise α(q)j (1) = a(q)1jb(q)j (o1)
α(q)Nq(1) =
NXq−1 i=2
α(q)i (1)a(q)iNq
where the superscript in parentheses refers to the index of the model in the sequence of concatenated models. All unspecified values of α are zero. For time t > 1,
α1(q)(t) =
( 0 if q = 1
αN(q−1)q−1(t − 1) + α(q−1)1 (t)a(q−1)1Nq−1 otherwise
α(q)j (t) =
α(q)1 (t)a(q)1j +
NXq−1 i=2
α(q)i (t − 1)a(q)ij
b(q)j (ot)
αN(q)q(t) =
NXq−1 i=2
α(q)i (t)a(q)iNq
For the backward probability, the initial conditions are set at time t = T as follows
β(q)Nq(T ) =
( 1 if q = Q
β(q+1)N
q+1(T )a(q+1)1N
q+1 otherwise βi(q)(T ) = a(q)iNqβN(q)q(T )
β(q)1 (T ) =
NXq−1 j=2
a(q)1jb(q)j (oT)βj(q)(T ) where once again, all unspecified β values are zero. For time t < T ,
βN(q)q(t) =
( 0 if q = Q
β1(q+1)(t + 1) + βN(q+1)
q+1(t)a(q+1)1N
q+1 otherwise βi(q)(t) = a(q)iNqβ(q)Nq(t) +
NXq−1 j=2
a(q)ij b(q)j (ot+1)βj(q)(t + 1)
β1(q)(t) =
NXq−1 j=2
a(q)1jb(q)j (ot)βj(q)(t)
The total probability P = prob(O|λ) can be computed from either the forward or backward probabilities
P = αN(T ) = β1(1)
8.8 Parameter Re-Estimation Formulae 145
8.8.3 Single Model Reestimation(HRest)
In this style of model training, a set of training observations Or, 1 ≤ r ≤ R is used to estimate the parameters of a single HMM. The basic formula for the reestimation of the transition probabilities is
ˆaij= PR
r=1 1 Pr
PTr−1
t=1 αri(t)aijbj(ort+1)βrj(t + 1) PR
r=1 1 Pr
PTr
t=1αri(t)βir(t)
where 1 < i < N and 1 < j < N and Pr is the total probability P = prob(Or|λ) of the r’th observation. The transitions from the non-emitting entry state are reestimated by
ˆa1j = 1 R
XR r=1
1
Prαrj(1)βrj(1)
where 1 < j < N and the transitions from the emitting states to the final non-emitting exit state are reestimated by
ˆaiN = PR
r=1 1
Prαri(T )βir(T ) PR
r=1 1 Pr
PTr
t=1αri(t)βir(t) where 1 < i < N .
For a HMM with Ms mixture components in stream s, the means, covariances and mixture weights for that stream are reestimated as follows. Firstly, the probability of occupying the m’th mixture component in stream s at time t for the r’th observation is
Lrjsm(t) = 1 Pr
Ujr(t)cjsmbjsm(orst)βrj(t)b∗js(ort) where
Ujr(t) =
½ a1j if t = 1
PN −1
i=2 αri(t − 1)aij otherwise (8.1) and
b∗js(ort) =Y
k6=s
bjk(orkt)
For single Gaussian streams, the probability of mixture component occupancy is equal to the prob-ability of state occupancy and hence it is more efficient in this case to use
Lrjsm(t) = Lrj(t) = 1 Pr
αj(t)βj(t)
Given the above definitions, the re-estimation formulae may now be expressed in terms of Lrjsm(t) as follows.
ˆ µjsm=
PR
r=1
PTr
t=1Lrjsm(t)orst PR
r=1
PTr
t=1Lrjsm(t) Σˆjsm=
PR
r=1
PTr
t=1Lrjsm(t)(orst− ˆµjsm)(orst− ˆµjsm)T PR
r=1
PTr
t=1Lrjsm(t) (8.2)
cjsm= PR
r=1
PTr
t=1Lrjsm(t) PR
r=1
PTr
t=1Lrj(t)
8.8.4 Embedded Model Reestimation (HERest)
The re-estimation formulae for the embedded model case have to be modified to take account of the fact that the entry states can be occupied at any time as a result of transitions out of the previous model. The basic formulae for the re-estimation of the transition probabilities is
ˆa(q)ij = PR
r=1 1 Pr
PTr−1
t=1 α(q)ri (t)a(q)ij b(q)j (ort+1)βj(q)r(t + 1) PR
r=1 1 Pr
PTr
t=1α(q)ri (t)βi(q)r(t)
8.8 Parameter Re-Estimation Formulae 146
The transitions from the non-emitting entry states into the HMM are re-estimated by
ˆa(q)1j =
PR
r=1 1 Pr
PTr−1
t=1 α(q)r1 (t)a(q)1jb(q)j (ort)βj(q)r(t) PR
r=1 1 Pr
PTr
t=1α(q)r1 (t)β1(q)r(t) + α1(q)r(t)a(q)1Nqβ1(q+1)r(t)
and the transitions out of the HMM into the non-emitting exit states are re-estimated by
ˆa(q)iN
q = PR
r=1 1 Pr
PTr−1
t=1 α(q)ri (t)a(q)iNqβ(q)rNq (t) PR
r=1 1 Pr
PTr
t=1α(q)ri (t)βi(q)r(t)
Finally, the direct transitions from non-emitting entry to non-emitting exit states are re-estimated by
ˆa(q)1Nq =
PR
r=1 1 Pr
PTr−1
t=1 α1(q)r(t)a(q)1N
qβ1(q+1)r(t) PR
r=1 1 Pr
PTr
t=1α(q)ri (t)βi(q)r(t) + α(q)r1 (t)a(q)1N
qβ1(q+1)r(t)
The re-estimation formulae for the output distributions are the same as for the single model case except for the obvious additional subscript for q. However, the probability calculations must now allow for transitions from the entry states by changing Ujr(t) in equation8.1to
Uj(q)r(t) = (
α(q)r1 (t)a(q)1j if t = 1 α(q)r1 (t)a(q)1j +PNq−1
i=2 α(q)ri (t − 1)a(q)ij otherwise
8.8.5 Semi-Tied Transform Estimation (HERest)
In addition to estimating the standard parameters above HERest can be used to estimated semi-tied transforms and HLDA projections. This section describes semi-semi-tied transforms, the updates for HLDA are very similar.
Semi-tied covariance matrices have the form
µmr = µmr, Σmr = HrΣdiagmr HTr (8.3) For efficiency reasons the transforms are stored and likelihoods calculated using
N (o; µmr, HrΣdiagmr HTr) = 1
|Hr|N (H−1r o; H−1r µmr, Σdiagmr ) = |Ar|N (Aro; Arµmr, Σdiagmr ) (8.4) where Ar = H−1r . The transformed mean, Arµmr, is stored in the model files rather than the original mean for efficiency.
The estimation of semi-tied transforms is a doubly iterative process. Given a current set of covariance matrix estimates the semi-tied transforms are estimated in a similar fashion to the full variance MLLRCOV transforms.
ari= criG(i)−1r vu ut
à βr
criG(i)−1r cTri
!
(8.5)
where ariis ithrow of Ar, the 1 × n row vector criis the vector of cofactors of Ar, crij= cof(Arij), and G(i)r is defined as
G(i)r =
Mr
X
mr=1
1 σdiag2mri
XT t=1
Lmr(t)(o(t) − µmr)(o(t) − µmr)T (8.6)
This iteratively estimates one row of the transform at a time. The number of iterations is controlled by the HAdapt configuration variable MAXXFORMITER.
Having estimated the transform the diagonal covariance matrix is updated as
Σdiagmr = diag ÃAr
PT
t=1Lmr(t)(o(t) − µmr)(o(t) − µmr)TATr PT
t=1Lmr(t)
!
(8.7)
This is the second look as given a new estimate of the diagonal variance a new transform can be estimated. The number of iterations of transform and covariance matrix update is controlled by the HAdapt configuration variable MAXSEMITIEDITER