Parameter Re-Estimation Formulae

8.7 Parameter Re-Estimation Formulae 124

8.7.1 Viterbi Training (HInit)

In this style of model training, a set of training observations O^r, 1 ≤ r ≤ R is used to estimate the parameters of a single HMM by iteratively computing Viterbi alignments. When used to initialise a new HMM, the Viterbi segmentation is replaced by a uniform segmentation (i.e. each training observation is divided into N equal segments) for the first iteration.

Apart from the first iteration on a new model, each training sequence O is segmented using a state alignment procedure which results from maximising

φN(T ) = max

i φi(T )aiN

for 1 < i < N where

φj(t) = h

maxi φi(t − 1)aij

i bj(ot) with initial conditions given by

φ1(1) = 1 φj(1) = a1jbj(o1)

for 1 < j < N . In this and all subsequent cases, the output probability bj(·) is as defined in equations7.1and7.2in section7.1.

If Aij represents the total number of transitions from state i to state j in performing the above maximisations, then the transition probabilities can be estimated from the relative frequencies

ˆaij = A_ij P_N

k=2Aik

The sequence of states which maximises φN(T ) implies an alignment of training data observa-tions with states. Within each state, a further alignment of observaobserva-tions to mixture components is made. The tool HInit provides two mechanisms for this: for each state and each stream

1. use clustering to allocate each observation ost to one of Ms clusters, or

2. associate each observation ost with the mixture component with the highest probability In either case, the net result is that every observation is associated with a single unique mixture component. This association can be represented by the indicator function ψ^r_jsm(t) which is 1 if o^r_st is associated with mixture component m of stream s of state j and is zero otherwise.

The means and variances are then estimated via simple averages

ˆ µ_jsm=

P_R

r=1

P_T_r

t=1ψ_jsm^r (t)o^r_st P_R

r=1

P_T_r

t=1ψ_jsm^r (t)

Σˆjsm= P_R

r=1

P_T_r

t=1ψ_jsm^r (t)(o^r_st− ˆµ_jsm)(o^r_st− ˆµ_jsm)⁰ P_R

r=1

P_T_r

t=1ψ_jsm^r (t)

Finally, the mixture weights are based on the number of observations allocated to each compo-nent

cjsm= P_R

r=1

P_T_r

t=1ψ^r_jsm(t) P_R

r=1

P_T_r

t=1

P_M_s

l=1ψ_jsl^r (t)

8.7.2 Forward/Backward Probabilities

Baum-Welch training is similar to the Viterbi training described in the previous section except that the hard boundary implied by the ψ function is replaced by a soft boundary function L which represents the probability of an observation being associated any given Gaussian mixture component. This occupation probability is computed from the forward and backward probabilities.

For the isolated-unit style of training, the forward probability αj(t) for 1 < j < N and 1 < t ≤ T is calculated by the forward recursion

αj(t) =

"_{N −1} X

i=2

αi(t − 1)aij

# bj(ot)

8.7 Parameter Re-Estimation Formulae 125 with initial conditions given by

α1(1) = 1 αj(1) = a1jbj(o1) for 1 < j < N and final condition given by

αN(T ) =

N −1X

i=2

αi(T )aiN

The backward probability βi(t) for 1 < i < N and T > t ≥ 1 is calculated by the backward recursion

βi(t) =

N −1X

j=2

aijbj(ot+1)βj(t + 1) with initial conditions given by

βi(T ) = aiN

for 1 < i < N and final condition given by

β1(1) =

N −1X

j=2

a1jbj(o1)βj(1)

In the case of embedded training where the HMM spanning the observations is a composite constructed by concatenating Q subword models, it is assumed that at time t, the α and β values corresponding to the entry state and exit states of a HMM represent the forward and backward probabilities at time t−∆t and t+∆t, respectively, where ∆t is small. The equations for calculating α and β are then as follows.

For the forward probability, the initial conditions are established at time t = 1 as follows

α^(q)₁ (1) =

( 1 if q = 1

α^(q−1)₁ (1)a^(q−1)_1N_q−1 otherwise α^(q)_j (1) = a^(q)_1jb^(q)_j (o1)

α^(q)_N_q(1) =

NXq−1 i=2

α^(q)_i (1)a^(q)_iN_q

where the superscript in parentheses refers to the index of the model in the sequence of concatenated models. All unspecified values of α are zero. For time t > 1,

α₁^(q)(t) =

( 0 if q = 1

α_N^(q−1)

q−1(t − 1) + α^(q−1)₁ (t)a^(q−1)_1N

q−1 otherwise

α^(q)_j (t) =



α^(q)₁ (t)a^(q)_1j +

NXq−1 i=2

α^(q)_i (t − 1)a^(q)_ij



 b^(q)_j (o_t)

α_N^(q)_q(t) =

NXq−1 i=2

α^(q)_i (t)a^(q)_iN_q

For the backward probability, the initial conditions are set at time t = T as follows

β^(q)_N_q(T ) =

( 1 if q = Q

β^(q+1)_N_q+1(T )a^(q+1)_1N_q+1 otherwise β_i^(q)(T ) = a^(q)_iN

qβ_N^(q)

q(T ) β^(q)₁ (T ) =

NXq−1 j=2

a^(q)_1jb^(q)_j (oT)β_j^(q)(T )

8.7 Parameter Re-Estimation Formulae 126

where once again, all unspecified β values are zero. For time t < T ,

β_N^(q)

q(t) =

( 0 if q = Q

β₁^(q+1)(t + 1) + β_N^(q+1)_q+1(t)a^(q+1)_1N_q+1 otherwise

β_i^(q)(t) = a^(q)_iN

qβ^(q)_N

q(t) +

NXq−1 j=2

a^(q)_ij b^(q)_j (ot+1)β_j^(q)(t + 1)

β₁^(q)(t) =

NXq−1 j=2

a^(q)_1jb^(q)_j (ot)β_j^(q)(t)

The total probability P = prob(O|λ) can be computed from either the forward or backward probabilities

P = αN(T ) = β1(1)

8.7.3 Single Model Reestimation(HRest)

In this style of model training, a set of training observations O^r, 1 ≤ r ≤ R is used to estimate the parameters of a single HMM. The basic formula for the reestimation of the transition probabilities is

ˆaij= P_R

r=1 1 Pr

P_T_r₋₁

t=1 α^r_i(t)aijbj(o^r_t+1)β^r_j(t + 1) P_R

r=1 1 P_r

P_T_r

t=1α^r_i(t)β_i^r(t)

where 1 < i < N and 1 < j < N and Pr is the total probability P = prob(O^r|λ) of the r’th observation. The transitions from the non-emitting entry state are reestimated by

ˆa_1j = 1 R

XR r=1

1 Pr

α^r_j(1)β^r_j(1)

where 1 < j < N and the transitions from the emitting states to the final non-emitting exit state are reestimated by

ˆaiN = P_R

r=1 1

Prα^r_i(T )β_i^r(T ) P_R

r=1 1 Pr

P_T_r

t=1α^r_i(t)β_i^r(t) where 1 < i < N .

For a HMM with Ms mixture components in stream s, the means, covariances and mixture weights for that stream are reestimated as follows. Firstly, the probability of occupying the m’th mixture component in stream s at time t for the r’th observation is

L^r_jsm(t) = 1

PrU_j^r(t)cjsmbjsm(o^r_st)β^r_j(t)b^∗_js(o^r_t) where

U_j^r(t) =

½ a1j if t = 1

P_{N −1}

i=2 α_i^r(t − 1)a_ij otherwise (8.1) and

b^∗_js(o^r_t) =Y

k6=s

bjk(o^r_kt)

For single Gaussian streams, the probability of mixture component occupancy is equal to the prob-ability of state occupancy and hence it is more efficient in this case to use

L^r_jsm(t) = L^r_j(t) = 1

Prαj(t)βj(t)

Given the above definitions, the re-estimation formulae may now be expressed in terms of L^r_jsm(t) as follows.

ˆ µ_jsm=

P_R

r=1

P_T_r

t=1L^r_jsm(t)o^r_st P_R

r=1

P_T_r

t=1L^r_jsm(t)

8.7 Parameter Re-Estimation Formulae 127

Σˆjsm= P_R

r=1

P_T_r

t=1L^r_jsm(t)(o^r_st− ˆµ_jsm)(o^r_st− ˆµ_jsm)⁰ P_R

r=1

P_T_r

t=1L^r_jsm(t) (8.2)

cjsm= P_R

r=1

P_T_r

t=1L^r_jsm(t) P_R

r=1

P_T_r

t=1L^r_j(t)

8.7.4 Embedded Model Reestimation(HERest)

The re-estimation formulae for the embedded model case have to be modified to take account of the fact that the entry states can be occupied at any time as a result of transitions out of the previous model. The basic formulae for the re-estimation of the transition probabilities is

ˆa^(q)_ij = P_R

r=1 1 Pr

P_T_r₋₁

t=1 α^(q)r_i (t)a^(q)_ij b^(q)_j (o^r_t+1)β_j^(q)r(t + 1) P_R

r=1 1 Pr

P_T_r

t=1α^(q)r_i (t)β_i^(q)r(t)

The transitions from the non-emitting entry states into the HMM are re-estimated by

ˆa^(q)_1j =

P_R

r=1 1 Pr

P_T_r₋₁

t=1 α^(q)r₁ (t)a^(q)_1jb^(q)_j (o^r_t)β_j^(q)r(t) P_R

r=1 1 Pr

P_T_r

t=1α^(q)r₁ (t)β₁^(q)r(t) + α₁^(q)r(t)a^(q)_1N

qβ₁^(q+1)r(t)

and the transitions out of the HMM into the non-emitting exit states are re-estimated by

ˆa^(q)_iN_q = P_R

r=1 1 Pr

P_T_r₋₁

t=1 α^(q)r_i (t)a^(q)_iN_qβ^(q)r_N_q (t) P_R

r=1 1 Pr

P_T_r

t=1α^(q)r_i (t)β_i^(q)r(t)

Finally, the direct transitions from non-emitting entry to non-emitting exit states are re-estimated by

ˆa^(q)_1N

q =

P_R

r=1 1 Pr

P_T_r₋₁

t=1 α₁^(q)r(t)a^(q)_1N_qβ₁^(q+1)r(t) P_R

r=1 1 Pr

P_T_r

t=1α^(q)r_i (t)β_i^(q)r(t) + α^(q)r₁ (t)a^(q)_1N

qβ₁^(q+1)r(t)

The re-estimation formulae for the output distributions are the same as for the single model case except for the obvious additional subscript for q. However, the probability calculations must now allow for transitions from the entry states by changing U_j^r(t) in equation8.1to

U_j^(q)r(t) = (

α^(q)r₁ (t)a^(q)_1j if t = 1 α^(q)r₁ (t)a^(q)_1j +P_N_q₋₁

i=2 α^(q)r_i (t − 1)a^(q)_ij otherwise

Chapter 9

HMM Adaptation

Labelled Adaptation or

Enrollment

/HA

DAPT Data

HEA

DAPT

Transformed Speaker Independent Model Set Speaker Independent Model Set

Chapter8described how the parameters are estimated for plain continuous density HMMs within HTK, primarily using the embedded training tool HERest. Using the training strategy depicted in figure 8.2, together with other techniques can produce high performance speaker independent acoustic models for a large vocabulary recognition system. However it is possible to build improved acoustic models by tailoring a model set to a specific speaker. By collecting data from a speaker and training a model set on this speaker’s data alone, the speaker’s characteristics can be modelled more accurately. Such systems are commonly known as speaker dependent systems, and on a typical word recognition task, may have half the errors of a speaker independent system. The drawback of speaker dependent systems is that a large amount of data (typically hours) must be collected in order to obtain sufficient model accuracy.

Rather than training speaker dependent models, adaptation techniques can be applied. In this case, by using only a small amount of data from a new speaker, a good speaker independent system model set can be adapted to better fit the characteristics of this new speaker.

Speaker adaptation techniques can be used in various different modes. If the true transcription of the adaptation data is known then it is termed supervised adaptation, whereas if the adaptation data is unlabelled then it is termed unsupervised adaptation. In the case where all the adaptation data is available in one block, e.g. from a speaker enrollment session, then this termed static adaptation.

Alternatively adaptation can proceed incrementally as adaptation data becomes available, and this is termed incremental adaptation.

HTK provides two tools to adapt continuous density HMMs. HEAdapt performs offline super-vised adaptation using maximum likelihood linear regression (MLLR) and/or maximum a-posteriori (MAP) adaptation, while unsupervised adaptation is supported by HVite (using only MLLR). In

128

9.1 Model Adaptation using MLLR 129 this case HVite not only performs recognition, but simultaneously adapts the model set as the data becomes available through recognition. Currently, MLLR adaptation can be applied in both incremental and static modes while MAP supports only static adaptation. If MLLR and MAP adap-tation is to be performed simultaneously using HEAdapt in the same pass, then the restriction is that the entire adaptation must be performed statically¹.

This chapter describes the supervised adaptation tool HEAdapt. The first sections of the chapter give an overview of MLLR and MAP adaptation and this is followed by a section describing the general usages of HEAdapt to build simple and more complex adapted systems. The chapter concludes with a section detailing the various formulae used by the adaptation tool. The use of HVite to perform unsupervised adaptation is discussed in section13.6.2.

9.1 Model Adaptation using MLLR

9.1.1 Maximum Likelihood Linear Regression

Maximum likelihood linear regression or MLLR computes a set of transformations that will reduce the mismatch between an initial model set and the adaptation data². More specifically MLLR is a model adaptation technique that estimates a set of linear transformations for the mean and variance parameters of a Gaussian mixture HMM system. The effect of these transformations is to shift the component means and alter the variances in the initial system so that each state in the HMM system is more likely to generate the adaptation data. Note that due to computational reasons, MLLR is only implemented within HTK for diagonal covariance, single stream, continuous density HMMs.

The transformation matrix used to give a new estimate of the adapted mean is given by ˆ

µ = W ξ, (9.1)

where W is the n × (n + 1) transformation matrix (where n is the dimensionality of the data) and ξ is the extended mean vector,

ξ = [ w µ1µ2 . . . µn ]^T

where w represents a bias offset whose value is fixed (within HTK) at 1.

Hence W can be decomposed into

W = [ b A ] (9.2)

where A represents an n × n transformation matrix and b represents a bias vector.

The transformation matrix W is obtained by solving a maximisation problem using the Expectation-Maximisation (EM) technique. This technique is also used to compute the variance transformation matrix. Using EM results in the maximisation of a standard auxiliary function. (Full details are available in section9.4.)

9.1.2 MLLR and Regression Classes

This adaptation method can be applied in a very flexible manner, depending on the amount of adaptation data that is available. If a small amount of data is available then a global adaptation transform can be generated. A global transform (as its name suggests) is applied to every Gaus-sian component in the model set. However as more adaptation data becomes available, improved adaptation is possible by increasing the number of transformations. Each transformation is now more specific and applied to certain groupings of Gaussian components. For instance the Gaussian components could be grouped into the broad phone classes: silence, vowels, stops, glides, nasals, fricatives, etc. The adaptation data could now be used to construct more specific broad class transforms to apply to these groupings.

Rather than specifying static component groupings or classes, a robust and dynamic method is used for the construction of further transformations as more adaptation data becomes available.

MLLR makes use of a regression class tree to group the Gaussians in the model set, so that the set of transformations to be estimated can be chosen according to the amount and type of adaptation data that is available. The tying of each transformation across a number of mixture components

1By using two passes, one could perform incremental MLLR in the first pass (saving the new model or transform), followed by a second pass, this time using MAP adaptation.

2 MLLR can also be used to perform environmental compensation by reducing the mismatch due to channel or additive noise effects.

9.1 Model Adaptation using MLLR 130 makes it possible to adapt distributions for which there were no observations at all. With this process all models can be adapted and the adaptation process is dynamically refined when more adaptation data becomes available.

The regression class tree is constructed so as to cluster together components that are close in acoustic space, so that similar components can be transformed in a similar way. Note that the tree is built using the original speaker independent model set, and is thus independent of any new speaker. The tree is constructed with a centroid splitting algorithm, which uses a Euclidean distance measure. For more details see section 10.7. The terminal nodes or leaves of the tree specify the final component groupings, and are termed the base (regression) classes. Each Gaussian component of a model set belongs to one particular base class. The tool HHEd can be used to build a binary regression class tree, and to label each component with a base class number. Both the tree and component base class numbers are saved automatically as part of the MMF. Please refer to section7.8and section10.7for further details.

2 3

4 5 6 7

Fig. 9.1 A binary regression tree

Figure9.1shows a simple example of a binary regression tree with four base classes, denoted as {C4, C5, C6, C7}. During “dynamic” adaptation, the occupation counts are accumulated for each of the regression base classes. The diagram shows a solid arrow and circle (or node), indicating that there is sufficient data for a transformation matrix to be generated using the data associated with that class. A dotted line and circle indicates that there is insufficient data. For example neither node 6 or 7 has sufficient data; however when pooled at node 3, there is sufficient adaptation data.

The amount of data that is “determined” as sufficient is set by the user as a command-line option to HEAdapt (see reference section14.5).

HEAdapt uses a top-down approach to traverse the regression class tree. Here the search starts at the root node and progresses down the tree generating transforms only for those nodes which

1. have sufficient data and

2. are either terminal nodes (i.e. base classes) or have any children without sufficient data.

In the example shown in figure 9.1, transforms are constructed only for regression nodes 2, 3 and 4, which can be denoted as W2, W3 and W4. Hence when the transformed model set is required, the transformation matrices (mean and variance) are applied in the following fashion to the Gaussian components in each base

class:-



W2 → {C5} W3 → {C6, C7} W4 → {C4}





At this point it is interesting to note that the global adaptation case is the same as a tree with just a root node, and is in fact treated as such.

9.1 Model Adaptation using MLLR 131

9.1.3 Transform Model File Format

HEAdapt estimates the required transformation statistics and can either output a transformed MMF or a transform model file (TMF). The advantage in storing the transforms as opposed to an adapted MMF is that the TMFs are considerably smaller than MMFs (especially triphone MMFs).

This section describes the format of the transform model file in detail.

The mean transformation matrix is stored as a block diagonal transformation matrix. The example block diagonal matrix A shown below contains three blocks. The first block represents the transformation for only the static components of the feature vector, while the second represents the deltas and the third the accelerations. This block diagonal matrix example makes the assumption that for the transformation, there is no correlation between the statics, deltas and delta deltas. In practice this assumption works quite well.

A =



 As 0 0

0 A∆ 0

0 0 A∆²





This format reduces the number of transformation parameters required to be learnt, making the adaptation process faster. It also reduces the adaptation data required per transform when compared with the full case. When comparing the storage requirements, the 3 block diagonal matrix requires much less storage capacity than the full transform matrix. Note that for convenience a full transformation matrix is also stored as a block diagonal matrix, only in this case there is a single block.

The variance transformation is a diagonal matrix and as such is simply stored as a vector.

Figure9.2shows a simple example of a TMF. In this case the feature vector has nine dimensions, and the mean transform has three diagonal blocks. The TMF can be saved in ASCII or binary format. The user header is always output in ascii. The first two fields are speaker descriptor fields.

The next field <MMFID>, the MMF identifier, is obtained from the global options macro in the MMF, while the regression class tree identifier <RCID> is obtained from the regression tree macro name in the MMF. If global adaptation is being performed, then the <RCID> will contain the identifier global, since a tree is unnecessary in the global case. Note that the MMF and regression class tree identifiers are set within the MMF using the tool HHEd. The final two fields are optional, but HEAdapt outputs these anyway for the user’s convenience. These can be edited at any time (as can all the fields if desired, but editing <MMFID> and <RCID> fields should be avoided). The <CHAN>

field should represent the adaptation data recording environment. Examples could be a particular microphone name, telephone channel or various background noise conditions. The <DESC> allow the user to enter any other information deemed useful. An example could be the speaker’s dialect region.

在文檔中 The HTK Book (頁 129-138)