• 沒有找到結果。

Parameter Re-Estimation Formulae

在文檔中 The HTK Book (頁 129-138)

8.7 Parameter Re-Estimation Formulae 124

8.7.1 Viterbi Training (HInit)

In this style of model training, a set of training observations Or, 1 ≤ r ≤ R is used to estimate the parameters of a single HMM by iteratively computing Viterbi alignments. When used to initialise a new HMM, the Viterbi segmentation is replaced by a uniform segmentation (i.e. each training observation is divided into N equal segments) for the first iteration.

Apart from the first iteration on a new model, each training sequence O is segmented using a state alignment procedure which results from maximising

φN(T ) = max

i φi(T )aiN

for 1 < i < N where

φj(t) = h

maxi φi(t − 1)aij

i bj(ot) with initial conditions given by

φ1(1) = 1 φj(1) = a1jbj(o1)

for 1 < j < N . In this and all subsequent cases, the output probability bj(·) is as defined in equations7.1and7.2in section7.1.

If Aij represents the total number of transitions from state i to state j in performing the above maximisations, then the transition probabilities can be estimated from the relative frequencies

ˆaij = Aij PN

k=2Aik

The sequence of states which maximises φN(T ) implies an alignment of training data observa-tions with states. Within each state, a further alignment of observaobserva-tions to mixture components is made. The tool HInit provides two mechanisms for this: for each state and each stream

1. use clustering to allocate each observation ost to one of Ms clusters, or

2. associate each observation ost with the mixture component with the highest probability In either case, the net result is that every observation is associated with a single unique mixture component. This association can be represented by the indicator function ψrjsm(t) which is 1 if orst is associated with mixture component m of stream s of state j and is zero otherwise.

The means and variances are then estimated via simple averages

ˆ µjsm=

PR

r=1

PTr

t=1ψjsmr (t)orst PR

r=1

PTr

t=1ψjsmr (t)

Σˆjsm= PR

r=1

PTr

t=1ψjsmr (t)(orst− ˆµjsm)(orst− ˆµjsm)0 PR

r=1

PTr

t=1ψjsmr (t)

Finally, the mixture weights are based on the number of observations allocated to each compo-nent

cjsm= PR

r=1

PTr

t=1ψrjsm(t) PR

r=1

PTr

t=1

PMs

l=1ψjslr (t)

8.7.2 Forward/Backward Probabilities

Baum-Welch training is similar to the Viterbi training described in the previous section except that the hard boundary implied by the ψ function is replaced by a soft boundary function L which represents the probability of an observation being associated any given Gaussian mixture component. This occupation probability is computed from the forward and backward probabilities.

For the isolated-unit style of training, the forward probability αj(t) for 1 < j < N and 1 < t ≤ T is calculated by the forward recursion

αj(t) =

"N −1 X

i=2

αi(t − 1)aij

# bj(ot)

8.7 Parameter Re-Estimation Formulae 125 with initial conditions given by

α1(1) = 1 αj(1) = a1jbj(o1) for 1 < j < N and final condition given by

αN(T ) =

N −1X

i=2

αi(T )aiN

The backward probability βi(t) for 1 < i < N and T > t ≥ 1 is calculated by the backward recursion

βi(t) =

N −1X

j=2

aijbj(ot+1j(t + 1) with initial conditions given by

βi(T ) = aiN

for 1 < i < N and final condition given by

β1(1) =

N −1X

j=2

a1jbj(o1j(1)

In the case of embedded training where the HMM spanning the observations is a composite constructed by concatenating Q subword models, it is assumed that at time t, the α and β values corresponding to the entry state and exit states of a HMM represent the forward and backward probabilities at time t−∆t and t+∆t, respectively, where ∆t is small. The equations for calculating α and β are then as follows.

For the forward probability, the initial conditions are established at time t = 1 as follows

α(q)1 (1) =

( 1 if q = 1

α(q−1)1 (1)a(q−1)1Nq−1 otherwise α(q)j (1) = a(q)1jb(q)j (o1)

α(q)Nq(1) =

NXq−1 i=2

α(q)i (1)a(q)iNq

where the superscript in parentheses refers to the index of the model in the sequence of concatenated models. All unspecified values of α are zero. For time t > 1,

α1(q)(t) =

( 0 if q = 1

αN(q−1)

q−1(t − 1) + α(q−1)1 (t)a(q−1)1N

q−1 otherwise

α(q)j (t) =

α(q)1 (t)a(q)1j +

NXq−1 i=2

α(q)i (t − 1)a(q)ij

 b(q)j (ot)

αN(q)q(t) =

NXq−1 i=2

α(q)i (t)a(q)iNq

For the backward probability, the initial conditions are set at time t = T as follows

β(q)Nq(T ) =

( 1 if q = Q

β(q+1)Nq+1(T )a(q+1)1Nq+1 otherwise βi(q)(T ) = a(q)iN

qβN(q)

q(T ) β(q)1 (T ) =

NXq−1 j=2

a(q)1jb(q)j (oTj(q)(T )

8.7 Parameter Re-Estimation Formulae 126

where once again, all unspecified β values are zero. For time t < T ,

βN(q)

q(t) =

( 0 if q = Q

β1(q+1)(t + 1) + βN(q+1)q+1(t)a(q+1)1Nq+1 otherwise

βi(q)(t) = a(q)iN

qβ(q)N

q(t) +

NXq−1 j=2

a(q)ij b(q)j (ot+1j(q)(t + 1)

β1(q)(t) =

NXq−1 j=2

a(q)1jb(q)j (otj(q)(t)

The total probability P = prob(O|λ) can be computed from either the forward or backward probabilities

P = αN(T ) = β1(1)

8.7.3 Single Model Reestimation(HRest)

In this style of model training, a set of training observations Or, 1 ≤ r ≤ R is used to estimate the parameters of a single HMM. The basic formula for the reestimation of the transition probabilities is

ˆaij= PR

r=1 1 Pr

PTr−1

t=1 αri(t)aijbj(ort+1rj(t + 1) PR

r=1 1 Pr

PTr

t=1αri(t)βir(t)

where 1 < i < N and 1 < j < N and Pr is the total probability P = prob(Or|λ) of the r’th observation. The transitions from the non-emitting entry state are reestimated by

ˆa1j = 1 R

XR r=1

1 Pr

αrj(1)βrj(1)

where 1 < j < N and the transitions from the emitting states to the final non-emitting exit state are reestimated by

ˆaiN = PR

r=1 1

Prαri(T )βir(T ) PR

r=1 1 Pr

PTr

t=1αri(t)βir(t) where 1 < i < N .

For a HMM with Ms mixture components in stream s, the means, covariances and mixture weights for that stream are reestimated as follows. Firstly, the probability of occupying the m’th mixture component in stream s at time t for the r’th observation is

Lrjsm(t) = 1

PrUjr(t)cjsmbjsm(orstrj(t)bjs(ort) where

Ujr(t) =

½ a1j if t = 1

PN −1

i=2 αir(t − 1)aij otherwise (8.1) and

bjs(ort) =Y

k6=s

bjk(orkt)

For single Gaussian streams, the probability of mixture component occupancy is equal to the prob-ability of state occupancy and hence it is more efficient in this case to use

Lrjsm(t) = Lrj(t) = 1

Prαj(t)βj(t)

Given the above definitions, the re-estimation formulae may now be expressed in terms of Lrjsm(t) as follows.

ˆ µjsm=

PR

r=1

PTr

t=1Lrjsm(t)orst PR

r=1

PTr

t=1Lrjsm(t)

8.7 Parameter Re-Estimation Formulae 127

Σˆjsm= PR

r=1

PTr

t=1Lrjsm(t)(orst− ˆµjsm)(orst− ˆµjsm)0 PR

r=1

PTr

t=1Lrjsm(t) (8.2)

cjsm= PR

r=1

PTr

t=1Lrjsm(t) PR

r=1

PTr

t=1Lrj(t)

8.7.4 Embedded Model Reestimation(HERest)

The re-estimation formulae for the embedded model case have to be modified to take account of the fact that the entry states can be occupied at any time as a result of transitions out of the previous model. The basic formulae for the re-estimation of the transition probabilities is

ˆa(q)ij = PR

r=1 1 Pr

PTr−1

t=1 α(q)ri (t)a(q)ij b(q)j (ort+1j(q)r(t + 1) PR

r=1 1 Pr

PTr

t=1α(q)ri (t)βi(q)r(t)

The transitions from the non-emitting entry states into the HMM are re-estimated by

ˆa(q)1j =

PR

r=1 1 Pr

PTr−1

t=1 α(q)r1 (t)a(q)1jb(q)j (ortj(q)r(t) PR

r=1 1 Pr

PTr

t=1α(q)r1 (t)β1(q)r(t) + α1(q)r(t)a(q)1N

qβ1(q+1)r(t)

and the transitions out of the HMM into the non-emitting exit states are re-estimated by

ˆa(q)iNq = PR

r=1 1 Pr

PTr−1

t=1 α(q)ri (t)a(q)iNqβ(q)rNq (t) PR

r=1 1 Pr

PTr

t=1α(q)ri (t)βi(q)r(t)

Finally, the direct transitions from non-emitting entry to non-emitting exit states are re-estimated by

ˆa(q)1N

q =

PR

r=1 1 Pr

PTr−1

t=1 α1(q)r(t)a(q)1Nqβ1(q+1)r(t) PR

r=1 1 Pr

PTr

t=1α(q)ri (t)βi(q)r(t) + α(q)r1 (t)a(q)1N

qβ1(q+1)r(t)

The re-estimation formulae for the output distributions are the same as for the single model case except for the obvious additional subscript for q. However, the probability calculations must now allow for transitions from the entry states by changing Ujr(t) in equation8.1to

Uj(q)r(t) = (

α(q)r1 (t)a(q)1j if t = 1 α(q)r1 (t)a(q)1j +PNq−1

i=2 α(q)ri (t − 1)a(q)ij otherwise

Chapter 9

HMM Adaptation

Labelled Adaptation or

Enrollment

/HA

DAPT Data

HEA

DAPT

Transformed Speaker Independent Model Set Speaker Independent Model Set

Chapter8described how the parameters are estimated for plain continuous density HMMs within HTK, primarily using the embedded training tool HERest. Using the training strategy depicted in figure 8.2, together with other techniques can produce high performance speaker independent acoustic models for a large vocabulary recognition system. However it is possible to build improved acoustic models by tailoring a model set to a specific speaker. By collecting data from a speaker and training a model set on this speaker’s data alone, the speaker’s characteristics can be modelled more accurately. Such systems are commonly known as speaker dependent systems, and on a typical word recognition task, may have half the errors of a speaker independent system. The drawback of speaker dependent systems is that a large amount of data (typically hours) must be collected in order to obtain sufficient model accuracy.

Rather than training speaker dependent models, adaptation techniques can be applied. In this case, by using only a small amount of data from a new speaker, a good speaker independent system model set can be adapted to better fit the characteristics of this new speaker.

Speaker adaptation techniques can be used in various different modes. If the true transcription of the adaptation data is known then it is termed supervised adaptation, whereas if the adaptation data is unlabelled then it is termed unsupervised adaptation. In the case where all the adaptation data is available in one block, e.g. from a speaker enrollment session, then this termed static adaptation.

Alternatively adaptation can proceed incrementally as adaptation data becomes available, and this is termed incremental adaptation.

HTK provides two tools to adapt continuous density HMMs. HEAdapt performs offline super-vised adaptation using maximum likelihood linear regression (MLLR) and/or maximum a-posteriori (MAP) adaptation, while unsupervised adaptation is supported by HVite (using only MLLR). In

128

9.1 Model Adaptation using MLLR 129 this case HVite not only performs recognition, but simultaneously adapts the model set as the data becomes available through recognition. Currently, MLLR adaptation can be applied in both incremental and static modes while MAP supports only static adaptation. If MLLR and MAP adap-tation is to be performed simultaneously using HEAdapt in the same pass, then the restriction is that the entire adaptation must be performed statically1.

This chapter describes the supervised adaptation tool HEAdapt. The first sections of the chapter give an overview of MLLR and MAP adaptation and this is followed by a section describing the general usages of HEAdapt to build simple and more complex adapted systems. The chapter concludes with a section detailing the various formulae used by the adaptation tool. The use of HVite to perform unsupervised adaptation is discussed in section13.6.2.

9.1 Model Adaptation using MLLR

9.1.1 Maximum Likelihood Linear Regression

Maximum likelihood linear regression or MLLR computes a set of transformations that will reduce the mismatch between an initial model set and the adaptation data2. More specifically MLLR is a model adaptation technique that estimates a set of linear transformations for the mean and variance parameters of a Gaussian mixture HMM system. The effect of these transformations is to shift the component means and alter the variances in the initial system so that each state in the HMM system is more likely to generate the adaptation data. Note that due to computational reasons, MLLR is only implemented within HTK for diagonal covariance, single stream, continuous density HMMs.

The transformation matrix used to give a new estimate of the adapted mean is given by ˆ

µ = W ξ, (9.1)

where W is the n × (n + 1) transformation matrix (where n is the dimensionality of the data) and ξ is the extended mean vector,

ξ = [ w µ1µ2 . . . µn ]T

where w represents a bias offset whose value is fixed (within HTK) at 1.

Hence W can be decomposed into

W = [ b A ] (9.2)

where A represents an n × n transformation matrix and b represents a bias vector.

The transformation matrix W is obtained by solving a maximisation problem using the Expectation-Maximisation (EM) technique. This technique is also used to compute the variance transformation matrix. Using EM results in the maximisation of a standard auxiliary function. (Full details are available in section9.4.)

9.1.2 MLLR and Regression Classes

This adaptation method can be applied in a very flexible manner, depending on the amount of adaptation data that is available. If a small amount of data is available then a global adaptation transform can be generated. A global transform (as its name suggests) is applied to every Gaus-sian component in the model set. However as more adaptation data becomes available, improved adaptation is possible by increasing the number of transformations. Each transformation is now more specific and applied to certain groupings of Gaussian components. For instance the Gaussian components could be grouped into the broad phone classes: silence, vowels, stops, glides, nasals, fricatives, etc. The adaptation data could now be used to construct more specific broad class transforms to apply to these groupings.

Rather than specifying static component groupings or classes, a robust and dynamic method is used for the construction of further transformations as more adaptation data becomes available.

MLLR makes use of a regression class tree to group the Gaussians in the model set, so that the set of transformations to be estimated can be chosen according to the amount and type of adaptation data that is available. The tying of each transformation across a number of mixture components

1By using two passes, one could perform incremental MLLR in the first pass (saving the new model or transform), followed by a second pass, this time using MAP adaptation.

2 MLLR can also be used to perform environmental compensation by reducing the mismatch due to channel or additive noise effects.

9.1 Model Adaptation using MLLR 130 makes it possible to adapt distributions for which there were no observations at all. With this process all models can be adapted and the adaptation process is dynamically refined when more adaptation data becomes available.

The regression class tree is constructed so as to cluster together components that are close in acoustic space, so that similar components can be transformed in a similar way. Note that the tree is built using the original speaker independent model set, and is thus independent of any new speaker. The tree is constructed with a centroid splitting algorithm, which uses a Euclidean distance measure. For more details see section 10.7. The terminal nodes or leaves of the tree specify the final component groupings, and are termed the base (regression) classes. Each Gaussian component of a model set belongs to one particular base class. The tool HHEd can be used to build a binary regression class tree, and to label each component with a base class number. Both the tree and component base class numbers are saved automatically as part of the MMF. Please refer to section7.8and section10.7for further details.

1

2 3

4 5 6 7

Fig. 9.1 A binary regression tree

Figure9.1shows a simple example of a binary regression tree with four base classes, denoted as {C4, C5, C6, C7}. During “dynamic” adaptation, the occupation counts are accumulated for each of the regression base classes. The diagram shows a solid arrow and circle (or node), indicating that there is sufficient data for a transformation matrix to be generated using the data associated with that class. A dotted line and circle indicates that there is insufficient data. For example neither node 6 or 7 has sufficient data; however when pooled at node 3, there is sufficient adaptation data.

The amount of data that is “determined” as sufficient is set by the user as a command-line option to HEAdapt (see reference section14.5).

HEAdapt uses a top-down approach to traverse the regression class tree. Here the search starts at the root node and progresses down the tree generating transforms only for those nodes which

1. have sufficient data and

2. are either terminal nodes (i.e. base classes) or have any children without sufficient data.

In the example shown in figure 9.1, transforms are constructed only for regression nodes 2, 3 and 4, which can be denoted as W2, W3 and W4. Hence when the transformed model set is required, the transformation matrices (mean and variance) are applied in the following fashion to the Gaussian components in each base

class:-

W2 → {C5} W3 → {C6, C7} W4 → {C4}



At this point it is interesting to note that the global adaptation case is the same as a tree with just a root node, and is in fact treated as such.

9.1 Model Adaptation using MLLR 131

9.1.3 Transform Model File Format

HEAdapt estimates the required transformation statistics and can either output a transformed MMF or a transform model file (TMF). The advantage in storing the transforms as opposed to an adapted MMF is that the TMFs are considerably smaller than MMFs (especially triphone MMFs).

This section describes the format of the transform model file in detail.

The mean transformation matrix is stored as a block diagonal transformation matrix. The example block diagonal matrix A shown below contains three blocks. The first block represents the transformation for only the static components of the feature vector, while the second represents the deltas and the third the accelerations. This block diagonal matrix example makes the assumption that for the transformation, there is no correlation between the statics, deltas and delta deltas. In practice this assumption works quite well.

A =

 As 0 0

0 A 0

0 0 A2

This format reduces the number of transformation parameters required to be learnt, making the adaptation process faster. It also reduces the adaptation data required per transform when compared with the full case. When comparing the storage requirements, the 3 block diagonal matrix requires much less storage capacity than the full transform matrix. Note that for convenience a full transformation matrix is also stored as a block diagonal matrix, only in this case there is a single block.

The variance transformation is a diagonal matrix and as such is simply stored as a vector.

Figure9.2shows a simple example of a TMF. In this case the feature vector has nine dimensions, and the mean transform has three diagonal blocks. The TMF can be saved in ASCII or binary format. The user header is always output in ascii. The first two fields are speaker descriptor fields.

The next field <MMFID>, the MMF identifier, is obtained from the global options macro in the MMF, while the regression class tree identifier <RCID> is obtained from the regression tree macro name in the MMF. If global adaptation is being performed, then the <RCID> will contain the identifier global, since a tree is unnecessary in the global case. Note that the MMF and regression class tree identifiers are set within the MMF using the tool HHEd. The final two fields are optional, but HEAdapt outputs these anyway for the user’s convenience. These can be edited at any time (as can all the fields if desired, but editing <MMFID> and <RCID> fields should be avoided). The <CHAN>

field should represent the adaptation data recording environment. Examples could be a particular microphone name, telephone channel or various background noise conditions. The <DESC> allow the user to enter any other information deemed useful. An example could be the speaker’s dialect region.

在文檔中 The HTK Book (頁 129-138)