Linear Transformation Estimation Formulae

HR EST / HER EST

9.4 Linear Transformation Estimation Formulae

results in a simple adaptation formula. The update formula for a single stream system for state j and mixture component m is

µ_jm= Njm

Njm+ τµ¯_jm+ τ

Njm+ τµ_jm (9.8)

where τ is a weighting of the a priori knowledge to the adaptation speech data and N is the occupation likelihood of the adaptation data, defined as,

Njm= XR r=1

t=1

L^r_jm(t)

where µ_jm is the speaker independent mean and ¯µ_jm is the mean of the observed adaptation data and is defined as,

¯ µ_jm=

P_R

r=1

P_T_r

t=1L^r_jm(t)o^r_t P_R

r=1

P_T_r

t=1L^r_jm(t)

As can be seen, if the occupation likelihood of a Gaussian component (Njm) is small, then the mean MAP estimate will remain close to the speaker independent component mean. With MAP adaptation, every single mean component in the system is updated with a MAP estimate, based on the prior mean, the weighting and the adaptation data. Hence, MAP adaptation requires a new

“speaker-dependent” model set to be saved.

One obvious drawback to MAP adaptation is that it requires more adaptation data to be effective when compared to MLLR, because MAP adaptation is specifically defined at the component level.

When larger amounts of adaptation training data become available, MAP begins to perform better than MLLR, due to this detailed update of each component (rather than the pooled Gaussian transformation approach of MLLR). In fact the two adaptation processes can be combined to improve performance still further, by using the MLLR transformed means as the priors for MAP adaptation (by replacing µ_jm in equation9.8 with the transformed mean of equation9.1). In this case components that have a low occupation likelihood in the adaptation data, (and hence would not change much using MAP alone) have been adapted using a regression class transform in MLLR.

An example usage is shown in the following section.

9.4 Linear Transformation Estimation Formulae

For reference purposes, this section lists the various formulae employed within the HTK adaptation tool. It is assumed throughout that single stream data is used and that diagonal covariances are also used. All are standard and can be found in various literature.

The following notation is used in this section

M the model set

Mˆ the adapted model set

T number of observations

m a mixture component

O a sequence of d-dimensional observations o(t) the observation at time t, 1 ≤ t ≤ T ζ(t) extended observation at time t, 1 ≤ t ≤ T µ_m_r mean vector for the mixture component mr

ξ_m_r extended mean vector for the mixture component mr

Σmr covariance matrix for the mixture component mr

Lmr(t) the occupancy probability for the mixture component mr

at time t

To enable robust transformations to be trained, the transform matrices are tied across a number of Gaussians. The set of Gaussians which share a transform is referred to as a regression class. For a particular transform case Wr, the Mr Gaussian components {m1, m2, . . . , mMr} will be tied together, as determined by the regression class tree (see section 9.1.4). The standard auxiliary function shown below is used to estimate the transforms.

Q(M, ˆM) = −1 2

XR r=1

mr=1

XT t=1

Lmr(t) h

K^(m)+ log(| ˆΣmr|) + (o(t) − ˆµ_m_r)^TΣˆ⁻¹_m_r(o(t) − ˆµ_m_r) i

9.4 Linear Transformation Estimation Formulae 162

where K^(m)subsumes all constants and Lmr(t), the occupation likelihood, is defined as, L_m_r(t) = p(q_m_r(t) | M, O_T)

and qmr(t) indicates the Gaussian component mrat time t, and OT = {o(1), . . . , o(T )} is the adap-tation data. The occupation likelihood is obtained from the forward-backward process described in section8.8.

9.4.1 Mean Transformation Matrix (MLLRMEAN)

Substituting the for expressions for MLLR mean adaptation ˆ

µ_m_r = Wrξ_m_r, Σˆmr = Σmr (9.9) into the auxiliary function, and using the fact that the covariance matrices are diagonal, yields

Q(M, ˆM) = K −1 2

XR r=1

Xd j=1

wrjG^(j)_r w^T_rj− 2wrjk^(j)T_r ´

where w_rj is the j^throw of W_r,

G⁽ⁱ⁾_r =

mr=1

σ²_m_r_iξ_m_rξ^T_m_r XT t=1

L_m_r(t) (9.10)

and

k⁽ⁱ⁾_r =

mr=1

XT t=1

Lmr(t) 1

σ²_m_r_ioi(t)ξ^T_m_r (9.11) Differentiating the auxiliary function with respect to the transform Wr , and then maximising it with respect to the transformed mean yields the following update

wri= k⁽ⁱ⁾_r G⁽ⁱ⁾⁻¹_r (9.12)

The above expressions assume that each base regression class r has a separate transform. If regression class trees are used then the shared transform parameters may be simply estimated by combining the statistics of the base regression classes. The regression class tree is used to generate the classes dynamically, so it is not known a-priori which regression classes will be used to estimate the transform. This does not present a problem, since G⁽ⁱ⁾and k⁽ⁱ⁾for the chosen regression class may be obtained from its child classes (as defined by the tree). If the parent node R has children {R1, . . . , RC} then

k⁽ⁱ⁾= XC c=1

k⁽ⁱ⁾_R_c and

G⁽ⁱ⁾= XC c=1

G⁽ⁱ⁾_R

The same approach of combining statistics from multiple children can be applied to all the estimation formulae in this section.

9.4.2 Variance Transformation Matrix (MLLRVAR, MLLRCOV)

Estimation of the first variance transformation matrices is only available for diagonal covariance Gaussian systems in the current implementation, though full transforms can in theory be estimated.

The Gaussian covariance is transformed using⁵, ˆ

µ_m_r = µ_m_r, Σˆr= B^T_m_rHrBmr

5In the current implementation of the code this form of transform can only be estimated in addition to the MLLRMEAN transform

9.4 Linear Transformation Estimation Formulae 163

where Hmis the linear transformation to be estimated and Bmis the inverse of the Choleski factor of Σ⁻¹_m_r, so

Σ⁻¹_m_r = CmrC^T_m_r and

Bmr = C⁻¹_m_r

After rewriting the auxiliary function, the transform matrix Hm is estimated from,

Hr= P_M_r

mr=1C^T_m_r£

Lmr(t)(o(t) − ˆµ_m_r)(o(t) − ˆµ_m_r)^T¤ Cmr

Lmr(t)

Here, Hris forced to be a diagonal transformation by setting the off-diagonal terms to zero, which ensures that ˆΣmr is also diagonal.

The alternative form of variance adaptation us supported for full, block and diagonal transforms.

Substituting the for expressions for variance adaptation ˆ

µ_m_r = µ_m_r, Σˆmr = HrΣmrH^T_r (9.13) into the auxiliary function, and using the fact that the covariance matrices are diagonal yields

Q(M, ˆM) = K + XR r=1

βrlog(cria^T_ri) −1 2

Xd j=1

arjG^(j)_r a^T_rj´

where

βr =

mr=1

XT t=1

Lmr(t) (9.14)

Ar = H⁻¹_r (9.15)

ari is i^throw of Ar, the 1 × n row vector cri is the vector of cofactors of Ar, crij= cof(Arij), and G⁽ⁱ⁾r is defined as

G⁽ⁱ⁾_r =

mr=1

1 σ_m²

XT t=1

Lmr(t)(o(t) − ˆµ_m_r)(o(t) − ˆµ_m_r)^T (9.16)

Differentiating the auxiliary function with respect to the transform Ar , and then maximising it with respect to the transformed mean yields the following update

ari= criG⁽ⁱ⁾⁻¹_r vu ut Ã

βr

criG⁽ⁱ⁾⁻¹r c^T_ri

(9.17)

This is an iterative optimisation scheme as the cofactors mean the estimate of row i is dependent on all the other rows (in that block). For the diagonal transform case it is of course non-iterative and simplifies to the same form as the MLLRVAR transform.

9.4.3 Constrained MLLR Transformation Matrix (CMLLR)

Substituting the for expressions for CMLLR adaptation where⁶ ˆ

µ_m_r = Hrµ_m_r+ ˜br, Σˆmr = HrΣmrH^T_r (9.19) into the auxiliary function, and using the fact that the covariance matrices are diagonal yields

Q(M, ˆM) = K + XR r=1



β log(priw^T_ri) −1 2

Xd j=1

wrjG^(j)_r w^T_rj− 2wrjk^(j)_r ´



6For efficiency this transformation is implemented as

ˆo^r(t) =A^ro(t) + b^r=W^r(t) (9.18)

9.4 Linear Transformation Estimation Formulae 164

where

Wr=£

−Ar˜br H⁻¹_r ¤

=£

b A ¤

(9.20) wriis i^th row of Wr, the 1 × n row vector priis the zero extended vector of cofactors of Ar, G⁽ⁱ⁾r

and k⁽ⁱ⁾r are defined as

G⁽ⁱ⁾_r =

mr=1

1 σ_m²_r_i

XT t=1

Lmr(t)ζ(t)ζ^T(t) (9.21)

and

k⁽ⁱ⁾_r =

mr=1

µmri

σ_m²_r_i XT t=1

L_m_r(t)ζ^T(t) (9.22)

Differentiating the auxiliary function with respect to the transform Wr , and then maximising it with respect to the transformed mean yields the following update

wri=³

αpri+ k⁽ⁱ⁾_r ´

G⁽ⁱ⁾⁻¹_r (9.23)

where α satisfies

α²priG⁽ⁱ⁾⁻¹_r p^T_ri+ αpriG⁽ⁱ⁾⁻¹_r k^(i)T_r − β = 0 (9.24) There are thus two possible solutions for α. The solutions that yields the maximum increase in the auxiliary function (obtained by simply substituting in the two options) is used. This is an iterative optimisation scheme as the cofactors mean the estimate of row i is dependent on all the other rows (in that block).

Chapter 10

HMM System Refinement

在文檔中 The HTK Book (頁 170-174)