HR EST / HER EST
9.4 Linear Transformation Estimation Formulae
results in a simple adaptation formula. The update formula for a single stream system for state j and mixture component m is
ˆ
µjm= Njm
Njm+ τµ¯jm+ τ
Njm+ τµjm (9.8)
where τ is a weighting of the a priori knowledge to the adaptation speech data and N is the occupation likelihood of the adaptation data, defined as,
Njm= XR r=1
Tr
X
t=1
Lrjm(t)
where µjm is the speaker independent mean and ¯µjm is the mean of the observed adaptation data and is defined as,
¯ µjm=
PR
r=1
PTr
t=1Lrjm(t)ort PR
r=1
PTr
t=1Lrjm(t)
As can be seen, if the occupation likelihood of a Gaussian component (Njm) is small, then the mean MAP estimate will remain close to the speaker independent component mean. With MAP adaptation, every single mean component in the system is updated with a MAP estimate, based on the prior mean, the weighting and the adaptation data. Hence, MAP adaptation requires a new
“speaker-dependent” model set to be saved.
One obvious drawback to MAP adaptation is that it requires more adaptation data to be effective when compared to MLLR, because MAP adaptation is specifically defined at the component level.
When larger amounts of adaptation training data become available, MAP begins to perform better than MLLR, due to this detailed update of each component (rather than the pooled Gaussian transformation approach of MLLR). In fact the two adaptation processes can be combined to improve performance still further, by using the MLLR transformed means as the priors for MAP adaptation (by replacing µjm in equation9.8 with the transformed mean of equation9.1). In this case components that have a low occupation likelihood in the adaptation data, (and hence would not change much using MAP alone) have been adapted using a regression class transform in MLLR.
An example usage is shown in the following section.
9.4 Linear Transformation Estimation Formulae
For reference purposes, this section lists the various formulae employed within the HTK adaptation tool. It is assumed throughout that single stream data is used and that diagonal covariances are also used. All are standard and can be found in various literature.
The following notation is used in this section
M the model set
Mˆ the adapted model set
T number of observations
m a mixture component
O a sequence of d-dimensional observations o(t) the observation at time t, 1 ≤ t ≤ T ζ(t) extended observation at time t, 1 ≤ t ≤ T µmr mean vector for the mixture component mr
ξmr extended mean vector for the mixture component mr
Σmr covariance matrix for the mixture component mr
Lmr(t) the occupancy probability for the mixture component mr
at time t
To enable robust transformations to be trained, the transform matrices are tied across a number of Gaussians. The set of Gaussians which share a transform is referred to as a regression class. For a particular transform case Wr, the Mr Gaussian components {m1, m2, . . . , mMr} will be tied together, as determined by the regression class tree (see section 9.1.4). The standard auxiliary function shown below is used to estimate the transforms.
Q(M, ˆM) = −1 2
XR r=1
Mr
X
mr=1
XT t=1
Lmr(t) h
K(m)+ log(| ˆΣmr|) + (o(t) − ˆµmr)TΣˆ−1mr(o(t) − ˆµmr) i
9.4 Linear Transformation Estimation Formulae 162
where K(m)subsumes all constants and Lmr(t), the occupation likelihood, is defined as, Lmr(t) = p(qmr(t) | M, OT)
and qmr(t) indicates the Gaussian component mrat time t, and OT = {o(1), . . . , o(T )} is the adap-tation data. The occupation likelihood is obtained from the forward-backward process described in section8.8.
9.4.1 Mean Transformation Matrix (MLLRMEAN)
Substituting the for expressions for MLLR mean adaptation ˆ
µmr = Wrξmr, Σˆmr = Σmr (9.9) into the auxiliary function, and using the fact that the covariance matrices are diagonal, yields
Q(M, ˆM) = K −1 2
XR r=1
Xd j=1
³
wrjG(j)r wTrj− 2wrjk(j)Tr ´
where wrj is the jthrow of Wr,
G(i)r =
Mr
X
mr=1
1
σ2mriξmrξTmr XT t=1
Lmr(t) (9.10)
and
k(i)r =
Mr
X
mr=1
XT t=1
Lmr(t) 1
σ2mrioi(t)ξTmr (9.11) Differentiating the auxiliary function with respect to the transform Wr , and then maximising it with respect to the transformed mean yields the following update
wri= k(i)r G(i)−1r (9.12)
The above expressions assume that each base regression class r has a separate transform. If regression class trees are used then the shared transform parameters may be simply estimated by combining the statistics of the base regression classes. The regression class tree is used to generate the classes dynamically, so it is not known a-priori which regression classes will be used to estimate the transform. This does not present a problem, since G(i)and k(i)for the chosen regression class may be obtained from its child classes (as defined by the tree). If the parent node R has children {R1, . . . , RC} then
k(i)= XC c=1
k(i)Rc and
G(i)= XC c=1
G(i)R
c
The same approach of combining statistics from multiple children can be applied to all the estimation formulae in this section.
9.4.2 Variance Transformation Matrix (MLLRVAR, MLLRCOV)
Estimation of the first variance transformation matrices is only available for diagonal covariance Gaussian systems in the current implementation, though full transforms can in theory be estimated.
The Gaussian covariance is transformed using5, ˆ
µmr = µmr, Σˆr= BTmrHrBmr
5In the current implementation of the code this form of transform can only be estimated in addition to the MLLRMEAN transform
9.4 Linear Transformation Estimation Formulae 163
where Hmis the linear transformation to be estimated and Bmis the inverse of the Choleski factor of Σ−1mr, so
Σ−1mr = CmrCTmr and
Bmr = C−1mr
After rewriting the auxiliary function, the transform matrix Hm is estimated from,
Hr= PMr
mr=1CTmr£
Lmr(t)(o(t) − ˆµmr)(o(t) − ˆµmr)T¤ Cmr
Lmr(t)
Here, Hris forced to be a diagonal transformation by setting the off-diagonal terms to zero, which ensures that ˆΣmr is also diagonal.
The alternative form of variance adaptation us supported for full, block and diagonal transforms.
Substituting the for expressions for variance adaptation ˆ
µmr = µmr, Σˆmr = HrΣmrHTr (9.13) into the auxiliary function, and using the fact that the covariance matrices are diagonal yields
Q(M, ˆM) = K + XR r=1
βrlog(criaTri) −1 2
Xd j=1
³
arjG(j)r aTrj´
where
βr =
Mr
X
mr=1
XT t=1
Lmr(t) (9.14)
Ar = H−1r (9.15)
ari is ithrow of Ar, the 1 × n row vector cri is the vector of cofactors of Ar, crij= cof(Arij), and G(i)r is defined as
G(i)r =
Mr
X
mr=1
1 σm2
ri
XT t=1
Lmr(t)(o(t) − ˆµmr)(o(t) − ˆµmr)T (9.16)
Differentiating the auxiliary function with respect to the transform Ar , and then maximising it with respect to the transformed mean yields the following update
ari= criG(i)−1r vu ut Ã
βr
criG(i)−1r cTri
!
(9.17)
This is an iterative optimisation scheme as the cofactors mean the estimate of row i is dependent on all the other rows (in that block). For the diagonal transform case it is of course non-iterative and simplifies to the same form as the MLLRVAR transform.
9.4.3 Constrained MLLR Transformation Matrix (CMLLR)
Substituting the for expressions for CMLLR adaptation where6 ˆ
µmr = Hrµmr+ ˜br, Σˆmr = HrΣmrHTr (9.19) into the auxiliary function, and using the fact that the covariance matrices are diagonal yields
Q(M, ˆM) = K + XR r=1
β log(priwTri) −1 2
Xd j=1
³
wrjG(j)r wTrj− 2wrjk(j)r ´
6For efficiency this transformation is implemented as
ˆor(t) =Aro(t) + br=Wr(t) (9.18)
9.4 Linear Transformation Estimation Formulae 164
where
Wr=£
−Ar˜br H−1r ¤
=£
b A ¤
(9.20) wriis ith row of Wr, the 1 × n row vector priis the zero extended vector of cofactors of Ar, G(i)r
and k(i)r are defined as
G(i)r =
Mr
X
mr=1
1 σm2ri
XT t=1
Lmr(t)ζ(t)ζT(t) (9.21)
and
k(i)r =
Mr
X
mr=1
µmri
σm2ri XT t=1
Lmr(t)ζT(t) (9.22)
Differentiating the auxiliary function with respect to the transform Wr , and then maximising it with respect to the transformed mean yields the following update
wri=³
αpri+ k(i)r ´
G(i)−1r (9.23)
where α satisfies
α2priG(i)−1r pTri+ αpriG(i)−1r k(i)Tr − β = 0 (9.24) There are thus two possible solutions for α. The solutions that yields the maximum increase in the auxiliary function (obtained by simply substituting in the two options) is used. This is an iterative optimisation scheme as the cofactors mean the estimate of row i is dependent on all the other rows (in that block).