HR EST / HER EST
8.9 Discriminative Training
8.9 Discriminative Training 148
8.9.1 Discriminative Parameter Re-Estimation Formulae
For both MMI and MPE training the estimation of the model parameters are based on variants of the Extended Baum-Welch (EBW) algorithm. In HTK the following form is used to estimate the means and covariance matrices3
ˆ µjm=
PR
r=1
PTr
t=1(Lnumrjm (t) − Ldenrjm (t))ort+ Djmµjm+ τIµpjm PR
r=1
PTr
t=1(Lnumrjm (t) − Ldenrjm (t)) + Djm+ τI (8.13) and
Σˆjm= PR
r=1
PTr
t=1(Lnumrjm (t) − Ldenrjm (t))ortorTt + DjmGsjm+ τIGpjm PR
r=1
PTr
t=1(Lnumrjm (t) − Ldenrjm (t)) + Djm+ τI − ˆµjmµˆTjm (8.14) where
Gsjm= Σjm+ µjmµTjm (8.15)
Gpjm= Σpjm+ µpjmµpTjm (8.16) The difference between the MMI and MPE criteria lie in how the numerator, Lnumrjm (t), and de-nominator, Ldenrjm (t), “occupancy probabilities” are computed. For MMI, these are the posterior probabilities of Gaussian component occupation for either the numerator or denominator lattice.
However for MPE, in order to keep the same form of re-estimation formulae as MMI, an MPE-based analogue of the “occupation probability” is computed which is related to an approximate error measure for each phone marked for the denominator: the positive values are treated as nu-merator statistics and negative values as denominator statistics.
In these update formulae there are a number of parameters to be set.
• Smoothing constant, Djm: this is a state-component specific parameter that determines the contribution of the counts from the current model parameter estimates. In HMMIRest this is set at
Djm= max (
E XR r=1
Tr
X
t=1
Ldenrjm (t), 2Dminjm )
(8.17)
where Dminjm is the minimum value of Djm to ensure that ˆΣjm is positive semi-definite. E is specified using the configuration variable E.
• I-smoothing constant, τI: global smoothing term to improve generalisation by using the state-component priors, µpjm and Σpjm. This is set using the configuration option ISMOOTHTAU.
• Prior parameters, µpjmand Σpjm: the prior parameters that the counts from the training data are smoothed with. These may be obtained from a number of sources. Supported options are;
1. dynamic ML-estimates (default): the ML estimates of the mean and covariance matrices, given the current model parameters, are used.
2. dynamic MMI-estimates: for MPE training the MMI estimates of the mean and covari-ance matrices, given the current model parameters, can be used. To set this option the following configuration entries must be added:
MMIPRIOR = TRUE MMITAUI = 50
The MMI estimates for the prior can themselves make use of I-smoothing onto a dynamic ML prior. The smoothing constant for this is specified using MMITAUI.
3. static estimates: fixed prior parameters can be specified and used for all iterations. A single MMF file can be specified on the command line using the -Hprior option and the following configuration file entries added
3Discriminative training with multiple streams can also be run. However to simplify the notation it is assumed that only a single stream is being used.
8.9 Discriminative Training 149
PRIORTAU = 25 STATICPRIOR = TRUE
where PRIORTAU specifies the prior constant, τI, to be used, rather than the standard I-smoothing value.
The best configuration option and parameter settings will be task and criterion specific and so will need to be determined empirically. The values shown in the tutorial section of this book can be treated as a reasonable starting point. Note the grammar scale factors used in the tutorial are low compared to those often used in a typical large vocabulary speech recognition systems where values in the range 12-15 are used.
The estimation of the weights and the transition matrices have a similar form. Only the com-ponent prior updates will be described here. c(0)jm is initialised to the current model parameter cjm. The values are then updated 100 times using the following iterative update rule:
c(i+1)jm = PR
r=1
PTr
t=1Lnumrjm (t) + kjmc(i)jm+ τWcpjm P
n
³PR
r=1
PTr
t=1Lnumrjn (t) + kjnc(i)jn+ τWcpjn
´ (8.18)
where
kjm= max
n
( PR
r=1
PTr
t=1Ldenrjn (t) cjn
)
− PR
r=1
PTr
t=1Ldenrjm (t)
cjm (8.19)
In a similar fashion to the estimation of the means and covariance matrices there are a range of forms that can be used to specify the prior for the component or the transition matrix entry. The same configuration options used for the mean and covariance matrix will determine the exact form of the prior.
For the component prior the I-smoothing weight, τW, is specified using the configuration variable ISMOOTHTAUW. This is normally set to 1. The equivalent smoothing term for the transition matrices is set using ISMOOTHTAUT and again a value of 1 is often used.
8.9.2 Lattice-Based Discriminative Training
For both the MMI and MPE training criteria a set of possible hypotheses for each utterance must be considered. To get these confusable hypotheses the training data must first be recognised. Rather than perform an explicit recognition on each iteration of HMMIRest, HTK uses Lattice Based Discriminative Training, in which word lattices are first created with e.g. HDecode, and then these lattices are used for all iterations of discriminative training.
To make the operation of HMMIRest more efficient, the times of the HMM/phone boundaries are also marked in the lattices. This creates so-called phone-marked lattices and it is this form of lattice that is used by HMMIRest. For each utterance used for discriminative training, two lattices need to be created. The first is a phone-marked lattice that represents the correct word sequence (also known as a “numerator” lattice). The second is a phone-marked lattice for competing hypotheses: the “denominator” lattice. These names derive from the MMI objective function, but the same phone-marked lattices are also required for MPE. The numerator lattice is found by generating phone-level alignments in lattice form from the correct word level transcription, while the denominator lattice uses a phone-marked form of the lattice representing confusable hypotheses. In both cases, these are created using HDecode.mod which is a version HDecode that can output lattices with model-level alignments (model names and segmentations) included in the lattice structure, or HVite.
For examples of lattice generation and phone-marking see the tutorial section.
8.9.3 Improved Generalisation
In order to improve the generalisation capability of discriminative training, several techniques can be used: lattice generation using weak language models; and acoustic log likelihood scaling. These are performed in addition to the smoothing described in a previous section.
• weak language models: in order to increase the number of reasonable errors made during the denominator lattice generation stage a simpler language mode, for example a unigram