Creating Monophone HMMs

/root/sjy/waves/S0002.wav /root/sjy/train/S0002.mfc /root/sjy/waves/S0003.wav /root/sjy/train/S0003.mfc /root/sjy/waves/S0004.wav /root/sjy/train/S0004.mfc (etc.)

Files containing lists of files are referred to as script files³and by convention are given the extension scp (although HTK does not demand this). Script files are specified using the standard -S option and their contents are read simply as extensions to the command line. Thus, they avoid the need for command lines with several thousand arguments⁴.

Configuration File (^config)

Script File (^codetr.scp)

HCO PY

Waveform Files

S0001.wav S0002.wav S0003.wav etc

MFCC Files

S0001.mfc S0002.mfc S0003.mfc etc

Fig. 3.6 Step 5

Assuming that the above script is stored in the file codetr.scp, the training data would be coded by executing

HCopy -T 1 -C config -S codetr.scp

This is illustrated in Fig.3.6. A similar procedure is used to code the test data after which all of the pieces are in place to start training the HMMs.

3.2 Creating Monophone HMMs

In this section, the creation of a well-trained set of single-Gaussian monophone HMMs will be described. The starting point will be a set of identical monophone HMMs in which every mean and variance is identical. These are then retrained, short-pause models are added and the silence model is extended slightly. The monophones are then retrained.

Some of the dictionary entries have multiple pronunciations. However, when HLEd was used to expand the word level MLF to create the phone level MLFs, it arbitrarily selected the first pronunciation it found. Once reasonable monophone HMMs have been created, the recogniser tool HVite can be used to perform a forced alignment of the training data. By this means, a new phone level MLF is created in which the choice of pronunciations depends on the acoustic evidence. This new MLF can be used to perform a final re-estimation of the monophone HMMs.

3.2.1 Step 6 - Creating Flat Start Monophones

The first step in HMM training is to define a prototype model. The parameters of this model are not important, its purpose is to define the model topology. For phone-based systems, a good topology to use is 3-state left-right with no skips such as the following

~o <VecSize> 39 <MFCC_0_D_A>

~h "proto"

3 Not to be confused with files containing edit scripts

4 Most UNIX shells, especially the C shell, only allow a limited and quite small number of arguments.

3.2 Creating Monophone HMMs 30

<NumStates> 5

<State> 2

<Mean> 39

0.0 0.0 0.0 ...

<Variance> 39 1.0 1.0 1.0 ...

<State> 3

<Mean> 39

0.0 0.0 0.0 ...

<Variance> 39 1.0 1.0 1.0 ...

<State> 4

<Mean> 39

0.0 0.0 0.0 ...

<Variance> 39 1.0 1.0 1.0 ...

<TransP> 5

0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0

where each ellipsed vector is of length 39. This number, 39, is computed from the length of the parameterised static vector (MFCC 0 = 13) plus the delta coefficients (+13) plus the acceleration coefficients (+13).

The HTK tool HCompV will scan a set of data files, compute the global mean and variance and set all of the Gaussians in a given HMM to have the same mean and variance. Hence, assuming that a list of all the training files is stored in train.scp, the command

HCompV -C config -f 0.01 -m -S train.scp -M hmm0 proto

will create a new version of proto in the directory hmm0 in which the zero means and unit variances above have been replaced by the global speech means and variances. Note that the prototype HMM defines the parameter kind as MFCC 0 D A. This means that delta and acceleration coefficients are to be computed and appended to the static MFCC coefficients computed and stored during the coding process described above. To ensure that these are computed during loading, the configuration file config should be modified to change the target kind, i.e. the configuration file entry for TARGETKIND should be changed to

TARGETKIND = MFCC_0_D_A

HCompV has a number of options specified for it. The -f option causes a variance floor macro (called vFloors) to be generated which is equal to 0.01 times the global variance. This is a vector of values which will be used to set a floor on the variances estimated in the subsequent steps. The -m option asks for means to be computed as well as variances. Given this new prototype model stored in the directory hmm0, a Master Macro File (MMF) called hmmdefs containing a copy for each of the required monophone HMMs is constructed by manually copying the prototype and relabeling it for each required monophone. The format of an MMF is similar to that of an MLF and it serves a similar purpose in that it avoids having a large number of individual HMM definition files (see Fig.3.7).

3.2 Creating Monophone HMMs 31

macros

~v "varFloor1"

<Variance> 39 0.0012 0.0003 ...

hmmdefs

~h "aa"

<BeginHMM> ...

~h "eh"

<BeginHMM> ...

... etc

Fig. 3.7 Form of Master Macro Files

The flat start monophones stored in the directory hmm0 are re-estimated using the embedded re-estimation tool HERest invoked as follows

HERest -C config -I phones0.mlf -t 250.0 150.0 1000.0 \

-S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones0

The effect of this is to load all the models in hmm0 which are listed in the model list monophones0 (monophones1 less the short pause (sp) model). These are then re-estimated them using the data listed in train.scp and the new model set is stored in the directory hmm1. Most of the files used in this invocation of HERest have already been described. The exception is the file macros. This should contain a so-called global options macro and the variance floor macro vFloors generated earlier. The global options macro simply defines the HMM parameter kind and the vector size i.e.

~o <MFCC_0_D_A> <VecSize> 39

See Fig.3.7. This can be combined with vFloors into a text file called macros.

HEREST

Phone Level Transc ription (phones0.mlf) Training Files

liste d in (^train.scp) Prototype HMM

Defin ition

(proto)

hmm0

macros hmmdefs

hmm1

macros hmmdefs HCO MPV

HMM list

(monophones0)

Fig. 3.8 Step 6

The -t option sets the pruning thresholds to be used during training. Pruning limits the range of state alignments that the forward-backward algorithm includes in its summation and it can reduce the amount of computation required by an order of magnitude. For most training files, a very tight pruning threshold can be set, however, some training files will provide poorer acoustic matching

3.2 Creating Monophone HMMs 32 and in consequence a wider pruning beam is needed. HERest deals with this by having an auto-incrementing pruning threshold. In the above example, pruning is normally 250.0. If re-estimation fails on any particular file, the threshold is increased by 150.0 and the file is reprocessed. This is repeated until either the file is successfully processed or the pruning limit of 1000.0 is exceeded. At this point it is safe to assume that there is a serious problem with the training file and hence the fault should be fixed (typically it will be an incorrect transcription) or the training file should be discarded. The process leading to the initial set of monophones in the directory hmm0 is illustrated in Fig.3.8.

Each time HERest is run it performs a single re-estimation. Each new HMM set is stored in a new directory. Execution of HERest should be repeated twice more, changing the name of the input and output directories (set with the options -H and -M) each time, until the directory hmm3 contains the final set of initialised monophone HMMs.

3.2.2 Step 7 - Fixing the Silence Models

shared state

sil

Fig. 3.9 Silence Models

The previous step has generated a 3 state left-to-right HMM for each phone and also a HMM for the silence model sil. The next step is to add extra transitions from states 2 to 4 and from states 4 to 2 in the silence model. The idea here is to make the model more robust by allowing individual states to absorb the various impulsive noises in the training data. The backward skip allows this to happen without committing the model to transit to the following word.

Also, at this point, a 1 state short pause sp model should be created. This should be a so-called tee-model which has a direct transition from entry to exit node. This sp has its emitting state tied to the centre state of the silence model. The required topology of the two silence models is shown in Fig.3.9.

These silence models can be created in two stages

• Use a text editor on the file hmm3/hmmdefs to copy the centre state of the sil model to make a new sp model and store the resulting MMF hmmdefs, which includes the new sp model, in the new directory hmm4.

• Run the HMM editor HHEd to add the extra transitions required and tie the sp state to the centre sil state

HHEd works in a similar way to HLEd. It applies a set of commands in a script to modify a set of HMMs. In this case, it is executed as follows

HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophones1 where sil.hed contains the following commands

AT 2 4 0.2 {sil.transP}

AT 4 2 0.2 {sil.transP}

AT 1 3 0.3 {sp.transP}

TI silst {sil.state[3],sp.state[2]}

3.2 Creating Monophone HMMs 33 The AT commands add transitions to the given transition matrices and the final TI command creates a tied-state called silst. The parameters of this tied-state are stored in the hmmdefs file and within each silence model, the original state parameters are replaced by the name of this macro. Macros are described in more detail below. For now it is sufficient to regard them simply as the mechanism by which HTK implements parameter sharing. Note that the phone list used here has been changed, because the original list monophones0 has been extended by the new sp model. The new file is called monophones1 and has been used in the above HHEd command.

HEREST

(X2)

Edit Script (sil.hed) HMM list (monophones1)

hmm5

macros hmmdefs

hmm7

macros hmmdefs HHED

hmm4

macros hmmdefs

Edit sil -> sp

Fig. 3.10 Step 7

Finally, another two passes of HERest are applied using the phone transcriptions with sp models between words. This leaves the set of monophone HMMs created so far in the directory hmm7. This step is illustrated in Fig.3.10

3.2.3 Step 8 - Realigning the Training Data

As noted earlier, the dictionary contains multiple pronunciations for some words, particularly func-tion words. The phone models created so far can be used to realign the training data and create new transcriptions. This can be done with a single invocation of the HTK recognition tool HVite, viz

HVite -l ’*’ -o SWT -b silence -C config -a -H hmm7/macros \ -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 -y lab \ -I words.mlf -S train.scp dict monophones1

This command uses the HMMs stored in hmm7 to transform the input word level transcription words.mlf to the new phone level transcription aligned.mlf using the pronunciations stored in the dictionary dict (see Fig3.11). The key difference between this operation and the original word-to-phone mapping performed by HLEd in step 4 is that the recogniser considers all pronunciations for each word and outputs the pronunciation that best matches the acoustic data.

In the above, the -b option is used to insert a silence model at the start and end of each utterance. The name silence is used on the assumption that the dictionary contains an entry

silence sil

The -t option sets a pruning level of 250.0 and the -o option is used to suppress the printing of scores, word names and time boundaries in the output MLF.

在文檔中 The HTK Book (頁 35-40)