HHE D - The HTK Book

Chapter 10

HMM System Refinement

10.1 Using HHEd 166

This chapter describes how the HTK tool HHEd is used, its editing language and the main opera-tions that can be performed.

10.1 Using HHEd

The HMM editor HHEd takes as input a set of HMM definitions and outputs a new modified set, usually to a new directory. It is invoked by a command line of the form

HHEd -H MMF1 -H MMF2 ... -M newdir cmds.hed hmmlist

where cmds.hed is an edit script containing a list of edit commands. Each command is written on a separate line and begins with a 2 letter command name.

The effect of executing the above command line would be to read in the HMMs listed in hmmlist and defined by files MMF1, MMF2, etc., apply the editing operations defined in cmds.hed and then write the resulting system out to the directory newdir. As with all tools, HTK will attempt to replicate the file structure of the input in the output directory. By default, any new macros generated by HHEd will be written to one or more of the existing MMFs. In doing this, HTK will attempt to ensure that the “definition before use” rule for macros is preserved, but it cannot always guarantee this. Hence, it is usually best to define explicit target file names for new macros. This can be done in two ways. Firstly, explicit target file names can be given in the edit script using the UF command. For example, if cmds.hed contained

....

UF smacs

# commands to generate state macros ....

UF vmacs

# commands to generate variance macros ....

then the output directory would contain an MMF called smacs containing a set of state macro definitions and an MMF called vmacs containing a set of variance macro definitions, these would be in addition to the existing MMF files MMF1, MMF2, etc.

Alternatively, the whole HMM system can be written to a single file using the -w option. For example,

HHEd -H MMF1 -H MMF2 ... -w newMMF cmds.hed hmmlist would write the whole of the edited HMM set to the file newMMF.

As mentioned previously, each execution of HHEd is normally followed by re-estimation using HERest. Normally, all the information needed by HHEd is contained in the model set itself. How-ever, some clustering operations require various statistics about the training data (see sections10.4 and10.5). These statistics are gathered by HERest and output to a stats file, which is then read in by HHEd. Note, however, that the statistics file generated by HERest refers to the input model set not the re-estimated set. Thus for example, in the following sequence, the HHEd edit script in cmds.hed contains a command (see the RO command in section10.4) which references a statistics file (called stats) describing the HMM set defined by hmm1/MMF.

HERest -H hmm1/MMF -M hmmx -s stats hmmlist train1 train2 ....

HHEd -H hmm1/MMF -M hmm2 cmds.hed hmmlist

The required statistics file is generated by HERest but the re-estimated model set stored in hmmx/MMF is ignored and can be deleted.

10.2 Constructing Context-Dependent Models

The first stage of model refinement is usually to convert a set of initialised and trained context-independent monophone HMMs to a set of context dependent models. As explained in section6.4, HTK uses the convention that a HMM name of the form l-p+r denotes the context-dependent version of the phone p which is to be used when the left neighbour is the phone l and the right neighbour is the phone r. To make a set of context dependent phone models, it is only necessary to construct a HMM list, called say cdlist, containing the required context-dependent models and then execute HHEd with a single command in its edit script

10.3 Parameter Tying and Item Lists 167

CL cdlist

The effect of this command is that for each model l-p+r in cdlist it makes a copy of the monophone p.

The set of context-dependent models output by the above must be reestimated using HERest.

To do this, the training data transcriptions must be converted to use context-dependent labels and the original monophone hmm list must be replaced by cdlist. In fact, it is best to do this conversion before cloning the monophones because if the HLEd TC command is used then the -n option can be used to generate the required list of context dependent HMMs automatically.

Before building a set of context-dependent models, it is necessary to decide whether or not cross-word triphones are to be used. If they are, then word boundaries in the training data can be ignored and all monophone labels can be converted to triphones. If, however, word internal triphones are to be used, then word boundaries in the training transcriptions must be marked in some way (either by an explicit marker which is subsequently deleted or by using a short pause tee-model). This word boundary marker is then identified to HLEd using the WB command to make the TC command use biphones rather than triphones at word boundaries (see section6.4).

All HTK tools can read and write HMM definitions in text or binary form. Text is good for seeing exactly what the tools are producing, but binary is much faster to load and store, and much more compact. Binary output is enabled either using the standard option -B or by setting the configuration variable SAVEBINARY. In the above example, the HMM set input to HHEd will contain a small set of monophones whereas the output will be a large set of triphones. In order, to save storage and computation, this is usually a good point to switch to binary storage of MMFs.

10.3 Parameter Tying and Item Lists

As explained in Chapter 7, HTK uses macros to support a generalised parameter tying facility.

Referring again to Fig. 7.7.8, each of the solid black circles denotes a potential tie-point in the hierarchy of HMM parameters. When two or more parameter sets are tied, the same set of parameter values are shared by all the owners of the tied set. Externally, tied parameters are represented by macros and internally they are represented by structure sharing. The accumulators needed for the numerators and denominators of the Baum-Welch re-estimation formulae given in section8.8 are attached directly to the parameters themselves. Hence, when the values of a tied parameter set are re-estimated, all of the data which would have been used to estimate each individual untied parameter are effectively pooled leading to more robust parameter estimation.

Note also that although parameter tying is implemented in a way which makes it transparent to the HTK re-estimation and recognition tools, in practice, these tools do notice when a system has been tied and try to take advantage of it by avoiding redundant computations.

Although macro definitions could be written by hand, in practice, tying is performed by execut-ing HHEd commands and the resultexecut-ing macros are thus generated automatically. The basic HHEd command for tying a set of parameters is the TI command which has the form

TI macroname itemlist

This causes all items in the given itemlist to be tied together and output as a macro called macroname. Macro names are written as a string of characters optionally enclosed in double quotes.

The latter are necessary if the name contains one or more characters which are not letters or digits.

10.3 Parameter Tying and Item Lists 168

hmmName

transP state[]

stream[] dur weights

mix[]

mean cov

Fig. 10.1 Item List Construction

Item lists use a simple language to identify sets of points in the HMM parameter hierarchy illustrated in Fig. 7.7.8. This language is defined fully in the reference entry for HHEd. The essential idea is that item lists represent paths down the hierarchical parameter tree where the direction down should be regarded as travelling from the root of the tree to towards the leaves. A path can be unique, or more usually, it can be a pattern representing a set of paths down the tree.

The point at which each path stops identifies one member of the set represented by the item list.

Fig.10.1shows the possible paths down the tree. In text form the branches are replaced by dots and the underlined node names are possible terminating points. At the topmost level, an item list is a comma separated list of paths enclosed in braces.

Some examples, should make all this clearer. Firstly, the following is a legal but somewhat long-winded way of specifying the set of items comprising states 2, 3 and 4 of the HMM called aa

{ aa.state[2],aa.state[3],aa.state[4] }

however in practice this would be written much more compactly as { aa.state[2-4] }

It must be emphasised that indices in item lists are really patterns. The set represented by an item list consists of all those elements which match the patterns. Thus, if aa only had two emitting states, the above item list would not generate an error. It would simply only match two items. The reason for this is that the same pattern can be applied to many different objects. For example, the HMM name can be replaced by a list of names enclosed in brackets, furthermore each HMM name can include ‘?’ characters which match any single character and ‘*’ characters which match zero or more characters. Thus

{ (aa+*,iy+*,eh+*).state[2-4] }

represents states 2, 3 and 4 of all biphone models corresponding to the phonemes aa, iy and eh. If aa had just 2 emitting states and the others had 4 emitting states, then this item list would include 2 states from each of the aa models and 3 states from each of the others. Moving further down the tree, the item list

{ *.state[2-4].stream[1].mix[1,3].cov }

denotes the set of all covariance vectors (or matrices) of the first and third mixture components of stream 1, of states 2 to 4 of all HMMs. Since many HMM systems are single stream, the stream part of the path can be omitted if its value is 1. Thus, the above could have been written

{ *.state[2-4].mix[1,3].cov }

These last two examples also show that indices can be written as comma separated lists as well as ranges, for example, [1,3,4-6,9] is a valid index list representing states 1, 3, 4, 5, 6, and 9.

10.4 Data-Driven Clustering 169

When item lists are used as the argument to a TI command, the kind of items represented by the list determines the macro type in a fairly obvious way. The only non-obvious cases are firstly that lists ending in cov generate ∼v, ∼i, ∼c, or ∼x macros as appropriate. If an explicit set of mixture components is defined as in

{ *.state[2].mix[1-5] }

then ∼m macros are generated but omitting the indices altogether denotes a special case of mixture tying which is explained later in Chapter11.

To illustrate the use of item lists, some example TI commands can now be given. Firstly, when a set of context-dependent models is created, it can be beneficial to share one transition matrix across all variants of a phone rather than having a distinct transition matrix for each. This could be achieved by adding TI commands immediately after the CL command described in the previous section, that is

CL cdlist

TI T_ah {*-ah+*.transP}

TI T_eh {*-eh+*.transP}

TI T_ae {*-ae+*.transP}

TI T_ih {*-ih+*.transP}

... etc

As a second example, a so-called Grand Variance HMM system can be generated very easily with the following HHEd command

TI "gvar" { *.state[2-4].mix[1].cov }

where it is assumed that the HMMs are 3-state single mixture component models. The effect of this command is to tie all state distributions to a single global variance vector. For applications, where there is limited training data, this technique can improve performance, particularly in noise.

Speech recognition systems will often have distinct models for silence and short pauses. A silence model sil may have the normal 3 state topology whereas a short pause model may have just a single state. To avoid the two models competing with each other, the sp model state can be tied to the centre state of the sil model thus

TI "silst" { sp.state[2], sil.state[3] }

So far nothing has been said about how the parameters are actually determined when a set of items is replaced by a single shared representative. When states are tied, the state with the broadest variances and as few as possible zero mixture component weights is selected from the pool and used as the representative. When mean vectors are tied, the average of all the mean vectors in the pool is used and when variances are tied, the largest variance in the the pool is used. In all other cases, the last item in the tie-list is arbitrarily chosen as representative. All of these selection criteria are ad hoc, but since the tie operations are always followed by explicit re-estimation using HERest, the precise choice of representative for a tied set is not critical.

Finally, tied parameters can be untied. For example, subsequent refinements of the context-dependent model set generated above with tied transition matrices might result in a much more compact set of models for which individual transition parameters could be robustly estimated. This can be done using the UT command whose effect is to untie all of the items in its argument list. For example, the command

UT {*-iy+*.transP}

would untie the transition parameters in all variants of the iy phoneme. This untying works by simply making unique copies of the tied parameters. These untied parameters can then subsequently be re-estimated.

10.4 Data-Driven Clustering

In section10.2, a method of triphone construction was described which involved cloning all mono-phones and then re-estimating them using data for which monophone labels have been replaced by triphone labels. This will lead to a very large set of models, and relatively little training data for

10.4 Data-Driven Clustering 170

each model. Applying the argument that context will not greatly affect the centre states of triphone models, one way to reduce the total number of parameters without significantly altering the models’

ability to represent the different contextual effects might be to tie all of the centre states across all models derived from the same monophone. This tying could be done by writing an edit script of the form

TI "iyS3" {*-iy+*.state[3]}

TI "ihS3" {*-ih+*.state[3]}

TI "ehS3" {*-eh+*.state[3]}

.... etc

Each TI command would tie all the centre states of all triphones in each phone group. Hence, if there were an average of 100 triphones per phone group then the total number of states per group would be reduced from 300 to 201.

Explicit tyings such as these can have some positive effect but overall they are not very satis-factory. Tying all centre states is too severe and worse still, the problem of undertraining for the left and right states remains. A much better approach is to use clustering to decide which states to tie. HHEd provides two mechanisms for this. In this section a data-driven clustering approach will be described and in the next section, an alternative decision tree-based approach is presented.

Data-driven clustering is performed by the TC and NC commands. These both invoke the same top-down hierarchical procedure. Initially all states are placed in individual clusters. The pair of clusters which when combined would form the smallest resultant cluster are merged. This process repeats until either the size of the largest cluster reaches the threshold set by the TC command or the total number of clusters has fallen to that specified by by the NC command. The size of cluster is defined as the greatest distance between any two states. The distance metric depends on the type of state distribution. For single Gaussians, a weighted Euclidean distance between the means is used and for tied-mixture systems a Euclidean distance between the mixture weights is used. For all other cases, the average probability of each component mean with respect to the other state is used. The details of the algorithm and these metrics are given in the reference section for HHEd.

t-ih+n t-ih+ng f-ih+l s-ih+l

TC Command

Fig. 10.2 Data-driven state tying

As an example, the following HHEd script would cluster and tie the corresponding states of the triphone group for the phone ih

TC 100.0 "ihS2" {*-ih+*.state[2]}

10.5 Tree-Based Clustering 171

TC 100.0 "ihS3" {*-ih+*.state[3]}

TC 100.0 "ihS4" {*-ih+*.state[4]}

In this example, each TC command performs clustering on the specified set of states, each cluster is tied and output as a macro. The macro name is generated by appending the cluster index to the macro name given in the command. The effect of this command is illustrated in Fig.10.2. Note that if a word-internal triphone system is being built, it is sensible to include biphones as well as triphones in the item list, for example, the first command above would be written as

TC 100.0 "ihS2" {(*-ih,ih+*,*-ih+*).state[2]}

If the above TC commands are repeated for all phones, the resulting set of tied-state models will have far fewer parameters in total than the original untied set. The numeric argument immediately following the TC command name is the cluster threshold. Increasing this value will allow larger and hence, fewer clusters. The aim, of course, is to strike the right balance between compactness and the acoustic accuracy of the individual models. In practice, the use of this command requires some experimentation to find a good threshold value. HHEd provides extensive trace output for monitoring clustering operations. Note in this respect that as well as setting tracing from the command line and the configuration file, tracing in HHEd can be set by the TR command. Thus, tracing can be controlled at the command level. Further trace information can be obtained by including the SH command at strategic points in the edit script. The effect of executing this command is to list out all of the parameter tyings currently in force.

A potential problem with the use of the TC and NC commands is that outlier states will tend to form their own singleton clusters for which there is then insufficient data to properly train. One solution to this is to use the RO command to remove outliers. This command has the form

RO thresh "statsfile"

where statsfile is the name of a statistics file output using the -s option of HERest. This statistics file holds the occupation counts for all states of the HMM set being trained. The term occupation count refers to the number of frames allocated to a particular state and can be used as a measure of how much training data is available for estimating the parameters of that state.

The RO command must be executed before the TC or NC commands used to do the actual clustering.

Its effect is to simply read in the statistics information from the given file and then to set a flag instructing the TC or NC commands to remove any outliers remaining at the conclusion of the normal clustering process. This is done by repeatedly finding the cluster with the smallest total occupation count and merging it with its nearest neighbour. This process is repeated until all clusters have a total occupation count which exceeds thresh, thereby ensuring that every cluster of states will be properly trained in the subsequent re-estimation performed by HERest.

On completion of the above clustering and tying procedures, many of the models may be effec-tively identical, since acoustically similar triphones may share common clusters for all their emitting states. They are then, in effect, so-called generalised triphones. State tying can be further exploited if the HMMs which are effectively equivalent are identified and then tied via the physical-logical mapping¹ facility provided by HMM lists (see section7.4). The effect of this would be to reduce the total number of HMM definitions required. HHEd provides a compaction command to do all of this automatically. For example, the command

CO newList

will compact the currently loaded HMM set by identifying equivalent models and then tying them via the new HMM list output to the file newList. Note, however, that for two HMMs to be tied, they must be identical in all respects. This is one of the reasons why transition parameters are often tied across triphone groups otherwise HMMs with identical states would still be left distinct due to minor differences in their transition matrices.

10.5 Tree-Based Clustering

One limitation of the data-driven clustering procedure described above is that it does not deal with triphones for which there are no examples in the training data. When building word-internal triphone systems, this problem can often be avoided by careful design of the training database but when building large vocabulary cross-word triphone systems unseen triphones are unavoidable.

1The physical HMM which corresponding to several logical HMMs will be arbitrarily named after one of them.

在文檔中 The HTK Book (頁 174-185)