The HMM Definition Language

∼o <HMMSetId> ecrl us mono

<VecSize> 4 <MFCC>

∼r “ecrl us mono tree 4”

<RegTree> 4

<Node> 1 2 3

<Node> 2 4 5

<Node> 3 6 7

<TNode> 4 30

<TNode> 5 25

<TNode> 6 40

<TNode> 7 39

∼s “stateA” <NumMixes> 3

<Mixture> 1 0.34

<RClass> 4

∼u “mean51”

∼v “var65”

<Mixture> 2 0.52

<RClass> 7

∼u “mean32”

∼v “var65”

<Mixture> 3 0.14

<RClass> 5

∼u “mean12”

∼v “var3”

Fig. 7.18 MMF with a regression tree and classes

7.10 The HMM Definition Language

To conclude this chapter, this section presents a formal description of the HMM definition language used by HTK. Syntax is described using an extended BNF notation in which alternatives are separated by a vertical bar |, parentheses () denote factoring, brackets [ ] denote options, and braces {} denote zero or more repetitions.

All keywords are enclosed in angle brackets⁶and the case of the keyword name is not significant.

White space is not significant except within double-quoted strings.

The top level structure of a HMM definition is shown by the following rule.

hmmdef = [ ∼h macro ]

<BeginHMM>

[ globalOpts ]

<NumStates> short state { state } [ regTree ] transP [ duration ]

<EndHMM>

A HMM definition consists of an optional set of global options followed by the <NumStates>

keyword whose following argument specifies the number of states in the model inclusive of the non-emitting entry and exit states⁷. The information for each state is then given in turn, followed by the parameters of the transition matrix and the model duration parameters, if any. The name of the HMM is given by the ∼h macro. If the HMM is the only definition within a file, the ∼h macro name can be omitted and the HMM name is assumed to be the same as the file name.

The global options are common to all HMMs. They can be given separately using a ∼o option macro

6This definition covers the textual version only. The syntax for the binary format is identical apart from the way that the lexical items are encoded.

7 Integer numbers are specified as either char or short. This has no effect on text-based definitions but for binary format it indicates the underlying C type used to represent the number.

7.10 The HMM Definition Language 108

optmacro = ∼o globalOpts

or they can be included in one or more HMM definitions. Global options may be repeated but no definition can change a previous definition. All global options must be defined before any other macro definition is processed. In practice this means that any HMM system which uses parameter tying must have a ∼o option macro at the head of the first macro file processed.

The full set of global options is given below. Every HMM set must define the vector size (via

<VecSize>), the stream widths (via <StreamInfo>) and the observation parameter kind. However, if only the stream widths are given, then the vector size will be inferred. If only the vector size is given, then a single stream of identical width will be assumed. All other options default to null.

globalOpts = option { option } option = <HmmSetId> string |

<StreamInfo> short { short } |

<VecSize> short | covkind |

durkind | parmkind

The <HmmSetId> option allows the user to give the MMF an identifier. This is used as a san-ity check to make sure that a TMF can be safely applied to this MMF. The arguments to the

<StreamInfo> option are the number of streams (default 1) and then for each stream, the width of that stream. The <VecSize> option gives the total number of elements in each input vector. If both <VecSize> and <StreamInfo> are included then the sum of all the stream widths must equal the input vector size.

The covkind defines the kind of the covariance matrix covkind = <DiagC> | <InvDiagC> | <FullC> |

<LLTC> | <XformC>

where <InvDiagC> is used internally. <LLTC> and <XformC> are not used in HTK Version 2.0.

Setting the covariance kind as a global option forces all components to have this kind. In particular, it prevents mixing full and diagonal covariances within a HMM set.

The durkind denotes the type of duration model used according to the following rules durkind = <nullD> | <poissonD> | <gammaD> | <genD>

For anything other than <nullD>, a duration vector must be supplied for the model or each state as described below. Note that no current HTK tool can estimate or use such duration vectors.

The parameter kind is any legal parameter kind including qualified forms (see section5.1) parmkind = <basekind{ D| A| E| N| Z| O| V| C| K}>

<melspec>| <lprefc>|<lpdelcep> | <user>

where the syntax rule for parmkind is non-standard in that no spaces are allowed between the base kind and any subsequent qualifiers. As noted in chapter5, <lpdelcep> is provided only for compatibility with earlier versions of HTK and its further use should be avoided.

Each state of each HMM must have its own section defining the parameters associated with that state

state = <State: Exp > short stateinfo

where the short following <State: Exp > is the state number. State information can be defined in any order. The syntax is as follows

stateinfo = ∼s macro |

[ mixes ] [ weights ] stream { stream } [ duration ] macro = string

A stateinfo definition consists of an optional specification of the number of mixtures, an optional set of stream weights, followed by a block of information for each stream, optionally terminated with a duration vector. Alternatively, ∼s macro can be written where macro is the name of a previously defined macro.

The optional mixes in a stateinfo definition specify the number of mixture components (or discrete codebook size) for each stream of that state

7.10 The HMM Definition Language 109

mixes = <NumMixes> short {short}

where there should be one short for each stream. If this specification is omitted, it is assumed that all streams have just one mixture component.

The optional weights in a stateinfo definition define a set of exponent weights for each independent data stream. The syntax is

weights = ∼w macro | <SWeights> short vector vector = float { float }

where the short gives the number S of weights (which should match the value given in the <StreamInfo>

option) and the vector contains the S stream weights γs(see section7.1).

The definition of each stream depends on the kind of HMM set. In the normal case, it consists of a sequence of mixture component definitions optionally preceded by the stream number. If the stream number is omitted then it is assumed to be 1. For tied-mixture and discrete HMM sets, special forms are used.

stream = [ <Stream> short ]

(mixture { mixture } | tmixpdf | discpdf)

The definition of each mixture component consists of a Gaussian pdf optionally preceded by the mixture number and its weight

mixture = [ <Mixture> short float ] mixpdf

If the <Mixture> part is missing then mixture 1 is assumed and the weight defaults to 1.0.

The tmixpdf option is used only for fully tied mixture sets. Since the mixpdf parts are all macros in a tied mixture system and since they are identical for every stream and state, it is only necessary to know the mixture weights. The tmixpdf syntax allows these to be specified in the following compact form

tmixpdf = <TMix> macro weightList weightList = repShort { repShort } repShort = short [ ∗ char ]

where each short is a mixture component weight scaled so that a weight of 1.0 is represented by the integer 32767. The optional asterix followed by a char is used to indicate a repeat count. For example, 0*5 is equivalent to 5 zeroes. The Gaussians which make-up the pool of tied-mixtures are defined using ∼m macros called macro1, macro2, macro3, etc.

Discrete probability HMMs are defined in a similar way discpdf = <DProb> weightList

The only difference is that the weights in the weightList are scaled log probabilities as defined in section7.6.

The definition of a Gaussian pdf requires the mean vector to be given and one of the possible forms of covariance

mixpdf = ∼m macro | [ rclass ] mean cov [ <GConst> float ] rclass = <RClass> short

mean = ∼u macro | <Mean> short vector cov = var | inv | xform

var = ∼v macro | <Variance> short vector inv = ∼i macro |

(<InvCovar> | <LLTCovar>) short tmatrix xform = ∼x macro | <Xform> short short matrix matrix = float {float}

tmatrix = matrix

In mean and var, the short preceding the vector defines the length of the vector, in inv the short preceding the tmatrix gives the size of this square upper triangular matrix, and in xform the two short’s preceding the matrix give the number of rows and columns. The optional <GConst>⁸ gives

8specifically, in equation 7.2the GCONST value seen in HMM sets is calculated by multiplying the determinant of the covariance matrix by (2)ⁿ

7.10 The HMM Definition Language 110 that part of the log probability of a Gaussian that can be precomputed. If it is omitted, then it will be computed during load-in, including it simply saves some time. HTK tools which output HMM definitions always include this field. The optional <RClass> stores the regression base class index that this mixture component belongs to, as specified by the regression class tree (which is also stored in the model set). HTK tools which output HMM definitions always include this field, and if there is no regression class tree then the regression identifier is set to zero.

In addition to defining the output distributions, a state can have a duration probability distri-bution defined for it. However, no current HTK tool can estimate or use these.

duration = ∼d macro | <Duration> short vector

Alternatively, as shown by the top level syntax for a hmmdef, duration parameters can be specified for a whole model.

A binary regression class tree (for the purposes of HMM adaptation as in chapter9) may also exist for an HMM set. This is defined by

regTree = ∼r macro tree

tree = <RegTree> short nodes

nodes = (<Node> short short short | <TNode> short int) [ nodes ]

In tree the short preceding the nodes refers to the number of terminal nodes or leaves that the regression tree contains. Each node in nodes can either be a non-terminal <Node> or a terminal (leaf) <TNode>. For a <Node> the three following shorts refer to the node’s index number and the index numbers of its children. For a <TNode>, the short refers to the leaf’s index (which correspond to a regression base class index as stored at the component level in RClass, see above), while the int refers to the number of mixture components in this leaf cluster.

Finally, the transition matrix is defined by

transP = ∼t macro | <TransP> short matrix

where the short in this case should be equal to the number of states in the model.

Chapter 8

HMM Parameter Estimation

HC

OMP

V / HI

NIT

HR

^EST

/ HER

^EST

th ih s ih z s p iy t sh

sh t iy p s z ih s ih th

a b ih t l oh n g ax s p iy t sh

. . . .

In chapter7the various types of HMM were described and the way in which they are represented within HTK was explained. Defining the structure and overall form of a set of HMMs is the first step towards building a recogniser. The second step is to estimate the parameters of the HMMs from examples of the data sequences that they are intended to model. This process of parameter estima-tion is usually called training. HTK supplies four basic tools for parameter estimaestima-tion: HCompV, HInit, HRest and HERest. HCompV and HInit are used for initialisation. HCompV will set the mean and variance of every Gaussian component in a HMM definition to be equal to the global mean and variance of the speech training data. This is typically used as an initialisation stage for flat-start training. Alternatively, a more detailed initialisation is possible using HInit which will compute the parameters of a new HMM using a Viterbi style of estimation.

HRest and HERest are used to refine the parameters of existing HMMs using Baum-Welch Re-estimation. Like HInit, HRest performs isolated-unit training whereas HERest operates on complete model sets and performs embedded-unit training. In general, whole word HMMs are built using HInit and HRest, and continuous speech sub-word based systems are built using HERest initialised by either HCompV or HInit and HRest.

This chapter describes these training tools and their use for estimating the parameters of plain (i.e. untied) continuous density HMMs. The use of tying and special cases such as tied-mixture HMM sets and discrete probality HMMs are dealt with in later chapters. The first section of this chapter gives an overview of the various training strategies possible with HTK. This is then followed by sections covering initialisation, isolated-unit training, and embedded training. The chapter concludes with a section detailing the various formulae used by the training tools.

8.1 Training Strategies

As indicated in the introduction above, the basic operation of the HTK training tools involves reading in a set of one or more HMM definitions, and then using speech data to estimate the

111

8.1 Training Strategies 112 parameters of these definitions. The speech data files are normally stored in parameterised form such as LPC or MFCC parameters. However, additional parameters such as delta coefficients are normally computed on-the-fly whilst loading each file.

Unlabelled Tokens

HI nit

HCom pV

HR est

Whole Word HMMs

Fig. 8.1 Isolated Word Training

In fact, it is also possible to use waveform data directly by performing the full parameter con-version on-the-fly. Which approach is preferred depends on the available computing resources. The advantages of storing the data already encoded are that the data is more compact in parameterised form and pre-encoding avoids wasting compute time converting the data each time that it is read in. However, if the training data is derived from CD-ROMS and they can be accessed automatically on-line, then the extra compute may be worth the saving in magnetic disk storage.

The methods for configuring speech data input to HTK tools were described in detail in chap-ter5. All of the various input mechanisms are supported by the HTK training tools except direct audio input.

The precise way in which the training tools are used depends on the type of HMM system to be built and the form of the available training data. Furthermore, HTK tools are designed to interface cleanly to each other, so a large number of configurations are possible. In practice, however, HMM-based speech recognisers are either whole-word or sub-word.

As the name suggests, whole word modelling refers to a technique whereby each individual word in the system vocabulary is modelled by a single HMM. As shown in Fig.8.1, whole word HMMs are most commonly trained on examples of each word spoken in isolation. If these training examples, which are often called tokens, have had leading and trailing silence removed, then they can be input directly into the training tools without the need for any label information. The most common method of building whole word HMMs is to firstly use HInit to calculate initial parameters for the model and then use HRest to refine the parameters using Baum-Welch re-estimation. Where there is limited training data and recognition in adverse noise environments is needed, so-called fixed variance models can offer improved robustness. These are models in which all the variances are set equal to the global speech variance and never subsequently re-estimated. The tool HCompV can be used to compute this global variance.

8.1 Training Strategies 113

HCompV

HER est

HH Ed

th ih s ih s p iy t sh

sh t iy s z ih s ih th

Labelled U tterances

HR est HInit

Sub-Word HMMs Unlabelled Utterances

Transcriptions

th ih s ih s p iy t sh sh t iy s z ih s ih th

Fig. 8.2 Training Subword HMMs

Although HTK gives full support for building whole-word HMM systems, the bulk of its facilities are focussed on building sub-word systems in which the basic units are the individual sounds of the language called phones. One HMM is constructed for each such phone and continuous speech is recognised by joining the phones together to make any required vocabulary using a pronunciation dictionary.

The basic procedures involved in training a set of subword models are shown in Fig. 8.2. The core process involves the embedded training tool HERest. HERest uses continuously spoken utterances as its source of training data and simultaneously re-estimates the complete set of subword HMMs. For each input utterance, HERest needs a transcription i.e. a list of the phones in that utterance. HERest then joins together all of the subword HMMs corresponding to this phone list to make a single composite HMM. This composite HMM is used to collect the necessary statistics for the re-estimation. When all of the training utterances have been processed, the total set of accumulated statistics are used to re-estimate the parameters of all of the phone HMMs. It is important to emphasise that in the above process, the transcriptions are only needed to identify the sequence of phones in each utterance. No phone boundary information is needed.

The initialisation of a set of phone HMMs prior to embedded re-estimation using HERest can be achieved in two different ways. As shown on the left of Fig. 8.2, a small set of hand-labelled bootstrap training data can be used along with the isolated training tools HInit and HRest to initialise each phone HMM individually. When used in this way, both HInit and HRest use the label information to extract all the segments of speech corresponding to the current phone HMM in order to perform isolated word training.

A simpler initialisation procedure uses HCompV to assign the global speech mean and variance to every Gaussian distribution in every phone HMM. This so-called flat start procedure implies that during the first cycle of embedded re-estimation, each training utterance will be uniformly segmented. The hope then is that enough of the phone models align with actual realisations of that phone so that on the second and subsequent iterations, the models align as intended.

8.2 Initialisation using HInit 114

在文檔中 The HTK Book (頁 113-120)