### The HTK Book

### Steve Young Dan Kershaw Julian Odell Dave Ollason Valtcho Valtchev Phil Woodland

### The HTK Book (for HTK Version 3.0)

### c

*°COPYRIGHT 1995-1999 Microsoft Corporation.*

### All Rights Reserved

### First published December 1995 Reprinted March 1996

### Revised for HTK Version 2.1 March 1997

### Revised for HTK Version 2.2 January 1999

### Revised for HTK Version 3.0 July 2000

## Contents

### I Tutorial Overview 1

1 The Fundamentals of HTK 2

1.1 General Principles of HMMs. . . 3

1.2 Isolated Word Recognition. . . 3

1.3 Output Probability Specification . . . 6

1.4 Baum-Welch Re-Estimation . . . 6

1.5 Recognition and Viterbi Decoding . . . 9

1.6 Continuous Speech Recognition . . . 10

1.7 Speaker Adaptation . . . 13

2 An Overview of the HTK Toolkit 14 2.1 HTK Software Architecture . . . 14

2.2 Generic Properties of a HTK Tool . . . 15

2.3 The Toolkit . . . 16

2.3.1 Data Preparation Tools . . . 16

2.3.2 Training Tools . . . 16

2.3.3 Recognition Tools . . . 19

2.3.4 Analysis Tool . . . 19

2.4 Whats New In Version 2.2 . . . 19

2.4.1 Features Added To Version 2.1 . . . 20

3 A Tutorial Example of Using HTK 21 3.1 Data Preparation . . . 22

3.1.1 Step 1 - the Task Grammar . . . 22

3.1.2 Step 2 - the Dictionary . . . 23

3.1.3 Step 3 - Recording the Data. . . 25

3.1.4 Step 4 - Creating the Transcription Files. . . 26

3.1.5 Step 5 - Coding the Data . . . 28

3.2 Creating Monophone HMMs. . . 29

3.2.1 Step 6 - Creating Flat Start Monophones . . . 29

3.2.2 Step 7 - Fixing the Silence Models . . . 32

3.2.3 Step 8 - Realigning the Training Data . . . 33

3.3 Creating Tied-State Triphones . . . 34

3.3.1 Step 9 - Making Triphones from Monophones . . . 34

3.3.2 Step 10 - Making Tied-State Triphones . . . 36

3.4 Recogniser Evaluation . . . 39

3.4.1 Step 11 - Recognising the Test Data . . . 39

3.5 Running the Recogniser Live . . . 40

3.6 Adapting the HMMs . . . 41

3.6.1 Step 12 - Preparation of the Adaptation Data. . . 41

3.6.2 Step 13 - Generating the Transforms . . . 42

3.6.3 Step 14 - Evaluation of the Adapted System. . . 42

3.7 Summary . . . 43

2

### II HTK in Depth 44

4 The Operating Environment 45

4.1 The Command Line . . . 46

4.2 Script Files . . . 46

4.3 Configuration Files . . . 47

4.4 Standard Options. . . 48

4.5 Error Reporting. . . 49

4.6 Strings and Names . . . 49

4.7 Memory Management . . . 50

4.8 Input/Output via Pipes and Networks . . . 51

4.9 Byte-swapping of HTK data files . . . 51

4.10 Summary . . . 52

5 Speech Input/Output 54 5.1 General Mechanism. . . 54

5.2 Speech Signal Processing. . . 56

5.3 Linear Prediction Analysis. . . 58

5.4 Filterbank Analysis. . . 59

5.5 Energy Measures . . . 61

5.6 Delta and Acceleration Coefficients . . . 61

5.7 Storage of Parameter Files. . . 62

5.7.1 HTK Format Parameter Files . . . 62

5.7.2 Esignal Format Parameter Files. . . 64

5.8 Waveform File Formats . . . 65

5.8.1 HTK File Format. . . 65

5.8.2 Esignal File Format . . . 65

5.8.3 TIMIT File Format . . . 65

5.8.4 NIST File Format . . . 65

5.8.5 SCRIBE File Format. . . 66

5.8.6 SDES1 File Format . . . 66

5.8.7 AIFF File Format . . . 66

5.8.8 SUNAU8 File Format . . . 66

5.8.9 OGI File Format . . . 67

5.8.10 WAV File Format . . . 67

5.8.11 ALIEN and NOHEAD File Formats . . . 67

5.9 Direct Audio Input/Output . . . 67

5.10 Multiple Input Streams . . . 69

5.11 Vector Quantisation . . . 70

5.12 Viewing Speech with HList . . . 72

5.13 Copying and Coding using HCopy . . . 74

5.14 Version 1.5 Compatibility . . . 75

5.15 Summary . . . 76

6 Transcriptions and Label Files 79 6.1 Label File Structure . . . 79

6.2 Label File Formats . . . 80

6.2.1 HTK Label Files . . . 80

6.2.2 ESPS Label Files . . . 81

6.2.3 TIMIT Label Files . . . 81

6.2.4 SCRIBE Label Files . . . 81

6.3 Master Label Files . . . 82

6.3.1 General Principles of MLFs . . . 82

6.3.2 Syntax and Semantics . . . 83

6.3.3 MLF Search. . . 83

6.3.4 MLF Examples . . . 84

6.4 Editing Label Files . . . 86

6.5 Summary . . . 89

7 HMM Definition Files 90

7.1 The HMM Parameters . . . 91

7.2 Basic HMM Definitions . . . 92

7.3 Macro Definitions. . . 95

7.4 HMM Sets. . . 99

7.5 Tied-Mixture Systems . . . 103

7.6 Discrete Probability HMMs . . . 103

7.7 Tee Models . . . 105

7.8 Regression Class Trees for Adaptation . . . 105

7.9 Binary Storage Format . . . 106

7.10 The HMM Definition Language . . . 107

8 HMM Parameter Estimation 111 8.1 Training Strategies . . . 111

8.2 Initialisation using HInit . . . 114

8.3 Flat Starting with HCompV . . . 117

8.4 Isolated Unit Re-Estimation using HRest . . . 119

8.5 Embedded Training using HERest. . . 120

8.6 Single-Pass Retraining . . . 123

8.7 Parameter Re-Estimation Formulae. . . 123

8.7.1 Viterbi Training (HInit) . . . 124

8.7.2 Forward/Backward Probabilities . . . 124

8.7.3 Single Model Reestimation(HRest) . . . 126

8.7.4 Embedded Model Reestimation(HERest) . . . 127

9 HMM Adaptation 128 9.1 Model Adaptation using MLLR . . . 129

9.1.1 Maximum Likelihood Linear Regression . . . 129

9.1.2 MLLR and Regression Classes . . . 129

9.1.3 Transform Model File Format . . . 131

9.2 Model Adaptation using MAP . . . 132

9.3 Using HEAdapt . . . 134

9.4 MLLR Formulae . . . 135

9.4.1 Estimation of the Mean Transformation Matrix . . . 136

9.4.2 Estimation of the Variance Transformation Matrix . . . 137

10 HMM System Refinement 138 10.1 Using HHEd . . . 139

10.2 Constructing Context-Dependent Models . . . 139

10.3 Parameter Tying and Item Lists . . . 140

10.4 Data-Driven Clustering . . . 142

10.5 Tree-Based Clustering . . . 144

10.6 Mixture Incrementing . . . 146

10.7 Regression Class Tree Construction . . . 147

10.8 Miscellaneous Operations . . . 148

11 Discrete and Tied-Mixture Models 149 11.1 Modelling Discrete Sequences . . . 149

11.2 Using Discrete Models with Speech . . . 150

11.3 Tied Mixture Systems . . . 152

11.4 Parameter Smoothing . . . 154

12 Networks, Dictionaries and Language Models 155 12.1 How Networks are Used . . . 156

12.2 Word Networks and Standard Lattice Format . . . 157

12.3 Building a Word Network with HParse . . . 159

12.4 Bigram Language Models . . . 161

12.5 Building a Word Network with HBuild . . . 163

12.6 Testing a Word Network using HSGen. . . 164

12.7 Constructing a Dictionary . . . 165

12.8 Word Network Expansion . . . 167

12.9 Other Kinds of Recognition System. . . 170

13 Decoding 172 13.1 Decoder Operation . . . 172

13.2 Decoder Organisation . . . 174

13.3 Recognition using Test Databases. . . 176

13.4 Evaluating Recognition Results . . . 177

13.5 Generating Forced Alignments . . . 180

13.6 Decoding and Adaptation . . . 181

13.6.1 Recognition with Adapted HMMs . . . 181

13.6.2 Unsupervised Adaptation . . . 181

13.7 Recognition using Direct Audio Input . . . 182

13.8 N-Best Lists and Lattices . . . 183

### III Reference Section 185

14 The HTK Tools 186 14.1 HBuild. . . 18714.1.1 Function. . . 187

14.1.2 Use . . . 187

14.1.3 Tracing . . . 188

14.2 HCompV . . . 189

14.2.1 Function. . . 189

14.2.2 Use . . . 189

14.2.3 Tracing . . . 190

14.3 HCopy . . . 191

14.3.1 Function. . . 191

14.3.2 Use . . . 191

14.3.3 Trace Output . . . 193

14.4 HDMan . . . 194

14.4.1 Function. . . 194

14.4.2 Use . . . 195

14.4.3 Tracing . . . 196

14.5 HEAdapt . . . 197

14.5.1 Function. . . 197

14.5.2 Use . . . 197

14.5.3 Tracing . . . 198

14.6 HERest . . . 200

14.6.1 Function. . . 200

14.6.2 Use . . . 200

14.6.3 Tracing . . . 202

14.7 HHEd . . . 203

14.7.1 Function. . . 203

14.7.2 Use . . . 210

14.7.3 Tracing . . . 210

14.8 HInit. . . 212

14.8.1 Function. . . 212

14.8.2 Use . . . 212

14.8.3 Tracing . . . 213

14.9 HLEd . . . 214

14.9.1 Function. . . 214

14.9.2 Use . . . 215

14.9.3 Tracing . . . 216

14.10HList . . . 217

14.10.1 Function. . . 217

14.10.2 Use . . . 217

14.10.3 Tracing . . . 217

14.11HLStats . . . 218

14.11.1 Function. . . 218

14.11.2 Bigram Generation . . . 218

14.11.3 Use . . . 219

14.11.4 Tracing . . . 219

14.12HParse. . . 220

14.12.1 Function. . . 220

14.12.2 Network Definition . . . 220

14.12.3 Compatibility Mode . . . 222

14.12.4 Use . . . 222

14.12.5 Tracing . . . 223

14.13HQuant . . . 224

14.13.1 Function. . . 224

14.13.2 VQ Codebook Format . . . 224

14.13.3 Use . . . 224

14.13.4 Tracing . . . 225

14.14HRest . . . 226

14.14.1 Function. . . 226

14.14.2 Use . . . 226

14.14.3 Tracing . . . 227

14.15HResults. . . 228

14.15.1 Function. . . 228

14.15.2 Use . . . 229

14.15.3 Tracing . . . 230

14.16HSGen. . . 231

14.16.1 Function. . . 231

14.16.2 Use . . . 231

14.16.3 Tracing . . . 231

14.17HSLab . . . 232

14.17.1 Function. . . 232

14.17.2 Use . . . 233

14.17.3 Tracing . . . 235

14.18HSmooth . . . 236

14.18.1 Function. . . 236

14.18.2 Use . . . 236

14.18.3 Tracing . . . 237

14.19HVite . . . 238

14.19.1 Function. . . 238

14.19.2 Use . . . 238

14.19.3 Tracing . . . 240

15 Configuration Variables 241 15.1 Configuration Variables used in Library Modules . . . 242

15.2 Configuration Variables used in Tools . . . 245

16 Error and Warning Codes 246 16.1 Generic Errors . . . 246

16.2 Summary of Errors by Tool and Module . . . 247

17 HTK Standard Lattice Format (SLF) 262 17.1 SLF Files . . . 262

17.2 Format. . . 262

17.3 Syntax . . . 263

17.4 Field Types . . . 263

17.5 Example SLF file . . . 264

### Part I

## Tutorial Overview

1

### Chapter 1

## The Fundamentals of HTK

Training Tools

Speech Data

Recogniser

Transcription

Unknown^{} Speech Transc^{}ription

HTK is a toolkit for building Hidden Markov Models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly general-purpose. However, HTK is primarily designed for building HMM-based speech processing tools, in particular recognisers. Thus, much of the infrastructure support in HTK is dedicated to this task. As shown in the picture alongside, there are two major processing stages involved. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions.

Secondly, unknown utterances are transcribed using the HTK recognition tools.

The main body of this book is mostly concerned with the mechanics of these two processes.

However, before launching into detail it is necessary to understand some of the basic principles of HMMs. It is also helpful to have an overview of the toolkit and to have some appreciation of how training and recognition in HTK is organised.

This first part of the book attempts to provide this information. In this chapter, the basic ideas of HMMs and their use in speech recognition are introduced. The following chapter then presents a brief overview of HTK and, for users of older versions, it highlights the main differences in version 2.0 and later. Finally in this tutorial part of the book, chapter 3 describes how a HMM-based speech recogniser can be built using HTK. It does this by describing the construction of a simple small vocabulary continuous speech recogniser.

The second part of the book then revisits the topics skimmed over here and discusses each in detail. This can be read in conjunction with the third and final part of the book which provides a reference manual for HTK. This includes a description of each tool, summaries of the various parameters used to configure HTK and a list of the error messages that it generates when things go wrong.

Finally, note that this book is concerned only with HTK as a tool-kit. It does not provide information for using the HTK libraries as a programming environment.

2

*1.1* *General Principles of HMMs* 3

### 1.1 General Principles of HMMs

s_{1} s_{2}

s_{3}

et^{}c

s_{1} s_{2}

s_{3}

Speech Waveform

Speech
Vectors
Concept: a seque^{}nce of symbols

Parame^{}terise

Recog^{}nise

Fig. 1.1 Message Encoding/Decoding

Speech recognition systems generally assume that the speech signal is a realisation of some mes- sage encoded as a sequence of one or more symbols (see Fig.1.1). To effect the reverse operation of recognising the underlying symbol sequence given a spoken utterance, the continuous speech wave- form is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact representation of the speech waveform on the basis that for the duration covered by a single vector (typically 10ms or so), the speech waveform can be regarded as being stationary. Although this is not strictly true, it is a reasonable approxima- tion. Typical parametric representations in common use are smoothed spectra or linear prediction coefficients plus various other representations derived from these.

The rˆole of the recogniser is to effect a mapping between sequences of speech vectors and the wanted underlying symbol sequences. Two problems make this very difficult. Firstly, the mapping from symbols to speech is not one-to-one since different underlying symbols can give rise to similar speech sounds. Furthermore, there are large variations in the realised speech waveform due to speaker variability, mood, environment, etc. Secondly, the boundaries between symbols cannot be identified explicitly from the speech waveform. Hence, it is not possible to treat the speech waveform as a sequence of concatenated static patterns.

The second problem of not knowing the word boundary locations can be avoided by restricting the task to isolated word recognition. As shown in Fig.1.2, this implies that the speech waveform corresponds to a single underlying symbol (e.g. word) chosen from a fixed vocabulary. Despite the fact that this simpler problem is somewhat artificial, it nevertheless has a wide range of practical applications. Furthermore, it serves as a good basis for introducing the basic ideas of HMM-based recognition before dealing with the more complex continuous speech case. Hence, isolated word recognition using HMMs will be dealt with first.

### 1.2 Isolated Word Recognition

*Let each spoken word be represented by a sequence of speech vectors or observations O, defined as*

*O = o*1*, o*2*, . . . , o**T* (1.1)

*where o*_{t}*is the speech vector observed at time t. The isolated word recognition problem can then*
be regarded as that of computing

arg max

*i* *{P (w**i**|O)}* (1.2)

*where w**i* *is the i’th vocabulary word. This probability is not computable directly but using Bayes’*

Rule gives

*P (w**i**|O) =* *P (O|w**i**)P (w**i*)

*P (O)* (1.3)

*Thus, for a given set of prior probabilities P (w**i*), the most probable spoken word depends only
*on the likelihood P (O|w**i**). Given the dimensionality of the observation sequence O, the direct*
*estimation of the joint conditional probability P (o*1*, o*2*, . . . |w**i*) from examples of spoken words is
not practicable. However, if a parametric model of word production such as a Markov model is

*1.2* *Isolated Word Recognition* 4
assumed, then estimation from data is possible since the problem of estimating the class conditional
*observation densities P (O|w**i*) is replaced by the much simpler problem of estimating the Markov
model parameters.

Speech Waveform

Speech Vectors Concept: a single word

Parameterise

Recognise

**w**

**w**

Fig. 1.2 Isolated Word Problem

I n HMM based speech recognition, it is assumed that the sequence of observed speech vectors
corresponding to each word is generated by a Markov model as shown in Fig.1.3. A Markov model
*is a finite state machine which changes state once every time unit and each time t that a state j*
*is entered, a speech vector o**t* *is generated from the probability density b**j**(o**t*). Furthermore, the
*transition from state i to state j is also probabilistic and is governed by the discrete probability*
*a**ij*. Fig.1.3shows an example of this process where the six state model moves through the state
*sequence X = 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o*1*to o*6. Notice that in HTK, the
entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite
models as explained in more detail later.

*The joint probability that O is generated by the model M moving through the state sequence*
*X is calculated simply as the product of the transition probabilities and the output probabilities.*

*So for the state sequence X in Fig.*1.3

*P (O, X|M ) = a*12*b*2*(o*1*)a*22*b*2*(o*2*)a*23*b*3*(o*3*) . . .* (1.4)
*However, in practice, only the observation sequence O is known and the underlying state sequence*
*X is hidden. This is why it is called a Hidden Markov Model.*

a_{12} a_{23} a_{34}^{} a_{45} a_{56}

a_{22} a_{33} a_{44}^{} a_{55}

1 2 3 4 5 6

a_{24}^{} a_{35}

o_{1} o_{2} o_{3} o^{}_{4} o_{5} o_{6}

b_{2}( )o_{1} b_{5}

o_{6}

( )
b_{2}( )o_{2} b_{3}

o_{3}

( ) b^{}_{4}( )o_{4}^{} b_{4}^{}( )o_{5}

Markov
Mo^{} del

**M**

Observation Sequence

Fig. 1.3 The Markov Generation Model

*Given that X is unknown, the required likelihood is computed by summing over all possible*
*state sequences X = x(1), x(2), x(3), . . . , x(T ), that is*

*P (O|M ) =*X

*X*

*a**x(0)x(1)*

Y*T*
*t=1*

*b**x(t)**(o**t**)a**x(t)x(t+1)* (1.5)

*1.2* *Isolated Word Recognition* 5
*where x(0) is constrained to be the model entry state and x(T + 1) is constrained to be the model*
exit state.

As an alternative to equation1.5, the likelihood can be approximated by only considering the most likely state sequence, that is

*P (O|M ) = max*ˆ

*X*

(

*a** _{x(0)x(1)}*
Y

*T*

*t=1*

*b*_{x(t)}*(o**t**)a** _{x(t)x(t+1)}*
)

(1.6) Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive procedures exist which allow both quantities to be calculated very efficiently. Before going any further, however, notice that if equation1.2is computable then the recognition problem is solved.

*Given a set of models M**i* *corresponding to words w**i*, equation 1.2 is solved by using 1.3 and
assuming that

*P (O|w**i**) = P (O|M**i**).* (1.7)

*All this, of course, assumes that the parameters {a**ij**} and {b**j**(o**t**)} are known for each model*
*M** _{i}*. Herein lies the elegance and power of the HMM framework. Given a set of training examples
corresponding to a particular model, the parameters of that model can be determined automatically
by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of
representative examples of each word can be collected then a HMM can be constructed which
implicitly models all of the many sources of variability inherent in real speech. Fig.1.4summarises
the use of HMMs for isolated word recognition. Firstly, a HMM is trained for each vocabulary word
using a number of examples of that word. In this case, the vocabulary consists of just three words:

“one”, “two” and “three”. Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word.

P( P(

P(

(a) Training

one two three

Training Examples

M_{1} M_{2} M_{3}

Estimate Models

1.

2.

3.

(b) Recognition

**Unknown O = **

**O| M**_{1}) **O| M**_{2} ) **O|M**_{3} )
Choose Max

Fig. 1.4 Using HMMs for Isolated Word Recognition

*1.3* *Output Probability Specification* 6

### 1.3 Output Probability Specification

Before the problem of parameter estimation can be discussed in more detail, the form of the output
*distributions {b**j**(o**t**)} needs to be made explicit. HTK is designed primarily for modelling con-*
tinuous parameters using continuous density multivariate output distributions. It can also handle
observation sequences consisting of discrete symbols in which case, the output distributions are
discrete probabilities. For simplicity, however, the presentation in this chapter will assume that
continuous density distributions are being used. The minor differences that the use of discrete
probabilities entail are noted in chapter7and discussed in more detail in chapter 11.

In common with most other continuous density HMM systems, HTK represents output distri-
butions by Gaussian Mixture Densities. In HTK, however, a further generalisation is made. HTK
*allows each observation vector at time t to be split into a number of S independent data streams*
*o**st**. The formula for computing b**j**(o**t*) is then

*b**j**(o**t*) =
Y*S*
*s=1*

"* _{M}*
X

*s*

*m=1*

*c**jsm**N (o**st**; µ*_{jsm}*, Σ**jsm*)

#*γ**s*

(1.8)
*where M**s**is the number of mixture components in stream s, c**jsm**is the weight of the m’th compo-*
*nent and N (·; µ, Σ) is a multivariate Gaussian with mean vector µ and covariance matrix Σ, that*
is

*N (o; µ, Σ) =* 1

p*(2π)*^{n}*|Σ|e*^{−}^{1}^{2}^{(}*o*^{−}*µ*^{)}* ^{0}*Σ

^{−1}^{(}

*o*

^{−}*µ*

^{)}(1.9)

*where n is the dimensionality of o.*

*The exponent γ**s*is a stream weight^{1}. It can be used to give a particular stream more emphasis,
however, it can only be set manually. No current HTK training tools can estimate values for it.

Multiple data streams are used to enable separate modelling of multiple information sources. In HTK, the processing of streams is completely general. However, the speech input modules assume that the source data is split into at most 4 streams. Chapter5discusses this in more detail but for now it is sufficient to remark that the default streams are the basic parameter vector, first (delta) and second (acceleration) difference coefficients and log energy.

### 1.4 Baum-Welch Re-Estimation

To determine the parameters of a HMM it is first necessary to make a rough guess at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the so-called Baum-Welch re-estimation formulae.

a_{ij}c_{j1}

a_{ij}c_{j2}

a_{ij}c_{jM}^{}

### ...

Single Gaussians

j

a_{ij}

M-component
Gaus^{}sian

mixture

j

1^{}

j

2

j

M

Fig. 1.5 Representing a Mixture

Chapter 8 gives the formulae used in HTK in full detail. Here the basis of the formulae will be presented in a very informal way. Firstly, it should be noted that the inclusion of multiple data streams does not alter matters significantly since each stream is considered to be statistically

1often referred to as a codebook exponent.

*1.4* *Baum-Welch Re-Estimation* 7
independent. Furthermore, mixture components can be considered to be a special form of sub-state
in which the transition probabilities are the mixture weights (see Fig.1.5).

Thus, the essential problem is to estimate the means and variances of a HMM in which each state output distribution is a single component Gaussian, that is

*b**j**(o**t*) = 1

p*(2π)*^{n}*|Σ**j**|e*^{−}^{1}^{2}^{(}*o**t**−**µ** _{j}*)

*Σ*

^{0}

^{−1}*j*(

*o*

*t*

*−*

*µ*

*) (1.10)*

_{j}*If there was just one state j in the HMM, this parameter estimation would be easy. The maximum*
*likelihood estimates of µ** _{j}* and Σ

*j*would be just the simple averages, that is

ˆ
*µ** _{j}*= 1

*T*
X*T*
*t=1*

*o**t* (1.11)

and

Σˆ*j* = 1
*T*

X*T*
*t=1*

*(o**t**− µ*_{j}*)(o**t**− µ** _{j}*)

*(1.12) In practice, of course, there are multiple states and there is no direct assignment of observation vectors to individual states because the underlying state sequence is unknown. Note, however, that if some approximate assignment of vectors to states could be made then equations1.11and 1.12 could be used to give the required initial values for the parameters. Indeed, this is exactly what is done in the HTK tool called HInit. HInit first divides the training observation vectors equally amongst the model states and then uses equations1.11and1.12to give initial values for the mean and variance of each state. It then finds the maximum likelihood state sequence using the Viterbi algorithm described below, reassigns the observation vectors to states and then uses equations1.11 and 1.12 again to get better initial values. This process is repeated until the estimates do not change.*

^{0}Since the full likelihood of each observation sequence is based on the summation of all possi-
*ble state sequences, each observation vector o**t* contributes to the computation of the maximum
*likelihood parameter values for each state j. In other words, instead of assigning each observation*
vector to a specific state as in the above approximation, each observation is assigned to every state
in proportion to the probability of the model being in that state when the vector was observed.

*Thus, if L**j**(t) denotes the probability of being in state j at time t then the equations*1.11and1.12
given above become the following weighted averages

ˆ
*µ** _{j}*=

P_{T}

*t=1**L**j**(t)o**t*

P_{T}

*t=1**L**j**(t)* (1.13)

and

Σˆ*j*=
P_{T}

*t=1**L**j**(t)(o**t**− µ*_{j}*)(o**t**− µ** _{j}*)

*P*

^{0}

_{T}*t=1**L**j**(t)* (1.14)

where the summations in the denominators are included to give the required normalisation.

Equations 1.13and1.14are the Baum-Welch re-estimation formulae for the means and covari- ances of a HMM. A similar but slightly more complex formula can be derived for the transition probabilities (see chapter8).

Of course, to apply equations 1.13 and 1.14, the probability of state occupation L*j**(t) must*
*be calculated. This is done efficiently using the so-called Forward-Backward algorithm. Let the*
forward probability^{2} *α**j**(t) for some model M with N states be defined as*

*α**j**(t) = P (o*1*, . . . , o**t**, x(t) = j|M ).* (1.15)
*That is, α**j**(t) is the joint probability of observing the first t speech vectors and being in state j at*
*time t. This forward probability can be efficiently calculated by the following recursion*

*α**j**(t) =*

"* _{N −1}*
X

*i=2*

*α**i**(t − 1)a**ij*

#

*b**j**(o**t**).* (1.16)

2 Since the output distributions are densities, these are not really probabilities but it is a convenient fiction.

*1.4* *Baum-Welch Re-Estimation* 8
*This recursion depends on the fact that the probability of being in state j at time t and seeing*
*observation o**t* can be deduced by summing the forward probabilities for all possible predecessor
*states i weighted by the transition probability a**ij*. The slightly odd limits are caused by the fact
*that states 1 and N are non-emitting*^{3}. The initial conditions for the above recursion are

*α*1(1) = 1 (1.17)

*α**j**(1) = a**1j**b**j**(o*1) (1.18)

*for 1 < j < N and the final condition is given by*

*α**N**(T ) =*

*N −1*X

*i=2*

*α**i**(T )a**iN**.* (1.19)

*Notice here that from the definition of α**j**(t),*

*P (O|M ) = α*_{N}*(T ).* (1.20)

*Hence, the calculation of the forward probability also yields the total likelihood P (O|M ).*

*The backward probability β**j**(t) is defined as*

*β**j**(t) = P (o**t+1**, . . . , o**T**|x(t) = j, M ).* (1.21)
As in the forward case, this backward probability can be computed efficiently using the following
recursion

*β**i**(t) =*

*N −1*X

*j=2*

*a**ij**b**j**(o**t+1**)β**j**(t + 1)* (1.22)
with initial condition given by

*β**i**(T ) = a**iN* (1.23)

*for 1 < i < N and final condition given by*

*β*_{1}(1) =

*N −1*X

*j=2*

*a*_{1j}*b*_{j}*(o*_{1}*)β*_{j}*(1).* (1.24)

Notice that in the definitions above, the forward probability is a joint probability whereas the backward probability is a conditional probability. This somewhat asymmetric definition is deliberate since it allows the probability of state occupation to be determined by taking the product of the two probabilities. From the definitions,

*α**j**(t)β**j**(t) = P (O, x(t) = j|M ).* (1.25)
Hence,

*L**j**(t) = P (x(t) = j|O, M )* (1.26)

= *P (O, x(t) = j|M )*
*P (O|M )*

= 1

*Pα**j**(t)β**j**(t)*
*where P = P (O|M ).*

All of the information needed to perform HMM parameter re-estimation using the Baum-Welch algorithm is now in place. The steps in this algorithm may be summarised as follows

1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator
and denominator summations of the form illustrated by equations1.13and1.14. These storage
*locations are referred to as accumulators*^{4}.

3 *To understand equations involving a non-emitting state at time t, the time should be thought of as being t − δt*
*if it is an entry state, and t + δt if it is an exit state. This becomes important when HMMs are connected together*
*in sequence so that transitions across non-emitting states take place between frames.*

4 Note that normally the summations in the denominators of the re-estimation formulae are identical across the parameter sets of a given state and therefore only a single common storage location for the denominators is required and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can result in the denominator summations being different. Hence, in HTK the denominator summations are always stored and calculated individually for each distinct parameter vector or matrix.

*1.5* *Recognition and Viterbi Decoding* 9

2. *Calculate the forward and backward probabilities for all states j and times t.*

3. *For each state j and time t, use the probability L**j**(t) and the current observation vector o**t*

to update the accumulators for that state.

4. Use the final accumulator values to calculate new parameter values.

5. *If the value of P = P (O|M ) for this iteration is not higher than the value at the previous*
iteration then stop, otherwise repeat the above steps using the new re-estimated parameter
values.

All of the above assumes that the parameters for a HMM are re-estimated from a single ob- servation sequence, that is a single example of the spoken word. In practice, many examples are needed to get good parameter estimates. However, the use of multiple observation sequences adds no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct training sequence.

One final point that should be mentioned is that the computation of the forward and backward probabilities involves taking the product of a large number of probabilities. In practice, this means that the actual numbers involved become very small. Hence, to avoid numerical problems, the forward-backward computation is computed in HTK using log arithmetic.

The HTK program which implements the above algorithm is called HRest. In combination with the tool HInit for estimating initial values mentioned earlier, HRest allows isolated word HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.

### 1.5 Recognition and Viterbi Decoding

The previous section has described the basic ideas underlying HMM parameter re-estimation using
the Baum-Welch algorithm. In passing, it was noted that the efficient recursive algorithm for
*computing the forward probability also yielded as a by-product the total likelihood P (O|M ). Thus,*
*this algorithm could also be used to find the model which yields the maximum value of P (O|M**i*),
and hence, it could be used for recognition.

In practice, however, it is preferable to base recognition on the maximum likelihood state se-
quence since this generalises easily to the continuous speech case whereas the use of the total
probability does not. This likelihood is computed using essentially the same algorithm as the for-
ward probability calculation except that the summation is replaced by a maximum operation. For
*a given model M , let φ**j**(t) represent the maximum likelihood of observing speech vectors o*1 to
*o**t* *and being in state j at time t. This partial likelihood can be computed efficiently using the*
following recursion (cf. equation1.16)

*φ**j**(t) = max*

*i* *{φ**i**(t − 1)a**ij**} b**j**(o**t**).* (1.27)
where

*φ*1(1) = 1 (1.28)

*φ**j**(1) = a**1j**b**j**(o*1) (1.29)

*for 1 < j < N . The maximum likelihood ˆP (O|M ) is then given by*
*φ**N**(T ) = max*

*i* *{φ**i**(T )a**iN**}* (1.30)

As for the re-estimation case, the direct computation of likelihoods leads to underflow, hence, log likelihoods are used instead. The recursion of equation1.27then becomes

*ψ**j**(t) = max*

*i* *{ψ**i**(t − 1) + log(a**ij**)} + log(b**j**(o**t**)).* (1.31)
This recursion forms the basis of the so-called Viterbi algorithm. As shown in Fig.1.6, this algorithm
can be visualised as finding the best path through a matrix where the vertical dimension represents
the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time).

Each large dot in the picture represents the log probability of observing that frame at that time and
each arc between dots corresponds to a log transition probability. The log probability of any path
is computed simply by summing the log transition probabilities and the log output probabilities
*along that path. The paths are grown from left-to-right column-by-column. At time t, each partial*
*path ψ**i**(t − 1) is known for all states i, hence equation*1.31*can be used to compute ψ**j**(t) thereby*
extending the partial paths by one time frame.

*1.6* *Continuous Speech Recognition* 10

1 2

3

4

5

6

Sta^{} te

Speech
Fram^{} e
(Time)
1 2^{} 3^{} 4^{} 5^{} 6^{}

### b

_{3}

### ( ) o

^{ }

_{4}

### a

_{35}

Fig. 1.6 The Viterbi Algorithm for Isolated Word Recognition

This concept of a path is extremely important and it is generalised below to deal with the continuous speech case.

This completes the discussion of isolated word recognition using HMMs. There is no HTK tool which implements the above Viterbi algorithm directly. Instead, a tool called HVite is provided which along with its supporting libraries, HNet and HRec, is designed to handle continuous speech. Since this recogniser is syntax directed, it can also perform isolated word recognition as a special case. This is discussed in more detail below.

### 1.6 Continuous Speech Recognition

Returning now to the conceptual model of speech production and recognition exemplified by Fig.1.1, it should be clear that the extension to continuous speech simply involves connecting HMMs together in sequence. Each model in the sequence corresponds directly to the assumed underlying symbol.

*These could be either whole words for so-called connected speech recognition or sub-words such as*
*phonemes for continuous speech recognition. The reason for including the non-emitting entry and*
*exit states should now be evident, these states provide the glue needed to join models together.*

There are, however, some practical difficulties to overcome. The training data for continuous
speech must consist of continuous utterances and, in general, the boundaries dividing the segments
of speech corresponding to each underlying sub-word model in the sequence will not be known. In
practice, it is usually feasible to mark the boundaries of a small amount of data by hand. All of
*the segments corresponding to a given model can then be extracted and the isolated word style*
of training described above can be used. However, the amount of data obtainable in this way is
usually very limited and the resultant models will be poor estimates. Furthermore, even if there
was a large amount of data, the boundaries imposed by hand-marking may not be optimal as far
as the HMMs are concerned. Hence, in HTK the use of HInit and HRest for initialising sub-word
*models is regarded as a bootstrap operation*^{5}. The main training phase involves the use of a tool
*called HERest which does embedded training.*

Embedded training uses the same Baum-Welch procedure as for the isolated case but rather than training each model individually all models are trained in parallel. It works in the following steps:

1. Allocate and zero accumulators for all parameters of all HMMs.

2. Get the next training utterance.

5 *They can even be avoided altogether by using a flat start as described in section*8.3.

*1.6* *Continuous Speech Recognition* 11
3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol

transcription of the training utterance.

4. Calculate the forward and backward probabilities for the composite HMM. The inclusion of intermediate non-emitting states in the composite model requires some changes to the computation of the forward and backward probabilities but these are only minor. The details are given in chapter8.

5. Use the forward and backward probabilities to compute the probabilities of state occupation at each time frame and update the accumulators in the usual way.

6. Repeat from 2 until all training utterances have been processed.

7. Use the accumulators to calculate new parameter estimates for all of the HMMs.

These steps can then all be repeated as many times as is necessary to achieve the required conver- gence. Notice that although the location of symbol boundaries in the training data is not required (or wanted) for this procedure, the symbolic transcription of each training utterance is needed.

Whereas the extensions needed to the Baum-Welch procedure for training sub-word models are
relatively minor^{6}, the corresponding extensions to the Viterbi algorithm are more substantial.

*In HTK, an alternative formulation of the Viterbi algorithm is used called the Token Passing*
*Model*^{7}. In brief, the token passing model makes the concept of a state alignment path explicit.

*Imagine each state j of a HMM at time t holds a single moveable token which contains, amongst*
*other information, the partial log probability ψ**j**(t). This token then represents a partial match*
*between the observation sequence o*1*to o**t*and the model subject to the constraint that the model
*is in state j at time t. The path extension algorithm represented by the recursion of equation*1.31
*is then replaced by the equivalent token passing algorithm which is executed at each time frame t.*

The key steps in this algorithm are as follows

1. *Pass a copy of every token in state i to all connecting states j, incrementing the log probability*
*of the copy by log[a**ij**] + log[b**j**(o(t)].*

2. Examine the tokens in every state and discard all but the token with the highest probability.

In practice, some modifications are needed to deal with the non-emitting states but these are
*straightforward if the tokens in entry states are assumed to represent paths extended to time t − δt*
*and tokens in exit states are assumed to represent paths extended to time t + δt.*

The point of using the Token Passing Model is that it extends very simply to the continuous
speech case. Suppose that the allowed sequence of HMMs is defined by a finite state network. For
example, Fig.1.7shows a simple network in which each word is defined as a sequence of phoneme-
based HMMs and all of the words are placed in a loop. In this network, the oval boxes denote HMM
*instances and the square boxes denote word-end nodes. This composite network is essentially just*
a single large HMM and the above Token Passing algorithm applies. The only difference now is
that more information is needed beyond the log probability of the best token. When the best token
reaches the end of the speech, the route it took through the network must be known in order to
recover the recognised sequence of models.

6 In practice, a good deal of extra work is needed to achieve efficient operation on large training databases. For example, the HERest tool includes facilities for pruning on both the forward and backward passes and parallel operation on a network of machines.

7See “Token Passing: a Conceptual Model for Connected Speech Recognition Systems”, SJ Young, NH Russell and JHS Thornton, CUED Technical Report F INFENG/TR38, Cambridge University, 1989. Available by anonymous ftp from svr-ftp.eng.cam.ac.uk.

*1.6* *Continuous Speech Recognition* 12

ax b

iy^{}

b

iy^{} n^{}

a b

e
been
et^{}c

Fig. 1.7 Recognition Network for Continuously Spoken Word

Recognition

The history of a token’s route through the network may be recorded efficiently as follows. Every
*token carries a pointer called a word end link. When a token is propagated from the exit state of a*
word (indicated by passing through a word-end node) to the entry state of another, that transition
*represents a potential word boundary. Hence a record called a Word Link Record is generated in*
which is stored the identity of the word from which the token has just emerged and the current
value of the token’s link. The token’s actual link is then replaced by a pointer to the newly created
WLR. Fig.1.8illustrates this process.

Once all of the unknown speech has been processed, the WLRs attached to the link of the best matching token (i.e. the token with the highest log probability) can be traced back to give the best matching sequence of words. At the same time the positions of the word boundaries can also be extracted if required.

logP logP logP logP

t-3 t-2 t-1 t^{}

tw^{} o tw^{} o on^{} e on^{} e
Recording Decisions

logP logP

Record Word Ends Best

Token
cam^{} e
from

"one"

w

uh n^{}

t

uw^{}

th^{} r^{} iy^{}

on^{} e

thr^{}ee
two^{}
Before After

Fig. 1.8 Recording Word Boundary Decisions

The token passing algorithm for continuous speech has been described in terms of recording the word sequence only. If required, the same principle can be used to record decisions at the model and state level. Also, more than just the best token at each word boundary can be saved. This gives the potential for generating a lattice of hypotheses rather than just the single best hypothesis.

*Algorithms based on this idea are called lattice N-best. They are suboptimal because the use of a*
single token per state limits the number of different token histories that can be maintained. This
limitation can be avoided by allowing each model state to hold multiple-tokens and regarding tokens
*as distinct if they come from different preceding words. This gives a class of algorithm called word*

*1.7* *Speaker Adaptation* 13

*N-best which has been shown empirically to be comparable in performance to an optimal N-best*
algorithm.

The above outlines the main idea of Token Passing as it is implemented within HTK. The algorithms are embedded in the library modules HNet and HRec and they may be invoked using the recogniser tool called HVite. They provide single and multiple-token passing recognition, single-best output, lattice output, N-best lists, support for cross-word context-dependency, lattice rescoring and forced alignment.

### 1.7 Speaker Adaptation

Although the training and recognition techniques described previously can produce high perfor- mance recognition systems, these systems can be improved upon by customising the HMMs to the characteristics of a particular speaker. HTK provides the tools HEAdapt and HVite to perform adaptation using a small amount of enrollment or adaptation data. The two tools differ in that HEAdapt performs offline supervised adaptation while HVite recognises the adaptation data and uses the generated transcriptions to perform the adaptation. Generally, more robust adaptation is performed in a supervised mode, as provided by HEAdapt, but given an initial well trained model set, HVite can still achieve noticeable improvements in performance. Full details of adaptation and how it is used in HTK can be found in Chapter9.

### Chapter 2

## An Overview of the HTK Toolkit

**Entropi****c**

D arpa TIM IT

N IST

The basic principles of HMM-based recognition were outlined in the previous chapter and a number of the key HTK tools have already been mentioned. This chapter describes the software architecture of a HTK tool. It then gives a brief outline of all the HTK tools and the way that they are used together to construct and test HMM-based recognisers. For the benefit of existing HTK users, the major changes in recent versions of HTK are listed. The following chapter will then illustrate the use of the HTK toolkit by working through a practical example of building a simple continuous speech recognition system.

### 2.1 HTK Software Architecture

Much of the functionality of HTK is built into the library modules. These modules ensure that every tool interfaces to the outside world in exactly the same way. They also provide a central resource of commonly used functions. Fig.2.1 illustrates the software structure of a typical HTK tool and shows its input/output interfaces.

User input/output and interaction with the operating system is controlled by the library module HShell and all memory management is controlled by HMem. Math support is provided by HMath and the signal processing operations needed for speech analysis are in HSigP. Each of the file types required by HTK has a dedicated interface module. HLabel provides the interface for label files, HLM for language model files, HNet for networks and lattices, HDict for dictionaries, HVQ for VQ codebooks and HModel for HMM definitions.

14

*2.2* *Generic Properties of a HTK Tool* 15

Speech

Data DefinitionsHMM

Terminal

Graphical

Adaptation Model

Training HNet Language

Models Constraint Network Lattices/

Dictionary

HModel HDict

HUtil

HShell

HGraf

HRec HAdapt HMath

HMem HSigP HVQHParm HWave HAudio

HTrain HFB

HTK Tool

I/O I/O HLM

Labels

HLabel

Fig. 2.1 Software Architecture

All speech input and output at the waveform level is via HWave and at the parameterised level via HParm. As well as providing a consistent interface, HWave and HLabel support multiple file formats allowing data to be imported from other systems. Direct audio input is supported by HAudio and simple interactive graphics is provided by HGraf. HUtil provides a number of utility routines for manipulating HMMs while HTrain and HFB contain support for the various HTK training tools. HAdapt provides support for the various HTK adaptation tools. Finally, HRec contains the main recognition processing functions.

As noted in the next section, fine control over the behaviour of these library modules is provided by setting configuration variables. Detailed descriptions of the functions provided by the library modules are given in the second part of this book and the relevant configuration variables are described as they arise. For reference purposes, a complete list is given in chapter15.

### 2.2 Generic Properties of a HTK Tool

HTK tools are designed to run with a traditional command-line style interface. Each tool has a number of required arguments plus optional arguments. The latter are always prefixed by a minus sign. As an example, the following command would invoke the mythical HTK tool called HFoo

HFoo -T 1 -f 34.3 -a -s myfile file1 file2

This tool has two main arguments called file1 and file2 plus four optional arguments. Options are always introduced by a single letter option name followed where appropriate by the option value.

The option value is always separated from the option name by a space. Thus, the value of the -f option is a real number, the value of the -T option is an integer number and the value of the -s option is a string. The -a option has no following value and it is used as a simple flag to enable or disable some feature of the tool. Options whose names are a capital letter have the same meaning across all tools. For example, the -T option is always used to control the trace output of a HTK tool.

In addition to command line arguments, the operation of a tool can be controlled by parameters stored in a configuration file. For example, if the command

HFoo -C config -f 34.3 -a -s myfile file1 file2

is executed, the tool HFoo will load the parameters stored in the configuration file config during its initialisation procedures. Configuration parameters can sometimes be used as an alternative to using command line arguments. For example, trace options can always be set within a configuration file. However, the main use of configuration files is to control the detailed behaviour of the library modules on which all HTK tools depend.

Although this style of command-line working may seem old-fashioned when compared to modern graphical user interfaces, it has many advantages. In particular, it makes it simple to write shell

*2.3* *The Toolkit* 16
scripts to control HTK tool execution. This is vital for performing large-scale system building
and experimentation. Furthermore, defining all operations using text-based commands allows the
details of system construction or experimental procedure to be recorded and documented.

Finally, note that a summary of the command line and options for any HTK tool can be obtained simply by executing the tool with no arguments.

### 2.3 The Toolkit

The HTK tools are best introduced by going through the processing steps involved in building a sub-word based continuous speech recogniser. As shown in Fig.2.2, there are 4 main phases: data preparation, training, testing and analysis.

### 2.3.1 Data Preparation Tools

In order to build a set of HMMs, a set of speech data files and their associated transcriptions are required. Very often speech data will be obtained from database archives, typically on CD-ROMs.

Before it can be used in training, it must be converted into the appropriate parametric form and any associated transcriptions must be converted to have the correct format and use the required phone or word labels. If the speech needs to be recorded, then the tool HSLab can be used both to record the speech and to manually annotate it with any required transcriptions.

*Although all HTK tools can parameterise waveforms on-the-fly, in practice it is usually better to*
parameterise the data just once. The tool HCopy is used for this. As the name suggests, HCopy
is used to copy one or more source files to an output file. Normally, HCopy copies the whole file,
but a variety of mechanisms are provided for extracting segments of files and concatenating files.

By setting the appropriate configuration variables, all input files can be converted to parametric
form as they are read-in. Thus, simply copying each file in this manner performs the required
encoding. The tool HList can be used to check the contents of any speech file and since it can also
convert input on-the-fly, it can be used to check the results of any conversions before processing
large quantities of data. Transcriptions will also need preparing. Typically the labels used in the
original source transcriptions will not be exactly as required, for example, because of differences in
the phone sets used. Also, HMM training might require the labels to be context-dependent. The
tool HLEd is a script-driven label editor which is designed to make the required transformations
*to label files. HLEd can also output files to a single Master Label File MLF which is usually*
more convenient for subsequent processing. Finally on data preparation, HLStats can gather and
display statistics on label files and where required, HQuant can be used to build a VQ codebook
in preparation for building discrete probability HMM system.

### 2.3.2 Training Tools

The second step of system building is to define the topology required for each HMM by writing a prototype definition. HTK allows HMMs to be built with any desired topology. HMM definitions can be stored externally as simple text files and hence it is possible to edit them with any convenient text editor. Alternatively, the standard HTK distribution includes a number of example HMM prototypes and a script to generate the most common topologies automatically. With the exception of the transition probabilities, all of the HMM parameters given in the prototype definition are ignored. The purpose of the prototype definition is only to specify the overall characteristics and topology of the HMM. The actual parameters will be computed later by the training tools. Sensible values for the transition probabilities must be given but the training process is very insensitive to these. An acceptable and simple strategy for choosing these probabilities is to make all of the transitions out of any state equally likely.

*2.3* *The Toolkit* 17

HPARSE

HVITE

HCOMPV, HINIT, HREST, HEREST

Data Prep

Training

Testing

Analysis HMMs

Networks Dictionary HLE

HDMAN

HQ HSLAB

HC HL Speech

D

HLS

UANT IST TATS OPY

HR HBUILD

ESULTS

HSMOOTH, HHED, HEADAPT

Transcriptions Transcriptions

Fig. 2.2 HTK Processing Stages

The actual training process takes place in stages and it is illustrated in more detail in Fig. 2.3.

Firstly, an initial set of models must be created. If there is some speech data available for which
the location of the sub-word (i.e. phone) boundaries have been marked, then this can be used as
*bootstrap data. In this case, the tools HInit and HRest provide isolated word style training using*
the fully labelled bootstrap data. Each of the required HMMs is generated individually. HInit
*reads in all of the bootstrap training data and cuts out all of the examples of the required phone. It*
*then iteratively computes an initial set of parameter values using a segmental k-means procedure.*

On the first cycle, the training data is uniformly segmented, each model state is matched with the
corresponding data segments and then means and variances are estimated. If mixture Gaussian
models are being trained, then a modified form of k-means clustering is used. On the second
and successive cycles, the uniform segmentation is replaced by Viterbi alignment. The initial
parameter values computed by HInit are then further re-estimated by HRest. Again, the fully
labelled bootstrap data is used but this time the segmental k-means procedure is replaced by the
Baum-Welch re-estimation procedure described in the previous chapter. When no bootstrap data
*is available, a so-called flat start can be used. In this case all of the phone models are initialised*
to be identical and have state means and variances equal to the global speech mean and variance.

The tool HCompV can be used for this.

*2.3* *The Toolkit* 18

HCompV

HER^{} est

HH^{} Ed

th ih s ih s p iy t sh

sh t iy s z^{} ih s ih th

Labelled U^{} tterances

HR^{} est
HI^{}nit

Sub-Word HMMs Unlabelled Utterances

Transcr^{}iptions

th ih s ih s p iy t sh
sh t iy s z^{} ih s ih th

Fig. 2.3 Training Sub-word HMMs

*Once an initial set of models has been created, the tool HERest is used to perform embedded*
*training using the entire training set. HERest performs a single Baum-Welch re-estimation of the*
whole set of HMM phone models simultaneously. For each training utterance, the corresponding
phone models are concatenated and then the forward-backward algorithm is used to accumulate the
statistics of state occupation, means, variances, etc., for each HMM in the sequence. When all of
the training data has been processed, the accumulated statistics are used to compute re-estimates
of the HMM parameters. HERest is the core HTK training tool. It is designed to process large
databases, it has facilities for pruning to reduce computation and it can be run in parallel across a
network of machines.

The philosophy of system construction in HTK is that HMMs should be refined incrementally.

Thus, a typical progression is to start with a simple set of single Gaussian context-independent phone models and then iteratively refine them by expanding them to include context-dependency and use multiple mixture component Gaussian distributions. The tool HHEd is a HMM definition editor which will clone models into context-dependent sets, apply a variety of parameter tyings and increment the number of mixture components in specified distributions. The usual process is to modify a set of HMMs in stages using HHEd and then re-estimate the parameters of the modified set using HERest after each stage. To improve performance for specific speakers the tools HEAdapt and HVite can be used to adapt HMMs to better model the characteristics of particular speakers using a small amount of training or adaptation data. The end result of which is a speaker adapted system.

The single biggest problem in building context-dependent HMM systems is always data insuffi- ciency. The more complex the model set, the more data is needed to make robust estimates of its parameters, and since data is usually limited, a balance must be struck between complexity and the available data. For continuous density systems, this balance is achieved by tying parameters together as mentioned above. Parameter tying allows data to be pooled so that the shared param- eters can be robustly estimated. In addition to continuous density systems, HTK also supports