The HTK Book
Steve Young
Gunnar Evermann Mark Gales
Thomas Hain Dan Kershaw
Xunying (Andrew) Liu Gareth Moore
Julian Odell Dave Ollason Dan Povey
Valtcho Valtchev Phil Woodland
The HTK Book (for HTK Version 3.4)
c
°COPYRIGHT 1995-1999 Microsoft Corporation.
c
°COPYRIGHT 2001-2009 Cambridge University Engineering Depart- ment.
All Rights Reserved
First published December 1995 Reprinted March 1996
Revised for HTK Version 2.1 March 1997 Revised for HTK Version 2.2 January 1999 Revised for HTK Version 3.0 July 2000
Revised for HTK Version 3.1 December 2001
Revised for HTK Version 3.2 December 2002
Revised for HTK Version 3.3 April 2005
Revised for HTK Version 3.4 March 2009
Contents
I Tutorial Overview 1
1 The Fundamentals of HTK 2
1.1 General Principles of HMMs. . . 3
1.2 Isolated Word Recognition. . . 3
1.3 Output Probability Specification . . . 6
1.4 Baum-Welch Re-Estimation . . . 6
1.5 Recognition and Viterbi Decoding . . . 9
1.6 Continuous Speech Recognition . . . 10
1.7 Speaker Adaptation . . . 13
2 An Overview of the HTK Toolkit 14 2.1 HTK Software Architecture . . . 14
2.2 Generic Properties of a HTK Tool . . . 15
2.3 The Toolkit . . . 16
2.3.1 Data Preparation Tools . . . 16
2.3.2 Training Tools . . . 16
2.3.3 Recognition Tools . . . 19
2.3.4 Analysis Tool . . . 20
2.4 What’s New In Version 3.4.1 . . . 20
2.4.1 New In Version 3.4 . . . 21
2.4.2 New In Version 3.3 . . . 21
2.4.3 New In Version 3.2 . . . 21
2.4.4 New In Version 3.1 . . . 22
2.4.5 New In Version 2.2 . . . 22
2.4.6 Features Added To Version 2.1 . . . 22
3 A Tutorial Example of Using HTK 24 3.1 Data Preparation . . . 25
3.1.1 Step 1 - the Task Grammar . . . 25
3.1.2 Step 2 - the Dictionary . . . 26
3.1.3 Step 3 - Recording the Data. . . 28
3.1.4 Step 4 - Creating the Transcription Files. . . 29
3.1.5 Step 5 - Coding the Data . . . 31
3.2 Creating Monophone HMMs. . . 32
3.2.1 Step 6 - Creating Flat Start Monophones . . . 32
3.2.2 Step 7 - Fixing the Silence Models . . . 35
3.2.3 Step 8 - Realigning the Training Data . . . 36
3.3 Creating Tied-State Triphones . . . 38
3.3.1 Step 9 - Making Triphones from Monophones . . . 38
3.3.2 Step 10 - Making Tied-State Triphones . . . 40
3.4 Recogniser Evaluation . . . 43
3.4.1 Step 11 - Recognising the Test Data . . . 43
3.5 Running the Recogniser Live . . . 44
3.6 Adapting the HMMs . . . 45
3.6.1 Step 12 - Preparation of the Adaptation Data. . . 45
3.6.2 Step 13 - Generating the Transforms . . . 46
3.6.3 Step 14 - Evaluation of the Adapted System. . . 48
2
3.7 Adaptive training . . . 48
3.8 Semi-Tied and HLDA transforms . . . 48
3.9 Using the HTK Large Vocabulary Decoder HDecode . . . 51
3.9.1 Dictionary and Language Model . . . 51
3.9.2 Option 1 - Recognition. . . 52
3.9.3 Option 2 - Speaker Adaptation . . . 53
3.9.4 Option 3 - Lattice Generation . . . 53
3.9.5 Option 4 - Lattice Rescoring . . . 54
3.10 Discriminative Training . . . 55
3.10.1 Step 1 - Generation of Initial Maximum Likelihood Models . . . 55
3.10.2 Step 2 - Training Data LM Creation . . . 55
3.10.3 Step 3 - Word Lattice Creation . . . 56
3.10.4 Step 4 - Phone Marking of Numerator and Denominator Lattices . . . 57
3.10.5 Step 5 - Generating Discriminatively Trained Models . . . 57
3.11 Summary . . . 58
II HTK in Depth 59
4 The Operating Environment 60 4.1 The Command Line . . . 614.2 Script Files . . . 61
4.3 Configuration Files . . . 62
4.4 Standard Options. . . 64
4.5 Error Reporting. . . 64
4.6 Strings and Names . . . 65
4.7 Memory Management . . . 66
4.8 Input/Output via Pipes and Networks . . . 67
4.9 Byte-swapping of HTK data files . . . 67
4.10 Summary . . . 67
5 Speech Input/Output 70 5.1 General Mechanism. . . 70
5.2 Speech Signal Processing. . . 72
5.3 Linear Prediction Analysis. . . 74
5.4 Filterbank Analysis. . . 75
5.5 Vocal Tract Length Normalisation . . . 76
5.6 Cepstral Features . . . 77
5.7 Perceptual Linear Prediction . . . 78
5.8 Energy Measures . . . 79
5.9 Delta, Acceleration and Third Differential Coefficients . . . 79
5.10 Storage of Parameter Files. . . 80
5.10.1 HTK Format Parameter Files . . . 80
5.10.2 Esignal Format Parameter Files. . . 82
5.11 Waveform File Formats . . . 83
5.11.1 HTK File Format. . . 83
5.11.2 Esignal File Format . . . 83
5.11.3 TIMIT File Format . . . 83
5.11.4 NIST File Format . . . 83
5.11.5 SCRIBE File Format. . . 84
5.11.6 SDES1 File Format . . . 84
5.11.7 AIFF File Format . . . 84
5.11.8 SUNAU8 File Format . . . 84
5.11.9 OGI File Format . . . 85
5.11.10 WAV File Format . . . 85
5.11.11 ALIEN and NOHEAD File Formats . . . 85
5.12 Direct Audio Input/Output . . . 85
5.13 Multiple Input Streams . . . 87
5.14 Vector Quantisation . . . 88
5.15 Viewing Speech with HList . . . 90
5.16 Copying and Coding using HCopy . . . 92
5.17 Version 1.5 Compatibility . . . 93
5.18 Summary . . . 94
6 Transcriptions and Label Files 97 6.1 Label File Structure . . . 97
6.2 Label File Formats . . . 98
6.2.1 HTK Label Files . . . 98
6.2.2 ESPS Label Files . . . 99
6.2.3 TIMIT Label Files . . . 99
6.2.4 SCRIBE Label Files . . . 99
6.3 Master Label Files . . . 100
6.3.1 General Principles of MLFs . . . 100
6.3.2 Syntax and Semantics . . . 101
6.3.3 MLF Search. . . 101
6.3.4 MLF Examples . . . 102
6.4 Editing Label Files . . . 104
6.5 Summary . . . 107
7 HMM Definition Files 108 7.1 The HMM Parameters . . . 109
7.2 Basic HMM Definitions . . . 110
7.3 Macro Definitions. . . 113
7.4 HMM Sets. . . 117
7.5 Tied-Mixture Systems . . . 121
7.6 Discrete Probability HMMs . . . 121
7.7 Input Linear Transforms . . . 123
7.8 Tee Models . . . 123
7.9 Binary Storage Format . . . 124
7.10 The HMM Definition Language . . . 124
8 HMM Parameter Estimation 129 8.1 Training Strategies . . . 130
8.2 Initialisation using HInit . . . 132
8.3 Flat Starting with HCompV . . . 135
8.4 Isolated Unit Re-Estimation using HRest . . . 137
8.5 Embedded Training using HERest. . . 138
8.6 Single-Pass Retraining . . . 141
8.7 Two-model Re-Estimation . . . 141
8.8 Parameter Re-Estimation Formulae. . . 142
8.8.1 Viterbi Training (HInit) . . . 142
8.8.2 Forward/Backward Probabilities . . . 143
8.8.3 Single Model Reestimation(HRest) . . . 145
8.8.4 Embedded Model Reestimation (HERest) . . . 145
8.8.5 Semi-Tied Transform Estimation (HERest) . . . 146
8.9 Discriminative Training . . . 147
8.9.1 Discriminative Parameter Re-Estimation Formulae . . . 148
8.9.2 Lattice-Based Discriminative Training . . . 149
8.9.3 Improved Generalisation . . . 149
8.10 Discriminative Training using HMMIRest . . . 150
9 HMM Adaptation 153 9.1 Model Adaptation using Linear Transformations . . . 154
9.1.1 Linear Transformations . . . 154
9.1.2 Input/Output/Parent Transformations . . . 155
9.1.3 Base Class Definitions . . . 156
9.1.4 Regression Class Trees . . . 156
9.1.5 Linear Transform Format . . . 158
9.1.6 Hierarchy of Transform . . . 158
9.1.7 Multiple Stream Systems . . . 160
9.2 Adaptive Training with Linear Transforms. . . 160
9.3 Model Adaptation using MAP . . . 160
9.4 Linear Transformation Estimation Formulae . . . 161
9.4.1 Mean Transformation Matrix (MLLRMEAN) . . . 162
9.4.2 Variance Transformation Matrix (MLLRVAR, MLLRCOV). . . 162
9.4.3 Constrained MLLR Transformation Matrix (CMLLR) . . . 163
10 HMM System Refinement 165 10.1 Using HHEd . . . 166
10.2 Constructing Context-Dependent Models . . . 166
10.3 Parameter Tying and Item Lists . . . 167
10.4 Data-Driven Clustering . . . 169
10.5 Tree-Based Clustering . . . 171
10.6 Mixture Incrementing . . . 173
10.7 Regression Class Tree Construction . . . 174
10.8 Miscellaneous Operations . . . 175
11 Discrete and Tied-Mixture Models 176 11.1 Modelling Discrete Sequences . . . 176
11.2 Using Discrete Models with Speech . . . 177
11.3 Tied Mixture Systems . . . 179
11.4 Parameter Smoothing . . . 181
12 Networks, Dictionaries and Language Models 182 12.1 How Networks are Used . . . 183
12.2 Word Networks and Standard Lattice Format . . . 184
12.3 Building a Word Network with HParse . . . 186
12.4 Bigram Language Models . . . 188
12.5 Building a Word Network with HBuild . . . 190
12.6 Testing a Word Network using HSGen. . . 191
12.7 Constructing a Dictionary . . . 192
12.8 Word Network Expansion . . . 194
12.9 Other Kinds of Recognition System. . . 197
13 Decoding with HVite 199 13.1 Decoder Operation . . . 199
13.2 Decoder Organisation . . . 201
13.3 Recognition using Test Databases. . . 203
13.4 Evaluating Recognition Results . . . 204
13.5 Generating Forced Alignments . . . 207
13.6 Recognition using Direct Audio Input . . . 208
13.7 N-Best Lists and Lattices . . . 209
III Language Modelling 211
14 Fundamentals of language modelling 212 14.1 n-gram language models . . . 21314.1.1 Word n-gram models . . . 213
14.1.2 Equivalence classes . . . 214
14.1.3 Class n-gram models . . . 214
14.2 Statistically-derived Class Maps. . . 215
14.2.1 Word exchange algorithm . . . 215
14.3 Robust model estimation . . . 217
14.3.1 Estimating probabilities . . . 217
14.3.2 Smoothing probabilities . . . 218
14.4 Perplexity . . . 219
14.5 Overview of n-Gram Construction Process. . . 220
14.6 Class-Based Language Models . . . 222
15 A Tutorial Example of Building Language Models 223
15.1 Database preparation . . . 223
15.2 Mapping OOV words. . . 226
15.3 Language model generation . . . 227
15.4 Testing the LM perplexity . . . 228
15.5 Generating and using count-based models . . . 229
15.6 Model interpolation . . . 230
15.7 Class-based models . . . 231
15.8 Problem solving. . . 234
15.8.1 File format problems . . . 234
15.8.2 Command syntax problems . . . 234
15.8.3 Word maps . . . 235
15.8.4 Memory problems . . . 235
15.8.5 Unexpected perplexities . . . 235
16 Language Modelling Reference 236 16.1 Words and Classes . . . 237
16.2 Data File Headers . . . 237
16.3 Word Map Files. . . 237
16.4 Class Map Files. . . 238
16.5 Gram Files . . . 240
16.6 Frequency-of-frequency (FoF) Files . . . 241
16.7 Word LM file formats . . . 241
16.7.1 The ARPA-MIT LM format . . . 242
16.7.2 The modified ARPA-MIT format . . . 243
16.7.3 The binary LM format . . . 243
16.8 Class LM file formats . . . 243
16.8.1 Class counts format . . . 244
16.8.2 The class probabilities format . . . 244
16.8.3 The class LM three file format . . . 244
16.8.4 The class LM single file format . . . 245
16.9 Language modelling tracing . . . 245
16.9.1 LCMap . . . 246
16.9.2 LGBase . . . 246
16.9.3 LModel . . . 246
16.9.4 LPCalc . . . 246
16.9.5 LPMerge . . . 246
16.9.6 LUtil. . . 246
16.9.7 LWMap . . . 246
16.10Run-time configuration parameters . . . 247
16.10.1 USEINTID . . . 247
16.11Compile-time configuration parameters. . . 248
16.11.1 LM ID SHORT . . . 248
16.11.2 LM COMPACT . . . 248
16.11.3 LMPROB SHORT . . . 248
16.11.4 INTERPOLATE MAX . . . 248
16.11.5 SANITY . . . 248
16.11.6 INTEGRITY CHECK . . . 248
IV Reference Section 250
17 The HTK Tools 251 17.1 Cluster. . . 25217.1.1 Function. . . 252
17.1.2 Use . . . 253
17.1.3 Tracing . . . 254
17.2 HBuild. . . 255
17.2.1 Function. . . 255
17.2.2 Use . . . 255
17.2.3 Tracing . . . 256
17.3 HCompV . . . 257
17.3.1 Function. . . 257
17.3.2 Use . . . 257
17.3.3 Tracing . . . 258
17.4 HCopy . . . 259
17.4.1 Function. . . 259
17.4.2 Use . . . 259
17.4.3 Trace Output . . . 261
17.5 HDMan . . . 262
17.5.1 Function. . . 262
17.5.2 Use . . . 263
17.5.3 Tracing . . . 264
17.6 HDecode. . . 265
17.6.1 Function. . . 265
17.6.2 Use . . . 266
17.6.3 Tracing . . . 267
17.7 HERest . . . 268
17.7.1 Function. . . 268
17.7.2 Use . . . 269
17.7.3 Tracing . . . 271
17.8 HHEd . . . 272
17.8.1 Function. . . 272
17.8.2 Use . . . 280
17.8.3 Tracing . . . 280
17.9 HInit. . . 282
17.9.1 Function. . . 282
17.9.2 Use . . . 282
17.9.3 Tracing . . . 283
17.10HLEd . . . 284
17.10.1 Function. . . 284
17.10.2 Use . . . 285
17.10.3 Tracing . . . 286
17.11HList . . . 287
17.11.1 Function. . . 287
17.11.2 Use . . . 287
17.11.3 Tracing . . . 287
17.12HLMCopy . . . 288
17.12.1 Function. . . 288
17.12.2 Use . . . 288
17.12.3 Tracing . . . 288
17.13HLRescore. . . 289
17.13.1 Function. . . 289
17.13.2 Use . . . 289
17.13.3 Tracing . . . 291
17.14HLStats . . . 292
17.14.1 Function. . . 292
17.14.2 Bigram Generation . . . 292
17.14.3 Use . . . 293
17.14.4 Tracing . . . 293
17.15HMMIRest . . . 294
17.15.1 Function. . . 294
17.15.2 Use . . . 295
17.15.3 Tracing . . . 296
17.16HParse. . . 297
17.16.1 Function. . . 297
17.16.2 Network Definition . . . 297
17.16.3 Compatibility Mode . . . 299
17.16.4 Use . . . 299
17.16.5 Tracing . . . 300
17.17HQuant . . . 301
17.17.1 Function. . . 301
17.17.2 VQ Codebook Format . . . 301
17.17.3 Use . . . 301
17.17.4 Tracing . . . 302
17.18HRest . . . 303
17.18.1 Function. . . 303
17.18.2 Use . . . 303
17.18.3 Tracing . . . 304
17.19HResults. . . 305
17.19.1 Function. . . 305
17.19.2 Use . . . 306
17.19.3 Tracing . . . 308
17.20HSGen. . . 309
17.20.1 Function. . . 309
17.20.2 Use . . . 309
17.20.3 Tracing . . . 309
17.21HSLab . . . 310
17.21.1 Function. . . 310
17.21.2 Use . . . 311
17.21.3 Tracing . . . 313
17.22HSmooth . . . 314
17.22.1 Function. . . 314
17.22.2 Use . . . 314
17.22.3 Tracing . . . 315
17.23HVite . . . 316
17.23.1 Function. . . 316
17.23.2 Use . . . 316
17.23.3 Tracing . . . 318
17.24LAdapt . . . 319
17.24.1 Function. . . 319
17.24.2 Use . . . 319
17.24.3 Tracing . . . 320
17.25LBuild . . . 321
17.25.1 Function. . . 321
17.25.2 Use . . . 321
17.25.3 Tracing . . . 321
17.26LFoF. . . 322
17.26.1 Function. . . 322
17.26.2 Use . . . 322
17.26.3 Tracing . . . 322
17.27LGCopy . . . 323
17.27.1 Function. . . 323
17.27.2 Use . . . 323
17.27.3 Tracing . . . 324
17.28LGList. . . 325
17.28.1 Function. . . 325
17.28.2 Use . . . 325
17.28.3 Tracing . . . 325
17.29LGPrep . . . 326
17.29.1 Function. . . 326
17.29.2 Use . . . 327
17.29.3 Tracing . . . 328
17.30LLink . . . 329
17.30.1 Function. . . 329
17.30.2 Use . . . 329
17.30.3 Tracing . . . 329
17.31LMerge . . . 330
17.31.1 Function. . . 330
17.31.2 Use . . . 330
17.31.3 Tracing . . . 330
17.32LNewMap . . . 331
17.32.1 Function. . . 331
17.32.2 Use . . . 331
17.32.3 Tracing . . . 331
17.33LNorm. . . 332
17.33.1 Function. . . 332
17.33.2 Use . . . 332
17.33.3 Tracing . . . 332
17.34LPlex . . . 333
17.34.1 Function. . . 333
17.34.2 Use . . . 333
17.34.3 Tracing . . . 334
17.35LSubset . . . 335
17.35.1 Function. . . 335
17.35.2 Use . . . 335
17.35.3 Tracing . . . 335
18 Configuration Variables 336 18.1 Configuration Variables used in Library Modules . . . 336
18.2 Configuration Variables used in Tools . . . 341
19 Error and Warning Codes 344 19.1 Generic Errors . . . 344
19.2 Summary of Errors by Tool and Module . . . 346
20 HTK Standard Lattice Format (SLF) 363 20.1 SLF Files . . . 363
20.2 Format. . . 363
20.3 Syntax . . . 364
20.4 Field Types . . . 365
20.5 Example SLF file . . . 365
Part I
Tutorial Overview
1
Chapter 1
The Fundamentals of HTK
Training Tools
Speech Data
Recogniser
Transcription
Unknown Speech Transcription
HTK is a toolkit for building Hidden Markov Models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly general-purpose. However, HTK is primarily designed for building HMM-based speech processing tools, in particular recognisers. Thus, much of the infrastructure support in HTK is dedicated to this task. As shown in the picture above, there are two major processing stages involved. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions.
Secondly, unknown utterances are transcribed using the HTK recognition tools.
The main body of this book is mostly concerned with the mechanics of these two processes.
However, before launching into detail it is necessary to understand some of the basic principles of HMMs. It is also helpful to have an overview of the toolkit and to have some appreciation of how training and recognition in HTK is organised.
This first part of the book attempts to provide this information. In this chapter, the basic ideas of HMMs and their use in speech recognition are introduced. The following chapter then presents a brief overview of HTK and, for users of older versions, it highlights the main differences in version 2.0 and later. Finally in this tutorial part of the book, chapter 3 describes how a HMM-based speech recogniser can be built using HTK. It does this by describing the construction of a simple small vocabulary continuous speech recogniser.
The second part of the book then revisits the topics skimmed over here and discusses each in detail. This can be read in conjunction with the third and final part of the book which provides a reference manual for HTK. This includes a description of each tool, summaries of the various parameters used to configure HTK and a list of the error messages that it generates when things go wrong.
Finally, note that this book is concerned only with HTK as a tool-kit. It does not provide information for using the HTK libraries as a programming environment.
2
1.1 General Principles of HMMs 3
1.1 General Principles of HMMs
s1 s2 s3 etc
s1 s2 s3
Speech Waveform
Speech Vectors Concept: a sequence of symbols
Parameterise
Recognise
Fig. 1.1 Message Encoding/Decoding
Speech recognition systems generally assume that the speech signal is a realisation of some message encoded as a sequence of one or more symbols (see Fig. 1.1 ). To effect the reverse operation of recognising the underlying symbol sequence given a spoken utterance, the continuous speech waveform is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact representation of the speech waveform on the basis that for the duration covered by a single vector (typically 10ms or so), the speech waveform can be regarded as being stationary. Although this is not strictly true, it is a reasonable approximation. Typical parametric representations in common use are smoothed spectra or linear prediction coefficients plus various other representations derived from these.
The rˆole of the recogniser is to effect a mapping between sequences of speech vectors and the wanted underlying symbol sequences. Two problems make this very difficult. Firstly, the mapping from symbols to speech is not one-to-one since different underlying symbols can give rise to similar speech sounds. Furthermore, there are large variations in the realised speech waveform due to speaker variability, mood, environment, etc. Secondly, the boundaries between symbols cannot be identified explicitly from the speech waveform. Hence, it is not possible to treat the speech waveform as a sequence of concatenated static patterns.
The second problem of not knowing the word boundary locations can be avoided by restricting the task to isolated word recognition. As shown in Fig.1.2, this implies that the speech waveform corresponds to a single underlying symbol (e.g. word) chosen from a fixed vocabulary. Despite the fact that this simpler problem is somewhat artificial, it nevertheless has a wide range of practical applications. Furthermore, it serves as a good basis for introducing the basic ideas of HMM-based recognition before dealing with the more complex continuous speech case. Hence, isolated word recognition using HMMs will be dealt with first.
1.2 Isolated Word Recognition
Let each spoken word be represented by a sequence of speech vectors or observations O, defined as
O = o1, o2, . . . , oT (1.1)
where otis the speech vector observed at time t. The isolated word recognition problem can then be regarded as that of computing
arg max
i {P (wi|O)} (1.2)
where wi is the i’th vocabulary word. This probability is not computable directly but using Bayes’
Rule gives
P (wi|O) = P (O|wi)P (wi)
P (O) (1.3)
Thus, for a given set of prior probabilities P (wi), the most probable spoken word depends only on the likelihood P (O|wi). Given the dimensionality of the observation sequence O, the direct estimation of the joint conditional probability P (o1, o2, . . . |wi) from examples of spoken words is not practicable. However, if a parametric model of word production such as a Markov model is
1.2 Isolated Word Recognition 4
assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities P (O|wi) is replaced by the much simpler problem of estimating the Markov model parameters.
Speech Waveform
Speech Vectors Concept: a single word
Parameterise
Recognise
w
w
Fig. 1.2 Isolated Word Problem
I n HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Fig.1.3. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector ot is generated from the probability density bj(ot). Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability aij. Fig.1.3shows an example of this process where the six state model moves through the state sequence X = 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o1to o6. Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.
The joint probability that O is generated by the model M moving through the state sequence X is calculated simply as the product of the transition probabilities and the output probabilities.
So for the state sequence X in Fig.1.3
P (O, X|M ) = a12b2(o1)a22b2(o2)a23b3(o3) . . . (1.4) However, in practice, only the observation sequence O is known and the underlying state sequence X is hidden. This is why it is called a Hidden Markov Model.
a12 a23 a34 a45 a56 a22 a33 a44 a55
1 2 3 4 5 6
a24 a35
o1 o2 o3 o4 o5 o6 b2( )o1 b2( ) bo2 3( ) bo3 4( ) bo4 4( )o5 b5( )o6
Markov Model
M
Observation Sequence
Fig. 1.3 The Markov Generation Model
Given that X is unknown, the required likelihood is computed by summing over all possible state sequences X = x(1), x(2), x(3), . . . , x(T ), that is
P (O|M ) =X
X
ax(0)x(1)
YT t=1
bx(t)(ot)ax(t)x(t+1) (1.5)
1.2 Isolated Word Recognition 5
where x(0) is constrained to be the model entry state and x(T + 1) is constrained to be the model exit state.
As an alternative to equation1.5, the likelihood can be approximated by only considering the most likely state sequence, that is
P (O|M ) = maxˆ
X
(
ax(0)x(1) YT t=1
bx(t)(ot)ax(t)x(t+1) )
(1.6)
Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive procedures exist which allow both quantities to be calculated very efficiently. Before going any further, however, notice that if equation1.2is computable then the recognition problem is solved.
Given a set of models Mi corresponding to words wi, equation 1.2 is solved by using 1.3 and assuming that
P (O|wi) = P (O|Mi). (1.7)
All this, of course, assumes that the parameters {aij} and {bj(ot)} are known for each model Mi. Herein lies the elegance and power of the HMM framework. Given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of representative examples of each word can be collected then a HMM can be constructed which implicitly models all of the many sources of variability inherent in real speech. Fig.1.4summarises the use of HMMs for isolated word recognition. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. In this case, the vocabulary consists of just three words:
“one”, “two” and “three”. Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word.
P( P(
P(
(a) Training
one two three
Training Examples
M1 M2 M3
Estimate Models
1.
2.
3.
(b) Recognition
Unknown O =
O|M1) O|M2) O|M3) Choose Max
Fig. 1.4 Using HMMs for Isolated Word Recognition
1.3 Output Probability Specification 6
1.3 Output Probability Specification
Before the problem of parameter estimation can be discussed in more detail, the form of the output distributions {bj(ot)} needs to be made explicit. HTK is designed primarily for modelling con- tinuous parameters using continuous density multivariate output distributions. It can also handle observation sequences consisting of discrete symbols in which case, the output distributions are discrete probabilities. For simplicity, however, the presentation in this chapter will assume that continuous density distributions are being used. The minor differences that the use of discrete probabilities entail are noted in chapter7and discussed in more detail in chapter 11.
In common with most other continuous density HMM systems, HTK represents output distri- butions by Gaussian Mixture Densities. In HTK, however, a further generalisation is made. HTK allows each observation vector at time t to be split into a number of S independent data streams ost. The formula for computing bj(ot) is then
bj(ot) = YS s=1
"M Xs
m=1
cjsmN (ost; µjsm, Σjsm)
#γs
(1.8)
where Msis the number of mixture components in stream s, cjsmis the weight of the m’th compo- nent and N (·; µ, Σ) is a multivariate Gaussian with mean vector µ and covariance matrix Σ, that is
N (o; µ, Σ) = 1
p(2π)n|Σ|e−12(o−µ)TΣ−1(o−µ) (1.9) where n is the dimensionality of o.
The exponent γsis a stream weight1. It can be used to give a particular stream more emphasis, however, it can only be set manually. No current HTK training tools can estimate values for it.
Multiple data streams are used to enable separate modelling of multiple information sources. In HTK, the processing of streams is completely general. However, the speech input modules assume that the source data is split into at most 4 streams. Chapter5discusses this in more detail but for now it is sufficient to remark that the default streams are the basic parameter vector, first (delta) and second (acceleration) difference coefficients and log energy.
1.4 Baum-Welch Re-Estimation
To determine the parameters of a HMM it is first necessary to make a rough guess at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the so-called Baum-Welch re-estimation formulae.
aijcj1 aijcj2
aijcjM
...
Single Gaussians
aij j
M-component Gaussian
mixture
j1 j2
jM
Fig. 1.5 Representing a Mixture
Chapter 8 gives the formulae used in HTK in full detail. Here the basis of the formulae will be presented in a very informal way. Firstly, it should be noted that the inclusion of multiple data streams does not alter matters significantly since each stream is considered to be statistically
1often referred to as a codebook exponent.
1.4 Baum-Welch Re-Estimation 7
independent. Furthermore, mixture components can be considered to be a special form of sub-state in which the transition probabilities are the mixture weights (see Fig.1.5).
Thus, the essential problem is to estimate the means and variances of a HMM in which each state output distribution is a single component Gaussian, that is
bj(ot) = 1
p(2π)n|Σj|e−12(ot−µj)TΣ−1j (ot−µj) (1.10)
If there was just one state j in the HMM, this parameter estimation would be easy. The maximum likelihood estimates of µj and Σj would be just the simple averages, that is
ˆ µj= 1
T XT t=1
ot (1.11)
and
Σˆj = 1 T
XT t=1
(ot− µj)(ot− µj)T (1.12)
In practice, of course, there are multiple states and there is no direct assignment of observation vectors to individual states because the underlying state sequence is unknown. Note, however, that if some approximate assignment of vectors to states could be made then equations1.11and 1.12 could be used to give the required initial values for the parameters. Indeed, this is exactly what is done in the HTK tool called HInit. HInit first divides the training observation vectors equally amongst the model states and then uses equations1.11and1.12to give initial values for the mean and variance of each state. It then finds the maximum likelihood state sequence using the Viterbi algorithm described below, reassigns the observation vectors to states and then uses equations1.11 and 1.12 again to get better initial values. This process is repeated until the estimates do not change.
Since the full likelihood of each observation sequence is based on the summation of all possi- ble state sequences, each observation vector ot contributes to the computation of the maximum likelihood parameter values for each state j. In other words, instead of assigning each observation vector to a specific state as in the above approximation, each observation is assigned to every state in proportion to the probability of the model being in that state when the vector was observed.
Thus, if Lj(t) denotes the probability of being in state j at time t then the equations1.11and1.12 given above become the following weighted averages
ˆ µj=
PT
t=1Lj(t)ot
PT
t=1Lj(t) (1.13)
and
Σˆj= PT
t=1Lj(t)(ot− µj)(ot− µj)T PT
t=1Lj(t) (1.14)
where the summations in the denominators are included to give the required normalisation.
Equations 1.13and1.14are the Baum-Welch re-estimation formulae for the means and covari- ances of a HMM. A similar but slightly more complex formula can be derived for the transition probabilities (see chapter8).
Of course, to apply equations 1.13 and 1.14, the probability of state occupation Lj(t) must be calculated. This is done efficiently using the so-called Forward-Backward algorithm. Let the forward probability2 αj(t) for some model M with N states be defined as
αj(t) = P (o1, . . . , ot, x(t) = j|M ). (1.15) That is, αj(t) is the joint probability of observing the first t speech vectors and being in state j at time t. This forward probability can be efficiently calculated by the following recursion
αj(t) =
"N −1 X
i=2
αi(t − 1)aij
#
bj(ot). (1.16)
2 Since the output distributions are densities, these are not really probabilities but it is a convenient fiction.
1.4 Baum-Welch Re-Estimation 8
This recursion depends on the fact that the probability of being in state j at time t and seeing observation ot can be deduced by summing the forward probabilities for all possible predecessor states i weighted by the transition probability aij. The slightly odd limits are caused by the fact that states 1 and N are non-emitting3. The initial conditions for the above recursion are
α1(1) = 1 (1.17)
αj(1) = a1jbj(o1) (1.18)
for 1 < j < N and the final condition is given by αN(T ) =
N −1X
i=2
αi(T )aiN. (1.19)
Notice here that from the definition of αj(t),
P (O|M ) = αN(T ). (1.20)
Hence, the calculation of the forward probability also yields the total likelihood P (O|M ).
The backward probability βj(t) is defined as
βj(t) = P (ot+1, . . . , oT|x(t) = j, M ). (1.21) As in the forward case, this backward probability can be computed efficiently using the following recursion
βi(t) =
N −1X
j=2
aijbj(ot+1)βj(t + 1) (1.22) with initial condition given by
βi(T ) = aiN (1.23)
for 1 < i < N and final condition given by β1(1) =
N −1X
j=2
a1jbj(o1)βj(1). (1.24)
Notice that in the definitions above, the forward probability is a joint probability whereas the backward probability is a conditional probability. This somewhat asymmetric definition is deliberate since it allows the probability of state occupation to be determined by taking the product of the two probabilities. From the definitions,
αj(t)βj(t) = P (O, x(t) = j|M ). (1.25) Hence,
Lj(t) = P (x(t) = j|O, M ) (1.26)
= P (O, x(t) = j|M ) P (O|M )
= 1
Pαj(t)βj(t) where P = P (O|M ).
All of the information needed to perform HMM parameter re-estimation using the Baum-Welch algorithm is now in place. The steps in this algorithm may be summarised as follows
1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator and denominator summations of the form illustrated by equations1.13and1.14. These storage locations are referred to as accumulators4.
3 To understand equations involving a non-emitting state at time t, the time should be thought of as being t − δt if it is an entry state, and t + δt if it is an exit state. This becomes important when HMMs are connected together in sequence so that transitions across non-emitting states take place between frames.
4 Note that normally the summations in the denominators of the re-estimation formulae are identical across the parameter sets of a given state and therefore only a single common storage location for the denominators is required and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can result in the denominator summations being different. Hence, in HTK the denominator summations are always stored and calculated individually for each distinct parameter vector or matrix.
1.5 Recognition and Viterbi Decoding 9
2. Calculate the forward and backward probabilities for all states j and times t.
3. For each state j and time t, use the probability Lj(t) and the current observation vector ot
to update the accumulators for that state.
4. Use the final accumulator values to calculate new parameter values.
5. If the value of P = P (O|M ) for this iteration is not higher than the value at the previous iteration then stop, otherwise repeat the above steps using the new re-estimated parameter values.
All of the above assumes that the parameters for a HMM are re-estimated from a single ob- servation sequence, that is a single example of the spoken word. In practice, many examples are needed to get good parameter estimates. However, the use of multiple observation sequences adds no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct training sequence.
One final point that should be mentioned is that the computation of the forward and backward probabilities involves taking the product of a large number of probabilities. In practice, this means that the actual numbers involved become very small. Hence, to avoid numerical problems, the forward-backward computation is computed in HTK using log arithmetic.
The HTK program which implements the above algorithm is called HRest. In combination with the tool HInit for estimating initial values mentioned earlier, HRest allows isolated word HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.
1.5 Recognition and Viterbi Decoding
The previous section has described the basic ideas underlying HMM parameter re-estimation using the Baum-Welch algorithm. In passing, it was noted that the efficient recursive algorithm for computing the forward probability also yielded as a by-product the total likelihood P (O|M ). Thus, this algorithm could also be used to find the model which yields the maximum value of P (O|Mi), and hence, it could be used for recognition.
In practice, however, it is preferable to base recognition on the maximum likelihood state se- quence since this generalises easily to the continuous speech case whereas the use of the total probability does not. This likelihood is computed using essentially the same algorithm as the for- ward probability calculation except that the summation is replaced by a maximum operation. For a given model M , let φj(t) represent the maximum likelihood of observing speech vectors o1 to ot and being in state j at time t. This partial likelihood can be computed efficiently using the following recursion (cf. equation1.16)
φj(t) = max
i {φi(t − 1)aij} bj(ot). (1.27) where
φ1(1) = 1 (1.28)
φj(1) = a1jbj(o1) (1.29)
for 1 < j < N . The maximum likelihood ˆP (O|M ) is then given by φN(T ) = max
i {φi(T )aiN} (1.30)
As for the re-estimation case, the direct computation of likelihoods leads to underflow, hence, log likelihoods are used instead. The recursion of equation1.27then becomes
ψj(t) = max
i {ψi(t − 1) + log(aij)} + log(bj(ot)). (1.31) This recursion forms the basis of the so-called Viterbi algorithm. As shown in Fig.1.6, this algorithm can be visualised as finding the best path through a matrix where the vertical dimension represents the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time).
Each large dot in the picture represents the log probability of observing that frame at that time and each arc between dots corresponds to a log transition probability. The log probability of any path
1.6 Continuous Speech Recognition 10
is computed simply by summing the log transition probabilities and the log output probabilities along that path. The paths are grown from left-to-right column-by-column. At time t, each partial path ψi(t − 1) is known for all states i, hence equation1.31can be used to compute ψj(t) thereby extending the partial paths by one time frame.
1 2 3 4 5 6
State
Speech Frame (Time)
1 2 3 4 5 6
b
3( ) o
4a
35Fig. 1.6 The Viterbi Algorithm for Isolated Word Recognition
This concept of a path is extremely important and it is generalised below to deal with the continuous speech case.
This completes the discussion of isolated word recognition using HMMs. There is no HTK tool which implements the above Viterbi algorithm directly. Instead, a tool called HVite is provided which along with its supporting libraries, HNet and HRec, is designed to handle continuous speech. Since this recogniser is syntax directed, it can also perform isolated word recognition as a special case. This is discussed in more detail below.
1.6 Continuous Speech Recognition
Returning now to the conceptual model of speech production and recognition exemplified by Fig.1.1, it should be clear that the extension to continuous speech simply involves connecting HMMs together in sequence. Each model in the sequence corresponds directly to the assumed underlying symbol.
These could be either whole words for so-called connected speech recognition or sub-words such as phonemes for continuous speech recognition. The reason for including the non-emitting entry and exit states should now be evident, these states provide the glue needed to join models together.
There are, however, some practical difficulties to overcome. The training data for continuous speech must consist of continuous utterances and, in general, the boundaries dividing the segments of speech corresponding to each underlying sub-word model in the sequence will not be known. In practice, it is usually feasible to mark the boundaries of a small amount of data by hand. All of the segments corresponding to a given model can then be extracted and the isolated word style of training described above can be used. However, the amount of data obtainable in this way is usually very limited and the resultant models will be poor estimates. Furthermore, even if there was a large amount of data, the boundaries imposed by hand-marking may not be optimal as far as the HMMs are concerned. Hence, in HTK the use of HInit and HRest for initialising sub-word models is regarded as a bootstrap operation5. The main training phase involves the use of a tool called HERest which does embedded training.
Embedded training uses the same Baum-Welch procedure as for the isolated case but rather than training each model individually all models are trained in parallel. It works in the following steps:
5 They can even be avoided altogether by using a flat start as described in section8.3.
1.6 Continuous Speech Recognition 11
1. Allocate and zero accumulators for all parameters of all HMMs.
2. Get the next training utterance.
3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol transcription of the training utterance.
4. Calculate the forward and backward probabilities for the composite HMM. The inclusion of intermediate non-emitting states in the composite model requires some changes to the computation of the forward and backward probabilities but these are only minor. The details are given in chapter8.
5. Use the forward and backward probabilities to compute the probabilities of state occupation at each time frame and update the accumulators in the usual way.
6. Repeat from 2 until all training utterances have been processed.
7. Use the accumulators to calculate new parameter estimates for all of the HMMs.
These steps can then all be repeated as many times as is necessary to achieve the required conver- gence. Notice that although the location of symbol boundaries in the training data is not required (or wanted) for this procedure, the symbolic transcription of each training utterance is needed.
Whereas the extensions needed to the Baum-Welch procedure for training sub-word models are relatively minor6, the corresponding extensions to the Viterbi algorithm are more substantial.
In HTK, an alternative formulation of the Viterbi algorithm is used called the Token Passing Model7. In brief, the token passing model makes the concept of a state alignment path explicit.
Imagine each state j of a HMM at time t holds a single moveable token which contains, amongst other information, the partial log probability ψj(t). This token then represents a partial match between the observation sequence o1to otand the model subject to the constraint that the model is in state j at time t. The path extension algorithm represented by the recursion of equation1.31 is then replaced by the equivalent token passing algorithm which is executed at each time frame t.
The key steps in this algorithm are as follows
1. Pass a copy of every token in state i to all connecting states j, incrementing the log probability of the copy by log[aij] + log[bj(o(t)].
2. Examine the tokens in every state and discard all but the token with the highest probability.
In practice, some modifications are needed to deal with the non-emitting states but these are straightforward if the tokens in entry states are assumed to represent paths extended to time t − δt and tokens in exit states are assumed to represent paths extended to time t + δt.
The point of using the Token Passing Model is that it extends very simply to the continuous speech case. Suppose that the allowed sequence of HMMs is defined by a finite state network. For example, Fig.1.7shows a simple network in which each word is defined as a sequence of phoneme- based HMMs and all of the words are placed in a loop. In this network, the oval boxes denote HMM instances and the square boxes denote word-end nodes. This composite network is essentially just a single large HMM and the above Token Passing algorithm applies. The only difference now is that more information is needed beyond the log probability of the best token. When the best token reaches the end of the speech, the route it took through the network must be known in order to recover the recognised sequence of models.
6 In practice, a good deal of extra work is needed to achieve efficient operation on large training databases. For example, the HERest tool includes facilities for pruning on both the forward and backward passes and parallel operation on a network of machines.
7See “Token Passing: a Conceptual Model for Connected Speech Recognition Systems”, SJ Young, NH Russell and JHS Thornton, CUED Technical Report F INFENG/TR38, Cambridge University, 1989. Available by anonymous ftp from svr-ftp.eng.cam.ac.uk.
1.6 Continuous Speech Recognition 12
ax
b iy
b iy n
a be been etc
Fig. 1.7 Recognition Network for Continuously Spoken Word
Recognition
The history of a token’s route through the network may be recorded efficiently as follows. Every token carries a pointer called a word end link. When a token is propagated from the exit state of a word (indicated by passing through a word-end node) to the entry state of another, that transition represents a potential word boundary. Hence a record called a Word Link Record is generated in which is stored the identity of the word from which the token has just emerged and the current value of the token’s link. The token’s actual link is then replaced by a pointer to the newly created WLR. Fig.1.8illustrates this process.
Once all of the unknown speech has been processed, the WLRs attached to the link of the best matching token (i.e. the token with the highest log probability) can be traced back to give the best matching sequence of words. At the same time the positions of the word boundaries can also be extracted if required.
logP logP logP logP
t-3 t-2 t-1 t
two two one one
Recording Decisions
logP logP
Record Word Ends TokenBest
came
"one"from
w uh n
t uw
th r iy
one
three Before After two
Fig. 1.8 Recording Word Boundary Decisions
The token passing algorithm for continuous speech has been described in terms of recording the word sequence only. If required, the same principle can be used to record decisions at the model and state level. Also, more than just the best token at each word boundary can be saved. This gives the potential for generating a lattice of hypotheses rather than just the single best hypothesis.
Algorithms based on this idea are called lattice N-best. They are suboptimal because the use of a single token per state limits the number of different token histories that can be maintained. This limitation can be avoided by allowing each model state to hold multiple-tokens and regarding tokens as distinct if they come from different preceding words. This gives a class of algorithm called word
1.7 Speaker Adaptation 13
N-best which has been shown empirically to be comparable in performance to an optimal N-best algorithm.
The above outlines the main idea of Token Passing as it is implemented within HTK. The algorithms are embedded in the library modules HNet and HRec and they may be invoked using the recogniser tool called HVite. They provide single and multiple-token passing recognition, single-best output, lattice output, N-best lists, support for cross-word context-dependency, lattice rescoring and forced alignment.
1.7 Speaker Adaptation
Although the training and recognition techniques described previously can produce high perfor- mance recognition systems, these systems can be improved upon by customising the HMMs to the characteristics of a particular speaker. HTK provides the tools HERest and HVite to perform adaptation using a small amount of enrollment or adaptation data. The two tools differ in that HERest performs offline supervised adaptation while HVite recognises the adaptation data and uses the generated transcriptions to perform the adaptation. Generally, more robust adaptation is performed in a supervised mode, as provided by HERest, but given an initial well trained model set, HVite can still achieve noticeable improvements in performance. Full details of adaptation and how it is used in HTK can be found in Chapter9.
Chapter 2
An Overview of the HTK Toolkit
Entropic
D arpa TIM IT
N IST
The basic principles of HMM-based recognition were outlined in the previous chapter and a number of the key HTK tools have already been mentioned. This chapter describes the software architecture of a HTK tool. It then gives a brief outline of all the HTK tools and the way that they are used together to construct and test HMM-based recognisers. For the benefit of existing HTK users, the major changes in recent versions of HTK are listed. The following chapter will then illustrate the use of the HTK toolkit by working through a practical example of building a simple continuous speech recognition system.
2.1 HTK Software Architecture
Much of the functionality of HTK is built into the library modules. These modules ensure that every tool interfaces to the outside world in exactly the same way. They also provide a central resource of commonly used functions. Fig.2.1 illustrates the software structure of a typical HTK tool and shows its input/output interfaces.
User input/output and interaction with the operating system is controlled by the library module HShell and all memory management is controlled by HMem. Math support is provided by HMath and the signal processing operations needed for speech analysis are in HSigP. Each of the file types required by HTK has a dedicated interface module. HLabel provides the interface for label files, HLM for language model files, HNet for networks and lattices, HDict for dictionaries, HVQ for VQ codebooks and HModel for HMM definitions.
14
2.2 Generic Properties of a HTK Tool 15
Speech
Data DefinitionsHMM
Terminal
Graphical
Adaptation Model
Training HNet Language
Models Constraint Network Lattices/
Dictionary
HModel HDict
HUtil
HShell
HGraf
HRec HAdapt HMath
HMem HSigP HVQHParm HWave HAudio
HTrain HFB HTK Tool
I/O I/O HLM
Labels
HLabel
Fig. 2.1 Software Architecture
All speech input and output at the waveform level is via HWave and at the parameterised level via HParm. As well as providing a consistent interface, HWave and HLabel support multiple file formats allowing data to be imported from other systems. Direct audio input is supported by HAudio and simple interactive graphics is provided by HGraf. HUtil provides a number of utility routines for manipulating HMMs while HTrain and HFB contain support for the various HTK training tools. HAdapt provides support for the various HTK adaptation tools. Finally, HRec contains the main recognition processing functions.
As noted in the next section, fine control over the behaviour of these library modules is provided by setting configuration variables. Detailed descriptions of the functions provided by the library modules are given in the second part of this book and the relevant configuration variables are described as they arise. For reference purposes, a complete list is given in chapter18.
2.2 Generic Properties of a HTK Tool
HTK tools are designed to run with a traditional command-line style interface. Each tool has a number of required arguments plus optional arguments. The latter are always prefixed by a minus sign. As an example, the following command would invoke the mythical HTK tool called HFoo
HFoo -T 1 -f 34.3 -a -s myfile file1 file2
This tool has two main arguments called file1 and file2 plus four optional arguments. Options are always introduced by a single letter option name followed where appropriate by the option value.
The option value is always separated from the option name by a space. Thus, the value of the -f option is a real number, the value of the -T option is an integer number and the value of the -s option is a string. The -a option has no following value and it is used as a simple flag to enable or disable some feature of the tool. Options whose names are a capital letter have the same meaning across all tools. For example, the -T option is always used to control the trace output of a HTK tool.
In addition to command line arguments, the operation of a tool can be controlled by parameters stored in a configuration file. For example, if the command
HFoo -C config -f 34.3 -a -s myfile file1 file2
is executed, the tool HFoo will load the parameters stored in the configuration file config during its initialisation procedures. Multiple configuration files can be specified by repeating the -C option, e.g.
HFoo -C config1 -C config2 -f 34.3 -a -s myfile file1 file2
2.3 The Toolkit 16
Configuration parameters can sometimes be used as an alternative to using command line argu- ments. For example, trace options can always be set within a configuration file. However, the main use of configuration files is to control the detailed behaviour of the library modules on which all HTK tools depend.
Although this style of command-line working may seem old-fashioned when compared to modern graphical user interfaces, it has many advantages. In particular, it makes it simple to write shell scripts to control HTK tool execution. This is vital for performing large-scale system building and experimentation. Furthermore, defining all operations using text-based commands allows the details of system construction or experimental procedure to be recorded and documented.
Finally, note that a summary of the command line and options for any HTK tool can be obtained simply by executing the tool with no arguments.
2.3 The Toolkit
The HTK tools are best introduced by going through the processing steps involved in building a sub-word based continuous speech recogniser. As shown in Fig.2.2, there are 4 main phases: data preparation, training, testing and analysis.
2.3.1 Data Preparation Tools
In order to build a set of HMMs, a set of speech data files and their associated transcriptions are required. Very often speech data will be obtained from database archives, typically on CD-ROMs.
Before it can be used in training, it must be converted into the appropriate parametric form and any associated transcriptions must be converted to have the correct format and use the required phone or word labels. If the speech needs to be recorded, then the tool HSLab can be used both to record the speech and to manually annotate it with any required transcriptions.
Although all HTK tools can parameterise waveforms on-the-fly, in practice it is usually better to parameterise the data just once. The tool HCopy is used for this. As the name suggests, HCopy is used to copy one or more source files to an output file. Normally, HCopy copies the whole file, but a variety of mechanisms are provided for extracting segments of files and concatenating files.
By setting the appropriate configuration variables, all input files can be converted to parametric form as they are read-in. Thus, simply copying each file in this manner performs the required encoding. The tool HList can be used to check the contents of any speech file and since it can also convert input on-the-fly, it can be used to check the results of any conversions before processing large quantities of data. Transcriptions will also need preparing. Typically the labels used in the original source transcriptions will not be exactly as required, for example, because of differences in the phone sets used. Also, HMM training might require the labels to be context-dependent. The tool HLEd is a script-driven label editor which is designed to make the required transformations to label files. HLEd can also output files to a single Master Label File MLF which is usually more convenient for subsequent processing. Finally on data preparation, HLStats can gather and display statistics on label files and where required, HQuant can be used to build a VQ codebook in preparation for building discrete probability HMM system.
2.3.2 Training Tools
The second step of system building is to define the topology required for each HMM by writing a prototype definition. HTK allows HMMs to be built with any desired topology. HMM definitions can be stored externally as simple text files and hence it is possible to edit them with any convenient text editor. Alternatively, the standard HTK distribution includes a number of example HMM prototypes and a script to generate the most common topologies automatically. With the exception of the transition probabilities, all of the HMM parameters given in the prototype definition are ignored. The purpose of the prototype definition is only to specify the overall characteristics and topology of the HMM. The actual parameters will be computed later by the training tools. Sensible values for the transition probabilities must be given but the training process is very insensitive to these. An acceptable and simple strategy for choosing these probabilities is to make all of the transitions out of any state equally likely.
2.3 The Toolkit 17
HPARSE
HVITE
HCOMPV, HINIT, HREST, HEREST
Data Prep
Training
Testing
Analysis HMMs
Networks Dictionary HLE
HDMAN
HQ HSLAB
HC HL Speech
D
HLS
UANT IST TATS OPY
HR HBUILD
ESULTS
HSMOOTH, HHED, HEADAPT
Transcriptions Transcriptions
Fig. 2.2 HTK Processing Stages
The actual training process takes place in stages and it is illustrated in more detail in Fig. 2.3.
Firstly, an initial set of models must be created. If there is some speech data available for which the location of the sub-word (i.e. phone) boundaries have been marked, then this can be used as bootstrap data. In this case, the tools HInit and HRest provide isolated word style training using the fully labelled bootstrap data. Each of the required HMMs is generated individually. HInit reads in all of the bootstrap training data and cuts out all of the examples of the required phone. It then iteratively computes an initial set of parameter values using a segmental k-means procedure.
On the first cycle, the training data is uniformly segmented, each model state is matched with the corresponding data segments and then means and variances are estimated. If mixture Gaussian models are being trained, then a modified form of k-means clustering is used. On the second and successive cycles, the uniform segmentation is replaced by Viterbi alignment. The initial parameter values computed by HInit are then further re-estimated by HRest. Again, the fully labelled bootstrap data is used but this time the segmental k-means procedure is replaced by the Baum-Welch re-estimation procedure described in the previous chapter. When no bootstrap data is available, a so-called flat start can be used. In this case all of the phone models are initialised to be identical and have state means and variances equal to the global speech mean and variance.
The tool HCompV can be used for this.