The HTK Book

(1)

The HTK Book

Steve Young Dan Kershaw Julian Odell Dave Ollason Valtcho Valtchev Phil Woodland

The HTK Book (for HTK Version 3.0)

c

°COPYRIGHT 1995-1999 Microsoft Corporation.

All Rights Reserved

First published December 1995 Reprinted March 1996

Revised for HTK Version 2.1 March 1997

Revised for HTK Version 2.2 January 1999

Revised for HTK Version 3.0 July 2000

(2)

Tutorial Overview

1

(8)

Chapter 1

The Fundamentals of HTK

Training Tools

Speech Data

Recogniser

Transcription

Unknown Speech Transcription

HTK is a toolkit for building Hidden Markov Models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly general-purpose. However, HTK is primarily designed for building HMM-based speech processing tools, in particular recognisers. Thus, much of the infrastructure support in HTK is dedicated to this task. As shown in the picture alongside, there are two major processing stages involved. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions.

Secondly, unknown utterances are transcribed using the HTK recognition tools.

The main body of this book is mostly concerned with the mechanics of these two processes.

However, before launching into detail it is necessary to understand some of the basic principles of HMMs. It is also helpful to have an overview of the toolkit and to have some appreciation of how training and recognition in HTK is organised.

This first part of the book attempts to provide this information. In this chapter, the basic ideas of HMMs and their use in speech recognition are introduced. The following chapter then presents a brief overview of HTK and, for users of older versions, it highlights the main differences in version 2.0 and later. Finally in this tutorial part of the book, chapter 3 describes how a HMM-based speech recogniser can be built using HTK. It does this by describing the construction of a simple small vocabulary continuous speech recogniser.

The second part of the book then revisits the topics skimmed over here and discusses each in detail. This can be read in conjunction with the third and final part of the book which provides a reference manual for HTK. This includes a description of each tool, summaries of the various parameters used to configure HTK and a list of the error messages that it generates when things go wrong.

Finally, note that this book is concerned only with HTK as a tool-kit. It does not provide information for using the HTK libraries as a programming environment.

2

(9)

1.1 General Principles of HMMs 3

1.1 General Principles of HMMs

s₁ s₂

s₃

etc

s₁ s₂

s₃

Speech Waveform

Speech Vectors Concept: a sequence of symbols

Parameterise

Recognise

Fig. 1.1 Message Encoding/Decoding

Speech recognition systems generally assume that the speech signal is a realisation of some message encoded as a sequence of one or more symbols (see Fig.1.1). To effect the reverse operation of recognising the underlying symbol sequence given a spoken utterance, the continuous speech waveform is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact representation of the speech waveform on the basis that for the duration covered by a single vector (typically 10ms or so), the speech waveform can be regarded as being stationary. Although this is not strictly true, it is a reasonable approximation. Typical parametric representations in common use are smoothed spectra or linear prediction coefficients plus various other representations derived from these.

The rˆole of the recogniser is to effect a mapping between sequences of speech vectors and the wanted underlying symbol sequences. Two problems make this very difficult. Firstly, the mapping from symbols to speech is not one-to-one since different underlying symbols can give rise to similar speech sounds. Furthermore, there are large variations in the realised speech waveform due to speaker variability, mood, environment, etc. Secondly, the boundaries between symbols cannot be identified explicitly from the speech waveform. Hence, it is not possible to treat the speech waveform as a sequence of concatenated static patterns.

The second problem of not knowing the word boundary locations can be avoided by restricting the task to isolated word recognition. As shown in Fig.1.2, this implies that the speech waveform corresponds to a single underlying symbol (e.g. word) chosen from a fixed vocabulary. Despite the fact that this simpler problem is somewhat artificial, it nevertheless has a wide range of practical applications. Furthermore, it serves as a good basis for introducing the basic ideas of HMM-based recognition before dealing with the more complex continuous speech case. Hence, isolated word recognition using HMMs will be dealt with first.

1.2 Isolated Word Recognition

Let each spoken word be represented by a sequence of speech vectors or observations O, defined as

O = o1, o2, . . . , oT (1.1)

where o_tis the speech vector observed at time t. The isolated word recognition problem can then be regarded as that of computing

arg max

i {P (wi|O)} (1.2)

where wi is the i’th vocabulary word. This probability is not computable directly but using Bayes’

Rule gives

P (wi|O) = P (O|wi)P (wi)

P (O) (1.3)

Thus, for a given set of prior probabilities P (wi), the most probable spoken word depends only on the likelihood P (O|wi). Given the dimensionality of the observation sequence O, the direct estimation of the joint conditional probability P (o1, o2, . . . |wi) from examples of spoken words is not practicable. However, if a parametric model of word production such as a Markov model is

(10)

1.2 Isolated Word Recognition 4 assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities P (O|wi) is replaced by the much simpler problem of estimating the Markov model parameters.

Speech Waveform

Speech Vectors Concept: a single word

Parameterise

Recognise

w

Fig. 1.2 Isolated Word Problem

I n HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Fig.1.3. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector ot is generated from the probability density bj(ot). Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability aij. Fig.1.3shows an example of this process where the six state model moves through the state sequence X = 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o1to o6. Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.

The joint probability that O is generated by the model M moving through the state sequence X is calculated simply as the product of the transition probabilities and the output probabilities.

So for the state sequence X in Fig.1.3

P (O, X|M ) = a12b2(o1)a22b2(o2)a23b3(o3) . . . (1.4) However, in practice, only the observation sequence O is known and the underlying state sequence X is hidden. This is why it is called a Hidden Markov Model.

a₁₂ a₂₃ a₃₄ a₄₅ a₅₆

a₂₂ a₃₃ a₄₄ a₅₅

1 2 3 4 5 6

a₂₄ a₃₅

o₁ o₂ o₃ o₄ o₅ o₆

b₂( )o₁ b₅

o₆

( ) b₂( )o₂ b₃

o₃

( ) b₄( )o₄ b₄( )o₅

Markov Mo del

M

Observation Sequence

Fig. 1.3 The Markov Generation Model

Given that X is unknown, the required likelihood is computed by summing over all possible state sequences X = x(1), x(2), x(3), . . . , x(T ), that is

P (O|M ) =X

X

ax(0)x(1)

YT t=1

bx(t)(ot)ax(t)x(t+1) (1.5)

(11)

1.2 Isolated Word Recognition 5 where x(0) is constrained to be the model entry state and x(T + 1) is constrained to be the model exit state.

As an alternative to equation1.5, the likelihood can be approximated by only considering the most likely state sequence, that is

P (O|M ) = maxˆ

X

(

a_x(0)x(1) YT t=1

b_x(t)(ot)a_x(t)x(t+1) )

(1.6) Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive procedures exist which allow both quantities to be calculated very efficiently. Before going any further, however, notice that if equation1.2is computable then the recognition problem is solved.

Given a set of models Mi corresponding to words wi, equation 1.2 is solved by using 1.3 and assuming that

P (O|wi) = P (O|Mi). (1.7)

All this, of course, assumes that the parameters {aij} and {bj(ot)} are known for each model M_i. Herein lies the elegance and power of the HMM framework. Given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of representative examples of each word can be collected then a HMM can be constructed which implicitly models all of the many sources of variability inherent in real speech. Fig.1.4summarises the use of HMMs for isolated word recognition. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. In this case, the vocabulary consists of just three words:

“one”, “two” and “three”. Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word.

P( P(

P(

(a) Training

one two three

Training Examples

M₁ M₂ M₃

Estimate Models

1.

2.

3.

(b) Recognition

Unknown O =

O| M₁) O| M₂ ) O|M₃ ) Choose Max

Fig. 1.4 Using HMMs for Isolated Word Recognition

(12)

1.3 Output Probability Specification 6

1.3 Output Probability Specification

Before the problem of parameter estimation can be discussed in more detail, the form of the output distributions {bj(ot)} needs to be made explicit. HTK is designed primarily for modelling continuous parameters using continuous density multivariate output distributions. It can also handle observation sequences consisting of discrete symbols in which case, the output distributions are discrete probabilities. For simplicity, however, the presentation in this chapter will assume that continuous density distributions are being used. The minor differences that the use of discrete probabilities entail are noted in chapter7and discussed in more detail in chapter 11.

In common with most other continuous density HMM systems, HTK represents output distributions by Gaussian Mixture Densities. In HTK, however, a further generalisation is made. HTK allows each observation vector at time t to be split into a number of S independent data streams ost. The formula for computing bj(ot) is then

bj(ot) = YS s=1

"_M Xs

m=1

cjsmN (ost; µ_jsm, Σjsm)

#γs

(1.8) where Msis the number of mixture components in stream s, cjsmis the weight of the m’th compo- nent and N (·; µ, Σ) is a multivariate Gaussian with mean vector µ and covariance matrix Σ, that is

N (o; µ, Σ) = 1

p(2π)ⁿ|Σ|e⁻¹²⁽o⁻µ⁾⁰Σ⁻¹⁽o⁻µ⁾ (1.9) where n is the dimensionality of o.

The exponent γsis a stream weight¹. It can be used to give a particular stream more emphasis, however, it can only be set manually. No current HTK training tools can estimate values for it.

Multiple data streams are used to enable separate modelling of multiple information sources. In HTK, the processing of streams is completely general. However, the speech input modules assume that the source data is split into at most 4 streams. Chapter5discusses this in more detail but for now it is sufficient to remark that the default streams are the basic parameter vector, first (delta) and second (acceleration) difference coefficients and log energy.

1.4 Baum-Welch Re-Estimation

To determine the parameters of a HMM it is first necessary to make a rough guess at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the so-called Baum-Welch re-estimation formulae.

a_ijc_j1

a_ijc_j2

a_ijc_jM

...

Single Gaussians

j

a_ij

M-component Gaussian

mixture

j

1

j

2

j

M

Fig. 1.5 Representing a Mixture

Chapter 8 gives the formulae used in HTK in full detail. Here the basis of the formulae will be presented in a very informal way. Firstly, it should be noted that the inclusion of multiple data streams does not alter matters significantly since each stream is considered to be statistically

1often referred to as a codebook exponent.

(13)

1.4 Baum-Welch Re-Estimation 7 independent. Furthermore, mixture components can be considered to be a special form of sub-state in which the transition probabilities are the mixture weights (see Fig.1.5).

Thus, the essential problem is to estimate the means and variances of a HMM in which each state output distribution is a single component Gaussian, that is

bj(ot) = 1

p(2π)ⁿ|Σj|e⁻¹²⁽ot−µ_j)⁰Σ⁻¹j (ot−µ_j) (1.10)

If there was just one state j in the HMM, this parameter estimation would be easy. The maximum likelihood estimates of µ_j and Σj would be just the simple averages, that is

ˆ µ_j= 1

T XT t=1

ot (1.11)

and

Σˆj = 1 T

XT t=1

(ot− µ_j)(ot− µ_j)⁰ (1.12) In practice, of course, there are multiple states and there is no direct assignment of observation vectors to individual states because the underlying state sequence is unknown. Note, however, that if some approximate assignment of vectors to states could be made then equations1.11and 1.12 could be used to give the required initial values for the parameters. Indeed, this is exactly what is done in the HTK tool called HInit. HInit first divides the training observation vectors equally amongst the model states and then uses equations1.11and1.12to give initial values for the mean and variance of each state. It then finds the maximum likelihood state sequence using the Viterbi algorithm described below, reassigns the observation vectors to states and then uses equations1.11 and 1.12 again to get better initial values. This process is repeated until the estimates do not change.

Since the full likelihood of each observation sequence is based on the summation of all possi- ble state sequences, each observation vector ot contributes to the computation of the maximum likelihood parameter values for each state j. In other words, instead of assigning each observation vector to a specific state as in the above approximation, each observation is assigned to every state in proportion to the probability of the model being in that state when the vector was observed.

Thus, if Lj(t) denotes the probability of being in state j at time t then the equations1.11and1.12 given above become the following weighted averages

ˆ µ_j=

P_T

t=1Lj(t)ot

P_T

t=1Lj(t) (1.13)

and

Σˆj= P_T

t=1Lj(t)(ot− µ_j)(ot− µ_j)⁰ P_T

t=1Lj(t) (1.14)

where the summations in the denominators are included to give the required normalisation.

Equations 1.13and1.14are the Baum-Welch re-estimation formulae for the means and covari- ances of a HMM. A similar but slightly more complex formula can be derived for the transition probabilities (see chapter8).

Of course, to apply equations 1.13 and 1.14, the probability of state occupation Lj(t) must be calculated. This is done efficiently using the so-called Forward-Backward algorithm. Let the forward probability² αj(t) for some model M with N states be defined as

αj(t) = P (o1, . . . , ot, x(t) = j|M ). (1.15) That is, αj(t) is the joint probability of observing the first t speech vectors and being in state j at time t. This forward probability can be efficiently calculated by the following recursion

αj(t) =

"_{N −1} X

i=2

αi(t − 1)aij

#

bj(ot). (1.16)

2 Since the output distributions are densities, these are not really probabilities but it is a convenient fiction.

(14)

1.4 Baum-Welch Re-Estimation 8 This recursion depends on the fact that the probability of being in state j at time t and seeing observation ot can be deduced by summing the forward probabilities for all possible predecessor states i weighted by the transition probability aij. The slightly odd limits are caused by the fact that states 1 and N are non-emitting³. The initial conditions for the above recursion are

α1(1) = 1 (1.17)

αj(1) = a1jbj(o1) (1.18)

for 1 < j < N and the final condition is given by

αN(T ) =

N −1X

i=2

αi(T )aiN. (1.19)

Notice here that from the definition of αj(t),

P (O|M ) = α_N(T ). (1.20)

Hence, the calculation of the forward probability also yields the total likelihood P (O|M ).

The backward probability βj(t) is defined as

βj(t) = P (ot+1, . . . , oT|x(t) = j, M ). (1.21) As in the forward case, this backward probability can be computed efficiently using the following recursion

βi(t) =

N −1X

j=2

aijbj(ot+1)βj(t + 1) (1.22) with initial condition given by

βi(T ) = aiN (1.23)

for 1 < i < N and final condition given by

β₁(1) =

N −1X

j=2

a_1jb_j(o₁)β_j(1). (1.24)

Notice that in the definitions above, the forward probability is a joint probability whereas the backward probability is a conditional probability. This somewhat asymmetric definition is deliberate since it allows the probability of state occupation to be determined by taking the product of the two probabilities. From the definitions,

αj(t)βj(t) = P (O, x(t) = j|M ). (1.25) Hence,

Lj(t) = P (x(t) = j|O, M ) (1.26)

= P (O, x(t) = j|M ) P (O|M )

= 1

Pαj(t)βj(t) where P = P (O|M ).

All of the information needed to perform HMM parameter re-estimation using the Baum-Welch algorithm is now in place. The steps in this algorithm may be summarised as follows

1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator and denominator summations of the form illustrated by equations1.13and1.14. These storage locations are referred to as accumulators⁴.

3 To understand equations involving a non-emitting state at time t, the time should be thought of as being t − δt if it is an entry state, and t + δt if it is an exit state. This becomes important when HMMs are connected together in sequence so that transitions across non-emitting states take place between frames.

4 Note that normally the summations in the denominators of the re-estimation formulae are identical across the parameter sets of a given state and therefore only a single common storage location for the denominators is required and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can result in the denominator summations being different. Hence, in HTK the denominator summations are always stored and calculated individually for each distinct parameter vector or matrix.

(15)

1.5 Recognition and Viterbi Decoding 9

2. Calculate the forward and backward probabilities for all states j and times t.

3. For each state j and time t, use the probability Lj(t) and the current observation vector ot

to update the accumulators for that state.

4. Use the final accumulator values to calculate new parameter values.

5. If the value of P = P (O|M ) for this iteration is not higher than the value at the previous iteration then stop, otherwise repeat the above steps using the new re-estimated parameter values.

All of the above assumes that the parameters for a HMM are re-estimated from a single observation sequence, that is a single example of the spoken word. In practice, many examples are needed to get good parameter estimates. However, the use of multiple observation sequences adds no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct training sequence.

One final point that should be mentioned is that the computation of the forward and backward probabilities involves taking the product of a large number of probabilities. In practice, this means that the actual numbers involved become very small. Hence, to avoid numerical problems, the forward-backward computation is computed in HTK using log arithmetic.

The HTK program which implements the above algorithm is called HRest. In combination with the tool HInit for estimating initial values mentioned earlier, HRest allows isolated word HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.

1.5 Recognition and Viterbi Decoding

The previous section has described the basic ideas underlying HMM parameter re-estimation using the Baum-Welch algorithm. In passing, it was noted that the efficient recursive algorithm for computing the forward probability also yielded as a by-product the total likelihood P (O|M ). Thus, this algorithm could also be used to find the model which yields the maximum value of P (O|Mi), and hence, it could be used for recognition.

In practice, however, it is preferable to base recognition on the maximum likelihood state sequence since this generalises easily to the continuous speech case whereas the use of the total probability does not. This likelihood is computed using essentially the same algorithm as the forward probability calculation except that the summation is replaced by a maximum operation. For a given model M , let φj(t) represent the maximum likelihood of observing speech vectors o1 to ot and being in state j at time t. This partial likelihood can be computed efficiently using the following recursion (cf. equation1.16)

φj(t) = max

i {φi(t − 1)aij} bj(ot). (1.27) where

φ1(1) = 1 (1.28)

φj(1) = a1jbj(o1) (1.29)

for 1 < j < N . The maximum likelihood ˆP (O|M ) is then given by φN(T ) = max

i {φi(T )aiN} (1.30)

As for the re-estimation case, the direct computation of likelihoods leads to underflow, hence, log likelihoods are used instead. The recursion of equation1.27then becomes

ψj(t) = max

i {ψi(t − 1) + log(aij)} + log(bj(ot)). (1.31) This recursion forms the basis of the so-called Viterbi algorithm. As shown in Fig.1.6, this algorithm can be visualised as finding the best path through a matrix where the vertical dimension represents the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time).

Each large dot in the picture represents the log probability of observing that frame at that time and each arc between dots corresponds to a log transition probability. The log probability of any path is computed simply by summing the log transition probabilities and the log output probabilities along that path. The paths are grown from left-to-right column-by-column. At time t, each partial path ψi(t − 1) is known for all states i, hence equation1.31can be used to compute ψj(t) thereby extending the partial paths by one time frame.

(16)

1.6 Continuous Speech Recognition 10

1 2

3

4

5

6

Sta te

Speech Fram e (Time) 1 2 3 4 5 6

b

₃

( ) o

₄

a

₃₅

Fig. 1.6 The Viterbi Algorithm for Isolated Word Recognition

This concept of a path is extremely important and it is generalised below to deal with the continuous speech case.

This completes the discussion of isolated word recognition using HMMs. There is no HTK tool which implements the above Viterbi algorithm directly. Instead, a tool called HVite is provided which along with its supporting libraries, HNet and HRec, is designed to handle continuous speech. Since this recogniser is syntax directed, it can also perform isolated word recognition as a special case. This is discussed in more detail below.

1.6 Continuous Speech Recognition

Returning now to the conceptual model of speech production and recognition exemplified by Fig.1.1, it should be clear that the extension to continuous speech simply involves connecting HMMs together in sequence. Each model in the sequence corresponds directly to the assumed underlying symbol.

These could be either whole words for so-called connected speech recognition or sub-words such as phonemes for continuous speech recognition. The reason for including the non-emitting entry and exit states should now be evident, these states provide the glue needed to join models together.

There are, however, some practical difficulties to overcome. The training data for continuous speech must consist of continuous utterances and, in general, the boundaries dividing the segments of speech corresponding to each underlying sub-word model in the sequence will not be known. In practice, it is usually feasible to mark the boundaries of a small amount of data by hand. All of the segments corresponding to a given model can then be extracted and the isolated word style of training described above can be used. However, the amount of data obtainable in this way is usually very limited and the resultant models will be poor estimates. Furthermore, even if there was a large amount of data, the boundaries imposed by hand-marking may not be optimal as far as the HMMs are concerned. Hence, in HTK the use of HInit and HRest for initialising sub-word models is regarded as a bootstrap operation⁵. The main training phase involves the use of a tool called HERest which does embedded training.

Embedded training uses the same Baum-Welch procedure as for the isolated case but rather than training each model individually all models are trained in parallel. It works in the following steps:

1. Allocate and zero accumulators for all parameters of all HMMs.

2. Get the next training utterance.

5 They can even be avoided altogether by using a flat start as described in section8.3.

(17)

1.6 Continuous Speech Recognition 11 3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol

transcription of the training utterance.

4. Calculate the forward and backward probabilities for the composite HMM. The inclusion of intermediate non-emitting states in the composite model requires some changes to the computation of the forward and backward probabilities but these are only minor. The details are given in chapter8.

5. Use the forward and backward probabilities to compute the probabilities of state occupation at each time frame and update the accumulators in the usual way.

6. Repeat from 2 until all training utterances have been processed.

7. Use the accumulators to calculate new parameter estimates for all of the HMMs.

These steps can then all be repeated as many times as is necessary to achieve the required conver- gence. Notice that although the location of symbol boundaries in the training data is not required (or wanted) for this procedure, the symbolic transcription of each training utterance is needed.

Whereas the extensions needed to the Baum-Welch procedure for training sub-word models are relatively minor⁶, the corresponding extensions to the Viterbi algorithm are more substantial.

In HTK, an alternative formulation of the Viterbi algorithm is used called the Token Passing Model⁷. In brief, the token passing model makes the concept of a state alignment path explicit.

Imagine each state j of a HMM at time t holds a single moveable token which contains, amongst other information, the partial log probability ψj(t). This token then represents a partial match between the observation sequence o1to otand the model subject to the constraint that the model is in state j at time t. The path extension algorithm represented by the recursion of equation1.31 is then replaced by the equivalent token passing algorithm which is executed at each time frame t.

The key steps in this algorithm are as follows

1. Pass a copy of every token in state i to all connecting states j, incrementing the log probability of the copy by log[aij] + log[bj(o(t)].

2. Examine the tokens in every state and discard all but the token with the highest probability.

In practice, some modifications are needed to deal with the non-emitting states but these are straightforward if the tokens in entry states are assumed to represent paths extended to time t − δt and tokens in exit states are assumed to represent paths extended to time t + δt.

The point of using the Token Passing Model is that it extends very simply to the continuous speech case. Suppose that the allowed sequence of HMMs is defined by a finite state network. For example, Fig.1.7shows a simple network in which each word is defined as a sequence of phoneme- based HMMs and all of the words are placed in a loop. In this network, the oval boxes denote HMM instances and the square boxes denote word-end nodes. This composite network is essentially just a single large HMM and the above Token Passing algorithm applies. The only difference now is that more information is needed beyond the log probability of the best token. When the best token reaches the end of the speech, the route it took through the network must be known in order to recover the recognised sequence of models.

6 In practice, a good deal of extra work is needed to achieve efficient operation on large training databases. For example, the HERest tool includes facilities for pruning on both the forward and backward passes and parallel operation on a network of machines.

7See “Token Passing: a Conceptual Model for Connected Speech Recognition Systems”, SJ Young, NH Russell and JHS Thornton, CUED Technical Report F INFENG/TR38, Cambridge University, 1989. Available by anonymous ftp from svr-ftp.eng.cam.ac.uk.

(18)

1.6 Continuous Speech Recognition 12

ax b

iy

b

iy n

a b

e been etc

Fig. 1.7 Recognition Network for Continuously Spoken Word

Recognition

The history of a token’s route through the network may be recorded efficiently as follows. Every token carries a pointer called a word end link. When a token is propagated from the exit state of a word (indicated by passing through a word-end node) to the entry state of another, that transition represents a potential word boundary. Hence a record called a Word Link Record is generated in which is stored the identity of the word from which the token has just emerged and the current value of the token’s link. The token’s actual link is then replaced by a pointer to the newly created WLR. Fig.1.8illustrates this process.

Once all of the unknown speech has been processed, the WLRs attached to the link of the best matching token (i.e. the token with the highest log probability) can be traced back to give the best matching sequence of words. At the same time the positions of the word boundaries can also be extracted if required.

logP logP logP logP

t-3 t-2 t-1 t

tw o tw o on e on e Recording Decisions

logP logP

Record Word Ends Best

Token cam e from

"one"

w

uh n

t

uw

th r iy

on e

three two Before After

Fig. 1.8 Recording Word Boundary Decisions

The token passing algorithm for continuous speech has been described in terms of recording the word sequence only. If required, the same principle can be used to record decisions at the model and state level. Also, more than just the best token at each word boundary can be saved. This gives the potential for generating a lattice of hypotheses rather than just the single best hypothesis.

Algorithms based on this idea are called lattice N-best. They are suboptimal because the use of a single token per state limits the number of different token histories that can be maintained. This limitation can be avoided by allowing each model state to hold multiple-tokens and regarding tokens as distinct if they come from different preceding words. This gives a class of algorithm called word

(19)

1.7 Speaker Adaptation 13

N-best which has been shown empirically to be comparable in performance to an optimal N-best algorithm.

The above outlines the main idea of Token Passing as it is implemented within HTK. The algorithms are embedded in the library modules HNet and HRec and they may be invoked using the recogniser tool called HVite. They provide single and multiple-token passing recognition, single-best output, lattice output, N-best lists, support for cross-word context-dependency, lattice rescoring and forced alignment.

1.7 Speaker Adaptation

Although the training and recognition techniques described previously can produce high performance recognition systems, these systems can be improved upon by customising the HMMs to the characteristics of a particular speaker. HTK provides the tools HEAdapt and HVite to perform adaptation using a small amount of enrollment or adaptation data. The two tools differ in that HEAdapt performs offline supervised adaptation while HVite recognises the adaptation data and uses the generated transcriptions to perform the adaptation. Generally, more robust adaptation is performed in a supervised mode, as provided by HEAdapt, but given an initial well trained model set, HVite can still achieve noticeable improvements in performance. Full details of adaptation and how it is used in HTK can be found in Chapter9.

(20)

Chapter 2

An Overview of the HTK Toolkit

Entropic

D arpa TIM IT

N IST

The basic principles of HMM-based recognition were outlined in the previous chapter and a number of the key HTK tools have already been mentioned. This chapter describes the software architecture of a HTK tool. It then gives a brief outline of all the HTK tools and the way that they are used together to construct and test HMM-based recognisers. For the benefit of existing HTK users, the major changes in recent versions of HTK are listed. The following chapter will then illustrate the use of the HTK toolkit by working through a practical example of building a simple continuous speech recognition system.

2.1 HTK Software Architecture

Much of the functionality of HTK is built into the library modules. These modules ensure that every tool interfaces to the outside world in exactly the same way. They also provide a central resource of commonly used functions. Fig.2.1 illustrates the software structure of a typical HTK tool and shows its input/output interfaces.

User input/output and interaction with the operating system is controlled by the library module HShell and all memory management is controlled by HMem. Math support is provided by HMath and the signal processing operations needed for speech analysis are in HSigP. Each of the file types required by HTK has a dedicated interface module. HLabel provides the interface for label files, HLM for language model files, HNet for networks and lattices, HDict for dictionaries, HVQ for VQ codebooks and HModel for HMM definitions.

14

(21)

2.2 Generic Properties of a HTK Tool 15

Speech

Data DefinitionsHMM

Terminal

Graphical

Adaptation Model

Training HNet Language

Models Constraint Network Lattices/

Dictionary

HModel HDict

HUtil

HShell

HGraf

HRec HAdapt HMath

HMem HSigP HVQHParm HWave HAudio

HTrain HFB

HTK Tool

I/O I/O HLM

Labels

HLabel

Fig. 2.1 Software Architecture

All speech input and output at the waveform level is via HWave and at the parameterised level via HParm. As well as providing a consistent interface, HWave and HLabel support multiple file formats allowing data to be imported from other systems. Direct audio input is supported by HAudio and simple interactive graphics is provided by HGraf. HUtil provides a number of utility routines for manipulating HMMs while HTrain and HFB contain support for the various HTK training tools. HAdapt provides support for the various HTK adaptation tools. Finally, HRec contains the main recognition processing functions.

As noted in the next section, fine control over the behaviour of these library modules is provided by setting configuration variables. Detailed descriptions of the functions provided by the library modules are given in the second part of this book and the relevant configuration variables are described as they arise. For reference purposes, a complete list is given in chapter15.

2.2 Generic Properties of a HTK Tool

HTK tools are designed to run with a traditional command-line style interface. Each tool has a number of required arguments plus optional arguments. The latter are always prefixed by a minus sign. As an example, the following command would invoke the mythical HTK tool called HFoo

HFoo -T 1 -f 34.3 -a -s myfile file1 file2

This tool has two main arguments called file1 and file2 plus four optional arguments. Options are always introduced by a single letter option name followed where appropriate by the option value.

The option value is always separated from the option name by a space. Thus, the value of the -f option is a real number, the value of the -T option is an integer number and the value of the -s option is a string. The -a option has no following value and it is used as a simple flag to enable or disable some feature of the tool. Options whose names are a capital letter have the same meaning across all tools. For example, the -T option is always used to control the trace output of a HTK tool.

In addition to command line arguments, the operation of a tool can be controlled by parameters stored in a configuration file. For example, if the command

HFoo -C config -f 34.3 -a -s myfile file1 file2

is executed, the tool HFoo will load the parameters stored in the configuration file config during its initialisation procedures. Configuration parameters can sometimes be used as an alternative to using command line arguments. For example, trace options can always be set within a configuration file. However, the main use of configuration files is to control the detailed behaviour of the library modules on which all HTK tools depend.

Although this style of command-line working may seem old-fashioned when compared to modern graphical user interfaces, it has many advantages. In particular, it makes it simple to write shell

(22)

2.3 The Toolkit 16 scripts to control HTK tool execution. This is vital for performing large-scale system building and experimentation. Furthermore, defining all operations using text-based commands allows the details of system construction or experimental procedure to be recorded and documented.

Finally, note that a summary of the command line and options for any HTK tool can be obtained simply by executing the tool with no arguments.

2.3 The Toolkit

The HTK tools are best introduced by going through the processing steps involved in building a sub-word based continuous speech recogniser. As shown in Fig.2.2, there are 4 main phases: data preparation, training, testing and analysis.

2.3.1 Data Preparation Tools

In order to build a set of HMMs, a set of speech data files and their associated transcriptions are required. Very often speech data will be obtained from database archives, typically on CD-ROMs.

Before it can be used in training, it must be converted into the appropriate parametric form and any associated transcriptions must be converted to have the correct format and use the required phone or word labels. If the speech needs to be recorded, then the tool HSLab can be used both to record the speech and to manually annotate it with any required transcriptions.

Although all HTK tools can parameterise waveforms on-the-fly, in practice it is usually better to parameterise the data just once. The tool HCopy is used for this. As the name suggests, HCopy is used to copy one or more source files to an output file. Normally, HCopy copies the whole file, but a variety of mechanisms are provided for extracting segments of files and concatenating files.

By setting the appropriate configuration variables, all input files can be converted to parametric form as they are read-in. Thus, simply copying each file in this manner performs the required encoding. The tool HList can be used to check the contents of any speech file and since it can also convert input on-the-fly, it can be used to check the results of any conversions before processing large quantities of data. Transcriptions will also need preparing. Typically the labels used in the original source transcriptions will not be exactly as required, for example, because of differences in the phone sets used. Also, HMM training might require the labels to be context-dependent. The tool HLEd is a script-driven label editor which is designed to make the required transformations to label files. HLEd can also output files to a single Master Label File MLF which is usually more convenient for subsequent processing. Finally on data preparation, HLStats can gather and display statistics on label files and where required, HQuant can be used to build a VQ codebook in preparation for building discrete probability HMM system.

2.3.2 Training Tools

The second step of system building is to define the topology required for each HMM by writing a prototype definition. HTK allows HMMs to be built with any desired topology. HMM definitions can be stored externally as simple text files and hence it is possible to edit them with any convenient text editor. Alternatively, the standard HTK distribution includes a number of example HMM prototypes and a script to generate the most common topologies automatically. With the exception of the transition probabilities, all of the HMM parameters given in the prototype definition are ignored. The purpose of the prototype definition is only to specify the overall characteristics and topology of the HMM. The actual parameters will be computed later by the training tools. Sensible values for the transition probabilities must be given but the training process is very insensitive to these. An acceptable and simple strategy for choosing these probabilities is to make all of the transitions out of any state equally likely.

(23)

2.3 The Toolkit 17

HPARSE

HVITE

HCOMPV, HINIT, HREST, HEREST

Data Prep

Training

Testing

Analysis HMMs

Networks Dictionary HLE

HDMAN

HQ HSLAB

HC HL Speech

D

HLS

UANT IST TATS OPY

HR HBUILD

ESULTS

HSMOOTH, HHED, HEADAPT

Transcriptions Transcriptions

Fig. 2.2 HTK Processing Stages

The actual training process takes place in stages and it is illustrated in more detail in Fig. 2.3.

Firstly, an initial set of models must be created. If there is some speech data available for which the location of the sub-word (i.e. phone) boundaries have been marked, then this can be used as bootstrap data. In this case, the tools HInit and HRest provide isolated word style training using the fully labelled bootstrap data. Each of the required HMMs is generated individually. HInit reads in all of the bootstrap training data and cuts out all of the examples of the required phone. It then iteratively computes an initial set of parameter values using a segmental k-means procedure.

On the first cycle, the training data is uniformly segmented, each model state is matched with the corresponding data segments and then means and variances are estimated. If mixture Gaussian models are being trained, then a modified form of k-means clustering is used. On the second and successive cycles, the uniform segmentation is replaced by Viterbi alignment. The initial parameter values computed by HInit are then further re-estimated by HRest. Again, the fully labelled bootstrap data is used but this time the segmental k-means procedure is replaced by the Baum-Welch re-estimation procedure described in the previous chapter. When no bootstrap data is available, a so-called flat start can be used. In this case all of the phone models are initialised to be identical and have state means and variances equal to the global speech mean and variance.

The tool HCompV can be used for this.

(24)

2.3 The Toolkit 18

HCompV

HER est

HH Ed

th ih s ih s p iy t sh

sh t iy s z ih s ih th

Labelled U tterances

HR est HInit

Sub-Word HMMs Unlabelled Utterances

Transcriptions

th ih s ih s p iy t sh sh t iy s z ih s ih th

Fig. 2.3 Training Sub-word HMMs

Once an initial set of models has been created, the tool HERest is used to perform embedded training using the entire training set. HERest performs a single Baum-Welch re-estimation of the whole set of HMM phone models simultaneously. For each training utterance, the corresponding phone models are concatenated and then the forward-backward algorithm is used to accumulate the statistics of state occupation, means, variances, etc., for each HMM in the sequence. When all of the training data has been processed, the accumulated statistics are used to compute re-estimates of the HMM parameters. HERest is the core HTK training tool. It is designed to process large databases, it has facilities for pruning to reduce computation and it can be run in parallel across a network of machines.

The philosophy of system construction in HTK is that HMMs should be refined incrementally.

Thus, a typical progression is to start with a simple set of single Gaussian context-independent phone models and then iteratively refine them by expanding them to include context-dependency and use multiple mixture component Gaussian distributions. The tool HHEd is a HMM definition editor which will clone models into context-dependent sets, apply a variety of parameter tyings and increment the number of mixture components in specified distributions. The usual process is to modify a set of HMMs in stages using HHEd and then re-estimate the parameters of the modified set using HERest after each stage. To improve performance for specific speakers the tools HEAdapt and HVite can be used to adapt HMMs to better model the characteristics of particular speakers using a small amount of training or adaptation data. The end result of which is a speaker adapted system.

The single biggest problem in building context-dependent HMM systems is always data insuffi- ciency. The more complex the model set, the more data is needed to make robust estimates of its parameters, and since data is usually limited, a balance must be struck between complexity and the available data. For continuous density systems, this balance is achieved by tying parameters together as mentioned above. Parameter tying allows data to be pooled so that the shared parameters can be robustly estimated. In addition to continuous density systems, HTK also supports

The HTK Book

The HTK Book

Steve Young Dan Kershaw Julian Odell Dave Ollason Valtcho Valtchev Phil Woodland

The HTK Book (for HTK Version 3.0)

c

°COPYRIGHT 1995-1999 Microsoft Corporation.

All Rights Reserved

First published December 1995 Reprinted March 1996

Revised for HTK Version 2.1 March 1997

Revised for HTK Version 2.2 January 1999

Revised for HTK Version 3.0 July 2000

Contents

I Tutorial Overview 1

II HTK in Depth 44

III Reference Section 185

Part I

Tutorial Overview

Chapter 1

The Fundamentals of HTK

1.1 General Principles of HMMs

1.2 Isolated Word Recognition

1.3 Output Probability Specification

1.4 Baum-Welch Re-Estimation

...

1.5 Recognition and Viterbi Decoding

b

( ) o

a

1.6 Continuous Speech Recognition

1.7 Speaker Adaptation

Chapter 2

An Overview of the HTK Toolkit

2.1 HTK Software Architecture

2.2 Generic Properties of a HTK Tool

2.3 The Toolkit

2.3.1 Data Preparation Tools

2.3.2 Training Tools