Other Kinds of Recognition System

12.9 Other Kinds of Recognition System 170

12.9 Other Kinds of Recognition System 171 Phoneme recognisers often use biphones to provide some measure of context-dependency. Pro-vided that the HMM set contains all the necessary biphones, then HNet will expand a simple phone loop into a context-sensitive biphone loop simply by setting the configuration variable FORCELEFTBI or FORCERIGHTBI to true, as appropriate.

Whole word recognisers can be set-up in a similar way. The word network is designed using the same considerations as for a sub-word based system but the dictionary gives the name of the whole-word HMM in place of each word pronunciation.

Finally, word spotting systems can be defined by placing each keyword in a word network in parallel with the appropriate filler models. The keywords can be whole-word models or subword based. Note in this case that word transition penalties placed on the transitions can be used to gain fine control over the false alarm rate.

Chapter 13

Decoding

HV

ITE

(HR

)

zero one two

sil sil

?

one two three ...

...

The previous chapter has described how to construct a recognition network specifying what is allowed to be spoken and how each word is pronounced. Given such a network, its associated set of HMMs, and an unknown utterance, the probability of any path through the network can be computed. The task of a decoder is to find those paths which are the most likely.

As mentioned previously, decoding in HTK is performed by a library module called HRec.

HRec uses the token passing paradigm to find the best path and, optionally, multiple alternative paths. In the latter case, it generates a lattice containing the multiple hypotheses which can if required be converted to an N-best list. To drive HRec from the command line, HTK provides a tool called HVite. As well as providing basic recognition, HVite can perform forced alignments, lattice rescoring and recognise direct audio input.

To assist in evaluating the performance of a recogniser using a test database and a set of reference transcriptions, HTK also provides a tool called HResults to compute word accuracy and various related statistics. The principles and use of these recognition facilities are described in this chapter.

13.1 Decoder Operation

As described in Chapter 12 and illustrated by Fig. 12.1, decoding in HTK is controlled by a recognition network compiled from a word-level network, a dictionary and a set of HMMs. The recognition network consists of a set of nodes connected by arcs. Each node is either a HMM model instance or a word-end. Each model node is itself a network consisting of states connected by arcs.

Thus, once fully compiled, a recognition network ultimately consists of HMM states connected by transitions. However, it can be viewed at three different levels: word, model and state. Fig.13.1 illustrates this hierarchy.

172

13.1 Decoder Operation 173

Word level

Network level

HMM level w

n-1 w_n w_n+

w_n p

1 p ₂

1 s₂

Fig. 13.1 Recognition Network Levels

For an unknown input utterance with T frames, every path from the start node to the exit node of the network which passes through exactly T emitting HMM states is a potential recognition hypothesis. Each of these paths has a log probability which is computed by summing the log probability of each individual transition in the path and the log probability of each emitting state generating the corresponding observation. Within-HMM transitions are determined from the HMM parameters, between-model transitions are constant and word-end transitions are determined by the language model likelihoods attached to the word level networks.

The job of the decoder is to find those paths through the network which have the highest log probability. These paths are found using a Token Passing algorithm. A token represents a partial path through the network extending from time 0 through to time t. At time 0, a token is placed in every possible start node.

Each time step, tokens are propagated along connecting transitions stopping whenever they reach an emitting HMM state. When there are multiple exits from a node, the token is copied so that all possible paths are explored in parallel. As the token passes across transitions and through nodes, its log probability is incremented by the corresponding transition and emission probabilities.

A network node can hold at most N tokens. Hence, at the end of each time step, all but the N best tokens in any node are discarded.

As each token passes through the network it must maintain a history recording its route. The amount of detail in this history depends on the required recognition output. Normally, only word sequences are wanted and hence, only transitions out of word-end nodes need be recorded. However, for some purposes, it is useful to know the actual model sequence and the time of each model to model transition. Sometimes a description of each path down to the state level is required. All of this information, whatever level of detail is required, can conveniently be represented using a lattice structure.

Of course, the number of tokens allowed per node and the amount of history information re-quested will have a significant impact on the time and memory needed to compute the lattices. The most efficient configuration is N = 1 combined with just word level history information and this is sufficient for most purposes.

A large network will have many nodes and one way to make a significant reduction in the computation needed is to only propagate tokens which have some chance of being amongst the eventual winners. This process is called pruning. It is implemented at each time step by keeping a record of the best token overall and de-activating all tokens whose log probabilities fall more than a beam-width below the best. For efficiency reasons, it is best to implement primary pruning at the model rather than the state level. Thus, models are deactivated when they have no tokens in any state within the beam and they are reactivated whenever active tokens are propagated into them.

State-level pruning is also implemented by replacing any token by a null (zero probability) token if it falls outside of the beam. If the pruning beam-width is set too small then the most likely path might be pruned before its token reaches the end of the utterance. This results in a search error.

Setting the beam-width is thus a compromise between speed and avoiding search errors.

13.2 Decoder Organisation 174

在文檔中 The HTK Book (頁 176-180)