Parameter Smoothing - The HTK Book

current observation is computed for each mixture component. Then only those components which lie within a threshold of the most likely component are retained. This pruning is controlled by the -c option in HRest, HERest and HVite.

11.4 Parameter Smoothing

When large sets of context-dependent triphones are built using discrete models or tied-mixture models, under-training can be a severe problem since each state has a large number of mixture weight parameters to estimate. The HTK tool HSmooth allows these discrete probabilities or mixture component weights to be smoothed with the monophone weights using a technique called deleted interpolation.

HSmooth is used in combination with HERest working in parallel mode. The training data is split into blocks and each block is used separately to re-estimate the HMMs. However, since HERest is in parallel mode, it outputs a dump file of accumulators instead of updating the models.

HSmooth is then used in place of the second pass of HERest. It reads in the accumulator information from each of the blocks, performs deleted interpolation smoothing on the accumulator values and then outputs the re-estimated HMMs in the normal way.

HSmooth implements a conventional deleted interpolation scheme. However, optimisation of the smoothing weights uses a fast binary chop scheme rather than the more usual Baum-Welch approach. The algorithm for finding the optimal interpolation weights for a given state and stream is as follows where the description is given in terms of tied-mixture weights but the same applies to discrete probabilities.

Assume that HERest has been set-up to output N separate blocks of accumulators. Let w⁽ⁿ⁾_i be the i’th mixture weight based on the accumulator blocks 1 to N but excluding block n, and let ¯w_i⁽ⁿ⁾ be the corresponding context independent weight. Let x⁽ⁿ⁾_i be the i’th mixture weight count for the deleted block n. The derivative of the log likelihood of the deleted block, given the probability distribution with weights ci= λwi+ (1 − λ) ¯wi is given by

D(λ) = XN n=1

XM i=1

x⁽ⁿ⁾_i

w⁽ⁿ⁾_i − ¯w_i⁽ⁿ⁾ λw⁽ⁿ⁾_i + (1 − λ) ¯w⁽ⁿ⁾_i

(11.1) Since the log likelihood is a convex function of λ, this derivative allows the optimal value of λ to be found by a simple binary chop algorithm, viz.

function FindLambdaOpt:

if (D(0) <= 0) return 0;

if (D(1) >= 0) return = 1;

l=0; r=1;

for (k=1; k<=maxStep; k++){

m = (l+r)/2;

if (D(m) == 0) return m;

if (D(m) > 0) l=m; else r=m;

}

return m;

HSmooth is invoked in a similar way to HERest. For example, suppose that the directory hmm2 contains a set of accumulator files output by the first pass of HERest running in parallel mode using as source the HMM definitions listed in hlist and stored in hmm1/HMMDefs. Then the command

HSmooth -c 4 -w 2.0 -H hmm1/HMMDefs -M hmm2 hlist hmm2/*.acc

would generate a new smoothed HMM set in hmm2. Here the -w option is used to set the minimum mixture component weight in any state to twice the value of MINMIX. The -c option sets the maximum number of iterations of the binary chop procedure to be 4.

Chapter 12

Networks, Dictionaries and Language Models

HRec

HMM Definitions

Terminal

Graphical

Adaptation Model

Training HTK Tool HVQ

HParm HWave HAudio HLabel

HModel HUtil HShell

HGraf

HAdapt HFB

HTrain HMath HMem

HSigP I/O

I/O Speech

Data Labels

HLM HNet HDict Language

Models

Lattices/

Constraint Network Dictionary

The preceding chapters have described how to process speech data and how to train various types of HMM. This and the following chapter are concerned with building a speech recogniser using HTK. This chapter focuses on the use of networks and dictionaries. A network describes the sequence of words that can be recognised and, for the case of sub-word systems, a dictionary describes the sequence of HMMs that constitute each word. A word level network will typically represent either a Task Grammar which defines all of the legal word sequences explicitly or a Word Loop which simply puts all words of the vocabulary in a loop and therefore allows any word to follow any other word. Word-loop networks are often augmented by a stochastic language model.

Networks can also be used to define phone recognisers and various types of word-spotting systems.

Networks are specified using the HTK Standard Lattice Format (SLF) which is described in detail in Chapter17. This is a general purpose text format which is used for representing multiple hypotheses in a recogniser output as well as word networks. Since SLF format is text-based, it can be written directly using any text editor. However, this can be rather tedious and HTK provides two tools which allow the application designer to use a higher-level representation. Firstly, the tool HParse allows networks to be generated from a source text containing extended BNF format grammar rules. This format was the only grammar definition language provided in earlier versions of HTK and hence HParse also provides backwards compatibility.

HParse task grammars are very easy to write, but they do not allow fine control over the actual network used by the recogniser. The tool HBuild works directly at the SLF level to provide this detailed control. Its main function is to enable a large word network to be decomposed into a set of small self-contained sub-networks using as input an extended SLF format. This enhances the design process and avoids the need for unnecessary repetition.

155

12.1 How Networks are Used 156 HBuild can also be used to perform a number of special-purpose functions. Firstly, it can construct word-loop and word-pair grammars automatically. Secondly, it can incorporate a sta-tistical bigram language model into a network. These can be generated from label transcriptions using HLStats. However, HTK supports the standard ARPA MIT-LL text format for backed-off N-gram language models, and hence, import from other sources is possible.

Whichever tool is used to generate a word network, it is important to ensure that the generated network represents the intended grammar. It is also helpful to have some measure of the difficulty of the recognition task. To assist with this, the tool HSGen is provided. This tool will generate example word sequences from an SLF network using random sampling. It will also estimate the perplexity of the network.

When a word network is loaded into a recogniser, a dictionary is consulted to convert each word in the network into a sequence of phone HMMs. The dictionary can have multiple pronunciations in which case several sequences may be joined in parallel to make a word. Options exist in this process to automatically convert the dictionary entries to context-dependent triphone models, either within a word or cross-word. Pronouncing dictionaries are a vital resource in building speech recognition systems and, in practice, word pronunciations can be derived from many different sources. The HTK tool HDMan enables a dictionary to be constructed automatically from different sources.

Each source can be individually edited and translated and merged to form a uniform HTK format dictionary.

The various facilities for describing a word network and expanding into a HMM level network suitable for building a recogniser are implemented by the HTK library module HNet. The facilities for loading and manipulating dictionaries are implemented by the HTK library module HDict and for loading and manipulating language models are implemented by HLM. These facilities and those provided by HParse, HBuild, HSGen, HLStats and HDMan are the subject of this chapter.

12.1 How Networks are Used

Before delving into the details of word networks and dictionaries, it will be helpful to understand their rˆole in building a speech recogniser using HTK. Fig 12.1 illustrates the overall recognition process. A word network is defined using HTK Standard Lattice Format (SLF). An SLF word network is just a text file and it can be written directly with a text editor or a tool can be used to build it. HTK provides two such tools, HBuild and HParse. These both take as input a textual description and output an SLF file. Whatever method is chosen, word network SLF generation is done off-line and is part of the system build process.

An SLF file contains a list of nodes representing words and a list of arcs representing the transi-tions between words. The transitransi-tions can have probabilities attached to them and these can be used to indicate preferences in a grammar network. They can also be used to represent bigram probabil-ities in a back-off bigram network and HBuild can generate such a bigram network automatically.

In addition to an SLF file, a HTK recogniser requires a dictionary to supply pronunciations for each word in the network and a set of acoustic HMM phone models. Dictionaries are input via the HTK interface module HDict.

The dictionary, HMM set and word network are input to the HTK library module HNet whose function is to generate an equivalent network of HMMs. Each word in the dictionary may have several pronunciations and in this case there will be one branch in the network corresponding to each alternative pronunciation. Each pronunciation may consist either of a list of phones or a list of HMM names. In the former case, HNet can optionally expand the HMM network to use either word internal triphones or cross-word triphones. Once the HMM network has been constructed, it can be input to the decoder module HRec and used to recognise speech input. Note that HMM network construction is performed on-line at recognition time as part of the initialisation process.

在文檔中 The HTK Book (頁 160-163)