Word Network Expansion - The HTK Book

the typical manipulations that can be performed by HDMan. Firstly, suppose that a dictionary transcribed unstressed “-ed” endings as ih0 d but the required dictionary does not mark stress but uses a schwa in such cases, that is, the transformations

ih0 d # -> ax d

ih0 -> ih (otherwise)

are required. These could be achieved by the following 3 commands MP axd0 ih0 d #

SP axd0 ax d # RP ih ih0

The context sensitive replace is achieved by merging all sequences of ih0 d # and then splitting the result into the sequence ax d #. The final RP command then unconditionally replaces all occurrences of ih0 by ih. As a second similar example, suppose that all examples of ax l (as in “bottle”) are to be replaced by the single phone el provided that the immediately following phone is a non-vowel. This requires the use of the DC command to define a context consisting of all non-vowels, then a merge using MP as above followed by a context-sensitive replace

DC nonv l r w y .... m n ng # MP axl ax l

CR el * axl nonv SP axl ax l

the final step converts all non-transformed cases of ax l back to their original form.

As a final example, a typical output transformation applied via the edit script global.ded will convert all phones to context-dependent form and append a short pause model sp at the end of each pronunciation. The following two commands will do this

TC AS sp

For example, these commands would convert the dictionary entry BAT b ah t

into

BAT b+ah b-ah+t ah-t sp

Finally, if the -l option is set, HDMan will generate a log file containing a summary of the pronunciations used from each source and how many words, if any are missing. It is also possible to give HDMan a phone list using the -n option. In this case, HDMan will record how many times each phone was used and also, any phones that appeared in pronunciations but are not in the phone list. This is useful for detecting errors and unexpected phone symbols in the source dictionary.

12.8 Word Network Expansion

Now that word networks and dictionaries have been explained, the conversion of word level networks to model-based recognition networks will be described. Referring again to Fig12.1, this expansion is performed automatically by the module HNet. By default, HNet attempts to infer the required expansion from the contents of the dictionary and the associated list of HMMs. However, 5 con-figurations parameters are supplied to apply more precise control where required: ALLOWCXTEXP, ALLOWXWRDEXP, FORCECXTEXP, FORCELEFTBI and FORCERIGHTBI.

The expansion proceeds in four stages.

1. Context definition

The first step is to determine how model names are constructed from the dictionary entries and whether cross-word context expansion should be performed. The dictionary is scanned and each distinct phone is classified as either

12.8 Word Network Expansion 168

(a) Context Free

In this case, the phone is skipped when determining context. An example is a model (sp) for short pauses. This will typically be inserted at the end of every word pronunciation but since it tends to cover a very short segment of speech it should not block context-dependent effects in a cross-word triphone system.

(b) Context Independent

The phone only exists in context-independent form. A typical example would be a silence model (sil). Note that the distinction that would be made by HNet between sil and sp is that whilst both would only appear in the HMM set in context-independent form, sil would appear in the contexts of other phones whereas sp would not.

This classification depends on whether a phone appears in the context part of the name and whether any context dependent versions of the phone exist in the HMMSet. Context Dependent phones will be subject to model name expansion.

2. Determination of network type

The default behaviour is to produce the simplest network possible. If the dictionary is closed (every phone name appears in the HMM list), then no expansion of phone names is per-formed. The resulting network is generated by straightforward substitution of each dictionary pronunciation for each word in the word network. If the dictionary is not closed, then if word internal context expansion would find each model in the HMM set then word internal context expansion is used. Otherwise, full cross-word context expansion is applied.

The determination of the network type can be modified by using the configuration parameters mentioned earlier. By default ALLOWCXTEXP is set true. If ALLOWCXTEXP is set false, then no expansion of phone names is performed and each phone corresponds to the model of the same name. The default value of ALLOWXWRDEXP is false thus preventing context expansion across word boundaries. This also limits the expansion of the phone labels in the dictionary to word internal contexts only. If FORCECXTEXP is set true, then context expansion will be performed. For example, if the HMM set contained all monophones, all biphones and all triphones, then given a monophone dictionary, the default behaviour of HNet would be to generate a monophone recognition network since the dictionary would be closed. However, if FORCECXTEXP is set true and ALLOWXWRDEXP is set false then word internal context expansion will be performed. If FORCECXTEXP is set true and ALLOWXWRDEXP is set true then full cross-word context expansion will be performed.

3. Network expansion

Each word in the word network is transformed into a word-end node preceded by the sequence of model nodes corresponding to the word’s pronunciation. For cross word context expan-sion, the initial and final context dependent phones (and any preceding/following context independent ones) are duplicated as many times as is necessary to cater for each different cross word context. Each duplicated word-final phone is followed by a similarly duplicated word-end node. Null words are simply transformed into word-end nodes with no preceding model nodes.

4. Linking of models to network nodes

Each model node is linked to the corresponding HMM definition. In each case, the required HMM model name is determined from the phone name and the surrounding context names.

The algorithm used for this is

(a) Construct the context-dependent name and see if the corresponding model exists.

(b) Construct the context-independent name and see if the corresponding model exists.

If the configuration variable ALLOWCXTEXP is false (a) is skipped and if the configuration variable FORCECXTEXP is true (b) is skipped. If no matching model is found, an error is generated. When the right context is a boundary or FORCELEFTBI is true, then the context-dependent name takes the form of a left biphone, that is, the phone p with left context l becomes l-p. When the left context is a boundary or FORCERIGHTBI is true, then the context-dependent name takes the form of a right biphone, that is, the phone p with right context r becomes p+r. Otherwise, the context-dependent name is a full triphone, that is, l-p+r. Context-free phones are skipped in this process so

12.8 Word Network Expansion 169

sil aa r sp y uw sp sil would be expanded as

sil sil-aa+r aa-r+y sp r-y+uw y-uw+sil sp sil

assuming that sil is context-independent and sp is context-free. For word-internal systems, the context expansion can be further controlled via the configuration variable CFWORDBOUNDARY.

When set true (default setting) context-free phones will be treated as word boundaries so aa r sp y uw sp

would be expanded to

aa+r aa-r sp y+uw y-uw sp Setting CFWORDBOUNDARY false would produce

aa+r aa-r+y sp r-y+uw y-uw sp

Note that in practice, stages (3) and (4) above actually proceed concurrently so that for the first and last phone of context-dependent models, logical models which have the same underlying physical model can be merged.

sil Start sil En d

i t bit

t bu t

Fig. 12.8 Monophone Expansion of Bit-But Network

Having described the expansion process in some detail, some simple examples will help clarify the process. All of these are based on the Bit-But word network illustrated in Fig.12.2. Firstly, assume that the dictionary contains simple monophone pronunciations, that is

bit b i t

but b u t

start sil

end sil

and the HMM set consists of just monophones

b i t u sil

In this case, HNet will find a closed dictionary. There will be no expansion and it will directly generate the network shown in Fig 12.8. In this figure, the rounded boxes represent model nodes and the square boxes represent word-end nodes.

Similarly, if the dictionary contained word-internal triphone pronunciations such as

bit b+i b-i+t i-t

but b+u b-u+t u-t

start sil

end sil

and the HMM set contains all the required models b+i b-i+t i-t b+u b-u+t u-t sil

12.9 Other Kinds of Recognition System 170

在文檔中 The HTK Book (頁 173-176)