Creating Tied-State Triphones

TIMIT Prompts

3.3 Creating Tied-State Triphones

Given a set of monophone HMMs, the final stage of model building is to create context-dependent triphone HMMs. This is done in two steps. Firstly, the monophone transcriptions are converted to triphone transcriptions and a set of triphone models are created by copying the monophones and re-estimating. Secondly, similar acoustic states of these triphones are tied to ensure that all state distributions can be robustly estimated.

3.3.1 Step 9 - Making Triphones from Monophones

Context-dependent triphones can be made by simply cloning monophones and then re-estimating using triphone transcriptions. The latter should be created first using HLEd because a side-effect is to generate a list of all the triphones for which there is at least one example in the training data.

That is, executing

HLEd -n triphones1 -l ’*’ -i wintri.mlf mktri.led aligned.mlf

will convert the monophone transcriptions in aligned.mlf to an equivalent set of triphone tran-scriptions in wintri.mlf. At the same time, a list of triphones is written to the file triphones1.

The edit script mktri.led contains the commands WB sp

WB sil TC

The two WB commands define sp and sil as word boundary symbols. These then block the addition of context in the TI command, seen in the following script, which converts all phones (except word boundary symbols) to triphones . For example,

sil th ih s sp m ae n sp ...

becomes

sil th+ih th-ih+s ih-s sp m+ae m-ae+n ae-n sp ...

This style of triphone transcription is referred to as word internal. Note that some biphones will also be generated as contexts at word boundaries will sometimes only include two phones.

The cloning of models can be done efficiently using the HMM editor HHEd:

HHEd -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones1

where the edit script mktri.hed contains a clone command CL followed by TI commands to tie all of the transition matrices in each triphone set, that is:

CL triphones1

TI T_ah {(*-ah+*,ah+*,*-ah).transP}

TI T_ax {(*-ax+*,ax+*,*-ax).transP}

TI T_ey {(*-ey+*,ey+*,*-ey).transP}

TI T_b {(*-b+*,b+*,*-b).transP}

TI T_ay {(*-ay+*,ay+*,*-ay).transP}

...

The file mktri.hed can be generated using the Perl script maketrihed included in the HTKTutorial directory. When running the HHEd command you will get warnings about trying to tie transition matrices for the sil and sp models. Since neither model is context-dependent there aren’t actually any matrices to tie.

The clone command CL takes as its argument the name of the file containing the list of triphones (and biphones) generated above. For each model of the form a-b+c in this list, it looks for the monophone b and makes a copy of it. Each TI command takes as its argument the name of a macro and a list of HMM components. The latter uses a notation which attempts to mimic the hierarchical structure of the HMM parameter set in which the transition matrix transP can be regarded as a sub-component of each HMM. The list of items within brackets are patterns designed to match the set of triphones, right biphones and left biphones for each phone.

3.3 Creating Tied-State Triphones 39

~h "t-ah+p"

~ ~

0.0 1.0 0.0 ..

0.0 0.4 0.6 ..

~h "t-ah+b"

~ ~

0.0 1.0 0.0 ..

0.0 0.4 0.6 ..

~h "t-ah+p"

~ ~

~t "T_ah"

~h "t-ah+b"

~ ~

~t "T_ah"

0.0 1.0 0.0 ..

0.0 0.4 0.6 ..

~t "T_ah"

HHED Tie Command

Fig. 3.12 Tying Transition Matrices

Up to now macros and tying have only been mentioned in passing. Although a full explanation must wait until chapter 7, a brief explanation is warranted here. Tying means that one or more HMMs share the same set of parameters. On the left side of Fig.3.12, two HMM definitions are shown. Each HMM has its own individual transition matrix. On the right side, the effect of the first TI command in the edit script mktri.hed is shown. The individual transition matrices have been replaced by a reference to a macro called T ah which contains a matrix shared by both models.

When reestimating tied parameters, the data which would have been used for each of the original untied parameters is pooled so that a much more reliable estimate can be obtained.

Of course, tying could affect performance if performed indiscriminately. Hence, it is important to only tie parameters which have little effect on discrimination. This is the case here where the transition parameters do not vary significantly with acoustic context but nevertheless need to be estimated accurately. Some triphones will occur only once or twice and so very poor estimates would be obtained if tying was not done. These problems of data insufficiency will affect the output distributions too, but this will be dealt with in the next step.

Hitherto, all HMMs have been stored in text format and could be inspected like any text file.

Now however, the model files will be getting larger and space and load/store times become an issue.

For increased efficiency, HTK can store and load MMFs in binary format. Setting the standard -B option causes this to happen.

3.3 Creating Tied-State Triphones 40

Monophone Transcriptions

(aligned.mlf)

HLED

Triphone Transcriptions

( wi nt r i . ml f )

Monophones

(hmm9)

HHED

Triphones

(hmm10)

HEREST (X2)

Triphones

(hmm12) State Occs

(s t a t s)

Fig. 3.13 Step 9

Once the context-dependent models have been cloned, the new triphone set can be re-estimated using HERest. This is done as previously except that the monophone model list is replaced by a triphone list and the triphone transcriptions are used in place of the monophone transcriptions.

For the final pass of HERest, the -s option should be used to generate a file of state occupation statistics called stats. In combination with the means and variances, these enable likelihoods to be calculated for clusters of states and are needed during the state-clustering process described below.

Fig.3.13illustrates this step of the HMM construction procedure. Re-estimation should be again done twice, so that the resultant model sets will ultimately be saved in hmm12.

HERest -B -C config -I wintri.mlf -t 250.0 150.0 1000.0 -s stats \ -S train.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1

3.3.2 Step 10 - Making Tied-State Triphones

The outcome of the previous stage is a set of triphone HMMs with all triphones in a phone set sharing the same transition matrix. When estimating these models, many of the variances in the output distributions will have been floored since there will be insufficient data associated with many of the states. The last step in the model building process is to tie states within triphone sets in order to share data and thus be able to make robust parameter estimates.

In the previous step, the TI command was used to explicitly tie all members of a set of transition matrices together. However, the choice of which states to tie requires a bit more subtlety since the performance of the recogniser depends crucially on how accurate the state output distributions capture the statistics of the speech data.

HHEd provides two mechanisms which allow states to be clustered and then each cluster tied.

The first is data-driven and uses a similarity measure between states. The second uses decision trees and is based on asking questions about the left and right contexts of each triphone. The decision tree attempts to find those contexts which make the largest difference to the acoustics and which should therefore distinguish clusters.

Decision tree state tying is performed by running HHEd in the normal way, i.e.

HHEd -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 \ tree.hed triphones1 > log

Notice that the output is saved in a log file. This is important since some tuning of thresholds is usually needed.

3.3 Creating Tied-State Triphones 41

The edit script tree.hed, which contains the instructions regarding which contexts to examine for possible clustering, can be rather long and complex. A script for automatically generating this file, mkclscript, is found in the RM Demo. A version of the tree.hed script, which can be used with this tutorial, is included in the HTKTutorial directory. Note that this script is only capable of creating the TB commands (decision tree clustering of states). The questions (QS) still need defining by the user. There is, however, an example list of questions which may be suitable to some tasks (or at least useful as an example) supplied with the RM demo (lib/quests.hed). The entire script appropriate for clustering English phone models is too long to show here in the text, however, its main components are given by the following fragments:

RO 100.0 stats TR 0

QS "L_Class-Stop" {p-*,b-*,t-*,d-*,k-*,g-*}

QS "R_Class-Stop" {*+p,*+b,*+t,*+d,*+k,*+g}

QS "L_Nasal" {m-*,n-*,ng-*}

QS "R_Nasal" {*+m,*+n,*+ng}

QS "L_Glide" {y-*,w-*}

QS "R_Glide" {*+y,*+w}

....

QS "L_w" {w-*}

QS "R_w" {*+w}

QS "L_y" {y-*}

QS "R_y" {*+y}

QS "L_z" {z-*}

QS "R_z" {*+z}

TR 2

TB 350.0 "aa_s2" {(aa, *-aa, *-aa+*, aa+*).state[2]}

TB 350.0 "ae_s2" {(ae, *-ae, *-ae+*, ae+*).state[2]}

TB 350.0 "ah_s2" {(ah, *-ah, *-ah+*, ah+*).state[2]}

TB 350.0 "uh_s2" {(uh, *-uh, *-uh+*, uh+*).state[2]}

....

TB 350.0 "y_s4" {(y, *-y, *-y+*, y+*).state[4]}

TB 350.0 "z_s4" {(z, *-z, *-z+*, z+*).state[4]}

TB 350.0 "zh_s4" {(zh, *-zh, *-zh+*, zh+*).state[4]}

TR 1

AU "fulllist"

CO "tiedlist"

ST "trees"

Firstly, the RO command is used to set the outlier threshold to 100.0 and load the statistics file generated at the end of the previous step. The outlier threshold determines the minimum occupancy of any cluster and prevents a single outlier state forming a singleton cluster just because it is acoustically very different to all the other states. The TR command sets the trace level to zero in preparation for loading in the questions. Each QS command loads a single question and each question is defined by a set of contexts. For example, the first QS command defines a question called L Class-Stop which is true if the left context is either of the stops p, b, t, d, k or g.

3.3 Creating Tied-State Triphones 42

Tied−State Triphones

HREST (x2)

Tied−State Triphones

Edit Script HHED

Triphones State Occs

(hmm15) (hmm13) (hmm12)

(tiedlist) (tree.hed)

(stats)

HMM List

Fig. 3.14 Step 10

Notice that for a triphone system, it is necessary to include questions referring to both the right and left contexts of a phone. The questions should progress from wide, general classifications (such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone. Ideally, the full set of questions loaded using the QS command would include every possible context which can influence the acoustic realisation of a phone, and can include any linguistic or phonetic classification which may be relevant. There is no harm in creating extra unnecessary questions, because those which are determined to be irrelevant to the data will be ignored.

The second TR command enables intermediate level progress reporting so that each of the fol-lowing TB commands can be monitored. Each of these TB commands clusters one specific set of states. For example, the first TB command applies to the first emitting state of all context-dependent models for the phone aa.

Each TB command works as follows. Firstly, each set of states defined by the final argument is pooled to form a single cluster. Each question in the question set loaded by the QS commands is used to split the pool into two sets. The use of two sets rather than one, allows the log likelihood of the training data to be increased and the question which maximises this increase is selected for the first branch of the tree. The process is then repeated until the increase in log likelihood achievable by any question at any node is less than the threshold specified by the first argument (350.0 in this case).

Note that the values given in the RO and TB commands affect the degree of tying and therefore the number of states output in the clustered system. The values should be varied according to the amount of training data available. As a final step to the clustering, any pair of clusters which can be merged such that the decrease in log likelihood is below the threshold is merged. On completion, the states in each cluster i are tied to form a single shared state with macro name xxx i where xxx is the name given by the second argument of the TB command.

The set of triphones used so far only includes those needed to cover the training data. The AU command takes as its argument a new list of triphones expanded to include all those needed for recognition. This list can be generated, for example, by using HDMan on the entire dictionary (not just the training dictionary), converting it to triphones using the command TC and outputting a list of the distinct triphones to a file using the option -n

HDMan -b sp -n fulllist -g global.ded -l flog beep-tri beep

The -b sp option specifies that the sp phone is used as a word boundary, and so is excluded from triphones. The effect of the AU command is to use the decision trees to synthesise all of the new previously unseen triphones in the new list.

在文檔中 The HTK Book (頁 47-52)