3.3 Creating Tied-State Triphones 35 where the edit script mktri.hed contains a clone command CL followed by TI commands to tie all of the transition matrices in each triphone set, that is:
CL triphones1
TI T_ah {(*-ah+*,ah+*,*-ah).transP}
TI T_ax {(*-ax+*,ax+*,*-ax).transP}
TI T_ey {(*-ey+*,ey+*,*-ey).transP}
TI T_b {(*-b+*,b+*,*-b).transP}
TI T_ay {(*-ay+*,ay+*,*-ay).transP}
...
The file mktri.hed can be generated using the Perl script maketrihed included in the HTKTutorial directory.
The clone command CL takes as its argument the name of the file containing the list of triphones (and biphones) generated above. For each model of the form a-b+c in this list, it looks for the monophone b and makes a copy of it. Each TI command takes as its argument the name of a macro and a list of HMM components. The latter uses a notation which attempts to mimic the hierarchical structure of the HMM parameter set in which the transition matrix transP can be regarded as a sub-component of each HMM. The list of items within brackets are patterns designed to match the set of triphones, right biphones and left biphones for each phone.
~h "t-ah+p"
~ ~
~ ~
<transP>
0.0 1.0 0.0 ..
0.0 0.4 0.6 ..
..
~h "t-ah+b"
~ ~
~ ~
<transP>
0.0 1.0 0.0 ..
0.0 0.4 0.6 ..
..
~h "t-ah+p"
~ ~
~ ~
~t "T_ah"
~h "t-ah+b"
~ ~
~ ~
~t "T_ah"
<transP>
0.0 1.0 0.0 ..
0.0 0.4 0.6 ..
..
~t "T_ah"
HHED Tie Command
Fig. 3.12 Tying Transition Matrices
Up to now macros and tying have only been mentioned in passing. Although a full explanation must wait until chapter 7, a brief explanation is warranted here. Tying means that one or more HMMs share the same set of parameters. On the left side of Fig.3.12, two HMM definitions are shown. Each HMM has its own individual transition matrix. On the right side, the effect of the first TI command in the edit script mktri.hed is shown. The individual transition matrices have been replaced by a reference to a macro called T ah which contains a matrix shared by both models.
When reestimating tied parameters, the data which would have been used for each of the original untied parameters is pooled so that a much more reliable estimate can be obtained.
Of course, tying could affect performance if performed indiscriminately. Hence, it is important to only tie parameters which have little effect on discrimination. This is the case here where the transition parameters do not vary significantly with acoustic context but nevertheless need to be estimated accurately. Some triphones will occur only once or twice and so very poor estimates would be obtained if tying was not done. These problems of data insufficiency will affect the output distributions too, but this will be dealt with in the next step.
Hitherto, all HMMs have been stored in text format and could be inspected like any text file.
Now however, the model files will be getting larger and space and load/store times become an issue.
3.3 Creating Tied-State Triphones 36 For increased efficiency, HTK can store and load MMFs in binary format. Setting the standard -B option causes this to happen.
Monophone Transcriptions
(aligned.mlf)
HLE D
Triphone Transcriptions
(wintri.mlf)
Monophones
(hmm9)
HHE D
Triphones
(hmm10)
HEREST (X2)
Triphones
(hmm12)
State Occs
(stats)
Fig. 3.13 Step 9
Once the context-dependent models have been cloned, the new triphone set can be re-estimated using HERest. This is done as previously except that the monophone model list is replaced by a triphone list and the triphone transcriptions are used in place of the monophone transcriptions.
For the final pass of HERest, the -s option should be used to generate a file of state occupation statistics called stats. In combination with the means and variances, these enable likelihoods to be calculated for clusters of states and are needed during the state-clustering process described below.
Fig.3.13illustrates this step of the HMM construction procedure. Re-estimation should be again done twice, so that the resultant model sets will ultimately be saved in hmm12.
HERest -C config -I wintri.mlf -t 250.0 150.0 1000.0 -s stats \ -S train.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1
3.3.2 Step 10 - Making Tied-State Triphones
The outcome of the previous stage is a set of triphone HMMs with all triphones in a phone set sharing the same transition matrix. When estimating these models, many of the variances in the output distributions will have been floored since there will be insufficient data associated with many of the states. The last step in the model building process is to tie states within triphone sets in order to share data and thus be able to make robust parameter estimates.
In the previous step, the TI command was used to explicitly tie all members of a set of transition matrices together. However, the choice of which states to tie requires a bit more subtlety since the performance of the recogniser depends crucially on how accurate the state output distributions capture the statistics of the speech data.
HHEd provides two mechanisms which allow states to be clustered and then each cluster tied.
The first is data-driven and uses a similarity measure between states. The second uses decision trees and is based on asking questions about the left and right contexts of each triphone. The decision tree attempts to find those contexts which make the largest difference to the acoustics and which should therefore distinguish clusters.
Decision tree state tying is performed by running HHEd in the normal way, i.e.
HHEd -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 \ tree.hed triphones1 > log
3.3 Creating Tied-State Triphones 37 Notice that the output is saved in a log file. This is important since some tuning of thresholds is usually needed.
The edit script tree.hed, which contains the instructions regarding which contexts to examine for possible clustering, can be rather long and complex. A script for automatically generating this file, mkclscript, is found in the RM Demo. A version of the tree.hed script, which can be used with this tutorial, is included in the HTKTutorial directory. Note that this script is only capable of creating the TB commands (decision tree clustering of states). The questions (QS) still need defining by the user. There is, however, an example list of questions which may be suitable to some tasks (or at least useful as an example) supplied with the RM demo (lib/quests.hed). The entire script appropriate for clustering English phone models is too long to show here in the text, however, its main components are given by the following fragments:
RO 100.0 stats TR 0
QS "L_Class-Stop" {p-*,b-*,t-*,d-*,k-*,g-*}
QS "R_Class-Stop" {*+p,*+b,*+t,*+d,*+k,*+g}
QS "L_Nasal" {m-*,n-*,ng-*}
QS "R_Nasal" {*+m,*+n,*+ng}
QS "L_Glide" {y-*,w-*}
QS "R_Glide" {*+y,*+w}
....
QS "L_w" {w-*}
QS "R_w" {*+w}
QS "L_y" {y-*}
QS "R_y" {*+y}
QS "L_z" {z-*}
QS "R_z" {*+z}
TR 2
TB 350.0 "aa_s2" {(aa, *-aa, *-aa+*, aa+*).state[2]}
TB 350.0 "ae_s2" {(ae, *-ae, *-ae+*, ae+*).state[2]}
TB 350.0 "ah_s2" {(ah, *-ah, *-ah+*, ah+*).state[2]}
TB 350.0 "uh_s2" {(uh, *-uh, *-uh+*, uh+*).state[2]}
....
TB 350.0 "y_s4" {(y, *-y, *-y+*, y+*).state[4]}
TB 350.0 "z_s4" {(z, *-z, *-z+*, z+*).state[4]}
TB 350.0 "zh_s4" {(zh, *-zh, *-zh+*, zh+*).state[4]}
TR 1
AU "fulllist"
CO "tiedlist"
ST "trees"
Firstly, the RO command is used to set the outlier threshold to 100.0 and load the statistics file generated at the end of the previous step. The outlier threshold determines the minimum occupancy of any cluster and prevents a single outlier state forming a singleton cluster just because it is acoustically very different to all the other states. The TR command sets the trace level to zero in preparation for loading in the questions. Each QS command loads a single question and each question is defined by a set of contexts. For example, the first QS command defines a question called L Class-Stop which is true if the left context is either of the stops p, b, t, d, k or g.
3.3 Creating Tied-State Triphones 38
Tied−State Triphones
HREST (x2)
Tied−State Triphones
Edit Script HHED
Triphones State Occs
(hmm15) (hmm13) (hmm12)
(tiedlist) (tree.hed)
(stats)
HMM List
Fig. 3.14 Step 10
Notice that for a triphone system, it is necessary to include questions referring to both the right and left contexts of a phone. The questions should progress from wide, general classifications (such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone. Ideally, the full set of questions loaded using the QS command would include every possible context which can influence the acoustic realisation of a phone, and can include any linguistic or phonetic classification which may be relevant. There is no harm in creating extra unnecessary questions, because those which are determined to be irrelevant to the data will be ignored.
The second TR command enables intermediate level progress reporting so that each of the fol-lowing TB commands can be monitored. Each of these TB commands clusters one specific set of states. For example, the first TB command applies to the first emitting state of all context-dependent models for the phone aa.
Each TB command works as follows. Firstly, each set of states defined by the final argument is pooled to form a single cluster. Each question in the question set loaded by the QS commands is used to split the pool into two sets. The use of two sets rather than one, allows the log likelihood of the training data to be increased and the question which maximises this increase is selected for the first branch of the tree. The process is then repeated until the increase in log likelihood achievable by any question at any node is less than the threshold specified by the first argument (350.0 in this case).
Note that the values given in the RO and TB commands affect the degree of tying and therefore the number of states output in the clustered system. The values should be varied according to the amount of training data available. As a final step to the clustering, any pair of clusters which can be merged such that the decrease in log likelihood is below the threshold is merged. On completion, the states in each cluster i are tied to form a single shared state with macro name xxx i where xxx is the name given by the second argument of the TB command.
The set of triphones used so far only includes those needed to cover the training data. The AU command takes as its argument a new list of triphones expanded to include all those needed for recognition. This list can be generated, for example, by using HDMan on the entire dictionary (not just the training dictionary), converting it to triphones using the command TC and outputting a list of the distinct triphones to a file using the option -n
HDMan -n fulllist -g global.ded -l flog beep
The effect of the AU command is to use the decision trees to synthesise all of the new previously unseen triphones in the new list.
Once all state-tying has been completed and new models synthesised, some models may share exactly the same 3 states and transition matrices and are thus identical. The CO command is used
3.4 Recogniser Evaluation 39