each HTK tool being introduced (chapter17), so that all command line options and arguments are clearly understood.
3.1 Data Preparation
The first stage of any recogniser development project is data preparation. Speech data is needed both for training and for testing. In the system to be built here, all of this speech will be recorded from scratch and to do this scripts are needed to prompt for each sentence. In the case of the test data, these prompt scripts will also provide the reference transcriptions against which the recogniser’s performance can be measured and a convenient way to create them is to use the task grammar as a random generator. In the case of the training data, the prompt scripts will be used in conjunction with a pronunciation dictionary to provide the initial phone level transcriptions needed to start the HMM training process. Since the application requires that arbitrary names can be added to the recogniser, training data with good phonetic balance and coverage is needed. Here for convenience the prompt scripts needed for training are taken from the TIMIT acoustic-phonetic database.
It follows from the above that before the data can be recorded, a phone set must be defined, a dictionary must be constructed to cover both training and testing and a task grammar must be defined.
3.1.1 Step 1 - the Task Grammar
The goal of the system to be built here is to provide a voice-operated interface for phone dialling.
Thus, the recogniser must handle digit strings and also personal name lists. Examples of typical inputs might be
Dial three three two six five four Dial nine zero four one oh nine Phone Woodland
Call Steve Young
HTK provides a grammar definition language for specifying simple task grammars such as this.
It consists of a set of variable definitions followed by a regular expression describing the words to recognise. For the voice dialling application, a suitable grammar might be
$digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | OH | ZERO;
$name = [ JOOP ] JANSEN | [ JULIAN ] ODELL | [ DAVE ] OLLASON | [ PHIL ] WOODLAND | [ STEVE ] YOUNG;
( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )
where the vertical bars denote alternatives, the square brackets denote optional items and the angle braces denote one or more repetitions. The complete grammar can be depicted as a network as shown in Fig.3.1.
3.1 Data Preparation 26
sent-start
one two three
zero dial
phone
call
Julian Odell
Dave ... etc
Steve Young
... etc
Ollason
sent-end
Fig. 3.1 Grammar for Voice Dialling
Grammar (
gram)
Word Net (
wdne t)
HPARSE
Fig. 3.2 Step 1
The above high level representation of a task grammar is provided for user convenience. The HTK recogniser actually requires a word network to be defined using a low level notation called HTK Standard Lattice Format (SLF) in which each word instance and each word-to-word transition is listed explicitly. This word network can be created automatically from the grammar above using the HParse tool, thus assuming that the file gram contains the above grammar, executing
HParse gram wdnet
will create an equivalent word network in the file wdnet (see Fig3.2).
3.1.2 Step 2 - the Dictionary
The first step in building a dictionary is to create a sorted list of the required words. In the telephone dialling task pursued here, it is quite easy to create a list of required words by hand. However, if the task were more complex, it would be necessary to build a word list from the sample sentences present in the training data. Furthermore, to build robust acoustic models, it is necessary to train them on a large set of sentences containing many words and preferably phonetically balanced. For these reasons, the training data will consist of English sentences unrelated to the phone recognition task. Below, a short example of creating a word list from sentence prompts will be given. As noted above the training sentences given here are extracted from some prompts used with the TIMIT database and for convenience reasons they have been renumbered. For example, the first few items might be as follows
3.1 Data Preparation 27
S0001 ONE VALIDATED ACTS OF SCHOOL DISTRICTS S0002 TWO OTHER CASES ALSO WERE UNDER ADVISEMENT S0003 BOTH FIGURES WOULD GO HIGHER IN LATER YEARS S0004 THIS IS NOT A PROGRAM OF SOCIALIZED MEDICINE etc
The desired training word list (wlist) could then be extracted automatically from these. Before using HTK, one would need to edit the text into a suitable format. For example, it would be necessary to change all white space to newlines and then to use the UNIX utilities sort and uniq to sort the words into a unique alphabetically ordered set, with one word per line. The script prompts2wlist from the HTKTutorial directory can be used for this purpose.
The dictionary itself can be built from a standard source using HDMan. For this example, the British English BEEP pronouncing dictionary will be used2. Its phone set will be adopted without modification except that the stress marks will be removed and a short-pause (sp) will be added to the end of every pronunciation. If the dictionary contains any silence markers then the MP command will merge the sil and sp phones into a single sil. These changes can be applied using HDMan and an edit script (stored in global.ded) containing the three commands
AS sp RS cmu
MP sil sil sp
where cmu refers to a style of stress marking in which the lexical stress level is marked by a single digit appended to the phone name (e.g. eh2 means the phone eh with level 2 stress).
TIMIT Prompts
Word List (wlist)
sort | uniq
Edit Script (gl oba l . de d)
HDMAN
BEEP Dict
(be e p) Names Dict (na me s)
Dictionary (di c t)
+
VocabTest
Fig. 3.3 Step 2
The command
HDMan -m -w wlist -n monophones1 -l dlog dict beep names
will create a new dictionary called dict by searching the source dictionaries beep and names to find pronunciations for each word in wlist (see Fig3.3). Here, the wlist in question needs only to be a sorted list of the words appearing in the task grammar given above.
Note that names is a manually constructed file containing pronunciations for the proper names used in the task grammar. The option -l instructs HDMan to output a log file dlog which contains various statistics about the constructed dictionary. In particular, it indicates if there are words missing. HDMan can also output a list of the phones used, here called monophones1. Once training and test data has been recorded, an HMM will be estimated for each of these phones.
The general format of each dictionary entry is WORD [outsym] p1 p2 p3 ....
2Available by anonymous ftp from svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries/beep.tar.gz. Note that items beginning with unmatched quotes, found at the start of the dictionary, should be removed.
3.1 Data Preparation 28
which means that the word WORD is pronounced as the sequence of phones p1 p2 p3 .... The string in square brackets specifies the string to output when that word is recognised. If it is omitted then the word itself is output. If it is included but empty, then nothing is output.
To see what the dictionary is like, here are a few entries.
A ah sp
A ax sp
A ey sp
CALL k ao l sp
DIAL d ay ax l sp
EIGHT ey t sp
PHONE f ow n sp
SENT-END [] sil SENT-START [] sil
SEVEN s eh v n sp
TO t ax sp
TO t uw sp
ZERO z ia r ow sp
Notice that function words such as A and TO have multiple pronunciations. The entries for SENT-START and SENT-END have a silence model sil as their pronunciations and null output symbols.
3.1.3 Step 3 - Recording the Data
The training and test data will be recorded using the HTK tool HSLab. This is a combined waveform recording and labelling tool. In this example HSLab will be used just for recording, as labels already exist. However, if you do not have pre-existing training sentences (such as those from the TIMIT database) you can create them either from pre-existing text (as described above) or by labelling your training utterances using HSLab. HSLab is invoked by typing
HSLab noname
This will cause a window to appear with a waveform display area in the upper half and a row of buttons, including a record button in the lower half. When the name of a normal file is given as argument, HSLab displays its contents. Here, the special file name noname indicates that new data is to be recorded. HSLab makes no special provision for prompting the user. However, each time the record button is pressed, it writes the subsequent recording alternately to a file called noname_0. and to a file called noname_1.. Thus, it is simple to write a shell script which for each successive line of a prompt file, outputs the prompt, waits for either noname_0. or noname_1. to appear, and then renames the file to the name prepending the prompt (see Fig.3.4).
While the prompts for training sentences already were provided for above, the prompts for test sentences need to be generated before recording them. The tool HSGen can be used to do this by randomly traversing a word network and outputting each word encountered. For example, typing
HSGen -l -n 200 wdnet dict > testprompts
would generate 200 numbered test utterances, the first few of which would look something like: