5.15 Summary
This section summarises the various file formats, parameter kinds, qualifiers and configuration parameters used by HTK. Table5.1 lists the audio speech file formats which can be read by the HWave module. Table5.2lists the basic parameter kinds supported by the HParm module and Fig.5.8 shows the various automatic conversions that can be performed by appropriate choice of source and target parameter kinds. Table 5.3 lists the available qualifiers for parameter kinds.
The first 6 of these are used to describe the target kind. The source kind may already have some of these, HParm adds the rest as needed. Note that HParm can also delete qualifiers when converting from source to target. The final two qualifiers in Table 5.3 are only used in external files to indicate compression and an attached checksum. HParm adds these qualifiers to the target form during output and only in response to setting the configuration parameters SAVECOMPRESSED and SAVEWITHCRC. Adding the C or K qualifiers to the target kind simply causes an error. Finally, Tables 5.4and 5.5 lists all of the configuration parameters along with their meaning and default values.
Name Description
HTK The standard HTK file format
TIMIT As used in the original prototype TIMIT CD-ROM NIST The standard SPHERE format used by the US NIST SCRIBE Subset of the European SAM standard used in the
SCRIBE CD-ROM
SDES1 The Sound Designer 1 format defined by Digidesign Inc.
AIFF Audio interchange file format
SUNAU8 Subset of 8bit ”.au” and ”.snd” formats used by Sun and NeXT
OGI Format used by Oregan Graduate Institute similar to TIMIT
WAV Microsoft WAVE files used on PCs ESIG Entropic Esignal file format
AUDIO Pseudo format to indicate direct audio input
ALIEN Pseudo format to indicate unsupported file, the alien header size must be set via the environment variable HDSIZE
NOHEAD As for the ALIEN format but header size is zero Table. 5.1 Supported File Formats
Kind Meaning
WAVEFORM scalar samples (usually raw speech data) LPC linear prediction coefficients
LPREFC linear prediction reflection coefficients LPCEPSTRA LP derived cepstral coefficients LPDELCEP LP cepstra + delta coef (obsolete) IREFC LPREFC stored as 16bit (short) integers MFCC mel-frequency cepstral coefficients FBANK log filter-bank parameters
MELSPEC linear filter-bank parameters USER user defined parameters
DISCRETE vector quantised codebook symbols ANON matches actual parameter kind
Table. 5.2 Supported Parameter Kinds
5.15 Summary 77
Qualifier Meaning
A Acceleration coefficients appended
C External form is compressed
D Delta coefficients appended
E Log energy appended
K External form has checksum appended N Absolute log energy suppressed
V VQ index appended
Z Cepstral mean subtracted
0 Cepstral C0 coefficient appended
Table. 5.3 Parameter Kind Qualifiers
Module Name Default Description
HAudio LINEIN T Select line input for audio
HAudio MICIN F Select microphone input for audio
HAudio LINEOUT T Select line output for audio HAudio SPEAKEROUT F Select speaker output for audio HAudio PHONESOUT T Select headphones output for audio
SOURCEKIND ANON Parameter kind of source SOURCEFORMAT HTK File format of source
SOURCERATE 0.0 Sample period of source in 100ns units HWave NSAMPLES Num samples in alien file input via a pipe HWave HEADERSIZE Size of header in an alien file
HWave STEREOMODE Select channel: RIGHT or LEFT
HWave BYTEORDER Define byte order VAX or other
NATURALREADORDER F Enable natural read order for HTK files NATURALWRITEORDER F Enable natural write order for HTK files TARGETKIND ANON Parameter kind of target
TARGETFORMAT HTK File format of target
TARGETRATE 0.0 Sample period of target in 100ns units HParm SAVECOMPRESSED F Save the output file in compressed form HParm SAVEWITHCRC T Attach a checksum to output parameter
file
HParm ADDDITHER 0.0 Level of noise added to input signal HParm ZMEANSOURCE F Zero mean source waveform before analysis HParm WINDOWSIZE 256000.0 Analysis window size in 100ns units
HParm USEHAMMING T Use a Hamming window
HParm PREEMCOEF 0.97 Set pre-emphasis coefficient
HParm LPCORDER 12 Order of LPC analysis
HParm NUMCHANS 20 Number of filterbank channels
HParm LOFREQ -1.0 Low frequency cut-off in fbank analysis HParm HIFREQ -1.0 High frequency cut-off in fbank analysis HParm USEPOWER F Use power not magnitude in fbank analysis
HParm NUMCEPS 12 Number of cepstral parameters
HParm CEPLIFTER 22 Cepstral liftering coefficient
HParm ENORMALISE T Normalise log energy
HParm ESCALE 0.1 Scale log energy
HParm SILFLOOR 50.0 Energy silence floor (dB)
HParm DELTAWINDOW 2 Delta window size
HParm ACCWINDOW 2 Acceleration window size
HParm VQTABLE NULL Name of VQ table
HParm SAVEASVQ F Save only the VQ indices
HParm AUDIOSIG 0 Audio signal number for remote control Table. 5.4 Configuration Parameters
5.15 Summary 78
Module Name Default Description
HParm USESILDET F Enable speech/silence detector
HParm MEASURESIL T Measure background noise level prior to sampling
HParm OUTSILWARN T Print a warning message to stdout before measuring audio levels
HParm SPEECHTHRESH 9.0 Threshold for speech above silence level (dB)
HParm SILENERGY 0.0 Average background noise level (dB) HParm SPCSEQCOUNT 10 Window over which speech/silence decision
reached
HParm SPCGLCHCOUNT 0 Maximum number of frames marked as silence in window which is classified as speech whilst expecting start of speech HParm SILSEQCOUNT 100 Number of frames classified as silence
needed to mark end of utterance
HParm SILGLCHCOUNT 2 Maximum number of frames marked as silence in window which is classified as speech whilst expecting silence
HParm SILMARGIN 40 Number of extra frames included before and after start and end of speech marks from the speech/silence detector
HParm V1COMPAT F Set Version 1.5 compatibility mode
TRACE 0 Trace setting
Table. 5.5 Configuration Parameters (cont)
Chapter 6
Transcriptions and Label Files
Speech
Data DefinitionsHMM
Terminal
Graphical
Adaptation Model
Training HNet Language
Models Constraint Network Lattices/
Dictionary
HModel HDict
HUtil HShell
HGraf
HRec HAdapt HMath
HMem HSigP HVQ HParm HWave HAudio
HTrain HFB HTK Tool
I/O I/O Labels
HLabel HLM
Many of the operations performed by HTK which involve speech data files assume that the speech is divided into segments and each segment has a name or label. The set of labels associated with a speech file constitute a transcription and each transcription is stored in a separate label file.
Typically, the name of the label file will be the same as the corresponding speech file but with a different extension. For convenience, label files are often stored in a separate directory and all HTK tools have an option to specify this. When very large numbers of files are being processing, label file access can be greatly facilitated by using Master Label Files (MLFs). MLFs may be regarded as index files holding pointers to the actual label files which can either be embedded in the same index file or stored anywhere else in the file system. Thus, MLFs allow large sets of files to be stored in a single file, they allow a single transcription to be shared by many logical label files and they allow arbitrary file redirection.
The HTK interface to label files is provided by the module HLabel which implements the MLF facility and support for a number of external label file formats. All of the facilities supplied by HLabel, including the supported label file formats, are described in this chapter. In addition, HTK provides a tool called HLEd for simple batch editing of label files and this is also described.
Before proceeding to the details, however, the general structure of label files will be reviewed.
6.1 Label File Structure
Most transcriptions are single-alternative and single-level, that is to say, the associated speech file is described by a single sequence of labelled segments. Most standard label formats are of this kind. Sometimes, however, it is useful to have several levels of labels associated with the same basic segment sequence. For example, in training a HMM system it is useful to have both the word level transcriptions and the phone level transcriptions side-by-side.
79
6.2 Label File Formats 80