Summary - The HTK Book

5.15 Summary

This section summarises the various file formats, parameter kinds, qualifiers and configuration parameters used by HTK. Table5.1 lists the audio speech file formats which can be read by the HWave module. Table5.2lists the basic parameter kinds supported by the HParm module and Fig.5.8 shows the various automatic conversions that can be performed by appropriate choice of source and target parameter kinds. Table 5.3 lists the available qualifiers for parameter kinds.

The first 6 of these are used to describe the target kind. The source kind may already have some of these, HParm adds the rest as needed. Note that HParm can also delete qualifiers when converting from source to target. The final two qualifiers in Table 5.3 are only used in external files to indicate compression and an attached checksum. HParm adds these qualifiers to the target form during output and only in response to setting the configuration parameters SAVECOMPRESSED and SAVEWITHCRC. Adding the C or K qualifiers to the target kind simply causes an error. Finally, Tables 5.4and 5.5 lists all of the configuration parameters along with their meaning and default values.

Name Description

HTK The standard HTK file format

TIMIT As used in the original prototype TIMIT CD-ROM NIST The standard SPHERE format used by the US NIST SCRIBE Subset of the European SAM standard used in the

SCRIBE CD-ROM

SDES1 The Sound Designer 1 format defined by Digidesign Inc.

AIFF Audio interchange file format

SUNAU8 Subset of 8bit ”.au” and ”.snd” formats used by Sun and NeXT

OGI Format used by Oregan Graduate Institute similar to TIMIT

WAV Microsoft WAVE files used on PCs ESIG Entropic Esignal file format

AUDIO Pseudo format to indicate direct audio input

ALIEN Pseudo format to indicate unsupported file, the alien header size must be set via the environment variable HDSIZE

NOHEAD As for the ALIEN format but header size is zero Table. 5.1 Supported File Formats

Kind Meaning

WAVEFORM scalar samples (usually raw speech data) LPC linear prediction coefficients

LPREFC linear prediction reflection coefficients LPCEPSTRA LP derived cepstral coefficients LPDELCEP LP cepstra + delta coef (obsolete) IREFC LPREFC stored as 16bit (short) integers MFCC mel-frequency cepstral coefficients FBANK log filter-bank parameters

MELSPEC linear filter-bank parameters USER user defined parameters

DISCRETE vector quantised codebook symbols ANON matches actual parameter kind

Table. 5.2 Supported Parameter Kinds

5.15 Summary 77

Qualifier Meaning

A Acceleration coefficients appended

C External form is compressed

D Delta coefficients appended

E Log energy appended

K External form has checksum appended N Absolute log energy suppressed

V VQ index appended

Z Cepstral mean subtracted

0 Cepstral C0 coefficient appended

Table. 5.3 Parameter Kind Qualifiers

Module Name Default Description

HAudio LINEIN T Select line input for audio

HAudio MICIN F Select microphone input for audio

HAudio LINEOUT T Select line output for audio HAudio SPEAKEROUT F Select speaker output for audio HAudio PHONESOUT T Select headphones output for audio

SOURCEKIND ANON Parameter kind of source SOURCEFORMAT HTK File format of source

SOURCERATE 0.0 Sample period of source in 100ns units HWave NSAMPLES Num samples in alien file input via a pipe HWave HEADERSIZE Size of header in an alien file

HWave STEREOMODE Select channel: RIGHT or LEFT

HWave BYTEORDER Define byte order VAX or other

NATURALREADORDER F Enable natural read order for HTK files NATURALWRITEORDER F Enable natural write order for HTK files TARGETKIND ANON Parameter kind of target

TARGETFORMAT HTK File format of target

TARGETRATE 0.0 Sample period of target in 100ns units HParm SAVECOMPRESSED F Save the output file in compressed form HParm SAVEWITHCRC T Attach a checksum to output parameter

file

HParm ADDDITHER 0.0 Level of noise added to input signal HParm ZMEANSOURCE F Zero mean source waveform before analysis HParm WINDOWSIZE 256000.0 Analysis window size in 100ns units

HParm USEHAMMING T Use a Hamming window

HParm PREEMCOEF 0.97 Set pre-emphasis coefficient

HParm LPCORDER 12 Order of LPC analysis

HParm NUMCHANS 20 Number of filterbank channels

HParm LOFREQ -1.0 Low frequency cut-off in fbank analysis HParm HIFREQ -1.0 High frequency cut-off in fbank analysis HParm USEPOWER F Use power not magnitude in fbank analysis

HParm NUMCEPS 12 Number of cepstral parameters

HParm CEPLIFTER 22 Cepstral liftering coefficient

HParm ENORMALISE T Normalise log energy

HParm ESCALE 0.1 Scale log energy

HParm SILFLOOR 50.0 Energy silence floor (dB)

HParm DELTAWINDOW 2 Delta window size

HParm ACCWINDOW 2 Acceleration window size

HParm VQTABLE NULL Name of VQ table

HParm SAVEASVQ F Save only the VQ indices

HParm AUDIOSIG 0 Audio signal number for remote control Table. 5.4 Configuration Parameters

5.15 Summary 78

Module Name Default Description

HParm USESILDET F Enable speech/silence detector

HParm MEASURESIL T Measure background noise level prior to sampling

HParm OUTSILWARN T Print a warning message to stdout before measuring audio levels

HParm SPEECHTHRESH 9.0 Threshold for speech above silence level (dB)

HParm SILENERGY 0.0 Average background noise level (dB) HParm SPCSEQCOUNT 10 Window over which speech/silence decision

reached

HParm SPCGLCHCOUNT 0 Maximum number of frames marked as silence in window which is classified as speech whilst expecting start of speech HParm SILSEQCOUNT 100 Number of frames classified as silence

needed to mark end of utterance

HParm SILGLCHCOUNT 2 Maximum number of frames marked as silence in window which is classified as speech whilst expecting silence

HParm SILMARGIN 40 Number of extra frames included before and after start and end of speech marks from the speech/silence detector

HParm V1COMPAT F Set Version 1.5 compatibility mode

TRACE 0 Trace setting

Table. 5.5 Configuration Parameters (cont)

Chapter 6

Transcriptions and Label Files

Speech

Data DefinitionsHMM

Terminal

Graphical

Adaptation Model

Training HNet Language

Models Constraint Network Lattices/

Dictionary

HModel HDict

HUtil HShell

HGraf

HRec HAdapt HMath

HMem HSigP HVQ HParm HWave HAudio

HTrain HFB HTK Tool

I/O I/O Labels

HLabel HLM

Many of the operations performed by HTK which involve speech data files assume that the speech is divided into segments and each segment has a name or label. The set of labels associated with a speech file constitute a transcription and each transcription is stored in a separate label file.

Typically, the name of the label file will be the same as the corresponding speech file but with a different extension. For convenience, label files are often stored in a separate directory and all HTK tools have an option to specify this. When very large numbers of files are being processing, label file access can be greatly facilitated by using Master Label Files (MLFs). MLFs may be regarded as index files holding pointers to the actual label files which can either be embedded in the same index file or stored anywhere else in the file system. Thus, MLFs allow large sets of files to be stored in a single file, they allow a single transcription to be shared by many logical label files and they allow arbitrary file redirection.

The HTK interface to label files is provided by the module HLabel which implements the MLF facility and support for a number of external label file formats. All of the facilities supplied by HLabel, including the supported label file formats, are described in this chapter. In addition, HTK provides a tool called HLEd for simple batch editing of label files and this is also described.

Before proceeding to the details, however, the general structure of label files will be reviewed.

6.1 Label File Structure

Most transcriptions are single-alternative and single-level, that is to say, the associated speech file is described by a single sequence of labelled segments. Most standard label formats are of this kind. Sometimes, however, it is useful to have several levels of labels associated with the same basic segment sequence. For example, in training a HMM system it is useful to have both the word level transcriptions and the phone level transcriptions side-by-side.

6.2 Label File Formats 80

在文檔中 The HTK Book (頁 82-86)