• 沒有找到結果。

Storage of Parameter Files

在文檔中 The HTK Book (頁 68-71)

the window size is set by ACCWINDOW. Since equation5.16relies on past and future speech parameter values, some modification is needed at the beginning and end of the speech. The default behaviour is to replicate the first or last vector as needed to fill the regression window.

In older version 1.5 of HTK and earlier, this end-effect problem was solved by using simple first order differences at the start and end of the speech, that is

dt= ct+1− ct, t < Θ (5.17)

and

dt= ct− ct−1, t ≥ T − Θ (5.18)

where T is the length of the data file. If required, this older behaviour can be restored by setting the configuration variable V1COMPAT to true in HParm.

For some purposes, it is useful to use simple differences throughout. This can be achieved by setting the configuration variable SIMPLEDIFFS to true in HParm. In this case, just the end-points of the delta window are used, i.e.

dt= (ct+Θ− ct−Θ)

2Θ (5.19)

When delta and acceleration coefficients are requested, they are computed for all static param-eters including energy if present. In some applications, the absolute energy is not useful but time derivatives of the energy may be. By including the E qualifier together with the N qualifier, the absolute energy is suppressed leaving just the delta and acceleration coefficients of the energy.

5.7 Storage of Parameter Files

Whereas HTK can handle waveform data in a variety of file formats, all parameterised speech data is stored externally in either native HTK format data files or Entropic Esignal format files. Entropic ESPS format is no longer supported directly, but input and output filters can be used to convert ESPS to Esignal format on input and Esignal to ESPS on output.

5.7.1 HTK Format Parameter Files

HTK format files consist of a contiguous sequence of samples preceded by a header. Each sample is a vector of either 2-byte integers or 4-byte floats. 2-byte integers are used for compressed forms as described below and for vector quantised data as described later in section5.11. HTK format data files can also be used to store speech waveforms as described in section5.8.

The HTK file format header is 12 bytes long and contains the following data nSamples – number of samples in file (4-byte integer)

sampPeriod – sample period in 100ns units (4-byte integer) sampSize – number of bytes per sample (2-byte integer) parmKind – a code indicating the sample kind (2-byte integer)

The parameter kind consists of a 6 bit code representing the basic parameter kind plus additional bits for each of the possible qualifiers. The basic parameter kind codes are

0 WAVEFORM sampled waveform

1 LPC linear prediction filter coefficients 2 LPREFC linear prediction reflection coefficients 3 LPCEPSTRA LPC cepstral coefficients

4 LPDELCEP LPC cepstra plus delta coefficients

5 IREFC LPC reflection coef in 16 bit integer format 6 MFCC mel-frequency cepstral coefficients

7 FBANK log mel-filter bank channel outputs 8 MELSPEC linear mel-filter bank channel outputs 9 USER user defined sample kind

10 DISCRETE vector quantised data

5.7 Storage of Parameter Files 63

and the bit-encoding for the qualifiers (in octal) is E 000100 has energy

N 000200 absolute energy suppressed D 000400 has delta coefficients A 001000 has acceleration coefficients C 002000 is compressed

Z 004000 has zero mean static coef.

K 010000 has CRC checksum O 020000 has 0’th cepstral coef.

The A qualifier can only be specified when D is also specified. The N qualifier is only valid when both energy and delta coefficients are present. The sample kind LPDELCEP is identical to LPCEPSTRA D and is retained for compatibility with older versions of HTK. The C and K only exist in external files. Compressed files are always decompressed on loading and any attached CRC is checked and removed. An external file can contain both an energy term and a 0’th order cepstral coefficient. These may be retained on loading but normally one or the other is discarded7.

C1 C 2

 • • C

N



C1 C 2

 C

N

 E

C1 C 2

 C

N



LPC LPC_E LPC_D LPC_E_D LPC_E_D_N

1

 2 N

δC δC δC

C1 C 2

 C

N

 E

1 2 N

δC δC δC δE

1

 2 N

δC δC δC δE

C1 C 2

 C

N



• •

• • • •

• • • •

• • • • LPC_E _D_A C

1 C 2

 C

N

 E

• • δC1 δC2 • • δCN δE Ci Basic Coefficients E Log Energy

Delta coe fficients δE

i

δC ,

Acceleration coefficients

∆E i

∆C ,

1 2 N

∆C ∆C • • ∆C ∆E

Fig. 5.4 Parameter Vector Layout in HTK Format Files

All parameterised forms of HTK data files consist of a sequence of vectors. Each vector is organised as shown by the examples in Fig5.4where various different qualified forms are listed. As can be seen, an energy value if present immediately follows the base coefficients. If delta coefficients are added, these follow the base coefficients and energy value. Note that the base form LPC is used in this figure only as an example, the same layout applies to all base sample kinds. If the 0’th order cepstral coefficient is included as well as energy then it is inserted immediately before the energy coefficient, otherwise it replaces it.

For external storage of speech parameter files, two compression methods are provided. For LP coding only, the IREFC parameter kind exploits the fact that the reflection coefficients are bounded by ±1 and hence they can be stored as scaled integers such that +1.0 is stored as 32767 and

−1.0 is stored as −32767. For other types of parameterisation, a more general compression facility indicated by the C qualifier is used. HTK compressed parameter files consist of a set of compressed parameter vectors stored as shorts such that for parameter x

xshort = A ∗ xf loat− B The coefficients A and B are defined as

A = 2 ∗ I/(xmax− xmin)

B = (xmax+ xmin) ∗ I/(xmax− xmin)

7 Some applications may require the 0’th order cepstral coefficient in order to recover the filterbank coefficients from the cepstral coefficients.

5.7 Storage of Parameter Files 64 where xmaxis the maximum value of parameter x in the whole file and xmin is the corresponding minimum. I is the maximum range of a 2-byte integer i.e. 32767. The values of A and B are stored as two floating point vectors prepended to the start of the file immediately after the header.

When a HTK tool writes out a speech file to external storage, no further signal conversions are performed. Thus, for most purposes, the target parameter kind specifies both the required internal representation and the form of the written output, if any. However, there is a distinction in the way that the external data is actually stored. Firstly, it can be compressed as described above by setting the configuration parameter SAVECOMPRESSED to true. If the target kind is LPREFC then this compression is implemented by converting to IREFC otherwise the general compression algorithm described above is used. Secondly, in order to avoid data corruption problems, externally stored HTK parameter files can have a cyclic redundancy checksum appended. This is indicated by the qualifier K and it is generated by setting the configuration parameter SAVEWITHCRC to true. The principle tool which uses these output conversions is HCopy (see section5.13).

5.7.2 Esignal Format Parameter Files

The default for parameter files is native HTK format. However, HTK tools also support the Entropic Esignal format for both input and output. Esignal replaces the Entropic ESPS file format. To ensure compatibility Entropic provides conversion programs from ESPS to ESIG and vice versa.

To indicate that a source file is in Esignal format the configuration variable SOURCEFORMAT should be set to ESIG. Alternatively, -F ESIG can be specified as a command-line option. To generate Esignal format output files, the configuration variable TARGETFORMAT should be set to ESIG or the command line option -O ESIG should be set.

ESIG files consist of three parts: a preamble, a sequence of field specifications called the field list and a sequence of records. The preamble and the field list together constitute the header. The preamble is purely ASCII. Currently it consists of 6 information items that are all terminated by a new line. The information in the preamble is the following:

line 1 – identification of the file format line 2 – version of the file format

line 3 – architecture (ASCII, EDR1, EDR2, machine name) line 4 – preamble size (48 bytes)

line 5 – total header size line 6 – record size

All ESIG files that are output by HTK programs contain the following global fields:

commandLine the command-line used to generate the file;

recordFreq a double value that indicates the sample frequency in Herz;

startTime a double value that indicates a time at which the first sample is presumed to be starting;

parmKind a character string that indicates the full type of parameters in the file, e.g: MFCC E D.

source 1 if the input file was an ESIG file this field includes the header items in the input file.

After that there are field specifiers for the records. The first specifier is for the basekind of the parameters, e.g: MFCC. Then for each available qualifier there are additional specifiers. Possible specifiers are:

zeroc energy delta delta zeroc delta energy accs

accs zeroc accs energy

5.8 Waveform File Formats 65

在文檔中 The HTK Book (頁 68-71)