Evaluating Recognition Results

utter-ance, the average log probability per frame, the total acoustic likelihood, the total language model likelihood and the average number of active models.

The corresponding transcription written to the output MLF form will contain an entry of the form

"testf1.rec"

0 6200000 SIL -6067.333008 6200000 9200000 ONE -3032.359131 9200000 12300000 NINE -3020.820312 12300000 17600000 FOUR -4690.033203 17600000 17800000 SIL -302.439148 .

This shows the start and end time of each word and the total log probability. The fields output by HVite can be controlled using the -o. For example, the option -o ST would suppress the scores and the times to give

"testf1.rec"

SIL ONE NINE FOUR SIL .

In order to use HVite effectively and efficiently, it is important to set appropriate values for its pruning thresholds and the language model scaling parameters. The main pruning beam is set by the -t option. Some experimentation will be necessary to determine appropriate levels but around 250.0 is usually a reasonable starting point. Word-end pruning (-v) and the maximum model limit (-u) can also be set if required, but these are not mandatory and their effectiveness will depend greatly on the task.

The relative levels of insertion and deletion errors can be controlled by scaling the language model likelihoods using the -s option and adding a fixed penalty using the -p option. For example, setting -s 10.0 -p -20.0 would mean that every language model log probability x would be converted to 10x − 20 before being added to the tokens emitted from the corresponding word-end node. As an extreme example, setting -p 100.0 caused the digit recogniser above to output

SIL OH OH ONE OH OH OH NINE FOUR OH OH OH OH SIL

where adding 100 to each word-end transition has resulted in a large number of insertion errors. The word inserted is “oh” primarily because it is the shortest in the vocabulary. Another problem which may occur during recognition is the inability to arrive at the final node in the recognition network after processing the whole utterance. The user is made aware of the problem by the message “No tokens survived to final node of network”. The inability to match the data against the recognition network is usually caused by poorly trained acoustic models and/or very tight pruning beam-widths.

In such cases, partial recognition results can still be obtained by setting the HRec configuration variable FORCEOUT true. The results will be based on the most likely partial hypothesis found in the network.

13.4 Evaluating Recognition Results

Once the test data has been processed by the recogniser, the next step is to analyse the results.

The tool HResults is provided for this purpose. HResults compares the transcriptions output by HVite with the original reference transcriptions and then outputs various statistics. HResults matches each of the recognised and reference label sequences by performing an optimal string match using dynamic programming. Except when scoring word-spotter output as described later, it does not take any notice of any boundary timing information stored in the files being compared. The optimal string match works by calculating a score for the match with respect to the reference such that identical labels match with score 0, a label insertion carries a score of 7, a deletion carries a

13.4 Evaluating Recognition Results 178 score of 7 and a substitution carries a score of 10². The optimal string match is the label alignment which has the lowest possible score.

Once the optimal alignment has been found, the number of substitution errors (S), deletion errors (D) and insertion errors (I) can be calculated. The percentage correct is then

Percent Correct = N − D − S

N × 100% (13.1)

where N is the total number of labels in the reference transcriptions. Notice that this measure ignores insertion errors. For many purposes, the percentage accuracy defined as

Percent Accuracy = N − D − S − I

N × 100% (13.2)

is a more representative figure of recogniser performance.

HResults outputs both of the above measures. As with all HTK tools it can process individual label files and files stored in MLFs. Here the examples will assume that both reference and test transcriptions are stored in MLFs.

As an example of use, suppose that the MLF results contains recogniser output transcriptions, refs contains the corresponding reference transcriptions and wlist contains a list of all labels appearing in these files. Then typing the command

HResults -I refs wlist results would generate something like the following

====================== HTK Results Analysis =======================

Date: Sat Sep 2 14:14:22 1995 Ref : refs

Rec : results

--- Overall Results ---SENT: %Correct=98.50 [H=197, S=3, N=200]

WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855]

===================================================================

The first part shows the date and the names of the files being used. The line labelled SENT shows the total number of complete sentences which were recognised correctly. The second line labelled WORD gives the recognition statistics for the individual words³.

It is often useful to visually inspect the recognition errors. Setting the -t option causes aligned test and reference transcriptions to be output for all sentences containing errors. For example, a typical output might be

Aligned transcription: testf9.lab vs testf9.rec LAB: FOUR SEVEN NINE THREE

REC: FOUR OH SEVEN FIVE THREE

here an “oh” has been inserted by the recogniser and “nine” has been recognised as “five”

If preferred, results output can be formatted in an identical manner to NIST scoring software by setting the -h option. For example, the results given above would appear as follows in NIST format

,---.

| HTK Results Analysis at Sat Sep 2 14:42:06 1995 |

| Ref: refs |

| Rec: results |

|=============================================================|

| # Snt | Corr Sub Del Ins Err S. Err |

|---|

| Sum/Avg | 200 | 99.77 0.12 0.12 0.12 0.35 1.50 |

‘---’

2The default behaviour of HResults is slightly different to the widely used US NIST scoring software which uses weights of 3,3 and 4 and a slightly different alignment algorithm. Identical behaviour to NIST can be obtained by setting the -n option.

3 All the examples here will assume that each label corresponds to a word but in general the labels could stand for any recognition unit such as phones, syllables, etc. HResults does not care what the labels mean but for human consumption, the labels SENT and WORD can be changed using the -a and -b options.

13.4 Evaluating Recognition Results 179 When computing recognition results it is sometimes inappropriate to distinguish certain labels.

For example, to assess a digit recogniser used for voice dialing it might be required to treat the alternative vocabulary items “oh” and “zero” as being equivalent. This can be done by making them equivalent using the -e option, that is

HResults -e ZERO OH ...

If a label is equated to the special label ???, then it is ignored. Hence, for example, if the recognition output had silence marked by SIL, the setting the option -e ??? SIL would cause all the SIL labels to be ignored.

HResults contains a number of other options. Recognition statistics can be generated for each file individually by setting the -f option and a confusion matrix can be generated by setting the -p option. When comparing phone recognition results, HResults will strip any triphone contexts by setting the -s option. HResults can also process N-best recognition output. Setting the option -d N causes HResults to search the first N alternatives of each test output file to find the most accurate match with the reference labels.

When analysing the performance of a speaker independent recogniser it is often useful to obtain accuracy figures on a per speaker basis. This can be done using the option -k mask where mask is a pattern used to extract the speaker identifier from the test label file name. The pattern consists of a string of characters which can include the pattern matching metacharacters * and ? to match zero or more characters and a single character, respectively. The pattern should also contain a string of one or more % characters which are used as a mask to identify the speaker identifier.

For example, suppose that the test filenames had the following structure DIGITS_spkr_nnnn.rec

where spkr is a 4 character speaker id and nnnn is a 4 digit utterance id. Then executing HResults by

HResults -h -k ’*_%%%%_????.*’ ....

would give output of the form

,---.

| HTK Results Analysis at Sat Sep 2 15:05:37 1995 |

| Ref: refs |

| Rec: results |

|---|

| SPKR | # Snt | Corr Sub Del Ins Err S. Err |

|---|

| dgo1 | 20 | 100.00 0.00 0.00 0.00 0.00 0.00 |

|---|

| pcw1 | 20 | 97.22 1.39 1.39 0.00 2.78 10.00 |

|---|

...

|=============================================================|

| Sum/Avg | 200 | 99.77 0.12 0.12 0.12 0.35 1.50 |

‘---’

In addition to string matching, HResults can also analyse the results of a recogniser configured for word-spotting. In this case, there is no DP alignment. Instead, each recogniser label w is compared with the reference transcriptions. If the start and end times of w lie either side of the mid-point of an identical label in the reference, then that recogniser label represents a hit, otherwise it is a false-alarm (FA).

The recogniser output must include the log likelihood scores as well as the word boundary information. These scores are used to compute the Figure of Merit (FOM) defined by NIST which is an upper-bound estimate on word spotting accuracy averaged over 1 to 10 false alarms per hour.

The FOM is calculated as follows where it is assumed that the total duration of the test speech is T hours. For each word, all of the spots are ranked in score order. The percentage of true hits pi

found before the i’th false alarm is then calculated for i = 1 . . . N + 1 where N is the first integer

≥ 10T − 0.5. The figure of merit is then defined as

FOM = 1

10T(p1+ p2+ . . . + pN+ apN +1) (13.3)

13.5 Generating Forced Alignments 180

在文檔中 The HTK Book (頁 183-186)