Experiments - Monophone-based Experiments

Chapter 4 Experimental Results

4.2 Monophone-based Experiments

4.2.3 Experiments

The parameters of front-end processing are set as Table 4-11. The features adopted in the experiment are listed in Table 4-12. The flow chart of training the monophone-based HMMs is shown in Fig. 4-3. At the beginning, only the corpus and its corresponding Chinese phonetic alphabets are available. Hence, it is essential to transfer the Chinese phonetic alphabets to SAMPA-T before training. It is noted that there is no boundary information of the corpus. Here, six features selected in this thesis are based on LPC, MFCC, and PLP, which have been introduced in Chapter 2.

In the process of training, there is no rule that how much times of doing the Baum-Welch re-estimation will get best model and consequently it is necessary to test and verify the recognition rate to find the best model.

Table 4-11 The parameters of front-end processing

Sampling frequency 16 kHz Pre-emphasis filter 1−0.97z^-1

Hamming window

( )

⎟

Table 4-12 Six different features adopted in this thesis Order Number of

filter banks Energy ∆ ∆² Linear Predictive Coefficients

(LPC_39) 39 - √ √ √

Linear Predictive Coefficients

(LPC_38) 38 - √ √

Linear Predictive Reflection

coefficients (RC) 39 - √ √ √ LPC Cepstrum Coefficients

(LPCC) 39 - √ √ √

Mel-Frequency Cepstral

Coefficients (MFCC) 39 26 √ √ √ Perceptual Linear Prediction

Coefficients (PLP) 39 26 √ √ √

There is a 3-D view of six features performing on the word “不久” (bu4 djiou3) and the variations of the 39-dimensioned (or 38-dimensioned for LPC_38) vectors from frame 1 to frame 100 are shown in Fig. 4-4 where time denotes the frame order.

The highest curve is at the 13-th element of the feature vectors (19-th element for LPC_38) since this element is the energy term.

Fig.4-3 Flow chart of training the monophone-based HMMs

Pre-emphasis

Frame blocking

Hamming window Corpus

Feature extraction

Model Training Transcriptions

……

Front-end Processing

62 HMM models

b p d U sp sil

(a) (b)

The monophone-based HMMs are usually employed in Large Vocabulary Speech Recognition (LVSR). However, one of the factors which influence the recognition rate of the LVSR is the language model. Language model is a statistical model which attempts to capture the regularities of natural languages and improve the performance by estimating the probability distribution of various linguistic units, such as words, sentences, etc. If the recognition task is long paragraphs or articles, language model should be trained. Nevertheless, language model is to the key point in this thesis. Hence, the connected-digits corpus just mentioned in 4.1.2 is utilized for testing the monophone-based HMMs and the HMMs are trained by six different kinds

Fig.4-4 3-D view of the variations of the feature vectors (a) LPC-38 (b) LPC_39 (c) RC (d) LPCC (e) MFCC (f) PLP

(f) (e)

of feature extraction methods where the connected-digits needs only an uncomplicated grammar that the sentence are arbitrary permutation of digits.

The experimental results are shown in Table 4-12 and Fig. 4-6. The total number of digits, denoted by T , used in this experiment is 8432. There are three variables should be concerned in order to compute the recognition rate, that is, the number of insertions (I), the number of deletions (D) and the number of substitutions (S). For example, the output sentence of the recognition may be

i2 @`4 ba1 djiou3 sU4 while the actual sentence is

i2 ba1 liou3 sU4 san1

Fig.4-5 Flow chart of testing the performance of different features

Pre-emphasis

Frame blocking

Hamming window Connected-digits corpus

Feature extraction

Speech Recognition

Transcriptions of test corpus

…

Front-end Processing

62 HMM models b U sil

Grammar

Results analysis

Recognition Rate

where “@`4” is an insertion error, “djiou3” is a substitution error and “san1” is an deletion error.

Based on the definition of mentioned the above, the performance of different features can be examined through two functions, the Correct (%) and the Accuracy (%). The Correct (%) is computed by

( )

100

Correct = ×

T S D

% T -

-(4-1)

and the Accuracy (%) is defined as

( )

100

Accuracy = ×

T I S D

% T - -

(4-2)

which means that the Accuracy (%) concerns not only the deletion error and the substitution error but also the insertion error. Hence, the Accuracy (%) will lower than the percent of correct (%).

Table 4-13 Comparison of the Corr (%) and Acc (%) of different features

LPC_38 LPC_39 RC LPCC MFCC PLP

Number of

iterations Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc 1 73.7 40.1 69.4 38.3 82.0 46.0 87.3 67.1 89.6 67.2 90.0 67.9 2 75.2 47.5 72.8 47.2 83.7 54.3 88.9 70.8 91.2 72.2 91.6 73.7 3 76.4 51.3 74.8 49.4 84.0 56.1 89.2 71.6 92.0 74.2 92.2 75.6 4 77.8 54.1 75.9 50.6 83.7 54.5 89.6 71.5 92.4 74.4 92.9 76.5 5 78.7 55.9 76.4 52.3 83.6 54.0 89.6 71.8 92.8 75.7 93.2 77.3 6 79.7 56.6 76.8 53.9 83.6 54.2 89.6 71.7 92.9 76.3 93.3 78.0 7 80.1 57.3 76.9 54.2 83.6 54.3 89.5 71.9 93.0 77.1 93.4 78.6 8 80.6 58.2 77.1 54.5 83.7 54.4 89.5 71.9 93.0 77.4 93.4 78.7 9 81.0 59.0 77.0 54.4 83.9 54.7 89.7 72.1 93.1 77.7 93.4 78.7 10 81.1 59.1 77.1 54.3 83.9 55.0 89.6 72.1 93.1 78.1 93.4 78.8 11 81.2 59.2 77.1 54.3 84.1 55.4 89.6 72.1 93.2 78.2 93.3 78.9 12 81.2 59.2 76.8 54.0 84.2 56.0 89.6 72.2 93.2 78.3 93.2 78.9

60 65 70 75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10 11 12

Number of iterations

Correct (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

From the Fig. 4-6(a), the percent of correct is the recognition rate without considering the insertion error and the performance of the connected-digits speaker-independent recognition based on monophone HMM is

PLP > MFCC > LPCC > RC > LPC38 > LPC39 (4-3) where the PLP performs better than all the other features from iteration 1 to iteration 12 of the training. The performance of all the models almost saturates when coming up to iteration 12. The maximum percent of correct of PLP appears in iteration 8 of training and then decreases when more training iterations are performed. It shows that the PLP costs less time than others to reach a better model in the training stage. From

Fig.4-6 Comparison of the different features (a) Correct (%) (b) Accuracy (%)

30 40 50 60 70 80 90

1 2 3 4 5 6 7 8 9 10 11 12

Number of iterations

Accuracy (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

(a)

(b)

Fig. 4-6(b), it shows the recognition results when the insertion error is considered.

The order of the performance is not the same as (4-3) especially for the RC. It infers that the RC tends to insert words between two words than other insertion methods.

Comparisons of the different features through average and the best Correct (%) and Accuracy (%) are shown in Fig. 4-7. The order of the performance from the good to the bad in this experiment is

PLP > MFCC > LPCC > RC > LPC38 > LPC39 (4-4) except for the max Accuracy (%) where RC is worse than LPC38, hence, the order of the best performance of the six features is

PLP > MFCC > LPCC > LPC38 > RC > LPC39 (4-5) where PLP still has the best performance than other features in monophone-based speaker-independent speech recognition.

Fig.4-7 Monophone-based HMM experiment (a) Average Correct (%) (b) Average Accuracy (%) (c) Max Correct (%) (d) Max Accuracy (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

Average Correct (%)

54.8 52.6 54.9

71.9

LPC_38 LPC_39 RC LPCC MFCC PLP

Average Accuracy (%)

59.2

LPC_38 LPC_39 RC LPCC MFCC PLP

Max Accuracy (%)

81.2

LPC_38 LPC_39 RC LPCC MFCC PLP

Max Correct (%)

(a) (b)

在文檔中針對非特定語者語音辨識使用不同前處理技術之比較 (頁 67-74)