Chapter 4 Experimental Results
4.2 Monophone-based Experiments
4.2.3 Experiments
The parameters of front-end processing are set as Table 4-11. The features adopted in the experiment are listed in Table 4-12. The flow chart of training the monophone-based HMMs is shown in Fig. 4-3. At the beginning, only the corpus and its corresponding Chinese phonetic alphabets are available. Hence, it is essential to transfer the Chinese phonetic alphabets to SAMPA-T before training. It is noted that there is no boundary information of the corpus. Here, six features selected in this thesis are based on LPC, MFCC, and PLP, which have been introduced in Chapter 2.
In the process of training, there is no rule that how much times of doing the Baum-Welch re-estimation will get best model and consequently it is necessary to test and verify the recognition rate to find the best model.
Table 4-11 The parameters of front-end processing
Sampling frequency 16 kHz Pre-emphasis filter 1−0.97z-1
Hamming window
( )
⎟Table 4-12 Six different features adopted in this thesis Order Number of
filter banks Energy ∆ ∆2 Linear Predictive Coefficients
(LPC_39) 39 - √ √ √
Linear Predictive Coefficients
(LPC_38) 38 - √ √
Linear Predictive Reflection
coefficients (RC) 39 - √ √ √ LPC Cepstrum Coefficients
(LPCC) 39 - √ √ √
Mel-Frequency Cepstral
Coefficients (MFCC) 39 26 √ √ √ Perceptual Linear Prediction
Coefficients (PLP) 39 26 √ √ √
There is a 3-D view of six features performing on the word “不久” (bu4 djiou3) and the variations of the 39-dimensioned (or 38-dimensioned for LPC_38) vectors from frame 1 to frame 100 are shown in Fig. 4-4 where time denotes the frame order.
The highest curve is at the 13-th element of the feature vectors (19-th element for LPC_38) since this element is the energy term.
Fig.4-3 Flow chart of training the monophone-based HMMs
Pre-emphasis
Frame blocking
Hamming window Corpus
Feature extraction
Model Training Transcriptions
……
Front-end Processing
62 HMM models
b p d U sp sil
(a) (b)
The monophone-based HMMs are usually employed in Large Vocabulary Speech Recognition (LVSR). However, one of the factors which influence the recognition rate of the LVSR is the language model. Language model is a statistical model which attempts to capture the regularities of natural languages and improve the performance by estimating the probability distribution of various linguistic units, such as words, sentences, etc. If the recognition task is long paragraphs or articles, language model should be trained. Nevertheless, language model is to the key point in this thesis. Hence, the connected-digits corpus just mentioned in 4.1.2 is utilized for testing the monophone-based HMMs and the HMMs are trained by six different kinds
Fig.4-4 3-D view of the variations of the feature vectors (a) LPC-38 (b) LPC_39 (c) RC (d) LPCC (e) MFCC (f) PLP
(c) (d)
(f) (e)
of feature extraction methods where the connected-digits needs only an uncomplicated grammar that the sentence are arbitrary permutation of digits.
The experimental results are shown in Table 4-12 and Fig. 4-6. The total number of digits, denoted by T , used in this experiment is 8432. There are three variables should be concerned in order to compute the recognition rate, that is, the number of insertions (I), the number of deletions (D) and the number of substitutions (S). For example, the output sentence of the recognition may be
i2 @`4 ba1 djiou3 sU4 while the actual sentence is
i2 ba1 liou3 sU4 san1
Fig.4-5 Flow chart of testing the performance of different features
Pre-emphasis
Frame blocking
Hamming window Connected-digits corpus
Feature extraction
Speech Recognition
Transcriptions of test corpus
…
Front-end Processing
62 HMM models b U sil
Grammar
Results analysis
Recognition Rate
where “@`4” is an insertion error, “djiou3” is a substitution error and “san1” is an deletion error.
Based on the definition of mentioned the above, the performance of different features can be examined through two functions, the Correct (%) and the Accuracy (%). The Correct (%) is computed by
( )
100Correct = ×
T S D
% T -
-(4-1)
and the Accuracy (%) is defined as
( )
100Accuracy = ×
T I S D
% T - -
(4-2)
which means that the Accuracy (%) concerns not only the deletion error and the substitution error but also the insertion error. Hence, the Accuracy (%) will lower than the percent of correct (%).
Table 4-13 Comparison of the Corr (%) and Acc (%) of different features
LPC_38 LPC_39 RC LPCC MFCC PLP
Number of
iterations Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc 1 73.7 40.1 69.4 38.3 82.0 46.0 87.3 67.1 89.6 67.2 90.0 67.9 2 75.2 47.5 72.8 47.2 83.7 54.3 88.9 70.8 91.2 72.2 91.6 73.7 3 76.4 51.3 74.8 49.4 84.0 56.1 89.2 71.6 92.0 74.2 92.2 75.6 4 77.8 54.1 75.9 50.6 83.7 54.5 89.6 71.5 92.4 74.4 92.9 76.5 5 78.7 55.9 76.4 52.3 83.6 54.0 89.6 71.8 92.8 75.7 93.2 77.3 6 79.7 56.6 76.8 53.9 83.6 54.2 89.6 71.7 92.9 76.3 93.3 78.0 7 80.1 57.3 76.9 54.2 83.6 54.3 89.5 71.9 93.0 77.1 93.4 78.6 8 80.6 58.2 77.1 54.5 83.7 54.4 89.5 71.9 93.0 77.4 93.4 78.7 9 81.0 59.0 77.0 54.4 83.9 54.7 89.7 72.1 93.1 77.7 93.4 78.7 10 81.1 59.1 77.1 54.3 83.9 55.0 89.6 72.1 93.1 78.1 93.4 78.8 11 81.2 59.2 77.1 54.3 84.1 55.4 89.6 72.1 93.2 78.2 93.3 78.9 12 81.2 59.2 76.8 54.0 84.2 56.0 89.6 72.2 93.2 78.3 93.2 78.9
60 65 70 75 80 85 90 95 100
1 2 3 4 5 6 7 8 9 10 11 12
Number of iterations
Correct (%)
LPC_38 LPC_39 RC LPCC MFCC PLP
From the Fig. 4-6(a), the percent of correct is the recognition rate without considering the insertion error and the performance of the connected-digits speaker-independent recognition based on monophone HMM is
PLP > MFCC > LPCC > RC > LPC38 > LPC39 (4-3) where the PLP performs better than all the other features from iteration 1 to iteration 12 of the training. The performance of all the models almost saturates when coming up to iteration 12. The maximum percent of correct of PLP appears in iteration 8 of training and then decreases when more training iterations are performed. It shows that the PLP costs less time than others to reach a better model in the training stage. From
Fig.4-6 Comparison of the different features (a) Correct (%) (b) Accuracy (%)
30 40 50 60 70 80 90
1 2 3 4 5 6 7 8 9 10 11 12
Number of iterations
Accuracy (%)
LPC_38 LPC_39 RC LPCC MFCC PLP
(a)
(b)
Fig. 4-6(b), it shows the recognition results when the insertion error is considered.
The order of the performance is not the same as (4-3) especially for the RC. It infers that the RC tends to insert words between two words than other insertion methods.
Comparisons of the different features through average and the best Correct (%) and Accuracy (%) are shown in Fig. 4-7. The order of the performance from the good to the bad in this experiment is
PLP > MFCC > LPCC > RC > LPC38 > LPC39 (4-4) except for the max Accuracy (%) where RC is worse than LPC38, hence, the order of the best performance of the six features is
PLP > MFCC > LPCC > LPC38 > RC > LPC39 (4-5) where PLP still has the best performance than other features in monophone-based speaker-independent speech recognition.
Fig.4-7 Monophone-based HMM experiment (a) Average Correct (%) (b) Average Accuracy (%) (c) Max Correct (%) (d) Max Accuracy (%)
LPC_38 LPC_39 RC LPCC MFCC PLP
Average Correct (%)
54.8 52.6 54.9
71.9
LPC_38 LPC_39 RC LPCC MFCC PLP
Average Accuracy (%)
59.2
LPC_38 LPC_39 RC LPCC MFCC PLP
Max Accuracy (%)
81.2
LPC_38 LPC_39 RC LPCC MFCC PLP
Max Correct (%)
(a) (b)
(c) (d)