• 沒有找到結果。

Syllable-based HMM used on connected-digits corpus

Chapter 4 Experimental Results

4.3 Syllable-based Experiments

4.3.1 Syllable-based HMM used on connected-digits corpus

The connected-digits sentence is composed of arbitrary combination of the digits - “0, 1, 2, 3, 4, 5, 6, 7, 8, or/and 9”. Hence, the total number of word-level HMM needed is 12, including 10 digits, the silence model “sil”, and short pause model “sp”. The number of states of the HMM is defined in Table 4-14

The training database is selected from the connected-digits database mentioned in 4.1.2, where 800 files, where 400 files are spoken by 40 males and 400 files are spoken by 40 females, are selected for training the syllable-based HMMs. The other files of the corpus (199 files, 99 files spoken by 10 females and 100 spoken by 10 males) are adopted for testing the performance of the different features in syllable-based HMM speaker-independent speech recognition.

Table 4-14 Definition of HMM used in syllable-based experiment

Number of syllable-based HMMs 12 (10 digits, “sp” and “sil”)

Number of states of HMM 8 (first and last state are null state)

Number of Gaussian mixtures in a state 4

4.3.2

Experiments

The parameters of front-end processing are set the same as Table 4-11. The features adopted in the experiment listed in Table 4-15 which are the same as the parameters used in monophone-base experiments. The flow chart of training the syllable-based HMMs is shown in Fig. 4-8 where the digits “0, 1, 2, 3, 4, 5, 6, 7, 8, and 9” are denotes by “yi, er, san, si, wu, liu, qi, ba, jiu, and ling,” respectively. It is noted that the boundary information of the corpus is available. Therefore, the training procedure is different with the procedure of experiment in 4.2 which has no boundary information and the details of the difference between them have been introduced in section 3.3. In practice, the boundary information is beneficial for training HMM, that is, the HMM will be trained more precise with the boundary inforamtion. In addition, the number of HMMs is less than the HMMs used in previous experiment. The recognition results are supposed to be much higher than the results in 4.2.3.

Fig.4-9 shows the testing procedure of the sullable-based recognition procedure.

Table 4-16 shows the experiment results of the performance of different features performing on the connected-digits speaker-independent recognition. The performance of the features is generally in the order from the good to the bad as

PLP, MFCC > LPCC > RC > LPC_38 > LPC_39 (4-6) where the PLP and MFCC are resemble in maximum ACC(%), which is the major guide for judging the performance of the models.

Table 4-15 Six different features adopted in this thesis

Order Number of

filter banks Energy 2 Linear Predictive Coefficients

(LPC_39) 39 - √

Linear Predictive Coefficients

(LPC_38) 38 - √

Linear Predictive Reflection

coefficients (RC) 39 - √ LPC Cepstrum Coefficients

(LPCC) 39 - √

Mel-Frequency Cepstral

Coefficients (MFCC) 39 26 √ Perceptual Linear Prediction

Coefficients (PLP) 39 26 √

Fig.4-8 Flow chart of training the syllable-based HMMs

Pre-emphasis

Frame blocking

Hamming window Connected-digits Corpus

Feature extraction

Model Training Transcriptions

……

Front-end Processing

12 HMM models

yi er ling sp sil

Boundary information

san

Table 4-16 Comparison of the Corr (%) and Acc (%) of different features

LPC_38 LPC_39 RC LPCC MFCC PLP

Number of iterations

Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc 1 79.7 77.1 80.5 77.6 90.9 89.7 94.3 93.8 96.5 95.9 96.6 96.1 2 84.0 80.8 83.5 80.3 94.3 92.9 96.6 96.3 98.2 98.0 98.4 98.1 3 85.0 81.8 84.3 81.2 94.7 93.3 97.1 96.9 98.4 98.2 98.5 98.4 4 84.8 81.9 84.6 81.6 94.9 93.5 97.2 97.1 98.6 98.4 98.5 98.3 5 85.2 82.3 84.5 81.4 94.8 93.4 97.3 97.1 98.7 98.5 98.5 98.4 6 85.2 82.4 84.6 81.6 94.9 93.6 97.2 97.0 98.7 98.5 98.6 98.5 7 85.3 82.6 84.7 81.7 95.0 93.8 97.2 97.0 98.7 98.5 98.6 98.5 8 85.3 82.7 84.6 81.6 94.8 93.7 97.3 97.1 98.6 98.5 98.6 98.5 9 85.4 82.9 84.3 81.2 95.0 93.9 97.3 97.2 98.6 98.5 98.6 98.5 10 85.6 83.2 84.5 81.3 95.1 93.9 97.2 97.1 98.6 98.5 98.5 98.4

Fig.4-9 Flow chart of testing the syllable-based HMMs

Pre-emphasis

Frame blocking

Hamming window Connected-digits corpus

Feature extraction

Speech Recognition

Transcriptions of test corpus

Front-end Processing

12 HMM models yi ling sp

Grammar

Results analysis Recognition Rate

sil

Comparisons of the different features through average and the best Correct (%) and Accuracy (%) are shown in Fig. 4-11. In this case, the PLP and the MFCC are both a good choice of the connected-digits speaker-independent recognition due to their high recognition rate (Correct (%) and Accuracy (%)). The LPC_38 performs better than the LPC_39 since the sampling frequency of the speech is 16 kHz. From the guideline of selecting the order of filter p introduced in Chapter 2, the value of p should be chosen as 18-20 to represent the characteristic of the filter. Hence, the recognition rate of LPC_38 (p=18) is higher than the LPC_39. As for the reflection coefficients RC, the performance is much better than the performance of LPC_38. It can be inferred that the RC is more suitable for represent the speech signal of small-vocabulary speech recognition from the results of the two experiments.

Fig.4-10 Comparison of the different features (a) Correct (%) (b) Accuracy (%)

75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10

Number of iterations

Correct (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

70 75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10

Number of iterations

Accuracy (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

(a)

(b)

Fig.4-11 Syllable-based HMM experiment (a) Average Correct (%) (b) Average Accuracy (%) (c) Max Correct (%) (d) Max Accuracy (%)

85.6 84.7

95.1

97.3 98.7 98.6

80 85 90 95 100

LPC_38 LPC_39 RC LPCC MFCC PLP

Max Correct (%)

84.6 84.0

LPC_38 LPC_39 RC LPCC MFCC PLP

Average Correct (%)

81.8 81.3

LPC_38 LPC_39 RC LPCC MFCC PLP

Average Accuracy (%)

83.2

LPC_38 LPC_39 RC LPCC MFCC PLP

Max Accuracy (%)

(a) (b)

(c) (d)

Chapter 5 Conclusions

A short summary of different features is made as follows. The LPC states that the vocal tract transfer function can be modeled by an all-pole filter, and the number of coefficients is chosen to be sufficient to represent the vocal tract. The Reflection Coefficients (RC) model the reflection rate at each transition when the acoustic waves in the vocal tract are partially reflected at the transitions and interfere with waves approaching from the back. The LPC-derived Cepstral Coefficient (LPCC) is compact parametric representation of representing the spectrum of speech signals which can efficiently separate the excitation source from the all-pole filter. The conception of the Mel-Frequency Cepstral Coefficients (MFCC) is to use nonlinear frequency scale to approximate the behavior of the auditory system. The Perceptual Linear Predictive (PLP) analysis combines several engineering approximations of psychophysics of human hearing processes, including critical-band spectral resolution, the equal-loudness curve, and the intensity-loudness power law.

The results of the experiments can be explained from the essence of the features.

In former studies of the feature extraction, such as LPC and RC, the idea is focuses on modeling the vocal tract of the human. However, the performance is not satisfied especially for multi-users system. Besides, the production model concerns only the vocal tract which varies from person to person. From the experimental results, it can be infer that the variation of the vocal tract between persons is larger than the variation of the ear between persons. From the viewpoint of the human being, the general communication is comprised of two types, speech generating and hearing.

Intuitively, the objective of the speaker-independent speech recognition system is to

recognize the speech of different users. Hence, the key point is not who produced the speech but what the context was. In this case, the receiving side is more effective than the generation sides. Therefore, the features based on the speech perception, such as MFCC and PLP, are superior to the features based on the speech production, such as LPC, LPCC and RC, in the speaker-independent experiments.

In this thesis, the performance of different speech features for speaker-independent speech recognition system has been evaluated. It is noted that the PLP is not always better than MFCC because of the little difference between the recognition rates in the experiments in previous chapter, but it can be said that in most of cases the PLP and MFCC will perform better than LPC, RC and LPCC in speaker-independent speech recognition system. It can be concluded as follows. Firstly, features derived from FFT (MFCC, PLP) preserve more phonetic features than those derived form LPC spectrum (LPC, LPCC, RC). Secondly, the cepstrum parameters (LPCC) has higher recognition rate than LPC and RC. Thirdly, non-linear frequency analysis performs better than linear frequency analysis. Fourthly, LPC_38 has better performance than LPC_39.

Fifthly, PLP provide highest discrimination of phonetics for monophone-based speaker-independent SR. In addition, there is a performance comparison table Table 5-1. From the table, the perceptual model is more effective than production model in speaker-independent Speech Recognition system.

Table 5- 1 Performance Comparison Table

Monophone-based experiment

PLP MFCC LPCC RC LPC_38 LPC39

78.9% 78.3% 72.2% 56.0% 59.2% 54.5%

Word-based experiment

PLP MFCC LPCC RC LPC_38 LPC39

98.5 98.5 97.2 93.9 81.7 83.2

Due to the limitation of the corpus and the difficulties of training with large amount of database, the experiments are not complete to show the statistics of various tasks, for example, the robust test of speech features for speaker-independent speech recognition system in different noisy environments is not fulfilled. In addition, the environments (echo, channel-effect, noise, etc) and the speakers (speed of speaking, gender, age, etc) will both affect the performance of the recognition system in practice.

It is hard to start from the viewpoint of physiology to improve the features, thus to find a suitable feature for a particular task and adding new skills to eliminate the influence of environment and speakers are more feasible in the future.

References

[1] R.C. Rose and D. A. Reynolds, “Text-independent speaker identification using automatic acoustic segmentation,” ICASSP, IEEE, 1990.

[2] John Makhoul, “Linear Prediction: A Tutorial Review,” Proceedings of the IEEE, Vol. 63, No. 4, pp. 561–580, Apr. 1975.

[3] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoust. Speech Signal Processing, vol. ASSP-28, pp357-366, 1980.

[4] Hynek Hermansky, “Perceptual Linear Predictive (PLP) Analysis of Speech,”

Acoustic Society of America, Vol. 87, No. 4, pp.1738-1752, Apr. 1990.

[5] J.W. Picone, “Signal Modeling Techniques in Speech Recognition,” Proceedings of the IEEE, Vol. 81, No.9, pp.1215-1247, Sept. 1993.

[6] J. D. Markel and A. H. Gray, Linear Prediction of Speech, New York:

Springer-Verlag, Berlin, first edition, 1976.

[7] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR, 25 April, 2001.

[8] Stevens, S.S, “On hearing by electrical stimulation,” Journal of the Acoustical Society of America, Vol. 8, pp.191-195.

[9] Stevens, S.S. and J. Volkman, “The Relation of Pitch to Frequency,” Journal of Psychology, Vol. 53, pp. 329, 1940.

[10] V. Mantha, R. Duncan, Yufeng Wu, Jie Zhao, A. Ganapathiraju and J. Picone,”

Implementation and Analysis of Speech Recognition Front-End,” Southeastcon '99. Proceedings. IEEE, 25-28, pp.32-35, March 1999.

[11] Gabriel Costache, Inge Gavat, and Adrian Raileanu, “ Voice Command System,”

Politehnica University, Bucharest, International Workshop, 16-18 May, 2002.

[12] Zhang Jie, Huang Zhitong, and Wang Xiaolan, “Selection and Analysis of HMM's State-number in Speech Recognition,” Signal Processing Proceedings, ICSP '98, Vol. 1, pp.641-pp.645, 1998.

[13] J. Wilpon, and L. Rabiner, ”A Modified K-means Clustering Algorithm for Use in Isolated Word Recognition,” Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, Vol.

33 , Issue 3 , pp.587-pp.594, Jun. 1985.

[14] Lawrence Rabiner and B. H. Juang, Fundamentals of Speech Procesing, Prentice Hall Co. Ltd., 1993.

[15] Peter Motlíček, “Feature Extraction in Speech Coding and Recognition,” Report of PhD research internship in ASP Group, OGI-OHSU, Portland, US, 2001/2002, pp.1-50, http://www.fit.vutbr.cz/~motlicek/publi/2002/rp.pdf.

[16] G. M. White and R. B. Neely, “Speech recognition experiments with linear prediction, bandpass filtering, and dynamic programming,” IEEE Trans. Acoust., Speech, Signal Processing,Vol. ASSP-24, pp. 183-188, Apr. 1976.

[17] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals, IEEE Press, 2000.

[18] Montri Karnjanadecha and Stephen A. Zahorian, “Signal Modeling for High Performance Robust Isolated Word Recognition,” IEEE Trans. Speech and Audio Processing, Vol. 9, No. 6, pp.647-654, Sept. 2001.

[19] J. G. Wilpon and L. R. Rabiner, ” A Modified K-Means Clustering Algorithm for Use in Isolated Word Recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 33, No. 3, Jun. 1985.

[20] L. R. Rabiner, ” A tutorial on hidden Markov models and selected applications in

speech recognition,” Proceedings of the IEEE, Vol. 77, Issue 2 , pp.257-pp,286, Feb. 1989.

[21] L. R. Rabiner, and B. H. Juang, “An Introduction to Hidden Markov Models ASSP Magazine,” IEEE, Vol.3, Issue. 1 ,pp.4-pp.16, Jan 1986.

[22] 范世明, 高斯混合模型在語者辨識與國語語音辨認之應用, 國立交通大學碩 士論文, 民國九十一年六月.

[23] C. Y. Tseng and F. C. Chou, “Machine Readable Phonetic Transcription System for Chinese Dialects Spoken in Taiwan,” The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp.215-pp.223, May 1999.

[24] “Hidden Markov Model Toolkit,” http://htk.eng.cam.ac.uk/.

相關文件