結論與未來展望 - 調變頻譜分解之改良於強健性語音辨識

本論文探討了一些非負矩陣分解法的三種改進方法，並將之運用在調變頻譜上，

希望能夠擷取出更強健性的基底向量，而達到增進語音強健性的目的。

第一種是非平滑非負矩陣分解法(nsNMF)，利用添加了一個平滑矩陣 S，變更傳統非負矩陣分解法的模型。利用模型乘法的性質，使一個矩陣平滑，進而迫使另一個矩陣達到稀疏的效果。

第二種是基於圖正則化非負矩陣分解法(GNMF)，在減損函式中增加了一個額外的正則項。利用幾何結構與局部不變性的特性，求得訓練語句間的關聯程度並創造一個權重矩陣以使用，使模型能夠增加鑑別力。

第三種是統計圖等化法之非負矩陣分解法(HNMF)，希望能夠利用在訓練階段時捨棄的編碼矩陣，將之利用統計圖等化法將其建表儲存，希望在測試階段時，

能將編碼向量的統計資訊更新回乾淨的狀態。

此三種非負矩陣分解法的改進方式運用在 Aurora-2 上時，皆能有所進步。

nsNMF 使用的稀疏性在整體來看是比較穩定一點的方法；GNMF 雖沒有大幅度的進步，但是利用語句之間的相互資訊，將之加入 NMF 模型，也能有所幫助。

與 nsNMF 以及 HNMF 結合時也能提昇精確率；HNMF 在少許基底個數時能夠擁有不錯的效能提升之能力，甚至勝過 nsNMF。但有著在基底個數多時效能增加不明顯的缺點。本論文在聲學模型的部分也利用 Kaldi 之類神經網路(DNN、CNN) 來代替傳統的 GMM，結果顯示類神經網路在複合情境訓練模式表現的較好，但

在乾淨情境訓練模式並不理想。結合 NMF 之改進時也能有所進步。

在未來展望方面，希望將不同的資訊納入非負矩陣分解法，如 GNMF，可以利用語句間的關係。而各種不同的資訊對於非負矩陣分解法的重要性不同，或許加入適合 NMF 模型的資訊，能得到更好的效能增進。對於 HNMF，希望能夠繼續探討為何基底個數增加而效果驟降。

參考文獻

Abdel-Hamid, O., L. Deng and D. Yu (2013), “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in INTERSPEECH (pp. 3366-3370).

Acharya, T., A. K. Ray (2005), “Image Processing: Principles and Applications,”

Wiley Interscience.

Belkin, M. and P. Niyogi (2001), “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Neural Information Processing

Systems 14, pages 585–591. MIT Press,Cambridge, MA.

Belkin, M., P. Niyogi, and V. Sindhwani (2006), “Manifold regularization: A geometric framework for learning from examples,” Journal of Machine

Learning Research, 7:2399–2434.

Beyerlein, P., X. Aubert, R. Haeb-Umbach, M. Harris, D. Klakow et al.

(2002), ”Large vocabulary continuous speech recognition of Broadcast

News–The Philips/RWTH approach,” Speech Communication, 37(1), 109-131.

Bourlard, H. and N. Morgan (1994), “Connectionist Speech Recognition A Hybrid Approach, ” KLUWER ACADEMIC PUBLISHERS, ISBN 0-7923-9396-1.

Boll, S.F. (1979), “Supperssion of Acouststic Noise in Speech Using Spectral,” IEEE Transactions on Acoustics, Speech and Signal Processing.

Cai, D., X. He, J. Han, T. S. Huang (2011), ”Graph Regularized Nonnegative Matrix Factorization for Data Representation,” IEEE Trans on Pattern Analysis and Machine Intelligence, 33(8): 1548-1560.

Chen, B., W. H. Chen, S. H. Lin, and W. Y. Chu (2011), “Robust Speech Recognition Using Spatial–Temporal Feature Distribution Characteristics,”

Pattern Recognition Letters, Vol. 32, No. 7, pp. 919–926.

Chu, W. Y., J. W. Hung and B. Chen (2011), “Modulation Spectrum Factorization for Robust Speech Recognition,” in Proceedings of APSIPA Annual Summit and (APSIPA ASC ), pp. 18–21.

Cooke, M. P., A. Morris and P. D. Green (1997), “Missing Data Techniques For Robust Speech Recognition,” in Proceeding of International Conference on

Acoustics, Speech and Signal Processing , pp. 863–866.

Delashmit, W. H., M. T. Manry (2005), “Recent Developments in Multilayer Perceptron Neural Networks,” in Proceedings of the 7th Annual Memphis

Area Engineering and Science Conference.

Dharanipragada, S., Padmanabhan, M. (2000), ” A nonlinear unsupervised adaptation technique for speech recognition,” in Proc. Internat. Conf. on Spoken Lang.

Process., vol. 4, pp. 556–559.

Droppo, J. (2008), “Tutorial of International Conference on Spoken Language Processing,” Interspeech.

Drullman, R., J. M. Festen, and R. Plomp (1994), “Effect of Temporal Envelope Smearing on Speech Reception, “ The Journal of the Acoustical Society of America, Vol. 95, No. 2, pp. 1053–1064.

Duda, R. O. and P. E. Hart (1973), “Pattern classification and scene analysis,” New

York, John Wiley and Sons.

Duda, R. O., P. E. Hart and D. G. Stork (2001),” Pattern Classification,” Wiley Interscience.

Ephraim, Y. and D. Malah (1985), “Speech Enhancement Using a Minimum Mean- Square Error Log-Spectral Amplitude Estimator,” IEEE Trans.

Feng, T., S. Z. Li, H. Y. Shum, and H. Zhang (2002), “Local Nonnegative Matrix Factorization as a Visual Representation,” in Proc. Second Int’l Conf.

Development and Learning.

Fruri, S. (1981), “Cepstral Analysis Techniques for Automatic Speaker Verification,”

IEEE Transaction on Acoustic, Speech and Signal Processing.

Gauvain, J. L. and C. H. Lee (1994), “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE

Transaction on Speech and Audio Processing, vol. 2(2): pp. 291-297.

Gales, M. J. F. (1998), “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12(2): pp. 75-98.

Gales, M. J. F. and S. J. Young (1995), “Robust speech recognition in additive andconvolutional noise using parallel model combination.,” Computer

Speech and Language, vol. 9: pp. 289-307.

Greenberg, S. (1997), “On the origins of speech intelligibility in the real world,”

Proceedings of ESCA-NATO Tutorial and Research Workshop on Robust

Speech Recognition for Unknown Communication Channels.

Pont-a-Mousson, France, April.

Hadsell, R., S. Chopra and Y. LeCun. (2006), “Dimensionality reduction by learning an invariant mapping.” in Proceedings of the 2006 IEEE Computer Society

Conference on Computer Vision and Pattern Recognition (CVPR’06), pages

1735–1742.

Harpur, G. F. and R.W. Prager (1996), “Development of Low Entropy Coding in a Recurrent Network,” Network: Computation in Neural Systems, vol. 7, pp.

277-284.

Hain, T., P. C. Woodland et al. (2005), “Automatic Transcription of Conversational Telephone Speech,” IEEE TRANSACTIONS ON SPEECH AND AUDIO

PROCESSING, VOL. 13, NO. 6:pp. 1173-1185

Hirsch, H. G., C. Ehrlicher (1995), "Noise Estimation Techniques for Robust Speech Recognition,” IEEE.

Hermansky, H., N. Morgan and H. G. Hirsch (1993), “Recognition of Speech in Additive and Convolutional Noise Based on RASTA Spectral Processing,”

IEEE.

Hermansky, H. and N. Morgan. (1994), “RASTA processing of speech,” IEEE

Transactions on Speech and Audio Processing, vol. 2(4): pp. 578-589.

Hermansky, H. (1995), “Exploring temporal domain for robustness in speech recognition,” Proc. of 15th International Congress on Acoustics, vol. II.: pp.

61-64, June 1995.

Hermansky, H. (1997), “Should recognizers have ears?,” Invited Tutorial Paper, Proceedings of ESCA-NATO Tutorial and Research Workshop on Robust

speech recognition for unknown communication channels, pp.1-10,

Pont-a-Mousson, France, April.

Hermansky, H. (1998), “Should Recognizers Have Ears?, “ Speech Communication, Vol. 25, pp. 3–27.

Hilger, F., H. Ney (2006), “Quantile based histogram equalization for noise robust large vocabulary speech recognition,” IEEE Trans. Audio Speech Lang.

Process. 14(3), 845–854.

Hirsch, H. G. and D. Pearce (2002), “The AURORA experimental framework for the

performance evaluations of speech recognition systems under noisy conditions,” in Proc. ISCA ITRW ASR2000, Paris, France.

Hoyer, P.O. (2002), “Nonnegative Sparse Coding,” Proc. IEEE Workshop Neural Networks for Signal Processing.

Hoyer, P.O. (2004), “Nonnegative Matrix Factorization with Sparseness Constraints,” J. Machine Learning Research, vol. 5, pp. 1457-1469.

Huang, S. Y., W. H. Tu, J. W. Hung (2009), “A study of sub-band modulation spectrum compensation for robust speech recognition,” ROCLING XXI:

Conference on Computational Linguistics and Speech Processing

(ROCLING 2009), Taichung, Taiwan.

Huang, X., A. Acero, H. W. Hon (2001), “Spoken language processing: A guide to theory, algorithm and system development,”Upper Saddle River, NJ,

USA,Prentice Hall PTR.

Hung, J. W., W. H. Tu and C. C. Lai (2012), “Improved Modulation Spectrum Enhancement Methods for Robust Speech Recognition,” Signal Processing,

Vol. 92, No. 11, pp. 2791–2814.

Hung, J. W., H. T. Fan and Y. C. Lian (2012), “Modulation Spectrum Exponential

Weighting for Robust Speech Recognition,” in Proceedings of International Conference on ITS Telecommunications, pp. 812–816.

Hu, X., X. Lu, and C. Hori (2014), “Mandarin speech recognition using convolution neural network with augmented tone features,” in Chinese

Spoken Language Processing (ISCSLP).

Huo, Q., C. Chany and C. H. Lee (1995), “Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition,” IEEE Transaction on Speech and Audio Processing, vol. 3(4): pp. 334-345.

Joshi, V. et al. (2011), “Sub-band level histogram equalization for robust speech recognition,” in Proceedings of the Annual Conference of the International

Speech Communication Association.

Kollmeier, B., and R. Koch (1994), “Speech Enhancement Based on Physiological and Psychoacoustical Models of Modulation Perception,” Journal of the Acoustical Society of America, Vol. 95, pp. 1593–1602.

Kumar, N. (1997), “Investigation of silicon-auditory models and generalization of

linear discriminant analysis for improved speech recognition,” Ph.D.

Dissertation, John Hopkins University.

Lee, D. D. and H. S. Seung (1999), “Learning the parts of objects by non-negative matrix factorization,” Nature, 401:788–791.

Lee, D. D. and H. S. Seung (2000), “Algorithms for Non-negative Matrix Factorization,” Advances in Neural Information Processing Systems.

Lee, J. M. (2002), Introduction to Smooth Manifolds,” Springer-Verlag New

York.

Leggeter, C. J. and P. C. Woodland (1995), “Maximum likelihood linear regression

for speaker adaptation of continuous density hidden Markov models,”

Computer Speech and Language, vol. 9: pp. 171-185.

Lin, S. H., H. B. Chen, Y. M. Yeh, and B. Chen (2007), ”Improved Histogram Equalizaiton (HEQ) for Robust Speech Recognition,” in Proceedings of IEEE International Conference, pp. 2234–2237.

Lin, S. H., Y. M. Yeh, B. Chen (2006), “Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition,”

Interspeech- 9th International Conference on Spoken Language

Processing (ICSLP), Pittsburgh, Pennsylvania.

Lin, S. H., B. Chen, Y. M. Yeh (2009), “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,”

IEEE Trans on Audio, Speech and Lang Process 17(1):84–94.

Liu, W., N. Zheng and X. Lu (2003), “Nonnegative Matrix Factorization for Visual Coding,” Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing.

Lockwood, P. and J. Boudy (1992), "Experiments with a Nonlinear Spectral

Subtractor(NSS), Hidden Markov Models and The Projection, for Roubst Speech Recognition in Car,” Speech Communication.

Macho, D. et al. (2002), “Evaluation of a noise-robust DSR front-end on Aurora Databases,” in 7th International Conference on Spoken Language

Processing (ICSLP).

Mel, B.W. (1999), “Computational Neuroscience. Think Positive to Find Parts,” Nature, vol. 401, pp. 759-760.

Mika, S. (1999), “Fisher discriminant analysis with kernels,” IEEE International

Workshop on Neural Networks for Signal Processing, Madison, Wisconsin.

Olshausen, B. A. and D. J. Field (1996), “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,” Nature, vol.381, pp. 607-609.

Pascual-Montano, A. (2006), J. M. Carazo, K. K. D. Lehmann and R. D. Pascual- Margui, “Nonsmooth nonnegtive matrix facotorization (nsNMF),” IEEE

Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 403–

415.

Povey, D., A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M.

Hannemann, M. Petr, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K.

Vesely (2011), “The Kaldi speech recognition toolkit,” in Proc. ASRU.

Raj, B. (2000), “Reconstruction of Incomplete Spectrograms for Robust Speech Recognition,” Ph. D. dissertation, ECE Department, Carnegie Mellon University, Pittsburgh.

Roweis, S. and L. Saul (2000), “Nonlinear dimensionality reduction by locally linear Embedding,” Science, 290(5500):2323–2326.

Saon, G., M. Padmanabhan, R. Gopinath and S. Chen (2000), “Maximum likelihood discriminant feature spaces,” IEEE International Conference on Acoustics,

Speech, Signal processing (ICASSP '00), Istanbul, Turkey.

Schalkwijk, J. P. M. and T. kailath (1966), ”A Coding Scheme for Additive Noise Channels with Feedback-Part I: No Bandwidth Constraint,” IEEE Transactions on Information Theory.

Seung, H. S. and D. D. Lee (2000), ”The manifold ways of perception,” Science, 290(12).

Stouten, V., H. V. hamme and P. Wambacq (2004), “Joint Removal of Additive and Convolutional Noise with Model-Based Feature Enhancement,“ ICASSP.

Sun, L. C., C. W. Hsu and L. S. Lee (2007), “Modulation Spectrum Equalization for Robust Speech Recognition,“ in Proceedings of IEEE Workshop on

Automatic Speech Recognition and Understanding, pp. 81–86.

Huang, S. Y., W. H. Tu and J. W. Hung (2009), “A Study of Sub-band Modulation Spectrum Compensation for Robust Speech Recognition,”in Proceedings of

ROCLING Conference on Computational Linguistics and Speech.

Processing

Seung, S. (2002), “Multilayer perceptrons and backpropagation learning,” 9.641 Lecture 4: September 17.

Tabrikian, J., G. S. Fostick and H. Messer (1999) “Detection of Environmental Mismatch in a Shallow Water Waveguide,” IEEE.

Tenenbaum, J. et al. (2000), “A global geometric framework for nonlinear dimensionality reduction,” Science, 290(5500):2319–2323. spproach,”

Speech Communication, vol. 37: pp. 109-131.

Torre, A., Peinado, A.M., Segura, J.C., Perez-Cordoba, J.L., Bentez, M.C., Rubio,

A.J. (2005), “Histogram equalization of speech representation for robust speech recognition,”. IEEE Trans. Speech Audio Process. 13 (3), 355–366.

Varga, A. P. and R. K. Moore (1990), “Hidden Markov Model Decomposition of Speech and Noise,” in ICASSP.

Vizinho, A. et al. (1998), "Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition," Speech Communication.

Viemeister, N. F (1979), “Temporal Modulation Transfer Functions Based Upon

Modulation Thresholds,” Journal of the Acoustical Society of America, Vol.

66, pp. 1364–1380.

Vuuren, S. V. and H. Hermansky (1998), “On the Importance of Components of the Modulation Spectrum for Speaker Verification,” in Proceedings of the

International on Spoken Language Processing, Sydney, Australia.

Wada, Y., K. Yoshida, T. Suzuki, H. Mizuiri, K. Konishi, K. Ukon, K.Tanabe, Y.

Sakata and M. Fukushima (2006), “Synergistic Effects of Docetaxel And S-1 by Modulating The Expression Of Metabolic Enzymes Of 5-fluorouracil in Human Gastric Cancer Cell Lines, “ International Journal of Cancer, Vol. 119, pp. 783–791.

Xiao, X., E. S. Chng and H. Li (2008), “Normalization of the speech modulation spectra for robust speech recognition,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 16, no. 8.

Yoshizawa, S., N. Hayasaka, N. Wada and Y. Miyanaga (2004), “Cepstral GainNormalization for Noise Robust Speech Recognition,“in Proceedings

of International Conference on Acoustics, Speech and Signal Processing.

Young, S., G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland (2009), “The HTK Book (for version 3.4),” Cambridge University Engineering Department.

在文檔中調變頻譜分解之改良於強健性語音辨識 (頁 97-110)