Future Works

Chapter 4 Conclusions and Future Works

4.2 Future Works

Some further works are worth doing in the future. Firstly, we are interested in generalizing the proposed approach to spontaneous-speech ASR. To this end, we need to extend the three models of AM, LM and HPM to additionally consider the special characteristics, such as disfluency, of spontaneous speech. A preliminary study has been conducted to construct a hierarchical prosodic model for spontaneous Mandarin speech [35]. Secondly, it is also an interesting task to scale up the proposed approach to ASR for larger vocabulary comprising many compound words. The task can be attacked by modifying the first-stage recognition via firstly constructing an LM for a lexicon comprising both words and subwords, then generating a mixed-word/subword lattice using the new LM, and lastly forming compound words from subwords by applying some word-compounding rules. The second-stage recognition can be directly applied. Thirdly, modifying the proposed approach to reduce its computational complexity is needed for on-line system implementation. The task can be attacked by applying some prosodic models to reduce the size of the word lattice generated by the first-stage recognition. Specifically, we can incorporate the syllable-juncture prosodic-acoustic model into the first-stage recognition to detect B3 and B4 from long silences and generate a word lattice for each PPh-like segment instead of a large word lattice for the whole utterance. The stage-stage recognition can then be operated in a way of PPh-by-PPh decoding process. This can greatly speed up the second-stage Viterbi decoding process as well as reduce the decoding delay. Besides, the size of a

PPh word lattice can be further reduced by verifying its constituent words using the syllable-juncture prosodic-acoustic model to exclude unqualified words with prosodic features mismatching the intraword prosodic cues. Fourthly, it is found from error analysis that the WER improvement of the proposed system is seriously hampered by OOVs. Since most OOVs are name entities, incorporating an LM for name entity should be helpful. Fifthly, some high-level linguistic features, such as word chunk, phrase, and syntax, are still not used in this study. Design new prosodic models to include them should be useful for further improving the recognition performance as well as for decoding the syntactic structure of the testing utterance. Lastly, applying the same technique to other languages, such as English, must be interested to the speech processing society.

Bibliography

[1] S. Ananthakrishnan and S. Narayanan, “Unsupervised adaptation of categorical prosody models for prosody labeling and speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 1, pp. 138-149, Jan.

2009.

[2] S. Ananthakrishnan and S. Narayanan, “Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework,”

Proc. of ICASSP, pp. IV-873-IV876, 2007

[3] S. Ananthakrishnan and S. Narayanan, “Prosody-enriched lattices for improved syllable recognition,” Proc. of INTERSPEECH, pp. 1813-1816, 2007

[4] K. Chen, M. Hasegawa-Johnson, A. Cohen, S. Borys, S.-S. Kim, J. Cole, and J.-Y. Choi, “Prosody dependent speech recognition on radio news corpus of American English,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14 no.1, pp.232-245, January 2006.

[5] D. H. Milone and A. J. Rubio, “Prosodic and accentual information for automatic speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 11, no. 4, pp. 321-333, July 2003.

[6] D. Vergyri, A. Stolcke, V. R. R. Gadde, L. Ferrer, and E. Shriberg, “Prosodic mandarin speech and its application to speech-to-text conversion,” Speech Communication, vol. 36, pp. 247-265, 2002

[9] J.-T. Huang and L.-S. Lee, “Improved large vocabulary Mandarin speech recognition using prosodic features,” Proc. of SPEECH PROSODY, 2006.

[10] J.-T. Huang and L.-S. Lee, “Prosodic modeling in large vocabulary Mandarin speech recognition,” Proc. of ICSLP, 2006.

[11] X. Lei and M. Ostendorf, “Word-level tone modeling for Mandarin speech recognition,” Proc. of ICASSP, pp. IV-665-IV-668, 2007

[12] C. Ni, W. Liu, and B. Xu, “Improved large vocabulary Mandarin speech recognition using prosodic and lexical information in maximum entropy framework,” Proc. of CCPR, 2009.

[13] C. Ni, W. Liu, and B. Xu, “Using prosody to improve Mandarin automatic speech recognition,” Proc. of INTERSPEECH, 2010.

[14] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper,

“Enriching speech recognition with automatic detection of sentence boundaries

and disfluencies,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1526-1540, September 2006.

[15] E. Shriberg and A. Stolcke, “Prosody modeling for automatic speech recognition and understanding,” in Proc. workshop on mathematical foundations of natural language modeling, 2002.

[16] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J.

Pierrehumbert, and J. Hirschberg, “ToBI: A standard for labeling English prosody,” Proc. of ICSLP, vol. 2, pp. 867-870, 1992.

[17] D. Hirst, and A. D. Cristo, “Intonation systems. a survey of twenty Languages,”

Cambridge University Press, 1998.

[18] V. K. R. Sridhar, S. Bangalore, and S. S. Narayanan, “Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework,” IEEE Trans. on Audio, Speech and Language Processing, vol. 16, no. 4, pp. 797-811, May 2008.

[19] J.-H. Jeon and Y. Liu, “Automatic prosodic events detection suing syllable-based acoustic and syntactic features,” Proc. of ICASSP, pp. 4565-4568, 2009.

[20] C.-Y. Chiang, S.-H. Chen, H.-M. Yu, and Y.-R. Wang, “Unsupervised joint prosody labeling and modeling for Mandarin speech,” Journal of the Acoustic Society of America, vol. 125, no. 2, pp.1164-1183, Feb 2009.

[21] C.-Y. Tseng, S.-H. Pin, Y.-L. Lee, H.-M. Wang, and Y.-C. Chen, “Fluent speech prosody: Framework and modeling,” Speech Communication, 46, pp. 284-309, 2005.

[24] J. A. Bilmes and K. Kirchhoff, “Factor language models and generalized parallel backoff,” Proc. of HLT/NACCL, pp. 4-6, 2003.

[25] A. Stolcke, “SRILM – An extensible language modeling toolkit,” in Proc. ICSLP, 2002.

[26] P. Beyerlein, “Discriminative model combination,” Proc. of ICASSP, pp.

481-484, 1998.

[27] B.-H. Juang, W. Chou, and C.-H. Lee, “Statistical and discriminative methods for speech recognition”, in Speech Recognition and Coding - New Advances and Trends, ed. A.J. Rubio Ayuso, J.M. Lopez Soler, Springer-Verlag, Berlin-Hheidelberg, 1995.

[28] Mandarin microphone speech corpus – TCC300,

http://www.aclclp.org.tw/use_mat.php#tcc300edu.

[29] “HTK Web-Site”, http://htk.eng.cam.ac.uk. Accessed 2009

recognition,” in Proc. ICASSP, pp. 49-52, 1986.

[31] C.-R. Huang, K.-J. Chen, F.-Y. Chen, Z.-M. Gao and K.-Y. Chen. 2000, “Sinica treebank: design criteria, annotation guidelines, and on-line interface,”

Proceedings of 2nd Chinese Language Processing Workshop, Hong Kong, pp.

29-37, 2000.

[32] S.-H. Chen, W.-H. Lai, and Y.-R. Wang, “A statistics-based pitch contour model for Mandarin speech,” Journal of the Acoustical Society of America, vol. 117, no.

2, pp. 908–925, February 2005.

[33] S.-H. Chen, W.-H. Lai, and Y.-R. Wang, “A new duration modeling approach for Mandarin speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 11, no. 4, pp. 308–320, July 2003.

[34] Y. Xu, “Contextual tonal variations in Mandarin,” J. Phonetics 25, 61-83, 2007.

[35] Y.-L. Chou, C.-Y. Chiang, Y.-R. Wang, H.-M. Yu, S.-H. Chen, “Prosody labeling and modeling for Mandarin spontaneous speech,” Proc. of SPEECH PROSODY, Chicago, USA, May 2010.

[36] J.-H. Yang, M.-J. Liu, H.-H. Chang, C.-Y. Chiang, Y.-R. Wang, and S.-H. Chen, ,

“Enriching Mandarin speech recognition by incorporating a hierarchical prosody model”, Proc. of ICASSP, May 2011.

[37] S.-H. Chen, J.-H. Yang, C.-Y. Chiang, M.-C. Liu, and Y.-R. Wang, "A New Prosody-Assisted Mandarin ASR System", to appear in IEEE Transactions on Audio, Speech, & Language Processing , vol. 20, no. 5, July 2012.

[38] Tokuda, K., Masuko, T., Kobayashi, T. and Imai, S., “Mel-generalized cepstral analysis-a unified approach to speech spectral estimation,” Proceedings of the International Conference on Spoken Language Processing, pp. 1043–1046, Yokohama, Japan, September 1994.

[39] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T., “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. of ICASSP, pp.1315-1318, June 2000.

[40] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.,

“The HMM-based speech synthesis system version 2.0,” Proc. of ISCA SSW6, Bonn, Germany, Aug. 2007.

Publication List

Journal Paper

[1] Sin-Horng Chen, Jyh-Her Yang, Chen-Yu Chiang, Ming-Chieh Liu, and Yih-Ru Wang, "A New Prosody-Assisted Mandarin ASR System", to appear in

Trans. on IEEE Audio, Speech, & Language Processing, VOL. 20, NO. 5, JULY

2012.

[2] Yuan-Fu Liao, Jyh-Her Yang and Sin-Horng Chen, “Soft-decision A Priori Knowledge Interpolation for Robust Telephone Speaker Identification”, Journal

of the Chinese Institute of Engineers, pp. 627-637, July 2009.

Conference Papers

[1] Yu, Hsiu-min, Hsiu-hsueh Liu, Jyh-her Yang, Chen-yu Chiang, and Sin-horng Chen, “Tonal Contrast and Pitch Range in L2 Taiwan Min Produced by Native Si-Xien Hakka Speakers,” Presented at The 12th Conference on Min Languages.

(第 12 屆閩語國際學術研討會), In Proceedings of the 12th Conference on Min Languages, pp. 333-347, Taipei, Taiwan, 2011.

[2] Tzu-Hsuan Chiu, Chen-Yu Chiang, Yuan-Fu Liao, Jyh-Her Yang, Yih-Ru Wang and Sin-Horng Chen, “Prosody-dependent Acoustic Modeling for Mandarin Speech Recognition,” accepted by Speech Prosody 2012

[3] Jyh-Her Yang, Ming-Chieh Liu, Hao-Hsiang Chang, Chen-Yu Chiang, Yih-Ru Wang, and Sin-Horng Chen, “Enriching Mandarin speech recognition by incorporating a hierarchical prosody model”, in Proc. of ICASSP 2011, pp.

5052-5055, 2011.

[4] Chen-Yu Chiang, Jyh-Her Yang, Ming-Chieh Liu, Yih-Ru Wang, Yuan-Fu Liao, and Sin-Horng Chen, “A New Model-based Mandarin-speech Coding System,” in Proc. Interspeech 2011, Florence, Italy, pp 2561-2564, Sept. 2011.

[5] Yuan-Fu Liao, Zhi-Xian Zhuang, and Jyh-Her Yang, “Maximum Likelihood A Priori Knowledge Interpolation-Based Handset Mismatch Compensation for Robust Speaker Identification”, to appear in Tsinghua Science and Technology, 2008.

[6] Yuan-Fu Liao, Jyh-Her Yang, Chi-Hui Hsu, Cheng-Chang Lee, and Jing-Teng Zeng, “A Reference Model Weighting-based Method for Robust Speech Recognition”, in Proc. of InterSpeech, 2007

[7] Yuan-Fu Liao, Zhi-Xian Zhuang, and Jyh-Her Yang, “Maximum Likelihood A Priori Knowledge Interpolation-Based Handset Mismatch Compensation for Robust Speaker Identification”, NCMMSC, 2007.

[9] Jyh-Her Yang and Yuan-Fu Liao, “Unseen Handset Mismatch Compensation Based on A Priori Knowledge Interpolation for Robust Speaker Recognition”, in Proc. of ICSLP’2004.

[10] Jyh-Her Yang and Yuan-Fu Liao, “Unseen Handset Mismatch Compensation Based On Feature/Model-Space A Priori Knowledge Interpolation For Robust Speaker Recognition”, in Proc. of ISCLSP’2004.

博士候選人資料

姓名：楊智合

性別：男

出生年月日：民國 68 年 12 月 10 日

籍貫：桃園縣

學歷：

國立雲林科技大學電機系學士班畢業(87 年 8 月～91 年 6 月) 國立台北科技大學電通所碩士班畢業(91 年 8 月～93 年 7 月) 國立交通大學電信工程研究所博士班(93 年 8 月～101 年 7 月)

論文題目：

新韻律輔助中文語音辨認系統及其應用

A New Prosody-Assisted Mandarin ASR System and Its Application

在文檔中一種韻律輔助中文語音辨認系統及其應用 (頁 62-69)