• 沒有找到結果。

6.2.1 基礎實驗

(表 6-13、6-14)是 AMI 會議語料的文字和語音辨識的基礎實驗,可以從表中看 出基本的 LSTM 在 AMI 也有很好的表現,複雜度可以相對降低 20.71%,詞錯誤 率可以相對降低 6.99%。

發展集 測試集

複雜度 複雜度

三連詞語言模型 85.19 76.44

基本LSTM 語言模型 68.02 (73.40) 60.61 (65.40) 表 6-13 AMI會議文字基礎實驗

發展集 測試集

詞錯誤率 詞錯誤率

第一階段語音辨識解碼

(三連詞語言模型) 23.25% 23.02%

第二階段語音辨識解碼

(基本 LSTM 語言模型) 21.17% 20.41%

表 6-14 AMI會議的詞與字錯誤率(%)

6.2.2 語者慣用語特徵

語者慣用特徵在 AMI 的文字實驗表現不是很理想,但是在語音辨識實驗有不錯 的效果可以降低相對 0.5%。

發展集 測試集 準確度 準確度 語者辨識 74.55% 60.91%

表 6-15 AMI會議語者特徵抽取器

發展集 測試集

複雜度 複雜度

基本 LSTM 語言模型 68.02 (73.40) 60.61 (65.40) 基於 CNN 的語者慣用語特徵 66.28 (104.11) 59.05 (93.12)

表 6-16 AMI會議文字實驗以語者慣用語特徵作輔助

發展集 測試集

詞錯誤率 詞錯誤率

基本 LSTM 語言模型 21.17% 20.41%

基於 CNN 的語者慣用語特徵 21.07% 20.32%

表 6-17 AMI會議語音辨識實驗以語者慣用語特徵作輔助

6.2.3 語者調適混和模型

調適混和模型的部分,可能是因為語者量太大,所以選用的特殊語者占總語者的 比例太小,導致結果並沒有像華語會議語料來的好,。

發展集 測試集

複雜度 複雜度

基本 LSTM 語言模型 68.02 (73.40) 60.61 (65.40) 語者調適混和模型 67.61 (99.80) 60.43 (93.60)

表 6-18 AMI會議文字實驗使用調適混和模型

發展集 測試集

詞錯誤率 詞錯誤率

基本 LSTM 語言模型 21.17% 20.41%

語者調適混和模型 21.04% 20.33%

表 6-19 AMI會議語音辨識實驗使用調適混和模型

第7章 結論與未來展望

深層學習的崛起使得各領域都有很大的突破,語言模型領域也受到極大的影響,

各種新型架構如雨後春筍般的冒出來,雖然類神經網路提升了許多預測任務的精 準度,但是類神經網路也有一些的缺點:執行和訓練速度較以往的一些傳統的統 計式模型來的慢,以及需要足夠大量的資料才能需訓練的好。類神經網路因為執 行效率較差,在自動語音辨識任務上,無法應對第一階段詞網的產生,所以目前 只能作在第二階段的最佳候選詞序列的重新排序,也是本篇論文採用的方法。除 了執行速度較慢,類神經網路的訓練還容易受到資訊量的影響。

近年來,許多語言模型研究都意圖解決資訊量不足的問題,其中最常被使用 的方法是模型的調適。會議語音辨識時常會使用會議的主題資訊,來輔助、調適 語言模型,但是除了會議主題,會議的參與者不同也會很大程度的影響會議的語 言模型,本論文便提出了語者調適使用於會議語言模型的訓練上。

本論文考慮了「已知語者」和「未知語者」兩種測試階段的情境,也以不同 的角度考慮語者特徵擷取模型的建立,我們提出了三種語者特徵擷取模型,「語 者用詞特徵模型」、「語者慣用語特徵模型」、「語者調適混和模型」,其中「語 者慣用語特徵模型」以及「語者調適混和模型」適用於未知語者的測試階段,結 果顯示語者調適混和模型在不管在「已知語者」還是「未知語者」的情境都能達 到最好的效果,但是語者慣用語特徵模型的表現較其他方法差,原因是取得的反 例無法很好的表現該語者的相反用語。

在未來的研究裡,可以改進語者慣用語模型的表現,對於尋找反例能找出更 有效的檢索方法,因為科技的進步,會議語音的內容也更新飛快,詞典很容易就 無法涵蓋整場會議,所以 OOV 的問題也很大程度會影響最後的辨識結果,探討

OOV 對會議語音的影響也變得越來越重要,會議語音因為其特殊性,所以難以使 用其他語料作調適,不同領域的語料如何作轉移學習也是將來很重要的議題。

語言模型的調適研究,未來展望可分成三大面向:

(1)輔助特徵的選擇

除了主題特徵以及語者特徵,會議語音辨識還可以考慮場景、語者意向、

情緒等特徵。

(2)輔助特徵的擷取

如何從離散的標記轉為連續空間中的特徵,用來輔助類神經網路的訓練。

(3)模型調適的架構

現在使用額外資訊來調適模型的方法,大都是將輔助特徵加入輸入、隱藏 或輸出層中,未來應能尋找更有效率利用特徵的方法,例如本文中將特徵作為 注意力權重(Attention Weight)。

第8章 參考文獻

[1] A. Mansikkaniemi and M. Kurimo, “Unsupervised topic adaptation for morph-based speech recognition,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 2693–2697, 2013.

[2] A. Mnih and Y. Teh, “A fast and simple algorithm for training neural probabilistic language models,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 1751–1758, 2012.

[3] A. Stolcke, SRI Language Modeling Toolkit (http://www.speech.sri.com/projects/srilm/), 2000.

[4] B. Chen and K.-Y. Chen, “Leveraging relevance cues for language modeling in speech recognition,” Information Processing & Management, Vol. 49, No 4, pp.

807–816, 2013.

[5] B. Roark, M. Saraclar, M. Collins and M. Johnson, “Discriminative N-gram language modeling,” Computer Speech and Language, Vol. 21, No. 2, pp. 373– 392, 2007.

[6] Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.

[7] Brown, Peter F., et al. "A statistical approach to machine translation."

Computational linguistics 16.2 (1990): 79-85.

[8] C. Chelba and F. Jelinek, “Exploiting syntactic structure for language modeling,”

in Proceedings of the Annual Meeting of the Association for Computational

Linguistics (ACL), pp. 225–231, 1998.

[9] C. Chelba, “A structured language model,” in Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL), pp. 498–450, 1997.

[10] C. Chelba, D. Bikel, M. Shugrina, P. Nguyen and S. Kumar, “Large scale language modeling in automatic speech recognition,” Technical report, Google, 2012.

[11] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol.

27, pp. 379–423 and 623–656, 1948.

[12] C. Manning and H. Schutze, “Foundations of statistical natural language processing,” Cambridge, MA: MIT Press, 1999.

[13] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval,” in Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 334–342, 2001.

[14] D. Gildea and T. Hofmann, “Topic-based language models using EM,” in Proceedings of the European Conference on Speech Communication and Technology (Eurospeech), pp. 2167–2170, 1999.

[15] D. Guthrie, B. Allison, W. Liu, L. Guthrie and Y. Wilks, “A closer look at skip-gram modelling,” in Proceedings of the international Conference on Language Resources and Evaluation (LREC), pp. 1222–1225, 2006.

[16] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, Jan, pp. 993–1022, 2003.

[17] D. Povey and P. C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 105–108, 2002.

[18] Dauphin, Yann N., et al. "Language modeling with gated convolutional networks."

arXiv preprint arXiv:1612.08083 (2016).

[19] E. Arisoy, T. N. Sainath, B. Kingsbury and B. Ramabhadran, “Deep neural network language models,” in Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pp. 20–28, 2012.

[20] F. J. Och, “Minimum error rate training in statistical machine translation,” in Proceedings of Annual Meeting on Association for Computational Linguistics (ACL), pp. 160–167, 2003.62

[21] F. Jelinek, “Up from trigrams! The struggle for improved language models,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 1037–1040, 1991.

[22] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Cornell Aeronautical Laboratory, Psychological Review, Vol. 65, No. 6, pp. 386–408, 1958.

[23] G. Tur and A. Stolcke, “Unsupervised language model adaptation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.173–176, 2007.

[24] Gokhan Tur and Andreas Stolcke. "Unsupervised language model adaptation for meeting recognition." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. Vol. 4. IEEE, 2007

[25] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.

[26] H. Le, A. Allauzen and F. Yvon, “Measuring the influence of long range dependencies with neural network language models,” in Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp.

1–10, 2012.60

[27] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: a Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics &

Chinese Language Processing, Vol. 10, No. 1, pp. 219–235, 2005.

[28] Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal processing magazine 29.6 (2012): 82-97.

[29] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, Vol. 40, No. 3–4, pp. 237–264, 1953. 58

[30] Irie, Kazuki, et al. "RADMM: recurrent adaptive mixture model with applications to domain robust language modeling." Education 758.17.1: 77.

[31] J. Goodman, “A bit of progress in language modeling (extended version),” Machine Learning and Applied Statistics Group, Technique Report, Microsoft, 2001.

[32] J. Nie, R. Li, D. Luo and X. Wu, “Refine bigram PLSA model by assigning latent topics unevenly,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 141–146, 2007.

[33] J. R. Bellegarda, “A latent semantic analysis framework for large–span language modeling,” in Proceedings of European Conference on Speech Communication and Technology (Eurospeech), pp.1451–1454, 1997.

[34] J. R. Bellegarda, “A multispan language modeling framework for large vocabulary speech recognition,” IEEE Transactions on Acoustic, Speech and Signal Processing, Vol. 6, No. 5, pp. 456–467, 1998.

[35] J. R. Bellegarda, “Statistical language model adaptation: review and perspectives,”

Speech Communication, Vol. 42, No. 1, pp. 93–108, 2004.

[36] J-T. Chien and C-H. Chueh, “Latent Dirichlet language model for speech recognition”, in Proceedings of IEEE Workshop on Spoken Language Technology (SLT), pp. 201-204, 2008.

[37] K.-F. Lee, “Automatic Speech Recognition: The Development of the SPHINX Recognition System,” Boston: Kluwer Academic Publishers, 1989.

[38] K.-Y. Chen and B. Chen, “Relevance language modeling for speech recognition,”

in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5568–5571, 2011.

[39] Kuhn, Roland, and Renato De Mori. "A cache-based natural language model for speech recognition." IEEE transactions on pattern analysis and machine intelligence 12.6 (1990): 570-583.

[40] Liu, Xunying, et al. "Efficient lattice rescoring using recurrent neural network language models." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.

[41] Liu, Y., & Liu, F. (2008, March). Unsupervised language model adaptation via topic modeling based on named entity hypotheses. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 4921-4924). IEEE.

[42] M. A. Haidar and D. O’Shaughnessy, “Comparison of a bigram PLSA and a novel context-based PLSA language model for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 8440–8444, 2013.

[43] M. Bahrani and H. Sameti, “A new bigram PLSA language model for speech recognition,” EURASIP Journal on Advances in Signal Processing, Vol. 2010, July, pp. 1–8, 2010.

[44] M. Kozielski, D. Rybach, S. Hahn, R. Schlüter and H. Ney, “Open vocabulary handwriting recognition using combined word-level and character-level language models,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 8257–8261, 2013.

[45] Ma, Min, et al. "MODELING NON-LINGUISTIC CONTEXTUAL SIGNALS IN LSTM LANGUAGE MODELS VIA DOMAIN ADAPTATION." Acoustics, Speech and Signal Processing (ICASSP), 2018

[46] Mustafa, Mumtaz Begum, et al. "Exploring the influence of general and specific factors on the recognition accuracy of an ASR system for dysarthric speaker."

Expert Systems with Applications 42.8 (2015): 3924-3932.

[47] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,“Class-based N-gram models of natural language,” Computational Linguistics, Vol. 18, No.

4, pp. 467–479, 1992.

[48] P. F. Brown, V. J. Della Pietra, S. A. Della Pietra and R. L. Mercer, “The mathematics of statistical machine translation : Parameter estimation,”

Computational Linguistics, Vol. 19, No. 2, pp. 263–311, 1993.

[49] P.-N. Tan, M. Steinbach and V. Kumar, “Introduction to Data Mining,” Addison-Wesley, pp. 500, 2005.

[50] R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval: the Concepts and Technology behind Search,” Addison-Wesley Professional, 2011.

[51] R. Kneser and H. Ney, “Improved backing-off for N-gram language modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 181–184, 1995.

[52] R. Lau, R. Rosenfeld and S. Roukos, “Trigger-based language models: a maximum entropy approach,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 45–48, 1993.

[53] R. Rosenfeld, “Two decades of statistical language modeling: where do we go from here,” IEEE, Vol. 88, No. 8, pp. 1270–1278, 2000.

[54] Rosenfeld, Ronald. "Two decades of statistical language modeling: Where do we go from here?." Proceedings of the IEEE 88.8 (2000): 1270-1278.

[55] S. F. Chen, and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL), pp. 310–318, 1996.64

[56] S. Kullback and R. Leibler, “On information and sufficiency,” Annals of

Mathematical Statistics, Vol. 22, No.1, pp. 79–86, 1951.63

[57] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 3, pp. 400–401, 1987.

[58] S.Watanabe, T. Iwata, T. Hori, A. Sako and Y. Ariki, “Topic tracking language model for speech recognition,” Journal of Computer Speech & Language, vol. 25, No. 2, pp. 440–461, 2011.

[59] S.-Y. Kong and L.-S. Lee, “Improved spoken document summarization using probabilistic latent semantic analysis (PLSA),” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 941–944, 2006.

[60] Sundermeyer, Martin, Ralf Schlüter, and Hermann Ney. "LSTM neural networks for language modeling." Thirteenth Annual Conference of the International Speech Communication Association. 2012.

[61] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceeding of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 50–57, 1999. 59

[62] T. Mikolov, M. Karafiát, L. Burget, J. Černocký and S. Khudanpur, “Recurrent neural network based language model,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 1045–1048, 2010.

[63] T. Mikolov, S. Kombrink, A. Deoras, L. Burget and J. Černocký, “RNNLM – Recurrent neural network language modeling toolkit,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.

[64] T. Mikolov, S. Kombrink, L. Burget, J. Černocký and S. Khudanpur,“Extensions of recurrent neural network language model,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5528–5531, 2011.

[65] T. Oba, T. Hori and A. Nakamura, “A comparative study on methods of weighted language model training for reranking LVCSR N-best hypotheses,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5126–5129, 2010.

[66] T.-H. Wen, A Heidel, H.-Yi. Lee, Y Tsao and L.-S. Lee, “Recurrent neural network based language model personalization by social network crowdsourcing”, in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 2703–2707, 2013.

[67] V. Lavrenko and W. Croft, “Relevance-based language models,” in Proceeding of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 120–127, 2001. 61

[68] W.-Y. Ma and K.-J. Chen, “Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff,” in Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 168–171, (http://ckipsvr.iis.sinica.edu.tw/).

[69] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee and R. Rosenfeld, “The SPHINX-II speech recognition system: An overview,” Computer, Speech, and Language, Vol. 7, No. 2, pp. 137–148, 1993.

[70] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term dependencies with

gradient descent is difficult,” IEEE Transactions on Neural Networks, Vol. 5, No.

2, pp. 157–166, 1994.

[71] Y. Lv and C. Zhai, “Positional language models for information retrieval,” in Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR), pp. 299–306, 2009.

[72] Y.-W. Chen, B.-H. Hao, K.-Y. Chen and B. Chen, “Incorporating proximity information for relevance language modeling in speech recognition,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 2683–2687, 2013.

[73] Z. Chen, K. F. Lee and M. J. Li, “Discriminative training on language model,” in Proceedings of the International Speech Communication Association (INTERSPEECH), pp. 493–496, 2000.

[74] 李俊毅,“語音評分”國立清華大學資訊工程所碩士論文,2002。

[75] 邱炫盛,“利用主題與位置相關語言模型於中文連續語音辨識,”國立臺灣 師範大學資訊工程所碩士論文,2007。

[76] 郝柏翰,“運用鄰近與概念資訊於語言模型調適之研究,” 國立臺灣師範大 學資訊工程所碩士論文,2014。

[77] 陳冠宇,“主題模型於語音辨識使用之改進,”國立臺灣師範大學資訊工程 所碩士論文,2010。

[78] 陳思澄,“使用詞向量與概念資訊於中文大詞彙連續語音辨識之語言模型調

[78] 陳思澄,“使用詞向量與概念資訊於中文大詞彙連續語音辨識之語言模型調

相關文件