各式語言模型比較與探討

測試語料字正確率(%) 絕對提昇率(%) 相對提昇率(%)

Baseline 83.61 - -

2-gram 85.29 1.68 10.25

3-gram 85.23 1.62 9.88

NN 84.75 1.14 6.96

RNN 85.17 1.56 9.52

SR-RNN 85.21 1.60 9.76

WR-RNN 85.40 1.79 10.90

DWR-RNN 85.34 1.73 10.57

Cluster-RNN 85.41 1.80 10.96

PLSA 83.85 0.24 1.46

Perceptron 85.05 1.44 8.79

GCLM 84.11 0.50 3.05

WGCLM 84.63 1.02 6.22

MERT 84.70 1.09 6.65

表 5-13：各種語言模型之實驗結果

表 5-13 為不同語言模型之比較結果，可以分做六個部分來看，第一個部份是基礎 字正確辨識率(Baseline)；第二部份是 N 連語言模型，分成 2 連語言模型(2-gram) 和 3 連語言模型(3-gram)來看，此兩種模型的絕對提昇率，分別是 1.68%和 1.62%

的進步，相對提昇率則是 10.25%和 9.88%的進步。可以看到 2 連語言模型較 3 連語言模型來的好，因為 3 連語言模型的參數量大幅增多，雖然可以做出更正確的估測，但是估測錯誤的機會相對也會來的大，導致沒有比 2 連語言模型好。

第三部份為類神經網路語言模型，類神經網路語言模型(NN)不同於鑑別式語言模型，使用的是非線性的方法來做估測，而且解決了資料稀疏的問題，因而絕對提昇率有 1.14%的進步，相對提昇率有 6.96%的進步。遞迴式類神經網路語言模型(RNN)則是繼承了類神經網路語言模型的優點，且能獲得較長距離的資訊，

所以辨識率方面又比類神經網路語言模型進步 0.42%，在絕對提昇率和相對提昇率則為 1.56%和 9.52%。雖然此部份結合背景語言模型後，較 2 連語言模型和 3 連語言模型來的差，但在單獨使用時，也就是不包含背景語言模型的時候，遞迴式類神經網路語言模型會有較好的辨識結果。

第四部份是本論文所提出的方法，表 5-13 中皆為選最好的辨識率來做比較，

使用語句關聯資訊的遞迴式類神經網路語言模型(SR-RNN)在效果上的確帶來些許的幫助，其絕對提昇率有 1.6%的進步，而相對提昇率有 9.76%的進步。此部份進一步探討使用更詳細的關聯資訊，可以見得使用詞關聯資訊於遞迴式類神經網路語言模型(WR-RNN)改善了語句關聯資訊的部份，效果上則有 1.79%的絕對提昇率及 10.79%的相對提昇率。但我們發現，使用詞關聯資訊仍不臻完美，其同一個詞所對應的詞關聯資訊皆相同，為了改善這個問題我們進一步將詞關聯資訊做動態的改變(DWR-RNN)。儘管辨識率方面沒有更大的改進，但比起基礎辨識率仍有不錯的效果，絕對提昇率有 1.73%，相對提昇率有 10.57%。接著是將訓練語句分群，再對各群訓練其遞迴式類神經網路語言模型(Cluster-RNN)，由於分群後，

測試語句能針對其特性來得到更好的估測，因此辨識率上有不錯的成效，絕對提昇率和相對提昇率分別有 1.8%和 10.96%的進步。

最後兩部分則是主題式模型和四種常見的鑑別式語言模型，機率式潛藏語意分析(PLSA)的辨識率為 85.85%，絕對提升率和相對提升率分別為 0.24%和 1.46%；

感知器演算法(Perceptron)雖然一般化能力較差，但在訓練語料和測試語料有高度相關的時候，會有較不錯的效果，因此為四種方法裡面進步最多，絕對提昇率有 1.44%，相對提昇率有 8.79%。全域條件式對數線性模型(GCLM)則在四種模型中表現較差，絕對提昇率只有 0.5%，相對提昇率為 3.05%。權重式全域條件式對數線性模型(WGCLM)則是因為多考慮了樣本權重，因此在辨識率上較全域條件式

對數線性模型來的好，絕對提昇率和相對提昇率則有 1.02%和 6.22%的進步。而最小化錯誤率訓練(MERT)不但考慮了樣本權重，其一般化能力也較佳，所以在此四種鑑別式語言模型中也有不錯的效果，絕對提昇率和相對提昇率分別有 1.09%

和 6.65%的進步。因此，我們可以從最後兩部分看出，本論文所提出的方法與遞迴式類神經網路語言模型結合有較佳的效果。

第6章結論與未來展望

在語音辨識、資訊檢索與自然語言處理領域中，語言模型具有一定的影響力。其中，在語音辨識裡，語言模型更是缺一不可的角色，他提供了語句在自然語言處 理中發生的可能性。傳統 N 連語言模型是目前語言模型當中常見的方法之一，但 是卻難以捕捉到長距離的語句資訊，加上擁有資料稀疏和維度的詛咒之特性，長期以來一直難以突破。近年來不斷有新型的語言模型被提出，如鑑別式語言模型與類神經網路語言模型，而其中根據國外學者的研究，發現類神經網路語言模型 有不錯的成效，不僅能擁有 N 連語言模型的特性也能彌補維度的詛咒之缺點，為 語音辨識與語言模型帶來新的視野；然而類神經網路語言模型也仍存在一些缺點，

例如缺乏長距離資訊、運算的時間複雜度過高以及詞的表示方式缺少了詞的特性等問題。因此，也有學者針對類神經網路的變形，使用了具有遞迴能力的類神經網路來建構語言模型，而效果也比一般類神經網路語言模型好。

遞迴式類神經網路語言模型所希望的就是具備一般類神經網路語言模型的優點，且能夠提供長距離的資訊。但是有研究[Bengio et al., 1994]提到使用梯度下降法時，由於鏈鎖率(Chain Rule)的關係，當時間越長時機率會越乘越小，導致趨近於 0。使得長距離資訊無法有效取得，因此本論文針對遞迴式類神經網路語言模型做了更進一步的改善，期望使用關聯資訊和動態調整語言模型來輔助機率的估測。從實驗結果中可以看出使用關聯資訊的確能帶來幫助，但是效果仍不夠明顯，其原因應為輸入層或前一時間點的資訊被關聯資訊所干擾，導致成效有限。

而實驗中也發現到減少部分關聯資訊能提升辨識率，因此關聯資訊或其他資訊的表示法在未來研究上也是值得注意的部分。另一部分，本論文藉由將訓練語料分群並訓練各群的遞迴式類神經網路語言模型，期望藉由動態的調整語言模型來達到更好的辨識率。此部分實驗結果也顯示分兩群時，使用相似度線性組合法有較

佳的成效。但分成四群時，由於各群中的訓練語料不足，因此無法訓練出學習能力較佳的遞迴式類神經網路語言模型。

在未來的研究裡，可以根據遞迴式類神經網路語言模型無法有效學習長距離資訊之缺點來進行改善，如加入不同的特徵或其他資訊來幫助估測，抑或是針對時序性倒傳遞演算法的缺點進行結構上的改進。而隨著時代的變遷，語言也不斷地在進化，許多以前沒有的詞語也不停出現，因此用不同平滑化的方法來處理 OOV 的問題也是相當重要的議題。另外，與現行的語言模型結合，如主題模型或鑑別式語言模型等，使語言模型更具有一般性能力、適應性能力，甚至鑑別性能力也是將來值得探討的部分。由於鑑別式語言模型的概念和類神經網路語言模型相當的像，差別在於前者是監督式的，後者是非監督式的。而倘若將類神經網路語言模型改良成監督式的方法，則辨識率應該會有更好的提升，期望在未來能將此兩種語言模型做結合，並進一步的獲得更好的辨識結果。

參考文獻

1.中文部分

[邱炫盛，2007] 邱炫盛，“利用主題與位置相關語言模型於中文連續語音辨識，”

國立臺灣師範大學資訊工程所碩士論文，2007。

[劉鳳萍，2009] 劉鳳萍，“使用鑑別式言模型於語音辨識結果重新排序，”國立臺灣師範大學資訊工程所碩士論文，2009。

[陳冠宇，2010] 陳冠宇，“主題模型於語音辨識使用之改進，”國立臺灣師範大學資訊工程所碩士論文，2010。

[劉家妏，2010] 劉家妏，“多種鑑別式語言模型應用於語音辨識之研究，” 國立臺灣師範大學資訊工程所碩士論文，2010。

[賴敏軒，2011] 賴敏軒，“實證探究多種鑑別式語言模型於語音辨識之研究，”國立臺灣師範大學資訊工程所碩士論文，2011。

2.西文部分

[Aubert, 2002] X. L. Aubert, “An overview of decoding techniques for large vocabulary continuous speech recognition,” Computer Speech and Language, Vol.

16, No. 1, pp. 89-114, 2002.

[Alexandrescu and Kirchhoff, 2006] A. Alexandrescu and K. Kirchhoff, “Factored neural language models,” in Proc. North American Chapter of the Association for Computational Linguistics, pp. 1-4, 2006.

[Arisoy et al., 2010] E. Arisoy, M. Saraclar, B. Roark, and I. Shafran, “Syntactic and

sub-lexical features for Turkish discriminative language models,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5538 -5541, 2010.

[Bahl et al., 1983] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” in Proc. IEEE Transactions on Patten Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190, 1983.

[Bahl et al., 1986] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer,

“Maximum mutual information estimation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 49-52, 1986.

[Brown et al., 1992] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L.

Mercer. “Class-based n-gram models of natural language,” Computational Linguistics, Vol. 18, No. 4, pp. 467-479, 1992.

[Bengio et al., 1993] Y. Bengio, P. Frasconi, and P. Simard, “The problem of learning long-term dependencies in recurrent networks,” in Proc. IEEE International Conference on Neural Networks, Vol. 3, pp. 1183-1188, 1993.

[Bengio et al., 1994] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transaction on Neural Networks, Vol. 5, No. 2, pp. 157-166, 1994.

[Bengio et al., 2001] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” in Proc. Advances in Neural Information Processing Systems, pp. 933-938, 2001.

[Boden, 2002] Mikael Boden, “A guide to recurrent neural networks and back-propagation,” in the Dallas project, 2002.

[Bellegarda, 2005] J. R. Bellegarda, “Latent semantic mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, pp. 70- 80, 2005.

[Chen and Goodman, 1996] S. F. Chen, and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proc. the 34th annual meeting on Association for Computational Linguistics, pp. 310-318, 1996.

[Clarkson and Robinson, 1997] P. R. Clarkson, and A. J. Robinson, “Language model adaptation using mixtures and an exponentially decaying cache,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp.

799-802, 1997.

[Chelba and Jelinek, 2000] C. Chelba, and F. Jelinek, “Structured language modeling,”

Computer, Speech and Language, Vol. 14, No. 4, pp. 283-332, 2000.

[Chen et al., 2004] B. Chen, J.-W. Kuo, and W.-H. Tsai. “Lightly supervised and data-driven approaches to mandarin broadcast news transcription,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10, No. 1, pp. 1-18, 2004.

[Davis and Mermelstein, 1980] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28, No. 4, pp. 357-366, 1980.

[Elman, 1990] J. L. Elman, “Finding structure in time,” Cognitive Science, Vol. 14, No. 2, pp. 179-211, 1990.

[Gales, 1998] M. J. F. Gales “Maximum likelihood linear transformations for HMM-based speech recognition,＂ Computer, Speech and Language, Vol. 12, pp.75-98, 1998.

[Gildea and Hofmann, 1999] D. Gildea and T. Hofmann, “Topic-based language models using EM,” in Proc. 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.

[Goodman, 2001] J. Goodman, “Classes for fast maximum entropy training,” in Proc.

IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol.

1, pp. 561-564, 2001.

[Goodman, 2001] J. Goodman, “A bit of progress in language modeling,” Computer, Speech and Language, pp. 403-434, 2001.

[Gao et al., 2005] J. Gao, H. Suzuki, and W. Yuan, “An empirical study on language model adaptation,” ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, pp. 209-227, 2005.

[Hermansky, 1990] H. Hermansky, “Perceptual linear predictive analysis of speech,”

The Journal of the Acoustical Society of America, Vol. 87, No. 4, 1990.

[Huang et al., 2007] Z. Huang, M. P. Harper, and W. Wang, “Mandarin part-of-speech tagging and discriminative reranking,” in Proc. Empirical Methods in Natural Language Processing, pp. 1093-1102, 2007.

[Jordan, 1986] M. L. Jordan, “Attractor dynamics and parallelism in a connectionist sequential machine,” in Proc. the Eighth Annual Conference of the Cognitive Science Society, pp.531-546, 1986.

[Juang and Katagiri, 1992] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992.

[Katz, 1987] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 3, pp. 400, 1987.

[Kuhn, 1988] R. Kuhn, “Speech recognition and the frequency of recently used words:

A modified Markov model for natural language,” in Proc. International Conference on Computational Linguistics, pp. 348-350, 1988.

[Kneser and Ney, 1995] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 181-184, 1995.

[Kumar, 1997] N. Kumar, Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition, Ph.D.

dissertation, John Hopkins University, Baltimore, 1997.

[Kang et al., 2011] M. Kang, T. Ng, and L. Nguyen, “Mandarin word-character hybrid-input neural network language model,” in Proc. International Speech Communication Association, pp. 625-628, 2011.

[Lawrence et al., 1996] S. Lawrence, C. L. Giles, and S. Fong, “Can recurrent neural networks learn natural language grammars?,” in Proc. International Conference on Neural Networks, pp. 1853-1858, 1996.

[Le et al., 2011] H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon,

“ Structured output layer neural network language model,” in Proc. IEEE

在文檔中遞迴式類神經網路語言模型使用額外資訊於語音辨識之研究 (頁 71-86)

第6章 結論與未來展望

參考文獻

第6章結論與未來展望