Averaged Perceptron 演算法與關鍵字擷取

第五章實驗架構與結果

5.3 本文理論實驗結果

5.3.2 Averaged Perceptron 演算法與關鍵字擷取

表5-10 列舉 Averaged Perceptron 演算法增加關鍵詞特徵實驗前 50 個訓練回合的數據。最低辨識字錯誤率出現在增加未必為長詞的關鍵詞(AllKeyword)實驗中，在第18 個訓練回合可得最低字錯誤率 17.92%。

回合 Perceptron (%)

LongKeyword (%)

AllKeyword

(%) 回合 Perceptron (%)

LongKeyword (%)

AllKeyword (%) 1 18.09 18.09 18.09 26 18.09 18.09 18.07 2 18.09 18.09 18.09 27 18.04 18.05 18.04 3 18.10 18.10 18.10 28 18.05 18.04 18.03 4 18.11 18.11 18.11 29 18.04 18.04 18.03 5 18.11 18.11 18.11 30 18.04 18.02 18.01 6 18.14 18.14 18.14 31 18.01 17.99 17.98 7 18.12 18.12 18.12 32 18.01 17.99 17.98 8 18.08 18.08 18.08 33 18.00 18.00 17.98 9 18.09 18.09 18.09 34 17.98 17.98 17.96 10 18.05 18.05 18.05 35 17.98 17.97 17.96 11 17.99 17.99 17.99 36 17.96 17.96 17.96 12 17.99 17.99 17.99 37 17.95 17.95 17.94 13 17.95 17.95 17.95 38 17.96 17.95 17.94 14 17.94 17.94 17.94 39 17.99 17.97 17.96 15 17.95 17.95 17.94 40 17.97 17.97 17.95 16 17.96 17.97 17.96 41 17.99 17.99 17.98 17 18.01 17.97 17.96 42 18.01 18.01 18.00 18 17.97 17.95 17.92 43 18.00 18.00 17.99 19 18.01 17.97 17.95 44 18.02 18.01 18.01 20 18.03 18.02 18.01 45 18.03 18.01 18.01 21 18.03 18.03 17.99 46 18.03 18.02 18.01 22 17.96 17.99 17.97 47 18.04 18.03 18.02 23 18.02 18.04 18.03 48 18.03 18.03 18.01 24 18.03 18.02 18.01 49 18.03 18.03 18.03 25 18.09 18.09 18.08 50 18.03 18.03 18.03

圖5-14 將單用 Averaged Perceptron 演算法以及 Averaged Perceptron 演算法增加關鍵詞特徵方法之數據依回合數並列作觀察。在第 13 個訓練回合後，增加關鍵詞特徵的方法無論是長詞或未必是長詞的關鍵詞，都較原本只用單連詞與雙連詞作特徵的 Averaged Perceptron 演算法實驗數據稍佳，這表示關鍵詞特徵於 Averaged Perceptron 演算法來說，可能對字錯誤率的降低產生一定的影響。

與前述Boosting 演算法增加關鍵詞特徵實驗相反，在 Averaged Perceptron 演算法增加關鍵詞特徵實驗中，未必是長詞(AllKeyword)的關鍵詞實驗結果較長詞(LongKeyword) 實驗結果來得好。這也許是因為在本文實驗中，Boosting 演算法採用所有候選詞序列以更新特徵權重(圖 5-2)，而 Perceptron 演算法僅採用得分最高的一條候選詞序列以更新特徵權重(圖 3-4)，造成長詞(LongKeyword)特徵較不容易有機會更新其特徵權重，難以發揮效果。

17.9 17.95 18 18.05 18.1 18.15

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 訓練回合數字錯誤率

(%)

Baseline Perceptron Perceptron+LongKeyword Perceptron+AllKeyword

圖 5-14 Averaged Perceptron 演算法增加關鍵詞特徵實驗結果

第六章結語

語言模型在語音辨識中扮演重要角色，它代表的是人類長久以來使用語言的規律性，用來判斷辨識器中哪一個詞序列較符合語言實際運用情形。然而，它可能面臨兩種問題。

其一，因時間或領域的差異造成這些訓練語料與測試目標的不一致，需透過語言模型調適以同時期或同領域之調適資料對語言模型進行調適；其二，現行語言模型為一個基於歷史資訊之模型，它根據歷史詞序列判斷一個詞的機率高低，若辨識歷史中有誤差，便會影響各詞序列之機率值，造成排序不盡然正確而影響辨識結果。

透過線性模型作鑑別式訓練以進行語言模型調適，可以同時針對上述兩方面作調整，一方面可以選擇合乎所需時間或領域之調適語料進行訓練，以改變辨識系統對詞序列的偏好高低，另一方面，鑑別式訓練根據調適語料中的正確參照轉寫，調整線性模型中的特徵權重，構成一個合乎調適語料辨識傾向的評分環境，在測試階段可對基於歷史資訊之模型產生的多個辨識結果進行重新排序，減少排序錯誤的發生。

首先，本文將透過鑑別式語言模型訓練方法進行的語言模型調適應用於中文大詞彙語音辨識，進行辨識結果的重新排序。

其次，將上述方法與模型插補法作互動：一則比較這兩種語言模型調適方法之效能高低，一則結合這兩種方法以期進一步降低辨識錯誤率。在比較效能的實驗中，模型插補法較鑑別式語言模型來得好；而在結合這兩種語言模型調適方法的實驗中，則可得到本文實驗中最低辨識字錯誤率 17.08%，相較於基礎辨識率(Baseline)有 5.74%的相對進步率，可見這兩種調適方法的結合對辨識錯誤率的

此外，本文中提出以關鍵詞自動擷取所得之關鍵詞作為鑑別式訓練的特徵。關鍵詞自動擷取系統可以在不需仰賴詞典的情況下，就文本本身內容特性，

也就是語言使用習慣擷取出關鍵詞，因此即使是新生詞彙或是詞典中並未列舉之詞語，只要其出現次數超過預設閥值，便可以透過此系統擷取出來。

以關鍵詞自動擷取方法所得之關鍵詞，因直接透過文本的使用習慣篩選出來，應更能掌握調適語料之語言規律，以及詞典中並未列舉之詞彙，對實驗中字錯誤率之降低有所幫助。在 Boosting 演算法增加關鍵詞特徵實驗與 Averaged Perceptron 演算法增加關鍵詞特徵實驗中，增加關鍵詞作為特徵，都對辨識錯誤率的降低有所幫助。若能將其應用在句長較長的語料庫中，或存在多個新生詞彙的訓練環境下，也許會對辨識結果產生更大的助益。

參考文獻

[Aubert 2002] X. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 16, pp. 89-114, 2002.

[Bacchiani et al. 2003] M. Bacchiani and B. Roark.,” Unsupervised Language Model Adaptation”, ICASSP , 2003.

[Brown et al. 1992] Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, JennifeC. Lai, and Robert L. Mercer. "Class-based N-gram Models of Natural Language", Computational Linguistics, 18(4):467–479, December, 1992.

[Chen et al. 2002 ] Z. Chen, K. F. Lee and M. J. Li, “Discriminative Training on Language Model”, ICSLP, 2002.

[Cherry et al. 2008.] C. Cherry and C. Quirk, “Discriminative, Syntactic Language Modeling through Latent SVMs”, ATMA, 2008.

[Collins et al. 2000] M. Collins, T. Koo, “Discriminative Reranking for Natural Language Parsing”, ICML, 2000.

[Collins 2002] M. Collins, “Discriminative Training Methods for Hidden Markov Models : Theorey and Experiments with Perceptron Algorithms”, EMNLP, 2002.

[Collins 2003] Machine Learning Approaches for Natural Language Processing , Lecture Slide 14, “Global Linear Models”,

http://www.ai.mit.edu/courses/6.891-nlp/l14.pdf

[Collins et al. 2005] M. Collins, B. Roark and M. Saraclar, “Discriminative Syntactic Language Modeling for Speech Recognition”, ACL 2005.

[Dempster et al. 1977] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, Vol.39, no. 1, pages 1-38, 1977.

[Duda et al. 2001] R. O. Duda, P.E. Hart and D. G. Stork, “Pattern Classification”, Wiley, New York, 2001.

[Freund et al. 1996] Y. Freund and R. E. Schapire, “Experiments with a New Boosting Algorithm ”, ICML 1996.

[Freund et al. 1998] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer, “An Efficient Boosting Algorithm for Combining Preferences”, In Machine Learning:

Proceedings of the Fifteenth International Conference, 1998.

[Friedman et al. 1998] J, Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: a statistical view of boosting”, Dept. of Statistics, Stanford University, Stanford, CA, 1998.

[Gao et al 2005a] J. Gao, H. Yu, W. Yuan and P. Xu, “Minimum Sample Risk Methods for Language Modeling,” HLT/EMNLP, 2005.

[Gao et al. 2005b] J. Gao, H. Suzuki, W. Yuan, “An Empirical Study on Language Model Adaptation”, ACM Transactions on Asian Language Information

Processing, Vol. 5, No. 3, September 2005, pp. 209-227

[Gao et al. 2005c] J. Gao, H. Suzuki, B. Yu, “Approximation Lasso Methods for Language Modeling”, ACL, 2006.

[Gao et al. 2007] J. Gao, G. Andrew, M. Johnson and K. Toutanova ,“A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing”, ACL, 2007.

[Hebb 1949] D.O. Hebb, “The Organization of Behavior : A Neuropsychological Theorey”, Wiley, 1949.

[Katz 1987] S. M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component of A Speech Recognizer. IEEE Trans. On Acoustics,Speech and Signal Processing, Volume 35 (3), pp. 400-401, March 1987.

[Kneser et al. 1995] R. Kneser and H. Ney, “Improved Backing-off for M-gram Language Modeling”, ICASSP, 1995.

[Kuhn et al. 1990] R. Kuhn and R. De Mori., “A Cache-based Natural Language Model for Speech Reproduction”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1990.

[Kuo et al. 2002] H.-K. J. Kuo, E. Fosler-Lussier, H. Jiang and C. H. Lee,

“Discriminative Training of Language Models for Speech Recognition”, ICASSP, 2002.

[Kuo et al. 2007] H.-K. J. Kuo, B. Kingsbury, G. Zweig, “Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition”, ICASSP, 2007.

[Kuo et al. 2005] J. W. Kuo and B. Chen, “Minimum Word Error Based Discriminative Training of Language Models”, Eurospeech, 2005.

[Lin et al. 2005] S. S. Lin and F. Yvon, “Discriminative training of finite-state decoding graphs,” Proc. InterSpeech, 2005.

[Lippman 1987] R.P. Lippman, “An Introduction to Computing With Neural Nets”, IEEE ASSP Magazine, vol. 4, pp.4-22, 1987.

[McCulloch et al. 1943] W. S. McCulloch, and W. Pitts, “A logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, Vol. 5, pp.115-133, 1943.

[Mitchell 1997] T. Mitchell, “Machine Learning”, New York, 1997

[Okanohara et al. 2007] D. Okanohara, J. Tsujii, “A Discriminative Language Model with Pseudo-Negative Samples”, ACL, 2007.

[Rigazio et al. 1998] L. Rigazio, J.-C. Junqua, M. Galler, “Multilevel Discriminative Training for Spelled Word Recognition”, ICASSP, 1998.

[Roark et al. 2004a] B. Roark, M. Saraclar and M. Collins, “Corrective Language Modeling for Large Vocabulary ASR with the Perceptron Algorithm”, ICASSP, 2004.

[Roark et al. 2004b] B. Roark, M. Saraclar, M. Collins, M. Johnson, “Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm”, ACL 2004.

[Roark et al. 2007] B. Roark, M. Saraclar and M. Collins,“Discriminative N-gram Language Modeling”, Computer Speech and Language, 2007.

[Tseng 1997] Y. H. Tseng, “Fast Keyword Extraction of Chinese Documents in a Web Environment”, International Workshop on Information Retrieval with Asian Languages , pp.81-87, 1997.

[Warnke et al. 1999] V. Warnke, S. Harbeck, E. Noth, H. Niemann and M. Levit.,

“Discriminative Estimation of Interpolation Parameters for Language Model Classifiers”, ICASSP, 1999.

[Woodland et al. 2000] P. C. Woodland and D Povey, “Large Scale Discriminative Training for Speech Recognition,” ASR-Speech Recognition: Challenges for the Millenium, pp. 7-16, 2000.

[Zhao et al. 2004] P. Zhao, B. Yu, “Boosted Lasso”, Tech Report, Statistic Department, U. C. Berkeley.

[Zhou et al. 2006] Z. Zhou, J. Gao, F. K. Soong and H. Meng, “A Comparative Study of Discriminative Methods for Reranking LVCSR N-Best Hypotheses in Domain Adaptation and Generalization”, ICASSP, 2006.

[Zhou et al. 2008] Z. Zhou and H. Meng, “Recasting the Discriminative N-gram Model as a Pseudo-conventional N-gram Model for LVCSR”, ICASSP ,2008.

[邱炫盛 2007] 邱炫盛，《利用主題與位置相關語言模型於中文連續語音辨識》，

國立台灣師範大學資訊工程所碩士論文, 2007.

《語法與修辭》聯編組，《語法與修辭》，新學識文教出版中心，台北，1998。

在文檔中使用鑑別式語言模型於語音辨識結果重新排序 (頁 89-98)

第五章 實驗架構與結果

5.3 本文理論實驗結果

5.3.2 Averaged Perceptron 演算法與關鍵字擷取

第六章 結語

參考文獻

第五章實驗架構與結果

第六章結語