未來展望

第五章結論與未來展望

5.2 未來展望

未來若能針對定量複合詞、人名、綴詞這三類來重新設計測詴語料，並且以資料量更充足的語料來訓練語言模型，同時其中詞組語言模型之機率的預估也將更為準確，本研究之方法的辨識效能相信能夠大幅改善。

目前階層式 LM 仍有改善的空間，詞彙正確地辨識之外，也希望以辨識正確之完整的詞組，建立詞組與詞彙甚至是詞組與詞組間的緊密關係，掌握更大且更多資訊的辨識單元，例如連續 DM 或 Nd 的辨識：七十八年_十二月_二日，當辨識成一個大詞組將可得知同性質詞彙的範圍，以這個詞組再建立和其他詞彙的關連性；甚至是得到詞組內某個 subword 與詞組

範圍外特定詞彙的關係，例如：詞組內的量詞通常會影響鄰近詞彙的用詞。

當此方法成熟時，未來可將此概念運用到中文中同樣具有特定語言學規則特性的詞類，

個別詞類有其語言模型更精準的描述各詞類的特性，以輔助中文之辨識效能；再且，未來將引進韻律模型的輔助，於辨認時加以確認詞彙單元是否真為一個有意義的詞彙單元，用以克服中文語音辨認長久以來的詞彙定義的問題。

參考文獻

【1】 O. Scharenborg, S. Seneff, L. Boves, ―A two-pass approach for handling out-of-vocabulary words in a large vocabulary recognition task‖, Computer Speech and Language 21 (2007) 206-218.

【2】 Y. C. Pan, L. S. Lee, ―Lexicon Adaptation with Reduced Character Error (LARCE) — A New Direction in Chinese Language Modeling‖, Interspeech, Antwerp, Belgium, August 2007, pp.610-613.

【3】 S Lee, K Hirose, and N Minematsu, ‖Incorporation of prosodic modules for large vocabulary continuous speech recognition, ‖ in Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding, 2001.

【4】 F Gallwitz, A Batliner, J Buckow, R Huber, H Niemann, and E Noth, ―Integrated recognition of words and phrase boundaries‖, ICSLP1998.

【5】 D. Vergyri, A. Stolcke,VRR. Gadde, L. Ferrer, E. Shriberg, ―Prosodic knowledge sources for automatic speech recognition, ‖ ICASSP 2003.

【6】 J. T. Huang, L. S. Lee, ―Improved Large Vocabulary Mandarin Speech Recognition Using Prosodic Features,‖ Speech Prosody 2006, Dresden, Germany.

【7】 X. Huang, F. Alleva, H.W. Hon, M.Y. Hwang, K.F. Lee, and R. Rosenfeld, ―The SPHINX-II speech recognition system: An overview,‖ Computer, Speech, and Language, vol. 2. pp. 137–148, 1993.

【8】 Peter F. Brown, Vincent J. DellaPietra, Peter V. deSouza, Jennifer C. Lai, and Robert L.

Mercer. ―Class-based N-gram models of natural language,‖ Computational Linguistics, vol.

18, no. 4, pp. 467–479, 1992.

【9】 J. R. Bellegarda, ―A multispan language modeling framework for large vocabulary speech recognition,― IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 6, no. 5,

pp.456-467, 1998.

【10】中央研究院的中文斷詞系統，http://ckipsvr.iis.sinica.edu.tw/。

【11】J. Lafferty, A. McCallum, and F. Pereira. ― Conditional random fields: Probabilistic models for segmenting and labeling sequence data‖, In Proc. of ICML, pp.282-289, 2001.

【12】中研院詞庫小組出版物，http://godel.iis.sinica.edu.tw/CKIP/publication.htm#t2。

【13】中央研究院詞庫小組，中央研究院帄衡語料庫的內容與說明，詞庫小組技術報告 # 93-02，台北，1995。

【 14 】 S. Onishi, H. Yamamoto, and Y. Sagisaka, ―Structured language model for class identification of out-of-vocabulary words arising from multiple word-classes,‖ Eurospeech 2001.

附錄一：量詞表

鄰里郡區站巷弄段號樓衖市洲地街部司課院科系級股室廳會會兒陣世輩輩子付學期學年年付下子版冊編回章面小節集卷面方面邊頭方拍板眼小節程作倍成分厘毫絲圍指象限度開聯軍師旅團營伍班排連球波端回合折摺流等票桿棒聲次

動量詞回次遍趟下遭番聲響圈步把仗覺頓關手腳巴掌拳頭拳眼口刀槌子板子鞭子棒棍子陣針箭槍砲場度輪周曲跤記回合票

附錄二：數詞單元集合表

集合 State ID 集合 State ID

一～九 501 一百億～九百億 514

十～十九 502 一千億～九千億 515

一十一～九十九 503 一兆～九兆 516

一百～九百 504 十兆～十九兆 517

一千～九千 505 一十一兆～九十九兆 518

一萬～九萬 506 一百兆～九百兆 519

十～十九萬 507 一千兆～九千兆 520

一十一萬～九十九萬 508 一二～八九

一二十～八九十 521

一百萬～九百萬 509 零 522

一千萬～九千萬 510 百萬,千萬,百億,千億,

百兆,千兆 523

一億～九億 511 百,千,萬,億,兆 524

十一億～十九億 512 兩 525

一十一億～九十九億 513

附錄三：定量複合詞之類別

35 Da Neu Nf 91 Nes Nes Neqa Nf 36 Neu Nf Ng 92 Nes Nes Neqa Nf Ng 37 Neu Nf Na Ng 93 Nep Nes Nf Na

38 Neqa Nf Ng 94 Neu Neqb

39 Neu Neqa Ng 95 Neu VH Nf Na

40 Nes Neqa Nf 96 Neu VH Nf

41 Nep Neqa Nf 97 Nes Neu Nf Na

42 Nep Neu Na 98 Neu Na Na

43 Nes Neu Na 99 Nep Neu Nf Na

44 Nes Neu Nf 100 Da Neu Nf Nf

45 Neqa Neu Na 101 Neu VH Nf Ng

46 Nep Neu Nf 102 Nd Nd Ng

47 Neqa Neu Nf 103 Neu Neu Nf

48 Neqa Neu Nf Na 104 Neu Neu Nf Na 49 Nes Neqa Nf Na 105 Neu Neu Nf Ng 50 Nep Neqa Nf Ng 106 Neu Neu Na 51 Nep Neqa Nf Neqb Ng 107 Neu Neu Nf Na Ng 52 Nep Neqa Nf Neqb 108 Neu Neqa Nf Na 53 Nep Neqa Nf Na Ng 109 Neu Neu Nf VH 54 Nep Neqa Nf Na 110 Nes Neu Nf Neqb Na 55 Nes Neqa Nf Na Ng 111 Nes Nes Neu Nf 56 Nes Neu Nf Ng 112 Neu Neqa Na

在文檔中使用階層式語言模型之大詞彙國語辨認系統 (頁 53-61)

第五章 結論與未來展望