展望 - 結論與展望 - 使用韻律模型的

第六章結論與展望

6.2 展望

以下幾點應是值得繼續延伸探討的方向：

首先，本論文提出的辨識架構為二段式，在第二階段才加入韻律模型的估測分數。若能把韻律模型直接結合在一段式辨識系統中，應

有更好的效果。其中一種可能的做法是，聲學模型中的聲母/韻母模型依照其在詞中的相對位置而分為不同模型訓練，但此時特徵向量的選取則需要再做實驗：直接把頻譜特徵（如 MFCC）和韻律特徵接合成一向量，或是只用頻譜特徵都是可行的做法。

第二，本論文只使用了韻律架構中最底兩層（音節層和韻律詞層）

的韻律訊息，而事實上口語的最終表現是來自於更多層韻律單位的整體影響，譬如韻律短語、呼吸段落或是韻律段落[19]。而在第五章討論的時候，也提到了廣播新聞語料的特性本來就具有較少的低層級邊界，導致韻律模型的提升有限的問題。目前本論文是使用正規化方式來消除較高層韻律單位的影響，但若要得到更完整的模型，或是需要抽取語音信號中更多的資訊量以貼近真實口語，勢必考慮這些更大的單位的影響。

第三，在前言中已提起實驗語料普遍缺乏對韻律事件的標註，另外，專家的標註也違反了一般人都可有聽覺認知能力的特質。因此有沒有可能不藉助專家，而利用機器學習的方法自動找出語音中韻律單位所在？一個方法是利用圖形模型（Graphical model）表示出韻律元素和其他語音中的元素如音節、詞典詞的依賴關係，然後用語料訓練出韻律元素的變化；另一種方法是效法聲學模型鑑別型訓練

（discriminative training）的精神，以最佳化辨識率為目標，找出最適合的韻律單位位置所在。

第四，若有可取得的韻律標註語料，便可探討韻律特徵與韻律單位的關係，譬如說可以確認韻律詞的聲學相關變化到底有哪些，與本論文已做之重要性分析做一比對、驗證。甚至進一步訓練包含韻律單位的語言模型，因為此模型僅包含字詞與韻律單位的組合機率，故可照樣套用在其他無標註語料的實驗上。即使是少量的有標註語料，都可以對韻律結合辨識的研究有多方面的幫助。

第五，在結論中曾提到了韻律可用來傳遞三個面向的資訊，尤其

是越大的韻律單位如韻律段落等跟語義的表達息息相關。故除了改善語音辨識，亦可往語音理解（speech understanding）的應用（如結合語音文件摘要或分段）發展，使得韻律訊息的使用達到最大效益。

本論文成果對大字彙國語辨識僅增加了 1.45%的辨識率，然而這樣的進步是一個令人振奮的開端。因為在這嶄新的研究課題中，以無正確標註韻律的語料，並僅使用了韻律訊息中少量的資訊，就足以達到合理的進步。若能從以上幾點方向著手延伸，結合韻律訊息的語音辨識其價值將越來越被發掘與肯定。

參考文獻

[1] D. Crystal, A Dictionary of Linguistics and Phonetics, 4^th edtition, Blackwell Publishers Inc.

[2] X. Huang, A. Acero, and H-W Hon, Spoken Language Processing, Prentice Hall 2001

[3] Colin W. Wightman, Mari Ostendorf, “Automatic Labeling of Prosodic Patterns”, IEEE Transactions on speech and audio processing, vol. 2, NO. 4. October 1994.

[4] J. H. Kim, and P. C. Woodland, “The use of prosody in a combined system for punctuation generation and speech recognition,” in Proc.

Eurospeech, 1997

[5] E. Shriberg et al., “Prosody-based automatic segmentation of speech into sentences and topics,” Speech communication, 32(1-2):127-154, 2000, Special Issue on Accessing Information in Spoken Audio.

[6] Y. Lui et al., “Automatic disfluency identification in conversational speech using multiple knowledge sources,’ in Proc. Eurospeech, 2003

[7] C-K Lin, and L-S Lee, “Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features,” in Proc. Eurospeech, 2005

[8] A. Stolcke et al., “Dialogue act modeling for automatic tagging and recognition of conversational speech,” Computayional Linguisrics, 26(2):339-373, 2000

[9] Nwe T.L., Foo S.W.; De Silva L.C. “Speech emotion recognition using hidden Markov models,” Speech Communication, Volume 41, Number 4, November 2003, pp. 603-623(21)

[10] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor, “Emotion recognition in

human-computer interaction,” IEEE SignalProcessing magazine,

vol. 18, no. 1, pp. 32–80, Jan. 2001

[11] S. Kajarekar et al., “Speaker recognition using prosodic and lexical features,” in Proc. IEEE Workshop on Speech Recognition and Understanding, 2003

[12] L. Hahn, “Native speakers’ reactions to non-native stress in English discourse,” Ph.D. thesis, University of Illinois at Urbana-Champaign, 1999.

[13] Ken Chen et al., “Prosody dependent speech recognition on radio news corpus of American English,” IEEE trans. Audio, Speech, and Language Processing, vol.14, No.1, Jan. 2006

[14] V. R. R. Gadde, “Modeling word durations,” in Proc. ICSLP, 2000

[15] A. Stolcke et al.,“Modeling the prosody of hidden events for improved word recognition,” in Proc. Eurospeech, 1999

[16] R. Trask, A Dictionary of Phonetics and Phonology, Routledge 1996

[17] Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992), “Tobi: a standard for labeling english prosody,” in Proc. ICSLP, 1992

[18] Tseng, and F.C. Chou, “A prosodic labeling system for Mandarin speech database,” in Proc. ICPhS, 1999

[19] Tseng et al., “Fluent speech prosody: framework and modeling,”

Speech Communication, Vol.46,issues 3-4, July 2005, Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, 284-309.

[20] “A detailed description of COSPRO and Toolkit,”

http://reg.myet.com/registration/corpus/en/Papers.asp

[21] Douglas A. Reynolds, “Robust text-independent speaker

identification using Gaussian mixture speaker models,” IEEE trans.

Speech and Audio Processing, vol. 3, No. 1, Jan 1995

[22] Breiman et al., Classification and regression trees. Chapman &

Hall/CRC, 1984

[23] Richard O. Duda et al., Pattern Classification, Wiley-Interscience, 2001

[24] Vapnik Vladimir N., “The nature of statistical learning theory,”

Springer-Verlag New York, Inc., 1995

[25] K.-M. Lin and C.-J. Lin, “A study on reduced support vector machines,” IEEE Trans. Neural Networks, 14(2003), 1449-1559

[26] S. B. Davis & P. Mermelstein, “Comparison of parametric

representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoustics Speech and Signal Processing, Vol. 28, No.4, 1980

[27] Jerome L. Packard, The morphology of Chinesea linguistic and cognitive approach, Cambridge University Press,2000.New York, NY

[28] Chao Wang and Stephanie Seneff, “Robust pitch tracking for prosodic modeling,” in Proc. ICASSP, 2000

[29] ESPS Version 5.0 Program Manual. 1993

[30]林婉怡，「流利國語語音之聲調辨識及其在大字彙辨識上的應用」，碩士論文 --國立臺灣大學電信工程學硏究所。(2004)

[31] S. H. Chen et al. ”Vector quantization of pitch information in Mandarin speech,” IEEE trans. Communications, Vol. 38, No. 9, 1990

[32] C. Wightman et al., “Segmental durations in the vicinity of prosodic phrase boundaries,” J. Acoust. Soc. Amer., Mar. 1992

在文檔中使用韻律模型的 (頁 64-70)

展望

第六章 結論與展望

6.2 展望

參考文獻

第六章結論與展望