結論與未來展望 - 改善豐富文脈模型於中文語音合成之研究

本論文探討了基於隱藏式馬可夫模型語音合成與豐富文脈模型語音合成，

並從基礎實驗當中獲得豐富文脈模型確實對於語音品質有正面的效果，但整體流暢度與 AB 喜好測試當中，語音品質的影響較小，故與基於隱藏式馬可夫模型之語音合成有類似的結果。

本論文觀察到每個豐富文脈模型皆使用獨特的文脈來描述模型之特色，因此提出了使用潛藏語意分析擷取文脈描述之韻律向量，並搭配資訊檢索領域當中的空間向量模型來計算合成目標語句與豐富文脈模型之間的文脈描述相似度用以解決目前豐富文脈模型之語音合成需額外搭配過度適應決策樹分群模型的缺點。

在主觀實驗結果當中，平均主觀分數與 AB 喜好測試皆不如過度適應模型，

吾人認為有兩個原因導致實驗結果不佳，一為檢索所得到的豐富文脈模型並非為最佳模型，其次為豐富文脈模型皆有其獨特韻律，並不一定完全與目標一致，

且在進行語音參數產生演算法後，所獲得語音會過於近似於所檢索的豐富文脈模型，因此使實驗結果降低。實際以訓練語料進行內部測試後，客觀實驗顯示出本論文所提出之方法確實能有效近似於原始語音之倒頻譜。

在未來研究當中，吾人將進一步改善使用潛藏語意分析來分析文脈標記獲得韻律向量不準確的問題，使韻律向量能更加準確代表其文脈標記具有的韻律，

使最後合成語句提高其流暢度。此外，近年語音合成之研究逐漸轉往使用深層

類神經網路訓練模型使模型優化或是利用新興語者轉換技術來彌平因為統計式模型所導致的沉悶語音，吾人也將與上述之技術與豐富文脈模型之語音合成進行搭配，以期獲得更為優良的合成語音。

參考文獻

[1] P. Taylor, "Text segmentation and organisation," in Text-to-Speech Synthesis, Cambridge, 2012, p. 52–77.

[2] R. L. Rivest, "Learning decision lists," Machine Learning, vol. 2, no. 3, p. 229–

246, 1987.

[3] A. Voutilainen, "Part-of-speech tagging," in The Oxford Handbook of Computational Linguistics, 2003, p. 219–232.

[4] E. Brill, "Part-of-speech tagging," in Handbook of Natural Language Processing, 2006, p. 403–414.

[5] J. Kupiec, "Robust part-of-speech tagging using a hidden Markov model,"

Computer Speech & Language, vol. 6, no. 3, p. 225–242, 1992.

[6] H. Schmid, "Probabilistic part-of-speech tagging using decision trees," in Proceedings of International Conference on New Methods in Language Processing, 1994.

[7] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra and J. C. Lai, "Class-based n-gram models of natural language," Computational Linguistics, vol. 18, no. 3, p. 467–479, 1992.

[8] M. Ostendorf and K. Ross, "Prediction of abstract prosodic labels for speech synthesis," Computer Speech & Language, vol. 10, no. 3, p. 155–185, 1996.

[9] C.–Y. Chiang, C.–C. Tang, H.–M. Yu, Y.–R. Wang and S.–H. Chen, "An investigation on the Mandarin prosody of a parallel multi-speaking rate speech corpus," in Proceedings of International Conference on Speech Database and Assessments, 2009.

[10] Y. Yu, D. Li and X. Wu, "Prosodic Modeling with Rich Syntactic Conext in HMM-Based Mandarin Speech Synthesis," in Proceedings of China Summit &

Internation Conference on Signal and Information Processing , 2013.

[11] P. Taylor, "Synthesis techniques base on vocal-tract models," in Text-to-Speech Synthesis, Cambridge, p. 387–411, 2012.

[12] B. S. Atal and S. L. Hanauer, "Speech analysis and synthesis by linear prediction of the speech wave," The Journal of the Acoustical Society of America, vol. 50, no. 2B, p. 637–655, 1971.

[13] L. Carroll, "Linear prediction synthesis," in An Introduction to Text-to-Speech Synthesis, T. Dutoit, Ed., 1997, p. 201–228.

[14] C. P. Browman, L. Goldstein, J. A. S. Kelso, P. Rubin and E. Saltzman,

"Articulatory synthesis from underlying dynamics," The Journal of the Acoustical Society of America, vol. 75, no. S1, p. S22, 1984.

[15] P. Taylor, "Synthesis by concatenation and signal-processing modifcation," in Text-to-Speech Synthesis, Cambridge, p. 412–434, 2012

[16] E. Moulines and F. Charpentier, "Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones," Speech Communication,

[19] T. Masuko, K. Tokuda, T. Kobayashi and S. Imai, "Speech synthesis from HMMs using dynamic features," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1996.

[20] J. Matoušek, Z. Hanzlíček and D. Tihelka, "Hybrid syllable/triphone speech synthesis," in Proceedings of Interspeech, 2005.

[21] Z. Ling and R. Wang, "HMM-Based unit selection using frame sized speech segments," in Proceedings of Interspeech, 2006.

[22] Z. Yan, Y. Qian and F. K. Soong, "Rich context modeling for high quality HMM-based TTS," in Proceedings of Interspeech, 2009.

[23] Z. Yan, Y. Qian and F. K. Soong, "Rich-context unit selection (RUS) approach to high quality TTS," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 2010.

[26] S. Takamichi, T. Toda, Y. Shiga, S. Sakti, G. Neubig and S. Nakamura, "Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis," IEEE Journal of Selected Topics in Signal Processing, vol.

8, no. 2, p. 239–250, 2014.

[27] H. Zen, A. Senior and M. Schuster, "Statistical parametric speech synthesis using

iii

deep nerual network," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2013.

[28] Y. Qian, Y. Fan and F. Soong, "On the training aspects of deep neural network (DNN) for parametric TTS synthesis," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2014.

[29] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman, "Using latent semantic analysis to improve access to textual information," in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1988 [30] G. Salton, A. Wong and C. Yang, "A vector space model for automatic indexing,"

Communications of the ACM, vol. 18, no. 11, p. 613–620, 1975.

[31] P. Taylor, "Unit-selection synthesis," in Text-to-Speech Synthesis, Cambridge, 2012, pp. 474-476.

[32] N. Iwahashi, N. Kaiki and Y. Sagisaka, "Speech segment selection for concatenative synthesis based on spectral distortion minimization," Transactions of the Institute of Electronics, Information and Communication Engineers, p.

1942–1948, 1993.

[33] S.-y. Nakajima and H. Hamada, "Automatic generation of synthesis units based on context oriented clustering," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1988.

[34] Y. Sagisaka, "Speech synthesis by rule using an optimal selection of non-uniform synthesis units," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1988.

[35] Y. Sagisaka, N. Kaiki, N. Iwahashi and K. Mimura, "ATR μ-talk speech synthesis system," in Proceedings of the International Conference on Speech and Language Processing 1992, 1992.

[36] G. D. Forney, Jr., "The Viterbi algorithm," Proceeding of the IEEE, vol. 61, no. 3, p. 268–278, 1973.

[37] T. Hirai and S. Tenpaku, "Using 5 ms segments in concatenative speech synthesis," in 5th ISCA Workshop on Speech Synthesis, 2005.

[38] R. E. Donegan and E. Eide, "The IBM trainable speech synthesis system," in Proceedings of the International Conference on Speech and Language Processing, 1998.

[39] R. E. Donovan and P. Woodland, "Automatic speech synthesizer parameter estimation using HMMs," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1995.

[40] G. Möhler and A. Conkie, "Parametric modeling of intonation using vector

quantization," in Proceedings of the Third ESCA/IEEE Workshop on Speech Synthesis, 1998.

[43] T. Portele, K. H. Stober, H. Meyer and W. Hess, "Generation of multiple synthesis inventories by a bootstrapping procedure," in Proceedings of the International Conference on Speech and Language Processing, 1996.

[44] R. E. Donovan, M. Franz, J. S. Sorensen and S. Rouko, "Phrase splicing and varable substitution using the IBM trainable speech synthesis system," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1999.

[45] S. Pearson, N. Kibre and N. Niedzielski, "A synthesis method based on concatenation of demisyllables and a residual excited vocal tract model," in Proceedings of the International Conference on Speech and Language Processing, 1998.

[46] S.–F. Chen, "Conditional and joint models for grapheme-to-phoneme conversion," in Proceedings of Eurospeech, 2003.

[47] M. Chu, H. Peng, Y. Zhao, Z. Niu and E. Chang, "Microsoft Mulan – a bilingual TTS system," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2003.

[48] C. K. K. Meng, T. Y. F. Siu and P. C. Ching, "Cu vocal: corpus-based syllable concatenation for chinese speech synthesis across domains and dialects," in Proceedings of the International Conference on Speech and Language Processing, 2002.

oriented clustering," in Proceedings of Eurospeech, 1993.

[53] A. K. Syrdal, "Prosodic effects on listener detection of vowel concatenation," in Proceedings of Eurospeech, 2001.

[54] A. K. Syrdal and A. D. Conkie, "Perceptually-based data-driven join costs:

comparing join types," in Proceedings of Eurospeech, 2005.

[55] J. Wouters and M. W. Macon, "Unit fusion for concatenative speech synthesis,"

in Proceedings of the International Conference on Spoken Language Processing, 2000.

[56] J. Vepa, S. King and P. Taylor, "Objective distance measures for spectral discontinuities in concatenative speech synthesis," in Proceedings of the International Conference on Speech and Language Processing, 2002.

[57] N. K. Shah and P. J. Gemperline, "Combination of the Mahalanobis distance and residual variance pattern recognition techniques for classification of near-infrared reflectance spectra," Analytical Chemistry, vol. 62, no. 5, p. 465–470, 1990.

[58] P. A. Taylor, "Unifying unit selection and hidden markov model speech syntheisis," in Proceedings of the International Conference on Speech and Language Processing, 2006.

[59] K. Tokuda, T. Yoshimura, T. Masuko and T. Kobayashi, "Speech parameter generation algorithms for HMM-based speech synthesis," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2000.

[60] G. Fant, "The source filter concept in voice production," STL–QPSR, vol. 22, no.

1, p. 21–37, 1981.

[61] T. Fukada, K. Tokuda, T. Kobayashi and S. Imai, "An adaptive algorithm for Mel-cepstral analysis of speech," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1992.

[62] K. Tokuda, T. Kobayashi, T. Masuko and S. Imai, "Mel-generalized cepstral analysis – a unified approach to speech spectral estimation," in Proceeding of the International Conference on Spoken Language Processing, 1994.

[63] A. Oppenheim, R. Schafer and J. Buck, Discrete-Time Signal Processing, 3 ed., Prentice Hall, 1999.

[64] S. Imai, "Cepstral analysis synthesis on the Mel-frequency scale," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1983.

[65] S. Austin, R. Schwartz and P. Placeway, "The forward-backward search algorithm," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1991.

[66] L. A. Liporace, "Maximum likelihood estimation for multivaraite observations of Markov sources," IEEE Transaction Information Theory, no. 28, p. 729–734, 1982.

[67] B.–H. Juang, "Maxiumm-likelihood estimation for the mixture multivariate stachastic observations of Markov chains," AT&T Technical Journal, vol. 5, no.

64, p. 1235–1249.

[68] U. Jensen, R. K. Moore, P. Dalsgaard and B. Lindberg, "Modeling intonation contours at phrase level using continuous density hidden Markov models,"

Computer Speech and Language, vol. 8, no. 3, p. 247–260, 1994.

[69] G. J. Freji and F. Fallside, "Lexical stress recognition using hidden Markov models," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1988.

[70] J.–C. Chen and J.–S. Jang, "TRUES: tone recognition using extended segments,"

ACM Transactions on Asian Language Information Processing, vol. 7, no. 3, 2008.

[71] K. Tokuda, T. Masuko, N. Miyazaki and T. Kobayashi, "Hidden Markov models based on multi-space probability distribution for pitch pattern modeling," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1999.

[72] K.–F. Lee, S. Hyamizu, H.–W. Hon, C. Huang, J. Swartz and R. Weide,

"Allophone clustering for continuous speech recognition," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 1990.

[73] M.–Y. Hwang, X. Huang and F. Alleva, "Predicting unseen triphones with senones," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 1993.

[74] S. Young, "The general use of tying in phoneme-based HMM speech recognizers," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 1992.

[75] K. Shinoda and T. Watanabe, "Acoustic modeling based on the MDL principle for speech recognition," in Proceedings of Eurospeech, 1997.

[76] C.–H. Lee, E. Giachin, L. Rabiner, R. Pieraccini and A. Rosenberg, "Improved acoustic modeling for large vocabulary continuous speech recognition,"

Computer Speech and Language, vol. 6, no. 2, p. 103–207, 1992.

[77] S. Young, J. J. Odell and P. Woodland, "Tree-based state tying for high accuracy acoustic modeling," in Proceeding of Human Langauge Technology, 1994.

[78] J. Rissanen, "Universal coding, information, prediction, and estimation," IEEE

vii

Transactions on Information Theroy, vol. 30, no. 4, p. 629–636, 1984.

[79] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura,

"Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," in Proceedings of Eurospeech, 1999.

[80] T. Yoshimura, "Simultaneous modeling of phonextic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems," Ph.D thesis, 2002.

[81] K. Tokuda, H. Zen and A. W. Black, "An HMM-based speech synthesis system applied to English," in Proceeding of IEEE Software Stability at Work, 2002.

[82] Y. Wu and R. Wang, "Minimum generation error training for HMM-based speech synthesis," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 2006.

[83] T. Toda and K. Tokuda, "Speech parameter generation algorithm conisdering global variance for HMM-based speech synthesis," in Proceedings of Interspeech, 2005.

[84] Y. Wu, "Investigations on HMM-based speech synthesis," Ph.D. thesis, 2006.

[85] J. Sivla and S. Narayanan, "Upper bound Kullback–Leibler divergence for transient hidden Markov models," IEEE Transactions on Signal Processing, vol.

56, no. 9, p. 4176–4188, 2008.

[86] T. Mizutani and T. Kagoshima, "Concatenative speech synthesis using The plural unit selection and fusion method," IEICE Transaction, vol. E88–D, no. 11, p.

2565–2572, 2005.

[87] " 北科大電子書語音語料庫 (NTUT-AB01)," [Online]. Available:

http://www.aclclp.org.tw/use_mat_c.php#ntut.

[88] "Chinese Knowledge Information Processing," [Online]. Available:

http://ckip.iis.sinica.edu.tw/CKIP/engversion/index.htm.

[89] D. Talkin, A robust algorithm for pitch tracking (RAPT), W. B. Kleijn and K. K.

Paliwal, Eds., Amsterdam: Elsevier, 1995, p. 495–518.

III

VII

VIII

XII

XIII

XIV

XVI

XVII

XVIII

XIX

XXI

XXII QS "R-CountWordInUtt==28" {*#28/CUTT:*}

QS "CountWordInUtt==29" {*/CPU:29@*,*@29#*,*#29/CUTT:*}

QS "L-CountWordInUtt==29" {*/CPU:29@*}

QS "C-CountWordInUtt==29" {*@29#*}

QS "R-CountWordInUtt==29" {*#29/CUTT:*}

QS "CountWordInUtt==30" {*/CPU:30@*,*@30#*,*#30/CUTT:*}

QS "L-CountWordInUtt==30" {*/CPU:30@*}

QS "C-CountWordInUtt==30" {*@30#*}

QS "R-CountWordInUtt==30" {*#30/CUTT:*}

QS "CountUttInAll==1" {/CUTT:1}

QS "CountUttInAll==2" {/CUTT:2}

QS "CountUttInAll==3" {/CUTT:3}

QS "CountUttInAll==4" {/CUTT:4}

QS "CountUttInAll==5" {/CUTT:5}

QS "CountUttInAll==6" {/CUTT:6}

QS "CountUttInAll==7" {/CUTT:7}

QS "CountUttInAll==8" {/CUTT:8}

QS "CountUttInAll==9" {/CUTT:9}

QS "CountUttInAll==10" {/CUTT:10}

XXIII

XXIV

XXV

XXVI

XXVII

XXVIII

XXIX

XXX

XXXI

XXXII

XXXIII

XXXIV

XXXV

XXXVI

XXXVII

XXXVIII

XXXIX

XLI

XLII

XLIII

XLIV

XLV

-832 L-Phone==ian "mgc_s2_810" -840 -833 Phone==iu -837 "mgc_s2_811"

-834 R-Phone==uo "mgc_s2_813" "mgc_s2_812"

-835 Phone==ch -844 -836

-836 Tone==1 "mgc_s2_815" "mgc_s2_814"

-837 L-Phone==ia "mgc_s2_817" "mgc_s2_816"

-838 C-Tone==4 -839 "mgc_s2_818"

-839 LL-Phone==sp "mgc_s2_820" "mgc_s2_819"

-840 LL-Phone==shi "mgc_s2_822" "mgc_s2_821"

-841 LL-Phone==shi "mgc_s2_824" "mgc_s2_823"

-842 C-PUNCTTAG==COMMACATEGORY "mgc_s2_826" "mgc_s2_825"

-843 Phone==a -846 "mgc_s2_827"

-844 RR-Phnoe==ang "mgc_s2_829" "mgc_s2_828"

-845 Phone==ts "mgc_s2_831" "mgc_s2_830"

-846 Phone==uo "mgc_s2_833" "mgc_s2_832"

-847 C-Tone==3 "mgc_s2_835" "mgc_s2_834"

-848 R-Tone==4 "mgc_s2_837" "mgc_s2_836"

-849 LL-Phone==i "mgc_s2_839" "mgc_s2_838"

-850 Tone==1 "mgc_s2_841" "mgc_s2_840"

-851 L-Phone==iu "mgc_s2_843" "mgc_s2_842"

-852 Tone==4 "mgc_s2_845" "mgc_s2_844"

-853 Phone==ai "mgc_s2_847" "mgc_s2_846"

-854 L-Phone==n "mgc_s2_849" "mgc_s2_848"

-855 R-Phone==b "mgc_s2_851" "mgc_s2_850"

-856 Phone==an "mgc_s2_853" "mgc_s2_852"

-857 LL-Phone==in "mgc_s2_855" "mgc_s2_854"

-858 Tone==6 "mgc_s2_857" "mgc_s2_856"

-859 R-Phone==ou "mgc_s2_859" "mgc_s2_858"

-860 C-CountWordInUtt==8 "mgc_s2_861" "mgc_s2_860"

-861 L-Phone==e "mgc_s2_863" "mgc_s2_862"

}

在文檔中改善豐富文脈模型於中文語音合成之研究 (頁 80-133)