未來展望

在本論文測試過程中有發現一些細節的施作方法會大幅影響最後生成品質，但這些細節測試實驗都僅是定性的而無定量的分析，是值得多做大量分析實驗去進一步觀察討論。此外有很多細節測試實驗並沒有機會嘗試，若一一嘗試有機會可以更提高目前聲碼器的品質。

將聲碼器訓練在從人聲所抽取出的聲學特徵值上，再應用在更多語音生成任務上，去探討哪種模型更適合也是一個值得嘗試的題目。

透過本論文分析，未來可以設計出可以訓練上擁有更普遍的能力更為強健的聲碼器的訓練集，也可能搜集到更乾淨更合適的訓練集。

透過本論文分析各種聲碼器的優缺點，以及自回歸模型的優缺點，也希望未來能設計出更有普遍性、可即時生成、可運用於各式應用的聲碼器。

參考文獻

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information pro-cessing systems, 2012, pp. 1097–1105.

[2] Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock`y, and Sanjeev Khu-danpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.

[3] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Em-pirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,”

in Advances in neural information processing systems, 2014, pp. 2672–2680.

[6] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu,

“Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.

[7] CCITT Recommendation, “Pulse code modulation (pcm) of voice frequencies,” in ITU. 1988.

[8] Paul Taylor, Text-to-speech synthesis, Cambridge university press, 2009.

[9] Yannis Stylianou, Olivier Capp´e, and Eric Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on speech and audio process-ing, vol. 6, no. 2, pp. 131–142, 1998.

[10] Claude Elwood Shannon, “A mathematical theory of communication,” ACM SIG-MOBILE mobile computing and communications review, vol. 5, no. 1, pp. 3–55, 2001.

[11] Ronald Newbold Bracewell and Ronald N Bracewell, The Fourier transform and its applications, vol. 31999, McGraw-Hill New York, 1986.

[12] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.

32, no. 2, pp. 236–243, 1984.

[13] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.

[14] Tom Le Paine, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A Hasegawa-Johnson, and Thomas S Huang, “Fast wavenet generation algo-rithm,” arXiv preprint arXiv:1611.09482, 2016.

[15] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al., “Parallel wavenet: Fast high-fidelity speech synthesis,”

arXiv preprint arXiv:1711.10433, 2017.

[16] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.

[17] Zeyu Jin, Adam Finkelstein, Gautham J Mysore, and Jingwan Lu, “Fftnet: A real-time speaker-dependent neural vocoder,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2251–2255.

[18] James W Cooley and John W Tukey, “An algorithm for the machine calculation of complex fourier series,” Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965.

[19] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Br´ebisson, Yoshua Bengio, and Aaron C Courville,

“Melgan: Generative adversarial networks for conditional waveform synthesis,” in Advances in Neural Information Processing Systems, 2019, pp. 14881–14892.

[20] Yariv Ephraim and David Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443–445, 1985.

[21] John Kominek and Alan W Black, “The cmu arctic speech databases,” in Fifth ISCA Workshop on Speech Synthesis, 2004.

[22] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu, “Libritts: A corpus derived from librispeech for text-to-speech,”

arXiv preprint arXiv:1904.02882, 2019.

[23] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron:

Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.

[24] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.,

“Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.

[25] Keith Ito, “The lj speech dataset,” https://keithito.com/

LJ-Speech-Dataset/, 2017.

[26] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and Lin-shan Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812, 2018.

[27] Kirsten MacDonald et al. Christophe Veaux, Junichi Yamagishi, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2017.

附錄

以下為本次做完大量平均主觀意見分(MOS)實驗後，所觀察到的現象：

• 每人受試句子總數不宜太多，容易造成受試者前後標準不一。

• 當受試句子總數 > 30句，許多受試者會開始感到疲憊而不想繼續填寫，就算是好朋友友情幫忙填寫，在填寫完也多半會怨聲連連。

• 填寫問卷的報酬的好壞其實不太影響填答率，重點是要讓填問卷的人貼問卷時不需要花費過多的時間，或是需要使用複雜的介面完成。

• 對於不同方式取得平均主觀意見分(MOS)給予評分：

– 好朋友:★ ★ ★ ★ ★

如果問卷搜集時程沒有很趕，相當推薦請好朋友填寫，大部分的好朋友會在兩三天之內或是將問卷放到週末填寫好，並且推薦在星期五或星期六的時候請好朋友填寫，免得好朋友隔了好幾天後會忘記填寫。

因為好朋友不會亂填，所以整體上是非常推薦的，缺點是如果需要很多份問卷的話，僅靠好朋友是不夠的。

– 實驗室同學: ★ ★ ★ ★ ☆

實驗室同學對於原始訓練音檔都非常熟悉，對於真實音檔其實都有印象，先天條件下其實不太公平。對於4-5分的音檔會相較一般人比較嚴苛，而3分以下的音檔相較一般人會容易給予較高的分數。實驗室同學注意到的細節也比一般人更仔細，對於回聲較一般人敏感很多。

綜合來說，實驗室同學們所填的結果變異會比一般人小很多，而集中在3-4分。

– FB:NTU 台大學生交流版: ★ ★ ★ ★

多p幣，對於發放問卷的人來說，所需花費的時間和金錢都不多，就算有一部分人亂填也不會覺得心疼。獲得問卷速度也相當快速，是一種有效率獲得問卷的方式，不過要注意會有一部分人亂填問卷。

– Dcard:不推薦

Dcard平台有匿名的規定，發問卷的人無法給予填寫問卷的人任何好處，因此無法吸引人填寫問卷。故不推薦使用Dcard填寫問卷。

在文檔中類神經網路聲碼器在語音波形生成上的強健性分析 (頁 69-0)

參 考 文 獻

附 錄

參考文獻

附錄