Problem of Seq2seq - 實驗結果 - 利用卷積式注意力機制語言模型為影片生成鋼琴樂曲

5.3 實驗結果

5.3.2 Problem of Seq2seq

5.3.2 Problem of Seq2seq

第二個原因是因為我們模型的輸出會在經過[13]提出的表現解碼(Performance decoding)，我們發現在 seq2seq 的 output 解碼的過程中會出現很多警告訊息(warning)，

訊息是表示序列中有許多不合理的地方，例如此 pitch 有 NOTE_ON 卻沒有 NOTE_OFF，

而此解碼過程會自動幫我們濾掉這些不合理的子序列，這導致 seq2seq 雖然輸出的序列有很多不合音樂邏輯的部分，但依然能夠輸出成 midi files，我們認為表現編碼的解碼美化了 seq2seq 模型的輸出結果。如圖 7 所示，左邊從上到下分別是 seq2seq 輸出的 tensor、

VMT 輸出的 tensor 和原歌曲音樂的 tensor，橘色方形圈起來的地方是表現編碼的「音樂結束符」，我們發現到 seq2seq 模型學不會生成「音樂結束符」，它只會持續的生出下一個值，可以知道 seq2seq 輸出的 tensor 明顯不具有音樂性。

圖 8 是經過[13]的表現解碼(Performance decoding)後的 MIDI 音符序列，從上到下分別是 Seq2seq、VMT 和標準答案(Ground truth)的音符序列。綠色框框圈起來的是具有音樂動機特徵的音符子序列，可以看到我們的 VMT 模型有學到音樂動機的特徵，音樂

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 7: from left to right are seq2seq, VMT, and ground truth is at the bottom. Inside the red boxes are “END” notes in our vocabulary dictionary.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 8: From top to down are seq2seq, VMT, and ground truth respectively. Left and right are two examples from testing set. Inside green boxes are music motifs which are short and constantly recurring musical phrases.

Seq2seq Seq2seq

VMT VMT

Ground Truth

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

第 6 章總結

因為之前沒有為影片生成音樂的研究，也沒有影片與相對應對齊的符號音樂資料集，

因此我們人工發布了一個超過 7 小時的影片音樂資料集，由流行音樂的鋼琴樂譜和對齊的音樂錄影帶所組成經過人工的對齊工作。

我們提出了一個卷積注意力機制模型(Video-Music Transformer)來為影片生成音樂，

在與序列到序列模型(seq2seq)的比較上，不管是音樂流暢度和音樂影片匹配度上皆達到最先進的結果，雖然實驗結果顯示與人類譜寫的音樂還有一些進步的空間。我們認為之後的工作可以嘗試一次生成 30 秒的音樂，這樣注意力機制架構相比於遞歸神經網路架構上在長序列生成上的優勢較能體現，而且人類譜寫的音樂特性在 30 秒到 1 分鐘長度的音樂才能有較多的顯現。

‧

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint

arXiv:1810.04805, 2018.

[2] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, MuseGAN: Symbolic-domain music generation and accompaniment with multi-track sequential generative

adversarial networks. arXiv preprint arXiv:1709.06298, 2017.

[3] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, Neural audio synthesis of musical notes with wavenet autoencoders. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.

[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.

Courville, and Y. Bengio, Generative adversarial nets. Advances in neural information processing systems, 2014.

[5] G. Hadjeres, F. Pachet, and F. Nielsen, DeepBach: a Steerable Model for Bach chorales generation. arXiv preprint arXiv:1612.01010, 2016.

[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V.

Vanhoucke, P. Nguyen, and T. N. Sainath, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal

Processing Magazine, 29(6), 82-97, 2012.

[7] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.

[9] F.-F. Kuo, M.-F. Chiang, M.-K. Shan, and S.-Y. Lee, Emotion-based music

recommendation by association discovery from film music. Proceedings of the 13th annual ACM international conference on Multimedia, 2005.

[10] J.-C. Lin, W.-L. Wei, and H.-M. Wang, EMV-matchmaker: emotional temporal course modeling and matching for automatic music video generation. Proceedings of the 23rd ACM international conference on Multimedia, 2015.

[11] O. Mogren, C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.

[12] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.

Kalchbrenner, A. Senior, and K. Kavukcuoglu, Wavenet: A generative model for raw

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

audio. arXiv preprint arXiv:1609.03499, 2016.

[13] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, This time with feeling:

learning expressive musical performance. Neural Computing and Applications, 1-13, 2018.

[14] P. M. Todd, A connectionist approach to algorithmic composition. Computer Music Journal, 13(4), 27-43, 1989.

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need. Advances in neural information

processing systems, 2017.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

附錄

Figure 9: Our demo webpage. User can use the mouse to click examples from testing set.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 10: The webpage for No. 100-015. User can click left top video to watch original music video. The left bottom is a slider to display 40 frames we used as our model inputs. The right side display target's midi piano notes. User can click buttons to play target's midi only or synchronize with left video.

Figure 11: The webpage for No. 100-015 and the results for our VMT model.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 12: The webpage for No. 100-015 and the results for Seq2seq model.

在文檔中利用卷積式注意力機制語言模型為影片生成鋼琴樂曲 - 政大學術集成 (頁 28-36)

Problem of Seq2seq

5.3 實驗結果

5.3.2 Problem of Seq2seq

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

第 6 章 總結

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

附錄

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

第 6 章總結

立政治大學

立政治大學

立政治大學

立政治大學