利用結構性支撐向量機的具音樂表現能力之半自動電腦演奏系統

(1)

國立臺灣大學電機資訊學院電機工程學系碩士論文

Department of Electrical Engineering

College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

利用結構性支撐向量機的

具音樂表現能力之半自動電腦演奏系統

A Semi-automatic Computer Expressive Music Performance System Using Structural Support Vector Machine

呂　行

Shing Hermes Lyu

指導教授：鄭士康博士 Advisor: Shyh-Kang Jeng, Ph.D.

中華民國 103 年 6 月

June, 2014

(2)

謝

首先要感謝鄭士康教授，早在大三選修教授的專題時（感謝大學部導師張時中教授推薦），教授就給我完全的自由，讓我能夠慢慢培養出找題目、設計實驗、上台報告以及撰寫論文的能力。每週六的 meeting 教授也都會給予我非常實際且切中要點的建議。

感謝各位口試委員寶貴的建議，讓這份論文能夠更加完善。謝謝王真儀教授逐字逐句的幫我訂正論文並且提供非常專業的意見。感謝王育雯教授的樂理課為我的實驗打下的基礎。也感謝陳宏銘教授的多媒體訊號處理課讓我對電腦音樂有更深的認識。

感謝 JCMG 的各位學長姐與同學，特別是志鴻、御仁、鴻心、彥彬、韋安、鼎棋、晟文、傳佑、如江、廉喬，每次的討論都給我許多的靈感。

感謝振宇、俞仲、鍾愛、宗緯、廉喬、智展撥出你們寶貴的時間來幫我錄製實驗用的演奏範例，你們精湛的演出讓這個實驗可以有高品質的數據可以使用。

謝謝 Intel 在這兩年多來給予我經濟上的支持還有給予我學習的機會。感謝 Allen Ouyang, Robinson Do, Chuanny Shiau 三位主管讓我有機會參與各種富有挑戰性的計畫，也讓我可以很自由的調配上班與上學的時間。

感謝愛樂社的各位，特別是順德、芝潔、品臻、小女王、迪西、顧門口、俊麟、子恩、乃嘉、維中、子瑩、彥彤，愛樂社燃起了我對音樂的愛，與各位共度的音樂會時光也是經常是我研究的靈感來源。

另外要特別感謝台大音樂所的教授與同學，王育雯教授的樂理課的同學們在實驗中給了我許多的幫助與建議。也特別謝謝金立群教授給予我的諸多協助：WOCMAT 上的批評指教、介紹口試委員、以及幫我轉貼網路問卷。

And I would like to thank all the contributors of the open source software community, especially the contributors of Python, music21, R, Rosegarden, Musescore, and Linux Mint Debian Edition. Without your great work, this thesis can't become a reality.

感謝我的家人在這 20 幾年來的支持與照顧，讓我可以無憂無慮的學習與成長，在我大學與研究所最忙碌的時候，家始終是我能夠最放

(3)

鬆自在的地方。

最後要感謝我的女朋友，永遠笑臉迎人，為我帶來許多的歡笑，在我為研究忙的不可開交的時候也從來不會抱怨，總是默默的支持與鼓勵我。

這份研究若沒有諸位的協助是不可能完成的，在此我獻上我要獻上最誠摯的感謝，願耶和華賜恩予你與你全家。

呂行謹誌 2014.6.10

(4)

中文要

電腦合成的音樂一向被認為是僵硬、機械化而且沒有音樂表現能力。因此能夠產生具有表現能力的電腦自動演奏系統將會對音樂產業、

個人化娛樂以及表驗藝術領域有重大的影響。在這篇論文中，我們藉由隱藏式馬可夫模型結構的結構性支撐向量機 (SVM-HMM) 來設計一個可以產生具有表現能力音樂的電腦自動演奏系統。我們邀請六位研究生錄製了克萊門蒂（Muzio Clementi）的小奏鳴曲集 Op.36。我們手動將這些錄音分割成樂句，並且利用程式從中抽取出音樂特徵。這些音樂特徵藉由 SVM-HMM 訓練成數學模型後，可以利用這個數學模型來演奏訓練過程中沒有見過的樂譜（需要手動標注樂句）。此系統目前只能支援單音旋律。問卷調查的結果顯示，本系統產生的音樂尚不能達到真人的演奏水準。但是根據量化的相似度分析，本系統產生的音樂確實比無表現性的 MIDI 音樂更接近真人演奏。

關鍵字：電腦自動演奏、結構性支撐向量機、支撐向量機

(5)

Abstract

Computer generated music is known to be robotic and inexpressive. A computer system that can generate expressive performance potentially has significant impact on music production industry, personalized entertainment or even art. In this paper, we have designed and implemented a system that can generate expressive performance using structural support vector machine with hidden Markov model output (SVM-HMM). We recorded six sets of Muzio Clementi's Sonatina Op.36 performed by six graduate students. The recordings and scores are manually split into phrases and had their musical features automatically extracted. Using the SVM-HMM algorithm, a mathematical model of expressive performance knowledge is learned from these features. The trained model can generate expressive performances for previously unseen scores (with user-assigned phrasings). The system currently supports monophonic music only. Subjective test shows that the computer generated performances still cannot achieve the same level of expressiveness of human performers, but quantitative similarity measures show that the computer generated performances are much similar to human performances than inexpressive MIDIs.

Keywords: Computer Expressive Performance, Performance Rendering, Structural SVMs, Support Vector Machines.

(6)

List of Figures

3.1 High-level system architecture . . . 10

3.2 Learning phase flow chart . . . 15

3.3 Performing phase flow chart . . . 20

3.4 Intervals with neighbor notes . . . 22

3.5 Relative durations with neighbor note . . . 23

3.6 Metric position . . . 23

3.7 Systematic bias in onset deviation . . . 24

4.1 Movement length (notes) distribution . . . 33

4.2 Movement length (phrases) distribution . . . 35

4.3 Phrase length (notes) distribution . . . 35

5.1 Onset deviations by aligning last note onset . . . 37

5.2 Onset deviations by aligning last notes note-off . . . 38

5.3 Onset deviations using automated normalization method . . . 39

5.4 Median distance between generated performances and recordings for dif- ferent ε's . . . . 41

5.5 Execution time for different ε's . . . . 42

5.6 Median distance between generated performances and recordings for different C's . . . 43

5.7 Execution time for different C's . . . 43

5.8 Execution time for differnt number of quantization levels . . . 45

(10)

5.9 Distribution of onset deviation values from full corpus versus single performer's corpus . . . 48 5.10 Distribution of duration ratio values from full corpus versus single per-

former's Corpus . . . 48 5.11 Distribution of MIDI velocity values from full corpus versus single per-

former's corpus . . . 49

(11)

List of Tables

4.1 Clementi's Sonatinas Op.36 . . . 28 4.2 Number of mistakes in the corpus. Blank cell means the performer did not

record the movement . . . 32 4.3 Total recorded phrases and notes count . . . 34 4.4 Phrases and notes count for Clementi's Sonatina Op.36 . . . 34 5.1 Average (normalized) distance between generated performance and hu-

man recording, and between inexpressive MIDI and human performance . 46 5.2 Average rating for generated performance and human recording; numbers

in brackets are standard deviations . . . 49 5.3 Average ratings for inexpressive MIDI and human performance . . . 50 5.4 Number of participants who gives higher rating to generated performance,

human recordings or equal rating . . . 51 5.5 Number of participants who gives higher rating to inexpressive MIDI, hu-

man recordings or equal rating . . . 51

(12)

Chapter 1 Introduction

1.1 Motivation

From the mechanical music performing automata of the middle ages, to the latest Japanese virtual singer Hatune Miku, there have been many attempts to create automated systems that perform music. However, many of these systems can only generate prede- fined expression. State-of-the-art text-to-speech system can already generate fluid and natural speech, but a computer performance system still can't perform very expressively.

Therefore, many researchers have devoted their efforts to develop systems that can automatically or semi-automatically perform music expressively. There is even a biannual contest for such systems called the Music Performance Rendering Contest (RenCon) [1].

The RenCon sets a goal that by 2050, a computer performer can win the International Chopin Piano Contest.

There are many potential applications for a computer expressive performance system;

many commercial music typesetting softwares like Finale [2] and Sibelius [3] already have expressive playback features built-in. For the entertainment industry, such systems provide personalized music listening experience. For the music production industry, this technology should save a lot of cost on hiring musicians and paying license fees. Such a systems also open up new opportunities in art, such as human-machine co-performance or interactive multimedia installation. In academia, researchers can use this technology to study the performance style of musicians, or restore historical recording archive.

(13)

1.2 Goal and Contribution

The ultimate goal of this paper is to be able to play any music in any expressive style specified. However, due to technical and time constrains, we narrow down our goal to building a computer expressive performance system that performs monophonic musical phrases by off-line supervised learning. The phrasings is left to human users, so the system built in this thesis is a semi-automatic one.

The major contribution of this paper is that we apply the structural support vector machine to construct an expressive performance system. No previous system that used the discriminative learning power of the structural support vector machine with hidden Markov model output (SVM-HMM) to improve the computer's capability to perform expressively . We also developed methods and tools to prepare an expressive performance corpus for training and determining necessary parametric values of the SVM-HMM. Since there is no unified ways to evaluate the expressive performance, we arranged subjective tests and find that our system cannot achieve the same level of expressiveness as humans.

But quantitative similarity evaluation shows that the music passages generated by our system are more similar to human performances than inexpressive MIDIs.

1.3 Chapter Organization

In Chapter 2, we give an overview of previous works with various goals. These works are grouped by way of how they learn performance knowledge, and we will discuss some additional specialities such as special instrument models or special user interaction patterns. In Chapter 3, we first give a brief introduction to the mathematical background of SVM-HMM, and then give a top-down explanation to the proposed method. In Chapter 4, we explain how the corpus used for training is designed and implemented. In Chapter 5, we examine several experiments that demonstrate design trade-offs and the subjective test results. Finally, we summarize our work and point out possible improvements in Chapter 6. In the appendix, we present some software tools used in this research, which may be helpful for other researchers in the field of computer music.

(14)

Chapter 2 Previous Works

2.1 Various Goals and Evaluation

The general goal of a computer expressive performance system is to generate expressive music, as opposed to the robotic and dull expression of rendered MIDI. Since the definition of “expressive” is very vague and ambiguous, each research needs to define a more precise and measurable goal. The followings are the most popular goals a computer expressive performance system aims to achieve:

1. To perform musical notations in a non-robotic way (no specific style).

2. To reproduce a human performance or a certain musician's style.

3. To accompany a human performance.

4. To validate a musicological theory of expressive performance.

5. To directly render computer-composed musical works.

Some systems try to perform musical notations in a non-robotic way in a general sense, without a certain style in mind. These systems have been employed in music typesetting softwares, like Finale [2] and Sibelius [3], to play the notation expressively. Most systems will implicitly include this goal.

Systems that are designed to reproduce certain human performance or style are usually designed and trained using a particular performer's recordings. One commercial example

(15)

is the Zenph re-performance CD [4]. This CD is a reconstruction of Rachmaninov's recording archives. By analysing various performance parameters like timing and key pressure, the low quality audio archives are re-synthesized on a modern computer controlled piano.

If we push this idea further, we may be able to learn a performance model of Rachmaninov and perform musical pieces that Rachmaninov himself never recorded in his lifetime.

Accompaniment systems try to render expressive music that acts as an accompaniment for a human performance. The challenge is that the system must be able to track the progress of the human performance and adaptively render the accompaniment in real-time.

One commercial example is Cadenza [5], using the technology created by Christopher Raphel. It can track the soloist's performance and play the accompanying orchestral part accordingly.

Another goal is to validate musicological theories. Musicologists may propose theories on how music are performed expressively, by building a generative model, they can validate their theories. These systems may focus more on the specific phenomenon that the theory tries to explain instead of generating music that is pleasant to human.

Finally, some systems combine computer composition technology with expressive performance technology. These systems have a big advantage because the intention of the composer can be shared with the performer. Other systems that perform past compositions can only guess the composers' intentions by analyzing the score notations. These systems usually have their own data structures to represent music, which contain more information than traditional music notations, but the performance system is not backward compatible with past compositions.

Because of the high diversity in the goals they want to achieve, it is very hard to make fair comparisons between systems. But we can still evaluate the capabilities of these systems by the following three key indicators proposed by [6]:

1. Expressive expression capability 2. Polyphonic capability

3. Performance creativity

(16)

Expressive expression capability range from high-level structural expressions (e.g.

tempo contrast between sections) to note-level expressions (e.g. onset, loudness, duration) or even sub-note expressions (e.g. loudness envelope, timbre). Most systems can generate note-level expressions, but higher or lower level expressions are much rare.

Polyphonic capability indicates weather the system can perform polyphonic input.

Polyphonic systems are more challenging than monophonic ones because they require synchronization between voices.

Performance creativity measures the ability of the system to create novel expressions.

The desired level of creativity varies from goal to goal. A system aiming to recreate human performances may want to produce deterministic expressions based on the learned knowledge, while a system that is combined with a composition system may want to create highly novel performance.

Each system will design different experiments and metrics to verify their goals. Thus, the self-reported results can hardly be compared. The only public contest that evaluates expressive performance systems is RenCon (Performance Rendering Contest) [1]. Scores (MIDI) will be given to participants one hour before the competition starts. The participants must generate the expressive version of the MIDIs in the given time; and the MIDIs will be played live on a Yamaha Disklavier piano. The audience and a jury consisting of professional musicians will give ratings for each performance. The performances are arranged in random, so the audience and jury will not know which participant is behind each performance.

The RenCon is divided into fully automatic and semi-automatic categories. Since the degree of human intervention in the semi-automatic category varies widely between systems, it is not very fair to compare them.

2.2 Researches Classified by Methods Used

Despite the differences between goals of different expressive performance systems, all expressive performance systems must have some strategies to learn and apply performance knowledge. There are generally two approaches: rule-based or machine-learning-based.

(17)

Using rules to generate expressive music is probably the earliest approach. Direc- tor Musices [7] is one of the early example. Pop-E [8] is also a rule-based system which can generate polyphonic music, using its voice synchronization algorithm. Computational Music Emotion Rule System [9] tries to develop rules that express human emotions. Other systems like Hierarchical Parabola System [7, 10--12], Composer Pulse System [13, 14], Bach Fugue System [15], Trumpet Synthesis System [16, 17] and Rubato [18, 19] are also some examples. Most of the rule-based systems focus on expressive attributes like note onset, note duration and loudness, but Hermode Tuning System [20] puts special em- phasis on intonation. Rule-based systems are generally more computationally efficient because the mathematical model is much simpler than those learned by machine learning algorithms. And rules are generally more understandable to human than complex model parameters. Some of the nuances, such as subconscious deviations, may be hard to de- scribe by rules, so there is empirical limit on how complex the rule-based system can be.

The lack of creativity is also a problem for rule-based approach.

Another approach is to acquire performance knowledge by machine learning. Many machine learning methods have already been applied to this problem. For example, Mu- sic Interpretation System [21--23] and CaRo [24--26] both use linear regression to learn performance knowledge. However, it is very unlikely that the expressive performance problem can be generated from a linear system, and therefore Music Interpretation Sys- tem tries to introduce non-linearity by using logic AND operations on linear regression results. But generally speaking, linear regression is too simple to capture the core of expressive performance.

More complicated machine-learning algorithms have also been applied: ANN Piano [27] and Emotional flute [28] use artificial neural network. ESP Piano [29] and Music Plus One [30--32] use statistical graphical models such as hidden Markov model (HMM) and Bayesian belief network, but they did not use structural support vector machine to train the HMM. KCCA Piano System [33] uses kernel regression. Drumming System [34] tries different mapping models that generate drum patterns.

Evolutionary computation such as genetic programming is used in Genetic Program-

(18)

ming Jazz Sax [35], Sequential Covering Algorithm Genetic Algorithm [36], Generative Performance Genetic Algorithm [37] and Multi-Agent System with Imitation [38, 39].

Evolutionary computation requires long training time, and the results are less predictable.

But being unpredictable also means that these systems will create interesting performances in an unconventional way.

Another possible approach is to use case-based reasoning. SaxEx [40--42] use fuzzy rules based on emotions to generate Jazz saxophone performance. Kagurame [43, 44]

focus on style (Baroque, Romantic, Classical etc.) instead of emotion. Ha-Hi-Hun [45]

has a more ambitious goal in mind: to accept natural language instructions like “Perform piece X in the style of Y.” Another series of researches done by Widmer et al., the PLCG [46--48], use data mining technique to find rules for expressive performance. Its successor -- Phrase-decomposition/PLCG [49] -- adds hierarchical phrase structures capability to the original PLCG system. And the latest research in the series -- DISTALL [50, 51] -- adds hierarchical rules to the original one.

Most of the performance systems discussed above take musical notation (MusicXML, MIDI, etc.) or inexpressive audio as input. They have to figure out the expressive intention of the composer by analyzing the score. Another type of computer expressive performance has a big advantage over the ones previous described, by combining computer composition and expressive performance, the performance module can receive the composition intention directly from the composition module. Ossia [52] and pMIMACS [53] are two examples of this category. This approach provides great possibilities for creativity, but such systems can only play their own composition, which limits its range of application.

2.3 Additional Specialties

Most expressive performance systems implicitly or explicitly generate piano performances, because it is relatively easy to collect training samples for piano, and the piano sound is relatively easy to synthesize. Yet, some systems generate music on other instruments, such as the saxophone [40--42], trumpet [16, 17], flute [28] and drums [54]. These systems require extra efforts in creating instrument-specific models for training, genera-

(19)

tion and synthesizing. Y.-H Kuo et al. [55] also proposed a way to re-synthsize individual notes into a performance with smooth timbre variation, but the work focus more on sub- note level timbre synthesis.

If not specified, most systems handle traditional Western tonal music. However, most saxophone-based work [40--42] generates Jazz music, because saxophone is an iconic instrument in Jazz performance. And the Drumming System [54] generates Brazilian drumming music.

Performing polyphonic music is much more challenging than monophonic music because it requires synchronization between voices. Pop-E [8] uses a synchronization mechanism to achieve polyphonic performance. Bach Fugue System [15] is created using the polyphonic rules in music theory on fugue, so it is inherently able to play polyphonic fugues. KCCA Piano System [33] can generate homophonic music -- an upper melody with an accompaniment -- which is common in the piano music. Music Plus One [30--32]

is a little bit different because it is a accompaniment system, and it adapts non-expressive orchestral accompaniment track to the user's performance.

(20)

Chapter 3 Proposed Method

3.1 Overview

The high-level architecture of the proposed system is shown in Fig. 3.1. The system has two phases: the upper half of the figure is the learning phase, and the lower half is the performing phase. In the learning phase, the score and expressive human recording pairs, split into phrases by human, are used as a training examples for the structural support vector machine with hidden Markov model output (SVM-HMM) algorithm to learn the performance knowledge model. In the performing phase, a score will be given to the system for expressive performance. The SVM-HMM generation module will use the performance knowledge learned in the previous phase to produce expressive performance.

The SVM-HMM output then goes through a MIDI generator and MIDI synthesizer to produce audible performance.

All scores and recordings are monophonic and contain only one musical phrase. The phrasing is done by the author, thus the system is semi-automatic. The learning algorithm applied in the learning phase, namely SVM-HMM, can only be executed off-line, while the generating phase works much faster; expressive music can be generated almost instantaneously.

There are many ways the user can control the performance style of the final output:

first, the user can choose the training corpus. Theoretically, a model of a particular style can be learned from a set of samples with that particular style. Second, the user can control

(21)

Figure 3.1: High-level system architecture

the structural expression by assigning the phrasings.

In the following sections, we will give an overview of the theoretical background behind SVM-HMM, and then walk through the detailed steps in the learning and the performing phases, other implementation details are also described. The features used will be presented at the end of this chapter.

3.2 A Brief Introduction to SVM-HMM

In this thesis, we use the structural support vector machine to learn performance knowledge from expressive performance samples. Unlike the traditional SVM algorithm, which only produce univariate prediction, the structural SVM can produce structural predictions like trees, graphs or sequences. The structural SVM with hidden Markov model output (SVM-HMM) has been successfully applied to part-of-speech tagging problem [56].

There are some similarities between the part-of-speech tagging problem and the expressive performance problem. In the part-of-speech tagging, one tries to identify the role in which the word plays in the sentence, while in the expressive performance, one tries to determine how a note should be played, usually based on its role in the musical phrase.

Thus, we believe that SVM-HMM is also a good candidate for expressive performance.

The following introduction and formulas are summaries of [56--58].

(22)

The traditional SVM prediction problem can be described as finding a function

h :X → Y

with lowest prediction error. X is the input features space, and Y is the prediction space.

In a traditional SVM, elements inY are labels (classification) or real values (regression).

However, a structural SVM extends the framework to generate structural output, such as trees, graphs or sequences. To extend SVM to support structured outputs, the problem is modified as finding a discriminant function

F :X × Y → R

, in which the input/output pairs are mapped to a real number score. To predict an output y for an input x, one tries to maximize F over all y∈ Y.

f (x) = arg max

y∈Y F (w, x, y) Let F be a linear function of the following form:

F = w^TΨ(x, y)

, where w is the parameter vector, and Ψ(x, y) is the kernel function relating input x to output y. Ψ can be defined to accommodate various kinds of structure.

For each structure we want to predict, a loss function that measures the accuracy of of a prediction is required. A loss function ∆ : Y × Y → R needs to satisfy the following properties:

∆(y, y^′)≥ 0 for y ̸= y^′

∆(y, y) = 0

The loss function is assumed to be bounded. Let's assume that the input-output pair

(23)

(x, y) is drawn from a join distribution P(x,y), the prediction problem is to minimize the total loss:

R^∆_p =

∫

X×Y∆(y, f (x))dP (x, y)

Since we cannot directly find the distribution P , we need to replace this total loss with an empirical loss, which can be calculated from the observed training set of (x_i, y_i) pairs.

R^∆_s (f ) = 1 n

∑n

i=1

∆(y_i, f (x_i))

Now we are ready to extend SVM to structural output, starting with a linear separable case, and we will then extend it to a soft-margin formulation.

A linear separable case can be expressed by a set of linear constrains

∀i ∈ {1, · · · , n}, ∀ ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ 0

The constrains imply that the groundtruth y_ifor x_ihas the minimum F value than any other ˆy_i ̸= yi.

The key concept of SVM is the large margin principle. We not only want to find a solution that statisfies the constrains, but also we want to maximize the margin between the groundtruth and the second best ˆy_i:

γ,w:∥w∥=1max γ

s.t ∀i ∈ {1, · · · , n}, ∀ ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ γ

, which is equivalent to the convex quadratic programming problem:

w,ξmini≥0

1 2∥w∥²

s.t.∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ 1

To extend the linear-separable case to a non-separable case, slack variables ξi are in-

(24)

troduced to penalize prediction errors, which results in a soft-margin formalization:

w,ξmini≥0

1

2∥w∥²+C n

∑n

i=1

ξ_i

s.t. ∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(xi, yi)− Ψ(xi, ˆyi)]≥ 1 − ξi

C is the weighting parameter controlling the trade-off between low training error and large margin. The optimal C varies between different problems, so experiments should be conducted to find the optimal C for our problem.

Intuitively, a constrain violation with a larger loss should be penalized more than the one with a smaller loss. So I. Tsochantaridis et al. [57] proposed two possible way to take the loss function into account. The first way is to re-scale the slack variable by the inverse of the loss, so a high loss leads to a smaller re-scaled slack variable:

w,ξmini≥0

1

2∥w∥²+C n

∑n

i=1

ξ_i

s.t.∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ 1 − ξi

∆(y_i, ˆy_i) The second way is to re-scale the margin, which yields

w,ξmini≥0

1

2∥w∥²+C n

∑n

i=1

ξ_i

s.t.∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ ∆(yi, ˆy_i)− ξi

But the above quadratic programming problem has a very large number (O(n|Y|)) of con- strains, which will take considerable time to solve. I. Tsochantaridis et al. [57] proposed a greedy algorithm to speed up the process by selecting only part of the constrains that con- tributes the most to finding the solution. Initially, the solver starts with an empty working set containing no constrains. Then the solver iteratively scans the training set to find the most violated constrains under the current solution. If a constrain is violated more times than a desired threshold, the constrain is added to the working set of constrains. Then the solver re-calculates the solution under the new working set. The algorithm will terminate

(25)

once no more constrain can be added under the desired precision.

In a later work by Joachims et al. [56], they created a new formulation and an algorithm to further speed up the algorithm. Instead of using one slack variable for each training sample, which resulting in a total of n slack variables, they use a single slack variable for all n training samples. The following formula is the 1-slack version of slack-rescaling structural SVM:

w,ξmini≥0

1

2∥w∥²+ Cξ

s.t. ∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(xi, yi)− Ψ(xi, ˆyi)]≥ 1 n

∑n

i=1

1− ξ

∆(y_i, ˆy_i) And margin-rescaling structural SVM:

w,ξmini≥0

1

2∥w∥²+ Cξ

s.t. ∀i ∈ {1, · · · , n}, ˆyi ∈ Y : w^T[Ψ(x_i, y_i)− Ψ(xi, ˆy_i)]≥ 1 n

∑n

i=1

∆(y_i, ˆy_i)− ξ

Detailed proofs on how the new formulation is equally general as the old one is given in the paper [56].

With the framework described above, the only problem left is how to define the general loss function and Ψ. Drawing the inter-state dependencies and time dependencies concept from hidden Markov model, Y. Altun et al. [58] proposed two types of features for an equal-length observation/label sequence pair (x, y). The first is the interaction of an observed feature x^s with a label y^t, the other is the interaction between neighboring labels y^sand y^t.

To illustrate the method, we use an example from music: for some observed features Ψ_r(x^s) of a note x located in s-th position of the phrase, and assume that [[y^t= τ ]] denotes the t-th note is played at a velocity of τ , the interaction of the observed feature and the label can be written as:

ψ^st_rσ(x, y) =^[[y^t = τ^]]Ψ_r(x^s), 1≤ γ ≤ d, τ ∈ Σ

(26)

Figure 3.2: Learning phase flow chart

And the interaction between labels can be written as:

ψˆ^st_rσ(x, y) =^[[y^s = σ∧ y^t = τ^]], σ, τ ∈ Σ

By selecting an order of dependency for the HMM model, we can further restrict s's and t's. For example, for a first-order HMM, s = t for the first feature, and s = t− 1 for the second feature. The two features on the same time t is then stacked into a vector Ψ(x, y; t). The feature map for the whole sequence is simply the sum of all the feature vectors

Ψ(x, y) =

∑T

t=1

Ψ(x, y; t)

The distance, i.e. the general loss function, between two feature maps depends on the number of common label segments and the inner product between the input features sequence with common labels.

∆(Ψ(x, y), Ψ(ˆx, ˆy)) =^∑

s,t

[[

y^s⁻¹ = ˆy^t⁻¹∧ y^s = ˆy^t^]]+^∑

s,t

[[

y^s= ˆy^t^]]k(x^s, ˆx^t)

Finally, during the prediction process, a Viterbi-like decoding algorithm is used to effeciently find a y that maximize F .

(27)

3.3 Learning Performance Knowledge

In this section, we will introduce the components that consist the learning phase. The main goal in the learning phase is to extract performance knowledge from training samples. Fig. 3.2 shows the internal structure of the learning phase.

Training samples are pairs of matched score and expressive performance (their format and preparation process is discussed in Chapter 4). The raw data from the samples is too complex to process, so we need to extract important features from it. Two types of features will be extracted from the samples: the musicological cues from the scores are (score features), and the measurable expressions from the expressive performances are (performance features). We want the system to learn how the score features are “trans- lated” into the performance features. This process can be analogized to a human performer reading the explicit and implicit cues from the score, and perform the music with certain expressive expressions. The definition of the features used will be presented in Section 3.5.

3.3.1 Training Sample Loader

The training samples are loaded by the sample loader module. Since a training sample consists of a score (musicXML format) and an expressive recording (MIDI format), the sample loader finds the two files and loads them into an intermediate representation (music21.Stream object provided by the music21 library [59] from MIT). The mu- sic21 library will convert the musicXML and MIDI format into a Python Object hierarchy that is easy to access and manipulate by Python code.

One caveat here is that the music21 library will quantize the time in MIDI, which will destroy the subtle onset and duration expressions. And the music21 library does not handle the “ticks per quarter note” information in the MIDI header [60], which is essential for the MIDI parser to interpret the correct time scale. So, we must explicitly disable quantization and specify the “ticks per quarter note” value during MIDI loading.

(28)

3.3.2 Features Extraction

In order to keep the system architecture simple, feature extractors are designed to be independent of other feature extractors, so features can be included or removed without affecting the rest of the system. Furthermore, this enables parallel feature extractions. But sometimes a feature inevitably depends on other features: for example, the “relative duration with the previous note” is calculated based on the “duration” feature. Since we want to avoid the complex dependency management, the “relative duration with the previous note” feature extractor has to invoke the “duration” extractor, instead of waiting for the

“duration” extractor to finish first. Therefore, the “duration” feature extracted will be computed twice. To avoid redundant computation of the feature extractors, we implemented a caching mechanism. Once the “duration” feature has been computed, no matter it is calculated during “duration” extraction or during the “relative duration with the previous note”

extraction process, its value will be cached during this execution session. So no matter how many feature extractors uses the “duration” feature, they can get the value directly from the cache. This can speed up the execution without needing to handle dependencies.

The extracted features are aggregated and stored into a JavaScript Object Notation (JSON) file for the SVM-HMM module to load. By saving the features in a human- readable intermediate file, we can debug potential problems easily.

3.3.3 SVM-HMM Learning

After all features are extracted, the next step is to learn the performance knowledge from the features. In the early stage of this research, we have successfully applied linear regression [61]. However, assuming this problem to be linear is clearly an oversimplifi- cation, so we switch to the structural support vector machine with hidden Markov model output (SVM-HMM) [56--58] as our supervised learning algorithm.

The SVM-HMM learning module loads the feature file from the previous stage, and aggregates the features to fit the required input format of the SVM-HMM learner program. Most features from the previous stage are real values; since SVM-HMM only takes

(29)

discrete performance features¹, quantization is required. There are many possible ways to quantize the features and each will result in different outputs. Here we will present a quantizer design as an example: for each performance feature, the mean and standard deviation from all training samples are calculated first. The range between mean minus or plus four standard deviations is divided into 128 uniform intervals. Values greater than the mean value plus four standard deviations are quantized into the 128th bin, and values smaller than the mean value minus four standard deviations are quantized into the 1st bin. The number of intervals decides how fine-grain the quantization is. If the number is too small, subtle expressions will be lost due to high quantization error. However, if the number is too large, there will be too few samples for each interval, which is bad from a statistical learning perspective. Also the training process will take a lot of CPU and mem- ory resources without significant gain in prediction accuracy. The range of four standard deviations is chosen by trail and error, a narrower range will make most of the extreme values be quantized into the largest of smallest bin, so the performance will have a lot of saturated values. But a very large range will make the interval between each quantization bin too large, rising the quantization error.

The theoretical background of SVM-HMM is already mentioned in Section 3.2. We leverage Thorsten Joachims's implementation called SV M^hmm[62]. SV M^hmmis an implementation of structural SVMs for sequence tagging [58] using the training algorithm de- scribed in [57] and [56]. The SV M^hmmpackage contains a SVM-HMM training program called svm_hmm_learn and a prediction program called svm_hmm_classify. For architectural simplicity, we train one model for each performance feature, and each model uses all the score features to predict a single performance feature. The svm_hmm_learn reads the features from a file in the following format: Each line represents features for a note in time order, formatted as

PERF qid:EXNUM FEAT1:FEAT1_VAL FEAT2:FEAT2_VAL ... #comment

PERFis a quantized performance feature. The EXNUM after qid: identifies the phrases;

all notes in a phrase will have the same qid:EXNUM identifier. Following the identifier

1SVM-HMM is initially designed for tasks like the part-of-speech tagging, in which real value or binary features are used to predict discrete part-of-speech tags.

(30)

are quantized score features, denoted as feature name : feature value, separated by spaces. And any text following a # symbol is a comment.

There are some key parameters needed to be adjusted for the training program: the first is the C parameter in SVM which controls the trade-off between lowering training error and maximizing margin. A larger C results in lower training error, but the margin may be smaller. The second is the ε parameter which controls the required precision for termination. The smaller the ε, the higher the precision, but it may require more time and computing resources. Finally, for the HMM part of the model, the order of dependencies of transition states and emission states needs to be specified. In our case, both are set to defaults: the transition dependency is set to one, which stands for first-order Markov property, and the emission dependency is set to zero. Since we train one model for each performance feature, each model will have its own set of parameters. The parameter se- lection experiments will be presented in Chapter 5.

Finally, the training program will output three model files (because we use three performance features) which contain SVM-HMM model parameters, such as the support vectors and other metadata. Since it takes considerable time (roughly from a dozen minutes to a few hours) to train a model, depending on the amount of training samples and the power of the computer, the system can only support off-line learning. But the learning process only needs to be run once. The performance knowledge model can be reused over and over again in the performing phase.

3.4 Performing Expressively

The performing phase uses the performance knowledge model learned in the previous phase to generate expressive performances. The input is a score file to be performed, which should not be used as training sample to prevent overfitting. Score features will be extracted from it using the same routine as in the learning phase. The SVM-HMM generation module will use the learned model and the score features to predict the performance features. These features will then be de-quantized back to real values using the method described previously. A MIDI generation module will apply those performance features

(31)

Figure 3.3: Performing phase flow chart

onto the score to produce an expressive MIDI file. The MIDI file itself is already an expressive performance. To actually hear the sound, a software synthesizer can be used to render the MIDI file into a WAV or MP3 format.

3.4.1 SVM-HMM Generation

The feature extraction and aggregation process in the performing phase is similar to the learning phase, but the PERF fields in the SVM-HMM input file are left blank for the algorithm to predict. The svm_hmm_classify program will take these inputs with the learned model file and predict the quantized labels of the performance features. These performance features are de-quantized back to the middle point of each bin.

3.4.2 MIDI Generation and Synthesis

The predicted performance features are then applied onto the input score, i.e. the onset timings will be shifted, the duration extended or shortened, and the loudness shifted according to the predicted performance features. The resulted expressive performance will be transformed into MIDI files using music21 library [59].

In order to actually hear the expressive performance, the MIDI file can be rendered by a software MIDI synthesizer. For example, timidity++ software synthesizer for Linux can render the MIDI into a WAV (Waveform Audio Format) file, which can be compressed into MP3 (MPEG-2 Audio Layer III) by lame audio encoder. Alternatively,

(32)

one can use hardware synthesizers, for example, the RenCon [1] contest uses Yamaha Disklavier digital piano to render contestants' submission.

Because the sub-note level expression is not the primary goal of this research, we choose a standard MIDI grand piano sound to render the music. The system can be extended to use a more advanced physical model or instrument-specific audio synthesizer.

Other sub-note level features, such as special techniques for playing the violins, can be added to the feature list and be learned by the SVM-HMM model.

3.5 Features

As mentioned in Section 3.3, there are two types of features, the score features and performance features. We will present the features used in the system and discuss the difficulties encountered.

3.5.1 Score Features

Score features are musicological cues presented in the score. The purpose of score features are to simulate the high level information a performer may perceive when he/she reads the score. The basic time unit for these features are notes. Each note will have all features presented below. Score features include:

Relative position in the phrase: the relative position of a note in the phrase with its value ranging from 0% to 100%. This feature is intended to capture the special expression at the start or the end of a phrase, or time-variant expressions like the arch-type loudness variation.

Pitch: the pitch of a note denoted by the MIDI pitch number (resolution is down to semi- tone).

Interval from the previous note: the interval between the current note and its previous note (in semitone). This feature and the next one represent the direction of the

(33)

Figure 3.4: Intervals with neighbor notes

melodic line. See Fig. 3.4 for an example.

∆P⁻ = P_i− Pi−1

Interval to the next note: the interval between the current note and its following note (in semitone). See Fig. 3.4 for an example.

∆P⁺ = P_i+1− Pi

Note duration: the duration of a note (quarter notes).

Grace notes have no duration in musicXML specification [63]. The reason for this is that grace notes are considered as very short ornaments that do not occupy real beat position. But zero duration is hard to handle in mathematic formulation. So, we assigned the duration of a sixty-fourth note for a grace note, because it is far shorter than all the notes in our corpus.

Relative Duration with the previous note: the duration of a note divided by the dura- tion of its previous note. See Fig. 3.5 for an example. For a phrase of n notes with duration D₁, D₂, . . . , D_n,

RD⁻= D_i D_i₋₁

This feature is intended to locate local changes in tempo, such as a series of rapid consecutive notes followed by a long note, which will cause a discontinuity in this feature.

Relative duration with the next note: The duration of a note divided by duration of its

(34)

Figure 3.5: Relative durations with neighbor note

Figure 3.6: Metric position

following note. See Fig. 3.5 for an example.

RD⁺= D_i D_i+1

Metric position: the position (beat) of a note in a measure. For example, under a time signature of⁴₄, if a measure consists of five notes, they will have metric positions of 1, 2, 2.5, 3 and 4, respectively.

Metric position usually implies beat strength. In most tonal music, there exists a hierarchy of beat strength. For example, for a time signature of ⁴₄, the first note is usually the strongest, the third note is the second strongest, and the second and fourth notes are the least strong ones.

3.5.2 Performance Features

Performance features are the expressive expressions we would like to learn from a performance. Performance features are extracted by calculating how the expression deviates from the nominal notation in the score. Performance features include:

Onset time deviation: a human performer usually adds conscious or unconscious rubato to their performance. The onset time deviation is the difference of onset timing

(35)

Figure 3.7: Systematic bias in onset deviation

between the performance and the score. Namely,

∆O = O^perf_i − O^scorei

Where O_i^perf is the onset time of note i in the performance, O_i^scoreis the onset time of note i in the score.

However, the above formula assumes the performance is played exactly at the same tempo assigned by the score. In reality, performers do not always keep up with the speed of the score, probably because of limited piano skills, or they may speed up or slow down certain passages to expressive his/her musical interpretation. Therefore, the performance should be linearly scaled to avoid systematic bias. We will present a solution to this issue in Section 3.5.3.

Loudness: the loudness of a note, measured by MIDI velocity level 0 to 127.

Relative duration: the performed duration of a note divided by the nominal duration in the score.

RD = D_i^perf D_i^score

3.5.3 Normalizing Onset Deviation

In the previous section, we have pointed out that the onset deviation feature extractor may face some difficulties when the performer did not play at the exact tempo indicated by the score. As illustrated in Fig. 3.7, if the performance is played slower than expected, the deviations at the end of the phrase will be very large due to the accumulated errors.

(36)

The same issue occurs when the performer is playing faster the expected. The systematic bias caused by the difference in total duration mixes up with the local deviation. For a long phrase, the onset deviation of the last notes can be as large as a dozen quarter notes.

This kind of extremely large values will be learned by the model and cause erroneous predictions. A note may be delayed for a few quarter notes, causing the notes to be played in the wrong order.

In other words, the onset deviation actually contains two types of deviation: a global/

systematic deviation caused by the difference between the performed and the nominal tempo, and a local deviation caused by the note-level expression. Since the intention of the onset deviation feature is to capture the note-level expression, the performance must be linearly scaled to cancel out the global deviation.

Initially, we tried two possible ways of normalization:

1. To align the onset of the first notes, and align the onset of the last notes.

2. To align the onset of the first notes, and align the end (MIDI note-off event) of the last notes.

However, neither of the methods can robustly eliminate extreme values. Therefore, we proposed an automated approach to find the best scaling ratio such that the normalized onset deviations in the performances fit best with those in the score. The measure of fitness is defined as the Euclidean distance between the normalized performance onset sequences and the score onset sequences, represented as vectors. Brent's Method [64] is used to find this optimal ratio. To speed up the optimization and prevent unreasonable local minima value, a search range of [initial guess× 0.5, initial guess × 2] is imposed on the optimizer. The initial guess is used as a rough estimate of the ratio, calculated by aligning the first and last onsets. Than we assume that the actual ratio is not smaller than half of initial guess and not larger than twice of initial guess. The two numbers 0.5 and 2 are chosen by trail and error, and most of the empirical data supports this decision. We will demonstrate the effectiveness of this solution in Section 5.1.

(37)

Chapter 4 Corpus Preparation

An expressive performance corpus is a set of performance samples. Since this research is based on a supervised learning algorithm, a high-quality corpus is essential to our suc- cess. Each sample consists of a score and its corresponding human recording. Some metadata such as phrasing, structure analysis, or harmonic analysis may be included, too.

In this chapter, we will review some of the existing corpora, specifications and formats of our corpus, and how we actually construct it.

4.1 Existing Corpora

Unlike other research fields like speech processing or natural language processing, there exists virtually no publicly accessible corpus for computer expressive performance research. CrestMusePEDB [65] (PEDB stands for “Performance Expression Database”) is a corpus created by Japan Science and Technology Agency's CREST program. However, until the time of this writing, we cannot establish any contact with the database adminis- trators to gain access to it. The corpus is claimed to have a GUI tool for annotating the expressive performance parameters from audio recordings. Their repertoire covers many piano works from well-known classical composers like Bach, Mozart, and Chopin, and is recorded by world famous pianists. On their website [65] they claim to contain the following data: PEDB-SCR - score text information, PEDB-DEV - performance deviation data and PEDB-IDX - audio performance credit. But the quality of the data is unknown.

(38)

Another example is the Magaloff Project [66], which is created by some universities in Austria. They invited the Russian pianist Nikita Magaloff to record all solo works for piano by Frederic Chopin on a Bösendorfer SE computer-controlled grand piano. This corpus became the material for many subsequent researches [67--73]. Flossmann et al., one of the leading team of the project, also won the 2008 RenCon contest with a system based on this corpus called YQX [74]. However, the corpus is not open to the public.

Since both corpora are not available, we need to implement our own. We will start by defining the specification.

4.2 Corpus Specification

The corpus we need must fulfill the following criteria:

1. All the samples are monophonic, containing only a single melody without chords.

2. No human errors, such as insertion, deletion, or wrong pitch exist in the recording;

the score and recording are matched note-to-note.

3. The phrasings are annotated by human.

4. The scores, recordings and phrasing data are in a machine-readable format.

Certain potentially useful information is not included because it is less relevant to our goal. Examples are:

1. Advanced structural analysis, such as GTTM (Generative Theory of Tonal Music) [75]

2. Harmonic analysis 3. Piano pedal usage 4. Piano fingerings

5. Techniques of other musical instruments, such as violin pizzicato, tapping, or bow techniques.

(39)

Table 4.1: Clementi's Sonatinas Op.36

Title Movement Time Signature

No.1 Sonatina in C major I. Allegro 4/4

II. Andante 3/4

III. Vivace 3/8

No.2 Sonatina in G major I. Allegretto 2/4

II. Allegretto 3/4

III. Allegro 3/8

No.3 Sonatina in C major I. Spiritoso 4/4 II. Un poco adagio 2/2

III. Allegro 2/4

No.4 Sonatina in F major I. Con spirito 3/4 II. Andante con espressione 2/4 III. Rondó: Allegro vivace 2/4

No.5 Sonatina in G major I. Presto 2/2

II. Allegretto moderato 3/8 III. Rondó: Allegro molto 2/4 No.6 Sonatina in D major I. Allegro con spirito 4/4

II. Allegretto 6/8

We choose Clementi's Sonatina Op.36 for our corpus. It is a must-learn repertoire for the piano students, so it is easy to find performers with a wide range of skill level to record the corpus. These sonatinas are in the Classical style, so the learned model can potentially be extended to other works from composers of the Classical era like Mozart and Haydn.

There are six sonatinas included in Op.36. The first five have three movements each, and the last one has two movements. The movement titles and time signatures of all the pieces are listed in Table 4.1

MusicXML is used to represent Clementi's work in digital format. MusicXML is a digital score notation using XML (eXtensible Markup Language); it can express most traditional music notations and metadata. Most music notation softwares and software tools support the musicXML format. Although MIDI is also a possible candidate for representing score, it is designed to hold instrument control signal rather than notation.

Some music symbols may not be available in MIDI. Furthermore, MIDI represents music as a series of note-on and note-off events, which requires additional efforts to transform them into the traditional notation.

But for representing the performance, MIDI is the most suitable format. Using a key- pressure-sensitive digital piano, the pianist can record their performance in a natural way.

(40)

The recordings have a high precision in time, pitch and loudness (key pressure); and polyphonic tracks can easily be recorded separately. Although a WAV (Waveform Audio For- mat) audio recording has a higher fidelity than MIDI, it is harder to parse by computers.

Without robust onset detection, pitch detection, and source separation technology, it is extremely difficult to extract the information. It takes much effort to manually annotate each WAV recordings, and the accuracy across different annotators may not be consistent.

There is a doable but impractical way to keep both the score and the recording in one single MIDI file. Instead of recording the actual note-on and note-off timing, we keep the nominal note-on and note-off in the score. Then, MIDI tempo-change events are inserted before each note to shift the performed timing of the recorded notes. Thus, the nominal time of each note represents the score, and the rendered time represents the performance.

But MIDI is limited as a score format and it requires complex calculations to recover the performance; this method is not used in the research.

Finally, we store the phrasing, which is the only metadata we used, in a plaintext file;

each line in the phrasing file stands for the starting point of each phrase. The starting point is defined as the onset timing (in quarter notes) counted from the beginning of the piece¹. The phrasing is decided by the following principles:

1. Phrases may be separated by a salient pause.

2. A phrase may end with a cadence.

3. Phrases may be separated by dramatic change in tempo, key or loudness.

4. Repeated structures may be repeated phrases.

Since the phrasing controls the structural interpretation of a piece, we would like to leave this freedom for expression to the user. However, if there exists any good automatic phrasing algorithm, it can be easily integrated into the current system to make it full- automatic.

1For a phrase that starts at a point which is a circulating decimal, for example 2¹₃ = 2.333· · · , the starting point can be alternatively defined as any finite decimal between the end of the last phrase and the start of the current phrase. For example, if the last phrase stops at beat 1 and the second phrase start at 2¹₃ = 2.333· · · beat, the start point of the second phrase can be written as 2.3 or 2.0, etc.

(41)

4.3 Implementation

4.3.1 Score Preparation

The digital scores are downloaded from KernScore website [76]. The scores are transformed into MusicXML from the original Hundrum file format (.krn) using the music21 toolkit [59]. Because this research focuses on monophonic melodies only, the accom- paniments are removed and the chords are reduced to their highest-pitched note, which is usually the most salient melody. The reduced scores are doubled-checked against a printed version publish by Durand & Cie., Paris [77] to eliminate all errors.

4.3.2 MIDI Recording

We have implemented two methods for recording: first, using a Yamaha digital piano to record MIDI; second, by tapping on a touch-sensitive device to express tempo, duration and loudness. Due to the accuracy consideration, only the recordings from the Yamaha digital piano are used in the expreiments.

We used a Yamaha P80 88-key graded hammer effect²digital piano for recording.

Through a MIDI-to-USB converter, the keyboard was connected to Rosegarden Digi- tal Audio Workstation (DAW) software on a Linux computer. The Rosegarden DAW also generated the metronome sound to help the performer maintain a steady speed. The metronome is mandatory because if the tempo is not assigned during the recording, the tempo information written in the MIDI file will be invalid, which makes subsequent pars- ing and linear scaling very difficult. So the performers were asked to follow the speed of the metronome, but they can adjust the metronome speed as they like, and apply any level of rubato as long as the overall tempo is steady.

The second (and non-chosen) method, which is not used in the experiments, is to uti- lize touch-enabled input devices like a smartphone touchscreen or laptop touchpad. We have implemented an prototype using a Synaptics Touchpad on a Lenovo ThinkPad X200i laptop. When the user taps the touchpad once, one note from the score will be played, the

2Graded Hammer Effect feature provides a realistic key pressure response similar to a traditional acoustic piano.

利用結構性支撐向量機的具音樂表現能力之半自動電腦演奏系統

國立臺灣大學電機資訊學院電機工程學系 碩士論文

Department of Electrical Engineering

College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

利用結構性支撐向量機的

具音樂表現能力之半自動電腦演奏系統

A Semi-automatic Computer Expressive Music Performance System Using Structural Support Vector Machine

呂 行

Shing Hermes Lyu

指導教授：鄭士康博士 Advisor: Shyh-Kang Jeng, Ph.D.

中華民國 103 年 6 月

June, 2014

謝

中文 要

Abstract

Table of Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Motivation

1.2 Goal and Contribution

1.3 Chapter Organization

Chapter 2

Previous Works

2.1 Various Goals and Evaluation

2.2 Researches Classified by Methods Used

2.3 Additional Specialties

Chapter 3

Proposed Method

3.1 Overview

3.2 A Brief Introduction to SVM-HMM

3.3 Learning Performance Knowledge

3.3.1 Training Sample Loader

3.3.2 Features Extraction

3.3.3 SVM-HMM Learning

3.4 Performing Expressively

3.4.1 SVM-HMM Generation

3.4.2 MIDI Generation and Synthesis

3.5 Features

3.5.1 Score Features

3.5.2 Performance Features

3.5.3 Normalizing Onset Deviation

Chapter 4

Corpus Preparation

4.1 Existing Corpora

4.2 Corpus Specification

4.3 Implementation

4.3.1 Score Preparation

4.3.2 MIDI Recording

國立臺灣大學電機資訊學院電機工程學系碩士論文

呂　行

中文要