Performing Expressively - 利用結構性支撐向量機的具音樂表現能力之半自動電腦演奏系統

The performing phase uses the performance knowledge model learned in the previ-ous phase to generate expressive performances. The input is a score file to be performed, which should not be used as training sample to prevent overfitting. Score features will be extracted from it using the same routine as in the learning phase. The SVM-HMM gener-ation module will use the learned model and the score features to predict the performance features. These features will then be de-quantized back to real values using the method described previously. A MIDI generation module will apply those performance features

Figure 3.3: Performing phase flow chart

onto the score to produce an expressive MIDI file. The MIDI file itself is already an ex-pressive performance. To actually hear the sound, a software synthesizer can be used to render the MIDI file into a WAV or MP3 format.

3.4.1 SVM-HMM Generation

The feature extraction and aggregation process in the performing phase is similar to the learning phase, but the PERF fields in the SVM-HMM input file are left blank for the algorithm to predict. The svm_hmm_classify program will take these inputs with the learned model file and predict the quantized labels of the performance features. These performance features are de-quantized back to the middle point of each bin.

3.4.2 MIDI Generation and Synthesis

The predicted performance features are then applied onto the input score, i.e. the onset timings will be shifted, the duration extended or shortened, and the loudness shifted according to the predicted performance features. The resulted expressive performance will be transformed into MIDI files using music21 library [59].

In order to actually hear the expressive performance, the MIDI file can be rendered by a software MIDI synthesizer. For example, timidity++ software synthesizer for Linux can render the MIDI into a WAV (Waveform Audio Format) file, which can be compressed into MP3 (MPEG-2 Audio Layer III) by lame audio encoder. Alternatively,

one can use hardware synthesizers, for example, the RenCon [1] contest uses Yamaha Disklavier digital piano to render contestants' submission.

Because the sub-note level expression is not the primary goal of this research, we choose a standard MIDI grand piano sound to render the music. The system can be ex-tended to use a more advanced physical model or instrument-specific audio synthesizer.

Other sub-note level features, such as special techniques for playing the violins, can be added to the feature list and be learned by the SVM-HMM model.

3.5 Features

As mentioned in Section 3.3, there are two types of features, the score features and performance features. We will present the features used in the system and discuss the difficulties encountered.

3.5.1 Score Features

Score features are musicological cues presented in the score. The purpose of score features are to simulate the high level information a performer may perceive when he/she reads the score. The basic time unit for these features are notes. Each note will have all features presented below. Score features include:

Relative position in the phrase: the relative position of a note in the phrase with its value ranging from 0% to 100%. This feature is intended to capture the special expression at the start or the end of a phrase, or time-variant expressions like the arch-type loudness variation.

Pitch: the pitch of a note denoted by the MIDI pitch number (resolution is down to semi-tone).

Interval from the previous note: the interval between the current note and its previous note (in semitone). This feature and the next one represent the direction of the

Figure 3.4: Intervals with neighbor notes

melodic line. See Fig. 3.4 for an example.

∆P⁻ = P_i− Pi−1

Interval to the next note: the interval between the current note and its following note (in semitone). See Fig. 3.4 for an example.

∆P⁺ = P_i+1− Pi

Note duration: the duration of a note (quarter notes).

Grace notes have no duration in musicXML specification [63]. The reason for this is that grace notes are considered as very short ornaments that do not occupy real beat position. But zero duration is hard to handle in mathematic formulation. So, we assigned the duration of a sixty-fourth note for a grace note, because it is far shorter than all the notes in our corpus.

Relative Duration with the previous note: the duration of a note divided by the dura-tion of its previous note. See Fig. 3.5 for an example. For a phrase of n notes with duration D₁, D₂, . . . , D_n,

RD⁻= D_i D_i₋₁

This feature is intended to locate local changes in tempo, such as a series of rapid consecutive notes followed by a long note, which will cause a discontinuity in this feature.

Relative duration with the next note: The duration of a note divided by duration of its

Figure 3.5: Relative durations with neighbor note

Figure 3.6: Metric position

following note. See Fig. 3.5 for an example.

RD⁺= D_i D_i+1

Metric position: the position (beat) of a note in a measure. For example, under a time signature of⁴₄, if a measure consists of five notes, they will have metric positions of 1, 2, 2.5, 3 and 4, respectively.

Metric position usually implies beat strength. In most tonal music, there exists a hierarchy of beat strength. For example, for a time signature of ⁴₄, the first note is usually the strongest, the third note is the second strongest, and the second and fourth notes are the least strong ones.

3.5.2 Performance Features

Performance features are the expressive expressions we would like to learn from a per-formance. Performance features are extracted by calculating how the expression deviates from the nominal notation in the score. Performance features include:

Onset time deviation: a human performer usually adds conscious or unconscious rubato to their performance. The onset time deviation is the difference of onset timing

Figure 3.7: Systematic bias in onset deviation

between the performance and the score. Namely,

∆O = O^perf_i − O^scorei

Where O_i^perf is the onset time of note i in the performance, O_i^scoreis the onset time of note i in the score.

However, the above formula assumes the performance is played exactly at the same tempo assigned by the score. In reality, performers do not always keep up with the speed of the score, probably because of limited piano skills, or they may speed up or slow down certain passages to expressive his/her musical interpretation. Therefore, the performance should be linearly scaled to avoid systematic bias. We will present a solution to this issue in Section 3.5.3.

Loudness: the loudness of a note, measured by MIDI velocity level 0 to 127.

Relative duration: the performed duration of a note divided by the nominal duration in the score.

RD = D_i^perf D_i^score

3.5.3 Normalizing Onset Deviation

In the previous section, we have pointed out that the onset deviation feature extractor may face some difficulties when the performer did not play at the exact tempo indicated by the score. As illustrated in Fig. 3.7, if the performance is played slower than expected, the deviations at the end of the phrase will be very large due to the accumulated errors.

The same issue occurs when the performer is playing faster the expected. The systematic bias caused by the difference in total duration mixes up with the local deviation. For a long phrase, the onset deviation of the last notes can be as large as a dozen quarter notes.

This kind of extremely large values will be learned by the model and cause erroneous predictions. A note may be delayed for a few quarter notes, causing the notes to be played in the wrong order.

In other words, the onset deviation actually contains two types of deviation: a global/

systematic deviation caused by the difference between the performed and the nominal tempo, and a local deviation caused by the note-level expression. Since the intention of the onset deviation feature is to capture the note-level expression, the performance must be linearly scaled to cancel out the global deviation.

Initially, we tried two possible ways of normalization:

1. To align the onset of the first notes, and align the onset of the last notes.

2. To align the onset of the first notes, and align the end (MIDI note-off event) of the last notes.

However, neither of the methods can robustly eliminate extreme values. Therefore, we proposed an automated approach to find the best scaling ratio such that the normalized onset deviations in the performances fit best with those in the score. The measure of fitness is defined as the Euclidean distance between the normalized performance onset sequences and the score onset sequences, represented as vectors. Brent's Method [64] is used to find this optimal ratio. To speed up the optimization and prevent unreasonable local minima value, a search range of [initial guess× 0.5, initial guess × 2] is imposed on the optimizer. The initial guess is used as a rough estimate of the ratio, calculated by aligning the first and last onsets. Than we assume that the actual ratio is not smaller than half of initial guess and not larger than twice of initial guess. The two numbers 0.5 and 2 are chosen by trail and error, and most of the empirical data supports this decision. We will demonstrate the effectiveness of this solution in Section 5.1.

Chapter 4

在文檔中利用結構性支撐向量機的具音樂表現能力之半自動電腦演奏系統 (頁 30-37)