In this section, we will introduce the components that consist the learning phase. The main goal in the learning phase is to extract performance knowledge from training sam-ples. Fig. 3.2 shows the internal structure of the learning phase.
Training samples are pairs of matched score and expressive performance (their for-mat and preparation process is discussed in Chapter 4). The raw data from the samples is too complex to process, so we need to extract important features from it. Two types of features will be extracted from the samples: the musicological cues from the scores are (score features), and the measurable expressions from the expressive performances are (performance features). We want the system to learn how the score features are “trans-lated” into the performance features. This process can be analogized to a human performer reading the explicit and implicit cues from the score, and perform the music with certain expressive expressions. The definition of the features used will be presented in Section 3.5.
3.3.1 Training Sample Loader
The training samples are loaded by the sample loader module. Since a training sam-ple consists of a score (musicXML format) and an expressive recording (MIDI format), the sample loader finds the two files and loads them into an intermediate representation (music21.Stream object provided by the music21 library [59] from MIT). The mu-sic21 library will convert the musicXML and MIDI format into a Python Object hierarchy that is easy to access and manipulate by Python code.
One caveat here is that the music21 library will quantize the time in MIDI, which will destroy the subtle onset and duration expressions. And the music21 library does not handle the “ticks per quarter note” information in the MIDI header [60], which is essential for the MIDI parser to interpret the correct time scale. So, we must explicitly disable quantization and specify the “ticks per quarter note” value during MIDI loading.
3.3.2 Features Extraction
In order to keep the system architecture simple, feature extractors are designed to be independent of other feature extractors, so features can be included or removed without affecting the rest of the system. Furthermore, this enables parallel feature extractions. But sometimes a feature inevitably depends on other features: for example, the “relative dura-tion with the previous note” is calculated based on the “duradura-tion” feature. Since we want to avoid the complex dependency management, the “relative duration with the previous note” feature extractor has to invoke the “duration” extractor, instead of waiting for the
“duration” extractor to finish first. Therefore, the “duration” feature extracted will be com-puted twice. To avoid redundant computation of the feature extractors, we implemented a caching mechanism. Once the “duration” feature has been computed, no matter it is calcu-lated during “duration” extraction or during the “relative duration with the previous note”
extraction process, its value will be cached during this execution session. So no matter how many feature extractors uses the “duration” feature, they can get the value directly from the cache. This can speed up the execution without needing to handle dependencies.
The extracted features are aggregated and stored into a JavaScript Object Notation (JSON) file for the SVM-HMM module to load. By saving the features in a human-readable intermediate file, we can debug potential problems easily.
3.3.3 SVM-HMM Learning
After all features are extracted, the next step is to learn the performance knowledge from the features. In the early stage of this research, we have successfully applied linear regression [61]. However, assuming this problem to be linear is clearly an oversimplifi-cation, so we switch to the structural support vector machine with hidden Markov model output (SVM-HMM) [56--58] as our supervised learning algorithm.
The SVM-HMM learning module loads the feature file from the previous stage, and aggregates the features to fit the required input format of the SVM-HMM learner pro-gram. Most features from the previous stage are real values; since SVM-HMM only takes
discrete performance features1, quantization is required. There are many possible ways to quantize the features and each will result in different outputs. Here we will present a quantizer design as an example: for each performance feature, the mean and standard deviation from all training samples are calculated first. The range between mean minus or plus four standard deviations is divided into 128 uniform intervals. Values greater than the mean value plus four standard deviations are quantized into the 128th bin, and val-ues smaller than the mean value minus four standard deviations are quantized into the 1st bin. The number of intervals decides how fine-grain the quantization is. If the number is too small, subtle expressions will be lost due to high quantization error. However, if the number is too large, there will be too few samples for each interval, which is bad from a statistical learning perspective. Also the training process will take a lot of CPU and mem-ory resources without significant gain in prediction accuracy. The range of four standard deviations is chosen by trail and error, a narrower range will make most of the extreme values be quantized into the largest of smallest bin, so the performance will have a lot of saturated values. But a very large range will make the interval between each quantization bin too large, rising the quantization error.
The theoretical background of SVM-HMM is already mentioned in Section 3.2. We leverage Thorsten Joachims's implementation called SV Mhmm[62]. SV Mhmmis an im-plementation of structural SVMs for sequence tagging [58] using the training algorithm de-scribed in [57] and [56]. The SV Mhmmpackage contains a SVM-HMM training program called svm_hmm_learn and a prediction program called svm_hmm_classify. For architectural simplicity, we train one model for each performance feature, and each model uses all the score features to predict a single performance feature. The svm_hmm_learn reads the features from a file in the following format: Each line represents features for a note in time order, formatted as
PERF qid:EXNUM FEAT1:FEAT1_VAL FEAT2:FEAT2_VAL ... #comment
PERFis a quantized performance feature. The EXNUM after qid: identifies the phrases;
all notes in a phrase will have the same qid:EXNUM identifier. Following the identifier
1SVM-HMM is initially designed for tasks like the part-of-speech tagging, in which real value or binary features are used to predict discrete part-of-speech tags.
are quantized score features, denoted as feature name : feature value, separated by spaces. And any text following a # symbol is a comment.
There are some key parameters needed to be adjusted for the training program: the first is the C parameter in SVM which controls the trade-off between lowering training error and maximizing margin. A larger C results in lower training error, but the margin may be smaller. The second is the ε parameter which controls the required precision for termination. The smaller the ε, the higher the precision, but it may require more time and computing resources. Finally, for the HMM part of the model, the order of dependencies of transition states and emission states needs to be specified. In our case, both are set to defaults: the transition dependency is set to one, which stands for first-order Markov property, and the emission dependency is set to zero. Since we train one model for each performance feature, each model will have its own set of parameters. The parameter se-lection experiments will be presented in Chapter 5.
Finally, the training program will output three model files (because we use three perfor-mance features) which contain SVM-HMM model parameters, such as the support vectors and other metadata. Since it takes considerable time (roughly from a dozen minutes to a few hours) to train a model, depending on the amount of training samples and the power of the computer, the system can only support off-line learning. But the learning process only needs to be run once. The performance knowledge model can be reused over and over again in the performing phase.