多重樂器自動採譜之探討

全文

(1)國立臺灣師範大學資訊工程學系碩士論文. 指導教授：陳柏琳共同指導教授：. 蘇黎. 博士博士. 多重樂器自動採譜之探討 An Investigation of Multi-Instrument Automatic Music Transcription. 研究生：吳宥德中華民國. 109. 年. 7 月.

(2) 謝辭光陰似箭，從大學開始的專題，至研究所期間完成學位論文為止，受到了許多人的幫助，要是沒有受到這些幫助，我想可能直到現在仍然掙扎於實驗中吧。首先我想感謝陳柏琳教授，從大學專題開始輔導我至今，從完全不懂學術研究的方法，一直到如今能夠在發表兩篇國際期刊論文後，同時完成學位論文，為研究所生涯畫下一個完美的句點，也是有陳老師的引薦與幫忙，如今才能待在中研院蘇黎老師的實驗室，蘇老師也是接下來想感謝的對象，與蘇老師的相遇是在 2017 年的中研院暑期實習計畫，自從那時候開始便一直受到蘇老師的幫助與指導，從頭開始教導我關於自動音樂轉譜的相關知識，也在暑期實習結束後、升上大四的時候，發表了第一篇學術論文，並投稿至具指標性的國際學術會議，之後也在兩位老師的協助之下，順利出國進行報告。爾後也持續與蘇老師合作直到現今，並分別於 2019 年接下來的陸續兩年裡投稿了兩篇學術論文，截至學位考試結束為止，共計有了三篇國際論文的成就，真的十分感謝兩位老師的幫忙！這是在大學期間完全不曾想過會有的成就，也因此獲得了許多珍貴出國參與國際學術會議的經驗。最後我想感謝一路陪伴、支持我至今的家人與朋友們，謝謝你們總是無私的幫忙與支持，在我最需要協助的時候，二話不說的熱心幫助我，讓我十分的感動與欣慰。也是有你們的陪伴，讓我在做研究的路上總是充滿歡笑，大幅減緩了研究時的枯燥與厭世感。在這謹以此論文誠心奉獻與所有人，以表萬分謝意。. I.

(3) 摘要自動音樂採譜（Automatic Music Transcription, AMT）是音樂資訊檢索（Music Information Retrieval, MIR）中最重要的任務之一，由於其訊號的複雜性，它已被視為訊號處理中最具挑戰性的領域之一。在許多 AMT 任務中，多樂器採譜任務是通用採譜系統的關鍵步驟之一，但相關領域的研究卻很少。模型必須在一首樂曲當中，同時辨識多種樂器和其相應音高，而其中包括了不同樂器的各種音色和豐富的諧波 (Harmonics)，可能導致訊號彼此相互干擾，造成更為複雜的情況，因此與傳統的單樂器採譜研究相比，多樂器採譜成為了一個更進階且複雜的問題。除了存在技術本質上的困難，統整與協調不同層次的採譜問題、處理複雜的交互影響，也需要更加清晰與明確的問題定義，並針對最後的結果發展一套有效的評估方法。在這項研究中，我們提出了一個多樂器自動採譜的方法。藉由發展一套從訊號層級的特徵工程、到最終評估結果的端到端流程，整合了多項技術以更好的處理此複雜的問題。當中結合了能夠清楚顯現音高特徵的訊號處理技術、新穎的深度學習模型，以及從多目標識別(Multi-object Recognition)，實例分割(Instance Segmentation)、計算機視覺中，圖到圖轉換所激發出來的概念，進一步整合新發展的後處理演算法，提出來的系統對於多樂器採譜中的所有子任務，呈現出通用彈性且十分有效率的表現。在針對不同子任務進行綜合評估後，於各項指標上皆表現出了至今為止最優的結果，其中包括了過去從未被研究的多樂器音符層級採譜任務(Note-level Transcription)。. 關鍵字：自動音樂採譜、多音預測、多音多樂器預測、深度學習、自注意力機制. II.

(4) Abstract Automatic music transcription (AMT), one of the most important tasks in music information retrieval (MIR), has been seen as one of the most challenging field in signal processing because of its inherent complexity of signals. Among many of the AMT tasks, multi-instrument is one critical step for general transcription system, but yet a less investigated field. The requirement of identifying multiple instruments and the corresponding pitch in music performances, which consists of various timbres and rich harmonic information that could interfere with each other, making it a more advanced problem in comparison with the conventional single-instrument AMT problem. Despite the technical difficulties, to orchestrate different levels of the complex problem scopes, a clear definition of problem scenarios and efficient evaluation approaches are also needed. In this research, we propose a multi-instrument AMT approach, with a complete end-toend flow from signal-level feature engineering to the final evaluation. Combined with signal processing techniques capable of specifying pitch saliency, novel deep learning methods, concepts inspired from multi-object recognition, instance segmentation, and image-to-image translation in computer vision, meanwhile being integrated with a newly developed postprocessing algorithm, the proposed system is flexible and efficient for all the sub-tasks in multiinstrument AMT. Comprehensive evaluations on different sub-tasks have shown state-of-theart performance, including the task of multi-instrument note tracking which has not been investigated before. Keywords: Automatic music transcription, multi-pitch estimation, multi-pitch streaming, deep learning, self-attention.. III.

(5) Table of Contents 1. Introduction …………………………………………………………………….….….....1 1.1. Background and Motivation ……………………………………….………………..1 1.2. Problem Scenarios …………………………………………………….………….....4 1.3. Arrangement of the Thesis ...…………………………………………………….......5 2. Related Work ……………………………………………………………………….……6 2.1. Background of AMT ………………………………………………..…………...…..6 2.2. Era of Deep Learning ……………………………………………..……………..…..7 2.2.1. Dealing with Long-Term Sequence ………………………………..…………...11. 2.3. Data Representations …………………………………………..…………………..14. 3. Method …………………………………………………………………………...……..17 3.1. Data Representations ………………………………………………………...…….17 3.1.1. CFP Representation …………………………………………………...………18 3.1.2. Harmonic Representation ………………………………………………….….20. 3.2. Model …………………………………………………………………………..….21 3.3. Label Smooth ……………………………………………………………………...25 3.4. Post-Processing ……………………………………………………………………26. 4. Experiments ………………………………………………..…………………………..30 4.1. Settings ………………………………………………………………………...…..30 4.2. Datasets …………………………………………………………………………....31 4.2.1. Single-instrument Datasets ……………………………………………….…...31 4.2.2. Multi-instrument Datasets ………………………………………………….…32. 4.3. Training ………………………………………………………………………..…..34 4.4. Evaluation Metrics ……………………………………………………………...…35. 5. Experiment Results …………………………………………………………………...38. IV.

(6) 5.1. CFP Feature Comparison …………………………………………………………..39 5.2. Harmonic Feature Comparison ………………………………………………….....40 5.3. MPE and NT with Different Models …………………………………………….…41 5.4. MPS and NS with Different Models ……………………………………….………43 5.5. MPS and NS Instrument-level Evaluation ………………………………………....46 5.6. Confusion Matrix Analysis ……………………………………………………...…49 5.7. Effect of Post-processing …………………………………………………………..51 5.8. Illustration …………………………………………………………………..……..52. 6. Discussion ……………………………………………………………………………...55 7. Conclusion and Future Works .…………………………………………………........57 References ………………………………………………………………………...……….58. V.

(7) List of Tables Table 3.6. Details of model architecture ………………………………………………….….24 Table 4.1. Total length portion of each instrument in datasets ………………………………..33 Table 4.2. Note length of each instrument in MusicNet ……………………………………...33 Table 5.1. Feature comparison results ………………………………………………………..39 Table 5.2. MPE and NT results ……………………………………………………………....42 Table 5.3. MPS results ……………………………………………………………………….44 Table 5.4. NS results ………………………………………………………………………....44 Table 5.5. Instrument accuracy ………………………………………………………………45 Table 5.6. Instrument-level MPS results ……………………………………………………..47 Table 5.7. Instrument-level NS results ……………………………………………………….48 Table 5.8. Confusion matrix ……………………………………………………………...….49 Table 5.9. Results without proposed post-processing ………………………………………..51. VI.

(8) List of Figures Figure 1.1. Problem scenarios ……………………………………………………………..….4 Figure 2.1. Diagram of Onsets and Frames model ………………………………………..…...8 Figure 2.2. Concept of semantic segmentation ………………………………………………10 Figure 2.3. U-net architecture ………………………………………………………………..10 Figure 3.1. End-to-end flow of transcription system ………………………………………....17 Figure 3.2. Illustration of feature representations …………………………………………....19 Figure 3.3. Illustration of encoder and decoder block ……………………………………..…21 Figure 3.4. Illustration of ASPP block …………………………………………………..…...22 Figure 3.5. Diagram of self-attention block ……………………………………………….....23 Figure 5.1. Illustration of feature, raw predictions, and MIDI …………………………….....53. VII.

(9) 1. Introduction 1.1 Background and Motivation Automatic Music Transcription (AMT), the task to convert acoustic music signals into music notation, is an enabling technology to music information retrieval (MIR), music generation, music search, music education, and musicology [1][2]. Being the fundamentals of many further applications, there still remains lots of room for improvement to form a concrete and robust base. Constructing these applications requires music notation that could vary largely from local properties such as pitch, onset, offset, velocity, and timbre, to global properties such as voice, meter, tempo, and structure. To deal with this complicated problem, the AMT research can be broken down into four different levels: frame-level transcription on pitch, which is also known as multi-pitch estimation (MPE); note-level transcription on pitch, onset, and duration, or referred to as note tracking (NT); stream-level transcription on note and stream attributes, or named as multi-pitch streaming (MPS); and notation-level transcription on human-readable scores [1]. More introduction about AMT can be found in [1][2]. Transcribing polyphonic music signals into a higher level symbolic representation has been long considered to be the Holy Grail in music listening [3] of the AMT research. The diversity of problem scopes, high complexity of polyphonic signals, mixed harmonic components, and the lack of labeled data, all of which make the situation much more difficult to handle [1][2][4]. Most of the previous AMT works have been focusing on single-instrument transcription at frame-level, such as piano solo, which is, however, an over-simplified task that can hardly extract truly informative symbolic music notation. Not until recently, the rapid evolution of the deep learning techniques has pushed AMT research to a further place. Benefiting from the extreme flexibility of deep learning [5], works combining MPE with onset/offset detection [6][7][8] or instrument classification [9] have been introduced to. -1-.

(10) simulate NT or MPS. With various extraordinary characteristics of deep learning, many interesting ideas have been employed and more powerful models are introduced. For example, the state-of-the-art piano transcription method leverages two bidirectional long-short term memory (BLSTM) recurrent neural networks (RNN), fine-tuned with multiple objective in an end-to-end manner. The model could jointly predict the note-level attributes including pitch, onset, offset, and velocity [10][11]. Providing a rich set of labels, a new milestone has been flagged of the AMT fundamentals, and opened opportunities for developing many other applications such as automatic music generation [12]. Similar works like state-of-the-art multiinstrument recognition [13][14] and multi-instrument MPE [15], where pitch and instrument classes are predicted jointly with models based on convolutional neural networks (CNN), have also triggered many potential future directions to be uncovered. Showing the possibility of jointly transcribing note-level attributes (i.e., pitch, onset, offset) and instrument-level attributes (i.e., instrument classes) at the same time, this work presents a further step to push across the limit, where the task was relatively less investigated of note-level multi-stream transcription. In the following sections, we refer to such task as the multi-instrument AMT task. The existing multi-instrument AMT researches usually focus on identifying frame-wise stream activations, which is to transcribe instrument classes on each pitch event. To the best of our knowledge, there are still few works focusing on transcribing note and instrument concurrently except for a few pioneer tries [9]. To better distinguish between the frame-level MPS task and the note-level MPS task, we abbreviate these two tasks as MPS and NS, respectively. The challenges of MPS and NS are no doubt difficult, as the model need to discriminate the timbres of similar instruments from the complex sound mixture at the same time. For the NS task, contextual dependencies are also required to distill necessary temporal information about each note of different streams. Moreover, most of the data are highly imbalanced in their size of each instrument, which could potentially lead to unwanted bias towards specific instrument classes.. -2-.

(11) In this research, we will incorporate multiple techniques to overcome the above mentioned challenges and propose systematic analysis over different scenarios. Inspired by the field of computer vision (CV), U-net [16] model is being applied, because of its great ability to extract various semantic information. The model has been widely employed into different tasks such as music source separation [17], melody extraction [18], MPE [19], and MPS [15]. Each of the above tasks are considered to be the semantic segmentation tasks, which in some way bear a close resemblance to CV [20] that has the goal to identify “objects” (i.e., note instances of instruments) from a given 2D representation. Based on the original model architecture, we further enhance the model with a novel self-attention mechanism [21] to better capture the long-term feature of a sequence. To efficiently overcome the data imbalance issue, label smoothing (LS) technique is also employed into the training process to improve the performance on the rarely-seen instrument classes. At the last stage of transcribing music pieces, a post-process procedure is required to polish the raw prediction, and it often takes effort to find a proper threshold which may differ from piece to piece due to various characteristics of the timbre of each piece. Furthermore, multiple thresholds are required in multi-instrument task for different instruments, which force the situation more complex to handle. An advanced postprocess procedure is thus proposed to stabilize the distribution of the raw prediction over different pieces, and also to simplify the thresholding process in the multi-instrument task. The proposed method enables us to impose the same threshold on different instruments.. -3-.

(12) Fig. 1.1. Illustration of different problem scenarios.. 1.2 Problem Scenarios As all the challenges mentioned above are related to the additional information about instruments when in comparison to the conventional single-instrument MPE task. To inspect how this further information could impact the AMT task, we define three different transcription scenarios: . Pitch-only transcription: the information about instruments is ignored, no matter the music piece contain single- or multi-instrument information. This scenario is equivalent to the typical MPE (at frame-level) or NT (at note-level) tasks in the AMT research.. . Instrument-informed transcription: the type of instruments in the test music pieces are assumed known. For example, for a given music piece which is known to be a violin sonata, we could directly tell that there is a violin and a piano presented, while for all other channels we could just ignore.. . Instrument-agnostic transcription: the most challenging case that there is no. -4-.

(13) information about instrument classes with the test music pieces. The model has to both predict the instrument classes and note events. The only assumption to the scenario is closed-set recognition of instrument types, which means the scope of instruments classes are the same for training and testing set. Combining the above three scenarios with different levels of AMT subtasks (i.e., MPE, NT, MPS, and NS) discussed above leads to six different tasks summarized as follows: 1) pitch-only MPE, 2) pitch-only NT, 3) instrument-informed MPS, 4) instrument-informed NS, 5) instrument-agnostic MPS, and 6) instrument- agnostic NS. A brief comparison is illustrated in Fig 1.1.. 1.3 Arrangement of the Thesis The rest of this thesis is organized as follows: Section II. Reviews related works on multi-instrument AMT.. Section III. Describes the proposed method.. Section IV. Reporting the setup of experiments. Section V. Presenting the experiment results. Section V I. Discussion about the experiment results.. Section VII. Conclusion and future works.. -5-.

(14) 2. Related Work 2.1 Background of AMT Most of the prior studies focused on the single-instrument music signals such as piano solo, or multi-instrument pieces but with mono-track output where information about instrument is ignored. To these date, more and more datasets have been proposed for the AMT research. Many of them provide instrument information alongside with the audio signals such as the MIREX Multiple Fundamental Frequency Estimation (MF0) test set, Bach10 dataset [22], Su dataset [4], RWC Classical Music Database [23], and MusicNet dataset [24]. Though there exist labels of instrument classes in these datasets, seldom do they have been adopted, except for some certain studies on MPS [25][ 26][27]. However, it has been long argued that information about instrument is highly relevant to the final performance for an AMT model. The unique timbre of each instrument has its own special patterns on spectral representations, which is the key to pilot the model in extracting the time and pitch information of that instrument in a given audio mixture. This perception has been repeatedly proven successful in AMT systems. With the idea in mind, non-negative matrix factorization (NMF) [28] and probabilistic latent component analysis (PLCA) [29] gives a strong first step toward solving multi-instrument tasks. In these studies, multiple instrument-aware templates are obtained from the pre-training or sampled from single-note data of various instrument classes. These templates are then being applied to the model to decompose the given mixture spectrogram. The templates act like a memory or dataset about characteristics of different instruments. More constraints can be imposed on spectral envelope or spectral smoothness to restrict and filter out undesirable behaviors of templates, and thereby better guiding the model to capture patterns of different instrument types [9][28][30]. Although it has been proven successful in decomposing spectrogram by using instrument-aware templates, only a few studies took a further step to. -6-.

(15) multi-instrument AMT. Further investigation has been proposed to examine the influence contributed by the templates, such as Independent Subspace Analysis (ISA) and factorial Hidden Markov Models (FHMM) [31], harmonic temporal clustering (HTC) [32], MPS with high-order HMM [33], constrained clustering [25][26][27], PLCA [9][34][35], to name but a few.. 2.2 Era of Deep Learning The progress in the recent development of hardware has sparked many exciting and interesting researches. With more powerful hardware being manufactured, limitations of the computation power have been lifted. One of the most significant developments benefits from such progress would be the renaissance of deep learning. There has been a winter time in this historical research since 1970s, due to the restriction of computation power. As the thriving of deep learning, many achievements that seem magical in the past has been made true, such as object recognition [36], image generation [37], automatic music composition [38], and playing board games with human [39], to name but a few. These works are now still inspiring more research directions and continuing to impact the society. Early attempts on MPE task leveraging deep learning models have been done in [40][41][42]. Different models were applied in these works, such as recurrent neural networks (RNNs), deep neural network (DNN), and convolutional neural networks (CNN). In [41], an acoustic model and a language model were combined together for better capturing musical structures. The acoustic model utilizes CNN to extract temporal context in the given spectral features, combined with a RNN-based musical language model to inference the final prediction on the raw output of acoustic model. They also compared different post-processing models like a hidden Markov model (HMM) and a naive thresholding approach, while the final results shows that by leveraging the RNN model on the post-processing could lead to the best improvement on the performance, albeit not very significant.. -7-.

(16) Fig. 2.1. Diagram of Onsets and Frames model. The term FC refers to fully connection layer. Four different objectives were considered, results in four branches of the model. For the branch of frame, the output of onset and offset branch are concatenated to the output of the middle layer of frame stack.. A more recent work, which leverages a more complicated RNN model architecture, has been proposed in [10]. The work pays a further attention to the information about note events, as this is the essence of what music is composed with. Comparing most of the prior studies placing focus on frame-level transcription, this work marks a huge step to the general transcription of AMT. The final outputs of the model are multi-object oriented, which means there are multiple branches at the end of the model that split from the middle of the structure. As shown in Fig 2.1, four objectives were taken into consideration in the work: onset, frame, offset, and dynamic. The first event is referred to the start time of the note, second event the duration, third event the end time of the note, and the last one how fast the key is pressed on the piano, and thus produce different volume levels. Specially designed objective function is also applied to the model, which emphasizes the loss on the onset channel to improve the performance on the note-level. The architecture of the model is constructed by a stacked bidirectional long short-term memory (Bi-LSTM) [43]. Thorough evaluation results of different scenarios were presented in the study, showing an impressive score on all metrics. Nonetheless, this approach is built on a massive model with tens of thousands of parameters that requires very powerful hardware to fine tune, which is computationally intractable for. -8-.

(17) many applications. Subsequent researches based on the work [10] also introduced extended model architectures to further improve the performance [44][45]. A similar idea of [41] is applied in the work [45], which leverages an elaborately designed post-processing. The model outputs three different probability distributions in terms of note activations: onset, offset, and intermediate phase. The last one can be considered to be the duration of the note after the onset has been triggered. As to post-processing, HMM is applied to generate the final note sequence. The HMM has four different states, which are attack, decay, sustain, and release, in combination of ADSR. Transition probabilities of each state were manually set according to the training set. Compared to [10], not only the model has much less parameters and is thus easier to employ, but also it performs better on frame- and note-level with offset. For note-level transcription, the corresponding score without offset is only a bit less than the baseline. A more complicated training scheme is employed in [44], which leverages generative adversarial network (GAN) model to find-tune. Difficulties that GAN model usually suffers from are the obstacles of unstable training performance, and the requirements of additional constraints on the optimization process. Though there exist some tricky issues, the results in [44] still show a promising direction of designing different training schemes for the MPE task. Aside from the systems based on multi-object optimization, there are less studies focusing on developing a multi-task end-to-end model. As depicted in Fig 2.1, the model literally consists with four sub-models, each being responsible for a different aspect of a note. However, the dependencies of the four event types leave reservations to the design of the hard separation of the model, even though there are two shared connections in the model. Sharing the information inside the model could be critical for transcribing a note from different aspects, and also help reducing the size of the model by merging branches into a single one. In consequence, a model should be able to share information between objectives while propagating the given input, and making predictions with regard to different objectives at the. -9-.

(18) Fig. 2.2. The concept of semantic segmentation. Figure on the left shows a spectrum with a colored area in orange, representing the presence of a note event. Similar to image semantic segmentation as the right figure shows, the highlighted blue area outlines a cactus. Both tasks are trying to classify the given area, or pixels, into specific category.. Fig. 2.3. Illustration of U-net architecture.. same time. We refer to a model with such diagram an end-to-end multi-task model. In CV, one subtask similar to the multi-streaming scenario has been widely investigated. Its goal is to identify objects in a given image and then categorize them into different classes pixel by pixel, the concept of which is named semantic segmentation (as shown in Fig 2.2). For the AMT task, spectral representations are the natural choice of the input features, which are also two-dimensional feature representations just like images. There have been many models. - 10 -.

(19) proposed for such task and achieves promising results. Inspired by numerous applications, we choose the U-Net architecture [16] as the base, with some extended modifications that will be described in more details later. The model has encoder, bottleneck, and decoder parts; each has stacks of blocks equipped with layers of different types. By stacking blocks, the model is able to extract higher and higher concepts of the given features, and leverage these distilled information to compose the final result within the scope we setup. One critical design to the Unet architecture is the residual connection between the encoder and decoder blocks. There exists a fatal drawback of a big deep learning system as when stacking it with more layers, the gradients could vanish fast when applying the back propagation learning for model inference [46]. One effective way to deal with this issue is to add skip connections between layers, forwarding the outputs of preceding layers directly to the later layers [37] (a detailed illustration is depicted in Fig 2.3).. 2.2.1 Dealing with Long-Term Sequence a). RNN-based Approach A natural choice of model, when in concern with a time-related sequence, is recurrent neural networks (RNNs). Vanilla RNNs models could be hard to train due to the challenge the same as big deep learning systems: as the sequence grows longer, the gradient could vanish fast. To cater for this, many improvements of RNNs have been proposed such as long short-term memory (LSTM) [47], bidirectional LSTM [43], and gated recurrent neural networks [48]. These models have succeeded in keeping the gradients and been firmly established as state-ofthe-art approaches in many different sequence modeling tasks, such as language modeling and machine translation [49][50]. Ever since then, numerous efforts have continued to push the boundaries of RNNs and encoder-decoder architectures to a further place [51][52]. Despite good at modeling time sequences, the inherent computation of each time steps generates a. - 11 -.

(20) series of hidden states. Each of the state depends on the previous output of states, and make prediction on the current time step. The dependent nature makes it difficult to compute the output sequence in parallel within training examples, which limits the ability to model longer sequence, as it would take too much time. To alleviate this problem, practice like factorization [53] have shown a significant improvement in computational efficiency, and also the work in [54] leverages a conditional computation technique to further improve the performance.. b). Attention-based Approach Recent efforts on the issue have shown a great solution on the computational limitation by using the so-called attention mechanism [55][56]. The basic idea of attention is to allow each position of the state sequence attend to every other states, which means to have a full sight to the state history. This powerful property relaxes the hard constrains of RNN model that could only depend on a single state previously generated. In [57], a novel multi-head attention mechanism is proposed, combined with the idea of parallelism and advantages of attention. The design of the model architecture is composed purely with multi-head attention stacks, without any other types of computation such as convolution or recurrent units. This is one milestone for introducing a new fundamental computation unit, on top of which most of the models can be broken down into several types of arithmetic units, such as linear transformation and convolution, exhibiting an impressive ability when compared to the conventional approaches. Assuming that the input sequence has the length of 𝑙, the formula of multi-head attention can be written as follows: 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (. 𝑄𝐾 𝑇 √𝑛. )𝑉. (1). where 𝑄, 𝐾, and 𝑉 are linear-transformed from the original input feature vector representations, and the output has a dimensionality of 𝑙 × 𝑑 , where 𝑑 is the feature channel. The term 𝑛 represents the number of head. The formula of the normalization function 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 is defined as follows:. - 12 -.

(21) 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑗 ) =. 𝑒 𝑧𝑗 , for 𝑗 = 1, … , 𝐾 𝑧𝑘 ∑𝐾 𝑘=1 𝑒. (2). where 𝑧 represents a vector with dimension 𝑑. This normalizes the output vector to the value ranging from 0 to 1, and enlarges the gap between the greatest value and others. The key for parallelism is by splitting the 𝑄𝐾 𝑇 term into several smaller pieces with a factor of √𝑛 along the feature channel 𝑑. These smaller parts can then be distributed to multiple computation cores and calculate the attention matrix. The attention matrix plays the role for weighting the importance between any pair of the input feature vector representations, telling each vector at specific time step which are the most related vectors to itself. A final output is then generated by multiply the attention matrix with the feature maps 𝑉 . Various approaches have been proposed to leverage different attention mechanisms and achieved state-of-the-art performance in language translation [58], music generation [12], and speech recognition [59]. Despite the advantages of dealing with long-term sequences, one major limitation of the attention mechanism is that the memory consumption of the attention computation, which is proportional to the square of the sequence length. This can be observed from the term 𝑄𝐾 𝑇 . The dot product is computed element-wisely along the feature dimension 𝑑, thereby the output of the term would have dimension of 𝑙 2 × 𝑑. As for one-dimensional tasks such as language modeling, it may be still effective enough to achieve high performance levels when not much hardware is available. But for tasks like image generation or others having two-dimensional input, the quadratic term becomes crucial for determine the feasible size of the input to fit into the memory. One practical way to the issue proposed in [21] is to divide an image into nonoverlapping query blocks. After splitting an image into smaller blocks, self-attention is then applied. With the partitioning, the memory consumption can thus be reduced to an appropriate size tailored for the computation. This approach is leveraged in this research, and will be discussed in later sections.. - 13 -.

(22) c). ASPP-based Approach Additionally, a generalized convolutional computation approach was proposed in [60], which is named atrous spatial pyramid pooling (ASPP) mechanism. This approach employs dilated convolution to enlarge the reception field, capturing objects in various scales by varying the size of dilation rate. The formula can be written as follows: 𝑦[𝑖, 𝑗] = ∑ 𝑥[𝑖 + 𝑟𝑚, 𝑗 + 𝑟𝑙 ]𝑤[𝑚, 𝑙 ]. (3). 𝑚,𝑙. where 𝑥 and y denotes the input and output 2-D feature maps, respectively, 𝑤 is the convolution filter to be learned, 𝑟 refers to the dilation rate, and [𝑖, 𝑗] indicates the location on the feature maps. The standard convolution is a special case when 𝑟 = 1. ASPP then performs dilated convolution with multiple dilation sizes and pool the resulting feature maps together. The time complexity of normal convolution is proportional to the square of the sequence length, the same as attention. ASPP works like the concept same as the memory-efficient attention mechanism mentioned before. They both split the certain components into a smaller size, and stacks them back after the computation. For ASPP, the splitting does not really happen, but in a similar way that with larger dilation rate, a certain amount of computation is ignored. For instance, considering a kernel with size 3 × 3, a total of nine positions in the kernel need to be calculated when the dilation rate of 1 is applied. And for dilation rate of 3, still a total of nine positions need to be calculated, but an effective reception size is now 7 × 7, in which a normal convolution will have a total of 49 positions need to be propagated. With a better ability to capture wider contexts, ASPP has been applied in melody extraction and AMT in [18] and [15].. 2.3 Data Representations Apart from defining the model architecture, the choice of data representation is another challenging yet important decision to be made. As pointed out in [42], the selection of the data representation is an additional complication found particularly in the audio domain, in relation. - 14 -.

(23) to other tasks that raw input data would be sufficient to give a competitive performance. Further investigation in [42] demonstrated that the performance of MPE is not only sensitive to the spectrogram type (i.e., linear-frequency scale, log-frequency-scale, or constant-Q transform), but even sensitive to very basic signal parameters such as sampling rate. To overcome the unstable native, [40] employed multi-resolution short-time Fourier transforms (STFTs) computed with different window sizes as the input to a recurrent neural network. In [61] also takes a multi-resolution approach to the task of vocal melody extraction, while different resolutions are capitalized in the prediction stage rather than the input representation. A more recent work proposed the harmonic constant-Q transform (HCQT) [19]. Multiple CQTs with different minimal frequencies are combined together aligning across harmonic components of each pitch. The feature is then fed into a CNN model. The spirit behind this design is assuming that stacking all the harmonic components along the channel could guide the model for better capturing the timbre. These studies indicate that multiple data representations are desirable in deep learning modeling in terms of the high flexibility of combining multiple input data and the great ability to extract correlations among them. All the above-mentioned studies leverage spectral representations as the input features, such as spectrogram and CQT. The limited consideration about leveraging only spectral representations could probably just because of such features were readily applied in other stateof-the-art methods like non-negative matrix factorization (NMF) and sparse coding (SC). Revisions should be conducted on this instinct choice of data representation, relaxing the restriction and harness the full power of deep learning. To explore the possibilities along this idea, we look for inspirations from the literature of pitch detection functions. A widely used representation in early feature-based MPE algorithm [62][63][64] is generalized cepstrum (GC), which is a lag-domain representation. There are also works recently proposed generalized cepstrum of spectrum (GCoS) [65][66], a generalization of the autocorrelation of spectrum [67]. Both of the above two data representations could be the potential candidates for. - 15 -.

(24) the deep learning. For this research, a systematic comparison of leveraging different combinations of data representations will be conducted, leading to a competitive performance in different scenarios.. - 16 -.

(25) 3. Method There are usually three conventional stages of an AMT system: feature extraction, neural network model training, and post-processing, as shown in Fig. 3.1. Given the input signal 𝐱 |𝒩|. that is a mono-channel music signal, the system predicts a set of note events 𝒩 ≔ {𝐧𝑖 }𝑖=1 from 𝐱. The neural network model predicts a finite set of instrument classes 𝒮, and the scope of classes is depending on the provided training data. There are four attributes in a note event, 𝑜𝑓𝑓. denoted by 𝐧𝑖 ≔ (𝑝𝑖 , 𝑡𝑖𝑜𝑛 , 𝑡𝑖. , 𝑠𝑖 ), where 𝑝𝑖 ϵ [21, 108] represents the pitch value in terms of 𝑜𝑓𝑓. MIDI number, 𝑡𝑖𝑜𝑛 ϵ ℝ+ is the onset time, 𝑡𝑖. ϵ ℝ+ refers to the offset time, and 𝑠𝑖 ϵ 𝒮 denotes. the instrument class of 𝐧𝑖 .. Fig 3.1. An end-to-end flow of the transcription system. The given input signal is first processed by extracting the multiple time-frequency representations, and then fed into the model. The model is then fine-tuned by the given input and ground-truth pairs in datasets. The raw output prediction is a piano-roll-like representation, which needs further refinement by applying a post-processing algorithm. The final output of the whole process would be a MIDI representation.. 3.1 Data Representations Two different spectral representations are taken into consideration for the pre-processing. One is the Combined Frequency and Periodicity (CFP) approach [68], and the other is the harmonic approach inspired from Harmonic Constant-Q Transform (HCQT) proposed in [19]. As shown. - 17 -.

(26) in both researches, better performances can be obtained by leveraging multiple features at the same time. In [68], information of the time and frequency domain is combined together. The assumption is that a note event cannot be described solely by the spectral view, but also the hint from the periodicity, such as generalized cepstrum. For the efficiency of HCQT as claimed in [19], stacking harmonic information along the channel axis can mentor the model to better capture harmonic information, resulting in better transcription performance.. 3.1.1 CFP Representation Given an input signal 𝐱 , a window function 𝐡, 𝐱, 𝐡 ϵ ℝN , 𝐱 ≔ 𝐱[𝑛], where 𝑛 represents the time index, the CFP representation can be computed by the short-time Fourier transform (STFT) as following: 𝑁−1. 𝑗2𝜋𝑘𝑚 𝑁 |. 𝐗 ≔ | ∑ 𝐱[𝑚 + 𝑛𝐻] 𝐡[𝑚]𝑒 −. (4). 𝑚=0. This is also known as the square root of a spectrogram. The data representation 𝐙 employed as the input of the model is derived from 𝐗 with the processing of generalized cepstrum and highpass filtering, which removes log-frequency parts and slow varying components in the spectrum [68][69]. 𝐙 contains three channels, denoted by 𝐙𝑓 , 𝐙𝑞 , 𝐙𝑔 ∈ ℝ𝐾×𝑁 : 𝛾𝑓. (5). 𝐙𝑓 [𝑘, 𝑛] ≔ 𝐐𝑓 |𝐖𝑓 𝐗| , 𝛾𝑞. 𝐙𝑞 [𝑞, 𝑛] ≔ 𝐐𝑞 |𝐖𝑞 𝐅 −1 𝐙𝑓 | , 𝐙𝑔 [𝑘, 𝑛] ≔ 𝐐𝑓 |𝐖𝑓 𝐅𝐙𝑞 |. 𝛾𝑔. (6) (7). where 𝐅 denotes the N-point DFT matrix, 𝐖𝑓 and 𝐖𝑞 are two high-pass filters for eliminating low-varying parts, and |∙|𝛾 is an element-wise power-scaled nonlinear function. The parameters of 𝛾𝑓 , 𝛾𝑞 and 𝛾𝑔 are set to 0.24, 6, and 1, respectively, as suggested by [70]. By changing the combinations of these three parameters, as indicated in [65], most of the pitch detection functions can be derived from Eq. (5)-(7) in the literature. For example, for (𝛾𝑓 , 𝛾𝑞 ) = (2, 1) , 𝐙𝑓 represents a autocorrelation function (ACF); when (𝛾𝑓 , 𝛾𝑞 , 𝛾𝑔 ) = (1, 2, 1) , 𝐙𝑞. - 18 -.

(27) Fig. 3.2. Illustration of three different data representations. From top to bottom: power-scale spectrogram (𝐙𝑓 ), generalized cepstrum (𝐙𝑞 ), and generalized cepstrum of spectrum (𝐙𝑔 ).. stands for the ACF of spectrum, which is useful in resolving the effect of missing fundamentals in a spectrum, thus can effectively reduce the pitch detection errors [67]. In sum, 𝐙𝑓 refers to power-scaled spectrogram, 𝐙𝑞 represents a generalized cepstrum (GC), and 𝐙𝑔 is a generalized cepstrum of spectrum (GCoS). To fit the perception scale of pitch, two triangular filterbanks: 𝐐𝑓 and 𝐐𝑞 , are applied to map a feature from the frequency and time domain to the logfrequency domain. Both filterbanks have 384 triangular filters, ranging from 16.35 Hz (C0) to 4,186 Hz (C8), and the resolution is 48 semitones per octave. Fig. 3.2 illustrates the examples of 𝐙𝑓 , 𝐙𝑞 , and 𝐙𝑔 of a piano solo segment. As 𝐙𝑓 tends to reveal the fundamental frequencies and their harmonics in a signal, most of the energy concentrates in the high-frequency range. On the other hand, 𝐙𝑞 usually discloses the fundamental frequencies and their sub-harmonics, which causes the energy to be accumulated in the low-frequency part. In both cases, the true fundamental frequencies are mostly of weak. - 19 -.

(28) salience when in comparison with harmonic/sub-harmonic components. This makes the results being sensitive to the interference from irrelevant information such as noise. The issue is mitigated in 𝐙𝑔 through a high-pass filtering process; the high-frequency parts in 𝐙𝑞 are suppressed so as to enhance those weak fundamental frequencies located in the low-frequency range.. 3.1.2 Harmonic Representation With a pre-computed time-frequency representation, multiple pitch-shifted versions of the representation are stacked together such that harmonic peaks are aligned to the same frequency index [19]. By doing so, we expect that a local convolutional kernel could cover the global pitch profile (i.e., the whole harmonic pattern) of a component, leading to better performance in discriminating note events. Furthermore, we extend this idea to the time domain features. Consider the mth harmonic frequency 𝑚𝑓0 of a fundamental frequency 𝑓0 . According to equal temperament, the pitch number of 𝑚𝑓0 is 𝜂(𝑚) ≔ round(12 log 2 𝑚) semitones higher than 𝑓0 . For instance, the 2nd, 3rd, and 4th harmonics of 𝑓0 are 12, 19, and 24 semitones higher than 𝑓0 respectively. Likewise, the 2nd, 3rd, and 4th sub-harmonics are lower than 𝑓0 by 12, 19, and 24 semitones. The formula can be written down as following: (𝑚) 𝐙𝑓 [𝑘, 𝑛] ≔ 𝐙𝑓 [𝑘 + 𝜂 (𝑚) ∙ 𝛿, 𝑛]. (8). (𝑚) 𝐙𝑞 [𝑘, 𝑛] ≔ 𝐙𝑞 [𝑘 − 𝜂(𝑚) ∙ 𝛿, 𝑛]. (9). (𝑚) 𝐙𝑔 [𝑘, 𝑛] ≔ 𝐙𝑔 [𝑘 + 𝜂 (𝑚) ∙ 𝛿, 𝑛]. (10) (1). (1). where 𝛿 is the number of bins per semitone. Notice that when 𝑚 = 1, 𝐙𝑓 = 𝐙𝑓 and 𝐙𝑞 = 𝐙𝑞 . In this research, we set 𝑚 = 1, 2, … , 6. The representations are aligned along the channel axis, which therefore having a multiple of 6 channel numbers. Possible combinations can be (1:𝑚). 𝐙HCFP ≔ [𝐙𝑓. (1:𝑚). , 𝐙𝑞. (1:𝑚). ], or 𝐙HCFP ≔ [𝐙𝑓. (1:𝑚). , 𝐙𝑞. - 20 -. (1:𝑚). , 𝐙𝑔. ]. The final combination will be.

(29) Fig. 3.3. Illustration of the detail components of encoder and decoder block. For each pair of encoder and decoder block, there is a corresponding residual connection in between. The output of decoder block is concatenated to the input of decoder block after batch normalization layer.. determined in the later experiments. Since the frequency scale is 48 semitones per octave, we thus have 𝛿 = 4. All the input audio recordings in this research are mono-channel. The STFT is computed with a Blackman-Harris window with size 0.128 seconds, and the hop size is 0.02 seconds.. 3.2 Model As shown in Fig. 2.3, the model architecture contains three parts: encoder, bottleneck, and decoder. In this research, we investigate two different types of model, which are ASPP-based model and attention-based model. This model originates from DeepLabV3 and its improved version, DeepLabV3+ [60], both of which are fully convolution neural networks with an encoder-decoder architecture. It was tested on image semantic segmentation tasks, and achieved state-of-the-art results. The encoder part consists of four encoder block groups, each having 2, 3, 4, and 5 convolutional blocks, respectively. And for every group, there is also a corresponding decoder block (notice here is “block”, not “group”). Illustration of encoder and decoder blocks are shown in Fig. 3.3, and the detailed model architecture is listed in Table 3.6.. - 21 -.

(30) Fig. 3.4. Illustration of the mechanism of ASPP block. The input feature map is processed by kernels with different dilation rates, and pool the output together to capture objects in various scales.. For the bottleneck part, there are two different types of block being leveraged in this research. The first is the ASPP block, which dilated convolution layers are employed, bringing a wider view on the latent space and keeps the same memory usage meanwhile. Eq. (3) shows the general formula of the dilated convolutional computation. We use three different levels of dilation rates in this research, which 𝑟 = {1, 2, 4}. Fig. 3.4 shows the concept of the ASPP block. The second type is the block of attention, forming an attention-based model. As pointed out before that for a long sequence, it could consume a large amount of memory for the attention mechanism to compute. In our model, the first two dimension of each hidden layer are 𝐾 × 𝑁 the same as the input, meaning the input and the output channel has the same dimension throughout the propagation in the model. To compute self-attention on a 2D feature map, the most direct way is to flatten the input with size 𝑑 × 𝐾 × 𝑁 to a sequence with size 𝑑 × 𝐾𝑁, where 𝑑 is the feature dimension and 𝐾𝑁 is the length of sequence. This is, however, impractical in our case that the length could grow up to over 50,000 (𝐾 = 384 and 𝑁 = 128), and a desktop computer cannot afford the memory consumption of the quadratic term 𝐐𝐊 𝑇 . To overcome this issue, we adopt technique proposed in [21]. The input features are divided into non-overlapping query blocks and processed over a memory flange that bounds the receptive field of the query blocks. The outputs are then assembled back to the same size as the input. - 22 -.

(31) Fig.3.5. The diagram of the self-attention block. The feature map from the previous layer is partitioned into nonoverlapping blocks. The blocks are then fed into the self-attention layer in raster-scan order. The outputs are then concatenated back to form the final output feature map, and pass to the next layer.. features. The formula of the self-attention block is described as follows: qa = layernorm (q + dropout(Attention(𝐐, 𝐊, 𝐕))). (11). q′ = layernorm (qa + dropout(𝐖1 ReLu(𝐖2 qa ))). (12). where 𝐐 = 𝐖Q q , 𝐊 = 𝐖K q , and 𝐕 = 𝐖V q . 𝐖Q , 𝐖K , 𝐖V , 𝐖1 , and 𝐖2 are learnable parameters and will be fine-tuned during training. In (11), q is the flattened feature map bounded by memory flange. Equation (12) describes the computation of a feed-forward neural network, where 𝐖1 and 𝐖2 are the parameters shared across all the positions in a layer. Fig. 3.5 visualizes the complete process of self-attention block. The output of the proposed model is a multi-channel representation, each having dimension 𝐾 × 𝑁. The model predicts |𝒮 | classes of instrument, and for each instrument class 𝑠 ∈ 𝒮, we use two-event type channels to represent the likelihood a note event occuring at specific time and pitch, one for note onset (denoted as 𝐘𝑠on ) and the other for pitch activation (i.e., the process between note onset and note offset events, denoted as 𝐘𝑠act ). Note offset and the whole note event are identified in the post-processing stage based on these two channels. An additional channel is added to the output to represent the classes of “others”. In total, there are 2|𝒮 | + 1 channels at the output of the model. For more details of the parameter settings, Table 3.6 lists the settings of each layer of ASPP-based and attention-based model.. - 23 -.

(32) Conv Input. Attn. Input Feature: 128 × 384 × CH Conv: 32/(7, 7)/(1, 1) Enc Block: 32/(3, 3)/(2, 2) Enc Block: 32/(3, 3)/(1, 1) Enc Block: 64/(3, 3)/(2, 2). Encoder. 2 × Enc Block: 64/(3, 3)/(1, 1) Enc Block: 128/(3, 3)/(2, 2) 3 × Enc Block: 128/(3, 3)/(1, 1) Enc Block: 256/(3, 3)/(2, 2) 4 × Enc Block: 256/(3, 3)/(1, 1). Bottleneck. ASPP: 512/(3, 3)/(1, 1)/1. Attn: 64/(100, 32)/(8, 8). ASPP: 512/(3, 3)/(1, 1)/2. Attn: 128/(64, 16)/(8, 8). ASPP: 512/(3, 3)/(1, 1)/4 Conv: 256/(1, 1)/(1, 1) Dec Block: 128/(3, 3)/(2, 2) Decoder. Dec Block: 64/(3, 3)/(2, 2) Dec Block: 32/(3, 3)/(2, 2) Dec Block: 32/(3, 3)/(2, 2). Output Total Parameters. Output: 128 × 384 × C 12,571,907. 7,920,343. Table 3.6. Detail settings of each layer for ASPP-based and Attn-based model. The input dimension is 128 × 384 × CH, where CH is the number of input channels. And the first two output dimension is the same as the input shape, and the third dimension C represents the output classes. The terms Conv, Enc Block, Dec Block, ASPP, and Attn represents the convolution layer, encoder block, decoder block, ASPP block, and self-attention block, respectively. There are three pairs of number following after each layer type separated by slashes. The numbers represent different meanings for different layer type. For all layer types, the first number denotes the output channel number. For layer type Conv, Enc Block, and Dec Block, the second and third number pair represent kernel size and stride number, respectively. For ASPP, the second and third parameter denote the kernel size and dilation rate. And for Attn layer, the second and third number refer to query shape and memory flange. The last row of the table shows the total number of parameters of the two model.. - 24 -.

(33) 3.3 Label Smooth In polyphonic music transcription, the label distribution is usually highly imbalanced. Most of the pitch activation and onset labels on the time-frequency plane are zero-valued (i.e., silence). When draw the piano roll on a 2D plane, a note just constitutes a line, and for note onset, it’s just even a dot occupies only one pixel at the start of the line. The rest of the place on the figure would just all be zero, and for the model trained with such labels would be misled to predict all the examples as zero when using pixel-wise binary cross-entropy as the loss function. To deal with the issue, focal loss [71] is adopted to handle the problem. In computer vision, focal loss has been proven effective in the one-stage dense object detection problem with extremely dense examples of background classes, but sparse examples of foreground classes. Focal loss has also been shown useful in vocal melody extraction [18] and MPE [15]. Given a ground truth 𝑦 at a pixel, and the model predicts the value 𝑝 at the pixel, the focal loss can then be written as: FL(𝑝𝑡 ) = −𝛼𝑡 (1 − 𝑝𝑡 )𝛾 log(𝑝𝑡 ).. (13). The focal loss is parametrized by a weighting factor 𝛼 ∈ [0, 1] and a focusing factor 𝛾 ∈ [0, 1]. In addition, in (13) we have 𝛼𝑡 = 𝛼 when 𝑦 = 1 and 𝛼𝑡 = 1 − 𝛼 otherwise, and 𝑝𝑡 = 𝑝 if 𝑦 = 1 and 𝑝𝑡 = 1 − 𝑝 otherwise. The employment of 𝛼 is to balance the loss from the activation and silence examples, and the term (1 − 𝑝𝑡 )𝛾 is applied to balance the loss from the examples which are correctly predicted and those that are wrongly predicted. As [71] suggested, 𝛼𝑡 and 𝑦 are set to 0.25 and 2 respectively. Apart from the imbalance between activation and silence events, there also exhibits a serious imbalance between instrument classes in a multi-instrument scenario. As we will see in Table 4.1, most of the instrument classes are in the extreme minority. In such case, a model could eventually be over-confident on predicting the majority instrument classes, while suppressing the performance of other instruments. Not only does the over-confident happen on. - 25 -.

(34) multi-instrument tasks, rather, in the case of piano solo transcription, there is also tug of war between the class ‘piano’ and the class ‘others’, which in this scheme ‘piano’ acts the role of minority, and vise-versa. To handle this difficulty, the label smoothing (LS) method smooths the label distribution by imposing a certain penalty to the over-confident output. In general, LS transfers the ‘confidence’ on the majority classes to the minority classes. Speaking in a formal language, given a dataset with samples 𝑥, label 𝑦𝑠 , and with |𝒮 | different classes, 1 ≤ 𝑠 ≤ |𝒮 |. For an ideal label distribution with the predicted label 𝑦̂𝑠 , it can then be described as 𝐷(𝑦̂𝑠 |𝑥 ) = Φ𝑦̂𝑠 ,𝑦𝑠 , where Φ𝑦̂𝑠 ,𝑦𝑠 = 1 for 𝑦̂𝑠 = 𝑦𝑠 and Φ𝑦̂𝑠 ,𝑦𝑠 = 0 otherwise. In the scheme of label smoothing, a modified label distribution is 𝐷′ (𝑦̂𝑠 |𝑥 ) = (1 − λ)Φ𝑦̂𝑠 ,𝑦𝑠 + λ𝑝(|𝒮 |),. (14). where λ represents smooth factor, and 𝑝(∙) is a prior label distribution that is usually considered to be uniform distributed over the |𝒮 | classes, which means that with probability λ, the label distribution is uniform. There are more discussions of the general form of label smoothing in [72] proposed recently. To keep the alignment with the central topic of this research, we will follow the naïve version of LS as described above, setting 𝑝(|𝒮 |) = 1/|𝒮|. Further details about label smoothing technique can be found in [73][74].. 3.4 Post-processing To transform the raw prediction probabilities into MIDI representation, a final post-processing stage is applied. The output values 𝐘𝑠on [𝑘, 𝑛] and 𝐘𝑠act [𝑘, 𝑛] represent the likelihood of onset and activation of instrument class 𝑠, respectively. 𝐘𝑠on determines the onset time of a note event, and 𝐘𝑠act determines the duration of the corresponding note event, both of which having values between 0 and 1 when the model makes predictions. At this stage, it is common to quantify the output values by setting a threshold and clip the values to zero or one. Such a treatment, however, is ineffective in multi-instrument transcription, as the distribution of the output values could diverge between instrument classes. This tremendously impact the stability when. - 26 -.

(35) transforming different instruments into MIDI output, and additional tuning of multiple thresholds for different channels would be necessary. Therefore, a more adaptive and general post-processing is required for multi-instrument transcription. We here propose a newly developed normalization-based post-processing progress to fit the requirement. The progress consists of four steps: global normalization, instrument selection, local normalization, and note inference. Details are described in the following. 1) Global normalization: Two set of channels are normalized independently: one is all the onset channels 𝐘𝒮on , and the other one is all the activation channels 𝐘𝒮act . For each of the set, z-scoring is applied for the normalization, regulating the overall mean and the standard deviation to be zero and one, respectively. 2) Instrument selection: The process confines the all-or-none principle: the instrument classes appearing in the prediction result are selected by a global threshold 𝜃 ins . We conclude the confidence level whether an instrument should be selected or not by a confidence value 𝑣𝑠 . This value is defined as the sum of the standard deviation of 𝐘𝑠on and 𝐘𝑠act (denoted as 𝜎𝑠on and 𝜎𝑠act respectively). In other word, the formula can be written as. 𝑣𝑠 ≔ 𝜎𝑠on + 𝜎𝑠act . A low value of 𝑣𝑠 implies that 𝐘𝑠on and 𝐘𝑠act. approximate all-zero prediction, and this indicates that the instrument class 𝑠 might not exist. By contrast, a high value of 𝑣𝑠 would suggest the existence of the instrument class 𝑠 in the music piece. The assumption behind this strategy by using standard deviation to filter instruments is that instruments should have higher prediction values if they are in the given music, thus having higher standard deviation. The set of selected instrument 𝒮𝑝 are therefore those classes having 𝑣𝑠 greater than the global threshold 𝜃 ins . Those instrument classes being filtered out are considered absent from that music piece, and will not be processed in the following stage. The value of 𝜃 ins is fine-tuned from the validation set. 3) Local normalization: The z-score normalization process is applied again, but this time. - 27 -.

(36) act applied in a channel-wise manner: each channel ( 𝐘𝑠on ′ and 𝐘𝑠 ′ ) of the selected. instruments (𝑠 ′ ∈ 𝒮𝑝) is z-score normalized. After this process, values of the output are filtered by the thresholds 𝜃 on and 𝜃 act for the onset and activation channels, respectively. Values under the threshold are set to zero, otherwise the original value is kept. These two thresholds are fine-tuned from the validation set. 4) Note Inference: After the normalization and thresholding processes, the resulting act 𝐘𝑠on ′ [𝑘, 𝑛] and 𝐘𝑠 ′ [𝑘, 𝑛] are then used for the final inference of note onset and note. duration. A note onset position [𝑘 on , 𝑛on ] is determined when that 𝐘𝑠on ′ [𝑘, 𝑛] is at a local maximum, and the minimum distance between two consecutive peaks is set to be 𝜂 = 50ms; this can be done by finding the maximum over a sliding window with a length of 2𝜂 = 100ms. When an onset event is detected, it triggers a mechanism to on on find its corresponding offset event in 𝐘𝑠act , 𝑛 + 𝛿] , ′ . This offset event is at [𝑘. where the note duration 𝛿 is determined by the smallest value that introduces a silence interval 𝜖 longer than 60ms started from , 𝑛on + 𝛿. The pseudo-code of the complete post-processing progress is depicted in algorithm 1.. - 28 -.

(37) Algorithm 1 Post-processing Input Model prediction 𝐘𝑠on [𝑘, 𝑛] and 𝐘𝑠act [𝑘, 𝑛], 𝑠 ∈ 𝒮 Output Transcribed note attributes {𝑘, 𝑡 on , 𝑡 off , 𝑠} Parameters Thresholds 𝜃 ins , 𝜃 on , and 𝜃 act Parameters Onset distance 𝜂, silence interval 𝜖 1:. {𝐘𝑠on }𝑠∈𝒮 ← z-score({𝐘𝑠on }𝑠∈𝒮 ). 2:. {𝐘𝑠act }𝑠∈𝒮 ← z-score({𝐘𝑠act }𝑠∈𝒮 ). 3:. for each 𝑠 ∈ 𝒮 do. 4:. 𝑣𝑠 ← 𝜎𝑠on + 𝜎𝑠act. 5:. 𝒮𝑝 ← {𝒮𝑝 ∶ 𝑠|𝑣𝑠 > 𝜃 ins }. 6:. end for. 7:. for each 𝑠 ′ ∈ 𝒮 do. 8:. on 𝐘𝑠on ′ ← z-score(𝐘𝑠 ′ ). 9:. act 𝐘𝑠act ′ ← z-score(𝐘𝑠 ′ ). 10:. on on 0 → 𝐘𝑠on ′ [𝑘, 𝑛 ] ∀ [𝑘, 𝑛 ] s. t. 𝐘𝑠 ′ [𝑘, 𝑛 ] < 𝜃. 11:. act act 0 → 𝐘𝑠act ′ [𝑘, 𝑛 ] ∀ [𝑘, 𝑛 ] s. t. 𝐘𝑠 ′ [𝑘, 𝑛 ] < 𝜃. 12:. {𝑘, 𝑡 on } ← arg. 13:. for each 𝑡 on do. max. 𝑘∈𝐾,𝑡 𝑜𝑛 −𝜂<𝑛<𝑡 𝑜𝑛 +𝜂. {𝐘𝑠on ′ [𝑘, 𝑛 ]}. 14:. 𝛿 ← arg min{max 𝐘𝑠on ′ [𝑘, 𝑛 + 𝛿 ∶ 𝑛 + 𝛿 + 𝜖 ] = 0}. 15:. 𝑡 off ← 𝑡 on + 𝛿. 16:. 𝛿. end for. 17: end for. - 29 -.

(38) 4. Experiments 4.1 Settings To determine the most appropriate combination for transcribing music signals, we first compare the efficacy of using different data representations as described in Section 3. A total of 7 feature combinations will be used for training and evaluation: 𝐙𝑓 , 𝐙𝑞 , 𝐙𝑔 , [𝐙𝑓 , 𝐙𝑞 ], [𝐙𝑓 , 𝐙𝑔 ], [𝐙𝑞 , 𝐙𝑔 ], and [𝐙𝑓 , 𝐙𝑞 , 𝐙𝑔 ]. After validating the combination that brings the best performance, we then compare CFP and HCQT with the same combination setup to further verify whether harmonic information could benefit the results. Finishing the feature validation, we next compare different models working in conjunction with the label smoothing strategy as discussed in Section 3.3. More specifically, we consider the following three settings: . Using the ASPP as the bottleneck part to connect the encoder and decoder. Label smoothing is not applied in the training process. Since this model is a fully convolutional network, such a setting is denoted by Conv hereafter.. . Using the ASPP to connect the encoder and decoder. Label smoothing is employed in the training process. We refer this scheme as Conv-LS.. . Using the self-attention block to replace the original ASPP bottleneck part. Label smoothing is applied in the training process. This setting is designated as Attn-LS.. There are two different approaches to execute pitch-only (i.e., MPE and NT) transcription using the proposed model. The first is to train the model on a multi-instrument dataset, but with only one instrument output, meaning transcribing all the instruments at the same time. The second is to predict instruments separately, with each output channel represents a specific instrument class, and merge them into a single channel by summing them all together. In this research, the. - 30 -.

(39) second approach is employed, with the advantages that we could inspect the performance of the same model on both the single- and multi-instrument tasks and test the generalizability. More deliberately, for all different scenarios, MPE, NT, instrument-informed MPS/NS, and instrument-agnostic MPS/NS (see Fig 1), are trained only for the multi-instrument NS task. That is, the results of MPE and NT are all adopted from the results of the NS task. The MPE results are obtained by summing the outputs over all the channels, and performing normalization and thresholding in the same way as described in Algorithm 1. For NT, the onset and pitch activation channels are summed individually into two channels. Normalization and thresholding are applied independently on the two channels. No additional training specifically tailored for MPE and NT are done for two rationales: 1) the effectiveness of using the proposed model on MPE has been demonstrated in [15], and 2) there is good reason reporting the degenerated (obviously underestimated) performance of MPE and NT so as to demonstrate the generalization power of the proposed model.. 4.2 Datasets Two main datasets are used for training, and several additional datasets are used for performance validation. We validate pitch-only transcription on both the single- and multiinstrument datasets. For the instrument-informed and instrument-agnostic tasks, we test both of them on multi-instrument datasets. The following subsections describe the datasets with more details.. 4.2.1 Single-instrument Datasets The single-instrument dataset used in the experiments is a subset of the MAPS dataset [75]. The collection of this subset contains 60 real piano solo recordings (ENSTDkCl and ENSTDkAm), and such setting is also often taken as the benchmark in piano solo transcription. In this research, this subset of MAPS dataset is only used for testing the performance of models, and is excluded while training.. - 31 -.

(40) Following the tradition set by state-of-the-art piano transcription methods, we train our models on the MAESTRO dataset [11], which contains a total of 1,184 real piano performance recordings collected from International Piano-e-Competition, with a total length of 172.3 hours. As the dataset comprises only a single instrument, which is piano, we will only report the results of the MPE and NT tasks.. 4.2.2 Multi-instrument Datasets Three multi-instrument datasets are used. The first one is the MusicNet [76] dataset, which will be used for training and testing. The dataset includes 330 pieces of solo and ensemble music. All of the music pieces are real-world performances, and the ground-truth is generated by audio-to-score alignment algorithm. There are 11 classes of instruments presented in the MusicNet dataset, namely piano (pn), violin (vn), viola (va), cello (vc), flute (fl), horn (hn), bassoon (bn), clarinet (cl), harpsichord (hpd), contrabass (db), and oboe (ob). We follow the default partition of the training and testing set: 320 pieces are for training, and the remaining 10 pieces are for testing. As mentioned before, there exists a highly imbalance of the numbers of samples between different instruments. Table 4.1 shows the portion of the total length of each instrument type. The length is computed by accumulating each note length, rather than only considering the length of appearance of each instrument in the music piece. The consideration of counting in such way is that the model is supposed to predict multiple notes at the same time in AMT. With regard to this, the notes should be counted separately. As can be seen in Table 4.1, the piano recordings constitute far more portion than the other instruments. In contrast, contrabass gives almost no contribution to the length in the MusicNet dataset. Note that although the training set contains 11 instrument classes, the test set has only 7 classes among them. Table 4.2 lists the test set of MusicNet with more details. To validate thresholds, we randomly pick up 40 pieces from the training set as the validation set.. - 32 -.

(41) Instrument class. % in MusicNet. % in Ext-Su. % URMP. Piano (pn). 59.24. 28.62. —. Violin (vn). 16.23. 30.89. 34.71. Cello (vc). 9.37. 10.09. 14.16. Viola (va). 8.16. 13.35. 15.38. Clarinet (cl). 2.2. 6.08. 10.86. Horn (hn). 1.3. 2.24. —. Bassoon (bn). 1.39. 3.77. —. Flute (fl). 0.57. 2.85. 15.37. Oboe (ob). 0.72. 1.51. 6.01. Contrabass (db). 0.29. 0.61. 3.52. Harpsichord (hpd). 0.53. —. —. Table 4.1. Instrument classes, abbreviations, and the portion of note length in each multi-instrument dataset. File name. Number of instrument classes. Length (secs). 1759. 1 (pn). 530. 1819. 3 (hn, bn, cl). 580.2. 2106. 3 (vn, va, vc). 640.6. 2191. 1 (vn). 97. 2298. 1 (vc). 150.4. 2303. 1 (pn). 220. 2382. 3 (vn, va, vc). 243.4. 2416. 3 (hn, bn, cl). 267.7. 2556. 1 (pn). 436.8. 2628. 2 (pn, vn). 356.6. Table 4.2. Information of instrument classes in the MusicNet test set.. - 33 -.