中 華 大 學 碩 士 論 文
應用無參數區別分析演算法及調變頻譜特徵 分析於音樂風格之自動分類
Automatic music genre classification using nonparametric discriminant analysis and
modulation spectral feature analysis
系 所 別：資訊工程學系碩士班 學號姓名：M09702041 方仁政 指導教授：李 建 興 博士
中 華 民 國 100 年 2 月
摘 要
本論文提出了利用調變頻譜分析去觀察長時間的特徵變化，進而從中擷取出特徵。
首先，擷取出整首歌中代表每個音框的特徵向量(論文中每個音框擷取出的特徵包含 有 MFCC、OSC、與 MPEG7 NASE)，再利用調變頻譜去分析音框之間特徵的變化，
並根據不同的調變頻帶再擷取出五種調變頻譜特徵：調變頻譜質心、調變頻譜能量、
調變頻譜對比值、調變頻譜頻平滑度、調變頻譜波谷值。在實驗過程中，輸入測試的 音樂訊號後擷取出所需的特徵後，經過線性正規劃與主成份分析(PCA)的降維之後，
接著再將特徵使用無參數區別分析(NDA)之轉換，隨之特徵以歐基里德距離計算測試 的訊號與每一音樂類別的距離，再取其最接近之距離做為辨識的依據，最後實驗數據 可以明顯看出利用調變頻譜分析的特徵較優於傳統利用所有音框的平均向量與標準 差向量做為特徵，而最高辨識率為89.99%。
關鍵字: 音樂曲風分類、調變頻譜分析、主成份分析、無參數區別分析
ABSTRACT
With the development of computer networks, it becomes more and more popular to purchase and download digital music from the Internet. Since a general music database often contains millions of music tracks, it is very difficult to manage such a large digital music database. Therefore, it will be helpful to manage such a large music database if the music tracks are properly categorized. In this paper, a novel feature set derived from modulation spectrum analysis of Melfrequency cepstral coefficients (MFCC), octavebased spectral contrast (OSC), and normalized audio spectral envelope (NASE) is proposed for capturing the timevarying behavior of music signals. The modulation spectrum is then decomposed into several logarithmically spaced modulation subbands.
For each modulation subband, five modulation spectral features will be computed as the feature values: modulation spectral peak, modulation spectral valley, modulation spectral energy, modulation spectral centroid, and modulation flatness. Statistical aggregation of modulation spectral features and principal component analysis (PCA) will be employed to reduce the feature dimension. Furthermore, linear discriminant analysis (LDA) or nonparametric discriminant analysis (NDA) will then be employed to reduce the feature dimension and improve the classification accuracy. The experiments results show that the proposed feature vector can get better performance than that taking the mean and standard derivation operations.
Keywords: music genre classification, modulation spectrum analysis, principal component
analysis, nonparametric discriminant analysis致謝
在研究所的生活中，除了讓我對音樂辨識的領域有所瞭解之外，更重要的是對於 研究的態度及精神，除了恆心跟毅力外，更要有對於各種事物的好奇心、熱忱與實事 求是的精神，以這樣子的態度不管用在任何領域上都能夠有所成就與突破，而讓我理 解到這一深層意義的是我的指導老師 李建興 老師，在老師的細心指導之下，除了修 正我對研究的態度外，也讓我感受到突破瓶頸那瞬間的愉悅，在論文寫作期間，真的 要感謝老師多次的詳細閱讀與修改，使我的論文順利完成，另外還要感謝 石昭玲 老 師、韓欽銓 老師、周智勳 老師 和連振昌 老師 的提攜及鼓勵與在課業上及報告上 的指導，學生在此敬上最深的感謝。
研究所的生活有歡笑、忙碌、疲憊、與感動，同時也要感謝實驗室的學長 楊凱、
清乾、正達、昭偉、懷三、建程、勝斌、正崙、偉欣、明修、昭弘、佑維及育成，能 在我的研究生活給予寶貴的意見指教與打氣，受益匪淺，由衷的感謝。另外要感謝實 驗室的同學 書峻、信吉、琮瑋、堯文、翔淵、佩蓉、雅婷，在學業及生活上的相處 及同甘共苦的同袍之情，有了你們讓我的研究所生活真正展現出色彩與充滿歡笑的回 憶，還要實驗室的學弟妹 文楷、耀德、珮筠、冠霖、子豪、柏廷、政揚及育瑋，陪 我度過難忘與美好的研究生活。
最後，我要感謝在身邊支持我的家人，讓我在困難時能得到溫暖，給我加油與打 氣，以面對重重的困難，謝謝你們!。
CONTENTS
ABSTRACT………..ii
CONTENTS………..iv
CHAPTER 1………..1
INTRODUCTION………...1
1.1 Motivation………..1
1.2 Review of music genre classification system……….…1
1.2.1 Feature Extraction……….….4
1.2.1.1 Shortterm features……….…4
1.2.1.1.1 Timbral features………4
1.2.1.1.2 Rhythmic features………6
1.2.1.1.3 Pitch features………7
1.2.1.2 Longterm features……….7
1.2.1.2.1 Mean and standard deviation………7
1.2.1.2.2 Autoregressive model………...7
1.2.1.2.3 Modulation spectrum analysis………..8
1.2.1.2.4 Nonlinear time series analysis………..8
1.2.2 Feature Transformation……….…………...8
1.2.2.1 Principal Component Analysis (PCA)….………...8
1.2.2.2 Linear Discriminant Analysis (LDA)………….………....9
1.2.2.3 Nonparametric Discriminant Analysis (NDA)………..10
1.2.3 Feature Classifier………..……10
1.3 Outline of Thesis………..……12
CHAPTER 2………13
THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEM………...………13
2.1 Feature Extraction………13
2.1.1 MelFrequency Cepstral Coefficient (MFCC)……….13
2.1.2 Octavebased Spectral Contrast (OSC)………16
2.1.3 Normalized Audio Spectral Envelope (NASE)………18
2.1.4 Modulation Spectral Analysis………..21
2.1.4.1 Modulation Spectral Contrast of MFCC (MMFCC)…….…..21
2.1.4.2 Modulation Spectral Contrast of OSC (MOSC)…………...23
2.1.4.3 Modulation Spectral Contrast of NASE (MASE)………25
2.1.5 Statistical Aggregation of Modulation Spectral Feature values…....27
2.1.5.1 Statistical Aggregation of MMFCC (SMMFCC)……..……..29
2.1.5.2 Statistical Aggregation of MOSC (SMOSC)……….…..31
2.1.5.3 Statistical Aggregation of MASE (SMASE)………33
2.1.6 Feature vector normalization………35
2.2 Linear Discriminant Analysis (LDA).…..………...36
2.3 Nonparametric Discriminant Analysis (NDA)………....……….…37
2.4 Music Genre Classificaiton phase………39
CHAPTER 3……….40
EXPERIMENT RESULTS………...40
3.1 Comparison of different modulation spectral feature vector………...…….40
3.2 Comparison of different features using kNN classifier……..………...41
3.3 Comparison of multiple representation of each feature set…..………41
CHAPTER 4……….49
CONCLUSION………49
REFERENCE………...50
List of Table
Table 2.1 The range of each triangular bandpass filter………..15 Table 2.2 The range of each octavescale bandpass filter (Sampling rate = 44.1 kHz)17 Table 2.3 The range of each Normalized audio spectral evenlope bandpass filte……21 Table 2.4 Frequency interval of each modulation subband………27 Table 3.1 Classification accuracy (%) of different modulation spectral features using LDA
or NDA as the classifier…….………...42 Table 3.2 Classification accuracy (%) using kNN classifier with LDA……….43 Table 3.3 Classification accuracy (%) using kNN classifier with NDA………44 Table 3.4 Classification accuracy (%) using LDA with multiple prototypes for each music
class and α=0.98………...………...45 Table 3.5 Classification accuracy (%) using LDA with multiple prototypes for each music
class and α=0.99……….46 Table 3.6 Classification accuracy (%) using NDA with multiple prototypes for each music
class and α=0.98………..………...47 Table 3.7 Classification accuracy (%) using NDA with multiple prototypes for each music
class and α=0.99………..……….……..48
List of Figure
Fig. 1.1 A hierarchical audio taxonomy [18]………...3
Fig. 1.2 A hierarchical audio taxonomy [19]………...3
Fig. 1.3 A music genre classification system………...4
Fig. 2.1 The flowchart for computing MFCC………15
Fig. 2.2 The flowchart for computing OSC………...17
Fig. 2.3 The flowchart for computing NASE………20
Fig. 2.4 MPEG7 octavebased subband decomposition with spectral resolution………20
Fig. 2.5 The flowchart for extracting MMFCC……….23
Fig. 2.6 The flowchart for extracting MOSC………25
Fig. 2.7 The flowchart for extracting MASE………...………..27
Fig. 2.8 Aggregation of the rowbased modulation spectral features………28
Fig. 2.9 Aggregation of the columnbased modulation spectral features………..…28
Chapter 1 Introduction
1.1 Motivation
With the development of computer networks, it becomes more and more popular to purchase and download digital music from the Internet. However, a general music database often contains millions of music tracks. Hence it is very difficult to manage such a large digital music database. For this reason, it will be helpful to manage a vast amount of music tracks when they are properly categorized in advance. In general, the retail or online music stores often organize their collections of music tracks by categories such as genre, artist, and album. Usually, the category information of a music track is manually labeled by experienced managers. But, to determine the music genre of music track by experienced managers is a laborious and timeconsuming work. Therefore, a number of supervised classification techniques have been developed for automatic classification of unlabeled music tracks [111].Thus, in the study, we focus on the music genre classification problem which is defined as genre labeling of music tracks. Therefore, an automatic music genre classification plays an important and preliminary role in music information retrieval systems. A new album or music track can be assigned to a proper genre in order to place it in the appropriate section of an online music store or music database.
To classify the music genre of a given music track, some discriminating audio features have to be extracted through contentbased analysis of the music signal. In addition, many studies try to examine a set of classifiers to improve the classification performance.
However, the improvement is limited and ineffective. In fact, employing effective feature sets will have much more useful on the classification accuracy than selecting a specific classifier [12]. In the study, a novel feature set derived from the rowbased and the columnbased modulation spectrum analysis will be proposed for automatic music genre classification.
1.2 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine the structure of the taxonomy that music pieces will be classified into. However, it is hard to clearly define a universally agreed structure. In general, exploiting hierarchical taxonomy structure for music genre classification has some merits: (1) People often prefer to search
music by browsing the hierarchical catalogs. (2) Taxonomy structures identify the relationships or dependence between the music genres. Thus, hierarchical taxonomy structures provide a coarsetofine classification approach to improve the classification efficiency and accuracy. (3) The classification errors become more acceptable by using taxonomy than direct music genre classification. The coarsetofine approach can make the classification errors concentrate on a given level of the hierarchy.
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre classification, as shown in Fig. 1.1. Rather than making a single decision to classify a given music into one of all music genres (direct approach), the hierarchical approach makes successive decisions at each branch point of the taxonomy hierarchy. Additionally, appropriate and variant features can be employed at each branch point of the taxonomy.
Therefore, the hierarchical classification approach allows the managers to trace at which level the classification errors occur frequently. Barbedo and Lopes [14] have also defined a hierarchical taxonomy, as shown in Fig. 1.2. The hierarchical structure was constructed in the bottomup structure in stead of the topdown structure. This is because that it is easily to merge leaf classes into the same parent class in the bottomup structure. Therefore, the upper layer can be easily constructed. In their experiment result, the classification accuracy which used the hierarchical bottomup approach outperforms the topdown approach by about 3%  5%.
Li and Ogihara [15] investigated the effect of two different taxonomy structures for music genre classification. They also proposed an approach to automatic generation of music genre taxonomies based on the confusion matrix computed by linear discriminant projection. This approach can reduce the timeconsuming and expensive task for manual construction of taxonomies. It also helps to look for music collections in which there are no natural taxonomies [16]. According to a given genre taxonomy, many different approaches have been proposed to classify the music genre for raw music tracks. In general, a music genre classification system consists of three major aspects: feature extraction, feature selection, and feature classification. Fig. 1.3 shows the block diagram of a music genre classification system.
Music
Classical
Nonclassical
Chamber
Orchestral
Chamber with piano Solo
String quartet
Other chamber ensembles Symphonic
Orchestral with choir Orchestral with soloist Rock
Electronic/Pop
Jazz/Blues
Hard Rock Soft Rock Techno/Dance Rap /HipHop Pop
Fig. 1.1 A hierarchical audio taxonomy [18].
Music
Classical Pop/Rock
Instrumental Vocal Organic Electronic
Piano Orchestra Opear Chorus Rock Country Pop Techno
Piano Light Chorus
orchestra Heavy
orchestra Rock Country Soft
rock Hard rock
Heavy metal
Soft country
Dancing country
Soft country
Dancing country
Soft country
Dancing country
Dance
Vocal Percussion
HipHop Reggae Jazz Latin
Soft rock
Hard rock
Heavy metal
Soft country
Dancing country
Cool Easy
listening Fusion Bebop Soft rock
Hard rock
Heavy metal
Swing Blues
Fig. 1.2 A hierarchical audio taxonomy [19].
Feature Extraction
Feature Extraction
Feature Selection Feature
Transformation
Classification
Testing Music Training Music
Classified music genre
Feature Database
Fig. 1.3 A music genre classification system.
1.2.1 Feature Extraction
1.2.1.1 Shortterm Features
The most important aspect of music genre classification is to determine which features are relevant and how to extract them. Tzanetakis and Cook [1] employed three feature sets, including timbral texture, rhythmic content, and pitch content, to classify audio collections in terms of the musical genres.
1.2.1.1.1 Timbral features
Timbral features are generally characterized by the properties related to instrumentations or sound sources such as music, speech, or environment signals. The features used to represent timbral texture are described as follows.
(1) Lowenergy Feature: it is defined as the percentage of analysis windows that have RMS energy less than the average RMS energy across the texture window. The size of texture window should correspond to the minimum amount of time required to identify a particular music texture.
(2) ZeroCrossing Rate (ZCR): ZCR provides a measure of noisiness of the signal. It is defined as:
, ]) 1 [ ( ])
[ 2 (
1 ^{1}
0
∑
^{−}=
−
−
= ^{N}
n
t t n sign x n x
sign
ZCR
where the sign function will return 1 for positive input and 0 for negative input and x_{t}[n] is the time domain signal for frame t.
(3) Spectral Subband Centroid: spectral subband centroid is defined as the center of gravity of the magnitude spectrum in each subband:
, ] [
] [
) (
) ( ) (
) (
∑
∑
=
=
×
= _{ih}_{b}
b il n
b b ih
b il n
b
b
n M
n n M
SSC
where il(b) and ih(b) are the low frequency and high frequency indices of the bth subband and Mb[n] is the magnitude of the nth frequency bin of the bth subband.
(4) Spectral Bandwidth: spectral bandwidth determines the frequency bandwidth of the signal.
( )
. ]
[
] [
1 2
N
1
2
∑
∑
=
=
×
−
= _{N}
n b n
b b
b
n M
n M SSC
n
SB
(5) Spectral Rolloff: spectral rolloff is a measure of spectral shape. It is defined as the frequency R_{t} below which 85% of the magnitude distribution is concentrated:
. ] [ 85
. 0 ] [
0
1
0
∑ ∑
=
−
=
×
t ≤
R
k
N
k
k S k
S
(6) Spectral Flux: The spectral flux measures the amount of local spectral change. It is defined as the squared difference between the normalized magnitudes of successive spectral distributions:
, ]) [ ] [ (
1
0
2
∑
^{−} 1= − −
=^{N}
k
t t
t
N k N k
SF
where Nt[n] and Nt1[n] are the normalized magnitude spectrum of the tth frame and the (t1)th frame, respectively.
(7) Spectral Subbnad Flatness: A high spectral flatness indicates that the spectrum has a uniform frequency distribution, whereas a low spectral flatness represents that the spectrum may concentrate around a specific frequency. The spectral subband flatness can be computed as follows:
, ] 1 [
) ( ) (
1
] [
) (
) ( 1 ) ( ) (
) (
) (
∑
∏
= +
−
=
+
−
= _{ih}_{b}
b il i
b b
il b ih
b ih
b il i
b
b
i b M
il b ih
i M
SSF
where
M
_{b}[i] is the magnitude of the ith bin in the bth frequency subband, il(b) and ih(b) are the low frequency and high frequency indexs of the bth band.(8) Spectral Subband Energy: spectral subband energy is defined as the power of the magnitude of each frequency subband:
10 log (1 ( [ ]) ).
) (
) (
2
10
∑
=
+
⋅
= ^{ih}^{b}
b il i
b
b
M i
SSE
(9) MelFrequency Cepstral Coefficients: MFCC have been widely used for speech recognition due to their ability to represent the speech spectrum in a compact form. In human auditory system, the perceived pitch is not linear with respect to the physical frequency of the corresponding tone. The mapping between the physical frequency scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz and logarithmic at higher frequencies. In fact, MFCC have been proven to be very effective in automatic speech recognition and in modeling the subjective frequency content of audio signals.
(10) Octavebased spectral contrast (OSC): OSC was developed to represent the spectral characteristics of a music piece [3]. This feature describes the strength of spectral peaks and spectral valleys in each subband separately. It can roughly reflect the distribution of harmonic and nonharmonic components.
(11) Normalized audio spectral envelope (NASE): NASE was referred to the MPEG7 standard [17]. First, the audio spectral envelope (ASE) is obtained from the sum of the log power spectrum in each logarithmic subband. Then, each ASE coefficient is normalized with the Root Mean Square (RMS) energy, yielding a normalized version of the ASE, called NASE.
1.2.1.1.2 Rhythmic features
The features representing the rhythmic content of a music piece are mainly derived from the beat histogram, including the overall beat strength, the main beat and its strength, the period of the main beat and subbeats, the relative strength of subbeats to main beat.
Many beattracking algorithms [18, 19] providing an estimate of the main beat and the
corresponding strength have been proposed.
1.2.1.1.3 Pitch features
Tzanetakis et al. [20] extracted pitch features from the pitch histograms of a music piece. The extracted pitch features contain frequency, pitch strength, and pitch interval.
The pitch histogram can be estimated by multiple pitch detection techniques [21, 22].
Melody and harmony have been widely used by musicologists to study the musical structures. Scaringella et al. [23] proposed a method to extract melody and harmony features by characterizing the pitch distribution of a short segment like most melody/harmony analyzers. The main difference is that no fundamental frequency, chord, key, or other highlevel feature has to determine in advance.
1.2.1.2 Longterm Features
To find the representative feature vector of a whole music piece, the methods employed to integrate the shortterm features into a longterm feature include mean and standard deviation, autoregressive model [9], modulation spectrum analysis [24, 25, 26], and nonlinear time series analysis.
1.2.1.2.1 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate the shortterm features. Let x_{i} = [x_{i}[0], x_{i}[1], …, x_{i}[D1]]^{T} denote the representative
Ddimensional feature vector of the ith frame. The mean and standard deviation is
calculated as follow:
∑
^{−}=
= ^{1}
0
] 1 [
] [
T
i i
d T x
µ d
, 0≤d
≤D
−12 / 1 1
0
2] ]) [ ] [ 1 (
[ ]
[
∑
^{−}=
−
= ^{T}
i
i
d d
T x
d µ
σ
, 0≤d
≤D
−1where T is the number of frames of the input signal. This statistical method exhibits no
information about the relationship between features as well as the timevarying behavior of music signals.
1.2.1.2.2 Autoregressive model (AR model)
Meng et al. [9] used AR model to analyze the timevarying texture of music signals.
They proposed the diagonal autoregressive (DAR) and multivariate autoregressive (MAR) analysis to integrate the shortterm features. In DAR, each shortterm feature is independently modeled by an AR model. The extracted feature vector includes the mean and variance of all shortterm feature vectors as well as the coefficients of each AR model.
In MAR, all shortterm features are modeled by a MAR model. The difference between MAR model and AR model is that MAR considers the relationship between features. The features used in MAR include the mean vector, the covariance matrix of all shorterterm feature vectors, and the coefficients of the MAR model. In addition, for a porder MAR model, the feature dimension is p×D×D, where D is the feature dimension of a shortterm feature vector.
1.2.1.2.3 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of signals along the time axis. Kingsbury et al. [24] first employed modulation spectrogram for speech recognition. It has been shown that the most sensitive modulation frequency to human audition is about 4 Hz. Sukittanon et al. [25] used modulation spectrum analysis for music content identification. They showed that modulationscale features along with subband normalization are insensitive to convolutional noise. Shi et al. [26] used modulation spectrum analysis to model the longterm characteristics of music signals in order to extract the tempo feature for music emotion classification.
1.2.1.2.4 Nonlinear time series analysis
Nonlinear analysis of time series offers an alternative way to describe temporal structure, which is complementary to the analysis of linear correlation and spectral properties. Mierswa and Morik [27] used the reconstructed phase space to extract features directly from the audio data. The mean and standard deviations of the distances and angles in the phase space with an embedding dimension of two and unit time lag were used.
1.2.2 Feature Transformation
1.2.2.1 Principal Component Analysis (PCA)
PCA has been a widely used technique for dimensionality reduction [35]. PCA is
defined as the orthogonal projection of the data onto a lower dimensional vector space such that the variance of the projected data is maximized.
First, the Ddimensional mean vector and D×
D covariance matrix are computed for
the set of Ddimensional training vectors X = {xj, j = 1, …, N}:. ) (
) 1 (
1
1 1
μ
Tμ x x μ x
−
−
=
=
∑
∑
∑
=
=
j N
j j N
j j
N
N
Second, the eigenvalues and corresponding eigenvectors of the covariance matrix are computed and sorted in a decreasing order of the eigenvalues. Let the eigenvector vi be associated with eigenvalue
λ , 1≦i≦D. The first d eigenvectors having the largest
_{i} eigenvalues will form the columns of the D × d transformation matrix APCA:A
PCA = [v1, v2, …, vd].The number of selected eigenvectors d can be determined by finding the minimum integer that satisfies the following criterion:
,
1
1
∑
∑
= =≥ ^{D}
j j d
j
j
α λ
λ
where
α
determine how many percentage of information need to be preserved. In this paperα
=0.98 andα
=0.99 are used. The projected vector can be computed according to the transformation matrix APCA:).
(x μ A
x_{PCA} = ^{T}_{PCA} −
1.2.2.2 Linear Discriminant Analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional feature vector space. LDA deals with discrimination between classes rather than representations of various classes. The goal of LDA is to minimize the withinclass distance while maximize the betweenclass distance. In LDA, an optimal transformation matrix from an
ndimensional feature space to ddimensional space is determined, where d
≤ n. The transformation should enhance the separability among different classes. The optimal transformation matrix can be exploited to map each ndimensional feature vector into addimensional vector. The detailed steps will be described in Chapter 2.
In LDA, each class is generally modeled by a single Gaussian distribution. In fact, the music signal is too complexity to be modeled by a single Gaussian distribution. In addition, the same transformation matrix of LDA is used for all the classes, which doesn’t consider the classwise differences.
1.2.2.3 Nonparametric Discriminant Analysis (NDA)
LDA is based on the assumption that each class will be normally distributed. Thus, the recognition rate will deteriorate if this assumption is not satisfied. Fukunaga proposed nonparametric discriminant analysis (NDA) [36] to overcome this nonnormal distribution problem for twoclass classification task, in which a nonparametric betweenclass scatter is defined. In NDA, the withinclass scatter matrix has the same form as LDA. The main difference lies in the definition of the betweenclass scatter matrix. The detailed steps will be described in Chapter 2.
1.2.3 Feature Classifier
Tzanetakis and Cook [1] combined timbral features, rhythmic features, and pitch features with GMM classifier to their music genre classification system. The hierarchical genres adopted in their music classification system are Classical, Country, Disco, HipHop, Jazz, Rock, Blues, Reggae, Pop, and Metal. In Classical, the subgenres contain Choir, Orchestra, Piano, and String Quarter. In Jazz, the subgenres contain BigBand, Cool, Fusion, Piano, Quarter, and Swing. The experiment result shows that GMM with three components achieves the best classification accuracy.
West and Cox [4] constructed a hierarchical framed based music genre classification system. In their classification system, a majority vote is taken to decide the final classification. The genres adopted in their music classification system are Rock, Classical, Heavy Metal, Drum, Bass, Reggae, and Jungle. They take MFCC, and OSC as features and compare the performance with/without decision tree classifier of Gaussian classifier, GMM with three components and LDA. In their experiment, the feature vector with GMM classifier and decision tree classifier has best accuracy 82.79%.
Xu et al. [29] applied SVM to discriminate between pure music and vocal one. The
SVM learning algorithm is applied to obtain the classification parameters according to the calculated features. It is demonstrated that SVM achieves better performance than traditional Euclidean distance methods and hidden Markov model (HMM) methods.
Esmaili et al. [30] use some lowlevel features (MFCC, entropy, centroid, bandwidth, etc.) and LDA for music genre classification. In their system, the classification accuracy is 93.0% for the classification of five music genres: Rock, Classical, Folk, Jazz and Pop.
Bagci and Erzin [8] constructed a novel framebased music genre classification system. In their classification system, some invalid frames are first detected and discarded for classification purpose. To determine whether a frame is valid or not, a GMM model is constructed for each music genre. These GMM models are then used to sift the frames which are unable to be correctly classified, and each GMM model of a music genre is updated for each correctly classified frame. Moreover, a GMM model is employed to represent the invalid frames. In their experiment, the feature vector includes 13 MFCC, 4 spectral shape features (spectral centroid, spectral rolloff, spectral flux, and zerocrossing rate) as well as the first and secondorder derivative of these timbral features. Their musical genre dataset includes ten genre types: Blues, Classical, Country, Disco, Hiphop, Jazz, Metal, Pop, Reggae and Rock. The classification accuracy can up to 88.60% when the frame length is 30s and each GMM is modeled by 48 Gaussian distributions.
Umapathy et al. [31] used local discriminant bases (LDB) technique to measure the dissimilarity of the LDB nodes of any two classes and extract features from these highdissimilarity LDB nodes. First, they use the wavelet packet tree decomposition to construct a fivelevel tree for a music signal. Then, two novel features, the energy distribution over frequencies (D1) and nonstationarity index (D2), are used to measure the dissimilarity of the LDB nodes of any two classes. In their classification system, the feature dimension is 30, including the energies and variances of the basis vector coefficients of the first 15 high dissimilarity nodes. The experiment results show that when the LDB feature vector is combined with MFCC and by using LDA analysis, the average classification accuracy for the first level is 91% (artificial and natural sounds), for the second level is 99% (instrumental and automobile; human and nonhuman), and 95% for the third level (drums, flute, and piano; aircraft and helicopter; male and female speech;
animals, birds, and insects).
Grimaldi et al. [11, 32] used a set of features based on discrete wavelet packet transform (DWPT) to represent a music track. Discrete wavelet transform (DWT) is a
wellknown signal analysis methodology able to approximate a real signal at different scales both in time and frequency domain. Taking into account the nonstationary nature of the input signal, the DWT provides an approximation with excellent time and frequency resolution. WPT is a variant of DWT, which is achieved by recursively convolving the input signal with a pair of low pass and high pass filters. Unlike DWT that recursively decomposes only the lowpass subband, the WPDT decomposes both bands at each level.
Bergatra et al [33] used AdaBoost for music classification. AdaBoost is an ensemble (or metalearning) method that constructs a classifier in an iterative fashion [34]. It was originally designed for binary classification, and was later extended to multiclass classification using several different strategies.
1.3 Outline of Thesis
In Chapter 2, the proposed method for music genre classification will be introduced.
In Chapter 3, some experiments will be presented to show the effectiveness of the proposed method. Finally, conclusion will be given in Chapter 4.
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases: the training phase and the classification phase. The training phase is composed of four main modules:
feature extraction, principal component analysis (PCA), kmean clustering, and linear discriminant analysis (LDA) or nonparametric discriminant analysis (NDA). The classification phase consists of five modules: feature extraction, PCA transformation, kmean clustering, LDA (or NDA) transformation, and classification. A detailed description of each module will be described below.
2.1 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral (OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is proposed for music genre classification.
2.1.1 MelFrequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to represent the speech spectrum in a compact form. In fact, MFCC have been proven to be very effective in automatic speech recognition and in modeling the subjective frequency content of audio signals. Fig. 2.1 is a flowchart for extracting MFCC from an input signal. The detailed steps will be given below.
Step 1. Preemphasis:
sˆ[n]=s[n]−aˆ×s[n−1], (1) where s[n] is the current sample and s[n−1] is the previous sample, a typical value for aˆ is 0.95.
Step 2. Framing:
Each music signal is divided into a set of overlapped frames (frame size = N samples). Each pair of consecutive frames is overlapped M samples.
Step 3. Windowing:
Each frame is multiplied by a Hamming window:
], [ ] [ ˆ ]
~
s
_{i}[n
=s
_{i}n w n
0≤n
≤N
−1, (2) where the Hamming window function w[n] is defined as:1) cos( 2 46 . 0 54 . 0 ]
[ = − −
N n n
w π
, 0≤
n
≤N
−1. (3) Step 4. Spectral Analysis:Take the discrete Fourier transform of each frame using FFT:
, ]
~[ ] [
1
0
∑
^{−} 2=
=^{N} − n
Nn j k
i
i
k s n e
X
^{π} 0≤k≤N−1, (4) where k is the frequency index.Step 5. Melscale BandPass Filtering:
The spectrum is then decomposed into a number of subbands by using a set of Melscale bandpass filters:
, ] [ )
(
,
,
∑
== ^{b}^{h}
l b
I
I k
i
i
b A k
E
0≤b<B,0≤k≤ N/2−1 (5) Where B is the total number of filters (B is 25 in the study), Ib,l and Ib,h denote respectively the lowfrequency index and highfrequency index of the bth bandpass filter. A_{i}[k] is the squared amplitude of X_{i}[k], that is,.
 ] [
 ]
[
k X k
^{2}A
_{i} = _{i}I
b,l and Ib,h are given as:), / (
), / (
, ,
, ,
N f I f
f N I f
s h b h b
s l b l
b
=
=
(6)
where fs is the sampling frequency, fb,l and fb,h are the low frequency and high frequency of the bth bandpass filter as shown in Table 2.1.
Step 6. Discrete cosine transform (DCT):
MFCC can be obtained by applying DCT on the logarithm of E(b):
, 0
, )) 5 . 0 ( cos(
)) ( 1 ( log )
(
1
0
10 b l L
lB b
E l
MFCC
B
b
i + ≤ <
+
=
∑
^{−}=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study).
Therefore, the MFCC feature vector can be represented as follows:
x
MFCC = [MFCC(0), MFCC(1), …, MFCC(L1)]^{T}. (8)Fig. 2.1 The flowchart for computing MFCC
Table 2.1 The range of each triangular bandpass filter
Index Low Freg.(Hz) Center Freg. (Hz) High Freg. (Hz)
Filter 1 0 100 200
Filter 2 100 200 300
Filter 3 200 300 400
Filter 4 300 400 500
Filter 5 400 500 600
Filter 6 500 600 700
Filter 7 600 700 800
Filter 8 700 800 900
Filter 9 800 900 1000
Filter 10 900 1000 1149
Filter 11 1000 1149 1320
Filter 12 1149 1320 1516
Filter 13 1320 1516 1741
Filter 14 1516 1741 2000
Filter 15 1741 2000 2297
Filter 16 2000 2297 2639
Filter 17 2297 2639 3031
Filter 18 2639 3031 3482
Filter 19 3031 3482 4000
Filter 20 3482 4000 4595
Filter 21 4000 4595 5278
Filter 22 4595 5278 6063
Filter 23 5278 6063 6964
Filter 24 6063 6964 8000
Filter 25 6964 8000 9190
Preemphasis Input Signal
Framing
Windowing
FFT
Melscale bandpass filtering
DCT
MFCC
2.1.2 Octavebased Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal. It considers the spectral peak and valley in each subband independently. In general, spectral peaks correspond to harmonic components and spectral valleys the nonharmonic components or noise in music signals. Therefore, the difference between spectral peaks and spectral valleys will reflect the spectral contrast distribution. Fig 2.2 shows the block diagram for extracting the OSC feature. The detailed steps will be described below.
Step 1. Framing and Spectral Analysis:
An input music signal is divided into a number of successive overlapped frames and FFT is then to obtain the corresponding spectrum of each frame.
Step 2. Octave Scale Filtering:
This spectrum is then divided into a number of subbands by the set of octave scale filters shown in Table 2.2. The octave scale filtering operation can be described as follows:
, ] [ )
(
,
,
∑
== ^{b}^{h}
l b
I
I k
i
i
b A k
E
0≤b<B,0≤k≤ N/2−1 (9) where B is the number of subbands, I_{b,l} and I_{b,h }denote respectively the
lowfrequency index and highfrequency index of the bth bandpass filter. A_{i}[k]is the squared amplitude of Xi[k], that is,
A
_{i}[k
]=X
_{i}[k
]^{2} .I
b,l and Ib,h are given as:), / (
), / (
, ,
, ,
N f I f
f N I f
s h b h b
s l b l
b
=
=
(10)
where fs is the sampling frequency, fb,l and fb,h are the low frequency and high frequency of the bth bandpass filter.
Step 3. Peak / Valley Selection:
Let (M_{b,1}, M_{b,2}, …, M_{b,Nb}) denote the magnitude spectrum within the bth subband,
N
_{b} is the number of FFT frequency bins in the bth subband. Without loss of generality, let the magnitude spectrum be sorted in a decreasing order, that is, M_{b,1 }≥ Mb,2 ≥ … ≥ M_{b,Nb}. The spectral peak and spectral valley in the bth subband are then estimated as follows:
1 ), log(
) (
1
∑
,=
= ^{N}^{b}
i i b b
N M b
Peak
α
α
(11)
1 ), log(
) (
1
1
∑
,= −+
= ^{b}
b
N
i
i N b b
N M b
Valley
α
α
(12)
whereα is a neighborhood factor (α is 0.2 in this study). The spectral contrast is
given by the difference between the spectral peak and the spectral valley:).
( )
( )
(b Peak b Valley b
SC = − (13)
The feature vector of an audio frame consists of the spectral contrasts and the spectral valleys of all subbands. Thus, the OSC feature vector of an audio frame can be represented as follows:
x
OSC = [Valley(0), …, Valley(B1), SC(0), …, SC(B1)]^{T}, (14)Fig. 2.2 The flowchart for computing OSC
Table 2.2. The range of each octavescale bandpass filter (Sampling rate = 44.1 kHz) Filter number Frequency interval (Hz)
0 [0, 0]
1 (0, 100]
2 (100, 200]
3 (200, 400]
4 (400, 800]
5 (800, 1600]
6 (1600, 3200]
7 (3200, 6400]
8 (6400, 12800]
9 (12800, 22050)
Input Signal
Framing
Octave scale filtering
Peak/Valley Selection
Spectral Contrast
OSC FFT
2.1.3 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG7 for sound classification. The NASE descriptor provides a representation of the power spectrum of each audio frame. Each component of the NASE feature vector represents the normalized magnitude of a particular frequency subband. Fig 2.3 shows the block diagram for extracting the NASE feature. For a given music piece, the main steps for computing NASE are described as follow:
Step 1. Framing and Spectral Analysis:
An input music signal is divided into a number of successive overlapped frames and each audio frame is multiplied by a Hamming window function and analyzed using FFT to derive its spectrum, notated X(k), 1 ≤ k ≤ N, where N is the size of FFT. The power spectrum is defined as the normalized squared magnitude of the DFT spectrum X(k):
, 2 / 0
,
 ) ( 2 
2 / , 0 ,
 ) ( 1 
) (
2 2
<
⋅ <
⋅ =
=
N k k
E X N
N k k
E X k N
P
w
w
(15)
where Ew is the energy of the Hamming window function w(n) of size Nw: .
 ) (

1
0
∑
^{−} 2=
=^{N}^{w}
n
w
w n
E
(16)Step 2. Subband Decomposition:
The power spectrum is divided into logarithmically spaced subbands spanning between 62.5 Hz (“loEdge”) and 16 kHz (“hiEdge”) over a spectrum of 8 octave interval (see Fig.2.4). The NASE scale filtering operation can be described as follows (see Table 2.3.):
, ] [ )
(
,
,
∑
== ^{b}^{h}
l b
I
I k
i
i
b A k
E
0≤b<B,0≤k≤ N/2−1 (17) where B is the number of logarithmic subbands within the frequency range [loEdge, hiEdge] and is given by B = 8/r , and r is the spectral resolution of the frequency subbands ranging from 1/16 of an octave to 8 octaves(B=16 , r=1/2 in the study):
r
=2^{j} octaves, −4≤j
≤3.(18)
I
b,l and Ib,h are the lowfrequency index and highfrequency index of the bthbandpass filter given as:
), / (
), / (
, ,
, ,
N f I f
N f I f
s h b h b
s l b l b
=
=
(19)
where fs is the sampling frequency, fb,l and fb,h are the low frequency and high frequency of the bth bandpass filter.
Step 3. Normalized Audio Spectral Envelope
The ASE coefficient for the bth subband is defined as the sum of power spectrum coefficients within this subband:
, 1 0
, ) ( )
(
,
,
+
≤
≤
=
∑
=
B b k
P b
ASE
h b
l b
I
I k
(20) Each ASE coefficient is then converted to the decibel scale:
. 1 0
)), ( ( log 10 )
(
b
= _{10}ASE b
≤b
≤B
+ASE
_{dB}(21)
The NASE coefficient is derived by normalizing each decibelscale ASE coefficient with the rootmeansquare (RMS) norm gain value, R:
, 1 0
), ) (
( = ≤b≤B+
R b b ASE
NASE ^{dB}
(22)
where the RMSnorm gain value R is defined as:
. )) ( (
1
0
∑
^{+} 2=
= ^{B}
b
dB b ASE
R
(23)
In MPEG7, the ASE coefficients consist of one coefficient representing power between 0 Hz and loEdge, a series of coefficients representing power in logarithmically spaced bands between loEdge and hiEdge, a coefficient representing power above hiEdge, the RMSnorm gain value R. Therefore, the feature dimension of NASE is B+3. Thus, the NASE feature vector of an audio frame will be represented as follows:
x
NASE = [R, NASE(0), NASE(1), …, NASE(B+1)]^{T}. (24)Fig. 2.3 The flowchart for computing NASE
r = 1/2
Fig. 2.4 MPEG7 octavebased subband decomposition with spectral resolution Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE Subband Decomposition
62.5 125 250 500 1K 2K 4K 8K 16K
loEdge hiEdge
88.4 176.8 353.6 707.1 1414.2 2828.4 5656.9 11313.7
1 coeff 16 coeffs 1 coeff
Table 2.3. The range of each Normalized audio spectral evenlope bandpass filter Filter number Frequency interval (Hz)
0 (0, 62]
1 (62, 88]
2 (88, 125]
3 (125, 176]
4 (176, 250]
5 (250, 353]
6 (353, 500]
7 (500, 707]
8 (707, 1000]
9 (1000, 1414]
10 (1414, 2000]
11 (2000, 2828]
12 (2828, 4000]
13 (4000, 5656]
14 (5656, 8000]
15 (8000, 11313]
16 (11313, 16000]
17 (16000, 22050]
2.1.4 Modulation Spectral Analysis
MFCC, OSC, and NASE capture only shortterm framebased characteristics of audio signals. In order to capture the timevarying behavior of the music signals, we employ modulation spectral analysis on MFCC, OSC, and NASE to observe the variations of the sound.
2.1.4.1 Modulation Spectral Analysis of MFCC (MMFCC)
To observe the timevarying behavior of MFCC, modulation spectral analysis is applied on MFCC trajectories, Fig. 2.5 shows the flowchart for extracting MMFCC and the detailed steps will be described below.
Step 1. Framing and MFCC Extraction:
Given an input music signal, divide the whole music signal into successive overlapped frames and extract the MFCC coefficients of each frame.
Step 2. Modulation Spectrum Analysis:
Let
MFCC
_{i}[l] be the lth MFCC feature value of the ith frame, 0≤l
<L
. The modulation spectrogram is obtained by applying FFT independently on each feature value along the time trajectory within a texture window of length W:, 0
, 0
, )
( )
, (
1
0
2 )
2 /
(
l e m W l L
MFCC l
m M
W
n
Wm j n
n W t
t =
∑
^{−} ≤ < ≤ <=
− +
×
π
(25) where Mt(m, l) is the modulation spectrogram for the tth texture window, m is the modulation frequency index, and l is the MFCC coefficient index. In the study, W is 512, which is about 6 seconds, with 50% overlap between two successive texture windows. The representative modulation spectrogram of a music track is derived by time averaging the magnitude modulation spectrograms of all texture windows:
, 0
, 0
, ) , 1 (
) , (
1
L l W
m l
m T M
l m M
T
t t
MFCC =
∑
≤ < ≤ <=
(26) where T is the total number of texture windows in the music track.
Step 3. Contrast/Valley/Energy/Centroid/Flatness Determination:
The averaged modulation spectrum of each feature value will be decomposed into
J logarithmically spaced modulation subbands. In the study, the number of
modulation subbands is 8 (J = 8). The frequency interval of each modulation subband is shown in Table 2.4. For each feature value, the modulation spectral peak (MSP), modulation spectral valley (MSV), modulation spectral energy (MSE), modulation spectral centroid (MSCen), and modulation flatness (MSF) within each modulation subband are then evaluated:(
^{(} ^{,} ^{)})
^{,}max )
, (
, ,
l m M
l j
MSP ^{MFCC}
Φ Φ m MFCC
h j l j ≤ <
= (27)
(
^{(} ^{,} ^{)})
^{,}min )
, (
, ,
l m M
l j
MSV ^{MFCC}
Φ Φ m MFCC
h j l j ≤ <
= (28)
), ) ) , ( (
1 ( log 10 ) ,
( _{10} ^{2}
,
,
∑
=+
⋅
= ^{j}^{h}
l j
Φ Φ m
MFCC
MFCC j l M m l
MSE (29)
, ) , (
) , ( )
,
( ,
, ,
,
∑
∑
=
=
×
= jh
l j h j
l j
Φ Φ m
MFCC Φ
Φ m
MFCC
MFCC
l m M
m l m M
l j
MSCen (30)
, ) , 1 (
1
) , ( )
, (
,
, ,
,
,
,
, ,
1
∑
∏
= +
−
=
+
−
=
h j
l j l
j h j
h j
l j
Φ Φ m
MFCC
l j h j
Φ Φ
Φ Φ m
MFCC
MFCC
l m Φ M
Φ
l m M
l j
MSF (31)
where
Φ
j,land Φ
j,h are respectively the low modulation frequency index and highmodulation frequency index of the jth modulation subband, 0 ≤ j < J. The MSPs correspond to the dominant rhythmic components, MSVs the nonrhythmic components, MSEs expressed the power of each modulation subband, MSCens indicated the mass center of the modulation spectrum, MSFs represents the modulation frequency distribution of a modulation subband. In addition, the difference between MSP and MSV will reflect the modulation spectral contrast distribution:
), , ( )
, ( )
,
(j l MSP j l MSV j l
MSC^{MFCC} = ^{MFCC} − ^{MFCC} (32) As a result, all MSCs (MSVs, MSEs, MSCens, or MSFs) will form a L×J matrix which contains the modulation spectral contrast information. Therefore, the feature dimension of MMFCC is 5×20×8 = 800.
Windowing
Average Modulation
Spectrum
Contrast Valley Energy Centroid Flatness Determination
DFT
MFCC extraction Framing
M^{1}l[m]
M^{2}l[m]
M^{T}l[m]
M^{3}l[m]
M^{T1}l[m]
ML[m]
MFCCI[l]
MFCC1[l] MFCC2[l]
sI[n]
sI1[n]
s1[n]s2[n]s3[n] . . .
. . .
. . . .
Music signal
MMFCC M1[m]
M2[m]
M3[m]
ML1[m]
. . .
Fig. 2.5 The flowchart for extracting MMFCC
2.1.4.2 Modulation Spectral Analysis of OSC (MOSC)
To observe the timevarying behavior of OSC, the same modulation spectrum analysis is applied to the OSC feature values. Fig. 2.6 shows the flowchart for extracting MOSC and the detailed steps will be described below.
Step 1. Framing and OSC Extraction:
Given an input music signal, divide the whole music signal into successive overlapped frames and extract the OSC coefficients of each frame.
Step 2. Modulation Spectrum Analysis: