中華大學

(1)

中華大學碩士論文

應用無參數區別分析演算法及調變頻譜特徵分析於音樂風格之自動分類

Automatic music genre classification using nonparametric discriminant analysis and

modulation spectral feature analysis

系所別：資訊工程學系碩士班學號姓名：M09702041 方仁政指導教授：李建興博士

中華民國 100 年 2 月

(2)

摘要

本論文提出了利用調變頻譜分析去觀察長時間的特徵變化，進而從中擷取出特徵。

首先，擷取出整首歌中代表每個音框的特徵向量(論文中每個音框擷取出的特徵包含有 MFCC、OSC、與 MPEG-7 NASE)，再利用調變頻譜去分析音框之間特徵的變化，

並根據不同的調變頻帶再擷取出五種調變頻譜特徵：調變頻譜質心、調變頻譜能量、

調變頻譜對比值、調變頻譜頻平滑度、調變頻譜波谷值。在實驗過程中，輸入測試的音樂訊號後擷取出所需的特徵後，經過線性正規劃與主成份分析(PCA)的降維之後，

接著再將特徵使用無參數區別分析(NDA)之轉換，隨之特徵以歐基里德距離計算測試的訊號與每一音樂類別的距離，再取其最接近之距離做為辨識的依據，最後實驗數據可以明顯看出利用調變頻譜分析的特徵較優於傳統利用所有音框的平均向量與標準差向量做為特徵，而最高辨識率為89.99%。

關鍵字: 音樂曲風分類、調變頻譜分析、主成份分析、無參數區別分析

(3)

ABSTRACT

With the development of computer networks, it becomes more and more popular to purchase and download digital music from the Internet. Since a general music database often contains millions of music tracks, it is very difficult to manage such a large digital music database. Therefore, it will be helpful to manage such a large music database if the music tracks are properly categorized. In this paper, a novel feature set derived from modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC), octave-based spectral contrast (OSC), and normalized audio spectral envelope (NASE) is proposed for capturing the time-varying behavior of music signals. The modulation spectrum is then decomposed into several logarithmically spaced modulation subbands.

For each modulation subband, five modulation spectral features will be computed as the feature values: modulation spectral peak, modulation spectral valley, modulation spectral energy, modulation spectral centroid, and modulation flatness. Statistical aggregation of modulation spectral features and principal component analysis (PCA) will be employed to reduce the feature dimension. Furthermore, linear discriminant analysis (LDA) or nonparametric discriminant analysis (NDA) will then be employed to reduce the feature dimension and improve the classification accuracy. The experiments results show that the proposed feature vector can get better performance than that taking the mean and standard derivation operations.

Keywords: music genre classification, modulation spectrum analysis, principal component

analysis, nonparametric discriminant analysis

(4)

致謝

在研究所的生活中，除了讓我對音樂辨識的領域有所瞭解之外，更重要的是對於研究的態度及精神，除了恆心跟毅力外，更要有對於各種事物的好奇心、熱忱與實事求是的精神，以這樣子的態度不管用在任何領域上都能夠有所成就與突破，而讓我理解到這一深層意義的是我的指導老師李建興老師，在老師的細心指導之下，除了修正我對研究的態度外，也讓我感受到突破瓶頸那瞬間的愉悅，在論文寫作期間，真的要感謝老師多次的詳細閱讀與修改，使我的論文順利完成，另外還要感謝石昭玲老師、韓欽銓老師、周智勳老師和連振昌老師的提攜及鼓勵與在課業上及報告上的指導，學生在此敬上最深的感謝。

研究所的生活有歡笑、忙碌、疲憊、與感動，同時也要感謝實驗室的學長楊凱、

清乾、正達、昭偉、懷三、建程、勝斌、正崙、偉欣、明修、昭弘、佑維及育成，能在我的研究生活給予寶貴的意見指教與打氣，受益匪淺，由衷的感謝。另外要感謝實驗室的同學書峻、信吉、琮瑋、堯文、翔淵、佩蓉、雅婷，在學業及生活上的相處及同甘共苦的同袍之情，有了你們讓我的研究所生活真正展現出色彩與充滿歡笑的回憶，還要實驗室的學弟妹文楷、耀德、珮筠、冠霖、子豪、柏廷、政揚及育瑋，陪我度過難忘與美好的研究生活。

最後，我要感謝在身邊支持我的家人，讓我在困難時能得到溫暖，給我加油與打氣，以面對重重的困難，謝謝你們!。

(5)

List of Table

Table 2.1 The range of each triangular band-pass filter………..15 Table 2.2 The range of each octave-scale band-pass filter (Sampling rate = 44.1 kHz)17 Table 2.3 The range of each Normalized audio spectral evenlope band-pass filte……21 Table 2.4 Frequency interval of each modulation subband………27 Table 3.1 Classification accuracy (%) of different modulation spectral features using LDA

or NDA as the classifier…….………...42 Table 3.2 Classification accuracy (%) using kNN classifier with LDA……….43 Table 3.3 Classification accuracy (%) using kNN classifier with NDA………44 Table 3.4 Classification accuracy (%) using LDA with multiple prototypes for each music

class and α=0.98………...………...45 Table 3.5 Classification accuracy (%) using LDA with multiple prototypes for each music

class and α=0.99……….46 Table 3.6 Classification accuracy (%) using NDA with multiple prototypes for each music

class and α=0.98………..………...47 Table 3.7 Classification accuracy (%) using NDA with multiple prototypes for each music

class and α=0.99………..……….……..48

(8)

List of Figure

Fig. 1.1 A hierarchical audio taxonomy [18]………...3

Fig. 1.2 A hierarchical audio taxonomy [19]………...3

Fig. 1.3 A music genre classification system………...4

Fig. 2.1 The flowchart for computing MFCC………15

Fig. 2.2 The flowchart for computing OSC………...17

Fig. 2.3 The flowchart for computing NASE………20

Fig. 2.4 MPEG-7 octave-based subband decomposition with spectral resolution………20

Fig. 2.5 The flowchart for extracting MMFCC……….23

Fig. 2.6 The flowchart for extracting MOSC………25

Fig. 2.7 The flowchart for extracting MASE………...………..27

Fig. 2.8 Aggregation of the row-based modulation spectral features………28

Fig. 2.9 Aggregation of the column-based modulation spectral features………..…28

(9)

Chapter 1 Introduction

1.1 Motivation

With the development of computer networks, it becomes more and more popular to purchase and download digital music from the Internet. However, a general music database often contains millions of music tracks. Hence it is very difficult to manage such a large digital music database. For this reason, it will be helpful to manage a vast amount of music tracks when they are properly categorized in advance. In general, the retail or online music stores often organize their collections of music tracks by categories such as genre, artist, and album. Usually, the category information of a music track is manually labeled by experienced managers. But, to determine the music genre of music track by experienced managers is a laborious and time-consuming work. Therefore, a number of supervised classification techniques have been developed for automatic classification of unlabeled music tracks [1-11].Thus, in the study, we focus on the music genre classification problem which is defined as genre labeling of music tracks. Therefore, an automatic music genre classification plays an important and preliminary role in music information retrieval systems. A new album or music track can be assigned to a proper genre in order to place it in the appropriate section of an online music store or music database.

To classify the music genre of a given music track, some discriminating audio features have to be extracted through content-based analysis of the music signal. In addition, many studies try to examine a set of classifiers to improve the classification performance.

However, the improvement is limited and ineffective. In fact, employing effective feature sets will have much more useful on the classification accuracy than selecting a specific classifier [12]. In the study, a novel feature set derived from the row-based and the column-based modulation spectrum analysis will be proposed for automatic music genre classification.

1.2 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine the structure of the taxonomy that music pieces will be classified into. However, it is hard to clearly define a universally agreed structure. In general, exploiting hierarchical taxonomy structure for music genre classification has some merits: (1) People often prefer to search

(10)

music by browsing the hierarchical catalogs. (2) Taxonomy structures identify the relationships or dependence between the music genres. Thus, hierarchical taxonomy structures provide a coarse-to-fine classification approach to improve the classification efficiency and accuracy. (3) The classification errors become more acceptable by using taxonomy than direct music genre classification. The coarse-to-fine approach can make the classification errors concentrate on a given level of the hierarchy.

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre classification, as shown in Fig. 1.1. Rather than making a single decision to classify a given music into one of all music genres (direct approach), the hierarchical approach makes successive decisions at each branch point of the taxonomy hierarchy. Additionally, appropriate and variant features can be employed at each branch point of the taxonomy.

Therefore, the hierarchical classification approach allows the managers to trace at which level the classification errors occur frequently. Barbedo and Lopes [14] have also defined a hierarchical taxonomy, as shown in Fig. 1.2. The hierarchical structure was constructed in the bottom-up structure in stead of the top-down structure. This is because that it is easily to merge leaf classes into the same parent class in the bottom-up structure. Therefore, the upper layer can be easily constructed. In their experiment result, the classification accuracy which used the hierarchical bottom-up approach outperforms the top-down approach by about 3% - 5%.

Li and Ogihara [15] investigated the effect of two different taxonomy structures for music genre classification. They also proposed an approach to automatic generation of music genre taxonomies based on the confusion matrix computed by linear discriminant projection. This approach can reduce the time-consuming and expensive task for manual construction of taxonomies. It also helps to look for music collections in which there are no natural taxonomies [16]. According to a given genre taxonomy, many different approaches have been proposed to classify the music genre for raw music tracks. In general, a music genre classification system consists of three major aspects: feature extraction, feature selection, and feature classification. Fig. 1.3 shows the block diagram of a music genre classification system.

(11)

Music

Classical

Non-classical

Chamber

Orchestral

Chamber with piano Solo

String quartet

Other chamber ensembles Symphonic

Orchestral with choir Orchestral with soloist Rock

Electronic/Pop

Jazz/Blues

Hard Rock Soft Rock Techno/Dance Rap /Hip-Hop Pop

Fig. 1.1 A hierarchical audio taxonomy [18].

Music

Classical Pop/Rock

Instrumental Vocal Organic Electronic

Piano Orchestra Opear Chorus Rock Country Pop Techno

Piano Light Chorus

orchestra Heavy

orchestra Rock Country Soft

rock Hard rock

Heavy metal

Soft country

Dancing country

Soft country

Dancing country

Soft country

Dancing country

Dance

Vocal Percussion

Hip-Hop Reggae Jazz Latin

Soft rock

Hard rock

Heavy metal

Soft country

Dancing country

Cool Easy

listening Fusion Bebop Soft rock

Hard rock

Heavy metal

Swing Blues

Fig. 1.2 A hierarchical audio taxonomy [19].

(12)

Feature Extraction

Feature Selection Feature

Transformation

Classification

Testing Music Training Music

Classified music genre

Feature Database

Fig. 1.3 A music genre classification system.

1.2.1 Feature Extraction

1.2.1.1 Short-term Features

The most important aspect of music genre classification is to determine which features are relevant and how to extract them. Tzanetakis and Cook [1] employed three feature sets, including timbral texture, rhythmic content, and pitch content, to classify audio collections in terms of the musical genres.

1.2.1.1.1 Timbral features

Timbral features are generally characterized by the properties related to instrumentations or sound sources such as music, speech, or environment signals. The features used to represent timbral texture are described as follows.

(1) Low-energy Feature: it is defined as the percentage of analysis windows that have RMS energy less than the average RMS energy across the texture window. The size of texture window should correspond to the minimum amount of time required to identify a particular music texture.

(2) Zero-Crossing Rate (ZCR): ZCR provides a measure of noisiness of the signal. It is defined as:

, ]) 1 [ ( ])

[ 2 (

1 ¹

0

∑

⁻

=

−

= ^N

n

t t n sign x n x

sign

ZCR

(13)

where the sign function will return 1 for positive input and 0 for negative input and x_t[n] is the time domain signal for frame t.

(3) Spectral Subband Centroid: spectral subband centroid is defined as the center of gravity of the magnitude spectrum in each subband:

, ] [

] [

) (

) ( ) (

) (

∑

=

×

= _ih_b

b il n

b b ih

b il n

b

n M

n n M

SSC

where il(b) and ih(b) are the low frequency and high frequency indices of the b-th subband and Mb[n] is the magnitude of the n-th frequency bin of the b-th subband.

(4) Spectral Bandwidth: spectral bandwidth determines the frequency bandwidth of the signal.

( )

. ]

[

] [

1 2

N

1

2

∑

=

×

−

= _N

n b n

b b

b

n M

n M SSC

n

SB

(5) Spectral Roll-off: spectral roll-off is a measure of spectral shape. It is defined as the frequency R_t below which 85% of the magnitude distribution is concentrated:

. ] [ 85

. 0 ] [

0

1

0

∑ ∑

=

−

=

×

t ≤

R

k

N

k

k S k

S

(6) Spectral Flux: The spectral flux measures the amount of local spectral change. It is defined as the squared difference between the normalized magnitudes of successive spectral distributions:

, ]) [ ] [ (

1

0

2

∑

⁻ 1

= − −

=^N

k

t t

t

N k N k

SF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and the (t-1)-th frame, respectively.

(7) Spectral Subbnad Flatness: A high spectral flatness indicates that the spectrum has a uniform frequency distribution, whereas a low spectral flatness represents that the spectrum may concentrate around a specific frequency. The spectral subband flatness can be computed as follows:

(14)

, ] 1 [

) ( ) (

1

] [

) (

) ( 1 ) ( ) (

) (

∑

∏

= +

−

=

+

−

= _ih_b

b il i

b b

il b ih

b ih

b il i

b

i b M

il b ih

i M

SSF

where

M

_b[i] is the magnitude of the i-th bin in the b-th frequency subband, il(b) and ih(b) are the low frequency and high frequency indexs of the b-th band.

(8) Spectral Subband Energy: spectral subband energy is defined as the power of the magnitude of each frequency subband:

10 log (1 ( [ ]) ).

) (

2

10

∑

=

+

⋅

= ^ih^b

b il i

b

M i

SSE

(9) Mel-Frequency Cepstral Coefficients: MFCC have been widely used for speech recognition due to their ability to represent the speech spectrum in a compact form. In human auditory system, the perceived pitch is not linear with respect to the physical frequency of the corresponding tone. The mapping between the physical frequency scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz and logarithmic at higher frequencies. In fact, MFCC have been proven to be very effective in automatic speech recognition and in modeling the subjective frequency content of audio signals.

(10) Octave-based spectral contrast (OSC): OSC was developed to represent the spectral characteristics of a music piece [3]. This feature describes the strength of spectral peaks and spectral valleys in each sub-band separately. It can roughly reflect the distribution of harmonic and non-harmonic components.

(11) Normalized audio spectral envelope (NASE): NASE was referred to the MPEG-7 standard [17]. First, the audio spectral envelope (ASE) is obtained from the sum of the log power spectrum in each logarithmic subband. Then, each ASE coefficient is normalized with the Root Mean Square (RMS) energy, yielding a normalized version of the ASE, called NASE.

1.2.1.1.2 Rhythmic features

The features representing the rhythmic content of a music piece are mainly derived from the beat histogram, including the overall beat strength, the main beat and its strength, the period of the main beat and subbeats, the relative strength of subbeats to main beat.

Many beat-tracking algorithms [18, 19] providing an estimate of the main beat and the

(15)

corresponding strength have been proposed.

1.2.1.1.3 Pitch features

Tzanetakis et al. [20] extracted pitch features from the pitch histograms of a music piece. The extracted pitch features contain frequency, pitch strength, and pitch interval.

The pitch histogram can be estimated by multiple pitch detection techniques [21, 22].

Melody and harmony have been widely used by musicologists to study the musical structures. Scaringella et al. [23] proposed a method to extract melody and harmony features by characterizing the pitch distribution of a short segment like most melody/harmony analyzers. The main difference is that no fundamental frequency, chord, key, or other high-level feature has to determine in advance.

1.2.1.2 Long-term Features

To find the representative feature vector of a whole music piece, the methods employed to integrate the short-term features into a long-term feature include mean and standard deviation, autoregressive model [9], modulation spectrum analysis [24, 25, 26], and nonlinear time series analysis.

1.2.1.2.1 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate the short-term features. Let x_i = [x_i[0], x_i[1], …, x_i[D-1]]^T denote the representative

D-dimensional feature vector of the i-th frame. The mean and standard deviation is

calculated as follow:

∑

⁻

=

= ¹

0

] 1 [

] [

T

i i

d T x

µ d

, 0≤

d

≤

D

−1

2 / 1 1

0

2] ]) [ ] [ 1 (

[ ]

[

∑

⁻

=

−

= ^T

i

d d

T x

d µ

σ

, 0≤

d

≤

D

−1

where T is the number of frames of the input signal. This statistical method exhibits no

information about the relationship between features as well as the time-varying behavior of music signals.

1.2.1.2.2 Autoregressive model (AR model)

(16)

Meng et al. [9] used AR model to analyze the time-varying texture of music signals.

They proposed the diagonal autoregressive (DAR) and multivariate autoregressive (MAR) analysis to integrate the short-term features. In DAR, each short-term feature is independently modeled by an AR model. The extracted feature vector includes the mean and variance of all short-term feature vectors as well as the coefficients of each AR model.

In MAR, all short-term features are modeled by a MAR model. The difference between MAR model and AR model is that MAR considers the relationship between features. The features used in MAR include the mean vector, the covariance matrix of all shorter-term feature vectors, and the coefficients of the MAR model. In addition, for a p-order MAR model, the feature dimension is p×D×D, where D is the feature dimension of a short-term feature vector.

1.2.1.2.3 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of signals along the time axis. Kingsbury et al. [24] first employed modulation spectrogram for speech recognition. It has been shown that the most sensitive modulation frequency to human audition is about 4 Hz. Sukittanon et al. [25] used modulation spectrum analysis for music content identification. They showed that modulation-scale features along with subband normalization are insensitive to convolutional noise. Shi et al. [26] used modulation spectrum analysis to model the long-term characteristics of music signals in order to extract the tempo feature for music emotion classification.

1.2.1.2.4 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal structure, which is complementary to the analysis of linear correlation and spectral properties. Mierswa and Morik [27] used the reconstructed phase space to extract features directly from the audio data. The mean and standard deviations of the distances and angles in the phase space with an embedding dimension of two and unit time lag were used.

1.2.2 Feature Transformation

1.2.2.1 Principal Component Analysis (PCA)

PCA has been a widely used technique for dimensionality reduction [35]. PCA is

(17)

defined as the orthogonal projection of the data onto a lower dimensional vector space such that the variance of the projected data is maximized.

First, the D-dimensional mean vector and D×

D covariance matrix are computed for

the set of D-dimensional training vectors X = {xj, j = 1, …, N}:

. ) (

) 1 (

1

1 1

μ

T

μ x x μ x

−

=

∑

=

j N

j j N

j j

N

Second, the eigenvalues and corresponding eigenvectors of the covariance matrix are computed and sorted in a decreasing order of the eigenvalues. Let the eigenvector vi be associated with eigenvalue

λ , 1≦i≦D. The first d eigenvectors having the largest

_i eigenvalues will form the columns of the D × d transformation matrix APCA:

A

PCA = [v1, v2, …, vd].

The number of selected eigenvectors d can be determined by finding the minimum integer that satisfies the following criterion:

,

1

∑

= =

≥ ^D

j j d

j

α λ

λ

where

α

determine how many percentage of information need to be preserved. In this paper

α

=0.98 and

α

=0.99 are used. The projected vector can be computed according to the transformation matrix APCA:

).

(x μ A

x_PCA = ^T_PCA −

1.2.2.2 Linear Discriminant Analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional feature vector space. LDA deals with discrimination between classes rather than representations of various classes. The goal of LDA is to minimize the within-class distance while maximize the between-class distance. In LDA, an optimal transformation matrix from an

n-dimensional feature space to d-dimensional space is determined, where d

≤ n. The transformation should enhance the separability among different classes. The optimal transformation matrix can be exploited to map each n-dimensional feature vector into a

(18)

d-dimensional vector. The detailed steps will be described in Chapter 2.

In LDA, each class is generally modeled by a single Gaussian distribution. In fact, the music signal is too complexity to be modeled by a single Gaussian distribution. In addition, the same transformation matrix of LDA is used for all the classes, which doesn’t consider the class-wise differences.

1.2.2.3 Nonparametric Discriminant Analysis (NDA)

LDA is based on the assumption that each class will be normally distributed. Thus, the recognition rate will deteriorate if this assumption is not satisfied. Fukunaga proposed nonparametric discriminant analysis (NDA) [36] to overcome this non-normal distribution problem for two-class classification task, in which a nonparametric between-class scatter is defined. In NDA, the within-class scatter matrix has the same form as LDA. The main difference lies in the definition of the between-class scatter matrix. The detailed steps will be described in Chapter 2.

1.2.3 Feature Classifier

Tzanetakis and Cook [1] combined timbral features, rhythmic features, and pitch features with GMM classifier to their music genre classification system. The hierarchical genres adopted in their music classification system are Classical, Country, Disco, Hip-Hop, Jazz, Rock, Blues, Reggae, Pop, and Metal. In Classical, the sub-genres contain Choir, Orchestra, Piano, and String Quarter. In Jazz, the sub-genres contain BigBand, Cool, Fusion, Piano, Quarter, and Swing. The experiment result shows that GMM with three components achieves the best classification accuracy.

West and Cox [4] constructed a hierarchical framed based music genre classification system. In their classification system, a majority vote is taken to decide the final classification. The genres adopted in their music classification system are Rock, Classical, Heavy Metal, Drum, Bass, Reggae, and Jungle. They take MFCC, and OSC as features and compare the performance with/without decision tree classifier of Gaussian classifier, GMM with three components and LDA. In their experiment, the feature vector with GMM classifier and decision tree classifier has best accuracy 82.79%.

Xu et al. [29] applied SVM to discriminate between pure music and vocal one. The

(19)

SVM learning algorithm is applied to obtain the classification parameters according to the calculated features. It is demonstrated that SVM achieves better performance than traditional Euclidean distance methods and hidden Markov model (HMM) methods.

Esmaili et al. [30] use some low-level features (MFCC, entropy, centroid, bandwidth, etc.) and LDA for music genre classification. In their system, the classification accuracy is 93.0% for the classification of five music genres: Rock, Classical, Folk, Jazz and Pop.

Bagci and Erzin [8] constructed a novel frame-based music genre classification system. In their classification system, some invalid frames are first detected and discarded for classification purpose. To determine whether a frame is valid or not, a GMM model is constructed for each music genre. These GMM models are then used to sift the frames which are unable to be correctly classified, and each GMM model of a music genre is updated for each correctly classified frame. Moreover, a GMM model is employed to represent the invalid frames. In their experiment, the feature vector includes 13 MFCC, 4 spectral shape features (spectral centroid, spectral roll-off, spectral flux, and zero-crossing rate) as well as the first- and second-order derivative of these timbral features. Their musical genre dataset includes ten genre types: Blues, Classical, Country, Disco, Hip-hop, Jazz, Metal, Pop, Reggae and Rock. The classification accuracy can up to 88.60% when the frame length is 30s and each GMM is modeled by 48 Gaussian distributions.

Umapathy et al. [31] used local discriminant bases (LDB) technique to measure the dissimilarity of the LDB nodes of any two classes and extract features from these high-dissimilarity LDB nodes. First, they use the wavelet packet tree decomposition to construct a five-level tree for a music signal. Then, two novel features, the energy distribution over frequencies (D1) and nonstationarity index (D2), are used to measure the dissimilarity of the LDB nodes of any two classes. In their classification system, the feature dimension is 30, including the energies and variances of the basis vector coefficients of the first 15 high dissimilarity nodes. The experiment results show that when the LDB feature vector is combined with MFCC and by using LDA analysis, the average classification accuracy for the first level is 91% (artificial and natural sounds), for the second level is 99% (instrumental and automobile; human and nonhuman), and 95% for the third level (drums, flute, and piano; aircraft and helicopter; male and female speech;

animals, birds, and insects).

Grimaldi et al. [11, 32] used a set of features based on discrete wavelet packet transform (DWPT) to represent a music track. Discrete wavelet transform (DWT) is a

(20)

well-known signal analysis methodology able to approximate a real signal at different scales both in time and frequency domain. Taking into account the non-stationary nature of the input signal, the DWT provides an approximation with excellent time and frequency resolution. WPT is a variant of DWT, which is achieved by recursively convolving the input signal with a pair of low pass and high pass filters. Unlike DWT that recursively decomposes only the low-pass subband, the WPDT decomposes both bands at each level.

Bergatra et al [33] used AdaBoost for music classification. AdaBoost is an ensemble (or meta-learning) method that constructs a classifier in an iterative fashion [34]. It was originally designed for binary classification, and was later extended to multiclass classification using several different strategies.

1.3 Outline of Thesis

In Chapter 2, the proposed method for music genre classification will be introduced.

In Chapter 3, some experiments will be presented to show the effectiveness of the proposed method. Finally, conclusion will be given in Chapter 4.

(21)

Chapter 2 The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases: the training phase and the classification phase. The training phase is composed of four main modules:

feature extraction, principal component analysis (PCA), k-mean clustering, and linear discriminant analysis (LDA) or nonparametric discriminant analysis (NDA). The classification phase consists of five modules: feature extraction, PCA transformation, k-mean clustering, LDA (or NDA) transformation, and classification. A detailed description of each module will be described below.

2.1 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral (OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is proposed for music genre classification.

2.1.1 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to represent the speech spectrum in a compact form. In fact, MFCC have been proven to be very effective in automatic speech recognition and in modeling the subjective frequency content of audio signals. Fig. 2.1 is a flowchart for extracting MFCC from an input signal. The detailed steps will be given below.

Step 1. Pre-emphasis:

sˆ[n]=s[n]−aˆ×s[n−1], (1) where s[n] is the current sample and s[n−1] is the previous sample, a typical value for aˆ is 0.95.

Step 2. Framing:

Each music signal is divided into a set of overlapped frames (frame size = N samples). Each pair of consecutive frames is overlapped M samples.

Step 3. Windowing:

Each frame is multiplied by a Hamming window:

(22)

], [ ] [ ˆ ]

~

s

_i[

n

=

s

_i

n w n

0≤

n

≤

N

−1, (2) where the Hamming window function w[n] is defined as:

1) cos( 2 46 . 0 54 . 0 ]

[ = − −

N n n

w π

, 0≤

n

≤

N

−1. (3) Step 4. Spectral Analysis:

Take the discrete Fourier transform of each frame using FFT:

, ]

~[ ] [

1

0

∑

⁻ 2

=

=^N − n

Nn j k

i

k s n e

X

^π 0≤k≤N−1, (4) where k is the frequency index.

Step 5. Mel-scale Band-Pass Filtering:

The spectrum is then decomposed into a number of subbands by using a set of Mel-scale band-pass filters:

, ] [ )

(

,

∑

=

= ^b^h

l b

I

I k

i

b A k

E

0≤b<B,0≤k≤ N/2−1 (5) Where B is the total number of filters (B is 25 in the study), Ib,l and Ib,h denote respectively the low-frequency index and high-frequency index of the b-th band-pass filter. A_i[k] is the squared amplitude of X_i[k], that is,

.

| ] [

| ]

[

k X k

²

A

_i = _i

I

b,l and Ib,h are given as:

), / (

, ,

N f I f

f N I f

s h b h b

s l b l

b

=

(6)

where fs is the sampling frequency, fb,l and fb,h are the low frequency and high frequency of the b-th band-pass filter as shown in Table 2.1.

Step 6. Discrete cosine transform (DCT):

MFCC can be obtained by applying DCT on the logarithm of E(b):

, 0

, )) 5 . 0 ( cos(

)) ( 1 ( log )

(

1

0

10 b l L

lB b

E l

MFCC

B

b

i + ≤ <

+

=

∑

⁻

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study).

Therefore, the MFCC feature vector can be represented as follows:

x

MFCC = [MFCC(0), MFCC(1), …, MFCC(L-1)]^T. (8)

(23)

Fig. 2.1 The flowchart for computing MFCC

Table 2.1 The range of each triangular band-pass filter

Index Low Freg.(Hz) Center Freg. (Hz) High Freg. (Hz)

Filter 1 0 100 200

Filter 2 100 200 300

Filter 3 200 300 400

Filter 4 300 400 500

Filter 5 400 500 600

Filter 6 500 600 700

Filter 7 600 700 800

Filter 8 700 800 900

Filter 9 800 900 1000

Filter 10 900 1000 1149

Filter 11 1000 1149 1320

Filter 12 1149 1320 1516

Filter 13 1320 1516 1741

Filter 14 1516 1741 2000

Filter 15 1741 2000 2297

Filter 16 2000 2297 2639

Filter 17 2297 2639 3031

Filter 18 2639 3031 3482

Filter 19 3031 3482 4000

Filter 20 3482 4000 4595

Filter 21 4000 4595 5278

Filter 22 4595 5278 6063

Filter 23 5278 6063 6964

Filter 24 6063 6964 8000

Filter 25 6964 8000 9190

Pre-emphasis Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

(24)

2.1.2 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal. It considers the spectral peak and valley in each subband independently. In general, spectral peaks correspond to harmonic components and spectral valleys the non-harmonic components or noise in music signals. Therefore, the difference between spectral peaks and spectral valleys will reflect the spectral contrast distribution. Fig 2.2 shows the block diagram for extracting the OSC feature. The detailed steps will be described below.

Step 1. Framing and Spectral Analysis:

An input music signal is divided into a number of successive overlapped frames and FFT is then to obtain the corresponding spectrum of each frame.

Step 2. Octave Scale Filtering:

This spectrum is then divided into a number of subbands by the set of octave scale filters shown in Table 2.2. The octave scale filtering operation can be described as follows:

, ] [ )

(

,

∑

=

= ^b^h

l b

I

I k

i

b A k

E

0≤b<B,0≤k≤ N/2−1 (9) where B is the number of subbands, I_b,l and I_b,h

denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter. A_i[k]

is the squared amplitude of Xi[k], that is,

A

_i[

k

]=|

X

_i[

k

]|² .

I

b,l and Ib,h are given as:

), / (

, ,

N f I f

f N I f

s h b h b

s l b l

b

=

(10)

where fs is the sampling frequency, fb,l and fb,h are the low frequency and high frequency of the b-th band-pass filter.

Step 3. Peak / Valley Selection:

Let (M_b,1, M_b,2, …, M_b,Nb) denote the magnitude spectrum within the b-th subband,

N

_b is the number of FFT frequency bins in the b-th subband. Without loss of generality, let the magnitude spectrum be sorted in a decreasing order, that is, M_b,1

≥ Mb,2 ≥ … ≥ M_b,Nb. The spectral peak and spectral valley in the b-th subband are then estimated as follows:

(25)

1 ), log(

) (

1

∑

,

=

= ^N^b

i i b b

N M b

Peak

α

(11)

1 ), log(

) (

1

∑

,

= −+

= ^b

b

N

i

i N b b

N M b

Valley

α

(12)

where

α is a neighborhood factor (α is 0.2 in this study). The spectral contrast is

given by the difference between the spectral peak and the spectral valley:

).

( )

(b Peak b Valley b

SC = − (13)

The feature vector of an audio frame consists of the spectral contrasts and the spectral valleys of all subbands. Thus, the OSC feature vector of an audio frame can be represented as follows:

x

OSC = [Valley(0), …, Valley(B-1), SC(0), …, SC(B-1)]^T, (14)

Fig. 2.2 The flowchart for computing OSC

Table 2.2. The range of each octave-scale band-pass filter (Sampling rate = 44.1 kHz) Filter number Frequency interval (Hz)

0 [0, 0]

1 (0, 100]

2 (100, 200]

3 (200, 400]

4 (400, 800]

5 (800, 1600]

6 (1600, 3200]

7 (3200, 6400]

8 (6400, 12800]

9 (12800, 22050)

Input Signal

Framing

Octave scale filtering

Peak/Valley Selection

Spectral Contrast

OSC FFT

(26)

2.1.3 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification. The NASE descriptor provides a representation of the power spectrum of each audio frame. Each component of the NASE feature vector represents the normalized magnitude of a particular frequency subband. Fig 2.3 shows the block diagram for extracting the NASE feature. For a given music piece, the main steps for computing NASE are described as follow:

Step 1. Framing and Spectral Analysis:

An input music signal is divided into a number of successive overlapped frames and each audio frame is multiplied by a Hamming window function and analyzed using FFT to derive its spectrum, notated X(k), 1 ≤ k ≤ N, where N is the size of FFT. The power spectrum is defined as the normalized squared magnitude of the DFT spectrum X(k):

, 2 / 0

,

| ) ( 2 |

2 / , 0 ,

| ) ( 1 |

) (

2 2









<

⋅ <

⋅ =

=

N k k

E X N

N k k

E X k N

P

w

(15)

where Ew is the energy of the Hamming window function w(n) of size Nw: .

| ) (

|

1

0

∑

⁻ 2

=

=^N^w

n

w

w n

E

(16)

Step 2. Subband Decomposition:

The power spectrum is divided into logarithmically spaced subbands spanning between 62.5 Hz (“loEdge”) and 16 kHz (“hiEdge”) over a spectrum of 8 octave interval (see Fig.2.4). The NASE scale filtering operation can be described as follows (see Table 2.3.):

, ] [ )

(

,

∑

=

= ^b^h

l b

I

I k

i

b A k

E

0≤b<B,0≤k≤ N/2−1 (17) where B is the number of logarithmic subbands within the frequency range [loEdge, hiEdge] and is given by B = 8/r , and r is the spectral resolution of the frequency subbands ranging from 1/16 of an octave to 8 octaves(B=16 , r=1/2 in the study):

r

=2^j octaves, −4≤

j

≤3.

(18)

I

b,l and Ib,h are the low-frequency index and high-frequency index of the b-th

(27)

band-pass filter given as:

), / (

, ,

N f I f

s h b h b

s l b l b

=

(19)

where fs is the sampling frequency, fb,l and fb,h are the low frequency and high frequency of the b-th band-pass filter.

Step 3. Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power spectrum coefficients within this subband:

, 1 0

, ) ( )

(

,

+

≤

=

∑

=

B b k

P b

ASE

h b

l b

I

I k

(20) Each ASE coefficient is then converted to the decibel scale:

. 1 0

)), ( ( log 10 )

(

b

= ₁₀

ASE b

≤

b

≤

B

+

ASE

_dB

(21)

The NASE coefficient is derived by normalizing each decibel-scale ASE coefficient with the root-mean-square (RMS) norm gain value, R:

, 1 0

), ) (

( = ≤b≤B+

R b b ASE

NASE ^dB

(22)

where the RMS-norm gain value R is defined as:

. )) ( (

1

0

∑

⁺ 2

=

= ^B

b

dB b ASE

R

(23)

In MPEG-7, the ASE coefficients consist of one coefficient representing power between 0 Hz and loEdge, a series of coefficients representing power in logarithmically spaced bands between loEdge and hiEdge, a coefficient representing power above hiEdge, the RMS-norm gain value R. Therefore, the feature dimension of NASE is B+3. Thus, the NASE feature vector of an audio frame will be represented as follows:

x

NASE = [R, NASE(0), NASE(1), …, NASE(B+1)]^T. (24)

(28)

Fig. 2.3 The flowchart for computing NASE

r = 1/2

Fig. 2.4 MPEG-7 octave-based subband decomposition with spectral resolution Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE Subband Decomposition

62.5 125 250 500 1K 2K 4K 8K 16K

loEdge hiEdge

88.4 176.8 353.6 707.1 1414.2 2828.4 5656.9 11313.7

1 coeff 16 coeffs 1 coeff

(29)

Table 2.3. The range of each Normalized audio spectral evenlope band-pass filter Filter number Frequency interval (Hz)

0 (0, 62]

1 (62, 88]

2 (88, 125]

3 (125, 176]

4 (176, 250]

5 (250, 353]

6 (353, 500]

7 (500, 707]

8 (707, 1000]

9 (1000, 1414]

10 (1414, 2000]

11 (2000, 2828]

12 (2828, 4000]

13 (4000, 5656]

14 (5656, 8000]

15 (8000, 11313]

16 (11313, 16000]

17 (16000, 22050]

2.1.4 Modulation Spectral Analysis

MFCC, OSC, and NASE capture only short-term frame-based characteristics of audio signals. In order to capture the time-varying behavior of the music signals, we employ modulation spectral analysis on MFCC, OSC, and NASE to observe the variations of the sound.

2.1.4.1 Modulation Spectral Analysis of MFCC (MMFCC)

To observe the time-varying behavior of MFCC, modulation spectral analysis is applied on MFCC trajectories, Fig. 2.5 shows the flowchart for extracting MMFCC and the detailed steps will be described below.

Step 1. Framing and MFCC Extraction:

Given an input music signal, divide the whole music signal into successive overlapped frames and extract the MFCC coefficients of each frame.

Step 2. Modulation Spectrum Analysis:

Let

MFCC

_i[l] be the l-th MFCC feature value of the i-th frame, 0≤

l

<

L

. The modulation spectrogram is obtained by applying FFT independently on each feature value along the time trajectory within a texture window of length W:

(30)

, 0

, )

( )

, (

1

0

2 )

2 /

(

l e m W l L

MFCC l

m M

W

n

Wm j n

n W t

t =

∑

⁻ ≤ < ≤ <

=

− +

×

π

(25) where Mt(m, l) is the modulation spectrogram for the t-th texture window, m is the modulation frequency index, and l is the MFCC coefficient index. In the study, W is 512, which is about 6 seconds, with 50% overlap between two successive texture windows. The representative modulation spectrogram of a music track is derived by time averaging the magnitude modulation spectrograms of all texture windows:

, 0

, ) , 1 (

) , (

1

L l W

m l

m T M

l m M

T

t t

MFCC =

∑

≤ < ≤ <

=

(26) where T is the total number of texture windows in the music track.

Step 3. Contrast/Valley/Energy/Centroid/Flatness Determination:

The averaged modulation spectrum of each feature value will be decomposed into

J logarithmically spaced modulation subbands. In the study, the number of

modulation subbands is 8 (J = 8). The frequency interval of each modulation subband is shown in Table 2.4. For each feature value, the modulation spectral peak (MSP), modulation spectral valley (MSV), modulation spectral energy (MSE), modulation spectral centroid (MSCen), and modulation flatness (MSF) within each modulation subband are then evaluated:

(

⁽ ^, ⁾

)

^,

max )

, (

, ,

l m M

l j

MSP ^MFCC

Φ Φ m MFCC

h j l j ≤ <

= (27)

(

⁽ ^, ⁾

)

^,

min )

, (

, ,

l m M

l j

MSV ^MFCC

Φ Φ m MFCC

h j l j ≤ <

= (28)

), ) ) , ( (

1 ( log 10 ) ,

( ₁₀ ²

,

∑

=

+

⋅

= ^j^h

l j

Φ Φ m

MFCC

MFCC j l M m l

MSE (29)

, ) , (

) , ( )

,

( ,

, ,

,

∑

=

×

= jh

l j h j

l j

Φ Φ m

MFCC Φ

Φ m

MFCC

l m M

m l m M

l j

MSCen (30)

, ) , 1 (

1

) , ( )

, (

,

, ,

,

, ,

1

∑

∏

= +

−

=

+

−

=

h j

l j l

j h j

h j

l j

Φ Φ m

MFCC

l j h j

Φ Φ

Φ Φ m

MFCC

l m Φ M

Φ

l m M

l j

MSF (31)

where

Φ

j,l

and Φ

j,h are respectively the low modulation frequency index and high

(31)

modulation frequency index of the j-th modulation subband, 0 ≤ j < J. The MSPs correspond to the dominant rhythmic components, MSVs the non-rhythmic components, MSEs expressed the power of each modulation subband, MSCens indicated the mass center of the modulation spectrum, MSFs represents the modulation frequency distribution of a modulation subband. In addition, the difference between MSP and MSV will reflect the modulation spectral contrast distribution:

), , ( )

, ( )

,

(j l MSP j l MSV j l

MSC^MFCC = ^MFCC − ^MFCC (32) As a result, all MSCs (MSVs, MSEs, MSCens, or MSFs) will form a L×J matrix which contains the modulation spectral contrast information. Therefore, the feature dimension of MMFCC is 5×20×8 = 800.

Windowing

Average Modulation

Spectrum

Contrast Valley Energy Centroid Flatness Determination

DFT

MFCC extraction Framing

M¹l[m]

M²l[m]

M^Tl[m]

M³l[m]

M^T-1l[m]

ML[m]

MFCCI[l]

MFCC1[l] MFCC2[l]

sI[n]

sI-1[n]

s1[n]s2[n]s3[n] . . .

. . .

. . . .

Music signal

MMFCC M1[m]

M2[m]

M3[m]

ML-1[m]

. . .

Fig. 2.5 The flowchart for extracting MMFCC

2.1.4.2 Modulation Spectral Analysis of OSC (MOSC)

To observe the time-varying behavior of OSC, the same modulation spectrum analysis is applied to the OSC feature values. Fig. 2.6 shows the flowchart for extracting MOSC and the detailed steps will be described below.

Step 1. Framing and OSC Extraction:

Given an input music signal, divide the whole music signal into successive overlapped frames and extract the OSC coefficients of each frame.

Step 2. Modulation Spectrum Analysis:

中 華 大 學

中 華 大 學 碩 士 論 文

應用無參數區別分析演算法及調變頻譜特徵 分析於音樂風格之自動分類

Automatic music genre classification using nonparametric discriminant analysis and

modulation spectral feature analysis

系 所 別：資訊工程學系碩士班 學號姓名：M09702041 方仁政 指導教授：李 建 興 博士

中 華 民 國 100 年 2 月

摘 要

ABSTRACT

Keywords: music genre classification, modulation spectrum analysis, principal component

致謝

CONTENTS

List of Table

List of Figure

Chapter 1 Introduction

1.1 Motivation

1.2 Review of Music Genre Classification Systems

1.2.1 Feature Extraction

1.2.1.1 Short-term Features

∑

∑

∑

n M

n n M

SSC

( )

∑

∑

n M

n M SSC

n

SB

∑ ∑

k S k

S

∑

N k N k

SF

∑

∏

M

∑

M i

SSE

1.2.1.2 Long-term Features

D-dimensional feature vector of the i-th frame. The mean and standard deviation is

∑

d T x

µ d

d

D

∑

d d

T x

d µ

σ

d

D

1.2.2 Feature Transformation

1.2.2.1 Principal Component Analysis (PCA)

D covariance matrix are computed for

μ

μ x x μ x

∑

∑

∑

N

N

λ , 1≦i≦D. The first d eigenvectors having the largest

A

∑

∑

α λ

λ

α

α

α

1.2.2.2 Linear Discriminant Analysis (LDA)

n-dimensional feature space to d-dimensional space is determined, where d

d-dimensional vector. The detailed steps will be described in Chapter 2.

中華大學

中華大學碩士論文

應用無參數區別分析演算法及調變頻譜特徵分析於音樂風格之自動分類

系所別：資訊工程學系碩士班學號姓名：M09702041 方仁政指導教授：李建興博士

中華民國 100 年 2 月

摘要