Music Emotion Analysis - Related Work - 個人化音樂情緒反應預測系統之建造

Chapter 2 Related Work

2.2 Music Emotion Analysis

Here we review three papers that are related to music emotion.

In [11], Kuo et al. proposed an emotion-based music recommendation system based on film music. Totally 107 film music midi files from 20 animated films including the productions of Disney, Studio Ghibli and Dream Works were selected to do the experiments. The music attributes included melody, rhythm and tempo and the emotion model that was proposed by Reilly [22] was used to describe the music emotion. There were 30 kinds of emotion that had been grouped into 15 groups. The relationship of music features and emotions of the training data was discovered by a

graph-based approach named Mixed Media Graph (MMG) and the MMG method was further refined to be the Music Affinity Graph that was suitable to use in the music recommendation system. During the application time, user queried a kind of music, then the system tried to rank all the music stored in the music database with their music features which were related to this kind of emotion. Music with top ten score was recommended to the user. The average score of recommendation results achieved 85 %.

In [18], Lu et al. focused on the acoustic music data and three features sets including intensity, timbre and rhythm were extracted to represent the characteristics of a music clip using some digital signal processing technique. There were totally 800 representative music clips of 20 seconds selected from 250 pieces of music and the Thayer’s two-dimensional model of mood including Exuberance, Anxious/Frantic, Contentment and Depression was adopted to describe the music emotion. Three experts joined the experiment and annotated music in the database. Initially, the Gaussian mixture model (GMM) was utilized to model each feature set. A classification model was obtained by classifying music into 4 emotions and then used to detect the music mood in an entire music clip to distinguish every sub-music clip with different music mood. The precision of resultd in the experiment is about 81.5%.

As mentioned above, only a music classification or recommendation model could be obtained in their systems. It means that everybody should have the same emotional response to music. Accordingly, these models are not suitable to represent the situation that listeners with different background may have different emotional response to music.

In [25], 18 music features including pitch, interval, tempo, loudness, note density, tonality, etc. were used to represent the music midi files and a modified Thayer’s emotion model with six kinds of emotions including Joyous, Robust, Restless, Lyrical, Sober, Gloomy were used. Support Vector Machine (SVM) was used to do the classification over the music data. In experiment, 20 different listeners with and without music backgrounds were joined to label the music data. Wang considered the phenomenon that listeners with different backgrounds may have different emotional response to music. So they trained the music classification model for all the 20 different listeners. Then these 20 sets of music emotion classification rules were used to predict the listener’s emotion to music.

In this research, they predict the emotion in a user-adaptive way. But everyone should have a trained prediction model before prediction. It will be an overhead and the method is difficult to be applied for everybody. Although listeners with different backgrounds may have different emotional response to music but those with similar backgrounds may have similar emotional response to music. The prediction rules and the knowledge could be reused to predict new listeners’ emotion.

In order to improved the drawbacks in the previous researches, the fact that listeners with different backgrounds may have different emotional response to music should take into consideration to model the phenomenon that people’ music emotional response has a grouped effect. We propose an analysis procedure that consists of data mining techniques and will be described in detail in the following sections.

Chapter 3 Personalized Music Emotion Prediction (P-MEP)

For the same music, listeners with different user profiles may have different emotion to music but some may have similar emotion to music. Generally speaking, music with different music patterns may result in different emotion to listeners. In order to model the characteristics of the grouped emotional response to music of listeners and find out the relationship between music patterns and music emotion of listeners, an analysis procedure of the Personalized Music Emotion Prediction (P-MEP) is proposed.

3.1 Problem Definition

To analyze the listeners’ music emotion with music patterns and their backgrounds, we have two assumptions: Firstly, for the same music, listeners with different user profiles may have different emotions to music. Secondly, music with different music patterns or different music attribute values may have different emotion to listeners.

During the construction of the predicting model, there are two terminologies: 1) User Emotion Group (UEG): User emotion group is a group consisting of several listeners who have the similar emotional response to music. According to experts’

suggestions the total number of user emotion group could be three to five. 2) Representative Music: Representative music is a kind of music that makes the emotional response of the listeners different and is selected by expert. It results in that listeners would be separated into different user emotion groups according to their

emotion to representative music.

Inputs of the problem are user profiles, music attributes and music emotion. The problem is to analyze music emotion concerning with listeners’ differences and music attributes. Output is the personalized music emotion prediction rules.

3.2 Feature Selection

In this section, three kinds of input data including user profiles, music attributes and music emotion will be described in detail.

(1) User Profiles:

The user profiles including ‘Gender’, ‘Age’, ‘Education Status’, ‘Job’, Geographical Location’, ‘Constellation’ and some ‘Personality’ [28] index, etc. are defined as user profiles vector in Definition 3.1 to represent a listener’s backgrounds.

Feature selection about user profiles is based on expert’s suggestion and [28] and may have effects on music emotion.

Definition 3.1: User Profiles Vector

Listener: L (Background, Personality)_i denotes the i-th listener, and it has 2 categories containing 24 tuples.

․Background = <GN, AG, ES, CS, GL> denotes the background of listener

․Personality = <A, B, C, …, R, S> denotes the personality of listener

The value of each user profile attribute is listed in Table 3.1.

Table 3.1: The User Profiles of a Listener

Category Attribute Value

Gender (GN) M: Male, F: Female

E: Elementar,; J: Junior, S: Senior

U: Undergraduate, M: Master, D: Doctor

Constellation C 1: Like physical exercise 2: Like brainstorming

D 1: Expansive 2: Shy

E 1: Unambitious 2: Ambitious F 1: Confident 2: Unconfident G 1: Self-centered 2: Sensitive to others H 1: Pedantry 2: Creative

I 1: Patient 2: Inpatient J 1: Despotic 2: Like to be led

K 1: Amiable 2: Severe

L 1: Mild 2: Combative

M 1: Unpredictable tempered 2: Stable tempered

N 1: Serious 2: Humorous

O 1: Cold 2: Enthusiastic

P 1: Responsible 2: Irresponsible Q 1: Moderate 2: Irritable

R 1: Obstinate 2: Like to be guided Personality

S 1: Self-restraint 2: Easy-going

(2) Music Attributes:

Music has often been divided into three categories including monophonic, homophonic and polyphonic based on the amount of concurrency present.

Monophonic music is a kind of music that only one note sounds at a time, homophonic music is music in which multiple notes may sound at once, but all notes start and finish at the same time, and polyphonic music is the most general form of music, in which multiple notes may sound independently. We are concerned with polyphonic music and choose the track with the highest pitch density to be the main track [4], because composer uses the most distinct pitch notes in the main track. The pitch density is defined in Equation 3.1.

Equation 3.1: Pitch Density of Music Pitch density=NP

AP [4]

․NP is the number of distinct pitches in the track

․AP is the number of all distinct pitches in MIDI standard, i.e. 128 [27]

In [9], some music attributes including tempo, mode, loudness, pitch, intervals, melody, harmony, tonality, rhythm, tempo, articulation, amplitude envelope and musical form were proved to have the effect in listener’s emotional response to music, and may interact with each other. In this thesis, three important sets of music attributes are extracted from the music midi file data which are listed in Table 3.2 and are defined as the music attributes vector in Definition 3.2 to describe a piece of music.

Definition 3.2: Music Attributes Vector

Music: M (Pitch, Rhythm, Velocity, Timber, Mode)i denotes i-th music and has 5 categories containing 12 tuples.

․Pitch = <PM, PS, IM, IS, PE, PD> denotes the pitch information of music

․Rhythm = <TD> denotes the rhythm information of music

․Velocity = <LM, LS> denotes the velocity information of music

․Timber = < TB > denotes the timber of music

․Mode = < MD, TN > denotes the mode and tonality of music

For the music attributes, PM, PS, IM, IS, PE, PD, TD, LM, LS are numerical attributes, while TB, MD, TD are categorical attributes.

Table 3.2: Music Attributes

Category Attributes Description

Pitch Mean (PM)

Mean of pitch value of the main track.

Pitch Standard Deviation (PS)

Standard deviation of pitch value of the main track.

Interval Mean (IM)

Mean of the difference of adjacent two note pitch in the main track.

Interval Standard Deviation (IS)

Standard deviation of the difference of adjacent two note pitch in the main track.

Nj is the total number of notes with the corresponding pitch in the main track and T is the total number of notes in the main track. [4]

Pitch

Pitch Density

(PD) Defined in Equation 3.1

Rhythm Tempo Degree (TD)

Ratio of the number of fast measures to the number of measures in the main track. A measure is a fast measure if the average note duration in the measure is shorter than one. [4]

Loudness Mean

(LM) Average value of the note velocities Velocity

Loudness Standard Deviation (LS)

Standard deviation of the note velocities

Tonality of the music

(C, C#, D, D#, E, F, F#, G, G#, A, A#, B)

(3) Music Emotion:

The music emotion vector tagged by m-th listener for n pieces of music is defined in Definition 3.3.

Definition 3.3: Music Emotion Vector

Music Emotion: ME (E₁, E₂, E₃, …, E_i, …, E_n)_mdenotes music emotion of m-th listener.

․E_i represent the emotion of i-th music

A music emotion concept hierarchy based upon two emotion model proposed by Thayer [25] and Reilly [22] is proposed to describe the listeners’ emotional response to music. In Thayer’s two dimensional model of mood, there are Exuberance, Anxious/Frantic, Contentment and Depression emotions. Among them, Exuberance and Contentment are positive emotions while Anxious/Frantic and Depression are negative emotions. In addition, Exuberance and Anxious/Frantic are mapped to high intensity while Contentment and Depression are mapped to low intensity [18].

In Reilly’s emotion model, there are 30 kinds of emotions grouped into 15 groups in [11], where the emotions in the same group assumingly have the same emotion. In our thesis, we adopted different grouped policy. We map the Reilly’s emotion model to Thayer’s mood model considered whether they are positive or negative emotion and whether the intensity is high or low.

The music emotion concept hierarchy is shown in Figure 3.1. The first layer has one element named Emotion in the first layer named Root. There are two elements named Positive (A1) and Negative (A2) in the second layer named Layer A. In the third layer named Layer B, Exuberance (B1) and Contentment (B2) have the AKO

relation to Positive (A1), while Anxious/Frantic (B3) and Depression (B4) have the AKO relation to Negative (A2). This layer is referred to Thayer’s two dimensional model of mood. Finally, in the fourth layer named Layer C, there 30 elements marked C1, C2, …, C30. They also have the AKO relation to B1, B2, B3 and B4. This layer is referred to Reilly’s emotion model.

C13

Figure 3.1 : Music Emotion Concept Hierarchy

3.3 Analysis Procedure

In order to predict the music emotion concerning with listeners’ differences, we propose a Personalized Music Emotion Prediction (P-MEP) Analysis Procedure which consists of a series of data mining process to extract the personalized music emotion prediction rules and then adopt them in the prediction system.

Figure 3.2 : Analysis Procedure of Personalized Music Emotion Prediction

As shown in Figure 3.2, the analysis procedure including five phases is described as follows:

(1) Data Preprocessing Phase: In this phase, the user profiles, music attributes and music emotion are collected. Firstly, all the listeners fill up a designed questionnaire which includes all the user profiles on it. Then listeners write down the emotion they perceived after representative music play. Finally, music attributes are obtained by parsing midi file. All these three kinds of data are preprocessed and then saved into a database.

(2) User Emotion Group Clustering Phase: Music emotion tagged by listeners to music is transformed into emotion vectors. Then the Robust Clustering Algorithm for Categorical Attributes (ROCK) method is applied to cluster listeners into several user emotion groups.

(3) User Group Classification Phase: In this phase, ID3 classification algorithm is applied to train the relationship between user profiles and user emotion group number. All the listeners are classified into the user emotion group according to their user profiles. The output of this phase is the user group classification rules.

(4) Music Emotion Classification Phase: For each user emotion group, we have the representative music emotion vector of m music to represent all the emotion of every listener in a user emotion group. Then C4.5 classification algorithm is applied to the music attributes and music emotion to train the music emotion prediction rules for each user emotion group.

(5) Personalized Music Emotion Prediction Rule Integration Phase: The rule integration method is used to combine user group classification rules and music emotion prediction rules to be the personalized music emotion prediction rules.

Chapter 4 Rule Generation of P-MEP

4.1 Data Preprocessing

The purpose of this phase is to collect three kinds of data, including user profiles, music attributes and music emotion. The output is these three kinds of dada. Suppose there are m listeners and n pieces of representative music and the listeners annotate the emotion tag after they listen to music. Three kinds of the vectors including user profiles vector, music attributes vector and music emotion vector are the input of the P-MEP analysis procedure.

Example 4.1: User profiles, music attributes and music emotion vector

In the following section, an example contained 12 listeners and 16 pieces of music will be used to describe the analysis procedure of P-MEP. The user profiles and music attributes are simplified in this example. Table 4.1 shows twelve listeners with their user profile value, Table 4.2 shows sixteen pieces of music with their music attribute value, and Table 4.3 shows the music emotion of 16 pieces of music tagged by these listeners. The emotion model used is the Reilly’s emotion model.

Table 4.1: Simplified User Profiles of 12 Listeners Listener Gender Major in Music

L1 Male Yes

Table 4.2: Simplified Music Attributes of 16 Pieces of Music Music Pitch Mean Tonality Mode

Table 4.3: Emotion of 16 Pieces of Music Tagged by 12 Listeners

4.2 User Emotion Group Clustering

The first task of the analysis procedure would be to find out several user emotion groups by the music emotion that listeners tagged to the chosen representative music.

Here we apply the clustering method named ROCK to do clustering.

ROCK is the abbreviation of A Robust Clustering Algorithm for Categorical Attributes that was proposed by Guha, Rastogi and Shim in 2000 [6]. Clustering is a useful technique for grouping data that points within a single group/cluster have similar characteristics. Traditional clustering methods such as K-Means and K-Medoids, etc. are only suitable for clustering data point with numerical values because the way they calculate the similarity of data point is to calculate the numerical distance between data point in the numerical space and it could be only used in numerical data point. But for the categorical and boolean attributes like our emotion model, this kind of method will not be suitable to perform such a clustering

task. So here we choose the ROCK method and modify the way of similarity calculation based on our emotion concept hierarchy.

Figure 4.1: User Emotion Group Clustering

The procedure of this phase including several steps is shown in Figure 4.1. The input is music emotion vector of m listeners and the similarity of each data point will be calculated by the proposed music emotion concept hierarchy. Then neighboring points are identified if the similarity value of a pair of point is larger than the threshold. Accumulate the number of links between each pair of points if they have the same neighbor. Every cluster is maintained by a heap and initially it contains only a point. Starting merging the pair of clusters with largest goodness value is performed until there are sufficient numbers of desired clusters or there is no more links between each pair of clusters. Finally, a hierarchical clustering tree would be constructed.

According to the expert’s suggestion, the number of user emotion groups would be three to five. So we cut in the appropriate place in the hierarchical clustering tree to get 3 to 5 user emotion groups. The detailed algorithm is shown in Algorithm 4.1.

Algorithm 4.1: User Emotion Group Clustering (Modified ROCK Algorithm) Purpose: To cluster all the listeners into several user emotion groups according to their emotion to representative music.

Symbol Definition:

C_i: The ith Cluster L_i: The ith listener

Input: Emotion of n representative music of m listeners (ME), Similarity Threshold (θ ), Desired Number of Clusters (DNC)

Output: User Emotion Group Label of m listeners (UEGL)

Step1: Calculate the similarity among each ME, ∀ i, j, (i≠j), if similarity of L_i and L_j > θ , then L_j and L_j are neighbors.

Step2: Calculate link between neighbors

Step3: Repeatedly execute this step until number of clusters < DNC or there is no more links between each pair of clusters

3.1 Choose the largest goodness measure of C_i and C_j, where the goodness measure _{1 2 ( )} [ _{1 2 ( )}, ] _{1 2 ( )}

3.3 Recalculate the link among new clusters

Step4: Output the complete hierarchical clustering tree and UEGL

The similarity calculation of two listeners is based on the equation below.

Similarity of listener Lx and Ly

To determine the Si value, let’s go back to the emotion concept hierarchy proposed in previous section. In Layer C (Reilly’s model), if emotion of ith music of Lx and Ly are the same, then Si=1. If emotion of ith music of Lx and Ly are different, then roll up to Layer B (Thayer’s model), the S_i value is listed in Table 4.4. For the value 0.8, because in Layer C they are different, but when roll up to Layer B, they are in the same group, so their similarity is set to be 0.8. For the value 0.6, in Layer B, they are either in positive group or negative group, so their similarity is set to 0.6. For value 0.3, although they are not simultaneously in positive or negative group, but they

are in either high intensity group or low intensity group, so their similarity is set to 0.3.

Finally, for those are not in positive or negative group or high or low intensity group, their similarity is set to 0.

Example 4.2: User Emotion Group Clustering

Following Example 4.1, in ROCK, the similarity of each pair of point will be first calculated. In the initial stage, each listener represents a data point. Table 4.4 shows the similarity value among twelve listeners. Then referred to Step1 in Algorithm 4.1, the neighboring points will be identified if the similarity value of each pair of points is larger than the threshold θ which is 0.6. Table 4.5 shows the number of links among each data point referred to Step2 in Algorithm 4.1. Then referred to Step3, the algorithm starts to merge each cluster based on the goodness value. Finally the hierarchical clustering tree is generated and shown in Figure 4.2.

在文檔中個人化音樂情緒反應預測系統之建造 (頁 15-0)