CAN LYRICS IMPROVE EMOTION ESTIMATION FOR MUSIC?

(1)

MUSIC AND LYRICS:

CAN LYRICS IMPROVE EMOTION ESTIMATION FOR MUSIC?

ABSTRACT

Do lyrics hold the key to improving the performance of emotion-based music information retrieval (EMIR)? While mainstream EMIR research focuses on analyzing a song’s melody, this study aims to explore the influence of lyrics vs melody. We conducted a user study to gather subjects’

emotion ratings on lyrics and melody, and applied statistical analysis to show how each contributes to the song’s overall Valance-Arousal (V-A) emotion level. Our results show that lyrics are not only a valid measure for emotion estimation of a song, but they also provide supplementary information that can improve a melody-centric EMIR system. In addition, our data suggest that the correlation between lyrics and melody depends on the V-A quadrant in which the song re- sides.

1 INTRODUCTION

Emotion-based music information retrieval (EMIR) has been an emerging field within the MIR community. Given the strong association of emotion with music expressivity [8], identifying music emotion allows more intuitive music orga- nization and searching¹. It also provides a novel approach to music recommendation [4].

Various methods have been applied to EMIR. Most mainstream research focused on melody-based methods [13, 10, 17, 12, 19], many have also incorporated meta-features such as genre, artist, and album name in training [5, 6]. Mean- while, lyrics have seldom played a significant role in EMIR probably because of its subtleness described in [18]. Never- theless, thanks to technology advancement in computational linguistic [3, 14] in recent decades, text emotion can be re- trieved with less effort and higher precisions. [4] demonstrate an innovative use of lyrics emotion to estimate the emotional level of songs and answers to users emotion-based query.

Even though incorporating lyrics information in EMIR is considered to be an innovative improvement, there was no solid evidence to justify this act. Songs are special medium which provide lyrics text and audio stimulation at the same time. What are the role of lyrics in a song? Do lyrics only contains redundant emotion information that has al- ready been covered in melody? Or is there any mysteri-

1http://www.musicovery.com/

ous interaction between the lyrics and the melody that really constitutes our music listening experience?

One published study that is most related to answer these questions is [1]. Their conclusion is that melodies of songs are more dominant than the lyrics in eliciting emotions. While this might be valid, it doesnt discourage our curiosity in knowing whether we can extract additional information from the less emotion-dominant lyrics.

In this paper, we hope to show the value of lyrics information in predicting a song’s emotion. We conducted an experiment to gather user emotional rating over lyrics, melody, and the song itself. In section 3.1, applying Pearson correlation coefficient analysis to our data shows that lyrics emotion is a valid measure for estimating song emotion. In section 3.2, we demonstrate that lyrics could provides supplementary emotion hint over melody by using a multiple linear regression model. In section 3.3, a detailed exami- nation yields that the interaction between lyrics and melody might be emotion-dependent.

2 METHODOLOGY

The purpose of this experiment is to gather participants’ numerical ratings over the emotion of a song and its components. The quantitative data will then be applied to statistical analysis to find out their correlation with each other. The experiment design and details are listed below.

2.1 Emotion model

In our experiment, we adopted Thayer’s Valance-Arousal emotion plane [16] to record subjects’ emotion, shown in Figure 1. We divided the Valance-Arousal emotion plane into four categories. The four categories are (1) Positive valence with High arousal (Pos-High), (2) Negative valence with High arousal (Neg-High), (3) Negative valence with Low arousal (Neg-Low), and (4) Positive valence with Low arousal (Pos-Low).

2.2 Selection of songs

The songs in the experiment are selected from a popular online Karaoke database². These songs are hits in recent years. To create the test sets, we first randomly selected

2http://www.karayou.com/

(2)

Figure 1. Thayer’s arousal-valence emotion plane

twenty songs from the song database. Then, twenty anno- tators subjectively classified these songs into the four above mentioned emotion categories. Among these twenty songs, we again randomly selected two from each category. These songs were divided into two unique test sets. Each set con- sists of four songs from four different categories.

2.3 The M, L, and ML sessions

A song comprises of two main components: lyrics and melody.

To have participants rate a song’s emotion based on its different components, we designed three sessions in our experiment: Lyrics-only (L), Melody-only (M), and Melody- Lyrics (ML). In the L session, we only show participants the lyrics on the screen. In the M session, we only play a song’s melody. In the ML session, both of them are presented to the participants. However, this is the ideal situation. The major difficulty is how to create songs for the ”ideal” M session.

If we remove human sounds, usually the main melody, a song’s emotion will be significantly influenced. So we had better to keep the human sound but make participants not comprehend it. Our solution is to choose some language that participants cannot understand the meaning by listening. In addition, we also need to consider the L session. In the L session, the participants need to understand the lyrics by reading. We found that Cantonese is qualified because of two reasons: (1) its pronunciation is very different from that of Mandarin [11] (2) participants can easily understand Cantonese lyrics by reading without any translation because Cantonese has very similar vocabulary and grammars with Mandarin [7]. Each participant is expected to rate one test set (four songs). Each song has its own L session, M session, and ML session. Each participant is assigned to experience 12 sessions. All sessions are presented to participants in random order.

2.4 Emotion ratings

After each trial (i.e. after a single sessions) the participants are asked to rate the emotion expressed in the music on the Valance-Arousal emotion plane (see Discussion session for more details on the issue of objective versus subjective emotion rating). The ratings are given on nine discrete values, where ’4’ represents ’the most positive valence’ or ’the high- est arousal’ and ’-4’ represents ’the most negative valence’

or ’the lowest arousal’. The ’0’ were given to the user to represent ’neutral’ rating.

2.5 The participants

There are total of 24 male and 11 female participated in our experiment. Most of them are college students aged from 20 to 30. To ensure that participants cannot comprehend any semantic meaning from the M session, a Cantonese listening test is placed at the beginning of the experiment. Partici- pants with high score are assumed to have listening ability in Cantonese. These participants can comprehend semantic meaning from the vocal and violating our experiment design; hence, including their data will bias the experiment result. The Cantonese testing scores show no participants reached our filtering threshold.

2.6 Environment

We allow multiple subjects to participate in the experiment at the same time. To facilitate the experiment, we imple- mented a software interface using Visual Basic, and built our program on top of the Windows platform. To control the quality of the data, we add some controlling constraints in our program. For example, we set a time constraint which forbids participants to rate a song before a period of time.

This constraint implicitly suggest a participant to rate the trial only after he or she has seriously experienced the song.

3 EXPERIMENTAL RESULTS

A quick view of our experiment result can be found in Table 1. Mean and standard deviation table gives us a general feeling of what the data looks like. However, detailed analysis is needed to make our case. In this session, we analyze our data to support three hypotheses that emphasize the role of lyrics in estimating a song’s emotion.

3.1 Prediction power of lyrics emotion

To form the basis of using lyrics as a source of estimation toward song emotion, we must first ensure that lyrics emotion have explanatory power toward song emotion. To investigate the correlations between components of a song and the song itself, we decided to use standard Pearson correlation

(3)

Figure 2. Snapshots of experimental user interface. (a) L session: subject can read the lyrics located in the center. (b) M session: subject listens to the melodies but cannot read any lyrics. (c) ML session: subject listens to the melodies and read the lyrics at the same time.

Lyrics Song Melody

Pos-High 0.61 (2.15) 2.08 (1.82) 2.15 (1.49) Neg-High −1.87 (1.70) −1.18 (2.19) −0.85 (1.85) Neg-Low −1.79 (1.66) −2.26 (1.47) −1.57 (1.62) Valence

Pos-Low 1.41 (2.15) 0.93 (2.08) 0.48 (1.76) Pos-High 1.02 (1.73) 0.98 (2.01) 1.34 (1.79) Neg-High 1.54 (2.03) 2.72 (1.24) 2.62 (1.44) Neg-Low −0.05 (2.04) −0.79 (2.09) −1.33 (1.90) Arousal

Pos-Low 0.72 (2.01) −0.26 (2.12) −0.10 (2.14)

Table 1. The mean (standard deviation) of valence-arousal score over lyrics, melody, and song; scoring range:[-4, 4]

analysis that results in a numerical value for how well a single component and a song correlates.

Song Valence Song Arousal Lyrics Melody Lyrics Melody valence valence arousal arousal Pos-High .45^∗∗ .46^∗∗ .40^∗∗ .80^∗∗

Neg-High .47^∗∗ .31^∗ .25^∗ .61^∗∗

Neg-Low .24 .53^∗∗ .33^∗∗ .40^∗∗

Pos-Low .67^∗∗ .57^∗∗ .24 .65^∗∗

Table 2. Pearson product-moment coefficients denoting the correlation between VA values and lyrics/melody, respectively, for songs with different affective values. Note: ’^∗’ significant at α = 0.05 level; ’^∗∗’ significant α = 0.01 level In Table 2, we show the value of the Pearson correlation between component and song separated into different emotion categories and emotion dimensions. First, looking at the lyrics side, we can see that lyrics emotion has significant correlation with the song emotion in over half of the cases, especially when the song belongs to the low arousal category. When the song has high arousal rating, the lyrics have generally no significant correlation with the song emotion

in both valence and arousal dimension. On the other hand, melody has significant correlation with a song’s emotion in all categories of songs and in both valence and arousal dimension. From the above observation, two conclusions could be drawn: first, lyrics and melody respectively can be used to predict song emotion. Second, the prediction power of melody outperforms the prediction power of lyrics.

While melody seems to outperform lyrics quantitatively, we think that they might have qualitative differences. We assume that the semantic information of the lyrics should have its own contribution to the song emotion, and that this additional information provides supplementary explanatory power which melody could not cover. In the next session we demonstrate our experiment result to support this hypothesis.

3.2 Lyrics provide supplementary information to melody To evaluate the qualitative difference of lyrics and melody’s contribution toward song emotion, we use multiple linear regression model of melody and lyrics to analyze our data.

The results are shown in Table 3.

In Table 3, the dual component model (i.e. M+L) is a valid approximation toward a song’s emotion. The model has significant prediction power in all categories of songs

(4)

Valence Arousal

Lyrics Melody Model Lyrics Melody Model Pos-High .21^∗∗ .21^∗∗ .39^∗∗ .16^∗∗ .64^∗∗ .65^∗∗

Neg-High .22^∗∗ .09^∗ .27^∗∗ .06^∗ .38^∗∗ .40^∗∗

Neg-Low .06 .28^∗∗ .29^∗∗ .11^∗∗ .16^∗∗ .18^∗∗

Pos-Low .44^∗∗ .32^∗∗ .58^∗∗ .06 .42^∗∗ .42^∗∗

Table 3. Coefficients of determination (i.e.: R-square) and model testing results of three hypothesized models (single lyrics model, single melody model, and dual component model) for songs with different affective values. Note: ’^∗’ significant at α = 0.05 level; ’^∗∗’ significant α = 0.01 level

Valence Arousal

Lyrics Melody Lyrics Melody t-value t-value t-value t-value Pos-High 4.14^∗∗ 4.28^∗∗ 1.3 8.97^∗∗

Neg-High 3.7^∗∗ 1.84 1.49 5.68^∗∗

Neg-Low 0.71 4.33^∗∗ 1.18 2.25^∗ Pos-Low 5.98^∗∗ 4.38^∗∗ 0.26 6.09^∗∗

Table 4. Regression coefficients of dual component models for songs with different affective values in multiple linear regression (i.e.: song emotion predicted by lyrics and melody). Note: ’^∗’ significant at α = 0.05 level; ’^∗∗’ significant α = 0.01 level

and in both valence and arousal dimensions. In all the cases, the dual component model outperforms single lyrics model and single melody model. We assume the superior explanatory power exists because one variable contributing supplementary information to the other variable.

To test the above assumption, we investigate the signif- icance of individual terms. Table 4 presents the t-value of lyrics and melody component in the dual component model.

First, we notice the lyrics have significant explanatory power toward the model in over half of the case, and most of them occur when the song’s emotion is in the low arousal category. Second, we see that melody has significant explanatory power in most of the cases, and generally performs better among the arousal dimension. Hence, our assumption about the superior explanatory power is correct since both lyrics and melody have significant regression coefficient for the model.

The above analysis supports the our hypothesis that lyrics provide additional information to a melody-based song emotion estimation. Thus, although lyrics emotion might not be a superior measure over melody, the results of song emotion estimation would still be better if the lyrics information is considered in the process.

Figure 3. Four emotion categories of songs plotted on four quadrants of Valence-Arousal plane. The square represents the ML session emotion rating; the diamond represents the L session emotion rating; the triangle represents the M session emotion rating.

3.3 Effects are emotion-dependent

In our last analysis, we look separately at songs in the four emotion categories (Pos-High, Neg-High, Neg-Low, and Pos- Low). These four categories actually represent four quadrants on the Valence-Arousal emotion plane. We plot each category’s mean emotion ratings on the Valence-Arousal plane (Figure 3). From the figure, we observe that the correlation and interaction between lyrics and melody seem to vary between different song categories. We will briefly describe our observation for each quadrant below.

In the Pos-High category, the rating of the M session is pretty close to the rating of the ML session on the Valance- Arousal plane. On the other hand, the rating of the L session is far from that of the ML session. This observation indicates that the song’s numerical rating of arousal and va-

(5)

lence are dominated by melody in this category. The in- consistency of L session and ML session is probably due to the fast and strong beats in the Pos-High category. Thus, even if lyrics alone have negative valence and relatively low arousal rating, the participants still use melody information to explain and report their emotion.

In the Neg-Low quadrant, the lyrics information provides better estimation of a song’s valence level, whereas the melody provides credible estimation of the arousal level. From this observation, we assume that when the song falls in Neg-Low category, it would be better to estimate its emotion rating with both lyrics and melody information.

In the Neg-High and Pos-Low categories, it is hard to draw any conclusion on the correlation and interaction between lyrics and melody. We could probably say that lyrics provide better valence prediction of the song, but this is still within the margin of error.

To sum up, there could not be a general conclusion to draw on the correlation and interaction of lyrics and melody.

In each emotion categories, lyrics and melody have their own interaction patterns. Thus, we conclude that the supplementary explanatory power of lyrics over melody would be different among different emotion categories of songs.

4 DISCUSSION 4.1 Music emotion approximation model

From the experiment results, we claim that both lyrics and melody are required to best represent and predict the emotion of music. In our analysis, we use multiple linear regression model to approximate song emotion; however, this might not be the optimal use of the lyrics and melody information. Possible variations might include the intuitive linear combination, or including non-linear item to the regression. For example, the M*L term could be a representation of interaction between lyrics and melody. At one extreme, one might argue that lyrics and melody should be considered individually for their different qualitative nature. For example, using lyrics emotion to predict valence direction, and using melody emotion to predict arousal direction. No matter which method to use, the correlation and interaction of lyrics and melody are song-emotion dependent. Thus, a successful MRS might want to have different equations for different song emotion categories.

4.2 Lyrics emotion estimation

The most relevant work in estimating lyrics emotion would be iPlayr [4]. It uses the lyrics text of a song to match entries in ANEW [2], a database of affective words with their PAD [15] value annotated. The query results are then calculated to form the features of a song, such as the mean and standard deviation of the PAD values. The paper then applies classi-

fication techniques on these features to estimate the lyrics’

emotion.

4.3 Melody emotion estimation

One of all possible methods is proposed in [12]. The paper extracts melody features from a music training set, and applies machine learning techniques to create an emotion classifier. The classifier takes melody features as input, and out- put the emotion estimation on the Valance-Arousal plane.

4.4 Issue of subjective or objective emotion measure It is important to recognize the difference between subjective and objective emotion measure. Subjective emotion means, in our case, how participants really feel when they are listening to the music (affect attribution). Objective emotion, on the other hand, means participants annotate the emotion that best describe the music (affect induction). We ask participants to rate objective emotion in our experiment. For one thing, the objective emotion is much easier to obtain; for another, we assume that when it comes to music emotion the objective and subjective measures should have high correlations. An in depth discussion about the difference between subjective emotion (affect attribution) and objective emotion (affect induction) can be found in [9]

4.5 Issue of limited sample size

We are aware that the number of participants and the number of selected songs in the experiment are crucial to the con- fidence level of our claim. Although we use a randomized process in selecting songs, the eight songs in our experiment might still be too few to be an authentic representative of the Cantonese song population. Nevertheless, our experiment data supports the three hypotheses we make about lyrics.

We believe that increasing the number of sample size would result in more general and accurate results.

4.6 Issue of different input sources

One could argue that we usually extract semantic information from lyrics by listening not by reading. Although this might be true, and our emotion towards lyrics might be different because of the discrepancies between visual and audio cognitive processes, the experiment results can still answer our questions about how lyrics information can help improve MIR performance.

Our design of the three sessions has high similarity of how computers process music information. The M sessions in our experiment imitate how computers interpret the melody - the machines have knowledge about every melody pieces but were given no semantic information. The L sessions are similar to how computers handle context - they take in

(6)

a batch of words and process it to form meanings and in- terpretations. In our experiment, we use human ratings to simulate a perfect scenario where computers can rate in con- sistent with human for lyrics and melody emotion, and study how computers can use such information to generate accurate emotion estimation toward the song as a whole.

We acknowledge that the interaction between lyrics and melody could be much more complicated then we thought, and that preventing lyrics and melody from simultaneously present by audio might have significant impact to our emotion ratings. However, in this experiment we simply assume the impact could be neglected.

5 CONCLUSION

This paper reported the results from our scientific exami- nation of the role of lyrics in a song. By conducting user studies, we measured the quantitative contribution of lyrics and melody to a song’s emotion. The statistical analysis suggests that lyrics can serve as a valid measure for emotion estimation of a song. Our analysis further shows that lyrics can provide supplementary emotion information over melody. Another important discovery is that the correlation between lyrics and melody is dependent on which V-A quadrant the song locates at. The implication of the last finding suggests that different estimation models should be used in different V-A quadrants to get more accurate results in estimating a song’s emotion.

6 REFERENCES

[1] S. Omar Ali and Zehra F. Peynircioglu. Songs and emotions: are lyrics and melodies equal partners? Psychol- ogy of Music, 4(4):511–534, 2006.

[2] Margaret M. Bradley and Peter J. Lang. Affective norms for english words, 1999.

[3] Christiane Fellbaum, editor. Wordnet: An Electronic Lexical Database. Bradford Books, March 1998.

[4] David Chia-Wei Hsu. iPlayr: an emotion-aware music player. Master’s thesis, National Taiwan University, June 2007.

[5] Xiao Hu, Mert Bay, and J. Stephen Downie. Creating a simplified music mood classification ground-truth set.

In ISMIR, 2007.

[6] Xiao Hu and J. Stephen Downie. Exploring mood meta- data: Relationships with genre, artist and usage meta- data. In ISMIR, 2007.

[7] Cheng-Teh James Huang. Logical relations in Chinese and the theory of grammar. PhD thesis, Massachusetts Institute of Technology, 1982.

[8] P.N. Justlin, J. Karlsson, E. Lindstrom, A. Friberg, and E. Schoonderwaldt. Play it again with feeling: computer feedback in musical communication of emotions. Jour- nal of Experimental Psycology: Applied, 12(2):79–95, June 2006.

[9] Marc Leman, Valery Vermeulen, Liesbeth De Voogdt, Dirk Moelants, and Micheline Lesaffre. Prediction of musical affect using a combination of acoustic struc- tural cues. Journal of New Music Research, 34(1):36–

67, 2005.

[10] Tao Li and Mitsunori Ogihara. Detecting emotion in music. In ISMIR, 2003.

[11] Hua Lin. Gender differences in the chinese language: a preliminary report. In Proceedings of the Ninth North American Conference on Chinese Linguistics, 1998.

[12] Chia-Chu Liu, Yi-Hsuan Yang, Ping-Hao Wu, and Homer H. Chen. Detecting and classifying emotion in popular music. In JCIS/CVPRIP, 2006.

[13] Dan Liu, Lie Lu, and Hong-Jiang Zhang. Automatic mood detection from acoustic music data. In ISMIR, 2003.

[14] Hugo Liu and Push Singh. ConceptNet: A practical commonsense reasoning toolkit. In BT Technology Jour- nal, volume 22. Kluwer Academic Publishers, 2004.

[15] Albert Mehrabian. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4):261–292, December 1996.

[16] R.E. Thayer. The Biopsychology of Mood and Arousal.

Oxford University Press, 1989.

[17] Muyuan Wang, Naiyao Zhang, and Hancheng Zhu.

User-adaptive music emotion recognition. In ICSP, 2004.

[18] Bin Wei, Chengliang Zhang, and Mitsunori Ogihara.

Keyword generation for lyrics. In ISMIR, 2007.

[19] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H. Chen. Music emotion classification: A regression approach. In ICME, 2007.