Extracted Features - 基於機器學習探討音樂多樣性推薦之研究

the similarity score considers the amount of referred objects from another target or not.

Take the following three users with the listening records as an example:

O(U seri) = [1, 2, 3], O(U serj) = [1, 2, 3], O(U serk) = [1, 2, 3, 4].

Then Userj is more similar to Useri than Userk based on the listening history while the

↵ = 1; on the other hand, they will get a same score while the ↵ = 0.

For the categorical indicators, because this kind of feature usually occurs in different objects, the function O will be the collection of referred objects for a target. Take the User Age as an example, if we want to know the similarity of listening history between 15-year-old users and 30-year-old users, the function O will collect all the songs of the users whose age is between 15 and 30.

For the real-value indicators, the feature vector is normalized by the standard score:

x µ, where µ is the mean of the population and is the standard deviation of the popula-tion. The score indicates how many standard deviations an observation is above or below the mean.

Finally suppose we have a set of similarity scores for a specific target and seek to embed them into a feature vector, a simple way is to directly index them with correspond-ing scores. However, the popular object generally contains more similar objects than the others. It may leads to an unbalance problem that unpopular objects are hard to get the similarity score. In order to the balance issues into account, we only keep the top-k similar objects as the new score basis, and normalize the new vector of k values to 1:

sij = sij

j⁰=1|sij⁰|. (3.7)

The purpose of this step is to avoid the unbalance of similarity information. For example, s(U seri) = (0, 0.8, 0.6)and s(Userj) = (0.1, 0, 0.2), Useri will have more probability of getting high scores because of the high values of the similarity vector.

3.4 Extracted Features

The structure of collected music dataset is depicted in Figure 3.2. Personal factors indi-cate the characteristics that people would possess for a long period of time, such as age, gender. People with different levels of music background may appreciate music differ-ently, which in turn affects music preference. Musical factors consist of the audio content, its profile, and even the artwork of the CD. People may choose a song because its melody

•‧

Table 3.1: The feature sets considered in this work

abbr. Feature Unique Index Type

U User ID 19,596

-S Song ID 30,260

-H Listening History 30,260

-BY Birth Year (of users) 100 Cb

LR Live Region (of users) 208 Cb

M Mood Tags (of users) 132 Cx

VAD VAD values (of articles) 3 Cx

A Artists (of songs) 5,175 Cb

Au Audio Information 53 Cb

SR Social Relation 674,932 Cx

Note: P denotes the feature of user profile, Cb denotes the content-based feature that are ex-tracted from songs, and Cx denotes the context-based feature that are extracted from user.

Listening(

Figure 3.2: The structure of LiveJournal dataset

•‧

or the singer. Situational factors include those that persist for a short period time such as when and where you listen to music, what you are doing and what your mood is. People usually express their feelings though listening to music, and the user-generated article re-flects their recent mood.The structure of collected music dataset is depicted in Figure 3.2, as these factors affect how people choose the music. Personal factors indicate the charac-teristics that people would possess for a long period of time, such as age, gender. People with different levels of music background may appreciate music differently, which in turn affects music preference. Musical factors consist of the audio content, its profile, and even the artwork of the CD. People may choos a song because its melody or the singer. Situ-ational factors include those that persist for a short period time such as when and where you listen to music, what you are doing and what your mood is. People usually express their feelings though listening to music, and the user-generated article reflects their recent mood.

Table 3.1 summarizes the features used in the experiments, which are described in detail below.

3.4.1 Content-based Features

Content-based features refer to features that describe either the user or the item. For de-scribing users, we have Birth Year (BY), Live Region (LR) and Social Relations (SR) features. The birth years for the users in our dataset fall in a window of 100 years. More-over, the users are from 208 regions. We consider users who were born in the same year or users who were from the same region as similar. On the other hand, from LiveJournal we can obtain social information regarding who is who’s friend and construct the social network among the users. This gives rise to the social relation based similarity matrix.

People who are friends to one another are likely to share similar music taste.

For describing songs, we have Artist (A) and Audio Information (Au) features. The artist feature simply indicates the artist (among the 5,175 possible artists) of the songs.

If two songs are sung/performed by the same artist, they are likely to be more similar.

The audio features consists of 53 perceptual dimensions of music, including danceability, loudness, key, mode, tempo, pitches and timbre, with a total of 53-Dimensions. They are extracted by using the EchoNest API ¹, a commonly used audio feature extraction tool developed in the field of music information retrieval [9]. We can measure the similarity between two songs in this 53-dimensional feature space.

•‧

Table 3.2: Affective Norms for English words: 5 example words Description Valence Arousal Dominance

The user-generated articles are interesting context-based features in the dataset, but it may contains too many redundant words. Motivated by the idea of emotional matching, we convert the original content of an article into a vector of emotional words by referring to the dictionary of Active Norms for English Words (ANEW) [4], which provides a set of normative emotional ratings for English words. There are totally 2,476 emotional words in ANEW, and there are about 3% of articles discarded, because they have no ANEW words. We leave the words which can be found in the ANEW dictionary and weight them by the TF-IDF weighting. Specifically, a word is scored by tf(t, d) ⇥ idf(t, d), where

tf (t, d) = f (w, d)

max{f(w, d) : w 2 d}, (3.8)

idf (t, d) = log |D|

|{d 2 D : t 2 d}|, (3.9)

and D is the total number of articles. A term with the high score indicates that the term has a higher term frequency wight and a lower document frequency of the term in the whole collection of articles. In addition, the ANEW dictionary also provides a set of normative emotional ratings for English words. The emotional words are rated by Va-lence (or pleasantness; positive/negative active states) , Activation (or arousal; energy and stimulation level) and Dominance (or potency; a sense of control or freedom to act), the fundamental emotion dimensions found by psychologists [9]. Finally each word vector of articles is converted to valence, arousal, and dominance (VAD) values. For example, for the sentence ”I had a dream last night, I was eating a marshmallow,” the VAD values would be 14.2, 10.22, and 11.13, respectively, according to Table 3.2. Moreover, we also collected the recent mood tags which are recent used by each user.

1http://echonest.com/

•‧

國

立立政治大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

•‧

國

立立政治大

㈻㊫學

•‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 4 Experimental Results

4.1 Experimental Settings

4.1.1 Dataset

Our experiments are performed on a real-world dataset collected from a commercial web-sites – LiveJournal¹. LiveJournal is a well-known social blogging website where users listen to music while writing their online diaries. LiveJournal is unique in that, in addition to the common feature of blogging, each post is accompanied with a ”Mood” column and a ”Music” column so that users can write down their moods and songs in their minds while posting, as Figure 4.1 exemplifies. From LiveJournal, we crawled a total number of 1,928,868 listening records covering 674,932 users and 72,913 songs as an initial set.

For the purpose of retaining enough number of data in the training and test sets for this study, we only considered users who have more than 10 listening records and discarded the records of the other users. This filtering resulted in the final set of 225,652 listening

1http://www.livejournal.com/

Figure 4.1: Livejournal sample posts

•‧

records (11.7% of the initial set) among 19,596 users and 30,260 songs.

For evaluation, we split the dataset for each user according to the following 80/20 rule: keeping full listening history for the 80% and the half of listening history for the remaining 20% users as the training data, and the missing half of the remaining 20%

users as the testing data. For each recorded song, we randomly add 10 songs as negative records to construct the testing pool.

4.1.2 Evaluation Metrics

We employed two metrics to evaluate the recommendation performance: the truncated mean average precision at k (MAP@k) and recall. For each user, let P (k) denotes the precision at cut-off k:

AP (u, o) =

p=1P (k)⇥ r^uo(p)

I(u) , (4.1)

where o(p) = i means the item i is ranked at position p in the order list o, and rui means whether the user u has listened to song i or not(1 = yes, 0 = no). MAP@k is the mean of the average precision scores for the top-k results:

M AP @k =

u=1AP (u, o)

U , (4.2)

where U is the total number of target users. Higher MAP@k indicates better recommen-dation accuracy.

Recall measures how many songs the user really likes are recommended by the auto-matic system. It is computed by:

Recall = |{Correct Songs}| \ |{Returned T op k Songs}|

|{Correct Songs}| . (4.3)

High recall means that most of songs the user actually likes or listens to are recommended.

In the case of diversity, our evaluation metric is to check out the proportion of cate-gories that appears in top-n recommendations. We define the diversity score as following formula:

Coverage = {number of unique categories in recommendations}

{total number of the categories} . (4.4) Therefore high score means the recommendations can cover a more wide range of music, and low score means the recommendations only focus on few specific domains of music.

在文檔中基於機器學習探討音樂多樣性推薦之研究 - 政大學術集成 (頁 28-34)