• 沒有找到結果。

records (11.7% of the initial set) among 19,596 users and 30,260 songs.

For evaluation, we split the dataset for each user according to the following 80/20 rule: keeping full listening history for the 80% and the half of listening history for the remaining 20% users as the training data, and the missing half of the remaining 20%

users as the testing data. For each recorded song, we randomly add 10 songs as negative records to construct the testing pool.

4.1.2 Evaluation Metrics

We employed two metrics to evaluate the recommendation performance: the truncated mean average precision at k (MAP@k) and recall. For each user, let P (k) denotes the precision at cut-off k:

AP (u, o) =

Pk

p=1P (k)⇥ ruo(p)

I(u) , (4.1)

where o(p) = i means the item i is ranked at position p in the order list o, and rui means whether the user u has listened to song i or not(1 = yes, 0 = no). MAP@k is the mean of the average precision scores for the top-k results:

M AP @k =

PU

u=1AP (u, o)

U , (4.2)

where U is the total number of target users. Higher MAP@k indicates better recommen-dation accuracy.

Recall measures how many songs the user really likes are recommended by the auto-matic system. It is computed by:

Recall = |{Correct Songs}| \ |{Returned T op k Songs}|

|{Correct Songs}| . (4.3)

High recall means that most of songs the user actually likes or listens to are recommended.

In the case of diversity, our evaluation metric is to check out the proportion of cate-gories that appears in top-n recommendations. We define the diversity score as following formula:

Coverage = {number of unique categories in recommendations}

{total number of the categories} . (4.4) Therefore high score means the recommendations can cover a more wide range of music, and low score means the recommendations only focus on few specific domains of music.

4.2 Contextual Recommendation System

This section we focus on presenting the work of how to model the relationship between user’s mood and user’s listening behavior. To enhance the benefits from different perspec-tive of features, we

18

•‧

Our first evaluation focuses on the use of CF information only for music recommendation.

We compare FM with the following three well-know CF methods: user-based CF, item-based CF, and a SVD-item-based approach. Below we describe the main ideas of the methods.

• User-based CF: This method weights all users with respect to their similarity to each other, and selects a subset of users (who are highly similar to the target user) as neighbors. It predicts the rating of specific song based on the neighbors’ ratings.

Let S(u) be the set of songs that are chosen by the user u. The similarity between user u and user v is calculated by following formula:

suv = S(u)\ S(v)

|S(u)||S(v)|1 ↵, (4.5)

where ↵ 2 [0, 1] is a parameter to tune.

• Item-based CF: This method is similar to the user-based CF method. It computes the similarity between songs and scores a song based on user’s listening history.

The song similarity is calculated as follows:

sij = U (i)\ U(j)

|U(i)||U(j)|1 ↵, (4.6)

where U(i) the set of the users who have listened to the song i.

• SVD++: This method is an extended version of SVD-based latent factor models by integrating implicit feedback into the model. In specific, the prediction formula can be the following:

where N(u) is the set of implicit information, µ is the global mean rating, bu is a scalar bias for user u, biis a scalar bias for item i, puis a feature vectors for user u, qi is a feature vector for item i.

According to the results reported in the MSD Challenge [3], we set the ↵ in user-based CF and item-user-based CF to 0.5. Table 4.1 lists the preliminary results of MAP@10 and recall. As the table shows, the performance of all the implemented methods, except for the random baseline, seems to be reasonable, achieving about 0.30 to 0.38 in terms of MAP. Among the four methods, FM obtains the highest performance, showing that FM can be a competitive framework for this task. We therefore focus on the use of FM hereafter.

•‧

Table 4.1: Evaluation result of CF-based algorithms

Model MAP@10 Recall

Table 4.2: Performance of factorization machine with different feature combinations

Features MAP@10 Recall

4.2.2 FM with Content-based Features

Next, we evaluate the performance of content-based recommendation. It has been well known that the CF-based approach usually suffers from the so-called ”cold start” problem that occurs when new items and new users are considered. The problem can be alleviated by adding some content-based features, making it possible to recommend a new song by comparing its audio information to the audio information of other songs. As shown in the first two rows of Table 4.2, the content-based method (i.e., U + S + Cb) outperforms the CF-based one (i.e., U + S) by a great margin.

4.2.3 FM with Content-based and Context-based Features

We evaluate the performance of context-based recommendation by using Mood Tags and VAD, both represent the users’ mood. As shown in the third and fourth row of Table 4.2, the performance of adding the Mood Tags feature is improved from 0.3817 to 0.4134 in terms of MAP@10. This result shows that the contextual users’ mood information indeed improves the performance of recommendation. With the another contextual VAD feature from user-generated context, the performance is even higher, with the MAP@10 attaining 0.4483. This result implies that the VAD feature provides more emotional information of

20

•‧

Table 4.3: Performance of user similarity and item similarity LiveJournal Dataset

Note: For the feature abbreviation, please refer to Table 3.1.

the user context, which might not be easily captured by mood tags only.

Finally, we evaluate the hybrid model that combines the two contextual features and the content-based features (i.e., U + S + Cb + M + VAD). As the last row of Table 4.2 shows, this hybrid model greatly outperforms the content-based method, achieving 0.50 and 0.65 in terms of MAP and recall, respectively. The performance difference between the hybrid model and the CF-based or content-based models are significant under the two-tailed t-test (p-value< 0.001). On the other hand, we also provide an experimental result that without using the user-provided Mood tags (i.e., U + S+ Au + VAD). By comparing it to purely hybrid CF+CB method, we still see great performance improvement. In sum, the experimental results suggest that the contextual information mined from user-generated articles improves the quality of music recommendation.

相關文件