• 沒有找到結果。

The Optimization Problem

Chapter 3 Supervised FolkRank

3.2 The Optimization Problem

The training data contains the ternary relation among users, tags and items. Thus we could determine whether an item is relevant or not. According to the definition of NDCG, items are ranked by its relevance. Due to the lack of the measure of relevance, we take the ratings as relevance. However, the relevance of an item to the query is not related to its rating directly. The only thing we know is that the ratings could be taken as a measure of quality while the items are all relevant. To avoid disturbing the precision, we set the rating of an irrelevant item to 0.

We modify the objective function proposed by H. Valizadegan et al. [25] to

where the notation 〈 〉𝐹 is the expectation over all the possible rankings induced by the ranking function 𝐹, Q is a query set which we use to train for u, 𝐼𝑢 is the item set

18

By introducing the difference of the output scores of every two items to the logistic function, the rank position could be approximated in Eq. (29).

〈𝜋𝑞(𝑖)〉 = 1 + ∑ 〈𝜋𝑞(𝑖, 𝑗)〉 and the selected query q as input. Here, we use FolkRank model proposed above as the ranking function.

In the real condition, the result of a competition is that the position of a winner who has a larger score would add 0, while the position of the loser who has a smaller score would be added by one. We could translate the competition into a non-differentiable function written as:

〈𝜋𝑞(𝑖, 𝑗)〉 = {0, if 𝐹𝑢,𝑞(𝑖) − 𝐹𝑢,𝑞(𝑗) > 0 1, if 𝐹𝑢,𝑞(𝑖) − 𝐹𝑢,𝑞(𝑗) < 0

For an instance, the item which has the largest score would never get one in each

19

competition such that from Eq. (29), its rank position is 1. Thus, our goal is to make the gap between the approximation and the real condition as close as possible. Due to the definition of the logistic function used to simulate the rank positions, while η is set to a larger number, the results of competitions of two items are close to the real condition. The relation is depicted in Figure 3-2. The larger η is, the more precise the approximation is. With the increasing of η, it loses its differentiability gradually.

However, due to the non-differentiability of 〈𝜋𝑞(𝑖, 𝑗)〉, if η is set too large, though the approximation is precise, the derivatives of the objective function might be too large to process for the computer because of the occurrence of overflow of double precision floating point numbers.

Figure 3-2 The relation between η and the precipitation of the logistic function.

Using the above approximation for 〈𝜋𝑞(𝑖)〉, ℋ̅ (𝑢, 𝑄, 𝐹) could be written as

20 relevance of the item in the first rank has the most influence to the NDCG measures.

With the increasing of the rank position, the influence of the relevance of an item decreases gradually in the form of logarithm. Thus, if we use the objective function without approximation by the Taylor expansion, in the optimization phase, the optimized model may prefer to push a relevant item to the first rank while others may be left far behind. Nevertheless, through the approximation by the Taylor expansion, the optimization by the objective function may push each item forward equally.

21

Figure 3-3 The NDCG measures may be affected by the rank position of items in the form of logarithm. On the other hand, if we approximate it by Taylor expansion, it function would be minimized. In our implementation, we use the BFGS algorithm [4, 7, 8, 9, 23] to find the optimization result. There are only 3 parameters that we have to train, and then we calculate the inverse Hessian matrix directly rather than approximate iteratively like the L-BFGS [14] algorithm does. Because 𝐹𝑢,𝑞(𝑖) is polynomial, we could rewrite (𝐹𝑢,𝑞(𝑖) − 𝐹𝑢,𝑞(𝑗)) as 𝐹𝑢,𝑞,𝑖,𝑗(𝛼, 𝛽, 𝛾) where 𝛼, 𝛽, 𝛾 are our parameters to be optimized. Thus the derivatives of M(𝑢, 𝑄) with respect to 𝛼, 𝛽, 𝛾 could be computed independently. For example, the derivatives of M(𝑢, 𝑄)

Our objective function is not convex, so that the gradient descent methods may not

-4

22

find the global minimum. We resolve the problem of local minimum by using several different start points and then what causes the minimum value of objective function would be selected as the optimized parameters.

23

Chapter 4 Methodology

4.1 Data Preparation

The dataset we use is from the LibraryThing collected by M. Clements et al. [5, 6].

LibraryThing is a social network about books. A user could give all the books ratings and tags and then personalized catalogs are created. According to the preference of a user, the system would give him a list where users may have similar interests to him or recommend books he may like.

After pruning the books and tags that appear in less than 5 user profiles [5], there are 7279 users, 10559 tags and 37232 books. The pruned dataset has 2056487 UIT relations. The derived UT, R, IT matrices have a density 5.2 × 10−3, 2.8 × 10−3and 2.0 × 10−3, respectively. However, many users tag and rate items repeatedly with a little difference. We use the rating that the user gives to the item in the last record of the UIT relations and accumulate all the tags that the user gives to the item in the dataset. Thus, the real density of matrices would be a little less than the one stated above.

In Figure 4-1, we split the data into two parts, namely training set and testing set.

To split the D matrix into two parts, we choose the UIT relations given by the first 3000 users as the training set. The others would be taken as the testing set. Thus, in training set, there are 3000 users, 8009 tags and 36596 items, while there are 4280 users, 8071 tags and 37101 items in the testing set. In training phase, we use the training set to optimize the parameter vector of our model and we validate the performance of the optimized model in the testing phase.

24

Figure 4-1 The data preparation by splitting the D matrix into two parts.

4.2 Measures for Evaluation

4.2.1 The Assumption for Relevance

The data we use for the training phase are UIT relations which only represent the condition of tagging rather than the results of recommendation. Due to the lack of the ground truth of recommendation, to determine the items relevant or not, we assume that if the user 𝑢 has used the tag 𝑞 to tag the item 𝑖 which has been tagged by u, the tag 𝑞 is irrelevant to the item 𝑖 for the user 𝑢. Because of the difference of the cognition on the same word between different individuals, in the language system of the user 𝑢, 𝑞 may not refer to 𝑖, even though 𝑞 is relevant to 𝑖 in common conditions. It may cause the difference more obvious when a polysemous word is taken as a query.

If we presumed to take other tags, which are relevant in a common case, relevant, it might harm the precision of recommendation. Moreover, to consider other tags which have not been used by the target user yet we have to cluster tags before evaluation so that the experiment would be more complicated. If we expanded the relevant tags for evaluation, we would consequently use another model to cluster tags prior to ours that

D

25

it would fall into a trap of the circular evaluation. Hence we would use the tags which have been tagged by a user rather than expand them for evaluation.

Besides, is it possible that the user u does not tag the item 𝑖 with the tag 𝑞 which really is relevant to 𝑖 for u? Figure 4-2 shows the average distribution of tagging frequency per user. There are about 10 tags whose frequencies are more than the average frequency. The tags which are familiar to a user are not many. While tagging, a user may use a familiar tag on hand rather than an unfamiliar one unless the user tags an unfamiliar item. Nevertheless, in both the training phase and the testing phase, we only use the query whose tagging frequency is superior to the average for the target user.

Figure 4-2 The average distribution of tagging frequency per user.

4.2.2 Normalized Discounted Cumulative Gain

To evaluate the suitability of the predicted content ranking for each item ranking task we use the Normalized Discounted Cumulative Gain (NDCG) proposed by Järvelin, K. and Kekäläinen, J. [12]. The basic concept of NDCG is that highly relevant items appearing lower in the resulting ranked list ought to be penalized. The modified relevance value is reduced logarithmically proportional to the rank position.

Thus, the summation of the first 𝑝 items’ modified relevance values is called

1 10 100

0 50 100

Freq.

Rank position of tags sorted by tagging freq.

Tagging Freq.

Avg. Frq.

26

Discounted Cumulative Gain(DCG), which is defined as follows

DCG(𝑝) = ∑ 2𝑟𝑒𝑙(𝑟𝑎𝑛𝑘) − 1 item whose rank position is 𝑟𝑎𝑛𝑘. Besides, to compare the performance of retrieval of different queries, the normalization across queries is needed. The Ideal Discounted Cumulative Gain (IDCG) is introduced to represent the ranked list from a prefect ranking algorithm by which the resulting permutation is sorted by the relevance scores of items. Practically we sort items by their relevance to obtain the IDCG and thus we could compute the NDCG which is defined as follows

NDCG(𝑝) = DCG(𝑝)

IDCG(𝑝) (37) Due to lack of the ground truth of the relevance scores to the query, M. Clements et al. [5, 6] create a gain vector 𝐠 with length |𝐼| (i.e., all items) of zeros. To prevent from predicting content that has received a low rating, in this gain vector, the predicted rank positions of the held-out validation items that correspond to a positive opinion 𝐫 ∈ {3, 3.5, 4, 4.5, 5} are assigned a value of respectively g∈ {1, 2, 3, 4, 5}. In other words, an item whose rating is small is taken as irrelevant. However the rating of an item does not map to its relevance score of it directly because there is no relation between quality (i.e., rating) and relevance. Given the condition stated below, an item with rating value 2.5 is relevant, and another item with rating value 5 is irrelevant. According to the evaluation in [5, 6], the relevant item would be neglected because of its small rating value, while the irrelevant one is taken to be contributive to the suitability of the predicted content ranking. Hence we rewrite the assumption that the rating of an item could map to its relevance score on the premise that the item is relevant.

27

We directly use the ratings of items which are relevant to the query as their relevance in the NDCG. According to our assumption of relevance stated above, relevant items are included by the ones the target user has tagged. Moreover, we assume that the rating of an irrelevant item is assigned 0. Only the items tagged by the target user 𝑢 could be relevant or not to the query 𝑞 for 𝑢. By considering all items in the list (i.e., 𝑝 = |𝐼|), Discounted Cumulative Gain (DCG) now accumulates the values of the discounted gain for each item:

DCG(𝑢, 𝑞) = ∑ 2𝑟𝑢,𝑞(𝑖) − 1

The DCG value is normalized by dividing by the optimal DCG value, i.e., IDCG, which is computed using a static state vector in descending order. The NDCG could be written as in Eq. (39). mean of the NDCG over all validation users could be written as in Eq. (40).

NDCG̅̅̅̅̅̅̅̅(𝑈) = 1

With the increasing of the rank position, the influence of the relevance of an item on the NDCG decreases gradually in the form of logarithm. It is possible that the ranking results in a large value of the NDCG while the precision and recall are small.

28

The NDCG cannot be used independently while evaluating the performance in information retrieval. Hence, we also use precision and recall.

4.2.3 Precision and Recall

In information retrieval, precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Precision and recall are defined as follows

precision =|{relevant items} {retrieve items}|

|{retrieve items}| (41) recall =|{relevant items} {retrieve items}|

|{relevant items}| (42) The relation between precision and recall is shown in Figure 4-3.

Relevant Irrelevant

RetrievedNot Retrieved

Figure 4-3 Precision and recall are the quotient of the upper left region with orange color by respectively the region with red boundary and the one with blue boundary.

A perfect precision score of 1.0 means that all the items retrieved by the search engine are relevant and a prefect recall score of 1.0 means that all the relevant items are retrieved. Notice that the two statements do not mention how many items are

Retrieved relevant

items

29

retrieved. If a search engine only retrieves the item which has the highest rank score, the item would be relevant almost certainly. On the contrary, if a search engine retrieves all items whatever the query is, it always obtains the recall score of 1.0. Thus, it does not suggest that only one of the two measures is used and the other is neglected.

4.3 Evaluation

To make the recommendation system more practical, the items not only tagged by the user but also those untagged should be retrieved. Therefore we propose a pre-evaluation protocol which is modified from [13]. In every experiment we follow the pre-evaluation protocol as follows. For each individual user 𝑢 in the dataset we randomly select a list of 20% of the items the user 𝑢 has tagged and take them as

“unseen” items which we refer to as 𝑺𝒖,𝒕. We set zeros to the elements relative to these “unseen” tags in 𝐑 and 𝐑𝐓, and subtract one from the elements relative to these “unseen” tags in 𝐔𝐓, 𝐓𝐔, 𝐈𝐓 and 𝐓𝐈. 𝑺𝒖,𝒂 is the remaining 80% of items the user 𝑢 has tagged and 𝑺𝒖,𝒏 is the set of the items that the user 𝑢 has not tagged yet.

The protocol stated above is depicted in Figure 4-4.

Figure 4-4 For each user, the per user pre-evaluation protocol is followed: 𝑆𝑢,𝑡 is the

30

20% randomly selected set of items that the user 𝑢 has tagged, 𝑆𝑢,𝑎 is the remaining 80% of items the user 𝑢 has tagged and 𝑆𝑢,𝑛 corresponds to the items that the user 𝑢 has not tagged yet.

Then we use TD-IDF weighting shown in Eq. (20-25) on each 2-dimension matrix with normalization to reduce the influence of frequently occurring elements. After the normalization in all 2-dimension matrices, we combine these 2-dimension matrices in the transition matrix 𝐀𝒖. We use our PageRank-like model in Eq. (26) to calculate the scores of all items iteratively. While the scores of items converge, the static scores are obtained. We evaluate the performance of prediction from the static scores of items.

Notice that before calculating the scores of items, the rating of items which belong to 𝑺𝒖,𝒕 are set to zero, thus the construction of the transition matrix 𝐀𝒖 is on the premise that the user 𝑢 pretends that 𝑢 has not tagged the items which belong to 𝑺𝒖,𝒕 before.

Notice that our random walk model in the training phase is the same with that in the testing phase. The only difference between the training phase and the testing phase is the evaluation protocol. The evaluation protocol in the training phase computes the objective function and that in the testing phase evaluates the performance. The series of steps of the construction of transition matrix could be taken as the pre-evaluation protocol. By this pre-evaluation protocol, we could split the UIT relations into two parts. One includes the UIT relations, which does not involve 𝑢 or involves the items that belong to 𝑺𝒖,𝒂, while the other is 𝑺𝒖,𝒕. The former is taken as the historical information that supports recommendation, while the later could be taken as the ground truth. To estimate the suitability of the recommendation of items which a user has not seen, we would check the relevant items which are retrieved and belong to 𝑺𝒖,𝒕. If our recommendation system can find most of the relevant items which belong

31

to 𝑺𝒖,𝒕, the items which the user might like but has not seen yet could be retrieved.

On the other hand, due to the usage of the recommendation system, we want the relevant items seen before are still retrieved as many as possible. In both training phase and testing phase, whether the retrieved items belong to 𝑺𝒖,𝒕 or 𝑺𝒖,𝒂, we would treat them equally in the evaluation.

For each individual user 𝑢, we select the tags that 𝑢 has used as the queries 𝑸𝒖. If a tag 𝑡 used by 𝑢 infrequently would be taken as a query, the precision of recommendation depends mostly on the per user pre-evaluation protocol. Because if 𝑖𝑡 which is tagged by 𝑢 with 𝑡, is selected as an “unseen” item, the value of 𝐔𝐓(𝑢, 𝑡)and 𝐓𝐔(𝑡, 𝑢) would be subtracted by one. Due to the infrequent occurrence of the relation between 𝑢 and 𝑡, the value of 𝐔𝐓(𝑢, 𝑡)and 𝐓𝐔(𝑡, 𝑢) are small. The small value is sensitive to addition and subtraction.

Each query 𝑞 in the query set 𝑸𝒖 is the input of our model and the permutation of items sorted by their scores in the static state vector is the output of our model. The objective function is computed from the output of our random walk model. When the user 𝑢 has completed the query process, i.e. every query in the query set 𝑸𝒖 is used as the input of our training model once, we combine their objective function by summation.

Our objective function is based on the NDCG [12]. If we took the ratings of items as their relevance to a query, the result would be interfered with by irrelevant items which we do not recognize. Therefore, to consider not only the NDCG but also precision of an item, we define what a relevant item is. An item which has been tagged by the user with the query is relevant. Now the evaluation that combines the NDCG and the precision would be proposed as follows.

After computing the static scores, we divide all items into two parts, namely relevant items and irrelevant items. Notice that the relevant item set and irrelevant one

32

Figure 4-5 shows the relation among them.

Figure 4-5: The relations between 𝑆𝑢,𝑡, 𝑆𝑢,𝑎, 𝑆𝑢,𝑛, 𝑆𝑢,𝑡𝑞 , 𝑆𝑢,𝑎𝑞 and 𝑆𝑢,𝑛𝑞 .

In training phase, for each item 𝑖 relevant to 𝑞, we use the Eq. (31) to predict the rank position of 𝑖 by comparing the static score of 𝑖 with others. Figure 4-6 shows how we compute the objective function. Notice that while computing the objective function, the ratings of the items, which belong to 𝑺𝒕, are considered. In other words, we use the ratings in 𝐑 which has not been processed by the per user pre-evaluation protocol yet. Besides, the items which belong to 𝑺𝒏 are regarded irrelevant, and their ratings are set to zero. According to our assumption, items in 𝑺𝒖,𝒏 would be

33

neglected.

Figure 4-6: Relevant & irrelevant items in the training phase.

We optimize the parameter vector for each individual user rather than sum up the objective function for all users before optimization. In the training phase, each per user objective function is not concave; neither does the sum of all the objective functions. To analyze the personal behavior and its influence, we optimize the per user objective function for each user.

From the distribution of the per user optimized parameters, we could find some characteristics of the user behavior. For example, the parameter 𝜃 controls the proportion of the target user to the tag selected as a query. When the value of 𝜃 is large, the user is more influential than the query and vice versa. Considering a user who prefers to use ambiguous tags, the initial state may not include the query merely

From the distribution of the per user optimized parameters, we could find some characteristics of the user behavior. For example, the parameter 𝜃 controls the proportion of the target user to the tag selected as a query. When the value of 𝜃 is large, the user is more influential than the query and vice versa. Considering a user who prefers to use ambiguous tags, the initial state may not include the query merely

相關文件