Chapter 4 Methodology
4.2 Measures
4.2.1 The Assumption for Relevance
The data we use for the training phase are UIT relations which only represent the condition of tagging rather than the results of recommendation. Due to the lack of the ground truth of recommendation, to determine the items relevant or not, we assume that if the user 𝑢 has used the tag 𝑞 to tag the item 𝑖 which has been tagged by u, the tag 𝑞 is irrelevant to the item 𝑖 for the user 𝑢. Because of the difference of the cognition on the same word between different individuals, in the language system of the user 𝑢, 𝑞 may not refer to 𝑖, even though 𝑞 is relevant to 𝑖 in common conditions. It may cause the difference more obvious when a polysemous word is taken as a query.
If we presumed to take other tags, which are relevant in a common case, relevant, it might harm the precision of recommendation. Moreover, to consider other tags which have not been used by the target user yet we have to cluster tags before evaluation so that the experiment would be more complicated. If we expanded the relevant tags for evaluation, we would consequently use another model to cluster tags prior to ours that
D
25
it would fall into a trap of the circular evaluation. Hence we would use the tags which have been tagged by a user rather than expand them for evaluation.
Besides, is it possible that the user u does not tag the item 𝑖 with the tag 𝑞 which really is relevant to 𝑖 for u? Figure 4-2 shows the average distribution of tagging frequency per user. There are about 10 tags whose frequencies are more than the average frequency. The tags which are familiar to a user are not many. While tagging, a user may use a familiar tag on hand rather than an unfamiliar one unless the user tags an unfamiliar item. Nevertheless, in both the training phase and the testing phase, we only use the query whose tagging frequency is superior to the average for the target user.
Figure 4-2 The average distribution of tagging frequency per user.
4.2.2 Normalized Discounted Cumulative Gain
To evaluate the suitability of the predicted content ranking for each item ranking task we use the Normalized Discounted Cumulative Gain (NDCG) proposed by Järvelin, K. and Kekäläinen, J. [12]. The basic concept of NDCG is that highly relevant items appearing lower in the resulting ranked list ought to be penalized. The modified relevance value is reduced logarithmically proportional to the rank position.
Thus, the summation of the first 𝑝 items’ modified relevance values is called
1 10 100
0 50 100
Freq.
Rank position of tags sorted by tagging freq.
Tagging Freq.
Avg. Frq.
26
Discounted Cumulative Gain(DCG), which is defined as follows
DCG(𝑝) = ∑ 2𝑟𝑒𝑙(𝑟𝑎𝑛𝑘) − 1 item whose rank position is 𝑟𝑎𝑛𝑘. Besides, to compare the performance of retrieval of different queries, the normalization across queries is needed. The Ideal Discounted Cumulative Gain (IDCG) is introduced to represent the ranked list from a prefect ranking algorithm by which the resulting permutation is sorted by the relevance scores of items. Practically we sort items by their relevance to obtain the IDCG and thus we could compute the NDCG which is defined as follows
NDCG(𝑝) = DCG(𝑝)
IDCG(𝑝) (37) Due to lack of the ground truth of the relevance scores to the query, M. Clements et al. [5, 6] create a gain vector 𝐠 with length |𝐼| (i.e., all items) of zeros. To prevent from predicting content that has received a low rating, in this gain vector, the predicted rank positions of the held-out validation items that correspond to a positive opinion 𝐫 ∈ {3, 3.5, 4, 4.5, 5} are assigned a value of respectively g∈ {1, 2, 3, 4, 5}. In other words, an item whose rating is small is taken as irrelevant. However the rating of an item does not map to its relevance score of it directly because there is no relation between quality (i.e., rating) and relevance. Given the condition stated below, an item with rating value 2.5 is relevant, and another item with rating value 5 is irrelevant. According to the evaluation in [5, 6], the relevant item would be neglected because of its small rating value, while the irrelevant one is taken to be contributive to the suitability of the predicted content ranking. Hence we rewrite the assumption that the rating of an item could map to its relevance score on the premise that the item is relevant.
27
We directly use the ratings of items which are relevant to the query as their relevance in the NDCG. According to our assumption of relevance stated above, relevant items are included by the ones the target user has tagged. Moreover, we assume that the rating of an irrelevant item is assigned 0. Only the items tagged by the target user 𝑢 could be relevant or not to the query 𝑞 for 𝑢. By considering all items in the list (i.e., 𝑝 = |𝐼|), Discounted Cumulative Gain (DCG) now accumulates the values of the discounted gain for each item:
DCG(𝑢, 𝑞) = ∑ 2𝑟𝑢,𝑞(𝑖) − 1
The DCG value is normalized by dividing by the optimal DCG value, i.e., IDCG, which is computed using a static state vector in descending order. The NDCG could be written as in Eq. (39). mean of the NDCG over all validation users could be written as in Eq. (40).
NDCG̅̅̅̅̅̅̅̅(𝑈) = 1
With the increasing of the rank position, the influence of the relevance of an item on the NDCG decreases gradually in the form of logarithm. It is possible that the ranking results in a large value of the NDCG while the precision and recall are small.
28
The NDCG cannot be used independently while evaluating the performance in information retrieval. Hence, we also use precision and recall.
4.2.3 Precision and Recall
In information retrieval, precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Precision and recall are defined as follows
precision =|{relevant items} {retrieve items}|
|{retrieve items}| (41) recall =|{relevant items} {retrieve items}|
|{relevant items}| (42) The relation between precision and recall is shown in Figure 4-3.
Relevant Irrelevant
RetrievedNot Retrieved
Figure 4-3 Precision and recall are the quotient of the upper left region with orange color by respectively the region with red boundary and the one with blue boundary.
A perfect precision score of 1.0 means that all the items retrieved by the search engine are relevant and a prefect recall score of 1.0 means that all the relevant items are retrieved. Notice that the two statements do not mention how many items are
Retrieved relevant
items
29
retrieved. If a search engine only retrieves the item which has the highest rank score, the item would be relevant almost certainly. On the contrary, if a search engine retrieves all items whatever the query is, it always obtains the recall score of 1.0. Thus, it does not suggest that only one of the two measures is used and the other is neglected.