Chapter 5 Experiment and Evaluation 53
5.3 Example Result: Semantic Similarities between tag design and other
We calculated ratios of tag frequencies with different ranks to total tag frequency based on all users’ data, and then calculated the average ratios of all users in our filtered data set. The result is showed in Fig. 5.4, and the distribution of the average tag ratios is a power-law distribution, which means part of tags dominate the weights of tags in a (semantic) tag-based user profile. Finally, We decide to build tag concepts from every user’s top 30 tags for computing efficiency, and the average ratio of the rank 30th used tag is 0.6%. Another reason for selecting top 30 tags is that we can only retrieve top 30 tag frequencies of a bookmark within one request to Delicious. Because of we have to measure similarities between user profiles and bookmark profiles for empirical evaluation later, we also select users’ top 30 tags only for fairness.
5.3 Example Result
We listed the result of semantic similarities between tag design and some other tags in Table 5.3. The result is from the relative semantic similarities based on WordNet, ConceptNet and Google snippets.
Table 5.3: Example Result: Semantic Similarities between tag design and other tags
5.4 Empirical Evaluation
After crawling the data including the users and the bookmarks from Delicious and measuring the semantic similarities between top 15,000 tags based on WordNet, Con-ceptNet, and Google snippets, we can construct three semantic tag-based user profiles based on each semantic resource for a user to compare with the baseline method, tag-based user profile, described in Eq. 3.1.
We apply 5-fold cross validation to evaluate the performance of our proposed ap-proaches. Cross validation is a technique for assessing how well the model you have learned from some training data is going to perform on future unseen data (or testing data). In 5-fold cross validation, every user’s bookmark collection is partitioned into 5 subsets. The process is repeated 5 times. Each time a single subset is retained as the testing data, and the other 4 subsets are the training data. Finally, the evaluation result is from the average performance of 5 subsets as the testing set each. That is, for each user u’s bookmark collection Du, we random select 80% bookmarks as the training data for constructing four user profiles including based user profile, semantic tag-based user profile tag-based on WordNet, ConceptNet, and Google snippets separately, and the other 20% bookmarks as the testing data known as the ground truths in our evaluation.
For each test of 5-fold cross validation, firstly, we construct three type of tag-based profiles for each bookmark, which consists of top 30 distinct tags with their associated weights, in the testing set. Secondly, for every user with one type of tag-based profile, we calculate the similarities between the user profile and the same type of bookmark
5.4. EMPIRICAL EVALUATION 63
profiles. And then we sort the similarities to obtain the ranks of all the ground truth, the user’s hidden bookmarks. The higher the ranks of the ground truth are, the more accurate the profile is. We can obtain three ranked lists for a user by three types of pro-files totally, and we will show the evaluation results by different performance measures in the following subsections.
5.4.1 Precision-Recall Graph
In the area of Information Retrieval, the most common performance measures is preci-sion and recall measures. Precipreci-sion measure is the fraction of the bookmarks retrieved that are the ground truths, and recall measure is the proportion of the number of re-trieved ground truths to the number of total ground truths. Precision and recall are measures for the entire testing set which do not account for the rankings of the ground truths in the retrieved data. In our evaluation, the higher the rankings of the ground truths, the better performance the profile reveals. Therefore, we consider the eval-uation results by precision and recall measures at different cut-off points which are precision at n (P@n) and recall at n (R@n) listed below:
P (u)@n = |Du∩ Q(u, n)|
n (5.1)
R(u)@n = |Du∩ Q(u, n)|
|Du| (5.2)
whereQ(u, n) are user u’s top n similar bookmarks among the testing set.
With the results from P@n and R@n measures at all cut-off points from 1 to the number of bookmarks in the testing set, we can plot a precision-recall graph, which shows the trade-off between precision and recall, as Fig. 5.5. Trying to increase recall
typically brings in more false data into the querying result, thereby reducing preci-sion. Thus precision-recall graphs have a classical concave shape, which can depict the degradation of precision at n as one traverses the ranked list.
0
Figure 5.5: Evaluation Result by Precision-Recall Graph
The improvement for precision-recall graph is to increase both precision and recall.
In other words, the entire curve must move up and out to the right so that both precision and recall are higher at every point along the curve. From the precision-recall graph in Fig. 5.5, the performances of three semantic tag-based user profiles are all better than the baseline, the tag-based user profile. The major differences between curves are within the range which recall value under 0.1, which means the ranks of a few top ground truths obtained by semantic tag-based user profiles outperform by tag-based
5.4. EMPIRICAL EVALUATION 65
user profiles strongly.
5.4.2 Rank Accuracy Measures
Rank accuracy metrics measure the ability of a recommendation algorithm to produce a recommended ordering of items that matches how the user would have ordered the same items, and these metrics are more appropriate to evaluate algorithms that will be used to present ranked lists to the user. Thus we utilize two measures, mean reciprocal rank (MRR) and half-life utility measure [3], to compare the performances of three types of profiles.
The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer, and the mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries. We definequ,i as useru’s i-th similar ground truth and the formula of mean reciprocal rank as:
MRR = 1
whererank(i) is a function for retrieving the rank of item i given a ranked list, and U is the set of users for evaluation.
From the result Fig. 5.6, we can see the performances of three semantic tag-based profiles (STBPs) are both better than the baseline, where the MRR from STBP based on WordNet is 0.093, the MRR from STBP based on ConceptNet is 0.098, the MRR from STBP based on Google snippets is 0.0975, and the MRR from the baseline is 0.067. The result shows the rank of each user’s first similar ground truth in the testing
data by STBP is higher than the rank by tag-based profile (TBP) in a ranked list.
Figure 5.6: Evaluation Result by Mean Reciprocal Rank (MRR)
Mean reciprocal rank considers the rank of the first correct answer in a ranked list only. Moreover, we also should consider total ground truths in a ranked list. Half-life utility metric attempts to evaluate the utility of a ranked list, and the utility is defined as the difference between the user’s rating for an item and the “default rating” for an item. The default rating is generally a neutral rating. Breese et al. [3] presented half-life utility metric for recommender systems that is designed for tasks where the user is presented with a ranked list of results, and is unlikely to browse very deeply into the list. For example, most Internet users will not browse very deeply into results returned by search engines.
5.4. EMPIRICAL EVALUATION 67
In our data set, the rating of each bookmark is binary because a bookmark is whether in a user’s bookmark collection or not, so we let the ratingr be 1 if the book-mark is in the user’s ground truth. We define the formula of the half-life utility metric as:
HUu =
i
r
2(rank(qu,i)−1)/(h−1) (5.4)
whereh is the half-life. The half-life is the rank of the item on the list such that there is a 50% chance that a user will view that item. We leth be 10 in Eq. 5.4.
The overall score for a data set across all users is shown in Eq. 5.5. HUimaxis the maximum achievable utility if the system ranked the items in the exact order that useri ranked them. In other words, all useri’s hidden bookmarks are on the top of the ranked list.
The result of half-life utility metric is shown in Fig. 5.7. The performances of two STBPs are also both better than TBP, where the utility from STBP based on WordNet is 0.0293, the utility based on ConceptNet is 0.0308, the utility based on Google snippets is 0.0313, and the utility from the baseline is 0.0244. From the half-life utility metric, we show semantic tag-based profiles are better than tag-based profiles by considering total ranks of the correct answers in a ranked list.
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035
Half-life Utility Measure
baseline WordNet ConceptNet Google
Figure 5.7: Evaluation Result by Half-life Utility Measure
5.5 User Study
Although we used the methods, precision-recall graph and mean reciprocal rank and half-life utility measure, to evaluate the performances of our proposed semantic tag-based profiles tag-based on different semantic resources, this kind of evaluation considers users’ history data only. All the unseen bookmarks are treated as wrong answers, and it is unreasonable to make this assumption. Therefore, we design a user study to recover the missing part of the empirical evaluation.
5.5. USER STUDY 69
5.5.1 User Study Design
We design a web page as in Fig. 5.8 to collect the results from subjects. The require-ments of a subject are the subject must have an account on Delicious with enough bookmarks for constructing tag-based profiles. For each subject, we construct three profiles from the subject’s whole bookmark collection, including a semantic tag-based profile based on WordNet, a semantic tag-based profile based on ConceptNet, and a tag-based profile which is baseline, for evaluation. For each profile of a subject, we measure all similarities between the profile and the bookmarks in our data set exclud-ing the bookmarks in the subject’s collection. Then we select top 10 bookmarks from each profile and sort at most 30 bookmarks with a random order.
Figure 5.8: A Screen Shot of User Study
We put the data of selected bookmarks into the web page for clicking by subjects.
Subjects are asked to click each item and give the rating after reading the page content.
They first can see the data of each item, including the page title and associated tags retrieved from Delicious. They also can see their profiles by tag cloud. After clicking the title of a item, the subject will see the information bar including the title, the rating stars and the text “More Info.” for showing the associated tags, and the page content displayed below as in Fig. 5.9. After reading the content of the item, the subject needs to give the rating according to his/her preference to the item. The range of the rating score is from 1 to 5. We also provide an icon for subjects to click if the server which holds the web page is error, the item is removed, the language of the text in the web page is unknown for subjects, etc.
Figure 5.9: Let the subject give a rating after reading the web page content
5.5.2 User Study Result
We recruited 8 subjects for our user study, and they rated 211 web pages totally. We apply half-life utility measure to evaluate the performances of three different types of profiles. The rating r in half-life utility measure can be from 1 to 5 according to subjects’ ratings, and the maximum achievable utilityHUimax is gained by setting the
5.5. USER STUDY 71
Figure 5.10: Evaluation Result of User Study ratings of useri’s all items to 5.
We useHU@n to view the average performances among all subjects’ top-n item only, and we show the results includingHU@1, HU@3, HU@5, and HU@10. From the results in Fig. 5.10, the most similar item measured by the baseline method showed the best performance, which means the subjects gave the ratings averagely higher than the top-1 items measured by semantic based profiles. The utilities of semantic tag-based profiles tag-based on ConceptNet are the best amongHU@3, HU@5, and HU@10.
The utilities of semantic tag-based profiles based on WordNet are a little lower than the utilities of the baseline method in HU@3 and HU@5, but it becomes better in HU@10.
Chapter 6
Conclusion
In this thesis we proposed semantic based user profiles enriching the original tag-based user profiles by tag concepts. Each tag concept represents a common concept by the core tag and the set of semantic similar tags. We also proposed the similrity mea-sure for semantic tag-based user profiles which eliminates the deficiency of measuring similarity between tag-based user profiles. By applying cosine similarity in measuring the similarity between two distinct tags, we only get zero. But by applying the same method between two tag concepts, we can get the similarity if the two concepts are overlapped. By the similarity measure, we also can find out similar users or identify items a user has interests in.
Based on a user’s resource collection and associated sets of tags on social me-dia sites, we could construct the semantic tag-based user profile containing the set of tag concepts to represent the user’s interests. We introduced three semantic resources, WordNet and ConceptNet and Google snippets, with the associated approaches to
mea-73
sure semantic similarities between tags. We represented how to construct a tag concept from a tag by spreading activation with semantic similarities, and then we constructed a semantic tag-based user profile by a set of tag concepts from a user’s resource col-lection with associated tags.
From empirical evaluation, we showed the performances of the semantic tag-based user profiles based on WordNet, ConceptNet, and Google snippets all were better than the performance of the tag-based user profile with the data set consisting 20,578 users and 80,000 bookmarks by 5-fold cross validation. From the result of user study, se-mantic tag-based user profiles based on ConceptNet show the best utility excluding considering top 1 only.
6.1 Summary of Contributions
• We proposed a semantic similarity measure for tag-based profiles with appropri-ate properties, and this measure eliminappropri-ates the deficiency of measuring similarity by cosine similarity.
• We provided an insight into how the semantic tag-based profile of a user can be constructed from tags associated with the user’s social media collection, and the semantic relations preserved in the profile could reflect the user’s interests as the concepts.
• We proposed tag concepts capturing semantic relations between tags, and se-mantic similarities between tags could be measured based on different sese-mantic
6.2. FUTURE WORK 75
resources to represent different meanings. In this thesis we utilized WordNet, ConceptNet, and Google snippets to measure semantic similarity.
6.2 Future Work
According to our definition of semantic tag-based profiles, we can construct different profiles based on different approaches and semantic resources. However, it is pos-sible to combine different semantic resources with associated similarity measures to construct one semantic tag-based profile revealing better performance. Based on the same tag with different semantic resources, we may construct tag concepts including distinct set of tags and associated weights. Thus, combining all tag concepts into one is an important issue to do in the future.
The problem about how to filter dissimilar tags in a tag concept is also a research issue. Further, if we can confirm dissimilar tags when measuring the similarity between tag concepts, we can obtain more accurate semantic similarity between tag concepts and between semantic tag-based profiles probably.
In this thesis, we construct semantic tag-base profiles based on tag-based profiles which tag weights are measured by a simple approach. However, tag weights can be determined by different approaches for different circumstances. For example, we can consider temporal factor and add more tag weights on the set of tags used recently.
And we can combine those factors with our proposed solutions for different purposes.
[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005.
[2] M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and on-line media. In CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 971–980, New York, NY, USA, 2007. ACM.
[3] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pages 43–52, 1998.
[4] M. J. Carman, M. Baillie, and F. Crestani. Tag data and personalized information re-trieval. In SSM ’08: Proceeding of the 2008 ACM workshop on Search in social media, pages 27–34, New York, NY, USA, 2008. ACM.
[5] A. M. Collins and E. F. Loftus. A spreading-activation theory of semantic processing.
Psychological Review, 82(6):407 – 428, 1975.
[6] S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems.
Journal of Information Science, 32(2):198–208, 2006.
[7] H. Halpin, V. Robu, and H. Shepherd. The complex dynamics of collaborative tagging. In 76
BIBLIOGRAPHY 77
WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 211–220, New York, NY, USA, 2007. ACM.
[8] C. Havasi, R. Speer, and J. Alonso. Conceptnet 3: a flexible, multilingual semantic network for common sense knowledge. In Recent Advances in Natural Language Pro-cessing, Borovets, Bulgaria, September 2007.
[9] Y.-C. Huang. Tag-based profile presentation with semantic relationship, June 2008.
[10] C.-C. Hung. Tag-based user profiling for social media recommendation, June 2008.
[11] R. J¨aschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag recommen-dations in folksonomies. In PKDD ’07: Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, pages 506–514, Berlin, Heidelberg, 2007. Springer-Verlag.
[12] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference Research on Computational Linguistics, pages 9008+, September 1997.
[13] B. Krulwich. Lifestyle finder: Intelligent user profiling using large-scale demographic data. AI Magazine, 18(2):37–45, 1997.
[14] X. Li, L. Guo, and Y. E. Zhao. Tag-based social interest discovery. In WWW ’08: Pro-ceedings of the 17th international conference on World Wide Web, pages 675–684, New York, NY, USA, 2008. ACM.
[15] Y. Li, Z. A. Bandar, and D. McLean. An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering, 15(4):871–882, 2003.
[16] H. Liu and P. Singh. Conceptnet — a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4):211–226, 2004.
[17] A. Maedche and S. Staab. Measuring similarity between ontologies. In EKAW ’02: Pro-ceedings of the 13th International Conference on Knowledge Engineering and Knowl-edge Management. Ontologies and the Semantic Web, pages 251–263, London, UK, 2002. Springer-Verlag.
[18] C. man Au Yeung, N. Gibbins, and N. Shadbolt. A study of user profile generation from folksonomies. In Proceedings of the WWW 2008 Workshop on Social Web and Knowledge Management, 2008.
[19] D. L. Medin, R. L. Goldstone, and D. Gentner. Respects for similarity. Psychological Review, 100:254–278, 1993.
[20] E. Michlmayr and S. Cayzer. Learning User Profiles from Tagging Data and Leveraging them for Personal(ized) Information Access. In WWW ’07: Proceedings of the Workshop on Tagging and Metadata for Social Information Organization, 16th International World Wide Web Conference, May 2007.
[21] P. Mika. Ontologies are us: A unified model of social networks and semantics. In ISWC
’05: Proceedings of the 4th International Semantic Web Conference, volume 3729 of
’05: Proceedings of the 4th International Semantic Web Conference, volume 3729 of