In order to demonstrate the importance of the tag count information, we analyze the difference between the prediction results of the high count tags and low count tags in terms of the false negative rate (FNR) on the six social tag prediction datasets:
cal500, majorminer, and the four subset versions of the delicious data (from dlc1 to dlc4 ). The results are shown in Table 3.2. The high count tags include tags whose counts are larger than a pre-defined threshold. The thresholds for the cal500 and majorminer datasets are set to 5 and 6, respectively; and the threshold for the four delicious datasets is set to 50. The low count tags are tags whose counts are equal to the respective smallest tag count in that dataset. We use binary relevance SVM (BRSVM) and IBLR for training the multi-label classifiers. The results are obtained from three-fold cross-validation.
From Table 3.2, we observe that, for all of the six datasets, both BRSVM and IBLR have significantly higher false negative rates on the low count tags than on the high count tags. For example, for the cal500 dataset, the FNR is 37.61 % for the high count tags and 70.94 % for the low count tags, by using IBLR. We have also conducted the same experiment by using other multi-label classifiers such as GLE and MLKNN, and the results are similar to that of BRSVM and IBLR. It is
Table 3.2: Comparison of Prediction Results of High Count Tags and Low Count Tags in Terms of False Negative Rate (in %)
Tag Count Number of
Dataset Type Tag-Instance Pairs BRSVM IBLR
High 308 38.92 37.61
cal500 Low 3395 75.74 70.94
High 686 31.49 38.96
majorminer Low 4418 54.99 55.29
High 1497 65.60 55.85
dlc1 Low 1951 89.95 82.32
High 1658 69.54 59.83
dlc2 Low 2541 91.46 83.35
High 2142 67.55 58.17
dlc3 Low 3178 92.54 83.42
High 2362 64.01 62.53
dlc4 Low 3169 91.86 77.75
clear that the low count tags are more difficult to recognize than the high count tags. This observation inspires us to further study the relationship of the tag counts and the annotated resources.
We show some example URLs in the delicious dataset associated with the
“art”, “c”, “education”, “game”, “microsoft”, “music”, “travel”, and “web2.0” tags as the high count tags or low count tags in Tables 3.3 and 3.4, respectively. For each tag, we select three URLs which are repeatedly annotated with the tag, i.e., the tag is a high count tag for these URLs. The URLs are shown in Table 3.3. We also select three URLs which are annotated with the tag only a few times, i.e., the tag is a low count tag for the URLs. The URLs are shown in Table 3.4. The count of the tag and a short description for the URL are also shown in the table. For comparison, we also show two other high count tags annotated to the same URL in the last column in Table 3.3 and 3.4.
From Table 3.3, we observe that the high count tags usually capture the salient
property of a URL. For example, the URL www.cplusplus.com, which is a famous site for C++ programming language reference and documentation, has been anno-tated with the tag “c” for 771 times; and the URL ocw.mit.edu/courses, which is a famous site called MIT OpenCourseWare, has been annotated with the tag “ed-ucation” for 1024 times. We believe that the association between a URL and their high count tags is intuitive, obvious, and significant.
In contrast, the association between a URL and their low count tags is not salient and sometimes hard to understand. For example, the URL www.shoutcast.
com, which aims to provide free internet radio stations, has been annotated with
“art” for 2 times. The reason might be that some people consider music as a kind of art. Its high count tags “music” and “radio”, as shown in the last column in Table 3.4, indeed capture more salient properties of this site than the “art” tag.
The URL www.python.org/doc, which is the official documentation website of the Python programming language, has been annotated with “c” for 2 times. A plausible reason is that Python provides an application programming interface (API) for the C language. Its high count tags “python” and “reference” obviously capture more salient properties of this URL than the “c” tag.
The examples in Tables 3.3 and 3.4 demonstrate why the low count tags are more difficult to recognize than the high count tags, as the results shown in Table 3.2. Similar observation can be drawn from the cal500 and majorminer datasets.
We think that the tag counts do reflect the confidence of the tags to the URLs. Low counts imply that users are less confident about such tag. Therefore, misclassifying low count tags should be assigned a less penalty.
Figure 3.1 shows the distribution of the tag counts of the eight selected tags in the delicious data. The horizontal axis indicates the natural logarithm of the tag counts; and the vertical axis indicates the number of instance-tag pairs that
Table 3.3: Some Example URLs with the Eight Example Tags as the High Count Tags in the Delicious Dataset
Target Tag Other High
Tag URL Count Description Count Tags
www.drawspace.com 1649 Provides large library of free art lessons draw, tutori art www.threadless.com 1383 An online shirt shop that prints designs by users t shirt, shop www.julianbeever.net/pave.htm 1068 Julian Beever’s pavement drawings illusion, pavem
www.cplusplus.com 852 The C++ resources network programming, reference
c aelinik.free.fr/c 771 Teach Yourself C in 24 Hours tutori, book
www.codeblocks.org 535 Open Source, Cross-platform Free C++ IDE ide, opensource
ocw.mit.edu/courses 1024 MIT opencourseware free, course
education oreillyschool.com 330 O’Reilly School of Technology programming, oreilly
itunes.berkeley.edu 726 UC Berkeley on iTune podcasting, itun
secondlife.com 3173 Secondlife is a social game in a 3D virtual world virtual, social game www.addictinggames.com 1324 A large source of the free online games flash, free
www.romnation.net 542 Provides ROMs and emulators rom, emul
channel9.msdn.com 870 A social network of MSDN products blog, msdn
microsoft www.microsoft.com/mac 347 Microsoft software for MAC mac, softwar
ss64.com/nt 240 A list of windows command lines window, cmd
www.last.fm 14850 A music social network radio, social
music www.allmusic.com 4886 An online music database reference, database
www.emusic.com 1787 MP3 music download website mp3, shop
www.tripadvisor.com 3105 Reviews of vacations, hotels,...,etc. hotel, review
travel www.onebag.com 1274 Offers detail abou travelling light pack, howto
www.expedia.com 1269 Booking flights flight, airfar
www.go2web20.net 8900 A web 2.0 directory statistic, reference
web2.0 digg.com 3525 A social bookmark website social, link
www.librarything.com 2293 A social network for sharing books book, library
Table 3.4: Some Example URLs with the Eight Example Tags as the Low Count Tags in the Delicious Dataset
Target Tag Other High
Tag URL Count Description Count Tags
www.shoutcast.com 2 Free internet radio stations music, radio
art www.goodyblog.com 2 A blog about babies, kids, parents, and families blog, parent www.tvguide.com 2 TV Guide’s official page for TV news, live event,...,etc. television, entertain
www.python.org/doc 2 Python Documentation Index python, reference
c www.cons.org/cmucl 2 A free Common Lisp implementation lisp, programming
www.arsmathematica.net 2 A blog about math blog, math
www.solidworks.com 2 3D Mechanical Design and 3D CAD Software softwar, cad
education www.ebookee.com 2 The Free eBooks Download Library ebook, free
www.robotvillage.com 2 An online store for renting robots robot, store
www.butternutsquash.net 2 A blog about humor comics webcomic, comic
game www.world-machine.com 2 A terrain generator for game developers terrain, 3d
www.flashkit.com 2 A flash developer resource Site flash, web design
www.dailydoseofexcel.com 2 Daily posts of Excel tips excel, blog
microsoft www.vbtutor.net 2 Visual Basic Tutorial tutori, vb
www.wampserver.com 2 Apache, MySQL, PHP server mysql, vb
www.deepprose.com 2 Software for personal book and music management softwar, mac
music www.thefind.com 2 Help shoppers find what they want to buy shop, search
www.foxtrot.com 2 A webcomic blog comic, humor
www.gearthhacks.com 2 Provides links to interesting content in Google Earth google, map
travel www.photonet.org.uk 2 A public photo gallery in London photography, gallery
www.arcspace.com 2 A website about architecture architecture, design
dustindiaz.com/udasss 2 UDASSS Official Documentation ajax, javascript
web2.0 rollerweblogger.org/project 2 An open source Java blog software blog, java
www.pstut.com 2 Easy to follow photoshop tutorials tutori, design
0 1 2 3 4 5 6 7 8
Figure 3.1: Histogram for the tag counts of the eight selected tags in the delicious data.
fall into each interval. Generally, there are more low count tags than high count tags. The tag count distribution more or less follows the power law (we have a similar observation on the cal500 and majorminer datasets). When using a normal multi-label classification algorithm to predict the tags, the noisy, low count tags can dominate the positive distribution and cause problems for the learners. To solve the problem, we propose using the tag count information to train a cost-sensitive multi-label classifier that minimizes the training error associated with tag counts.
More specifically, the training process should give a higher importance weight (i.e., a higher misclassification cost) on correctly classifying the reliable high count tags and a lower importance weight (i.e., a lower misclassification cost) on the low count tags.