### 國立臺灣大學電機資訊學院資訊工程學研究所 博士論文

### Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

### National Taiwan University Doctorial Dissertation

### 成本導向多標籤學習演算法與應用

### Cost-Sensitive Multi-Label Classification with Applications

### 駱宏毅 Hung-Yi Lo

### 指導教授：林守德 博士、王新民 博士

### Advisors: Shou-De Lin, Ph.D. and Hsin-Min Wang, Ph.D.

### 中華民國 102 年 1 月

### January 2013

i

ii

**誌謝 **

「然而因著神的恩，我成了我今天這個人」哥林多前書第十五章 10 節上

讀博士班期間曾經歷過許許多多挑戰、困難和挫折。難以言喻的奇妙經歷點滴在心裡，

包括入學前教會的鄭桂忠弟兄向我介紹林守德教授、考最後一科資格考時經歷主的平安與能 力、參與資料探勘比賽時從詩歌中被感動並得著信心與鼓勵、以及投第三篇期刊論文時得到 許多意外的幫助。這些事都非我能力所能計劃並安排的，見證神的恩典與看顧，將一切榮耀 與感謝歸給神。

我要感謝本篇論文最重要的兩位推手，我的指導教授王新民老師和林守德老師。我從民 國 94 年一月即開始至中研院資訊所服國防役，役滿繼續擔任研究助理的工作。直到博士班畢 業為止，這八年的時間都受王新民老師的指導與栽培。在前幾年研究不太順利的時候，非常 感謝王老師對我非常有耐心、一直鼓勵我。每一次在對研究覺得失望灰心時跟老師個別面談，

都讓我重新燃起信心和鬥志。也謝謝老師讓我在研究上有自由發揮的空間，訓練我有獨立自 主研究的能力。老師在開會時總是非常仔細地深入報告的細節，並提出許多寶貴的建議。此 外要感謝老師非常用心且花費大量時間在我每一篇投稿的論文、報告投影片及海報上。每一 次老師都是從頭到尾、每字每句仔細地看過並反覆考慮過。王老師的認真、嚴謹、仔細，以 身做責示範作研究應有的態度跟用心，實在是我永遠學習的榜樣。

也非常感謝林守德老師在我進入台大讀博士班後的指導。我是林老師來到台灣大學後的 第一屆博士班學生。老師始終非常平易近人，營造了非常充滿研究熱忱、團結且有向心力的 實驗室氣氛。幾次我遭遇困境去找林老師，他總是在學生的角度替學生們著想並盡可能地給 予幫助。在 97 年我們參加了資料探勘比賽。開始只是我對於期末作業的提議，但老師非常認 真對待，帶動了實驗室對於參加比賽得獎的氣氛，最終也得到了連續多年得獎的偉大成果。

此外，老師也給我非常多的成長機會，如擔任教學助理、在國際會議發表 Tutorial、和指導 實驗室裡的學弟妹作研究。這些事都對於我在知識和為人處事上的長進非常有幫助。老師對 於帶領實驗進步非常有智慧，每學期對實驗室開會或研究上細節的安排都非常值得我學習。

老師對於做人、做事、和做研究「態度」上的要求，也是我永遠學習榜樣。

在博士論文計畫審查階段，感謝陳銘憲老師、林軒田老師、陳信希老師等委員們不吝指 教地給予許多寶貴意見。不僅對於我的研究主題，也對於我正在投稿中的期刊論文修訂非常 有幫助。博士學位考試時，林智仁老師、林軒田老師、陳銘憲老師、交通大學簡仁宗老師、

成功大學曾新穆老師、台灣科技大學李育杰老師等口試委員們精闢切要的論點與建議，幫助 我能更完善地修訂論文內容。在此特別感謝我的碩士班指導教授李育杰老師。李老師是帶領 我進入資料探勘與機器學習領域的入門導師。老師一點一滴教導我資料探勘的知識，為我後 來在中研院和讀博士班期間的研究打下良好的基礎。在畢業後許多人生選擇的轉彎路口，老 師也都給我許多寶貴的建議與幫助。謝謝您一路以來的諄諄教誨！

iii

感謝中研院語音語言與音樂處理實驗室的同事們，陪伴我完成這篇論文。謝謝王如江、

陳冠宇、李鴻欣、黃信德、簡御仁、陳傳祐、施羽芩、游廷碩、陳羿帆…等，尤其感謝如江 是我研究上的好夥伴、感謝廷碩對於出國和業界工作方面跟我分享很多寶貴的經驗、鴻欣追 根究底的研究精神非常令我佩服。謝謝台灣大學機器發明與社群網路探勘的同學們，特別謝 謝曾與我合作過的夥伴蕭卓毅、張峻銘、江宗憲、殷圖駿、魏吟軒、游書豪、蔡青樺。也感 謝李政德陪我一起走過這五年多，跟我分享很多研究與出國經驗、謝謝楊政倫幫我改論文的 英文、謝謝楊明翰和何建彤教我認識鬥塔遊戲，為枯燥的研究增添不少樂趣、謝謝我的鄰座 顏君釗、張博詞、林瑋詩幫我處理很多電腦或座位方面的問題、感謝嚴恩勗跟我合作過一次 印象深刻的線性分類器的報告、感謝解巽評常常提供實驗室裡鮮為人知的神祕資訊、謝謝賴 弘哲、曾建霖在我畢業找工作時給我很多資訊跟幫助。此外，非常感謝資工系資格考讀書會 的夥伴們，感謝王界人、林育仕、吳信宏、郭人瑋、何立勇、曾志傑、陳泰瑜、董才業。沒 有你們的陪伴、互相教導學習，我可能連資格考都通過不了。尤其志傑後來成為我在教會中 的好同伴與配搭，非常感謝他。也感謝與我一同參與 KDD Cup 的同學和老師們，很高興能與 大家一同努力過。感謝在博士班期間認識的朋友們，感謝在 ISMIR 2009 認識的楊奕軒博士是 研究上很好的夥伴，也是值得學習的榜樣、謝謝胡敏君博士在我畢業和找工作的過程中給我 很多幫助、謝謝任佳珉跟黎欣捷陪伴我渡過難忘的 ICASSP 2011 布拉格之旅。

最後感謝我親愛的家人們，謝謝爸爸、媽媽從小到大對我的栽培，讓我能毫無後顧之憂 地攻讀博士學位。感謝我的老婆在我讀博士班期間一直無怨無悔地陪伴並支持。在許多熬夜 趕工的夜晚，謝謝老婆辛苦地獨自照顧家庭跟小孩。並且感謝老婆生了兩個可愛的兒子在我 研究的路上陪伴我，雖然兒子常常調皮搗蛋，但讓我的生活增添了許多純真的歡笑！最後感 謝我的妹妹、岳父、岳母，及許多其他親友們的支持與照顧。謹將本篇論文獻給我最愛的你 們。

駱宏毅 謹誌 民國 102 年一月

iv

**摘要 **

本論文的第一部份研究一個新的機器學習問題，稱之為成本導向多標籤分類問題。在這個 問題中，每一筆資料的不同標籤可以有不同的分類錯誤成本。我們首先利用機器學習演算法 中簡化問題的技術，將成本導向多標籤分類簡化成成本導向單標籤分類問題。此外，我們提 出了一個基於基底擴展模型的方法來解成本導向多標籤分類問題。此方法稱為一般化 k 標籤 集合群體分類法。此群體分類中，每一個基底函式是一個標籤冪集合分類器。基底函式的係 數的學習方式是最小化成本導向錯誤率。我們推導出快速的求解係數的計算方式。此方法也 可以應用在一般的多標籤分類問題。在一般的多標籤分類問題和成本導向多標籤分類問題的 實驗結果都證實我們提出的新方法的預測效果更好。

如何在應用問題中找出分類錯誤成本，是一個重要的實務問題。本論文的第二部份研究兩 個成本導向分類問題的應用：醫學影像分類與社群標籤預測。在醫學影像分類問題中，我們 發現了正例資料中的病患不平衡問題。這個問題嚴重影響影像分類器的預測能力。我們利用 成本導向學習法設計了病患平衡學習演算法。利用這個方法我們成功地贏得了 KDD Cup 2008 年冠軍。在社群標籤預測問題中，我們提出了利用標籤計數當作分類錯誤成本，並利用成本 導向多標籤學習法解決這個問題。實驗結果證實成本導向多標籤學習法，不論是在成本導向 評量標準或是在一般評量標準都比我們在 MIREX 2009 音樂標籤預測比賽中得到冠軍的方法 預測效果還要好。在社群書簽預測的實驗結果也證實我們所提出的方法較其他方法有更好的 預測效果。

**關鍵字：成本導向多標籤分類、多標籤分類、群體分類法、音樂標記與搜尋、醫學影像分類、**

病患平衡式學習法、成本導向分類

v

**Abstract **

We study a generalization of the traditional multi-label classification, which we refer to as
cost-sensitive multi-label classification (CSML). In this problem, the misclassification cost can be
different for each instance-label pair. For solving the problem, we propose two novel and general
strategies based on the problem transformation technique. The proposed strategies transform the
CSML problem to several cost-sensitive single-label classification problems. In addition, we
*propose a basis expansion model for CSML, which we call the Generalized k-Labelsets Ensemble *
(GLE). In the basis expansion model, a basis function is a label powerset classifier trained on a
*random k-labelset. The expansion coefficients are learned by minimizing the cost-weighted global *
error between the prediction and the ground truth. GLE can also be used for traditional multi-label
classification. Experimental results on both multi-label classification and cost-sensitive multi-label
classification demonstrate that our method has better performance than other methods.

Cost-sensitive classification is based on the assumption that the cost is given according to the
*application. “Where does cost come from?” is an important practical issue. We study two *
real-world prediction tasks and link their data distribution to the cost information. The two tasks
are medical image classification and social tag prediction. In medical image classification, we
observe a patient-imbalanced phenomenon that has seriously hurt the generalization ability of the
image classifier. We design several patient-balanced learning algorithms based on cost-sensitive
binary classification. The success of our patient-balanced learning methods has been proved by
winning KDD Cup 2008.

For social tag prediction, we propose to treat the tag counts as the misclassification costs and model the social tagging problem as a cost-sensitive multi-label classification problem. The experimental results in audio tag annotation and retrieval demonstrate that the CSML approaches outperform our winning method in Music Information Retrieval Evaluation eXchange (MIREX) 2009 in terms of both cost-sensitive and cost-less evaluation metrics. The results on social bookmark prediction also demonstrate that our proposed method has better performance than other methods.

**Keywords: cost-sensitive multi-label classification, multi-label classification, ensemble method, **
tag-based music annotation and retrieval, medical image classification, patient-balanced learning,
cost-sensitive learning

**Contents**

**1** **Introduction** **1**

1.1 Multi-Label Classiﬁcation . . . 2

1.2 Cost-Sensitive Single-Label Classiﬁcation . . . 3

1.3 Cost-Sensitive Multi-Label Classiﬁcation . . . 5

1.4 Proposal Organization . . . 6

**2** **Cost-Sensitive Multi-Label Classiﬁcation** **7**
2.1 Cost-Sensitive Single-Label Classiﬁcation Methods . . . 7

2.1.1 Cost-Sensitive SVM . . . 8

2.1.2 Cost-Sensitive AdaBoost . . . 9

2.2 Reducing Cost-Sensitive Multi-Label Classiﬁcation into Cost-Sensitive Single-Label Classiﬁcation . . . 10

2.2.1 Cost-Sensitive Stacking . . . 11

2.2.2 *Cost-Sensitive RAk EL . . . 12*

2.3 *Generalized k -Labelsets Ensemble (GLE) . . . 13*

2.3.1 GLE for Multi-Label Classiﬁcation . . . 16

2.3.2 Multi-Label Learning with Hypergraphs . . . 19

2.3.3 GLE for Cost-Sensitive Multi-Label Classiﬁcation . . . 20

**3** **A Study on the Eﬀect of Tag Count in Using Multi-Label Classiﬁ-**
**cation for Social Tag Prediction** **22**
3.1 Datasets . . . 24

3.2 Analysis on the Eﬀect of Tag Count . . . 26

3.3 Evaluation Metrics Considering Tag Counts . . . 31

**4** **Application to Music Tag Annotation and Retrieval** **33**
4.1 Related Works . . . 35

4.2 System Overview . . . 36

4.3 Audio Signal Processing . . . 36

4.3.1 Feature Extraction . . . 38

4.3.2 Audio Segmentation . . . 38

4.4 The MIREX 2009 Winning Method . . . 40

4.4.1 Ranking Ensemble . . . 40

4.4.2 Probability Ensemble . . . 41

4.5 MIREX 2009 Results . . . 41

4.6 Extended Experiments . . . 43

4.6.1 Dataset . . . 44

4.6.2 Model Selection and Evaluation . . . 44

4.6.3 Experimental Results . . . 45

**5** **Experiments of Generalized k -Labelsets Ensemble****51**
5.1 Experiments on General Multi-Label Classiﬁcation . . . 51

5.1.1 Datasets . . . 51

5.1.2 Evaluation Metrics for Multi-Label Classiﬁcation . . . 52

5.1.3 Experimental Setup . . . 53

5.1.4 Experimental Results . . . 54

5.1.5 *Experimental Results with Diﬀerent γ And ν* . . . 57

5.2 Cost-Sensitive Experiments . . . 59

**6** **Patient-Balanced Learning for Medical Image Classiﬁcation** **71**
6.1 Background . . . 72

6.2 Issues of Model Selection . . . 74

6.3 Class-Balanced SVM . . . 75

6.4 Patient-Imbalanced Problem . . . 76

6.5 Patient Balanced Learning . . . 77

6.5.1 Patient-Balanced SVM (PB-SVM) . . . 78

6.5.2 Patient-Balanced AdaBoost (PB-AdaBoost) . . . 78

6.5.3 Patient-Balanced Learning by Re-Sampling . . . 79

6.6 Experiments . . . 80

6.6.1 Datasets . . . 80

6.6.2 Experimental Setup . . . 81

6.6.3 Experimental Results . . . 82

**7** **Conclusions** **84**

**Bibliography** **86**

**List of Tables**

1.1 An Example of Multi-Label Dataset for Music Tag Annotation . . . . 2

2.1 An Example of Multi-Label Dataset with Transformed Multi-Class Labels . . . 11

3.1 The 45 Tags Used in the MIREX Audio Tag Classiﬁcation Evaluation 24

3.2 Comparison of Prediction Results of High Count Tags and Low Count Tags in Terms of False Negative Rate (in %) . . . 27

3.3 Some Example URLs with the Eight Example Tags as the High Count Tags in the Delicious Dataset . . . 29

3.4 Some Example URLs with the Eight Example Tags as the Low Count Tags in the Delicious Dataset . . . 29

4.1 Some Examples of Audio Clips with Associated Tags Obtained from the MajorMiner Website . . . 34

4.2 Music Features Used in this Work . . . 39

4.3 Evaluation Results of MIREX 2009 Audio Tag Classiﬁcation on the MajorMiner Dataset. There are 12 submissions. Our submissions without and with pre-segmentation are denoted by NOS and SEG, respectively . . . 42

4.4 Evaluation Results of MIREX 2009 Audio Tag Classiﬁcation on the Mood Dataset. There are 12 submissions. Our submissions without and with pre-segmentation are denoted by NOS and SEG, respectively 42

4.5 Performance Rankings of Our Two Submissions to MIREX 2009 Au- dio Tag Classiﬁcation on Two Datasets . . . 43

4.6 Audio Tag Annotation and Retrieval Results of Diﬀerent Classiﬁers and Ensemble Methods on the MajorMiner Dataset (in %) . . . 45

4.7 Audio Tag Annotation and Retrieval Results of Cost-Sensitive Multi- Label Classiﬁcation Methods in Terms of Cost-Sensitive Metrics (in

%) . . . 46

4.8 Audio Tag Annotation and Retrieval Results of Cost-Sensitive Multi- Label Classiﬁcation Methods in Terms of Regular (Cost-Less) Metrics (in %) . . . 47

5.1 Statistics of the Multi-Label Datasets . . . 52

5.2 *Selected Parameters k and M of GLE and RAk EL for the Multi-Label*
Datasets . . . 54

5.3 Experimental Results in Terms of Five Diﬀerent Evaluation Metrics.

The Numbers in Parentheses Represent the Rank of the Algorithm
Among the Compared Algorithms. The Average Rank is the Average
of the Ranks Across All Datasets. *•/◦ indicates whether GLE is*
statistically superior/inferior to the compared algorithm (the pairwise
t-test at the 5% signiﬁcance level). . . 56

5.4 Relative Improvement of GLE and Its Two Simpliﬁed Versions Over
*RAk EL in Terms of Five Diﬀerent Evaluation Metrics (In %) . . . 57*

5.5 Experimental Results in Terms of Two Cost-Sensitive Evaluation
Metrics. The Average Rank is the Average of the Ranks Across All
Datasets. *•/◦ indicates whether GLE is statistically superior/inferior*
to the compared algorithm (the pairwise t-test at the 5% signiﬁcance
level). . . 60

6.1 Statistics of the BC and PE dataset . . . 81

6.2 Performance (AUC, in %) of Diﬀerent Patient-balanced Approaches on the BC Training Set (Cross-Validation) and Test Set. . . 83

6.3 Performance (AUC, in %) of Diﬀerent Patient-balanced Approaches on the PE Training Set (Cross-Validation) and Test Set. . . 83

**List of Figures**

3.1 *Histogram for the tag counts of the eight selected tags in the delicious*
data. . . 30

4.1 Work ﬂow of the proposed audio tag annotation and retrieval system. 37

4.2 Illustration of audio segmentation. . . 49

4.3 Work ﬂow of the classiﬁer ensemble used in the MIREX 2009 winning method. . . 50

5.1 *Average relative improvement of GLE over RAk EL in terms of ﬁve*
*diﬀerent evaluation metrics with respect to diﬀerent parameters k and*
*M . . . 58*

5.2 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Scene Dataset. . . 61

5.3 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Enron Dataset. . . 62

5.4 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Cal500 Dataset. . . 63

5.5 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Majorminer Dataset. . . 64

5.6 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Medical Dataset. . . 65

5.7 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Bibtex Dataset. . . 66

5.8 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Dlc1 Dataset. . . 67

5.9 *Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Dlc2 Dataset. . . 68

*5.10 Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Dlc3 Dataset. . . 69

*5.11 Experimental Results of GLE with Diﬀerent γ And ν in Terms of*
Five Diﬀerent Evaluation Metrics on The Dlc4 Dataset. . . 70

6.1 Histograms of the number of positive instances belonging to each patient on two medical image datasets . . . 77

**Chapter 1** **Introduction**

*Machine learning is the study of computer algorithms that improve automatically*
through experience [45]. It plays a key role in many areas of industry, ﬁnance, and
science. Classiﬁcation, is one of the most important task in machine learning. In
**the traditional classiﬁcation problem, the computer is given a training set (x**_{i}*, y** _{i}*)

^{N}*,*

_{i=1}

**where x**

_{i}*∈ R*

^{d}*is the feature vector of the i-th training sample and y*

*is the class*

_{i}*label. For binary classiﬁcation, y*

_{i}*∈ {+1, −1} is the binary category label; while*

*for multi-class classiﬁcation, y*

_{i}*∈ {1, 2, . . . , K} denotes the K predeﬁned classes.*

The goal of these classiﬁcation problems is to learn a classiﬁer which maps data
into the predeﬁned classes. The learning procedure can be achieve by minimizing
the expected misclassiﬁcation cost on the training samples. The binary and multi-
*class classiﬁcation problems are also called single-label classiﬁcation problem, since*
each instance could be associated only one single label. In this proposal, we focus
*on a raising classiﬁcation problem, called multi-label classiﬁcation, which will be*
described below.

Instance Label Set

1 Rock,Guitar

2 Rock, Guitar, Drum 3 Rock, Guitar, Vocal 4 Country, Guitar 5 Rock, Guitar, Drum 6 R&B, Vocal 7 Country, Guitar

8 Vocal

Table 1.1: An Example of Multi-Label Dataset for Music Tag Annotation

**1.1** **Multi-Label Classiﬁcation**

Multi-label classiﬁcation has attracted a great deal of attention in recent years.

Diﬀerent from single-label classiﬁcation, in multi-label classiﬁcation, an instance could be associated with a set of labels jointly. For example, in image classiﬁcation, an image may possess several concepts, such as “sea” and “sunset”. Table 1.1 shows an example of multi-label dataset for music tag classiﬁcation.

We deﬁne the problem of multi-label classiﬁcation as below. Let *Y ⊆ L =*
*{λ*1*, λ*_{2}*, ..., λ*_{K}*}, which is a ﬁnite set of K possible labels. To facilitate the discussion,*
hereafter, * Y is represented by a vector y = (y*1

*, y*

_{2}

*, ..., y*

*)*

_{K}*∈ {1, −1}*

*, in which*

^{K}*y*

*j*= 1

*⇔ λ*

*j*

*∈ Y, y*

*j*=

*−1 ⇔ λ*

*j*

*∈ Y. We denote the labels of the whole instances/*

**by Y***∈ R*

^{N}

^{×K}

**, where the i-th row of Y is y***; and denote the whole instances*

_{i}

**by X***∈ R*

^{N}

^{×d}

**, where the i-th row of X is x**

_{i}

**. Given a training set (x**

_{i}

**, y***)*

_{i}

^{N}

_{i=1}*that contains N samples, the goal of multi-label classiﬁcation is to learn a classiﬁer*

*H :*R

^{d}*→ 2*

^{K}

**such that H(x) predicts which labels should be assigned to an unseen**

**sample x well.***Multi-label classiﬁcation methods can be grouped into two categories: algo-*
*rithm adaptation and problem transformation [55]. The algorithm adaptation meth-*

ods extend some speciﬁc learning algorithms for single-label classiﬁcation to solve the multi-label classiﬁcation problem. Zhang and Zhou [66] extended the famous back-propagation algorithm for multi-label learning (BPMLL). Some algorithms are extended from the instance-based learning, such as multi-label K-nearest neighbor (MLKNN) [67] and instance-based learning by logistic regression (IBLR) [10]. Elis- seeﬀ and Weston [16] proposed an SVM-based method to minimize the ranking loss.

*The problem transformation methods transform the multi-label classiﬁcation*
problem to one or many single-label classiﬁcation tasks. Binary relevance and label
powerset are two popular problem transformation approaches. The binary relevance
(BR) method trains a binary classiﬁer for each label independently. Sun et al. [51]

proposed a two stage learning method to improve the binary relevance method based on hypergraph spectral learning. They exploited hypergraph spectral learning for feature transformation in the ﬁrst stage and trained binary relevance classiﬁers in the second stage. The label powerset (LP) method treats each distinct combination of labels as a diﬀerent class and, thus, treats the multi-label classiﬁcation as a multi-class classiﬁcation problem. In the next section, we introduce the concept of cost-sensitive single-label classiﬁcation.

**1.2** **Cost-Sensitive Single-Label Classiﬁcation**

Non-uniform misclassiﬁcation costs are very common in a variety of real-world appli- cations. For example, in banking authentication, the cost of mistakenly authenticat- ing an intruder as a normal user is usually larger than that of mistakenly classifying a normal user as an intruder. The former type of mistakes might result in serious losses; while the later type of mistakes can be endured.

Current cost-sensitive learning research has been focused on binary or multi-

class classiﬁcation [32, 35]. For cost-sensitive binary classiﬁcation, Elkan proposed
*theoretical foundations [17]. Zadrozny et al. proposed a method called costing,*
*based on rejection sampling method [65]. Costing can make common cost-less bi-*
nary classiﬁcation algorithms cost-sensitive. Cost-sensitive multi-class classiﬁca-
tion research is more challenging than the binary one. MetaCost is the ﬁrst cost-
sensitive multi-class classiﬁcation algorithm, which is based on a threshold-moving
method [14]. Other approaches include re-weighting, sampling [70], and regression-
based method [59].

Cost-sensitive classiﬁcation starts from the assumption that the cost is given
*based on the application. “Where does cost come from? ” becomes an important*
practical issue [32]. One well-known application of cost-sensitive binary classiﬁcation
is the mailing campaign in KDD Cup 1998. The task is to send mail requests to
potential donors to achieve maximum total proﬁt. Diﬀerent people might donate
diﬀerent amounts of money. Missing a major donor will results in large loss (cost),
while sending requests to those who will not donate will also lose money. Beside the
KDD Cup 1998 dataset, there are very few datasets with available cost information.

In this proposal, we study two real-world prediction tasks and link their data
distribution to the cost information. The two tasks are medical image classiﬁca-
tion and social tag prediction. We study the medical image classiﬁcation problem
on the KDD Cup 2008 breast cancer (BC) dataset and the KDD Cup 2006 pul-
*monary embolism (PE) dataset. We observe a patient-imbalanced problem that has*
seriously hurt the generalization ability of the image classiﬁer. We design several
*patient-balanced learning algorithms based on cost-sensitive binary classiﬁcation.*

The success of our patient-balanced learning methods has been proved by winning KDD Cup 2008.

*For social tag prediction, consider that the tag count indicates the number of*

users who have annotated the given resource with the tag. We believe the tag count
information should be considered in automatic social tagging because the count
*reﬂects the conﬁdence degree of the tag. Important, conﬁdent, and relevant tags*
(with respect to the resources, such as music tracks or websites) are usually assigned
by many diﬀerent users. We propose to treat the tag counts as the mis-classiﬁcation
costs and model the social tagging problem as a cost-sensitive learning problem. The
experimental results in audio tag annotation and retrieval demonstrate that the cost-
sensitive learning approaches outperform our winning method in Music Information
Retrieval Evaluation eXchange (MIREX) 2009. The results on social bookmark
prediction also demonstrate that our proposed method has better performance than
other methods.

**1.3** **Cost-Sensitive Multi-Label Classiﬁcation**

In addition to cost-sensitive single-label classiﬁcation, we study a novel cost-sensitive
multi-label classiﬁcation (CSML) problem, which is a generalization of the tradi-
tional multi-label classiﬁcation. In the CSML, the misclassiﬁcation cost can be
diﬀerent for each instance-label pair and is given before the training process. For
example, in image classiﬁcation, the misclassiﬁcation cost of the label “sunset” for
an image whose subject is sunset should be higher than the misclassiﬁcation cost of
**an additional label “gull”. More speciﬁcally, for each instance x*** _{i}*, a misclassiﬁcation

*cost c*

_{j}*is coupled to each label λ*

*belonging to the label set of that instance. We*

_{j}

**give a generalized deﬁnition as follow. We are given a training set (x**

_{i}

**, y**

_{i}

**, c***)*

_{i}

^{N}

_{i=1}*that contains N samples, where the j-th component c*

*denotes the cost to be paid*

_{ij}*when the label y*

*ij*

*is misclassiﬁed. More speciﬁcally, c*

*ij*

*is a false negative cost when*

*y*

_{ij}*= 1, and a false positive cost when y*

*=*

_{ij}*−1. We denote the misclassiﬁcation*

**costs of the whole instances by C***∈ R*

^{N}

^{×K}

**, where the i-th row of C is c***. The goal*

_{i}*of cost-sensitive multi-label classiﬁcation is to learn a classiﬁer H :*R

^{d}*→ 2*

*such*

^{K}**that H(x) minimizes the expected misclassiﬁcation cost on an unseen sample x.**

Since the problem is novel, no existing algorithm can be applied. In this
*work, we ﬁrst propose two general strategies based on the reduction technique.*

The proposed strategies reduce the CSML problem to cost-sensitive single-label
classiﬁcation problem. In addition, we propose a basis expansions model for CSML,
*called Generalized k -Labelsets Ensemble (GLE), where a basis function is an LP*
*classiﬁer trained on a random k -labelset. The expansion coeﬃcients are learned to*
minimize the global error between the prediction and the ground truth. GLE can
also be used for traditional cost-less multi-label classiﬁcation.

**1.4** **Proposal Organization**

Now we brieﬂy outline the contents of this proposal. In Chapter 2, we describe the proposed CSML methods. In Chapter 3, we study the eﬀect of tag count in using multi-label classiﬁcation for social tag prediction. The observation inspires us to formulate the social tag prediction task as a cost-sensitive multi-label classiﬁca- tion problem by treating the tag counts as the misclassiﬁcation costs. Chapter 4 presents our winning method of MIREX 2009 audio tagging competition and shows the experimental results of using CSML for audio tagging. The results demonstrate that the CSML approaches outperform our MIREX 2009 winning method. Chapter 5 contains the experimental results of GLE on both general multi-label classiﬁca- tion datasets and cost-sensitive social tagging. Chapter 6 describes the patient- imbalanced problem in medical image classiﬁcation and proposes patient-balanced learning algorithms. Chapter 7 concludes this proposal.

**Chapter 2**

**Cost-Sensitive Multi-Label** **Classiﬁcation**

In this chapter we start from reviewing some cost-sensitive single-label classiﬁca-
tion methods. Then, we propose two CSML methods: cost-sensitive stacking and
*cost-sensitive RAk EL, which are based on reducing CSML problem to several cost-*
*sensitive single-label problems. Finally we propose Generalized k -Labelsets Ensem-*
ble (GLE) for both multi-label classiﬁcation and cost-sensitive multi-label classiﬁ-
cation.

**2.1** **Cost-Sensitive Single-Label Classiﬁcation Meth-** **ods**

In cost-sensitive binary classiﬁcation, we are given a cost-sensitive training set
**(x**_{i}*, y*_{i}*, c** _{i}*)

^{N}

_{i=1}*, where c*

_{i}*⊂ [0, ∞) is the misclassiﬁcation cost. The goal of cost-*

*as follows:*

**sensitive classiﬁcation is to learn a classiﬁer f (x) that minimizes the expected cost****E[cI(f (x)***̸= y)],* (2.1)

*where I(·) is an indicator function that yields 1 if its argument is true, and 0*
otherwise. In contrast, the expected cost in the traditional binary classiﬁcation is

deﬁned as:

* E[I(f (x)̸= y)],* (2.2)

which is a special case of (2.1) where all samples have an equal misclassiﬁcation cost
*c. Many diﬀerent cost-less binary classiﬁers can be extended to the cost-sensitive*
versions. In the following subsections, we describe two cost-sensitive binary classi-
ﬁers: support vector machine (SVM) and AdaBoost.

**2.1.1** **Cost-Sensitive SVM**

SVM ﬁnds a separating surface with a large margin between training samples of
two classes in a high-dimensional feature space implicitly introduced by a computa-
tionally eﬃcient kernel mapping [12]. The large margin implies good generalization
ability according to statistical learning theory. We exploit a linear SVM classiﬁer
**f (x) of the following form:**

**f (x) = w**^{T}* x + b.* (2.3)

**In traditional SVM, the parameters w = (w**_{1}*, w*_{2}*, ..., w*_{d}*) and b can be learned by*
solving a minimization problem formulated as follows:

min**w,b,ξ**

1

2**w**^{T}**w + C**

∑*N*
*i=1*

*ξ*_{i}*,*
s.t. *y*_{i}**(w**^{T}**x**_{i}*+ b)* *≥ 1 − ξ**i*

*ξ*_{i}*≥ 0*

(2.4)

*where ξ*_{i}**is the training error associated with instance x**_{i}*; C is a tuning parame-*
ter that controls the tradeoﬀ between maximizing the margin and minimizing the
training error. Cost-sensitive SVM can be learned by modifying (2.4) to

min**w,b,ξ**

1

2**w**^{T}**w + C**

∑*N*
*i=1*

*c*_{i}*ξ*_{i}*,*
s.t. *y*_{i}**(w**^{T}**x**_{i}*+ b)* *≥ 1 − ξ**i*

*ξ*_{i}*≥ 0*

(2.5)

*where each cost c*_{i}*is associated with a corresponding training error term ξ** _{i}*.

**2.1.2** **Cost-Sensitive AdaBoost**

AdaBoost [19] ﬁnds a highly accurate classiﬁer by combining several base classiﬁers, even though each of them is only moderately accurate. It has been successfully used in applications such as music classiﬁcation [3] and audio tag classiﬁcation [15]. The decision function of the AdaBoost classiﬁer takes the following form:

**f (x) =**

∑*T*
*t=1*

*α*_{t}*h*_{t}* (x),* (2.6)

*where h*_{t}**(x) is the prediction score of a base classiﬁer h**_{t}**given the feature vector x***of a test sample; T is the number of base classiﬁers; and α** _{t}* can be calculated based
on diﬀerent versions of AdaBoost.

The base classiﬁers are learned iteratively. In the training phase, AdaBoost [52]

*maintains a weight vector D** _{t}* for the training instances in each iteration and uses a

*base learner to ﬁnd a base classiﬁer h*

*to minimize the weighted error according to*

_{t}*D*

*t*

*. In each iteration, the weight vector D*

*t*is updated by

*D*_{t+1}*(i) =* *D*_{t}*(i) exp(−α**t**y*_{i}*h*_{t}**(x*** _{i}*))

*Z*

*t*

*,* (2.7)

*where Z*_{t}*is a normalization factor that makes D** _{t+1}* a distribution. We can increase
the number of base learners iteratively and stop the training process when the gener-
alization ability on the validation set does not improve. Cost-sensitive AdaBoost [52]

*can be learned by modifying the update rule of weight vector D** _{t}* in (6.7) to

*D**t+1**(i) =* *D*_{t}*(i) exp(−α**t**c*_{i}*y*_{i}*h*_{t}**(x*** _{i}*))

*Z*_{t}*,* (2.8)

*where c*_{i}**is the cost of training instance x*** _{i}*. We use decision tree as the base learner
in this study.

**2.2** **Reducing Cost-Sensitive Multi-Label Classiﬁ-** **cation into Cost-Sensitive Single-Label Clas-** **siﬁcation**

*Reduction is a commonly used machine learning technique when the problem can*
not be solved by standard learning algorithms. In the research of traditional multi-
*label classiﬁcation, a category of methods called problem transformation belongs to*
this kind of technique. It transforms the multi-label classiﬁcation problem to one
or many single-label classiﬁcation tasks.

The binary relevance (BR) method is one of the popular problem transfor-
mation approaches. It trains a binary classiﬁer for each label independently. For
each label, the instances with/without the label will be treated as positive/negative
examples for training the corresponding binary classiﬁer. This manner inevitably
*loses the co-occurrence information of multiple labels that might be useful. Label*
*correlation is an useful information for multi-label classiﬁcation since some labels*
often co-occur. For example, in music tag annotation, a song with the “hip hop”

tag is more likely to be also annotated with “rap” than “jazz”, while a song with the “dance” tag is more likely to be also annotated with “electronic” than “guitar”.

Label powerset (LP) [55] method is another problem transformation approach.

It treats each distinct combination of labels in the training set as a diﬀerent class and, thus, treats the multi-label classiﬁcation as a multi-class classiﬁcation prob- lem. Given a test instance, the multi-class LP classiﬁer predicts the most probable class, which can be transformed to a set of labels. Table 2.1 shows an example of multi-label dataset with transformed multi-class label based on the concept of LP. However, one major concern for this model is that, when the number of la- bels increases, the number of potential classes increases proportionally, and each class will be associated with very few training instances. Moreover, LP can only

Table 2.1: An Example of Multi-Label Dataset with Transformed Multi-Class Labels Instance Label Set Transformed Class

1 Rock,Guitar 1

2 Rock, Guitar, Drum 2

3 Rock, Guitar, Vocal 3

4 Country, Guitar 4

5 Rock, Guitar, Drum 2

6 R&B, Vocal 5

7 Country, Guitar 4

8 Vocal 6

predict labelsets observed in the training data. In [56], a method called Random
*k -Labelsets (RAk EL) is proposed to overcome the drawback of the traditional LP*
*method. RAk EL randomly selects a number of label subsets from the original set of*
labels and uses the LP method to train the corresponding multi-class classiﬁers. The
*ﬁnal prediction of RAk EL is made by voting of the LP classiﬁers in the ensemble.*

This method can not only reduce the number of classes, but also allow each class to
have more training instances. Experimental results have shown an improvement of
*RAk EL over LP.*

Inspired by the reduction methods for multi-label classiﬁcation, we propose two general strategies for reducing the CSML problem to cost-sensitive single-label classiﬁcation problem: a binary relevance based strategy and a label powerset based strategy. We describe these two methods in the following two subsections.

**2.2.1** **Cost-Sensitive Stacking**

In this subsection, we propose a two-stage method called cost-sensitive stacking.

Stacking [63] is a method of combining the outputs of multiple independent classi-
ﬁers for multi-label classiﬁcation. In the ﬁrst stage of cost-sensitive stacking, assume
*that the K labels are independent and we train cost-sensitive binary classiﬁers inde-*

*pendently. Then, we use the outputs of all binary classiﬁers, f*_{1}**(x), f**_{2}**(x), ..., f**_{K}**(x),****as features to form a new feature set. Let the new feature be z = (z**_{1}*, z*_{2}*, ..., z** _{K}*).

We can use the new feature set together with the true label to learn the parameters
*w** _{kj}* of the stacking classiﬁers:

*h*_{k}**(z) =**

∑*K*
*j=1*

*w*_{kj}*z*_{j}*,* (2.9)

*where the weight w*_{kj}*will be positive if label j is positively correlated to label k;*

*otherwise, w** _{kj}* will be negative. The stacking classiﬁers can recover misclassiﬁed

*labels by using the correlation information captured in the weight w*

*.*

_{kj}**2.2.2** **Cost-Sensitive RAk EL**

**Cost-Sensitive RAk EL**

*As mentioned in the beginning of Section 2.2, a method called Random k -Labelsets*
*[58] is proposed to realize and improve the LP method. A k -labelset is a labelset*
*R ⊆ L with |R| = k. RAkEL randomly selects a number of k-labelsets from L and*
uses the LP method to train the corresponding multi-label classiﬁers. Algorithms
*1 and 2 describe the training and classiﬁcation processes of RAk EL, respectively.*

*The prediction of a multi-class LP classiﬁer g*_{m}**for sample x is denoted by g**_{m}**(x)**∈*{1, 2, . . . , V }. Note that V will be much smaller than 2** ^{k}* if the data is sparse. In

*Algorithm 2, q(g*

_{m}

**(x), j) is deﬁned as:***q(g*_{m}**(x), j) =**

1 *j* *∈ R**m* *and j is positive in g*_{m}**(x),**

*−1 j ∈ R**m* *and j is negative in g*_{m}**(x),**

*∅ j /∈ R**m**.*

(2.10)

*For example, when k = 2, the classes 1, 2, 3, and 4 correspond to (1, 1), (1,−1), (−1, 1),*
and (*−1, −1), respectively. If label j is not included in R**m**, q(g** _{m}*(

*·), j) is undeﬁned.*

*If label j corresponds to the ﬁrst label ofR**m**, q(1, j), q(2, j), q(3, j), and q(4, j) will*
*be 1, 1,−1, and −1, respectively.*

*We extend RAk EL for cost-sensitive multi-label classiﬁcation. The extension*
*is not straightforward since we are given a cost value for each label but RAk EL*

**Algorithm 1 The training process of RAk EL**

* • Input: number of models M, size of labelset k, set of labels L, and the training*
set

**D = (x***i*

**, y***)*

_{i}

^{N}

_{i=1}**• Output: an ensemble of LP classiﬁers g***m* *and the corresponding k -labelsets*
*R**m*

1. Initialize *S ← L*^{k}

**2. for m****← 1 to min(M,|L**^{k}**|) do**

*• R**m* *← a k-labelset randomly selected from S*

*• train the LP classiﬁer g**m* based on *D and R**m*

*• S ← S \ R**m*

**3. end**

considers a set of labels as a class. Our idea is to train the cost-sensitive LP classiﬁer ˆ

*g** _{m}*by transforming the cost of each label in a labelset to the total cost of the labelset.

The transformed cost ˆ*c*_{i}**of a training sample x*** _{i}* for training ˆ

*g*

*is computed by*

_{m}ˆ

*c**i***(c***i***, y*** _{i}*) =

{ ∑

*j**∈R**m**s.t. y**ij*=1

*c** _{ij}* if

*∃j ∈ R*

*m*

*s.t. y*

_{ij}*= 1,*

1 else,

(2.11)

**where c*** _{i}* is the cost vector mentioned in Section 1.3. Therefore, we can obtain the

**multi-class training sample with the associated cost, (x**

_{i}*, ˆy*

_{i}*, ˆc*

*), for training the LP classiﬁer, where ˆ*

_{i}*y*

*i*

*∈ {1, 2, . . . , V } is the class value and ˆc*

*i*is the cost to be paid when the class of this instance is misclassiﬁed. We use the multi-class SVM as the LP classiﬁer in this study, and employ the one-versus-one strategy [33] in cost-sensitive multi-class classiﬁcation.

**2.3** **Generalized k -Labelsets Ensemble (GLE)**

**Generalized k -Labelsets Ensemble (GLE)**

*As RAk EL is considered an ensemble-based multi-label classiﬁcation method, it has*
to follow the theory states [13] that “a necessary and suﬃcient condition for an en-
semble of classiﬁers to be more accurate than any of its individual members is if the

**Algorithm 2 The classiﬁcation process of RAk EL**

**• Input: number of models M, a test sample x, an ensemble of LP classiﬁers***g*_{m}*, and the corresponding k -labelsets* *R**m*

* • Output: the multi-label classiﬁcation vector r = (r*1

*, r*

_{2}

*, ..., r*

*)*

_{K}

**1. for j**

**← 1 to K do***(a) r**j* *= 0, n = 0*

**(b) for each g**_{m}**, if j***∈ R**m* **do**

*• r**j* *= r*_{j}*+ q(g*_{m}**(x), j)**

*• n = n + 1*
**(c) end**

*(d) r**j* = ^{r}_{n}^{j}**2. end**

*individual classiﬁers are accurate and diverse”. Here the term accurate implies that*
*such classiﬁer has a lower error rate than random guessing. In RAk EL, the diversity*
of classiﬁers is achieved by randomly selecting label subsets. The major limitation of
this kind of ensemble method is that such “committee-of-diverse-experts” heuristics
do not directly optimize for the learning objective. In contrast, AdaBoost.M2 [19] is
an ensemble method for single-label classiﬁcation, which directly optimizes a learn-
ing objective. Breiman [4] shows that the training procedure of AdaBoost is a form of
*gradient optimization to minimize the objective function J (F ) =*∑

*i*exp(*−y**i***F (x*** _{i}*)),

*where y*

_{i}

**is the class label of the instance x**

_{i}*and F (·) is a hypothesis classiﬁer of*AdaBoost. Both theoretical and experimental results show the superiority of Ad-

*aBoost. Our work tries to follow a similar idea to improve RAk EL by re-designing*the learning function to optimize a global error objective function.

*Another limitation of RAk EL is that it assumes every base classiﬁer in the*
ensemble is equally important. However, such assumption is problematic because
*the individual LP classiﬁers are trained on diﬀerent randomly selected k -labelsets,*
some of them may have worse predictive performance than others or can be even

redundant. Researchers have shown that properly determining the weights of the
base classiﬁers in an ensemble can improve the prediction performance [6, 19]. For
*example, in AdaBoost, the coeﬃcient α**t* *of a base classiﬁer f**t***(x) is determined***analytically and is proportional to the predictive performance of f*_{t}* (x). We believe*
that learning the weights of base classiﬁers in an ensemble can improve the prediction
performance for multi-label classiﬁcation.

Inspired by the success of the LP-based methods, we presents a novel machine
*learning model for multi-label classiﬁcation using the idea of basis expansions model*
[21, Chapter 5] of the following form:

**H(x) =**

∑*M*
*m=1*

*β*_{m}*h*_{m}* (x),* (2.12)

*where the basis functions h*_{m}**(x) exploit the LP classiﬁers trained on random k -***labelsets and β** _{m}* are their coeﬃcients. In general, the basis expansions model

*treats the classiﬁers h*

_{m}*of these functions to approximate the target classiﬁer during the learning proce- dure [21, Chapter 5]. Intuitively, given suﬃcient dictionary functions, the basis expansions model is a ﬂexible representation for the target multi-label classiﬁer.*

**(x) as dictionary functions and uses the linear combination***In our proposed GLE, the coeﬃcients β*_{m}*are learned to minimize the global error*
**between the prediction of H(x) and the ground truth. The coeﬃcients β*** _{m}* of the

*corresponding base classiﬁers h*

_{m}*better performance, and are not redundant given others. Another interpretation of our method is from the Bayesian framework. The GLE can be considered as a kind*

**(x) become more signiﬁcant for base classiﬁers with***of Bayesian model combination, that is, P (Y*

*i*

*∑*

**|x) =***m**P (Y**i***|x, h***m**)P (h** _{m}*), where
the

*Y*

*i*

*is a predicted label set. In this aspect, the coeﬃcients β*

*are approximating*

_{m}*the prior probabilities P (h*

*m*

*) for the model h*

*m*.

In the following subsections, we describe the proposed GLE for multi-label classiﬁcation, review the concept of exploiting hypergraph for multi-label classiﬁca-

tion, and extend GLE for cost-sensitive multi-label classiﬁcation.

**2.3.1** **GLE for Multi-Label Classiﬁcation**

*Similar to RAk EL, GLE ﬁrst trains M LP-based classiﬁers using randomly selected*
*k -labelsets from the original set of labels. Then, GLE uses the base classiﬁers as*
dictionary functions and learns a linear combination of these functions. Algorithms
3 and 4 describe the training and classiﬁcation processes, respectively. The weight
* coeﬃcients β for the base classiﬁers are learned by solving a minimization problem*
formulated as follows:

min* β*
1

2* ||Y −* ∑

^{M}*m=1*

*β**m**Q**m**||*^{2}*F* +^{γ}_{2}**||β||**^{2}2

+^{ν}_{2}trace
((∑*M*

*m=1*

*β*_{m}*Q** _{m}*
)

*T*

* L*
(∑

*M*

*m=1*

*β*_{m}*Q** _{m}*
))

*,*

(2.13)

where *|| · ||**F* **is the Frobenius norm of a matrix, L is the normalized hypergraph***Laplacian, and Q*_{m}*∈ R*^{N×K}*is a transformed prediction of g** _{m}*which will be described
in more detail later. The ﬁrst term in the objective function aims to minimize the

**global error between the prediction of H(x) and the multi-label ground truth Y .***term is a hypergraph regularization term.*

**The second term is a two-norm regularization term of the coeﬃcients β. The third***The prediction of a multi-class LP classiﬁer, g*_{m}**, for a sample x is denoted***by g*_{m}**(x)***∈ {1, 2, . . . , Z}. Note that Z will be much smaller than 2** ^{k}* if the data is

*sparse. Similar to RAk EL, the i, j-th element in Q*

_{m}*is calculated by q(g*

_{m}

**(x**

_{i}*), j),*which is deﬁned as:

*q(g*_{m}**(x**_{i}*), j) =*

*1,* *if j* *∈ R**m*

*and j is positive in g**m***(x***i**),*

*−1, if j ∈ R**m*

*and j is negative in g*_{m}**(x**_{i}*),*
*0,* *if j /∈ R**m**.*

(2.14)

*For example, when k = 2, the classes 1, 2, 3, and 4 correspond to (1, 1), (1,−1), (−1, 1),*
and (*−1, −1), respectively. If label j is not included in R**m**, q(g**m***(x***i**), j) is 0. If*

**Algorithm 3 The training process of GLE**

* • Input: number of models M, size of labelset k, learning parameters γ and ν,*
set of labels

**L, and the training set D = (x***i*

**, y***)*

_{i}

^{N}

_{i=1}**• Output: an ensemble of LP classiﬁers g***m**, the corresponding k -labelsets* *R**m*

*and coeﬃcients β**m*

1. Initialize *S ← L*^{k}

**2. for m****← 1 to min(M,|L**^{k}**|) do**

*• R**m* *← a k-labelset randomly selected from S*

*• train the LP classiﬁer g**m* based on *D and R**m*

*• calculate a transformed prediction of g**m* using (6.2)

*• S ← S \ R**m*

**3. end**

**4. Learn β using (2.19)**

*label j corresponds to the ﬁrst label of* *R**m**, q(1, j), q(2, j), q(3, j), and q(4, j) will*
*be 1, 1,−1, and −1, respectively. We note that the function q(g**m***(x), j) is used to***generate the h**m***(x) in the ﬁnal classiﬁer (2.12) by gathering the predictions on all***labels j.*

The trade-oﬀ between ﬁtting the training data and regularization can be con-
*trolled by the parameters γ and ν. A larger parameter γ will lead to smoother coef-*
**ﬁcients β. The hypergraph Laplacian L**∈ R* ^{N×N}* captures the high order labelling
relationship among diﬀerent instances. Following the spectral graph theory [11, 51],
our idea behind the hypergraph regularization term is that the prediction on two
instances, that is, two rows in ∑

_{M}*m=1**β**m**Q**m*, should be similar if they have high
similarity according to the hypergraph.

To solve the optimization problem (2.13), we start from rewriting the ﬁrst
*term of the objective function by vectorizing Q*_{m}* and Y . We denote the prediction*
of the base classiﬁers by ˆ

*Q*

*∈ R*

^{(L·N)×M}*whose columns are vectorized from Q*

*by*

_{m}**Algorithm 4 The classiﬁcation process of GLE**

**• Input: number of models M, a test sample x, an ensemble of LP classiﬁers***g*_{m}*, and the corresponding k -labelsets* *R**m* *and coeﬃcients β*_{m}

* • Output: the multi-label classiﬁcation vector r = (r*1

*, r*

_{2}

*, ..., r*

*)*

_{K}

**1. for j**

**← 1 to K do***(a) r**j* = 0

**(b) for each g**_{m}**, if j***∈ R**m* **do**

*• r**j* *= r*_{j}*+ β*_{m}*· q(g**m***(x), j)****(c) end**

**2. end**

*reshaping Q** _{m}* into R

^{(L}

^{·N)}

**; and vectorize Y into ˆ****Y***∈ R*

^{(L}*. Then, the ﬁrst term of the objective function can be rewritten as*

^{·N)}^{1}

_{2}

*|| ˆ*

**Y***− ˆ*

**Qβ**||^{2}2. To further simplify the

*third term of the objective function, we let Q*

_{m,j}*be the j-th column vector in Q*

*. We perform operations on the third term as follows:*

_{m}*ν*
2trace

((∑*M*
*m=1*

*β*_{m}*Q** _{m}*
)

*T*

* L*
(∑

*M*

*m=1*

*β*_{m}*Q** _{m}*
))

= ^{ν}_{2}trace

**β**^{T}**P**_{1,1}**β****· · · β**^{T}**P**_{1,L}* β*
... . .. ...

**β**

^{T}

**P**

_{L,1}

**β**

**· · · β**

^{T}

**P**

_{L,L}

**β**

= ^{ν}_{2}

∑*L*
*j=1*

**β**^{T}**P**_{j,j}**β**

(2.15)

**where P**_{i,j}*∈ R*^{M}* ^{×M}* is generated as

**P*** _{i,j}* =

*Q*^{T}_{1,i}**LQ***1,j* *· · · Q*^{T}*1,i***LQ***M,j*

... . .. ...
*Q*^{T}_{M,i}**LQ***1,j* *· · · Q*^{T}*M,i***LQ***M,j*

* .* (2.16)

Let∑*L*

*j=1***P**_{j,j}* in (2.15) be denoted as ρ, the loss function in (2.13) can be rewritten*
as

* L(β) =* 1

2( ˆ**Y***− ˆ Qβ)*

*( ˆ*

^{T}

**Y***− ˆ*

**Qβ) +***γ*

2**β**^{T}**β +***ν*

2**β**^{T}* ρβ.* (2.17)

**We take derivatives of equation (2.17) with respect to β and set them equal to zero,**

*∂L*

* ∂β* =

*− ˆQ*

^{T}*ˆ*

**Y + ˆ***Q*

^{T}*ˆ*

**Qβ + γ**

**· Iβ +***ν*

2**(ρ + ρ**^{T}* )β = 0.* (2.18)

*where I is the identity matrix. Hence, the optimization problem (2.13) has a unique*

**solution β***:*

^{∗}**β*** ^{∗}* = ( ˆ

*Q*

^{T}*Q + γ*ˆ

*· I +*

*ν*

2**(ρ + ρ*** ^{T}*))

^{−1}*Q ˆ*ˆ

*(2.19) The computational cost of training GLE depends on the training speed of the LP*

**Y .***base classiﬁer. We note that the matrix to be inverted in (2.19) is a M -dimensional*

*square matrix and typically the number of models M is set from 15 to 250. Hence,*

*much overhead.*

**solving such equation is computationally feasible and learning β does not increase****2.3.2** **Multi-Label Learning with Hypergraphs**

Hypergraph is a generalization of the traditional graph in which an edge can connect
arbitrary non-empty subsets of the vertex set [1]. We denote a hypergraph by
*G = (V, E), where V is the vertex set and E is the edge set, where each edge is a*
*subset of V . Traditional graph is a special case of the hypergraph, in which every*
edge connects exactly two vertices, and is also called “2-graph”. Given a multi-label
dataset, the instances with their labels can be represented as one single hypergraph.

More speciﬁcally, the vertex is a data point and the hyperedge is a label that connects
**the instances associated with it. A normalized hypergraph Laplacian L***∈ R** ^{|V |×|V |}*
is a commonly used technique for capturing the relationship among nodes in the
hypergraph and has been used in spectral clustering [69]. Sun et al. proposed the
hypergraph spectral multi-label learning algorithm [51] to learn a low-dimensional

*tend to be close to each other in the transformed space. The optimization problem*

**feature transformation W , such that the data points sharing many common labels****for learning W is formulated as follows:**

min**W****trace(W**^{T}**XLX**^{T}**W )**

**s.t. W**^{T}**XX**^{T}* W = I,* (2.20)

**where W**^{T}**X is the transformed data.**

In this paper, we exploit the relation information encoded in the hypergraph
* Laplacian L in a diﬀerent manner. We add a hypergraph Laplacian regularizer*
into the objective function for learning the coeﬃcients in the ensemble. There are
several diﬀerent methods for learning the normalized hypergraph Laplacian matrix,
but Agarwal et al. have shown that these methods lead to similar results [1]. In this
paper, we use the clique expansion algorithm [1, 51] for calculating the normalized
hypergraph Laplacian.

The clique expansion algorithm constructs a 2-graph from the original hyper-
graph by replacing each hyperedge with a clique, that is, maintaining an edge for
each pair of vertices in the hyperedge. For a hypergraph, the vertex-edge incidence
*matrix J* *∈ R*^{|V |×|E|}*is deﬁned as: J (v, e) = 1 if v* *∈ e and 0 otherwise. We de-*
*note the weight associated with the hyperedge e by w(e). We denote the diagonal*
*matrix whose diagonal entries are w(e) by W** _{H}*. In the application of hypergraph
for multi-label classiﬁcation, the weights are set to uniform for each hyperedge. We

*denote the vertex degree in the expanded 2-graph by d*

*c*

*(u), where u is a vertex of*

*the 2-graph, and denote the diagonal matrix whose diagonal entries are d*

_{c}*(u) by D*

*. The normalized hypergraph Laplacian can be calculated by*

_{c}**L = I***− D**c*^{−1/2}*J W**H**J*^{T}*D*_{c}^{−1/2}*.* (2.21)

**2.3.3** **GLE for Cost-Sensitive Multi-Label Classiﬁcation**

GLE can also be extended for cost-sensitive multi-label classiﬁcation. We follow the same training and classiﬁcation procedures as shown in Algorithms 3 and 4,

* and modify the objective function for learning the coeﬃcients β in (2.13). Recall*
that the ﬁrst term in the objective function of (2.13) is the global error between the
multi-label classiﬁer prediction, ∑

*M*

*m=1**β**m**Q**m***, and the multi-label ground truth Y .***We modify it to a cost-weighted global error by multiplying the global error with the*
**multi-label misclassiﬁcation cost matrix C. The optimization problem of learning*** the coeﬃcients β for cost-sensitive multi-label classiﬁcation can then be formulated*
as follows:

min* β*
1

2* ||C ◦*(

**Y***−* ∑^{M}

*m=1*

*β**m**Q**m*

)*||*^{2}*F* +^{γ}_{2}**||β||**^{2}2

+^{ν}_{2}trace
((∑*M*

*m=1*

*β*_{m}*Q** _{m}*
)

*T*

* L*
(∑

*M*

*m=1*

*β*_{m}*Q** _{m}*
))

*,*

(2.22)

where *◦ is the dot product of two matrices. Similarly, the optimization problem*
**(2.22) has a unique solution β*** ^{∗}*,

**β*** ^{∗}* =
(

**C**◦ ( ˆQ^{T}*Q) + γ*ˆ *· I +ν*

2**(ρ + ρ*** ^{T}*)
)

_{−1}*Q ˆ*ˆ* Y .* (2.23)

**Chapter 3**

**A Study on the Eﬀect of Tag** **Count in Using Multi-Label** **Classiﬁcation for Social Tag** **Prediction**

Tags are free text labels annotated on data in diﬀerent format, such as images, music tracks, and websites. In some cases, the tags are objective and assigned by experts, such as part-of-speech tagging, or semantic role labelling. In other cases, the tags are more or less subjective while diﬀerent persons might compose diﬀerent tag sets for an object. In the Web 2.0 era, there are more and more online services that allow or even encourage users to tag objects such as images, or music tracks, and consequently create a large set of subjective tagging data. The emergence of human computing [64] also creates mass of opportunity for users to tag media. In general, we call such tagging behavior “collective tagging” and the subjective tags produced through such process “collective tags”. The collective tags usually capture diﬀerent aspects of the content. For example, the types of music tags may contain artist, genre, mood, and instrumentation. Given a feature representation for the content of the resources, some previous researches [15, 40] assumed that the tags are independent and, thus, transformed the social tag prediction problem into many