Clas-sification
Reduction is a commonly used machine learning technique when the problem can not be solved by standard learning algorithms. In the research of traditional multi-label classification, a category of methods called problem transformation belongs to this kind of technique. It transforms the multi-label classification problem to one or many single-label classification tasks.
The binary relevance (BR) method is one of the popular problem transfor-mation approaches. It trains a binary classifier for each label independently. For each label, the instances with/without the label will be treated as positive/negative examples for training the corresponding binary classifier. This manner inevitably loses the co-occurrence information of multiple labels that might be useful. Label correlation is an useful information for multi-label classification since some labels often co-occur. For example, in music tag annotation, a song with the “hip hop”
tag is more likely to be also annotated with “rap” than “jazz”, while a song with the “dance” tag is more likely to be also annotated with “electronic” than “guitar”.
Label powerset (LP) [55] method is another problem transformation approach.
It treats each distinct combination of labels in the training set as a different class and, thus, treats the multi-label classification as a multi-class classification prob-lem. Given a test instance, the multi-class LP classifier predicts the most probable class, which can be transformed to a set of labels. Table 2.1 shows an example of multi-label dataset with transformed multi-class label based on the concept of LP. However, one major concern for this model is that, when the number of la-bels increases, the number of potential classes increases proportionally, and each class will be associated with very few training instances. Moreover, LP can only
Table 2.1: An Example of Multi-Label Dataset with Transformed Multi-Class Labels Instance Label Set Transformed Class
1 Rock,Guitar 1
2 Rock, Guitar, Drum 2
3 Rock, Guitar, Vocal 3
4 Country, Guitar 4
5 Rock, Guitar, Drum 2
6 R&B, Vocal 5
7 Country, Guitar 4
8 Vocal 6
predict labelsets observed in the training data. In [56], a method called Random k -Labelsets (RAk EL) is proposed to overcome the drawback of the traditional LP method. RAk EL randomly selects a number of label subsets from the original set of labels and uses the LP method to train the corresponding multi-class classifiers. The final prediction of RAk EL is made by voting of the LP classifiers in the ensemble.
This method can not only reduce the number of classes, but also allow each class to have more training instances. Experimental results have shown an improvement of RAk EL over LP.
Inspired by the reduction methods for multi-label classification, we propose two general strategies for reducing the CSML problem to cost-sensitive single-label classification problem: a binary relevance based strategy and a label powerset based strategy. We describe these two methods in the following two subsections.
2.2.1 Cost-Sensitive Stacking
In this subsection, we propose a two-stage method called cost-sensitive stacking.
Stacking [63] is a method of combining the outputs of multiple independent classi-fiers for multi-label classification. In the first stage of cost-sensitive stacking, assume that the K labels are independent and we train cost-sensitive binary classifiers
inde-pendently. Then, we use the outputs of all binary classifiers, f1(x), f2(x), ..., fK(x), as features to form a new feature set. Let the new feature be z = (z1, z2, ..., zK).
We can use the new feature set together with the true label to learn the parameters wkj of the stacking classifiers:
hk(z) =
∑K j=1
wkjzj, (2.9)
where the weight wkj will be positive if label j is positively correlated to label k;
otherwise, wkj will be negative. The stacking classifiers can recover misclassified labels by using the correlation information captured in the weight wkj.
2.2.2 Cost-Sensitive RAk EL
As mentioned in the beginning of Section 2.2, a method called Random k -Labelsets [58] is proposed to realize and improve the LP method. A k -labelset is a labelset R ⊆ L with |R| = k. RAkEL randomly selects a number of k-labelsets from L and uses the LP method to train the corresponding multi-label classifiers. Algorithms 1 and 2 describe the training and classification processes of RAk EL, respectively.
The prediction of a multi-class LP classifier gm for sample x is denoted by gm(x)∈ {1, 2, . . . , V }. Note that V will be much smaller than 2k if the data is sparse. In be 1, 1,−1, and −1, respectively.
We extend RAk EL for cost-sensitive multi-label classification. The extension is not straightforward since we are given a cost value for each label but RAk EL
Algorithm 1 The training process of RAk EL
• Input: number of models M, size of labelset k, set of labels L, and the training setD = (xi, yi)Ni=1
• Output: an ensemble of LP classifiers gm and the corresponding k -labelsets Rm
1. Initialize S ← Lk
2. for m← 1 to min(M,|Lk|) do
• Rm ← a k-labelset randomly selected from S
• train the LP classifier gm based on D and Rm
• S ← S \ Rm
3. end
considers a set of labels as a class. Our idea is to train the cost-sensitive LP classifier ˆ
gmby transforming the cost of each label in a labelset to the total cost of the labelset.
The transformed cost ˆci of a training sample xi for training ˆgm is computed by
ˆ
where ci is the cost vector mentioned in Section 1.3. Therefore, we can obtain the multi-class training sample with the associated cost, (xi, ˆyi, ˆci), for training the LP classifier, where ˆyi ∈ {1, 2, . . . , V } is the class value and ˆci is the cost to be paid when the class of this instance is misclassified. We use the multi-class SVM as the LP classifier in this study, and employ the one-versus-one strategy [33] in cost-sensitive multi-class classification.