Condensed Filter Tree for Cost-Sensitive Multi-Label Classiﬁcation

(1)

Chun-Liang Li R01922001@CSIE.NTU.EDU.TW

Hsuan-Tien Lin HTLIN@CSIE.NTU.EDU.TW

Department of Computer Science and Information Engineering, National Taiwan University

Abstract

Different real-world applications of multi-label classification often demand different evaluation criteria. We formalize this demand with a general setup, cost-sensitive multi-label classification (CSMLC), which takes the evaluation criteria into account during learning. Nevertheless, most existing algorithms can only focus on optimizing a few specific evaluation criteria, and cannot systematically deal with different ones. In this paper, we propose a novel algorithm, called condensed filter tree (CFT), for optimizing any criteria in CSMLC. CFT is derived from reducing CSMLC to the famous filter tree algorithm for cost-sensitive multi-class classification via con- structing the label powerset. We successfully cope with the difficulty of having exponentially many extended-classes within the powerset for representation, training and prediction by care- fully designing the tree structure and focusing on the key nodes. Experimental results across many real-world datasets validate that CFT is competitive with special purpose algorithms on special criteria and reaches better performance on general criteria.

1. Introduction

The multi-label classification problem allows each instance to be associated with a set of labels simultaneously. It has in recent years attracted much attention among re- searchers (Tsoumakas et al.,2010;2012) because the problem setting matches many different real-world applications;

these include bio-informatics (Elisseeff & Weston,2002), text mining (Srivastava & Zane-Ulman,2005) and multimedia (Turnbull et al., 2008). The different applications often come with different criteria for evaluating the performance of multi-label classification algorithms. Popular cri- Proceedings of the 31^st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s).

teria include the Hamming loss, the 0/1 loss, the Rank loss, the F1 score and the Accuracy score (Tsoumakas et al., 2010).

Currently, most algorithms are designed based on none, one, or a few specific criteria. For instance, the label- wise decomposition approaches (Read et al.,2009) aim at optimizing the Hamming loss by decomposing the multi- label classification problem into several binary classification problems, one for each possible label. The label powerset approach aims at optimizing the 0/1 loss by treating each distinct label-set as a unique extended class and reducing multi-label classification to multi-class classification. The probabilistic classifier chain (PCC) (Dembczyn- ski et al.,2010) approach estimates the probability of each possible label-set given an instance and uses the estimate to make a Bayes-optimal decision for any loss functions, while the structured SVM approach (Petterson & Caetano, 2010;2011) uses different convex surrogates for different evaluation criteria. Somehow both approaches require either special inference rules or loss maximizers for different evaluation criteria.

The variety of evaluation criteria calls for a more general algorithm that can cope with different criteria systematically and automatically. We formalize this need with a general setup, called cost-sensitive multi-label classification (CSMLC). CSMLC can be viewed as an extension of the popular setup of cost-sensitive multi-class classification. In CSMLC, we feed the multi-label classification algorithm with a cost function that quantifies the difference between a predicted label-set and a desired one. A general CSMLC algorithm operates on the given cost function, with the goal being better performance on that cost function. Compared with the existing methodology that requires specific design for every new application (criterion), general CSMLC algorithms can be used to save those design efforts and be easily adopted towards different application needs.

In this paper, we propose a novel algorithm for CSMLC, called the condensed filter tree (CFT). In contrast to PCC, the proposed CFT directly takes the criterion into account as the cost function during training, thereby averting the

(2)

need to design a specific inference rule for each new criterion and avoiding the possibly time-consuming inference step during prediction. Inspired by the rich literature of cost-sensitive multi-class classification (Domingos,1999;

Beygelzimer et al., 2008), CFT is derived by first reducing CSMLC to cost-sensitive multi-class classification via the label powerset approach. Nevertheless, the reduction leads to exponentially many extended classes, which makes training, prediction and model representation computation- ally challenging.

We conquer the challenge of prediction by exploiting tree- based models for cost-sensitive multi-class classification.

Tree-based models use a tree structure that is constructed by binary classifiers to make fast predictions. Then we achieve time complexity logarithmic with respect to the number of extended classes, which is linear with respect to the number of possible labels. Furthermore, we conquer the challenge of model representation by proposing proper orderingand K-classifier tricks. Interestingly, the two tricks reveal a strong connection between the CFT algorithm (which is derived from the label powerset approach) and the label-wise decomposition approaches.

Finally, we conquer the training challenge by modifying the famous Filter Tree algorithm (Beygelzimer et al.,2008) for CSMLC. The modification comes from revisiting the theoretical bound of Filter Tree, which allows the proposed CFT algorithm to only focus on some key tree nodes for training efficiency.

We conduct experiments on nine real-world datasets to validate the proposed CFT algorithm. Experimental results demonstrate that for specific evaluation criteria, CFT is competitive with special-purpose algorithms, such as PCC with specifically designed inference rules or the state-of- the-art MLkNN algorithm (Zhang & Zhou,2007). For general criteria, for which there is as yet no inference rule for PCC, CFT can reach significantly better performance. The results justify the superiority of the proposed CFT for general CSMLC problems.

2. Problem Setup

In a multi-label classification problem, we denote the feature vector by x ∈ R^d and its relevant label set by Y ⊆ {1, 2, ..., K}, where K is the number of classes.

The label set Y is commonly represented as a label vector, y ∈ {0, 1}^K, where y[k] = 1 if and only if k ∈ Y.

Given a dataset D = {(x_n, y_n)}^N_n=1, which contains N iid training examples (xn, yn) drawn from an unknown distribution P, the goal is to design an algorithm that uses D to find a classifier h : R^d → {0, 1}^Kin the training stage, with the hope that h(x) closely predicts y of an unseen x in the prediction stage when (x, y) is drawn from P.

For evaluating the closeness of the prediction ˆy = h(x), one of the most common criteria is called the Hamming loss Hamming(y, ˆy) = _K¹ PK

k=1Jy[k] 6= ˆy[k]K. Note that the Hamming loss evaluates each label component separately and equally weighted. In addition to the Hamming loss, there are many other criteria that evaluate the components of ˆy jointly; these include the 0/1 loss, the Rank loss, the F1 score and the Accuracy score (Tsoumakas et al., 2010). In this paper, we will use loss to denote the criterion that shall be minimized, and score to denote the criterion that shall be maximized.

The variety of criteria calls for a general setup of multi- label classification, called cost-sensitive multi-label classification (CSMLC), which will be the main focus of this work. CSMLC can be viewed as an extension of the popular setup of cost-sensitive multi-class classification (Domingos,1999). In CSMLC, we assume that there is a known cost function (matrix) C : {0, 1}^K×{0, 1}^K→ R, where C(y, ˆy) denotes the cost of predicting (x, y) as ˆy. The cost matrix is not only part of the prediction stage by using C(y, h(x)) to evaluate the performance of any classifier h, but also part of the training stage by feed- ing C as an additional piece of information to guide the learning algorithm.

The CSMLC setup meets the goal of optimizing many existing criteria, such as the (per-example) F1 score, the Ac- curacy score and the Rank loss (Tsoumakas et al.,2010).

Note that the setup above only considers a cost matrix C indexed by a desired vector y and a predicted vector ˆy.

Thus, the setup cannot fully cover some more complicated evaluation criteria such the micro-F1 score and the macro- F1 score, which are defined on a set of vectors. Studying the CSMLC setup can be viewed as an intermediate step toward tackling those complicated criteria in the future.

There are many existing algorithms for tackling the multi- label classification, but they either do not seriously take the cost matrix (criteria) into account, or only aim at a few specific cost matrices. That is, general algorithms for CSMLC have not been well studied. One intuitive family of algorithms is label-wise decomposition. For instance, the binary relevance (BR) algorithm (Tsoumakas et al., 2010) decomposes D = {(xn, yn)}^N_n=1 into K binary classification datasets Dk = {(x_n, y_n[k])}^N_n=1, and trains K independent binary classifiers hk with Dk for predicting y[k]. One extension of BR is the classifier chain (CC) algorithm (Read et al., 2009), which takes Dk = {(zn, y_n[k])}^N_n=1and zn = (x_n, y_n[1], ..., y_n[k − 1]), to train hk. One practical variant of CC, named CC-P, takes the predicted labels ˆy[k] instead of the true labels y[k] as the features in zn.

Because CC-P (as well as BR/CC) predicts ˆy[k] separately by each hk, arguably their main goal is to minimize the

(3)

Hamming loss (Tsoumakas et al.,2010). Extending CC- P for general CSMLC, however, is non-trivial, because it is difficult to embed the 2^K × 2^K possible C(y, ˆy) cost components into K separate steps of training. One algorithm that solves the difficulty for some specific cost matrices is the probabilistic classifier chain (PCC) (Dem- bczynski et al.,2010). PCC avoids the embedding issue in training by adopting a soft version of CC-P/CC without any cost information to estimate the conditional probability P (y|x). The probabilistic view allows PCC to interpret CC as greedily maximizing the 0/1 loss through the chain rule. During prediction, PCC considers the cost matrix for making the Bayes-optimal decision, which is based on an efficient inference rule that has been specifically designed for the cost matrix.

The potential drawback of PCC is that not only is it non- trivial to estimate the conditional probability but it is also challenging to design an efficient inference rule for each cost matrix. Because of the latter challenge, PCC currently can be used only to exactly tackle the Hamming loss, Rank loss, and the F1 score (Dembczynski et al.,2010;2012a;

2011). PCC can also be used with some search-based inference rule to approximately optimize the 0/1 loss (Dem- bczynski et al.,2012b;Kumar et al.,2013), but not other criteria in CSMLC.

Another major algorithm, known as label powerset (LP), reduces multi-label classification to multi-class classification (Tsoumakas et al.,2010). LP treats each unique pattern of the label vector as a single extended class. That is, the K possible labels are encoded to 2^K extended classes via a bijection function enc : {0, 1}^K → {1, ..., 2^K}. Dur- ing training, LP transforms D into Dm = {(x_n, c_n)}^N_n=1, where c_n= enc(y_n), and trains a multi-class classifier hm

from Dm. Then during prediction, LP takes h(x) = enc⁻¹(hm(x)). Trivially, LP focuses on the 0/1 loss, because the error rate of hmin the reduced problem is equivalent to the 0/1 loss of h. The disadvantage is that the exponentially many extended classes makes LP infeasible and impractical in general.

Lo et al.(2011) propose the CS-RAKEL algorithm that optimizes some weighted Hamming loss by extending from RAKEL (Tsoumakas & Vlahavas,2007), a representative algorithm between the label-wise decomposition and label powerset approaches. Somehow CS-RAKEL is designed for specific application needs and cannot tackle general CSMLC problems.

In summary, some related algorithms and their corresponding criteria are shown below. None of them can tackle general CSMLC problems.

Algorithms Criteria Being Optimized CC-P/CC Hamming loss or 0/1 loss

PCC Hamming loss, F1 score, Rank loss, 0/1 loss

LP 0/1 loss

CS-RAKEL Weighted Hamming loss

3. Tree Model for CSMLC

Inspired by the connection between CSMLC and the rich literatures of cost-sensitive classification (Domingos,1999;

Beygelzimer et al.,2008), we design a general CSMLC algorithm via the connection. Note that LP reduces multi- label classification to multi-class classification to optimize the 0/1 loss. If we follow the same reduction step but start from a general CSMLC problem, we end up with a cost- sensitive classification problem of 2^K extended classes and (implicitly) a 2^K× 2^Kcost matrix. Then any existing cost-sensitive classification algorithms can be used to solve CSMLC. We call this preliminary algorithm cost-sensitive label powerset (CS-LP). As with LP, the exponential number of extended classes presents a computational challenge for CS-LP. For example, using CS-LP to reduce CSMLC to the weighted-all pair approach (Beygelzimer et al.,2005) requires ²^K⁽²₂^K⁻¹⁾ comparisons for making each prediction.

Interestingly, PCC can be viewed as a special case of using CS-LP to reduce CSMLC to the famous Meta-Cost approach (Domingos,1999). Meta-Cost estimates the conditional probability during training and then makes the Bayes optimal decision with respect to a cost matrix in prediction.

Similarly, PCC estimates the probability by CC-P/CC, and then infers the optimal decision with respect to the cost matrix by the specifically designed inference rule.

We take another route that uses CS-LP to reduce CSMLC to tree models for cost-sensitive classification (Beygelzimer et al.,2008). A similar idea based on using the Hamming loss has been discussed in a blog post (Mineiro,2011), but the idea has not been seriously studied for general CSMLC problems. Tree models form a binary tree by weighted binary classifiers to conduct cost-sensitive classification.

Each non-leaf node of the tree is a binary classifier for de- ciding which subtree to go to, and each leaf node represents a class. Without loss of generality, we assume that the leaf nodes are indexed orderly by 1, 2, ..., #classes. Making a prediction for each instance follows the decisions of binary classifiers, starting from the root to the leaf. That is, only O (log(#classes)) decisions are required for making each prediction. In CS-LP, tree models result in O(K) time for each prediction, of the same order as label-wise decomposition approaches.

Nevertheless, the number of nodes on the resulting tree structure is O(2^K), which poses challenges in representation and training. We first tackle the representation challenge in this section, and then study algorithms for training the tree model in Section4.

Proper Ordering. Recall that CS-LP needs a bijec- tive functions enc(·) : {0, 1}^K → {1, ..., 2^K} for encod- ing y to c and decoding the predicted class ˆc to the corresponding label vector ˆy. Although a prediction ˆc can

(4)

A

B 00 01

C

10 11 (a)

r

0 00

0 01

1 0

1 10

0 11

1 1

(b)

Figure 1. Proper Ordering. (a) Put labels on leaf nodes orderly (b) Index internal nodes by paths.

be made within O(K) time in the tree model, the encod- ing function requires a careful design to make both enc and enc⁻¹ efficient for the 2^K possible inputs to those functions. We first consider the proper ordering trick, which lets enc(y) = BinaryNumber(y)+1. That is, we treat each y as a binary string, encode it by computing its corresponding integer in O(K), and decode accordingly, as illustrated in Fig.1(a). Based on proper ordering, if we let {0, 1} represent the decision {L, R} in each classifier of the tree, then the label vector y (extended class c) of each leaf node is equivalent to the sequence of binary decisions made from the root to the leaf. More generally, we can index each node of the tree by t ∈ {0, 1}^k−1, as shown in Fig.1(b), where t is the sequence of binary decisions from the root to the node on layer k.

K-Classifier Trick. Even with proper ordering, there are 2^K− 1 total internal nodes (classifiers) on the tree. The exponential number makes representing (and training) classifiers infeasible in practice. One existing idea for feasible representations is called the 1-classifier trick (Beygelzimer et al.,2008), which lets all 2^K − 1 internal nodes t share one classifier h(x, t). Nevertheless, using the 1-classifier trick often requires the classifier to be of sufficient power to capture different characteristics of different nodes. The requirement makes the trick less suitable for practical use.

Therefore, we propose a trade-off, K-classifier trick, between using 1 classifier and 2^K− 1 classifiers.

The K-classifier trick physically works as follows. After proper ordering, ˆy[k] corresponds to the prediction made by one of the nodes on layer k of the tree. In other words, the purpose of all the nodes located on layer k is similar:

predicting ˆy[k]. The similar purpose allows us to view each node as a part of a layer classifier of the form hk(x, t), which takes an instance x and a node index t ∈ {0, 1}^k−1 for predicting the k-th component of the label vector. Then equivalently only K classifiers (one per each layer) are required for representing the tree.

Connection to CC-P. By the proper ordering and the K- classifier tricks, predicting the extended class ˆc by the tree from layer 1 (root) to layer K is equivalent to predicting ˆ

y[1], ..., ˆy[K] by the K classifiers {h1, ..., h_K} using x and t. Such a prediction algorithm is exactly the same as those

r ^(0,²₃^-0)

0

00 1

01 0

1 ^{(1, 1-}²3)

10 1

11 2 /3 (a)

r ^{(0, 1-0)}

(1, 1-0) 0

00 1

01 0

1 ^{(1, 1-}²3)

10 1

11 2 /3 (b)

Figure 2. (a) Training of Top-down Tree; (b) Training of Filter Tree. The cost matrix used is 1−(F1 score). (0/1, w) means the direction based on proper ordering, and the weight for training the instance on the node. The thick edge represents the prediction for the instance by the trained classifier on the parent node.

used in CC-P, which uses the classifiers h_kwith exactly the same inputs (the instance x and the predicted labels which form the node index t in the tree) for predicting the next label.

The use of the two tricks reveals an interesting connection between two very different families of approaches: label- wise decomposition can be viewed as a special case of label powerset (in prediction). In short, label powerset with the tree model, proper ordering and K-classifiers tricks is equivalent to CC-P for prediction. Thus, by studying the role of the cost matrix during training, we can systematically extend CC-P to be cost-sensitive. Next, we will dis- cuss how to train the K classifiers subject to the cost matrix efficiently.

4. Training of Tree Model

There are two major algorithms for training the binary classifiers in the tree, Top-down Tree and Filter Tree (Beygelz- imer et al.,2008).

Top-down Tree (TT). Top-down Tree trains classifiers from layer 1 (root) to layer K. Formally, for each internal node t on layer k, denote its left child as t₀and right child as t1on layer (k + 1). Define t^∗ as the leaf node (prediction) with the minimum cost on the subtree Tt rooted at t.

Then for each training example (xn, y_n) that reaches t during top-down training, we form a example ((x_n, t), b_n, w_n) to train the weighted classifier hk, where the label bn= arg mini∈{0,1}C (yn, t^∗_i) represents the optimal decision, and the weight wn= |C (y_n, t^∗₀) − C (y_n, t^∗₁) | represents the cost difference. Then the training examples are split to two sets based on the decision of the trained classifier hk, and are used to train the child nodes t0and t1, respectively. All the training examples are taken to train the root classifier, and the whole tree is trained recursively with such divide-and-conquer steps. Note that as illustrated in Fig.2(a), each training example (xn, yn) only contributes to training the nodes that are on the path from the root to the predicted leaf of the example.

(5)

The time complexity of Top-down Tree for CSMLC is the same as CC-P. In fact, if we take the Hamming loss, then wn = _K¹ is the same for each instance on every node in the k-th layer, (xn, t_n) = (x_n, ˆy_n[1], ..., ˆy_n[k − 1]) and b_n= y_n[k]. Thus, Top-down Tree with the Hamming loss is equivalent to CC-P. That is, general Top-down Tree can be viewed as a systematic extension of CC-P for general CSMLC.

Uniform Filter Tree (UFT). It is known that Top-down Tree may suffer from the weaker theoretical guarantee (Beygelzimer et al.,2008). An alternative algorithm is called Filter Tree, which trains the classifiers in a bottom- up manner starting from the last non-leaf layer, and each example (xn, yn) is used to train all nodes. As illustrated in Fig.2(b), the last non-leaf layer of classifiers is trained by forming weighted examples based on the better leaf of the two. After training, each node on the last layer de- cides the winning leaf of the two by predicting on xn. Then the winning labels form the new “leaves” of a smaller filter tree, and the classifiers on the upper layer are trained similarly. Due to the bottom-up manner, on layer k, Filter Tree considers all the predictions from layer k + 1 to layer K. That is, Filter Tree split one original training example to 2^k−1examples, one for each possible node t, to train all of the 2^k−1nodes on the layer. When k is large, training on layer k can thus be challenging. Compared with Top-down Tree, Filter Tree is less efficient by considering all training examples for each node, but enjoys a stronger theoretical guarantee (Beygelzimer et al.,2008).

One possibility for training Filter Tree efficiently is to only train a few nodes for each layer, with the hope that other nodes can also perform decently because of the classifier- sharing in the K-classifier trick. The original Filter Tree work (Beygelzimer et al.,2008) suggests one simple approach that splits one example to train M uniformly cho- sen nodes on the k-th layer to approximate the full training of 2^k−1nodes. We call this algorithm Uniform Filter Tree for CSMLC.

Condensed Filter Tree (CFT). In Filter Tree, there are 2^Kpossible traversing paths from the root to the leaves for each instance; however, many of them are seldom needed if we have reasonably good classifiers, such as paths that result in high costs. Therefore, we can shift our focus to the important nodes on each layer instead of uniform sampling for each instance. Next, we revisit the regret bound of Filter Tree, and show that the bound can be revised to focus on a key path of the nodes on the tree.

In CSMLC, for a feature vector x and some distribution P_|x for generating the label vector y, the regret rg of a classifier h on x is defined as

rg(h, P) = E_y∼P_|x[C(h(x), y)]−min

g E_y∼P_|x[C(g(x), y)] .

For a distribution that generates weighted binary examples (x, b, w), the regret can be defined similarly by using w as the cost of a wrong prediction (of b) and 0 as the cost of a correct prediction.

Let y^∗= arg miny˜E_{y∼P |x}C(y, ˜y) be the ideal prediction of x under P. When t⁰ is an ancestor (prefix) of t on the tree, denote ht⁰, ti as a list (path) that contains the nodes on the path from node t⁰ to t. We call hr, y^∗i the ideal path of the tree for x, where r is the root of the tree. Similarly, for each node t, we can define the ideal path of the subtree Ttrooted at t.Beygelzimer et al.(2008) prove that for Filter Tree, the CSMLC regret of any tree-based classifier is upper-bounded by the total regret of all the nodes on the tree. Next, we show that the total regret of the nodes on the ideal path can readily be used to upper-bound the CSMLC regret.

Theorem 1. Under the proper ordering and K-classifier tricks, for eachx and the multi-label classifier h formed by chainingK binary classifiers (h1, ..., hK) as in the prediction procedure of Filter Tree, the regretrg(h, P) is no more than

X

t∈hr,y^∗i

Jh^k(x, t) 6= y[k]Krg

h_k(x,t),FTt(P,h_k+1,...,h_K)

,

where k denotes the layer that t is on, and FTt(P, hk+1, ..., hK) represents the procedure that generates weighted examples(x, b, w) to train the node at indext based on sampling y from P_|xand considering the predictions of classifiers in the lower layers.

Proof. For each node t on layer k, hkdirects the prediction procedure to move to either the node t0 or t1. Without loss of generality, assume hk(x, t) = 1. We denote ˆt as the prediction (leaf) on x when starting at node t. For each leaf node ˜y, let ¯C(˜y) ≡ E_y∼P|xC(y, ˜y). Then the node regret rg(t) is simply ¯C(ˆt1) − mini∈{0,1}C(ˆ¯ ti).

In addition to the regret of nodes, we also define the regret of the subtree Tt rooted at node t. The regret of the subtree Tt is as defined as the regret of the predicted path (vector) ˆt within the subtree Tt, that is, rg(T_t) = ¯C(ˆt) − ¯C(t^∗) , where t^∗ denotes the optimal prediction (leaf node) in the subtree Tt. By this definition, rg(h, P) is simply rg(Tr).

The proof can be made by replacing the total regret with rg(T_r) in the original Filter Tree work (Beygelzimer et al., 2008). Due to the space limit, we omit the complete proof here.

In Theorem 1, the bound is related to certain nodes on the ideal path for each training example. The bound in- spires us to first consider using each training example to

(6)

r t

t0

t^∗

0

t1

t10

t^∗₁ ˆt1

0 1

1 0

Figure 3. The thick edge represents the predition of the corresponding parent node. The ideal path is hr, t^∗i and t is the first mis-classified node; the ideal path of subtree Tt1 is ht1, t^∗₁i and t10is the first mis-classified node. Both nodes t and t10are on the predicted path hr, ˆt1i.

only train the K nodes in its ideal path to get the classifiers h1, .., hK for each layer. Then we can find the up- permost mis-classified node t on the ideal path for each example (xn, y_n). Without loss of generality, assume t is on the layer k, with y[k] = 0 and hk(x, t) = 1.

According to Theorem 1, we could decrease the regret rg(h, P) (or rg(Tr)) by decreasing the node regret rg(t) = C(ˆ¯ t₁) − min_i∈{0,1}C(ˆ¯ t_i), which can be done by decreasing ¯C(ˆt1).¹ Because ¯C(t^∗₁) is a constant, decreasing ¯C(ˆt1) is equivalent to decreasing the regret, rg(Tt1) = ¯C(ˆt₁) − C(t¯ ^∗₁), of the subtree Tt₁. We can then recursively adopt the above procedure to optimize the subtree regret rg(Tt₁) as shown in Fig.3.

The procedure suggests decreasing the regret on hr, ti and ht, ˆti, the predicted path of xn. Therefore, the next key path for xnthat should be included for training is its predicted path. That is, we can now train Filter Tree by adding the predicted path for each xn. We call the resulting algorithm Condensed Filter Tree, as shown in Algorithm1. The path-adding step can be repeated to further zoom into the key nodes. The number of adding step can be treated as a parameter M , and will be further discussed in Section5.

In summary, we derive three efficient approaches for general CSMLC with trees: TT (a systematic extension of CC-P), UFT and CFT. Next, we compare them with other existing algorithms by experiments.

5. Experiment

We conduct the experiments of different evaluation criteria on nine real-world datasets²(Tsoumakas et al.,2011;Read, 2012). In the experiments, we take three kinds of algorithms in our comparison: (a) the label-wise decomposition approaches, including classifier chain (CC), ensemble clas-

1Since ˆt0 relates to regret of other nodes on the ideal path of t, we cannot easily increase ¯C(ˆt0) to decrease rg(t).

2CAL500, emotions, enron, imdb, medical, scene, slash, tmc and yeast.

Algorithm 1 Condensed Filter Tree for CSMLC 1: D = {(xn, yn)}^N_n=1; D^p= {((xn, yn), yn)}^N_n=1 2: for m = 1 to M iterations do

3: for each layer k from layer K to root do

4: D_k= ∅

5: for each instance ((xn, ˜yn), yn) ∈ D^p do 6: t = (˜yn[1], ..., ˜yn[k]); zn = (xn, t) 7: b_n= arg min_i∈{0,1}C(y_n, ˆt_i) 8: wn=|C(yn, ˆt1)-C(yn, ˆt0)|

9: Dk← Dk∪ (zn, b_n, w_n) 10: end for

11: hk ← train(Dk) 12: end for

13: if m < M then

14: for each instance (xn, y_n) ∈ D do 15: yˆn = predict(h1, ..., hK, xn) 16: D^p← D^p∪ ((xn, ˆyn), yn) 17: end for

18: end if 19: end for

sifier chain (ECC), and probabilistic classifier chain (PCC);

(b) the tree-based models, including top-down tree (TT), uniform filter tree (UFT) and condensed filter tree (CFT);

(c) a state-of-the-art algorithm that does not explicitly take cost into account, MLkNN (Zhang & Zhou, 2007). We first consider three cost matrices: Hamming loss, Rank loss= P

y[i]>y[j]

J ˆy[i] < ˆy[j]K+

1

2J ˆy[i] = ˆy[j]K andF1 score=

2ky∩ˆyk₁

kyk1+kˆyk1. The three matrices corresponds to known efficient inference rules for PCC (Dembczynski et al.,2010;

2011). Then we take other criteria for comparison in Sec- tion5.3.

We couple PCC with L2-regularized logistic regression and other algorithms with linear support vector machines im- plemented in LIBLINEAR (Fan et al.,2008). For MLkNN, we use the implementation in Mulan (Tsoumakas et al., 2011). In each run of the experiment, we randomly sam- ple 50% of the dataset for training and reserve the rest for testing. For UFT and CFT, we restrict the maximum M to 8 for efficiency. For other parameters of each algorithm, we use cross-validation on the training set to search the best choice. Finally, Tables1,2 and3list the results for the three cost matrices, respectively, with the mean and the standard error over 40 different random runs, and the best result of each dataset is bolded. We also compare CFT with other algorithms based on the t-test at 95% confidence level. The number of datasets that CFT wins, ties and losses are shown in Table4.

(7)

Table 1. The result of Hamming loss (the best (lowest) ones are marked in bold)

Dataset CC ECC MLkNN PCC TT(CC-P) UFT CFT

CAL. 0.1376 ± 0.002 0.1374 ± 0.002 0.1379 ± 0.002 0.1370 ± 0.002 0.1375 ± 0.002 0.1489 ± 0.005 0.1368 ± 0.002 emo. 0.2613 ± 0.029 0.2501 ± 0.022 0.2122 ± 0.012 0.2297 ± 0.011 0.2435 ± 0.015 0.2222 ± 0.014 0.2138 ± 0.009 enron 0.0465 ± 0.001 0.0466 ± 0.001 0.0540 ± 0.001 0.0462 ± 0.001 0.0467 ± 0.001 0.0551 ± 0.001 0.0467 ± 0.001 imdb 0.0808 ± 0.000 0.0713 ± 0.000 0.0714 ± 0.000 0.0714 ± 0.000 0.0715 ± 0.000 0.0715 ± 0.000 0.0715 ± 0.000 medical 0.0109 ± 0.001 0.0113 ± 0.001 0.0176 ± 0.001 0.0110 ± 0.001 0.0108 ± 0.001 0.0119 ± 0.001 0.0102 ± 0.001 scene 0.1118 ± 0.004 0.0971 ± 0.004 0.0942 ± 0.004 0.0962 ± 0.003 0.0980 ± 0.003 0.1032 ± 0.003 0.1004 ± 0.003 slash 0.0418 ± 0.001 0.0383 ± 0.000 0.0514 ± 0.001 0.0386 ± 0.001 0.0388 ± 0.001 0.0375 ± 0.001 0.0383 ± 0.001 tmc 0.0571 ± 0.000 0.0565 ± 0.000 0.0669 ± 0.000 0.0576 ± 0.000 0.0575 ± 0.000 0.0574 ± 0.000 0.0572 ± 0.000 yeast 0.2107 ± 0.003 0.2009 ± 0.004 0.1981 ± 0.003 0.2006 ± 0.003 0.2000 ± 0.002 0.2008 ± 0.002 0.2013 ± 0.003

Table 2. The result of Rank loss (the best (lowest) ones are marked in bold)

Dataset CC ECC MLkNN PCC TT UFT CFT

Table 3. The result of F1 score (the best (highest) ones are marked in bold)

Dataset CC ECC MLkNN PCC TT UFT CFT

Table 5. The result of Acc. score, and Comp. score (best ones are marked in bold)

Dataset Accuracy(↑) Composite Score(↑)

PCC-F1 CFT PCC-Ham or F1 CFT

CAL. 0.303 ± 0.008 0.315 ± 0.004 −0.362 ± 0.012 −0.302 ± 0.013 emo. 0.534 ± 0.021 0.535 ± 0.015 −0.566 ± 0.100 −0.460 ± 0.063 enron 0.453 ± 0.009 0.476 ± 0.009 0.300 ± 0.017 0.351 ± 0.012 imdb 0.242 ± 0.010 0.268 ± 0.001 −0.263 ± 0.055 −0.096 ± 0.001 medical 0.783 ± 0.015 0.764 ± 0.018 0.758 ± 0.016 0.747 ± 0.018 scene 0.676 ± 0.011 0.669 ± 0.010 0.150 ± 0.036 0.170 ± 0.022 slash 0.511 ± 0.009 0.481 ± 0.006 0.263 ± 0.012 0.277 ± 0.011 tmc 0.613 ± 0.004 0.614 ± 0.002 0.402 ± 0.007 0.419 ± 0.004 yeast 0.518 ± 0.012 0.539 ± 0.006 −0.398 ± 0.019 −0.376 ± 0.019

Table 4. CFT versus the other algorithms based on t-test at 95%

confidence level (#win/#tie/#loss)

criteria CC ECC MLkNN PCC TT UFT Ham. 7/1/1 2/4/3 5/1/3 4/3/2 5/2/2 6/2/1 Rank. 9/0/0 9/0/0 9/0/0 3/2/4 4/5/0 6/1/2 F1. 9/0/0 9/0/0 9/0/0 4/2/3 7/1/1 6/2/1 Total 25/1/1 20/4/3 23/1/3 11/7/9 16/8/3 18/5/4

5.1. Cost-insensitive versus Cost-sensitive

Table1compares all the algorithms based on the Hamming loss. As discussed in Section4, CC-P is equivalent to TT with Hamming loss. In Table1, the five algorithms that can reach the best performance are ECC, MLkNN, PCC, UFT and CFT. Moreover, ECC successfully improves the performance of CC. The state-of-the-art algorithm, MLkNN, often achieves the best results. When looking at Table4for t-test results, CFT is competitive to ECC and PCC, while often being better than MLkNN.

For the other two criteria, as shown in Tables2and3, the algorithms that do not consider the cost explicitly, such as CC, ECC and MLkNN, are generally worse than the cost-

sensitive algorithms. The results demonstrate the importance and effectiveness of properly considering the cost information in the algorithm.

5.2. Comparison with Tree-based Algorithms

In Tables 1,2, 3 and 4, when comparing CFT with TT, CFT wins on 16 and ties on 8 of the 27 cases by t-test.

The results justify the importance of bottom-up training of the tree model. When comparing UFT with CFT, CFT is better than UFT on 18 and ties on 5 out of 27 cases by t- test. The results demonstrate the effectiveness of focusing on key paths (nodes).

2 4 6 8

0.5 0.55 0.6 0.65 0.7

Number of paths (M)

F1 Score _CFT−test

UFT−test CFT−train UFT−train

(8)

We further study UFT and CFT for varying M . We show the result of F1 score on emotions, while observing similar behaviors across other datasets and criteria. When number of paths increases, both the training and testing performance of CFT and UFT are improved. Moreover, CFT converges to a better F1 score than UFT as M increases, which explains its better performance during testing.

While CFT is usually better than UFT, on medical and slash, CFT loses to UFT in Tables2and3. We study the reasons and find that the cause is overfitting. For instance, the training Rank loss of CFT on medical is 0.083, which is much smaller than the UFT result of 0.264. That result implies that CFT indeed optimizes the desired evaluation criteria during training, but the focus on key paths could suffer worse generalization in a few datasets. A preliminary study shows that a mixture of CFT and UFT is less prone to overfitting.

5.3. Comparison with PCC and CFT

For the Hamming loss, the Rank loss and the F1 score, the exact inference algorithms of PCC have been proposed (Dembczynski et al.,2010;2011). From Tables1, 2,3and4, PCC and CFT are competitive to each other on the three criteria, having similar number of winning and losing cases.

To demonstrate the full ability of CFT, we consider two other criteria which there is no inference rule (yet) for PCC, including Accuracy³ (Boutell et al., 2004; Tsoumakas et al., 2010), and a composite score from the F1 score and the Hamming loss in Table5. The definitions of the criteria are Accuracy = ^ky∩ˆ_ky∪ˆ^yk_yk¹

1, and Composite Score= F1 Score−5×HAM Loss.

Here we use the approximate inference rules for PCC. For the Accuracy score, we couple PCC with the inference rule of the F1 score in view of the similarity in the formula.

For the Composite score, which considers the F1 score and the Hamming loss concurrently, we run PCC with either the inference rule of the F1 score or the inference rule of the Hamming loss, and optimistically report the best one in Table5.

Table5can be summarized as follows. Due to the similarity in the formula, CFT and PCC-F1 reach similar results for the Accuracy score. For the Composite score, which is similar to neither the F1 score nor the Hamming loss, PCC is much worse than CFT.

When K is small, PCC can use exhaustive search to enu- merate 2^K possible ˆy and locate the Bayes optimal ˆy. We further list the performance of this PCC-exhaust approach for emotions, scene and yeast, which are of no more

3α-Accuracy with α = 1

than 14 labels.

Infer. Acc.(↑) Comp.(↑)

emo. scene yeast emo. scene yeast Apprx. 0.534 0.676 0.518 -0.566 0.150 -0.398 Exhau. 0.530 0.709 0.535 -0.570 0.176 -0.383

By the exhaustive inference, the performance of PCC is significantly improved in most cases. The good performance highlights the importance of exact and efficient inference rules for PCC. Nevertheless, if the desired evaluation criteria are complicated, it is non-trivial to design exact and efficient inference rules. When comparing PCC-exhaust with CFT, we see that CFT wins on 3 cases, ties on 1 case and loses on 2 cases. Thus, the efficient CFT is quite competitive with the inefficient PCC-exhaust in performance.

6. Conclusion

We tackle the general cost-sensitive multi-label classification problem without any specific subroutine for different evaluation criteria, which meets the demands in real-world applications. We proposed the condensed filter tree (CFT) algorithm by coupling several tools and ideas: the label powerset approach for reducing to cost-sensitive classification, the tree-based algorithms for cost-sensitive classification, the proper-ordering and K-classifier tricks that uti- lize the structural property of multi-label classification, and the theoretical bound to locate the key tree nodes (paths) for training. The resulting CFT is as efficient as the common label-wise decomposition approaches in training and prediction, with respect to the number of possible labels.

Experimental results demonstrate that CFT is competitive with leading approaches for multi-label classification, and usually outperforms those approaches on the evaluation criteria that those approaches are not designed from.

CFT can currently handle evaluation criteria defined by a desired label vector and a predicted label vector. We can view CFT as the first step towards tackling more complicated evaluation criteria, which shall be an important future research direction.

7. Acknowledgement

We thank Profs. Yuh-Jye Lee, Chih-Jen Lin, Shou-De Lin, Chi-Jen Lu, Hung-Yi Lo, the anonymous reviewers, and the members of the NTU Computational Learning Lab for valuable suggestions. This work is mainly supported by National Science Council (NSC 101-2628-E-002-029- MY2) of Taiwan.

References

Beygelzimer, A., Dani, V., Hayes, T., Langford, J., and Zadrozny, B. Error limiting reductions between classification tasks. In Proceedings of the 22nd International

(9)

Conference on Machine Learning, 2005.

Beygelzimer, A., Langford, J., and Ravikumar, P. Error correcting tournaments, 2008. URLhttp://arxiv.

org/abs/0902.3176.

Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learn- ing multi-label scene classification. Pattern Recognition, 2004.

Dembczynski, K., Cheng, W., and H¨ullermeier, E. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th International Conference on Machine learning, 2010.

Dembczynski, K., Waegeman, W., Cheng, W., and H¨ullermeier, E. An exact algorithm for f-measure maxi- mization. In Advances in Neural Information Processing Systems 24. 2011.

Dembczynski, K., Kotlowski, W., and H¨ullermeier, E. Con- sistent multilabel ranking through univariate losses. In Proceedings of the 29th International Conference on Machine learning, 2012a.

Dembczynski, K., Waegeman, W., and H¨ullermeier, E. An analysis of chaining in multi-label classification. In Pro- ceedings of the 20th European Conference on Artificial Intelligence, 2012b.

Domingos, P. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999.

Elisseeff, A. and Weston, J. A kernel method for multi- labelled classification. In Advances in Neural Informa- tion Processing Systems 14, 2002.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 2008.

Kumar, A., Vembu, S., Menon, A. K., and Elkan, C.

Beam search algorithms for multilabel learning. Ma- chine Learning, 2013.

Lo, H.-Y., Wang, J.-C., Wang, H.-M., and Lin, S.-D. Cost- sensitive multi-label learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia, 2011.

Mineiro, P. Cost sensitive multi label: an

observation, 2011. URL http://www.

machinedlearnings.com/2011/05/

cost-sensitive-multi-label-observation.

html.

Petterson, J. and Caetano, T. S. Reverse multi-label learning. In Advances in Neural Information Processing Sys- tems 23. 2010.

Petterson, J. and Caetano, T. S. Submodular multi-label learning. In Advances in Neural Information Processing Systems 24. 2011.

Read, J. Meka: a multi-label extension to weka, 2012. URL http://meka.sourceforge.net.

Read, J., Pfahringer, B., Holmes, G., and Frank, E. Classi- fier chains for multi-label classification. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, 2009.

Srivastava, A.N. and Zane-Ulman, B. Discovering recur- ring anomalies in text reports regarding complex space systems. In IEEE Aerospace Conference, 2005.

Tsoumakas, G. and Vlahavas, I. Random k-labelsets: an ensemble method for multilabel classification. In Ma- chine Learning: the European Conference on Machine Learning. 2007.

Tsoumakas, G., Katakis, I., and Vlahavas, I. Mining multi- label data. In Data Mining and Knowledge Discovery Handbook. Springer US, 2010.

Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., and Vlahavas, I. Mulan: a java library for multi-label learning. Journal of Machine Learning Research, 2011.

Tsoumakas, G., Zhang, M.-L., and Zhou, Z.-H. Intro- duction to the special issue on learning from multi-label data. Journal of Machine Learning Research, 2012.

Turnbull, D., Barrington, L., Torres, D. A., and Lanckriet, G. R. G. Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech and Language Processing, 2008.

Zhang, M.-L. and Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 2007.