• 沒有找到結果。

Condensed Filter Tree for Cost-Sensitive Multi-Label Classification

N/A
N/A
Protected

Academic year: 2022

Share "Condensed Filter Tree for Cost-Sensitive Multi-Label Classification"

Copied!
9
0
0
顯示更多 ( 頁)

全文

(1)

Chun-Liang Li R01922001@CSIE.NTU.EDU.TW

Hsuan-Tien Lin HTLIN@CSIE.NTU.EDU.TW

Department of Computer Science and Information Engineering, National Taiwan University

Abstract

Different real-world applications of multi-label classification often demand different evaluation criteria. We formalize this demand with a gen- eral setup, cost-sensitive multi-label classifica- tion (CSMLC), which takes the evaluation crite- ria into account during learning. Nevertheless, most existing algorithms can only focus on op- timizing a few specific evaluation criteria, and cannot systematically deal with different ones. In this paper, we propose a novel algorithm, called condensed filter tree (CFT), for optimizing any criteria in CSMLC. CFT is derived from reducing CSMLC to the famous filter tree algorithm for cost-sensitive multi-class classification via con- structing the label powerset. We successfully cope with the difficulty of having exponentially many extended-classes within the powerset for representation, training and prediction by care- fully designing the tree structure and focusing on the key nodes. Experimental results across many real-world datasets validate that CFT is compet- itive with special purpose algorithms on special criteria and reaches better performance on gen- eral criteria.

1. Introduction

The multi-label classification problem allows each instance to be associated with a set of labels simultaneously. It has in recent years attracted much attention among re- searchers (Tsoumakas et al.,2010;2012) because the prob- lem setting matches many different real-world applications;

these include bio-informatics (Elisseeff & Weston,2002), text mining (Srivastava & Zane-Ulman,2005) and multi- media (Turnbull et al., 2008). The different applications often come with different criteria for evaluating the perfor- mance of multi-label classification algorithms. Popular cri- Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s).

teria include the Hamming loss, the 0/1 loss, the Rank loss, the F1 score and the Accuracy score (Tsoumakas et al., 2010).

Currently, most algorithms are designed based on none, one, or a few specific criteria. For instance, the label- wise decomposition approaches (Read et al.,2009) aim at optimizing the Hamming loss by decomposing the multi- label classification problem into several binary classifica- tion problems, one for each possible label. The label pow- erset approach aims at optimizing the 0/1 loss by treating each distinct label-set as a unique extended class and re- ducing multi-label classification to multi-class classifica- tion. The probabilistic classifier chain (PCC) (Dembczyn- ski et al.,2010) approach estimates the probability of each possible label-set given an instance and uses the estimate to make a Bayes-optimal decision for any loss functions, while the structured SVM approach (Petterson & Caetano, 2010;2011) uses different convex surrogates for different evaluation criteria. Somehow both approaches require ei- ther special inference rules or loss maximizers for different evaluation criteria.

The variety of evaluation criteria calls for a more general algorithm that can cope with different criteria systemati- cally and automatically. We formalize this need with a general setup, called cost-sensitive multi-label classifica- tion (CSMLC). CSMLC can be viewed as an extension of the popular setup of cost-sensitive multi-class classifi- cation. In CSMLC, we feed the multi-label classification algorithm with a cost function that quantifies the differ- ence between a predicted label-set and a desired one. A general CSMLC algorithm operates on the given cost func- tion, with the goal being better performance on that cost function. Compared with the existing methodology that re- quires specific design for every new application (criterion), general CSMLC algorithms can be used to save those de- sign efforts and be easily adopted towards different appli- cation needs.

In this paper, we propose a novel algorithm for CSMLC, called the condensed filter tree (CFT). In contrast to PCC, the proposed CFT directly takes the criterion into account as the cost function during training, thereby averting the

(2)

need to design a specific inference rule for each new cri- terion and avoiding the possibly time-consuming inference step during prediction. Inspired by the rich literature of cost-sensitive multi-class classification (Domingos,1999;

Beygelzimer et al., 2008), CFT is derived by first reduc- ing CSMLC to cost-sensitive multi-class classification via the label powerset approach. Nevertheless, the reduction leads to exponentially many extended classes, which makes training, prediction and model representation computation- ally challenging.

We conquer the challenge of prediction by exploiting tree- based models for cost-sensitive multi-class classification.

Tree-based models use a tree structure that is constructed by binary classifiers to make fast predictions. Then we achieve time complexity logarithmic with respect to the number of extended classes, which is linear with respect to the number of possible labels. Furthermore, we con- quer the challenge of model representation by proposing proper orderingand K-classifier tricks. Interestingly, the two tricks reveal a strong connection between the CFT algorithm (which is derived from the label powerset ap- proach) and the label-wise decomposition approaches.

Finally, we conquer the training challenge by modifying the famous Filter Tree algorithm (Beygelzimer et al.,2008) for CSMLC. The modification comes from revisiting the theoretical bound of Filter Tree, which allows the proposed CFT algorithm to only focus on some key tree nodes for training efficiency.

We conduct experiments on nine real-world datasets to val- idate the proposed CFT algorithm. Experimental results demonstrate that for specific evaluation criteria, CFT is competitive with special-purpose algorithms, such as PCC with specifically designed inference rules or the state-of- the-art MLkNN algorithm (Zhang & Zhou,2007). For gen- eral criteria, for which there is as yet no inference rule for PCC, CFT can reach significantly better performance. The results justify the superiority of the proposed CFT for gen- eral CSMLC problems.

2. Problem Setup

In a multi-label classification problem, we denote the fea- ture vector by x ∈ Rd and its relevant label set by Y ⊆ {1, 2, ..., K}, where K is the number of classes.

The label set Y is commonly represented as a label vec- tor, y ∈ {0, 1}K, where y[k] = 1 if and only if k ∈ Y.

Given a dataset D = {(xn, yn)}Nn=1, which contains N iid training examples (xn, yn) drawn from an unknown distri- bution P, the goal is to design an algorithm that uses D to find a classifier h : Rd → {0, 1}Kin the training stage, with the hope that h(x) closely predicts y of an unseen x in the prediction stage when (x, y) is drawn from P.

For evaluating the closeness of the prediction ˆy = h(x), one of the most common criteria is called the Hamming loss Hamming(y, ˆy) = K1 PK

k=1Jy[k] 6= ˆy[k]K. Note that the Hamming loss evaluates each label component sepa- rately and equally weighted. In addition to the Hamming loss, there are many other criteria that evaluate the compo- nents of ˆy jointly; these include the 0/1 loss, the Rank loss, the F1 score and the Accuracy score (Tsoumakas et al., 2010). In this paper, we will use loss to denote the criterion that shall be minimized, and score to denote the criterion that shall be maximized.

The variety of criteria calls for a general setup of multi- label classification, called cost-sensitive multi-label clas- sification (CSMLC), which will be the main focus of this work. CSMLC can be viewed as an extension of the popular setup of cost-sensitive multi-class classifica- tion (Domingos,1999). In CSMLC, we assume that there is a known cost function (matrix) C : {0, 1}K×{0, 1}K→ R, where C(y, ˆy) denotes the cost of predicting (x, y) as ˆy. The cost matrix is not only part of the prediction stage by using C(y, h(x)) to evaluate the performance of any classifier h, but also part of the training stage by feed- ing C as an additional piece of information to guide the learning algorithm.

The CSMLC setup meets the goal of optimizing many ex- isting criteria, such as the (per-example) F1 score, the Ac- curacy score and the Rank loss (Tsoumakas et al.,2010).

Note that the setup above only considers a cost matrix C indexed by a desired vector y and a predicted vector ˆy.

Thus, the setup cannot fully cover some more complicated evaluation criteria such the micro-F1 score and the macro- F1 score, which are defined on a set of vectors. Studying the CSMLC setup can be viewed as an intermediate step toward tackling those complicated criteria in the future.

There are many existing algorithms for tackling the multi- label classification, but they either do not seriously take the cost matrix (criteria) into account, or only aim at a few spe- cific cost matrices. That is, general algorithms for CSMLC have not been well studied. One intuitive family of algo- rithms is label-wise decomposition. For instance, the bi- nary relevance (BR) algorithm (Tsoumakas et al., 2010) decomposes D = {(xn, yn)}Nn=1 into K binary classi- fication datasets Dk = {(xn, yn[k])}Nn=1, and trains K independent binary classifiers hk with Dk for predict- ing y[k]. One extension of BR is the classifier chain (CC) algorithm (Read et al., 2009), which takes Dk = {(zn, yn[k])}Nn=1and zn = (xn, yn[1], ..., yn[k − 1]), to train hk. One practical variant of CC, named CC-P, takes the predicted labels ˆy[k] instead of the true labels y[k] as the features in zn.

Because CC-P (as well as BR/CC) predicts ˆy[k] separately by each hk, arguably their main goal is to minimize the

(3)

Hamming loss (Tsoumakas et al.,2010). Extending CC- P for general CSMLC, however, is non-trivial, because it is difficult to embed the 2K × 2K possible C(y, ˆy) cost components into K separate steps of training. One al- gorithm that solves the difficulty for some specific cost matrices is the probabilistic classifier chain (PCC) (Dem- bczynski et al.,2010). PCC avoids the embedding issue in training by adopting a soft version of CC-P/CC without any cost information to estimate the conditional probabil- ity P (y|x). The probabilistic view allows PCC to interpret CC as greedily maximizing the 0/1 loss through the chain rule. During prediction, PCC considers the cost matrix for making the Bayes-optimal decision, which is based on an efficient inference rule that has been specifically designed for the cost matrix.

The potential drawback of PCC is that not only is it non- trivial to estimate the conditional probability but it is also challenging to design an efficient inference rule for each cost matrix. Because of the latter challenge, PCC currently can be used only to exactly tackle the Hamming loss, Rank loss, and the F1 score (Dembczynski et al.,2010;2012a;

2011). PCC can also be used with some search-based in- ference rule to approximately optimize the 0/1 loss (Dem- bczynski et al.,2012b;Kumar et al.,2013), but not other criteria in CSMLC.

Another major algorithm, known as label powerset (LP), reduces multi-label classification to multi-class classifica- tion (Tsoumakas et al.,2010). LP treats each unique pat- tern of the label vector as a single extended class. That is, the K possible labels are encoded to 2K extended classes via a bijection function enc : {0, 1}K → {1, ..., 2K}. Dur- ing training, LP transforms D into Dm = {(xn, cn)}Nn=1, where cn= enc(yn), and trains a multi-class classifier hm

from Dm. Then during prediction, LP takes h(x) = enc−1(hm(x)). Trivially, LP focuses on the 0/1 loss, be- cause the error rate of hmin the reduced problem is equiv- alent to the 0/1 loss of h. The disadvantage is that the exponentially many extended classes makes LP infeasible and impractical in general.

Lo et al.(2011) propose the CS-RAKEL algorithm that op- timizes some weighted Hamming loss by extending from RAKEL (Tsoumakas & Vlahavas,2007), a representative algorithm between the label-wise decomposition and label powerset approaches. Somehow CS-RAKEL is designed for specific application needs and cannot tackle general CSMLC problems.

In summary, some related algorithms and their correspond- ing criteria are shown below. None of them can tackle gen- eral CSMLC problems.

Algorithms Criteria Being Optimized CC-P/CC Hamming loss or 0/1 loss

PCC Hamming loss, F1 score, Rank loss, 0/1 loss

LP 0/1 loss

CS-RAKEL Weighted Hamming loss

3. Tree Model for CSMLC

Inspired by the connection between CSMLC and the rich literatures of cost-sensitive classification (Domingos,1999;

Beygelzimer et al.,2008), we design a general CSMLC al- gorithm via the connection. Note that LP reduces multi- label classification to multi-class classification to optimize the 0/1 loss. If we follow the same reduction step but start from a general CSMLC problem, we end up with a cost- sensitive classification problem of 2K extended classes and (implicitly) a 2K× 2Kcost matrix. Then any existing cost-sensitive classification algorithms can be used to solve CSMLC. We call this preliminary algorithm cost-sensitive label powerset (CS-LP). As with LP, the exponential num- ber of extended classes presents a computational challenge for CS-LP. For example, using CS-LP to reduce CSMLC to the weighted-all pair approach (Beygelzimer et al.,2005) requires 2K(22K−1) comparisons for making each predic- tion.

Interestingly, PCC can be viewed as a special case of us- ing CS-LP to reduce CSMLC to the famous Meta-Cost ap- proach (Domingos,1999). Meta-Cost estimates the condi- tional probability during training and then makes the Bayes optimal decision with respect to a cost matrix in prediction.

Similarly, PCC estimates the probability by CC-P/CC, and then infers the optimal decision with respect to the cost ma- trix by the specifically designed inference rule.

We take another route that uses CS-LP to reduce CSMLC to tree models for cost-sensitive classification (Beygelzimer et al.,2008). A similar idea based on using the Hamming loss has been discussed in a blog post (Mineiro,2011), but the idea has not been seriously studied for general CSMLC problems. Tree models form a binary tree by weighted binary classifiers to conduct cost-sensitive classification.

Each non-leaf node of the tree is a binary classifier for de- ciding which subtree to go to, and each leaf node represents a class. Without loss of generality, we assume that the leaf nodes are indexed orderly by 1, 2, ..., #classes. Making a prediction for each instance follows the decisions of bi- nary classifiers, starting from the root to the leaf. That is, only O (log(#classes)) decisions are required for making each prediction. In CS-LP, tree models result in O(K) time for each prediction, of the same order as label-wise decom- position approaches.

Nevertheless, the number of nodes on the resulting tree structure is O(2K), which poses challenges in representa- tion and training. We first tackle the representation chal- lenge in this section, and then study algorithms for training the tree model in Section4.

Proper Ordering. Recall that CS-LP needs a bijec- tive functions enc(·) : {0, 1}K → {1, ..., 2K} for encod- ing y to c and decoding the predicted class ˆc to the cor- responding label vector ˆy. Although a prediction ˆc can

(4)

A

B 00 01

C

10 11 (a)

r

0 00

0 01

1 0

1 10

0 11

1 1

(b)

Figure 1. Proper Ordering. (a) Put labels on leaf nodes orderly (b) Index internal nodes by paths.

be made within O(K) time in the tree model, the encod- ing function requires a careful design to make both enc and enc−1 efficient for the 2K possible inputs to those functions. We first consider the proper ordering trick, which lets enc(y) = BinaryNumber(y)+1. That is, we treat each y as a binary string, encode it by computing its corresponding integer in O(K), and decode accordingly, as illustrated in Fig.1(a). Based on proper ordering, if we let {0, 1} represent the decision {L, R} in each classifier of the tree, then the label vector y (extended class c) of each leaf node is equivalent to the sequence of binary decisions made from the root to the leaf. More generally, we can in- dex each node of the tree by t ∈ {0, 1}k−1, as shown in Fig.1(b), where t is the sequence of binary decisions from the root to the node on layer k.

K-Classifier Trick. Even with proper ordering, there are 2K− 1 total internal nodes (classifiers) on the tree. The exponential number makes representing (and training) clas- sifiers infeasible in practice. One existing idea for feasible representations is called the 1-classifier trick (Beygelzimer et al.,2008), which lets all 2K − 1 internal nodes t share one classifier h(x, t). Nevertheless, using the 1-classifier trick often requires the classifier to be of sufficient power to capture different characteristics of different nodes. The requirement makes the trick less suitable for practical use.

Therefore, we propose a trade-off, K-classifier trick, be- tween using 1 classifier and 2K− 1 classifiers.

The K-classifier trick physically works as follows. After proper ordering, ˆy[k] corresponds to the prediction made by one of the nodes on layer k of the tree. In other words, the purpose of all the nodes located on layer k is similar:

predicting ˆy[k]. The similar purpose allows us to view each node as a part of a layer classifier of the form hk(x, t), which takes an instance x and a node index t ∈ {0, 1}k−1 for predicting the k-th component of the label vector. Then equivalently only K classifiers (one per each layer) are re- quired for representing the tree.

Connection to CC-P. By the proper ordering and the K- classifier tricks, predicting the extended class ˆc by the tree from layer 1 (root) to layer K is equivalent to predicting ˆ

y[1], ..., ˆy[K] by the K classifiers {h1, ..., hK} using x and t. Such a prediction algorithm is exactly the same as those

r (0,23-0)

0

00 1

01 0

1 (1, 1-23)

10 1

11 2 /3 (a)

r (0, 1-0)

(1, 1-0) 0

00 1

01 0

1 (1, 1-23)

10 1

11 2 /3 (b)

Figure 2. (a) Training of Top-down Tree; (b) Training of Filter Tree. The cost matrix used is 1−(F1 score). (0/1, w) means the direction based on proper ordering, and the weight for training the instance on the node. The thick edge represents the prediction for the instance by the trained classifier on the parent node.

used in CC-P, which uses the classifiers hkwith exactly the same inputs (the instance x and the predicted labels which form the node index t in the tree) for predicting the next label.

The use of the two tricks reveals an interesting connection between two very different families of approaches: label- wise decomposition can be viewed as a special case of la- bel powerset (in prediction). In short, label powerset with the tree model, proper ordering and K-classifiers tricks is equivalent to CC-P for prediction. Thus, by studying the role of the cost matrix during training, we can systemati- cally extend CC-P to be cost-sensitive. Next, we will dis- cuss how to train the K classifiers subject to the cost matrix efficiently.

4. Training of Tree Model

There are two major algorithms for training the binary clas- sifiers in the tree, Top-down Tree and Filter Tree (Beygelz- imer et al.,2008).

Top-down Tree (TT). Top-down Tree trains classifiers from layer 1 (root) to layer K. Formally, for each internal node t on layer k, denote its left child as t0and right child as t1on layer (k + 1). Define t as the leaf node (predic- tion) with the minimum cost on the subtree Tt rooted at t.

Then for each training example (xn, yn) that reaches t dur- ing top-down training, we form a example ((xn, t), bn, wn) to train the weighted classifier hk, where the la- bel bn= arg mini∈{0,1}C (yn, ti) represents the optimal decision, and the weight wn= |C (yn, t0) − C (yn, t1) | represents the cost difference. Then the training examples are split to two sets based on the decision of the trained classifier hk, and are used to train the child nodes t0and t1, respectively. All the training examples are taken to train the root classifier, and the whole tree is trained recursively with such divide-and-conquer steps. Note that as illustrated in Fig.2(a), each training example (xn, yn) only contributes to training the nodes that are on the path from the root to the predicted leaf of the example.

(5)

The time complexity of Top-down Tree for CSMLC is the same as CC-P. In fact, if we take the Hamming loss, then wn = K1 is the same for each instance on every node in the k-th layer, (xn, tn) = (xn, ˆyn[1], ..., ˆyn[k − 1]) and bn= yn[k]. Thus, Top-down Tree with the Hamming loss is equivalent to CC-P. That is, general Top-down Tree can be viewed as a systematic extension of CC-P for general CSMLC.

Uniform Filter Tree (UFT). It is known that Top-down Tree may suffer from the weaker theoretical guaran- tee (Beygelzimer et al.,2008). An alternative algorithm is called Filter Tree, which trains the classifiers in a bottom- up manner starting from the last non-leaf layer, and each example (xn, yn) is used to train all nodes. As illustrated in Fig.2(b), the last non-leaf layer of classifiers is trained by forming weighted examples based on the better leaf of the two. After training, each node on the last layer de- cides the winning leaf of the two by predicting on xn. Then the winning labels form the new “leaves” of a smaller fil- ter tree, and the classifiers on the upper layer are trained similarly. Due to the bottom-up manner, on layer k, Filter Tree considers all the predictions from layer k + 1 to layer K. That is, Filter Tree split one original training example to 2k−1examples, one for each possible node t, to train all of the 2k−1nodes on the layer. When k is large, training on layer k can thus be challenging. Compared with Top-down Tree, Filter Tree is less efficient by considering all training examples for each node, but enjoys a stronger theoretical guarantee (Beygelzimer et al.,2008).

One possibility for training Filter Tree efficiently is to only train a few nodes for each layer, with the hope that other nodes can also perform decently because of the classifier- sharing in the K-classifier trick. The original Filter Tree work (Beygelzimer et al.,2008) suggests one simple ap- proach that splits one example to train M uniformly cho- sen nodes on the k-th layer to approximate the full training of 2k−1nodes. We call this algorithm Uniform Filter Tree for CSMLC.

Condensed Filter Tree (CFT). In Filter Tree, there are 2Kpossible traversing paths from the root to the leaves for each instance; however, many of them are seldom needed if we have reasonably good classifiers, such as paths that result in high costs. Therefore, we can shift our focus to the important nodes on each layer instead of uniform sam- pling for each instance. Next, we revisit the regret bound of Filter Tree, and show that the bound can be revised to focus on a key path of the nodes on the tree.

In CSMLC, for a feature vector x and some distribution P|x for generating the label vector y, the regret rg of a classifier h on x is defined as

rg(h, P) = Ey∼P|x[C(h(x), y)]−min

g Ey∼P|x[C(g(x), y)] .

For a distribution that generates weighted binary examples (x, b, w), the regret can be defined similarly by using w as the cost of a wrong prediction (of b) and 0 as the cost of a correct prediction.

Let y= arg miny˜Ey∼P |xC(y, ˜y) be the ideal prediction of x under P. When t0 is an ancestor (prefix) of t on the tree, denote ht0, ti as a list (path) that contains the nodes on the path from node t0 to t. We call hr, yi the ideal path of the tree for x, where r is the root of the tree. Similarly, for each node t, we can define the ideal path of the sub- tree Ttrooted at t.Beygelzimer et al.(2008) prove that for Filter Tree, the CSMLC regret of any tree-based classifier is upper-bounded by the total regret of all the nodes on the tree. Next, we show that the total regret of the nodes on the ideal path can readily be used to upper-bound the CSMLC regret.

Theorem 1. Under the proper ordering and K-classifier tricks, for eachx and the multi-label classifier h formed by chainingK binary classifiers (h1, ..., hK) as in the predic- tion procedure of Filter Tree, the regretrg(h, P) is no more than

X

t∈hr,yi

Jhk(x, t) 6= y[k]Krg



hk(x,t),FTt(P,hk+1,...,hK)

 ,

where k denotes the layer that t is on, and FTt(P, hk+1, ..., hK) represents the procedure that generates weighted examples(x, b, w) to train the node at indext based on sampling y from P|xand considering the predictions of classifiers in the lower layers.

Proof. For each node t on layer k, hkdirects the prediction procedure to move to either the node t0 or t1. Without loss of generality, assume hk(x, t) = 1. We denote ˆt as the prediction (leaf) on x when starting at node t. For each leaf node ˜y, let ¯C(˜y) ≡ Ey∼P|xC(y, ˜y). Then the node regret rg(t) is simply ¯C(ˆt1) − mini∈{0,1}C(ˆ¯ ti).

In addition to the regret of nodes, we also define the regret of the subtree Tt rooted at node t. The re- gret of the subtree Tt is as defined as the regret of the predicted path (vector) ˆt within the subtree Tt, that is, rg(Tt) = ¯C(ˆt) − ¯C(t) , where t denotes the optimal prediction (leaf node) in the subtree Tt. By this definition, rg(h, P) is simply rg(Tr).

The proof can be made by replacing the total regret with rg(Tr) in the original Filter Tree work (Beygelzimer et al., 2008). Due to the space limit, we omit the complete proof here.

In Theorem 1, the bound is related to certain nodes on the ideal path for each training example. The bound in- spires us to first consider using each training example to

(6)

r t

t0

t

0

t1

t10

t1 ˆt1

0 1

1 0

Figure 3. The thick edge represents the predition of the corre- sponding parent node. The ideal path is hr, ti and t is the first mis-classified node; the ideal path of subtree Tt1 is ht1, t1i and t10is the first mis-classified node. Both nodes t and t10are on the predicted path hr, ˆt1i.

only train the K nodes in its ideal path to get the classi- fiers h1, .., hK for each layer. Then we can find the up- permost mis-classified node t on the ideal path for each example (xn, yn). Without loss of generality, assume t is on the layer k, with y[k] = 0 and hk(x, t) = 1.

According to Theorem 1, we could decrease the regret rg(h, P) (or rg(Tr)) by decreasing the node regret rg(t) = C(ˆ¯ t1) − mini∈{0,1}C(ˆ¯ ti), which can be done by decreas- ing ¯C(ˆt1).1 Because ¯C(t1) is a constant, decreasing ¯C(ˆt1) is equivalent to decreasing the regret, rg(Tt1) = ¯C(ˆt1) − C(t¯ 1), of the subtree Tt1. We can then recursively adopt the above procedure to optimize the subtree regret rg(Tt1) as shown in Fig.3.

The procedure suggests decreasing the regret on hr, ti and ht, ˆti, the predicted path of xn. Therefore, the next key path for xnthat should be included for training is its pre- dicted path. That is, we can now train Filter Tree by adding the predicted path for each xn. We call the resulting algo- rithm Condensed Filter Tree, as shown in Algorithm1. The path-adding step can be repeated to further zoom into the key nodes. The number of adding step can be treated as a parameter M , and will be further discussed in Section5.

In summary, we derive three efficient approaches for gen- eral CSMLC with trees: TT (a systematic extension of CC-P), UFT and CFT. Next, we compare them with other existing algorithms by experiments.

5. Experiment

We conduct the experiments of different evaluation criteria on nine real-world datasets2(Tsoumakas et al.,2011;Read, 2012). In the experiments, we take three kinds of algo- rithms in our comparison: (a) the label-wise decomposition approaches, including classifier chain (CC), ensemble clas-

1Since ˆt0 relates to regret of other nodes on the ideal path of t, we cannot easily increase ¯C(ˆt0) to decrease rg(t).

2CAL500, emotions, enron, imdb, medical, scene, slash, tmc and yeast.

Algorithm 1 Condensed Filter Tree for CSMLC 1: D = {(xn, yn)}Nn=1; Dp= {((xn, yn), yn)}Nn=1 2: for m = 1 to M iterations do

3: for each layer k from layer K to root do

4: Dk= ∅

5: for each instance ((xn, ˜yn), yn) ∈ Dp do 6: t = (˜yn[1], ..., ˜yn[k]); zn = (xn, t) 7: bn= arg mini∈{0,1}C(yn, ˆti) 8: wn=|C(yn, ˆt1)-C(yn, ˆt0)|

9: Dk← Dk∪ (zn, bn, wn) 10: end for

11: hk ← train(Dk) 12: end for

13: if m < M then

14: for each instance (xn, yn) ∈ D do 15: yˆn = predict(h1, ..., hK, xn) 16: Dp← Dp∪ ((xn, ˆyn), yn) 17: end for

18: end if 19: end for

sifier chain (ECC), and probabilistic classifier chain (PCC);

(b) the tree-based models, including top-down tree (TT), uniform filter tree (UFT) and condensed filter tree (CFT);

(c) a state-of-the-art algorithm that does not explicitly take cost into account, MLkNN (Zhang & Zhou, 2007). We first consider three cost matrices: Hamming loss, Rank loss= P

y[i]>y[j]

J ˆy[i] < ˆy[j]K+

1

2J ˆy[i] = ˆy[j]K andF1 score=

2ky∩ˆyk1

kyk1+kˆyk1. The three matrices corresponds to known ef- ficient inference rules for PCC (Dembczynski et al.,2010;

2011). Then we take other criteria for comparison in Sec- tion5.3.

We couple PCC with L2-regularized logistic regression and other algorithms with linear support vector machines im- plemented in LIBLINEAR (Fan et al.,2008). For MLkNN, we use the implementation in Mulan (Tsoumakas et al., 2011). In each run of the experiment, we randomly sam- ple 50% of the dataset for training and reserve the rest for testing. For UFT and CFT, we restrict the maximum M to 8 for efficiency. For other parameters of each algorithm, we use cross-validation on the training set to search the best choice. Finally, Tables1,2 and3list the results for the three cost matrices, respectively, with the mean and the standard error over 40 different random runs, and the best result of each dataset is bolded. We also compare CFT with other algorithms based on the t-test at 95% confidence level. The number of datasets that CFT wins, ties and losses are shown in Table4.

(7)

Table 1. The result of Hamming loss (the best (lowest) ones are marked in bold)

Dataset CC ECC MLkNN PCC TT(CC-P) UFT CFT

CAL. 0.1376 ± 0.002 0.1374 ± 0.002 0.1379 ± 0.002 0.1370 ± 0.002 0.1375 ± 0.002 0.1489 ± 0.005 0.1368 ± 0.002 emo. 0.2613 ± 0.029 0.2501 ± 0.022 0.2122 ± 0.012 0.2297 ± 0.011 0.2435 ± 0.015 0.2222 ± 0.014 0.2138 ± 0.009 enron 0.0465 ± 0.001 0.0466 ± 0.001 0.0540 ± 0.001 0.0462 ± 0.001 0.0467 ± 0.001 0.0551 ± 0.001 0.0467 ± 0.001 imdb 0.0808 ± 0.000 0.0713 ± 0.000 0.0714 ± 0.000 0.0714 ± 0.000 0.0715 ± 0.000 0.0715 ± 0.000 0.0715 ± 0.000 medical 0.0109 ± 0.001 0.0113 ± 0.001 0.0176 ± 0.001 0.0110 ± 0.001 0.0108 ± 0.001 0.0119 ± 0.001 0.0102 ± 0.001 scene 0.1118 ± 0.004 0.0971 ± 0.004 0.0942 ± 0.004 0.0962 ± 0.003 0.0980 ± 0.003 0.1032 ± 0.003 0.1004 ± 0.003 slash 0.0418 ± 0.001 0.0383 ± 0.000 0.0514 ± 0.001 0.0386 ± 0.001 0.0388 ± 0.001 0.0375 ± 0.001 0.0383 ± 0.001 tmc 0.0571 ± 0.000 0.0565 ± 0.000 0.0669 ± 0.000 0.0576 ± 0.000 0.0575 ± 0.000 0.0574 ± 0.000 0.0572 ± 0.000 yeast 0.2107 ± 0.003 0.2009 ± 0.004 0.1981 ± 0.003 0.2006 ± 0.003 0.2000 ± 0.002 0.2008 ± 0.002 0.2013 ± 0.003

Table 2. The result of Rank loss (the best (lowest) ones are marked in bold)

Dataset CC ECC MLkNN PCC TT UFT CFT

CAL. 1516.0 ± 60.4 1432.6 ± 39.0 1408.9 ± 21.3 967.93 ± 12.57 965.49 ± 11.20 968.40 ± 12.03 963.13 ± 10.99 emo. 2.697 ± 0.315 2.350 ± 0.299 1.906 ± 0.120 1.763 ± 0.102 1.868 ± 0.134 1.714 ± 0.131 1.632 ± 0.093 enron 44.190 ± 0.736 42.625 ± 0.775 55.959 ± 1.386 24.379 ± 0.557 25.144 ± 0.704 25.622 ± 0.576 24.907 ± 0.625 imdb 21.312 ± 0.299 22.559 ± 0.283 24.396 ± 2.345 12.620 ± 0.044 12.665 ± 0.047 12.638 ± 0.046 12.637 ± 0.049 medical 5.882 ± 0.595 5.800 ± 0.564 5.826 ± 0.565 2.942 ± 0.327 3.611 ± 0.431 2.812 ± 0.291 3.602 ± 0.455 scene 1.022 ± 0.053 0.922 ± 0.030 0.853 ± 0.046 0.696 ± 0.024 0.744 ± 0.029 0.764 ± 0.026 0.739 ± 0.028 slash 6.603 ± 0.132 6.467 ± 0.131 8.259 ± 0.259 3.835 ± 0.080 4.358 ± 0.152 3.965 ± 0.065 4.289 ± 0.074 tmc 7.704 ± 0.089 7.306 ± 0.139 5.329 ± 0.079 3.952 ± 0.034 3.924 ± 0.042 3.912 ± 0.040 3.894 ± 0.040 yeast 9.596 ± 0.224 9.208 ± 0.143 9.735 ± 0.247 8.753 ± 0.140 8.752 ± 0.138 8.813 ± 0.148 8.747 ± 0.118

Table 3. The result of F1 score (the best (highest) ones are marked in bold)

Dataset CC ECC MLkNN PCC TT UFT CFT

CAL. 0.319 ± 0.028 0.368 ± 0.015 0.318 ± 0.010 0.460 ± 0.006 0.447 ± 0.006 0.454 ± 0.005 0.473 ± 0.004 emo. 0.416 ± 0.087 0.489 ± 0.068 0.579 ± 0.030 0.639 ± 0.018 0.550 ± 0.061 0.619 ± 0.029 0.637 ± 0.016 enron 0.538 ± 0.010 0.547 ± 0.011 0.385 ± 0.021 0.574 ± 0.007 0.580 ± 0.009 0.545 ± 0.011 0.598 ± 0.010 imdb 0.256 ± 0.001 0.157 ± 0.015 0.001 ± 0.000 0.352 ± 0.015 0.371 ± 0.001 0.358 ± 0.001 0.374 ± 0.001 medical 0.784 ± 0.017 0.779 ± 0.014 0.523 ± 0.038 0.817 ± 0.015 0.789 ± 0.021 0.797 ± 0.011 0.796 ± 0.014 scene 0.687 ± 0.012 0.701 ± 0.010 0.655 ± 0.023 0.735 ± 0.011 0.721 ± 0.010 0.667 ± 0.007 0.717 ± 0.010 slash 0.489 ± 0.012 0.496 ± 0.007 0.136 ± 0.054 0.577 ± 0.008 0.517 ± 0.011 0.540 ± 0.005 0.514 ± 0.007 tmc 0.684 ± 0.003 0.693 ± 0.003 0.606 ± 0.007 0.714 ± 0.002 0.709 ± 0.002 0.687 ± 0.002 0.714 ± 0.002 yeast 0.622 ± 0.007 0.634 ± 0.007 0.607 ± 0.012 0.638 ± 0.008 0.639 ± 0.005 0.649 ± 0.006 0.649 ± 0.006

Table 5. The result of Acc. score, and Comp. score (best ones are marked in bold)

Dataset Accuracy(↑) Composite Score(↑)

PCC-F1 CFT PCC-Ham or F1 CFT

CAL. 0.303 ± 0.008 0.315 ± 0.004 −0.362 ± 0.012 −0.302 ± 0.013 emo. 0.534 ± 0.021 0.535 ± 0.015 −0.566 ± 0.100 −0.460 ± 0.063 enron 0.453 ± 0.009 0.476 ± 0.009 0.300 ± 0.017 0.351 ± 0.012 imdb 0.242 ± 0.010 0.268 ± 0.001 −0.263 ± 0.055 −0.096 ± 0.001 medical 0.783 ± 0.015 0.764 ± 0.018 0.758 ± 0.016 0.747 ± 0.018 scene 0.676 ± 0.011 0.669 ± 0.010 0.150 ± 0.036 0.170 ± 0.022 slash 0.511 ± 0.009 0.481 ± 0.006 0.263 ± 0.012 0.277 ± 0.011 tmc 0.613 ± 0.004 0.614 ± 0.002 0.402 ± 0.007 0.419 ± 0.004 yeast 0.518 ± 0.012 0.539 ± 0.006 −0.398 ± 0.019 −0.376 ± 0.019

Table 4. CFT versus the other algorithms based on t-test at 95%

confidence level (#win/#tie/#loss)

criteria CC ECC MLkNN PCC TT UFT Ham. 7/1/1 2/4/3 5/1/3 4/3/2 5/2/2 6/2/1 Rank. 9/0/0 9/0/0 9/0/0 3/2/4 4/5/0 6/1/2 F1. 9/0/0 9/0/0 9/0/0 4/2/3 7/1/1 6/2/1 Total 25/1/1 20/4/3 23/1/3 11/7/9 16/8/3 18/5/4

5.1. Cost-insensitive versus Cost-sensitive

Table1compares all the algorithms based on the Hamming loss. As discussed in Section4, CC-P is equivalent to TT with Hamming loss. In Table1, the five algorithms that can reach the best performance are ECC, MLkNN, PCC, UFT and CFT. Moreover, ECC successfully improves the per- formance of CC. The state-of-the-art algorithm, MLkNN, often achieves the best results. When looking at Table4for t-test results, CFT is competitive to ECC and PCC, while often being better than MLkNN.

For the other two criteria, as shown in Tables2and3, the algorithms that do not consider the cost explicitly, such as CC, ECC and MLkNN, are generally worse than the cost-

sensitive algorithms. The results demonstrate the impor- tance and effectiveness of properly considering the cost in- formation in the algorithm.

5.2. Comparison with Tree-based Algorithms

In Tables 1,2, 3 and 4, when comparing CFT with TT, CFT wins on 16 and ties on 8 of the 27 cases by t-test.

The results justify the importance of bottom-up training of the tree model. When comparing UFT with CFT, CFT is better than UFT on 18 and ties on 5 out of 27 cases by t- test. The results demonstrate the effectiveness of focusing on key paths (nodes).

2 4 6 8

0.5 0.55 0.6 0.65 0.7

Number of paths (M)

F1 Score CFT−test

UFT−test CFT−train UFT−train

(8)

We further study UFT and CFT for varying M . We show the result of F1 score on emotions, while observing simi- lar behaviors across other datasets and criteria. When num- ber of paths increases, both the training and testing per- formance of CFT and UFT are improved. Moreover, CFT converges to a better F1 score than UFT as M increases, which explains its better performance during testing.

While CFT is usually better than UFT, on medical and slash, CFT loses to UFT in Tables2and3. We study the reasons and find that the cause is overfitting. For instance, the training Rank loss of CFT on medical is 0.083, which is much smaller than the UFT result of 0.264. That result implies that CFT indeed optimizes the desired evaluation criteria during training, but the focus on key paths could suffer worse generalization in a few datasets. A preliminary study shows that a mixture of CFT and UFT is less prone to overfitting.

5.3. Comparison with PCC and CFT

For the Hamming loss, the Rank loss and the F1 score, the exact inference algorithms of PCC have been pro- posed (Dembczynski et al.,2010;2011). From Tables1, 2,3and4, PCC and CFT are competitive to each other on the three criteria, having similar number of winning and losing cases.

To demonstrate the full ability of CFT, we consider two other criteria which there is no inference rule (yet) for PCC, including Accuracy3 (Boutell et al., 2004; Tsoumakas et al., 2010), and a composite score from the F1 score and the Hamming loss in Table5. The definitions of the criteria are Accuracy = ky∩ˆky∪ˆykyk1

1, and Composite Score= F1 Score−5×HAM Loss.

Here we use the approximate inference rules for PCC. For the Accuracy score, we couple PCC with the inference rule of the F1 score in view of the similarity in the formula.

For the Composite score, which considers the F1 score and the Hamming loss concurrently, we run PCC with either the inference rule of the F1 score or the inference rule of the Hamming loss, and optimistically report the best one in Table5.

Table5can be summarized as follows. Due to the similar- ity in the formula, CFT and PCC-F1 reach similar results for the Accuracy score. For the Composite score, which is similar to neither the F1 score nor the Hamming loss, PCC is much worse than CFT.

When K is small, PCC can use exhaustive search to enu- merate 2K possible ˆy and locate the Bayes optimal ˆy. We further list the performance of this PCC-exhaust approach for emotions, scene and yeast, which are of no more

3α-Accuracy with α = 1

than 14 labels.

Infer. Acc.(↑) Comp.(↑)

emo. scene yeast emo. scene yeast Apprx. 0.534 0.676 0.518 -0.566 0.150 -0.398 Exhau. 0.530 0.709 0.535 -0.570 0.176 -0.383

By the exhaustive inference, the performance of PCC is sig- nificantly improved in most cases. The good performance highlights the importance of exact and efficient inference rules for PCC. Nevertheless, if the desired evaluation crite- ria are complicated, it is non-trivial to design exact and ef- ficient inference rules. When comparing PCC-exhaust with CFT, we see that CFT wins on 3 cases, ties on 1 case and loses on 2 cases. Thus, the efficient CFT is quite competi- tive with the inefficient PCC-exhaust in performance.

6. Conclusion

We tackle the general cost-sensitive multi-label classifica- tion problem without any specific subroutine for different evaluation criteria, which meets the demands in real-world applications. We proposed the condensed filter tree (CFT) algorithm by coupling several tools and ideas: the label powerset approach for reducing to cost-sensitive classifi- cation, the tree-based algorithms for cost-sensitive classifi- cation, the proper-ordering and K-classifier tricks that uti- lize the structural property of multi-label classification, and the theoretical bound to locate the key tree nodes (paths) for training. The resulting CFT is as efficient as the com- mon label-wise decomposition approaches in training and prediction, with respect to the number of possible labels.

Experimental results demonstrate that CFT is competitive with leading approaches for multi-label classification, and usually outperforms those approaches on the evaluation cri- teria that those approaches are not designed from.

CFT can currently handle evaluation criteria defined by a desired label vector and a predicted label vector. We can view CFT as the first step towards tackling more compli- cated evaluation criteria, which shall be an important future research direction.

7. Acknowledgement

We thank Profs. Yuh-Jye Lee, Chih-Jen Lin, Shou-De Lin, Chi-Jen Lu, Hung-Yi Lo, the anonymous reviewers, and the members of the NTU Computational Learning Lab for valuable suggestions. This work is mainly supported by National Science Council (NSC 101-2628-E-002-029- MY2) of Taiwan.

References

Beygelzimer, A., Dani, V., Hayes, T., Langford, J., and Zadrozny, B. Error limiting reductions between classi- fication tasks. In Proceedings of the 22nd International

(9)

Conference on Machine Learning, 2005.

Beygelzimer, A., Langford, J., and Ravikumar, P. Error correcting tournaments, 2008. URLhttp://arxiv.

org/abs/0902.3176.

Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learn- ing multi-label scene classification. Pattern Recognition, 2004.

Dembczynski, K., Cheng, W., and H¨ullermeier, E. Bayes optimal multilabel classification via probabilistic clas- sifier chains. In Proceedings of the 27th International Conference on Machine learning, 2010.

Dembczynski, K., Waegeman, W., Cheng, W., and H¨ullermeier, E. An exact algorithm for f-measure maxi- mization. In Advances in Neural Information Processing Systems 24. 2011.

Dembczynski, K., Kotlowski, W., and H¨ullermeier, E. Con- sistent multilabel ranking through univariate losses. In Proceedings of the 29th International Conference on Machine learning, 2012a.

Dembczynski, K., Waegeman, W., and H¨ullermeier, E. An analysis of chaining in multi-label classification. In Pro- ceedings of the 20th European Conference on Artificial Intelligence, 2012b.

Domingos, P. Metacost: a general method for making clas- sifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge dis- covery and data mining, 1999.

Elisseeff, A. and Weston, J. A kernel method for multi- labelled classification. In Advances in Neural Informa- tion Processing Systems 14, 2002.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. LIBLINEAR: a library for large linear classi- fication. Journal of Machine Learning Research, 2008.

Kumar, A., Vembu, S., Menon, A. K., and Elkan, C.

Beam search algorithms for multilabel learning. Ma- chine Learning, 2013.

Lo, H.-Y., Wang, J.-C., Wang, H.-M., and Lin, S.-D. Cost- sensitive multi-label learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia, 2011.

Mineiro, P. Cost sensitive multi label: an

observation, 2011. URL http://www.

machinedlearnings.com/2011/05/

cost-sensitive-multi-label-observation.

html.

Petterson, J. and Caetano, T. S. Reverse multi-label learn- ing. In Advances in Neural Information Processing Sys- tems 23. 2010.

Petterson, J. and Caetano, T. S. Submodular multi-label learning. In Advances in Neural Information Processing Systems 24. 2011.

Read, J. Meka: a multi-label extension to weka, 2012. URL http://meka.sourceforge.net.

Read, J., Pfahringer, B., Holmes, G., and Frank, E. Classi- fier chains for multi-label classification. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, 2009.

Srivastava, A.N. and Zane-Ulman, B. Discovering recur- ring anomalies in text reports regarding complex space systems. In IEEE Aerospace Conference, 2005.

Tsoumakas, G. and Vlahavas, I. Random k-labelsets: an ensemble method for multilabel classification. In Ma- chine Learning: the European Conference on Machine Learning. 2007.

Tsoumakas, G., Katakis, I., and Vlahavas, I. Mining multi- label data. In Data Mining and Knowledge Discovery Handbook. Springer US, 2010.

Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., and Vlahavas, I. Mulan: a java library for multi-label learn- ing. Journal of Machine Learning Research, 2011.

Tsoumakas, G., Zhang, M.-L., and Zhou, Z.-H. Intro- duction to the special issue on learning from multi-label data. Journal of Machine Learning Research, 2012.

Turnbull, D., Barrington, L., Torres, D. A., and Lanckriet, G. R. G. Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech and Language Processing, 2008.

Zhang, M.-L. and Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 2007.

參考文獻

相關文件

3.3 Locally-learned Surrogate Loss for General Cost-sensitive Multi-label Deep Learning From Figure 1, it can be seen that the key to designing a cost- sensitive model is that the

The well-known random k-labelsets (RAkEL) (Tsoumakas and Vlahavas, 2007) method focuses on many smaller multi-class classification problems to be computationally efficient, but it

To build a cost- sensitive DNN for a K-class cost-sensitive classification prob- lem, the proposed framework replaces the layer-wise pretrain- ing step with layer-wise cost

a smooth cost function, so that the cost of an instance should be similar with its neighbors’.On the basis of the extended idea, we propose the cost-sensitive tree sampling

possible preceding labels when we train the m -th chain, m examples will exist for each example in the original training data, and they may have different label features and

The proposed al- gorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors by using

The embedding allows the proposed algorithm, active learning with cost embedding (ALCE), to define a cost-sensitive uncertainty measure from the distance in the hidden space..

Experiments on the benchmark and the real-world data sets show that our proposed methodology in- deed achieves lower test error rates and similar (sometimes lower) test costs

Furthermore, we leverage the cost information embedded in the code space of CSRPE to propose a novel active learning algorithm for cost-sensitive MLC.. Extensive exper- imental

Coupling AdaBoost with the reduction framework leads to a novel algorithm that boosts the training accuracy of any cost-sensitive ordinal ranking algorithms theoretically, and in

For those methods utilizing label powerset to reduce the multi- label classification problem, in [7], the author proposes cost- sensitive RAkEL (CS-RAkEL) based on RAkEL optimizing on

We address the two issues by proposing Feature-aware Cost- sensitive Label Embedding (FaCLE), which utilizes deep Siamese network to keep cost information as the distance of

A Novel Approach for Label Space Compression algorithmic: scheme for fast decoding theoretical: justification for best projection practical: significantly better performance

3 Error-correction Coding (Ferng &amp; Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

customer 1 who hates romance but likes terror error measure = non-satisfaction. XXXX actual XXXX

introduces a methodology for extending regular classification algorithms to cost-sensitive ones with any cost. provides strong theoretical support for

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,

classify input to multiple (or no) categories.. Multi-label Classification.

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully. Tackle general

Cost-and-Error-Sensitive Classification with Bioinformatics Application Cost-Sensitive Ordinal Ranking with Information Retrieval Application Summary.. Non-Bayesian Perspective

Both problems are special cases of the optimum communication spanning tree problem, and are reduced to the minimum routing cost spanning tree (MRCT) prob- lem when all the

Our research use the suffix tree algorithm which created by Ukkonen and further improved by Gusfield team for develop the primers selection algorithm, we reform the problem of