Cost-Sensitive Label Embedding for Multi-Label Classification

(1)

(will be inserted by the editor)

Cost-Sensitive Label Embedding for Multi-Label Classification

Kuan-Hao Huang · Hsuan-Tien Lin

Received: date / Accepted: date

Abstract Label embedding (LE) is an important family of multi-label classification algorithms that digest the label information jointly for better performance.

Different real-world applications evaluate performance by different cost functions of interest. Current LE algorithms often aim to optimize one specific cost function, but they can suffer from bad performance with respect to other cost functions. In this paper, we resolve the performance issue by proposing a novel cost-sensitive LE algorithm that takes the cost function of interest into account. The proposed algorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors by using the classic multidimensional scaling approach for manifold learning. CLEMS is able to deal with both symmetric and asymmetric cost functions, and effectively makes cost-sensitive decisions by nearest-neighbor decoding within the embedded vectors. We derive theoretical results that justify how CLEMS achieves the desired cost-sensitivity. Furthermore, extensive experimental results demonstrate that CLEMS is significantly better than a wide spectrum of existing LE algorithms and state-of-the-art cost-sensitive algorithms across different cost functions.

Keywords Multi-label classification, Cost-sensitive, Label embedding

1 Introduction

The multi-label classification problem (MLC), which allows multiple labels to be associated with each example, is an extension of the multi-class classification problem. The MLC problem satisfies the demands of many real-world applications (Carneiro et al, 2007; Trohidis et al, 2008; Barut¸cuoglu et al, 2006). Different applications usually need different criteria to evaluate the prediction performance of MLC algorithms. Some popular criteria are Hamming loss, Rank loss, F1 score, and Accuracy score (Tsoumakas et al, 2010; Madjarov et al, 2012).

Kuan-Hao Huang, Hsuan-Tien Lin

CSIE Department, National Taiwan University, Taipei, Taiwan E-mail: {r03922062, htlin}@csie.ntu.edu.tw

arXiv:1603.09048v5 [cs.LG] 5 Aug 2017

(2)

Label embedding (LE) is an important family of MLC algorithms that jointly extract the information of all labels to improve the prediction performance. LE algorithms automatically transform the original labels to an embedded space, which represents the hidden structure of the labels. After conducting learning within the embedded space, LE algorithms make more accurate predictions with the help of the hidden structure.

Existing LE algorithms can be grouped into two categories based on the dimension of the embedded space: label space dimension reduction (LSDR) and label space dimension expansion (LSDE). LSDR algorithms (Hsu et al, 2009; Kapoor et al, 2012; Tai and Lin, 2012; Sun et al, 2011; Chen and Lin, 2012; Yu et al, 2014; Lin et al, 2014; Balasubramanian and Lebanon, 2012; Bi and Kwok, 2013;

Bhatia et al, 2015; Yeh et al, 2017) consider a low-dimensional embedded space for digesting the information between labels and conduct more effective learning.

In contrast to LSDR algorithms, LSDE algorithms (Zhang and Schneider, 2011;

Ferng and Lin, 2013; Tsoumakas et al, 2011a) focus on a high-dimensional embedded space. The additional dimensions can then be used to represent different angles of joint information between the labels to reach better performance.

While LE algorithms have become major tools for tackling the MLC problem, most existing LE algorithms are designed to optimize only one or few specific criteria. The algorithms may then suffer from bad performance with respect to other criteria. Given that different applications demand different criteria, it is thus important to achieve cost (criterion) sensitivity to make MLC algorithms more realistic. Cost-sensitive MLC (CSMLC) algorithms consider the criterion as an additional input, and take it into account either in the training or the predicting stage. The additional input can then be used to guide the algorithm towards more realistic predictions. CSMLC algorithms are attracting research attention in recent years (Lo et al, 2011, 2014; Dembczynski et al, 2010, 2011; Li and Lin, 2014), but to the best of our knowledge, there is no work on cost-sensitive label embedding (CSLE) algorithms yet.

In this paper, we study the design of CSLE algorithms, which take the intended criterion into account in the training stage to locate a cost-sensitive hidden structure in the embedded space. The cost-sensitive hidden structure can then be used for more effective learning and more accurate predictions with respect to the criterion of interest. Inspired by the finding that many of the existing LSDR algorithms can be viewed as linear manifold learning approaches, we propose to adopt manifold learning for CSLE. Nevertheless, to embed any general and possibly complicated criterion, linear manifold learning may not be sophisticated enough. We thus start with multidimensional scaling (MDS), one famous non-linear manifold learning approach, to propose a novel CSLE algorithm. The proposed cost-sensitive label embedding with multidimensional scaling (CLEMS) algorithm embeds the cost information within the distance measure of the embedded space. We further design a mirroring trick for CLEMS to properly embed the possibly asymmetric criterion information within the symmetric distance measure. We also design an efficient procedure that conquers the difficulty of making predictions through the non-linear cost-sensitive hidden structure. Theoretical results justify that CLEMS achieves cost-sensitivity through learning in the MDS-embedded space. Extensive empirical results demonstrate that CLEMS usually reaches better performance than leading LE algorithms across different criteria. In addition, CLEMS also performs better than the state-of-the-art CSMLC algorithms (Li and Lin, 2014;

(3)

label space Y embedded space Z feature space X

Φ

Ψ g

Fig. 1 Flow of label embedding

Dembczynski et al, 2010, 2011). The results suggest that CLEMS is a promising algorithm for CSMLC.

This paper is organized as follows. Section 2 formalizes the CSLE problem and Section 3 illustrates the proposed algorithm along with theoretical justifications.

We discuss the experimental results in Section 4 and conclude in Section 5.

2 Cost-sensitive label embedding

In multi-label classification (MLC), we denote the feature vector of an instance by x ∈ X ⊆ R^d and denote the label vector by y ∈ Y ⊆ {0, 1}^K where y[i] = 1 if and only if the instance is associated with the i-th label. Given the training instances D = {(x⁽ⁿ⁾, y⁽ⁿ⁾)}^N_n=1, the goal of MLC algorithms is to train a predictor h : X → Y from D in the training stage, with the expectation that for any unseen testing instance (x, y), the prediction ˜y = h(x) can be close to the ground truth y.

A simple criterion for evaluating the closeness between y and ˜y is Hamming loss(y, ˜y) = _K¹ PK

i=1Jy[i] 6= ˜y[i]_K. It is worth noting that Hamming loss separately evaluates each label component of ˜y. There are other criteria that jointly evaluate all the label components of ˜y, such as F1 score, Rank loss, 0/1 loss, and Accuracy score (Tsoumakas et al, 2010; Madjarov et al, 2012).

Arguably the simplest algorithm for MLC is binary relevance (BR) (Tsoumakas and Katakis, 2007). BR separately trains a binary classifier for each label without considering the information of other labels. In contrast to BR, label embedding (LE) is an important family of MLC algorithms that jointly use the information of all labels to achieve better prediction performance. LE algorithms try to identify the hidden structure behind the labels. In the training stage, instead of training a predictor h directly, LE algorithms first embed each K-dimensional label vector y⁽ⁿ⁾ as an M -dimensional embedded vector z⁽ⁿ⁾ ∈ Z ⊆ R^M by an embedding function Φ : Y → Z. The embedded vector z⁽ⁿ⁾ can be viewed as the hidden structure that contains the information pertaining to all labels. Then, the algorithms train a internal predictor g : X → Z from {(x⁽ⁿ⁾, z⁽ⁿ⁾)}^N_n=1. In the predicting stage, for the testing instance x, LE algorithms obtain the predicted embedded vector ˜z = g(x) and use a decoding function Ψ : Z → Y to get the prediction ˜y. In other words, LE algorithms learn the predictor by h = Ψ ◦ g.

Figure 1 illustrates the flow of LE algorithms.

Existing LE algorithms can be grouped into two categories based on M (the dimension of Z) and K (the dimension of Y). LE algorithms that work with M ≤ K are termed as label space dimension reduction (LSDR) algorithms. They consider a low-dimensional embedded space for digesting the information between labels and utilize different pairs of (Φ, Ψ ) to conduct more effective learning. Compressed

(4)

sensing (Hsu et al, 2009) and Bayesian compressed sensing (Kapoor et al, 2012) consider a random projection as Φ and obtain Ψ by solving an optimization problem per test instance. Principal label space transformation (Tai and Lin, 2012) considers Φ calculated from an optimal linear projection of the label vectors and derives Ψ accordingly. Some other works also consider optimal linear projections as Φ but take feature vectors into account in the optimality criterion, including canonical-correlation-analysis methods (Sun et al, 2011), conditional principal label space transformation (Chen and Lin, 2012), low-rank empirical risk mini- mization for multi-label learning (Yu et al, 2014), and feature-aware implicit label space encoding (Lin et al, 2014). Canonical-correlated autoencoder (Yeh et al, 2017) extends the linear projection works by using neural networks instead. Land- mark selection method (Balasubramanian and Lebanon, 2012) and column subset selection (Bi and Kwok, 2013) design Φ to select a subset of labels as embedded vectors and derive the corresponding Ψ . Sparse local embeddings for extreme classification (Bhatia et al, 2015) trains a locally-linear projection as Φ and con- structs Ψ by nearest neighbors. The smaller M in LSDR algorithms allows the internal predictor g to be learned more efficiently and effectively.

Other LE algorithms work with M > K, which are called label space dimension expansion (LSDE) algorithms. Canonical-correlation-analysis output codes (Zhang and Schneider, 2011) design Φ based on canonical correlation analysis to generate additional output codes to enhance the performance. Error-correcting codes (ECC) algorithms (Ferng and Lin, 2013) utilize the encoding and decoding functions of standard error-correcting codes for communication as Φ and Ψ , respectively.

Random k-labelsets (Tsoumakas et al, 2011a), a popular algorithm for MLC, can be considered as an ECC-based algorithm with the repetition code (Ferng and Lin, 2013). LSDE algorithms use additional dimensions to represent different angles of joint information between the labels to reach the better performance.

To the best of our knowledge, all the existing LE algorithms above are designed for one or few specific criteria and may suffer from bad performance with respect to other criteria. For example, the optimality criterion within principal label space transformation (Tai and Lin, 2012) is closely related to Hamming loss. For MLC data with very few non-zero y[i], which are commonly encountered in real-world applications, optimizing Hamming loss can easily results in all-zero predictions ˜y[i], which suffer from bad F1 score.

MLC algorithms that take the evaluation criterion into account are called cost-sensitive MLC (CSMLC) algorithms and are attracting research attentions in recent years. CSMLC algorithms take the criterion as an additional input and consider it either in the training or the predicting stage. For any given criterion, CSMLC algorithms can ideally make cost-sensitive predictions with respect to the criterion without extra efforts in algorithm design. Generalized k-labelsets ensemble (Lo et al, 2011, 2014) is extended from random k-labelsets (Tsoumakas et al, 2011a) and digests the criterion by giving appropriate weights to labels. The ensemble algorithm performs well for any weighted Hamming loss but cannot tackle more general criteria that jointly evaluate all the label components, such as F1 score. Two CSMLC algorithms for arbitrary criterion are probabilistic classifier chain (PCC) (Dembczynski et al, 2010, 2011) and condensed filter tree (CFT) (Li and Lin, 2014). PCC is based on estimating the probability of each label and making a Bayes-optimal inference for the evaluation criterion. While PCC can in principle be used for any criterion, it may suffer from computational difficulty un-

(5)

less an efficient inference rule for the criterion is designed first. CFT is based on converting the criterion as weights when learning each label. CFT conducts the weight-assignment in a more sophisticated manner than generalized k-labelsets ensemble does, and can hence work with arbitrary criterion. Both PCC and CFT are extended from classifier chain (CC) (Read et al, 2011) and form a chain of labels to utilize the information of the earlier labels in the chain, but they cannot globally find the hidden structure of all labels like LE algorithms.

In this paper, we study the design of cost-sensitive label embedding (CSLE) algorithms that respect the criterion when calculating the embedding function Φ and the decoding function Ψ . We take an initiative of studying CSLE algorithms, with the hope of achieving cost-sensitivity and finding the hidden structure at the same time. More precisely, we take the following CSMLC setting (Li and Lin, 2014). Consider a cost function c(y, ˜y) which represents the penalty when the ground truth is y and the prediction is ˜y. We naturally assume that c(y, ˜y) ≥ 0, with value 0 attained if and only if y and ˜y are the same. Given training instances D = {(x⁽ⁿ⁾, y⁽ⁿ⁾)}^N_n=1 and the cost function c(·, ·), CSLE algorithms learn an embedding function Φ, a decoding function Ψ , and an internal predictor g, based on both the training instance D and the cost function c(·, ·). The objective of CSLE algorithms is to minimize the expected cost c(y, h(x)) for any unseen testing instance (x, y), where h = Ψ ◦ g.

3 Proposed algorithm

We first discuss the difficulties of directly extending state-of-the-art LE algorithms for CSLE. In particular, the decoding function Ψ of many existing algorithms, such as conditional principal label space transformation (Chen and Lin, 2012) and feature-aware implicit label space encoding (Lin et al, 2014), are derived from Φ and can be divided into two steps. The first step is using some ψ : Z → R^K that corresponds to Φ to decode the embedded vector z to a real-valued vector ˆy ∈ R^K; the second step is choosing a threshold to transform ˆy to ˜y ∈ {0, 1}^K. If the embedding function Φ is a linear function, the corresponding ψ can be efficiently computed by pseudo-inverse. However, for complicated cost functions, a linear function may not be sufficient to properly embed the cost information. On the other hand, if the embedding function Φ is a non-linear function, such as those within kernel principal component analysis (Sch¨olkopf et al, 1998) and kernel dependency estimation (Weston et al, 2002), ψ is often difficult to derive or time-consuming in calculation, which then makes Ψ practically infeasible to compute.

To resolve the difficulties, we do not consider the two-step decoding function Ψ that depends on deriving ψ from Φ. Instead, we first fix a decent decoding function Ψ and then derive the corresponding embedding function Φ. We realize that the goal of Ψ is simply to locate the most probable label vector ˜y from Y, which is of a finite cardinality, based on the predicted embedded vector ˜z = g(x) ∈ Z.

If all the embedded vectors are sufficiently far away from each other in Z, one natural decoding function is to calculate the nearest neighbor zq of ˜z and return the corresponding yqas ˜y. Such a nearest-neighbor decoding function Ψ is behind some ECC-based LSDE algorithms (Ferng and Lin, 2013) and will be adopted.

The nearest-neighbor decoding function Ψ is based on the distance measure of Z, which matches our primary need of representing the cost information. In

(6)

particular, if yi is a lower-cost prediction than yj with respect to the ground truth yt, we hope that the corresponding embedded vector zi would be closer to zt than zj. Then, even if g makes a small error such that ˜z = g(x) deviates from the desired zt, nearest-neighbor decoding function Ψ can decode to the lower- cost yi as ˜y instead of yj. In other words, for any two label vectors yi, yj∈ Y and the corresponding embedded vectors zi, zj ∈ Z, we want the Euclidean distance between zi and zj, which is denoted by d(zi, zj), to preserve the magnitude- relationship of the cost c(yi, yj).

Based on this objective, the framework of the proposed algorithm is as follows. In the training stage, for each label vector yi ∈ Y, the proposed algorithm determines an embedded vector zi such that the distance between two embedded vectors d(zi, zj) in Z approximates the transformed cost δ(c(yi, yj)), where δ(·) is a monotonic transform function to preserve the magnitude-relationship and will be discussed later. We let the embedding function Φ be the mapping yi→ zi and use Q to represent the embedded vector set {Φ(yi) | yi∈ Y}. Then the algorithm trains a regressor g : X → Z as the internal predictor.

In the predicting stage, when receiving a testing instance x, the algorithm obtains the predicted embedded vector ˜z = g(x). Given that the cost information is embedded in the distance, for each zi∈ Q, the distance d(zi, ˜z) can be viewed as the estimated cost if the underlying truth is yi. Hence the algorithm finds zq∈ Q such that the distance d(zq, ˜z) is the smallest (the smallest estimated cost), and lets the corresponding yq= Φ⁻¹(zq) = ˜y be the final prediction for x. In other words, we have a nearest-neighbor-based Ψ , with the first step being the determination of the nearest-neighbor of ˜z and the second step being the utilization of Φ⁻¹ to obtain the prediction ˜y.

Three key issues of the framework above are yet to be addressed. The first issue is the determination of the embedded vectors zi. The second issue is using the symmetric distance measure to embed the possibly asymmetric cost functions where c(yi, yj) 6= c(yj, yi). The last issue is the choice of a proper monotonic transform function δ(·). The issues will be discussed in the following sub-sections.

3.1 Calculating the embedded vectors by multidimensional scaling

As mentioned above, our objective is to determine embedded vectors zi such that the distance d(zi, zj) approximates the transformed cost δ(c(yi, yj)). The objective can be formally defined as minimizing the embedding error

(d(zi, zj) − δ(c(yi, yj)))².

We observe that the transformed cost δ(c(yi, yj)) can be viewed as the dissimilarity between label vectors yi and yj. Computing an embedding based on the dissimilarity information matches the task of manifold learning, which is able to preserve the information and discover the hidden structure. Based on our discus- sions above, any approach that solves the manifold learning task can then be taken to solve the CSLE problem. Nevertheless, for CSLE, different cost functions may need different M (the dimension of Z) to achieve a decent embedding. We thus consider manifold learning approaches that can flexibly take M as the parameter, and adopt a classic manifold learning approach called multidimensional scaling (MDS) (Kruskal, 1964).

(7)

For a target dimension M , MDS attempts to discover the hidden structure of LM DS objects by embedding their dissimilarities in an M -dimensional target space. The dissimilarity is represented by a symmetric, non-negative, and zero- diagonal dissimilarity matrix ∆, which is an LM DS× LM DSmatrix with ∆i,j being the dissimilarity between the i-th object and the j-th object. The objective of MDS is to determine target vectors u1, u2, ..., uLM DS in the target space to minimize the stress, which is defined asP

i,jWi,j(d(ui, uj) − ∆i,j)², where d denotes the Euclidean distance in the target space, and W is a symmetric, non-negative, and zero-diagonal matrix that carries the weight Wi,j of each object pair. There are several algorithms available in the literature for solving MDS. A representative algorithm is Scaling by MAjorizing a COmplicated Function (SMACOF) (De Leeuw, 1977), which can iteratively minimize stress. The complexity of SMACOF is generally O((LM DS)³), but there is often room for speeding up with special weight matrices W.

The embedding error (d(zi, zj)−δ(c(yi, yj)))²and the stress (d(ui, uj)−∆i,j)² are of very similar form. Therefore, we can view the transformed costs as the dissimilarities of embedded vectors and feed MDS with specific values of ∆ and W to calculate the embedded vectors to reduce the embedding error. Specifically, the relation between MDS and our objective can be described as follows.

i-th object dissimilarity ∆i,j target vector ui stress (d(ui, uj) − ∆i,j)² label vector yi transformed cost

embedded vector zi embedding error δ(c(yi, yj)) (d(zi, zj) − δ(c(yi, yj)))²

The most complete embedding would convert all label vectors y ∈ Y ⊆ {0, 1}^K to the embedded vectors. Nevertheless, the number of all label vectors is 2^K, which makes solving MDS infeasible. Therefore, we do not consider embedding the entire Y. Instead, we select some representative label vectors as a candidate set S ⊆ Y, and only embed the label vectors in S. While the use of S instead of Y restricts the nearest-neighbor decoding function to only predict from S, it can reduce the computational burden. One reasonable choice of S is the set of label vectors that appear in the training instances D, which is denoted as Str. We will show that using Stras S readily leads to promising performance and discuss more about the choice of the candidate set in Section 4.

After choosing S, we can construct ∆ and W for solving MDS. Let L denote the number of elements in S and let C(S) be the transformed cost matrix of S, which is an L × L matrix with C(S)i,j = δ(c(yi, yj)) for yi, yj ∈ S. Unfortunately, C(S) cannot be directly used as the symmetric dissimilarity matrix ∆ because the cost function c(·, ·) may be asymmetric (c(yi, yj) 6= c(yj, yi)). To resolve this difficulty, we propose a mirroring trick to construct a symmetric ∆ from C(S).

3.2 Mirroring trick for asymmetric cost function

The asymmetric cost function implies that each label vector yi serves two roles:

as the ground truth, or as the prediction. When yi serves as the ground truth, we should use c(yi, ·) to describe the cost behavior. When yiserves as the prediction,

(8)

embedded space Z z^(p)₁ z^(p)₂

z^(p)₃ z^(t)₁

z^(t)₂

z^(t)₃

δ(c(y1, y2))

δ(c(y2, y1))

Fig. 2 Embedding cost in distance

(a) ∆ (b) W

Fig. 3 Constructions of ∆ and W

we should use c(·, yi) to describe the cost behavior. This motivates us to view these two roles separately.

For each yi ∈ S, we mirror it as y^(t)_i and y^(p)_i to denote viewing yi as the ground truth and the prediction, respectively. Note that the two mirrored label vectors y^(t)_i and y^(p)_i are in fact the same, but carry different meanings. Now, we have two roles of the candidate sets S^(t) = {y^(t)_i }^L_i=1 and S^(p) = {y^(p)_i }^L_i=1. Then, as illustrated by Figure 2, δ(c(yi, yj)), the transformed cost when yi is ground truth and yj is the prediction, can be viewed as the dissimilarity between the ground truth role y^(t)_i and the prediction role y_j^(p), which is symmetric for them. Similarly, δ(c(yj, yi)) can be viewed as the dissimilarity between prediction role y^(p)_i and ground truth role y^(t)_j . That is, all the asymmetric transformed costs can be viewed as the dissimilarities between the label vectors in S^(t) and S^(p).

Based on this view, instead of embedding S by MDS, we embed both S^(t) and S^(p) by considering 2L objects, the first L objects being the elements in S^(t) and the last L objects being the elements in S^(p). Following the mirroring step above, we construct symmetric ∆ and W as 2L × 2L matrices by the following equations and illustrate the constructions by Figure 3.

∆i,j=







δ(c(yi, yj−L)) if (i, j) in top-right part δ(c(yi−L, yj)) if (i, j) in bottom-left part

0 otherwise

(1)

Wi,j =







fi if (i, j) in top-right part fj if (i, j) in bottom-left part 0 otherwise

(2)

We explain the constructions and the new notations fi as follows. Given that we are concerned only about the dissimilarities between the elements in S^(t) and S^(p), we set the top-left and the bottom-right parts of W to zeros (and set the corresponding parts of ∆ conveniently to zeros as well). Then, we set the top- right part and the bottom-left part of ∆ to be the transformed costs to reflect the dissimilarities. The top-right part and the bottom-left part of ∆ are in fact C(S) and C(S)^> respectively, as illustrated by Figure 3. Considering that every label vector may have different importance, to reflect this difference, we set the top-right part of weight Wi,j to be fi, the frequency of yi in D, and set the bottom-left part of weight Wi,j to be fj.

By solving MDS with the above-mentioned ∆ and W, we can obtain the target vector u^(t)_i and u^(p)_i corresponding to y_i^(t) and y^(p)_i . We take U^(t) and

(9)

Z

z^(p)₁ z^(p)₂

z^(p)₃ z^(t)₁

z^(t)₂

z^(t)₃

g

X U^(p)

(a) learning g from U^(p)

Z

z^(p)₁ z^(p)₂

z^(p)₃ z^(t)₁

z^(t)₂

z^(t)₃

g

X

˜ z U^(t)

(b) making prediction from U^(t) Fig. 4 Different use of two roles of embedded vectors

Algorithm 1 Training process of CLEMS

1: Given D = {(x⁽ⁿ⁾, y⁽ⁿ⁾)}^N_n=1, cost function c, and embedded dimension M 2: Decide the candidate set S, and calculate ∆ and W by (1) and (2)

3: Solve MDS with ∆ and W, and obtain the two roles of embedding vectors U^(t)and U^(p) 4: Set embedding function Φ : S → U^(p)and embedded vector set Q = U^(t)

5: Train a regressor g from {(x⁽ⁿ⁾, Φ(y⁽ⁿ⁾))}^N_n=1

Algorithm 2 Predicting process of CLEMS

1: Given a testing example x

2: Obtain the predicted embedded vector ˜z = g(x) 3: Find zq∈ Q such that d(zq, ˜z) is the smallest 4: Make prediction ˜y = Φ⁻¹(zq)

U^(p)to denote the target vector sets {u^(t)_i }^L_i=1and {u^(p)_i }^L_i=1, respectively. Those target vectors minimize P

i,jWi,j(d(u^(t)_i , u^(p)_j ) − δ(c(yi, yj)))². That is, the cost information is embedded in the distances between the elements in U^(t) and U^(p). Since we mirror each label vector yi as two roles (y^(t)_i and y^(p)_i ), we need to decide which target vector (u^(t)_i and u^(p)_i ) is the embedded vector ziof yi. Recall that the goal of the embedded vectors is to train a internal predictor g and obtain ˜z, the “predicted” embedded vector. Therefore, we take the elements in U^(p), which serve the role of the prediction, as the embedded vectors of the elements in S, as illustrated by Figure 4(a). Accordingly, the nearest embedded vector zqshould be the role of the ground truth because the cost information is embedded in the distance between these two roles of target vectors. Hence, we take U^(t) as Q, the embedded vector set in the first step of nearest-neighbor decoding, and find the nearest embedded vector zq from Q, as illustrated by Figure 4(b). The final cost- sensitive prediction ˜y = yq is the corresponding label vector to zq, which carries the cost information through nearest-neighbor decoding.

With the embedding function Φ using U^(p)and the nearest-neighbor decoding function Ψ using Q = U^(t), we have now designed a novel CSLE algorithm. We name it cost-sensitive label embedding with multidimensional scaling (CLEMS).

Algorithm 1 and Algorithm 2 respectively list the training process and the predicting process of CLEMS.

(10)

3.3 Theoretical guarantee and monotonic function

The last issue is how to choose the monotonic transform function δ(·). We suggest a proper monotonic function δ(·) based on the following theoretical results.

Theorem 1 For any instance (x, y), let z be the embedded vector of y, ˜z = g(x) be the predicted embedded vector, zq be the nearest embedded vector of ˜z, and yq

be the corresponding label vector of zq. In other words, yq is the outcome of the nearest-neighbor decoding function Ψ . Then,

δ(c(y, yq))²≤ 5

(d(z, zq) − δ(c(y, yq)))²

| {z }

embedding error

+ d(z, ˜z)²

| {z } regression error

.

Proof Since zqis the nearest neighbor of ˜z, we have d(z, ˜z) ≥¹₂d(z, zq). Hence, embedding error + regression error = (d(z, zq) − δ(c(y, yq)))²+ d(z, ˜z)²

≥ (d(z, zq) − δ(c(y, yq)))²+1

4d(z, zq)²

= 5

4(d(z, zq) −4

5δ(c(y, yq)))²+ 1

5δ(c(y, yq))²

≥ 1

5δ(c(y, yq))². This implies the theorem.

Theorem 1 implies that the cost of the prediction can be bounded by embedding error and regression error. In our framework, the embedding error can be reduced by multidimensional scaling and the regression error can be reduced by learning a good regressor g. Theorem 1 provides a theoretical explanation of how our framework achieves cost-sensitivity.

In general, any monotonic function δ(·) can be used in the proposed framework.

Based on Theorem 1, we suggest δ(·) = (·)^1/2 to directly bound the cost by c(y, yq) ≤ 5(embedding error + regression error). We will show that the suggested monotonic function leads to promising practical performance in Section 4.

4 Experiments

We conduct the experiments on nine real-world datasets (Tsoumakas et al, 2011b;

Read et al, 2016) to validate the proposed algorithm, CLEMS. The details of the datasets are shown by Table 1. We evaluate the algorithms in our cost-sensitive setting with three commonly-used evaluation criteria, namely F1 score(y, ˜y) =

2ky∩˜yk1

kyk1+k˜yk1, Accuracy score(y, ˜y) = ^ky∩˜_ky∪˜^yk_yk¹

1, and Rank loss(y, ˜y) = P

y[i]>y[j]

(_Jy[i] <˜

˜

y[j]_K+ ¹₂_Jy[i] = ˜˜ y[j]_K). Note that F1 score and Accuracy score are symmetric while Rank loss is asymmetric. For CLEMS, the input cost function is set as the corresponding evaluation criterion.

All the following experimental results are averaged over 20 runs of experiments.

In each run, we randomly split 50%, 25%, and 25% of the dataset for training, validation, and testing. We use the validation part to select the best parameters for all

(11)

Table 1 Properties of datasets

Dataset # of instance N # of feature d # of labels K # of distinct labels

CAL500 502 68 174 502

emotions 593 72 6 27

birds 645 260 19 133

medical 978 1449 45 94

enron 1702 1001 53 753

scene 2407 294 6 15

yeast 2417 103 14 198

slashdot 3279 1079 22 156

EUR-Lex(dc) 19348 5000 412 1615

the algorithms and report the corresponding testing results. For all the algorithms, the internal predictors are set as random forest (Breiman, 2001) implemented by scikit-learn (Pedregosa et al, 2011) and the maximum depth of the trees is se- lected from {5, 10, ..., 35}. For CLEMS, we use the implementation of scikit-learn for solving SMACOF algorithm to obtain the MDS-based embedding and the parameters of SMACOF algorithm are set as default values by scikit-learn. For the other algorithms, the rest parameters are set as the default values suggested by their original papers. In the following figures and tables, we use the notation ↑ (↓) to highlight whether a higher (lower) value indicates better performance for the evaluation criterion.

4.1 Comparing CLEMS with LSDR algorithms

In the first experiment, we compare CLEMS with four LSDR algorithms intro- duced in Section 2: principal label space transformation (PLST) (Tai and Lin, 2012), conditional principal label space transformation (CPLST) (Chen and Lin, 2012), feature-aware implicit label space encoding (FaIE) (Lin et al, 2014), and sparse local embeddings for extreme classification (SLEEC) (Bhatia et al, 2015)

Since the prediction of SLEEC is a real-value vector rather than binary, we choose the best threshold for quantizing the vector according to the given criterion during training. Thus, our modified SLEEC can be viewed as “semi-cost-sensitive”

algorithm that learns the threshold according to the criterion.

Figures 5 and Figure 6 show the results of F1 score and Accuracy score across different embedded dimensions M . As M increases, all the algorithms reach better performance because of the better preservation of label information. CLEMS outperforms the non-cost-sensitive algorithms (PLST, CPLST, and FaIE) in most of the cases, which verifies the importance of constructing a cost-sensitive embedding. CLEMS also exhibits considerably better performance over SLEEC in most of the datasets, which demonstrates the usefulness to consider the cost information during embedding (CLEMS) rather than after the embedding (SLEEC). The results of Rank loss are shown by Figure 7. CLEMS again reaches the best in most of the cases, which justifies its validity for asymmetric criteria through the mirroring trick.

(12)

0.2 0.4 0.6 0.8 1

F1 score

0.3 0.4

0.5 CAL500

0.2 0.4 0.6 0.8 1

0.4 0.5 0.6

0.7 emotions

0.2 0.4 0.6 0.8 1

0.5 0.6

0.7 birds

0.2 0.4 0.6 0.8 1

F1 score

0.6 0.7 0.8

0.9 medical

0.2 0.4 0.6 0.8 1

0.4 0.5 0.6

0.7 enron

0.2 0.4 0.6 0.8 1

0 0.5

1 scene

M / K

0.2 0.4 0.6 0.8 1

F1 score

0.5 0.6

0.7 yeast

M / K

0.2 0.4 0.6 0.8 1

0 0.2 0.4

0.6 slashdot

M / K

0.2 0.4 0.6 0.8 1

0.2 0.4 0.6

0.8 EUR-Lex(dc) CLEMS SLEEC FaIE PLST CPLST

Fig. 5 F1 score (↑) with the 95% confidence interval of CLEMS and LSDR algorithms

0.2 0.4 0.6 0.8 1

Accuracy score 0.2 0.25 0.3 0.35

CAL500

0.2 0.4 0.6 0.8 1

0.2 0.4 0.6 0.8

emotions

0.2 0.4 0.6 0.8 1

0.4 0.5 0.6 0.7

birds

0.2 0.4 0.6 0.8 1

Accuracy score0.5 0.6 0.7 0.8

medical

0.2 0.4 0.6 0.8 1

0.3 0.4 0.5

enron

0.2 0.4 0.6 0.8 1

0 0.5 1

scene

M / K

0.2 0.4 0.6 0.8 1

Accuracy score0.4 0.5

0.6 yeast

M / K

0.2 0.4 0.6 0.8 1

0 0.2 0.4

0.6 slashdot

M / K

0.2 0.4 0.6 0.8 1

0.2 0.4 0.6

0.8 EUR-Lex(dc) CLEMS SLEEC FaIE PLST CPLST

Fig. 6 Accuracy score (↑) with the 95% confidence interval of CLEMS and LSDR algorithms

4.2 Comparing CLEMS with LSDE algorithms

We compare CLEMS with ECC-based LSDE algorithms (Ferng and Lin, 2013).

We consider two promising error-correcting codes, repetition code (ECC-RREP) and Hamming on repetition code (ECC-HAMR) in the original work. The former is equivalent to the famous Random k-labelsets (RAkEL) algorithm (Tsoumakas et al, 2011a).

(13)

0.2 0.4 0.6 0.8 1

Rank loss

1000 1200 1400

1600 CAL500

0.2 0.4 0.6 0.8 1

1 2 3

4 emotions

0.2 0.4 0.6 0.8 1

4 6

8 birds

0.2 0.4 0.6 0.8 1

Rank loss

0 5 10

15 medical

0.2 0.4 0.6 0.8 1

20 30 40

50 enron

0.2 0.4 0.6 0.8 1

0 1 2

3 scene

M / K

0.2 0.4 0.6 0.8 1

Rank loss

8 10

12 yeast

M / K

0.2 0.4 0.6 0.8 1

0 5 10

15 slashdot

M / K

0.2 0.4 0.6 0.8 1

50 100 150

200 EUR-Lex(dc) CLEMS SLEEC FaIE PLST CPLST

Fig. 7 Rank loss (↓) with the 95% confidence interval of CLEMS and LSDR algorithms

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

F1 score

0.2 0.3 0.4

0.5 CAL500

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.55

0.6 0.65

0.7 emotions

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.5

0.6

0.7 birds

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

F1 score

0.4 0.6 0.8 1

medical

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.5

0.55 0.6 0.65

enron

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.6

0.7 0.8

scene

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

F1 score

0.55 0.6 0.65

0.7 yeast

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.2

0.4

0.6 slashdot

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.4

0.5 0.6

0.7 EUR-Lex(dc) CLEMS ECC-RREP ECC-HAMR

Fig. 8 F1 score (↑) with the 95% confidence interval of CLEMS and LSDE algorithms

Figure 8 shows the results of F1 score. Note that in the figure, the scales of M/K for CLEMS and other LSDE algorithms are different. The scale of CLEMS is {1.2, 1.4, 1.6, 1.8, 2.0} while the scale of other LSDE algorithms is {2, 4, 6, 8, 10}.

Although we give LSDE algorithms more dimensions to embed the label information, CLEMS is still superior to those LSDE algorithms in most of cases. Similar results happen for Accuracy score and the Rank loss (Figure 9 and Figure 10).

The results again justify the superiority of CLEMS.

(14)

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

Accuracy score0.15 0.2 0.25

0.3 CAL500

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.5

0.55 0.6

0.65 emotions

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.5

0.6

0.7 birds

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

Accuracy score0.4 0.6

0.8 medical

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.4

0.45

0.5 enron

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.5

0.6 0.7

0.8 scene

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

Accuracy score^0.45

0.5 0.55

0.6 yeast

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.2

0.4

0.6 slashdot

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.4

0.5 0.6

0.7 EUR-Lex(dc) CLEMS ECC-RREP ECC-HAMR

Fig. 9 Accuracy score (↑) with the 95% confidence interval of CLEMS and LSDE algorithms

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

Rank loss

1200 1400 1600

CAL500

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 1.4

1.6 1.8 2

emotions

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 4

6 8

birds

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

Rank loss

0 5 10 15

medical

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 20

30 40 50

enron

1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 0.6

0.8 1 1.2

scene

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10

Rank loss

8 9 10

11 yeast

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 4

6 8

10 slashdot

M / K (CLEMS/others) 1.2/2 1.4/4 1.6/6 1.8/8 2.0/10 50

100 150

200 EUR-Lex(dc) CLEMS ECC-RREP ECC-HAMR

Fig. 10 Rank loss (↓) with the 95% confidence interval of CLEMS and LSDE algorithms

4.3 Candidate set and embedded dimension

Now, we discuss the influence of the candidate set S. In Section 3, we proposed to embed Strinstead of Y. To verify the goodness of the choice, we compare CLEMS with different candidate sets. We consider the sets sub-sampled with different percentage from Str to evaluate the importance of label vectors in Str. Furthermore, to know whether or not larger candidate set leads to better performance, we also

(15)

# of elements in S / # of elements in S_tr

0.5 1 1.5 2

F1 score

0.62 0.64 0.66 0.68

yeast

0.5 1 1.5 2

0.3 0.4 0.5 0.6

slashdot

0.5 1 1.5 2

0.4 0.5 0.6 0.7

EUR-Lex(dc)

Fig. 11 F1 score (↑) of CLEMS with different size of candidate sets

0.5 1 1.5 2

Accuracy score_0.45

0.5 0.55 0.6

yeast

0.5 1 1.5 2

0.3 0.4 0.5 0.6

slashdot

0.5 1 1.5 2

0.3 0.4 0.5 0.6 0.7

EUR-Lex(dc)

Fig. 12 Accuracy score (↑) of CLEMS with different size of candidate sets

0.5 1 1.5 2

Rank loss

8 8.5 9 9.5 10

yeast

0.5 1 1.5 2

5 6 7

slashdot

0.5 1 1.5 2

80 100 120 140

EUR-Lex(dc)

Fig. 13 Rank loss (↓) of CLEMS with different size of candidate sets

randomly sample different percentage of additional label vectors from Y \ Strand merge them with Str as the candidate sets. The results of three largest datasets are shown by Figures 11, 12, and 13. From the figures, we observe that sub- sampling from Str generally lead to worse performance; adding more candidates from Y \ Str, on the other hand, does not lead to significantly-better performance.

The two findings suggest that using Str as the candidate set is necessary and sufficient for decent performance.

We conduct another experiment about the candidate set. Instead of random sampling, we consider Sall, which denotes the set of label vectors that appear in the training instances and the testing instances, to estimate the benefit of “peeping”

the testing label vectors and embedding them in advance. We show the results of CLEMS with Str(CLEMS-train) and Sall(CLEMS-all) versus different embedded dimensions by Figure 14, 15, and 16. From the figures, we see that the improvement of CLEMS-all over CLEMS-train is small and insignificant. The results imply again that Strreadily allows nearest-neighbor decoding to make sufficiently good choices.

Now, we discuss about the embedded dimension M . From Figure 14, 15, and 16, CLEMS reaches better performance as M increases. For LSDR, M plays an important role since it decides how much information can be preserved in the embedded space. Nevertheless, For LSDE, the improvement becomes marginal when M increases. The results suggest that for LSDE, the influence of the additional dimension is not large, and setting the embedded dimension M = K is sufficiently good

(16)

M / K

0.5 1 1.5 2

F1 score 0.6 0.64 0.68

yeast

M / K

0.5 1 1.5 2

0.4 0.5 0.6

slashdot

M / K

0.5 1 1.5 2

0.55 0.6 0.65 0.7

EUR-Lex(dc) CLEMS-train CLEMS-all

Fig. 14 F1 score (↑) with the 95% confidence interval of CLEMS-train and CLEMS-all

M / K

0.5 1 1.5 2

Accuracy score^0.45

0.5 0.55

0.6 yeast

M / K

0.5 1 1.5 2

0.3 0.4 0.5

0.6 slashdot

M / K

0.5 1 1.5 2

0.55 0.6 0.65

0.7 EUR-Lex(dc) CLEMS-train CLEMS-all

Fig. 15 Accuracy score (↑) with the 95% confidence interval of CLEMS-train and CLEMS-all

M / K

0.5 1 1.5 2

Rankloss

8 9 10

11 yeast

M / K

0.5 1 1.5 2

4 6

8 slashdot

M / K

0.5 1 1.5 2

80 90

100 EUR-Lex(dc) CLEMS-train CLEMS-all

Fig. 16 Rank loss (↓) with the 95% confidence interval of CLEMS-train and CLEMS-all

in practice. One possible reason for the sufficiency is that the criteria of interest are generally not complicated enough and thus do not need more dimensions to preserve the cost information.

4.4 Comparing CLEMS with cost-sensitive algorithms

In this section, we compare CLEMS with two state-of-the-art cost-sensitive algorithms, probabilistic classifier chain (PCC) (Dembczynski et al, 2010, 2011) and condensed filter tree (CFT) (Li and Lin, 2014). Both CLEMS and CFT can handle arbitrary criteria while PCC can handle only those criteria with efficient inference rules. In addition, we also report the results of some baseline algorithms, such as binary relevance (BR) (Tsoumakas and Katakis, 2007) and classifier chain (CC) (Read et al, 2011). Similar to previous experiments, the internal predictors of all algorithms are set as random forest (Breiman, 2001) implemented by scikit-learn (Pedregosa et al, 2011) with the same parameter selection process.

Running Time. Figure 17 illustrates the average training, predicting, and total running time when taking F1 score as the intended criterion for the six largest

(17)

medical enron scene yeast slashdot EUR-Lex(dc)

% of BR's time

0 1000 2000 3000 4000

BR CC CLEMS CFT PCC

(a) average training time

% of BR's time

0 50 100 150

BR CC CLEMS CFT PCC

(b) average predicting time

% of BR's time

0 1000 2000 3000 4000

BR CC CLEMS CFT PCC

(c) average total running time Fig. 17 Average running time when taking F1 score as cost function

datasets. The running time is normalized by the running time of BR. For training time, CFT is the slowest, because it needs to iteratively estimate the importance of each label and re-train internal predictors. CLEMS, which consumes time for MDS calculations, is intuitively slower than baseline algorithms and PCC during training, but still much faster than CFT. For prediction time, all algorithms, including PCC (using inference calculation) and CLEMS (using nearest-neighbor calculation) are similarly fast. The results suggest that for CSMLC, CLEMS is superior to CFT and competitive to PCC for the overall efficiency.

Performance. We compare the performance of CLEMS and other algorithms across different criteria. To demonstrate the full ability of CLEMS, in addition to F1 score, Accuracy score, and Rank loss, we further consider one additional criterion, Composition loss = 1+5×Hamming loss−F1 score, as used by Li and Lin (2014).

We also consider three more datasets (arts, flags, and language-log) that comes from other MLC works (Tsoumakas et al, 2011b; Read et al, 2016).

The results are shown by Table 2. Accuracy score and Composition loss for PCC are left blank since there is no efficient inference rules. The first finding is that cost-sensitive algorithms (CLEMS, PCC, and CFT) generally perform better than non-cost-sensitive algorithms (BR and CC) across different criteria. This validates the usefulness of cost-sensitivity for MLC algorithms.

For F1 score, Accuracy score, and Composition loss, CLEMS outperforms PCC and CFT in most cases. The reason is that these criteria evaluate all the labels jointly, and CLEMS can globally locate the hidden structure of labels to facilitate more effective learning, while PCC and CFT are chain-based algorithms and only locally discover the relation between labels. For Rank loss, PCC performs the best in most cases. One possible reason is that Rank loss can be expressed as a