1

## Multi-label Classification with Principal Label Space Transformation

### Farbound Tai and Hsuan-Tien Lin

{b94901176, htlin}@ntu.edu.tw

Department of Computer Science, National Taiwan University

Keywords: multi-label classification, hypercube view, regression

Abstract

We consider a hypercube view to perceive the label space of multi-label classification problems geometrically. The view allows us to not only unify many existing multi- label classification approaches, but also design a novel algorithm, Principal Label Space Transformation (PLST), which captures key correlations between labels before learn- ing. The simple and efficient PLST relies on only the singular value decomposition as the key step. We derive the theoretical guarantee of PLST and evaluate its empirical performance using real-world data sets. Experimental results demonstrate that PLST is faster than the traditional binary relevance approach and is superior to the modern compressive sensing approach in terms of both accuracy and efficiency.

### 1 Introduction

*Multi-label classification*problems naturally arise in domains such as text mining, vi-
sion, or bio-informatics. For instance, a document is usually associated with more than
one category; a picture often includes many objects; a gene is usually multi-functional.

The problem generalizes the traditional multi-class classification problem—the former allows a set of labels to be associated with an instance while the latter allows only one.

Because of the wide range of potential applications in genomics (Barutcuoglu et al., 2006; Vens et al., 2008), scene classification (Boutell et al., 2004), video segmenta- tion (Snoek et al., 2006), music classification (Trohidis et al., 2008) and text catego- rization (Schapire and Singer, 2000), multi-label classification is attracting more and more research attention.

Existing multi-label classification approaches usually fall into one of two cate-
*gories (Tsoumakas et al., 2010a): Algorithm Adaptation or Problem Transformation.*

As its name suggests, Algorithm Adaptation directly extends some specific algorithms to solve the multi-label classification problem. Typical members of Algorithm Adapta- tion include Adaboost.MH (Schapire and Singer, 2000), Multi-label C4.5 (Clare and King, 2001) and ML-KNN (Zhang and Zhou, 2007). Problem Transformation (sometimes also called reduction) approaches, on the other hand, transform the multi-label classifi- cation problem to one or more reduced tasks. Typical members of Problem Transforma- tion include Label Power-set, Binary Relevance and Label Ranking (F¨urnkranz et al., 2008; Elisseeff and Weston, 2002). Label Power-set reduces multi-label classification to multi-class classification by treating each distinct label set as a unique multi-class la- bel. Binary Relevance, also known as one-versus-all, reduces multi-label classification to many different binary classification tasks, each for one of the labels. Label Ranking approaches transform the multi-label classification problem to the task of ranking all the labels by relevance and the task of determining a threshold of relevance. As can be seen from above, an advantage of Problem Transformation over Algorithm Adapta- tion is that any algorithm which deals with the reduced tasks can be easily extended to multi-label classification via the transformation.

In this paper, we discuss Problem Transformation approaches from a special per- spective: the hypercube view. The view describes all possible label sets in the multi- label classification problem as the vertices of a high-dimensional hypercube. The view not only unifies Label Power-set, Binary Relevance and Label Ranking under the same framework, but also allows us to design better methods that make use of the geomet- ric properties of those label-set vertices. We demonstrate the use of the hypercube view with a novel method, Principal Label Space Transformation (PLST), which captures the

key correlations between labels using a flat (a low-dimensional linear subspace) in the high-dimensional space. The method only uses a simple linear encoding of the vertices and a simple linear decoding of the predictions, both easily computed from the Singular Value Decomposition (SVD) of a matrix composed of the label-set vertices. Moreover, by keeping only the key correlations, PLST can dramatically decrease the number of reduced tasks to be solved without loss of prediction accuracy. Such a computational advantage is especially important for scaling up multi-label classification algorithms to a larger number of labels (Tsoumakas et al., 2010a).

Another recent work, multi-label prediction via Compressive Sensing (Hsu et al., 2009), also seeks to perform multi-label classification with a linear encoding of the label-sets vertices. Compressive Sensing operates under the assumption of sparsity in the label sets and thus can describe the label-set vertices with a small number of linear random projections as its encoding. Although the encoding component of Compressive Sensing is linear, the decoding component is not. In particular, for each incoming test instance, Compressive Sensing needs to solve an optimization problem with respect to its sparsity assumption. That is, Compressive Sensing can be time consuming during prediction. In our experiments, we will demonstrate that PLST is not only more efficient but also more accurate than Compressive Sensing.

As mentioned by Tsoumakas et al. (2010a) and Hsu et al. (2009), large-scale multi- label classification poses a computational challenge as even the efficient Binary Rele- vance can require thousands of classifiers. Other Problem Transformation methods such as Label Ranking or Label Power-set come with computational complexity that grows polynomially or exponentially with the number of labels and are thus not feasible for the challenge. PLST can be viewed as a linear dimension-reduction method in the label space for conquering both the training and the prediction parts of the challenge; Com- pressive Sensing solves the training part of the challenge, but not the prediction part.

For dimension reduction in the label space, there are also methods that are based on non-linear dimension reduction. A representative method is to use the topic model to group labels into a small set of topics (Law et al., 2010). The method solves the pre- diction part of the challenge (predicting the few topics instead of the many labels) but training a topic model is a non-trivial task in computation. We thus focus our attention only on linear dimension reduction methods like PLST for efficiency of both training

and prediction.

When viewing multi-label classification as a special case of the structured output prediction problem, the kernel dependency estimation algorithm (KDE; Weston et al., 2002) could be applied to multi-label classification by designing appropriate kernels for the label space (Dembczynski et al., 2010a). Interestingly, PLST is equivalent to the linear form of KDE for the special case. The linear PLST avoids the computationally- expensive pre-image problem (Weston et al., 2002) in the general non-linear KDE. To the best of our knowledge, neither the linear form of KDE nor its application to multi- label classification has been seriously studied. Our work provides a solid understanding to the linear form of KDE with novel theoretical and empirical results.

Some other related methods come from works on multi-task learning, which takes multi-label classification as a simple special case. Ando and Zhang (2005) propose a multi-task learning method, SVD-based alternating structure optimization (SVD-ASO), that simultaneously optimizes a loss function of all the tasks (label predictions) and performs dimension reduction to learn a compact joint representation of the feature space. A similar formulation is taken for multi-label classification by Ji et al. (2010).

While both our proposed PLST method and SVD-ASO uses SVD as a core component, they are different as the former performs dimension reduction on the label space while the latter focuses on the feature space.

The paper is organized as follows. In Section 2, we give a formal setup of the multi- label classification problem and introduce the hypercube view. Then, in Section 3, we unify Binary Relevance and CS under the same framework via the hypercube view and describe our proposed method: PLST. Finally, we present the experimental results in Section 4 and conclude in Section 5.

### 2 Hypercube View

In the multi-label classification problem, we seek for a multi-label classifier that maps
the input vector x∈ R^{d}to a set of labelY, where Y ⊆ L = {1, 2, . . . , K} with K being
the number of classes. Consider a training setS that contains N training examples of the
form(xn, Yn). Multi-label classification aims at using S to find a multi-label classifier
g : R^{d} → 2^{L}such thatg(x) predicts Y well on any future test example (x, Y).

Figure 1: a hypercube withK = 3

*The key of the hypercube view is to represent the label set*Y by a vector y ∈ {0, 1}^{K},
where thek-th component of y is 1 if and only if k ∈ Y. Then, as shown in Fig. 1, we
can visualize eachY as a vertex of a K-dimensional hypercube. The k-th component of
y corresponds to an axis of the hypercube, which represents the presence or absence of
a labelk in Y. We will use Y and its corresponding y interchangeably in this paper. The
hypercube view allows us to unify many existing Problem Transformation approaches,
as discussed below.

*Hypercube View of Label Power-set: One of the simplest approaches to multi-label*
classification is Label Power-set, as shown in Algorithm 1.

Algorithm 1 Label Power-set

1. pre-processing: map each vertex y_{n} (or each label-setYn) to a hyper-labelyn ∈
{1, 2, · · · , 2^{K}} with a bijection function B.

2. training: learn a multi-class classifiergc(x) from {(xn, yn)}^{N}_{n=1}.
3. predicting: for each x, returnB^{−1} gc(x)

.

In particular, Label Power-set simply treats each vertex of the hypercube as a dif- ferent hyper-label and performs regular multi-class classification with the hyper-labels.

That is, Label Power-set essentially breaks the structure of the hypercube and does not consider the relations (edges) between the vertices. The approach is often criticized for the large number of possible hyper-labels and the relatively few number of examples

per hyper-label, which may degrade the learning performance.

*Hypercube View of Binary Relevance:*Another straight-forward approach to multi-label
classification is Binary Relevance. This approach decomposes the original multi-label
problem intoK isolated relevance-learning sub-tasks, as shown in Algorithm 2.

Algorithm 2 Binary Relevance

1. training: fork = 1 to K, learn a relevance function rk(x) from { xn, yn[k]
}^{N}_{n=1}.
2. predicting: for each input vector x, compute r(x) ≡

r_{1}(x), r_{2}(x), · · · , rK(x)
.
Then, return round

r(x)

, where round(·) maps each component of the vector to the closest value in{0, 1}.

Using the hypercube view, the k-th iteration of Binary Relevance can be thought
as projecting the vertices to thek-th dimension (axis) before training. In addition, the
relevance vector r(x) ≡ [r1(x), r2(x), · · · , rK(x)] can be viewed as a point in R^{K} and
the round(·) operation maps the point to the closest vertex of the hypercube in terms of
theℓ1-distance.

Despite its effectiveness, Binary Relevance is often criticized for neglecting the cor- relation between labels, which may carry useful information in multi-label classification tasks. Furthermore, the training complexity of Binary Relevance is linear to the num- ber of labels K, which can still be expensive if K is too large. Recently, Hsu et al.

(2009) attempted to address this problem through Compressive Sensing, which will be discussed later in this section.

*Hypercube View of Label Ranking:* As shown in Algorithm 3, Label Ranking ap-
proaches learn two components (jointly or separately) from the multi-label classification
data set: the order of label relevance that is often represented by a scoring function on
the labels, and the threshold for label presence. Note that Binary Relevance is a special
case of Label Ranking when taking the relevance function r(x) as the scoring function
and a na¨ıve threshold at0.5 per label.

Using the hypercube view, the ordering component in Label Ranking can be thought as learning a length-K path from [0, 0, · · · , 0] to [1, 1, · · · , 1] using the hypercube

Algorithm 3 Label Ranking

1. training: learn a scoring function s(x) ≡

s1, s2, · · · , sK

from{ xn, yn

}^{N}_{n=1}
that gives a score si *to each label l*_{i} inL and a threshold function t that converts
scores above a certain threshold to 1 and the rest to 0.

2. predicting: for each x, return t(s(x)).

edges. Each vertex y_{n}in the training examples then represent multiple length-K edge-
paths that go through y_{n}, whose ℓ1 norm indicates the desired thresholding level. An
early representative of Label Ranking is Rank-SVM (Elisseeff and Weston, 2002), in
which the scores are obtained via the relevance function in Binary Relevance and the
thresholding function comes from estimating the number of relevant labels. Another
popular approach is Calibrated Label Ranking (F¨urnkranz et al., 2008), in which the
scoring function is learned from a pairwise comparison of the labels and the threshold-
ing function comes from the score of an additional virtual label that is added during
training.

*Hypercube View of Compressive Sensing:*Under the assumption that the label setsY are
sparse (i.e. containing only a few elements), it is possible to compress the label sets and
learn to predict the compressed labels instead. Such a possibility allows Compressive
Sensing (CS; Hsu et al., 2009) to reduce the number of sub-tasks in Binary Relevance to
be computationally feasible for data sets with a largeK. In particular, each label set Y
(vertex y) can be taken as aK-dimensional signal. The theory of compressive sensing
states that when the signals are sparse, one does not need to sample at the Nyquist rate
*in order to accurately recover the original signals. A vector is said to be s-sparse if it*
*contains at most s nonzero entries. Thus, as the sketch of CS in Algorithm 4 shows,*
when all y contain only a few1’s, CS only needs to solve M ≪ K sub-tasks instead
ofK for multi-label classification.

Using the hypercube view, them-th iteration of CS can be thought as projecting the vertices to a random direction before training. BecauseM ≪ K, the subspace explored by CS is much smaller than the space that the hypercube resides in. CS is able to work

Algorithm 4 Compressive Sensing

1. pre-processing: compress {(xn, yn)} to {(xn, hn)}, where h = Ps · y using an M by K random projection matrix Ps with M determined by the assumed sparsity levels. Each label-set ynis assumed to be s-sparse.

2. training: form = 1 to M , learn a function rm(x) from {(xn, hn[m])}^{N}_{n=1}.
3. prediction: for each input vector x, compute r(x) =

r1(x), r2(x), · · · , rM(x)
.
Then, obtain a sparse vector y such that Pˆ _{s} · ˆy is “closest” to r(x) using an
optimization algorithm. Finally, returny.ˆ

on such a small subspace because of the label-set sparsity assumption, which implies that only a limited number of vertices in the hypercube are relevant for the multi-label classification task.

Although the random projection in the pre-processing step of CS is efficient, the prediction step requires solving an optimization problem for every coming input vec- tor x. Such a prediction step is very time consuming. In addition, the assumption on label-set sparsity puts a restriction on the practical use of the CS approach.

*Hypercube View of Topic Modeling:* Under the assumption thatP (y|x), the probability
of getting a particular label-set vector y given x, can be modeled through a hidden
random variable z ∈ {1, 2, · · · , M } called the “topic”, topic modeling decomposes
P (y|x) to

P (y|x) = XM m=1

P (y|z = m) P (z = m|x)

| {z }

rm(x)

.

In the original work of topic modeling (Law et al., 2010), the former termP (y|z = m) is learned with latent Dirichlet allocation (Blei et al., 2003); the latter termP (z = m|x) is learned with the maximum entropy classifier (Csisz´ar, 1995). Note that a topic is es- sentially a cluster of label-set vertices y. Thus, more generally, any probabilistic clus- tering algorithm can be used to get the former termP (y|z = m) and any probabilistic classification algorithm can be used to get the latter term rm(x) = P (z = m|x), as shown in Algorithm 5.

The original work of topic modeling (Law et al., 2010) treatsP (y|x) as a standalone

probabilistic classifier and does not discuss much about its deterministic decoding. One simple procedure of making a deterministic prediction, as illustrated in Algorithm 5, is to round from the expected value of y given x. The procedure equivalently finds the best determistic predictiony subject toˆ P (y|x) in terms of the Hamming loss, a popular performance measure that will be discussed later in this section.

Algorithm 5 Generalized Topic Modeling

1. clustering: cluster {(xn, yn)} to {(xn, hn)}, where hn[m] represents the proba- bility of ynresiding in them-th cluster characterized by P (y|z = m). Let pmbe the center of the cluster, i.e., the expected value ofP (y|z = m).

2. training: for m = 1 to M , learn a probabilistic classifier r(x) from
{(xn, hn[m])}^{N}_{n=1}, where the m-th component of r(x) indicates the probability
of x being in clusterm.

3. prediction: for each input vector x, compute r(x) =

r1(x), r2(x), · · · , rM(x) . Then, compute

˜ y=

XM m=1

rm(x) · pm

and returnyˆ = round(˜y).

Let the vector pm be the expected value of y givenz = m, which is the center of
the cluster. The vector is inside the hypercube and can be viewed as the representative
point of the cluster. From the hypercube view, generalized topic modeling identifies the
representative point of each cluster (that is able to capture the nearby vertices of the hy-
percube), and then adopts a probabilistic multi-class classifier to map the input vector x
to a distribution of those representative points. An extreme case of topic modeling is
thus Label Power-set (Algorithm 1), which takes each vertex as its own cluster and a
deterministic classifierg as the probabilistic classifier r. If the label-set vectors y form
a small number of meaningful clusters in R^{K}, topic modeling can use the property to
predict efficiently and effectively. From the geometric perspective, however, clustering
in a K dimensional space is a non-trivial and a time-consuming task when K is large
because of the curse of dimensionality.

*Hypercube View of Kernel Dependency Estimation:* Consider a (high-dimensional)
transform function φ : R^{K} → H, where H is a Hilbert space; let K(y, y^{′}) embeds
the inner product hφ(y), φ(y^{′})i that represents the similarity between y and y^{′}. Un-
der the assumption thatφ(y) approximately resides in an M -dimensional flat (a linear
subspace) withinH, the kernel dependency estimation approach (Weston et al., 2002)
implicitly locates the reference point o and the basis vectors{um}^{M}_{m=1} of the flat using
the kernel K. Then, the approach transforms {(xn, yn)} to {(xn, hn)} by hn[m] =
hφ(yn) − o, umi. For each m = 1, 2, · · · , M , the approach then performs kernel ridge
regression (Saunders et al., 1998) from x to h[m] to get a regressor rm(x). During pre-
diction, the approach returns the best y such that each hφ(y) − o, umi ≈ rm(x), as
shown in Algorithm 6.

Algorithm 6 Kernel Dependency Estimation

1. decomposition of output space: perform kernel principal component analysis on y with some kernel function that embeds the transformation φ; transform {(xn, yn)} to {(xn, hn)}, where hn[m] = hφ(yn) − o, umi with o being the mean ofφ(yn) and um being them-th principal component.

2. training: for m = 1 to M , learn a function rm(x) from {(xn, hn[m])}^{N}_{n=1} with
kernel ridge regression (or more generally, any regression algorithm).

3. prediction: for each input vector x, compute r(x) =

r1(x), r2(x), · · · , rM(x) Then, return

ˆ

y= argmin

y∈{0,1}^{K}

h

hφ(y) − o, u1i, hφ(y) − o, u2i, · · · , hφ(y) − o, uMii

− r(x) . (1)

Consider a point y^{′} ∈ R^{K}. Ifφ(y^{′}) falls in the M -dimensional flat in H,
hφ(y^{′}) − o, umi = 0 for m = 1, 2, · · · , M.

That is, the points y^{′} reside in a hypersurface defined from an intersection of M hy-
persurfaces hφ(y^{′}) − o, umi = 0 in R^{K}. From the hypercube view, kernel dependency
estimation assumes that the vertices y are close to a hypersurface that corresponds to
some M -dimensional flat in H, and then performs learning on the flat instead of in

the original space. Because the hypersurface is usually nonlinear, the prediction proce- dure (1) in Algorithm 6 is a challenging optimization task and can be time consuming.

The key geometric objects used for modeling multi-label classification in the repre- sentative PT approaches above are summarized in the Table 1.

Table 1: Geometric Objects behind PT Approaches

approach geometric objects

label powerset vertices

binary relevance axes

label ranking edge-paths

compressive sensing flat that approximates close-to-origin vertices topic modeling cluster of vertices

kernel density estimation hypersurface that approximates vertices

The hypercube view not only unifies the PT approaches above, but also offers a geometric interpretation for the Hamming loss, which is commonly used to evaluate multi-label classifiers (Dembczynski et al., 2010b). Assume that the target label-set vertex is y and the predicted vertex isy, Hamming loss is defined asˆ

∆(ˆy, y) ≡ 1 K

XK k=1

ˆ

y[k] ⊕ y[k].

An alternative way to look at Hamming loss is

∆(ˆy, y) = 1

Kkˆy− yk_{1}.

That is, Hamming loss is simply a scaled ℓ1-distance between y and y. The distanceˆ also corresponds to the shortest edge-path to walk from y to y on the hypercube. Theˆ hypercube view justifies that Hamming loss can be a suitable error measure for algo- rithms that operates with respect to the space or the structure of the hypercube, such as Binary Relevance, CS (decoding to the closest vertex) or Label Ranking (thresholding edge-paths).

### 3 Proposed Approach

From the hypercube view, CS relies on label-set sparsity to consider a small number of
vertices of the hypercube. Our proposed approach stems from the same consideration,
but without requiring the assumption on label-set sparsity. From the hypercube view,
there are 2^{K} vertices of the hypercube, and each training example (xn, yn) occupies
only one vertex yn. In large multi-label classification data sets, it is typical for K to
exceed hundreds or even thousands. Then, usually the number of training examples
N ≪ 2^{K}. In addition, not all2^{K} vertices are needed for the multi-label classification
problem because of the possible hierarchy, correlation or hidden relationship between
the different labels. For instance, if classes labeled1 and 2 are disjoint sub-classes of
class3, only 3 vertices out of the 8 candidates are needed: [0, 0, 0], [1, 0, 1], [0, 1, 1].

Thus, during training, relatively few vertices will be occupied by a decent number of
*examples. We call this phenomenon hypercube sparsity to distinguish it from the label-*
*set sparsity*that CS uses.

Note that label-set sparsity implies hypercube sparsity, but not vice versa. By def-
inition, for a data set with label-set sparsity at s, all the hypercube vertices with more
thans labels are unoccupied by training examples—the phenomenon of hypercube spar-
sity. For instance, if a data set is label-set sparse at s = 2, then such a data set is also
hypercube sparse because the number of occupied vertices is at most ^{K}_{2}

+K +1 ≪ 2^{K}.
On the other hand, hypercube sparsity does not necessarily imply label-set sparsity,
because the few occupied label-set vertices may contain many labels. For instance, a
data set with all label sets containing at least(K−1) labels is hypercube sparse with the
number of occupied vertices being at mostK + 1 ≪ 2^{K}, but is by no means label-set
sparse.

Because of the hypercube sparsity, multi-label classification algorithms do not need
to learn with the entire hypercube in R^{K} and can focus on some vertices of the hyper-
cube (and their neighborhood area) instead. For instance, the Pruned Label Power-set
approach (Read et al., 2008), which is a variant of the usual Labl Power-set, only con-
siders vertices occupied by enough examples during training; topic modeling (Law et al.,
2010) groups the occupied vertices as clusters; kernel dependency estimation (Weston et al.,
2002) describes the occupied vertices by a (possibly non-linear) hypersurface. In other

words, hypercube sparsity allows dimensionality reduction in the label space without loss of prediction performance.

### 3.1 Linear Label Space Transformation

As shown in Algorithm 4, under the assumption label-set sparsity, CS is able to com-
press (reduce) the label space using anM by K random projection matrix Ps. The ran-
dom projection matrix defines a flat, which is a linear subspace ofR^{K} with at most M
dimensions. When taking the hypercube sparsity into account, could a flat also be help-
ful in modeling the occupied vertices in lower dimensions?

Let us first consider the case when there are two labels y[1] and y[2] and they are always the same for every example—a fully-correlated and equivalent relationship.

The equivalence causes hypercube sparsity and hence only vertices[0, 0] and [1, 1] are needed to model the multi-label classification problem. Intuitively, for the particular problem, it suffices to predict only y[1] and then replicate the prediction for y[2]. The intuition corresponds to making predictions by projecting y to the line (a1-dimensional flat)[0, 0] + α1[1, 1] and back—a form of linear dimension reduction.

We can extend the case to a multi-label classification problem that occupies three vertices: {[0, 0], [0, 1], [1, 1]}. In other words, examples with y[1] = 1 is a subset of examples with y[2] = 1, i.e., an inclusive relationship. Consider a line

_{1}

2, 0

+ α [1, 1]

The vertex [0, 0] projects to a point [^{1}_{4}, −^{1}_{4}], which is of α = −^{1}_{4}; the vertex [0, 1]

projects to a point[^{1}_{4},^{3}_{4}], of α = ^{1}_{4}; the vertex[1, 1] projects to a point [^{5}_{4},^{3}_{4}], of α = ^{3}_{4}.
Then, a simple dimension reduction procedure that performs regression from x toα on
the flat with a low-error regressor r(x) and decodes the regression result by^{1}

round(_{1}

2, 0

+ r(x) [1, 1])

would not incur any loss of information. That is, when the2-dimensional hypercube is occupied by3 out of the 4 vertices, there exists a 1-dimensional flat that describes the occupied vertices well.

1The detailed procedure would be discussed later.

Other types of vertex relations, which cause different patterns of hypercube sparsity, can also be captured by a flat. For instance, if two labels y[1] and y[2] are not fully correlated but just highly correlated with a positive correlation, a line

[0, 0] + α[1, 1]

can be used to capture the key correlation and hence reaching satisfactory performance.

Another type of vertex relation that can be captured is hirarchical. For instance, we have discussed that when classes labeled 1 and 2 are disjoint sub-classes of class 3, only 3 vertices of the hypercube would be occupied: (0, 0, 0), (1, 0, 1), (0, 1, 1). Intuitively, the two dimensional flat

[0, 0, 0] + α[1, 0, 1] + β[0, 1, 1]

perfectly describes the three vertices in the 3-dimensional hypercube. In fact, even when sub-classes1 and 2 are not disjoint, for which 4 vertices on the hypercube would be occupied: [0, 0, 0], [1, 0, 1], [0, 1, 1], [1, 1, 1], it can be shown that encoding the 4 vertices by a2-dimensional flat

[0.5, 0.5, 0.5] + α[0.55, 0.4, 0.8] + β[−0.6, 0.8, 0]

and decoding by rounding would not incure any loss of information when using low- error regressors.

Next, we study a simple framework that focuses on a linear subspace instead of the
whole hypercube in R^{K}. The framework takes anM -flat as the subspace and encodes
each vertex y of the hypercube to a vector h under the coordinate system of theM -flat
by projection. Then, the original multi-label classification problem with{(xn, yn)}^{N}_{n=1}
becomes a multi-dimensional regression problem with {(xn, hn)}^{N}_{n=1}. After obtain-
ing a multi-dimensional regressor r(x) that predicts h well, the framework will then
map r(x) back to a vertex of the hypercube in R^{K} using some decoderD. As discussed
earlier, Hamming loss is effectively the scaledℓ1 distance in the hypercube. The new
regression problem minimizes theℓ2distance in the hypercube, which upperbounds the
*scaled Hamming loss. The framework will be named Linear Label Space Transforma-*
*tion, as shown in Algorithm 7.*

As discussed, CS seeks to reduce the number of regressors by considering a flat with M ≪ K. Its projection matrix P is chosen randomly from an appropriate distribution

Algorithm 7 Linear Label Space Transformation

1. pre-processing: consider an M -flat described by a reference point o and an M byK projection matrix P. Then, encode {(xn, yn)} to {(xn, hn)}, where hn = P(yn− o) corresponds to a vector under the coordinate system of the flat.

2. training: form = 1 to M , learn a function rm(x) from {(xn, hn[m])}^{N}_{n=1}.
3. prediction: for each input vector x, compute r(x) =

r1(x), r2(x), · · · , rM(x)
.
Then, returnD(r(x)) where D : R^{M} → {0, 1}^{K} is a decoding function from the
M -flat to the vertices of the hypercube.

(such as Gaussian, Bernoulli, or Hadamard) and the reference point o of the flat is
simply 0, the origin of R^{K} as well as the most label-set-sparse vertex. The decoding
algorithmD corresponds to the reconstruction algorithm in the terminology of CS, and
requires solving an optimization problem for each different x.

### 3.2 Linear Label Space Transformation with Round-based Decod- ing

As discussed above, CS may suffer from its slow decoding algorithm, while the round-
based decoding in Binary Relevance can be more efficient. Next, we study a special
form of Linear Label Space Transformation that is coupled with an efficient round-
based decoding scheme. In particular, the decoding scheme first maps a prediction
vector r(x) under the coordinate system of the M -flat back to a corresponding point ˜y
in R^{K}. Then, the scheme roundsy to the closest vertex˜ y of the hypercube in termsˆ
of the ℓ1 *distance. The resulting approach, as shown in Algorithm 8, is called Linear*
*Label Space Transformation with Round-based Decoding, which works directly with*
the geometry between theM -flat and the hypercube, and can be viewed as a simple and
efficient form of the general Linear Label Space Transformation.

Note that Binary Relevance is a special case of Linear Label Space Transformation.

For Binary Relevance, we can set P = I with an arbitrary reference point o. The usual Binary Relevance operates withM = K, which means that many regressors rm

are needed when K is large. To reduce the number of regressors, we can also use

Algorithm 8 Linear Label Space Transformation with Round-based Decoding

1. pre-processing: consider an M -flat described by a reference point o and an or-
thonormal basis {pm}^{M}_{m=1}. Use p^{T}_{m} as the rows of anM by K projection matrix
P in Linear Label Space Transformation.

2. training: simply run Linear Label Space Transformation.

3. prediction: after getting r(x) from Linear Label Space Transformation, compute

˜

y= o + XM m=1

rm(x) · pm = o + P^{T} · r(x).

Then, returnyˆ= round(˜y).

Binary Relevance withM < K by taking only M rows of I as the projection matrix P.

One simple heuristic that exploits the label-set sparsity is to choose theM rows that correspond to theM -most frequent labels (i.e. with more 1) in the training data; other rows would simply be decoded by the corresponding components of o without using any information from the regressor. The approach with this heuristic will be called Partial Binary Relevance (PBR). PBR equivalently discards some of the labels when learning the regressors and hence the test performance may not be satisfactory. We will use PBR as a baseline approach that respects the label-set sparsity within the simple heuristic, and compare it with PLST and CS experimentally in Section 4.

Next, we analyze the performance of Algorithm 8. Note that the round-based de-
coder equivalently works by^{2}

ˆ y[k] =

s

˜

y[k] ≥ 1 2 {

. (2)

If round-based decoding is used, we can simply prove that Hamming loss betweenyˆ and the desired y is upper-bounded by a scaled squared distance between y and y, as˜ formalized below.

*Lemma 1. For the round-based decoder in (2),*

∆(ˆy, y) ≤ 4

Kk˜y− yk^{2}.

2J·K is 1 if the inner condition is true and 0 otherwise.

*Proof.* For any givenk,
ˆ

y[k] ⊕ y[k]

= s

˜

y[k] ≥ 1 2

{ s

y[k] = 0 {

+ s

˜

y[k] < 1 2

{ s

y[k] = 1 {

≤ 4 (˜y[k] − y[k])^{2}Jy[k] = 0K + 4 (˜y[k] − y[k])^{2}Jy[k] = 1K

= 4 (˜y[k] − y[k])^{2}.

The desired result can be proved by averaging over allk = 1, 2, · · · , K.

Thus, if the error k˜y− yk^{2} is small, the corresponding Hamming loss ∆(ˆy, y)
would also be small. That is, we could replace ∆(ˆy, y) with a proxy error function
k˜y− yk^{2}when using the round-based decoder. Then, we can prove an upper bound on
the per-example Hamming loss of Algorithm 8.

*Theorem 1. Consider any example(x, y) given at the prediction step of Algorithm 8.*

*Then,*

∆(ˆy, y) ≤ 4

K kr(x) − hk^{2}+ ky − o − P^{T}hk^{2}

, (3)

*where h≡ P(y − o).*

*Proof.* Using the fact that{pm}^{M}_{m=1} forms an orthonormal basis, we can uniquely de-
compose y = (o + P^{T}h+ p⊥), where

p_{⊥} = y − o − P^{T}h= (I − P^{T}P)(y − o)
is orthogonal to every pm.

Then, from Algorithm 8, consider the pointy˜ = o + P^{T}r(x). From Lemma 1,

∆(ˆy, y) ≤ 4

K k˜y− yk^{2}

= 4

K

o+ P^{T}r(x) − y
^{2}

= 4

K

P^{T}(r(x) − h)

^{2} + kp⊥k^{2}

(4)

= 4

K kr(x) − hk^{2}+ kp⊥k^{2}

. (5)

Here (4) comes from the fact that p_{⊥}is orthogonal to every p_{m}; (5) is because{pm}^{M}_{m=1}
forms an orthonormal basis.

We can take a closer look at the two terms in the right-hand-side of the bound (3).

The first term describes the squared prediction error between h and r(x), two vectors
represented under the coordinate system of the M -flat. The second term describes an
encoding error for projecting y to the closest point on the M -flat. The training step
of Linear Label Space Transformation aims at reducing the first term by learning from
{(xn, hn)}^{N}_{n=1}.

The second term, on the other hand, does not depend on x and denotes a trade-off on
the choice ofM . In particular, the second term generally decreases when M increases,
at the expense of more computational cost for learning the functions {rm}^{M}_{m=1}. For
instance, in PBR, if we take the origin as o, the second term is upper bounded by

K−M

K (while the actual value depends on how sparse y is). When using the full Binary Relevance, there is no encoding error but many regressors are needed; when using PBR with M ≪ K, we can use fewer regressors but the resulting Hamming loss may be large because of the large encoding error.

### 3.3 Principal Label Space Transformation

For a fixed value of M , the analysis of Algorithm 8 indicates that it is important to
use anM -flat that makes the encoding error as small as possible. Next, we propose an
approach that focuses on finding such an *M -flat. In particular, the proposed Principal*
*Label Space Transformation* (PLST) approach is a special case of Algorithm 8 that
seeks for a reference point o∈ R^{K} and anM by K matrix P by solving

mino,P

1 N

XN n=1

kyn− o − P^{T}P(yn− o)k^{2} (6)
such that PP^{T} = I.

The objective function of (6) is the empirical average of the encoding error on the train- ing set S. Because PLST makes an optimal use of the budget on the M ≪ K basis functions, we can take advantage from the hypercube sparsity to reduce the computa- tional cost in multi-label classification.

Similar to the traditional analysis of Principal Component Analysis (Hastie et al.,

2001), it can be proved that one optimal solution of (6) satisfies

o = 1

N XN n=1

y_{n}.

Then, the corresponding optimal P can be computed from the Singular Value Decom- position (SVD), as described below.

Consider a matrix Z with each column being y_{n}− o, a shifted version of the occu-
pied vertices. Then, we perform SVD on the K by N matrix Z to obtain three matri-
ces (Datta, 1995)

Z= UΣV^{T}. (7)

Here U is aK by K unitary matrix, Σ is a K by N diagonal matrix, and V is a N by N
unitary matrix. Through SVD, each y_{n}− o can be represented as a linear combination
of the singular vectors u_{m} in the columns of U. The vectors form a basis of a flat that
passes through o as well as all the y_{n}. The matrix Σ is a diagonal matrix that contains
the singular value σm of each singular vector u_{m}. We shall assume that the singular
values are ordered such thatσ1 ≥ σ2 ≥ · · · ≥ σK.

Note that (7) can be rewritten as

U^{T}Z= ΣV^{T}

where the orthogonal basis U^{T} can be seen as a projection matrix of Z that maps each
y_{n}− o to a different coordinate system. Since the largest M singular values correspond
to the principal directions for reconstructing Z, we could discard the rest of the singular
values and their associated basis vectors in U^{T} to obtain a smaller projection matrix
U^{T}_{M} = [u_{1}, u_{2}, · · · , uM]^{T}. The optimal P that solves (6) is indeed U^{T}_{M} (Hastie et al.,
2001), which leads to the total empirical encoding error ofPK

m=M +1σ^{2}_{m}.

In summary, PLST solves an SVD problem to minimize an empirical version of the encoding error. Then, PLST calls for a good regression algorithm to reduce the squared error between r(x) and h. According to Theorem 1, when both terms are small, the resulting Hamming loss would also be small.

Unlike PBR, for which P corresponds to the original axis, nor CS, for which P is
formed randomly, the PLST projection matrix using the principal directions u_{m} cap-
tures the correlations in multi-label classification. Thus, PLST is able to exploit the

Algorithm 9 Principal Label Space Transformation

1. With a given parameter M , perform SVD on Z and obtain U^{T}_{M} =
[u1, u2, · · · , uM].

2. Run Algorithm 8 using o= _{N}^{1} PN

n=1ynand P= U^{T}_{M}.

hypercube sparsity to make an effective use of theM ≪ K basis functions while keep- ing the encoding error small.

### 4 Experiments

Next, we conduct experiments on five real-world data sets to compare the three algo- rithms within Linear Label Space Transformation: PBR, CS and our proposed PLST.

The data sets are downloaded from Mulan (Tsoumakas et al., 2010b) and cover a variety of domains, sizes and characteristics, as shown in Table 2. We include data sets with a particularly large number of labels such as delicious, corel5k and mediamill to test the effectiveness of CS and PLST in reducing the dimension of the label space.

*The cardinality column of Table 2 is defined as the average number of labels per*
*example. The distinct column of Table 2 shows the number of distinct label sets, or*
using the hypercube view, the number of vertices occupied by examples. Dividing the
value of distinct by2^{K} in Table 2, we see that hypercube sparsity indeed exists in every
data set.

*On the other hand, the nonzero column of Table 2 shows the maximum number of*
non-zero entries in yn. Comparing the value of nonzero toK in Table 2, we see that
most data sets come with a strong label-set sparsity except yeast and emotions.

In all experiments, we randomly partition each data set into 90% for training and 10% for testing. We record the mean and the standard error of the test Hamming loss over20 different random partitions.

We test CS, PBR and PLST with Ridge Linear Regression (RLR; Hastie et al., 2001) and M5P Decision Tree (M5P; Wang and Witten, 1997) as the underlying regression algorithm. We implement Ridge Linear Regression withλ = 0.01 in MATLAB, and take the M5P Decision Tree from WEKA (Hall et al., 2009) with its default settings. For

Table 2: Data Set Statistics

data domain N K cardinality distinct hypercube nonzero

set sparsity

delicious text 16105 983 19.02 15806 1.93×10^{−292} 25

corel5k text 5000 374 3.52 3175 8.25×10^{−110} 5

mediamill video 43507 101 4.38 6555 2.59×10^{−27} 18

yeast biology 2417 14 4.24 198 1.21×10^{−2} 11

emotions music 593 6 1.87 27 4.22×10^{−1} 3

CS, We follow the recommendation from Hsu et al. (2009) to use the Hadamard matrix
as the projection matrix P. Then, we take the best-performing reconstruction algorithm
in their work, CoSaMP, as the decoding function D and set the sparsity parameter for
*the reconstruction algorithm to the nonzero column in Table 2. For PBR, we simply*
take the origin (the most label-set-sparse vertex of the hypercube) as the reference point
o. In other words, the discarded labels in PBR would be reconstructed with 0 (see
Subsection 3.2). Other variants of CS and PBR would be explored in Subsection 4.4.

### 4.1 Comparison on Hamming Loss

Fig. 2 and Fig. 3 show the test Hamming loss of PBR, PLST and CS at different sizes of the reduced sub-tasks. In Fig. 2, the approaches are coupled with a linear regressor:

RLR; in Fig. 3, the approaches are coupled with a non-linear regressor: M5P. First of all, we see that regardless of the type of the regressor used, PLST is always capable to reach reasonable performance while reducing the label space to a lower dimensional M -flat. In particular, PLST with M ≪ K regressors is capable of achieving the same or better Hamming loss than the full Binary Relevance (without dimension reduction) on all data sets.

When using RLR as the regressor, the Hamming loss curve of PLST is always be- low the curve of PBR across all M in all data sets, which demonstrates that PLST is the more effective choice in the family of Linear Label Space Transformation with round-based decoding (Algorithm 8). In addition, for data sets without a strong label-

0 200 400 600 800 1000 0.018

0.0185 0.019 0.0195 0.02 0.0205

Full−BR (no reduction) CS

PBR PLST

(a) delicious ^{0} ^{50} ^{100} ^{150} ^{200} ^{250} ^{300} ^{350} ^{400}

9.3
9.4
9.5
9.6
9.7
9.8
9.9x 10^{−3}

Full−BR (no reduction) CS

PBR PLST

(b) corel5k

0 20 40 60 80 100

0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044

Full−BR (no reduction) CS

PBR PLST

(c) mediamill ^{0} ^{5} ^{10} ^{15}

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32

Full−BR (no reduction) CS

PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34

Full−BR (no reduction) CS

PBR PLST

(e) emotions

Figure 2: Test Hamming Loss of Linear Label Space Transformation Algorithms with RLR

0 200 400 600 800 1000 0.0182

0.0184 0.0186 0.0188 0.019 0.0192 0.0194 0.0196 0.0198 0.02

Full−BR (no reduction) CS

PBR PLST

(a) delicious ^{0} ^{50} ^{100} ^{150} ^{200} ^{250} ^{300} ^{350} ^{400}

9.4 9.45 9.5 9.55 9.6 9.65

9.7x 10^{−3}

Full−BR (no reduction) CS

PBR PLST

(b) corel5k

0 20 40 60 80 100

0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044

Full−BR (no reduction) CS

PBR PLST

(c) mediamill ^{0} ^{5} ^{10} ^{15}

0.2 0.22 0.24 0.26 0.28 0.3 0.32

Full−BR (no reduction) CS

PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34

Full−BR (no reduction) CS

PBR PLST

(e) emotions

Figure 3: Test Hamming Loss of Linear Label Space Transformation Algorithms with M5P

set sparsity (yeast and emotions), as expected, PLST outperforms CS for both small and largeM because CS cannot rely on label-set sparsity to make any compres- sions. On the other hand, for data sets with a strong label-set sparsity (delicious, corel5k and mediamill), the Hamming loss curve of PLST is below the curve of CS for small M , and comparable to the curve of CS for large M . That is, CS is able to decrease Hamming loss significantly afterM is large enough by exploiting label-set sparsity. Nevertheless, PLST can always do even better by achieving the same Ham- ming loss with a much smallerM using hypercube sparsity. Thus, for data sets with or without label-set sparsity, PLST should be preferred over CS when RLR is taken as the regressor. One thing to notice is that on emotions, there is a decrease on Hamming loss for PLST with a smallM , which demonstrates an additional potential advantage of focusing on the principal directions.

As shown on Fig. 3, the relationship between PLST, CS and PBR stays mostly unchanged when the non-linear regressor M5P is employed instead of RLR, especially when M is small. In addition, PLST is able to perform significantly better than a full BR in most of the data sets. The results justify that PLST as the leading approach for Linear Label Space Transformation—better than CS in particular—across the two different regressors tested.

Table 3 records the Hamming loss of PBR, PLST and CS at the optimal reduced
sub-task size M^{∗} and Table 4 records their respective training time. The results are
obtained with RLR since it generally outperforms M5P when coupled with Linear Label
Space Transformation algorithms. Here M^{∗} is defined as the minimum dimension at
which the Hamming loss difference between Binary Relevance and PLST is within their
respective standard errors. In other words, this can be seen as the reduced dimension
at which no performance loss is incurred. All the timing experiments were performed
on the AMD Opteron Quad Core 2378 2.4 GHz Processor with 512 KB of cache. The
programming environment was in MATLAB version 7.5.0.338 (R2007b). For most of
the data sets with large amount of labels, PLST is able to drastically reduce the learning
and inference time compared to the full Binary Relevance. This is less obvious in small
data sets like yeast and emotions since their number of labels is already small
before the transformation. In addition, PLST usually outperforms CS atM^{∗}in terms of
Hamming loss, training time in regression and training time in encoding/decoding.

Table 3: Test Hamming Loss of Full Binary Relevance (BR) versus PLST and CS at the Optimal Reduction Size of PLST

data M^{∗}/K BR(K) PLST(M^{∗}) CS(M^{∗})

set

deli. 129/983 = 13% 0.01813 ± 0.00003 0.01819 ± 0.00003 0.01954 ± 0.00003 core. 16 /374 = 4% 0.00940 ± 0.00002 0.00944 ± 0.00002 0.01021 ± 0.00002 medi. 11 /101 = 11% 0.03003 ± 0.00006 0.03015 ± 0.00006 0.04332 ± 0.00007 yeas. 4 / 14 = 29% 0.19916 ± 0.00211 0.20320 ± 0.00204 0.29390 ± 0.00189 emot. 2 / 6 = 33% 0.20653 ± 0.00412 0.20542 ± 0.00467 0.31042 ± 0.00393

(the more accurate result between PLST and CS is marked in bold)

Table 4: Time of Full Binary Relevance (BR) versus PLST and CS at the Optimal Reduction Size of PLST

BR(K) PLST(M^{∗}) CS(M^{∗})

data regression regression encode+ regression encode+ set (sec) (sec) decode (sec) (sec) decode (sec)

deli. 4417.90 577.38 154.38 579.71 886.41

core. 560.11 23.96 7.76 24.13 2.97

medi. 105.96 11.55 8.39 11.43 57.14

yeas. 0.70 0.20 0.02 0.18 1.03

emot. 0.06 0.02 0.00 0.02 0.15

(the faster result between the corresponding columns of PLST/CS is marked in bold) From Fig. 2, Fig. 3 and Table 3, it is clear that PLST is highly effective at re- ducing the number of sub-tasks solved for multi-label classification. Large data sets like delicious, corel5k and mediamill can be reduced to only13%, 4%, and 11% of their original computational effort respectively with no sacrifice in performance.

Note that we can further reduce the computational effort by tolerating a slight increase in Hamming loss, as can be seen in Fig. 2 and Fig. 3. These results demonstrate that PLST can take advantage of the hypercube sparsity to efficiently solve multi-label clas- sification problems.

### 4.2 Comparison on Other Performance Measures

To further understand the benefits of PLST, we conduct more comparisons on three other popular performance measures: the average per-example ranking loss, the macro- averaged (per-label-averaged) area-under-the-ROC-curve (AUC), and the macro-averaged F1-score (Tsoumakas et al., 2010a). Although PLST is not particularly designed with respect to those measures, we shall demonstrate that PLST remains to be the most ef- fective choice over CS and PBR in terms of the ranking loss and AUC. We only list the results with RLR here as it generally outperforms M5P, while similar findings have also been observed across most of the data sets when using M5P.

Fig. 4 shows the comparison of ranking loss using RLR. For each example, the ranking loss takes the soft prediction y as an order of the labels, and compares the˜ predicted order to the desired order y:

ranking loss(˜y, y) = averagey[k]<y[ℓ]

r

y[k] > ˜˜ y[ℓ]z + 1

2

ry[k] = ˜˜ y[ℓ]z

CS cannot perform well on the ranking loss because it only outputs hard predictions that contain0 or 1, which could introduce more loss on ties J˜y[k] = ˜y[ℓ]K. Similarly, PBR cannot perform well on the ranking loss because many of its predictionsy contains a˜ constant 0 introduced by the reference point o, which also leads to loss on ties. On the other hand, PLST is highly effective in capturing the ranking preferences with the principal directions in the label space. On larger data sets like delicious, PLST is able to achieve decent ranking loss using only 10% of the original dimensions. The results justify the usefulnes of PLST on the ranking loss.

The promising ranking performance of PLST makes it interesting to compare PLST with the label ranking approach, which is shown in Fig. 5. In particular, we take the state-of-the-art calibrated label ranking (F¨urnkranz et al., 2008) approach and couple it with RLR for a fair comparison. Because label ranking takes pairwise comparison of the labels, we can only afford to run the experiments on emotions, yeast and mediamill. In terms of the ranking loss (the left-hand-side), PLST can be much worse than calibrated label ranking, which is expected because PLST does not include any pairwise information in its design for dimension reduction. In terms of the Ham- ming loss (the right-hand-side), however, PLST is quite competitive with calibrated label ranking. Note that calibrated label ranking pays for the pairwise comparisons and

0 200 400 600 800 1000 0.1

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

CS PBR PLST

(a) delicious ^{0} ^{50} ^{100} ^{150} ^{200} ^{250} ^{300} ^{350} ^{400}

0.2 0.25 0.3 0.35 0.4 0.45 0.5

CS PBR PLST

(b) corel5k

0 20 40 60 80 100 120

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

CS PBR PLST

(c) mediamill ^{0} ^{5} ^{10} ^{15}

0.2 0.25 0.3 0.35 0.4 0.45 0.5

CS PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0.2 0.25 0.3 0.35 0.4 0.45 0.5

CS PBR PLST

(e) emotions

Figure 4: Test Ranking Loss of Linear Label Space Transformation Algorithms Using RLR

can hardly be scaled up for data sets with lots of labels, while PLST is much more effi- cient. The results reveal an important future research direction: designing a dimension reduction approach that is more efficient than calibrated label ranking while maintaining the same level of ranking performance.

Fig. 6 shows the comparison of the macro AUC when using RLR. For each labelk =
1, 2, · · · , K, AUC takes the soft predictions ˜y[k] of all the test examples to construct
the ROC curve, and computes the area under the curve. The ROC curve reveal the
trade-off between the precision and the recall of the soft predictions, and larger AUC
indicates better performance. One simple view of the macro AUC is that it measures
*the ranking performance per label, while the ranking loss discussed above measures the*
*ranking performance per example. Similar to the ranking loss, CS cannot perform well*
on AUC because it is only able to output a hard prediction; PBR also cannot perform
well because its constant predictions in some labels. Thus, PLST remains to be the
most effective choice for achieving decent AUC while performing linear dimension
reduction.

Fig. 7 compares the macro F1 score using RLR. Instead of exploring the full trade- off of the precision and the recall with macro AUC, for each labelk = 1, 2, · · · , K, by the hard predictionsy[k]. One interesting finding is that PLST is not always better thanˆ PBR or CS when using the macro F1 score. That is, while PLST achieves a decent trade- off between precision and recall using the soft predictions, its hard predictions (using rounding at 0.5) leave some room for improvements. The results echo earlier findings by Fan and Lin (2007) on improving the F1 score by tuning the rounding thresholds.

To understand the cause of the different performance in the F1 score, we show the macro precision and the macro recall on delicious in Fig. 8. We see that CS is more aggressive in finding the present labels, which leads to better recall when M is around300 and explains its better F1 score. On the other hand, the precision of CS is not satisfactory. PLST is less aggressive, which results in a much better precision but worse recall when compared with CS. PBR is even less agressive and thus achieves the worst recall of all three algorithms.

0 20 40 60 80 100 0.075

0.08 0.085 0.09 0.095

Calibrated Label Ranking (no reduction) PLST

(a) mediamill (ranking) ^{0} ^{20} ^{40} ^{60} ^{80} ^{100}

0.0299 0.03 0.03 0.0301 0.0301 0.0302 0.0302 0.0303

Calibrated Label Ranking (no reduction) PLST

(b) mediamill (Hamming)

0 5 10 15

0.165 0.17 0.175 0.18 0.185 0.19 0.195

Calibrated Label Ranking (no reduction) PLST

(c) yeast (ranking) ^{0} ^{5} ^{10} ^{15}

0.195 0.2 0.205 0.21 0.215 0.22 0.225

Calibrated Label Ranking (no reduction) PLST

(d) yeast (Hamming)

0 1 2 3 4 5 6 7

0.15 0.16 0.17 0.18 0.19 0.2 0.21

Calibrated Label Ranking (no reduction) PLST

(e) emotions (ranking) ^{0} ^{1} ^{2} ^{3} ^{4} ^{5} ^{6} ^{7}

0.195 0.2 0.205 0.21 0.215 0.22 0.225 0.23

Calibrated Label Ranking (no reduction) PLST

(f) emotions (Hamming)

Figure 5: Comparison between Linear Label Space Transformation Algorithms and Calibrated Label Ranking Using RLR

0 200 400 600 800 1000 0.5

0.55 0.6 0.65 0.7 0.75 0.8 0.85

CS PBR PLST

(a) delicious ^{0} ^{50} ^{100} ^{150} ^{200} ^{250} ^{300} ^{350} ^{400}

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

CS PBR PLST

(b) corel5k

0 20 40 60 80 100 120

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

CS PBR PLST

(c) mediamill ^{0} ^{5} ^{10} ^{15}

0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7

CS PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

CS PBR PLST

(e) emotions

Figure 6: Test AUC of Linear Label Space Transformation Algorithms Using RLR

0 200 400 600 800 1000 0.015

0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055

CS PBR PLST

(a) delicious ^{0} ^{50} ^{100} ^{150} ^{200} ^{250} ^{300} ^{350} ^{400}

0.285 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0.325 0.33

CS PBR PLST

(b) corel5k

0 20 40 60 80 100 120

0 0.01 0.02 0.03 0.04 0.05 0.06

CS PBR PLST

(c) mediamill ^{0} ^{5} ^{10} ^{15}

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

CS PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

CS PBR PLST

(e) emotions

Figure 7: Test F1 Score of Linear Label Space Transformation Algorithms Using RLR