## A Deep Model with Local Surrogate Loss for General Cost-sensitive Multi-label Learning

### Cheng-Yu Hsieh Yi-An Lin Hsuan-Tien Lin

Department of Computer Science and Information Engineering National Taiwan University

{r05922048, r02922163}@ntu.edu.tw htlin@csie.ntu.edu.tw

Abstract

Multi-label learning is an important machine learning prob- lem with a wide range of applications. The variety of crite- ria for satisfying different application needs calls for cost- sensitive algorithms, which can adapt to different criteria eas- ily. Nevertheless, because of the sophisticated nature of the criteria for multi-label learning, cost-sensitive algorithms for general criteria are hard to design, and current cost-sensitive algorithms can at most deal with some special types of crite- ria. In this work, we propose a novel cost-sensitive multi-label learning model for any general criteria. Our key idea within the model is to iteratively estimate a surrogate loss that ap- proximates the sophisticated criterion of interest near some local neighborhood, and use the estimate to decide a descent direction for optimization. The key idea is then coupled with deep learning to form our proposed model. Experimental re- sults validate that our proposed model is superior to existing cost-sensitive algorithms and existing deep learning models across different criteria.

### 1 Introduction

Multi-label learning (MLL) addresses the problem of asso- ciating each data point with a set of relevant labels. It has recently attracted much research attention since the problem setting meets the needs of various real-world applications.

For instance, in image classification, an image may contain multiple objects simultaneously (Boutell et al. 2004). Other MLL applications include text categorization (Schapire and Singer 2000), music tag annotation (Lo et al. 2011), and video classification (Qi et al. 2007). Different MLL appli- cations often aim for different goals, and thus a variety of criteria have been proposed to measure the performance of MLL algorithms from different angles. Some popular crite- ria include Hamming loss, Rank loss, Example-F1, Micro- F1, Macro-F1, and Precision-at-k (Tsoumakas, Katakis, and Vlahavas 2010; Madjarov et al. 2012).

Classical MLL algorithms such as binary relevance (Tsoumakas, Katakis, and Vlahavas 2010), classifier chain (Read et al. 2011), and label powerset (Tsoumakas, Katakis, and Vlahavas 2010) are designed to optimize some specific criterion. Nevertheless, because of the different behaviors of different criteria, an algorithm that optimizes one criterion Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

well may not be a good choice for other criteria, and it is difficult to modify those classical algorithms towards other criteria. The demands from real-world applications call for algorithms that can adapt to optimize different evaluation criteria. Such algorithms allows applications to not only con- duct goal-specific optimization but also change their goals more easily if needed. As the evaluation criterion defines the costfor misclassifications made by the learning algorithms, MLL algorithms that adapt to optimize different criteria is generally referred to as cost-sensitive multi-label learning (CSMLL) algorithms (Li and Lin 2014).

Many CSMLL algorithms have been proposed in re- cent years (Dembczynski, Cheng, and H¨ullermeier 2010;

Lo et al. 2011; Li and Lin 2014; Wu and Lin 2017; Huang and Lin 2017). For example, probabilistic classifier chain (Dembczynski, Cheng, and H¨ullermeier 2010) makes cost- sensitive prediction with inference steps towards Bayes- optimal decisions, often with the help of an efficient infer- ence rule that corresponds to the criterion of interest; con- densed filter tree (Li and Lin 2014) adapts to optimize differ- ent criteria by transforming the criterion into sample weights when training the underlying classifiers; progressive random k-labelsets (Wu and Lin 2017) reduces the original CSMLL problem into multiple cost-sensitive multi-class classifica- tion subproblems. Nonetheless, current CSMLL algorithms are generally restricted to a certain class of criteria that can be decomposed to per-instance measures, and cannot deal with other criteria such as Precision-at-k.

In this work, we loosen the restriction and study a more general CSMLL setting that requires the algorithms to adapt to optimize virtually any common MLL criterion. How- ever, given the complicated nature of MLL criteria, design- ing such general cost-sensitive algorithm is challenging. In particular, most of the criteria are highly non-convex and even discontinuous, and it is thus generally impossible to optimize the criterion directly through numerical optimiza- tion. A common practice to tackle the difficulty is to de- velop appropriate surrogate loss function of the criterion of interest to make the optimization procedure tractable. (Pet- terson and Caetano 2010; 2011; Zhang and Zhou 2006;

Gong et al. 2013; Nam et al. 2014; Gao and Zhou 2011;

Dembczynski, Kotlowski, and H¨ullermeier 2012). The sur- rogate loss function serves as a smooth proxy of the crite- rion and carries better optimization properties during train-

ing. Nevertheless, current surrogate loss functions rely on human designs with respect to one or a few criteria, and can- not be systematically applied to solve the general CSMLL problem for any criteria of interest.

In this work, we approach the general CSMLL problem by letting the machine learn a surrogate loss function for the criterion of interest, therefore escaping from the restric- tions of human designs. Nevertheless, given the complicated nature of the MLL criteria, learning a global surrogate loss function turns out to be computationally demanding and conceptually difficult. We thus propose to learn the surrogate loss function locally. That is, we plug the surrogate-learning step into the iterative numerical optimization procedure of training CSMLL classifiers. In each surrogate-learning step, the proposed locally-learned surrogate loss (LLSL) is only required to approximate a given criterion of interest near some local neighborhood. The approximation captures the local behavior of the criterion’s cost surface, and carries suf- ficient information to guide the numerical optimization pro- cedure towards a descent direction for optimizing the crite- rion. We further combine the idea of LLSL with gradient- based optimization in deep learning to propose a novel deep model for multi-label learning that can automatically adapt to optimize general criteria.

The main contributions of this paper are highlighted as follows:

• We present a novel methodology that systematically and automatically learns to optimize any given criterion for multi-label learning. The methodology solves a broader range of CSMLL problems than existing CSMLL algo- rithms.

• The methodology is used to design the world’s first cost- sensitive deep learning model for multi-label learning.

• The proposed deep learning model enjoys superior per- formances against existing methods across various real- world datasets and evaluation criteria.

### 2 Background

2.1 Problem SetupIn a multi-label learning (MLL) problem, we denote an in-
stance by a feature vector x ∈ R^{d}, and its relevant labels by
a bit vector y ∈ {0, 1}^{K}, where K is the number of labels
and y[k] = 1 if and only if the k-th label is relevant. Given a
training set D = {(xn, yn)}^{N}_{n=1}, the goal of the multi-label
learning problem is to learn a hypothesis h : R^{d} → R^{K} to
make predictions on unseen instances accurately. The flexi-
ble definition of h above covers two typical cases: a classi-
fierh which is only allowed to output bit vectors that directly
decides the relevance of each label, or a ranker h whose real-
valued outputs can be used to rank the labels by the predicted
relevance level.

More specifically, given a test dataset D^{0} = {x^{0}_{m}}^{M}_{m=1},
the goal is to make the prediction vectors {ˆy_{m}^{0} =
h(x^{0}_{m})}^{M}_{m=1} close to the hidden ground-truth vectors
{y^{0}_{m}}^{M}_{m=1}. Denote a matrix Y^{0} that contains {y^{0}_{m}}^{M}_{m=1}as
its rows, and another matrix ˆY^{0} that contains {ˆy_{m}^{0} }^{M}_{m=1}
as its rows, the goal can be expressed as minimizing a

criterion Ψ(Y^{0}, ˆY^{0}) that measures the difference between
the two matrices of vectors. A special family of criteria,
called example-based criteria, measures the average differ-
ence vector by vector (row by row). That is, Ψ(Y^{0}, ˆY^{0}) =

1 M

PM

m=1ψ(y^{0}_{m}, ˆy^{0}_{m}). For instance, when h is a classifier,
one simplest choice is ψH(y, ˆy) = _{K}^{1} PK

k=1Jy[k] 6= ˆy[k]K,
called the Hamming loss. Other popular choices include the
Example-F1 loss, where ψF(y, ˆy) = 1−_{kyk}^{2y•ˆ}^{y}

1+kˆyk_{1}; and the
Rank loss with ψR(y, ˆy) = _{R(y)}^{1} P

(k,l):y[k]<y[l]J ˆy[k] >

ˆ y[l]K +

1

2J ˆy[k] = ˆy[l]K, where R(y) = |{(k, l)|y[k] < y[l]}|

is a normalizer. When h is a ranker, a common example-
based criterion is called (negative) Precision-at-k, where
ψP(y, ˆy) = 1 −^{1}_{k}P

`∈top_{k}(ˆy)y[`] and top_{k}(ˆy) returns the
indices that correspond to the k largest values in ˆy. As min-
imizing these instance-averaging criteria can be reduced to
minimizing ψ on each instance, ψ is used in this paper to
replace Ψ for example-based criteria.

In addition, the definition of Ψ covers the more general case of measuring the difference for the entire test set (pre- diction matrix). For example, when h is a classifier, the pop- ular Macro-F1 loss can be expressed as

Ψma(Y, ˆY) = 1 − 1 K

K

X

k=1

2PM

m=1Y_{mk}Yˆ_{mk}
PM

m=1Ymk+PM m=1Yˆmk

, which is physically the mean F1 loss per label. The Micro- F1 loss can be similarly expressed as

Ψ_{mi}(Y, ˆY) = 1− 2PK
k=1

PM

m=1YmkYˆmk

PK k=1

PM

m=1Y_{mk}+PK
k=1

PM

m=1Yˆ_{mk},
which is the F1 loss on all matrix components. Other crite-
rion such as (negative) Macro-averaged Precision-at-k can
also be defined when h is a ranker. Note that we consider
all criteria to be the lower the better for the simplicity of
comparison.

Notice that for a given criterion Ψ and a ground-truth la- bel matrix Y, a cost function that maps any predicted label matrix ˆY to a scalar cost can be defined as CΨ|Y( ˆY) = Ψ(Y, ˆY). In the rest of the paper, we refer to CΨ|Y as CΨ

when Y is clear in the context.

2.2 Related Work

Given such great variety of evaluation metrics for MLL, tra- ditional MLL algorithms are however designed to optimize only a certain or few specific metrics. For example, algo- rithms such as binary relevance (Tsoumakas, Katakis, and Vlahavas 2010) and classifier chain (Read et al. 2011) that decompose MLL into K binary classification problems can arguably only focus on optimizing Hamming loss. On the other hand, label powerset approach (Tsoumakas, Katakis, and Vlahavas 2010) can merely focus on optimizing 0/1 loss since it transforms the original MLL problem into multi- class classification problem.

Nevertheless, it should be noted that even for a single MLL criterion of interest, optimizing this criterion is in fact difficult owing to the highly non-smooth nature of MLL

criteria. As a result, there are currently two main fami- lies of methods that attempt to overcome the challenge.

The first common paradigm is to approach the problem by designing surrogate losses that can be optimized by effi- cient algorithms. For example, (Petterson and Caetano 2010;

2011) derived surrogates for the F-measure which can be op- timized efficiently by SVM-style models; (Zhang and Zhou 2006; Gong et al. 2013; Nam et al. 2014) introduced differ- ent loss functions for neural networks targeting at different criteria; and (Gao and Zhou 2011; Dembczynski, Kotlowski, and H¨ullermeier 2012) proposed consistent surrogates for the Rank loss. While an appropriate surrogate loss can in- deed improve model performance on its corresponding cri- terion, deriving a surrogate for every criterion is nonetheless unsatisfactory for practical use.

Another major family of algorithms, generally termed cost-sensitive multi-label learning algorithms, tackles the problem by considering the cost (criterion) information in the model’s training or prediction phase (Dembczynski, Cheng, and H¨ullermeier 2010; Lo et al. 2011; Li and Lin 2014; Wu and Lin 2017; Huang and Lin 2017). Although these methods can adapt to different criteria more easily, current algorithms can still only deal with example-based criteria due to their restricted problem setting.

### 3 Proposed Method

Inspired by the rich literatures on cost-sensitive classifi- cation (Elkan 2001; Zadrozny, Langford, and Abe 2003;

Li and Lin 2014; Wu and Lin 2017), we first propose a sample-weighting CSMLL framework which is able to deal with example-based criteria. We then highlight a prelimi- nary cost-sensitive multi-label deep learning model which can be derived based on the framework. Last, to overcome the drawbacks of such simple model, we present a novel technique which can be used to optimize any given MLL criterion. The idea is coupled with deep learning to form a deep model for general cost-sensitive MLL.

3.1 Sample-weighting CSMLL Framework In the literatures of both cost-sensitive multi-class classifi- cation as well as CSMLL, re-weighting the training samples has been a simple yet effective approach (Zadrozny, Lang- ford, and Abe 2003; Beygelzimer, Langford, and Raviku- mar 2009; Li and Lin 2014). Motivated by these work, we propose a sample-weighting framework which can be easily used to develop CSMLL algorithms.

Assume that there are K classifiers fk : R^{d} → {0, 1}

each responsible for predicting a corresponding label ˆy_{n}[k]

of a given instance xn. The main concept of the framework is to iteratively train these K classifiers on weighted exam- ples, where the sample weights act as the connection to the evaluation criterion ψ. In particular, when training the k-th classifier fk, each example xnis weighted by a correspond- ing sample weight wn,k. The sample weight is decided by how much cost it would incur for misclassifying the k-th la- bel of xn. To estimate this misclassification cost for ˆyn[k], one can assume that the other K − 1 classifiers are fixed, and obtain their current predictions via {ˆyn[i] = fi(xn)}i6=k. By

Algorithm 1 Sample-weighting framework for CSMLL 1: Let fkbe a single-label classifier that predicts ˆy[k]

2: for m = 1 to M iterations do 3: for k = 1 to K do

4: for each instance (xn, y_{n}) do

5: Assume the other classifiers {fi}i6=kfixed
6: Calculate c^{0}_{n,k}by Equation 1

7: Calculate c^{1}_{n,k}by Equation 2
8: wn,k← |c^{0}_{n,k}− c^{1}_{n,k}|

9: Assign sample weight wn,kto (xn, yn)

10: end for

11: Train fkwith the weighted examples 12: end for

13: end for

having these predictions in hand, the misclassification cost
can then be calculated as |c^{0}_{n,k}− c^{1}_{n,k}|, where c^{0}_{n,k}is the cost
for predicting ˆyn[k] as zero:

c^{0}_{n,k}= ψ(y_{n}, (ˆy_{n}[1, ..., k − 1], 0, ˆy_{n}[k + 1, ..., K])), (1)
and c^{1}_{n,k}is the cost for predicting ˆyn[k] as one:

c^{1}_{n,k}= ψ(yn, (ˆyn[1, ..., k − 1], 1, ˆyn[k + 1, ..., K])). (2)
By assigning wn,k = |c^{0}_{n,k}− c^{1}_{n,k}|, the sample weights can
guide fkto focus on the examples that have greater influence
on the final cost and optimize the criterion in interest. Based
on the proposed framework, various CSMLL algorithms can
be designed. In fact, it can also be showed that a previ-
ous CSMLL work, condensed filter tree (Li and Lin 2014),
is merely a special case that utilizes this sample-weighting
technique. We present this general framework for CSMLL
in Algorithm 1.

In short, the key idea behind the framework is that
if we are able to refer to all the other label predictions
{fi(xn)}i6=k while training a single-label classifier fk, the
cost for wrongly predicting ˆy_{n}[k] can then be calculated
and embedded within the sample weights to achieve cost-
sensitiveness.

3.2 A Simple Cost-sensitive Multi-label Deep Learning Model

Having the sample-weighting framework for CSMLL dis- cussed, we now turn our attention to how a simple cost- sensitive multi-label deep learning model can be devel- oped from it. Following previous work on multi-label neu- ral networks (Zhang and Zhou 2006; Gong et al. 2013;

Nam et al. 2014), we also consider the architectures with K
output nodes, where each output node o_{k}can be viewed as
a single-label classifier that predicts the k-th label. To lever-
age the sample-weighting technique, one should be able to
access the current predictions of the other K −1 labels while
training the k-th label classifier. In fact, this turns out to be
rather intuitive for deep learning models.

Let h_{θ}_{t} : R^{d} → [0, 1]^{K}denotes a multi-label neural net-
work, where θtis the network weights at timestep t. At any

t, the complete label prediction ˆyn for an instance xn can
be simply obtained by feeding the example as the network
input, i.e., ˆy_{n} = h_{θ}_{t}(x_{n}). With ˆy_{n}available, for each out-
put node ok(or the k-th single-label classifier), one can then
follow the framework and associate each example xn with
the calculated sample weight wn,k. As each ok is consid-
ered a binary classifier for ˆy[k], an intuitive choice of the
loss function for these output nodes is the commonly known
logloss:

Llog(y[k], ˆy[k]) = −(y[k] log(ˆy[k])+(1−y[k]) log(1−ˆy[k])) By coupling the logloss with the sample weights and con- sidering training all K label classifiers jointly, the final loss function to be optimized by the neural network at time t then becomes:

LWBCE= 1 N

N

X

n=1 K

X

k=1

wn,kLlog(yn[k], ˆyn[k]) (3) We note that if the sample weights are not used, i.e., wn,k= 1 for all n and k. Eq.3 degenerates to the conventional bi- nary cross entropy(BCE), as proposed in (Nam et al. 2014).

Thus, we term the loss in Eq.3 as weighted binary cross en-
tropy(WBCE). It is also worthwhile to note that the sample
weights wn,kchange as the network updates. Hence, the loss
function L_{WBCE} is in fact changing according to the sam-
ple weights at different timestep t. We present this simple
method to train a cost-sensitive multi-label deep learning
model in Algorithm 2.

Weighted BCE versus BCE By decomposing Eq.3, the weighted BCE loss for an instance (xn, yn) is:

K

X

k=1

wn,kLlog(yn[k], ˆyn[k]) (4)
From the perspective of a single instance, the original sam-
ple weights {wn,k}^{K}_{k=1}can also be viewed as the weights for
each label. These weights encode the information about the
relative importance of each label. That is, if wn,k > w_{n,l}, the
network should probably focus more on making the predic-
tion on ˆy_{n}[k] correct even at the cost of wrongly predicting
yˆn[l].

To compare the proposed WBCE with the ordinary BCE, we visualize the contours and gradients computed w.r.t. both losses in an illustrative two-dimensional scenario in Figure 1. It can be seen that the first label (dimension) has more influence on the cost than the second label (dimension) as the cost differences on the first axis is much greater than those on the second axis. Nonetheless, as the ordinary BCE does not take any cost information into account, the gradient it provides is unaware of the relative importance of labels.

In contrast, gradient computed w.r.t. the weighted BCE loss is inclined much toward the first dimension, suggesting a direction to a relatively low-cost region.

While the weighted loss is able to take the cost informa-
tion into account and provides a trajectory along the low-
cost regions, the gradient direction it suggests is however
relatively naive. The weights {wn,k}^{K}_{k=1} for an instance

Algorithm 2 Weighted binary cross entropy for deep learn- ing models

Input: Training set D = {(xn, yn)}^{N}_{n=1} and an example-
based MLL criterion ψ

1: Randomly initialize the neural network hθ_{0}

2: repeat

3: Split D into M random mini-batches {Dm}^{M}_{m=1}
4: for m = 1 to M do

5: for each instance (xn, yn) ∈ DM do 6: yˆn← hθ(xn)

7: for k = 1 to K do

8: Calculate the sample weight wn,k

9: end for

10: end for

11: Update the network weights with gradients computed w.r.t LWBCE

12: end for 13: until converge

(x_{n}, y_{n}) are in fact calculated as merely the cost differences
between the current prediction ˆy and its one-bit neighbors,
i.e., label vectors y ∈ {y|ky − ˆyk_{1} = 1}. In other words,
the weights are only the first-order approximation to the cost
surface, and thereby the gradient suggested by the weighted
loss leverages only limited cost information. Most impor-
tantly, this preliminary model can still only handle example-
based criteria, as with current cost-sensitive methods.

3.3 Locally-learned Surrogate Loss for General Cost-sensitive Multi-label Deep Learning From Figure 1, it can be seen that the key to designing a cost- sensitive model is that the loss surface for which the model is optimized should sufficiently reflect the curvature of the criterion in interest. This in fact matches the main concept behind the previous literatures that work on developing sur- rogates for MLL criteria, as their main goal is also to come up with smooth approximates that preserve the characteris- tics of their corresponding criteria.

However, when designing a model for general criteria, it is inefficient, or even impossible, to manually derive surro- gate losses for every criterion. Therefore, we call for a sur- rogate that can by itself learns to adapt to different criteria.

In essence, we ask the question: can a surrogate loss be au- tomatically learned to approximate a target criterion rather than explicitly designed by human? Nevertheless, learning to approximate the complicated MLL criteria is undoubtedly difficult, and it is certainly not preferable ending up with an- other complex surrogate which actually does not decrease the problem complexity. Therefore, what we aim for is an optimization-friendlysurrogate that can however provide de- cent approximation to the target criterion.

Among various optimization strategies, gradient descent based algorithms are simple but powerful, and are nowadays the most prevalent practices for training modern models.

One key characteristic of gradient descent based algorithms is they leverage only local information of the error surface to decide to descent direction for optimization. Therefore, if

Figure 1: The contour and gradient direction of ordinary BCE (left) and WBCE (right). Note that each vertex on the square corresponds to a predicted label vector, and C(ˆy) is the cost function defined by the evaluation criterion.

we consider using gradient descent based algorithms for the optimization of the surrogate loss, instead of a global ap- proximation of the MLL criterion, it is arguably sufficient for the approximation to be locally faithful. In addition, a di- rect advantage gained from considering local approximation is that simpler (smoother) approximator is perhaps enough to learn an accurate local estimation.

To this end, we answer the previously posed question by a novel surrogate called locally-learned surrogate loss (LLSL), which is a (a) smooth surrogate (b) learned auto- matically (c) to provide locally faithful approximation to the criterion of interest (d) that guides the descent direction for optimization. In particular, for a given criterion of interest, we consider an iterative procedure for optimization. In each iteration, LLSL is first updated to approximate the local be- havior of the criterion and is then used to determine the de- scent direction for model optimization in that specific itera- tion. Such routine is carried out repeatedly until the underly- ing model converges. Since LLSL is automatically learned, the key idea is in fact applicable to any MLL criterion. To the best of our knowledge, this is the first surrogate proposed that can adapt to general MLL criteria by itself. We note that as LLSL can essentially be viewed as a loss function, it can be coupled with any descent based optimization model such as deep learning to form a CSMLL model, as we shall show shortly.

Formally, let hθ_{t} : R^{d} → R^{K} be a MLL model param-
eterized by the weights θtat time t, Y be the ground truth
label matrix where its n-th row is yn, and ˆY be the pre-
dicted label matrix where its n-th row is ˆyn = hθ_{t}(xn). To
optimize a MLL criterion Ψ, we wish to provide a smooth
surrogate to the cost surface CΨ of the MLL criterion near
the local neighborhood of ˆY. Specifically, for an instance
(xn, yn), we like to estimate how the perturbation in ˆynaf-
fects the behavior of CΨ, and use this estimate to decide the
descent direction for the instance. Let G be a class of models
considered for the local approximation, such as linear mod-
els, and {zl}^{L}_{l=1}be the local neighbors around ˆyn. We first
form a dataset Z^{(t)} = {zl, CΨ(Zl)}^{L}_{l=1}, where Zlis the la-
bel matrix obtained by replacing the n-th row of ˆY with zl.
As the dataset consists of a set of local neighbors around ˆY

and their corresponding cost, LLSL can then be learned as:

L^{(t)}_{LLSL}(.) = arg min

g∈G

L(g, Z^{(t)}) (5)
where L is a measurement for the closeness of the learned
surrogate g and the cost surface CΨ. We note that the learned
surrogate is a regressor g : R^{K} → R who takes as input a
label vector and predicts its corresponding cost. With the
above formulation, the surrogate loss can be learned with
different sets of local neighbors Z^{(t)}, closeness measure-
ment L and approximation models G. For example, when
L is the square loss, and G is assumed to be linear. The sur-
rogate loss is actually learned as a linear regression:

L^{(t)}_{LLSL}(o) = (arg min

w

X

(z,c)∈Z^{(t)}

(c − w^{T}z)^{2})^{T}o (6)

where o ∈ R^{K} is the predicted label vector space. Other
types of regressors (approximators) such as polynomial re-
gression can also be considered by proper choices on G and
L. After the surrogate loss is learned, its gradient can now be
easily computed to lead the update of any gradient descent
based optimization model. For instance, if the underlying
MLL model is a neural network with K output nodes, con-
tinuing from Eq.6, the partial derivative of the surrogate loss
computed w.r.t. each output node okis obtained by:

∂L^{(t)}_{LLSL}(o)

∂ok

= w[k] (7)

As the gradient for the output layer is computed, the whole network weights can then be updated by backpropagation to optimize the surrogate loss, as well as the target criterion.

We present this novel deep learning model for general MLL criteria in Algorithm 3.

For the selection of the local neighbors {zl}^{L}_{l=1} from
which the surrogate loss is learned, a natural choice for a
classifier h is {z|kz − ˆynk1 ≤ n}, i.e., the label vectors
whose Hamming distance to the current prediction is less
than n, or, the n-bit neighbors. As for a ranker h, a natural
choice would be {z|ˆy_{n}+ p} where p is a random perturba-
tion. We also note that more advanced methods for defining

Algorithm 3 Locally-learned surrogate loss for deep learn- ing models

Input: Training set D = {(xn, yn)}^{N}_{n=1}, criterion in inter-
est Ψ, a class of approximators G and L

1: Randomly initialize the neural network hθ_{0}

2: repeat

3: Split D into M random mini-batches {Dm}^{M}_{m=1}
4: for m = 1 to M do

5: for each instance (xn, yn) ∈ DM do
6: yˆn← hθ_{t}(xn)

7: Collect a set local neighbors and their
corresponding cost Z^{(t)}= {z_{l}, C_{Ψ}(Z_{l})}^{L}_{l=1}
8: Learn the local surrogate loss LLLSLon Z

9: end for

10: Update the network with gradients computed w.r.t LLLSL

11: end for 12: until converge

the neighborhood can be considered. For example, one can select the local neighbors using the cost function as the dis- tance measurement.

While the general definition of Eq.5 and the non-smooth nature of the criteria make it hard to derive rigorous theo- retical results for the proposed method, analyses on simple cases are still available. For instance, when the local learner is linear, it is rather straightforward to show that the time complexity of fitting a LLSL is polynomial in K, and the resulting learned surrogate indeed points to a descending di- rection under mild conditions. We conjecture that the de- scending direction, when coupled with careful line-search step, can then prove optimization convergence to a local minimum. We leave the work as our future direction, and first demonstrate the superiority of our method by the em- pirical results as we shall show in the next section.

Connection to Weighted BCE It is worthwhile to note
that the proposed LLSL can be viewed as a generalization of
the weighted BCE. For WBCE, the weights for an instance
(xn, yn) are calculated simply as wn,k = |ψ(yn, zn,k) −
ψ(yn, ˆyn)|, where zn,k is the label vector obtained from
flipping the k-th bit of ˆyn. This in fact corresponds to the
LLSL framework with Z = {z|kz − ˆy_{n}k1 = 1}, L and G
being linear least squares, which utilizes the cost informa-
tion along each axis separately. Nonetheless, given the com-
plicated behavior of MLL criteria, a more sophisticated ap-
proximation that leverages the cost information around the
current prediction jointly is perhaps necessary to capture the
curvature of the cost surface.

### 4 Experiments

4.1 Experiment setupTo evaluate the effectiveness of the proposed models, we conduct experiments on a total of eleven datasets across different evaluation criteria. First, on seven benchmark

datasets^{1} (Tsoumakas et al. 2011), we compare our meth-
ods with: the state-of-the-art cost-sensitive MLL algorithm,
condensed filter tree (CFT) (Li and Lin 2014), and exist-
ing deep learning models including BP-MLL (Zhang and
Zhou 2006), WARP (Gong et al. 2013), and BCE (Nam et
al. 2014). In the experiments, we consider two main classes
of evaluation criteria: (a) example-based criteria: Hamming
loss, Rank loss, and Example-F1 loss; (b) set-based crite-
ria: Micro-F1 and Macro-F1. Since both CFT and the deep
model coupled with WBCE can only optimize example-
based criteria, their results on Micro-F1 and Macro-F1 are
not available. In addition, as WBCE essentially degenerates
to BCE on Hamming loss^{2}, the results for it are also omitted.

For fair comparison, all deep learning models are de- ployed with a fixed architecture. The architecture is com- posed of two fully-connected layers, where the number of hidden units for each layer is set to min(d, 1024) with d be- ing the input dimension. Each fully-connected layer is fol- lowed by a dropout layer with dropout ratio of 0.5. For the hidden units, Leaky ReLU is considered as the activation function.

For the proposed LLSL, we utilize several different set-
tings to approximate the criterion of interest. Specifically,
we consider three types of the underlying learners: (a) an
ordinary least square regressor that learns from the one-bit
neighbors; (b) an ordinary least square regressor that learns
from the two-bit neighbors; (c) a second-degree polyno-
mial regressor that learns from the two-bit neighbors. The
three different settings will be referred to as LLSLlinear-1,
LLSL_{linear-2}, and LLSL_{poly-2}respectively.

In each run of the experiment, we randomly split 50%, 25%, and 25% of the dataset for training, validation, and testing. Finally, the results are averaged over 10 different random runs. Due to space limit, the relative ranking for all algorithms are shown in Figure 2, and the detailed numerical results are provided in Appendix.

To demonstrate the scalability of our model, we further
compare our method to the state-of-the-art algorithm, sparse
local embeddings for extreme classification (SLEEC) (Bha-
tia et al. 2015), which is designed specially for handling
datasets with many labels. This set of experiments are con-
ducted on four benchmark datasets^{3} with many labels. We
follow (Bhatia et al. 2015) to use Precision-at-k as the eval-
uation criterion.

4.2 Comparisons with Cost-sensitive Algorithm From Figure 2, we see cases where even cost-insensitive deep learning models can outperform the traditional non- deep cost-sensitive algorithm CFT. This somewhat stresses the importance of studying deep learning models for MLL.

In addition, when comparing to CFT, our model almost con- stantly reaches better performances against CFT. Most im- portantly, while CFT can only deal with example-based cri- teria, our proposed model can adapt easily toward optimiz- ing any general criteria.

1birds, emotions, enron, medical, scene, tmc2007, and yeast.

2All sample weights wn,kare equal to _{K}^{1} under Hamming loss.

3Bibtex, Delicious, EURLex-4K, and Wiki10-31K.

Figure 2: The average rank of different models on different criteria. The lower (left) the rank, the better the performance.

4.3 Comparisons between Deep Learning Models Cost-insensitive versus Cost-sensitive To validate our proposed methods, we begin with the comparison between WBCE and BCE. Given that BCE is designed as the soft counterpart for Hamming loss, it shows its competence on Hamming loss in Figure 2. Nonetheless, since BCE can- not adapt to different criteria, WBCE outperforms BCE on Example-F1 and Rank loss, as shown in Figure 2. The re- sults demonstrate that the proposed WBCE is indeed a (sim- ple) way to make deep learning models cost-sensitive.

Although WBCE can reach better performances than BCE on Example-F1 and Rank loss, when comparing it to the proposed LLSL, LLSL wins by a large margin in Figure 2.

The results justify the need for studying a more sophisti- cated method to make deep learning models cost-sensitive.

Furthermore, we shall note that LLSL is able to optimize any given criterion while WBCE is restricted to cope with example-based criteria.

In Figure 2, when comparing the proposed LLSL to other deep learning models, our model steadily shows superior performances across different criteria, while the other mod- els can only sometimes reach the best result on the criteria for which they are designed to optimize. The results again demonstrate the ability of LLSL to adapt to general criteria.

The Approximators for LLSL In order to gain more in-
sights on the proposed LLSL, we further compare the perfor-
mances of LLSL_{linear-1}, LLSL_{linear-2}, and LLSL_{poly-2} to see
how different underlying approximators behave on differ-
ent criteria. In Figure 2, we see that LLSLlinear-1 performs
the best against the others on Hamming loss and Rank loss.

On the other hand, LLSLpoly-2outperforms the other two on
Example-F1, Micro-F1 and Macro-F1. LLSL_{linear-2}can only
reach the best result on few cases.

To explain the results, we take a step further to investi- gate the reason behind so. In particular, we investigate the goodness of the estimations learned by different approxima- tors. The goodness is measured by the RMSE between the true cost and the estimated cost on a set of points sampled from the local neighborhood where the estimation is learned.

The results are reported in Table 1. From the table, it can be seen that the performance of a model is strongly correlated

Table 1: Approximation error in RMSE

Approximators

Datasets Criterion linear-1 linear-2 poly-2 Consistent

birds Rank 0.073 ± 0.079 2.150 ± 3.148 1.274 ± 2.849 3 Example-F1 0.034 ± 0.048 0.053 ± 0.044 0.023 ± 0.032 3 Micro-F1 0.146 ± 0.046 0.116 ± 0.021 0.132 ± 0.033 3 Macro-F1 0.485 ± 0.025 0.551 ± 0.061 0.450 ± 0.064 3 emotions Ranking 0.170 ± 0.072 0.613 ± 0.329 0.990 ± 0.560 3 Example-F1 0.159 ± 0.070 0.173 ± 0.033 0.093 ± 0.029 3 Micro-F1 0.227 ± 0.054 0.242 ± 0.088 0.152 ± 0.026 3 Macro-F1 0.209 ± 0.047 0.187 ± 0.114 0.172 ± 0.016 3 scene Ranking 0.110 ± 0.058 0.603 ± 0.291 0.478 ± 0.495 3 Example-F1 0.154 ± 0.088 0.133 ± 0.072 0.123 ± 0.067 7 Micro-F1 0.390 ± 0.137 0.355 ± 0.140 0.225 ± 0.175 3 Macro-F1 0.200 ± 0.153 0.184 ± 0.090 0.172 ± 0.089 3

Table 2: Results on datasets with many labels

Algorithms

Datasets (n) P@k SLEEC BCE loss Locally-learned loss

Bibtex (n) P@1 0.3492 0.4482 0.3647

(n) P@3 0.6036 0.6698 0.6157 (n) P@5 0.7113 0.7612 0.7270 Delicious (n) P@1 0.3241 0.3194 0.2980 (n) P@3 0.3862 0.3778 0.3581 (n) P@5 0.4344 0.4282 0.4081 EURLex-4K (n) P@1 0.2074 0.2345 0.2287 (n) P@3 0.3570 0.3654 0.3579 (n) P@5 0.4767 0.4731 0.4677 Wiki10-31K (n) P@1 0.1412 0.1442 0.1396 (n) P@3 0.2702 0.2665 0.2563 (n) P@5 0.3730 0.3631 0.3586

to how well the underlying local approximation to the crite- rion is. That is, the lower the approximation error, the better the model performs. We mark every row in the table as con- sistentif the above holds. The results also suggest a general guideline to choose suitable approximator for LLSL when it comes to different criteria. While the bias of the estimation varies with the choice of local learner, we observe that the estimation error decreases as the optimization proceeds.

Interestingly, while we generally believe that more so- phisticated approximator may provide more faithful estima- tion to the criterion of interest, it is shown that the relatively simple linear-1 approximator gives the best estimation to Hamming loss and Rank loss. The interesting finding can be explained by the inherent nature of the two criteria. In other words, as the minimization of Hamming loss and Rank loss can actually be decomposed by labels (Dembczynski, Kot- lowski, and H¨ullermeier 2012), a linear approximator that treats each label independently might be enough good for estimating these criteria. Nevertheless, for the other more complicated criteria such as Macro-F1, a more sophisticated approximation is required for better performance.

4.4 Scaling Up to Datasets with Many Labels On large scale datasets, we demonstrate the scalability and flexibility of the proposed LLSL by using it to fine-tune deep models that are originally pre-trained on BCE loss. In table 2, BCE stands for models that are trained with the conven- tional BCE loss, and LLSL stands for models that are fine- tuned with the proposed locally-learned surrogate loss. In

the experiments, we find the gradients of the surrogate loss learned for Precision-at-k appear to be very sparse, resulting in slower convergence. An useful practical finding to tackle such issue is to optimize the mixture loss between Ham- ming loss, represented by BCE, and the learned surrogate loss. In a high-level sense, as Hamming loss treats each label equally, it in fact encodes the global information about the target metric. Thus, it can be viewed as a regularizer to the locally-learned surrogate loss which exploits the local infor- mation. As long as we could find the sweet spot between the two losses, optimizing the mixture loss between them works well in practice. Furthermore, following the idea, LLSL can actually be mixed with other objectives such as the proposed WBCE. A joint optimization between LLSL and other loss functions might also lead to an interesting future direction.

In Table 2, it is shown that our proposed LLSL success- fully improves the performances of cost-insensitive models on large scale datasets. In addition, our model reaches com- petitive performance to the state-of-the-art. This not only demonstrates the scalability of our proposed method, but also shows the capability for LLSL to cope with criterion like Precision-at-k when the deep model is a ranker.

### 5 Conclusion

We propose a novel locally-learned surrogate loss (LLSL) that can adapt toward optimizing general MLL criteria by learning local approximation to the criterion of interest. The learned surrogate loss is then coupled with deep learning model to optimize the target criterion. The proposed LLSL can successfully capture the local behavior of the target MLL criterion and in turn provides cost-aware gradients guiding the network updates. Extensive experimental results show that our proposed deep model achieves outstanding performances against the state-of-the-art methods.

### Acknowledgements

We thank the anonymous reviewers and the members of NTU CLLab for valuable suggestions. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Devel- opment (AOARD) under award number FA2386-15-1-4012, and by the Ministry of Science and Technology of Taiwan under number MOST 103-2221-E- 002-149-MY3

### References

Beygelzimer, A.; Langford, J.; and Ravikumar, P. 2009.

Error-correcting tournaments. CoRR abs/0902.3176.

Bhatia, K.; Jain, H.; Kar, P.; Varma, M.; and Jain, P. 2015.

Sparse local embeddings for extreme multi-label classifica- tion. In NIPS 2015.

Boutell, M. R.; Luo, J.; Shen, X.; and Brown, C. M. 2004.

Learning multi-label scene classification. Pattern Recogni- tion37(9):1757–1771.

Dembczynski, K.; Cheng, W.; and H¨ullermeier, E. 2010.

Bayes optimal multilabel classification via probabilistic classifier chains. In ICML 2010.

Dembczynski, K.; Kotlowski, W.; and H¨ullermeier, E. 2012.

Consistent multilabel ranking through univariate losses. In ICML 2012.

Elkan, C. 2001. The foundations of cost-sensitive learning.

In IJCAI 2001.

Gao, W., and Zhou, Z. 2011. On the consistency of multi- label learning. In COLT 2011.

Gong, Y.; Jia, Y.; Leung, T.; Toshev, A.; and Ioffe, S. 2013.

Deep convolutional ranking for multilabel image annotation.

CoRRabs/1312.4894.

Huang, K.-H., and Lin, H.-T. 2017. Cost-sensitive label embedding for multi-label classification. Machine Learning 106(9–10):1725–1746.

Li, C.-L., and Lin, H.-T. 2014. Condensed filter tree for cost-sensitive multi-label classification. In ICML 2014.

Lo, H.; Wang, J.; Wang, H.; and Lin, S. 2011. Cost-sensitive multi-label learning for audio tag annotation and retrieval.

IEEE Trans. Multimedia13(3):518–529.

Madjarov, G.; Kocev, D.; Gjorgjevikj, D.; and Dzeroski, S.

2012. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition 45(9):3084–3104.

Nam, J.; Kim, J.; Loza Menc´ıa, E.; Gurevych, I.; and F¨urnkranz, J. 2014. Large-scale multi-label text classifi- cation - revisiting neural networks. In ECML PKDD 2014.

Petterson, J., and Caetano, T. S. 2010. Reverse multi-label learning. In NIPS 2010.

Petterson, J., and Caetano, T. S. 2011. Submodular multi- label learning. In NIPS 2011.

Qi, G.; Hua, X.; Rui, Y.; Tang, J.; Mei, T.; and Zhang, H.

2007. Correlative multi-label video annotation. In Pro- ceedings of the 15th International Conference on Multime- dia 2007.

Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2011.

Classifier chains for multi-label classification. Machine Learning85(3):333–359.

Schapire, R. E., and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learning39(2/3):135–168.

Tsoumakas, G.; Xioufis, E. S.; Vilcek, J.; and Vlahavas, I. P.

2011. MULAN: A java library for multi-label learning.

Journal of Machine Learning Research12:2411–2414.

Tsoumakas, G.; Katakis, I.; and Vlahavas, I. P. 2010. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, 2nd ed.667–685.

Wu, Y.-P., and Lin, H.-T. 2017. Progressive k-labelsets for cost-sensitive multi-label classification. Machine Learning 106(5):671–694.

Zadrozny, B.; Langford, J.; and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In ICDM 2003, 435.

Zhang, M., and Zhou, Z. 2006. Multi-label neural networks with applications to functional genomics and text catego- rization. IEEE Trans. Knowl. Data Eng. 18(10):1338–1351.