TraditionalMLCalgorithmsmainlytacklethebatchMLCproblem,wheretheinputdataarepresentedinabatch[24,28].Nevertheless,inmanyMLCapplicationssuchase-mailcategorization[22],multi-labelexamplesarriveasastream.Onlineanalysisistherefore dimensionreducermotivatedbyma

(1)

(will be inserted by the editor)

Dynamic Principal Projection for Cost-Sensitive Online Multi-Label Classification

Hong-Min Chu · Kuan-Hao Huang · Hsuan-Tien Lin

Received: date / Accepted: date

Abstract We study multi-label classification (MLC) with three important real-world issues:

online updating, label space dimension reduction (LSDR), and cost-sensitivity. Current MLC algorithms have not been designed to address these three issues simultaneously. In this paper, we propose a novel algorithm, cost-sensitive dynamic principal projection (CS-DPP) that resolves all three issues. The foundation of CS-DPP is an online LSDR framework derived from a leading LSDR algorithm. In particular, CS-DPP is equipped with an efficient online dimension reducer motivated by matrix stochastic gradient, and establishes its theoretical backbone when coupled with a carefully-designed online regression learner. In addition, CS-DPP embeds the cost information into label weights to achieve cost-sensitivity along with theoretical guarantees. Experimental results verify that CS-DPP achieves better practical performance than current MLC algorithms across different evaluation criteria, and demonstrate the importance of resolving the three issues simultaneously.

Keywords Multi-label classification · Cost-sensitive · Label space dimension reduction

1 Introduction

The multi-label classification (MLC) problem allows each instance to be associated with a set of labels and reflects the nature of a wide spectrum of real-world applications [8, 4, 12].

Traditional MLC algorithms mainly tackle the batch MLC problem, where the input data are presented in a batch [24, 28]. Nevertheless, in many MLC applications such as e-mail categorization [22], multi-label examples arrive as a stream. Online analysis is therefore required as batch MLC algorithms may not meet the needs to make a prediction and update the predictor on the fly. The needs of such applications can be formalized as the online MLC (OMLC) problem.

The OMLC problem is generally more challenging than the batch one, and many mature algorithms for the batch problem have not yet been carefully extended to OMLC. Label space dimension reduction (LSDR) is a family of mature algorithms for the batch MLC problem [7, 13, 17, 26, 14, 25, 33, 6, 2, 5]. By viewing the label set of each instance as a high-dimensional

Hong-Min Chu, Kuan-Hao Huang, Hsuan-Tien Lin CSIE Department, National Taiwan University, Taiwan E-mail: {r04922031, r03922062, htlin}@csie.ntu.edu.tw

(2)

label vector in a label space, LSDR encodes each label vector as a code vector in a lower- dimensional code space, and learns a predictor within the code space. An unseen instance is predicted by coupling the predictor with a decoder from the code space to the label space.

For example, compressed sensing (CS) [13] encodes using random projections, and decodes with sparse vector reconstruction; principal label space transformation (PLST) [26] encodes by projecting to the key eigenvectors of the known label vectors obtained from principal component analysis (PCA), and decodes by reconstruction with the same eigenvectors.

This low-dimensional encoding allows LSDR algorithms to exploit the key joint information between labels to be more robust to noise and be more effective on learning [26]. Nevertheless, to the best of our knowledge, all the LSDR algorithms mentioned above are designed only for the batch MLC problem.

Another family of MLC algorithms that have not been carefully extended for OMLC contains the cost-sensitive MLC algorithms. In particular, different MLC applications usually come with different evaluation criteria (costs) that reflect their realistic needs. It is important to design MLC algorithms that are cost-sensitive to systematically cope with different costs, because an MLC algorithm that targets one specific cost may not always perform well under other costs [15]. Two representative cost-sensitive MLC algorithms are probabilistic classifier chain (PCC) [10] and condensed filter tree (CFT) [15]. PCC estimates the conditional probability with the classifier chain (CC) method [24] and makes Bayes-optimal predictions with respect to the given cost; CFT decomposes the cost into instance weights when training the classifiers in CC. Both algorithms, again, targets the batch MLC problem rather than the OMLC one.

From the discussions above, there is currently no algorithm that considers the three realistic needs of online updating, label space dimension reduction, and cost-sensitivity at the same time. The goal of this work is to study such algorithms. We first formalize the OMLC and cost-sensitive OMLC (CSOMLC) problems in Section 2 and discuss related work. We then extend LSDR for the OMLC problem and propose a novel online LSDR algorithm, dynamic principal projection (DPP), by connecting PLST with online PCA. In particular, we derive the DPP algorithm in Section 3 along with its theoretical guarantees, and resolve the issue of possible basis drifting caused by online PCA.

In Section 4, we further generalize DPP to cost-sensitive DPP (CS-DPP) to fully match the needs of CSOMLC with a theoretically-backed label-weighting scheme inspired by CFT.

Extensive empirical studies demonstrate the strength of CS-DPP in addressing the three realistic needs in Section 5. In particular, we justify the necessity to consider LSDR, basis drifting and cost-sensitivity. The results show that CS-DPP significantly outperforms other OMLC competitors across different CSOMLC problems, which validates the robustness and effectiveness of CS-DPP, as concluded in Section 6.

2 Preliminaries and Related Work

For the MLC problem, we denote the feature vector of an instance as x ∈ R^d and its corresponding label vector as y ∈ Y ≡ {+1, −1}^K, where y[k] = +1 iff the instance is associated with the k-th label out of a total of K possible labels. We let y[k] ∈ {+1, −1} to conform with the common setting of online binary classification [9], which is equivalent to another scheme, y[k] ∈ {1, 0}, used in other MLC works [15, 24].

Traditional MLC methods consider the batch setting, where a training dataset D = {(x_n, yn)}^N_n=1is given at once, and the objective is to learn a classifier g : R^d→ {+1, −1}^K from D with the hope that ˆy = g(x) accurately predicts the ground truth y with respect to an

(3)

unseen x. In this work, we focus on the OMLC setting, which assumes that instance (xt, yt) arrives in sequence from a data stream. Whenever an xtarrives at iteration t, the OMLC algorithm is required to make a prediction ˆyt= gt(xt) based on the current classifier gtand feature vector xt. The ground truth ytwith respect to xtis then revealed, and the penalty of ˆ

ytis evaluated against yt.

Many evaluation criteria for comparing y and ˆy have been considered in the literature to satisfy different application needs. A simple criterion [28] is the Hamming loss cHAM(y, ˆy) =

1 K

PK

k=1Jy[k] 6= ˆy[k]_K. The Hamming loss separately considers each label during evaluation. There are other criteria that jointly evaluate all labels, such as the F1 loss [28]

cF(y, ˆy) = 1 − 2 PK

k=1Jy[k] = +1 and ˆy[k] = +1_K PK

k=1(_Jy[k] = +1_K+_Jy[k] = +1ˆ _K).

In this work, we follow existing cost-sensitive MLC approaches [15] to extend OMLC to the cost-sensitive OMLC (CSOMLC) setting, which further takes the evaluation criterion as an additional input to the learning algorithm. We call the criterion a cost function and overload c : {+1, −1}^K× {+1, −1}^K → R as its notation. The cost function evaluates the penalty of ˆy against y by c(y, ˆy).We naturally assume that c(·, ·) satisfies c(y, y) = 0 and maxyˆc(y, ˆy) ≤ 1. The objective of a CSOMLC algorithm is to adaptively learn a classifier gt: R^d→ {+1, −1}^K based on not only the data stream but also the input cost function c such that the cumulative costPT

t=1c(yt, ˆyt) with respect to the input c, where ˆyt= gt(xt), can be minimized.

Note that the cost function within the CSOMLC setting above corresponds to the example- basedevaluation criteria for MLC, named because the prediction ˆytof each example is evaluated against the ground truth ytindependently. More sophisticated evaluation criteria such as micro-based and macro-based criteria [27, 20] can also be found in the literature.

The following equations highlight the difference between example-F1 (what our CSOMLC setting can handle), micro-F1 and macro-F1 when calculated on T predictions

example-F1 loss = 1 − 2 T

T

X

t=1

PK

k=1Jyt[k] = +1 and ˆyt[k] = +1_K PK

k=1(_Jyt[k] = +1_K+_Jyˆt[k] = +1_K) ; micro-F1 loss = 1 − 2

K

X

k=1

PT

t=1Jyt[k] = +1 and ˆyt[k] = +1_K PT

t=1(_Jyt[k] = +1_K+_Jyˆt[k] = +1_K) ; macro-F1 loss = 1 − 2

PT t=1

PK

k=1Jyt[k] = +1 and ˆyt[k] = +1_K PT

t=1

PK

k=1(_Jyt[k] = +1_K+_Jyˆt[k] = +1_K) . In particular, the three criteria differ by the averaging process. Average example-F1 computes the geometric mean of precision and recall (F1) per example and then computes the arithmetic mean over all examples; micro-F1 computes the geometic mean of precision and recall per labeland then computes the arithmetic mean over all labels; macro-F1 computes the geometric mean of precision and recall over the set of all example-label predictions. The more sophisticated ones are known to be more difficult to optimize. Thus, similar to many existing cost-sensitive MLC algorithms for the batch setting [15], we consider only example-based criteria in this work, and leave the investigation of achieving cost-sensitivity for micro- and macro-based criteria to the future.

Several OMLC algorithms have been studied in the literature, including online binary relevance [23], Bayesian OMLC framework [34], and the multi-window approach using k

(4)

nearest neighbors [32]. However, none of them are cost-sensitive. That is, they cannot take the cost function into account to improve learning performance.

Cost-sensitive MLC algorithms have also been studied in the literature. Cost-sensitive RAkEL [19] and progressive RAkEL [31] are two algorithms that generalize a famous batch MLC algorithm called RAkEL [29] to cost-sensitive learning. The former achieves cost-sensitivity for any weighted Hamming loss, and the latter achieves this for any cost function. Probabilistic classifier chain (PCC) [10] and condensed filter tree (CFT) [15] are two other algorithms that generalizes another famous batch MLC algorithm called classifier chain (CC) [24] to cost-sensitive learning. PCC estimates the conditional probability of the label vector via CC, and makes a Bayes-optimal prediction with respect to the cost function and the estimation. PCC in principal achieves cost-sensitivity for any cost function, but the prediction can be time-consuming unless an efficient Bayes inference rule is designed for the cost function (e.g. the F1 loss [11]). CFT embeds the cost information into CC by an O(K²)-time step that re-weights the training instances for each classifier. All four algorithms above are designed for the batch cost-sensitive MLC problem, and it is not clear how they can be modified for the CSOMLC problem. CC-family algorithms typically suffer from the problem of ordering the labels properly to achieve decent performance. Some works start solving the ordering problem for the original CC algorithm, such as the easy-to-hard paradigm [18], but whether those works can be well-coupled with CFT or PCC has yet to be studied.

Label space dimension reduction (LSDR) is another family of MLC algorithms. LSDR encodes each label vector as a code vector in the lower-dimensional code space, and learns a predictor from the feature vectors to the corresponding code vectors. The prediction of LSDR consists of the predictor followed by a decoder from the code space to the label space. For example, compressed sensing (CS) [13] uses random projection for encoding, takes a regressor as the predictor, and decodes by sparse vector reconstruction. Instead of random projection, principal label space transformation (PLST) [26] encodes the label vectors {y_n}^N_n=1to their top principal components for the batch MLC problem. Some other LSDR algorithms, including conditional principal label space transformation (CPLST) [7], feature- aware implicit label space encoding (FaIE) [17], canonical-correlation-analysis method [25], and low-rank empirical risk minimization for multi-label learning [33], jointly take the feature and the label vectors into account during encoding [7, 17, 25, 33] to further improve the performance.

The physical intuition behind LSDR algorithms is to capture the key joint information between labels before learning. By encoding to a more concise code space, LSDR algorithms enjoy the advantage of learning the predictor more effectively to improve the MLC performance. Moreover, compared with non-LSDR algorithms like RAkEL and CFT, LSDR algorithms are generally more efficient, which in turn makes them favorable candidates to be extended to online learning.

Motivated by the possible applications of online updating, the realistic needs of cost- sensitivity, and the potential effectiveness of label space dimension reduction, we take an initiative to study LSDR algorithms for the CSOMLC setting. In particular, we first adapt PLST to the OMLC setting in Section 3, and further generalize it to the CSOMLC setting in Section 4.

(5)

Table 1: Summary of common notations

notation meaning

d number of features

K number of labels

M dimension of the code space

x ∈ R^d feature vector

y ∈ {+1, −1}^K ground truth label vector ˆ

y ∈ {+1, −1}^K predicted label vector c(y, ˆy) cost for predicting y as ˆy

z ∈ R^M code vector

P ∈ R^{M ×K} encoding matrix from the label space to the code space W ∈ R^d×M linear predictor matrix from the input space to the code space U ∈ R^K×K (roughly) rank-M matrix within matrix stochastic gradient (MSG) (Q ∈ R^{(M +1)×K}, σ ∈ R^{M +1}) decomposition of U such that U = Qdiag(σ)Q^>

Γ ∈ [0, 1]^{M +1} discrete probability distribution for sampling the rows of Q to get P δ^(k)∈ R weight of the k-th label for representing the cost in CS-DPP C ∈ R^K×K a diagonal matrix that stores {

√

δ^(k)}^K_k=1in CS-DPP

3 Dynamic Principal Projection

In this section, we first propose an online LSDR algorithm, dynamic principal projection (DPP), that optimizes the Hamming loss. DPP is motivated by the connection between PLST, which encodes the label vectors to their top principal components, and the rich literature of online PCA algorithms [1, 21, 16]. We shall first introduce the detail of PLST. Then, we discuss the potential difficulties along with our solutions to advance PLST to our proposed DPP. To facilitate reading, the common notations that will be used for the coming sections are summarized in Table 1.

3.1 Principal Label Space Transformation

Given the dimension M ≤ K of the code space and a batch training dataset D = {(xn, yn)}^Nn=1, PLST, as a batch LSDR algorithm, encodes each yn ∈ {+1, −1}^K into a code vector zn= P^∗(yn− o), where o is a fixed reference point for shifting yn, and P^∗contains the top M eigenvectors ofPN

n=1(yn− o)(yn− o)^>. While PLST works with any fixed o, it is worth noting that when o is taken as_N¹ PN

n=1yn, the code vector zncontains the top M principal components of yn. A multi-target regressor r is then learned on {(xn, zn)}^Nn=1, and the prediction of an unseen instance x is made by

y = roundˆ

(P^∗)^>r(x) + o

(1)

where¹round(v) = sign(v[1]), . . . , sign(v[K])>

.

By projecting to the top principal components, PLST preserves the maximum amount of information within the observed label vectors. In addition, PLST is backed by the following theoretical guarantee:

1 The naming of the round(·) operator follows directly from the original paper of PLST [26], which represents y ∈ {0, 1}^Kinstead of {−1, +1}^K. Our use of sign is thus equivalent to the rounding steps used in the original PLST.

(6)

Theorem 1 [26] When making a prediction ˆy from x by ˆy = round P^>r(x) + o with any left orthogonal matrixP, the Hamming loss

cHAM(y, ˆy) ≤ 1

K(kr(x) − zk²2

| {z }

pred. error

+ k(I − P^>P)(y⁰)k²2

| {z }

reconstruction error

) (2)

wherez ≡ Py⁰andy⁰≡ y − o with respect to any fixed reference point o.

Theorem 1 bounds the Hamming loss by the prediction and reconstruction errors. Based on the results of singular value decomposition, P^∗ in PLST is the optimal solution for minimizing the total reconstruction error of the observed label vectors with respect to any fixed o, and the particular reference point _N¹ PN

n=1ynminimizes the reconstruction error over all possible o. Then, by minimizing the prediction error with regressor r, PLST is able to minimize the Hamming loss approximately.

3.2 General Online LSDR Framework for DPP

The upper bound in Theorem 1 works for any regressor r and any left orthogonal encoding matrix P. Based on the bound, we propose an online LSDR framework that approximately minimizes the Hamming loss with an online regressor rtand an online encoding matrix Pt

in each iteration t. Similar to PLST, the proposed framework works with any fixed referenced point o. But for simplicity of illustration, we assume that o = 0 to remove o from the derivations below. The steps of the framework are:

For t = 1, . . . , T

Receive xtand predict ˆyt= round(P^>trt(xt)) Receive ytand incur error `^(t)(rt, Pt)

Update Ptand rt

In each iteration t of the framework, an online prediction ˆytis made with the updated rtand Pt. We take the online error function `^(t)(r, P) to be kr(xt) − Pytk²₂+ k(I − P^>P)ytk²₂, which upper bounds the Hamming loss cHAM(yt, ˆyt) of the online prediction. Then, by updating rtand Ptwith online learning algorithms that minimize the cumulative online error PT

t=1`^(t)(rt, Pt), we can approximately minimize the cumulative Hamming loss.

The simple framework above transforms the OMLC problem to an online learning problem with an error function composed of two terms. Ideally, the online learning algorithm should update Ptand rtto jointly minimize the total error from both terms. Optimizing the two terms jointly has been studied in batch LSDR algorithms like CPLST [7], which is a successor of PLST [26] that also operates with the upper bound in Theorem 1. Nevertheless, it is very challenging to extend CPLST to the online setting efficiently. In particular, a naıve online extension would require computing the hat matrix of the ridge regression part (from x to z) within CPLST in order to obtain Pt, and the hat matrix grows quadratically with the number of examples. That is, in an online setting, computing and storing the hat matrix needs at least Ω(T²) complexity up to iteration T , which is practically infeasible.

Thus, we resort to PLST [26], the predecessor of CPLST, to make an initial attempt towards tackling OMLC problems. PLST minimizes the two terms separately in the batch setting, and our proposed extension of PLST similarly contains two online learning algorithms, one for minimizing each term. That is, we further decompose the online learning problem to two sub-problems, one for minimizing the cumulative reconstruction error (by updating Pt),

(7)

and one for minimizing the cumulative prediction error (by updating rt). Designing efficient and effective algorithms for the two sub-problems turns out to be non-trivial, and will be discussed in Sections 3.3 and 3.4.

3.3 Online Minimization of Reconstruction Error

Next, we discuss the design of our first online learning algorithm to tackle the sub-problem of minimizing the cumulative reconstruction errorPT

t=1k(I−P^>_t Pt)ytk²₂, which corresponds to the second term in (2). The goal is to generate a left-orthogonal matrix Pt∈ R^{M ×K}in each iteration that guarantees to minimize the cumulative reconstruction error theoretically.

Our design is motivated by a simple but promising online PCA algorithm, matrix stochastic gradient (MSG) [1]. MSG does not directly solve the sub-problem of our interest because the problem is non-convex over Pt. Instead, MSG substitutes P^>t Ptwith a rank-M matrix Ut ∈ R^K×K and rewrites the cumulative reconstruction error asPT

t=1y^>_t (I − Ut)yt. By further assuming that kytk2 ≤ 1, MSG loosens the constraint of rank(Ut) = M to tr(Ut) = M , and updates Ut with online projected gradient descent upon receiving a new ytas

Ut+1= Ptr(Ut+ ηyty_t^>) (3) where η is the learning rate and Ptr(·) is the projecting operator to a feasible U. The less-constrained Utin MSG carries the theoretical guarantee of minimizing the cumulative reconstruction error (subject to Ut), but decomposing Utto a left-orthogonal Pt∈ R^{M ×K} with theoretical guarantee on Ptis not only non-trivial but also time-consuming.

Capped MSG [1] is an extension of MSG with the hope of lightening the computational burden of decomposing Ut. In particular, Capped MSG introduces an additional (non-convex) constraint of rank(Ut) ≤ M + 1, and indirectly maintains the decomposition of Utas (Qt, σt), where the left-orthogonal matrix Qt ∈ R^{(M +1)×K} and the vector of singular values σt∈ R^{M +1}such that Ut= Qtdiag(σt)Q^>t . The decomposed (Qt, σt) in Capped MSG enjoys the same theoretical guarantee of minimizing the reconstruction error as the Ut in MSG, while the maintenance step of Capped MSG is more efficient than MSG.

Nevertheless, because we want Ptto be M by K while Qtis (M +1) by K, the generated Qtin Capped MSG cannot be directly used to solve our sub-problem. A na¨ıve idea is to generate Ptby truncating the least important row of Qt, but the na¨ıve idea is no longer backed by the theoretical guarantee of Capped MSG.

Aiming to address the above difficulties, we propose an efficient and effective algorithm to stochastically generate Ptfrom (Qt, σt) maintained by Capped MSG in each iteration. To elaborate, let Q⁻ⁱ_t be Qtwith its i-th row removed and σt[i] be the eigenvalue corresponding to i-th row of Qt. We generate Ptby sampling from a discrete probability distribution Γt, which consists of M + 1 events {Q⁻ⁱ_t }^{M +1}_i=1 with probability of Q⁻ⁱ_t being 1 − σt[i]. As the projecting operator Ptr(·) ensures 0 ≤ σt[i] ≤ 1 for each σt[i], one can easily verify Γtto be a valid distribution with the additional fact thatP

iσt[i] = tr(Ut) = M . The following lemma shows that the online encoding matrix generated by our simple stochastic algorithm is truly effective, and the proof can be found in the supplementary materials.

Lemma 2 Suppose (Qt, σt) is obtained after an updated of Capped MSG such that Ut= Qtdiag(σt)Q^>t. IfΓtis a discrete probability distribution over events{Q⁻ⁱ_t }^{M +1}_i=1 with probability ofQ⁻ⁱ_t being1 − σt[i], we have for any y

EPt∼Γt[y^>(I − P^>_tPt)y] = y^>(I − Ut)y (4)

(8)

The proof of the lemma can be found in Appendix A.1. Moreover, our sampling algorithm is highly efficient regarding its O(M ) time complexity. Note that there is an earlier work that contains another algorithm of similar spirit [21]. Somehow the algorithm’s time complexity is O(K²), which is less efficient than ours.

To sum up, our online learning algorithm that minimizes the cumulative reconstruction error for DPP takes Capped MSG as its building block to maintain Utby Qtand σt, and then samples the online encoding matrix Ptfrom Γtderived by Qtin each iteration by our proposed sampling algorithm. Note that to fulfill the assumption of kytk₂≤ 1 required by Capped MSG, we apply a simple trick to scale each yt∈ {+1, −1}^K with a factor of^√¹

K. The predictions given by our online LSDR framework remain unchanged after the constant scaling due to the use of round(·) operator.

3.4 Online Minimization of Prediction Error

Next, we discuss another proposed online learning algorithm to solve the second sub-problem of minimizing the cumulative prediction errorPT

t=1krt(xt) − Ptytk², which corresponds to the first term in (2). The proposed online learning algorithm is based on the well-known online ridge regression, and incorporates two different carefully designed techniques to remedy the negative effect caused by the variation of Ptin each iteration.

The na¨ıve online ridge regression parameterizes rt(x) to be an online linear regressor W_t^>x with Wt∈ R^d×M, and update Wtby

Wt= arg min

W

λ

2tr(WW^>) +

t−1

X

i=1

kW^>xi− z_ik²₂ (5)

where zi= Piyiis the code vector of yiregarding Pi, and λ is the regularization parameter.

However, the na¨ıve online ridge regression suffers from the drifting of projection basis caused by varying the online encoding matrix Ptas t advances. To elaborate, recall that the online regressor Wtaims to predict zt= Ptytfrom xt, where the code vector ztcan essentially be viewed as the set of combination coefficients with reference projection basis formed by Pt. However, Wtis learned from {(xi, zi)}^t−1_i=1, where the learning target {zi}^t−1_i=1is mixed up with coefficients ziinduced from different projection basis Pi. As a consequence, expecting W_t^>xtto give accurate prediction of ztfor any specific Ptis unrealistic. For a very extreme case, if P1= P3= . . . = P2τ −1= P and P2= P4= . . . = P2τ = −P, the zi’s in the odd and even iterations are of totally opposite meanings although the projection matrices P and −P are mathematically equivalent in quality. The totally opposite meanings make it impossible for Wtto predict ztaccurately.

To remedy the problem of basis drifting, we propose two different techniques, principal basis correction (PBC) and principal basis transform (PBT), to improve online regressor Wt. Each of them enjoys different advantages.

3.4.1 Principal Basis Correction

The ideal solution to handle basis drifting is to “correct” the reference basis of each zito be the latest Ptused for prediction. More specifically, we want Wtto be the ridge regression solution obtained from {(xi, Ptyi)}^t−1_i=1instead of {(xi, Piyi)}^t−1_i=1. Such a correction step ensures that the reference basis for generating the previous zi’s is the same as the basis

(9)

that will be used for the predicting ztand decoding ˆytfrom zt. Denote W^PBC_t as the ridge regression solution of {(xi, Ptyi)}^t−1_i=1. The closed-form solution of W^PBCt is

W^PBC_t = (λI +

t−1

X

i=1

xix^>_i )⁻¹

| {z }

A⁻¹_t

(

t−1

X

i=1

xiy^>_i )

| {z }

Bt

P^>_t . (6)

The part A⁻¹_t Btis independent of the projection matrix Pt. Thus, by maintaining another d by K matrix

Ht= A⁻¹_t Bt

throughout the iterations, W^PBCt can be easily obtained by HtP^>t for any Pt. The update of Htto Ht+1, on the other hand, requires the calculation of Ht+1= (At+ xtx^>t )⁻¹(Bt+ xty^>t), which at a first glance has a time complexity of O(d³+ Kd²). Fortunately, we can speed up the calculation by applying the Sherman-Morrison formula, which states that

(At+ xtx^>t)⁻¹=

A⁻¹t − A⁻¹_t xtx^>_tA⁻¹_t 1 + γ

with γ = x^>_tA⁻¹_t xt. Then, the calculation can be rewritten as Ht+1=

A⁻¹_t − A⁻¹_t xtx^>tA⁻¹_t 1 + γ

Bt+ xty^>t

= A⁻¹_t Bt− A⁻¹_t xtx^>_tA⁻¹_t Bt

1 + γ + A⁻¹_t xty^>_t −A⁻¹_t xtx^>_t A⁻¹_t xty^>_t 1 + γ

= Ht−A⁻¹_t xty˜^>t

1 + γ + A⁻¹_t xty^>t −γA⁻¹_t xty^>t

1 + γ

= Ht−A⁻¹_t xt(˜yt− y_t)^>

1 + γ ,

where ˜yt = H^>t xt. The third line follows from the fact that Ht = A⁻¹_t Bt. Thus, the d by K matrix Htcan be efficiently updated online by

Ht+1= Ht−A⁻¹_t xt(˜yt− y_t)^>

1 + x^>_t A⁻¹_t xt

(7)

which requires only a time complexity of O(d²+ Kd).

It is worth noting that Htactually stores the online ridge regression solution from x to y.

Based on the definition of Ht, we can then theoretically analyze the performance of our online ridge regression solution W^PBCt from x to z with respect to the error `^(t)(·, ·) in our proposed online LSDR framework. Following the convention of online learning, we analyze the expected average regret^R_T, defined as

R T = 1

T

X

t=1

EPt∼Γt[`^(t)(W^PBCt , Pt) − `^(t)(W#, P^∗)], (8)

(10)

for any given sequence of {(Pt, Γt)}^T_t=1, where each Ptis sampled from the distribution Γt. (W#, P^∗) here denotes the offline reference algorithm that is allowed to peek the whole data stream {(xt, yt)}^Tt=1. As our algorithm aims to minimize the online error function by a similar decomposition of sub-problems as PLST , we particularly consider (W#, P^∗) to be the solution of PLST when treating {(xt, yt)}^T_t=1as the input batch data. That is, P^∗is the minimizer ofPT

t=1y^>_t (I − P^>P)yt, which corresponds to the second term of `^(t)(·, ·), and W#is the minimizer ofPT

t=1kW^>xt− P^∗ytk²₂, which corresponds to the first term of

`^(t)(·, ·) given P^∗. It can be easily proved that W#= H^∗(P^∗)^>where H^∗is the optimal linear regression solution of {(xt, yt)}^T_t=1. That is,

H^∗= arg min

H T

X

t=1

kH^>xt− ytk²₂. (9)

With the expected average regret defined, we can prove its convergence by assuming the convergence of the subspace spanned by Ptto the subspace spanned by P^∗. The assumption generally holds when the M -th and (M + 1)-th eigenvalues ofPT

t=1(yt− o)(y_t− o)^>

are different, as the subspace spanned by P^∗to reach the minimum reconstruction error is consequently unique. In particular, define the expected subspace difference

∆t= kEPt∼Γt[P^>tPt] − P^∗>

P^∗k2. (10)

Theorem 3 With the definitions of Htin(7), H^∗in(9), ^R_T in(8) and ∆tin(10), assume thatkxtk ≤ 1, kytk ≤ 1 and kHtxt− ytk²₂≤ .

1. For any givenT , the expected cumulative regret R is upper-bounded by

(1 + )

T

X

t=1

∆t+ M

2 kH^∗k²_F+ 2M d log

1 +T

d

.

2. IflimT →∞∆T = 0 and kH^∗k_F ≤ h^∗across all iterations,²limT →∞ R T = 0.

The third assumption requires the residual errors of online ridge regression without projection to be bounded, which generally holds when there is some linear relationship between xtand yt. The detailed of the proof of the theorem can be found in Appendix A.2.

Theorem 3 guarantees the performance of PBC to be competitive with a reasonable offline baseline in the long run given the convergence of subspace spanned by Pt. Such a guarantee makes online linear regressor with PBC a solid option for DPP to tackle the sub-problem of minimizing cumulative prediction error.

3.4.2 Principal Basis Transform

While PBC always gives the W^PBC_t learned on the correct code vectors with respect to the basis formed by Pt, the time and space complexity of PBC depends on Ω(Kd) at the cost of maintaining Ht∈ R^d×K. The Ω(Kd) dependency can make PBC computationally inefficient when both K and d are large.

2 The technicality of requiring kH^∗kFto be bounded is because we defined regret (up to the T -th iteration) with respect to the optimal offline solution upon receiving T examples, and hence H^∗depends on T . Standard regret proof in online learning alternatively defines regret with respect to any fixed H. Our proof could also go through with the alternative definition, which changes kH^∗k_Fto a constant kHkF(that is trivially bounded).

(11)

Table 2: Time and space complexity for two DPP variants time complexity space complexity DPP-PBC O(d²+ M K + Kd + M²K) O(d²+ M K + Kd) DPP-PBT O(d²+ M²d + M²K) O(d²+ M K + M d)

To address the issue, we propose another technique, principal basis transform (PBT).

Different from PBC, when a new online encoding matrix Pt+1is presented, PBT aims at a direct basis transform of the online linear regressor from Ptto Pt+1. To be more specific, PBT assumes the regressor W_t^PBTto be the low-rank coefficients matrix of some unknown H⁰_t ∈ R^d×K with reference projection basis formed by Pt, which can equivalently be described as W^PBTt = H⁰_tP^>_t . The goal of PBT is to update Wt^PBTto Wt+1^PBTwith (xt, yt) such that the reference projection basis of W^PBTt+1is now induced from Pt+1. PBT achieves the goal by a two-step procedure. The first step is to find the low-rank coefficients matrix W_t⁰of H⁰_tbased on the new reference basis formed by Pt+1. However, as only the low rank coefficients matrix W^PBT_t rather than H⁰_titself is known, we approximate W⁰_tby

W⁰_t= arg min

W

kWPt+1− W^PBT_t Ptk²_F . (11)

Solving (11) analytically gives

W⁰_t= W_t^PBTPtP^>_t+1. (12) The second step is to update W⁰twith (xt, yt) to obtain W^PBTt+1by

W_t+1^PBT= W⁰_t−A⁻¹_t xt(˜z⁰_t− P_t+1yt)^>

1 + x^>_tA⁻¹_t xt

(13)

where ˜z⁰_t = W⁰_t>

xt. Equation (13) can be derived with a similar use of the Sherman- Morrison formula as that for (7) by replacing (˜yt, yt) with (˜z⁰t, Ptyt) respectively. One can easily verify that Wt+1^PBTobtained by (13) still keeps its reference basis as Pt+1.

Comparing to PBC, PBT only has Ω(M²(K + d)) dependency, which is particularly useful when M² min(K, d). The appealing time complexity makes PBT a highly practical option for DPP to minimize the cumulative prediction error with. The time and space complexity of the two variants of DPP are listed in Table 2.

4 Generalization to Cost-Sensitive Learning

In this section, we generalize DPP to cost-sensitive DPP (CS-DPP), which meets the require- ment of CSOMLC. The key ingredient to the generalization is a carefully designed label- weighting scheme that transforms cost c(y, ˆy) into the corresponding weighted Hamming loss. With the help of the label weighting scheme, we subsequently derive the optimization objective similar to Theorem 1 for general cost functions, which allows us to derive CS-DPP by reusing the building blocks of DPP.

We start from the detail of our label-weighting scheme based on the label-wise decomposition of c(y, ˆy). To represent the cost with the label weights, we propose a label-weighting

(12)

Algorithm 1 Cost-Sensitive Dynamic Principal Projection with Principal Basis Transform

Parameters: λ, η, M

1: P0← O_{M ×K}, U0← O_K×K, A⁻¹₀ ←_λ¹Id×d, W0← O_d×M(O is zero matrix) 2: while Receive (xt, yt) do

3: yˆt← round(P^>_t−1W^>_t−1xt) 4: Obtain Ctby (15)

5: Update Ut−1to Utby Capped MSG (with Ctyt) and sample Ptfrom Γtas defined in Lemma 2 6: W⁰_t−1← Wt−1Pt−1P^>_t (PBT)

7: Update W⁰_t−1, A⁻¹_t−1to Wt, A⁻¹_t by (13) (with Ctyt) 8: end while

scheme based on a label-wise and order-dependent decomposition of c(·, ·), which is motivated by a similar concept in [15]. The label-weighting scheme works as follows. Defining ˆ

y^(k)_real and ˆy^(k)_predas

ˆ y^(k)_real[i] =







y[i] if i < k y[i] if i = k ˆ

y[i] if i > k

and ˆy^(k)_pred[i] =







y[i] if i < k

−y[i] if i = k ˆ

y[i] if i > k we decompose c(y, ˆy) into δ⁽¹⁾, . . . , δ^(K)such that

δ^(k)= |c(y, ˆy^(k)_pred) − c(y, ˆy^(k)_real)| . (14) Recall that y is the ground truth vector and ˆy is the prediction vector from the algorithm.

The two newly constructed vectors, ˆy^(k)_real and ˆy^(k)_pred, can both be viewed as pseudo prediction vectors that are “better” than ˆy, as they are both perfectly correct up to the (k − 1)-th label.

The two vectors only differ on the k-th prediction, which is correct for ˆy^(k)_real and incorrect for ˆ

y^(k)_pred. The difference allows the term δ^(k)in (14) to quantify the price that the algorithm needs to pay if the k-th prediction is wrong. Then, the price δ^(k)can be viewed as an indicator of importance for predicting the k-th label correctly. Our label-weighting scheme follows such intuition by simply setting the weight of k-th label as δ^(k). The label-weighting scheme with (14) is not only intuitive, but also enjoys nice theoretical guarantee under a mild condition of c(·, ·), as shown in the following lemma.

Lemma 4 If c(y, y^(k)_pred) − c(y, y^(k)_real) ≥ 0 holds for any k, y and ˆy, then for any given y and ˆy, we have

c(y, ˆy) =

K

X

k=1

δ^(k)_Jy[k] 6= ˆy[k]_K

The condition of the lemma, which generally holds for reasonable cost functions, simply says that for any label, a correct prediction should enjoy a lower cost than an incorrect prediction.

The proof of the lemma can be found in Appendix A.3. Lemma 4 transforms c(y, ˆy) into the corresponding weighted Hamming loss, and thus enables the optimization over general cost functions. Note that condition implies that correcting a wrongly-predicted label leads to no higher cost, and is considered mild as general cost functions for MLC satisfy the condition.

Next, we propose CS-DPP, which extends DPP based on our proposed label-weighting scheme. Define C as

C = diag(

√ δ⁽¹⁾, ...,

√

δ^(K)) (15)

With C, which carries the cost information, we establish a theorem similar to Theorem 1 to upper-bound c(y, ˆy).

(13)

Theorem 5 When making a prediction ˆy from x by ˆy = round P^>r(x) + o with any left orthogonal matrixP, if c(·, ·) satisfies the condition of Lemma 4, the prediction cost

c(y, ˆy) ≤ kr(x) − zCk²₂+ k(I − P^>P)(yC⁰ )k²2

wherezC= P(y_C⁰ ) and y⁰_C= Cy − o with respect to any fixed reference point o.

Theorem 5 generalizes Theorem 1 to upper-bound the general cost c(y, ˆy) instead of the original Hamming loss cHAM(y, ˆy). With Theorem 5, extending DPP to CS-DPP is a straightforward task by reusing the online updating algorithms of DPP with ytreplaced by Ctyt. The full details of CS-DPP using PBT is given in Algorithm 1, and we can easily write down similar steps for CS-DPP using PBC. Note that we simplify Wt^PBTto Wtin Algorithm 1 to make a cleaner presentation.

5 Experiments

To empirically evaluate the performance, and also to study the effectiveness and necessity of design components of CS-DPP, we conduct three sets of experiments: (1) necessity justification of online LSDR, (2) experiments on basis drifting, and (3) experiments on cost-sensitivity. Furthermore, recall that the label weighting scheme of CS-DPP depends on the label order. We therefore conduct an additional set of experiments to study how different label orders affect the performance of CS-DPP. To assist the readers in understanding the experiments, we list the full names and acronyms of the algorithms to be compared along with their key differences in Table 3. The details of the algorithms will be illustrated as needed.

5.1 Experiments Setup

We conduct our experiments on eleven real-world datasets³downloaded from Mulan [30].

Statistics of datasets can be found in Table 4. In particular, datasets eurlex-eurovec and deliciousare used only in the experiment to justify the necessity of online LSDR, and only 7500 sub-sampled instances are used on these two datasets to reduce the computational burden of the competitors in the experiment. In addition, only 50000 sub-sampled instances are used for nuswide because a competitor in the cost-sensitivity experiment is rather computationally inefficient. Data streams are generated by permuting datasets into different random orders.

We perform sub-sampling on eurlex-eurovec, delicious and nuswide after computing the permutation so that each stream contains a diferent set of original instances for the three datasets.

All LSDR algorithms, except for competitors run on delicious and eurlex-eurovec, are coupled with online ridge regression and three different code space dimensions, M = 10%, 25%, and 50% of K, are considered. For DPP we fix λ = 1 and follow [1] to use the time- decreasing learning rate η = ^√²_t^M_K, and parameters of other algorithms will be elaborated along with their details in the corresponding section. For the two larger datasets delicious and eurlex-eurovec, we implement both DPP and O-BR using gradient descent instead of online ridge regression for calculating Wt, where O-BR is the competitor that will be elaborated

3 CAL500, emotions, scene, yeast, enron, Corel5k, mediamill, nuswide, medical, delicious and eurlex − eurovec

(14)

Table 3: Algorithms being compared in the experiments acronym full name dimension re-

duction

encode basis transform

decode cost- sensitivity

O-BR Online

Binary Relevance

- no - - no

O-CS Online Com- pressed Sens- ing

yes random

(static)

- compressed

sensing no

O-RAND Online Ran- dom Projec- tion

yes random

(static)

- pseudo

inverse no

DPP-PBC Dynamic Principal Pro- jection (DPP) with Prin- cipal Basis Correction

yes online PCA

(dynamic)

exact PCA no

DPP-PBT Dynamic Principal Pro- jection (DPP) with Prin- cipal Basis Transform (PBT)

yes online PCA

(dynamic)

approximate PCA no

CS-DPP Cost- Sensitive DPP (with PBT)

yes online PCA approximate PCA yes

Table 4: Statistics of datasets

# of features # of labels # of instances cardinality

CAL500 68 174 502 26.044

Corel5k 499 374 5000 3.522

emotions 72 6 593 1.869

enron 1001 53 1702 3.378

mediamill 120 101 43907 4.376

medical 1449 45 978 1.245

scene 294 6 2407 1.074

yeast 103 14 2417 4.237

nuswide 128 81 50000^* 1.869

delicious 500 983 7500^* 19.020

eurlex-eurovec 5000 3993 7500^* 5.310

in Section 5.2. In particular, for PBC of DPP we replace the update of the online ridge regressor (6) with online gradient descent, while for PBT we replace (13), the update after basis transform, with a gradient descent update as well. Note that even with online ridge regression replaced with gradient descent, the ability of DPP with PBT or PBC to handle the basis drifting problem remains unchanged. We use the time decreasing step-size^√¹_tfor gradient descent on delicious, and^0.001^√_t on eurlex-eurovec.

(15)

Dataset delicious eurlex-eurovec

Algorithms PBT PBC O-BR PBT PBC O-BR

cHAM 0.1136 0.1153 0.1245 0.4917 0.5011 0.4993

c_NR 0.5636 0.5641 0.5756 0.7435 0.7467 0.7433

c_F1 0.9143 0.9138 0.9076 0.9972 0.9928 0.9921

c_ACC 0.9512 0.9517 0.9494 0.9980 0.9964 0.9958 Avg. time (sec) 21.49 140.77 105.18 60.81 10522.25 4841.35

Table 5: DPP vs. O-BR on large datasets

We consider four different cost functions, Hamming loss, Normalized rank loss, F1 loss and Accuracy loss.

cHAM(y, ˆy) = 1 K

K

X

k=1

Jy[k] 6= ˆy[k]_K

!

cNR(y, ˆy) = average

y[i]>y[j]

Jy[i] < ˆˆ y[j]_K+1

2^Jy[i] = ˆˆ y[j]_K

cF1(y, ˆy) = 1−2

K

X

k=1

Jy[k] = +1 and ˆy[k] = +1_K

! /

K

X

k=1

(_Jy[k] = +1_K+_Jy[k] = +1ˆ _K)

!

cACC(y, ˆy) = 1−

K

X

k=1

Jy[k] = +1 and ˆy[k] = +1_K

! /

K

X

k=1

Jy[k] = +1 or ˆy[k] = +1_K

!

The performances of different algorithms are compared using the average cumulative cost

1 t

Pt

i=1c(yi, ˆyi) at each iteration t. We remark that lower average cumulative cost imply better performance. We report the average results of each experiment after 15 repetitions.

5.2 Necessity of Online LSDR

In this experiment, we aim to justify the necessity to address LSDR for OMLC problems. We demonstrate that the ability of LSDR to preserve the key joint correlations between labels can be helpful when facing (1) data with noisy labels or (2) data with a large possible set of labels, which are often encountered in real-world OMLC problems. We compare DPP with online Binary Relevance (O-BR), which is a na¨ıve extension from binary relevance [28] with online ridge regressor. The only difference between DPP and O-BR is whether the algorithm incorporates LSDR.

We first compare DPP and O-BR on data with noisy labels. We generate noisy data stream by randomly flipping each positive label y[i] = 1 to negative with probability p = {0.3, 0.5, 0.7}, which simulates the real-world scenario in which human annotators fail to tag the existed labels. We plot the results of O-BR and DPP with M = 10%, 25%

and 50% of K on datasets emotions and enron with respect to Hamming loss and F1 loss in Figure 1, which contains error bars that represent the standard error of the average results.

The standard errors are naturally larger when M is larger or when t (number of iterations) is small, but in general for M ≥ 25% · K and for t ≥ 400 the standard errors are small enough to justify the difference. The complete results are listed in Appendix B.1.

The results from the first two rows of Figure 1 show that DPP with M = 10% of K performs competitively and even better than O-BR as p increases on dataset emotions. The

(16)

Fig. 1: DPP vs. O-BR on noisy labels

results from the last two rows of Figure 1 show that DPP always performs better on enron.

We can also observe from Figure 1 that DPP with smaller M tends to perform better as p increases. The above results clearly demonstrate that DPP better resists the effect of noisy labels with its incorporation of LSDR as the noise level (p) increases. The observation that DPP with smaller M tends to perform better demonstrates that DPP is more robust to noise by preserving the key of the key joint correlations between labels with LSDR.

Next, we demonstrate that LSDR is also helpful for handling data with a large label set.

We compare O-BR with DPP that is coupled with either PBC or PBT on datasets delicious and eurlex-eurovec.⁴DPP uses M = 10 for delicious and M = 25 for eurlex-eurovec. We summarize the results and average run-time in Table 5. Table 5 indicates that DPP coupled

4 delicious: d=500, K=983, eurlex-eurovec: d=5000, K=3993.

(17)

Fig. 2: PBC vs. PBT vs. None, M = 10% of K

with either PBT or PBC performs competitively with O-BR, while DPP with PBT enjoys significantly cheaper computational cost. The results demonstrate that DPP enjoys more effective and efficient learning for data with a large label set than O-BR, and also justifies the advantage of PBT over PBC in terms of efficiency when K and d are large while M is relatively small, as previously highlighted in Section 3.

5.3 Experiments on Basis Drifting

To empirically justify the necessity of handling basis drifting, we compare variants of DPP that (a) incorporates PBC by (6), (b) incorporates PBT by (13), and (c) neglects basis drifting as (5). We plot the results for Hamming loss with M = 10% of K in Figure 2 on six datasets, and report the complete results in Appendix B.2. The results on all datasets in Figure 2 show that DPP with either PBC or PBT significantly improves the performance over its variant that neglects the basis drifting, which clearly demonstrates the necessity to handle the drifting of projection basis.

Further comparison of PBC and PBT based on Figure 2 reveals that PBC in general performs slightly better than PBC, reflecting its advantage of exact projection basis correction.

Nevertheless, as discussed in Section 5.2, PBT enjoys a nice computational speedup when K and d are large and M is relatively small, making PBT more suitable to handle data with a large label set.

5.4 Experiments on Cost-Sensitivity

To empirically justify the necessity of cost-sensitivity, we compare CS-DPP using PBT with DPP using PBT and other online LSDR algorithms. To the best of our knowledge, no online

(18)

Fig. 3: CS-DPP vs. Others, M = 10% of K

LSDR algorithm has yet been proposed in the literature. We therefore design two simple online LSDR algorithms, online compressed sensing (O-CS) and online random projection (O-RAND), to compare with CS-DPP. O-CS is a straightforward extension of CS [13] with an online ridge regressor, and we follow [13] to determine the parameter of O-CS. O-RAND encodes using random matrix PRand simply decodes with the corresponding pseudo inverse P^†_R.

We plot the results with respect to all evaluation criteria except for the Hamming loss with M = 10% of K in Figure 3 on three datasets, and report the complete results in Appendix B.3 Note that the results for CS-DPP here are obtained by using the original label order from the dataset.

5.4.1 CS-DPP versus DPP.

The results of Figure 3 clearly indicate that CS-DPP performs significantly better than DPP on all evaluation criteria other than the Hamming loss, while CS-DPP reduces to DPP when cHAM(·, ·) is used as the cost function. These observations demonstrate that CS-DPP, by optimizing the given cost function instead of Hamming loss, indeed achieves cost-sensitivity and is superior to its cost-insensitive counterpart, DPP.

(19)

10% of K 25% of K 50 % of K cHAM 0.1458 ± 0.00019 0.1489 ± 0.00012 0.1503 ± 0.00008 c_NR 0.1247 ± 0.00224 0.1321 ± 0.00210 0.1371 ± 0.00222 c_F1 0.5914 ± 0.00108 0.5956 ± 0.00110 0.5949 ± 0.00101 c_ACC 0.7388 ± 0.00105 0.7428 ± 0.00131 0.7426 ± 0.00126

Table 6: Results of CS-DPP on CAL500 with 50 random label orders

10% of K 25% of K 50 % of K

cHAM 0.2296 ± 0.00010 0.2162 ± 0.00009 0.2092 ± 0.00001 c_NR 0.0064 ± 0.00081 0.0170 ± 0.00242 0.0232 ± 0.00158 c_F1 0.4518 ± 0.00919 0.3841 ± 0.00199 0.3784 ± 0.00107 c_ACC 0.5448 ± 0.02252 0.4971 ± 0.00379 0.4901 ± 0.00124

Table 7: Results of CS-DPP on yeast with 50 random label orders

10% of K 25% of K 50 % of K

cHAM 0.0562 ± 0.00020 0.0600 ± 0.00011 0.0632 ± 0.00009 c_NR 0.1432 ± 0.00333 0.1364 ± 0.00244 0.1305 ± 0.00216 c_F1 0.5421 ± 0.00334 0.5392 ± 0.00291 0.5428 ± 0.00293 c_ACC 0.6573 ± 0.00360 0.6561 ± 0.00331 0.6627 ± 0.00315

Table 8: Results of CS-DPP on enron with 50 random label orders

5.4.2 CS-DPP versus Other Online LSDR Algorithms.

As shown in Figure 3, while DPP generally performs better than O-CS and O-RAND because of the advantage to preserve key label correlations rather than random ones, it can nevertheless be inferior on some datasets with respect to specific cost functions due to its cost-insensitivity.

For example, DPP loses to O-RAND on dataset Corel5k with respect to the Normalized rank loss, as shown in the third row of Figure 3. CS-DPP conquers the weakness of DPP with its cost-sensitivity, and significantly outperforms O-CS and O-RAND on all three datasets with respect to all three evaluation criteria, as demonstrated in Figure 3. The superiority of CS-DPP justifies the necessity to take cost-sensitivity into account.

5.5 Experiment on Effect of Label Order for CS-DPP

The goal of this experiment is to study how different label orders affect the performance of CS-DPP as our proposed label weighting scheme with (14) is label-order-dependent. To evaluate the impact of label orders, we run CS-DPP with 50 randomly generated label orders and M = 10%, 25% and 50% of K on each dataset. The permutation of each dataset is fixed to the original one given in Mulan [30], which allows the variance of the performance to better indicate the effect of different orders.

We summarize the results of all four different cost functions with mean and standard deviation on datasets CAL500, enron and yeast in Table 6, 7 and 8 respectively, and report the complete results in Appendix B.4. Note that the results of Hamming loss are unaffected by the order of labels, and the reported deviation is due to the randomness from Pt. From the results of Table 6, 7 and 8, we see that standard deviation is generally in a relatively small scale of 10⁻³, indicating that the performance of CS-DPP is not that sensitive to the order of labels. Closer inspection of Table 7 reveals that the standard deviation of cACCon yeast with