M Multi-labelClassiﬁcationUsingError-correctingCodesofHardorSoftBits

(1)

Multi-label Classification Using Error-correcting Codes of Hard or Soft Bits

Chun-Sung Ferng, and Hsuan-Tien Lin, Member, IEEE

Abstract—We formulate a framework for applying error- correcting codes (ECC) on multi-label classification problems.

The framework treats some base learners as noisy channels and uses ECC to correct the prediction errors made by the learners. An immediate use of the framework is a novel ECC- based explanation of the popular random k-label-sets (RAKEL) algorithm using a simple repetition ECC. Using the framework, we empirically compare a broad spectrum of off-the-shelf ECC designs for multi-label classification. The results not only demonstrate that RAKEL can be improved by applying some stronger ECC, but also show that the traditional Binary Relevance approach can be enhanced by learning more parity-checking labels. Our study on different ECC also helps understand the trade-off between the strength of ECC and the hardness of the base learning tasks. Furthermore, we extend our study to ECC with either hard (binary) or soft (real-valued) bits by designing a novel decoder. We demonstrate that the decoder improves the performance of our framework.

Index Terms—Multi-label classification, error-correcting codes.

I. INTRODUCTION

M

ULTI-LABEL classification is an extension of traditional multi-class classification. In particular, the lat- ter aims at accurately associating one single label with an instance, while the former aims at associating a label set.

Because of the increasing application needs in domains like music categorization [1] and scene analysis [2], multi-label classification is attracting much research attention in recent years.

Error-correcting code (ECC) roots from the information theoretic pursuit of communication [3]. In particular, the ECC studies how to accurately recover a desired signal block after transmitting the block’s encoding through a noisy communication channel. When the desired signal block is the single label (of some instances) and the noisy channel consists of some binary classifiers, it has been shown that a suitable use of the ECC could improve the association (prediction) accuracy of multi-class classification [4]. Several designs, including some classic ECC [4] and some adaptively constructed ECC [5], [6], have reached promising empirical performance for multi-class classification.

While the benefits of the ECC are well established for multi-class classification, the corresponding use for multi- label classification remains an ongoing research direction. [7]

takes the first step in this direction by proposing a multi-label

C.-S. Ferng and H.-T. Lin are with the Department of Computer Science and Information Engineering, National Taiwan University, Taiwan, e-mail:

{r99922054, htlin}@csie.ntu.edu.tw.

Manuscript received August ??, 2012; revised ??.

classification approach that applies a classic ECC, the Bose- Chaudhuri-Hocquenghem (BCH) code. The work is followed by some extensions to the convolution code [8]. Although the approach shows some good experimental results over existing multi-label classification approaches, a more rigorous study remains needed to understand the advantages and disadvantages of different ECC designs for multi-label classification and will be the main focus of this work.

In this work, we formalize the framework for applying the ECC on multi-label classification. The framework is more general than both existing ECC studies for multi-class classification [4] and for multi-label classification [7]. Then, we conduct a thorough study with a broad spectrum of classic ECC designs: repetition code, Hamming code, BCH code, and low-density parity-check code. The four designs cover the simplest ECC idea to the state-of-the-art ECC in communication systems. Interestingly, such a framework allows us to give a novel ECC-based explanation to the random k- label sets (RAKEL) algorithm, which is popular for multi- label classification. In particular, RAKEL can be viewed as a special type of repetition code coupled with a batch of simple and internal multi-label classifiers.

We empirically demonstrate that RAKEL can be improved by replacing its repetition code with the Hamming code, a slightly stronger ECC. Furthermore, even better performance can be achieved when replacing the repetition code with the BCH code. When compared with the traditional Binary Relevance (BR) approach without the ECC, multi-label classification with the ECC can perform significantly better. The empirical results justify the validity of the ECC framework.

In addition, we design a new decoder for linear ECC by using multiplications to approximate exclusive-OR operations.

This decoder is able to handle not only ordinary binary bits from the channels, called hard inputs, but also real-valued bits, called soft inputs. For multi-label classification using the ECC, the soft inputs can be used to represent the confidence of the internal classifiers. Our newly designed decoder allows a proper use of the detailed confidence information to produce more accurate predictions. The experimental results show that this decoder indeed improves the performance of the ECC framework with soft inputs.

The paper is organized as follows. First, we introduce the multi-label classification problem in Section I-A, and present related works in Section I-B. Section II illustrates the framework and demonstrates its effectiveness. Section III presents a new decoder for hard or soft inputs. We empirically study the performance of the framework and the proposed decoder in Section IV. Finally we conclude in Section V.

(2)

A short version of the paper appeared in the 2011 Asian Conference on Machine Learning [9]. The paper was then enriched by the novel decoder for dealing with soft bits, the comparison with other ECC designs, and broader experiments.

The paper is also the core of the first author’s M.S. thesis [10].

A. Problem Setup

Multi-label classification aims at mapping an instance x∈ R^d to a label-set Y ⊆ L = {1, 2, . . . , K}, where K is the number of classes. Following the hypercube view of [11], the label setY can be represented as a binary vector y of length K, where y[i] is 1 if the ith label is in Y , and 0 otherwise.

Consider a training datasetD = {(xn, yn)}^N_n=1. A multi-label classification algorithm usesD to locate a multi-label classifier h : R^d→ {0, 1}^K such thath(x) predicts y well on future test examples (x, y).

There are several loss functions for evaluating whethery˜= h(x) predicts y well. Two common ones are:

• subset 0/1 loss: ∆0/1(˜y, y) = J˜y6= yK.

• Hamming loss: ∆HL(˜y, y) =_K¹

K

P

i=1

J˜y[i] 6= y[i]K.

[12] show that the two loss functions focus on different statistics of the underlying probability distribution from a Bayesian perspective. While a wide range of other loss functions exist [13], in this paper we only focus on 0/1 and Hamming because they connect tightly with the ECC framework that will be discussed.¹

B. Related Works

The hypercube view [11] unifies many existing problem transformation approaches [13], which transform multi-label classification into one or more reduced learning tasks. For instance, one simple problem transformation approach is called Binary Relevance (BR), which learns one binary classifier per individual label. Another simple problem transformation approach is called label powerset (LP), which transforms multi-label classification to one multi-class classification task with a huge number of extended labels. One popular problem transformation approach that lies between BR and LP is called random k-label sets (RAKEL) [13], which transforms multi- label classification into many multi-class classification tasks with a smaller number of extended labels.

Some existing problem transformation approaches focus on compressing the label-set vector y [11], [14]—removing the redundancy within the binary signals (label sets) to form shorter codewords—which follows a classic task in information theory based on Shannon’s first theorem [3]. Another classic task in information theory aims at expansion—adding redundancy in the (longer) codewords to ensure robust decoding against noise contamination. The power of expansion is characterized by Shannon’s second theorem [3]. The error- correcting code (ECC) targets towards using the power of expansion systematically. In particular, the ECC works by encoding a block of signal to a longer codeword b before

1We follow the final remark of [12] to only focus on the loss functions that are related to our algorithmic goals.

passing it through the noisy channel and then decoding the received codeword ˜b back to the block appropriately. Then, under some assumptions [15], the block can be perfectly recovered—resulting in zero block-decoding error; in some cases, the block can only be almost perfectly recovered—

resulting in a few bit-decoding errors.

If we take the “block” as the label set y for every example (x, y) and a batch of base learners as a channel that outputs the contaminated block ˜b, the block-decoding error corresponds to∆0/1 while the bit-decoding error corresponds to a scaled version of∆HL. Such a correspondence motivates us to study whether suitable ECC designs can be used to improve multi- label classification, which will be formalized in Section II.

Most of the commonly-used ECC in communication systems are binary ECC. That is, the codeword b is a binary vector. We are going to apply this kind of ECC on multi-label classification, and will review some of them in Section II-A.

Another kind of ECC is real-valued ECC, which uses real- valued vectors as the codewords. [16] and [17] take this direction and design special encoding and decoding functions for multi-label classification. [16] uses canonical correlation analysis to find the most linearly-predictable transform of the original labels; [17] uses metric learning to locate a good encoding function. Both works take approximate Bayesian inference for decoding.

II. ML-ECC FRAMEWORK

We now describe the proposed ECC framework in detail.

The main idea is to use an ECC encoder enc(·) : {0, 1}^K → {0, 1}^M to expand the original label set y ∈ {0, 1}^K to a codeword b ∈ {0, 1}^M that contains redundant information. Then, instead of learning a multi-label classifier h(x) between x and y, we learn a multi-label classifier ˜h(x) between x and the corresponding b. In other words, we transform the original multi-label classification problem into another (larger) multi-label classification task. During prediction, we useh(x) = dec ◦ ˜h(x), where dec(·) : {0, 1}^M → {0, 1}^K is the corresponding ECC decoder, to get a multi-label prediction

˜

y ∈ {0, 1}^K. The simple steps of the framework are shown as follows:

• Parameter: an ECC with encoder enc(·) and decoder dec(·); a base multi-label learner Ab

• Training: GivenD = {(xn, yn)}^N_n=1, 1) ECC-encode each yn to bn= enc(yn);

2) Return ˜h = Ab

x_n, bn .

• Prediction: Given any x,

1) Predict a codeword ˜b= ˜h(x);

2) Return h(x) = dec(˜b) by ECC-decoding.

This algorithm is simple and general. It can be coupled with any block-coding ECC and any base learnerAbto form a new multi-label classification algorithm. For instance, the ML- BCHRF method [7] uses the BCH code (see Subsection II-A3) as the ECC and BR on Random Forest as the base learnerAb. Note that [7] did not describe why ML-BCHRF may lead to improvements in multi-label classification. Next, we show a simple theorem that connects the ECC framework with∆0/1.

(3)

Many ECCs can guarantee to correct up to m bit flipping errors in a codeword of lengthM . We will introduce some of those ECC in Section II-A. Then, if∆HLof ˜h is low, the ECC framework guarantees that∆0/1 ofh is low. The guarantee is formalized as follows.

Theorem 1: Consider an ECC that can correct up to m bit errors in a codeword of length M . Then, for any T test examples {(xt, yt)}^T_t=1, let bt= enc(yt). If

∆HL(˜h) = 1 T

T

X

t=1

∆HL(˜h(xt), bt) ≤ ǫ,

thenh = dec ◦ ˜h satisfies

∆0/1(h) = 1 T

T

X

t=1

∆0/1(h(xt), yt) ≤ M ǫ m + 1 . Proof:When the average Hamming loss of ˜h is at most ǫ, h makes at most ǫT M bits of error on all b˜ t. Since the ECC corrects up tom bits of errors in one bt, an adversarial has to make at leastm+1 bits of errors on btto makeh(xt) different from yt. The number of such bt can be at most ^{ǫT M}_m+1. Thus,

∆0/1(h) is at most _{T (m+1)}^{ǫT M} .

From Theorem 1, it appears that we should simply use some stronger ECC, for which m is larger. Nevertheless, note that we are applying the ECC in a learning scenario. Thus, ǫ is not a fixed value, but depends on whether Ab can learn well from ˜D. Stronger ECC usually contains redundant bits that come from complicated compositions of the original bits in y, and the compositions may not be easy to learn. The trade-off has been revealed when applying the ECC to multi- class classification [6]. Next, we study the ECC with different strength and empirically verify the trade-off in Section IV.

A. Review of Classic ECC

Next, we review four ECC designs that will be used in the empirical study. The four designs cover a broad spectrum of practical choices in terms of strength.

1) Repetition Code: One of the simplest ECCs is repetition code (REP) [15], for which every bit in y is repeated⌊^M_K⌋ or

⌊^M_K⌋ + 1 times in b during encoding. The decoding takes a majority vote using the received copies of each bit. Because of the majority vote, repetition code corrects up to mREP =

⌊_2K^M − ¹₂⌋ bit errors in b. We will discuss the connection between REP and the RAKEL algorithm in Section II-B.

2) Hamming on Repetition Code: A slightly more complicated ECC than REP is called the Hamming code (HAM) [18], which can correct mHAM = 1 bit error in b by adding some parity check bits (exclusive-OR operations of some bits in y).

One typical choice of HAM is HAM(7, 4), which encodes any y with K = 4 to b with M = 7. Note that mHAM = 1 is worse than mREP when M is large. Thus, we consider applying HAM(7, 4) on every 4 (permuted) bits of REP. That is, to form a codeword b of M bits from a block y of K bits, we first construct an REP of 4⌊M/7⌋ + (M mod 7) bits from y; then for every 4 bits in the REP, we add 3 parity bits to b using HAM(7, 4). The resulting code will be named Hamming on Repetition (HAMR). During decoding,

the decoder of HAM(7, 4) is first used to recover the 4-bit sub-blocks in the REP. Then, the decoder of REP (majority vote) takes place.

It is not hard to computemHAM Rby analyzing the REP and HAM parts separately. WhenM is a multiple of 7, mHAM R= 2 · ⌊^2M_7K −¹₂⌋, which is generally better than mREP especially when ^M_K is large. Thus, HAMR is slightly stronger than REP for ECC purposes. We include HAMR in our study to verify whether a simple inclusion of some parity bits for the ECC can readily improve the performance for multi-label classification.

3) Bose-Chaudhuri-Hocquenghem Code: BCH [19], [20]

code can be viewed as a sophisticated extension of HAM and allows correcting multiple bit errors. BCH with length M = 2^p− 1 has (M − K) parity bits, and it can correct mBCH=

M −K

p bits of error [15], which is in general stronger than REP and HAMR. The caveat is that the decoder of BCH is more complicated than the ones of REP and HAMR.

We include BCH in our study because it is one of the most popular ECCs in real-world communication systems. In addition, we compare BCH with HAMR to see if a strong ECC can do better for multi-label classification.

4) Low-density Parity-check Code: Low-density parity- check code (LDPC) [15] is recently drawing much research attention in communications. LDPC shares an interesting connection between ECC and Bayesian learning [15]. While it is difficult to state the strength of LDPC in terms of a single mLDP C, LDPC has been shown to approach the theoretical limit in some special channels [21], which makes it a state- of-the-art ECC. We choose to include LDPC in our study to see whether it is worthwhile to go beyond BCH with more sophisticated encoder/decoders.

B. ECC View of RAKEL

RAKEL is a multi-label classification algorithm proposed by [13]. Define a k-label set as a size-k subset of L. Each iteration of RAKEL randomly selects a (different) k-label set and builds a multi-label classifier on thek labels with a Label Powerset (LP). After running forR iterations, RAKEL obtains a size-R ensemble of LP classifiers. The prediction on each label is done by a majority vote from classifiers associated with the label.

Equivalently, we can draw (with replacement) M = Rk labels first before building the LP classifiers. Then, selecting k-label sets is equivalent to encoding y by a variant of REP, which will be called RAKEL repetition code (RREP). Similar to REP, each bit y[i] is repeated several times in b since label i is involved in several k-label sets. After encoding y to b, each LP classifier, calledk-powerset, acts as a sub-channel that transmits a size-k sub-block of the codeword b. The prediction procedure follows the decoder of the usual REP.

The ECC view above decomposes the original RAKEL into two parts: the ECC and the base learner Ab. We will empirically study how the two parts affect the performance of multi-label classification in Section IV.

C. Comparison to Real-valued Codewords

As mentioned in Section I-B, [16] and [17] propose using real-valued ECC for multi-label classification. There are two

(4)

major differences between the real-valued ECC and the binary ECC. In terms of codewords, the real-valued codewords of [16]

and [17] are linear combinations of labels. On the other hand, the binary codewords introduced in Section II-A are parity checks (exclusive-OR) of labels, which is a linear combination of labels under Galois field GF2.

In terms of base learners, the transformed learning tasks of real-valued codewords are regression instead of classification.

In [16] and [17], the regression task of each codeword bit is learned separately, which is like applying the BR base learner.

On the other hand, the transformed learning task of binary codewords can be learned with other base learners such as k-powerset.

III. GEOMETRICDECODER FORLINEARECC The real-valued codewords motivate us to design a new decoder for binary ECC in the ML-ECC framework. The goal of the new decoder is to utilize the channel measurement information, i.e. the real-valued confidence for the bit to be1.

Such information is available to the decoder when using the BR base learner (channel) with probabilistic outputs. The real- valued confidence may help improve the performance of the ML-ECC framework by focusing more on the highly-confident bits.

The off-the-shelf ECC decoders usually do not exploit the channel measurement information, but take advantages of the algebraic structure of the ECC to locate possible bit errors.

From the hypercube view, they decode a vertex of {0, 1}^M (binary prediction on codewords) to a vertex of {0, 1}^K (binary prediction on labels). The proposed decoder, on the contrary, utilizes the information and takes the geometry of the hypercube into account to perform interior-to-interior decoding from [0, 1]^M to [0, 1]^K. That is, the proposed decoder is a soft-input soft-output decoder. The soft input bits contain the channel measurement information, and the value of each bit represents the confidence in the bit being1. The soft prediction of labels are the confidence in whether the label presents. For evaluation, the soft predictions are then rounded to{0, 1}^K as in [11]. We call the proposed decoder geometric decoder, and call the off-the-shelf decoders algebraic decoders.

Here, we focus on linear codes, whose encoding function can be written as a matrix-vector multiplication under Galois fieldGF2. All the repetition code, Hamming code, BCH code, and LDPC code are linear codes. Let G be the generating matrix of a linear code,gij∈ {0, 1}. The encoding is done by b= enc(y) = G · y (mod 2), or equivalently we may write the formula in terms of exclusive-OR (XOR) operations:

bi= M

j:gij=1

yj

That is, the codeword bitbiis the result of XOR of some label bitsyj. The XOR operations are equivalent to multiplications if we map1 → −1 and 0 → 1. By defining ˆbi= 1 − 2bi and ˆ

yj= 1 − 2yj, the encoding can also be written as ˆbi= Y

j:gij=1

ˆ yj

We denote this form as multiplication encoding.

It is difficult to generalize the XOR operation from binary to real values, but multiplication by itself can be defined on real values. We take this advantage and use it to form our geometric decoder. Our geometric decoder would find the y that minimizes the˜ L2 distance between ˜b and the multiplication encoding result of the y:˜

decgeo(˜b) = argmin

˜ y∈[0,1]^K

M

X

i=1



(1 − 2˜bi) − Y

j:gij=1

(1 − 2˜yj)





2

Note that the squared L2 distance between codewords is an approximation of the Hamming distance in binary space {0, 1}^M.

For repetition code, since only one yj is considered for each bi, the optimal solution of the problem would be the same as averaging over the predictions on the same label for each label. However, for general linear codes, it is difficult to find the global optimum since the optimization problem may not be convex. Instead, we may apply a variant of coordinate descent optimization to find a local minimum. That is, in each step we optimize only one y˜j while fixing other y˜j. To optimize one y˜j, we only have to solve a second-order single-variable optimization problem, which enjoys an efficient analytic solution.

The benefit of using soft output geometric decoder is that the multiplication-approximated XOR preserves some geometric information. That is, close points in [0, 1]^K would also be close after multiplication encoding. Moreover, approximating XOR by multiplication allows us to consider soft input, i.e.

confidence information, during decoding.

IV. EXPERIMENTS

First we compare RREP, HAMR, BCH, and LDPC with the ML-ECC framework on seven real-world datasets in different domains: scene, emotions, yeast, tmc2007, genbase, medical, and enron [22] using the algebraic decoder. The number of classes (K) is shown in Table I. All the results are reported with the mean and standard error on random splitting test set over 30 runs. The sizes of training and testing sets are set according to the sizes in original datasets. Note that for tmc2007 dataset, which contains 28596 instances in total, we randomly sample 5% for training and another5% for testing in each run.

We set RREP with k = 3. Then, for each ECC, we first consider a 3-powerset with either Random Forest, non- linear support vector machine (SVM), or logistic regression as the multi-class classifier inside the 3-powerset. Note that we randomly permute the bits of b and apply an inverse permutation on ˜b for those ECC other than RREP to ensure that each3-powerset works on diverse sub-blocks. In addition to the 3-powerset base learners, we also consider BR base learners in Subsection IV-D.

We take the default Random Forest from Weka [23] with 60 trees. For the non-linear SVM, we use LIBSVM [24] with the Gaussian kernel and choose(C, γ) by cross validation on training data from{2⁻⁵, 2⁻³, · · · , 2⁷} × {2⁻⁹, 2⁻⁷, · · · , 2¹}.

(5)

In addition, we use LIBLINEAR [25] for the logistic regression and choose the parameter C by cross validation from {2⁻⁵, 2⁻³, · · · , 2⁷}.

A. Validity of ML-ECC Framework

First, we demonstrate the validity of the ML-ECC framework. We fix the codeword lengthM to about 20 times larger than the number of labels K. The numbers are in the form 2^p− 1 for integer p because the BCH code only works on such lengths. More experiments on different codeword lengths are presented in Section IV-B. Here the base multi-label learner is the 3-powerset with Random Forests. Following the description in Section II-B, RREP with the 3-powerset is exactly the same as RAKEL with k = 3.

The results on 0/1 loss is shown in Figure 1(a). HAMR achieves lower ∆0/1 than RREP on 5 out of the 7 datasets (scene, emotions, yeast, tmc2007, and medical) and achieves similar∆0/1with RREP on the other2. This verifies that using some parity bits instead of repetition improves the strength of ECC, which in turn improves the 0/1 loss.

Along the same direction, BCH performs even better than both HAMR and RREP, especially on medical dataset.

The superior performance of BCH justifies that the ECC is useful for multi-label classification. On the other hand, another sophisticated code, LDPC, gets higher 0/1 loss than BCH on every dataset, and even higher 0/1 loss than RREP on the emotions and yeast datasets, which suggest that LDPC may not be a good choice for the ECC framework.

Next we look at∆HL shown in Figure 1(b). The Hamming loss of HAMR is comparable to that of RREP, where each wins on two datasets. BCH beats both HAMR and RREP on the tmc2007, genbase, and medical datasets but loses on the other four datasets. LDPC has the highest Hamming loss among the codes on all datasets. Thus, simpler codes like RREP and HAMR perform better in terms of∆HL. A stronger code like BCH may guard ∆0/1better, but it can pay more in terms of ∆HL.

Similar results show up when using the Gaussian SVM or logistic regression as the base learner instead of Random Forest, as shown in Tables I and II. The boldface entries are the lowest-loss ones for the given dataset and base learner. The results validate that the performance of multi-label classification can be improved by applying the ECC. More specifically, we may improve the RAKEL algorithm by learning some parity bits instead of repetitions. Based on this experiment, we suggest that using HAMR for multi-label classification will improve the ∆0/1 while maintaining comparable ∆HL

with RAKEL. If we use BCH instead, we will improve ∆0/1

further but may pay for ∆HL.

B. Comparison of Codeword Length

Now, we compare on the length of codewords M . With larger M , the codes can correct more errors but the base learners have to take longer time to train. By experimenting different M , we may find a better trade-off between performance and efficiency.

The performance of the ECC framework with different codeword lengths on the scene dataset is shown on Figure 2.

Here, the base learner is again the 3-powerset with Random Forests. The codeword lengthM varies from 31 to 127, which is about 5 to 20 times of number of labels L. We do not include shorter codewords because their performance are not stable. Note that BCH only allows M = 2^p− 1 and thus we conduct experiments of BCH on those codeword lengths.

We first look at the0/1 loss in Figure 2(a). The horizontal axis indicates the codeword length M and the vertical axis is the 0/1 loss on the test set. We see that ∆0/1 of RREP stays around 0.335 no matter how long the codewords are.

This implies that the power of repetition bits reaches its limit very soon. For example, when all the3-powerset combinations of labels are learned, additional repetitions give very limited improvements. Therefore, methods using repetition bits only, such as RAKEL, cannot take advantage from the extra bits in the codewords.

The ∆0/1 of HAMR and BCH are slightly decreasing withM , but the differences between M = 63 and M = 127 are generally small (smaller than the differences between M = 31 and M = 63, in particular). This indicates that learning some parity bits provides additional information for prediction, which cannot be learned easily from repetition bits, and such information remains beneficial for longer codewords, comparing to repetition bits. One reason is that the number of 3-powerset combinations of parity bits is exponentially more than that of combinations of labels. The performance of LDPC is not as stable as the other codes, possibly because of its sophisticated decoding step. Somehow, we still see that its

∆0/1 decreases slightly with M .

Figure 2(b) shows∆HLversusM for each ECC. The ∆HL

of RREP is the lowest among the codes whenM is small, but it remains almost constant whenM ≥ 63, while ∆HLof HAMR and BCH are still decreasing. This matches our finding that extra repetition bits give limited information. WhenM = 127, BCH is comparable to RREP in terms of∆HL. HAMR is even better than RREP at that codeword length, and becomes the best code regarding ∆HL. Thus, while a stronger code like BCH may guard ∆0/1 better, it can pay more in terms of

∆HL.

As stated in Sections I-A and II, the base learners serve as the channels in the ECC framework and the performance of base learners may be affected by the codes. Therefore, using a strong ECC does not always improve multi-label classification performance. Next, we verify the trade-off by measuring the bit error rate∆BER of ˜h, which is defined as the Hamming loss between the predicted codeword ˜h(x) and the actual codeword b. Higher bit error rate implies that the transformed task is harder.

Figure 2(c) shows the ∆BER versus M for each ECC.

RREP has almost constant bit error rate. HAMR also has nearly constant bit error rate but at a higher value. The bit error rate of BCH is similar to that of HAMR when the codeword is short, but the bit error rate increases with M . One explanation is that some of the parity bits are harder to learn than repetition bits. The ratio between repetition bits and parity bits of both RREP and HAMR codes is a constant of

(6)

(M=127) (M=127) (M=255) (M=511) (M=511) (M=1023) (M=1023) scene emotions yeast tmc2007 genbase medical enron 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0/1 Loss

RREP (RAKEL) HAMR

BCH LDPC

(a)

(M=127) (M=127) (M=255) (M=511) (M=511) (M=1023)(M=1023) scene emotions yeast tmc2007 genbase medical enron 0

0.05 0.1 0.15 0.2

Hamming Loss

RREP (RAKEL) HAMR

BCH LDPC

(b) Fig. 1. Performance of ML-ECC using the 3-powerset with Random Forests: (a) 0/1 loss (b) Hamming loss

TABLE I

0/1LOSS OFML-ECCUSING3-POWERSET BASE LEARNERS

scene emotions yeast tmc2007 genbase medical enron

base (K=6) (K=6) (K=14) (K=22) (K=27) (K=45) (K=53)

learner ECC (M =127) (M =127) (M =255) (M =511) (M =511) (M =1023) (M =1023)

Random RREP (RAKEL) .3394 ± .0025 .6472 ± .0060 .7939 ± .0022 .7738 ± .0025 .0295 ± .0021 .6508 ± .0024 .8866 ± .0038 Forests HAMR .2849 ± .0020 .6381 ± .0060 .7789 ± .0021 .7693 ± .0024 .0276 ± .0021 .6420 ± .0029 .8855 ± .0036 BCH .2669 ± .0020 .6361 ± .0059 .7764 ± .0021 .7273 ± .0018 .0263 ± .0020 .4598 ± .0036 .8659 ± .0039 LDPC .3057 ± .0023 .6616 ± .0048 .8080 ± .0024 .7728 ± .0022 .0288 ± .0021 .5238 ± .0032 .8830 ± .0036 Gaussian RREP (RAKEL) .2856 ± .0016 .7759 ± .0055 .7601 ± .0023 .7196 ± .0024 .0295 ± .0025 .3679 ± .0036 .8725 ± .0041 SVM HAMR .2639 ± .0017 .7736 ± .0050 .7530 ± .0021 .7162 ± .0023 .0303 ± .0026 .3641 ± .0031 .8693 ± .0042 BCH .2576 ± .0017 .7744 ± .0053 .7429 ± .0017 .7095 ± .0020 .0255 ± .0019 .3394 ± .0027 .8477 ± .0045 LDPC .2780 ± .0020 .8040 ± .0044 .7574 ± .0021 .7403 ± .0019 .0285 ± .0021 .3856 ± .0031 .8666 ± .0041 Logistic RREP (RAKEL) .3601 ± .0019 .6949 ± .0070 .8161 ± .0017 .7408 ± .0024 .3593 ± .0078 .5507 ± .0254 .8762 ± .0035 Regression HAMR .3293 ± .0017 .6955 ± .0058 .8061 ± .0019 .7383 ± .0025 .2275 ± .0099 .5268 ± .0230 .8754 ± .0035 BCH .3148 ± .0018 .7068 ± .0046 .7899 ± .0020 .7233 ± .0024 .0250 ± .0018 .3797 ± .0044 .8504 ± .0042 LDPC .3655 ± .0028 .7295 ± .0056 .8082 ± .0024 .7562 ± .0027 .0325 ± .0018 .4516 ± .0083 .8653 ± .0038

TABLE II

HAMMING LOSS OFML-ECCUSING3-POWERSET BASE LEARNERS

base scene emotions yeast tmc2007 genbase medical enron

Random RREP (RAKEL) .0755 ± .0006 .1770 ± .0018 .1884 ± .0007 .0674 ± .0003 .0012 ± .0001 .0182 ± .0001 .0477 ± .0004 RandoForest HAMR .0746 ± .0006 .1795 ± .0020 .1894 ± .0008 .0671 ± .0003 .0012 ± .0001 .0180 ± .0001 .0479 ± .0004 BCH .0753 ± .0007 .1855 ± .0021 .1928 ± .0008 .0662 ± .0003 .0011 ± .0001 .0159 ± .0001 .0506 ± .0004 LDPC .0819 ± .0007 .1912 ± .0019 .2012 ± .0007 .0734 ± .0003 .0013 ± .0001 .0192 ± .0002 .0538 ± .0005 Gaussian RREP (RAKEL).0719 ± .0005 .2432 ± .0021 .1853 ± .0007 .0613 ± .0003 .0013 ± .0001 .0112 ± .0001 .0449 ± .0004 SVM HAMR .0724 ± .0005 .2490 ± .0023 .1868 ± .0006 .0610 ± .0003 .0013 ± .0001 .0111 ± .0001 .0449 ± .0004 BCH .0739 ± .0006 .2644 ± .0019 .1898 ± .0008 .0629 ± .0003 .0010 ± .0001 .0114 ± .0001 .0516 ± .0006 LDPC .0755 ± .0006 .2634 ± .0027 .1917 ± .0007 .0679 ± .0003 .0014 ± .0001 .0140 ± .0001 .0530 ± .0005 Logistic RREP (RAKEL).0915 ± .0005 .2026 ± .0025 .1993 ± .0007 .0634 ± .0003 .0179 ± .0006 .0190 ± .0011 .0453 ± .0003 Regression HAMR .0910 ± .0005 .2070 ± .0024 .2003 ± .0007 .0634 ± .0003 .0102 ± .0005 .0176 ± .0009 .0454 ± .0003 BCH .0920 ± .0005 .2233 ± .0022 .2051 ± .0008 .0653 ± .0003 .0013 ± .0001 .0137 ± .0003 .0505 ± .0004 LDPC .0989 ± .0007 .2202 ± .0021 .2054 ± .0007 .0701 ± .0003 .0024 ± .0002 .0187 ± .0006 .0528 ± .0004

30 40 50 60 70 80 90 100 110 120 130

0.26 0.28 0.3 0.32 0.34 0.36

codeword length

0/1 Loss

RREP (RAKEL) HAMR BCH LDPC

(a)

30 40 50 60 70 80 90 100 110 120 130

0.074 0.076 0.078 0.08 0.082 0.084 0.086

codeword length

Hamming Loss

(b)

30 40 50 60 70 80 90 100 110 120 130

0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16

codeword length

Bit Error Rate

(c)

Fig. 2. Performance on scene vs. codeword length (ML-ECC using the 3-powerset with Random Forests): (a) 0/1 loss (b) Hamming loss (c) bit-error rate

(7)

M (RREP has no parity bits, and HAMR has 3 parity bits for every 4 repetition bits), while BCH has more parity bits with largerM . The different bit error rates justify the trade-off between the strength of the ECC and the hardness of the base learning tasks. With more parity bits, one can correct more bit errors, but may have harder tasks to learn; when using fewer parity bits or even no parity bits, one cannot correct many errors, but will enjoy simpler learning tasks.

Similar results show up in other datasets with all three base learners. The performance on the yeast dataset with the 3-powerset and Random Forests is shown in Figure 3.

Because the number of labels in the yeast dataset is about twice of that in the scene dataset, the codeword length here ranges from 63 to 255, which is also about twice longer than that in the experiments on the scene dataset. Again, we see that the benefits of parity bits remain valid for longer codewords than repetition bits and that more parity bits cause the transformed task harder to learn. This result points out the trade-off between the strength of the ECC and the hardness of the base learning tasks.

C. Bit Error Analysis

To further analyze the difference between different ECC designs, we zoom in to M = 127 of Figure 2. The instances are divided into groups according to the number of bit errors at that instance. The relative frequency of each group, i.e., the ratio of the group size to the total number of instances, is plotted in Figure 4(a). The average ∆0/1 and ∆HL of each group are also plotted in Figure 4(b) and 4(c). The curve of each ECC forms two peak regions in Figure 4(a). Besides the peak at0, which means no bit error happens on the instances, the other peak varies from one code to another. The positions of the peaks suggest the hardness of the transformed learning task, similar to our findings in Figure 2(c).

We can clearly see the difference on the strength of different ECC from Figure 4(b). BCH can tolerate up to 31- bit errors, but its ∆0/1 sharply increases over 0.8 for 32-bit errors. HAMR can correct13-bit errors perfectly, and its ∆0/1

increases slowly when more errors occur. Both RREP and LDPC can perfectly correct only 9-bit errors, but LDPC is able to sustain a low ∆0/1 even when there are 32-bit errors.

It would be interesting to study the reason behind this long tail from a Bayesian network perspective.

We can also look at the relation between the number of bit errors and ∆HL, as shown in Figure 4(c). The BCH curve grows sharply when the number of bit errors is larger than 31, which links to the inferior performance of BCH over RREP in terms of∆HL. The LDPC curve grows much slower, but its right-sided peak in Figure 4(a) still leads to higher overall ∆HL. On the other hand, RREP and HAMR enjoy a better balance between the peak position in Figure 4(a) and the growth in Figure 4(c) and thus lower overall ∆HL.

Figure 4(a) suggests that the transformed learning task of more sophisticated ECC is harder. The reason is that sophisticated ECC contains many parity bits, which are the exclusive-or of labels, and the parity bits are harder to learn by the base learners. We demonstrate this in Figure 5 using

scenedataset (6 labels) and fixing M = 127. The codeword bits are divided into groups according to the number of labels XOR’ed to form the bit. The relative frequency of each group is plotted in Figure 5(a). We can see that all codeword bits of RREP are formed by 1 label, and the bits of HAMR are formed by1 or 3 labels. For BCH and LDPC, the number of labels XOR’ed in the bits may be none (0) to all (6) labels, while most of the bits are the XOR of half of the labels (3 labels).

Next we show how well the base learners learned on each group in Figure 5(b). Here the base learner is 3-powerset with Random Forests. The figure suggests that the parity bits (XOR’ing2 or more labels) result in harder learning tasks and higher bit error rates than original labels (XOR’ing1 label).

One exception is the bits XOR’ed from all (6) labels, which is easier to learn than original labels. The reason is that the bit XOR’ed from all labels is equivalent to the indicator of odd number of labels, and a constant predictor works well for this because in the scene dataset about92% of all instances has 1 or 3 labels. Since BCH and LDPC have many bits XOR’ed from2-4 labels, their bit error rates are higher than RREP and HAMR as shown in Figure 2(c).

These findings also appear on other datasets and other base learners, such as medical dataset (45 labels, M = 1023) shown in Figure 6. BCH and LDPC have many bits XOR’ed from about half of the labels, and the transformed learning tasks of such bits are harder to learn than that of original labels.

D. Comparison with Binary Relevance

In addition to the3-powerset base learners, we also consider BR base learners, which simply build a classifier for each bit in the codeword space. Note that if we couple the ECC framework with RREP and BR, the resulting algorithm is almost the same as the original BR. For example, using RREP and BR with SVM is equivalent to using BR with bootstrap aggregated SVM.

We first compare the performance between the ECC designs using the BR base learner with Random Forests. The result on0/1 loss is shown in Figure 7(a). From the figure, we can see that BCH and HAMR reaches superior performance to other ECC, with BCH being a better choice. RREP (BR), on the other hand, leads to the worst0/1 loss. The result again justifies the usefulness of coupling BR with the ECC instead of only the original y. Note that LDPC also performs better than BR on two datasets, but is not as good as HAMR and BCH. Thus, over-sophisticated ECC like LDPC may not be necessary for multi-label classification.

In Figure 7(b), we present the results on∆HL. In contrast to the case when using the3-powerset base learner, here both HAMR and BCH can achieve better ∆HL than RREP (BR) in most of the datasets. HAMR wins on three datasets, while BCH wins on four. Thus, coupling stronger ECC with the BR base learner can improve both ∆0/1 and∆HL. However, LDPC performs worse than BR in term of ∆HL, which again shows that LDPC may not be suitable for multi-label classification.

(8)

60 80 100 120 140 160 180 200 220 240 260 0.77

0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85

codeword length

0/1 Loss

(a)

60 80 100 120 140 160 180 200 220 240 260 0.19

0.195 0.2 0.205 0.21

codeword length

Hamming Loss

(b)

60 80 100 120 140 160 180 200 220 240 260 0.15

0.2 0.25 0.3 0.35

codeword length

Bit Error Rate

(c)

Fig. 3. Performance on yeast vs. codeword length (ML-ECC using the 3-powerset with Random Forests): (a) 0/1 loss (b) Hamming loss (c) bit-error rate

0 10 20 30 40 50 60 70 80 90

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

number of bit errors

Relative Frequency

(a)

0 10 20 30 40 50 60 70 80 90

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0/1 Loss

(b)

0 10 20 30 40 50 60 70 80 90

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Hamming Loss

(c)

Fig. 4. Bit errors and losses on the scene dataset withM = 127: (a) relative frequency (b) 0/1 loss (c) Hamming loss vs. number of bit errors

0 1 2 3 4 5 6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

number of labels XOR’ed

Relative Frequency

(a)

0 1 2 3 4 5 6

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Bit Error Rate

(b)

Fig. 5. Parity bits on the scene dataset, 6 labels, 127-bit codeword: (a) relative frequency (b) bit error rate vs. number of labels XOR’ed

0 5 10 15 20 25 30 35 40 45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Relative Frequency

(a)

0 5 10 15 20 25 30 35

0 0.05 0.1 0.15 0.2 0.25

Bit Error Rate

(b)

Fig. 6. Parity bits on the medical dataset, 45 labels, 1023-bit codeword: (a) relative frequency (b) bit error rate vs. number of labels XOR’ed

(9)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0/1 Loss

RREP (BR) HAMR

BCH LDPC

(a)

0.05 0.1 0.15 0.2

Hamming Loss

RREP (BR) HAMR

BCH LDPC

(b) Fig. 7. Performance of ML-ECC using Binary Relevance with Random Forests: (a) 0/1 loss (b) Hamming loss

Experiments with other base learners also support similar findings, as shown in Tables III and IV. Notice that HAMR performs better than BCH when using Gaussian SVM base learners. Thus, extending BR by learning some more parity bits and decoding them suitably by the ECC is a superior algorithm over the original BR.

Comparing Tables I and III, we see that using 3-powerset achieves lower 0/1 loss than using BR in most of the cases.

However, in terms of ∆HL, as shown in Tables II and IV, there is no clear winner between the3-powerset and BR.

E. Validity of the Geometric Decoder

The previous experiment shows that using either HAMR or BCH code improves the performance of Binary Relevance learners. Now we are going to examine whether the proposed geometric decoder can further improve the result. We empirically compare the off-the-shelf algebraic decoder with hard inputs, the proposed geometric decoder with hard inputs and the proposed geometric decoder with soft inputs. Note that hard inputs are the direct binary bits in the codeword, and soft inputs contain the confidence that the bits are1. We include the hard-input geometric decoder into comparison to see whether the geometric decoder is competitive to the algebraic ones when the base learner does not provide soft inputs.

When using BR base learners, Random Forests from WEKA, Gaussian SVM from LIBSVM, and logistic regression from LIBLINEAR all support outputting the confidence for binary classification, which is naturally taken as the soft input.

The results on the BCH code using the Gaussian SVM base learner is shown in Figure 8. We can see from Figure 8(a) that in terms of ∆0/1, both geometric decoders are significantly better than the algebraic one, especially on emotions and yeast dataset, and the soft-input geometric decoder is slightly better than or similar to the hard-input one.

From Figure 8(b), we can see that the soft-input geometric decoder is much better than the hard-input one in terms of

∆HL, but the algebraic decoder is usually the best here.

The reason may be that the geometric decoder minimizes the distance between approximatedenc(˜y) and ˜b in the codeword space. However, the BCH code does not preserve the Ham- ming distance during encoding and decoding between{0, 1}^K and {0, 1}^M, so the geometric decoder, which minimizes

the distance in [0, 1]^M (and approximately in {0, 1}^M), may not be suitable to the Hamming loss (Hamming distance in {0, 1}^K). But when taking confidence information as soft input, the geometric decoder can perform better on Hamming loss, and become comparable to the algebraic decoder some- times.

The result shows that the proposed geometric decoder on BCH code is better than the ordinary algebraic decoder for

∆0/1, but not for ∆HL. Also, the soft input is useful for the geometric decoder in terms of both∆0/1 and∆HL.

Next we look at the HAMR code. On 0/1 loss shown in Figure 9(a), the geometric decoders are better than the algebraic one except on the genbase and enron datasets where all decoders have similar 0/1 loss. Among the hard- input and soft-input geometric decoders, there is no clear winner. In terms of Hamming loss in Figure 9(b), the soft- input geometric decoder performs the best. Its Hamming loss is significantly lower than the that of the hard-input geometric decoder and the algebraic decoder on scene, emotions, and tmc2007 dataset, and similar to them on other datasets.

Comparing the hard-input geometric and algebraic decoders, there is no significant difference.

The result again shows that the proposed geometric decoder on HAMR code is better than the ordinary algebraic decoder for∆0/1, but not for∆HL. But with soft input, the geometric decoder can give lower∆HL.

Similar results show up when using other base learners, as shown in Table V and VI. The bold-face entries are the best entries on each dataset given the ECC and base learner. From this experiment we see that using the hard-input geometric decoder instead of the off-the-shelf algebraic one leads to improvement on ∆_0/1, but pay for ∆HL. If using the soft- input geometric decoder, the harm of ∆HL is mitigated and the improvement on∆0/1is strengthened. Therefore, the soft- input geometric decoder is a better choice for the ML-ECC framework.

F. Comparison with Real-valued ECC

Our ML-ECC framework only considers binary ECC. Here, we compare our ML-ECC framework with real-valued ECC methods: coding with canonical correlation analysis (CCA- OC) [16] and max-margin output coding (MaxMargin) [17], as discussed in Section II-C.

(10)

TABLE III

0/1LOSS OFML-ECCUSINGBRBASE LEARNERS

Random RREP (BR) .4398 ± .0023 .6822 ± .0055 .8332 ± .0016 .7715 ± .0023 .0303 ± .0020 .6546 ± .0025 .8872 ± .0036 Forest HAMR .3212 ± .0021 .6574 ± .0050 .7910 ± .0020 .7578 ± .0025 .0288 ± .0019 .6387 ± .0025 .8851 ± .0036 BCH .2562 ± .0020 .6404 ± .0060 .7792 ± .0019 .7149 ± .0022 .0250 ± .0019 .4567 ± .0034 .8737 ± .0038 LDPC .3995 ± .0026 .6903 ± .0058 .8338 ± .0015 .7735 ± .0024 .0312 ± .0022 .5601 ± .0032 .8876 ± .0035 Gaussian RREP (BR) .3376 ± .0023 .8414 ± .0052 .7955 ± .0017 .7281 ± .0025 .0273 ± .0021 .3721 ± .0037 .8720 ± .0041 SVM HAMR .2876 ± .0018 .8073 ± .0043 .7681 ± .0022 .7215 ± .0025 .0243 ± .0022 .3675 ± .0036 .8718 ± .0042 BCH .2552 ± .0018 .7809 ± .0050 .7515 ± .0015 .7053 ± .0024 .0255 ± .0019 .3499 ± .0030 .8561 ± .0043 LDPC .3161 ± .0022 .8523 ± .0042 .7963 ± .0018 .7515 ± .0023 .0243 ± .0017 .4226 ± .0034 .8782 ± .0037 Logistic RREP (BR) .4821 ± .0024 .7396 ± .0049 .8531 ± .0016 .7458 ± .0026 .5084 ± .0068 .5784 ± .0282 .8759 ± .0035 Regression HAMR .4050 ± .0020 .7175 ± .0054 .8282 ± .0015 .7405 ± .0024 .3509 ± .0089 .5499 ± .0247 .8740 ± .0036 BCH .3291 ± .0020 .6982 ± .0048 .8094 ± .0020 .7205 ± .0022 .0295 ± .0018 .4022 ± .0076 .8579 ± .0038 LDPC .4659 ± .0028 .7507 ± .0056 .8565 ± .0019 .7694 ± .0027 .0528 ± .0031 .5396 ± .0149 .8795 ± .0036

TABLE IV

HAMMING LOSS OFML-ECCUSINGBRBASE LEARNERS

Random RREP (BR) .0858 ± .0005 .1809 ± .0018 .1903 ± .0006 .0662 ± .0003 .0013 ± .0001 .0183 ± .0001 .0474 ± .0003 Forest HAMR .0726 ± .0005 .1781 ± .0017 .1878 ± .0007 .0652 ± .0002 .0012 ± .0001 .0179 ± .0001 .0474 ± .0003 BCH .0717 ± .0006 .1826 ± .0018 .1898 ± .0008 .0638 ± .0003 .0010 ± .0001 .0152 ± .0001 .0494 ± .0004 LDPC .0832 ± .0006 .1877 ± .0020 .1963 ± .0006 .0721 ± .0003 .0014 ± .0001 .0203 ± .0001 .0529 ± .0004 Gaussian RREP (BR) .0743 ± .0005 .2461 ± .0020 .1866 ± .0006 .0621 ± .0003 .0012 ± .0001 .0113 ± .0001 .0450 ± .0004 SVM HAMR .0717 ± .0004 .2472 ± .0023 .1861 ± .0007 .0616 ± .0003 .0010 ± .0001 .0112 ± .0001 .0451 ± .0004 BCH .0739 ± .0006 .2569 ± .0030 .1880 ± .0007 .0619 ± .0003 .0010 ± .0001 .0117 ± .0001 .0487 ± .0005 LDPC .0742 ± .0005 .2538 ± .0019 .1908 ± .0006 .0688 ± .0003 .0011 ± .0001 .0153 ± .0001 .0517 ± .0005 Logistic RREP (BR) .1024 ± .0006 .2062 ± .0018 .2000 ± .0007 .0641 ± .0003 .0347 ± .0007 .0212 ± .0014 .0455 ± .0003 Regression HAMR .0959 ± .0006 .2047 ± .0022 .2003 ± .0007 .0635 ± .0003 .0186 ± .0005 .0191 ± .0011 .0452 ± .0003 BCH .0955 ± .0006 .2172 ± .0023 .2037 ± .0008 .0638 ± .0003 .0024 ± .0002 .0161 ± .0006 .0472 ± .0004 LDPC .1038 ± .0006 .2133 ± .0019 .2044 ± .0008 .0705 ± .0004 .0050 ± .0004 .0234 ± .0011 .0517 ± .0004

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0/1 Loss

BCH−alg−hard BCH−geo−hard BCH−geo−soft

(a)

0.05 0.1 0.15 0.2 0.25 0.3

Hamming Loss

BCH−alg−hard BCH−geo−hard BCH−geo−soft

(b)

Fig. 8. Performance of ML-ECC with hard-/soft-input geometric decoders and algebraic decoder on the BCH code using BR and SVM: (a) 0/1 loss (b) Hamming loss

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0/1 Loss

HAMR−alg−hard HAMR−geo−hard HAMR−geo−soft

(a)

scene emotions yeast tmc2007 genbase medical enron 0

0.05 0.1 0.15 0.2 0.25

Hamming Loss

HAMR−alg−hard HAMR−geo−hard HAMR−geo−soft

(b)

Fig. 9. Performance of ML-ECC with hard-/soft-input geometric decoders and algebraic decoder on the HAMR code using BR and SVM: (a) 0/1 loss (b) Hamming loss