• 沒有找到結果。

2.1 ML-ECC Framework

We now describe the ECC framework in detail. The main idea is to use an ECC encoder enc(·) : {0, 1}K → {0, 1}M to expand the original label set y ∈ {0, 1}K to a codeword b ∈ {0, 1}M that contains redundant information. Then, instead of learning a multi-label classifier h(x) between x and y, we learn a multi-label classifier ˜h(x) between x and the corresponding b. In other words, we transform the original multi-label classification problem into another (larger) multi-label classification task. During prediction, we use h(x) = dec ◦ ˜h(x), where dec(·) : {0, 1}M → {0, 1}Kis the corresponding ECC decoder, to get a multi-label prediction ˜y ∈ {0, 1}K. The simple steps of the framework are shown in Algorithm 1.

Algorithm 1 is simple and general. It can be coupled with any block-coding ECC and any base learner Ab to form a new multi-label classification algorithm. For instance, the ML-BCHRF method [Kouzani and Nasireding, 2009] uses the BCH code (see Sub-section 2.2.3) as the ECC and BR on Random Forest as the base learner Ab. Note that Kouzani and Nasireding [2009] did not describe why ML-BCHRF may lead to improve-ments in multi-label classification. Next, we show a simple theorem that connects the ECC framework with ∆0/1.

Many ECCs can guarantee to correct up to m bit flipping errors in a codeword of length M . We will introduce some of those ECC in Section 2.2. Then, if ∆HL of ˜h is

Algorithm 1 Error-Correcting Framework

• Parameter: an ECC with encoder enc(·) and decoder dec(·); a base multi-label learner Ab

2. Return h(x) = dec(˜b) by ECC-decoding.

low, the ECC framework guarantees that ∆0/1 of h is low. The guarantee is formalized as follows.

Theorem 1 Consider an ECC that can correct up to m bit errors in a codeword of lengthM . Then, for any T test examples {(xt, yt)}Tt=1, letbt = enc(yt). If

From Theorem 1, it appears that we should simply use some stronger ECC, for which m is larger. Nevertheless, note that we are applying the ECC in a learning scenario. Thus,

 is not a fixed value, but depends on whether Ab can learn well from ˜D. Stronger ECC usually contains redundant bits that come from complicated compositions of the original bits in y, and the compositions may not be easy to learn. The trade-off has been revealed

when applying the ECC to multi-class classification [Li, 2006]. In the next section, we study the ECC with different strength and empirically verify the trade-off in Section 2.4.

2.2 Review of Classic ECC

Next, we review four ECC designs that will be used in the empirical study. The four designs cover a broad spectrum of practical choices in terms of strength: the repetition code, the Hamming on repetition code, the Bose-Chaudhuri-Hocquenghem code, and the low-density parity-check code.

2.2.1 Repetition Code

One of the simplest ECCs is repetition code (REP) [Mackay, 2003], for which every bit in y is repeated bMKc times in b during encoding. If M is not a multiple of K, then (M mod K) bits are repeated one more time. The decoding takes a majority vote using the received copies of each bit. Because of the majority vote, repetition code corrects up to mREP = 12bMKc − 1 bit errors in b. We will discuss the connection between REP and the RAKEL algorithm in Section 2.3.

2.2.2 Hamming on Repetition Code

A slightly more complicated ECC than REP is called the Hamming code (HAM) [Ham-ming, 1950], which can correct mHAM = 1 bit error in b by adding some parity check bits (exclusive-OR operations of some bits in y). One typical choice of HAM is HAM(7, 4), which encodes any y with K = 4 to b with M = 7. Note that mHAM = 1 is worse than mREP = 12bMKc − 1 when M is large. Thus, we consider applying HAM(7, 4) on every 4 (permuted) bits of REP. That is, to form a codeword b of M bits from a block y of K bits, we first construct an REP of 4bM/7c + (M mod 7) bits from y; then for every 4 bits in the REP, we add 3 parity bits to b using HAM(7, 4). The resulting code will be named Hamming on Repetition (HAMR). During decoding, the decoder of HAM(7, 4) is first used to recover the 4-bit sub-blocks in the REP. Then, the decoder of REP (majority

vote) takes place.

It is not hard to compute mHAM R by analyzing the REP and HAM parts separately.

When M is a multiple of 7 and K is a multiple of 4, it can be proved that mHAM R = 4M7K, which is generally better than mREP = 12bMKc − 1. Thus, HAMR is slightly stronger than REP for ECC purposes. We include HAMR in our study to verify whether a simple inclusion of some parity bits for the ECC can readily improve the performance for multi-label classification.

2.2.3 Bose-Chaudhuri-Hocquenghem Code

BCH was invented by Bose and Ray-Chaudhuri [1960], and independently by Hocquenghem [1959]. It can be viewed as a sophisticated extension of HAM and allows correcting mul-tiple bit errors. BCH with length M = 2p− 1 has (M − K) parity bits, and it can correct mBCH = M −Kp bits of error [Mackay, 2003], which is in general stronger than REP and HAMR. The caveat is that the decoder of BCH is more complicated than the ones of REP and HAMR.

We include BCH in our study because it is one of the most popular ECCs in real-world communication systems. In addition, we compare BCH with HAMR to see if a strong ECC can do better for multi-label classification.

2.2.4 Low-density Parity-check Code

Low-density parity-check code (LDPC) [Mackay, 2003] is recently drawing much re-search attention in communications. LDPC shares an interesting connection between ECC and Bayesian learning [Mackay, 2003]. While it is difficult to state the strength of LDPC in terms of a single mLDP C, LDPC has been shown to approach the theoretical limit in some special channels [Gallager, 1963], which makes it a state-of-the-art ECC.

We choose to include LDPC in our study to see whether it is worthwhile to go beyond BCH with more sophisticated encoder/decoders.

2.3 ECC View of RAKEL

RAKEL is a multi-label classification algorithm proposed by Tsoumakas and Vlahavas [2007]. Define a k-label set as a size-k subset of L. Each iteration of RAKEL randomly selects a (different) k-label set and builds a multi-label classifier on the k labels with a Label Powerset (LP). After running for R iterations, RAKEL obtains a size-R ensemble of LP classifiers. The prediction on each label is done by a majority vote from classifiers associated with the label.

Equivalently, we can draw (with replacement) M = Rk labels first before building the LP classifiers. Then, selecting k-label sets is equivalent to encoding y by a variant of REP, which will be called RAKEL repetition code (RREP). Similar to REP, each bit y[i] is repeated several times in b since label i is involved in several k-label sets. After encoding y to b, each LP classifier, called k-powerset, acts as a sub-channel that transmits a size-k sub-block of the codeword b. The prediction procedure follows the decoder of the usual REP.

The ECC view above decomposes the original RAKEL into two parts: the ECC and the base learner Ab. Next, we empirically study how the two parts affect the performance of multi-label classification.

2.4 Experimental Results

We compare RREP, HAMR, BCH, and LDPC with the ECC framework on seven real-world datasets in different domains: scene, emotions, yeast, tmc2007, genbase, medical, and enron [Tsoumakas et al., 2010]. The statistics of these datasets are shown in Table 2.1. All the results are reported with the mean and standard error on ran-dom splitting test set over 30 runs. The sizes of training and testing sets are set according to the sizes in original datasets. Note that for tmc2007 dataset, which has 28596 in-stances in total, we randomly sample 5% for training and another 5% for testing in each run.

We set RREP with k = 3. Then, for each ECC, we first consider a 3-powerset with

Table 2.1: Dataset characteristics

D

ATASET

K T

RAINING

T

ESTING

F

EATURES

SCENE

6 1211 1196 294

EMOTIONS

6 391 202 72

YEAST

14 1500 917 103

TMC

2007 22 1430 1430 500

GENBASE

27 463 199 1186

MEDICAL

45 333 645 1449

ENRON

53 1123 579 1001

either Random Forest, non-linear support vector machine (SVM), or logistic regression as the multi-class classifier inside the 3-powerset. Note that we randomly permute the bits of b and apply an inverse permutation on ˜b for those ECC other than RREP to ensure that each 3-powerset works on diverse sub-blocks. In addition to the 3-powerset base learners, we also consider BR base learners in Subsection 2.4.4.

We take the default Random Forest from Weka [Hall et al., 2009] with 60 trees. For the non-linear SVM, we use LIBSVM [Chang and Lin, 2001] with the Gaussian ker-nel and choose (C, γ) by cross validation on training data from {2−5, 2−3, · · · , 27} × {2−9, 2−7, · · · , 21}. In addition, we use LIBLINEAR [Fan et al., 2008] for the logistic regression and choose the parameter C by cross validation from {2−5, 2−3, · · · , 27}.

Note that the experiments taken in this work are generally broader than existing works that are related to multi-label classification with the ECC in terms of the datasets, the codes, the “channels,” and the base learners, as shown in Table 2.2. The goal of the exper-iments is not only to justify that the framework is promising but also to rigorously identify the best codes, channels, and base learners for solving general multi-label classification tasks via the ECC.

2.4.1 Validity of ML-ECC Framework

First, we demonstrate the validity of the ML-ECC framework. We fix the codeword length M to about 20 times larger than the number of labels K. The numbers are in the form 2p − 1 for integer p because the BCH code only works on such lengths. More

Table 2.2: Focus of existing works under the ML-ECC framework

work # datasets codes channels base learners

RAKEL 3 RREP k-powerset linear SVM

[Tsoumakas and Vlahavas, 2007]

ML-BCHRF 3 BCH BR Random Forest

[Kouzani and Nasireding, 2009]

ML-BCHRF & ML-CRF 1 convolution, BR Random Forest

[Kouzani, 2010] and BCH

this work 7 RREP, HAMR, 3-powerset, Random Forest,

BCH, and LDPC and BR Gaussian SVM,

相關文件