剛性與柔性解碼之錯誤更正碼於多標籤分類學習之應用

(1)

國立臺灣大學電機資訊學院資訊工程學系碩士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

剛性與柔性解碼之錯誤更正碼於多標籤分類學習之應用 Multi-label Classification

with Hard-/soft-decoded Error-correcting Codes

馮俊菘

Ferng, Chun-Sung

指導教授：林軒田博士 Advisor: Hsuan-Tien Lin, Ph.D.

中華民國 101 年 6 月

June, 2012

(2)

(3)

致致致謝謝謝

能寫出這份論文和得到碩士學位，首先我要感謝林軒田老師。謝

謝老師一直以來的指導，每個星期都和我一對一的討論。從一開始領

我進門，與我討論題目、想法和實驗，到最後論文的修改、簡報的技巧，都是在老師的教導和指引下才能有如此成果。還有要感謝口試委員李育杰老師和林守德老師，不但百忙之中撥空來參與口試，也給了

我不少有用的建議，讓這篇論文更加完整。

還有，要感謝 217 和 536 實驗室的同學們，在這兩年願意不時和我

討論，讓我對機器學習以及其他相關領域有更多認識；在我準備口頭

報告和口試的時候願意聽我練習，讓我能表現得更好。此外，也要感謝我的女朋友琇文，一路上陪著我一起努力，在我遭遇困難的時候給

我安慰、幫我加油打氣，讓我可以繼續前進。

最重要的，要感謝爸爸媽媽，因為有你們的付出和支持，我才能心無旁鶩的在學業上努力。

在此對所有幫助我的人至上最誠摯的感謝，謝謝你們。

馮俊菘， 2012 年 7 月

(4)

(5)

中

中中文文文摘摘摘要要要

我們提出一個將錯誤更正碼 (error-correcting codes, ECC) 應用於多標籤分類問題 (multi-label classification) 的架構。在這個架構中，我們以一些基礎學習器 (base learner) 當作有干擾的傳輸頻道，並用錯誤

更正碼來更正這些基礎學習器的預測錯誤。透過這個架構，我們可

以用簡單的重複碼 (repetition code) 來解釋現有的隨機 k 標籤組演算法 (random k-label-sets, RAKEL) 。我們也實驗了各種錯誤更正碼應用在多標籤分類問題的效果，實驗結果顯示，利用較強的錯誤更正碼可以

改善隨機 k 標籤組演算法的表現；此外，讓傳統的二元關聯演算法

(binary relevance)學習一些校驗標籤 (parity-checking labels) 也會讓它有更好的表現。而且，由不同的錯誤更正碼的實驗結果可以看出，錯誤

更正碼的強度會影響基礎學習器的難度，妥善平衡兩者可以讓結果變

得更好。最後，我們也設計了一個新的解碼器來處理剛性（二元值）

與柔性（實數值）的線性錯誤更正碼，實驗結果也證實這個新的解碼

器可以提昇這個架構的表現。

關鍵詞：機器學習、多標籤分類、錯誤更正碼、柔性解碼、幾何解碼

(6)

(7)

Abstract

We formulate a framework for applying error-correcting codes (ECC) on multi-label classification problems. The framework treats some base learners as noisy channels and uses ECC to correct the prediction errors made by the learners. An immediate use of the framework is a novel ECC-based explanation of the popular random k-label-sets (RAKEL) algorithm using a simple repetition ECC. Using the framework, we empirically compare a broad spectrum of ECC designs for multi-label classification. The results not only demonstrate that RAKEL can be improved by applying some stronger ECC, but also show that the traditional Binary Relevance approach can be enhanced by learning more parity-checking labels. Our study on different ECC also helps understand the trade-off between the strength of ECC and the hardness of the base learning tasks. Furthermore, we extend our study to linear ECC for either hard (binary) or soft (real-valued) bits, and design a novel decoder for the ECC. We demonstrate that the decoder improves the performance of our framework.

Keywords : Machine Learning, Multi-label Classification, Error-correcting Codes, Soft Decoding, Geometric Decoding.

(8)

(9)

List of Figures

2.1 Performance of ML-ECC using the 3-powerset with Random Forests . . . 13

2.2 Varying codeword length on scene: ML-ECC using the 3-powerset with Random Forests . . . 17

2.3 Varying codeword length on yeast: ML-ECC using the 3-powerset with Random Forests . . . 18

2.4 Bit errors and losses: the scene dataset, M = 127 . . . 19

2.5 Parity bits: the scene dataset, 6 labels, M = 127 . . . 20

2.6 Parity bits: the medical dataset, 45 labels, M = 1023 . . . 21

2.7 Performance of ML-ECC using Binary Relevance with Random Forests . 22 3.1 Hard-input soft-output geometric decoding results for ML-ECC using BR with Random Forests . . . 28

3.2 Bit error distribution of BR with Random Forests on the scene dataset . 31 3.3 Strength of HAMR on the scene dataset and BR with Random Forests . 31 3.4 Strength of BCH on the scene dataset and BR with Random Forests . . 32

3.5 Comparison between hard-/soft-input geometric decoders in the ML-ECC with the BCH code using BR learners . . . 35

3.6 Comparison between hard-/soft-input geometric decoders in the ML-ECC with the HAMR code using BR learners . . . 36

3.7 Comparison between hard-/soft-input geometric decoders in the ML-ECC with the BCH code using 3-powerset learners . . . 39

3.8 Comparison between hard-/soft-input geometric decoders in the ML-ECC with the HAMR code using 3-powerset learners . . . 40

(12)

(13)

List of Tables

2.1 Dataset characteristics . . . 12

2.2 Focus of existing works under the ML-ECC framework . . . 13

2.3 0/1 loss of ML-ECC using 3-powerset base learners . . . 14

2.4 Hamming loss of ML-ECC using 3-powerset base learners . . . 15

2.5 0/1 loss of ML-ECC using BR base learners . . . 23

2.6 Hamming loss of ML-ECC using BR base learners . . . 24

3.1 0/1 loss changes when applying the proposed soft-output decoder . . . . 29

3.2 Hamming loss changes when applying the proposed soft-output decoder . 30 3.3 0/1 loss of ML-ECC with hard-/soft-input geometric decoders and BR . . 37

3.4 Hamming loss of ML-ECC with hard-/soft-input geometric decoders and BR . . . 38

3.5 Comparison of soft-input geometric decoders using different confidence estimation methods on k-powerset learners . . . 40

3.6 Comparison between ML-ECC and real-valued ECC methods using Ran- dom Forests base learners . . . 42

3.7 Comparison between ML-ECC and real-valued ECC methods using logistic regression base learners . . . 42

A.1 Micro-F1of ML-ECC using 3-powerset base learners . . . 49

A.2 Macro-F1of ML-ECC using 3-powerset base learners . . . 50

A.3 Ranking loss of ML-ECC using 3-powerset base learners . . . 50

A.4 0/1 loss of ML-ECC on scene dataset (3-powerset learners) . . . 51

A.5 Hamming loss of ML-ECC on scene dataset (3-powerset learners) . . . 51

A.6 0/1 loss of ML-ECC on yeast dataset (3-powerset learners) . . . 52

A.7 Hamming loss of ML-ECC on yeast dataset (3-powerset learners) . . . 52

A.8 0/1 loss of ML-ECC on emotions dataset (3-powerset learners) . . . . 52

A.9 Hamming loss of ML-ECC on emotions dataset (3-powerset learners) . 53 A.10 0/1 loss of ML-ECC on medical dataset (3-powerset learners) . . . 53

A.11 Hamming loss of ML-ECC on medical dataset (3-powerset learners) . . 53

(14)

A.12 Micro-F₁of ML-ECC using BR base learners . . . 54 A.13 Macro-F1of ML-ECC using BR base learners . . . 55 A.14 Ranking loss of ML-ECC using BR base learners . . . 55 A.15 Micro-F1of ML-ECC with hard-/soft-input geometric decoders and BR . 56 A.16 Macro-F₁of ML-ECC with hard-/soft-input geometric decoders and BR . 57 A.17 Ranking Loss of ML-ECC with hard-/soft-input geometric decoders and

BR . . . 58

(15)

Chapter 1 Introduction

Multi-label classification is an extension of traditional multi-class classification. In particular, the latter aims at accurately associating one single label with an instance, while the former aims at associating a label set. Because of the increasing application needs in domains like text and music categorization [Pestian et al., 2007, Trohidis et al., 2008], scene analysis [Boutell et al., 2004], and genomics [Elisseeff and Weston, 2002, Diplaris et al., 2005], multi-label classification is attracting much research attention in recent years.

Error-correcting code (ECC) roots from the information theoretic pursuit of communication [Shannon, 1948]. In particular, the ECC studies how to accurately recover a desired signal block after transmitting the block’s encoding through a noisy communication channel. When the desired signal block is the single label (of some instances) and the noisy channel consists of some binary classifiers, it has been shown that a suitable use of the ECC could improve the association (prediction) accuracy of multi-class classification [Dietterich and Bakiri, 1995]. In particular, with the help of the ECC, we can reduce multi-class classification to several binary classification tasks. Then, following the foun- dation of the ECC in information theory [Shannon, 1948, Mackay, 2003], a suitable ECC can correct a small portion of binary classification errors during the prediction stage and thus improve the prediction accuracy. Several designs, including some classic ECC [Diet- terich and Bakiri, 1995] and some adaptively constructed ECC [Schapire, 1997, Li, 2006], have reached promising empirical performance for multi-class classification.

While the benefits of the ECC are well established for multi-class classification, the

(16)

corresponding use for multi-label classification remains an ongoing research direction.

Kouzani and Nasireding [2009] take the first step in this direction by proposing a multi- label classification approach that applies a classic ECC, the Bose-Chaudhuri-Hocquenghem (BCH) code, using a batch of binary classifiers as the noisy channel. The work is followed by some extensions to the convolution code [Kouzani, 2010]. Although the approach shows some good experimental results over existing multi-label classification approaches, a more rigorous study remains needed to understand the advantages and disadvantages of different ECC designs for multi-label classification and will be the main focus of this work.

In this work, we formalize the framework for applying the ECC on multi-label classification. The framework is more general than both existing ECC studies for multi-class classification [Dietterich and Bakiri, 1995] and for multi-label classification [Kouzani and Nasireding, 2009]. Then, we conduct a thorough study with a broad spectrum of classic ECC designs: repetition code, Hamming code, BCH code, and low-density parity-check code. The four designs cover the simplest ECC idea to the state-of-the-art ECC in communication systems. Interestingly, such a framework allows us to give a novel ECC-based explanation to the random k-label sets (RAKEL) algorithm, which is popular for multi- label classification. In particular, RAKEL can be viewed as a special type of repetition code coupled with a batch of simple and internal multi-label classifiers.

We empirically demonstrate that RAKEL can be improved by replacing its repetition code with the Hamming code, a slightly stronger ECC. Furthermore, even better performance can be achieved when replacing the repetition code with the BCH code. When compared with the traditional Binary Relevance (BR) approach without the ECC, multi- label classification with the ECC can perform significantly better. The empirical results justify the validity of the ECC framework.

In addition, we design a new decoder for linear ECC by using multiplications to ap- proximate exclusive-OR operations. This decoder is able to handle not only ordinary binary bits from the channels, called hard inputs, but also real-valued bits, called soft inputs. For multi-label classification using the ECC, the soft inputs can be used to represent

(17)

the confidence of the internal classifiers. Our newly designed decoder allows a proper use of the detailed confidence information to produce more accurate predictions. The experimental results show that this decoder indeed improves the performance of the ECC framework with either hard or soft inputs.

The thesis is organized as follows. First, we introduce the multi-label classification problem in Section 1.1, and present related works in Section 1.2. Chapter 2 illustrates the framework and demonstrates its effectiveness. Chapter 3 presents a new decoder for hard or soft inputs. Finally we conclude in Chapter 4.

1.1 Problem Setup

Multi-label classification aims at mapping an instance x ∈ R^d to a label-set Y ⊆ L = {1, 2, . . . , K}, where K is the number of classes. Following the hypercube view of Tai and Lin [2012], the label set Y can be represented as a binary vector y of length K, where y[i] is 1 if the ith label is in Y , and 0 otherwise. Consider a training dataset D = {(x_n, y_n)}^N_n=1. A multi-label classification algorithm uses D to locate a multi-label classifier h : R^d→ {0, 1}^K such that h(x) predicts y well on future test examples (x, y).

There are several loss functions for evaluating whether h(x) predicts y well. Two common ones are:

• subset 0/1 loss: this loss function is arguably one of the most challenging loss functions because zero (small) loss occurs only when every bit of the prediction is correct.

∆_0/1(˜y, y) =J ˜y 6= yK

• Hamming loss: this loss function considers individual bit differences.

∆_HL(˜y, y) = 1 K

K

X

i=1

J ˜y[i] 6= y[i]K

Dembczy´nski et al. [2010] show that the two loss functions focus on different statistics of the underlying probability distribution from a Bayesian perspective. While a wide

(18)

range of other loss functions exist [Tsoumakas and Vlahavas, 2007], in this paper we only focus on 0/1 and Hamming because they connect tightly with the ECC framework that will be discussed.¹

1.2 Related Works

The hypercube view [Tai and Lin, 2012] unifies many existing problem transformation approaches [Tsoumakas and Vlahavas, 2007] for multi-label classification. Problem transformation approaches transform multi-label classification into one or more reduced learning tasks. For instance, one simple problem transformation approach for multi-label classification is called Binary Relevance (BR), which learns one binary classifier per individual label. Another simple problem transformation approach is called label powerset (LP), which transforms multi-label classification to one multi-class classification task with a huge number of extended labels. One popular problem transformation approach that lies between BR and LP is called random k-label sets (RAKEL) [Tsoumakas and Vlahavas, 2007], which transforms multi-label classification into many multi-class classification tasks with a smaller number of extended labels.

Multi-label classification with compressive sensing [Hsu et al., 2009] is a problem transformation approach that encodes the training label set y_n to a shorter, real-valued codeword vector using compressive sensing. Tai and Lin [2012] study some different encoding schemes from label sets to real-valued codewords. Note that those encoding schemes focus on compression—removing the redundancy within the binary signals (label sets) to form shorter codewords. The compression perspective can lead to not only more efficient training and testing but also more meaningful codewords.

Compression is a classic task in information theory based on Shannon’s first theorem [Shannon, 1948]. Another classic task in information theory aims at expansion—

adding redundancy in the (longer) codewords to ensure robust decoding against noise contamination. The power of expansion is characterized by Shannon’s second theo-

1

We follow the final remark of Dembczy´nski et al. [2010] to only focus on the loss functions that are

related to our algorithmic goals.

(19)

rem [Shannon, 1948]. The ECC targets towards using the power of expansion system- atically. In particular, the ECC works by encoding a block of signal to a longer codeword b before passing it through the noisy channel and then decoding the received codeword b back to the block appropriately. Then, under some assumptions [Mackay, 2003], the˜ block can be perfectly recovered—resulting in zero block-decoding error; in some cases, the block can only be almost perfectly recovered—resulting in a few bit-decoding errors.

If we take the “block” as the label set y for every example (x, y) and a batch of base learners as a channel that outputs the contaminated block ˜b, the block-decoding error corresponds to ∆_0/1while the bit-decoding error corresponds to a scaled version of ∆_HL. Such a correspondence motivates us to study whether suitable ECC designs can be used to improve multi-label classification, which will be formalized in the next chapter.

(20)

(21)

Chapter 2 ECC for Multi-label Classification

2.1 ML-ECC Framework

We now describe the ECC framework in detail. The main idea is to use an ECC encoder enc(·) : {0, 1}^K → {0, 1}^M to expand the original label set y ∈ {0, 1}^K to a codeword b ∈ {0, 1}^M that contains redundant information. Then, instead of learning a multi-label classifier h(x) between x and y, we learn a multi-label classifier ˜h(x) between x and the corresponding b. In other words, we transform the original multi-label classification problem into another (larger) multi-label classification task. During prediction, we use h(x) = dec ◦ ˜h(x), where dec(·) : {0, 1}^M → {0, 1}^Kis the corresponding ECC decoder, to get a multi-label prediction ˜y ∈ {0, 1}^K. The simple steps of the framework are shown in Algorithm 1.

Algorithm 1 is simple and general. It can be coupled with any block-coding ECC and any base learner A_b to form a new multi-label classification algorithm. For instance, the ML-BCHRF method [Kouzani and Nasireding, 2009] uses the BCH code (see Sub- section 2.2.3) as the ECC and BR on Random Forest as the base learner A_b. Note that Kouzani and Nasireding [2009] did not describe why ML-BCHRF may lead to improvements in multi-label classification. Next, we show a simple theorem that connects the ECC framework with ∆0/1.

Many ECCs can guarantee to correct up to m bit flipping errors in a codeword of length M . We will introduce some of those ECC in Section 2.2. Then, if ∆HL of ˜h is

(22)

Algorithm 1 Error-Correcting Framework

• Parameter: an ECC with encoder enc(·) and decoder dec(·); a base multi-label learner A_b

• Training: Given D = {(x_n, y_n)}^N_n=1,

1. ECC-encode each y_nto b_n= enc(y_n);

2. Return ˜h = A_b

x_n, b_n .

• Prediction: Given any x,

1. Predict a codeword ˜b = ˜h(x);

2. Return h(x) = dec(˜b) by ECC-decoding.

low, the ECC framework guarantees that ∆_0/1 of h is low. The guarantee is formalized as follows.

Theorem 1 Consider an ECC that can correct up to m bit errors in a codeword of lengthM . Then, for any T test examples {(x_t, y_t)}^T_t=1, letb_t = enc(y_t). If

∆_HL(˜h) = 1 T

T

X

t=1

∆_HL(˜h(x_t), b_t) ≤ ,

thenh = dec ◦ ˜h satisfies

∆_0/1(h) = 1 T

T

X

t=1

∆_0/1(h(x_t), y_t) ≤ M m + 1 .

Proof When the average Hamming loss of ˜h is at most , ˜h makes at most T M bits of error on all b_t. Since the ECC corrects up to m bits of errors in one b_t, an adversarial has to make at least m + 1 bits of errors on b_tto make h(x_t) different from y_t. The number of such btcan be at most ^{T M}_m+1. Thus, ∆0/1(h) is at most _{T (m+1)}^{T M} .

From Theorem 1, it appears that we should simply use some stronger ECC, for which m is larger. Nevertheless, note that we are applying the ECC in a learning scenario. Thus,

is not a fixed value, but depends on whether Ab can learn well from ˜D. Stronger ECC usually contains redundant bits that come from complicated compositions of the original bits in y, and the compositions may not be easy to learn. The trade-off has been revealed

(23)

when applying the ECC to multi-class classification [Li, 2006]. In the next section, we study the ECC with different strength and empirically verify the trade-off in Section 2.4.

2.2 Review of Classic ECC

Next, we review four ECC designs that will be used in the empirical study. The four designs cover a broad spectrum of practical choices in terms of strength: the repetition code, the Hamming on repetition code, the Bose-Chaudhuri-Hocquenghem code, and the low-density parity-check code.

2.2.1 Repetition Code

One of the simplest ECCs is repetition code (REP) [Mackay, 2003], for which every bit in y is repeated b^M_Kc times in b during encoding. If M is not a multiple of K, then (M mod K) bits are repeated one more time. The decoding takes a majority vote using the received copies of each bit. Because of the majority vote, repetition code corrects up to m_REP = ¹₂b^M_Kc − 1 bit errors in b. We will discuss the connection between REP and the RAKEL algorithm in Section 2.3.

2.2.2 Hamming on Repetition Code

A slightly more complicated ECC than REP is called the Hamming code (HAM) [Ham- ming, 1950], which can correct mHAM = 1 bit error in b by adding some parity check bits (exclusive-OR operations of some bits in y). One typical choice of HAM is HAM(7, 4), which encodes any y with K = 4 to b with M = 7. Note that m_HAM = 1 is worse than m_REP = ¹₂b^M_Kc − 1 when M is large. Thus, we consider applying HAM(7, 4) on every 4 (permuted) bits of REP. That is, to form a codeword b of M bits from a block y of K bits, we first construct an REP of 4bM/7c + (M mod 7) bits from y; then for every 4 bits in the REP, we add 3 parity bits to b using HAM(7, 4). The resulting code will be named Hamming on Repetition (HAMR). During decoding, the decoder of HAM(7, 4) is first used to recover the 4-bit sub-blocks in the REP. Then, the decoder of REP (majority

(24)

vote) takes place.

It is not hard to compute m_{HAM R} by analyzing the REP and HAM parts separately.

When M is a multiple of 7 and K is a multiple of 4, it can be proved that m_{HAM R} = ^4M_7K, which is generally better than m_REP = ¹₂b^M_Kc − 1. Thus, HAMR is slightly stronger than REP for ECC purposes. We include HAMR in our study to verify whether a simple inclusion of some parity bits for the ECC can readily improve the performance for multi- label classification.

2.2.3 Bose-Chaudhuri-Hocquenghem Code

BCH was invented by Bose and Ray-Chaudhuri [1960], and independently by Hocquenghem [1959]. It can be viewed as a sophisticated extension of HAM and allows correcting multiple bit errors. BCH with length M = 2^p− 1 has (M − K) parity bits, and it can correct m_BCH = ^{M −K}_p bits of error [Mackay, 2003], which is in general stronger than REP and HAMR. The caveat is that the decoder of BCH is more complicated than the ones of REP and HAMR.

We include BCH in our study because it is one of the most popular ECCs in real- world communication systems. In addition, we compare BCH with HAMR to see if a strong ECC can do better for multi-label classification.

2.2.4 Low-density Parity-check Code

Low-density parity-check code (LDPC) [Mackay, 2003] is recently drawing much research attention in communications. LDPC shares an interesting connection between ECC and Bayesian learning [Mackay, 2003]. While it is difficult to state the strength of LDPC in terms of a single m_{LDP C}, LDPC has been shown to approach the theoretical limit in some special channels [Gallager, 1963], which makes it a state-of-the-art ECC.

We choose to include LDPC in our study to see whether it is worthwhile to go beyond BCH with more sophisticated encoder/decoders.

(25)

2.3 ECC View of RAKEL

RAKEL is a multi-label classification algorithm proposed by Tsoumakas and Vlahavas [2007]. Define a k-label set as a size-k subset of L. Each iteration of RAKEL randomly selects a (different) k-label set and builds a multi-label classifier on the k labels with a Label Powerset (LP). After running for R iterations, RAKEL obtains a size-R ensemble of LP classifiers. The prediction on each label is done by a majority vote from classifiers associated with the label.

Equivalently, we can draw (with replacement) M = Rk labels first before building the LP classifiers. Then, selecting k-label sets is equivalent to encoding y by a variant of REP, which will be called RAKEL repetition code (RREP). Similar to REP, each bit y[i] is repeated several times in b since label i is involved in several k-label sets. After encoding y to b, each LP classifier, called k-powerset, acts as a sub-channel that transmits a size-k sub-block of the codeword b. The prediction procedure follows the decoder of the usual REP.

The ECC view above decomposes the original RAKEL into two parts: the ECC and the base learner Ab. Next, we empirically study how the two parts affect the performance of multi-label classification.

2.4 Experimental Results

We compare RREP, HAMR, BCH, and LDPC with the ECC framework on seven real- world datasets in different domains: scene, emotions, yeast, tmc2007, genbase, medical, and enron [Tsoumakas et al., 2010]. The statistics of these datasets are shown in Table 2.1. All the results are reported with the mean and standard error on random splitting test set over 30 runs. The sizes of training and testing sets are set according to the sizes in original datasets. Note that for tmc2007 dataset, which has 28596 instances in total, we randomly sample 5% for training and another 5% for testing in each run.

We set RREP with k = 3. Then, for each ECC, we first consider a 3-powerset with

(26)

Table 2.1: Dataset characteristics

D

ATASET

K T

RAINING

T

ESTING

F

EATURES

SCENE

6 1211 1196 294

EMOTIONS

6 391 202 72

YEAST

14 1500 917 103

TMC

2007 22 1430 1430 500

GENBASE

27 463 199 1186

MEDICAL

45 333 645 1449

ENRON

53 1123 579 1001

either Random Forest, non-linear support vector machine (SVM), or logistic regression as the multi-class classifier inside the 3-powerset. Note that we randomly permute the bits of b and apply an inverse permutation on ˜b for those ECC other than RREP to ensure that each 3-powerset works on diverse sub-blocks. In addition to the 3-powerset base learners, we also consider BR base learners in Subsection 2.4.4.

We take the default Random Forest from Weka [Hall et al., 2009] with 60 trees. For the non-linear SVM, we use LIBSVM [Chang and Lin, 2001] with the Gaussian ker- nel and choose (C, γ) by cross validation on training data from {2⁻⁵, 2⁻³, · · · , 2⁷} × {2⁻⁹, 2⁻⁷, · · · , 2¹}. In addition, we use LIBLINEAR [Fan et al., 2008] for the logistic regression and choose the parameter C by cross validation from {2⁻⁵, 2⁻³, · · · , 2⁷}.

Note that the experiments taken in this work are generally broader than existing works that are related to multi-label classification with the ECC in terms of the datasets, the codes, the “channels,” and the base learners, as shown in Table 2.2. The goal of the experiments is not only to justify that the framework is promising but also to rigorously identify the best codes, channels, and base learners for solving general multi-label classification tasks via the ECC.

2.4.1 Validity of ML-ECC Framework

First, we demonstrate the validity of the ML-ECC framework. We fix the codeword length M to about 20 times larger than the number of labels K. The numbers are in the form 2^p − 1 for integer p because the BCH code only works on such lengths. More

(27)

Table 2.2: Focus of existing works under the ML-ECC framework

work # datasets codes channels base learners

RAKEL 3 RREP k-powerset linear SVM

[Tsoumakas and Vlahavas, 2007]

ML-BCHRF 3 BCH BR Random Forest

[Kouzani and Nasireding, 2009]

ML-BCHRF & ML-CRF 1 convolution, BR Random Forest

[Kouzani, 2010] and BCH

this work 7 RREP, HAMR, 3-powerset, Random Forest,

BCH, and LDPC and BR Gaussian SVM, logistic regression

(a) 0/1 loss (b) Hamming loss

Figure 2.1: Performance of ML-ECC using the 3-powerset with Random Forests

experiments on different codeword lengths are presented in Section 2.4.2. Here the base multi-label learner is the 3-powerset with Random Forests. Following the description in Section 2.3, RREP with the 3-powerset is exactly the same as RAKEL with k = 3.

The results on 0/1 loss is shown in Figure 2.1(a). HAMR achieves lower ∆_0/1 than RREP on 5 out of the 7 datasets (scene, emotions, yeast, tmc2007, and medical) and achieves similar ∆_0/1 with RREP on the other 2. This verifies that using some parity bits instead of repetition improves the strength of ECC, which in turn improves the 0/1 loss. Along the same direction, BCH performs even better than both HAMR and RREP, especially on medical dataset. The superior performance of BCH justifies that the ECC is useful for multi-label classification. On the other hand, another

(28)

Table 2.3: 0/1 loss of ML-ECC using 3-powerset base learners

base learner ECC scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) Random Forest RREP (RAKEL) .3390 ± .0022 .6475 ± .0057 .7939 ± .0022 .7738 ± .0025 Random Forest HAMR .2855 ± .0022 .6393 ± .0055 .7789 ± .0021 .7693 ± .0024 Random Forest BCH .2671 ± .0020 .6366 ± .0061 .7764 ± .0021 .7273 ± .0018 Random Forest LDPC .3058 ± .0024 .6606 ± .0050 .8080 ± .0024 .7728 ± .0022 Gaussian SVM RREP (RAKEL) .2856 ± .0016 .7759 ± .0055 .7601 ± .0023 .7196 ± .0024 Gaussian SVM HAMR .2635 ± .0017 .7729 ± .0052 .7530 ± .0021 .7162 ± .0023 Gaussian SVM BCH .2576 ± .0017 .7744 ± .0053 .7429 ± .0017 .7095 ± .0020 Gaussian SVM LDPC .2780 ± .0020 .8040 ± .0044 .7574 ± .0021 .7403 ± .0019 Logistic Regression RREP (RAKEL) .3601 ± .0019 .6949 ± .0070 .8161 ± .0017 .7408 ± .0024 Logistic Regression HAMR .3299 ± .0018 .6954 ± .0057 .8061 ± .0019 .7383 ± .0025 Logistic Regression BCH .3148 ± .0018 .7068 ± .0046 .7899 ± .0020 .7233 ± .0024 Logistic Regression LDPC .3655 ± .0028 .7295 ± .0056 .8082 ± .0024 .7562 ± .0027 base learner ECC genbase (M =511) medical (M =1023) enron (M =1023)

Random Forest RREP (RAKEL) .0295 ± .0021 .6508 ± .0024 .8866 ± .0038 Random Forest HAMR .0276 ± .0021 .6420 ± .0029 .8855 ± .0036 Random Forest BCH .0263 ± .0020 .4598 ± .0036 .8659 ± .0039 Random Forest LDPC .0288 ± .0021 .5238 ± .0032 .8830 ± .0036 Gaussian SVM RREP (RAKEL) .0295 ± .0025 .3679 ± .0036 .8725 ± .0041 Gaussian SVM HAMR .0303 ± .0026 .3641 ± .0031 .8693 ± .0042 Gaussian SVM BCH .0255 ± .0019 .3394 ± .0027 .8477 ± .0045 Gaussian SVM LDPC .0285 ± .0021 .3856 ± .0031 .8666 ± .0041 Logistic Regression RREP (RAKEL) .3593 ± .0078 .5507 ± .0254 .8762 ± .0035 Logistic Regression HAMR .2275 ± .0099 .5268 ± .0230 .8754 ± .0035 Logistic Regression BCH .0250 ± .0018 .3797 ± .0044 .8504 ± .0042 Logistic Regression LDPC .0325 ± .0018 .4516 ± .0083 .8653 ± .0038

sophisticated code, LDPC, gets higher 0/1 loss than BCH on every dataset, and even higher 0/1 loss than RREP on the emotions and yeast datasets, which suggest that LDPC may not be a good choice for the ECC framework.

Next we look at ∆HLshown in Figure 2.1(b). The Hamming loss of HAMR is comparable to that of RREP, where each wins on two datasets. BCH beats both HAMR and RREP on the tmc2007, genbase, and medical datasets but loses on the other four datasets. LDPC has the highest Hamming loss among the codes on all datasets. Thus, simpler codes like RREP and HAMR perform better in terms of ∆_HL. A stronger code like BCH may guard ∆_0/1 better, but it can pay more in terms of ∆_HL.

Similar results show up when using the Gaussian SVM or logistic regression as the base learner instead of Random Forest, as shown in Tables 2.3 and 2.4. The boldface entries are the lowest-loss ones for the given dataset and base learner. The results validate that the performance of multi-label classification can be improved by applying the ECC. More specifically, we may improve the RAKEL algorithm by learning some parity

(29)

Table 2.4: Hamming loss of ML-ECC using 3-powerset base learners

base learner ECC scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) Random Forest RREP (RAKEL) .0755 ± .0006 .1778 ± .0018 .1884 ± .0007 .0674 ± .0003 Random Forest HAMR .0748 ± .0006 .1798 ± .0019 .1894 ± .0008 .0671 ± .0003 Random Forest BCH .0753 ± .0007 .1858 ± .0021 .1928 ± .0008 .0662 ± .0003 Random Forest LDPC .0817 ± .0007 .1907 ± .0021 .2012 ± .0007 .0734 ± .0003 Gaussian SVM RREP (RAKEL) .0719 ± .0005 .2432 ± .0021 .1853 ± .0007 .0613 ± .0003 Gaussian SVM HAMR .0723 ± .0005 .2492 ± .0023 .1868 ± .0006 .0610 ± .0003 Gaussian SVM BCH .0739 ± .0006 .2644 ± .0019 .1898 ± .0008 .0629 ± .0003 Gaussian SVM LDPC .0755 ± .0006 .2634 ± .0027 .1917 ± .0007 .0679 ± .0003 Logistic Regression RREP (RAKEL) .0915 ± .0005 .2026 ± .0025 .1993 ± .0007 .0634 ± .0003 Logistic Regression HAMR .0911 ± .0005 .2064 ± .0024 .2003 ± .0007 .0634 ± .0003 Logistic Regression BCH .0920 ± .0005 .2233 ± .0022 .2051 ± .0008 .0653 ± .0003 Logistic Regression LDPC .0989 ± .0007 .2202 ± .0021 .2054 ± .0007 .0701 ± .0003 base learner ECC genbase (M =511) medical (M =1023) enron (M =1023)

Random Forest RREP (RAKEL) .0012 ± .0001 .0182 ± .0001 .0477 ± .0004 Random Forest HAMR .0012 ± .0001 .0180 ± .0001 .0479 ± .0004 Random Forest BCH .0011 ± .0001 .0159 ± .0001 .0506 ± .0004 Random Forest LDPC .0013 ± .0001 .0192 ± .0002 .0538 ± .0005 Gaussian SVM RREP (RAKEL) .0013 ± .0001 .0112 ± .0001 .0449 ± .0004 Gaussian SVM HAMR .0013 ± .0001 .0111 ± .0001 .0449 ± .0004 Gaussian SVM BCH .0010 ± .0001 .0114 ± .0001 .0516 ± .0006 Gaussian SVM LDPC .0014 ± .0001 .0140 ± .0001 .0530 ± .0005 Logistic Regression RREP (RAKEL) .0179 ± .0006 .0190 ± .0011 .0453 ± .0003 Logistic Regression HAMR .0102 ± .0005 .0176 ± .0009 .0454 ± .0003 Logistic Regression BCH .0013 ± .0001 .0137 ± .0003 .0505 ± .0004 Logistic Regression LDPC .0024 ± .0002 .0187 ± .0006 .0528 ± .0004

bits instead of repetitions. Based on this experiment, we suggest that using HAMR for multi-label classification will improve the ∆_0/1 while maintaining comparable ∆_HLwith RAKEL. If we use BCH instead, we will improve ∆_0/1 further but may pay for ∆_HL. We also report the micro and macro F1scores, and also the pairwise label ranking loss in Tables A.1, A.2, and A.3, respectively.

2.4.2 Comparison of Codeword Length

Now, we compare on the length of codewords M . With larger M , the codes can correct more errors but the base learners have to take longer time to train. By experimenting different M , we may find a better trade-off between performance and efficiency.

The performance of the ECC framework with different codeword lengths on the scene dataset is shown on Figure 2.2. Here, the base learner is again the 3-powerset with Ran- dom Forests. The codeword length M varies from 31 to 127, which is about 5 to 20 times of number of labels L. We do not include shorter codewords because their performance

(30)

are not stable. Note that BCH only allows M = 2^p− 1 and thus we conduct experiments of BCH on those codeword lengths.

We first look at the 0/1 loss in Figure 2.2(a). The horizontal axis indicates the codeword length M and the vertical axis is the 0/1 loss on the test set. We see that ∆_0/1 of RREP stays around 0.335 no matter how long the codewords are. This implies that the power of repetition bits reaches its limit very soon. For example, when all the 3-powerset combinations of labels are learned, additional repetitions give very limited improvements.

Therefore, methods using repetition bits only, such as RAKEL, cannot take advantage from the extra bits in the codewords.

The ∆_0/1of HAMR and BCH are slightly decreasing with M , but the differences between M = 63 and M = 127 are generally small (smaller than the differences between M = 31 and M = 63, in particular). This indicates that learning some parity bits provides additional information for prediction, which cannot be learned easily from repetition bits, and such information remains beneficial for longer codewords, comparing to repetition bits. One reason is that the number of 3-powerset combinations of parity bits is exponen- tially more than that of combinations of labels. The performance of LDPC is not as stable as the other codes, possibly because of its sophisticated decoding step. Somehow, we still see that its ∆_0/1 decreases slightly with M .

Figure 2.2(b) shows ∆_HL versus M for each ECC. The ∆_HL of RREP is the lowest among the codes when M is small, but it remains almost constant when M ≥ 63, while

∆_HL of HAMR and BCH are still decreasing. This matches our finding that extra repetition bits give limited information. When M = 127, BCH is comparable to RREP in terms of ∆_HL. HAMR is even better than RREP at that codeword length, and becomes the best code regarding ∆_HL. Thus, while a stronger code like BCH may guard ∆_0/1better, it can pay more in terms of ∆_HL.

As stated in Sections 1.1 and 2.1, the base learners serve as the channels in the ECC framework and the performance of base learners may be affected by the codes. Therefore, using a strong ECC does not always improve multi-label classification performance. Next, we verify the trade-off by measuring the bit error rate ∆BERof ˜h, which is defined as the

(31)

(a) 0/1 loss (b) Hamming loss

(c) bit-error rate

Figure 2.2: Varying codeword length on scene: ML-ECC using the 3-powerset with Random Forests

Hamming loss between the predicted codeword ˜h(x) and the actual codeword b. Higher bit error rate implies that the transformed task is harder.

Figure 2.2(c) shows the ∆BERversus M for each ECC. RREP has almost constant bit error rate. HAMR also has nearly constant bit error rate but at a higher value. The bit error rate of BCH is similar to that of HAMR when the codeword is short, but the bit error rate increases with M . One explanation is that some of the parity bits are harder to learn than repetition bits. The ratio between repetition bits and parity bits of both RREP and HAMR codes is a constant of M (RREP has no parity bits, and HAMR has 3 parity bits for every 4 repetition bits), while BCH has more parity bits with larger M . The different bit error rates justify the trade-off between the strength of the ECC and the hardness of the base learning tasks. With more parity bits, one can correct more bit errors, but may

(32)

(a) 0/1 loss (b) Hamming loss

(c) bit-error rate

Figure 2.3: Varying codeword length on yeast: ML-ECC using the 3-powerset with Random Forests

have harder tasks to learn; when using fewer parity bits or even no parity bits, one cannot correct many errors, but will enjoy simpler learning tasks.

Similar results show up in other datasets with all three base learners. The performance on the yeast dataset with the 3-powerset and Random Forests is shown in Figure 2.3.

Because the number of labels in the yeast dataset is about twice of that in the scene dataset, the codeword length here ranges from 63 to 255, which is also about twice longer than that in the experiments on the scene dataset. Again, we see that the benefits of parity bits remain valid for longer codewords than repetition bits and that more parity bits cause the transformed task harder to learn. This result points out the trade-off between the strength of the ECC and the hardness of the base learning tasks.

(33)

(a) relative frequency vs. number of bit errors (b) 0/1 loss vs. number of bit errors

(c) Hamming loss vs. number of bit errors

Figure 2.4: Bit errors and losses: the scene dataset, M = 127

2.4.3 Bit Error Analysis

To further analyze the difference between different ECC designs, we zoom in to M = 127 of Figure 2.2. The instances are divided into groups according to the number of bit errors at that instance. The relative frequency of each group, i.e., the ratio of the group size to the total number of instances, is plotted in Figure 2.4(a). The average ∆_0/1 and ∆_HL of each group are also plotted in Figure 2.4(b) and 2.4(c). The curve of each ECC forms two peak regions in Figure 2.4(a). Besides the peak at 0, which means no bit error happens on the instances, the other peak varies from one code to another. The positions of the peaks suggest the hardness of the transformed learning task, similar to our findings in Figure 2.2(c).

We can clearly see the difference on the strength of different ECC from Figure 2.4(b).

BCH can tolerate up to 31-bit errors, but its ∆0/1sharply increases over 0.8 for 32-bit er-

(34)

(a) relative frequency vs. number of labels XOR’ed (b) bit error rate vs. number of labels XOR’ed

Figure 2.5: Parity bits: the scene dataset, 6 labels, M = 127

rors. HAMR can correct 13-bit errors perfectly, and its ∆_0/1increases slowly when more errors occur. Both RREP and LDPC can perfectly correct only 9-bit errors, but LDPC is able to sustain a low ∆_0/1 even when there are 32-bit errors. It would be interesting to study the reason behind this long tail from a Bayesian network perspective.

We can also look at the relation between the number of bit errors and ∆HL, as shown in Figure 2.4(c). The BCH curve grows sharply when the number of bit errors is larger than 31, which links to the inferior performance of BCH over RREP in terms of ∆_HL. The LDPC curve grows much slower, but its right-sided peak in Figure 2.4(a) still leads to higher overall ∆_HL. On the other hand, RREP and HAMR enjoy a better balance between the peak position in Figure 2.4(a) and the growth in Figure 2.4(c) and thus lower overall ∆HL.

Figure 2.4(a) suggests that the transformed learning task of more sophisticated ECC is harder. The reason is that sophisticated ECC contains many parity bits, which are the exclusive-or of labels, and the parity bits are harder to learn by the base learners. We demonstrate this in Figure 2.5 using scene dataset (6 labels) and fixing M = 127. The codeword bits are divided into groups according to the number of labels XOR’ed to form the bit. The relative frequency of each group is plotted in Figure 2.5(a). We can see that all codeword bits of RREP are formed by 1 label, and the bits of HAMR are formed by 1 or 3 labels. For BCH and LDPC, the number of labels XOR’ed in the bits may be none

(35)

(a) relative frequency vs. number of labels XOR’ed (b) bit error rate vs. number of labels XOR’ed

Figure 2.6: Parity bits: the medical dataset, 45 labels, M = 1023

(0) to all (6) labels, while most of the bits are the XOR of half of the labels (3 labels).

Next we show how well the base learners learned on each group in Figure 2.5(b).

Here the base learner is 3-powerset with Random Forests. The figure suggests that the parity bits (XOR’ing 2 or more labels) result in harder learning tasks and higher bit error rates than original labels (XOR’ing 1 label). One exception is the bits XOR’ed from all (6) labels, which is easier to learn than original labels. The reason is that the bit XOR’ed from all labels is equivalent to the indicator of odd number of labels, and a constant predictor works well for this because in the scene dataset about 92% of all instances has 1 or 3 labels. Since BCH and LDPC have many bits XOR’ed from 2-4 labels, their bit error rates are higher than RREP and HAMR as shown in Figure 2.2(c).

These findings also appear on other datasets and other base learners, such as medical dataset (45 labels, M = 1023) shown in Figure 2.6. BCH and LDPC have many bits XOR’ed from about half of the labels, and the transformed learning tasks of such bits are harder to learn than that of original labels.

2.4.4 Comparison with Binary Relevance

In addition to the 3-powerset base learners, we also consider BR base learners, which simply build a classifier for each bit in the codeword space. Note that if we couple the ECC framework with RREP and BR, the resulting algorithm is almost the same as the

(36)

(a) 0/1 loss (b) Hamming loss

Figure 2.7: Performance of ML-ECC using Binary Relevance with Random Forests

original BR. For example, using RREP and BR with SVM is equivalent to using BR with bootstrap aggregated SVM.

We first compare the performance between the ECC designs using the BR base learner with Random Forests. The result on 0/1 loss is shown in Figure 2.7(a). From the figure, we can see that BCH and HAMR has superior performance to other ECC, with BCH being a better choice. RREP (BR), on the other hand, leads to the worst 0/1 loss. The result again justifies the usefulness of coupling BR with the ECC instead of only the original y.

Note that LDPC also performs better than BR on two datasets, but is not as good as HAMR and BCH. Thus, over-sophisticated ECC like LDPC may not be necessary for multi-label classification.

In Figure 2.7(b), we present the results on ∆_HL. In contrast to the case when using the 3-powerset base learner, here both HAMR and BCH can achieve better ∆_HL than RREP (BR) in most of the datasets. HAMR wins on three datasets, while BCH wins on four.

Thus, coupling stronger ECC with the BR base learner can improve both ∆0/1 and ∆HL. However, LDPC performs worse than BR in term of ∆HL, which again shows that LDPC may not be suitable for multi-label classification.

Experiments with other base learners also support similar findings, as shown in Ta- bles 2.5 and 2.6. Notice that HAMR performs better than BCH when using Gaussian SVM base learners. Thus, extending BR by learning some more parity bits and decoding

(37)

Table 2.5: 0/1 loss of ML-ECC using BR base learners

base learner ECC scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) Random Forest RREP (BR) .4396 ± .0022 .6825 ± .0053 .8332 ± .0016 .7715 ± .0023 Random Forest HAMR .3213 ± .0020 .6573 ± .0051 .7910 ± .0020 .7578 ± .0025 Random Forest BCH .2570 ± .0022 .6386 ± .0062 .7792 ± .0019 .7149 ± .0022 Random Forest LDPC .3996 ± .0028 .6939 ± .0049 .8338 ± .0015 .7735 ± .0024 Gaussian SVM RREP (BR) .3378 ± .0023 .8414 ± .0051 .7955 ± .0017 .7281 ± .0025 Gaussian SVM HAMR .2873 ± .0017 .8084 ± .0047 .7681 ± .0022 .7215 ± .0025 Gaussian SVM BCH .2550 ± .0018 .7787 ± .0048 .7515 ± .0015 .7053 ± .0024 Gaussian SVM LDPC .3161 ± .0023 .8530 ± .0041 .7963 ± .0018 .7515 ± .0023 Logistic Regression RREP (BR) .4821 ± .0024 .7396 ± .0049 .8531 ± .0016 .7458 ± .0026 Logistic Regression HAMR .4048 ± .0020 .7170 ± .0056 .8282 ± .0015 .7405 ± .0024 Logistic Regression BCH .3291 ± .0020 .6982 ± .0048 .8094 ± .0020 .7205 ± .0022 Logistic Regression LDPC .4659 ± .0028 .7507 ± .0056 .8565 ± .0019 .7694 ± .0027 base learner ECC genbase (M =511) medical (M =1023) enron (M =1023)

Random Forest RREP (BR) .0303 ± .0020 .6546 ± .0025 .8872 ± .0036 Random Forest HAMR .0288 ± .0019 .6387 ± .0025 .8851 ± .0036 Random Forest BCH .0250 ± .0019 .4567 ± .0034 .8737 ± .0038 Random Forest LDPC .0312 ± .0022 .5601 ± .0032 .8876 ± .0035 Gaussian SVM RREP (BR) .0273 ± .0021 .3721 ± .0037 .8720 ± .0041 Gaussian SVM HAMR .0243 ± .0022 .3675 ± .0036 .8718 ± .0042 Gaussian SVM BCH .0255 ± .0019 .3499 ± .0030 .8561 ± .0043 Gaussian SVM LDPC .0243 ± .0017 .4226 ± .0034 .8782 ± .0037 Logistic Regression RREP (BR) .5084 ± .0068 .5784 ± .0282 .8759 ± .0035 Logistic Regression HAMR .3509 ± .0089 .5499 ± .0247 .8740 ± .0036 Logistic Regression BCH .0295 ± .0018 .4022 ± .0076 .8579 ± .0038 Logistic Regression LDPC .0528 ± .0031 .5396 ± .0149 .8795 ± .0036

them suitably by the ECC is a superior algorithm over the original BR. The micro and macro F1 scores, and the pairwise label ranking loss are reported in Tables A.12, A.13, and A.14, respectively.

Comparing Tables 2.3 and 2.5, we see that using 3-powerset achieves lower 0/1 loss than using BR in most of the cases. However, in terms of ∆_HL, as shown in Tables 2.4 and 2.6, there is no clear winner between the 3-powerset and BR.

(38)

Table 2.6: Hamming loss of ML-ECC using BR base learners

base learner ECC scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) Random Forest RREP (BR) .0858 ± .0005 .1811 ± .0016 .1903 ± .0006 .0662 ± .0003 Random Forest HAMR .0728 ± .0005 .1779 ± .0017 .1878 ± .0007 .0652 ± .0002 Random Forest BCH .0720 ± .0007 .1828 ± .0018 .1898 ± .0008 .0638 ± .0003 Random Forest LDPC .0832 ± .0006 .1882 ± .0017 .1963 ± .0006 .0721 ± .0003 Gaussian SVM RREP (BR) .0743 ± .0005 .2460 ± .0021 .1866 ± .0006 .0621 ± .0003 Gaussian SVM HAMR .0717 ± .0004 .2480 ± .0023 .1861 ± .0007 .0616 ± .0003 Gaussian SVM BCH .0738 ± .0005 .2565 ± .0031 .1880 ± .0007 .0619 ± .0003 Gaussian SVM LDPC .0742 ± .0006 .2532 ± .0019 .1908 ± .0006 .0688 ± .0003 Logistic Regression RREP (BR) .1024 ± .0006 .2062 ± .0018 .2000 ± .0007 .0641 ± .0003 Logistic Regression HAMR .0959 ± .0006 .2049 ± .0022 .2003 ± .0007 .0635 ± .0003 Logistic Regression BCH .0955 ± .0006 .2172 ± .0023 .2037 ± .0008 .0638 ± .0003 Logistic Regression LDPC .1038 ± .0006 .2133 ± .0019 .2044 ± .0008 .0705 ± .0004 base learner ECC genbase (M =511) medical (M =1023) enron (M =1023)

Random Forest RREP (BR) .0013 ± .0001 .0183 ± .0001 .0474 ± .0003 Random Forest HAMR .0012 ± .0001 .0179 ± .0001 .0474 ± .0003 Random Forest BCH .0010 ± .0001 .0152 ± .0001 .0494 ± .0004 Random Forest LDPC .0014 ± .0001 .0203 ± .0001 .0529 ± .0004 Gaussian SVM RREP (BR) .0012 ± .0001 .0113 ± .0001 .0450 ± .0004 Gaussian SVM HAMR .0010 ± .0001 .0112 ± .0001 .0451 ± .0004 Gaussian SVM BCH .0010 ± .0001 .0117 ± .0001 .0487 ± .0005 Gaussian SVM LDPC .0011 ± .0001 .0153 ± .0001 .0517 ± .0005 Logistic Regression RREP (BR) .0347 ± .0007 .0212 ± .0014 .0455 ± .0003 Logistic Regression HAMR .0186 ± .0005 .0191 ± .0011 .0452 ± .0003 Logistic Regression BCH .0024 ± .0002 .0161 ± .0006 .0472 ± .0004 Logistic Regression LDPC .0050 ± .0004 .0234 ± .0011 .0517 ± .0004

(39)

Chapter 3 New Decoder for Hard and Soft

Decoding of Error-correcting Codes

In Chapter 2 we demonstrated the effectiveness of applying the ECCs on multi-label classification. In addition, we showed that the ∆0/1 is upper bounded by a function of ∆HL

in the codewords and the strength of the ECC—the number of bit errors that the ECC is able to correct. However, sometimes in multi-label classification the bit error rate is high.

If the codeword prediction of an instance has more bit errors than that the ECC is able to correct, there is no guarantee of the decoding outcome.

The reason is that, the off-the-shelf decoder, e.g., the decoder for the BCH code we used, takes advantages of the algebraic structure of the ECC to locate possible bit errors.

The decoder decodes ˜b ∈ {0, 1}^M to ˜y ∈ {0, 1}^K where the encoding of ˜y is approximately the valid codeword closest to ˜b in {0, 1}^M. However, when M is large, many possible values of ˜b are too far away from any valid codeword. If the decoder is able to correct m bit errors, any vertex of the hypercube {0, 1}^M within m-bit difference from a valid codeword can be perfectly mapped back to a vertex of {0, 1}^K. The number of such vertices is 2^K·Pm

i=0 M

i, which is generally smaller than 2^M. For the other vertices, since they are too far away from any valid codeword, the off-the-shelf decoder cannot uti- lize the full power of all parity bits but use only k of the M bits, resulting in suboptimal decoding performance. This also explains the sharp increase of ∆0/1 for the BCH code

(40)

in Figure 2.4(b), in contrast to the smooth slope for LDPC code, which is decoded using Belief Propagation algorithm.

We try to overcome this deficiency by proposing a new decoder in Section 3.1, and experiment on it in Section 3.2. This decoder decodes a vertex of hypercube {0, 1}^M to the interior of the hypercube [0, 1]^K, and then round to the nearest vertex of {0, 1}^K. The rounding-based methods have been studied by Tai and Lin [2012]. Because this decoder takes some geometric information into account, we call it geometric decoder, and call the off-the-shelf decoder as algebraic decoder.

Another benefit of the geometric decoder is to perform interior-to-interior decoding, from [0, 1]^M to [0, 1]^K. In other words, this is a soft-in soft-out decoder [Wolf, 1978, Hagenuaer et al., 1996]. The soft input bits contain the channel measurement information, and the value of each bit represents the confidence in the bit being 1. We discuss about how to gather such information from our channels, the base learners, in Section 3.3 and present experimental results in Section 3.4.

3.1 Geometric Decoder for Linear Codes

Here we describe our proposed geometric decoder in detail. The geometric decoder maps a vertex ˜b in {0, 1}^M to a point ˜y in the interior of [0, 1]^K. Since the output of this decoder are real values, we call it soft output, in contrast to the binary values, which are called hard output. As mentioned above, we may convert the soft output of geometric decoder to hard output by rounding.

Here, we focus on linear codes, whose encoding function can be written as a matrix- vector multiplication under Galois field GF₂. All the repetition code, Hamming code, BCH code, and LDPC code are linear codes. Let G be the generating matrix of a linear code, g_ij ∈ {0, 1}. The encoding is done by b = enc(y) = G · y (mod 2), or equivalently we may write the formula in terms of exclusive-OR (XOR) operations:

b_i = M

j:gij=1

y_j

(41)

That is, the codeword bit b_iis the result of XOR of some label bits y_j. The XOR operations are equivalent to multiplications if we map 1 → −1 and 0 → 1. By defining ˆb_i = 1 − 2b_i and ˆy_j = 1 − 2y_j, the encoding can also be written as

ˆb_i = Y

j:gij=1

ˆ y_j

We denote this form as multiplication encoding.

It is difficult to generalize the XOR operation from binary to real values, but multiplication by itself can be defined on real values. We take this advantage and use it to form our geometric decoder. Our geometric decoder would find the ˜y that minimizes the L₂ distance between ˜b and the multiplication encoding result of the ˜y:

dec_geometric(˜b) = argmin

˜ y∈[0,1]^K

M

X

i=1



(1 − 2˜b_i) − Y

j:gij=1

(1 − 2˜y_j)





2

Note that the squared L2 distance between codewords is an approximation of the Ham- ming distance in binary space {0, 1}^M.

For repetition code, since only one yj is considered for each bi, the optimal solution of the problem would be the same as averaging over the predictions on the same label for each label. However, for general linear codes, there is no efficient way to find the global optimum since the optimization problem may not be convex. Instead, we may apply a variant of coordinate descent optimization to find a local minimum. That is, in each step we optimize only one ˜y_j while fixing other ˜y_j. To optimize one ˜y_j, we only have to solve a second-order single-variable optimization problem, which has an efficient analytic solution.

The benefit of using soft output geometric decoder is that the multiplication-approximated XOR preserves some geometric information. That is, close points in [0, 1]^K would also be close after multiplication encoding. Moreover, the soft outputs condense the space of valid codewords, so it would be easier to find one close to ˜b.

(42)

(a) 0/1 loss (b) Hamming loss

Figure 3.1: Hard-input soft-output geometric decoding results for ML-ECC using BR with Random Forests

3.2 Experimental Results of Geometric Decoder

The experiments of the proposed geometric decoder are done on the same setting as that of the off-the-shelf algebraic decoder in Section 2.4. Here, we focus on comparing the new decoder on HAMR and BCH codes with their algebraic decoder since the previous experiments have already shown that these codes are the better choices for multi-label classification.

We first demonstrate the advantage of the proposed geometric decoder over the algebraic one using the same codeword predictions as in Section 2.4. The results are shown in Figure 3.1. Here the base learner is Binary Relevance with Random Forests. In the figures, alg stands for the algebraic decoder, and geo stands for the proposed geometric decoder. The soft decoding output of the geometric decoder is rounded back to {0, 1} for evaluation and comparison.

Figure 3.1(a) shows the result on 0/1 loss. For the BCH code, the proposed geometric decoder outperforms the algebraic one significantly on almost all datasets, especially the great improvement on the yeast and medical datasets. For the HAMR code, the geometric decoder is better than the algebraic one except on the genbase and enron datasets where both decoders have similar 0/1 loss.

Next we look at the Hamming loss in Figure 3.1(b). For the HAMR code, the proposed

(43)

Table 3.1: 0/1 loss changes when applying the proposed soft-output decoder

ECC base learner scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) HAMR BR,Random Forest −.0101 ± .0010 −.0094 ± .0022 −.0077 ± .0010 −.0012 ± .0006 HAMR BR,Gaussian SVM −.0047 ± .0007 −.0145 ± .0030 −.0031 ± .0008 −.0009 ± .0005 HAMR BR,Logistic Regression −.0081 ± .0010 −.0078 ± .0028 −.0012 ± .0008 −.0006 ± .0005 HAMR 3-powerset,Random Forest −.0099 ± .0008 −.0071 ± .0023 −.0101 ± .0011 −.0014 ± .0006 HAMR 3-powerset,Gaussian SVM −.0042 ± .0006 −.0239 ± .0029 −.0082 ± .0006 −.0010 ± .0007 HAMR 3-powerset,Logistic Regression −.0064 ± .0009 −.0041 ± .0029 −.0051 ± .0009 −.0004 ± .0007 BCH BR,Random Forest −.0100 ± .0007 −.0101 ± .0030 −.0575 ± .0015 −.0231 ± .0014 BCH BR,Gaussian SVM −.0048 ± .0005 −.0437 ± .0039 −.0312 ± .0012 −.0078 ± .0014 BCH BR,Logistic Regression −.0127 ± .0007 −.0096 ± .0034 −.0396 ± .0017 −.0103 ± .0018 BCH 3-powerset,Random Forest −.0114 ± .0007 −.0087 ± .0032 −.0529 ± .0016 −.0210 ± .0013 BCH 3-powerset,Gaussian SVM −.0048 ± .0004 −.0343 ± .0044 −.0250 ± .0011 −.0090 ± .0013 BCH 3-powerset,Logistic Regression −.0070 ± .0006 −.0114 ± .0030 −.0256 ± .0014 −.0122 ± .0013 ECC base learner genbase (M =511) medical (M =1023) enron (M =1023)

HAMR BR,Random Forest −.0008 ± .0005 −.0010 ± .0010 −.0006 ± .0006 HAMR BR,Gaussian SVM −.0012 ± .0008 −.0004 ± .0007 .0002 ± .0005 HAMR BR,Logistic Regression .0032 ± .0057 .0015 ± .0010 .0004 ± .0004 HAMR 3-powerset,Random Forest .0010 ± .0005 −.0006 ± .0008 −.0001 ± .0004 HAMR 3-powerset,Gaussian SVM −.0005 ± .0004 −.0004 ± .0007 .0002 ± .0005 HAMR 3-powerset,Logistic Regression −.0030 ± .0061 .0004 ± .0010 −.0004 ± .0005 BCH BR,Random Forest .0005 ± .0003 −.0438 ± .0022 −.0308 ± .0016 BCH BR,Gaussian SVM −.0000 ± .0003 −.0068 ± .0025 −.0133 ± .0016 BCH BR,Logistic Regression .0127 ± .0018 −.0318 ± .0045 −.0217 ± .0013 BCH 3-powerset,Random Forest .0003 ± .0004 −.0280 ± .0018 −.0238 ± .0018 BCH 3-powerset,Gaussian SVM −.0007 ± .0003 −.0150 ± .0018 −.0055 ± .0016 BCH 3-powerset,Logistic Regression .0003 ± .0006 −.0216 ± .0022 −.0139 ± .0013

method has a small improvement on the scene, emotions, and yeast datasets, and has similar Hamming loss with the algebraic decoding method on other datasets. How- ever, for the BCH code, the proposed method has worse Hamming loss on the yeast, emotions, and enron datasets. The reason may be that the geometric decoder minimizes the distance between approximated enc(˜y) and ˜b in the codeword space. However, the BCH code does not preserve the Hamming distance during encoding and decoding between {0, 1}^Kand {0, 1}^M, so the geometric decoder, which minimizes the distance in [0, 1]^M (and approximately in {0, 1}^M), may not be suitable to the Hamming loss (Ham- ming distance in {0, 1}^K).

Similar results show up when using other base learners, as shown in Table 3.1 and 3.2.

In the tables, each entry reports the difference between the results of the geometric decoder and the algebraic decoder. The bold entries indicate that the geometric decoder is significantly better than the algebraic one. The results validate that the proposed geomet-

(44)

Table 3.2: Hamming loss changes when applying the proposed soft-output decoder

ECC base learner scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) HAMR BR,Random Forest −.0008 ± .0002 −.0012 ± .0007 −.0006 ± .0002 −.0000 ± .0001 HAMR BR,Gaussian SVM −.0001 ± .0001 .0035 ± .0013 −.0003 ± .0001 −.0001 ± .0000 HAMR BR,Logistic Regression −.0003 ± .0002 .0018 ± .0009 −.0001 ± .0002 −.0001 ± .0001 HAMR 3-powerset,Random Forest −.0002 ± .0002 .0005 ± .0006 −.0004 ± .0002 −.0001 ± .0000 HAMR 3-powerset,Gaussian SVM −.0001 ± .0001 .0046 ± .0009 .0001 ± .0002 .0000 ± .0001 HAMR 3-powerset,Logistic Regression .0001 ± .0002 .0029 ± .0009 .0005 ± .0002 −.0002 ± .0001 BCH BR,Random Forest .0011 ± .0002 .0068 ± .0009 .0070 ± .0005 .0005 ± .0002 BCH BR,Gaussian SVM −.0005 ± .0002 .0192 ± .0022 .0073 ± .0004 .0030 ± .0001 BCH BR,Logistic Regression .0002 ± .0002 .0141 ± .0012 .0079 ± .0006 .0036 ± .0002 BCH 3-powerset,Random Forest .0009 ± .0002 .0035 ± .0013 .0062 ± .0005 .0006 ± .0002 BCH 3-powerset,Gaussian SVM .0004 ± .0001 .0122 ± .0023 .0055 ± .0004 .0025 ± .0002 BCH 3-powerset,Logistic Regression .0008 ± .0002 .0089 ± .0016 .0072 ± .0005 .0022 ± .0002 ECC base learner genbase (M =511) medical (M =1023) enron (M =1023)

HAMR BR,Random Forest −.0000 ± .0000 −.0000 ± .0000 −.0000 ± .0000 HAMR BR,Gaussian SVM −.0001 ± .0000 −.0000 ± .0000 −.0000 ± .0000 HAMR BR,Logistic Regression .0001 ± .0004 .0001 ± .0000 .0000 ± .0000 HAMR 3-powerset,Random Forest .0000 ± .0000 −.0000 ± .0000 −.0001 ± .0000 HAMR 3-powerset,Gaussian SVM −.0000 ± .0000 −.0000 ± .0000 .0000 ± .0000 HAMR 3-powerset,Logistic Regression −.0002 ± .0002 .0000 ± .0000 −.0001 ± .0000 BCH BR,Random Forest .0000 ± .0000 .0002 ± .0001 .0073 ± .0003 BCH BR,Gaussian SVM .0000 ± .0000 .0009 ± .0001 .0090 ± .0002 BCH BR,Logistic Regression .0030 ± .0004 −.0008 ± .0003 .0086 ± .0002 BCH 3-powerset,Random Forest .0000 ± .0000 .0005 ± .0001 .0083 ± .0003 BCH 3-powerset,Gaussian SVM .0000 ± .0000 .0007 ± .0001 .0082 ± .0004 BCH 3-powerset,Logistic Regression .0001 ± .0001 −.0001 ± .0001 .0077 ± .0003

ric decoder can decode more accurately (lower 0/1 loss) and with similar Hamming loss comparing to the algebraic decoder.

3.2.1 Bit Error Analysis

Next, we look deeper into the scene dataset, and fix the base learner to BR with Random Forests. The instances are grouped by the number of bit errors at that instance. First, we plot the ratio of the group size to the total number of instances in Figure 3.2 for HAMR and BCH codes. Besides the highest peak at 0 bit errors, another peak for the BCH code is at 63 bit errors, which is higher than that for HAMR at 38 bit errors. This suggests that BCH code is harder to learn, which is consistent to our finding in Section 2.4.3,

Then, we plot the 0/1 loss and Hamming loss in each group for HAMR, as shown in Figure 3.3. From Figure 3.3(a), we can see that the geometric decoder is able to correct errors more accurately when there are 16 to 24 bit errors, comparing to the algebraic

(45)

(a) HAMR (b) BCH

Figure 3.2: Bit error distribution of BR with Random Forests on the scene dataset

(a) 0/1 loss vs. number of bit errors (b) Hamming loss vs. number of bit errors

Figure 3.3: Strength of HAMR on the scene dataset and BR with Random Forests

decoding. The ordinary decoding method of HAMR has two-stages, one for HAM (7, 4) and one for repetition code, and each HAM (7, 4) block is decoded independently. In the proposed geometric decoding method, the two stages are combined into one, which enables joint decoding of those HAM (7, 4) blocks and thus ensures that the decoding of each HAM (7, 4) block is consistent to others. This leads to superior performance of the proposed decoding method on 0/1 loss. For Hamming loss, as shown in Figure 3.3(b), the improvement of the geometric decoder at that bit error range is small, which explains the small improvement on Hamming loss.

We also plot the 0/1 loss and Hamming loss in each group for the BCH code in Figure 3.4. The algebraic decoder can correctly recover the label vector with no 0/1 loss

(46)

(a) 0/1 loss vs. number of bit errors (b) Hamming loss vs. number of bit errors

Figure 3.4: Strength of BCH on the scene dataset and BR with Random Forests

for instances with at most 31 bit errors, but for instance with 32 bit errors the 0/1 loss sharply goes up to 0.97. The proposed geometric decoder did a better job for instances with 32–39 bit errors, so its 0/1 loss goes up more smoothly. This is exactly what we would like to address in the beginning of this Chapter. On the other hand, in terms of Hamming loss shown in Figure 3.4(b), the proposed geometric decoder has 0.01–0.025 higher Hamming loss than the algebraic one for instances with 37–45 bit errors, which yields the slightly worse result of geometric decoder on Hamming loss.

From this analysis, we may conclude that the geometric decoder can improve 0/1 loss because it really does a better job on the instances far from valid codewords. However, regarding Hamming loss, the geometric decoder gets improvements for HAMR, but not for BCH.

3.3 Soft-input Decoding and Bitwise Confidence Estima- tion for k-powerset Learners

In Section 3.1, we proposed the geometric decoder based on approximating XOR by multiplication. Since L2 distance in [0, 1]^M space is used as optimization criterion, the input codeword prediction ˜b is not necessary to be in {0, 1}^M but can also be in [0, 1]^M. That is, this decoding method supports not only soft outputs but also soft inputs. The soft

剛性與柔性解碼之錯誤更正碼於多標籤分類學習之應用

國立臺灣大學電機資訊學院資訊工程學系 碩士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

剛性與柔性解碼之錯誤更正碼於多標籤分類學習之應用 Multi-label Classification

with Hard-/soft-decoded Error-correcting Codes

馮俊菘

Ferng, Chun-Sung

指導教授：林軒田 博士 Advisor: Hsuan-Tien Lin, Ph.D.

中華民國 101 年 6 月

June, 2012

致 致 致謝 謝 謝

中

中 中文 文 文摘 摘 摘要 要 要

Abstract

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Problem Setup

1.2 Related Works

We follow the final remark of Dembczy´nski et al. [2010] to only focus on the loss functions that are

related to our algorithmic goals.

Chapter 2

ECC for Multi-label Classification

2.1 ML-ECC Framework

2.2 Review of Classic ECC

2.2.1 Repetition Code

2.2.2 Hamming on Repetition Code

2.2.3 Bose-Chaudhuri-Hocquenghem Code

2.2.4 Low-density Parity-check Code

2.3 ECC View of RAKEL

2.4 Experimental Results

D

K T

T

F

6 1211 1196 294

6 391 202 72

14 1500 917 103

2007 22 1430 1430 500

27 463 199 1186

45 333 645 1449

53 1123 579 1001

2.4.1 Validity of ML-ECC Framework

work # datasets codes channels base learners

RAKEL 3 RREP k-powerset linear SVM

[Tsoumakas and Vlahavas, 2007]

ML-BCHRF 3 BCH BR Random Forest

[Kouzani and Nasireding, 2009]

ML-BCHRF & ML-CRF 1 convolution, BR Random Forest

[Kouzani, 2010] and BCH

this work 7 RREP, HAMR, 3-powerset, Random Forest,

BCH, and LDPC and BR Gaussian SVM, logistic regression

(a) 0/1 loss (b) Hamming loss

2.4.2 Comparison of Codeword Length

(a) 0/1 loss (b) Hamming loss

(c) bit-error rate

(a) 0/1 loss (b) Hamming loss

(c) bit-error rate

(a) relative frequency vs. number of bit errors (b) 0/1 loss vs. number of bit errors

(c) Hamming loss vs. number of bit errors

2.4.3 Bit Error Analysis

(a) relative frequency vs. number of labels XOR’ed (b) bit error rate vs. number of labels XOR’ed

(a) relative frequency vs. number of labels XOR’ed (b) bit error rate vs. number of labels XOR’ed

2.4.4 Comparison with Binary Relevance

(a) 0/1 loss (b) Hamming loss

Chapter 3

New Decoder for Hard and Soft

Decoding of Error-correcting Codes

3.1 Geometric Decoder for Linear Codes

(a) 0/1 loss (b) Hamming loss

3.2 Experimental Results of Geometric Decoder

3.2.1 Bit Error Analysis

(a) HAMR (b) BCH

(a) 0/1 loss vs. number of bit errors (b) Hamming loss vs. number of bit errors

(a) 0/1 loss vs. number of bit errors (b) Hamming loss vs. number of bit errors

3.3 Soft-input Decoding and Bitwise Confidence Estima- tion for k-powerset Learners

國立臺灣大學電機資訊學院資訊工程學系碩士論文

指導教授：林軒田博士 Advisor: Hsuan-Tien Lin, Ph.D.

致致致謝謝謝

中中文文文摘摘摘要要要