BCH, and LDPC and BR Gaussian SVM, logistic regression

(a) 0/1 loss (b) Hamming loss

Figure 2.1: Performance of ML-ECC using the 3-powerset with Random Forests

experiments on different codeword lengths are presented in Section 2.4.2. Here the base multi-label learner is the 3-powerset with Random Forests. Following the description in Section 2.3, RREP with the 3-powerset is exactly the same as RAKEL with k = 3.

The results on 0/1 loss is shown in Figure 2.1(a). HAMR achieves lower ∆_0/1 than RREP on 5 out of the 7 datasets (scene, emotions, yeast, tmc2007, and medical) and achieves similar ∆_0/1 with RREP on the other 2. This verifies that us-ing some parity bits instead of repetition improves the strength of ECC, which in turn improves the 0/1 loss. Along the same direction, BCH performs even better than both HAMR and RREP, especially on medical dataset. The superior performance of BCH justifies that the ECC is useful for multi-label classification. On the other hand, another

Table 2.3: 0/1 loss of ML-ECC using 3-powerset base learners

base learner ECC scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) Random Forest RREP (RAKEL) .3390 ± .0022 .6475 ± .0057 .7939 ± .0022 .7738 ± .0025 Random Forest HAMR .2855 ± .0022 .6393 ± .0055 .7789 ± .0021 .7693 ± .0024 Random Forest BCH .2671 ± .0020 .6366 ± .0061 .7764 ± .0021 .7273 ± .0018 Random Forest LDPC .3058 ± .0024 .6606 ± .0050 .8080 ± .0024 .7728 ± .0022 Gaussian SVM RREP (RAKEL) .2856 ± .0016 .7759 ± .0055 .7601 ± .0023 .7196 ± .0024 Gaussian SVM HAMR .2635 ± .0017 .7729 ± .0052 .7530 ± .0021 .7162 ± .0023 Gaussian SVM BCH .2576 ± .0017 .7744 ± .0053 .7429 ± .0017 .7095 ± .0020 Gaussian SVM LDPC .2780 ± .0020 .8040 ± .0044 .7574 ± .0021 .7403 ± .0019 Logistic Regression RREP (RAKEL) .3601 ± .0019 .6949 ± .0070 .8161 ± .0017 .7408 ± .0024 Logistic Regression HAMR .3299 ± .0018 .6954 ± .0057 .8061 ± .0019 .7383 ± .0025 Logistic Regression BCH .3148 ± .0018 .7068 ± .0046 .7899 ± .0020 .7233 ± .0024 Logistic Regression LDPC .3655 ± .0028 .7295 ± .0056 .8082 ± .0024 .7562 ± .0027 base learner ECC genbase (M =511) medical (M =1023) enron (M =1023)

Random Forest RREP (RAKEL) .0295 ± .0021 .6508 ± .0024 .8866 ± .0038 Random Forest HAMR .0276 ± .0021 .6420 ± .0029 .8855 ± .0036 Random Forest BCH .0263 ± .0020 .4598 ± .0036 .8659 ± .0039 Random Forest LDPC .0288 ± .0021 .5238 ± .0032 .8830 ± .0036 Gaussian SVM RREP (RAKEL) .0295 ± .0025 .3679 ± .0036 .8725 ± .0041 Gaussian SVM HAMR .0303 ± .0026 .3641 ± .0031 .8693 ± .0042 Gaussian SVM BCH .0255 ± .0019 .3394 ± .0027 .8477 ± .0045 Gaussian SVM LDPC .0285 ± .0021 .3856 ± .0031 .8666 ± .0041 Logistic Regression RREP (RAKEL) .3593 ± .0078 .5507 ± .0254 .8762 ± .0035 Logistic Regression HAMR .2275 ± .0099 .5268 ± .0230 .8754 ± .0035 Logistic Regression BCH .0250 ± .0018 .3797 ± .0044 .8504 ± .0042 Logistic Regression LDPC .0325 ± .0018 .4516 ± .0083 .8653 ± .0038

sophisticated code, LDPC, gets higher 0/1 loss than BCH on every dataset, and even higher 0/1 loss than RREP on the emotions and yeast datasets, which suggest that LDPC may not be a good choice for the ECC framework.

Next we look at ∆HLshown in Figure 2.1(b). The Hamming loss of HAMR is com-parable to that of RREP, where each wins on two datasets. BCH beats both HAMR and RREP on the tmc2007, genbase, and medical datasets but loses on the other four datasets. LDPC has the highest Hamming loss among the codes on all datasets. Thus, simpler codes like RREP and HAMR perform better in terms of ∆_HL. A stronger code like BCH may guard ∆_0/1 better, but it can pay more in terms of ∆_HL.

Similar results show up when using the Gaussian SVM or logistic regression as the base learner instead of Random Forest, as shown in Tables 2.3 and 2.4. The boldface entries are the lowest-loss ones for the given dataset and base learner. The results vali-date that the performance of multi-label classification can be improved by applying the ECC. More specifically, we may improve the RAKEL algorithm by learning some parity

Table 2.4: Hamming loss of ML-ECC using 3-powerset base learners

base learner ECC scene (M =127) emotions (M =127) yeast (M =255) tmc2007 (M =511) Random Forest RREP (RAKEL) .0755 ± .0006 .1778 ± .0018 .1884 ± .0007 .0674 ± .0003 Random Forest HAMR .0748 ± .0006 .1798 ± .0019 .1894 ± .0008 .0671 ± .0003 Random Forest BCH .0753 ± .0007 .1858 ± .0021 .1928 ± .0008 .0662 ± .0003 Random Forest LDPC .0817 ± .0007 .1907 ± .0021 .2012 ± .0007 .0734 ± .0003 Gaussian SVM RREP (RAKEL) .0719 ± .0005 .2432 ± .0021 .1853 ± .0007 .0613 ± .0003 Gaussian SVM HAMR .0723 ± .0005 .2492 ± .0023 .1868 ± .0006 .0610 ± .0003 Gaussian SVM BCH .0739 ± .0006 .2644 ± .0019 .1898 ± .0008 .0629 ± .0003 Gaussian SVM LDPC .0755 ± .0006 .2634 ± .0027 .1917 ± .0007 .0679 ± .0003 Logistic Regression RREP (RAKEL) .0915 ± .0005 .2026 ± .0025 .1993 ± .0007 .0634 ± .0003 Logistic Regression HAMR .0911 ± .0005 .2064 ± .0024 .2003 ± .0007 .0634 ± .0003 Logistic Regression BCH .0920 ± .0005 .2233 ± .0022 .2051 ± .0008 .0653 ± .0003 Logistic Regression LDPC .0989 ± .0007 .2202 ± .0021 .2054 ± .0007 .0701 ± .0003 base learner ECC genbase (M =511) medical (M =1023) enron (M =1023)

Random Forest RREP (RAKEL) .0012 ± .0001 .0182 ± .0001 .0477 ± .0004 Random Forest HAMR .0012 ± .0001 .0180 ± .0001 .0479 ± .0004 Random Forest BCH .0011 ± .0001 .0159 ± .0001 .0506 ± .0004 Random Forest LDPC .0013 ± .0001 .0192 ± .0002 .0538 ± .0005 Gaussian SVM RREP (RAKEL) .0013 ± .0001 .0112 ± .0001 .0449 ± .0004 Gaussian SVM HAMR .0013 ± .0001 .0111 ± .0001 .0449 ± .0004 Gaussian SVM BCH .0010 ± .0001 .0114 ± .0001 .0516 ± .0006 Gaussian SVM LDPC .0014 ± .0001 .0140 ± .0001 .0530 ± .0005 Logistic Regression RREP (RAKEL) .0179 ± .0006 .0190 ± .0011 .0453 ± .0003 Logistic Regression HAMR .0102 ± .0005 .0176 ± .0009 .0454 ± .0003 Logistic Regression BCH .0013 ± .0001 .0137 ± .0003 .0505 ± .0004 Logistic Regression LDPC .0024 ± .0002 .0187 ± .0006 .0528 ± .0004

bits instead of repetitions. Based on this experiment, we suggest that using HAMR for multi-label classification will improve the ∆_0/1 while maintaining comparable ∆_HLwith RAKEL. If we use BCH instead, we will improve ∆_0/1 further but may pay for ∆_HL. We also report the micro and macro F1scores, and also the pairwise label ranking loss in Tables A.1, A.2, and A.3, respectively.

2.4.2 Comparison of Codeword Length

Now, we compare on the length of codewords M . With larger M , the codes can correct more errors but the base learners have to take longer time to train. By experimenting different M , we may find a better trade-off between performance and efficiency.

The performance of the ECC framework with different codeword lengths on the scene dataset is shown on Figure 2.2. Here, the base learner is again the 3-powerset with Ran-dom Forests. The codeword length M varies from 31 to 127, which is about 5 to 20 times of number of labels L. We do not include shorter codewords because their performance

are not stable. Note that BCH only allows M = 2^p− 1 and thus we conduct experiments of BCH on those codeword lengths.

We first look at the 0/1 loss in Figure 2.2(a). The horizontal axis indicates the code-word length M and the vertical axis is the 0/1 loss on the test set. We see that ∆_0/1 of RREP stays around 0.335 no matter how long the codewords are. This implies that the power of repetition bits reaches its limit very soon. For example, when all the 3-powerset combinations of labels are learned, additional repetitions give very limited improvements.

Therefore, methods using repetition bits only, such as RAKEL, cannot take advantage from the extra bits in the codewords.

The ∆_0/1of HAMR and BCH are slightly decreasing with M , but the differences be-tween M = 63 and M = 127 are generally small (smaller than the differences bebe-tween M = 31 and M = 63, in particular). This indicates that learning some parity bits provides additional information for prediction, which cannot be learned easily from repetition bits, and such information remains beneficial for longer codewords, comparing to repetition bits. One reason is that the number of 3-powerset combinations of parity bits is exponen-tially more than that of combinations of labels. The performance of LDPC is not as stable as the other codes, possibly because of its sophisticated decoding step. Somehow, we still see that its ∆_0/1 decreases slightly with M .

Figure 2.2(b) shows ∆_HL versus M for each ECC. The ∆_HL of RREP is the lowest among the codes when M is small, but it remains almost constant when M ≥ 63, while

∆_HL of HAMR and BCH are still decreasing. This matches our finding that extra repeti-tion bits give limited informarepeti-tion. When M = 127, BCH is comparable to RREP in terms of ∆_HL. HAMR is even better than RREP at that codeword length, and becomes the best code regarding ∆_HL. Thus, while a stronger code like BCH may guard ∆_0/1better, it can pay more in terms of ∆_HL.

As stated in Sections 1.1 and 2.1, the base learners serve as the channels in the ECC framework and the performance of base learners may be affected by the codes. Therefore, using a strong ECC does not always improve multi-label classification performance. Next, we verify the trade-off by measuring the bit error rate ∆BERof ˜h, which is defined as the

在文檔中剛性與柔性解碼之錯誤更正碼於多標籤分類學習之應用 (頁 27-31)