Importance of Reference Points - Multi-label Classiﬁcation with Principal Label Space Transform

One particular advantage of PLST appears to be using the average yn as the reference point o, which readily captures an estimate of the ratio of presence per label. PBR uses the origin as the reference point and hence the advantge on changing the reference point has not been explored. The flexibility of CS for choosing the reference point (to make the signal vectors more sparse) has not been explored, either.

To understand whether the choice of reference point is an important factor behind the success of PLST, we conduct some additional experiments using three more algo-rithms:

1. PLST-Origin: a crippled PLST that takes the origin 0 as o and extracts a different set of principal directions that passes the origin. The usual PLST is then renamed PLST-Mean for clarity.

2. PBR-Mean: an improved PBR that takes the average y_nas o. The usual PBR is renamed PBR-Origin.

3. CS-BestVertex: an improved CS that transforms y[k] to y[k] ⊕ majority{yn[k] : n = 1, 2, · · · , N }

before (and after) training. In other words, the improved CS computes the best vertex to strengthen label-set sparsity prior to training. The usual CS is renamed CS-Origin.

Fig. 11 shows the Hamming loss comparison between the usual PLST, PBR, CS and their variants on delicious and yeast using RLR as the regressor. For delicious, in which there is strong label-set sparsity, we see that the change of reference point does not lead to much difference for each pair of algorithms. In particular, the origin is read-ily the best vertex for CS and quite close to the mean for PBR and PLST. On the other

0 200 400 600 800 1000 0

1 2 3 4 5 6 7 8 9x 10⁻³

PBR PLST

(a) delicious ⁰ ⁵⁰ ¹⁰⁰ ¹⁵⁰ ²⁰⁰ ²⁵⁰ ³⁰⁰ ³⁵⁰ ⁴⁰⁰

0 0.5 1 1.5 2 2.5 3 3.5

4x 10⁻³

PBR PLST

(b) corel5k

0 20 40 60 80 100 120

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

PBR PLST

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0 0.05 0.1 0.15 0.2 0.25

PBR PLST

(e) emotions

Figure 10: Test Encoding Error of PBR and PLST

hand, for yeast, in which there is no strong label-set sparsity, the change of refer-ence point not only is important when M is small in PBR and PLST but also leads to significant improvements in CS.

Fig. 12 compares PLST-Mean and PLST-Origin with PBR-Mean and CS-BestVertex on all data sets. Even with the improved PBR-Mean and CS-BestVertex, the findings in Fig. 12 remain the same. PLST with or without the mean shift is significantly better than the other two algorithms, with PLST-Mean being the better choice. The results suggest that the principal directions (by SVD) rather than the mean shifting are the key factors behind the success of PLST.

5 Conclusion

We presented the hypercube view for problem transformation approaches to multi-label classification. The view offers geometric interpretations to many existing algorithms including Binary Relevance, Label Power-set, Label Ranking, Compressive Sensing (CS), Topic Modeling and Kernel Dependency Estimation. Inspired by this view, we introduced the notion of hypercube sparsity and took it into account by Principal Lin-ear Space Transformation (PLST). We derived the theoretical guarantee of PLST and conducted experiments to compare PLST with Binary Relevance and CS. Experimental results verified that PLST is successful in reducing the computational effort for multi-label classification, especially for data sets with large numbers of multi-labels. Most im-portantly, when compared with CS, PLST not only enjoys a faster decoding scheme, but also reduces the multi-label classification problem to simpler and fewer regression sub-tasks. The advantages and the empirical superiority suggest that PLST should be a preferred choice over CS in practice.

As demonstrated through experiments, PLST was able to achieve similar perfor-mance with substantially less dimensions compared to the original label-space. An immediate future work is to conclude how to automatically and efficiently determine a reasonable parameterM for PLST.

We discussed in Section 1 that PLST can be viewed as a special case of the ker-nel dependency estimation (KDE) algorithm (Weston et al., 2002). To the best of our knowledge, we are the first to focus on KDE’s linear form for multi-label classification,

0 200 400 600 800 1000 0.018

0.0185 0.019 0.0195 0.02 0.0205

CS−Origin CS−BestVertex

(a) delicious (CS) ⁰ ⁵ ¹⁰ ¹⁵

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32

CS−Origin CS−BestVertex

(b) yeast (CS)

0 200 400 600 800 1000

0.0181 0.0181 0.0182 0.0182 0.0183 0.0183 0.0184 0.0184 0.0185 0.0185

PBR−Origin PBR−Mean

0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28

PBR−Origin PBR−Mean

(d) yeast (PBR)

0 200 400 600 800 1000

0.0181 0.0181 0.0181 0.0182 0.0182 0.0182 0.0182 0.0182 0.0183 0.0183

PLST−Origin PLST−Mean

(e) delicious (PLST) ⁰ ⁵ ¹⁰ ¹⁵

0.195 0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24

PLST−Origin PLST−Mean

(f) yeast (PLST)

Figure 11: Change of Reference Points for Linear Label Space Transformation Algo-rithms

0 200 400 600 800 1000 0.018

0.0185 0.019 0.0195 0.02 0.0205