• 沒有找到結果。

One particular advantage of PLST appears to be using the average yn as the reference point o, which readily captures an estimate of the ratio of presence per label. PBR uses the origin as the reference point and hence the advantge on changing the reference point has not been explored. The flexibility of CS for choosing the reference point (to make the signal vectors more sparse) has not been explored, either.

To understand whether the choice of reference point is an important factor behind the success of PLST, we conduct some additional experiments using three more algo-rithms:

1. PLST-Origin: a crippled PLST that takes the origin 0 as o and extracts a different set of principal directions that passes the origin. The usual PLST is then renamed PLST-Mean for clarity.

2. PBR-Mean: an improved PBR that takes the average ynas o. The usual PBR is renamed PBR-Origin.

3. CS-BestVertex: an improved CS that transforms y[k] to y[k] ⊕ majority{yn[k] : n = 1, 2, · · · , N }

before (and after) training. In other words, the improved CS computes the best vertex to strengthen label-set sparsity prior to training. The usual CS is renamed CS-Origin.

Fig. 11 shows the Hamming loss comparison between the usual PLST, PBR, CS and their variants on delicious and yeast using RLR as the regressor. For delicious, in which there is strong label-set sparsity, we see that the change of reference point does not lead to much difference for each pair of algorithms. In particular, the origin is read-ily the best vertex for CS and quite close to the mean for PBR and PLST. On the other

0 200 400 600 800 1000 0

1 2 3 4 5 6 7 8 9x 10−3

PBR PLST

(a) delicious 0 50 100 150 200 250 300 350 400

0 0.5 1 1.5 2 2.5 3 3.5

4x 10−3

PBR PLST

(b) corel5k

0 20 40 60 80 100 120

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014

PBR PLST

(c) mediamill 0 5 10 15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PBR PLST

(d) yeast

0 1 2 3 4 5 6 7

0 0.05 0.1 0.15 0.2 0.25

PBR PLST

(e) emotions

Figure 10: Test Encoding Error of PBR and PLST

hand, for yeast, in which there is no strong label-set sparsity, the change of refer-ence point not only is important when M is small in PBR and PLST but also leads to significant improvements in CS.

Fig. 12 compares PLST-Mean and PLST-Origin with PBR-Mean and CS-BestVertex on all data sets. Even with the improved PBR-Mean and CS-BestVertex, the findings in Fig. 12 remain the same. PLST with or without the mean shift is significantly better than the other two algorithms, with PLST-Mean being the better choice. The results suggest that the principal directions (by SVD) rather than the mean shifting are the key factors behind the success of PLST.

5 Conclusion

We presented the hypercube view for problem transformation approaches to multi-label classification. The view offers geometric interpretations to many existing algorithms including Binary Relevance, Label Power-set, Label Ranking, Compressive Sensing (CS), Topic Modeling and Kernel Dependency Estimation. Inspired by this view, we introduced the notion of hypercube sparsity and took it into account by Principal Lin-ear Space Transformation (PLST). We derived the theoretical guarantee of PLST and conducted experiments to compare PLST with Binary Relevance and CS. Experimental results verified that PLST is successful in reducing the computational effort for multi-label classification, especially for data sets with large numbers of multi-labels. Most im-portantly, when compared with CS, PLST not only enjoys a faster decoding scheme, but also reduces the multi-label classification problem to simpler and fewer regression sub-tasks. The advantages and the empirical superiority suggest that PLST should be a preferred choice over CS in practice.

As demonstrated through experiments, PLST was able to achieve similar perfor-mance with substantially less dimensions compared to the original label-space. An immediate future work is to conclude how to automatically and efficiently determine a reasonable parameterM for PLST.

We discussed in Section 1 that PLST can be viewed as a special case of the ker-nel dependency estimation (KDE) algorithm (Weston et al., 2002). To the best of our knowledge, we are the first to focus on KDE’s linear form for multi-label classification,

0 200 400 600 800 1000 0.018

0.0185 0.019 0.0195 0.02 0.0205

CS−Origin CS−BestVertex

(a) delicious (CS) 0 5 10 15

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32

CS−Origin CS−BestVertex

(b) yeast (CS)

0 200 400 600 800 1000

0.0181 0.0181 0.0182 0.0182 0.0183 0.0183 0.0184 0.0184 0.0185 0.0185

PBR−Origin PBR−Mean

(c) delicious (PBR) 0 5 10 15

0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28

PBR−Origin PBR−Mean

(d) yeast (PBR)

0 200 400 600 800 1000

0.0181 0.0181 0.0181 0.0182 0.0182 0.0182 0.0182 0.0182 0.0183 0.0183

PLST−Origin PLST−Mean

(e) delicious (PLST) 0 5 10 15

0.195 0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24

PLST−Origin PLST−Mean

(f) yeast (PLST)

Figure 11: Change of Reference Points for Linear Label Space Transformation Algo-rithms

0 200 400 600 800 1000 0.018

0.0185 0.019 0.0195 0.02 0.0205

CS−BestVertex PBR−Mean PLST−Origin PLST−Mean

(a) delicious 0 50 100 150 200 250 300 350 400

9.3 9.4 9.5 9.6 9.7 9.8 9.9x 10−3

CS−BestVertex PBR−Mean PLST−Origin PLST−Mean

(b) corel5k

0 20 40 60 80 100

0.029 0.03 0.031 0.032 0.033 0.034 0.035 0.036 0.037

CS−BestVertex PBR−Mean PLST−Origin PLST−Mean

(c) mediamill 0 5 10 15

0.195 0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24

CS−BestVertex PBR−Mean PLST−Origin PLST−Mean

(d) yeast

0 1 2 3 4 5 6 7

0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34

CS−BestVertex PBR−Mean PLST−Origin PLST−Mean

(e) emotions

Figure 12: Test Hamming Loss of Improved PBR and CS versus PLST Variants with RLR

PLST, which readily leads to promising performance. A plausible future work is to carefully evaluate the usefulness of KDE’s non-linear form for multi-label classifica-tion.

Acknowledgments

A preliminary version of this paper appeared in the Second International Workshop of Learning from Multi-label Data. We thank the reviewers of the workshop as well as reviewers for all versions of this paper for their many useful suggestions. We also thank Krzysztof Dembczynski, Weiwei Cheng, Eyke H¨ullermeier and Willem Waegeman for valuable discussions. This research has been supported by the National Science Council of Taiwan via NSC 98-2221-E-002-192 and 100-2628-E-002-010.

References

Ando, R. and Zhang, T. (2005). A framework for learning predictative structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–

1853.

Barutcuoglu, Z., Schapire, R. E., and Troyanskaya, O. G. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830–836.

Blei, D. M., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. JMLR, 3:993–

1022.

Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771.

Clare, A. and King, R. D. (2001). Knowledge discovery in multi-label phenotype data.

In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 42–53.

Csisz´ar, I. (1995). Maxent, mathematics, and information theory. In Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods, pages 35–50.

Datta, B. N. (1995). Numerical Linear Algebra and Applications. Brooks/Cole Pub-lishing.

Dembczynski, K., Waegeman, W., Cheng, W., and H¨ullermeier, E. (2010a). On label dependencies in multi-label classification. In Proceedings of the 2nd Workshop on Learning from Multi-label Data.

Dembczynski, K., Waegeman, W., Cheng, W., and H¨ullermeier, E. (2010b). Regret analysis for performance metrics in multi-label classication: The case of hamming and subset zero-one loss. In European Conference on Machine Learning.

Elisseeff, A. and Weston, J. (2002). A kernel method for multi-labelled classification.

In Advances in Neural Information Processing Systems 14, pages 681–688.

Fan, R.-E. and Lin, C.-J. (2007). A study on threshold selection for multi-label classi-fication. Technical report, National Taiwan University.

F¨urnkranz, J., H¨ullermeier, E., Lozamenc´ıa, E., and Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2):133–153.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009).

The WEKA data mining software: An update. In SIGKDD Explorations, volume 11.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learn-ing: Data Mining, Inference, and Prediction. Springer-Verlag.

Hsu, D., Kakade, S. M., Langford, J., and Zhang, T. (2009). Multi-label prediction via compressed sensing. In Advances in Neural Information Processing Systems 22, pages 772–780.

Ji, S., Tang, L., Yu, S., and Ye, J. (2010). A shared-subspace learning framework for multi-label classification. ACM Transaction on Knowledge Discovery from Data, 4(2):1–29.

Law, E., Settles, B., and Mitchell, T. (2010). Learning to tag using noisy labels. In European Conference on Machine Learning.

相關文件