4.6 Extended Experiments
5.1.3 Experimental Setup
For the cost-less multi-label classification part, we compare the performance of GLE with that of six state-of-the-art multi-label learning algorithms: RAk EL, BR (Bi-nary Relevance) [55], CC (Classifier Chains) [49], MLKNN [67], IBLR [10], and BPMLL [66]. We implement GLE using MATLAB. We exploit the linear multi-class SVM implemented in LIBSVM for the LP multi-classifiers in GLE, which is based on the one-against-one approach. The parameter C in SVM is set to 1.
Based on our observation, the best selected parameters k and M are usually the same for both RAk EL and GLE. We select k and M using cross-validation for RAk EL and use Hamming loss as the model selection criterion. We apply the selected k and M for GLE and the parameters are listed in Table 5.2. Then, the parameters γ and ν in GLE are selected using cross-validation with respect to the five different evaluation metrics, respectively. As mentioned in the Section 2.3.1, the computational cost of obtaining β∗is very small. After training the LP classifiers, we can obtain different β with thousands of different parameter pair, that is, γ and ν, within one second. Furthermore, we will show the experimental results with different γ and ν in Section 5.1.5. The results suggest that we can obtain good enough results by tuning γ and set ν to 1, and thus alleviate the burden of parameter selection.
The BR, CC, MLKNN, and IBLR are implemented in the MULAN package.
For BR, we exploit the linear logistic regression with probability output score as the base classifier. For CC, we exploit the linear SVM as the base classifier and
Table 5.2: Selected Parameters k and M of GLE and RAk EL for the Multi-Label Datasets
Dataset k M
scene 4 15
enron 16 250 cal500 10 250 majorminer 14 250 medical 14 250 bibtex 24 250
dlc1 26 250
dlc2 32 250
dlc3 32 250
dlc4 32 250
the parameter C is set to 1. For MLKNN and IBLR, the size of the neighborhood is set to 10 [10, 67]. We use the MATLAB implementation of BPMLL, which is provided by the authors of BPMLL. For BPMLL, the number of hidden neurons is set to 20% of the dimensionality, and the number of training epochs is selected using cross-validation. The implementation and setting of RAk EL are similar to that of GLE. The parameters k and M in RAk EL are selected using cross-validation. We perform three-fold cross-validation sixty times for the five medium-scale datasets and three times for the five large-scale datasets; and calculate the mean and standard deviation of the results.
5.1.4 Experimental Results
The experimental results of multi-label classification are summarized in Table 5.3.
The numbers in parentheses represent the rank of the algorithm among the compared algorithms. We do not report the performance on cal500 in terms of subset 0/1 loss since none of the methods can achieve an error rate better than 1.0. The average rankings of our method GLE on ten datasets using five different metrics are 1.8, 2.4, 1.5, 1.4, and 1.9, respectively. On four of the five metrics, GLE achieves the best performance. GLE performs slightly worse than MLKNN only in terms
of ranking loss; however, the difference is very small. We observe that RAk EL performs closely to GLE in terms of Hamming loss; but in terms of the other four metrics, GLE performs much better than RAk EL. Among the five metrics, the improvement of GLE is more significant on Hamming loss, subset 0/1 error, and one error. Generally speaking, GLE has better or competitive performance against the other state-of-the-art methods. We have run the pairwise t-test at the 5%
significance level on the experimental results. We use •/◦ to indicate whether GLE is statistically superior/inferior to the compared algorithm in Table 5.3. When the difference is not significant, no marker is given. There are 246 cases in which GLE performs significantly better than the compared method and only 34 cases in which GLE performs significantly worse.
Since both GLE and RAk EL are LP-based methods, we calculate the relative improvement of GLE over RAk EL for each dataset, respectively. Then, we show the average relative improvement over all datasets for the five evaluation metrics in Table 5.4, respectively. In each iteration of the GLE and RAk EL training phase, they use the same randomly selected k -Labelsets for the LP classifiers. We have also shown the relative improvement of two simplified versions of GLE, that is, without two-norm regularization (γ = 0) or without hypergraph regularization (ν = 0), over RAk EL in Table 5.4. We observe that the relative improvement of GLE over RAk EL is more significant in terms of ranking loss than the other metrics. GLE achieves around 10% relative improvements over RAk EL in terms of both one error and average precision, but the improvement is small in terms of Hamming loss and subset 0/1 loss. The simplified version of GLE without two-norm regularization performs even worse than RAk EL in terms of Hamming loss, subset 0/1 loss and one error. The simplified version of GLE without hypergraph regularization performs better than RAk EL but worse than GLE.
We further compare GLE and RAk EL by varying parameters k and M . They
Table 5.3: Experimental Results in Terms of Five Different Evaluation Metrics. The Numbers in Parentheses Represent the Rank of the Algorithm Among the Compared Algorithms. The Average Rank is the Average of the Ranks Across All Datasets. •/◦
indicates whether GLE is statistically superior/inferior to the compared algorithm (the pairwise t-test at the 5% significance level).
GLE RAk EL BR CC MLKNN IBLR BPMLL
Hamming Loss cal500 1.0000 (1) 1.0000 (1) 1.0000 (1) 1.0000 (1) 1.0000 (1) 1.0000 (1) 1.0000 (1) majorminer 0.9081 (1) 0.9113 (2)• 0.9602 (5)• 0.9434 (4)• 0.9358 (3)• 0.9701 (6)• 0.9910 (7)•
Table 5.4: Relative Improvement of GLE and Its Two Simplified Versions Over RAk EL in Terms of Five Different Evaluation Metrics (In %)
Hamming Ranking Subset One Average Loss Loss 0/1 Loss Error Precision γ = 0 -9.11 26.44 -5.19 -28.29 2.16
ν = 0 0.20 51.26 1.30 3.87 10.25
GLE 0.22 55.34 1.83 8.15 11.80
use the same randomly selected k -Labelset for the LP classifiers. We show the aver-age relative improvement of GLE over RAk EL in terms of five different evaluation metrics in Figure 5.1. The results for the selected parameter k in Table 5.2 are indicated by the blue curves in Figure 5.1. We have also tested larger k and smaller k, and the results are indicated by the red and green curves, respectively. We in-crementally increase the number of models M as indicated by the horizontal axis.
Since the number of instances and the number of labels for the scene dataset are much smaller than that of the other datasets, the ten steps of M are from 6 to 15.
For the other nine datasets, the ten steps of M are from 25 to 250 with an increment of 25. From the results, we observe that GLE performs better than RAk EL in most parameter settings, especially when the number of models increases. The major reason could be that, when the number of models is small, it is less likely to select a significantly better or worse k -Labelset, so it is reasonable to assume that all LP classifiers are equally important.