Evaluation Criteria for Multi-label Classification
Rong-En Fan
Department of Computer Science National Taiwan University
Outline
Introduction
Binary Method and Evaluation Criteria Supervised Threshold Setting
Discussions and Conclusions
Introduction
Outline
Introduction
Binary Method and Evaluation Criteria Supervised Threshold Setting
Discussions and Conclusions
Introduction
Classification
Two-class problem
Each instance associated with either positive or negative label Multi-class problem
Each instance associated with one label Multi-label problem
Each instance associated with several labels; e.g., a news story may belong to both sports and politics
Introduction
Representation for Multiple Labels
Given instances x1, . . . ,xl ∈ Rn and d labels 1, . . . , d
The binary vector of subset of labels of xi is yi = [y1i, . . . ,ydi] ∈ {0, 1}d,
where yji = 1 ⇔ xi is associated with j.
y1 = [0, 0, 0, 0, 1]
y2 = [0, 0, 1, 0, 1]
y3 = [1, 0, 1, 1, 0]
Binary Method and Evaluation Criteria
Outline
Introduction
Binary Method and Evaluation Criteria Supervised Threshold Setting
Discussions and Conclusions
Binary Method and Evaluation Criteria
The Binary Method
An extension of the “one-against-the rest” method for multi-class classification
one label positive, others negative
Example:
y1 = [1, 0, 0, 0, 1] ⇒ x1 x1 x1 x1 x1 y2 = [0, 0, 1, 0, 1] ⇒ x2 x2 x2 x2 x2 y3 = [0, 0, 1, 0, 1] ⇒ x3 x3 x3 x3 x3 y4 = [1, 1, 0, 1, 0] ⇒ x4 x4 x4 x4 x4 y5 = [0, 0, 0, 0, 1] ⇒ x5 x5 x5 x5 x5 y6 = [1, 1, 0, 1, 0] ⇒ x6 x6 x6 x6 x6
Binary Method and Evaluation Criteria
The Classifier
Need d two-class classifiers, for example, Support Vector Machine (SVM)
SVM decision function for label j wjTx+ bj
(
> 0, x is relevant to label j
< 0, otherwise Can be written as
wjTx+ bj > Tj
with Tj = 0. Output positive ⇔ inequality holds, negative otherwise.
Binary Method and Evaluation Criteria
Evaluation Measures
Exact Match Ratio [Kazawa et al., 2005]
direct extension of accuracy in classification 1
¯l
¯l
X
i=1
I[ˆyi = yi]
¯l: number of testing instances
yi: true label vector for ith instance ˆ
yi: predicted label vector for ith instance I(s) is 1 if s is true, otherwise 0
I h h1
01
iT
= h0
01
iT i
= 0, partial matches not counted
Binary Method and Evaluation Criteria
Evaluation Measures
Consider partial matches
Macro-average F-measure [Lewis, 1991]
mean of per-label F-measure, d independent parts 1
d
d
X
j=1
2P¯l
i=1ˆyjiyji P¯l
i=1yˆji + P¯l i=1yji P¯l
i=1yˆjiyji: true positives of label j P¯l
i=1yˆji +P¯l
i=1yji: avoid too many false positives
Binary Method and Evaluation Criteria
Evaluation Measures
Micro-average F-measure [Lewis, 1991]
F-measure across all labels, non-separable 2Pd
j=1
P¯l
i=1yˆjiyji Pd
j=1
P¯l
i=1ˆyji +Pd
j=1
P¯l
i=1yji
Binary Method and Evaluation Criteria
Optimize Measures
Predictions ˆy: a function of d decision functions Measure: a function of predictions
m
³ ˆ
y(f1,f2,· · · , fd)´
Maximize m: difficult global optimization problem Adjust one function and fix others sequentially
maxfk m
³ ˆ
y(f1,· · · ,fk,· · · , fd)´
obtain fk such that function value is non-decreasing
Binary Method and Evaluation Criteria
Optimize Measures: Example
Optimize macro-average F-measure
Adjust one function and fix others sequentially
maxfk
1 d
à 2P¯l
i=1yˆkiyki P¯l
i=1yˆki +P¯l
i=1yki + terms not related to fk
!
obtain fk to improve macro-average
Supervised Threshold Setting
Outline
Introduction
Binary Method and Evaluation Criteria Supervised Threshold Setting
Discussions and Conclusions
Supervised Threshold Setting
Motivations from [Lewis et al., 2004]
Unbalanced data distribution among labels
RCV1-V2: 101 labels, 96 labels with ≤ 10% instances
SVM.1 predicts 4,000 more positives (average) per label
Micro-avg. F Macro-avg. F
Binary 0.809 0.534
SVM.1 0.816 0.619
Sum of TP Sum of FP Sum of FN Sum of Predict Binary 1,877,239 229,687 655,939 2,106,926 SVM.1 2,034,167 420,826 499,011 2,454,993
Adjust the decision threshold Tj in SVM prediction fj(x) = wjTx+ bj > Tj
Supervised Threshold Setting
Simple Threshold Tuning
Result of binary method
Macro-average F-measure: 0.333
label 1 y1 dec. value
1 1.2
1 1.1
0 -1.1
1 -1.2
0 -1.3
1 -1.4
0 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.5
0 -1.6
1 -1.7
0 -1.8
0 -1.9
T1 = 0−0.5 T2 = 0
Supervised Threshold Setting
Simple Threshold Tuning
Adjust threshold T1 of label 1 Macro-average F-measure: 0.4
label 1 y1 dec. value
1 1.2
1 1.1
0 -1.1
1 -1.2
0 -1.3
1 -1.4
0 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.5
0 -1.6
1 -1.7
0 -1.8
0 -1.9
T1 = −1.45 T2 = 0
Supervised Threshold Setting
Simple Threshold Tuning
Adjust threshold T2 of label 2 Macro-average F-measure: 0.511
label 1 y1 dec. value
1 1.2
1 1.1
0 -1.1
1 -1.2
0 -1.3
1 -1.4
0 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.5
0 -1.6
1 -1.7
0 -1.8
0 -1.9
T1 = −1.45 T2 = −1.75
Supervised Threshold Setting
Simple Threshold Tuning
Improve macro-average F-measure: 0.333 → 0.511 Five-fold cross validation for better thresholds Threshold Tj = average of five thresholds
Tuning threshold significantly improves macro-avg., but a few cases still worse than SVM.1
Supervised Threshold Setting
Example: Tuning Threshold is Not Enough
OHSUMED: 94 labels, 5,000 training / 5,000 testing Macro-average / micro-average F-measure
SVM.1: 0.234 / 0.496
Simple Threshold Tuning: 0.224 / 0.469
Improper tuned threshold when positives are few label 52: 12, label 57 : 4
Sum of TP Sum of FP Sum of FN
SVM.1 7,343 9,141 5,804
Simple Threshold Tuning 7,365 10,895 5,782 label 52 label 57
TP FP FN TP FP FN
SVM.1 0 3 12 2 3 3
Simple Threshold Tuning 9 1,084 3 0 0 5
Supervised Threshold Setting
The fbr heuristic [Yang, 2001]
Tuning threshold with few positive examples Easily get biased, worse generalization error too low (many FP) or too high (many FN)
Remedy: if F-measure < fbr value, set threshold to the highest decision value
Purpose: adjust the way of predicting positive instances
Limited experiments in [Yang, 2001], a comprehensive study in this work
Supervised Threshold Setting
Simple Threshold Tuning with fbr = 0.5
Label 1 F-measure = 0.8 > 0.5
label 1 y1 dec. value
1 1.2
1 1.1
0 -1.1
1 -1.2
0 -1.3
1 -1.4
0 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.5
0 -1.6
1 -1.7
0 -1.8
0 -1.9
T1 = −1.45 T2 = 0
Supervised Threshold Setting
Simple Threshold Tuning with fbr = 0.5
Label 2 F-measure = 0.4 < 0.5
label 1 y1 dec. value
1 1.2
1 1.1
0 -1.1
1 -1.2
0 -1.3
1 -1.4
0 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.5
0 -1.6
1 -1.7
0 -1.8
0 -1.9
T1 = −1.45 T2 = −1.75
Supervised Threshold Setting
Simple Threshold Tuning with fbr = 0.5
Label 2 F-measure = 0.4 < 0.5
Set threshold T2 to the highest decision value
label 1 y1 dec. value
1 1.2
1 1.1
0 -1.1
1 -1.2
0 -1.3
1 -1.4
0 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.5
0 -1.6
1 -1.7
0 -1.8
0 -1.9
T1 = −1.45 T2 = 1.2
Supervised Threshold Setting
Simple Threshold Tuning with fbr = 0.5
Testing macro-average: 0.633 (without fbr : 0.4) Testing micro-average: 0.615 (without fbr : 0.4)
label 1 y1 dec. value
0 1.2
1 1.1
0 -1.1
0 -1.2
1 -1.3
1 -1.4
1 -1.5
0 -1.6
0 -1.7
0 -1.8
label 2 y2 dec. value
0 1.4
1 1.3
0 1.2
0 1.1
0 -1.1
0 -1.2
0 -1.4
0 -1.6
0 -1.7
0 -1.8
T1 = −1.45 T2 = 1.2 T2 = −1.75
Supervised Threshold Setting
Simple Threshold Tuning with fbr = 0.5
Testing performance
macro-average F-measure: 0.633 (was 0.4) micro-average F-measure: 0.615 (was 0.4)
Withour fbr , overfits the measure on validation set Better generalization ability with fbr
Supervised Threshold Setting
The SVM.1 Method [Lewis et al., 2004]
Simple Threshold Tuning with fbr heuristic fbr value selected by five-fold cross validaion
eight fbr candidates: 0.1 to 0.8
Two-level cross validation
Outer: select the best fbr value Inner: choose threshold
Train 5 × 5 = 25 more classifiers
Time consuming
Supervised Threshold Setting
Experiments
Three parts, each one targets at one measure
Macro-average F-measure Micro-average F-measure Exact match ratio
Supervised Threshold Setting
Experiments
Three parts, each one targets at one measure
Macro-average F-measure Micro-average F-measure Exact match ratio
Optimize measure with SVM.1-type methods
Algo. 1: Simple Threshold Tuning
Algo. 2: Simple Threshold Tuning with fbr = 0.1 Algo. 3: the SVM.1 Method
Supervised Threshold Setting
Experiments
Three parts, each one targets at one measure
Macro-average F-measure Micro-average F-measure Exact match ratio
Optimize measure with SVM.1-type methods
Algo. 1: Simple Threshold Tuning
Algo. 2: Simple Threshold Tuning with fbr = 0.1 Algo. 3: the SVM.1 Method
Examine effects of fbr heuristic
Supervised Threshold Setting
Experiments: Data Sets
scene: semantic scene classification yeast: yeast gene microarray data
OHSUMED: medical literature documents Yahoo!: web pages from yahoo.com
six sets: Arts (Ar), Business (Bu), Computers (Co), Education (Ed), Entertainment (En), Health (He) RCV1-V2: newswire stories from Reuters
Supervised Threshold Setting
Optimize Macro-average F-measure
macro-average F-measure Data set Binary Algo. 1 Algo. 2 Algo. 3 scene 0.653 0.766 0.766 0.759 yeast 0.405 0.504 0.504 0.506 OHSUMED 0.117 0.221 0.244 0.236
Ar 0.304 0.401 0.408 0.403
Bu 0.281 0.384 0.383 0.383
Co 0.292 0.382 0.390 0.382
Ed 0.255 0.311 0.341 0.335
En 0.410 0.493 0.492 0.492
He 0.337 0.391 0.418 0.410
RCV1-V2 0.378 0.516 0.536 0.528
micro-average F-measure Binary Algo. 1 Algo. 2 Algo. 3
0.652 0.756 0.756 0.750 0.651 0.651 0.643 0.633 0.452 0.481 0.487 0.494 0.484 0.528 0.529 0.532 0.774 0.768 0.771 0.772 0.588 0.606 0.608 0.609 0.507 0.532 0.546 0.543 0.624 0.659 0.658 0.658 0.697 0.719 0.715 0.714 0.744 0.756 0.752 0.761
Tuning threshold significantly helps:
Algo. 1 > Binary
Macro-average further improved with fbr : Algo. 2 and 3 > Algo. 1
Supervised Threshold Setting
Why better macro-average with fbr ?
Threshold too high with Algo. 1
Macro-average in one run of OHSUMED:
0.224 → 0.242 (Algo. 1 and Algo. 2)
training Algo. 2 Algo. 1 Algo. 2
label pos. # TP FP FN F threshold threshold
16 2 1 2 1 0.4 0.70 0.28
32 1 1 8 0 0.2 0.84 0.11
35 5 1 0 1 0.6 0.87 0.32
57 4 2 2 3 0.4 0.78 0.31
Algo. 1 predicts no positives for these labels Thresholds selected by Algo. 2 are lower
Supervised Threshold Setting
Why better macro-average with fbr ?
(cont’d)Few positives (≤ 5) in training most decision values < threshold
May not have positives in validation set Algo. 1 can not get a better threshold Threshold remains high
With fbr , threshold decreased to the highest decision value
Supervised Threshold Setting
Why better micro-average with fbr ?
Threshold too low with Algo. 1
Micro-average in one run of OHSUMED:
0.469 → 0.496 (Algo. 1 and Algo. 2) Overfits macro-average on validation set
Sum of TP Sum of FP Sum of FN
Algo. 1 7,365 10,895 5,782
Algo. 2 7,360 19,405 5,787
label 65: 28/27 positive instances
Best F-measure in training: 0.0107 < 0.1
Algo. 1 predicts 358 positives, only 3 are correct Algo. 2 predicts no positives
Supervised Threshold Setting
Optimize Micro-avg. & Exact Match Ratio
Details not shown here
Tuning threshold generally helps, but improvements not as big as when we optimize macro-average
fbr does not further help the target measure
Discussions and Conclusions
Outline
Introduction
Binary Method and Evaluation Criteria Supervised Threshold Setting
Discussions and Conclusions
Discussions and Conclusions
Discussions and Conclusions
Adjusting decision thresholds is useful (for all measures)
fbr heuristic increases macro-average F-measure strongly related to per-label F-measure
fbr does not help when we optimize the other two measures
Discussions and Conclusions
Discussions and Conclusions
Macro-average F-measure: easier to improve Optimize measures calculated across all labels:
harder
macro-avg. F-measure Data set Binary Algo. 1 Algo. 2
(%) (%)
scene 0.653 17.4 17.4
yeast 0.405 24.3 24.2
OHSUMED 0.117 89.0 109.4
Ar 0.304 32.2 34.5
Bu 0.281 36.5 36.1
Co 0.292 30.6 33.3
Ed 0.255 22.3 33.9
En 0.410 20.2 19.9
He 0.337 15.9 24.0
RCV1-V2 0.378 36.6 41.8
micro-avg. F-measure Binary Algo. 1 Algo. 2
(%) (%)
0.652 16.6 16.4
0.651 4.6 4.7
0.452 14.6 13.9 0.484 11.0 11.0
0.774 1.2 0.9
0.588 6.3 5.9
0.507 8.7 8.4
0.624 5.8 5.7
0.697 3.5 3.0
0.744 3.9 3.5
exact match ratio Binary Algo. 1 Algo. 2
(%) (%)
0.490 22.7 22.4
0.206 4.4 5.7
0.160 11.0 8.8
0.320 7.9 7.7
0.617 0.4 -0.2
0.443 4.5 3.9
0.335 8.6 7.4
0.464 8.3 7.9
0.517 2.0 0.1
0.469 4.2 3.2
Discussions and Conclusions
Discussions and Conclusions
Macro-average F-measure: easier to improve mean of per-label F-measure, d independent parts
1 d
d
X
j=1
2P¯l
i=1ˆyjiyji P¯l
i=1yˆji + P¯l
i=1yji
Optimize measures calculated across all labels:
harder, e.g., micro-average F-measure F-measure across all labels, non-separable
2Pd j=1
P¯l
i=1yˆjiyji Pd
j=1
P¯l
i=1ˆyji +Pd j=1
P¯l i=1yji