Evaluation Criteria for Multi-label Classiﬁcation

(1)

Evaluation Criteria for Multi-label Classification

Rong-En Fan

Department of Computer Science National Taiwan University

(2)

Outline

Introduction

Binary Method and Evaluation Criteria Supervised Threshold Setting

Discussions and Conclusions

(3)

Introduction

Outline

Introduction

(4)

Introduction

Classification

Two-class problem

Each instance associated with either positive or negative label Multi-class problem

Each instance associated with one label Multi-label problem

Each instance associated with several labels; e.g., a news story may belong to both sports and politics

(5)

Introduction

Representation for Multiple Labels

Given instances x1, . . . ,x_l ∈ Rⁿ and d labels 1, . . . , d

The binary vector of subset of labels of xi is yⁱ = [y₁ⁱ, . . . ,y_dⁱ] ∈ {0, 1}^d,

where y_jⁱ = 1 ⇔ xi is associated with j.

y¹ = [0, 0, 0, 0, 1]

y² = [0, 0, 1, 0, 1]

y³ = [1, 0, 1, 1, 0]

(6)

Binary Method and Evaluation Criteria

Outline

Introduction

(7)

The Binary Method

An extension of the “one-against-the rest” method for multi-class classification

one label positive, others negative

Example:

y¹ = [1, 0, 0, 0, 1] ⇒ x₁ x₁ x₁ x₁ x₁ y² = [0, 0, 1, 0, 1] ⇒ x2 x₂ x₂ x₂ x₂ y³ = [0, 0, 1, 0, 1] ⇒ x3 x₃ x₃ x₃ x₃ y⁴ = [1, 1, 0, 1, 0] ⇒ x₄ x₄ x₄ x₄ x₄ y⁵ = [0, 0, 0, 0, 1] ⇒ x5 x₅ x₅ x₅ x₅ y⁶ = [1, 1, 0, 1, 0] ⇒ x₆ x₆ x₆ x₆ x₆

(8)

The Classifier

Need d two-class classifiers, for example, Support Vector Machine (SVM)

SVM decision function for label j w_j^Tx+ bj

(

> 0, x is relevant to label j

< 0, otherwise Can be written as

w_j^Tx+ bj > T_j

with Tj = 0. Output positive ⇔ inequality holds, negative otherwise.

(9)

Evaluation Measures

Exact Match Ratio [Kazawa et al., 2005]

direct extension of accuracy in classification 1

¯l

X

i=1

I[ˆyⁱ = yⁱ]

¯l: number of testing instances

yⁱ: true label vector for ith instance ˆ

yⁱ: predicted label vector for ith instance I(s) is 1 if s is true, otherwise 0

I h h₁

01

iT

= h₀

01

iT i

= 0, partial matches not counted

(10)

Evaluation Measures

Consider partial matches

Macro-average F-measure [Lewis, 1991]

mean of per-label F-measure, d independent parts 1

d

X

j=1

2P¯l

i=1ˆy_jⁱy_jⁱ P¯l

i=1yˆ_jⁱ + P¯l i=1y_jⁱ P¯l

i=1yˆ_jⁱy_jⁱ: true positives of label j P_¯l

i=1yˆ_jⁱ +P_¯l

i=1y_jⁱ: avoid too many false positives

(11)

Evaluation Measures

Micro-average F-measure [Lewis, 1991]

F-measure across all labels, non-separable 2Pd

j=1

P¯l

i=1yˆ_jⁱy_jⁱ P_d

j=1

P_¯l

i=1ˆy_jⁱ +P_d

j=1

P_¯l

i=1y_jⁱ

(12)

Optimize Measures

Predictions ˆy: a function of d decision functions Measure: a function of predictions

m

³ ˆ

y(f1,f₂,· · · , fd)´

Maximize m: difficult global optimization problem Adjust one function and fix others sequentially

maxf_k m

³ ˆ

y(f1,· · · ,fk,· · · , fd)´

obtain fk such that function value is non-decreasing

(13)

Optimize Measures: Example

Optimize macro-average F-measure

Adjust one function and fix others sequentially

maxf_k

1 d

Ã 2P¯l

i=1yˆ_kⁱy_kⁱ P_¯l

i=1yˆ_kⁱ +P_¯l

i=1y_kⁱ + terms not related to fk

!

obtain fk to improve macro-average

(14)

Supervised Threshold Setting

Outline

Introduction

(15)

Motivations from [Lewis et al., 2004]

Unbalanced data distribution among labels

RCV1-V2: 101 labels, 96 labels with ≤ 10% instances

SVM.1 predicts 4,000 more positives (average) per label

Micro-avg. F Macro-avg. F

Binary 0.809 0.534

SVM.1 0.816 0.619

Sum of TP Sum of FP Sum of FN Sum of Predict Binary 1,877,239 229,687 655,939 2,106,926 SVM.1 2,034,167 420,826 499,011 2,454,993

Adjust the decision threshold Tj in SVM prediction f_j(x) = w_j^Tx+ bj > T_j

(16)

Simple Threshold Tuning

Result of binary method

Macro-average F-measure: 0.333

label 1 y₁ dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y₂ dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T₁ = 0−0.5 T₂ = 0

(17)

Simple Threshold Tuning

Adjust threshold T1 of label 1 Macro-average F-measure: 0.4

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T₁ = −1.45 T₂ = 0

(18)

Simple Threshold Tuning

Adjust threshold T2 of label 2 Macro-average F-measure: 0.511

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T₁ = −1.45 T₂ = −1.75

(19)

Simple Threshold Tuning

Improve macro-average F-measure: 0.333 → 0.511 Five-fold cross validation for better thresholds Threshold Tj = average of five thresholds

Tuning threshold significantly improves macro-avg., but a few cases still worse than SVM.1

(20)

Example: Tuning Threshold is Not Enough

OHSUMED: 94 labels, 5,000 training / 5,000 testing Macro-average / micro-average F-measure

SVM.1: 0.234 / 0.496

Simple Threshold Tuning: 0.224 / 0.469

Improper tuned threshold when positives are few label 52: 12, label 57 : 4

Sum of TP Sum of FP Sum of FN

SVM.1 7,343 9,141 5,804

Simple Threshold Tuning 7,365 10,895 5,782 label 52 label 57

TP FP FN TP FP FN

SVM.1 0 3 12 2 3 3

Simple Threshold Tuning 9 1,084 3 0 0 5

(21)

The fbr heuristic [Yang, 2001]

Tuning threshold with few positive examples Easily get biased, worse generalization error too low (many FP) or too high (many FN)

Remedy: if F-measure < fbr value, set threshold to the highest decision value

Purpose: adjust the way of predicting positive instances

Limited experiments in [Yang, 2001], a comprehensive study in this work

(22)

Simple Threshold Tuning with fbr = 0.5

Label 1 F-measure = 0.8 > 0.5

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T₁ = −1.45 T₂ = 0

(23)

Simple Threshold Tuning with fbr = 0.5

Label 2 F-measure = 0.4 < 0.5

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T₁ = −1.45 T₂ = −1.75

(24)

Simple Threshold Tuning with fbr = 0.5

Label 2 F-measure = 0.4 < 0.5

Set threshold T2 to the highest decision value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T₁ = −1.45 T₂ = 1.2

(25)

Simple Threshold Tuning with fbr = 0.5

Testing macro-average: 0.633 (without fbr : 0.4) Testing micro-average: 0.615 (without fbr : 0.4)

0 1.2

1 1.1

0 -1.1

0 -1.2

1 -1.3

1 -1.4

1 -1.5

0 -1.6

0 -1.7

0 -1.8

0 1.4

1 1.3

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.6

0 -1.7

0 -1.8

T₁ = −1.45 T₂ = 1.2 T₂ = −1.75

(26)

Simple Threshold Tuning with fbr = 0.5

Testing performance

macro-average F-measure: 0.633 (was 0.4) micro-average F-measure: 0.615 (was 0.4)

Withour fbr , overfits the measure on validation set Better generalization ability with fbr

(27)

The SVM.1 Method [Lewis et al., 2004]

Simple Threshold Tuning with fbr heuristic fbr value selected by five-fold cross validaion

eight fbr candidates: 0.1 to 0.8

Two-level cross validation

Outer: select the best fbr value Inner: choose threshold

Train 5 × 5 = 25 more classifiers

Time consuming

(28)

Experiments

Three parts, each one targets at one measure

Macro-average F-measure Micro-average F-measure Exact match ratio

(29)

Experiments

Optimize measure with SVM.1-type methods

Algo. 1: Simple Threshold Tuning

Algo. 2: Simple Threshold Tuning with fbr = 0.1 Algo. 3: the SVM.1 Method

(30)

Experiments

Optimize measure with SVM.1-type methods

Algo. 1: Simple Threshold Tuning

Algo. 2: Simple Threshold Tuning with fbr = 0.1 Algo. 3: the SVM.1 Method

Examine effects of fbr heuristic

(31)

Experiments: Data Sets

scene: semantic scene classification yeast: yeast gene microarray data

OHSUMED: medical literature documents Yahoo!: web pages from yahoo.com

six sets: Arts (Ar), Business (Bu), Computers (Co), Education (Ed), Entertainment (En), Health (He) RCV1-V2: newswire stories from Reuters

(32)

Optimize Macro-average F-measure

macro-average F-measure Data set Binary Algo. 1 Algo. 2 Algo. 3 scene 0.653 0.766 0.766 0.759 yeast 0.405 0.504 0.504 0.506 OHSUMED 0.117 0.221 0.244 0.236

Ar 0.304 0.401 0.408 0.403

Bu 0.281 0.384 0.383 0.383

Co 0.292 0.382 0.390 0.382

Ed 0.255 0.311 0.341 0.335

En 0.410 0.493 0.492 0.492

He 0.337 0.391 0.418 0.410

RCV1-V2 0.378 0.516 0.536 0.528

micro-average F-measure Binary Algo. 1 Algo. 2 Algo. 3

0.652 0.756 0.756 0.750 0.651 0.651 0.643 0.633 0.452 0.481 0.487 0.494 0.484 0.528 0.529 0.532 0.774 0.768 0.771 0.772 0.588 0.606 0.608 0.609 0.507 0.532 0.546 0.543 0.624 0.659 0.658 0.658 0.697 0.719 0.715 0.714 0.744 0.756 0.752 0.761

Tuning threshold significantly helps:

Algo. 1 > Binary

Macro-average further improved with fbr : Algo. 2 and 3 > Algo. 1

(33)

Why better macro-average with fbr ?

Threshold too high with Algo. 1

Macro-average in one run of OHSUMED:

0.224 → 0.242 (Algo. 1 and Algo. 2)

training Algo. 2 Algo. 1 Algo. 2

label pos. # TP FP FN F threshold threshold

16 2 1 2 1 0.4 0.70 0.28

32 1 1 8 0 0.2 0.84 0.11

35 5 1 0 1 0.6 0.87 0.32

57 4 2 2 3 0.4 0.78 0.31

Algo. 1 predicts no positives for these labels Thresholds selected by Algo. 2 are lower

(34)

Why better macro-average with fbr ?

^(cont’d)

Few positives (≤ 5) in training most decision values < threshold

May not have positives in validation set Algo. 1 can not get a better threshold Threshold remains high

With fbr , threshold decreased to the highest decision value

(35)

Why better micro-average with fbr ?

Threshold too low with Algo. 1

Micro-average in one run of OHSUMED:

0.469 → 0.496 (Algo. 1 and Algo. 2) Overfits macro-average on validation set

Sum of TP Sum of FP Sum of FN

Algo. 1 7,365 10,895 5,782

Algo. 2 7,360 19,405 5,787

label 65: 28/27 positive instances

Best F-measure in training: 0.0107 < 0.1

Algo. 1 predicts 358 positives, only 3 are correct Algo. 2 predicts no positives

(36)

Optimize Micro-avg. & Exact Match Ratio

Details not shown here

Tuning threshold generally helps, but improvements not as big as when we optimize macro-average

fbr does not further help the target measure

(37)

Outline

Introduction

(38)

Discussions and Conclusions

Adjusting decision thresholds is useful (for all measures)

fbr heuristic increases macro-average F-measure strongly related to per-label F-measure

fbr does not help when we optimize the other two measures

(39)

Discussions and Conclusions

Macro-average F-measure: easier to improve Optimize measures calculated across all labels:

harder

macro-avg. F-measure Data set Binary Algo. 1 Algo. 2

(%) (%)

scene 0.653 17.4 17.4

yeast 0.405 24.3 24.2

OHSUMED 0.117 89.0 109.4

Ar 0.304 32.2 34.5

Bu 0.281 36.5 36.1

Co 0.292 30.6 33.3

Ed 0.255 22.3 33.9

En 0.410 20.2 19.9

He 0.337 15.9 24.0

RCV1-V2 0.378 36.6 41.8

micro-avg. F-measure Binary Algo. 1 Algo. 2

(%) (%)

0.652 16.6 16.4

0.651 4.6 4.7

0.452 14.6 13.9 0.484 11.0 11.0

0.774 1.2 0.9

0.588 6.3 5.9

0.507 8.7 8.4

0.624 5.8 5.7

0.697 3.5 3.0

0.744 3.9 3.5

exact match ratio Binary Algo. 1 Algo. 2

(%) (%)

0.490 22.7 22.4

0.206 4.4 5.7

0.160 11.0 8.8

0.320 7.9 7.7

0.617 0.4 -0.2

0.443 4.5 3.9

0.335 8.6 7.4

0.464 8.3 7.9

0.517 2.0 0.1

0.469 4.2 3.2

(40)

Discussions and Conclusions

Macro-average F-measure: easier to improve mean of per-label F-measure, d independent parts

1 d

d

X

j=1

2P¯l

i=1ˆy_jⁱy_jⁱ P_¯l

i=1yˆ_jⁱ + P_¯l

i=1y_jⁱ

Optimize measures calculated across all labels:

harder, e.g., micro-average F-measure F-measure across all labels, non-separable

2Pd j=1

P_¯l

i=1yˆ_jⁱy_jⁱ Pd

j=1

P¯l

i=1ˆy_jⁱ +Pd j=1

P¯l i=1y_jⁱ