• 沒有找到結果。

Evaluation Criteria for Multi-label Classification

N/A
N/A
Protected

Academic year: 2022

Share "Evaluation Criteria for Multi-label Classification"

Copied!
40
0
0

加載中.... (立即查看全文)

全文

(1)

Evaluation Criteria for Multi-label Classification

Rong-En Fan

Department of Computer Science National Taiwan University

(2)

Outline

Introduction

Binary Method and Evaluation Criteria Supervised Threshold Setting

Discussions and Conclusions

(3)

Introduction

Outline

Introduction

Binary Method and Evaluation Criteria Supervised Threshold Setting

Discussions and Conclusions

(4)

Introduction

Classification

Two-class problem

Each instance associated with either positive or negative label Multi-class problem

Each instance associated with one label Multi-label problem

Each instance associated with several labels; e.g., a news story may belong to both sports and politics

(5)

Introduction

Representation for Multiple Labels

Given instances x1, . . . ,xl ∈ Rn and d labels 1, . . . , d

The binary vector of subset of labels of xi is yi = [y1i, . . . ,ydi] ∈ {0, 1}d,

where yji = 1 ⇔ xi is associated with j.

y1 = [0, 0, 0, 0, 1]

y2 = [0, 0, 1, 0, 1]

y3 = [1, 0, 1, 1, 0]

(6)

Binary Method and Evaluation Criteria

Outline

Introduction

Binary Method and Evaluation Criteria Supervised Threshold Setting

Discussions and Conclusions

(7)

Binary Method and Evaluation Criteria

The Binary Method

An extension of the “one-against-the rest” method for multi-class classification

one label positive, others negative

Example:

y1 = [1, 0, 0, 0, 1] ⇒ x1 x1 x1 x1 x1 y2 = [0, 0, 1, 0, 1] ⇒ x2 x2 x2 x2 x2 y3 = [0, 0, 1, 0, 1] ⇒ x3 x3 x3 x3 x3 y4 = [1, 1, 0, 1, 0] ⇒ x4 x4 x4 x4 x4 y5 = [0, 0, 0, 0, 1] ⇒ x5 x5 x5 x5 x5 y6 = [1, 1, 0, 1, 0] ⇒ x6 x6 x6 x6 x6

(8)

Binary Method and Evaluation Criteria

The Classifier

Need d two-class classifiers, for example, Support Vector Machine (SVM)

SVM decision function for label j wjTx+ bj

(

> 0, x is relevant to label j

< 0, otherwise Can be written as

wjTx+ bj > Tj

with Tj = 0. Output positive ⇔ inequality holds, negative otherwise.

(9)

Binary Method and Evaluation Criteria

Evaluation Measures

Exact Match Ratio [Kazawa et al., 2005]

direct extension of accuracy in classification 1

¯l

¯l

X

i=1

I[ˆyi = yi]

¯l: number of testing instances

yi: true label vector for ith instance ˆ

yi: predicted label vector for ith instance I(s) is 1 if s is true, otherwise 0

I h h1

01

iT

= h0

01

iT i

= 0, partial matches not counted

(10)

Binary Method and Evaluation Criteria

Evaluation Measures

Consider partial matches

Macro-average F-measure [Lewis, 1991]

mean of per-label F-measure, d independent parts 1

d

d

X

j=1

2P¯l

i=1ˆyjiyji P¯l

i=1ji + P¯l i=1yji P¯l

i=1jiyji: true positives of label j P¯l

i=1ji +P¯l

i=1yji: avoid too many false positives

(11)

Binary Method and Evaluation Criteria

Evaluation Measures

Micro-average F-measure [Lewis, 1991]

F-measure across all labels, non-separable 2Pd

j=1

P¯l

i=1jiyji Pd

j=1

P¯l

i=1ˆyji +Pd

j=1

P¯l

i=1yji

(12)

Binary Method and Evaluation Criteria

Optimize Measures

Predictions ˆy: a function of d decision functions Measure: a function of predictions

m

³ ˆ

y(f1,f2,· · · , fd

Maximize m: difficult global optimization problem Adjust one function and fix others sequentially

maxfk m

³ ˆ

y(f1,· · · ,fk,· · · , fd

obtain fk such that function value is non-decreasing

(13)

Binary Method and Evaluation Criteria

Optimize Measures: Example

Optimize macro-average F-measure

Adjust one function and fix others sequentially

maxfk

1 d

à 2P¯l

i=1kiyki P¯l

i=1ki +P¯l

i=1yki + terms not related to fk

!

obtain fk to improve macro-average

(14)

Supervised Threshold Setting

Outline

Introduction

Binary Method and Evaluation Criteria Supervised Threshold Setting

Discussions and Conclusions

(15)

Supervised Threshold Setting

Motivations from [Lewis et al., 2004]

Unbalanced data distribution among labels

RCV1-V2: 101 labels, 96 labels with ≤ 10% instances

SVM.1 predicts 4,000 more positives (average) per label

Micro-avg. F Macro-avg. F

Binary 0.809 0.534

SVM.1 0.816 0.619

Sum of TP Sum of FP Sum of FN Sum of Predict Binary 1,877,239 229,687 655,939 2,106,926 SVM.1 2,034,167 420,826 499,011 2,454,993

Adjust the decision threshold Tj in SVM prediction fj(x) = wjTx+ bj > Tj

(16)

Supervised Threshold Setting

Simple Threshold Tuning

Result of binary method

Macro-average F-measure: 0.333

label 1 y1 dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T1 = 0−0.5 T2 = 0

(17)

Supervised Threshold Setting

Simple Threshold Tuning

Adjust threshold T1 of label 1 Macro-average F-measure: 0.4

label 1 y1 dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T1 = −1.45 T2 = 0

(18)

Supervised Threshold Setting

Simple Threshold Tuning

Adjust threshold T2 of label 2 Macro-average F-measure: 0.511

label 1 y1 dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T1 = −1.45 T2 = −1.75

(19)

Supervised Threshold Setting

Simple Threshold Tuning

Improve macro-average F-measure: 0.333 → 0.511 Five-fold cross validation for better thresholds Threshold Tj = average of five thresholds

Tuning threshold significantly improves macro-avg., but a few cases still worse than SVM.1

(20)

Supervised Threshold Setting

Example: Tuning Threshold is Not Enough

OHSUMED: 94 labels, 5,000 training / 5,000 testing Macro-average / micro-average F-measure

SVM.1: 0.234 / 0.496

Simple Threshold Tuning: 0.224 / 0.469

Improper tuned threshold when positives are few label 52: 12, label 57 : 4

Sum of TP Sum of FP Sum of FN

SVM.1 7,343 9,141 5,804

Simple Threshold Tuning 7,365 10,895 5,782 label 52 label 57

TP FP FN TP FP FN

SVM.1 0 3 12 2 3 3

Simple Threshold Tuning 9 1,084 3 0 0 5

(21)

Supervised Threshold Setting

The fbr heuristic [Yang, 2001]

Tuning threshold with few positive examples Easily get biased, worse generalization error too low (many FP) or too high (many FN)

Remedy: if F-measure < fbr value, set threshold to the highest decision value

Purpose: adjust the way of predicting positive instances

Limited experiments in [Yang, 2001], a comprehensive study in this work

(22)

Supervised Threshold Setting

Simple Threshold Tuning with fbr = 0.5

Label 1 F-measure = 0.8 > 0.5

label 1 y1 dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T1 = −1.45 T2 = 0

(23)

Supervised Threshold Setting

Simple Threshold Tuning with fbr = 0.5

Label 2 F-measure = 0.4 < 0.5

label 1 y1 dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T1 = −1.45 T2 = −1.75

(24)

Supervised Threshold Setting

Simple Threshold Tuning with fbr = 0.5

Label 2 F-measure = 0.4 < 0.5

Set threshold T2 to the highest decision value

label 1 y1 dec. value

1 1.2

1 1.1

0 -1.1

1 -1.2

0 -1.3

1 -1.4

0 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.5

0 -1.6

1 -1.7

0 -1.8

0 -1.9

T1 = −1.45 T2 = 1.2

(25)

Supervised Threshold Setting

Simple Threshold Tuning with fbr = 0.5

Testing macro-average: 0.633 (without fbr : 0.4) Testing micro-average: 0.615 (without fbr : 0.4)

label 1 y1 dec. value

0 1.2

1 1.1

0 -1.1

0 -1.2

1 -1.3

1 -1.4

1 -1.5

0 -1.6

0 -1.7

0 -1.8

label 2 y2 dec. value

0 1.4

1 1.3

0 1.2

0 1.1

0 -1.1

0 -1.2

0 -1.4

0 -1.6

0 -1.7

0 -1.8

T1 = −1.45 T2 = 1.2 T2 = −1.75

(26)

Supervised Threshold Setting

Simple Threshold Tuning with fbr = 0.5

Testing performance

macro-average F-measure: 0.633 (was 0.4) micro-average F-measure: 0.615 (was 0.4)

Withour fbr , overfits the measure on validation set Better generalization ability with fbr

(27)

Supervised Threshold Setting

The SVM.1 Method [Lewis et al., 2004]

Simple Threshold Tuning with fbr heuristic fbr value selected by five-fold cross validaion

eight fbr candidates: 0.1 to 0.8

Two-level cross validation

Outer: select the best fbr value Inner: choose threshold

Train 5 × 5 = 25 more classifiers

Time consuming

(28)

Supervised Threshold Setting

Experiments

Three parts, each one targets at one measure

Macro-average F-measure Micro-average F-measure Exact match ratio

(29)

Supervised Threshold Setting

Experiments

Three parts, each one targets at one measure

Macro-average F-measure Micro-average F-measure Exact match ratio

Optimize measure with SVM.1-type methods

Algo. 1: Simple Threshold Tuning

Algo. 2: Simple Threshold Tuning with fbr = 0.1 Algo. 3: the SVM.1 Method

(30)

Supervised Threshold Setting

Experiments

Three parts, each one targets at one measure

Macro-average F-measure Micro-average F-measure Exact match ratio

Optimize measure with SVM.1-type methods

Algo. 1: Simple Threshold Tuning

Algo. 2: Simple Threshold Tuning with fbr = 0.1 Algo. 3: the SVM.1 Method

Examine effects of fbr heuristic

(31)

Supervised Threshold Setting

Experiments: Data Sets

scene: semantic scene classification yeast: yeast gene microarray data

OHSUMED: medical literature documents Yahoo!: web pages from yahoo.com

six sets: Arts (Ar), Business (Bu), Computers (Co), Education (Ed), Entertainment (En), Health (He) RCV1-V2: newswire stories from Reuters

(32)

Supervised Threshold Setting

Optimize Macro-average F-measure

macro-average F-measure Data set Binary Algo. 1 Algo. 2 Algo. 3 scene 0.653 0.766 0.766 0.759 yeast 0.405 0.504 0.504 0.506 OHSUMED 0.117 0.221 0.244 0.236

Ar 0.304 0.401 0.408 0.403

Bu 0.281 0.384 0.383 0.383

Co 0.292 0.382 0.390 0.382

Ed 0.255 0.311 0.341 0.335

En 0.410 0.493 0.492 0.492

He 0.337 0.391 0.418 0.410

RCV1-V2 0.378 0.516 0.536 0.528

micro-average F-measure Binary Algo. 1 Algo. 2 Algo. 3

0.652 0.756 0.756 0.750 0.651 0.651 0.643 0.633 0.452 0.481 0.487 0.494 0.484 0.528 0.529 0.532 0.774 0.768 0.771 0.772 0.588 0.606 0.608 0.609 0.507 0.532 0.546 0.543 0.624 0.659 0.658 0.658 0.697 0.719 0.715 0.714 0.744 0.756 0.752 0.761

Tuning threshold significantly helps:

Algo. 1 > Binary

Macro-average further improved with fbr : Algo. 2 and 3 > Algo. 1

(33)

Supervised Threshold Setting

Why better macro-average with fbr ?

Threshold too high with Algo. 1

Macro-average in one run of OHSUMED:

0.224 → 0.242 (Algo. 1 and Algo. 2)

training Algo. 2 Algo. 1 Algo. 2

label pos. # TP FP FN F threshold threshold

16 2 1 2 1 0.4 0.70 0.28

32 1 1 8 0 0.2 0.84 0.11

35 5 1 0 1 0.6 0.87 0.32

57 4 2 2 3 0.4 0.78 0.31

Algo. 1 predicts no positives for these labels Thresholds selected by Algo. 2 are lower

(34)

Supervised Threshold Setting

Why better macro-average with fbr ?

(cont’d)

Few positives (≤ 5) in training most decision values < threshold

May not have positives in validation set Algo. 1 can not get a better threshold Threshold remains high

With fbr , threshold decreased to the highest decision value

(35)

Supervised Threshold Setting

Why better micro-average with fbr ?

Threshold too low with Algo. 1

Micro-average in one run of OHSUMED:

0.469 → 0.496 (Algo. 1 and Algo. 2) Overfits macro-average on validation set

Sum of TP Sum of FP Sum of FN

Algo. 1 7,365 10,895 5,782

Algo. 2 7,360 19,405 5,787

label 65: 28/27 positive instances

Best F-measure in training: 0.0107 < 0.1

Algo. 1 predicts 358 positives, only 3 are correct Algo. 2 predicts no positives

(36)

Supervised Threshold Setting

Optimize Micro-avg. & Exact Match Ratio

Details not shown here

Tuning threshold generally helps, but improvements not as big as when we optimize macro-average

fbr does not further help the target measure

(37)

Discussions and Conclusions

Outline

Introduction

Binary Method and Evaluation Criteria Supervised Threshold Setting

Discussions and Conclusions

(38)

Discussions and Conclusions

Discussions and Conclusions

Adjusting decision thresholds is useful (for all measures)

fbr heuristic increases macro-average F-measure strongly related to per-label F-measure

fbr does not help when we optimize the other two measures

(39)

Discussions and Conclusions

Discussions and Conclusions

Macro-average F-measure: easier to improve Optimize measures calculated across all labels:

harder

macro-avg. F-measure Data set Binary Algo. 1 Algo. 2

(%) (%)

scene 0.653 17.4 17.4

yeast 0.405 24.3 24.2

OHSUMED 0.117 89.0 109.4

Ar 0.304 32.2 34.5

Bu 0.281 36.5 36.1

Co 0.292 30.6 33.3

Ed 0.255 22.3 33.9

En 0.410 20.2 19.9

He 0.337 15.9 24.0

RCV1-V2 0.378 36.6 41.8

micro-avg. F-measure Binary Algo. 1 Algo. 2

(%) (%)

0.652 16.6 16.4

0.651 4.6 4.7

0.452 14.6 13.9 0.484 11.0 11.0

0.774 1.2 0.9

0.588 6.3 5.9

0.507 8.7 8.4

0.624 5.8 5.7

0.697 3.5 3.0

0.744 3.9 3.5

exact match ratio Binary Algo. 1 Algo. 2

(%) (%)

0.490 22.7 22.4

0.206 4.4 5.7

0.160 11.0 8.8

0.320 7.9 7.7

0.617 0.4 -0.2

0.443 4.5 3.9

0.335 8.6 7.4

0.464 8.3 7.9

0.517 2.0 0.1

0.469 4.2 3.2

(40)

Discussions and Conclusions

Discussions and Conclusions

Macro-average F-measure: easier to improve mean of per-label F-measure, d independent parts

1 d

d

X

j=1

2P¯l

i=1ˆyjiyji P¯l

i=1ji + P¯l

i=1yji

Optimize measures calculated across all labels:

harder, e.g., micro-average F-measure F-measure across all labels, non-separable

2Pd j=1

P¯l

i=1jiyji Pd

j=1

P¯l

i=1ˆyji +Pd j=1

P¯l i=1yji

參考文獻

相關文件

Average transaction price per square metre of residential units (excluding intermediate transfer of title and units scheduled for re-evaluation) by district and year of

A Novel Approach for Label Space Compression algorithmic: scheme for fast decoding theoretical: justification for best projection practical: significantly better performance

– Factorization is “harder than” calculating Euler’s phi function (see Lemma 51 on p. 404).. – So factorization is harder than calculating Euler’s phi function, which is

3 Error-correction Coding (Ferng &amp; Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,

classify input to multiple (or no) categories.. Multi-label Classification.

?: { machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a