• 沒有找到結果。

Label Space Coding for Multi-label Classification

N/A
N/A
Protected

Academic year: 2022

Share "Label Space Coding for Multi-label Classification"

Copied!
30
0
0

加載中.... (立即查看全文)

全文

(1)

Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University RIKEN AIP Center, 08/30/2019

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Yao-Nan Chen(NeurIPS Conference 2012)&

Kuan-Hao Huang(ECML Conference ML Journal Track 2017)

(2)

Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(3)

What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc. }

anothermulti-label classification problem:

tagginginput to multiple categories

(4)

Binary Relevance: Multi-label Classification via Yes/No

Binary

Classification {yes,no}

Multi-label w/ L classes: Lyes/no questions

machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

Binary Relevance approach:

transformation tomultiple isolated binary classification

disadvantages:

isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)

unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with

(5)

Multi-label Classification Setup

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

multi-label classification: hot and important

(6)

From Label-set to Coding View

label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]

Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]

Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]

Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]

Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(7)

Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly

compressed by projecting to M  L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:

efficient in training: random projectionw/M  L

inefficient in testing:time-consuming decoding bettr projection? faster decoding?

(8)

Our Contributions

Compression Coding &

Learnable-Compression Coding

A Novel Approach for Label Space Compression

algorithmic: first known algorithm forfeature-aware dimension reduction with fast decoding

theoretical: justification forbestlearnableprojection

practical: consistently better performance than compressive sensing (& binary relevance)

will now introduce the key ideas behind the approach

(9)

Faster Decoding: Round-based

Compressive Sensing Revisited

decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x) For any given “intermediate prediction” (real-valued vector) ˜y,

find closestsparsebinary vector to ˜y:slow optimization of`1-regularizedobjective

find closestanybinary vector to ˜y:fast g(x) =round(y)

round-based decoding: simple & faster alternative

(10)

Better Projection: Principal Directions

Compressive Sensing Revisited

compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP

randomprojection: arbitrarydirections

bestprojection: principaldirections

principal directions: best approximation to desired out- putynduringcompression(why?)

(11)

Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

kr(x) − ck2: prediction errorfrom input to codeword

ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputynduringcompression(indeed)

(12)

Proposed Approach 1:

Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = round(PTr(x))

principal directions: viaPrincipal Component Analysison {yn}Nn=1

physical meaning behindpm: key (linear)label correlations PLST: improving CS by projecting tokey correlations

(13)

Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(PTr(x)), 1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

ky − PTck2: encoding error, minimized during encoding

kr(x) − ck2: prediction error, minimized during learning

but goodencodingmay not be easy tolearn; vice versa PLST: minimize two errors separately (sub-optimal)

(can we do better by minimizingjointly?)

(14)

Proposed Approach 2:

Conditional Principal Label Space Transform

can we do better by minimizingjointly?

Yes and easy for ridge regression (closed-form solution)

From PLST to CPLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using ridge regression

3 decode: g(x) = round(PTr(x))

conditional principal directions: top eigenvectors ofYTXXY, key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

(15)

Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)

PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

PLSTbetter thanCS: faster,better performance

similar findings acrossdata sets and regression algorithms

(16)

Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

CPLSTbetter thanPLST: better performance across all dimensions

similar findings acrossdata sets and regression algorithms

(17)

Multi-label Classification Setup Revisited

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

next: going beyondHamming loss

(18)

Cost Functions for Multi-label Classification

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

Other Evaluation of Closeness

cost function c(y, ˜y): the penalty of predictingy as ˜y

e.g. 0/1 loss: strict match of ˜y to y

e.g. F1 cost: 1 - geometric mean of precision & recall of ˜y w.r.t. y

(19)

Cost-Sensitive Multi-Label Classification (CSMLC)

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

and desired cost function c(y, ˜y)

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set-vectory associated with someunseen inputs x—i.e. low c(y, g(x)).

next: label space coding for CSMLC

(20)

Label Embedding

label space Y Φ

Ψ

embedded space Z r

feature space X Training Stage

embedding function Φ:label vectory → embedded vector z

learn a regressorr from {(xn,zn)}Nn=1 Predicting Stage

for testing instancex, predicted embedded vector ˜z = r(x)

decoding function Ψ: ˜z → predicted label vector ˜y (C)PLST: linear projection embedding

(21)

Cost-Sensitive Label Embedding

label space Y Φ

Ψ

embedded space Z r

feature space X Existing Works

label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based[Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]

cost-sensitivity: CFT, PCC[Li et al., 2014; Dembczynski et al., 2010]

cost-sensitivity + label embedding: no existing works

Cost-Sensitive Label Embedding

considercost function cwhen designingembedding function Φ anddecoding function Ψ(cost-sensitive embedded vectorsz)

(22)

Our Contributions

Cost-sensitive Coding

A Novel Approach for Label Space Compression

algorithmic: first known algorithm forcost-sensitive dimension reduction

theoretical: justification forcost-sensitivelabel embedding

practical: consistently better performance than CPLST across different costs

will now introduce the key ideas behind the approach

(23)

Cost-Sensitive Embedding

embedded space Z z1

z2

z3 pc(y1,y2)

pc(y2,y3)

pc(y3,y1)

r

feature space X Training Stage

distances between embedded vectors ⇔ cost information

larger (smaller) distance d (zi,zj) ⇔higher (lower) cost c(yi,yj)

d (zi,zj) ≈pc(yi,yj)by multidimensional scaling (manifold learning)

(24)

Cost-Sensitive Decoding

embedded space Z z1

z2

z3

r

feature space X

˜z

Predicting Stage

for testing instancex, predicted embedded vector ˜z = r(x)

findnearest embedded vectorzqof ˜z

cost-sensitive prediction ˜y = yq

(25)

Theoretical Explanation

Theorem (Huang and Lin, 2017) c(y, ˜y) ≤ 5

d (z, zq) − q

c(y, yq)2

| {z }

embedding error

+ kz − r(x)k2

| {z }

regression error



Optimization

embedding error→ multidimensional scaling

regression error→ regression r Challenge

asymmetric cost function vs. symmetric distance?

i.e. c(yi,yj) 6=c(yj,yi)vs. d (zi,zj)

(26)

Mirroring Trick

embedded space Z z(p)1 z(p)2

z(p)3 z(t)1

z(t)2

z(t)3

pc(y1,y2)

pc(y2,y1)

r

feature space X

˜z

two roles ofyi:ground truth role y(t)i andprediction role y(p)i

pc(yi,yj) ⇒predictyi asyj ⇒ forz(t)i andz(p)j

pc(yj,yi) ⇒predictyj asyi ⇒ forz(p)i andz(t)j

learnregression function r fromz(p)i ,z(p)2 , ...,z(p)L

(27)

Cost-Sensitive Label Embedding with Multidimensional Scaling

Training Stage of CLEMS

given training instances D = {(xn,yn)}Nn=1and cost function c

determine two roles of embedded vectorsz(t)n andz(p)n for label vectoryn

embedding function Φ :ynz(p)i

learn a regression functionr from {(xn, Φ(yn))}Nn=1 Predicting Stage ofCLEMS

given the testing instancex

obtain the predicted embedded vector by ˜z = r(x)

decoding Ψ(·) = Φ−1(nearest neighbor) = Φ−1 argmin d (z(t)n , ·)

prediction ˜y = Ψ(˜z)

(28)

Comparison with Label Embedding Algorithms

F1 score (↑)

M (% of K)

20 40 60 80 100

F1 score

0.55 0.6 0.65

CLEMS FaIE SLEEC PLST CPLST

yeast

M (% of K)

20 40 60 80 100

F1 score

0.4 0.45 0.5 0.55 0.6 0.65 0.7

CLEMS FaIE SLEEC PLST CPLST

birds

Accuracy score (↑)

M (% of K)

20 40 60 80 100

Accuracy score

0.4 0.45 0.5 0.55

CLEMS FaIE SLEEC PLST CPLST

yeast

M (% of K)

20 40 60 80 100

Accuracy score

0.4 0.45 0.5 0.55 0.6 0.65

CLEMS FaIE SLEEC PLST CPLST

birds

Rank loss (↓)

M (% of K)

20 40 60 80 100

Rank loss

8 9 10 11 12

CLEMS FaIE SLEEC PLST CPLST

yeast

M (% of K)

20 40 60 80 100

Rank loss

5 6 7 8 9

CLEMS FaIE SLEEC PLST CPLST

birds

(29)

Comparison with Cost-Sensitive Algorithms

data F1 score (↑) Accuracy score (↑) Rank loss (↓)

CLEMS CFT PCC CLEMS CFT PCC CLEMS CFT PCC

emot. 0.676 0.640 0.643 0.589 0.557 1.484 1.563 1.467 scene 0.770 0.703 0.745 0.760 0.656 0.672 0.723 0.645 yeast 0.671 0.649 0.614 0.568 0.543 8.302 8.566 8.469 birds 0.677 0.601 0.636 0.642 0.586 4.886 4.908 3.660 med. 0.814 0.635 0.573 0.786 0.613 5.170 5.811 4.234 enron 0.606 0.557 0.542 0.491 0.448 29.40 26.64 25.11 lang. 0.375 0.168 0.247 0.327 0.164 31.03 34.16 19.11 flag 0.731 0.692 0.706 0.615 0.588 2.930 3.075 2.857 slash 0.568 0.429 0.503 0.538 0.402 4.986 5.677 4.472

CAL. 0.419 0.371 0.391 0.273 0.237 1247 1120 993

arts 0.492 0.334 0.349 0.451 0.281 9.865 10.07 8.467 EUR. 0.670 0.456 0.483 0.650 0.450 89.52 129.5 43.28

generality for CSMLC: CLEMS = CFT > PCC

PCC requires an efficient inference rule

performance: CLEMS ≈ PCC > CFT

(30)

Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012 with 172 citations)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012 with 114 citations)

—condense learnably forbetterefficiency: better (than PLST) approach CPLST

— key tool: Ridge Regression from Statistics (+ PCA)

3 Cost-sensitiveCoding(Huang & Lin, ECML Conference ML Journal Track 2017)

—condense cost-sensitively towards application needs: better (than CPLST) approach CLEMS

— key tool: Multidimensional Scaling from Statistics

參考文獻

相關文件

Jeejeebhoy FM, Zelop CM, Lipman S, et al; for the American Heart Association Emergency Cardiovascular Care Committee, Council on Cardiopulmonary, Critical Care, Perioperative

classify input to multiple (or no) categories.. Multi-label Classification.

• User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle. E2E Task-Completion Bot (TC-Bot) (Li et

 End-to-end reinforcement learning dialogue system (Li et al., 2017; Zhao and Eskenazi, 2016)?.  No specific goal, focus on

Reading: Stankovic, et al., “Implications of Classical Scheduling Results for Real-Time Systems,” IEEE Computer, June 1995, pp.. Copyright: All rights reserved, Prof. Stankovic,

Group, R.C., Convalescent plasma in patients admitted to hospital with COVID-19 (RECOVERY): a randomised controlled, open-label, platform trial.. Lopez-Medina, E., et al., Effect

Compass and straightedge constructions, In John K.Baumgart et al (Eds), Historical topics for the mathematics classroom ( 31st yearbook)( p.193 ). Reston,

Arbenz et al.[1] proposed a hybrid preconditioner combining a hierarchical basis preconditioner and an algebraic multigrid preconditioner for the correc- tion equation in the