Label Space Coding for Multi-label Classiﬁcation

(1)

Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University RIKEN AIP Center, 08/30/2019

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Yao-Nan Chen(NeurIPS Conference 2012)&

Kuan-Hao Huang(ECML Conference ML Journal Track 2017)

(2)

Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(3)

What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc. }

anothermulti-label classification problem:

tagginginput to multiple categories

(4)

Binary Relevance: Multi-label Classification via Yes/No

Binary

Classification {yes,no}

Multi-label w/ L classes: Lyes/no questions

machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

• Binary Relevance approach:

transformation tomultiple isolated binary classification

• disadvantages:

• isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)

• unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with

(5)

Multi-label Classification Setup

Given

N examples (inputx_n,label-set Y_n) ∈ X ×2^{{1,2,···L}}

• fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

• tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

• Hamming loss: averaged symmetric difference ¹_L|g(x) 4 Y|

multi-label classification: hot and important

(6)

From Label-set to Coding View

label set apple orange strawberry binary code Y₁= {o} 0 (N) 1 (Y) 0 (N) y₁= [0, 1, 0]

Y₂= {a, o} 1 (Y) 1 (Y) 0 (N) y₂= [1, 1, 0]

Y₃= {a, s} 1 (Y) 0 (N) 1 (Y) y₃= [1, 0, 1]

Y₄= {o} 0 (N) 1 (Y) 0 (N) y₄= [0, 1, 0]

Y₅= {} 0 (N) 0 (N) 0 (N) y₅= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(7)

Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}^Lcan berobustly

compressed by projecting to M L basis vectors {p₁,p₂, · · · ,p_M} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith some M by LrandommatrixP = [p₁,p₂, · · · ,p_M]^T

2 learn: getregressionfunctionr(x) from x_n toc_n

3 decode: g(x) = findclosest sparse binary vectortoP^Tr(x) Compressive Sensing:

• efficient in training: random projectionw/M L

• inefficient in testing:time-consuming decoding bettr projection? faster decoding?

(8)

Our Contributions

Compression Coding &

Learnable-Compression Coding

A Novel Approach for Label Space Compression

• algorithmic: first known algorithm forfeature-aware dimension reduction with fast decoding

• theoretical: justification forbestlearnableprojection

• practical: consistently better performance than compressive sensing (& binary relevance)

will now introduce the key ideas behind the approach

(9)

Faster Decoding: Round-based

Compressive Sensing Revisited

• decode: g(x) = findclosest sparse binary vectorto ˜y = P^Tr(x) For any given “intermediate prediction” (real-valued vector) ˜y,

• find closestsparsebinary vector to ˜y:slow optimization of`₁-regularizedobjective

• find closestanybinary vector to ˜y:fast g(x) =round(y)

round-based decoding: simple & faster alternative

(10)

Better Projection: Principal Directions

Compressive Sensing Revisited

• compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith some M by LrandommatrixP

• randomprojection: arbitrarydirections

• bestprojection: principaldirections

principal directions: best approximation to desired out- puty_nduringcompression(why?)

(11)

Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(P^Tr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·





kr(x) −

c

z}|{Py k²

| {z }

learn

+ ky − P^T

c

z}|{Py k²

| {z }

compress







• kr(x) − ck²: prediction errorfrom input to codeword

• ky − P^Tck²: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputy_nduringcompression(indeed)

(12)

Proposed Approach 1:

Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from x_n toc_n

3 decode: g(x) = round(P^Tr(x))

• principal directions: viaPrincipal Component Analysison {y_n}^N_n=1

• physical meaning behindp_m: key (linear)label correlations PLST: improving CS by projecting tokey correlations

(13)

Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(P^Tr(x)), 1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·





kr(x) −

c

z}|{Py k²

| {z }

learn

+ ky − P^T

c

z}|{Py k²

| {z }

compress







• ky − P^Tck²: encoding error, minimized during encoding

• kr(x) − ck²: prediction error, minimized during learning

• but goodencodingmay not be easy tolearn; vice versa PLST: minimize two errors separately (sub-optimal)

(can we do better by minimizingjointly?)

(14)

Proposed Approach 2:

Conditional Principal Label Space Transform

can we do better by minimizingjointly?

Yes and easy for ridge regression (closed-form solution)

From PLST to CPLST

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using ridge regression

3 decode: g(x) = round(P^Tr(x))

• conditional principal directions: top eigenvectors ofY^TXX^†Y, key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

(15)

Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)

• PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

• PLSTbetter thanCS: faster,better performance

• similar findings acrossdata sets and regression algorithms

(16)

Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

• CPLSTbetter thanPLST: better performance across all dimensions

• similar findings acrossdata sets and regression algorithms

(17)

Multi-label Classification Setup Revisited

Given

N examples (inputx_n,label-set Y_n) ∈ X ×2^{{1,2,···L}}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x

next: going beyondHamming loss

(18)

Cost Functions for Multi-label Classification

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x

Other Evaluation of Closeness

• cost function c(y, ˜y): the penalty of predictingy as ˜y

• e.g. 0/1 loss: strict match of ˜y to y

• e.g. F1 cost: 1 - geometric mean of precision & recall of ˜y w.r.t. y

(19)

Cost-Sensitive Multi-Label Classification (CSMLC)

Given

N examples (inputxn,label-set Yn) ∈ X ×2^{{1,2,···L}}

and desired cost function c(y, ˜y)

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set-vectory associated with someunseen inputs x—i.e. low c(y, g(x)).

next: label space coding for CSMLC

(20)

Label Embedding

label space Y Φ

Ψ

embedded space Z r

feature space X Training Stage

• embedding function Φ:label vectory → embedded vector z

• learn a regressorr from {(x_n,z_n)}^N_n=1 Predicting Stage

• for testing instancex, predicted embedded vector ˜z = r(x)

• decoding function Ψ: ˜z → predicted label vector ˜y (C)PLST: linear projection embedding

(21)

Cost-Sensitive Label Embedding

label space Y Φ

Ψ

embedded space Z r

feature space X Existing Works

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based[Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]

• cost-sensitivity: CFT, PCC[Li et al., 2014; Dembczynski et al., 2010]

• cost-sensitivity + label embedding: no existing works

Cost-Sensitive Label Embedding

• considercost function cwhen designingembedding function Φ anddecoding function Ψ(cost-sensitive embedded vectorsz)

(22)

Our Contributions

Cost-sensitive Coding

A Novel Approach for Label Space Compression

• algorithmic: first known algorithm forcost-sensitive dimension reduction

• theoretical: justification forcost-sensitivelabel embedding

• practical: consistently better performance than CPLST across different costs

will now introduce the key ideas behind the approach

(23)

Cost-Sensitive Embedding

embedded space Z z1

z2

z₃ pc(y₁,y₂)

pc(y₂,y₃)

pc(y₃,y₁)

r

feature space X Training Stage

• distances between embedded vectors ⇔ cost information

• larger (smaller) distance d (z_i,z_j) ⇔higher (lower) cost c(y_i,y_j)

• d (z_i,z_j) ≈pc(yi,y_j)by multidimensional scaling (manifold learning)

(24)

Cost-Sensitive Decoding

embedded space Z z1

z2

z3

r

feature space X

˜z

Predicting Stage

• for testing instancex, predicted embedded vector ˜z = r(x)

• findnearest embedded vectorz_qof ˜z

• cost-sensitive prediction ˜y = y_q

(25)

Theoretical Explanation

Theorem (Huang and Lin, 2017) c(y, ˜y) ≤ 5

d (z, z_q) − q

c(y, y_q)2

| {z }

embedding error

+ kz − r(x)k²

| {z }

regression error

Optimization

• embedding error→ multidimensional scaling

• regression error→ regression r Challenge

• asymmetric cost function vs. symmetric distance?

i.e. c(y_i,y_j) 6=c(y_j,y_i)vs. d (z_i,z_j)

(26)

Mirroring Trick

embedded space Z z^(p)₁ z^(p)₂

z^(p)₃ z^(t)₁

z^(t)₂

z^(t)₃

pc(y₁,y₂)

pc(y₂,y₁)

r

feature space X

˜z

• two roles ofy_i:ground truth role y^(t)_i andprediction role y^(p)_i

• pc(yi,y_j) ⇒predicty_i asy_j ⇒ forz^(t)_i andz^(p)_j

• pc(yj,y_i) ⇒predicty_j asy_i ⇒ forz^(p)_i andz^(t)_j

• learnregression function r fromz^(p)_i ,z^(p)₂ , ...,z^(p)_L

(27)

Cost-Sensitive Label Embedding with Multidimensional Scaling

Training Stage of CLEMS

• given training instances D = {(xn,yn)}^N_n=1and cost function c

• determine two roles of embedded vectorsz^(t)_n andz^(p)_n for label vectory_n

• embedding function Φ :y_n→z^(p)_i

• learn a regression functionr from {(xn, Φ(yn))}^N_n=1 Predicting Stage ofCLEMS

• given the testing instancex

• obtain the predicted embedded vector by ˜z = r(x)

• decoding Ψ(·) = Φ⁻¹(nearest neighbor) = Φ⁻¹ argmin d (z^(t)_n , ·)

• prediction ˜y = Ψ(˜z)

(28)

Comparison with Label Embedding Algorithms

F1 score (↑)

M (% of K)

20 40 60 80 100

F1 score

0.55 0.6 0.65

CLEMS FaIE SLEEC PLST CPLST

yeast

M (% of K)

20 40 60 80 100

F1 score

0.4 0.45 0.5 0.55 0.6 0.65 0.7

birds

Accuracy score (↑)

M (% of K)

20 40 60 80 100

Accuracy score

0.4 0.45 0.5 0.55

yeast

M (% of K)

20 40 60 80 100

Accuracy score

0.4 0.45 0.5 0.55 0.6 0.65

birds

Rank loss (↓)

M (% of K)

20 40 60 80 100

Rank loss

8 9 10 11 12

yeast

M (% of K)

20 40 60 80 100

Rank loss

5 6 7 8 9

birds

(29)

Comparison with Cost-Sensitive Algorithms

data F1 score (↑) Accuracy score (↑) Rank loss (↓)

CLEMS CFT PCC CLEMS CFT PCC CLEMS CFT PCC

emot. 0.676 0.640 0.643 0.589 0.557 – 1.484 1.563 1.467 scene 0.770 0.703 0.745 0.760 0.656 – 0.672 0.723 0.645 yeast 0.671 0.649 0.614 0.568 0.543 – 8.302 8.566 8.469 birds 0.677 0.601 0.636 0.642 0.586 – 4.886 4.908 3.660 med. 0.814 0.635 0.573 0.786 0.613 – 5.170 5.811 4.234 enron 0.606 0.557 0.542 0.491 0.448 – 29.40 26.64 25.11 lang. 0.375 0.168 0.247 0.327 0.164 – 31.03 34.16 19.11 flag 0.731 0.692 0.706 0.615 0.588 – 2.930 3.075 2.857 slash 0.568 0.429 0.503 0.538 0.402 – 4.986 5.677 4.472

CAL. 0.419 0.371 0.391 0.273 0.237 – 1247 1120 993

arts 0.492 0.334 0.349 0.451 0.281 – 9.865 10.07 8.467 EUR. 0.670 0.456 0.483 0.650 0.450 – 89.52 129.5 43.28

• generality for CSMLC: CLEMS = CFT > PCC

• PCC requires an efficient inference rule

• performance: CLEMS ≈ PCC > CFT

(30)

Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012 with 172 citations)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012 with 114 citations)

—condense learnably forbetterefficiency: better (than PLST) approach CPLST

— key tool: Ridge Regression from Statistics (+ PCA)

3 Cost-sensitiveCoding(Huang & Lin, ECML Conference ML Journal Track 2017)

—condense cost-sensitively towards application needs: better (than CPLST) approach CLEMS

— key tool: Multidimensional Scaling from Statistics