Label Space Coding for Multi-label Classification
Hsuan-Tien Lin National Taiwan University RIKEN AIP Center, 08/30/2019
joint works with
Farbound Tai(MLD Workshop 2010, NC Journal 2012)&
Yao-Nan Chen(NeurIPS Conference 2012)&
Kuan-Hao Huang(ECML Conference ML Journal Track 2017)
Which Fruits?
?: {orange, strawberry, kiwi}
apple orange strawberry kiwi
multi-labelclassification:
classify input tomultiple (or no)categories
What Tags?
?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc. }
anothermulti-label classification problem:
tagginginput to multiple categories
Binary Relevance: Multi-label Classification via Yes/No
Binary
Classification {yes,no}
Multi-label w/ L classes: Lyes/no questions
machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),
children book(N),etc.
• Binary Relevance approach:
transformation tomultiple isolated binary classification
• disadvantages:
• isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)
• unbalanced—fewyes, manyno
Binary Relevance: simple (& good) benchmark with
Multi-label Classification Setup
Given
N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}
• fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}
• tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}
Goal
a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)
• Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|
multi-label classification: hot and important
From Label-set to Coding View
label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]
Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]
Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]
Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]
Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]
subset Y of 2{1,2,··· ,L}⇔ length-L binary code y
Existing Approach: Compressive Sensing
General Compressive Sensing
sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly
compressed by projecting to M L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)
1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T
2 learn: getregressionfunctionr(x) from xn tocn
3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:
• efficient in training: random projectionw/M L
• inefficient in testing:time-consuming decoding bettr projection? faster decoding?
Our Contributions
Compression Coding &
Learnable-Compression Coding
A Novel Approach for Label Space Compression
• algorithmic: first known algorithm forfeature-aware dimension reduction with fast decoding
• theoretical: justification forbestlearnableprojection
• practical: consistently better performance than compressive sensing (& binary relevance)
will now introduce the key ideas behind the approach
Faster Decoding: Round-based
Compressive Sensing Revisited
• decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x) For any given “intermediate prediction” (real-valued vector) ˜y,
• find closestsparsebinary vector to ˜y:slow optimization of`1-regularizedobjective
• find closestanybinary vector to ˜y:fast g(x) =round(y)
round-based decoding: simple & faster alternative
Better Projection: Principal Directions
Compressive Sensing Revisited
• compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP
• randomprojection: arbitrarydirections
• bestprojection: principaldirections
principal directions: best approximation to desired out- putynduringcompression(why?)
Novel Theoretical Guarantee
Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)
Ifg(x) = round(PTr(x)),
1
L|g(x) 4 Y|
| {z }
Hamming loss
≤ const ·
kr(x) −
c
z}|{Py k2
| {z }
learn
+ ky − PT
c
z}|{Py k2
| {z }
compress
• kr(x) − ck2: prediction errorfrom input to codeword
• ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to
desired outputynduringcompression(indeed)
Proposed Approach 1:
Principal Label Space Transform
From Compressive Sensing to PLST
1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP
2 learn: getregressionfunctionr(x) from xn tocn
3 decode: g(x) = round(PTr(x))
• principal directions: viaPrincipal Component Analysison {yn}Nn=1
• physical meaning behindpm: key (linear)label correlations PLST: improving CS by projecting tokey correlations
Theoretical Guarantee of PLST Revisited
Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)
Ifg(x) = round(PTr(x)), 1
L|g(x) 4 Y|
| {z }
Hamming loss
≤ const ·
kr(x) −
c
z}|{Py k2
| {z }
learn
+ ky − PT
c
z}|{Py k2
| {z }
compress
• ky − PTck2: encoding error, minimized during encoding
• kr(x) − ck2: prediction error, minimized during learning
• but goodencodingmay not be easy tolearn; vice versa PLST: minimize two errors separately (sub-optimal)
(can we do better by minimizingjointly?)
Proposed Approach 2:
Conditional Principal Label Space Transform
can we do better by minimizingjointly?
Yes and easy for ridge regression (closed-form solution)
From PLST to CPLST
1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP
2 learn: getregressionfunctionr(x) from xn tocn, ideally using ridge regression
3 decode: g(x) = round(PTr(x))
• conditional principal directions: top eigenvectors ofYTXX†Y, key (linear) label correlationsthat are “easy to learn”
CPLST: project tokeylearnablecorrelations
Hamming Loss Comparison: Full-BR, PLST & CS
0 20 40 60 80 100
0.03 0.035 0.04 0.045 0.05
Full−BR (no reduction) CS
PLST
mediamill (Linear Regression)
0 20 40 60 80 100
0.03 0.035 0.04 0.045 0.05
Full−BR (no reduction) CS
PLST
mediamill (Decision Tree)
• PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance
• PLSTbetter thanCS: faster,better performance
• similar findings acrossdata sets and regression algorithms
Hamming Loss Comparison: PLST & CPLST
0 5 10 15
0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245
# of dimension
Hamning loss
pbr cpa plst cplst
yeast (Linear Regression)
• CPLSTbetter thanPLST: better performance across all dimensions
• similar findings acrossdata sets and regression algorithms
Multi-label Classification Setup Revisited
Given
N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}
• fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}
• tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}
Goal
a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x
• Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|
next: going beyondHamming loss
Cost Functions for Multi-label Classification
Goal
a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x
• Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|
Other Evaluation of Closeness
• cost function c(y, ˜y): the penalty of predictingy as ˜y
• e.g. 0/1 loss: strict match of ˜y to y
• e.g. F1 cost: 1 - geometric mean of precision & recall of ˜y w.r.t. y
Cost-Sensitive Multi-Label Classification (CSMLC)
Given
N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}
• fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}
• tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}
and desired cost function c(y, ˜y)
Goal
a multi-label classifier g(x) thatclosely predictsthe label-set-vectory associated with someunseen inputs x—i.e. low c(y, g(x)).
next: label space coding for CSMLC
Label Embedding
label space Y Φ
Ψ
embedded space Z r
feature space X Training Stage
• embedding function Φ:label vectory → embedded vector z
• learn a regressorr from {(xn,zn)}Nn=1 Predicting Stage
• for testing instancex, predicted embedded vector ˜z = r(x)
• decoding function Ψ: ˜z → predicted label vector ˜y (C)PLST: linear projection embedding
Cost-Sensitive Label Embedding
label space Y Φ
Ψ
embedded space Z r
feature space X Existing Works
• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based[Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]
• cost-sensitivity: CFT, PCC[Li et al., 2014; Dembczynski et al., 2010]
• cost-sensitivity + label embedding: no existing works
Cost-Sensitive Label Embedding
• considercost function cwhen designingembedding function Φ anddecoding function Ψ(cost-sensitive embedded vectorsz)
Our Contributions
Cost-sensitive Coding
A Novel Approach for Label Space Compression
• algorithmic: first known algorithm forcost-sensitive dimension reduction
• theoretical: justification forcost-sensitivelabel embedding
• practical: consistently better performance than CPLST across different costs
will now introduce the key ideas behind the approach
Cost-Sensitive Embedding
embedded space Z z1
z2
z3 pc(y1,y2)
pc(y2,y3)
pc(y3,y1)
r
feature space X Training Stage
• distances between embedded vectors ⇔ cost information
• larger (smaller) distance d (zi,zj) ⇔higher (lower) cost c(yi,yj)
• d (zi,zj) ≈pc(yi,yj)by multidimensional scaling (manifold learning)
Cost-Sensitive Decoding
embedded space Z z1
z2
z3
r
feature space X
˜z
Predicting Stage
• for testing instancex, predicted embedded vector ˜z = r(x)
• findnearest embedded vectorzqof ˜z
• cost-sensitive prediction ˜y = yq
Theoretical Explanation
Theorem (Huang and Lin, 2017) c(y, ˜y) ≤ 5
d (z, zq) − q
c(y, yq)2
| {z }
embedding error
+ kz − r(x)k2
| {z }
regression error
Optimization
• embedding error→ multidimensional scaling
• regression error→ regression r Challenge
• asymmetric cost function vs. symmetric distance?
i.e. c(yi,yj) 6=c(yj,yi)vs. d (zi,zj)
Mirroring Trick
embedded space Z z(p)1 z(p)2
z(p)3 z(t)1
z(t)2
z(t)3
pc(y1,y2)
pc(y2,y1)
r
feature space X
˜z
• two roles ofyi:ground truth role y(t)i andprediction role y(p)i
• pc(yi,yj) ⇒predictyi asyj ⇒ forz(t)i andz(p)j
• pc(yj,yi) ⇒predictyj asyi ⇒ forz(p)i andz(t)j
• learnregression function r fromz(p)i ,z(p)2 , ...,z(p)L
Cost-Sensitive Label Embedding with Multidimensional Scaling
Training Stage of CLEMS
• given training instances D = {(xn,yn)}Nn=1and cost function c
• determine two roles of embedded vectorsz(t)n andz(p)n for label vectoryn
• embedding function Φ :yn→z(p)i
• learn a regression functionr from {(xn, Φ(yn))}Nn=1 Predicting Stage ofCLEMS
• given the testing instancex
• obtain the predicted embedded vector by ˜z = r(x)
• decoding Ψ(·) = Φ−1(nearest neighbor) = Φ−1 argmin d (z(t)n , ·)
• prediction ˜y = Ψ(˜z)
Comparison with Label Embedding Algorithms
F1 score (↑)
M (% of K)
20 40 60 80 100
F1 score
0.55 0.6 0.65
CLEMS FaIE SLEEC PLST CPLST
yeast
M (% of K)
20 40 60 80 100
F1 score
0.4 0.45 0.5 0.55 0.6 0.65 0.7
CLEMS FaIE SLEEC PLST CPLST
birds
Accuracy score (↑)
M (% of K)
20 40 60 80 100
Accuracy score
0.4 0.45 0.5 0.55
CLEMS FaIE SLEEC PLST CPLST
yeast
M (% of K)
20 40 60 80 100
Accuracy score
0.4 0.45 0.5 0.55 0.6 0.65
CLEMS FaIE SLEEC PLST CPLST
birds
Rank loss (↓)
M (% of K)
20 40 60 80 100
Rank loss
8 9 10 11 12
CLEMS FaIE SLEEC PLST CPLST
yeast
M (% of K)
20 40 60 80 100
Rank loss
5 6 7 8 9
CLEMS FaIE SLEEC PLST CPLST
birds
Comparison with Cost-Sensitive Algorithms
data F1 score (↑) Accuracy score (↑) Rank loss (↓)
CLEMS CFT PCC CLEMS CFT PCC CLEMS CFT PCC
emot. 0.676 0.640 0.643 0.589 0.557 – 1.484 1.563 1.467 scene 0.770 0.703 0.745 0.760 0.656 – 0.672 0.723 0.645 yeast 0.671 0.649 0.614 0.568 0.543 – 8.302 8.566 8.469 birds 0.677 0.601 0.636 0.642 0.586 – 4.886 4.908 3.660 med. 0.814 0.635 0.573 0.786 0.613 – 5.170 5.811 4.234 enron 0.606 0.557 0.542 0.491 0.448 – 29.40 26.64 25.11 lang. 0.375 0.168 0.247 0.327 0.164 – 31.03 34.16 19.11 flag 0.731 0.692 0.706 0.615 0.588 – 2.930 3.075 2.857 slash 0.568 0.429 0.503 0.538 0.402 – 4.986 5.677 4.472
CAL. 0.419 0.371 0.391 0.273 0.237 – 1247 1120 993
arts 0.492 0.334 0.349 0.451 0.281 – 9.865 10.07 8.467 EUR. 0.670 0.456 0.483 0.650 0.450 – 89.52 129.5 43.28
• generality for CSMLC: CLEMS = CFT > PCC
• PCC requires an efficient inference rule
• performance: CLEMS ≈ PCC > CFT
Conclusion
1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012 with 172 citations)
—condense for efficiency: better (than BR) approach PLST
— key tool: PCA from Statistics/Signal Processing
2 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012 with 114 citations)
—condense learnably forbetterefficiency: better (than PLST) approach CPLST
— key tool: Ridge Regression from Statistics (+ PCA)
3 Cost-sensitiveCoding(Huang & Lin, ECML Conference ML Journal Track 2017)
—condense cost-sensitively towards application needs: better (than CPLST) approach CLEMS
— key tool: Multidimensional Scaling from Statistics