Label Space Coding for Multi-label Classification
Hsuan-Tien Lin National Taiwan University
Talk at NTHU, 04/11/2014
joint works with
Farbound Tai(MLD Workshop 2010, NC Journal 2012)&
Chun-Sung Ferng(ACML Conference 2011, IEEE TNNLS Journal 2013)&
Yao-Nan Chen(NIPS Conference 2012)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 0 / 34
Multi-label Classification
Which Fruit?
?
apple orange strawberry kiwi
multi-class classification:
classify input (picture) toone category (label)
—How?
Multi-label Classification
Supervised Machine Learning
Parent
?
(picture, category) pairs
?
Kid’s good
decision function brain
'
&
$
% -
6 possibilities
Truth f (x ) + noise e(x )
?
examples (picture xn, category yn)
?
learning good
decision function g(x ) ≈ f (x ) algorithm
'
&
$
% -
6
learning model {gα(x )}
challenge:
see only {(xn,yn)}without knowing f (x ) or e(x )
=⇒? generalize to unseen (x , y ) w.r.t. f (x )
H.-T. Lin (NTU) Label Space Coding 04/11/2014 2 / 34
Multi-label Classification
Which Fruits?
?: {orange, strawberry, kiwi}
apple orange strawberry kiwi
multi-labelclassification:
classify input tomultiple (or no)categories
Multi-label Classification
Powerset: Multi-label Classification via Multi-class
Multi-class w/ L = 4 classes 4 possible outcomes
{a, o, s, k}
Multi-label w/ L = 4 classes 24=16possible outcomes
2{a, o, s, k} m
{ φ, a, o, s, k, ao, as, ak, os, ok, sk, aos, aok, ask, osk, aosk } Powerset approach: transformation to multi-class classification difficulties for large L:
computation (super-large 2L)
—hard to construct classifier
sparsity (no example for some of 2L)
—hard to discover hidden combination
Powerset: feasible only forsmall Lwith enough examples for every combination
H.-T. Lin (NTU) Label Space Coding 04/11/2014 4 / 34
Multi-label Classification
What Tags?
?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}
anothermulti-label classification problem:
tagginginput to multiple categories
Multi-label Classification
Binary Relevance: Multi-label Classification via Yes/No
Binary Classification {yes,no}
Multi-label w/ L classes: Lyes/noquestions machine learning(Y), data structure(N), data
mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),
children book(N),etc.
Binary Relevance approach:
transformation tomultiple isolated binary classification disadvantages:
isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint) unbalanced—fewyes, manyno
Binary Relevance: simple (& good) benchmark with known disadvantages
H.-T. Lin (NTU) Label Space Coding 04/11/2014 6 / 34
Multi-label Classification
Multi-label Classification Setup
Given
N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}
fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}
tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}
Goal
a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)
0/1 loss: any discrepancyJg (x) 6= Y K
Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|
multi-label classification: hot and important
Multi-label Classification
Topics in this Talk
1 CompressionCoding
—condense for efficiency
—capture hidden correlation
2 Error-correctionCoding
—expand for accuracy
—capture hidden combination
3 Learnable-CompressionCoding
—condense-by-learnability forbetterefficiency
—capture hidden &learnablecorrelation
H.-T. Lin (NTU) Label Space Coding 04/11/2014 8 / 34
Compression Coding
From Label-set to Coding View
label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]
Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]
Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]
Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]
Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]
subset Y of 2{1,2,··· ,L}⇔ length-L binary code y
Compression Coding
Existing Approach: Compressive Sensing
General Compressive Sensing
sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly
compressed by projecting to M L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)
1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T
2 learn: getregressionfunctionr(x) from xn tocn
3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:
efficient in training: random projectionw/M L (any better projection scheme?)
inefficient in testing:time-consuming decoding (any faster decoding method?)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 10 / 34
Compression Coding
Our Contributions (First Part)
Compression Coding
A Novel Approach for Label Space Compression algorithmic: scheme forfast decoding theoretical: justification forbest projection practical: significantly better performance than compressive sensing (& binary relevance)
Compression Coding
Faster Decoding: Round-based
Compressive Sensing Revisited
decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x)
For any given “intermediate prediction” (real-valued vector) ˜y, find closestsparsebinary vector to ˜y:slow
optimization of`1-regularizedobjective find closestanybinary vector to ˜y:fast
g(x) =round(y)
round-based decoding: simple & faster alternative
H.-T. Lin (NTU) Label Space Coding 04/11/2014 12 / 34
Compression Coding
Better Projection: Principal Directions
Compressive Sensing Revisited
compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP
randomprojection: arbitrarydirections bestprojection: principaldirections
principal directions: best approximation to desired out- putynduringcompression(why?)
Compression Coding
Novel Theoretical Guarantee
Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)
Ifg(x) = round(PTr(x)),
1
L|g(x) 4 Y|
| {z }
Hamming loss
≤ const ·
kr(x) −
c
z}|{Py k2
| {z }
learn
+ ky − PT
c
z}|{Py k2
| {z }
compress
kr(x) − ck2: prediction errorfrom input to codeword
ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to
desired outputynduringcompression(indeed)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 14 / 34
Compression Coding
Proposed Approach: Principal Label Space Transform
From Compressive Sensing to PLST
1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP
2 learn: getregressionfunctionr(x) from xn tocn
3 decode: g(x) = round(PTr(x))
principal directions: viaPrincipal Component Analysison {yn}Nn=1 physical meaning behindpm: key (linear)label correlations
PLST: improving CS by projecting tokey correlations
Compression Coding
Hamming Loss Comparison: Full-BR, PLST & CS
0 20 40 60 80 100
0.03 0.035 0.04 0.045 0.05
Full−BR (no reduction) CS
PLST
mediamill (Linear Regression)
0 20 40 60 80 100
0.03 0.035 0.04 0.045 0.05
Full−BR (no reduction) CS
PLST
mediamill (Decision Tree) PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance
PLSTbetter thanCS: faster,better performance similar findings acrossdata sets and regression algorithms
H.-T. Lin (NTU) Label Space Coding 04/11/2014 16 / 34
Compression Coding
Semi-summary on PLST
project toprincipal directions and capture key correlations efficient learning (afterlabel space compression)
efficient decoding (round-based)
sound theoretical guarantee +good practical performance (better than CS & BR)
expansion (channel coding) instead of compression (“lossy” source coding)?YES!
Error-correction Coding
Our Contributions (Second Part)
Error-correction Coding
A Novel Framework for Label Space Error-correction algorithmic: generalize an popular existing algorithm
(RAk EL; Tsoumakas & Vlahavas, 2007)and explain through coding view
theoretical: link learning performance to error-correcting ability
practical: explorechoices of error-correcting code and obtainbetter performance than RAk EL (&
binary relevance)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 18 / 34
Error-correction Coding
Key Idea: Redundant Information
General Error-correcting Codes (ECC)
noisy channel
commonly used in communication systems
detect & correct errors after transmitting data over a noisy channel encode dataredundantly
ECC for Machine Learning(successful for multi-class classification) predictions ofb
learnredundant bits=⇒correct prediction errors
Error-correction Coding
Proposed Framework: Multi-labeling with ECC
encodetoadd redundant informationenc(·) : {0, 1}L→ {0, 1}M decodetolocate most possible binary vector
dec(·) : {0, 1}M → {0, 1}L
transformation tolarger multi-label classification with labels b PLST: M L (works for large L);
MLECC: M > L (works for small L)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 20 / 34
Error-correction Coding
Simple Theoretical Guarantee
ECC encode+Larger Multi-label Learning+ECC decode Theorem
Letg(x) = dec(˜b)with ˜b = h(x). Then,
Jg (x) 6= Y K
| {z }
0/1 loss
≤ const. ·Hamming loss of h(x) ECC strength+1 .
PLST:principal directions+decent regression MLECC: which ECC balancesstrength&difficulty?
Error-correction Coding
Simplest ECC: Repetition Code
encoding:y ∈ {0, 1}L→ b ∈ {0, 1}M
repeat each bit ML times
L = 4, M = 28 : 1010 −→ 1111111
| {z }
28 4=7
000000011111110000000
permute the bits randomly
decoding: ˜b ∈ {0, 1}M → ˜y ∈ {0, 1}L majority vote on each original bit
L = 4, M = 28: strength of repetition code (REP) = 3
RAk EL = REP (code) + a special powerset (channel)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 22 / 34
Error-correction Coding
Slightly More Sophisticated: Hamming Code
HAM(7, 4) Code
{0, 1}4→ {0, 1}7via adding 3parity bits
—physical meaning: label combinations
b4=y0⊕ y1⊕ y3, b5=y0⊕ y2⊕ y3, b6=y1⊕ y2⊕ y3
e.g. 1011 −→ 1011010 strength = 1 (weak)
Our Proposed Code: Hamming on Repetition (HAMR) {0, 1}L−−−→ {0, 1}REP 4M7 HAM(7, 4) on each 4-bit block
−−−−−−−−−−−−−−−−−−−→ {0, 1}7M7 L = 4, M = 28: strength of HAMR = 4 betterthan REP!
HAMR + the special powerset:
improve RAk EL oncode strength
Error-correction Coding
Even More Sophisticated Codes
Bose-Chaudhuri-Hocquenghem Code (BCH) modern code inCD players
sophisticated extension of Hamming, withmore parity bits codeword length M = 2p− 1 for p ∈ N
L = 4, M = 31, strength of BCH = 5
Low-density Parity-check Code (LDPC)
modern code forsatellite communication connect ECC and Bayesian learning
approach the theoretical limit in some cases
let’s compare!
H.-T. Lin (NTU) Label Space Coding 04/11/2014 24 / 34
Error-correction Coding
Different ECCs on 3-label Powerset
(scene data set w/ L = 6)learner: special powerset with Random Forests REP + special powerset ≈ RAk EL
0/1 loss Hamming loss
Comparing to RAk EL (on most of data sets), HAMR:better 0/1 loss, similar Hamming loss
Error-correction Coding
Semi-summary on MLECC
transformation tolargermulti-label classification encode viaerror-correcting code and capture label combinations (parity bits)
effective decoding (error-correcting)
simple theoretical guarantee +good practical performance toimprove RAk EL, replace REP by
HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss toimprove Binary Relevance, · · ·
H.-T. Lin (NTU) Label Space Coding 04/11/2014 26 / 34
Learnable-Compression Coding
Theoretical Guarantee of PLST Revisited
Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)
Ifg(x) = round(PTr(x)),
1
L|g(x) 4 Y|
| {z }
Hamming loss
≤ const ·
kr(x) −
c
z}|{Py k2
| {z }
learn
+ ky − PT
c
z}|{Py k2
| {z }
compress
ky − PTck2: encoding error, minimized during encoding kr(x) − ck2: prediction error, minimized during learning but goodencodingmay not be easy tolearn; vice versa
PLST: minimize two errors separately (sub-optimal) (can we do better by minimizingjointly?)
Learnable-Compression Coding
Our Contributions (Third Part)
Learnable-Compression Coding
A Novel Approach for Label Space Compression
algorithmic: first known algorithm forfeature-aware dimension reduction
theoretical: justification forbestlearnableprojection practical: consistently better performance than PLST
H.-T. Lin (NTU) Label Space Coding 04/11/2014 28 / 34
Learnable-Compression Coding
The In-Sample Optimization Problem
min
r,P
kr(X) − PYk2
| {z }
learn
+ kY − PTPYk2
| {z }
compress
start from a well-known tool: linear regression asr r(X) = XW
for fixedP: a closed-form solution forlearnis W∗ =X†PY optimalP:
forlearn top eigenvectors ofYT(I − XX†)Y forcompress top eigenvectors ofYTY
forboth top eigenvectors ofYTXX†Y
Learnable-Compression Coding
Proposed Approach: Conditional Principal Label Space Transform
From PLST to CPLST
1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP
2 learn: getregressionfunctionr(x) from xn tocn, ideally using linear regression
3 decode: g(x) = round(PTr(x))
conditional principal directions: top eigenvectors ofYTXX†Y physical meaning behindpm: key (linear) label correlationsthat are “easy to learn”
CPLST: project tokeylearnablecorrelations
—can also pair withkernel regression (non-linear)
H.-T. Lin (NTU) Label Space Coding 04/11/2014 30 / 34
Learnable-Compression Coding
Hamming Loss Comparison: PLST & CPLST
0 5 10 15
0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245
# of dimension
Hamning loss
pbr cpa plst cplst
yeast (Linear Regression)
CPLSTbetter thanPLST: better performance across all dimensions
similar findings acrossdata sets and regression algorithms
Learnable-Compression Coding
Semi-summary on CPLST
project toconditional principal directions and capturekey learnable correlations
more efficient
sound theoretical guarantee (via PLST) +good practical performance (better than PLST)
CPLST:state-of-the-artfor label space compression
H.-T. Lin (NTU) Label Space Coding 04/11/2014 32 / 34
Conclusion
1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)
—condense for efficiency: better (than BR) approach PLST
— key tool: PCA from Statistics/Signal Processing
2 Error-correctionCoding(Ferng & Lin, ACML Conference 2011)
—expand for accuracy: better (than REP) code HAMR or BCH
— key tool: ECC from Information Theory
3 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)
—condense for efficiency: better (than PLST) approach CPLST
— key tool: Linear Regression from Statistics (+ PCA)
More...
beyond standard ECC-decoding (Ferng and Lin, IEEE TNNLS 2013) coupling CPLST with other regressor (Chen and Lin, NIPS 2012) multi-label classification with arbitrary loss (Li and Lin, ICML 2014) dynamic instead of static coding, combine ML-ECC & PLST/CPLST (...)
Thank you! Questions?
H.-T. Lin (NTU) Label Space Coding 04/11/2014 34 / 34