Label Space Coding for Multi-label Classiﬁcation

(1)

Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University

NUK Seminar, 12/22/2017

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Yao-Nan Chen(NIPS Conference 2012)&

Chun-Sung Ferng(ACML Conference 2011, TNNLS Journal 2013)

(2)

Which Fruit?

?

apple orange strawberry kiwi

multi-class classification:

classify input (picture) toone category (label)

(3)

Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(4)

What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}

anothermulti-label classification problem:

tagginginput to multiple categories

(5)

Binary Relevance: Multi-label Classification via Yes/No

Binary

Classification {yes,no}

Multi-label w/ L classes: Lyes/no questions

machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

• Binary Relevance approach:

transformation tomultiple isolated binary classification

• disadvantages:

• isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)

• unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with known disadvantages

(6)

Multi-label Classification Setup

Given

N examples (inputx_n,label-set Y_n) ∈ X ×2^{{1,2,···L}}

• fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

• tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

• 0/1 loss: any discrepancyJg (x) 6= Y K

• Hamming loss: averaged symmetric difference ¹_L|g(x) 4 Y|

multi-label classification: hot and important

(7)

From Label-set to Coding View

label set apple orange strawberry binary code Y₁= {o} 0 (N) 1 (Y) 0 (N) y₁= [0, 1, 0]

Y₂= {a, o} 1 (Y) 1 (Y) 0 (N) y₂= [1, 1, 0]

Y₃= {a, s} 1 (Y) 0 (N) 1 (Y) y₃= [1, 0, 1]

Y₄= {o} 0 (N) 1 (Y) 0 (N) y₄= [0, 1, 0]

Y₅= {} 0 (N) 0 (N) 0 (N) y₅= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(8)

Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}^Lcan berobustly

compressed by projecting to M L basis vectors {p₁,p₂, · · · ,p_M} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith some M by LrandommatrixP = [p₁,p₂, · · · ,p_M]^T

2 learn: getregressionfunctionr(x) from x_n toc_n

3 decode: g(x) = findclosest sparse binary vectortoP^Tr(x) Compressive Sensing:

• efficient in training: random projectionw/M L

• inefficient in testing:time-consuming decoding

(9)

From Coding View to Geometric View

label set binary code Y₁= {o} y₁= [0, 1, 0]

Y₂= {a, o} y₂= [1, 1, 0]

Y₃= {a, s} y₃= [1, 0, 1]

Y₄= {o} y₄= [0, 1, 0]

Y₅= {} y₅= [0, 0, 0]

y₁,y₄ y₂ y₃

y₅

length-L binary code ⇔vertex of hypercube {0, 1}^L

(10)

Geometric Interpretation of Binary Relevance

y₁,y₄ y₂ y₃

y₅

Binary Relevance: project to thenatural axes& classify

(11)

Geometric Interpretation of Compressive Sensing

y₁,y₄ y₂ y₃

y₅

Compressive Sensing:

• project torandom flat (linear subspace)

• learn “on” the flat; decode toclosest sparse vertex other (better) flat? other (faster) decoding?

(12)

Our Contributions

Compression Coding &

Learnable-Compression Coding

A Novel Approach for Label Space Compression

• algorithmic: first known algorithm forfeature-aware dimension reduction with fast decoding

• theoretical: justification forbestlearnableprojection

• practical: consistently better performance than compressive sensing (& binary relevance)

will now introduce the key ideas behind the approach

(13)

Faster Decoding: Round-based

Compressive Sensing Revisited

• decode: g(x) = findclosest sparse binary vectorto ˜y = P^Tr(x)

For any given “intermediate prediction” (real-valued vector) ˜y,

• find closestsparsebinary vector to ˜y:slow optimization of`₁-regularizedobjective

• find closestanybinary vector to ˜y:fast g(x) =round(y)

round-based decoding: simple & faster alternative

(14)

Better Projection: Principal Directions

Compressive Sensing Revisited

• compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith some M by LrandommatrixP

• randomprojection: arbitrarydirections

• bestprojection: principaldirections

principal directions: best approximation to desired out- puty_nduringcompression(why?)

(15)

Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding

Theorem (Tai and Lin, 2012) Ifg(x) = round(P^Tr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·





kr(x) −

c

z}|{Py k²

| {z }

learn

+ ky − P^T

c

z}|{Py k²

| {z }

compress







• kr(x) − ck²: prediction errorfrom input to codeword

• ky − P^Tck²: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputy_nduringcompression(indeed)

(16)

Proposed Approach 1:

Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from x_n toc_n

3 decode: g(x) = round(P^Tr(x))

• principal directions: viaPrincipal Component Analysison {y_n}^N_n=1

• physical meaning behindp_m: key (linear)label correlations PLST: improving CS by projecting tokey correlations

(17)

Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding

Theorem (Tai and Lin, 2012) Ifg(x) = round(P^Tr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·





kr(x) −

c

z}|{Py k²

| {z }

learn

+ ky − P^T

c

z}|{Py k²

| {z }

compress







• ky − P^Tck²: encoding error, minimized during encoding

• kr(x) − ck²: prediction error, minimized during learning

• but goodencodingmay not be easy tolearn; vice versa PLST: minimize two errors separately (sub-optimal)

(can we do better by minimizingjointly?)

(18)

Proposed Approach 2:

Conditional Principal Label Space Transform

can we do better by minimizingjointly?

Yes and easy for ridge regression (closed-form solution)

From PLST to CPLST

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using ridge regression

3 decode: g(x) = round(P^Tr(x))

• conditional principal directions: top eigenvectors ofY^TXX^†Y, key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

—can also pair withkernel regression (non-linear)

(19)

Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)

• PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

• PLSTbetter thanCS: faster,better performance

• similar findings acrossdata sets and regression algorithms

(20)

Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

• CPLSTbetter thanPLST: better performance across all dimensions

• similar findings acrossdata sets and regression algorithms

(21)

Topics in this Talk

1 CompressionCoding

—condense for efficiency

—capture hidden correlation

2 Learnable-CompressionCoding

—condense-by-learnability forbetterefficiency

—capture hidden &learnablecorrelation

3 Error-CorrectionCoding

—expand for accuracy

—capture hidden combination

(22)

Our Contributions (Second Part)

Error-correction Coding

A Novel Framework for Label Space Error-correction

• algorithmic: generalize an popular existing algorithm

(RAk EL; Tsoumakas & Vlahavas, 2007)and explain through coding view

• theoretical: link learning performance to error-correcting ability

• practical: explorechoices of error-correcting code and obtainbetter performance than RAk EL (&

binary relevance)

(23)

Key Idea: Redundant Information

General Error-correcting Codes (ECC)

noisy channel

• commonly used in communication systems

• detect & correct errors after transmitting data over a noisy channel

• encode dataredundantly

ECC for Machine Learning (successful for multi-class classification) predictions ofb

learnredundant bits=⇒correct prediction errors

(24)

Proposed Framework: Multi-labeling with ECC

• encodetoadd redundant informationenc(·) : {0, 1}^L→ {0, 1}^M

• decodetolocate most possible binary vector dec(·) : {0, 1}^M → {0, 1}^L

• transformation tolarger multi-label classification with labels b PLST: M L (works for large L);

MLECC: M > L (works for small L)

(25)

Simple Theoretical Guarantee

ECC encode+Larger Multi-label Learning+ECC decode

Theorem

Letg(x) = dec(˜b)with ˜b = h(x). Then,

Jg (x) 6= Y K

| {z }

0/1 loss

≤ const. ·Hamming loss of h(x) ECC strength+1 .

PLST:principal directions+decent regression MLECC: which ECC balancesstrength&difficulty?

(26)

Simplest ECC: Repetition Code

encoding: y ∈ {0, 1}^L→ b ∈ {0, 1}^M

• repeat each bit ^M_L times

L = 4, M = 28 : 1010 −→ 1111111

| {z }

28 4=7

000000011111110000000

• permute the bits randomly

decoding: ˜b ∈ {0, 1}^M → ˜y ∈ {0, 1}^L

• majority vote on each original bit

L = 4, M = 28: strength of repetition code (REP) = 3

RAk EL = REP (code) + a special powerset (channel)

(27)

Slightly More Sophisticated: Hamming Code

HAM(7, 4) Code

• {0, 1}⁴→ {0, 1}⁷via adding 3parity bits

—physical meaning: label combinations

• b₄=y₀⊕ y₁⊕ y₃, b5=y₀⊕ y₂⊕ y₃, b₆=y₁⊕ y₂⊕ y₃

• e.g. 1011 −→ 1011010

• strength = 1 (weak)

Our Proposed Code: Hamming on Repetition (HAMR) {0, 1}^L−−−→ {0, 1}^REP ^4M⁷ HAM(7, 4) on each 4-bit block

−−−−−−−−−−−−−−−−−−−→ {0, 1}^7M⁷ L = 4, M = 28: strength of HAMR = 4 betterthan REP!

HAMR + the special powerset:

improve RAk EL oncode strength

(28)

Even More Sophisticated Codes

Bose-Chaudhuri-Hocquenghem Code (BCH)

• modern code inCD players

• sophisticated extension of Hamming, withmore parity bits

• codeword length M = 2^p− 1 for p ∈ N

• L = 4, M = 31, strength of BCH = 5

Low-density Parity-check Code (LDPC)

• modern code forsatellite communication

• connect ECC and Bayesian learning

• approach the theoretical limit in some cases

let’s compare!

(29)

Different ECCs on 3-label Powerset

(scene data set w/ L = 6)

• learner: special powerset with Random Forests

• REP + special powerset ≈ RAk EL

0/1 loss Hamming loss

Comparing to RAk EL (on most of data sets),

• HAMR:better 0/1 loss, similar Hamming loss

• BCH:even better 0/1 loss, pay for Hamming loss

(30)

Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)

—condense learnably forbetterefficiency: better (than PLST) approach CPLST

— key tool: Ridge Regression from Statistics (+ PCA)

3 Error-correctionCoding(Ferng & Lin, ACML Conference 2011, TNNLS Journal 2013)

—expand for accuracy: better (than REP) code HAMR or BCH

— key tool: ECC from Information Theory Thank you! Questions?