Label Space Coding for Multi-label Classiﬁcation

(1)

Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University Talk at IIS Sinica, April 26, 2012

joint works with

Farbound Tai(MLD Workshop 2011, NC Journal 2012)&

Chun-Sung Ferng(ACML Conference 2011, NTU Thesis 2012)

(2)

Which Fruit?

?

apple orange strawberry kiwi

multi-class classification:

classify input (picture) toone category (label)

(3)

Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(4)

Powerset: Multi-label Classification via Multi-class

Multi-class w/ L = 4 classes 4 possible outcomes

{a, o, s, k}

Multi-label w/ L = 4 classes 2⁴=16possible outcomes

2^{^{a, o, s, k}^} m

{ φ, a, o, s, k, ao, as, ak, os, ok, sk, aos, aok, ask, osk, aosk } Powerset approach: reduction to multi-class classification difficulties for large L:

computation (super-large 2^L)

—hard to construct classifier

sparsity (no example for some of 2^L)

—hard to discover hidden combination

Powerset: feasible only forsmall Lwith enough examples for every combination

(5)

What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}

anothermulti-label classification problem:

tagginginput to multiple categories

(6)

Binary Relevance: Multi-label Classification via Yes/No

Binary Classification {yes,no}

Multi-label w/ L classes: Lyes/noquestions machine learning(Y), data structure(N), data

mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

Binary Relevance approach:

reduction tomultiple isolated binary classification disadvantages:

isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint) unbalancedness—fewyes, manyno

Binary Relevance: simple (& good) benchmark with known disadvantages

(7)

Multi-label Classification Setup

Given

N examples (inputxn,label-set Yn) ∈ X ×2^{{1,2,···L}}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

0/1 loss: any discrepancyJg (x) 6= Y K

Hamming loss: averaged symmetric difference ¹_L|g(x) 4 Y|

multi-label classification: hot and important

(8)

Topics in this Talk

1 Coding/GeometricView of Multi-label Classification

—unify existing algorithms w/ intuitive explanations

2 CompressionCoding

—condense for efficiency

—capture hidden correlation

3 Error-correctionCoding

—expand for accuracy

—capture hidden combination

(9)

Coding/Geometric View of Multi-label

Classification

(10)

From Label-set to Coding View

label set apple orange strawberry binary code Y₁= {o} 0 (N) 1 (Y) 0 (N) y₁= [0, 1, 0]

Y₂= {a, o} 1 (Y) 1 (Y) 0 (N) y₂= [1, 1, 0]

Y₃= {a, s} 1 (Y) 0 (N) 1 (Y) y₃= [1, 0, 1]

Y₄= {o} 0 (N) 1 (Y) 0 (N) y₄= [0, 1, 0]

Y₅= {} 0 (N) 0 (N) 0 (N) y₅= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(11)

Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}^Lcan berobustly

compressed by projecting to M L basis vectors {p₁,p₂, · · · ,p_M} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith some M by LrandommatrixP = [p₁,p₂, · · · ,p_M]^T

2 learn: getregressionfunctionr(x) from x_n toc_n

3 decode: g(x) = findclosest sparse binary vectortoP^Tr(x) Compressive Sensing:

reduction to multi-output regression w/codewordsc efficient in training: random projectionw/M L inefficient in testing:time-consuming decoding

(12)

From Coding View to Geometric View

label set binary code Y₁= {o} y₁= [0, 1, 0]

Y₂= {a, o} y₂= [1, 1, 0]

Y₃= {a, s} y₃= [1, 0, 1]

Y₄= {o} y₄= [0, 1, 0]

Y₅= {} y₅= [0, 0, 0]

y₁,y₄ y₂ y₃

y5

length-L binary code ⇔vertex of hypercube {0, 1}^L

(13)

Geometric Interpretation of Powerset

y₁,y₄ y₂ y₃

y₅

y₁,y₄ y₂ y₃

y5

Powerset: directly classify to theverticesof hypercube

(14)

Geometric Interpretation of Binary Relevance

y₁,y₄ y₂ y₃

y₅

y₂,y₃ y₁,y₄,y₅

y1,y2,y4

y₃,y5

y3

y₁,y₂,y₄,y₅

Binary Relevance: project to thenatural axes& classify

(15)

Geometric Interpretation of Compressive Sensing

y₁,y₄ y₂ y₃

y₅

Compressive Sensing:

project torandom flat (linear subspace)

learn “on” the flat; decode toclosest sparse vertex other (better) flat? other (faster) decoding?

(16)

Our Contributions (First Part)

Compression Coding (Using Geometry)

A Novel Approach for Label Space Compression algorithmic: scheme forfast decoding theoretical: justification forbest flat

practical: significantly better performance than compressive sensing (& binary relevance)

will now introduce the key ideas behind the approach

(17)

Faster Decoding: Round-based

Compressive Sensing Revisited

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = findclosest sparse binary vectortoP^Tr(x) find closestsparsebinary vector to ˜y:slow

optimization of`₁-regularizedobjective find closestanybinary vector to ˜y:fast

g(x) =round(y)

round-based decoding: simple & faster alternative

(18)

Better Flat: Principal Directions

Compressive Sensing Revisited

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=Py_nwith some M by LrandommatrixP

2 learn: getregressionfunctionr(x) from x_n toc_n

3 decode: g(x) = findclosest sparse binary vectortoP^Tr(x) randomflat: arbitrarydirections

bestflat: principaldirections

principal directions/flat: best approximation to verticesy_n duringcompression(why?)

(19)

Novel Theoretical Guarantee

Linear Transform+Regress+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(P^Tr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·





kr(x) −

c

z}|{Py k²

| {z }

learn

+ ky − P^T

c

z}|{Py k²

| {z }

compress







kr(x) − ck²: prediction errorfrom input to codeword ky − P^Tck²: encoding errorfrom vertex to codeword

principal directions/flat: best approximation to verticesy_n duringcompression(indeed)

(20)

Proposed Approach: Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(x_n,y_n)}to {(x_n,c_n)}byc_n=P(y_n− o) with the M by L principal matrixP and some reference pointo

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = round(P^Tr(x) + o)

reference pointo: allow flat not passing the origin besto and P:

o,p₁,pmin₂,··· ,p_M N

X

n=1

y_n− o − P^TP(y_n− o)

2

subject to orthonormal vectorsp_m

(21)

Solving for Principal Directions

o,p₁,pmin₂,··· ,p_M N

X

n=1

y_n− o − P^TP(y_n− o)

2

subject to orthonormal vectorsp_m solution: Principal Component Analysison {yn}^N_n=1 besto: _N¹ PN

n=1y_n

bestp_m: top eigenvectors ofPN

n=1(y_n− o)(yn− o)^T physical meaning behindp_m:

key (linear)label correlations(e.g. like eigenface in face recognition)

PLST: reduction to multi-output regression by projecting tokey correlations

(22)

Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree) PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

PLSTbetter thanCS: faster,better performance similar findings acrossdata sets and regression algorithms

(23)

Semi-summary on PLST

reduction tomulti-output regression

project toprincipal directions and capture key correlations efficient learning (label space compression)

efficient decoding (round-based)

sound theoretical guarantee +good practical performance (better than CS & BR)

expansion (channel coding) instead of compression (“lossy” source coding)?YES!

will start by reviewing an existing algorithm

(24)

Random k -labelsets

RAndomk-labELsets(Tsoumakas & Vlahavas, 2007)

more efficient than Powerset: many 2^k instead of 2^L more robust than Powerset: hidden combinationsdetected

RAk EL:

reduction to many 2^k-category classification tasks

(25)

Random k -labelsets from Coding View

RAndom k -labELsets(Tsoumakas & Vlahavas, 2007)

encode (y → b):

repetition &

permutation

learn:k -label powerset, i.e. run Powerset on size-k chunk of bits decode (˜b → ˜y):

majority vote

RAk EL:encode+learn+decode

(26)

Our Contributions (Second Part)

Error-correction Coding

A Novel Framework for Label Space Error-correction algorithmic: generalize RAk EL and explain through coding view

theoretical: link learning performance to error-correcting ability

practical: explorechoices of error-correcting code and obtainbetter performance than RAk EL (&

binary relevance)

will now introduce the framework

(27)

Key Idea: Redundant Information

General Error-correcting Codes (ECC)

noisy channel

commonly used in communication systems

detect & correct errors after transmitting data over a noisy channel encode dataredundantly

ECC for Machine Learning(successful for multi-class classification) predictions ofb

learnredundant bits=⇒correct prediction errors

(28)

Proposed Framework: Multi-labeling with ECC

encodetoadd redundant informationenc(·) : {0, 1}^L→ {0, 1}^M decodetolocate most possible binary vector

dec(·) : {0, 1}^M → {0, 1}^L

reduction tolarger multi-label classification with labels b PLST: M L (works for large L);

MLECC: M > L (works for small L)

(29)

Simple Theoretical Guarantee

ECC encode+Larger Multi-label Learning+ECC decode Theorem

Let g(x) = dec(˜b) with ˜b = h(x). Then,

Jg (x) 6= Y K

| {z }

0/1 loss

≤ const. ·Hamming loss of h(x) ECC strength+1 .

PLST:principal directions+decent regression MLECC: which ECC balancesstrength&difficulty?

(30)

Simplest ECC: Repetition Code

encoding:y ∈ {0, 1}^L→ b ∈ {0, 1}^M repeat each bit ^M_L times

L = 4, M = 28 : 1010 −→ 1111111

| {z }

28 4=7

000000011111110000000

permute the bits randomly

decoding: ˜b ∈ {0, 1}^M → ˜y ∈ {0, 1}^L majority vote on each original bit

L = 4, M = 28: strength of repetition code (REP) = 3 RAk EL = REP + k -label powerset

(31)

Slightly More Sophisticated: Hamming Code

HAM(7, 4) Code

{0, 1}⁴→ {0, 1}⁷via adding 3parity bits

—physical meaning: label combinations

b₄=y₀⊕ y₁⊕ y₃, b₅=y₀⊕ y₂⊕ y₃, b₆=y₁⊕ y₂⊕ y₃ e.g. 1011 −→ 1011010

strength = 1 (weak)

Our Proposed Code: Hamming on Repetition (HAMR) {0, 1}^L−−−→ {0, 1}^REP ^4M⁷ HAM(7, 4) on each 4-bit block

−−−−−−−−−−−−−−−−−−−→ {0, 1}^7M⁷ L = 4, M = 28: strength of HAMR = 4 betterthan REP!

HAMR + k -label powerset:

improvement of RAk EL oncode strength

(32)

Even More Sophisticated Codes

Bose-Chaudhuri-Hocquenghem Code (BCH) modern code inCD players

sophisticated extension of Hamming, withmore parity bits codeword length M = 2^p− 1 for p ∈ N

L = 4, M = 31, strength of BCH = 5

Low-density Parity-check Code (LDPC)

modern code forsatellite communication connect ECC and Bayesian learning

approach the theoretical limit in some cases let’s compare!

(33)

Different ECCs on 3-label Powerset

(scene data set w/ L = 6)

learner: 3-label powerset with Random Forests REP + 3-label powerset ≈ RAk EL

0/1 loss Hamming loss

Comparing to RAk EL (on most of data sets), HAMR:better 0/1 loss, similar Hamming loss BCH:even better 0/1 loss, pay for Hamming loss

(34)

Different ECCs on Binary Relevance

(scene data set w/ L = 6)

Binary Relevance: simply 1-label powerset

REP + Binary Relevance ≈ Binary Relevance (with aggregation)

0/1 loss Hamming loss

Comparing to BR (on most of data sets), BCH/HAMR + BR:better 0/1 loss, better Hamming loss

(35)

Semi-summary on MLECC

reduction tolargermulti-label classification

encode viaerror-correcting code and capture label combinations (parity bits)

effective decoding (error-correcting)

simple theoretical guarantee +good practical performance toimprove RAk EL, replace REP by

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss toimprove Binary Relevance, use

HAMR or BCH =⇒ lower 0/1 loss, lower Hamming loss

(36)

Conclusion

1 Coding/GeometricView of Multi-label Classification

—useful in linking toInformation Theory & visualizing

2 CompressionCoding

—condense for efficiency: better approach PLST

3 Error-correctionCoding

—expand for accuracy: better code HAMR or BCH More...

more geometric explanations(Tai & Lin, NC Journal 2012)

beyond standard ECC-decoding(Ferng, NTU Thesis 2012)

improved PLST(Chen, NTU Thesis 2012)

dynamic instead of static coding(...), combine ML-ECC & PLST(...)

Thank you. Questions?