## Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University

Talk at NTHU, 04/11/2014

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Chun-Sung Ferng(ACML Conference 2011, IEEE TNNLS Journal 2013)&

Yao-Nan Chen(NIPS Conference 2012)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 0 / 34

Multi-label Classification

## Which Fruit?

**?**

apple orange strawberry kiwi

multi-class classification:

classify input (picture) to**one category (label)**

—How?

Multi-label Classification

## Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture x_{n}, category y_{n})

?

learning good

decision function g(x ) ≈ f (x ) algorithm

'

&

$

% -

6

learning model {gα(x )}

challenge:

see only {(x_{n},y_{n})}without knowing f (x ) or e(x )

=⇒? **generalize to unseen (x , y ) w.r.t. f (x )**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 2 / 34

Multi-label Classification

## Which Fruits?

**?: {orange, strawberry, kiwi}**

apple orange strawberry kiwi

multi-labelclassification:

classify input to**multiple (or no)**categories

Multi-label Classification

## Powerset: Multi-label Classification via Multi-class

Multi-class w/ L = 4 classes 4 possible outcomes

{a, o, s, k}

Multi-label w/ L = 4 classes
2^{4}=16**possible outcomes**

2^{{}^{a, o, s, k}^{}}
m

{ φ, a, o, s, k, ao, as, ak, os, ok, sk,
aos, aok, ask, osk, aosk }
**Powerset approach: transformation to multi-class classification**
difficulties for large L:

**computation (super-large 2**^{L})

—hard to construct classifier

**sparsity (no example for some of 2**^{L})

—hard to discover hidden combination

**Powerset: feasible only for**small Lwith enough
examples for every combination

H.-T. Lin (NTU) Label Space Coding 04/11/2014 4 / 34

Multi-label Classification

## What **Tags?**

**?: {machine learning,**data structure,data mining,object
oriented programming,artificial intelligence,compiler,
architecture,chemistry,textbook,children book,. . .etc.}

another**multi-label classification problem:**

**tagging**input to multiple categories

Multi-label Classification

## Binary Relevance: Multi-label Classification via Yes/No

Binary Classification {yes,no}

Multi-label w/ L classes: L**yes/noquestions**
machine learning(Y), data structure(N), data

mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

**Binary Relevance approach:**

transformation to**multiple isolated binary classification**
disadvantages:

**isolation—hidden relations not exploited (e.g. ML and DM**highly
correlated, MLsubset ofAI, textbook & children bookdisjoint)
**unbalanced—few**yes, manyno

**Binary Relevance: simple (& good) benchmark with**
known disadvantages

H.-T. Lin (NTU) Label Space Coding 04/11/2014 6 / 34

Multi-label Classification

## Multi-label Classification Setup

Given

N examples (input**x**_{n},label-set Y_{n}) ∈ X ×2^{{1,2,···L}}

fruits: X = encoding(pictures), Y_{n}⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Y_{n}⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) that**closely predicts**the label-set Y
associated with some**unseen inputs x (by**exploiting hidden
relations/combinations between labels)

**0/1 loss: any discrepancyJg (x) 6= Y K**

**Hamming loss: averaged symmetric difference** ^{1}_{L}**|g(x) 4 Y|**

**multi-label classification:** **hot and important**

Multi-label Classification

## Topics in this Talk

1 CompressionCoding

—condense for efficiency

—capture hidden correlation

2 Error-correctionCoding

—expand for accuracy

—capture hidden combination

3 Learnable-CompressionCoding

—condense-by-learnability for**better**efficiency

—capture hidden &**learnable**correlation

H.-T. Lin (NTU) Label Space Coding 04/11/2014 8 / 34

Compression Coding

## From Label-set to Coding View

label set apple orange strawberry **binary code**
Y_{1}= {o} 0 (N) 1 (Y) 0 (N) **y**_{1}= [0, 1, 0]

Y_{2}= {a, o} 1 (Y) 1 (Y) 0 (N) **y**_{2}= [1, 1, 0]

Y_{3}= {a, s} 1 (Y) 0 (N) 1 (Y) **y**_{3}= [1, 0, 1]

Y_{4}= {o} 0 (N) 1 (Y) 0 (N) **y**_{4}= [0, 1, 0]

Y_{5}= {} 0 (N) 0 (N) 0 (N) **y**5= [0, 0, 0]

subset Y of 2{1,2,··· ,L}**⇔ length-L binary code y**

Compression Coding

## Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectors**y ∈ {0, 1}**^{L}can be**robustly**

**compressed by projecting to M L basis vectors {p**_{1},**p**_{2}, · · · ,**p**_{M}}
Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 **compress: transform {(x**_{n},**y**_{n})}to {(x_{n},**c**_{n})}by**c**_{n}=**Py**_{n}with
some M by Lrandommatrix**P = [p**_{1},**p**_{2}, · · · ,**p**_{M}]^{T}

2 **learn: get**regressionfunction**r(x) from x**n to**c**n

3 **decode: g(x) = find**closest sparse binary vectorto**P**^{T}**r(x)**
**Compressive Sensing:**

efficient in training: random projectionw/M L (any better projection scheme?)

inefficient in testing:time-consuming decoding (any faster decoding method?)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 10 / 34

Compression Coding

## Our Contributions (First Part)

**Compression Coding**

A Novel Approach for Label Space Compression
algorithmic: scheme for**fast decoding**
theoretical: justification for**best projection**
practical: **significantly better performance than**
compressive sensing (& binary relevance)

Compression Coding

## Faster Decoding: Round-based

Compressive Sensing Revisited

**decode: g(x) = find**closest sparse binary vectorto ˜**y = P**^{T}**r(x)**

For any given “intermediate prediction” (real-valued vector) ˜**y,**
find closestsparsebinary vector to ˜**y:**slow

optimization of`_{1}-regularizedobjective
find closestanybinary vector to ˜**y:**fast

g(x) =round(y)

round-based decoding: simple & faster alternative

H.-T. Lin (NTU) Label Space Coding 04/11/2014 12 / 34

Compression Coding

## Better Projection: Principal Directions

Compressive Sensing Revisited

**compress: transform {(x**_{n},**y**_{n})}to {(x_{n},**c**_{n})}by**c**_{n}=**Py**_{n}with
some M by Lrandommatrix**P**

randomprojection: arbitrarydirections bestprojection: principaldirections

principal directions: best approximation to desired out-
put**y**nduringcompression(why?)

Compression Coding

## Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(P^{T}**r(x)),**

1

L**|g(x) 4 Y|**

| {z }

Hamming loss

≤ const ·

**kr(x) −**

**c**

z}|{**Py k**^{2}

| {z }

learn

+ **ky − P**^{T}

**c**

z}|{**Py k**^{2}

| {z }

compress

**kr(x) − ck**^{2}: prediction errorfrom input to codeword

**ky − P**^{T}**ck**^{2}: encoding errorfrom desired output to codeword
principal directions: best approximation to

desired output**y**_{n}duringcompression(indeed)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 14 / 34

Compression Coding

## Proposed Approach: Principal Label Space Transform

From Compressive Sensing to PLST

1 **compress: transform {(x**n,**y**n)}to {(xn,**c**n)}by**c**n=**Py**nwith the
M by L principal matrix**P**

2 **learn: get**regressionfunction**r(x) from x**_{n} to**c**_{n}

3 **decode:** g(x) = round(P^{T}**r(x))**

principal directions: viaPrincipal Component Analysison {y_{n}}^{N}_{n=1}
physical meaning behind**p**_{m}: key (linear)label correlations

PLST: improving CS by projecting to**key correlations**

Compression Coding

## Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)
PLSTbetter thanFull-BR: fewer dimensions, similar (or
**better) performance**

PLSTbetter thanCS: faster,**better performance**
similar findings across**data sets and regression**
**algorithms**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 16 / 34

Compression Coding

## Semi-summary on PLST

project to**principal directions and capture key correlations**
efficient learning (after**label space compression)**

efficient decoding (round-based)

sound theoretical guarantee +**good practical performance**
(better than CS & BR)

**expansion (channel coding) instead of**
compression (“lossy” source coding)?YES!

Error-correction Coding

## Our Contributions (Second Part)

**Error-correction Coding**

A Novel Framework for Label Space Error-correction algorithmic: generalize an popular existing algorithm

(RAk EL; Tsoumakas & Vlahavas, 2007)and explain through
**coding view**

theoretical: link learning performance to
**error-correcting ability**

practical: explore**choices of error-correcting code**
and obtain**better performance than RAk EL (&**

binary relevance)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 18 / 34

Error-correction Coding

## Key Idea: Redundant Information

General Error-correcting Codes (ECC)

noisy channel

commonly used in communication systems

detect & correct errors after transmitting data over a noisy channel encode dataredundantly

ECC for Machine Learning(successful for multi-class classification)
predictions of**b**

learnredundant bits=⇒**correct prediction errors**

Error-correction Coding

## Proposed Framework: Multi-labeling with ECC

**encode**toadd redundant informationenc(·) : {0, 1}^{L}→ {0, 1}^{M}
**decode**tolocate most possible binary vector

dec(·) : {0, 1}^{M} → {0, 1}^{L}

transformation to**larger multi-label classification with labels b**
PLST: M L (works for large L);

**MLECC: M > L (works for small L)**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 20 / 34

Error-correction Coding

## Simple Theoretical Guarantee

ECC encode+Larger Multi-label Learning+ECC decode Theorem

Letg(x) = dec(˜**b)**with ˜**b = h(x). Then,**

**Jg (x) 6= Y K**

| {z }

0/1 loss

≤ const. ·Hamming loss of h(x) ECC strength+1 .

PLST:principal directions+decent regression
**MLECC: which ECC balancesstrength**&**difficulty?**

Error-correction Coding

## Simplest ECC: Repetition Code

encoding:**y ∈ {0, 1}**^{L}**→ b ∈ {0, 1}**^{M}

**repeat each bit** ^{M}_{L} times

L = 4, M = 28 : 1010 −→ 1111111

| {z }

28 4=7

000000011111110000000

permute the bits randomly

decoding: ˜**b ∈ {0, 1}**^{M} → ˜**y ∈ {0, 1}**^{L}
**majority vote on each original bit**

L = 4, M = 28: strength of repetition code (REP) = 3

RAk EL = REP (code) + a special powerset (channel)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 22 / 34

Error-correction Coding

## Slightly More Sophisticated: Hamming Code

HAM(7, 4) Code

{0, 1}^{4}→ {0, 1}^{7}via adding 3**parity bits**

—physical meaning: label combinations

b_{4}=y_{0}⊕ y1⊕ y3, b_{5}=y_{0}⊕ y2⊕ y3, b_{6}=y_{1}⊕ y2⊕ y3

e.g. 1011 −→ 1011010 strength = 1 (weak)

Our Proposed Code: Hamming on Repetition (HAMR)
{0, 1}^{L}−−−→ {0, 1}^{REP} ^{4M}^{7} HAM(7, 4) on each 4-bit block

−−−−−−−−−−−−−−−−−−−→ {0, 1}^{7M}^{7}
L = 4, M = 28: strength of HAMR = 4 betterthan REP!

HAMR + the special powerset:

improve RAk EL oncode strength

Error-correction Coding

## Even More Sophisticated Codes

Bose-Chaudhuri-Hocquenghem Code (BCH)
modern code in**CD players**

sophisticated extension of Hamming, with**more parity bits**
codeword length M = 2^{p}− 1 for p ∈ N

L = 4, M = 31, strength of BCH = 5

Low-density Parity-check Code (LDPC)

modern code for**satellite communication**
connect ECC and Bayesian learning

approach the theoretical limit in some cases

**let’s compare!**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 24 / 34

Error-correction Coding

## Different ECCs on 3-label Powerset

(scene data set w/ L = 6)learner: special powerset with Random Forests REP + special powerset ≈ RAk EL

0/1 loss Hamming loss

Comparing to RAk EL (on most of data sets),
HAMR:**better 0/1 loss, similar Hamming loss**

Error-correction Coding

## Semi-summary on MLECC

transformation to**largermulti-label classification**
encode via**error-correcting code and capture label**
combinations (parity bits)

effective decoding (error-correcting)

simple theoretical guarantee +**good practical performance**
to**improve RAk EL, replace REP by**

HAMR =⇒ lower 0/1 loss, similar Hamming loss
BCH =⇒ even lower 0/1 loss, but higher Hamming loss
to**improve Binary Relevance, · · ·**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 26 / 34

Learnable-Compression Coding

## Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(P^{T}**r(x)),**

1

L**|g(x) 4 Y|**

| {z }

Hamming loss

≤ const ·

**kr(x) −**

**c**

z}|{**Py k**^{2}

| {z }

learn

+ **ky − P**^{T}

**c**

z}|{**Py k**^{2}

| {z }

compress

**ky − P**^{T}**ck**^{2}: encoding error, minimized during encoding
**kr(x) − ck**^{2}: prediction error, minimized during learning
but goodencodingmay not be easy tolearn; vice versa

PLST: minimize two errors separately (sub-optimal) (can we do better by minimizingjointly?)

Learnable-Compression Coding

## Our Contributions (Third Part)

**Learnable-Compression Coding**

A Novel Approach for Label Space Compression

algorithmic: first known algorithm for**feature-aware**
**dimension reduction**

theoretical: justification for**bestlearnableprojection**
practical: **consistently better performance than**
PLST

H.-T. Lin (NTU) Label Space Coding 04/11/2014 28 / 34

Learnable-Compression Coding

## The In-Sample Optimization Problem

min

**r,P**

**kr(X) − PYk**^{2}

| {z }

learn

+ **kY − P**^{T}**PYk**^{2}

| {z }

compress

start from a well-known tool: linear regression as**r**
**r(X) = XW**

for fixed**P: a closed-form solution for**learnis
**W**^{∗} =**X**^{†}**PY**
optimal**P:**

forlearn top eigenvectors of**Y**^{T}(I − XX^{†})Y
forcompress top eigenvectors of**Y**^{T}**Y**

forboth top eigenvectors of**Y**^{T}**XX**^{†}**Y**

Learnable-Compression Coding

## Proposed Approach: Conditional Principal Label Space Transform

From PLST to CPLST

1 **compress: transform {(x**_{n},**y**_{n})}to {(x_{n},**c**_{n})}by**c**_{n}=**Py**_{n}with the
M by L conditional principal matrix**P**

2 **learn: get**regressionfunction**r(x) from x**n to**c**n, ideally
using linear regression

3 **decode:** g(x) = round(P^{T}**r(x))**

conditional principal directions: top eigenvectors of**Y**^{T}**XX**^{†}**Y**
physical meaning behind**p**m: key (linear) label correlationsthat
are “easy to learn”

CPLST: project to**keylearnablecorrelations**

—can also pair with**kernel regression (non-linear)**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 30 / 34

Learnable-Compression Coding

## Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

CPLSTbetter thanPLST: better performance across all dimensions

similar findings across**data sets and regression**
**algorithms**

Learnable-Compression Coding

## Semi-summary on CPLST

project to**conditional principal directions and capture**key
**learnable correlations**

more efficient

sound theoretical guarantee (via PLST) +**good practical**
**performance (better than PLST)**

CPLST:state-of-the-artfor label space compression

H.-T. Lin (NTU) Label Space Coding 04/11/2014 32 / 34

## Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Error-correctionCoding(Ferng & Lin, ACML Conference 2011)

—expand for accuracy: better (than REP) code HAMR or BCH

— key tool: ECC from Information Theory

3 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)

—condense for efficiency: better (than PLST) approach CPLST

— key tool: Linear Regression from Statistics (+ PCA)

More...

beyond standard ECC-decoding (Ferng and Lin, IEEE TNNLS 2013) coupling CPLST with other regressor (Chen and Lin, NIPS 2012) multi-label classification with arbitrary loss (Li and Lin, ICML 2014) dynamic instead of static coding, combine ML-ECC & PLST/CPLST (...)

**Thank you! Questions?**

H.-T. Lin (NTU) Label Space Coding 04/11/2014 34 / 34