• 沒有找到結果。

# Label Space Coding for Multi-label Classiﬁcation

N/A
N/A
Protected

Share "Label Space Coding for Multi-label Classiﬁcation"

Copied!
35
0
0

(1)

## Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University

Talk at NTHU, 04/11/2014

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Chun-Sung Ferng(ACML Conference 2011, IEEE TNNLS Journal 2013)&

Yao-Nan Chen(NIPS Conference 2012)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 0 / 34

(2)

Multi-label Classification

## Which Fruit?

?

apple orange strawberry kiwi

multi-class classification:

classify input (picture) toone category (label)

—How?

(3)

Multi-label Classification

## Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

\$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture xn, category yn)

?

learning good

decision function g(x ) ≈ f (x ) algorithm

'

&

\$

% -

6

learning model {gα(x )}

challenge:

see only {(xn,yn)}without knowing f (x ) or e(x )

=⇒? generalize to unseen (x , y ) w.r.t. f (x )

H.-T. Lin (NTU) Label Space Coding 04/11/2014 2 / 34

(4)

Multi-label Classification

## Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(5)

Multi-label Classification

## Powerset: Multi-label Classification via Multi-class

Multi-class w/ L = 4 classes 4 possible outcomes

{a, o, s, k}

Multi-label w/ L = 4 classes 24=16possible outcomes

2{a, o, s, k} m

{ φ, a, o, s, k, ao, as, ak, os, ok, sk, aos, aok, ask, osk, aosk } Powerset approach: transformation to multi-class classification difficulties for large L:

computation (super-large 2L)

—hard to construct classifier

sparsity (no example for some of 2L)

—hard to discover hidden combination

Powerset: feasible only forsmall Lwith enough examples for every combination

H.-T. Lin (NTU) Label Space Coding 04/11/2014 4 / 34

(6)

Multi-label Classification

## What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}

anothermulti-label classification problem:

tagginginput to multiple categories

(7)

Multi-label Classification

## Binary Relevance: Multi-label Classification via Yes/No

Binary Classification {yes,no}

Multi-label w/ L classes: Lyes/noquestions machine learning(Y), data structure(N), data

mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

Binary Relevance approach:

transformation tomultiple isolated binary classification disadvantages:

isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint) unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with known disadvantages

H.-T. Lin (NTU) Label Space Coding 04/11/2014 6 / 34

(8)

Multi-label Classification

## Multi-label Classification Setup

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

0/1 loss: any discrepancyJg (x) 6= Y K

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

multi-label classification: hot and important

(9)

Multi-label Classification

## Topics in this Talk

1 CompressionCoding

—condense for efficiency

—capture hidden correlation

2 Error-correctionCoding

—expand for accuracy

—capture hidden combination

3 Learnable-CompressionCoding

—condense-by-learnability forbetterefficiency

—capture hidden &learnablecorrelation

H.-T. Lin (NTU) Label Space Coding 04/11/2014 8 / 34

(10)

Compression Coding

## From Label-set to Coding View

label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]

Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]

Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]

Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]

Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(11)

Compression Coding

## Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly

compressed by projecting to M  L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:

efficient in training: random projectionw/M  L (any better projection scheme?)

inefficient in testing:time-consuming decoding (any faster decoding method?)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 10 / 34

(12)

Compression Coding

## Compression Coding

A Novel Approach for Label Space Compression algorithmic: scheme forfast decoding theoretical: justification forbest projection practical: significantly better performance than compressive sensing (& binary relevance)

(13)

Compression Coding

## Faster Decoding: Round-based

Compressive Sensing Revisited

decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x)

For any given “intermediate prediction” (real-valued vector) ˜y, find closestsparsebinary vector to ˜y:slow

optimization of`1-regularizedobjective find closestanybinary vector to ˜y:fast

g(x) =round(y)

round-based decoding: simple & faster alternative

H.-T. Lin (NTU) Label Space Coding 04/11/2014 12 / 34

(14)

Compression Coding

## Better Projection: Principal Directions

Compressive Sensing Revisited

compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP

randomprojection: arbitrarydirections bestprojection: principaldirections

principal directions: best approximation to desired out- putynduringcompression(why?)

(15)

Compression Coding

## Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

kr(x) − ck2: prediction errorfrom input to codeword

ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputynduringcompression(indeed)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 14 / 34

(16)

Compression Coding

## Proposed Approach: Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = round(PTr(x))

principal directions: viaPrincipal Component Analysison {yn}Nn=1 physical meaning behindpm: key (linear)label correlations

PLST: improving CS by projecting tokey correlations

(17)

Compression Coding

## Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree) PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

PLSTbetter thanCS: faster,better performance similar findings acrossdata sets and regression algorithms

H.-T. Lin (NTU) Label Space Coding 04/11/2014 16 / 34

(18)

Compression Coding

## Semi-summary on PLST

project toprincipal directions and capture key correlations efficient learning (afterlabel space compression)

efficient decoding (round-based)

sound theoretical guarantee +good practical performance (better than CS & BR)

expansion (channel coding) instead of compression (“lossy” source coding)?YES!

(19)

Error-correction Coding

## Error-correction Coding

A Novel Framework for Label Space Error-correction algorithmic: generalize an popular existing algorithm

(RAk EL; Tsoumakas & Vlahavas, 2007)and explain through coding view

theoretical: link learning performance to error-correcting ability

practical: explorechoices of error-correcting code and obtainbetter performance than RAk EL (&

binary relevance)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 18 / 34

(20)

Error-correction Coding

## Key Idea: Redundant Information

General Error-correcting Codes (ECC)

noisy channel

commonly used in communication systems

detect & correct errors after transmitting data over a noisy channel encode dataredundantly

ECC for Machine Learning(successful for multi-class classification) predictions ofb

learnredundant bits=⇒correct prediction errors

(21)

Error-correction Coding

## Proposed Framework: Multi-labeling with ECC

encodetoadd redundant informationenc(·) : {0, 1}L→ {0, 1}M decodetolocate most possible binary vector

dec(·) : {0, 1}M → {0, 1}L

transformation tolarger multi-label classification with labels b PLST: M  L (works for large L);

MLECC: M > L (works for small L)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 20 / 34

(22)

Error-correction Coding

## Simple Theoretical Guarantee

ECC encode+Larger Multi-label Learning+ECC decode Theorem

Letg(x) = dec(˜b)with ˜b = h(x). Then,

Jg (x) 6= Y K

| {z }

0/1 loss

≤ const. ·Hamming loss of h(x) ECC strength+1 .

PLST:principal directions+decent regression MLECC: which ECC balancesstrength&difficulty?

(23)

Error-correction Coding

## Simplest ECC: Repetition Code

encoding:y ∈ {0, 1}L→ b ∈ {0, 1}M

repeat each bit ML times

L = 4, M = 28 : 1010 −→ 1111111

| {z }

28 4=7

000000011111110000000

permute the bits randomly

decoding: ˜b ∈ {0, 1}M → ˜y ∈ {0, 1}L majority vote on each original bit

L = 4, M = 28: strength of repetition code (REP) = 3

RAk EL = REP (code) + a special powerset (channel)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 22 / 34

(24)

Error-correction Coding

## Slightly More Sophisticated: Hamming Code

HAM(7, 4) Code

{0, 1}4→ {0, 1}7via adding 3parity bits

—physical meaning: label combinations

b4=y0⊕ y1⊕ y3, b5=y0⊕ y2⊕ y3, b6=y1⊕ y2⊕ y3

e.g. 1011 −→ 1011010 strength = 1 (weak)

Our Proposed Code: Hamming on Repetition (HAMR) {0, 1}L−−−→ {0, 1}REP 4M7 HAM(7, 4) on each 4-bit block

−−−−−−−−−−−−−−−−−−−→ {0, 1}7M7 L = 4, M = 28: strength of HAMR = 4 betterthan REP!

HAMR + the special powerset:

improve RAk EL oncode strength

(25)

Error-correction Coding

## Even More Sophisticated Codes

Bose-Chaudhuri-Hocquenghem Code (BCH) modern code inCD players

sophisticated extension of Hamming, withmore parity bits codeword length M = 2p− 1 for p ∈ N

L = 4, M = 31, strength of BCH = 5

Low-density Parity-check Code (LDPC)

modern code forsatellite communication connect ECC and Bayesian learning

approach the theoretical limit in some cases

let’s compare!

H.-T. Lin (NTU) Label Space Coding 04/11/2014 24 / 34

(26)

Error-correction Coding

## Different ECCs on 3-label Powerset

(scene data set w/ L = 6)

learner: special powerset with Random Forests REP + special powerset ≈ RAk EL

0/1 loss Hamming loss

Comparing to RAk EL (on most of data sets), HAMR:better 0/1 loss, similar Hamming loss

(27)

Error-correction Coding

## Semi-summary on MLECC

transformation tolargermulti-label classification encode viaerror-correcting code and capture label combinations (parity bits)

effective decoding (error-correcting)

simple theoretical guarantee +good practical performance toimprove RAk EL, replace REP by

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss toimprove Binary Relevance, · · ·

H.-T. Lin (NTU) Label Space Coding 04/11/2014 26 / 34

(28)

Learnable-Compression Coding

## Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

ky − PTck2: encoding error, minimized during encoding kr(x) − ck2: prediction error, minimized during learning but goodencodingmay not be easy tolearn; vice versa

PLST: minimize two errors separately (sub-optimal) (can we do better by minimizingjointly?)

(29)

Learnable-Compression Coding

## Learnable-Compression Coding

A Novel Approach for Label Space Compression

algorithmic: first known algorithm forfeature-aware dimension reduction

theoretical: justification forbestlearnableprojection practical: consistently better performance than PLST

H.-T. Lin (NTU) Label Space Coding 04/11/2014 28 / 34

(30)

Learnable-Compression Coding

## The In-Sample Optimization Problem

min

r,P

kr(X) − PYk2

| {z }

learn

+ kY − PTPYk2

| {z }

compress

start from a well-known tool: linear regression asr r(X) = XW

for fixedP: a closed-form solution forlearnis W =XPY optimalP:

forlearn top eigenvectors ofYT(I − XX)Y forcompress top eigenvectors ofYTY

forboth top eigenvectors ofYTXXY

(31)

Learnable-Compression Coding

## Proposed Approach: Conditional Principal Label Space Transform

From PLST to CPLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using linear regression

3 decode: g(x) = round(PTr(x))

conditional principal directions: top eigenvectors ofYTXXY physical meaning behindpm: key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

—can also pair withkernel regression (non-linear)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 30 / 34

(32)

Learnable-Compression Coding

## Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

CPLSTbetter thanPLST: better performance across all dimensions

similar findings acrossdata sets and regression algorithms

(33)

Learnable-Compression Coding

## Semi-summary on CPLST

project toconditional principal directions and capturekey learnable correlations

more efficient

sound theoretical guarantee (via PLST) +good practical performance (better than PLST)

CPLST:state-of-the-artfor label space compression

H.-T. Lin (NTU) Label Space Coding 04/11/2014 32 / 34

(34)

## Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Error-correctionCoding(Ferng & Lin, ACML Conference 2011)

—expand for accuracy: better (than REP) code HAMR or BCH

— key tool: ECC from Information Theory

3 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)

—condense for efficiency: better (than PLST) approach CPLST

— key tool: Linear Regression from Statistics (+ PCA)

More...

beyond standard ECC-decoding (Ferng and Lin, IEEE TNNLS 2013) coupling CPLST with other regressor (Chen and Lin, NIPS 2012) multi-label classification with arbitrary loss (Li and Lin, ICML 2014) dynamic instead of static coding, combine ML-ECC & PLST/CPLST (...)

(35)

## Thank you! Questions?

H.-T. Lin (NTU) Label Space Coding 04/11/2014 34 / 34

3 Error-correction Coding (Ferng &amp; Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

• Non-uniform space subdivision (for example, kd tree and octree) is better than uniform grid kd-tree and octree) is better than uniform grid if the scene is

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,

classify input to multiple (or no) categories.. Multi-label Classification.

?: { machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach,