• 沒有找到結果。

Label Space Coding for Multi-label Classification

N/A
N/A
Protected

Academic year: 2022

Share "Label Space Coding for Multi-label Classification"

Copied!
21
0
0

加載中.... (立即查看全文)

全文

(1)

Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University 3rd TWSIAM Annual Meeting, 05/30/2015

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Yao-Nan Chen(NIPS Conference 2012)

(2)

Multi-label Classification

Which Fruit?

?

apple orange strawberry kiwi

multi-class classification:

classify input (picture) toone category (label)

(3)

Multi-label Classification

Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(4)

Multi-label Classification

What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}

anothermulti-label classification problem:

tagginginput to multiple categories

(5)

Multi-label Classification

Binary Relevance: Multi-label Classification via Yes/No

Binary

Classification {yes,no}

Multi-label w/ L classes: Lyes/no questions

machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

Binary Relevance approach:

transformation tomultiple isolated binary classification

disadvantages:

isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)

unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with known disadvantages

(6)

Multi-label Classification

Multi-label Classification Setup

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

multi-label classification: hot and important

(7)

Compression Coding

From Label-set to Coding View

label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]

Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]

Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]

Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]

Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(8)

Compression Coding

Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly

compressed by projecting to M  L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:

efficient in training: random projectionw/M  L

inefficient in testing:time-consuming decoding

(9)

Compression Coding

From Coding View to Geometric View

label set binary code Y1= {o} y1= [0, 1, 0]

Y2= {a, o} y2= [1, 1, 0]

Y3= {a, s} y3= [1, 0, 1]

Y4= {o} y4= [0, 1, 0]

Y5= {} y5= [0, 0, 0]

y1,y4 y2 y3

y5

length-L binary code ⇔vertex of hypercube {0, 1}L

(10)

Compression Coding

Geometric Interpretation of Binary Relevance

y1,y4 y2 y3

y5

Binary Relevance: project to thenatural axes& classify

(11)

Compression Coding

Geometric Interpretation of Compressive Sensing

y1,y4 y2 y3

y5

Compressive Sensing:

project torandom flat (linear subspace)

learn “on” the flat; decode toclosest sparse vertex other (better) flat? other (faster) decoding?

(12)

Compression Coding

Our Contributions

Compression Coding &

Learnable-Compression Coding

A Novel Approach for Label Space Compression

algorithmic: first known algorithm forfeature-aware dimension reduction with fast decoding

theoretical: justification forbestlearnableprojection

practical: consistently better performance than compressive sensing (& binary relevance)

will now introduce the key ideas behind the approach

(13)

Compression Coding

Faster Decoding: Round-based

Compressive Sensing Revisited

decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x)

For any given “intermediate prediction” (real-valued vector) ˜y,

find closestsparsebinary vector to ˜y:slow optimization of`1-regularizedobjective

find closestanybinary vector to ˜y:fast g(x) =round(y)

round-based decoding: simple & faster alternative

(14)

Compression Coding

Better Projection: Principal Directions

Compressive Sensing Revisited

compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP

randomprojection: arbitrarydirections

bestprojection: principaldirections

principal directions: best approximation to desired out- putynduringcompression(why?)

(15)

Compression Coding

Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding

Theorem (Tai and Lin, 2012) Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

kr(x) − ck2: prediction errorfrom input to codeword

ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputynduringcompression(indeed)

(16)

Compression Coding

Proposed Approach 1:

Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = round(PTr(x))

principal directions: viaPrincipal Component Analysison {yn}Nn=1

physical meaning behindpm: key (linear)label correlations PLST: improving CS by projecting tokey correlations

(17)

Compression Coding

Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding

Theorem (Tai and Lin, 2012) Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

ky − PTck2: encoding error, minimized during encoding

kr(x) − ck2: prediction error, minimized during learning

but goodencodingmay not be easy tolearn; vice versa PLST: minimize two errors separately (sub-optimal)

(can we do better by minimizingjointly?)

(18)

Compression Coding

Proposed Approach 2:

Conditional Principal Label Space Transform

can we do better by minimizingjointly?

Yes and easy for ridge regression (closed-form solution)

From PLST to CPLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using ridge regression

3 decode: g(x) = round(PTr(x))

conditional principal directions: top eigenvectors ofYTXXY, key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

—can also pair withkernel regression (non-linear)

(19)

Compression Coding

Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)

PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

PLSTbetter thanCS: faster,better performance

similar findings acrossdata sets and regression algorithms

(20)

Compression Coding

Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

CPLSTbetter thanPLST: better performance across all dimensions

similar findings acrossdata sets and regression algorithms

(21)

Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)

—condense learnably forbetterefficiency: better (than PLST) approach CPLST

— key tool: Ridge Regression from Statistics (+ PCA)

More...

error-correcting code instead of compression, with improved decoding (Ferng and Lin, IEEE TNNLS 2013)

multi-label classification with arbitrary loss (Li and Lin, ICML 2014)

dynamic instead of static coding, binary instead of real coding, (...)

Thank you! Questions?

參考文獻

相關文件

We address the two issues by proposing Feature-aware Cost- sensitive Label Embedding (FaCLE), which utilizes deep Siamese network to keep cost information as the distance of

名稱 Carbon Trust Carbon Reduction Label Quality Assurance Scheme Climate Conscious Carbon Label CarbonFree Label.

A Novel Approach for Label Space Compression algorithmic: scheme for fast decoding theoretical: justification for best projection practical: significantly better performance

• A conditional jump instruction branches to a label when specific register or flag conditions are met.

• A conditional jump instruction branches to a label when specific register or flag conditions are met.

3 Error-correction Coding (Ferng & Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,