• 沒有找到結果。

Label Space Coding for Multi-label Classification

N/A
N/A
Protected

Academic year: 2022

Share "Label Space Coding for Multi-label Classification"

Copied!
30
0
0
顯示更多 ( 頁)

全文

(1)

Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University

NUK Seminar, 12/22/2017

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Yao-Nan Chen(NIPS Conference 2012)&

Chun-Sung Ferng(ACML Conference 2011, TNNLS Journal 2013)

(2)

Which Fruit?

?

apple orange strawberry kiwi

multi-class classification:

classify input (picture) toone category (label)

(3)

Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(4)

What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}

anothermulti-label classification problem:

tagginginput to multiple categories

(5)

Binary Relevance: Multi-label Classification via Yes/No

Binary

Classification {yes,no}

Multi-label w/ L classes: Lyes/no questions

machine learning(Y), data structure(N), data mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

Binary Relevance approach:

transformation tomultiple isolated binary classification

disadvantages:

isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)

unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with known disadvantages

(6)

Multi-label Classification Setup

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

0/1 loss: any discrepancyJg (x) 6= Y K

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

multi-label classification: hot and important

(7)

From Label-set to Coding View

label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]

Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]

Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]

Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]

Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(8)

Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly

compressed by projecting to M  L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:

efficient in training: random projectionw/M  L

inefficient in testing:time-consuming decoding

(9)

From Coding View to Geometric View

label set binary code Y1= {o} y1= [0, 1, 0]

Y2= {a, o} y2= [1, 1, 0]

Y3= {a, s} y3= [1, 0, 1]

Y4= {o} y4= [0, 1, 0]

Y5= {} y5= [0, 0, 0]

y1,y4 y2 y3

y5

length-L binary code ⇔vertex of hypercube {0, 1}L

(10)

Geometric Interpretation of Binary Relevance

y1,y4 y2 y3

y5

Binary Relevance: project to thenatural axes& classify

(11)

Geometric Interpretation of Compressive Sensing

y1,y4 y2 y3

y5

Compressive Sensing:

project torandom flat (linear subspace)

learn “on” the flat; decode toclosest sparse vertex other (better) flat? other (faster) decoding?

(12)

Our Contributions

Compression Coding &

Learnable-Compression Coding

A Novel Approach for Label Space Compression

algorithmic: first known algorithm forfeature-aware dimension reduction with fast decoding

theoretical: justification forbestlearnableprojection

practical: consistently better performance than compressive sensing (& binary relevance)

will now introduce the key ideas behind the approach

(13)

Faster Decoding: Round-based

Compressive Sensing Revisited

decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x)

For any given “intermediate prediction” (real-valued vector) ˜y,

find closestsparsebinary vector to ˜y:slow optimization of`1-regularizedobjective

find closestanybinary vector to ˜y:fast g(x) =round(y)

round-based decoding: simple & faster alternative

(14)

Better Projection: Principal Directions

Compressive Sensing Revisited

compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP

randomprojection: arbitrarydirections

bestprojection: principaldirections

principal directions: best approximation to desired out- putynduringcompression(why?)

(15)

Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding

Theorem (Tai and Lin, 2012) Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

kr(x) − ck2: prediction errorfrom input to codeword

ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputynduringcompression(indeed)

(16)

Proposed Approach 1:

Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = round(PTr(x))

principal directions: viaPrincipal Component Analysison {yn}Nn=1

physical meaning behindpm: key (linear)label correlations PLST: improving CS by projecting tokey correlations

(17)

Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding

Theorem (Tai and Lin, 2012) Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

ky − PTck2: encoding error, minimized during encoding

kr(x) − ck2: prediction error, minimized during learning

but goodencodingmay not be easy tolearn; vice versa PLST: minimize two errors separately (sub-optimal)

(can we do better by minimizingjointly?)

(18)

Proposed Approach 2:

Conditional Principal Label Space Transform

can we do better by minimizingjointly?

Yes and easy for ridge regression (closed-form solution)

From PLST to CPLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using ridge regression

3 decode: g(x) = round(PTr(x))

conditional principal directions: top eigenvectors ofYTXXY, key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

—can also pair withkernel regression (non-linear)

(19)

Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)

PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

PLSTbetter thanCS: faster,better performance

similar findings acrossdata sets and regression algorithms

(20)

Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

CPLSTbetter thanPLST: better performance across all dimensions

similar findings acrossdata sets and regression algorithms

(21)

Topics in this Talk

1 CompressionCoding

—condense for efficiency

—capture hidden correlation

2 Learnable-CompressionCoding

—condense-by-learnability forbetterefficiency

—capture hidden &learnablecorrelation

3 Error-CorrectionCoding

—expand for accuracy

—capture hidden combination

(22)

Our Contributions (Second Part)

Error-correction Coding

A Novel Framework for Label Space Error-correction

algorithmic: generalize an popular existing algorithm

(RAk EL; Tsoumakas & Vlahavas, 2007)and explain through coding view

theoretical: link learning performance to error-correcting ability

practical: explorechoices of error-correcting code and obtainbetter performance than RAk EL (&

binary relevance)

(23)

Key Idea: Redundant Information

General Error-correcting Codes (ECC)

noisy channel

commonly used in communication systems

detect & correct errors after transmitting data over a noisy channel

encode dataredundantly

ECC for Machine Learning (successful for multi-class classification) predictions ofb

learnredundant bits=⇒correct prediction errors

(24)

Proposed Framework: Multi-labeling with ECC

encodetoadd redundant informationenc(·) : {0, 1}L→ {0, 1}M

decodetolocate most possible binary vector dec(·) : {0, 1}M → {0, 1}L

transformation tolarger multi-label classification with labels b PLST: M  L (works for large L);

MLECC: M > L (works for small L)

(25)

Simple Theoretical Guarantee

ECC encode+Larger Multi-label Learning+ECC decode

Theorem

Letg(x) = dec(˜b)with ˜b = h(x). Then,

Jg (x) 6= Y K

| {z }

0/1 loss

≤ const. ·Hamming loss of h(x) ECC strength+1 .

PLST:principal directions+decent regression MLECC: which ECC balancesstrength&difficulty?

(26)

Simplest ECC: Repetition Code

encoding: y ∈ {0, 1}L→ b ∈ {0, 1}M

repeat each bit ML times

L = 4, M = 28 : 1010 −→ 1111111

| {z }

28 4=7

000000011111110000000

permute the bits randomly

decoding: ˜b ∈ {0, 1}M → ˜y ∈ {0, 1}L

majority vote on each original bit

L = 4, M = 28: strength of repetition code (REP) = 3

RAk EL = REP (code) + a special powerset (channel)

(27)

Slightly More Sophisticated: Hamming Code

HAM(7, 4) Code

{0, 1}4→ {0, 1}7via adding 3parity bits

—physical meaning: label combinations

b4=y0⊕ y1⊕ y3, b5=y0⊕ y2⊕ y3, b6=y1⊕ y2⊕ y3

e.g. 1011 −→ 1011010

strength = 1 (weak)

Our Proposed Code: Hamming on Repetition (HAMR) {0, 1}L−−−→ {0, 1}REP 4M7 HAM(7, 4) on each 4-bit block

−−−−−−−−−−−−−−−−−−−→ {0, 1}7M7 L = 4, M = 28: strength of HAMR = 4 betterthan REP!

HAMR + the special powerset:

improve RAk EL oncode strength

(28)

Even More Sophisticated Codes

Bose-Chaudhuri-Hocquenghem Code (BCH)

modern code inCD players

sophisticated extension of Hamming, withmore parity bits

codeword length M = 2p− 1 for p ∈ N

L = 4, M = 31, strength of BCH = 5

Low-density Parity-check Code (LDPC)

modern code forsatellite communication

connect ECC and Bayesian learning

approach the theoretical limit in some cases

let’s compare!

(29)

Different ECCs on 3-label Powerset

(scene data set w/ L = 6)

learner: special powerset with Random Forests

REP + special powerset ≈ RAk EL

0/1 loss Hamming loss

Comparing to RAk EL (on most of data sets),

HAMR:better 0/1 loss, similar Hamming loss

BCH:even better 0/1 loss, pay for Hamming loss

(30)

Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)

—condense learnably forbetterefficiency: better (than PLST) approach CPLST

— key tool: Ridge Regression from Statistics (+ PCA)

3 Error-correctionCoding(Ferng & Lin, ACML Conference 2011, TNNLS Journal 2013)

—expand for accuracy: better (than REP) code HAMR or BCH

— key tool: ECC from Information Theory Thank you! Questions?

參考文獻

相關文件

3.3 Locally-learned Surrogate Loss for General Cost-sensitive Multi-label Deep Learning From Figure 1, it can be seen that the key to designing a cost- sensitive model is that the

The well-known random k-labelsets (RAkEL) (Tsoumakas and Vlahavas, 2007) method focuses on many smaller multi-class classification problems to be computationally efficient, but it

The proposed al- gorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors by using

We proposed the condensed filter tree (CFT) algorithm by coupling several tools and ideas: the label powerset approach for reducing to cost-sensitive classifi- cation, the

It is required to do radial distortion correction for better stitching results. correction for better

Furthermore, we leverage the cost information embedded in the code space of CSRPE to propose a novel active learning algorithm for cost-sensitive MLC.. Extensive exper- imental

When using RLR as the regressor, the Hamming loss curve of PLST is always be- low the curve of PBR across all M in all data sets, which demonstrates that PLST is the more

For those methods utilizing label powerset to reduce the multi- label classification problem, in [7], the author proposes cost- sensitive RAkEL (CS-RAkEL) based on RAkEL optimizing on

We address the two issues by proposing Feature-aware Cost- sensitive Label Embedding (FaCLE), which utilizes deep Siamese network to keep cost information as the distance of

This section explains in detail how the function evaluation method based on non- uniform segmentation is used to compute the f and g functions for Gaussian noise generation

A Novel Approach for Label Space Compression algorithmic: scheme for fast decoding theoretical: justification for best projection practical: significantly better performance

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

• Non-uniform space subdivision (for example, kd tree and octree) is better than uniform grid kd-tree and octree) is better than uniform grid if the scene is

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,

classify input to multiple (or no) categories.. Multi-label Classification.

?: { machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

It is required to do radial distortion correction for better stitching results. correction for better

It is required to do radial distortion correction for better stitching results. correction for better

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach,

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully. Tackle general

Improve macro-average F-measure: 0.333 → 0.511 Five-fold cross validation for better thresholds Threshold T j = average of five thresholds. Tuning threshold significantly

Unlike the case of optimizing the micro-average F-measure, where cyclic optimization does not help, here the exact match ratio is slightly improved for most data sets.. 5.5

• Non-uniform space subdivision (for example, kd tree and octree) is better than uniform grid kd-tree and octree) is better than uniform grid if the scene is