• 沒有找到結果。

# Label Space Coding for Multi-label Classiﬁcation

N/A
N/A
Protected

Share "Label Space Coding for Multi-label Classiﬁcation"

Copied!
35
0
0

(1)

## Label Space Coding for Multi-label Classification

Hsuan-Tien Lin National Taiwan University

Talk at NTHU, 04/11/2014

joint works with

Farbound Tai(MLD Workshop 2010, NC Journal 2012)&

Chun-Sung Ferng(ACML Conference 2011, IEEE TNNLS Journal 2013)&

Yao-Nan Chen(NIPS Conference 2012)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 0 / 34

(2)

Multi-label Classification

## Which Fruit?

?

apple orange strawberry kiwi

multi-class classification:

classify input (picture) toone category (label)

—How?

(3)

Multi-label Classification

## Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

\$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture xn, category yn)

?

learning good

decision function g(x ) ≈ f (x ) algorithm

'

&

\$

% -

6

learning model {gα(x )}

challenge:

see only {(xn,yn)}without knowing f (x ) or e(x )

=⇒? generalize to unseen (x , y ) w.r.t. f (x )

H.-T. Lin (NTU) Label Space Coding 04/11/2014 2 / 34

(4)

Multi-label Classification

## Which Fruits?

?: {orange, strawberry, kiwi}

apple orange strawberry kiwi

multi-labelclassification:

classify input tomultiple (or no)categories

(5)

Multi-label Classification

## Powerset: Multi-label Classification via Multi-class

Multi-class w/ L = 4 classes 4 possible outcomes

{a, o, s, k}

Multi-label w/ L = 4 classes 24=16possible outcomes

2{a, o, s, k} m

{ φ, a, o, s, k, ao, as, ak, os, ok, sk, aos, aok, ask, osk, aosk } Powerset approach: transformation to multi-class classification difficulties for large L:

computation (super-large 2L)

—hard to construct classifier

sparsity (no example for some of 2L)

—hard to discover hidden combination

Powerset: feasible only forsmall Lwith enough examples for every combination

H.-T. Lin (NTU) Label Space Coding 04/11/2014 4 / 34

(6)

Multi-label Classification

## What Tags?

?: {machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . .etc.}

anothermulti-label classification problem:

tagginginput to multiple categories

(7)

Multi-label Classification

## Binary Relevance: Multi-label Classification via Yes/No

Binary Classification {yes,no}

Multi-label w/ L classes: Lyes/noquestions machine learning(Y), data structure(N), data

mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

Binary Relevance approach:

transformation tomultiple isolated binary classification disadvantages:

isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint) unbalanced—fewyes, manyno

Binary Relevance: simple (& good) benchmark with known disadvantages

H.-T. Lin (NTU) Label Space Coding 04/11/2014 6 / 34

(8)

Multi-label Classification

## Multi-label Classification Setup

Given

N examples (inputxn,label-set Yn) ∈ X ×2{1,2,···L}

fruits: X = encoding(pictures), Yn⊆ {1, 2, · · · , 4}

tags: X = encoding(merchandise), Yn⊆ {1, 2, · · · , L}

Goal

a multi-label classifier g(x) thatclosely predictsthe label-set Y associated with someunseen inputs x (byexploiting hidden relations/combinations between labels)

0/1 loss: any discrepancyJg (x) 6= Y K

Hamming loss: averaged symmetric difference 1L|g(x) 4 Y|

multi-label classification: hot and important

(9)

Multi-label Classification

## Topics in this Talk

1 CompressionCoding

—condense for efficiency

—capture hidden correlation

2 Error-correctionCoding

—expand for accuracy

—capture hidden combination

3 Learnable-CompressionCoding

—condense-by-learnability forbetterefficiency

—capture hidden &learnablecorrelation

H.-T. Lin (NTU) Label Space Coding 04/11/2014 8 / 34

(10)

Compression Coding

## From Label-set to Coding View

label set apple orange strawberry binary code Y1= {o} 0 (N) 1 (Y) 0 (N) y1= [0, 1, 0]

Y2= {a, o} 1 (Y) 1 (Y) 0 (N) y2= [1, 1, 0]

Y3= {a, s} 1 (Y) 0 (N) 1 (Y) y3= [1, 0, 1]

Y4= {o} 0 (N) 1 (Y) 0 (N) y4= [0, 1, 0]

Y5= {} 0 (N) 0 (N) 0 (N) y5= [0, 0, 0]

subset Y of 2{1,2,··· ,L}⇔ length-L binary code y

(11)

Compression Coding

## Existing Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectorsy ∈ {0, 1}Lcan berobustly

compressed by projecting to M  L basis vectors {p1,p2, · · · ,pM} Compressive Sensing for Multi-label Classification(Hsu et al., 2009)

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP = [p1,p2, · · · ,pM]T

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = findclosest sparse binary vectortoPTr(x) Compressive Sensing:

efficient in training: random projectionw/M  L (any better projection scheme?)

inefficient in testing:time-consuming decoding (any faster decoding method?)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 10 / 34

(12)

Compression Coding

## Compression Coding

A Novel Approach for Label Space Compression algorithmic: scheme forfast decoding theoretical: justification forbest projection practical: significantly better performance than compressive sensing (& binary relevance)

(13)

Compression Coding

## Faster Decoding: Round-based

Compressive Sensing Revisited

decode: g(x) = findclosest sparse binary vectorto ˜y = PTr(x)

For any given “intermediate prediction” (real-valued vector) ˜y, find closestsparsebinary vector to ˜y:slow

optimization of`1-regularizedobjective find closestanybinary vector to ˜y:fast

g(x) =round(y)

round-based decoding: simple & faster alternative

H.-T. Lin (NTU) Label Space Coding 04/11/2014 12 / 34

(14)

Compression Coding

## Better Projection: Principal Directions

Compressive Sensing Revisited

compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith some M by LrandommatrixP

randomprojection: arbitrarydirections bestprojection: principaldirections

principal directions: best approximation to desired out- putynduringcompression(why?)

(15)

Compression Coding

## Novel Theoretical Guarantee

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

kr(x) − ck2: prediction errorfrom input to codeword

ky − PTck2: encoding errorfrom desired output to codeword principal directions: best approximation to

desired outputynduringcompression(indeed)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 14 / 34

(16)

Compression Coding

## Proposed Approach: Principal Label Space Transform

From Compressive Sensing to PLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn

3 decode: g(x) = round(PTr(x))

principal directions: viaPrincipal Component Analysison {yn}Nn=1 physical meaning behindpm: key (linear)label correlations

PLST: improving CS by projecting tokey correlations

(17)

Compression Coding

## Hamming Loss Comparison: Full-BR, PLST & CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree) PLSTbetter thanFull-BR: fewer dimensions, similar (or better) performance

PLSTbetter thanCS: faster,better performance similar findings acrossdata sets and regression algorithms

H.-T. Lin (NTU) Label Space Coding 04/11/2014 16 / 34

(18)

Compression Coding

## Semi-summary on PLST

project toprincipal directions and capture key correlations efficient learning (afterlabel space compression)

efficient decoding (round-based)

sound theoretical guarantee +good practical performance (better than CS & BR)

expansion (channel coding) instead of compression (“lossy” source coding)?YES!

(19)

Error-correction Coding

## Error-correction Coding

A Novel Framework for Label Space Error-correction algorithmic: generalize an popular existing algorithm

(RAk EL; Tsoumakas & Vlahavas, 2007)and explain through coding view

theoretical: link learning performance to error-correcting ability

practical: explorechoices of error-correcting code and obtainbetter performance than RAk EL (&

binary relevance)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 18 / 34

(20)

Error-correction Coding

## Key Idea: Redundant Information

General Error-correcting Codes (ECC)

noisy channel

commonly used in communication systems

detect & correct errors after transmitting data over a noisy channel encode dataredundantly

ECC for Machine Learning(successful for multi-class classification) predictions ofb

learnredundant bits=⇒correct prediction errors

(21)

Error-correction Coding

## Proposed Framework: Multi-labeling with ECC

encodetoadd redundant informationenc(·) : {0, 1}L→ {0, 1}M decodetolocate most possible binary vector

dec(·) : {0, 1}M → {0, 1}L

transformation tolarger multi-label classification with labels b PLST: M  L (works for large L);

MLECC: M > L (works for small L)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 20 / 34

(22)

Error-correction Coding

## Simple Theoretical Guarantee

ECC encode+Larger Multi-label Learning+ECC decode Theorem

Letg(x) = dec(˜b)with ˜b = h(x). Then,

Jg (x) 6= Y K

| {z }

0/1 loss

≤ const. ·Hamming loss of h(x) ECC strength+1 .

PLST:principal directions+decent regression MLECC: which ECC balancesstrength&difficulty?

(23)

Error-correction Coding

## Simplest ECC: Repetition Code

encoding:y ∈ {0, 1}L→ b ∈ {0, 1}M

repeat each bit ML times

L = 4, M = 28 : 1010 −→ 1111111

| {z }

28 4=7

000000011111110000000

permute the bits randomly

decoding: ˜b ∈ {0, 1}M → ˜y ∈ {0, 1}L majority vote on each original bit

L = 4, M = 28: strength of repetition code (REP) = 3

RAk EL = REP (code) + a special powerset (channel)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 22 / 34

(24)

Error-correction Coding

## Slightly More Sophisticated: Hamming Code

HAM(7, 4) Code

{0, 1}4→ {0, 1}7via adding 3parity bits

—physical meaning: label combinations

b4=y0⊕ y1⊕ y3, b5=y0⊕ y2⊕ y3, b6=y1⊕ y2⊕ y3

e.g. 1011 −→ 1011010 strength = 1 (weak)

Our Proposed Code: Hamming on Repetition (HAMR) {0, 1}L−−−→ {0, 1}REP 4M7 HAM(7, 4) on each 4-bit block

−−−−−−−−−−−−−−−−−−−→ {0, 1}7M7 L = 4, M = 28: strength of HAMR = 4 betterthan REP!

HAMR + the special powerset:

improve RAk EL oncode strength

(25)

Error-correction Coding

## Even More Sophisticated Codes

Bose-Chaudhuri-Hocquenghem Code (BCH) modern code inCD players

sophisticated extension of Hamming, withmore parity bits codeword length M = 2p− 1 for p ∈ N

L = 4, M = 31, strength of BCH = 5

Low-density Parity-check Code (LDPC)

modern code forsatellite communication connect ECC and Bayesian learning

approach the theoretical limit in some cases

let’s compare!

H.-T. Lin (NTU) Label Space Coding 04/11/2014 24 / 34

(26)

Error-correction Coding

## Different ECCs on 3-label Powerset

(scene data set w/ L = 6)

learner: special powerset with Random Forests REP + special powerset ≈ RAk EL

0/1 loss Hamming loss

Comparing to RAk EL (on most of data sets), HAMR:better 0/1 loss, similar Hamming loss

(27)

Error-correction Coding

## Semi-summary on MLECC

transformation tolargermulti-label classification encode viaerror-correcting code and capture label combinations (parity bits)

effective decoding (error-correcting)

simple theoretical guarantee +good practical performance toimprove RAk EL, replace REP by

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss toimprove Binary Relevance, · · ·

H.-T. Lin (NTU) Label Space Coding 04/11/2014 26 / 34

(28)

Learnable-Compression Coding

## Theoretical Guarantee of PLST Revisited

Linear Transform+Learn+Round-based Decoding Theorem (Tai and Lin, 2012)

Ifg(x) = round(PTr(x)),

1

L|g(x) 4 Y|

| {z }

Hamming loss

≤ const ·

kr(x) −

c

z}|{Py k2

| {z }

learn

+ ky − PT

c

z}|{Py k2

| {z }

compress

ky − PTck2: encoding error, minimized during encoding kr(x) − ck2: prediction error, minimized during learning but goodencodingmay not be easy tolearn; vice versa

PLST: minimize two errors separately (sub-optimal) (can we do better by minimizingjointly?)

(29)

Learnable-Compression Coding

## Learnable-Compression Coding

A Novel Approach for Label Space Compression

algorithmic: first known algorithm forfeature-aware dimension reduction

theoretical: justification forbestlearnableprojection practical: consistently better performance than PLST

H.-T. Lin (NTU) Label Space Coding 04/11/2014 28 / 34

(30)

Learnable-Compression Coding

## The In-Sample Optimization Problem

min

r,P

kr(X) − PYk2

| {z }

learn

+ kY − PTPYk2

| {z }

compress

start from a well-known tool: linear regression asr r(X) = XW

for fixedP: a closed-form solution forlearnis W =XPY optimalP:

forlearn top eigenvectors ofYT(I − XX)Y forcompress top eigenvectors ofYTY

forboth top eigenvectors ofYTXXY

(31)

Learnable-Compression Coding

## Proposed Approach: Conditional Principal Label Space Transform

From PLST to CPLST

1 compress: transform {(xn,yn)}to {(xn,cn)}bycn=Pynwith the M by L conditional principal matrixP

2 learn: getregressionfunctionr(x) from xn tocn, ideally using linear regression

3 decode: g(x) = round(PTr(x))

conditional principal directions: top eigenvectors ofYTXXY physical meaning behindpm: key (linear) label correlationsthat are “easy to learn”

CPLST: project tokeylearnablecorrelations

—can also pair withkernel regression (non-linear)

H.-T. Lin (NTU) Label Space Coding 04/11/2014 30 / 34

(32)

Learnable-Compression Coding

## Hamming Loss Comparison: PLST & CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

pbr cpa plst cplst

yeast (Linear Regression)

CPLSTbetter thanPLST: better performance across all dimensions

similar findings acrossdata sets and regression algorithms

(33)

Learnable-Compression Coding

## Semi-summary on CPLST

project toconditional principal directions and capturekey learnable correlations

more efficient

sound theoretical guarantee (via PLST) +good practical performance (better than PLST)

CPLST:state-of-the-artfor label space compression

H.-T. Lin (NTU) Label Space Coding 04/11/2014 32 / 34

(34)

## Conclusion

1 CompressionCoding(Tai & Lin, MLD Workshop 2010; NC Journal 2012)

—condense for efficiency: better (than BR) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 Error-correctionCoding(Ferng & Lin, ACML Conference 2011)

—expand for accuracy: better (than REP) code HAMR or BCH

— key tool: ECC from Information Theory

3 Learnable-CompressionCoding(Chen & Lin, NIPS Conference 2012)

—condense for efficiency: better (than PLST) approach CPLST

— key tool: Linear Regression from Statistics (+ PCA)

More...

beyond standard ECC-decoding (Ferng and Lin, IEEE TNNLS 2013) coupling CPLST with other regressor (Chen and Lin, NIPS 2012) multi-label classification with arbitrary loss (Li and Lin, ICML 2014) dynamic instead of static coding, combine ML-ECC & PLST/CPLST (...)

(35)

## Thank you! Questions?

H.-T. Lin (NTU) Label Space Coding 04/11/2014 34 / 34

3.3 Locally-learned Surrogate Loss for General Cost-sensitive Multi-label Deep Learning From Figure 1, it can be seen that the key to designing a cost- sensitive model is that the

The well-known random k-labelsets (RAkEL) (Tsoumakas and Vlahavas, 2007) method focuses on many smaller multi-class classification problems to be computationally efficient, but it

The proposed al- gorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors by using

We proposed the condensed filter tree (CFT) algorithm by coupling several tools and ideas: the label powerset approach for reducing to cost-sensitive classifi- cation, the

Furthermore, we leverage the cost information embedded in the code space of CSRPE to propose a novel active learning algorithm for cost-sensitive MLC.. Extensive exper- imental

When using RLR as the regressor, the Hamming loss curve of PLST is always be- low the curve of PBR across all M in all data sets, which demonstrates that PLST is the more

For those methods utilizing label powerset to reduce the multi- label classification problem, in , the author proposes cost- sensitive RAkEL (CS-RAkEL) based on RAkEL optimizing on

We address the two issues by proposing Feature-aware Cost- sensitive Label Embedding (FaCLE), which utilizes deep Siamese network to keep cost information as the distance of

3 Error-correction Coding (Ferng &amp; Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

• Non-uniform space subdivision (for example, kd tree and octree) is better than uniform grid kd-tree and octree) is better than uniform grid if the scene is

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,

classify input to multiple (or no) categories.. Multi-label Classification.

?: { machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach,

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully. Tackle general

In this paper, we develop a novel volumetric stretch energy minimization algorithm for volume-preserving parameterizations of simply connected 3-manifolds with a single boundary

Improve macro-average F-measure: 0.333 → 0.511 Five-fold cross validation for better thresholds Threshold T j = average of five thresholds. Tuning threshold significantly

• gather photos under CC-BY-2.0 license on Flicker (thanks to the authors below!) and label them as apple/other for learning.. (APAL stands for Apple and Pear

“Since our classification problem is essentially a multi-label task, during the prediction procedure, we assume that the number of labels for the unlabeled nodes is already known

Unlike the case of optimizing the micro-average F-measure, where cyclic optimization does not help, here the exact match ratio is slightly improved for most data sets.. 5.5

• Non-uniform space subdivision (for example, kd tree and octree) is better than uniform grid kd-tree and octree) is better than uniform grid if the scene is