• 沒有找到結果。

Label Space Dimension Reduction

N/A
N/A
Protected

Academic year: 2022

Share "Label Space Dimension Reduction"

Copied!
1
0
0

加載中.... (立即查看全文)

全文

(1)

Feature-aware Label Space Dimension Reduction for Multi-label Classification

Yao-Nan Chen (r99922008@csie.ntu.edu.tw) and Hsuan-Tien Lin (htlin@csie.ntu.edu.tw)

Department of Computer Science and Information Engineering, National Taiwan University

Multi-label Classification Setup

Which tags Y are associated with this picture x?

Y = { building, taipei 101, day view, night view, skyscraper, fireworks, new york, fireworks, car, face, taipei world financial center, university, etc.}

(CC BY-SA SElefant from Wikimedia Commons)

• Given: N examples n

xn ∈ Rd, Yn ⊆ {1, 2, · · · , K}oN

n=1

• Goal: classifier g(x) that closely predicts the label-set Y associated with some unseen inputs x, presumably by exploiting hidden relations between labels, e.g.

– taipei 101 & taipei world financial center highly correlated – skyscraper subset of building

– day view & night view disjoint

Label Space Dimension Reduction

Y ⊆ {1, 2, · · · , K} equivalent to y ∈ {0, 1}K

• feature space dimension reduction: compress x to remove irrelevant, redundant (possi- bly related), or noisy information, and achieve better efficiency & performance – principal component analysis (PCA): linearly project x to wmT x with minimum

projection error

– canonical correlation analysis (CCA): linearly project x to wmT x in order to max- imize correlation with some vmT y

• label space dimension reduction: analogously, but compress y instead 1. compress: transform {(xn, yn)} to {(xn, tn)} with

tn = compress(yn) ∈ RM and M  K 2. learn: train some r(x) from {(xn, tn)}

3. decompress: g(x) = decompress(r(x))

– compressive sensing (Hsu et al., NIPS 2009): linearly project y to t[m] = vmT y with random vm’s (for incoherence)

– principal label space transformation (PLST; Tai and Lin, NC 2012): linearly project y to t[m] = vmT y with minimum projection error (sibling of PCA)

Feature-Aware Label Space Dimension Reduction

• feature space dimension reduction

unsupervised (not using y) supervised (using y)

PCA, locally linear embedding, etc. CCA, sliced inverse regression, etc.

—supervised generally better for learning from compress(x) to y

• label space dimension reduction

feature-unaware (not using x) feature-aware (using x)

PLST, compressive sensing, etc. ???

—can we improve PLST by feature-aware label space dimension reduction?

Conditional Principal Label Space Transformation

• idea 1: exploit dual role of CCA to be feature-aware

project x to wmT x in order to maximize correlation with some vmT y

≡ project y to vmT y in order to maximize correlation with some wmT x

≈ project y to vmT y in order to minimize difference to some wmT x proposed OCCA : min

W,V kXWT − YVT k2F , s.t. VVT = I – project to easiest-by-linear-regression directions

• idea 2: keep benefits of PLST for compression existing PLST : min

V kY − YVT Vk2F , s.t. VVT = I – project to most representative directions

• proposed algorithm: conditional principal label space transformation (CPLST)

W,Vmin kXWT − YVT k2F

| {z }

learning error

+ kY − YVT Vk2F

| {z }

compression error

, s.t. VVT = I

– theoretical guarantee (Tai and Lin, NC 2012): when using linear regression as r, hamming loss ≤ learning error + compression error

– algorithmic simplicity: closed-form optimal V contains top eigenvectors of

OCCA PLST CPLST

YT ( XX

| {z }

hat matrix

−I)Y YT Y YT XX†

| {z }

hat matrix

Y

(note: Z, i.e. the mean-shifted Y, is actually used for better projection) – physical meaning: exploit conditional (feature-aware) correlations

– kernelization: replace linear regression with kernel ridge regression as r

Experimental Results

on yeast data set:

0 5 10 15

0 500 1000 1500 2000 2500

# of dimension

|XWT − ZVT |2

PBR OCCA PLST CPLST

0 5 10 15

0 500 1000 1500 2000 2500

# of dimension

|Z − ZVT V|2

PBR OCCA PLST CPLST

learning error compression error

0 5 10 15

1950 2000 2050 2100 2150 2200 2250

# of dimension

|XWT − ZVT |2 + |Z − ZVT V|2

PBR OCCA PLST CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

PBR OCCA PLST CPLST

learning+compression error hamming loss

• PBR: baseline, with standard basis as vm

• OCCA: optimize learning error, but worst in compression error

• PLST: optimize compression error, but worst in learning error

• CPLST: optimize learning+compression error, and hence best hamming loss on 8 benchmark data sets:

algorithms CPLST vs. PLST CPLST vs. PLST kernel-CPLST vs. PLST + linear regression + decision tree + kernel ridge regression M = 20%K 3 win, 5 similar 2 win, 6 similar 5 win, 1 lose, 2 similar

CPLST consistently better than or similar to PLST across data & algorithms

Summary

Conditional Principal Label Space Transformation, which

• projects to conditional principal directions by combining ideas behind CCA (feature- aware) and PLST (optimal compression)

• can be kernelized for exploiting feature power

• achieves better/similar practical performance consistently when compared with the readily-strong PLST

參考文獻

相關文件

The HEQ method provides impor- tant improvements in the recognition performance with respect to the baseline system and also with respect to the linear compensa- tion methods because

A Novel Approach for Label Space Compression algorithmic: scheme for fast decoding theoretical: justification for best projection practical: significantly better performance

3 Error-correction Coding (Ferng & Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower

 The feature-enriched MF-SLU can benefit from both hidden information modeled by MF and enriched semantics including structured knowledge and behavioral patterns to improve

• label embedding: PLST, CPLST, FaIE, RAk EL, ECC-based [Tai et al., 2012; Chen et al., 2012; Lin et al., 2014; Tsoumakas et al., 2011; Ferng et al., 2013]. • cost-sensitivity: CFT,

classify input to multiple (or no) categories.. Multi-label Classification.

Finally, we train the SLU model by learning latent feature vectors for utterances and slot candidates through MF techniques. Combining with a knowledge graph propagation model based