Feature-aware Label Space Dimension Reduction for Multi-label Classification
Yao-Nan Chen (r99922008@csie.ntu.edu.tw) and Hsuan-Tien Lin (htlin@csie.ntu.edu.tw)
Department of Computer Science and Information Engineering, National Taiwan University
Multi-label Classification Setup
Which tags Y are associated with this picture x?
Y = { building, taipei 101, day view, night view, skyscraper, fireworks, new york, fireworks, car, face, taipei world financial center, university, etc.}
(CC BY-SA SElefant from Wikimedia Commons)
• Given: N examples n
xn ∈ Rd, Yn ⊆ {1, 2, · · · , K}oN
n=1
• Goal: classifier g(x) that closely predicts the label-set Y associated with some unseen inputs x, presumably by exploiting hidden relations between labels, e.g.
– taipei 101 & taipei world financial center highly correlated – skyscraper subset of building
– day view & night view disjoint
Label Space Dimension Reduction
Y ⊆ {1, 2, · · · , K} equivalent to y ∈ {0, 1}K
• feature space dimension reduction: compress x to remove irrelevant, redundant (possi- bly related), or noisy information, and achieve better efficiency & performance – principal component analysis (PCA): linearly project x to wmT x with minimum
projection error
– canonical correlation analysis (CCA): linearly project x to wmT x in order to max- imize correlation with some vmT y
• label space dimension reduction: analogously, but compress y instead 1. compress: transform {(xn, yn)} to {(xn, tn)} with
tn = compress(yn) ∈ RM and M K 2. learn: train some r(x) from {(xn, tn)}
3. decompress: g(x) = decompress(r(x))
– compressive sensing (Hsu et al., NIPS 2009): linearly project y to t[m] = vmT y with random vm’s (for incoherence)
– principal label space transformation (PLST; Tai and Lin, NC 2012): linearly project y to t[m] = vmT y with minimum projection error (sibling of PCA)
Feature-Aware Label Space Dimension Reduction
• feature space dimension reduction
unsupervised (not using y) supervised (using y)
PCA, locally linear embedding, etc. CCA, sliced inverse regression, etc.
—supervised generally better for learning from compress(x) to y
• label space dimension reduction
feature-unaware (not using x) feature-aware (using x)
PLST, compressive sensing, etc. ???
—can we improve PLST by feature-aware label space dimension reduction?
Conditional Principal Label Space Transformation
• idea 1: exploit dual role of CCA to be feature-aware
project x to wmT x in order to maximize correlation with some vmT y
≡ project y to vmT y in order to maximize correlation with some wmT x
≈ project y to vmT y in order to minimize difference to some wmT x proposed OCCA : min
W,V kXWT − YVT k2F , s.t. VVT = I – project to easiest-by-linear-regression directions
• idea 2: keep benefits of PLST for compression existing PLST : min
V kY − YVT Vk2F , s.t. VVT = I – project to most representative directions
• proposed algorithm: conditional principal label space transformation (CPLST)
W,Vmin kXWT − YVT k2F
| {z }
learning error
+ kY − YVT Vk2F
| {z }
compression error
, s.t. VVT = I
– theoretical guarantee (Tai and Lin, NC 2012): when using linear regression as r, hamming loss ≤ learning error + compression error
– algorithmic simplicity: closed-form optimal V contains top eigenvectors of
OCCA PLST CPLST
YT ( XX†
| {z }
hat matrix
−I)Y YT Y YT XX†
| {z }
hat matrix
Y
(note: Z, i.e. the mean-shifted Y, is actually used for better projection) – physical meaning: exploit conditional (feature-aware) correlations
– kernelization: replace linear regression with kernel ridge regression as r
Experimental Results
on yeast data set:
0 5 10 15
0 500 1000 1500 2000 2500
# of dimension
|XWT − ZVT |2
PBR OCCA PLST CPLST
0 5 10 15
0 500 1000 1500 2000 2500
# of dimension
|Z − ZVT V|2
PBR OCCA PLST CPLST
learning error compression error
0 5 10 15
1950 2000 2050 2100 2150 2200 2250
# of dimension
|XWT − ZVT |2 + |Z − ZVT V|2
PBR OCCA PLST CPLST
0 5 10 15
0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245
# of dimension
Hamning loss
PBR OCCA PLST CPLST
learning+compression error hamming loss
• PBR: baseline, with standard basis as vm
• OCCA: optimize learning error, but worst in compression error
• PLST: optimize compression error, but worst in learning error
• CPLST: optimize learning+compression error, and hence best hamming loss on 8 benchmark data sets:
algorithms CPLST vs. PLST CPLST vs. PLST kernel-CPLST vs. PLST + linear regression + decision tree + kernel ridge regression M = 20%K 3 win, 5 similar 2 win, 6 similar 5 win, 1 lose, 2 similar
CPLST consistently better than or similar to PLST across data & algorithms
Summary
Conditional Principal Label Space Transformation, which
• projects to conditional principal directions by combining ideas behind CCA (feature- aware) and PLST (optimal compression)
• can be kernelized for exploiting feature power
• achieves better/similar practical performance consistently when compared with the readily-strong PLST