Label Space Dimension Reduction

(1)

Feature-aware Label Space Dimension Reduction for Multi-label Classification

Yao-Nan Chen (r99922008@csie.ntu.edu.tw) and Hsuan-Tien Lin (htlin@csie.ntu.edu.tw)

Department of Computer Science and Information Engineering, National Taiwan University

Multi-label Classification Setup

Which tags Y are associated with this picture x?

Y = { building, taipei 101, day view, night view, skyscraper, fireworks, new york, fireworks, car, face, taipei world financial center, university, etc.}

(CC BY-SA SElefant from Wikimedia Commons)

• Given: N examples n

x_n ∈ R^d, Y_n ⊆ {1, 2, · · · , K}o^N

n=1

• Goal: classifier g(x) that closely predicts the label-set Y associated with some unseen inputs x, presumably by exploiting hidden relations between labels, e.g.

– taipei 101 & taipei world financial center highly correlated – skyscraper subset of building

– day view & night view disjoint

Label Space Dimension Reduction

Y ⊆ {1, 2, · · · , K} equivalent to y ∈ {0, 1}^K

• feature space dimension reduction: compress x to remove irrelevant, redundant (possi- bly related), or noisy information, and achieve better efficiency & performance – principal component analysis (PCA): linearly project x to w_m^T x with minimum

projection error

– canonical correlation analysis (CCA): linearly project x to w_m^T x in order to maximize correlation with some v_m^T y

• label space dimension reduction: analogously, but compress y instead 1. compress: transform {(x_n, y_n)} to {(x_n, t_n)} with

t_n = compress(y_n) ∈ R^M and M K 2. learn: train some r(x) from {(x_n, t_n)}

3. decompress: g(x) = decompress(r(x))

– compressive sensing (Hsu et al., NIPS 2009): linearly project y to t[m] = v_m^T y with random v_m’s (for incoherence)

– principal label space transformation (PLST; Tai and Lin, NC 2012): linearly project y to t[m] = v_m^T y with minimum projection error (sibling of PCA)

Feature-Aware Label Space Dimension Reduction

• feature space dimension reduction

unsupervised (not using y) supervised (using y)

PCA, locally linear embedding, etc. CCA, sliced inverse regression, etc.

—supervised generally better for learning from compress(x) to y

• label space dimension reduction

feature-unaware (not using x) feature-aware (using x)

PLST, compressive sensing, etc. ???

—can we improve PLST by feature-aware label space dimension reduction?

Conditional Principal Label Space Transformation

• idea 1: exploit dual role of CCA to be feature-aware

project x to w_m^T x in order to maximize correlation with some v_m^T y

≡ project y to v_m^T y in order to maximize correlation with some w_m^T x

≈ project y to v_m^T y in order to minimize difference to some w_m^T x proposed OCCA : min

W,V kXW^T − YV^T k²_F , s.t. VV^T = I – project to easiest-by-linear-regression directions

• idea 2: keep benefits of PLST for compression existing PLST : min

V kY − YV^T Vk²_F , s.t. VV^T = I – project to most representative directions

• proposed algorithm: conditional principal label space transformation (CPLST)

W,Vmin kXW^T − YV^T k²_F

| {z }

learning error

+ kY − YV^T Vk²_F

| {z }

compression error

, s.t. VV^T = I

– theoretical guarantee (Tai and Lin, NC 2012): when using linear regression as r, hamming loss ≤ learning error + compression error

– algorithmic simplicity: closed-form optimal V contains top eigenvectors of

OCCA PLST CPLST

Y^T ( XX^†

| {z }

hat matrix

−I)Y Y^T Y Y^T XX†

| {z }

hat matrix

Y

(note: Z, i.e. the mean-shifted Y, is actually used for better projection) – physical meaning: exploit conditional (feature-aware) correlations

– kernelization: replace linear regression with kernel ridge regression as r

Experimental Results

on yeast data set:

0 5 10 15

0 500 1000 1500 2000 2500

# of dimension

|XWT − ZVT |2

PBR OCCA PLST CPLST

0 5 10 15

0 500 1000 1500 2000 2500

# of dimension

|Z − ZVT V|2

PBR OCCA PLST CPLST

learning error compression error

0 5 10 15

1950 2000 2050 2100 2150 2200 2250

# of dimension

|XWT − ZVT |2 + |Z − ZVT V|2

PBR OCCA PLST CPLST

0 5 10 15

0.2 0.205 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245

# of dimension

Hamning loss

PBR OCCA PLST CPLST

learning+compression error hamming loss

• PBR: baseline, with standard basis as v_m

• OCCA: optimize learning error, but worst in compression error

• PLST: optimize compression error, but worst in learning error

• CPLST: optimize learning+compression error, and hence best hamming loss on 8 benchmark data sets:

algorithms CPLST vs. PLST CPLST vs. PLST kernel-CPLST vs. PLST + linear regression + decision tree + kernel ridge regression M = 20%K 3 win, 5 similar 2 win, 6 similar 5 win, 1 lose, 2 similar

CPLST consistently better than or similar to PLST across data & algorithms

Summary

Conditional Principal Label Space Transformation, which

• projects to conditional principal directions by combining ideas behind CCA (feature- aware) and PLST (optimal compression)

• can be kernelized for exploiting feature power

• achieves better/similar practical performance consistently when compared with the readily-strong PLST