### Label Space Coding for Multi-label Classification Mathematical Machine Learning

### for Modern Artificial Intelligence

Hsuan-Tien Lin National Taiwan University

7th6th 3rdTWSIAM Annual Meeting, 05/25/2019

### From Intelligence to Artificial Intelligence

intelligence: thinking and actingsmartly

• humanly

• rationally

**artificial**intelligence:**computers**thinking and actingsmartly

• humanly

• rationally

humanly≈**smartly**≈rationally

**—are humans rational? :-)**

### Traditional vs. Modern [My] Definition of AI

Traditional Definition

humanly≈ intelligently ≈ rationally My Definition

intelligently≈ easily

**is your smart phone ‘smart’? :-)**

modern artificial intelligence

=**application**intelligence

### Examples of Application Intelligence

Siri

By Bernard Goldbach [CC BY 2.0]

Amazon Recommendations

By Kelly Sims [CC BY 2.0]

iRobot

By Yuan-Chou Lo [CC BY-NC-ND 2.0]

Vivino

from nordic.businessinsider.com

### Machine Learning and AI

Easy-to-Use

Acting Humanly Acting Rationally

## Machine Learning

**machine learning: core behind**
modern (data-driven) AI

### ML Connects Big Data and AI

From Big Data to Artificial Intelligence

big data

### ML

artificial intelligenceingredient tools/steps dish

(Photos Licensed under CC BY 2.0 from Andrea Goh on Flickr)

“cooking” needs many possible
**tools & procedures**

### Bigger Data Towards Better AI

best route by shortest path

best route by current traffic

best route by predicted travel time

big data**can**make machine look smarter

### ML for Modern AI

big data

### ML AI

human learning/

analysis

domain knowledge

(HumanI)

method

model expert system

• human sometimes**faster learner**on**initial (smaller) data**

• industry: **black plum is as sweet as white**

often important to leverage human learning,
especially**in the beginning**

### Application: Tropical Cyclone Intensity Estimation

meteorologists can ‘feel’ & estimate TC intensity from image

TC images

### ML

_{estimation}

^{intensity}

human learning/

analysis

domain knowledge

(HumanI)

**CNN** **polar**

**rotation**
**invariance**

current weather

system

better than current system & ‘trial-ready’

(Chen et al., KDD 2018) (Chen et al., Weather & Forecasting 2019)

### Cost-Sensitive Multiclass Classification

### Patient Status Prediction

? H7N9-infected cold-infected healthy

error measure = society cost XXXXactual XXXXXX

predicted

H7N9 cold healthy

H7N9 0 1000 **100000**

cold 100 0 3000

healthy 100 30 0

• H7N9 mis-predicted as healthy:**very high cost**

• cold mis-predicted as healthy: high cost

• cold correctly predicted as cold: no cost

human doctors consider costs of decision;

**how about computer-aided diagnosis?**

Setup: Cost-Sensitive Classification Given

N classification examples (input**x**n, label yn)∈ X × {1, 2, . . . , K }
and a ‘proper’ cost matrix C∈ R^{K ×K}

Goal

a classifier g(x) that pays a small cost C(y, g(x)) on future unseen example (x, y )

cost-sensitive classification:

more**application-realistic**
than traditional classification

### Key Idea: Cost Estimator

(Tu and Lin, ICML 2010)Goal

a classifier g(x) that pays a small cost C(y, g(x)) on future unseen example (x, y )

consider expected conditional costs**c**** _{x}**[k ] =PK

y =1C(y, k )P(y**|x)**

if**c****x** known

optimal

g^{∗}(x) = argmin_{1≤k ≤K}**c**** _{x}**[k ]

if rk(x)**≈ c**** ^{x}**[k ] well
approximately good
g

_{r}(x) = argmin

_{1≤k ≤K}r

_{k}(x)

how to get cost estimator r_{k}?**regression**

### Cost Estimator by Per-class Regression

Given

N examples, each (input**x**n, label yn)∈ X × {1, 2, . . . , K }

• take**c**_{n}as yn-th row of C: **c**_{n}[k ] = C(yn, k )

input **c**_{n}[1] input **c**_{n}[2] . . . input **c**_{n}[K ]

**x**_{1} 0, **x**_{1} 2, **x**_{1} 1

**x**_{2} 1, **x**_{2} 3, **x**_{2} 5

· · ·

**x**_{N} 6, **x**_{N} 1, **x**_{N} 0

| {z }

r_{1} | {z }

r_{2} | {z }

r_{K}

**want: r**_{k}(x)**≈ c**** ^{x}**[k ] for all future

**x and k**

### The Reduction Framework

cost- sensitive example (xn, yn, cn)

⇒^{}

@ AA

%

$ '

&

regression examples (Xn,k, Yn,k)

k = 1,· · · , K ⇒⇒⇒ _{regression}

algorithm

⇒⇒

⇒

%

$ '

&

regressors rk(x) k∈ 1, · · · , K

AA

@

⇒

cost- sensitive classifier gr(x)

1 transform classification examples (x_{n}, yn)to regression examples
**x**_{n,k}, Y_{n,k}** = x**n, C(yn, k )

2 use your favorite algorithm on the regression examples and get
estimators r_{k}(x)

3 for each new input**x, predict its class using**
g_{r}(x) = argmin_{1≤k ≤K}r_{k}(x)

the reduction-to-regression framework:

**systematic & easy to implement**

### A Simple Theoretical Guarantee

g_{r}(x) = argmin

1≤k ≤K

r_{k}(x)

Theorem (Absolute Loss Bound)

For any set of estimators (cost estimators){rk}^{K}k =1and for any tuple
(x, y , c) with c[y ] =0=min_{1≤k ≤K}**c[k ],**

**c[g**_{r}(x)]≤

K

X

k =1

r_{k}(x)**− c[k]**

.

**low-cost classifier⇐= accurate estimator**

### Our Contributions

In 2010(Tu and Lin, ICML 2010)

• **tighten**the simple guarantee (+math)

• **propose**loss function (+math) from tighter bound

• **derive**SVM-based model (+math) from loss function

—eventually reaching**superior experimental results**
Six Years Later(Chung et al., IJCAI 2016)

• **propose smoother**loss function (+math) from tighter bound

• **derive**world’s first cost-sensitive deep learning model (+math)
from loss function

—eventually reaching**even superior experimental results**
why are people**not**

using those**cool ML works for their AI? :-)**

### Issue 1: Where Do Costs Come From?

A Real Medical Application: Classifying Bacteria

• by human doctors: **different treatments**⇐⇒ serious costs

• cost matrix averaged from two doctors:

Ab Ecoli HI KP LM Nm Psa Spn Sa GBS

Ab 0 1 10 7 9 9 5 8 9 1

Ecoli 3 0 10 8 10 10 5 10 10 2

HI 10 10 0 3 2 2 10 1 2 10

KP 7 7 3 0 4 4 6 3 3 8

LM 8 8 2 4 0 5 8 2 1 8

Nm 3 10 9 8 6 0 8 3 6 7

Psa 7 8 10 9 9 7 0 8 9 5

Spn 6 10 7 7 4 4 9 0 4 7

Sa 7 10 6 5 1 3 9 2 0 7

GBS 2 5 10 9 8 6 5 6 8 0

issue 2: is cost-sensitive classification
**really useful?**

ML Research for Modern AI

### Cost-Sensitive vs. Traditional on Bacteria Data

. . . . . .

Are cost-sensitive algorithms great?

RBF kernel

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

OVOSVM

csOSRSVM csOVOSVM csFTSVM

algorithms

cost

.

...Cost-sensitive algorithms perform better than regular algorithm

Jan et al. (Academic Sinica) Cost-Sensitive Classiﬁcation on SERS October 31, 2011 15 / 19

(Jan et al., BIBM 2011)

**cost-sensitive**better than**traditional;**

but why are people**still not**

using those cool ML works for their AI? :-)

### Issue 3: Error Rate of Cost-Sensitive Classifiers

The Problem

0.1 0.15 0.2 0.25 0.3

0 0.05 0.1 0.15 0.2

Error (%)

Cost

• cost-sensitive classifier: low cost but high error rate

• traditional classifier: low error rate but high cost

• how can we get theblueclassifiers?: low error rate and low cost

—math++ on**multi-objective**optimization(Jan et al., KDD 2012)

now,**are people using those cool ML works**
**for their AI? :-)**

### Lessons Learned from

### Research on Cost-Sensitive Multiclass Classification

? H7N9-infected cold-infected healthy

1 more realistic (generic) in academia

6=**more realistic (feasible) in application**
e.g. the ‘cost’ of**inputting a cost matrix? :-)**

2 **cross-domain collaboration**important

e.g. getting the ‘cost matrix’ from**domain experts**

3 not easy to win**human trust**

—humans are somewhat**multi-objective**

4 many battlefields for**math**towards
application intelligence

e.g. abstraction of**goals and needs**

### Label Space Coding for

### Multilabel Classification

### What **Tags?**

**?:**{machine learning,data structure,data mining,object
oriented programming,artificial intelligence,compiler,
architecture,chemistry,textbook,children book,. . . etc.}

a**multilabel classification problem:**

**tagging**input to multiple categories

### Binary Relevance: Multilabel Classification via Yes/No

Binary

Classification {yes,no}

multilabel w/ L classes: L**Y/Nquestions**
machine learning(Y), data structure(N), data

mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),

children book(N),etc.

• **Binary Relevance approach:**

transformation to**multiple isolated binary classification**

• disadvantages:

• **isolation—hidden relations not exploited (e.g. ML and DM**highly
correlated, MLsubset ofAI, textbook & children bookdisjoint)

• **unbalanced—few**yes, manyno

**Binary Relevance: simple (& good)**
benchmark with known disadvantages

### From Label-set to Coding View

label set apple orange strawberry **binary code**

{o} 0 (N) 1 (Y) 0 (N) [0, 1, 0]

{a, o} 1 (Y) 1 (Y) 0 (N) [1, 1, 0]

{a, s} 1 (Y) 0 (N) 1 (Y) [1, 0, 1]

{o} 0 (N) 1 (Y) 0 (N) [0, 1, 0]

{} 0 (N) 0 (N) 0 (N) [0, 0, 0]

subset of 2{1,2,··· ,L}**⇔ length-L binary code**

### A NeurIPS 2009 Approach: Compressive Sensing

General Compressive Sensing

sparse (many0) binary vectors**y**∈ {0, 1}^{L}can be**robustly**

**compressed by projecting to M** ** L basis vectors {p**1**, p**_{2},**· · · , p**M}
Comp. Sensing for Multilabel Classification(Hsu et al., NeurIPS 2009)

1 **compress: encode original data bycompressive sensing**

2 **learn: get**regressionfunction from compressed data

3 **decode: decode regression predictions to sparse vector by**
**compressive sensing**

**Compressive Sensing: seemly strong**
competitor**from related theoretical analysis**

### Our Proposed Approach:

### Compressive Sensing ⇒ PCA

Principal Label Space Transformation (PLST),

i.e. PCA for Multilabel Classification (Tai and Lin, NC Journal 2012)
1 **compress: encode original data byPCA**

2 **learn: get**regressionfunction from compressed data

3 **decode: decode regression predictions to label vector byreverse**
**PCA + quantization**

**does PLST perform better than CS?**

### Hamming Loss Comparison: PLST vs. CS

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Linear Regression)

0 20 40 60 80 100

0.03 0.035 0.04 0.045 0.05

Full−BR (no reduction) CS

PLST

mediamill (Decision Tree)

• **PLST**better thanCS: faster,**better performance**

• similar findings across**data sets and regression algorithms**

Why? CScreates

**harder-to-learn**regression tasks

### Our Works Continued from PLST

1 **Compression**Coding(Tai & Lin, NC Journal 2012 with 216 citations)

—condense for efficiency: better (than CS) approach PLST

— key tool: PCA from Statistics/Signal Processing

2 **Learnable-Compression**Coding(Chen & Lin, NeurIPS 2012 with 157 citations)

—condense learnably for**better**efficiency: better (than PLST)
approach CPLST

— key tool: Ridge Regression from Statistics (+ PCA)

3 **Cost-Sensitive**Coding(Huang & Lin, ECML Journal Track 2017)

—condense cost-sensitively towards application needs: better (than CPLST) approach CLEMS

— key tool: Multidimensional Scaling from Statistics

cannot thank**statisticans**
enough for those tools!

### Lessons Learned from

### Label Space Coding for Multilabel Classification

**?:**{machine learning,data structure,data
mining,object oriented programming,artificial
intelligence,compiler,architecture,chemistry,

textbook,children book,. . . etc. }

1 Is Statistics the same as ML? Is Statistics the same as AI?

• **does it really matter?**

• Modern AI should embrace**every useful tool from every field**&

any necessary**math**

2 ‘application intelligence’ tools**not necessarily most**
**sophisticated ones**

e.g. PCA possibly more useful than CS for label space coding

3 more-cited paper6= more-useful AI solution

—citation count**not the only impact measure**

4 **are people using those cool ML works for their AI?**

—we wish!

### Active Learning by Learning

### Active Learning: Learning by ‘Asking’

labeling is**expensive:**

**active learning ‘question asking’**

—query ynof**chosenx**_{n}

unknown target function f : X → Y

labeled training examples ( , +1), ( , +1), ( , +1)

( , -1), ( , -1), ( , -1)

learning algorithm

A

final hypothesis g≈f

+1

active: improve hypothesis with fewer labels
(hopefully) by asking questions**strategically**

### Pool-Based Active Learning Problem

Given

• labeled poolDl =n

(feature**x**n ,**label y**n(e.g. IsApple?))oN
n=1

• unlabeled poolDu=n
**x**˜_{s}oS

s=1

Goal

design an algorithm that iteratively

1 **strategically query**some˜**x**s to get associatedy˜s
2 move (**x**˜_{s},y˜s)fromD^{u}toDl

3 learn**classifier g**^{(t)}fromDl

and improve**test accuracy of g**^{(t)} w.r.t**#queries**

how to**query strategically?**

### How to Query Strategically?

Strategy 1

ask**most confused**
question

Strategy 2

ask**most frequent**
question

Strategy 3

ask**most debateful**
question

• **choosing**one single strategy is**non-trivial:**

0 10 20 30 40 50 60

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

% of unlabelled data

Accuracy

RAND UNCERTAIN PSDS QUIRE

0 10 20 30 40 50 60

0.4 0.5 0.6 0.7 0.8 0.9

% of unlabelled data

Accuracy

RAND UNCERTAIN PSDS QUIRE

0 10 20 30 40 50 60

0.5 0.6 0.7 0.8 0.9 1

% of unlabelled data

Accuracy

RAND UNCERTAIN PSDS QUIRE

application intelligence: how to
**choose strategy smartly?**

### Idea: Trial-and-Reward Like Human

when do humans**trial-and-reward?**

**gambling**

K strategies:

A1,A2,· · · , AK

**try**one strategy

“goodness” of strategy
as**reward**

K bandit machines:

B1,B2,· · · , BK

**try**one bandit machine

“luckiness” of machine
as**reward**

intelligent choice of strategy

=⇒ intelligent choice of**bandit machine**

### Active Learning by Learning

(Hsu and Lin, AAAI 2015)K strategies:

A1,A2,· · · , AK

**try** one strategy

“goodness” of strategy
as **reward**

Given: K existing active learning strategies for t = 1, 2, . . . , T

1 let some bandit model**decide strategy**Ak **to try**

2 **query the ˜x**_{s}suggested byAk, and compute g^{(t)}

3 evaluate**goodness of g**^{(t)} as**reward**of**trial**to update model

only remaining problem: **what reward?**

### Design of Reward

ideal rewardafter updating classifier g^{(t)} by the query (x_{n}_{t}, ynt):

accuracy ofg^{(t)} on**test set** **{(x**^{0}m, y_{m}^{0} )}^{M}m=1

—test accuracy**infeasible**in practice because labeling**expensive**
more feasible reward: training accuracy on the fly

accuracy of g^{(t)} on**labeled pool** **{(x**^{n}τ, ynτ)}^{t}τ =1

—but**biased**towards**easier**queries

weighted training accuracyas a better reward:

acc. of g^{(t)} oninv.-prob. weighted**labeled pool** n

(x_{n}_{τ}, ynτ,_{p}^{1}

τ) ot

τ =1

—‘bias correction’fromquerying probability within bandit model Active Learning by Learning (ALBL):

bandit +**weighted training acc. as reward**

### Comparison with Single Strategies

UNCERTAINBest

5 10 15 20 25 30 35 40 45 50 55 60 0.55

0.6 0.65 0.7 0.75 0.8 0.85 0.9

% of unlabelled data

Accuracy ALBL

RAND UNCERTAIN PSDS QUIRE

vehicle

PSDSBest

5 10 15 20 25 30 35 40 45 50 55 60 0.5

0.55 0.6 0.65 0.7 0.75 0.8

% of unlabelled data

Accuracy ALBL

RAND UNCERTAIN PSDS QUIRE

sonar

QUIREBest

5 10 15 20 25 30 35 40 45 50 55 60 0.5

0.55 0.6 0.65 0.7 0.75

% of unlabelled data

Accuracy ALBL

RAND UNCERTAIN PSDS QUIRE

diabetes

• **no single best strategy**for every data set

—choosing needed

• proposed**ALBL**consistently**matches the best**

—similar findings across other data sets

**ALBL: effective inmaking intelligent choices**

### Discussion for Statisticians

**weighted training accuracy** ^{1}_{t} Pt
τ =1

1

pτ qynτ =g^{(t)}(xnτ)y as reward

• is rewardunbiased estimatorof test performance?

**no for learned g**^{(t)} **(yes for fixed g)**

• is reward fixedbefore playing?

**no because g**^{(t)}**learned from (x**n_{t}, yn_{t})

• is rewardindependent of each other?

**no because past history all in reward**

—ALBL: tools from statistics +**wild/unintended usage**

‘application intelligence’ outcome:

**open-source tool**released

(https://github.com/ntucllab/libact)

### Lessons Learned from

### Research on Active Learning by Learning

by DFID - UK Department for International Development;

licensed under CC BY-SA 2.0 via Wikimedia Commons

1 **scalability bottleneck**of ‘application intelligence’:

**choice**of methods/models/parameter/. . .

2 think outside of the**math**box:

‘unintended’ usage may be**good enough**

3 important to be**brave**yet**patient**

—idea: 2012

—paper(Hsu and Lin, AAAI 2015); software(Yang et al., 2017)

### Summary

• ML for (Modern) AI:

tools + human knowledge

⇒**easy-to-use application intelligence**

• ML Research for Modern AI:

need to be**more open-minded**

—in methodology, in collaboration, in KPI

• **Math**in ML Research for Modern AI:

—new setup/need/goal& wider usage of tools

**Thank you! Questions?**