Label Space Coding for Multi-label Classification Mathematical Machine Learning
for Modern Artificial Intelligence
Hsuan-Tien Lin National Taiwan University
7th6th 3rdTWSIAM Annual Meeting, 05/25/2019
From Intelligence to Artificial Intelligence
intelligence: thinking and actingsmartly
• humanly
• rationally
artificialintelligence:computersthinking and actingsmartly
• humanly
• rationally
humanly≈smartly≈rationally
—are humans rational? :-)
Traditional vs. Modern [My] Definition of AI
Traditional Definition
humanly≈ intelligently ≈ rationally My Definition
intelligently≈ easily
is your smart phone ‘smart’? :-)
modern artificial intelligence
=applicationintelligence
Examples of Application Intelligence
Siri
By Bernard Goldbach [CC BY 2.0]
Amazon Recommendations
By Kelly Sims [CC BY 2.0]
iRobot
By Yuan-Chou Lo [CC BY-NC-ND 2.0]
Vivino
from nordic.businessinsider.com
Machine Learning and AI
Easy-to-Use
Acting Humanly Acting Rationally
Machine Learning
machine learning: core behind modern (data-driven) AI
ML Connects Big Data and AI
From Big Data to Artificial Intelligence
big data
ML
artificial intelligenceingredient tools/steps dish
(Photos Licensed under CC BY 2.0 from Andrea Goh on Flickr)
“cooking” needs many possible tools & procedures
Bigger Data Towards Better AI
best route by shortest path
best route by current traffic
best route by predicted travel time
big datacanmake machine look smarter
ML for Modern AI
big data
ML AI
human learning/
analysis
domain knowledge
(HumanI)
method
model expert system
• human sometimesfaster learneroninitial (smaller) data
• industry: black plum is as sweet as white
often important to leverage human learning, especiallyin the beginning
Application: Tropical Cyclone Intensity Estimation
meteorologists can ‘feel’ & estimate TC intensity from image
TC images
ML
estimationintensityhuman learning/
analysis
domain knowledge
(HumanI)
CNN polar
rotation invariance
current weather
system
better than current system & ‘trial-ready’
(Chen et al., KDD 2018) (Chen et al., Weather & Forecasting 2019)
Cost-Sensitive Multiclass Classification
Patient Status Prediction
? H7N9-infected cold-infected healthy
error measure = society cost XXXXactual XXXXXX
predicted
H7N9 cold healthy
H7N9 0 1000 100000
cold 100 0 3000
healthy 100 30 0
• H7N9 mis-predicted as healthy:very high cost
• cold mis-predicted as healthy: high cost
• cold correctly predicted as cold: no cost
human doctors consider costs of decision;
how about computer-aided diagnosis?
Setup: Cost-Sensitive Classification Given
N classification examples (inputxn, label yn)∈ X × {1, 2, . . . , K } and a ‘proper’ cost matrix C∈ RK ×K
Goal
a classifier g(x) that pays a small cost C(y, g(x)) on future unseen example (x, y )
cost-sensitive classification:
moreapplication-realistic than traditional classification
Key Idea: Cost Estimator
(Tu and Lin, ICML 2010)Goal
a classifier g(x) that pays a small cost C(y, g(x)) on future unseen example (x, y )
consider expected conditional costscx[k ] =PK
y =1C(y, k )P(y|x)
ifcx known
optimal
g∗(x) = argmin1≤k ≤Kcx[k ]
if rk(x)≈ cx[k ] well approximately good gr(x) = argmin1≤k ≤Krk(x)
how to get cost estimator rk?regression
Cost Estimator by Per-class Regression
Given
N examples, each (inputxn, label yn)∈ X × {1, 2, . . . , K }
• takecnas yn-th row of C: cn[k ] = C(yn, k )
input cn[1] input cn[2] . . . input cn[K ]
x1 0, x1 2, x1 1
x2 1, x2 3, x2 5
· · ·
xN 6, xN 1, xN 0
| {z }
r1 | {z }
r2 | {z }
rK
want: rk(x)≈ cx[k ] for all futurex and k
The Reduction Framework
cost- sensitive example (xn, yn, cn)
⇒
@ AA
%
$ '
&
regression examples (Xn,k, Yn,k)
k = 1,· · · , K ⇒⇒⇒ regression
algorithm
⇒⇒
⇒
%
$ '
&
regressors rk(x) k∈ 1, · · · , K
AA
@
⇒
cost- sensitive classifier gr(x)
1 transform classification examples (xn, yn)to regression examples xn,k, Yn,k = xn, C(yn, k )
2 use your favorite algorithm on the regression examples and get estimators rk(x)
3 for each new inputx, predict its class using gr(x) = argmin1≤k ≤Krk(x)
the reduction-to-regression framework:
systematic & easy to implement
A Simple Theoretical Guarantee
gr(x) = argmin
1≤k ≤K
rk(x)
Theorem (Absolute Loss Bound)
For any set of estimators (cost estimators){rk}Kk =1and for any tuple (x, y , c) with c[y ] =0=min1≤k ≤Kc[k ],
c[gr(x)]≤
K
X
k =1
rk(x)− c[k]
.
low-cost classifier⇐= accurate estimator
Our Contributions
In 2010(Tu and Lin, ICML 2010)
• tightenthe simple guarantee (+math)
• proposeloss function (+math) from tighter bound
• deriveSVM-based model (+math) from loss function
—eventually reachingsuperior experimental results Six Years Later(Chung et al., IJCAI 2016)
• propose smootherloss function (+math) from tighter bound
• deriveworld’s first cost-sensitive deep learning model (+math) from loss function
—eventually reachingeven superior experimental results why are peoplenot
using thosecool ML works for their AI? :-)
Issue 1: Where Do Costs Come From?
A Real Medical Application: Classifying Bacteria
• by human doctors: different treatments⇐⇒ serious costs
• cost matrix averaged from two doctors:
Ab Ecoli HI KP LM Nm Psa Spn Sa GBS
Ab 0 1 10 7 9 9 5 8 9 1
Ecoli 3 0 10 8 10 10 5 10 10 2
HI 10 10 0 3 2 2 10 1 2 10
KP 7 7 3 0 4 4 6 3 3 8
LM 8 8 2 4 0 5 8 2 1 8
Nm 3 10 9 8 6 0 8 3 6 7
Psa 7 8 10 9 9 7 0 8 9 5
Spn 6 10 7 7 4 4 9 0 4 7
Sa 7 10 6 5 1 3 9 2 0 7
GBS 2 5 10 9 8 6 5 6 8 0
issue 2: is cost-sensitive classification really useful?
ML Research for Modern AI
Cost-Sensitive vs. Traditional on Bacteria Data
. . . . . .
Are cost-sensitive algorithms great?
RBF kernel
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
OVOSVM
csOSRSVM csOVOSVM csFTSVM
algorithms
cost
.
...Cost-sensitive algorithms perform better than regular algorithm
Jan et al. (Academic Sinica) Cost-Sensitive Classification on SERS October 31, 2011 15 / 19
(Jan et al., BIBM 2011)
cost-sensitivebetter thantraditional;
but why are peoplestill not
using those cool ML works for their AI? :-)
Issue 3: Error Rate of Cost-Sensitive Classifiers
The Problem
0.1 0.15 0.2 0.25 0.3
0 0.05 0.1 0.15 0.2
Error (%)
Cost
• cost-sensitive classifier: low cost but high error rate
• traditional classifier: low error rate but high cost
• how can we get theblueclassifiers?: low error rate and low cost
—math++ onmulti-objectiveoptimization(Jan et al., KDD 2012)
now,are people using those cool ML works for their AI? :-)
Lessons Learned from
Research on Cost-Sensitive Multiclass Classification
? H7N9-infected cold-infected healthy
1 more realistic (generic) in academia
6=more realistic (feasible) in application e.g. the ‘cost’ ofinputting a cost matrix? :-)
2 cross-domain collaborationimportant
e.g. getting the ‘cost matrix’ fromdomain experts
3 not easy to winhuman trust
—humans are somewhatmulti-objective
4 many battlefields formathtowards application intelligence
e.g. abstraction ofgoals and needs
Label Space Coding for
Multilabel Classification
What Tags?
?:{machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler, architecture,chemistry,textbook,children book,. . . etc.}
amultilabel classification problem:
tagginginput to multiple categories
Binary Relevance: Multilabel Classification via Yes/No
Binary
Classification {yes,no}
multilabel w/ L classes: LY/Nquestions machine learning(Y), data structure(N), data
mining(Y), OOP(N), AI(Y), compiler(N), architecture(N), chemistry(N), textbook(Y),
children book(N),etc.
• Binary Relevance approach:
transformation tomultiple isolated binary classification
• disadvantages:
• isolation—hidden relations not exploited (e.g. ML and DMhighly correlated, MLsubset ofAI, textbook & children bookdisjoint)
• unbalanced—fewyes, manyno
Binary Relevance: simple (& good) benchmark with known disadvantages
From Label-set to Coding View
label set apple orange strawberry binary code
{o} 0 (N) 1 (Y) 0 (N) [0, 1, 0]
{a, o} 1 (Y) 1 (Y) 0 (N) [1, 1, 0]
{a, s} 1 (Y) 0 (N) 1 (Y) [1, 0, 1]
{o} 0 (N) 1 (Y) 0 (N) [0, 1, 0]
{} 0 (N) 0 (N) 0 (N) [0, 0, 0]
subset of 2{1,2,··· ,L}⇔ length-L binary code
A NeurIPS 2009 Approach: Compressive Sensing
General Compressive Sensing
sparse (many0) binary vectorsy∈ {0, 1}Lcan berobustly
compressed by projecting to M L basis vectors {p1, p2,· · · , pM} Comp. Sensing for Multilabel Classification(Hsu et al., NeurIPS 2009)
1 compress: encode original data bycompressive sensing
2 learn: getregressionfunction from compressed data
3 decode: decode regression predictions to sparse vector by compressive sensing
Compressive Sensing: seemly strong competitorfrom related theoretical analysis
Our Proposed Approach:
Compressive Sensing ⇒ PCA
Principal Label Space Transformation (PLST),
i.e. PCA for Multilabel Classification (Tai and Lin, NC Journal 2012) 1 compress: encode original data byPCA
2 learn: getregressionfunction from compressed data
3 decode: decode regression predictions to label vector byreverse PCA + quantization
does PLST perform better than CS?
Hamming Loss Comparison: PLST vs. CS
0 20 40 60 80 100
0.03 0.035 0.04 0.045 0.05
Full−BR (no reduction) CS
PLST
mediamill (Linear Regression)
0 20 40 60 80 100
0.03 0.035 0.04 0.045 0.05
Full−BR (no reduction) CS
PLST
mediamill (Decision Tree)
• PLSTbetter thanCS: faster,better performance
• similar findings acrossdata sets and regression algorithms
Why? CScreates
harder-to-learnregression tasks
Our Works Continued from PLST
1 CompressionCoding(Tai & Lin, NC Journal 2012 with 216 citations)
—condense for efficiency: better (than CS) approach PLST
— key tool: PCA from Statistics/Signal Processing
2 Learnable-CompressionCoding(Chen & Lin, NeurIPS 2012 with 157 citations)
—condense learnably forbetterefficiency: better (than PLST) approach CPLST
— key tool: Ridge Regression from Statistics (+ PCA)
3 Cost-SensitiveCoding(Huang & Lin, ECML Journal Track 2017)
—condense cost-sensitively towards application needs: better (than CPLST) approach CLEMS
— key tool: Multidimensional Scaling from Statistics
cannot thankstatisticans enough for those tools!
Lessons Learned from
Label Space Coding for Multilabel Classification
?:{machine learning,data structure,data mining,object oriented programming,artificial intelligence,compiler,architecture,chemistry,
textbook,children book,. . . etc. }
1 Is Statistics the same as ML? Is Statistics the same as AI?
• does it really matter?
• Modern AI should embraceevery useful tool from every field&
any necessarymath
2 ‘application intelligence’ toolsnot necessarily most sophisticated ones
e.g. PCA possibly more useful than CS for label space coding
3 more-cited paper6= more-useful AI solution
—citation countnot the only impact measure
4 are people using those cool ML works for their AI?
—we wish!
Active Learning by Learning
Active Learning: Learning by ‘Asking’
labeling isexpensive:
active learning ‘question asking’
—query ynofchosenxn
unknown target function f : X → Y
labeled training examples ( , +1), ( , +1), ( , +1)
( , -1), ( , -1), ( , -1)
learning algorithm
A
final hypothesis g≈f
+1
active: improve hypothesis with fewer labels (hopefully) by asking questionsstrategically
Pool-Based Active Learning Problem
Given
• labeled poolDl =n
(featurexn ,label yn(e.g. IsApple?))oN n=1
• unlabeled poolDu=n x˜soS
s=1
Goal
design an algorithm that iteratively
1 strategically querysome˜xs to get associatedy˜s 2 move (x˜s,y˜s)fromDutoDl
3 learnclassifier g(t)fromDl
and improvetest accuracy of g(t) w.r.t#queries
how toquery strategically?
How to Query Strategically?
Strategy 1
askmost confused question
Strategy 2
askmost frequent question
Strategy 3
askmost debateful question
• choosingone single strategy isnon-trivial:
0 10 20 30 40 50 60
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
% of unlabelled data
Accuracy
RAND UNCERTAIN PSDS QUIRE
0 10 20 30 40 50 60
0.4 0.5 0.6 0.7 0.8 0.9
% of unlabelled data
Accuracy
RAND UNCERTAIN PSDS QUIRE
0 10 20 30 40 50 60
0.5 0.6 0.7 0.8 0.9 1
% of unlabelled data
Accuracy
RAND UNCERTAIN PSDS QUIRE
application intelligence: how to choose strategy smartly?
Idea: Trial-and-Reward Like Human
when do humanstrial-and-reward?
gambling
K strategies:
A1,A2,· · · , AK
tryone strategy
“goodness” of strategy asreward
K bandit machines:
B1,B2,· · · , BK
tryone bandit machine
“luckiness” of machine asreward
intelligent choice of strategy
=⇒ intelligent choice ofbandit machine
Active Learning by Learning
(Hsu and Lin, AAAI 2015)K strategies:
A1,A2,· · · , AK
try one strategy
“goodness” of strategy as reward
Given: K existing active learning strategies for t = 1, 2, . . . , T
1 let some bandit modeldecide strategyAk to try
2 query the ˜xssuggested byAk, and compute g(t)
3 evaluategoodness of g(t) asrewardoftrialto update model
only remaining problem: what reward?
Design of Reward
ideal rewardafter updating classifier g(t) by the query (xnt, ynt):
accuracy ofg(t) ontest set {(x0m, ym0 )}Mm=1
—test accuracyinfeasiblein practice because labelingexpensive more feasible reward: training accuracy on the fly
accuracy of g(t) onlabeled pool {(xnτ, ynτ)}tτ =1
—butbiasedtowardseasierqueries
weighted training accuracyas a better reward:
acc. of g(t) oninv.-prob. weightedlabeled pool n
(xnτ, ynτ,p1
τ) ot
τ =1
—‘bias correction’fromquerying probability within bandit model Active Learning by Learning (ALBL):
bandit +weighted training acc. as reward
Comparison with Single Strategies
UNCERTAINBest
5 10 15 20 25 30 35 40 45 50 55 60 0.55
0.6 0.65 0.7 0.75 0.8 0.85 0.9
% of unlabelled data
Accuracy ALBL
RAND UNCERTAIN PSDS QUIRE
vehicle
PSDSBest
5 10 15 20 25 30 35 40 45 50 55 60 0.5
0.55 0.6 0.65 0.7 0.75 0.8
% of unlabelled data
Accuracy ALBL
RAND UNCERTAIN PSDS QUIRE
sonar
QUIREBest
5 10 15 20 25 30 35 40 45 50 55 60 0.5
0.55 0.6 0.65 0.7 0.75
% of unlabelled data
Accuracy ALBL
RAND UNCERTAIN PSDS QUIRE
diabetes
• no single best strategyfor every data set
—choosing needed
• proposedALBLconsistentlymatches the best
—similar findings across other data sets
ALBL: effective inmaking intelligent choices
Discussion for Statisticians
weighted training accuracy 1t Pt τ =1
1
pτ qynτ =g(t)(xnτ)y as reward
• is rewardunbiased estimatorof test performance?
no for learned g(t) (yes for fixed g)
• is reward fixedbefore playing?
no because g(t)learned from (xnt, ynt)
• is rewardindependent of each other?
no because past history all in reward
—ALBL: tools from statistics +wild/unintended usage
‘application intelligence’ outcome:
open-source toolreleased
(https://github.com/ntucllab/libact)
Lessons Learned from
Research on Active Learning by Learning
by DFID - UK Department for International Development;
licensed under CC BY-SA 2.0 via Wikimedia Commons
1 scalability bottleneckof ‘application intelligence’:
choiceof methods/models/parameter/. . .
2 think outside of themathbox:
‘unintended’ usage may begood enough
3 important to bebraveyetpatient
—idea: 2012
—paper(Hsu and Lin, AAAI 2015); software(Yang et al., 2017)
Summary
• ML for (Modern) AI:
tools + human knowledge
⇒easy-to-use application intelligence
• ML Research for Modern AI:
need to bemore open-minded
—in methodology, in collaboration, in KPI
• Mathin ML Research for Modern AI:
—new setup/need/goal& wider usage of tools
Thank you! Questions?