A Deep Model with Local Surrogate Loss for General Cost-sensitive Multi-label Learning
Cheng-Yu Hsieh
Department of Computer Science and Information Engineering National Taiwan University
NTU Machine Learning Symposium, December 23, 2017 Joint work with Yi-An Lin and Hsuan-Tien Lin
About Me
Cheng-Yu Hsieh
Current Status
Second Year MS Student, Dept. of CSIE, National Taiwan University Member, Computational Learning Lab led by Prof. Hsuan-Tien Lin Research Interest
Reinforcement Learning Application
Apply deep reinforcement learning models to develop bridge bidding AI Design different techniques to incorporate human knowledge into the learning algorithm
Cost-sensitive Multi-label Learning
Design algorithms that can automatically learn to optimize different evaluation criteria (AAAI 2018)
Outline
1 Background
2 Proposed methods
3 Experiments
4 Conclusion
Outline
1 Background
2 Proposed methods
3 Experiments
4 Conclusion
Multi-label Learning (MLL)
Ordered label set: {cat, dog, rabbit, bird}
input
true label (1, 1, 0, 0) (1, 0, 1, 0)
Multi-label Learning Setup
Given training set D = {(xn, yn)}Nn=1, where feature vector x ∈ Rd
label vector y ∈ {0, 1}K, where K is the total number of labels and y[k] = 1 iff the k-th label is relevant
Learn a hypothesis h(x) to makeaccurate predictions on unseen instances
Evaluation of Multi-label Learning Algorithms
A variety of criteria used to measure algorithm performance E.g., Hamming loss, Rank loss, Example-F1 loss, Macro-F1 loss, etc Different applications aim for different criteria (goal)
E.g.
Application Information Retrieval Image Annotation Criterion Example-F1 loss Rank loss
An algorithm performing well under one criterion does not necessarily do well under another
Table:Costs when the ground truth is (1, 1, 0) Predicted Rank loss Example-F1 loss
Algorithm 1 (0, 1, 0) 0.25 0.3
Algorithm 2 (1, 1, 1) 0.5 0.2
Problem Setup: MLL with Target Criterion
Notation
feature vector x ∈ Rd
label vector y ∈ {0, 1}K, where K is the total number of labels and y[k] = 1 iff the k-th label is relevant
Given
Training set D = {(xn, yn)}Nn=1 Testing set D0 = {(x0m, y0m)}Mm=1
Target criterion Ψ that measures the difference between two matrices
Goal
Learn a good hypothesis h that minimize Ψ(Y0, ˆY0), where Y0 contains {ym0 }Mm=1 as its rows
Yˆ0 contains {ˆym0 = h(x0m)}Mm=1as its rows
More on Ψ
All common MLL criteria can be expressed as Ψ(Y0, ˆY0) Example-based criteria
Ψ(Y0, ˆY0) = M1 PM
m=1ψ(y0m, ˆy0m) Hamming loss: ψH(y, ˆy) = K1 PK
k=1Jy[k] 6= ˆy[k]K Example-F1 loss: ψF(y, ˆy) = 1 −kyk2y•ˆy
1+kˆyk1
Set-based criteria
Macro-F1 loss: Ψma(Y, ˆY) = 1 − K1 PK k=1
2PM
m=1YmkYˆmk
PM
m=1Ymk+PM m=1Yˆmk
Micro-F1 loss: Ψmi(Y, ˆY) = 1 − 2
PK k=1
PM
m=1YmkYˆmk
PK k=1
PM
m=1Ymk+PK k=1
PM m=1Yˆmk
Optimization of Target Criterion
How to optimize a given criterion?
Non-smoothness of MLL criteria Surrogate Loss
Design surrogate loss which can be easily optimized
Criterion → Surrogate Loss → Optimization → Good Model
Example
Hamming loss → Binary Cross Entropy (Nam et al. 2014) Rank loss → BP-MLL (Zhang and Zhou 2006)
Unfortunately, inefficient and unpractical for many criteria /
Automatic Adaptation to Different Criteria
Cost-sensitive Multi-label Learning (CSMLL)
Cope with different criteria automatically and systematically Consider the evaluation criterion in either training or testing phase
Example
Make cost-sensitive prediction with Bayes-optimal inference (Dembczynski et al. 2010)
Embed cost information in sample weights (Li and Lin 2014) Transform cost information as distance in embedded space (Huang and Lin 2017)
Current algorithm can only handle some special type of criteria /
Challenges and Opportunities
Challenges
Existing surrogate losses rely on human designs Current CSMLL methods are restricted
Opportunities
The idea of surrogate loss is general
there exists surrogate loss for criterion like Macro-Fβ
The spirit of CSMLL methods should be kept digest cost information in training/testing phase Can we combine the advantages?
Goal: a surrogate that can automatically adapt to general criteria
Outline
1 Background
2 Proposed methods
3 Experiments
4 Conclusion
Sample-weighting Framework for CSMLL
Main Idea
higher misclassification cost → more weight
Estimation of Misclassification Cost
Consider k classifiers each responsible for predicting ˆy[k]
Assume other classifiers fixed when training the k-th classifier Misclassification cost for ˆyn[k] is |c0n,k− c1n,k|, where
c0n,k = ψ(yn, (ˆyn[1, ..., k − 1], 0, ˆyn[k + 1, ..., K])) c1n,k = ψ(yn, (ˆyn[1, ..., k − 1], 1, ˆyn[k + 1, ..., K])) Weight = Misclassification cost
General Framework
Embed target criterion information in sample weights Contain previous methods as special cases
A Simple CSMLL Deep Model
Simple CSMLL Deep Model
Each output node corresponds to a label prediction Current prediction easily obtained by ˆyn= DN N (xn) Calculate weight for each sample at each output node
Weighted Binary Cross Entropy (WBCE)
Consider optimizing all K classifiers (output nodes) together LWBCE = N1 PN
n=1
PK
k=1wn,kLlog(yn[k], ˆyn[k]), where Llog(y[k], ˆy[k]) = −(y[k] log(ˆy[k]) + (1 − y[k]) log(1 − ˆy[k])) Degenerate to conventional Binary Cross Entropy if all wn,k = 1
BCE versus WBCE
Conventional BCE (ignore cost information) Weighted BCE (utilize cost information)
Drawbacks
Can only deal with example-based criteria Utilize limited cost information
Learning Surrogate for General Criteria
How to leverage cost information to generate surrogate?
Initial Properties of desired surrogate
Towards Tackling General CSMLL
Our solution
Iteratively fit a smooth local approximation to the criterion Use the local estimation to decide a descent direction Locally-learned Surrogate Loss (LLSL)
1 Local Neighbors Z = {(0, 1), (0, 0), (1, 1)}
2 Corresponding Cost C = {1, 0.875, 0.375}
3 Learn f : Z → C as the surrogate
E.g., LLLSL= −0.625y1 + 0.125y2 + 0.875
4 Optimize with respect to LLLSL
E.g. gradient(LLLSL) = (−0.625, 0.125)
Simple but Powerful!,
Towards Tackling General CSMLL
Our solution
Iteratively fit a smooth local approximation to the criterion Use the local estimation to decide a descent direction Locally-learned Surrogate Loss (LLSL)
1 Local Neighbors Z = {(0, 1), (0, 0), (1, 1)}
2 Corresponding Cost C = {1, 0.875, 0.375}
3 Learn f : Z → C as the surrogate
E.g., LLLSL= −0.625y1 + 0.125y2 + 0.875
4 Optimize with respect to LLLSL
E.g. gradient(LLLSL) = (−0.625, 0.125)
Simple but Powerful!,
Locally-learned Surrogate Loss for General CSMLL
Locally-learned Surrogate Loss A smooth surrogate
Learned automatically
Provide locally-faithful approximation to the criterion Guide the descent direction for optimization
Advantages
Easy to optimize
Able to adapt to general criteria
Flexible in the choice of local approximation methods Can be coupled with models based on descent optimization
Naturally Coupled with Deep Learning!
Outline
1 Background
2 Proposed methods
3 Experiments
4 Conclusion
Experimental Results
Algorithms
State-of-the-art CSMLL algorithm: CFT
Deep Learning Models: BP-MLL, WARP, BCE
Ours: WBCE, LLSL(linear-1), LLSL(linear-2), LLSL(poly-2)
Difference of Local Approximators
Lower approximation error → Better model performance
Table:Approximation error in RMSE
Approximators
Datasets Criterion linear-1 linear-2 poly-2 Consistent
birds Rank 0.073 ± 0.079 2.150 ± 3.148 1.274 ± 2.849 3 Example-F1 0.034 ± 0.048 0.053 ± 0.044 0.023 ± 0.032 3 Micro-F1 0.146 ± 0.046 0.116 ± 0.021 0.132 ± 0.033 3 Macro-F1 0.485 ± 0.025 0.551 ± 0.061 0.450 ± 0.064 3 emotions Ranking 0.170 ± 0.072 0.613 ± 0.329 0.990 ± 0.560 3 Example-F1 0.159 ± 0.070 0.173 ± 0.033 0.093 ± 0.029 3 Micro-F1 0.227 ± 0.054 0.242 ± 0.088 0.152 ± 0.026 3 Macro-F1 0.209 ± 0.047 0.187 ± 0.114 0.172 ± 0.016 3 scene Ranking 0.110 ± 0.058 0.603 ± 0.291 0.478 ± 0.495 3 Example-F1 0.154 ± 0.088 0.133 ± 0.072 0.123 ± 0.067 7 Micro-F1 0.390 ± 0.137 0.355 ± 0.140 0.225 ± 0.175 3 Macro-F1 0.200 ± 0.153 0.184 ± 0.090 0.172 ± 0.089 3
More sophisticated approximator is better for complex criteria!
Outline
1 Background
2 Proposed methods
3 Experiments
4 Conclusion
Conclusion
We propose
A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully
Tackle general CSMLL problem
Achieve superior performance against existing methods
Thank you for listening!
Any questions? ,