### A Deep Model with Local Surrogate Loss for General Cost-sensitive Multi-label Learning

Cheng-Yu Hsieh

Department of Computer Science and Information Engineering National Taiwan University

NTU Machine Learning Symposium, December 23, 2017 Joint work with Yi-An Lin and Hsuan-Tien Lin

### About Me

Cheng-Yu Hsieh

Current Status

Second Year MS Student, Dept. of CSIE, National Taiwan University Member, Computational Learning Lab led by Prof. Hsuan-Tien Lin Research Interest

Reinforcement Learning Application

Apply deep reinforcement learning models to develop bridge bidding AI Design different techniques to incorporate human knowledge into the learning algorithm

Cost-sensitive Multi-label Learning

Design algorithms that can automatically learn to optimize different evaluation criteria (AAAI 2018)

### Outline

1 Background

2 Proposed methods

3 Experiments

4 Conclusion

### Outline

1 Background

2 Proposed methods

3 Experiments

4 Conclusion

### Multi-label Learning (MLL)

Ordered label set: {cat, dog, rabbit, bird}

input

true label (1, 1, 0, 0) (1, 0, 1, 0)

Multi-label Learning Setup

Given training set D = {(x_{n}, y_{n})}^{N}_{n=1}, where
feature vector x ∈ R^{d}

label vector y ∈ {0, 1}^{K}, where K is the total number of labels and
y[k] = 1 iff the k-th label is relevant

Learn a hypothesis h(x) to makeaccurate predictions on unseen instances

### Evaluation of Multi-label Learning Algorithms

A variety of criteria used to measure algorithm performance E.g., Hamming loss, Rank loss, Example-F1 loss, Macro-F1 loss, etc Different applications aim for different criteria (goal)

E.g.

Application Information Retrieval Image Annotation Criterion Example-F1 loss Rank loss

An algorithm performing well under one criterion does not necessarily do well under another

Table:Costs when the ground truth is (1, 1, 0) Predicted Rank loss Example-F1 loss

Algorithm 1 (0, 1, 0) 0.25 0.3

Algorithm 2 (1, 1, 1) 0.5 0.2

### Problem Setup: MLL with Target Criterion

Notation

feature vector x ∈ R^{d}

label vector y ∈ {0, 1}^{K}, where K is the total number of labels and
y[k] = 1 iff the k-th label is relevant

Given

Training set D = {(xn, yn)}^{N}_{n=1}
Testing set D^{0} = {(x^{0}_{m}, y^{0}_{m})}^{M}_{m=1}

Target criterion Ψ that measures the difference between two matrices

Goal

Learn a good hypothesis h that minimize Ψ(Y^{0}, ˆY^{0}), where
Y^{0} contains {y_{m}^{0} }^{M}_{m=1} as its rows

Yˆ^{0} contains {ˆy_{m}^{0} = h(x^{0}_{m})}^{M}_{m=1}as its rows

### More on Ψ

All common MLL criteria can be expressed as Ψ(Y^{0}, ˆY^{0})
Example-based criteria

Ψ(Y^{0}, ˆY^{0}) = _{M}^{1} PM

m=1ψ(y^{0}_{m}, ˆy^{0}_{m})
Hamming loss: ψ_{H}(y, ˆy) = _{K}^{1} P_{K}

k=1Jy[k] 6= ˆy[k]K
Example-F1 loss: ψF(y, ˆy) = 1 −_{kyk}^{2y•ˆ}^{y}

1+kˆyk_{1}

Set-based criteria

Macro-F1 loss: Ψ_{ma}(Y, ˆY) = 1 − _{K}^{1} PK
k=1

2PM

m=1YmkYˆmk

PM

m=1Ymk+PM m=1Yˆmk

Micro-F1 loss: Ψmi(Y, ˆY) = 1 − ^{2}

PK k=1

PM

m=1YmkYˆmk

PK k=1

PM

m=1Ymk+PK k=1

PM m=1Yˆmk

### Optimization of Target Criterion

How to optimize a given criterion?

Non-smoothness of MLL criteria Surrogate Loss

Design surrogate loss which can be easily optimized

Criterion → Surrogate Loss → Optimization → Good Model

Example

Hamming loss → Binary Cross Entropy (Nam et al. 2014) Rank loss → BP-MLL (Zhang and Zhou 2006)

Unfortunately, inefficient and unpractical for many criteria /

### Automatic Adaptation to Different Criteria

Cost-sensitive Multi-label Learning (CSMLL)

Cope with different criteria automatically and systematically Consider the evaluation criterion in either training or testing phase

Example

Make cost-sensitive prediction with Bayes-optimal inference (Dembczynski et al. 2010)

Embed cost information in sample weights (Li and Lin 2014) Transform cost information as distance in embedded space (Huang and Lin 2017)

Current algorithm can only handle some special type of criteria /

### Challenges and Opportunities

Challenges

Existing surrogate losses rely on human designs Current CSMLL methods are restricted

Opportunities

The idea of surrogate loss is general

there exists surrogate loss for criterion like Macro-Fβ

The spirit of CSMLL methods should be kept digest cost information in training/testing phase Can we combine the advantages?

Goal: a surrogate that can automatically adapt to general criteria

### Outline

1 Background

2 Proposed methods

3 Experiments

4 Conclusion

### Sample-weighting Framework for CSMLL

Main Idea

higher misclassification cost → more weight

Estimation of Misclassification Cost

Consider k classifiers each responsible for predicting ˆy[k]

Assume other classifiers fixed when training the k-th classifier
Misclassification cost for ˆy_{n}[k] is |c^{0}_{n,k}− c^{1}_{n,k}|, where

c^{0}_{n,k} = ψ(yn, (ˆyn[1, ..., k − 1], 0, ˆyn[k + 1, ..., K]))
c^{1}_{n,k} = ψ(yn, (ˆyn[1, ..., k − 1], 1, ˆyn[k + 1, ..., K]))
Weight = Misclassification cost

General Framework

Embed target criterion information in sample weights Contain previous methods as special cases

### A Simple CSMLL Deep Model

Simple CSMLL Deep Model

Each output node corresponds to a label prediction
Current prediction easily obtained by ˆy_{n}= DN N (x_{n})
Calculate weight for each sample at each output node

Weighted Binary Cross Entropy (WBCE)

Consider optimizing all K classifiers (output nodes) together
L_{WBCE} = _{N}^{1} PN

n=1

PK

k=1w_{n,k}L_{log}(y_{n}[k], ˆy_{n}[k]), where
L_{log}(y[k], ˆy[k]) = −(y[k] log(ˆy[k]) + (1 − y[k]) log(1 − ˆy[k]))
Degenerate to conventional Binary Cross Entropy if all w_{n,k} = 1

### BCE versus WBCE

Conventional BCE (ignore cost information) Weighted BCE (utilize cost information)

Drawbacks

Can only deal with example-based criteria Utilize limited cost information

### Learning Surrogate for General Criteria

How to leverage cost information to generate surrogate?

Initial Properties of desired surrogate

### Towards Tackling General CSMLL

Our solution

Iteratively fit a smooth local approximation to the criterion Use the local estimation to decide a descent direction Locally-learned Surrogate Loss (LLSL)

1 Local Neighbors Z = {(0, 1), (0, 0), (1, 1)}

2 Corresponding Cost C = {1, 0.875, 0.375}

3 Learn f : Z → C as the surrogate

E.g., LLLSL= −0.625y1 + 0.125y2 + 0.875

4 Optimize with respect to L_{LLSL}

E.g. gradient(LLLSL) = (−0.625, 0.125)

Simple but Powerful!,

### Towards Tackling General CSMLL

Our solution

Iteratively fit a smooth local approximation to the criterion Use the local estimation to decide a descent direction Locally-learned Surrogate Loss (LLSL)

1 Local Neighbors Z = {(0, 1), (0, 0), (1, 1)}

2 Corresponding Cost C = {1, 0.875, 0.375}

3 Learn f : Z → C as the surrogate

E.g., LLLSL= −0.625y1 + 0.125y2 + 0.875

4 Optimize with respect to L_{LLSL}

E.g. gradient(LLLSL) = (−0.625, 0.125)

Simple but Powerful!,

### Locally-learned Surrogate Loss for General CSMLL

Locally-learned Surrogate Loss A smooth surrogate

Learned automatically

Provide locally-faithful approximation to the criterion Guide the descent direction for optimization

Advantages

Easy to optimize

Able to adapt to general criteria

Flexible in the choice of local approximation methods Can be coupled with models based on descent optimization

Naturally Coupled with Deep Learning!

### Outline

1 Background

2 Proposed methods

3 Experiments

4 Conclusion

### Experimental Results

Algorithms

State-of-the-art CSMLL algorithm: CFT

Deep Learning Models: BP-MLL, WARP, BCE

Ours: WBCE, LLSL(linear-1), LLSL(linear-2), LLSL(poly-2)

### Difference of Local Approximators

Lower approximation error → Better model performance

Table:Approximation error in RMSE

Approximators

Datasets Criterion linear-1 linear-2 poly-2 Consistent

birds Rank 0.073 ± 0.079 2.150 ± 3.148 1.274 ± 2.849 3 Example-F1 0.034 ± 0.048 0.053 ± 0.044 0.023 ± 0.032 3 Micro-F1 0.146 ± 0.046 0.116 ± 0.021 0.132 ± 0.033 3 Macro-F1 0.485 ± 0.025 0.551 ± 0.061 0.450 ± 0.064 3 emotions Ranking 0.170 ± 0.072 0.613 ± 0.329 0.990 ± 0.560 3 Example-F1 0.159 ± 0.070 0.173 ± 0.033 0.093 ± 0.029 3 Micro-F1 0.227 ± 0.054 0.242 ± 0.088 0.152 ± 0.026 3 Macro-F1 0.209 ± 0.047 0.187 ± 0.114 0.172 ± 0.016 3 scene Ranking 0.110 ± 0.058 0.603 ± 0.291 0.478 ± 0.495 3 Example-F1 0.154 ± 0.088 0.133 ± 0.072 0.123 ± 0.067 7 Micro-F1 0.390 ± 0.137 0.355 ± 0.140 0.225 ± 0.175 3 Macro-F1 0.200 ± 0.153 0.184 ± 0.090 0.172 ± 0.089 3

More sophisticated approximator is better for complex criteria!

### Outline

1 Background

2 Proposed methods

3 Experiments

4 Conclusion

### Conclusion

We propose

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully

Tackle general CSMLL problem

Achieve superior performance against existing methods

Thank you for listening!

Any questions? ,