• 沒有找到結果。

Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

N/A
N/A
Protected

Academic year: 2022

Share "Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama

ICML 2020 work done during Chou’s internship at RIKEN AIP, Japan;

resulting M.S. thesis of Chou won the 2020 thesis award of TAAI

October 8, 2021, AI Forum, Taipei, Taiwan

(2)

Introduction

Supervised Learning

(Slide Modified from My ML Foundations MOOC)

25

5 1

Mass

Size 10

unknown target function f : X → Y

training examples D : (x1,y1), · · · , (xN,yN)

learning algorithm

A

final hypothesis gf

hypothesis set H

supervised learning:

every input vectorxn with its (possibly expensive) label yn,

(3)

Introduction

Weakly-supervised: Learning without True Labels y

n

(a) positive-unlabeled learning [EN08] (b) learning with complementary labels [Ish+17] (c) learning with noisy labels [Nat+13]

• positive-unlabeled: some of true yn= +1 revealed

• complementary: ‘not label’ yninstead of true yn

• noisy: noisy label yn0 instead of true yn

weakly-supervised: arealisticandhot research direction to reduce labeling burden

[EN08] Learning classifiers from only positive and unlabeled data, KDD’08.

[Ish+17] Learning from complementary labels, NeurIPS’17.

(4)

Introduction

Motivation

popular weakly-supervised models [DNS15; Ish+19; Pat+17]

• deriveUnbiased Risk Estimators (URE) as new loss

• theoretically, nice properties (unbiased, consistent, etc.) [Ish+17]

• practically,sometimes bad performance(overfitting) our contributions: on Learning with Complementary Labels (LCL)

• analysis:identify weaknessof URE framework

• algorithm: propose animproved framework

• experiment: demonstratestronger performance

next: introduction to LCL

[DNS15] Convex formulation for learning from positive and unlabeled data, ICML’15.

[Ish+19] Complementary-Label Learning for Arbitrary Losses and Models, ICML’19.

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach, CVPR’17.

(5)

Introduction

Motivation behind Learning with Complementary Label

complementary label yn instead of true yn

Figure 1 of [Yu+18]

complementary label: easier/cheaperto obtain for some applications

(6)

Introduction

Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange?

• mango?

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:less labeling cost/expertiserequired

(7)

Introduction

Comparison

Ordinary (Supervised) Learning

training: {(xn= ,yn =mango)} →classifier Complementary Learning

training: {(xn= ,yn=banana)} →classifier

testing goal: classifier( ) →cherry

ordinary versus complementary: same goal via different training data

(8)

Introduction

Learning with Complementary Labels Setup

Given

N examples (inputxn,complementary label yn) ∈ X × {1, 2, · · · K } in data set D such that yn6= ynfor some hidden ordinary label

yn∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x

LCL model design: connecting complementary & ordinary

(9)

Introduction

Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(yn,g(xn))]

• loss `: usuallysurrogateof 0/1 error

LCL [Ish+19]

• rewrite the loss ` to `, such that

unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]

• LCL by minimizingURE

URE:pioneer modelsfor LCL

(10)

Introduction

Example of URE

Cross Entropy Loss

for g(x) = argmaxk ∈{1,2,...,K }p(k | x),

• `CE: derived by maximum likelihood as surrogate of 0/1 risk: R(g; `CE) = E(x,y )(− log(p(y | x)))

| {z }

`CE

Complementary Learning [Ish+19]

URE: R(g; `) = E(x,y )



`

z }| {

(K − 1) log(p(y | x))

| {z }

negative

K

X

k =1

log(p(k | x))

under uniform y assumption

ERM with URE: minpR with E taken on D

(11)

Problems of URE

URE overfits on single label

` = − log(p(y | x)

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• smallp(y | x) (often) → possibly very negative `

empiricalURE can be negative: only observingsome but not all y

• negative empirical UREdrags minimizationtowards overfitting practical remedy: [Ish+19]

NN-URE: constrain emprical URE to be non-negative how can we avoid negative empirical URE?

(12)

Proposed Framework

Proposed Framework

Minimize Complementary 0/1

• Recall the goal: We minimize 0-1 loss instead of `

• The unbiased estimator of R01

R01 : Ey[`01(y , g(x))] = `01(y , g(x))

• We denote `01 as the complementary 0-1 loss:

`01(y , g(x)) =Jy = g (x)K Surrogate Complementary Loss (SCL)

• Surrogate loss to optimize `01

• Unify previous work as surrogates of `01[Yu+18; Kim+19]

[Yu+18] Learning with biased complementary labels, ECCV’18.

[Kim+19] Nlnl: Negative learning for noisy labels, ICCV’19.

(13)

Proposed Framework

Negative Risk Avoided

Unbiased Risk Estimator (URE)

URE loss `CE [Ish+19] from cross-entropy `CE,

`CE(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL) another loss [Kim+19], a surrogate `01

φNL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.

(14)

Proposed Framework

Illustrative Difference between URE and SCE

URE

SCL

R

01

R

`

R

`

ˆ R

`

R

01

R

φ

ˆ R

φ A

E

E

A URE: Ripple effect of errors

• Theoretical motivation [Ish+17]

• Estimation step (E) amplifies approximation error (A) in ` SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

(15)

Proposed Framework

Classification Accuracy

Methods

1 Unbiased risk estimator (URE) [Ish+19]

2 Non-negative correction methods on URE (NN) [Ish+19]

3 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φEXP(y , g(x)) = exp(py).

Data set + Model URE NN SCL

MNIST + Linear 0.850 0.818 0.902

MNIST + MLP 0.801 0.867 0.925

CIFAR10 + ResNet 0.109 0.308 0.492 CIFAR10 + DenseNet 0.291 0.338 0.544

(16)

Gradient Analysis

Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `01

∇`(y , g(x))

∇`(y , g(x))

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

(17)

Gradient Analysis

Gradient Estimation Error

Bias-Variance Decomposition

MSE = E(f − c)2

= E(f − h)2

| {z }

Bias2

+ E(h − c)2

| {z }

Variance

Gradient Estimation

1 Ordinary gradient f = ∇`(y , g(x))

2 Complementary gradient c = ∇`(y , g(x))

3 Expected complementary gradient h

(18)

Gradient Analysis

Bias-Variance Tradeoff

(a) MSE (b) Bias2 (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

(19)

Conclusion

Conclusion

Explain Overfitting of URE

• Unbiased method only do well in expectation

• Single fixed complementary label cause overfitting Surrogate Complementary Loss (SCL)

• Minimum likelihood approach

• Avoids negative risk problem Experiment Results

• SCL significantly outperforms other methods

• Introduce small bias for lower gradient variance

(20)

Conclusion

References

Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama.“Convex formulation for learning from positive and unlabeled data”.In: International Conference on Machine Learning. 2015, pp. 1386–1394.

Charles Elkan and Keith Noto.“Learning classifiers from only positive and unlabeled data”.In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, pp. 213–220.

Takashi Ishida et al.“Learning from complementary labels”.In: Advances in neural information processing systems. 2017, pp. 5639–5649.

Takashi Ishida et al.“Complementary-Label Learning for Arbitrary Losses and Models”.In: International Conference on Machine Learning. 2019, pp. 2971–2980.

Youngdong Kim et al.“Nlnl: Negative learning for noisy labels”.In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 101–110.

Nagarajan Natarajan et al.“Learning with noisy labels”.In: Advances in neural information processing systems. 2013, pp. 1196–1204.

Giorgio Patrini et al.“Making deep neural networks robust to label noise: A loss correction approach”.In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1944–1952.

Xiyu Yu et al.“Learning with biased complementary labels”.In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 68–83.

參考文獻

相關文件

By integrating data from a variety of government and commercial sources, we discovered 19,397 potential new commercial properties to inspect, based on the property usage types that

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully. Tackle general

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

Qi (2001), Solving nonlinear complementarity problems with neural networks: a reformulation method approach, Journal of Computational and Applied Mathematics, vol. Pedrycz,

17-1 Diffraction and the Wave Theory of Light 17-2 Diffraction by a single slit.. 17-3 Diffraction by a Circular Aperture 17-4

{ Title: Using neural networks to forecast the systematic risk..

Ongoing Projects in Image/Video Analytics with Deep Convolutional Neural Networks. § Goal – Devise effective and efficient learning methods for scalable visual analytic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep