Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

(1)

Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama

ICML 2020 work done during Chou’s internship at RIKEN AIP, Japan;

resulting M.S. thesis of Chou won the 2020 thesis award of TAAI

October 8, 2021, AI Forum, Taipei, Taiwan

(2)

Introduction

Supervised Learning

(Slide Modified from My ML Foundations MOOC)

25

5 1

Mass

Size 10

unknown target function f : X → Y

training examples D : (x1,y1), · · · , (xN,yN)

learning algorithm

A

final hypothesis g≈f

hypothesis set H

supervised learning:

every input vectorxn with its (possibly expensive) label y_n,

(3)

Introduction

Weakly-supervised: Learning without True Labels y

_n

(a) positive-unlabeled learning [EN08] (b) learning with complementary labels [Ish+17] (c) learning with noisy labels [Nat+13]

• positive-unlabeled: some of true y_n= +1 revealed

• complementary: ‘not label’ y_ninstead of true yn

• noisy: noisy label y_n⁰ instead of true yn

weakly-supervised: arealisticandhot research direction to reduce labeling burden

[EN08] Learning classifiers from only positive and unlabeled data, KDD’08.

[Ish+17] Learning from complementary labels, NeurIPS’17.

(4)

Introduction

Motivation

popular weakly-supervised models [DNS15; Ish+19; Pat+17]

• deriveUnbiased Risk Estimators (URE) as new loss

• theoretically, nice properties (unbiased, consistent, etc.) [Ish+17]

• practically,sometimes bad performance(overfitting) our contributions: on Learning with Complementary Labels (LCL)

• analysis:identify weaknessof URE framework

• algorithm: propose animproved framework

• experiment: demonstratestronger performance

next: introduction to LCL

[DNS15] Convex formulation for learning from positive and unlabeled data, ICML’15.

[Ish+19] Complementary-Label Learning for Arbitrary Losses and Models, ICML’19.

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach, CVPR’17.

(5)

Introduction

Motivation behind Learning with Complementary Label

complementary label y_n instead of true yn

Figure 1 of [Yu+18]

complementary label: easier/cheaperto obtain for some applications

(6)

Introduction

Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange?

• mango?

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:less labeling cost/expertiserequired

(7)

Introduction

Comparison

Ordinary (Supervised) Learning

training: {(xn= ,yn =mango)} →classifier Complementary Learning

training: {(xn= ,y_n=banana)} →classifier

testing goal: classifier( ) →cherry

ordinary versus complementary: same goal via different training data

(8)

Introduction

Learning with Complementary Labels Setup

Given

N examples (inputxn,complementary label y_n) ∈ X × {1, 2, · · · K } in data set D such that y_n6= ynfor some hidden ordinary label

y_n∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x

LCL model design: connecting complementary & ordinary

(9)

Introduction

Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(y_n,g(x_n))]

• loss `: usuallysurrogateof 0/1 error

LCL [Ish+19]

• rewrite the loss ` to `, such that

unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]

• LCL by minimizingURE

URE:pioneer modelsfor LCL

(10)

Introduction

Example of URE

Cross Entropy Loss

for g(x) = argmaxk ∈{1,2,...,K }p(k | x),

• `_CE: derived by maximum likelihood as surrogate of 0/1 risk: R(g; `_CE) = E(x,y )(− log(p(y | x)))

| {z }

`_CE

Complementary Learning [Ish+19]

URE: R(g; `) = E(x,y )

`

z }| {

(K − 1) log(p(y | x))

| {z }

negative

−

K

X

k =1

log(p(k | x))

under uniform y assumption

ERM with URE: minpR with E taken on D

(11)

Problems of URE

URE overfits on single label

` = − log(p(y | x)

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• smallp(y | x) (often) → possibly very negative `

empiricalURE can be negative: only observingsome but not all y

• negative empirical UREdrags minimizationtowards overfitting practical remedy: [Ish+19]

NN-URE: constrain emprical URE to be non-negative how can we avoid negative empirical URE?

(12)

Proposed Framework

Minimize Complementary 0/1

• Recall the goal: We minimize 0-1 loss instead of `

• The unbiased estimator of R₀₁

R₀₁ : Ey[`₀₁(y , g(x))] = `₀₁(y , g(x))

• We denote `₀₁ as the complementary 0-1 loss:

`₀₁(y , g(x)) =Jy = g (x)K Surrogate Complementary Loss (SCL)

• Surrogate loss to optimize `₀₁

• Unify previous work as surrogates of `₀₁[Yu+18; Kim+19]

[Yu+18] Learning with biased complementary labels, ECCV’18.

[Kim+19] Nlnl: Negative learning for noisy labels, ICCV’19.

(13)

Proposed Framework

Negative Risk Avoided

Unbiased Risk Estimator (URE)

URE loss `_CE [Ish+19] from cross-entropy `_CE,

`_CE(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

−

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL) another loss [Kim+19], a surrogate `₀₁

φ_NL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.

(14)

Proposed Framework

Illustrative Difference between URE and SCE

URE

SCL

R

01

R

`

R

_`

ˆ R

_`

R

₀₁

R

φ

ˆ R

φ A

E

A URE: Ripple effect of errors

• Theoretical motivation [Ish+17]

• Estimation step (E) amplifies approximation error (A) in ` SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

(15)

Proposed Framework

Classification Accuracy

Methods

1 Unbiased risk estimator (URE) [Ish+19]

2 Non-negative correction methods on URE (NN) [Ish+19]

3 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φEXP(y , g(x)) = exp(p_y).

Data set + Model URE NN SCL

MNIST + Linear 0.850 0.818 0.902

MNIST + MLP 0.801 0.867 0.925

CIFAR10 + ResNet 0.109 0.308 0.492 CIFAR10 + DenseNet 0.291 0.338 0.544

(16)

Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `₀₁

∇`(y , g(x))

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

(17)

Gradient Analysis

Gradient Estimation Error

Bias-Variance Decomposition

MSE = E(f − c)²

= E(f − h)²

| {z }

Bias²

+ E(h − c)²

| {z }

Variance

Gradient Estimation

1 Ordinary gradient f = ∇`(y , g(x))

2 Complementary gradient c = ∇`(y , g(x))

3 Expected complementary gradient h

(18)

Gradient Analysis

Bias-Variance Tradeoff

(a) MSE (b) Bias² (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

(19)

Conclusion

Explain Overfitting of URE

• Unbiased method only do well in expectation

• Single fixed complementary label cause overfitting Surrogate Complementary Loss (SCL)

• Minimum likelihood approach

• Avoids negative risk problem Experiment Results

• SCL significantly outperforms other methods

• Introduce small bias for lower gradient variance

(20)

Conclusion

References

Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama.“Convex formulation for learning from positive and unlabeled data”.In: International Conference on Machine Learning. 2015, pp. 1386–1394.

Charles Elkan and Keith Noto.“Learning classifiers from only positive and unlabeled data”.In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, pp. 213–220.

Takashi Ishida et al.“Learning from complementary labels”.In: Advances in neural information processing systems. 2017, pp. 5639–5649.

Takashi Ishida et al.“Complementary-Label Learning for Arbitrary Losses and Models”.In: International Conference on Machine Learning. 2019, pp. 2971–2980.

Youngdong Kim et al.“Nlnl: Negative learning for noisy labels”.In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 101–110.

Nagarajan Natarajan et al.“Learning with noisy labels”.In: Advances in neural information processing systems. 2013, pp. 1196–1204.

Giorgio Patrini et al.“Making deep neural networks robust to label noise: A loss correction approach”.In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1944–1952.

Xiyu Yu et al.“Learning with biased complementary labels”.In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 68–83.