## Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama

ICML 2020 work done during Chou’s internship at RIKEN AIP, Japan;

resulting M.S. thesis of Chou won the 2020 thesis award of TAAI

October 8, 2021, AI Forum, Taipei, Taiwan

Introduction

## Supervised Learning

## (Slide Modified from My ML Foundations MOOC)

**25**

**5**
**1**

**Mass**

**Size**
**10**

unknown target function f : X → Y

training examples
**D : (x**1,y1), · · · , (xN,yN)

learning algorithm

A

final hypothesis g≈f

hypothesis set H

supervised learning:

every input vector**x**n with
**its (possibly expensive) label y**_{n},

Introduction

## Weakly-supervised: Learning without True Labels y

_{n}

(a) positive-unlabeled learning [EN08] (b) learning with complementary labels [Ish+17] (c) learning with noisy labels [Nat+13]

• positive-unlabeled: some of true y_{n}= +1 revealed

• complementary: ‘not label’ y_{n}instead of true yn

• noisy: noisy label y_{n}^{0} instead of true yn

**weakly-supervised: arealistic**and**hot**
research direction to reduce labeling burden

[EN08] Learning classifiers from only positive and unlabeled data, KDD’08.

[Ish+17] Learning from complementary labels, NeurIPS’17.

Introduction

## Motivation

popular weakly-supervised models [DNS15; Ish+19; Pat+17]

• derive**Unbiased Risk Estimators (URE) as new loss**

• theoretically, nice properties (unbiased, consistent, etc.) [Ish+17]

• practically,**sometimes bad performance**(overfitting)
our contributions: on Learning with Complementary Labels
(LCL)

• analysis:**identify weakness**of URE framework

• algorithm: propose an**improved framework**

• experiment: demonstrate**stronger performance**

next: introduction to LCL

[DNS15] Convex formulation for learning from positive and unlabeled data, ICML’15.

[Ish+19] Complementary-Label Learning for Arbitrary Losses and Models, ICML’19.

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach, CVPR’17.

Introduction

## Motivation behind Learning with Complementary Label

complementary label y_{n} instead of true yn

Figure 1 of [Yu+18]

complementary label: **easier/cheaper**to
obtain for some applications

Introduction

## Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange**?**

• mango**?**

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:**less labeling**
**cost/expertise**required

Introduction

## Comparison

Ordinary (Supervised) Learning

training: **{(x**n= ,yn =mango)} →**classifier**
Complementary Learning

training: **{(x**n= ,y_{n}=banana)} →**classifier**

testing goal: **classifier(** ) →cherry

ordinary versus complementary: same goal via
**different training data**

Introduction

## Learning with Complementary Labels Setup

Given

N examples (input**x**n,complementary label y_{n}) ∈ X × {1, 2, · · · K } in
data set D such that y_{n}6= ynfor some hidden ordinary label

y_{n}∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) that**closely predicts**(0/1 error) the
ordinary label y associated with some**unseen**inputs x

LCL model design: connecting
**complementary & ordinary**

Introduction

## Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

**risk:** E(x,y )[`(y , g(x))] **empirical risk:** E(xn,yn)∈D[`(y_{n},g(x_{n}))]

• loss `: usually**surrogate**of 0/1 error

LCL [Ish+19]

• rewrite the loss ` to `, such that

**unbiased risk estimator:** E(x,y )[`(y , g(**x))] = E**(x,y )[`(y , g(x))]

• LCL by minimizing**URE**

URE:**pioneer models**for LCL

Introduction

## Example of URE

Cross Entropy Loss

for g(x) = argmaxk ∈{1,2,...,K }**p(k | x),**

• `_{CE}: derived by maximum likelihood as surrogate of 0/1
**risk:** R(g; `_{CE}) = E(x,y )(− log(p(y | x)))

| {z }

`_{CE}

Complementary Learning [Ish+19]

**URE:** R(g; `) = E(x,y )

`

z }| {

(K − 1) log(p(y | x))

| {z }

negative

−

K

X

k =1

log(p(k | x))

under uniform y assumption

ERM with URE: min**p**R with E taken on D

Problems of URE

## URE overfits on single label

` = **− log(p(y | x)**

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• small**p(y | x) (often) → possibly very negative `**

**empirical**URE can be negative: only observing**some but not all**
y

• negative empirical URE**drags minimization**towards overfitting
practical remedy: [Ish+19]

NN-URE: constrain emprical URE to be non-negative how can we avoid negative empirical URE?

Proposed Framework

## Proposed Framework

Minimize Complementary 0/1

• Recall the goal: We minimize 0-1 loss instead of `

• The unbiased estimator of R_{01}

**R**** _{01}** : Ey[`

_{01}(y , g(x))] = `

_{01}(y , g(x))

• We denote `_{01} as the complementary 0-1 loss:

`_{01}(y , g(x)) =**Jy = g (x)K**
Surrogate Complementary Loss (SCL)

• Surrogate loss to optimize `_{01}

• Unify previous work as surrogates of `_{01}[Yu+18; Kim+19]

[Yu+18] Learning with biased complementary labels, ECCV’18.

[Kim+19] Nlnl: Negative learning for noisy labels, ICCV’19.

Proposed Framework

## Negative Risk Avoided

Unbiased Risk Estimator (URE)

URE loss `_{CE} [Ish+19] from cross-entropy `_{CE},

`_{CE}(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

−

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL)
another loss [Kim+19], a surrogate `_{01}

φ_{NL}(y , g(x)) = − log(1 − p(y | x)))
remains non-negative.

Proposed Framework

## Illustrative Difference between URE and SCE

URE

SCL

## R

01## R

`## R

_{`}

## ˆ R

_{`}

## R

_{01}

## R

φ## ˆ R

φ AE

E

A URE: Ripple effect of errors

• Theoretical motivation [Ish+17]

• Estimation step (E) amplifies approximation error (A) in ` SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

Proposed Framework

## Classification Accuracy

Methods

1 Unbiased risk estimator (URE) [Ish+19]

2 Non-negative correction methods on URE (NN) [Ish+19]

3 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while
SCL is based on exponential loss φEXP(y , g(x)) = exp(p_{y}).

Data set + Model URE NN SCL

MNIST + Linear 0.850 0.818 **0.902**

MNIST + MLP 0.801 0.867 **0.925**

CIFAR10 + ResNet 0.109 0.308 **0.492**
CIFAR10 + DenseNet 0.291 0.338 **0.544**

Gradient Analysis

## Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `_{01}

**∇`(y , g(x))**

**∇`(y , g(x))**

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

Gradient Analysis

## Gradient Estimation Error

Bias-Variance Decomposition

MSE = E**(f − c)**^{2}

= E**(f − h)**^{2}

| {z }

Bias^{2}

+ E**(h − c)**^{2}

| {z }

Variance

Gradient Estimation

1 Ordinary gradient **f = ∇`(y , g(x))**

2 Complementary gradient **c = ∇`(y , g(x))**

3 Expected complementary gradient **h**

Gradient Analysis

## Bias-Variance Tradeoff

(a) MSE (b) Bias^{2} (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

Conclusion

## Conclusion

Explain Overfitting of URE

• Unbiased method only do well in expectation

• Single fixed complementary label cause overfitting Surrogate Complementary Loss (SCL)

• Minimum likelihood approach

• Avoids negative risk problem Experiment Results

• SCL significantly outperforms other methods

• Introduce small bias for lower gradient variance

Conclusion

## References

Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama.“Convex formulation for learning from positive and unlabeled data”.In: International Conference on Machine Learning. 2015, pp. 1386–1394.

Charles Elkan and Keith Noto.“Learning classifiers from only positive and unlabeled data”.In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, pp. 213–220.

Takashi Ishida et al.“Learning from complementary labels”.In: Advances in neural information processing systems. 2017, pp. 5639–5649.

Takashi Ishida et al.“Complementary-Label Learning for Arbitrary Losses and Models”.In: International Conference on Machine Learning. 2019, pp. 2971–2980.

Youngdong Kim et al.“Nlnl: Negative learning for noisy labels”.In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 101–110.

Nagarajan Natarajan et al.“Learning with noisy labels”.In: Advances in neural information processing systems. 2013, pp. 1196–1204.

Giorgio Patrini et al.“Making deep neural networks robust to label noise: A loss correction approach”.In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1944–1952.

Xiyu Yu et al.“Learning with biased complementary labels”.In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 68–83.