• 沒有找到結果。

Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

N/A
N/A
Protected

Academic year: 2022

Share "Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels"

Copied!
20
0
0

全文

(1)

Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama

ICML 2020 work done during Chou’s internship at RIKEN AIP, Japan;

resulting M.S. thesis of Chou won the 2020 thesis award of TAAI

October 8, 2021, AI Forum, Taipei, Taiwan

(2)

Introduction

Supervised Learning

(Slide Modified from My ML Foundations MOOC)

25

5 1

Mass

Size 10

unknown target function f : X → Y

training examples D : (x1,y1), · · · , (xN,yN)

learning algorithm

A

final hypothesis gf

hypothesis set H

supervised learning:

every input vectorxn with its (possibly expensive) label yn,

(3)

Introduction

Weakly-supervised: Learning without True Labels y

n

(a) positive-unlabeled learning [EN08] (b) learning with complementary labels [Ish+17] (c) learning with noisy labels [Nat+13]

• positive-unlabeled: some of true yn= +1 revealed

• complementary: ‘not label’ yninstead of true yn

• noisy: noisy label yn0 instead of true yn

weakly-supervised: arealisticandhot research direction to reduce labeling burden

[EN08] Learning classifiers from only positive and unlabeled data, KDD’08.

[Ish+17] Learning from complementary labels, NeurIPS’17.

(4)

Introduction

Motivation

popular weakly-supervised models [DNS15; Ish+19; Pat+17]

• deriveUnbiased Risk Estimators (URE) as new loss

• theoretically, nice properties (unbiased, consistent, etc.) [Ish+17]

• practically,sometimes bad performance(overfitting) our contributions: on Learning with Complementary Labels (LCL)

• analysis:identify weaknessof URE framework

• algorithm: propose animproved framework

• experiment: demonstratestronger performance

next: introduction to LCL

[DNS15] Convex formulation for learning from positive and unlabeled data, ICML’15.

[Ish+19] Complementary-Label Learning for Arbitrary Losses and Models, ICML’19.

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach, CVPR’17.

(5)

Introduction

Motivation behind Learning with Complementary Label

complementary label yn instead of true yn

Figure 1 of [Yu+18]

complementary label: easier/cheaperto obtain for some applications

(6)

Introduction

Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange?

• mango?

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:less labeling cost/expertiserequired

(7)

Introduction

Comparison

Ordinary (Supervised) Learning

training: {(xn= ,yn =mango)} →classifier Complementary Learning

training: {(xn= ,yn=banana)} →classifier

testing goal: classifier( ) →cherry

ordinary versus complementary: same goal via different training data

(8)

Introduction

Learning with Complementary Labels Setup

Given

N examples (inputxn,complementary label yn) ∈ X × {1, 2, · · · K } in data set D such that yn6= ynfor some hidden ordinary label

yn∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x

LCL model design: connecting complementary & ordinary

(9)

Introduction

Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(yn,g(xn))]

• loss `: usuallysurrogateof 0/1 error

LCL [Ish+19]

• rewrite the loss ` to `, such that

unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]

• LCL by minimizingURE

URE:pioneer modelsfor LCL

(10)

Introduction

Example of URE

Cross Entropy Loss

for g(x) = argmaxk ∈{1,2,...,K }p(k | x),

• `CE: derived by maximum likelihood as surrogate of 0/1 risk: R(g; `CE) = E(x,y )(− log(p(y | x)))

| {z }

`CE

Complementary Learning [Ish+19]

URE: R(g; `) = E(x,y )



`

z }| {

(K − 1) log(p(y | x))

| {z }

negative

K

X

k =1

log(p(k | x))

under uniform y assumption

ERM with URE: minpR with E taken on D

(11)

Problems of URE

URE overfits on single label

` = − log(p(y | x)

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• smallp(y | x) (often) → possibly very negative `

empiricalURE can be negative: only observingsome but not all y

• negative empirical UREdrags minimizationtowards overfitting practical remedy: [Ish+19]

NN-URE: constrain emprical URE to be non-negative how can we avoid negative empirical URE?

(12)

Proposed Framework

Proposed Framework

Minimize Complementary 0/1

• Recall the goal: We minimize 0-1 loss instead of `

• The unbiased estimator of R01

R01 : Ey[`01(y , g(x))] = `01(y , g(x))

• We denote `01 as the complementary 0-1 loss:

`01(y , g(x)) =Jy = g (x)K Surrogate Complementary Loss (SCL)

• Surrogate loss to optimize `01

• Unify previous work as surrogates of `01[Yu+18; Kim+19]

[Yu+18] Learning with biased complementary labels, ECCV’18.

[Kim+19] Nlnl: Negative learning for noisy labels, ICCV’19.

(13)

Proposed Framework

Negative Risk Avoided

Unbiased Risk Estimator (URE)

URE loss `CE [Ish+19] from cross-entropy `CE,

`CE(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL) another loss [Kim+19], a surrogate `01

φNL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.

(14)

Proposed Framework

Illustrative Difference between URE and SCE

URE

SCL

R

01

R

`

R

`

ˆ R

`

R

01

R

φ

ˆ R

φ A

E

E

A URE: Ripple effect of errors

• Theoretical motivation [Ish+17]

• Estimation step (E) amplifies approximation error (A) in ` SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

(15)

Proposed Framework

Classification Accuracy

Methods

1 Unbiased risk estimator (URE) [Ish+19]

2 Non-negative correction methods on URE (NN) [Ish+19]

3 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φEXP(y , g(x)) = exp(py).

Data set + Model URE NN SCL

MNIST + Linear 0.850 0.818 0.902

MNIST + MLP 0.801 0.867 0.925

CIFAR10 + ResNet 0.109 0.308 0.492 CIFAR10 + DenseNet 0.291 0.338 0.544

(16)

Gradient Analysis

Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `01

∇`(y , g(x))

∇`(y , g(x))

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

(17)

Gradient Analysis

Gradient Estimation Error

Bias-Variance Decomposition

MSE = E(f − c)2

= E(f − h)2

| {z }

Bias2

+ E(h − c)2

| {z }

Variance

Gradient Estimation

1 Ordinary gradient f = ∇`(y , g(x))

2 Complementary gradient c = ∇`(y , g(x))

3 Expected complementary gradient h

(18)

Gradient Analysis

Bias-Variance Tradeoff

(a) MSE (b) Bias2 (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

(19)

Conclusion

Conclusion

Explain Overfitting of URE

• Unbiased method only do well in expectation

• Single fixed complementary label cause overfitting Surrogate Complementary Loss (SCL)

• Minimum likelihood approach

• Avoids negative risk problem Experiment Results

• SCL significantly outperforms other methods

• Introduce small bias for lower gradient variance

(20)

Conclusion

References

Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama.“Convex formulation for learning from positive and unlabeled data”.In: International Conference on Machine Learning. 2015, pp. 1386–1394.

Charles Elkan and Keith Noto.“Learning classifiers from only positive and unlabeled data”.In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, pp. 213–220.

Takashi Ishida et al.“Learning from complementary labels”.In: Advances in neural information processing systems. 2017, pp. 5639–5649.

Takashi Ishida et al.“Complementary-Label Learning for Arbitrary Losses and Models”.In: International Conference on Machine Learning. 2019, pp. 2971–2980.

Youngdong Kim et al.“Nlnl: Negative learning for noisy labels”.In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 101–110.

Nagarajan Natarajan et al.“Learning with noisy labels”.In: Advances in neural information processing systems. 2013, pp. 1196–1204.

Giorgio Patrini et al.“Making deep neural networks robust to label noise: A loss correction approach”.In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1944–1952.

Xiyu Yu et al.“Learning with biased complementary labels”.In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 68–83.

參考文獻

相關文件

3.3 Locally-learned Surrogate Loss for General Cost-sensitive Multi-label Deep Learning From Figure 1, it can be seen that the key to designing a cost- sensitive model is that the

The problem of learning from label proportions (LLP) involves training classifiers with weak labels on bags of instances, rather than strong labels on individual instances.. The

• learning data representation: Using deep neural networks for fea- ture extraction, our proposed deep reinforcement learning model is the first model that automatically learns

(2017) observed the overfitting issue of unbiased PU learning and proposed a non-negative risk estimator to fix the problem.... Unlabeled-Unlabeled (UU) learning: In binary

— Miloro and Quinn advocate dividing multifocal central giant cell lesions into synchronous or metachronous lesions: metachronous lesions are more likely to represent a

Accordingly, the present retrospective study analyzed bone loss as an objective clinical parameter for chronic periodontitis as a potential risk factor for the presence of OSCC in

A particular type of DL based on convolutional neural networks (CNNs) has shown wide applicability on imaging data, as it can be used to extract a wide array of features by

Prayukvong, W.: “A Buddhist economic approach to the development of community enterprises: A case study from Southern Thailand”,Cambridge Journal of Economics 29,2005

By integrating data from a variety of government and commercial sources, we discovered 19,397 potential new commercial properties to inspect, based on the property usage types that

A novel surrogate able to adapt to any given MLL criterion The first cost-sensitive multi-label learning deep model The proposed model successfully. Tackle general

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

Qi (2001), Solving nonlinear complementarity problems with neural networks: a reformulation method approach, Journal of Computational and Applied Mathematics, vol. Pedrycz,

17-1 Diffraction and the Wave Theory of Light 17-2 Diffraction by a single slit.. 17-3 Diffraction by a Circular Aperture 17-4

{ Title: Using neural networks to forecast the systematic risk..

Ongoing Projects in Image/Video Analytics with Deep Convolutional Neural Networks. § Goal – Devise effective and efficient learning methods for scalable visual analytic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep

Unlike the case of optimizing the micro-average F-measure, where cyclic optimization does not help, here the exact match ratio is slightly improved for most data sets.. 5.5

Attack is easy in both black-box and white-box settings back-door attack, one-pixel attack, · · ·. Defense

To tackle these problems, this study develops a novel approach integrated with some graph-based heuristic working rules, robust back-propagation neural network (BPNN) engines

“A Flexible, Fast, and Optimal Modeling Approach Applied to Crew Rostering at London Underground,” Annals of Operations Research 127, pp.259-281,2004. [17] Levine.D, “Application of

To enhance the generalization of neural network model, we proposed a novel neural network, Minimum Risk Neural Networks (MRNN), whose principle is the combination of minimizing

The effect of gender on motivation and student achievement in digital game-based learning: A case study of a contented-based classroom. Using game-based learning to

Kim K., and Han I., 2000, “Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index”, Expert Systems