Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels
Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama
ICML 2020 work done during Chou’s internship at RIKEN AIP, Japan;
resulting M.S. thesis of Chou won the 2020 thesis award of TAAI
October 8, 2021, AI Forum, Taipei, Taiwan
Introduction
Supervised Learning
(Slide Modified from My ML Foundations MOOC)
25
5 1
Mass
Size 10
unknown target function f : X → Y
training examples D : (x1,y1), · · · , (xN,yN)
learning algorithm
A
final hypothesis g≈f
hypothesis set H
supervised learning:
every input vectorxn with its (possibly expensive) label yn,
Introduction
Weakly-supervised: Learning without True Labels y
n(a) positive-unlabeled learning [EN08] (b) learning with complementary labels [Ish+17] (c) learning with noisy labels [Nat+13]
• positive-unlabeled: some of true yn= +1 revealed
• complementary: ‘not label’ yninstead of true yn
• noisy: noisy label yn0 instead of true yn
weakly-supervised: arealisticandhot research direction to reduce labeling burden
[EN08] Learning classifiers from only positive and unlabeled data, KDD’08.
[Ish+17] Learning from complementary labels, NeurIPS’17.
Introduction
Motivation
popular weakly-supervised models [DNS15; Ish+19; Pat+17]
• deriveUnbiased Risk Estimators (URE) as new loss
• theoretically, nice properties (unbiased, consistent, etc.) [Ish+17]
• practically,sometimes bad performance(overfitting) our contributions: on Learning with Complementary Labels (LCL)
• analysis:identify weaknessof URE framework
• algorithm: propose animproved framework
• experiment: demonstratestronger performance
next: introduction to LCL
[DNS15] Convex formulation for learning from positive and unlabeled data, ICML’15.
[Ish+19] Complementary-Label Learning for Arbitrary Losses and Models, ICML’19.
[Pat+17] Making deep neural networks robust to label noise: A loss correction approach, CVPR’17.
Introduction
Motivation behind Learning with Complementary Label
complementary label yn instead of true yn
Figure 1 of [Yu+18]
complementary label: easier/cheaperto obtain for some applications
Introduction
Fruit Labeling Task (Image from AICup in 2020)
hard: true label
• orange?
• mango?
• cherry
• banana
easy: complementary label
• orange
• mango
• cherry
• banana 7
complementary:less labeling cost/expertiserequired
Introduction
Comparison
Ordinary (Supervised) Learning
training: {(xn= ,yn =mango)} →classifier Complementary Learning
training: {(xn= ,yn=banana)} →classifier
testing goal: classifier( ) →cherry
ordinary versus complementary: same goal via different training data
Introduction
Learning with Complementary Labels Setup
Given
N examples (inputxn,complementary label yn) ∈ X × {1, 2, · · · K } in data set D such that yn6= ynfor some hidden ordinary label
yn∈ {1, 2, · · · K }.
Goal
a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x
LCL model design: connecting complementary & ordinary
Introduction
Unbiased Risk Estimation for LCL
Ordinary Learning
• empirical risk minimization (ERM) on training data
risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(yn,g(xn))]
• loss `: usuallysurrogateof 0/1 error
LCL [Ish+19]
• rewrite the loss ` to `, such that
unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]
• LCL by minimizingURE
URE:pioneer modelsfor LCL
Introduction
Example of URE
Cross Entropy Loss
for g(x) = argmaxk ∈{1,2,...,K }p(k | x),
• `CE: derived by maximum likelihood as surrogate of 0/1 risk: R(g; `CE) = E(x,y )(− log(p(y | x)))
| {z }
`CE
Complementary Learning [Ish+19]
URE: R(g; `) = E(x,y )
`
z }| {
(K − 1) log(p(y | x))
| {z }
negative
−
K
X
k =1
log(p(k | x))
under uniform y assumption
ERM with URE: minpR with E taken on D
Problems of URE
URE overfits on single label
` = − log(p(y | x)
` = (K − 1) log(p(y | x)) −
K
X
k =1
log(p(k | x))
ordinary risk and URE very different
• ` >0 → ordinary risk non-negative
• smallp(y | x) (often) → possibly very negative `
empiricalURE can be negative: only observingsome but not all y
• negative empirical UREdrags minimizationtowards overfitting practical remedy: [Ish+19]
NN-URE: constrain emprical URE to be non-negative how can we avoid negative empirical URE?
Proposed Framework
Proposed Framework
Minimize Complementary 0/1
• Recall the goal: We minimize 0-1 loss instead of `
• The unbiased estimator of R01
R01 : Ey[`01(y , g(x))] = `01(y , g(x))
• We denote `01 as the complementary 0-1 loss:
`01(y , g(x)) =Jy = g (x)K Surrogate Complementary Loss (SCL)
• Surrogate loss to optimize `01
• Unify previous work as surrogates of `01[Yu+18; Kim+19]
[Yu+18] Learning with biased complementary labels, ECCV’18.
[Kim+19] Nlnl: Negative learning for noisy labels, ICCV’19.
Proposed Framework
Negative Risk Avoided
Unbiased Risk Estimator (URE)
URE loss `CE [Ish+19] from cross-entropy `CE,
`CE(y , g(x)) = (K − 1) log(p(y | x))
| {z }
negative loss term
−
K
X
j=1
log(p(j | x))
can go negative.
Surrogate Complementary Loss (SCL) another loss [Kim+19], a surrogate `01
φNL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.
Proposed Framework
Illustrative Difference between URE and SCE
URE
SCL
R
01R
`R
`ˆ R
`R
01R
φˆ R
φ AE
E
A URE: Ripple effect of errors
• Theoretical motivation [Ish+17]
• Estimation step (E) amplifies approximation error (A) in ` SCL: ‘Directly’ minimize complementary likelihood
• Non-negative loss φ
• Practically prevents ripple effect
Proposed Framework
Classification Accuracy
Methods
1 Unbiased risk estimator (URE) [Ish+19]
2 Non-negative correction methods on URE (NN) [Ish+19]
3 Surrogate complementary loss (SCL)
Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φEXP(y , g(x)) = exp(py).
Data set + Model URE NN SCL
MNIST + Linear 0.850 0.818 0.902
MNIST + MLP 0.801 0.867 0.925
CIFAR10 + ResNet 0.109 0.308 0.492 CIFAR10 + DenseNet 0.291 0.338 0.544
Gradient Analysis
Gradient Analysis
Gradient Direction of URE
• Very diverged directions on each y to maintain unbiasedness
• Low correlation to the target `01
∇`(y , g(x))
∇`(y , g(x))
Figure:Illustration of URE
Gradient Direction of SCL
• Targets to minimum likelihood objective
• High correlation to the target
`01
Gradient Analysis
Gradient Estimation Error
Bias-Variance Decomposition
MSE = E(f − c)2
= E(f − h)2
| {z }
Bias2
+ E(h − c)2
| {z }
Variance
Gradient Estimation
1 Ordinary gradient f = ∇`(y , g(x))
2 Complementary gradient c = ∇`(y , g(x))
3 Expected complementary gradient h
Gradient Analysis
Bias-Variance Tradeoff
(a) MSE (b) Bias2 (c) Variance
Findings
• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE
URE 0 Big Big
SCL Small Small Small
Conclusion
Conclusion
Explain Overfitting of URE
• Unbiased method only do well in expectation
• Single fixed complementary label cause overfitting Surrogate Complementary Loss (SCL)
• Minimum likelihood approach
• Avoids negative risk problem Experiment Results
• SCL significantly outperforms other methods
• Introduce small bias for lower gradient variance
Conclusion
References
Marthinus Du Plessis, Gang Niu, and Masashi Sugiyama.“Convex formulation for learning from positive and unlabeled data”.In: International Conference on Machine Learning. 2015, pp. 1386–1394.
Charles Elkan and Keith Noto.“Learning classifiers from only positive and unlabeled data”.In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, pp. 213–220.
Takashi Ishida et al.“Learning from complementary labels”.In: Advances in neural information processing systems. 2017, pp. 5639–5649.
Takashi Ishida et al.“Complementary-Label Learning for Arbitrary Losses and Models”.In: International Conference on Machine Learning. 2019, pp. 2971–2980.
Youngdong Kim et al.“Nlnl: Negative learning for noisy labels”.In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 101–110.
Nagarajan Natarajan et al.“Learning with noisy labels”.In: Advances in neural information processing systems. 2013, pp. 1196–1204.
Giorgio Patrini et al.“Making deep neural networks robust to label noise: A loss correction approach”.In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 1944–1952.
Xiyu Yu et al.“Learning with biased complementary labels”.In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 68–83.