### A Case Study of Learning with Complementary Labels

Yu-Ting Chou^{1 *} Gang Niu^{2} Hsuan-Tien Lin^{1} Masashi Sugiyama^{2 3}

### Abstract

In weakly supervised learning, unbiased risk esti- mator(URE) is a powerful tool for training clas- sifiers when training and test data are drawn from different distributions. Nevertheless, UREs lead to overfitting in many problem settings when the models are complex like deep networks. In this paper, we investigate reasons for such overfitting by studying a weakly supervised problem called learning with complementary labels. We argue the quality of gradient estimation matters more in risk minimization. Theoretically, we show that a URE gives an unbiased gradient estimator (UGE).

Practically, however, UGEs may suffer from huge variance, which causes empirical gradients to be usually far away from true gradients during mini- mization. To this end, we propose a novel surro- gate complementary loss(SCL) framework that trades zero bias with reduced variance and makes empirical gradients more aligned with true gradi- ents in the direction. Thanks to this characteristic, SCL successfully mitigates the overfitting issue and improves URE-based methods.

### 1. Introduction

In weakly supervised learning (WSL), learning algorithms have to train classifiers under incomplete, inexact or inac- curate supervision (Zhou,2017), including but not limited to semi-supervised learning (Chapelle et al.,2009), partial labels (Jin & Ghahramani,2002), noisy labels (Natarajan et al.,2013;Patrini et al.,2017;Han et al.,2018a;b;Yu et al., 2019;Xia et al.,2019), complementary labels (Ishida et al., 2017;Yu et al.,2018;Ishida et al.,2019;Xu et al.,2020;

Feng et al.,2020), where the label distribution changes, and positive-unlabeled data (Elkan & Noto,2008;du Plessis

*Work done during an internship at RIKEN.^{1}National Taiwan
University^{2}RIKEN^{3}The University of Tokyo. Correspondence to:

Yu-Ting Chou <r07922042@csie.ntu.edu.tw>.

Proceedings of the37^{th}International Conference on Machine
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au-
thor(s).

et al.,2014;2015;Niu et al.,2016;Sakai et al.,2017;2018), unlabeled-unlabeled data (Lu et al.,2019;2020), and other similar settings (Bao et al.,2018;Ishida et al.,2018;Hsieh et al.,2019), where the data distribution changes. Among WSL methods, unbiased risk estimator (URE) is a powerful tool: it evaluates the classification risk from training data drawn from a distribution different from the test one, and thus empirical risk minimization (Vapnik,1992) is possible.

The success of URE is due to two orthogonal demands in WSL for handling big data and complex data: URE poses unconstrained optimizationsso that it can handle very big data by stochastic optimizers; URE is model-independent so that it can handle complex data where the model is chosen according to the data (e.g., image, text, or speech).

An important motivation of employing URE in WSL is that URE enables estimation error bounds to guarantee statisti- cal consistency. However, the consistency in the asymptotic casesis not very meaningful in the finite-sample cases es- pecially in deep learning (Zhang et al.,2017;Nagarajan &

Kolter,2019). Despite its popularity and nice properties, URE indu Plessis et al.(2015),Ishida et al.(2017) orLu et al.(2019) has inferior test performance to recent biased methods inKiryo et al.(2017),Ishida et al.(2019) andLu et al.(2020). When complex models like deep networks are chosen as the classifiers, UREs suffer from severe negative empirical risksduring training, which is a sign of overfitting.

Even though the overfitting issue can be relatively mitigated by keeping UREs non-negative, the mechanism behind how UREs cause overfitting is still unknown. Thus, instead of a theoretical motivation, this paper has a practical motivation and focuses on understanding how UREs cause overfitting and how to avoid such overfitting in algorithm design.

Learning with complementary labels(Ishida et al.,2017) is a WSL problem of multi-class classification where classifiers are trained from data with complementary labels (CL). A CL specifies a class that an instance does not belong to, but the trained classifier should still predict the correct labels.

Although CLs are less informative than ordinary labels, they provide an alternative when ordinary labels are inaccessible or costly to acquire. In this paper, we choose learning with CLs to study the overfitting issue of UREs, as it combines several practical advantages: first, CLs are easy to generate

compared with partial labels and noisy labels; second, nega- tive empirical risks are easy to occur; and third, it is easy to experimentally analyze the bias and variance of empirical gradients. With the help of such a case study, we can gain a deep insight of UREs and lay the foundation for further studies of UREs in other WSL problem settings.

Our contributions can be summarized in two folds. First of all, we conduct a series of analyses to investigate reasons for the overfitting issue. We show that due to the linearity of the differential operator, any URE must give an unbiased gradient estimator(UGE); however, UGE is not necessarily good at gradient estimation though it is unbiased. During training, only a single fixed CL could be acquired for each instance, which causes empirical gradients given by a UGE to be usually far away from true gradients. This illustrates the difference between validation and training:

• In validation, the classifier is fixed and the data is repeat- edly sampled, and then UGE is good at gradient estima- tion (which can be theoretically guaranteed by concentra- tion inequalities).

• In training, the data are fixed and based on these data the classifier is iteratively updated, and then UGE might be really bad at gradient estimation.

• Theoretically speaking, good validation can imply good training if the model is simple, while good validation may still result in poor training if the model is complex (Zhang et al.,2017;Nagarajan & Kolter,2019).

Unfortunately, UGEs in training suffer from huge variance in learning with CLs. Here, the root cause of overfitting is that only one fixed CL is available for each instance, and the direct causeis the huge variance of UGEs and the distance from empirical to true gradients. The root cause also exists in other WSL problem settings, e.g., partial or noisy labels.

Notice that the quality of gradient estimation matters more than risk estimation in risk minimization, since stochastic optimizers mainly rely on empirical gradients.

Next, we propose a novel framework named surrogate com- plementary loss(SCL) to improve gradient estimation. Re- call that the classification error is defined as the expected zero-one loss over the test distribution. Existing URE-based methods first replace the zero-one loss with a surrogate loss to obtain the risk, and then rewrite the risk into an expecta- tion over the training distribution. We call it complementary surrogate losssince replacing is before rewriting. On the other hand, our framework first rewrites the error into an expectation over the training distribution and then replaces the zero-one loss with a surrogate loss, namely, rewriting before replacing. Rewriting the error is nicer since the zero- one loss has many nice properties while the surrogate loss is just arbitrary. In our experiments, SCL-based methods outperform URE-based methods, where SCL successfully reduces the variance of empirical gradients and makes them

more aligned with true gradients in the direction.

The rest of the paper is organized as follows. We introduce WSL problem settings and the overfitting issue in Section2.

In Section3, we propose the SCL framework. In Section4, we analyze empirical gradients to justify our claims.

### 2. The Use of Unbiased Risk Estimators

In this section we introduce the usage of unbiased risk esti- mators in several weakly supervised learning settings. Then we zoom into the problem of learning with complemen- tary labels, and show the relationship between negative risk problem and overfitting.

2.1. Related WSL Settings

The following problems are typical examples where UREs fail under weak supervision. The negative empirical risk can happen when the loss functions are not specifically restricted, causing overfitting. Biased loss functions or non- negative correction methods are introduced to mitigate such issues in related literature.

Noisy Label Learning: Noisy label learning studies about learning when training labels flip according to some underlying distribution. A common assumption is the class conditional noise setting where the noisy label depends on its ordinary label. Natarajan et al.(2013) first provided a URE for arbitrary loss in the binary case, and provided per- formance guarantee. To ensure the convexity of the rewrit- ten loss function, they require the original surrogate loss to satisfy a symmetric property.Patrini et al.(2017) extends to multiclass classification and proposed two loss correc- tion methods: backward correction and forward correction.

Backward correction involves a matrix inversion and gives an unbiased estimator of the original loss. Forward cor- rection corrects the prediction with a matrix multiplication and can be added as an additional layer to neural networks.

The authors showed that forward correction performs bet- ter than backward correction, and hinted the reason to be optimization related.

Positive-Unlabeled (PU) Learning: In binary classifica- tion, the labeled data consists of two sets, the positive (P) class and the negative (N) class. PU learning studies when labeled data only consists of positive examples, while we have unlabeled (U) data consisting of both positive and neg- ative examples. Elkan & Noto(2008) proposed to learn from assigning weights to unlabeled examples.du Plessis et al.(2014) proposed a URE of non-convex losses, and du Plessis et al.(2015) extends it further to a more general framework with convex formulation. Kiryo et al. (2017) observed the overfitting issue of unbiased PU learning and proposed a non-negative risk estimator to fix the problem.

Unlabeled-Unlabeled (UU) learning: In binary classifi- cation, UU learning considers the setting when all labels are unknown.Lu et al.(2019) discovers that if the two sets of data have different class priors, a URE can be derived to learn from such data. However, the unbiased UU learning also encounters severe overfitting due to negative empirical risk.Lu et al.(2020) proposed a non-negative corrected risk estimator to fix the problem.

2.2. Learning with Complementary Labels

In the following part, we first introduce related work of learning with complementary labels, then formally define the URE formulation and the negative risk effect.

InIshida et al.(2017), the first work to introduce the setting of complementary labels, a URE can be obtained when a loss function satisfies the symmetric property, under uni- form assumptions.Yu et al.(2018) provides a loss correc- tion method for softmax cross entropy loss, and shows that non-uniform complementary labels can also be learned if the complementary transition matrix is known. Continu- ing in the uniform complementary assumption of Ishida et al.(2017),Ishida et al.(2019) generalizes the URE for arbitrary loss functions and models, and proposes a non- negative correction and a gradient ascent method to ac- count for overfitting. Several studies have also extended to learning with multiple complementary labels (Feng et al., 2020), and its combination with unlabeled data (Cao & Xu, 2020). The flexibility of CLs makes it easy to use in settings such as online learning (Kaneko et al.,2019), generative- discriminative learning (Xu et al.,2020), and noisy label learning (Kim et al.,2019).

Ordinary Learning: We start by reviewing the setting
and introduce notations in ordinary learning. Consider the
problem of K class classification (K > 2), where [K] =
{1, 2, ..., K} is the label set. Let D be a joint distribution
over the feature set X and label set Y , where we sample
input feature x ∈ R^{d} and label y ∈ [K]. Given training
samples {(xi, yi)}^{n}_{i=1}, the goal of the learning algorithm
is to learn a classifier f (x) : R^{d} → [K] which predicts
the correct label from a given input x. The classifier f is
implemented with a decision function g : R^{d} → R^{K} by
taking the argmax function f (x) = arg max_{i}g(x)_{i}. For a
label y and a decision function output g(x), the loss function
is defined as a nonnegative function ` : [K] × R^{K} → R^{+}.
Finally, we define the risk as the expected loss of g over
distribution D:

R(g; `) = E(X,Y )∼D[`(Y, g(X))]. (1)

Complementary Learning: In complementary learning,
the data distribution is switched to D = X × Y where the
training samples given to the learner become {(x_{i}, y_{i})}^{n}_{i=1}.

For instance x_{i}, the complementary label (CL) y_{i}is a class
in [K] that xidoes not belong to, satisfying y_{i} 6= yi. In this
case, the loss function ` cannot be used directly since the
ordinary target yi is not given. In the following part, we
review the derivation of URE using backward loss rewriting
process (Patrini et al.,2017;Ishida et al.,2019).

Unbiased Risk Estimator: In this part, we follow the
assumption of class conditional complementary transi-
tion as in related work, assuming the transition matrix
T invertible, where Tij = P(Y = j | Y = i) and
T_{ii} = 0 for all i. We borrow the following notation
fromIshida et al.(2019). The loss vector is `(g(x)) =
[`(1, g(x)), `(2, g(x))...`(K, g(x))], and let ei ∈ {0, 1}^{K}
denote the one-hot vector in which the i-th entry is one.

Proposition 1. The ordinary risk can be transformed as
R(g; `) = E(X,Y )∼D[e^{>}_{y}(T^{−1})`(g(x))]. (2)
That is, we obtain an unbiased risk estimator (URE):

R(g; `) = E(x,y)∼D[`(y, g(x))] = R(g; `) (3) where` is the following rewritten loss:

`(y, g(x)) = e^{>}_{y}(T^{−1})`(g(x)). (4)

This proposition implies the expectation of `(y, g(x)) under distribution D is equivalent to the ordinary risk R(g; `).

Uniform Assumption: In the rest of this paper, we as-
sume CLs are sampled uniformly from [K] \ {y}, for a
better comparison withIshida et al.(2019). By plugging in
the uniform assumption T = _{K−1}^{1} (1_{k}− I_{k}), we have the
following formulation of `,

`(y, g(x)) = −(K − 1)`(y, g(x)) +

K

X

j=1

`(j, g(x)). (5)

This URE approach minimizes ` over the training distribu- tion, and theoretical results fromIshida et al.(2017) proved the consistency of the risk estimator under specific losses.

2.3. Negative Risk and Overfitting

However, URE tends to have poor empirical performance.

Ishida et al.(2019) reported that minimizing URE causes the empirical risk to go negative, which is a sign of overfitting.

It is clear that the negative loss term −(K − 1)`(y, g(x)) in ` (Equation5) is the source of negativity. Such negative term occurs in common class conditional complementary transition as long as all diagonal elements of T are zero.

Recall the URE in Equation3, in expectation has minimum value 0 when the classifier has no error. However, when

(a) MNIST, Linear (b) MNIST, MLP (c) CIFAR-10, DenseNet

Figure 1. Empirical risk minimization comparison

minimizing URE empirically, the non-negative lower bound does not remain. We claim that the main difference between the expectation and its empirical realization is the label distribution: only single y is given for each instance in prac- tice, while the expectation is calculated over all possible y.

The URE only stays non-negative when taken expectation, which is not realistic.

Negative Risk Experiment: To show the difference be- tween theory (expectation) and practice, we use an experi- ment to demonstrate how the empirical distribution of CLs leads to negative empirical risk during training. Three dif- ferent label distributions are given:

1. Ordinary Learning (ORD): The supervised learning base- line, which the ordinary label y is given. This is also the case where the complementary label is marginalized out by taking expectation.

2. Fixed Complementary Learning (FIXED): The realistic complementary learning scenario, for each instance x only a fixed CL y is given.

3. Random Complementary Learning (RAND): The y of each instance is randomly sampled from [K] \ {y} in each epoch. This setting acts as a stochastic version of ORD on y.

In this experiment, we used the cross-entropy loss as ` for ordinary learning (ORD) and ` for complementary learning (FIXED, RAND). For MNIST, we use linear model and single hidden layer MLP (d − 500 − 10) as learning models;

for CIFAR-10, we used ResNet-34 (He et al.,2016) and
DenseNet (Huang et al.,2017). The models are trained with
Adam (Kingma & Ba,2015) optimizer at a fixed learning
rate of 10^{−5}for 300 epochs.

Results are shown in Figure1. FIXED suffers from severe negative risk in comparison to ORD and RAND, which is a clear sign of overfitting to the given CL. The problem worsen as flexible models are used, matching results from Ishida et al. (2019). However, note that RAND yields a significantly different result from FIXED even though they are trained on the same objective. Though the risk of RAND

fluctuates considerably due to the changes in each epoch, it does not stay negative, as we can view RAND as an ran- domized approximation of ORD. The results also show that the estimated risk diverges far from the ordinary risk as the training goes on, and the gap increases with the training epochs. In this case, consistency guarantees become inef- fective since the risk estimation error keeps increasing as training goes on. That is, the behavior of URE and the ordi- nary risk is extremely different in the empirical setting, even if statistical properties such as unbiasedness and consistency can be proven.

Risk Correction Methods: Ishida et al.(2019) proposed two correction methods to mitigate the problem. First, the non-negative loss correction (NN), which enforces non- negativity to the decomposed risk of each class. Second, namely the gradient ascent correction (GA) which enforces a reverse gradient update to the model parameters when the decomposed risk goes negative or under a certain thresh- old. GA can be viewed as a more aggressive correction than NN. The correction methods show improvements in various experiments, and similar techniques have also been applied in other WSL problems (Kiryo et al.,2017;Lu et al., 2020). However, such correction methods are still based on URE and lack theoretical motivation, the fundamental difference between risk and URE are not solved. We will include experiment results of these methods in the following sections.

### 3. Proposed Framework

In this section, we propose a complementary learning frame- work that avoids the negative risk problem of URE. To clearly distinguish between complementary learning and ordinary learning, we rethink the relationship between input features and labels: An ordinary label provides a positive feedback to the given class, while a CL provides a nega- tive feedback to the given class. The maximum likelihood approach is commonly used in ordinary learning when we have probability estimation from the model, by maximiz-

ing the conditional likelihood given the training data. The commonly used softmax cross-entropy loss function in deep learning is a typical example by combining softmax acti- vation function and the maximum likelihood approach. In complementary learning, given only CLs as training data, we propose to apply the minimum complementary likeli- hood approach, through a proxy loss. In the following of this section, we propose a new framework that consists complementary 0-1 loss and its corresponding surrogate complementary loss (SCL).

3.1. Complementary 0-1 Loss

From the classification error perspective: In ordinary learn- ing, zero error is obtained when the classifier predicts the correct class as the label, and has error otherwise. In com- plementary learning, given only limited information, we can only be sure that prediction error occurs when the CL is predicted by the classifier. With the rules above, we for- mally define the ordinary classification error and a novel complementary classification error:

Definition 1. (Multiclass) classification error, or 0-1 loss:

`01(y, f (x)) =Jy 6= f (x)K. (6) Definition 2. Complementary classification error, or com- plementary 0-1 loss:

`_{01}(y, f (x)) =Jy = f (x)K. (7)

`01 is 1 when the predicted class matches the CL, which indicates classification error. By minimizing `01, we can minimize the conditional probability output of CLs.

Proposition 2. The complementary 0-1 loss is a constant multiple of the URE of the classification error.

R(g; `01) = (K − 1)R(g; `01) (8) In other words, the URE of the classification error has the same minimizer with the complementary 0-1 loss:

E_{(x,y)∼D}[`01(y, g(x))] (9)
Thus, existing guarantees show that we can learn with CLs
via empirical risk minimization from R(g; `01).

3.2. Surrogate Complementary Loss

To minimize the non-convex `01, a common approach in statistical learning is to select a convex surrogate loss to approximate the target loss. In order to minimize the output of the label prediction, which is the opposite of most com- mon surrogate functions, we require a new type of surrogate complementary loss (SCL) for this problem setting. Differ- ent from ordinary surrogate losses which are non-increasing functions of the label class output, SCLs are non-decreasing functions of the CL class output.

Baseline Methods: To better distinguish from URE- based methods, we use φ to denote the SCL loss functions.

Here we denote the probability output p ∈ ∆^{K−1} if g
passes through a softmax layer, where ∆^{K−1} is the K-
dimensional simplex. Existing work on complementary
learning has resulted in similar patterns that minimize la-
bel class prediction output. We include these methods as
baselines in our experiments.

1. Forward correction (SCL-FWD) inYu et al.(2018): a forward loss correction method given transition matrix T :

φ_{FWD}(y, g(x)) = `(y, T^{>}p). (10)
2. Negative learning loss (SCL-NL) inKim et al.(2019): a

modified log loss for negative learning with CLs:

φ_{NL}(y, g(x)) = − log(1 − p_{y}). (11)
3. Exponential loss (SCL-EXP):

φEXP(y, g(x)) = exp(py). (12) As we unify the above-mentioned losses into the surrogate complementary loss φ framework. These loss functions actually all accomplish the same purpose: minimizing the complementary 0-1 loss by using its loss as surrogate:

min `01(y, f (x)) → min φ(y, g(x)). (13)

Here we compare the proposed SCL learning process with
the URE learning process, as shown in Figure2. We use ap-
proximationstep to denote the process of replacing 0-1 loss
with its surrogate loss, and the estimation step represents
rewriting the risk from ordinary distribution to complemen-
tary distribution. Given the same goal of minimizing the
true classification risk R01, the two frameworks follow a
different order in the learning steps. The URE framework
follows the traditional statistical learning framework by ap-
proximating R01with R`, and then performs the estimation
step by rewriting the risk into R_{`}for the complementary
distribution. The SCL framework, on the other hand, per-
forms the approximation step after the estimation step by
first rewriting the classification risk R01to complementary
classification risk R_{01}, then perform the approximation step
by using the SCL loss φ, resulting in the objective Rφ.
The ordinary surrogate loss ` in URE is used for ordinary
labels, which serves as an upper bound proxy in order to
minimize the 0-1 classification error. However, when the
training data distribution is changed into CLs, the loss is
rewritten and the non-negativity of ` no longer remain, caus-
ing the negative risk term. That is, a ripple effect of error
happens when the approximation error of the surrogate loss
is amplified by the estimation step. In the proposed SCL
framework, we sidestep this question by placing the sur-
rogate process after the risk rewriting. In this way, the

URE

SCL

### R

01### R

_{`}

### R

_{`}

### R ˆ

_{`}

### R

_{01}

### R

_{φ}

### R ˆ

φApproximation

Estimation Empirical

Estimation

Approximation Empirical

Figure 2. Comparison of URE learning process with the SCL framework.

Table 1. Classification accuracies

DATA SET+ MODEL URE NN GA SCL-FWD SCL-NL SCL-EXP

MNIST + LINEAR 0.8503 0.8182 0.8193 0.9 0.9 0.9019

MNIST + MLP 0.8012 0.8665 0.9088 0.8965 0.9469 0.9251

KUZUSHI-MNIST + LINEAR 0.5613 0.5331 0.4992 0.6056 0.6056 0.6132 KUZUSHI-MNIST + MLP 0.5433 0.5683 0.6567 0.6445 0.7644 0.7184 FASHION-MNIST + LINEAR 0.7675 0.7755 0.7672 0.8274 0.8274 0.8282 FASHION-MNIST + MLP 0.7401 0.7829 0.8019 0.8372 0.8456 0.835 CIFAR-10 + RESNET 0.1091 0.3078 0.3738 0.5058 0.4713 0.492 CIFAR-10 + DENSENET 0.2909 0.3379 0.4108 0.5457 0.5394 0.5435

surrogate loss φ is directly applied on its target `01, and the negative loss problem is avoided. Furthermore, it is not only the statistical properties that matters to surrogate loss, optimization properties such as smoothness and curvature are also important to consider. As the estimation process of URE damages the original properties of `, the optimization properties of φ are preserved.

3.3. Classification Accuracy

In this section, we use an experiment to compare the
performance of each method. Specifically, the methods
can be classified into two categories: URE-based meth-
ods, and SCL-based methods. In URE-based methods,
we have URE, URE with negative risk correction (NN),
and URE with gradient ascent (GA). In SCL-based meth-
ods, we have SCL-FWD, SCL-NL, and SCL-EXP. We
used the Adam optimizer with learning rate selected from
{10^{−1}, 10^{−2}, 10^{−3}, 10^{−4}, 10^{−5}} and trained the models for
300 epochs.

The testing accuracy is shown in Table1. The URE per- forms poorly compared to other methods, especially in more flexible models. Even though NN and GA improve on URE in most tasks, the SCL methods still outperform them by a significant gap. These results justify our claims. Although URE is an estimation of the risk R`with statistical guaran- tees, in practice, it does not perform well as a classifier. On the other hand, although the proposed SCL framework is biased to the risk R`, introducing such bias towards mini- mizing the CL output yields superior results compared to URE, avoiding the negative risk issue. In the next section,

we discuss why the difference between the two frameworks result in such a performance gap by analyzing the loss gra- dient during training.

### 4. Gradient Analysis

In this section, we discuss how the proposed SCL frame- work outperforms URE through two gradient analysis exper- iments. As mentioned in Section2, the URE diverges widely from the risk itself when only a single CL is used to estimate the risk. Here we further discuss how the SCL framework gives such improvement by rearranging the learning process.

The discussion will focus on the loss gradient: in the experi- ments, they are the stochastic gradient (SGD) in mini-batch optimization specifically. The analysis can be viewed as two parts: gradient directional estimation, and the bias-variance tradeoff of the gradient estimation error.

4.1. Directional Similarity

Since the URE is an estimator of the risk function, we expect its optimization to be similar to the risk function. Here we prove that the gradient of URE is also an unbiased gradient estimator (UGE) of the ordinary gradient.

Proposition 3. The gradient of an unbiased risk estimator is unbiased to the ordinary risk gradient. That is, for an instance(x, y) we have,

Ey|y∇θ`(y, g(x)) = ∇θ`(y, g(x)) (14) Thus, the gradient of the complementary loss ` is unbiased with respect to the gradient of the ordinary loss, in our

(a-1) Expected: MNIST, Linear (b-1) Expected: MNIST, MLP (c-1) Expected: CIFAR-10, DenseNet

(a-2) Fixed: MNIST, Linear (b-2) Fixed: MNIST, MLP (c-2) Fixed: CIFAR-10, DenseNet

Figure 3. Cosine similarity comparison.

case the gradient of cross entropy loss. However, does that lead to similar performance with ordinary learning? Our experimental results show that URE methods learn poorly through unbiased gradient estimation.

In this section, we use an experiment to compare the gra- dient direction of ordinary learning and its complementary learning counterparts. We compare the complementary loss gradient directions with the ordinary gradient direction of the cross entropy loss ∇θ`(y, g(x)) = −∇θlog(py).

The quality of the complementary gradient depends on its similarity with the ordinary gradient direction, where the similarity of two gradient directions is measured by the cosine similarity S of two gradient vectors a and b, where S(a, b) = (a · b)/|a||b|. For the gradient directions, a reasonable assumption is S should be as large as possible, indicating a direction more similar to the ordinary gradient.

In this experiment, we compare two gradient settings:

1. Expected: The averaged gradient computed over all pos- sible CLs of an instance x.

2. Fixed: The gradient computed on a single CL of an in- stance x.

We compared three complementary learning methods on their approximation of the direction of the ordinary gradient:

URE, NN, SCL-EXP. To ensure fair comparison, the model is updated only with ordinary labels in each epoch to avoid gradient error accumulation, the complementary gradients were computed only for comparison and were not updated

to the model. The SGD optimizer was used with a learning
rate fixed at 10^{−2}, trained for 300 epochs.

As the results show in Figure3, URE achieves an ideal gra- dient direction only in the case of expected CLs. In the fixed case, UGE results in very different gradient directions with respect to the ordinary gradient direction. This shows that in the case when each x is fixed to a y, UGE does not estimate a reliable direction. The UGEs of each y have diverged directions in order to maintain the unbiasedness. Unsurpris- ingly, the SCL methods provide better approximations of the ordinary gradient, since it does not diverge by focusing on the CL direction `01.

4.2. Bias-Variance Tradeoff

In this part, we further analyze the estimation error of the complementary gradient verses the ordinary gradient, using the bias-variance decomposition technique. Bias-variance decomposition is a common approach in statistical learning used to evaluate the complexity of a learning algorithm;

instead of analyzing the error of a prediction problem, we extend this framework to evaluate the estimation error of the gradient, setting the ordinary gradient as the target. We will show that URE has much larger L2loss than SCL caused by its large variance, despite having no bias.

We denote f as the gradient step determined by ordinary labeled data (x, y) and ordinary loss `. c denotes the com- plementary gradient step by complementary labeled data

(a-1) MSE: MNIST, Linear (b-1) MSE: MNIST, MLP (c-1) MSE: CIFAR-10, DenseNet

(a-2) Bias^{2}: MNIST, Linear (b-2) Bias^{2}: MNIST, MLP (c-2) Bias^{2}: CIFAR-10, DenseNet

(a-3) Variance: MNIST, Linear (b-3) Variance: MNIST, MLP (c-3) Variance: CIFAR-10, DenseNet

Figure 4. Error decomposition of gradient estimators.

(x, y) and complementary loss ` (or φ). h denotes the ex- pected gradient step of [K] \ {y}, which is the average of c on every possible CL. We formalize as:

f = ∇`(y, g(x)) (15)

c = ∇`(y, g(x)) (16)

h = 1

K − 1 X

y^{0}6=y

∇`(y^{0}, g(x)) (17)

In this setting, we set f as the ground truth, which is the target for the complementary estimator c. We hope the mean squared error(MSE) of gradient estimation to be small.

MSE = Ex,y,y(f − c)^{2}

(18)

Here we can derive the bias-variance decomposition by

introducing h and eliminating remaining terms:

E(f − c)^{2} = E(f − h + h − c)^{2}

(19)

= E(f − h)^{2}

| {z }

Bias^{2}

+ E(h − c)^{2}

| {z }

Variance

(20)

Since the UGE has no bias, it implies that all the estimation error of UGE comes from the variance term.

We run experiments to check how the complementary gra-
dient c approximate the ordinary gradient f , and compare
with baseline methods. The training works as follows. In
each epoch, we compute three gradients: the ordinary gradi-
ent f , the current method c, and h. We measure the mean
square error (MSE), the squared bias term and the variance
term according to Equation18and Equation20. In each
epoch, we only update the model with f to maintain a fair
comparison of the gradients. The optimizer is SGD with a
learning rate fixed at 10^{−2}, trained for 300 epochs.

Results are showed in Figure4(mean statistics are shown

in Table 2and Table3), GA is omitted for visualization reasons. It is clear that although URE has no bias, it has very large MSE due to the large variance. On the other hand, the SCL methods though have little bias, have much smaller variance compared to URE. This justifies our claims in Sec- tion4.1, the URE creates highly diverged gradients in order to maintain the overall unbiasedness, resulting high gradient variance. On the other hand, SCL introduces inductive bias towards minimizing the CL likelihood, trading zero bias with reduced variance.

### 5. Conclusion

In this paper, we show that unbiased risk estimator (URE) does not serve as a desirable optimization objective in weakly supervised learning problems such as learning with complementary labels. From the empirical risk aspect, the URE encounters the negative risk issue which leads to severe overfitting under weakly supervision. From the gradient as- pect, the effort to maintain the unbiased gradient estimator (UGE) causes misleading direction and large variance to the loss gradient. We propose a new SCL learning framework based on the minimum likelihood principle and surrogate complementary loss functions. Though having a bias to- wards the CL, the SCL framework avoids the extremely noisy gradient problem encountered in URE. Empirical re- sults show that SCL outperforms URE in classification ac- curacy and other gradient quality metrics.

Table 2. Gradient error decomposition of MNIST on linear model (Averaged over 300 epochs)

METHOD MSE BIAS^{2} VARIANCE

URE 1.9692E-03 8.0643E-07 1.3907E-02 NN 1.7268E-03 3.7248E-03 1.2272E-02 GA 1.0436E+00 2.5829E+00 8.3596E+00 SCL-FWD 7.7511E-06 7.5037E-06 6.9942E-07 SCL-NL 7.7511E-06 7.5038E-06 6.9931E-07 SCL-EXP 7.9152E-06 7.7895E-06 4.3945E-07

Table 3. Gradient error decomposition of CIFAR-10 on DenseNet (Averaged over 300 epochs)

METHOD MSE BIAS^{2} VARIANCE

URE 5.0196E-02 6.6855E-06 1.0101E-01 NN 5.2152E-02 2.1846E-02 7.0500E-02 GA 3.1350E+01 1.2985E+01 3.8302E+01 SCL-FWD 2.0237E-04 1.9225E-04 1.1051E-05 SCL-NL 2.0237E-04 1.9225E-04 1.1050E-05 SCL-EXP 2.0455E-04 1.9810E-04 7.0735E-06

### Acknowledgements

GN and MS were supported by JST AIP Acceleration Re- search Grant Number JPMJCR20U3, Japan. YC and HL were partially supported by MOST 107-2628-E-002-008- MY3 and 108-2119-M-007-010.

### References

Bao, H., Niu, G., and Sugiyama, M. Classification from pairwise similarity and unlabeled data. In ICML, 2018.

Cao, Y. and Xu, Y. Multi-complementary and unla- beled learning for arbitrary losses and models. CoRR, abs/2001.04243, 2020. URLhttps://arxiv.org/

abs/2001.04243.

Chapelle, O., Scholkopf, B., and Zien, A. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews].

IEEE Transactions on Neural Networks, 20(3):542–542, 2009.

du Plessis, M., Niu, G., and Sugiyama, M. Convex formu- lation for learning from positive and unlabeled data. In ICML, pp. 1386–1394, 2015.

du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In NeurIPS, 2014.

Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In KDD, 2008.

Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and Sugiyama, M. Learning with multiple complementary labels. In ICML, 2020.

Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., and Sugiyama, M. Masking: A new perspective of noisy supervision. In NeurIPS, 2018a.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018b.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.

Hsieh, Y.-G., Niu, G., and Sugiyama, M. Classification from positive, unlabeled and biased negative data. In ICML, 2019.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, 2017.

Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In NeurIPS, 2017.

Ishida, T., Niu, G., and Sugiyama, M. Binary classification from positive-confidence data. In NeurIPS, 2018.

Ishida, T., Niu, G., Menon, A., and Sugiyama, M.

Complementary-label learning for arbitrary losses and models. In ICML, 2019.

Jin, R. and Ghahramani, Z. Learning with multiple labels.

In NeurIPS, 2002.

Kaneko, T., Sato, I., and Sugiyama, M. Online multiclass classification based on prediction margin for partial feed- back. arXiv preprint arXiv:1902.01056, 2019.

Kim, Y., Yim, J., Yun, J., and Kim, J. Nlnl: Negative learning for noisy labels. In ICCV, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. 2015.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M.

Positive-unlabeled learning with non-negative risk esti- mator. In NeurIPS, 2017.

Lu, N., Niu, G., Menon, A. K., and Sugiyama, M. On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR, 2019.

Lu, N., Zhang, T., Niu, G., and Sugiyama, M. Mitigating overfitting in supervised classification from two unla- beled datasets: A consistent risk correction approach. In AISTATS, 2020.

Nagarajan, V. and Kolter, J. Z. Uniform convergence may be unable to explain generalization in deep learning. In NeurIPS, 2019.

Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In NeurIPS, 2013.

Niu, G., du Plessis, M. C., Sakai, T., Ma, Y., and Sugiyama, M. Theoretical comparisons of positive-unlabeled learn- ing against positive-negative learning. In NeurIPS, 2016.

Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise:

A loss correction approach. In ICCV, 2017.

Sakai, T., du Plessis, M. C., Niu, G., and Sugiyama, M.

Semi-supervised classification based on classification from positive and unlabeled data. In ICML, 2017.

Sakai, T., Niu, G., and Sugiyama, M. Semi-supervised auc optimization based on positive-unlabeled learning.

Machine Learning, 107(4):767–794, 2018.

Vapnik, V. Principles of risk minimization for learning theory. In NeurIPS, 1992.

Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In NeurIPS, 2019.

Xu, Y., Gong, M., Chen, J., Liu, T., Zhang, K., and Bat- manghelich, K. Generative-discriminative complemen- tary learning. 2020.

Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In ECCV, 2018.

Yu, X., Han, B., Yao, J., Niu, G., Tsang, I. W., and Sugiyama, M. How does disagreement help general- ization against label corruption? In ICML, 2019.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.

Understanding deep learning requires rethinking general- ization. In ICLR, 2017.

Zhou, Z.-H. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2017.

### A Case Study of Learning with Complementary Labels

### A. Proofs

A.1. Proof of Proposition1

Proof. Let η and η denote the conditional distribution P(Y | X) and P(Y | X) respectively, where ηk(x) = P(Y = k | x)
and η_{k}(x) = P(Y = k | x). Since ¯y only depends on y, we have ¯η(x) = T^{>}η(x). The unbiased risk estimator can be
derived as follows:

R(g; `) = E(x,y)∼D[`(y, g(x))] = E^{X}EY ∼η(X)[`(Y, g(X))]

= E^{X}[η(X)^{>}`(g(X))] = E^{X}[η(X)^{>}(T^{−1})`(g(X))]

= E(x,y)∼D[e^{>}_{y}(T^{−1})`(g(x))]

A.2. Proof of Proposition2

Proof. Given the following two properties of `01:

K

X

i=1

`01(i, g(x)) = K − 1 and

`_{01}(y, g(x)) + `_{01}(y, g(x)) = 1
An unbiased risk estimator of classification error can be obtained by:

R(g; `01) = E(x,y)∼D

− (K − 1)`01(y, g(x)) +

K

X

j=1

`01(j, g(x))

= E(x,y)∼D

(K − 1)(1 − `01(y, g(x)))K

= (K − 1)E(x,y)∼D

`01(y, g(x))

= (K − 1)R(g; `01)

A.3. Proof of Proposition3

Proof. The proposition can be derived by using the linearity of the gradient operator:

Ey|y∇θ`(y, g(x)) = ∇θEy|y`(y, g(x))

= ∇_{θ}

1

K − 1 X

y^{0}6=y

− (K − 1)`(y^{0}, g(x)) +

K

X

j=1

`(j, g(x))

= ∇θ

−X

y^{0}6=y

`(y^{0}, g(x)) +

K

X

j=1

`(j, g(x))

= ∇θ`(y, g(x))