### Learning with Limited Labeled Data

Hsuan-Tien Lin 林軒田

Dept. of Computer Science and Information Enginnering, National Taiwan University

國立臺灣大學資訊工程學系 January 26, 2022 AI & Data Science Workshop

Learning with Limited Labeled Data

### Outline

Learning with Limited Labeled Data

Learning from Label Proportions

Learning from Complementary Labels

Learning with Limited Labeled Data

### Supervised Learning

### (Slide Modified from My ML Foundations MOOC)

unknown target function f : X → Y

training examples
**D : (x**1,y1), · · · , (xN,yN)

**25**

**5**
**1**

**Mass**

**Size**
**10**

learning algorithm

A

final hypothesis g≈f

hypothesis set H

supervised learning: every input vector
(picture)**x**_{n}with**its label (category) y**_{n}

Learning with Limited Labeled Data

### Semi-Supervised Learning

unknown target function f : X → Y

training examples
**D : (x**_{1},y_{1}), · · · , (x_{M},y_{M}),

**x**N+1, . . . ,**x**N

**25**

**5**
**1**

**Mass**

**Size**
**10**

learning algorithm

A

final hypothesis g≈f

hypothesis set H

semi-supervisedlearning:

a few labeled examples +many unlabeled examples

Learning with Limited Labeled Data

### Active Learning: Learning by ‘Asking’

Protocol ⇔ Learning Philosophy

• batch: ‘duck feeding’

• **active: ‘question asking’**(iteratively)

—query ynof**chosenx**_{n}

unknown target function f : X → Y

labeled training examples ( , +1), ( , +1), ( , +1)

( , -1), ( , -1), ( , -1) unlabeled training examples

( ), ( ), ( ), ( )

learning algorithm

A

final hypothesis g≈f

+1

activelearning (on top of semi-supervised):

a few labeled examples+unlabeled pool +a few strategically-queried labels

Learning with Limited Labeled Data

### Weakly-Supervised Learning:

### Learning without True Labels

(a) positive-unlabeled learning (b) learning with noisy labels (c) learning with complementary labels

• positive-unlabeled: some of true yn= +1 revealed

• noisy: (cheaper) noisy label y_{n}^{0} instead of true yn

• complementary: ‘not label’ y_{n}instead of true y_{n}
**weakly-supervised:**

a few (no) labeled examples +many ‘related’ and easier-to-get labels

Learning with Limited Labeled Data

### Our Ongoing Research Quests

Learning from Limited Labeled Data (L^{3}D)

• in**supervised learning**

• e.g.**uneven-margin augmentation**for imbalanced learning?

• in**interactive learning**

• e.g. can**strategically**obtained labels push L^{3}D to the extreme?

• in**generative learning**

• e.g.**development with cloned data**first, validate with limited
labeled data later?

• in**weakly-supervised learning**

• e.g.**sketch with weak labels**first, refine with limited labeled data
later—or maybe**learn from many weak labels**only?

Learning with Limited Labeled Data

### Some of Our Selected Work

target

(1), (2)? data

? (3)

learning algorithm

good learned hypothesis '

&

$

% - 6

(5)

learning model (4) 6

1 zero-shot learning (ICLR 2021):

**no labeled data but only**
descriptions for new classes

2 learning from complementary
labels (ICML 2020): **cheaper**
**weakly labeled data**

3 robust estimation (gaze: BMVC 2020, typhoon: KDD 2018):

**domain-driven data**
**augmentation**

4 robust generation (NeurIPS
2021): **math-driven objective**
**augmentation**

5 active learning (EMNLP 2020): **a**
**few actively labeled data**

Learning with Limited Labeled Data

### Quick Stories about Augmentation (1/3)

(Ashesh, 2021)Learning with Limited Labeled Data

### Quick Stories about Augmentation (2/3)

(Chen, 2018)Learning with Limited Labeled Data

### Quick Stories about Augmentation (3/3)

(Chen, 2021)Learning from Label Proportions

### Outline

Learning with Limited Labeled Data

Learning from Label Proportions

Learning from Complementary Labels

Learning from Label Proportions

### Learning from Label Proportions

Training

bag [a, o, s, k]

{ , , , } [^{1}_{2},^{1}_{4},^{1}_{4},0]

{ , , , } [^{1}_{4},^{1}_{2},0,^{1}_{4}]

Test

**?**

motivations

• expensive labeling

• privacy issues

LLP: learn an instance-level classifier with
**proportion labels**

Learning from Label Proportions

### LLP Setting

input

Given M bags B_{1}, . . . ,B_{M}, where the m-th bag contains a set of
instances X_{m}and a proportion label**p**_{m}, defined by

**p**_{m} = 1

|X_{m}|
X

n :**x**n∈Xm

**e**^{(y}^{n}^{)},

M

[

m=1

X_{m} = {x_{1}, . . . ,**x**_{N}}.

output

learn a usual instance classifier g_{θ}: R^{D} → estimated probability

Learning from Label Proportions

### Our Sol.: LLP w/ Consistency Regularization

(Tsai, 2020)vanilla: bag-level proportion loss
L_{prop} =KL(pkˆ**p)**

• ‘distance’ between target**p and**
estimated ˆ**p =** _{|X |}^{1} P

**x∈X** g_{θ}(x)
small

• extension of**standard**
**cross-entropy loss**

instance-level regularization
L_{cons} = 1

|X | X

**x∈X**

KL(g_{θ}(x)kg_{θ}(ˆ**x))**

• ‘difference’ between**x and**
perturbed ˆ**x small**

• mature technique for
**semi-supervised learning**

LLPwithconsistency regularization:

L =L_{prop}+ αL_{cons}

Learning from Label Proportions

### Consistency Loss by Virtual Adversarial Training

smoothness assumption

if**x**_{i} **≈ x**_{j}, then y_{i} ≈ y_{j}

goal

encourage the classifier to produce consistent outputs for neighbors

Virtual Adversarial Training (Miyato, 2018)

generate a perturbed example ˆ**x that most**
likely causes the model to misclassify

**x = argmax**ˆ

kˆ**x−xk≤r**

KL(g_{θ}(x)kg_{θ}(ˆ**x))**

consistency loss w/ VAT:

Lcons(θ) =KL(g_{θ}(x)kg_{θ}(ˆ**x))**

Learning from Label Proportions

### Experimental Results

Bag Size

Dataset Method 16 32 64 128 256

SVHN vanilla 95.28 95.20 94.41 88.93 12.64 LLP-VAT 95.66 95.73 94.60 91.24 11.18 CIFAR10 vanilla 88.77 85.02 70.68 47.48 38.69 LLP-VAT 89.30 85.41 72.49 50.78 41.62 CIFAR100 vanilla 58.58 48.09 20.66 5.82 2.82 LLP-VAT 59.47 48.98 22.84 9.40 3.29

**consistency regularization**(VAT) helps!

Learning from Label Proportions

### Take-Home Message

• LLP: a typical**weakly-supervised**learning problem

• **consistency regularization**helps

—can other regularization help?

• anyone using?

• 50%**accuracy on 10 class for big bags?!**

• **no real-world data yet**

Learning from Complementary Labels

### Outline

Learning with Limited Labeled Data

Learning from Label Proportions

Learning from Complementary Labels

Learning from Complementary Labels

### Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange**?**

• mango**?**

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:**less labeling**
**cost/expertise**required

Learning from Complementary Labels

### Comparison

Ordinary (Supervised) Learning

training: **{(x**n= ,yn =mango)} →**classifier**
Complementary Learning

training: **{(x**n= ,y_{n}=banana)} →**classifier**

testing goal: **classifier(** ) →cherry

ordinary versus complementary:

same goal via**different training data**

Learning from Complementary Labels

### Learning with Complementary Labels Setup

Given

N examples (input**x**n,complementary label y_{n}) ∈ X × {1, 2, · · · K } in
data set D such that y_{n}6= ynfor some hidden ordinary label

y_{n}∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) that**closely predicts**(0/1 error) the
ordinary label y associated with some**unseen**inputs x

LCL model design: connecting
**complementary & ordinary**

Learning from Complementary Labels

### Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

**risk:** E(x,y )[`(y , g(x))] **empirical risk:** E(xn,yn)∈D[`(y_{n},g(x_{n}))]

• loss `: usually**surrogate**of 0/1 error

LCL(Ishida, 2019)

• rewrite the loss ` to `, such that

**unbiased risk estimator:** E(x,y )[`(y , g(**x))] = E**(x,y )[`(y , g(x))]

under assumptions (e.g. uniform complementary labels)

• LCL by minimizing**URE**

Learning from Complementary Labels

### URE Overfits Easily

` = **− log(p(y | x))**

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• small**p(y | x) (often) → possibly very negative `**
**empirical**URE can be negative on**some observed**y

• negative empirical URE**drags minimization**towards overfitting

how can we avoid negative empirical URE?

Learning from Complementary Labels

### Proposed Framework

(Chou, 2021)Minimize Complementary 0/1

• our goal: minimize 0/1 loss instead of `

• unbiased estimator of R_{01}is**simple**

**R**** _{01}** : Ey[`

_{01}(y , g(x))] = `

_{01}(y , g(x))

• `01as the complementary 0/1 loss:

`_{01}(y , g(x)) =**Jy = g (x)K**

Surrogate Complementary Loss (SCL):

surrogate**after**complementary 0/1

Learning from Complementary Labels

### Illustrative Difference between URE and SCL

URE

SCL

### R

01### R

`### R

_{`}

### ˆ R

_{`}

### R

_{01}

### R

φ### ˆ R

φ AE

E

A

URE: Ripple effect of errors

• Theoretical motivation(Ishida, 2017)

• Estimation step (E) amplifies approximation error (A) in `

SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

Learning from Complementary Labels

### Negative Risk Avoided

Unbiased Risk Estimator (URE)
URE loss `_{CE} from cross-entropy `_{CE},

`_{CE}(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

−

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL) a surrogate of `01 (Kim, 2019)

φ_{NL}(y , g(x)) = − log(1 − p(y | x)))
remains non-negative.

Learning from Complementary Labels

### Classification Accuracy

Methods

1 Unbiased risk estimator (URE)(Ishida, 2019) 2 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while
SCL is based on exponential loss φ_{EXP}(y , g(x)) = exp(p_{y}).

Data set + Model URE SCL
MNIST + Linear 0.850 **0.902**

MNIST + MLP 0.801 **0.925**

CIFAR10 + ResNet 0.109 **0.492**
CIFAR10 + DenseNet 0.291 **0.544**

Learning from Complementary Labels

### Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `_{01}

**∇`(y , g(x))**

**∇`(y , g(x))**

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

Learning from Complementary Labels

### Gradient Estimation Error

Bias-Variance Decomposition

MSE = E**(f − c)**^{2}

= E**(f − h)**^{2}

| {z }

Bias^{2}

+ E**(h − c)**^{2}

| {z }

Variance

Gradient Estimation

1 Ordinary gradient **f = ∇`(y , g(x))**

2 Complementary gradient **c = ∇`(y , g(x))**

3 Expected complementary gradient **h**

Learning from Complementary Labels

### Bias-Variance Tradeoff

(a) MSE (b) Bias^{2} (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

Learning from Complementary Labels

### Take-Home Message

• LCL: another popular**weakly-supervised**learning problem

• **surrogate on complementary**helps

• avoid negative loss

• lower gradient variance (with trade-off in bias)

• anyone using?

• **uniform complementary generation unrealistic**(ongoing)

• **need stronger theoretical guarantee**(ongoing)

Learning from Complementary Labels