Learning with Limited Labeled Data

(1)

Learning with Limited Labeled Data

Hsuan-Tien Lin 林軒田

Dept. of Computer Science and Information Enginnering, National Taiwan University

國立臺灣大學資訊工程學系 January 26, 2022 AI & Data Science Workshop

(2)

Learning with Limited Labeled Data

Outline

Learning from Label Proportions

Learning from Complementary Labels

(3)

Supervised Learning

(Slide Modified from My ML Foundations MOOC)

unknown target function f : X → Y

training examples D : (x1,y1), · · · , (xN,yN)

25

5 1

Mass

Size 10

learning algorithm

A

final hypothesis g≈f

hypothesis set H

supervised learning: every input vector (picture)x_nwithits label (category) y_n

(4)

Semi-Supervised Learning

training examples D : (x₁,y₁), · · · , (x_M,y_M),

xN+1, . . . ,xN

25

5 1

Mass

Size 10

learning algorithm

A

hypothesis set H

semi-supervisedlearning:

a few labeled examples +many unlabeled examples

(5)

Active Learning: Learning by ‘Asking’

Protocol ⇔ Learning Philosophy

• batch: ‘duck feeding’

• active: ‘question asking’(iteratively)

—query ynofchosenx_n

labeled training examples ( , +1), ( , +1), ( , +1)

( , -1), ( , -1), ( , -1) unlabeled training examples

( ), ( ), ( ), ( )

learning algorithm

A

+1

activelearning (on top of semi-supervised):

a few labeled examples+unlabeled pool +a few strategically-queried labels

(6)

Weakly-Supervised Learning:

Learning without True Labels

(a) positive-unlabeled learning (b) learning with noisy labels (c) learning with complementary labels

• positive-unlabeled: some of true yn= +1 revealed

• noisy: (cheaper) noisy label y_n⁰ instead of true yn

• complementary: ‘not label’ y_ninstead of true y_n weakly-supervised:

a few (no) labeled examples +many ‘related’ and easier-to-get labels

(7)

Our Ongoing Research Quests

Learning from Limited Labeled Data (L³D)

• insupervised learning

• e.g.uneven-margin augmentationfor imbalanced learning?

• ininteractive learning

• e.g. canstrategicallyobtained labels push L³D to the extreme?

• ingenerative learning

• e.g.development with cloned datafirst, validate with limited labeled data later?

• inweakly-supervised learning

• e.g.sketch with weak labelsfirst, refine with limited labeled data later—or maybelearn from many weak labelsonly?

(8)

Some of Our Selected Work

target

(1), (2)? data

? (3)

learning algorithm

good learned hypothesis '

&

$

% - 6

(5)

learning model (4) 6

1 zero-shot learning (ICLR 2021):

no labeled data but only descriptions for new classes

2 learning from complementary labels (ICML 2020): cheaper weakly labeled data

3 robust estimation (gaze: BMVC 2020, typhoon: KDD 2018):

domain-driven data augmentation

4 robust generation (NeurIPS 2021): math-driven objective augmentation

5 active learning (EMNLP 2020): a few actively labeled data

(9)

Quick Stories about Augmentation (1/3)

(Ashesh, 2021)

(10)

Quick Stories about Augmentation (2/3)

(Chen, 2018)

(11)

Quick Stories about Augmentation (3/3)

(Chen, 2021)

(12)

Outline

(13)

Learning from Label Proportions

Training

bag [a, o, s, k]

{ , , , } [¹₂,¹₄,¹₄,0]

{ , , , } [¹₄,¹₂,0,¹₄]

Test

?

motivations

• expensive labeling

• privacy issues

LLP: learn an instance-level classifier with proportion labels

(14)

LLP Setting

input

Given M bags B₁, . . . ,B_M, where the m-th bag contains a set of instances X_mand a proportion labelp_m, defined by

p_m = 1

|X_m| X

n :xn∈Xm

e^(yⁿ⁾,

M

[

m=1

X_m = {x₁, . . . ,x_N}.

output

learn a usual instance classifier g_θ: R^D → estimated probability

(15)

Our Sol.: LLP w/ Consistency Regularization

(Tsai, 2020)

vanilla: bag-level proportion loss L_prop =KL(pkˆp)

• ‘distance’ between targetp and estimated ˆp = _{|X |}¹ P

x∈X g_θ(x) small

• extension ofstandard cross-entropy loss

instance-level regularization L_cons = 1

|X | X

x∈X

KL(g_θ(x)kg_θ(ˆx))

• ‘difference’ betweenx and perturbed ˆx small

• mature technique for semi-supervised learning

LLPwithconsistency regularization:

L =L_prop+ αL_cons

(16)

Consistency Loss by Virtual Adversarial Training

smoothness assumption

ifx_i ≈ x_j, then y_i ≈ y_j

goal

encourage the classifier to produce consistent outputs for neighbors

Virtual Adversarial Training (Miyato, 2018)

generate a perturbed example ˆx that most likely causes the model to misclassify

x = argmaxˆ

kˆx−xk≤r

KL(g_θ(x)kg_θ(ˆx))

consistency loss w/ VAT:

Lcons(θ) =KL(g_θ(x)kg_θ(ˆx))

(17)

Experimental Results

Bag Size

Dataset Method 16 32 64 128 256

SVHN vanilla 95.28 95.20 94.41 88.93 12.64 LLP-VAT 95.66 95.73 94.60 91.24 11.18 CIFAR10 vanilla 88.77 85.02 70.68 47.48 38.69 LLP-VAT 89.30 85.41 72.49 50.78 41.62 CIFAR100 vanilla 58.58 48.09 20.66 5.82 2.82 LLP-VAT 59.47 48.98 22.84 9.40 3.29

consistency regularization(VAT) helps!

(18)

Take-Home Message

• LLP: a typicalweakly-supervisedlearning problem

• consistency regularizationhelps

—can other regularization help?

• anyone using?

• 50%accuracy on 10 class for big bags?!

• no real-world data yet

(19)

Outline

(20)

Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange?

• mango?

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:less labeling cost/expertiserequired

(21)

Comparison

Ordinary (Supervised) Learning

training: {(xn= ,yn =mango)} →classifier Complementary Learning

training: {(xn= ,y_n=banana)} →classifier

testing goal: classifier( ) →cherry

ordinary versus complementary:

same goal viadifferent training data

(22)

Learning with Complementary Labels Setup

Given

N examples (inputxn,complementary label y_n) ∈ X × {1, 2, · · · K } in data set D such that y_n6= ynfor some hidden ordinary label

y_n∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x

LCL model design: connecting complementary & ordinary

(23)

Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(y_n,g(x_n))]

• loss `: usuallysurrogateof 0/1 error

LCL(Ishida, 2019)

• rewrite the loss ` to `, such that

unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]

under assumptions (e.g. uniform complementary labels)

• LCL by minimizingURE

(24)

URE Overfits Easily

` = − log(p(y | x))

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• smallp(y | x) (often) → possibly very negative ` empiricalURE can be negative onsome observedy

• negative empirical UREdrags minimizationtowards overfitting

how can we avoid negative empirical URE?

(25)

Proposed Framework

(Chou, 2021)

Minimize Complementary 0/1

• our goal: minimize 0/1 loss instead of `

• unbiased estimator of R₀₁issimple

R₀₁ : Ey[`₀₁(y , g(x))] = `₀₁(y , g(x))

• `01as the complementary 0/1 loss:

`₀₁(y , g(x)) =Jy = g (x)K

Surrogate Complementary Loss (SCL):

surrogateaftercomplementary 0/1

(26)

Illustrative Difference between URE and SCL

URE

SCL

R

01

R

`

R

_`

ˆ R

_`

R

₀₁

R

φ

ˆ R

φ A

E

A

URE: Ripple effect of errors

• Theoretical motivation(Ishida, 2017)

• Estimation step (E) amplifies approximation error (A) in `

SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

(27)

Negative Risk Avoided

Unbiased Risk Estimator (URE) URE loss `_CE from cross-entropy `_CE,

`_CE(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

−

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL) a surrogate of `01 (Kim, 2019)

φ_NL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.

(28)

Classification Accuracy

Methods

1 Unbiased risk estimator (URE)(Ishida, 2019) 2 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φ_EXP(y , g(x)) = exp(p_y).

Data set + Model URE SCL MNIST + Linear 0.850 0.902

MNIST + MLP 0.801 0.925

CIFAR10 + ResNet 0.109 0.492 CIFAR10 + DenseNet 0.291 0.544

(29)

Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `₀₁

∇`(y , g(x))

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

(30)

Gradient Estimation Error

Bias-Variance Decomposition

MSE = E(f − c)²

= E(f − h)²

| {z }

Bias²

+ E(h − c)²

| {z }

Variance

Gradient Estimation

1 Ordinary gradient f = ∇`(y , g(x))

2 Complementary gradient c = ∇`(y , g(x))

3 Expected complementary gradient h

(31)

Bias-Variance Tradeoff

(a) MSE (b) Bias² (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

(32)

Take-Home Message

• LCL: another popularweakly-supervisedlearning problem

• surrogate on complementaryhelps

• avoid negative loss

• lower gradient variance (with trade-off in bias)

• anyone using?

• uniform complementary generation unrealistic(ongoing)

• need stronger theoretical guarantee(ongoing)

(33)

Learning with Limited Labeled Data