• 沒有找到結果。

Learning with Limited Labeled Data

N/A
N/A
Protected

Academic year: 2022

Share "Learning with Limited Labeled Data"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

Learning with Limited Labeled Data

Hsuan-Tien Lin 林軒田

Dept. of Computer Science and Information Enginnering, National Taiwan University

國立臺灣大學資訊工程學系 January 26, 2022 AI & Data Science Workshop

(2)

Learning with Limited Labeled Data

Outline

Learning with Limited Labeled Data

Learning from Label Proportions

Learning from Complementary Labels

(3)

Learning with Limited Labeled Data

Supervised Learning

(Slide Modified from My ML Foundations MOOC)

unknown target function f : X → Y

training examples D : (x1,y1), · · · , (xN,yN)

25

5 1

Mass

Size 10

learning algorithm

A

final hypothesis gf

hypothesis set H

supervised learning: every input vector (picture)xnwithits label (category) yn

(4)

Learning with Limited Labeled Data

Semi-Supervised Learning

unknown target function f : X → Y

training examples D : (x1,y1), · · · , (xM,yM),

xN+1, . . . ,xN

25

5 1

Mass

Size 10

learning algorithm

A

final hypothesis gf

hypothesis set H

semi-supervisedlearning:

a few labeled examples +many unlabeled examples

(5)

Learning with Limited Labeled Data

Active Learning: Learning by ‘Asking’

Protocol ⇔ Learning Philosophy

• batch: ‘duck feeding’

active: ‘question asking’(iteratively)

—query ynofchosenxn

unknown target function f : X → Y

labeled training examples ( , +1), ( , +1), ( , +1)

( , -1), ( , -1), ( , -1) unlabeled training examples

( ), ( ), ( ), ( )

learning algorithm

A

final hypothesis gf

+1

activelearning (on top of semi-supervised):

a few labeled examples+unlabeled pool +a few strategically-queried labels

(6)

Learning with Limited Labeled Data

Weakly-Supervised Learning:

Learning without True Labels

(a) positive-unlabeled learning (b) learning with noisy labels (c) learning with complementary labels

• positive-unlabeled: some of true yn= +1 revealed

• noisy: (cheaper) noisy label yn0 instead of true yn

• complementary: ‘not label’ yninstead of true yn weakly-supervised:

a few (no) labeled examples +many ‘related’ and easier-to-get labels

(7)

Learning with Limited Labeled Data

Our Ongoing Research Quests

Learning from Limited Labeled Data (L3D)

• insupervised learning

e.g.uneven-margin augmentationfor imbalanced learning?

• ininteractive learning

e.g. canstrategicallyobtained labels push L3D to the extreme?

• ingenerative learning

e.g.development with cloned datafirst, validate with limited labeled data later?

• inweakly-supervised learning

e.g.sketch with weak labelsfirst, refine with limited labeled data later—or maybelearn from many weak labelsonly?

(8)

Learning with Limited Labeled Data

Some of Our Selected Work

target

(1), (2)? data

? (3)

learning algorithm

good learned hypothesis '

&

$

% - 6

(5)

learning model (4) 6

1 zero-shot learning (ICLR 2021):

no labeled data but only descriptions for new classes

2 learning from complementary labels (ICML 2020): cheaper weakly labeled data

3 robust estimation (gaze: BMVC 2020, typhoon: KDD 2018):

domain-driven data augmentation

4 robust generation (NeurIPS 2021): math-driven objective augmentation

5 active learning (EMNLP 2020): a few actively labeled data

(9)

Learning with Limited Labeled Data

Quick Stories about Augmentation (1/3)

(Ashesh, 2021)

(10)

Learning with Limited Labeled Data

Quick Stories about Augmentation (2/3)

(Chen, 2018)

(11)

Learning with Limited Labeled Data

Quick Stories about Augmentation (3/3)

(Chen, 2021)

(12)

Learning from Label Proportions

Outline

Learning with Limited Labeled Data

Learning from Label Proportions

Learning from Complementary Labels

(13)

Learning from Label Proportions

Learning from Label Proportions

Training

bag [a, o, s, k]

{ , , , } [12,14,14,0]

{ , , , } [14,12,0,14]

Test

?

motivations

• expensive labeling

• privacy issues

LLP: learn an instance-level classifier with proportion labels

(14)

Learning from Label Proportions

LLP Setting

input

Given M bags B1, . . . ,BM, where the m-th bag contains a set of instances Xmand a proportion labelpm, defined by

pm = 1

|Xm| X

n :xn∈Xm

e(yn),

M

[

m=1

Xm = {x1, . . . ,xN}.

output

learn a usual instance classifier gθ: RD → estimated probability

(15)

Learning from Label Proportions

Our Sol.: LLP w/ Consistency Regularization

(Tsai, 2020)

vanilla: bag-level proportion loss Lprop =KL(pkˆp)

• ‘distance’ between targetp and estimated ˆp = |X |1 P

x∈X gθ(x) small

• extension ofstandard cross-entropy loss

instance-level regularization Lcons = 1

|X | X

x∈X

KL(gθ(x)kgθx))

• ‘difference’ betweenx and perturbed ˆx small

• mature technique for semi-supervised learning

LLPwithconsistency regularization:

L =Lprop+ αLcons

(16)

Learning from Label Proportions

Consistency Loss by Virtual Adversarial Training

smoothness assumption

ifxi ≈ xj, then yi ≈ yj

goal

encourage the classifier to produce consistent outputs for neighbors

Virtual Adversarial Training (Miyato, 2018)

generate a perturbed example ˆx that most likely causes the model to misclassify

x = argmaxˆ

x−xk≤r

KL(gθ(x)kgθx))

consistency loss w/ VAT:

Lcons(θ) =KL(gθ(x)kgθx))

(17)

Learning from Label Proportions

Experimental Results

Bag Size

Dataset Method 16 32 64 128 256

SVHN vanilla 95.28 95.20 94.41 88.93 12.64 LLP-VAT 95.66 95.73 94.60 91.24 11.18 CIFAR10 vanilla 88.77 85.02 70.68 47.48 38.69 LLP-VAT 89.30 85.41 72.49 50.78 41.62 CIFAR100 vanilla 58.58 48.09 20.66 5.82 2.82 LLP-VAT 59.47 48.98 22.84 9.40 3.29

consistency regularization(VAT) helps!

(18)

Learning from Label Proportions

Take-Home Message

• LLP: a typicalweakly-supervisedlearning problem

consistency regularizationhelps

—can other regularization help?

• anyone using?

50%accuracy on 10 class for big bags?!

no real-world data yet

(19)

Learning from Complementary Labels

Outline

Learning with Limited Labeled Data

Learning from Label Proportions

Learning from Complementary Labels

(20)

Learning from Complementary Labels

Fruit Labeling Task (Image from AICup in 2020)

hard: true label

• orange?

• mango?

• cherry

• banana

easy: complementary label

• orange

• mango

• cherry

• banana 7

complementary:less labeling cost/expertiserequired

(21)

Learning from Complementary Labels

Comparison

Ordinary (Supervised) Learning

training: {(xn= ,yn =mango)} →classifier Complementary Learning

training: {(xn= ,yn=banana)} →classifier

testing goal: classifier( ) →cherry

ordinary versus complementary:

same goal viadifferent training data

(22)

Learning from Complementary Labels

Learning with Complementary Labels Setup

Given

N examples (inputxn,complementary label yn) ∈ X × {1, 2, · · · K } in data set D such that yn6= ynfor some hidden ordinary label

yn∈ {1, 2, · · · K }.

Goal

a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x

LCL model design: connecting complementary & ordinary

(23)

Learning from Complementary Labels

Unbiased Risk Estimation for LCL

Ordinary Learning

• empirical risk minimization (ERM) on training data

risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(yn,g(xn))]

• loss `: usuallysurrogateof 0/1 error

LCL(Ishida, 2019)

• rewrite the loss ` to `, such that

unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]

under assumptions (e.g. uniform complementary labels)

• LCL by minimizingURE

(24)

Learning from Complementary Labels

URE Overfits Easily

` = − log(p(y | x))

` = (K − 1) log(p(y | x)) −

K

X

k =1

log(p(k | x))

ordinary risk and URE very different

• ` >0 → ordinary risk non-negative

• smallp(y | x) (often) → possibly very negative ` empiricalURE can be negative onsome observedy

• negative empirical UREdrags minimizationtowards overfitting

how can we avoid negative empirical URE?

(25)

Learning from Complementary Labels

Proposed Framework

(Chou, 2021)

Minimize Complementary 0/1

• our goal: minimize 0/1 loss instead of `

• unbiased estimator of R01issimple

R01 : Ey[`01(y , g(x))] = `01(y , g(x))

• `01as the complementary 0/1 loss:

`01(y , g(x)) =Jy = g (x)K

Surrogate Complementary Loss (SCL):

surrogateaftercomplementary 0/1

(26)

Learning from Complementary Labels

Illustrative Difference between URE and SCL

URE

SCL

R

01

R

`

R

`

ˆ R

`

R

01

R

φ

ˆ R

φ A

E

E

A

URE: Ripple effect of errors

• Theoretical motivation(Ishida, 2017)

• Estimation step (E) amplifies approximation error (A) in `

SCL: ‘Directly’ minimize complementary likelihood

• Non-negative loss φ

• Practically prevents ripple effect

(27)

Learning from Complementary Labels

Negative Risk Avoided

Unbiased Risk Estimator (URE) URE loss `CE from cross-entropy `CE,

`CE(y , g(x)) = (K − 1) log(p(y | x))

| {z }

negative loss term

K

X

j=1

log(p(j | x))

can go negative.

Surrogate Complementary Loss (SCL) a surrogate of `01 (Kim, 2019)

φNL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.

(28)

Learning from Complementary Labels

Classification Accuracy

Methods

1 Unbiased risk estimator (URE)(Ishida, 2019) 2 Surrogate complementary loss (SCL)

Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φEXP(y , g(x)) = exp(py).

Data set + Model URE SCL MNIST + Linear 0.850 0.902

MNIST + MLP 0.801 0.925

CIFAR10 + ResNet 0.109 0.492 CIFAR10 + DenseNet 0.291 0.544

(29)

Learning from Complementary Labels

Gradient Analysis

Gradient Direction of URE

• Very diverged directions on each y to maintain unbiasedness

• Low correlation to the target `01

∇`(y , g(x))

∇`(y , g(x))

Figure:Illustration of URE

Gradient Direction of SCL

• Targets to minimum likelihood objective

• High correlation to the target

`01

(30)

Learning from Complementary Labels

Gradient Estimation Error

Bias-Variance Decomposition

MSE = E(f − c)2

= E(f − h)2

| {z }

Bias2

+ E(h − c)2

| {z }

Variance

Gradient Estimation

1 Ordinary gradient f = ∇`(y , g(x))

2 Complementary gradient c = ∇`(y , g(x))

3 Expected complementary gradient h

(31)

Learning from Complementary Labels

Bias-Variance Tradeoff

(a) MSE (b) Bias2 (c) Variance

Findings

• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE

URE 0 Big Big

SCL Small Small Small

(32)

Learning from Complementary Labels

Take-Home Message

• LCL: another popularweakly-supervisedlearning problem

surrogate on complementaryhelps

avoid negative loss

lower gradient variance (with trade-off in bias)

• anyone using?

uniform complementary generation unrealistic(ongoing)

need stronger theoretical guarantee(ongoing)

(33)

Learning from Complementary Labels

Thank you! Questions?

參考文獻

相關文件

– stump kernel: succeeded in specific applications infinite ensemble learning could be better – existing AdaBoost-Stump applications may switch. not the

the larger dataset: 90 samples (libraries) x i , each with 27679 features (counts of SAGE tags) (x i ) d.. labels y i : 59 cancerous samples, and 31

[Pat+17] Making deep neural networks robust to label noise: A loss correction approach,

Keywords: accuracy measure; bootstrap; case-control; cross-validation; missing data; M -phase; pseudo least squares; pseudo maximum likelihood estimator; receiver

The following sentences are taken from Understanding Integrated Science for 21 st Century Book 1A published by Aristo Educational Press Limited.. Fill in the blanks with the

Instead of assuming all sentences are labeled correctly in the training set, multiple instance learning learns from bags of instances, provided that each positive bag contains at

3 active learning: limited protocol (unlabeled data) + requested

3 active learning: limited protocol (unlabeled data) + requested