Learning with Limited Labeled Data
Hsuan-Tien Lin 林軒田
Dept. of Computer Science and Information Enginnering, National Taiwan University
國立臺灣大學資訊工程學系 January 26, 2022 AI & Data Science Workshop
Learning with Limited Labeled Data
Outline
Learning with Limited Labeled Data
Learning from Label Proportions
Learning from Complementary Labels
Learning with Limited Labeled Data
Supervised Learning
(Slide Modified from My ML Foundations MOOC)
unknown target function f : X → Y
training examples D : (x1,y1), · · · , (xN,yN)
25
5 1
Mass
Size 10
learning algorithm
A
final hypothesis g≈f
hypothesis set H
supervised learning: every input vector (picture)xnwithits label (category) yn
Learning with Limited Labeled Data
Semi-Supervised Learning
unknown target function f : X → Y
training examples D : (x1,y1), · · · , (xM,yM),
xN+1, . . . ,xN
25
5 1
Mass
Size 10
learning algorithm
A
final hypothesis g≈f
hypothesis set H
semi-supervisedlearning:
a few labeled examples +many unlabeled examples
Learning with Limited Labeled Data
Active Learning: Learning by ‘Asking’
Protocol ⇔ Learning Philosophy
• batch: ‘duck feeding’
• active: ‘question asking’(iteratively)
—query ynofchosenxn
unknown target function f : X → Y
labeled training examples ( , +1), ( , +1), ( , +1)
( , -1), ( , -1), ( , -1) unlabeled training examples
( ), ( ), ( ), ( )
learning algorithm
A
final hypothesis g≈f
+1
activelearning (on top of semi-supervised):
a few labeled examples+unlabeled pool +a few strategically-queried labels
Learning with Limited Labeled Data
Weakly-Supervised Learning:
Learning without True Labels
(a) positive-unlabeled learning (b) learning with noisy labels (c) learning with complementary labels
• positive-unlabeled: some of true yn= +1 revealed
• noisy: (cheaper) noisy label yn0 instead of true yn
• complementary: ‘not label’ yninstead of true yn weakly-supervised:
a few (no) labeled examples +many ‘related’ and easier-to-get labels
Learning with Limited Labeled Data
Our Ongoing Research Quests
Learning from Limited Labeled Data (L3D)
• insupervised learning
• e.g.uneven-margin augmentationfor imbalanced learning?
• ininteractive learning
• e.g. canstrategicallyobtained labels push L3D to the extreme?
• ingenerative learning
• e.g.development with cloned datafirst, validate with limited labeled data later?
• inweakly-supervised learning
• e.g.sketch with weak labelsfirst, refine with limited labeled data later—or maybelearn from many weak labelsonly?
Learning with Limited Labeled Data
Some of Our Selected Work
target
(1), (2)? data
? (3)
learning algorithm
good learned hypothesis '
&
$
% - 6
(5)
learning model (4) 6
1 zero-shot learning (ICLR 2021):
no labeled data but only descriptions for new classes
2 learning from complementary labels (ICML 2020): cheaper weakly labeled data
3 robust estimation (gaze: BMVC 2020, typhoon: KDD 2018):
domain-driven data augmentation
4 robust generation (NeurIPS 2021): math-driven objective augmentation
5 active learning (EMNLP 2020): a few actively labeled data
Learning with Limited Labeled Data
Quick Stories about Augmentation (1/3)
(Ashesh, 2021)Learning with Limited Labeled Data
Quick Stories about Augmentation (2/3)
(Chen, 2018)Learning with Limited Labeled Data
Quick Stories about Augmentation (3/3)
(Chen, 2021)Learning from Label Proportions
Outline
Learning with Limited Labeled Data
Learning from Label Proportions
Learning from Complementary Labels
Learning from Label Proportions
Learning from Label Proportions
Training
bag [a, o, s, k]
{ , , , } [12,14,14,0]
{ , , , } [14,12,0,14]
Test
?
motivations
• expensive labeling
• privacy issues
LLP: learn an instance-level classifier with proportion labels
Learning from Label Proportions
LLP Setting
input
Given M bags B1, . . . ,BM, where the m-th bag contains a set of instances Xmand a proportion labelpm, defined by
pm = 1
|Xm| X
n :xn∈Xm
e(yn),
M
[
m=1
Xm = {x1, . . . ,xN}.
output
learn a usual instance classifier gθ: RD → estimated probability
Learning from Label Proportions
Our Sol.: LLP w/ Consistency Regularization
(Tsai, 2020)vanilla: bag-level proportion loss Lprop =KL(pkˆp)
• ‘distance’ between targetp and estimated ˆp = |X |1 P
x∈X gθ(x) small
• extension ofstandard cross-entropy loss
instance-level regularization Lcons = 1
|X | X
x∈X
KL(gθ(x)kgθ(ˆx))
• ‘difference’ betweenx and perturbed ˆx small
• mature technique for semi-supervised learning
LLPwithconsistency regularization:
L =Lprop+ αLcons
Learning from Label Proportions
Consistency Loss by Virtual Adversarial Training
smoothness assumption
ifxi ≈ xj, then yi ≈ yj
goal
encourage the classifier to produce consistent outputs for neighbors
Virtual Adversarial Training (Miyato, 2018)
generate a perturbed example ˆx that most likely causes the model to misclassify
x = argmaxˆ
kˆx−xk≤r
KL(gθ(x)kgθ(ˆx))
consistency loss w/ VAT:
Lcons(θ) =KL(gθ(x)kgθ(ˆx))
Learning from Label Proportions
Experimental Results
Bag Size
Dataset Method 16 32 64 128 256
SVHN vanilla 95.28 95.20 94.41 88.93 12.64 LLP-VAT 95.66 95.73 94.60 91.24 11.18 CIFAR10 vanilla 88.77 85.02 70.68 47.48 38.69 LLP-VAT 89.30 85.41 72.49 50.78 41.62 CIFAR100 vanilla 58.58 48.09 20.66 5.82 2.82 LLP-VAT 59.47 48.98 22.84 9.40 3.29
consistency regularization(VAT) helps!
Learning from Label Proportions
Take-Home Message
• LLP: a typicalweakly-supervisedlearning problem
• consistency regularizationhelps
—can other regularization help?
• anyone using?
• 50%accuracy on 10 class for big bags?!
• no real-world data yet
Learning from Complementary Labels
Outline
Learning with Limited Labeled Data
Learning from Label Proportions
Learning from Complementary Labels
Learning from Complementary Labels
Fruit Labeling Task (Image from AICup in 2020)
hard: true label
• orange?
• mango?
• cherry
• banana
easy: complementary label
• orange
• mango
• cherry
• banana 7
complementary:less labeling cost/expertiserequired
Learning from Complementary Labels
Comparison
Ordinary (Supervised) Learning
training: {(xn= ,yn =mango)} →classifier Complementary Learning
training: {(xn= ,yn=banana)} →classifier
testing goal: classifier( ) →cherry
ordinary versus complementary:
same goal viadifferent training data
Learning from Complementary Labels
Learning with Complementary Labels Setup
Given
N examples (inputxn,complementary label yn) ∈ X × {1, 2, · · · K } in data set D such that yn6= ynfor some hidden ordinary label
yn∈ {1, 2, · · · K }.
Goal
a multi-class classifier g(x) thatclosely predicts(0/1 error) the ordinary label y associated with someunseeninputs x
LCL model design: connecting complementary & ordinary
Learning from Complementary Labels
Unbiased Risk Estimation for LCL
Ordinary Learning
• empirical risk minimization (ERM) on training data
risk: E(x,y )[`(y , g(x))] empirical risk: E(xn,yn)∈D[`(yn,g(xn))]
• loss `: usuallysurrogateof 0/1 error
LCL(Ishida, 2019)
• rewrite the loss ` to `, such that
unbiased risk estimator: E(x,y )[`(y , g(x))] = E(x,y )[`(y , g(x))]
under assumptions (e.g. uniform complementary labels)
• LCL by minimizingURE
Learning from Complementary Labels
URE Overfits Easily
` = − log(p(y | x))
` = (K − 1) log(p(y | x)) −
K
X
k =1
log(p(k | x))
ordinary risk and URE very different
• ` >0 → ordinary risk non-negative
• smallp(y | x) (often) → possibly very negative ` empiricalURE can be negative onsome observedy
• negative empirical UREdrags minimizationtowards overfitting
how can we avoid negative empirical URE?
Learning from Complementary Labels
Proposed Framework
(Chou, 2021)Minimize Complementary 0/1
• our goal: minimize 0/1 loss instead of `
• unbiased estimator of R01issimple
R01 : Ey[`01(y , g(x))] = `01(y , g(x))
• `01as the complementary 0/1 loss:
`01(y , g(x)) =Jy = g (x)K
Surrogate Complementary Loss (SCL):
surrogateaftercomplementary 0/1
Learning from Complementary Labels
Illustrative Difference between URE and SCL
URE
SCL
R
01R
`R
`ˆ R
`R
01R
φˆ R
φ AE
E
A
URE: Ripple effect of errors
• Theoretical motivation(Ishida, 2017)
• Estimation step (E) amplifies approximation error (A) in `
SCL: ‘Directly’ minimize complementary likelihood
• Non-negative loss φ
• Practically prevents ripple effect
Learning from Complementary Labels
Negative Risk Avoided
Unbiased Risk Estimator (URE) URE loss `CE from cross-entropy `CE,
`CE(y , g(x)) = (K − 1) log(p(y | x))
| {z }
negative loss term
−
K
X
j=1
log(p(j | x))
can go negative.
Surrogate Complementary Loss (SCL) a surrogate of `01 (Kim, 2019)
φNL(y , g(x)) = − log(1 − p(y | x))) remains non-negative.
Learning from Complementary Labels
Classification Accuracy
Methods
1 Unbiased risk estimator (URE)(Ishida, 2019) 2 Surrogate complementary loss (SCL)
Table:URE and NN are based on ` rewritten from cross-entropy loss, while SCL is based on exponential loss φEXP(y , g(x)) = exp(py).
Data set + Model URE SCL MNIST + Linear 0.850 0.902
MNIST + MLP 0.801 0.925
CIFAR10 + ResNet 0.109 0.492 CIFAR10 + DenseNet 0.291 0.544
Learning from Complementary Labels
Gradient Analysis
Gradient Direction of URE
• Very diverged directions on each y to maintain unbiasedness
• Low correlation to the target `01
∇`(y , g(x))
∇`(y , g(x))
Figure:Illustration of URE
Gradient Direction of SCL
• Targets to minimum likelihood objective
• High correlation to the target
`01
Learning from Complementary Labels
Gradient Estimation Error
Bias-Variance Decomposition
MSE = E(f − c)2
= E(f − h)2
| {z }
Bias2
+ E(h − c)2
| {z }
Variance
Gradient Estimation
1 Ordinary gradient f = ∇`(y , g(x))
2 Complementary gradient c = ∇`(y , g(x))
3 Expected complementary gradient h
Learning from Complementary Labels
Bias-Variance Tradeoff
(a) MSE (b) Bias2 (c) Variance
Findings
• SCL reduces variance by introducing small bias (towards y ) Bias Variance MSE
URE 0 Big Big
SCL Small Small Small
Learning from Complementary Labels
Take-Home Message
• LCL: another popularweakly-supervisedlearning problem
• surrogate on complementaryhelps
• avoid negative loss
• lower gradient variance (with trade-off in bias)
• anyone using?
• uniform complementary generation unrealistic(ongoing)
• need stronger theoretical guarantee(ongoing)
Learning from Complementary Labels