Efficient Multi-Task Auxiliary Learning:
Selecting Auxiliary Data by Feature Similarity
Po-Nien Kung, Yi-Cheng Chen, Sheng-Siang Yin Tse-Hsuan Yang, Yun-Nung (Vivian) Chen
http://github.com/MiuLab/FastMTL
Outline
◉ Background
◉ Two-Stage Multi-Task Auxiliary Learning
◉ Efficient Data Selection
◉ Experiments
◉ Conclusion
2
Outline
◉ Background
○ What is Multi-Task Auxiliary Learning
◉ Two-Stage Multi-Task Auxiliary Learning
◉ Efficient Data Selection
◉ Experiments
◉ Conclusion
3
Background: Multi-Task Auxiliary Learning
◉
Multi-task auxiliary learning
All tasks are important! One primary task and multiple auxiliary tasks.
◉
Multi-task learning
4
Background: Multi-Task Auxiliary Learning
◉
To achieve better performance on the primary task
◉
More useful when the size of the primary task is small
5
Background: Multi-Task Learning Chronology
6
More tasks, more power…
2019 Jan 2019 Oct 2021 Jan Near Future
MSR 8 tasks 960k data
MTDNN
Google 18 tasks
T5
Facebook 50 tasks 4.8mil data
Muppet
GLUE, Super GLUE, WMT…
new NLP tasks multi-modal tasks
There will be more…
&
Background: Auxiliary Data Size
More tasks, more power…
Also, more computing!In MTDNN setting…
RTE (2.4k) vs Auxiliary(960k) data 400x Computing cost!
In Muppet setting…
RTE (2.4k) vs Auxiliary (4.8mil) data 2000x Computing cost!
Compare single-task finetuning and multi-task auxiliary learning when the primary task is RTE:
7
Outline
◉ Background
○ What is Multi-Task Auxiliary Learning
○ Why we need an efficient Multi-Task Auxiliary Learning method
◉ Two-Stage Multi-Task Auxiliary Learning
◉ Efficient Data Selection
◉ Experiments
◉ Conclusion
8
Inefficient Multi-Task Auxiliary Learning
Multi-Task Auxiliary Learning
Task-Oriented Predictors
P
Auxiliary Data
Finetuning
Primary Data
Task-Oriented Predictors
P
Primary Data
Too large!
9
Less Auxiliary Data is Possible?
◉
Single-task vs. multi-task
○ GLUE dataset (similar with MTDNN)
○ 10 random seeds
○ Use 1 STDEV as the threshold for performance IMPROVED and DROPPED , Otherwise, NEUTRAL.
MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA The first question: Are all auxiliary data beneficial?
Some auxiliary dataset might be unhelpful or even harmful!
Negative Transfer !
10
Less Auxiliary Data is Possible?
◉
Single-task vs. multi-task
○ GLUE dataset (similar with MTDNN)
MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA
The first question: Are all auxiliary data beneficial?
Some auxiliary dataset might be unhelpful or even harmful!
Negative Transfer !
11
Outline
◉ Background
◉ Two-Stage Multi-Task Auxiliary Learning
○ We need a data-sampling method to shrink the size of the auxiliary data
◉ Efficient Data Selection
◉ Experiments
◉ Conclusion
12
Two-Stage Multi-Task Auxiliary Learning
Multi-task Auxiliary Learning
Task-Oriented Predictors
P
Auxiliary Data
Primary Data
Sampling Method
Select the most beneficial auxiliary data
Auxiliary Sub-set
Goal: reducing the cost of training auxiliary data
13
Prior Work: AutoSeM (Guo et al., 2019)
◉
Idea: automatically select the most beneficial (related) auxiliary tasks
○ Beta-Bernoulli multi-armed bandit with Thompson Sampling
◉
Decide the mixing ratio of auxiliary tasks
○ Gaussian Process
○ Trial and error
Avoid negative transfer and further improve the performance!
Target 1: Reduce auxiliary dataset size ?
Target 2: Reduce the total computing cost ?
Why?
The sampling method itself is Computationally Expensive!
14
Sampling Method
Challenges of the Sampling Method
Prior work
Our target
Auxiliary Data Auxiliary Sub-dataset
Auxiliary Data
Sampling Method
Auxiliary Sub-dataset
Training
Through All Data Multiple Turns
Predicting Through All Data
One Turn
15
Outline
◉ Background
◉ Two-Stage Multi-Task Auxiliary Learning
◉ Efficient Data Selection
○ Select the most beneficial auxiliary data by feature similarity
◉ Experiments
◉ Conclusion
16
Feature Similarity Assumption
Assumption: More similar is an auxiliary data to the primary task, more benefit it can bring.
Visualize the data
Primary task data
Useful auxiliary data
Neutral/Harmful auxiliary data
Pseudo centroid of primary task data
17
Feature Similarity Assumption
◉ Toy experiment
○ MT-DNN setting: multi-task train 500 data for each GLUE task
◉ T-SNE Visualization
○ Last hidden state features of BERT
SST-2 and CoLA are more separate from others.
18
MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA
Feature Similarity Assumption
◉
Adding task-discriminator
○ Force the model to encode more task information into features
STS-B, RTE, MRPC and MNLI have more data overlap with each other.
The tasks with more similar auxiliary data improve most!
19
MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA
Usefulness of Auxiliary Data
◉
Feature similarity may indicate the usefulness of auxiliary data to a primary task.
How do we get the rank the feature similarity?
Train a small proxy model with a task discriminator to predict similarity.
20
Small Proxy Model
Task Discriminator
Predicting
Data Selection: Similarity Ranking
Sample N
0.8 0.4 … 0.4
Auxiliary Tasks Primary
Task Sample 1
Sample 2
…
0.1 0.9 … 0.9
0.2 0.1 … 0.1
…
Top-Ranked Data Selection
Auxiliary Sub-set
All Auxiliary Data
Task-Oriented Predictors
Small Mixed Set
Training a small proxy model
21
Two-Stage Multi-Task Auxiliary Learning
Task-Discriminative MT-DNN
Task Discriminator
0.8 0.4 … 0.4
0.1 0.9 … 0.9
0.2 0.1 … 0.1
…
Top-Ranked Data Selection
Auxiliary Tasks
Task-Oriented MT-DNN
A
B Task-Oriented Predictors
Training
Similarity Measuring
Auxiliary Sub-set Primary
Task Sample 1
Sample 2
Sample N
…
Primary Data Task-Oriented Predictors
C Training D
Stage 2: Multi-task Auxiliary Learning & Fine-tuning
Stage 1: Similarity Ranking
Small Mixed Set
All Auxiliary Data
Fine-Tuning
Goal: multi-task auxiliary learning on less auxiliary data
but comparable (or even improved) performance
Efficient!
22
Outline
◉ Background
◉ Two-Stage Multi-Task Auxiliary Learning
◉ Efficient Data Selection
◉ Experiments
◉ Conclusion
23
Setting
◉ Similar with MT-DNN
◉ Data: GLUE (960K)
◉ Model: Bert-base
◉ Baselines:
○ No-MTL (Weak)
○ Random Sampling (Surprisingly Strong)
○ Fully-trained (Strong)
Primary tasks
They are improved by MTL, so there exist useful data in auxiliary tasks
24
MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA
Fully-Trained
Primary All Auxiliary Data
Data
Multi-task auxiliary learning
Fine-tuning
Model
Evaluating
Random Sampling Ours
Similarity Sampling
Primary Data All Auxiliary
Data
Multi-task auxiliary learning
Fine-tuning
Model
Evaluating
Auxiliary Sub-set
Random Sampling
Primary Data All Auxiliary
Data
Multi-task auxiliary learning
Fine-tuning
Model
Evaluating
Auxiliary Sub-set
Similarity Sampling
25
Results
64 66 68 70 72 74 76 78
RTE
Ours Random
No-MTL Fully-trained
85 85.5 86 86.5 87 87.5 88 88.5 89 89.5 90
MRPC(F1)
Ours Random
No-MTL Fully-trained 26
Results
82 83 84 85 86 87 88
STS-B
Ours Random
No-MTL Fully-trained
Ours > Random
Ours > Fully-Trained
Random > Fully Trained (STS-B)
Findings
Our method can use less data to achieve better performance!
27
Efficiency Evaluation
◉
How many auxiliary data is needed to surpass fully- trained (RTE, MRPC, STS-B)?
STS-B Runtime(s)
Similarity Sampling Multi-Task Auxiliary Learning
Total
How much faster?
Training small proxy model
Predict
similarity MTL Finetuning
Fully-trained
--
15801
190
15991 -- Random
260
450 35x
Ours 95 775 670 23x
Ours: 50%, 60%, 0.05% Random: 100%, 100%, 1%
28
Outline
◉ Background
◉ Two-Stage Multi-Task Auxiliary Learning
◉ Efficient Data Selection
◉ Experiments
◉ Conclusion
29
Conclusion
◉ Address the efficiency importance in multi-task auxiliary learning
◉ Propose a data sampling method to shrink the size of the auxiliary data → computing cost reduction
◉ First use feature similarity to determine the data usefulness
◉ Our method outperforms random sampling and further surpass fully- trained model using less data
30
First work for time-efficiency of multi-task auxiliary learning:
http://github.com/MiuLab/FastMTL