Efficient Multi-Task Auxiliary Learning:

(1)

Efficient Multi-Task Auxiliary Learning:

Selecting Auxiliary Data by Feature Similarity

Po-Nien Kung, Yi-Cheng Chen, Sheng-Siang Yin Tse-Hsuan Yang, Yun-Nung (Vivian) Chen

http://github.com/MiuLab/FastMTL

(2)

Outline

◉ Background

◉ Two-Stage Multi-Task Auxiliary Learning

◉ Efficient Data Selection

◉ Experiments

◉ Conclusion

2

(3)

Outline

◉ Background

○ What is Multi-Task Auxiliary Learning

◉ Experiments

◉ Conclusion

3

(4)

Background: Multi-Task Auxiliary Learning

◉

Multi-task auxiliary learning

All tasks are important! One primary task and multiple auxiliary tasks.

◉

Multi-task learning

4

(5)

Background: Multi-Task Auxiliary Learning

◉

To achieve better performance on the primary task

◉

More useful when the size of the primary task is small

5

(6)

Background: Multi-Task Learning Chronology

6

More tasks, more power…

2019 Jan 2019 Oct 2021 Jan Near Future

MSR 8 tasks 960k data

MTDNN

Google 18 tasks

T5

Facebook 50 tasks 4.8mil data

Muppet

GLUE, Super GLUE, WMT…

new NLP tasks multi-modal tasks

There will be more…

&

(7)

Background: Auxiliary Data Size

More tasks, more power…

Also, more computing!

In MTDNN setting…

RTE (2.4k) vs Auxiliary(960k) data 400x Computing cost!

In Muppet setting…

RTE (2.4k) vs Auxiliary (4.8mil) data 2000x Computing cost!

Compare single-task finetuning and multi-task auxiliary learning when the primary task is RTE:

7

(8)

Outline

◉ Background

○ What is Multi-Task Auxiliary Learning

○ Why we need an efficient Multi-Task Auxiliary Learning method

◉ Experiments

◉ Conclusion

8

(9)

Inefficient Multi-Task Auxiliary Learning

Multi-Task Auxiliary Learning

Task-Oriented Predictors

P

Auxiliary Data

Finetuning

Primary Data

P

Primary Data

Too large!

9

(10)

Less Auxiliary Data is Possible?

◉

Single-task vs. multi-task

○ GLUE dataset (similar with MTDNN)

○ 10 random seeds

○ Use 1 STDEV as the threshold for performance IMPROVED and DROPPED , Otherwise, NEUTRAL.

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA The first question: Are all auxiliary data beneficial?

Some auxiliary dataset might be unhelpful or even harmful!

Negative Transfer !

10

(11)

Less Auxiliary Data is Possible?

◉

Single-task vs. multi-task

○ GLUE dataset (similar with MTDNN)

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA

The first question: Are all auxiliary data beneficial?

Some auxiliary dataset might be unhelpful or even harmful!

Negative Transfer !

11

(12)

Outline

◉ Background

○ We need a data-sampling method to shrink the size of the auxiliary data

◉ Experiments

◉ Conclusion

12

(13)

Two-Stage Multi-Task Auxiliary Learning

Multi-task Auxiliary Learning

P

Auxiliary Data

Primary Data

Sampling Method

Select the most beneficial auxiliary data

Auxiliary Sub-set

Goal: reducing the cost of training auxiliary data

13

(14)

Prior Work: AutoSeM (Guo et al., 2019)

◉

Idea: automatically select the most beneficial (related) auxiliary tasks

○ Beta-Bernoulli multi-armed bandit with Thompson Sampling

◉

Decide the mixing ratio of auxiliary tasks

○ Gaussian Process

○ Trial and error

Avoid negative transfer and further improve the performance!

Target 1: Reduce auxiliary dataset size ?

Target 2: Reduce the total computing cost ?

Why?

The sampling method itself is Computationally Expensive!

14

(15)

Sampling Method

Challenges of the Sampling Method

Prior work

Our target

Auxiliary Data Auxiliary Sub-dataset

Auxiliary Data

Sampling Method

Auxiliary Sub-dataset

Training

Through All Data Multiple Turns

Predicting Through All Data

One Turn

15

(16)

Outline

◉ Background

○ Select the most beneficial auxiliary data by feature similarity

◉ Experiments

◉ Conclusion

16

(17)

Feature Similarity Assumption

Assumption: More similar is an auxiliary data to the primary task, more benefit it can bring.

Visualize the data

Primary task data

Useful auxiliary data

Neutral/Harmful auxiliary data

Pseudo centroid of primary task data

17

(18)

Feature Similarity Assumption

◉ Toy experiment

○ MT-DNN setting: multi-task train 500 data for each GLUE task

◉ T-SNE Visualization

○ Last hidden state features of BERT

SST-2 and CoLA are more separate from others.

18

(19)

Feature Similarity Assumption

◉

Adding task-discriminator

○ Force the model to encode more task information into features

STS-B, RTE, MRPC and MNLI have more data overlap with each other.

The tasks with more similar auxiliary data improve most!

19

(20)

Usefulness of Auxiliary Data

◉

Feature similarity may indicate the usefulness of auxiliary data to a primary task.

How do we get the rank the feature similarity?

Train a small proxy model with a task discriminator to predict similarity.

20

(21)

Small Proxy Model

Task Discriminator

Predicting

Data Selection: Similarity Ranking

Sample N

0.8 0.4 … 0.4

Auxiliary Tasks Primary

Task Sample 1

Sample 2

…

0.1 0.9 … 0.9

0.2 0.1 … 0.1

…

Top-Ranked Data Selection

Auxiliary Sub-set

All Auxiliary Data

Small Mixed Set

Training a small proxy model

21

(22)

Two-Stage Multi-Task Auxiliary Learning

Task-Discriminative MT-DNN

Task Discriminator

0.8 0.4 … 0.4

0.1 0.9 … 0.9

0.2 0.1 … 0.1

…

Top-Ranked Data Selection

Auxiliary Tasks

Task-Oriented MT-DNN

A

B Task-Oriented Predictors

Training

Similarity Measuring

Auxiliary Sub-set Primary

Task Sample 1

Sample 2

Sample N

…

Primary Data Task-Oriented Predictors

C Training D

Stage 2: Multi-task Auxiliary Learning & Fine-tuning

Stage 1: Similarity Ranking

Small Mixed Set

All Auxiliary Data

Fine-Tuning

Goal: multi-task auxiliary learning on less auxiliary data

but comparable (or even improved) performance

Efficient!

22

(23)

Outline

◉ Background

◉ Experiments

◉ Conclusion

23

(24)

Setting

◉ Similar with MT-DNN

◉ Data: GLUE ^(960K)

◉ Model: Bert-base

◉ Baselines:

○ No-MTL (Weak)

○ Random Sampling (Surprisingly Strong)

○ Fully-trained (Strong)

Primary tasks

They are improved by MTL, so there exist useful data in auxiliary tasks

24

(25)

Fully-Trained

Primary All Auxiliary Data

Data

Multi-task auxiliary learning

Fine-tuning

Model

Evaluating

Random Sampling Ours

Similarity Sampling

Primary Data All Auxiliary

Data

Fine-tuning

Model

Evaluating

Auxiliary Sub-set

Random Sampling

Primary Data All Auxiliary

Data

Fine-tuning

Model

Evaluating

Auxiliary Sub-set

Similarity Sampling

25

(26)

Results

64 66 68 70 72 74 76 78

RTE

Ours Random

No-MTL Fully-trained

85 85.5 86 86.5 87 87.5 88 88.5 89 89.5 90

MRPC(F1)

Ours Random

No-MTL Fully-trained 26

(27)

Results

82 83 84 85 86 87 88

STS-B

Ours Random

No-MTL Fully-trained

Ours > Random

Ours > Fully-Trained

Random > Fully Trained (STS-B)

Findings

Our method can use less data to achieve better performance!

27

(28)

Efficiency Evaluation

◉

How many auxiliary data is needed to surpass fully- trained (RTE, MRPC, STS-B)?

STS-B Runtime(s)

Similarity Sampling Multi-Task Auxiliary Learning

Total

How much faster?

Training small proxy model

Predict

similarity MTL Finetuning

Fully-trained

--

15801

190

15991 -- Random

260

450 35x

Ours 95 775 670 23x

Ours: 50%, 60%, 0.05% Random: 100%, 100%, 1%

28

(29)

Outline

◉ Background

◉ Experiments

◉ Conclusion

29

(30)

Conclusion

◉ Address the efficiency importance in multi-task auxiliary learning

◉ Propose a data sampling method to shrink the size of the auxiliary data → computing cost reduction

◉ First use feature similarity to determine the data usefulness

◉ Our method outperforms random sampling and further surpass fully- trained model using less data

30

First work for time-efficiency of multi-task auxiliary learning:

http://github.com/MiuLab/FastMTL