• 沒有找到結果。

Efficient Multi-Task Auxiliary Learning:

N/A
N/A
Protected

Academic year: 2022

Share "Efficient Multi-Task Auxiliary Learning:"

Copied!
30
0
0

加載中.... (立即查看全文)

全文

(1)

Efficient Multi-Task Auxiliary Learning:

Selecting Auxiliary Data by Feature Similarity

Po-Nien Kung, Yi-Cheng Chen, Sheng-Siang Yin Tse-Hsuan Yang, Yun-Nung (Vivian) Chen

http://github.com/MiuLab/FastMTL

(2)

Outline

Background

Two-Stage Multi-Task Auxiliary Learning

Efficient Data Selection

Experiments

Conclusion

2

(3)

Outline

Background

What is Multi-Task Auxiliary Learning

Two-Stage Multi-Task Auxiliary Learning

Efficient Data Selection

Experiments

Conclusion

3

(4)

Background: Multi-Task Auxiliary Learning

Multi-task auxiliary learning

All tasks are important! One primary task and multiple auxiliary tasks.

Multi-task learning

4

(5)

Background: Multi-Task Auxiliary Learning

To achieve better performance on the primary task

More useful when the size of the primary task is small

5

(6)

Background: Multi-Task Learning Chronology

6

More tasks, more power…

2019 Jan 2019 Oct 2021 Jan Near Future

MSR 8 tasks 960k data

MTDNN

Google 18 tasks

T5

Facebook 50 tasks 4.8mil data

Muppet

GLUE, Super GLUE, WMT…

new NLP tasks multi-modal tasks

There will be more…

&

(7)

Background: Auxiliary Data Size

More tasks, more power…

Also, more computing!

In MTDNN setting…

RTE (2.4k) vs Auxiliary(960k) data 400x Computing cost!

In Muppet setting…

RTE (2.4k) vs Auxiliary (4.8mil) data 2000x Computing cost!

Compare single-task finetuning and multi-task auxiliary learning when the primary task is RTE:

7

(8)

Outline

Background

What is Multi-Task Auxiliary Learning

Why we need an efficient Multi-Task Auxiliary Learning method

Two-Stage Multi-Task Auxiliary Learning

Efficient Data Selection

Experiments

Conclusion

8

(9)

Inefficient Multi-Task Auxiliary Learning

Multi-Task Auxiliary Learning

Task-Oriented Predictors

P

Auxiliary Data

Finetuning

Primary Data

Task-Oriented Predictors

P

Primary Data

Too large!

9

(10)

Less Auxiliary Data is Possible?

Single-task vs. multi-task

GLUE dataset (similar with MTDNN)

10 random seeds

Use 1 STDEV as the threshold for performance IMPROVED and DROPPED , Otherwise, NEUTRAL.

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA The first question: Are all auxiliary data beneficial?

Some auxiliary dataset might be unhelpful or even harmful!

Negative Transfer !

10

(11)

Less Auxiliary Data is Possible?

Single-task vs. multi-task

GLUE dataset (similar with MTDNN)

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA

The first question: Are all auxiliary data beneficial?

Some auxiliary dataset might be unhelpful or even harmful!

Negative Transfer !

11

(12)

Outline

Background

Two-Stage Multi-Task Auxiliary Learning

We need a data-sampling method to shrink the size of the auxiliary data

Efficient Data Selection

Experiments

Conclusion

12

(13)

Two-Stage Multi-Task Auxiliary Learning

Multi-task Auxiliary Learning

Task-Oriented Predictors

P

Auxiliary Data

Primary Data

Sampling Method

Select the most beneficial auxiliary data

Auxiliary Sub-set

Goal: reducing the cost of training auxiliary data

13

(14)

Prior Work: AutoSeM (Guo et al., 2019)

Idea: automatically select the most beneficial (related) auxiliary tasks

Beta-Bernoulli multi-armed bandit with Thompson Sampling

Decide the mixing ratio of auxiliary tasks

Gaussian Process

Trial and error

Avoid negative transfer and further improve the performance!

Target 1: Reduce auxiliary dataset size ?

Target 2: Reduce the total computing cost ?

Why?

The sampling method itself is Computationally Expensive!

14

(15)

Sampling Method

Challenges of the Sampling Method

Prior work

Our target

Auxiliary Data Auxiliary Sub-dataset

Auxiliary Data

Sampling Method

Auxiliary Sub-dataset

Training

Through All Data Multiple Turns

Predicting Through All Data

One Turn

15

(16)

Outline

Background

Two-Stage Multi-Task Auxiliary Learning

Efficient Data Selection

Select the most beneficial auxiliary data by feature similarity

Experiments

Conclusion

16

(17)

Feature Similarity Assumption

Assumption: More similar is an auxiliary data to the primary task, more benefit it can bring.

Visualize the data

Primary task data

Useful auxiliary data

Neutral/Harmful auxiliary data

Pseudo centroid of primary task data

17

(18)

Feature Similarity Assumption

Toy experiment

MT-DNN setting: multi-task train 500 data for each GLUE task

T-SNE Visualization

Last hidden state features of BERT

SST-2 and CoLA are more separate from others.

18

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA

(19)

Feature Similarity Assumption

Adding task-discriminator

Force the model to encode more task information into features

STS-B, RTE, MRPC and MNLI have more data overlap with each other.

The tasks with more similar auxiliary data improve most!

19

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA

(20)

Usefulness of Auxiliary Data

Feature similarity may indicate the usefulness of auxiliary data to a primary task.

How do we get the rank the feature similarity?

Train a small proxy model with a task discriminator to predict similarity.

20

(21)

Small Proxy Model

Task Discriminator

Predicting

Data Selection: Similarity Ranking

Sample N

0.8 0.4 0.4

Auxiliary Tasks Primary

Task Sample 1

Sample 2

0.1 0.9 0.9

0.2 0.1 0.1

Top-Ranked Data Selection

Auxiliary Sub-set

All Auxiliary Data

Task-Oriented Predictors

Small Mixed Set

Training a small proxy model

21

(22)

Two-Stage Multi-Task Auxiliary Learning

Task-Discriminative MT-DNN

Task Discriminator

0.8 0.4 0.4

0.1 0.9 0.9

0.2 0.1 0.1

Top-Ranked Data Selection

Auxiliary Tasks

Task-Oriented MT-DNN

A

B Task-Oriented Predictors

Training

Similarity Measuring

Auxiliary Sub-set Primary

Task Sample 1

Sample 2

Sample N

Primary Data Task-Oriented Predictors

C Training D

Stage 2: Multi-task Auxiliary Learning & Fine-tuning

Stage 1: Similarity Ranking

Small Mixed Set

All Auxiliary Data

Fine-Tuning

Goal: multi-task auxiliary learning on less auxiliary data

but comparable (or even improved) performance

Efficient!

22

(23)

Outline

Background

Two-Stage Multi-Task Auxiliary Learning

Efficient Data Selection

Experiments

Conclusion

23

(24)

Setting

Similar with MT-DNN

Data: GLUE (960K)

Model: Bert-base

Baselines:

No-MTL (Weak)

Random Sampling (Surprisingly Strong)

Fully-trained (Strong)

Primary tasks

They are improved by MTL, so there exist useful data in auxiliary tasks

24

MNLI RTE MRPC STS-B QQP QNLI SST-2 CoLA

(25)

Fully-Trained

Primary All Auxiliary Data

Data

Multi-task auxiliary learning

Fine-tuning

Model

Evaluating

Random Sampling Ours

Similarity Sampling

Primary Data All Auxiliary

Data

Multi-task auxiliary learning

Fine-tuning

Model

Evaluating

Auxiliary Sub-set

Random Sampling

Primary Data All Auxiliary

Data

Multi-task auxiliary learning

Fine-tuning

Model

Evaluating

Auxiliary Sub-set

Similarity Sampling

25

(26)

Results

64 66 68 70 72 74 76 78

RTE

Ours Random

No-MTL Fully-trained

85 85.5 86 86.5 87 87.5 88 88.5 89 89.5 90

MRPC(F1)

Ours Random

No-MTL Fully-trained 26

(27)

Results

82 83 84 85 86 87 88

STS-B

Ours Random

No-MTL Fully-trained

Ours > Random

Ours > Fully-Trained

Random > Fully Trained (STS-B)

Findings

Our method can use less data to achieve better performance!

27

(28)

Efficiency Evaluation

How many auxiliary data is needed to surpass fully- trained (RTE, MRPC, STS-B)?

STS-B Runtime(s)

Similarity Sampling Multi-Task Auxiliary Learning

Total

How much faster?

Training small proxy model

Predict

similarity MTL Finetuning

Fully-trained

--

15801

190

15991 -- Random

260

450 35x

Ours 95 775 670 23x

Ours: 50%, 60%, 0.05% Random: 100%, 100%, 1%

28

(29)

Outline

Background

Two-Stage Multi-Task Auxiliary Learning

Efficient Data Selection

Experiments

Conclusion

29

(30)

Conclusion

Address the efficiency importance in multi-task auxiliary learning

Propose a data sampling method to shrink the size of the auxiliary data → computing cost reduction

First use feature similarity to determine the data usefulness

Our method outperforms random sampling and further surpass fully- trained model using less data

30

First work for time-efficiency of multi-task auxiliary learning:

http://github.com/MiuLab/FastMTL

參考文獻

相關文件

It should be stressed that the four eigenvalues obtained here do not change even if we include other field outside KBc subalgebra or outside the dressed B 0 gauge, since such fields

the composition presented by T101 〉, “ First, the style of writing: by and large, these s ū tras are translated into prose.. Even though there are some verse-like renderings,

In our AI term project, all chosen machine learning tools will be use to diagnose cancer Wisconsin dataset.. To be consistent with the literature [1, 2] we removed the 16

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to

MTL – multi-task learning for STM and LM, where they share the embedding layer PSEUDO – train an STM with labeled data, generate labels for unlabeled data, and retrain STM.

2 machine learning, data mining and statistics all need data. 3 data mining is just another name for

Data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases... 幕後英雄

This reduced dual problem may be solved by a conditional gradient method and (accelerated) gradient-projection methods, with each projection involving an SVD of an r × m matrix..