• 沒有找到結果。

Model Pre-Training

N/A
N/A
Protected

Academic year: 2022

Share "Model Pre-Training"

Copied!
30
0
0

加載中.... (立即查看全文)

全文

(1)

Model Pre-Training

Applied Deep Learning

April 25th, 2022 http://adl.miulab.tw

(2)

Three Types of Model Pre-Training

◉ Encoder

○ Bidirectional context

○ Examples: BERT and its variants

◉ Decoder

○ Language modeling; better for generation

○ Example: GPT-2, GPT-3, LaMDA

◉ Encoder-Decoder

○ Sequence-to-sequence model

○ Examples: Transformer, BART, T5

2

(3)

Three Types of Model Pre-Training

◉ Encoder

○ Bidirectional context

○ Examples: BERT and its variants

◉ Decoder

○ Language modeling; better for generation

○ Example: GPT-2, GPT-3, LaMDA

◉ Encoder-Decoder

○ Sequence-to-sequence model

○ Examples: Transformer, BART, T5

3

(4)

BERT Variants

◉ Improvements to the BERT pretraining:

RoBERTa: mainly train BERT on more data and longer

SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task

4

(5)

Need of Decoder

◉ Generating tasks

BERT and other pretrained encoders don’t naturally lead to autoregressive (1-word- at-a-time) generation methods

5

Pretrained Encoder

Vivian goes to [MASK] tasty tea make / brew / craft

Pretrained Decoder

Vivian goes to make tasty tea

goes to make tasty tea

(6)

Three Types of Model Pre-Training

◉ Encoder

○ Bidirectional context

○ Examples: BERT and its variants

◉ Decoder

○ Language modeling; better for generation

○ Example: GPT-2, GPT-3, LaMDA

◉ Encoder-Decoder

○ Sequence-to-sequence model

○ Examples: Transformer, BART, T5

6

(7)

GPT: Generative Pretrained Transformer

◉ Transformer decoder

○ Pre-trained on BooksCorpus (~7000 books)

Transformer decoder with 12 layers

768-dim hidden states, 3072-dim feed-forward hidden layers

Byte-pair encoding with 40,000 merges

○ Supervised fine-tuning for the target tasks

7

(8)

GPT-2

◉ Transformer decoder

Pre-trained on more data

○ Good for NLG

8

(9)

More Powerful Pre-Trained Model – GPT-3

9

Model pre-train

task-specific annotated data

unannotated data

Model fine-tune

Model fine-tune

Pre-Training & Fine-Tuning

Model pre-train

Model no learning

Model

Pre-Training & In-Context Learning

no learning

(10)

GPT-3 “In-Context” Learning

10

(11)

GPT-3 “In-Context” Learning

◉ Zero-Shot

◉ One-Shot

◉ Few-Shot

11

Traditional Fine-Tuning

(12)

Benchmark 42 NLU Tasks

12

(13)

NLU Performance in SuperGLUE

13

(14)

NLG Performance

◉ Human identify if the article is generated

14

(15)

NLG Performance

◉ Using a new word in a sentence (few-shot)

15

(16)

GPT-J

◉ GPT-3 was released by openAI, has 175 billion parameters and is not openly available.

◉ GPT-J is a 6 billion parameter model released by Eleuther AI. The goal of the group is to democratize huge language models, so they released GPT-J and it is currently publicly available.

Better in code generation tasks

16

Demo

(17)

LaMDA: Language Models for Dialog Applications

◉ Pre-training: multiple public dialogue data (1.56T words)

Fine-tuning: Quality and Safety scores

Using one model for both generation and discrimination enables an efficient combined generate-and-discriminate procedure.

17

“<context><sentinel><response><attribute-name><rating>”

• “What’s up? RESPONSE not much. SENSIBLE 1”

• “What’s up? RESPONSE not much. INTERESTING 0”

• “What’s up? RESPONSE not much. UNSAFE 0”

(18)

LaMDA: Language Models for Dialog Applications

◉ Fine-tuning for external knowledge via a tool set (TS)

Calculator: “135+7721”→ “7856”

Translator: “hello in French” → “Bonjour”

IR system: “How old is Rafael Nadal?” → “Rafael Nadal / Age / 35”

◉ 40K dialog turns (generative data) are labeled ‘correct’ or ‘incorrect’ for the ranking task (discriminative data)

18

context + base → “TS, Rafael Nadal’s age”

• snippet: “He is 31 years old right now” + “Rafael Nadal / Age / 35”

context + base + query + snippet → “User, He is 35 years old right now”

context + base + query + snippet → “TS, Rafael Nadal’s favorite song”

(19)

LaMDA Goundedness

19

“When was the Eiffel Tower built?”

LaMDA-Base It was constructed in 1887.

LaMDA to user: Hi, how can I help you today? <EOS> […] user to LaMDA: When was the Eiffel Tower built? <EOS>

LaMDA-Research TS, Eiffel Tower

construction date TS Eiffel Tower / construction started:

28 January 1887

:

TS to LaMDA-Research: Eiffel Tower / construction started: 28 January 1887 <EOS>

LaMDA-Research TS, Eiffel Tower complete

when TS Eiffel Tower / date opened: 31

March 1889

:

TS to LaMDA-Research: Eiffel Tower / date opened 31 March 1889 <EOS>

LaMDA-Research User, Work started on it in January 1887, and it was opened in March 1889.

:

LaMDA-Base to LaMDA-Research: It was constructed in 1887. <EOS>

(20)

Three Types of Model Pre-Training

◉ Encoder

○ Bidirectional context

○ Examples: BERT and its variants

◉ Decoder

○ Language modeling; better for generation

○ Example: GPT-2, GPT-3, LaMDA

◉ Encoder-Decoder

○ Sequence-to-sequence model

○ Examples: Transformer, BART, T5

20

(21)

Encoder-Decoder Pre-Training

◉ The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling.

◉ Pre-training objective: span corruption (denoising)

implemented in preprocessing

similar to language modeling at the decoder side 21

Thank you for inviting me to your party last week

(22)

Denoising for Pre-Training

◉ BART: output the whole sentence

◉ T5: output the missing parts

22

Bidirectional Encoder Autoregressive Decoder

Thank you <X> me to your party <Y> week

Bidirectional Encoder Autoregressive Decoder

Thank you ___ me to your party ___ week

<X> for inviting <Y> last <Z> </s>

……

<s> <X> for inviting <Y> last <Z>

……

Thank you for inviting me to your party last week </s>

……

<s> Thank you for inviting me to your party last week

……

Thank you for inviting me to your party last week

(23)

Fine-Tuning for Classification

◉ BART: repeat input in decoder

◉ T5: treat it as a seq2seq task

23

Bidirectional Encoder Autoregressive Decoder

A B C D E <s> A B C D E

label

Bidirectional Encoder Autoregressive Decoder

A B C D E <s>

label

(24)

Diverse Noises in BART

24

(25)

Effectiveness of Denoising in T5

25

(26)

T5: Text-to-Text Transfer Transformer

Multi-task pre-training: learning multiple tasks via seq2seq

26

(27)

BART v.s. T5

◉ Differences

Training data size: BART > T5 (about 2x)

Model size:

BART-large: 12 encoder, 12 decoder, 1024 hidden

T5-base: 12encoder, 12decoder, 768 hidden, 220M parameters (2x BERT-base)

T5-large: 24encoder, 24decoder, 1024hidden, 770M parameters

Position encoding: learnable absolute position (BART) & relative position (T5)

◉ Understanding performance

◉ Generation performance (summarization)

27

SQuAD MNLI SST QQP QNLI STS-B RTE MRPC CoLA

BART 88.8 / 94.6 89.9 / 90.1 96.6 92.5 94.9 91.2 87.2 90.4 62.8

T5 86.7 / 93.8 89.9 / 89.6 96.3 89.9 94.8 89.9 87.0 89.9 61.2

CNN/DailyMail ROUGE-1 ROUGE-2 ROUGE-3

BART 45.14 21.28 37.25

T5 42.50 20.68 39.75

(28)

mBART: Multilingual BART

28

(29)

mT5: Multilingual T5

29

(30)

Concluding Remarks

◉ Encoder

○ Bidirectional context

○ Examples: BERT and its variants

◉ Decoder

○ Language modeling; better for generation

○ Example: GPT-2, GPT-3, LaMDA

◉ Encoder-Decoder

○ Sequence-to-sequence model

○ Examples: Transformer, BART, T5

30

參考文獻

相關文件

2.1.1 The pre-primary educator must have specialised knowledge about the characteristics of child development before they can be responsive to the needs of children, set

 Context level: Teacher familiarizes the students with the writing topic/ background (through videos/ pictures/ pre- task)..  Text level: Show a model consequential explanation

During early childhood, developing proficiency in the mother-tongue is of primary importance. Cantonese is most Hong Kong children’s mother-tongue and should also be the medium

9 The pre-S1 HKAT is conducted in all secondary schools in July every year to assess the performance of students newly admitted to S1 in Chinese Language, English Language

O.K., let’s study chiral phase transition. Quark

(From the disease model to the positive psychology model).

In the school opening ceremony, the principal announces that she, Miss Shen, t is going to retire early.. There will be a new teacher from

We showed that the BCDM is a unifying model in that conceptual instances could be mapped into instances of five existing bitemporal representational data models: a first normal