Model Pre-Training

(1)

Model Pre-Training

Applied Deep Learning

April 25th, 2022 http://adl.miulab.tw

(2)

Three Types of Model Pre-Training

◉ Encoder

○ Bidirectional context

○ Examples: BERT and its variants

◉ Decoder

○ Language modeling; better for generation

○ Example: GPT-2, GPT-3, LaMDA

◉ Encoder-Decoder

○ Sequence-to-sequence model

○ Examples: Transformer, BART, T5

2

(3)

Three Types of Model Pre-Training

◉ Encoder

◉ Decoder

◉ Encoder-Decoder

3

(4)

BERT Variants

◉ Improvements to the BERT pretraining:

○

RoBERTa: mainly train BERT on more data and longer

○

SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task

4

(5)

Need of Decoder

◉ Generating tasks

○

BERT and other pretrained encoders don’t naturally lead to autoregressive (1-word- at-a-time) generation methods

5

Pretrained Encoder

Vivian goes to [MASK] tasty tea make / brew / craft

Pretrained Decoder

Vivian goes to make tasty tea

goes to make tasty tea

(6)

Three Types of Model Pre-Training

◉ Encoder

◉ Decoder

◉ Encoder-Decoder

6

(7)

GPT: Generative Pretrained Transformer

◉ Transformer decoder

○ Pre-trained on BooksCorpus (~7000 books)

■ Transformer decoder with 12 layers

■ 768-dim hidden states, 3072-dim feed-forward hidden layers

■ Byte-pair encoding with 40,000 merges

○ Supervised fine-tuning for the target tasks

7

(8)

GPT-2

◉ Transformer decoder

○ Pre-trained on more data

○ Good for NLG

8

(9)

More Powerful Pre-Trained Model – GPT-3

9

Model pre-train

task-specific annotated data

unannotated data

Model fine-tune

Pre-Training & Fine-Tuning

Model pre-train

Model no learning

Model

Pre-Training & In-Context Learning

no learning

(10)

GPT-3 “In-Context” Learning

10

(11)

GPT-3 “In-Context” Learning

◉ Zero-Shot

◉ One-Shot

◉ Few-Shot

11

◉

Traditional Fine-Tuning

(12)

Benchmark 42 NLU Tasks

12

(13)

NLU Performance in SuperGLUE

13

(14)

NLG Performance

◉ Human identify if the article is generated

14

(15)

NLG Performance

◉ Using a new word in a sentence (few-shot)

15

(16)

GPT-J

◉ GPT-3 was released by openAI, has 175 billion parameters and is not openly available.

◉ GPT-J is a 6 billion parameter model released by Eleuther AI. The goal of the group is to democratize huge language models, so they released GPT-J and it is currently publicly available.

○

Better in code generation tasks

16

Demo

(17)

LaMDA: Language Models for Dialog Applications

◉ Pre-training: multiple public dialogue data (1.56T words)

◉ Fine-tuning: Quality and Safety scores

○

Using one model for both generation and discrimination enables an efficient combined generate-and-discriminate procedure.

17

“<context><sentinel><response><attribute-name><rating>”

• “What’s up? RESPONSE not much. SENSIBLE 1”

• “What’s up? RESPONSE not much. INTERESTING 0”

• “What’s up? RESPONSE not much. UNSAFE 0”

(18)

LaMDA: Language Models for Dialog Applications

◉ Fine-tuning for external knowledge via a tool set (TS)

○

Calculator: “135+7721”→ “7856”

○

Translator: “hello in French” → “Bonjour”

○

IR system: “How old is Rafael Nadal?” → “Rafael Nadal / Age / 35”

◉ 40K dialog turns (generative data) are labeled ‘correct’ or ‘incorrect’ for the ranking task (discriminative data)

18

context + base → “TS, Rafael Nadal’s age”

• snippet: “He is 31 years old right now” + “Rafael Nadal / Age / 35”

context + base + query + snippet → “User, He is 35 years old right now”

context + base + query + snippet → “TS, Rafael Nadal’s favorite song”

(19)

LaMDA Goundedness

19

“When was the Eiffel Tower built?”

LaMDA-Base It was constructed in 1887.

LaMDA to user: Hi, how can I help you today? <EOS> […] user to LaMDA: When was the Eiffel Tower built? <EOS>

LaMDA-Research TS, Eiffel Tower

construction date TS Eiffel Tower / construction started:

28 January 1887

:

TS to LaMDA-Research: Eiffel Tower / construction started: 28 January 1887 <EOS>

LaMDA-Research TS, Eiffel Tower complete

when TS Eiffel Tower / date opened: 31

March 1889

:

TS to LaMDA-Research: Eiffel Tower / date opened 31 March 1889 <EOS>

LaMDA-Research User, Work started on it in January 1887, and it was opened in March 1889.

:

LaMDA-Base to LaMDA-Research: It was constructed in 1887. <EOS>

(20)

Three Types of Model Pre-Training

◉ Encoder

◉ Decoder

◉ Encoder-Decoder

20

(21)

Encoder-Decoder Pre-Training

◉ The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling.

◉ Pre-training objective: span corruption (denoising)

○

implemented in preprocessing

○

similar to language modeling at the decoder side 21

Thank you for inviting me to your party last week

(22)

Denoising for Pre-Training

◉ BART: output the whole sentence

◉ T5: output the missing parts

22

Bidirectional Encoder Autoregressive Decoder

Thank you <X> me to your party <Y> week

Thank you ___ me to your party ___ week

<X> for inviting <Y> last <Z> </s>

……

<s> <X> for inviting <Y> last <Z>

……

Thank you for inviting me to your party last week </s>

……

<s> Thank you for inviting me to your party last week

……

Thank you for inviting me to your party last week

(23)

Fine-Tuning for Classification

◉ BART: repeat input in decoder

◉ T5: treat it as a seq2seq task

23

A B C D E <s> A B C D E

label

A B C D E <s>

label

(24)

Diverse Noises in BART

24

(25)

Effectiveness of Denoising in T5

25

(26)

T5: Text-to-Text Transfer Transformer

◉ Multi-task pre-training: learning multiple tasks via seq2seq

26

(27)

BART v.s. T5

◉ Differences

○

Training data size: BART > T5 (about 2x)

○

Model size:

■

BART-large: 12 encoder, 12 decoder, 1024 hidden

■

T5-base: 12encoder, 12decoder, 768 hidden, 220M parameters (2x BERT-base)

■

T5-large: 24encoder, 24decoder, 1024hidden, 770M parameters

○

Position encoding: learnable absolute position (BART) & relative position (T5)

◉ Understanding performance

◉ Generation performance (summarization)

27

SQuAD MNLI SST QQP QNLI STS-B RTE MRPC CoLA

BART 88.8 / 94.6 89.9 / 90.1 96.6 92.5 94.9 91.2 87.2 90.4 62.8

T5 86.7 / 93.8 89.9 / 89.6 96.3 89.9 94.8 89.9 87.0 89.9 61.2

CNN/DailyMail ^ROUGE-1 ^ROUGE-2 ^ROUGE-3

BART ^45.14 ^21.28 ^37.25

T5 ^42.50 ^20.68 ^39.75

(28)

mBART: Multilingual BART

28

(29)

mT5: Multilingual T5

29

(30)

Concluding Remarks

◉ Encoder

◉ Decoder

◉ Encoder-Decoder

30