Model Pre-Training
Applied Deep Learning
April 25th, 2022 http://adl.miulab.tw
Three Types of Model Pre-Training
◉ Encoder
○ Bidirectional context
○ Examples: BERT and its variants
◉ Decoder
○ Language modeling; better for generation
○ Example: GPT-2, GPT-3, LaMDA
◉ Encoder-Decoder
○ Sequence-to-sequence model
○ Examples: Transformer, BART, T5
2
Three Types of Model Pre-Training
◉ Encoder
○ Bidirectional context
○ Examples: BERT and its variants
◉ Decoder
○ Language modeling; better for generation
○ Example: GPT-2, GPT-3, LaMDA
◉ Encoder-Decoder
○ Sequence-to-sequence model
○ Examples: Transformer, BART, T5
3
BERT Variants
◉ Improvements to the BERT pretraining:
○
RoBERTa: mainly train BERT on more data and longer○
SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task4
Need of Decoder
◉ Generating tasks
○
BERT and other pretrained encoders don’t naturally lead to autoregressive (1-word- at-a-time) generation methods5
Pretrained Encoder
Vivian goes to [MASK] tasty tea make / brew / craft
Pretrained Decoder
Vivian goes to make tasty tea
goes to make tasty tea
Three Types of Model Pre-Training
◉ Encoder
○ Bidirectional context
○ Examples: BERT and its variants
◉ Decoder
○ Language modeling; better for generation
○ Example: GPT-2, GPT-3, LaMDA
◉ Encoder-Decoder
○ Sequence-to-sequence model
○ Examples: Transformer, BART, T5
6
GPT: Generative Pretrained Transformer
◉ Transformer decoder
○ Pre-trained on BooksCorpus (~7000 books)
■ Transformer decoder with 12 layers
■ 768-dim hidden states, 3072-dim feed-forward hidden layers
■ Byte-pair encoding with 40,000 merges
○ Supervised fine-tuning for the target tasks
7
GPT-2
◉ Transformer decoder
○ Pre-trained on more data
○ Good for NLG
8
More Powerful Pre-Trained Model – GPT-3
9
Model pre-train
task-specific annotated data
unannotated data
Model fine-tune
Model fine-tune
Pre-Training & Fine-Tuning
Model pre-train
Model no learning
Model
Pre-Training & In-Context Learning
no learning
GPT-3 “In-Context” Learning
10
GPT-3 “In-Context” Learning
◉ Zero-Shot
◉ One-Shot
◉ Few-Shot
11
◉
Traditional Fine-TuningBenchmark 42 NLU Tasks
12
NLU Performance in SuperGLUE
13
NLG Performance
◉ Human identify if the article is generated
14
NLG Performance
◉ Using a new word in a sentence (few-shot)
15
GPT-J
◉ GPT-3 was released by openAI, has 175 billion parameters and is not openly available.
◉ GPT-J is a 6 billion parameter model released by Eleuther AI. The goal of the group is to democratize huge language models, so they released GPT-J and it is currently publicly available.
○
Better in code generation tasks16
Demo
LaMDA: Language Models for Dialog Applications
◉ Pre-training: multiple public dialogue data (1.56T words)
◉ Fine-tuning: Quality and Safety scores
○
Using one model for both generation and discrimination enables an efficient combined generate-and-discriminate procedure.17
“<context><sentinel><response><attribute-name><rating>”
• “What’s up? RESPONSE not much. SENSIBLE 1”
• “What’s up? RESPONSE not much. INTERESTING 0”
• “What’s up? RESPONSE not much. UNSAFE 0”
LaMDA: Language Models for Dialog Applications
◉ Fine-tuning for external knowledge via a tool set (TS)
○
Calculator: “135+7721”→ “7856”○
Translator: “hello in French” → “Bonjour”○
IR system: “How old is Rafael Nadal?” → “Rafael Nadal / Age / 35”◉ 40K dialog turns (generative data) are labeled ‘correct’ or ‘incorrect’ for the ranking task (discriminative data)
18
context + base → “TS, Rafael Nadal’s age”
• snippet: “He is 31 years old right now” + “Rafael Nadal / Age / 35”
context + base + query + snippet → “User, He is 35 years old right now”
context + base + query + snippet → “TS, Rafael Nadal’s favorite song”
LaMDA Goundedness
19
“When was the Eiffel Tower built?”
LaMDA-Base It was constructed in 1887.
LaMDA to user: Hi, how can I help you today? <EOS> […] user to LaMDA: When was the Eiffel Tower built? <EOS>
LaMDA-Research TS, Eiffel Tower
construction date TS Eiffel Tower / construction started:
28 January 1887
:
TS to LaMDA-Research: Eiffel Tower / construction started: 28 January 1887 <EOS>
LaMDA-Research TS, Eiffel Tower complete
when TS Eiffel Tower / date opened: 31
March 1889
:
TS to LaMDA-Research: Eiffel Tower / date opened 31 March 1889 <EOS>
LaMDA-Research User, Work started on it in January 1887, and it was opened in March 1889.
:
LaMDA-Base to LaMDA-Research: It was constructed in 1887. <EOS>
Three Types of Model Pre-Training
◉ Encoder
○ Bidirectional context
○ Examples: BERT and its variants
◉ Decoder
○ Language modeling; better for generation
○ Example: GPT-2, GPT-3, LaMDA
◉ Encoder-Decoder
○ Sequence-to-sequence model
○ Examples: Transformer, BART, T5
20
Encoder-Decoder Pre-Training
◉ The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling.
◉ Pre-training objective: span corruption (denoising)
○
implemented in preprocessing○
similar to language modeling at the decoder side 21Thank you for inviting me to your party last week
Denoising for Pre-Training
◉ BART: output the whole sentence
◉ T5: output the missing parts
22
Bidirectional Encoder Autoregressive Decoder
Thank you <X> me to your party <Y> week
Bidirectional Encoder Autoregressive Decoder
Thank you ___ me to your party ___ week
<X> for inviting <Y> last <Z> </s>
……
<s> <X> for inviting <Y> last <Z>
……
Thank you for inviting me to your party last week </s>
……
<s> Thank you for inviting me to your party last week
……
Thank you for inviting me to your party last week
Fine-Tuning for Classification
◉ BART: repeat input in decoder
◉ T5: treat it as a seq2seq task
23
Bidirectional Encoder Autoregressive Decoder
A B C D E <s> A B C D E
label
Bidirectional Encoder Autoregressive Decoder
A B C D E <s>
label
Diverse Noises in BART
24
Effectiveness of Denoising in T5
25
T5: Text-to-Text Transfer Transformer
◉ Multi-task pre-training: learning multiple tasks via seq2seq
26
BART v.s. T5
◉ Differences
○
Training data size: BART > T5 (about 2x)○
Model size:■
BART-large: 12 encoder, 12 decoder, 1024 hidden■
T5-base: 12encoder, 12decoder, 768 hidden, 220M parameters (2x BERT-base)■
T5-large: 24encoder, 24decoder, 1024hidden, 770M parameters○
Position encoding: learnable absolute position (BART) & relative position (T5)◉ Understanding performance
◉ Generation performance (summarization)
27
SQuAD MNLI SST QQP QNLI STS-B RTE MRPC CoLA
BART 88.8 / 94.6 89.9 / 90.1 96.6 92.5 94.9 91.2 87.2 90.4 62.8
T5 86.7 / 93.8 89.9 / 89.6 96.3 89.9 94.8 89.9 87.0 89.9 61.2
CNN/DailyMail ROUGE-1 ROUGE-2 ROUGE-3
BART 45.14 21.28 37.25
T5 42.50 20.68 39.75
mBART: Multilingual BART
28
mT5: Multilingual T5
29
Concluding Remarks
◉ Encoder
○ Bidirectional context
○ Examples: BERT and its variants
◉ Decoder
○ Language modeling; better for generation
○ Example: GPT-2, GPT-3, LaMDA
◉ Encoder-Decoder
○ Sequence-to-sequence model
○ Examples: Transformer, BART, T5
30