More BERT
Applied Deep Learning
April 14th, 2020 http://adl.miulab.tw
Beyond BERT
Better Performance
Wide Applications
XLNET
RoBERTa
SpanBERT
XLM
2
(Dai et al, 2019)
Transformer-XL
3
Transformer
◉ Issue: context fragmentation
○ Long dependency: unable to model dependencies longer than a fixed length
○ Inefficient optimization: ignore sentence boundaries
■ particularly troublesome even for short sequences
4
Transformer-XL (extra-long)
◉ Idea: segment-level recurrence
○ Previous segment embeddings are fixed and cached to be reused when training the next segment
○ → increases the largest dependency length by N times (N: network depth)
resolve the context fragmentation issue and makes the dependency longer
5
State Reuse for Segment-Level Recurrence
◉ Vanilla
◉ State Reuse
6
Incoherent Positional Encoding
◉ Issue: naively applying segment-level recurrence can’t work
○ positional encodings are incoherent when reusing
[0, 1, 2, 3] [0, 1, 2, 3, 0, 1, 2, 3]
7
Relative Positional Encoding
◉ Idea: relative positional encoding
○ learnable embeddings → fixed embeddings with learnable transformations
■ the query vector is the same for all query positions
■ the attentive bias towards different words should remain the same
better generalizability to longer sequences
absolute
relative
absolute
relative trainable
parameters
trainable parameters
much longer effective contexts than a vanilla model during evaluation
content position
content position
8
Segment-Level Recurrence in Inference
◉ Vanilla
◉ State Reuse
9
Contributions
◉ Longer context dependency
○ Learn dependency about 80% longer than RNNs and
450% longer than vanilla Transformers
○ Better perplexity on long sequences
○ Better perplexity on short sequences by addressing the fragmentation issue
◉ Speed increase
○ Process new segments without recomputation
○ Achieve up to 1,800+ times faster than a vanilla Transformer during evaluation on LM tasks
10
(Yang et al., 2019)
XLNet
11
Auto-Regressive (AR)
◉ Objective: modeling information based on either previous or following contexts
12
Auto-Encoding (AE)
◉ Objective: reconstructing ҧ𝑥 from ො 𝑥
○ dimension reduction or denoising (masked LM)
Randomly mask 15%
of tokens
13
Auto-Encoding (AE)
◉ Issues
○ Independence assumption: ignore the dependency between masks
○ Input noise: discrepancy between pre-training and fine-tuning
(w/ [MASK]) (w/o [MASK])
14
Permutation Language Model
◉ Goal: use AR and bidirectional contexts for prediction
◉ Idea: parameters shared across all factorization orders in expectation
○ T! different orders to a valid AR factorization for a sequence of length T
○ Pre-training on sequences sampled from all possible permutations
15
Permutation Language Model
◉ Implementation: only permute the factorization order
○ Remain original positional encoding
○ Rely on proper attention masks in Transformers
resolve independence assumption and pretrain-finetune discrepancy issues
16
Formulation Reparameterizing
◉ Issue: naively applying permutation LM does not work
◉ Original formulation
○ [MASK] indicates the target position
○ ℎ𝜃 𝑥𝑧<𝑡 does not depend on predicted position
◉ Reparameterization
○ 𝑔𝜃 𝑥𝑧<𝑡,𝑧𝑡 is a new embedding considering the target position 𝑧𝑡
17
𝑥1, 𝑥2, 𝑥3, 𝑥4 𝑥1, 𝑥2, 𝑥4, 𝑥3
𝑃 𝑥3|𝑥1, 𝑥2 𝑃 𝑥4|𝑥1, 𝑥2
Two-Stream Self-Attention
◉ Formulation of 𝑔 𝑥
𝑧<𝑡, 𝑧
𝑡1) Predicting the token 𝑥𝑧𝑡 should only use the position 𝑧𝑡 and not the content 𝑥𝑧𝑡 2) Predicting other tokens 𝑥𝑧𝑗 (𝑗 > 𝑡) should encode the content 𝑥𝑧𝑡
◉ Idea: two sets of hidden representations
○ Content stream: can see self
○ Query stream: cannot see self
18
Two-Stream Self-Attention
◉ Content stream
○ Predict other tokens
◉ Query stream
○ Predict the current token
19
GLUE Results
20
Contributions
◉ AR for addressing independence assumption
◉ AE for addressing the pretrain-finetune discrepancy
21
(Liu et al., 2019)
RoBERTa
22
RoBERTa
◉ Dynamic masking
○ each sequence is masked in 10 different ways over the 40 epochs of training
■ Original masking is performed during data preprocessing
◉ Optimization hyperparameters
○ peak learning rate and number of warmup steps tuned separately for each setting
■ Training is very sensitive to the Adam epsilon term
■ Setting β2 = 0.98 improves stability when training with large batch sizes
◉ Data
○ not randomly inject short sequences
○ train only with full-length sequences
■ Original model trains with a reduced sequence length for first 90% of updates
○ BookCorpus, CC-News, OpenWebText, Stories
23
GLUE Results
24
(Joshi et al., 2019)
SpanBERT
25
SpanBERT
◉ Span masking
○ A random process to mask spans of tokens
◉ Single sentence training
○ a single contiguous segment of text for each training sample (instead of two)
◉ Span boundary objective (SBO)
○ predict the entire masked span using only the span’s boundary
26
Results
◉ Masking scheme
◉ Auxiliary objective
27
(Lample & Connueau, 2019)
XLM
28
XLM
◉ Masked LM + Translation LM
29
(Lan et al., 2020)
ALBERT
30
Beyond BERT
Better Performance
Wide Applications
Compact Model
ALBERT
31
ALBERT: A Lite BERT
1. Factorized embedding parameterization
○ WordPiece embedding size E is tied with the hidden layer size H
→
E≡
H2. Cross-layer sharing
context-independent context-dependent
→
E << H𝑉 × 𝐸 𝐸 × 𝐻
32
ALBERT: A Lite BERT
33
ALBERT: A Lite BERT
3. Inter-sentence coherence loss
○ NSP (next sentence prediction) contains both topical and ordering information
○ Topical cues help more → model utilizes more
○ SOP (sentence order prediction) focuses on ordering not topical cues
34
ALBERT: A Lite BERT
4. Additional data and removing dropout
35
GLUE Results
36
ALBERT
Concluding Remarks
◉ Transformer-XL (https://github.com/kimiyoung/transformer-xl)
○ Longer context dependency
◉ XLNet (https://github.com/zihangdai/xlnet)
○ AR + AE
○ No pretrain-finetune discrepancy
◉ RoBERTa (http://github.com/pytorch/fairseq)
○ Optimization details & data
◉ SpanBERT
○ Better for QA, NLI, coreference
◉ XLM (https://github.com/facebookresearch/XLM)
○ Zero-shot scenarios
◉ ALBERT (https://github.com/google-research/google-research/tree/master/albert / https://github.com/brightmart/albert_zh)
○ Compact model, faster training/fine-tuning
Better Performance
Wide Applications
XLNET RoBERTa
SpanBERT
XLM
Compact Model
37