More BERT

(1)

More BERT

Applied Deep Learning

April 14th, 2020 http://adl.miulab.tw

(2)

Beyond BERT

Better Performance

Wide Applications

XLNET

RoBERTa

SpanBERT

XLM

2

(3)

(Dai et al, 2019)

Transformer-XL

3

(4)

Transformer

◉ Issue: context fragmentation

○ Long dependency: unable to model dependencies longer than a fixed length

○ Inefficient optimization: ignore sentence boundaries

■ particularly troublesome even for short sequences

4

(5)

Transformer-XL (extra-long)

◉ Idea: segment-level recurrence

○ Previous segment embeddings are fixed and cached to be reused when training the next segment

○ → increases the largest dependency length by N times (N: network depth)

resolve the context fragmentation issue and makes the dependency longer

5

(6)

State Reuse for Segment-Level Recurrence

◉ Vanilla

◉ State Reuse

6

(7)

Incoherent Positional Encoding

◉ Issue: naively applying segment-level recurrence can’t work

○ positional encodings are incoherent when reusing

[0, 1, 2, 3] [0, 1, 2, 3, 0, 1, 2, 3]

7

(8)

Relative Positional Encoding

◉ Idea: relative positional encoding

○ learnable embeddings → fixed embeddings with learnable transformations

■ the query vector is the same for all query positions

■ the attentive bias towards different words should remain the same

better generalizability to longer sequences

absolute

relative

absolute

relative trainable

parameters

trainable parameters

much longer effective contexts than a vanilla model during evaluation

content position

8

(9)

Segment-Level Recurrence in Inference

◉ Vanilla

◉ State Reuse

9

(10)

Contributions

◉ Longer context dependency

○ Learn dependency about 80% longer than RNNs and

450% longer than vanilla Transformers

○ Better perplexity on long sequences

○ Better perplexity on short sequences by addressing the fragmentation issue

◉ Speed increase

○ Process new segments without recomputation

○ Achieve up to 1,800+ times faster than a vanilla Transformer during evaluation on LM tasks

10

(11)

(Yang et al., 2019)

XLNet

11

(12)

Auto-Regressive (AR)

◉ Objective: modeling information based on either previous or following contexts

12

(13)

Auto-Encoding (AE)

◉ Objective: reconstructing ҧ𝑥 from ො 𝑥

○ dimension reduction or denoising (masked LM)

Randomly mask 15%

of tokens

13

(14)

Auto-Encoding (AE)

◉ Issues

○ Independence assumption: ignore the dependency between masks

○ Input noise: discrepancy between pre-training and fine-tuning

(w/ [MASK]) (w/o [MASK])

14

(15)

Permutation Language Model

◉ Goal: use AR and bidirectional contexts for prediction

◉ Idea: parameters shared across all factorization orders in expectation

○ T! different orders to a valid AR factorization for a sequence of length T

○ Pre-training on sequences sampled from all possible permutations

15

(16)

Permutation Language Model

◉ Implementation: only permute the factorization order

○ Remain original positional encoding

○ Rely on proper attention masks in Transformers

resolve independence assumption and pretrain-finetune discrepancy issues

16

(17)

Formulation Reparameterizing

◉ Issue: naively applying permutation LM does not work

◉ Original formulation

○ [MASK] indicates the target position

○ ℎ_𝜃 𝑥_𝑧_<𝑡 does not depend on predicted position

◉ Reparameterization

○ 𝑔_𝜃 𝑥_𝑧_<𝑡_,𝑧_𝑡 is a new embedding considering the target position 𝑧_𝑡

17

𝑥₁, 𝑥₂, 𝑥₃, 𝑥₄ 𝑥₁, 𝑥₂, 𝑥₄, 𝑥₃

𝑃 𝑥₃|𝑥₁, 𝑥₂ 𝑃 𝑥₄|𝑥₁, 𝑥₂

(18)

Two-Stream Self-Attention

◉ Formulation of 𝑔 𝑥

_𝑧_<𝑡

, 𝑧

_𝑡

1) Predicting the token 𝑥_𝑧_𝑡 should only use the position 𝑧_𝑡 and not the content 𝑥_𝑧_𝑡 2) Predicting other tokens 𝑥_𝑧_𝑗 (𝑗 > 𝑡) should encode the content 𝑥_𝑧_𝑡

◉ Idea: two sets of hidden representations

○ Content stream: can see self

○ Query stream: cannot see self

18

(19)

Two-Stream Self-Attention

◉ Content stream

○ Predict other tokens

◉ Query stream

○ Predict the current token

19

(20)

GLUE Results

20

(21)

Contributions

◉ AR for addressing independence assumption

◉ AE for addressing the pretrain-finetune discrepancy

21

(22)

(Liu et al., 2019)

RoBERTa

22

(23)

RoBERTa

◉ Dynamic masking

○ each sequence is masked in 10 different ways over the 40 epochs of training

■ Original masking is performed during data preprocessing

◉ Optimization hyperparameters

○ peak learning rate and number of warmup steps tuned separately for each setting

■ Training is very sensitive to the Adam epsilon term

■ Setting β2 = 0.98 improves stability when training with large batch sizes

◉ Data

○ not randomly inject short sequences

○ train only with full-length sequences

■ Original model trains with a reduced sequence length for first 90% of updates

○ BookCorpus, CC-News, OpenWebText, Stories

23

(24)

GLUE Results

24

(25)

(Joshi et al., 2019)

SpanBERT

25

(26)

SpanBERT

◉ Span masking

○ A random process to mask spans of tokens

◉ Single sentence training

○ a single contiguous segment of text for each training sample (instead of two)

◉ Span boundary objective (SBO)

○ predict the entire masked span using only the span’s boundary

26

(27)

Results

◉ Masking scheme

◉ Auxiliary objective

27

(28)

(Lample & Connueau, 2019)

XLM

28

(29)

XLM

◉ Masked LM + Translation LM

29

(30)

(Lan et al., 2020)

ALBERT

30

(31)

Beyond BERT

Better Performance

Wide Applications

Compact Model

ALBERT

31

(32)

ALBERT: A Lite BERT

1. Factorized embedding parameterization

○ WordPiece embedding size E is tied with the hidden layer size H

→

E

≡

_H

2. Cross-layer sharing

context-independent context-dependent

→

E << H

𝑉 × 𝐸 𝐸 × 𝐻

32

(33)

ALBERT: A Lite BERT

33

(34)

ALBERT: A Lite BERT

3. Inter-sentence coherence loss

○ NSP (next sentence prediction) contains both topical and ordering information

○ Topical cues help more → model utilizes more

○ SOP (sentence order prediction) focuses on ordering not topical cues

34

(35)

ALBERT: A Lite BERT

4. Additional data and removing dropout

35

(36)

GLUE Results

36

(37)

ALBERT

Concluding Remarks

◉ Transformer-XL (https://github.com/kimiyoung/transformer-xl)

○ Longer context dependency

◉ XLNet (https://github.com/zihangdai/xlnet)

○ AR + AE

○ No pretrain-finetune discrepancy

◉ RoBERTa (http://github.com/pytorch/fairseq)

○ Optimization details & data

◉ SpanBERT

○ Better for QA, NLI, coreference

◉ XLM (https://github.com/facebookresearch/XLM)

○ Zero-shot scenarios

◉ ALBERT (https://github.com/google-research/google-research/tree/master/albert / https://github.com/brightmart/albert_zh)

○ Compact model, faster training/fine-tuning

Better Performance

Wide Applications

XLNET RoBERTa

SpanBERT

XLM

Compact Model

37