• 沒有找到結果。

More BERT

N/A
N/A
Protected

Academic year: 2022

Share "More BERT"

Copied!
37
0
0

加載中.... (立即查看全文)

全文

(1)

More BERT

Applied Deep Learning

April 14th, 2020 http://adl.miulab.tw

(2)

Beyond BERT

Better Performance

Wide Applications

XLNET

RoBERTa

SpanBERT

XLM

2

(3)

(Dai et al, 2019)

Transformer-XL

3

(4)

Transformer

◉ Issue: context fragmentation

○ Long dependency: unable to model dependencies longer than a fixed length

○ Inefficient optimization: ignore sentence boundaries

■ particularly troublesome even for short sequences

4

(5)

Transformer-XL (extra-long)

◉ Idea: segment-level recurrence

Previous segment embeddings are fixed and cached to be reused when training the next segment

○ → increases the largest dependency length by N times (N: network depth)

resolve the context fragmentation issue and makes the dependency longer

5

(6)

State Reuse for Segment-Level Recurrence

◉ Vanilla

◉ State Reuse

6

(7)

Incoherent Positional Encoding

◉ Issue: naively applying segment-level recurrence can’t work

positional encodings are incoherent when reusing

[0, 1, 2, 3] [0, 1, 2, 3, 0, 1, 2, 3]

7

(8)

Relative Positional Encoding

◉ Idea: relative positional encoding

○ learnable embeddings → fixed embeddings with learnable transformations

the query vector is the same for all query positions

the attentive bias towards different words should remain the same

better generalizability to longer sequences

absolute

relative

absolute

relative trainable

parameters

trainable parameters

much longer effective contexts than a vanilla model during evaluation

content position

content position

8

(9)

Segment-Level Recurrence in Inference

◉ Vanilla

◉ State Reuse

9

(10)

Contributions

◉ Longer context dependency

Learn dependency about 80% longer than RNNs and

450% longer than vanilla Transformers

Better perplexity on long sequences

Better perplexity on short sequences by addressing the fragmentation issue

◉ Speed increase

Process new segments without recomputation

Achieve up to 1,800+ times faster than a vanilla Transformer during evaluation on LM tasks

10

(11)

(Yang et al., 2019)

XLNet

11

(12)

Auto-Regressive (AR)

◉ Objective: modeling information based on either previous or following contexts

12

(13)

Auto-Encoding (AE)

◉ Objective: reconstructing ҧ𝑥 from ො 𝑥

○ dimension reduction or denoising (masked LM)

Randomly mask 15%

of tokens

13

(14)

Auto-Encoding (AE)

◉ Issues

Independence assumption: ignore the dependency between masks

Input noise: discrepancy between pre-training and fine-tuning

(w/ [MASK]) (w/o [MASK])

14

(15)

Permutation Language Model

◉ Goal: use AR and bidirectional contexts for prediction

◉ Idea: parameters shared across all factorization orders in expectation

T! different orders to a valid AR factorization for a sequence of length T

○ Pre-training on sequences sampled from all possible permutations

15

(16)

Permutation Language Model

◉ Implementation: only permute the factorization order

○ Remain original positional encoding

○ Rely on proper attention masks in Transformers

resolve independence assumption and pretrain-finetune discrepancy issues

16

(17)

Formulation Reparameterizing

◉ Issue: naively applying permutation LM does not work

◉ Original formulation

○ [MASK] indicates the target position

○ ℎ𝜃 𝑥𝑧<𝑡 does not depend on predicted position

◉ Reparameterization

○ 𝑔𝜃 𝑥𝑧<𝑡,𝑧𝑡 is a new embedding considering the target position 𝑧𝑡

17

𝑥1, 𝑥2, 𝑥3, 𝑥4 𝑥1, 𝑥2, 𝑥4, 𝑥3

𝑃 𝑥3|𝑥1, 𝑥2 𝑃 𝑥4|𝑥1, 𝑥2

(18)

Two-Stream Self-Attention

◉ Formulation of 𝑔 𝑥

𝑧<𝑡

, 𝑧

𝑡

1) Predicting the token 𝑥𝑧𝑡 should only use the position 𝑧𝑡 and not the content 𝑥𝑧𝑡 2) Predicting other tokens 𝑥𝑧𝑗 (𝑗 > 𝑡) should encode the content 𝑥𝑧𝑡

◉ Idea: two sets of hidden representations

○ Content stream: can see self

○ Query stream: cannot see self

18

(19)

Two-Stream Self-Attention

◉ Content stream

○ Predict other tokens

◉ Query stream

○ Predict the current token

19

(20)

GLUE Results

20

(21)

Contributions

◉ AR for addressing independence assumption

◉ AE for addressing the pretrain-finetune discrepancy

21

(22)

(Liu et al., 2019)

RoBERTa

22

(23)

RoBERTa

◉ Dynamic masking

○ each sequence is masked in 10 different ways over the 40 epochs of training

Original masking is performed during data preprocessing

◉ Optimization hyperparameters

○ peak learning rate and number of warmup steps tuned separately for each setting

Training is very sensitive to the Adam epsilon term

Setting β2 = 0.98 improves stability when training with large batch sizes

◉ Data

○ not randomly inject short sequences

○ train only with full-length sequences

Original model trains with a reduced sequence length for first 90% of updates

○ BookCorpus, CC-News, OpenWebText, Stories

23

(24)

GLUE Results

24

(25)

(Joshi et al., 2019)

SpanBERT

25

(26)

SpanBERT

◉ Span masking

○ A random process to mask spans of tokens

◉ Single sentence training

○ a single contiguous segment of text for each training sample (instead of two)

◉ Span boundary objective (SBO)

○ predict the entire masked span using only the span’s boundary

26

(27)

Results

◉ Masking scheme

◉ Auxiliary objective

27

(28)

(Lample & Connueau, 2019)

XLM

28

(29)

XLM

◉ Masked LM + Translation LM

29

(30)

(Lan et al., 2020)

ALBERT

30

(31)

Beyond BERT

Better Performance

Wide Applications

Compact Model

ALBERT

31

(32)

ALBERT: A Lite BERT

1. Factorized embedding parameterization

WordPiece embedding size E is tied with the hidden layer size H

E

H

2. Cross-layer sharing

context-independent context-dependent

E << H

𝑉 × 𝐸 𝐸 × 𝐻

32

(33)

ALBERT: A Lite BERT

33

(34)

ALBERT: A Lite BERT

3. Inter-sentence coherence loss

○ NSP (next sentence prediction) contains both topical and ordering information

○ Topical cues help more → model utilizes more

SOP (sentence order prediction) focuses on ordering not topical cues

34

(35)

ALBERT: A Lite BERT

4. Additional data and removing dropout

35

(36)

GLUE Results

36

(37)

ALBERT

Concluding Remarks

◉ Transformer-XL (https://github.com/kimiyoung/transformer-xl)

Longer context dependency

◉ XLNet (https://github.com/zihangdai/xlnet)

AR + AE

No pretrain-finetune discrepancy

◉ RoBERTa (http://github.com/pytorch/fairseq)

Optimization details & data

◉ SpanBERT

Better for QA, NLI, coreference

◉ XLM (https://github.com/facebookresearch/XLM)

Zero-shot scenarios

◉ ALBERT (https://github.com/google-research/google-research/tree/master/albert / https://github.com/brightmart/albert_zh)

Compact model, faster training/fine-tuning

Better Performance

Wide Applications

XLNET RoBERTa

SpanBERT

XLM

Compact Model

37

參考文獻

相關文件

The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns.

N2 Metastasis in a single lymph node, more than 2 cm but not more than 5 cm in greatest dimension; or multiple lymph nodes, none more than 5 cm in greatest dimension.

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

You may spend more time chatting online than talking face-to-face with your friends or family.. So, are you a heavy

January/Kindergarten%20space.html.. More than one way: An approach to teaching that supports playful learning. Project Zero: A Pedagogy of Play working paper .Retrieved

• Gauss on Germain: “But when a person of the sex which, according to our customs and prejudices, must encounter infinitely more difficulties than men to.. familiarize herself with

Freely write an added part to the original motive, make sure your new part is of a different duration to the ostinato motive (say, more than two bars in length) so that they have