Ernie and Bert

(1)

(2)

(3)

Ernie and Bert

3

(4)

ERNIE: Enhanced Representation through kNowledge IntEgration

• Developed by Baidu Research

• No paper published yet, only an aritcle on their website (3/16)

• claim that it outperforms BERT in Chinese language tasks including natural language inference, semantic similarity, named entity

recognition, sentiment analysis, and question-answer matching.

• Methods like Cove, ELMo, GPT or BERT mainly focus on building models to solve problems based on original language signals instead of

semantic units in the text.

•Unlike BERT, ERNIE features knowledge integration enhancement, which learns semantic relations in the real world through massive data.

(5)

ERNIE: Enhanced Representation through kNowledge IntEgration

• In BERT, we randomly mask 15% tokens to train masked LM

5

(6)

ERNIE: Enhanced Representation through kNowledge IntEgration

•It directly models prior semantic knowledge units, which enhances the ability to learn semantic representation.

• ERNIE learns the semantic representation of complete concepts by masking semantic units such as words and entities. ERNIE directly

models priori semantic knowledge units and, as a result, enhances the model's ability to learn semantic representation.

• Entities: in information extraction, a named entity is a real-world object, such as persons, locations, organizations, products, etc.

(7)

ERNIE: Enhanced Representation through kNowledge IntEgration

• BERT can identify the character “尔(er)” through the local co-occurring characters 哈(ha) and 滨(bin), but fails to learn any knowledge related to the word “Harbin (哈尔滨)”.

• ERNIE can extrapolate the relationship between Harbin (哈尔滨) and Heilongjiang (黑龙江) by analyzing implicit knowledge of words and entities.

7

(8)

Closer Look…

• Maybe it resembles the leading model “BERT+N-Gram Masking” on SQuAD2.0?

• Granularity: masking short sentences?

• Keep the Transformer structure untouched and change masking?

• Dataset are different.

• “For every plus there is a minus.”, more factors: granularity, quality of segmentation systems…, and etc.

(9)

By The Way…

9

(10)

ERNIE: Enhanced Representation through kNowledge IntEgration

• Pretrained models/code available:

https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE

(11)

Transformer-XL: Attentive Language

Models Beyond a Fixed-Length Context

11

• from Google

• The third generation (Transformer-XL): recurrence in length

• The second generation (Universal Transformer): recurrence in depth

• The original Transformer: no recurrence

(12)

Transformer for LM

• in language modeling, Transformers are currently implemented with a fixed-length context, i.e. a long text sequence is truncated into fixed- length segments of a few hundred characters, and each segment is processed separately.

(13)

Transformer for LM

13

• This introduces two critical limitations:

• The algorithm is not able to model dependencies that are longer than a fixed length.

• The segments usually do not respect the sentence boundaries, resulting in context fragmentation which leads to inefficient optimization.

它不僅是一個能夠處理可變長度序列的模型，在多個任務中刷新了當前的最好性能。

(14)

Transformer-XL:

Segment-level Recurrence

• During training, the representations computed for the previous segment are fixed and cached to be reused as an extended context when the model processes the next new segment.

(15)

Transformer-XL:

Segment-level Recurrence

15

• Moreover, this recurrence mechanism also resolves the context

fragmentation issue, providing necessary context for tokens in the front of a new segment.

(16)

Transformer-XL:

Relative Positional Encodings

• Naively applying segment-level recurrence does not work, however, because the positional encodings are not coherent when we reuse the previous segments.

• For example, consider an old segment with contextual positions [0, 1, 2, 3]. When a new segment is processed, we have positions [0, 1, 2, 3, 0, 1, 2, 3] for the two segments combined, where the semantics of each

position id is incoherent through out the sequence.

• parameterization to only encode the relative positional information based on content

(17)

Transformer-XL:

Overview

17

• segment-level recurrence + relative positional encoding

(18)

Transformer-XL:

Results

• Transformer-XL learns dependency that is about 80% longer than RNNs and 450% longer than vanilla Transformers, which generally have better performance than RNNs, but are not the best for long-range

dependency modeling due to fixed-length contexts.

• Transformer-XL is up to 1,800+ times faster than a vanilla Transformer during evaluation on language modeling tasks, because no re-

computation is needed.

• Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term

dependency modeling, and also on short sequences by resolving the context fragmentation problem.

(19)

Transformer-XL: Attentive Language

Models Beyond a Fixed-Length Context

19

• code available (TF and PyTroch):

https://github.com/kimiyoung/transformer-xl

(20)

References

•http://research.baidu.com/Blog/index-view?id=113

•http://fortune.com/2016/05/06/sesame-street-bert-ernie-std/

•https://www.usatoday.com/story/life/tv/2018/09/18/sesame-street-denies-writers-claim-bert- and-ernie-gay/1348017002/

•https://en.wikipedia.org/wiki/Named_entity

•http://jalammar.github.io/illustrated-bert/

•https://arxiv.org/pdf/1901.02860.pdf