• 沒有找到結果。

Ernie and Bert

N/A
N/A
Protected

Academic year: 2022

Share "Ernie and Bert"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)
(2)
(3)

Ernie and Bert

3

(4)

ERNIE: Enhanced Representation through kNowledge IntEgration

• Developed by Baidu Research

No paper published yet, only an aritcle on their website (3/16)

• claim that it outperforms BERT in Chinese language tasks including natural language inference, semantic similarity, named entity

recognition, sentiment analysis, and question-answer matching.

• Methods like Cove, ELMo, GPT or BERT mainly focus on building models to solve problems based on original language signals instead of

semantic units in the text.

•Unlike BERT, ERNIE features knowledge integration enhancement, which learns semantic relations in the real world through massive data.

(5)

ERNIE: Enhanced Representation through kNowledge IntEgration

In BERT, we randomly mask 15% tokens to train masked LM

5

(6)

ERNIE: Enhanced Representation through kNowledge IntEgration

•It directly models prior semantic knowledge units, which enhances the ability to learn semantic representation.

• ERNIE learns the semantic representation of complete concepts by masking semantic units such as words and entities. ERNIE directly

models priori semantic knowledge units and, as a result, enhances the model's ability to learn semantic representation.

• Entities: in information extraction, a named entity is a real-world object, such as persons, locations, organizations, products, etc.

(7)

ERNIE: Enhanced Representation through kNowledge IntEgration

• BERT can identify the character “尔(er)” through the local co-occurring characters 哈(ha) and 滨(bin), but fails to learn any knowledge related to the word “Harbin (哈尔滨)”.

• ERNIE can extrapolate the relationship between Harbin (哈尔滨) and Heilongjiang (黑龙江) by analyzing implicit knowledge of words and entities.

7

(8)

Closer Look…

Maybe it resembles the leading model “BERT+N-Gram Masking” on SQuAD2.0?

Granularity: masking short sentences?

Keep the Transformer structure untouched and change masking?

Dataset are different.

“For every plus there is a minus.”, more factors: granularity, quality of segmentation systems…, and etc.

(9)

By The Way…

9

(10)

ERNIE: Enhanced Representation through kNowledge IntEgration

• Pretrained models/code available:

https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE

(11)

Transformer-XL: Attentive Language

Models Beyond a Fixed-Length Context

11

• from Google

The third generation (Transformer-XL): recurrence in length

The second generation (Universal Transformer): recurrence in depth

The original Transformer: no recurrence

(12)

Transformer for LM

• in language modeling, Transformers are currently implemented with a fixed-length context, i.e. a long text sequence is truncated into fixed- length segments of a few hundred characters, and each segment is processed separately.

(13)

Transformer for LM

13

• This introduces two critical limitations:

• The algorithm is not able to model dependencies that are longer than a fixed length.

• The segments usually do not respect the sentence boundaries, resulting in context fragmentation which leads to inefficient optimization.

它不僅是一個能夠處理可變長度序列的模型,在多個任 務中刷新了當前的最好性能。

(14)

Transformer-XL:

Segment-level Recurrence

• During training, the representations computed for the previous segment are fixed and cached to be reused as an extended context when the model processes the next new segment.

(15)

Transformer-XL:

Segment-level Recurrence

15

• Moreover, this recurrence mechanism also resolves the context

fragmentation issue, providing necessary context for tokens in the front of a new segment.

(16)

Transformer-XL:

Relative Positional Encodings

• Naively applying segment-level recurrence does not work, however, because the positional encodings are not coherent when we reuse the previous segments.

• For example, consider an old segment with contextual positions [0, 1, 2, 3]. When a new segment is processed, we have positions [0, 1, 2, 3, 0, 1, 2, 3] for the two segments combined, where the semantics of each

position id is incoherent through out the sequence.

parameterization to only encode the relative positional information based on content

(17)

Transformer-XL:

Overview

17

• segment-level recurrence + relative positional encoding

(18)

Transformer-XL:

Results

• Transformer-XL learns dependency that is about 80% longer than RNNs and 450% longer than vanilla Transformers, which generally have better performance than RNNs, but are not the best for long-range

dependency modeling due to fixed-length contexts.

Transformer-XL is up to 1,800+ times faster than a vanilla Transformer during evaluation on language modeling tasks, because no re-

computation is needed.

• Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term

dependency modeling, and also on short sequences by resolving the context fragmentation problem.

(19)

Transformer-XL: Attentive Language

Models Beyond a Fixed-Length Context

19

code available (TF and PyTroch):

https://github.com/kimiyoung/transformer-xl

(20)

References

http://research.baidu.com/Blog/index-view?id=113

http://fortune.com/2016/05/06/sesame-street-bert-ernie-std/

•https://www.usatoday.com/story/life/tv/2018/09/18/sesame-street-denies-writers-claim-bert- and-ernie-gay/1348017002/

https://en.wikipedia.org/wiki/Named_entity

http://jalammar.github.io/illustrated-bert/

•https://arxiv.org/pdf/1901.02860.pdf

參考文獻

相關文件

Receiver operating characteristic (ROC) curves are a popular measure to assess performance of binary classification procedure and have extended to ROC surfaces for ternary or

stating clearly the important learning concepts to strengthen the coverage of knowledge, so as to build a solid knowledge base for students; reorganising and

assessment items targeting the following reading foci: specific information, inferencing, main ideas. What syntactic and/or semantic clues would you identify in the text to guide

LTP (I - III) Latent Topic Probability (mean, variance, standard deviation).. O

dialogue utterances annotated with semantic frames (user intents & slots). user intents, slots and

◦ Value function: how good is each state and/or action1. ◦ Model: agent’s representation of

Pros: simple, error-free, easy to control Cons: time-consuming, rigid, poor scalability Semantic Frame Natural Language. confirm() “Please tell me more about the product your are

Hofmann, “Collaborative filtering via Gaussian probabilistic latent semantic analysis”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and