• 沒有找到結果。

More on Embeddings

N/A
N/A
Protected

Academic year: 2022

Share "More on Embeddings"

Copied!
25
0
0

加載中.... (立即查看全文)

全文

(1)

More on Embeddings

Applied Deep Learning

April 7th, 2020 http://adl.miulab.tw

(2)

Handling Out-of-Vocabulary

• One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of- vocabulary (OOV) words, i.e. words that have not been seen during training.

Typically, such words are set to the UNK token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large.

2

(3)

Subwords and characters

Below Words

3

(4)

Subword Embeddings

• separating unseen or rare words into common subwords, potentially address OOV issue

• “AppleCare” = “Apple” + “Care”, “iPhone11” = “iPhone” +

“11”

4

(5)

Why Subwords?

• “台灣大學生喜歡深度學習”

• suboptimal word segmentation system

• ambiguity in word segmentation: “深度學習” or “深度” “學習”

• informal spelling: ”So goooooooood.”, “lollllllllll”

5

(6)

Subword Embeddings

Possibility of leveraging morphological information

◉ In speech, we have phonemes; in language, we have morphemes.

◉ Morphemes (語素): smallest semantic units

-s: noun plural, -ed: verb simple past tense, pre-, un-…

6

(7)

Subword Embeddings

◉ Morphological Recursive Neural Network

7

(8)

How to Decide Subwords?

◉ by simple n-gram: Apple = [App, ppl, ple]

Byte Pair Encoding: an algorithm to build the vocabulary

8

(9)

Byte Pair Encoding

◉ Originally a compression algorithm: most frequent byte pair

↦ a new byte.

◉ Used as a word segmentation algorithm

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

9

(10)

Byte Pair Encoding

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

10

(11)

Byte Pair Encoding

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

11

(12)

Byte Pair Encoding

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

12

(13)

Byte Pair Encoding

◉ Have a target vocabulary size and stop when you reach it

◉ Automatically decides vocab for system

13

(14)

Character-Level Embeddings

◉ modeling word-level representation by character-level information

◉ completely solve OOV problem

◉ dynamically infer representation

14

(15)

Character-Level Embeddings

◉ compositional character to word (C2W) model

15

(16)

MIMICK

◉ Optimizing towards pretrained embeddings

◉ no need to access the originating corpus

16

(17)

FastText

◉ An extension of the word2vec skip-gram model with character n-grams

◉ Represent word as char n-grams augmented with boundary symbols and as whole word: Apple = [<Ap, App, ppl, ple, le>, Apple]

◉ Prefix, suffixes and whole words are special

◉ supervised objective: text classification

17

(18)

Sentences and documents

Beyond Words

18

(19)

Sentence/Document Embedding

◉ How to extend to sentence/document-level?

◉ simply averaging word embeddings, inferring by trained models, … etc.

◉ training objective?

19

(20)

Skip-Thought

◉ extend skip-gram concept to sentence-level

◉ inspired by the distributional hypothesis: sentences that have similar surrounding context are likely to be both semantically and syntactically similar

20

(21)

Quick-Thought

◉ change the objective to classification problem

◉ the model can choose to ignore aspects of the sentence that are irrelevant in constructing a semantic embedding space

21

(22)

InferSent

◉ trained on natural language inference (NLI) task

◉ NLI is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

22

(23)

InferSent

23

(24)

References

◉ https://www.aclweb.org/anthology/W13-3512.pdf

◉ http://web.stanford.edu/class/cs224n/slides/cs224n-2019- lecture12-subwords.pdf

◉ http://www.aclweb.org/anthology/D15-1176

◉ https://arxiv.org/pdf/1508.07909.pdf

◉ https://arxiv.org/pdf/1707.06961.pdf

◉ https://github.com/Separius/awesome-sentence-embedding

◉ https://openreview.net/pdf?id=rJvJXZb0W

◉ https://arxiv.org/pdf/1607.01759.pdf

24

(25)

References

◉ https://arxiv.org/pdf/1705.02364.pdf

25

參考文獻

相關文件

One could deal with specifi c topics for researching on Buddhist Literature while one has to clarify the categories and analyze the problems of methodology to construct “History

I would like to thank the Education Bureau and the Academy for Gifted Education for their professional support and for commissioning the Department of English Language and

Roles of English language (ELTs) and non- language teachers (NLTs)3. General, academic and technical

 For students of other class levels, arrangements of reading out the papers / using screen readers / asking for pronunciations of words are only applicable

I certify that I have audited the financial statements of the Subsidized Schools Provident Fund set out on pages 24 to 45, which comprise the balance sheet as at

I certify that I have audited the financial statements of the Subsidized Schools Provident Fund set out on pages 25 to 48, which comprise the balance sheet as at 31 August 2017,

I certify that I have audited the financial statements of the Subsidized Schools Provident Fund set out on pages 24 to 47, which comprise the balance sheet as at 31 August

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the