More on Embeddings

(1)

More on Embeddings

Applied Deep Learning

April 7th, 2020 http://adl.miulab.tw

(2)

Handling Out-of-Vocabulary

• One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of- vocabulary (OOV) words, i.e. words that have not been seen during training.

• Typically, such words are set to the UNK token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large.

2

(3)

Subwords and characters

Below Words

3

(4)

Subword Embeddings

• separating unseen or rare words into common subwords, potentially address OOV issue

• “AppleCare” = “Apple” + “Care”, “iPhone11” = “iPhone” +

“11”

4

(5)

Why Subwords?

• “台灣大學生喜歡深度學習”

• suboptimal word segmentation system

• ambiguity in word segmentation: “深度學習” or “深度” “學習”

• informal spelling: ”So goooooooood.”, “lollllllllll”

5

(6)

Subword Embeddings

◉ Possibility of leveraging morphological information

◉ In speech, we have phonemes; in language, we have morphemes.

◉ Morphemes (語素): smallest semantic units

◉ -s: noun plural, -ed: verb simple past tense, pre-, un-…

6

(7)

Subword Embeddings

◉ Morphological Recursive Neural Network

7

(8)

How to Decide Subwords?

◉ by simple n-gram: Apple = [App, ppl, ple]

◉ Byte Pair Encoding: an algorithm to build the vocabulary

8

(9)

Byte Pair Encoding

◉ Originally a compression algorithm: most frequent byte pair

↦ a new byte.

◉ Used as a word segmentation algorithm

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

9

(10)

Byte Pair Encoding

10

(11)

Byte Pair Encoding

11

(12)

Byte Pair Encoding

12

(13)

Byte Pair Encoding

◉ Have a target vocabulary size and stop when you reach it

◉ Automatically decides vocab for system

13

(14)

Character-Level Embeddings

◉ modeling word-level representation by character-level information

◉ completely solve OOV problem

◉ dynamically infer representation

14

(15)

Character-Level Embeddings

◉ compositional character to word (C2W) model

15

(16)

MIMICK

◉ Optimizing towards pretrained embeddings

◉ no need to access the originating corpus

16

(17)

FastText

◉ An extension of the word2vec skip-gram model with character n-grams

◉ Represent word as char n-grams augmented with boundary symbols and as whole word: Apple = [<Ap, App, ppl, ple, le>, Apple]

◉ Prefix, suffixes and whole words are special

◉ supervised objective: text classification

17

(18)

Sentences and documents

Beyond Words

18

(19)

Sentence/Document Embedding

◉ How to extend to sentence/document-level?

◉ simply averaging word embeddings, inferring by trained models, … etc.

◉ training objective?

19

(20)

Skip-Thought

◉ extend skip-gram concept to sentence-level

◉ inspired by the distributional hypothesis: sentences that have similar surrounding context are likely to be both semantically and syntactically similar

20

(21)

Quick-Thought

◉ change the objective to classification problem

◉ the model can choose to ignore aspects of the sentence that are irrelevant in constructing a semantic embedding space

21

(22)

InferSent

◉ trained on natural language inference (NLI) task

◉ NLI is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

22

(23)

InferSent

23

(24)

References

◉ https://www.aclweb.org/anthology/W13-3512.pdf

◉ http://web.stanford.edu/class/cs224n/slides/cs224n-2019- lecture12-subwords.pdf

◉ http://www.aclweb.org/anthology/D15-1176

◉ https://arxiv.org/pdf/1508.07909.pdf

◉ https://github.com/Separius/awesome-sentence-embedding

◉ https://openreview.net/pdf?id=rJvJXZb0W

24

(25)

References

25