• 沒有找到結果。

More on Embeddings

N/A
N/A
Protected

Academic year: 2022

Share "More on Embeddings"

Copied!
25
0
0

全文

(1)

More on Embeddings

Applied Deep Learning

April 7th, 2020 http://adl.miulab.tw

(2)

Handling Out-of-Vocabulary

• One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of- vocabulary (OOV) words, i.e. words that have not been seen during training.

Typically, such words are set to the UNK token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large.

2

(3)

Subwords and characters

Below Words

3

(4)

Subword Embeddings

• separating unseen or rare words into common subwords, potentially address OOV issue

• “AppleCare” = “Apple” + “Care”, “iPhone11” = “iPhone” +

“11”

4

(5)

Why Subwords?

• “台灣大學生喜歡深度學習”

• suboptimal word segmentation system

• ambiguity in word segmentation: “深度學習” or “深度” “學習”

• informal spelling: ”So goooooooood.”, “lollllllllll”

5

(6)

Subword Embeddings

Possibility of leveraging morphological information

◉ In speech, we have phonemes; in language, we have morphemes.

◉ Morphemes (語素): smallest semantic units

-s: noun plural, -ed: verb simple past tense, pre-, un-…

6

(7)

Subword Embeddings

◉ Morphological Recursive Neural Network

7

(8)

How to Decide Subwords?

◉ by simple n-gram: Apple = [App, ppl, ple]

Byte Pair Encoding: an algorithm to build the vocabulary

8

(9)

Byte Pair Encoding

◉ Originally a compression algorithm: most frequent byte pair

↦ a new byte.

◉ Used as a word segmentation algorithm

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

9

(10)

Byte Pair Encoding

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

10

(11)

Byte Pair Encoding

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

11

(12)

Byte Pair Encoding

◉ Start with a unigram vocabulary of all (Unicode) characters in data

◉ Most frequent ngram pairs ↦ a new ngram

12

(13)

Byte Pair Encoding

◉ Have a target vocabulary size and stop when you reach it

◉ Automatically decides vocab for system

13

(14)

Character-Level Embeddings

◉ modeling word-level representation by character-level information

◉ completely solve OOV problem

◉ dynamically infer representation

14

(15)

Character-Level Embeddings

◉ compositional character to word (C2W) model

15

(16)

MIMICK

◉ Optimizing towards pretrained embeddings

◉ no need to access the originating corpus

16

(17)

FastText

◉ An extension of the word2vec skip-gram model with character n-grams

◉ Represent word as char n-grams augmented with boundary symbols and as whole word: Apple = [<Ap, App, ppl, ple, le>, Apple]

◉ Prefix, suffixes and whole words are special

◉ supervised objective: text classification

17

(18)

Sentences and documents

Beyond Words

18

(19)

Sentence/Document Embedding

◉ How to extend to sentence/document-level?

◉ simply averaging word embeddings, inferring by trained models, … etc.

◉ training objective?

19

(20)

Skip-Thought

◉ extend skip-gram concept to sentence-level

◉ inspired by the distributional hypothesis: sentences that have similar surrounding context are likely to be both semantically and syntactically similar

20

(21)

Quick-Thought

◉ change the objective to classification problem

◉ the model can choose to ignore aspects of the sentence that are irrelevant in constructing a semantic embedding space

21

(22)

InferSent

◉ trained on natural language inference (NLI) task

◉ NLI is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

22

(23)

InferSent

23

(24)

References

◉ https://www.aclweb.org/anthology/W13-3512.pdf

◉ http://web.stanford.edu/class/cs224n/slides/cs224n-2019- lecture12-subwords.pdf

◉ http://www.aclweb.org/anthology/D15-1176

◉ https://arxiv.org/pdf/1508.07909.pdf

◉ https://arxiv.org/pdf/1707.06961.pdf

◉ https://github.com/Separius/awesome-sentence-embedding

◉ https://openreview.net/pdf?id=rJvJXZb0W

◉ https://arxiv.org/pdf/1607.01759.pdf

24

(25)

References

◉ https://arxiv.org/pdf/1705.02364.pdf

25

參考文獻

相關文件

Its main tool is the stem cells that are seeded on the surface of biomaterials (scaffolds), in order to create a biocom- plex. Several populations of mesenchymal stem cells are found

According to his protocol, patients are categorised as Stage I if they exhibit exposed bone in a field of radiation that has failed to heal for at least 6 months and do not have

 Manufacturers often provide retailers with aids that they can use in their advertising, publicity, and public relations.  They might use one or more of the following

• come to one’s assistance 去幫助某人. • be of great assistance

(c) Draw the graph of as a function of and draw the secant lines whose slopes are the average velocities in part (a) and the tangent line whose slope is the instantaneous velocity

◉ Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:.. 1) semantic similarity between two

 To write to the screen (or read the screen), use the next 8K words of the memory To read which key is currently pressed, use the next word of the

For obvious reasons, the model we leverage here is the best model we have for first posts spam detection, that is, SVM with RBF kernel trained with dimension-reduced

One could deal with specifi c topics for researching on Buddhist Literature while one has to clarify the categories and analyze the problems of methodology to construct “History

I would like to thank the Education Bureau and the Academy for Gifted Education for their professional support and for commissioning the Department of English Language and

Roles of English language (ELTs) and non- language teachers (NLTs)3. General, academic and technical

 For students of other class levels, arrangements of reading out the papers / using screen readers / asking for pronunciations of words are only applicable

I certify that I have audited the financial statements of the Subsidized Schools Provident Fund set out on pages 24 to 45, which comprise the balance sheet as at

I certify that I have audited the financial statements of the Subsidized Schools Provident Fund set out on pages 25 to 48, which comprise the balance sheet as at 31 August 2017,

I certify that I have audited the financial statements of the Subsidized Schools Provident Fund set out on pages 24 to 47, which comprise the balance sheet as at 31 August

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

• Contact with both parents is generally said to be the right of the child, as opposed to the right of the parent. • In other words the child has the right to see and to have a

The difference resulted from the co- existence of two kinds of words in Buddhist scriptures a foreign words in which di- syllabic words are dominant, and most of them are the

We first define regular expressions with memory (REM), which extend standard regular expressions with limited memory and show that they capture the class of data words defined by

Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.. 9.. ELMo: Embeddings from

engineering design, product design, industrial design, ceramic design, decorative design, graphic design, illustration design, information design, typographic design,

Nowadays, more and more researchers have been exploring underwater communications systems using the ultrasonic sensors.. This type of system has seen much

But the number of times of using Martian words on the Instant Messaging and issue topic will not influence the satisfaction of communication.. Keywords﹕Instant Messaging, the speed