More on Embeddings
Applied Deep Learning
April 7th, 2020 http://adl.miulab.tw
Handling Out-of-Vocabulary
• One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of- vocabulary (OOV) words, i.e. words that have not been seen during training.
• Typically, such words are set to the UNK token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large.
2
Subwords and characters
Below Words
3
Subword Embeddings
• separating unseen or rare words into common subwords, potentially address OOV issue
• “AppleCare” = “Apple” + “Care”, “iPhone11” = “iPhone” +
“11”
4
Why Subwords?
• “台灣大學生喜歡深度學習”
• suboptimal word segmentation system
• ambiguity in word segmentation: “深度學習” or “深度” “學習”
• informal spelling: ”So goooooooood.”, “lollllllllll”
5
Subword Embeddings
◉ Possibility of leveraging morphological information
◉ In speech, we have phonemes; in language, we have morphemes.
◉ Morphemes (語素): smallest semantic units
◉ -s: noun plural, -ed: verb simple past tense, pre-, un-…
6
Subword Embeddings
◉ Morphological Recursive Neural Network
7
How to Decide Subwords?
◉ by simple n-gram: Apple = [App, ppl, ple]
◉ Byte Pair Encoding: an algorithm to build the vocabulary
8
Byte Pair Encoding
◉ Originally a compression algorithm: most frequent byte pair
↦ a new byte.
◉ Used as a word segmentation algorithm
◉ Start with a unigram vocabulary of all (Unicode) characters in data
◉ Most frequent ngram pairs ↦ a new ngram
9
Byte Pair Encoding
◉ Start with a unigram vocabulary of all (Unicode) characters in data
◉ Most frequent ngram pairs ↦ a new ngram
10
Byte Pair Encoding
◉ Start with a unigram vocabulary of all (Unicode) characters in data
◉ Most frequent ngram pairs ↦ a new ngram
11
Byte Pair Encoding
◉ Start with a unigram vocabulary of all (Unicode) characters in data
◉ Most frequent ngram pairs ↦ a new ngram
12
Byte Pair Encoding
◉ Have a target vocabulary size and stop when you reach it
◉ Automatically decides vocab for system
13
Character-Level Embeddings
◉ modeling word-level representation by character-level information
◉ completely solve OOV problem
◉ dynamically infer representation
14
Character-Level Embeddings
◉ compositional character to word (C2W) model
15
MIMICK
◉ Optimizing towards pretrained embeddings
◉ no need to access the originating corpus
16
FastText
◉ An extension of the word2vec skip-gram model with character n-grams
◉ Represent word as char n-grams augmented with boundary symbols and as whole word: Apple = [<Ap, App, ppl, ple, le>, Apple]
◉ Prefix, suffixes and whole words are special
◉ supervised objective: text classification
17
Sentences and documents
Beyond Words
18
Sentence/Document Embedding
◉ How to extend to sentence/document-level?
◉ simply averaging word embeddings, inferring by trained models, … etc.
◉ training objective?
19
Skip-Thought
◉ extend skip-gram concept to sentence-level
◉ inspired by the distributional hypothesis: sentences that have similar surrounding context are likely to be both semantically and syntactically similar
20
Quick-Thought
◉ change the objective to classification problem
◉ the model can choose to ignore aspects of the sentence that are irrelevant in constructing a semantic embedding space
21
InferSent
◉ trained on natural language inference (NLI) task
◉ NLI is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.
22
InferSent
23
References
◉ https://www.aclweb.org/anthology/W13-3512.pdf
◉ http://web.stanford.edu/class/cs224n/slides/cs224n-2019- lecture12-subwords.pdf
◉ http://www.aclweb.org/anthology/D15-1176
◉ https://arxiv.org/pdf/1508.07909.pdf
◉ https://arxiv.org/pdf/1707.06961.pdf
◉ https://github.com/Separius/awesome-sentence-embedding
◉ https://openreview.net/pdf?id=rJvJXZb0W
◉ https://arxiv.org/pdf/1607.01759.pdf
24
References
◉ https://arxiv.org/pdf/1705.02364.pdf
25