• 沒有找到結果。

Word Embeddings

N/A
N/A
Protected

Academic year: 2022

Share "Word Embeddings"

Copied!
49
0
0

加載中.... (立即查看全文)

全文

(1)

Word Embeddings

Applied Deep Learning

March 31st, 2020 http://adl.miulab.tw

(2)

Review

Word Representation

2

(3)

Meaning Representations in Computers

◉ Knowledge-based representation

◉ Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

o Method 1 – dimension reduction o Method 2 – direct learning

3

(4)

Meaning Representations in Computers

◉ Knowledge-based representation

◉ Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

o Method 1 – dimension reduction o Method 2 – direct learning

4

(5)

Corpus-based representation

Atomic symbols: one-hot representation

5

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 0 0 0 0 1 0 0 … 0]

AND

[0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors

Issues: difficult to compute the similarity (i.e. comparing “car” and “motorcycle”)

car

car

car motorcycle

(6)

Meaning Representations in Computers

◉ Knowledge-based representation

◉ Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

o Method 1 – dimension reduction o Method 2 – direct learning

6

(7)

Window-based Co-occurrence Matrix

◉ Example

Window length=1

Left or right context

Corpus:

7

I love NTU.

I love deep learning.

I enjoy learning.

Counts I love enjoy NTU deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

NTU 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity → poor robustness

Idea: low dimensional word vector

(8)

Meaning Representations in Computers

◉ Knowledge-based representation

◉ Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

o Method 1 – dimension reduction o Method 2 – direct learning

8

(9)

Low-Dimensional Dense Word Vector

◉ Method 1: dimension reduction on the matrix

◉ Singular Value Decomposition (SVD) of co-occurrence matrix X

9

approximate

(10)

Low-Dimensional Dense Word Vector

◉ Method 1: dimension reduction on the matrix

◉ Singular Value Decomposition (SVD) of co-occurrence matrix X

10

semantic relations

Rohde et al., “An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence,” 2005.

syntactic relations

Issues:

▪ computationally expensive: O(mn2) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn low-

dimensional word vectors

(11)

Word Representation

◉ Knowledge-based representation

◉ Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

o Method 1 – dimension reduction

o Method 2 – direct learning → word embedding

11

(12)

Word Embedding

◉ Method 2: directly learn low-dimensional word vectors

Learning representations by back-propagation. (Rumelhart et al., 1986)

A neural probabilistic language model (Bengio et al., 2003)

NLP (almost) from Scratch (Collobert & Weston, 2008)

Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

12

(13)

Word Embedding Benefit

◉ Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:

1) semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors

2) word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information

3) propagate any information into them via neural networks and update during training

13

(14)

Word Embeddings

Word2Vec

14

(15)

Word2Vec – Skip-Gram Model

◉ Goal: predict surrounding words within a window of each word

◉ Objective function: maximize the probability of any context word given the current center word

context window

outside target word

target word vector

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

15

(16)

Word2Vec Skip-Gram Illustration

◉ Goal: predict surrounding words within a window of each word

V =

N =

V =

x

h s

16

(17)

Hidden Layer Matrix → Word Embedding Matrix

17

(18)

Weight Matrix Relation

◉ Hidden layer weight matrix = word vector lookup

Each vocabulary entry has two vectors: as a target word and as a context word 18

(19)

Weight Matrix Relation

◉ Output layer weight matrix = weighted sum as final score

within the context window

softmax

Each vocabulary entry has two vectors: as a target word and as a context word 19

(20)

Word2Vec Skip-Gram Illustration

V = N =

x h

s

20

(21)

Word Embeddings

Word2Vec Training

21

(22)

Word2Vec Skip-Gram Illustration

V = N =

x h

s

22

(23)

Loss Function

Given a target word (wI)

23

(24)

SGD Update for W’

Given a target word (wI)

=1, when wjc is within the context window

=0, otherwise

error term

x h

s

24

(25)

SGD Update for W

x h

s

25

(26)

SGD Update

large vocabularies or large training corpora → expensive computations

limit the number of output vectors that must be updated per training instance

→ hierarchical softmax, sampling 26

(27)

Word Embeddings

Negative Sampling

27

(28)

Hierarchical Softmax

◉ Idea: compute the probability of leaf nodes using the paths

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.

28

(29)

Negative Sampling

◉ Idea: only update a sample of output vectors

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.

29

(30)

Negative Sampling

◉ Sampling methods

o Random sampling

o Distribution sampling: wj is sampled from P(w)

◉ Empirical setting: unigram model raised to the power of 3/4

What is a good P(w)?

Idea: less frequent words sampled more often

Word Probability to be sampled for “neg”

is 0.93/4 = 0.92

constitution 0.093/4 = 0.16

bombastic 0.013/4 = 0.032

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.

30

(31)

Word Embeddings

Word2Vec Variants

31

(32)

Word2Vec Skip-Gram Visualization

https://ronxin.github.io/wevi/

Skip-gram training data:

apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk,milk|drink^rice,water|drink^mil k,juice|orange^apple,juice|apple^drink,milk|rice^drink,drink|milk^water,drink|water^juice,drink|juic e^water

32

(33)

33

(34)

Word2Vec Variants

Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)

CBOW (continuous bag-of-words): predicting the target word given the surrounding words (Mikolov+, 2013)

LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)

Practice the derivation by yourself!!

better

Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.

Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.

first

34

(35)

Word2Vec CBOW

◉ Goal: predicting the target word given the surrounding words

35

(36)

Word2Vec LM

◉ Goal: predicting the next words given the proceeding contexts

36

(37)

Word Embeddings

GloVe

37

(38)

Comparison

◉ Count-based

o LSA, HAL (Lund & Burgess), COALS (Rohde et

al), Hellinger-PCA (Lebret & Collobert)

o Pros

Fast training

Efficient usage of statistics

o Cons

Primarily used to capture word similarity

Disproportionate importance given to large counts

◉ Direct prediction

o NNLM, HLBL, RNN, Skipgram/CBOW

(Bengio et al; Collobert & Weston; Huang et al; Mnih &

Hinton; Mikolov et al; Mnih & Kavukcuoglu)

o Pros

Generate improved performance on other tasks

Capture complex patterns beyond word similarity

o Cons

Benefits mainly from large corpus

Inefficient usage of statistics

Combining the benefits from both worlds → GloVe

38

(39)

GloVe

◉ Idea: ratio of co-occurrence probability can encode meaning

Pij is the probability that word wj appears in the context of word wi

Relationship between the words wi and wj

x = solid x = gas x = water x = fashion

P(x | ice) 1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5 P(x | stream) 2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5

P x | ice

P x | stream 8.9 8.5 × 10

−2 1.36 0.96

x = solid x = gas x = water x = random

P(x | ice) large small large small

P(x | stream) small large large small

P x | ice

P x | stream large small ~ 1 ~ 1

Pennington et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

39

(40)

GloVe

The relationship of wi and wj approximates the ratio of their co-occurrence probabilities with various wk

Pennington et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

40

(41)

GloVe

Pennington et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

41

(42)

GloVe Weighted Least Squares Regression Model

◉ Weighting function should obey

o

o should be non-decreasing so that rare co-occurrences are not overweighted

o should be relatively small for large values of 𝑥, so that frequent co-occurrences are not overweighted

fast training, scalable, good performance even with small corpus, and small vectors

Pennington et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

42

(43)

Word Vector Evaluation

43

(44)

Intrinsic Evaluation – Word Analogies

◉ Word linear relationship

Syntactic and Semantic example questions [link]

44

Issue: what if the information is there but not linear

(45)

Intrinsic Evaluation – Word Analogies

◉ Word linear relationship

Syntactic and Semantic example questions [link]

45

Issue: different cities may have same name

city---in---state

Chicago : Illinois = Houston : Texas

Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona

Chicago : Illinois = Dallas : Texas

Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas

Chicago : Illinois = Detroit : Michigan

Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts

Issue: can change with time

capital---country

Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey

Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa

Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea

Abuja : Nigeria = Astana : Kazakhstan

(46)

Intrinsic Evaluation – Word Analogies

◉ Word linear relationship

Syntactic and Semantic example questions [link]

46

superlative

bad : worst = big : biggest

bad : worst = bright : brightest bad : worst = cold : coldest

bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best

bad : worst = great : greatest

past tense

dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell

dancing : danced = feeding : fed dancing : danced = flying : flew

dancing : danced = generating : generated dancing : danced = going : went

dancing : danced = hiding : hid dancing : danced = hiding : hit

(47)

Intrinsic Evaluation – Word Correlation

◉ Comparing word correlation with human-judged scores

◉ Human-judged word correlation [link]

47

Word 1 Word 2 Human-Judged Score

tiger cat 7.35

tiger tiger 10.00

book paper 7.46

computer internet 7.58

plane car 5.77

professor doctor 6.62

stock phone 1.62

Ambiguity: synonym or same word with different POSs

(48)

Extrinsic Evaluation – Subsequent Task

◉ Goal: use word vectors in neural net models built for subsequent tasks

◉ Benefit

○ Ability to also classify words accurately

Ex. countries cluster together a classifying location words should be possible with word vectors

○ Incorporate any information into them other tasks

Ex. project sentiment into words to find most positive/negative words in corpus

48

(49)

Concluding Remarks

◉ Low dimensional word vector

○ word2vec

○ GloVe: combining count-based and direct learning

◉ Word vector evaluation

○ Intrinsic: word analogy, word correlation

○ Extrinsic: subsequent task

49

Skip-gram CBOW LM

參考文獻

相關文件

First, in the Intel documentation, the encoding of the MOV instruction that moves an immediate word into a register is B8 +rw dw, where +rw indicates that a register code (0-7) is to

When the Hessian of each constraint function is of rank 1 (namely, outer products of some given so-called steer- ing vectors) and the phase spread of the entries of these

Key words: Virtual community, technology acceptance model, social network, word-of- mouth

We must assume, further, that between a nucleon and an anti-nucleon strong attractive forces exist, capable of binding the two particles together.. *Now at the Institute for

• No vector potential needed for gauge symmetry Vector potential is useful for

Corpus-based information ― The grammar presentations are based on a careful analysis of the billion-word Cambridge English Corpus, so students and teachers can be

Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.. 9.. ELMo: Embeddings from

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the