Slides credited from Dr. Richard Socher

(1)

(2)

Review

2

(3)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

◦

High-dimensional sparse word vector

◦

Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

3

(4)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

◦

High-dimensional sparse word vector

◦

Low-dimensional dense word vector

4

(5)

Corpus-based representation

Atomic symbols: one-hot representation

5

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 0 0 0 0 1 0 0 … 0] ^AND [0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity

(i.e. comparing “car” and “motorcycle”)

car

car motorcycle

(6)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

◦

High-dimensional sparse word vector

◦

Low-dimensional dense word vector

6

(7)

Window-based Co-occurrence Matrix

Example

◦ Window length=1

◦ Left or right context

◦ Corpus:

7

I love NTU.

I love deep learning.

I enjoy learning.

Counts I love enjoy NTU deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

NTU 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity → poor robustness

Idea: low dimensional

word vector

(8)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

◦

High-dimensional sparse word vector

◦

Low-dimensional dense word vector

8

(9)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

9

approximate

(10)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

10

semantic relations

Rohde et al., “An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence,” 2005.

syntactic relations

Issues:

▪ computationally expensive: O(mn

²

) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn

low-dimensional word

vectors

(11)

Word Representation

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

◦

High-dimensional sparse word vector

◦

Low-dimensional dense word vector

▪ Method 2 – direct learning → word embedding

11

(12)

Word Embedding

Method 2: directly learn low-dimensional word vectors

◦ Learning representations by back-propagation. (Rumelhart et al., 1986)

◦ A neural probabilistic language model (Bengio et al., 2003)

◦ NLP (almost) from Scratch (Collobert & Weston, 2008)

◦ Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

12

(13)

Word Embedding Benefit

Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:

 semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors

 word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information

 propagate any information into them via neural networks and update during training

13

(14)

Word2Vec Skip-Gram

Mikolov et al., “Distrib uted representation s of words and phrases and their compositionality,” in NIPS, 2013.

Mikolov et al., “Efficient estimation of word representation s in vector space,” in ICLR Workshop, 2013.

14

(15)

Word2Vec – Skip-Gram Model

Goal: predict surrounding words within a window of each word Objective function: maximize the probability of any context word given the current center word

15

context window

outside target word

target word vector

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

(16)

Word2Vec Skip-Gram Illustration

Goal: predict surrounding words within a window of each word

16

V =

N =

V =

x

h s

(17)

Hidden Layer Weight Matrix

→ Word Embedding Matrix

17

(18)

Weight Matrix Relation

Hidden layer weight matrix = word vector lookup

18

Each vocabulary entry has two vectors: as a target word and as a context word

(19)

Weight Matrix Relation

Output layer weight matrix = weighted sum as final score

19

within the context window

softmax

Each vocabulary entry has two vectors: as a target word and as a context word

(20)

Word2Vec Skip-Gram Illustration

20

V =

N =

x

h s

(21)

Loss Function

Given a target word (w

_I

)

21

(22)

SGD Update for W’

Given a target word (w

_I

)

22

=1, when w_jc is within the context window

=0, otherwise

error term

x h

s

(23)

SGD Update for W

23

x h

s

(24)

SGD Update

24

large vocabularies or large training corpora → expensive computations

limit the number of output vectors that must be updated per training instance

→ hierarchical softmax, sampling

(25)

Hierarchical Softmax

25

Idea: compute the probability of leaf nodes using the paths

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.

(26)

Negative Sampling

Idea: only update a sample of output vectors

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013. 26

(27)

Negative Sampling

Sampling methods

◦ Random sampling

◦ Distribution sampling: w

_j

is sampled from P(w)

Empirical setting: unigram model raised to the power of 3/4

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013. 27

What is a good P(w)?

Idea: less frequent words sampled more often

Word Probability to be sampled for “neg”

is 0.9

^3/4

= 0.92

constitution 0.09

^3/4

= 0.16

bombastic 0.01

^3/4

= 0.032

(28)

Word2Vec Skip-Gram Visualization

https://ronxin.github.io/wevi/

Skip-gram training data:

apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk, milk|drink^rice,water|drink^milk,juice|orange^apple,juice|apple^drink ,milk|rice^drink,drink|milk^water,drink|water^juice,drink|juice^water

28

(29)

Word2Vec Variants

Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)

CBOW (continuous bag-of-words): predicting the

target word given the surrounding words (Mikolov+, 2013)

LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)

29

Practice the derivation by yourself!!

better

Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.

Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.

first

(30)

Word2Vec CBOW

Goal: predicting the target word given the surrounding words

30

(31)

Word2Vec LM

Goal: predicting the next words given the proceeding contexts

31

(32)

Comparison

Count-based

◦ Example

◦

LSA, HAL

(Lund & Burgess)

, COALS

(Rohde et al),

Hellinger-PCA

(Lebret &

Collobert)

◦ Pros

✓Fast training

✓

Efficient usage of statistics

◦ Cons

✓Primarily used to capture word

similarity

✓

Disproportionate importance given to large counts

Direct prediction

◦ Example

◦

NNLM, HLBL, RNN, Skipgram/CBOW,

(Bengio et al; Collobert & Weston; Huang et al;

Mnih & Hinton; Mikolov et al; Mnih &

Kavukcuoglu)

◦ Pros

✓Generate improved performance on

other tasks

✓

Capture complex patterns beyond word similarity

◦ Cons

✓Benefits mainly from large corpus

✓Inefficient usage of statistics

32

Combining the benefits from both worlds → GloVe

(33)

GloVe

Penningto n et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

33

(34)

GloVe

34

Idea: ratio of co-occurrence probability can encode meaning P

_ij

is the probability that word w

_j

appears in the context of word w

_i

Relationship between the words w

_i

and w

_j

x = solid x = gas x = water x = fashion P(x | ice)

1.9 × 10⁻⁴ 6.6 × 10⁻⁵ 3.0 × 10⁻³ 1.7 × 10⁻⁵

P(x | stream)

2.2 × 10⁻⁵ 7.8 × 10⁻⁴ 2.2 × 10⁻³ 1.8 × 10⁻⁵

P x | ice

P x | stream

^8.9 ^{8.5 × 10}

−2 1.36 0.96

x = solid x = gas x = water x = random

P(x | ice) large small large small

P(x | stream) small large large small

P x | ice

P x | stream large small ~ 1 ~ 1

(35)

GloVe

35

The relationship of w

_i

and w

_j

approximates the ratio of their

co-occurrence probabilities with various w

_k

(36)

GloVe

36

fast training, scalable, good performance even with small corpus, and small vectors

(37)

Word Vector Evaluation

37

(38)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions

^[link]

38

Issue: what if the information is there but not linear

(39)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions

^[link]

39

Issue: different cities may have same name

city---in---state

Chicago : Illinois = Houston : Texas

Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona

Chicago : Illinois = Dallas : Texas

Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas

Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts

Issue: can change with time

capital---country

Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey

Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa

Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea

Abuja : Nigeria = Astana : Kazakhstan

(40)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions

^[link]

40

superlative

bad : worst = big : biggest

bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best

bad : worst = great : greatest

past tense

dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell

dancing : danced = feeding : fed dancing : danced = flying : flew

dancing : danced = generating : generated dancing : danced = going : went

dancing : danced = hiding : hid dancing : danced = hiding : hit

(41)

Intrinsic Evaluation – Word Correlation

Comparing word correlation with human-judged scores Human-judged word correlation

^[link]

41

Word 1 Word 2 Human-Judged Score

tiger cat 7.35

tiger tiger 10.00

book paper 7.46

computer internet 7.58

plane car 5.77

professor doctor 6.62

stock phone 1.62

Ambiguity: synonym or same word with different POSs

(42)

Extrinsic Evaluation – Subsequent Task

Goal: use word vectors in neural net models built for subsequent tasks

Benefit

◦ Ability to also classify words accurately

◦

Slides credited from Dr. Richard Socher

Review

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Corpus-based representation

Atomic symbols: one-hot representation

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 0 0 0 0 1 0 0 … 0] AND [0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity

(i.e. comparing “car” and “motorcycle”)

car

car motorcycle

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Window-based Co-occurrence Matrix

Example

◦ Window length=1

◦ Left or right context

◦ Corpus:

I love NTU.

I love deep learning.

I enjoy learning.

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity → poor robustness

Idea: low dimensional

word vector

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

Issues:

▪ computationally expensive: O(mn

) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn

low-dimensional word

vectors

Word Representation

Knowledge-based representation Corpus-based representation

✓ Atomic symbol

✓ Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Word Embedding

Method 2: directly learn low-dimensional word vectors

◦ Learning representations by back-propagation. (Rumelhart et al., 1986)

◦ A neural probabilistic language model (Bengio et al., 2003)

◦ NLP (almost) from Scratch (Collobert & Weston, 2008)

◦ Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

Word Embedding Benefit

Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:

 semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors

 word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information

 propagate any information into them via neural networks and update during training

Word2Vec Skip-Gram

Word2Vec – Skip-Gram Model

[0 0 0 0 0 0 1 0 0 … 0] ^AND [0 0 1 0 0 0 0 0 0 … 0] = 0