Slide credit from Dr. Richard Socher

(1)

(2)

Review

(3)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

◦ High-dimensional sparse word vector

◦ Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

(4)

Meaning Representations in Computers

Atomic symbol

Neighbors

(5)

Corpus-based representation

Atomic symbols: one-hot representation

[0 0 0 0 0 0 1 0 0 … 0]

^AND

[0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity

(i.e. comparing “car” and “motorcycle”)

car

car motorcycle

(6)

Meaning Representations in Computers

Atomic symbol

Neighbors

(7)

Window-based Co-occurrence Matrix

Example

◦Window length=1

◦Left or right context

◦Corpus:

I love NTU.

I love deep learning.

I enjoy learning.

Counts I love enjoy NTU deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

NTU 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity  poor robustness

Idea: low dimensional word vector

(8)

Meaning Representations in Computers

Atomic symbol

Neighbors

(9)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

approximate

(10)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

Issues:

▪ computationally expensive: O(mn²) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn

(11)

Word Representation

Atomic symbol

Neighbors

▪ Method 2 – direct learning  word embedding

(12)

Word Embedding

Method 2: directly learn low-dimensional word vectors

◦Learning representations by back-propagation. (Rumelhart et al., 1986)

◦A neural probabilistic language model (Bengio et al., 2003)

◦NLP (almost) from Scratch (Collobert & Weston, 2008)

◦Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

(13)

Word Embedding Benefit

Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:

 semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors

 word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information

 propagate any information into them via neural networks and update during training

(14)

Word2Vec Skip-Gram

Mikolov et al., “Distrib uted representation s of words and phrases and their compositionality,” in NIPS, 2013.

Mikolov et al., “Efficient estimation of word representation s in vector space,” in ICLR Workshop, 2013.

(15)

Word2Vec – Skip-Gram Model

Goal: predict surrounding words within a window of each word Objective function: maximize the probability of any context word given the current center word

context window

outside target word

target word vector

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

(16)

Word2Vec Skip-Gram Illustration

Goal: predict surrounding words within a window of each word

x

h s

(17)

Hidden Layer Weight Matrix

 Word Embedding Matrix

(18)

Weight Matrix Relation

Hidden layer weight matrix = word vector lookup

(19)

Weight Matrix Relation

Output layer weight matrix = weighted sum as final score

within the context window

softmax

(20)

Word2Vec Skip-Gram Illustration

V =

N =

x

h s

(21)

Loss Function

Given a target word (w_I)

(22)

SGD Update for W’

Given a target word (w_I)

=1, when w_jc is within the context window

=0, otherwise

error term

x h

s

(23)

SGD Update for W

x h

s

(24)

SGD Update

(25)

Hierarchical Softmax

Idea: compute the probability of leaf nodes using the paths

(26)

Negative Sampling

Idea: only update a sample of output vectors

(27)

Negative Sampling

Sampling methods

◦Random sampling

◦Distribution sampling: w_j is sampled from P(w)

Empirical setting: unigram model raised to the power of 3/4

What is a good P(w)?

Idea: less frequent words sampled more often

Word Probability to be sampled for “neg”

is 0.9^3/4 = 0.92

constitution 0.09^3/4 = 0.16

bombastic 0.01^3/4 = 0.032

(28)

Word2Vec Skip-Gram Visualization

https://ronxin.github.io/wevi/

Skip-gram training data:

(29)

Word2Vec Variants

Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)

CBOW (continuous bag-of-words): predicting the

target word given the surrounding words (Mikolov+, 2013)

LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)

Practice the derivation by yourself!!

better

first

(30)

Word2Vec CBOW

Goal: predicting the target word given the surrounding words

(31)

Word2Vec LM

Goal: predicting the next words given the proceeding contexts

(32)

Comparison

Count-based

◦Example

◦ LSA, HAL (Lund & Burgess), COALS

(Rohde et al), Hellinger-PCA (Lebret &

Collobert)

◦Pros

Fast training

Efficient usage of statistics

◦Cons

Primarily used to capture word similarity

Disproportionate importance

Direct prediction

◦Example

◦ NNLM, HLBL, RNN, Skipgram/CBOW,

(Bengio et al; Collobert & Weston; Huang et al;

Mnih & Hinton; Mikolov et al; Mnih &

Kavukcuoglu)

◦Pros

Generate improved performance on other tasks

Capture complex patterns beyond word similarity

◦Cons

Benefits mainly from large corpus

(33)

GloVe

Penningto n et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

(34)

GloVe

Idea: ratio of co-occurrence probability can encode meaning P_ij is the probability that word w_j appears in the context of word w_i

Relationship between the words w_i and w_j

x = solid x = gas x = water x = fashion P(x | ice) 1.9 × 10⁻⁴ 6.6 × 10⁻⁵ 3.0 × 10⁻³ 1.7 × 10⁻⁵

P(x | stream) 2.2 × 10⁻⁵ 7.8 × 10⁻⁴ 2.2 × 10⁻³ 1.8 × 10⁻⁵

P x | ice

x = solid x = gas x = water x = random

P(x | ice) large small large small

P(x | stream) small large large small

P x | ice

(35)

GloVe

The relationship of w_i and w_j approximates the ratio of their co-occurrence probabilities with various w_k

(36)

GloVe

(37)

Word Vector Evaluation

(38)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions ^[link]

(39)

Intrinsic Evaluation – Word Analogies

Issue: different cities may have same name

city---in---state

Chicago : Illinois = Houston : Texas

Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona

Chicago : Illinois = Dallas : Texas

Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas

Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts

Issue: can change with time

capital---country

Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey

Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa

Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea

Abuja : Nigeria = Astana : Kazakhstan

(40)

Intrinsic Evaluation – Word Analogies

superlative

bad : worst = big : biggest

bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best bad : worst = great : greatest

past tense

dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell

dancing : danced = feeding : fed dancing : danced = flying : flew

dancing : danced = generating : generated dancing : danced = going : went

dancing : danced = hiding : hid

(41)

Intrinsic Evaluation – Word Correlation

Comparing word correlation with human-judged scores Human-judged word correlation ^[link]

Word 1 Word 2 Human-Judged Score

tiger cat 7.35

tiger tiger 10.00

book paper 7.46

computer internet 7.58

plane car 5.77

professor doctor 6.62

stock phone 1.62

Ambiguity: synonym or same word with different POSs

(42)

Extrinsic Evaluation – Subsequent Task

Goal: use word vectors in neural net models built for subsequent tasks

Benefit

◦Ability to also classify words accurately

◦ Ex. countries cluster together a classifying location words should be possible with word vectors

◦Incorporate any information into them other tasks

◦ Ex. project sentiment into words to find most positive/negative words in corpus

(43)

Softmax & Cross-Entropy

(44)

Revisit Word Embedding Training

Goal: estimating vector representations s.t.

Softmax classification on x to obtain the probability for class y

◦Definition

(45)

Softmax Classification

Softmax classification on x to obtain the probability for class y

◦Definition

usually C > 2

(multi-class classification)

W

C

d

x

x₁

x₂

x₃

y₁ y₂ y₃ x₄

(46)

Loss of Softmax

Objective function Loss function

◦If the correct answer already has the largest input to the softmax, then the first term and the second term will roughly cancel

◦the correct sample contributes little to the overall cost, which

(47)

Cross Entropy Loss

Cross entropy of target and predicted probability distribution

◦Definition

◦Re-written as the entropy and Kullback-Leibler divergence

◦KL divergence is not a distance but a non-symmetric measure of the difference between p and q

p: target one-hot vector

q: predicted probability distribution

cross entropy loss = loss for softmax

loss for softmax cross entropy loss

p: target one-hot vector

(48)

Concluding Remarks

Low dimensional word vector

◦word2vec

◦GloVe: combining count-based and direct learning

Word vector evaluation

◦Intrinsic: word analogy, word correlation

◦Extrinsic: subsequent task

Skip-gram CBOW LM