Slides credited from Dr. Richard Socher
Review
2
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
✓ Atomic symbol
✓ Neighbors
◦
High-dimensional sparse word vector
◦
Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
3
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
✓ Atomic symbol
✓ Neighbors
◦
High-dimensional sparse word vector
◦
Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
4
Corpus-based representation
Atomic symbols: one-hot representation
5
[0 0 0 0 0 0 1 0 0 … 0]
[0 0 0 0 0 0 1 0 0 … 0] AND [0 0 1 0 0 0 0 0 0 … 0] = 0
Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity
(i.e. comparing “car” and “motorcycle”)
car
car
car motorcycle
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
✓ Atomic symbol
✓ Neighbors
◦
High-dimensional sparse word vector
◦
Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
6
Window-based Co-occurrence Matrix
Example
◦ Window length=1
◦ Left or right context
◦ Corpus:
7
I love NTU.
I love deep learning.
I enjoy learning.
Counts I love enjoy NTU deep learning
I 0 2 1 0 0 0
love 2 0 0 1 1 0
enjoy 1 0 0 0 0 1
NTU 0 1 0 0 0 0
deep 0 1 0 0 0 1
learning 0 0 1 0 1 0
similarity > 0
Issues:
▪ matrix size increases with vocabulary
▪ high dimensional
▪ sparsity → poor robustness
Idea: low dimensional
word vector
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
✓ Atomic symbol
✓ Neighbors
◦
High-dimensional sparse word vector
◦
Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
8
Low-Dimensional Dense Word Vector
Method 1: dimension reduction on the matrix
Singular Value Decomposition (SVD) of co-occurrence matrix X
9
approximate
Low-Dimensional Dense Word Vector
Method 1: dimension reduction on the matrix
Singular Value Decomposition (SVD) of co-occurrence matrix X
10
semantic relations
Rohde et al., “An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence,” 2005.
syntactic relations
Issues:
▪ computationally expensive: O(mn
2) when n<m for n x m matrix
▪ difficult to add new words
Idea: directly learn
low-dimensional word
vectors
Word Representation
Knowledge-based representation Corpus-based representation
✓ Atomic symbol
✓ Neighbors
◦
High-dimensional sparse word vector
◦
Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning → word embedding
11
Word Embedding
Method 2: directly learn low-dimensional word vectors
◦ Learning representations by back-propagation. (Rumelhart et al., 1986)
◦ A neural probabilistic language model (Bengio et al., 2003)
◦ NLP (almost) from Scratch (Collobert & Weston, 2008)
◦ Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)
12
Word Embedding Benefit
Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:
semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors
word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information
propagate any information into them via neural networks and update during training
13
Word2Vec Skip-Gram
Mikolov et al., “Distrib uted representation s of words and phrases and their compositionality,” in NIPS, 2013.
Mikolov et al., “Efficient estimation of word representation s in vector space,” in ICLR Workshop, 2013.
14
Word2Vec – Skip-Gram Model
Goal: predict surrounding words within a window of each word Objective function: maximize the probability of any context word given the current center word
15
context window
outside target word
target word vector
Benefit: faster, easily incorporate a new sentence/document or add a word to vocab
Word2Vec Skip-Gram Illustration
Goal: predict surrounding words within a window of each word
16
V =
N =
V =
x
h s
Hidden Layer Weight Matrix
→ Word Embedding Matrix
17
Weight Matrix Relation
Hidden layer weight matrix = word vector lookup
18
Each vocabulary entry has two vectors: as a target word and as a context word
Weight Matrix Relation
Output layer weight matrix = weighted sum as final score
19
within the context window
softmax
Each vocabulary entry has two vectors: as a target word and as a context word
Word2Vec Skip-Gram Illustration
20
V =
N =
x
h s
Loss Function
Given a target word (w
I)
21
SGD Update for W’
Given a target word (w
I)
22
=1, when wjc is within the context window
=0, otherwise
error term
x h
s
SGD Update for W
23
x h
s
SGD Update
24
large vocabularies or large training corpora → expensive computations
limit the number of output vectors that must be updated per training instance
→ hierarchical softmax, sampling
Hierarchical Softmax
25
Idea: compute the probability of leaf nodes using the paths
Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.
Negative Sampling
Idea: only update a sample of output vectors
Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013. 26
Negative Sampling
Sampling methods
◦ Random sampling
◦ Distribution sampling: w
jis sampled from P(w)
Empirical setting: unigram model raised to the power of 3/4
Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013. 27
What is a good P(w)?
Idea: less frequent words sampled more often
Word Probability to be sampled for “neg”
is 0.9
3/4= 0.92
constitution 0.09
3/4= 0.16
bombastic 0.01
3/4= 0.032
Word2Vec Skip-Gram Visualization
https://ronxin.github.io/wevi/
Skip-gram training data:
apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk, milk|drink^rice,water|drink^milk,juice|orange^apple,juice|apple^drink ,milk|rice^drink,drink|milk^water,drink|water^juice,drink|juice^water
28
Word2Vec Variants
Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)
CBOW (continuous bag-of-words): predicting the
target word given the surrounding words (Mikolov+, 2013)
LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)
29
Practice the derivation by yourself!!
better
Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.
Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.
first
Word2Vec CBOW
Goal: predicting the target word given the surrounding words
30
Word2Vec LM
Goal: predicting the next words given the proceeding contexts
31
Comparison
Count-based
◦ Example
◦
LSA, HAL
(Lund & Burgess), COALS
(Rohde et al),
Hellinger-PCA
(Lebret &Collobert)
◦ Pros
✓Fast training
✓
Efficient usage of statistics
◦ Cons
✓Primarily used to capture word
similarity
✓
Disproportionate importance given to large counts
Direct prediction
◦ Example
◦
NNLM, HLBL, RNN, Skipgram/CBOW,
(Bengio et al; Collobert & Weston; Huang et al;
Mnih & Hinton; Mikolov et al; Mnih &
Kavukcuoglu)
◦ Pros
✓Generate improved performance on
other tasks
✓
Capture complex patterns beyond word similarity
◦ Cons
✓Benefits mainly from large corpus
✓Inefficient usage of statistics
32
Combining the benefits from both worlds → GloVe
GloVe
Penningto n et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.
33
GloVe
34
Idea: ratio of co-occurrence probability can encode meaning P
ijis the probability that word w
jappears in the context of word w
iRelationship between the words w
iand w
jx = solid x = gas x = water x = fashion P(x | ice)
1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5P(x | stream)
2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5P x | ice
P x | stream
8.9 8.5 × 10−2 1.36 0.96
x = solid x = gas x = water x = random
P(x | ice) large small large small
P(x | stream) small large large small
P x | ice
P x | stream large small ~ 1 ~ 1
GloVe
35
The relationship of w
iand w
japproximates the ratio of their
co-occurrence probabilities with various w
kGloVe
36
fast training, scalable, good performance even with small corpus, and small vectors
Word Vector Evaluation
37
Intrinsic Evaluation – Word Analogies
Word linear relationship
Syntactic and Semantic example questions
[link]38
Issue: what if the information is there but not linear
Intrinsic Evaluation – Word Analogies
Word linear relationship
Syntactic and Semantic example questions
[link]39
Issue: different cities may have same name
city---in---state
Chicago : Illinois = Houston : Texas
Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona
Chicago : Illinois = Dallas : Texas
Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas
Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts
Issue: can change with time
capital---country
Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey
Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa
Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea
Abuja : Nigeria = Astana : Kazakhstan
Intrinsic Evaluation – Word Analogies
Word linear relationship
Syntactic and Semantic example questions
[link]40
superlative
bad : worst = big : biggest
bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best
bad : worst = great : greatest
past tense
dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell
dancing : danced = feeding : fed dancing : danced = flying : flew
dancing : danced = generating : generated dancing : danced = going : went
dancing : danced = hiding : hid dancing : danced = hiding : hit
Intrinsic Evaluation – Word Correlation
Comparing word correlation with human-judged scores Human-judged word correlation
[link]41
Word 1 Word 2 Human-Judged Score
tiger cat 7.35
tiger tiger 10.00
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
Ambiguity: synonym or same word with different POSs
Extrinsic Evaluation – Subsequent Task
Goal: use word vectors in neural net models built for subsequent tasks
Benefit
◦ Ability to also classify words accurately
◦
Ex. countries cluster together a classifying location words should be possible with word vectors
◦ Incorporate any information into them other tasks
◦
Ex. project sentiment into words to find most positive/negative words in corpus
42
Concluding Remarks
Low dimensional word vector
◦ word2vec
◦ GloVe: combining count-based and direct learning
Word vector evaluation
◦ Intrinsic: word analogy, word correlation
◦ Extrinsic: subsequent task
43
Skip-gram CBOW LM