Review
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Corpus-based representation
Atomic symbols: one-hot representation
[0 0 0 0 0 0 1 0 0 … 0]
[0 0 0 0 0 0 1 0 0 … 0]
AND[0 0 1 0 0 0 0 0 0 … 0] = 0
Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity
(i.e. comparing “car” and “motorcycle”)
car
car
car motorcycle
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Window-based Co-occurrence Matrix
Example
◦Window length=1
◦Left or right context
◦Corpus:
I love NTU.
I love deep learning.
I enjoy learning.
Counts I love enjoy NTU deep learning
I 0 2 1 0 0 0
love 2 0 0 1 1 0
enjoy 1 0 0 0 0 1
NTU 0 1 0 0 0 0
deep 0 1 0 0 0 1
learning 0 0 1 0 1 0
similarity > 0
Issues:
▪ matrix size increases with vocabulary
▪ high dimensional
▪ sparsity poor robustness
Idea: low dimensional word vector
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Low-Dimensional Dense Word Vector
Method 1: dimension reduction on the matrix
Singular Value Decomposition (SVD) of co-occurrence matrix X
approximate
Low-Dimensional Dense Word Vector
Method 1: dimension reduction on the matrix
Singular Value Decomposition (SVD) of co-occurrence matrix X
Issues:
▪ computationally expensive: O(mn2) when n<m for n x m matrix
▪ difficult to add new words
Idea: directly learn
Word Representation
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning word embedding
Word Embedding
Method 2: directly learn low-dimensional word vectors
◦Learning representations by back-propagation. (Rumelhart et al., 1986)
◦A neural probabilistic language model (Bengio et al., 2003)
◦NLP (almost) from Scratch (Collobert & Weston, 2008)
◦Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)
Word Embedding Benefit
Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:
semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors
word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information
propagate any information into them via neural networks and update during training
Word2Vec Skip-Gram
Mikolov et al., “Distrib uted representation s of words and phrases and their compositionality,” in NIPS, 2013.
Mikolov et al., “Efficient estimation of word representation s in vector space,” in ICLR Workshop, 2013.
Word2Vec – Skip-Gram Model
Goal: predict surrounding words within a window of each word Objective function: maximize the probability of any context word given the current center word
context window
outside target word
target word vector
Benefit: faster, easily incorporate a new sentence/document or add a word to vocab
Word2Vec Skip-Gram Illustration
Goal: predict surrounding words within a window of each word
x
h s
Hidden Layer Weight Matrix
Word Embedding Matrix
Weight Matrix Relation
Hidden layer weight matrix = word vector lookup
Weight Matrix Relation
Output layer weight matrix = weighted sum as final score
within the context window
softmax
Word2Vec Skip-Gram Illustration
V =
N =
x
h s
Loss Function
Given a target word (wI)
SGD Update for W’
Given a target word (wI)
=1, when wjc is within the context window
=0, otherwise
error term
x h
s
SGD Update for W
x h
s
SGD Update
Hierarchical Softmax
Idea: compute the probability of leaf nodes using the paths
Negative Sampling
Idea: only update a sample of output vectors
Negative Sampling
Sampling methods
◦Random sampling
◦Distribution sampling: wj is sampled from P(w)
Empirical setting: unigram model raised to the power of 3/4
What is a good P(w)?
Idea: less frequent words sampled more often
Word Probability to be sampled for “neg”
is 0.93/4 = 0.92
constitution 0.093/4 = 0.16
bombastic 0.013/4 = 0.032
Word2Vec Skip-Gram Visualization
https://ronxin.github.io/wevi/
Skip-gram training data:
apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk, milk|drink^rice,water|drink^milk,juice|orange^apple,juice|apple^drink ,milk|rice^drink,drink|milk^water,drink|water^juice,drink|juice^water
Word2Vec Variants
Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)
CBOW (continuous bag-of-words): predicting the
target word given the surrounding words (Mikolov+, 2013)
LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)
Practice the derivation by yourself!!
better
first
Word2Vec CBOW
Goal: predicting the target word given the surrounding words
Word2Vec LM
Goal: predicting the next words given the proceeding contexts
Comparison
Count-based
◦Example
◦ LSA, HAL (Lund & Burgess), COALS
(Rohde et al), Hellinger-PCA (Lebret &
Collobert)
◦Pros
Fast training
Efficient usage of statistics
◦Cons
Primarily used to capture word similarity
Disproportionate importance
Direct prediction
◦Example
◦ NNLM, HLBL, RNN, Skipgram/CBOW,
(Bengio et al; Collobert & Weston; Huang et al;
Mnih & Hinton; Mikolov et al; Mnih &
Kavukcuoglu)
◦Pros
Generate improved performance on other tasks
Capture complex patterns beyond word similarity
◦Cons
Benefits mainly from large corpus
GloVe
Penningto n et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.
GloVe
Idea: ratio of co-occurrence probability can encode meaning Pij is the probability that word wj appears in the context of word wi
Relationship between the words wi and wj
x = solid x = gas x = water x = fashion P(x | ice) 1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5
P(x | stream) 2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5
P x | ice
x = solid x = gas x = water x = random
P(x | ice) large small large small
P(x | stream) small large large small
P x | ice
GloVe
The relationship of wi and wj approximates the ratio of their co-occurrence probabilities with various wk
GloVe
Word Vector Evaluation
Intrinsic Evaluation – Word Analogies
Word linear relationship
Syntactic and Semantic example questions [link]
Intrinsic Evaluation – Word Analogies
Word linear relationship
Syntactic and Semantic example questions [link]
Issue: different cities may have same name
city---in---state
Chicago : Illinois = Houston : Texas
Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona
Chicago : Illinois = Dallas : Texas
Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas
Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts
Issue: can change with time
capital---country
Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey
Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa
Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea
Abuja : Nigeria = Astana : Kazakhstan
Intrinsic Evaluation – Word Analogies
Word linear relationship
Syntactic and Semantic example questions [link]
superlative
bad : worst = big : biggest
bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best bad : worst = great : greatest
past tense
dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell
dancing : danced = feeding : fed dancing : danced = flying : flew
dancing : danced = generating : generated dancing : danced = going : went
dancing : danced = hiding : hid
Intrinsic Evaluation – Word Correlation
Comparing word correlation with human-judged scores Human-judged word correlation [link]
Word 1 Word 2 Human-Judged Score
tiger cat 7.35
tiger tiger 10.00
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
Ambiguity: synonym or same word with different POSs
Extrinsic Evaluation – Subsequent Task
Goal: use word vectors in neural net models built for subsequent tasks
Benefit
◦Ability to also classify words accurately
◦ Ex. countries cluster together a classifying location words should be possible with word vectors
◦Incorporate any information into them other tasks
◦ Ex. project sentiment into words to find most positive/negative words in corpus
Softmax & Cross-Entropy
Revisit Word Embedding Training
Goal: estimating vector representations s.t.
Softmax classification on x to obtain the probability for class y
◦Definition
Softmax Classification
Softmax classification on x to obtain the probability for class y
◦Definition
usually C > 2
(multi-class classification)
W
C
d
x
x1
x2
x3
y1 y2 y3 x4
Loss of Softmax
Objective function Loss function
◦If the correct answer already has the largest input to the softmax, then the first term and the second term will roughly cancel
◦the correct sample contributes little to the overall cost, which
Cross Entropy Loss
Cross entropy of target and predicted probability distribution
◦Definition
◦Re-written as the entropy and Kullback-Leibler divergence
◦KL divergence is not a distance but a non-symmetric measure of the difference between p and q
p: target one-hot vector
q: predicted probability distribution
cross entropy loss = loss for softmax
loss for softmax cross entropy loss
p: target one-hot vector
Concluding Remarks
Low dimensional word vector
◦word2vec
◦GloVe: combining count-based and direct learning
Word vector evaluation
◦Intrinsic: word analogy, word correlation
◦Extrinsic: subsequent task
Skip-gram CBOW LM