Slide credit from Dr. Richard Socher
Classification Task
◦x: input object to be classified
◦y: class/label
x y
f f : R
N R
MAssume both x and y can be represented as fixed-size vectors
a N-dim vector
a M-dim vector
Learning Target Function
“這規格有誠意!” +
“太爛了吧~” -
How do we represent the meaning of the word?
Meaning Representations
Definition of “Meaning”
◦the idea that is represented by a word, phrase, etc.
◦the idea that a person wants to express by using words, signs, etc.
◦the idea that is expressed in a work of wri4ng, art, etc.
3
Goal: word representations that capture the relationships between words
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
5
Knowledge-based representation
Hypernyms (is-a) relationships of WordNet
Issues:
▪ newly-invented words
▪ subjective
▪ annotation effort
▪ difficult to compute word similarity
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
7
Corpus-based representation
Atomic symbols: one-hot representation
[0 0 0 0 0 0 1 0 0 … 0]
[0 0 0 0 0 0 1 0 0 … 0]
AND[0 0 1 0 0 0 0 0 0 … 0] = 0
Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity
(i.e. comparing “car” and “motorcycle”)
car
car
car motorcycle
Corpus-based representation
Co-occurrence matrix
◦Neighbor definition: full document v.s. windows
9
full document
word-document co-occurrence matrix gives general topics
“Latent Semantic Analysis”
windows
context window for each word
capture syntactic (e.g. POS) and sematic information
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Window-based Co-occurrence Matrix
Example
◦Window length=1
◦Left or right context
◦Corpus:
11
I love NTU.
I love deep learning.
I enjoy learning.
Counts I love enjoy NTU deep learning
I 0 2 1 0 0 0
love 2 0 0 1 1 0
enjoy 1 0 0 0 0 1
NTU 0 1 0 0 0 0
deep 0 1 0 0 0 1
learning 0 0 1 0 1 0
similarity > 0
Issues:
▪ matrix size increases with vocabulary
▪ high dimensional
▪ sparsity poor robustness
Idea: low dimensional word vector
Low-Dimensional Dense Word Vector
Method 1: dimension reduction on the matrix
Singular Value Decomposition (SVD) of co-occurrence matrix X
approximate
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
13
Low-Dimensional Dense Word Vector
Method 1: dimension reduction on the matrix
Singular Value Decomposition (SVD) of co-occurrence matrix X
Issues:
▪ computationally expensive: O(mn2) when n<m for n x m matrix
▪ difficult to add new words
Idea: directly learn low-dimensional word
Meaning Representations in Computers
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
15
Low-Dimensional Dense Word Vector
Method 2: directly learn low-dimensional word vectors
◦Learning representations by back-propagation. (Rumelhart et al., 1986)
◦A neural probabilis4c language model (Bengio et al., 2003)
◦NLP (almost) from Scratch (Collobert & Weston, 2008)
◦Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)
Word2Vec
Benefit: faster, easily incorporate a new sentence/document or add a word to vocab
Goal: predict surrounding words within a window of each word Objective function: maximize the log probability of any context word given the current center word
17
Idea: predict surrounding words of each word
context window (size=m)
Word2Vec
Goal: predict surrounding words within a window of each word Objective function: maximize the log probability of any context word given the current center word
context window (size=m)
u: outside word vector v: center word vector
target word vector
Major Advantages of Word Embeddings
Propagate any information into them via neural networks
◦form the basis for all language-related tasks
19
deep learned word embeddings
The networks, R and Ws, can be updated during model training
Concluding Remarks
Knowledge-based representation Corpus-based representation
Atomic symbol
Neighbors
◦ High-dimensional sparse word vector
◦ Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning