Slide credit from Dr. Richard Socher

(1)

(2)

Classification Task

◦x: input object to be classified

◦y: class/label

  ^x ^y

f  ^f ^: ^R

^N

^ ^R

^M

Assume both x and y can be represented as fixed-size vectors

 a N-dim vector

 a M-dim vector

Learning Target Function

“這規格有誠意!” +

“太爛了吧~” -

How do we represent the meaning of the word?

(3)

Meaning Representations

Definition of “Meaning”

◦the idea that is represented by a word, phrase, etc.

◦the idea that a person wants to express by using words, signs, etc.

◦the idea that is expressed in a work of wri4ng, art, etc.

3

Goal: word representations that capture the relationships between words

(4)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

◦ High-dimensional sparse word vector

◦ Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

(5)

Meaning Representations in Computers

Atomic symbol

Neighbors

5

(6)

Knowledge-based representation

Hypernyms (is-a) relationships of WordNet

Issues:

▪ newly-invented words

▪ subjective

▪ annotation effort

▪ difficult to compute word similarity

(7)

Meaning Representations in Computers

Atomic symbol

Neighbors

7

(8)

Corpus-based representation

Atomic symbols: one-hot representation

[0 0 0 0 0 0 1 0 0 … 0]

^AND

[0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity

(i.e. comparing “car” and “motorcycle”)

car

car motorcycle

(9)

Corpus-based representation

Co-occurrence matrix

◦Neighbor definition: full document v.s. windows

9

full document

word-document co-occurrence matrix gives general topics

 “Latent Semantic Analysis”

windows

context window for each word

 capture syntactic (e.g. POS) and sematic information

(10)

Meaning Representations in Computers

Atomic symbol

Neighbors

(11)

Window-based Co-occurrence Matrix

Example

◦Window length=1

◦Left or right context

◦Corpus:

11

I love NTU.

I love deep learning.

I enjoy learning.

Counts I love enjoy NTU deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

NTU 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity  poor robustness

Idea: low dimensional word vector

(12)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

approximate

(13)

Meaning Representations in Computers

Atomic symbol

Neighbors

13

(14)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

Issues:

▪ computationally expensive: O(mn²) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn low-dimensional word

(15)

Meaning Representations in Computers

Atomic symbol

Neighbors

15

(16)

Low-Dimensional Dense Word Vector

Method 2: directly learn low-dimensional word vectors

◦Learning representations by back-propagation. (Rumelhart et al., 1986)

◦A neural probabilis4c language model (Bengio et al., 2003)

◦NLP (almost) from Scratch (Collobert & Weston, 2008)

◦Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

(17)

Word2Vec

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

Goal: predict surrounding words within a window of each word Objective function: maximize the log probability of any context word given the current center word

17

Idea: predict surrounding words of each word

context window (size=m)

(18)

Word2Vec

Goal: predict surrounding words within a window of each word Objective function: maximize the log probability of any context word given the current center word

context window (size=m)

u: outside word vector v: center word vector

target word vector

(19)

Major Advantages of Word Embeddings

Propagate any information into them via neural networks

◦form the basis for all language-related tasks

19

deep learned word embeddings

The networks, R and Ws, can be updated during model training

(20)

Concluding Remarks

Atomic symbol

Neighbors

Slide credit from Dr. Richard Socher

  x y

f  f : R

 R

Learning Target Function

Meaning Representations

Meaning Representations in Computers

Meaning Representations in Computers

Knowledge-based representation

Meaning Representations in Computers

Corpus-based representation

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 1 0 0 0 0 0 0 … 0] = 0

Corpus-based representation

Meaning Representations in Computers

Window-based Co-occurrence Matrix

Low-Dimensional Dense Word Vector

Meaning Representations in Computers

Low-Dimensional Dense Word Vector

Meaning Representations in Computers

Low-Dimensional Dense Word Vector

Word2Vec

Word2Vec

Major Advantages of Word Embeddings

Concluding Remarks

  ^x ^y

f  ^f ^: ^R

^ ^R