• 沒有找到結果。

Slide credit from Dr. Richard Socher

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from Dr. Richard Socher"

Copied!
48
0
0

加載中.... (立即查看全文)

全文

(1)
(2)

Review

(3)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

(4)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

(5)

Corpus-based representation

Atomic symbols: one-hot representation

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 0 0 0 0 1 0 0 … 0]

AND

[0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity

(i.e. comparing “car” and “motorcycle”)

car

car

car motorcycle

(6)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

(7)

Window-based Co-occurrence Matrix

Example

Window length=1

Left or right context

Corpus:

I love NTU.

I love deep learning.

I enjoy learning.

Counts I love enjoy NTU deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

NTU 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity  poor robustness

Idea: low dimensional word vector

(8)

Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

(9)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

approximate

(10)

Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

Issues:

▪ computationally expensive: O(mn2) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn

(11)

Word Representation

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning  word embedding

(12)

Word Embedding

Method 2: directly learn low-dimensional word vectors

Learning representations by back-propagation. (Rumelhart et al., 1986)

A neural probabilistic language model (Bengio et al., 2003)

NLP (almost) from Scratch (Collobert & Weston, 2008)

Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

(13)

Word Embedding Benefit

Given an unlabeled training corpus, produce a vector for each word that encodes its semantic information. These vectors are useful because:

semantic similarity between two words can be calculated as the cosine similarity between their corresponding word vectors

word vectors as powerful features for various supervised NLP tasks since the vectors contain semantic information

propagate any information into them via neural networks and update during training

(14)

Word2Vec Skip-Gram

Mikolov et al., “Distrib uted representation s of words and phrases and their compositionality,” in NIPS, 2013.

Mikolov et al., “Efficient estimation of word representation s in vector space,” in ICLR Workshop, 2013.

(15)

Word2Vec – Skip-Gram Model

Goal: predict surrounding words within a window of each word Objective function: maximize the probability of any context word given the current center word

context window

outside target word

target word vector

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

(16)

Word2Vec Skip-Gram Illustration

Goal: predict surrounding words within a window of each word

x

h s

(17)

Hidden Layer Weight Matrix

 Word Embedding Matrix

(18)

Weight Matrix Relation

Hidden layer weight matrix = word vector lookup

(19)

Weight Matrix Relation

Output layer weight matrix = weighted sum as final score

within the context window

softmax

(20)

Word2Vec Skip-Gram Illustration

V =

N =

x

h s

(21)

Loss Function

Given a target word (wI)

(22)

SGD Update for W’

Given a target word (wI)

=1, when wjc is within the context window

=0, otherwise

error term

x h

s

(23)

SGD Update for W

x h

s

(24)

SGD Update

(25)

Hierarchical Softmax

Idea: compute the probability of leaf nodes using the paths

(26)

Negative Sampling

Idea: only update a sample of output vectors

(27)

Negative Sampling

Sampling methods

Random sampling

Distribution sampling: wj is sampled from P(w)

Empirical setting: unigram model raised to the power of 3/4

What is a good P(w)?

Idea: less frequent words sampled more often

Word Probability to be sampled for “neg”

is 0.93/4 = 0.92

constitution 0.093/4 = 0.16

bombastic 0.013/4 = 0.032

(28)

Word2Vec Skip-Gram Visualization

https://ronxin.github.io/wevi/

Skip-gram training data:

apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk, milk|drink^rice,water|drink^milk,juice|orange^apple,juice|apple^drink ,milk|rice^drink,drink|milk^water,drink|water^juice,drink|juice^water

(29)

Word2Vec Variants

Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)

CBOW (continuous bag-of-words): predicting the

target word given the surrounding words (Mikolov+, 2013)

LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)

Practice the derivation by yourself!!

better

first

(30)

Word2Vec CBOW

Goal: predicting the target word given the surrounding words

(31)

Word2Vec LM

Goal: predicting the next words given the proceeding contexts

(32)

Comparison

Count-based

Example

LSA, HAL (Lund & Burgess), COALS

(Rohde et al), Hellinger-PCA (Lebret &

Collobert)

Pros

Fast training

Efficient usage of statistics

Cons

Primarily used to capture word similarity

Disproportionate importance

Direct prediction

Example

NNLM, HLBL, RNN, Skipgram/CBOW,

(Bengio et al; Collobert & Weston; Huang et al;

Mnih & Hinton; Mikolov et al; Mnih &

Kavukcuoglu)

Pros

Generate improved performance on other tasks

Capture complex patterns beyond word similarity

Cons

Benefits mainly from large corpus

(33)

GloVe

Penningto n et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

(34)

GloVe

Idea: ratio of co-occurrence probability can encode meaning Pij is the probability that word wj appears in the context of word wi

Relationship between the words wi and wj

x = solid x = gas x = water x = fashion P(x | ice) 1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5

P(x | stream) 2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5

P x | ice

x = solid x = gas x = water x = random

P(x | ice) large small large small

P(x | stream) small large large small

P x | ice

(35)

GloVe

The relationship of wi and wj approximates the ratio of their co-occurrence probabilities with various wk

(36)

GloVe

(37)

Word Vector Evaluation

(38)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions [link]

(39)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions [link]

Issue: different cities may have same name

city---in---state

Chicago : Illinois = Houston : Texas

Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona

Chicago : Illinois = Dallas : Texas

Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas

Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts

Issue: can change with time

capital---country

Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey

Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa

Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea

Abuja : Nigeria = Astana : Kazakhstan

(40)

Intrinsic Evaluation – Word Analogies

Word linear relationship

Syntactic and Semantic example questions [link]

superlative

bad : worst = big : biggest

bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best bad : worst = great : greatest

past tense

dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell

dancing : danced = feeding : fed dancing : danced = flying : flew

dancing : danced = generating : generated dancing : danced = going : went

dancing : danced = hiding : hid

(41)

Intrinsic Evaluation – Word Correlation

Comparing word correlation with human-judged scores Human-judged word correlation [link]

Word 1 Word 2 Human-Judged Score

tiger cat 7.35

tiger tiger 10.00

book paper 7.46

computer internet 7.58

plane car 5.77

professor doctor 6.62

stock phone 1.62

Ambiguity: synonym or same word with different POSs

(42)

Extrinsic Evaluation – Subsequent Task

Goal: use word vectors in neural net models built for subsequent tasks

Benefit

Ability to also classify words accurately

Ex. countries cluster together a classifying location words should be possible with word vectors

Incorporate any information into them other tasks

Ex. project sentiment into words to find most positive/negative words in corpus

(43)

Softmax & Cross-Entropy

(44)

Revisit Word Embedding Training

Goal: estimating vector representations s.t.

Softmax classification on x to obtain the probability for class y

Definition

(45)

Softmax Classification

Softmax classification on x to obtain the probability for class y

Definition

usually C > 2

(multi-class classification)

W

C

d

x

x1

x2

x3

y1 y2 y3 x4

(46)

Loss of Softmax

Objective function Loss function

If the correct answer already has the largest input to the softmax, then the first term and the second term will roughly cancel

the correct sample contributes little to the overall cost, which

(47)

Cross Entropy Loss

Cross entropy of target and predicted probability distribution

Definition

Re-written as the entropy and Kullback-Leibler divergence

KL divergence is not a distance but a non-symmetric measure of the difference between p and q

p: target one-hot vector

q: predicted probability distribution

cross entropy loss = loss for softmax

loss for softmax cross entropy loss

p: target one-hot vector

(48)

Concluding Remarks

Low dimensional word vector

word2vec

GloVe: combining count-based and direct learning

Word vector evaluation

Intrinsic: word analogy, word correlation

Extrinsic: subsequent task

Skip-gram CBOW LM

參考文獻

相關文件

Key words: Virtual community, technology acceptance model, social network, word-of- mouth

Looking for a recurring theme in the CareerCast.com Jobs Rated report’s best jobs of 2019.. One

利用 determinant 我 們可以判斷一個 square matrix 是否為 invertible, 也可幫助我們找到一個 invertible matrix 的 inverse, 甚至將聯立方成組的解寫下.

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

Numerical results are reported for some convex second-order cone programs (SOCPs) by solving the unconstrained minimization reformulation of the KKT optimality conditions,

–  Students can type the words from the word list into Voki and listen to them. •  Any

Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.. 9.. ELMo: Embeddings from

• One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of- vocabulary (OOV) words, i.e.. words that have not been seen