• 沒有找到結果。

# Slide credit from Dr. Richard Socher

N/A
N/A
Protected

Share "Slide credit from Dr. Richard Socher"

Copied!
20
0
0

(1)

Slide credit from Dr. Richard Socher

(2)

x: input object to be classified

y: class/label

##   xy

N

### R

M

Assume both x and y can be represented as fixed-size vectors

 a N-dim vector

 a M-dim vector

### Learning Target Function

“這規格有誠意!” +

“太爛了吧~” -

How do we represent the meaning of the word?

(3)

### Meaning Representations

Definition of “Meaning”

the idea that is represented by a word, phrase, etc.

◦the idea that a person wants to express by using words, signs, etc.

◦the idea that is expressed in a work of wri4ng, art, etc.

3

Goal: word representations that capture the relationships between words

(4)

### Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

(5)

### Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

5

(6)

### Knowledge-based representation

Hypernyms (is-a) relationships of WordNet

Issues:

▪ newly-invented words

▪ subjective

▪ annotation effort

▪ difficult to compute word similarity

(7)

### Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

7

(8)

### Corpus-based representation

Atomic symbols: one-hot representation

AND

### [0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors Issues: difficult to compute the similarity

(i.e. comparing “car” and “motorcycle”)

car

car

car motorcycle

(9)

### Corpus-based representation

Co-occurrence matrix

Neighbor definition: full document v.s. windows

9

full document

word-document co-occurrence matrix gives general topics

 “Latent Semantic Analysis”

windows

context window for each word

 capture syntactic (e.g. POS) and sematic information

(10)

### Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

(11)

### Window-based Co-occurrence Matrix

Example

Window length=1

Left or right context

Corpus:

11

I love NTU.

I love deep learning.

I enjoy learning.

Counts I love enjoy NTU deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

NTU 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity  poor robustness

Idea: low dimensional word vector

(12)

### Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

approximate

(13)

### Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

13

(14)

### Low-Dimensional Dense Word Vector

Method 1: dimension reduction on the matrix

Singular Value Decomposition (SVD) of co-occurrence matrix X

Issues:

▪ computationally expensive: O(mn2) when n<m for n x m matrix

▪ difficult to add new words

Idea: directly learn low-dimensional word

(15)

### Meaning Representations in Computers

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

15

(16)

### Low-Dimensional Dense Word Vector

Method 2: directly learn low-dimensional word vectors

Learning representations by back-propagation. (Rumelhart et al., 1986)

A neural probabilis4c language model (Bengio et al., 2003)

NLP (almost) from Scratch (Collobert & Weston, 2008)

Recent and most popular models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

(17)

### Word2Vec

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

Goal: predict surrounding words within a window of each word Objective function: maximize the log probability of any context word given the current center word

17

Idea: predict surrounding words of each word

context window (size=m)

(18)

### Word2Vec

Goal: predict surrounding words within a window of each word Objective function: maximize the log probability of any context word given the current center word

context window (size=m)

u: outside word vector v: center word vector

target word vector

(19)

### Major Advantages of Word Embeddings

Propagate any information into them via neural networks

form the basis for all language-related tasks

19

deep learned word embeddings

The networks, R and Ws, can be updated during model training

(20)

### Concluding Remarks

Knowledge-based representation Corpus-based representation

Atomic symbol

Neighbors

High-dimensional sparse word vector

Low-dimensional dense word vector

Method 1 – dimension reduction

Method 2 – direct learning

Looking for a recurring theme in the CareerCast.com Jobs Rated report’s best jobs of 2019.. One

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

Numerical results are reported for some convex second-order cone programs (SOCPs) by solving the unconstrained minimization reformulation of the KKT optimality conditions,

Particularly, combining the numerical results of the two papers, we may obtain such a conclusion that the merit function method based on ϕ p has a better a global convergence and

By exploiting the Cartesian P -properties for a nonlinear transformation, we show that the class of regularized merit functions provides a global error bound for the solution of

–  Students can type the words from the word list into Voki and listen to them. •  Any

Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.. 9.. ELMo: Embeddings from