• 沒有找到結果。

4thCoffee Sequence Modeling & Embeddings

N/A
N/A
Protected

Academic year: 2022

Share "4thCoffee Sequence Modeling & Embeddings"

Copied!
109
0
0

加載中.... (立即查看全文)

全文

(1)

NTUMIULAB

http://seamls.miulab.tw/

4 t h C o f f e e

w w w . 4 t h c o f f e e . c o m

Yun-Nung (Vivian) Chen

http://vivianchen.idv.tw

Sequence Modeling & Embeddings

Southeast Asia Machine Learning School 2019 http://seamls.miulab.tw/

(2)

NTUMIULAB

http://seamls.miulab.tw/

4 t h C o f f e e

w w w . 4 t h c o f f e e . c o m

• W o r d R e p r e s e n t a t i o n B a s i c s

• W o r d E m b e d d i n g s

• R e c u r r e n t N e u r a l N e t w o r k

Yun-Nung (Vivian) Chen

http://vivianchen.idv.tw

Sequence Modeling & Embeddings

(3)

NTUMIULAB

http://seamls.miulab.tw/

4 t h C o f f e e

w w w . 4 t h c o f f e e . c o m

• W o r d R e p r e s e n t a t i o n B a s i c s

• W o r d E m b e d d i n g s

• R e c u r r e n t N e u r a l N e t w o r k

Yun-Nung (Vivian) Chen

http://vivianchen.idv.tw

Sequence Modeling & Embeddings

(4)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Classification Task

x: input object to be classified

y: class/label

Learning Target Function

( ) x y

f = f : R

N

R

M

Assume both x and y can be represented as fixed-size vectors

→ a N-dim vector

→ a M-dim vector

“This is awesome!” +

“It sucks.” -

How do we represent the meaning of the word?

(5)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Definition of “Meaning”

• the idea that is represented by a word, phrase, etc.

• the idea that a person wants to express by using words, signs, etc.

• the idea that is expressed in a work of writing, art, etc.

Meaning Representations

Goal: word representations that capture the relationships between words

(6)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Meaning Representations in Computers

(7)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Meaning Representations in Computers

(8)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Hypernyms (is-a) relationships of WordNet

Knowledge-Based Representation

Issues:

▪ newly-invented words

▪ subjective

▪ annotation effort

▪ difficult to compute word similarity

(9)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Meaning Representations in Computers

(10)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Atomic symbols: one-hot representation

Corpus-Based Representation

[0 0 0 0 0 0 1 0 0 … 0]

[0 0 0 0 0 0 1 0 0 … 0]

AND

[0 0 1 0 0 0 0 0 0 … 0] = 0

Idea: words with similar meanings often have similar neighbors

Issues: difficult to compute the similarity (i.e. comparing “car” and “motorcycle”) car

car

car motorcycle

(11)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Co-occurrence matrix

• Neighbor definition: full document v.s. windows

• full document: word-document co-occurrence matrix gives general topics

→ “Latent Semantic Analysis”, “Latent Dirichlet Allocation”

• windows: context window for each word

→ capture syntactic (e.g. POS) and sematic information

Corpus-Based Representation

(12)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Meaning Representations in Computers

(13)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Example

• Window length=1

• Left or right context

• Corpus:

Window-Based Co-occurrence Matrix

I love AI.

I love deep learning.

I enjoy learning.

Counts I love enjoy AI deep learning

I 0 2 1 0 0 0

love 2 0 0 1 1 0

enjoy 1 0 0 0 0 1

AI 0 1 0 0 0 0

deep 0 1 0 0 0 1

learning 0 0 1 0 1 0

similarity > 0

Issues:

▪ matrix size increases with vocabulary

▪ high dimensional

▪ sparsity

Idea: low-dimensional dense word vector

(14)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Meaning Representations in Computers

(15)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Method 1: dimension reduction on the matrix

• Singular Value Decomposition (SVD) of co-occurrence matrix X

Low-Dimensional Dense Word Vector

approximate

(16)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Method 1: dimension reduction on the matrix

• Singular Value Decomposition (SVD) of co-occurrence matrix X

Low-Dimensional Dense Word Vector

semantic relations

Rohde et al., “An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence,” 2005.

syntactic relations

Issues:

computationally expensive difficult to add new words

Idea: directly learn low-dimensional word vectors

(17)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Meaning Representations in Computers

(18)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Method 2: directly learn low-dimensional word vectors

• Learning representations by back-propagation. (Rumelhart et al., 1986)

• A neural probabilistic language model (Bengio et al., 2003)

• NLP (almost) from Scratch (Collobert & Weston, 2008)

• Widely-used models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)

• As known as “Word Embeddings”

Low-Dimensional Dense Word Vector

(19)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Propagate any information into them via neural networks

• form the basis for all language-related tasks

Major Advantages of Word Embeddings

deep learned word embeddings

The networks, R and Ws, can be updated during model training

(20)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Knowledge-based representation

• Corpus-based representation

✓Atomic symbol

✓Neighbors

• High-dimensional sparse word vector

• Low-dimensional dense word vector

▪ Method 1 – dimension reduction

▪ Method 2 – direct learning

Concluding Remarks

(21)

NTUMIULAB

http://seamls.miulab.tw/

4 t h C o f f e e

w w w . 4 t h c o f f e e . c o m

• W o r d R e p r e s e n t a t i o n B a s i c s

• W o r d E m b e d d i n g s

• R e c u r r e n t N e u r a l N e t w o r k

Yun-Nung (Vivian) Chen

http://vivianchen.idv.tw

Sequence Modeling & Embeddings

(22)

NTUMIULAB

http://seamls.miulab.tw/

Word2Vec Skip-Gram

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.

Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.

(23)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: predict surrounding words within a window of each word

• Objective function: maximize the probability of any context word given the current center word

Word2Vec – Skip-Gram Model

context window

outside target word

target word vector

Benefit: faster, easily incorporate a new sentence/document or add a word to vocab

(24)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: predict surrounding words within a window of each word

Word2Vec Skip-Gram Illustration

V =

N =

V =

x h

s

(25)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Hidden Layer Matrix → Word Embedding Matrix

(26)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Hidden layer weight matrix = word vector lookup

Weight Matrix Relation

The 300-dim feature representation has the ability of predicting the contexts

(27)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Output layer weight matrix = weighted sum as final score

Weight Matrix Relation

within the context window

softmax

Each vocabulary entry has two vectors: as a target word and as a context word

(28)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Word2Vec Skip-Gram Illustration

x h

s

(29)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Given a target word (wI)

Loss Function

(30)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Given a target word (wI)

SGD Update for W’

=1, when wjc is within the context window

=0, otherwise

error term

x h

s

(31)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

SGD Update for W

x h

s

(32)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

SGD Update

large vocabularies or large training corpora → expensive computations

limit the number of output vectors that must be updated per training instance

(33)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Idea: only update a sample of output vectors

Negative Sampling

Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.

(34)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Skip-gram training data:

apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk,milk|drink^rice,water|d rink^milk,juice|orange^apple,juice|apple^drink,milk|rice^drink,drink|milk^water,drink|water

^juice,drink|juice^water

Word2Vec Skip-Gram Visualization

https://ronxin.github.io/wevi/

(35)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)

• CBOW (continuous bag-of-words): predicting the target word given the surrounding words (Mikolov+, 2013)

• LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)

Word2Vec Variants

Practice the derivation by yourself!!

better

Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.

Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.

first

(36)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: predicting the target word given the surrounding words

Word2Vec CBOW

(37)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: predicting the next words given the proceeding contexts

Word2Vec LM

(38)

NTUMIULAB

http://seamls.miulab.tw/

Comparison

• Count-based

• Example

• LSA, HAL (Lund & Burgess), COALS (Rohde

et al), Hellinger-PCA (Lebret & Collobert)

• Pros

✓ Fast training

✓ Efficient usage of statistics

• Cons

✓ Primarily used to capture word similarity

✓ Disproportionate importance given to large counts

• Direct prediction

• Example

• NNLM, HLBL, RNN, Skipgram/CBOW,

(Bengio et al; Collobert & Weston; Huang et al;

Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu)

• Pros

✓ Generate improved performance on other tasks

✓ Capture complex patterns beyond word similarity

• Cons

✓ Benefits mainly from large corpus

✓ Inefficient usage of statistics

Combining the benefits from both worlds → GloVe

(39)

NTUMIULAB

http://seamls.miulab.tw/

G l o V e

Pennington et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.

(40)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Idea: ratio of co-occurrence probability can encode meaning

• Pij is the probability that word wj appears in the context of word wi

• Relationship between the words wi and wj

GloVe

x = solid x = gas x = water x = fashion P(x | ice) 1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5 P(x | stream) 2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5

P x | ice

P x | stream 8.9 8.5 × 10−2 1.36 0.96

x = solid x = gas x = water x = random

P(x | ice) large small large small

P(x | stream) small large large small

P x | ice

P x | stream large small ~ 1 ~ 1

(41)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• The relationship of wi and wj approximates the ratio of their co-occurrence probabilities with various wk

GloVe

(42)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

GloVe

fast training, scalable, good performance even with small corpus, and small vectors

(43)

NTUMIULAB

http://seamls.miulab.tw/

Word Vector Evaluation

(44)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Word linear relationship

• Syntactic and Semantic example questions [link]

Intrinsic Evaluation – Word Analogies

Issue: what if the information is there but not linear

(45)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Word linear relationship

• Syntactic and Semantic example questions [link]

Intrinsic Evaluation – Word Analogies

Issue: different cities may have same name city---in---state

Chicago : Illinois = Houston : Texas

Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona

Chicago : Illinois = Dallas : Texas

Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas

Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts

Issue: can change with time capital---country

Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey

Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa

Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea

Abuja : Nigeria = Astana : Kazakhstan

(46)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Word linear relationship

• Syntactic and Semantic example questions [link]

Intrinsic Evaluation – Word Analogies

superlative

bad : worst = big : biggest

bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best

bad : worst = great : greatest

past tense

dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell

dancing : danced = feeding : fed dancing : danced = flying : flew

dancing : danced = generating : generated dancing : danced = going : went

dancing : danced = hiding : hid dancing : danced = hiding : hit

(47)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Comparing word correlation with human-judged scores

• Human-judged word correlation [link]

Intrinsic Evaluation – Word Correlation

Word 1 Word 2 Human-Judged Score

tiger cat 7.35

tiger tiger 10.00

book paper 7.46

computer internet 7.58

plane car 5.77

professor doctor 6.62

stock phone 1.62

Ambiguity: synonym or same word with different POSs

(48)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: use word vectors in neural net models built for subsequent tasks

• Benefit

• Ability to also classify words accurately

• Ex. countries cluster together a classifying location words should be possible with word vectors

• Incorporate any information into them other tasks

• Ex. project sentiment into words to find most positive/negative words in corpus

Extrinsic Evaluation – Subsequent Task

(49)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Low dimensional word vector

• word2vec

• GloVe: combining count-based and direct learning

• Word vector evaluation

• Intrinsic: word analogy, word correlation

• Extrinsic: subsequent task

Concluding Remarks

Skip-gram CBOW LM

(50)

NTUMIULAB

http://seamls.miulab.tw/

4 t h C o f f e e

w w w . 4 t h c o f f e e . c o m

• W o r d R e p r e s e n t a t i o n B a s i c s

• W o r d E m b e d d i n g s

• R e c u r r e n t N e u r a l N e t w o r k

Yun-Nung (Vivian) Chen

http://vivianchen.idv.tw

Sequence Modeling & Embeddings

(51)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(52)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(53)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: estimate the probability of a word sequence

• Example task

• determinate whether a sequence is grammatical or makes more sense

Language Modeling

recognize speech or

wreck a nice beach Output =

“recognize speech”

If P(recognize speech)

> P(wreck a nice beach)

(54)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(55)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Goal: estimate the probability of a word sequence

• N-gram language model

• Probability is conditioned on a window of (n-1) previous words

• Estimate the probability based on the training data

N-Gram Language Modeling

𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

(56)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Training data:

• The dog ran ……

• The cat jumped ……

N-Gram Language Modeling

P( jumped | dog ) = 0 P( ran | cat ) = 0

give some small probability

→ smoothing

0.0001 0.0001

➢ The probability is not accurate.

➢ The phenomenon happens because we cannot collect all the possible text in the world as training data.

(57)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(58)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Idea: estimate not from count, but from the NN prediction

Neural Language Modeling

Neural Network

vector of “START”

P(next word is

“wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

vector of “a”

P(next word is

“nice”)

Neural Network

vector of “nice”

P(next word is

“beach”) P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

(59)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Neural Language Modeling

Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003.

input hidden

output

context vector

Probability distribution of the next word

(60)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• The input layer (or hidden layer) of the related words are close

• If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “…

cat jump …” in the data)

Neural Language Modeling

h1 h2

dog

cat

rabbit

Smoothing is automatically done

Issue: fixed context window for conditioning

(61)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(62)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Idea: condition the neural network on all previous words and tie the weights at each time step

• Assumption: temporal information matters

Recurrent Neural Network

(63)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

RNN Language Modeling

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”) input

hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(64)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(65)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• At each time step,

RNNLM Formulation

…… ……

……

……

vector of the current word probability of the next word

(66)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(67)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Recurrent Neural Network Definition

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

(68)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• All model parameters can be updated by

Model Training

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

yt-1 yt yt+1 target

predicted

(69)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(70)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Backpropagation

… …

1 2

j

… …

1 2

l i wij

Layer l Layer l −1





=

1

1 1 l x

l a

j l l j

i

Backward Pass

Forward Pass

Error signal

(71)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Backpropagation

l

i

Backward Pass

Error signal

1

2

n

y1

C

( )z1L

( )z2L

( )znL

y2

C

yn

C

Layer L

2 1

i

Layer l

( )z1l

( )z2l

( )zil

δ1l

δ2l

l

δi

2

( )1L1

z

1

m

Layer L-1

( )WL T

( )Wl+1 T

( )y

C

L 1

-

L

( )2L 1

z

( )L−1

zm

δl

(72)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Backpropagation through Time (BPTT)

For 𝐶(1) Backward Pass:

For 𝐶(2)

For 𝐶(3) For 𝐶(4)

Forward Pass: Compute s1, s2, s3, s4 ……

y1 y2 y3

x1 x2 x3

o1 o2 o3

init

y4

x4 o4

𝐶(1) 𝐶(2) 𝐶(3) 𝐶(4)

s1 s2 s3 s4

(73)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(74)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

• Multiply the same matrix at each time step during backprop

RNN Training Issue

The gradient becomes very small or very large quickly

→ vanishing or exploding gradient

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

(75)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

w2

w1

Cost

Rough Error Surface

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

The error surface is either very flat or very steep

(76)

NTUMIULAB

http://seamls.miulab.tw/

Possible Solutions

R e c u r r e n t N e u r a l N e t w o r k

(77)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Exploding Gradient: Clipping

w2

w1

Cost

clipped gradient

Idea: control the gradient value to avoid exploding

Parameter setting: values from half to ten times the average can still yield convergence

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

(78)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• RNN models temporal sequence information

• can handle “long-term dependencies” in theory

Vanishing Gradient: Gating Mechanism

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient

→ apply the gating mechanism to directly encode the long-distance information

“I grew up in France…

I speak fluent French.”

(79)

NTUMIULAB

http://seamls.miulab.tw/

79

Long Short-Term Memory

A d d r e s s i n g V a n i s h i n g G r a d i e n t P r o b l e m

(80)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• LSTMs are explicitly designed to avoid the long-term dependency problem

Long Short-Term Memory (LSTM)

Vanilla RNN

(81)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 81

LSTM

(82)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 82

LSTM

runs straight down the chain with minor linear interactions

→ easy for information to flow along it unchanged

Gates are a way to optionally let information through

→ composed of a sigmoid and a pointwise multiplication operation

(83)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 83

LSTM

forget gate (a sigmoid layer): decides what information we’re going to throw away from the cell state

• 1: “completely keep this”

• 0: “completely get rid of this”

(84)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 84

LSTM

input gate (a sigmoid layer): decides what new information we’re going to store in the cell state

Vanilla RNN

(85)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 85

LSTM

cell state update: forgets the things we decided to forget earlier and add the new candidate values, scaled by how much we decided to update each state value

• ft: decides which to forget

• it: decide which to update

(86)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Long Short-Term Memory (LSTM)

Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 86

LSTM

output gate (a sigmoid layer):

decides what new information we’re going to output

(87)

NTUMIULAB

http://seamls.miulab.tw/

87

Variants on LSTM

A d d r e s s i n g V a n i s h i n g G r a d i e n t P r o b l e m

(88)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

LSTM with Peephole Connections

Gers and Schmidhuber, ”Recurrent nets that time and count,” in IJCNN, 2000. [link] 88

LSTM with Peephole Conncetions

Idea: allow gate layers to look at the cell state

(89)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

LSTM with Coupled Forget/Input Gates

Gers and Schmidhuber, ”Recurrent nets that time and count,” in IJCNN, 2000. [link] 89

LSTM with Coupled Forget/Input Gates

Idea: instead of separately deciding what to forget and what we should add new information to, we make those decisions together

We only forget when we’re going to input something in its place, and vice versa.

(90)

NTUMIULAB

http://seamls.miulab.tw/

90

Gated Recurrent Unit

A d d r e s s i n g V a n i s h i n g G r a d i e n t P r o b l e m

(91)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Gated Recurrent Unit (GRU)

Cho et al., ”Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [link]91

GRU

Idea: combine the forget and input gates into a single “update gate”; merge the cell state and hidden state

GRU is simpler and has less parameters than LSTM

update gate:

reset gate:

rt=0: ignore previous memory and only stores the new word information

(92)

NTUMIULAB

http://seamls.miulab.tw/

E x t e n s i o n

R e c u r r e n t N e u r a l N e t w o r k

(93)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Bidirectional RNN

ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token

(94)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

Deep Bidirectional RNN

Each memory layer passes an intermediate representation to the next

(95)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(96)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• The learning algorithm f is to map the input domain X into the output domain Y

• Input domain: word, word sequence, audio signal, click logs

• Output domain: single label, sequence tags, tree structure, probability distribution

How to Frame the Learning Problem?

Y X

f : →

Network design should leverage input and output domain properties

(97)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(98)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Idea: aggregate the meaning from all words into a vector

• Method:

• Basic combination: average, sum

• Neural combination:

✓ Recursive neural network (RvNN)

✓ Recurrent neural network (RNN)

✓ Convolutional neural network (CNN)

Input Domain – Sequence Modeling

How to compute

specification

good this

is

N-dim

(99)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

good

This is very

x4 h4

• Encode the sequential input into a vector using RNN

Sentiment Analysis

x1

x2

… …

y1

y2

… …

Input Output

yM

xN

RNN considers temporal information to learn sentence vectors as the input of classification tasks

(100)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(101)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• POS Tagging

• Speech Recognition

• Machine Translation

Output Domain – Sequence Prediction

“I like reading papers.” I/PN like/VBP reading/VBG papers/NNS

“Hello”

“How are you doing today?” “你好嗎?”

The output can be viewed as a sequence of classification

(102)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

(103)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Tag a word at each timestamp

• Input: word sequence

• Output: corresponding POS tag sequence

POS Tagging

I love you

N V N

(104)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Tag a word at each timestamp

• Input: word sequence

• Output: IOB-format slot tag and intent tag

Natural Language Understanding (NLU)

<START> just sent email to bob about fishing this weekend <END>

O O O O

B-contact_name

O

B-subject I-subject I-subject

→ send_email(contact_name=“bob”, subject=“fishing this weekend”) O

send_email

Temporal orders for input and output are the same

(105)

NTUMIULAB

http://seamls.miulab.tw/

http://seamls.miulab.tw/

• Language Modeling

• N-gram Language Model

• Feed-Forward Neural Language Model

• Recurrent Neural Network Language Model (RNNLM)

• Recurrent Neural Network

• Definition

• Training via Backpropagation through Time (BPTT)

• Training Issue

• Applications

• Sequential Input

• Sequential Output

• Aligned Sequential Pairs (Tagging)

• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

Outline

參考文獻

相關文件

• Early experiences have long term impacts on brain power.. • Creative play and quality care make all

Solution: pay attention on the partial input object each time Problem: larger memory implies more parameters in RNN. Solution: long-term memory increases memory size without

 Replace the wall in observation room with the projected image of the remote room...

• developing coherent short-term and long-term school development plan that aligns the school aims, the needs, interests and abilities of students in accordance with the

Capital works expenditure Non-works expenditure Capital Surplus/Deficit.. Issuance/Repayment of Bonds and Notes

In the work of Qian and Sejnowski a window of 13 secondary structure predictions is used as input to a fully connected structure-structure network with 40 hidden units.. Thus,

happy linear modeling after Z = Φ(X ) Price of Nonlinear Transform.

Venerable Master Hsing Yun's view on religious laws is a result of his long term contemplation and practical observation in religious legislation and management issues of Taiwan.