Slide credit from 1

(1)

Slide credit from Hung-Yi Lee & Richard Socher 1

(2)

Review

Word Vector

(3)

Word2Vec Variants

Skip-gram: predicting surrounding words given the target word

(Mikolov+, 2013)

CBOW (continuous bag-of-words): predicting the

target word given the surrounding words

(Mikolov+, 2013)

LM (Language modeling): predicting the next words given the proceeding contexts

(Mikolov+, 2013)

Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013. 3 Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.

(4)

Word2Vec LM

Goal: predicting the next words given the proceeding contexts

(5)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

◦ Aligned Sequential Pairs (Tagging)

◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

5

(6)

Outline

Language Modeling

◦Definition

◦Sequential Input

(7)

Language Modeling

Goal: estimate the probability of a word sequence

Example task: determinate whether a sequence is grammatical or makes more sense

7

recognize speech or

wreck a nice beach Output =

“recognize speech”

If P(recognize speech)

> P(wreck a nice beach)

(8)

Outline

Language Modeling

◦N-gram Language Model

◦Definition

◦Sequential Input

(9)

N-Gram Language Modeling

Goal: estimate the probability of a word sequence

N-gram language model

◦

Probability is conditioned on a window of (n-1) previous words

◦

Estimate the probability based on the training data

9

𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

(10)

N-Gram Language Modeling

Training data:

◦The dog ran ……

◦The cat jumped ……

P( jumped | dog ) = 0 P( ran | cat ) = 0

give some small probability

 smoothing

0.0001 0.0001

 The probability is not accurate.

 The phenomenon happens because we cannot collect all the possible text in the world as training data.

(11)

Outline

Language Modeling

◦Feed-Forward Neural Language Model

◦Definition

◦Sequential Input

11

(12)

Neural Language Modeling

Idea: estimate not from count, but

from the NN prediction

Neural Network

vector of “START”

P(next word is

“wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

vector of “a”

P(next word is

“nice”)

Neural Network

vector of “nice”

P(next word is

“beach”)

P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

(13)

Neural Language Modeling

Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003. 13

Issue: fixed context window for conditioning

input hidden

output

context vector

Probability distribution of the next word

(14)

Neural Language Modeling

The input layer (or hidden layer) of the related words are close

◦If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)

h

₁

h

₂

dog

cat

rabbit

Smoothing is automatically done

(15)

Outline

Language Modeling

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Sequential Input

15

(16)

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

(17)

RNN Language Modeling

17

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”)

input hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(18)

Outline

Language Modeling

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Sequential Input

(19)

RNNLM Formulation

19

At each time step,

…… ……

……

vector of the current word probability of the next word

(20)

Outline

Language Modeling

◦Definition

◦Sequential Input

(21)

Recurrent Neural Network Definition

21 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

(22)

Model Training

All model parameters can be updated by y

_t-1

y

_t

y

_t+1

target

predicted

(23)

Outline

Language Modeling

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Sequential Input

23

(24)

Backpropagation

… …

1 2

j … …

1 2

l

i w

ij

Layer l Layer l  1



 









1

1 l x

l a

j l l j



i

Backward Pass

⋮

Forward Pass

⋮

Error

signal

(25)

Backpropagation

25 l



i

Backward Pass

⋮

Error signal

1

2

n

…

y1

C



 ^z1^L

^



 ^z2^L





 zn^L





y2

C



yn

C



Layer L

2 1

i

…

Layer l

 

^z1^l





 

^z₂^l





 

^z_i^l

^



δ₁l

δ₂l

l

δi

2

…

 1^L¹

 

 z

1

m

Layer L-1

…

_{ }_W^L ^T

 ^W^l^¹ ^T

 

^y

C

 L 1

-

 L

 2^L ¹

 

 z

 ^L¹

z_m

δl

(26)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

y

1

C



y

2

C



y

n

C



  ^y

 C

…

(27)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

27

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

1

2

n

…

1

2

n

…

  ^y

 C

(28)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

  ^y

 C

(29)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

29

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

j

i

i j

  ^y

 C

Weights are tied together

the same memory

pointer pointer

(30)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

j

i

k i

k j i j

i

k j

  ^y

 C

Weights are tied together

(31)

BPTT

31

For 𝐶

⁽¹⁾

Backward Pass:

For 𝐶

⁽²⁾

For 𝐶

⁽³⁾

For 𝐶

⁽⁴⁾

Forward Pass: Compute s

₁

, s

₂

, s

₃

, s

₄

……

y

₁

y

₂

y

₃

x

₁

x

₂

x

₃

o

₁

o

₂

o

₃

init

y

₄

x

₄

o

₄

𝐶

⁽¹⁾

𝐶

⁽²⁾

𝐶

⁽³⁾

𝐶

⁽⁴⁾

s

₁

s

₂

s

₃

s

₄

(32)

Outline

Language Modeling

◦Definition

◦Training Issue Applications

◦Sequential Input

(33)

RNN Training Issue

The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

33

The gradient becomes very small or very large quickly

 vanishing or exploding gradient

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

(34)

w

₂

w

₁

C ost

Rough Error Surface

The error surface is either very flat or very steep

(35)

Vanishing/Exploding Gradient Example

35

0 5 10 15 20 25 30 35

1 step 2 steps 5 steps

10 steps 20 steps 50 steps

(36)

Outline

Language Modeling

◦Definition

◦Training Issue Applications

◦Sequential Input

(37)

How to Frame the Learning Problem?

The learning algorithm f is to map the input domain X into the output domain Y

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution

37

Y X

f : 

Network design should leverage input and output domain properties

(38)

Outline

Language Modeling

◦Definition

◦Sequential Input

(39)

Input Domain – Sequence Modeling

Idea: aggregate the meaning from all words into a vector Method:

◦

Basic combination: average, sum

◦

Neural combination:

Recursive neural network (RvNN)

Recurrent neural network (RNN)

Convolutional neural network (CNN)

39

How to compute

規格

(specification)

誠意

(sincerity)

這

(this)

有

(have)

N-dim

(40)

誠意

這規格有

x₄ h₄

Sentiment Analysis

Encode the sequential input into a vector using RNN

x1

x2

… …

y1

y2

… …

…

Input Output

yM

xN

RNN considers temporal information to learn sentence vectors as the input of classification tasks

(41)

Outline

Language Modeling

◦Definition

◦Sequential Input

◦Sequential Output

41

(42)

Output Domain – Sequence Prediction

POS Tagging

Speech Recognition

Machine Translation

“推薦我台大後門的餐廳” 推薦/VV 我/PN 台大/NR 後門/NN

的/DEG 餐廳/NN

“大家好”

“How are you doing today?” “你好嗎?”

The output can be viewed as a sequence of classification

(43)

Outline

Language Modeling

◦Definition

◦Sequential Input

◦ Aligned Sequential Pairs (Tagging)

43

(44)

POS Tagging

Tag a word at each timestamp

◦Input: word sequence

◦Output: corresponding POS tag sequence

四樓好專業

N VA AD

(45)

Natural Language Understanding (NLU)

Tag a word at each timestamp

◦Input: word sequence

◦Output: IOB-format slot tag and intent tag

45

<START> just sent email to bob about fishing this weekend <END>

O O O O

B-contact_name

O

B-subject I-subject I-subject

 send_email(contact_name=“bob”, subject=“fishing this weekend”) O

send_email

Temporal orders for input and output are the same

(46)

Outline

Language Modeling

◦Definition

◦Sequential Input

◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(47)

超棒的醬汁

Machine Translation

47

Cascade two RNNs, one for encoding and one for decoding

◦

Input: word sequences in the source language

◦

Output: word sequences in the target language

encoder

decoder

(48)

Chit-Chat Dialogue Modeling

Cascade two RNNs, one for encoding and one for decoding

◦

Input: word sequences in the question

◦

Output: word sequences in the response

Temporal ordering for input and output may be different

(49)

Concluding Remarks

Language Modeling

◦RNNLM

Recurrent Neural Networks

◦

Definition

◦

Backpropagation through Time (BPTT)

◦

Vanishing/Exploding Gradient

Applications

◦

Sequential Input: Sequence-Level Embedding

◦

Sequential Output: Tagging / Seq2Seq (Encoder-Decoder)

49