Slide credit from 1

(1)

Slide credit from Hung-Yi Lee & Richard Socher 1

(2)

Review

Recurrent Neural Network

2

(3)

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

3

(4)

RNN Language Modeling

4

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”)

input hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(5)

RNNLM Formulation

5

At each time step,

…… ……

……

vector of the current word probability of the next word

(6)

Recurrent Neural Network Definition

6 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

(7)

Model Training

All model parameters can be updated by

7 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

y

_t-1

y

_t

y

_t+1

target

predicted

(8)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

◦ Aligned Sequential Pairs (Tagging)

◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

8

(9)

Backpropagation

9

… …

1 2

j … …

1 2

l

i w

ij

Layer l Layer l  1



 









1

1 l x

l a

j l l j



i

Backward Pass

⋮

Forward Pass

⋮

Error

signal

(10)

Backpropagation

10 l



i

Backward Pass

⋮

Error signal

1

2

n

…

y1

C



 

^z1^L

^



 ^z2^L





 zn^L





y2

C



yn

C



Layer L

2 1

i

…

Layer l

 

^z1^l





 

^z₂^l





 

^z_i^l

^



δ₁l

δ₂l

l

δi

2

…

 1^L¹

 

 z

1

m

Layer L-1

…

… _{ }

_W^L ^T

 

^W^l^¹ ^T

 

^y

C

 L 1

-

 L

 2^L ¹

 

 z

 ^L¹

z_m

δl

(11)

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

₁

, x

₂

, …, x

_t

◦ Output: o

_t

◦ Target: y

_t

11

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

o

1

C



o

2

C



o

n

C



  ^y

 C

…

(12)

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

₁

, x

₂

, …, x

_t

◦ Output: o

_t

◦ Target: y

_t

12

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

1

2

n

…

1

2

n

…

  ^y

 C

(13)

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

₁

, x

₂

, …, x

_t

◦ Output: o

_t

◦ Target: y

_t

13

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

  ^y

 C

(14)

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

₁

, x

₂

, …, x

_t

◦ Output: o

_t

◦ Target: y

_t

14

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

j

i

j i j

i j

  ^y

 C

Weights are tied together

the same memory

pointer pointer

(15)

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

₁

, x

₂

, …, x

_t

◦ Output: o

_t

◦ Target: y

_t

15

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

j

i

k i

k j i j

i

k j

  ^y

 C

Weights are tied together

(16)

BPTT

16

For 𝐶

⁽¹⁾

Backward Pass:

For 𝐶

⁽²⁾

For 𝐶

⁽³⁾

For 𝐶

⁽⁴⁾

Forward Pass: Compute s

₁

, s

₂

, s

₃

, s

₄

……

y

₁

y

₂

y

₃

x

₁

x

₂

x

₃

o

₁

o

₂

o

₃

init

y

₄

x

₄

o

₄

𝐶

⁽¹⁾

𝐶

⁽²⁾

𝐶

⁽³⁾

𝐶

⁽⁴⁾

s

₁

s

₂

s

₃

s

₄

(17)

RNN Training Issue

The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

17

The gradient becomes very small or very large quickly

 vanishing or exploding gradient

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

(18)

w

₂

w

₁

C ost

Rough Error Surface

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link] 18

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

The error surface is either very flat or very steep

(19)

Possible Solutions

19

(20)

Exploding Gradient: Clipping

20

w₂

w₁

Cost

clipped gradient Idea: control the gradient value to avoid exploding

Parameter setting: values from half to ten times the average can still yield convergence

(21)

Vanishing Gradient: Initialization + ReLU

IRNN

◦ initialize all W as identity matrix I

◦ use ReLU for activation functions

Le et al., “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,” arXiv, 2016. [link] 21

(22)

Vanishing Gradient: Gating Mechanism

RNN models temporal sequence information

◦ can handle “long-term dependencies” in theory

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 22

Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient

 apply the gating mechanism to directly encode the long-distance information

“I grew up in France…

I speak fluent French.”

(23)

Extension

23

(24)

Bidirectional RNN

24

ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token

(25)

Deep Bidirectional RNN

25

Each memory layer passes an intermediate representation to the next

(26)

Concluding Remarks

Recurrent Neural Networks

◦ Definition

◦ Issue: Vanishing/Exploding Gradient

◦ Solution:

• Exploding Gradient: Clipping

• Vanishing Gradient: Initialization, ReLU, Gated RNNs

Extension

◦ Bidirectional

◦ Deep RNN

26

Slide credit from 1

Review

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

RNN Language Modeling

RNNLM Formulation

At each time step,

Recurrent Neural Network Definition

: tanh, ReLU

Model Training

All model parameters can be updated by

y

y

y

target

predicted

Outline

Backpropagation

… …

1 2

j … …

1 2

i w

Layer l Layer l  1



 







1

1 l x

l a



Error

signal

Backpropagation



Error signal

…

 

…

 

 

 

…

…

…

…  

 

 

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

, x

, …, x

◦ Output: o

◦ Target: y

o

C





o

C





o

C





  y

 C

…

Backpropagation through Time (BPTT)

Unfold

◦ Input: init, x

, x

, …, x

◦ Output: o

◦ Target: y

… _{ }

  ^y

  ^y

  ^y

  ^y

  ^y