• 沒有找到結果。

Slide credit from 1

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from 1"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Slide credit from Hung-Yi Lee & Richard Socher 1

(2)

Review

Recurrent Neural Network

2

(3)

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

3

(4)

RNN Language Modeling

4

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”)

input hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(5)

RNNLM Formulation

5

At each time step,

…… ……

……

……

vector of the current word probability of the next word

(6)

Recurrent Neural Network Definition

6 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

(7)

Model Training

All model parameters can be updated by

7 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

y

t-1

y

t

y

t+1

target

predicted

(8)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

8

(9)

Backpropagation

9

… …

1 2

j … …

1 2

l

i w

ij

Layer l Layer l  1



 

1

1

1 l x

l a

j l l j

i

Backward Pass

Forward Pass

Error

signal

(10)

Backpropagation

10 l

i

Backward Pass

Error signal

1

2

n

y1

C

 

z1L

 z2L

 znL

y2

C

yn

C

Layer L

2 1

i

Layer l

 

z1l

 

z2l

 

zil

δ1l

δ2l

l

δi

2

 1L1

z

1

m

Layer L-1

 

WL T

 

Wl1 T

 

y

C

L 1

-

L

 2L 1

z

 L1

zm

δl

(11)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

11

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

o

1

C

o

2

C

o

n

C

  y

C

(12)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

12

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

1

2

n

1

2

n

  y

C

(13)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

13

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

  y

C

(14)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

14

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

j

i

i

j i j

i j

  y

C

Weights are tied together

the same memory

pointer pointer

(15)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

15

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

j

i

k i

k j i j

i

k j

  y

C

Weights are tied together

(16)

BPTT

16

For 𝐶

(1)

Backward Pass:

For 𝐶

(2)

For 𝐶

(3)

For 𝐶

(4)

Forward Pass: Compute s

1

, s

2

, s

3

, s

4

……

y

1

y

2

y

3

x

1

x

2

x

3

o

1

o

2

o

3

init

y

4

x

4

o

4

𝐶

(1)

𝐶

(2)

𝐶

(3)

𝐶

(4)

s

1

s

2

s

3

s

4

(17)

RNN Training Issue

The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

17

The gradient becomes very small or very large quickly

 vanishing or exploding gradient

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

(18)

w

2

w

1

C ost

Rough Error Surface

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link] 18

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

The error surface is either very flat or very steep

(19)

Possible Solutions

Recurrent Neural Network

19

(20)

Exploding Gradient: Clipping

20

w2

w1

Cost

clipped gradient Idea: control the gradient value to avoid exploding

Parameter setting: values from half to ten times the average can still yield convergence

(21)

Vanishing Gradient: Initialization + ReLU

IRNN

◦ initialize all W as identity matrix I

◦ use ReLU for activation functions

Le et al., “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,” arXiv, 2016. [link] 21

(22)

Vanishing Gradient: Gating Mechanism

RNN models temporal sequence information

◦ can handle “long-term dependencies” in theory

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 22

Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient

 apply the gating mechanism to directly encode the long-distance information

“I grew up in France…

I speak fluent French.”

(23)

Extension

Recurrent Neural Network

23

(24)

Bidirectional RNN

24

ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token

(25)

Deep Bidirectional RNN

25

Each memory layer passes an intermediate representation to the next

(26)

Concluding Remarks

Recurrent Neural Networks

◦ Definition

◦ Issue: Vanishing/Exploding Gradient

◦ Solution:

• Exploding Gradient: Clipping

• Vanishing Gradient: Initialization, ReLU, Gated RNNs

Extension

◦ Bidirectional

◦ Deep RNN

26

參考文獻

相關文件

Wang, A recurrent neural network for solving nonlinear convex programs subject to linear constraints, IEEE Transactions on Neural Networks, vol..

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory.

Constrain the data distribution for learned latent codes Generate the latent code via a prior

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the

◦ Value function: how good is each state and/or action1. ◦ Model: agent’s representation of

CAST: Using neural networks to improve trading systems based on technical analysis by means of the RSI financial indicator. Performance of technical analysis in growth and small