Slides credited from Hung-Yi Lee & Richard Socher

(1)

(2)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

◦ Aligned Sequential Pairs (Tagging)

◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(3)

Outline

Language Modeling

◦Definition

◦Sequential Input

(4)

Language Modeling

Goal: estimate the probability of a word sequence

Example task: determinate whether a sequence is grammatical or makes more sense

recognize speech or

wreck a nice beach Output =

“recognize speech”

If P(recognize speech)

> P(wreck a nice beach)

(5)

Outline

Language Modeling

◦N-gram Language Model

◦Definition

◦Sequential Input

(6)

N-Gram Language Modeling

Goal: estimate the probability of a word sequence

N-gram language model

◦

Probability is conditioned on a window of (n-1) previous words

◦

Estimate the probability based on the training data

𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

(7)

N-Gram Language Modeling

Training data:

◦The dog ran ……

◦The cat jumped ……

P( jumped | dog ) = 0 P( ran | cat ) = 0

give some small probability

 smoothing

0.0001 0.0001

 The probability is not accurate.

 The phenomenon happens because we cannot collect all the possible text in the world as training data.

(8)

Outline

Language Modeling

◦Feed-Forward Neural Language Model

◦Definition

◦Sequential Input

(9)

Neural Language Modeling

Idea: estimate not from count, but

from the NN prediction

Neural Network

vector of “START”

P(next word is

“wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

P(next word is

“nice”)

Neural Network

P(next word is

“beach”)

P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

(10)

Neural Language Modeling

input hidden

output

context vector

Probability distribution of the next word

(11)

Neural Language Modeling

The input layer (or hidden layer) of the related words are close

◦If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)

h

₁

h

₂

dog

cat

rabbit

Smoothing is automatically done

Issue: fixed context window for conditioning

(12)

Outline

Language Modeling

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Sequential Input

(13)

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

(14)

RNN Language Modeling

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”)

input hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(15)

Outline

Language Modeling

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Sequential Input

(16)

RNNLM Formulation

At each time step,

…… ……

……

vector of the current word probability of the next word

(17)

Outline

Language Modeling

◦Definition

◦Sequential Input

(18)

Recurrent Neural Network Definition

: tanh, ReLU

(19)

Model Training

All model parameters can be updated by y

_t-1

y

_t

y

_t+1

target

predicted

(20)

Outline

Language Modeling

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Sequential Input

(21)

Backpropagation

… …

1 2

j … …

1 2

l

i w

ij

Layer l Layer l  1



 









1

1 l x

l a

j l l j



i

Backward Pass

⋮

Forward Pass

⋮

Error

signal

(22)

Backpropagation

l



i

Backward Pass

⋮

Error signal

1

2

n

…

y1

C



 ^z1^L

^



 ^z2^L





 zn^L





y2

C



yn

C



Layer L

2 1

i

…

Layer l

 

^z1^l





 

^z₂^l





 

^z_i^l

^



δ₁l

δ₂l

l

δi

2

…

 1^L¹

 

 z

1

m

Layer L-1

…

_{ }_W^L ^T

 ^W^l^¹ ^T

 

^y

C

 L 1

-

 L

 2^L ¹

 

 z

 ^L¹

z_m

δl

(23)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

o

1

C



o

2

C



o C



  ^y

 C

…

(24)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

1

2

n

…

1

2

n

…

  ^y

 C

(25)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

  ^y

 C

(26)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

j

i

i j

  ^y

 C

Weights are tied together

the same memory

pointer pointer

(27)

Backpropagation through Time (BPTT)

Unfold

◦

Input: init, x

₁

, x

₂

, …, x

_t

◦

Output: o

_t

◦

Target: y

_t

init

s_t y_t

x_t o_t

x_t-1 s_t-1

x_t-2

x₁ s₁

s_t-2

j

i

k i

k j i j

i

k j

  ^y

 C

Weights are tied together

(28)

BPTT Backward Pass: For 𝐶

⁽¹⁾

For 𝐶

⁽²⁾

For 𝐶

⁽³⁾

For 𝐶

⁽⁴⁾

Forward Pass: Compute s

₁

, s

₂

, s

₃

, s

₄

……

y

₁

y

₂

y

₃

x

₁

x

₂

x

₃

o

₁

o

₂

o

₃

init

y

₄

x

₄

o

₄

𝐶

⁽¹⁾

𝐶

⁽²⁾

𝐶

⁽³⁾

𝐶

⁽⁴⁾

s

₁

s

₂

s

₃

s

₄

(29)

Outline

Language Modeling

◦Definition

◦Training Issue Applications

◦Sequential Input

(30)

RNN Training Issue

The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

The gradient becomes very small or very large quickly

 vanishing or exploding gradient

(31)

w

₂

w

₁

Cost

Rough Error Surface

The error surface is either very flat or very steep

(32)

Vanishing/Exploding Gradient Example

0 5 10 15 20 25 30 35

1 step 2 steps 5 steps

10 steps 20 steps 50 steps

(33)

Possible Solutions

Recurrent Neural Network

(34)

Exploding Gradient: Clipping

w₂

w₁

Cost

clipped gradient Idea: control the gradient value to avoid exploding

Parameter setting: values from half to ten times the average can still yield convergence

(35)

Vanishing Gradient: Initialization + ReLU

IRNN

◦

initialize all W as identity matrix I

◦

use ReLU for activation

functions

(36)

Vanishing Gradient: Gating Mechanism

RNN models temporal sequence information

◦

can handle “long-term dependencies” in theory

Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient

 apply the gating mechanism to directly encode the long-distance information

“I grew up in France…

I speak fluent French.”

(37)

Extension

Recurrent Neural Network

(38)

Bidirectional RNN

ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token

(39)

Deep Bidirectional RNN

Each memory layer passes an intermediate representation to the next

(40)

Outline

Language Modeling

◦Definition

◦Training Issue Applications

◦Sequential Input

(41)

How to Frame the Learning Problem?

The learning algorithm f is to map the input domain X into the output domain Y

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution

Y X

f : 

Network design should leverage input and output domain properties

(42)

Outline

Language Modeling

◦Definition

◦Sequential Input

(43)

Input Domain – Sequence Modeling

Idea: aggregate the meaning from all words into a vector Method:

◦

Basic combination: average, sum

◦

Neural combination:

Recursive neural network (RvNN)

Recurrent neural network (RNN)

Convolutional neural network (CNN)

How to compute

規格

(specification)

誠意

(sincerity)

這

(this)

有

(have)

N-dim

(44)

誠意

這規格有

x₄ h₄

Sentiment Analysis

Encode the sequential input into a vector using RNN

x1

x2

… …

y1

y2

… …

…

Input Output

yM

xN

RNN considers temporal information to learn sentence vectors as the input of classification tasks

(45)

Outline

Language Modeling

◦Definition

◦Sequential Input

◦Sequential Output

(46)

Output Domain – Sequence Prediction

POS Tagging

Speech Recognition

Machine Translation

“推薦我台大後門的餐廳” 推薦/VV 我/PN 台大/NR 後門/NN

的/DEG 餐廳/NN

“大家好”

“How are you doing today?” “你好嗎?”

The output can be viewed as a sequence of classification

(47)

Outline

Language Modeling

◦Definition

◦Sequential Input

◦ Aligned Sequential Pairs (Tagging)

(48)

POS Tagging

Tag a word at each timestamp

◦Input: word sequence

◦Output: corresponding POS tag sequence

四樓好專業

N VA AD

(49)

Natural Language Understanding (NLU)

Tag a word at each timestamp

◦Input: word sequence

◦Output: IOB-format slot tag and intent tag

<START> just sent email to bob about fishing this weekend <END>

O O O O

B-contact_name

O

B-subject I-subject I-subject

 send_email(contact_name=“bob”, subject=“fishing this weekend”) O

send_email

Temporal orders for input and output are the same

(50)

Outline

Language Modeling

◦Definition

◦Sequential Input

◦ Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(51)

超棒的醬汁

Machine Translation

Cascade two RNNs, one for encoding and one for decoding

◦

Input: word sequences in the source language

◦

Output: word sequences in the target language

encoder

decoder

(52)

Chit-Chat Dialogue Modeling

Cascade two RNNs, one for encoding and one for decoding

◦

Input: word sequences in the question

◦

Output: word sequences in the response

Temporal ordering for input and output may be different

(53)

Sci-Fi Short Film - SUNSPRING

(54)

Concluding Remarks

Language Modeling

◦RNNLM

Recurrent Neural Networks

◦

Definition

◦

Backpropagation through Time (BPTT)

◦

Vanishing/Exploding Gradient

Applications

◦

Sequential Input: Sequence-Level Embedding

◦