• 沒有找到結果。

Slides credited from Hung-Yi Lee & Richard Socher

N/A
N/A
Protected

Academic year: 2022

Share "Slides credited from Hung-Yi Lee & Richard Socher"

Copied!
54
0
0

加載中.... (立即查看全文)

全文

(1)
(2)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(3)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(4)

Language Modeling

Goal: estimate the probability of a word sequence

Example task: determinate whether a sequence is grammatical or makes more sense

recognize speech or

wreck a nice beach Output =

“recognize speech”

If P(recognize speech)

> P(wreck a nice beach)

(5)

Outline

Language Modeling

N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(6)

N-Gram Language Modeling

Goal: estimate the probability of a word sequence

N-gram language model

Probability is conditioned on a window of (n-1) previous words

Estimate the probability based on the training data

𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

(7)

N-Gram Language Modeling

Training data:

◦The dog ran ……

◦The cat jumped ……

P( jumped | dog ) = 0 P( ran | cat ) = 0

give some small probability

 smoothing

0.0001 0.0001

 The probability is not accurate.

 The phenomenon happens because we cannot collect all the possible text in the world as training data.

(8)

Outline

Language Modeling

◦N-gram Language Model

Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(9)

Neural Language Modeling

Idea: estimate not from count, but

from the NN prediction

Neural Network

vector of “START”

P(next word is

“wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

P(next word is

“nice”)

Neural Network

P(next word is

“beach”)

P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

(10)

Neural Language Modeling

input hidden

output

context vector

Probability distribution of the next word

(11)

Neural Language Modeling

The input layer (or hidden layer) of the related words are close

◦If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)

h

1

h

2

dog

cat

rabbit

Smoothing is automatically done

Issue: fixed context window for conditioning

(12)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(13)

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

(14)

RNN Language Modeling

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”)

input hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(15)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(16)

RNNLM Formulation

At each time step,

…… ……

……

……

vector of the current word probability of the next word

(17)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(18)

Recurrent Neural Network Definition

: tanh, ReLU

(19)

Model Training

All model parameters can be updated by y

t-1

y

t

y

t+1

target

predicted

(20)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(21)

Backpropagation

… …

1 2

j … …

1 2

l

i w

ij

Layer l Layer l  1



 

1

1

1 l x

l a

j l l j

i

Backward Pass

Forward Pass

Error

signal

(22)

Backpropagation

l

i

Backward Pass

Error signal

1

2

n

y1

C

 z1L

 z2L

 znL

y2

C

yn

C

Layer L

2 1

i

Layer l

 

z1l

 

z2l

 

zil

δ1l

δ2l

l

δi

2

 1L1

z

1

m

Layer L-1

 WL T

 Wl1 T

 

y

C

L 1

-

L

 2L 1

z

 L1

zm

δl

(23)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

o

1

C

o

2

C

o C

  y

C

(24)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

1

2

n

1

2

n

  y

C

(25)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

  y

C

(26)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

j

i

i j

i j

i j

  y

C

Weights are tied together

the same memory

pointer pointer

(27)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

j

i

k i

k j i j

i

k j

  y

C

Weights are tied together

(28)

BPTT Backward Pass: For 𝐶

(1)

For 𝐶

(2)

For 𝐶

(3)

For 𝐶

(4)

Forward Pass: Compute s

1

, s

2

, s

3

, s

4

……

y

1

y

2

y

3

x

1

x

2

x

3

o

1

o

2

o

3

init

y

4

x

4

o

4

𝐶

(1)

𝐶

(2)

𝐶

(3)

𝐶

(4)

s

1

s

2

s

3

s

4

(29)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(30)

RNN Training Issue

The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

The gradient becomes very small or very large quickly

 vanishing or exploding gradient

(31)

w

2

w

1

Cost

Rough Error Surface

The error surface is either very flat or very steep

(32)

Vanishing/Exploding Gradient Example

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

1 step 2 steps 5 steps

10 steps 20 steps 50 steps

(33)

Possible Solutions

Recurrent Neural Network

(34)

Exploding Gradient: Clipping

w2

w1

Cost

clipped gradient Idea: control the gradient value to avoid exploding

Parameter setting: values from half to ten times the average can still yield convergence

(35)

Vanishing Gradient: Initialization + ReLU

IRNN

initialize all W as identity matrix I

use ReLU for activation

functions

(36)

Vanishing Gradient: Gating Mechanism

RNN models temporal sequence information

can handle “long-term dependencies” in theory

Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient

 apply the gating mechanism to directly encode the long-distance information

“I grew up in France…

I speak fluent French.”

(37)

Extension

Recurrent Neural Network

(38)

Bidirectional RNN

ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token

(39)

Deep Bidirectional RNN

Each memory layer passes an intermediate representation to the next

(40)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(41)

How to Frame the Learning Problem?

The learning algorithm f is to map the input domain X into the output domain Y

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution

Y X

f : 

Network design should leverage input and output domain properties

(42)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(43)

Input Domain – Sequence Modeling

Idea: aggregate the meaning from all words into a vector Method:

Basic combination: average, sum

Neural combination:

Recursive neural network (RvNN)

Recurrent neural network (RNN)

Convolutional neural network (CNN)

How to compute

規格

(specification)

誠意

(sincerity)

(this)

(have)

N-dim

(44)

誠意

規格

x4 h4

Sentiment Analysis

Encode the sequential input into a vector using RNN

x1

x2

… …

y1

y2

… …

Input Output

yM

xN

RNN considers temporal information to learn sentence vectors as the input of classification tasks

(45)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(46)

Output Domain – Sequence Prediction

POS Tagging

Speech Recognition

Machine Translation

“推薦我台大後門的餐廳” 推薦/VV 我/PN 台大/NR 後門/NN

的/DEG 餐廳/NN

“大家好”

“How are you doing today?” “你好嗎?”

The output can be viewed as a sequence of classification

(47)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(48)

POS Tagging

Tag a word at each timestamp

◦Input: word sequence

◦Output: corresponding POS tag sequence

四樓 專業

N VA AD

(49)

Natural Language Understanding (NLU)

Tag a word at each timestamp

◦Input: word sequence

◦Output: IOB-format slot tag and intent tag

<START> just sent email to bob about fishing this weekend <END>

O O O O

B-contact_name

O

B-subject I-subject I-subject

 send_email(contact_name=“bob”, subject=“fishing this weekend”) O

send_email

Temporal orders for input and output are the same

(50)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(51)

超棒 醬汁

Machine Translation

Cascade two RNNs, one for encoding and one for decoding

Input: word sequences in the source language

Output: word sequences in the target language

encoder

decoder

(52)

Chit-Chat Dialogue Modeling

Cascade two RNNs, one for encoding and one for decoding

Input: word sequences in the question

Output: word sequences in the response

Temporal ordering for input and output may be different

(53)

Sci-Fi Short Film - SUNSPRING

(54)

Concluding Remarks

Language Modeling

◦RNNLM

Recurrent Neural Networks

Definition

Backpropagation through Time (BPTT)

Vanishing/Exploding Gradient

Applications

Sequential Input: Sequence-Level Embedding

Sequential Output: Tagging / Seq2Seq (Encoder-Decoder)

參考文獻

相關文件

Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie ‘Deadpool’. User:

 “Greedy”: always makes the choice that looks best at the moment in the hope that this choice will lead to a globally optimal solution.  When to

Instruction Set ...1-2 Ladder Diagram Instructions...1-2 ST Statement Instructions ...1-2 Sequence Input Instructions ...1-2 Sequence Output Instructions ...1-3 Sequence

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory.

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

Constrain the data distribution for learned latent codes Generate the latent code via a prior

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the