• 沒有找到結果。

Slide credit from 1

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from 1"

Copied!
49
0
0

加載中.... (立即查看全文)

全文

(1)

Slide credit from Hung-Yi Lee & Richard Socher 1

(2)

Review

Word Vector

(3)

Word2Vec Variants

Skip-gram: predicting surrounding words given the target word

(Mikolov+, 2013)

CBOW (continuous bag-of-words): predicting the

target word given the surrounding words

(Mikolov+, 2013)

LM (Language modeling): predicting the next words given the proceeding contexts

(Mikolov+, 2013)

Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013. 3 Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.

(4)

Word2Vec LM

Goal: predicting the next words given the proceeding contexts

(5)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

5

(6)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(7)

Language Modeling

Goal: estimate the probability of a word sequence

Example task: determinate whether a sequence is grammatical or makes more sense

7

recognize speech or

wreck a nice beach Output =

“recognize speech”

If P(recognize speech)

> P(wreck a nice beach)

(8)

Outline

Language Modeling

N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(9)

N-Gram Language Modeling

Goal: estimate the probability of a word sequence

N-gram language model

Probability is conditioned on a window of (n-1) previous words

Estimate the probability based on the training data

9

𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data

Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

(10)

N-Gram Language Modeling

Training data:

◦The dog ran ……

◦The cat jumped ……

P( jumped | dog ) = 0 P( ran | cat ) = 0

give some small probability

 smoothing

0.0001 0.0001

 The probability is not accurate.

 The phenomenon happens because we cannot collect all the possible text in the world as training data.

(11)

Outline

Language Modeling

◦N-gram Language Model

Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

11

(12)

Neural Language Modeling

Idea: estimate not from count, but

from the NN prediction

Neural Network

vector of “START”

P(next word is

“wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

vector of “a”

P(next word is

“nice”)

Neural Network

vector of “nice”

P(next word is

“beach”)

P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

(13)

Neural Language Modeling

Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003. 13

Issue: fixed context window for conditioning

input hidden

output

context vector

Probability distribution of the next word

(14)

Neural Language Modeling

The input layer (or hidden layer) of the related words are close

◦If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)

h

1

h

2

dog

cat

rabbit

Smoothing is automatically done

(15)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

15

(16)

Recurrent Neural Network

Idea: condition the neural network on all previous words and tie the weights at each time step

Assumption: temporal information matters

(17)

RNN Language Modeling

17

vector of “START”

P(next word is

“wreck”)

vector of “wreck”

P(next word is “a”)

vector of “a”

P(next word is

“nice”)

vector of “nice”

P(next word is

“beach”)

input hidden output

context vector word prob dist

Idea: pass the information from the previous hidden layer to leverage all contexts

(18)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(19)

RNNLM Formulation

19

At each time step,

…… ……

……

……

vector of the current word probability of the next word

(20)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(21)

Recurrent Neural Network Definition

21 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

(22)

Model Training

All model parameters can be updated by y

t-1

y

t

y

t+1

target

predicted

(23)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

23

(24)

Backpropagation

… …

1 2

j … …

1 2

l

i w

ij

Layer l Layer l  1



 

1

1

1 l x

l a

j l l j

i

Backward Pass

Forward Pass

Error

signal

(25)

Backpropagation

25 l

i

Backward Pass

Error signal

1

2

n

y1

C

 z1L

 z2L

 znL

y2

C

yn

C

Layer L

2 1

i

Layer l

 

z1l

 

z2l

 

zil

δ1l

δ2l

l

δi

2

 1L1

z

1

m

Layer L-1

 WL T

 Wl1 T

 

y

C

L 1

-

L

 2L 1

z

 L1

zm

δl

(26)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

y

1

C

y

2

C

y

n

C

  y

C

(27)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

27

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

1

2

n

1

2

n

  y

C

(28)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

  y

C

(29)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

29

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

j

i

i j

i j

i j

  y

C

Weights are tied together

the same memory

pointer pointer

(30)

Backpropagation through Time (BPTT)

Unfold

Input: init, x

1

, x

2

, …, x

t

Output: o

t

Target: y

t

init

st yt

xt ot

xt-1 st-1

xt-2

x1 s1

st-2

j

i

k i

k j i j

i

k j

  y

C

Weights are tied together

(31)

BPTT

31

For 𝐶

(1)

Backward Pass:

For 𝐶

(2)

For 𝐶

(3)

For 𝐶

(4)

Forward Pass: Compute s

1

, s

2

, s

3

, s

4

……

y

1

y

2

y

3

x

1

x

2

x

3

o

1

o

2

o

3

init

y

4

x

4

o

4

𝐶

(1)

𝐶

(2)

𝐶

(3)

𝐶

(4)

s

1

s

2

s

3

s

4

(32)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(33)

RNN Training Issue

The gradient is a product of Jacobian matrices, each associated with a step in the forward computation

Multiply the same matrix at each time step during backprop

33

The gradient becomes very small or very large quickly

 vanishing or exploding gradient

Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]

Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]

(34)

w

2

w

1

C ost

Rough Error Surface

The error surface is either very flat or very steep

(35)

Vanishing/Exploding Gradient Example

35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

0 5 10 15 20 25 30 35

1 step 2 steps 5 steps

10 steps 20 steps 50 steps

(36)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(37)

How to Frame the Learning Problem?

The learning algorithm f is to map the input domain X into the output domain Y

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution

37

Y X

f : 

Network design should leverage input and output domain properties

(38)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(39)

Input Domain – Sequence Modeling

Idea: aggregate the meaning from all words into a vector Method:

Basic combination: average, sum

Neural combination:

Recursive neural network (RvNN)

Recurrent neural network (RNN)

Convolutional neural network (CNN)

39

How to compute

規格

(specification)

誠意

(sincerity)

(this)

(have)

N-dim

(40)

誠意

規格

x4 h4

Sentiment Analysis

Encode the sequential input into a vector using RNN

x1

x2

… …

y1

y2

… …

Input Output

yM

xN

RNN considers temporal information to learn sentence vectors as the input of classification tasks

(41)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

41

(42)

Output Domain – Sequence Prediction

POS Tagging

Speech Recognition

Machine Translation

“推薦我台大後門的餐廳” 推薦/VV 我/PN 台大/NR 後門/NN

的/DEG 餐廳/NN

“大家好”

“How are you doing today?” “你好嗎?”

The output can be viewed as a sequence of classification

(43)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

43

(44)

POS Tagging

Tag a word at each timestamp

◦Input: word sequence

◦Output: corresponding POS tag sequence

四樓 專業

N VA AD

(45)

Natural Language Understanding (NLU)

Tag a word at each timestamp

◦Input: word sequence

◦Output: IOB-format slot tag and intent tag

45

<START> just sent email to bob about fishing this weekend <END>

O O O O

B-contact_name

O

B-subject I-subject I-subject

 send_email(contact_name=“bob”, subject=“fishing this weekend”) O

send_email

Temporal orders for input and output are the same

(46)

Outline

Language Modeling

◦N-gram Language Model

◦Feed-Forward Neural Language Model

◦Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network

◦Definition

◦Training via Backpropagation through Time (BPTT)

◦Training Issue Applications

◦Sequential Input

◦Sequential Output

Aligned Sequential Pairs (Tagging)

Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)

(47)

超棒 醬汁

Machine Translation

47

Cascade two RNNs, one for encoding and one for decoding

Input: word sequences in the source language

Output: word sequences in the target language

encoder

decoder

(48)

Chit-Chat Dialogue Modeling

Cascade two RNNs, one for encoding and one for decoding

Input: word sequences in the question

Output: word sequences in the response

Temporal ordering for input and output may be different

(49)

Concluding Remarks

Language Modeling

◦RNNLM

Recurrent Neural Networks

Definition

Backpropagation through Time (BPTT)

Vanishing/Exploding Gradient

Applications

Sequential Input: Sequence-Level Embedding

Sequential Output: Tagging / Seq2Seq (Encoder-Decoder)

49

參考文獻

相關文件

If the bootstrap distribution of a statistic shows a normal shape and small bias, we can get a confidence interval for the parameter by using the boot- strap standard error and

3.1(c) again which leads to a contradiction to the level sets assumption. 3.10]) which indicates that the condition A on F may be the weakest assumption to guarantee bounded level

Machine Translation Speech Recognition Image Captioning Question Answering Sensory Memory.

Constrain the data distribution for learned latent codes Generate the latent code via a prior

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

Training two networks jointly  the generator knows how to adapt its parameters in order to produce output data that can fool the

◦ Value function: how good is each state and/or action1. ◦ Model: agent’s representation of