Slides credit from Shawn

(1)

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Action / Knowledge Providers

(5)

Language Modeling

5

(6)

6

Language Modeling

 Goal: estimate the probability of a word sequence

 Example task: determinate whether a sequence is grammatical or makes more sense

6

recognize speech or

wreck a nice beach Output =

“recognize speech”

If P(recognize speech)

> P(wreck a nice beach)

(7)

7

N-Gram Language Modeling

 Goal: estimate the probability of a word sequence

 N-gram language model

 Probability is conditioned on a window of (n-1) previous words

 Estimate the probability based on the training data

7

𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ

𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data Count of “nice beach” in the training data

Issue: some sequences may not appear in the training data

(8)

8

N-Gram Language Modeling

 Training data:

 The dog ran ……

 The cat jumped ……

8

P( jumped | dog ) = 0 P( ran | cat ) = 0

give some small probability

 smoothing

0.0001 0.0001

 The probability is not accurate.

 The phenomenon happens because we cannot collect all the possible text in the world as training data.

(9)

9

Neural Language Modeling

 Idea: estimate not from count, but

from the NN prediction

9

Neural Network

vector of “START”

P(next word is

“wreck”)

Neural Network

vector of “wreck”

P(next word is “a”)

Neural Network

vector of “a”

P(next word is

“nice”)

Neural Network

vector of “nice”

P(next word is

“beach”) P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)

(10)

10

Neural Language Modeling

Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003. 10

Issue: fixed context window for conditioning input

hidden output

context vector Probability distribution

of the next word

(11)

11

Neural Language Modeling

 The input layer (or hidden layer) of the related words are close

 If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “… cat jump …” in the data)

11

h₁ h₂

dog cat

rabbit

Smoothing is automatically done

(12)

12

RNNLM

 Idea: condition the neural network on all previous words and tie the weights at each time step

 Assumption: temporal information matters

12

vector of “START”

P(next=“wreck”)

vector of “wreck”

P(next=“a”)

vector of “a”

P(next=“nice”)

vector of “nice”

P(next =“beach”)

Idea: pass the information from the previous hidden layer to leverage all contexts context

vector word dist

(13)

Natural Language Generation

13

Traditional Approaches

(14)

14

Natural Language Generation (NLG)



Mapping dialogue acts into natural language

inform(name=Seven_Days, foodtype=Chinese)

Seven Days is a nice Chinese restaurant

14

(15)

15

Template-Based NLG



Define a set of rules to map frames to NL

15

Pros:simple, error-free, easy to control Cons: time-consuming, rigid, poor scalability Semantic Frame Natural Language

confirm() “Please tell me more about the product your are looking for.”

confirm(area=$V) “Do you want somewhere in the $V?”

confirm(food=$V) “Do you want a $V restaurant?”

confirm(food=$V,area=$W) “Do you want a $V restaurant in the $W.”

(16)

16

Class-Based LM NLG

(Oh and Rudnicky, 2000)



Class-based language modeling



NLG by decoding

16

Pros:easy to implement/

understand, simple rules

Cons: computationally inefficient Classes:

inform_area inform_address

…

request_area request_postcode

http://dl.acm.org/citation.cfm?id=1117568

(17)

17

Phrase-Based NLG

(Mairesse et al, 2010)

Semantic DBN Phrase

DBN

Charlie Chan is a Chinese Restaurant near Cineworld in the centre

d d

Inform(name=Charlie Chan, food=Chinese, type= restaurant, near=Cineworld, area=centre)

17

Pros:efficient, good performance Cons: require semantic alignments

realization phrase semantic stack

http://dl.acm.org/citation.cfm?id=1858838

(18)

Natural Language Generation

18

Deep Learning Approaches

(19)

19

RNN-Based LM NLG

(Wen et al., 2015)

<BOS> SLOT_NAME serves SLOT_FOOD .

<BOS> Din Tai Fung serves Taiwanese . delexicalisation

Inform(name=Din Tai Fung, food=Taiwanese) 0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, 0, 0, 0…

dialogue act 1-hot representation

SLOT_NAME serves SLOT_FOOD . <EOS>

Slot weight tying

conditioned on the dialogue act

Input

Output

http://www.anthology.aclweb.org/W/W15/W15-46.pdf#page=295

(20)

20

Handling Semantic Repetition

 Issue: semantic repetition

 Din Tai Fung is a great Taiwanese restaurant that serves Taiwanese.

 Din Tai Fung is a child friendly restaurant, and also allows kids.

 Deficiency in either model or decoding (or both)

 Mitigation

 Post-processing rules (Oh & Rudnicky, 2000)

 Gating mechanism (Wen et al., 2015)

 Attention(Mei et al., 2016; Wen et al., 2015)

20

(21)

21

Visualization

21

(22)

22



Original LSTM cell



Dialogue act (DA) cell



Modify C

^t

Semantic Conditioned LSTM

(Wen et al., 2015)

DA cell LSTM cell

Ct

i_t

ft

o_t

r_t

h_t

dt

dt-1

x_t

x_t h_t-1

x_t h_t-1 xt h_t-1 x_t h_t-

1

h_t-1

Inform(name=Seven_Days, food=Chinese)

0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, … dialog act 1-hot representation d₀

22

Idea: using gate mechanism to control the generated semantics (dialogue act/slots)

http://www.aclweb.org/anthology/D/D15/D15-1199.pdf

(23)

23

Attentive Encoder-Decoder for NLG



Slot & value embedding



Attentive meaning representation

23

(24)

24

Attention Heat Map

(25)

25

Model Comparison

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1 10 100

BLEU

% of data

hlstm sclstm encdec

0%

1%

10%

100%

1 10 100

ERR

% of data hlstm

sclstm encdec

(26)

26

Structural NLG

(Dušek and Jurčíček, 2016)



Goal: NLG based on the syntax tree

 Encode trees as sequences

 Seq2Seq model for generation

26 https://www.aclweb.org/anthology/P/P16/P16-2.pdf#page=79

(27)

27

Contextual NLG

(Dušek and Jurčíček, 2016)



Goal: adapting users’

way of speaking, providing context- aware responses

 Context encoder

 Seq2Seq model

27 https://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=203

(28)

28

Decoder Sampling Strategy



Decoding procedure



Greedy search



Beam search



Random search

28

Inform(name=Din Tai Fung, food=Taiwanese) 0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, 0, 0, 0…

SLOT_NAME serves SLOT_FOOD . <EOS>

(29)

29

Greedy Search



Select the next word with the highest probability

29

(30)

30

Beam Search



Select the next k-best words and keep a beam with width=k for following decoding

30

(31)

31

Random Search



Randomly select the next word



Higher diversity



Can follow a probability distribution

31

(32)

Chit-Chat Generation

32

(33)

33

Chit-Chat Bot



Neural conversational model



Non task-oriented

33

(34)

34

Many-to-Many

 Both input and output are both sequences → Sequence-to- sequence learning



E.g. Machine Translation (machine learning→機器學習)

34

learning

machine

機器學習

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]

===

(35)

35

A Neural Conversational Model



Seq2Seq

35

[Vinyals and Le, 2015]

(36)

36

Chit-Chat Bot

36

電視影集 (~40,000 sentences)、美國總統大選辯論

(37)

37

Sci-Fi Short Film - SUNSPRING

https://www.youtube.com/watch?v=LY7x2Ihqj37

(38)

38

Concluding Remarks

 The three pillars of deep learning for NLG

 Distributed representation – generalization

 Recurrent connection – long-term dependency

 Conditional RNN – flexibility/creativity

 Useful techniques in deep learning for NLG

 Learnable gates

 Attention mechanism

 Generating longer/complex sentences

 Phrase dialogue as conditional generation problem

 Conditioning on raw input sentence  chit-chat bot

 Conditioning on both structured and unstructured sources  task-completing dialogue system

38