83
Dialogue Policy Optimization
Dialogue management in a RL framework
83
U s e r
Reward R
Observation O Action A84
Reward for RL ≅ Evaluation for System
Dialogue is a special RL task
Human involves in interaction and rating (evaluation) of a dialogue
Fully human-in-the-loop framework
Rating: correctness, appropriateness, and adequacy
- Expert rating high quality, high cost
- User rating unreliable quality, medium cost - Objective rating Check desired aspects, low cost
84
Material: http://deepdialogue.miulab.tw
85
Reinforcement Learning for Dialogue Policy Optimization
85
Language understanding
Language (response) generation
Dialogue Policy 𝑎 = 𝜋(𝑠)
Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)
Optimize 𝑄(𝑠, 𝑎) User input (o)
Response
𝑠
𝑎
Type of Bots State Action Reward
Social ChatBots Chat history System Response # of turns maximized;
Intrinsically motivated reward
InfoBots (interactive Q/A) User current question + Context
Answers to current question
Relevance of answer;
# of turns minimized
Task-Completion Bots User current input + Context
System dialogue act w/
slot value (or API calls)
Task success rate;
# of turns minimized
Goal: develop a generic deep RL algorithm to learn dialogue policy for all bot categories
86
Dialogue Reinforcement Learning Signal
Typical reward function
-1 for per turn penalty
Large reward at completion if successful
Typically requires domain knowledge
✔ Simulated user
✔ Paid users (Amazon Mechanical Turk)
✖ Real users
|||
…
﹅
86
The user simulator is usually required for dialogue system training before deployment
Material: http://deepdialogue.miulab.tw
87
Neural Dialogue Manager (Li et al., 2017)
Deep Q-network for training DM policy
Input: current semantic frame observation, database returned results
Output: system action
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location DQN-based
Dialogue Management
Simulated User
(DM)Backend DB
https://arxiv.org/abs/1703.01008
88
SL + RL for Sample Efficiency (Su et al., 2017)
Issue about RL for DM
slow learning speed
cold start
Solutions
Sample-efficient actor-critic
Off-policy learning with experience replay
Better gradient update
Utilizing supervised data
Pretrain the model with SL and then fine-tune with RL
Mix SL and RL data during RL learning
Combine both
88 https://arxiv.org/pdf/1707.00130.pdf Su et.al., SIGDIAL 2017
Material: http://deepdialogue.miulab.tw
89
Online Training (Su et al., 2015; Su et al., 2016)
Policy learning from real users
Infer reward directly from dialogues
(Su et al., 2015)
User rating
(Su et al., 2016)
Reward modeling on user binary success rating
Reward
Model
Success/Fail
EmbeddingFunction
Dialogue Representation
Reinforcement Signal Query rating
http://www.anthology.aclweb.org/W/W15/W15-46.pdf; https://www.aclweb.org/anthology/P/P16/P16-1230.pdf
90
Interactive RL for DM (Shah et al., 2016)
90
Immediate Feedback
https://research.google.com/pubs/pub45734.html
Use a third agent for providing interactive feedback to the DM
Material: http://deepdialogue.miulab.tw
91
Interpreting Interactive Feedback (Shah et al., 2016)
91 https://research.google.com/pubs/pub45734.html
92
Dialogue Management Evaluation
Metrics
Turn-level evaluation: system action accuracy
Dialogue-level evaluation: task success rate, reward
92
Material: http://deepdialogue.miulab.tw
93
Outline
Introduction
Background Knowledge
Neural Network Basics
Reinforcement Learning
Modular Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Dialogue Management
Dialogue State Tracking (DST)
Dialogue Policy Optimization
Natural Language Generation (NLG)
Evaluation
Recent Trends and Challenges
End-to-End Neural Dialogue System
Multimodality
Dialogue Breath
Dialogue Depth
93
94
Natural Language Generation (NLG)
Mapping semantic frame into natural language
inform(name=Seven_Days, foodtype=Chinese) Seven Days is a nice Chinese restaurant
94
Material: http://deepdialogue.miulab.tw
95
Template-Based NLG
Define a set of rules to map frames to NL
95
Pros:
simple, error-free, easy to control
Cons: time-consuming, poor scalability Semantic Frame Natural Languageconfirm() “Please tell me more about the product your are looking for.”
confirm(area=$V) “Do you want somewhere in the $V?”
confirm(food=$V) “Do you want a $V restaurant?”
confirm(food=$V,area=$W) “Do you want a $V restaurant in the $W.”
96
Plan-Based NLG (Walker et al., 2002)
Divide the problem into pipeline
Statistical sentence plan generator
(Stent et al., 2009)
Statistical surface realizer
(Dethlefs et al., 2013; Cuayáhuitl et al., 2014; …) Inform(name=Z_House, price=cheap )
Z House is a cheap restaurant.
Pros:
can model complex linguistic structures
Cons: heavily engineered, require domain knowledge Sentence
Plan Generator
Sentence Plan Reranker
Surface Realizer
syntactic tree
Material: http://deepdialogue.miulab.tw
97
Class-Based LM NLG (Oh and Rudnicky, 2000)
Class-based language modeling
NLG by decoding
97
Pros:
easy to implement/ understand, simple rules
Cons: computationally inefficientClasses:
inform_area inform_address
…
request_area request_postcode
http://dl.acm.org/citation.cfm?id=1117568
98
Phrase-Based NLG (Mairesse et al, 2010)
Semantic DBN Phrase
DBN
Charlie Chan is a Chinese Restaurant near Cineworld in the centre
d d
Inform(name=Charlie Chan, food=Chinese, type= restaurant, near=Cineworld, area=centre)
98
Pros:
efficient, good performance
Cons: require semantic alignmentsrealization phrase semantic stack
http://dl.acm.org/citation.cfm?id=1858838
Material: http://deepdialogue.miulab.tw
99
RNN-Based LM NLG (Wen et al., 2015)
<BOS> SLOT_NAME serves SLOT_FOOD .
<BOS> Din Tai Fung serves Taiwanese . delexicalisation
Inform(name=Din Tai Fung, food=Taiwanese) 0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, 0, 0, 0…
dialogue act 1-hot representation
SLOT_NAME serves SLOT_FOOD . <EOS>
Slot weight tying
conditioned on the dialogue act
Input
Output
http://www.anthology.aclweb.org/W/W15/W15-46.pdf#page=295
100
Handling Semantic Repetition
Issue: semantic repetition
Din Tai Fung is a great Taiwanese restaurant that serves Taiwanese.
Din Tai Fung is a child friendly restaurant, and also allows kids.
Deficiency in either model or decoding (or both)
Mitigation
Post-processing rules
(Oh & Rudnicky, 2000) Gating mechanism (Wen et al., 2015)
Attention(Mei et al., 2016; Wen et al., 2015)
100
Material: http://deepdialogue.miulab.tw
101
Original LSTM cell
Dialogue act (DA) cell
Modify C
tSemantic Conditioned LSTM (Wen et al., 2015)
DA cell LSTM cell
Ct
it
ft
ot
rt
ht
dt
dt-1
xt
xt ht-1
xt ht-1 xt ht-1 xt h
t-1
ht-1
Inform(name=Seven_Days, food=Chinese)
0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, … dialog act 1-hot representation d0
101
Idea: using gate mechanism to control the generated semantics (dialogue act/slots)
http://www.aclweb.org/anthology/D/D15/D15-1199.pdf
102
Structural NLG (Dušek and Jurčíček, 2016)
Goal: NLG based on the syntax tree
Encode trees as sequences
Seq2Seq model for generation
102 https://www.aclweb.org/anthology/P/P16/P16-2.pdf#page=79
Material: http://deepdialogue.miulab.tw
103
Contextual NLG (Dušek and Jurčíček, 2016)
Goal: adapting users’ way of speaking, providing context-aware responses
Context encoder
Seq2Seq model
103 https://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=203
104
Controlled Text Generation (Hu et al., 2017)
Idea: NLG based on generative adversarial network (GAN) framework
c: targeted sentence attributes
https://arxiv.org/pdf/1703.00955.pdf
Material: http://deepdialogue.miulab.tw
105
NLG Evaluation
Metrics
Subjective: human judgement
(Stent et al., 2005)
Adequacy: correct meaning
Fluency: linguistic fluency
Readability: fluency in the dialogue context
Variation: multiple realizations for the same concept
Objective: automatic metrics
Word overlap: BLEU
(Papineni et al, 2002), METEOR, ROUGE
Word embedding based: vector extrema, greedy matching, embedding average
There is a gap between human perception and automatic metrics
105Evaluation
106
107
Dialogue System Evaluation
107
Dialogue model evaluation
Crowd sourcing
User simulator
Response generator evaluation
Word overlap metrics
Embedding based metrics
108
Crowdsourcing for Dialogue System Evaluation (Yang et al., 2012)
108 http://www-scf.usc.edu/~zhaojuny/docs/SDSchapter_final.pdf
The normalized mean scores of Q2 and Q5 for approved ratings in each category. A higher score maps to a higher level of task success
Material: http://deepdialogue.miulab.tw
109
User Simulation
Goal: generate natural and reasonable conversations to enable reinforcement learning for exploring the policy space
Approach
Rule-based crafted by experts (Li et al., 2016)
Learning-based (Schatzmann et al., 2006; El Asri et al., 2016, Crook and Marin, 2017) Dialogue
Corpus
Simulated User
Real User
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy
Interaction
110
Elements of User Simulation
Error Model
• Recognition error
• LU error
Dialogue State Tracking (DST)
System dialogue acts
Reward
Backend Action / Knowledge Providers
Dialogue Policy Optimization
Dialogue Management (DM)
User Model Reward Model
User Simulation
Distribution over user dialogue acts (semantic frames)
The error model enables the system to maintain the robustness
Material: http://deepdialogue.miulab.tw
111
Rule-Based Simulator for RL Based System (Li et al., 2016)
111
rule-based simulator + collected data
starts with sets of goals, actions, KB, slot types
publicly available simulation framework
movie-booking domain: ticket booking and movie seeking
provide procedures to add and test own agent
http://arxiv.org/abs/1612.05688
112
Model-Based User Simulators
Bi-gram models (Levin et.al. 2000)
Graph-based models (Scheffler and Young, 2000)
Data Driven Simulator (Jung et.al., 2009)
Neural Models (deep encoder-decoder)
112
Material: http://deepdialogue.miulab.tw
113
Data-Driven Simulator (Jung et.al., 2009)
113
Three step process
1)
User intention simulator
Current discourse status (t-1) User’s current
semantic frame (t-1)
Current discourse
status (t) User’s current
semantic frame (t)
Current discourse
status User’s current
semantic frame
request+search_loc
(*) compute all possible semantic frame given previous turn info
(*) randomly select one possible semantic frame
features (DD+DI)
114
Data-Driven Simulator (Jung et.al., 2009)
114
Three step process
1)
User intention simulator
2)
User utterance simulator
request+search_loc
Given a list of POS tags associated with the semantic frame, using LM+Rules
they generate the user utterance.