Slides credit from Gašić
Review
2
3
Task-Oriented Dialogue System
(Young, 2000)3
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
4
Task-Oriented Dialogue System
(Young, 2000)4
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Natural Language Generation (NLG)
Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Action / Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy
Dialogue Management
5
6
Example Dialogue
6
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
greeting ()
request (area)
inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)
inform (address=24 Green street)
7
Example Dialogue
7
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
greeting ()
request (area)
inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)
inform (address=24 Green street)
8
Elements of Dialogue Management
(Figure from Gašić)8
9
Rule-Based Management
9
10
Elements of Dialogue Management
(Figure from Gašić)10
Dialogue policy optimization
Dialogue Policy Optimization
11
Reinforcement Learning
12
Reinforcement Learning
RL is a general purpose framework for decision making
RL is for an agentwith the capacity to act
Each actioninfluences the agent’s future state
Success is measured by a scalar rewardsignal
Goal: select actions to maximize future reward Big three: action, state, reward
Scenario of Reinforcement Learning
Agent learns to take actions to maximize expected reward.
13
Environment
Observation ot Action at
Reward rt If win, reward = 1 If loss, reward = -1 Otherwise, reward = 0
Next Move
14
Supervised v.s. Reinforcement
Supervised
Reinforcement
14
Hello
Agent
……
Agent
……. …….
……
Bad
“Hello” Say “Hi”
“Bye bye” Say “Good bye”
Learning from teacher
Learning from critics
15
Dialogue as Reinforcement Learning
Problems in solving dialogue as an RL task
1)
Optimization problem size
Belief dialogue state space is large and continuous
System action space is large
2)
Knowledge environment (user)
Transition probability is unknown (user status)
How to get rewards
3)
RL takes long time to converge
15
Solution: learn in reduced summary space
Solution: learn in interaction with a simulated user
16
Large Belief Space and Action Space
Solution: perform optimization in a reduced
summary space built according to the heuristics
16
Belief Dialogue
State System Action
Summary
Dialogue State Summary Policy Summary Action
Summary Function
Master Function
17
Transition Probability and Rewards
Solution: learn from a simulated user
17
Error Model
• Recognition error
• LU error
Dialogue State Tracking (DST)
System dialogue acts
Reward Backend Action /
Knowledge Providers Dialogue Policy
Optimization
Dialogue Management (DM)
User Model Reward Model
User Simulation Distribution over user dialogue acts (semantic frames)
18
Agent and Environment
At time step t
The agent
Executes action at
Receives observation ot
Receives scalar reward rt
The environment
Receives action at
Emits observation ot+1
Emits scalar reward rt+1
t increments at env. step
observation ot
action at
reward rt
19
State
Experience is the sequence of observations, actions, rewards
State is the information used to determine what happens next
what happens depends on the history experience
• The agent selects actions
• The environment selects observations/rewards
The state is the function of the history experience
20
POMDP Policy Optimization
Finding value function associated with optimal policy, i.e. the one that generates maximal return
Problem: tractable only for very simple cases
(Kaelbling et al., 1998)
Alternative solution: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states
20
21
Markov Decision Process (MDP)
Belief state from tracking: b
t= s
t
System actions: a
t
Rewards: r
t
Transition probability: p(b
t+1|b
t, a
t)
bt+1 at
bt
rt
22
DM as Markov Decision Process (MDP)
22
Data
Model
Prediction
• Belief dialogue states (continuous)
• Reward – a measure of dialogue quality
• System actions –
Dialogue Policy Optimization
• Markov decision process (MDP) &
reinforcement learning
23
Dialogue Policy Optimization
Dialogue management in a RL framework
23
U s e r
Reward R Observation O Action A
Environment
Agent
Natural Language Generation Language Understanding
Dialogue Manager
The optimized dialogue policy selects the best action that maximizes the future reward.
Correct rewards are a crucial factor in dialogue policy training
24
Reward
Reinforcement learning is based on reward hypothesis
A reward rt is a scalar feedback signal
Indicates how well agent is doing at step t
Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward
25
Reward for RL ≅ Evaluation for System
Dialogue is a special RL task
Human involves in interaction and rating (evaluation) of a dialogue
Fully human-in-the-loop framework
Rating: correctness, appropriateness, and adequacy
- Expert rating high quality, high cost
- User rating unreliable quality, medium cost - Objective rating Check desired aspects, low cost
25
26
Reinforcement Learning for Dialogue Policy Optimization
26
Language understanding
Language (response) generation
Dialogue Policy 𝑎 = 𝜋(𝑠)
Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)
Optimize 𝑄(𝑠, 𝑎) User input (o)
Response
𝑠
𝑎
Type of Bots State Action Reward
Social ChatBots Chat history System Response # of turns maximized;
Intrinsically motivated reward
InfoBots (interactive Q/A) User current question + Context
Answers to current question
Relevance of answer;
# of turns minimized
Task-Completion Bots User current input + Context
System dialogue act w/
slot value (or API calls)
Task success rate;
# of turns minimized
Goal: develop a generic deep RL algorithm to learn dialogue policy for all bot categories
27
Dialogue Reinforcement Learning Signal
Typical reward function
-1 for per turn penalty
Large reward at completion if successful
Typically requires domain knowledge
✔ Simulated user
✔ Paid users (Amazon Mechanical Turk)
✖ Real users
|||
…
﹅
27
The user simulator is usually required for dialogue system training before deployment
28
Sequential Decision Making
Goal: select actions to maximize total future reward
Actions may have long-term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward
28
29
Deep Reinforcement Learning
Environment 29
Observation Action
Reward Function
Input
Function Output
Used to pick the best function
… …
…
DNN
Value-Based Policy-Based Model-Based
Reinforcement Learning Approach
30
31
Major Components in an RL Agent
An RL agent may include one or more of these components
Policy: agent’s behavior function
Value function: how good is each state and/or action
Model: agent’s representation of the environment
31
32
Policy
A policy is the agent’s behavior
A policy maps from state to action
Deterministic policy:
Stochastic policy:
32
33
Value Function
A value function is a prediction of future reward (with action a in state s)
Q-value function gives expected total reward
from state and action
under policy
with discount factor
Value functions decompose into a Bellman equation
33
34
Optimal Value Function
An optimal value function
is the maximum achievable value
allows us to act optimally
informally maximizes over all decisions
decompose into a Bellman equation
34
35
Reinforcement Learning Approach
Policy-based RL
Search directly for optimal policy
Value-based RL
Estimate the optimal value function
Model-based RL
Build a model of the environment
Plan (e.g. by lookahead) using model
35
is the policy achieving maximum future reward
is maximum value achievable under any policy
36
Maze Example
Rewards: -1 per time- step
Actions: N, E, S, W
States: agent’s location
36
37
Maze Example: Policy
Rewards: -1 per time- step
Actions: N, E, S, W
States: agent’s location
37
Arrows represent policy π(s) for each state s
38
Maze Example: Value Function
Rewards: -1 per time- step
Actions: N, E, S, W
States: agent’s location
38
Numbers represent value Qπ(s) of each state s
Categorizing RL Agents
Value-Based
No Policy (implicit)
Value Function
Policy-Based
Policy
No Value Function
Actor-Critic
Policy
Value Function
Model-Free
Policy and/or Value Function
No Model
Model-Based
Policy and/or Value Function
Model
39
40
RL Agent Taxonomy
40
Dynamic Programming Monte-Carlo
Temporal-Difference Q-Learning
Value-Based Deep RL
41
42
Dynamic Programming
Model-based
Evaluate policy
Update policy
42
43
Dynamic Programming
GridWorld example
43
44
Monte-Carlo RL
Characteristics
Learn from complete episodes of experience
Model-free: no knowledge of MDP transitions / rewards
Value = mean return
MC policy
Goal: learn from episodes under policy
Return is the total discounted reward
Value function is the expected return
44
45
Monte-Carlo
Model-free prediction
45
46
Temporal-Difference RL
Characteristics
Learn from incompele episodes of experience
Model-free: no knowledge of MDP transitions / rewards
Update a guess toward a guess
TD policy
Goal: learn online from experience under policy
Value function is updated toward estimated return
TD target
47
Temporal-Difference
Model-free prediction
47
48
Q-Learning – Value Function Approximation
Value functions are represented by a lookup table
too many states and/or actions to store
too slow to learn the value of each entry individually
Values can be estimated with function approximation
48
49
Q-Networks
Q-networks represent value functions with weights
generalize from seen states to unseen states
update parameter for function approximation
49
50
Q-Learning
Goal: estimate optimal Q-values
Optimal Q-values obey a Bellman equation
Value iterationalgorithms solve the Bellman equation
50
learning target
51
Deep Q-Networks (DQN)
Represent value function by deep Q-network with weights
Objective is to minimize mean square error (MSE) loss by SGD
Leading to the following Q-learning gradient
51
Issue: naïve Q-learning oscillates or diverges using NN due to:
1) correlations between samples 2) non-stationary targets learning target
52
Stability Issues with Deep RL
Naive Q-learning oscillates or diverges with neural nets
1. Data is sequential
Successive samples are correlated, non-iid (independent and identically distributed)
2. Policy changes rapidly with slight changes to Q-values
Policy may oscillate
Distribution of data can swing from one extreme to another
3. Scale of rewards and Q-values is unknown
Naive Q-learning gradients can be unstable when backpropagated
52
53
Stable Solutions for DQN
DQN provides a stable solutions to deep value-based RL
1. Use experience replay
Break correlations in data, bring us back to iid setting
Learn from all past policies
2. Freeze target Q-network
Avoid oscillation
Break correlations between Q-network and target
3. Clip rewards or normalizenetwork adaptively to sensible range
Robust gradients
53
54
Stable Solution 1: Experience Replay
To remove correlations, build a dataset from agent’s experience
Take action at according to 𝜖-greedy policy
Store transition in replay memory D
Sample random mini-batch of transitions from D
Optimize MSE between Q-network and Q-learning targets
54
small prob for exploration
55
Stable Solution 2: Fixed Target Q-Network
To avoid oscillations, fix parameters used in Q-learning target
Compute Q-learning targets w.r.t. old, fixed parameters
Optimize MSE between Q-network and Q-learning targets
Periodically update fixed parameters
55
56
Stable Solution 3: Reward / Value Range
To avoid oscillations, control the reward / value range
DQN clips the rewards to [−1, +1]
Prevents too large Q-values
Ensures gradients are well-conditioned
56
57
Other Improvements: Double DQN
Nature DQN
Double DQN: remove upward bias caused by
Current Q-network is used to select actions
Older Q-network is used to evaluateactions
57
58
Other Improvements: Prioritized Replay
Prioritized Replay: weight experience based on surprise
Store experience in priority queue according to DQN error
58
59
Other Improvements: Dueling Network
Dueling Network: split Q-network into two channels
Action-independent value function
Value function estimates how good the state is
Action-dependent advantage function
Advantage function estimates the additional benefit
59
60
DQN for Dialogue Management
(Li et al., 2017) Goal: end-to-end learning of values Q(s, a) from interactions
Input: state is the combination of user history observation, previous system action, database returned results
Output: Q(s, a) for all available system action a
Reward: -1 per turn; large reward for successful task
60
Semantic Frame request_movie genre=action, date=this weekend
System Action/Policy request_location
DQN- Based DM
Simulated User Backend DB
61
E2E RL-Based System
(Zhao and Eskenazi, 2016)61
Joint learning
NLU, DST, Dialogue Policy
Deep RL for training
Deep Q-network
Deep recurrent network
Baseline RL
Hybrid-RL
http://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=19
62
E2E Task-Completion Bot (TC-Bot)
(Li et al., 2017)wi
B- type
wi +1
wi+2
O O
EOS
<intent
>
wi
B- type
wi +1
wi+2
O O
EOS
<intent
> Semantic Frame request_movie genre=action, date=this weekend System Action /
Policy
request_location User Dialogue Action
Inform(location=San Francisco)
Time t-1
wi
<slot>
wi
+1
wi+2
O O
EOS
<intent>
Language Understanding (LU)
Time t-2 Time t
Dialogue Management
(DM)
w0 w1 w2
Natural Language Generation (NLG)
User EOS
Goal
User Agenda Modeling User Simulator
End-to-End Neural Dialogue System Text Input
Are there any action movies to see this weekend?
Idea: supervised learning for each component and reinforcement learning for end-to-end training the neural dialogue system
62 https://arxiv.org/abs/1703.01008
63
E2E Task-Completion Bot (TC-Bot)
(Li et al., 2017) User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.
RULE BASED SYSTEM
User: Can I get 2 tickets for the witch?
Agent: Which theater do you want?
User: Regal meridian 16
Agent: What time would you like to see it?
User: 9:30 pm
Agent: What date would you like to see it?
User: Tomorrow!
Agent: How many people are you?
User: Just two adults.
Agent: What is your location?
User: Seattle!
Agent: What kind of video format do you like?
User: Oh, any kind is okay.
Agent: Do you have any price constraint?
User: No, I don’t care.
Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.
User: Thank you.
Agent: Thank you! 63
The system can learn how to efficiently interact with users for task completion
REINFORCEMENT LEARNING SYSTEM User: Can I get 2 tickets for the witch?
Agent: What date would you like to see it?
User: Tomorrow
Agent: Which theater do you prefer?
User: Regal meridian 16 Agent: What is your location?
User: Seattle
Agent: What time would you like to see it?
User: 9:30 pm.
Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.
User: Thanks.
Agent: Thanks!
https://arxiv.org/abs/1703.01008
Estimate How Good An Agent’s Behavior is
Policy-Based Deep RL
64
65
Deep Policy Networks
Represent policy by deep network with weights
Objective is to maximize total discounted reward by SGD
65
stochastic policy deterministic policy
66
Policy Gradient
The gradient of a stochastic policy is given by
The gradient of a deterministic policy is given by
66
How to deal with continuous actions
67
Actor-Critic (Value-Based + Policy-Based)
Estimate value function
Update policy parameters by SGD
Stochastic policy
Deterministic policy
67
Q-networks tell whether a policy is good or not Policy networks optimize the policy accordingly
68
Deterministic Deep Policy Gradient
Goal: end-to-end learning of control policy from pixels
Input: state is stack of raw pixels from last 4 frames
Output: two separate CNNs for Q and 𝜋
Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015. 68
69
E2E RL-Based Info-Bot
(Dhingra et al., 2016)Movie=?; Actor=Bill Murray; Release Year=1993
Find me the Bill Murray’s movie.
I think it came out in 1993.
When was it released?
Groundhog Day is a Bill Murray movie which came out in 1993.
KB-InfoBot User
(Groundhog Day, actor, Bill Murray) (Groundhog Day, release year, 1993) (Australia, actor, Nicole Kidman)
(Mad Max: Fury Road, release year, 2015) Knowledge Base (head, relation, tail)
Idea: differentiable database for propagating the gradients 69
https://arxiv.org/abs/1609.00777
70
Dialogue Management Evaluation
Metrics
Turn-level evaluation: system action accuracy
Dialogue-level evaluation: task success rate, reward
70
71
Concluding Remarks
Dialogue policy optimization of DM solves MDP via RL
Value-based
Dynamic programming
Monte-Carlo
Temporal-difference
Q-learning DQN
Policy-based
Deep policy gradient
Actor-critic
71