… DNN
Q- Learning
Goal: estimate optimal Q-values
Optimal Q-values obey a Bellman equation
Value iterationalgorithms solve the Bellman equation
50
learning target
51
Deep Q-Networks (DQN)
Represent value function by deep Q-network with weights
Objective is to minimize mean square error (MSE) loss by SGD
Leading to the following Q-learning gradient
51
Issue: naïve Q-learning oscillates or diverges using NN due to:
1) correlations between samples 2) non-stationary targets learning target
52
Stability Issues with Deep RL
Naive Q-learning oscillates or diverges with neural nets
1. Data is sequential
Successive samples are correlated, non-iid (independent and identically distributed)
2. Policy changes rapidly with slight changes to Q-values
Policy may oscillate
Distribution of data can swing from one extreme to another
3. Scale of rewards and Q-values is unknown
Naive Q-learning gradients can be unstable when backpropagated
52
53
Stable Solutions for DQN
DQN provides a stable solutions to deep value-based RL
1. Use experience replay
Break correlations in data, bring us back to iid setting
Learn from all past policies
2. Freeze target Q-network
Avoid oscillation
Break correlations between Q-network and target
3. Clip rewards or normalizenetwork adaptively to sensible range
Robust gradients
53
54
Stable Solution 1: Experience Replay
To remove correlations, build a dataset from agent’s experience
Take action at according to 𝜖-greedy policy
Store transition in replay memory D
Sample random mini-batch of transitions from D
Optimize MSE between Q-network and Q-learning targets
54
small prob for exploration
55
Stable Solution 2: Fixed Target Q-Network
To avoid oscillations, fix parameters used in Q-learning target
Compute Q-learning targets w.r.t. old, fixed parameters
Optimize MSE between Q-network and Q-learning targets
Periodically update fixed parameters
55
56
Stable Solution 3: Reward / Value Range
To avoid oscillations, control the reward / value range
DQN clips the rewards to [−1, +1]
Prevents too large Q-values
Ensures gradients are well-conditioned
56
57
Other Improvements: Double DQN
Nature DQN
Double DQN: remove upward bias caused by
Current Q-network is used to select actions
Older Q-network is used to evaluateactions
57
58
Other Improvements: Prioritized Replay
Prioritized Replay: weight experience based on surprise
Store experience in priority queue according to DQN error
58
59
Other Improvements: Dueling Network
Dueling Network: split Q-network into two channels
Action-independent value function
Value function estimates how good the state is
Action-dependent advantage function
Advantage function estimates the additional benefit
59
60
DQN for Dialogue Management
(Li et al., 2017) Goal: end-to-end learning of values Q(s, a) from interactions
Input: state is the combination of user history observation, previous system action, database returned results
Output: Q(s, a) for all available system action a
Reward: -1 per turn; large reward for successful task
60
Semantic Frame request_movie genre=action, date=this weekend
System Action/Policy request_location
DQN-Based DM
Simulated User Backend DB
61
E2E RL-Based System
(Zhao and Eskenazi, 2016)61
Joint learning
NLU, DST, Dialogue Policy
Deep RL for training
Deep Q-network
Deep recurrent network
Baseline RL
Hybrid-RL
http://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=19
62
E2E Task-Completion Bot (TC-Bot)
(Li et al., 2017)wi
B-type
wi +1
wi+2
O O
EOS
<intent
>
wi
B-type
wi +1
wi+2
O O
EOS
<intent
> Semantic Frame request_movie genre=action, date=this weekend System Action /
Policy
request_location User Dialogue Action
Inform(location=San Francisco)
Time t-1
wi
<slot>
wi
+1
wi+2
O O
EOS
<intent>
Language Understanding (LU)
Time t-2 Time t
Dialogue Management
(DM)
w0 w1 w2
Natural Language Generation (NLG)
User EOS
Goal
User Agenda Modeling User Simulator
End-to-End Neural Dialogue System Text Input
Are there any action movies to see this weekend?
Idea: supervised learning for each component and reinforcement learning for end-to-end training the neural dialogue system
62 https://arxiv.org/abs/1703.01008
63
E2E Task-Completion Bot (TC-Bot)
(Li et al., 2017) User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.
RULE BASED SYSTEM
User: Can I get 2 tickets for the witch?
Agent: Which theater do you want?
User: Regal meridian 16
Agent: What time would you like to see it?
User: 9:30 pm
Agent: What date would you like to see it?
User: Tomorrow!
Agent: How many people are you?
User: Just two adults.
Agent: What is your location?
User: Seattle!
Agent: What kind of video format do you like?
User: Oh, any kind is okay.
Agent: Do you have any price constraint?
User: No, I don’t care.
Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.
User: Thank you.
Agent: Thank you! 63
The system can learn how to efficiently interact with users for task completion
REINFORCEMENT LEARNING SYSTEM User: Can I get 2 tickets for the witch?
Agent: What date would you like to see it?
User: Tomorrow
Agent: Which theater do you prefer?
User: Regal meridian 16 Agent: What is your location?
User: Seattle
Agent: What time would you like to see it?
User: 9:30 pm.
Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.
User: Thanks.
Agent: Thanks!
https://arxiv.org/abs/1703.01008
Estimate How Good An Agent’s Behavior is
Policy-Based Deep RL
64
65
Deep Policy Networks
Represent policy by deep network with weights
Objective is to maximize total discounted reward by SGD
65
stochastic policy deterministic policy
66
Policy Gradient
The gradient of a stochastic policy is given by
The gradient of a deterministic policy is given by
66
How to deal with continuous actions
67
Actor-Critic (Value-Based + Policy-Based)
Estimate value function
Update policy parameters by SGD
Stochastic policy
Deterministic policy
67
Q-networks tell whether a policy is good or not Policy networks optimize the policy accordingly
68
Deterministic Deep Policy Gradient
Goal: end-to-end learning of control policy from pixels
Input: state is stack of raw pixels from last 4 frames
Output: two separate CNNs for Q and 𝜋
Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015. 68
69
E2E RL-Based Info-Bot
(Dhingra et al., 2016)Movie=?; Actor=Bill Murray; Release Year=1993
Find me the Bill Murray’s movie.
I think it came out in 1993.
When was it released?
Groundhog Day is a Bill Murray movie which came out in 1993.
KB-InfoBot User
(Groundhog Day, actor, Bill Murray) (Groundhog Day, release year, 1993) (Australia, actor, Nicole Kidman)
(Mad Max: Fury Road, release year, 2015) Knowledge Base (head, relation, tail)
Idea: differentiable database for propagating the gradients 69
https://arxiv.org/abs/1609.00777
70
Dialogue Management Evaluation
Metrics
Turn-level evaluation: system action accuracy
Dialogue-level evaluation: task success rate, reward
70