Learning - Slides credit from Gašić

… DNN

Q- Learning

 Goal: estimate optimal Q-values

 Optimal Q-values obey a Bellman equation

 Value iterationalgorithms solve the Bellman equation

learning target

Deep Q-Networks (DQN)

 Represent value function by deep Q-network with weights

 Objective is to minimize mean square error (MSE) loss by SGD

 Leading to the following Q-learning gradient

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets learning target

Stability Issues with Deep RL

 Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

 Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

 Policy may oscillate

 Distribution of data can swing from one extreme to another

3. Scale of rewards and Q-values is unknown

 Naive Q-learning gradients can be unstable when backpropagated

Stable Solutions for DQN

 DQN provides a stable solutions to deep value-based RL

1. Use experience replay

 Break correlations in data, bring us back to iid setting

 Learn from all past policies

2. Freeze target Q-network

 Avoid oscillation

 Break correlations between Q-network and target

3. Clip rewards or normalizenetwork adaptively to sensible range

 Robust gradients

Stable Solution 1: Experience Replay

 To remove correlations, build a dataset from agent’s experience

 Take action at according to 𝜖-greedy policy

 Store transition in replay memory D

 Sample random mini-batch of transitions from D

 Optimize MSE between Q-network and Q-learning targets

small prob for exploration

Stable Solution 2: Fixed Target Q-Network

 To avoid oscillations, fix parameters used in Q-learning target

 Compute Q-learning targets w.r.t. old, fixed parameters

 Optimize MSE between Q-network and Q-learning targets

 Periodically update fixed parameters

Stable Solution 3: Reward / Value Range

 To avoid oscillations, control the reward / value range

 DQN clips the rewards to [−1, +1]

 Prevents too large Q-values

 Ensures gradients are well-conditioned

Other Improvements: Double DQN

 Nature DQN

 Double DQN: remove upward bias caused by

 Current Q-network is used to select actions

 Older Q-network is used to evaluateactions

Other Improvements: Prioritized Replay

 Prioritized Replay: weight experience based on surprise

 Store experience in priority queue according to DQN error

Other Improvements: Dueling Network

 Dueling Network: split Q-network into two channels

 Action-independent value function

 Value function estimates how good the state is

 Action-dependent advantage function

 Advantage function estimates the additional benefit

DQN for Dialogue Management

(Li et al., 2017)

 Goal: end-to-end learning of values Q(s, a) from interactions

 Input: state is the combination of user history observation, previous system action, database returned results

 Output: Q(s, a) for all available system action a

 Reward: -1 per turn; large reward for successful task

Semantic Frame request_movie genre=action, date=this weekend

System Action/Policy request_location

DQN-Based DM

Simulated User Backend DB

E2E RL-Based System

(Zhao and Eskenazi, 2016)



Joint learning



NLU, DST, Dialogue Policy



Deep RL for training



Deep Q-network



Deep recurrent network

Baseline RL

Hybrid-RL

http://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=19

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

B-type

wi +1

wi+2

O O

EOS

<intent

w_i

B-type

wi +1

w_i+2

O O

EOS

<intent

> Semantic Frame request_movie genre=action, date=this weekend System Action /

Policy

request_location User Dialogue Action

Inform(location=San Francisco)

Time t-1

w_i

<slot>

w_i

wi+2

O O

EOS

Language Understanding (LU)

Time t-2 Time t

Dialogue Management

(DM)

w₀ w1 w2

Natural Language Generation (NLG)

User EOS

Goal

User Agenda Modeling User Simulator

End-to-End Neural Dialogue System Text Input

Are there any action movies to see this weekend?

Idea: supervised learning for each component and reinforcement learning for end-to-end training the neural dialogue system

62 https://arxiv.org/abs/1703.01008

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

 User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.

RULE BASED SYSTEM

User: Can I get 2 tickets for the witch?

Agent: Which theater do you want?

User: Regal meridian 16

Agent: What time would you like to see it?

User: 9:30 pm

Agent: What date would you like to see it?

User: Tomorrow!

Agent: How many people are you?

User: Just two adults.

Agent: What is your location?

User: Seattle!

Agent: What kind of video format do you like?

User: Oh, any kind is okay.

Agent: Do you have any price constraint?

User: No, I don’t care.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thank you.

Agent: Thank you! 63

The system can learn how to efficiently interact with users for task completion

REINFORCEMENT LEARNING SYSTEM User: Can I get 2 tickets for the witch?

Agent: What date would you like to see it?

User: Tomorrow

Agent: Which theater do you prefer?

User: Regal meridian 16 Agent: What is your location?

User: Seattle

Agent: What time would you like to see it?

User: 9:30 pm.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thanks.

Agent: Thanks!

https://arxiv.org/abs/1703.01008

Estimate How Good An Agent’s Behavior is

Policy-Based Deep RL

Deep Policy Networks

 Represent policy by deep network with weights

 Objective is to maximize total discounted reward by SGD

stochastic policy deterministic policy

Policy Gradient

 The gradient of a stochastic policy is given by

 The gradient of a deterministic policy is given by

How to deal with continuous actions

Actor-Critic (Value-Based + Policy-Based)

 Estimate value function

 Update policy parameters by SGD

 Stochastic policy

 Deterministic policy

Q-networks tell whether a policy is good or not Policy networks optimize the policy accordingly

Deterministic Deep Policy Gradient

 Goal: end-to-end learning of control policy from pixels

 Input: state is stack of raw pixels from last 4 frames

 Output: two separate CNNs for Q and 𝜋

Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015. 68

E2E RL-Based Info-Bot

(Dhingra et al., 2016)

Movie=?; Actor=Bill Murray; Release Year=1993

Find me the Bill Murray’s movie.

I think it came out in 1993.

When was it released?

Groundhog Day is a Bill Murray movie which came out in 1993.

KB-InfoBot User

(Groundhog Day, actor, Bill Murray) (Groundhog Day, release year, 1993) (Australia, actor, Nicole Kidman)

(Mad Max: Fury Road, release year, 2015) Knowledge Base (head, relation, tail)

Idea: differentiable database for propagating the gradients 69

https://arxiv.org/abs/1609.00777

Dialogue Management Evaluation



Metrics



Turn-level evaluation: system action accuracy



Dialogue-level evaluation: task success rate, reward

在文檔中 Slides credit from Gašić (頁 50-71)