• 沒有找到結果。

Learning

在文檔中 Slides credit from Gašić (頁 50-71)

… DNN

Q- Learning

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iterationalgorithms solve the Bellman equation

50

learning target

51

Deep Q-Networks (DQN)

Represent value function by deep Q-network with weights

Objective is to minimize mean square error (MSE) loss by SGD

Leading to the following Q-learning gradient

51

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets learning target

52

Stability Issues with Deep RL

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

Policy may oscillate

Distribution of data can swing from one extreme to another

3. Scale of rewards and Q-values is unknown

Naive Q-learning gradients can be unstable when backpropagated

52

53

Stable Solutions for DQN

DQN provides a stable solutions to deep value-based RL

1. Use experience replay

Break correlations in data, bring us back to iid setting

Learn from all past policies

2. Freeze target Q-network

Avoid oscillation

Break correlations between Q-network and target

3. Clip rewards or normalizenetwork adaptively to sensible range

Robust gradients

53

54

Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

Take action at according to 𝜖-greedy policy

Store transition in replay memory D

Sample random mini-batch of transitions from D

Optimize MSE between Q-network and Q-learning targets

54

small prob for exploration

55

Stable Solution 2: Fixed Target Q-Network

To avoid oscillations, fix parameters used in Q-learning target

Compute Q-learning targets w.r.t. old, fixed parameters

Optimize MSE between Q-network and Q-learning targets

Periodically update fixed parameters

55

56

Stable Solution 3: Reward / Value Range

To avoid oscillations, control the reward / value range

DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned

56

57

Other Improvements: Double DQN

Nature DQN

Double DQN: remove upward bias caused by

Current Q-network is used to select actions

Older Q-network is used to evaluateactions

57

58

Other Improvements: Prioritized Replay

Prioritized Replay: weight experience based on surprise

Store experience in priority queue according to DQN error

58

59

Other Improvements: Dueling Network

Dueling Network: split Q-network into two channels

Action-independent value function

Value function estimates how good the state is

Action-dependent advantage function

Advantage function estimates the additional benefit

59

60

DQN for Dialogue Management

(Li et al., 2017)

Goal: end-to-end learning of values Q(s, a) from interactions

Input: state is the combination of user history observation, previous system action, database returned results

Output: Q(s, a) for all available system action a

Reward: -1 per turn; large reward for successful task

60

Semantic Frame request_movie genre=action, date=this weekend

System Action/Policy request_location

DQN-Based DM

Simulated User Backend DB

61

E2E RL-Based System

(Zhao and Eskenazi, 2016)

61

Joint learning

NLU, DST, Dialogue Policy

Deep RL for training

Deep Q-network

Deep recurrent network

Baseline RL

Hybrid-RL

http://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=19

62

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

wi

B-type

wi +1

wi+2

O O

EOS

<intent

>

wi

B-type

wi +1

wi+2

O O

EOS

<intent

> Semantic Frame request_movie genre=action, date=this weekend System Action /

Policy

request_location User Dialogue Action

Inform(location=San Francisco)

Time t-1

wi

<slot>

wi

+1

wi+2

O O

EOS

<intent>

Language Understanding (LU)

Time t-2 Time t

Dialogue Management

(DM)

w0 w1 w2

Natural Language Generation (NLG)

User EOS

Goal

User Agenda Modeling User Simulator

End-to-End Neural Dialogue System Text Input

Are there any action movies to see this weekend?

Idea: supervised learning for each component and reinforcement learning for end-to-end training the neural dialogue system

62 https://arxiv.org/abs/1703.01008

63

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.

RULE BASED SYSTEM

User: Can I get 2 tickets for the witch?

Agent: Which theater do you want?

User: Regal meridian 16

Agent: What time would you like to see it?

User: 9:30 pm

Agent: What date would you like to see it?

User: Tomorrow!

Agent: How many people are you?

User: Just two adults.

Agent: What is your location?

User: Seattle!

Agent: What kind of video format do you like?

User: Oh, any kind is okay.

Agent: Do you have any price constraint?

User: No, I don’t care.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thank you.

Agent: Thank you! 63

The system can learn how to efficiently interact with users for task completion

REINFORCEMENT LEARNING SYSTEM User: Can I get 2 tickets for the witch?

Agent: What date would you like to see it?

User: Tomorrow

Agent: Which theater do you prefer?

User: Regal meridian 16 Agent: What is your location?

User: Seattle

Agent: What time would you like to see it?

User: 9:30 pm.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thanks.

Agent: Thanks!

https://arxiv.org/abs/1703.01008

Estimate How Good An Agent’s Behavior is

Policy-Based Deep RL

64

65

Deep Policy Networks

Represent policy by deep network with weights

Objective is to maximize total discounted reward by SGD

65

stochastic policy deterministic policy

66

Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

66

How to deal with continuous actions

67

Actor-Critic (Value-Based + Policy-Based)

Estimate value function

Update policy parameters by SGD

Stochastic policy

Deterministic policy

67

Q-networks tell whether a policy is good or not Policy networks optimize the policy accordingly

68

Deterministic Deep Policy Gradient

Goal: end-to-end learning of control policy from pixels

Input: state is stack of raw pixels from last 4 frames

Output: two separate CNNs for Q and 𝜋

Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015. 68

69

E2E RL-Based Info-Bot

(Dhingra et al., 2016)

Movie=?; Actor=Bill Murray; Release Year=1993

Find me the Bill Murray’s movie.

I think it came out in 1993.

When was it released?

Groundhog Day is a Bill Murray movie which came out in 1993.

KB-InfoBot User

(Groundhog Day, actor, Bill Murray) (Groundhog Day, release year, 1993) (Australia, actor, Nicole Kidman)

(Mad Max: Fury Road, release year, 2015) Knowledge Base (head, relation, tail)

Idea: differentiable database for propagating the gradients 69

https://arxiv.org/abs/1609.00777

70

Dialogue Management Evaluation

Metrics

Turn-level evaluation: system action accuracy

Dialogue-level evaluation: task success rate, reward

70

在文檔中 Slides credit from Gašić (頁 50-71)

相關文件