Slides credit from Gašić

(1)

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

• Domain Identification

• User Intent Detection

• Slot Filling

Natural Language Generation (NLG)

Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Speech Signal

Backend Action / Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

• Dialogue State Tracking (DST)

• Dialogue Policy

(5)

Dialogue Management

5

(6)

6

Example Dialogue

6

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(7)

7

Example Dialogue

7

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(8)

8

Elements of Dialogue Management

(Figure from Gašić)8

(9)

9

Rule-Based Management

9

(10)

10

Elements of Dialogue Management

(Figure from Gašić)10

Dialogue policy optimization

(11)

Dialogue Policy Optimization

11

Reinforcement Learning

(12)

12

Reinforcement Learning

 RL is a general purpose framework for decision making

 RL is for an agentwith the capacity to act

 Each actioninfluences the agent’s future state

 Success is measured by a scalar rewardsignal

 Goal: select actions to maximize future reward Big three: action, state, reward

(13)

Scenario of Reinforcement Learning

Agent learns to take actions to maximize expected reward.

13

Environment

Observation o_t Action a_t

Reward r_t If win, reward = 1 If loss, reward = -1 Otherwise, reward = 0

Next Move

(14)

14

Supervised v.s. Reinforcement



Supervised



Reinforcement

14

Hello 

Agent

……

Agent

……. …….

……

Bad

“Hello” Say “Hi”

“Bye bye” Say “Good bye”

Learning from teacher

Learning from critics

(15)

15

Dialogue as Reinforcement Learning



Problems in solving dialogue as an RL task

1)

Optimization problem size

Belief dialogue state space is large and continuous

System action space is large

2)

Knowledge environment (user)

Transition probability is unknown (user status)

How to get rewards

3)

RL takes long time to converge

15

Solution: learn in reduced summary space

Solution: learn in interaction with a simulated user

(16)

16

Large Belief Space and Action Space



Solution: perform optimization in a reduced

summary space built according to the heuristics

16

Belief Dialogue

State System Action

Summary

Dialogue State Summary Policy Summary Action

Summary Function

Master Function

(17)

17

Transition Probability and Rewards



Solution: learn from a simulated user

17

Error Model

• Recognition error

• LU error

Dialogue State Tracking (DST)

System dialogue acts

Reward Backend Action /

Knowledge Providers Dialogue Policy

Optimization

User Model Reward Model

User Simulation Distribution over user dialogue acts (semantic frames)

(18)

18

Agent and Environment

 At time step t

 The agent

 Executes action a_t

 Receives observation o_t

 Receives scalar reward r_t

 The environment

 Receives action a_t

 Emits observation o_t+1

 Emits scalar reward r_t+1

 t increments at env. step

observation o_t

action a_t

reward r_t

(19)

19

State

 Experience is the sequence of observations, actions, rewards

 State is the information used to determine what happens next

 what happens depends on the history experience

• The agent selects actions

• The environment selects observations/rewards

 The state is the function of the history experience

(20)

20

POMDP Policy Optimization



Finding value function associated with optimal policy, i.e. the one that generates maximal return



Problem: tractable only for very simple cases

(Kaelbling et al., 1998)



Alternative solution: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states

20

(21)

21

Markov Decision Process (MDP)



Belief state from tracking: b

_t

= s

_t



System actions: a

_t



Rewards: r

_t



Transition probability: p(b

_t+1

|b

_t

, a

_t

)

b_t+1 a_t

b_t

r_t

(22)

22

DM as Markov Decision Process (MDP)

22

Data

Model

Prediction

• Belief dialogue states (continuous)

• Reward – a measure of dialogue quality

• System actions –

Dialogue Policy Optimization

• Markov decision process (MDP) &

reinforcement learning

(23)

23

Dialogue Policy Optimization



Dialogue management in a RL framework

23

U s e r

Reward R Observation O Action A

Environment

Agent

Natural Language Generation Language Understanding

Dialogue Manager

The optimized dialogue policy selects the best action that maximizes the future reward.

Correct rewards are a crucial factor in dialogue policy training

(24)

24

Reward

 Reinforcement learning is based on reward hypothesis

 A reward r_t is a scalar feedback signal

 Indicates how well agent is doing at step t

Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward

(25)

25

Reward for RL ≅ Evaluation for System



Dialogue is a special RL task

 Human involves in interaction and rating (evaluation) of a dialogue

 Fully human-in-the-loop framework



Rating: correctness, appropriateness, and adequacy

- Expert rating high quality, high cost

- User rating unreliable quality, medium cost - Objective rating Check desired aspects, low cost

25

(26)

26

Reinforcement Learning for Dialogue Policy Optimization

26

Language understanding

Language (response) generation

Dialogue Policy 𝑎 = 𝜋(𝑠)

Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)

Optimize 𝑄(𝑠, 𝑎) User input (o)

Response

𝑠

𝑎

Type of Bots State Action Reward

Social ChatBots Chat history System Response # of turns maximized;

Intrinsically motivated reward

InfoBots (interactive Q/A) User current question + Context

Answers to current question

Relevance of answer;

# of turns minimized

Task-Completion Bots User current input + Context

System dialogue act w/

slot value (or API calls)

Task success rate;

# of turns minimized

Goal: develop a generic deep RL algorithm to learn dialogue policy for all bot categories

(27)

27

Dialogue Reinforcement Learning Signal

Typical reward function

 -1 for per turn penalty

 Large reward at completion if successful

Typically requires domain knowledge

✔ Simulated user

✔ Paid users (Amazon Mechanical Turk)

✖ Real users

|||

…

﹅

27

The user simulator is usually required for dialogue system training before deployment

(28)

28

Sequential Decision Making

 Goal: select actions to maximize total future reward

 Actions may have long-term consequences

 Reward may be delayed

 It may be better to sacrifice immediate reward to gain more long-term reward

28

(29)

29

Deep Reinforcement Learning

Environment 29

Observation Action

Reward Function

Input

Function Output

Used to pick the best function

… …

…

DNN

(30)

Value-Based Policy-Based Model-Based

Reinforcement Learning Approach

30

(31)

31

Major Components in an RL Agent



An RL agent may include one or more of these components

 Policy: agent’s behavior function

 Value function: how good is each state and/or action

 Model: agent’s representation of the environment

31

(32)

32

Policy

 A policy is the agent’s behavior

 A policy maps from state to action

 Deterministic policy:

 Stochastic policy:

32

(33)

33

Value Function

 A value function is a prediction of future reward (with action a in state s)

 Q-value function gives expected total reward

 from state and action

 under policy

 with discount factor

 Value functions decompose into a Bellman equation

33

(34)

34

Optimal Value Function

 An optimal value function

 Policy-based RL

 Search directly for optimal policy

 Value-based RL

 Estimate the optimal value function

 Model-based RL

 Build a model of the environment

 Plan (e.g. by lookahead) using model

35

is the policy achieving maximum future reward

is maximum value achievable under any policy

(36)

36

Maze Example

 Rewards: -1 per time- step

 Actions: N, E, S, W

 States: agent’s location

36

(37)

37

Maze Example: Policy

37

Arrows represent policy π(s) for each state s

(38)

38

Maze Example: Value Function

38

Numbers represent value Q_π(s) of each state s

(39)

Categorizing RL Agents

 Value-Based

 No Policy (implicit)

 Value Function

 Policy-Based

 Policy

 No Value Function

 Actor-Critic

 Policy

 Value Function

 Model-Free

 Policy and/or Value Function

 No Model

 Model-Based

 Policy and/or Value Function

 Model

39

(40)

40

RL Agent Taxonomy

40

(41)

Dynamic Programming Monte-Carlo

Temporal-Difference Q-Learning

Value-Based Deep RL

41

(42)

42

Dynamic Programming



Model-based



Evaluate policy



Update policy

42

(43)

43

Dynamic Programming



GridWorld example

43

(44)

47

(48)

48

Q-Learning – Value Function Approximation

 Value functions are represented by a lookup table

 too many states and/or actions to store

 too slow to learn the value of each entry individually

 Values can be estimated with function approximation

48

(49)

49

Q-Networks

 Q-networks represent value functions with weights

 generalize from seen states to unseen states

 update parameter for function approximation

49

(50)

50

Q-Learning

 Goal: estimate optimal Q-values

 Optimal Q-values obey a Bellman equation

 Value iterationalgorithms solve the Bellman equation

50

learning target

(51)

51

Deep Q-Networks (DQN)

 Represent value function by deep Q-network with weights

 Objective is to minimize mean square error (MSE) loss by SGD

 Leading to the following Q-learning gradient

51

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets learning target

(52)

52

Stability Issues with Deep RL

 Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

 Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

 Policy may oscillate

 Distribution of data can swing from one extreme to another

3. Scale of rewards and Q-values is unknown

 Naive Q-learning gradients can be unstable when backpropagated

52

(53)

53

Stable Solutions for DQN

 DQN provides a stable solutions to deep value-based RL

1. Use experience replay

 Break correlations in data, bring us back to iid setting

 Learn from all past policies

2. Freeze target Q-network

 Avoid oscillation

 Break correlations between Q-network and target

3. Clip rewards or normalizenetwork adaptively to sensible range

 Robust gradients

53

(54)

54

Stable Solution 1: Experience Replay

 To remove correlations, build a dataset from agent’s experience

 Take action at according to 𝜖-greedy policy

 Store transition in replay memory D

 Sample random mini-batch of transitions from D

 Optimize MSE between Q-network and Q-learning targets

54

small prob for exploration

(55)

55

Stable Solution 2: Fixed Target Q-Network

 To avoid oscillations, fix parameters used in Q-learning target

 Compute Q-learning targets w.r.t. old, fixed parameters

 Optimize MSE between Q-network and Q-learning targets

 Periodically update fixed parameters

55

(56)

56

Stable Solution 3: Reward / Value Range

 To avoid oscillations, control the reward / value range

 DQN clips the rewards to [−1, +1]

 Prevents too large Q-values

 Ensures gradients are well-conditioned

56

(57)

57

Other Improvements: Double DQN

 Nature DQN

 Double DQN: remove upward bias caused by

 Current Q-network is used to select actions

 Older Q-network is used to evaluateactions

57

(58)

58

Other Improvements: Prioritized Replay

 Prioritized Replay: weight experience based on surprise

 Store experience in priority queue according to DQN error

58

(59)

59

Other Improvements: Dueling Network

 Dueling Network: split Q-network into two channels

 Action-independent value function

 Value function estimates how good the state is

 Action-dependent advantage function

 Advantage function estimates the additional benefit

59

(60)

60

DQN for Dialogue Management

(Li et al., 2017)

 Goal: end-to-end learning of values Q(s, a) from interactions

 Input: state is the combination of user history observation, previous system action, database returned results

 Output: Q(s, a) for all available system action a

 Reward: -1 per turn; large reward for successful task

60

Semantic Frame request_movie genre=action, date=this weekend

System Action/Policy request_location

DQN- Based DM

Simulated User Backend DB

(61)

61

E2E RL-Based System

(Zhao and Eskenazi, 2016)

61



Joint learning



NLU, DST, Dialogue Policy



Deep RL for training



Deep Q-network



Deep recurrent network

Baseline RL

Hybrid-RL

http://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=19

(62)

62

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

wi

B- type

wi +1

wi+2

O O

EOS

<intent

>

w_i

B- type

wi +1

w_i+2

O O

EOS

<intent

> Semantic Frame request_movie genre=action, date=this weekend System Action /

Policy

request_location User Dialogue Action

Inform(location=San Francisco)

Time t-1

w_i

<slot>

w_i

+1

wi+2

O O

EOS

Time t-2 Time t

Dialogue Management

(DM)

w₀ w1 w2

Natural Language Generation (NLG)

User EOS

Goal

User Agenda Modeling User Simulator

End-to-End Neural Dialogue System Text Input

Idea: supervised learning for each component and reinforcement learning for end-to-end training the neural dialogue system

62 https://arxiv.org/abs/1703.01008

(63)

63

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

 User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.

RULE BASED SYSTEM

User: Can I get 2 tickets for the witch?

Agent: Which theater do you want?

User: Regal meridian 16

Agent: What time would you like to see it?

User: 9:30 pm

Agent: What date would you like to see it?

User: Tomorrow!

Agent: How many people are you?

User: Just two adults.

Agent: What is your location?

User: Seattle!

Agent: What kind of video format do you like?

User: Oh, any kind is okay.

Agent: Do you have any price constraint?

User: No, I don’t care.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thank you.

Agent: Thank you! 63

The system can learn how to efficiently interact with users for task completion

REINFORCEMENT LEARNING SYSTEM User: Can I get 2 tickets for the witch?

Agent: What date would you like to see it?

User: Tomorrow

Agent: Which theater do you prefer?

User: Regal meridian 16 Agent: What is your location?

User: Seattle

Agent: What time would you like to see it?

User: 9:30 pm.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thanks.

Agent: Thanks!

https://arxiv.org/abs/1703.01008

(64)

Estimate How Good An Agent’s Behavior is

Policy-Based Deep RL

64

(65)

65

Deep Policy Networks

 Represent policy by deep network with weights

 Objective is to maximize total discounted reward by SGD

65

stochastic policy deterministic policy

(66)

66

Policy Gradient

 The gradient of a stochastic policy is given by

 The gradient of a deterministic policy is given by

66

How to deal with continuous actions

(67)

67

Actor-Critic (Value-Based + Policy-Based)

 Estimate value function

 Update policy parameters by SGD

 Stochastic policy

 Deterministic policy

67

Q-networks tell whether a policy is good or not Policy networks optimize the policy accordingly

(68)

68

Deterministic Deep Policy Gradient

 Goal: end-to-end learning of control policy from pixels

 Input: state is stack of raw pixels from last 4 frames

 Output: two separate CNNs for Q and 𝜋

Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015. 68

(69)

69

E2E RL-Based Info-Bot

(Dhingra et al., 2016)

Movie=?; Actor=Bill Murray; Release Year=1993

Find me the Bill Murray’s movie.

I think it came out in 1993.

When was it released?

Groundhog Day is a Bill Murray movie which came out in 1993.

KB-InfoBot User

(Groundhog Day, actor, Bill Murray) (Groundhog Day, release year, 1993) (Australia, actor, Nicole Kidman)

(Mad Max: Fury Road, release year, 2015) Knowledge Base (head, relation, tail)

Idea: differentiable database for propagating the gradients 69

https://arxiv.org/abs/1609.00777

(70)

70

Dialogue Management Evaluation



Metrics



Turn-level evaluation: system action accuracy



Dialogue-level evaluation: task success rate, reward

70

(71)

71

Concluding Remarks

 Dialogue policy optimization of DM solves MDP via RL

 Value-based

 Dynamic programming

 Monte-Carlo

 Temporal-difference

 Q-learning  DQN

 Policy-based

 Deep policy gradient

 Actor-critic

71