Slides credit from Gašić
Review
2
3
Task-Oriented Dialogue System
(Young, 2000)3
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
4
Task-Oriented Dialogue System
(Young, 2000)4
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Natural Language Generation (NLG)
Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Action / Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy
Dialogue Management
5
6
Example Dialogue
6
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
greeting ()
request (area)
inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)
inform (address=24 Green street)
7
Example Dialogue
7
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
greeting ()
request (area)
inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)
inform (address=24 Green street)
8
Elements of Dialogue Management
(Figure from Gašić)8
9
Reinforcement Learning
RL is a general purpose framework for decision making
RL is for an agentwith the capacity to act
Each actioninfluences the agent’s future state
Success is measured by a scalar rewardsignal
Goal: select actions to maximize future reward Big three: action, state, reward
10
Reinforcement Learning
10
Agent
Environment
Observation Action
Reward Don’t do
that
11
Reinforcement Learning
11
Agent
Environment
Observation Action
Reward Thank you.
Agent learns to take actions to maximize expected reward.
Scenario of Reinforcement Learning
Environment
Observation Action
Reward
If win, reward = 1 If loss, reward = -1
Agent learns to take actions to maximize expected reward.
12
Otherwise, reward = 0
Next Move
13
Agent and Environment
At time step t
The agent
Executes action at
Receives observation ot
Receives scalar reward rt
The environment
Receives action at
Emits observation ot+1
Emits scalar reward rt+1
t increments at env. step
observation ot
action at
reward rt
14
State
Experience is the sequence of observations, actions, rewards
State is the information used to determine what happens next
what happens depends on the history experience
• The agent selects actions
• The environment selects observations/rewards
The state is the function of the history experience
Dialogue Policy Optimization
15
Decision Making
16
Elements of Dialogue Management
(Figure from Gašić)16
Dialogue policy optimization
17
Partially Observable Markov Decision Process (POMDP)
Dialogue states: st
Noisy observation: ot
System actions: at
Rewards: rt
Transition probability: p(st+1|st, at)
Observation probability: p(ot+1|st+1)
Distribution over states: b(st)
st+1 at
st
rt
ot+1 ot
18
DM as Partially Observable Markov Decision Process (POMDP)
18
Data
Model
Prediction
• Noisy observation of dialogue states
• Reward – a measure of dialogue quality
• Distribution over dialogue states
• Optimal system actions – Dialogue Policy Optimization
• Partially observable Markov decision process (POMDP)
19
Decision Making in POMDP
Policy:
Return:
Value function
How good the system is in a particular belief state
belief estimation mappingaccumulated reward
st+1 at
st
rt
ot+1 ot
20
POMDP Policy Optimization
Finding value function associated with optimal policy, i.e. the one that generates maximal return
Problem: tractable only for very simple cases
(Kaelbling et al., 1998)
Alternative solution: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states
20
21
Markov Decision Process (MDP)
Belief state from tracking: b
t= s
t
System actions: a
t
Rewards: r
t
Transition probability: p(b
t+1|b
t, a
t)
bt+1 at
bt
rt
22
DM as Markov Decision Process (MDP)
22
Data
Model
Prediction
• Belief dialogue states (continuous)
• Reward – a measure of dialogue quality
• System actions –
Dialogue Policy Optimization
• Markov decision process (MDP) &
reinforcement learning
23
Policy Optimization Issue
Optimization problem size
Belief dialogue state space is large and continuous
System action space is large
Knowledge environment (user)
Transition probability is unknown (user status)
How to get rewards
23
24
Large Belief Space and Action Space
Solution: perform optimization in a reduced
summary space built according to the heuristics
24
Belief Dialogue
State System Action
Summary
Dialogue State Summary Policy Summary Action
Summary Function
Master Function
25
Transition Probability and Rewards
Solution: learn from real users
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG)
Backend Database/
Knowledge Providers
26
Transition Probability and Rewards
Solution: learn from a simulated user
26
Error Model
• Recognition error
• LU error
Dialogue State Tracking (DST)
System dialogue acts
Reward Backend Action /
Knowledge Providers Dialogue Policy
Optimization
Dialogue Management (DM)
User Model Reward Model
User Simulation Distribution over user dialogue acts (semantic frames)
27
Concluding Remarks
Dialogue policy optimization can be viewed as an RL task
POMDP can be viewed as a continuous space MDP
Belief dialogue state space can be summarized to reduce computational complexity
Transition probability and reward come from
Real user
Simulated user