Slides credit from Gašić
Review
2
3
Task-Oriented Dialogue System
(Young, 2000)3
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
4
Task-Oriented Dialogue System
(Young, 2000)4
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Natural Language Generation (NLG)
Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Action / Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy
Dialogue Management
5
6
Example Dialogue
6
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
7
Elements of Dialogue Management
(Figure from Gašić)7
8
Dialogue State Tracking (DST)
Maintain a probabilistic distribution instead of a 1-best prediction for better robustness to recognition errors
8
Incorrect for both!
9
Dialogue State Tracking (DST)
Maintain a probabilistic distribution instead of a 1-best prediction for better robustness to SLU errors or
ambiguous input
9
How can I help you?
Book a table at Sumiko for 5 How many people?
3
Slot Value
# people 5 (0.5)
time 5 (0.5)
Slot Value
# people 3 (0.8)
time 5 (0.8)
10
1-Best Input w/o State Tracking
10
11
N-Best Inputs w/o State Tracking
11
12
N-Best Inputs w/ State Tracking
12
13
Dialogue State Tracking (DST)
Definition
Representation of the system's belief of the user's goal(s) at any time during the dialogue
Challenge
How to define the state space?
How to tractably maintain the dialogue state?
Which actions to take for each state?
13
Define dialogue as a control problem where the behavior can be automatically learned
Introduction to RL
14
Reinforcement Learning
15
Reinforcement Learning
RL is a general purpose framework for decision making
RL is for an agentwith the capacity to act
Each actioninfluences the agent’s future state
Success is measured by a scalar rewardsignal
Goal: select actions to maximize future reward Big three: action, state, reward
16
Reinforcement Learning
16
Agent
Environment
Observation Action
Reward Don’t do
that
17
Reinforcement Learning
17
Agent
Environment
Observation Action
Reward Thank you.
Agent learns to take actions to maximize expected reward.
18
Supervised v.s. Reinforcement
Supervised
Reinforcement
18
Hello
Agent
……
Agent
……. …….
……
Bad
“Hello” Say “Hi”
“Bye bye” Say “Good bye”
Learning from teacher
Learning from critics
Scenario of Reinforcement Learning
Environment
Observation Action
Reward If win, reward = 1 If loss, reward = -1
Agent learns to take actions to maximize expected reward.
19
Otherwise, reward = 0
Next Move
20
RL Based AI Examples
Play games: Atari, poker, Go, …
Explore worlds: 3D worlds, …
Control physical systems: manipulate, …
Interact with users: recommend, optimize, personalize, …
21
Agent and Environment
→←
MoveRight MoveLeft
observation ot action at
reward rt Agent
Environment
22
Agent and Environment
At time step t
The agent
Executes action at
Receives observation ot
Receives scalar reward rt
The environment
Receives action at
Emits observation ot+1
Emits scalar reward rt+1
t increments at env. step
observation ot
action at
reward rt
23
State
Experience is the sequence of observations, actions, rewards
State is the information used to determine what happens next
what happens depends on the history experience
• The agent selects actions
• The environment selects observations/rewards
The state is the function of the history experience
24
observation ot
action at
reward rt
Environment State
The environment state 𝑠𝑡𝑒 is the environment’s private
representation
whether data the environment uses to pick the next
observation/reward
may not be visible to the agent
may contain irrelevant information
25
observation ot
action at
reward rt
Agent State
The agent state 𝑠𝑡𝑎 is the agent’s internal representation
whether data the agent uses to pick the next action
information used by RL algorithms
can be any function of experience
26
Information State
An information state (a.k.a. Markov state) contains all useful information from history
The future is independent of the past given the present
Once the state is known, the history may be thrown away
The state is a sufficient statistics of the future A state is Markov iff
27
Fully Observable Environment
Full observability: agent directly observes environment state
information state = agent state = environment state
This is a Markov decision process (MDP)
28
Partially Observable Environment
Partial observability: agent indirectly observes environment
agent state ≠ environment state
Agent must construct its own state representation 𝑠𝑡𝑎
Complete history:
Beliefs of environment state:
Hidden state (from RNN):
This is partially observable Markov decision process (POMDP)
29
Reward
Reinforcement learning is based on reward hypothesis
A reward rt is a scalar feedback signal
Indicates how well agent is doing at step t
Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward
30
Sequential Decision Making
Goal: select actions to maximize total future reward
Actions may have long-term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward
30
31
Elements of Dialogue Management
(Figure from Gašić)31
Dialogue state tracking
32
Generative v.s. Discriminative
Generative
The state generates the observation
Discriminative
The state depends on the observation
32
Generative Approach
33
Dialogue State Tracking
34
Markov Process
Markov process is a memoryless random process
a sequence of random states S1, S2, ... with the Markov property
34
Student Markov chain
Sample episodes from S1=C1
• C1 C2 C3 Pass Sleep
• C1 FB FB C1 C2 Sleep
• C1 C2 C3 Pub C2 C3 Pass Sleep
• C1 FB FB C1 C2 C3 Pub
• C1 FB FB FB C1 C2 C3 Pub C2 Sleep
35
Student MRP
Markov Reward Process (MRP)
Markov reward process is a Markov chain with values
The return Gt is the total discounted reward from time-step t
35
36
Markov Decision Process (MDP)
Markov decision process is a MRP with decisions
It is an environment in which all states are Markov
36
Student MDP
37
Markov Decision Process (MDP)
S: finite set of states/observations
A: finite set of actions
P : transition probability
R : immediate reward
γ : discount factor
Goal is to choose policy π at time t that maximizes expected overall return:
37
38
DM as Markov Decision Process (MDP)
38
Data
Model
Prediction
• Dialogue states
• Reward – a measure of dialogue quality
• System actions
• Markov decision process (MDP)
39
DM as Partially Observable Markov Decision Process (POMDP)
39
Data
Model
Prediction
• Noisy observation of dialogue states
• Reward – a measure of dialogue quality
• Distribution over dialogue states – Dialogue State Tracking
• Optimal system actions
• Partially observable Markov decision process (POMDP)
40
Markov Decision Process (MDP)
States can be fully observed
State depends on the
previous state and the action
st+1 at
st
rt transition probability
41
Partially Observable Markov Decision Process (POMDP)
State generates a noisy observation
st+1 at
st
rt
ot+1 ot
observation probability
transition probability
State is unobservable and depends on the previous state and the action
summation over all possible states at every dialogue turn – intractable!
42
Dialogue State Tracking (DST)
Requirement
Dialogue history
Keep tracking of what happened so far in the dialogue
Normally done via Markov property
Task-oriented dialogue
Need to know what the user wants
Modeled via the user goal
Robustness to errors
Need to know what the user says
Modeled via the user action
42
43
Decompose dialogue state into
conditionally independent elements
User goal gt
User action ut
Dialogue history dt
Dialogue State Factorization
at
rt
ot+1 ot
summation over all possible goals – intractable!
summation over all possible histories and user actions – intractable!
ut
dt gt
ut+1
dt+1 gt+1
44
Generative DST
POMDPs are normally intractable for everything
Two approximations enable POMDP for dialogues
I. Hidden Information State (HIS) system (Young et al., 2010)
II. Bayesian Update of Dialogue State (BUDS) system
(Thomson and Young, 2010)
44
45
Hidden Information State (HIS)
Dialogue state: distribution over most likely hypotheses 45
46
HIS Partitions
46
=
47
Pruning
47
=
48
Pruning
48
=
49
Bayesian Update of Dialogue State (BUDS)
Idea
Further decomposes the dialogue state
Produce tractable state update
Transition and observation probability distributions can be parameterized
49
50
BUDS Belief Tracking
Expectation propagation
Allow parameters tying
Handle factorized hidden variables
Handle large sate spaces
Example
50
Discriminative Approach
51
Dialogue State Tracking
52
Generative v.s. Discriminative
Generative
The state generates the observation
Discriminative
The state depends on the observation
52
Directly model dialogue states given arbitrary input features Assumption: observations at each turn are independent
53
DST Problem Formulation
The DST dataset consists of
Goal: for each informable slot
e.g. price=cheap
Requested: slots by the user
e.g. moviename
Method: search method for entities
e.g. by constraints, by name
The dialogue state is
the distribution over possible slot-value pairs for goals
the distribution over possible requested slots
the distribution over possible methods
53
54
Class-Based DST
54
Data
Model
Prediction
• Observations labeled w/ dialogue state
• Distribution over dialogue states – Dialogue State Tracking
• Neural networks
• Ranking models
55
DNN for DST
55
feature
extraction DNN
A slot value distribution for each slot
multi-turn conversation
state of this turn
56
Sequence-Based DST
56
Data
Model
Prediction
• Sequence of observations labeled w/
dialogue state
• Distribution over dialogue states – Dialogue State Tracking
• Recurrent neural networks (RNN)
57
Recurrent Neural Network (RNN)
Elman-type
Jordan-type
57
58
RNN DST
Idea: internal memory for representing dialogue context
Input
most recent dialogue turn
last machine dialogue act
dialogue state
memory layer
Output
update its internal memory
distribution over slot values
58
59
RNN-CNN DST
(Figure from Wen et al, 2016)59 http://www.anthology.aclweb.org/W/W13/W13-4073.pdf; https://arxiv.org/abs/1506.07190
60
Multichannel Tracker
(Shi et al., 2016)60
Training a multichannel CNN for each slot
Chinese character CNN
Chinese word CNN
English word CNN
https://arxiv.org/abs/1701.06247
61
DST Evaluation
Metric
Tracked state accuracy with respect to user goal
L2-norm of the hypothesized dist. and the true label
Recall/Precision/F-measure individual slots
61
62
Dialog State Tracking Challenge (DSTC)
(Williams et al. 2013, Henderson et al. 2014, Henderson et al. 2014, Kim et al. 2016, Kim et al. 2016)
Challenge Type Domain Data Provider Main Theme DSTC1 Human-
Machine Bus Route CMU Evaluation Metrics
DSTC2 Human-
Machine Restaurant U. Cambridge User Goal Changes DSTC3 Human-
Machine Tourist Information U. Cambridge Domain Adaptation DSTC4 Human-
Human Tourist Information I2R Human Conversation DSTC5 Human-
Human Tourist Information I2R Language Adaptation
63
DSTC1
Type: Human-Machine
Domain: Bus Route
63
64
DSTC4-5
Type: Human-Human
Domain: Tourist Information
64
Tourist: Can you give me some uh- tell me some cheap rate hotels, because I'm planning just to leave my bags there and go somewhere take some pictures.
Guide: Okay. I'm going to recommend firstly you want to have a backpack type of hotel, right?
Tourist: Yes. I'm just gonna bring my backpack and my buddy with me. So I'm kinda looking for a hotel that is not that expensive. Just gonna leave our things there and, you know, stay out the whole day.
Guide: Okay. Let me get you hm hm. So you don't mind if it's a bit uh not so roomy like hotel because you just back to sleep.
Tourist: Yes. Yes. As we just gonna put our things there and then go out to take some pictures.
Guide: Okay, um- Tourist: Hm.
Guide: Let's try this one, okay?
Tourist: Okay.
Guide: It's InnCrowd Backpackers Hostel in Singapore. If you take a dorm bed per person only twenty dollars. If you take a room, it's two single beds at fifty nine dollars.
Tourist: Um. Wow, that's good.
Guide: Yah, the prices are based on per person per bed or dorm. But this one is room. So it should be fifty nine for the two room. So you're actually paying about ten dollars more per person only.
Tourist: Oh okay. That's- the price is reasonable actually. It's good.
{Topic: Accommodation; Type: Hostel; Pricerange:
Cheap; GuideAct: ACK; TouristAct: REQ}
{Topic: Accommodation; NAME: InnCrowd Backpackers Hostel; GuideAct: REC; TouristAct: ACK}
65
Concluding Remarks
Dialogue state tracking (DST) of DM has Markov assumption to model the user goal and be robust to errors
Generative models for DST are based on POMDP
Hidden Information State (HIS)
state user goal, user action dialogue history
transitions are hand-crafted and the goals are grouped together to allow tractable belief tracking
65
Bayesian Update of Dialogue State (BUDS)
further factorizes the state
allows tractable belief tracking and learning of the shapers of distributions via expectation propagation
Discriminative models directly estimate dialogue states given arbitrary input features