Slides credit from Gašić

(1)

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Natural Language Generation (NLG)

Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Action / Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy

(5)

Dialogue Management

5

(6)

6

Example Dialogue

6

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

(7)

7

Elements of Dialogue Management

(Figure from Gašić)7

(8)

8

Dialogue State Tracking (DST)



Maintain a probabilistic distribution instead of a 1-best prediction for better robustness to recognition errors

8

Incorrect for both!

(9)

9

Dialogue State Tracking (DST)



Maintain a probabilistic distribution instead of a 1-best prediction for better robustness to SLU errors or

ambiguous input

9

How can I help you?

Book a table at Sumiko for 5 How many people?

3

Slot Value

# people 5 (0.5)

time 5 (0.5)

Slot Value

# people 3 (0.8)

time 5 (0.8)

(10)

10

1-Best Input w/o State Tracking

10

(11)

11

N-Best Inputs w/o State Tracking

11

(12)

12

N-Best Inputs w/ State Tracking

12

(13)

13

Dialogue State Tracking (DST)



Definition

 Representation of the system's belief of the user's goal(s) at any time during the dialogue



Challenge

 How to define the state space?

 How to tractably maintain the dialogue state?

 Which actions to take for each state?

13

Define dialogue as a control problem where the behavior can be automatically learned

(14)

Introduction to RL

14

Reinforcement Learning

(15)

15

Reinforcement Learning

 RL is a general purpose framework for decision making

 RL is for an agentwith the capacity to act

 Each actioninfluences the agent’s future state

 Success is measured by a scalar rewardsignal

 Goal: select actions to maximize future reward Big three: action, state, reward

(16)

16

Reinforcement Learning

16

Agent

Environment

Observation Action

Reward Don’t do

that

(17)

17

Reinforcement Learning

17

Agent

Environment

Observation Action

Reward Thank you.

Agent learns to take actions to maximize expected reward.

(18)

18

Supervised v.s. Reinforcement



Supervised



Reinforcement

18

Hello 

Agent

……

Agent

……. …….

……

Bad

“Hello” Say “Hi”

“Bye bye” Say “Good bye”

Learning from teacher

Learning from critics

(19)

Scenario of Reinforcement Learning

Environment

Observation Action

Reward If win, reward = 1 If loss, reward = -1

Agent learns to take actions to maximize expected reward.

19

Otherwise, reward = 0

Next Move

(20)

20

RL Based AI Examples

 Play games: Atari, poker, Go, …

 Explore worlds: 3D worlds, …

 Control physical systems: manipulate, …

 Interact with users: recommend, optimize, personalize, …

(21)

21

Agent and Environment

→←

MoveRight MoveLeft

observation o_t action a_t

reward r_t Agent

Environment

(22)

22

Agent and Environment

 At time step t

 The agent

 Executes action a_t

 Receives observation o_t

 Receives scalar reward r_t

 The environment

 Receives action a_t

 Emits observation o_t+1

 Emits scalar reward r_t+1

 t increments at env. step

observation o_t

action a_t

reward r_t

(23)

23

State

 Experience is the sequence of observations, actions, rewards

 State is the information used to determine what happens next

 what happens depends on the history experience

• The agent selects actions

• The environment selects observations/rewards

 The state is the function of the history experience

(24)

24

observation o_t

action a_t

reward r_t

Environment State

 The environment state 𝑠_𝑡^𝑒 is the environment’s private

representation

 whether data the environment uses to pick the next

observation/reward

 may not be visible to the agent

 may contain irrelevant information

(25)

25

observation o_t

action a_t

reward r_t

Agent State

 The agent state 𝑠_𝑡^𝑎 is the agent’s internal representation

 whether data the agent uses to pick the next action 

information used by RL algorithms

 can be any function of experience

(26)

26

Information State

 An information state (a.k.a. Markov state) contains all useful information from history

 The future is independent of the past given the present

 Once the state is known, the history may be thrown away

 The state is a sufficient statistics of the future A state is Markov iff

(27)

27

Fully Observable Environment

 Full observability: agent directly observes environment state

information state = agent state = environment state

This is a Markov decision process (MDP)

(28)

28

Partially Observable Environment

 Partial observability: agent indirectly observes environment

agent state ≠ environment state

 Agent must construct its own state representation 𝑠_𝑡^𝑎

 Complete history:

 Beliefs of environment state:

 Hidden state (from RNN):

This is partially observable Markov decision process (POMDP)

(29)

29

Reward

 Reinforcement learning is based on reward hypothesis

 A reward r_t is a scalar feedback signal

 Indicates how well agent is doing at step t

Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward

(30)

30

Sequential Decision Making

 Goal: select actions to maximize total future reward

 Actions may have long-term consequences

 Reward may be delayed

 It may be better to sacrifice immediate reward to gain more long-term reward

30

(31)

31

Elements of Dialogue Management

(Figure from Gašić)31

Dialogue state tracking

(32)

32

Generative v.s. Discriminative



Generative



The state generates the observation



Discriminative



The state depends on the observation

32

(33)

Generative Approach

33

Dialogue State Tracking

(34)

34

Markov Process

 Markov process is a memoryless random process

 a sequence of random states S₁, S₂, ... with the Markov property

34

Student Markov chain

Sample episodes from S₁=C1

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub

• C1 FB FB FB C1 C2 C3 Pub C2 Sleep

(35)

35

Student MRP

Markov Reward Process (MRP)

 Markov reward process is a Markov chain with values

 The return G_t is the total discounted reward from time-step t

35

(36)

36

Markov Decision Process (MDP)

 Markov decision process is a MRP with decisions

 It is an environment in which all states are Markov

36

Student MDP

(37)

37

Markov Decision Process (MDP)

 S: finite set of states/observations

 A: finite set of actions

 P : transition probability

 R : immediate reward

 γ : discount factor

 Goal is to choose policy π at time t that maximizes expected overall return:

37

(38)

38

DM as Markov Decision Process (MDP)

38

Data

Model

Prediction

• Dialogue states

• Reward – a measure of dialogue quality

• System actions

• Markov decision process (MDP)

(39)

39

DM as Partially Observable Markov Decision Process (POMDP)

39

Data

Model

Prediction

• Noisy observation of dialogue states

• Reward – a measure of dialogue quality

• Distribution over dialogue states – Dialogue State Tracking

• Optimal system actions

• Partially observable Markov decision process (POMDP)

(40)

40

Markov Decision Process (MDP)



States can be fully observed



State depends on the

previous state and the action

s_t+1 a_t

s_t

r_t transition probability

(41)

41

Partially Observable Markov Decision Process (POMDP)



State generates a noisy observation

s_t+1 a_t

s_t

r_t

o_t+1 o_t

observation probability

transition probability



State is unobservable and depends on the previous state and the action

summation over all possible states at every dialogue turn – intractable!

(42)

42

Dialogue State Tracking (DST)



Requirement



Dialogue history

Keep tracking of what happened so far in the dialogue

Normally done via Markov property



Task-oriented dialogue

Need to know what the user wants

Modeled via the user goal



Robustness to errors

Need to know what the user says

Modeled via the user action

42

(43)

43



Decompose dialogue state into

conditionally independent elements

 User goal g_t

 User action u_t

 Dialogue history d_t

Dialogue State Factorization

a_t

r_t

o_t+1 o_t

summation over all possible goals – intractable!

summation over all possible histories and user actions – intractable!

u_t

d_t g_t

u_t+1

d_t+1 g_t+1

(44)

44

Generative DST



POMDPs are normally intractable for everything



Two approximations enable POMDP for dialogues

I. Hidden Information State (HIS) system (Young et al., 2010)

II. Bayesian Update of Dialogue State (BUDS) system

(Thomson and Young, 2010)

44

(45)

45

Hidden Information State (HIS)

Dialogue state: distribution over most likely hypotheses 45

(46)

46

HIS Partitions

46

=

(47)

47

Pruning

47

=

(48)

48

Pruning

48

=

(49)

49

Bayesian Update of Dialogue State (BUDS)



Idea



Further decomposes the dialogue state



Produce tractable state update



Transition and observation probability distributions can be parameterized

49

(50)

50

BUDS Belief Tracking



Expectation propagation

 Allow parameters tying

 Handle factorized hidden variables

 Handle large sate spaces



Example

50

(51)

Discriminative Approach

51

Dialogue State Tracking

(52)

52

Generative v.s. Discriminative



Generative



The state generates the observation



Discriminative



The state depends on the observation

52

Directly model dialogue states given arbitrary input features Assumption: observations at each turn are independent

(53)

53

DST Problem Formulation

 The DST dataset consists of

 Goal: for each informable slot

 e.g. price=cheap

 Requested: slots by the user

 e.g. moviename

 Method: search method for entities

 e.g. by constraints, by name

 The dialogue state is

 the distribution over possible slot-value pairs for goals

 the distribution over possible requested slots

 the distribution over possible methods

53

(54)

54

Class-Based DST

54

Data

Model

Prediction

• Observations labeled w/ dialogue state

• Neural networks

• Ranking models

(55)

55

DNN for DST

55

feature

extraction DNN

A slot value distribution for each slot

multi-turn conversation

state of this turn

(56)

56

Sequence-Based DST

56

Data

Model

Prediction

• Sequence of observations labeled w/

dialogue state

• Recurrent neural networks (RNN)

(57)

57

Recurrent Neural Network (RNN)



Elman-type



Jordan-type

57

(58)

58

RNN DST



Idea: internal memory for representing dialogue context

 Input

 most recent dialogue turn

 last machine dialogue act

 dialogue state

 memory layer

 Output

 update its internal memory

 distribution over slot values

58

(59)

59

RNN-CNN DST

(Figure from Wen et al, 2016)59 http://www.anthology.aclweb.org/W/W13/W13-4073.pdf; https://arxiv.org/abs/1506.07190

(60)

60

Multichannel Tracker

(Shi et al., 2016)

60



Training a multichannel CNN for each slot

 Chinese character CNN

 Chinese word CNN

 English word CNN

https://arxiv.org/abs/1701.06247

(61)

61

DST Evaluation



Metric



Tracked state accuracy with respect to user goal



L2-norm of the hypothesized dist. and the true label



Recall/Precision/F-measure individual slots

61

(62)

62

Dialog State Tracking Challenge (DSTC)

(Williams et al. 2013, Henderson et al. 2014, Henderson et al. 2014, Kim et al. 2016, Kim et al. 2016)

Challenge Type Domain Data Provider Main Theme DSTC1 Human-

Machine Bus Route CMU Evaluation Metrics

DSTC2 Human-

Machine Restaurant U. Cambridge User Goal Changes DSTC3 Human-

Machine Tourist Information U. Cambridge Domain Adaptation DSTC4 Human-

Human Tourist Information I2R Human Conversation DSTC5 Human-

Human Tourist Information I2R Language Adaptation

(63)

63

DSTC1

 Type: Human-Machine

 Domain: Bus Route

63

(64)

64

DSTC4-5

 Type: Human-Human

 Domain: Tourist Information

64

Tourist: Can you give me some uh- tell me some cheap rate hotels, because I'm planning just to leave my bags there and go somewhere take some pictures.

Guide: Okay. I'm going to recommend firstly you want to have a backpack type of hotel, right?

Tourist: Yes. I'm just gonna bring my backpack and my buddy with me. So I'm kinda looking for a hotel that is not that expensive. Just gonna leave our things there and, you know, stay out the whole day.

Guide: Okay. Let me get you hm hm. So you don't mind if it's a bit uh not so roomy like hotel because you just back to sleep.

Tourist: Yes. Yes. As we just gonna put our things there and then go out to take some pictures.

Guide: Okay, um- Tourist: Hm.

Guide: Let's try this one, okay?

Tourist: Okay.

Guide: It's InnCrowd Backpackers Hostel in Singapore. If you take a dorm bed per person only twenty dollars. If you take a room, it's two single beds at fifty nine dollars.

Tourist: Um. Wow, that's good.

Guide: Yah, the prices are based on per person per bed or dorm. But this one is room. So it should be fifty nine for the two room. So you're actually paying about ten dollars more per person only.

Tourist: Oh okay. That's- the price is reasonable actually. It's good.

{Topic: Accommodation; Type: Hostel; Pricerange:

Cheap; GuideAct: ACK; TouristAct: REQ}

{Topic: Accommodation; NAME: InnCrowd Backpackers Hostel; GuideAct: REC; TouristAct: ACK}

(65)

65

Concluding Remarks

 Dialogue state tracking (DST) of DM has Markov assumption to model the user goal and be robust to errors

 Generative models for DST are based on POMDP

 Hidden Information State (HIS)

 state  user goal, user action dialogue history

 transitions are hand-crafted and the goals are grouped together to allow tractable belief tracking

65

 Bayesian Update of Dialogue State (BUDS)

 further factorizes the state

 allows tractable belief tracking and learning of the shapers of distributions via expectation propagation

 Discriminative models directly estimate dialogue states given arbitrary input features