Slides credit from Gašić

(1)

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

• Slot Filling

Natural Language Generation (NLG)

Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Action / Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

• Dialogue Policy

(5)

Dialogue Management

5

(6)

6

Example Dialogue

6

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(7)

7

Example Dialogue

7

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(8)

8

Elements of Dialogue Management

(Figure from Gašić)8

(9)

9

Reinforcement Learning

 RL is a general purpose framework for decision making

 RL is for an agentwith the capacity to act

 Each actioninfluences the agent’s future state

 Success is measured by a scalar rewardsignal

 Goal: select actions to maximize future reward Big three: action, state, reward

(10)

10

Reinforcement Learning

10

Agent

Environment

Observation Action

Reward Don’t do

that

(11)

11

Reinforcement Learning

11

Agent

Environment

Observation Action

Reward Thank you.

Agent learns to take actions to maximize expected reward.

(12)

Scenario of Reinforcement Learning

Environment

Observation Action

Reward

If win, reward = 1 If loss, reward = -1

Agent learns to take actions to maximize expected reward.

12

Otherwise, reward = 0

Next Move

(13)

13

Agent and Environment

 At time step t

 The agent

 Executes action a_t

 Receives observation o_t

 Receives scalar reward r_t

 The environment

 Receives action a_t

 Emits observation o_t+1

 Emits scalar reward r_t+1

 t increments at env. step

observation o_t

action a_t

reward r_t

(14)

14

State

 Experience is the sequence of observations, actions, rewards

 State is the information used to determine what happens next

 what happens depends on the history experience

• The agent selects actions

• The environment selects observations/rewards

 The state is the function of the history experience

(15)

Dialogue Policy Optimization

15

Decision Making

(16)

16

Elements of Dialogue Management

(Figure from Gašić)16

Dialogue policy optimization

(17)

17

Partially Observable Markov Decision Process (POMDP)

 Dialogue states: s_t

 Noisy observation: o_t

 System actions: a_t

 Rewards: r_t

 Transition probability: p(s_t+1|s_t, a_t)

 Observation probability: p(o_t+1|s_t+1)

 Distribution over states: b(s_t)

s_t+1 a_t

s_t

r_t

o_t+1 o_t

(18)

18

DM as Partially Observable Markov Decision Process (POMDP)

18

Data

Model

Prediction

• Noisy observation of dialogue states

• Reward – a measure of dialogue quality

• Distribution over dialogue states

• Optimal system actions – Dialogue Policy Optimization

• Partially observable Markov decision process (POMDP)

(19)

19

Decision Making in POMDP



Policy:



Return:



Value function



How good the system is in a particular belief state

belief estimation mapping

accumulated reward

s_t+1 a_t

s_t

r_t

o_t+1 o_t

(20)

20

POMDP Policy Optimization



Finding value function associated with optimal policy, i.e. the one that generates maximal return



Problem: tractable only for very simple cases

(Kaelbling et al., 1998)



Alternative solution: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states

20

(21)

21

Markov Decision Process (MDP)



Belief state from tracking: b

_t

= s

_t



System actions: a

_t



Rewards: r

_t



Transition probability: p(b

_t+1

|b

_t

, a

_t

)

b_t+1 a_t

b_t

r_t

(22)

22

DM as Markov Decision Process (MDP)

22

Data

Model

Prediction

• Belief dialogue states (continuous)

• Reward – a measure of dialogue quality

• System actions –

Dialogue Policy Optimization

• Markov decision process (MDP) &

reinforcement learning

(23)

23

Policy Optimization Issue



Optimization problem size



Belief dialogue state space is large and continuous



System action space is large



Knowledge environment (user)



Transition probability is unknown (user status)



How to get rewards

23

(24)

24

Large Belief Space and Action Space



Solution: perform optimization in a reduced

summary space built according to the heuristics

24

Belief Dialogue

State System Action

Summary

Dialogue State Summary Policy Summary Action

Summary Function

Master Function

(25)

25

Transition Probability and Rewards



Solution: learn from real users

Speech Recognition

• Slot Filling

• Dialogue Policy Natural Language

Generation (NLG)

Backend Database/

Knowledge Providers

(26)

26

Transition Probability and Rewards



Solution: learn from a simulated user

26

Error Model

• Recognition error

• LU error

Dialogue State Tracking (DST)

System dialogue acts

Reward Backend Action /

Knowledge Providers Dialogue Policy

Optimization

User Model Reward Model

User Simulation Distribution over user dialogue acts (semantic frames)

(27)

27

Concluding Remarks

 Dialogue policy optimization can be viewed as an RL task

 POMDP can be viewed as a continuous space MDP

 Belief dialogue state space can be summarized to reduce computational complexity

 Transition probability and reward come from

 Real user

 Simulated user