• 沒有找到結果。

Slides credit from Gašić

N/A
N/A
Protected

Academic year: 2022

Share "Slides credit from Gašić"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Slides credit from Gašić

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Natural Language Generation (NLG)

Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Action / Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy

(5)

Dialogue Management

5

(6)

6

Example Dialogue

6

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(7)

7

Example Dialogue

7

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(8)

8

Elements of Dialogue Management

(Figure from Gašić)8

(9)

9

Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agentwith the capacity to act

Each actioninfluences the agent’s future state

Success is measured by a scalar rewardsignal

Goal: select actions to maximize future reward Big three: action, state, reward

(10)

10

Reinforcement Learning

10

Agent

Environment

Observation Action

Reward Don’t do

that

(11)

11

Reinforcement Learning

11

Agent

Environment

Observation Action

Reward Thank you.

Agent learns to take actions to maximize expected reward.

(12)

Scenario of Reinforcement Learning

Environment

Observation Action

Reward

If win, reward = 1 If loss, reward = -1

Agent learns to take actions to maximize expected reward.

12

Otherwise, reward = 0

Next Move

(13)

13

Agent and Environment

At time step t

The agent

Executes action at

Receives observation ot

Receives scalar reward rt

The environment

Receives action at

Emits observation ot+1

Emits scalar reward rt+1

t increments at env. step

observation ot

action at

reward rt

(14)

14

State

Experience is the sequence of observations, actions, rewards

State is the information used to determine what happens next

what happens depends on the history experience

The agent selects actions

The environment selects observations/rewards

The state is the function of the history experience

(15)

Dialogue Policy Optimization

15

Decision Making

(16)

16

Elements of Dialogue Management

(Figure from Gašić)16

Dialogue policy optimization

(17)

17

Partially Observable Markov Decision Process (POMDP)

Dialogue states: st

Noisy observation: ot

System actions: at

Rewards: rt

Transition probability: p(st+1|st, at)

Observation probability: p(ot+1|st+1)

Distribution over states: b(st)

st+1 at

st

rt

ot+1 ot

(18)

18

DM as Partially Observable Markov Decision Process (POMDP)

18

Data

Model

Prediction

• Noisy observation of dialogue states

• Reward – a measure of dialogue quality

• Distribution over dialogue states

• Optimal system actions – Dialogue Policy Optimization

• Partially observable Markov decision process (POMDP)

(19)

19

Decision Making in POMDP

Policy:

Return:

Value function

How good the system is in a particular belief state

belief estimation mapping

accumulated reward

st+1 at

st

rt

ot+1 ot

(20)

20

POMDP Policy Optimization

Finding value function associated with optimal policy, i.e. the one that generates maximal return

Problem: tractable only for very simple cases

(Kaelbling et al., 1998)

Alternative solution: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states

20

(21)

21

Markov Decision Process (MDP)

Belief state from tracking: b

t

= s

t

System actions: a

t

Rewards: r

t

Transition probability: p(b

t+1

|b

t

, a

t

)

bt+1 at

bt

rt

(22)

22

DM as Markov Decision Process (MDP)

22

Data

Model

Prediction

• Belief dialogue states (continuous)

• Reward – a measure of dialogue quality

• System actions –

Dialogue Policy Optimization

• Markov decision process (MDP) &

reinforcement learning

(23)

23

Policy Optimization Issue

Optimization problem size

Belief dialogue state space is large and continuous

System action space is large

Knowledge environment (user)

Transition probability is unknown (user status)

How to get rewards

23

(24)

24

Large Belief Space and Action Space

Solution: perform optimization in a reduced

summary space built according to the heuristics

24

Belief Dialogue

State System Action

Summary

Dialogue State Summary Policy Summary Action

Summary Function

Master Function

(25)

25

Transition Probability and Rewards

Solution: learn from real users

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG)

Backend Database/

Knowledge Providers

(26)

26

Transition Probability and Rewards

Solution: learn from a simulated user

26

Error Model

Recognition error

LU error

Dialogue State Tracking (DST)

System dialogue acts

Reward Backend Action /

Knowledge Providers Dialogue Policy

Optimization

Dialogue Management (DM)

User Model Reward Model

User Simulation Distribution over user dialogue acts (semantic frames)

(27)

27

Concluding Remarks

Dialogue policy optimization can be viewed as an RL task

POMDP can be viewed as a continuous space MDP

Belief dialogue state space can be summarized to reduce computational complexity

Transition probability and reward come from

Real user

Simulated user

參考文獻

相關文件

• Non-uniform space subdivision (for example, kd tree and octree) is better than uniform grid kd-tree and octree) is better than uniform grid if the scene is

3. Show the remaining statement on ad h in Proposition 5.27.s 6. The Peter-Weyl the- orem states that representative ring is dense in the space of complex- valued continuous

The entire moduli space M can exist in the perturbative regime and its dimension (∼ M 4 ) can be very large if the flavor number M is large, in contrast with the moduli space found

•  What if the quark bilinear is slightly away from the light cone (space-like) in the proton

Compensates for deep fades via diversity techniques over time, frequency and space. (Glass is

○ Value function: how good is each state and/or action. ○ Policy: agent’s

 Goal: select actions to maximize future reward Big three: action, state, reward.. Scenario of Reinforcement Learning.. Agent learns to take actions to maximize expected

 Goal: select actions to maximize future reward Big three: action, state,