• 沒有找到結果。

Slides credit from Gašić

N/A
N/A
Protected

Academic year: 2022

Share "Slides credit from Gašić"

Copied!
71
0
0

加載中.... (立即查看全文)

全文

(1)

Slides credit from Gašić

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Natural Language Generation (NLG)

Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Action / Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy

(5)

Dialogue Management

5

(6)

6

Example Dialogue

6

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(7)

7

Example Dialogue

7

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(8)

8

Elements of Dialogue Management

(Figure from Gašić)8

(9)

9

Rule-Based Management

9

(10)

10

Elements of Dialogue Management

(Figure from Gašić)10

Dialogue policy optimization

(11)

Dialogue Policy Optimization

11

Reinforcement Learning

(12)

12

Reinforcement Learning

RL is a general purpose framework for decision making

RL is for an agentwith the capacity to act

Each actioninfluences the agent’s future state

Success is measured by a scalar rewardsignal

Goal: select actions to maximize future reward Big three: action, state, reward

(13)

Scenario of Reinforcement Learning

Agent learns to take actions to maximize expected reward.

13

Environment

Observation ot Action at

Reward rt If win, reward = 1 If loss, reward = -1 Otherwise, reward = 0

Next Move

(14)

14

Supervised v.s. Reinforcement

Supervised

Reinforcement

14

Hello 

Agent

……

Agent

……. …….

……

Bad

“Hello” Say “Hi”

“Bye bye” Say “Good bye”

Learning from teacher

Learning from critics

(15)

15

Dialogue as Reinforcement Learning

Problems in solving dialogue as an RL task

1)

Optimization problem size

Belief dialogue state space is large and continuous

System action space is large

2)

Knowledge environment (user)

Transition probability is unknown (user status)

How to get rewards

3)

RL takes long time to converge

15

Solution: learn in reduced summary space

Solution: learn in interaction with a simulated user

(16)

16

Large Belief Space and Action Space

Solution: perform optimization in a reduced

summary space built according to the heuristics

16

Belief Dialogue

State System Action

Summary

Dialogue State Summary Policy Summary Action

Summary Function

Master Function

(17)

17

Transition Probability and Rewards

Solution: learn from a simulated user

17

Error Model

Recognition error

LU error

Dialogue State Tracking (DST)

System dialogue acts

Reward Backend Action /

Knowledge Providers Dialogue Policy

Optimization

Dialogue Management (DM)

User Model Reward Model

User Simulation Distribution over user dialogue acts (semantic frames)

(18)

18

Agent and Environment

At time step t

The agent

Executes action at

Receives observation ot

Receives scalar reward rt

The environment

Receives action at

Emits observation ot+1

Emits scalar reward rt+1

t increments at env. step

observation ot

action at

reward rt

(19)

19

State

Experience is the sequence of observations, actions, rewards

State is the information used to determine what happens next

what happens depends on the history experience

The agent selects actions

The environment selects observations/rewards

The state is the function of the history experience

(20)

20

POMDP Policy Optimization

Finding value function associated with optimal policy, i.e. the one that generates maximal return

Problem: tractable only for very simple cases

(Kaelbling et al., 1998)

Alternative solution: discrete space POMDPs can be viewed as a continuous space MDP with states as belief states

20

(21)

21

Markov Decision Process (MDP)

Belief state from tracking: b

t

= s

t

System actions: a

t

Rewards: r

t

Transition probability: p(b

t+1

|b

t

, a

t

)

bt+1 at

bt

rt

(22)

22

DM as Markov Decision Process (MDP)

22

Data

Model

Prediction

• Belief dialogue states (continuous)

• Reward – a measure of dialogue quality

• System actions –

Dialogue Policy Optimization

• Markov decision process (MDP) &

reinforcement learning

(23)

23

Dialogue Policy Optimization

Dialogue management in a RL framework

23

U s e r

Reward R Observation O Action A

Environment

Agent

Natural Language Generation Language Understanding

Dialogue Manager

The optimized dialogue policy selects the best action that maximizes the future reward.

Correct rewards are a crucial factor in dialogue policy training

(24)

24

Reward

Reinforcement learning is based on reward hypothesis

A reward rt is a scalar feedback signal

Indicates how well agent is doing at step t

Reward hypothesis: all agent goals can be desired by maximizing expected cumulative reward

(25)

25

Reward for RL ≅ Evaluation for System

Dialogue is a special RL task

Human involves in interaction and rating (evaluation) of a dialogue

Fully human-in-the-loop framework

Rating: correctness, appropriateness, and adequacy

- Expert rating high quality, high cost

- User rating unreliable quality, medium cost - Objective rating Check desired aspects, low cost

25

(26)

26

Reinforcement Learning for Dialogue Policy Optimization

26

Language understanding

Language (response) generation

Dialogue Policy 𝑎 = 𝜋(𝑠)

Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)

Optimize 𝑄(𝑠, 𝑎) User input (o)

Response

𝑠

𝑎

Type of Bots State Action Reward

Social ChatBots Chat history System Response # of turns maximized;

Intrinsically motivated reward

InfoBots (interactive Q/A) User current question + Context

Answers to current question

Relevance of answer;

# of turns minimized

Task-Completion Bots User current input + Context

System dialogue act w/

slot value (or API calls)

Task success rate;

# of turns minimized

Goal: develop a generic deep RL algorithm to learn dialogue policy for all bot categories

(27)

27

Dialogue Reinforcement Learning Signal

Typical reward function

-1 for per turn penalty

Large reward at completion if successful

Typically requires domain knowledge

✔ Simulated user

✔ Paid users (Amazon Mechanical Turk)

✖ Real users

|||

27

The user simulator is usually required for dialogue system training before deployment

(28)

28

Sequential Decision Making

Goal: select actions to maximize total future reward

Actions may have long-term consequences

Reward may be delayed

It may be better to sacrifice immediate reward to gain more long-term reward

28

(29)

29

Deep Reinforcement Learning

Environment 29

Observation Action

Reward Function

Input

Function Output

Used to pick the best function

… …

DNN

(30)

Value-Based Policy-Based Model-Based

Reinforcement Learning Approach

30

(31)

31

Major Components in an RL Agent

An RL agent may include one or more of these components

Policy: agent’s behavior function

Value function: how good is each state and/or action

Model: agent’s representation of the environment

31

(32)

32

Policy

A policy is the agent’s behavior

A policy maps from state to action

Deterministic policy:

Stochastic policy:

32

(33)

33

Value Function

A value function is a prediction of future reward (with action a in state s)

Q-value function gives expected total reward

from state and action

under policy

with discount factor

Value functions decompose into a Bellman equation

33

(34)

34

Optimal Value Function

An optimal value function

is the maximum achievable value

allows us to act optimally

informally maximizes over all decisions

decompose into a Bellman equation

34

(35)

35

Reinforcement Learning Approach

Policy-based RL

Search directly for optimal policy

Value-based RL

Estimate the optimal value function

Model-based RL

Build a model of the environment

Plan (e.g. by lookahead) using model

35

is the policy achieving maximum future reward

is maximum value achievable under any policy

(36)

36

Maze Example

Rewards: -1 per time- step

Actions: N, E, S, W

States: agent’s location

36

(37)

37

Maze Example: Policy

Rewards: -1 per time- step

Actions: N, E, S, W

States: agent’s location

37

Arrows represent policy π(s) for each state s

(38)

38

Maze Example: Value Function

Rewards: -1 per time- step

Actions: N, E, S, W

States: agent’s location

38

Numbers represent value Qπ(s) of each state s

(39)

Categorizing RL Agents

Value-Based

No Policy (implicit)

Value Function

Policy-Based

Policy

No Value Function

Actor-Critic

Policy

Value Function

Model-Free

Policy and/or Value Function

No Model

Model-Based

Policy and/or Value Function

Model

39

(40)

40

RL Agent Taxonomy

40

(41)

Dynamic Programming Monte-Carlo

Temporal-Difference Q-Learning

Value-Based Deep RL

41

(42)

42

Dynamic Programming

Model-based

Evaluate policy

Update policy

42

(43)

43

Dynamic Programming

GridWorld example

43

(44)

44

Monte-Carlo RL

Characteristics

Learn from complete episodes of experience

Model-free: no knowledge of MDP transitions / rewards

Value = mean return

MC policy

Goal: learn from episodes under policy

Return is the total discounted reward

Value function is the expected return

44

(45)

45

Monte-Carlo

Model-free prediction

45

(46)

46

Temporal-Difference RL

Characteristics

Learn from incompele episodes of experience

Model-free: no knowledge of MDP transitions / rewards

Update a guess toward a guess

TD policy

Goal: learn online from experience under policy

Value function is updated toward estimated return

TD target

(47)

47

Temporal-Difference

Model-free prediction

47

(48)

48

Q-Learning – Value Function Approximation

Value functions are represented by a lookup table

too many states and/or actions to store

too slow to learn the value of each entry individually

Values can be estimated with function approximation

48

(49)

49

Q-Networks

Q-networks represent value functions with weights

generalize from seen states to unseen states

update parameter for function approximation

49

(50)

50

Q-Learning

Goal: estimate optimal Q-values

Optimal Q-values obey a Bellman equation

Value iterationalgorithms solve the Bellman equation

50

learning target

(51)

51

Deep Q-Networks (DQN)

Represent value function by deep Q-network with weights

Objective is to minimize mean square error (MSE) loss by SGD

Leading to the following Q-learning gradient

51

Issue: naïve Q-learning oscillates or diverges using NN due to:

1) correlations between samples 2) non-stationary targets learning target

(52)

52

Stability Issues with Deep RL

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential

Successive samples are correlated, non-iid (independent and identically distributed)

2. Policy changes rapidly with slight changes to Q-values

Policy may oscillate

Distribution of data can swing from one extreme to another

3. Scale of rewards and Q-values is unknown

Naive Q-learning gradients can be unstable when backpropagated

52

(53)

53

Stable Solutions for DQN

DQN provides a stable solutions to deep value-based RL

1. Use experience replay

Break correlations in data, bring us back to iid setting

Learn from all past policies

2. Freeze target Q-network

Avoid oscillation

Break correlations between Q-network and target

3. Clip rewards or normalizenetwork adaptively to sensible range

Robust gradients

53

(54)

54

Stable Solution 1: Experience Replay

To remove correlations, build a dataset from agent’s experience

Take action at according to 𝜖-greedy policy

Store transition in replay memory D

Sample random mini-batch of transitions from D

Optimize MSE between Q-network and Q-learning targets

54

small prob for exploration

(55)

55

Stable Solution 2: Fixed Target Q-Network

To avoid oscillations, fix parameters used in Q-learning target

Compute Q-learning targets w.r.t. old, fixed parameters

Optimize MSE between Q-network and Q-learning targets

Periodically update fixed parameters

55

(56)

56

Stable Solution 3: Reward / Value Range

To avoid oscillations, control the reward / value range

DQN clips the rewards to [−1, +1]

Prevents too large Q-values

Ensures gradients are well-conditioned

56

(57)

57

Other Improvements: Double DQN

Nature DQN

Double DQN: remove upward bias caused by

Current Q-network is used to select actions

Older Q-network is used to evaluateactions

57

(58)

58

Other Improvements: Prioritized Replay

Prioritized Replay: weight experience based on surprise

Store experience in priority queue according to DQN error

58

(59)

59

Other Improvements: Dueling Network

Dueling Network: split Q-network into two channels

Action-independent value function

Value function estimates how good the state is

Action-dependent advantage function

Advantage function estimates the additional benefit

59

(60)

60

DQN for Dialogue Management

(Li et al., 2017)

Goal: end-to-end learning of values Q(s, a) from interactions

Input: state is the combination of user history observation, previous system action, database returned results

Output: Q(s, a) for all available system action a

Reward: -1 per turn; large reward for successful task

60

Semantic Frame request_movie genre=action, date=this weekend

System Action/Policy request_location

DQN- Based DM

Simulated User Backend DB

(61)

61

E2E RL-Based System

(Zhao and Eskenazi, 2016)

61

Joint learning

NLU, DST, Dialogue Policy

Deep RL for training

Deep Q-network

Deep recurrent network

Baseline RL

Hybrid-RL

http://www.aclweb.org/anthology/W/W16/W16-36.pdf#page=19

(62)

62

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

wi

B- type

wi +1

wi+2

O O

EOS

<intent

>

wi

B- type

wi +1

wi+2

O O

EOS

<intent

> Semantic Frame request_movie genre=action, date=this weekend System Action /

Policy

request_location User Dialogue Action

Inform(location=San Francisco)

Time t-1

wi

<slot>

wi

+1

wi+2

O O

EOS

<intent>

Language Understanding (LU)

Time t-2 Time t

Dialogue Management

(DM)

w0 w1 w2

Natural Language Generation (NLG)

User EOS

Goal

User Agenda Modeling User Simulator

End-to-End Neural Dialogue System Text Input

Are there any action movies to see this weekend?

Idea: supervised learning for each component and reinforcement learning for end-to-end training the neural dialogue system

62 https://arxiv.org/abs/1703.01008

(63)

63

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle.

RULE BASED SYSTEM

User: Can I get 2 tickets for the witch?

Agent: Which theater do you want?

User: Regal meridian 16

Agent: What time would you like to see it?

User: 9:30 pm

Agent: What date would you like to see it?

User: Tomorrow!

Agent: How many people are you?

User: Just two adults.

Agent: What is your location?

User: Seattle!

Agent: What kind of video format do you like?

User: Oh, any kind is okay.

Agent: Do you have any price constraint?

User: No, I don’t care.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thank you.

Agent: Thank you! 63

The system can learn how to efficiently interact with users for task completion

REINFORCEMENT LEARNING SYSTEM User: Can I get 2 tickets for the witch?

Agent: What date would you like to see it?

User: Tomorrow

Agent: Which theater do you prefer?

User: Regal meridian 16 Agent: What is your location?

User: Seattle

Agent: What time would you like to see it?

User: 9:30 pm.

Agent: Great - I was able to purchase 2 tickets for you to see the witch tomorrow at regal meridian 16 theater in Seattle at 9:30 pm.

User: Thanks.

Agent: Thanks!

https://arxiv.org/abs/1703.01008

(64)

Estimate How Good An Agent’s Behavior is

Policy-Based Deep RL

64

(65)

65

Deep Policy Networks

Represent policy by deep network with weights

Objective is to maximize total discounted reward by SGD

65

stochastic policy deterministic policy

(66)

66

Policy Gradient

The gradient of a stochastic policy is given by

The gradient of a deterministic policy is given by

66

How to deal with continuous actions

(67)

67

Actor-Critic (Value-Based + Policy-Based)

Estimate value function

Update policy parameters by SGD

Stochastic policy

Deterministic policy

67

Q-networks tell whether a policy is good or not Policy networks optimize the policy accordingly

(68)

68

Deterministic Deep Policy Gradient

Goal: end-to-end learning of control policy from pixels

Input: state is stack of raw pixels from last 4 frames

Output: two separate CNNs for Q and 𝜋

Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv, 2015. 68

(69)

69

E2E RL-Based Info-Bot

(Dhingra et al., 2016)

Movie=?; Actor=Bill Murray; Release Year=1993

Find me the Bill Murray’s movie.

I think it came out in 1993.

When was it released?

Groundhog Day is a Bill Murray movie which came out in 1993.

KB-InfoBot User

(Groundhog Day, actor, Bill Murray) (Groundhog Day, release year, 1993) (Australia, actor, Nicole Kidman)

(Mad Max: Fury Road, release year, 2015) Knowledge Base (head, relation, tail)

Idea: differentiable database for propagating the gradients 69

https://arxiv.org/abs/1609.00777

(70)

70

Dialogue Management Evaluation

Metrics

Turn-level evaluation: system action accuracy

Dialogue-level evaluation: task success rate, reward

70

(71)

71

Concluding Remarks

Dialogue policy optimization of DM solves MDP via RL

Value-based

Dynamic programming

Monte-Carlo

Temporal-difference

Q-learning  DQN

Policy-based

Deep policy gradient

Actor-critic

71

參考文獻

Outline

相關文件

Using Reinforcement Learning to Establish Taiwan Stock Index Future Intra-day Trading Strategies.. 賴怡玲

Use reward to diminish negative behaviour. Teach social skill /

To enable pre-primary institutions to be more effective in enhancing school culture and support to children, actions can be taken in the following three areas: Caring and

This thesis applied Q-learning algorithm of reinforcement learning to improve a simple intra-day trading system of Taiwan stock index future. We simulate the performance

◦ Action, State, and Reward Markov Decision Process Reinforcement Learning.

Reinforcement learning is based on reward hypothesis A reward r t is a scalar feedback signal. ◦ Indicates how well agent is doing at

Agent learns to take actions maximizing expected reward.. Machine Learning ≈ Looking for

 Goal: select actions to maximize future reward Big three: action, state,