• 沒有找到結果。

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

N/A
N/A
Protected

Academic year: 2022

Share "Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning"

Copied!
1
0
0

加載中.... (立即查看全文)

全文

(1)

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

Baolin Peng1, Xiujun Li2, Jianfeng Gao2, Jingjing Liu2, Yun-Nung Chen3 , Kam-Fai Wong1

1The Chinese University of Hong Kong, 2Microsoft AI & Research, 3National Taiwan University

➢ Motivation

• Exploiting reinforcement learning for dialogue policy learning

• Exploration in the large state-action space is challenging

• Reward is delayed and sparse with a long trajectory

➢Approach

• Propose an Adversarial Advantage Actor-Critic algorithm

• Leverage expert-generated dialogues as priors

• Use a discriminator to differentiate responses from an agent or human experts

• The output of discriminator as intrinsic reward to explore state-action regions similar to what human experts do

➢Results

• Significant improvement of efficiency and performance on a movie-ticket booking domain

Summary

1. Task Definition

➢ Natural Language Understanding (NLU) turns natural language into intents and slot-values

➢ Natural Language Generation (NLG) turns system actions into natural language

➢ Dialogue Manager (DM)

• tracks dialogue states and updates state accordingly

• interacts with the database

• takes state as input to output system action  Dialogue Policy Learning

Agent Succ. Turn Reward

Rule 41.34 16.00 0.26

A2C 81.24 15.43 5.08

BBQN-MAP 81.56 18.75 5.00

Adversarial A2C 87.52 13.52 5.93

Model

Dialogue Observations

(History Utterances, Belief State, KB Results)

System Action

e.g. inform(ticket=?)

2. Methodology

3. Experiments & Results

➢Dataset: human-human conversations in the movie-ticket booking scenario

• collected via AMT and annotated by human experts

• 280 labeled dialogue with 11 average turns

• 11 dialogue acts, 29 slots

• Informable (narrow down search), requestable (ask info from agent)

• Use a publicly available user simulator

➢ Baselines

• Rule Agent: handcrafted rule-based policy that in- forms and requests a hand-picked subset of necessary slots.

• A2C: trained with a pre-defined reward function and a standard advantage actor-critic algorithm

• BBQN-Map Agent (AAAI’18): the best agent among a set of BBQN variants that has great efficiency for policy exploration in dialogue systems

➢ Evaluation

• Success rate

• 10 run averaged learning curve

• 2000 dialogues for testing

4. Conclusion

• We propose an adversarial advantage actor-critic model with efficient exploration.

• The discriminator severs as an additional critic to guide policy exploration towards human-like one.

It also has connection with inverse reinforcement learning that learns reward function.

• Our experiments in a movie-ticket booking domain show its superiority.

➢ Advantage Actor-Critic for Dialogue Policy Learning

• Find a policy 𝜋 that maximizes the expected reward

• 𝜋 is a parameterized probabilistic mapping function:

o Update 𝜃 with following gradients

o Baseline function for reducing variance o TD error as an unbiased estimation

➢Adversarial Training

• Actor 𝜋𝜃 as a generator G

• A discriminator D identifies state-action pair (s, a) from experts or G

• D can be viewed as a reward function extracted from experts’ trajectories

• D is to maximize the probability of classifying each pair correctly

• Actor 𝝅𝜽 (G) can be improved with −𝒍𝒐𝒈 𝟏 − 𝑫 𝒔, 𝒂 as the reward function

➢ Combine A2C with a reward function learned from experts’ demonstrations with adversarial training.

➢ The discriminator D guides actor to explore state action regions where human experts will explore.

➢ Adversarial A2C learns faster and more stable with better exploration.

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung (Vivian) Chen, Kam-Fai Wong

NLG NLU

𝑜1 𝑜2

Dialogue State Tracker

𝑜𝑡

Dialogue Policy Learning

User Goal Dialogue Manager

System Action (Policy)

𝑠𝑡

𝑠1 𝑠2 𝑠𝑛

𝜋1 𝜋2 𝜋𝑘

……

Semantic Frame

State Representation

𝑎 = max

𝑎 𝜋 𝑎|𝑠

Critic

Actor

User Simulator

Expert

Demonstration Simulation

Discriminator Sample (s,a) pair

Sample (s,a) pair

Expert vs

Simulation

Discriminator

Discriminator Training

Adversarial Advantage Actor-Critic

State 𝑠

Reward

Action 𝑎

TD error TD error

Agent

Hi, how can I help you?

What theater would you like?

Which time do you prefer?

Sure, here it is one ticket for deadpool 8 PM at AMC pacific place is available for you.

I would like to book a ticket for deadpool.

AMC pacific place.

Tomorrow evening. User

參考文獻

相關文件

 Testing data should be real data collected from human to make evaluation results convincing.

In this paper, we provide new decidability and undecidability results for classes of linear hybrid systems, and we show that some algorithms for the analysis of timed automata can

Our experiments, both on simulated and real users, show that reinforcement learning systems outper- form rule-based agents and have better robustness to allow natural interactions

• Dilek Hakkani-Tur, Asli Celikyilmaz, Larry Heck, and Gokhan Tur, Probabilistic enrichment of knowledge graph entities for relation detection in conversational understanding,

• User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle. E2E Task-Completion Bot (TC-Bot) (Li et

graphs, a slot-based semantic knowledge graph and a word-based lexical knowledge graph, are au- tomatically constructed. To jointly consider the word-to-word, word-to-slot,

Finally, we train the SLU model by learning latent feature vectors for utterances and slot candidates through MF techniques. Combining with a knowledge graph propagation model based

Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of