Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

(1)

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

Baolin Peng¹, Xiujun Li², Jianfeng Gao², Jingjing Liu², Yun-Nung Chen³ , Kam-Fai Wong¹

1The Chinese University of Hong Kong, ²Microsoft AI & Research, ³National Taiwan University

➢ Motivation

• Exploiting reinforcement learning for dialogue policy learning

• Exploration in the large state-action space is challenging

• Reward is delayed and sparse with a long trajectory

➢Approach

• Propose an Adversarial Advantage Actor-Critic algorithm

• Leverage expert-generated dialogues as priors

• Use a discriminator to differentiate responses from an agent or human experts

• The output of discriminator as intrinsic reward to explore state-action regions similar to what human experts do

➢Results

• Significant improvement of efficiency and performance on a movie-ticket booking domain

Summary

1. Task Definition

➢ Natural Language Understanding (NLU) turns natural language into intents and slot-values

➢ Natural Language Generation (NLG) turns system actions into natural language

➢ Dialogue Manager (DM)

• tracks dialogue states and updates state accordingly

• interacts with the database

• takes state as input to output system action  Dialogue Policy Learning

Agent Succ. Turn Reward

Rule 41.34 16.00 0.26

A2C 81.24 15.43 5.08

BBQN-MAP 81.56 18.75 5.00

Adversarial A2C 87.52 13.52 5.93

Model

Dialogue Observations

(History Utterances, Belief State, KB Results)

System Action

e.g. inform(ticket=?)

2. Methodology

3. Experiments & Results

➢Dataset: human-human conversations in the movie-ticket booking scenario

• collected via AMT and annotated by human experts

• 280 labeled dialogue with 11 average turns

• 11 dialogue acts, 29 slots

• Informable (narrow down search), requestable (ask info from agent)

• Use a publicly available user simulator

➢ Baselines

• Rule Agent: handcrafted rule-based policy that in- forms and requests a hand-picked subset of necessary slots.

• A2C: trained with a pre-defined reward function and a standard advantage actor-critic algorithm

• BBQN-Map Agent (AAAI’18): the best agent among a set of BBQN variants that has great efficiency for policy exploration in dialogue systems

➢ Evaluation

• Success rate

• 10 run averaged learning curve

• 2000 dialogues for testing

4. Conclusion

• We propose an adversarial advantage actor-critic model with efficient exploration.

• The discriminator severs as an additional critic to guide policy exploration towards human-like one.

It also has connection with inverse reinforcement learning that learns reward function.

• Our experiments in a movie-ticket booking domain show its superiority.

➢ Advantage Actor-Critic for Dialogue Policy Learning

• Find a policy 𝜋 that maximizes the expected reward

• 𝜋 is a parameterized probabilistic mapping function:

o Update 𝜃 with following gradients

o Baseline function for reducing variance o TD error as an unbiased estimation

➢Adversarial Training

• Actor 𝜋_𝜃 as a generator G

• A discriminator D identifies state-action pair (s, a) from experts or G

• D can be viewed as a reward function extracted from experts’ trajectories

• D is to maximize the probability of classifying each pair correctly

• Actor 𝝅_𝜽 (G) can be improved with −𝒍𝒐𝒈 𝟏 − 𝑫 𝒔, 𝒂 as the reward function

➢ Combine A2C with a reward function learned from experts’ demonstrations with adversarial training.

➢ The discriminator D guides actor to explore state action regions where human experts will explore.

➢ Adversarial A2C learns faster and more stable with better exploration.

Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning

Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung (Vivian) Chen, Kam-Fai Wong

NLG NLU

𝑜₁ 𝑜₂

Dialogue State Tracker

𝑜_𝑡

Dialogue Policy Learning

User Goal Dialogue Manager

System Action (Policy)

𝑠_𝑡

𝑠₁ 𝑠₂ 𝑠_𝑛

𝜋₁ 𝜋₂ 𝜋_𝑘

……

…

Semantic Frame

State Representation

𝑎^∗ = max

𝑎 𝜋 𝑎|𝑠

Critic

Actor

User Simulator

Expert

Demonstration Simulation

Discriminator Sample (s,a) pair

Sample (s,a) pair

Expert vs

Simulation

Discriminator

Discriminator Training

Adversarial Advantage Actor-Critic

State 𝑠

Reward

Action 𝑎

TD error TD error

Agent

Hi, how can I help you?

What theater would you like?

Which time do you prefer?

Sure, here it is one ticket for deadpool 8 PM at AMC pacific place is available for you.

I would like to book a ticket for deadpool.

AMC pacific place.

Tomorrow evening. User