Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning
Baolin Peng1, Xiujun Li2, Jianfeng Gao2, Jingjing Liu2, Yun-Nung Chen3 , Kam-Fai Wong1
1The Chinese University of Hong Kong, 2Microsoft AI & Research, 3National Taiwan University
➢ Motivation
• Exploiting reinforcement learning for dialogue policy learning
• Exploration in the large state-action space is challenging
• Reward is delayed and sparse with a long trajectory
➢Approach
• Propose an Adversarial Advantage Actor-Critic algorithm
• Leverage expert-generated dialogues as priors
• Use a discriminator to differentiate responses from an agent or human experts
• The output of discriminator as intrinsic reward to explore state-action regions similar to what human experts do
➢Results
• Significant improvement of efficiency and performance on a movie-ticket booking domain
Summary
1. Task Definition
➢ Natural Language Understanding (NLU) turns natural language into intents and slot-values
➢ Natural Language Generation (NLG) turns system actions into natural language
➢ Dialogue Manager (DM)
• tracks dialogue states and updates state accordingly
• interacts with the database
• takes state as input to output system action Dialogue Policy Learning
Agent Succ. Turn Reward
Rule 41.34 16.00 0.26
A2C 81.24 15.43 5.08
BBQN-MAP 81.56 18.75 5.00
Adversarial A2C 87.52 13.52 5.93
Model
Dialogue Observations
(History Utterances, Belief State, KB Results)
System Action
e.g. inform(ticket=?)
2. Methodology
3. Experiments & Results
➢Dataset: human-human conversations in the movie-ticket booking scenario
• collected via AMT and annotated by human experts
• 280 labeled dialogue with 11 average turns
• 11 dialogue acts, 29 slots
• Informable (narrow down search), requestable (ask info from agent)
• Use a publicly available user simulator
➢ Baselines
• Rule Agent: handcrafted rule-based policy that in- forms and requests a hand-picked subset of necessary slots.
• A2C: trained with a pre-defined reward function and a standard advantage actor-critic algorithm
• BBQN-Map Agent (AAAI’18): the best agent among a set of BBQN variants that has great efficiency for policy exploration in dialogue systems
➢ Evaluation
• Success rate
• 10 run averaged learning curve
• 2000 dialogues for testing
4. Conclusion
• We propose an adversarial advantage actor-critic model with efficient exploration.
• The discriminator severs as an additional critic to guide policy exploration towards human-like one.
It also has connection with inverse reinforcement learning that learns reward function.
• Our experiments in a movie-ticket booking domain show its superiority.
➢ Advantage Actor-Critic for Dialogue Policy Learning
• Find a policy 𝜋 that maximizes the expected reward
• 𝜋 is a parameterized probabilistic mapping function:
o Update 𝜃 with following gradients
o Baseline function for reducing variance o TD error as an unbiased estimation
➢Adversarial Training
• Actor 𝜋𝜃 as a generator G
• A discriminator D identifies state-action pair (s, a) from experts or G
• D can be viewed as a reward function extracted from experts’ trajectories
• D is to maximize the probability of classifying each pair correctly
• Actor 𝝅𝜽 (G) can be improved with −𝒍𝒐𝒈 𝟏 − 𝑫 𝒔, 𝒂 as the reward function
➢ Combine A2C with a reward function learned from experts’ demonstrations with adversarial training.
➢ The discriminator D guides actor to explore state action regions where human experts will explore.
➢ Adversarial A2C learns faster and more stable with better exploration.
Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning
Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung (Vivian) Chen, Kam-Fai Wong
NLG NLU
𝑜1 𝑜2
Dialogue State Tracker
𝑜𝑡
Dialogue Policy Learning
User Goal Dialogue Manager
System Action (Policy)
𝑠𝑡
𝑠1 𝑠2 𝑠𝑛
𝜋1 𝜋2 𝜋𝑘
……
…
Semantic Frame
State Representation
𝑎∗ = max
𝑎 𝜋 𝑎|𝑠
Critic
Actor
User Simulator
Expert
Demonstration Simulation
Discriminator Sample (s,a) pair
Sample (s,a) pair
Expert vs
Simulation
Discriminator
Discriminator Training
Adversarial Advantage Actor-Critic
State 𝑠
Reward
Action 𝑎
TD error TD error
Agent
Hi, how can I help you?
What theater would you like?
Which time do you prefer?
Sure, here it is one ticket for deadpool 8 PM at AMC pacific place is available for you.
I would like to book a ticket for deadpool.
AMC pacific place.
Tomorrow evening. User