2
Task-Oriented Dialogue System
(Young, 2000)Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST) Natural Language
Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend Text Input
Are there any action movies to see this weekend?
Speech Signal
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
3
Speech Recognition / Multimodality
Speech recognition
Word error rate
Word accuracy
Emotion recognition
Accuracy
3
Hyp: A AB D C K Ref: A C D A C
#words in the reference
4
Language Understanding Evaluation
Data
Training and testing should be split
Testing data should be real data collected from human to make evaluation results convincing
Metrics
Sub-sentence-level: intent accuracy, slot F1
Sentence-level: whole frame accuracy
5
Dialogue State Tracking Evaluation
Metric
Tracked state accuracy with respect to user goal
Recall/Precision/F-measure individual slots
5
6
Dialogue Policy Evaluation
Metrics
Turn-level evaluation: system action accuracy
Dialogue-level evaluation: task success rate, reward,
#dialogue turn
7
Reinforcement Learning Policy
Frame-level semantics
7
If your RL agent cannot outperform the rule-based agent, please consider to increase the complexity of system functionality and the simulated user.
X. Li, Y.-N. Chen, L. Li, and J. Gao, “End-to-End Task-Completion Neural Dialogue Systems,” preprint arXiv: 1703.01008, 2017.
Natural language
Note: check whether the interactions can be satisfied by the system’s functionality
8
Natural Language Generation Evaluation
Metrics
Subjective: human judgement (Stent et al., 2005)
Adequacy: correct meaning
Fluency: linguistic fluency
Readability: fluency in the dialogue context
Variation: multiple realizations for the same concept
Objective: automatic metrics
Word overlap: BLEU (Papineni et al, 2002), METEOR, ROUGE
9
User Study
System performance from real users
1) Allow others to interact with the system
2) Record the dialogues and compute the success rate, satisfaction degree
3) Analyze where the errors come from
9
10
Concluding Remarks
Evaluate all components of the system in detail
Speech recognition: word accuracy
Language understanding: frame accuracy
Dialogue state tracking: frame accuracy
Dialogue policy: success rate
Natural language generation: BLEU
User study
Subjective: satisfaction
Objective: success rate