Task-Oriented Dialogue System

(1)

(2)

2

Task-Oriented Dialogue System

(Young, 2000)

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST) Natural Language

Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend Text Input

Are there any action movies to see this weekend?

Speech Signal

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(3)

3

Speech Recognition / Multimodality



Speech recognition

 Word error rate

 Word accuracy



Emotion recognition

 Accuracy

3

Hyp: A AB D C K Ref: A C D A C

#words in the reference

(4)

4

Language Understanding Evaluation



Data



Training and testing should be split

Testing data should be real data collected from human to make evaluation results convincing



Metrics

 Sub-sentence-level: intent accuracy, slot F1

 Sentence-level: whole frame accuracy

(5)

5

Dialogue State Tracking Evaluation



Metric

 Tracked state accuracy with respect to user goal

 Recall/Precision/F-measure individual slots

5

(6)

6

Dialogue Policy Evaluation



Metrics

 Turn-level evaluation: system action accuracy

 Dialogue-level evaluation: task success rate, reward,

#dialogue turn

(7)

7

Reinforcement Learning Policy



Frame-level semantics

7

If your RL agent cannot outperform the rule-based agent, please consider to increase the complexity of system functionality and the simulated user.

X. Li, Y.-N. Chen, L. Li, and J. Gao, “End-to-End Task-Completion Neural Dialogue Systems,” preprint arXiv: 1703.01008, 2017.



Natural language

Note: check whether the interactions can be satisfied by the system’s functionality

(8)

8

Natural Language Generation Evaluation



Metrics

 Subjective: human judgement (Stent et al., 2005)

Adequacy: correct meaning

Fluency: linguistic fluency

Readability: fluency in the dialogue context

Variation: multiple realizations for the same concept

 Objective: automatic metrics

Word overlap: BLEU (Papineni et al, 2002), METEOR, ROUGE

(9)

9

User Study



System performance from real users

1) Allow others to interact with the system

2) Record the dialogues and compute the success rate, satisfaction degree

3) Analyze where the errors come from

9

(10)

10

Concluding Remarks

 Evaluate all components of the system in detail

 Speech recognition: word accuracy

 Language understanding: frame accuracy

 Dialogue state tracking: frame accuracy

 Dialogue policy: success rate

 Natural language generation: BLEU

 User study

 Subjective: satisfaction

 Objective: success rate