1
H A K K A N I - T U R , T U R , G A O , D E N G
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture End-to-End Training
Experiments
Conclusion & Future Work
2
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture End-to-End Training
Experiments
Conclusion & Future Work
3
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
Dialogue System Pipeline
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
4
ASR
Language Understanding (LU)
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking
• Policy Decision Output
Generation
Hypothesis
are there any action movies to see this weekend
Semantic Frame (Intents, Slots) request_movie genre=action
date=this weekend
System Action request_locaion Text response
Where are you located?
Screen Display location?
Text Input
Are there any action movies to see this weekend?
Speech Signal
LU Importance
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495
Success Rate
Simulation Epoch
Learning Curve of System Performance
Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05
RL Agent w/o LU errors
Rule Agent w/o LU errors
LU Importance
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495
Success Rate
Simulation Epoch
Learning Curve of System Performance
Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05
RL Agent w/o LU errors RL Agent w/ 5% LU errors
Rule Agent w/o LU errors Rule Agent w/ 5% LU errors
>5% performance drop
The system performance is sensitive to LU errors, for both rule-based
and reinforcement learning agents.
Dialogue System Pipeline
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
7
SLU usually focuses on understanding single-turn utterances The understanding result is usually influenced by
1) local observations 2) global knowledge.
ASR
Language Understanding (LU)
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking
• Policy Decision Output
Generation
Hypothesis
are there any action movies to see this weekend
Semantic Frame (Intents, Slots) request_movie genre=action
date=this weekend
System Action request_locaion Text response
Where are you located?
Screen Display location?
Text Input
Are there any action movies to see this weekend?
Speech Signal
current bottleneck
error propagation
Spoken Language Understanding
8
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
just sent email to bob about fishing this weekend
O O O O
B-contact_name
O
B-subject I-subject I-subject
U
S
I send_email D communication
send_email(contact_name=“bob”, subject=“fishing this weekend”)
are we going to fish this weekend U 1
S 2
send_email(message=“are we going to fish this weekend”)
send email to bob
U 2
send_email(contact_name=“bob”)
B-message
I-message I-message I-message I-message I-message I-message
B-contact_name
S 1
Domain Identification Intent Prediction Slot Filling
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture End-to-End Training
Experiments
Conclusion & Future Work
9
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
MODEL ARCHITECTURE
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
10
u
Knowledge Attention Distribution
p i
m i
Memory Representation
Weighted
Sum h
∑ W kg
Knowledge Encoding o
Representation history utterances
{x
i} current utterance
c
Inner Product Sentence
Encoder RNN
inx1 x2 … xi
Contextual Sentence Encoder
x1 x2 … xi
RNN
memslot tagging sequence y
h
t-1h
tV V
W W W
w
t-1w
ty
t-1y
tU U
RNN Tagger
M M
Idea: additionally incorporating contextual knowledge during slot tagging
Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
MODEL ARCHITECTURE
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
11
u
Knowledge Attention Distribution
p i
m i
Memory Representation
Weighted
Sum h
∑ W kg
Knowledge Encoding o
Representation history utterances
{x
i} current utterance
c
Inner Product Sentence
Encoder RNN
inx1 x2 … xi
Contextual Sentence Encoder
x1 x2 … xi
RNN
memslot tagging sequence y
h
t-1h
tV V
W W W
w
t-1w
ty
t-1y
tU U
RNN Tagger
M M
Idea: additionally incorporating contextual knowledge during slot tagging
Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
CNN
CNN
END-TO-END TRAINING
• Tagging Objective
• RNN Tagger
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
12
slot tag sequence contextual utterances & current utterance
h t-1 h t h t+1
V V V
W W W W
w t-1 w t w t+1 y t-1 y t y t+1
U U U
o
M M M
Automatically figure out the attention distribution without explicit
supervision
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture End-to-End Training
Experiments
Conclusion & Future Work
13
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN – adam optimizer
– embedding dim=150 – hidden unit=100 – dropout=0.5
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
14
Model Training Set Knowledge Encoding
Sentence
Encoder First Turn Other Overall
RNN Tagger single-turn x x 60.6 16.2 25.5
The model trained on single-turn data performs worse for non-first
turns due to mismatched training data
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN – adam optimizer
– embedding dim=150 – hidden unit=100 – dropout=0.5
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
15
Model Training Set Knowledge Encoding
Sentence
Encoder First Turn Other Overall
RNN Tagger single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Treating multi-turn data as single-turn for training performs reasonable
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN – adam optimizer
– embedding dim=150 – hidden unit=100 – dropout=0.5
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
16
Model Training Set Knowledge Encoding
Sentence
Encoder First Turn Other Overall
RNN Tagger single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder- Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5
Encoding current and history utterances improves the performance
but increases the training time
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN – adam optimizer
– embedding dim=150 – hidden unit=100 – dropout=0.5
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
17
Model Training Set Knowledge Encoding
Sentence
Encoder First Turn Other Overall
RNN Tagger single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder- Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1
Applying memory networks significantly outperforms all approaches
with much less training time
EXPERIMENTS
• Dataset: Cortana communication session data
– GRU for all RNN – adam optimizer
– embedding dim=150 – hidden unit=100 – dropout=0.5
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
18
Model Training Set Knowledge Encoding
Sentence
Encoder First Turn Other Overall
RNN Tagger single-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder- Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1
multi-turn history + current (x, c) CNN 73.8 66.5 68.0
CNN produces comparable results for sentence encoding with
shorter training time
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture End-to-End Training
Experiments
Conclusion & Future Work
19
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
Conclusion
• The proposed end-to-end memory networks store
contextual knowledge, which can be exploited dynamically based on an attention model for manipulating knowledge carryover for multi-turn understanding
• The end-to-end model performs the tagging task instead of classification
• The experiments show the feasibility and robustness of modeling knowledge carryover through memory networks
20
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
Future Work
• Leveraging not only local observation but also global knowledge for better language understanding
– Syntax or semantics can serve as global knowledge to guide the understanding model
– “Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks,” arXiv preprint arXiv: 1609.03286
21
d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen
Q & A
T H A N K S F O R Y O U R AT T E N T I O N !
22