Robust and Scalable Conversational AI
Computer Science & Information Engineering National Taiwan University
Yun-Nung (Vivian) Chen
?
16th WORKSHOP ON SPOKEN DIALOGUE SYSTEMS FOR PHDS, POSTDOCS & NEW RESEARCHERS (YRRSDS 2020)
Language Empowering Intelligent Assistants
Apple Siri (2011) Google Now (2012)
Google Home (2016)
Microsoft Cortana (2014)
Amazon Alexa/Echo (2014)
Google Assistant (2016)
Apple HomePod (2017) Facebook Portal (2019)
2
Task-Oriented Dialogue Systems (Young, 2000)
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Database
3
Recent Advances in NLP
◉ Contextual Embeddings (ELMo & BERT)
○ Boost many understanding performance with pre-trained language models
?
4
5
Lift all lights to Morocco List all flights tomorrow
6
Task-Oriented Dialogue Systems (Young, 2000)
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Database
7
Mismatch between Written and Spoken Languages
◉ Goal: ASR-Robust Contextualized Embeddings
✓ learning spoken contextualized word embeddings
✓ better performance on spoken language understanding tasks
Training
• Written language
Testing
• Spoken language
• Include recognition errors
8
Solution: LatticeLM
(Huang & Chen, ACL 2020)
Chao-Wei Huang and Yun-Nung Chen, “Learning Spoken Language Representations with Neural Lattice Language Modeling,”
in Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
9
ASR Lattices for Preserving Uncertainty
◉ Idea: lattices may include correct words
<s> cheapest airfare
fair
affair air
to Milwaukee </s>
1
0.4
0.3 0.3
1
1 1
1
1 1
LatticeRNN helps
LM pre-training helps
(Ladhak, et al., 2016)
Chao-Wei Huang and Yun-Nung Chen, “Learning Spoken Language Representations with Neural Lattice Language Modeling,”
in Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
10
Lattice Language Modeling
1) LatticeLSTM encodes nodes of a lattice
2) The goal is to predict the outgoing transitions (words) given a node’s representation
◉ The one-hypothesis lattice reduces to normal language modeling
the
, 1.0LatticeLSTM
0.8 0.2
Linear
0.9 1.0 1.0
0.1
1.0 1.0
Chao-Wei Huang and Yun-Nung Chen, “Learning Spoken Language Representations with Neural Lattice Language Modeling,”
in Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
Issue: LatticeLSTM runs prohibitively slow
11
Efficient Two-Stage Pre-Training
LSTM LSTM LSTM
What a day
Linear
a day <EOS>
Stage 1: Pre-Training on Sequential Texts
LatticeLSTM
the, 1.0
LatticeLSTM Max pooling
classification Fine-Tuning
the, 1.0 0.8
0.2
Linear
0.9 1.0 1.0
0.1
1.0 1.0
Stage 2: Pre-Training on Lattices
LatticeLSTM
12
Spoken Language Understanding Results
◉ Intent Prediction
○ Word Error Rate: 45.6% (SNIPS); 15.6% (ATIS)
80 85 90 95 100
ATIS SNIPS
1-Best
1-Best 1-Best +
1-Best + LatticeLSTM
LatticeLSTM
13
Spoken Language Understanding Results
◉ Intent Prediction
○ Word Error Rate: 45.6% (SNIPS); 15.6% (ATIS)
80 85 90 95 100
ATIS SNIPS
1-Best
1-Best 1-Best +
1-Best + LatticeLSTM
LatticeLSTM
LatticeLM LatticeLM
14
Spoken Language Understanding Results
◉ Intent Prediction
○ Word Error Rate: 45.6% (SNIPS); 15.6% (ATIS)
80 85 90 95 100
ATIS SNIPS
1-Best
1-Best 1-Best +
1-Best + LatticeLSTM
LatticeLSTM
LatticeLM LatticeLM
15
Spoken Language Understanding Results
◉ Dialogue Act Prediction
○ Word Error Rate: 32.0% (MRDA); 28.4% (SWDA)
50 55 60 65 70 75
SWDA MRDA
1-Best
1-Best +
LatticeLSTM
LatticeLM
1-Best
1-Best + LatticeLSTM
LatticeLM
16
What if we do not have ASR lattices?
17
Solution:
Learning ASR-Robust Embeddings
(Huang & Chen, ICASSP 2020)
Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language
Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
18
ASR-Robust Contextualized Embeddings
Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language
Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
◉ Confusion-Aware Fine-Tuning
○ Supervised
○ Unsupervised
19
Spoken Language Understanding Results
◉ Airline Traveling Information System (ATIS)
○ Word Error Rate: 16.4%
90 91 92 93 94 95 96 97 98 99
Intent Slot
Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language
Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
20
Spoken Language Understanding Results
◉ Airline Traveling Information System (ATIS)
○ Word Error Rate: 16.4%
90 91 92 93 94 95 96 97 98 99
Intent
Slot + LM fine-tuning
+ LM fine-tuning
Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language
Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
21
Spoken Language Understanding Results
◉ Airline Traveling Information System (ATIS)
○ Word Error Rate: 16.4%
90 91 92 93 94 95 96 97 98 99
Intent
Slot + LM fine-tuning
+ LM fine-tuning + LM + Confusion (Supervised)
+ LM + Confusion (Supervised)
Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language
Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
22
Spoken Language Understanding Results
◉ Airline Traveling Information System (ATIS)
○ Word Error Rate: 16.4%
90 91 92 93 94 95 96 97 98 99
Intent
Slot + LM fine-tuning
+ LM fine-tuning + LM + Confusion (Supervised)
+ LM + Confusion (Supervised) + LM + Confusion (Unsupervised) + LM + Confusion (Unsupervised)
Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language
Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
23
Task-Oriented Dialogue Systems (Young, 2000)
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Database
24
Natural Language Understanding (NLU)
◉ Parse natural language into structured semantics
NLU
Natural Language
McDonald’s is a cheap restaurant nearby the station.
Semantic Frame
RESTAURANT=“McDonald’s”
PRICE=“cheap”
LOCATION= “nearby the station”
25
Natural Language Generation (NLG)
◉ Construct natural language based on structured semantics
Natural Language
McDonald’s is a cheap restaurant nearby the station.
Semantic Frame
RESTAURANT=“McDonald’s”
PRICE=“cheap”
LOCATION= “nearby the station”
NLG
26
Duality between NLU and NLG
Natural Language
McDonald’s is a cheap restaurant nearby the station.
Semantic Frame
RESTAURANT=“McDonald’s”
PRICE=“cheap”
LOCATION= “nearby the station”
NLG NLU
How can we leverage this dual relationship?
27
Solution:
Dual Supervised Learning for NLU & NLG
(Su et al., ACL 2019)
Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
28
DSL: Dual Supervised Learning (Xia et al., 2017)
◉ Proposed for machine translation
◉ Consider two domains 𝑋 and 𝑌, and two tasks 𝑋 → 𝑌 and 𝑌 → 𝑋
𝑋 𝑌
𝜽 𝒚→𝒙 𝜽 𝒙→𝒚
We have 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦)𝑃 𝑦 = 𝑃 𝑦 𝑥)𝑃(𝑥)
Ideally 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦; 𝜽 𝒚→𝒙 )𝑃 𝑦 = 𝑃 𝑦 𝑥; 𝜽 𝒙→𝒚 )𝑃(𝑥)
Xia, Y., Qin, T., Chen, W., Bian, J., Yu, N., & Liu, T. Y., “Dual supervised learning,” in Proc. of the 34th International Conference on Machine Learning, 2017.
29
Dual Supervised Learning
◉ Exploit the duality by forcing models to follow the probabilistic constraint 𝑃 𝑥 𝑦; 𝜽 𝒚→𝒙 )𝑃 𝑦 = 𝑃 𝑦 𝑥; 𝜽 𝒙→𝒚 )𝑃(𝑥)
Objective function
ቐ min 𝜃 𝑥→𝑦 𝔼 𝑙 1 (𝑓 𝑥; 𝜃 𝑥→𝑦 , 𝑦) min 𝜃 𝑦→𝑥 𝔼 𝑙 2 (𝑔 𝑦; 𝜃 𝑦→𝑥 , 𝑥)
+ 𝜆 𝑥→𝑦 𝑙 𝑑𝑢𝑎𝑙𝑖𝑡𝑦 + 𝜆 𝑦→𝑥 𝑙 𝑑𝑢𝑎𝑙𝑖𝑡𝑦
How to model the marginal distributions of 𝑋 and 𝑌?
Xia, Y., Qin, T., Chen, W., Bian, J., Yu, N., & Liu, T. Y., “Dual supervised learning,” in Proc. of the 34th International Conference on Machine Learning, 2017.
30
Dual Supervised Learning
◉ Let’s go back to NLU and NLG
Natural Language
McDonald’s is a cheap restaurant nearby the station.
Semantic Frame
RESTAURANT=“McDonald’s”
PRICE=“cheap”
LOCATION= “nearby the station”
NLG NLU
Natural Language
X
Semantic Frame
Y
log 𝑷(𝒙) log 𝑷(𝒚)
Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
31
Natural Language log 𝑃(𝑥)
◉ Language modeling
GRU
𝑥 𝑑−1
𝑃 𝑥 𝑑 𝑥 1 , … , 𝑥 𝑑−1 )
Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
32
Semantic Frame log 𝑃(𝑦)
◉ We treat NLU as a multi-label classification problem
◉ Each label is a slot-value pair
RESTAURANT=“McDonald’s”
PRICE=“cheap”
LOCATION= “nearby the station”
0
1 . . . 0 1
How to model the marginal distributions of 𝑦?
Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
33
Semantic Frame log 𝑃(𝑦)
◉ Naïve approach
○ Calculate prior probability for each label 𝑃(𝑦 𝑖 ) on the training set.
○ 𝑃 𝑦 = ς 𝑃(𝑦 𝑖 )
Assumption: labels are independent
Restaurant: “McDonald’s”
Restaurant: “KFC”
Restaurant: “PizzaHut”
Price: “cheap”
Price: “expensive”
Food: “Pizza”
Food: “Hamburger”
Food:”Chinese”
Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
34
Semantic Frame log 𝑃(𝑦)
◉ Masked autoencoder for distribution estimation (MADE)
2 1 3
1 2 2 1
2 1 3
Introduce sequential dependency among labels by masking certain connections
→ marginal distribution of 𝑦
Germain, M., Gregor, K., Murray, I., & Larochelle, H., “MADE: Masked autoencoder for distribution estimation,”
in Proceedings of International Conference on Machine Learning, 2015.
35
GRU
McDonald’s is
…
station
Linear
0
1 . . . 0 1
NLU
GRU
<BOS> McDonald’s
…
station
NLG
0
1 . . . 0 1
McDonald’s is <EOS>
Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding
and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. 36