Can Current Conversational Assistants Satisfy Users?
Yun-Nung Vivian Chen
http://vivianchen.idv.tw
2
Iron Man (2008)
N T U M I U L A B
Language Empowering Intelligent Assistants
Apple Siri (2011) Google Now (2012)
Facebook M & Bot (2015) Google Home (2016)
Microsoft Cortana (2014)
Amazon Alexa/Echo (2014)
Google Assistant (2016)
Apple HomePod (2017)
N T U M I U L A B
Task-Oriented Dialogue Systems (Young, 2000)
4
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Database
N T U M I U L A B
• Contextual Embeddings (ELMo & BERT)
• Boost many understanding performance with pre-trained natural language
Recent Advances in NLP
?
N T U M I U L A B
N T U M I U L A B
N T U M I U L A B
Task-Oriented Dialogue Systems (Young, 2000)
8
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Database
N T U M I U L A B
• Goal: ASR-Robust Contextualized Embeddings
✓learning contextualized word embeddings specifically for spoken language
✓achieves better performance on spoken language understanding tasks
‒ shows better results on ASR transcripts
‒ maintain similar results on manual transcripts
Mismatch between Written and Spoken Languages
Training
• Written language
Testing
• Spoken language
• Include recognition errors
N T U M I U L A B
Solution 1:
Adapting Transformer to ASR Lattices
N T U M I U L A B
BERT/GPT Pre-Training & Fine-Tuning
• Pre-Training • Fine-Tuning
Transformer Encoder
𝑤 1 𝑤 2 𝑤 2 𝑤 3
. . .
. . .
𝑤 𝑚−1 𝑤 𝑚 𝑤 𝑚 𝑤 𝑚+1 Linear
. . .
Transformer Encoder
𝑤 1 𝑤 2 . . .
𝑤 𝑚−1 𝑤 𝑚
<S> <E>
Linear
𝑦
N T U M I U L A B
• Idea: lattices may include correct words
• Goal: feed lattices into Transformer
1) Linearize 2) Binary mask
3) Probabilistic mask
ASR Lattices
<s> cheapest airfare
fair
affair air
to Milwaukee
</s>1
0.4 0.3 0.3
1
1 1
1
1 1
Transformer Encoder
𝑤1 𝑤2 . . .𝑤𝑚−1𝑤𝑚
<S> <E>
Linear
𝑦
N T U M I U L A B
Self-Attention (Vaswani+, 2017)
13
Dot-Prod
FFNN
Dot-Prod +
× ×
Dot-Prod
×
MatMulK
MatMulV MatMulV
softmax
MatMulQ
Dot-Prod
Value
Query Key
MatMulV
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
Dot-Prod
×
Dot-Prod
×
MatMulV MatMulV
N T U M I U L A B
• Binary masks
• Probabilistic masks
Attention Masks
<s> cheapest airfare
fair
affair air
to Milwaukee </s>
1
0.4 0.3
0.3
1
1 1
1
1 1
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 15.5%
Spoken Language Understanding Results
86 88 90 92 94 96 98
Intent Slot
1-Best
1-Best
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 15.5%
Spoken Language Understanding Results
86 88 90 92 94 96 98
Intent
Slot Lattice-Linear
1-Best
Lattice-Linear 1-Best
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 15.5%
Spoken Language Understanding Results
86 88 90 92 94 96 98
Intent
Slot Lattice-Linear
1-Best
Lattice-Binary
Lattice-Linear 1-Best
Lattice-Binary
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 15.5%
Spoken Language Understanding Results
86 88 90 92 94 96 98
Intent
Slot Lattice-Linear
1-Best
Lattice-Binary Lattice-Prob
Lattice-Linear 1-Best
Lattice-Binary Lattice-Prob
N T U M I U L A B
86 88 90 92 94 96 98
Intent Slot
• Airline Traveling Information System (ATIS)
• Word Error Rate: 26.3%
Spoken Language Understanding Results
Lattice-Linear 1-Best
Lattice-Binary Lattice-Prob
Lattice-Linear 1-Best
Lattice-Binary Lattice-Prob
N T U M I U L A B
What if we do not have ASR lattices?
N T U M I U L A B
Solution 2:
Learning ASR-Robust Embeddings
N T U M I U L A B
ASR-Robust Contextualized Embeddings
• Confusion-Aware Fine-Tuning
• Supervised
• Unsupervised
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 16.4%
Spoken Language Understanding Results
90 91 92 93 94 95 96 97 98 99
Intent Slot
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 16.4%
Spoken Language Understanding Results
90 91 92 93 94 95 96 97 98 99
Intent
Slot + LM fine-tuning
+ LM fine-tuning
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 16.4%
Spoken Language Understanding Results
90 91 92 93 94 95 96 97 98 99
Intent
Slot + LM fine-tuning
+ LM fine-tuning + LM + Confusion (Supervised)
+ LM + Confusion (Supervised)
N T U M I U L A B
• Airline Traveling Information System (ATIS)
• Word Error Rate: 16.4%
Spoken Language Understanding Results
90 91 92 93 94 95 96 97 98 99
Intent
Slot + LM fine-tuning
+ LM fine-tuning + LM + Confusion (Supervised)
+ LM + Confusion (Supervised) + LM + Confusion (Unsupervised) + LM + Confusion (Unsupervised)
N T U M I U L A B
Task-Oriented Dialogue Systems (Young, 2000)
27
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame
request_movie
genre=action, date=this weekend
System Action/Policy
request_location
Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Database
N T U M I U L A B
N T U M I U L A B
N T U M I U L A B
Conversational AI for Unstructured Knowledge
• A machine reads big text data
• serves as a teacher
• A user can ask questions
• serves as a student
• in a conversational manner
→ Conversational QA
N T U M I U L A B
• Idea: model the difference of hidden states in multi-turn dialogues
FlowDelta: Information Gain in Dialogue Flow
Conversation Flow (over Context)
Time (Question Turns)
Δ Δ Δ … … Δ
Δ Δ Δ … … Δ
ℎ
𝑡−1,𝑗ℎ
𝑡,𝑗𝑐
𝑡,𝑗FlowDelta:
Modeling Flow Information Gainℎ
𝑡,2𝑐
𝑡,2ℎ
𝑡−1,2ℎ
𝑡,1𝑐
𝑡,1ℎ
𝑡−1,1… …
Q1 Q2 Q3
… …
… …
N T U M I U L A B
• Idea: model the difference of hidden states in multi-turn dialogues
FlowDelta: Information Gain in Dialogue Flow
i-th Question Context
i-th Answer
FlowQA
Dialogue ReasoningEncoding
Encoding
i-th Question Context
BERT 𝑙1 BERT 𝑙k
:
BERT 𝑙k-1
i-th Answer
BERT
Dialogue Reasoning
N T U M I U L A B
• Data: QuAC, CoQA
Conversational QA Results
60 62 64 66 68 70 72 74 76 78 80
CoQA QuAC
FlowQA BERT
FlowQA BERT
N T U M I U L A B
• Data: QuAC, CoQA
Conversational QA Results
60 62 64 66 68 70 72 74 76 78 80
CoQA QuAC
FlowQA
+ Flow BERT
FlowQA
+ Flow BERT
N T U M I U L A B
• Data: QuAC, CoQA
Conversational QA Results
60 62 64 66 68 70 72 74 76 78 80
CoQA QuAC
FlowQA
+ FlowDelta
+ FlowDelta + Flow
BERT
FlowQA
+ FlowDelta
+ FlowDelta + Flow
BERT