Can Current Conversational Assistants Satisfy Users?

(1)

Can Current Conversational Assistants Satisfy Users?

Yun-Nung Vivian Chen

http://vivianchen.idv.tw

(2)

2

Iron Man (2008)

(3)

N T U M I U L A B

Language Empowering Intelligent Assistants

Apple Siri (2011) Google Now (2012)

Facebook M & Bot (2015) Google Home (2016)

Microsoft Cortana (2014)

Amazon Alexa/Echo (2014)

Google Assistant (2016)

Apple HomePod (2017)

(4)

N T U M I U L A B

Task-Oriented Dialogue Systems (Young, 2000)

4

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame

request_movie

genre=action, date=this weekend

System Action/Policy

request_location

Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Database

(5)

N T U M I U L A B

• Contextual Embeddings (ELMo & BERT)

• Boost many understanding performance with pre-trained natural language

Recent Advances in NLP

?

(6)

N T U M I U L A B

(7)

N T U M I U L A B

(8)

N T U M I U L A B

Task-Oriented Dialogue Systems (Young, 2000)

8

Speech Recognition

• Slot Filling

Semantic Frame

request_movie

request_location

Text response

Text Input

Speech Signal

Database

(9)

N T U M I U L A B

• Goal: ASR-Robust Contextualized Embeddings

✓learning contextualized word embeddings specifically for spoken language

✓achieves better performance on spoken language understanding tasks

‒ shows better results on ASR transcripts

‒ maintain similar results on manual transcripts

Mismatch between Written and Spoken Languages

Training

• Written language

Testing

• Spoken language

• Include recognition errors

(10)

N T U M I U L A B

Solution 1:

Adapting Transformer to ASR Lattices

(11)

N T U M I U L A B

BERT/GPT Pre-Training & Fine-Tuning

• Pre-Training • Fine-Tuning

Transformer Encoder

𝑤 ₁ 𝑤 ₂ 𝑤 ₂ 𝑤 ₃

. . .

𝑤 _𝑚−1 𝑤 _𝑚 𝑤 _𝑚 𝑤 _𝑚+1 Linear

. . .

Transformer Encoder

𝑤 ₁ 𝑤 ₂ . . .

𝑤 _𝑚−1 𝑤 _𝑚

<S> <E>

Linear

𝑦

(12)

N T U M I U L A B

• Idea: lattices may include correct words

• Goal: feed lattices into Transformer

1) Linearize 2) Binary mask

3) Probabilistic mask

ASR Lattices

<s> cheapest airfare

fair

affair air

to Milwaukee

^</s>

1

0.4 0.3 0.3

1

1 1

1

1 1

Transformer Encoder

𝑤₁ 𝑤₂ . . .

𝑤_𝑚−1𝑤_𝑚

<S> <E>

Linear

𝑦

(13)

N T U M I U L A B

Self-Attention (Vaswani+, 2017)

13

Dot-Prod

FFNN

Dot-Prod +

× ×

Dot-Prod

×

MatMul_K

MatMul_V MatMul_V

softmax

MatMul_Q

Dot-Prod

Value

Query Key

MatMul_V

Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.

Dot-Prod

×

Dot-Prod

×

MatMul_V MatMul_V

(14)

N T U M I U L A B

• Binary masks

• Probabilistic masks

Attention Masks

<s> cheapest airfare

fair

affair air

to Milwaukee ^</s>

1

0.4 0.3

0.3

1

1 1

1

1 1

(15)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 15.5%

Spoken Language Understanding Results

86 88 90 92 94 96 98

Intent Slot

1-Best

(16)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 15.5%

Spoken Language Understanding Results

86 88 90 92 94 96 98

Intent

Slot Lattice-Linear

1-Best

Lattice-Linear 1-Best

(17)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 15.5%

Spoken Language Understanding Results

86 88 90 92 94 96 98

Intent

Slot Lattice-Linear

1-Best

Lattice-Binary

(18)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 15.5%

Spoken Language Understanding Results

86 88 90 92 94 96 98

Intent

Slot Lattice-Linear

1-Best

Lattice-Binary Lattice-Prob

(19)

N T U M I U L A B

86 88 90 92 94 96 98

Intent Slot

• Airline Traveling Information System (ATIS)

• Word Error Rate: 26.3%

Spoken Language Understanding Results

(20)

N T U M I U L A B

What if we do not have ASR lattices?

(21)

N T U M I U L A B

Solution 2:

Learning ASR-Robust Embeddings

(22)

N T U M I U L A B

ASR-Robust Contextualized Embeddings

• Confusion-Aware Fine-Tuning

• Supervised

• Unsupervised

(23)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 16.4%

Spoken Language Understanding Results

90 91 92 93 94 95 96 97 98 99

Intent Slot

(24)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 16.4%

Spoken Language Understanding Results

90 91 92 93 94 95 96 97 98 99

Intent

Slot + LM fine-tuning

+ LM fine-tuning

(25)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 16.4%

Spoken Language Understanding Results

90 91 92 93 94 95 96 97 98 99

Intent

+ LM fine-tuning + LM + Confusion (Supervised)

+ LM + Confusion (Supervised)

(26)

N T U M I U L A B

• Airline Traveling Information System (ATIS)

• Word Error Rate: 16.4%

Spoken Language Understanding Results

90 91 92 93 94 95 96 97 98 99

Intent

+ LM fine-tuning + LM + Confusion (Supervised)

+ LM + Confusion (Supervised) + LM + Confusion (Unsupervised) + LM + Confusion (Unsupervised)

(27)

N T U M I U L A B

Task-Oriented Dialogue Systems (Young, 2000)

27

Speech Recognition

• Slot Filling

Semantic Frame

request_movie

request_location

Text response

Text Input

Speech Signal

Database

(28)

N T U M I U L A B

(29)

N T U M I U L A B

(30)

N T U M I U L A B

Conversational AI for Unstructured Knowledge

• A machine reads big text data

• serves as a teacher

• A user can ask questions

• serves as a student

• in a conversational manner

→ Conversational QA

(31)

N T U M I U L A B

• Idea: model the difference of hidden states in multi-turn dialogues

FlowDelta: Information Gain in Dialogue Flow

Conversation Flow (over Context)

Time (Question Turns)

Δ Δ Δ … … Δ

ℎ

_{𝑡−1,𝑗}

ℎ

_𝑡,𝑗

𝑐

_𝑡,𝑗

FlowDelta:

Modeling Flow Information Gain

ℎ

_𝑡,2

𝑐

_𝑡,2

ℎ

_𝑡−1,2

ℎ

_𝑡,1

𝑐

_𝑡,1

ℎ

_𝑡−1,1

… …

Q1 Q2 Q3

… …

(32)

N T U M I U L A B

• Idea: model the difference of hidden states in multi-turn dialogues

FlowDelta: Information Gain in Dialogue Flow

i-th Question Context

i-th Answer

FlowQA

Dialogue ReasoningEncoding

Encoding

i-th Question Context

BERT 𝑙₁ BERT 𝑙_k

:

BERT 𝑙_k-1

i-th Answer

BERT

Dialogue Reasoning

(33)

N T U M I U L A B

• Data: QuAC, CoQA

Conversational QA Results

60 62 64 66 68 70 72 74 76 78 80

CoQA QuAC

FlowQA BERT

(34)

N T U M I U L A B

• Data: QuAC, CoQA

Conversational QA Results

60 62 64 66 68 70 72 74 76 78 80

CoQA QuAC

FlowQA

+ Flow BERT

FlowQA

+ Flow BERT

(35)

N T U M I U L A B

• Data: QuAC, CoQA

Conversational QA Results

60 62 64 66 68 70 72 74 76 78 80

CoQA QuAC

FlowQA

+ FlowDelta

+ FlowDelta + Flow

BERT

FlowQA

+ FlowDelta

+ FlowDelta + Flow

BERT

(36)

N T U M I U L A B

QuAC Leaderboard

(37)

N T U M I U L A B

• Spoken language embeddings are needed for better conversational AI

• Written texts enough for pre-training embeddings

• Mismatch when applying to spoken language

1) Adapting Transformer to ASR lattices

2) Adapting contextualized embeddings robust to misrecognition

• Conversational QA enables unstructured information access

• FlowDelta: information gain in dialogue flow guides better understanding

Summary

(38)

Can Current Conversational Assistants Satisfy Users?