? Robust and Scalable Conversational AI

(1)

Robust and Scalable Conversational AI

Computer Science & Information Engineering National Taiwan University

Yun-Nung (Vivian) Chen

?

16th WORKSHOP ON SPOKEN DIALOGUE SYSTEMS FOR PHDS, POSTDOCS & NEW RESEARCHERS (YRRSDS 2020)

(2)

Language Empowering Intelligent Assistants

Apple Siri (2011) Google Now (2012)

Google Home (2016)

Microsoft Cortana (2014)

Amazon Alexa/Echo (2014)

Google Assistant (2016)

Apple HomePod (2017) Facebook Portal (2019)

2

(3)

Task-Oriented Dialogue Systems (Young, 2000)

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame

request_movie

genre=action, date=this weekend

System Action/Policy

request_location

Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Database

3

(4)

Recent Advances in NLP

◉ Contextual Embeddings (ELMo & BERT)

○ Boost many understanding performance with pre-trained language models

?

4

(5)

5

(6)

Lift all lights to Morocco List all flights tomorrow

6

(7)

Task-Oriented Dialogue Systems (Young, 2000)

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame

request_movie

genre=action, date=this weekend

System Action/Policy

request_location

Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Database

7

(8)

Mismatch between Written and Spoken Languages

◉ Goal: ASR-Robust Contextualized Embeddings

✓ learning spoken contextualized word embeddings

✓ better performance on spoken language understanding tasks

Training

• Written language

Testing

• Spoken language

• Include recognition errors

8

(9)

Solution: LatticeLM

(Huang & Chen, ACL 2020)

Chao-Wei Huang and Yun-Nung Chen, “Learning Spoken Language Representations with Neural Lattice Language Modeling,”

in Proceedings of The 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.

9

(10)

ASR Lattices for Preserving Uncertainty

◉ Idea: lattices may include correct words

<s> cheapest airfare

fair

affair air

to Milwaukee ^</s>

1

0.4 0.3 0.3

1 1 1

LatticeRNN helps

LM pre-training helps

(Ladhak, et al., 2016)

10

(11)

Lattice Language Modeling

1) LatticeLSTM encodes nodes of a lattice

2) The goal is to predict the outgoing transitions (words) given a node’s representation

◉ The one-hypothesis lattice reduces to normal language modeling

the

, 1.0

LatticeLSTM

0.8 0.2

Linear

0.9 1.0 1.0

0.1

1.0 1.0

Issue: LatticeLSTM runs prohibitively slow

11

(12)

Efficient Two-Stage Pre-Training

LSTM LSTM LSTM

What a day

Linear

a day <EOS>

Stage 1: Pre-Training on Sequential Texts

LatticeLSTM

the, 1.0

LatticeLSTM Max pooling

classification Fine-Tuning

the, 1.0 0.8

0.2

Linear

0.9 1.0 1.0

0.1

1.0 1.0

Stage 2: Pre-Training on Lattices

LatticeLSTM

12

(13)

Spoken Language Understanding Results

◉ Intent Prediction

○ Word Error Rate: 45.6% (SNIPS); 15.6% (ATIS)

80 85 90 95 100

ATIS SNIPS

1-Best

1-Best 1-Best +

1-Best + LatticeLSTM

LatticeLSTM

13

(14)

Spoken Language Understanding Results

◉ Intent Prediction

○ Word Error Rate: 45.6% (SNIPS); 15.6% (ATIS)

80 85 90 95 100

ATIS SNIPS

1-Best

1-Best 1-Best +

1-Best + LatticeLSTM

LatticeLSTM

LatticeLM LatticeLM

14

(15)

Spoken Language Understanding Results

◉ Intent Prediction

○ Word Error Rate: 45.6% (SNIPS); 15.6% (ATIS)

80 85 90 95 100

ATIS SNIPS

1-Best

1-Best 1-Best +

1-Best + LatticeLSTM

LatticeLSTM

LatticeLM LatticeLM

15

(16)

Spoken Language Understanding Results

◉ Dialogue Act Prediction

○ Word Error Rate: 32.0% (MRDA); 28.4% (SWDA)

50 55 60 65 70 75

SWDA MRDA

1-Best

1-Best +

LatticeLSTM

LatticeLM

1-Best

1-Best + LatticeLSTM

LatticeLM

16

(17)

What if we do not have ASR lattices?

17

(18)

Solution:

Learning ASR-Robust Embeddings

(Huang & Chen, ICASSP 2020)

Chao-Wei Huang and Yun-Nung Chen, “Learning ASR-Robust Contextualized Embeddings for Spoken Language

Understanding,” in The 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.

18

(19)

ASR-Robust Contextualized Embeddings

◉ Confusion-Aware Fine-Tuning

○ Supervised

○ Unsupervised

19

(20)

Spoken Language Understanding Results

◉ Airline Traveling Information System (ATIS)

○ Word Error Rate: 16.4%

90 91 92 93 94 95 96 97 98 99

Intent Slot

20

(21)

Spoken Language Understanding Results

◉ Airline Traveling Information System (ATIS)

○ Word Error Rate: 16.4%

90 91 92 93 94 95 96 97 98 99

Intent

Slot + LM fine-tuning

+ LM fine-tuning

21

(22)

Spoken Language Understanding Results

◉ Airline Traveling Information System (ATIS)

○ Word Error Rate: 16.4%

90 91 92 93 94 95 96 97 98 99

Intent

Slot + LM fine-tuning

+ LM fine-tuning + LM + Confusion (Supervised)

+ LM + Confusion (Supervised)

22

(23)

Spoken Language Understanding Results

◉ Airline Traveling Information System (ATIS)

○ Word Error Rate: 16.4%

90 91 92 93 94 95 96 97 98 99

Intent

Slot + LM fine-tuning

+ LM fine-tuning + LM + Confusion (Supervised)

+ LM + Confusion (Supervised) + LM + Confusion (Unsupervised) + LM + Confusion (Unsupervised)

23

(24)

Task-Oriented Dialogue Systems (Young, 2000)

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame

request_movie

genre=action, date=this weekend

System Action/Policy

request_location

Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Database

24

(25)

Natural Language Understanding (NLU)

◉ Parse natural language into structured semantics

NLU

Natural Language

McDonald’s is a cheap restaurant nearby the station.

Semantic Frame

RESTAURANT=“McDonald’s”

PRICE=“cheap”

LOCATION= “nearby the station”

25

(26)

Natural Language Generation (NLG)

◉ Construct natural language based on structured semantics

Natural Language

McDonald’s is a cheap restaurant nearby the station.

Semantic Frame

RESTAURANT=“McDonald’s”

PRICE=“cheap”

LOCATION= “nearby the station”

NLG

26

(27)

Duality between NLU and NLG

Natural Language

McDonald’s is a cheap restaurant nearby the station.

Semantic Frame

RESTAURANT=“McDonald’s”

PRICE=“cheap”

LOCATION= “nearby the station”

NLG NLU

How can we leverage this dual relationship?

27

(28)

Solution:

Dual Supervised Learning for NLU & NLG

(Su et al., ACL 2019)

Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

28

(29)

DSL: Dual Supervised Learning (Xia et al., 2017)

◉ Proposed for machine translation

◉ Consider two domains 𝑋 and 𝑌, and two tasks 𝑋 → 𝑌 and 𝑌 → 𝑋

𝑋 𝑌

𝜽 _𝒚→𝒙 𝜽 _𝒙→𝒚

We have 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦)𝑃 𝑦 = 𝑃 𝑦 𝑥)𝑃(𝑥)

Ideally 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦; 𝜽 _𝒚→𝒙 )𝑃 𝑦 = 𝑃 𝑦 𝑥; 𝜽 _𝒙→𝒚 )𝑃(𝑥)

Xia, Y., Qin, T., Chen, W., Bian, J., Yu, N., & Liu, T. Y., “Dual supervised learning,” in Proc. of the 34th International Conference on Machine Learning, 2017.

29

(30)

Dual Supervised Learning

◉ Exploit the duality by forcing models to follow the probabilistic constraint 𝑃 𝑥 𝑦; 𝜽 _𝒚→𝒙 )𝑃 𝑦 = 𝑃 𝑦 𝑥; 𝜽 _𝒙→𝒚 )𝑃(𝑥)

Objective function

ቐ min _𝜃 _𝑥→𝑦 𝔼 𝑙 ₁ (𝑓 𝑥; 𝜃 _𝑥→𝑦 , 𝑦) min _𝜃 _𝑦→𝑥 𝔼 𝑙 ₂ (𝑔 𝑦; 𝜃 _𝑦→𝑥 , 𝑥)

+ 𝜆 _𝑥→𝑦 𝑙 _{𝑑𝑢𝑎𝑙𝑖𝑡𝑦} + 𝜆 _𝑦→𝑥 𝑙 _{𝑑𝑢𝑎𝑙𝑖𝑡𝑦}

How to model the marginal distributions of 𝑋 and 𝑌?

Xia, Y., Qin, T., Chen, W., Bian, J., Yu, N., & Liu, T. Y., “Dual supervised learning,” in Proc. of the 34th International Conference on Machine Learning, 2017.

30

(31)

Dual Supervised Learning

◉ Let’s go back to NLU and NLG

Natural Language

McDonald’s is a cheap restaurant nearby the station.

Semantic Frame

RESTAURANT=“McDonald’s”

PRICE=“cheap”

LOCATION= “nearby the station”

NLG NLU

Natural Language

X

Semantic Frame

Y

log෡ 𝑷(𝒙) log෡ 𝑷(𝒚)

31

(32)

Natural Language log ෠ 𝑃(𝑥)

◉ Language modeling

GRU

𝑥 _𝑑−1

𝑃 𝑥 _𝑑 𝑥 ₁ , … , 𝑥 _𝑑−1 )

32

(33)

Semantic Frame log ෠ 𝑃(𝑦)

◉ We treat NLU as a multi-label classification problem

◉ Each label is a slot-value pair

RESTAURANT=“McDonald’s”

PRICE=“cheap”

LOCATION= “nearby the station”

0 1 . . . 0 1

How to model the marginal distributions of 𝑦?

33

(34)

Semantic Frame log ෠ 𝑃(𝑦)

◉ Naïve approach

○ Calculate prior probability for each label ෠ 𝑃(𝑦 _𝑖 ) on the training set.

○ 𝑃 𝑦 = ς ෠ ෠ 𝑃(𝑦 _𝑖 )

Assumption: labels are independent

Restaurant: “McDonald’s”

Restaurant: “KFC”

Restaurant: “PizzaHut”

Price: “cheap”

Price: “expensive”

Food: “Pizza”

Food: “Hamburger”

Food:”Chinese”

34

(35)

Semantic Frame log ෠ 𝑃(𝑦)

◉ Masked autoencoder for distribution estimation (MADE)

2 1 3

1 2 2 1

2 1 3

Introduce sequential dependency among labels by masking certain connections

→ marginal distribution of 𝑦

Germain, M., Gregor, K., Murray, I., & Larochelle, H., “MADE: Masked autoencoder for distribution estimation,”

in Proceedings of International Conference on Machine Learning, 2015.

35

(36)

GRU

McDonald’s is

…

station

Linear

0 1 . . . 0 1

NLU

GRU

<BOS> McDonald’s

…

station

NLG

0 1 . . . 0 1

McDonald’s is <EOS>

Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen, “Dual Supervised Learning for Natural Language Understanding

and Generation,” in Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. ³⁶

(37)

NLU/NLG Results

◉ E2E NLG data: 50k examples in the restaurant domain

◉ NLU: F-1 score; NLG: BLEU, ROUGE

50 55 60 65 70 75

F1 BLEU ROUGE-1

NLG Baseline

NLU Baseline

37

(38)

NLU/NLG Results

◉ E2E NLG data: 50k examples in the restaurant domain

◉ NLU: F-1 score; NLG: BLEU, ROUGE

50 55 60 65 70 75

F1 BLEU ROUGE-1

NLU Baseline DSL w/o MADE

DSL w/o MADE

DSL w/o MADE NLG Baseline

NLG Baseline

38

(39)

NLU/NLG Results

◉ E2E NLG data: 50k examples in the restaurant domain

◉ NLU: F-1 score; NLG: BLEU, ROUGE

50 55 60 65 70 75

F1 BLEU ROUGE-1

DSL w/ MADE

DSL w/ MADE DSL w/o MADE

DSL w/o MADE NLG Baseline

NLG Baseline

NLU Baseline DSL w/o MADE

39

(40)

Summary

◉ Robustness: spoken language embeddings are needed for better conversational AI