End-to-End Conversational AI

(1)

Slido: #ADL2021

End-to-End Conversational AI

Applied Deep Learning

June 7th, 2021 http://adl.miulab.tw

Slides credited from NeurIPS 2020 Tutorial

(2)

Slido: #ADL2021

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

Turing Test (talk like a human) Information consumption

Task completion Decision support

Social Chit-Chat Task-Oriented Dialogues

• Is this course good to take?

• Book me the train ticket from Kaohsiung to Taipei

• Reserve a table at Din Tai Fung for 5 people, 7PM tonight

• Schedule a meeting with Vivian at 10:00 tomorrow

• What is today’s agenda?

• What does NLP stand for?

2

(3)

Slido: #ADL2021

Two Branches of Conversational AI

Chit-Chat

Task-Oriented

3

(4)

Slido: #ADL2021

Vanilla Seq2Seq ConvAI: How

A simple 4-step recipe:

1. Choose the data: Human-to-human conversations

2. Choose the model: Large pre-trained language models are preferable 3. Train the model with the data: Supervised learning

4. Evaluate your model: Automatic or human evaluation

4

(5)

Slido: #ADL2021

Human1: Ok, I’ll try that.

Human2: Is there anything else bothering you?

Human1: Just one more thing. A school

called me this morning to see if I could teach a few classes this weekend and I don’t know what to do.

Human2: Do you have any other plan this weekend?

Human1: I’m supposed to work on a paper that’s due on Monday.

Human-to-Human Conversations:

● Daily Dialog

● Ubuntu Dialogue Corpus

● Twitter Conversations

● Reddit Conversational Data

● OpenSubtitles

These datasets are pre-processed to

have only 2 speakers ⇒ usually no more than 2 turns

Vanilla Seq2Seq ConvAI: Datasets

5

(6)

Slido: #ADL2021

Vanilla Seq2Seq conversational model (Vinyals and Le et.al., 2015, Shang et al., 2015)

Causal Decoder (Wolf et.al. 2019 ,Radford et.al. 2018)

Vanilla Seq2Seq ConvAI: Models

6

(7)

Slido: #ADL2021

Vanilla Seq2Seq ConvAI: Supervised Learning

Maximum Likelihood Estimation (MLE)

⇒ maximizing the conditional probability of the response given the history

⇒ The model output is a probability distribution over the vocab

Decoder

softmax softmax softmax softmax

(Target Seq.) (Probabilities)

7

(8)

Slido: #ADL2021

Decoder

softmax softmax softmax softmax

argmax

Vanilla Seq2Seq ConvAI: Greedy Decoding

8

(9)

Slido: #ADL2021

Vanilla Seq2Seq ConvAI: How

A simple 4-step recipe:

1. Choose the data: Human-to-human conversations

2. Choose the model: Large pre-trained language models are preferable 3. Train the model with the data: Supervised learning

4. Evaluate your model: Automatic or human evaluation

9

(10)

Slido: #ADL2021

Sci-Fi Short Film - SUNSPRING

https://www.youtube.com/watch?v=LY7x2Ihqj

10

(11)

Slido: #ADL2021

Limitation: Lack of Diversity

Human: What are you doing today?

System: I don’t know

Human: What is your name?

System: I don’t know

Human: What is your favourite food?

System: I don’t know

11

(12)

Slido: #ADL2021

Limitation: Lack of Consistency

Human: Where were you born?

System: I was born in Canada.

Human: Where are you from?

System: England, you?

Human: Where did you grow up?

System: I grew up in Texas.

12

(13)

Slido: #ADL2021

Limitation: Lack of Knowledge

Human: What is the weather like today?

System: I don’t know, maybe sunny?

Human: I am going out to Taylor Swift's concert today, I hope the weather is good.

System: Who is Taylor Swift?

Human: Okay, never mind.

13

(14)

Slido: #ADL2021

Limitation: Lack of Empathy

Human: I messed up my project today.

System: What is your project?

Human: I think I might get fired...

System: Then you need to find another job!

Human: 😩

14

(15)

Slido: #ADL2021

Limitation: Lack of Controllability

Human: Hi, how are you?

Model1: I am good thanks

Model2: I had really a bad day

Model3: I am okay, how was your day?

Model4: I am okay, I just finished my training session in the swimming pool

→ no control for style, topics, engagement, toxic and inappropriate responses

15

(16)

Slido: #ADL2021

Limitation: Lack of Versatility

NLU: Natural Language

Understanding

DST: Dialogue State Tracking

DP: Dialogue Policy

Learning NLG: Natural

Language Generation Can you help me book

a 5-star hotel on Sunday?

For how many people?

→ cannot handle task-oriented conversations due to API requirement

16

(17)

Slido: #ADL2021

Limitation: Lack of Global Optimization

◉ Turn-level optimization

Target:

response

decoder

Yeah EOS

I’m Yeah

on I’m

my on

way

… because of your game? my

Source: conversation history

encoder

17

(18)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

◉ These limitations of vanilla seq2seq make human-machine conversations boring and shallow. How can we overcome these limitations and move towards deeper conversational AI?

18

(19)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

19

(20)

Slido: #ADL2021

Limitation 1: Lack of Diversity

‘tis a fine brew on a day like this! Strong though, how many is sensible?

I'm not sure yet, I'll let you know !

Milan apparently selling Zlatan to balance the books... Where next, Madrid?

I don’t know.

Wow sour starbursts really do make your mouth water... mm drool.

Can I have one?

Of course!

Well he was on in Bromley a while ago... still touring.

I don't even know what he's talking about.

32% responses are general and meaningless

“I don’t know”

“I don’t know what you are talking about”

“I don’t think that is a good idea”

“Oh my god”

20

(21)

Slido: #ADL2021

Solution: Diversify Responses

1. Training and Decoding strategy ⇒ Maximum Mutual Information (MMI)

2. Model architecture ⇒ Conditional Variational Autoencoder (CVAE)

3. More data & Larger models ⇒ Large scale pre-training

4. Decoding strategy ⇒ Top-k sampling, Nucleus Sampling

⇒

21

(22)

Slido: #ADL2021

MMI for Response Diversity

(Li et al., 2016)

Depends on how much you drink!

I think he'd be a good signing.

Can I have one?

Of course you can! They’re delicious!

I’ve never seen him live.

22

(23)

Slido: #ADL2021

MMI for Response Diversity

(Li et al., 2016)

Depends on how much you drink!

I think he'd be a good signing.

Can I have one?

Of course you can! They’re delicious!

I’ve never seen him live.

23

(24)

Slido: #ADL2021

Diversify by Large-Scale Pretraining

BART T5

Meena BlenderBot

Meena BST

Text Pre-trained Dialogue Pre-trained

Initialize

Encoder Decoder

Dialogue

History Response

24

(25)

Slido: #ADL2021

Diversify by Large-Scale Pretraining

GPT-1/2/3

DialoGPT

Text Pre-trained Dialogue Pre-trained

Initialize

Causal Decoder

Dialogue

History Response

25

(26)

Slido: #ADL2021

● Compared to beam search, human are more likely to

sample “low probability”

tokens.

● Nucleus Sampling try to

recover the human sampling process by sampling from

top-N vocabulary ..

.

Ref: The Curious Case of Neural Text Degeneration

Diversify by Nucleus Sampling

26

(27)

Slido: #ADL2021

Diversify by Nucleus Sampling

Figure from: https://huggingface.co/blog/how-to-generate

Time step 1 Time step 2

27

(28)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge

4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

28

(29)

Slido: #ADL2021

Limitation 2: Lack of Consistency

29

(30)

Slido: #ADL2021

1. Learning speaker embedding:

■ Speaker Model

2. Conditioning on persona descriptions:

■ PersonaChat Dataset

■ TransferTransfo Model

Solution: Personalization

30

(31)

Slido: #ADL2021

Personalization via Speaker Model

EOS

where do you live

in

in england

england

.

. EOS

Rob Rob

Word embeddings(50k)

englandlondon u.s.

great

good

stay

live okay monday

tuesday

Speaker embeddings(70k)

Rob_712 skinnyoflynny2

Tomcoatez

Kush_322 D_Gomes25

Dreamswalls

kierongillen5 TheCharlieZ

The_Football_Bar

This_Is_Artful DigitalDan285 Jinnmeow3

Bob_Kelly2

31

(32)

Slido: #ADL2021

Persona Model for Consistency

(Li et al., 2016)

Baseline model → inconsistency Persona model using speaker embedding → consistency

32

(33)

Slido: #ADL2021

Personalization Datasets

Persona Info Human2:

- I like to ski.

- I am 25 years old

Human1: Hi, what do you do in your free time?

Human2: I enjoy going to the mountain and skiing

Human1: That’s cool, you should be young and strong for this activity!

Human2: oh yeah, I am 25 🤗

Human-to-Human Conversations + Persona Features

●

Persona Chat

●

Tweeter Personalized

●

Learning Personalized End-to- End Goal-Oriented Dialog

33

(34)

Slido: #ADL2021

Personalization via TransferTransfo Model

- Fine-Tuning GPT with conversational data

(Persona-Chat)

- Formulate persona, history and reply in single

sequence

Decoder-only

Dialogue History Persona Description+

Response 34

(35)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

35

(36)

Slido: #ADL2021

Limitation 3: Lack of Knowledge

Any recommendation?

The weather is so depressing these days.

I know, I dislike rain too.

What about a day trip to eastern Washington?

Try Dry Falls, it’s spectacular!

Social Chat

Engaging, Human-Like Interaction (Ungrounded)

Task-Oriented

Task Completion, Decision Support (Grounded)

36

(37)

Slido: #ADL2021

Conversation and Non-Conversation Data

You know any good Japanese restaurant in Seattle?

Try Kisaku, one of the best sushi restaurants in the city.

You know any good A restaurant in B?

Try C, one of the best D in the city.

Conversation Data

Knowledge Resource 37

(38)

Slido: #ADL2021

Solution: Knowledge

1. Textual Knowledge

⇒ Retrieving knowledge from Wikipedia, news, etc.

2. Graph Knowledge

⇒ Retrieving subgraph from knowledge graphs

3. Tabular Knowledge

⇒ Incorporate tabular information

4. Service API Interaction

⇒ Generates API query, and incorporate API returns into the response 38

(39)

Slido: #ADL2021

Textual Knowledge

Human: My favorite color is blue.

Wizard: Same! Blue is one of the three primary colours.

Human: I am trying to recall, where does blue fall on the spectrum of visible light?

Textual Knowledge:

Blue is one of the three primary colours in the RGB colour model. It lies between violet and green on the spectrum of visible light.

Wizard: It is right between violet and green.

Human-to-Human Conversations + Textual Knowledge

●

Wizard of Wikipedia

●

CoQA

●

TopicChat

●

CMUDoG

●

HollE

●

ConversingByReading

39

(40)

Slido: #ADL2021

Dialogue History

Textual

Knowledge

Retrieved Knowledge

Retrieval Methods:

- IR Systems: TF-IDF, BM25

- Neural Retriever: DPR

Encoder Decoder ^Response

Models with Textual Knowledge

40

(41)

Slido: #ADL2021

1. Use TF-IDF retrieves documents that related to dialogue context 2. Encode the retrieved documents independently

3. Use dialogue history as query to assign different weights to the documents 4. Decoder generates the response

Generative Transformer Memory Network

Knowledge: IR Systems + Model

41

(42)

Slido: #ADL2021

Human-to-Human Conversations + Graph KG

● OpenDialKG

● DyKgChat

● KdConv

● Commonsense Knowledge Aware Conversation Generation with Graph Attention

● Enhancing Dialog Coherence with Event Graph Grounded Content Planning

Graph Knowledge

42

(43)

Slido: #ADL2021

Dialogue History

Subgraph

Subgraph Retrieval:

● All knowledge triples mentioned in a dialogue (1 hop reasoning)

● Neural Retriever (multihop reasoning)

Knowledge graph in triple format:

(entity1, relation, entity2)

Models with Graph Knowledge

43

(44)

Slido: #ADL2021

DyKgChat: Quick Adaptive Model (Qadpt)

Encoder

output projection extracte seeds

Ct

F B C D

E generic words dist.

D talked to C: Afterwards, don't wait for him at the door. It is cool

in autumn. You may get a cold.

C: I have promised to wait for E.

Reasoning Model

N-hops

D

B

F

E C

D

B

F

E

C D

B

F

E C Transition Matrix (T_t)

X

T_t

Decoder

d1 d₂ dt

controller graph entity dist.

Reasoning Matrix

D

B

C

enem y

lover

friend lover

Adjacency Matrix

D

B

F

E C

lover enem

y lover

friend

d_T

1. Seq2Seq model

3. Reasoning model 2. Controller

44

(45)

Slido: #ADL2021

- Take all the entities mentioned in dialogue as starting node

- Supervised learn the reasoning path over graph via graph attention

OpenDialKG Walker : Subgraph Retrieval

45

(46)

Slido: #ADL2021

Tabular Knowledge

Human-to-Human Conversations + Table Knowledge

●

SMD

●

Camrest

●

bAbI-Dialogues

46

(47)

Slido: #ADL2021

Dialogue History

Mem2Seq Neural Assistant

Examples

KVR

Models with Tabular Knowledge

47

(48)

Slido: #ADL2021

External Service API Interaction

Human-to-Human Conversations + Table Knowledge

●

bAbI

●

Camrest

●

MultiWoz

●

CrossWoz

●

Schema Guided Dialogue

●

TaskMaster 1-2-3

●

STAR

48

(49)

Slido: #ADL2021

Dialogue History

Service API

Response

Language Model

API query (dialogue state)

ResultsKB

I want to book a cheap restaurant

Query(restaurant_price : cheap)

1050 matches

{name: pizza hut, price :cheap}

…...

There are 1050 cheap restaurant, which location you prefer?

Models with Service API

I want to book a pizza hut for 3 people

Book(restaurant_name : pizza hut, restaurant_people: 3)

Book: success

Reference Number: 32bhj32n Your booking is

successful, the reference number is 32bhj32n

49

(50)

Slido: #ADL2021

End-to-End GPT2 Neural Pipeline SimpleToD SOLOIST

Examples

Causal- Decoder

Dialogue History

Service API

ResultsKB Response

Models with Service API

50

(51)

Slido: #ADL2021

Sequicity DAMD MinTL

Examples

Encoder- Decoder

Dialogue History API query

(dialogue state)

Service API

ResultsKB Response

Models with Service API

51

(52)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

52

(53)

Slido: #ADL2021

Limitation 4: Lack of Empathy

Human: I messed up my project today.

System: What is your project?

Human: I think I might get fired...

System: Then you need to find another job!

Human: 😩

53

(54)

Slido: #ADL2021

Solution: Empathic Generation

1. Emotional response generation:

■ MojiTalk

■ Emotional Chatting Machine

2. Understand user’s emotion, and response accordingly:

■ Empathetic Dialogues

■ MoEL

■ Cairebot

54

(55)

Slido: #ADL2021

Empathy Dataset

Empathy: understand the feelings of the conversation partner and replying accordingly.

Dataset: Empathetic Dialogues 55

(56)

Slido: #ADL2021

Encoder Decoder

Dialogue

History Response

Emotion Recognition

Examples

MoEL EmoPrepend-1 CaireBot

Models with Empathy

56

(57)

Slido: #ADL2021

57

https://demo.caire.ust.hk/chatbot

(58)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

58

(59)

Slido: #ADL2021

Limitation 5: Lack of Controllability

Existing large pre-trained model has no control over

-

Response style

-

Topics

-

Repetition and specificity

-

Response-relatedness

-

Engagement by

proactively asking question

Dialogue Model

Dialogue

History Response

Meena BlenderBot

Meena BST

Dialogue Pre-trained

DialoGPT

59

(60)

Slido: #ADL2021

1. Controlling low-level attribute 2. Controlling by fine-tuning

3. Controlling by perturbation

4. Controlling by conditioned generation

Solution: Controllability

60

(61)

Slido: #ADL2021

Conditional Training + Weight Decoding

What makes a good conversation? How controllable attributes affect human judgments (See et. al. 2019)

Controlling Low-Level Attribute

61

(62)

Slido: #ADL2021

Multitask conversation data with style data

⇒ No control codes

STYLEDGPT: Stylized Response

Generation with Pre-trained Language Models (Yang et. al. 2020)

DialoGPT

Dialogue

History Response

word-level style loss

Conversational data Sentence-level

Style loss

Controlling by Fine-Tuning

62

(63)

Slido: #ADL2021

● Control the generated style with Plug-and-Play LM (PPLM) (Dathathri et. al. 2020)

● Distilling the generated responses from PPLM into residual adapter (Houlsby et.al. 2019)

⇒ Plug-and-Play for 3 style and 3 topic

Plug-and-Play Conversational Models (Madotto et. al. 2020) DialoGPT

Dialogue

History Response

Controlling by Perturbation

63

(64)

Slido: #ADL2021

Controlling by Conditioned Generation

Controllable generation architectures in open-domain dialogues:

▪ retrieval + style-controlled generation (Weston et al. 2018)

▪ PPLM (Dathathri et. al. 2020)

▪ CTRL (Keskar et. al. 2019)

200 style labels in ConvAI2, EmpatheticDialogues, Wizard of Wikipedia, and BlendedSkillTalk) generated by a classifier trained on Image-Chat

Controlling Style in Generated Dialogue (Smith & Gonzalez-Rico et. al. 2020)

64

(65)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

65

(66)

Slido: #ADL2021

Limitation 6: Lack of Versatility

Dialogue Model

Dialogue

History Response

Textual Knowle dge

Dialogue Model

Dialogue

History Response

Dialogue History

Service API KB Results

Response

Dialogue Model

Dialogue

History Response

Emotion Recognition

Dialogue Model

66

(67)

Slido: #ADL2021

67

Dialogue Model

Dialogue History

Response

Textual

Knowledge

Emotion Recognition

API-Query

Solution: ToDs + Chit-Chat

(68)

Slido: #ADL2021

ToDs + Chit-Chat Datasets

1. Mixing multiple dialogue datasets 2. Multiple dialogue skills

⇒ Collecting dataset that mix skills

3. Mixing Chit-Chat and ToDs

⇒ Collecting data from mixing the two

68

(69)

Slido: #ADL2021

Encoder

Chit-Chat

Knowledge Base Persona

Dialogue History

… .

Domain/Skills API

Composer

Restaurant Hotel SQL BOOK

Decoders

System Response API Call

69

Attention over Parameters

(70)

Slido: #ADL2021

Adapter-Bot: All-In-One Controllable Model

●

Use a fixed backbone - DialoGPT

●

Encode each dialogue skill with an independently trained adapters

● able to process multiple knowledge types and styles (8 goal-oriented skills +

personalized and empathetic responses)

●

A skill manager, BERT, is trained to select each adapter

70

(71)

Slido: #ADL2021

Encoder

Chit-Chat

Knowledge Base Persona

Dialogue History

….

Domain/Skills API

Composer

Restaurant Hotel SQL BOOK

Decoders

System Response API Call

Blender-bot

Attention over Parameters for Dialogue Systems (Madotto et.al. 2019)

Recipes for building an open-domain chatbot (Roller et.al 2020)

The Adapter-Bot: All-In-One Controllable Conversational Model (Lin & Madotto et.al.

2020)

71

Putting It All Together

(72)

Slido: #ADL2021

Limitations of Vanilla Seq2Seq: Summary

1. Lack of diversity

2. Lack of consistency 3. Lack of knowledge 4. Lack of empathy

5. Lack of controllability 6. Lack of versatility

7. Lack of global optimization

72

(73)

Slido: #ADL2021

Limitation 7: Lack of Global Optimization

Application State Action Reward

Task Completion Bots

(Movies, Restaurants, …)

User input + Context Dialog act + slot-value Task success rate

# of turns Info Bots

(Q&A bot over KB, Web etc.)

Question + Context Clarification questions, Answers

Relevance of answer

# of turns Social Bot

(XiaoIce)

Conversation history Response Engagement(?)

Language understanding

Language (response) generation

Dialogue Manager

𝑎 = 𝜋(𝑠)

Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)

Optimize 𝑄(𝑠, 𝑎) User input (o)

Response

𝑠

𝑎

73

(74)

Slido: #ADL2021

Input message Supervised Learning Agent Reinforcement Learning Agent

Solution: Deep RL for Optimization

(Li et al., 2016)

◉ RL agent generates more interactive responses

◉ RL agent tends to end a sentence with a question and hand the conversation over to the user

74

(75)

Slido: #ADL2021

Concluding Remarks

◉ Limitations of vanilla seq2seq models

1.

Lack of diversity

2.

Lack of consistency

3.

Lack of knowledge

4.

Lack of empathy

5.

Lack of controllability

6.

Lack of versatility

7.

Lack of global optimization

◉ Recent trends for addressing above limitations

75

(76)

Slido: #ADL2021

Her (2013)

What can machines achieve now or in the future?

76

(77)

Slido: #ADL2021

77