Towards Open-Domain Conversational AI Y U N - N U N G ( V I V I A N ) C H E N

(1)

Towards Open-Domain Conversational AI

Y U N - N U N G ( V I V I A N ) C H E N 陳縕儂

H T T P : / / V I V I A N C H E N . I D V . T W

(2)

2

What can machines achieve now or in the future?

Iron Man (2008)

(3)

3

Language Empowering Intelligent Assistants

Apple Siri (2011) Google Now (2012)

Facebook M & Bot (2015) Google Home (2016)

Microsoft Cortana (2014)

Amazon Alexa/Echo (2014)

Google Assistant (2016)

Apple HomePod (2017)

(4)

4

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

4

Turing Test (talk like a human) Information consumption

Task completion

Decision support

(5)

5

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

Turing Test (talk like a human) Information consumption

Task completion

Decision support

(6)

6

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

Turing Test (talk like a human) Information consumption

Task completion Decision support

• What is today’s agenda?

• Which room is SCAI workshop in?

• What does SCAI stand for?

(7)

7

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

Turing Test (talk like a human) Information consumption

Task completion Decision support

• Book me the train ticket from Amsterdam to Brussel

• Reserve a table at Din Tai Fung for 5 people, 7PM tonight

• Schedule a meeting with Vivian at 10:00 tomorrow

(8)

8

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

8

Turing Test (talk like a human) Information consumption

Task completion Decision support

• Is this brussels waffle worth to try?

• Is the SCAI workshop good to attend?

(9)

9

Social Chit-Chat

Why and When We Need?

“I want to chat”

“I have a question”

“I need to get this done”

“What should I do?”

9

Turing Test (talk like a human) Information consumption

Task completion Decision support

Task-Oriented Dialogues

(10)

10

Intelligent Assistants

Task-Oriented

(11)

11

Conversational Agents

11

Chit-Chat

Task-Oriented

(12)

Task-Oriented Dialogue Systems

12

JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion

(13)

13

Task-Oriented Dialogue Systems

(Young, 2000)

13

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

(14)

14

Task-Oriented Dialogue Systems

(Young, 2000)

14

Speech Recognition

• Slot Filling

Text Input

Speech Signal

Backend Database/

Knowledge Providers

(15)

15

Semantic Frame Representation

 Requires a domain ontology: early connection to backend

 Contains core concept (intent, a set of slots with fillers)

find me a cheap taiwanese restaurant in oakland

show me action movies directed by james cameron

find_restaurant (price=“cheap”,

type=“taiwanese”, location=“oakland”)

find_movie (genre=“action”, director=“james cameron”) Restaurant

Domain

Movie Domain

restaurant price type

location

movie year genre

director

15

(16)

16

Movie Name Theater Rating Date Time

Iron Man Last Taipei A1 8.5 2018/10/31 09:00

Backend Database / Ontology



Domain-specific table



Target and attributes



Functionality



Information access: find specific entries



Task completion: find the row that satisfies the constraints

movie name rating date

theater time

(17)

17

Task-Oriented Dialogue Systems

(Young, 2000)

17

Speech Recognition

• Slot Filling

Text Input

Speech Signal

Backend Action / Knowledge Providers

(18)

18

Language Understanding (LU)



Pipelined

18

1. Domain Classification

2. Intent

Classification 3. Slot Filling

(19)

19

RNN for Slot Tagging – I

(Yao et al, 2013; Mesnil et al, 2015)



Variations:

a.

RNNs with LSTM cells

b.

Input, sliding window of n-grams

c.

Bi-directional LSTMs

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 ℎ₀^𝑓 ℎ₁^𝑓 ℎ₂^𝑓 ℎ_𝑛^𝑓 ℎ₀^𝑏 ℎ₁^𝑏 ℎ₂^𝑏 ℎ_𝑛^𝑏 𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

(b) LSTM-LA (c) bLSTM

𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 ℎ₀ ℎ₁ ℎ₂ ℎ_𝑛

(a) LSTM 𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 ℎ₀ ℎ₁ ℎ₂ ℎ_𝑛

(20)

20

RNN for Slot Tagging – II

(Kurata et al., 2016; Simonnet et al., 2015)



Encoder-decoder networks



Leverages sentence level information



Attention-based encoder-decoder



Use attention (as in MT) in the encoder- decoder network

𝑤_𝑛 𝑤₂ 𝑤₁ 𝑤₀ ℎ_𝑛 ℎ₂ ℎ₁ ℎ₀

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛

ℎ₀ ℎ₁ ℎ₂ ℎ_𝑛 𝑠₀ 𝑠₁ 𝑠₂ 𝑠_𝑛 c_i

ℎ₀…ℎ_𝑛

(21)

h_t-

1

h_t+

1

h_t

W W W W

taiwanese

B-type U

food U

please U

V

O V

h_T+1 EOS U

FIND_REST V

Slot Filling Intent Prediction

Joint Semantic Frame Parsing

Sequence- based (Hakkani-Tur

et al., 2016)

• Slot filling and intent prediction in the same

output sequence

Parallel (Liu and Lane, 2016)

• Intent prediction and slot filling are performed in two branches

21

(22)

22

Joint Model Comparison

Attention Mechanism

Intent-Slot Relationship

Joint bi-LSTM X Δ (Implicit)

Attentional Encoder-Decoder √ Δ (Implicit)

Slot Gate Joint Model √ √ (Explicit)

22

(23)

23

Slot-Gated Joint SLU (Goo et al., 2018)

 Slot Gate

 Slot Prediction

23

Slot Attention

Intent Attention 𝑦^𝐼

Word Sequenc

e

𝑥1 𝑥2 𝑥3 𝑥4

BLSTM

Slot

Sequence ^𝑦¹

𝑆 𝑦₂^𝑆 𝑦₃^𝑆 𝑦₄^𝑆

Word

Sequence ^𝑥¹ ^𝑥² ^𝑥³ ^𝑥⁴

BLSTM

Slot Gate

𝑊

𝑐^𝐼

𝑣

tanh 𝑔

𝑐_𝑖^𝑆

𝑐_𝑖^𝑆: slot context vector 𝑐^𝐼: intent context vector 𝑊: trainable matrix 𝑣 : trainable vector 𝑔 : scalar gate value

𝑔 = ∑𝑣 ∙ tanh 𝑐_𝑖^𝑆+ 𝑊 ∙ 𝑐^𝐼

𝑦_𝑖^𝑆 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊^𝑆 ℎ_𝑖+ 𝑐_𝑖^𝑆 + 𝑏^𝑆

𝒈 will be larger if slot and intent are better related

𝑦_𝑖^𝑆 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊^𝑆 ℎ_𝑖+ 𝒈 ∙ 𝑐_𝑖^𝑆 + 𝑏^𝑆 𝑊^𝑆: matrix for output layer

𝑏^𝑆 : bias for output layer

(24)

24

Contextual Language Understanding

24

just sent email to bob about fishing this weekend

O O O O

B-contact_name O

B-subject I-subject I-subject U

S

→ send_email(contact_name=“bob”, subject=“fishing this weekend”)

are we going to fish this weekend U₁

S₂

→ send_email(message=“are we going to fish this weekend”) send email to bob

U₂

→ send_email(contact_name=“bob”)

B-message

I-messageI-message I-message I-message I-message I-message

B-contact_name S₁

(25)

25

E2E MemNN for Contextual LU

(Chen et al., 2016)

U: “i d like to purchase tickets to see deepwater horizon”

S: “for which theatre”

U: “angelika”

S: “you want them for angelika theatre?”

U: “yes angelika”

S: “how many tickets would you like ?”

U: “3 tickets for saturday”

S: “What time would you like ?”

U: “Any time on saturday is fine”

S: “okay , there is 4:10 pm , 5:40 pm and 9:20 pm”

0.69

0.13

0.16

U: “Let’s do 5:40”

(26)

26

Role-Based & Time-Aware Attention

(Su et al., 2018)

26

Dense Layer

+

w_t w_t+1 w_T

… …

Dense Layer

Spoken Language Understanding u₂

u₆

Tourist u4 Guide

u₁

u₇

Current

Sentence-Level Time-Decay Attention

u₃ u₅

Role-Level Time- Decay Attention

𝛼_𝑟₁ 𝛼_𝑟₂

𝛼_𝑢_𝑖

∙ u₂ ∙ u₄ ∙ u₅

𝛼_𝑢₂ 𝛼_𝑢₄ 𝛼_𝑢₅ 𝛼_𝑢∙ u₁ ₁ 𝛼_𝑢∙ u₃ ₃ 𝛼_𝑢∙ u₆ ₆ History Summary

Time-Decay Attention Function (𝛼_𝑢 & 𝛼_𝑟)

𝛼

𝑑 𝛼

𝑑

convex linear concave

(27)

27

Context-Sensitive Time-Decay Attention

(Su et al., 2018)

Dense Layer

+

Current Utterance w_t w_t+1 w_T

… …

Dense

Layer Spoken Language Understanding

∙ u₂ ∙ u₄ ∙ u₅ u₂

u₆

Tourist

u₄

Guide u₁

u₇

Current

Sentence-Level Time-Decay Attention

u₃ u₅

𝛼_𝑢₂ 𝛼_𝑢∙ u₁ ₁ ∙ u₃ ∙ u₆

Role-Level Time-Decay Attention

𝛼_𝑟₁ 𝛼_𝑟₂

𝛼_𝑢_𝑖

𝛼_𝑢₄ 𝛼_𝑢₅ 𝛼_𝑢₃ 𝛼_𝑢₆

History Summary

Attention Model

convex linear concave

Time-decay attention significantly improves the understanding results

(28)

28

Task-Oriented Dialogue Systems

(Young, 2000)

28

Speech Recognition

• Slot Filling

Text Input

Speech Signal

Backend Action / Knowledge Providers

(29)

29

Dialogue State Tracking

29

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

(30)

30

DNN for DST

30

feature

extraction DNN

A slot value distribution for each slot

multi-turn conversation

state of this turn

(31)

31

RNN-CNN DST (Wen et al., 2016)

(Figure from Wen et al, 2016) 31

(32)

32

Dialogue Policy Optimization

32

request (restaurant; foodtype=Thai)

inform (area=centre)

request (address)

bye ()

greeting ()

request (area)

inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)

inform (address=24 Green street)

(33)

33

Dialogue Policy Optimization



Dialogue management in a RL framework

33

U s e r

Reward R Observation O Action A

Environment

Agent

Natural Language Generation Language Understanding

Dialogue Manager

Select the best action that maximizes the future reward

(34)

34

E2E Task-Completion Bot (TC-Bot)

(Li et al., 2017)

Idea: SL for each component and RL for end-to-end training the neural dialogue system 34

w_i

<slot>

w_i+1

O

EOS

w_i

<slot>

w_i+1

O

EOS

Knowledge Database

Neural Dialogue System User Model

User Simulation

Dialogue Policy Natural Language

w₀ w₁NLGw₂

EOS

User Goal

w_i

<slot>

w_i+1

O

EOS

LU

𝑠_𝑡 DST

𝑠₁ 𝑠₂ 𝑠_𝑛

𝑎₁ 𝑎₂ 𝑎_𝑘

……

…

Dialogue Policy Learning Are there any action movies

to see this weekend?

request_location

(35)

35

User goal –Two tickets for “Deadpool” tomorrow 9PM at AMC Pacific Place 11 theater, Seattle.

RULE BASED SYSTEM

User: Do you have any ticket for movie ‘Deadpool’ at 9:00 pm?

Agent: Which theater do you want?

User: AMC pacific place 11 theater, please!

Agent: How many tickets do you want?

User: Two!

Agent: What date?

User: Tomorrow!

Agent: Which city do you want?

User: Seattle!

Agent: Do you have any specification for zip code?

User: No, I do not care.

Agent: Do you have any distance constraint?

User: No, I do not care.

Agent: How about video format? 2d? Imax 3d?

User: Any kind is okay.

Agent: Do you have any constraint to price?

User: No!

Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie ‘Deadpool’.

User: Okay, thanks.

Agent: Thanks!

REINFORCEMENT LEARNING SYSTEM

User: I want two tickets for movie ‘Deadpool’.

Agent: What is the starttime you prefer?

User: 9:00 PM Agent: Which day?

User: Tomorrow Agent: Which city?

User: Seattle

Agent: Which theater do you prefer?

User: AMC Pacific Place 11 theater.

Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie

‘Deadpool’.

User: Okay, thanks.

Agent: Thanks!

Reinforcement Learning Agent

(Li et al., 2017)

Skip the requests the user may not care about to improve efficiency

Issue 1: no notion about what requests can be skipped

35

(36)

36

User goal –Two tickets for “Hail Caesar” tomorrow 7:15PM at AMC Lowes Oak tree 6, Seattle.

Reinforcement Learning Agent

(Li et al., 2017)

Automatically learn when and what to confirm for better robustness

REINFORCEMENT LEARNING SYSTEM

User: Do you any ticket for movie ‘Hail Caesar’ at theater AMC Lowes Oak Tree 6?

Agent: What start time do you prefer?

User: 7:15 pm Tomorrow

Agent: Which city you are looking for?

User: Seattle

Agent: Okay, I will issue you tickets for movie ‘Hail Caeser’

tomorrow 7:50 pm at theater AMC Lowes Oak Tree 6, Seattle.

User: Oh, no.

Agent:It is 7:50 pm, right?

User: Two tickets for 7:15 pm, please.

Agent: Okay, I will issue you two tickets for movie ‘Hail Caeser’

User: Thanks!

Agent: Thanks!

RULE BASED SYSTEM

User: Do you any ticket for movie ‘Hail Caesar’ at theater AMC Lowes Oak Tree 6?

Agent: What start time do you prefer?

User: 7:15 pmTomorrow

Agent: Which city you are looking for?

User: Seattle

Agent: How many tickets do you want?

User: Two

Agent: Okay, I have issued two tickets for movie ‘Hail Caeser’

User: Oh, no.

Agent: Thanks!

Issue 2: no principle way to deal with LU errors/uncertainty

36

(37)

37

RL in Dialogue Systems

1. Sample inefficient, hard to design reward function, local optima…

2. Real users are expensive

3. Discrepancy between real users and simulators

37

(38)

38

 Idea

 learning with real users with planning

 add a discriminator to filter out the bad experiences

Policy Model

World User Model

Real Experience

Direct Reinforcement

Learning World Model

Learning

Controlled Planning

Acting Human

Conversational Data

Imitation Learning Supervised

Learning

Discriminator Discriminative

Training NLU

Discriminator

System Action (Policy) Semantic

Frame

State Representation Real

Experience

DST

Policy Learning NLG

Simulated Experience

World Model User

D3Q: Discriminative Deep Dyna-Q (Su et al., 2018)

(39)

39

D3Q: Discriminative Deep Dyna-Q (Su et al., 2018)

39 S.-Y. Su, X. Li, J. Gao, J. Liu, and Y.-N. Chen, “Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning," (to appear) in Proc. of EMNLP, 2018.The policy learning is more robust and shows the improvement in human evaluation

(40)

40

Task-Oriented Dialogue Systems

(Young, 2000)

Speech Recognition

• Slot Filling Hypothesis

System Action/Policy request_location Text Input

Speech Signal

• Dialogue Policy

Backend Action / Knowledge Providers Natural Language

Generation (NLG) Text response

(41)

41

Natural Language Generation (NLG)



Mapping dialogue acts into natural language

41

inform(name=Seven_Days, foodtype=Chinese)

Seven Days is a nice Chinese restaurant

(42)

42

Template-Based NLG



Define a set of rules to map frames to natural language

42

Pros:simple, error-free, easy to control Cons: time-consuming, rigid, poor scalability Semantic Frame Natural Language

confirm() “Please tell me more about the product your are looking for.”

confirm(area=$V) “Do you want somewhere in the $V?”

confirm(food=$V) “Do you want a $V restaurant?”

confirm(food=$V,area=$W) “Do you want a $V restaurant in the $W.”

(43)

43

RNN-Based LM NLG

(Wen et al., 2015)

<BOS> SLOT_NAME serves SLOT_FOOD .

<BOS> Din Tai Fung serves Taiwanese . delexicalisation

Inform(name=Din Tai Fung, food=Taiwanese) 0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, 0, 0, 0…

dialogue act 1-hot representation

SLOT_NAME serves SLOT_FOOD . <EOS>

conditioned on the dialogue act

Input

Output

(44)

44

x_t h_t-1 x_t h_t-1 x_t ht- 1

LSTM cell

Ct

i_t

ft

o_t ht

x_t h_t-1

Semantic Conditioned LSTM

(Wen et al., 2015)

 Issue: semantic repetition

 Din Tai Fung is a great Taiwanese restaurant that serves Taiwanese.

 Din Tai Fung is a child friendly restaurant, and also allows kids.

44 DA cell

r_t dt

dt-1

x_t ht-1

Inform(name=Seven_Days, food=Chinese)

0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, … dialog act 1-hot representation d0

Idea: using gate mechanism to control the generated semantics (dialogue act/slots)

(45)

45

Issues in NLG



Issue

 NLG tends to generate shorter sentences

 NLG may generate grammatically-incorrect sentences



Solution

 Generate word patterns in a order

 Consider linguistic patterns

45

(46)

46

Hierarchical NLG w/ Linguistic Patterns

(Su et al., 2018)

46

Bidirectional GRU Encoder

Italian priceRange

name … …

46

ENCODER

name[Midsummer House], food[Italian], priceRange[moderate], near[All Bar One]

All Bar One place it Midsummer House

All Bar Oneis pricedplace itis calledMidsummer House All Bar One ismoderatelypricedItalianplace it is called Midsummer House

NearAll Bar One isamoderately priced Italian place it is called Midsummer House

DECODING LAYER1 DECODING LAYER2 DECODING LAYER3 DECODING LAYER4

Hierarchical Decoder 1. NOUN + PROPN + PRON 2. VERB

3. ADJ + ADV 4. Others

Input Semantics

[ … 1, 0, 0, 1, 0, …]

Semantic 1-hot Representation

GRU Decoder

All Bar One is a is a moderately

All Bar One is moderately

…

… …

…

… …

output from last layer 𝒚_𝒕^𝒊−𝟏 last output 𝒚_𝒕−𝟏^𝒊 1. Repeat-input

2. Inner-Layer Teacher Forcing 3. Inter-Layer Teacher Forcing 4. Curriculum Learning

𝒉enc

(47)

47

Evolution Roadmap

47

Single domain systems

Extended systems

Multi- domain systems

Open domain systems Dialogue breadth (coverage)

Dialogue depth (complexity)

What is influenza?

I’ve got a cold what do I do?

Tell me a joke.

I feel sad…

(48)

48

Dialogue Systems

48

DB

Understanding

(NLU) State tracker

Generation

(NLG) Dialog policy

DB

input x

output y

Database Memory External knowledge

Task-Oriented Dialogue

Understanding

(NLU) State tracker

Generation

(NLG) Dialog policy

input x

output y

Fully Data-Driven

(49)

Chit-Chat Social Bots

49

(50)

50

Neural Response Generation

(Sordoni et al., 2015; Vinyals & Le, 2015)

Target:

response

decoder

Yeah EOS

I’m Yeah

on I’m

my on

way my

… because of your game?

Source: conversation history

encoder

Learns to generate dialogues from offline data (no state, action, intent, slot, etc.)

(51)

51

Issue 1: Blandness Problem

‘tis a fine brew on a day like this! Strong though, how many is sensible?

I'm not sure yet, I'll let you know !

Milan apparently selling Zlatan to balance the books... Where next, Madrid?

I don’t know.

Wow sour starbursts really do make your mouth water... mm drool.

Can I have one?

Of course!

Well he was on in Bromley a while ago... still touring.

I don't even know what he's talking about.

32% responses are general and meaningless

“I don’t know”

“I don’t know what you are talking about”

“I don’t think that is a good idea”

“Oh my god”

(52)

52

MMI for Response Diversity

(Li et al., 2016)

Depends on how much you drink!

I think he'd be a good signing.

Can I have one?

Of course you can! They’re delicious!

I’ve never seen him live.

(53)

53

MMI for Response Diversity

(Li et al., 2016)

Depends on how much you drink!

I think he'd be a good signing.

Can I have one?

Of course you can! They’re delicious!

I’ve never seen him live.

(54)

54

Real-World Conversations



Multimodality

 Conversation history

 Persona

 User profile data

(bio, social graph, etc.)

 Visual signal

(camera, picture etc.)

 Knowledge base

 Mood

 Geolocation

 Time

54

Because of your game? EOS Yeah I’m …

(55)

55

Issue 2: Response Inconsistency

55

(56)

56

Personalized Response Generation

(Li et al., 2016)

EOS

where do you live

in

in england

england

.

. EOS

Rob Rob

Word embeddings(50k)

england london u.s.

great good

stay live okay monday

tuesday

Speaker embeddings(70k)

Rob_712 skinnyoflynny2

Tomcoatez

Kush_322 D_Gomes25

Dreamswalls kierongillen5 TheCharlieZ

The_Football_Bar

This_Is_Artful DigitalDan285 Jinnmeow3

Bob_Kelly2

(57)

57

Persona Model for Speaker Consistency

(Li et al., 2016)

Baseline model → inconsistency Persona model using speaker embedding → consistency

(58)

58

Issue 3: Dialogue-Level Optimization via RL

58

Application State Action Reward

Task Completion Bots (Movies, Restaurants, …)

User input + Context Dialog act + slot-value Task success rate

# of turns Info Bots

(Q&A bot over KB, Web etc.)

Question + Context Clarification questions, Answers

Relevance of answer

# of turns Social Bot

(XiaoIce)

Conversation history Response Engagement(?) Language

understanding

Language (response) generation

Dialogue Manager

𝑎 = 𝜋(𝑠)

Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)

Optimize 𝑄(𝑠, 𝑎) User input (o)

Response

𝑠

𝑎

(59)

59

Input message Supervised Learning Agent Reinforcement Learning Agent

Deep RL for Response Generation

(Li et al., 2016)



RL agent generates more interactive responses



RL agent tends to end a sentence with a question and hand the

conversation over to the user

(60)

60

Issue 4: No Grounding

(Sordoni et al., 2015; Li et al., 2016)

60

H: hi

M: how are you?

H: not bad

M: what's wrong?

H: nothing really

M: wanna talk about it? i 'm here if you wanna talk

H: well, i'm just tired

M: me too, i'm here if you wanna talk

Neural model learns the general shape of conversations, and the system output is situationally appropriate and coherent.

H: would thursday afternoon work sometime?

M: yeah , sure . just let me know when you‘re free.

H: after lunch is probably the best time

M: okay, sounds good . just let me know when you‘re free.

H: would 2 pm work for you?

M: works for me.

H: well let‘s say 2 pm then i ‘ll see you there M: sounds good.

No grounding into a real calendar, but the

“shape” of the conversation is fluent and plausible.

(61)

61

Chit-Chat v.s. Task-Oriented

61

Any recommendation?

The weather is so depressing these days.

I know, I dislike rain too.

What about a day trip to eastern Washington?

Try Dry Falls, it’s spectacular!

Social Chat

Engaging, Human-Like Interaction (Ungrounded)

Task-Oriented

Task Completion, Decision Support (Grounded)

61

(62)

62

Image-Grounded Dialogues

(Mostafazadeh et al., 2017) Neural model, trained on 250K Twitter convos:

-Turn 1: tweet + accompanying photo -Turn 2: tweet is a question

62

Tweet: my poor baby moms saying it 's probably totaled so rip Bessie

NeuroCon: How did you manage to do that?

Tweet: as penitence I offer this NeuroCon: can I have one of those?

Tweet: some of the damage done NeuroCon: Is that at your house?

Given data, other signals can be exploited in the same way: program state, database info,

geolocation, etc.

(63)

63

Knowledge-Grounded Responses

(Ghazvininejad et al., 2017)

Going to Kusakabe tonight

Conversation History

Try omakase, the best in town

Response Σ ^Decoder

Dialogue Encoder

...

World “Facts”

A

Consistently the best omakase

...

Contextually-Relevant “Facts”

Amazing sushi tasting […]

They were out of kaisui […]

Fact Encoder

(64)

64

Conversation and Non-Conversation Data

64

You know any good Japanese restaurant in Seattle?

Try Kisaku, one of the best sushi restaurants in the city.

You know any good A restaurant in B?

Try C, one of the best D in the city.

Conversation Data

Knowledge Resource

(65)

65

Evolution Roadmap

65

Knowledge based system Common sense system

Empathetic systems

Dialogue breadth (coverage)

Dialogue depth (complexity)

What is influenza?

I’ve got a cold what do I do?

Tell me a joke.

I feel sad…

(66)

66

Multimodality & Personalization

(Chen et al., 2018)

 Task: user intent prediction

 Challenge: language ambiguity

 User preference

✓ Some people prefer “Message” to “Email”

✓ Some people prefer “Ping” to “Text”

 App-level contexts

✓ “Message” is more likely to follow “Camera”

✓ “Email” is more likely to follow “Excel”

send to vivian v.s.

Email? Message?

Communication

Behavioral patterns in history helps intent prediction.

(67)

67

High-Level Intention Learning

(Sun et al., 2016; Sun et al., 2016)



High-level intention may span several domains

Schedule a lunch with Vivian.

find restaurant check location contact play music

What kind of restaurants do you prefer?

The distance is … Should I send the restaurant information to Vivian?

Users interact via high-level descriptions and the system learns how to plan the dialogues

(68)

68

Empathy in Dialogue System

(Fung et al., 2016)



Embed an empathy module



Recognize emotion using multimodality



Generate emotion-aware responses

Emotion Recognizer 68

vision speech text

(69)

69

Cognitive Behavioral Therapy (CBT)

Pattern Mining

Mood Tracking Content Providing

Depression Reduction

Always Be There

Know You Well

(70)

Challenges and Conclusions

70

(71)

71

Challenge Summary

71

The human-machine interface is a hot topic but several components must be integrated!

Most state-of-the-art technologies are based on DNN

•Requires huge amounts of labeled data

•Several frameworks/models are available

Fast domain adaptation with scarse data + re-use of rules/knowledge

Handling reasoning and personalization

Data collection and analysis from un-structured data

Complex-cascade systems requires high accuracy for working good as a whole

(72)

72

Her (2013)

72

What can machines achieve now or in the future?

(73)

Q & A

Thanks for Your Attention!

73

Yun-Nung (Vivian) Chen Assistant Professor

National Taiwan University

[email protected] / http://vivianchen.idv.tw