Towards Open-Domain Conversational AI
Y U N - N U N G ( V I V I A N ) C H E N 陳 縕 儂
H T T P : / / V I V I A N C H E N . I D V . T W
2
2
What can machines achieve now or in the future?
Iron Man (2008)
3
Language Empowering Intelligent Assistants
Apple Siri (2011) Google Now (2012)
Facebook M & Bot (2015) Google Home (2016)
Microsoft Cortana (2014)
Amazon Alexa/Echo (2014)
Google Assistant (2016)
Apple HomePod (2017)
4
Why and When We Need?
“I want to chat”
“I have a question”
“I need to get this done”
“What should I do?”
4
Turing Test (talk like a human) Information consumption
Task completion
Decision support
5
Why and When We Need?
“I want to chat”
“I have a question”
“I need to get this done”
“What should I do?”
Turing Test (talk like a human) Information consumption
Task completion
Decision support
6
Why and When We Need?
“I want to chat”
“I have a question”
“I need to get this done”
“What should I do?”
Turing Test (talk like a human) Information consumption
Task completion Decision support
• What is today’s agenda?
• Which room is SCAI workshop in?
• What does SCAI stand for?
7
Why and When We Need?
“I want to chat”
“I have a question”
“I need to get this done”
“What should I do?”
Turing Test (talk like a human) Information consumption
Task completion Decision support
• Book me the train ticket from Amsterdam to Brussel
• Reserve a table at Din Tai Fung for 5 people, 7PM tonight
• Schedule a meeting with Vivian at 10:00 tomorrow
8
Why and When We Need?
“I want to chat”
“I have a question”
“I need to get this done”
“What should I do?”
8
Turing Test (talk like a human) Information consumption
Task completion Decision support
• Is this brussels waffle worth to try?
• Is the SCAI workshop good to attend?
9
Social Chit-Chat
Why and When We Need?
“I want to chat”
“I have a question”
“I need to get this done”
“What should I do?”
9
Turing Test (talk like a human) Information consumption
Task completion Decision support
Task-Oriented Dialogues
10
Intelligent Assistants
Task-Oriented
11
Conversational Agents
11
Chit-Chat
Task-Oriented
Task-Oriented Dialogue Systems
12
JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion
13
Task-Oriented Dialogue Systems
(Young, 2000)13
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
14
Task-Oriented Dialogue Systems
(Young, 2000)14
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
15
Semantic Frame Representation
Requires a domain ontology: early connection to backend
Contains core concept (intent, a set of slots with fillers)
find me a cheap taiwanese restaurant in oakland
show me action movies directed by james cameron
find_restaurant (price=“cheap”,
type=“taiwanese”, location=“oakland”)
find_movie (genre=“action”, director=“james cameron”) Restaurant
Domain
Movie Domain
restaurant price type
location
movie year genre
director
15
16
Movie Name Theater Rating Date Time
Iron Man Last Taipei A1 8.5 2018/10/31 09:00
Iron Man Last Taipei A1 8.5 2018/10/31 09:25
Iron Man Last Taipei A1 8.5 2018/10/31 10:15
Iron Man Last Taipei A1 8.5 2018/10/31 10:40
Backend Database / Ontology
Domain-specific table
Target and attributes
Functionality
Information access: find specific entries
Task completion: find the row that satisfies the constraints
movie name rating date
theater time
17
Task-Oriented Dialogue Systems
(Young, 2000)17
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Action / Knowledge Providers
18
Language Understanding (LU)
Pipelined
18
1. Domain Classification
2. Intent
Classification 3. Slot Filling
19
RNN for Slot Tagging – I
(Yao et al, 2013; Mesnil et al, 2015)
Variations:
a.
RNNs with LSTM cells
b.
Input, sliding window of n-grams
c.
Bi-directional LSTMs
𝑤0 𝑤1 𝑤2 𝑤𝑛 ℎ0𝑓 ℎ1𝑓 ℎ2𝑓 ℎ𝑛𝑓 ℎ0𝑏 ℎ1𝑏 ℎ2𝑏 ℎ𝑛𝑏 𝑦0 𝑦1 𝑦2 𝑦𝑛
(b) LSTM-LA (c) bLSTM
𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤0 𝑤1 𝑤2 𝑤𝑛 ℎ0 ℎ1 ℎ2 ℎ𝑛
(a) LSTM 𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤0 𝑤1 𝑤2 𝑤𝑛 ℎ0 ℎ1 ℎ2 ℎ𝑛
20
RNN for Slot Tagging – II
(Kurata et al., 2016; Simonnet et al., 2015)
Encoder-decoder networks
Leverages sentence level information
Attention-based encoder-decoder
Use attention (as in MT) in the encoder- decoder network
𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤𝑛 𝑤2 𝑤1 𝑤0 ℎ𝑛 ℎ2 ℎ1 ℎ0
𝑤0 𝑤1 𝑤2 𝑤𝑛
𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤0 𝑤1 𝑤2 𝑤𝑛
ℎ0 ℎ1 ℎ2 ℎ𝑛 𝑠0 𝑠1 𝑠2 𝑠𝑛 ci
ℎ0…ℎ𝑛
ht-
1
ht+
1
ht
W W W W
taiwanese
B-type U
food U
please U
V
O V
O V
hT+1 EOS U
FIND_REST V
Slot Filling Intent Prediction
Joint Semantic Frame Parsing
Sequence- based (Hakkani-Tur
et al., 2016)
• Slot filling and intent prediction in the same
output sequence
Parallel (Liu and Lane, 2016)
• Intent prediction and slot filling are performed in two branches
21
22
Joint Model Comparison
Attention Mechanism
Intent-Slot Relationship
Joint bi-LSTM X Δ (Implicit)
Attentional Encoder-Decoder √ Δ (Implicit)
Slot Gate Joint Model √ √ (Explicit)
22
23
Slot-Gated Joint SLU (Goo et al., 2018)
Slot Gate
Slot Prediction
23
Slot Attention
Intent Attention 𝑦𝐼
Word Sequenc
e
𝑥1 𝑥2 𝑥3 𝑥4
BLSTM
Slot
Sequence 𝑦1
𝑆 𝑦2𝑆 𝑦3𝑆 𝑦4𝑆
Word
Sequence 𝑥1 𝑥2 𝑥3 𝑥4
BLSTM
Slot Gate
𝑊
𝑐𝐼
𝑣
tanh 𝑔
𝑐𝑖𝑆
𝑐𝑖𝑆: slot context vector 𝑐𝐼: intent context vector 𝑊: trainable matrix 𝑣 : trainable vector 𝑔 : scalar gate value
𝑔 = ∑𝑣 ∙ tanh 𝑐𝑖𝑆+ 𝑊 ∙ 𝑐𝐼
𝑦𝑖𝑆 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑆 ℎ𝑖+ 𝑐𝑖𝑆 + 𝑏𝑆
𝒈 will be larger if slot and intent are better related
𝑦𝑖𝑆 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊𝑆 ℎ𝑖+ 𝒈 ∙ 𝑐𝑖𝑆 + 𝑏𝑆 𝑊𝑆: matrix for output layer
𝑏𝑆 : bias for output layer
24
Contextual Language Understanding
24
just sent email to bob about fishing this weekend
O O O O
B-contact_name O
B-subject I-subject I-subject U
S
→ send_email(contact_name=“bob”, subject=“fishing this weekend”)
are we going to fish this weekend U1
S2
→ send_email(message=“are we going to fish this weekend”) send email to bob
U2
→ send_email(contact_name=“bob”)
B-message
I-messageI-message I-message I-message I-message I-message
B-contact_name S1
25
E2E MemNN for Contextual LU
(Chen et al., 2016)U: “i d like to purchase tickets to see deepwater horizon”
S: “for which theatre”
U: “angelika”
S: “you want them for angelika theatre?”
U: “yes angelika”
S: “how many tickets would you like ?”
U: “3 tickets for saturday”
S: “What time would you like ?”
U: “Any time on saturday is fine”
S: “okay , there is 4:10 pm , 5:40 pm and 9:20 pm”
0.69
0.13
0.16
U: “Let’s do 5:40”
26
Role-Based & Time-Aware Attention
(Su et al., 2018)26
Dense Layer
+
wt wt+1 wT
… …
Dense Layer
Spoken Language Understanding u2
u6
Tourist u4 Guide
u1
u7
Current
Sentence-Level Time-Decay Attention
u3 u5
Role-Level Time- Decay Attention
𝛼𝑟1 𝛼𝑟2
𝛼𝑢𝑖
∙ u2 ∙ u4 ∙ u5
𝛼𝑢2 𝛼𝑢4 𝛼𝑢5 𝛼𝑢∙ u1 1 𝛼𝑢∙ u3 3 𝛼𝑢∙ u6 6 History Summary
Time-Decay Attention Function (𝛼𝑢 & 𝛼𝑟)
𝛼
𝑑 𝛼
𝑑 𝛼
𝑑
convex linear concave
27
Context-Sensitive Time-Decay Attention
(Su et al., 2018)Dense Layer
+
Current Utterance wt wt+1 wT
… …
Dense
Layer Spoken Language Understanding
∙ u2 ∙ u4 ∙ u5 u2
u6
Tourist
u4
Guide u1
u7
Current
Sentence-Level Time-Decay Attention
u3 u5
𝛼𝑢2 𝛼𝑢∙ u1 1 ∙ u3 ∙ u6
Role-Level Time-Decay Attention
𝛼𝑟1 𝛼𝑟2
𝛼𝑢𝑖
𝛼𝑢4 𝛼𝑢5 𝛼𝑢3 𝛼𝑢6
History Summary
Attention Model
Attention Model
convex linear concave
Time-decay attention significantly improves the understanding results
28
Task-Oriented Dialogue Systems
(Young, 2000)28
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Action / Knowledge Providers
29
Dialogue State Tracking
29
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
30
DNN for DST
30
feature
extraction DNN
A slot value distribution for each slot
multi-turn conversation
state of this turn
31
RNN-CNN DST (Wen et al., 2016)
(Figure from Wen et al, 2016) 31
32
Dialogue Policy Optimization
32
request (restaurant; foodtype=Thai)
inform (area=centre)
request (address)
bye ()
greeting ()
request (area)
inform (restaurant=Bangkok city, area=centre of town, foodtype=Thai)
inform (address=24 Green street)
33
Dialogue Policy Optimization
Dialogue management in a RL framework
33
U s e r
Reward R Observation O Action A
Environment
Agent
Natural Language Generation Language Understanding
Dialogue Manager
Select the best action that maximizes the future reward
34
E2E Task-Completion Bot (TC-Bot)
(Li et al., 2017)Idea: SL for each component and RL for end-to-end training the neural dialogue system 34
wi
<slot>
wi+1
O
EOS
<intent>
wi
<slot>
wi+1
O
EOS
<intent>
Knowledge Database
Neural Dialogue System User Model
User Simulation
Dialogue Policy Natural Language
w0 w1NLGw2
EOS
User Goal
wi
<slot>
wi+1
O
EOS
<intent>
LU
𝑠𝑡 DST
𝑠1 𝑠2 𝑠𝑛
𝑎1 𝑎2 𝑎𝑘
……
…
Dialogue Policy Learning Are there any action movies
to see this weekend?
request_location
35
User goal –Two tickets for “Deadpool” tomorrow 9PM at AMC Pacific Place 11 theater, Seattle.
RULE BASED SYSTEM
User: Do you have any ticket for movie ‘Deadpool’ at 9:00 pm?
Agent: Which theater do you want?
User: AMC pacific place 11 theater, please!
Agent: How many tickets do you want?
User: Two!
Agent: What date?
User: Tomorrow!
Agent: Which city do you want?
User: Seattle!
Agent: Do you have any specification for zip code?
User: No, I do not care.
Agent: Do you have any distance constraint?
User: No, I do not care.
Agent: How about video format? 2d? Imax 3d?
User: Any kind is okay.
Agent: Do you have any constraint to price?
User: No!
Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie ‘Deadpool’.
User: Okay, thanks.
Agent: Thanks!
REINFORCEMENT LEARNING SYSTEM
User: I want two tickets for movie ‘Deadpool’.
Agent: What is the starttime you prefer?
User: 9:00 PM Agent: Which day?
User: Tomorrow Agent: Which city?
User: Seattle
Agent: Which theater do you prefer?
User: AMC Pacific Place 11 theater.
Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie
‘Deadpool’.
User: Okay, thanks.
Agent: Thanks!
Reinforcement Learning Agent
(Li et al., 2017)Skip the requests the user may not care about to improve efficiency
Issue 1: no notion about what requests can be skipped
35
36
User goal –Two tickets for “Hail Caesar” tomorrow 7:15PM at AMC Lowes Oak tree 6, Seattle.
Reinforcement Learning Agent
(Li et al., 2017)Automatically learn when and what to confirm for better robustness
REINFORCEMENT LEARNING SYSTEM
User: Do you any ticket for movie ‘Hail Caesar’ at theater AMC Lowes Oak Tree 6?
Agent: What start time do you prefer?
User: 7:15 pm Tomorrow
Agent: Which city you are looking for?
User: Seattle
Agent: Okay, I will issue you tickets for movie ‘Hail Caeser’
tomorrow 7:50 pm at theater AMC Lowes Oak Tree 6, Seattle.
User: Oh, no.
Agent:It is 7:50 pm, right?
User: Two tickets for 7:15 pm, please.
Agent: Okay, I will issue you two tickets for movie ‘Hail Caeser’
tomorrow 7:15 pm at theater AMC Lowes Oak Tree 6, Seattle.
User: Thanks!
Agent: Thanks!
RULE BASED SYSTEM
User: Do you any ticket for movie ‘Hail Caesar’ at theater AMC Lowes Oak Tree 6?
Agent: What start time do you prefer?
User: 7:15 pmTomorrow
Agent: Which city you are looking for?
User: Seattle
Agent: How many tickets do you want?
User: Two
Agent: Okay, I have issued two tickets for movie ‘Hail Caeser’
tomorrow 7:50 pm at theater AMC Lowes Oak Tree 6, Seattle.
User: Oh, no.
Agent: Thanks!
Issue 2: no principle way to deal with LU errors/uncertainty
36
37
RL in Dialogue Systems
1. Sample inefficient, hard to design reward function, local optima…
2. Real users are expensive
3. Discrepancy between real users and simulators
37
38
Idea
learning with real users with planning
add a discriminator to filter out the bad experiences
Policy Model
World User Model
Real Experience
Direct Reinforcement
Learning World Model
Learning
Controlled Planning
Acting Human
Conversational Data
Imitation Learning Supervised
Learning
Discriminator Discriminative
Training NLU
Discriminator
System Action (Policy) Semantic
Frame
State Representation Real
Experience
DST
Policy Learning NLG
Simulated Experience
World Model User
D3Q: Discriminative Deep Dyna-Q (Su et al., 2018)
39
D3Q: Discriminative Deep Dyna-Q (Su et al., 2018)
39 S.-Y. Su, X. Li, J. Gao, J. Liu, and Y.-N. Chen, “Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning," (to appear) in Proc. of EMNLP, 2018.The policy learning is more robust and shows the improvement in human evaluation
40
Task-Oriented Dialogue Systems
(Young, 2000)Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text Input
Are there any action movies to see this weekend?
Speech Signal
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy
Backend Action / Knowledge Providers Natural Language
Generation (NLG) Text response
Where are you located?
41
Natural Language Generation (NLG)
Mapping dialogue acts into natural language
41
inform(name=Seven_Days, foodtype=Chinese)
Seven Days is a nice Chinese restaurant
42
Template-Based NLG
Define a set of rules to map frames to natural language
42
Pros:simple, error-free, easy to control Cons: time-consuming, rigid, poor scalability Semantic Frame Natural Language
confirm() “Please tell me more about the product your are looking for.”
confirm(area=$V) “Do you want somewhere in the $V?”
confirm(food=$V) “Do you want a $V restaurant?”
confirm(food=$V,area=$W) “Do you want a $V restaurant in the $W.”
43
RNN-Based LM NLG
(Wen et al., 2015)<BOS> SLOT_NAME serves SLOT_FOOD .
<BOS> Din Tai Fung serves Taiwanese . delexicalisation
Inform(name=Din Tai Fung, food=Taiwanese) 0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, 0, 0, 0…
dialogue act 1-hot representation
SLOT_NAME serves SLOT_FOOD . <EOS>
conditioned on the dialogue act
Input
Output
44
xt ht-1 xt ht-1 xt ht- 1
LSTM cell
Ct
it
ft
ot ht
xt ht-1
Semantic Conditioned LSTM
(Wen et al., 2015) Issue: semantic repetition
Din Tai Fung is a great Taiwanese restaurant that serves Taiwanese.
Din Tai Fung is a child friendly restaurant, and also allows kids.
44 DA cell
rt dt
dt-1
xt ht-1
Inform(name=Seven_Days, food=Chinese)
0, 0, 1, 0, 0, …, 1, 0, 0, …, 1, 0, 0, … dialog act 1-hot representation d0
Idea: using gate mechanism to control the generated semantics (dialogue act/slots)
45
Issues in NLG
Issue
NLG tends to generate shorter sentences
NLG may generate grammatically-incorrect sentences
Solution
Generate word patterns in a order
Consider linguistic patterns
45
46
Hierarchical NLG w/ Linguistic Patterns
(Su et al., 2018)46
Bidirectional GRU Encoder
Italian priceRange
name … …
46
ENCODER
name[Midsummer House], food[Italian], priceRange[moderate], near[All Bar One]
All Bar One place it Midsummer House
All Bar Oneis pricedplace itis calledMidsummer House All Bar One ismoderatelypricedItalianplace it is called Midsummer House
NearAll Bar One isamoderately priced Italian place it is called Midsummer House
DECODING LAYER1 DECODING LAYER2 DECODING LAYER3 DECODING LAYER4
Hierarchical Decoder 1. NOUN + PROPN + PRON 2. VERB
3. ADJ + ADV 4. Others
Input Semantics
[ … 1, 0, 0, 1, 0, …]
Semantic 1-hot Representation
GRU Decoder
All Bar One is a is a moderately
All Bar One is moderately
…
… …
…
… …
output from last layer 𝒚𝒕𝒊−𝟏 last output 𝒚𝒕−𝟏𝒊 1. Repeat-input
2. Inner-Layer Teacher Forcing 3. Inter-Layer Teacher Forcing 4. Curriculum Learning
𝒉enc
47
Evolution Roadmap
47
Single domain systems
Extended systems
Multi- domain systems
Open domain systems Dialogue breadth (coverage)
Dialogue depth (complexity)
What is influenza?
I’ve got a cold what do I do?
Tell me a joke.
I feel sad…
48
Dialogue Systems
48
DB
Understanding
(NLU) State tracker
Generation
(NLG) Dialog policy
DB
input x
output y
Database Memory External knowledge
Task-Oriented Dialogue
Understanding
(NLU) State tracker
Generation
(NLG) Dialog policy
input x
output y
Fully Data-Driven
Chit-Chat Social Bots
49
50
Neural Response Generation
(Sordoni et al., 2015; Vinyals & Le, 2015)Target:
response
decoder
Yeah EOS
I’m Yeah
on I’m
my on
way my
… because of your game?
Source: conversation history
encoder
Learns to generate dialogues from offline data (no state, action, intent, slot, etc.)
51
Issue 1: Blandness Problem
‘tis a fine brew on a day like this! Strong though, how many is sensible?
I'm not sure yet, I'll let you know !
Milan apparently selling Zlatan to balance the books... Where next, Madrid?
I don’t know.
Wow sour starbursts really do make your mouth water... mm drool.
Can I have one?
Of course!
Well he was on in Bromley a while ago... still touring.
I don't even know what he's talking about.
32% responses are general and meaningless
“I don’t know”
“I don’t know what you are talking about”
“I don’t think that is a good idea”
“Oh my god”
52
MMI for Response Diversity
(Li et al., 2016)‘tis a fine brew on a day like this! Strong though, how many is sensible?
Depends on how much you drink!
Milan apparently selling Zlatan to balance the books... Where next, Madrid?
I think he'd be a good signing.
Wow sour starbursts really do make your mouth water... mm drool.
Can I have one?
Of course you can! They’re delicious!
Well he was on in Bromley a while ago... still touring.
I’ve never seen him live.
53
MMI for Response Diversity
(Li et al., 2016)‘tis a fine brew on a day like this! Strong though, how many is sensible?
Depends on how much you drink!
Milan apparently selling Zlatan to balance the books... Where next, Madrid?
I think he'd be a good signing.
Wow sour starbursts really do make your mouth water... mm drool.
Can I have one?
Of course you can! They’re delicious!
Well he was on in Bromley a while ago... still touring.
I’ve never seen him live.
54
Real-World Conversations
Multimodality
Conversation history
Persona
User profile data
(bio, social graph, etc.)
Visual signal
(camera, picture etc.)
Knowledge base
Mood
Geolocation
Time
54
Because of your game? EOS Yeah I’m …
55
Issue 2: Response Inconsistency
55
56
Personalized Response Generation
(Li et al., 2016)EOS
where do you live
in
in england
england
.
. EOS
Rob Rob
Rob Rob
Word embeddings(50k)
england london u.s.
great good
stay live okay monday
tuesday
Speaker embeddings(70k)
Rob_712 skinnyoflynny2
Tomcoatez
Kush_322 D_Gomes25
Dreamswalls kierongillen5 TheCharlieZ
The_Football_Bar
This_Is_Artful DigitalDan285 Jinnmeow3
Bob_Kelly2
57
Persona Model for Speaker Consistency
(Li et al., 2016)Baseline model → inconsistency Persona model using speaker embedding → consistency
58
Issue 3: Dialogue-Level Optimization via RL
58
Application State Action Reward
Task Completion Bots (Movies, Restaurants, …)
User input + Context Dialog act + slot-value Task success rate
# of turns Info Bots
(Q&A bot over KB, Web etc.)
Question + Context Clarification questions, Answers
Relevance of answer
# of turns Social Bot
(XiaoIce)
Conversation history Response Engagement(?) Language
understanding
Language (response) generation
Dialogue Manager
𝑎 = 𝜋(𝑠)
Collect rewards (𝑠, 𝑎, 𝑟, 𝑠’)
Optimize 𝑄(𝑠, 𝑎) User input (o)
Response
𝑠
𝑎
59
Input message Supervised Learning Agent Reinforcement Learning Agent
Deep RL for Response Generation
(Li et al., 2016)
RL agent generates more interactive responses
RL agent tends to end a sentence with a question and hand the
conversation over to the user
60
Issue 4: No Grounding
(Sordoni et al., 2015; Li et al., 2016)60
H: hi
M: how are you?
H: not bad
M: what's wrong?
H: nothing really
M: wanna talk about it? i 'm here if you wanna talk
H: well, i'm just tired
M: me too, i'm here if you wanna talk
Neural model learns the general shape of conversations, and the system output is situationally appropriate and coherent.
H: would thursday afternoon work sometime?
M: yeah , sure . just let me know when you‘re free.
H: after lunch is probably the best time
M: okay, sounds good . just let me know when you‘re free.
H: would 2 pm work for you?
M: works for me.
H: well let‘s say 2 pm then i ‘ll see you there M: sounds good.
No grounding into a real calendar, but the
“shape” of the conversation is fluent and plausible.
61
Chit-Chat v.s. Task-Oriented
61
Any recommendation?
The weather is so depressing these days.
I know, I dislike rain too.
What about a day trip to eastern Washington?
Try Dry Falls, it’s spectacular!
Social Chat
Engaging, Human-Like Interaction (Ungrounded)
Task-Oriented
Task Completion, Decision Support (Grounded)
61
62
Image-Grounded Dialogues
(Mostafazadeh et al., 2017) Neural model, trained on 250K Twitter convos:-Turn 1: tweet + accompanying photo -Turn 2: tweet is a question
62
Tweet: my poor baby moms saying it 's probably totaled so rip Bessie
NeuroCon: How did you manage to do that?
Tweet: as penitence I offer this NeuroCon: can I have one of those?
Tweet: some of the damage done NeuroCon: Is that at your house?
Given data, other signals can be exploited in the same way: program state, database info,
geolocation, etc.
63
Knowledge-Grounded Responses
(Ghazvininejad et al., 2017)Going to Kusakabe tonight
Conversation History
Try omakase, the best in town
Response Σ Decoder
Dialogue Encoder
...
World “Facts”
A
Consistently the best omakase
...
Contextually-Relevant “Facts”
Amazing sushi tasting […]
They were out of kaisui […]
Fact Encoder
64
Conversation and Non-Conversation Data
64
You know any good Japanese restaurant in Seattle?
Try Kisaku, one of the best sushi restaurants in the city.
You know any good A restaurant in B?
Try C, one of the best D in the city.
Conversation Data
Knowledge Resource
65
Evolution Roadmap
65
Knowledge based system Common sense system
Empathetic systems
Dialogue breadth (coverage)
Dialogue depth (complexity)
What is influenza?
I’ve got a cold what do I do?
Tell me a joke.
I feel sad…
66
Multimodality & Personalization
(Chen et al., 2018) Task: user intent prediction
Challenge: language ambiguity
User preference
✓ Some people prefer “Message” to “Email”
✓ Some people prefer “Ping” to “Text”
App-level contexts
✓ “Message” is more likely to follow “Camera”
✓ “Email” is more likely to follow “Excel”
send to vivian v.s.
Email? Message?
Communication
Behavioral patterns in history helps intent prediction.
67
High-Level Intention Learning
(Sun et al., 2016; Sun et al., 2016)
High-level intention may span several domains
Schedule a lunch with Vivian.
find restaurant check location contact play music
What kind of restaurants do you prefer?
The distance is … Should I send the restaurant information to Vivian?
Users interact via high-level descriptions and the system learns how to plan the dialogues
68
Empathy in Dialogue System
(Fung et al., 2016)
Embed an empathy module
Recognize emotion using multimodality
Generate emotion-aware responses
Emotion Recognizer 68
vision speech text
69
69
Cognitive Behavioral Therapy (CBT)
Pattern Mining
Mood Tracking Content Providing
Depression Reduction
Always Be There
Know You Well
Challenges and Conclusions
70
71
Challenge Summary
71
The human-machine interface is a hot topic but several components must be integrated!
Most state-of-the-art technologies are based on DNN
•Requires huge amounts of labeled data
•Several frameworks/models are available
Fast domain adaptation with scarse data + re-use of rules/knowledge
Handling reasoning and personalization
Data collection and analysis from un-structured data
Complex-cascade systems requires high accuracy for working good as a whole
72
Her (2013)
72
What can machines achieve now or in the future?
Q & A
Thanks for Your Attention!
73
Yun-Nung (Vivian) Chen Assistant Professor
National Taiwan University
y.v.chen@ieee.org / http://vivianchen.idv.tw