Unsupervised Learning and Modeling of
Knowledge and Intent for Spoken Dialogue Systems
Y U N - NUNG (V I V I AN ) CHEN
H T T P : / / V I V I A N C H E N . I D V . T WA P R I L 1 6 T H , 2 0 1 5 @ N E W YO R K U N I V E R S I T Y
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
A Popular Robot - Baymax
Big Hero 6 -- Video content owned and licensed by Disney Entertainment, Marvel Entertainment, LLC, etc
A Popular Robot - Baymax
Baymax is capable of maintaining a good spoken dialogue system and learning new knowledge for better
understanding and interacting with people.
The goal is to automate learning and understanding
procedures in system development.
Spoken Dialogue System (SDS)
Spoken dialogue systems are the intelligent agents that are able to help users finish tasks more efficiently via speech interactions.
Spoken dialogue systems are being incorporated into various devices (smart-phones, smart TVs, in-car navigating system, etc).
Apple’s Siri
Microsoft’s Cortana
Amazon’s
Echo Samsung’s SMART TV
Google Now
https://www.apple.com/ios/siri/
http://www.windowsphone.com/en-us/how-to/wp8/cortana/meet-cortana http://www.amazon.com/oc/echo/
http://www.samsung.com/us/experience/smart-tv/
Large Smart Device Population
The number of global smartphone users will surpass 2 billion in 2016.
As of 2012, there are 1.1 billion automobiles on the earth.
The more natural and convenient input of the devices evolves towards speech
Knowledge Representation/Ontology
Traditional SDSs require manual annotations for specific domains to represent domain knowledge.
Restaurant Domain
Movie Domain
restaurant
type price
location
movie
year genre
director
Node: semantic concept/slot Edge: relation between concepts
located_in
directed_by
released_in
Utterance Semantic Representation
A spoken language understanding (SLU) component requires the domain ontology to decode utterances into semantic forms, which contain core content (a set of slots and slot-fillers) of the utterance.
find a cheap taiwanese restaurant in oakland
show me action movies directed by james cameron
target=“restaurant”, price=“cheap”, type=“taiwanese”, location=“oakland”
target=“movie”, genre=“action”, director=“james cameron”
Restaurant Domain
Movie Domain
restaurant
type price
location
movie
year genre
director
Challenges for SDS
An SDS in a new domain requires
1) A hand-crafted domain ontology
2) Utterances labelled with semantic representations
3) An SLU component for mapping utterances into semantic representations
With increasing spoken interactions, building domain ontologies and annotating utterances cost a lot so that the data does not scale up.
The goal is to enable an SDS to automatically learn this
knowledge so that open domain requests can be handled.
Questions to Address
1) Given unlabelled raw audio recordings, how can a system automatically induce and organize domain-specific concepts?
2) With the automatically acquired knowledge, how can a system
understand individual utterances?
Interaction Example
find a cheap restaurant for asian food
User
Intelligent
Agent Q: How does a dialogue system process this request?
Cheap Asian restaurants include Kelly & Ping, Saigon Shack, etc.
What do you want to choose?
SDS Process – Available Domain Ontology find a cheap restaurant for asian food
User
target
food price
AMOD
NN
seeking PREP_FOR
Organized Domain Knowledge
Intelligent
Agent
SDS Process – Available Domain Ontology find a cheap restaurant for asian food
User
target
food price
AMOD
NN
seeking PREP_FOR
Organized Domain Knowledge
Intelligent Agent
Ontology Induction
(semantic slot)
SDS Process – Available Domain Ontology find a cheap restaurant for asian food
User
target
food price
AMOD
NN
seeking PREP_FOR
Organized Domain Knowledge
Intelligent Agent
Ontology Induction
(semantic slot)
Structure Learning
(inter-slot relation)
SDS Process – Spoken Language Understanding (SLU) find a cheap restaurant for asian food
User
target
food price
AMOD
NN
seeking PREP_FOR
Intelligent Agent
seeking=“find”
target=“restaurant”
price=“cheap”
food=“asian food”
SDS Process – Spoken Language Understanding (SLU) find a cheap restaurant for asian food
User
target
food price
AMOD
NN
seeking PREP_FOR
Intelligent Agent
seeking=“find”
target=“restaurant”
price=“cheap”
food=“asian food”
Semantic Decoding
SDS Process – Dialogue Management (DM) find a cheap restaurant for asian food
User
target
food price
AMOD
NN
seeking PREP_FOR
SELECT restaurant {
restaurant.price=“cheap”
restaurant.food=“asian food”
Intelligent }
Agent
SDS Process – Dialogue Management (DM) find a cheap restaurant for asian food
User
SELECT restaurant {
restaurant.price=“cheap”
restaurant.food=“asian food”
}
Kelly & Ping Saigon Shack
: :
Intelligent
Agent
SDS Process – Natural Language Generation (NLG) find a cheap restaurant for asian food
User
Intelligent Agent
Cheap Asian restaurants include Kelly & Ping, Saigon Shack, etc.
What do you want to choose?
Goals
target
food price
AMOD
NN
seeking PREP_FOR
SELECT restaurant {
restaurant.price=“cheap”
restaurant.food=“asian food”
}
find a cheap eating place for asian food
User
Required Domain-Specific Information
Goals
target
food price
AMOD
NN
seeking PREP_FOR
SELECT restaurant {
restaurant.price=“cheap”
restaurant.food=“asian food”
}
find a cheap restaurant for asian food
User
Ontology Induction
Structure Learning
Semantic Decoding
(inter-slot relation)
(semantic slot)
Goals
find a cheap restaurant for asian food
User
Ontology Induction
Structure Learning
Semantic Decoding
Knowledge Acquisition
1) Given unlabelled raw audio recordings, how can a system automatically induce and organize domain-specific concepts?
Restaurant Asking Conversations
target
food
price seeking
quantity
PREP_FOR
PREP_FOR
NN AMOD
AMOD AMOD
Organized Domain Knowledge Unlabelled
Collection Knowledge Acquisition
Ontology Induction Structure Learning
Knowledge Acquisition
SLU Modeling
2) With the automatically acquired knowledge, how can a system understand individual utterances?
Organized Domain Knowledge
price=“cheap”
target=“restaurant”
SLU Modeling
SLU Component
“can i have a cheap restaurant”
Semantic Decoding
SLU Modeling
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
Ontology Induction [ASRU’13, SLT’14a]
Input: Unlabelled user utterances Output: Slots that are useful for a
domain-specific SDS
Y.-N. Chen et al., "Unsupervised Induction and Filling of Semantic Slots for Spoken Dialogue Systems Using Frame-Semantic Parsing," in Proc. of ASRU, 2013. (Best Student Paper Award)
Restaurant Asking Conversations
target
food
price seeking
quantity
Domain-Specific Slot Unlabelled Collection
Step 1: Frame-semantic parsing on all utterances for creating slot candidates Step 2: Slot ranking model for differentiating domain-specific concepts from
generic concepts
Step 3: Slot selection
Probabilistic Frame-Semantic Parsing
FrameNet [Baker et al., 1998]
◦ a linguistically semantic resource, based on the frame- semantics theory
◦ “low fat milk” “milk” evokes the “food” frame;
“low fat” fills the descriptor frame element
SEMAFOR [Das et al., 2014]
◦ a state-of-the-art frame-semantics parser, trained on manually annotated FrameNet sentences
Baker et al., " The berkeley framenet project," in Proc. of International Conference on Computational linguistics, 1998.
Step 1: Frame-Semantic Parsing for Utterances
can i have a cheap restaurant
Frame: capability FT LU: can FE LU: i
Frame: expensiveness FT LU: cheap
Frame: locale by use FT/FE LU: restaurant
Task: adapting generic frames to domain-specific settings for SDSs
Good!
Good!
?
FT: Frame Target; FE: Frame Element; LU: Lexical Unit
Step 2: Slot Ranking Model
Main Idea: rank domain-specific concepts higher than generic semantic concepts
can i have a cheap restaurant
Frame: capability FT LU: can FE LU: i
Frame: expensiveness FT LU: cheap
Frame: locale by use FT/FE LU: restaurant
slot candidate
slot filler
Step 2: Slot Ranking Model
Rank a slot candidate s by integrating two scores
slot frequency in the domain-specific conversation
semantic coherence of slot fillers slots with higher frequency more important
domain-specific concepts fewer topics
Step 2: Slot Ranking Model
h(s) : Semantic coherence
lower coherence in topic space higher coherence in topic space slot name: quantity slot name: expensiveness
a one
all three
cheap
expensive
inexpensive
measured by cosine similarity between their word embeddings
the slot-filler set
corresponding to s
Step 3: Slot Selection
Rank all slot candidates by their importance scores
Output slot candidates with higher scores based on a threshold
frequency semantic coherence
locale_by_use 0.89 capability 0.68
food 0.64
expensiveness 0.49 quantity 0.30 seeking 0.22
:
locale_by_use
food
expensiveness capability
quantity
Experiments of Ontology Induction
Dataset
◦ Cambridge University SLU corpus [Henderson, 2012]
◦ Restaurant recommendation in an in-car setting in Cambridge
◦ WER = 37%
◦ vocabulary size = 1868
◦ 2,166 dialogues
◦ 15,453 utterances
◦ dialogue slot: addr, area, food, name, phone, postcode, price range, task, type
The mapping table between induced and reference slots
UNSUPERVISED LEARNING AND MODELING OF KNOWLEDGE AND INTENT FOR SPOKEN DIALOGUE SYSTEMS
Henderson et al., "Discriminative spoken language understanding using word confusion networks," in Proc. of SLT, 2012.
34
Experiments of Ontology Induction
◦ Slot Induction Evaluation: Average Precision (AP) and Area Under the Precision-Recall Curve (AUC) of the slot ranking model to
measure quality of induced slots via the mapping table
Induced slots have 70% of AP and align well with human-annotated slots for SDS.
Semantic relations help decide domain-specific knowledge.
Approach ASR Manual
AP (%) AUC (%) AP (%) AUC (%)
Baseline: MLE 56.7 54.7 53.0 50.8
MLE + Semantic Coherence 71.7
(+26.5%)
70.4
(+28.7%)
74.4
(+40.4%)
73.6
(+44.9%)
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
Structure Learning [NAACL-HLT’15]
Input: Unlabelled user utterances Output: Slots with relations
Step 1: Construct a graph to represent slots, words, and relations Step 2: Compute scores for edges (relations) and nodes (slots)
Step 3: Identify important relations connecting important slot pairs
Restaurant Asking Conversations
Domain-Specific Ontology Unlabelled Collection
locale_by_use
food expensiveness seeking
relational_quantity
PREP_FOR
PREP_FOR
NN AMOD
AMOD
AMOD
desiring
DOBJ
Step 1: Knowledge Graph Construction
ccomp
amod dobj
nsubj det
Syntactic dependency parsing on utterances
Word-based lexical knowledge graph
Slot-based semantic knowledge graph
can i have a cheap restaurant
capability expensiveness locale_by_use
restaurant
can have
i a
cheap w
w capability
locale_by_use expensiveness
s
Step 1: Knowledge Graph Construction
The edge between a node pair is weighted as relation importance
Word-based lexical knowledge graph
Slot-based semantic knowledge graph
restaurant
can have
i a
cheap w
w capability
locale_by_use expensiveness
s
How to decide the weights to represent relation importance?
Dependency-based word embeddings
Dependency-based slot embeddings
Step 2: Weight Measurement
Slot/Word Embeddings Training
can = [0.8 … 0.24]
have = [0.3 … 0.21]
: :
expensiveness = [0.12 … 0.7]
capability = [0.3 … 0.6]
: :
can i have a cheap restaurant
ccomp
amod dobj
nsubj det
have a
capability expensiveness locale_by_use
ccomp
amod dobj
nsubj det
Step 2: Weight Measurement
Compute edge weights to represent relation importance
◦ Slot-to-slot relation 𝐿
𝑠𝑠: similarity between slot embeddings
◦ Word-to-slot relation 𝐿
𝑤𝑠or 𝐿
𝑠𝑤: frequency of the slot-word pair
◦ Word-to-word relation 𝐿
𝑤𝑤: similarity between word embeddings
𝐿 𝑤𝑤
𝐿 𝑤𝑠 𝑜𝑟 𝐿 𝑠𝑤 𝐿 𝑠𝑠
w
1w
2w
3w
4w
5w
6w
7s
2s
1s
3Step 2: Slot Importance by Random Walk
scores propagated from word-layer then propagated within slot-layer
Assumption: the slots with more dependencies to more important slots should be more important
The random walk algorithm computes importance for each slot
original frequency score slot importance
𝐿 𝑤𝑤
𝐿 𝑤𝑠 𝑜𝑟 𝐿 𝑠𝑤 𝐿 𝑠𝑠
w
1w
2w
3w
4w
5w
6w
7s
2s
1s
3Converged scores can
be obtained by a
closed form solution.
The converged slot importance suggests whether the slot is important Rank slot pairs by summing up their converged slot importance
Select slot pairs with higher scores according to a threshold
Step 3: Identify Domain Slots w/ Relations
s
1s
2r
s(1) + r
s(2) s
3s
4r
s(3) + r
s(4)
: :
: :
locale_by_use
food expensiveness seeking
relational_quantity
PREP_FOR
PREP_FOR
NN AMOD
AMOD AMOD
desiring
DOBJ
(Experiment 1)
(Experiment 2)
Experiments for Structure Learning
Experiment 1: Quality of Slot Importance
Dataset: Cambridge University SLU Corpus
Approach ASR Manual
AP (%) AUC (%) AP (%) AUC (%)
Baseline: MLE 56.7 54.7 53.0 50.8
Random Walk:
MLE + Dependent Relations
69.0
(+21.8%)
68.5
(+24.8%)
75.2
(+41.8%)
74.5
(+46.7%)
Dependent relations help decide domain-specific knowledge.
Experiments for Structure Learning
Experiment 2: Relation Discovery Evaluation
Discover inter-slot relations connecting important slot pairs
The reference ontology with the most frequent syntactic dependencies
locale_by_use
food expensiveness seeking
relational_quantity
PREP_FOR
PREP_FOR
NN AMOD
AMOD AMOD
desiring
DOBJ
type
food pricerange
DOBJ
AMOD AMOD
AMOD
task
area
PREP_IN
Experiments for Structure Learning
Experiment 2: Relation Discovery Evaluation
Discover inter-slot relations connecting important slot pairs
The reference ontology with the most frequent syntactic dependencies
locale_by_use
food expensiveness seeking
relational_quantity
PREP_FOR
PREP_FOR
NN AMOD
AMOD AMOD
desiring
DOBJ
type
food pricerange
DOBJ
AMOD AMOD
AMOD
task
area
PREP_IN
The automatically learned domain ontology
aligns well with the reference one.
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
Knowledge Acquisition
Restaurant Asking Conversations
Organized Domain Knowledge Unlabelled Collection
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
SLU Model
Semantic Representation
“can I have a cheap restaurant”
Ontology Induction
Unlabeled Collection
Semantic KG
Frame-Semantic Parsing
Fw Fs
Feature Model
Rw
Rs Knowledge Graph Propagation Model
Word Relation Model
Lexical KG
Slot Relation Model
StructureLearning
.
Semantic KG
SLU Modeling by Matrix Factorization
Semantic Decoding
Input: user utterances, automatically learned knowledge
Output: the semantic concepts included in each individual utterance
Ontology Induction
SLU
Fw Fs
Structure Learning
.
Matrix Factorization (MF)
Feature Model
1
Utterance 1
i would like a cheap restaurant
Word Observation Slot Candidate
Train
… … …
cheap restaurant expensiveness food 1
locale_by_use 1
1
find a restaurant with chinese food
Utterance 2
1 1
food
1 1
1 Test
1
1
.90 .97 .85 .95
.93 .98 .92
.
05 .05Slot Induction
show me a list of cheap restaurants
Test Utterance
hidden semanticsMatrix Factorization (MF)
Knowledge Graph Propagation Model
Reasoning with Matrix Factorization
Word Relation Model Slot Relation Model
word relation
matrix
slot relation
matrix
‧
1
Word Observation Slot Candidate
Train
cheap restaurant expensiveness food 1
locale_by_use 1
1
1 1
food
1 1
1 Test
1
1
.90 .97 .85 .95
.93 .98 .92
.
05 .05Slot Induction
The MF method completes a partially-missing matrix based on the
latent semantics by decomposing it into product of two matrices.
Bayesian Personalized Ranking for MF
Model implicit feedback
◦ not treat unobserved facts as negative samples (true or false)
◦ give observed facts higher scores than unobserved facts
Objective:
1
𝑓
+𝑓
−𝑓
−The objective is to learn a set of well-ranked semantic slots per utterance.
𝑢
𝑥
Experiments of Semantic Decoding
Experiment 1: Quality of Semantics Estimation
Dataset: Cambridge University SLU Corpus
Metric: Mean Average Precision (MAP) of all estimated slot probabilities for each utterance
Approach ASR Manual
Baseline: Logistic Regression 34.0 38.8
Random 22.5 25.1
Majority Class 32.9 38.4
MF Approach
Feature Model 37.6 45.3
Feature Model +
Knowledge Graph Propagation
43.5 (+27.9%)
53.4 (+37.6%) Modeling
Implicit Semantics
The MF approach effectively models hidden semantics to improve SLU.
Adding a knowledge graph propagation model further improves the results.
Experiments of Semantic Decoding
Experiment 2: Effectiveness of Relations
Dataset: Cambridge University SLU Corpus
Metric: Mean Average Precision (MAP) of all estimated slot probabilities for each utterance
Both semantic and dependent relations are useful to infer hidden semantics.
Approach ASR Manual
Feature Model 37.6 45.3
Feature + Knowledge Graph Propagation
Semantic Relation 41.4 (+10.1%) 51.6 (+13.9%)
Dependent Relation 41.6 (+10.6%) 49.0 (+8.2%)
Both 43.5 (+15.7%) 53.4 (+17.9%)
Combining both types of relations further improves the performance.
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
SLU Modeling
Organized Domain Knowledge
price=“cheap”
target=“restaurant”
SLU Model
“can i have a cheap restaurant”
Outline
Introduction
Ontology Induction [ASRU’13, SLT’14a]
Structure Learning [NAACL-HLT’15]
Semantic Decoding (submitted)
Conclusions
Summary
Ontology Induction Semantic relations are useful Structure Learning Dependent relations are useful
Knowledge Acquisition
Semantic Decoding
The MF approach builds an SLU model to decode semantics
SLU Modeling
Conclusions
The knowledge acquisition procedure enables systems to automatically learn open domain knowledge and produce domain-specific ontologies.
The MF technique for SLU modeling provides a principle model that is able to unify the automatically acquired knowledge, and then allows systems to consider implicit semantics for better understanding.
The work shows the feasibility and the potential of improving
generalization, maintenance, efficiency, and scalability of SDSs.
Q & A
THANKS FOR YOUR ATTENTIONS!!
Word Embeddings
Training Process
◦ Each word w is associated with a vector
◦ The contexts within the window size c are considered as the training data D
◦ Objective function:
[back]
w
t-2w
t-1w
t+1w
t+2w
tSUM
INPUT PROJECTION OUTPUT
CBOW Model
Mikolov et al., " Efficient Estimation of Word Representations in Vector Space," in Proc. of ICLR, 2013.
Mikolov et al., " Distributed Representations of Words and Phrases and their Compositionality," in Proc. of NIPS, 2013.
Dependency-Based Embeddings
Word & Context Extraction
Word Contexts
can have/ccomp
i have/nsub
-1have can/ccomp
-1, i/nsubj, restaurant/dobj a restaurant/det
-1cheap restaurant/amod
-1restaurant have/dobj
-1, a/det, cheap/amod
can i have a cheap restaurant
ccomp
amod dobj
nsubj det
Dependency-Based Embeddings
Training Process
◦ Each word w is associated with a vector v
wand each context c is represented as a vector v
c◦ Learn vector representations for both words and contexts such that the dot product v
w. v
cassociated with good word-context pairs belonging to the training data D is maximized
◦ Objective function:
[back]
Evaluation Metrics
◦ Slot Induction Evaluation: Average Precision (AP) and Area Under the Precision-Recall Curve (AUC) of the slot ranking model to
measure the quality of induced slots via the mapping table
[back]
1. locale_by_use 0.89 2. capability 0.68 3. food 0.64 4. expensiveness 0.49 5. quantity 0.30
Precision 1 0 2/3 3/4 0
AP = 80.56%
Slot Induction on ASR & Manual Results
The slot importance:
[back]
1. locale_by_use 0.88 2. expensiveness 0.78 3. food 0.64 4. capability 0.59 5. quantity 0.44
1. locale_by_use 0.89 2. capability 0.68 3. food 0.64 4. expensiveness 0.49 5. quantity 0.30
ASR Manual
Users tend to speak important information more clearly, so misrecognition of less important slots may slightly benefit the slot induction performance.
frequency
Slot Mapping Table
origin food
u
1u
2: u
k: u
nasian : : japan
: :
asian beer
: japan
: noodle food
: beer
: : : noodle
Create the mapping if slot fillers of the induced slot are included by the reference slot
induced slots reference slot
[back]
Random Walk Algorithm
UNSUPERVISED LEARNING AND MODELING OF KNOWLEDGE AND INTENT FOR SPOKEN DIALOGUE SYSTEMS
The converged algorithm satisfies
The derived closed form solution is the dominant eigenvector of M
[back]
66
SEMAFOR Performance
The SEMAFOR evaluation
[back]
Matrix Factorization
UNSUPERVISED LEARNING AND MODELING OF KNOWLEDGE AND INTENT FOR SPOKEN DIALOGUE SYSTEMS
The decomposed matrices represent latent semantics for utterances and words/slots respectively
The product of two matrices fills the probability of hidden semantics
[back]
68
1
Word Observation Slot Candidate
Train
cheap restaurant expensiveness food
1
locale_by_use
1 1
1 1
food
1 1
1 T
est
1 1
.90 .97 .85 .95
.93 .98 .92