Review
2
3
Task-Oriented Dialogue System
(Young, 2000)3
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
4
Task-Oriented Dialogue System
(Young, 2000)4
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Action / Knowledge Providers
http://rsta.royalsocietypublishing.org/content/358/1769/1389.short
Conventional LU
5
6
Language Understanding (LU)
Pipelined
6
1. Domain Classification
2. Intent
Classification 3. Slot Filling
LU – Domain/Intent Classification
As an utterance classification
task
• Given a collection of utterances ui with labels ci, D= {(u1,c1),…,(un,cn)} where ci ∊ C, train a model to estimate labels for new utterances uk.
7
find me a cheap taiwanese restaurant in oakland
Movies Restaurants Music
Sports
…
find_movie, buy_tickets
find_restaurant, find_price, book_table find_lyrics, find_singer
…
Domain Intent
8
Conventional Approach
8
Data
Model
Prediction
dialogue utterances annotated with domains/intents
domains/intents
machine learning classification model e.g. support vector machine (SVM)
9
Theory: Support Vector Machine
SVM is a maximum margin classifier
Input data points are mapped into a high dimensional feature space where the data is linearly separable
Support vectors are input data points that lie on the margin
9 http://www.csie.ntu.edu.tw/~htlin/mooc/
10 z
z
Theory: Support Vector Machine
Multiclass SVM
Extended using one-versus-rest approach
Then transform into probability
http://www.csie.ntu.edu.tw/~htlin/mooc/
SVM1 SVM2 SVM3 SVMk
S1 S2 S3 Sk
score for
each class … …
P1 P2 P3 Pk
prob for
each class … …
Domain/intent can be decided based on the estimated scores
LU – Slot Filling
11
flights from Boston to New York today
O O B-city O B-city I-city O
O O B-dept O B-arrival I-arrival B-date
As a sequence tagging task
• Given a collection tagged word sequences, S={((w1,1,w1,2,…, w1,n1), (t1,1,t1,2,…,t1,n1)), ((w2,1,w2,2,…,w2,n2), (t2,1,t2,2,…,t2,n2)) …}
where ti ∊ M, the goal is to estimate tags for a new word sequence.
flights from Boston to New York today
Entity Tag Slot Tag
12
Conventional Approach
12
Data
Model
Prediction
dialogue utterances annotated with slots
slotsand their values
machine learning tagging model e.g. conditional random fields (CRF)
13
Theory: Conditional Random Fields
CRF assumes that the label at time step t depends on the label in the previous time step t-1
Maximize the log probability log p(y | x) with respect to parameters λ
13
input output
Slots can be tagged based on the y that maximizes p(y|x)
Neural Network Based LU
14
15
A Single Neuron
z w
1w
2w
N…
x
1x
2x
N b
z
z
bias
z
y
zz e
1
1
Sigmoid function Activation function
1
w, bare the parameters of this neuron
15
16
A Single Neuron
z w
1w
2w
N…
x
1x
2x
N
b
bias
y
1
5 . 0
"
2
"
5 . 0
"
2
"
y not
y is
A single neuron can only handle binary classification
16
M
N
R
R
f :
17
A Layer of Neurons
Handwriting digit classification
f : R
N R
MA layer of neurons can handle multiple possible output, and the result depends on the max one
…
x
1x
2x
N
1
y
1
… …
“1” or not
“2” or not
“3” or not
y
2y
310 neurons/10 classes
Which one is max?
18
Deep Neural Networks (DNN)
Fully connected feedforward network
x1
x2
……
Layer 1
……
y1
y2
……
Layer 2
……
Layer L
……
……
……
Input Output
yM
xN
vector x
vector y
Deep NN: multiple hidden layers
M
N
R
R
f :
19
Recurrent Neural Network (RNN)
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
: tanh, ReLU
time
RNN can learn accumulated sequential information (time-series)
20
Model Training
All model parameters can be updated by SGD
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 20
yt-1 yt yt+1 target
predicted
21
BPTT
21
For 𝐶(1) Backward Pass:
For 𝐶(2)
For 𝐶(3) For 𝐶(4)
Forward Pass: Compute s1, s2, s3, s4 ……
y1 y2 y3
x1 x2 x3
o1 o2 o3
ini t
y4
x4 o4
𝐶(1) 𝐶(2) 𝐶(3) 𝐶(4)
s1 s2 s3 s4
The model is trained by comparing the correct sequence tags and the predicted ones
22
Deep Learning Approach
22
Data
Model
Prediction
dialogue utterances annotated with semantic frames (user intents & slots)
user intents, slots and their values
deep learning model (classification/tagging) e.g. recurrent neural networks (RNN)
23
Classification Model
Input: each utterance ui is represented as a feature vector fi
Output: a domain/intent label ci for each input utterance
23
As an utterance classification
task
• Given a collection of utterances ui with labels ci, D= {(u1,c1),…,(un,cn)} where ci ∊ C, train a model to estimate labels for new utterances uk.
How to represent a sentence using a feature vector
Sequence Tagging Model
24
As a sequence tagging task
• Given a collection tagged word sequences, S={((w1,1,w1,2,…, w1,n1), (t1,1,t1,2,…,t1,n1)), ((w2,1,w2,2,…,w2,n2), (t2,1,t2,2,…,t2,n2)) …}
where ti ∊ M, the goal is to estimate tags for a new word sequence.
Input: each word wi,jis represented as a feature vector fi,j
Output: a slot label tifor each word in the utterance
How to represent a word using a feature vector
25
Word Representation
Atomic symbols: one-hot representation
25
[0 0 0 0 0 0 1 0 0 … 0]
[0 0 0 0 0 0 1 0 0 … 0]
AND[0 0 1 0 0 0 0 0 0 … 0] = 0
Issues: difficult to compute the similarity (i.e. comparing “car” and “motorcycle”)
car
car
car motorcycle
26
Word Representation
Neighbor-based: low-dimensional dense word embedding
26
Idea: words with similar meanings often have similar neighbors
27
Chinese Input Unit of Representation
Character
Feed each char to each time step
Word
Word segmentation required
你知道美女與野獸電影的評價如何嗎?
你/知道/美女與野獸/電影/的/評價/如何/嗎
Can two types of information fuse together for better performance?
LU – Domain/Intent Classification
As an utterance classification
task
• Given a collection of utterances ui with labels ci, D= {(u1,c1),…,(un,cn)} where ci ∊ C, train a model to estimate labels for new utterances uk.
28
find me a cheap taiwanese restaurant in oakland
Movies Restaurants Music
Sports
…
find_movie, buy_tickets
find_restaurant, find_price, book_table find_lyrics, find_singer
…
Domain Intent
29
Deep Neural Networks for Domain/Intent Classification – I
(Sarikaya et al, 2011)
Deep belief nets (DBN)
Unsupervised training of weights
Fine-tuning by back-propagation
Compared to MaxEnt, SVM, and boosting
29 http://ieeexplore.ieee.org/abstract/document/5947649/
30
Deep Neural Networks for Domain/Intent Classification – II
(Tur et al., 2012; Deng et al., 2012)
Deep convex networks (DCN)
Simple classifiers are stacked to learn complex functions
Feature selection of salient n-grams
Extension to kernel-DCN
30 http://ieeexplore.ieee.org/abstract/document/6289054/; http://ieeexplore.ieee.org/abstract/document/6424224/
31
Deep Neural Networks for Domain/Intent Classification – III
(Ravuri and Stolcke, 2015)
RNN and LSTMs for utterance classification
Word hashing to deal with large number of singletons
Kat: #Ka, Kat, at#
Each character n-gram is associated with a bit in the input encoding
31 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/RNNLM_addressee.pdf
LU – Slot Filling
32
flights from Boston to New York today
O O B-city O B-city I-city O
O O B-dept O B-arrival I-arrival B-date
As a sequence tagging task
• Given a collection tagged word sequences, S={((w1,1,w1,2,…, w1,n1), (t1,1,t1,2,…,t1,n1)), ((w2,1,w2,2,…,w2,n2), (t2,1,t2,2,…,t2,n2)) …}
where ti ∊ M, the goal is to estimate tags for a new word sequence.
flights from Boston to New York today
Entity Tag Slot Tag
33
Recurrent Neural Nets for Slot Tagging – I
(Yao et al, 2013; Mesnil et al, 2015)
Variations:
a.
RNNs with LSTM cells
b.
Input, sliding window of n-grams
c.
Bi-directional LSTMs
𝑤0 𝑤1 𝑤2 𝑤𝑛 ℎ0𝑓 ℎ1𝑓 ℎ2𝑓 ℎ𝑛𝑓 ℎ0𝑏 ℎ1𝑏 ℎ2𝑏 ℎ𝑛𝑏 𝑦0 𝑦1 𝑦2 𝑦𝑛
(b) LSTM-LA (c) bLSTM 𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤0 𝑤1 𝑤2 𝑤𝑛 ℎ0 ℎ1 ℎ2 ℎ𝑛
(a) LSTM 𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤0 𝑤1 𝑤2 𝑤𝑛 ℎ0 ℎ1 ℎ2 ℎ𝑛
http://131.107.65.14/en-us/um/people/gzweig/Pubs/Interspeech2013RNNLU.pdf; http://dl.acm.org/citation.cfm?id=2876380
34
Recurrent Neural Nets for Slot Tagging – II
(Kurata et al., 2016; Simonnet et al., 2015)
Encoder-decoder networks
Leverages sentence level information
Attention-based encoder- decoder
Use of attention (as in MT) in the encoder-decoder network
Attention is estimated using a feed-forward network with input: ht and st at time t
𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤𝑛 𝑤2 𝑤1 𝑤0 ℎ𝑛 ℎ2 ℎ1 ℎ0
𝑤0 𝑤1 𝑤2 𝑤𝑛 𝑦0 𝑦1 𝑦2 𝑦𝑛
𝑤0 𝑤1 𝑤2 𝑤𝑛
ℎ0 ℎ1 ℎ2 ℎ𝑛 𝑠0 𝑠1 𝑠2 𝑠𝑛
ci ℎ0…ℎ𝑛
http://www.aclweb.org/anthology/D16-1223
ht-
1
ht+
1
ht
W W W W
taiwanese
B-type U
food U
please U
V
O V
O V
hT+1 EOS U
FIND_RES T V
Slot Filling Intent Prediction
Joint Semantic Frame Parsing
Sequence- based (Hakkani-Tur
et al., 2016)
• Slot filling and intent prediction in the same
output sequence
Parallel (Liu and Lane, 2016)
• Intent prediction and slot filling are performed in two branches
35 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/IS16_MultiJoint.pdf; https://arxiv.org/abs/1609.01454
36
Milestone 1 – Language Understanding
3)
Collect and annotate data
4)
Use machine learning method to train your system
Conventional
SVM for domain/intent classification
CRF for slot filling
Deep learning
LSTM for domain/intent classification and slot filling
5)
Test your system performance
36
37
Speech Recognition
Language Understanding (LU)
• Domain Identification
• User Intent Detection
• Slot Filling
Dialogue Management (DM)
• Dialogue State Tracking (DST)
• Dialogue Policy Natural Language
Generation (NLG) Hypothesis
are there any action movies to see this weekend
Semantic Frame request_movie
genre=action, date=this weekend
System Action/Policy request_location Text response
Where are you located?
Text Input
Are there any action movies to see this weekend?
Speech Signal
Backend Database/
Knowledge Providers
Concluding Remarks
37