Review 2

(1)

(2)

Review

2

(3)

3

Task-Oriented Dialogue System

(Young, 2000)

3

Speech Recognition

Language Understanding (LU)

• Domain Identification

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking (DST)

• Dialogue Policy Natural Language

Generation (NLG) Hypothesis

are there any action movies to see this weekend

Semantic Frame request_movie

genre=action, date=this weekend

System Action/Policy request_location Text response

Where are you located?

Text Input

Are there any action movies to see this weekend?

Speech Signal

Backend Database/

Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(4)

4

Task-Oriented Dialogue System

(Young, 2000)

4

Speech Recognition

• Slot Filling

Text Input

Speech Signal

Backend Action / Knowledge Providers

http://rsta.royalsocietypublishing.org/content/358/1769/1389.short

(5)

Conventional LU

5

(6)

6

Language Understanding (LU)



Pipelined

6

1. Domain Classification

2. Intent

Classification 3. Slot Filling

(7)

LU – Domain/Intent Classification

As an utterance classification

task

• Given a collection of utterances u_i with labels c_i, D= {(u₁,c₁),…,(u_n,c_n)} where c_i∊ C, train a model to estimate labels for new utterances u_k.

7

find me a cheap taiwanese restaurant in oakland

Movies Restaurants Music

Sports

…

find_movie, buy_tickets

find_restaurant, find_price, book_table find_lyrics, find_singer

…

Domain Intent

(8)

8

Conventional Approach

8

Data

Model

Prediction

dialogue utterances annotated with domains/intents

domains/intents

machine learning classification model e.g. support vector machine (SVM)

(9)

9

Theory: Support Vector Machine



SVM is a maximum margin classifier



Input data points are mapped into a high dimensional feature space where the data is linearly separable



Support vectors are input data points that lie on the margin

9 http://www.csie.ntu.edu.tw/~htlin/mooc/

(10)

10  z

z

Theory: Support Vector Machine



Multiclass SVM



Extended using one-versus-rest approach



Then transform into probability

http://www.csie.ntu.edu.tw/~htlin/mooc/

SVM₁ SVM₂ SVM₃ SVM_k

S₁ S₂ S₃ S_k

score for

each class … …

P₁ P₂ P₃ P_k

prob for

each class … …

Domain/intent can be decided based on the estimated scores

(11)

LU – Slot Filling

11

flights from Boston to New York today

O O B-city O B-city I-city O

O O B-dept O B-arrival I-arrival B-date

As a sequence tagging task

• Given a collection tagged word sequences, S={((w_1,1,w_1,2,…, w_1,n1), (t_1,1,t_1,2,…,t_1,n1)), ((w_2,1,w_2,2,…,w_2,n2), (t_2,1,t_2,2,…,t_2,n2)) …}

where t_i ∊ M, the goal is to estimate tags for a new word sequence.

flights from Boston to New York today

Entity Tag Slot Tag

(12)

12

Conventional Approach

12

Data

Model

Prediction

dialogue utterances annotated with slots

slotsand their values

machine learning tagging model e.g. conditional random fields (CRF)

(13)

13

Theory: Conditional Random Fields



CRF assumes that the label at time step t depends on the label in the previous time step t-1



Maximize the log probability log p(y | x) with respect to parameters λ

13

input output

Slots can be tagged based on the y that maximizes p(y|x)

(14)

Neural Network Based LU

14

(15)

15

A Single Neuron

z w

1

w

2

w

N

…

x

1

x

2

x

N

 b

  ^z

 ^   ^z

bias

z

y

 

_z

z e

_

  1

 1

Sigmoid function Activation function

1

w, bare the parameters of this neuron

15

(16)

16

A Single Neuron

z w

1

w

2

w

N

…

x

1

x

2

x

N



b

bias

y

1

  





5 . 0

"

2 "

5 . 0

"

2 "

y not

y is

A single neuron can only handle binary classification

16

M

N

R

f : 

(17)

17

A Layer of Neurons

 Handwriting digit classification

f : R

^N

 R

^M

A layer of neurons can handle multiple possible output, and the result depends on the max one

…

x

1

x

2

x

N



1

 y

₁



… …

“1” or not

“2” or not

“3” or not

y

2

y

3

10 neurons/10 classes

Which one is max?

(18)

18

Deep Neural Networks (DNN)

 Fully connected feedforward network

x1

x2

……

Layer 1

……

y1

y2

……

Layer 2

……

Layer L

……

Input Output

yM

xN

vector x

vector y

Deep NN: multiple hidden layers

M

N

R

f : 

(19)

19

Recurrent Neural Network (RNN)

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

: tanh, ReLU

time

RNN can learn accumulated sequential information (time-series)

(20)

20

Model Training

 All model parameters can be updated by SGD

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 20

y_t-1 y_t y_t+1 target

predicted

(21)

21

BPTT

21

For 𝐶⁽¹⁾ Backward Pass:

For 𝐶⁽²⁾

For 𝐶⁽³⁾ For 𝐶⁽⁴⁾

Forward Pass: Compute s₁, s₂, s₃, s₄ ……

y₁ y₂ y₃

x₁ x₂ x₃

o₁ o₂ o₃

ini t

y₄

x₄ o₄

𝐶⁽¹⁾ 𝐶⁽²⁾ 𝐶⁽³⁾ 𝐶⁽⁴⁾

s₁ s₂ s₃ s₄

The model is trained by comparing the correct sequence tags and the predicted ones

(22)

22

Deep Learning Approach

22

Data

Model

Prediction

dialogue utterances annotated with semantic frames (user intents & slots)

user intents, slots and their values

deep learning model (classification/tagging) e.g. recurrent neural networks (RNN)

(23)

23

Classification Model

 Input: each utterance u_i is represented as a feature vector ^f_i

 Output: a domain/intent label c_ifor each input utterance

23

task

How to represent a sentence using a feature vector

(24)

Sequence Tagging Model

24

 Input: each word w_i,jis represented as a feature vector f_i,j

 Output: a slot label t_ifor each word in the utterance

How to represent a word using a feature vector

(25)

25

Word Representation

 Atomic symbols: one-hot representation

25

[0 0 0 0 0 0 1 0 0 … 0]

^AND

[0 0 1 0 0 0 0 0 0 … 0] = 0

Issues: difficult to compute the similarity (i.e. comparing “car” and “motorcycle”)

car

car motorcycle

(26)

26

Word Representation



Neighbor-based: low-dimensional dense word embedding

26

Idea: words with similar meanings often have similar neighbors

(27)

27

Chinese Input Unit of Representation



Character



Feed each char to each time step



Word



Word segmentation required

你知道美女與野獸電影的評價如何嗎?

你/知道/美女與野獸/電影/的/評價/如何/嗎

Can two types of information fuse together for better performance?

(28)

LU – Domain/Intent Classification

task

28

find me a cheap taiwanese restaurant in oakland

Movies Restaurants Music

Sports

…

find_movie, buy_tickets

find_restaurant, find_price, book_table find_lyrics, find_singer

…

Domain Intent

(29)

29

Deep Neural Networks for Domain/Intent Classification – I

(Sarikaya et al, 2011)



Deep belief nets (DBN)



Unsupervised training of weights



Fine-tuning by back-propagation



Compared to MaxEnt, SVM, and boosting

29 http://ieeexplore.ieee.org/abstract/document/5947649/

(30)

30

Deep Neural Networks for Domain/Intent Classification – II

(Tur et al., 2012; Deng et al., 2012)



Deep convex networks (DCN)



Simple classifiers are stacked to learn complex functions



Feature selection of salient n-grams



Extension to kernel-DCN

30 http://ieeexplore.ieee.org/abstract/document/6289054/; http://ieeexplore.ieee.org/abstract/document/6424224/

(31)

31

Deep Neural Networks for Domain/Intent Classification – III

(Ravuri and Stolcke, 2015)



RNN and LSTMs for utterance classification



Word hashing to deal with large number of singletons



Kat: #Ka, Kat, at#



Each character n-gram is associated with a bit in the input encoding

31 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/RNNLM_addressee.pdf

(32)

LU – Slot Filling

32

flights from Boston to New York today

O O B-city O B-city I-city O

O O B-dept O B-arrival I-arrival B-date

flights from Boston to New York today

Entity Tag Slot Tag

(33)

33

Recurrent Neural Nets for Slot Tagging – I

(Yao et al, 2013; Mesnil et al, 2015)



Variations:

a.

RNNs with LSTM cells

b.

Input, sliding window of n-grams

c.

Bi-directional LSTMs

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 ℎ₀^𝑓 ℎ₁^𝑓 ℎ₂^𝑓 ℎ_𝑛^𝑓 ℎ₀^𝑏 ℎ₁^𝑏 ℎ₂^𝑏 ℎ_𝑛^𝑏 𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

(b) LSTM-LA (c) bLSTM 𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 ℎ₀ ℎ₁ ℎ₂ ℎ_𝑛

(a) LSTM 𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 ℎ₀ ℎ₁ ℎ₂ ℎ_𝑛

http://131.107.65.14/en-us/um/people/gzweig/Pubs/Interspeech2013RNNLU.pdf; http://dl.acm.org/citation.cfm?id=2876380

(34)

34

Recurrent Neural Nets for Slot Tagging – II

(Kurata et al., 2016; Simonnet et al., 2015)



Encoder-decoder networks

 Leverages sentence level information



Attention-based encoder- decoder

 Use of attention (as in MT) in the encoder-decoder network

 Attention is estimated using a feed-forward network with input: h_t and s_t at time t

𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

𝑤_𝑛 𝑤₂ 𝑤₁ 𝑤₀ ℎ_𝑛 ℎ₂ ℎ₁ ℎ₀

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛 𝑦₀ 𝑦₁ 𝑦₂ 𝑦_𝑛

𝑤₀ 𝑤₁ 𝑤₂ 𝑤_𝑛

ℎ₀ ℎ₁ ℎ₂ ℎ_𝑛 𝑠₀ 𝑠₁ 𝑠₂ 𝑠_𝑛

c_i ℎ₀…ℎ_𝑛

http://www.aclweb.org/anthology/D16-1223

(35)

h_t-

1

h_t+

1

h_t

W W W W

taiwanese

B-type U

food U

please U

V

O V

h_T+1 EOS U

FIND_RES T V

Slot Filling Intent Prediction

Joint Semantic Frame Parsing

Sequence- based (Hakkani-Tur

et al., 2016)

• Slot filling and intent prediction in the same

output sequence

Parallel (Liu and Lane, 2016)

• Intent prediction and slot filling are performed in two branches

35 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/IS16_MultiJoint.pdf; https://arxiv.org/abs/1609.01454

(36)

36

Milestone 1 – Language Understanding

3)

Collect and annotate data

4)

Use machine learning method to train your system



Conventional

SVM for domain/intent classification

CRF for slot filling



Deep learning

LSTM for domain/intent classification and slot filling

5)

Test your system performance

36

(37)

37

Speech Recognition

• Slot Filling

Text Input

Speech Signal

Backend Database/

Knowledge Providers

Concluding Remarks

37

Review 2

Review

Task-Oriented Dialogue System

Task-Oriented Dialogue System

Conventional LU

Language Understanding (LU)

Pipelined

LU – Domain/Intent Classification

Conventional Approach

Theory: Support Vector Machine

SVM is a maximum margin classifier

Input data points are mapped into a high dimensional feature space where the data is linearly separable

Support vectors are input data points that lie on the margin

Theory: Support Vector Machine

Multiclass SVM

Extended using one-versus-rest approach

Then transform into probability

LU – Slot Filling

Conventional Approach

Theory: Conditional Random Fields

CRF assumes that the label at time step t depends on the label in the previous time step t-1

Maximize the log probability log p(y | x) with respect to parameters λ

Neural Network Based LU

A Single Neuron

z w

w

w

…

x

x

x

 b

  z

    z

z

y

 

z e

  1

 1

A Single Neuron

z w

w

w

…

x

x

x



b

y

  





5 . 0

"

2

"

5 . 0

"

2

"

y not

y is

R

R

f : 

A Layer of Neurons

f : R

 R

…

x

x

x



 y



… …

y

y

  ^z

 ^   ^z