HAKKANI-TUR, TUR, GAO, DENG YUN-NUNG (VIVIAN) CHEN

(1)

1

H A K K A N I - T U R , T U R , G A O , D E N G

(2)

Outline

Introduction

Spoken Dialogue System

Spoken/Natural Language Understanding (SLU/NLU)

Contextual Spoken Language Understanding

Model Architecture End-to-End Training

Experiments

Conclusion & Future Work

2

d Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen

(3)

Outline

Introduction

Spoken Dialogue System

Spoken/Natural Language Understanding (SLU/NLU)

Contextual Spoken Language Understanding

Model Architecture End-to-End Training

Experiments

Conclusion & Future Work

3

(4)

Dialogue System Pipeline

4 ASR

Language Understanding (LU)

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking

• Policy Decision Output

Generation

Hypothesis

are there any action movies to see this weekend

Semantic Frame (Intents, Slots) request_movie genre=action

date=this weekend

System Action request_locaion Text response

Where are you located?

Screen Display location?

Text Input

Are there any action movies to see this weekend?

Speech Signal

(5)

LU Importance

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495

Success Rate

Simulation Epoch

Learning Curve of System Performance

Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05

RL Agent w/o LU errors

Rule Agent w/o LU errors

(6)

LU Importance

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495

Success Rate

Simulation Epoch

Learning Curve of System Performance

Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05

RL Agent w/o LU errors RL Agent w/ 5% LU errors

Rule Agent w/o LU errors Rule Agent w/ 5% LU errors

>5% performance drop

The system performance is sensitive to LU errors, for both rule-based

and reinforcement learning agents.

(7)

Dialogue System Pipeline

7 SLU usually focuses on understanding single-turn utterances The understanding result is usually influenced by

1) local observations 2) global knowledge.

ASR

Language Understanding (LU)

• User Intent Detection

• Slot Filling

Dialogue Management (DM)

• Dialogue State Tracking

• Policy Decision Output

Generation

Hypothesis

are there any action movies to see this weekend

Semantic Frame (Intents, Slots) request_movie genre=action

date=this weekend

System Action request_locaion Text response

Where are you located?

Screen Display location?

Text Input

Are there any action movies to see this weekend?

Speech Signal

current bottleneck

 error propagation

(8)

Spoken Language Understanding

8 just sent email to bob about fishing this weekend

O O O O

B-contact_name

O

B-subject I-subject I-subject

U

S

I send_email D communication

 send_email(contact_name=“bob”, subject=“fishing this weekend”)

are we going to fish this weekend U ₁

S ₂

 send_email(message=“are we going to fish this weekend”)

send email to bob

U ₂

 send_email(contact_name=“bob”)

B-message

I-message I-message I-message I-message I-message I-message

B-contact_name

S ₁

Domain Identification  Intent Prediction  Slot Filling

(9)

Outline

Introduction

Spoken Dialogue System

Spoken/Natural Language Understanding (SLU/NLU)

Contextual Spoken Language Understanding

Model Architecture End-to-End Training

Experiments

Conclusion & Future Work

9

(10)

MODEL ARCHITECTURE

10 u

Knowledge Attention Distribution

p _i

m _i

Memory Representation

Weighted

Sum h

∑ W _kg

Knowledge Encoding o

Representation history utterances

{x

_i

} current utterance

c

Inner Product Sentence

Encoder RNN

_in

x₁ x₂ … x_i

Contextual Sentence Encoder

x₁ x₂ … x_i

RNN

_mem

slot tagging sequence y

h

_t-1

h

_t

V V

W W W

w

_t-1

w

_t

y

_t-1

y

_t

U U

RNN Tagger

M M

Idea: additionally incorporating contextual knowledge during slot tagging

Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.

1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding

(11)

MODEL ARCHITECTURE

11 u

Knowledge Attention Distribution

p _i

m _i

Memory Representation

Weighted

Sum h

∑ W _kg

Knowledge Encoding o

Representation history utterances

{x

_i

} current utterance

c

Inner Product Sentence

Encoder RNN

_in

x₁ x₂ … x_i

Contextual Sentence Encoder

x₁ x₂ … x_i

RNN

_mem

slot tagging sequence y

h

_t-1

h

_t

V V

W W W

w

_t-1

w

_t

y

_t-1

y

_t

U U

RNN Tagger

M M

Idea: additionally incorporating contextual knowledge during slot tagging

Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.

1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding

CNN

(12)

END-TO-END TRAINING

• Tagging Objective

• RNN Tagger

12 slot tag sequence contextual utterances & current utterance

h _t-1 h _t h _t+1

V V V

W W W W

w _t-1 w _t w _t+1 y _t-1 y _t y _t+1

U U U

o

M M M

Automatically figure out the attention distribution without explicit

supervision

(13)

Outline

Introduction

Spoken Dialogue System

Spoken/Natural Language Understanding (SLU/NLU)

Contextual Spoken Language Understanding

Model Architecture End-to-End Training

Experiments

Conclusion & Future Work

13

(14)

EXPERIMENTS

• Dataset: Cortana communication session data

– GRU for all RNN – adam optimizer

– embedding dim=150 – hidden unit=100 – dropout=0.5

14 Model Training Set Knowledge Encoding

Sentence

Encoder First Turn Other Overall

RNN Tagger single-turn x x 60.6 16.2 25.5

The model trained on single-turn data performs worse for non-first

turns due to mismatched training data

(15)

EXPERIMENTS

• Dataset: Cortana communication session data

– GRU for all RNN – adam optimizer

– embedding dim=150 – hidden unit=100 – dropout=0.5

15 Model Training Set Knowledge Encoding

Sentence

Encoder First Turn Other Overall

RNN Tagger single-turn x x 60.6 16.2 25.5

multi-turn x x 55.9 45.7 47.4

Treating multi-turn data as single-turn for training performs reasonable

(16)

EXPERIMENTS

• Dataset: Cortana communication session data

– GRU for all RNN – adam optimizer

– embedding dim=150 – hidden unit=100 – dropout=0.5

16 Model Training Set Knowledge Encoding

Sentence

Encoder First Turn Other Overall

RNN Tagger single-turn x x 60.6 16.2 25.5

multi-turn x x 55.9 45.7 47.4

Encoder- Tagger

multi-turn current utt (c) RNN 57.6 56.0 56.3

multi-turn history + current (x, c) RNN 69.9 60.8 62.5

Encoding current and history utterances improves the performance

but increases the training time

(17)

EXPERIMENTS

• Dataset: Cortana communication session data

– GRU for all RNN – adam optimizer

– embedding dim=150 – hidden unit=100 – dropout=0.5

17 Model Training Set Knowledge Encoding

Sentence

Encoder First Turn Other Overall

RNN Tagger single-turn x x 60.6 16.2 25.5

multi-turn x x 55.9 45.7 47.4

Encoder- Tagger

multi-turn current utt (c) RNN 57.6 56.0 56.3

multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1

Applying memory networks significantly outperforms all approaches

with much less training time

(18)

EXPERIMENTS

• Dataset: Cortana communication session data

– GRU for all RNN – adam optimizer

– embedding dim=150 – hidden unit=100 – dropout=0.5

18 Model Training Set Knowledge Encoding

Sentence

Encoder First Turn Other Overall

RNN Tagger single-turn x x 60.6 16.2 25.5

multi-turn x x 55.9 45.7 47.4

Encoder- Tagger

multi-turn current utt (c) RNN 57.6 56.0 56.3

multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1

multi-turn history + current (x, c) CNN 73.8 66.5 68.0

CNN produces comparable results for sentence encoding with

shorter training time

(19)

Outline

Introduction

Spoken Dialogue System

Spoken/Natural Language Understanding (SLU/NLU)

Contextual Spoken Language Understanding

Model Architecture End-to-End Training

Experiments

Conclusion & Future Work

19

(20)

Conclusion

• The proposed end-to-end memory networks store

contextual knowledge, which can be exploited dynamically based on an attention model for manipulating knowledge carryover for multi-turn understanding

• The end-to-end model performs the tagging task instead of classification

• The experiments show the feasibility and robustness of modeling knowledge carryover through memory networks

20

(21)

Future Work

• Leveraging not only local observation but also global knowledge for better language understanding

– Syntax or semantics can serve as global knowledge to guide the understanding model

– “Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks,” arXiv preprint arXiv: 1609.03286