Conclusions Unlabeled Data Limited Labeled Data Outline

(1)

(2)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge

◦ How to utilize the current observations

Unlabeled Data

◦ How to re-use the trained dialogue acts

◦ How to share knowledge across languages

◦ How to utilize parallel data

Conclusions

2

(3)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge: Knowledge-Guided Model

◦ How to utilize the current observations

Unlabeled Data

◦ How to re-use the trained dialogue acts

◦ How to share knowledge across languages

◦ How to utilize parallel data

Conclusions

3

(4)

Prior Structural Knowledge

Syntax (Dependency Tree)

4 Semantics (AMR Graph)

show me

the

flights from seattle

to

san francisco

ROOT

1.

3.

4.

2.

Sentence s show me the flights from seattle to san francisco

show

you flight I

1.

2.

4.

city city

Seattle San Francisco

3.

.

Prior knowledge about syntax or semantics may guide understanding

Y.-N. Chen, D. Hakkani-Tur, G. Tur, A. Celikyilmaz, J. Gao, and L. Deng, “Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks,” preprint arXiv: 1609.00777, 2016.

(5)

K-SAN: Knowledge-Guided Structural Attention Networks

Prior knowledge as a teacher

5 knowledge-guided structure {x

_i

}

Knowledge Encoding

Sentence Encoding

Inner Product

u

m

_i

Knowledge Attention Distribution

p

_i

Encoded Knowledge Representation Weighted Sum

∑

h

o Knowledge-Guided

Representation

slot tagging sequence

s

y show me the flights from seattle to san francisco

ROOT

Input Sentence

h_t-1 h_t h_t+1

W W W W

w_t-1

y_t-1 U

w_t U

w_t+1 U

V

y_t V

y_t+1 V

RNN Tagger Knowledge Encoding Module

CNN

_kg

CNN

_in

NN

_out

M M M

(6)

Sentence Structural Knowledge

Syntax (Dependency Tree)

6 Semantics (AMR Graph)

show me

the

flights from seattle

to

san francisco

ROOT

1.

3.

4.

2.

1. show me

2. show flights the

3. show flights from seattle 4. show flights to francisco san

Sentence s show me the flights from seattle to san francisco

Knowledge-Guided Substructure x

_i

(s / show

:ARG0 (y / you) :ARG1 (f / flight

:source (c / city

:name (d / name :op1 Seattle)) :destination (c2 / city

:name (s2 / name :op1 San :op2 Francisco))) :ARG2 (i / I)

:mode imperative)

Knowledge-Guided Substructure x

_i

1. show you

2. show flight seattle

3. show flight san francisco 4. show i

show

you flight I

1.

2.

4.

city city

Seattle San Francisco

3.

.

(7)

Y.-N. Chen, D. Hakkani-Tur, G. Tur, A. Celikyilmaz, J. Gao, and L. Deng, “Knowledge as a Teacher: Knowledge-Guided Structural Attention

7

Networks,” preprint arXiv: 1609.00777, 2016.

Knowledge-Guided Structures

knowledge-guided structure {x

_i

}

Knowledge Encoding

Sentence Encoding

Inner Product

u

m

_i

Knowledge Attention Distribution

p

_i

Encoded Knowledge Representation Weighted Sum

∑

h

o Knowledge-Guided

Representation

slot tagging sequence

s

y show me the flights from seattle to san francisco

ROOT

Input Sentence

h_t-1 h_t h_t+1

W W W W

w_t-1

y_t-1 U

w_t U

w_t+1 U

V

y_t V

y_t+1 V

RNN Tagger Knowledge Encoding Module

CNN

_kg

CNN

_in

NN

_out

M M M

The model will pay more attention to more important substructures that may be crucial for slot tagging.

(8)

K-SAN Experiments

8 ATIS Dataset (F1 slot filling)

Small (1/40)

Medium

(1/10) Large

Tagger (GRU) 73.83 85.55 93.11

Encoder-Tagger (GRU) 72.79 88.26 94.75

(9)

K-SAN Experiments

9 ATIS Dataset (F1 slot filling)

Small (1/40)

Medium

(1/10) Large

Tagger (GRU) 73.83 85.55 93.11

Encoder-Tagger (GRU) 72.79 88.26 94.75

K-SAN (Stanford dep) 74.60

⁺

87.99 94.86

⁺

K-SAN (Syntaxnet dep) 74.35

⁺

88.40

⁺

95.00

⁺

Syntax provides richer knowledge and more general guidance when less training data.

(10)

K-SAN Experiments

10 ATIS Dataset (F1 slot filling)

Small (1/40)

Medium

(1/10) Large

Tagger (GRU) 73.83 85.55 93.11

Encoder-Tagger (GRU) 72.79 88.26 94.75

K-SAN (Stanford dep) 74.60

⁺

87.99 94.86

⁺

K-SAN (Syntaxnet dep) 74.35

⁺

88.40

⁺

95.00

⁺

K-SAN (AMR) 74.32

⁺

88.14 94.85

⁺

K-SAN (JAMR) 74.27

⁺

88.27

⁺

94.89

⁺

Syntax provides richer knowledge and more general guidance when less training data.

Semantics captures the most salient info so it achieves similar performance with much less substructures

(11)

Attention Analysis

Darker blocks and lines correspond to higher attention weights

11

(12)

Attention Analysis

Darker blocks and lines correspond to higher attention weights

Using less training data with K-SAN allows the model pay the similar attention to the salient substructures that are important for tagging.

12

(13)

EHR Data

Predicting diagnosis codes for clinical reports

◦ Present illness text

◦ “fever up to 39.4C intermittent in recent 3 days, cough/sputum(+), shortness of breath tonight”

◦ ICD-9 diagnosis codes

◦ 486: Pneumonia, organism unspecified; 780.6: Fever

13

(14)

CNN for Diagnosis Code Prediction

(Li et al., 2017)

Convolutional neural network (CNN) for multi-label code prediction

◦ Multiple convolutional filters for extracting different patterns

14 Clinic Text No dizziness No fever …

Conv Layer Max Pooling Fully-Connected

Embedding Layer

Multi-Label Code Prediction

C. Li, et al., “Convolutional Neural Networks for Medical Diagnosis from Admission Notes,” in arXiv, 2017.

(15)

Hierarchy Category Knowledge

Low-level code

◦ 301.0: Paranoid personality disorder

◦ 301.1: Affective personality disorder

◦ 301.2: Schizoid personality disorder

High-level category

◦ All belong to the “personality disorders”

15 Idea: category knowledge provides additional cues to know code relatedness

Clinic Text No dizziness No fever …

Conv Layer Max Pooling Fully-Connected

Embedding Layer

Multi-Label

Code Prediction

(16)

Hierarchy Category Knowledge

(Cluster Penalty)

Low-level code

◦ 301.0: Paranoid personality disorder

◦ 301.1: Affective personality disorder

◦ 301.2: Schizoid personality disorder

High-level category

◦ All belong to the “personality disorders”

Category constrained loss

16 Clinic Text No dizziness No fever …

Conv Layer Max Pooling Fully-Connected

Embedding Layer

Multi-Label Code Prediction

A. Nie, et al., “DeepTag: inferring all-cause diagnoses from clinical notes in under-resourced medical domain,” in arXiv, 2018.

(17)

Multi-Task Category Knowledge Integration

High-Level Category

Prediction

Hierarchy Category Knowledge

(Multi-Task)

Low-level code

◦ 301.0: Paranoid personality disorder

◦ 301.1: Affective personality disorder

◦ 301.2: Schizoid personality disorder

High-level category

◦ All belong to the “personality disorders”

Low-level code infers the high-level category

Category integrated loss via multi-task

17 Clinic Text No dizziness No fever …

Conv Layer Max Pooling Fully-Connected

Embedding Layer Low-Level Code

Prediction

𝐿 = 𝐿

_low

+ 𝛾 ∙ 𝐿

_high

𝑦

_high

= 1 if 𝑦

_low

= 1

(18)

Avg Meta-Label Category Knowledge Integration

Low-Level Code Prediction High-Level Category

Prediction

Hierarchy Category Knowledge

(Avg Meta-Label)

Low-level code

◦ 301.0: Paranoid personality disorder

◦ 301.1: Affective personality disorder

◦ 301.2: Schizoid personality disorder

High-level category

◦ All belong to the “personality disorders”

High-level prob can be approximated by the average of low-level code prob

Category integrated loss

18 Clinic Text No dizziness No fever …

Conv Layer Max Pooling Fully-Connected

Embedding Layer

𝐿 = 𝐿

_low

+ 𝛾 ∙ 𝐿

_high

𝑦

_high

= 1

𝑘 ෍ 𝑦

_low^𝑘

(19)

Hierarchy Category Knowledge ^(At-

Least-One Meta-Label)

Low-level code

◦ 301.0: Paranoid personality disorder

◦ 301.1: Affective personality disorder

◦ 301.2: Schizoid personality disorder

High-level category

◦ All belong to the “personality disorders”

High-level prob can be approximated by the at-least-one of low-level code prob

Category integrated loss

19 Clinic Text No dizziness No fever …

Conv Layer Max Pooling Fully-Connected

Embedding Layer

At-Least-One Meta-Label Category Knowledge Integration

Low-Level Code Prediction High-Level Category

Prediction

𝐿 = 𝐿

_low

+ 𝛾 ∙ 𝐿

_high

𝑦

_high

= 1 − ෑ

𝑘

1 − 𝑦

_low^𝑘

(20)

State-of-the-Art Performance

20

(21)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge: Knowledge-Guided Model

◦ How to utilize the current observations: Semi-Supervised Multi-Task SLU

Unlabeled Data

◦ How to re-use the trained dialogue acts

◦ How to share knowledge across languages

◦ How to utilize parallel data

Conclusions

21

(22)

Semi-Supervised Multi-Task SLU (Lan et al., 2018)

O. Lan, S. Zhu, and K. Yu, “Semi-supervised Training using Adversarial Multi-task Learning for Spoken Language Understanding,” in

22

Proceedings of ICASSP, 2018.

Idea: language understanding objective can enhance other tasks

Slot Tagging

Model

BLM exploits the unsupervised knowledge, the shared-private framework and

adversarial training make the slot tagging model more generalized

(23)

Semi-Supervised Multi-Task SLU (Lan et al., 2018)

STM – BLSTM for slot tagging

MTL – multi-task learning for STM and LM, where they share the embedding layer PSEUDO – train an STM with labeled data, generate labels for unlabeled data, and retrain STM

O. Lan, S. Zhu, and K. Yu, “Semi-supervised Training using Adversarial Multi-task Learning for Spoken Language Understanding,” in

23

Proceedings of ICASSP, 2018.

The model is more efficient when the labeled data is limited and the data for LM is

more sufficient.

(24)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge: Knowledge-Guided Model

◦ How to utilize the current observations: Semi-Supervised Multi-Task SLU

Unlabeled Data

◦ How to re-use the trained dialogue acts: Zero-Shot Intent Expansion

◦ How to share knowledge across languages

◦ How to utilize parallel data

Conclusions

24

(25)

Zero-Shot Intent Expansion (Chen et al., 2016) Goal: resolve domain constraint and enable flexible intent expansion for unlabeled domains

25 CDSSM

New Intent

Intent Embedding

1 2

K :

Embedding Generation

K+1

<change_calender> K+2

Training Data

<change_note>

“adjust my note”

:

<change_setting>

“volume turn down”

“postpone my meeting to five pm”

Original

Expand

Y.-N. Chen, D. Hakkani-Tur, and X. He, “Zero-Shot Learning of Intent Embeddings for Expansion by Convolutional Deep Structured Semantic Models,” in Proceedings of ICASSP, 2016.

Same dialogue acts can be shared across domains

(26)

CDSSM: Convolutional Deep Structured Semantic Models

26 20K 20K 20K

1000

w

₁

w

₂

w

₃

1000 1000

20K

w

_d

1000 300

Word Sequence: x

Word Hashing Matrix: W

_h

Word Hashing Layer: l

_h

Convolution Matrix: W

_c

Convolutional Layer: l

_c

Max Pooling Operation Max Pooling Layer: l

_m

Semantic Projection Matrix: W

_s

Semantic Layer: y

max max max 300 300 300 300

U I

₁

I

₂

I

_n

CosSim(U, I

_i

)

P(I

₁

| U) P(I

₂

| U) P(I

_n

| U)

Utterance …

Intent

𝑃 𝐴 𝑈 = exp(𝐶𝑜𝑠𝑆𝑖𝑚(𝑈, 𝐼)) σ

_𝐴′

exp(𝐶𝑜𝑠𝑆𝑖𝑚(𝑈, 𝐼′))

I want to adjust ….

…..

CDSSM maps language usage for the same dialogue acts together

(27)

Zero-Shot Intent Expansion (Chen et al., 2016)

27 Seen Unseen Seen Unseen Seen Unseen Seen Unseen Seen Unseen

58.6

0.0

66.1

0.0

67.3

0.0

68.2

0.0

68.6 0.0 58.3

9.1

65.6

31.0

66.8

34.5

67.7

36.0

68.2

36.6 MAP@K (%)

Intent Classification Performance

Ori Exp

K=1 K=3 K=5 K=10 K=30

The expanded models consider new intents without training samples, and produces

better understanding for unseen domains with comparable results for seen domains.

(28)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge: Knowledge-Guided Model

◦ How to utilize the current observations: Semi-Supervised Multi-Task SLU

Unlabeled Data

◦ How to re-use the trained dialogue acts: Zero-Shot Intent Expansion

◦ How to share knowledge across languages: Zero-Shot Crosslingual SLU

◦ How to utilize parallel data

Conclusions

28

(29)

Zero-Shot Crosslingual SLU (Upadhyay et al., 2018)

Source language: English (full annotations) Target language: Hindi (limited annotations)

29 RT: round trip, FC: from city, TC: to city, DDN: departure day name

S. Upadhyay, M. Faruqui, G. Tur, D. Hakkani-Tur, and L. Heck, “(Almost) Zero-Shot Cross-Lingual Spoken Language Understanding,”

in Proceedings of ICASSP, 2018.

(30)

Zero-Shot Crosslingual SLU (Upadhyay et al., 2018)

30 English Train

Hindi Train

Hindi Tagger

MT SLU

Results Hindi Test

TRAIN ON TARGET

English Tagger Hindi

Test

English

MT Test SLU

Results TEST ON SOURCE

SLU Results Hindi Train (Small)

Bilingual Tagger English Train (Large)

Joint Training

Hindi Test PROPOSED

MT system is not required and both languages can be processed by a single model

(31)

Joint Model for Crosslingual SLU

31 Hindi Train (Small)

Bilingual Tagger

SLU Results English Train (Large)

Joint Training Hindi Test

language indicator

given 100 examples in the target language

For rare slots (like meal, airline code), there is a huge difference between the

bilingual model and the naive model when the target training data is limited

(32)

Bilingual Model SLU Experiments

32 The bilingual model outperforms others and does not suffer from latency introduced by MT

(33)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge: Knowledge-Guided Model

◦ How to utilize the current observations: Semi-Supervised Multi-Task SLU

Unlabeled Data

◦ How to re-use the trained dialogue acts: Zero-Shot Intent Expansion

◦ How to share knowledge across languages: Zero-Shot Crosslingual SLU

◦ How to utilize parallel data: Crosslingual Sense Embeddings

Conclusions

33

(34)

Crosslingual Embeddings

Tokens in source language shall be mapped to tokens in target language

◦ This assumption only holds in sense level token

◦ Sets of crosslingual sense embeddings are therefore important

◦ uniform/制服 are all polysemous words

34 uniform_1

制服_2

uniform_2 subdue_1 均勻_1

制服_1

wrong

(35)

Embeddings in a Unified Space (Conneau et al., 2017; Lample et al., 2017)

May largely benefit tasks such as unsupervised machine translation

◦

A. Conneau, G. Lample, L. Denoyer, MA. Ranzato, H. Jégou, ”Word Translation Without Parallel Data,” preprint arXiv: 1710:04087, 2017.

35

G. Lample, A. Conneau, L. Denoyer, MA. Ranzato, ”Unsupervised Machine Translation With Monolingual Data Only,” preprint arXiv:1711.00043, 2017.

(36)

Our method can be separated into two steps (Lee & Chen, 2017):

1. Select the most probable (argmax) sense given the context

2. Use skip-gram to train the representation of the selected senses

➢ Reinforcement learning is used to connected the two modules

Modular Framework

36

Apple

company designs the best cellphone in the world.

蘋果公司設計世界一流的手機。

apple-1 apple-2

Lee and Chen, "MUSE: Modularizing Unsupervised Sense Embeddings," in EMNLP, pages 327-337, 2017.

parallel sentence w/o word alignment

cellphone-1

cellphone-2

公司-1 公司-2

(37)

Sense Selection Module

Input:

◦ Chinese text context C

_t

= 𝐶

_𝑡−𝑚

, … , 𝐶

_𝑡

= 𝑤

_𝑖

, … , 𝐶

_𝑡+𝑚

◦ English text context C

_t

′ = 𝐶

_𝑡−𝑚^′

, … , 𝐶

_𝑡^′

= 𝑤

_𝑖^′

, … , 𝐶

_𝑡+𝑚^′

Output: the fitness for each sense 𝑧

_𝑖1

, … , 𝑧

_𝑖3

Model architecture: Continuous Bag-of-Words (CBOW) for efficiency

Sense selection

37

Sense Selection Module

𝑞(𝑧_𝑖1| ഥ𝐶_𝑡) 𝑞(𝑧_𝑖2| ഥ𝐶_𝑡) 𝑞(𝑧_𝑖3| ഥ𝐶_𝑡) matrix 𝑄_𝑖^𝑒𝑛

matrix 𝑃^𝑒𝑛

…

𝐶_𝑡 = 𝑤_𝑖

𝐶

_𝑡−1

… 𝐶

_𝑡+1

like apple

companies and

…

𝐶_𝑡 = 𝑤_𝑖

𝐶

_𝑡−1

… 𝐶

_𝑡+1

製造商蘋果

手機公司與

𝐶

_𝑡

𝐶

_𝑡^′

matrix 𝑃^𝑧ℎ

𝛼 1 − 𝛼

(38)

Sense Representation Module

Input: sense collocation s _i , 𝑠 _𝑗 , 𝑠 _𝑙 ^′

Output: collocation likelihood estimation Model architecture: skip-gram architecture

Sense selection (optimized by negative sampling)

38 𝑧

_𝑖1

𝑃(𝑧_𝑗2^′ |𝑧_𝑖1) 𝑃(𝑧_𝑢𝑣^′ |𝑧_𝑖1)

…

matrix 𝑈^𝑒𝑛

matrix 𝑉

^𝑧ℎ

𝑧

_𝑖1

…

𝑃(𝑧_𝑗2|𝑧_𝑖1) 𝑃(𝑧_𝑢𝑣|𝑧_𝑖1)

matrix 𝑈^𝑒𝑛

matrix 𝑉

^𝑒𝑛

(39)

Crosslingual Model Architecture

39 Enabling bilingual sense embedding learning with parallel data

(40)

Qualitative Analysis

40 The words with similar senses from both languages have similar embeddings in a unified space

(41)

New Dataset – BCWS

(Bilingual Contextual Word Similarity)

41 A newly collected dataset for evaluating bilingual sense embeddings

(42)

Contextual Word Similarity Experiment

42 The crosslingual sense embeddings learned in an unsupervised way produce better

results on BCWS (bilingual) and comparable performance on SCWS (monolingual)

(43)

Outline

Limited Labeled Data

◦ How to incorporate the prior knowledge: Knowledge-Guided Model

◦ How to utilize the current observations: Semi-Supervised Multi-Task SLU

Unlabeled Data

◦ How to re-use the trained dialogue acts: Zero-Shot Intent Expansion

◦ How to share knowledge across languages: Zero-Shot Crosslingual SLU

◦ How to utilize parallel data: Crosslingual Sense Embeddings

Conclusions

43

(44)