UNSUPERVISED LEARNING AND MODELING OF

(1)

UNSUPERVISED LEARNING AND MODELING OF KNOWLEDGE AND INTENT FOR SPOKEN DIALOGUE SYSTEMS

Yun-Nung (Vivian) Chen | http://vivianchen.idv.tw

(2)

OUTLINE

Introduction

Semantic Decoding

 Ontology Induction

 Knowledge Graph Propagation

 Matrix Factorization

 Experiments

Future Work

Conclusions

(3)

OUTLINE

Introduction

Semantic Decoding

[ACL-IJCNLP’15]

 Experiments

Future Work

Conclusions

(4)

A POPULAR ROBOT - BAYMAX

Big Hero 6 -- Video content owned and licensed by Disney Entertainment, Marvel Entertainment, LLC, etc

(5)

A POPULAR ROBOT - BAYMAX

Baymax is capable of maintaining a good spoken dialogue system and learning new knowledge for better understanding and interacting with people.

The goal is to automate learning and understanding procedures in system

development.

(6)

SPOKEN DIALOGUE SYSTEM (SDS)

Spoken dialogue systems are the intelligent agents that are able to help users finish tasks more efficiently via speech interactions.

Spoken dialogue systems are being incorporated into various devices (smart-phones, smart TVs, in-car navigating system, etc).

Apple’s Siri

Microsoft’s Cortana

Amazon’s

Echo Samsung’s SMART TV

Google Now

https://www.apple.com/ios/siri/

http://www.windowsphone.com/en-us/how-to/wp8/cortana/meet-cortana http://www.xbox.com/en-US/

http://www.amazon.com/oc/echo/

http://www.samsung.com/us/experience/smart-tv/

https://www.google.com/landing/now/

Microsoft’s XBOX Kinect

(7)

LARGE SMART DEVICE POPULATION

The number of global smartphone users will surpass 2 billion in 2016.

As of 2012, there are 1.1 billion automobiles on the earth.

The more natural and convenient input of the devices evolves towards speech

(8)

KNOWLEDGE REPRESENTATION/ONTOLOGY

Traditional SDSs require manual annotations for specific domains to represent domain knowledge.

Restaurant Domain

Movie Domain

restaurant

type price

location

movie genre year

director

Node: semantic concept/slot Edge: relation between concepts

located_in

directed_by

released_in

(9)

UTTERANCE SEMANTIC REPRESENTATION

A spoken language understanding (SLU) component requires the domain ontology to decode utterances into semantic forms, which contain core content (a set of slots and slot-fillers) of the utterance.

find a cheap taiwanese restaurant in seattle

show me action movies directed by james cameron

target=“restaurant”, price=“cheap”, type=“taiwanese”, location=“seattle”

target=“movie”, genre=“action”, director=“james cameron”

Restaurant Domain

Movie Domain

restaurant

type price

location

movie genre year

director

(10)

CHALLENGES FOR SDS

An SDS in a new domain requires

1) A hand-crafted domain ontology

2) Utterances labelled with semantic representations

3) An SLU component for mapping utterances into semantic representations

With increasing spoken interactions, building domain ontologies and annotating utterances cost a lot so that the data does not scale up.

The goal is to enable an SDS to automatically learn this knowledge so that open domain requests can be handled.

(11)

INTERACTION EXAMPLE

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

Q: How does a dialogue system process this request?

Inexpensive Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose?

I can help you go there.

(12)

SDS PROCESS – AVAILABLE DOMAIN ONTOLOGY

target price food

AMOD

NN

seeking

^PREP_FOR

Organized Domain Knowledge find an inexpensive eating place for taiwanese food

User

Intelligent Agent

(13)

SDS PROCESS – AVAILABLE DOMAIN ONTOLOGY

find a cheap eating place for asian food

target price food

AMOD

NN

seeking

^PREP_FOR

Organized Domain Knowledge

Ontology Induction (semantic slot)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

(14)

SDS PROCESS – AVAILABLE DOMAIN ONTOLOGY

target price food

AMOD

NN

seeking

^PREP_FOR

Organized Domain Knowledge

Structure Learning (inter-slot relation)

Ontology Induction (semantic slot)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

(15)

target price food

AMOD

NN

seeking

^PREP_FOR

seeking=“find”

target=“eating place”

price=“inexpensive”

food=“taiwanese food”

SDS PROCESS – SPOKEN LANGUAGE UNDERSTANDING (SLU)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

(16)

target price food

AMOD

NN

seeking

^PREP_FOR

seeking=“find”

target=“eating place”

price=“inexpensive”

food=“taiwanese food”

SDS PROCESS – SPOKEN LANGUAGE UNDERSTANDING (SLU)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

Semantic Decoding

(17)

target price food

AMOD

NN

seeking

^PREP_FOR

SDS PROCESS – DIALOGUE MANAGEMENT (DM)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

SELECT restaurant {

restaurant.price=“inexpensive”

restaurant.food=“Taiwanese food”

}

(18)

target price food

AMOD

NN

seeking

^PREP_FOR

SDS PROCESS – DIALOGUE MANAGEMENT (DM)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

SELECT restaurant {

}

Surface Form Derivation

(natural language)

(19)

SDS PROCESS – DIALOGUE MANAGEMENT (DM)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

SELECT restaurant {

}

Din Tai Fung Boiling Point

: :

Predicted behavior: navigation

(20)

SDS PROCESS – DIALOGUE MANAGEMENT (DM)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

SELECT restaurant {

}

Din Tai Fung Boiling Point

: :

Predicted behavior: navigation Behavior Prediction

(21)

find an inexpensive eating place for taiwanese food

User

Intelligent Agent

Inexpensive Taiwanese eating places include Din Tai Fung, Boiling Point, etc. What do you want to choose?

I can help you go there. (navigation)

SDS PROCESS – NATURAL LANGUAGE GENERATION (NLG)

(22)

GOALS

target price food

AMOD

NN

seeking

^PREP_FOR

SELECT restaurant {

restaurant.food=“taiwanese food”

}

Required Domain-Specific Information

find an inexpensive eating place for taiwanese food

User

(23)

FIVE GOALS

target price food

AMOD

NN

seeking

^PREP_FOR

SELECT restaurant {

restaurant.food=“taiwanese food”

}

Required Domain-Specific Information

find an inexpensive eating place for taiwanese food

User

1. Ontology Induction

2. Structure Learning

3. Surface Form Derivation

4. Semantic Decoding

5. Behavior Prediction

(inter-slot relation) (semantic slot)

(24)

FIVE GOALS

find an inexpensive eating place for taiwanese food

User

1. Ontology Induction

2. Structure Learning

3. Surface Form Derivation

4. Semantic Decoding

5. Behavior Prediction

(inter-slot relation) (semantic slot)

(25)

FIVE GOALS

1. Ontology Induction 2. Structure Learning

3. Surface Form Derivation

4. Semantic Decoding 5. Behavior Prediction

Knowledge Acquisition SLU Modeling

find an inexpensive eating place for taiwanese food

User

(26)

OUTLINE

Introduction

Semantic Decoding

[ACL-IJCNLP’15]

 Experiments

Future Work

Conclusions

(27)

SLU Model

target=“restaurant”

price=“cheap”

“can I have a cheap restaurant”

Ontology Induction

Unlabeled Collection

Semantic KG

Frame-Semantic Parsing

F_w F_s

Feature Model

R_w

R_s Knowledge Graph Propagation Model

Word Relation Model

Lexical KG

Slot Relation Model Structure

Learning

.

Semantic KG

SLU Modeling by Matrix Factorization

Semantic Representation

Input: user utterances

Output: the domain-specific semantic concepts included in each individual utterance

SEMANTIC DECODING

Y.-N. Chen et al., "Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding," (to appear) in Proc. of ACL-IJCNLP, 2015.

(28)

OUTLINE

Introduction

Semantic Decoding

 Ontology Induction

 Experiments

Future Work

Conclusions

(29)

PROBABILISTIC FRAME-SEMANTIC PARSING

FrameNet

[Baker et al., 1998]

 a linguistically semantic resource, based on the frame-semantics theory

 “low fat milk”  “milk” evokes the “food” frame;

“low fat” fills the descriptor frame element

SEMAFOR

[Das et al., 2014]

 a state-of-the-art frame-semantics parser, trained on manually annotated FrameNet sentences

Baker et al., " The berkeley framenet project," in Proc. of International Conference on Computational linguistics, 1998.

Das et al., " Frame-semantic parsing," in Proc. of Computational Linguistics, 2014.

(30)

FRAME-SEMANTIC PARSING FOR UTTERANCES

can i have a cheap restaurant

Frame: capability FT LU: can FE LU: i

Frame: expensiveness FT LU: cheap

Frame: locale by use FT/FE LU: restaurant

1st Issue: adapting generic frames to domain-specific settings for SDSs

Good!

?

FT: Frame Target; FE: Frame Element; LU: Lexical Unit

(31)

OUTLINE

Introduction

Semantic Decoding

 Knowledge Graph Propagation (for 1st issue)

 Experiments

Future Work

Conclusions

(32)

SLU Model

target=“restaurant”

price=“cheap”

“can I have a cheap restaurant”

Ontology Induction

Unlabeled Collection

Semantic KG

Frame-Semantic Parsing

F_w F_s

Feature Model

R_w

R_s Knowledge Graph Propagation Model

Word Relation Model

Lexical KG

Slot Relation Model Structure

Learning

.

Semantic KG

SLU Modeling by Matrix Factorization

Semantic Representation

Input: user utterances

Output: the domain-specific semantic concepts included in each individual utterance

SEMANTIC DECODING

Y.-N. Chen et al., "Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding," (to appear) in Proc. of ACL-IJCNLP, 2015.

(33)

Assumption: The domain-specific words/slots have more dependency to each other.

1ST ISSUE: HOW TO ADAPT GENERIC SLOTS TO DOMAIN-SPECIFIC?

KNOWLEDGE GRAPH PROPAGATION MODEL

Word Relation Model Slot Relation Model

word relation

matrix

slot relation

matrix

‧

1

Word Observation Slot Candidate

Train

cheap restaurant expensiveness food 1

locale_by_use 1 1

1 1

food

1 1

1 Test

1

Slot Induction

The relation matrices allow each node propagate the scores to its neighbor in the knowledge graph, so that the domain-specific words/slots have higher scores during training.

i like

1 1

capability 1

locale_by_use

food expensiveness seeking

relational_quantity desiring

Utterance 1

i would like a cheap restaurant

… …

find a restaurant with chinese food Utterance 2

show me a list of cheap restaurants Test Utterance

(34)

ccomp

dobj amod

nsubj det

Syntactic dependency parsing on utterances

can i have a cheap restaurant

capability expensiveness locale_by_use

KNOWLEDGE GRAPH CONSTRUCTION

Word-based lexical knowledge graph

Slot-based semantic knowledge graph

restaurant

can

have

i a

cheap w

w capability

locale_by_use expensiveness

s

(35)

The edge between a node pair is weighted as relation importance for build the matrix How to decide the weights to represent relation importance?

KNOWLEDGE GRAPH CONSTRUCTION

Word-based lexical knowledge graph

Slot-based semantic knowledge graph

restaurant

can

have

i a

cheap w

w capability

locale_by_use expensiveness

s

(36)

Dependency-based word embeddings

Dependency-based slot embeddings

can = [0.8 … 0.24]

have = [0.3 … 0.21]

: :

expensiveness = [0.12 … 0.7]

capability = [0.3 … 0.6]

: :

can i have a cheap restaurant

ccomp

dobj amod

nsubj det

have a

capability expensiveness locale_by_use

ccomp

dobj amod

nsubj det

Levy and Goldberg, " Dependency-Based Word Embeddings," in Proc. of ACL, 2014.

WEIGHT MEASUREMENT BY EMBEDDINGS

(37)

Compute edge weights to represent relation importance

 Slot-to-slot semantic relation 𝑅

_𝑠^𝑆

: similarity between slot embeddings

 Slot-to-slot dependency relation 𝑅

_𝑠^𝐷

: dependency score between slot embeddings

 Word-to-word semantic relation 𝑅

_𝑤^𝑆

: similarity between word embeddings

 Word-to-word dependency relation 𝑅

_𝑤^𝐷

: dependency score between word embeddings

𝑅_𝑤^𝑆𝐷 = 𝑅_𝑤^𝑆 +𝑅_𝑤^𝐷

𝑅_𝑠^𝑆𝐷 = 𝑅_𝑠^𝑆+𝑅_𝑠^𝐷

w₁

w₂

w₃ w₄

w₅

w₆

w₇ s₂

s₁ s₃

WEIGHT MEASUREMENT BY EMBEDDINGS

(38)

word relation

matrix

slot relation

matrix

‧

1

Train

locale_by_use 1 1

1 1

food

1 1

1 Test

1

Slot Induction

𝑅_𝑤^𝑆𝐷 𝑅_𝑠^𝑆𝐷

KNOWLEDGE GRAPH PROPAGATION MODEL

(39)

OUTLINE

Introduction

Semantic Decoding

 Matrix Factorization (for 2nd issue)

 Experiments

Future Work

Conclusions

(40)

Ontology Induction

F_w F_s SLU

Structure Learning

.

MATRIX FACTORIZATION (MF)

FEATURE MODEL

1

Utterance 1

i would like a cheap restaurant

Train

… … …

locale_by_use 1

1

find a restaurant with chinese food

Utterance 2

1 1

food

1 1

1 ¹ Test

1

.90 .97 .85 .95

.93 .98 .92

.05 .05

Slot Induction

show me a list of cheap restaurants

Test Utterance hidden semantics

2nd Issue: hidden semantics cannot be observed but may benefit the understanding performance

(41)

Reasoning with Matrix Factorization

word relation

matrix

slot relation

matrix

‧

1

Train

locale_by_use 1 1

1 1

food

1 1

1 Test

1

.90 .97 .85 .95

.93 .98 .92

.05 .05

Slot Induction

The MF method completes a partially-missing matrix based on the latent semantics by decomposing it into product of two matrices.

2ND ISSUE: HOW TO LEARN THE IMPLICIT SEMANTICS?

MATRIX FACTORIZATION (MF)

(42)

MATRIX FACTORIZATION (MF)

The decomposed matrices represent latent semantics for utterances and words/slots respectively The product of two matrices fills the probability of hidden semantics

1

Train

locale_by_use 1 1

1 1

food

1 1

1 Test

1

.90 .97 .85 .95

.93 .98 .92

.05 .05

𝑼

𝑾 + 𝑺

≈ 𝑼 × 𝒅 × 𝒅 × 𝑾 + 𝑺

(43)

BAYESIAN PERSONALIZED RANKING FOR MF

Model implicit feedback

 not treat unobserved facts as negative samples (true or false)

 give observed facts higher scores than unobserved facts

Objective:

1

𝑓⁺ 𝑓⁻ 𝑓⁻

The objective is to learn a set of well-ranked semantic slots per utterance.

𝑢

𝑥

(44)

Reasoning with Matrix Factorization

word relation

matrix

slot relation

matrix

‧

1

Train

locale_by_use 1 1

1 1

food

1 1

1 Test

1

.90 .97 .85 .95

.93 .98 .92

.05 .05

Slot Induction

The MF method completes a partially-missing matrix based on the latent semantics by decomposing it into product of two matrices.

2ND ISSUE: HOW TO LEARN THE IMPLICIT SEMANTICS?

MATRIX FACTORIZATION (MF)

(45)

OUTLINE

Introduction

Semantic Decoding

 Experiments

Future Work

Conclusions

(46)

EXPERIMENTAL SETUP

Dataset

 Cambridge University SLU corpus [Henderson, 2012]

 Restaurant recommendation in an in-car setting in Cambridge

 WER = 37%

 vocabulary size = 1868

 2,166 dialogues

 15,453 utterances

 dialogue slot: addr, area, food, name, phone, postcode, price range, task, type

The mapping table between induced and reference slots

Henderson et al., "Discriminative spoken language understanding using word confusion networks," in Proc. of SLT, 2012.

(47)

Metric: Mean Average Precision (MAP) of all estimated slot probabilities for each utterance

Approach ASR Manual

w/o w/ Explicit w/o w/ Explicit

Explicit Support Vector Machine 32.5 36.6

Multinomial Logistic Regression 34.0 38.8

EXPERIMENT 1: QUALITY OF SEMANTICS ESTIMATION

(48)

Implicit

Baseline Random

Majority MF

Feature Model Feature Model +

Knowledge Graph Propagation Modeling

Implicit Semantics

EXPERIMENT 1: QUALITY OF SEMANTICS ESTIMATION

(49)

Implicit

Baseline Random 3.4 2.6

Majority 15.4 16.4

MF

Feature Model 24.2 22.6

Feature Model +

Knowledge Graph Propagation

40.5^* (+19.1%)

52.1^* (+34.3%) Modeling

Implicit Semantics

EXPERIMENT 1: QUALITY OF SEMANTICS ESTIMATION

(50)

Implicit

Baseline Random 3.4 22.5 2.6 25.1

Majority 15.4 32.9 16.4 38.4

MF

Feature Model 24.2 37.6^* 22.6 45.3^*

Feature Model +

Knowledge Graph Propagation

40.5^* (+19.1%)

43.5^* (+27.9%)

52.1^* (+34.3%)

53.4^* (+37.6%) Modeling

Implicit Semantics

The MF approach effectively models hidden semantics to improve SLU.

Adding a knowledge graph propagation model further improves the results.

EXPERIMENT 1: QUALITY OF SEMANTICS ESTIMATION

(51)

All types of relations are useful to infer hidden semantics.

Feature Model

37.6 45.3

Feature + Knowledge Graph

Propagation

Semantic ^𝑅^𝑤^𝑆 ⁰

0 𝑅_𝑠^𝑆 41.4^* 51.6^*

Dependency ^𝑅^𝑤^𝐷 ⁰

0 𝑅_𝑠^𝐷 41.6^* 49.0^*

Word ^𝑅^𝑤^𝑆𝐷 ⁰

0 0 39.2^* 45.2

Slot ⁰₀ _𝑅⁰

𝑠𝑆𝐷 42.1^* 49.9^*

Both ^𝑅^w^𝑆𝐷₀ _𝑅⁰

𝑠𝑆𝐷

EXPERIMENT 2: EFFECTIVENESS OF RELATIONS

(52)

All types of relations are useful to infer hidden semantics.

Feature Model

37.6 45.3

Feature + Knowledge Graph

Propagation

Semantic ^𝑅^𝑤^𝑆 ⁰

0 𝑅_𝑠^𝑆 41.4^* 51.6^*

Dependency ^𝑅^𝑤^𝐷 ⁰

0 𝑅_𝑠^𝐷 41.6^* 49.0^*

Word ^𝑅^𝑤^𝑆𝐷 ⁰

0 0 39.2^* 45.2

Slot ⁰₀ _𝑅⁰

𝑠𝑆𝐷 42.1^* 49.9^*

Both ^𝑅^w^𝑆𝐷₀ _𝑅⁰

𝑠𝑆𝐷 43.5^* (+15.7%) 53.4^* (+17.9%)

Combining different relations further improves the performance.

EXPERIMENT 2: EFFECTIVENESS OF RELATIONS

(53)

OUTLINE

Introduction

Semantic Decoding

[ACL-IJCNLP’15]

 Bayesian Personalized Ranking for Matrix Factorization

 Experiments

Future Work

Conclusions

(54)

LOW- AND HIGH-LEVEL UNDERSTANDING

Semantic concepts for individual utterances do not consider high-level semantics (user intents) The follow-up behaviors are observable and usually correspond to user intents

price=“cheap”

target=“restaurant”

SLU Component

“can i have a cheap restaurant”

behavior=navigation

restaurant=“din tai fung”

time=“tonight”

SLU Component

“i plan to dine in din tai fung tonight”

behavior=reservation

(55)

BEHAVIOR PREDICTION

1

Utterance 1

play lady gaga’s song bad romance

Feature

Observation Behavior

Train

… … …

play song pandora youtube

1

maps 1

i’d like to listen to lady gaga’s bad romance

Utterance 2

1 listen

1

1 ^.90 ¹ ^.85 ^.97 ^.05 Test

Feature Relation Behavior Relation

Predicting with Matrix Factorization Identification

SLU Model

Predicted Behavior

“play lady gaga’s bad romance”

Behavior Identification Unlabeled

Collection

SLU Modeling for Behavior Prediction

F_f F_b

Feature Model

R_f

R_b

Relation Model

Feature Relation Model

Behavior Relation Model

.

(56)

OUTLINE

Introduction

Semantic Decoding

[ACL-IJCNLP’15]

 Bayesian Personalized Ranking for Matrix Factorization

 Experiments

Future Work

Conclusions

(57)

CONCLUSIONS

The ontology induction and knowledge graph construction enable systems to automatically acquire open domain knowledge.

The MF technique for SLU modeling provides a principle model that is able to unify the

automatically acquired knowledge, and then allows systems to consider implicit semantics for better understanding.

 Better semantic representations for individual utterances

 Better follow-up behavior prediction

The work shows the feasibility and the potential of improving generalization, maintenance, efficiency, and scalability of SDSs.

(58)

Q & A

Thanks for your attentions!!

(59)

CAMBRIDGE UNIVERSITY SLU CORPUS

hi i'd like a restaurant in the cheap price range in the centre part of town

um i'd like chinese food please how much is the main cost

okay and uh what's the address

great uh and if i wanted to uh go to an italian restaurant instead italian please

what's the address

i would like a cheap chinese restaurant something in the riverside

[back]

type=restaurant, pricerange=cheap, area=centre food=chinese

pricerange addr

food=italian, type=restaurant food=italian

addr

pricerange=cheap, food=chinese, type=restaurant area=centre

(60)

WORD EMBEDDINGS

Training Process

 Each word w is associated with a vector

 The contexts within the window size c are considered as the training data D

 Objective function:

[back]

w_t-2

w_t-1

w_t+1

w_t+2

w_t

SUM

INPUT PROJECTION OUTPUT

CBOW Model

Mikolov et al., " Efficient Estimation of Word Representations in Vector Space," in Proc. of ICLR, 2013.

Mikolov et al., " Distributed Representations of Words and Phrases and their Compositionality," in Proc. of NIPS, 2013.

Mikolov et al., " Linguistic Regularities in Continuous Space Word Representations," in Proc. of NAACL-HLT, 2013.

(61)

Word & Context Extraction

Word Contexts

can have/ccomp

i have/nsub^-1

have can/ccomp^-1, i/nsubj, restaurant/dobj a restaurant/det^-1

cheap restaurant/amod^-1

restaurant have/dobj^-1, a/det, cheap/amod

can i have a cheap restaurant

ccomp

dobj amod

nsubj det

DEPENDENCY-BASED EMBEDDINGS

(62)

Training Process

 Each word w is associated with a vector v_w and each context c is represented as a vector v_c

 Learn vector representations for both words and contexts such that the dot product v_w．_v_cassociated with good word-context pairs belonging to the training data D is maximized

 Objective function:

[back]

DEPENDENCY-BASED EMBEDDINGS

(63)

SLOT MAPPING TABLE

origin food

u₁ u₂ : u_k

: u_n

asian : : japan

: :

asian beer

: japan

: noodle food

: beer

: : : noodle

Create the mapping if slot fillers of the induced slot are included by the reference slot

induced slots reference slot

[back]

(64)

SEMAFOR PERFORMANCE

The SEMAFOR evaluation

[back]