• 沒有找到結果。

Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding

N/A
N/A
Protected

Academic year: 2022

Share "Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding"

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding

Yun-Nung Chen, William Yang Wang, Anatole Gershman, and Alexander I. Rudnicky School of Computer Science, Carnegie Mellon University

5000 Forbes Aveue, Pittsburgh, PA 15213-3891, USA {yvchen, yww, anatoleg, air}@cs.cmu.edu

Abstract

Spoken dialogue systems (SDS) typically require a predefined semantic ontology to train a spoken language understanding (SLU) module. In addition to the anno- tation cost, a key challenge for design- ing such an ontology is to define a coher- ent slot set while considering their com- plex relations. This paper introduces a novel matrix factorization (MF) approach to learn latent feature vectors for utter- ances and semantic elements without the need of corpus annotations. Specifically, our model learns the semantic slots for a domain-specific SDS in an unsupervised fashion, and carries out semantic pars- ing using latent MF techniques. To fur- ther consider the global semantic struc- ture, such as inter-word and inter-slot re- lations, we augment the latent MF-based model with a knowledge graph propaga- tion model based on a slot-based seman- tic graph and a word-based lexical graph.

Our experiments show that the proposed MF approaches produce better SLU mod- els that are able to predict semantic slots and word patterns taking into account their relations and domain-specificity in a joint manner.

1 Introduction

A key component of a spoken dialogue sys- tem (SDS) is the spoken language understand- ing (SLU) module—it parses the users’ utterances into semantic representations; for example, the ut- terance “find a cheap restaurant” can be parsed into (price=cheap, target=restaurant) (Pieraccini et al., 1992). To design the SLU module of a SDS, most previous studies relied on predefined slots1 for training the decoder (Seneff, 1992; Dowding

1A slot is defined as a basic semantic unit in SLU, such as

“price” and “target” in the example.

et al., 1993; Gupta et al., 2006; Bohus and Rud- nicky, 2009). However, these predefined semantic slots may bias the subsequent data collection pro- cess, and the cost of manually labeling utterances for updating the ontology is expensive (Wang et al., 2012).

In recent years, this problem led to the devel- opment of unsupervised SLU techniques (Heck and Hakkani-T¨ur, 2012; Heck et al., 2013; Chen et al., 2013b; Chen et al., 2014b). In particular, Chen et al. (2013b) proposed a frame-semantics based framework for automatically inducing se- mantic slots given raw audios. However, these ap- proaches generally do not explicitly learn the la- tent factor representations to model the measure- ment errors (Skrondal and Rabe-Hesketh, 2004), nor do they jointly consider the complex lexical, syntactic, and semantic relations among words, slots, and utterances.

Another challenge of SLU is the inference of the hidden semantics. Considering the user utter- ance “can i have a cheap restaurant”, from its sur- face patterns, we can see that it includes explicit semantic information about “price (cheap)” and

“target (restaurant)”; however, it also includes hidden semantic information, such as “food” and

“seeking”, since the SDS needs to infer that the user wants to “find” some cheap “food”, even though they are not directly observed in the sur- face patterns. Nonetheless, these implicit seman- tics are important semantic concepts for domain- specific SDSs. Traditional SLU models use dis- criminative classifiers (Henderson et al., 2012) to predict whether the predefined slots occur in the utterances or not, ignoring the unobserved con- cepts and the hidden semantic information.

In this paper, we take a rather radical approach:

we propose a novel matrix factorization (MF) model for learning latent features for SLU, tak- ing account of additional information such as the word relations, the induced slots, and the slot re- lations. To further consider the global coherence of induced slots, we combine the MF model with

(2)

a knowledge graph propagation based model, fus- ing both a word-based lexical knowledge graph and a slot-based semantic graph. In fact, as it is shown in the Netflix challenge, MF is cred- ited as the most useful technique for recommen- dation systems (Koren et al., 2009). Also, the MF model considers the unobserved patterns and esti- mates their probabilities instead of viewing them as negative examples. However, to the best of our knowledge, the MF technique is not yet well un- derstood in the SLU and SDS communities, and it is not very straight-forward to use MF methods to learn latent feature representations for semantic parsing in SLU. To evaluate the performance of our model, we compare it to standard discrimina- tive SLU baselines, and show that our MF-based model is able to produce strong results in seman- tic decoding, and the knowledge graph propaga- tion model further improves the performance. Our contributions are three-fold:

• We are among the first to study matrix fac- torization techniques for unsupervised SLU, taking account of additional information;

• We augment the MF model with a knowl- edge graph propagation model, increasing the global coherence of semantic decoding using induced slots;

• Our experimental results show that the MF- based unsupervised SLU outperforms strong discriminative baselines, obtaining promis- ing results.

In the next section, we outline the related work in unsupervised SLU and latent variable model- ing for spoken language processing. Section 3 introduces our framework. The detailed MF ap- proach is explained in Section 4. We then intro- duce the global knowledge graphs for MF in Sec- tion 5. Section 6 shows the experimental results, and Section 7 concludes.

2 Related Work

Unsupervised SLU Tur et al. (2011; 2012) were among the first to consider unsupervised ap- proaches for SLU, where they exploited query logs for slot-filling. In a subsequent study, Heck and Hakkani-T¨ur (2012) studied the Semantic Web for an unsupervised intent detection problem in SLU, showing that results obtained from the unsuper- vised training process align well with the perfor- mance of traditional supervised learning. Fol- lowing their success of unsupervised SLU, recent studies have also obtained interesting results on the tasks of relation detection (Hakkani-T¨ur et al., 2013; Chen et al., 2014a), entity extraction (Wang

et al., 2014), and extending domain coverage (El- Kahky et al., 2014; Chen and Rudnicky, 2014).

However, most of the studies above do not ex- plicitly learn latent factor representations from the data—while we hypothesize that the better robust- ness in noisy data can be achieved by explicitly modeling the measurement errors (usually pro- duced by automatic speech recognizers (ASR)) us- ing latent variable models and taking additional lo- cal and global semantic constraints into account.

Latent Variable Modeling in SLU Early stud- ies on latent variable modeling in speech included the classic hidden Markov model for statistical speech recognition (Jelinek, 1997). Recently, Ce- likyilmaz et al. (2011) were the first to study the intent detection problem using query logs and a discrete Bayesian latent variable model. In the field of dialogue modeling, the partially observ- able Markov decision process (POMDP) (Young et al., 2013) model is a popular technique for di- alogue management, reducing the cost of hand- crafted dialogue managers while producing ro- bustness against speech recognition errors. More recently, Tur et al. (2013) used a semi-supervised LDA model to show improvement on the slot fill- ing task. Also, Zhai and Williams (2014) proposed an unsupervised model for connecting words with latent states in HMMs using topic models, obtain- ing interesting qualitative and quantitative results.

However, for unsupervised learning for SLU, it is not obvious how to incorporate additional infor- mation in the HMMs. To the best of our knowl- edge, this paper is the first to consider MF tech- niques for learning latent feature representations in unsupervised SLU, taking various local and global lexical, syntactic, and semantic information into account.

3 The Proposed Framework

This paper introduces a matrix factorization tech- nique for unsupervised SLU,. The proposed framework is shown in Figure 1(a). Given the utterances, the task of the SLU model is to de- code their surface patterns into semantic forms and differentiate the target semantic concepts from the generic semantic space for task-oriented SDSs simultaneously. Note that our model does not require any human-defined slots and domain- specific semantic representations for utterances.

In the proposed model, we first build a feature matrix to represent the training utterances, where each row represents an utterance, and each column refers to an observed surface pattern or a induced slot candidate. Figure 1(b) illustrates an example

(3)

Utterance 1 1

i would like a cheap restaurant

Word Observation Slot Candidate

Train

… … …

cheap restaurant food expensiveness 1

locale_by_use 1 1

find a restaurant with chinese food Utterance 2

1 1

food

1 1

1 Test

1

1

.90 .85 .97 .95

.93 .98 .92

.05 .05

Word Relation Model Slot Relation Model Reasoning with Matrix Factorization

Slot Induction

SLU Model

Semantic Representation

“can I have a cheap restaurant”

Slot Induction

Unlabeled Collection

SLU Model Training by Matrix Factorization

Frame- Semantic

Parsing Fw Fs

Feature Model Rw

Rs

Knowledge Graph Propagation Model Word Relation Model

Slot Relation Model Knowledge

Graph Construction

.

(a)

(b)

Semantic KG Lexical KG

Figure 1: (a): The proposed framework. (b): Our matrix factorization method completes a partially- missing matrix for implicit semantic parsing. Dark circles are observed facts, shaded circles are inferred facts. The slot induction maps (yellow arrow) observed surface patterns to semantic slot candidates.

The word relation model (blue arrow) constructs correlations between surface patterns. The slot relation model (pink arrow) learns the slot-level correlations based on propagating the automatically derived semantic knowledge graphs. Reasoning with matrix factorization (gray arrow) incorporates these models jointly, and produces a coherent, domain-specific SLU model.

of the matrix. Given a testing utterance, we con- vert it into a vector based on the observed surface patterns, and then fill in the missing values of the slots. In the first utterance in the figure, although the semantic slot food is not observed, the utter- ance implies the meaning facet food. The MF ap- proach is able to learn the latent feature vectors for utterances and semantic elements, inferring im- plicit semantic concepts to improve the decoding process—namely, by filling the matrix with prob- abilities (lower part of the matrix).

The feature model is built on the observed word patterns and slot candidates, where the slot candi- dates are obtained from the slot induction compo- nent through frame-semantic parsing (the yellow block in Figure 1(a)) (Chen et al., 2013b). Sec- tion 4.1 explains the detail of the feature model.

In order to consider the additional inter-word and inter-slot relations, we propose a knowledge graph propagation model based on two knowl- edge graphs, which includes a word relation model (blue block) and a slot relation model (pink block), described in Section 4.2. The method of auto-

matic knowledge graph construction is introduced in Section 5, where we leverage distributed word embeddings associated with typed syntactic de- pendencies to model the relations (Mikolov et al., 2013b; Mikolov et al., 2013c; Levy and Goldberg, 2014; Chen et al., 2015).

Finally, we train the SLU model by learning latent feature vectors for utterances and slot can- didates through MF techniques. Combining with a knowledge graph propagation model based on word/slot relations, the trained SLU model esti- mates the probability that each semantic slot oc- curs in the testing utterance, and how likely each slot is domain-specific simultaneously. In other words, the SLU model is able to transform the test- ing utterances into domain-specific semantic rep- resentations without human involvement.

4 The Matrix Factorization Approach Considering the benefits brought by MF tech- niques, including 1) modeling the noisy data, 2) modeling hidden semantics, and 3) modeling the

(4)

can i have a cheap restaurant

Frame: capability FT LU: can FE Filler: i

Frame: expensiveness FT LU: cheap

Frame: locale by use FT/FE LU: restaurant

Figure 2: An example of probabilistic frame- semantic parsing on ASR output. FT: frame target.

FE: frame element. LU: lexical unit.

long-range dependencies between observations, in this work we apply an MF approach to SLU mod- eling for SDSs. In our model, we use U to de- note the set of input utterances, W as the set of word patterns, and S as the set of semantic slots that we would like to predict. The pair of an ut- terance u ∈ U and a word pattern/semantic slot x ∈ {W + S}, hu, xi, is a fact. The input to our model is a set of observed facts O, and the observed facts for a given utterance is denoted by {hu, xi ∈ O}. The goal of our model is to esti- mate, for a given utterance u and a given word pat- tern/semantic slot x, the probability, p(Mu,x= 1), where Mu,xis a binary random variable that is true if and only if x is the word pattern/domain-specific semantic slot in the utterance u. We introduce a series of exponential family models that estimate the probability using a natural parameter θu,x and the logistic sigmoid function:

p(Mu,x = 1 | θu,x) = σ(θu,x) = 1 1 + exp (−θu,x)

(1) We construct a matrix M|U |×(|W |+|S|)as observed facts for MF by integrating a feature model and a knowledge graph propagation model below.

4.1 Feature Model

First, we build a word pattern matrix Fw with binary values based on observations, where each row represents an utterance and each column refers to an observed unigram. In other words, Fw

carries the basic word vectors for the utterances, which is illustrated as the left part of the matrix in Figure 1(b).

To induce the semantic elements, we parse all ASR-decoded utterances in our corpus using SE- MAFOR2, a state-of-the-art semantic parser for frame-semantic parsing (Das et al., 2010; Das et al., 2013), and extract all frames from seman- tic parsing results as slot candidates (Chen et al., 2013b; Dinarelli et al., 2009). Figure 2 shows an example of an ASR-decoded output parsed by SEMAFOR. Three FrameNet-defined frames

2http://www.ark.cs.cmu.edu/SEMAFOR/

(capability, expensiveness, and locale by use) are generated for the utterance, which we consider as slot candidates for a domain-specific dialogue system (Baker et al., 1998). Then we build a slot matrix Fswith binary values based on the induced slots, which also denotes the slot features for the utterances (right part of the matrix in Figure 1(b)).

To build the feature model MF, we concatenate two matrices:

MF = [ Fw Fs ], (2) which is the upper part of the matrix in Fig- ure 1(b) for training utterances. Note that we do not use any annotations, so all slot candidates are included.

4.2 Knowledge Graph Propagation Model Since SEMAFOR was trained on FrameNet anno- tation, which has a more generic frame-semantic context, not all the frames from the parsing re- sults can be used as the actual slots in the domain- specific dialogue systems. For instance, in Fig- ure 2, we see that the frames “expensiveness”

and “locale by use” are essentially the key slots for the purpose of understanding in the restaurant query domain, whereas the “capability” frame does not convey particularly valuable information for SLU.

Assuming that domain-specific concepts are usually related to each other, considering global relations between semantic slots induces a more coherent slot set. It is shown that the relations on knowledge graphs help make decisions on domain-specific slots (Chen et al., 2015). Con- sidering two directed graphs, semantic and lexi- cal knowledge graphs, each node in the semantic knowledge graph is a slot candidate si generated by the frame-semantic parser, and each node in the lexical knowledge graph is a word wj.

• Slot-based semantic knowledge graph is built as Gs = hVs, Essi, where Vs = {si ∈ S} and Ess= {eij | si, sj ∈ Vs}.

• Word-based lexical knowledge graph is built as Gw = hVw, Ewwi, where Vw = {wi ∈ W } and Eww= {eij | wi, wj ∈ Vw}.

The edges connect two nodes in the graphs if there is a typed dependency between them. Figure 3 is a simplified example of a slot-based semantic knowledge graph. The structured graph helps de- fine a coherent slot set. To model the relations be- tween words/slots based on the knowledge graphs, we define two relation models below.

(5)

locale_by_use

food expensiveness seeking

relational_quantity

PREP_FOR

PREP_FOR

NN AMOD

AMOD AMOD

Figure 3: A simplified example of the automati- cally derived knowledge graph.

• Semantic Relation

For modeling word semantic rela- tions, we compute a matrix RSw = [Sim(wi, wj)]|W |×|W |, where Sim(wi, wj) is the cosine similarity between the de- pendency embeddings of the word pat- terns wi and wj after normalization.

For slot semantic relations, we compute RSs = [Sim(si, sj)]|S|×|S| similarly3. The matrices RSw and RSs model not only the semantic but functional similarity since we use dependency-based embeddings (Levy and Goldberg, 2014).

• Dependency Relation

Assuming that important semantic slots are usually mutually related to each other, that is, connected by syntactic dependencies, our automatically derived knowledge graphs are able to help model the dependency relations.

For word dependency relations, we compute a matrix RDw = [ˆr(wi, wj)]|W |×|W |, where ˆ

r(wi, wj) measures the dependency between two word patterns wi and wj based on the word-based lexical knowledge graph, and the detail is described in Section 5. For slot dependency relations, we similarly compute RDs = [ˆr(si, sj)]|S|×|S| based on the slot- based semantic knowledge graph.

With the built word relation models (RSwand RDw) and slot relation models (RSs and RDs), we com- bine them as a knowledge graph propagation ma- trix MR4.

MR=

h RSDw 0 0 RSDs

i

, (3)

3For each column in RSw and RSs, we only keep top 10 highest values, which correspond the top 10 semantically similar nodes.

4The values in the diagonal of MR are 0 to model the propagation from other entries.

where RSDw = RSw+ RwDand RSDs = RSs + RDs to integrate semantic and dependency relations. The goal of this matrix is to propagate scores between nodes according to different types of relations in the knowledge graphs (Chen and Metze, 2012).

4.3 Integrated Model

With a feature model MF and a knowledge graph propagation model MR, we integrate them into a single matrix.

M = MF · (αI + βMR) (4)

=

h αFw+ βFwRw 0 0 αFs+ βFsRs

i , where M is the final matrix and I is the iden- tity matrix. α and β are the weights for balanc- ing original values and propagated values, where α + β = 1. The matrix M is similar to MF, but some weights are enhanced through the knowl- edge graph propagation model, MR. The word relations are built by FwRw, which is the ma- trix with internal weight propagation on the lexical knowledge graph (the blue arrow in Figure 1(b)).

Similarly, FsRs models the slot correlations, and can be treated as the matrix with internal weight propagation on the semantic knowledge graph (the pink arrow in Figure 1(b)). The propagation mod- els can be treated as running a random walk algo- rithm on the graphs.

Fs contains all slot candidates generated by SEMAFOR, which may include some generic slots (such as capability), so the original feature model cannot differentiate the domain-specific and generic concepts. By integrating with Rs, the semantic and dependency relations can be propa- gated via the knowledge graph, and the domain- specific concepts may have higher weights based on the assumption that the slots for dialogue sys- tems are often mutually related (Chen et al., 2015).

Hence, the structure information can be automati- cally involved in the matrix. Also, the word rela- tion model brings the same function, but now on the word level. In conclusion, for each utterance, the integrated model not only predicts the proba- bility that semantic slots occur but also considers whether the slots are domain-specific. The follow- ing sections describe the learning process.

4.4 Parameter Estimation

The proposed model is parameterized through weights and latent component vectors, where the parameters are estimated by maximizing the log

(6)

likelihood of observed data (Collins et al., 2001).

θ = arg max

θ

Y

u∈U

p(θ | Mu) (5)

= arg max

θ

Y

u∈U

p(Mu| θ)p(θ)

= arg max

θ

X

u∈U

ln p(Mu | θ) − λθ,

where Muis the vector corresponding to the utter- ance u from Mu,xin (1), because we assume that each utterance is independent of others.

To avoid treating unobserved facts as designed negative facts, we consider our positive-only data as implicit feedback. Bayesian Personalized Rank- ing (BPR) is an optimization criterion that learns from implicit feedback for MF, which uses a vari- ant of the ranking: giving observed true facts higher scores than unobserved (true or false) facts (Rendle et al., 2009). Riedel et al. (2013) also showed that BPR learns the implicit relations for improving the relation extraction task.

4.4.1 Objective Function

To estimate the parameters in (5), we create a dataset of ranked pairs from M in (4): for each utterance u and each observed fact f+= hu, x+i, where Mu,x ≥ δ, we choose each word pat- tern/slot x such that f = hu, xi, where Mu,x< δ, which refers to the word pattern/slot we have not observed to be in utterance u. That is, we construct the observed data O from M . Then for each pair of facts f+ and f, we want to model p(f+) > p(f) and hence θf+ > θf accord- ing to (1). BPR maximizes the summation of each ranked pair, where the objective is

X

u∈U

ln p(Mu | θ) = X

f+∈O

X

f6∈O

ln σ(θf+− θf). (6)

The BPR objective is an approximation to the per utterance AUC (area under the ROC curve), which directly correlates to what we want to achieve – well-ranked semantic slots per utterance.

4.4.2 Optimization

To maximize the objective in (6), we employ a stochastic gradient descent (SGD) algorithm (Ren- dle et al., 2009). For each randomly sampled ob- served fact hu, x+i, we sample an unobserved fact hu, xi, which results in |O| fact pairs hf, f+i.

For each pair, we perform an SGD update using the gradient of the corresponding objective func- tion for matrix factorization (Gantner et al., 2011).

can i have a cheap restaurant

ccomp

amod dobj

nsubj det

capability expensiveness locale_by_use

Figure 4: The dependency parsing result.

5 Knowledge Graph Construction

This section introduces the procedure of con- structing knowledge graphs in order to estimate ˆ

r(wi, wj) for building RDw and ˆr(si, sj) for RDs in Section 4.2. Considering the relations in the knowledge graphs, the edge weights for Eww and Essare measured as ˆr(wi, wj) and ˆr(si, sj) based on the dependency parsing results respectively.

The example utterance “can i have a cheap restaurant” and its dependency parsing result are illustrated in Figure 4. The arrows denote the dependency relations from headwords to their dependents, and words on arcs denote types of the dependencies. All typed dependencies between two words are encoded in triples and form a word-based dependency set Tw = {hwi, t, wji}, where t is the typed dependency between the headword wi and the dependent wj. For example, Figure 4 generates hrestaurant,AMOD, cheapi, hrestaurant,DOBJ, havei, etc. for Tw, Sim- ilarly, we build a slot-based dependency set Ts = {hsi, t, sji} by transforming dependen- cies between slot-fillers into ones between slots. For example, hrestaurant,AMOD, cheapi

from Tw is transformed into

hlocale by use,AMOD,expensivenessi for building Ts, because both sides of the non-dotted line are parsed as slot-fillers by SEMAFOR.

5.1 Relation Weight Estimation

For the edges in the knowledge graphs, we model the relations between two connected nodes xi and xj as ˆr(xi, xj), where x is either a slot s or a word pattern w. Since the weights are measured based on the relations between nodes regardless of the directions, we combine the scores of two direc- tional dependencies:

ˆ

r(xi, xj) = r(xi → xj) + r(xj → xi), (7) where r(xi → xj) is the score estimating the de- pendency including xi as a head and xj as a de- pendent. We propose a scoring function for r(·) using dependency-based embeddings.

(7)

Table 1: The example contexts extracted for training dependency-based word/slot embeddings.

Typed Dependency Relation Target Word Contexts Word hrestaurant,AMOD, cheapi restaurant cheap/AMOD

cheap restaurant/AMOD−1

Slot hlocale by use,AMOD,expensivenessi locale by use expensiveness/AMOD

expansiveness locale by use/AMOD−1

5.1.1 Dependency-Based Embeddings

Most neural embeddings use linear bag-of-words contexts, where a window size is defined to pro- duce contexts of the target words (Mikolov et al., 2013c; Mikolov et al., 2013b; Mikolov et al., 2013a). However, some important contexts may be missing due to smaller windows, while larger windows capture broad topical content. A dependency-based embedding approach was pro- posed to derive contexts based on the syntactic re- lations the word participates in for training embed- dings, where the embeddings are less topical but offer more functional similarity compared to orig- inal embeddings (Levy and Goldberg, 2014).

Table 1 shows the extracted dependency-based contexts for each target word from the example in Figure 4, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. After replacing original bag-of-words contexts with dependency- based contexts, we can train dependency-based embeddings for all target words (Yih et al., 2014;

Bordes et al., 2011; Bordes et al., 2013).

For training dependency-based word embed- dings, each target x is associated with a vector vx ∈ Rd and each context c is represented as a context vector vc ∈ Rd, where d is the embed- ding dimensionality. We learn vector representa- tions for both targets and contexts such that the dot product vx· vcassociated with “good” target- context pairs belonging to the training data D is maximized, leading to the objective function:

arg max

vx,vc

X

(w,c)∈D

log 1

1 + exp(−vc· vx), (8) which can be trained using stochastic-gradient up- dates (Levy and Goldberg, 2014). Then we can obtain the dependency-based slot and word em- beddings using Tsand Twrespectively.

5.1.2 Embedding-Based Scoring Function With trained dependency-based embeddings, we estimate the probability that xi is the headword and xjis its dependent via the typed dependency t

as P (xi −→

t xj) = Sim(xi, xj/t) + Sim(xj, xi/t−1)

2 ,

(9) where Sim(xi, xj/t) is the cosine similarity be- tween word/slot embeddings vxi and context em- beddings vxj/tafter normalizing to [0, 1].

Based on the dependency set Tx, we use txi→xj to denote the most possible typed dependency with xias a head and xjas a dependent.

txi→xj = arg max

t C(xi−→

t xj), (10) where C(xi −→

t xj) counts how many times the dependency hxi, t, xji occurs in the dependency set Tx. Then the scoring function r(·) in (7) that estimates the dependency xi → xj is measured as r(xi → xj) = C(xi−−−−→

txi→xj xj)·P (xi −−−−→

txi→xj xj), (11) which is equal to the highest observed frequency of the dependency xi → xj among all types from Tx and additionally weighted by the estimated probability. The estimated probability smoothes the observed frequency to avoid overfitting due to the smaller dataset. Figure 3 is a simplified exam- ple of an automatically derived semantic knowl- edge graph with the most possible typed depen- dencies as edges based on the estimated weights.

Then the relation weights ˆr(xi, xj) can be ob- tained by (7) in order to build RDw and RDs ma- trices.

6 Experiments

6.1 Experimental Setup

In this experiment, we used the Cambridge Uni- versity SLU corpus, previously used on several other SLU tasks (Henderson et al., 2012; Chen et al., 2013a). The domain of the corpus is about restaurant recommendation in Cambridge; sub- jects were asked to interact with multiple SDSs in an in-car setting. The corpus contains a to- tal number of 2,166 dialogues, including 15,453 utterances (10,571 for self-training and 4,882 for

(8)

Table 2: The MAP of predicted slots (%);indicates that the result is significantly better than the Logistic Regression (row (b)) with p < 0.05 in t-test.

Approach ASR Manual

w/o w/ Explicit w/o w/ Explicit

Explicit SVM (a) 32.48 36.62

MLR (b) 33.96 38.78

Implicit

Baseline Random (c) 3.43 22.45 2.63 25.09

Majority (d) 15.37 32.88 16.43 38.41

MF Feature (e) 24.24 37.61 22.55 45.34

Feature + KGP (f) 40.46 43.51 52.14 53.40

speak on topic addr area

food

phone part orientational

direction locale

part inner outer food

origin

contacting

postcode price range

task type

sending

commerce scenario expensiveness range

seeking desiring locating locale by use building

Figure 5: The mappings from induced slots (within blocks) to reference slots (right sides of arrows).

testing). The data is gender-balanced, with slightly more native than non-native speakers. The vocab- ulary size is 1868. An ASR system was used to transcribe the speech; the word error rate was re- ported as 37%. There are 10 slots created by do- main experts: addr, area, food, name, phone, postcode, price range, signature, task, and type.

For parameter setting, the weights for balanc- ing feature models and propagation models, α and β, are set as 0.5 to give the same influence, and the threshold for defining the unobserved facts δ is set as 0.5 for all experiments. We use the Stan- ford Parser5to obtain the collapsed typed syntac- tic dependencies (Socher et al., 2013) and set the dimensionality of embeddings d = 300 in all ex- periments.

6.2 Evaluation Metrics

To evaluate the accuracy of the automatically decoded slots, we measure their quality as the proximity between predicted slots and reference slots. Figure 5 shows the mappings that indicate semantically related induced slots and reference slots (Chen et al., 2013b).

To eliminate the influence of threshold selection when predicting semantic slots, in the following

5http://nlp.stanford.edu/software/lex-parser.

shtml

metrics, we take the whole ranking list into ac- count and evaluate the performance by the met- rics that are independent of the selected threshold.

For each utterance, with the predicted probabilities of all slot candidates, we can compute an average precision (AP) to evaluate the performance of SLU by treating the slots with mappings as positive. AP scores the ranking result higher if the correct slots are ranked higher, which also approximates to the area under the precision-recall curve (Boyd et al., 2012). Mean average precision (MAP) is the met- ric for evaluating all utterances. For all experi- ments, we perform a paired t-test on the AP scores of the results to test the significance.

6.3 Evaluation Results

Table 2 shows the MAP performance of predicted slots for all experiments on ASR and manual tran- scripts. For the first baseline using explicit seman- tics, we use the observed data to self-train mod- els for predicting the probability of each seman- tic slot by support vector machine (SVM) with a linear kernel and multinomial logistic regression (MLR) (row (a)-(b)) (Pedregosa et al., 2011; Hen- derson et al., 2012). It is shown that SVM and MLR perform similarly, and MLR is slightly bet- ter than SVM because it has better capability of estimating probabilities. For modeling implicit semantics, two baselines are performed as refer- ences, Random (row (c)) and Majority (row (d)), where the former assigns random probabilities for all slots, and the later assigns probabilities for the slots based on their frequency distribution. To im- prove probability estimation, we further integrate the results from implicit semantics with the better result from explicit approaches, MLR (row (b)), by averaging the probability distribution from two results.

Two baselines, Random and Majority, cannot model the implicit semantics, producing poor re- sults. The results of Random integrated with MLR significantly degrades the performance of

(9)

Table 3: The MAP of predicted slots using different types of relation models in MR(%);indicates that the result is significantly better than the feature model (column (a)) with p < 0.05 in t-test.

Model Feature Knowledge Graph Propagation Model

Rel. (a) None (b) Semantic (c) Dependency (d) Word (e) Slot (f) All

MR -

h RSw 0 0 RSs

i h RDw 0 0 RDs

i h RwSD 0

0 0

i h 0 0

0 RsSD

i h RSDw 0 0 RSDs

i

ASR 37.61 41.39 41.63 39.19 42.10 43.51

Manual 45.34 51.55 49.04 45.18 49.91 53.40

MLR for both ASR and manual transcripts. Also, the results of Majority integrated with MLR does not produce any difference compared to the MLR baseline. Among the proposed MF approaches, only using feature model for building the ma- trix (row (e)) achieves 24.2% and 22.6% of MAP for ASR and manual results respectively, which are worse than two baselines using explicit se- mantics. However, with the combination of ex- plicit semantics, using only the feature model sig- nificantly outperforms the baselines, where the performance comes from about 34.0% to 37.6%

and from 38.8% to 45.3% for ASR and manual results respectively. Additionally integrating a knowledge graph propagation (KGP) model (row (e)) outperforms the baselines for both ASR and manual transcripts, and the performance is fur- ther improved by combining with explicit seman- tics (achieving MAP of 43.5% and 53.4%). The experiments show that the proposed MF models successfully learn the implicit semantics and con- sider the relations and domain-specificity simulta- neously.

6.4 Discussion and Analysis

With promising results obtained by the proposed models, we analyze the detailed difference be- tween different relation models in Table 3.

6.4.1 Effectiveness of Semantic and Dependency Relation Models

To evaluate the effectiveness of semantic and de- pendency relations, we consider each of them in- dividually in MRof (3) (columns (b) and (c) in Ta- ble 3). Comparing to the original model (column (a)), both modeling semantic relations and mod- eling dependency relations significantly improve the performance for ASR and manual results. It is shown that semantic relations help the SLU model infer the implicit meaning, and then the predic- tion becomes more accurate. Also, dependency re- lations successfully differentiate the generic con- cepts from the domain-specific concepts, so that the SLU model is able to predict more coherent

set of semantic slots (Chen et al., 2015). Integrat- ing two types of relations (column (f)) further im- proves the performance.

6.4.2 Comparing Word/ Slot Relation Models To analyze the performance results from inter- word and inter-slot relations, the columns (d) and (e) show the results considering only word rela- tions and only slot relations respectively. It can be seen that the inter-slot relation model signif- icantly improves the performance for both ASR and manual results. However, the inter-word re- lation model only performs slightly better results for ASR output (from 37.6% to 39.2%), and there is no difference after applying the inter-word rela- tion model on manual transcripts. The reason may be that inter-slot relations carry high-level seman- tics that align well with the structure of SDSs, but inter-word relations do not. Nevertheless, combin- ing two relations (column (f)) outperforms both re- sults for ASR and manual transcripts, showing that different types of relations can compensate each other and then benefit the SLU performance.

7 Conclusions

This paper presents an MF approach to self-train the SLU model for semantic decoding in an unsu- pervised way. The purpose of the proposed model is not only to predict the probability of each se- mantic slot but also to distinguish between generic semantic concepts and domain-specific concepts that are related to an SDS. The experiments show that the MF-based model obtains promising re- sults, outperforming strong discriminative base- lines.

Acknowledgments

We thank anonymous reviewers for their useful comments and Prof. Manfred Stede for his men- toring. We are also grateful to MetLife’s support.

Any opinions, findings, and conclusions expressed in this publication are those of the authors and do not necessarily reflect the views of funding agen- cies.

(10)

References

Collin F Baker, Charles J Fillmore, and John B Lowe.

1998. The Berkeley FrameNet project. In Proceed- ings of COLING, pages 86–90.

Dan Bohus and Alexander I Rudnicky. 2009. The RavenClaw dialog management framework: Archi- tecture and systems. Computer Speech & Language, 23(3):332–361.

Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, et al. 2011. Learning structured embeddings of knowledge bases. In Proceedings of AAAI.

Antoine Bordes, Nicolas Usunier, Alberto Garcia- Duran, Jason Weston, and Oksana Yakhnenko.

2013. Translating embeddings for modeling multi- relational data. In Proceedings of Advances in Neu- ral Information Processing Systems, pages 2787–

2795.

Kendrick Boyd, Vitor Santos Costa, Jesse Davis, and C David Page. 2012. Unachievable region in precision-recall space and its effect on empirical evaluation. In Machine learning: proceedings of the International Conference. International Conference on Machine Learning, volume 2012, page 349. NIH Public Access.

Asli Celikyilmaz, Dilek Hakkani-T¨ur, and Gokhan T¨ur.

2011. Leveraging web query logs to learn user in- tent via bayesian discrete latent variable model. In Proceedings of ICML.

Yun-Nung Chen and Florian Metze. 2012. Two- layer mutually reinforced random walk for improved multi-party meeting summarization. In Proceedings of The 4th IEEE Workshop on Spoken Language Tachnology, pages 461–466.

Yun-Nung Chen and Alexander I. Rudnicky. 2014.

Dynamically supporting unexplored domains in conversational interactions by enriching semantics with neural word embeddings. In Proceedings of 2014 IEEE Spoken Language Technology Workshop (SLT), pages 590–595. IEEE.

Yun-Nung Chen, William Yang Wang, and Alexan- der I. Rudnicky. 2013a. An empirical investigation of sparse log-linear models for improved dialogue act classification. In Proceedings of ICASSP, pages 8317–8321.

Yun-Nung Chen, William Yang Wang, and Alexander I Rudnicky. 2013b. Unsupervised induction and fill- ing of semantic slots for spoken dialogue systems using frame-semantic parsing. In Proceedings of 2013 IEEE Workshop on Automatic Speech Recog- nition and Understanding (ASRU), pages 120–125.

IEEE.

Yun-Nung Chen, Dilek Hakkani-T¨ur, and Gokan Tur.

2014a. Deriving local relational surface forms from

dependency-based entity embeddings for unsuper- vised spoken language understanding. In Proceed- ings of 2014 IEEE Spoken Language Technology Workshop (SLT), pages 242–247. IEEE.

Yun-Nung Chen, William Yang Wang, and Alexan- der I. Rudnicky. 2014b. Leveraging frame se- mantics and distributional semantics for unsuper- vised semantic slot induction in spoken dialogue systems. In Proceedings of 2014 IEEE Spoken Lan- guage Technology Workshop (SLT), pages 584–589.

IEEE.

Yun-Nung Chen, William Yang Wang, and Alexan- der I. Rudnicky. 2015. Jointly modeling inter-slot relations by random walk on knowledge graphs for unsupervised spoken language understanding. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics - Human Language Technologies.

ACL.

Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. 2001. A generalization of principal com- ponents analysis to the exponential family. In Pro- ceedings of Advances in Neural Information Pro- cessing Systems, pages 617–624.

Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A Smith. 2010. Probabilistic frame-semantic parsing. In Proceedings of The Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 948–956.

Dipanjan Das, Desai Chen, Andr´e F. T. Martins, Nathan Schneider, and Noah A. Smith. 2013.

Frame-semantic parsing. Computational Linguis- tics.

Marco Dinarelli, Silvia Quarteroni, Sara Tonelli, Alessandro Moschitti, and Giuseppe Riccardi.

2009. Annotating spoken dialogs: from speech seg- ments to dialog acts and frame semantics. In Pro- ceedings of the 2nd Workshop on Semantic Repre- sentation of Spoken Language, pages 34–41. ACL.

John Dowding, Jean Mark Gawron, Doug Appelt, John Bear, Lynn Cherny, Robert Moore, and Douglas Moran. 1993. Gemini: A natural language system for spoken-language understanding. In Proceedings of ACL, pages 54–61.

Ali El-Kahky, Derek Liu, Ruhi Sarikaya, G¨okhan T¨ur, Dilek Hakkani-T¨ur, and Larry Heck. 2014. Ex- tending domain coverage of language understanding systems via intent transfer between domains using knowledge graphs and search query click logs. In Proceedings of ICASSP.

Zeno Gantner, Steffen Rendle, Christoph Freuden- thaler, and Lars Schmidt-Thieme. 2011. Mymedi- alite: A free recommender system library. In Pro- ceedings of the fifth ACM conference on Recom- mender systems, pages 305–308. ACM.

(11)

Narendra Gupta, G¨okhan T¨ur, Dilek Hakkani-T¨ur, Srinivas Bangalore, Giuseppe Riccardi, and Mazin Gilbert. 2006. The AT&T spoken language un- derstanding system. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):213–222.

Dilek Hakkani-T¨ur, Larry Heck, and Gokhan Tur.

2013. Using a knowledge graph and query click logs for unsupervised learning of relation detection. In Proceedings of ICASSP, pages 8327–8331.

Larry Heck and Dilek Hakkani-T¨ur. 2012. Exploiting the semantic web for unsupervised spoken language understanding. In Proceedings of SLT, pages 228–

233.

Larry P Heck, Dilek Hakkani-T¨ur, and Gokhan Tur.

2013. Leveraging knowledge graphs for web-scale unsupervised semantic parsing. In Proceedings of INTERSPEECH, pages 1594–1598.

Matthew Henderson, Milica Gasic, Blaise Thomson, Pirros Tsiakoulis, Kai Yu, and Steve Young. 2012.

Discriminative spoken language understanding us- ing word confusion networks. In Proceedings of SLT, pages 176–181.

Frederick Jelinek. 1997. Statistical methods for speech recognition. MIT press.

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.

Matrix factorization techniques for recommender systems. Computer, (8):30–37.

Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of ACL.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word represen- tations in vector space. In Proceedings of Workshop at ICLR.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositional- ity. In Proceedings of Advances in Neural Informa- tion Processing Systems, pages 3111–3119.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.

2013c. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–

751. Citeseer.

Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:

Machine learning in python. The Journal of Ma- chine Learning Research, 12:2825–2830.

Roberto Pieraccini, Evelyne Tzoukermann, Zakhar Gorelov, J Gauvain, Esther Levin, Chin-Hui Lee, and Jay G Wilpon. 1992. A speech understand- ing system based on statistical representation of se- mantics. In Proceedings of 1992 IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing (ICASSP), volume 1, pages 193–196. IEEE.

Steffen Rendle, Christoph Freudenthaler, Zeno Gant- ner, and Lars Schmidt-Thieme. 2009. BPR:

Bayesian personalized ranking from implicit feed- back. In Proceedings of the Twenty-Fifth Confer- ence on Uncertainty in Artificial Intelligence, pages 452–461. AUAI Press.

Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Pro- ceedings of NAACL-HLT, pages 74–84.

Stephanie Seneff. 1992. TINA: A natural language system for spoken language applications. Computa- tional linguistics, 18(1):61–86.

Anders Skrondal and Sophia Rabe-Hesketh. 2004.

Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Crc Press.

Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with composi- tional vector grammars. In Proceedings of the ACL conference. Citeseer.

Gokhan Tur, Dilek Z Hakkani-T¨ur, Dustin Hillard, and Asli Celikyilmaz. 2011. Towards unsupervised spoken language understanding: Exploiting query click logs for slot filling. In Proceedings of INTER- SPEECH, pages 1293–1296.

Gokhan Tur, Minwoo Jeong, Ye-Yi Wang, Dilek Hakkani-T¨ur, and Larry P Heck. 2012. Exploit- ing the semantic web for unsupervised natural lan- guage semantic parsing. In Proceedings of INTER- SPEECH.

Gokhan Tur, Asli Celikyilmaz, and Dilek Hakkani- Tur. 2013. Latent semantic modeling for slot fill- ing in conversational understanding. In Proceedings of 2013 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 8307–8311. IEEE.

William Yang Wang, Dan Bohus, Ece Kamar, and Eric Horvitz. 2012. Crowdsourcing the acquisition of natural language corpora: Methods and observa- tions. In Proceedings of SLT, pages 73–78.

Lu Wang, Dilek Hakkani-T¨ur, and Larry Heck. 2014.

Leveraging semantic web search and browse ses- sions for multi-turn spoken dialog systems. In Pro- ceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4082–4086. IEEE.

Wen-tau Yih, Xiaodong He, and Christopher Meek.

2014. Semantic parsing for single-relation question answering. In Proceedings of ACL.

Steve Young, Milica Gasic, Blaise Thomson, and Ja- son D Williams. 2013. POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.

(12)

Ke Zhai and Jason D Williams. 2014. Discovering latent structure in task-oriented dialogues. In Pro- ceedings of the Association for Computational Lin- guistics.

參考文獻

相關文件

graphs, a slot-based semantic knowledge graph and a word-based lexical knowledge graph, are au- tomatically constructed. To jointly consider the word-to-word, word-to-slot,

Finally, we train the SLU model by learning latent feature vectors for utterances and slot candidates through MF techniques. Combining with a knowledge graph propagation model based

The ontology induction and knowledge graph construction enable systems to automatically acquire open domain knowledge. The MF technique for SLU modeling provides a principle model

A spoken language understanding (SLU) component requires the domain ontology to decode utterances into semantic forms, which contain core content (a set of slots and slot-fillers)

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

This paper addresses the above issues by proposing an architecture using end-to-end memory networks to model knowledge carryover in multi-turn conver- sations, where utterances

Experimental results show that this new approach offers significantly better performance than the previously proposed pseudo-relevance feedback approach, which considers primarily

Matrix Factorization with Domain Knowledge and Behavioral Patterns for Intent Modeling.. Yun-Nung (Vivian) Chen, Ming Sun,