Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems

(1)

Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems

Yun-Nung (Vivian) Chen

Ph.D. Thesis Proposal

Thesis Committee:

Dr. Alexander I. Rudnicky, Carnegie Mellon University Dr. Anatole Gershman, Carnegie Mellon University

Dr. Alan W Black, Carnegie Mellon University Dr. Dilek Hakkani-T¨ur, Microsoft Research

(2)

(3)

Abstract

Spoken language interfaces are being incorporated into various devices (smart-phones, smart TVs, in-car navigating system, etc). The role of spoken language understanding is of great significance to a successful spoken dialogue system: in order to capture the language variation from dialogue participants, the spoken language understanding component must create a mapping between the natural language inputs and semantic representations that captures users’ intentions.

The semantic representation must include “concept”, “structure”, etc, where concepts are the important semantics capturing the domain-specific topics, and the structure describes the relation between concepts and conveys high-level intention. However, spoken dialogue systems typically use manually predefined semantic elements to parse users’ utterances into unified semantic representations. To define the knowledge and the structure, domain experts and professional annotators are often involved, and the cost of development can be expensive. Therefore, current technology usually limits conversational interactions to a few narrow predefined domains/topics. With the increasing conversational interactions, this dissertation focuses on improving generalization and scalability of building dialogue systems by automating the knowledge inference and structure learning from unlabelled conversations.

In order to achieve the goal, two questions need to be addressed: 1) Given unlabelled raw audio recordings, how can a system automatically induce and organize the domain-specific concepts? 2) With the automatically acquired knowledge, how can a system understand individual utterances and user intents? To tackle the above problems, we propose to acquire the domain knowledge that can be used in specific applications in order to capture human’s semantics, intent, and behavior, and then build an SLU component based on the learned knowledge to offer better interactions in dialogues.

The dissertation mainly focuses on five important stages: ontology induction, structure learning, surface form derivation, semantic decoding, and behavior prediction. To solve the first problem, ontology induction automatically extracts the domain-specific concepts by leveraging available ontologies and distributional semantics. Then an unsupervised machine learning approach is proposed to learn the structure and then infer the meaningful organization for the dialogue system design. Surface form derivation learns natural languages that describe elements from the ontology and convey the domain knowledge to infer better understanding.

(4)

For the second problem, the learned knowledge can be utilized to decode users’ semantics and predict the follow-up behaviors in order to provide better interactions.

In conclusion, the dissertation shows that it is feasible to build a dialogue learning system that is able to understand how particular domains work based on unlabelled conversations.

As a result, the initial spoken dialogue system can be automatically built according to the learned knowledge, and its performance can be quickly improved by interacting with users for practical usage, presenting the potential of reducing human effort for spoken dialogue system development.

(5)

Keywords

Distributional Semantics Knowledge Graph

Domain Ontology

Matrix Factorization (MF) Random Walk

Spoken Language Understanding (SLU) Spoken Dialogue System (SDS)

Word Embeddings

(6)

(7)

(8)

(9)

List of Abbreviations

AF Average F-Measure is an evaluation metric that measures the performance of a ranking list by averaging the F-measure over all positions in the ranking list.

AMR Abstract Meaning Representation is a simple and readable semantic representation in AMR Bank.

AP Average Precision is an evaluation metric that measures the performance of a ranking list by averaging the precision over all positions in the ranking list.

ASR Automatic Speech Recognition, also known as computer speech recognition, is the process of converting the speech signal into written text.

AUC Area Under the Precision-Recall Curve is an evaluation metric that measures the performance of a ranking list by averaging the precision over a set of evenly spaced recall levels in the ranking list.

BPR Bayesian Personalized Ranking is an approach to learn with implicit feedback for matrix factorization, especially used in item reommendation.

CBOW Continuous Bag-of-Words is an architecture for learning distributed word representations, which is similar to the feedforward neural net language model but uses continuous distributed representation of the context.

CMU Carnegie Mellon University is a private research university in Pittsburgh.

FE Frame Element is a descriptive vocabulary for the components of each frame.

ISCA International Speech Communication Association is a non-profit organization that aims to promote, in an international world-wide context, activities and exchanges in all fields related to speech communication science and technology.

LTI Language Technologies Institute is a research department in the School of Computer Science at Carnegie Mellon University.

LU Lexical Unit is a word with a sense.

MF Matrix Factorization is a decomposition of a matrix into a product of matrices in the discipline of linear algebra.

(10)

NLP Natural Language Processing is a field of artificial intelligence and linguistics that studies the problems intrinsic to the processing and manipulation of natural language.

POS Part of Speech tag, also known as word class, lexical class or lexical class are traditional categories of words intended to reflect their functions within a sentence.

SDS Spoken Dialogue System is an intelligent agent that interacts with a user via natural spoken language in order to help the user obtain desired information or solve a problem more efficiently.

SGD Stochastic Gradient Descent is a gradient descent optimization method for miniming an objective function that is written as a sum of differentiable functions.

SLU Spoken Language Understanding is a component of a spoken dialogue system, which parses the natural languages into semantic forms that benefit the system’s understanding.

SVM Support Vector Machine is a supervised learning method used for classification and regression based on the Structural Risk Minimization inductive principle.

WAP Weighted Average Precision is an evaluation metric that measures the performance of a ranking list by weighting the precision over all positions in the ranking list.

(11)

List of Figures

1.1 An example output of the proposed knowledge acquisition approach. . . 2 1.2 An example output of the proposed SLU modeling approach. . . 2

2.1 A sentence example in AMR Bank. . . 9 2.2 Three famous semantic knowledge graph examples (Google’s Knowledge

Graph, Bing Satori, and Freebase) corresponding to the entity “Lady Gaga”. 10 2.3 A portion of the Freebase knowledge graph related to the movie domain. . . . 10 2.4 An example of FrameNet categories for ASR output labelled by probabilistic

frame-semantic parsing. . . 11 2.5 An example of AMR parsed by JAMR on ASR output. . . 12 2.6 An example of Wikification. . . 12 2.7 The CBOW and Skip-gram architectures. The CBOW model predicts the cur-

rent word based on the context, and the Skip-gram model predicts surrounding words given the target word [70]. . . 13 2.8 The target words and associated dependency-based contexts extracted from

the parsed sentence for training depedency-based word embeddings. . . 15

3.1 The proposed framework for ontology induction . . . 19 3.2 An example of probabilistic frame-semantic parsing on ASR output. FT: frame

target. FE: frame element. LU: lexical unit. . . 20 3.3 The mappings from induced slots (within blocks) to reference slots (right sides

of arrows). . . 24

4.1 The proposed framework . . . 30

(18)

4.2 A simplified example of the two knowledge graphs, where a slot candidate s_i is represented as a node in a semantic knowledge graph and a word w_j is

represented as a node in a lexical knowledge graph. . . 32

4.3 The dependency parsing result on an utterance. . . 33

4.4 A simplified example of the automatically derived knowledge graph. . . 40

5.1 The relation detection examples. . . 42

5.2 The proposed framework. . . 43

5.3 An example of dependency-based contexts. . . 44

5.4 Learning curves over incremental iterations of bootstrapping. . . 52

6.1 (a): the proposed framework. (b): our matrix factorization method completes a partially-missing matrix for implicit semantic parsing. Dark circles are observed facts, shaded circles are inferred facts. Slot induction maps observed surface patterns to semantic slot candidates. Word relation model constructs correlations between surface patterns. Slot relation model learns the slot-level correlations based on propagating the automatically derived semantic knowledge graphs. Reasoning with matrix factorization incorporates these models jointly, and produces a coherent, domain-specific SLU model. . . 57

7.1 (a): the proposed framework. (b): our matrix factorization method completes a partially-missing matrix for implicit semantic parsing. Dark circles are observed facts, shaded circles are inferred facts. Behavior identification maps observed features to behaviors. Feature relation model constructs correlations between features. Behavior relation model trains the correlations between behaviors or their transitions. Predicting with matrix factorization incorporates these models jointly, and produces an SLU model that understands the users better by predicting the follow-up behaviors. . . 68

7.2 Total 13 tasks in the corpus (only pictures are shown to subjects for making requests). . . 69

(19)

List of Tables

2.1 The frame example defined in FrameNet. . . 9

3.1 The statistics of training and testing corpora . . . 24 3.2 The performance of slot induction and SLU modeling (%) . . . 25

4.1 The contexts extracted for training dependency-based word/slot embeddings from the utterance of Fig. 3.2. . . 34 4.2 The performance of induced slots and corresponding SLU models (%) . . . . 38 4.3 The top inter-slot relations learned from the training set of ASR outputs. . . 40

5.1 The contexts extracted for training dependency entity embeddings in the example of the Figure 5.3. . . 45 5.2 An example of three different methods in probabilistic enrichment (w = “pitt ”). 48 5.3 Relation detection datasets used in the experiments. . . 49 5.4 The SLU performance of all proposed approaches without bootstrapping (N =

15). . . 49 5.5 The SLU performance of all proposed approaches with bootstrapping (N = 15). 50 5.6 The examples of derived entity surface forms based on dependency-based entity

embeddings. . . 50

6.1 The MAP of predicted slots (%); ^† and ^§ mean that the result is better and worse with p < 0.05 respectively. . . 64 6.2 The MAP of predicted slots using different types of relation models in M_R(%);

† means that the result is better with p < 0.05 . . . 65

7.1 The recording examples collected from some subjects. . . 69

(20)

(21)

1

Introduction

1.1 Spoken Dialogue System

A spoken dialogue system (SDS) is an intelligent agent that interacts with a user via natural spoken language in order to help the user obtain desired information or solve a problem more efficiently. As for current technologies, a dialogue system is one of many spoken language applications that operate on limited specific domains. For instance, the CMU Communicator system is a dialogue system for the air travel domain that provides information about flight, car, and hotel reservations [81]. Another example, the JUPITER system [104], is a dialogue system for a weather domain, which provides forecast information for the requested city.

More recently, a number of efforts in industry (e.g. Google Now¹, Apple’s Siri², Microsoft’s Cortana³, and Amazon’s Echo⁴) and academia [1, 8, 34, 42, 61, 63, 73, 77, 78, 82, 97, 98] have focused on developing semantic understanding techniques for building better SDSs.

An SDS typically is composed of the following components: an automatic speech recognizer (ASR), a spoken language understanding (SLU) module, a dialogue manager, a natural language generation module, and a speech synthesizer. When developing a dialogue system in a new domain, we may be able to reuse some components that are designed independently of domain-specific information, for example, the speech recognizer and the speech synthesizer.

However, the components that are integrated with domain-specific information have to be reconstructed for each new domain, and the cost of development is expensive. Usually participants engage in a conversation in order to achieve a specific goal such as accomplishing a task in their mind or getting the answers to the questions, for example, to obtain the list of restaurants in a specific location. Therefore in the context of this dissertation, domain-specific information refers to the knowledge specific to a task that an SDS has to support rather than the knowledge about general dialogue mechanisms. The dissertation mainly focuses on two parts:

• Knowledge acquisition is to learn the domain-specific knowledge that is used by a

1http://www.google.com/landing/now/

2http://www.apple.com/ios/siri/

3http://www.microsoft.com/en-us/mobile/campaign-cortana/

4http://www.amazon.com/oc/echo

(22)

Restaurant Asking Conversations

target

food

price seeking

quantity

PREP_FOR

NN AMOD

AMOD AMOD

Organized Domain Knowledge Unlabelled Collection

Knowledge Acquisition

Figure 1.1: An example output of the proposed knowledge acquisition approach.

Organized Domain Knowledge

price=“cheap”

target=“restaurant”

behavior=navigation SLU Modeling

SLU Component

“can i have a cheap restaurant”

Figure 1.2: An example output of the proposed SLU modeling approach.

spoken language understanding (SLU) component. The domain-specific information used by an SLU component includes the organized ontology that a dialogue system has to support in order to successfully understand the actual meaning. An example of the necessary domain knowledge about restaurant recommendation is shown in Figure 1.1, where the learned domain knowledge contains the semantic slots and their relations⁵.

• SLU modeling is to build an SLU module that is able to understand the actual meaning of domain-specific utterances based on the domain-specific knowledge and then further provide better responses. An example of the corresponding understanding procedure in the restaurant domain is shown in Figure 1.2, where the decoded output by the SLU component is the semantic form of the input utterance.

Conventionally the domain-specific knowledge is defined manually by domain experts or developers, who are familiar with the specific domains. For common domains like a weather domain or a bus domain, system developers are usually able to identify the such information.

5The slot is defined as a semantic unit usually used in dialogue systems.

(23)

However, some domains, such as a military domain [7], require the domain experts, which makes the knowledge engineering process more difficult. Furthermore, the experts’ decision may be subjective and may not cover all possible real-world users’ cases [99]. Therefore, poor generalization results in the limited predefined information and even biases the subsequent data collection and annotation. Another issue is about the efficiency: the manual definition and annotation process for domain-specific tasks can be very time-consuming, and have high financial costs. Finally, the maintenance cost is also non-trivial: when new conversational data comes in, developers, domain experts, and annotators have to manually analyze the audios or the transcriptions for updating and expanding the ontologies. With more available conversational data, to acquire the domain knowledge, recent approaches are data-driven in terms of generalization and scalability.

In the past decade, the computational linguistics community has focused on developing language processing algorithms that can leverage the vast quantities of available raw data. Choti- mongkol proposed a machine learning technique to acquire domain-specific knowledge, showing the potential of reducing human effort in the SDS development [24]. However, the work mainly focused on the low-level semantic units like word-level concepts. With increasing high-level knowledge resources, such as knowledge bases, this dissertation moves forward to investigate the possibility of developing a high-level semantic conversation analyzer for a certain domain using an unsupervised machine learning approach. The human’s semantics, intent, and behavior can be captured from a collection of unlabelled raw conversational data, and then be modeled for building a good SDS.

Considering the practical usage, the acquired knowledge may be revised manually to improve the system performance. Even though some revision might be required, the cost of revision is significantly lower than the cost of analysis. Also, the automatically learned information may consider the real-world users’ cases and avoid biasing the subsequent annotation. This thesis focuses on the highlighted parts, inducing acquiring domain knowledge from the dialogues using the available ontology, and modeling the SLU module using the automatically learned information. The proposed approach combining both data-driven and knowledge- driven perspectives shows the potential of improving generalization, maintenance, efficiency, and scalability of dialogue system development.

1.2 Thesis Statement

The main purpose of this work is to automatically develop SLU components for SDSs by using the automatically learned domain knowledge in an unsupervised fashion. This dissertation mainly focuses on acquiring the domain knowledge that is useful for better understanding and designing the system framework and further modeling the semantic meaning of the spoken

(24)

language. For knowledge acquisition, there are two important stages – ontology induction and structure learning. After applying them, an organized domain knowledge is inferred from the unlabeled conversations. For SLU modeling, there are two aspects – semantic decoding and behavior prediction. Based on the learned information, semantic decoding analyzes the meaning in each individual utterance and behavior prediction model user intents and predicts possible follow-up behaviors. In conclusion, the thesis demonstrates the feasibility of building a dialogue learning system that is able to automatically learn the important knowledge and understand how the domains work based on unlabelled raw audio. With the domain knowledge, then the initial dialogue system can be constructed and improved quickly by interacting with users. The main contribution of the dissertation is presenting the potential of reducing human work and showing the feasibility of improving efficiency for dialogue system development by automating the knowledge learning process.

1.3 Thesis Structure

The thesis proposal is organized as below.

• Chapter 2 - Background and Related Work

This chapter reviews some background knowledge and summarizes related works. The chapter also discusses current challenges of the task, describes several structured knowledge resources and presents distributional semantics that may benefit understanding problems.

• Chapter 3 - Ontology Induction for Knowledge Acquisition

This chapter focuses on inducing the ontology that are useful for developing SLU modules of SDSs based on the available structured knowledge resources in an unsupervised way. Some of the contributions were published [20, 22]

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Unsupervised Induction and Filling of Semantic Slots for Spoken Dialogue Systems Using Frame- Semantic Parsing,” in Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13), Olomouc, Czech Republic, 2013.

(Student Best Paper Award)

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Leveraging Frame Semantics and Distributional Semantics for Unsupervised Semantic Slot Induction for Spoken Dialogue Systems,” in Proceedings of 2014 IEEE Workshop on of Spoken Language Technology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.

(25)

• Chapter 4 - Structure Learning for Knowledge Acquisition

This chapter focuses on learning the structures, such as the inter-slot relations, for helping SLU development. Some of the contributions were published [23]:

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Jointly Model- ing Inter-Slot Relations by Random Walk on Knowledge Graphs for Unsupervised Spoken Language Understanding,” in Proceeding of The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (NAACL-HLT’15), Denver, Colorado, USA, 2015.

• Chapter 5 - Surface Form Derivation for Knowledge Acquisition

This chapter focuses on deriving the surface forms conveying semantics for the entities from the ontology, where the derived information helps predict the probability of semantics given the observation more accurately. Some of the contributions were published [21]:

– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Gokhan Tur, “Deriving Local Relational Surface Forms from Dependency-Based Entity Embeddings for Unsupervised Spo- ken Language Understanding,” in Proceedings of 2014 IEEE Workshop of Spoken Language Technology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.

• Chapter 6 - Semantic Decoding in SLU Modeling

This chapter focuses on decoding the users’ spoken languages into corresponding semantic forms, which is the task of SLU models. Some of the contributions were under review:

– Yun-Nung Chen, William Yang Wang, Anatole Gershman, and Alexander I. Rud- nicky, “Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding,” under submission.

• Chapter 7 - Behavior Prediction in SLU Modeling

This chapter focuses on modeling the behaviors in the SLU component, so that the SDS is able to predict the users’ behaviors and further provide better interactions. Some of the contributions were published [18]:

– Yun-Nung Chen and Alexander I. Rudnicky, “Dynamically Supporting Unexplored Domains in Conversational Interactions by Enriching Semantics with Neural Word Embeddings,” in Proceedings of 2014 IEEE Workshop of Spoken Language Tech- nology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.

• Chapter 7 - Conclusions and Future Work

This chapter makes the conclusions and presents the proposed work and the timeline.

(26)

(27)

2

Background and Related Work

2.1 Semantic Representation

Considering to understand natural language for machines, a semantic representation is introduced. A semantic representation for an utterance carries its core content so that the actual meaning behind the utterance can be inferred only through the representation. For example, an utterance “show me action movies directed by james cameron” can be represented as target=“movie”, genre=“action”, director=james cameron. Another utterance, “find a cheap taiwanese restaurant in oakland ” can be formed as target=“restaurant ”, price=“cheap”, type=“taiwanese”, location=“oakland ”. The semantic representations are able to convey the whole meaning of the utterances, which can be more easily processed by machines. The semantic representation is not unique, and there are several forms for representing the meaning.

Below we describe two types of semantic forms:

• Slot-Based Semantic Representation

The slot-based representation includes flat semantic concepts, which are usually used in simpler tasks. Above examples belong to slot-based semantic representation, where semantic concepts are target, location, price, etc.

• Relation-Based Semantic Representation

The relation-based representation includes structured concepts, which are usually used in tasks that have more complicate dependency relations. For instance, “show me action movies directed by james cameron” can be represented as movie.directed by, movie.genre, director.name=“james cameron”, genre.name=“action”. A semantic slot in the slot- based representation is formed as relations, which can be either two concepts or the concept’s name.

2.2 Spoken Language Understanding (SLU) Component

The main purpose of an SLU component is to convert the natural language into semantic forms, which is also called as semantic decoding or semantic parsing. Building the state-of- the-art semantic parsing system needs the large training data with annotations. For example,

(28)

Berant et al. proposed SEMPRE¹, which used the web-scaled knowledge bases to train the semantic parser [6]. Das et al. proposed SEMAFOR², which utilized a lexicon developed based on a linguistic theory – Frame Semantics to train the semantic parser [29].

The SLU module includes following challenges:

• How to define the semantic elements from the unlabelled data?

• How to understand the structure between defined elements?

• How to detect the semantics for the testing data?

• How to use the learned information to predict user behaviors for improving the system performance?

2.3 Ontology and Knowledge Base

There are two main types of knowledge resources available, generic concept and entity-based, both of which may benefit SLU modules for SDSs. The generic concept knowledge bases cover the concepts that are more common, such as a food domain and a weather domain. The entity- based knowledge bases usually contain a lot of named entities that are specific for certain domains, for example, a movie domain and a music domain. The following describes several knowledge resources, which contain the rich semantics and may benefit the understanding task.

2.3.1 Generic Concept Knowledge

There are two semantic knowledge resources, FrameNet and Abstract Meaning Representation (AMR).

• FrameNet³ is a linguistically semantic resource that offers annotations of predicate- argument semantics, and associated lexical units for English [3]. FrameNet is developed based on semantic theory, Frame Semantics [38]. The theory holds that the meaning of most words can be expressed on the basis of semantic frames, which encompass three major components: frame (F), frame elements (FE), and lexical units (LU). For example, the frame “food” contains words referring to items of food. A descriptor frame element within the food frame indicates the characteristic of the food. For example, the phrase

1http://www-nlp.stanford.edu/software/sempre/

2http://www.ark.cs.cmu.edu/SEMAFOR/

3http://framenet.icsi.berkeley.edu

(29)

Frame: Revenge Noun revenge, vengeance, reprisal, retaliation

Verb avenge, revenge, retaliate (against), get back (at), get even (with), pay back Adjective vengeful, vindictive

FE avenger, offender, injury, injured party, punishment Table 2.1: The frame example defined in FrameNet.

The boy wants to go

^ARG1

ARG0 ARG0

boy go-01

want-01 instance

instance

(w / want-01

:ARG0 (b / boy) :ARG1 (g / go-01 :ARG0 b))

Figure 2.1: A sentence example in AMR Bank.

“low fat milk ” should be analyzed with “milk ” evoking the food frame, where “low fat ” fills the descriptor FE of that frame and the word “milk ” is the actual LU. A defined frame example is shown in Table 2.1.

• Abstract Meaning Representation (AMR) is a semantic representation language including the meanings of thousands of English sentences. Each AMR is a single rooted, directed graph. AMRs include PropBank semantic roles, within-sentence coreference, named entities and types, modality, negation, questions, quantities, etc [4]. The AMR feature structure graph of an example sentence is illustrated in Figure 2.1, where the

“boy” appears twice, once as the ARG0 of want-01, and once as the ARG0 of go-01.

2.3.2 Entity-Based Knowledge

• Semantic Knowledge Graph is a knowledge base that provides structured and de- tailed information about the topic with a lists of related links. Three different knowledge graph examples, Google’s knowledge graph⁴, Microsoft’s Bing Satori, and Freebase, are shown in Figure 2.2. The semantic knowledge graph is defined by a schema and composed of nodes and edges connecting the nodes, where each node represents an entity- type and the edge between each node pair describes their relation, as called as property.

An example from Freebase is shown in Figure 2.3, where nodes represent core entity-

4http://www.google.com/insidesearch/features/search/knowledge.html

(30)

Google Knowledge Graph Bing Satori Freebase

Figure 2.2: Three famous semantic knowledge graph examples (Google’s Knowledge Graph, Bing Satori, and Freebase) corresponding to the entity “Lady Gaga”.

Avatar Titanic Drama Kate

Winslet

James Cameron

Canada 1997

Oscar, best director

Genre Cast

Award Release

Year

Director

Nationality Director

Figure 2.3: A portion of the Freebase knowledge graph related to the movie domain.

types for the movie domain. The domains in the knowledge graphs span the web, from

“American Football” to “Zoos and Aquariums”.

• Wikipedia⁵ is a free-access, free content Internet encyclopedia, which contains a large number of pages/articles related to a specific entity [74]. It is able to provide basic background knowledge for help understanding tasks in the natural language processing (NLP) field.

5http://en.wikipedia.org/wiki/Wikipedia

(31)

I want to find some inexpensive and very fancy bars in north.

desiring

becoming_aware

relational_quantity

expensiveness degree building

part_orientaional Figure 2.4: An example of FrameNet categories for ASR output labelled by probabilistic frame-semantic parsing.

2.4 Knowledge-Based Semantic Analyzer

2.4.1 Generic Concept Knowledge

• FrameNet

SEMAFOR⁶ is a state-of-the-art semantic parser for frame-semantic parsing [27, 28].

Trained on manually annotated sentences in FrameNet, SEMAFOR is relatively accurate in predicting semantic frames, FE, and LU from raw text. Augmented by the dual decomposition techniques in decoding, SEMAFOR also produces the semantically- labeled output in a timely manner. Note that SEMAFOR does not consider the relations between frames but treat each frame independently. Figure 3.2 shows the output of probabilistic frame-semantic parsing.

• Abstract Meaning Representation (AMR)

JAMR⁷ is the first semantic parser that parses the sentences into AMRs [39]. Trained on manually defined AMR Bank, JAMR applied an algorithm for finding the maximum, spanning, connected subgraph and showed how to incorporate extra constraints with Lagrangian relaxation. Figure 2.5 shows the output of JAMR on the example sentence.

2.4.2 Entity-Based Knowledge

• Semantic Knowledge Graph

Freebase API⁸ is an API for accessing the data, and the data can also be dumped directly.

• Wikipedia

Wikifier⁹is an entity linking (a.k.a. Wikification, Disambiguation to Wikipedia (D2W)) tool. The task is to identify concepts and entities in texts and disambiguate them into

6http://www.ark.cs.cmu.edu/SEMAFOR/

7http://github.com/jflanigan/jamr

8https://developers.google.com/freebase/

9http://cogcomp.cs.illinois.edu/page/software_view/Wikifier

(32)

show me what richard lester directed

(s / show-01

:ARG1 (d / direct-01 :ARG0 (p / person

:name (n / name :op1 "lester”

:op2 "richard"))))

Figure 2.5: An example of AMR parsed by JAMR on ASR output.

Michael Jordan is a machine learning expert.

Michael Jordan is my favorite player.

Figure 2.6: An example of Wikification.

the corresponding Wikipedia pages. An example is shown in Figure 2.6, where the entities “Micheal Jordan” in two sentences refer to different people, pointing to different Wikipedia pages.

2.5 Distributional Semantics

The distributional view of semantics hypothesizes that words occurring in the same contexts may have similar meanings [47]. As the foundation for modern statistical semantics [40], an early success that implements this distributional theory is Latent Semantic Analysis [32].

Recently, with the advance of deep learning techniques, the continuous representations as word embeddings have further boosted the state-of-the-art results in many applications, such as sentiment analysis [86], language modeling [68], sentence completion [70], and relation detection [21].

In NLP, Brown et al. proposed an early hierarchical clustering algorithm that extracts word clusters from large corpora [14], which has been used successfully in many NLP applications [67]. Comparing to standard bag-of-words n-gram language models, in recent years, continuous word embeddings (a.k.a. word representations, or neural embeddings) are shown

(33)

w_t-2

w_t-1

w_t+1

w_t+2

w_t SUM

INPUT PROJECTION OUTPUT

w_t-2

w_t-1

w_t+1

w_t+2 w_t

SUM

INPUT PROJECTION OUTPUT

CBOW Model Skip-gram Model

Figure 2.7: The CBOW and Skip-gram architectures. The CBOW model predicts the current word based on the context, and the Skip-gram model predicts surrounding words given the target word [70].

to be the state-of-the-art in many NLP tasks, due to its rich continuous representations (e.g.

vectors, or sometimes matrices, and tensors) that capture the context of the target semantic unit [93, 5].

The continuous word vectors are derived from a recurrent neural network architecture [69].

The recurrent neural network language models use the context history to include long-distance information. Interestingly, the vector-space word representations learned from the language models were shown to capture syntactic and semantic regularities [72, 71]. The word rela- tionships are characterized by vector offsets, where in the embedded space, all pairs of words sharing a particular relation are related by the same constant offset.

The word embeddings are trained on the contexts of the target word, where the considered contexts can be linear or dependency-based described as follows.

2.5.1 Linear Word Embeddings

Typical neural embeddings use linear word contexts, where a window size is defined to produce contexts of the target words [72, 71, 70]. There are two model architectures for learning distributed word representations: continuous bag-of-words (CBOW) model and continuous skip-gram model, where the former predicts the current word based on the context and the latter predicts surrounding words given the current word.

(34)

2.5.1.1 Continuous Bag-of-Words (CBOW) Model

The word representations are learned by a recurrent neural network language model [69], as illustrated in Figure 2.7. The architecture contains an input layer, a hidden layer with recurrent connections, and the corresponding weight matrices. Given a word sequence w₁, ..., w_T, the objective function of the model is to maximize the probability of observing the target word wt given its contexts wt−c, wt−c+1, ..., wt+c−1, wt+c, where c is the window size:

1 T

T

X

t=1

X

−c≤i≤c,i6=0

log p(wt| w_t+i). (2.1)

2.5.1.2 Continuous Skip-Gram Model

The training objective of the skip-gram model is to find word representations that are useful for predicting the surrounding words, which is similar to the CBOW architecture. Given a word sequence as the training data w1, ..., wT, the objective function of the model is to maximize the average log probability:

1 T

N

X

i=t

X

−c≤i≤c,i6=0

log p(w_t+i| w_t) (2.2)

The objective can be trained using stochastic-gradient updates over the observed corpus.

2.5.2 Dependency-Based Word Embeddings

Most neural embeddings use linear bag-of-words contexts, where a window size is defined to produce contexts of the target words [72, 71, 70]. However, some important contexts may be missing due to smaller windows, while larger windows capture broad topical content. A dependency-based embedding approach was proposed to derive contexts based on the syntactic relations the word participates in for training embeddings, where the embeddings are less topical but offer more functional similarity compared to original embeddings [64].

Figure 2.8 shows the extracted dependency-based contexts for each target word from the dependency-parsed sentence, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. After replacing original bag-of-words contexts with dependency-based contexts, we can train dependency-based embeddings for all target words [100, 10, 11].

For training dependency-based word embeddings, each target x is associated with a vector

(35)

Word Contexts can have/ccomp

i have/nsubj^-1

have can/ccomp^-1, i/nsubj, restaurant/dobj a restaurant/det^-1

cheap restaurant/amod^-1 restaurant have/dobj^-1, cheap/amod

can i have a cheap restaurant

ccomp

amod dobj

nsubj det

Figure 2.8: The target words and associated dependency-based contexts extracted from the parsed sentence for training depedency-based word embeddings.

v_x ∈ R^d and each context c is represented as a context vector v_c ∈ R^d, where d is the embedding dimensionality. We learn vector representations for both targets and contexts such that the dot product vx· v_c associated with “good” target-context pairs belonging to the training data D is maximized, leading to the objective function:

arg max

vx,vc

X

(w,c)∈D

log 1

1 + exp(−vc· v_x), (2.3)

which can be trained using stochastic-gradient updates [64]. We thus expect the syntactic contexts to yield more focused embeddings, capturing more functional and less topical similarity.

(36)

(37)

3

Ontology Induction for Knowledge Acquisition

When building a dialogue system, a domain-specific knowledge base is required. To acquire the domain knowledge, the chapter focuses on automatically extracting the domain-specific concepts that can be used for SDSs.

3.1 Introduction

The distributional view of semantics hypothesizes that words occurring in the same contexts may have similar meanings [47]. Recently, with the advance of deep learning techniques, the continuous representations as word embeddings have further boosted the state-of-the-art results in many applications. Frame semantics, on the other hand, is a linguistic theory that defines meaning as a coherent structure of related concepts [37]. Although there has been some successful applications in natural language processing (NLP) [51, 26, 48], the Frame semantics theory has not been explored in the speech community.

The section focuses on using probabilistic frame-semantic parsing to automatically induce and adapt the semantic ontology for designing SDSs in an unsupervised fashion [20], alleviat- ing some of the challenging problems for developing and maintaining SLU-based interactive systems [96]. Comparing to the traditional approach where domain experts and developers manually define the semantic ontology for SDS, the proposed approach has the advantages to reduce the costs of annotation, avoid human induced bias, and lower the maintenance costs [20].

Given unlabeled raw audio files, we investigate an unsupervised approach for automatic induction of semantic slots, the basic semantic units used in SDSs. To do this, we use a state- of-the-art probabilistic frame-semantic parsing approach [27], and perform an unsupervised approach to adapt, rerank, and map the generic FrameNet-style semantic parses to the target semantic space that is suitable for the domain-specific conversation settings [3]. We utilize continuous word embeddings trained on very large external corpora (e.g. Google News) to improve the adaptation process. To evaluate the performance of our approach, we compare the automatically induced semantic slots with the reference slots created by domain experts.

Empirical experiments show that the slot creation results generated by our approach align

(38)

well with those of domain experts. Our main contributions of this paper are three-fold:

• We exploit continuous-valued word embeddings for unsupervised SLU;

• We propose the first approach of combining distributional and frame semantics for inducing semantic ontology from unlabeled speech data;

• We show that this synergized method yields the state-of-the-art performance.

3.2 Related Work

The idea of leveraging external semantic resources for unsupervised SLU was popularized by the work of Heck and Hakkani-T¨ur, and Tur et al. [49, 91]. The former exploited Semantic Web for the intent detection problem in SLU, showing that the results obtained from the unsupervised training process align well with the performance of traditional supervised learning [49]. The latter used search queries and obtained promising results on the slot filling task in the movie domain [91]. Following the success of the above applications, recent studies have also obtained interesting results on the tasks of relation detection [45], entity extraction [95], and extending domain coverage [35]. The major difference between our work and previous studies is that, instead of leveraging the discrete representations of Bing search queries or Semantic Web, we build our model on top of the recent success of deep learning—we utilize the continuous-valued word embeddings trained on Google News to induce semantic ontology for task-oriented SDS.

Our approach is clearly relevant to recent studies on deep learning for SLU. Tur et al. have shown that deep convex networks are effective for building better semantic utterance classification systems [90]. Following their success, Deng et al. have further demonstrated the effectiveness of applying the kernel trick to build better deep convex networks for SLU [33].

To the best of our knowledge, our work is the first study that combines the distributional view of meaning from the deep learning community, and the linguistic frame semantic view for improved SLU.

(39)

Slot Ranking Model SLU Model Induced Slots Semantic

Representation

“can I have a cheap restaurant”

Knowledge Acquisition Slot Candidates Frame-Semantic Parsing

Semantic Decoder Training Unlabeled Collection

Figure 3.1: The proposed framework for ontology induction

3.3 The Proposed Framework

The main motivation of the work is to use a FrameNet-trained statistical probabilistic semantic parser to generate initial frame-semantic parses from ASR decodings of the raw audio conversation files. Then adapt the FrameNet-style frame-semantic parses to the semantic slots in the target semantic space, so that they can be used practically in the SDSs. The semantic mapping and adaptation problem are formulated as a ranking problem, where the domain-specific slots should be ranked higher than the generic ones. This thesis proposes the use of unsupervised clustering methods to differentiate the generic semantic concepts from target semantic space for task-oriented dialogue systems [20]. Also, considering that only using the small in-domain conversations as the training data may not robust enough, and the performance would be easily influenced by the data, this thesis proposes a radical exten- sion: we aim at improving the semantic adaptation process by leveraging distributed word representations trained on very large external datasets [72, 71].

3.3.1 Probabilistic Semantic Parsing

In our approach, we parse all ASR-decoded utterances in our corpus using SEMAFOR introduced in Section 2.3 and 2.4 of Chapter 2, a state-of-the-art semantic parser for frame-semantic parsing [27, 28], and extract all frames from semantic parsing results as slot candidates, where the LUs that correspond to the frames are extracted for slot filling. For example, Figure 3.2 shows an example of an ASR-decoded text output parsed by SEMAFOR. SEMAFOR gen- erates three frames (capability, expensiveness, and locale by use) for the utterance, which we consider as slot candidates. Note that for each slot candidate, SEMAFOR also includes the corresponding lexical unit (can i, cheap, and restaurant ), which we consider as possible slot-fillers.

Since SEMAFOR was trained on FrameNet annotation, which has a more generic frame- semantic context, not all the frames from the parsing results can be used as the actual

(40)

can i have a cheap restaurant

Frame: capability FT LU: can FE Filler: i

Frame: expensiveness FT LU: cheap

Frame: locale_by_use FT/FE LU: restaurant

Figure 3.2: An example of probabilistic frame-semantic parsing on ASR output. FT: frame target. FE: frame element. LU: lexical unit.

slots in the domain-specific dialogue systems. For instance, in Figure 3.2, we see that the frames “expensiveness” and “locale by use” are essentially the key slots for the purpose of understanding in the restaurant query domain, whereas the “capability” frame does not convey particular valuable information for SLU. In order to fix this issue, we compute the prominence of these slot candidates, use a slot ranking model to rank the most important slots, and then generate a list of induced slots for use in domain-specific dialogue systems.

3.3.2 Independent Semantic Decoder

With outputted semantic parses, we extract the frames with the top 50 highest frequency as our slot candidates for training SLU. The features for training are generated by word confusion network, where confusion network features are shown to be useful in developing more robust systems for SLU [43]. We build a vector representation of an utterance as u = [x₁, ..., x_j, ...].

xj = E[Cu(n-gram_j)]^1/|n-gram^j^|, (3.1) where C_u(n-gram_j) counts how many times n-gram_j occurs in the utterance u, E(Cu(n-gram_j)) is the expected frequency of n-gramj in u, and |n-gram_j| is the number of words in n-gramj.

For each slot candidate s_i, we generate a pseudo training data Dⁱ to train a binary classifier Mⁱ for predicting the existence of si given an utterance, Dⁱ = {(u_k, lⁱ_k) | u_k ∈ R⁺, lⁱ_k ∈ {−1, +1}}^K_k=1, where lⁱ_k = +1 when the utterance u_k contains the slot candidate s_i in its semantic parse, lⁱ_k= −1 otherwise, and K is the number of utterances.

3.3.3 Adaptation Process and SLU Model

The generic concepts should be distinguished from the domain-specific concepts in the adaptation process. With the trained independent semantic decoders for all slot candidates, adaptation process computes the prominence of slot candidates for ranking and then selects a list of induced slots associated with their corresponding semantic decoders for use in domain-specific

(41)

dialogue systems. Then with each induced slot si and its corresponding trained semantic decoder Mⁱ, an SLU model can be built to predict whether the semantic slot occurs in the given utterance in a fully unsupervised way. In other words, the SLU model is able to transform the testing utterance into semantic representations without human involvement. The detail of the adaptation is described in the following section.

3.4 Slot Ranking Model

The purpose of the ranking model is to distinguish between generic semantic concepts and domain-specific concepts that are relevant to an SDS. To induce meaningful slots for the purpose of SDS, we compute the prominence of the slot candidates using a slot ranking model described below.

With the semantic parses from SEMAFOR, the model ranks the slot candidates by integrat- ing two scores [20, 22]: (1) the normalized frequency of each slot candidate in the corpus, since slots with higher frequency may be more important. (2) the coherence of slot-fillers corresponding to the slot. Assuming that domain-specific concepts focus on fewer topics, the coherence of the corresponding slot-fillers can help measure the prominence of the slots because they are similar to each other.

w(s) = (1 − α) · log f (s) + α · log h(s), (3.2) where w(s) is the ranking weight for the slot candidate s, f (s) is its normalized frequency from semantic parsing, h(s) is its coherence measure, and α is the weighting parameter within the interval [0, 1], which balances the frequency and coherence.

For each slot s, we have the set of corresponding slot-fillers, V (s), constructed from the utterances including the slot s in the parsing results. The coherence measure of the slot s, h(s), is computed as the average pair-wise similarity of slot-fillers to evaluate if slot s corresponds to centralized or scattered topics.

h(s) = P

xa,x_b∈V (s)Sim(x_a, x_b)

|V (s)|² , (3.3)

where V (s) is the set of slot-fillers corresponding slot s, |V (s)| is the size of the set, and Sim(x_a, x_b) is the similarity between the slot-filler pair x_aand x_b. The slot s with higher h(s) usually focuses on fewer topics, which is more specific and more likely to be a slot for the dialogue system. The detail about similarity measure is introduced in the following section.

(42)

3.5 Word Representations for Similarity Measure

To capture the semantics of each word, we transform each token x into a corresponding vector x by following methods. Then given that word representations can capture the semantic meanings, the topical similarity between each slot-filler pair x_a and x_b can be computed as

Sim(xa, x_b) = xa· x_b

kx_akkx_bk. (3.4)

We assume that words occurring in similar domains have similar word representations, and thus Sim(xa, xb) will be larger when xa and xb are semantically related. To build the word representations, we consider two techniques, in-domain clustering vectors and external word vectors.

3.5.1 In-Domain Clustering Vectors

The in-domain data is used to cluster words using Brown hierarchical word clustering algorithm [14, 67]. For each word x, we construct a vector x = [c1, c2, ..., cK], where ci = 1 when the word x is clustered into the i-th cluster, and ci = 0 otherwise, and K is the number of clusters. The assumption is that topically similar words may be clustered together since they occur with the same contexts more frequently. Therefore, the cluster-based vectors that carry the such information can help measure similarity between words.

3.5.2 External Word Vectors

Section 2.5 of Chapter 2 introduces the distributional semantics, and a lot of studies have utilized the semantically-rich continuous word representations to benefit many NLP tasks.

Considering that this distributional semantic theory may benefit our SLU task, we leverage word representations trained from large external data to differentiate semantic concepts. The rationale behind applying the distributional semantic theory to our task is straight-forward:

because spoken language is a very distinct genre comparing to the written language on which FrameNet is constructed, it is necessary to borrow external word representations to help bridge these two data sources for the unsupervised adaptation process. More specifically, to better adapt the FrameNet-style parses to the target task-oriented SDS domain, we make use of continuous word vectors derived from a recurrent neural network architecture [69]. The learned word embeddings are able to capture both syntactic and semantic relations [72, 71], which provide more robust relatedness information between words and may help distinguish the domain-specific information from the generic concepts.

(43)

Considering that continuous space word representations may capture more robust topical information [72], we leverage word embeddings trained on an external large dataset to involve distributional semantics of slot-fillers. That is, the word vectors are built as their word embeddings, and the learning process is introduced in Section 2.5 of Chapter 2. The external word vectors rely on the performance of pre-trained word representations, and higher dimensionality of embedding words results in more accurate performance but greater complexity.

3.6 Experiments

To evaluate the effectiveness of our induced slots, we performed two evaluations. First, we examine the slot induction accuracy by comparing the ranked list of frame-semantic parsing induced slots with the reference slots created by developers of the corresponding system [101].

Secondly, based on the ranked list of induced slots, we can train a semantic decoder for each slot to build an SLU component, and then evaluate the performance of our SLU model by comparing against the human annotated semantic forms. For the experiments, we evaluate both on ASR transcripts of the raw audio, and on the manual transcripts.

3.6.1 Experimental Setup

In this experiment, we used the Cambridge University SLU corpus, previously used on several other SLU tasks [52, 19]. The domain of the corpus is about restaurant recommendation in Cambridge; subjects were asked to interact with multiple SDSs in an in-car setting. There were multiple recording settings: 1) a stopped car with the air condition control on and off; 2) a driving condition; and 3) in a car simulator. The distribution of each condition in this corpus is uniform. The corpus contains a total number of 2,166 dialogues, and 15,453 utterances, which is separated into training and testing parts as shown in Table 3.1. The training part is for self-training the SLU model.

The data is gender-balanced, with slightly more native than non-native speakers. The vocabulary size is 1,868. An ASR system was used to transcribe the speech; the word error rate was reported as 37%. There are 10 slots created by domain experts: addr, area, food, name, phone, postcode, price range, signature, task, and type. The parameter α in (3.2) can be empirically set; we use α = 0.2, N = 100 for all experiments.

To include distributional semantics information, we use the distributed vectors trained on 10⁹ words from Google News¹. Training was performed using the continuous bag of words architecture, which predicts the current word based on the context, with sub-sampling using

1https://code.google.com/p/word2vec/

(44)

Train Test Total

Dialogue 1522 644 2166

Utterance 10571 4882 15453

Male : Female 28 : 31 15 : 15 43 : 46 Native : Non-Native 33 : 26 21 : 9 54 : 47 Avg. #Slot 0.959 0.952 0.957 Table 3.1: The statistics of training and testing corpora

speak_on_topic addr

area

food

phone

part_orientational direction

locale

part_inner_outer

food origin

contacting

postcode price range

task type

sending commerce scenario

expensiveness range

seeking desiring locating locale_by_use

building

Figure 3.3: The mappings from induced slots (within blocks) to reference slots (right sides of arrows).

threshold 1 × e⁻⁵, and with negative sampling with 3 negative examples per each positive one.

The resulting vectors have dimensionality 300, vocabulary size is 3 × 10⁶; the entities contain both words and automatically derived phrases. The dataset provides a larger vocabulary and better coverage.

3.6.2 Evaluation Metrics

To eliminate the influence of threshold selection when choosing induced slots, in the following metrics, we take the whole ranking list into account and evaluate the performance by the metrics that are independent on the selected threshold.

3.6.2.1 Slot Induction

To evaluate the accuracy of the induced slots, we measure their quality as the proximity between induced slots and reference slots. Figure 3.3 shows the mappings that indicate semantically related induced slots and reference slots [20]. For example, “expensiveness → price”, “food → food”, and “direction → area” show that these induced slots can be mapped into the reference slots defined by experts and carry important semantics in the target domain for developing the task-oriented SDS. Note that two slots, name and signature, do not have proper mappings, because they are too specific on restaurant-related domain, where name records the name of restaurant and signature refers to signature dishes. This means that

(45)

Approach

ASR Manual

Slot Induction SLU Model Slot Induction SLU Model

AP AUC WAP AF AP AUC WAP AF

Frequency (α = 0) 56.69 54.67 35.82 43.28 53.01 50.80 36.78 44.20 In-Domain 60.06 58.02 34.39 43.28 59.96 57.89 39.84 44.99 External 71.70 70.35 44.51 45.24 74.41 73.57 50.48 73.57 Max RI (%) +26.5 +28.7 +24.3 +4.5 +40.4 +44.8 +37.2 +66.4

Table 3.2: The performance of slot induction and SLU modeling (%)

the 80% recall is achieved by our approach because we consider all outputted frames as slot candidates.

Since we define the adaptation task as a ranking problem, with a ranked list of induced slots and their associated scores, we can use the standard average precision (AP) as our metric, where the induced slot is counted as correct when it has a mapping to a reference slot. For a ranked list of induced slots l = s¹, ..., s^k, ..., where the s^k is the induced slot ranked at k-th position, the average precision is

AP(l) = Pn

k=1P (k) × 1[s^k has a mapping to a reference slot]

number of induced slots with mapping , (3.5) where P (k) is the precision at cut-off k in the list and 1 is an indicator function equaling 1 if ranked k-th induced slot s^k has a mapping to a reference slot, 0 otherwise. Since the slots generated by our method cover only 80% of the referenced slots, the oracle recall is 80%.

Therefore, average precision is a proper way to measure the slot ranking problem, which is also an approximation of the area under the precision-recall curve (AUC) [12].

3.6.2.2 SLU Model

While semantic slot induction is essential for providing semantic categories and imposing semantic constraints, we are also interested in understanding the performance of our unsupervised SLU models. For each induced slot with the mapping to a reference slot, we can compute an F-measure of the corresponding semantic decoder, and weight the average precision with corresponding F-measure as weighted average precision (WAP) to evaluate the performance of slot induction and SLU tasks together. The metric scores the ranking result higher if the induced slots corresponding to better semantic decoders are ranked higher. An- other metric is the average F-measure (AF), which is the average micro-F of SLU models at all cut-off positions in the ranked list. Compared to WAP, AF additionally considers the slot popularity in the dataset.

Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems