Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems

(1)

Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems

Yun-Nung (Vivian) Chen CMU-LTI-15-018

Language Technologies Institute School of Computer Science

Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213

www.lti.cs.cmu.edu

Thesis Committee:

Dr. Alexander I. Rudnicky (chair), Carnegie Mellon University Dr. Anatole Gershman (co-chair), Carnegie Mellon University

Dr. Alan W Black, Carnegie Mellon University Dr. Dilek Hakkani-T¨ur, Microsoft Research

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies

(2)

(3)

To my beloved parents and family,

(4)

(5)

Acknowledgements

“

If you have someone in your life that you are grateful for - someone to whom you want to write another heartfelt, slanted, misspelled thank you note - do it. Tell them they made you feel loved and supported. That they made you feel like you belonged somewhere and that you were not a freak.

Tell them all of that. Tell them today.

”

Lisa Jakub, Canadian-American writer and actress

There is a long list of exceptional people to whom I wish to express my gratitude, who have supported and assisted me during my Ph.D. journey. I want to start with my excellent thesis advisor, Prof. Alexander I. Rudnicky, who provided me freedom to investigate and explore this field, guided me future directions, and not only appreciated but supported my ideas. Without Alex I might not be qualified as a good researcher. I also thank all committee members – Prof. Anatole Gershman, Prof. Alan W Black, and Dr. Dilek Hakkani-T¨ur. Thanks to their diverse areas of expertise and insightful advice, I was able to apply my experiments to real-world data, which makes my research work more valuable and more practical.

During my stay at CMU, I was very fortunate to work on different projects and collaborate with several faculty, including Prof. Jack Mostow, Prof. Florian Metze, Dr. Kai-Min Chang, Dr. Sunjing Lee, etc. As a student at CMU, I had chances to work and collaborate with people from the same group: Ming Sun, Matthew Marge, Aasish Pappu, Seshadri Sridharan, and Justin Chiu, and different amazing colleagues in LTI: William Y. Wang, Ting-Hao (Kenneth) Huang, Lingpeng Kong, Swabha Swayamdipta, Sujay Kumar Jauhar, Sunayana Sitaram, Sukhada Palkar, Alok Parlikar. I was very happy to know and interact with a lot of great people from CMU LTI: Shoou-I Yu, Yanchuan Sim, Miaomiao Wen, Yi-Chia Wang, Troy Hua, Duo Ding, Zi Yang, Chenyan Xiong, Diyi Yang, Di Wang, Hyeju Jang, Yanbo Xu, Wei Chen, Yajie Miao, Ran Zhou, Chu-Cheng Lin, Yu-Hsin Kuo, Sz-Rung Shiang, Po Yao Huang, Ting-Yao Hu, Han Lu, Yiu-Chang Lin, Wei-Cheng Chang, Joseph Chang, and Zhou Yu.

I am thankful to my roommates and great friends whom I met in Pittsburgh: Yu-Ying Huang, Jackie Yang, Yi-Tsen Pan, Pei-Hsuan Lee, Yin-Chen Chang, Wan-Ru Yu, Hsien-Tang Kao,

(6)

Ching-Heng Lu, Ting-Hao Chen, Chao-Lien Chen, Yu-Ting Lai, Yun-Hsuan Chu, Shannen Liu, Hung-Jing Huang, Kuang-Ching Cheng, Wei-Chun Lin, Po-Wei Chou, Chun-Liang Li, etc. Also, I am grateful that my great undergraduate fellow, Kerry Shih-Ping Chang, also studies at CMU, and it makes me be able to chat about our undergraduate life. Thanks to my best friend, Yu-Ying Lee, who stayed Pittsburgh with me for a awesome summer.

During the years of my Ph.D., I have benefited from conferences, workshops, and internships.

I am thankful to have opportunities of discussing and interacting with a lot of smart people, such as Gokhan Tur, Asli Celikyilmaz, Andreas Stolcke, Geoffrey Zweig, Larry Heck, Jason Williams, Dan Bohus, Malcolm Slaney, Omer Levy, Yun-Cheng Ju, Li Deng, Xiaodong He, Wan-Tau Yih, Xiang Li, Qi Li, Pei-Hao Su, Tsung-Hsien Wen, David Vandyke, Dipanjan Das, Matthew Henderson, etc., who helped me find my research direction during my Ph.D.

life.

I also want to thank my undergraduate fellows from NTU CSIE and colleagues from NTU Speech Lab. I started my research journey there and found what I am really interested in.

Everyone helped me a lot, especially my master advisor – Prof. Lin-Shan Lee, who brought me into the area and motivated me to pursue my doctoral degree in this field.

In addition, I want to thank Luis Marujo for his beautiful L^ATEXtemplate sharing, Jung-Yu Lin for her professional English revision, and my friends for their annotation work. I am so lucky to have all of your assistance, which speeds up my research and make it much better and more beautiful.

Last but not least to my dearest Mom and Dad, to my boyfriend and also my husband, Che- An Lu, to my big family, thank all of you for your unconditional love and support. Without all of you none of my success would be possible.

Yun-Nung (Vivian) Chen, Pittsburgh

(7)

Abstract

Various smart devices (smartphone, smart-TV, in-car navigating system, etc.) are incorporating spoken language interfaces, as known as spoken dialogue systems (SDS), to help users finish tasks more efficiently. The key role in a successful SDS is a spoken language understanding (SLU) component; in order to capture language variation from dialogue participants, the SLU component must create a mapping between natural language inputs and semantic representations that correspond to users’ intentions.

The semantic representation must include “concepts” and a “structure”: concepts are domain- specific topics, and the structure describes relations between concepts and conveys intention.

Most of knowledge-based approaches originated from the field of artificial intelligence (AI).

These methods leveraged deep semantics and relied heavily on rules and symbolic interpre- tations, which mapped sentences into logical forms: a context-independent representation of a sentence covering its predicates and arguments. However, most prior work focused on learning a mapping between utterances and semantic representations, where such organized concepts still remain predefined. The need of predefined structures and annotated semantic concepts results in extremely high cost and poor scalability in system development. Thus, current technology usually limits conversational interactions to a few narrow predefined domains/topics. Because domains used in various devices are increasing, to fill the gap, this dissertation focuses on improving generalization and scalability of building SDSs with little human effort.

In order to achieve the goal, two questions need to be addressed: 1) Given unlabeled conversations, how can a system automatically induce and organize the domain-specific concepts?

2) With the automatically acquired knowledge, how can a system understand user utterances and intents? To tackle above problems, we propose to acquire domain knowledge that captures human’s salient semantics, intents, and behaviors. Then based on the acquired knowledge, we build an SLU component to understand users.

The dissertation focuses on several important aspects for above two problems: Ontology Induction, Structure Learning, Surface Form Derivation, Semantic Decoding, and Intent Pre- diction. To solve the first problem about automating knowledge learning, ontology induction extracts domain-specific concepts, and then structure learning infers a meaningful organization of these concepts for SDS design. With the structured ontology, surface form derivation

(8)

learns natural language variation to enrich its understanding cues. For the second problem about how to effectively understand users based on the acquired knowledge, we propose to decode users’ semantics and to predict intents about follow-up behaviors through a matrix factorization model, which outperforms other SLU models.

Furthermore, the dissertation investigates the performance of SLU modeling for human- human conversations, where two tasks are discussed: actionable item detection and iterative ontology refinement. For actionable item detection, human-machine conversations are utilized to learn intent embeddings through convolutional deep structured semantic models for estimating the probability of appearing actionable items in human-human dialogues. For iterative ontology refinement, ontology induction is first performed on human-human conversations and achieves similar performance as human-machine conversations. The integration of actionable item estimation and ontology induction induces an improved ontology for manual transcripts. Also, the oracle estimation shows the feasibility of iterative ontology refinement and the room for further improvement.

In conclusion, the dissertation shows the feasibility of building a dialogue learning system that is able to understand how particular domains work based on unlabeled human-machine and human-human conversations. As a result, an initial SDS can be built automatically according to the learned knowledge, and its performance can be iteratively improved by interacting with users for practical usage, presenting a great potential for reducing human effort during SDS development.

(9)

Keywords

Convolutional Deep Structured Semantic Model (CDSSM) Deep Structured Semantic Model (DSSM)

Distributional Semantics Domain Ontology

Embeddings Intent Modeling Knowledge Graph

Matrix Factorization (MF) Multimodality

Random Walk

Spoken Language Understanding (SLU) Spoken Dialogue System (SDS)

Semantic Representation Unsupervised Learning

(10)

(11)

(12)

(13)

List of Abbreviations

AF Average F-Measure is an evaluation metric that measures the performance of a ranking list by averaging the F-measure over all positions in the ranking list.

AMR Abstract Meaning Representation is a simple and readable semantic representation in AMR Bank.

AP Average Precision is an evaluation metric that measures the performance of a ranking list by averaging the precision over all positions in the ranking list.

ASR Automatic Speech Recognition, also known as computer speech recognition, is the process of converting the speech signal into written text.

AUC Area Under the Precision-Recall Curve is an evaluation metric that measures the performance of a ranking list by averaging the precision over a set of evenly spaced recall levels in the ranking list.

CBOW Continuous Bag-of-Words is an architecture for learning distributed word representations, which is similar to the feedforward neural net language model but uses continuous distributed representation of the context.

CDSSM Convolutional Deep Structured Semantic Model is a deep neural net model with a convolutional layer, where the objective is to maximize the similarity between semantic vectors of two associated elements.

CMU Carnegie Mellon University is a private research university in Pittsburgh.

DSSM Deep Structured Semantic Model is a deep neural net model, where the objective is to maximize the similarity between semantic vectors of two associated elements.

FE Frame Element is a descriptive vocabulary for the components of each frame.

IA Intelligent Assistant is a software agent that can perform tasks or services for an individual. These tasks or services are based on user input, location awareness, and the ability to access information from a variety of online sources.

ICSI International Computer Science Institute is an independent, non-profit research organization located in Berkeley, California, USA.

(14)

IR Information Retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing.

ISCA International Speech Communication Association is a non-profit organization that aims to promote, in an international world-wide context, activities and exchanges in all fields related to speech communication science and technology.

LTI Language Technologies Institute is a research department in the School of Computer Science at Carnegie Mellon University.

LU Lexical Unit is a word with a sense.

MAP Mean Average Precision is an evaluation metric that measures the performance of a ranking list by averaging the precision over all positions in the ranking list.

MF Matrix Factorization is a decomposition of a matrix into a product of matrices in the discpline of linear algebra.

MLR Multinomial Logistic Regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes.

NLP Natural Language Processing is a field of artificial intelligence and linguistics that studies the problems intrinsic to the processing and manipulation of natural language.

POS Part of Speech tag, also known as word class, lexical class or lexical class are traditional categories of words intended to reflect their functions within a sentence.

RDF Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model.

SDS Spoken Dialogue System is an intelligent agent that interacts with a user via natural spoken language in order to help the user obtain desired information or solve a problem more efficiently.

SGD Stochastic Gradient Descent is a gradient descent optimization method for miniming an objective function that is written as a sum of differentiable functions.

SLU Spoken Language Understanding is a component of a spoken dialogue system, which parses the natural languages into semantic forms that benefit the system’s understanding.

SPARQL SPARQL Protocol and RDF Query Language is a semantic query language for databases, able to retrieve and manipulate data stored in RDF format.

(15)

SVM Support Vector Machine is a supervised learning method used for classification and regression based on the Structural Risk Minimization inductive principle.

WAP Weighted Average Precision is an evaluation metric that measures the performance of a ranking list by weighting the precision over all positions in the ranking list.

WOZ Wizard-of-Oz is a research experiment in which subjects interact with a computer system that subjects believe to be autonomous, but which is actually being operated or partially operated by an unseen human being.

(16)

(17)

7.3.1.2 Enriched Semantics Matrix . . . 87 7.3.1.3 Intent Matrix . . . 89 7.3.1.4 Integrated Model . . . 89 7.3.2 Optimization Procedure . . . 90 7.4 User Intent Prediction for Mobile App . . . 91 7.4.1 Baseline Model for Single-Turn Requests . . . 91 7.4.2 Baseline Model for Multi-Turn Interactions . . . 92 7.5 Experimental Setup . . . 92 7.5.1 Word Embedding . . . 92 7.5.2 Retrieval Setup . . . 92 7.6 Results . . . 93 7.6.1 Results for Single-Turn Requests . . . 93 7.6.2 Results for Multi-Turn Interactions . . . 94 7.6.3 Comparing between User-Dependent and User-Independent Models . . 96 7.7 Summary . . . 97

8 SLU in Human-Human Conversations 99

8.1 Introduction . . . 99 8.2 Convolutional Deep Structured Semantic Models (CDSSM) . . . 101 8.2.1 Architecture . . . 102 8.2.2 Training Procedure . . . 102 8.2.2.1 Predictive Model . . . 103 8.2.2.2 Generative Model . . . 103 8.3 Adaptation . . . 103 8.3.1 Adapting CDSSM . . . 104 8.3.2 Adapting Action Embeddings . . . 104

(23)

8.4 Actionable Item Detection . . . 105 8.4.1 Unidirectional Estimation . . . 106 8.4.2 Bidirectional Estimation . . . 106 8.5 Iterative Ontology Refinement . . . 106 8.6 Experiments . . . 107 8.6.1 Experimental Setup . . . 107 8.6.2 CDSSM Training . . . 108 8.6.3 Implementation Details . . . 109 8.7 Evaluation Results . . . 109 8.7.1 Comparing Different CDSSM Training Data . . . 110 8.7.2 Effectiveness of Bidirectional Estimation . . . 111 8.7.3 Effectiveness of Adaptation Techniques . . . 111 8.7.4 Effectiveness of CDSSM . . . 112 8.7.5 Discussion . . . 113 8.8 Extensive Experiments . . . 113 8.8.1 Dataset . . . 113 8.8.2 Ontology Induction . . . 114 8.8.3 Iterative Ontology Refinement . . . 115 8.8.4 Influence of Recognition Errors . . . 116 8.8.5 Balance between Frequency and Coherence . . . 116 8.8.6 Effectiveness of Actionable Item Information . . . 117 8.9 Summary . . . 119

9 Conclusions and Future Work 121

9.1 Conclusions . . . 121 9.2 Future Work . . . 122

(24)

9.2.1 Domain Discovery . . . 122 9.2.2 Large-Scaled Active Learning of SLU . . . 122 9.2.3 Iterative Learning through Dialogues . . . 123 9.2.4 Error Recovery for System Robustness . . . 123

Appendices 141

A AIMU: Actionable Items in Meeting Understanding 143

A.1 AIMU Dataset . . . 143 A.2 Semantic Intent Schema . . . 143 A.2.1 Domain, Intent, and Argument Definition . . . 144 A.3 Annotation Agreement . . . 146 A.4 Statistical Analysis . . . 147

B Insurance Dataset 149

B.1 Dataset . . . 149 B.2 Annotation Procedure . . . 149 B.3 Reference Ontology . . . 151 B.3.1 Ontology Slots . . . 151 B.3.2 Ontology Structure . . . 153

(25)

List of Figures

1.1 An example output of the proposed knowledge acquisition approach. . . 4 1.2 An example output of the proposed SLU modeling approach. . . 4

2.1 The typical pipeline in a dialogue system. . . 10 2.2 A sentence example in AMR Bank. . . 16 2.3 Three famous semantic knowledge graph examples (Google’s Knowledge

Graph, Bing Satori, and Freebase) corresponding to the entity “Lady Gaga”. 16 2.4 A portion of the Freebase knowledge graph related to the movie domain. . . . 17 2.5 An example of FrameNet categories for an ASR output labelled by probabilistic

frame-semantic parsing. . . 17 2.6 An example of AMR parsed by JAMR on an ASR output. . . 18 2.7 An example of Wikification. . . 18 2.8 The CBOW and Skip-gram architectures. The CBOW model predicts the cur-

rent word based on the context, and the Skip-gram model predicts surrounding words given the target word [117]. . . 20 2.9 The target words and associated dependency-based contexts extracted from

the parsed sentence for training depedency-based word embeddings. . . 21 2.10 The framework for learning paragraph vectors. . . 22 2.11 Illustration of the CDSSM architecture for the IR task. . . 23

3.1 The proposed framework for ontology induction . . . 31 3.2 An example of probabilistic frame-semantic parsing on ASR output. FT: frame

target. FE: frame element. LU: lexical unit. . . 31

(26)

3.3 The mappings from induced slots (within blocks) to reference slots (right sides of arrows). . . 36 3.4 The performance of slot induction learned with different α values. . . 39 3.5 The performance of SLU modeling with different α values. . . 39 3.6 The performance of slot induction learned from different amount of training

data. . . 40

4.1 The proposed framework of structure learning. . . 43 4.2 A simplified example of the integration of two knowledge graphs, where a slot

candidate si is represented as a node in a semantic knowledge graph and a word w_j is represented as a node in a lexical knowledge graph. . . 44 4.3 The dependency parsing result on an utterance. . . 45 4.4 The automatically and manually created knowledge graphs for a restaurant

domain. . . 52

5.1 The relation detection examples. . . 54 5.2 The proposed framework of surface form derivation. . . 55 5.3 An example of dependency-based contexts. . . 56 5.4 Learning curves over incremental iterations of bootstrapping. . . 64

6.1 (a): The proposed framework of semantic decoding. (b): Our MF method completes a partially-missing matrix for implicit semantic parsing. Dark circles are observed facts, and shaded circles are inferred facts. Ontology induction maps observed surface patterns to semantic slot candidates. Word relation model constructs correlations between surface patterns. Slot relation model learns slot-level correlations based on propagating the automatically derived semantic knowledge graphs. Reasoning with matrix factorization incorporates these models jointly, and produces a coherent and domain-specific SLU model. 69

7.1 Total 13 tasks in the corpus (only pictures are shown to subjects for making requests). . . 82 7.2 The dialogue example for multi-turn interaction with multiple apps . . . 84

(27)

7.3 The feature-enriched MF method completes a partially-missing matrix to fac- torize the low-rank matrix for implicit information modeling. Dark circles are observed facts, and shaded circles are latent and inferred facts. Reasoning with MF considers latent semantics to predict intents based on rich features corresponding to the current user utterance. . . 86

8.1 The ICSI meeting segments annotated with actionable items. The triggered intents are at the right part along with descriptions. The intent-associated arguments are labeled within texts. . . 100 8.2 The genre mismatched examples with the same action. . . 101 8.3 Illustration of the CDSSM architecture for the predictive model. . . 102 8.4 Action distribution for different types of meetings. . . 104 8.5 The average AUC distribution over all actions in the training set before and

after action embedding adaptation using Match-CDSSM. . . 111 8.6 The AUC trend of actionable item detection with different thresholds δ. . . . 115 8.7 The performance of the induced slots with different α values for ASR tran-

scripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 117 8.8 The performance of the induced slots with different α values for manual tran-

scripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 117 8.9 The performance of the learned structure with different α values for ASR tran-

scripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 118 8.10 The performance of the learned structure with different α values for manual

transcripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 118

A.1 Action distribution for different types of meetings. . . 146

B.1 The annotation interface. . . 150

(28)

(29)

List of Tables

2.1 The frame example defined in FrameNet. . . 15

3.1 The statistics of training and testing corpora. . . 35 3.2 The performance with different α tuned on a development set (%). . . 37

4.1 The contexts extracted for training dependency-based word/slot embeddings from the utterance of Figure 3.2. . . 46 4.2 The performance of induced slots and corresponding SLU models (%) . . . . 49 4.3 The top inter-slot relations learned from the training set of ASR outputs. . . 51

5.1 The contexts extracted for training dependency entity embeddings in the example of the Figure 5.3. . . 57 5.2 An example of three different methods in the probabilistic enrichment (w =

“pitt ”). . . 60 5.3 Relation detection datasets used in the experiments. . . 61 5.4 The micro F-measure of the first-pass SLU performance before bootstrapping

(N = 15) (%). . . 62 5.5 The micro F-measure of SLU performance with bootstrapping (N = 15) (%). 62 5.6 The examples of derived entity surface forms based on dependency-based entity

embeddings. . . 63

6.1 The MAP of predicted slots (%); ^† indicates that the result is significantly better than MLR (row (b)) with p < 0.05 in t-test. . . 76 6.2 The MAP of predicted slots using different types of relation models in M_R

(%); ^† indicates that the result is significantly better than the feature model (column (a)) with p < 0.05 in t-test. . . 77

(30)

7.1 The recording examples collected from some subjects for single-turn requests. 83 7.2 User intent prediction for single-turn requests on MAP using different training

features (%). LM is a baseline language modeling approach that models explicit semantics. . . 94 7.3 User intent prediction for single-turn requests on P@10 using different training

features (%). LM is a baseline language modeling approach that models explicit semantics. . . 94 7.4 User intent prediction for multi-turn interactions on MAP (%). MLR is a

multi-class baseline for modeling explicit semantics. ^† means that all features perform significantly better than lexical/behavioral features alone; ^§ means that integrating using MF can significantly improve the MLR model (t-test with p < 0.05). . . 95 7.5 User intent prediction for multi-turn interactions on turn accuracy (%). MLR is

a multi-class baseline for modeling explicit semantics. ^†means that all features perform significantly better than lexical/behavioral features alone; ^§ means that integrating using MF can significantly improve the MLR model (t-test with p < 0.05). . . 95 7.6 User intent prediction for multi-turn interactions on MAP for ASR and manual

transcripts (%). ^†means that all features perform significantly better than lexical/behavioral features alone;^§ means that integrating with MF significantly improves the MLR model (t-test with p < 0.05). . . 97 7.7 User intent prediction for multi-turn interaction on ACC for ASR and manual

transcripts (%). ^†means that all features perform significantly better than lexical/behavioral features alone;^§ means that integrating with MF significantly improves the MLR model (t-test with p < 0.05). . . 97

8.1 Actionable item detection performance on the average AUC using bidirectional estimation (%). . . 108 8.2 Actionable item detection performance on AUC (%). . . 110 8.3 Actionable item detection performance on AUC (%). . . 113 8.4 The learned ontology performance with α = 0.8 tuned from Chapter 3 (%) . . 114

A.1 The data set description . . . 143

(31)

A.2 The description of the semantic intent schema for meetings . . . 145 A.3 Annotation agreement during different settings. . . 146 A.4 The annotation statistics . . . 147

B.1 The data statistics. . . 149 B.2 The inter-rater agreements for different annotated types (%). . . 151 B.3 The detail of inter-rater agreements for actionable utterances (%). . . 151 B.4 The reference FrameNet slots associated with fillers and corresponding frequency.152 B.5 The labeled slots uncovered by FrameNet. . . 153 B.6 The reference ontology structure composed of inter-slot relations (part 1). . . 154 B.7 The reference ontology structure composed of inter-slot relations (part 2). . . 155 B.8 The reference ontology structure composed of inter-slot relations (part 3). . . 156 B.9 The reference ontology structure composed of inter-slot relations (part 4). . . 157 B.10 The reference ontology structure composed of inter-slot relations (part 5). . . 158 B.11 The reference ontology structure composed of inter-slot relations (part 6). . . 159

(32)

(33)

1

Introduction

“

Computing is not about computers any more. It is about living.

”

Nicholas Negroponte, Massachusetts Institute of Technology Media Lab founder and chairman emeritus

A spoken dialogue system (SDS) is an intelligent agent that interacts with a user via natural spoken language in order to help the user obtain desired information or solve a problem more efficiently. Despite recent successful personal intelligent assistants (e.g. Google Now¹, Apple’s Siri², Microsoft’s Cortana³, and Amazon’s Echo⁴), spoken dialogue systems are still very brittle when confronted with out-of-domain information. The biggest challenge therefore results from limited domain knowledge. In this introductory chapter, we first introduce how an SDS is developed and articulate a research problem existing in the current development procedure. This chapter outlines the contributions this dissertation brings to address the challenges, and provides a roadmap for the rest of this document.

1.1 Introduction

Spoken language understanding (SLU) has also seen considerable advancements over the past two decades [150]. However, while language understanding remains unsolved, a variety of practical task-oriented dialogue systems have been built to operate on limited specific domains. For instance, the CMU Communicator system is a dialogue system for a air travel domain that provides information about flight, car, and hotel reservations [137]. Another example, the JUPITER system, is a dialogue system for a weather domain, which provides forecast information for the requested city [175]. More recently, a number of efforts in industry (e.g. Google Now, Apple’s Siri, Microsoft’s Cortana, Amazon’s Echo, and Facebook’s M) and

1http://www.google.com/landing/now/

2http://www.apple.com/ios/siri/

3http://www.microsoft.com/en-us/mobile/campaign-cortana/

4http://www.amazon.com/oc/echo

(34)

academia have focused on developing semantic understanding techniques for building better SDSs [1, 14, 56, 72, 103, 107, 120, 129, 131, 138, 160, 166].

These systems aim to automatically identify user intents as expressed in natural language, extract associated arguments or slots, and take actions accordingly to fulfill the user’s requests. Typically, a SDS architecture is composed of the following components: an automatic speech recognizer (ASR), a spoken language understanding (SLU) module, a dialogue manager (DM), and an output manager. When developing a dialogue system in a new domain, we may be able to reuse some components that are designed independently of domain-specific information, for example, the speech recognizer. However, the components that are integrated with domain-specific information have to be reconstructed for each new domain, and the cost of development is expensive. With a rapidly increasing number of domains, the current bottleneck of the SDS is SLU.

1.2 The Problem

The classic development process of a dialogue system involves 1) specifying system requirements, 2) designing and implementing each module in a dialogue system to meet all requirements, and 3) evaluating the implemented system. In the first step, dialogue system developers need to specify the scope of a target dialogue system (i.e. the domain that the system can support and operate) to identify domain-specific concepts, a.k.a. slots, and arguments for slot filling; determine the structure of each task to specify the potential intents and associated slots for intent classification; indicate the desired interaction between the system and a user such as the dialogue flow for DM usage. Most of the prior studies focused on implementation of each component in the SDS pipeline under an assumption that the domain-specific schema is given. However, due to unlimited domains, identifying domain-specific information becomes a large issue during SDS development.

Conventionally, the domain-specific knowledge is manually defined by domain experts or developers . For common domains like a weather domain or a bus domain, system developers are usually able to identify such information. However, information of some niche domains (e.g. a military domain) is withheld by experts, making the knowledge engineering process more difficult [13]. Furthermore, the experts’ decisions may be subjective and may not cover all possible real-world users’ cases [168].

In the second step, implementing each component usually suffers from a common issue, data scarcity. Specifically, training intent detectors and slot taggers of SLU requires a set of utterances labeled with task-specific intents and arguments. One simple solution is to create hand-crafted grammars so that the component can be built without any annotated data.

(35)

Another solution is to simulate an environment to collect utterances, such as a Wizard-of- Oz (WOZ) method and crowd-sourcing, so that the collected data can be used to train the models [6, 76]. However, the collected data may be biased by developers’ subjectivity, because users’ perspectives of a task might not be foreseen by dialogue system developers [168].

Furthermore, poor generalization and scalability of current systems result in limited predefined information, and even biases the subsequent data collection and annotation. Another issue is about the efficiency: the manual definition and annotation process for domain-specific tasks can be very time-consuming, and have high financial costs. Finally, the maintenance cost is also non-trivial: when new conversational data comes in, developers, domain experts, and annotators have to manually analyze the audios or the transcripts for updating and expanding the ontologies. Identifying domain knowledge and collecting training data as well as annotations require domain experts and manual labors, resulting in high cost, long du- ration, and poor scalability of SDS development. The challenges, generalization, scalability, efficiency, are the main bottleneck in the current dialogue systems.

1.3 Towards Improved Scalability, Gernalization & Effi- ciency for SDS

Usually participants engage in a conversation in order to achieve a specific goal such as accom- plishing a task or acquiring answers to questions, for example, to obtain a list of restaurants in a specific location. Therefore in the context of this dissertation, domain-specific information refers to the knowledge specific to an SDS-supported task rather than the knowledge about general dialogue mechanisms. To tackle the above problems, we aim to mine the domain- specific knowledge from unlabeled dialogues (e.g. conversations collected by a call center, recorded utterances that cannot be handled by existing systems, etc.) to construct a domain ontology, and then model the SLU component based on the acquired knowledge and unlabeled data in an unsupervised manner.

The dissertation mainly focuses on two parts:

• Knowledge acquisition is to learn the domain-specific knowledge that is used by an SLU component. The domain-specific knowledge is represented by a structured ontology, which allows SDS to support the target domain, and thus comprehend meanings.

An example of the necessary domain knowledge about restaurant recommendation is shown in Figure 1.1, where the learned domain knowledge contains semantic slots and their relations⁵. The acquired domain ontology provides an overview of a domain or

5The slot is defined as a semantic unit usually used in dialogue systems.

(36)

Restaurant Asking Conversations

target

food

price seeking

quantity

PREP_FOR

NN AMOD

AMOD AMOD

Organized Domain Knowledge Unlabelled Collection

Knowledge Acquisition

Figure 1.1: An example output of the proposed knowledge acquisition approach.

Organized Domain Knowledge

price=“cheap”

target=“restaurant”

intent=navigation SLU Modeling

SLU Component

“can i have a cheap restaurant”

Figure 1.2: An example output of the proposed SLU modeling approach.

multiple domains, which can guide developers for designing the schema or be directly utilized by the SLU module.

• SLU modeling is to build an SLU module that is able to understand the actual meaning of domain-specific utterances based on the domain-specific knowledge and then further provide better responses. An example of the corresponding understanding procedure in a restaurant domain is shown in Figure 1.2, where the the SLU component analyzes an utterance “can i have a cheap restaurant ” and output a semantic representation including low-level slots price=“cheap” and target=“restaurant” and a high-level intent navigation.

With more available conversational data, to acquire the domain knowledge, recent approaches are data-driven in terms of generalization and scalability. In the past decade, the computational linguistics community has focused on developing language processing algorithms that can leverage the vast quantities of available raw data. Chotimongkol et al. proposed a machine learning technique to acquire domain-specific knowledge, showing the potential for reducing human effort in the SDS development [44]. However, the work mainly focused on

(37)

the low-level semantic units like word-level concepts. With increasing high-level knowledge resources, such as FrameNet, Freebase and Wikipedia, this dissertation moves forward to investigate the possibility of developing a high-level semantic conversation analyzer for a certain domain using an unsupervised machine learning approach. The human’s semantics, intent, and behavior can be captured from a collection of unlabelled raw conversational data, and then be modeled for building a good SDS.

In terms of practical usage, the acquired knowledge may be manually revised to improve system performance. Even though some revision might be required, the cost of revision is already significantly lower than the cost of analysis. Also, the automatically learned information may employ real-world users’ cases and avoid biasing subsequent annotations. This thesis focuses on the highlighted parts, inducing acquiring domain knowledge from the dialogues using available resources, and modeling an SLU module using the automatically acquired information.

The proposed approach combining both data-driven and knowledge-driven perspectives shows the potential for improving generalization, maintenance, efficiency, and scalability of dialogue system development.

1.4 Thesis Statement

The main purpose of this work is to automatically develop an SLU module for SDS by utilizing the automatically learned domain knowledge in an unsupervised fashion. This dissertation mainly focuses on acquiring the domain knowledge that is useful for better understanding and designing the system framework and further modeling the semantic meaning of the spoken language. For knowledge acquisition, there are two important stages – ontology induction and structure learning. After applying them, an organized domain knowledge is inferred from unlabeled conversations. For SLU modeling, there are two aspects – semantic decoding and intent prediction. Based on the acquired ontology, semantic decoding analyzes the semantic meaning in each individual utterance and intent prediction models user intents to predict possible follow-up behaviors. In conclusion, the thesis demonstrates the feasibility of building a dialogue learning system that is able to automatically learn salient knowledge and understand how the domains work based on unlabeled raw conversations. With the acquired domain knowledge, the initial dialogue system can be constructed and improved quickly by continuously interacting with users. The main contribution of the dissertation is presenting the potential for reducing human work and showing the feasibility of improving scalability and efficiency for dialogue system development by automating the knowledge learning process.

(38)

1.5 Thesis Structure

The dissertation is organized as below.

• Chapter 2 - Background and Related Work

This chapter reviews background knowledge and summarizes related works. The chapter also discusses current challenges of the task, describes several structured knowledge resources and presents distributional semantics that may benefit understanding problems.

• Chapter 3 - Ontology Induction for Knowledge Acquisition

This chapter focuses on inducing a domain ontology that are useful for developing SLU in SDS based on the available structured knowledge resources in an unsupervised way.

Part of this research work has been presented in the following publications [31, 33]:

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Unsupervised Induction and Filling of Semantic Slots for Spoken Dialogue Systems Using Frame- Semantic Parsing,” in Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13), Olomouc, Czech Republic, 2013.

(Student Best Paper Award)

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Leveraging Frame Semantics and Distributional Semantics for Unsupervised Semantic Slot Induction for Spoken Dialogue Systems,” in Proceedings of 2014 IEEE Workshop on of Spoken Language Technology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.

• Chapter 4 - Structure Learning for Knowledge Acquisition

This chapter focuses on learning the structures, such as the inter-slot relations, for help- ing SLU development. Some of the contributions have been presented in the following publications [39, 40]:

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Jointly Model- ing Inter-Slot Relations by Random Walk on Knowledge Graphs for Unsupervised Spoken Language Understanding,” in Proceeding of The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (NAACL-HLT’15), Denver, Colorado, USA, 2015.

– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Learning Semantic Hierarchy for Unsupervised Slot Induction and Spoken Language Un- derstanding,” in Proceedings of The 16th Annual Conference of the Interna-

(39)

tional Speech Communication Association (INTERSPEECH’15), Dresden, Ger- many, 2015.

• Chapter 5 - Surface Form Derivation for Knowledge Acquisition

This chapter focuses on deriving the surface forms conveying semantics for entities from the given ontology, where the derived information contributes to better understanding.

Some of the work has been published [32]:

– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Gokhan Tur, “Deriving Local Relational Surface Forms from Dependency-Based Entity Embeddings for Unsupervised Spo- ken Language Understanding,” in Proceedings of 2014 IEEE Workshop of Spoken Language Technology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.

• Chapter 6 - Semantic Decoding in SLU Modeling

This chapter focuses on decoding users’ spoken languages into corresponding semantic forms, which corresponds to the goal of SLU. Some of these contributions have been presented in the following publication [38]:

– Yun-Nung Chen, William Yang Wang, Anatole Gershman, and Alexander I. Rud- nicky, “Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding,” in Proceeding of The 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), Beijing, China, 2015.

• Chapter 7 - Intent Prediction in SLU Modeling

This chapter focuses on modeling user intents in SLU, so that the SDS is able to predict the users’ follow-up actions and further provide better interactions. Some of the contributions have been presented by following publications [28, 36, 37, 42]:

– Yun-Nung Chen and Alexander I. Rudnicky, “Dynamically Supporting Unexplored Domains in Conversational Interactions by Enriching Semantics with Neural Word Embeddings,” in Proceedings of 2014 IEEE Workshop of Spoken Language Tech- nology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.

– Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky, and Anatole Gershman,

“Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken Language Understanding,” in Proceedings of The 17th ACM International Confer- ence on Multimodel Interaction (ICMI’15), Seattle, Washington, USA, 2015.

– Yun-Nung Chen, Ming Sun, and Alexander I. Rudnicky, “Matrix Factorization with Domain Knowledge and Behavioral Patterns for Intent Modeling,” in Ex- tended Abstract of The 29th Annual Conference on Neural Information Processing

(40)

Systems – Machine Learning for Spoken Language Understanding and Interactions Workshop (NIPS-SLU’15), Montreal, Canada, 2015.

– Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky, and Anatole Gershman, “Un- supervised User Intent Modeling by Feature-Enriched Matrix Factorization,” in Proceedings of The 41st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’16), Shanghai, China, 2016.

• Chapter 8 - SLU in Human-Human Conversations

This chapter investigates the feasibility of applying the technologies developed for human-machine interactions to human-human interactions, expanding the application usage to more practical and broader genres. Part of the research work has been presented in the following publications [34, 35]:

– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Xiaodong He, “Detecting Actionable Items in Meetings by Convolutional Deep Structured Semantic Models,” in Pro- ceedings of 2015 IEEE Workshop on Automatic Speech Recognition and Under- standing (ASRU’15), Scottsdale, Arizona, 2015.

– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Xiaodong He, “Learning Bidirectional Intent Embeddings by Convolutional Deep Structred Semantic Models for Spo- ken Language Understanding,” in Extended Abstract of The 29th Annual Confer- ence on Neural Information Processing Systems – Machine Learning for Spoken Language Understanding and Interactions Workshop (NIPS-SLU’15), Montreal, Canada, 2015.

• Chapter 9 - Conclusions and Future Work

This chapter concludes the main contributions and discusses a number of interesting directions that can be explored in the future.

(41)

2

Background and Related Work

“

Everything that needs to be said has already been said. But since no one was listening, everything must be said again.

”

Andr´e Gide, Nobel Prize in Literature winner

With an emerging trend of using mobile devices, spoken dialogue systems (SDS) are being incorporating in several devices (e.g. smartphone, smart-TV, navigating system). In the architecture of SDSs, spoken language understanding (SLU) plays an important role and there are many unsolved challenges. The next section first introduces a typical pipeline of an SDS and elaborates the functionality of each individual component. Section 2.2 details how SLU works with different examples, reviews the related literature and discusses their pros and cons; following the literature review, we briefly sketch the idea of the proposed approaches and how it is related to prior studies. Then semantic resources that are used for benefiting language understanding are introduced, where the resources with explicit semantics, Ontology and Knowledge Base, are presented in Section 2.3, and implicit semantics based on the theory, Distributional Semantics, are presented in Section 2.4.

2.1 Spoken Dialogue System (SDS)

A typical SDS is composed of a recognizer, a spoken language understanding (SLU) module, a dialogue manager (DM), and an output manager. Figure 2.1 illustrates the system pipeline.

The functionality of each component is summarized below.

• Automatic Speech Recognizer (ASR)

The ASR component takes raw audio signals and then transcribes into word hypotheses with confidence scores. The top one hypothesis would then be transmitted into the next component.

• Spoken Language Understanding (SLU)

The goal of SLU is to capture the core semantics given the input word hypothesis; And

(42)

Automatic Speech Recognizer

Domain Reasoner

Dialogue Manager Spoken

Language Understanding

Intent Detector

Output Manager / Output Generator

Slot Tagger

Knowledge Base

Natural Language Generator Multimedia

Response

Speech Synthesizer

visual, etc. textual spoken

Figure 2.1: The typical pipeline in a dialogue system.

the extracted information can be populated into task-specific arguments in a given semantic frame [82]. Therefore the task of an SLU module is to identify user intents and fill associated slots based on the word hypotheses. This procedure is also called semantic parsing, semantic decoding, etc. The SLU component typically includes an intent detector and slot taggers. An example utterance “I want to fly to Taiwan from Pittsburgh next week ” can be parsed into find flight(origin=“Pittsburgh”, destination=“Taiwan”, departure date=“next week”), where find flight is classified by the intent detector; and the associated slots are later filled by the slot taggers based on the detected intent. This component also estimates confidence scores of decoded semantic representations for next component usage.

• Dialogue Manager (DM) / Task Manager

Subsequent to the SLU processing, the DM interacts with users to assist them in achieving their goals. Given the above example, DM should check whether required slots are properly assigned (departure date may not properly specified) and then de- cide the system’s action such as ask date or return flight(origin=“Pittsburgh”, destination=“Taiwan”). This procedure should access knowledge bases as a retrieval database to acquire the desired information. Due to possible misrecognition and misunderstand- ing errors, this procedure involves dialogue state tracking and policy selection to make more robust decisions [85, 164].

• Output Manager / Output Generator

Traditional dialogue systems are mostly used through phone calls, so the output manager mainly interacts with two modules, a natural language generation (NLG) module and a speech synthesizer. However, with increasing usage of various multimedia devices (e.g. smartphone, smartwatch, and smart-TV), the output manager does not need to focus on generating spoken responses. Instead, recent trend is moving toward display- ing responses via different channels; for example, the utterance “Play Lady Gaga’s Bad Romance.” should correspond to an output action that launches a music player and

(43)

then plays the specified song. Hence an additional component, multimedia response, is introduced in the infrastructure in order to handle diverse multimedia outputs.

– Multimedia Response

Given the decided action, a multimedia response considers which channel is more suitable to present the returned information based on environmental contexts, user preference, and used devices. For example, return flight(origin=“Pittsburgh”, destination=“Taiwan”) can be presented through visual responses by listing the flights that satisfy the requirement in desktops, laptops, etc., and through spoken responses by uttering “There are seven flights from Pittsburgh to Taiwan. First is ...” in the smartwatches.

– Natural Language Generation (NLG)

Given the current dialogue strategy, the NLG component generates the corresponding natural language responses that humans can understand for the purpose of natural dialogues. For example, an action from DM, ask date, can generate a response “Which date will you plan to fly? ”. Here the responses can be template- based or outputted by statistical models [29, 163].

– Speech Synthesizer / Text-to-Speech (TTS)

In order to communicate with users via speech, a speech synthesizer simulates human speech based on the natural language responses generated by the NLG component.

All basic components in a dialogue system should interact with each other, so errors may propagate and then result in poor performance. In addition, several components (e.g. the SLU module) need to incorporate the domain knowledge in order to handle task-specific dialogues. Because domain knowledge is usually predefined by experts or developers, when there are more and more domains, making SLU scalable has been a main challenge of SDS development.

2.2 Spoken Language Understanding (SLU)

In order to allow machines to understand natural language, a semantic representation¹ is introduced. A semantic representation of an utterance carries its core content, so that the actual meaning behind the utterance can be inferred only through the representation. For example, an utterance “show me action movies directed by james cameron” can be represented

1In this document, we use the terms “semantic representation” and “semantic form” interchangeablely.

(44)

as action=“show ”, target=“movie”, genre=“action”, director=“james cameron”. Another utterance “find a cheap taiwanese restaurant in oakland ” can be formed as action=“find ”, target=“restaurant”, price=“cheap”, type=“taiwanese”, location=“oakland ”. The semantic representations are able to convey the core meaning of the utterances, which can be more easily processed by machines. The semantic representation is not unique, and there are several forms for representing meanings. Below we describe two types of semantic forms:

• Slot-Based Semantic Representation

The slot-based representation is a flat structure of semantic concepts, which are usually used in simpler tasks. Above examples belong to slot-based semantic representations, where semantic concepts are action, target, location, price, etc.

• Relation-Based Semantic Representation

The relation-based representation includes structured concepts, which are usually used in tasks that have more complicate dependency relations. For instance, “show me action movies directed by james cameron” can be represented as movie.directed by, movie.genre, director.name=“james cameron”, genre.name=“action”. This representation is the same as movie.directed by(?, “james cameron”) ∧ movie.genre(?, “action”), which originated from the logic form in the artificial intelligence field. The semantic slots in the slot-based representation are formed as relations here.

The main purpose of an SLU component is to convert the natural language into semantic forms. In the natural language processing (NLP) field, natural language understanding (NLU) also refers to semantic decoding or semantic parsing. Therefore, this section reviews related literature and studies how they approach the problems for language understanding. After that, the following chapters focus on addressing the challenges that building an SDS suffers from, namely:

• How can we define semantic elements from unlabeled data to form a semantic schema?

• How can we organize semantic elements and then form a meaningful structure?

• How can we decode semantics for test data while considering noises in the mean time?

• How can we utilize the acquired information to predict user intents for improving system performance?

2.2.1 Leveraging External Resources

Building semantic parsing systems requires large training data with detailed annotations.

With rich web-scaled resources, a lot of NLP research therefore leveraged external human

(45)

knowledge resources for semantic parsing. For example, Berant et al. proposed SEMPRE², which used the web-scaled knowledge bases to train the semantic parser [10]. Das et al.

proposed SEMAFOR³, which utilized a lexicon developed based on a linguistic theory – Frame Semantics to train the semantic parser [50]. However, such NLP tasks deal with individual and focused problems, ignoring how parsing results are used by applications.

Tur et al. were among the first to consider unsupervised approaches for SLU, where they exploited query logs for slot-filling [152, 154]. In a subsequent study, Heck and Hakkani-T¨ur studied the Semantic Web for an unsupervised intent detection problem in SLU, showing that results obtained from the unsupervised training process align well with the performance of traditional supervised learning [82]. Following their success of unsupervised SLU, recent studies have also obtained interesting results on the tasks of relation detection [32, 125, 75], entity extraction [159], and extending domain coverage [41, 28, 57]. Section 2.3 will introduce the exploited knowledge resources and the corresponding analyzers in detail. However, most of the prior studies considered semantic elements independently or only considered the relations appearing in the external resources, where the structure of concepts used by real users might be ignored.

2.2.2 Structure Learning and Inference

From a knowledge management perspective, empowering dialogue systems with large knowledge bases is of crucial significance to modern SDSs. While leveraging external knowledge is the trend, efficient inference algorithms, such as random walks, are still less-studied for direct inference on knowledge graphs of the spoken contents. In the NLP literature, Lao et al.

used a random walk algorithm to construct inference rules on large entity-based knowledge bases, and leveraged syntactic information for reading the web [101, 102]. Even though this work has important contributions, the proposed algorithm cannot learn mutually-recursive relations, and does not to consider lexical items—in fact, more and more studies show that, in addition to semantic knowledge graphs, lexical knowledge graphs that model surface-level natural language realization, multiword expressions, and context, are also critical for short text understanding [91, 109, 110, 145, 158].

2.2.3 Neural Model and Representation

With the recently emerging trend of neural systems, a lot of work has shown the success of applying neural-based models in SLU. Tur et al. have shown that deep convex networks are effective for building better semantic utterance classification systems [153]. Following their

2http://www-nlp.stanford.edu/software/sempre/

3http://www.ark.cs.cmu.edu/SEMAFOR/

(46)

success, Deng et al. have further demonstrated the effectiveness of applying the kernel trick to build better deep convex networks for SLU [54]. Nevertheless, most of work used neural-based representations for supervised tasks, so there is a gap between approaches used for supervised and unsupervised tasks.

In addition, recently Mikolov proposed recurrent neural network based language models to capture long dependency and achieved the state-of-the-art performance in recognition [114, 116]. The proposed continuous representations as word embeddings have further boosted the state-of-the-art results in many applications, such as sentiment analysis, sentence completion, and relation detection [32, 117, 144]. The detail of distributional representations will be described in Section 2.4. Despite the advances of several NLP tasks, how unsupervised SLU can incorporate neural representations remains unknown.

2.2.4 Latent Variable Modeling

Most of the studies above did not explicitly learn latent factor representations from data, so they may neglect errors (e.g. misrecognition) and thus produce unreliable results of SLU [12].

Early studies on latent variable modeling in speech included the classic hidden Markov model for statistical speech recognition [94]. Recently, Celikyilmaz et al. were the first to study the intent detection problem using query logs and a discrete Bayesian latent variable model [23].

In the field of dialogue modeling, the partially observable Markov decision process (POMDP) model is a popular technique for dialogue management [164, 172], reducing the cost of hand- crafted dialogue managers while producing robustness against speech recognition errors. More recently, Tur et al. used a semi-supervised LDA model to show improvement on the slot filling task [155]. Also, Zhai and Williams proposed an unsupervised model for connecting words with latent states in HMMs using topic models, obtaining interesting qualitative and quantitative results [174]. However, for unsupervised SLU, it is unclear how to take latent semantics into account.

2.2.5 The Proposed Method

Towards unsupervised SLU, this dissertation proposes an SLU model to integrate the advan- tages of prior studies and overcome the disadvantages mentioned above. The model leverages the external knowledge while combining frame semantics and distributional semantics, and learns latent feature representations while taking various local and global lexical, syntactic and semantic relations into account in an unsupervised manner. The details will be presented in the following chapters.

(47)

Table 2.1: The frame example defined in FrameNet.

Frame: Food Semantics physical object

noun: almond, apple, banana, basil, beef, beer, berry, ...

Frame Element constituent parts, descriptor, type

2.3 Domain Ontology and Knowledge Base

There are two main types of knowledge resources available, generic concept and entity-based, both of which may benefit SLU modules for SDSs. Below we first introduce the detailed definition of each resource and discuss the corresponding work and tools that are useful for leveraging such resources.

2.3.1 Definition

For generic concept and entity-based knowledge bases, the former covers the concepts that are more common, such as a food domain and a weather domain. The latter usually contains a lot of named entities that are specific for certain domains, for example, a movie domain and a music domain. The following describes examples of these knowledge resources, which contain the rich semantics and may be beneficial for understanding tasks.

2.3.1.1 Generic Concept Knowledge

There are two semantic knowledge resources for generic concepts, FrameNet and Abstract Meaning Representation (AMR).

• FrameNet⁴ is a linguistically semantic resource that offers annotations of predicate- argument semantics, and associated lexical units for English [4]. FrameNet is developed based on semantic theory, Frame Semantics [64]. For example, the phrase “low fat milk ” should be analyzed with “milk ” evoking the food frame, where “low fat ” fills the descriptor FE of that frame and the word “milk ” is the actual LU. A defined frame example is shown in Table 2.1.

• Abstract Meaning Representation (AMR) is a semantic representation language including the meanings of thousands of English sentences. Each AMR is a single rooted, directed graph. AMRs include PropBank semantic roles, within-sentence coreference, named entities and types, modality, negation, questions, quantities, etc [5]. The AMR

4http://framenet.icsi.berkeley.edu

(48)

The boy wants to go

^ARG1

ARG0 ARG0

boy go-01

want-01 instance

instance

(w / want-01

:ARG0 (b / boy) :ARG1 (g / go-01 :ARG0 b))

Figure 2.2: A sentence example in AMR Bank.

Google Knowledge Graph Bing Satori Freebase

Figure 2.3: Three famous semantic knowledge graph examples (Google’s Knowledge Graph, Bing Satori, and Freebase) corresponding to the entity “Lady Gaga”.

feature structure graph of an example sentence is illustrated in Figure 2.2, where the

“boy” appears twice, once as the ARG0 of “want-01 ”, and once as the ARG0 of “go-01 ”.

2.3.1.2 Entity-Based Knowledge

• Semantic Knowledge Graph is a knowledge base that provides structured and detailed information about the topic with a lists of related links. Three different knowledge graph examples, Google’s knowledge graph⁵, Microsoft’s Bing Satori, and Freebase, are shown in Figure 2.3. The semantic knowledge graph is defined by a schema and composed of nodes and edges connecting the nodes, where each node represents an entity- type and the edge between each node pair describes their relation, as called as property.

An example from Freebase is shown in Figure 2.4, where nodes represent core entity- types for the movie domain. The domains in the knowledge graphs span the web, from

“American Football” to “Zoos and Aquariums”.

5http://www.google.com/insidesearch/features/search/knowledge.html

(49)

Avatar Titanic Drama Kate

Winslet

James Cameron

Canada 1997

Oscar, best director

Genre Cast

Award Release

Year

Director

Nationality Director

Figure 2.4: A portion of the Freebase knowledge graph related to the movie domain.

I want to find some inexpensive and very fancy bars in north.

desiring

becoming_aware

relational_quantity

expensiveness degree building

part_orientaional Figure 2.5: An example of FrameNet categories for an ASR output labelled by probabilistic frame-semantic parsing.

• Wikipedia⁶ is a free-access, free content Internet encyclopedia, which contains a large number of pages/articles related to a specific entity [124]. It provides basic background knowledge for help understanding tasks in the natural language processing (NLP) field.

2.3.2 Knowledge-Based Semantic Analyzer

With the available knowledge resources mentioned above, there are many work that utilizes such knowledge for different tasks. The prior approaches or tools can serve as analyzers and facilitate the target task of this dissertation, unsupervised SLU for dialogue systems.

2.3.2.1 Generic Concept Knowledge

• FrameNet

SEMAFOR⁷ is a state-of-the-art semantic parser for frame-semantic parsing [48, 49].

Trained on manually annotated sentences in FrameNet, SEMAFOR is relatively accu- rate in predicting semantic frames, FE, and LU from raw text. SEMAFOR is augmented by the dual decomposition techniques in decoding, and thus produces semantically-

6http://en.wikipedia.org/wiki/Wikipedia

7http://www.ark.cs.cmu.edu/SEMAFOR/

Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems