Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems
Yun-Nung (Vivian) Chen CMU-LTI-15-018
Language Technologies Institute School of Computer Science
Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213
Dr. Alexander I. Rudnicky (chair), Carnegie Mellon University Dr. Anatole Gershman (co-chair), Carnegie Mellon University
Dr. Alan W Black, Carnegie Mellon University Dr. Dilek Hakkani-T¨ur, Microsoft Research
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies
To my beloved parents and family,
“If you have someone in your life that you are grateful for - someone to whom you want to write another heartfelt, slanted, misspelled thank you note - do it. Tell them they made you feel loved and supported. That they made you feel like you belonged somewhere and that you were not a freak.
Tell them all of that. Tell them today.
Lisa Jakub, Canadian-American writer and actress
There is a long list of exceptional people to whom I wish to express my gratitude, who have supported and assisted me during my Ph.D. journey. I want to start with my excellent thesis advisor, Prof. Alexander I. Rudnicky, who provided me freedom to investigate and explore this field, guided me future directions, and not only appreciated but supported my ideas. Without Alex I might not be qualified as a good researcher. I also thank all committee members – Prof. Anatole Gershman, Prof. Alan W Black, and Dr. Dilek Hakkani-T¨ur. Thanks to their diverse areas of expertise and insightful advice, I was able to apply my experiments to real-world data, which makes my research work more valuable and more practical.
During my stay at CMU, I was very fortunate to work on different projects and collaborate with several faculty, including Prof. Jack Mostow, Prof. Florian Metze, Dr. Kai-Min Chang, Dr. Sunjing Lee, etc. As a student at CMU, I had chances to work and collaborate with people from the same group: Ming Sun, Matthew Marge, Aasish Pappu, Seshadri Sridharan, and Justin Chiu, and different amazing colleagues in LTI: William Y. Wang, Ting-Hao (Kenneth) Huang, Lingpeng Kong, Swabha Swayamdipta, Sujay Kumar Jauhar, Sunayana Sitaram, Sukhada Palkar, Alok Parlikar. I was very happy to know and interact with a lot of great people from CMU LTI: Shoou-I Yu, Yanchuan Sim, Miaomiao Wen, Yi-Chia Wang, Troy Hua, Duo Ding, Zi Yang, Chenyan Xiong, Diyi Yang, Di Wang, Hyeju Jang, Yanbo Xu, Wei Chen, Yajie Miao, Ran Zhou, Chu-Cheng Lin, Yu-Hsin Kuo, Sz-Rung Shiang, Po Yao Huang, Ting-Yao Hu, Han Lu, Yiu-Chang Lin, Wei-Cheng Chang, Joseph Chang, and Zhou Yu.
I am thankful to my roommates and great friends whom I met in Pittsburgh: Yu-Ying Huang, Jackie Yang, Yi-Tsen Pan, Pei-Hsuan Lee, Yin-Chen Chang, Wan-Ru Yu, Hsien-Tang Kao,
Ching-Heng Lu, Ting-Hao Chen, Chao-Lien Chen, Yu-Ting Lai, Yun-Hsuan Chu, Shannen Liu, Hung-Jing Huang, Kuang-Ching Cheng, Wei-Chun Lin, Po-Wei Chou, Chun-Liang Li, etc. Also, I am grateful that my great undergraduate fellow, Kerry Shih-Ping Chang, also studies at CMU, and it makes me be able to chat about our undergraduate life. Thanks to my best friend, Yu-Ying Lee, who stayed Pittsburgh with me for a awesome summer.
During the years of my Ph.D., I have benefited from conferences, workshops, and internships.
I am thankful to have opportunities of discussing and interacting with a lot of smart people, such as Gokhan Tur, Asli Celikyilmaz, Andreas Stolcke, Geoffrey Zweig, Larry Heck, Jason Williams, Dan Bohus, Malcolm Slaney, Omer Levy, Yun-Cheng Ju, Li Deng, Xiaodong He, Wan-Tau Yih, Xiang Li, Qi Li, Pei-Hao Su, Tsung-Hsien Wen, David Vandyke, Dipanjan Das, Matthew Henderson, etc., who helped me find my research direction during my Ph.D.
I also want to thank my undergraduate fellows from NTU CSIE and colleagues from NTU Speech Lab. I started my research journey there and found what I am really interested in.
Everyone helped me a lot, especially my master advisor – Prof. Lin-Shan Lee, who brought me into the area and motivated me to pursue my doctoral degree in this field.
In addition, I want to thank Luis Marujo for his beautiful LATEXtemplate sharing, Jung-Yu Lin for her professional English revision, and my friends for their annotation work. I am so lucky to have all of your assistance, which speeds up my research and make it much better and more beautiful.
Last but not least to my dearest Mom and Dad, to my boyfriend and also my husband, Che- An Lu, to my big family, thank all of you for your unconditional love and support. Without all of you none of my success would be possible.
Yun-Nung (Vivian) Chen, Pittsburgh
Various smart devices (smartphone, smart-TV, in-car navigating system, etc.) are incorpo- rating spoken language interfaces, as known as spoken dialogue systems (SDS), to help users finish tasks more efficiently. The key role in a successful SDS is a spoken language under- standing (SLU) component; in order to capture language variation from dialogue participants, the SLU component must create a mapping between natural language inputs and semantic representations that correspond to users’ intentions.
The semantic representation must include “concepts” and a “structure”: concepts are domain- specific topics, and the structure describes relations between concepts and conveys intention.
Most of knowledge-based approaches originated from the field of artificial intelligence (AI).
These methods leveraged deep semantics and relied heavily on rules and symbolic interpre- tations, which mapped sentences into logical forms: a context-independent representation of a sentence covering its predicates and arguments. However, most prior work focused on learning a mapping between utterances and semantic representations, where such organized concepts still remain predefined. The need of predefined structures and annotated semantic concepts results in extremely high cost and poor scalability in system development. Thus, current technology usually limits conversational interactions to a few narrow predefined do- mains/topics. Because domains used in various devices are increasing, to fill the gap, this dissertation focuses on improving generalization and scalability of building SDSs with little human effort.
In order to achieve the goal, two questions need to be addressed: 1) Given unlabeled conver- sations, how can a system automatically induce and organize the domain-specific concepts?
2) With the automatically acquired knowledge, how can a system understand user utterances and intents? To tackle above problems, we propose to acquire domain knowledge that captures human’s salient semantics, intents, and behaviors. Then based on the acquired knowledge, we build an SLU component to understand users.
The dissertation focuses on several important aspects for above two problems: Ontology Induction, Structure Learning, Surface Form Derivation, Semantic Decoding, and Intent Pre- diction. To solve the first problem about automating knowledge learning, ontology induction extracts domain-specific concepts, and then structure learning infers a meaningful organiza- tion of these concepts for SDS design. With the structured ontology, surface form derivation
learns natural language variation to enrich its understanding cues. For the second problem about how to effectively understand users based on the acquired knowledge, we propose to decode users’ semantics and to predict intents about follow-up behaviors through a matrix factorization model, which outperforms other SLU models.
Furthermore, the dissertation investigates the performance of SLU modeling for human- human conversations, where two tasks are discussed: actionable item detection and itera- tive ontology refinement. For actionable item detection, human-machine conversations are utilized to learn intent embeddings through convolutional deep structured semantic models for estimating the probability of appearing actionable items in human-human dialogues. For iterative ontology refinement, ontology induction is first performed on human-human conver- sations and achieves similar performance as human-machine conversations. The integration of actionable item estimation and ontology induction induces an improved ontology for manual transcripts. Also, the oracle estimation shows the feasibility of iterative ontology refinement and the room for further improvement.
In conclusion, the dissertation shows the feasibility of building a dialogue learning system that is able to understand how particular domains work based on unlabeled human-machine and human-human conversations. As a result, an initial SDS can be built automatically according to the learned knowledge, and its performance can be iteratively improved by interacting with users for practical usage, presenting a great potential for reducing human effort during SDS development.
Convolutional Deep Structured Semantic Model (CDSSM) Deep Structured Semantic Model (DSSM)
Distributional Semantics Domain Ontology
Embeddings Intent Modeling Knowledge Graph
Matrix Factorization (MF) Multimodality
Spoken Language Understanding (SLU) Spoken Dialogue System (SDS)
Semantic Representation Unsupervised Learning
List of Abbreviations
AF Average F-Measure is an evaluation metric that measures the performance of a ranking list by averaging the F-measure over all positions in the ranking list.
AMR Abstract Meaning Representation is a simple and readable semantic representation in AMR Bank.
AP Average Precision is an evaluation metric that measures the performance of a ranking list by averaging the precision over all positions in the ranking list.
ASR Automatic Speech Recognition, also known as computer speech recognition, is the process of converting the speech signal into written text.
AUC Area Under the Precision-Recall Curve is an evaluation metric that measures the performance of a ranking list by averaging the precision over a set of evenly spaced recall levels in the ranking list.
CBOW Continuous Bag-of-Words is an architecture for learning distributed word represen- tations, which is similar to the feedforward neural net language model but uses con- tinuous distributed representation of the context.
CDSSM Convolutional Deep Structured Semantic Model is a deep neural net model with a convolutional layer, where the objective is to maximize the similarity between semantic vectors of two associated elements.
CMU Carnegie Mellon University is a private research university in Pittsburgh.
DSSM Deep Structured Semantic Model is a deep neural net model, where the objective is to maximize the similarity between semantic vectors of two associated elements.
FE Frame Element is a descriptive vocabulary for the components of each frame.
IA Intelligent Assistant is a software agent that can perform tasks or services for an individual. These tasks or services are based on user input, location awareness, and the ability to access information from a variety of online sources.
ICSI International Computer Science Institute is an independent, non-profit research orga- nization located in Berkeley, California, USA.
IR Information Retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing.
ISCA International Speech Communication Association is a non-profit organization that aims to promote, in an international world-wide context, activities and exchanges in all fields related to speech communication science and technology.
LTI Language Technologies Institute is a research department in the School of Computer Science at Carnegie Mellon University.
LU Lexical Unit is a word with a sense.
MAP Mean Average Precision is an evaluation metric that measures the performance of a ranking list by averaging the precision over all positions in the ranking list.
MF Matrix Factorization is a decomposition of a matrix into a product of matrices in the discpline of linear algebra.
MLR Multinomial Logistic Regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes.
NLP Natural Language Processing is a field of artificial intelligence and linguistics that studies the problems intrinsic to the processing and manipulation of natural language.
POS Part of Speech tag, also known as word class, lexical class or lexical class are traditional categories of words intended to reflect their functions within a sentence.
RDF Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model.
SDS Spoken Dialogue System is an intelligent agent that interacts with a user via natural spoken language in order to help the user obtain desired information or solve a problem more efficiently.
SGD Stochastic Gradient Descent is a gradient descent optimization method for miniming an objective function that is written as a sum of differentiable functions.
SLU Spoken Language Understanding is a component of a spoken dialogue system, which parses the natural languages into semantic forms that benefit the system’s understand- ing.
SPARQL SPARQL Protocol and RDF Query Language is a semantic query language for databases, able to retrieve and manipulate data stored in RDF format.
SVM Support Vector Machine is a supervised learning method used for classification and regression based on the Structural Risk Minimization inductive principle.
WAP Weighted Average Precision is an evaluation metric that measures the performance of a ranking list by weighting the precision over all positions in the ranking list.
WOZ Wizard-of-Oz is a research experiment in which subjects interact with a computer system that subjects believe to be autonomous, but which is actually being operated or partially operated by an unseen human being.
1 Introduction 1
1.1 Introduction . . . 1
1.2 The Problem . . . 2
1.3 Towards Improved Scalability, Gernalization & Efficiency for SDS . . . 3
1.4 Thesis Statement . . . 5
1.5 Thesis Structure . . . 6
2 Background and Related Work 9 2.1 Spoken Dialogue System (SDS) . . . 9
2.2 Spoken Language Understanding (SLU) . . . 11
2.2.1 Leveraging External Resources . . . 12
2.2.2 Structure Learning and Inference . . . 13
2.2.3 Neural Model and Representation . . . 13
2.2.4 Latent Variable Modeling . . . 14
2.2.5 The Proposed Method . . . 14
2.3 Domain Ontology and Knowledge Base . . . 15
2.3.1 Definition . . . 15
22.214.171.124 Generic Concept Knowledge . . . 15
126.96.36.199 Entity-Based Knowledge . . . 16
2.3.2 Knowledge-Based Semantic Analyzer . . . 17
188.8.131.52 Generic Concept Knowledge . . . 17
184.108.40.206 Entity-Based Knowledge . . . 18
2.4 Distributional Semantics . . . 19
2.4.1 Linear Word Embedding . . . 19
2.4.2 Dependency-Based Word Embedding . . . 21
2.4.3 Paragraph Embedding . . . 22
2.4.4 Sentence Embedding . . . 22
220.127.116.11 Architecture . . . 23
18.104.22.168 Training Procedure . . . 24
2.5 Evaluation Metrics . . . 25
3 Ontology Induction for Knowledge Acquisition 29 3.1 Introduction . . . 29
3.2 Proposed Framework . . . 30
3.2.1 Probabilistic Semantic Parsing . . . 31
3.2.2 Independent Semantic Decoder . . . 32
3.2.3 Adaptation Process and SLU Model . . . 32
3.3 Slot Ranking Model . . . 32
3.4 Word Representations for Similarity Measure . . . 33
3.4.1 In-Domain Clustering Vector . . . 34
3.4.2 In-Domain Embedding Vector . . . 34
3.4.3 External Embedding Vector . . . 34
3.5 Experiments . . . 35
3.5.1 Experimental Setup . . . 35
3.5.2 Implementation Detail . . . 36
3.5.3 Evaluation Metrics . . . 36
22.214.171.124 Slot Induction . . . 36
126.96.36.199 SLU Model . . . 37
3.5.4 Evaluation Results . . . 37
3.6 Discussion . . . 38
3.6.1 Balance between Frequency and Coherence . . . 38
3.6.2 Sensitivity to Amount of Training Data . . . 39
3.7 Summary . . . 40
4 Structure Learning for Knowledge Acquisition 41 4.1 Introduction . . . 41
4.2 The Proposed Framework . . . 42
4.3 Slot Ranking Model . . . 43
4.3.1 Knowledge Graphs . . . 44
4.3.2 Edge Weight Estimation . . . 44
188.8.131.52 Frequency-Based Measurement . . . 45
184.108.40.206 Embedding-Based Measurement . . . 46
4.3.3 Random Walk Algorithm . . . 47
220.127.116.11 Single-Graph Random Walk . . . 47
18.104.22.168 Double-Graph Random Walk . . . 48
4.4 Experiments . . . 48
4.4.1 Experimental Setup . . . 49
4.4.2 Evaluation Results . . . 49
22.214.171.124 Slot Induction . . . 49
126.96.36.199 SLU Model . . . 50
4.4.3 Discussion and Analysis . . . 50
188.8.131.52 Comparing Frequency- and Embedding-Based Measurements 50 184.108.40.206 Comparing Single- and Double-Graph Approaches . . . 50
220.127.116.11 Relation Discovery Analysis . . . 51
4.5 Summary . . . 52
5 Surface Form Derivation for Knowledge Acquisition 53 5.1 Introduction . . . 53
5.2 Knowledge Graph Relation . . . 54
5.3 Proposed Framework . . . 55
5.4 Relation Inference from Gazetteers . . . 55
5.5 Relational Surface Form Derivation . . . 56
5.5.1 Web Resource Mining . . . 56
5.5.2 Dependency-Based Entity Embedding . . . 56
5.5.3 Surface Form Derivation . . . 57
18.104.22.168 Entity Surface Forms . . . 57
22.214.171.124 Entity Syntactic Contexts . . . 58
5.6 Probabilistic Enrichment and Bootstrapping . . . 59
5.7 Experiments . . . 60
5.7.1 Experimental Setup . . . 60
5.7.2 Results . . . 61
5.8 Discussion . . . 62
5.8.1 Effectiveness of Entity Surface Forms . . . 62
5.8.2 Effectiveness of Entity Contexts . . . 63
5.8.3 Comparison of Probabilistic Enrichment Methods . . . 63
5.8.4 Effectiveness of Bootstrapping . . . 64
5.8.5 Overall Results . . . 64
5.9 Summary . . . 65
6 Semantic Decoding in SLU Modeling 67
6.1 Introduction . . . 67
6.2 Proposed Framework . . . 69
6.3 Matrix Factorization for Spoken Language Understanding (MF-SLU) . . . 70
6.3.1 Feature Model . . . 71
6.3.2 Knowledge Graph Propagation Model . . . 71
6.3.3 Integrated Model . . . 73
6.3.4 Parameter Estimation . . . 74
126.96.36.199 Objective Function . . . 74
188.8.131.52 Optimization . . . 75
6.4 Experiments . . . 75
6.4.1 Experimental Setup . . . 75
6.4.2 Evaluation Results . . . 75
6.4.3 Discussion and Analysis . . . 76
184.108.40.206 Effectiveness of Semantic and Dependency Relation Models . 77 220.127.116.11 Comparing Word/ Slot Relation Models . . . 77
6.5 Summary . . . 77
7 Intent Prediction in SLU Modeling 79 7.1 Introduction . . . 79
7.2 Data Description . . . 82
7.2.1 Single-Turn Request for Mobile Apps . . . 82
7.2.2 Multi-Turn Interaction for Mobile Apps . . . 83
7.3 Feature-Enriched MF-SLU . . . 85
7.3.1 Feature-Enriched Matrix Construction . . . 87
18.104.22.168 Word Observation Matrix . . . 87
22.214.171.124 Enriched Semantics Matrix . . . 87 126.96.36.199 Intent Matrix . . . 89 188.8.131.52 Integrated Model . . . 89 7.3.2 Optimization Procedure . . . 90 7.4 User Intent Prediction for Mobile App . . . 91 7.4.1 Baseline Model for Single-Turn Requests . . . 91 7.4.2 Baseline Model for Multi-Turn Interactions . . . 92 7.5 Experimental Setup . . . 92 7.5.1 Word Embedding . . . 92 7.5.2 Retrieval Setup . . . 92 7.6 Results . . . 93 7.6.1 Results for Single-Turn Requests . . . 93 7.6.2 Results for Multi-Turn Interactions . . . 94 7.6.3 Comparing between User-Dependent and User-Independent Models . . 96 7.7 Summary . . . 97
8 SLU in Human-Human Conversations 99
8.1 Introduction . . . 99 8.2 Convolutional Deep Structured Semantic Models (CDSSM) . . . 101 8.2.1 Architecture . . . 102 8.2.2 Training Procedure . . . 102 184.108.40.206 Predictive Model . . . 103 220.127.116.11 Generative Model . . . 103 8.3 Adaptation . . . 103 8.3.1 Adapting CDSSM . . . 104 8.3.2 Adapting Action Embeddings . . . 104
8.4 Actionable Item Detection . . . 105 8.4.1 Unidirectional Estimation . . . 106 8.4.2 Bidirectional Estimation . . . 106 8.5 Iterative Ontology Refinement . . . 106 8.6 Experiments . . . 107 8.6.1 Experimental Setup . . . 107 8.6.2 CDSSM Training . . . 108 8.6.3 Implementation Details . . . 109 8.7 Evaluation Results . . . 109 8.7.1 Comparing Different CDSSM Training Data . . . 110 8.7.2 Effectiveness of Bidirectional Estimation . . . 111 8.7.3 Effectiveness of Adaptation Techniques . . . 111 8.7.4 Effectiveness of CDSSM . . . 112 8.7.5 Discussion . . . 113 8.8 Extensive Experiments . . . 113 8.8.1 Dataset . . . 113 8.8.2 Ontology Induction . . . 114 8.8.3 Iterative Ontology Refinement . . . 115 8.8.4 Influence of Recognition Errors . . . 116 8.8.5 Balance between Frequency and Coherence . . . 116 8.8.6 Effectiveness of Actionable Item Information . . . 117 8.9 Summary . . . 119
9 Conclusions and Future Work 121
9.1 Conclusions . . . 121 9.2 Future Work . . . 122
9.2.1 Domain Discovery . . . 122 9.2.2 Large-Scaled Active Learning of SLU . . . 122 9.2.3 Iterative Learning through Dialogues . . . 123 9.2.4 Error Recovery for System Robustness . . . 123
A AIMU: Actionable Items in Meeting Understanding 143
A.1 AIMU Dataset . . . 143 A.2 Semantic Intent Schema . . . 143 A.2.1 Domain, Intent, and Argument Definition . . . 144 A.3 Annotation Agreement . . . 146 A.4 Statistical Analysis . . . 147
B Insurance Dataset 149
B.1 Dataset . . . 149 B.2 Annotation Procedure . . . 149 B.3 Reference Ontology . . . 151 B.3.1 Ontology Slots . . . 151 B.3.2 Ontology Structure . . . 153
List of Figures
1.1 An example output of the proposed knowledge acquisition approach. . . 4 1.2 An example output of the proposed SLU modeling approach. . . 4
2.1 The typical pipeline in a dialogue system. . . 10 2.2 A sentence example in AMR Bank. . . 16 2.3 Three famous semantic knowledge graph examples (Google’s Knowledge
Graph, Bing Satori, and Freebase) corresponding to the entity “Lady Gaga”. 16 2.4 A portion of the Freebase knowledge graph related to the movie domain. . . . 17 2.5 An example of FrameNet categories for an ASR output labelled by probabilistic
frame-semantic parsing. . . 17 2.6 An example of AMR parsed by JAMR on an ASR output. . . 18 2.7 An example of Wikification. . . 18 2.8 The CBOW and Skip-gram architectures. The CBOW model predicts the cur-
rent word based on the context, and the Skip-gram model predicts surrounding words given the target word . . . 20 2.9 The target words and associated dependency-based contexts extracted from
the parsed sentence for training depedency-based word embeddings. . . 21 2.10 The framework for learning paragraph vectors. . . 22 2.11 Illustration of the CDSSM architecture for the IR task. . . 23
3.1 The proposed framework for ontology induction . . . 31 3.2 An example of probabilistic frame-semantic parsing on ASR output. FT: frame
target. FE: frame element. LU: lexical unit. . . 31
3.3 The mappings from induced slots (within blocks) to reference slots (right sides of arrows). . . 36 3.4 The performance of slot induction learned with different α values. . . 39 3.5 The performance of SLU modeling with different α values. . . 39 3.6 The performance of slot induction learned from different amount of training
data. . . 40
4.1 The proposed framework of structure learning. . . 43 4.2 A simplified example of the integration of two knowledge graphs, where a slot
candidate si is represented as a node in a semantic knowledge graph and a word wj is represented as a node in a lexical knowledge graph. . . 44 4.3 The dependency parsing result on an utterance. . . 45 4.4 The automatically and manually created knowledge graphs for a restaurant
domain. . . 52
5.1 The relation detection examples. . . 54 5.2 The proposed framework of surface form derivation. . . 55 5.3 An example of dependency-based contexts. . . 56 5.4 Learning curves over incremental iterations of bootstrapping. . . 64
6.1 (a): The proposed framework of semantic decoding. (b): Our MF method completes a partially-missing matrix for implicit semantic parsing. Dark circles are observed facts, and shaded circles are inferred facts. Ontology induction maps observed surface patterns to semantic slot candidates. Word relation model constructs correlations between surface patterns. Slot relation model learns slot-level correlations based on propagating the automatically derived semantic knowledge graphs. Reasoning with matrix factorization incorporates these models jointly, and produces a coherent and domain-specific SLU model. 69
7.1 Total 13 tasks in the corpus (only pictures are shown to subjects for making requests). . . 82 7.2 The dialogue example for multi-turn interaction with multiple apps . . . 84
7.3 The feature-enriched MF method completes a partially-missing matrix to fac- torize the low-rank matrix for implicit information modeling. Dark circles are observed facts, and shaded circles are latent and inferred facts. Reasoning with MF considers latent semantics to predict intents based on rich features corresponding to the current user utterance. . . 86
8.1 The ICSI meeting segments annotated with actionable items. The triggered intents are at the right part along with descriptions. The intent-associated arguments are labeled within texts. . . 100 8.2 The genre mismatched examples with the same action. . . 101 8.3 Illustration of the CDSSM architecture for the predictive model. . . 102 8.4 Action distribution for different types of meetings. . . 104 8.5 The average AUC distribution over all actions in the training set before and
after action embedding adaptation using Match-CDSSM. . . 111 8.6 The AUC trend of actionable item detection with different thresholds δ. . . . 115 8.7 The performance of the induced slots with different α values for ASR tran-
scripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 117 8.8 The performance of the induced slots with different α values for manual tran-
scripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 117 8.9 The performance of the learned structure with different α values for ASR tran-
scripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 118 8.10 The performance of the learned structure with different α values for manual
transcripts; the right one shows the detailed trends about the baseline and the proposed ones. . . 118
A.1 Action distribution for different types of meetings. . . 146
B.1 The annotation interface. . . 150
List of Tables
2.1 The frame example defined in FrameNet. . . 15
3.1 The statistics of training and testing corpora. . . 35 3.2 The performance with different α tuned on a development set (%). . . 37
4.1 The contexts extracted for training dependency-based word/slot embeddings from the utterance of Figure 3.2. . . 46 4.2 The performance of induced slots and corresponding SLU models (%) . . . . 49 4.3 The top inter-slot relations learned from the training set of ASR outputs. . . 51
5.1 The contexts extracted for training dependency entity embeddings in the ex- ample of the Figure 5.3. . . 57 5.2 An example of three different methods in the probabilistic enrichment (w =
“pitt ”). . . 60 5.3 Relation detection datasets used in the experiments. . . 61 5.4 The micro F-measure of the first-pass SLU performance before bootstrapping
(N = 15) (%). . . 62 5.5 The micro F-measure of SLU performance with bootstrapping (N = 15) (%). 62 5.6 The examples of derived entity surface forms based on dependency-based entity
embeddings. . . 63
6.1 The MAP of predicted slots (%); † indicates that the result is significantly better than MLR (row (b)) with p < 0.05 in t-test. . . 76 6.2 The MAP of predicted slots using different types of relation models in MR
(%); † indicates that the result is significantly better than the feature model (column (a)) with p < 0.05 in t-test. . . 77
7.1 The recording examples collected from some subjects for single-turn requests. 83 7.2 User intent prediction for single-turn requests on MAP using different training
features (%). LM is a baseline language modeling approach that models explicit semantics. . . 94 7.3 User intent prediction for single-turn requests on P@10 using different training
features (%). LM is a baseline language modeling approach that models explicit semantics. . . 94 7.4 User intent prediction for multi-turn interactions on MAP (%). MLR is a
multi-class baseline for modeling explicit semantics. † means that all features perform significantly better than lexical/behavioral features alone; § means that integrating using MF can significantly improve the MLR model (t-test with p < 0.05). . . 95 7.5 User intent prediction for multi-turn interactions on turn accuracy (%). MLR is
a multi-class baseline for modeling explicit semantics. †means that all features perform significantly better than lexical/behavioral features alone; § means that integrating using MF can significantly improve the MLR model (t-test with p < 0.05). . . 95 7.6 User intent prediction for multi-turn interactions on MAP for ASR and manual
transcripts (%). †means that all features perform significantly better than lex- ical/behavioral features alone;§ means that integrating with MF significantly improves the MLR model (t-test with p < 0.05). . . 97 7.7 User intent prediction for multi-turn interaction on ACC for ASR and manual
transcripts (%). †means that all features perform significantly better than lex- ical/behavioral features alone;§ means that integrating with MF significantly improves the MLR model (t-test with p < 0.05). . . 97
8.1 Actionable item detection performance on the average AUC using bidirectional estimation (%). . . 108 8.2 Actionable item detection performance on AUC (%). . . 110 8.3 Actionable item detection performance on AUC (%). . . 113 8.4 The learned ontology performance with α = 0.8 tuned from Chapter 3 (%) . . 114
A.1 The data set description . . . 143
A.2 The description of the semantic intent schema for meetings . . . 145 A.3 Annotation agreement during different settings. . . 146 A.4 The annotation statistics . . . 147
B.1 The data statistics. . . 149 B.2 The inter-rater agreements for different annotated types (%). . . 151 B.3 The detail of inter-rater agreements for actionable utterances (%). . . 151 B.4 The reference FrameNet slots associated with fillers and corresponding frequency.152 B.5 The labeled slots uncovered by FrameNet. . . 153 B.6 The reference ontology structure composed of inter-slot relations (part 1). . . 154 B.7 The reference ontology structure composed of inter-slot relations (part 2). . . 155 B.8 The reference ontology structure composed of inter-slot relations (part 3). . . 156 B.9 The reference ontology structure composed of inter-slot relations (part 4). . . 157 B.10 The reference ontology structure composed of inter-slot relations (part 5). . . 158 B.11 The reference ontology structure composed of inter-slot relations (part 6). . . 159
“Computing is not about computers any more. It is about living.
Nicholas Negroponte, Massachusetts Institute of Technology Media Lab founder and chairman emeritus
A spoken dialogue system (SDS) is an intelligent agent that interacts with a user via natural spoken language in order to help the user obtain desired information or solve a problem more efficiently. Despite recent successful personal intelligent assistants (e.g. Google Now1, Apple’s Siri2, Microsoft’s Cortana3, and Amazon’s Echo4), spoken dialogue systems are still very brittle when confronted with out-of-domain information. The biggest challenge therefore results from limited domain knowledge. In this introductory chapter, we first introduce how an SDS is developed and articulate a research problem existing in the current development procedure. This chapter outlines the contributions this dissertation brings to address the challenges, and provides a roadmap for the rest of this document.
Spoken language understanding (SLU) has also seen considerable advancements over the past two decades . However, while language understanding remains unsolved, a variety of practical task-oriented dialogue systems have been built to operate on limited specific domains. For instance, the CMU Communicator system is a dialogue system for a air travel domain that provides information about flight, car, and hotel reservations . Another example, the JUPITER system, is a dialogue system for a weather domain, which provides forecast information for the requested city . More recently, a number of efforts in industry (e.g. Google Now, Apple’s Siri, Microsoft’s Cortana, Amazon’s Echo, and Facebook’s M) and
academia have focused on developing semantic understanding techniques for building better SDSs [1, 14, 56, 72, 103, 107, 120, 129, 131, 138, 160, 166].
These systems aim to automatically identify user intents as expressed in natural language, extract associated arguments or slots, and take actions accordingly to fulfill the user’s re- quests. Typically, a SDS architecture is composed of the following components: an automatic speech recognizer (ASR), a spoken language understanding (SLU) module, a dialogue man- ager (DM), and an output manager. When developing a dialogue system in a new domain, we may be able to reuse some components that are designed independently of domain-specific in- formation, for example, the speech recognizer. However, the components that are integrated with domain-specific information have to be reconstructed for each new domain, and the cost of development is expensive. With a rapidly increasing number of domains, the current bottleneck of the SDS is SLU.
1.2 The Problem
The classic development process of a dialogue system involves 1) specifying system require- ments, 2) designing and implementing each module in a dialogue system to meet all require- ments, and 3) evaluating the implemented system. In the first step, dialogue system developers need to specify the scope of a target dialogue system (i.e. the domain that the system can support and operate) to identify domain-specific concepts, a.k.a. slots, and arguments for slot filling; determine the structure of each task to specify the potential intents and associated slots for intent classification; indicate the desired interaction between the system and a user such as the dialogue flow for DM usage. Most of the prior studies focused on implementa- tion of each component in the SDS pipeline under an assumption that the domain-specific schema is given. However, due to unlimited domains, identifying domain-specific information becomes a large issue during SDS development.
Conventionally, the domain-specific knowledge is manually defined by domain experts or developers . For common domains like a weather domain or a bus domain, system developers are usually able to identify such information. However, information of some niche domains (e.g. a military domain) is withheld by experts, making the knowledge engineering process more difficult . Furthermore, the experts’ decisions may be subjective and may not cover all possible real-world users’ cases .
In the second step, implementing each component usually suffers from a common issue, data scarcity. Specifically, training intent detectors and slot taggers of SLU requires a set of utterances labeled with task-specific intents and arguments. One simple solution is to create hand-crafted grammars so that the component can be built without any annotated data.
Another solution is to simulate an environment to collect utterances, such as a Wizard-of- Oz (WOZ) method and crowd-sourcing, so that the collected data can be used to train the models [6, 76]. However, the collected data may be biased by developers’ subjectivity, because users’ perspectives of a task might not be foreseen by dialogue system developers .
Furthermore, poor generalization and scalability of current systems result in limited prede- fined information, and even biases the subsequent data collection and annotation. Another issue is about the efficiency: the manual definition and annotation process for domain-specific tasks can be very time-consuming, and have high financial costs. Finally, the maintenance cost is also non-trivial: when new conversational data comes in, developers, domain experts, and annotators have to manually analyze the audios or the transcripts for updating and ex- panding the ontologies. Identifying domain knowledge and collecting training data as well as annotations require domain experts and manual labors, resulting in high cost, long du- ration, and poor scalability of SDS development. The challenges, generalization, scalability, efficiency, are the main bottleneck in the current dialogue systems.
1.3 Towards Improved Scalability, Gernalization & Effi- ciency for SDS
Usually participants engage in a conversation in order to achieve a specific goal such as accom- plishing a task or acquiring answers to questions, for example, to obtain a list of restaurants in a specific location. Therefore in the context of this dissertation, domain-specific information refers to the knowledge specific to an SDS-supported task rather than the knowledge about general dialogue mechanisms. To tackle the above problems, we aim to mine the domain- specific knowledge from unlabeled dialogues (e.g. conversations collected by a call center, recorded utterances that cannot be handled by existing systems, etc.) to construct a domain ontology, and then model the SLU component based on the acquired knowledge and unlabeled data in an unsupervised manner.
The dissertation mainly focuses on two parts:
• Knowledge acquisition is to learn the domain-specific knowledge that is used by an SLU component. The domain-specific knowledge is represented by a structured ontol- ogy, which allows SDS to support the target domain, and thus comprehend meanings.
An example of the necessary domain knowledge about restaurant recommendation is shown in Figure 1.1, where the learned domain knowledge contains semantic slots and their relations5. The acquired domain ontology provides an overview of a domain or
5The slot is defined as a semantic unit usually used in dialogue systems.
Restaurant Asking Conversations
Organized Domain Knowledge Unlabelled Collection
Figure 1.1: An example output of the proposed knowledge acquisition approach.
Organized Domain Knowledge
intent=navigation SLU Modeling
“can i have a cheap restaurant”
Figure 1.2: An example output of the proposed SLU modeling approach.
multiple domains, which can guide developers for designing the schema or be directly utilized by the SLU module.
• SLU modeling is to build an SLU module that is able to understand the actual meaning of domain-specific utterances based on the domain-specific knowledge and then further provide better responses. An example of the corresponding understanding procedure in a restaurant domain is shown in Figure 1.2, where the the SLU component analyzes an utterance “can i have a cheap restaurant ” and output a semantic representation including low-level slots price=“cheap” and target=“restaurant” and a high-level intent navigation.
With more available conversational data, to acquire the domain knowledge, recent approaches are data-driven in terms of generalization and scalability. In the past decade, the computa- tional linguistics community has focused on developing language processing algorithms that can leverage the vast quantities of available raw data. Chotimongkol et al. proposed a machine learning technique to acquire domain-specific knowledge, showing the potential for reducing human effort in the SDS development . However, the work mainly focused on
the low-level semantic units like word-level concepts. With increasing high-level knowledge resources, such as FrameNet, Freebase and Wikipedia, this dissertation moves forward to in- vestigate the possibility of developing a high-level semantic conversation analyzer for a certain domain using an unsupervised machine learning approach. The human’s semantics, intent, and behavior can be captured from a collection of unlabelled raw conversational data, and then be modeled for building a good SDS.
In terms of practical usage, the acquired knowledge may be manually revised to improve sys- tem performance. Even though some revision might be required, the cost of revision is already significantly lower than the cost of analysis. Also, the automatically learned information may employ real-world users’ cases and avoid biasing subsequent annotations. This thesis focuses on the highlighted parts, inducing acquiring domain knowledge from the dialogues using avail- able resources, and modeling an SLU module using the automatically acquired information.
The proposed approach combining both data-driven and knowledge-driven perspectives shows the potential for improving generalization, maintenance, efficiency, and scalability of dialogue system development.
1.4 Thesis Statement
The main purpose of this work is to automatically develop an SLU module for SDS by utilizing the automatically learned domain knowledge in an unsupervised fashion. This dissertation mainly focuses on acquiring the domain knowledge that is useful for better understanding and designing the system framework and further modeling the semantic meaning of the spoken language. For knowledge acquisition, there are two important stages – ontology induction and structure learning. After applying them, an organized domain knowledge is inferred from unlabeled conversations. For SLU modeling, there are two aspects – semantic decod- ing and intent prediction. Based on the acquired ontology, semantic decoding analyzes the semantic meaning in each individual utterance and intent prediction models user intents to predict possible follow-up behaviors. In conclusion, the thesis demonstrates the feasibility of building a dialogue learning system that is able to automatically learn salient knowledge and understand how the domains work based on unlabeled raw conversations. With the acquired domain knowledge, the initial dialogue system can be constructed and improved quickly by continuously interacting with users. The main contribution of the dissertation is presenting the potential for reducing human work and showing the feasibility of improving scalability and efficiency for dialogue system development by automating the knowledge learning process.
1.5 Thesis Structure
The dissertation is organized as below.
• Chapter 2 - Background and Related Work
This chapter reviews background knowledge and summarizes related works. The chap- ter also discusses current challenges of the task, describes several structured knowledge resources and presents distributional semantics that may benefit understanding prob- lems.
• Chapter 3 - Ontology Induction for Knowledge Acquisition
This chapter focuses on inducing a domain ontology that are useful for developing SLU in SDS based on the available structured knowledge resources in an unsupervised way.
Part of this research work has been presented in the following publications [31, 33]:
– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Unsupervised Induction and Filling of Semantic Slots for Spoken Dialogue Systems Using Frame- Semantic Parsing,” in Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13), Olomouc, Czech Republic, 2013.
(Student Best Paper Award)
– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Leveraging Frame Semantics and Distributional Semantics for Unsupervised Semantic Slot Induction for Spoken Dialogue Systems,” in Proceedings of 2014 IEEE Workshop on of Spoken Language Technology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.
• Chapter 4 - Structure Learning for Knowledge Acquisition
This chapter focuses on learning the structures, such as the inter-slot relations, for help- ing SLU development. Some of the contributions have been presented in the following publications [39, 40]:
– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Jointly Model- ing Inter-Slot Relations by Random Walk on Knowledge Graphs for Unsupervised Spoken Language Understanding,” in Proceeding of The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (NAACL-HLT’15), Denver, Colorado, USA, 2015.
– Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky, “Learning Semantic Hierarchy for Unsupervised Slot Induction and Spoken Language Un- derstanding,” in Proceedings of The 16th Annual Conference of the Interna-
tional Speech Communication Association (INTERSPEECH’15), Dresden, Ger- many, 2015.
• Chapter 5 - Surface Form Derivation for Knowledge Acquisition
This chapter focuses on deriving the surface forms conveying semantics for entities from the given ontology, where the derived information contributes to better understanding.
Some of the work has been published :
– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Gokhan Tur, “Deriving Local Relational Surface Forms from Dependency-Based Entity Embeddings for Unsupervised Spo- ken Language Understanding,” in Proceedings of 2014 IEEE Workshop of Spoken Language Technology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.
• Chapter 6 - Semantic Decoding in SLU Modeling
This chapter focuses on decoding users’ spoken languages into corresponding semantic forms, which corresponds to the goal of SLU. Some of these contributions have been presented in the following publication :
– Yun-Nung Chen, William Yang Wang, Anatole Gershman, and Alexander I. Rud- nicky, “Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding,” in Proceeding of The 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), Beijing, China, 2015.
• Chapter 7 - Intent Prediction in SLU Modeling
This chapter focuses on modeling user intents in SLU, so that the SDS is able to pre- dict the users’ follow-up actions and further provide better interactions. Some of the contributions have been presented by following publications [28, 36, 37, 42]:
– Yun-Nung Chen and Alexander I. Rudnicky, “Dynamically Supporting Unexplored Domains in Conversational Interactions by Enriching Semantics with Neural Word Embeddings,” in Proceedings of 2014 IEEE Workshop of Spoken Language Tech- nology (SLT’14), South Lake Tahoe, Nevada, USA, 2014.
– Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky, and Anatole Gershman,
“Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken Language Understanding,” in Proceedings of The 17th ACM International Confer- ence on Multimodel Interaction (ICMI’15), Seattle, Washington, USA, 2015.
– Yun-Nung Chen, Ming Sun, and Alexander I. Rudnicky, “Matrix Factorization with Domain Knowledge and Behavioral Patterns for Intent Modeling,” in Ex- tended Abstract of The 29th Annual Conference on Neural Information Processing
Systems – Machine Learning for Spoken Language Understanding and Interactions Workshop (NIPS-SLU’15), Montreal, Canada, 2015.
– Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky, and Anatole Gershman, “Un- supervised User Intent Modeling by Feature-Enriched Matrix Factorization,” in Proceedings of The 41st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’16), Shanghai, China, 2016.
• Chapter 8 - SLU in Human-Human Conversations
This chapter investigates the feasibility of applying the technologies developed for human-machine interactions to human-human interactions, expanding the application usage to more practical and broader genres. Part of the research work has been pre- sented in the following publications [34, 35]:
– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Xiaodong He, “Detecting Actionable Items in Meetings by Convolutional Deep Structured Semantic Models,” in Pro- ceedings of 2015 IEEE Workshop on Automatic Speech Recognition and Under- standing (ASRU’15), Scottsdale, Arizona, 2015.
– Yun-Nung Chen, Dilek Hakkani-T¨ur, and Xiaodong He, “Learning Bidirectional Intent Embeddings by Convolutional Deep Structred Semantic Models for Spo- ken Language Understanding,” in Extended Abstract of The 29th Annual Confer- ence on Neural Information Processing Systems – Machine Learning for Spoken Language Understanding and Interactions Workshop (NIPS-SLU’15), Montreal, Canada, 2015.
• Chapter 9 - Conclusions and Future Work
This chapter concludes the main contributions and discusses a number of interesting directions that can be explored in the future.
Background and Related Work
“Everything that needs to be said has already been said. But since no one was listening, everything must be said again.
Andr´e Gide, Nobel Prize in Literature winner
With an emerging trend of using mobile devices, spoken dialogue systems (SDS) are being incorporating in several devices (e.g. smartphone, smart-TV, navigating system). In the architecture of SDSs, spoken language understanding (SLU) plays an important role and there are many unsolved challenges. The next section first introduces a typical pipeline of an SDS and elaborates the functionality of each individual component. Section 2.2 details how SLU works with different examples, reviews the related literature and discusses their pros and cons; following the literature review, we briefly sketch the idea of the proposed approaches and how it is related to prior studies. Then semantic resources that are used for benefiting language understanding are introduced, where the resources with explicit semantics, Ontology and Knowledge Base, are presented in Section 2.3, and implicit semantics based on the theory, Distributional Semantics, are presented in Section 2.4.
2.1 Spoken Dialogue System (SDS)
A typical SDS is composed of a recognizer, a spoken language understanding (SLU) module, a dialogue manager (DM), and an output manager. Figure 2.1 illustrates the system pipeline.
The functionality of each component is summarized below.
• Automatic Speech Recognizer (ASR)
The ASR component takes raw audio signals and then transcribes into word hypotheses with confidence scores. The top one hypothesis would then be transmitted into the next component.
• Spoken Language Understanding (SLU)
The goal of SLU is to capture the core semantics given the input word hypothesis; And
Automatic Speech Recognizer
Dialogue Manager Spoken
Output Manager / Output Generator
Natural Language Generator Multimedia
visual, etc. textual spoken
Figure 2.1: The typical pipeline in a dialogue system.
the extracted information can be populated into task-specific arguments in a given se- mantic frame . Therefore the task of an SLU module is to identify user intents and fill associated slots based on the word hypotheses. This procedure is also called semantic parsing, semantic decoding, etc. The SLU component typically includes an intent de- tector and slot taggers. An example utterance “I want to fly to Taiwan from Pittsburgh next week ” can be parsed into find flight(origin=“Pittsburgh”, destination=“Taiwan”, de- parture date=“next week”), where find flight is classified by the intent detector; and the associated slots are later filled by the slot taggers based on the detected intent. This component also estimates confidence scores of decoded semantic representations for next component usage.
• Dialogue Manager (DM) / Task Manager
Subsequent to the SLU processing, the DM interacts with users to assist them in achieving their goals. Given the above example, DM should check whether required slots are properly assigned (departure date may not properly specified) and then de- cide the system’s action such as ask date or return flight(origin=“Pittsburgh”, destina- tion=“Taiwan”). This procedure should access knowledge bases as a retrieval database to acquire the desired information. Due to possible misrecognition and misunderstand- ing errors, this procedure involves dialogue state tracking and policy selection to make more robust decisions [85, 164].
• Output Manager / Output Generator
Traditional dialogue systems are mostly used through phone calls, so the output man- ager mainly interacts with two modules, a natural language generation (NLG) module and a speech synthesizer. However, with increasing usage of various multimedia devices (e.g. smartphone, smartwatch, and smart-TV), the output manager does not need to focus on generating spoken responses. Instead, recent trend is moving toward display- ing responses via different channels; for example, the utterance “Play Lady Gaga’s Bad Romance.” should correspond to an output action that launches a music player and
then plays the specified song. Hence an additional component, multimedia response, is introduced in the infrastructure in order to handle diverse multimedia outputs.
– Multimedia Response
Given the decided action, a multimedia response considers which channel is more suitable to present the returned information based on environmental contexts, user preference, and used devices. For example, return flight(origin=“Pittsburgh”, destination=“Taiwan”) can be presented through visual responses by listing the flights that satisfy the requirement in desktops, laptops, etc., and through spoken responses by uttering “There are seven flights from Pittsburgh to Taiwan. First is ...” in the smartwatches.
– Natural Language Generation (NLG)
Given the current dialogue strategy, the NLG component generates the corre- sponding natural language responses that humans can understand for the purpose of natural dialogues. For example, an action from DM, ask date, can generate a response “Which date will you plan to fly? ”. Here the responses can be template- based or outputted by statistical models [29, 163].
– Speech Synthesizer / Text-to-Speech (TTS)
In order to communicate with users via speech, a speech synthesizer simulates human speech based on the natural language responses generated by the NLG component.
All basic components in a dialogue system should interact with each other, so errors may propagate and then result in poor performance. In addition, several components (e.g. the SLU module) need to incorporate the domain knowledge in order to handle task-specific dialogues. Because domain knowledge is usually predefined by experts or developers, when there are more and more domains, making SLU scalable has been a main challenge of SDS development.
2.2 Spoken Language Understanding (SLU)
In order to allow machines to understand natural language, a semantic representation1 is introduced. A semantic representation of an utterance carries its core content, so that the actual meaning behind the utterance can be inferred only through the representation. For ex- ample, an utterance “show me action movies directed by james cameron” can be represented
1In this document, we use the terms “semantic representation” and “semantic form” interchangeablely.
as action=“show ”, target=“movie”, genre=“action”, director=“james cameron”. Another ut- terance “find a cheap taiwanese restaurant in oakland ” can be formed as action=“find ”, tar- get=“restaurant”, price=“cheap”, type=“taiwanese”, location=“oakland ”. The semantic rep- resentations are able to convey the core meaning of the utterances, which can be more easily processed by machines. The semantic representation is not unique, and there are several forms for representing meanings. Below we describe two types of semantic forms:
• Slot-Based Semantic Representation
The slot-based representation is a flat structure of semantic concepts, which are usually used in simpler tasks. Above examples belong to slot-based semantic representations, where semantic concepts are action, target, location, price, etc.
• Relation-Based Semantic Representation
The relation-based representation includes structured concepts, which are usually used in tasks that have more complicate dependency relations. For instance, “show me action movies directed by james cameron” can be represented as movie.directed by, movie.genre, director.name=“james cameron”, genre.name=“action”. This representation is the same as movie.directed by(?, “james cameron”) ∧ movie.genre(?, “action”), which originated from the logic form in the artificial intelligence field. The semantic slots in the slot-based representation are formed as relations here.
The main purpose of an SLU component is to convert the natural language into semantic forms. In the natural language processing (NLP) field, natural language understanding (NLU) also refers to semantic decoding or semantic parsing. Therefore, this section reviews related literature and studies how they approach the problems for language understanding. After that, the following chapters focus on addressing the challenges that building an SDS suffers from, namely:
• How can we define semantic elements from unlabeled data to form a semantic schema?
• How can we organize semantic elements and then form a meaningful structure?
• How can we decode semantics for test data while considering noises in the mean time?
• How can we utilize the acquired information to predict user intents for improving system performance?
2.2.1 Leveraging External Resources
Building semantic parsing systems requires large training data with detailed annotations.
With rich web-scaled resources, a lot of NLP research therefore leveraged external human
knowledge resources for semantic parsing. For example, Berant et al. proposed SEMPRE2, which used the web-scaled knowledge bases to train the semantic parser . Das et al.
proposed SEMAFOR3, which utilized a lexicon developed based on a linguistic theory – Frame Semantics to train the semantic parser . However, such NLP tasks deal with individual and focused problems, ignoring how parsing results are used by applications.
Tur et al. were among the first to consider unsupervised approaches for SLU, where they exploited query logs for slot-filling [152, 154]. In a subsequent study, Heck and Hakkani-T¨ur studied the Semantic Web for an unsupervised intent detection problem in SLU, showing that results obtained from the unsupervised training process align well with the performance of traditional supervised learning . Following their success of unsupervised SLU, recent studies have also obtained interesting results on the tasks of relation detection [32, 125, 75], entity extraction , and extending domain coverage [41, 28, 57]. Section 2.3 will introduce the exploited knowledge resources and the corresponding analyzers in detail. However, most of the prior studies considered semantic elements independently or only considered the relations appearing in the external resources, where the structure of concepts used by real users might be ignored.
2.2.2 Structure Learning and Inference
From a knowledge management perspective, empowering dialogue systems with large knowl- edge bases is of crucial significance to modern SDSs. While leveraging external knowledge is the trend, efficient inference algorithms, such as random walks, are still less-studied for direct inference on knowledge graphs of the spoken contents. In the NLP literature, Lao et al.
used a random walk algorithm to construct inference rules on large entity-based knowledge bases, and leveraged syntactic information for reading the web [101, 102]. Even though this work has important contributions, the proposed algorithm cannot learn mutually-recursive relations, and does not to consider lexical items—in fact, more and more studies show that, in addition to semantic knowledge graphs, lexical knowledge graphs that model surface-level natural language realization, multiword expressions, and context, are also critical for short text understanding [91, 109, 110, 145, 158].
2.2.3 Neural Model and Representation
With the recently emerging trend of neural systems, a lot of work has shown the success of applying neural-based models in SLU. Tur et al. have shown that deep convex networks are effective for building better semantic utterance classification systems . Following their
success, Deng et al. have further demonstrated the effectiveness of applying the kernel trick to build better deep convex networks for SLU . Nevertheless, most of work used neural-based representations for supervised tasks, so there is a gap between approaches used for supervised and unsupervised tasks.
In addition, recently Mikolov proposed recurrent neural network based language models to capture long dependency and achieved the state-of-the-art performance in recognition [114, 116]. The proposed continuous representations as word embeddings have further boosted the state-of-the-art results in many applications, such as sentiment analysis, sentence completion, and relation detection [32, 117, 144]. The detail of distributional representations will be described in Section 2.4. Despite the advances of several NLP tasks, how unsupervised SLU can incorporate neural representations remains unknown.
2.2.4 Latent Variable Modeling
Most of the studies above did not explicitly learn latent factor representations from data, so they may neglect errors (e.g. misrecognition) and thus produce unreliable results of SLU .
Early studies on latent variable modeling in speech included the classic hidden Markov model for statistical speech recognition . Recently, Celikyilmaz et al. were the first to study the intent detection problem using query logs and a discrete Bayesian latent variable model .
In the field of dialogue modeling, the partially observable Markov decision process (POMDP) model is a popular technique for dialogue management [164, 172], reducing the cost of hand- crafted dialogue managers while producing robustness against speech recognition errors. More recently, Tur et al. used a semi-supervised LDA model to show improvement on the slot filling task . Also, Zhai and Williams proposed an unsupervised model for connecting words with latent states in HMMs using topic models, obtaining interesting qualitative and quantitative results . However, for unsupervised SLU, it is unclear how to take latent semantics into account.
2.2.5 The Proposed Method
Towards unsupervised SLU, this dissertation proposes an SLU model to integrate the advan- tages of prior studies and overcome the disadvantages mentioned above. The model leverages the external knowledge while combining frame semantics and distributional semantics, and learns latent feature representations while taking various local and global lexical, syntactic and semantic relations into account in an unsupervised manner. The details will be presented in the following chapters.
Table 2.1: The frame example defined in FrameNet.
Frame: Food Semantics physical object
noun: almond, apple, banana, basil, beef, beer, berry, ...
Frame Element constituent parts, descriptor, type
2.3 Domain Ontology and Knowledge Base
There are two main types of knowledge resources available, generic concept and entity-based, both of which may benefit SLU modules for SDSs. Below we first introduce the detailed definition of each resource and discuss the corresponding work and tools that are useful for leveraging such resources.
For generic concept and entity-based knowledge bases, the former covers the concepts that are more common, such as a food domain and a weather domain. The latter usually contains a lot of named entities that are specific for certain domains, for example, a movie domain and a music domain. The following describes examples of these knowledge resources, which contain the rich semantics and may be beneficial for understanding tasks.
18.104.22.168 Generic Concept Knowledge
There are two semantic knowledge resources for generic concepts, FrameNet and Abstract Meaning Representation (AMR).
• FrameNet4 is a linguistically semantic resource that offers annotations of predicate- argument semantics, and associated lexical units for English . FrameNet is developed based on semantic theory, Frame Semantics . For example, the phrase “low fat milk ” should be analyzed with “milk ” evoking the food frame, where “low fat ” fills the descriptor FE of that frame and the word “milk ” is the actual LU. A defined frame example is shown in Table 2.1.
• Abstract Meaning Representation (AMR) is a semantic representation language including the meanings of thousands of English sentences. Each AMR is a single rooted, directed graph. AMRs include PropBank semantic roles, within-sentence coreference, named entities and types, modality, negation, questions, quantities, etc . The AMR
The boy wants to goARG1
(w / want-01
:ARG0 (b / boy) :ARG1 (g / go-01 :ARG0 b))
Figure 2.2: A sentence example in AMR Bank.
Google Knowledge Graph Bing Satori Freebase
Figure 2.3: Three famous semantic knowledge graph examples (Google’s Knowledge Graph, Bing Satori, and Freebase) corresponding to the entity “Lady Gaga”.
feature structure graph of an example sentence is illustrated in Figure 2.2, where the
“boy” appears twice, once as the ARG0 of “want-01 ”, and once as the ARG0 of “go-01 ”.
22.214.171.124 Entity-Based Knowledge
• Semantic Knowledge Graph is a knowledge base that provides structured and de- tailed information about the topic with a lists of related links. Three different knowledge graph examples, Google’s knowledge graph5, Microsoft’s Bing Satori, and Freebase, are shown in Figure 2.3. The semantic knowledge graph is defined by a schema and com- posed of nodes and edges connecting the nodes, where each node represents an entity- type and the edge between each node pair describes their relation, as called as property.
An example from Freebase is shown in Figure 2.4, where nodes represent core entity- types for the movie domain. The domains in the knowledge graphs span the web, from
“American Football” to “Zoos and Aquariums”.
Avatar Titanic Drama Kate
Oscar, best director
Figure 2.4: A portion of the Freebase knowledge graph related to the movie domain.
I want to find some inexpensive and very fancy bars in north.
expensiveness degree building
part_orientaional Figure 2.5: An example of FrameNet categories for an ASR output labelled by probabilistic frame-semantic parsing.
• Wikipedia6 is a free-access, free content Internet encyclopedia, which contains a large number of pages/articles related to a specific entity . It provides basic background knowledge for help understanding tasks in the natural language processing (NLP) field.
2.3.2 Knowledge-Based Semantic Analyzer
With the available knowledge resources mentioned above, there are many work that utilizes such knowledge for different tasks. The prior approaches or tools can serve as analyzers and facilitate the target task of this dissertation, unsupervised SLU for dialogue systems.
126.96.36.199 Generic Concept Knowledge
SEMAFOR7 is a state-of-the-art semantic parser for frame-semantic parsing [48, 49].
Trained on manually annotated sentences in FrameNet, SEMAFOR is relatively accu- rate in predicting semantic frames, FE, and LU from raw text. SEMAFOR is augmented by the dual decomposition techniques in decoding, and thus produces semantically-