• 沒有找到結果。

then plays the specified song. Hence an additional component, multimedia response, is introduced in the infrastructure in order to handle diverse multimedia outputs.

– Multimedia Response

Given the decided action, a multimedia response considers which channel is more suitable to present the returned information based on environmental contexts, user preference, and used devices. For example, return flight(origin=“Pittsburgh”, destination=“Taiwan”) can be presented through visual responses by listing the flights that satisfy the requirement in desktops, laptops, etc., and through spoken responses by uttering “There are seven flights from Pittsburgh to Taiwan. First is ...” in the smartwatches.

– Natural Language Generation (NLG)

Given the current dialogue strategy, the NLG component generates the corre-sponding natural language responses that humans can understand for the purpose of natural dialogues. For example, an action from DM, ask date, can generate a response “Which date will you plan to fly? ”. Here the responses can be template-based or outputted by statistical models [29, 163].

– Speech Synthesizer / Text-to-Speech (TTS)

In order to communicate with users via speech, a speech synthesizer simulates human speech based on the natural language responses generated by the NLG component.

All basic components in a dialogue system should interact with each other, so errors may propagate and then result in poor performance. In addition, several components (e.g. the SLU module) need to incorporate the domain knowledge in order to handle task-specific dialogues. Because domain knowledge is usually predefined by experts or developers, when there are more and more domains, making SLU scalable has been a main challenge of SDS development.

as action=“show ”, target=“movie”, genre=“action”, director=“james cameron”. Another ut-terance “find a cheap taiwanese restaurant in oakland ” can be formed as action=“find ”, tar-get=“restaurant”, price=“cheap”, type=“taiwanese”, location=“oakland ”. The semantic rep-resentations are able to convey the core meaning of the utterances, which can be more easily processed by machines. The semantic representation is not unique, and there are several forms for representing meanings. Below we describe two types of semantic forms:

• Slot-Based Semantic Representation

The slot-based representation is a flat structure of semantic concepts, which are usually used in simpler tasks. Above examples belong to slot-based semantic representations, where semantic concepts are action, target, location, price, etc.

• Relation-Based Semantic Representation

The relation-based representation includes structured concepts, which are usually used in tasks that have more complicate dependency relations. For instance, “show me action movies directed by james cameron” can be represented as movie.directed by, movie.genre, director.name=“james cameron”, genre.name=“action”. This representation is the same as movie.directed by(?, “james cameron”) ∧ movie.genre(?, “action”), which originated from the logic form in the artificial intelligence field. The semantic slots in the slot-based representation are formed as relations here.

The main purpose of an SLU component is to convert the natural language into semantic forms. In the natural language processing (NLP) field, natural language understanding (NLU) also refers to semantic decoding or semantic parsing. Therefore, this section reviews related literature and studies how they approach the problems for language understanding. After that, the following chapters focus on addressing the challenges that building an SDS suffers from, namely:

• How can we define semantic elements from unlabeled data to form a semantic schema?

• How can we organize semantic elements and then form a meaningful structure?

• How can we decode semantics for test data while considering noises in the mean time?

• How can we utilize the acquired information to predict user intents for improving system performance?

2.2.1 Leveraging External Resources

Building semantic parsing systems requires large training data with detailed annotations.

With rich web-scaled resources, a lot of NLP research therefore leveraged external human

knowledge resources for semantic parsing. For example, Berant et al. proposed SEMPRE2, which used the web-scaled knowledge bases to train the semantic parser [10]. Das et al.

proposed SEMAFOR3, which utilized a lexicon developed based on a linguistic theory – Frame Semantics to train the semantic parser [50]. However, such NLP tasks deal with individual and focused problems, ignoring how parsing results are used by applications.

Tur et al. were among the first to consider unsupervised approaches for SLU, where they exploited query logs for slot-filling [152, 154]. In a subsequent study, Heck and Hakkani-T¨ur studied the Semantic Web for an unsupervised intent detection problem in SLU, showing that results obtained from the unsupervised training process align well with the performance of traditional supervised learning [82]. Following their success of unsupervised SLU, recent studies have also obtained interesting results on the tasks of relation detection [32, 125, 75], entity extraction [159], and extending domain coverage [41, 28, 57]. Section 2.3 will introduce the exploited knowledge resources and the corresponding analyzers in detail. However, most of the prior studies considered semantic elements independently or only considered the relations appearing in the external resources, where the structure of concepts used by real users might be ignored.

2.2.2 Structure Learning and Inference

From a knowledge management perspective, empowering dialogue systems with large knowl-edge bases is of crucial significance to modern SDSs. While leveraging external knowlknowl-edge is the trend, efficient inference algorithms, such as random walks, are still less-studied for direct inference on knowledge graphs of the spoken contents. In the NLP literature, Lao et al.

used a random walk algorithm to construct inference rules on large entity-based knowledge bases, and leveraged syntactic information for reading the web [101, 102]. Even though this work has important contributions, the proposed algorithm cannot learn mutually-recursive relations, and does not to consider lexical items—in fact, more and more studies show that, in addition to semantic knowledge graphs, lexical knowledge graphs that model surface-level natural language realization, multiword expressions, and context, are also critical for short text understanding [91, 109, 110, 145, 158].

2.2.3 Neural Model and Representation

With the recently emerging trend of neural systems, a lot of work has shown the success of applying neural-based models in SLU. Tur et al. have shown that deep convex networks are effective for building better semantic utterance classification systems [153]. Following their

2http://www-nlp.stanford.edu/software/sempre/

3http://www.ark.cs.cmu.edu/SEMAFOR/

success, Deng et al. have further demonstrated the effectiveness of applying the kernel trick to build better deep convex networks for SLU [54]. Nevertheless, most of work used neural-based representations for supervised tasks, so there is a gap between approaches used for supervised and unsupervised tasks.

In addition, recently Mikolov proposed recurrent neural network based language models to capture long dependency and achieved the state-of-the-art performance in recognition [114, 116]. The proposed continuous representations as word embeddings have further boosted the state-of-the-art results in many applications, such as sentiment analysis, sentence completion, and relation detection [32, 117, 144]. The detail of distributional representations will be described in Section 2.4. Despite the advances of several NLP tasks, how unsupervised SLU can incorporate neural representations remains unknown.

2.2.4 Latent Variable Modeling

Most of the studies above did not explicitly learn latent factor representations from data, so they may neglect errors (e.g. misrecognition) and thus produce unreliable results of SLU [12].

Early studies on latent variable modeling in speech included the classic hidden Markov model for statistical speech recognition [94]. Recently, Celikyilmaz et al. were the first to study the intent detection problem using query logs and a discrete Bayesian latent variable model [23].

In the field of dialogue modeling, the partially observable Markov decision process (POMDP) model is a popular technique for dialogue management [164, 172], reducing the cost of hand-crafted dialogue managers while producing robustness against speech recognition errors. More recently, Tur et al. used a semi-supervised LDA model to show improvement on the slot filling task [155]. Also, Zhai and Williams proposed an unsupervised model for connecting words with latent states in HMMs using topic models, obtaining interesting qualitative and quantitative results [174]. However, for unsupervised SLU, it is unclear how to take latent semantics into account.

2.2.5 The Proposed Method

Towards unsupervised SLU, this dissertation proposes an SLU model to integrate the advan-tages of prior studies and overcome the disadvanadvan-tages mentioned above. The model leverages the external knowledge while combining frame semantics and distributional semantics, and learns latent feature representations while taking various local and global lexical, syntactic and semantic relations into account in an unsupervised manner. The details will be presented in the following chapters.

Table 2.1: The frame example defined in FrameNet.

Frame: Food Semantics physical object

noun: almond, apple, banana, basil, beef, beer, berry, ...

Frame Element constituent parts, descriptor, type