Summary - Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue System

This chapter presents an MF approach to self-train the SLU model for semantic decoding with consideration of a well-organized ontology in an unsupervised way. The purpose of

the proposed model is not only to predict the probability of each semantic slot but also to distinguish between generic semantic concepts and domain-specific concepts that are related to an SDS. The experiments show that the MF-SLU model obtains promising results of semantic decoding, outperforming strong discriminative baselines.

7

Intent Prediction in SLU Modeling

“

It’s really interesting to me how all of us can experience the exact same event, and yet come away with wildly disparate interpretations of what happened. We each have totally different ideas of what was said, what was intended, and what really took place.

”

Marya Hornbacher, Pulitzer Prize nominee

SLU modeling has different aspects: shallow understanding and deep understanding. In addition to low-level semantic concepts from semantic decoding, deeply understanding users involves more high-level intentions. Since users usually take observable actions motivated by their intents, predicting intents along with follow-up actions can be viewed as another aspect about SLU. A good SLU module is able to accurately understand low-level semantic meanings of utterances and high-level user intentions, which allows SDSs to further predict users’ follow-up actions in order to offer better interactions. This chapter focuses on predicting user intents, which correspond to observable actions, using a feature-enriched MF technique, in order to deeply understand users in the unsupervised and semi-supervised manners.

7.1 Introduction

In a dialogue system, SLU and DM modules play important roles because they first map utterances into semantics and then into intent-triggered actions. The semantic representa-tions of user utterances usually refer to the specified information that directly occurs in the utterances. To involve deeper understanding, high-level intentions should be considered. For example, in a restaurant domain, “find me a taiwanese restaurant ” has a semantic form as action=“find ”, type=“taiwanese”, target=“restaurant”, and a follow-up intended action might be asking for its location or navigation, which can be viewed as the high-level intention. As-suming an SDS is able to predict user intents (e.g. navigation), the system not only returns the user a restaurant list but also asks whether the user needs the corresponding location or navigating instructions, and then automatically launches the corresponding application (e.g.

Maps), providing better conversational interactions and more friendly user experience [147]

To design the SLU module of an SDS, most previous studies relied on the predefined ontology and schema to bridge intents and semantic slots [14, 41, 56, 72, 138].

Recently, dialogue systems are appearing on smart-phones and allowing users to launch ap-plications¹ via spontaneous speech. Typically, an SDS needs predefined task domains to understand corresponding functions, such as setting alert clock and query words via browser, Each app supports a single-turn request task. However, traditional SDSs are unable to dy-namically support functions provided by newly installed or not yet installed apps, so that open domain requests cannot be handled due to lack of predefined ontologies. We address the following question: with an open domain single-turn request, how can a system dynamically and effectively provide the corresponding functions to fulfill users’ requests? This chapter first focuses on understanding a user’s intent and identifying apps that can support such open domain requests.

In addition to the difficulty caused by language ambiguity, behavioral patterns also influence user intents. Typical intelligent assistants (IA) treat each domain (e.g. restaurant search, messaging, etc.) independent of each other, where only current user utterances are considered to decide the desired apps in SLU [28]. Some IAs model user intents by using contexts from previous utterances, but they do not take into account behavioral patterns of individual users [11]. This work improves intent prediction based on our observation that the intended apps usually depend on 1) individual preference (some people prefer Message to Email) and 2) behavioral patterns at the app level (Message is more likely to follow Camera, and Email is more likely to follow Excel). Since behavioral contexts from previous turns affect intent prediction, we refer it as a multi-turn interaction task.

To improve understanding, some studies utilized non-verbal contexts like eye gaze and head nod as cues to resolve the referring expression ambiguity and to improve driving perfor-mance [77, 99]. Because human users often interact with their phones to carry out compli-cated tasks that span multiple domains and applications, user behavioral patterns as addi-tional non-verbal signals may provide deeper insights into user intents [142, 24]. For example, if a user always texts his friend via Message instead of Email right after finding a good restaurant via Yelp, this behavioral pattern helps disambiguate apps corresponding to the communicating utterance “send to alex ”.

Another challenge of SLU is inference of hidden semantics, which is mentioned in the previous chapter. Considering a user utterance “i would like to contact alex ”, we can see that its surface patterns includes explicit semantic information about “contact ”; however, it also includes hidden semantic information such as “message” and “email ”, since the user is likely

1In the rest of the document, we use the word “app” in stead of the word “application” for simplification.

to launch some apps such as Messenger (message) or Outlook (email) even though they are not directly observed in the surface patterns. Traditional SLU models use discriminative classifiers to predict whether predefined slots occur in the utterances or not and ignore hidden semantic information. However, in order to provide better interactions with users, modeling hidden intents helps predict user-desired apps. Therefore, this chapter proposes a feature-enriched MF model to learn low-ranked latent features for SLU [98]. Specifically, an MF-SLU model is able to learn relations between observed features and unobserved features, and estimate probabilities of all unobserved patterns instead of viewing them as negative instances as described in Chapter 6. Therefore, the feature-enriched MF-SLU incorporates rich features, including semantic and behavioral cues, to infer high-level intents.

For the single-turn request task, the model takes account of app descriptions and spoken utterances along with enriched semantic knowledge in a joint fashion. More specifically, we use entity linking methods based on structured knowledge resources, which are to locate slot fillers in a given utterance, and then types of identified fillers are extracted as semantic seeds to enrich the features of the utterance. In additional to slot types, low-level semantics of the utterance is further enriched with related knowledge that is automatically extracted through neural word embeddings. Then applying feature-enriched MF-SLU enables an SDS to dynamically support non-predefined domains based on the semantics-enriched models. We evaluate the performance by examining whether predicted apps are capable of fulfilling users’

requests.

For the multi-turn interaction task, the model additionally incorporates contextual behavior history to improve intent prediction. Here we take personal app usage history into account, where the behavioral patterns are used to enrich utterance features, in order to infer user intents better. Finally the system is able to provide behavioral and context-aware personalized prediction by feature-enriched MF techniques. To evaluate the personalized performance, we examine whether predicted apps are what the users actually launch.

We evaluate the performance by examining whether predicted applications can satisfy users’

requests. The experiments show that our MF-based approach can model user intents and allow an SDS to provide better responses for both unsupervised single-turn requests and supervised multi-turn interactions. Our contributions include:

• This is among the first attempts to apply feature-enriched MF techniques for intent modeling, incorporating different sources of rich information (app description, semantic knowledge, behavioral patterns);

• The feature-enriched MF-SLU approach jointly models spoken observations, available text information, and structured knowledge to infer user intents for single-turn requests, taking hidden semantics into account;

Lady Gaga

Bad Romance Trailer of Iron Man 3

Alex Alex

Alex ^{“I can}_meet” Alex ^…

“I graduated”

… …

? CMU

English: university

Chinese: ?

1. music listening 2. video watching

7. post to social websites

4. video chat

5. send an email

8. share the photo

6. text

9. share the video

3. make a phone call

10. navigation 11. address request

12. translation 13. read the book

Figure 7.1: Total 13 tasks in the corpus (only pictures are shown to subjects for making requests).

• The behavioral patterns can be incorporated into the feature-enriched MF-SLU ap-proach to model user preference for personalized understanding in multi-turn interac-tions;

• Our experimental results indicate that feature-enriched MF-SLU approaches outperform most strong baselines and achieve better intent prediction performance.

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 109-114)