Word Observation Matrix - Feature-Enriched Matrix Construction

7.3 Feature-Enriched MF-SLU

7.3.1 Feature-Enriched Matrix Construction

7.3.1.1 Word Observation Matrix

A word observation matrix features with binary values based on n-gram word patterns. For single-turn requests, two word observation matrices are built, where F_w^A is for textual app descriptions and F_w^U is for spoken utterances. Each row in the matrix represents an app/ut-terance and each column refers to an observed word pattern. In other words, F_w^A and F_w^U carry basic word vectors for all apps and all utterances respectively. Similarly, for multi-turn interactions, a word observation matrix, F_w^U, is constructed for spoken utterances. The left-most column set in Figure 7.3 illustrates lexical features for the given utterances.

7.3.1.2 Enriched Semantics Matrix

For single-turn requests, considering to include open domain knowledge based on user utter-ances, we utilize distributed word representations to capture syntactic and semantic relation-ship for acquiring domain knowledge [28, 33].

• Embedding-based semantics

We enrich original utterances with semantically similar words, where the similarity is measured by word embeddings trained on all app descriptions [28, 118]. Algorithm 2 shows the procedure of acquiring domain knowledge for semantics enrichment. This procedure is to obtain semantically related knowledge for enriching original utterances.

For example, “compose an email to alex ” focuses on the email writing domain, and generated semantic seeds may include the tokens “text ” and “contacting”. Then word embeddings can help provide additional words with similar concepts. For example, the nearest vectors of “text ” include “message”, “msg”, etc. in the continuous space. This procedure leverages rich semantics by incorporating conceptually related knowledge, so that the system can provide proper apps for supporting open domain requests.

• Type-embedding-based semantics

In addition to semantically similar words, types of concepts are included to further expand semantic information. For example, “play lady gaga’s bad romance” may contain the types “singer” and “song” to improve semantic inference (domain-related cues about music playing), where we detect all entity mention candidates in the given utterances and use entity linking with Freebase and Wikipedia to mine entity types [28].

– Wikipedia page linking

For an entity mention set in the given utterance, we output a set of linked Wikipedia pages, where an integer linear programming (ILP) formulation to gen-erate the mapping from mentions to Wikipedia pages [43, 133]. For each entity, we extract the definition sentence from the linked page, and then all words parsed into adjectives or nouns in the noun phrase just following the part-of-speech pat-tern (VBZ) (DT) such as “is a/an/the” are extracted as semantic concepts. For example, the sentence about the entity “lady gaga” is “Stefani Joanne Angelina Germanotta, better known by her stage name Lady Gaga, is an American singer and songwriter.”, and the entity types, “American singer ” and “songwriter ”, are extracted.

– Freebase list linking

Each mention can be linked to a ranked list of Freebase nodes by Freebase API⁶, and we extract the top K notable types for each entity as the acquired knowledge.

Then an enriched semantics matrix can be built as F_s^U, where each row is a utterance and each column corresponds a semantic element shown in Figure 7.3.

For multi-turn interactions, we enrich utterances with contextual behaviors to incorporate behavioral information into personalized and context-aware intent modeling. Figure 7.3b

6https://developers.google.com/freebase/

illustrates the enriched behavioral features as F_b^U, where the second utterance “tell vivian this is me in the lab” involves “Camera” acquired from the previous turn “take this photo”.

The behavioral history at turn t, h_t, can be formulated as {a₁, ..., a_t−1}, which is a set of apps that were previously launched in the ongoing dialogue.

7.3.1.3 Intent Matrix

To link word patterns with corresponding intents, an intended app matrix F_a^Ais constructed, where each column corresponds to launching a specific app. Hence, the entry is 1 when the app and the intent correspond to each other, and 0 otherwise,

For unsupervised single-turn requests, to induce user intents, we use a basic retrieval model for returning top K relevant apps for each utterance u, and treat them as pseudo intended apps [28], which is detailed in Section 7.4.1. Figure 7.3a includes an example of utterance “i would like to contact alex ”, where the utterance is treated as a request to search for relevant apps such as “Outlook” and “Skype”. Then we build an app matrix Fa^U with binary values based on the top relevant apps, which also denotes intent features for utterances. Note that we do not use any annotations, the app-related intents are returned by a retrieval model and may contain some noises.

For personalized intent prediction on multi-turn interactions, the intent matrix can be directly acquired from users’ app usage logs. F_a^Ucan be built and illustrated in the right part of matrix from Figure 7.3b.

7.3.1.4 Integrated Model

As shown in Figure 7.3, we integrate word matrices, an enriched semantics matrix, and intent matrices from both apps and utterances together for training an MF-SLU model. The integrated model for single-turn requests can be formulate as

M = [ F_w^A 0 F_a^A

F_w^U F_s^U F_a^U ]. (7.2) Similarly, the integrated matrix for multi-turn interactions can be built as

M = [ F_w^U F_s^U F_a^U ]. (7.3) Hence, the relations among word patterns, domain knowledge, and behaviors can be automat-ically inferred from the integrated model in a joint fashion. The goal of the feature-enriched MF-SLU model is, for a given user utterance, to predict the probability that the user intents

to launch each app.

7.3.2 Optimization Procedure

With the built matrix, M , we can learn a model θ^∗ that can best estimate the observed patterns by parametrizing the matrix through weights and latent component vectors, where the parameters are estimated by maximizing the log likelihood of observed data similar to (6.5) [46].

θ^∗ = arg max

x∈X

p(θ | M_x)

= arg max

x∈X

p(Mx| θ) · p(θ)

= arg max

x∈X

ln p(M_x | θ) − λ_θ,

(7.4)

where X is a set indicating row information. For single-turn requests, M_x is a row vector corresponding either an app or an utterance; for multi-turn interactions, Mx corresponds to an utterance. Here the assumption is that each row (app/utterance) is independent of others.

Similar to Chapter 6, we apply BPR to parameterize the integrated model, and create a dataset by sampling from M . In single-turn requests, for each app/utterance x and each observed fact f⁺ = hx, y⁺i, we choose each feature y⁻ referring to the word/semantics that does not correspond to x or the app that is not returned as by the basic retrieval model according to the utterance x. Also, in multi-turn interactions, for each utterance x (e.g. “take this photo” in Figure 7.3b) we created pairs of observed and unobserved facts: f⁺ = hx, y⁺i and f⁻ = hx, y⁻i, where y⁺corresponds to an observed lexical/behavioral/intent feature (e.g.,

“photo” for lexical, “null ” for behavior, “camera” for intended app) and y⁻ corresponds to an unobserved feature (e.g. “tell ”, “camera”, “email ”). Then for each pair of facts f⁺ and f⁻, we maximize the margin between p(f⁺) and p(f⁻) with the objective:

x∈X

ln p(M_x| θ) = X

f⁺∈O

f⁻6∈O

ln σ(θ_f+− θ_f⁻). (7.5)

The BPR objective is an approximation to the per utterance AUC, which correlates with well-ranked apps per utterance. The fact pairs constructed from sampled observed facts hx, y⁺i and unobserved facts hx, y⁻i form |O|, and an SGD update is applied to maximize the objective for MF [68, 134].

Finally we can obtain estimated probabilities of various features given the current utterance, which corresponds to probabilities of intended apps given the utterance, P (a | u). For single-turn requests, Figure 7.3a shows that hidden semantics, “message” and “email ”, are inferred from “i would like to contact alex ” because relations between features are captured by the

model based on app descriptions and previous user utterances. For multi-turn interactions, as shown in Figure 7.3b, hidden semantics (e.g. “tell ”, “email ”) can also be inferred from

“send it to alice” in this model based on both lexical and behavioral features.

7.4 User Intent Prediction for Mobile App

For each test utterance u, with the trained MF model, we can predict the probability of each intended app a based on the observed features corresponding to the current utterance by taking into account two models, a baseline model for explicit semantics and a feature-enriched MF-SLU model for implicit semantics:

P (a | u) = Pexp(a | u) × Pimp(a | u)

= Pexp(a | u) × P (Mu,a= 1 | θ),

(7.6)

where P (a | u) is an integrated probability for ranking apps, Pexp(a | u) is the probability outputted by the baseline model that considers explicit semantics, and Pimp(a | u) is the probability estimated by the MF-based model. The fused probabilities are able to consider hidden intents by learning latent semantics from the enriched features. The baselines for modeling explicit semantics Pexp(a | u) for single-turn requests and multi-turn interactions are described below.

7.4.1 Baseline Model for Single-Turn Requests

Considering an unsupervised task of ranking apps based on user spoken requests, the baselines for modeling explicit semantics Pexp(a | u) for single-turn requests apply a language modeling retrieval technique for query likelihood estimation, and app-related behaviors are ranked by

P_exp(a | u) = P (u | a)P (a) P (u)

∝ P (u | a) = 1

|u|

w∈u

log P (w | a),

(7.7)

where u is a user’s query, a is an app-related intent, w represents a token in the query, and P (u | a) represents the probability that user speaks the utterance u to make the request for launching the app a⁷ [28, 130]. For example, in order to use the app Gmail, a user is more likely to say “compose an email to alex ”, while the same utterance should correspond to a lower probability when launching the app Maps. To estimate the likelihood by the language modeling approach, we use the description content of this app with assumption that it carries

7Here we assume that the priors for apps/utterances are the same.

semantically related information. For example, the description of Gmail includes a text segment “read and respond to your conversations”, and Maps includes texts “navigating”

and “spots” in its description.

7.4.2 Baseline Model for Multi-Turn Interactions

In multi-turn interactions, the goal is to predict the apps that are more likely to be used to handle user requests given input utterances and behavioral contexts, considering not only the desired functionality but also user preference. Considering that multi-app tasks are often user-specific, we want to build a personalized SLU component for each user to model his/her intents; we frame the task as a multi-class classification problem, and perform a standard MLR to model explicit semantics, where the model uses a standard maximum likelihood estimation approach by the gradient ascent to estimate the likelihood P_exp(a | u).

7.5 Experimental Setup

In single-turn requests, we train a word embedding model to enrich utterances with domain knowledge for semantics enrichment, which is described in Section 7.5.1. To build an intent matrix, a retrieval model is applied to acquire pseudo intended apps. This procedure is detailed in Section 7.5.2. In multi-turn interactions, we group 70% of each user’s multi-app dialogues (chronologically ordered) into a training set, and use the rest 30% as a testing set.

In the experiments, we then build a user-independent and user-dependent (personalized) SLU model to estimate the probability distribution of all apps. All experiments are performed on both ASR and manual transcripts as follows.

7.5.1 Word Embedding

To include distributional semantics for SLU, we use description contents of all apps to train word embeddings using CBOW, which predicts the current word based on the context⁸. The resulting vectors have dimensionality 300, and vocabulary size is 8 × 10⁵ [118].

7.5.2 Retrieval Setup

Lemur toolkit⁹ is used to perform our retrieval model for ranking apps. For the retrieval setting, word stemming¹⁰ and stopword removal¹¹are applied, and we assign an equal weight

8https://code.google.com/p/word2vec/

9http://www.lemurproject.org/

10http://tartarus.org/martin/PorterStemmer/

11http://www.lextek.com/manuals/onix/stopwords1.html

to each term in the query to eliminate the influence of weighting [130]. To further consider popularity of apps, for each returned list, we rerank “popular” apps to the top of this list, where “popular” means the apps with more than ten million downloads, because we assume that users are more willing to use/install popular apps, and also our ground truth is based on subjects’ annotations, where most reference apps belong to the set of popular apps we define.

7.6 Results

Under the app-oriented SDS, the main idea is to model users’ intents; therefore, we evaluate the model performance by measuring whether the predicted apps satisfy users’ need. For single-turn requests, we use subject-labeled apps as our ground truth for evaluating our returned apps, where we use standard metrics of information retrieval, MAP and P@10 in the experiments. Similarly, for multi-turn interactions, we also evaluate the performance by MAP to consider the whole ranking list of all apps corresponding to each utterance.

Considering that the multi-turn interaction task is supervised, we additionally report turn accuracy (ACC) as the percentage of our top-1 predictions that match the correct apps, which is exactly the same as P@1.

7.6.1 Results for Single-Turn Requests

We evaluated the proposed feature-enriched MF approach for single-turn requests in Table 7.2 and Table 7.3, which present the results using different features before and after integrating with the feature-enriched MF-SLU model for ASR and manual transcripts.

Table 7.2 shows that almost all results are improved after combining with the MF model, in-dicating that hidden semantics modeled by MF techniques helps estimate intent probabilities.

For ASR results, enriching semantics using embedding-based (row (b)) and type-embedding-based semantics (row (c)) significantly improve the baseline performance (row (a)) using a basic retrieval model, where the MAP performance is increased from 25.1% to 31.5%. Then the performance can be further improved by integrating MF to additionally model hidden semantics, where row (b) achieves 34.2% on MAP. The reason why type-embedding-based semantics (row (c)) does not perform better compared with embedding-based semantics (row (b)) is that the automatically acquired type information appears to introduce noises, and row (c) is slightly worse than row (b) for ASR results.

For manually transcribed speech in Table 7.2, the semantic enrichment procedure and MF-SLU models also improve the performance. Different from ASR results, the best result for user intent prediction is based on the features enriched with type-embedding-based semantics (row (c)), achieving 34.0% on MAP. The reason may be that manual transcripts are more likely to

Table 7.2: User intent prediction for single-turn requests on MAP using different training features (%). LM is a baseline language modeling approach that models explicit semantics.

Feature for Single-Turn Request ASR Transcripts

LM w/ MF-SLU LM w/ MF-SLU

(a) Word Observation 25.1 29.2 (+16.2%) 26.1 30.4 (+16.4%) (b) Word + Embedding Semantics 32.0 34.2 (+6.8%) 33.3 33.3 (-0.2%) (c) Word + Type-Embedding Semantics 31.5 32.2 (+2.1%) 32.9 34.0 (+3.4%) Table 7.3: User intent prediction for single-turn requests on P@10 using different training features (%). LM is a baseline language modeling approach that models explicit semantics.

Feature for Single-Turn Request ASR Transcripts

LM w/ MF-SLU LM w/ MF-SLU

(d) Word Observation 28.6 29.5 (+3.4%) 29.2 30.1 (+2.8%) (e) Word + Embedding Semantics 31.2 32.5 (+4.3%) 32.0 33.0 (+3.4%) (f) Word + Type-Embedding Semantics 31.3 30.6 (-2.3%) 32.5 34.7 (+6.8%)

capture the correct semantic information by word embeddings and have more consistent type information, allowing the MF technique to model user intents better and more accurate.

Also, Table 7.3 shows the similar trends on P@10, where row (e) with MF is best result for ASR, and row (f) is best results for manual transcripts. In sum, the results show that rich features carried by app descriptions and utterance-related contents can help intent prediction in single-turn requests using the proposed MF-SLU model for most cases. The evaluation results also prove the effectiveness of our feature-enriched MF-SLU models, which incorporate enriched semantics and model implicit semantics along with explicit semantics in a joint fashion, demonstrating promising performance.

7.6.2 Results for Multi-Turn Interactions

To analyze whether contextual behaviors convey informative cues for personalized intent prediction, Table 7.4 and Table 7.5 show the personalized SLU performance using lexical and behavioral features individually on both ASR and manual transcripts, where we build an SLU model for each subject and then predict intents of given utterances using the corresponding user-dependent models.

For the baseline MLR that models explicit semantics, we can see that lexical features alone (row (a)) achieve better performance than behavioral features alone (row (b)), indicating that the majority of utterances contains explicit expressions that are predictable. In addition, combining behavior history patterns with lexical features (row (c)) performs best in terms of the two measures. This implies that users’ personal app usage patterns can improve prediction

Table 7.4: User intent prediction for turn interactions on MAP (%). MLR is a multi-class baseline for modeling explicit semantics. ^† means that all features perform significantly better than lexical/behavioral features alone;^§ means that integrating using MF can signifi-cantly improve the MLR model (t-test with p < 0.05).

Feature for Multi-Turn Interaction ASR Transcripts

MLR w/ MF-SLU MLR w/ MF-SLU

(a) Word Observation 52.1 52.7 (+1.2%) 55.5 55.4 (-0.2%) (b) Behavioral Pattern 25.2 26.7 (+6.0%^§) 25.2 26.7 (+6.0%^§) (c) Word + Behavioral Features 53.9^† 55.7^† (+3.3%^§) 56.6 57.7^† (+1.9%^§) Table 7.5: User intent prediction for multi-turn interactions on turn accuracy (%). MLR is a multi-class baseline for modeling explicit semantics. ^† means that all features perform significantly better than lexical/behavioral features alone;^§means that integrating using MF can significantly improve the MLR model (t-test with p < 0.05).

Feature for Multi-Turn Interaction ASR Transcripts

MLR w/ MF-SLU MLR w/ MF-SLU

(d) Word Observation 48.2 48.3 (+0.2%) 51.6 51.3 (-0.6%) (e) Behavioral Pattern 19.3 20.6 (+6.7%) 19.3 20.6 (+6.7%) (f) Word + Behavioral Features 50.1^† 51.9^† (+3.6%^§) 52.8 54.0^† (+2.3%)

by allowing the model to capture dependencies between lexical and behavioral features.

To evaluate the effectiveness of modeling latent semantics via MF, we can find that the performance of the MLR model integrated with MF-SLU estimation. For ASR transcripts, combining MLR with MF-SLU outperforms all MLR models (significant improvement with p < 0.05 in t-test), because MF is able to model latent semantics under lexical and behavioral patterns. For manual results, integrating with the MF-SLU model does not perform better when using only lexical features (row (a)), possibly because manually transcribed utterances contain more clear and explicit semantics, so that learning latent semantics does not improve the performance. However, using behavioral features or using all features show significantly better performance when combining with the proposed MF-based method.

Table 7.5 shows the similar trend as Table 7.4, where the improvement of ACC is more than the improvement of MAP, showing that the model produces accurate prediction of the top returned intent. Comparing results for ASR and manual transcripts, we observed that the performance is worse when user utterances contain recognition errors, possibly because the explicit expression contains more noises and thus results in worse prediction. The benefit of MF techniques on ASR results is more pronounced (3.3% and 3.6% relative improvement of MAP and ACC respectively) than for manual transcripts, showing the effectiveness of MF in modeling latent semantics in noisy data.

Finally, our experiments show the feasibility of disambiguating spoken language inputs for better intent prediction using behavioral patterns. The best prediction on ASR reaches 55.7%

of MAP and 51.9% of ACC.

7.6.3 Comparing between User-Dependent and User-Independent Models

To deeply investigate the performance for personalized models, we conduct experiments to compare the difference between user-independent models and user-dependent models in Ta-ble 7.6 and TaTa-ble 7.7.

The first two rows are baseline results using maximum likelihood estimation (MLE), which predict the intended apps based on the observed app distribution. Here it can be found that the user-dependent model (row (b)) performs better than the user-independent model (row (a)). Note that MLE only considers the observed frequency of app usage, so there is no difference between ASR and manual results. The rows (c) and (d) use MLR to model explicit semantics, where the performance is significantly better than MLE, and user-dependent mod-els (row (d)) still outperform user-independent modmod-els (row (c)) using lexical alone, behavioral alone, or combined features for both ASR and manual transcripts.

To analyze the capability of modeling hidden semantics, the rows (e) and (f) apply MF to model the implicit semantics before integrating with models about explicit semantics. Using an MF model alone does not perform better compared to MLR, because MF takes latent in-formation into account and has weaker capability of modeling explicit observations. However, the performance of the user-independent model through behavioral patterns achieves 39.9%

for both ASR and manual transcripts, but combining with lexical features only performs 21.2%

and 19.8% for ASR and manual results. The results may be unreliable due to data sparsity in MF. Personalized results (row (f)) perform much better than user-independent models (row (e)), indicating that inference relations between apps are more obvious for individual users.

Then we investigate the performance using different combinations in rows (g) and (h), where row (g) combines user-independent MLR (row (c)) and personalized MF (row (e)), and row (h) combines personalized MLR (row (d)) and personalized MF (row (e)). By integrating personalized MF, both models are improved for ASR and manual transcripts. Between these two combinations, it can be found that utilizing user-specific data to model both explicit and implicit semantics (row (h)) is able to achieve the best performance, where the best results are 55.7% and 57.5% on ASR and manual transcripts respectively.

Table 7.7 shows the performance of turn accuracy, and almost all trends are the same, except for the user-independent MF model using only behavioral patterns (row (e)), which produces more reasonable results, 7.6% for both ASR and manual transcripts. The poor performance

Table 7.6: User intent prediction for multi-turn interactions on MAP for ASR and manual transcripts (%). ^†means that all features perform significantly better than lexical/behavioral features alone; ^§ means that integrating with MF significantly improves the MLR model (t-test with p < 0.05).

Approach ASR Transcripts

Lex. Behav. All Lex. Behav. All

(a) MLE User-Independent 19.6

(b) Personalized 27.9

(d) Personalized 52.1 25.2 53.9^† 55.5 25.2 56.6

(e) MF-SLU User-Independent 19.4 39.9 21.2 16.8 39.9 19.8

(f) Personalized 29.8 43.8 29.8 30.8 43.8 31.5

(g) (c) + Personalized MF-SLU 51.1 20.3 54.2^†§ 53.3 20.3 57.6^†§

(h) (d) + Personalized MF-SLU 52.7 26.7 55.7^†§ 55.4 26.7 57.7^†§

Table 7.7: User intent prediction for multi-turn interaction on ACC for ASR and manual transcripts (%). ^†means that all features perform significantly better than lexical/behavioral features alone; ^§ means that integrating with MF significantly improves the MLR model (t-test with p < 0.05).

Approach ASR Transcripts

Lex. Behav. All Lex. Behav. All

(a) MLE User-Independent 13.5

(b) Personalized 20.2

(d) Personalized 48.2 19.3 50.1^† 51.6 19.3 52.8

(e) MF-SLU User-Independent 13.4 7.6 14.8 10.1 7.6 13.6

(f) Personalized 21.7 16.5 21.8 23.3 16.5 24.2

(g) (c) + Personalized MF-SLU 47.6 16.4 50.3^†§ 49.1 16.4 53.5^†§

(h) (d) + Personalized MF-SLU 48.3 20.6 51.9^†§ 51.3 20.6 54.0^†

of the user-independent model is expectable, because different users usually have different preference and history usage. In terms of ACC, the best performance is 51.9% and 54.0% on ASR and manual results respectively, concluding that applying personalized SLU on explicit and implicit semantics is better.

7.7 Summary

This chapter presents a feature-enriched MF-SLU approach to learn user intents based on the automatically acquired rich features, in one case taking account into domain knowledge and in another case incorporating behavioral patterns along with user utterances. In a smart-phone intelligent assistant setting (e.g. requesting an app launch), the proposed model considers

implicit semantics to enhance intent inference given the noisy ASR inputs for single-turn re-quest dialogues. The model is also able to incorporate users’ behavioral patterns and their app preferences to better predict user intents in multi-turn interactions. We believe that the proposed approach allows systems to handle users’ open domain intents when retriev-ing relevant apps that provide desired functionality either locally available or by suggestretriev-ing installation of suitable apps and doing so in an unsupervised way. The framework can be extended to incorporate personal behavior history and use it to improve a system’s ability to assist users to pursue multi-app activities. In sum, the effectiveness of the feature-enriched MF model can be shown in different domains, indicating good generality and providing a reasonable direction for future work.

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 119-192)