Matrix Factorization for Spoken Language Understanding (MF-SLU)

Considering the benefits brought by MF techniques, including 1) modeling noisy data, 2) modeling hidden semantics, and 3) modeling long-range dependencies between observations, in this work we apply an MF approach to SLU modeling for SDS. In our model, we use U to denote a set of input utterances, W as a set of word patterns, and S as a set of semantic slots that we would like to predict. The pair of an utterance u ∈ U and a word pattern/semantic slot x ∈ {W ∪ S}, hu, xi, is a fact. The input to our model is a set of observed facts O, and the observed facts for a given utterance is denoted by {hu, xi ∈ O}. The goal of our model is to estimate, for a given utterance u and a given word pattern/semantic slot x, the probability, p(M_u,x= 1), where M_u,x is a binary random variable that is true if and only if x is the word pattern/domain-specific semantic slot in the utterance u. We introduce a series of exponential family models that estimate the probability using a natural parameter θu,x and

the logistic sigmoid function:

p(Mu,x= 1 | θu,x) = σ(θu,x) = 1

1 + exp (−θ_u,x) (6.1)

We construct a matrix M|U |×(|W |+|S|) as observed facts for MF by integrating a feature model and a knowledge graph propagation model below.

6.3.1 Feature Model

First, we build a word pattern matrix F_w with binary values based on observations, where each row represents an utterance and each column refers to an observed unigram. In other words, F_w carries basic word vectors for utterances, which is illustrated as the left part of the matrix in Figure 6.1b.

To induce the semantic elements, we parse all ASR-decoded utterances in our corpus us-ing SEMAFOR [48, 49], and extract all frames from semantic parsus-ing results as slot can-didates [31, 55]. Figure 3.2 shows an ASR-decoded utterance example “can i have a cheap restaurant ” parsed by SEMAFOR. Three FrameNet-defined frames capability, expensiveness, and locale by use are generated for the utterance, which we consider as slot candidates for a domain-specific dialogue system [4]. Then we build a slot matrix F_swith binary values based on the induced slots, and the matrix also denotes slot features for all utterances (right part of the matrix in Figure 6.1b).

To build a feature model M_F, we concatenate two matrices:

MF = [ F_w Fs ], (6.2)

which is the upper part of the matrix in Figure 6.1b for training utterances. Note that we do not use any annotations, so all slot candidates are included.

6.3.2 Knowledge Graph Propagation Model

As mentioned in Chapter 3, because SEMAFOR was trained on FrameNet annotation, which has a more generic frame-semantic context, not all the frames from the parsing results can be used as the actual slots in the domain-specific dialogue systems. For instance, we see that the frames expensiveness and locale by use are essentially key slots for the purpose of understanding in the restaurant query domain, whereas the capability frame does not convey particularly valuable information for SLU.

Assuming that domain-specific concepts are usually related to each other, considering global

relations between semantic slots induces a more coherent slot set. It is shown that the relations on knowledge graphs help make decisions on domain-specific slots [39]. Similar to Chapter 4, we consider two directed graphs, semantic and lexical knowledge graphs, where each node in the semantic knowledge graph is a slot candidate si generated by the frame-semantic parser, and each node in the lexical knowledge graph is a word w_j.

• Slot-based semantic knowledge graph is built as Gs= hV_s, E_ssi, where V_s= {s_i ∈ S} and Ess= {eij | s_i, sj ∈ V_s}.

• Word-based lexical knowledge graph is built as Gw = hV_w, E_wwi, where V_w = {w_i ∈ W } and E_ww= {e_ij | w_i, w_j ∈ V_w}.

The edges connect two nodes in the graphs if there is a typed dependency between them. The structured graph helps define a coherent slot set. To model the relations between words/slots based on the knowledge graphs, we define two relation models below.

• Semantic Relation

To model word semantic relations, we compute a matrix R^S_w = [Sim(wi, wj)]|W |×|W |, where Sim(w_i, w_j) is the cosine similarity between the dependency embeddings of word patterns w_i and w_j after normalization. For slot semantic relations, we compute R^S_s = [Sim(si, sj)]|S|×|S| similarly¹. The matrices R^S_w and R^S_s model not only semantic similarity but functional similarity since we use dependency-based embeddings [108].

• Dependency Relation

Assuming that important semantic slots are usually mutually related to each other, that is, connected by syntactic dependencies, our automatically derived knowledge graphs are able to help model the dependency relations. For word dependency relations, we compute a matrix R^D_w = [ˆr(w_i, w_j)]_{|W |×|W |}, where ˆr(w_i, w_j) measures a dependency relation between two word patterns w_iand w_jbased on the word-based lexical knowledge graph, and the detail is described in Section 4.3.1. For slot dependency relations, we similarly compute R^D_s = [ˆr(s_i, s_j)]_|S|×|S| based on the slot-based semantic knowledge graph.

With the built word relation models (R^S_w and R^D_w) and slot relation models (R^S_s and R^D_s), we combine them as a knowledge graph propagation matrix MR2.

M_R=

h R^SD_w 0 0 R^SD_s

, (6.3)

1For each column in R^Swand R^Ss, we only keep top 10 highest values, which correspond the top 10 seman-tically similar nodes.

2The values in the diagonal of MR are 0 to model the propagation from other entries.

where R^SD_w = R^S_w+ R_w^D and R^SD_s = R^S_s+ R^D_s to integrate semantic and dependency relations.

The goal of this matrix is to propagate scores between nodes according to different types of relations in the knowledge graphs [26].

6.3.3 Integrated Model

With a feature model MF and a knowledge graph propagation model MR, we integrate them into a single matrix.

M = M_F · (αI + βM_R)

= [ F_w F_s ] ·h αI + βRw 0 0 αI + βR_s

h αF_w+ βF_wR_w 0 0 αFs+ βFsRs

i ,

(6.4)

where M is the final matrix and I is an identity matrix. α and β are weights for balancing original values and propagated values, where α + β = 1. The matrix M is similar to MF, but some weights are enhanced through the knowledge graph propagation model, M_R. The word relations are built by FwRw, which is the matrix with internal weight propagation on the lexical knowledge graph (the blue arrow in Figure 6.1b). Similarly, FsRs models the slot correlations, and can be treated as the matrix with internal weight propagation on the semantic knowledge graph (the pink arrow in Figure 6.1b). The propagation models can be treated as running a random walk algorithm on the graphs.

Fs contains all slot candidates generated by SEMAFOR, which may include some generic slots (such as capability), so the original feature model cannot differentiate domain-specific and generic concepts. By integrating with R_s, the semantic and dependency relations can be propagated via edges in the knowledge graph, and domain-specific concepts may have higher weights based on the assumption that slots for dialogue systems are often mutually related [39]. Hence, the structure information can be automatically involved in the matrix.

Also, the word relation model brings the same function, but on the word level. In conclusion, for each utterance, the integrated model not only predicts probabilities that semantic slots occur but also considers whether the slots are domain-specific. The following sections describe the learning process.

6.3.4 Parameter Estimation

The proposed model is parameterized through weights and latent component vectors, where the parameters are estimated by maximizing the log likelihood of observed data [46].

θ^∗ = arg max

u∈U

p(θ | Mu)

= arg max

u∈U

p(M_u | θ)p(θ)

= arg max

u∈U

ln p(Mu| θ) − λ_θ,

(6.5)

where M_u is the vector corresponding to the utterance u, because we assume that each utterance is independent of others.

To avoid treating unobserved facts as designed negative facts and to complete missing entries of the matrix, our model can be factorized by a matrix completion technique with a low-rank latent semantics assumption [97, 134]. Bayesian personalized ranking (BPR) is an optimiza-tion criterion that learns from implicit feedback for MF by a matrix compleoptimiza-tion technique, which uses a variant of the ranking: giving observed true facts higher scores than unobserved (true or false) facts to factorize the given matrix [134]. BPR was shown to be useful in learning implicit relations for improving semantic parsing [38, 135].

6.3.4.1 Objective Function

To estimate the parameters in (6.5), we create a dataset of ranked pairs from M in (6.4):

for each utterance u and each observed fact f⁺ = hu, x⁺i, where M_u,x ≥ δ, we choose each word pattern/slot x⁻ such that f⁻ = hu, x⁻i, where M_u,x < δ, which refers to the word pattern/slot we have not observed to be in utterance u. That is, we construct the observed data O from M . Then for each pair of facts f⁺ and f⁻, we want to model p(f⁺) > p(f⁻) and hence θ_f⁺ > θ_f⁻ according to (6.1). BPR maximizes the summation of each ranked pair, where the objective is

u∈U

ln p(M_u | θ) = X

f⁺∈O

f⁻6∈O

ln σ(θ_f⁺ − θ_f−). (6.6)

The BPR objective is an approximation to the per utterance AUC (area under the ROC curve), which directly correlates to what we want to achieve – well-ranked semantic slots per utterance.

6.3.4.2 Optimization

To maximize the objective in (6.6), we employ a stochastic gradient descent (SGD) algo-rithm [134]. For each randomly sampled observed fact hu, x⁺i, we sample an unobserved fact hu, x⁻i, which results in |O| fact pairs hf⁻, f⁺i. For each pair, we perform an SGD update using the gradient of the corresponding objective function for MF [68].

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 102-107)