The Matrix Factorization Approach - Unsupervised Learning and Modeling of Knowledge and Intent

Considering the benefits brought by MF techniques, including 1) modeling the noisy data, 2) modeling hidden semantics, and 3) modeling the dependencies between observations, this theis applies an MF approach to SLU model building for SDSs. In our model, we use U to denote the set of input utterances, W as the set of word patterns, and S as the set of semantic slots we would like to predict. The pair of an utterance u ∈ U and a word pattern/semantic slot x ∈ {W + S}, hu, xi, is a fact. The input to our model is a set of observed facts O, and the observed facts for a given utterance is denoted by {hu, xi ∈ O}. The goal of our model is to estimate, for a given utterance u and a given word pattern/semantic slot x, the probability, p(M_u,x= 1), where M_u,x is a binary random variable that is true if and only if x is the word pattern/domain-specific semantic slot in the utterance u. We introduce a series of exponential family models that estimate the probability using a natural parameter θ_u,x and the logistic sigmoid function:

p(M_u,x= 1 | θ_u,x) = σ(θ_u,x) = 1

1 + exp (−θ_u,x) (6.1)

We construct a matrix M|U |×(|W |+|S|) as observed facts for MF by integrating a feature model and a knowledge graph propagation model below.

6.4.1 Feature Model

First, we build a binary word pattern matrix F_wbased on observations, where each row refers to an utterance and each column refers to an observed word pattern. In other words, Fw

carries the basic word vectors for the utterances, which is illustrated as the left part of the

matrix in Figure 4.1(b).

To induce the semantic elements, we parse all ASR-decoded utterances in our corpus using SEMAFOR², a state-of-the-art semantic parser for frame-semantic parsing [27, 28], and ex-tract all frames from semantic parsing results as slot candidates [20]. Figure 3.2 shows an example of an ASR-decoded output parsed by SEMAFOR. Three FrameNet-defined frames (capability, expensiveness, and locale by use) are generated for the utterance, which we con-sider as slot candidates for a domain-specific dialogue system [3]. Then we build a binary slot matrix Fsbased on the outputted slots, which also denotes the slot features for the utterances (right part of the matrix in Figure 4.1(b)).

For building the feature model MF, we concatenate two matrices:

M_F = [ F_w F_s ], (6.2)

which refers to the upper part of the matrix in Fig. 4.1(b) for training utterances. Note that we do not use any annotations, so all slot candidates are included.

6.4.2 Knowledge Graph Propagation Model

Since SEMAFOR was trained on FrameNet annotation, which has a more generic frame-semantic context, not all the frames from the parsing results can be used as the actual slots in the domain-specific dialogue systems. For instance, in Figure 3.2, we see that the frames “expensiveness” and “locale by use” are essentially the key slots for the purpose of understanding in the restaurant query domain, whereas the “capability” frame does not convey particular valuable information for SLU.

Assuming that domain-specific concepts are usually related to each other, globally considering relations between semantic slots induces a more coherent slot set. It is shown that the relations on knowledge graphs help decision of domain-specific slots [23]. Considering two directed graphs, semantic and lexical knowledge graphs built in Section 4.4.1 of Chapter 4, each node in the semantic knowledge graph is a slot candidate s_i outputted by the frame-semantic parser, and each node in the lexical knowledge graph is a word wj. The structured graph helps define a coherent slot set. To model the relations between words/slots based on the knowledge graphs, we define two relation models below.

• Semantic Relation

For modeling word semantic relations, we compute a matrix R^S_w = [Sim(wi, wj)]|W |×|W |, where Sim(w_i, w_j) is the cosine similarity between the dependency embeddings of the

2http://www.ark.cs.cmu.edu/SEMAFOR/

word patterns wi and wj after normalization. For slot semantic relations, we compute R^S_s = [Sim(s_i, s_j)]_|S|×|S| similarly³. The matrices R_w^S and R_s^S model not only the semantic but functional similarity since we use dependency-based embeddings [64].

• Dependent Relation

Assuming that important semantic slots usually mutually related to each other, that is, connected by syntactic dependencies, our automatically derived knowledge graphs are able to help model the dependent relations. For word dependent relations, we compute a matrix R^D_w = [ˆr(w_i, w_j)]_{|W |×|W |}, where ˆr(w_i, w_j) measures the dependency between two word patterns wi and wj based on the word-based lexical knowledge graph, which can be computed by (4.1) in Section 4.4.2.2 of Chapter 4. For slot dependent relations, we similarly compute R^D_s = [ˆr(s_i, s_j)]_|S|×|S|based on the slot-based semantic knowledge graph.

With the built word relation models (R^S_w and R^D_w) and slot relation models (R^S_s and R^D_s), we combine them as a knowledge graph propagation matrix M_R⁴.

M_R=h R^SD_w 0 0 R^SD_s

, (6.3)

where R^SD_w = R_w^S+ R_w^D and R_s^SD= R^S_s + R^D_s to integrate semantic and dependent relations.

The goal of this matrix is to propagate scores between nodes according to different types of relations in the knowledge graphs [16].

6.4.3 Integrated Model

With a feature model MF and a knowledge graph propagation model MR, we integrate them into a single matrix.

M = M_F · (M_R+ I),

= [ F_w F_s ]h Rw+ I 0 0 Rs+ I

i ,

h F_wR_w+ F_w 0 0 FsRs+ Fs

i ,

(6.4)

where M is final matrix and I is the identity matrix in order to remain the original values.

The matrix M is similar to MF, but some weights are enhanced through the knowledge graph propagation model, M_R. The word relations are built by F_wR_w, which is the matrix with

3For each column in R^Swand R^Ss, we only keep top 10 highest values, which correspond the top 10 seman-tically similar nodes.

4The values in the diagonal of MR are 0 to only model the propagation from other entries.

internal weight propagation on the lexical knowledge graph (the blue arrow in Fig. 4.1(b)).

Similarly, F_sR_s models the slot correlations, and can be treated as the matrix with internal weight propagation on the semantic knowledge graph (the pink arrow in Fig. 4.1(b)).

Fscontains all slot candidates outputted by SEMAFOR, which may include some generic slots (such as capability), so the original feature model cannot differentiate the domain-specific and generic concepts. By integrating with R_s, the semantic and dependent relations can be prop-agated via the knowledge graph, and the domain-specific concepts may have higher weights based on the assumption that the slots for dialogue systems are often mutually related [23].

Hence, the structure information can be automatically involved in the matrix. Also, the word relation model brings the same function but on the word level. In conclusion, for each utterance, the integrated model not only predicts the probability that semantic slots occur but also considers whether the slots are domain-specific. The following sections describe the learning process.

6.4.4 Parameter Estimation

The proposed model is parametrized through weights and latent component vectors, where the parameters are estimated by maximizing the log likelihood of observed data [25].

θ^∗ = arg max

u∈U

p(θ | M_u)

= arg max

u∈U

p(Mu | θ)p(θ)

= arg max

u∈U

ln p(M_u| θ) − λ_θ,

(6.5)

where M_u is the vector corresponding to the utterance u from M_u,x in (6.1), because we assume that each utterance is independent of others.

To avoid treating unobserved facts as designed negative facts, we consider our positive-only data as implicit feedback. Bayesian Personalized Ranking (BPR) is an approach to learn with implicit feedback, which uses a variant of the ranking: giving observed true facts higher scores than unobserved (true or false) facts [79]. Riedel et al. also showed that BPR learn the implicit relations for improving the relation extraction task [80].

6.4.4.1 Objective Function

To estimate the parameters in (6.5), we create a dataset of ranked pairs from M in (6.4):

for each utterance u and each observed fact f⁺ = hu, x⁺i, where M_u,x ≥ δ, we choose each

word pattern/slot x⁻ such that f⁻ = hu, x⁻i, where M_u,x < δ, which refers to the word pattern/slot we have not observed to be in utterance u. That is, we construct the observed data O from M . Then for each pair of facts f⁺ and f⁻, we want to model p(f⁺) > p(f⁻) and hence θ_f⁺ > θ_f⁻ according to (6.1). BPR maximizes the summation of each ranked pair, where the objective is

u∈U

ln p(M_u | θ) = X

f⁺∈O

f⁻6∈O

ln σ(θ_f⁺ − θ_f−). (6.6)

The BPR objective is an approximation to the per utterance AUC (area under the ROC curve), which directly correlates to what we want to achieve – well-ranked semantic slots per utterance.

6.4.4.2 Optimization

To maximize the objective in (6.6), we employ a stochastic gradient descent (SGD) algo-rithm [79]. For each randomly sampled observed fact hu, x⁺i, we sample an unobserved fact hu, x⁻i, which results in |O| fact pairs hf⁻, f⁺i. For each pair, we perform an SGD update using the gradient of the corresponding objective function for matrix factorization [41].

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 79-83)