Slot Ranking Model - Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dial

The goal of the ranking model is to extract domain-specific concepts from all fine-grained frames outputted by frame-semantic parsing. To induce meaningful slots for the purpose of SDS, we compute the prominence of slot candidates by additionally considering their structure information.

With the semantic parses from SEMAFOR, where each frame is viewed independently and inter-slot relations are not included, our model ranks slot candidates by integrating two information: (1) the frequency of each slot candidate in the corpus, and (2) the relations between slot candidates. Assuming that domain-specific concepts are usually related to each other, globally considering inter-slot relations induces a more coherent slot set. As the baseline in Chapter 3, we consider only the frequency of each slot candidate as its prominence without the structure information.

Since syntactic dependency relations between fillers may help measure the prominence of corresponding slots. First we construct two knowledge graphs, one is a slot-based semantic knowledge graph and another is a word-based lexical knowledge graph, both of which encode the typed dependency relations in their edge weights. We also connect two graphs to model the relations between slot-filler pairs. The integrated graph that incorporates dependency relations of semantic and lexical elements can compute the prominence of slot candidates by a random walk algorithm. The details are described as follows.

w₃ w4

w₅

w₆

w₇

Lexical Knowledge Graph s2

Semantic Knowledge Graph

s₁ s₃

Figure 4.2: A simplified example of the integration of two knowledge graphs, where a slot can-didate si is represented as a node in a semantic knowledge graph and a word wj is represented as a node in a lexical knowledge graph.

4.3.1 Knowledge Graphs

We construct two undirected graphs, semantic and lexical knowledge graphs. Each node in the semantic knowledge graph is a slot candidate s_i outputted by the frame-semantic parser, and each node in the lexical knowledge graph is a word wj.

• Slot-based semantic knowledge graph is built as Gs= hVs, Essi, where V_s = {si} and E_ss= {e_ij | s_i, s_j ∈ V_s}.

• Word-based lexical knowledge graph is built as Gw = hV_w, E_wwi, where V_w = {w_i} and Eww= {eij | w_i, wj ∈ V_w}.

With two knowledge graphs, we build edges between slots and slot-fillers to integrate them as shown in Figure 4.2. Thus the integrated graph can be formulated as G = hV_s, V_w, E_ss, E_ww, E_wsi, where E_ws = {e_ij | w_i ∈ V_w, s_j ∈ V_s}. E_ss, E_ww, and E_ws cor-respond to slot-to-slot relations, word-to-word relations, and word-to-slot relations respec-tively [26, 27].

4.3.2 Edge Weight Estimation

To incorporate different strengths of dependency relations in the knowledge graphs, we as-sign weights for edges. The edge weights for Eww and Ess are measured based on the dependency parsing results. The example utterance “can i have a cheap restaurant ” and its dependency parsing result are illustrated in Figure 4.3. The arrows denote the de-pendency relations from headwords to their dependents, and words on arcs denote types

capability expensiveness locale_by_use

can i have a cheap restaurant

ccomp

amod dobj

nsubj det

Figure 4.3: The dependency parsing result on an utterance.

of dependencies. All typed dependencies between two words are encoded in triples and form a word-based dependency set T_w = {hw_i, t, w_ji}, where t is the typed dependency between the headword wi and the dependent wj. For example, Figure 4.3 generates hrestaurant, amod, cheapi, hhave, dobj, restauranti, etc. for Tw. Similarly, we build a slot-based dependency set T_s = {hs_i, t, s_ji} by transforming dependencies between slot-fillers into ones between slots. For example, hrestaurant, amod, cheapi from Tw is transformed into hlocale by use, amod, expensivenessi for building Ts, because both sides of the non-dotted line are parsed as slot-fillers by SEMAFOR.

For all edges within a single knowledge graph, we assign the weight of the edge connecting nodes x_i and x_j as ˆr(x_i, x_j), where x is either s (slot) or w (word). Since weights are measured based on relations between nodes regardless of directions, we combine the scores for two directional dependencies:

r(x_i, x_j) = r(x_i → x_j) + r(x_j → x_i), (4.1) where r(xi → x_j) is the score that estimates the dependency including xi as a head and xj

as a dependent. In Section 4.3.2.1 and 4.3.2.2, we propose two scoring functions for r(·), frequency-based as r1(·) and embedding-based as r2(·) respectively.

For edges of E_ws, we estimate edge weights based on the frequency that slot candidates and words are parsed as slot-filler pairs. In other words, the edge weight between a slot-filler wi

and a slot candidate sj, ˆr(wi, sj), is equal to how many times the filler wi corresponds to the slot candidate s_j in the parsing results.

4.3.2.1 Frequency-Based Measurement

Based on the parsed dependency set Tx, we use t^∗_x_i_→x_j to denote the most frequent typed dependency with xi as a head and xj as a dependent.

t^∗_x_i_→x_j = arg max

t C(x_i −→

t x_j), (4.2)

Table 4.1: The contexts extracted for training dependency-based word/slot embeddings from the utterance of Figure 3.2.

Typed Dependency Relation Target Word Contexts Word hrestaurant, amod, cheapi restaurant cheap/amod

cheap restaurant /amod⁻¹ Slot hlocale by use, amod, expensivenessi locale by use expensiveness/amod

expansiveness locale by use/amod⁻¹

where C(xi −→

t xj) counts how many times the dependency hxi, t, xji occurs in the dependency set T_x. Then the scoring function that estimates the dependency x_i → x_j is measured as

r1(xi → x_j) = C(xi −−−−→

t^∗_xi→xj xj), (4.3)

which equals to the highest observed frequency of the dependency x_i → x_j among all types from Tx.

4.3.2.2 Embedding-Based Measurement

It is shown that a dependency-based embedding approach introduced in Section 2.4.2 is able to capture more functional similarity because it uses dependency-based syntactic contexts for training word embeddings [108]. Table 4.1 shows some extracted dependency-based contexts for each target word from the example in Figure 4.3, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. We learn vector representations for both words and contexts such that the dot product vw· v_c is maximized when they are associated with

“good” word-context pairs belonging to the training data.

Then we can obtain the dependency-based slot and word embeddings using T_s and T_w respec-tively.

With trained dependency-based embeddings, we estimate the probability that x_iis a headword and xj is its dependent via a typed dependency t as

P (xi −→

t xj) = Sim(xi, xj/t) + Sim(xj, xi/t⁻¹)

2 , (4.4)

where Sim(x_i, x_j/t) is the cosine similarity between word/slot embeddings v_x_i and context embeddings v_x_j_/t after normalizing to [0, 1]. Then we can measure the scoring function r₂(·) as

r₂(x_i→ x_j) = C(x_i−−−−→

t^∗_xi→xj x_j) · P (x_i−−−−→

t^∗_xi→xj x_j), (4.5)

which is similar to (4.3) but additionally weighted with the estimated probability. The es-timated probability smooths the observed frequency to avoid overfitting due to the smaller dataset.

4.3.3 Random Walk Algorithm

We first compute L_ww= [ˆr(w_i, w_j)]_|V_w_|×|V_w_| and L_ss= [ˆr(s_i, s_j)]_|V_s_|×|V_s_|, where ˆr(w_i, w_j) and ˆ

r(s_i, s_j) are either from frequency-based (r₁(·)) or embedding-based measurements (r₂(·)).

Similarly, Lws = [ˆr(wi, sj)]|V_w|×|V_s|and Lsw= [ˆr(wi, sj)]^T_|V

w|×|Vs|are computed, where ˆr(wi, sj) is the frequency that s_j and w_i are a slot-filler pair computed in Section 4.3.2. Then we only keep the top N highest weights for each row in L_ww and L_ss (N = 10), which means that we filter out edges with smaller weights within a single knowledge graph. Column-normalization are performed for L_ww, L_ss, L_ws, L_sw[141]. They can be viewed as word-to-word, slot-to-slot, and word-to-slot relation matrices.

4.3.3.1 Single-Graph Random Walk

Here we perform a random walk algorithm only on the semantic knowledge graph to propagate scores based on inter-slot relations through the edges E_ss.

R^(t+1)_s = (1 − α)R⁽⁰⁾_s + αLssR^(t)_s , (4.6) where R^(t)s denotes importance scores of slot candidates Vs in t-th iteration. In the algo-rithm, the score is the interpolation of two scores, the normalized baseline importance of slot candidates (R⁽⁰⁾s ), and scores propagated from the neighboring nodes in the seman-tic knowledge graph based on slot-to-slot relations L_ss. The algorithm will converge when R^∗_s = Rs^(t+1)≈ R^(t)_s and R^∗_s satisfies the equation,

R^∗_s = (1 − α)R⁽⁰⁾_s + αLssR^∗_s. (4.7) We can solve R^∗_s as

R^∗_s =

(1 − α)R⁽⁰⁾_s e^T + αLss

R^∗_s= M1R^∗_s, (4.8) where the e = [1, 1, ..., 1]^T. It has been shown that the closed-form solution R^∗_s of (4.8) is the dominant eigenvector of M1, or the eigenvector corresponding to the largest absolute eigenvalue of M₁ [100]. The solution of R^∗_s denotes the updated importance scores for all ut-terances. Similar to the PageRank algorithm, the solution can also be obtained by iteratively updating Rs^(t) [19].

4.3.3.2 Double-Graph Random Walk

We borrow the idea from two-layer mutually reinforced random walk to propagate scores based on not only internal importance propagation within the same graphs but also external mutual reinforcement between different knowledge graphs [26, 27, 93].

( R^(t+1)s = (1 − α)R⁽⁰⁾s + αLssLswR^(t)w

R^(t+1)w = (1 − α)Rw⁽⁰⁾+ αLwwLwsRs^(t)

(4.9)

In the algorithm, they are the interpolations of two scores, the normalized baseline importance (R⁽⁰⁾s and R⁽⁰⁾w ) and the scores propagated from another graph. For the semantic knowledge graph, L_swR^(t)w is the score from the word set weighted by slot-to-word relations, and then the scores are propagated based on slot-to-slot relations Lss. Similarly, nodes in the lexical knowledge graph also include scores propagated from the semantic knowledge graph. Then R^(t+1)s and R^(t+1)w can be mutually updated by the latter parts in (4.9) iteratively. When the algorithm converges, R_s^∗ and R^∗_w can be derived similarly.

( R^∗_s= (1 − α)Rs⁽⁰⁾+ αL_ssL_swR^∗_w

R^∗_w = (1 − α)R⁽⁰⁾w + αL_wwL_wsR^∗_s (4.10) R^∗_s = (1 − α)R⁽⁰⁾_s + αLssLsw

(1 − α)R⁽⁰⁾_w + αLwwLwsR^∗_s

= (1 − α)R⁽⁰⁾_s + α(1 − α)LssLswR⁽⁰⁾_w + α²LssLswLwwLwsR^∗_s

(1 − α)R⁽⁰⁾_s e^T + α(1 − α)LssLswR⁽⁰⁾_w e^T + α²LssLswLwwLws

R^∗_s

= M2R^∗_s.

(4.11)

The closed-form solution R^∗_s of (4.11) can be obtained from the dominant eigenvector of M₂.

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 75-80)