Slot Ranking Model - Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dial

The purpose of the ranking model is to distinguish between generic semantic concepts and domain-specific concepts that are relevant to an SDS. To induce meaningful slots for the purpose of SDS, we compute the prominence of the slot candidates additionally considering the structure information.

With the semantic parses from SEMAFOR, where each frame is viewed independently, so inter-slot relations are not included, the model ranks the slot candidates by integrating two information: (1) the frequency of each slot candidate in the corpus, since slots with higher frequency may be more important. (2) the relations between slot candidates. Assuming that domain-specific concepts are usually related to each other, globally considering inter-slot

w₃ w4

w₅

w₆

w₇

Lexical Knowledge Graph s2

Semantic Knowledge Graph

s₁ s₃

Figure 4.2: A simplified example of the two knowledge graphs, where a slot candidate s_i is represented as a node in a semantic knowledge graph and a word wj is represented as a node in a lexical knowledge graph.

relations induces a more coherent slot set. Here for baseline scores, we only use the frequency of each slot candidate as its prominence without the structure information.

First we construct two knowledge graphs, one is a slot-based semantic knowledge graph and another is a word-based lexical knowledge graph, both of which encode the typed dependency relations in their edge weights. We also connect two graphs to model the relations between slot-filler pairs.

4.4.1 Knowledge Graphs

We construct two undirected graphs, semantic and lexical knowledge graphs. Each node in the semantic knowledge graph is a slot candidate si outputted by the frame-semantic parser, and each node in the lexical knowledge graph is a word w_j.

• Slot-based semantic knowledge graph is built as Gs= hVs, Essi, where V_s = {si} and E_ss= {e_ij | s_i, s_j ∈ V_s}.

• Word-based lexical knowledge graph is built as Gw = hV_w, E_wwi, where V_w = {w_i} and Eww= {eij | w_i, wj ∈ V_w}.

With two knowledge graphs, we build the edges between slots and slot-fillers to inte-grate them as shown in Figure 4.2. Thus the combined graph can be formulated as G = hV_s, V_w, E_ss, E_ww, E_wsi, where E_ws = {e_ij | w_i ∈ V_w, s_j ∈ V_s}. E_ss, E_ww, and E_ws correspond the slot-to-slot relations, word-to-word relations, and the word-to-slot relations respectively [16, 17].

capability expensiveness locale_by_use

can i have a cheap restaurant

ccomp

amod dobj

nsubj det

Figure 4.3: The dependency parsing result on an utterance.

4.4.2 Edge Weight Estimation

Considering the relations in the knowledge graphs, the edge weights for E_ww and E_ss are measured based on the dependency parsing results. The example utterance “can i have a cheap restaurant ” and its dependency parsing result are illustrated in Figure 4.3. The arrows denote the dependency relations from headwords to their dependents, and words on arcs denote types of the dependencies. All typed dependencies between two words are encoded in triples and form a word-based dependency set T_w = {hw_i, t, w_ji}, where t is the typed dependency between the headword wi and the dependent wj. For example, Figure 4.3 generates hrestaurant, amod, cheapi, hhave, dobj, restauranti, etc. for Tw. Similarly, we build a slot-based dependency set T_s= {hs_i, t, s_ji} by transforming dependencies between slot-fillers into ones between slots. For example, hrestaurant, amod, cheapi from Tw is transformed into hlocale by use, amod, expensivenessi for building Ts, because both sides of the non-dotted line are parsed as slot-fillers by SEMAFOR.

For the edges within a single knowledge graph, we assign a weight of the edge connecting nodes xi and xj as ˆr(xi, xj), where x is either s or w. Since the weights are measured based on the relations between nodes regardless of the directions, we combine the scores of two directional dependencies:

r(xi, xj) = r(xi → x_j) + r(xj → x_i), (4.1) where r(x_i → x_j) is the score estimating the dependency including x_i as a head and x_j as a dependent. In Section 4.4.2.1 and 4.4.2.2, we propose two scoring functions for r(·), frequency-based as r1(·) and embedding-based as r2(·) respectively.

For the edges in Ews, we estimate the edge weights based on the frequency that the slot candidates and the words are parsed as slot-filler pairs. In other words, the edge weight between the slot-filler wi and the slot candidate sj, ˆr(wi, sj), is equal to how many times the filler wi corresponds to the slot candidate sj in the parsing results.

Typed Dependency Relation Target Word Contexts Word hrestaurant, amod, cheapi restaurant cheap/amod

cheap restaurant /amod⁻¹ Slot hlocale by use, amod, expensivenessi locale by use expensiveness/amod

expansiveness locale by use/amod⁻¹ Table 4.1: The contexts extracted for training dependency-based word/slot embeddings from the utterance of Fig. 3.2.

4.4.2.1 Frequency-Based Measurement

Based on the dependency set T_x, we use t^∗_x_i_→x_j to denote the most frequent typed dependency with xi as a head and xj as a dependent.

t^∗_x_i_→x_j = arg max

t C(xi −→

t xj), (4.2)

where C(x_i −→

t x_j) counts how many times the dependency hx_i, t, x_ji occurs in the dependency set Tx.

Then the scoring function that estimates the dependency xi → x_j is measured as r₁(x_i → x_j) = C(x_i −−−−→

t^∗_xi→xj x_j), (4.3)

which equals to the highest observed frequency of the dependency xi → x_j among all types from T_x.

4.4.2.2 Embedding-Based Measurement

It is shown that a dependency-based embedding approach introduced in Section 2.5.2 of Chap-ter 2 is able to capture more functional similarity because using dependency-based syntactic contexts for training word embeddings [64]. Table 4.1 shows some extracted dependency-based contexts for each target word from the example in Figure 4.3, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. We learn vector representations for both words and contexts such that the dot product v_w· v_c associated with “good” word-context pairs belonging to the training data is maximized. Then we can obtain the dependency-based slot and word embeddings using Ts and Tw respectively.

With trained dependency-based embeddings, we estimate the probability that xi is the

head-word and xj is its dependent via the typed dependency t as

P (xi −→

t xj) = Sim(xi, xj/t) + Sim(xj, xi/t⁻¹)

2 , (4.4)

where Sim(x_i, x_j/t) is the cosine similarity between word/slot embeddings v_x_i and context embeddings v_x_j_/t after normalizing to [0, 1]. Then we can measure the scoring function r2(·) as

r₂(x_i→ x_j) = C(x_i−−−−→

t^∗_xi→xj x_j) · P (x_i−−−−→

t^∗_xi→xj x_j), (4.5)

which is similar to (4.3) but additionally weighted by the estimated probability. The estimated probability smooths the observed frequency to avoid overfitting due to the smaller dataset.

4.4.3 Random Walk Algorithm

We first compute L_ww= [ˆr(w_i, w_j)]_|V_w_|×|V_w_| and L_ss= [ˆr(s_i, s_j)]_|V_s_|×|V_s_|, where ˆr(w_i, w_j) and ˆ

r(s_i, s_j) are either from frequency-based (r₁(·)) or embedding-based measurements (r₂(·)).

Similarly, Lws = [ˆr(wi, sj)]|V_w|×|V_s| and Lsw = [ˆr(wi, sj)]^T_|V

w|×|Vs|, where ˆr(wi, sj) is the fre-quency that s_j and w_i are a slot-filler pair computed in Section 4.4.2. Then we only keep the top N highest weights for each row in Lwwand Lss(N = 10), which means that we filter out the edges with smaller weights within the single knowledge graph. Column-normalization are performed for L_ww, L_ss, L_ws, L_sw [83]. They can be viewed as word-to-word, slot-to-slot, and word-to-slot relation matrices.

4.4.3.1 Single-Graph Random Walk

Here we run random walk only on the semantic knowledge graph to propagate the scores based on inter-slot relations through the edges E_ss.

R^(t+1)_s = (1 − α)R⁽⁰⁾_s + αLssR^(t)_s , (4.6) where R^(t)_s denotes the importance scores of the slot candidates V_s in t-th iteration. In the algorithm, the score is the interpolation of two scores, the normalized baseline importance of slot candidates (R⁽⁰⁾s ), and the scores propagated from the neighboring nodes in the semantic knowledge graph based on the slot-to-slot relations L_ss. The algorithm will converge when R^(t+1)s = R^(t)s = R^∗_s and (4.7) can be satisfied.

R^∗_s= (1 − α)R_s⁽⁰⁾+ αLssR^∗_s (4.7)

We can solve R^∗_s as below.

R^∗_s =

(1 − α)R⁽⁰⁾_s e^T + αL_ss

R^∗_s= M₁R^∗_s, (4.8) where the e = [1, 1, ..., 1]^T. It has been shown that the closed-form solution R^∗_s of (4.8) is the dominant eigenvector of M₁ [58], or the eigenvector corresponding to the largest absolute eigenvalue of M1. The solution of R^∗_sdenotes the updated importance scores for all utterances.

Similar to the PageRank algorithm [13], the solution can also be obtained by iteratively updating Rs^(t).

4.4.3.2 Double-Graph Random Walk

Here we borrow the idea from two-layer mutually reinforced random walk to propagate the scores based on not only internal importance propagation within the same graph but external mutual reinforcement between different knowledge graphs [16, 17].

( R^(t+1)s = (1 − α)R⁽⁰⁾s + αLssLswR^(t)w

R^(t+1)w = (1 − α)Rw⁽⁰⁾+ αLwwLwsRs^(t)

(4.9)

In the algorithm, they are the interpolations of two scores, the normalized baseline importance (R⁽⁰⁾_s and R⁽⁰⁾_w ) and the scores propagated from another graph. For the semantic knowledge graph, L_swR^(t)w is the score from the word set weighted by slot-to-word relations, and then the scores are propagated based on slot-to-slot relations Lss. Similarly, nodes of the lexical knowledge graph also include the scores propagated from the semantic knowledge graph. Then R^(t+1)s and R^(t+1)w can be mutually updated by the latter parts in (4.9) iteratively. When the algorithm converges, we have R^∗_s and R^∗_w can be derived similarly.

( R^∗_s= (1 − α)Rs⁽⁰⁾+ αLssLswR^∗_w

R^∗_w = (1 − α)R⁽⁰⁾w + αLwwLwsR^∗_s (4.10) R^∗_s = (1 − α)R⁽⁰⁾_s + αL_ssL_sw

(1 − α)R⁽⁰⁾_w + αL_wwL_wsR^∗_s

= (1 − α)R⁽⁰⁾_s + α(1 − α)L_ssL_swR⁽⁰⁾_w + α²L_ssL_swL_wwL_wsR^∗_s

(1 − α)R⁽⁰⁾_s e^T + α(1 − α)LssLswR⁽⁰⁾_w e^T + α²LssLswLwwLws

R^∗_s

= M₂R^∗_s.

(4.11)

The closed-form solution R_s^∗ of (4.11) is the dominant eigenvector of M2.

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 51-57)