• 沒有找到結果。

The purpose of the ranking model is to distinguish between generic semantic concepts and domain-specific concepts that are relevant to an SDS. To induce meaningful slots for the purpose of SDS, we compute the prominence of the slot candidates additionally considering the structure information.

With the semantic parses from SEMAFOR, where each frame is viewed independently, so inter-slot relations are not included, the model ranks the slot candidates by integrating two information: (1) the frequency of each slot candidate in the corpus, since slots with higher frequency may be more important. (2) the relations between slot candidates. Assuming that domain-specific concepts are usually related to each other, globally considering inter-slot

w1

w2

w3 w4

w5

w6

w7

Lexical Knowledge Graph s2

Semantic Knowledge Graph

s1 s3

Figure 4.2: A simplified example of the two knowledge graphs, where a slot candidate si is represented as a node in a semantic knowledge graph and a word wj is represented as a node in a lexical knowledge graph.

relations induces a more coherent slot set. Here for baseline scores, we only use the frequency of each slot candidate as its prominence without the structure information.

First we construct two knowledge graphs, one is a slot-based semantic knowledge graph and another is a word-based lexical knowledge graph, both of which encode the typed dependency relations in their edge weights. We also connect two graphs to model the relations between slot-filler pairs.

4.4.1 Knowledge Graphs

We construct two undirected graphs, semantic and lexical knowledge graphs. Each node in the semantic knowledge graph is a slot candidate si outputted by the frame-semantic parser, and each node in the lexical knowledge graph is a word wj.

• Slot-based semantic knowledge graph is built as Gs= hVs, Essi, where Vs = {si} and Ess= {eij | si, sj ∈ Vs}.

• Word-based lexical knowledge graph is built as Gw = hVw, Ewwi, where Vw = {wi} and Eww= {eij | wi, wj ∈ Vw}.

With two knowledge graphs, we build the edges between slots and slot-fillers to inte-grate them as shown in Figure 4.2. Thus the combined graph can be formulated as G = hVs, Vw, Ess, Eww, Ewsi, where Ews = {eij | wi ∈ Vw, sj ∈ Vs}. Ess, Eww, and Ews correspond the slot-to-slot relations, word-to-word relations, and the word-to-slot relations respectively [16, 17].

capability expensiveness locale_by_use

can i have a cheap restaurant

ccomp

amod dobj

nsubj det

Figure 4.3: The dependency parsing result on an utterance.

4.4.2 Edge Weight Estimation

Considering the relations in the knowledge graphs, the edge weights for Eww and Ess are measured based on the dependency parsing results. The example utterance “can i have a cheap restaurant ” and its dependency parsing result are illustrated in Figure 4.3. The arrows denote the dependency relations from headwords to their dependents, and words on arcs denote types of the dependencies. All typed dependencies between two words are encoded in triples and form a word-based dependency set Tw = {hwi, t, wji}, where t is the typed dependency between the headword wi and the dependent wj. For example, Figure 4.3 generates hrestaurant, amod, cheapi, hhave, dobj, restauranti, etc. for Tw. Similarly, we build a slot-based dependency set Ts= {hsi, t, sji} by transforming dependencies between slot-fillers into ones between slots. For example, hrestaurant, amod, cheapi from Tw is transformed into hlocale by use, amod, expensivenessi for building Ts, because both sides of the non-dotted line are parsed as slot-fillers by SEMAFOR.

For the edges within a single knowledge graph, we assign a weight of the edge connecting nodes xi and xj as ˆr(xi, xj), where x is either s or w. Since the weights are measured based on the relations between nodes regardless of the directions, we combine the scores of two directional dependencies:

ˆ

r(xi, xj) = r(xi → xj) + r(xj → xi), (4.1) where r(xi → xj) is the score estimating the dependency including xi as a head and xj as a dependent. In Section 4.4.2.1 and 4.4.2.2, we propose two scoring functions for r(·), frequency-based as r1(·) and embedding-based as r2(·) respectively.

For the edges in Ews, we estimate the edge weights based on the frequency that the slot candidates and the words are parsed as slot-filler pairs. In other words, the edge weight between the slot-filler wi and the slot candidate sj, ˆr(wi, sj), is equal to how many times the filler wi corresponds to the slot candidate sj in the parsing results.

Typed Dependency Relation Target Word Contexts Word hrestaurant, amod, cheapi restaurant cheap/amod

cheap restaurant /amod−1 Slot hlocale by use, amod, expensivenessi locale by use expensiveness/amod

expansiveness locale by use/amod−1 Table 4.1: The contexts extracted for training dependency-based word/slot embeddings from the utterance of Fig. 3.2.

4.4.2.1 Frequency-Based Measurement

Based on the dependency set Tx, we use txi→xj to denote the most frequent typed dependency with xi as a head and xj as a dependent.

txi→xj = arg max

t C(xi −→

t xj), (4.2)

where C(xi −→

t xj) counts how many times the dependency hxi, t, xji occurs in the dependency set Tx.

Then the scoring function that estimates the dependency xi → xj is measured as r1(xi → xj) = C(xi −−−−→

txi→xj xj), (4.3)

which equals to the highest observed frequency of the dependency xi → xj among all types from Tx.

4.4.2.2 Embedding-Based Measurement

It is shown that a dependency-based embedding approach introduced in Section 2.5.2 of Chap-ter 2 is able to capture more functional similarity because using dependency-based syntactic contexts for training word embeddings [64]. Table 4.1 shows some extracted dependency-based contexts for each target word from the example in Figure 4.3, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. We learn vector representations for both words and contexts such that the dot product vw· vc associated with “good” word-context pairs belonging to the training data is maximized. Then we can obtain the dependency-based slot and word embeddings using Ts and Tw respectively.

With trained dependency-based embeddings, we estimate the probability that xi is the

head-word and xj is its dependent via the typed dependency t as

P (xi −→

t xj) = Sim(xi, xj/t) + Sim(xj, xi/t−1)

2 , (4.4)

where Sim(xi, xj/t) is the cosine similarity between word/slot embeddings vxi and context embeddings vxj/t after normalizing to [0, 1]. Then we can measure the scoring function r2(·) as

r2(xi→ xj) = C(xi−−−−→

txi→xj xj) · P (xi−−−−→

txi→xj xj), (4.5)

which is similar to (4.3) but additionally weighted by the estimated probability. The estimated probability smooths the observed frequency to avoid overfitting due to the smaller dataset.

4.4.3 Random Walk Algorithm

We first compute Lww= [ˆr(wi, wj)]|Vw|×|Vw| and Lss= [ˆr(si, sj)]|Vs|×|Vs|, where ˆr(wi, wj) and ˆ

r(si, sj) are either from frequency-based (r1(·)) or embedding-based measurements (r2(·)).

Similarly, Lws = [ˆr(wi, sj)]|Vw|×|Vs| and Lsw = [ˆr(wi, sj)]T|V

w|×|Vs|, where ˆr(wi, sj) is the fre-quency that sj and wi are a slot-filler pair computed in Section 4.4.2. Then we only keep the top N highest weights for each row in Lwwand Lss(N = 10), which means that we filter out the edges with smaller weights within the single knowledge graph. Column-normalization are performed for Lww, Lss, Lws, Lsw [83]. They can be viewed as word-to-word, slot-to-slot, and word-to-slot relation matrices.

4.4.3.1 Single-Graph Random Walk

Here we run random walk only on the semantic knowledge graph to propagate the scores based on inter-slot relations through the edges Ess.

R(t+1)s = (1 − α)R(0)s + αLssR(t)s , (4.6) where R(t)s denotes the importance scores of the slot candidates Vs in t-th iteration. In the algorithm, the score is the interpolation of two scores, the normalized baseline importance of slot candidates (R(0)s ), and the scores propagated from the neighboring nodes in the semantic knowledge graph based on the slot-to-slot relations Lss. The algorithm will converge when R(t+1)s = R(t)s = Rs and (4.7) can be satisfied.

Rs= (1 − α)Rs(0)+ αLssRs (4.7)

We can solve Rs as below.

Rs =

(1 − α)R(0)s eT + αLss

Rs= M1Rs, (4.8) where the e = [1, 1, ..., 1]T. It has been shown that the closed-form solution Rs of (4.8) is the dominant eigenvector of M1 [58], or the eigenvector corresponding to the largest absolute eigenvalue of M1. The solution of Rsdenotes the updated importance scores for all utterances.

Similar to the PageRank algorithm [13], the solution can also be obtained by iteratively updating Rs(t).

4.4.3.2 Double-Graph Random Walk

Here we borrow the idea from two-layer mutually reinforced random walk to propagate the scores based on not only internal importance propagation within the same graph but external mutual reinforcement between different knowledge graphs [16, 17].

( R(t+1)s = (1 − α)R(0)s + αLssLswR(t)w

R(t+1)w = (1 − α)Rw(0)+ αLwwLwsRs(t)

(4.9)

In the algorithm, they are the interpolations of two scores, the normalized baseline importance (R(0)s and R(0)w ) and the scores propagated from another graph. For the semantic knowledge graph, LswR(t)w is the score from the word set weighted by slot-to-word relations, and then the scores are propagated based on slot-to-slot relations Lss. Similarly, nodes of the lexical knowledge graph also include the scores propagated from the semantic knowledge graph. Then R(t+1)s and R(t+1)w can be mutually updated by the latter parts in (4.9) iteratively. When the algorithm converges, we have Rs and Rw can be derived similarly.

( Rs= (1 − α)Rs(0)+ αLssLswRw

Rw = (1 − α)R(0)w + αLwwLwsRs (4.10) Rs = (1 − α)R(0)s + αLssLsw

(1 − α)R(0)w + αLwwLwsRs

= (1 − α)R(0)s + α(1 − α)LssLswR(0)w + α2LssLswLwwLwsRs

=



(1 − α)R(0)s eT + α(1 − α)LssLswR(0)w eT + α2LssLswLwwLws

 Rs

= M2Rs.

(4.11)

The closed-form solution Rs of (4.11) is the dominant eigenvector of M2.

相關文件