The purpose of the ranking model is to distinguish between generic semantic concepts and domain-specific concepts that are relevant to an SDS. To induce meaningful slots for the purpose of SDS, we compute the prominence of the slot candidates additionally considering the structure information.
With the semantic parses from SEMAFOR, where each frame is viewed independently, so inter-slot relations are not included, the model ranks the slot candidates by integrating two information: (1) the frequency of each slot candidate in the corpus, since slots with higher frequency may be more important. (2) the relations between slot candidates. Assuming that domain-specific concepts are usually related to each other, globally considering inter-slot
w1
w2
w3 w4
w5
w6
w7
Lexical Knowledge Graph s2
Semantic Knowledge Graph
s1 s3
Figure 4.2: A simplified example of the two knowledge graphs, where a slot candidate si is represented as a node in a semantic knowledge graph and a word wj is represented as a node in a lexical knowledge graph.
relations induces a more coherent slot set. Here for baseline scores, we only use the frequency of each slot candidate as its prominence without the structure information.
First we construct two knowledge graphs, one is a slot-based semantic knowledge graph and another is a word-based lexical knowledge graph, both of which encode the typed dependency relations in their edge weights. We also connect two graphs to model the relations between slot-filler pairs.
4.4.1 Knowledge Graphs
We construct two undirected graphs, semantic and lexical knowledge graphs. Each node in the semantic knowledge graph is a slot candidate si outputted by the frame-semantic parser, and each node in the lexical knowledge graph is a word wj.
• Slot-based semantic knowledge graph is built as Gs= hVs, Essi, where Vs = {si} and Ess= {eij | si, sj ∈ Vs}.
• Word-based lexical knowledge graph is built as Gw = hVw, Ewwi, where Vw = {wi} and Eww= {eij | wi, wj ∈ Vw}.
With two knowledge graphs, we build the edges between slots and slot-fillers to inte-grate them as shown in Figure 4.2. Thus the combined graph can be formulated as G = hVs, Vw, Ess, Eww, Ewsi, where Ews = {eij | wi ∈ Vw, sj ∈ Vs}. Ess, Eww, and Ews correspond the slot-to-slot relations, word-to-word relations, and the word-to-slot relations respectively [16, 17].
capability expensiveness locale_by_use
can i have a cheap restaurant
ccomp
amod dobj
nsubj det
Figure 4.3: The dependency parsing result on an utterance.
4.4.2 Edge Weight Estimation
Considering the relations in the knowledge graphs, the edge weights for Eww and Ess are measured based on the dependency parsing results. The example utterance “can i have a cheap restaurant ” and its dependency parsing result are illustrated in Figure 4.3. The arrows denote the dependency relations from headwords to their dependents, and words on arcs denote types of the dependencies. All typed dependencies between two words are encoded in triples and form a word-based dependency set Tw = {hwi, t, wji}, where t is the typed dependency between the headword wi and the dependent wj. For example, Figure 4.3 generates hrestaurant, amod, cheapi, hhave, dobj, restauranti, etc. for Tw. Similarly, we build a slot-based dependency set Ts= {hsi, t, sji} by transforming dependencies between slot-fillers into ones between slots. For example, hrestaurant, amod, cheapi from Tw is transformed into hlocale by use, amod, expensivenessi for building Ts, because both sides of the non-dotted line are parsed as slot-fillers by SEMAFOR.
For the edges within a single knowledge graph, we assign a weight of the edge connecting nodes xi and xj as ˆr(xi, xj), where x is either s or w. Since the weights are measured based on the relations between nodes regardless of the directions, we combine the scores of two directional dependencies:
ˆ
r(xi, xj) = r(xi → xj) + r(xj → xi), (4.1) where r(xi → xj) is the score estimating the dependency including xi as a head and xj as a dependent. In Section 4.4.2.1 and 4.4.2.2, we propose two scoring functions for r(·), frequency-based as r1(·) and embedding-based as r2(·) respectively.
For the edges in Ews, we estimate the edge weights based on the frequency that the slot candidates and the words are parsed as slot-filler pairs. In other words, the edge weight between the slot-filler wi and the slot candidate sj, ˆr(wi, sj), is equal to how many times the filler wi corresponds to the slot candidate sj in the parsing results.
Typed Dependency Relation Target Word Contexts Word hrestaurant, amod, cheapi restaurant cheap/amod
cheap restaurant /amod−1 Slot hlocale by use, amod, expensivenessi locale by use expensiveness/amod
expansiveness locale by use/amod−1 Table 4.1: The contexts extracted for training dependency-based word/slot embeddings from the utterance of Fig. 3.2.
4.4.2.1 Frequency-Based Measurement
Based on the dependency set Tx, we use t∗xi→xj to denote the most frequent typed dependency with xi as a head and xj as a dependent.
t∗xi→xj = arg max
t C(xi −→
t xj), (4.2)
where C(xi −→
t xj) counts how many times the dependency hxi, t, xji occurs in the dependency set Tx.
Then the scoring function that estimates the dependency xi → xj is measured as r1(xi → xj) = C(xi −−−−→
t∗xi→xj xj), (4.3)
which equals to the highest observed frequency of the dependency xi → xj among all types from Tx.
4.4.2.2 Embedding-Based Measurement
It is shown that a dependency-based embedding approach introduced in Section 2.5.2 of Chap-ter 2 is able to capture more functional similarity because using dependency-based syntactic contexts for training word embeddings [64]. Table 4.1 shows some extracted dependency-based contexts for each target word from the example in Figure 4.3, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. We learn vector representations for both words and contexts such that the dot product vw· vc associated with “good” word-context pairs belonging to the training data is maximized. Then we can obtain the dependency-based slot and word embeddings using Ts and Tw respectively.
With trained dependency-based embeddings, we estimate the probability that xi is the
head-word and xj is its dependent via the typed dependency t as
P (xi −→
t xj) = Sim(xi, xj/t) + Sim(xj, xi/t−1)
2 , (4.4)
where Sim(xi, xj/t) is the cosine similarity between word/slot embeddings vxi and context embeddings vxj/t after normalizing to [0, 1]. Then we can measure the scoring function r2(·) as
r2(xi→ xj) = C(xi−−−−→
t∗xi→xj xj) · P (xi−−−−→
t∗xi→xj xj), (4.5)
which is similar to (4.3) but additionally weighted by the estimated probability. The estimated probability smooths the observed frequency to avoid overfitting due to the smaller dataset.
4.4.3 Random Walk Algorithm
We first compute Lww= [ˆr(wi, wj)]|Vw|×|Vw| and Lss= [ˆr(si, sj)]|Vs|×|Vs|, where ˆr(wi, wj) and ˆ
r(si, sj) are either from frequency-based (r1(·)) or embedding-based measurements (r2(·)).
Similarly, Lws = [ˆr(wi, sj)]|Vw|×|Vs| and Lsw = [ˆr(wi, sj)]T|V
w|×|Vs|, where ˆr(wi, sj) is the fre-quency that sj and wi are a slot-filler pair computed in Section 4.4.2. Then we only keep the top N highest weights for each row in Lwwand Lss(N = 10), which means that we filter out the edges with smaller weights within the single knowledge graph. Column-normalization are performed for Lww, Lss, Lws, Lsw [83]. They can be viewed as word-to-word, slot-to-slot, and word-to-slot relation matrices.
4.4.3.1 Single-Graph Random Walk
Here we run random walk only on the semantic knowledge graph to propagate the scores based on inter-slot relations through the edges Ess.
R(t+1)s = (1 − α)R(0)s + αLssR(t)s , (4.6) where R(t)s denotes the importance scores of the slot candidates Vs in t-th iteration. In the algorithm, the score is the interpolation of two scores, the normalized baseline importance of slot candidates (R(0)s ), and the scores propagated from the neighboring nodes in the semantic knowledge graph based on the slot-to-slot relations Lss. The algorithm will converge when R(t+1)s = R(t)s = R∗s and (4.7) can be satisfied.
R∗s= (1 − α)Rs(0)+ αLssR∗s (4.7)
We can solve R∗s as below.
R∗s =
(1 − α)R(0)s eT + αLss
R∗s= M1R∗s, (4.8) where the e = [1, 1, ..., 1]T. It has been shown that the closed-form solution R∗s of (4.8) is the dominant eigenvector of M1 [58], or the eigenvector corresponding to the largest absolute eigenvalue of M1. The solution of R∗sdenotes the updated importance scores for all utterances.
Similar to the PageRank algorithm [13], the solution can also be obtained by iteratively updating Rs(t).
4.4.3.2 Double-Graph Random Walk
Here we borrow the idea from two-layer mutually reinforced random walk to propagate the scores based on not only internal importance propagation within the same graph but external mutual reinforcement between different knowledge graphs [16, 17].
( R(t+1)s = (1 − α)R(0)s + αLssLswR(t)w
R(t+1)w = (1 − α)Rw(0)+ αLwwLwsRs(t)
(4.9)
In the algorithm, they are the interpolations of two scores, the normalized baseline importance (R(0)s and R(0)w ) and the scores propagated from another graph. For the semantic knowledge graph, LswR(t)w is the score from the word set weighted by slot-to-word relations, and then the scores are propagated based on slot-to-slot relations Lss. Similarly, nodes of the lexical knowledge graph also include the scores propagated from the semantic knowledge graph. Then R(t+1)s and R(t+1)w can be mutually updated by the latter parts in (4.9) iteratively. When the algorithm converges, we have R∗s and R∗w can be derived similarly.
( R∗s= (1 − α)Rs(0)+ αLssLswR∗w
R∗w = (1 − α)R(0)w + αLwwLwsR∗s (4.10) R∗s = (1 − α)R(0)s + αLssLsw
(1 − α)R(0)w + αLwwLwsR∗s
= (1 − α)R(0)s + α(1 − α)LssLswR(0)w + α2LssLswLwwLwsR∗s
=
(1 − α)R(0)s eT + α(1 − α)LssLswR(0)w eT + α2LssLswLwwLws
R∗s
= M2R∗s.
(4.11)
The closed-form solution Rs∗ of (4.11) is the dominant eigenvector of M2.