• 沒有找到結果。

4.3.3.2 Double-Graph Random Walk

We borrow the idea from two-layer mutually reinforced random walk to propagate scores based on not only internal importance propagation within the same graphs but also external mutual reinforcement between different knowledge graphs [26, 27, 93].

( R(t+1)s = (1 − α)R(0)s + αLssLswR(t)w

R(t+1)w = (1 − α)Rw(0)+ αLwwLwsRs(t)

(4.9)

In the algorithm, they are the interpolations of two scores, the normalized baseline importance (R(0)s and R(0)w ) and the scores propagated from another graph. For the semantic knowledge graph, LswR(t)w is the score from the word set weighted by slot-to-word relations, and then the scores are propagated based on slot-to-slot relations Lss. Similarly, nodes in the lexical knowledge graph also include scores propagated from the semantic knowledge graph. Then R(t+1)s and R(t+1)w can be mutually updated by the latter parts in (4.9) iteratively. When the algorithm converges, Rs and Rw can be derived similarly.

( Rs= (1 − α)Rs(0)+ αLssLswRw

Rw = (1 − α)R(0)w + αLwwLwsRs (4.10) Rs = (1 − α)R(0)s + αLssLsw



(1 − α)R(0)w + αLwwLwsRs



= (1 − α)R(0)s + α(1 − α)LssLswR(0)w + α2LssLswLwwLwsRs

=



(1 − α)R(0)s eT + α(1 − α)LssLswR(0)w eT + α2LssLswLwwLws

 Rs

= M2Rs.

(4.11)

The closed-form solution Rs of (4.11) can be obtained from the dominant eigenvector of M2.

Table 4.2: The performance of induced slots and corresponding SLU models (%)

Approach

ASR Transcripts

Slot Induction SLU Model Slot Induction SLU Model

AP AUC WAP AF AP AUC WAP AF

(a) Baseline (α = 0) 56.69 54.67 35.82 43.28 53.01 50.80 36.78 44.20 (b) Single Freq. 63.88 62.05 41.67 47.38 63.02 61.10 43.76 48.53 (c) Embed. 69.04 68.25 46.29 48.89 75.15 74.50 54.50 50.86 (d) Double Freq. 56.83 55.31 32.64 44.91 52.12 50.54 34.01 45.05 (e) Embed. 71.48 70.84 44.06 47.91 76.42 75.94 52.89 50.40

4.4.1 Experimental Setup

The data is the Cambridge University SLU corpus described in the previous chapter. For the parameter setting, the damping factor for random walk α is empirically set as 0.9 for all experiments1. For training semantic decoders, we use SVM with linear kernel to predict the probability of each semantic slot. We use the Stanford Parser to obtain the collapsed typed syntactic dependencies and set the dimensionality of embeddings d = 300 in all exper-iments [143].

For evaluation, we measure their quality as the proximity between induced slots and reference slots. Figure 3.3 shows the mappings between induced slots and reference slots [31]. As the metrics in Chapter 3, we use AP and AUC for evaluating slot induction, and WAP and AF for evaluating slot induction and SLU tasks together.

4.4.2 Evaluation Results

Table 4.2 shows results on both ASR and transcripts. The row (a) is the baseline considering only the frequency of each slot candidate for ranking. Rows (b) and (c) show performance after leveraging a semantic knowledge graph through random walk. Rows (d) and (e) are results after combining two knowledge graphs. We find that almost all results are improved by additionally considering inter-slot relations in terms of single- and double-graph random walk for both ASR and manual transcripts.

4.4.2.1 Slot Induction

For both ASR and manual transcripts, almost all results outperform the baseline, which shows that inter-slot relations significantly influence the performance of slot induction. The best performance is from results using double-graph random walk with the embedding-based

1The performance is different from results in Chapter 3 since we do not need a dev set in the experiments.

measurement, which integrate a semantic knowledge graph and a lexical knowledge graph to-gether and jointly consider slot-to-slot, word-to-word, and word-to-slot relations when scoring the prominence of slot candidates to generate a coherent slot set.

4.4.2.2 SLU Model

For both ASR and manual transcripts, almost all results outperform the baseline, which shows the practical usage for training dialogue systems. The best performance is from the results of single-graph random walk with embedding-based measurement, which only use the semantic knowledge graph to involve inter-slot relations. The semantic knowledge graph is not as precise as the lexical one and may be influenced more by the performance of the semantic parser. Although the row (e) does not show better performance than the row (c), double-graph random walk may be more robust because it additionally includes word relations to avoid from relying only on relations tied with slot candidates.

4.4.3 Discussion and Analysis

4.4.3.1 Comparing Frequency- and Embedding-Based Measurements

Table 4.2 shows that all results with the embedding-based measurement perform better than ones with frequency-based measurement. The frequency-based measurement also brings large improvement for single-graph approaches, but not for double-graph ones. The reason is prob-ably that using observed frequencies in the lexical knowledge graph may result in overfitting issues due to the smaller dataset. Additionally incorporating embedding information can smooth edge weights and deal with data sparsity to improve the performance, especially for the lexical knowledge graph.

4.4.3.2 Comparing Single- and Double-Graph Approaches

Considering that the embedding-based measurement performs better, we only compare results of single- and double-graph random walk using the measurement (rows (c) and (e)). It can be seen that the difference between them is not consistent in terms of slot induction and SLU modeling.

For evaluating slot induction (AP and AUC), double-graph random walk (row (e)) performs better on both ASR and manual results, which implies that additionally integrating the lexical knowledge graph helps decide a more coherent and complete slot set because we can model the score propagation more precisely (not only slot-level but word-level information).

Table 4.3: The top inter-slot relations learned from the training set of ASR outputs.

Rank Relation

1 hlocale by use, nn, foodi 2 hfood, amod, expensivenessi

3 hlocale by use, amod, expensivenessi 4 hseeking, prep for, foodi

5 hfood, amod, relational quantityi 6 hdesiring, dobj, foodi

7 hseeking, prep for, locale by usei 8 hfood, det, quantityi

However, for SLU evaluation (WAP and AF), single-graph random walk (row (c)) performs better, which may imply that the slots carrying coherent relations from the row (e) may not have good semantic decoder performance so that the performance is decreased a little.

For example, double-graph random walk scores the slots local by use and expensiveness higher than the slot contacting, while the single-graph method ranks the latter higher. The slots, local by use and expensiveness, are more important on this domain but contacting has very good performance of its semantic decoder, so the double-graph approach does not show the improvement when evaluating SLU. This allows us to try an improved method of jointly optimizing the slot coherence and SLU performance in the future.

4.4.3.3 Relation Discovery Analysis

To interpret inter-slot relations, we output the relations that connect slots with highest scores from the best results (row (e)) in Table 4.3. It can be shown that the outputted inter-slot relations are reasonable and usually connect two important semantic slots. The automatically learned structure is able to construct a corresponding slot-based semantic knowledge graph as Figure 4.4a.

To evaluate the performance of the automatically learned knowledge graph, we also construct a semantic knowledge graph based on human-annotated data, where the shown relations are the most frequent typed dependencies between two slots in the domain-expert annotated data. The reference knowledge graph is shown in Figure 4.4b. Here we can clearly see similar structures between the generated one and the reference one, where the nodes with same colors represent semantically similar concepts. This proves that inter-slot relations help decide a coherent and complete slot set and enhance the interpretability of semantic slots. Thus, from a practical perspective, developers are able to design the framework of dialogue systems more easily.

locale_by_use

food expensiveness seeking

relational_quantity

PREP_FOR

PREP_FOR

NN AMOD

AMOD

AMOD

desiring

DOBJ

(a) A simplified example of the automatically de-rived knowledge graph.

type

food pricerange task

DOBJ

AMOD AMOD

AMOD

PREP_IN

area

(b) The reference knowledge graph.

Figure 4.4: The automatically and manually created knowledge graphs for a restaurant do-main.