Experiments - Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Sy

To evaluate the effectiveness of our induced slots, we performed two evaluations. First, we examine the slot induction accuracy by comparing the ranked list of frame-semantic parsing induced slots with the reference slots created by developers of the corresponding system [101].

Secondly, based on the ranked list of induced slots, we can train a semantic decoder for each slot to build an SLU component, and then evaluate the performance of our SLU model by comparing against the human annotated semantic forms. For the experiments, we evaluate both on ASR transcripts of the raw audio, and on the manual transcripts.

3.6.1 Experimental Setup

In this experiment, we used the Cambridge University SLU corpus, previously used on several other SLU tasks [52, 19]. The domain of the corpus is about restaurant recommendation in Cambridge; subjects were asked to interact with multiple SDSs in an in-car setting. There were multiple recording settings: 1) a stopped car with the air condition control on and off; 2) a driving condition; and 3) in a car simulator. The distribution of each condition in this corpus is uniform. The corpus contains a total number of 2,166 dialogues, and 15,453 utterances, which is separated into training and testing parts as shown in Table 3.1. The training part is for self-training the SLU model.

The data is gender-balanced, with slightly more native than non-native speakers. The vo-cabulary size is 1,868. An ASR system was used to transcribe the speech; the word error rate was reported as 37%. There are 10 slots created by domain experts: addr, area, food, name, phone, postcode, price range, signature, task, and type. The parameter α in (3.2) can be empirically set; we use α = 0.2, N = 100 for all experiments.

To include distributional semantics information, we use the distributed vectors trained on 10⁹ words from Google News¹. Training was performed using the continuous bag of words architecture, which predicts the current word based on the context, with sub-sampling using

1https://code.google.com/p/word2vec/

Train Test Total

Dialogue 1522 644 2166

Utterance 10571 4882 15453

Male : Female 28 : 31 15 : 15 43 : 46 Native : Non-Native 33 : 26 21 : 9 54 : 47 Avg. #Slot 0.959 0.952 0.957 Table 3.1: The statistics of training and testing corpora

speak_on_topic addr

area

food

phone

part_orientational direction

locale

part_inner_outer

food origin

contacting

postcode price range

task type

sending commerce scenario

expensiveness range

seeking desiring locating locale_by_use

building

Figure 3.3: The mappings from induced slots (within blocks) to reference slots (right sides of arrows).

threshold 1 × e⁻⁵, and with negative sampling with 3 negative examples per each positive one.

The resulting vectors have dimensionality 300, vocabulary size is 3 × 10⁶; the entities contain both words and automatically derived phrases. The dataset provides a larger vocabulary and better coverage.

3.6.2 Evaluation Metrics

To eliminate the influence of threshold selection when choosing induced slots, in the following metrics, we take the whole ranking list into account and evaluate the performance by the metrics that are independent on the selected threshold.

3.6.2.1 Slot Induction

To evaluate the accuracy of the induced slots, we measure their quality as the proximity between induced slots and reference slots. Figure 3.3 shows the mappings that indicate semantically related induced slots and reference slots [20]. For example, “expensiveness → price”, “food → food”, and “direction → area” show that these induced slots can be mapped into the reference slots defined by experts and carry important semantics in the target domain for developing the task-oriented SDS. Note that two slots, name and signature, do not have proper mappings, because they are too specific on restaurant-related domain, where name records the name of restaurant and signature refers to signature dishes. This means that

Approach

ASR Manual

Slot Induction SLU Model Slot Induction SLU Model

AP AUC WAP AF AP AUC WAP AF

Frequency (α = 0) 56.69 54.67 35.82 43.28 53.01 50.80 36.78 44.20 In-Domain 60.06 58.02 34.39 43.28 59.96 57.89 39.84 44.99 External 71.70 70.35 44.51 45.24 74.41 73.57 50.48 73.57 Max RI (%) +26.5 +28.7 +24.3 +4.5 +40.4 +44.8 +37.2 +66.4

Table 3.2: The performance of slot induction and SLU modeling (%)

the 80% recall is achieved by our approach because we consider all outputted frames as slot candidates.

Since we define the adaptation task as a ranking problem, with a ranked list of induced slots and their associated scores, we can use the standard average precision (AP) as our metric, where the induced slot is counted as correct when it has a mapping to a reference slot. For a ranked list of induced slots l = s¹, ..., s^k, ..., where the s^k is the induced slot ranked at k-th position, the average precision is

AP(l) = Pn

k=1P (k) × 1[s^k has a mapping to a reference slot]

number of induced slots with mapping , (3.5) where P (k) is the precision at cut-off k in the list and 1 is an indicator function equaling 1 if ranked k-th induced slot s^k has a mapping to a reference slot, 0 otherwise. Since the slots generated by our method cover only 80% of the referenced slots, the oracle recall is 80%.

Therefore, average precision is a proper way to measure the slot ranking problem, which is also an approximation of the area under the precision-recall curve (AUC) [12].

3.6.2.2 SLU Model

While semantic slot induction is essential for providing semantic categories and imposing semantic constraints, we are also interested in understanding the performance of our unsu-pervised SLU models. For each induced slot with the mapping to a reference slot, we can compute an F-measure of the corresponding semantic decoder, and weight the average pre-cision with corresponding F-measure as weighted average prepre-cision (WAP) to evaluate the performance of slot induction and SLU tasks together. The metric scores the ranking result higher if the induced slots corresponding to better semantic decoders are ranked higher. An-other metric is the average F-measure (AF), which is the average micro-F of SLU models at all cut-off positions in the ranked list. Compared to WAP, AF additionally considers the slot popularity in the dataset.

3.6.3 Evaluation Results

Table 3.2 shows the results. The first row is the baseline, which only considers the frequency of slot candidates for ranking. It is found that the performance of SLU induction for ASR is better than for manual results. The reason about better AP and AUC scores of ASR may be that users tend to speak keywords clearer than generic words, higher word error rate of generic words makes these slot candidates ranked lower due to lower frequency.

In-Domain and External are the results of proposed word vector models with leveraging distributional word representations, in-domain clustering vectors and external word vectors respectively. In terms of both slot induction and SLU modeling, we find most results are improved by including distributed word information. With only in-domain data, the perfor-mance of slot induction can be significantly improved, from 57% to 60% on AP and from 55% to 58% on AUC. However, for SLU models, in-domain clustering approach does not show the improvement on ASR transcripts and improves the performance on manual results a little. With the external word vector approach, the performance is significantly improved for ASR and manual transcripts, which shows the effectiveness of involving external data for the similarity measurement.

To compare different similarity measures, we evaluate two approaches of computing distribu-tional semantic similarity: in-domain clustering vectors and external word vectors. For both ASR and manual transcripts, the similarity derived from external word vectors significantly outperforms one from in-domain clustering vectors. The reason may be that external word vectors have more accurate semantic representations to measure similarity because they are trained on the large data, while in-domain clustering vectors rely on a small training set, which may be biased by the data and degrade with recognition errors.

We see that leveraging distributional semantics with frame-semantic parsing produces promis-ing slot rankpromis-ing performance; this demonstrates the effectiveness of our proposed approaches for slot induction. The 72% of AP indicates that our proposed approach can generate good coverage for domain-specific slots in a real-world SDS, reducing labor cost of system develop-ment.

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 43-46)