Experiments - Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Sy

To evaluate the effectiveness of our induced slots, we performed two evaluations. First, we examine the slot induction accuracy by comparing the ranked list of frame-semantic parsing induced slots with the reference slots created by developers of the corresponding system [171].

Secondly, based on the ranked list of induced slots, we can train a semantic decoder for each slot to build an SLU component, and then evaluate the performance of our SLU model by comparing against the human annotated semantic forms. For the experiments, we evaluate both on ASR results of the raw speech, and on the manual transcripts.

3.5.1 Experimental Setup

We used the Cambridge In-Car SLU corpus¹, which has been used on several other SLU tasks such as supervised slot filling and dialogue act classification [30, 86]. The dialogue corpus was collected via a restaurant information system for Cambridge. Users can specify restaurant suggestions by area, price range and food type and can then query the system for additional restaurant specific information such as phone number, post code, signature dish and address [71, 148].

Subjects were asked to interact with multiple SDSs in an in-car setting [70, 148]. There were multiple recording settings for different noise conditions: 1) a stopped car with the air condition control on and off; 2) a driving condition; and 3) in a car simulator. The distribution of each condition in this corpus is uniform. The corpus contains a total number of 2,166 dialogues, and 15,453 utterances, which is separated into training and testing parts as shown in Table 3.1. The training part is for self-training the SLU model.

The data is gender-balanced, with slightly more native than non-native speakers. The vocab-ulary size is 1,868. An ASR system was used to transcribe the speech; the word error rate (WER) was reported as 37%. There are 10 slots created by domain experts: addr, area, food, name, phone, postcode, price range, signature, task and type.

1http://www.repository.cam.ac.uk/handle/1810/248271

speak_on_topic addr

area

food

phone

part_orientational direction

locale

part_inner_outer

food origin

contacting

postcode price range

task type

sending commerce scenario

expensiveness range

seeking desiring locating locale_by_use

building

Figure 3.3: The mappings from induced slots (within blocks) to reference slots (right sides of arrows).

3.5.2 Implementation Detail

For clustering, we perform K-means clustering. The parameter K, the number of clusters, can be empirically set, where we use K = 50 for all experiments. For the parameter α in (2.7), we tune it by using a development set, which contains first 302 dialogues and total 2,515 utterances. For training independent semantic decoders, we apply support vector machine (SVM) with linear kernel to classify whether each utterance contain a semantic concept.

To include distributional semantics information for the external data, we use the distributed vectors trained on 10⁹ words from Google News². Training was performed using the CBOW architecture, which predicts the current word based on the context, with sub-sampling using threshold 1×e⁻⁵, and with negative sampling using 3 negative examples per each positive one.

The resulting vectors have dimensionality 300, vocabulary size is 3 × 10⁶; the entities contain both words and automatically derived phrases. The dataset provides a larger vocabulary and better coverage.

3.5.3 Evaluation Metrics

Our metrics take the entire list into account and evaluate the performance by the metrics that are independent on the selected threshold, in order to eliminate the influence of different thresholds when producing inducing induced slots.

3.5.3.1 Slot Induction

To evaluate the accuracy of the induced slots, we measure their quality as the proximity between induced slots and reference slots. Figure 3.3 shows the mappings that indicate semantically related induced slots and reference slots [31]. For example, expensiveness →

2https://code.google.com/p/word2vec/

Table 3.2: The performance with different α tuned on a development set (%).

Approach

ASR Transcripts

α Slot Induction SLU Model

AP AUC WAP AF AP AUC WAP AF

(a) Baseline .0 58.17 56.16 35.39 36.76 .0 55.03 53.52 36.99 35.96 (b) In. Cluster. .4 64.76 63.55 40.54 37.28 .5 63.59 62.89 42.25 36.95 (c) In. Embed. .6 66.98 65.82 42.74 37.50 .4 57.96 56.51 39.78 36.61 (d) Ex. Embed. .8 74.51 73.51 46.04 37.88 .8 64.99 64.17 43.28 38.57

Max RI (%) - +39.9 +44.1 +32.9 +3.0 - +18.1 +19.9 +18.0 +7.3

price, food → food, and direction → area show that these induced slots can be mapped into the reference slots defined by experts and carry important semantics in the target domain for developing the task-oriented SDS. Note that two slots, name and signature, do not have proper mappings, because they are too specific on restaurant-related domain, where name records the name of restaurant and signature refers to signature dishes. This means that the 80% recall is achieved by our approach because we consider all outputted frames as slot candidates. Since we define the adaptation task as a ranking problem, with a ranked list of induced slots and their associated scores, we can use the standard AP defined in (2.11) and AUC as our metrics, where the induced slot is counted as correct when it has a mapping to a reference slot.

3.5.3.2 SLU Model

Semantic slot induction is essential for providing semantic categories and imposing seman-tic constraints for SLU modeling. Therefore, we are also interested in understanding the performance of our unsupervised SLU models. For each induced slot with the mapping to a reference slot, we can compute an F-measure of the corresponding semantic decoder, and weight AP with corresponding F-measure as WAP defined in (2.13) to evaluate the perfor-mance of slot induction and SLU tasks together. The metric scores the ranking result higher if the induced slots corresponding to better semantic decoders are ranked higher. Another metric is the average F-measure (AF), which is the average micro F-measure of SLU models at all cut-off positions in the ranked list. Compared to WAP, AF additionally considers the slot popularity in the dataset.

3.5.4 Evaluation Results

Table 3.2 shows all results. The row (a) is the baseline, which only considers the frequency of slot candidates for ranking. It is found that the performance of slot induction for ASR is better than for manual results. The better AP and AUC scores of ASR results from biased

recognition results, which are optimized to recognize domain-specific words better. Thus the ASR results contain more accurate slot-fillers and the performance of slot induction is biased and better than the results on manual transcripts.

Rows (b)-(d) show performance after leveraging distributional word representations, in-domain clustering vector, in-in-domain embedding vector and external embedding vector. In terms of both slot induction and SLU modeling, we find that most results are improved by including distributed word information. With in-domain data (row (b) and row (c)), the per-formance of slot induction can be significantly improved, from 58% to 67% on AP and from 56% to 65% on AUC for ASR results, and from 55% to 64% on AP and from 54% to 63% on AUC for manual transcripts. Also, for SLU modeling, in-domain clustering and in-domain embedding approaches outperform the baseline. Using external data to train word embed-dings (row (d)), the performance for both slot induction and SLU modeling is significantly improved for ASR and manual transcripts, which shows the effectiveness of involving external data for the similarity measurement. The reason may be that external word embeddings have more accurate vector representations to measure similarity because they are trained on the large data, while in-domain approaches rely on a small in-domain training set, which may be biased by the data and may be sensitive to recognition errors.

We see that leveraging distributional semantics with frame-semantic parsing produces promis-ing slot rankpromis-ing performance; this demonstrates the effectiveness of our proposed approaches for slot induction. The 72% of AP indicates that our proposed approach can generate good coverage for domain-specific slots in a real-world SDS, reducing labor cost of system develop-ment.

在文檔中 Unsupervised Learning and Modeling of Knowledge and Intent for Spoken Dialogue Systems (頁 67-70)