• 沒有找到結果。

Speaker Role Contextual Modeling for Language Understanding and Dialogue Policy Learning

N/A
N/A
Protected

Academic year: 2022

Share "Speaker Role Contextual Modeling for Language Understanding and Dialogue Policy Learning"

Copied!
6
0
0

加載中.... (立即查看全文)

全文

(1)

Speaker Role Contextual Modeling for Language Understanding and Dialogue Policy Learning

Ta-Chung Chi? Po-Chun Chen? Shang-Yu Su Yun-Nung Chen?

?Department of Computer Science and Information Engineering

Graduate Institute of Electrical Engineering National Taiwan University

{b02902019,r06922028,r05921117}@ntu.edu.tw y.v.chen@ieee.org

Abstract

Language understanding (LU) and dia- logue policy learning are two essential components in conversational systems.

Human-human dialogues are not well- controlled and often random and unpre- dictable due to their own goals and speak- ing habits. This paper proposes a role- based contextual model to consider differ- ent speaker roles independently based on the various speaking patterns in the multi- turn dialogues. The experiments on the benchmark dataset show that the proposed role-based model successfully learns role- specific behavioral patterns for contextual encoding and then significantly improves language understanding and dialogue pol- icy learning tasks1.

1 Introduction

Spoken dialogue systems that can help users to solve complex tasks such as booking a movie ticket become an emerging research topic in the ar- tificial intelligence and natural language process- ing area. With a well-designed dialogue system as an intelligent personal assistant, people can ac- complish certain tasks more easily via natural lan- guage interactions. Today, there are several vir- tual intelligent assistants, such as Apple’s Siri, Google’s Home, Microsoft’s Cortana, and Ama- zon’s Echo. Recent advance of deep learning has inspired many applications of neural models to di- alogue systems. Wen et al. (2017),Bordes et al.

(2017), and Li et al. (2017) introduced network- based end-to-end trainable task-oriented dialogue systems.

1The source code is available at: https://github.

com/MiuLab/Spk-Dialogue.

A key component of the understanding sys- tem is a language understanding (LU) module—

it parses user utterances into semantic frames that capture the core meaning, where three main tasks of LU are domain classification, intent determi- nation, and slot filling (Tur and De Mori, 2011).

A typical pipeline of LU is to first decide the do- main given the input utterance, and based on the domain, to predict the intent and to fill associated slots corresponding to a domain-specific semantic template. Recent advance of deep learning has in- spired many applications of neural models to nat- ural language processing tasks. With the power of deep learning, there are emerging better ap- proaches of LU (Hakkani-T¨ur et al., 2016; Chen et al.,2016b,a;Wang et al.,2016). However, most of above work focused on single-turn interactions, where each utterance is treated independently.

The contextual information has been shown useful for LU (Bhargava et al., 2013; Xu and Sarikaya, 2014; Chen et al., 2015; Sun et al., 2016). For example, the Figure1 shows conver- sational utterances, where the intent of the high- lighted tourist utterance is to ask about location information, but it is difficult to understand with- out contexts. Hence, it is more likely to estimate the location-related intent given the contextual ut- terance about location recommendation. Contex- tual information has been incorporated into the re- current neural network (RNN) for improved do- main classification, intent prediction, and slot fill- ing (Xu and Sarikaya,2014;Shi et al.,2015;We- ston et al., 2015; Chen et al., 2016c). The LU output is semantic representations of users’ behav- iors, and then flows to the downstream dialogue management component in order to decide which action the system should take next, as called dia- logue policy. It is intuitive that better understand- ing could improve the dialogue policy learning, so that the dialogue management can be further

(2)

Task 1: Language Understanding (User Intents) Task 2: Dialogue Policy Learning (System Actions)

Guide: so you of course %uh you can have dinner there and %uh of course you also can do sentosa , if you want to for the song of the sea , right ? Tourist: yah .

Tourist: what 's the song in the sea ?

Guide: a song of the sea in fact is %uh laser show inside sentosa

FOL_RECOMMEND:FOOD;

QST_CONFIRM:LOC;

QST_RECOMMEND:LOC RES_CONFIRM

QST_WHAT:LOC Task 1 FOL_EXPLAIN:LOC Task 2

Figure 1: The human-human conversational utterances and their associated semantics from DSTC4.

boosted through interactions (Li et al.,2017).

Most of previous dialogue systems did not take speaker roles into consideration. However, we dis- cover that different speaker roles can cause no- table variance in speaking habits and later af- fect the system performance differently (Chen et al.,2017). From Figure1, the benchmark dia- logue dataset, Dialogue State Tracking Challenge 4 (DSTC4) (Kim et al.,2016)2, contains two spe- cific roles, a tourist and a guide. Under the sce- nario of dialogue systems and the communication patterns, we take the tourist as a user and the guide as the dialogue agent (system). During conversa- tions, the user may focus on not only reasoning (user history)but also listening (agent history), so different speaker roles could provide various cues for better understanding and policy learning.

This paper focuses on LU and dialogue pol- icy learning, which targets the understanding of tourist’s natural language (LU; language under- standing) and the prediction of how the system should respond (SAP; system action prediction) respectively. In order to comprehend what the tourist is talking about and predict how the guide reacts to the user, this work proposes a role- based contextual model by modeling role-specific contexts differently for improving system perfor- mance.

2 Proposed Approach

The model architecture is illustrated in Figure 2.

First, the previous utterances are fed into the con- textual model to encode into the history summary, and then the summary vector and the current ut- terance are integrated for helping LU and dia- logue policy learning. The whole model is trained in an end-to-end fashion, where the history sum- mary vector is automatically learned based on two

2http://www.colips.org/workshop/dstc4/

downstream tasks. The objective of the proposed model is to optimize the conditional probability p(ˆy | x), so that the difference between the pre- dicted distribution q( ˆyk = z | x) and the target distribution q(yk = z | x) can be minimized:

L = −

K

X

k=1 N

X

z=1

q(yk= z | x) log p( ˆyk= z | x), (1) where the labels y can be either intent tags for un- derstanding or system actions for dialogue policy learning.

Language Understanding (LU) Given the cur- rent utterance x = {wt}T1, the goal is to predict the user intents of x, which includes the speech acts and associated attributes shown in Figure1; for ex- ample, QST WHAT is composed of the speech act QSTand the associated attribute WHAT. Note that we do not process the slot filling task for extract- ing LOC. We apply a bidirectional long short-term memory (BLSTM) model (Schuster and Paliwal, 1997) to integrate preceding and following words to learn the probability distribution of the user in- tents.

vcur = BLSTM(x, Whis· vhis), (2) o = sigmoid(WLU· vcur), (3) where Whisis a dense matrix and vhisis the history summary vector, vcur is the context-aware vector of the current utterance encoded by the BLSTM, and o is the intent distribution. Note that this is a multi-label and multi-class classification, so the sigmoid function is employed for modeling the distribution after a dense layer. The user intent la- bels y are decided based on whether the value is higher than a threshold θ.

Dialogue Policy Learning For system action prediction, we also perform similar multi-label

(3)

Role-Based Contextual Model

Role-Based Contextual Model

Dense Layer

Σ

Current Utterance wt wt+1 wT

Dense Layer

Language Understanding (Intent Prediction)

Dialogue Policy Learning (System Action Prediction)

Contextual Module

Natural Language Semantic Label

Intenth-3

History Summary

Intenth-2Intenth-1 Intenth-3

History Summary Intenth-2 Intenth-1

Sentence Encoder

Sentence Encoder

Sentence Encoder Utterh-3 Utterh-2 Utterh-1

Intermediate Guidance

History Summary u2

u4 User

(Tourist) u3

Agent (Guide) u1

u5

Current

Figure 2: Illustration of the proposed role-based contextual model.

multi-class classification on the context-aware vector vcurfrom (2) using sigmoid:

o = sigmoid(Wπ· vcur), (4) and then the system actions can be decided based on a threshold θ.

2.1 Contextual Module

In order to leverage the contextual information, we utilize two types of contexts: 1) semantic la- bels and 2) natural language, to learn history sum- mary representations, vhis in (2). The illustration is shown in the top-right part of Figure2.

Semantic Label Given a sequence of annotated intent tags and associated attributes for each his- tory utterance, we employ a BLSTM to model the explicit semantics:

vhis = BLSTM(intentt), (5) where intenttis the vector after one-hot encoding for representing the annotated intent and the at- tribute features. Note that this model requires the ground truth annotations of history utterances for training and testing.

Natural Language (NL) Given the natural lan- guage history, a sentence encoder is applied to learn a vector representation for each prior utter- ance. After encoding, the feature vectors are fed into a BLSTM to capture temporal information:

vhis= BLSTM(CNN(uttt)), (6) where the CNN is good at extracting the most salient features that can represent the given natural

‘s the song what

in

convolutional layer with multiple filter widths

max-over- time pooling the

sea

word embeddings dim filter depth

fully- connected

layer

Figure 3: Illustration of the CNN sentence encoder for the example sentence “what’s the song in the sea”.

language utterances illustrated in Figure 3. Here the sentence encoder can be replaced into differ- ent encoders3, and the weights of all encoders are tied together.

NL with Intermediate Guidance Considering that the semantic labels may provide rich cues, the middle supervision signal is utilized as inter- mediate guidance for the sentence encoding mod- ule in order to guide them to project from input utterances to a more meaningful feature space.

Specifically, for each utterance, we compute the cross entropy loss between the encoder outputs and corresponding intent-attributes shown in Fig- ure 2. Assuming that lt is the encoding loss for uttt in the history, the final objective is to mini- mize (L +P

tlt). This model does not require the

3In the experiments, CNN achieved slightly better perfor- mance with fewer parameters compared with BLSTM.

(4)

ground truth semantics for history when testing, so that it is more practical compared to the above model using semantic labels.

2.2 Speaker Role Modeling

In a dialogue, there are at least two roles com- municating with each other, each individual has his/her own goal and speaking habit. For example, the tourists have their own desired touring goals and the guides are try to provide the sufficient tour- ing information for suggestions and assistance.

Prior work usually ignored the speaker role in- formation or only modeled a single speaker’s his- tory for various tasks (Chen et al., 2016c; Yang et al., 2017). The performance may be degraded due to the possibly unstable and noisy input fea- ture space. To address this issue, this work pro- poses the role-based contextual model: instead of using only a single BLSTM model for the his- tory, we construct one individual contextual mod- ule for each speaker role. Each role-dependent recurrent unit BLSTMrolex receives corresponding inputs xi,rolex (i = [1, ..., N ]), which have been processed by an encoder model, we can rewrite (5) and (6) into (7) and (8) respectively:

vhis = BLSTMrolea(intentt,rolea) (7) + BLSTMroleb(intentt,roleb).

vhis = BLSTMrolea(CNN(uttt,rolea)) (8) + BLSTMroleb(CNN(uttt,roleb)) Therefore, each role-based contextual module fo- cuses on modeling the role-dependent goal and speaking style, and vcur from (2) is able to carry role-based contextual information.

3 Experiments

To evaluate the effectiveness of the proposed model, we conduct the LU and dialogue policy learning experiments on human-human conversa- tional data.

3.1 Setup

The experiments are conducted on DSTC4, which consists of 35 dialogue sessions on touristic infor- mation for Singapore collected from Skype calls between 3 tour guides and 35 tourists (Kim et al., 2016). All recorded dialogues with the total length of 21 hours have been manually transcribed and annotated with speech acts and semantic labels at each turn level. The speaker labels are also

annotated. Human-human dialogues contain rich and complex human behaviors and bring much difficulty to all dialogue-related tasks. Given the fact that different speaker roles behave differently, DSTC4 is a suitable benchmark dataset for evalu- ation.

We choose a mini-batch adam as the optimizer with the batch size of 128 examples (Kingma and Ba, 2014). The size of each hidden re- current layer is 128. We use pre-trained 200- dimensional word embeddings GloV e (Penning- ton et al.,2014). We only apply 30 training epochs without any early stop approach. The sentence en- coder is implemented using a CNN with the fil- ters of size [2, 3, 4], 128 filters each size, and max pooling over time. The idea is to capture the most important feature (the highest value) for each fea- ture map. This pooling scheme naturally deals with variable sentence lengths. Please refer toKim (2014) for more details.

For both tasks, we focus on predicting multiple labels including speech acts and attributes, so the evaluation metric is average F1 score for balancing recall and precision in each utterance. Note that the final prediction may contain multiple labels.

3.2 Results

The experiments are shown in Table1, where we report the average number over five runs. The first baseline (row (a)) is the best participant of DSTC4 in IWSDS 2016 (Kim et al.,2016), the poor per- formance is probably because tourist intents are much more difficult than guide intents (most sys- tems achieved higher than 60% of F1 for guide in- tents but lower than 50% for tourist intents). The second baseline (row (b)) models the current utter- ance without contexts, performing 62.6% for un- derstanding and 63.4% for policy learning.

3.2.1 Language Understanding Results With contextual history, using ground truth se- mantic labels for learning history summary vec- tors greatly improves the performance to 68.2%

(row (c)), while using natural language slightly improves the performance to 64.2% (row (e)). The reason may be that NL utterances contain more noises and the contextual vectors are more difficult to model for LU. The proposed role-based con- textual models applying on semantic labels and NL achieve 69.2% (row (d)) and 65.1% (row (f)) on F1 respectively, showing the significant im- provement all model without role modeling. Fur-

(5)

Model Language Policy Understanding Learning

Baseline (a) DSTC4-Best 52.1 -

(b) BLSTM 62.6 63.4

Contextual-Sem (c) BLSTM 68.2 66.8

(d) + Role-Based 69.2 70.1

Contextual-NL (e) BLSTM 64.2 66.3

(f) + Role-Based 65.1 66.9

(g) + Role-Based w/ Intermediate Guidance 65.8 67.4 Table 1: Language understanding and dialogue policy learning performance of F-measure on DSTC4 (%).indicates the significant improvement compared to all methods without speaker role modeling.

thermore, adding the intermediate guidance ac- quires additional improvement (65.8% from the row (g)). It is shown that the semantic labels successfully guide the sentence encoder to obtain better sentence-level representations, and then the history summary vector carrying more accurate se- mantics gives better performance for understand- ing.

3.2.2 Dialogue Policy Learning Results To predict the guide’s next actions, the baseline utilizes intent tags of the current utterance with- out contexts (row (b)). Table1shows the similar trend as LU results, where applying either role- based contextual models or intermediate guidance brings advantages for both semantics-encoded and NL-encoded history.

3.3 Discussion

In contrast to NL, semantic labels (intent-attribute pairs) can be seen as more explicit and concise in- formation for modeling the history, which indeed gains more in our experiments for both LU and di- alogue policy learning. The results of Contextual- Sem can be treated as the upper bound perfor- mance, because they utilizes the ground truth se- mantics of contexts. Among the experiments of Contextual-NL, which are more practical because the annotated semantics are not required during testing, the proposed approaches achieve 5.1% and 6.3% relative improvement compared to the base- line for LU and dialogue policy learning respec- tively.

Between LU and dialogue policy learning tasks, most LU results are worse than dialogue policy learning results. The reason probably is that the guide has similar behavior patterns such as pro- viding information and confirming questions etc., while the user can have more diverse interac-

tions. Therefore, understanding the user intents is slightly harder than predicting the guide policy in the DSTC4 dataset.

With the promising improvement for both LU and dialogue policy learning, the idea about mod- eling speaker role information can be further ex- tended to various research topics in the future.

4 Conclusion

This paper proposes an end-to-end role-based con- textual model that automatically learns speaker- specific contextual encoding. Experiments on a benchmark multi-domain human-human dialogue dataset show that our role-based model achieves impressive improvement in language understand- ing and dialogue policy learning, demonstrating that different speaker roles behave differently and focus on different goals.

Acknowledgements

We would like to thank reviewers for their insight- ful comments on the paper. The authors are sup- ported by the Ministry of Science and Technology of Taiwan, Google Research, Microsoft Research, and MediaTek Inc..

References

Anshuman Bhargava, Asli Celikyilmaz, Dilek Hakkani-Tur, and Ruhi Sarikaya. 2013. Easy contextual intent prediction and slot detection. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pages 8337–8341.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.

2017. Learning end-to-end goal-oriented dialog. In ICLR.

Po-Chun Chen, Ta-Chung Chi, Shang-Yu Su, and Yun- Nung Chen. 2017. Dynamic time-aware attention

(6)

to speaker roles and contexts for spoken language understanding. In Proceedings of 2017 IEEE Work- shop on Automatic Speech Recognition and Under- standing.

Yun-Nung Chen, Dilek Hakanni-T¨ur, Gokhan Tur, Asli Celikyilmaz, Jianfeng Guo, and Li Deng.

2016a. Syntax or semantics? knowledge-guided joint semantic frame parsing. In Proceedings of 2016 IEEE Spoken Language Technology Workshop.

IEEE, pages 348–355.

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz, Jianfeng Gao, and Li Deng.

2016b. Knowledge as a teacher: Knowledge- guided structural attention networks. arXiv preprint arXiv:1609.03286.

Yun-Nung Chen, Dilek Hakkani-T¨ur, G¨okhan T¨ur, Jianfeng Gao, and Li Deng. 2016c. End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding. In Pro- ceedings of The 17th Annual Meeting of the Inter- national Speech Communication Association. pages 3245–3249.

Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky, and Anatole Gershman. 2015. Leveraging behav- ioral patterns of mobile applications for personalized spoken language understanding. In Proceedings of 17th ACM International Conference on Multimodal Interaction. ACM, pages 83–86.

Dilek Hakkani-T¨ur, G¨okhan T¨ur, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye- Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Proceed- ings of The 17th Annual Meeting of the Interna- tional Speech Communication Association. pages 715–719.

Seokhwan Kim, Luis Fernando DHaro, Rafael E Banchs, Jason D Williams, and Matthew Henderson.

2016. The fourth dialog state tracking challenge. In Proceedings of the 7th International Workshop on Spoken Dialogue Systems.

Yoon Kim. 2014. Convolutional neural net- works for sentence classification. arXiv preprint arXiv:1408.5882.

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Xuijun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task- completion neural dialogue systems. In Proceedings of The 8th International Joint Conference on Natu- ral Language Processing.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. vol- ume 14, pages 1532–1543.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing45(11):2673–2681.

Yangyang Shi, Kaisheng Yao, Hu Chen, Yi-Cheng Pan, Mei-Yuh Hwang, and Baolin Peng. 2015. Contex- tual spoken language understanding using recurrent neural networks. In Proceedings of 2015 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing. IEEE, pages 5271–5275.

Ming Sun, Yun-Nung Chen, and Alexander I. Rud- nicky. 2016. An intelligent assistant for high-level task understanding. In Proceedings of The 21st An- nual Meeting of the Intelligent Interfaces Commu- nity. pages 169–174.

Gokhan Tur and Renato De Mori. 2011. Spoken lan- guage understanding: Systems for extracting seman- tic information from speech. John Wiley & Sons.

Zhangyang Wang, Yingzhen Yang, Shiyu Chang, Qing Ling, and Thomas S Huang. 2016. Learning a deep l encoder for hashing. In Proceedings of the Twenty- Fifth International Joint Conference on Artificial In- telligence, IJCAI. pages 2174–2180.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. 2017. A network- based end-to-end trainable task-oriented dialogue system. In Proceedings of EACL.

Jason Weston, Sumit Chopra, and Antoine Bordesa.

2015. Memory networks. In Proceedings of Inter- national Conference on Learning Representations.

Puyang Xu and Ruhi Sarikaya. 2014. Contextual do- main classification in spoken language understand- ing systems using recurrent neural network. In Pro- ceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pages 136–140.

Xuesong Yang, Yun-Nung Chen, Dilek Hakkani-T¨ur, Paul Crook, Xiujun Li, Jianfeng Gao, and Li Deng.

2017. End-to-end joint learning of natural lan- guage understanding and dialogue manager. In Pro- ceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pages 5690–5694.

參考文獻

相關文件

• Dilek Hakkani-Tur, Asli Celikyilmaz, Larry Heck, and Gokhan Tur, Probabilistic enrichment of knowledge graph entities for relation detection in conversational understanding,

• User goal: Two tickets for “the witch” tomorrow 9:30 PM at regal meridian 16, Seattle. E2E Task-Completion Bot (TC-Bot) (Li et

First based on the limited domain-specific data, an OOV learn- ing procedure is applied to generate a list of OOVs that may be.. The vocabulary and the language model can be expanded

graphs, a slot-based semantic knowledge graph and a word-based lexical knowledge graph, are au- tomatically constructed. To jointly consider the word-to-word, word-to-slot,

Finally, we train the SLU model by learning latent feature vectors for utterances and slot candidates through MF techniques. Combining with a knowledge graph propagation model based

 End-to-end reinforcement learning dialogue system (Li et al., 2017; Zhao and Eskenazi, 2016)?.  No specific goal, focus on

A spoken language understanding (SLU) component requires the domain ontology to decode utterances into semantic forms, which contain core content (a set of slots and slot-fillers)

• BBQN-Map Agent (AAAI’18): the best agent among a set of BBQN variants that has great efficiency for policy exploration in dialogue systems.