• 沒有找到結果。

Natural Language Generation by

N/A
N/A
Protected

Academic year: 2022

Share "Natural Language Generation by"

Copied!
6
0
0

加載中.... (立即查看全文)

全文

(1)

Natural Language Generation by

Hierarchical Decoding with Linguistic Patterns

Shang-Yu Su Kai-Ling Lo? Yi-Ting Yeh? Yun-Nung Chen?

?Department of Computer Science and Information Engineering

Department of Electrical Engineering National Taiwan University

{r05921117,b04902010,b03902071}@ntu.edu.tw y.v.chen@ieee.org

Abstract

Natural language generation (NLG) is a crit- ical component in spoken dialogue systems.

Classic NLG can be divided into two phases:

(1) sentence planning: deciding on the overall sentence structure, (2) surface realization: de- termining specific word forms and flattening the sentence structure into a string. Many sim- ple NLG models are based on recurrent neu- ral networks (RNN) and sequence-to-sequence (seq2seq) model, which basically contains an encoder-decoder structure; these NLG models generate sentences from scratch by jointly op- timizing sentence planning and surface real- ization using a simple cross entropy loss train- ing criterion. However, the simple encoder- decoder architecture usually suffers from gen- erating complex and long sentences, because the decoder has to learn all grammar and dic- tion knowledge. This paper introduces a hi- erarchical decoding NLG model based on lin- guistic patterns in different levels, and shows that the proposed method outperforms the tra- ditional one with a smaller model size. Fur- thermore, the design of the hierarchical decod- ing is flexible and easily-extensible in various NLG systems1.

1 Introduction

Spoken dialogue systems that can help users to solve complex tasks have become an emerging re- search topic in artificial intelligence and natural language processing areas (Wen et al.,2017;Bor- des et al., 2017; Dhingra et al., 2017; Li et al., 2017). A typical dialogue system pipeline con- tains a speech recognizer, a natural language un- derstanding component, a dialogue manager, and a natural language generator (NLG).

The first two authors have equal contributions.

1The source code is available at https://github.

com/MiuLab/HNLG.

NLG is a critical component in a dialogue system, where its goal is to generate the nat- ural language given the semantics provided by the dialogue manager. As the endpoint of inter- acting with users, the quality of generated sen- tences is crucial for user experience. The com- mon and mostly adopted method is the rule-based (or template-based) method (Mirkovic and Cave- don,2011), which can ensure the natural language quality and fluency. Considering that designing templates is time-consuming and the scalability issue, data-driven approaches have been investi- gated for open-domain NLG tasks.

Recent advances in recurrent neural network- based language model (RNNLM) (Mikolov et al., 2010, 2011) have demonstrated the capability of modeling long-term dependency by leverag- ing RNN structure. Previous work proposed an RNNLM-based NLG (Wen et al., 2015) that can be trained on any corpus of dialogue act- utterance pairs without any semantic alignment and hand-crafted features. Sequence-to-sequence (seq2seq) generators (Cho et al.,2014;Sutskever et al., 2014) further offer better results by lever- aging encoder-decoder structure: previous model encoded syntax trees and dialogue acts into se- quences (Duˇsek and Jurˇc´ıˇcek, 2016) as inputs of attentional seq2seq model (Bahdanau et al.,2015).

However, it is challenging to generate long and complex sentences by the simple encoder-decoder structure due to grammar complexity and lack of diction knowledge.

This paper proposes a hierarchical decoder leveraging linguistic patterns, where the decoding hierarchy is constructed in terms of part-of-speech (POS) tags. The original single decoding process is separated into a multi-level decoding hierarchy, where each decoding layer generates words asso- ciated with a specific POS set. The experiments show that our proposed method outperforms the

(2)

ENCODER name[Midsummer House], food[Italian], priceRange[moderate], near[All Bar One]

All Bar One place it Midsummer House

All Bar Oneis pricedplace itis calledMidsummer House All Bar One ismoderatelypricedItalianplace it is called Midsummer House

NearAll Bar One isamoderately priced Italian place it is called Midsummer House

DECODING LAYER1 DECODING LAYER2 DECODING LAYER3 DECODING LAYER4

Hierarchical Decoder

1. NOUN + PROPN + PRON 2. VERB

3. ADJ + ADV 4. Others

Input Semantics 𝒙 = {𝑤1, … , 𝑤𝑇}

[ … 1, 0, 0, 1, 0, …]

Semantic 1-hot Representation Bidirectional

GRU Encoder GRU Decoder

Italian priceRange

name

All Bar One is a is a moderately

All Bar One is moderately

output from last layer 𝒚𝒕𝒊−𝟏 last output 𝒚𝒕−𝟏𝒊 1. Inner-Layer Teacher Forcing 2. Inter-Layer Teacher Forcing 3. Repeat-input

4. Curriculum Learning

𝒉enc

Figure 1: The framework of the proposed semantically conditioned NLG model.

classic seq2seq model with less parameters. In addition, our proposed model allows other word- level or sentence-level characteristics to be further leveraged for better generalization.

2 The Proposed Approach

The framework of the proposed semantically con- ditioned NLG model is illustrated in Figure 1, where the model architecture is based on an encoder-decoder (seq2seq) design (Cho et al., 2014;Sutskever et al.,2014). In the seq2seq ar- chitecture, a typical generation process includes encoding and decoding phases: First, the given se- mantic representation sequence x = {wt}T1 is fed into a RNN-based encoder to capture the temporal dependency and project the input to a latent fea- ture space, and encoded into 1-hot semantic repre- sentation as the initial state of the encoder in order to maintain the temporal-independent condition as shown in the left-bottom of Figure1. The recur- rent unit of the encoder is bidirectional gated re- current unit (GRU) (Cho et al.,2014),

henc= BiGRU(x). (1)

Then the encoded semantic vector, henc, flows into an RNN-based decoder as the initial state to gen- erate word sequences by an RNN model shown in the left-top component of the figure.

2.1 Hierarchical Decoder

Despite the intuitive and elegant design of the seq2seq model, it is difficult to generate long, complex, and decent sequences by such encoder- decoder structure, because a single decoder is not capable of learning all diction, grammar, and other related linguistic knowledge. Some prior work ap- plied additional technique such as reranker to se- lect a better result among multiple generated se- quences (Wen et al., 2015; Duˇsek and Jurˇc´ıˇcek, 2016). However, the issue still remains unsolved in NLG community.

Therefore, we propose a hierarchical decoder to address the above issue, where the core idea is to separate the decoding process and learn different types of patterns instead of learning all relevant knowledge together. The hierarchical decoder is composed of several decoding layers, each of which is only responsible for learning a portion of the related knowledge. Namely, the linguistic knowledge can be incorporated into the decoding process and divided into several subsets.

In this paper, we use part-of-speech (POS) tags as the additional linguistic features to construct the hierarchy, where POS tags of the words in the target sentence are separated into several sub- sets and each layer is responsible for decoding the words associated with a specific set of POS pat- terns. An example is shown in the right part of Figure1, where the first layer at the bottom is in

(3)

charge of learning to decode nouns, pronouns, and proper nouns, and the second layer is in charge of verbs, and so on. Our approach is also intu- itive from the viewpoint of how humans learn to speak; for example, infants first learn to say the keywords which are often nouns. When an infant says “Daddy, toilet.”, it actually means “Daddy, I want to go to the toilet.”. Along with the growth of the age, children learn more grammars and vocab- ulary and then start adding verbs to the sentences, further adding adverbs, and so on. This process of how humans learn to speak is the core motivation of our proposed method.

In the hierarchical decoder, the initial state of each GRU-based decoding layer i is the extracted feature hencfrom the encoder, and the input at ev- ery step is the last predicted token yit−1 concate- nated with the output from the previous layer yi−1t , hit, oit = GRUidec(yit−1, yi−1t | henc, hit−1),(2)

yit = argmax(ot), (3)

where hit is the t-th hidden state of the i-th GRU decoding layer and yitis the t-th outputted word in the i-th layer. The cross entropy loss is used for optimization.

2.2 Inner- and Inter-Layer Teacher Forcing Teacher forcing (Williams and Zipser,1989) is a strategy for training RNN that uses model output from a prior time step as an input, and it works by using the expected output at the current time step ˆytas the input at the next time step, rather than the output generated by the network. In our proposed framework, an input of a decoder contains not only the output from the last step but one from the last decoding layer. Therefore, we design two types of teacher forcing techniques – inner-layer and inter- layer.

Inner-layer teacher forcing is the classic teacher forcing strategy:

hit, oit= GRUidec(ˆyit−1, yi−1t | henc, hit−1). (4) Inter-layer teacher forcing uses the labels in- stead of the actual output tokens of the last layer:

hit, oit= GRUidec(yit−1, ˆyi−1t | henc, hit−1). (5) The teacher forcing techniques can also be trig- gered only with a certain probability, which is known as the scheduled sampling approach (Ben- gio et al.,2015). In our experiments, the scheduled sampling approach is also adopted.

2.3 Repeat-Input Mechanism

The concept of our proposed method is to hierar- chically generate the sequence, gradually adding words associated with different linguistic patterns.

Therefore, the generated sequences from the de- coders become longer as the generating process proceeds to the higher decoding layers, and the sequence generated by a upper layer should con- tain the words predicted by the lower layers. In order to ensure the output sequences with the con- straints, we design a strategy that repeats the out- puts from the last layer as inputs until the current decoding layer outputs the same token, so-called repeat-input mechanism. This approach offers at least two merits: (1) Repeating inputs tells the decoder that the repeated tokens are important to encourage the decoder to generate them. (2) If the expected output sequence of a layer is much shorter than the one of the next layer, the large dif- ference in length becomes a critical issue of the hierarchical decoder, because the output sequence of a layer will be fed into the next layer. With the repeat-input mechanism, the impact of length dif- ference can be mitigated.

2.4 Curriculum Learning

The proposed hierarchical decoder consists of several decoding layers, the expected output se- quences of upper layers are longer than the ones in the lower layers. The framework is suitable for applying the curriculum learning (Elman, 1993), in which core concept is that a curriculum of pro- gressively harder tasks could significantly acceler- ate a networks training. The training procedure is to train each decoding layer for some epochs from the bottommost layer to the topmost one.

3 Experiments 3.1 Setup

The experiments are conducted using the E2E NLG challenge dataset (Novikova et al., 2017)2, which is a crowd-sourced dataset of 50k instances in the restaurant domain. The input is the semantic frame containing specific slots and corresponding values, and the output is the natural language con- taining the given semantics as shown in Figure1.

To prepare the labels of each layer within the hierarchical structure of the proposed method,

2http://www.macs.hw.ac.uk/

InteractionLab/E2E/

(4)

NLG Model BLEU ROUGE-1 ROUGE-2 ROUGE-L

(a) Sequence-to-Sequence Model 28.9 40.7 12.5 32.1

(b) + Hierarchical Decoder 43.1 53.0 24.6 40.4

(c) + Hierarchical Decoder, Repeat-Input 42.3 52.9 24.0 40.1 (d) + Hierarchical Decoder, Curriculum Learning 58.4 60.4 30.6 44.6

(e) + All 58.7 62.3 31.6 45.4

(f) (e) with High Inner-Layer TF Prob. 62.1 64.0 32.8 46.0 (g) (e) with High Inter-Layer TF Prob. 56.7 61.3 30.9 44.6 (h) (e) with High Inner- and Inter-Layer TF Prob. 60.0 63.0 31.8 45.2 Table 1: The NLG performance reported on BLEU, ROUGE-1, ROUGE-2, and ROUGE-L of models (%).

we utilize spaCy toolkit to perform POS tag- ging for the target word sequences. Some prop- erties such as names of restaurants are delex- icalized (for example, replaced with symbols

”RESTAURANT NAME”) to avoid data sparsity.

We assign the words with specific POS tags for each decoding layer: nouns, proper nouns, and pronouns for the first layer, verbs for the second layer, adjectives and adverbs for the third layer, and others for the forth layer. Note that the hierar- chies with more than four levels are also applica- ble, the proposed hierarchical decoder is a general and easily-extensible concept.

The experimental results are shown in Table1, every reported number is averaged over the results on the official testing set from three different mod- els. Row (a) is the simple seq2seq model as the baseline. The probability of activating inter-layer and inner-layer teacher forcing is set to 0.5 in the rows (a)-(e); to evaluate the impact of teacher forc- ing, the probability is set to 0.9 (rows (f)-(h)). The probability of teacher forcing is attenuated every epoch, and the decay ratio is 0.9. We perform 20 training epochs without early stop; when the cur- riculum learning approach is applied, only the first layer is trained during first five epochs, the sec- ond decoder layer starts to be trained at the sixth epoch, and so on. To evaluate the quality of the generated sequences regarding both precision and recall, the evaluation metrics include BLEU and ROUGE (1, 2, L) scores with multiple references.

3.2 Results and Analysis

To fairly examine the effectiveness of our pro- posed approaches, we control the size of the pro- posed model to be smaller. The baseline seq2seq decoder has 400-dim hidden layer, and the mod- els with the proposed hierarchical decoder (rows (b)-(h)) have four 100-dim decoding layers. Ta-

ble 1 shows that simply introducing the hierar- chical decoding technique without adding param- eters (row (b)) to separate the generation process into several phases achieves significant improve- ment in both BLEU and ROUGE scores, 49.1% in BLEU, 30.2% in ROUGE-1, 96.8% in ROUGE- 2, and 25.9% in ROUGE-L. Applying the pro- posed repeat-input mechanism (row (c)) and the curriculum learning strategy (row (d)) both offer considerable improvement. Combining all pro- posed techniques (row (e)) yields the best perfor- mance in both BLEU and ROUGE scores, achiev- ing 103.1%, 53.1%, 152.8%, and 41.4% of rela- tive improvement in BLEU, ROUGE-1, ROUGE- 2, and ROUGE-L respectively. The results demon- strate the effectiveness of the proposed approach.

To further verify the impact of teacher forcing, the integrated models (row (e)) with high inter and inner-layer teacher forcing probability (rows (f)- (h)) are also evaluated. Note that when the teacher forcing is activated probabilistically, the strate- gies are also known as schedule sampling (Bengio et al., 2015). Row (g) shows that high probabil- ity of triggering inter-layer teacher forcing results in slight performance degradation, while models with high inner-layer teacher forcing probability (rows (f) and (h)) can further benefit the model.

Note that the decoding process is a single- path forward generation without any heuristics and other mechanisms (like beam search and rerank- ing), so the effectiveness of the proposed meth- ods can be fairly verified. The experiments show that by considering linguistic patterns in hierarchi- cal decoding, the proposed approaches can signif- icantly improve NLG results with smaller models.

4 Conclusion

This paper proposes a seq2seq-based model with a hierarchical decoder that leverages various lin-

(5)

guistic patterns and further designs several corre- sponding training and inference techniques. The experimental results show that the models apply- ing the proposed methods achieve significant im- provement over the classic seq2seq model. By in- troducing additional word-level or sentence-level labels as features, the hierarchy of the decoder can be designed arbitrarily. Namely, the proposed hi- erarchical decoding concept is general and easily- extensible, with flexibility of being applied to many NLG systems.

Acknowledgements

We would like to thank reviewers for their insight- ful comments on the paper. The authors are sup- ported by the Institute for Information Industry, Ministry of Science and Technology of Taiwan, Google Research, Microsoft Research, and Medi- aTek Inc..

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for se- quence prediction with recurrent neural networks.

In Advances in Neural Information Processing Sys- tems. pages 1171–1179.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.

2017. Learning end-to-end goal-oriented dialog. In Proceedings of ICLR.

Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of EMNLP. pages 1724–1734.

Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.

Towards end-to-end reinforcement learning of dia- logue agents for information access. In Proceedings of ACL. pages 484–495.

Ondˇrej Duˇsek and Filip Jurˇc´ıˇcek. 2016. Sequence-to- sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of ACL.

pages 45–51.

Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small.

Cognition48(1):71–99.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task- completion neural dialogue systems. In Proceedings of IJCNLP. pages 733–743.

Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceed- ings of Interspeech.

Tom´aˇs Mikolov, Stefan Kombrink, Luk´aˇs Burget, Jan Cernock`y, and Sanjeev Khudanpur. 2011. Exten-ˇ sions of recurrent neural network language model.

In Proceedings of ICASSP. IEEE, pages 5528–5531.

Danilo Mirkovic and Lawrence Cavedon. 2011. Di- alogue management using scripts. US Patent 8,041,570.

Jekaterina Novikova, Ondrej Duˇsek, and Verena Rieser.

2017. The E2E dataset: New challenges for end-to- end generation. In Proceedings of SIGDIAL. pages 201–206.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.

Sequence to sequence learning with neural net- works. In Proceedings of NIPS. pages 3104–3112.

Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Stochastic language generation in di- alogue using recurrent neural networks with convo- lutional sentence reranking. In Proceedings of SIG- DIAL. pages 275–284.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. 2017. A network- based end-to-end trainable task-oriented dialogue system. In Proceedings of EACL. pages 438–449.

Ronald J Williams and David Zipser. 1989. A learn- ing algorithm for continually running fully recurrent neural networks. Neural computation 1(2):270–280.

A Dataset Detail

The experiments are conducted using the E2E NLG challenge dataset, which is a crowd-sourced dataset in the restaurant domain, the training set contains 42064 instances while there are 4673 instances in the validation (development) set.

In our experiments, we use the validation set to test our models. In the E2E NLG Challenge dataset, the input is the semantics containing slots and their values, and the output is the corresponding natural language. For exam- ple, the slot-value pairs "name[Bibimbap House], food[English],

priceRange[moderate],

area[riverside], near[Clare

Hall]" correspond to the target sentence

(6)

“Bibimbap House is a moderately priced restau- rant who’s main cuisine is English food. You will find this local gem near Clare Hall in the Riverside area.”.

B Parameter Setting

We use mini-batch Adam as the optimizer with the batch size of 32 examples. The baseline seq2seq model (row (a)) sets the encoder’s hidden layer size to 200 and the decoder’s to 400. The size of the hidden layer in the encoder and the decoder layers of the models based on the proposed hierar- chical decoder (rows (b)-(h)) are 200 and 100, re- spectively. Note that in this setting, the models ap- plied the proposed methods will have less param- eters than the baseline seq2seq model. In terms of the models utilized the basic RNN cell, the base- line seq2seq model (row (a)) has 640k parameters whereas the proposed models (rows (b)-(h)) have only 520k parameters.

參考文獻

相關文件

Different for Encoder/Decoder According to the previous results, we hypothesize that using BERT in the encoder and GPT-2 in the decoder could perform best. However, in our

(12%) Among all planes that are tangent to the surface x 2 yz = 1, are there the ones that are nearest or farthest from the origin?. Find such tangent planes if

[r]

[r]

39) The osmotic pressure of a solution containing 22.7 mg of an unknown protein in 50.0 mL of solution is 2.88 mmHg at 25 °C. Determine the molar mass of the protein.. Use 100°C as

Dublin born Oscar Wilde (1854-1900) established himself as one of the leading lights of the London stage at the end of the nineteenth century?. A poet and prose writer as well, it

Please create a timeline showing significant political, education, legal and social milestones for women of your favorite country.. Use the timeline template to record key dates

(3) Record of pupils using the Geography Room during lunch time for geographical study and activities should be kept.. (4) Booking record should be kept for using the room