Natural Language Generation by

(1)

Natural Language Generation by

Hierarchical Decoding with Linguistic Patterns

Shang-Yu Su^† Kai-Ling Lo^? Yi-Ting Yeh^? Yun-Nung Chen^?

?Department of Computer Science and Information Engineering

†Department of Electrical Engineering National Taiwan University

{r05921117,b04902010,b03902071}@ntu.edu.tw y.v.chen@ieee.org

Abstract

Natural language generation (NLG) is a critical component in spoken dialogue systems.

Classic NLG can be divided into two phases:

(1) sentence planning: deciding on the overall sentence structure, (2) surface realization: de- termining specific word forms and flattening the sentence structure into a string. Many simple NLG models are based on recurrent neural networks (RNN) and sequence-to-sequence (seq2seq) model, which basically contains an encoder-decoder structure; these NLG models generate sentences from scratch by jointly op- timizing sentence planning and surface realization using a simple cross entropy loss training criterion. However, the simple encoder- decoder architecture usually suffers from generating complex and long sentences, because the decoder has to learn all grammar and diction knowledge. This paper introduces a hierarchical decoding NLG model based on linguistic patterns in different levels, and shows that the proposed method outperforms the tra- ditional one with a smaller model size. Fur- thermore, the design of the hierarchical decoding is flexible and easily-extensible in various NLG systems¹.

1 Introduction

Spoken dialogue systems that can help users to solve complex tasks have become an emerging research topic in artificial intelligence and natural language processing areas (Wen et al.,2017;Bor- des et al., 2017; Dhingra et al., 2017; Li et al., 2017). A typical dialogue system pipeline contains a speech recognizer, a natural language un- derstanding component, a dialogue manager, and a natural language generator (NLG).

The first two authors have equal contributions.

1The source code is available at https://github.

com/MiuLab/HNLG.

NLG is a critical component in a dialogue system, where its goal is to generate the natural language given the semantics provided by the dialogue manager. As the endpoint of inter- acting with users, the quality of generated sentences is crucial for user experience. The com- mon and mostly adopted method is the rule-based (or template-based) method (Mirkovic and Cave- don,2011), which can ensure the natural language quality and fluency. Considering that designing templates is time-consuming and the scalability issue, data-driven approaches have been investi- gated for open-domain NLG tasks.

Recent advances in recurrent neural network- based language model (RNNLM) (Mikolov et al., 2010, 2011) have demonstrated the capability of modeling long-term dependency by leveraging RNN structure. Previous work proposed an RNNLM-based NLG (Wen et al., 2015) that can be trained on any corpus of dialogue act- utterance pairs without any semantic alignment and hand-crafted features. Sequence-to-sequence (seq2seq) generators (Cho et al.,2014;Sutskever et al., 2014) further offer better results by leveraging encoder-decoder structure: previous model encoded syntax trees and dialogue acts into sequences (Duˇsek and Jurˇc´ıˇcek, 2016) as inputs of attentional seq2seq model (Bahdanau et al.,2015).

However, it is challenging to generate long and complex sentences by the simple encoder-decoder structure due to grammar complexity and lack of diction knowledge.

This paper proposes a hierarchical decoder leveraging linguistic patterns, where the decoding hierarchy is constructed in terms of part-of-speech (POS) tags. The original single decoding process is separated into a multi-level decoding hierarchy, where each decoding layer generates words associated with a specific POS set. The experiments show that our proposed method outperforms the

(2)

ENCODER name[Midsummer House], food[Italian], priceRange[moderate], near[All Bar One]

All Bar One place it Midsummer House

All Bar Oneis pricedplace itis calledMidsummer House All Bar One ismoderatelypricedItalianplace it is called Midsummer House

NearAll Bar One isamoderately priced Italian place it is called Midsummer House

DECODING LAYER1 DECODING LAYER2 DECODING LAYER3 DECODING LAYER4

Hierarchical Decoder

1. NOUN + PROPN + PRON 2. VERB

3. ADJ + ADV 4. Others

Input Semantics 𝒙 = {𝑤₁, … , 𝑤_𝑇}

[ … 1, 0, 0, 1, 0, …]

Semantic 1-hot Representation Bidirectional

GRU Encoder GRU Decoder

Italian priceRange

name ^…

All Bar One is a is a moderately

All Bar One is moderately

…

… ^…

output from last layer 𝒚_𝒕^𝒊−𝟏 last output 𝒚_𝒕−𝟏^𝒊 1. Inner-Layer Teacher Forcing 2. Inter-Layer Teacher Forcing 3. Repeat-input

4. Curriculum Learning

…

𝒉enc

Figure 1: The framework of the proposed semantically conditioned NLG model.

classic seq2seq model with less parameters. In addition, our proposed model allows other word- level or sentence-level characteristics to be further leveraged for better generalization.

2 The Proposed Approach

The framework of the proposed semantically conditioned NLG model is illustrated in Figure 1, where the model architecture is based on an encoder-decoder (seq2seq) design (Cho et al., 2014;Sutskever et al.,2014). In the seq2seq architecture, a typical generation process includes encoding and decoding phases: First, the given semantic representation sequence x = {wt}^T₁ is fed into a RNN-based encoder to capture the temporal dependency and project the input to a latent feature space, and encoded into 1-hot semantic representation as the initial state of the encoder in order to maintain the temporal-independent condition as shown in the left-bottom of Figure1. The recurrent unit of the encoder is bidirectional gated recurrent unit (GRU) (Cho et al.,2014),

henc= BiGRU(x). (1)

Then the encoded semantic vector, h_enc, flows into an RNN-based decoder as the initial state to generate word sequences by an RNN model shown in the left-top component of the figure.

2.1 Hierarchical Decoder

Despite the intuitive and elegant design of the seq2seq model, it is difficult to generate long, complex, and decent sequences by such encoder- decoder structure, because a single decoder is not capable of learning all diction, grammar, and other related linguistic knowledge. Some prior work applied additional technique such as reranker to se- lect a better result among multiple generated sequences (Wen et al., 2015; Duˇsek and Jurˇc´ıˇcek, 2016). However, the issue still remains unsolved in NLG community.

Therefore, we propose a hierarchical decoder to address the above issue, where the core idea is to separate the decoding process and learn different types of patterns instead of learning all relevant knowledge together. The hierarchical decoder is composed of several decoding layers, each of which is only responsible for learning a portion of the related knowledge. Namely, the linguistic knowledge can be incorporated into the decoding process and divided into several subsets.

In this paper, we use part-of-speech (POS) tags as the additional linguistic features to construct the hierarchy, where POS tags of the words in the target sentence are separated into several subsets and each layer is responsible for decoding the words associated with a specific set of POS patterns. An example is shown in the right part of Figure1, where the first layer at the bottom is in

(3)

charge of learning to decode nouns, pronouns, and proper nouns, and the second layer is in charge of verbs, and so on. Our approach is also intuitive from the viewpoint of how humans learn to speak; for example, infants first learn to say the keywords which are often nouns. When an infant says “Daddy, toilet.”, it actually means “Daddy, I want to go to the toilet.”. Along with the growth of the age, children learn more grammars and vocab- ulary and then start adding verbs to the sentences, further adding adverbs, and so on. This process of how humans learn to speak is the core motivation of our proposed method.

In the hierarchical decoder, the initial state of each GRU-based decoding layer i is the extracted feature h_encfrom the encoder, and the input at every step is the last predicted token yⁱ_t−1 concate- nated with the output from the previous layer yⁱ⁻¹_t , hⁱ_t, oⁱ_t = GRUⁱ_dec(yⁱ_t−1, yⁱ⁻¹_t | henc, hⁱ_t−1),(2)

yⁱ_t = argmax(o_t), (3)

where hⁱ_t is the t-th hidden state of the i-th GRU decoding layer and yⁱ_tis the t-th outputted word in the i-th layer. The cross entropy loss is used for optimization.

2.2 Inner- and Inter-Layer Teacher Forcing Teacher forcing (Williams and Zipser,1989) is a strategy for training RNN that uses model output from a prior time step as an input, and it works by using the expected output at the current time step ˆy_tas the input at the next time step, rather than the output generated by the network. In our proposed framework, an input of a decoder contains not only the output from the last step but one from the last decoding layer. Therefore, we design two types of teacher forcing techniques – inner-layer and inter- layer.

Inner-layer teacher forcing is the classic teacher forcing strategy:

hⁱ_t, oⁱ_t= GRUⁱ_dec(ˆyⁱ_t−1, yⁱ⁻¹_t | h_enc, hⁱ_t−1). (4) Inter-layer teacher forcing uses the labels instead of the actual output tokens of the last layer:

hⁱ_t, oⁱ_t= GRUⁱ_dec(yⁱ_t−1, ˆyⁱ⁻¹_t | henc, hⁱ_t−1). (5) The teacher forcing techniques can also be trig- gered only with a certain probability, which is known as the scheduled sampling approach (Ben- gio et al.,2015). In our experiments, the scheduled sampling approach is also adopted.

2.3 Repeat-Input Mechanism

The concept of our proposed method is to hierar- chically generate the sequence, gradually adding words associated with different linguistic patterns.

Therefore, the generated sequences from the de- coders become longer as the generating process proceeds to the higher decoding layers, and the sequence generated by a upper layer should con- tain the words predicted by the lower layers. In order to ensure the output sequences with the con- straints, we design a strategy that repeats the outputs from the last layer as inputs until the current decoding layer outputs the same token, so-called repeat-input mechanism. This approach offers at least two merits: (1) Repeating inputs tells the decoder that the repeated tokens are important to encourage the decoder to generate them. (2) If the expected output sequence of a layer is much shorter than the one of the next layer, the large dif- ference in length becomes a critical issue of the hierarchical decoder, because the output sequence of a layer will be fed into the next layer. With the repeat-input mechanism, the impact of length dif- ference can be mitigated.

2.4 Curriculum Learning

The proposed hierarchical decoder consists of several decoding layers, the expected output sequences of upper layers are longer than the ones in the lower layers. The framework is suitable for applying the curriculum learning (Elman, 1993), in which core concept is that a curriculum of pro- gressively harder tasks could significantly acceler- ate a networks training. The training procedure is to train each decoding layer for some epochs from the bottommost layer to the topmost one.

3 Experiments 3.1 Setup

The experiments are conducted using the E2E NLG challenge dataset (Novikova et al., 2017)², which is a crowd-sourced dataset of 50k instances in the restaurant domain. The input is the semantic frame containing specific slots and corresponding values, and the output is the natural language containing the given semantics as shown in Figure1.

To prepare the labels of each layer within the hierarchical structure of the proposed method,

2http://www.macs.hw.ac.uk/

InteractionLab/E2E/

(4)

NLG Model BLEU ROUGE-1 ROUGE-2 ROUGE-L

(a) Sequence-to-Sequence Model 28.9 40.7 12.5 32.1

(b) + Hierarchical Decoder 43.1 53.0 24.6 40.4

(c) + Hierarchical Decoder, Repeat-Input 42.3 52.9 24.0 40.1 (d) + Hierarchical Decoder, Curriculum Learning 58.4 60.4 30.6 44.6

(e) + All 58.7 62.3 31.6 45.4

(f) (e) with High Inner-Layer TF Prob. 62.1 64.0 32.8 46.0 (g) (e) with High Inter-Layer TF Prob. 56.7 61.3 30.9 44.6 (h) (e) with High Inner- and Inter-Layer TF Prob. 60.0 63.0 31.8 45.2 Table 1: The NLG performance reported on BLEU, ROUGE-1, ROUGE-2, and ROUGE-L of models (%).

we utilize spaCy toolkit to perform POS tag- ging for the target word sequences. Some prop- erties such as names of restaurants are delex- icalized (for example, replaced with symbols

”RESTAURANT NAME”) to avoid data sparsity.

We assign the words with specific POS tags for each decoding layer: nouns, proper nouns, and pronouns for the first layer, verbs for the second layer, adjectives and adverbs for the third layer, and others for the forth layer. Note that the hierar- chies with more than four levels are also applica- ble, the proposed hierarchical decoder is a general and easily-extensible concept.

The experimental results are shown in Table1, every reported number is averaged over the results on the official testing set from three different models. Row (a) is the simple seq2seq model as the baseline. The probability of activating inter-layer and inner-layer teacher forcing is set to 0.5 in the rows (a)-(e); to evaluate the impact of teacher forcing, the probability is set to 0.9 (rows (f)-(h)). The probability of teacher forcing is attenuated every epoch, and the decay ratio is 0.9. We perform 20 training epochs without early stop; when the curriculum learning approach is applied, only the first layer is trained during first five epochs, the second decoder layer starts to be trained at the sixth epoch, and so on. To evaluate the quality of the generated sequences regarding both precision and recall, the evaluation metrics include BLEU and ROUGE (1, 2, L) scores with multiple references.

3.2 Results and Analysis

To fairly examine the effectiveness of our proposed approaches, we control the size of the proposed model to be smaller. The baseline seq2seq decoder has 400-dim hidden layer, and the models with the proposed hierarchical decoder (rows (b)-(h)) have four 100-dim decoding layers. Ta-

ble 1 shows that simply introducing the hierarchical decoding technique without adding parameters (row (b)) to separate the generation process into several phases achieves significant improvement in both BLEU and ROUGE scores, 49.1% in BLEU, 30.2% in ROUGE-1, 96.8% in ROUGE- 2, and 25.9% in ROUGE-L. Applying the proposed repeat-input mechanism (row (c)) and the curriculum learning strategy (row (d)) both offer considerable improvement. Combining all proposed techniques (row (e)) yields the best performance in both BLEU and ROUGE scores, achiev- ing 103.1%, 53.1%, 152.8%, and 41.4% of rela- tive improvement in BLEU, ROUGE-1, ROUGE- 2, and ROUGE-L respectively. The results demon- strate the effectiveness of the proposed approach.

To further verify the impact of teacher forcing, the integrated models (row (e)) with high inter and inner-layer teacher forcing probability (rows (f)- (h)) are also evaluated. Note that when the teacher forcing is activated probabilistically, the strate- gies are also known as schedule sampling (Bengio et al., 2015). Row (g) shows that high probability of triggering inter-layer teacher forcing results in slight performance degradation, while models with high inner-layer teacher forcing probability (rows (f) and (h)) can further benefit the model.

Note that the decoding process is a single- path forward generation without any heuristics and other mechanisms (like beam search and reranking), so the effectiveness of the proposed methods can be fairly verified. The experiments show that by considering linguistic patterns in hierarchical decoding, the proposed approaches can significantly improve NLG results with smaller models.

4 Conclusion

This paper proposes a seq2seq-based model with a hierarchical decoder that leverages various lin-

(5)

guistic patterns and further designs several corresponding training and inference techniques. The experimental results show that the models applying the proposed methods achieve significant improvement over the classic seq2seq model. By introducing additional word-level or sentence-level labels as features, the hierarchy of the decoder can be designed arbitrarily. Namely, the proposed hierarchical decoding concept is general and easily- extensible, with flexibility of being applied to many NLG systems.

Acknowledgements

We would like to thank reviewers for their insight- ful comments on the paper. The authors are sup- ported by the Institute for Information Industry, Ministry of Science and Technology of Taiwan, Google Research, Microsoft Research, and Medi- aTek Inc..

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks.

In Advances in Neural Information Processing Sys- tems. pages 1171–1179.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.

2017. Learning end-to-end goal-oriented dialog. In Proceedings of ICLR.

Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of EMNLP. pages 1724–1734.

Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.

Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of ACL. pages 484–495.

Ondˇrej Duˇsek and Filip Jurˇc´ıˇcek. 2016. Sequence-to- sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of ACL.

pages 45–51.

Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small.

Cognition48(1):71–99.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task- completion neural dialogue systems. In Proceedings of IJCNLP. pages 733–743.

Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceed- ings of Interspeech.

Tom´aˇs Mikolov, Stefan Kombrink, Luk´aˇs Burget, Jan Cernock`y, and Sanjeev Khudanpur. 2011. Exten-ˇ sions of recurrent neural network language model.

In Proceedings of ICASSP. IEEE, pages 5528–5531.

Danilo Mirkovic and Lawrence Cavedon. 2011. Di- alogue management using scripts. US Patent 8,041,570.

Jekaterina Novikova, Ondrej Duˇsek, and Verena Rieser.

2017. The E2E dataset: New challenges for end-to- end generation. In Proceedings of SIGDIAL. pages 201–206.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.

Sequence to sequence learning with neural networks. In Proceedings of NIPS. pages 3104–3112.

Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Stochastic language generation in dialogue using recurrent neural networks with convo- lutional sentence reranking. In Proceedings of SIG- DIAL. pages 275–284.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. 2017. A network- based end-to-end trainable task-oriented dialogue system. In Proceedings of EACL. pages 438–449.

Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1(2):270–280.

A Dataset Detail

The experiments are conducted using the E2E NLG challenge dataset, which is a crowd-sourced dataset in the restaurant domain, the training set contains 42064 instances while there are 4673 instances in the validation (development) set.

In our experiments, we use the validation set to test our models. In the E2E NLG Challenge dataset, the input is the semantics containing slots and their values, and the output is the corresponding natural language. For example, the slot-value pairs "name[Bibimbap House], food[English],

priceRange[moderate],

area[riverside], near[Clare

Hall]" correspond to the target sentence

(6)

“Bibimbap House is a moderately priced restaurant who’s main cuisine is English food. You will find this local gem near Clare Hall in the Riverside area.”.

B Parameter Setting

We use mini-batch Adam as the optimizer with the batch size of 32 examples. The baseline seq2seq model (row (a)) sets the encoder’s hidden layer size to 200 and the decoder’s to 400. The size of the hidden layer in the encoder and the decoder layers of the models based on the proposed hierarchical decoder (rows (b)-(h)) are 200 and 100, respectively. Note that in this setting, the models applied the proposed methods will have less parameters than the baseline seq2seq model. In terms of the models utilized the basic RNN cell, the baseline seq2seq model (row (a)) has 640k parameters whereas the proposed models (rows (b)-(h)) have only 520k parameters.