Question Generation through Transfer Learning

全文

(1)國立臺灣師範大學資訊工程研究所. 指導教授：. 柯佳伶. 博士. Question Generation through Transfer Learning. 研究生中華民國. 廖盈翔. 撰. 109 年 7 月.

(2) Abstract Question Generation through Transfer Learning. By. Yin-Hsiang Liao. An automatic question generation (QG) system aims to produce questions from a text, such as a sentence or a paragraph. This system can be useful on the frontline of education, as making questions is a time-consuming and expert-participating craft. Traditional approaches are mainly based on heuristic and hand-crafted rules to transduce a declarative sentence into a related interrogative sentence. In this work, we propose a data-driven approach, which leverages a neural sequence-to-sequence framework with various transfer learning strategies to capture the underlying information of making a question, on a target domain with rare training pairs. Our experiment shows this modified model is capable to generate satisfactory results to some extent.. Keywords: Question generation, sequence-to-sequence model, transfer learning. i.

(3) Acknowledgement I would like to take this opportunity to express my deepest gratitude to my advisor, Pf. Jia-Ling Koh, for her excellent guidance and continuous encouragement. Countless knowledge of mine was acquired from her; numberless discussions were held during the writing process. She not only pointed out the direction for this research, but cultivated my capability to think critically, to plan practically, and most importantly, to persist. Without her illuminating advice, incredible patience, or any of above, it is impossible for me to finish this thesis. I would like to thank the committee of oral examination, Arbee L. P. Chen, Jia-Lien Hsu, and Ming-Feng Tsai. They gave me precious advice and correction, so that I could have a better version for my thesis. I wish to show my appreciation to my partners in KDD lab. Ming Chieh, Avon, and YC gave me a lot of ideas when I was stuck in doing research. Pei-Hao, Jin-An, Yi-huei, JiaYi, Chen-wei, Chin-Hsuan, Po-Wen, Hsiu-Yi, and Shih-Han helped me much in academy and life when I was new there. Ya-Fang, Yo-Hsiang, Shao-Wei, and Pei-Hsuan did great preparation for the oral exam. It’s my honor to be a member of the lab. Special thanks to Shih-Min and Fang Li for their selfless sharing, I learned many useful skills from them. Our friendship will be long-lasting. Finally, I would like to thank my beloved families and my girlfriend Holly. Their unconditional love is the strongest support I have. Thank you. Yin-Hsiang Liao.

(4) Contents Abstract ........................................................................................................................... i. Acknowledgement ......................................................................................................... ii. Chapter 1 Introduction ............................................................................................ - 1 -. 1.1 Motivation .................................................................................................... - 1 -. 1.2 Challenge ..................................................................................................... - 2 -. 1.3 Method ......................................................................................................... - 3 -. Chapter 2 Related Works ........................................................................................ - 8 -. 2.1 Question Generation .................................................................................... - 8 -. 2.2 Seq2seq Models ......................................................................................... - 12 -. 2.3 Domain Adaptation .................................................................................... - 13 -. Chapter 3 Methods................................................................................................. - 15 -. 3.1 Problem definition ..................................................................................... - 15 -. 3.2 Data Preparation ...................................................................................... - 15 -. 3.2.1. Source Domain Data .............................................................. - 16 -. 3.2.2 Target Domain Data ........................................................................ - 17 -.

(5) 3.3 Baseline Model .......................................................................................... - 19 -. 3.3.1 Pointer Network Model ................................................................ - 19 -. 3.3.2 Pointer Network with Reinforcement Module ................................ - 28 -. 3.4 Domain Adaptation .................................................................................... - 30 -. 3.4.1 Supervised Domain Adaptation ...................................................... - 30 -. 3.4.2 Unsupervised Domain Adaptation .................................................. - 31 -. Chapter 4 Performance Evaluation...................................................................... - 32 -. 4.1 Experiments Setup ..................................................................................... - 32 -. 4.2 Evaluation Measurements .......................................................................... - 33 -. 4.3 Supervised Transfer Learning .................................................................... - 34 -. Experiment 1: ........................................................................................... - 35 -. Experiment 2: ........................................................................................... - 36 -. Experiment 3: ........................................................................................... - 39 -. 4.4 Unsupervised Transfer Learning ................................................................ - 40 -. Experiment 4 .......................................................................................... - 40 -. Chapter 5 Discussion ............................................................................................. - 42 -.

(6) Chapter 6 Conclusion ............................................................................................ - 45 -. Reference................................................................................................................. - 47 -. Appendix ................................................................................................................. - 52 -.

(7) Chapter 1 Introduction 1.1 Motivation. Nowadays, teachers in school are far from the only way in which students can get knowledge. Besides from the traditional education such as classrooms, there are plenty of sources to be chosen, like massive open online courses (MOOCs) or open educational materials, e.g., OpenStax. Sufficient and frequent quizzes help students get better learning outcomes than just studying textbooks or notes (Karpicke 2012; Kovacs. 2016). However, the amount of related quizzes is not comparable with the amount of growing online educational materials, since creating associated quizzes to their learning sources needs the participation of domain experts. Creating reasonable and meaningful questions hence becomes a costly task in both time and money. Accordingly, it is worthwhile to build a reliable automatic question generation system for educational purpose. Meanwhile, the previous works are generally based on English (Du et al. 2017, Wang et al. 2017, M. Heilman et al. 2010, Wang et al. 2018, Zhou et al. 2017). In this thesis, we aim to create a system to utilize educational resources in traditional Chinese for middle-school students.. -1-.

(8) 1.2 Challenge. In computational linguistics, there are two mainstream approaches for question generation -- rule-based approaches and data-driven approaches. Rule-based approaches basically utilize grammars to generate texts. The simplest example, in the view of designing rules, is transducing a declarative sentence “Dry distillation is the process of heating material without air.” to an interrogative one “What is the process of heating organics without air?” Ideally, these rule-based approaches can generate good questions if researchers design rules well. However, creating such a set of rules requires deep linguistic knowledge and most of these rules are languagespecific. To the best of our knowledge, previous works mostly based on English. In order to reduce the participation of experts, a data-driven approach is required to alleviate the need of strong linguistic background for a certain language. In addition, the proposed methods should fit our application scenario in traditional Chinese.. A data-driven question generation system for educational purpose has two obvious difficulties as following:. (1) Proper labeled data is insufficient. Training question generation models usually need both input texts and corresponding questions in supervised approaches. Quiz questions are relatively few with respect to the educational content. Even -2-.

(9) if quizzes are abundant because of an existing question bank, the annotated content-question pairs remain rare.. (2) The absence of an effective evaluation metrics. In previous works, the most popular evaluation metrics include BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). However, there has been no explicit and correct way to objectively evaluate whether the generated questions of a model are with good quality for the given content.. 1.3 Method. To materialize the aforementioned system, this study applies the following three kinds of strategies.. (1). Seq2seq model:. Seq2seq models provide a prevalent framework for dealing with end-to-end, sequential problems in natural language processing. They are frequently used in various applications such as machine translation (Sutskever et al., 2014), text summarization (See et al., 2017), news headline generation (Keneshloo et al., 2018) and question generation (Du et al. 2017). A Seq2seq model usually consists of an encoder and a decoder. The encoder reads the input text, and converts it to a context vector with textual information. The vector is then -3-.

(10) translated by the decoder, often with the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), to generate a meaningful question corresponding to the input text. Our encoder and decoder are implemented by models on a bidirectional Gated Recurrent Unit (bi-GRU, Bahdanau et al., 2015) and a GRU respectively (Cho et al. 2014).. (2). Domain adaptation:. Domain adaptation, also known as transfer learning or fine-tuning, is a machine learning technique that seeks to employ the knowledge learnt via a task onto another different, but related, task which is concerned about. This technique may help us lessen the necessary amount of training data by tuning a pre-trained model rather than building a model from scratch. Thus, in this thesis, we first trains a baseline model from Delta Reading Comprehension Dataset (DRCD, Shao et al., 2018), a general, publicly available dataset with paragraph-question pairs. Then we further modify the baseline model by the data of a target domain that we concerned. The data of the target domain comprises either paragraphquestion pairs or questions only. The experiments show that transfer learning improves the derivative questions in the scope of BLEU/ROUGE measurement.. -4-.

(11) (3). Reinforcement learning:. Reinforcement learning is about the interaction of an agent and its environment. The agent takes actions in the environment and the latter gives rewards back to the former. According to the received rewards, the agent modifies its policy, i.e., the way it takes an action, and seeks for the maximum reward. The forms of rewards are various for different tasks. Quantitative measures like BLEU and ROUGE are often used in the QG task, but as referred earlier, these measures are not totally suitable to the task: sometimes questions with higher BLEU/ROUGE scores may be considered worse than those with lower scores. We find that interrogatives are crucial to a question, so we consider the related information as a kind of reward in the training process.. In this study, we mainly study how to apply a supervised approach of transfer learning on a Seq2seq model for solving the QG task with rare content-question training pairs. Inspired by Chung et al. (2018), we also consider a unsupervised approach combined with domain adoption.. -5-.

(12) The framework of this research is shown in Fig. 1.1:. Fig 1.1 Framework of the proposed methods.. In this thesis, our main contributions are as follows:. (1) We demonstrate the viability of domain adaptation in QG task.. (2) We proposed a Seq2seq, RNNs-based model with a reinforcement module.. (3) We perform several model fin-turning strategies to construct a Seq2seq pointer network (PTN) model for QG with better performance. -6-.

(13) (4) We investigated the feasibility of unsupervised fine-tuning for Seq2seq models.. (5) In addition to BLEU and ROUGE, we use an additional simple evaluation, the proportion of generated questions with interrogatives, to evaluate the generated questions.. This thesis is organized as follows: In chapter 2, we introduce the related works. In chapter 3, we will describe the proposed approaches in detail. Chapter 4 shows the performance evaluation of the methods. The detailed discussion for the results of experiments is in chapter 5. In chapter 6, we conclude this research and point out the future works.. -7-.

(14) Chapter 2 Related Works 2.1 Question Generation. Learning to ask questions, as known as question generation, aims to create natural questions automatically from a given text, such as a sentence or a paragraph. Question generation systems can be used in the domain of education. A typical scenario is producing quiz questions, which test students’ understanding of the content of learning materials. Such a task, if done manually, is time-consuming and expensive because of the involvement of people, especially domain experts. Another application of a question generation system can be a part of a chatbot to start or continue a conversation, since asking questions is natural in human dialogues.. As indicated by Vanderwende (2008), deciding what text is worth asking a question about remains a challenge for computational systems so far. Z. Wang et al. (2018) pointed out that a good question must be fluent and related to the input. Moreover, as in the scenarios mentioned above, question generation is not just a modification of the original declarative sentence or paragraph. Synonyms and world knowledge, non-syntactic information that helps readers understand, would be deployed to generate good questions.. -8-.

(15) Researchers have dealt with the task of question generation by rule-based approaches in the past (e.g., Rus et al., 2010). These solutions depended on well-designed rules, based on profound linguistic knowledge, to transform declarative sentences into their syntactic representations and then generate interrogative sentences. Heilman and Smith (2010) took an “overgenerate-and-rank” strategy, which used a set of rules to generate more-than-enough questions and leveraged a supervised learning-based method to rank the produced questions. The ranking algorithm improved the outcome and these rule-based approaches performed well on well-structured input text; however, because of the limitation of hand-crafted rules, the systems failed to deal with subtle or complicated text. In addition, these heuristic rule-based approaches focus on the syntactic information of input words, most of which ignore the semantic information.. Du et al. (2017) first proposed the method using a Seq2seq framework with attention mechanism to model the question generation task for reading comprehension. Their work also considered context information both from a sentence and a paragraph. Y. Wang (2018) presented another QG model with two kinds of decoders. By considering the types of a word—including interrogatives, topic words, and ordinary words—their model aimed to generate questions for an open-domain conversational system. -9-.

(16) To leverage more data with potentially useful information, the answer of a question can also be considered (D. Tang et al. 2017.). A question and its answer are highly related in both QG and question-answering (QA). It is common to assume the answer of a question is a part of the given content. Under this premise, the sentence containing an answer can be the input of QG task and be the output of QA task. The work of Tang et al. showed that the QA and QG tasks could enhance each other at the same time in their training framework. Thus, it seems feasible to create a QA/QG machine which is trained itself by the “inner-conversation”. Sachanand and Xing (2018) proposed a self-training model for jointly learning to ask and answer questions, and at the same time, they explored several heuristics to evaluation questions. We therefore started whether there exists a quantitative measurement that helps us know whether a model could generate a good question.. The aforementioned question generation works were remarkable; nevertheless, their successes were inseparable with SQuAD, a relatively large, publicly available, general purpose dataset. If there is a shortage of data in a domain where we are interested, e.g., teaching materials in middle school, it is difficult to train the models to an acceptable level. For educational purpose, Z. Wang et al. (2018) thus proposed a Seq2seq-based model, QG-net, that captures the way how humans ask questions from a general-purpose dataset, SQuAD, and directly applied the built model to the - 10 -.

(17) learning material, OpenStax textbooks. Similarly, this research aims to build a system for middle school education but there doesn’t exist a large-scale dataset with manually reviewed context-question pairs. However, our work mainly differs from QG-net in the following aspects. First, we focus on the effectiveness of domain adaptation: we tune the proposed model by hundreds of labeled pairs in our target domain, i.e. the textbooks of middle school. Second, since we have labeled data in target domain, we could therefore do quantitative evaluations, which were unseen for the generated questions in the target domain of QG-net. Our experiments show that the application of transfer learning improves the performance in quantitative metrics BLEU and ROUGE. Furthermore, our study utilizes a reinforcement learning technique in the Seq2seq model and proposes an unsupervised approach to leverage those generated questions as training pairs to fine-tune the baseline model. Finally, as mentioned before, the way measuring goodness of a generated question is important. We designed an evaluation by considering the proportion of generated questions with interrogatives.. - 11 -.

(18) 2.2 Seq2seq Models. Recently, in natural language processing (NLP) community, there has been a paradigm shift from rule-based approaches to data-driven approaches. For instance, RNNs-based models have been widely used on sequential tasks. These models, especially Seq2Seq, have been capable to learn the syntactic structures or even the semantics of input and output texts through a large amount of training data without manually assigned rules nor heavy feature engineering. There were various applications of the Seq2Seq model, ranging from machine translation, text summarization and news headline generation.. A Seq2seq model typically consists an encoder and a decoder. The architecture of an/several encoder(s) and an/several decoder(s) is also called the encoder-decoder framework. Most of the frameworks are implemented by RNNs as in the Seq2seq model. The encoder looks through the input text as a context reader, and converts it to a context vector with textual information. The vector is then decrypted by the decoder as a question generator. The decoding procedure often take the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015) to generate a meaningful question corresponding to the input text. O. Vinyals et al. (2015) proposed the pointer network, a modification of Seq2seq, to deal with the words absent in the training set. Their work was later used by Z. Wang et al. to point out which part of - 12 -.

(19) an input content is more possible to appear in the output question. Inspired by these researches, we modify the pointer network framework by adding a reinforcement module. This module leverages the proportion of generated questions with interrogatives as the reward in the training or tuning process. We will describe in detail how we apply the Seq2seq model and its variations in chapter 3.. 2.3 Domain Adaptation. Deep neural networks (DNNs) often benefit from transfer learning. In computer vision (CV), convolutional neural networks (CNNs) trained on a large-scale dataset such as ImageNet has proved to be useful initialization of other CV tasks (M. Oquab et al., 2014). The key idea of this work is that reusing the pre-trained weights of the internal layers in the CNNs. In NLP, transfer learning has also been successfully applied in tasks like QA (Y. Chung et al., 2018), among other things.. Although transfer learning has been successfully employed as a generic featureextractor in many applications, the feasibility of applications to QG has yet to be well-studied. Domain adaptation with/without annotated data in target domain is referred to as supervised/unsupervised fine-tuning. As mentioned above, QG-net directly uses the pre-trained model on its target domain without fine-tuning. We propose various fine-tuning strategies on the pre-trained model by limited training - 13 -.

(20) pairs of the target domain and investigate the effectiveness of transfer learning on QG.. - 14 -.

(21) Chapter 3 Methods The proposed method consists of three processing steps: data preparation, training baseline model, and domain adaptation.. 3.1 Problem definition. Given a sentence 𝑆 = {𝑤 }. , where 𝑤. is a word in the input sentence 𝑆 with. length 𝐿 . The goal of question generation is to generate a question Q of natural language to maximize 𝑃(𝑄|𝑆, 𝜃) (Du et al., 2017).. 𝑃(𝑄|𝑆, 𝜃) = ∏. 𝑃(𝑞 |𝑆, {𝑞 }. , 𝜃),. (1). where 𝐿 , 𝑞 , and 𝜃 denote the length of 𝑄, each word within 𝑄, and the set of model parameters, respectively. Besides, a basic assumption is that the answer of the generated question is a consecutive segment in 𝑆 . Accordingly, we mainly consider how to generate factual questions.. 3.2 Data Preparation. For preparing the source domain data and target domain data, the following processing steps are performed as follows.. - 15 -.

(22) 3.2.1 Source Domain Data. The data format stored in the DRCD dataset is as follows: Version: Data: - title #the title of an article - id #the identification of an article - paragraphs - - id #the identification of a paragraph - - context #the content of the paragraph - - qas - - - question #the content of a question - - - id #the identification of a question - - - answers - - - - answer_start #the starting position of an answer in the paragraph - - - - id #”1” for human-labeled, “2” or greater for human-answered - - - - text #the content of an answer. In other words, for a given article, there could be more than one paragraph. Besides, there could be several triples of (context, question, answer) given for a paragraph. The goal of the following processing is to extract a sentence from the paragraph for each triple, which is most related to the answer of the question, as a shorten context to generate the (sentence, question) pair. Moreover, the step of Chinese word segmentation is performed on the sentence and question, which both compose a sequence of Chinese characters, to get two sequences of semantic words.. (1). Extract the sentence S containing the answer A from the content of the paragraph for the corresponding question Q. - 16 -.

(23) Fig. 3.1 shows an example of the extracted sentence from the context for a question.. Fig. 3.1 The extracted sentence for a question. Note that the answer is ‘1913’. (2). Make the sentence-question pairs (S, Q).. (3). Segment the texts in each sentence-question pair via Jieba (J. Sun, 2012).. Following the example shown in Fig 3.1, the result is shown in Fig. 3.2.. Fig. 3.2 An example of a segmented sentence-question pair in DRCD. After the above processing, the constructed dataset is denoted as DBDRCR.. 3.2.2 Target Domain Data. To the best of our knowledge, there is no existing sentence-question pairs in the target domain. Accordingly, we collected a dataset of sentences in textbooks of junior high school in science subject and a dataset of multi-selection questions from a question bank. - 17 -.

(24) The basic idea is that the sentence to be the context of a factual question should have some degree of word overlapping with the question. Accordingly, we selected factual questions from the database and matched them with the sentences in the textbooks semiautomatically as following:. (1) Exclude the questions with a figure(s).. (2) Exclude the questions with the form “which of the following is correct/wrong”.. (3) Match each remaining question Q and its answer A with each sentence S in the textbook as follows:. (3-1) Make sure the answer A appears within the sentence S.. (3-2) Compute the closeness between the sentence S and the question Q by the BLEU-4 metric (Papineni et. al, 2002).. (4) Sort the sentence-question pairs by their BLEU-4 scores in descending order and manually make the proper pairs remaining.. (5) Exclude duplicating sentence-question pairs.. (6) Segmenting texts of the pairs via Jieba.. Fig. 3.3 shows the result of a sentence-question pair in the target domain.. - 18 -.

(25) Fig. 3.3 An example of sentence-question pair in the target domain, i.e. the textbook After the above processing, the constructed dataset is denoted by DBtextbook.. 3.3 Baseline Model. There are two baseline neural models used in this thesis, the pointer network model and the pointer network with reinforcement model. We describe details of them in the following two subsections.. 3.3.1 Pointer Network Model Seq2seq model. Fig. 3.4 The Seq2seq framework of neural network.. - 19 -.

(26) I. Sutskever et al., 2014 first used the Seq2seq model for solving machine translation task. This method used a Long Short-Term Memory (LSTM) to map the input sequence into a vector with certain fixed dimension, as the context vector shown in Fig 3.4. Then the method used another LSTM to decode the output sequence from the vector and generated words. This model could effectively deal with a common problem in machine translation—variant length of input/output sequences due to the unnecessarily one-toone correspondence of the words 𝑤. and 𝑞 in text sentence and question. In the. training phase, after the decoding process, the model parameters in 𝜃 were then updated by stochastic gradient descent (SGD) to minimize the loss function. The internal LSTM cells had many alternatives. We implemented our baseline model by bi-GRU and GRU, as the encoder and decoder respectively, for reducing the dimensionality of 𝜃 with respect to LSTM.. A GRU has two gates, a reset gate 𝑟 = 𝜎(𝑊 𝑥 + 𝑈 𝑠. + 𝑏 ) and an update gate. 𝑧 = 𝜎 (𝑈 𝑥 + 𝑊 𝑠. + 𝑏 ), where 𝑥 is the current input, i.e. the embedding of. the input word 𝑤 , 𝑠. is the previous hidden state and 𝜎 is the sigmoid function.. 𝑊 ,𝑈 ,𝑊. and 𝑈. are training parameter matrices in GRU. 𝑏 and 𝑏 are biases.. Intuitively, the update gate defines how much of the previous memory to keep, and the reset gate defines how to combine the new input with the previous hidden state. Therefore, after each iteration for t > 0, current hidden state: - 20 -.

(27) 𝑠 = 𝑧⨀𝑠. + (𝟏 − 𝑧)⨀ℎ , and. ℎ = tanh 𝑊 𝑥 + 𝑈 (𝑟 ⨀ 𝑠. ) ,. where ⨀ is the element-wise multiplication and 𝟏 denotes a "all-ones vector". 𝑊 and 𝑈. are also training parameter matrices in GRU. Bi-GRU contains an additional. backward layer calculating hidden states in decreasing order by reversing the input sentence. Let 𝑠⃗ denote the hidden state of the forward GRU layer and 𝑠⃖ denote the hidden state of the backward GRU layer, respectively. Then the hidden state of Bi-GRU at t, denoted as ℎ is the concatenation of 𝑠⃗ and 𝑠⃖ : ℎ = [𝑠⃗; 𝑠⃖ ].. - 21 -.

(28) Attention mechanism. Fig 3.5 Seq2seq model with attention mechanism.. Attention mechanism has lately been used to improve numerous machine learning tasks, especially the deep learning approaches, such as object detection in CV or machine translation in NLP. The key idea of attention mechanism is to find which parts of the input should be focused. In a Seq2seq model, it turns to be in which time step i in the hidden states of GRU encoder cells should have higher weight for predicting the result of decoder at time step t, as the attention distribution shown in Fig. 3.5.. - 22 -.

(29) Let ℎ. and ℎ denote the hidden state of decoder at time step t and the hidden. state of encoder at time step i, respectively. 𝑒. denotes the attention score of the. encoder’s hidden state ℎ with respect to the decoding time step t.. 𝑒 = ℎ where 𝑊. 𝑊. ℎ +𝑏. is the parameter matrix and 𝑏. 𝑎 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑒 ) =. ( ∑. ) (. ). ,. is the bias to learn.. is the attention weight for each input. embedding at decoding time step t to get the context vector of encoder 𝑐 :. 𝑐 = ∑. 𝑎 ℎ .. Finally, the context vector of encoder, i.e. 𝑐 , and the hidden state of decoder at time step t, i.e. ℎ , are used to infer a word as output at time step t. The output layer generates a token through a probability distribution:. 𝑃. = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊. - 23 -. [ℎ ; 𝑐 ] + 𝑏. ).

(30) Pointer Network. Fig. 3.6 The framework of pointer-network.. Pointer Network (See et al., 2017), PTN, is a variation of the Seq2seq model. It calculates the output word probabilities as a weighted sum of two, one of which comes from the output of Seq2seq model and the other comes from the attention weights of input sentence. In our constructed model, an intra decoder attention, 𝑎 is used in the model as:. - 24 -. ,.

(31) 𝑎. = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑒 ), where. 𝑒. = ℎ. Then another context vector 𝑐. 𝑊. ℎ +𝑏. .. is generated by 𝑐 = ∑. 𝛼 ℎ. for providing. context information of the previously generated sequence in the decoder.. Accordingly, 𝑃. is modified into 𝑃. = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊. [ℎ ; 𝑐 ; 𝑐 ] + 𝑏. ).. Moreover. as shown in Fig. 3.6, the proportion of the two probabilities are controlled by the tunable parameter 𝑃. , which is learned from the concatenation. of the hidden state vector of decoder, ℎ , and the two context vectors, 𝑐. and 𝑐 , obtained from encoder and decoder, respectively.. 𝑃. = 𝜎 (𝑊 [ℎ ; 𝑐 ; 𝑐 ] + 𝑏 ).. If unseen words occur in the input sentence, the PTN will create an extra dictionary for those unseen words. For each word 𝑤 in the input sentence S, if 𝑤 is an unseen word, 𝑃. (𝑤 ) = 𝛼 .. Finally, the probability of a word w predicted to be the next generated word is:. 𝑃(𝑞 =w|𝑆, {𝑞 }. )= 𝑃. (𝑤)𝑃. - 25 -. +𝑃. (𝑤)(1 − 𝑃. )..

(32) During the training, the loss of each time step t is the negative log likelihood of the predicted probability for the target word 𝑞 , denoted as 𝑙𝑜𝑠𝑠 = −𝑙𝑜𝑔𝑃(𝑞 ). The overall loss for a predicted sentence with length l is:. 𝐿𝑜𝑠𝑠({𝑞 }. )=. ∑. 𝑙𝑜𝑠𝑠 .. Those unseen words will be probably chosen to be the output words when their attentions weights in the input sentence are large. By the usage of attention weights in input sentences, this model can effectively deal with out-of-vocabulary problem.. Note that we did not put Chinese characters into the model directly. Instead, we first built a dictionary of frequent words from the segmentation results of the training set. Each word in the dictionary had a unique identity and its corresponding pre-trained word embedding1 (E. Grave and P. Bojanowski, 2018).. Coverage mechanism. Coverage mechanism is leveraged to solve word repetition problem in sequence-tosequence models (See et al., 2017). In the models of this thesis, we keep a coverage vector covt, which is defined as the sum of all encoder attention distribution at each previous decoder time step t′, i.e. 𝑎 .. 1. https://fasttext.cc/docs/en/crawl-vectors.html - 26 -.

(33) 𝑐𝑜𝑣 =. 𝑎. Note that cov0 is a zero vector, since on the first time step, none of a word in the given sentence has been covered. The attention elements which occur more are penalized by a composite loss function in each time step: 𝑙𝑜𝑠𝑠 = −𝑙𝑜𝑔𝑃(𝑞 ) + 𝜆. min(𝑎 , 𝑐𝑜𝑣 ). The composite loss is weighted by a hyperparameter , and intuitively, the loss will be less if paying attention on those words that has not been focused so far.. Teacher forcing. In the training process of a Seq2seq model, the inference of a new token is based on the current hidden state and the previous predicted token. A bad inference will then make the next inference worse. This phenomenon is a kind of error propagation. D. Bahdanau et al. (2015) thus proposed a learning strategy to ease the problem. Instead of always using the generated tokens, the strategy gently changed the training process from fully using the true tokens, toward mostly using the generated tokens. This method can yield performance improvement for sequence prediction tasks such as QG. In the proposed model, we guide the training by 0.75 at beginning, and decay the ratio by multiplying 0.9999 after each epoch.. - 27 -.

(34) 3.3.2 Pointer Network with Reinforcement Module. One of the main issues with the current Seq2seq models is minimizing the lose function, such as cross-entropy between predicted sequence and true sequence, does not always result in the best performance in evaluations, e.g. BLEU or ROUGE. Reinforcement learning techniques can help us improve this mismatch (Keneshloo et al., 2018). The goal of training is to find the parameters of the agent to maximize the reward defined in reinforcement learning. In our scenario, the policy is a language model 𝑃(𝑄|𝐶, 𝜃) that tries to generate the question 𝑄 and the reward is the evaluation measurement. That is, we can optimize the model by BLUE are/or ROUGE, which are our quantitative measurements. However, observing from the heuristics we have made, we notice that BLEU and ROGUE are not totally suitable to the QG task. In other words, although those output sequences have higher BLEU/ROUGE scores than others, but in humans’ perspectives, at least in ours, those sequences are not good questions—sometimes they are not even recognized as questions. Therefore, as the noticeable importance of the interrogatives in a question, we replaced the BLEU/ROUGE by the proportion of output sequences in a batch, those contain any pre-defined interrogative as illustrated in Fig. 3.7.. - 28 -.

(35) Accordingly, the loss function for the pointer network with reinforcement module is defined to be 𝐹𝑖𝑛𝑎𝑙_𝑙𝑜𝑠𝑠 :. 𝐹𝑖𝑛𝑎𝑙. = 𝐿𝑜𝑠𝑠 {𝑞 }. where 𝐿𝑜𝑠𝑠. = 𝐿𝑜𝑠𝑠({𝑞 }.  (1 − 𝑅𝐿. ) + 𝐿𝑜𝑠𝑠  𝑅𝐿. )  (1 – Iratio). Besides, {𝑞 }. prediction output of our model by bean search, {𝑞 }. denotes the. denotes the target output. of the model and Iratio denote the proportion of output sequences within interrogative words in the batch. The process can be depicted as the following figure.. Fig 3.7. Pointer network with RL module - 29 -.

(36) 3.4 Domain Adaptation. 3.4.1 Supervised Domain Adaptation. Domain adaptation is a machine learning technique that aims to utilize the learnt knowledge from other fields. This is a significant technique because it is sometimes inescapable to face the problem of insufficient training data. The collection of data is sometimes complicated, like the obedience of GDPR, and time/money-consuming as well. Additionally, most of the time, we need considerable amount of data to build a tolerable DNNs-based model. To the best of our knowledge, there has been no preceding annotated dataset of sentence-question pairs in our target domain. Therefore, we take the transfer learning approach as follows:. (1) Given an epoch number epo_num to train the PTN dataset with/without RL module on the source domain and save the model of each epoch until reach the number epo_num.. (2) Select a model according to a selection strategy as the base model Mb. These strategies are choosing the model with the highest average BLEU-4 on the validation set of target domain.. - 30 -.

(37) (3) Fine-tune the selected model Mb with/without RL module. That is, we initialize another training process with leant parameters of the model Mb on the data set of target domain.. We also observe the outcome by freezing some layers in the model. The details of various fine-tuning strategies are described in the experiments.. 3.4.2 Unsupervised Domain Adaptation. The following algorithm 1 shows our unsupervised approach. Algorithm 1 Unsupervised QG Domain Adaptation Input: Source dataset DBsource with (sentence, question) pairs; Target dataset DBtarget with sentences but without any question labeling; Number of training epochs: epo_num. Output: QG model M* 1: Pre-train QG models on the source dataset DBsource. 2: Select a Pre-trained model Mb. 3: Repeat 4: For each sentence S in the target dataset DBtarget 5: Use Mb to predict its question Q. 6: Insert (S, Q) as a training pair of DBtarget. 7: Fine-tune Mb by the training pairs of the target dataset DBtarget. 8: Until the number of training epochs == epo_num.. - 31 -.

(38) Chapter 4 Performance Evaluation 4.1 Experiments Setup. For training the base model, we used the DBDRCD, which is an open domain traditional Chinese machine reading comprehension (MRC) dataset. The dataset contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators. We excluded those sentence-question pairs over 80 and 50 words, so there are 26,175 pairs extracted from the open source training set in total, with vocabulary size 28981.. In the target domain, from DBtextbook, we annotated 480 data pairs and applied layered random sampling w.r.t. BLEU to separate the data into train/test/validation sets in the proportion 7/2/1 roughly to get 336/776, pairs/vocabulary size, for training, 48 pairs for developing, and 96 pairs for testing.. For all the experiments, the dimensionality of the hidden states and pre-trained word embedding in our models are set to be 300. Moreover, we set the learning rate to be 0.0001, teacher forcing rate 0.75, batch size to be 32, and use Adam as the optimizer.. - 32 -.

(39) 4.2 Evaluation Measurements. BLEU-4. BLEU measures the average n-gram precision on a set of reference sentences. The key idea of BLEU is how similar the sentence predicted by a machine is with respect to the reference sentence(s) made by human. There are two terms used in BLEU: candidate and reference. Candidates stand for the sentences produced by the model, and references stand for the sentences given by humans, which are questions in the testing pairs.. , Here pn is the modified precision for n-grams and wn is the corresponding weight.. ,. where Countclip = min(Count, Max_Ref_Count). Besides, Count denotes the number of count of a n-gram appearing in a candidate; Max_Ref_Count denotes the largest count observed in any single reference for the n-gram. BP is the penalty of short sentences,. ,. - 33 -.

(40) where c is the length of the candidate, r is the length of the reference. In our experiments, n ∈{1,2,3,4}, |Candidates| = 1, the size of references is 1, and wn = 1/4 for each n-gram.. ROUGE. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is another similarity measurement between an artificial reference and a machine-predicted hypothesis. Here we use ROUGE-L, where L stands for the Longest Common Subsequence (LCS). Given a reference X and a hypothesis Y, similar to a candidate in BLEU-4, with size m and n respectively, ROUGE-L is formulated as:. ROUGE-L =. where 𝑅. =. ( , ). , 𝑃. =. ( , ). ,. , LCS(X, Y) denotes the length of the. longest common subsequence of X and Y, and 𝛼=1 in our evaluation.. 4.3 Supervised Transfer Learning. In the following experiments, we select the base model Mb with the highest average BLEU-4 score tested by the validation set from the pre-trained models as representative. In the following tables, the notations BLEU4, ROUGE-R, and ‘?’% stand for the average BLEU-4, 𝑅. in ROUGE-L, and the ratio of interrogatives. in corresponding tested dataset respectively. - 34 -.

(41) Experiment 1:. In this experiment, three strategies for constructing models, denoted as M1, M2, and M3, are described as follows:. M1: The PTN models trained on the training set of DBDRCD.. M2: The PTN models trained on the training set of DBtextbook.. M3: The PTN models trained on the combination of the training sets for DBDRCD and DBtextbook.. As demonstrated in Table 1, each model has a version of not fixing the pre-trained embedding. The I/O Vocabulary denotes the input/output vocabulary of the model.. The results of Experiment 1 are in Table 2, which shows fixing the pre-trained embedding layer in M1 and M2 gets better result on testing data of DBtextbook than fine-tuning the embedding layer. Besides, the model M1 trained on the larger dataset DBDRCD outperforms the model M2 directly trained on the smaller target dataset DBtextbook. The limited size of the target dataset cannot provide enough data for the complex PTN model. Moreover, simply mixing the two dataset cannot lead better performance than just DBDRCD.. - 35 -.

(42) Model 𝑀 /𝑀 𝑀 /𝑀 𝑀 /𝑀. I/O Vocabulary. Training dataset. DBDRCD DBtextbook. DBDRCD DBtextbook. DBDRCD ∪ DBtextbook. DBDRCD ∪ DBtextbook. Fixed embedding Yes/No Yes/No Yes/No. Table 1: Descriptions of models in Experiment1.. Fix embedding layer. Fine-tune embedding layer. Model. BLEU-4. ROUGE-R. ?%. 𝑀 𝑀 𝑀. 0.352. 0.438. 0.073. 0.259. 0.388. 0.208. 0.299. 0.414. 0.052. Model BLEU-4 ROUGE-R 𝑀 𝑀 𝑀. ?%. 0.295. 0.401. 0.115. 0.193. 0.336. 0.083. 0.314. 0.418. 0.063. Table 2: The evaluation results of Experiment 1.. Experiment 2:. The purpose of Experiment 2 is to demonstrate the effectiveness of supervised domain adaptation. In Experiment 2, the model with the highest average BLEU-4 on the validation set of DBtextbook in M1’s training process is selected as Mb. The training set of DBtextbook is used to fine-tune Mb. Moreover, because DBDRCD and DBtextbook varied in their vocabulary, five strategies of transfer learning are proposed according to whether retraining the layers in Mb related to the input/output vocabulary.. The five strategies of transfer learning models, denoted as M4, M5, M6, M7, and M8, are described as follows:. - 36 -.

(43) M4: The initial state of M4 is loaded from Mb except for the embedding layers in both the encoder and the decoder and the output layer. Moreover, we kept the input and output vocabulary as the vocabulary of DBDRCD.. M5: Similar to M4, but we load state of Mb from M1 except for the output layer.. M6: Similar to M4, but we load all the layers of Mb from M1.. M7: The initial state of M7 is loaded from model Mb, except for the embedding layer in decoder and the output layer. The output layer is resized to match the vocabulary in the training set of DBtextbook.. M8: The initial state of M8 is loaded from model Mb, except for the embedding layer in both encoder and decoder and the output layer. Both the input/output layer is resized to match the vocabulary in the training set of DBtextbook.. Table 3 shows the comparison of models performed in Experiment 2. Moreover, in order to let the decoder learned more interrogatives for generating questions when using vocabulary of DBtextbook as the output vocabulary, model M7a, M8a take the vocabulary in the training set of DBtextbook plus a set of self-defined interrogatives as their output vocabulary instead.. - 37 -.

(44) Model. I/O Vocabulary. Retraining layers. M4. DBDRCD. Embedding in both encoder and decoder, Output. M5. DBDRCD. Output,. M6. DBDRCD. None. M7. DBDRCD / DBtextbook. Embedding in decoder, Output. M7a. DBDRCD / DBtextbook ∪ ‘?%’. Embedding in decoder, Output. M8. DBtextbook. Embedding in in both encoder and decoder, Output. M8a. DBtextbook ∪ ‘?%’. Embedding in both encoder and decoder, Output. Table 3: Descriptions of models in Experiment 2. Experimental Results Model. BLEU-4. ROUGE-L. ?%. 𝑀 𝑀 𝑀 𝑀. 0.352. 0.438. 0.073. 0.352. 0.438. 0.010. 0.366. 0.443. 0.010. 0.420. 0.523. 0.365. Model BLEU-4 𝑀 𝑀 𝑀 𝑀. ROUGE-L. ?%. 0.340. 0.456. 0.365. 0.355. 0.470. 0.177. 0.438. 0.535. 0.302. 0.452. 0.552. 0.208. Table 4: Results of transfer learning with various types of input/output sizes. M6 shows the best results in all measurement we tested. The evaluations of Experiment 2 are shown in Table 4. 𝑀. shows transfer. learning on DBtextbook by loading the pre-trained parameters from 𝑀 significantly improves the performance in all the evaluations. Moreover, by changing the I/O vocabulary to DBtextbook and retraining both the embedding in and output layers. - 38 -.

(45) (𝑀 ) also achieves significant improvement. Adding the set of interrogatives to the output vocabulary (𝑀 , 𝑀 ) is useful to improve the results on BLEU-4 and ROUGE-L.. Experiment 3:. The purpose of Experiment 3 is to test the effectiveness of reinforcement learning module. Accordingly, we add the reinforcement module to the previous training process of M1, M6, and M8a to get 𝑀 , 𝑀 , and 𝑀 .. 𝑀 : The PTN model with RL module trained on the training set of DBDRCD.. 𝑀 : The tuning method of 𝑀. is the same as M6, except that we fine-tune the. 𝑀 model by the PTN with RL module.. 𝑀 : The tuning method of 𝑀 𝑀. is the same as M8, except that we fine-tune the. model by the PTN with RL module.. Experimental Results Model 𝑀 𝑀 𝑀. BLEU-4. ROUGE-R. ?%. 0.352. 0.438. 0.073. 0.420. 0.523. 0.365. 0.452. 0.552. 0.208. Model BLEU-4 ROUGE-R 𝑀 𝑀 𝑀. 0.296. 0.388. 0.021. 0.348. 0.458. 0.135. 0.182. 0.333. 0.375. Table 5: Comparison between PTN with/without RL module.. - 39 -. ?%.

(46) As demonstrated in Table 5, the reinforcement learning (RL) modulo that targets to ratio of interrogative word does enhance the performance in “?%” to transferring method 𝑀 . However, due to the general drops for BLEU and ROUGE, we cannot approve the effectiveness of RL modulo designed in this way.. 4.4 Unsupervised Transfer Learning. Experiment 4. In Experiment 4, we choose several pre-trained models in M1 as Mb in algorithm 1. Then we follow algorism 1, and the following figures are the records of the unsupervised approach.. Experimental Results. The results show that this approach helps PTN model on measurements such as BLEU-4, but decrease the ratio of interrogatives. This makes the output sequences unlike questions. We will talk about this phenomenon further in the next chapter.. Fig.4.4.1 Trials on the unsupervised approach. - 40 -.

(47) Fig.4.4.2 Trials on the unsupervised approach.. Fig.4.4.3 Trials on the unsupervised approach.. - 41 -.

(48) Chapter 5 Discussion For PTN, BLEU-4 is not good enough as a measurement of QG task.. Since the central dogma of BLEU is to measure the similarity between references and the prediction, when our reference questions are similar to the input declaratives, the loss function, NLL, may decrease due to the copy mechanism rather than the implicit information of asking questions. For example, as our data in testing set “這種將物質隔絕空氣加熱分解的過程，稱為乾餾。” → “這種將物質隔絕空氣加熱分解的過程，稱為什麼？” (“Dry distillation is the process of heating material without air.” → “What is the process of heating material without air?”), the discrepancy (highlighted) between the input declarative and the output interrogative is little. In order words, to copy almost the same content with the given context as the predicted query will get a high BLEU score. However, it is not sufficient that a model can generate suitable questions.. Fig. 5.1: Example in M1, #6. - 42 -.

(49) Fig. 5.2: Example in M2, #138 Fig. 5.1 and Fig. 5.2 support our idea. The prediction in Fig. 5.1 has higher BLEU than the one in Fig. 5.2, but for native speakers in Chinese, the latter one is clearly the better question with respect to it input. In addition, as illustrated in Fig 5.3, Evaluations such as BLEU climb to the peak soon in front epochs, but the ratio of interrogatives seems existing the rising tendency. That is why we also use the ratio of interrogatives to measure the quality of the generated questions. The list of self-defined interrogatives is put in appendix.. Fig. 5.3: The trend of evaluations in the training process. - 43 -.

(50) From the results of unsupervised approaches, although we can get a higher BLEU score incrementally, the ratio of interrogatives in the self-generated questions is decreasing. The reason is that the generated questions are not representative for the target domain. It will gradually learn a model tending to copy the content for reducing the NLL loss function. Therefore, it is important to start tuning from a suitable model for the target domain.. - 44 -.

(51) Chapter 6 Conclusion In this work, in order to construct a question generation system for educational purpose, we propose a neural sequence-to-sequence framework of pointer network with various transfer learning strategies to construct the suitable model on a target domain with rare training pairs. The results of experiments show, one of the transfer learning strategy works well to generate satisfactory results to some extent.. Furthermore, though performance measurements such as BLEU or ROUGE are widely used in previous researches of questions generation, we find their deficits for the PTN model in a task with high intra-similarity pairs, i.e., content and question.. Accordingly, we raise another way for measuring the performance of QG task and hope it could inspire more researches of QG task.. Fig. 6.1: Heat map of attention weights, M6, #6. - 45 -.

(52) Future works. We will discover QG further in four possible approaches. First, as we see in Fig. 6.1, PTN is actually capable to predict proper interrogatives if it has learnt when to decrease the probability to copy from the input content. We could collect more data in target domain or apply other updating criterions for parameters. Second, the potential of our proposed measurement does not end. Although we do not get convincing result that the ratio of interrogatives helps as a reward for reinforcement learning, there are various possibly helpful methods of reinforcement learning that we can study. For the unsupervised approach, we will try a semi-supervised approach to generate more training pairs to fine-tune the best supervised transfer learning approach discovered in our study. Finally, the optimized evaluation of QG task is still not well discovered, we will continue work on it.. - 46 -.

(53) Reference D. Bahdanau, K. Cho, Y. Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 oral presentation. https://arxiv.org/abs/1409.0473v7. G. Chen, J. Yang, C. Hauff and G. Houben. 2018. LearningQ: A Large-scale Dataset for Educational Question Generation. ICWSM-18.. K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. SSST-8, 103–111. Y. Chung, H. Lee, J. Glass. 2018. Supervised and Unsupervised Transfer Learning for Question Answering. NAACL-HLT 2018, 1585–1594. X. Du, J. Shao, and C. Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1342–1352.. E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov. 2018. Learning Word Vectors for 157 Languages. arXiv:1802.06893v2 [cs.CL] https://arxiv.org/abs/1802.06893v2. - 47 -.

(54) M. Heilman and N. A. Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Los Angeles, California, 609–617.. J. Karpicke. 2012. Retrieval-Based Learning: Active Retrieval Promotes Meaningful Learning. Current Directions in Psychological Science 21, 3 (May 2012), 157–163.. Y. Keneshloo. T. Shi, N. Ramakrishnan, C. K. Reddy. 2018. Deep Reinforcement Learning. for. Sequence. to. Sequence. Models.. arXiv:1805.09461.. https://arxiv.org/abs/1805.09461. G. Kovacs. 2016. Effects of In-Video Quizzes on MOOC Lecture Viewing. In Proc. Conference on Learning at Scale, 31–40.. C. Lin, 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).. M. Luong, H. Pham, C. D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015, 1412–1421.. - 48 -.

(55) M. Oquab, L. Bottou, I. Laptev, J. Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. pp. 1717–1724. IEEE. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.. V. Rus, B. Wyse, P. Piwek, M. Lintean, S. Stoyanchev, and C. Moldovan. 2010. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics, Stroudsburg, PA, USA, 251–257.. M. Sachan and E. P. Xing. 2018. Self-Training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, 629–640.. A. See, P. J. Liu, and C. D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1073–1083. - 49 -.

(56) C. C. Shao, T. Liu, Y. Lai, Y. Tseng, and Sam Tsai. 2018. DRCD: A Chinese machine reading comprehension dataset. ArXiv preprint. https:/arxiv.org/abs/1806.00920. J. Sun. 2012. "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.. I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (NIPS), pages 3104– 3112.. D. Tang, N. Duan, T. Qin, Z. Yan, and M. Zhou. 2017. Question answering and question generation as dual tasks. ArXiv e-prints. https://arxiv.org/abs/1706.02027. L. Vanderwende. 2008. The importance of being important: Question generation. In Proceedings of the 1st Workshop on the Question Generation Shared Task Evaluation Challenge, Arlington, VA. O. Vinyals, M. Fortunato, and N. Jaitly. 2015a. Pointer networks. In Advances in Neural Information Processing Systems, pages 2674–2682.. T. Wang, X. Yuan, and A. Trischler. 2017. A joint model for question answering and question generation. 1st Workshop on Learning to Generate Natural Language. - 50 -.

(57) Y. Wang, C. Liu, M. Huang, and L. Nie. 2018. Learning to ask questions in open-domain conversational systems with typed decoders. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2193–2203.. Z. Wang, A. S. Lan, W. Nie, A. E. Waters, P. J. Grimaldi, and R. G. Baraniuk. 2018 QG-net: A data-driven question generation model for educational content. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale, pages 7:1– 7:10.. Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou. 2017. Neural question generation. from. text:. A. preliminary. https://arxiv.org/abs/1704.01792. - 51 -. study.. ArXiv. e-prints..

(58) Appendix List of interrogatives: ['哪','哪個','哪㇐','哪種','哪裡','哪邊','哪些','哪位', '何','何種','何人','何時','何年',' 為何','何處','何地','如何','何者','有何','有何意義', '誰','什麼','多少','多久','幾','多遠 ','多⾧']. - 52 -.

(59)