Pointer Network Model - Baseline Model - Question Generation through Transfer Learning

Chapter 3 Methods

3.3 Baseline Model

3.3.1 Pointer Network Model

Seq2seq model

Fig. 3.4 The Seq2seq framework of neural network.

- 20 -

I. Sutskever et al., 2014 first used the Seq2seq model for solving machine translation

task. This method used a Long Short-Term Memory (LSTM) to map the input sequence

into a vector with certain fixed dimension, as the context vector shown in Fig 3.4. Then

the method used another LSTM to decode the output sequence from the vector and

generated words. This model could effectively deal with a common problem in machine

translation—variant length of input/output sequences due to the unnecessarily

one-to-one correspondence of the words 𝑤 and 𝑞 in text sentence and question. In the

training phase, after the decoding process, the model parameters in 𝜃 were then

updated by stochastic gradient descent (SGD) to minimize the loss function. The internal

LSTM cells had many alternatives. We implemented our baseline model by bi-GRU and

GRU, as the encoder and decoder respectively, for reducing the dimensionality of 𝜃

with respect to LSTM.

A GRU has two gates, a reset gate 𝑟 = 𝜎(𝑊 𝑥 + 𝑈 𝑠 + 𝑏 ) and an update gate

𝑧 = 𝜎 (𝑈 𝑥 + 𝑊 𝑠 + 𝑏 ), where 𝑥 is the current input, i.e. the embedding of

the input word 𝑤 , 𝑠 is the previous hidden state and 𝜎 is the sigmoid function.

𝑊 , 𝑈 , 𝑊 and 𝑈 are training parameter matrices in GRU. 𝑏 and 𝑏 are biases.

Intuitively, the update gate defines how much of the previous memory to keep, and the

reset gate defines how to combine the new input with the previous hidden state.

Therefore, after each iteration for t > 0, current hidden state:

- 21 -

𝑠 = 𝑧⨀𝑠 + (𝟏 − 𝑧)⨀ℎ , and

ℎ = tanh 𝑊 𝑥 + 𝑈 (𝑟 ⨀ 𝑠 ) ,

where ⨀ is the element-wise multiplication and 𝟏 denotes a "all-ones vector". 𝑊

and 𝑈 are also training parameter matrices in GRU. Bi-GRU contains an additional

backward layer calculating hidden states in decreasing order by reversing the input

sentence. Let 𝑠⃗ denote the hidden state of the forward GRU layer and 𝑠⃖ denote the

hidden state of the backward GRU layer, respectively. Then the hidden state of Bi-GRU

at t, denoted as ℎ is the concatenation of 𝑠⃗ and 𝑠⃖ : ℎ = [𝑠⃗; 𝑠⃖ ].

- 22 -

Attention mechanism

Fig 3.5 Seq2seq model with attention mechanism.

Attention mechanism has lately been used to improve numerous machine learning

tasks, especially the deep learning approaches, such as object detection in CV or

machine translation in NLP. The key idea of attention mechanism is to find which

parts of the input should be focused. In a Seq2seq model, it turns to be in which

time step i in the hidden states of GRU encoder cells should have higher weight for

predicting the result of decoder at time step t, as the attention distribution shown in

Fig. 3.5.

- 23 -

Let ℎ and ℎ denote the hidden state of decoder at time step t and the hidden

state of encoder at time step i, respectively. 𝑒 denotes the attention score of the

encoder’s hidden state ℎ with respect to the decoding time step t.

𝑒 = ℎ 𝑊 ℎ + 𝑏 ,

where 𝑊 is the parameter matrix and 𝑏 is the bias to learn.

𝑎 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑒 ) = ⁽ ⁾

∑ ( ) is the attention weight for each input embedding at decoding time step t to get the context vector of encoder 𝑐 :

𝑐 = ∑ 𝑎 ℎ .

Finally, the context vector of encoder, i.e. 𝑐 , and the hidden state of decoder at

time step t, i.e. ℎ , are used to infer a word as output at time step t. The output layer

generates a token through a probability distribution:

𝑃 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊 [ℎ ; 𝑐 ] + 𝑏 )

- 24 -

Pointer Network

Fig. 3.6 The framework of pointer-network.

Pointer Network (See et al., 2017), PTN, is a variation of the Seq2seq model. It

calculates the output word probabilities as a weighted sum of two, one of which

comes from the output of Seq2seq model and the other comes from the attention

weights of input sentence. In our constructed model, an intra decoder attention, 𝑎 ,

is used in the model as:

- 25 -

𝑎 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑒 ), where

𝑒 = ℎ 𝑊 ℎ + 𝑏 .

Then another context vector 𝑐 is generated by 𝑐 = ∑ 𝛼 ℎ for providing

context information of the previously generated sequence in the decoder.

Accordingly, 𝑃 is modified into 𝑃 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊 [ℎ ; 𝑐 ; 𝑐 ] + 𝑏 ).

Moreover. as shown in Fig. 3.6, the proportion of the two probabilities are

controlled by the tunable parameter 𝑃 , which is learned from the concatenation

of the hidden state vector of decoder, ℎ , and the two context vectors,

𝑐 and 𝑐 , obtained from encoder and decoder, respectively.

𝑃 = 𝜎 (𝑊 [ℎ ; 𝑐 ; 𝑐 ] + 𝑏 ).

If unseen words occur in the input sentence, the PTN will create an extra dictionary

for those unseen words. For each word 𝑤 in the input sentence S, if 𝑤 is an

unseen word, 𝑃 (𝑤 ) = 𝛼 .

Finally, the probability of a word w predicted to be the next generated word is:

𝑃(𝑞 =w|𝑆, {𝑞 } ) = 𝑃 (𝑤)𝑃 + 𝑃 (𝑤)(1 − 𝑃 ).

- 26 -

During the training, the loss of each time step t is the negative log likelihood of the

predicted probability for the target word 𝑞 , denoted as 𝑙𝑜𝑠𝑠 = −𝑙𝑜𝑔𝑃(𝑞 ). The

overall loss for a predicted sentence with length l is:

𝐿𝑜𝑠𝑠({𝑞 } ) = ∑ 𝑙𝑜𝑠𝑠 .

Those unseen words will be probably chosen to be the output words when their

attentions weights in the input sentence are large. By the usage of attention weights

in input sentences, this model can effectively deal with out-of-vocabulary problem.

Note that we did not put Chinese characters into the model directly. Instead, we first

built a dictionary of frequent words from the segmentation results of the training set.

Each word in the dictionary had a unique identity and its corresponding pre-trained

word embedding¹ (E. Grave and P. Bojanowski, 2018).

Coverage mechanism

Coverage mechanism is leveraged to solve word repetition problem in

sequence-to-sequence models (See et al., 2017). In the models of this thesis, we keep a coverage

vector cov^t, which is defined as the sum of all encoder attention distribution at each

previous decoder time step t′, i.e. 𝑎 .

1 https://fasttext.cc/docs/en/crawl-vectors.html

- 27 -

𝑐𝑜𝑣 = 𝑎

Note that cov⁰ is a zero vector, since on the first time step, none of a word in the

given sentence has been covered. The attention elements which occur more are

penalized by a composite loss function in each time step:

𝑙𝑜𝑠𝑠 = −𝑙𝑜𝑔𝑃(𝑞 ) + 𝜆 min (𝑎 , 𝑐𝑜𝑣 )

The composite loss is weighted by a hyperparameter , and intuitively, the loss will

be less if paying attention on those words that has not been focused so far.

Teacher forcing

In the training process of a Seq2seq model, the inference of a new token is based on

the current hidden state and the previous predicted token. A bad inference will then

make the next inference worse. This phenomenon is a kind of error propagation. D.

Bahdanau et al. (2015) thus proposed a learning strategy to ease the problem.

Instead of always using the generated tokens, the strategy gently changed the

training process from fully using the true tokens, toward mostly using the generated

tokens. This method can yield performance improvement for sequence prediction

tasks such as QG. In the proposed model, we guide the training by 0.75 at beginning,

and decay the ratio by multiplying 0.9999 after each epoch.

- 28 -

在文檔中 Question Generation through Transfer Learning (頁 25-34)