• 沒有找到結果。

Given a document D composed of n sentences [s0, s1, …, sn] in order, our goal is to obtain a vector representation vD for the document. Note that […] stands for an ordered list in the rest of this paper.

3.1 Overview

Figure 1 is an overview of our model. The purpose of the model is to obtain a vector representation for document D in an unsupervised manner. We update variables in the model by training it to predict a target sentence among some candidate sentences given its context sentences. The context sentences are defined by k sentences on each side of the target sentence st. Namely, Scntx = [st-k, …, st-1, st+1, …, st+k].

Besides the target sentence, r negative samples are coupled with each target sentence st. The model will calculate a probability distribution over these r+1 candidate sentences to make prediction. We refer to the list of candidate sentences as Scdd = [st, sneg1, …,snegr].

The model will output r+1 scalars, corresponding to each sentence in Scdd. These scalars are referred to as logits of the sentences. A higher logit indicates a higher probability is distributed to the sentence by the model. Logit of the target sentence st is denoted as lt

and logits of negative samples sneg1, …, snegr are denoted as lneg1, …, lnegr.

According to Mikolov et al., with those logits given, optimizing the following loss function will approximate optimizing the probability distribution over all possible sentences in the world:

doi:10.6342/NTU201902410

6

Applying negative sampling, a softmax function is not literally operated while a distribution over infinite number of all possible sentences in the world is optimized.

After the model is trained this way, it can be used to calculate a vector representation for a document.

Fig. 3.1 Overview of our model.

In the figure, number of context sentences on each side is 1 and number of negative samples r is 2. Context sentences st-1, st+1 are fed to the model from the bottom. The target sentence st and negative samples sneg1, sneg2 are fed from the top. Logit of the target sentence lt and negative samples lneg1, lneg2 are obtained in the middle. These will

be used to calculate the loss.

3.2 Architecture

3.2.1 model

As illustrated in Figure 1, we use sentence encoders to encode a sentence into a fixed-length sentence vector. Two sentence encoders are used in the model, the context encoder Ecntx and the candidate encoder Ecdd. Sentences in Scntx are encoded into sentence vectors Vcntx = [vt-k, …, vt-1, vt+1, …, vt+k] by Ecntx. Those in Scdd are encoded into a target sentence vector vt and negative samples vectors Vneg = [vneg1, …, vnegr] by Ecdd. To merge information captured by each sentence vector in Vcntx into a single context vector, vectors in Vcntx are element-wise averaged. The obtained context vector is called vcntx.

vcntx will go through a process called length adjustment except when calculating Lcntx in Section 3.3.1. Length adjustment process will normalize vcntx and lengthen it to the average length of sentence vectors which are used to obtain vcntx itself. The process is as follow:

where length(x) denotes l2 norm of x and size(y) denotes number of elements in y. This

doi:10.6342/NTU201902410

8

process solves the length vanishing problem of element-wise averaging many vectors.

Now, we have a single vector vcntx containing unified information from context sentences. If the sentence vector of a candidate is similar to vcntx, it is probability the sentence to be predicted. Similarity is evaluated with inner product. So, vcntx will dot with the target sentence vector vt and negative sentence vectors in vneg to obtain a logit for each of them. Logit of the target sentence is called lt = dot(vcntx, vt) and logits of negative samples are called lnegl, …, lnegr , where lnegi = dot(vcntx, vnegi).

With these logits, the loss can be calculated with Equation (1).

Table 3.1 Structure of sentence encoders.

For consistency with Figure 1, first layer is placed at the bottom and the last layer at the top.

3.2.2 Sentence encoders

Ecntx and Ecntx have the same structure, as elaborated in Table 1. Nevertheless, they do not share variables except the word embedding table. This allows a sentence to be represented differently when playing different roles. We choose convolutional networks for sentence encoders for its simplicity and efficiency of training. Note that a global average pooling layer is placed on top of convolutional layers to form a fix-length vector for sentences of variable length.

3.3 Training

During training, a list of sentences sD = [s0, s1, …, sn] from a single document D is fed to the model as a single training sample. The total loss to be minimized, Ltotal, is the weighted sum of two terms: the context loss Lcntx and the document loss Ldoc. The model is then trained end to end by minimizing Ltotal.

3.3.1 Context loss

For each sentence in sD, k sentences before and k sentences after the target sentence are given in Scntx as context sentences. Besides this, randomly selected negative samples sneg1, …, snegr are selected from sentences in other documents in the dataset. Length adjustment process is not applied when calculating context loss. Target sentence logit lt

and negative sentences logits lneg1, lneg2, …, lnegr are obtained and used to calculate Lcntxt

with Equation (1). The context loss Lcntx is defined by averaging losses from each

doi:10.6342/NTU201902410

10

sentence in sD except the first k and the last k sentences for incomplete context sentences.

where Lcntx is the context loss of a single target sentence.

3.3.2 Document loss

For document loss, there are only two differences from context loss: 1) length adjustment process is applied on vcntx. 2) all the sentences in SD, including the target sentence st itself, are regarded as context sentences for each target sentence.

Consequently, each sentence in SD can be used as target sentence.

The document loss Ldoc is defined by averaging losses from all the sentences in the document:

3.3.3 Total loss

The total loss is the weighted sum of context loss and document loss. A hyper-parameter α is used to assign weights. Total loss Ltotal is obtained by:

Ltotal is then minimized to update model variables. In particular, Lcntx and Ldoc are responsible for capturing local and global relations among sentences respectively. Ldoc

also guarantees an effective aggregation for sentence vectors.

Fig. 3.2 Composition of the total loss function.

Each loss term is obtained from the same model structure but different input. The weights of the terms are assigned by a hyper-parameter α.

3.4 Inference of document representation

For a document D, its representation is the length adjusted average of sentence vectors from all sentences in it. No extra training is needed for new documents seen for the first time. Notice that it is exactly the context vector vcntx used for calculating Ldoc. It is explicitly used during model training on purpose. This leads the model to learn sentence vectors that can be effectively aggregated by average. Also, the aggregated representation is guaranteed to be informative since it is also learned during training.

doi:10.6342/NTU201902410

12

相關文件