Positive/Negative instances - 適用於點對點中文語篇剖析的遞迴類神經網路統一架構

Positive instance Negative instance EDU

Figure 4.5: Training Instances Example

To train the RvNN, the positive instances are the tree nodes extracted from the dis-course parse trees in CDTB. Figure 4.5 illustrate a parsing tree constructed from five seg-ments(the black node). The black nodes, black lines, and the red nodes form the parsing tree itself. So, we can find four subtrees by considering one red node as the root. These four subtrees are the four positive instances we can derive from this parsing tree for train-ing. On the other hand, we select arbitrary two neighboring subtrees and merge them into a new tree. The new tree is regarded as a negative instance if it is inconsistent with the ground-truth. We can see from Figure 4.5 that there are four possible trees as negative instances with the blue nodes as the root. The losses of the merging scorer, the sense classifier, and the center classifier,Lm,Ls, andLc, respectively, are measured with cross-entropy. When training on negative instance, we don’t need to estimate the performance

of classification of sense and centering. In contrast, in the positive cases, we sum up the loss of the three and optimize them jointly. More formally, our loss functionL is defined as:

L =









Lm, if the instance is negative Lm+Ls+Lc, otherwise

(4.7)

We use stochastic gradient decent (SGD) with the learning rate of 0.1 for parameter opti-mization.

4.6 Parse Tree Construction

In the prediction stage, we construct the discourse parse tree based on the predictions made by Tree-LSTM. We modify the Cocke–Younger–Kasami (CKY) CKY algorithm Younger [1967] to maximize the probability of the whole parse tree. The CKY-like dy-namic programing algorithm simulates the recursive parsing procedure, considering local and global information jointly. In each step of the dynamic programming procedure, we consider several combinations of two neighboring trees L and R, merge them to a new tree N , and select two such N s with higher probability P r(N ) as candidates for future steps. Pr(N) is formulated as follows:

P r(N ) = P r_{M erge}(L, R)× P r(L) × P r(R) (4.8)

The P r_{M erge}(L, R) above is the output of the merge scorer in our model. Since we always stored the top two N s in each entry of the dynamic programming table, we mark our CKY-like algorithm as CKY2. See Algorithm 1 for details.

Algorithm 1 Discourse Parse Tree Construction with Dynamic Programming

1: P robs← table[][]

2: for level from 0 to n− 1 do ▷ The span range

3: for col from 0 to n− level do ▷ The start column

4: P robs[level][col]← 0

5: Candidates← list[]

6: for k from 1 to level do

7: Lef tT ree← GetTree(k − 1, col) ▷ Get the candidate tree for left span

8: RightT ree← GetTree(level − k, col + k)

9: N ewT ree, M ergeP rob← Merge(LeftT ree, RighT ree)

10: ▷ Apply RvNN unit

11: P rob← MergeP rob × P robs[k − 1][col] × P robs[level − k][col + k]

12: T ree, P rob← Candidates.MaxP rob()

13: ▷ Get the maximum probabilty and the tree

14: SaveTree(level,col,T ree) ▷ Save as the candidate tree

15: P robs[level][col]← P rob

4.7 Model Variation

Variations of our RvNN framework will also be tested for comparison. In the ver-sion shown in Figure 4.6, instead of running our CKY algorithm throughout the whole construction process from segment level, CKY is only adopt after EDU detection. For EDU detection, we process through the segment representations s₁, s₂, s₃...s_nfrom left to right, judging based on the merge scorer whether to merge the next segment as a part of EDU or separate it to be the start of a new EDU. The intention is to fit how we construct our training instance, as mentioned in Section 4.5. We abbreviate our original model as RvNN-CKY2, and this modified version as RvNN-CKY2+Seq-EDU.

Considering each segment representation si output by the LSTM layer only contains informations of the segment itself. We attempt to add one bi-LSTM layer between the original LSTM and CKY part to integrate context information into each segment repre-sentation, as shown in Figure 4.7. In this version, s₁, s₂, s₃...s_n are first fed into the bi-LSTM, and the outputs f₁, f₂, f₃...f_nare then fed into the RvNN. We abbreviate this version of our model as RvNN-CKY2+bi-LSTM.

In a further attempt to integrate syntactic information into our model while still avoid the need of external syntactic parser, we generalize our model to not only construct

dis-Segment level

segment level: left-to-right merge, detecting EDU boundaries RvNN

Figure 4.6: The RvNN-CKY2+Seq-EDU model.

Segment level

Figure 4.7: The RvNN-CKY2+bi-LSTM model.

course structure, but also to construct syntactic structure for each text segment automati-cally. This variation of our model is shown in Figure 4.8. Instead of feeding each character embedding eⁱ_j for the i^thsegment into a LSTM layer to get the segment embedding s_i, the model processes eⁱ_jthrough another RvNN with CKY2 algorithm similar to the process of the original RvNN, resulting in a two-staged RvNN framework. We use the same training set in CDTB, and we can get the gold syntactic parsing for each articles from CTB which is the source corpus for CDTB. Note that we only use this external resource other than CDTB

during the training stage. After training, our model can construct the syntactic structure automatically. We abbreviate this version of our model as 2-Staged-RvNN-CKY2.

Segment level

Character level Segment/DU level

c¹₁ c¹₂ c¹₃ ……

=Embedding Layer

=RvNN(CKY)

c²₁ c²₂ c²₃ ……

=Embedding Layer

=RvNN(CKY)

c³₁ c³₂ c³₃ ……

=Embedding Layer

=RvNN(CKY)

e¹₁ e¹₂ e¹₃ …… e²₁ e²₂ e²₃ …… e³₁ e³₂ e³₃ ……

s₁ s₂ s₃

RvNN(CKY)

Figure 4.8: The 2-Staged-RvNN-CKY2 model.

In this version of our model, the RvNN process from characters to the whole paragraph through a deep tree structure, as shown in Figure 4.9. We then adapt sampling mechanism to reduce the huge amount of training instances. We sample the character level training instances (the subtrees rooted in a red or blue node in the character level area in Figure 4.9) to the amount as same as the amount of instances above segment level. Also, we do not merge a node above segment level with a neighboring node of character level to construct an instance.

Ex:海关统计表明，

DU level

segment level

character level

Positive instance

Negative instance

Character

在文檔中適用於點對點中文語篇剖析的遞迴類神經網路統一架構 (頁 38-43)