Positive instance Negative instance EDU
Figure 4.5: Training Instances Example
To train the RvNN, the positive instances are the tree nodes extracted from the dis-course parse trees in CDTB. Figure 4.5 illustrate a parsing tree constructed from five seg-ments(the black node). The black nodes, black lines, and the red nodes form the parsing tree itself. So, we can find four subtrees by considering one red node as the root. These four subtrees are the four positive instances we can derive from this parsing tree for train-ing. On the other hand, we select arbitrary two neighboring subtrees and merge them into a new tree. The new tree is regarded as a negative instance if it is inconsistent with the ground-truth. We can see from Figure 4.5 that there are four possible trees as negative instances with the blue nodes as the root. The losses of the merging scorer, the sense classifier, and the center classifier,Lm,Ls, andLc, respectively, are measured with cross-entropy. When training on negative instance, we don’t need to estimate the performance
of classification of sense and centering. In contrast, in the positive cases, we sum up the loss of the three and optimize them jointly. More formally, our loss functionL is defined as:
L =
Lm, if the instance is negative Lm+Ls+Lc, otherwise
(4.7)
We use stochastic gradient decent (SGD) with the learning rate of 0.1 for parameter opti-mization.
4.6 Parse Tree Construction
In the prediction stage, we construct the discourse parse tree based on the predictions made by Tree-LSTM. We modify the Cocke–Younger–Kasami (CKY) CKY algorithm Younger [1967] to maximize the probability of the whole parse tree. The CKY-like dy-namic programing algorithm simulates the recursive parsing procedure, considering local and global information jointly. In each step of the dynamic programming procedure, we consider several combinations of two neighboring trees L and R, merge them to a new tree N , and select two such N s with higher probability P r(N ) as candidates for future steps. Pr(N) is formulated as follows:
P r(N ) = P rM erge(L, R)× P r(L) × P r(R) (4.8)
The P rM erge(L, R) above is the output of the merge scorer in our model. Since we always stored the top two N s in each entry of the dynamic programming table, we mark our CKY-like algorithm as CKY2. See Algorithm 1 for details.
Algorithm 1 Discourse Parse Tree Construction with Dynamic Programming
1: P robs← table[][]
2: for level from 0 to n− 1 do ▷ The span range
3: for col from 0 to n− level do ▷ The start column
4: P robs[level][col]← 0
5: Candidates← list[]
6: for k from 1 to level do
7: Lef tT ree← GetTree(k − 1, col) ▷ Get the candidate tree for left span
8: RightT ree← GetTree(level − k, col + k)
9: N ewT ree, M ergeP rob← Merge(LeftT ree, RighT ree)
10: ▷ Apply RvNN unit
11: P rob← MergeP rob × P robs[k − 1][col] × P robs[level − k][col + k]
12: T ree, P rob← Candidates.MaxP rob()
13: ▷ Get the maximum probabilty and the tree
14: SaveTree(level,col,T ree) ▷ Save as the candidate tree
15: P robs[level][col]← P rob
4.7 Model Variation
Variations of our RvNN framework will also be tested for comparison. In the ver-sion shown in Figure 4.6, instead of running our CKY algorithm throughout the whole construction process from segment level, CKY is only adopt after EDU detection. For EDU detection, we process through the segment representations s1, s2, s3...snfrom left to right, judging based on the merge scorer whether to merge the next segment as a part of EDU or separate it to be the start of a new EDU. The intention is to fit how we construct our training instance, as mentioned in Section 4.5. We abbreviate our original model as RvNN-CKY2, and this modified version as RvNN-CKY2+Seq-EDU.
Considering each segment representation si output by the LSTM layer only contains informations of the segment itself. We attempt to add one bi-LSTM layer between the original LSTM and CKY part to integrate context information into each segment repre-sentation, as shown in Figure 4.7. In this version, s1, s2, s3...sn are first fed into the bi-LSTM, and the outputs f1, f2, f3...fnare then fed into the RvNN. We abbreviate this version of our model as RvNN-CKY2+bi-LSTM.
In a further attempt to integrate syntactic information into our model while still avoid the need of external syntactic parser, we generalize our model to not only construct
dis-Segment level
segment level: left-to-right merge, detecting EDU boundaries RvNN
Figure 4.6: The RvNN-CKY2+Seq-EDU model.
Segment level
Figure 4.7: The RvNN-CKY2+bi-LSTM model.
course structure, but also to construct syntactic structure for each text segment automati-cally. This variation of our model is shown in Figure 4.8. Instead of feeding each character embedding eij for the ithsegment into a LSTM layer to get the segment embedding si, the model processes eijthrough another RvNN with CKY2 algorithm similar to the process of the original RvNN, resulting in a two-staged RvNN framework. We use the same training set in CDTB, and we can get the gold syntactic parsing for each articles from CTB which is the source corpus for CDTB. Note that we only use this external resource other than CDTB
during the training stage. After training, our model can construct the syntactic structure automatically. We abbreviate this version of our model as 2-Staged-RvNN-CKY2.
Segment level
Character level Segment/DU level
c11 c12 c13 ……
=Embedding Layer
=RvNN(CKY)
c21 c22 c23 ……
=Embedding Layer
=RvNN(CKY)
c31 c32 c33 ……
=Embedding Layer
=RvNN(CKY)
e11 e12 e13 …… e21 e22 e23 …… e31 e32 e33 ……
s1 s2 s3
RvNN(CKY)
Figure 4.8: The 2-Staged-RvNN-CKY2 model.
In this version of our model, the RvNN process from characters to the whole paragraph through a deep tree structure, as shown in Figure 4.9. We then adapt sampling mechanism to reduce the huge amount of training instances. We sample the character level training instances (the subtrees rooted in a red or blue node in the character level area in Figure 4.9) to the amount as same as the amount of instances above segment level. Also, we do not merge a node above segment level with a neighboring node of character level to construct an instance.
Ex:海关统计表明,
DU level
segment level
character level
Positive instance
Negative instance
Character