Tree Transformer: Integrating Tree Structures into Self-Attention
Yau-Shian Wang Hung-Yi Lee Yun-Nung Chen National Taiwan University, Taipei, Taiwan
[email protected] [email protected] [email protected]
Abstract
Pre-training Transformer from large-scale raw texts and fine-tuning on the desired task have achieved state-of-the-art results on diverse NLP tasks. However, it is unclear what the learned attention captures. The attention com- puted by attention heads seems not to match human intuitions about hierarchical structures.
This paper proposes Tree Transformer, which adds an extra constraint to attention heads of the bidirectional Transformer encoder in order to encourage the attention heads to follow tree structures. The tree structures can be automat- ically induced from raw texts by our proposed
“Constituent Attention” module, which is sim- ply implemented by self-attention between two adjacent words. With the same training procedure identical to BERT, the experiments demonstrate the effectiveness of Tree Trans- former in terms of inducing tree structures, better language modeling, and further learning more explainable attention scores1.
1 Introduction
Human languages exhibit a rich hierarchical struc- ture which is currently not exploited nor mir- rored by the self-attention mechanism that is the core of the now popular Transformer archi- tecture. Prior work that integrated hierarchi- cal structure into neural networks either used re- cursive neural networks (Tree-RNNs) (C.Goller and A.Kuchler, 1996; Socher et al., 2011; Tai et al., 2015) or simultaneously generated a syn- tax tree and language in RNN (Dyer et al., 2016), which have shown beneficial for many downstream tasks (Aharoni and Goldberg, 2017;
Eriguchi et al., 2017;Strubell et al., 2018;Zare- moodi and Haffari, 2018). Considering the re- quirement of the annotated parse trees and the
1The source code is publicly available at https://
github.com/yaushian/Tree-Transformer.
costly annotation effort, most prior work relied on the supervised syntactic parser. However, a su- pervised parser may be unavailable when the lan- guage is low-resourced or the target data has dif- ferent distribution from the source domain.
Therefore, the task of learning latent tree struc- tures without human-annotated data, called gram- mar induction(Carroll and Charniak,1992;Klein and Manning,2002;Smith and Eisner,2005), has become an important problem and attractd more attention from researchers recently. Prior work mainly focused on inducing tree structures from recurrent neural networks (Shen et al.,2018a,b) or recursive neural networks (Yogatama et al.,2017;
Drozdov et al.,2019), while integrating tree struc- tures into Transformer remains an unexplored di- rection.
Pre-training Transformer from large-scale raw texts successfully learns high-quality language representations. By further fine-tuning pre-trained Transformer on desired tasks, wide range of NLP tasks obtain the state-of-the-art results (Radford et al., 2019; Devlin et al., 2018; Dong et al., 2019). However, what pre-trained Transformer self-attention heads capture remains unknown. Al- though an attention can be easily explained by ob- serving how words attend to each other, only some distinct patterns such as attending previous words or named entities can be found informative (Vig, 2019). The attention matrices do not match our intuitions about hierarchical structures.
In order to make the attention learned by Transformer more interpretable and allow Trans- former to comprehend language hierarchically, we propose Tree Transformer, which integrates tree structures into bidirectional Transformer encoder.
At each layer, words are constrained to attend to other words in the same constituents. This con- straint has been proven to be effective in prior work (Wu et al., 2018). Different from the prior
work that required a supervised parser, in Tree Transformer, the constituency tree structures is au- tomatically induced from raw texts by our pro- posed “Constituent Attention” module, which is simply implemented by self-attention. Motivated by Tree-RNNs, which compose each phrase and the sentence representation from its constituent sub-phrases, Tree Transformer gradually attaches several smaller constituents into larger ones from lower layers to higher layers.
The contributions of this paper are 3-fold:
• Our proposed Tree Transformer is easy to implement, which simply inserts an addi- tional “Constituent Attention” module im- plemented by self-attention to the original Transformer encoder, and achieves good per- formance on the unsupervised parsing task.
• As the induced tree structures guide words to compose the meaning of longer phrases hierarchically, Tree Transformer improves the perplexity on masked language modeling compared to the original Transformer.
• The behavior of attention heads learned by Tree Transformer expresses better inter- pretability, because they are constrained to follow the induced tree structures. By visu- alizing the self-attention matrices, our model provides the information that better matchs the human intuition about hierarchical struc- tures than the original Transformer.
2 Related Work
This section reviews the recent progress about grammar induction. Grammar induction is the task of inducing latent tree structures from raw texts without human-annotated data. The models for grammar induction are usually trained on other target tasks such as language modeling. To obtain better performance on the target tasks, the mod- els have to induce reasonable tree structures and utilize the induced tree structures to guide text en- coding in a hierarchical order. One prior attempt formulated this problem as a reinforcement learn- ing (RL) problem (Yogatama et al.,2017), where the unsupervised parser is an actor in RL and the parsing operations are regarded as its actions. The actor manages to maximize total rewards, which are the performance of downstream tasks.
PRPN (Shen et al.,2018a) and On-LSTM (Shen et al.,2018b) induce tree structures by introducing
a bias to recurrent neural networks. PRPN pro- poses a parsing network to compute the syntactic distance of all word pairs, and a reading network utilizes the syntactic structure to attend relevant memories. On-LSTM allows hidden neurons to learn long-term or short-term information by the proposed new gating mechanism and new activa- tion function. In URNNG (Kim et al.,2019b), they applied amortized variational inference between a recurrent neural network grammar (RNNG) (Dyer et al.,2016) decoder and a tree structures inference network, which encourages the decoder to gen- erate reasonable tree structures. DIORA (Droz- dov et al.,2019) proposed using inside-outside dy- namic programming to compose latent represen- tations from all possible binary trees. The repre- sentations of inside and outside passes from same sentences are optimized to be close to each other.
Compound PCFG (Kim et al., 2019a) achieves grammar induction by maximizing the marginal likelihood of the sentences which are generated by a probabilistic context-free grammar (PCFG) in a corpus.
3 Tree Transformer
Given a sentence as input, Tree Transformer in- duces a tree structure. A 3-layer Tree Trans- former is illustrated in Figure 1(A). The build- ing blocks of Tree Transformer is shown in Fig- ure 1(B), which is the same as those used in bidirectional Transformer encoder, except the pro- posed Constituent Attention module. The blocks in Figure 1(A) are constituents induced from the input sentence. The red arrows indicate the self- attention. The words in different constituents are constrained to not attend to each other. In the 0- th layer, some neighboring words are merged into constituents; for example, given the sentence “the cute dog is wagging its tail”, the tree Transformer automatically determines that “cute” and “dog”
form a constituent, while “its” and “tail” also form one. The two neighboring constituents may merge together in the next layer, so the sizes of con- stituents gradually grow from layer to layer. In the top layer, the layer 2, all words are grouped into the same constituent. Because all words are into the same constituent, the attention heads freely at- tend to any other words, in this layer, Tree Trans- former behaves the same as the typical Trans- former encoder. Tree Transformer can be trained in an end-to-end fashion by using “masked LM”,
Layer 0 Layer 1 Layer 2
the cute dog is wagging its tail
Multi-Head Attention Add & Norm
Feed Forward Add & Norm
Constituent Attention Constituent
Priors
the cute
dog is wagging its tail
the cute dog is wagging its tail
(A) (B) (C)
Figure 1: (A) A 3-layer Tree Transformer, where the blocks are constituents induced from the input sentence. The two neighboring constituents may merge together in the next layer, so the sizes of constituents gradually grow from layer to layer. The red arrows indicate the self-attention. (B) The building blocks of Tree Transformer. (C) Constituent prior C for the layer 1.
which is one of the unsupervised training task used for BERT training.
Whether two words belonging to the same con- stituent is determined by “Constituent Prior” that guides the self-attention. Constituent Prior is de- tailed in Section 4, which is computed by the proposed Constituent Attention module in Sec- tion5. By using BERT masked language model as training, latent tree structures emerge from Con- stituent Prior and unsupervised parsing is thereby achieved. The method for extracting the con- stituency parse trees from Tree Transformer is de- scribed in Section6.
4 Constituent Prior
In each layer of Transformer, there are a query ma- trix Q consisting of query vectors with dimension dk and a key matrix K consisting of key vectors with dimension dk. The attention probability ma- trix is denoted as E, which is an N by N matrix, where N is the number of words in an input sen- tence. Ei,j is the probability that the position i attends to the position j. The Scaled Dot-Product Attention computes the E as:
E = softmax(QKT
d ), (1)
where the dot-product is scaled by 1/d. In Trans- former, the scaling factor d is set to be√
dk. In Tree Transformer, the E is not only deter- mined by the query matrix Q and key matrix K, but also guided by Constituent Prior C generat- ing from Constituent Attention module. Same as E, the constituent prior C is also a N by N ma- trix, where Ci,jis the probability that word wiand
word wj belong to the same constituency. This matrix is symmetric that Ci,jis same as Cj,i. Each layer has its own Constituent Prior C. An exam- ple of Constituent Prior C is illustrated in Fig- ure1(C), which indicates that in layer 1, “the cute dog” and “is wagging its tail” are two constituents.
To make each position not attend to the position in different constituents, Tree Transformer con- strains the attention probability matrix E by con- stituent prior C as below,
E = C softmax(QKT
d ), (2)
where is the element-wise multiplication.
Therefore, if Ci,j has small value, it indicates that the positions i and j belong to different con- stituents, where the attention weight Ei,jwould be small. As Transformer uses multi-head attention with h different heads, there are h different query matrices Q and key matrices K at each position, but here in the same layer, all attention heads in multi-head attention share the same C. The multi- head attention module produces the output of di- mension dmodel= h × dk.
5 Constituent Attention
The proposed Constituent Attention module is to generate the constituent prior C. Instead of directly generating C, we decompose the prob- lem into estimating the breakpoints between con- stituents, or the probability that two adjacent words belong to the same constituent. In each layer, the Constituent Attention module generates a sequence a = {a1, ..., ai, ..., aN}, where ai is the probability that the word wi and its neighbor
is wagging its tail The
𝑃1,2
0.1 0.9
𝑃2,1 𝑃2,3
0.8 0.2
𝑃1,0
cute dog
Figure 2: The example illustration about how neigh- boring attention works.
word wi+1are in the same constituent. The small value of ai implies that there is a breakpoint be- tween wi and wi+1, so the constituent prior C is obtained from the sequence a as follows. Ci,j is the multiplication of all ai6k<j between word wi and word wj:
Ci,j =
j−1
Y
k=i
ak. (3)
In (3), we choose to use multiplication instead of summation, because if one of ai6k<jbetween two words wi and wj is small, the value of Ci,j with multiplication also becomes small. In implemen- tation, to avoid probability vanishing, we use log- sum instead of directly multiplying all a:
Ci,j = ePj−1k=ilog(ak). (4) The sequence a is obtained based on the follow- ing two mechanisms: Neighboring Attention and Hierarchical Constraint.
5.1 Neighboring Attention
We compute the score si,i+1 indicating that wi
links to wi+1by scaled dot-product attention:
si,i+1 = qi· ki+1
d , (5)
where qi is a link query vector of wi with dmodel
dimensions, and ki+1is a link key vector of wi+1
with dmodeldimensions. We use qi· ki+1to rep- resent the tendency that wiand wi+1belong to the same constituent. Here, we set the scaling factor d to be dmodel2 . The query and key vectors in (5) are different from (1). They are computed by the same network architecture, but with different sets of network parameters.
For each word, we constrain it to either link to its right neighbor or left neighbor as illustrated in Figure 2. This constraint is implemented by ap- plying a softmax function to two attention links of wi:
pi,i+1, pi,i−1= softmax(si,i+1, si,i−1), (6)
where pi,i+1 is the probability that wi attends to wi+1, and (pi,i+1+ pi,i−1) = 1. We find that with- out the constraint of the softmax operation in (6) the model prefers to link all words together and assign all words to the same constituency. That is, giving both si,i+1and si,i−1large values, so the at- tention head freely attends to any position without restriction of constituent prior, which is the same as the original Transformer. Therefore, the soft- max function is to constraint the attention to be sparse.
As pi,i+1 and pi+1,i may have different values, we average its two attention links:
ˆ
ai=ppi,i+1× pi+1,i. (7) The ˆailinks two adjacent words only if two words attend to each other. ˆai is used in the next subsec- tion to obtain ai.
5.2 Hierarchical Constraint
As mentioned in Section 3, constituents in the lower layer merge into larger one in the higher layer. That is, once two words belong to the same constituent in the lower layer, they would still be- long to the same constituent in the higher layer. To apply the hierarchical constraint to the tree Trans- former, we restrict alkto be always larger than al−1k for the layer l and word index k. Hence, at the layer l, the link probability alkis set as:
alk= al−1k + (1 − al−1k )ˆalk, (8) where al−1k is the link probability from the previ- ous layer l − 1, and ˆalkis obtained from Neighbor- ing Attention (Section5.1) of the current layer l.
Finally, at the layer l, we apply (4) for computing Clfrom al. Initially, different words are regarded as different constituents, and thus we initialize a−1k as zero.
6 Unsupervised Parsing from Tree Transformer
After training, the neighbor link probability a can be used for unsupervised parsing. The small value of a suggests this link be the breakpoint of two constituents. By top-down greedy parsing (Shen et al.,2018a), which recursively splits the sentence into two constituents with minimum a, a parse tree can be formed.
However, because each layer has a set of al, we have to decide to use which layer for parsing. In- stead of using a from a specific layer for parsing
Algorithm 1 Unsupervised Parsing with Multiple Layers
1: a ← link probabilities
2: m ← minimum layer id . Discard the a from layers below minimum layer
3: thres ← 0.8 . Threshold of breakpoint
4: procedure BUILDTREE(l, s, e) . l: layer index, s: start index, e: end index
5: if e − s < 2 then. The constituent cannot be split
6: return (s, e)
7: span ← als≤i<e
8: b ← argmin(span) . Get breakpoint
9: last ← max(l − 1, m) . Get index of last layer
10: if alb > thres then
11: if l = m then
12: return (s, e)
13: return BuildTree(last, s, e)
14: tree1 ← BuildTree(last, s, b)
15: tree2 ← BuildTree(last, b + 1, e)
16: return (tree1, tree2) . Return tree
(Shen et al.,2018b), we propose a new parsing al- gorithm, which utilizes a from all layers for un- supervised parsing. As mentioned in Section5.2, the values of a are strictly increasing, which indi- cates that a directly learns the hierarchical struc- tures from layer to layer. Algorithm1details how we utilize hierarchical information of a for unsu- pervised parsing.
The unsupervised parsing starts from the top layer, and recursively moves down to the last layer after finding a breakpoint until reaching the bot- tom layer m. The bottom layer m is a hyperparam- eter needed to be tuned, and is usually set to 2 or 3. We discard a from layers below m, because we find the lowest few layers do not learn good rep- resentations (Liu et al.,2019) and thus the parsing results are poor (Shen et al.,2018b). All values of a on top few layers are very close to 1, suggesting that those are not good breakpoints. Therefore, we set a threshold for deciding a breakpoint, where a minimum a will be viewed as a valid breakpoint only if its value is below the threshold. As we find that our model is not very sensitive to the threshold value, we set it to be 0.8 for all experiments.
7 Experiments
In order to evaluate the performance of our pro- posed model, we conduct the experiments detailed below.
7.1 Model Architecture
Our model is built upon a bidirectional Trans- former encoder. The implementation of our Trans- former encoder is identical to the original Trans- former encoder. For all experiments, we set the hidden size dmodel of Constituent Attention and Transformer as 512, the number of self-attention heads h as 8, the feed-forward size as 2048 and the dropout rate as 0.1. We analyze and discuss the sensitivity of the number of layers, denoted as L, in the following experiments.
7.2 Grammar Induction
In this section, we evaluate the performance of our model on unsupervised constituency parsing.
We train our model on WSJ training set and WSJ- all (i.e. including testing and validation sets) and set the vocabulary size as 16k. We choose BERT Masked LM (Devlin et al.,2018) as our unsuper- vised training task. Our best result is optimized by adamwith a learning rate of 0.0001, β1 = 0.9 and β2 = 0.98. Following the evaluation settings of prior work (Htut et al.,2018;Shen et al.,2018b)2, we evaluate F1 scores of our model on WSJ-test and WSJ-10 of Penn Treebank (PTB) (Marcus et al., 1993). The WSJ-10 has 7422 sentences from whole PTB with sentence length restricted to 10 after punctuation removal, while WSJ-test has 2416 sentences from the PTB testing set with unrestricted sentence length.
The results on WSJ-test are in Table 1. We mainly compare our model to PRPN (Shen et al., 2018a), On-lstm (Shen et al., 2018b) and Com- pound PCFG(C-PCFG) (Kim et al., 2019a), in which the evaluation settings and the training data are identical to our model. DIORA (Drozdov et al.,2019) and URNNG (Kim et al.,2019b) use a relative larger training data and the evaluation set- tings are slightly different from our model. Our model performs much better than trivial trees (i.e.
right and left-branching trees) and random trees, which suggests that our proposed model success- fully learns meaningful trees. We also find that
2https://github.com/yikangshen/
Ordered-Neurons
Model Data F1median F1max
PRPN WSJ-train 35.0 42.8
On-lstm WSJ-train 47.7 49.4
C-PCFG WSJ-train 55.2 60.1
Tree-T,L=12 WSJ-train 48.4 50.2 Tree-T,L=10 WSJ-train 49.5 51.1 Tree-T,L=8 WSJ-train 48.3 49.6 Tree-T,L=6 WSJ-train 47.4 48.8
DIORA NLI 55.7 56.2
URNNG Billion - 52.4
Tree-T,L=10 WSJ-all 50.5 52.0
Random - 21.6 21.8
LB - 9.0 9.0
RB - 39.8 39.8
Table 1: The F1 scores on WSJ-test. Tree Trans- former is abbreviated as Tree-T, and L is the num- ber of layers(blocks). DIORA is trained on multi-NLI dataset (Williams et al.,2018). URNNG is trained on the subset of one billion words (Chelba et al., 2013) with 1M training data. LB and RB are the left and right-brancing baselines.
Figure 3: A parse tree induced by Tree Transformer.
As shown in the figure, because we set a threshold in Algorithm1, the leaf nodes are not strictly binary.
increasing the layer number results in better per- formance, because it allows the Tree Transformer to model deeper trees. However, the performance stops growing when the depth is above 10. The words in the layers above the certain layer are all grouped into the same constituent, and there- fore increasing the layer number will no longer help model discover useful tree structures. In Ta- ble2, we report the results on WSJ-10. Some of the baselines including CCM (Klein and Manning, 2002), DMV+CCM (Klein and Manning, 2005) and UML-DOP (Bod,2006) are not directly com- parable to our model, because they are trained us- ing POS tags our model does not consider.
In addition, we further investigate what kinds of trees are induced by our model. Following URNNG, we evaluate the performance of con-
Model Data F1median F1max
PRPN WSJ-train 70.5 71.3
On-lstm WSJ-train 65.1 66.8
C-PCFG WSJ-train 70.5 -
Tree-T,L=10 WSJ-train 66.2 67.9
DIORA NLI 67.7 68.5
Tree-T,L=10 WSJ-all 66.2 68.0
CCM WSJ-10 - 71.9
DMV+CCM WSJ-10 - 77.6
UML-DOP WSJ-10 - 82.9
Random - 31.9 32.6
LB - 19.6 19.6
RB - 56.6 56.6
Table 2: The F1 scores on WSJ-10. Tree Transformer is abbreviated as Tree-T, and L is the number of layers (blocks).
Label Tree-T URNNG PRPN
NP 67.6 39.5 63.9
VP 38.5 76.6 27.3
PP 52.3 55.8 55.1
ADJP 24.7 33.9 42.5
SBAR 36.4 74.8 28.9
ADVP 55.1 50.4 45.1
Table 3: Recall of constituents by labels. The re- sults of URNNG and PRPN are taken fromKim et al.
(2019b).
stituents by its label in Table 3. The trees in- duced by different methods are quite different.
Our model is inclined to discover noun phrases (NP) and adverb phrases (ADVP), but not easy to discover verb phrases (VP) or adjective phrases (ADJP). We show an induced parse tree in Fig- ure8 and more induced parse trees can be found in Appendix.
7.3 Analysis of Induced Structures
In this section, we study whether Tree Trans- former learns hierarchical structures from layers to layers. First, we analyze the influence of the hy- perparameter minimum layer m in Algorithm. 1 given the model trained on WSJ-all in Table 1.
As illustrated in Figure 4(a), setting m to be 3 yields the best performance. Prior work discov- ered that the representations from the lower lay- ers of Transformer are not informative (Liu et al., 2019). Therefore, using syntactic structures from lower layers decreases the quality of parse trees.
On the other hand, most syntactic information is
15 20 25 30 35 40 45 50 55
0 1 2 3 4 5 6 7 8 9
F1
Minimum Layer m
(a) F1 in terms of different minimum layers.
15 20 25 30 35 40 45 50 55
0 1 2 3 4 5 6 7 8 9
F1
Specific Layer for Parsing (b) F1 of parsing via a specific layer.
Figure 4: Performance of unsupervised parsing.
missing when a from top few layers are close to 1, so too large m also decreases the performance.
To further analyze which layer contains richer information of syntactic structures, we evaluate the performance on obtaining parse trees from a specific layer. We use al from the layer l for parsing with the top-down greedy parsing algo- rithm (Shen et al., 2018a). As shown in Fig- ure 4(b), using a3 from the layer 3 for parsing yields the best F1 score, which is 49.07. The re- sult is consistent to the best value of m. However, compared to our best result (52.0) obtained by Al- gorithm 1, the F1-score decreases by 3 (52.0 → 49.07). This demonstrates the effectiveness of Tree Transformer in terms of learning hierarchi- cal structures. The higher layers indeed capture the higher-level syntactic structures such as clause patterns.
7.4 Interpretable Self-Attention
This section discusses whether the attention heads in Tree Transformer learn hierarchical structures.
Considering that the most straightforward way of interpreting what attention heads learn is to visu- alize the attention scores, we plot the heat maps of Constituent Attention prior C from each layer in Figure5.
In the heat map of constituent prior from first layer (Figure 5(a)), as the size of constituent is small, the words only attend to its adjacent words.
We can observe that the model captures some sub- phrase structures, such as the noun phrase “delta air line” or “american airlines unit”. In Fig- ure5(b)-(d), the constituents attach to each other and become larger. In the layer 6, the words from
“involved” to last word “lines” form a high-level
Model L Parameters Perplexity
Transformer 8 40M 48.8
Transformer 10 46M 48.5
Transformer 10-B 67M 49.2
Transformer 12 52M 48.1
Tree-T 8 44M 46.1
Tree-T 10 51M 45.7
Tree-T 12 58M 45.6
Table 4: The perplexity of masked words. We denote the number of layers as L. In Transformer L = 10 − B, the increased hidden size results in more parameters.
adjective phrase (ADJP). In the layer 9, all words are grouped into a large constituent except the first word “but”. By visualizing the heat maps of the constituent prior from each layer, we can easily know what types of syntactic structures are learned in each layer. The parse tree of this exam- ple can be found in Figure8of Appendix. We also visualize the heat maps of self-attention from the original Transformer layers and one from the Tree Transformer layers in Appendix A. As the self- attention heads from our model are constrained by the constituent prior, compared to the original Transformer, we can discover hierarchical struc- tures more easily.
Those attention heat maps demonstrate that: (1) the size of constituents gradually grows from layer to layer, and (2) at each layer, the attention heads tend to attend to other words within constituents posited in that layer. Those two evidences support the success of the proposed Tree Transformer in terms of learning tree-like structures.
(a) The layer 0. (b) The layer 3.
(c) The layer 6. (d) The layer 9.
Figure 5: The constituent prior heat maps.
7.5 Masked Language Modeling
To investigate the capability of Tree Transformer in terms of capturing abstract concepts and syn- tactic knowledge, we evaluate the performance on language modeling. As our model is a bidirec- tional encoder, in which the model can see its subsequent words, we cannot evaluate the lan- guage model in a left-to-right manner. We evaluate the performance on masked language modeling by measuring the perplexity on masked words3. To perform the inference without randomness, for each sentence in the testing set, we mask all words in the sentence, but not at once. In each masked testing data, only one word is replaced with a
“[MASK]” token. Therefore, each sentence cre- ates the number of testing samples equal to its
3The perplexity of masked words is e
−P log(p)
nmask , where p is the probability of correct masked word to be predicted and nmaskis the total number of masked words.
length.
In Table4, the models are trained on WSJ-train with BERT masked LM and evaluated on WSJ- test. All hyperparameters except the number of layers in Tree Transformer and Transformer are set to be the same and optimized by the same opti- mizer. We use adam as our optimizer with learn- ing rate of 0.0001, β1 = 0.9 and β2 = 0.999. Our proposed Constituent Attention module increases about 10% hyperparameters to the original Trans- former encoder and the computational speed is 1.2 times slower. The results with best performance on validation set are reported. Compared to the original Transformer, Tree Transformer achieves better performance on masked language modeling.
As the performance gain is possibly due to more parameters, we adjust the number of layers or in- crease the number of hidden layers in Transformer L = 10 − B. Even with fewer parameters than Transformer, Tree Transformer still performs bet-
ter.
The performance gain is because the induced tree structures guide the self-attention processes language in a more straightforward and human- like manner, and thus the knowledge can be bet- ter generalized from training data to testing data.
Also, Tree Transformer acquires positional infor- mation not only from positional encoding but also from the induced tree structures, where the words attend to other words from near to distant (lower layers to higher layers)4.
7.6 Limitations and Discussion
It is worth mentioning that we have tried to initial- ize our Transformer model with pre-trained BERT, and then fine-tuning on WSJ-train. However, in this setting, even when the training loss becomes lower than the loss of training from scratch, the parsing result is still far from our best results. This suggests that the attention heads in pre-trained BERT learn quite different structures from the tree-like structures in Tree Transformer. In ad- dition, with a well-trained Transformer, it is not necessary for the Constituency Attention module to induce reasonable tree structures, because the training loss decreases anyway.
8 Conclusion
This paper proposes Tree Transformer, a first at- tempt of integrating tree structures into Trans- former by constraining the attention heads to at- tend within constituents. The tree structures are automatically induced from the raw texts by our proposed Constituent Attention module, which attaches the constituents to each other by self- attention. The performance on unsupervised pars- ing demonstrates the effectiveness of our model in terms of inducing tree structures coherent to hu- man expert annotations. We believe that incorpo- rating tree structures into Transformer is an im- portant and worth exploring direction, because it allows Transformer to learn more interpretable at- tention heads and achieve better language model- ing. The interpretable attention can better explain how the model processes the natural language and guide the future improvement.
4We do not remove the positional encoding from the Transformer and we find that without positional encoding the quality of induced parse trees drops.
Acknowledgements
We would like to thank reviewers for their insight- ful comments. This work was financially sup- ported from the Young Scholar Fellowship Pro- gram by Ministry of Science and Technology (MOST) in Taiwan, under Grant 108-2636-E002- 003.
References
Roee Aharoni and Yoav Goldberg. 2017. Towards string-to-tree neural machine translation. In Pro- ceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers).
Rens Bod. 2006. An all-subtrees approach to unsuper- vised parsing. In Proceedings of the 21st Interna- tional Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Com- putational Linguistics.
Glenn Carroll and Eugene Charniak. 1992. Two exper- iments on learning probabilistic dependency gram- mars from corpora. In WORKING NOTES OF THE WORKSHOP STATISTICALLY-BASED NLP TECH- NIQUES.
C.Goller and A.Kuchler. 1996. Learning task- dependent distributed representations by backpropa- gation through structure. In Proceedings of Interna- tional Conference on Neural Networks (ICNN’96).
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One billion word benchmark for measur- ing progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
Andrew Drozdov, Pat Verga, Mohit Yadav, Mohit Iyyer, and Andrew McCallum. 2019. Unsupervised latent tree induction with deep inside-outside recur- sive autoencoders. In North American Association for Computational Linguistics.
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proc. of NAACL.
Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers).
Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman.
2018. Grammar induction with neural language models: An unusual replication. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyz- ing and Interpreting Neural Networks for NLP. As- sociation for Computational Linguistics.
Yoon Kim, Chris Dyer, and Alexander Rush. 2019a.
Compound probabilistic context-free grammars for grammar induction. In Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 2369–2385.
Yoon Kim, Alexander M. Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and G´abor Melis. 2019b.
Unsupervised recurrent neural network grammars.
arXiv preprint arXiv:1904.03746.
Dan Klein and Christopher D. Manning. 2002. A generative constituent-context model for improved grammar induction. In Proceedings of 40th Annual Meeting of the Association for Computational Lin- guistics.
Dan Klein and Christopher D. Manning. 2005. Nat- ural language grammar induction with a generative constituent-context model. Pattern Recogn.
Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Lin- guistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Comput. Lin- guist.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. 2018a. Neural language modeling by jointly learning syntax and lexicon. In Interna- tional Conference on Learning Representations.
Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2018b. Ordered neurons: Integrat- ing tree structures into recurrent neural networks.
arXiv preprint arXiv:1810.09536.
Noah A. Smith and Jason Eisner. 2005. Guiding un- supervised grammar induction using contrastive es- timation. In In Proc. of IJCAI Workshop on Gram- matical Inference Applications.
Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng, and Christopher D. Manning. 2011. Parsing natu- ral scenes and natural language with recursive neu- ral networks. In Proceedings of the 28th Interna- tional Conference on International Conference on Machine Learning.
Emma Strubell, Patrick Verga, Daniel Andor, David I Weiss, and Andrew McCallum. 2018.
Linguistically-informed self-attention for semantic role labeling. In EMNLP.
Kai Sheng Tai, Richard Socher, and Christopher D.
Manning. 2015. Improved semantic representations from tree-structured long short-term memory net- works. arXiv preprint arXiv:1503.00075.
Jesse Vig. 2019. Visualizing attention in transformer- based language representation models. arXiv preprint arXiv:1904.02679.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers).
Wei Wu, Houfeng Wang, Tianyu Liu, and Shuming Ma. 2018. Phrase-level self-attention networks for universal sentence encoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3729–3738.
Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to compose words into sentences with reinforcement learning. In International Conference on Learning Representations.
Poorya Zaremoodi and Gholamreza Haffari. 2018. In- corporating syntactic uncertainty in neural machine translation with a forest-to-sequence model. In Pro- ceedings of the 27th International Conference on Computational Linguistics.
A Visualization of Self-Attention
In this section, we plot the heat maps of attention probability E from the original Transformer and one from Tree Transformer. In order to investi- gate whether models learn layer-wise tree struc- tures, we average the scores of the attention heads within the same layer.
In the heat map from first layer of Tree Trans- former (Figure6(a)), as the size of constituent in the layer is small, the words only attend to its adja- cent words. We also observe that the words within the same phrase attend to each other, such as the phrasal verb “involve in” and the noun phrase
“delta air line”. With the constituents attaching to each other and becoming larger, in Figure6(b)-(d), the words attend to more distant words.
On the other hard, we average the scores of attention heads from each layer of the original Transformer and visualize them. In the attention heat maps of the original Transformer, we also ob- serve that it learns hierarchical structures in the first layer (Figure7(a)) that the words mostly at- tend to its neighboring words. However, this phe- nomenon is only observed in the first layer that in the third layer (Figure 7(b)), the attention heads attend to other words at will.
(a) The layer 0. (b) The layer 3.
(c) The layer 6. (d) The layer 9.
Figure 6: The attention heat map from the Tree Transformer.
(a) The layer 0. (b) The layer 3.
Figure 7: The attention heat map from the original Transformer.
Figure 8: The parse trees induced by the proposed Tree Transformer.
Figure 9: The parse trees induced by the proposed Tree Transformer.