What Does This Word Mean?
Explaining Contextualized Embeddings with Natural Language Definition
Ting-Yun Chang Yun-Nung Chen
Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
[email protected] [email protected]
Abstract
Contextualized word embeddings have boosted many NLP tasks compared with traditional static word embeddings. However, the word with a specific sense may have different contextualized embeddings due to its various contexts. To further investigate what contextualized word embeddings capture, this paper analyzes whether they can indicate the corresponding sense definitions and proposes a general framework that is capable of ex- plaining word meanings given contextualized word embeddings for better interpretation.
The experiments show that both ELMo and BERT embeddings can be well interpreted via a readable textual form, and the findings may benefit the research community for a better understanding of what the embeddings capture1.
1 Introduction
Contextualized word embeddings, such as ELMo, BERT, and OpenAI GPT, GPT-2 (Peters et al., 2018; Devlin et al., 2018; Radford et al., a,b) have been shown to yield richer representations of meaning and boosted many NLP tasks. To under- stand what contextualized word embeddings cap- ture,Schuster et al.(2019) recently visualized the representations of ELMo and showed that 1) em- beddings of the same word in different contexts can form a cluster, and 2) when a word has mul- tiple senses, the embeddings can be separated into multiple distinct groups, one for each meaning.
To further investigate the meanings contextual- ized word embeddings indicate, this paper focuses on analyzing whether a contextualized embedding is sense-informative enough to indicate the cor- responding sense definition given a (target word, context) pair. We train and evaluate our model on
1The source code and the trained models are publicly available athttps://github.com/MiuLab/GenDef.
the online Oxford dictionary dataset released by Chang et al.(2018).
To analyze if the embeddings are sense- informative, our work focuses on learning a map- ping between the semantic space of contextual- ized word embeddings and the space of definition embeddings. Specifically, a better mapping indi- cates richer sense-specific cues in the contextual- ized word embedding.
Different from the definition modeling in the prior work (Noraset et al.,2017;Gadetsky et al., 2018;Chang et al.,2018), we reformulate the task from natural language generation (NLG) to clas- sification, i.e., selecting the most reasonable def- inition according to the target word and its con- texts. As recent work has shown the great suc- cess in encoding lexical resources into consis- tent representations (Tissier et al.,2017;Bahdanau et al.,2017;Bosc and Vincent, 2018), in this pa- per, we leverage pretrained sentence encoder (Cer et al.,2018) to project all definitions in the Oxford dictionary to a consistent embedding space, sup- porting our reformulation which requires to learn a mapping transforming from the contextualized word embedding space to the definition embed- ding space. Therefore, we can avoid some predica- ments in NLG, such as troubles in generating flu- ent sequences, the exposure bias problem (Ran- zato et al., 2015) and the difficulties in evalua- tion (Stent et al.,2005).
2 Methodology
The goal of this paper is to analyze whether we can distill sense-specific information from the pre- trained contextualized word embeddings such as ELMo and BERT for better interpretation. Specif- ically, given the embedding of a (word, context) pair, our model learns a non-linear mapping net- work f : X → Y to project it into the desired
Hannah took off into a fairylandof snowy pines.
##land<PAD>fairy fairy##land<PAD> fairy##land<PAD> fairy##land<PAD> conv1dconv1d
Max-Pooling
Weighted Sum Target Word Embedding
(Context-Independent)
+
MLP
𝒚′
Definition Embeddings
x Top3 NNs 1stDef.: a beautiful place 2ndDef.: an imagined ideal place 3rdDef.: a holy place
Concat
Context-Dependent Embedding Transformation Network
Pre-Trained BERT
Figure 1: Illustration of the proposed model using contextualized embeddings from BERT as the context-dependent component. We use the one-dimensional convolution with its kernel sizes being 1 and 3.
definition space. Note that the mapping is many- to-one intrinsically because there exist some ex- amples sharing the same definition, and their tar- get words or contexts differ. This paper assumes that contextualized word embeddings can be eas- ily translated into their corresponding definitions if their semantics is well captured in the represen- tations. To validate the assumption, the following models are proposed for mapping the embeddings.
2.1 Model Objective
The whole framework of our proposed model is il- lustrated in Figure1. Our goal is to learn the map- ping that transforms contextualized word embed- dings into their corresponding definition, which is to solve:
f? = arg min
f
kf (x) − yk2, (1)
where x = (u, v) ∈ X consists of the target word embedding u, which is context-independent and serving as the target word identity, and a context- dependent component v, which could be either the context embedding or the contextualized tar- get word embedding; y ∈ Y is the embedding of the corresponding definition. Both the defi- nition embedding and the context embedding are encoded by the pretrained transformer-based uni- versal sentence encoder (Cer et al.,2018), and we utilize different types of contextualized word em- beddings, such as ELMo, BERT-base, and BERT- large, by swapping the context-dependent compo- nent shown in the left part of Figure1.
Note that though we model this task as a clas- sification problem, we do not train a classifier that regards all definitions as discrete labels but learn a translation between two representation spaces, motivated byLample et al.(2017,2018), because different definitions may have semantic similari- ties. We first encode all ground truth definitions to a consistent embedding space and learn a map- ping function f . During the inference stage, given a target word and its context, we retrieve the cor- responding human-readable definitions of the top- k nearest neighbors of our predicted embedding in the definition embedding space, with consider- ation of the whole 79,030 definition candidates in the Oxford dictionary. Note that the candidates for each word are not restricted to its existing defini- tions in the dictionary, considering that words may go through semantic shift, such as the word gay has shifted its meaning from happy to homosex- ual.
2.2 Mapping Architecture
Our mapping architecture consists of a transfor- mation network followed by the 7-layer MLP, with batch normalization (Ioffe and Szegedy, 2015) and ReLU. In order to incorporate diverse context-dependent embeddings (such as ELMo and BERT), different transformation nets are pro- posed, whose common goal is to map the in- put features to a fixed-dimension representation.
Three variants are described in detail.
• Context Embedding: the target word em- bedding and its context embedding encoded
by a pretrained sentence encoder are con- catenated as the input to the transformation net, which is simply implemented as a fully- connected layer with the ReLU activation.
• ELMo: we apply a weighted sum over 3 ex- tracted contextualized word embeddings, i.e., the output of character CNN and two LSTMs, getting a single context-dependent vector for concatenating with the target word vector.
• BERT: the target word is tokenized into word pieces, and we use one-dimensional convolution (conv1d) (Kim,2014) and max- pooling to tackle the variable-lengthed fea- tures. We extract the last 4 layers from BERT and jointly learn softmax-normalized weights corresponding to each layer simi- lar to ELMo.2 Figure1 illustrates the map- ping model leveraging features from BERT, which expresses the best capability of car- rying sense-specific explanation among all variants.
2.3 Reverse Mapping
In order to analyze what the mapping captures for better interpretation, we examine the reverse di- rection of our mapping after training, motivated byYuan et al.(2016). Given a context-dependent embedding v and its ground truth definition em- bedding y, also the word embedding uw for each target word w in the vocabulary V , and a pre- trained mapping ¯f , the word that is the closest vec- tor to the target definition after mapping is formu- lated as:
arg max
w∈V
cos( ¯f (uw, v), y). (2) In our experiments, the word set corresponding to the top-k highest cosine scores often contains the actual target word, also overlapping with a few synonyms provided by the Oxford dictionary.
When applying to contextualized word embed- dings from BERT-base, the model achieves the av- erage recall of 23.7%, even though we do not in- corporate any synonym information during train- ing. This demonstrates that our models are capa-
2Note that our work focuses on analyzing the sense infor- mation encoded in the contextualized embeddings; thus, our model is stacked upon the frozen representations extracted from the pretrained BERT instead of fine-tuning them. More training details are given in the supplementary material.
ble of automatically capturing potential similari- ties. Examples of the generated synonyms can be found in the supplementary material.
3 Experiments
In order to examine whether the sense-specific in- formation captured by contextualized word em- beddings can be well disentangled, the following experiments are conducted.
3.1 Definition Retrieval
This is to analyze whether our proposed mapping indeed interprets the sense-specific definitions from contextualized word embeddings. Consider- ing that the words can be seen and unseen, our ex- periments contain two levels of tasks (Chang et al., 2018).
• Seen is to test the pair with (seen word, unseen context, seen definition), including 151,306 pairs of instances containing 9,276 target words.
• Unseen is to test the pair with (unseen word, unseen context), including 15,959 pairs of in- stances corresponding to the 1,000 randomly selected target words held-out from training.
Such a zero-shot setting challenges if the in- put feature is informative enough and if the mapping can generalize to the unseen but se- mantically consistent embeddings. Also, it is a practical and appealing task as many new words are coined every year.
We ensure both being polysemic tasks by sam- pling within instances whose target words have at least 3 definitions when building these two test sets.
3.1.1 Results
We measure the performance of various proposed architectures with average precision (@1, @5,
@10) as well as the average cosine distance be- tween the predicted definition embedding and the ground truth embedding (lower is better). Two baselines without using contextualized embed- dings as context-dependent input features are pro- posed, 1) using the target word embedding only, which entirely ignores the contexts and thus be- ing a lower bound of this task and 2) leverag- ing the context embedding from the pretrained Transformer-based universal-sentence encoder as
Task Methods P@1 P@5 P@10 Cosine Dist
Seen
target word embedding 33.27 / 18.29 50.77 / 31.92 56.06 / 36.69 0.251 + context embedding 59.36 / 45.19 71.43 / 58.17 74.95 / 62.42 0.178 + ELMo 67.00 / 53.91 77.32 / 65.69 80.35 / 69.58 0.149 + BERT-base 74.83 / 63.34 83.28 / 73.97 85.46 / 77.06 0.123 + BERT-large 73.89 / 62.36 82.61 / 73.24 84.92 / 76.28 0.126
Zero-Shot
target word embedding 1.84 / 1.06 6.54 / 4.22 9.67 / 6.44 0.388 + context embedding 1.97 / 1.29 7.00 / 4.77 10.78 / 7.50 0.383
+ ELMo 2.04 / 1.38 6.79 / 4.65 10.21 / 7.06 0.387
+ BERT-base 3.27 / 2.28 9.59 / 7.41 14.44 / 11.44 0.344 + BERT-large 3.50 / 2.52 10.47 / 8.17 15.58 / 12.35 0.339 Table 1: Precision@K (average within examples sharing the same target words / average within examples sharing the same (target word, definition)) and cosine distance for models using various input features.
Methods Tasks
Seen Unseen
Noraset et al.(2017) 21.6 / 36.7 1.7 / 15.8 Chang et al.(2018) 24.9 / 41.0 2.0 / 15.9 target word embedding 28.4 / 36.9 4.6 / 17.2 + context embedding 58.5 / 62.8 5.1 / 16.8
+ ELMo 66.5 / 71.6 4.8 / 17.2
+ BERT-base 74.7 / 78.3 7.1 / 19.3 Table 2: BLEU@4 / ROUGE-L:F scores of NLG-based models and various proposed architectures.
described in Section2.2. Note that the naive base- line is randomly guessing among the whole 79,030 definitions, with P@1 lower than 0.0013%, show- ing the difficulty of this task.
For Seen experiments, Table 1 shows that the context-dependent component contains abun- dant sense-informative cues, where contextualized word embeddings, especially BERT, expresses the strong capability of producing corresponding def- initions with about 15% enhancement of P@1 comparing to the second baseline. For Unseen re- sults, the trend is similar: all models with context- dependent input features outperform the first base- line, and the variants with BERT reach the best scores among all metrics. The above results demonstrate rich sense-informative cues captured by the contextualized word embeddings.
Furthermore, we evaluate the definitions by their natural language surfaces using BLEU (Pa- pineni et al., 2002) and ROUGE-L (Lin, 2004) scores. The results are in Table2, where bothNo- raset et al. (2017) and the first proposed baseline generates or selects definitions depending merely on the static target word embedding, and all other
architectures are context-dependent. We initial- ize the target word embeddings for all models with pretrained fasttext (Joulin et al., 2016) on Wikipedia 2017, UMBC webbase corpus for a fair comparison. The results demonstrate that our mapping model can better explain the word repre- sentations than the prior work.
3.1.2 Analysis
An ablation study is conducted in which only the contextualized word embedding is used. While the scores of all 3 related variants drop dramatically due to the lack of the explicit, context-independent signal to the target word identity, the scores still outperform the first baseline by 7% (P@1) on the Seen task, showing the superiority of contextual- ized representations to their static counterparts. In addition, the reason about better performance of the models with contextualized embeddings com- pared to the one with context embeddings (the sec- ond baseline) is that two context embeddings shar- ing the same target word sense may differ a lot due to various words in the contexts, but they may have similar contextualized word embeddings produced from ELMo or BERT. This allows our proposed model to better interpret the sense information.
Despite the overall low performance under the zero-shot setting, it is found that all proposed mod- els with context-dependent components are still able to disambiguate different senses. Table 3 shows a randomly-sampled example from the out- put of the BERT-base model. Although the model can only correctly answer the first definition of draw, which may be the most common usage so that it can be easily generated from the input em- bedding, we show that our model is still able to capture the other two very different word senses
Target Contexts, Selected Definitions, Ground Truth
draw
The embodied capacity to write and draw seems to rule over the languid group of objects underneath.
1st Definition: produce an image of someone or something by making lines and marks on paper 2nd Definition: produce a picture or diagram by making lines and marks on paper with a pencil pen etc 3rd Definition: compose or draw up something written or abstract
Ground Truth: produce an image of someone or something by making lines and marks on paper
When it came to the end of the day, though, I was more than happy to draw the curtains and shut the day out.
1st Definition: arrange something carefully into a particular shape or position
2nd Definition: arrange objects or parts in a zigzag formation or so that they are not in line 3rd Definition: draw a circle round something especially to focus attention on it
Ground Truth: pull curtains shut or open
... hinders your ability to impart spin on the ball, reducing your ability to draw and fade the shot on command.
1st Definition: put the ball in play by throwing it up between two opponents
2nd Definition: strike the ball in the direction of ones followthrough so that it travels to the left ...
3rd Definition: propel a ball with a bat racket stick etc to score runs or points in a game Ground Truth: hit the ball so that it deviates slightly usually as a result of spin
Table 3: The analysis of the top 3 selected definitions on the Unseen task.
Model Accuracy (%)
Lee and Chen(2017) 52.14 Neelakantan et al.(2015) 54.00 Mancini et al.(2016) 54.56
Guo et al.(2019) 55.27
Chang et al.(2018) 57.00 Pilehvar and Collier(2016) 58.55 Proposed (BERT-base) 68.64 Table 4: The results on Word-in-Context (WiC) data.
according to the selected definitions. More output samples on both Seen and Unseen tasks can be found in the supplementary material.
Moreover, unlike the prior work that required discrete token generation to interpret the pre- trained word embeddings (Noraset et al., 2017;
Gadetsky et al., 2018; Chang et al., 2018), we reformulate the definition modeling task from an NLG problem to a classification problem via learning a mapping between two semanti- cally continuous spaces, which greatly simplifies the hardness, making significant improvement as shown in Table 2. Specifically, as the input rep- resentations of Noraset et al.(2017) and the first proposed baseline are the same, i.e., they both are context-agnostic and utilize the same pretrained static word embeddings, the better performance of our model demonstrates the direct benefits of not requiring sequence generation.
3.2 Word Sense Selection in Context
We further examine if the captured sense-specific cues help word sense disambiguation via Word- in-Context data (WiC) (Pilehvar and Camacho- Collados, 2018), in which each instance contains a pair of two contexts sharing a target word, and
the task is to decide whether their word senses are the same.3 To justify that the models are ca- pable of selecting senses encoded in the embed- dings, for each pair, our model outputs 10 can- didate definitions (top-10 nearest neighbors), and we output TRUE if any definition occurs in both candidate sets, otherwise FALSE. Table 4 shows that the proposed model with contextualized word embeddings outperforms all previous models. We conclude that contextualized word embeddings in- deed capture sense-informative cues and our pro- posed model is capable of interpreting the corre- sponding senses via definition.
4 Conclusion
This paper proposes a framework that can well interpret the contextualized word embeddings by human-readable sense definitions. The experi- ments demonstrate that contextualized word em- beddings capture the sense-informative cues and the proposed model can better explain the seman- tics encoded in the representations.
Acknowledgements
We would like to thank Ta-Chung Chi for in- depth discussions and anonymous reviewers for their insightful comments on the paper. This work was financially supported from the Young Scholar Fellowship Program by Ministry of Science and Technology (MOST) in Taiwan, under Grant 108- 2636-E-002-003.
3To analyze the generalizability, we follow the experi- mental setting ofGuo et al.(2019), where all baselines and the proposed model are pretrained and evaluated directly on the large WiC training set without fine-tuning.
References
Dzmitry Bahdanau, Tom Bosc, Stanisław Jastrzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. 2017. Learning to compute word embed- dings on the fly. arXiv preprint arXiv:1706.00286.
Tom Bosc and Pascal Vincent. 2018. Auto-encoding dictionary definitions into consistent word embed- dings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process- ing, pages 1522–1532.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
Ting-Yun Chang, Ta-Chung Chi, Shang-Chi Tsai, and Yun-Nung Chen. 2018. xSense: Learning sense- separated sparse representations and textual defini- tions for explainable word sense networks. arXiv preprint https://arxiv.org/abs/1809.03348.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805.
Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry Vetrov. 2018. Conditional generators of words def- initions. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), pages 266–271.
Fenfei Guo, Mohit Iyyer, Leah Findlater, and Jor- dan Boyd-Graber. 2019. A differentiable self- disambiguated sense embedding model via scaled gumbel softmax.
Sergey Ioffe and Christian Szegedy. 2015. Batch nor- malization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H´erve J´egou, and Tomas Mikolov.
2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1746–1751.
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only.
arXiv preprint arXiv:1711.00043.
Guillaume Lample, Myle Ott, Alexis Conneau, Lu- dovic Denoyer, and Marc’Aurelio Ranzato. 2018.
Phrase-based & neural unsupervised machine trans- lation. arXiv preprint arXiv:1804.07755.
Guang-He Lee and Yun-Nung Chen. 2017. MUSE:
Modularizing unsupervised sense embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 327–337.
Chin-Yew Lin. 2004. Rouge: A package for auto- matic evaluation of summaries. Text Summarization Branches Out.
Massimiliano Mancini, Jose Camacho-Collados, Ig- nacio Iacobacci, and Roberto Navigli. 2016. Em- bedding words and senses together via joint knowledge-enhanced training. arXiv preprint arXiv:1612.02703.
Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2015. Effi- cient non-parametric estimation of multiple embed- dings per word in vector space. arXiv preprint arXiv:1504.06654.
Thanapon Noraset, Chen Liang, Larry Birnbaum, and Doug Downey. 2017. Definition modeling: Learn- ing to define word embeddings in natural language.
In Proceedings of AAAI.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Computational Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 2227–2237.
Mohammad Taher Pilehvar and Jose Camacho- Collados. 2018. WiC: 10,000 example pairs for evaluating context-sensitive representations. arXiv preprint arXiv:1808.09121.
Mohammad Taher Pilehvar and Nigel Collier. 2016.
De-conflated semantic representations. arXiv preprint arXiv:1608.01961.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. a. Improving language understand- ing by generative pre-training.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. b. Language models are unsupervised multitask learners.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level train- ing with recurrent neural networks. arXiv preprint arXiv:1511.06732.
Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of con- textual word embeddings, with applications to zero-shot dependency parsing. arXiv preprint arXiv:1902.09492.
Amanda Stent, Matthew Marge, and Mohit Singhai.
2005. Evaluating evaluation methods for generation in the presence of variation. In International Con- ference on Intelligent Text Processing and Compu- tational Linguistics, pages 341–351. Springer.
Julien Tissier, Christopher Gravier, and Amaury Habrard. 2017. Dict2vec: Learning word embed- dings using lexical dictionaries. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 254–263.
Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models.
arXiv preprint arXiv:1603.07012.
A Experimental Details
Hyperparameters Tuning We have empirically found that simply using a linear layer as our map- ping f fails to fit the training set well. In ad- dition, for the model that utilizes contextualized embeddings extracted from BERT, replacing the conv1d and max-pooling with the simple aver- age of all embeddings of the corresponding word pieces leads to worse results. The maximum num- ber of word pieces is clipped to 3 considering not only statistics but also computational cost. In the one-dimensional convolution, we have tried differ- ent kernel sizes and eventually chose to use 256 filters with the kernel size equal to 3 in order to tackle all input word pieces, and the other 256 ones with the kernel size equal to 1 are incorpo- rated with max-pooling, only considering the most activated word piece.
With adam as our optimizer, we let the learning rate be reduced by half if the loss on the validation set has stopped improving within 5 epochs. All hyperparameters are tuned on the hold-out valida- tion set.
Fine-Tuning Contextualized Embeddings In order to further investigate the definition mapping task, we also conduct the experiments that allow the pretrained BERT model to be fine-tuned jointly when training our mapping. It is found that the re- sults are much worse than the results of leveraging the frozen BERT representations shown in Table1
no matter which architecture is used in our map- ping. The relative worse performance on the train- ing set may indicate that BERT suffers from catas- trophic forgetting during fine-tuning in our task, despite our training set is large. We posit that it is because our task is much harder then others in the GLUE benchmark, where the model is required to disambiguate from near 80,000 definitions. More investigation is needed to address this issue.
Details in Pretrained Representations Both context embeddings and the definition embed- dings are encoded by the Transformer-based uni- versal sentence encoder4, with the dimension in IR512. The dimension of representations from BERT-base5∈ IR768, while the ones from BERT- large and ELMo6 ∈ IR1024. The context- independent embeddings of all proposed varinats are initialized with pretrained fasttext embed- dings7in IR300.
Evaluation Metrics The BLEU score has many variants. In our experiments, we apply the commonly-used corpus-level BLEU@4 without smoothing from the NLTK package. We also measure the ROUGE-L:F score, the longest com- mon subsequence based statistics, with the python ROUGE package8.
B Generated Synonyms
The examples of generated synonyms retrieved by reversing the mapping as referred in Section 2.3 are shown in Table 5. In more detail, we freeze the pretrained mapping and the context-dependent component, i.e., the contextualized representa- tions from BERT, ELMo or the context embedding from the universal sentece encoder, and then probe through all target word embeddings in the train- ing set to obtain the word that could maximize the cosine similarity between the predicted definition embedding and the ground truth one.
C Sample Outputs
The randomly-sampled examples from the output of the proposed model with contextualized embed-
4https://tfhub.dev/google/
universal-sentence-encoder-large/3
5https://pypi.org/project/
pytorch-pretrained-bert/
6https://pypi.org/project/allennlp/0.
8.4/
7https://fasttext.cc/docs/en/
english-vectors.html
8https://pypi.org/project/rouge/0.3.1/
Contexts, Generated Synonyms, Ground Truth Context: Dressed all in black.
Generated Synonyms: all, altogether, strictly, totes, ferociously, bang exceedingly, completely, wholly, terribly, definitely, easily Ground Truth: all, totally, outright, altogether, completely, absolutely, quite, wholly, fully, thoroughly, utterly, entirely
Context: Worse yet concert goers with floor tickets had to remain outside in front of gate.
Generated Synonyms: yet, entirely, anyone, anywhere, still, ever Ground Truth: yet, besides, even, additionally, further, still
Context: If you have green eyes, brighten your eyes with a plum hue.
Generated Synonyms: brighten, flamboyant, embellish, lurid Ground Truth: brighten, enhance, embellish, enrich
Context: As the disease progresses the cone becomes more pronounced causing vision to become blurred and distorted.
Generated Synonyms: distorted, crooked, compressed, strained, stiffen slippage, sore, tuber, convoluted, recurrent Ground Truth: distorted, crooked, twisted, awry, misshapen, deformed, bent, wry, malformed, irregular
Context: The hospital trust shares its medical expertise in the field of blood diseases with the facility in India.
Generated Synonyms: expertise, prowess, knowhow, ustad, literacy Ground Truth: expertise, skill, prowess, competence, proficiency
Table 5: Samples of generated synonyms obtained by reversing the trained mapping.
Target Word Contexts, Selected Definitions, Ground Truth
living
The aim of the scheme is to improve the city environment and make better use of living space.
1st Definition: of a place used for living rather than working in 2nd Definition: a place regarded as giving access to another place
3rd Definition: denoting or relating to accommodation designed for occupation by more than one family Ground Truth: of a place used for living rather than working in
A living language both accumulates new words of value and preserves what is old and of value.
1st Definition: of a language still spoken and used
2nd Definition: of a language tending to have each element as an independent word without inflections 3rd Definition: of a form of a language as used in former or earliest times
Ground Truth: of a language still spoken and used
But as I said I’m aware of the problems earning enough money to make a living.
1st Definition: an income sufficient to live on or the means of earning it 2nd Definition: give or bequeath an income or property to a person or institution 3rd Definition: of income or resources making it unnecessary to earn ones living Ground Truth: an income sufficient to live on or the means of earning it
push
He patted me on the back and gave me a slight push to the door as if I should do it right now.
1st Definition: an act of pushing someone or something in order to move them away from oneself 2nd Definition: an act of moving on ones hands and knees or dragging ones body along the ground 3rd Definition: an act of pushing or shoving something
Ground Truth: an act of pushing someone or something in order to move them away from oneself The electric beds which can be raised and lowered at the push of a button will help give patients ....
1st Definition: an act of pressing a part of a machine or device 2nd Definition: a device which prevents or stops a specified thing
3rd Definition: a device used to prevent the operation or movement of a vehicle or other machine Ground Truth: an act of pressing a part of a machine or device
American Forces then embarked on the long push to Tokyo.
1st Definition: a military attack in force
2nd Definition: the action of returning a military attack counterattack 3rd Definition: a disorderly retreat of defeated troops
Ground Truth: a military attack in force
view
... I only have to lift my eyes by ten degrees and I have a sumptuous panoramic view of a building site.
1st Definition: a sight or prospect typically of attractive natural scenery ... by the eye from a particular place 2nd Definition: the ability to see something or to be seen from a particular place
3rd Definition: an object or feature of a landscape or town that is easily seen and recognized from a distance ...
Ground Truth: a sight or prospect typically of attractive natural scenery ... by the eye from a particular place Matisse’s view of Collioure.
1st Definition: the distinctive nature or qualities of something 2nd Definition: an attribute quality or characteristic of something 3rd Definition: a tangible or visible form of an idea quality or feeling Ground Truth: a work of art depicting a sight of natural scenery
Members of the public can view these and other documents at the national archive.
1st Definition: inspect a house or other property with the intention of possibly buying or renting it 2nd Definition: show or provide something for consideration inspection or use
3rd Definition: examine and report on the condition of a building especially for a prospective buyer Ground Truth: look at or inspect
Table 6: Randomly sampled examples of the top 3 selected definitions on the Seen task.
Target Word Contexts, Selected Definitions, Ground Truth
construction
There is, so far as I am aware, no authority on the true construction of this clause.
1st Definition: a general statement or concept obtained by inference from specific cases 2nd Definition: the precise terms of a statement or requirement the strict verbal interpretation 3rd Definition: the action of making a statement or situation less confused and more comprehensible Ground Truth: an interpretation or explanation
... intellectual predilection for stressing the active role of individuals in the social construction of social reality.
1st Definition: the action of creating or preparing something
2nd Definition: the process of analysing and developing an idea or principle in detail 3rd Definition: the process of deciding or planning something
Ground Truth: the creation of an abstract entity
That tells us that the construction is an interrogative complement clause in each case.
1st Definition: the arrangement of words and phrases to create wellformed sentences in a language 2nd Definition: in systemic grammar a level of structure between clause and word ...
3rd Definition: a single distinct meaningful element of speech or writing used with others ...
Ground Truth: the arrangement of words according to syntactical rules
inflate
Unfortunately, the balloon refused to inflate properly and just dragged along on the surface of the sea.
1st Definition: extend outwards beyond something else protrude 2nd Definition: extending upwards over
3rd Definition: become or make less wide Ground Truth: become distended with air or gas
If there is one thing the Fed can do, they say, it is inflate the currency.
1st Definition: make a slight reduction in the amount rate or price of 2nd Definition: bring about a general reduction of price levels in an economy
3rd Definition: exclude a nonnet amount such as tax when making a calculation in order to reduce the amount ...
Ground Truth: bring about inflation of a currency or in an economy
On his bicycle fitted with a luggage box ... a vacuum pump to inflate the cycle tyres whenever necessary.
1st Definition: become or make less wide
2nd Definition: squeeze or force into a small or restricted space 3rd Definition: extend across or through
Ground Truth: fill a balloon tyre or other expandable structure with air or gas so that it becomes distended Table 7: More randomly sampled examples of the top3 selected definitions on the Unseen task.
dings from BERT-base on the Seen task are shown in Table6, and the ones on the Unseen task are in Table7.