Contextualized Word Embeddings
Applied Deep Learning
March 14th, 2022 http://adl.miulab.tw
Word Embedding Polysemy Issue
◉ Words are polysemy
✓ An apple a day, keeps the doctor away.
✓ Smartphone companies including apple, …
◉ However, their embeddings are NOT polysemy
◉ Issue
✓ Multi-senses (polysemy)
✓ Multi-aspects (semantics, syntax)
treetrees
rock
rocks
2
RNNLM
◉ Idea: condition the neural network on all previous words and tie the weights at each time step
input hidden output
context vector word prob dist
This LM producing context-specific word representations at each position
vector of “START”
P(next w=“wreck”)
vector of “wreck”
P(next w=“a”)
vector of “a”
P(next w=“nice”)
vector of “nice”
P(next w=“beach”)
3
TagLM – “Pre-ELMo”
◉ Idea: train NLM on big unannotated data and provide the context-specific embeddings for the target task → semi-supervised learning
Word Embedding
Word Embedding Model Recurrent Language Model LM Embedding
New York is located … Input Sequence
New York is located …
Sequence Tagging Model
B-LOC E-LOC O O … Output Sequence
Peters et al., “Semi-supervised sequence tagging with bidirectional language models,” in ACL, 2017.
4
TagLM Model Detail
◉ Leveraging pre-trained LM information
Peters et al., “Semi-supervised sequence tagging with bidirectional language models,” in ACL, 2017.
5
TagLM on Name Entity Recognition
The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When,
after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Model Description CONLL 2003 F1
Klein+, 2003 MEMM softmax markov model 86.07
Florian+, 2003 Linear/softmax/TBL/HMM 88.76
Finkel+, 2005 Categorical feature CRF 86.86
Ratinov and Roth, 2009 CRF+Wiki+Word cls 90.80
Peters+, 2017 BLSTM + char CNN + CRF 90.87
Ma and Hovy, 2016 BLSTM + char CNN + CRF 91.21 TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93
Peters et al., “Semi-supervised sequence tagging with bidirectional language models,” in ACL, 2017.
6
CoVe
◉ Idea: use trained sequence model to provide contexts to other NLP tasks
a) MT is to capture the meaning of a sequence b) NMT provides the context for target tasks
McCann et al., “Learned in Translation: Contextualized Word Vectors”, in NIPS, 2017.
CoVe vectors outperform GloVe vectors on various tasks
The results are not as strong as the simpler NLM training
7
Contextualized
Word Embeddings
ELMo
8
ELMo: Embeddings from Language Models
◉ Idea: contextualized word representations
✓ Learn word vectors using long contexts instead of a context window
✓ Learn a deep Bi-NLM and use all its layers in prediction
have a
a nice
nice day
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
9
ELMo: Embeddings from Language Models
1) Bidirectional LM
Forward LM
have a
a nice
nice day
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
10
ELMo: Embeddings from Language Models
1) Bidirectional LM
○ Character CNN for initial word embeddings
2048 n-gram filters, 2 highway layers, 512 dim projection
○ 2 BLSTM layers
○ Parameter tying for input/output layers
Backward LM Forward LM
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
11
ELMo: Embeddings from Language Models
2) ELMo
○ Learn task-specific linear combination of LM embeddings
○ Use multiple layers in LSTM instead of top one
■ 𝛾task scales overall usefulness of ELMo to task
■ 𝑠task are softmax-normalized weights
■ optional layer normalization
A task-specific embedding with combining weights learned from a downstream task
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
Backward LM Forward LM
12
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
ELMo: Embeddings from Language Models
3) Use ELMo in Supervised NLP Tasks
○ Get LM embedding for each word
○ Freeze LM weights to form ELMo enhanced embeddings
: concatenate ELMo into the intermediate layer : concatenate ELMo into the input layer
○ Tricks: dropout, regularization
The way for concatenation depends on the task
13
ELMo Illustration
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
14
ELMo Illustration
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
15
ELMo on Name Entity Recognition
Model Description CONLL 2003
F1 Klein+, 2003 MEMM softmax markov model 86.07 Florian+, 2003 Linear/softmax/TBL/HMM 88.76 Finkel+, 2005 Categorical feature CRF 86.86 Ratinov and Roth, 2009 CRF+Wiki+Word cls 90.80
Peters+, 2017 BLSTM + char CNN + CRF 90.87
Ma and Hovy, 2016 BLSTM + char CNN + CRF 91.21 TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93
ELMo (Peters+, 2018) ELMo in BLSTM 92.22
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
16
ELMo Results
◉ Improvement on various NLP tasks
Machine Comprehension Textual Entailment Semantic Role Labeling Coreference Resolution Name Entity Recognition Sentiment Analysis
Good transfer learning in NLP (similar to computer vision)
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
17
ELMo Analysis
◉ Word embeddings v.s. contextualized embeddings
The biLM is able to disambiguate both the PoS and word sense in the source sentence
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
18
ELMo Analysis
◉ The two NLM layers have differentiated uses/meanings
✓ Lower layer is better for lower-level syntax, etc. (e.g. Part-of-speech tagging, syntactic dependencies, NER)
✓ Higher layer is better for higher-level semantics (e.g. sentiment, semantic role labeling, question answering, SNLI)
Word Sense Disambiguation PoS Tagging
Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.
19
Concluding Remarks
◉ Contextualized embeddings learned from LM provide informative cues
◉ ELMo – a general approach for learning high-quality deep context-dependent representations from biLMs
✓ Pre-trained ELMo: https://allennlp.org/elmo
✓ ELMo can process the character-level inputs