Contextualized Word Embeddings

(1)

Contextualized Word Embeddings

Applied Deep Learning

March 14th, 2022 http://adl.miulab.tw

(2)

Word Embedding Polysemy Issue

◉ Words are polysemy

✓ An apple a day, keeps the doctor away.

✓ Smartphone companies including apple, …

◉ However, their embeddings are NOT polysemy

◉ Issue

✓ Multi-senses (polysemy)

✓ Multi-aspects (semantics, syntax)

^tree

trees

rock

rocks

2

(3)

RNNLM

◉ Idea: condition the neural network on all previous words and tie the weights at each time step

input hidden output

context vector word prob dist

This LM producing context-specific word representations at each position

vector of “START”

P(next w=“wreck”)

vector of “wreck”

P(next w=“a”)

vector of “a”

P(next w=“nice”)

vector of “nice”

P(next w=“beach”)

3

(4)

TagLM – “Pre-ELMo”

◉ Idea: train NLM on big unannotated data and provide the context-specific embeddings for the target task → semi-supervised learning

Word Embedding

Word Embedding Model Recurrent Language Model LM Embedding

New York is located … Input Sequence

New York is located …

Sequence Tagging Model

B-LOC E-LOC O O … Output Sequence

Peters et al., “Semi-supervised sequence tagging with bidirectional language models,” in ACL, 2017.

4

(5)

TagLM Model Detail

◉ Leveraging pre-trained LM information

5

(6)

TagLM on Name Entity Recognition

The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When,

after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Model Description CONLL 2003 F1

Klein+, 2003 MEMM softmax markov model 86.07

Florian+, 2003 Linear/softmax/TBL/HMM 88.76

Finkel+, 2005 Categorical feature CRF 86.86

Ratinov and Roth, 2009 CRF+Wiki+Word cls 90.80

Peters+, 2017 BLSTM + char CNN + CRF 90.87

Ma and Hovy, 2016 BLSTM + char CNN + CRF 91.21 TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93

6

(7)

CoVe

◉ Idea: use trained sequence model to provide contexts to other NLP tasks

a) MT is to capture the meaning of a sequence b) NMT provides the context for target tasks

McCann et al., “Learned in Translation: Contextualized Word Vectors”, in NIPS, 2017.

CoVe vectors outperform GloVe vectors on various tasks

The results are not as strong as the simpler NLM training

7

(8)

Contextualized

Word Embeddings

ELMo

8

(9)

ELMo: Embeddings from Language Models

◉ Idea: contextualized word representations

✓ Learn word vectors using long contexts instead of a context window

✓ Learn a deep Bi-NLM and use all its layers in prediction

have a

a nice

nice day

Peters et al., “Deep Contextualized Word Representations”, in NAACL-HLT, 2018.

9

(10)

ELMo: Embeddings from Language Models

1) Bidirectional LM

Forward LM

have a

a nice

nice day

10

(11)

ELMo: Embeddings from Language Models

1) Bidirectional LM

○ Character CNN for initial word embeddings

2048 n-gram filters, 2 highway layers, 512 dim projection

○ 2 BLSTM layers

○ Parameter tying for input/output layers

Backward LM Forward LM

11

(12)

ELMo: Embeddings from Language Models

2) ELMo

○ Learn task-specific linear combination of LM embeddings

○ Use multiple layers in LSTM instead of top one

■ 𝛾^task scales overall usefulness of ELMo to task

■ 𝑠^task are softmax-normalized weights

■ optional layer normalization

A task-specific embedding with combining weights learned from a downstream task

Backward LM Forward LM

12

(13)

ELMo: Embeddings from Language Models

3) Use ELMo in Supervised NLP Tasks

○ Get LM embedding for each word

○ Freeze LM weights to form ELMo enhanced embeddings

: concatenate ELMo into the intermediate layer : concatenate ELMo into the input layer

○ Tricks: dropout, regularization

The way for concatenation depends on the task

13

(14)

ELMo Illustration

14

(15)

ELMo Illustration

15

(16)

ELMo on Name Entity Recognition

Model Description CONLL 2003

F1 Klein+, 2003 MEMM softmax markov model 86.07 Florian+, 2003 Linear/softmax/TBL/HMM 88.76 Finkel+, 2005 Categorical feature CRF 86.86 Ratinov and Roth, 2009 CRF+Wiki+Word cls 90.80

Peters+, 2017 BLSTM + char CNN + CRF 90.87

Ma and Hovy, 2016 BLSTM + char CNN + CRF 91.21 TagLM (Peters+, 2017) LSTM BiLM in BLSTM Tagger 91.93

ELMo (Peters+, 2018) ELMo in BLSTM 92.22

16

(17)

ELMo Results

◉ Improvement on various NLP tasks

Machine Comprehension Textual Entailment Semantic Role Labeling Coreference Resolution Name Entity Recognition Sentiment Analysis

Good transfer learning in NLP (similar to computer vision)

17

(18)

ELMo Analysis

◉ Word embeddings v.s. contextualized embeddings

The biLM is able to disambiguate both the PoS and word sense in the source sentence

18

(19)

ELMo Analysis

◉ The two NLM layers have differentiated uses/meanings

✓ Lower layer is better for lower-level syntax, etc. (e.g. Part-of-speech tagging, syntactic dependencies, NER)

✓ Higher layer is better for higher-level semantics (e.g. sentiment, semantic role labeling, question answering, SNLI)

Word Sense Disambiguation PoS Tagging

19

(20)

Concluding Remarks

◉ Contextualized embeddings learned from LM provide informative cues

◉ ELMo – a general approach for learning high-quality deep context-dependent representations from biLMs

✓ Pre-trained ELMo: https://allennlp.org/elmo

✓ ELMo can process the character-level inputs