NTUMIULAB
http://seamls.miulab.tw/
4 t h C o f f e e
w w w . 4 t h c o f f e e . c o m
Yun-Nung (Vivian) Chen
http://vivianchen.idv.tw
Sequence Modeling & Embeddings
Southeast Asia Machine Learning School 2019 http://seamls.miulab.tw/
NTUMIULAB
http://seamls.miulab.tw/
4 t h C o f f e e
w w w . 4 t h c o f f e e . c o m
• W o r d R e p r e s e n t a t i o n B a s i c s
• W o r d E m b e d d i n g s
• R e c u r r e n t N e u r a l N e t w o r k
Yun-Nung (Vivian) Chen
http://vivianchen.idv.tw
Sequence Modeling & Embeddings
NTUMIULAB
http://seamls.miulab.tw/
4 t h C o f f e e
w w w . 4 t h c o f f e e . c o m
• W o r d R e p r e s e n t a t i o n B a s i c s
• W o r d E m b e d d i n g s
• R e c u r r e n t N e u r a l N e t w o r k
Yun-Nung (Vivian) Chen
http://vivianchen.idv.tw
Sequence Modeling & Embeddings
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Classification Task
• x: input object to be classified
• y: class/label
Learning Target Function
( ) x y
f = f : R
N→ R
MAssume both x and y can be represented as fixed-size vectors
→ a N-dim vector
→ a M-dim vector
“This is awesome!” +
“It sucks.” -
How do we represent the meaning of the word?
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Definition of “Meaning”
• the idea that is represented by a word, phrase, etc.
• the idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of writing, art, etc.
Meaning Representations
Goal: word representations that capture the relationships between words
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Hypernyms (is-a) relationships of WordNet
Knowledge-Based Representation
Issues:
▪ newly-invented words
▪ subjective
▪ annotation effort
▪ difficult to compute word similarity
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Atomic symbols: one-hot representation
Corpus-Based Representation
[0 0 0 0 0 0 1 0 0 … 0]
[0 0 0 0 0 0 1 0 0 … 0]
AND[0 0 1 0 0 0 0 0 0 … 0] = 0
Idea: words with similar meanings often have similar neighbors
Issues: difficult to compute the similarity (i.e. comparing “car” and “motorcycle”) car
car
car motorcycle
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Co-occurrence matrix
• Neighbor definition: full document v.s. windows
• full document: word-document co-occurrence matrix gives general topics
→ “Latent Semantic Analysis”, “Latent Dirichlet Allocation”
• windows: context window for each word
→ capture syntactic (e.g. POS) and sematic information
Corpus-Based Representation
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Example
• Window length=1
• Left or right context
• Corpus:
Window-Based Co-occurrence Matrix
I love AI.
I love deep learning.
I enjoy learning.
Counts I love enjoy AI deep learning
I 0 2 1 0 0 0
love 2 0 0 1 1 0
enjoy 1 0 0 0 0 1
AI 0 1 0 0 0 0
deep 0 1 0 0 0 1
learning 0 0 1 0 1 0
similarity > 0
Issues:
▪ matrix size increases with vocabulary
▪ high dimensional
▪ sparsity
Idea: low-dimensional dense word vector
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Method 1: dimension reduction on the matrix
• Singular Value Decomposition (SVD) of co-occurrence matrix X
Low-Dimensional Dense Word Vector
approximate
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Method 1: dimension reduction on the matrix
• Singular Value Decomposition (SVD) of co-occurrence matrix X
Low-Dimensional Dense Word Vector
semantic relations
Rohde et al., “An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence,” 2005.
syntactic relations
Issues:
computationally expensive difficult to add new words
Idea: directly learn low-dimensional word vectors
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Meaning Representations in Computers
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Method 2: directly learn low-dimensional word vectors
• Learning representations by back-propagation. (Rumelhart et al., 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP (almost) from Scratch (Collobert & Weston, 2008)
• Widely-used models: word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014)
• As known as “Word Embeddings”
Low-Dimensional Dense Word Vector
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Propagate any information into them via neural networks
• form the basis for all language-related tasks
Major Advantages of Word Embeddings
deep learned word embeddings
The networks, R and Ws, can be updated during model training
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Knowledge-based representation
• Corpus-based representation
✓Atomic symbol
✓Neighbors
• High-dimensional sparse word vector
• Low-dimensional dense word vector
▪ Method 1 – dimension reduction
▪ Method 2 – direct learning
Concluding Remarks
NTUMIULAB
http://seamls.miulab.tw/
4 t h C o f f e e
w w w . 4 t h c o f f e e . c o m
• W o r d R e p r e s e n t a t i o n B a s i c s
• W o r d E m b e d d i n g s
• R e c u r r e n t N e u r a l N e t w o r k
Yun-Nung (Vivian) Chen
http://vivianchen.idv.tw
Sequence Modeling & Embeddings
NTUMIULAB
http://seamls.miulab.tw/
Word2Vec Skip-Gram
Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.
Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: predict surrounding words within a window of each word
• Objective function: maximize the probability of any context word given the current center word
Word2Vec – Skip-Gram Model
context window
outside target word
target word vector
Benefit: faster, easily incorporate a new sentence/document or add a word to vocab
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: predict surrounding words within a window of each word
Word2Vec Skip-Gram Illustration
V =
N =
V =
x h
s
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Hidden Layer Matrix → Word Embedding Matrix
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Hidden layer weight matrix = word vector lookup
Weight Matrix Relation
The 300-dim feature representation has the ability of predicting the contexts
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Output layer weight matrix = weighted sum as final score
Weight Matrix Relation
within the context window
softmax
Each vocabulary entry has two vectors: as a target word and as a context word
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Word2Vec Skip-Gram Illustration
x h
s
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Given a target word (wI)
Loss Function
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Given a target word (wI)
SGD Update for W’
=1, when wjc is within the context window
=0, otherwise
error term
x h
s
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
SGD Update for W
x h
s
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
SGD Update
large vocabularies or large training corpora → expensive computations
limit the number of output vectors that must be updated per training instance
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Idea: only update a sample of output vectors
Negative Sampling
Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Skip-gram training data:
apple|drink^juice,orange|eat^apple,rice|drink^juice,juice|drink^milk,milk|drink^rice,water|d rink^milk,juice|orange^apple,juice|apple^drink,milk|rice^drink,drink|milk^water,drink|water
^juice,drink|juice^water
Word2Vec Skip-Gram Visualization
https://ronxin.github.io/wevi/NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Skip-gram: predicting surrounding words given the target word (Mikolov+, 2013)
• CBOW (continuous bag-of-words): predicting the target word given the surrounding words (Mikolov+, 2013)
• LM (Language modeling): predicting the next words given the proceeding contexts (Mikolov+, 2013)
Word2Vec Variants
Practice the derivation by yourself!!
better
Mikolov et al., “Efficient estimation of word representations in vector space,” in ICLR Workshop, 2013.
Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL HLT, 2013.
first
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: predicting the target word given the surrounding words
Word2Vec CBOW
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: predicting the next words given the proceeding contexts
Word2Vec LM
NTUMIULAB
http://seamls.miulab.tw/
Comparison
• Count-based
• Example
• LSA, HAL (Lund & Burgess), COALS (Rohde
et al), Hellinger-PCA (Lebret & Collobert)
• Pros
✓ Fast training
✓ Efficient usage of statistics
• Cons
✓ Primarily used to capture word similarity
✓ Disproportionate importance given to large counts
• Direct prediction
• Example
• NNLM, HLBL, RNN, Skipgram/CBOW,
(Bengio et al; Collobert & Weston; Huang et al;
Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu)
• Pros
✓ Generate improved performance on other tasks
✓ Capture complex patterns beyond word similarity
• Cons
✓ Benefits mainly from large corpus
✓ Inefficient usage of statistics
Combining the benefits from both worlds → GloVe
NTUMIULAB
http://seamls.miulab.tw/
G l o V e
Pennington et al., ”GloVe: Global Vectors for Word Representation,” in EMNLP, 2014.
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Idea: ratio of co-occurrence probability can encode meaning
• Pij is the probability that word wj appears in the context of word wi
• Relationship between the words wi and wj
GloVe
x = solid x = gas x = water x = fashion P(x | ice) 1.9 × 10−4 6.6 × 10−5 3.0 × 10−3 1.7 × 10−5 P(x | stream) 2.2 × 10−5 7.8 × 10−4 2.2 × 10−3 1.8 × 10−5
P x | ice
P x | stream 8.9 8.5 × 10−2 1.36 0.96
x = solid x = gas x = water x = random
P(x | ice) large small large small
P(x | stream) small large large small
P x | ice
P x | stream large small ~ 1 ~ 1
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• The relationship of wi and wj approximates the ratio of their co-occurrence probabilities with various wk
GloVe
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
GloVe
fast training, scalable, good performance even with small corpus, and small vectors
NTUMIULAB
http://seamls.miulab.tw/
Word Vector Evaluation
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Word linear relationship
• Syntactic and Semantic example questions [link]
Intrinsic Evaluation – Word Analogies
Issue: what if the information is there but not linear
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Word linear relationship
• Syntactic and Semantic example questions [link]
Intrinsic Evaluation – Word Analogies
Issue: different cities may have same name city---in---state
Chicago : Illinois = Houston : Texas
Chicago : Illinois = Philadelphia : Pennsylvania Chicago : Illinois = Phoenix : Arizona
Chicago : Illinois = Dallas : Texas
Chicago : Illinois = Jacksonville : Florida Chicago : Illinois = Indianapolis : Indiana Chicago : Illinois = Aus8n : Texas
Chicago : Illinois = Detroit : Michigan Chicago : Illinois = Memphis : Tennessee Chicago : Illinois = Boston : Massachusetts
Issue: can change with time capital---country
Abuja : Nigeria = Accra : Ghana Abuja : Nigeria = Algiers : Algeria Abuja : Nigeria = Amman : Jordan Abuja : Nigeria = Ankara : Turkey
Abuja : Nigeria = Antananarivo : Madagascar Abuja : Nigeria = Apia : Samoa
Abuja : Nigeria = Ashgabat : Turkmenistan Abuja : Nigeria = Asmara : Eritrea
Abuja : Nigeria = Astana : Kazakhstan
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Word linear relationship
• Syntactic and Semantic example questions [link]
Intrinsic Evaluation – Word Analogies
superlative
bad : worst = big : biggest
bad : worst = bright : brightest bad : worst = cold : coldest bad : worst = cool : coolest bad : worst = dark : darkest bad : worst = easy : easiest bad : worst = fast : fastest bad : worst = good : best
bad : worst = great : greatest
past tense
dancing : danced = decreasing : decreased dancing : danced = describing : described dancing : danced = enhancing : enhanced dancing : danced = falling : fell
dancing : danced = feeding : fed dancing : danced = flying : flew
dancing : danced = generating : generated dancing : danced = going : went
dancing : danced = hiding : hid dancing : danced = hiding : hit
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Comparing word correlation with human-judged scores
• Human-judged word correlation [link]
Intrinsic Evaluation – Word Correlation
Word 1 Word 2 Human-Judged Score
tiger cat 7.35
tiger tiger 10.00
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
Ambiguity: synonym or same word with different POSs
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: use word vectors in neural net models built for subsequent tasks
• Benefit
• Ability to also classify words accurately
• Ex. countries cluster together a classifying location words should be possible with word vectors
• Incorporate any information into them other tasks
• Ex. project sentiment into words to find most positive/negative words in corpus
Extrinsic Evaluation – Subsequent Task
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Low dimensional word vector
• word2vec
• GloVe: combining count-based and direct learning
• Word vector evaluation
• Intrinsic: word analogy, word correlation
• Extrinsic: subsequent task
Concluding Remarks
Skip-gram CBOW LM
NTUMIULAB
http://seamls.miulab.tw/
4 t h C o f f e e
w w w . 4 t h c o f f e e . c o m
• W o r d R e p r e s e n t a t i o n B a s i c s
• W o r d E m b e d d i n g s
• R e c u r r e n t N e u r a l N e t w o r k
Yun-Nung (Vivian) Chen
http://vivianchen.idv.tw
Sequence Modeling & Embeddings
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: estimate the probability of a word sequence
• Example task
• determinate whether a sequence is grammatical or makes more sense
Language Modeling
recognize speech or
wreck a nice beach Output =
“recognize speech”
If P(recognize speech)
> P(wreck a nice beach)
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Goal: estimate the probability of a word sequence
• N-gram language model
• Probability is conditioned on a window of (n-1) previous words
• Estimate the probability based on the training data
N-Gram Language Modeling
𝑃 beach|nice = 𝐶 𝑛𝑖𝑐𝑒 𝑏𝑒𝑎𝑐ℎ
𝐶 𝑛𝑖𝑐𝑒 Count of “nice” in the training data
Count of “nice beach” in the training data
Issue: some sequences may not appear in the training data
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Training data:
• The dog ran ……
• The cat jumped ……
N-Gram Language Modeling
P( jumped | dog ) = 0 P( ran | cat ) = 0
give some small probability
→ smoothing
0.0001 0.0001
➢ The probability is not accurate.
➢ The phenomenon happens because we cannot collect all the possible text in the world as training data.
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Idea: estimate not from count, but from the NN prediction
Neural Language Modeling
Neural Network
vector of “START”
P(next word is
“wreck”)
Neural Network
vector of “wreck”
P(next word is “a”)
Neural Network
vector of “a”
P(next word is
“nice”)
Neural Network
vector of “nice”
P(next word is
“beach”) P(“wreck a nice beach”) = P(wreck|START)P(a|wreck)P(nice|a)P(beach|nice)
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Neural Language Modeling
Bengio et al., “A Neural Probabilistic Language Model,” in JMLR, 2003.
input hidden
output
context vector
Probability distribution of the next word
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• The input layer (or hidden layer) of the related words are close
• If P(jump|dog) is large, P(jump|cat) increase accordingly (even there is not “…
cat jump …” in the data)
Neural Language Modeling
h1 h2
dog
cat
rabbit
Smoothing is automatically done
Issue: fixed context window for conditioning
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Idea: condition the neural network on all previous words and tie the weights at each time step
• Assumption: temporal information matters
Recurrent Neural Network
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
RNN Language Modeling
vector of “START”
P(next word is
“wreck”)
vector of “wreck”
P(next word is “a”)
vector of “a”
P(next word is
“nice”)
vector of “nice”
P(next word is
“beach”) input
hidden output
context vector word prob dist
Idea: pass the information from the previous hidden layer to leverage all contexts
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• At each time step,
RNNLM Formulation
…… ……
……
……
vector of the current word probability of the next word
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Recurrent Neural Network Definition
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
: tanh, ReLU
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• All model parameters can be updated by
Model Training
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
yt-1 yt yt+1 target
predicted
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Backpropagation
… …
1 2
j
… …
1 2
l i wij
Layer l Layer l −1
=
−
1
1 1 l x
l a
j l l j
iBackward Pass
⋮
⋮
Forward Pass
⋮
⋮
Error signal
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Backpropagation
l
iBackward Pass
⋮
⋮
Error signal
1
2
n
…
y1
C
( )z1L
( )z2L
( )znL
y2
C
yn
C
Layer L
2 1
i
…
Layer l
( )z1l
( )z2l
( )zil
δ1l
δ2l
l
δi
2
…
( )1L1
−
z
1
m
Layer L-1
…
…
…
( )WL T( )Wl+1 T
( )y
C
L 1
-
L
( )2L 1
−
z
( )L−1
zm
δl
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Backpropagation through Time (BPTT)
For 𝐶(1) Backward Pass:
For 𝐶(2)
For 𝐶(3) For 𝐶(4)
Forward Pass: Compute s1, s2, s3, s4 ……
y1 y2 y3
x1 x2 x3
o1 o2 o3
init
y4
x4 o4
𝐶(1) 𝐶(2) 𝐶(3) 𝐶(4)
s1 s2 s3 s4
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• The gradient is a product of Jacobian matrices, each associated with a step in the forward computation
• Multiply the same matrix at each time step during backprop
RNN Training Issue
The gradient becomes very small or very large quickly
→ vanishing or exploding gradient
Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
w2
w1
Cost
Rough Error Surface
Bengio et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. of Neural Networks, 1994. [link]
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
The error surface is either very flat or very steep
NTUMIULAB
http://seamls.miulab.tw/
Possible Solutions
R e c u r r e n t N e u r a l N e t w o r k
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Exploding Gradient: Clipping
w2
w1
Cost
clipped gradient
Idea: control the gradient value to avoid exploding
Parameter setting: values from half to ten times the average can still yield convergence
Pascanu et al., “On the difficulty of training recurrent neural networks,” in ICML, 2013. [link]
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• RNN models temporal sequence information
• can handle “long-term dependencies” in theory
Vanishing Gradient: Gating Mechanism
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Issue: RNN cannot handle such “long-term dependencies” in practice due to vanishing gradient
→ apply the gating mechanism to directly encode the long-distance information
“I grew up in France…
I speak fluent French.”
NTUMIULAB
http://seamls.miulab.tw/
79
Long Short-Term Memory
A d d r e s s i n g V a n i s h i n g G r a d i e n t P r o b l e m
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• LSTMs are explicitly designed to avoid the long-term dependency problem
Long Short-Term Memory (LSTM)
Vanilla RNN
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 81
LSTM
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 82
LSTM
runs straight down the chain with minor linear interactions
→ easy for information to flow along it unchanged
Gates are a way to optionally let information through
→ composed of a sigmoid and a pointwise multiplication operation
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 83
LSTM
forget gate (a sigmoid layer): decides what information we’re going to throw away from the cell state
• 1: “completely keep this”
• 0: “completely get rid of this”
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 84
LSTM
input gate (a sigmoid layer): decides what new information we’re going to store in the cell state
Vanilla RNN
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 85
LSTM
cell state update: forgets the things we decided to forget earlier and add the new candidate values, scaled by how much we decided to update each state value
• ft: decides which to forget
• it: decide which to update
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Long Short-Term Memory (LSTM)
Hochreiter and Schmidhuber, ”Long short-term memory,” in Neural Computation, 1997. [link] 86
LSTM
output gate (a sigmoid layer):
decides what new information we’re going to output
NTUMIULAB
http://seamls.miulab.tw/
87
Variants on LSTM
A d d r e s s i n g V a n i s h i n g G r a d i e n t P r o b l e m
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
LSTM with Peephole Connections
Gers and Schmidhuber, ”Recurrent nets that time and count,” in IJCNN, 2000. [link] 88
LSTM with Peephole Conncetions
Idea: allow gate layers to look at the cell state
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
LSTM with Coupled Forget/Input Gates
Gers and Schmidhuber, ”Recurrent nets that time and count,” in IJCNN, 2000. [link] 89
LSTM with Coupled Forget/Input Gates
Idea: instead of separately deciding what to forget and what we should add new information to, we make those decisions together
We only forget when we’re going to input something in its place, and vice versa.
NTUMIULAB
http://seamls.miulab.tw/
90
Gated Recurrent Unit
A d d r e s s i n g V a n i s h i n g G r a d i e n t P r o b l e m
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Gated Recurrent Unit (GRU)
Cho et al., ”Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [link]91
GRU
Idea: combine the forget and input gates into a single “update gate”; merge the cell state and hidden state
GRU is simpler and has less parameters than LSTM
update gate:
reset gate:
rt=0: ignore previous memory and only stores the new word information
NTUMIULAB
http://seamls.miulab.tw/
E x t e n s i o n
R e c u r r e n t N e u r a l N e t w o r k
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Bidirectional RNN
ℎ = ℎ; ℎ represents (summarizes) the past and future around a single token
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
Deep Bidirectional RNN
Each memory layer passes an intermediate representation to the next
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• The learning algorithm f is to map the input domain X into the output domain Y
• Input domain: word, word sequence, audio signal, click logs
• Output domain: single label, sequence tags, tree structure, probability distribution
How to Frame the Learning Problem?
Y X
f : →
Network design should leverage input and output domain properties
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Idea: aggregate the meaning from all words into a vector
• Method:
• Basic combination: average, sum
• Neural combination:
✓ Recursive neural network (RvNN)
✓ Recurrent neural network (RNN)
✓ Convolutional neural network (CNN)
Input Domain – Sequence Modeling
How to compute
specification
good this
is
N-dim
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
good
This is very
x4 h4
• Encode the sequential input into a vector using RNN
Sentiment Analysis
x1
x2
… …
y1
y2
… …
…
…
…
Input Output
yM
xN
RNN considers temporal information to learn sentence vectors as the input of classification tasks
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• POS Tagging
• Speech Recognition
• Machine Translation
Output Domain – Sequence Prediction
“I like reading papers.” I/PN like/VBP reading/VBG papers/NNS
“Hello”
“How are you doing today?” “你好嗎?”
The output can be viewed as a sequence of classification
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)
Outline
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Tag a word at each timestamp
• Input: word sequence
• Output: corresponding POS tag sequence
POS Tagging
I love you
N V N
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Tag a word at each timestamp
• Input: word sequence
• Output: IOB-format slot tag and intent tag
Natural Language Understanding (NLU)
<START> just sent email to bob about fishing this weekend <END>
O O O O
B-contact_name
O
B-subject I-subject I-subject
→ send_email(contact_name=“bob”, subject=“fishing this weekend”) O
send_email
Temporal orders for input and output are the same
NTUMIULAB
http://seamls.miulab.tw/
http://seamls.miulab.tw/
• Language Modeling
• N-gram Language Model
• Feed-Forward Neural Language Model
• Recurrent Neural Network Language Model (RNNLM)
• Recurrent Neural Network
• Definition
• Training via Backpropagation through Time (BPTT)
• Training Issue
• Applications
• Sequential Input
• Sequential Output
• Aligned Sequential Pairs (Tagging)
• Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder)