Unsupervised Learning
Applied Deep Learning
May 25th, 2020 http://adl.miulab.tw
Introduction
◉ Big data ≠ Big annotated data
◉ Machine learning techniques include:
○
Supervised learning (if we have labelled data)○ Reinforcement learning (if we have an environment for reward)
○ Unsupervised learning (if we do not have labelled data)
What can we do if there is no sufficient training data?
2
Semi-Supervised Learning
Labelled Data
Unlabeled Data
cat dog
(Image of cats and dogs without labeling)
3
Semi-Supervised Learning
◉ Why semi-supervised learning helps?
The distribution of the unlabeled data provides some cues
4
Transfer Learning
Source Data
Target Data
cat dog
Not related to the task considered
elephant elephant tiger tiger
5
Transfer Learning
◉ Widely used on image processing
○ Using sufficient labeled data to learn a CNN
○
Using this CNN as feature extractorLayer 1
Layer 2
Layer L
x1
x2
……
…… … … … …
……
elephant
…
N …
x
Pixels
…
…
…
…
…
…
6
Transfer Learning Example
爆漫王 研究生
生存守則
責編 漫畫家
投稿 jump 畫分鏡 指導教授
研究生
投稿期刊 跑實驗
漫畫家 online 研究生 online
7
Self-Taught Learning
◉ The unlabeled data sometimes is not related to the task
Unlabeled Data
(Just crawl millions of images from the Internet) Labelled Data
cat dog
8
Self-Taught Learning
◉ The unlabeled data sometimes is not related to the task
Labelled DataDigit Recognition
Unlabeled Data
Speech Recognition
Document Classification
Digits character
Taiwanese
English Chinese
News Webpages
……
Why can we use unlabeled and unrelated data to help our tasks?
9
Self-Taught Learning
◉ How does self-taught learning work?
◉ Why does unlabeled and unrelated data help the tasks?
Finding latent factors that control the observations
10
Latent Factors for Handwritten Digits
11
Latent Factors for Documents
http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg
12
Latent Factors for Recommendation System
A
C B
單純呆 傲嬌
13
Latent Factor Exploitation
◉ Handwritten digits
The handwritten images are composed of strokes
Strokes (Latent Factors)
…….
No. 1 No. 2 No. 3 No. 4 No. 5
14
Latent Factor Exploitation
28
28
Represented by 28 X 28 = 784 pixels
= + +
[1 0 1 0 1 0 …….]
Strokes (Latent Factors)
…….
No. 1 No. 2 No. 3 No. 4 No. 5
No. 1 No. 3 No. 5
(simpler representation)
15
Representation Learning
Autoencoder
16
Autoencoder
◉
Represent a digit using 28 X 28 dimensions◉
Not all 28 X 28 images are digitsNN Encoder
NN Decoder
code
code
Learn together
28 X 28 = 784 Usually <784
Idea: represent the images of digits in a more compact way
compact representation of the input object
reconstruct the original object
17
Autoencoder
𝑥
Input layer
𝑊
𝑦 𝑊′
output layer hidden layer
𝑎
As close as possible Minimize 𝑥 − 𝑦 2
Bottleneck layer
𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′
Output of the hidden layer is the code encode decode
18
Autoencoder
◉
De-noising auto-encoder𝑥 𝑊 𝑦
𝑊′
𝑎
encode decode
Add noise
𝑥′
As close as possible
Rifai, et al. "Contractive auto-encoders: Explicit invariance during feature extraction,“ in ICML, 2011.
19
Deep Autoencoder
Input Layer Layer Layer bottle Output Layer
Layer
Layer
Layer
Layer
… …
Code
As close as possible
𝑥 𝑥
Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.
20
Deep Autoencoder
Original Image
PCA Deep Auto-encoder
784 784
784 1000 500 250 30
250 500 1000 784
30
21
Feature Representation
784 784
784 1000 500 250 2 2
250 500 1000 784
22
Auto-encoder – Text Retrieval
word string:
“This is an apple”
…
this is
a an apple pen
1 1 0 1 1 0
Bag-of-word Vector Space Model
document query
Semantics are not considered
23
Autoencoder – Text Retrieval
Bag-of-word(document or query) query
2000 500 250 125
2
The documents talking about the same thing will have close code
24
Auto-Encoding (AE)
◉ Objective: reconstructing ҧ𝑥 from ො 𝑥
○ dimension reduction or denoising (masked LM)
Randomly mask 15% of tokens
25
Autoencoder – Similar Image Retrieval
◉
Retrieved using Euclidean distance in pixel intensity spaceKrizhevsky et al. "Using very deep autoencoders for content-based image retrieval," in ESANN, 2011.
26
Autoencoder – Similar Image Retrieval
32x32
8192 4096 2048 1024 512 256
code (crawl millions of images from the Internet)
27
Autoencoder – Similar Image Retrieval
◉ Images retrieved using Euclidean distance in pixel intensity space
◉ Images retrieved using 256 codes
Learning the useful latent factors
28
Autoencoder for DNN Pre-Training
◉
Greedy layer-wise pre-training againTarget
784 1000 1000
10
Input output
Input 784 1000
784
W1 𝑥
𝑥 500
W1’
29
Autoencoder for DNN Pre-Training
◉
Greedy layer-wise pre-training againTarget
784 1000 1000
10
Input output
500
Input 784 1000
W1 1000 1000
fix
𝑥 𝑎1
𝑎1
W2 W2’
30
Autoencoder for DNN Pre-Training
◉
Greedy layer-wise pre-training againTarget
784 1000 1000
10
Input output
500
Input 784 1000
W1 1000
fix
𝑥 𝑎1
𝑎2
W2 fix
𝑎2 1000
W3 500
W3’
31
Autoencoder for DNN Pre-Training
◉
Greedy layer-wise pre-training againTarget
784 1000 1000
10
Input output
500
Input 784 1000
W1 1000
𝑥 W2
W3 500
output 10
W4
Random init Find-tune via backprop
32
Representation Learning and Generation
Variational Autoencoder
33
Generation from Latent Codes
𝑥
𝑊
𝑦 𝑊′
𝑎
encode decode
How can we set a latent code for generation?
34
Latent Code Distribution Constraints
◉
Constrain the data distribution for learned latent codes◉
Generate the latent code via a prior distribution𝑥 encode 𝑎 𝑦
decode
sampling
35
Reconstruction
AE
VAE
36
Representation Learning by Weak Labels
Distant Supervision
37
20K 20K 20K 1000
w1 w2 w3
1000 1000
20K
wd 1000
300
Word Sequence: x Word Hashing Matrix: Wh Word Hashing Layer: lh Convolution Matrix: Wc Convolutional Layer: lc Max Pooling Operation Max Pooling Layer: lm
Semantic Projection Matrix: Ws Semantic Layer: y
max max max 300 300 300 300
Q D1 D2 Dn
CosSim(Q, Di)
P(D1| Q) P(D2| Q) P(Dn| Q)
…
Query
Documents
Convolutional Deep Structured Semantic Models (CDSSM/DSSM)
Huang et al., "Learning deep structured semantic models for web search using clickthrough data," in Proc. of CIKM, 2013.
Shenet al., “Learning semantic representations using ´ convolutional neural networks for web search," in Proc. of WWW, 2014.
maximizes the likelihood of clicked documents given queries
Semantically related documents are close to the query in the encoded space
38
Representation Learning by Different Tasks
Multi-Task Learning
39
Task-Shared Representation
Task 1 Task 2
The latent factors can be learned by different tasks 40
Semi-Supervised Multi-Task SLU
(Lan et al., 2018)◉ Idea: language understanding objective can enhance other tasks
41
O. Lan, S. Zhu, and K. Yu, “Semi-supervised Training using Adversarial Multi-task Learning for Spoken Language Understanding,” in Proceedings of ICASSP, 2018.
Slot Tagging
Model
BLM exploits the unsupervised knowledge, the shared-private framework and adversarial training make the slot tagging model more generalized
MT-DNN
(Liu et al., 2019)42
https://github.com/namisan/mt-dnn
Concluding Remarks
◉ Labeling data is expensive, but we have large unlabeled data
◉ Autoencoder
○
exploits unlabeled data to learn latent factors as representations○ learned representations can be transfer to other tasks
◉ Distant Labels / Labels from Other Tasks
○ learn the representations that are useful for other tasks
○ learned representations may be also useful for the target task
43