Introduction
Big data ≠ Big annotated data
Machine learning techniques include:
◦Supervised learning (if we have labelled data)
◦Reinforcement learning (if we have an environment for reward)
◦Unsupervised learning (if we do not have labelled data)
2
What can we do if there is no sufficient training data?
Semi-Supervised Learning
Labelled data
Unlabeled data
cat dog
Semi-Supervised Learning
Why semi-supervised learning helps?
4
The distribution of the unlabeled data provides some cues
Transfer Learning
Labelled data
Labeled data
cat dog
elephant elephant tiger tiger
Transfer Learning
Widely used on image processing
◦Using sufficient labeled data to learn a CNN
◦Using this CNN as feature extractor
6
Layer 1 Layer 2 Layer L
x1
x2
……
…… …… ……
……
elephant
N ……
x
Pixels
……
……
……
Transfer Learning Example
爆漫王 研究生
生存守則
責編 漫畫家
投稿 jump 畫分鏡 指導教授
研究生
投稿期刊 跑實驗
漫畫家 online
研究生 online
Self-Taught Learning
The unlabeled data sometimes is not related to the task
8
Unlabeled data
(Just crawl millions of images from the Internet) Labelled
data
cat dog
Self-Taught Learning
The unlabeled data sometimes is not related to the task Labelled data
Digit Recognition
Unlabeled data
Speech Document Classification
Digits character
English
News Webpages
Self-Taught Learning
How does self-taught learning work?
Why does unlabeled and unrelated data help the tasks?
10
Finding latent factors that control the observations
Latent Factors for Handwritten Digits
Latent Factors for Documents
http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg 12
Latent Factors for Recommendation System
A
C B 單純呆
傲嬌
Latent Factor Exploitation
Handwritten digits
14
The handwritten images are composed of strokes
Strokes (Latent Factors)
…….
No. 1 No. 2 No. 3 No. 4 No. 5
Latent Factor Exploitation
28
28
= + +
Strokes (Latent Factors)
…….
No. 1 No. 2 No. 3 No. 4 No. 5
No. 1 No. 3 No. 5
Autoencoder
Representati on Learning
16
Autoencoder
Represent a digit using 28 X 28 dimensions Not all 28 X 28 images are digits
NN Encoder
NN
code
Learn together
28 X 28 = 784 Usually <784
Idea: represent the images of digits in a more compact way
Compact
representation of the input object
Can reconstruct
Autoencoder
18
𝑥
Input layer
𝑊
𝑦 𝑊′
output layer hidden layer
𝑎
As close as possible Minimize 𝑥 − 𝑦 2
Bottleneck layer
𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′
Output of the hidden layer is the code
encode decode
Autoencoder
De-noising auto-encoder
𝑥 𝑊 𝑦
𝑊′
𝑎
encode decode
Add noise
𝑥′
As close as possible
Deep Autoencoder
20
Initialize by RBM layer-by-layer
Input Layer Layer Layer bottle Output Layer
Layer
Layer
Layer
Layer
… …
Code
As close as possible
𝑥 𝑥
Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.
Deep Autoencoder
Original Image
PCA Deep
Auto-encoder
784 784
784 1000 500 250 30
250 500 1000 784
30
Feature Representation
22
784 784
784 1000 500 250 2 2
250 500 1000 784
Auto-encoder – Text Retrieval
word string:
“This is an apple”
…
this is
a an apple pen
1 1 0 1 1 0
Bag-of-word Vector Space Model
document query
Autoencoder – Text Retrieval
24
Bag-of-word (document or query) query
2000 500 250 125
2
The documents talking about the same thing will have close code
Autoencoder – Similar Image Retrieval
Retrieved using Euclidean distance in pixel intensity space
Autoencoder – Similar Image Retrieval
26
32x32
8192 4096 2048 1024 512 256
code (crawl millions of images from the Internet)
Autoencoder – Similar Image Retrieval
Images retrieved using Euclidean distance in pixel intensity space
Images retrieved using 256 codes
Autoencoder for DNN Pre-Training
Greedy layer-wise pre-training again
28
Target
784 1000 1000
10
Input output
Input 784
1000 784
W1 𝑥
𝑥 500
W1’
Autoencoder for DNN Pre-Training
Greedy layer-wise pre-training again
Target
1000 1000 output 10
500
1000 1000 1000
𝑎1
𝑎1
W2 W2’
Autoencoder for DNN Pre-Training
Greedy layer-wise pre-training again
30
Target
784 1000 1000
10
Input output
500
Input 784
1000
W1 1000
fix
𝑥 𝑎1
𝑎2
W2 fix
𝑎2 1000
W3 500
W3’
Autoencoder for DNN Pre-Training
Greedy layer-wise pre-training again
Target
1000 1000 output 10
500
1000 1000
W2 W3 500
output 10
W4
Random init Find-tune via backprop
Variational Autoencoder
Representati on Learning and Generation
32
Generation from Latent Codes
𝑥
𝑊
𝑦 𝑊′
𝑎
encode decode
How can we set a latent code for generation?
Latent Code Distribution Constraints
Constrain the data distribution for learned latent codes Generate the latent code via a prior distribution
34
𝑥 encode 𝑎 𝑦
decode
sampling
Reconstruction
AE
VAE
VAE for Music Generation
ℎ𝑡
𝑑𝑇𝑡= 𝑇𝑡 = 𝑃𝑡 = #𝐹
Note Dictionary 𝑛𝑜𝑡𝑒𝑡= (𝑃𝑡, 𝑇𝑡, 𝑑𝑇𝑡)
notet
𝑇1
𝑇0 𝑇𝑡
𝑃1
𝑃0 𝑃𝑡
𝑑𝑇1 d𝑇0 𝑑𝑇𝑡
Variationa l Inference
Reverse Note Dictionary
Modularized Note Unrolling Decoder
Modularized Encoder
Latent Code
Fully-Connected 𝑑𝑇1, 𝑧𝑑𝑇0, 𝑧 𝑑𝑇𝑡, 𝑧
𝑇1, 𝑧
𝑇0, 𝑧 𝑇𝑡, 𝑧
𝑃1, 𝑧
𝑃0, 𝑧 𝑃𝑡, 𝑧
𝑛𝑜𝑡𝑒0 𝑛𝑜𝑡𝑒1 𝑛𝑜𝑡𝑒𝑡𝑛𝑜𝑡𝑒𝑡+1
<Start>
ℎ𝑡+3, 𝑧 𝑑𝑇𝑡 𝑑𝑇𝑡+1
ℎ𝑡+5, 𝑧 ℎ𝑡+4, 𝑧
𝑇𝑡 𝑇𝑡+1
𝑃𝑡 𝑃𝑡+1 ℎ0 ℎ1
ℎ𝑡+1
ℎ𝑡+2
𝑧
𝑧
𝑧
ℎ2
http://mvae.miulab.tw
Distant Supervision
Representati on Learning by Weak Labels
20K 20K 20K 1000
w1 w2 w3
1000 1000
20K
wd
1000 300
Word Sequence: x
Word Hashing Matrix: Wh Word Hashing Layer: lh Convolution Matrix: Wc Convolutional Layer: lc Max Pooling Operation Max Pooling Layer: lm
Semantic Projection Matrix: Ws Semantic Layer: y
max max max 300 300 300 300
Q D1 D2 Dn CosSim(Q, Di)
P(D1 | Q) P(D2 | Q) P(Dn | Q)
Query …
Documents
how about we discuss this later
Convolutional Deep Structured Semantic Models (CDSSM/DSSM)
Huang et al., "Learning deep structured semantic models for web search using clickthrough data," in Proc. of CIKM, 2013.
Shen et al., “Learning semantic representations using ´ convolutional neural networks for web search," in Proc. of WWW, 2014. 38 maximizes the likelihood of clicked documents given queries
Semantically related documents are close to the query in the encoded space
Multi-Tasking
Representati on Learning by Different Tasks
Task-Shared Representation
40
Task 1 Task 2
The latent factors can be learned by different tasks
Concluding Remarks
Labeling data is expensive, but we have large unlabeled data Autoencoder / VAE
◦exploits unlabeled data to learn latent factors as representations
◦learned representations can be transfer to other tasks
Distant Labels / Labels from Other Tasks
◦learn the representations that are useful for other tasks
◦learned representations may be also useful for the target task