Slide credit from Hung-Yi Lee

(1)

(2)

Introduction

Big data ≠ Big annotated data

Machine learning techniques include:

◦Supervised learning (if we have labelled data)

◦Reinforcement learning (if we have an environment for reward)

◦Unsupervised learning (if we do not have labelled data)

2

What can we do if there is no sufficient training data?

(3)

Semi-Supervised Learning

Labelled data

Unlabeled data

cat dog

(4)

Semi-Supervised Learning

Why semi-supervised learning helps?

4

The distribution of the unlabeled data provides some cues

(5)

Transfer Learning

Labelled data

Labeled data

cat dog

elephant elephant tiger tiger

(6)

Transfer Learning

Widely used on image processing

◦Using sufficient labeled data to learn a CNN

◦Using this CNN as feature extractor

6

Layer 1 Layer 2 Layer L

x1

x2

……

…… …… ……

……

elephant

N ……

x

Pixels

……

(7)

Transfer Learning Example

爆漫王研究生

生存守則

責編漫畫家

投稿 jump 畫分鏡指導教授

研究生

投稿期刊跑實驗

漫畫家 _online

研究生 _online

(8)

Self-Taught Learning

The unlabeled data sometimes is not related to the task

8

Unlabeled data

(Just crawl millions of images from the Internet) Labelled

data

cat dog

(9)

Self-Taught Learning

The unlabeled data sometimes is not related to the task Labelled data

Digit Recognition

Unlabeled data

Speech Document Classification

Digits character

English

News Webpages

(10)

Self-Taught Learning

How does self-taught learning work?

Why does unlabeled and unrelated data help the tasks?

10

Finding latent factors that control the observations

(11)

Latent Factors for Handwritten Digits

(12)

Latent Factors for Documents

http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg 12

(13)

Latent Factors for Recommendation System

A

C B 單純呆

傲嬌

(14)

Latent Factor Exploitation

Handwritten digits

14

The handwritten images are composed of strokes

Strokes (Latent Factors)

…….

No. 1 No. 2 No. 3 No. 4 No. 5

(15)

Latent Factor Exploitation

28

= + +

Strokes (Latent Factors)

…….

No. 1 No. 2 No. 3 No. 4 No. 5

No. 1 No. 3 No. 5

(16)

Autoencoder

Representati on Learning

16

(17)

Autoencoder

Represent a digit using 28 X 28 dimensions Not all 28 X 28 images are digits

NN Encoder

NN

code

Learn together

28 X 28 = 784 Usually <784

Idea: represent the images of digits in a more compact way

Compact

representation of the input object

Can reconstruct

(18)

Autoencoder

18

𝑥

Input layer

𝑊

𝑦 𝑊′

output layer hidden layer

𝑎

As close as possible Minimize 𝑥 − 𝑦 ²

Bottleneck layer

𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′

Output of the hidden layer is the code

encode decode

(19)

Autoencoder

De-noising auto-encoder

𝑥 𝑊 𝑦

𝑊′

𝑎

encode decode

Add noise

𝑥′

As close as possible

(20)

Deep Autoencoder

20

Initialize by RBM layer-by-layer

Input Layer Layer Layer bottle Output Layer

Layer

… …

Code

As close as possible

𝑥 𝑥෤

Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.

(21)

Deep Autoencoder

Original Image

PCA Deep

Auto-encoder

784 784

784 1000 500 250 30

250 500 1000 784

30

(22)

Feature Representation

22

784 784

784 1000 500 250 2 2

250 500 1000 784

(23)

Auto-encoder – Text Retrieval

word string:

“This is an apple”

…

this is

a an apple pen

1 1 0 1 1 0

Bag-of-word Vector Space Model

document query

(24)

Autoencoder – Text Retrieval

24

Bag-of-word (document or query) query

2000 500 250 125

2

The documents talking about the same thing will have close code

(25)

Autoencoder – Similar Image Retrieval

Retrieved using Euclidean distance in pixel intensity space

(26)

Autoencoder – Similar Image Retrieval

26

32x32

8192 4096 2048 1024 512 256

code (crawl millions of images from the Internet)

(27)

Autoencoder – Similar Image Retrieval

Images retrieved using Euclidean distance in pixel intensity space

Images retrieved using 256 codes

(28)

Autoencoder for DNN Pre-Training

Greedy layer-wise pre-training again

28

Target

784 1000 1000

10

Input output

Input 784

1000 784

W¹ 𝑥

෤ 𝑥 500

W¹’

(29)

Autoencoder for DNN Pre-Training

Target

1000 1000 output 10

500

1000 1000 1000

𝑎¹

෤ 𝑎¹

W² W²’

(30)

Autoencoder for DNN Pre-Training

30

Target

784 1000 1000

10

Input output

500

Input 784

1000

W¹ 1000

fix

𝑥 𝑎¹

෤ 𝑎²

W² fix

𝑎² 1000

W³ 500

W³’

(31)

Autoencoder for DNN Pre-Training

Target

1000 1000 output 10

500

1000 1000

W² W³ 500

output 10

W⁴

Random init Find-tune via backprop

(32)

Variational Autoencoder

Representati on Learning and Generation

32

(33)

Generation from Latent Codes

𝑥

𝑊

𝑦 𝑊′

𝑎

encode decode

How can we set a latent code for generation?

(34)

Latent Code Distribution Constraints

Constrain the data distribution for learned latent codes Generate the latent code via a prior distribution

34

𝑥 encode 𝑎 𝑦

decode

sampling

(35)

Reconstruction

AE

VAE

(36)

VAE for Music Generation

ℎ_𝑡

𝑑𝑇_𝑡= 𝑇_𝑡 = 𝑃_𝑡 = #𝐹

Note Dictionary 𝑛𝑜𝑡𝑒_𝑡= (𝑃_𝑡, 𝑇_𝑡, 𝑑𝑇_𝑡)

note_t

𝑇₁

𝑇₀ 𝑇_𝑡

𝑃₁

𝑃₀ 𝑃_𝑡

𝑑𝑇₁ d𝑇₀ 𝑑𝑇_𝑡

Variationa l Inference

Reverse Note Dictionary

Modularized Note Unrolling Decoder

Modularized Encoder

Latent Code

Fully-Connected ^𝑑𝑇¹^{, 𝑧}^𝑑𝑇⁰^{, 𝑧} ^𝑑𝑇^𝑡^{, 𝑧}

𝑇₁, 𝑧

𝑇₀, 𝑧 𝑇_𝑡, 𝑧

𝑃₁, 𝑧

𝑃₀, 𝑧 𝑃_𝑡, 𝑧

𝑛𝑜𝑡𝑒₀ 𝑛𝑜𝑡𝑒₁ 𝑛𝑜𝑡𝑒_𝑡𝑛𝑜𝑡𝑒_𝑡+1

<Start>

ℎ_𝑡+3, 𝑧 𝑑𝑇_𝑡 𝑑𝑇_𝑡+1

ℎ_𝑡+5, 𝑧 ℎ_𝑡+4, 𝑧

𝑇_𝑡 𝑇_𝑡+1

𝑃_𝑡 𝑃_𝑡+1 ℎ₀ ℎ₁

ℎ_𝑡+1

ℎ_𝑡+2

𝑧

ℎ₂

http://mvae.miulab.tw

(37)

Distant Supervision

Representati on Learning by Weak Labels

(38)

20K 20K 20K 1000

w₁ w₂ w₃

1000 1000

20K

w_d

1000 300

Word Sequence: x

Word Hashing Matrix: W_h Word Hashing Layer: l_h Convolution Matrix: W_c Convolutional Layer: l_c Max Pooling Operation Max Pooling Layer: l_m

Semantic Projection Matrix: W_s Semantic Layer: y

max max max 300 300 300 300

Q D₁ D₂ D_n CosSim(Q, D_i)

P(D₁ | Q) P(D₂ | Q) P(D_n | Q)

Query …

Documents

how about we discuss this later

Convolutional Deep Structured Semantic Models (CDSSM/DSSM)

Huang et al., "Learning deep structured semantic models for web search using clickthrough data," in Proc. of CIKM, 2013.

Shen et al., “Learning semantic representations using ´ convolutional neural networks for web search," in Proc. of WWW, 2014. 38 maximizes the likelihood of clicked documents given queries

Semantically related documents are close to the query in the encoded space

(39)

Multi-Tasking

Representati on Learning by Different Tasks

(40)

Task-Shared Representation

40

Task 1 Task 2

The latent factors can be learned by different tasks

(41)

Concluding Remarks

Labeling data is expensive, but we have large unlabeled data Autoencoder / VAE

◦exploits unlabeled data to learn latent factors as representations

◦learned representations can be transfer to other tasks

Distant Labels / Labels from Other Tasks

◦learn the representations that are useful for other tasks

◦learned representations may be also useful for the target task

Slide credit from Hung-Yi Lee

Introduction

Semi-Supervised Learning

Semi-Supervised Learning

Transfer Learning

Transfer Learning

……

…… …… ……

Transfer Learning Example

漫畫家 online

研究生 online

Self-Taught Learning

Self-Taught Learning

Self-Taught Learning

Latent Factors for Handwritten Digits

Latent Factors for Documents

Latent Factors for Recommendation System

Latent Factor Exploitation

Strokes (Latent Factors)

…….

Latent Factor Exploitation

…….

Autoencoder

Autoencoder

Learn together

Autoencoder

Autoencoder

Deep Autoencoder

Deep Autoencoder

Feature Representation

Auto-encoder – Text Retrieval

…

Bag-of-word Vector Space Model

Autoencoder – Text Retrieval

Autoencoder – Similar Image Retrieval

Autoencoder – Similar Image Retrieval

Autoencoder – Similar Image Retrieval

Autoencoder for DNN Pre-Training

Autoencoder for DNN Pre-Training

Autoencoder for DNN Pre-Training

Autoencoder for DNN Pre-Training

Variational Autoencoder

Generation from Latent Codes

Latent Code Distribution Constraints

Reconstruction

VAE for Music Generation

Distant Supervision

Convolutional Deep Structured Semantic Models (CDSSM/DSSM)

Multi-Tasking

Task-Shared Representation

Concluding Remarks

漫畫家 _online

研究生 _online