Unsupervised Learning

(1)

Unsupervised Learning

Applied Deep Learning

May 25th, 2020 http://adl.miulab.tw

(2)

Introduction

◉ Big data ≠ Big annotated data

◉ Machine learning techniques include:

○

Supervised learning (if we have labelled data)

○ Reinforcement learning (if we have an environment for reward)

○ Unsupervised learning (if we do not have labelled data)

What can we do if there is no sufficient training data?

2

(3)

Semi-Supervised Learning

Labelled Data

Unlabeled Data

cat dog

(Image of cats and dogs without labeling)

3

(4)

Semi-Supervised Learning

◉ Why semi-supervised learning helps?

The distribution of the unlabeled data provides some cues

4

(5)

Transfer Learning

Source Data

Target Data

cat dog

Not related to the task considered

elephant elephant tiger tiger

5

(6)

Transfer Learning

◉ Widely used on image processing

○ Using sufficient labeled data to learn a CNN

○

Using this CNN as feature extractor

Layer 1

Layer 2

Layer L

x1

x2

……

…… … … … …

……

elephant

…

N …

x

Pixels

…

6

(7)

Transfer Learning Example

爆漫王研究生

生存守則

責編漫畫家

投稿 jump 畫分鏡指導教授

研究生

投稿期刊跑實驗

漫畫家 _online 研究生 _online

7

(8)

Self-Taught Learning

◉ The unlabeled data sometimes is not related to the task

Unlabeled Data

(Just crawl millions of images from the Internet) Labelled Data

cat dog

8

(9)

Self-Taught Learning

◉ The unlabeled data sometimes is not related to the task

Labelled Data

Digit Recognition

Unlabeled Data

Speech Recognition

Document Classification

Digits character

Taiwanese

English Chinese

News Webpages

……

Why can we use unlabeled and unrelated data to help our tasks?

9

(10)

Self-Taught Learning

◉ How does self-taught learning work?

◉ Why does unlabeled and unrelated data help the tasks?

Finding latent factors that control the observations

10

(11)

Latent Factors for Handwritten Digits

11

(12)

Latent Factors for Documents

http://deliveryimages.acm.org/10.1145/2140000/2133826/figs/f1.jpg

12

(13)

Latent Factors for Recommendation System

A

C B

單純呆傲嬌

13

(14)

Latent Factor Exploitation

◉ Handwritten digits

The handwritten images are composed of strokes

Strokes (Latent Factors)

…….

No. 1 No. 2 No. 3 No. 4 No. 5

14

(15)

Latent Factor Exploitation

28

Represented by 28 X 28 = 784 pixels

= + +

[1 0 1 0 1 0 …….]

Strokes (Latent Factors)

…….

No. 1 No. 2 No. 3 No. 4 No. 5

No. 1 No. 3 No. 5

(simpler representation)

15

(16)

Representation Learning

Autoencoder

16

(17)

Autoencoder

◉

Represent a digit using 28 X 28 dimensions

◉

Not all 28 X 28 images are digits

NN Encoder

NN Decoder

code

Learn together

28 X 28 = 784 Usually <784

Idea: represent the images of digits in a more compact way

compact representation of the input object

reconstruct the original object

17

(18)

Autoencoder

𝑥

Input layer

𝑊

𝑦 𝑊′

output layer hidden layer

𝑎

As close as possible Minimize 𝑥 − 𝑦 ²

Bottleneck layer

𝑎 = 𝜎 𝑊𝑥 + 𝑏 𝑦 = 𝜎 𝑊′𝑎 + 𝑏′

Output of the hidden layer is the code encode decode

18

(19)

Autoencoder

◉

De-noising auto-encoder

𝑥 𝑊 𝑦

𝑊′

𝑎

encode decode

Add noise

𝑥′

As close as possible

Rifai, et al. "Contractive auto-encoders: Explicit invariance during feature extraction,“ in ICML, 2011.

19

(20)

Deep Autoencoder

Input Layer Layer Layer bottle Output Layer

Layer

… …

Code

As close as possible

𝑥 𝑥෤

Hinton and Salakhutdinov. “Reducing the dimensionality of data with neural networks,” Science, 2006.

20

(21)

Deep Autoencoder

Original Image

PCA Deep Auto-encoder

784 784

784 1000 500 250 30

250 500 1000 784

30

21

(22)

Feature Representation

784 784

784 1000 500 250 2 2

250 500 1000 784

22

(23)

Auto-encoder – Text Retrieval

word string:

“This is an apple”

…

this is

a an apple pen

1 1 0 1 1 0

Bag-of-word Vector Space Model

document query

Semantics are not considered

23

(24)

Autoencoder – Text Retrieval

Bag-of-word(document or query) query

2000 500 250 125

2

The documents talking about the same thing will have close code

24

(25)

Auto-Encoding (AE)

◉ Objective: reconstructing ҧ𝑥 from ො 𝑥

○ dimension reduction or denoising (masked LM)

Randomly mask 15% of tokens

25

(26)

Autoencoder – Similar Image Retrieval

◉

Retrieved using Euclidean distance in pixel intensity space

Krizhevsky et al. "Using very deep autoencoders for content-based image retrieval," in ESANN, 2011.

26

(27)

Autoencoder – Similar Image Retrieval

32x32

8192 4096 2048 1024 512 256

code (crawl millions of images from the Internet)

27

(28)

Autoencoder – Similar Image Retrieval

◉ Images retrieved using Euclidean distance in pixel intensity space

◉ Images retrieved using 256 codes

Learning the useful latent factors

28

(29)

Autoencoder for DNN Pre-Training

◉

Greedy layer-wise pre-training again

Target

784 1000 1000

10

Input output

Input 784 1000

784

W¹ 𝑥

෤ 𝑥 500

W¹’

29

(30)

Autoencoder for DNN Pre-Training

◉

Target

784 1000 1000

10

Input output

500

Input 784 1000

W¹ 1000 1000

fix

𝑥 𝑎¹

෤ 𝑎¹

W² W²’

30

(31)

Autoencoder for DNN Pre-Training

◉

Target

784 1000 1000

10

Input output

500

Input 784 1000

W¹ 1000

fix

𝑥 𝑎¹

෤ 𝑎²

W² fix

𝑎² 1000

W³ 500

W³’

31

(32)

Autoencoder for DNN Pre-Training

◉

Target

784 1000 1000

10

Input output

500

Input 784 1000

W¹ 1000

𝑥 W²

W³ 500

output 10

W⁴

Random init Find-tune via backprop

32

(33)

Representation Learning and Generation

Variational Autoencoder

33

(34)

Generation from Latent Codes

𝑥

𝑊

𝑦 𝑊′

𝑎

encode decode

How can we set a latent code for generation?

34

(35)

Latent Code Distribution Constraints

◉

Constrain the data distribution for learned latent codes

◉

Generate the latent code via a prior distribution

𝑥 encode 𝑎 𝑦

decode

sampling

35

(36)

Reconstruction

AE

VAE

36

(37)

Representation Learning by Weak Labels

Distant Supervision

37

(38)

20K 20K 20K 1000

w₁ w₂ w₃

1000 1000

20K

w_d 1000

300

Word Sequence: x Word Hashing Matrix: W_h Word Hashing Layer: l_h Convolution Matrix: W_c Convolutional Layer: l_c Max Pooling Operation Max Pooling Layer: l_m

Semantic Projection Matrix: W_s Semantic Layer: y

max max max 300 300 300 300

Q D₁ D₂ D_n

CosSim(Q, D_i)

P(D₁| Q) P(D₂| Q) P(D_n| Q)

…

Query

Documents

Convolutional Deep Structured Semantic Models (CDSSM/DSSM)

Huang et al., "Learning deep structured semantic models for web search using clickthrough data," in Proc. of CIKM, 2013.

Shenet al., “Learning semantic representations using ´ convolutional neural networks for web search," in Proc. of WWW, 2014.

maximizes the likelihood of clicked documents given queries

Semantically related documents are close to the query in the encoded space

38

(39)

Representation Learning by Different Tasks

Multi-Task Learning

39

(40)

Task-Shared Representation

Task 1 Task 2

The latent factors can be learned by different tasks 40

(41)

Semi-Supervised Multi-Task SLU

(Lan et al., 2018)

◉ Idea: language understanding objective can enhance other tasks

41

O. Lan, S. Zhu, and K. Yu, “Semi-supervised Training using Adversarial Multi-task Learning for Spoken Language Understanding,” in Proceedings of ICASSP, 2018.

Slot Tagging

Model

BLM exploits the unsupervised knowledge, the shared-private framework and adversarial training make the slot tagging model more generalized

(42)

MT-DNN

(Liu et al., 2019)

42

https://github.com/namisan/mt-dnn

(43)

Concluding Remarks

◉ Labeling data is expensive, but we have large unlabeled data

◉ Autoencoder

○

exploits unlabeled data to learn latent factors as representations

○ learned representations can be transfer to other tasks

◉ Distant Labels / Labels from Other Tasks

○ learn the representations that are useful for other tasks

○ learned representations may be also useful for the target task

43

Unsupervised Learning