Slides credited from Prof. Hung-Yi Lee

### What is Machine Learning?

2

### What Computers Can Do?

### Programs can do the things you ask them to do

3

### Program for Solving Tasks

### Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”

**+** **-** **?**

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

**推** ^{噓}

**?**

program.py

Some tasks are complex, and we don’t know how to write a program to solve them.

if input contains “love”, “like”, etc.

output = positive

if input contains “too much”, “bad”, etc.

output = negative

program.py program.py

program.py program.py program.py

4

### Learning ≈ Looking for a Function

### Task: predicting positive or negative given a product review

“I love this product!” “It claims too much.” “It’s a little expensive.”

**+** **-** **?**

“台灣第一波上市!” “規格好雞肋…” “樓下買了我才考慮”

**推** ^{噓}

**?**

**f** **f** **f**

**f**

**f**

**f**

**f** **f** **f**

**f**

**f**

**f**

**Given a large amount of data, the machine learns what the function f should be.**

5

### Learning ≈ Looking for a Function

Speech Recognition

Handwritten Recognition

Weather forecast

Play video games

### ( ^{ } ^{ } ^{ } ^{ } ) ^{=}

*f*

### ( ^{ } ^{ } ^{ } ^{ } ) ^{=}

*f*

### ( ^{ } ^{ } ^{ } ^{ } ) ^{=}

*f*

### ( ^{ } ^{ } ^{ } ^{ } ) ^{=}

*f*

### “2”

### “你好”

### “ Saturday”

### “move left”

### Thursday

6

### Machine Learning Framework

Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data

Model: Hypothesis Function Set

2

1*, f*
*f*

*Training: Pick the best function f *^{*}

Testing: ^{f}^{}

### ( )

^{x}^{} ^{=}

^{y}^{} ^{y} ^{} ^{=} ^{+}

^{y}

*f*

*
“Best” Function

### ( ) ( )

###

*x*

_{1},

*y*ˆ

_{1},

*x*

_{2},

*y*ˆ

_{2},

###

Testing Data

### ( )

###

*x*,? ,

###

“It claims too much.”

**-**

(negative)
### : *x*

*ˆy* :

7

function input function output

### What is Deep Learning?

### A subfield of machine learning

8

### Stacked Functions Learned by Machine

### Production line (生產線)

“台灣第一波

上市!” **推**

End-to-end training: what each function should do is learned automatically

Simple Function

**f**_{1}

Simple Function

**f**_{2}

Simple Function

**f**_{3}

### Deep Learning Model

**f: a very complex function**

**f: a very complex function**

9

Deep learning usually refers to neural network based model

### Stacked Functions Learned by Machine

**Output **
**Layer**
**Hidden Layers**

**Input **
**Layer**

Layer 1 Layer 2 Layer L

Input Output

**vector x**

**vector x**

“台灣第一波 上市!”

**label y** 推 *x*

1
**label y**

*x*

2
### ……

*x*

N
### …… …… ……

### ……

### ……

### ……

*y*

**Features / Representations**

Representation Learning attempts to learn good features/representations

Deep Learning attempts to learn (multiple levels of) representations and an output

10

### Deep v.s. Shallow – Speech Recognition

### Shallow Model

### MFCC Waveform

### …

### Filter bank DFT

### DCT log

### GMM

### spectrogram

### “Hello”

### :hand-crafted :learned from data Each box is a simple function in the production line:

11

### Deep v.s. Shallow – Speech Recognition

### Deep Model

### All functions are learned from data

### Less engineering labor, but machine learns more *f*

_{1}### “Hello”

*f*

_{2}*f*

_{3}*f*

_{4}*f*

_{5}“Bye bye, MFCC” - Deng Li in Interspeech 2014 12

### Deep v.s. Shallow – Image Recognition

### Shallow Model

### :hand-crafted :learned from data

http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/ 13

### Deep v.s. Shallow – Image Recognition

### Deep Model

*Reference: Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833)*

### “monkey”

*f*

_{1}*f*

_{2}*f*

_{3}*f*

_{4}### All functions are learned from data

**Features / Representations**

14

### Machine Learning v.s. Deep Learning

### Machine Learning

describing your data with features a

computer can understand

model learning algorithm

Credit by Dr. Socher

hand-crafted domain-specific knowledge

optimizing the weights on features

15

### Machine Learning v.s. Deep Learning

### Deep Learning

representations learned by machine

model learning algorithm

automatically learned internal knowledge

optimizing the weights on features

16

Deep learning usually refers to neural network based model

### Inspired by Human Brain

17

### A Single Neuron

*z* *w*

1
*w*

2
*w*

N
### …

*x*

1
*x*

2
*x*

N
### + *b*

### ( ) ^{z}

^{z}

### ^{} ( ) ^{z}

^{z}

*z*

### bias

*y*

### ( )

_{z}*z* *e*

_{−}

### = + 1

### 1

### Sigmoid function Activation

### function

Each neuron is a very simple function

18

### Deep Neural Network

### Cascading the neurons to form a neural network

*x*

1
*x*

2
### ……

### Layer 1

### ……

*y*

1
*y*

2
### ……

### Layer 2

### ……

### Layer L

### ……

### ……

### ……

### Input Output

*y*

M
*x*

N
*M*

*N*

*R*

*R*

*f* : →

A neural network is a complex function:

Each layer is a simple function in the production line

19

### History of Deep Learning

### 1960s: Perceptron (single layer neural network) 1969: Perceptron has limitation

### 1980s: Multi-layer perceptron 1986: Backpropagation

### 1989: 1 hidden layer is “good enough”, why deep?

### 2006: RBM initialization (breakthrough) 2009: GPU

### 2010: breakthrough in Speech Recognition (Dahl et al., 2010) 2012: breakthrough in ImageNet (Krizhevsky et al. 2012)

### 2015: “superhuman” results in Image and Speech Recognition

20

### Deep Learning Breakthrough

### First: Speech Recognition

### Second: Computer Vision

21
**Acoustic Model** **WER on RT03S FSH** **WER on Hub5 SWB**

Traditional Features 27.4% 23.6%

Deep Learning 18.5% (-33%) 16.1% (-32%)

### History of Deep Learning

### 1960s: Perceptron (single layer neural network) 1969: Perceptron has limitation

### 1980s: Multi-layer perceptron 1986: Backpropagation

### 1989: 1 hidden layer is “good enough”, why deep?

### 2006: RBM initialization (breakthrough) 2009: GPU

### 2010: breakthrough in Speech Recognition (Dahl et al., 2010) 2012: breakthrough in ImageNet (Krizhevsky et al. 2012)

### 2015: “superhuman” results in Image and Speech Recognition

Why does deep learning show breakthrough in applications after 2010?

22

### Reasons why Deep Learning works

### Big Data GPU

23

### Why to Adopt GPU for Deep Learning?

### GPU is like a brain

*Human brains create graphical imagination for mental thinking*

24

**台灣好吃牛肉麵**

### Why Speed Matters?

### Training time

### ◦ Big data increases the training time

### ◦ Too long training time is not practical

### Inference time

◦Users are not patient to wait for the responses

25

0 50 100 150 200 250 300

P40 P4 CPU

Time (ms)

GPU enables the real-world applications using the computational power

### Why Deeper is Better?

### Deeper → More parameters

26

*x*1 *x*_{2}

### ……

_{x}_{N}

### ……

*x*1 *x*_{2}

### ……

*x*

_{N}

### ……

### ……

### Shallow Deep

### Universality Theorem

*Any continuous function f*

### can be realized by a network with only hidden layer

27

### : *R* *R*

M
*f*

^{N}### →

http://neuralnetworksanddeeplearning.com/chap4.html

Why “deep” not “fat”?

### Fat + Shallow v.s. Thin + Deep

### Two networks with the same number of parameters

28

*x*1 *x*_{2}

### ……

^{x}_{N}

### ……

### ……

*x*1 *x*_{2}

### ……

*x*

_{N}

### ……

### Fat + Shallow v.s. Thin + Deep

### Hand-Written Digit Classification

29

The deeper model uses less parameters to achieve the same performance

### Fat + Shallow v.s. Thin + Deep

### Two networks with the same number of parameters

30

*x*1 *x*_{2}

### ……

^{x}_{N}

### ……

### ……

*x*1 *x*_{2}

### ……

*x*

_{N}

### ……

*2d* *d*

*d*

*O(2d)* *O(d*

^{2}### )

## How to Apply?

31

### How to Frame the Learning Problem?

*The learning algorithm f is to map the input domain X into the * *output domain Y*

### Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution

32

*Y* *X*

*f* : →

### Output Domain – Classification

### Sentiment Analysis

### Speech Phoneme Recognition Handwritten Recognition

33

“這規格有誠意!” **+**

“太爛了吧~” **-**

**/h/**

**2**

### Output Domain – Sequence Prediction

### POS Tagging

### Speech Recognition

### Machine Translation

34

“推薦我台大後門的餐廳” 推薦/VV 我/PN 台大/NR 後門/NN

的/DEG 餐廳/NN

“大家好”

“How are you doing today?” “你好嗎?”

Learning tasks are decided by the output domains

### Input Domain –

### How to Aggregate Information

### Input: word sequence, image pixels, audio signal, click logs Property: continuity, temporal, importance distribution Example

### ◦ CNN (convolutional neural network): local connections, shared weights, pooling

◦ AlexNet, VGGNet, etc.

### ◦ RNN (recurrent neural network): temporal information

35

Network architectures should consider the input domain properties

### How to Frame the Learning Problem?

*The learning algorithm f is to map the input domain X into the * *output domain Y*

### Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution

36

*Y* *X*

*f* : →

Network design should leverage input and output domain properties

### “Applied” Deep Learning

### Deep Learning

representations learned by machine

model learning algorithm

automatically learned internal knowledge

optimizing the weights on features

37

How to frame a task into a learning problem and design the corresponding model

### Core Factors for Applied Deep Learning

### 1. Data: big data

### 2. Hardware: GPU computing

### 3. Talent: design algorithms to allow networks to work for the specific problems

38

### Concluding Remarks

39

### Training

### Concluding Remarks

40

### Inference

### Concluding Remarks

41

### Training

### Inference

Main focus: how to apply deep learning to the real-world problems

### Reference

### Reading Materials

### ◦ Academic papers will be put in the website

### Deep Learning

### ◦ Goodfellow, Bengio, and Courville, “Deep Learning,” 2016.

http://www.deeplearningbook.org

### ◦ Michael Nielsen, “Neural Networks and Deep Learning”

http://neuralnetworksanddeeplearning.com

42