• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

機器學習技巧)

Lecture 11: Neural Network

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Agenda

Lecture 11: Neural Network

Random Forest: Theory and Practice Neural Network Motivation

Neural Network Hypothesis

Neural Network Training

Deep Neural Networks

(3)

strength-correlation

decomposition (classification):

lim

T →∞

E

out

(G)≤

ρ

·



1 − s 2 s 2



• strength: average voting margin

within G

• correlation: similarity between g t

similar for regression (bias-variancedecomposition) RF good if

diverse

and

strong

(4)

Practice: How Many Trees Needed?

theory: the more, the ‘better’

NTU KDDCup 2013 Track 1: predicting author-paper relation

1− E

val

of thousands of trees: [0.981, 0.985] depending on seed;

1− E

out

of top 20 teams: [0.98130, 0.98554]

decision: take 12000 trees with seed 1

cons of RF: may need lots of trees

if random

process too unstable

(5)
(6)

Disclaimer

Many parts of this lecture borrows

Prof. Yaser S. Abu-Mostafa’s slides with permission.

Learning From Data

YaserS.Abu-Mostafa

CaliforniaInstituteofTe hnology

Le ture10:NeuralNetworks

SponsoredbyCalte h'sProvostO e,E&ASDivision,andIST

Thursday,May3,2012

(7)

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

Biologi alinspiration

biologi alfun tion

−→

biologi alstru ture

1

2

1

2

LearningFromData-Le ture10 8/21

(8)

Combining per eptrons

LearningFromData-Le ture10 9/21

Combining per eptrons

− − +

x 1 x 2

+

h 1 x 1 x 2

+

+

h 2

x 1 x 2

LearningFromData-Le ture10 9/21

Combining per eptrons

− − +

x 1 x 2

+

h 1 x 1 x 2

+

+

h 2

x 1 x 2

OR(x 1 , x 2 ) 1

x 1

x 2 1 1

1.5 −1.5

1

x 1

x 2

AN D(x 1 , x 2 ) 1

1

LearningFromData-Le ture10 9/21

(9)

Creatinglayers

LearningFromData-Le ture10 10/21

Creatinglayers

1 1

h 1 ¯ h 2

¯ h 1 h 2

f 1.5

1

LearningFromData-Le ture10 10/21

Creatinglayers

1 1

h 1 ¯ h 2

¯ h 1 h 2

f 1.5

1

1 f

1 1.5

1 1 1

h 1

h 2

−1 −1

−1.5

−1.5

1

LearningFromData-Le ture10 10/21

(10)

Themultilayerper eptron

LearningFromData-Le ture10 11/21

Themultilayerper eptron

w 1 x

f 1

1.5 1 1

1 1

−1 −1

−1.5

−1.5

1 x 1 1

x 2

w 2 x

3 layers feedforward

LearningFromData-Le ture10 11/21

Themultilayerper eptron

w 1 x

f 1

1.5 1 1

1 1

−1 −1

−1.5

−1.5

1 x 1 1

x 2

w 2 x

3 layers feedforward

LearningFromData-Le ture10 11/21

Themultilayerper eptron

w 1 x

f 1

1.5 1 1

1 1

−1 −1

−1.5

−1.5

1 x 1 1

x 2

w 2 x

3 layers feedforward

LearningFromData-Le ture10 11/21

(11)

A powerfulmodel

LearningFromData-Le ture10 12/21

A powerfulmodel

− + +

+

− +

− −

− + +

+ +

− −

− + +

+

− +

Target 8per eptrons 16per eptrons

2redags for generalization and optimization

LearningFromData-Le ture10 12/21

A powerfulmodel

− + +

+

− +

− −

− + +

+ +

− −

− + +

+

− +

Target 8per eptrons 16per eptrons

2redags for generalization and optimization

LearningFromData-Le ture10 12/21

A powerfulmodel

− + +

+

− +

− −

− + +

+ +

− −

− + +

+

− +

Target 8per eptrons 16per eptrons

2redags for generalization and optimization

LearningFromData-Le ture10 12/21

A powerfulmodel

− + +

+

− +

− −

− + +

+ +

− −

− + +

+

− +

Target 8per eptrons 16per eptrons

2redags for generalization and optimization

LearningFromData-Le ture10 12/21

(12)

Fun Time

(13)

The neuralnetwork

LearningFromData-Le ture10 13/21

The neuralnetwork

θ x 2

x d

s θ(s) h(x) 1

1 1

x 1

θ θ

θ θ

θ

LearningFromData-Le ture10 13/21

The neuralnetwork

θ x 2

x d

s θ(s) h(x) 1

1 1

x 1

input

x

θ θ

θ θ

θ

LearningFromData-Le ture10 13/21

The neuralnetwork

θ x 2

x d

s θ(s) h(x) 1

1 1

x 1

input

x

hiddenlayers

1 ≤ l < L

θ θ

θ θ

θ

LearningFromData-Le ture10 13/21

The neuralnetwork

θ x 2

x d

s θ(s) h(x) 1

1 1

x 1

input

x

hiddenlayers

1 ≤ l < L

outputlayer

l = L

θ θ

θ θ

θ

LearningFromData-Le ture10 13/21

(14)

Theneuralnetwork

θ x 2

x d

s θ(s) h(x) 1

1 1

x 1

input

x

hiddenlayers

1 ≤ l < L

outputlayer

l = L θ

θ

θ θ

θ

LearningFromData-Le ture10 13/21

Howthenetworkoperates Howthenetworkoperates

PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w ij (l)

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w ij (l)

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e s − e −s e s + e −s w (l) ij

 

 

 

1 ≤ l ≤ L layers 0 ≤ i ≤ d (l−1) inputs 1 ≤ j ≤ d (l) outputs

x (l) j = θ(s (l) j ) = θ

d X (l−1)

i=0

w ij (l) x (l i −1)

Apply

x

to

x (0) 1 · · · x (0) d (0) → → x (L) 1 = h(x)

(15)
(16)

ApplyingSGD

Alltheweights

w = {w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

ApplyingSGD

Alltheweights

w = { w (l) ij }

determine

h(x)

Erroronexample

(x n , y n )

is

e

h(x n ), y n

 =

e

(w)

ToimplementSGD,weneedthegradient

e

(w)

:

e

(w)

∂ w ij (l)

forall

i, j, l

LearningFromData-Le ture10 16/21

(17)

Computing

e

(w)

∂ w (l) ij

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone: analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

Computing

e

(w)

∂ w (l) ij

x w

s x

i ij

j j (l)

(l)

(l)

(l−1)

top

bottom

θ

We anevaluate

e

(w)

∂ w ij (l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

e

(w)

∂ w (l) ij = ∂

e

(w)

∂ s (l) j × ∂ s (l) j

∂ w (l) ij

Wehave

∂ s (l) j

∂ w (l) ij = x (l i −1)

Weonlyneed:

e

(w)

∂ s (l) j = δ j (l)

LearningFromData-Le ture10 17/21

(18)

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 x (L) 1 , y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 x (L) 1 , y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 x (L) 1 , y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 x (L) 1 , y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 x (L) 1 , y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 h(x n ), y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) =

e

 x (L) 1 , y n



x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) = ( x (L) 1 − y n ) 2 x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

δ

forthenallayer

δ (l) j = ∂

e

(w)

∂ s (l) j

Forthenallayer

l = L

and

j = 1

:

δ (L) 1 = ∂

e

(w)

∂ s (L) 1

e

(w) = ( x (L) 1 − y n ) 2 x (L) 1 = θ(s (L) 1 )

θ (s) = 1 − θ 2 (s)

forthetanh

LearningFromData-Le ture10 18/21

參考文獻

相關文件

Categories of Network Types by Broad Learning Method.

Each unit in hidden layer receives only a portion of total errors and these errors then feedback to the input layer.. Go to step 4 until the error is

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep

Deep learning usually refers to neural network based model.. Shallow – Speech Recognition. ◉

for training

Attack is easy in both black-box and white-box settings back-door attack, one-pixel attack, · · ·. Defense

Idea: condition the neural network on all previous words and tie the weights at each time step. Assumption: temporal