Machine Learning Techniques (ᘤᢈ)

(1)

機器學習技巧)

Lecture 11: Neural Network

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Agenda

Lecture 11: Neural Network

Random Forest: Theory and Practice Neural Network Motivation

Neural Network Hypothesis

Neural Network Training

Deep Neural Networks

(3)

strength-correlation

decomposition (classification):

lim

T →∞

E

_out

(G)≤

ρ

·

1 − s ² s ²

• strength: average voting margin

within G

• correlation: similarity between g t

•

similar for regression (bias-variancedecomposition) RF good if

diverse

and

strong

(4)

Practice: How Many Trees Needed?

theory: the more, the ‘better’

•

NTU KDDCup 2013 Track 1: predicting author-paper relation

•

1− E

val

of thousands of trees: [0.981, 0.985] depending on seed;

1− E

^out

of top 20 teams: [0.98130, 0.98554]

•

decision: take 12000 trees with seed 1

cons of RF: may need lots of trees

if random

process too unstable

(5)

(6)

Disclaimer

Many parts of this lecture borrows

Prof. Yaser S. Abu-Mostafa’s slides with permission.

Learning From Data

YaserS.Abu-Mostafa

CaliforniaInstituteofTe hnology

Le ture10:NeuralNetworks

SponsoredbyCalte h'sProvostO e,E&ASDivision,andIST

•

^Thursday,^May^3,²⁰¹²

(7)

Biologi alinspiration

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

LearningFromData-Le ture10 8/21

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

biologi alfun tion

−→

^biologi
al^stru
ture

1

2

1

2

(8)

Combining per eptrons

− − +

x ₁ x ₂

+

−

h ₁ x ₁ x ₂

+

−

h 2

x ₁ x ₂

− − +

x ₁ x ₂

+

−

h ₁ x ₁ x ₂

+

−

h 2

x ₁ x ₂

OR(x ₁ , x ₂ ) 1

x 1

x ₂ 1 1

1.5 −1.5

1 x 1

x ₂

AN D(x ₁ , x ₂ ) 1

1

(9)

Creatinglayers

1 1

h 1 ¯ h 2

¯ h 1 h 2

f 1.5

1

Creatinglayers

1 1

h 1 ¯ h 2

¯ h 1 h 2

f 1.5

1 1 f

1 1.5

1 1 1

h 1

h 2

−1 −1

−1.5

1

(10)

Themultilayerper eptron

w 1 ^• x

f 1

1.5 1 1

1 1

−1 −1

−1.5

1 x 1 1

x 2

w 2 ^• x

3 layers feedforward

w 1 ^• x

f 1

1.5 1 1

1 1

−1 −1

−1.5

1 x 1 1

x 2

w 2 ^• x

w 1 ^• x

f 1

1.5 1 1

1 1

−1 −1

−1.5

1 x 1 1

x 2

w 2 ^• x

(11)

A powerfulmodel

−

− + +

+

− +

− −

−

− + +

+ +

− −

− + +

+

− +

Target 8per eptrons 16per eptrons

2redags for generalization and optimization

A powerfulmodel

−

− + +

+

− +

− −

−

− + +

+ +

− −

− + +

+

− +

A powerfulmodel

−

− + +

+

− +

− −

−

− + +

+ +

− −

− + +

+

− +

A powerfulmodel

−

− + +

+

− +

− −

−

− + +

+ +

− −

− + +

+

− +

(12)

Fun Time

(13)

The neuralnetwork

θ x ₂

x _d

s θ(s) h(x) 1

1 1

x ₁

θ θ

θ

The neuralnetwork

θ x ₂

x _d

s θ(s) h(x) 1

1 1

x ₁

input

x

θ θ

θ

The neuralnetwork

θ x ₂

x _d

s θ(s) h(x) 1

1 1

x ₁

input

x

^hidden^layers

1 ≤ l < L

θ θ

θ

The neuralnetwork

θ x ₂

x _d

s θ(s) h(x) 1

1 1

x ₁

input

x

^hidden^layers

1 ≤ l < L

^output^layer

l = L

θ θ

θ

(14)

Theneuralnetwork

θ x 2

x d

s θ(s) h(x) 1

1 1

x 1

input

x

hiddenlayers

1 ≤ l < L

^output^layer

l = L θ

θ

θ θ

θ

Howthenetworkoperates Howthenetworkoperates

PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w _ij ^(l)

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

Howthenetworkoperates PSfragrepla ements

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w _ij ^(l)

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) 1 = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) ₁ = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) ₁ = h(x)

linear

tanh

hardthreshold

+1

−1

-4

-2

0

2

4

-1

-0.5

0

0.5

1

θ(s) = tanh(s) = e ^s − e ^−s e ^s + e ^−s w ^(l) _ij

 

 

 



1 ≤ l ≤ L layers 0 ≤ i ≤ d ^(l−1) inputs 1 ≤ j ≤ d ^(l) outputs

x ^(l) _j = θ(s ^(l) _j ) = θ





d X ^(l−1)

i=0

w _ij ^(l) x ^(l _i ⁻¹⁾





Apply

x

^to

x ⁽⁰⁾ ₁ · · · x ⁽⁰⁾ _d (0) → → x ^(L) ₁ = h(x)

(15)

(16)

ApplyingSGD

Alltheweights

w = {w ^(l) ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

ToimplementSGD,weneedthegradient

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

ApplyingSGD

Alltheweights

w = { w ^(l) _ij }

^determine

h(x)

Erroronexample

(x n , y n )

^is

e

h(x n ), y n

=

^e

(w)

∇

^e

(w)

^:

∂

^e

(w)

∂ w _ij ^(l)

forall

i, j, l

(17)

Computing

∂

^e

(w)

∂ w ^(l) _ij

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

onebyone:analyti allyornumeri ally

Atri kfore ient omputation:

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

onebyone: analyti allyornumeri ally

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

Computing

∂

^e

(w)

∂ w ^(l) _ij

x w

s x

i ij

j j (l)

(l)

(l−1)

top

bottom

θ

We anevaluate

∂

^e

(w)

∂ w _ij ^(l)

∂

^e

(w)

∂ w ^(l) _ij = ∂

^e

(w)

∂ s ^(l) _j × ∂ s ^(l) _j

∂ w ^(l) _ij

Wehave

∂ s ^(l) _j

∂ w ^(l) _ij = x ^(l _i ⁻¹⁾

^We^only^need:

^∂

^e

^(w)

∂ s ^(l) _j = δ _j ^(l)

(18)

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

x ^(L) ₁ , y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

x ^(L) ₁ , y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

x ^(L) ₁ , y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

x ^(L) ₁ , y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

x ^(L) ₁ , y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

h(x n ), y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) =

^e

x ^(L) ₁ , y n

x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) = ( x ^(L) ₁ − y n ) ² x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

δ

^for^the^nal^layer

δ ^(l) _j = ∂

^e

(w)

∂ s ^(l) _j

Forthenallayer

l = L

^and

j = 1

^:

δ ^(L) ₁ = ∂

^e

(w)

∂ s ^(L) ₁

e

(w) = ( x ^(L) ₁ − y n ) ² x ^(L) ₁ = θ(s ^(L) ₁ )

θ ^′ (s) = 1 − θ ² (s)

^for^the^tanh

Machine Learning Techniques (ᘤᢈ)

機器學習技巧)

Lecture 11: Neural Network

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Agenda

Lecture 11: Neural Network

Random Forest: Theory and Practice Neural Network Motivation

Neural Network Hypothesis

Neural Network Training

Deep Neural Networks

strength-correlation

T →∞

out

ρ

1 − s 2 s 2

• strength: average voting margin

• correlation: similarity between g t

•

diverse

strong

Practice: How Many Trees Needed?

•

•

val

out

•

if random

process too unstable

Disclaimer

Many parts of this lecture borrows

Prof. Yaser S. Abu-Mostafa’s slides with permission.

•

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

−→

1

2

1

2

Machine Learning Techniques (ᘤᢈ)

_out

1 − s ² s ²

^out

x ₁ x ₂

h ₁ x ₁ x ₂

x ₁ x ₂

x ₁ x ₂

h ₁ x ₁ x ₂

x ₁ x ₂

OR(x ₁ , x ₂ ) 1

x ₂ 1 1

x ₂

AN D(x ₁ , x ₂ ) 1

w 1 ^• x

w 2 ^• x

w 1 ^• x

w 2 ^• x

w 1 ^• x