Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 12: Neural Network

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Neural Network

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

Lecture 11: Gradient Boosted Decision Tree

aggregating trees from

functional gradient

and

steepest descent

subject to

any error measure

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

Motivation

Neural Network Hypothesis Neural Network Learning

Optimization and Regularization

(3)

Neural Network Motivation

Linear Aggregation of Perceptrons: Pictorial View

x

0

= 1

x

1

x

2

.. . x

d





 x







=

g

1

w

₁

g

2

w

₂

g

₃

w

3

g

4

w

4

.. . .. .

g

T

w

T

G α

1

α

2

α

3

α

4

.. . α

T

G(x) = sign







T

P

t=1

α

t

sign w

^T_t

x

| {z }

gt(x)







•

two layers of weights:

w _t

and

α

•

two layers of sign functions:

in

g t

and in

G

what boundary can

G implement?

(4)

Logic Operations with Aggregation

+1

−1

+1

−1

g ₁ g ₂

AND(g

₁

,

g ₂

)

x

₀

= 1 x

₁

x

₂

.. . x

d

+1 g

₁

w

1

g

₂

w

2

α

0

= −1 α

1

= +1

α

2

= +1

G(x)

G(x) = sign ( −1+ g

1

(x)+g

2

(x))

• g ₁

(x) =

g ₂

(x) =

+1

(TRUE):

G(x) = +1

(TRUE)

•

otherwise:

G(x) = −1

(FALSE)

• G

≡ AND(

g ₁

,

g ₂

)

OR, NOT can be

similarly implemented

(5)

Powerfulness and Limitation

− −

−

+ +

− −

−

+ +

+

−

+

−

+ +

+

−

+

8 perceptrons 16 perceptrons target boundary

•

‘convex set’ hypotheses implemented:

d

_VC

→ ∞, remember? :-)

•

powerfulness: enough perceptrons

≈ smooth boundary

+1

−1

+1

−1

g ₁ g ₂

XOR(g

₁

,

g ₂

)

•

limitation: XOR

not ‘linear separable’

under

φ(x) = (g ₁

(x),

g ₂

(x)) how to implement XOR(g

₁

,

g ₂

)?

(6)

Multi-Layer Perceptrons: Basic Neural Network

•

non-separable data: can use more

transform

•

how about

one more layer of AND transform?

XOR(g

₁

,

g ₂

) = OR(AND(−

g ₁

,

g ₂

),

AND(g ₁

,−

g ₂

))

x

₀

=1 x

₁

x

₂

... x

_d

+1 g ₁ w ₁

g ₂ w ₂

+1

-1

AND

OR G

≡

XOR(g

₁

,

g ₂

)

perceptron (simple)

=⇒ aggregation of perceptrons (powerful)

=⇒

multi-layer perceptrons

(more powerful)

(7)

Connection to Biological Neurons

by UC Regents Davis campus-brainmaps.org.

Licensed under CC BY 3.0 via Wikimedia Commons

x₀=1 x₁ x2 ... xd

+1 +1

by Lauris Rubenis.

Licensed under CC BY 2.0 via

https://flic.kr/

p/fkVuZX

by Pedro Ribeiro Sim ¯oes. Licensed under CC BY 2.0 via https://flic.kr/

p/adiV7b

neural network:

bio-inspired

model

(8)

Fun Time

Let g

₀

(x) = +1. Which of the following (α

0

, α

1

, α

2

)allows G(x) = sign

₂

P

t=0

α

t

g

t

(x)

to implement OR(g

₁

, g

₂

)?

1

(−3, +1, +1)

2

(−1, +1, +1)

3

(+1, +1, +1)

4

(+3, +1, +1)

Reference Answer: 3

You can easily verify with all four possibilities of (g

₁

(x), g

₂

(x)).

(9)

Fun Time

Let g

₀

(x) = +1. Which of the following (α

0

, α

1

, α

2

)allows G(x) = sign

₂

P

t=0

α

t

g

t

(x)

to implement OR(g

₁

, g

₂

)?

1

(−3, +1, +1)

2

(−1, +1, +1)

3

(+1, +1, +1)

4

(+3, +1, +1)

Reference Answer: 3

You can easily verify with all four possibilities of (g

₁

(x), g

₂

(x)).

(10)

Neural Network Neural Network Hypothesis

Neural Network Hypothesis: Output

x

₀

= 1 x

₁

x

₂

.. . x

_d

+1 +1

OUTPUT w

• OUTPUT: simply a linear model

with

s

=

w ^T

φ

⁽²⁾

(φ

⁽¹⁾

(x))

•

any linear model can be used—remember? :-)

linear classification

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

err = 0/1

linear regression

h(x) = s

s x

x

x x₀

1 2

d

h x( )

err = squared

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

err = cross-entropy

will discuss

‘regression’ with squared error

(11)

Neural Network Hypothesis: Transformation

•

:

transformation

function of score (signal) s

• any transformation?

• : whole network linear &

thus less useful

• : discrete & thus hard to optimize for w

•

popular choice of

transformation:

=tanh(s)

• ‘analog’ approximation of : easier to optimize

• somewhat closer to biological neuron

• not that new! :-)

x

₀

= 1 x

₁

x

2

.. . x

d

+1 +1

PSfrag

linear

sign tanh

tanh(s) = exp(s) − exp(−s) exp(s) + exp(−s)

= 2θ(2s) − 1

will discuss with

tanh

as

transformation function

(12)

Neural Network Hypothesis

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

w

_ij⁽¹⁾

w

_jk⁽²⁾

w

_kq⁽³⁾

+1

tanh

s

₃⁽²⁾ tanh

x

₃⁽²⁾

d ⁽⁰⁾ -d ⁽¹⁾ -d ⁽²⁾ - · · · -d ^(L) Neural Network (NNet) w _ij ^(`)

:







1≤ ` ≤ L layers 0≤ i ≤ d

^(`−1)

inputs 1≤ j ≤ d

^(`)

outputs

, score s

_j ^(`)

=

d

^(`−1)

X

i=0

w

_ij ^(`)

x

_i ^(`−1)

,

transformed

x

_j ^(`)

=

tanh

s

^(`)_j

if ` < L s

^(`)_j

if ` = L

apply

x as input layer x ⁽⁰⁾

, go through

hidden layers

to get

x ^(`)

, predict at

output layer

x

₁ ^(L)

(13)

Physical Interpretation

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

w

_ij⁽¹⁾

w

_jk⁽²⁾

w

_kq⁽³⁾

+1

tanh

s

₃⁽²⁾ tanh

x

₃⁽²⁾

•

each layer:

transformation

to be

learned

from data

• φ ^(`)

(x) = tanh









d^(`−1)

P

i=0

w

_i1^(`)

x

_i^(`−1)

. . .









—whether

x ‘matches’ weight vectors

in pattern NNet:

pattern extraction

with

layers of

connection weights

(14)

Fun Time

How many weights{w

ij ^(`)

} are there in a 3-5-1 NNet?

1

9

2

15

3

20

4

26

Reference Answer: 4

There are (3 + 1)× 5 weights in w

ij ⁽¹⁾

, and (5 + 1)× 1 weights in w

jk ⁽²⁾

.

(15)

Fun Time

How many weights{w

ij ^(`)

} are there in a 3-5-1 NNet?

1

9

2

15

3

20

4

26

Reference Answer: 4

There are (3 + 1)× 5 weights in w

ij ⁽¹⁾

, and (5 + 1)× 1 weights in w

jk ⁽²⁾

.

(16)

Neural Network Neural Network Learning

How to Learn the Weights?

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

w

_ij⁽¹⁾

w

_jk⁽²⁾

w

_kq⁽³⁾

+1

tanh

s

₃⁽²⁾ tanh

x

₃⁽²⁾

•

goal: learning all

{w _ij ^(`) }

to

minimize E _in

{w _ij ^(`) }

•

one hidden layer: simply

aggregation of perceptrons

—gradient boostingto determine hidden neuron one by one

•

multiple hidden layers?

not easy

•

let

e n = (y n − NNet(x ⁿ )) ²

:

can apply

(stochastic) GD

after computing

^∂e

ⁿ

∂w

_ij^(`)! next: efficient computation of

^∂e

ⁿ

∂w

_ij^(`)

(17)

Computing ^∂e

ⁿ

∂w

_i1^(L)

(Output Layer)

e n

=

(y n − NNet(x ⁿ )) ²

=

y n − s ₁ ^(L)

2

=



y n −

d

^(L−1)

X

i=0

w _i1 ^(L) x _i ^(L−1)





2 specially (output layer) (0 ≤ i ≤ d ^(L−1) )

∂e n

∂w _i1 ^(L)

=

∂e n

∂s ₁ ^(L)

·

∂s ^(L) ₁

∂w _i1 ^(L)

=

−2

y n − s ^(L) 1

·

x _i ^(L−1)

generally (1 ≤ ` < L)

(0 ≤ i ≤ d ^(`−1) ; 1 ≤ j ≤ d ^(`) )

∂e n

∂w _ij ^(`)

=

∂e n

∂s ^(`) _j

·

∂s ^(`) _j

∂w _ij ^(`)

=

δ _j ^(`)

·

x _i ^(`−1)

δ ^(L) ₁ = −2

y _n − s ^(L) ₁

, how about

others?

(18)

Computing δ ⁽ _j ^`) = ^∂e

ⁿ

∂s

_j^(`)

s _j ^(`)

=

^tanh

⇒

x _j ^(`) ^w

(`+1)

=jk⇒





 s ₁ ^(`+1)

.. . s _k ^(`+1)

.. .







=⇒ · · · =⇒

e _n

δ ^(`) _j

=

∂e n

∂s ^(`) _j

=

d

^(`+1)

X

k =1

∂e n

∂s ^(`+1) _k

∂x _j ^(`)

∂s ^(`) _j

= X

k

δ ^(`+1) _k

w _jk ^(`+1)

tanh ⁰ s _j ^(`)

δ _j ^(`)

can be computed

backwards

from

δ _k ^(`+1)

(19)

Backpropagation (Backprop) Algorithm

Backprop on NNet

initialize all weights

w _ij ^(`)

for t = 0, 1, . . . , T

1

stochastic: randomly pick n∈ {1, 2, · · · , N}

2

forward: compute all

x _i ^(`)

with

x ⁽⁰⁾

=

x _n

3

backward: compute all

δ _j ^(`)

subject to

x ⁽⁰⁾

=

x _n

4

gradient descent:

w _ij ^(`)

←

w _ij ^(`)

− η

x _i ^(`−1) δ _j ^(`)

return

g

NNET(x) =

· · · tanh P

j w _jk ⁽²⁾

· tanh P

i w _ij ⁽¹⁾

x

_i

sometimes 1 to 3 is (parallelly) done many times and average(x

_i ^(`−1)

δ

_j ^(`)

)taken for update in 4 , called

mini-batch

basic NNet algorithm: backprop to compute the gradient

efficiently

(20)

Fun Time

According to

^∂e

ⁿ

∂w

_i1^(L) =

−2

y _n − s ^(L) ₁

·

x _i ^(L−1)

when would

^∂e

ⁿ

∂w

_i1^(L) =0?

1

y

n

=s

^(L) ₁

2

x

_i ^(L−1)

=0

3

s

_i ^(L−1)

=0

4

all of the above

Reference Answer: 4

Note that x

_i ^(L−1)

=tanh(s

_i ^(L−1)

) =0 if and only if s

_i ^(L−1)

=0.

(21)

Fun Time

According to

^∂e

ⁿ

∂w

_i1^(L) =

−2

y _n − s ^(L) ₁

·

x _i ^(L−1)

when would

^∂e

ⁿ

∂w

_i1^(L) =0?

1

y

n

=s

^(L) ₁

2

x

_i ^(L−1)

=0

3

s

_i ^(L−1)

=0

4

all of the above

Reference Answer: 4

Note that x

_i ^(L−1)

=tanh(s

_i ^(L−1)

) =0 if and only if s

_i ^(L−1)

=0.

(22)

Neural Network Optimization and Regularization

Neural Network Optimization

E

in

(w) = 1 N

N

X

n=1

err







 · · · tanh



 X

j

w

_jk⁽²⁾

· tanh X

i

w

_ij⁽¹⁾

x

n,i

! 





 , y

n





•

generally

non-convex

when multiple hidden layers

• not easy to reach global minimum

• GD/SGD with backprop only gives local minimum

•

different initial

w _ij ^(`)

=⇒ different

local minimum

• somewhat ‘sensitive’ to initial weights

• large weights = ⇒ saturate (small gradient)

• advice: try some random & small ones

NNet:

difficult to optimize,

but

practically works

(23)

VC Dimension of Neural Network Model

roughly, with

tanh-like transfer functions:

d_VC =O(V

D)

where

V = # of neurons, D = # of weights

x

0

= 1 x

1

x

₂

.. . x

d

+1

tanh

w

_ij⁽¹⁾

w

_jk⁽²⁾

w

_kq⁽³⁾

+1

tanh

s

₃⁽²⁾ tanh

x

₃⁽²⁾

•

pros: can

approximate ‘anything’

if

enough neurons

(V large)

•

cons: can

overfit

if

too many neurons

NNet:

watch out for overfitting!

(24)

Regularization for Neural Network

basic choice:

old friend

weight-decay

(L2) regularizer Ω(w) =P

w _ij ^(`)

2 • ‘shrink’ weights:

large weight

→

large shrink; small weight

→

small shrink

•

want

w _ij ^(`) = 0 (sparse)

to effectively

decrease d

VC

• L1 regularizer: P w

_ij^(`)

, but not differentiable

• weight-elimination (‘scaled’ L2) regularizer:

large weight → median shrink; small weight → median shrink

weight-elimination

regularizer: P

w

_ij^(`)

2

1+ w

_ij^(`)

2

(25)

Yet Another Regularization: Early Stopping

• GD/SGD (backprop)

visits

more weight combinations

as

t increases

w0

w1

w2

w3

H³

• smaller t

effectively

decrease d

_VC

•

better

‘stop in middle’:

early stopping

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

d^∗_vc

(d_VC

^∗ in middle, remember? :-))

iteration, t log10(error)

t^∗ Ein

Etest

10² 10³ 10⁴

-1.8 -1.4 -1 -0.6 -0.2

when to stop?

validation!

(26)

Fun Time

For the weight elimination regularizerP

w

_ij^(`)

2

1+

w

_ij^(`)

2, what is

∂regularizer

∂w

_ij^(`) ?

1

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

1

2

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

3

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

3

4

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

4 Reference Answer: 2

Too much calculus in this class, huh? :-)

(27)

Fun Time

For the weight elimination regularizerP

w

_ij^(`)

2

1+

w

_ij^(`)

2, what is

∂regularizer

∂w

_ij^(`) ?

1

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

1

2

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

3

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

3

4

2w

_ij ^(`)

/

1 +

w

_ij ^(`)

2

4 Reference Answer: 2

Too much calculus in this class, huh? :-)

(28)

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

Motivation

multi-layer for power with biological inspirations Neural Network Hypothesis

layered pattern extraction until linear hypothesis Neural Network Learning

backprop to compute gradient efficiently Optimization and Regularization

tricks on initialization, regularizer, early stopping

• next: making neural network ‘deeper’

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 12: Neural Network

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

Lecture 11: Gradient Boosted Decision Tree

functional gradient

steepest descent

any error measure

3

Lecture 12: Neural Network

Motivation

Neural Network Hypothesis Neural Network Learning

Optimization and Regularization

Linear Aggregation of Perceptrons: Pictorial View

x

= 1

x

x

.. . x













































 x















































=

g

w

g

w

g

w

g

w

Machine Learning Techniques (ᘤᢈ)

w _t

g ₁ g ₂

₁

g ₂

• g ₁

g ₂

g ₁

g ₂