• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 12: Neural Network

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Neural Network

Roadmap

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

Lecture 11: Gradient Boosted Decision Tree

aggregating trees from

functional gradient

and

steepest descent

subject to

any error measure

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

Motivation

Neural Network Hypothesis Neural Network Learning

Optimization and Regularization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23

(3)

Neural Network Motivation

Linear Aggregation of Perceptrons: Pictorial View

x

0

= 1

x

1

x

2

.. . x

d

x

=

g

1

w

1

g

2

w

2

g

3

w

3

g

4

w

4

.. . .. .

g

T

w

T

G α

1

α

2

α

3

α

4

.. . α

T

G(x) = sign

T

P

t=1

α

t

sign w

Tt

x 

| {z }

gt(x)

two layers of weights:

w t

and

α

two layers of sign functions:

in

g t

and in

G

what boundary can

G implement?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

(4)

Neural Network Motivation

Logic Operations with Aggregation

+1

−1

+1

−1

g 1 g 2

AND(g

1

,

g 2

)

x

0

= 1 x

1

x

2

.. . x

d

+1 g

1

w

1

g

2

w

2

α

0

= −1 α

1

= +1

α

2

= +1

G(x)

G(x) = sign ( −1+ g

1

(x)+g

2

(x))

• g 1

(x) =

g 2

(x) =

+1

(TRUE):

G(x) = +1

(TRUE)

otherwise:

G(x) = −1

(FALSE)

• G

≡ AND(

g 1

,

g 2

)

OR, NOT can be

similarly implemented

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23

(5)

Neural Network Motivation

Powerfulness and Limitation

− −

+ +

+ +

− −

+ +

+

+

+ +

+

+

8 perceptrons 16 perceptrons target boundary

‘convex set’ hypotheses implemented:

d

VC

→ ∞, remember? :-)

powerfulness: enough perceptrons

≈ smooth boundary

+1

−1

+1

−1

g 1 g 2

XOR(g

1

,

g 2

)

limitation: XOR

not ‘linear separable’

under

φ(x) = (g 1

(x),

g 2

(x)) how to implement XOR(g

1

,

g 2

)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

(6)

Neural Network Motivation

Multi-Layer Perceptrons: Basic Neural Network

non-separable data: can use more

transform

how about

one more layer of AND transform?

XOR(g

1

,

g 2

) = OR(AND(−

g 1

,

g 2

),

AND(g 1

,−

g 2

))

x

0

=1 x

1

x

2

... x

d

+1 g 1 w 1

g 2 w 2

+1

-1

-1

AND

AND

OR G

XOR(g

1

,

g 2

)

perceptron (simple)

=⇒ aggregation of perceptrons (powerful)

=⇒

multi-layer perceptrons

(more powerful)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23

(7)

Neural Network Motivation

Connection to Biological Neurons

by UC Regents Davis campus-brainmaps.org.

Licensed under CC BY 3.0 via Wikimedia Commons

x0=1 x1 x2 ... xd

+1 +1

by Lauris Rubenis.

Licensed under CC BY 2.0 via

https://flic.kr/

p/fkVuZX

by Pedro Ribeiro Sim ¯oes. Licensed under CC BY 2.0 via https://flic.kr/

p/adiV7b

neural network:

bio-inspired

model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

(8)

Neural Network Motivation

Fun Time

Let g

0

(x) = +1. Which of the following (α

0

, α

1

, α

2

)allows G(x) = sign



2

P

t=0

α

t

g

t

(x)



to implement OR(g

1

, g

2

)?

1

(−3, +1, +1)

2

(−1, +1, +1)

3

(+1, +1, +1)

4

(+3, +1, +1)

Reference Answer: 3

You can easily verify with all four possibilities of (g

1

(x), g

2

(x)).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

(9)

Neural Network Motivation

Fun Time

Let g

0

(x) = +1. Which of the following (α

0

, α

1

, α

2

)allows G(x) = sign



2

P

t=0

α

t

g

t

(x)



to implement OR(g

1

, g

2

)?

1

(−3, +1, +1)

2

(−1, +1, +1)

3

(+1, +1, +1)

4

(+3, +1, +1)

Reference Answer: 3

You can easily verify with all four possibilities of (g

1

(x), g

2

(x)).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

(10)

Neural Network Neural Network Hypothesis

Neural Network Hypothesis: Output

x

0

= 1 x

1

x

2

.. . x

d

+1 +1

OUTPUT w

OUTPUT: simply a linear model

with

s

=

w T

φ

(2)

(1)

(x))

any linear model can be used—remember? :-)

linear classification

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

err = 0/1

linear regression

h(x) = s

s x

x

x x0

1 2

d

h x( )

err = squared

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

err = cross-entropy

will discuss

‘regression’ with squared error

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23

(11)

Neural Network Neural Network Hypothesis

Neural Network Hypothesis: Transformation

:

transformation

function of score (signal) s

• any transformation?

• : whole network linear &

thus less useful

• : discrete & thus hard to optimize for w

popular choice of

transformation:

=tanh(s)

• ‘analog’ approximation of : easier to optimize

• somewhat closer to biological neuron

not that new! :-)

x

0

= 1 x

1

x

2

.. . x

d

+1 +1

PSfrag

linear

sign tanh

tanh(s) = exp(s) − exp(−s) exp(s) + exp(−s)

= 2θ(2s) − 1

will discuss with

tanh

as

transformation function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

(12)

Neural Network Neural Network Hypothesis

Neural Network Hypothesis

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

d (0) -d (1) -d (2) - · · · -d (L) Neural Network (NNet) w ij (`)

:

1≤ ` ≤ L layers 0≤ i ≤ d

(`−1)

inputs 1≤ j ≤ d

(`)

outputs

, score s

j (`)

=

d

(`−1)

X

i=0

w

ij (`)

x

i (`−1)

,

transformed

x

j (`)

=



tanh

 s

(`)j



if ` < L s

(`)j

if ` = L

apply

x as input layer x (0)

, go through

hidden layers

to get

x (`)

, predict at

output layer

x

1 (L)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23

(13)

Neural Network Neural Network Hypothesis

Physical Interpretation

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

each layer:

transformation

to be

learned

from data

• φ (`)

(x) = tanh

d(`−1)

P

i=0

w

i1(`)

x

i(`−1)

. . .

—whether

x ‘matches’ weight vectors

in pattern NNet:

pattern extraction

with

layers of

connection weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

(14)

Neural Network Neural Network Hypothesis

Fun Time

How many weights{w

ij (`)

} are there in a 3-5-1 NNet?

1

9

2

15

3

20

4

26

Reference Answer: 4

There are (3 + 1)× 5 weights in w

ij (1)

, and (5 + 1)× 1 weights in w

jk (2)

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

(15)

Neural Network Neural Network Hypothesis

Fun Time

How many weights{w

ij (`)

} are there in a 3-5-1 NNet?

1

9

2

15

3

20

4

26

Reference Answer: 4

There are (3 + 1)× 5 weights in w

ij (1)

, and (5 + 1)× 1 weights in w

jk (2)

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

(16)

Neural Network Neural Network Learning

How to Learn the Weights?

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

goal: learning all

{w ij (`) }

to

minimize E in 

{w ij (`) } 

one hidden layer: simply

aggregation of perceptrons

—gradient boostingto determine hidden neuron one by one

multiple hidden layers?

not easy

let

e n = (y n − NNet(x n )) 2

:

can apply

(stochastic) GD

after computing

∂e

n

∂w

ij(`)! next: efficient computation of

∂e

n

∂w

ij(`)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23

(17)

Neural Network Neural Network Learning

Computing ∂e

n

∂w

i1(L)

(Output Layer)

e n

=

(y n − NNet(x n )) 2

=



y n − s 1 (L)

 2

=

y n −

d

(L−1)

X

i=0

w i1 (L) x i (L−1)

2

specially (output layer) (0 ≤ i ≤ d (L−1) )

∂e n

∂w i1 (L)

=

∂e n

∂s 1 (L)

·

∂s (L) 1

∂w i1 (L)

=

−2 

y n − s (L) 1



·



x i (L−1) 

generally (1 ≤ ` < L)

(0 ≤ i ≤ d (`−1) ; 1 ≤ j ≤ d (`) )

∂e n

∂w ij (`)

=

∂e n

∂s (`) j

·

∂s (`) j

∂w ij (`)

=

δ j (`)

·



x i (`−1) 

δ (L) 1 = −2 

y n − s (L) 1 

, how about

others?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

(18)

Neural Network Neural Network Learning

Computing δ ( j `) = ∂e

n

∂s

j(`)

s j (`)

=

tanh

x j (`) w

(`+1)

=jk⇒

 s 1 (`+1)

.. . s k (`+1)

.. .

=⇒ · · · =⇒

e n

δ (`) j

=

∂e n

∂s (`) j

=

d

(`+1)

X

k =1

∂e n

∂s (`+1) k

∂s (`+1) k

∂x j (`)

∂x j (`)

∂s (`) j

= X

k

 δ (`+1) k  

w jk (`+1)  

tanh 0  s j (`)  

δ j (`)

can be computed

backwards

from

δ k (`+1)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23

(19)

Neural Network Neural Network Learning

Backpropagation (Backprop) Algorithm

Backprop on NNet

initialize all weights

w ij (`)

for t = 0, 1, . . . , T

1

stochastic: randomly pick n∈ {1, 2, · · · , N}

2

forward: compute all

x i (`)

with

x (0)

=

x n

3

backward: compute all

δ j (`)

subject to

x (0)

=

x n

4

gradient descent:

w ij (`)

w ij (`)

− η

x i (`−1) δ j (`)

return

g

NNET(x) =

· · · tanh P

j w jk (2)

· tanh P

i w ij (1)

x

i



sometimes 1 to 3 is (parallelly) done many times and average(x

i (`−1)

δ

j (`)

)taken for update in 4 , called

mini-batch

basic NNet algorithm: backprop to compute the gradient

efficiently

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

(20)

Neural Network Neural Network Learning

Fun Time

According to

∂e

n

∂w

i1(L) =

−2 

y n − s (L) 1 

·



x i (L−1) 

when would

∂e

n

∂w

i1(L) =0?

1

y

n

=s

(L) 1

2

x

i (L−1)

=0

3

s

i (L−1)

=0

4

all of the above

Reference Answer: 4

Note that x

i (L−1)

=tanh(s

i (L−1)

) =0 if and only if s

i (L−1)

=0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

(21)

Neural Network Neural Network Learning

Fun Time

According to

∂e

n

∂w

i1(L) =

−2 

y n − s (L) 1 

·



x i (L−1) 

when would

∂e

n

∂w

i1(L) =0?

1

y

n

=s

(L) 1

2

x

i (L−1)

=0

3

s

i (L−1)

=0

4

all of the above

Reference Answer: 4

Note that x

i (L−1)

=tanh(s

i (L−1)

) =0 if and only if s

i (L−1)

=0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

(22)

Neural Network Optimization and Regularization

Neural Network Optimization

E

in

(w) = 1 N

N

X

n=1

err

 · · · tanh

 X

j

w

jk(2)

· tanh X

i

w

ij(1)

x

n,i

! 

 , y

n

generally

non-convex

when multiple hidden layers

• not easy to reach global minimum

• GD/SGD with backprop only gives local minimum

different initial

w ij (`)

=⇒ different

local minimum

• somewhat ‘sensitive’ to initial weights

large weights = ⇒ saturate (small gradient)

• advice: try some random & small ones

NNet:

difficult to optimize,

but

practically works

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23

(23)

Neural Network Optimization and Regularization

VC Dimension of Neural Network Model

roughly, with

tanh-like transfer functions:

dVC =O(V

D)

where

V = # of neurons, D = # of weights

x

0

= 1 x

1

x

2

.. . x

d

+1

tanh

tanh

w

ij(1)

w

jk(2)

w

kq(3)

+1

tanh

tanh

s

3(2) tanh

x

3(2)

pros: can

approximate ‘anything’

if

enough neurons

(V large)

cons: can

overfit

if

too many neurons

NNet:

watch out for overfitting!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

(24)

Neural Network Optimization and Regularization

Regularization for Neural Network

basic choice:

old friend

weight-decay

(L2) regularizer Ω(w) =P

 w ij (`)

 2

• ‘shrink’ weights:

large weight

large shrink; small weight

small shrink

want

w ij (`) = 0 (sparse)

to effectively

decrease d

VC

• L1 regularizer: P w

ij(`)

, but not differentiable

• weight-elimination (‘scaled’ L2) regularizer:

large weight → median shrink; small weight → median shrink

weight-elimination

regularizer: P

 w

ij(`)



2

1+  w

ij(`)



2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23

(25)

Neural Network Optimization and Regularization

Yet Another Regularization: Early Stopping

GD/SGD (backprop)

visits

more weight combinations

as

t increases

w0

w1

w2

w3

H3

• smaller t

effectively

decrease d

VC

better

‘stop in middle’:

early stopping

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

(dVC

in middle, remember? :-))

iteration, t log10(error)

t Ein

Etest

102 103 104

-1.8 -1.4 -1 -0.6 -0.2

when to stop?

validation!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

(26)

Neural Network Optimization and Regularization

Fun Time

For the weight elimination regularizerP

 w

ij(`)



2

1+ 

w

ij(`)



2, what is

∂regularizer

∂w

ij(`) ?

1

2w

ij (`)

/

 1 +

w

ij (`)



2



1

2

2w

ij (`)

/

 1 +

 w

ij (`)



2



2

3

2w

ij (`)

/

 1 +

 w

ij (`)



2



3

4

2w

ij (`)

/

 1 +

 w

ij (`)



2



4

Reference Answer: 2

Too much calculus in this class, huh? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(27)

Neural Network Optimization and Regularization

Fun Time

For the weight elimination regularizerP

 w

ij(`)



2

1+ 

w

ij(`)



2, what is

∂regularizer

∂w

ij(`) ?

1

2w

ij (`)

/

 1 +

w

ij (`)



2



1

2

2w

ij (`)

/

 1 +

 w

ij (`)



2



2

3

2w

ij (`)

/

 1 +

 w

ij (`)



2



3

4

2w

ij (`)

/

 1 +

 w

ij (`)



2



4

Reference Answer: 2

Too much calculus in this class, huh? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(28)

Neural Network Optimization and Regularization

Summary

1 Embedding Numerous Features: Kernel Models

2 Combining Predictive Features: Aggregation Models

3

Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

Motivation

multi-layer for power with biological inspirations Neural Network Hypothesis

layered pattern extraction until linear hypothesis Neural Network Learning

backprop to compute gradient efficiently Optimization and Regularization

tricks on initialization, regularizer, early stopping

next: making neural network ‘deeper’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.