• 沒有找到結果。

Neural Network Basics

N/A
N/A
Protected

Academic year: 2022

Share "Neural Network Basics"

Copied!
93
0
0

加載中.... (立即查看全文)

全文

(1)

Neural Network Basics

Applied Deep Learning

September 15th, 2022 http://adl.miulab.tw

(2)

Learning ≈ Looking for a Function

Speech Recognition

Handwritten Recognition

Weather forecast

Play video games

2

(

)

=

f

(

)

= f

(

)

= f

(

)

=

f

“2”

“你好”

“ Saturday”

“move left”

Thursday

(3)

Machine Learning Framework

3

Training is to pick the best function given the observed data Testing is to predict the label using the learned function

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

Testing: f

( )

x = y y = +

f *

“Best” Function

( ) ( )

x1,yˆ1 , x2,yˆ2 , Testing Data

( )

x,? ,

“It claims too much.”

-

(negative)

: x

ˆy: function input function output

(4)

實際上我們是如何訓練一個模型的?

How to Train a Model?

4

(5)

Training is to pick the best function given the observed data Testing is to predict the label using the learned function

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

Testing: f

( )

x = y y = +

f *

“Best” Function

( ) ( )

x1,yˆ1 , x2,yˆ2 , Testing Data

( )

x,? ,

“It claims too much.”

-

(negative)

: x

ˆy: function input function output

Machine Learning Framework

5

Training Procedure

(6)

Training Procedure

Q1. What is the model? (function hypothesis set)

Q2. What does a “good” function mean?

Q3. How do we pick the “best” function?

6

Model: Hypothesis Function Set

2

1

, f f

Training: Pick the best function f *

f

*

“Best” Function

(7)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 7

(8)

什麼是模型?

What is the Model?

8

(9)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 9

(10)

Classification Task

Sentiment Analysis

Speech Phoneme Recognition

Handwritten Recognition

10

“這規格有誠意!” +

“太爛了吧~” -

/h/

2

Binary Classification

Multi-class Classification input

object

Class A (yes)

Class B (no)

input object

Class A Class B Class C

Some cases are not easy to be formulated as classification problems

(11)

Target Function

Classification Task

x: input object to be classified

y: class/label 11

( ) x y

f = f : R

N

R

M

Assume both x and y can be represented as fixed-size vectors

→ a N-dim vector

→ a M-dim vector

(12)

Vector Representation Example

Handwriting Digit Classification

12

“1” “2”









 0 0 1

10 dimensions for digit recognition

“1”

“2”

“3” 





 0 1

0 “1”

“2”

“3”

1: for ink 0: otherwise

Each pixel corresponds to an element in the vector

 

 

 1 0

16 x 16

16 x 16 = 256 dimensions

x: image y: class/label

“1” or not

“2” or not

“3” or not

M

N

R

R

f : →

(13)

Vector Representation Example

Sentiment Analysis

13

“+” “-”

 

 

 

 

 0 0 1

3 dimensions

(positive, negative, neutral)

“+”

“-”

“?”

   

 

 

 0 1

0

“+”

“-”

“?”

1: indicates the word 0: otherwise

Each element in the vector corresponds to a word in the vocabulary

 

 

 1 0

dimensions = size of vocab

x: word y: class/label

“+” or not

“-” or not

“?” or not

M

N

R

R

f : →

“love”

(14)

Target Function

Classification Task

x: input object to be classified

y: class/label 14

( ) x y

f = f : R

N

R

M

Assume both x and y can be represented as fixed-size vectors

→ a N-dim vector

→ a M-dim vector

(15)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 15

(16)

A Single Neuron

16

z w

1

w

2

w

N

x

1

x

2

x

N

+ b

( ) z

( ) z

z

bias

y

( )

z

z e

= + 1

 1

Sigmoid function Activation

function

Each neuron is a very simple function

(17)

A Single Neuron

17

z w

1

w

2

w

N

x

1

x

2

x

N

+ b

( ) z

bias

y

Activation function

1

The bias term is an “always on” feature

( ) z

( )

z

z

z e

= + 1

 1

Sigmoid function

(18)

Why Bias?

18

( ) z

z

b

bias

The bias term gives a class prior

(19)

Model Parameters of A Single Neuron

19

z w

1

w

2

w

N

x

1

x

2

x

N

+ b

( ) z

bias

y

( )

z

z e

= + 1

 1

1

w, b are the parameters of this neuron

(20)

A Single Neuron

20

z w

1

w

2

w

N

x

1

x

2

x

N

+

b

bias

y

1

M

N

R

R

f : →

 

5 . 0

"

2

"

5 . 0

"

2

"

y not

y is

A single neuron can only handle binary classification

(21)

A Layer of Neurons

Handwriting digit classification

21

M

N

R

R

f : →

A layer of neurons can handle multiple possible output, and the result depends on the max one

x

1

x

2

x

N

+

1

+ y

1

+

… …

“1” or not

“2” or not

“3” or not

y

2

y

3

10 neurons/10 classes

Which one is max?

(22)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 22

(23)

A Layer of Neurons – Perceptron

Output units all operate separately – no shared weights

23

Adjusting weights moves the location, orientation, and steepness of cliff

x1

x2

xN

+

1

+ y1

+

… …

y2

y3

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

(24)

Expression of Perceptron

24

+ z w

1

w

2

x

1

x

2

b 1

y

A perceptron can represent AND, OR, NOT, etc., but not XOR → linear separator

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

(25)

How to Implement XOR?

25

Input

Output

A B

0 0 0

0 1 1

1 0 1

1 1 0

A xor B = AB’ + A’B

Multiple operations can produce more complicate output

A

B

A’

B’

A + B’

A’ + B

AB’ + A’B

(26)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 26

(27)

Neural Networks – Multi-Layer Perceptron

27

a

1

+ z

1

+ z

2

x

1

x

2

a

2

+ z

Hidden Units

1 1

y

(28)

Expression of Multi-Layer Perceptron

Continuous function w/ 2 layers

Combine two opposite-facing

threshold functions to make a ridge

28

Continuous function w/ 3 layers

Combine two perpendicular ridges to make a bump

Add bumps of various sizes and locations to fit any surface

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

multiple layers enhance the model expression

→ the model can approximate more complex functions

(29)

Deep Neural Networks (DNN)

Fully connected feedforward network

29

x1

x2

……

Layer 1

……

y1

y2

……

Layer 2

……

Layer L

… …

Input Output

yM

xN

vector x

vector y

M

N

R

R

f : →

Deep NN: multiple hidden layers

(30)

Notation Definition

30

…..

nodes Nl

Layerl

1 1

al

1 2

al

1 l

aj

….. …..

Layerl −1 nodes

1

Nl

a1l

a2l

l

ai

…..

l

a i

Output of a neuron:

neuron i layer l 1

2

j

1 2

i

output of one layer → a vector

(31)

Notation Definition

31

1 2

j

1 2

i

layer l −1 to layer l

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

nodes Nl

Layerl Layerl −1

nodes

1

Nl

from neuron j (layer l-1) to neuron i (layer l)

weights between two layers

→ a matrix

….. ….. ….. …..

(32)

Notation Definition

32

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

l

b

i : bias for neuron i at layer l

l

bi

1

b1l

b2l

nodes Nl

Layerl Layerl −1

nodes

1

Nl

bias of all neurons at each layer

→ a vector

….. ….. ….. …..

(33)

Notation Definition

33

nodes Nl

Layerl Layerl −1

nodes

1

Nl

1 2

j i

: input of the activation function for neuron i at layer l

1 activation function input at

each layer → a vector

….. ….. …..

(34)

Notation Summary

34

l

a

i

a

l

l

z

i

z

l

l

b

i

b

l : output of a neuron

: output vector of a layer

: input of activation function

: input vector of activation function for a layer

: a weight

: a weight matrix : a bias

: a bias vector

(35)

Layer Output Relation

35

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

1

Nl

… …

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

(36)

Layer Output Relation – from a to z

36

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

1

Nl

… …

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

… …

(37)

Layer Output Relation – from z to a

37

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

1

Nl

… …

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

( ) ( )

( )

 

 

 

 

 

 

=

 

 

 

 

 

 

l i l l

l i l l

z z z

a a a

2 1 2

1

( )

il l

i

z

a = 

( )

l

l

z

a = 

(38)

Layer Output Relation

38

… …

nodes Nl

Layerl

… … … …

Layerl −1 nodes

1

Nl

… …

1 2

j

1 2

i

1 1

al

1 2

al

1 l

aj

a1l

a2l

l

ai

z1l

z2l

l

zi

al

zl

1

al

l l

l

l

W a b

z =

−1

+

( )

l

l

z

a = 

(39)

Neural Network Formulation

Fully connected feedforward network

39

Layer 1 Layer 2 Layer L

Input Output

x1

x2

…. .

…..

y1

y2

…. . …. .

… …

yM

xN

vector x

vector y

M

N

R

R

f : →

= = =

(40)

Neural Network Formulation

Fully connected feedforward network

40

M

N

R

R

f : →

Layer 1 Layer 2 Layer L

Input Output

x1

x2

…. .

…..

y1

y2

…. . …. .

… …

yM

xN

vector x

vector

y

(41)

Activation Function

41

bounded function

(42)

Activation Function

42

non-linear boolean

linear

(43)

Non-Linear Activation Function

Sigmoid

Tanh

Rectified Linear Unit (ReLU)

43

Non-linear functions are frequently used in neural networks

(44)

Why Non-Linearity?

Function approximation

Without non-linearity, deep neural networks work the same as linear transform

With non-linearity, networks with more layers can approximate more complex functions

44

http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf

(45)

什麼叫做“好”的Function呢?

What does the “Good”

Function mean?

45

(46)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 46

(47)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 47

(48)

Function = Model Parameters

Formal definition

48

different parameters W and b → different functions function set

model parameter set

pick a function f = pick a set of model parameters θ

(49)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 49

(50)

Model Parameter Measurement

Define a function to measure the quality of a parameter set θ

Evaluating by a loss/cost/error function C(θ) → how bad θ is

Best model parameter set

Evaluating by an objective/reward function O(θ) → how good θ is

Best model parameter set

50

(51)

Loss Function Example

51

f *

“Best” Function

A “Good” function:

Define an example loss function:

sum over the error of all training samples

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

( ) ( )

x1,yˆ1 , x2,yˆ2 ,

“It claims too much.”

-

(negative)

: x

ˆy: function input function output

(52)

Frequent Loss Function

Square loss

Hinge loss

Logistic loss

Cross entropy loss

Others: large margin, etc.

52

https://en.wikipedia.org/wiki/Loss_functions_for_classification

(53)

我們如何找出“最好”的Function呢?

How can we Pick the

“Best” Function?

53

(54)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 54

(55)

Problem Statement

Given a loss function and several model parameter sets

Loss function:

Model parameter sets:

Find a model parameter set that minimizes C(θ)

1) Brute force – enumerate all possible θ

2) Calculus –

55

How to solve this optimization problem?

Issue: whole space of C(θ) is unknown

(56)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 56

(57)

Gradient Descent for Optimization

Assume that θ has only one variable

57

( ) 

C

0

1

2

3

: the model at the i-th iteration

Idea: drop a ball and find the position where the ball stops rolling (local minima)

(58)

Gradient Descent for Optimization

Assume that θ has only one variable

58

( ) 

C

0

1

Randomly start at 𝜃0 Compute 𝑑𝐶 𝜃0 Τ𝑑𝜃:

Compute 𝑑𝐶 𝜃1 Τ𝑑𝜃:

η is “learning rate”

(59)

Gradient Descent for Optimization

Assume that θ has two variables {θ1, θ2}

59

(60)

Gradient Descent for Optimization

Assume that θ has two variables {θ1, θ2}

60

• Randomly start at 𝜃0:

• Compute the gradients of 𝐶 𝜃 at 𝜃0:

• Update parameters:

• Compute the gradients of 𝐶 𝜃 at 𝜃1:

(61)

Gradient Descent for Optimization

61

Movement Gradient 𝜃0

𝜃1

𝜃2 𝜃3

𝛻𝐶 𝜃0

𝛻𝐶 𝜃1

𝛻𝐶 𝜃2

𝛻𝐶 𝜃3

𝜃1 𝜃2

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

(62)

Revisit Neural Network Formulation

Fully connected feedforward network

62

M

N

R

R

f : →

Layer 1 Layer 2 Layer L

Input Output

x1

x2

…. .

…..

y1

y2

…. . …. .

… …

yM

xN

vector x

vector

y

(63)

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

Gradient Descent for Neural Network

63

(64)

Gradient Descent for Optimization

Simple Case

64

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

z w1

w2

x1

x2 +

b

y

1

( )

z

(65)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

Update three parameters for t-th iteration

Square error loss

65

(66)

Gradient Descent for Optimization

Simple Case – Square Error Loss

Square Error Loss

66

(67)

Gradient Descent for Optimization

Simple Case – Square Error Loss

67

chain rule sigmoid func

(68)

Gradient Descent for Optimization

Simple Case – Square Error Loss

Square Error Loss

68

(69)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

Update three parameters for t-th iteration

69

z w1

w2

x1

x2 +

b

y

1

( )z

(70)

Optimization Algorithm

70

Algorithm

Initialization: set the parameters 𝜃, 𝑏 at random while(stopping criteria not met)

{

for training sample {𝑥, ො𝑦}, compute gradient and update parameters }

(71)

Gradient Descent for Neural Network

71

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

(72)

Gradient Descent Issue

72

Training Data

( ) ( )

x1, yˆ1 , x2, yˆ2 ,

After seeing all training samples, the model can be updated → slow

(73)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 73

(74)

Stochastic Gradient Descent (SGD)

Gradient Descent

Stochastic Gradient Descent (SGD)

Pick a training sample xk

If all training samples have same probability to be picked 74

The model can be updated after seeing one training sample → faster

Training Data

( ) ( )

x1, yˆ1 , x2, yˆ2 ,

(75)

Epoch Definition

When running SGD, the model starts θ0

75

pick x1 pick x2 pick xk

pick xK pick x1

see all training samples once

→ one epoch

… …

… …

Training Data

( ) ( )

x1, yˆ1 , x2, yˆ2 ,

(76)

Gradient Descent v.s. SGD

Gradient Descent

Update after seeing all examples 76

Stochastic Gradient Descent

If there are 20 examples, update 20 times in one epoch.

1 epoch See all

examples

See only one example

SGD approaches to the target point faster than gradient descent

(77)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 77

(78)

Mini-Batch SGD

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Pick a training sample xk

Mini-Batch SGD

Pick a set of B training samples as a batch b 78

B is “batch size”

Use all K samples in each iteration

Use 1 samples in each iteration

Use all B samples in each iteration

(79)

Mini-Batch SGD

79

(80)

Batch v.s. Mini-Batch

Handwritting Digit Classification

80

Batch size = 1 Gradient Descent

(81)

Gradient Descent v.s. SGD v.s. Mini-Batch

81

Training Time (sec)

Batch Size SGD

1 10 100 1000 10000 full

Gradient Descent Mini-Batch

Why is mini-batch faster than SGD?

Training speed: mini-batch > SGD > Gradient Descent

(82)

SGD v.s. Mini-Batch

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

82

𝑊

1

𝑥 𝑊

1

𝑥

𝑧

1

= 𝑧

1

= ……

matrix

𝑊

1

𝑥 𝑥

𝑧

1

𝑧

1

=

Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication

(83)

Big Issue: Local Optima

83

Neural networks has no guarantee for obtaining global optimal solution

(84)

Training Procedure Outline

 Model Architecture

A Single Layer of Neurons (Perceptron)

Limitation of Perceptron

Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips 84

(85)

Initialization

Different initialization parameters may result in different trained models

85

Do not initialize the parameters equally → set them randomly

(86)

Learning Rate

86

# parameters updates cost

Error Surface Very Large

Large

small

Just make

Learning rate should be set carefully

(87)

Tips for Mini-Batch Training

◉ Shuffle training samples before every epoch

the network might memorize the order you feed the samples

◉ Use a fixed batch size for every epoch

enable to fast implement matrix multiplication for calculations

◉ Adapt the learning rate to the batch size

𝐾 times of batch size → (theoretically) 𝐾 times of learning rate

87

http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini- batch

(88)

Learning Recipe

88

Training Data

Testing Data

x

Validation Real Testing

x x

f *

“Best” Function

y y

(89)

Learning Recipe

89

immediately know the performance

Do not know the performance until

submission

Training Data

Testing Data

x

Validation Real Testing

x y x y

(90)

Learning Recipe

Possible reasons

no good function exists: bad hypothesis function set

→ reconstruct the model architecture

cannot find a good function: local optima

→ change the training strategy 90

get good results on training set modify training process

no

(91)

get good results on dev/validation set

Learning Recipe

91

yes

done yes

no

prevent overfitting

Better performance on training but worse performance on dev → overfitting get good results

on training set modify training process

no

(92)

Overfitting

Possible solutions

more training samples

some tips: dropout, etc.

92

(93)

Concluding Remarks

Q1. What is the model?

Q2. What does a “good” function mean?

Q3. How do we pick the “best” function?

93

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f * f *

“Best” Function

Model Architecture Loss Function Design

Optimization

參考文獻

相關文件

Categories of Network Types by Broad Learning Method.

SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods.. In this work, we investigate

Each unit in hidden layer receives only a portion of total errors and these errors then feedback to the input layer.. Go to step 4 until the error is

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

Deep learning usually refers to neural network based model.. Shallow – Speech Recognition. ◉

3. Works better for some tasks to use grammatical tree structure Language recursion is still up to debate.. Recursive Neural Network Architecture. A network is to predict the

Random Forest: Theory and Practice Neural Network Motivation.. Neural Network Hypothesis Neural Network Training Deep

They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods.. For example, fully-connected