• 沒有找到結果。

Slide credit from Prof. Hung-Yi Lee

N/A
N/A
Protected

Academic year: 2022

Share "Slide credit from Prof. Hung-Yi Lee"

Copied!
94
0
0

加載中.... (立即查看全文)

全文

(1)

Slide credit from Prof. Hung-Yi Lee

(2)

Review

2

(3)

Learning ≈ Looking for a Function

Speech Recognition

Handwritten Recognition

Weather forecast

Play video games

f

f

f

f

“2”

“你好”

“ Saturday”

“move left”

Thursday

3

(4)

Machine Learning Framework

Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

Testing: f

 

x y

y

f

*

“Best” Function

   

x1, yˆ1 , x2, yˆ2 ,

Testing Data

 

x,? ,

“It claims too much.”

-

(negative)

: x

ˆy :

4

function input function output

(5)

How to Train a Model?

5

(6)

Machine Learning Framework

Training is to pick the best function given the observed data Testing is to predict the label using the learned function Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

Testing: f

 

x y

y

f

*

“Best” Function

   

x1, yˆ1 , x2, yˆ2 ,

Testing Data

 

x,? ,

“It claims too much.”

-

(negative)

: x

ˆy :

6

function input

function output

Training

Procedure

(7)

Training Procedure

Q1. What is the model? (function hypothesis set) Q2. What does a “good” function mean?

Q3. How do we pick the “best” function?

7

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

f

*

“Best” Function

(8)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

8

(9)

What is the Model?

9

(10)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

10

(11)

Classification Task

Sentiment Analysis

Speech Phoneme Recognition Handwritten Recognition

“這規格有誠意!” +

“太爛了吧~” -

/h/

2

Binary Classification

Multi-class Classification

input object

Class A (yes)

Class B (no)

input object

Class A Class B Class C Some cases are not easy to be formulated as classification problems

11

(12)

Target Function

Classification Task

x: input object to be classified

y: class/label

  x y

ff : R

N

R

M

Assume both x and y can be represented as fixed-size vectors

12

 a N-dim vector

 a M-dim vector

(13)

Vector Representation Example

Handwriting Digit Classification

“1” “2”

 

 

 

 

 0 0 1

10 dimensions for digit recognition

“1”

“2”

“3”    

 

 

 0 1

0 “1”

“2”

“3”

1: for ink 0: otherwise

Each pixel

corresponds to an element in the vector

 

 

 1 0 16 x 16

16 x 16 = 256 dimensions

x: image y : class/label

“1” or not

“2” or not

“3” or not

M

N

R

R

f : 

13

(14)

Vector Representation Example

Sentiment Analysis

“+” “-”

 

 

 

 

 0 0 1

3 dimensions

(positive, negative, neutral)

“+”

“-”

“?”    

 

 

 0 1

0 “+”

“-”

“?”

1: indicates the word 0: otherwise

Each element in the vector corresponds to a word in the vocabulary

 

 

 1 0

dimensions = size of vocab

x: word y : class/label

“+” or not

“-” or not

“?” or not

M

N

R

R

f : 

“love”

14

(15)

Target Function

Classification Task

x: input object to be classified

y: class/label

  x y

ff : R

N

R

M

Assume both x and y can be represented as fixed-size vectors

15

 a N-dim vector

 a M-dim vector

(16)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

16

(17)

A Single Neuron

z w

1

w

2

w

N

x

1

x

2

x

N

b

  z

    z

z

bias

y

 

z

z e

  1

 1

Sigmoid function Activation

function

Each neuron is a very simple function

17

(18)

A Single Neuron

z w

1

w

2

w

N

x

1

x

2

x

N

b

  z

    z

bias z

y

 

z

z e

  1

 1

Sigmoid function Activation

function

18

1

The bias term is an “always on” feature

(19)

Why Bias?

19

  z

z

b

bias

The bias term gives a class prior

(20)

Model Parameters of A Single Neuron

z w

1

w

2

w

N

x

1

x

2

x

N

b

  z

bias

y

 

z

z e

  1

 1

20

1

w, b are the parameters of this neuron

(21)

A Single Neuron

z w

1

w

2

w

N

x

1

x

2

x

N

b

bias

y

21

1

M

N

R

R

f : 

 

5 . 0

"

2

"

5 . 0

"

2

"

y not

y is

A single neuron can only handle binary classification

(22)

A Layer of Neurons

Handwriting digit classification

22 M

N

R

R

f : 

A layer of neurons can handle multiple possible output, and the result depends on the max one

x

1

x

2

x

N

1

y

1

… …

“1” or not

“2” or not

“3” or not

y

2

y

3

10 neurons/10 classes

Which one is max?

(23)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

23

(24)

A Layer of Neurons – Perceptron

Output units all operate separately – no shared weights

24

Adjusting weights moves the location, orientation, and steepness of cliff

x

1

x

2

x

N

1

y

1

… …

y

2

y

3

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

(25)

Expression of Perceptron

z w

1

w

2

x

1

x

2

b 1

y

A perceptron can represent AND, OR, NOT, etc., but not XOR  linear separator

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf 25

(26)

How to Implement XOR?

Input

Output

A B

0 0 0

0 1 1

1 0 1

1 1 0

A xor B = AB’ + A’B

Multiple operations can produce more complicate output

26

(27)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

27

(28)

Neural Networks – Multi-Layer Perceptron

a

1

z

1

z

2

x

1

x

2

z

a

2

Hidden Units

1 1

y

28

(29)

Continuous function w/ 2 layers

Combine two opposite-facing

threshold functions to make a ridge

Continuous function w/ 3 layers

Combine two perpendicular ridges to make a bump

 Add bumps of various sizes and locations to fit any surface

Expression of Multi-Layer Perceptron

http://aima.eecs.berkeley.edu/slides-pdf/chapter20b.pdf

multiple layers enhance the model expression

 the model can approximate more complex functions

29

(30)

Deep Neural Networks (DNN)

Fully connected feedforward network

x

1

x

2

……

Layer 1

……

y

1

y

2

……

Layer 2

……

Layer L

……

……

……

Input Output

y

M

x

N

vector x

vector y

M

N

R

R

f : 

Deep NN: multiple hidden layers

30

(31)

Notation Definition

……

nodes N

l

Layer l

1 1

a

l

1 2

a

l

1 l

a

j

…… ……

Layer l  1 nodes

1

N

l

a

1l

a

2l

l

a

i

……

l

a i

Output of a neuron:

neuron i layer l 1

2

j

1 2

i

output of one layer  a vector

31

(32)

Notation Definition

…… …… …… ……

1 2

j

1 2

i

layer l  1 to layer l

1 1

a

l

1 2

a

l

1 l

a

j

a

1l

a

2l

l

a

i

nodes N

l

Layer l Layer l  1

nodes

1

N

l

from neuron j (layer l-1) to neuron i (layer l)

weights between two layers

 a matrix

32

(33)

Notation Definition

…… …… …… ……

1 2

j

1 2

i

1 1

a

l

1 2

a

l

1 l

a

j

a

1l

a

2l

l

a

i

l

b

i

: bias for neuron i at layer l

l

b

i

1

b

1l

b

2l

nodes N

l

Layer l Layer l  1

nodes

1

N

l

bias of all neurons at each layer

 a vector

33

(34)

Notation Definition

……

nodes N

l

Layer l

1 1

a

l

1 2

a

l

1 l

a

j

……

Layer l  1 nodes

1

N

l

l

a

i

……

1 2

j i

l

z

i

: input of the activation function for neuron i at layer l

l

w

ij l

w

i2 l

w

i1

l

z

i

l

b

i

1 activation function

input at each layer

 a vector

34

(35)

Notation Summary

l

a i

a l

l

z i

z l

l

b

i

b l

: output of a neuron

: output vector of a layer : input of activation

function

: input vector of activation function for a layer

: a weight

: a weight matrix

: a bias

: a bias vector

35

(36)

Layer Output Relation

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

i

1 1

a

l

1 2

a

l

1 l

a

j

a

1l

a

2l

l

a

i

z

1l

z

2l

l

z

i

a

l

z

l

1

a

l

36

(37)

Layer Output Relation – from a to z

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

i

1 1

a

l

1 2

a

l

1 l

a

j

a

1l

a

2l

l

a

i

z

1l

z

2l

l

z

i

a

l

z

l

1

a

l

… …

37

(38)

Layer Output Relation – from z to a

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

i

1 1

a

l

1 2

a

l

1 l

a

j

a

1l

a

2l

l

a

i

z

1l

z

2l

l

z

i

a

l

z

l

1

a

l

   

 

 

 

 

 

 

 

 

 

 

 

 

 

l i l l

l i l l

z z z

a a a

2 1 2

1

 

il l

i

z

a  

 

l

l

z

a  

38

(39)

Layer Output Relation

……

nodes N

l

Layer l

…… ……

Layer l  1 nodes

1

N

l

……

1 2

j

1 2

i

1 1

a

l

1 2

a

l

1 l

a

j

a

1l

a

2l

l

a

i

z

1l

z

2l

l

z

i

a

l

z

l

1

a

l

l l

l

l

W a b

z

1

 

l

l

z

a  

39

(40)

Neural Network Formulation

Fully connected feedforward network

Layer 1 Layer 2 Layer L

Input Output

x

1

x

2

…… ……

y

1

y

2

…… ……

……

……

…… y

M

x

N

vector x

vector y

M

N

R

R

f : 

= = =

40

(41)

Neural Network Formulation

Fully connected feedforward network

Layer 1 Layer 2 Layer L

Input Output

x

1

x

2

…… ……

y

1

y

2

…… ……

……

……

…… y

M

x

N

vector x

vector y

M

N

R

R

f : 

41

(42)

Activation Function

bounded function

42

(43)

Activation Function

non-linear

43

boolean

linear

(44)

Non-Linear Activation Function

Sigmoid

Tanh

Rectified Linear Unit (ReLU)

44

Non-linear functions are

frequently used in neural net

(45)

Why Non-Linearity?

Function approximation

Without non-linearity, deep neural networks work the same as linear transform

With non-linearity, networks with more layers can approximate more complex function

http://cs224d.stanford.edu/lectures/CS224d-Lecture4.pdf 45

(46)

What does the “Good”

Function mean?

46

(47)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

47

(48)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

48

(49)

Function = Model Parameters

Formal definition

different parameters W and b  different functions function set

model parameter set

pick a function f = pick a set of model parameters θ

49

(50)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

50

(51)

Model Parameter Measurement

Define a function to measure the quality of a parameter set θ

◦ Evaluating by a loss/cost/error function C(θ)  how bad θ is

◦ Best model parameter set

◦ Evaluating by an objective/reward function O(θ)  how good θ is

◦ Best model parameter set

51

(52)

Loss Function Example

Training Data

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

f

*

“Best” Function

   

x1, yˆ1 , x2, yˆ2 ,

“It claims too much.”

-

(negative)

: x

ˆy :

52

function input function output

A “Good” function:

Define an example loss function:

sum over the error of all training samples

(53)

Frequent Loss Function

Square loss Hinge loss Logistic loss

Cross entropy loss

Others: large margin, etc.

https://en.wikipedia.org/wiki/Loss_functions_for_classification 53

(54)

How can we Pick the

“Best” Function?

54

(55)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

55

(56)

Problem Statement

Given a loss function and several model parameter sets

◦ Loss function:

◦ Model parameter sets:

Find a model parameter set that minimizes C(θ)

1) Brute force – enumerate all possible θ 2) Calculus –

How to solve this optimization problem?

Issue: whole space of C(θ) is unknown

56

(57)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

57

(58)

Gradient Descent for Optimization

Assume that θ has only one variable

 

C

0

1

2

3

: the model at the i-th iteration Idea: drop a ball and find the position

where the ball stops rolling (local minima)

58

(59)

Gradient Descent for Optimization

Assume that θ has only one variable

 

C

0

1

Randomly start at 𝜃0 Compute 𝑑𝐶 𝜃0 Τ𝑑𝜃:

Compute 𝑑𝐶 𝜃1 Τ𝑑𝜃:

η is “learning rate”

59

(60)

Gradient Descent for Optimization

Assume that θ has two variables {θ

1

, θ

2

}

60

(61)

Gradient Descent for Optimization

Assume that θ has two variables {θ

1

, θ

2

}

• Randomly start at 𝜃0:

• Compute the gradients of 𝐶 𝜃 at 𝜃0:

• Update parameters:

• Compute the gradients of 𝐶 𝜃 at 𝜃1:

61

(62)

Gradient Descent for Optimization

Movement Gradient 𝜃

0

𝜃

1

𝜃

2

𝜃

3

𝛻𝐶 𝜃

0

𝛻𝐶 𝜃

1

𝛻𝐶 𝜃

2

𝛻𝐶 𝜃

3

𝜃

1

𝜃

2

Algorithm

Initialization: start at 𝜃

0

while(𝜃

(𝑖+1)

≠ 𝜃

𝑖

)

{

compute gradient at 𝜃

𝑖

update parameters

}

62

(63)

Revisit Neural Network Formulation

Fully connected feedforward network

Layer 1 Layer 2 Layer L

Input Output

x

1

x

2

…… ……

y

1

y

2

…… ……

……

……

…… y

M

x

N

vector x

vector y

63

(64)

Gradient Descent for Neural Network

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

64

(65)

Gradient Descent for Optimization

Simple Case

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

65

z w

1

w

2

x

1

x

2

b

y

1

  z

(66)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

Update three parameters for t-th iteration

Square error loss

66

(67)

Gradient Descent for Optimization

Simple Case – Square Error Loss

Square Error Loss

67

(68)

Gradient Descent for Optimization

Simple Case – Square Error Loss

68

chain rule sigmoid func

(69)

Gradient Descent for Optimization

Simple Case – Square Error Loss

Square Error Loss

69

(70)

Gradient Descent for Optimization

Simple Case – Three Parameters & Square Error Loss

Update three parameters for t-th iteration

70

z w1

w2

x1

x2

b

y

1

 z

(71)

Optimization Algorithm

71

Algorithm

Initialization: set the parameters 𝜃, 𝑏 at random while(stopping criteria not met)

{

for training sample {𝑥, ො𝑦}, compute gradient and update parameters }

(72)

Gradient Descent for Neural Network

Algorithm

Initialization: start at 𝜃0 while(𝜃(𝑖+1) ≠ 𝜃𝑖)

{

compute gradient at 𝜃𝑖 update parameters }

Computing the gradient includes millions of parameters.

To compute it efficiently, we use backpropagation.

72

(73)

Gradient Descent Issue

Training Data

   

x1, yˆ1 , x2, yˆ2 ,

After seeing all training samples, the model can be updated  slow

73

(74)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

74

(75)

Stochastic Gradient Descent (SGD)

Gradient Descent

Stochastic Gradient Descent (SGD)

Pick a training sample x

k

◦ If all training samples have same probability to be picked

Training Data

   

x1, yˆ1 , x2, yˆ2 ,

The model can be updated after seeing one training sample  faster

75

(76)

Epoch Definition

When running SGD, the model starts θ

0

pick x

1

pick x

2

pick x

k

pick x

K

pick x

1

see all training samples once

 one epoch

… …

… …

Training Data

   

x1, yˆ1 , x2, yˆ2 ,

76

(77)

Gradient Descent v.s. SGD

Gradient Descent

Update after seeing all examples

Stochastic Gradient Descent

If there are 20 examples, update 20 times in one epoch.

1 epoch See all

examples

See only one example

SGD approaches to the target point faster than gradient descent 77

(78)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

78

(79)

Mini-Batch SGD

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Pick a training sample x

k

Mini-Batch SGD

Pick a set of B training samples as a batch b

B is “batch size”

Use all K samples in each iteration

Use 1 samples in each iteration

Use all B samples in each iteration

79

(80)

Mini-Batch SGD

80

(81)

Batch v.s. Mini-Batch

Handwritting Digit Classification

Batch size = 1 Gradient Descent

81

(82)

Gradient Descent v.s. SGD v.s. Mini-Batch

Training Time (sec)

Batch Size

SGD

1 10 100 1000 10000 full

Gradient Descent Mini-Batch

Why is mini-batch faster than SGD?

82

Training speed: mini-batch > SGD > Gradient Descent

(83)

SGD v.s. Mini-Batch

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

𝑊

1

𝑥 𝑊

1

𝑥

𝑧

1

= 𝑧

1

= ……

matrix

𝑊

1

𝑥 𝑥

𝑧

1

𝑧

1

=

Modern computers run matrix-matrix multiplication faster than matrix-vector multiplication

83

(84)

Big Issue: Local Optima

Neural networks has no guarantee for obtaining global optimal solution

84

(85)

Training Procedure Outline

 Model Architecture

 A Single Layer of Neurons (Perceptron)

 Limitation of Perceptron

 Neural Network Model (Multi-Layer Perceptron)

 Loss Function Design

Function = Model Parameters

Model Parameter Measurement

 Optimization

Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch SGD

Practical Tips

85

(86)

Initialization

Different initialization parameters may result in different trained models

Do not initialize the parameters equally  set them randomly

86

(87)

Learning Rate

# parameters updates cost

Error Surface Very Large

Large

small

Just make

Learning rate should be set carefully

87

(88)

Tips for Mini-Batch Training

Shuffle training samples before every epoch

◦ the network might memorize the order you feed the samples

Use a fixed batch size for every epoch

◦ enable to fast implement matrix multiplication for calculations

Adapt the learning rate to the batch size

◦ larger batch  smaller learning rate

http://stackoverflow.com/questions/13693966/neural-net-selecting-data-for-each-mini-batch 88

(89)

Learning Recipe

Training Data

Testing Data

x

Validation Real Testing

x x

f

*

“Best” Function

y y

89

(90)

Learning Recipe

Training Data

Testing Data

x

Validation Real Testing

x y x y

immediately know the performance

Do not know the performance until

submission

90

(91)

Learning Recipe

Possible reasons

◦ no good function exists: bad hypothesis function set

 reconstruct the model architecture

◦ cannot find a good function: local optima

 change the training strategy

get good results on training set modify training process

no

91

(92)

get good results on dev/validation set

Learning Recipe

get good results on training set modify training process

no

yes

done yes

no

prevent overfitting

Better performance on training but worse performance on dev  overfitting

92

(93)

Overfitting

Possible solutions

◦ more training samples

◦ some tips: dropout, etc.

93

(94)

Concluding Remarks

Q1. What is the model?

Q2. What does a “good” function mean?

Q3. How do we pick the “best” function?

94

Model: Hypothesis Function Set

2

1, f f

Training: Pick the best function f *

f

*

“Best” Function

Model Architecture Loss Function Design

Optimization

參考文獻

相關文件

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability

◦ Value function: how good is each state and/or action1. ◦ Model: agent’s representation of

for training

float *s, float *t, float *dsdx, float *dtdx, float *dsdy, float *dtdy) const =

◦ Value function: how good is each state and/or action.. ◦ Policy: agent’s

State value function: when using

3. Works better for some tasks to use grammatical tree structure Language recursion is still up to debate.. Recursive Neural Network Architecture. A network is to predict the

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability distribution.