## Networks

### Chih-Jen Lin

National Taiwan University Last updated: May 25, 2020

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

## Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a neural network, or other types

## Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is
sgn(w^{T}x)

For any data, x, the predicted label is
(1 if w^{T}x ≥ 0

−1 otherwise

## Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^{T}x = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

## Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; y, x) for each instance (y, x), where

y = ±1 is the label and x is the feature vector Ideally we should use 0–1 training loss:

ξ(w; y, x) =

(1 if yw^{T}x < 0,
0 otherwise

## Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−yw^{T}x
ξ(w; y, x)

We need continuous approximations

## Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; y, x) ≡ max(0, 1 − yw^{T}x) (1)
Logistic loss

ξLR(w; y, x) ≡ log(1 + e^{−y}^{w}^{T}^{x}) (2)
Support vector machines (SVM): Eq. (1). Logistic
regression (LR): (2)

SVM and LR are two very fundamental classification methods

## Common Loss Functions (Cont’d)

−yw^{T}x
ξ(w; y, x)

ξL1

ξLR

Logistic regression is very related to SVM Their performance is usually similar

## Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

## Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

## l and s: training; and 4: testing

## Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^{T}w

2 or kwk^{1}
to the function that is minimized

## General Form of Linear Classification

Training data {yi,x^{i}},x^{i} ∈ R^{n}, i = 1, . . . , l, yi = ±1
l : # of data, n: # of features

minw f (w), f (w) ≡ w^{T}w

2 + C

l

X

i =1

ξ(w; y^{i},x^{i})
w^{T}w/2: regularization term

ξ(w; y, x): loss function

C : regularization parameter (chosen by users)

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

## Multi-class Classification I

Our training set includes (y^{i},x^{i}), i = 1, . . . , l.

x^{i} ∈ R^{n}^{1} is the feature vector.

y^{i} ∈ R^{K} is the label vector.

As label is now a vector, we change (label, instance) from

(y_{i},xi) to (y^{i},x^{i})
K : # of classes

If x^{i} is in class k, then
y^{i} = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^{T} ∈ R^{K}

## Multi-class Classification II

A neural network maps each feature vector to one of the class labels by the connection of nodes.

## Fully-connected Networks

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).

### A

_{1}

### B

_{1}

### C

_{1}

### A

_{2}

### B

_{2}

### A

_{3}

### B

_{3}

### C

_{3}

## Operations Between Two Layers I

The weight matrix W^{m} at the mth layer is

W^{m} =

w_{11}^{m} w_{12}^{m} · · · w_{1n}^{m}_{m}
w_{21}^{m} w_{22}^{m} · · · w_{2n}^{m}_{m}

... ... ... ...

w_{n}^{m}_{m+1}_{1} w_{n}^{m}_{m+1}_{2} · · · w_{n}^{m}_{m+1}_{n}_{m}

nm+1×nm

n_{m} : # input features at layer m

n_{m+1} : # output features at layer m, or # input
features at layer m + 1

L: number of layers

## Operations Between Two Layers II

n_{1} = # of features, n_{L+1} = # of classes

Let z^{m} be the input of the mth layer, z^{1} = x and
z^{L+1} be the output

From mth layer to (m + 1)th layer
s^{m} = W^{m}z^{m},

z_{j}^{m+1} = σ(s_{j}^{m}), j = 1, . . . , nm+1,
σ(·) is the activation function.

## Operations Between Two Layers III

Usually people do a bias term

b_{1}^{m}
b_{2}^{m}
...

b_{n}^{m}

m+1

n_{m+1}×1

,

so that

s^{m} = W^{m}z^{m} +b^{m}

## Operations Between Two Layers IV

Activation function is usually an R → R

transformation. As we are interested in

optimization, let’s not worry about why it’s needed We collect all variables:

θ =

vec(W^{1})
b^{1}

...

vec(W^{L})
b^{L}

∈ R^{n}

## Operations Between Two Layers V

n : total # variables = (n_{1}+1)n_{2}+· · ·+(n_{L}+1)n_{L+1}
The vec(·) operator stacks columns of a matrix to a
vector

## Optimization Problem I

We solve the following optimization problem, minθ f (θ), where

f (θ) = 1

2θ^{T}θ + C X^{l}

i =1ξ(z^{L+1,i}(θ);y^{i},x^{i}).
C : regularization parameter

z^{L+1}(θ) ∈ R^{n}^{L+1}: last-layer output vector of x.

ξ(z^{L+1};y, x): loss function. Example:

ξ(z^{L+1};y, x) = ||z^{L+1} −y||^{2}

## Optimization Problem II

The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex

Note that in the earlier discussion we consider a single instance

In the training process we actually have for i = 1, . . . , l,

s^{m,i} = W^{m}z^{m,i},

z_{j}^{m+1,i} = σ(s_{j}^{m,i}), j = 1, . . . , nm+1,
This makes the training more complicated

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

## Why CNN? I

There are many types of neural networks

They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods

For example, fully-connected networks were

evalueated for general classification data (e.g., data from UCI machine learning repository)

They are not consistently better than random forests or SVM; see the comparisons (Meyer et al., 2003;

Fern´andez-Delgado et al., 2014; Wang et al., 2018).

## Why CNN? II

We are interested in CNN because it’s shown to be significantly better than others on image data That’s one of the main reasons deep learning becomes popular

To study optimization algorithms, of course we want to consider an “established” network

That’s why CNN was chosen for our discussion However, the problem is that operations in CNN are more complicated than fully-connected networks Most books/papers only give explanation without detailed mathematical forms

## Why CNN? III

To study the optimization, we need some clean formulations

So let’s give it a try here

## Convolutional Neural Networks I

Consider a K -class classification problem with training data

(y^{i}, Z^{1,i}), i = 1, . . . , l.

y^{i}: label vector Z^{1,i}: input image
If Z^{1,i} is in class k, then

y^{i} = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^{T} ∈ R^{K}.

CNN maps each image Z^{1,i} to y^{i}

## Convolutional Neural Networks II

Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.

Input and output of a convolutional layer are assumed to be images.

## Convolutional Layers I

For the current layer, let the input be an image
Z^{in} : a^{in}× b^{in} × d^{in}.

a^{in}: height, b^{in}: width, and d^{in}: #channels.

### a

^{in}

### b

^{in}

### d

^{in}

## Convolutional Layers II

The goal is to generate an output image
Z^{out,i}

of d^{out} channels of a^{out}× b^{out} images.

Consider d^{out} filters.

Filter j ∈ {1, . . . , d^{out}} has dimensions
h × h × d^{in}.

w_{1,1,1}^{j} w_{1,h,1}^{j}
. . .

w^{j} w^{j}

. . .

w_{1,1,d}^{j} in w_{1,h,d}^{j} in

. . .

w^{j} w^{j}

.

## Convolutional Layers III

h: filter height/width (layer index omitted)

1,1,1 1,2,1 1,3,1

2,1,1 2,2,1 2,3,1

3,1,1 3,2,1 3,3,1

s_{1,1,j}^{out,i} s_{1,2,j}^{out,i}
s_{2,1,j}^{out,i} s_{2,2,j}^{out,i}

To compute the j th channel of output, we scan the
input from top-left to bottom-right to obtain the
sub-images of size h × h × d^{in}

## Convolutional Layers IV

We then calculate the inner product between each sub-image and the j th filter

For example, if we start from the upper left corner of the input image, the first sub-image of channel d is

z_{1,1,d}^{i} . . . z_{1,h,d}^{i}
. . .

z_{h,1,d}^{i} . . . z_{h,h,d}^{i}

.

## Convolutional Layers V

We then calculate

d^{in}

X

d =1

*

z_{1,1,d}^{i} . . . z_{1,h,d}^{i}
. . .

z_{h,1,d}^{i} . . . z_{h,h,d}^{i}

,

w_{1,1,d}^{j} . . . w_{1,h,d}^{j}
. . .

w_{h,1,d}^{j} . . . w_{h,h,d}^{j}

+

+b_{j},
(3)
where h·, ·i means the sum of component-wise

products between two matrices.

This value becomes the (1, 1) position of the channel j of the output image.

## Convolutional Layers VI

Next, we use other sub-images to produce values in other positions of the output image.

Let the stride s be the number of pixels vertically or horizontally to get sub-images.

For the (2, 1) position of the output image, we move down s pixels vertically to obtain the following sub-image:

z_{1+s,1,d}^{i} . . . z_{1+s,h,d}^{i}
. . .

z_{h+s,1,d}^{i} . . . z_{h+s,h,d}^{i}

.

## Convolutional Layers VII

The (2, 1) position of the channel j of the output image is

d^{in}

X

d =1

*

z_{1+s,1,d}^{i} . . . z_{1+s,h,d}^{i}
. . .

z_{h+s,1,d}^{i} . . . z_{h+s,h,d}^{i}

,

w_{1,1,d}^{j} . . . w_{1,h,d}^{j}
. . .

w_{h,1,d}^{j} . . . w_{h,h,d}^{j}

+

+ b_{j}.

(4)

## Convolutional Layers VIII

The output image size a^{out} and b^{out} are respectively
numbers that vertically and horizontally we can
move the filter

a^{out} = ba^{in} − h

s c + 1, b^{out} = bb^{in} − h

s c + 1 (5) Rationale of (5): vertically last row of each

sub-image is

h, h + s, . . . , h + ∆s ≤ a^{in}

## Convolutional Layers IX

Thus

∆ = ba^{in}− h

s c

## Matrix Operations I

For efficient implementations, we should conduct convolutional operations by matrix-matrix and matrix-vector operations

We will go back to this issue later

## Matrix Operations II

Let’s collect images of all channels as the input
Z^{in,i}

=

z_{1,1,1}^{i} z_{2,1,1}^{i} . . . z_{a}^{i}in,b^{in},1

... ... . . . ...

z_{1,1,d}^{i} in z_{2,1,d}^{i} in . . . z_{a}^{i}in,b^{in},d^{in}

∈R^{d}^{in}^{×a}^{in}^{b}^{in}.

## Matrix Operations III

Let all filters

W =

w_{1,1,1}^{1} w_{2,1,1}^{1} . . . w_{h,h,d}^{1} in

... ... . . . ...

w_{1,1,1}^{d}^{out} w_{2,1,1}^{d}^{out} . . . w_{h,h,d}^{d}^{out} in

∈ R^{d}^{out}^{×hhd}^{in}

be variables (parameters) of the current layer

## Matrix Operations IV

Usually a bias term is considered b =

b_{1}

...

bd^{out}

∈ R^{d}^{out}^{×1}
Operations at a layer

S^{out,i} = Wφ(Z^{in,i}) + b1^{T}a^{out}b^{out}

∈ R^{d}^{out}^{×a}^{out}^{b}^{out}, (6)

## Matrix Operations V

where

1a^{out}b^{out} =

1...

1

∈ R^{a}^{out}^{b}^{out}^{×1}.

φ(Z^{in,i}) collects all sub-images in Z^{in,i} into a matrix.

## Matrix Operations VI

Specifically,
φ(Z^{in,i}) =

z_{1,1,1}^{i} z_{1+s,1,1}^{i} z_{1+(a}^{i} out−1)s,1+(b^{out}−1)s,1

z_{2,1,1}^{i} z_{2+s,1,1}^{i} z_{2+(a}^{i} out−1)s,1+(b^{out}−1)s,1

... ... . . . ...

z_{h,h,1}^{i} z_{h+s,h,1}^{i} z_{h+(a}^{i} out−1)s,h+(b^{out}−1)s,1

... ... ...

z_{h,h,d}^{i} in z_{h+s,h,d}^{i} in z_{h+(a}^{i} out−1)s,h+(b^{out}−1)s,d^{in}

∈ R^{hhd}^{in}^{×a}^{out}^{b}^{out}

## Activation Function I

Next, an activation function scales each element of
S^{out,i} to obtain the output matrix Z^{out,i}.

Z^{out,i} = σ(S^{out,i}) ∈ R^{d}^{out}^{×a}^{out}^{b}^{out}. (7)
For CNN, commonly the following RELU activation
function

σ(x) = max(x, 0) (8)

is used

Later we need that σ(x) is differentiable, but the RELU function is not.

## Activation Function II

Past works such as Krizhevsky et al. (2012) assume
σ^{0}(x ) =

(

1 if x > 0 0 otherwise

## The Function φ(Z

^{in}

^{,i}

## ) I

In the matrix-matrix product
Wφ(Z^{in,i}),

each element is the inner product between a filter and a sub-image

We need to represent φ(Z^{in,i}) in an explicit form.

This is important for subsequent calculation

Clearly φ is a linear mapping, so there exists a 0/1
matrix P_{φ} such that

φ(Z^{in,i}) ≡ mat P_{φ}vec(Z^{in,i})

hhd^{in}×a^{out}b^{out}, ∀i, (9)

## The Function φ(Z

^{in}

^{,i}

## ) II

vec(M): all M’s columns concatenated to a vector v

vec(M) =

M_{:,1}

...

M_{:,b}

∈ R^{ab×1}, where M ∈ R^{a×b}
mat(v) is the inverse of vec(M)

mat(v)a×b =

v_{1} v_{(b−1)a+1}
... · · · ...

v_{a} v_{ba}

∈ R^{a×b}, (10)

## The Function φ(Z

^{in}

^{,i}

## ) III

where

v ∈ R^{ab×1}.
P_{φ} is a huge matrix:

P_{φ} ∈ R^{hhd}^{in}^{a}^{out}^{b}^{out}^{×d}^{in}^{a}^{in}^{b}^{in}
and

φ : R^{d}^{in}^{×a}^{in}^{b}^{in} → R^{hhd}^{in}^{×a}^{out}^{b}^{out}
Later we will check implementation details

Past works using the form (9) include, for example, Vedaldi and Lenc (2015)

## Optimization Problem I

We collect all weights to a vector variable θ.

θ =

vec(W^{1})
b^{1}

...

vec(W^{L})
b^{L}

∈ R^{n}, n : total # variables

The output of the last layer L is a vector z^{L+1,i}(θ).

Consider any loss function such as the squared loss
ξi(θ) = ||z^{L+1,i}(θ) −y^{i}||^{2}.

## Optimization Problem II

The optimization problem is min

θ f (θ), where

f (θ) = 1

2Cθ^{T}θ + 1
l

X^{l}

i =1ξ(z^{L+1,i}(θ);y^{i}, Z^{1,i})
C : regularization parameter.

The formulation is almost the same as that for fully connected networks

## Optimization Problem III

Note that we divide the sum of training losses by the number of training data

Thus the secnd term becomes the average training loss

With the optimization problem, there is still a long way to do a real implementation

Further, CNN involves additional operations in practice

padding pooling

We will explain them

## Zero Padding I

To better control the size of the output image, before the convolutional operation we may enlarge the input image to have zero values around the border.

This technique is called zero-padding in CNN training.

An illustration:

## Zero Padding II

An input image

0 · · · 0 ...

0 · · · 0 ...

...

· · ·

· · · 0 · · · 0 ...

0 · · · 0

· · ·0 0 ...

· · ·0 0

· · ·0 0 ...

· · ·0 0

p

p

a^{in}

b^{in}

## Zero Padding III

The size of the new image is changed from
a^{in} × b^{in} to (a^{in} + 2p) × (b^{in} + 2p),
where p is specified by users

The operation can be treated as a layer of mapping
an input Z^{in,i} to an output Z^{out,i}.

Let

d^{out} = d^{in}.

## Zero Padding IV

There exists a 0/1 matrix

P_{pad} ∈ R^{d}^{out}^{a}^{out}^{b}^{out}^{×d}^{in}^{a}^{in}^{b}^{in}

so that the padding operation can be represented by
Z^{out,i} ≡ mat(P_{pad}vec(Z^{in,i}))_{d}^{out}_{×a}^{out}_{b}^{out}. (11)
Implementation details will be discussed later

## Pooling I

To reduce the computational cost, a dimension reduction is often applied by a pooling step after convolutional operations.

Usually we consider an operation that can

(approximately) extract rotational or translational invariance features.

Examples: average pooling, max pooling, and stochastic pooling,

Let’s consider max pooling as an illustration

## Pooling II

An example:

image A

2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1

→ 5 9 4 6

image B

3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2

→ 5 9 4 6

## Pooling III

B is derived by shifting A by 1 pixel in the horizontal direction.

We split two images into four 2 × 2 sub-images and choose the max value from every sub-image.

In each sub-image because only some elements are changed, the maximal value is likely the same or similar.

This is called translational invariance

For our example the two output images from A and B are the same.

## Pooling IV

For mathematical representation, we consider the
operation as a layer of mapping an input Z^{in,i} to an
output Z^{out,i}.

In practice pooling is considered as an operation at the end of the convolutional layer.

We partition every channel of Z^{in,i} into

non-overlapping sub-regions by h × h filters with the stride s = h

Because of the disjoint sub-regions, the stride s for sliding the filters is equal to h.

## Pooling V

This partition step is a special case of how we generate sub-images in convolutional operations.

By the same definition as (9) we can generate the matrix

φ(Z^{in,i}) = mat(Pφvec(Z^{in,i}))hh×d^{out}a^{out}b^{out}, (12)
where

a^{out} = ba^{in}

h c, b^{out} = bb^{in}

h c, d^{out} = d^{in}. (13)

## Pooling VI

This is the same from the calculation in (5) as
ba^{in}− h

h c + 1 = ba^{in}
h c
Note that here we consider

hh × d^{out}a^{out}b^{out} rather than hhd^{out}× a^{out}b^{out}
because we can then do a max operation on each
column

## Pooling VII

To select the largest element of each sub-region, there exists a 0/1 matrix

M^{i} ∈ R^{d}^{out}^{a}^{out}^{b}^{out}^{×hhd}^{out}^{a}^{out}^{b}^{out}

so that each row of M^{i} selects a single element from
vec(φ(Z^{in,i})).

Therefore,

Z^{out,i} = mat M^{i}vec(φ(Z^{in,i}))

d^{out}×a^{out}b^{out}. (14)

## Pooling VIII

A comparison with (6) shows that M^{i} is in a similar
role to the weight matrix W

While M^{i} is 0/1, it is not a constant. It’s positions
of 1’s depend on the values of φ(Z^{in,i})

By combining (12) and (14), we have
Z^{out,i} = mat P_{pool}^{i} vec(Z^{in,i})

d^{out}×a^{out}b^{out}, (15)
where

P_{pool}^{i} = M^{i}P_{φ} ∈ R^{d}^{out}^{a}^{out}^{b}^{out}^{×d}^{in}^{a}^{in}^{b}^{in}. (16)

## Summary of a Convolutional Layer I

For implementation, padding and pooling are (optional) part of the convolutional layers.

We discuss details of considering all operations together.

The whole convolutional layer involves the following procedure:

Z^{m,i} → padding by (11) →

convolutional operations by (6), (7)

→ pooling by (15) → Z^{m+1,i}, (17)

## Summary of a Convolutional Layer II

where Z^{m,i} and Z^{m+1,i} are input and output of the
mth layer, respectively.

Let the following symbols denote image sizes at different stages of the convolutional layer.

a^{m}, b^{m} : size in the beginning
a_{pad}^{m} , b_{pad}^{m} : size after padding

a_{conv}^{m} , b_{conv}^{m} : size after convolution.

The following table indicates how these values are
a^{in}, b^{in}, d^{in} and a^{out}, b^{out}, d^{out} at different stages.

## Summary of a Convolutional Layer III

Operation Input Output

Padding: (11) Z^{m,i} pad(Z^{m,i})
Convolution: (6) pad(Z^{m,i}) S^{m,i}

Convolution: (7) S^{m,i} σ(S^{m,i})
Pooling: (15) σ(S^{m,i}) Z^{m+1,i}

Operation a^{in}, b^{in}, d^{in} a^{out}, b^{out}, d^{out}
Padding: (11) a^{m}, b^{m}, d^{m} a_{pad}^{m} , b_{pad}^{m} , d^{m}
Convolution: (6) a^{m}_{pad}, b^{m}_{pad}, d^{m} a_{conv}^{m} , b_{conv}^{m} , d^{m+1}
Convolution: (7) a^{m}_{conv}, b^{m}_{conv}, d^{m+1} a_{conv}^{m} , b_{conv}^{m} , d^{m+1}
Pooling: (15) a^{m}_{conv}, b^{m}_{conv}, d^{m+1} a^{m+1}, b^{m+1}, d^{m+1}

## Summary of a Convolutional Layer IV

Let the filter size, mapping matrices and weight matrices at the mth layer be

h^{m}, P_{pad}^{m} , P_{φ}^{m}, P_{pool}^{m,i} , W^{m}, b^{m}.
From (11), (6), (7), (15), all operations can be
summarized as

S^{m,i} =W^{m}mat(P_{φ}^{m}P_{pad}^{m} vec(Z^{m,i}))_{h}^{m}_{h}^{m}_{d}^{m}_{×a}^{m}_{conv}_{b}_{conv}^{m} +
b^{m}1^{T}a^{conv}b^{conv}

Z^{m+1,i} = mat(P_{pool}^{m,i} vec(σ(S^{m,i})))_{d}^{m+1}_{×a}^{m+1}_{b}^{m+1},
(18)

## Fully-Connected Layer I

Assume L^{C} is the number of convolutional layers
Input vector of the first fully-connected layer:

z^{m,i} = vec(Z^{m,i}), i = 1, . . . , l, m = L^{c} + 1.

In each of the fully-connected layers (L^{c} < m ≤ L),
we consider weight matrix and bias vector between
layers m and m + 1.

## Fully-Connected Layer II

Weight matrix:

W^{m} =

w_{11}^{m} w_{12}^{m} · · · w_{1n}^{m}

m

w_{21}^{m} w_{22}^{m} · · · w_{2n}^{m}_{m}
... ... ... ...

w_{n}^{m}_{m+1}_{1} w_{n}^{m}_{m+1}_{2} · · · w_{n}^{m}

m+1n_{m}

nm+1×nm

(19)

Bias vector

b^{m} =

b_{1}^{m}
b_{2}^{m}
...

b_{n}^{m}_{m+1}

n_{m+1}×1

## Fully-Connected Layer III

Here n_{m} and n_{m+1} are the numbers of nodes in
layers m and m + 1, respectively.

If z^{m,i} ∈ R^{n}^{m} is the input vector, the following

operations are applied to generate the output vector
z^{m+1,i} ∈ R^{n}^{m+1}.

s^{m,i} = W^{m}z^{m,i} +b^{m}, (20)
z_{j}^{m+1,i} = σ(s_{j}^{m,i}), j = 1, . . . , nm+1. (21)

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

## Challenges in NN Optimization

The objective function is non-convex. It may have many local minima

It’s known that global optimization is much more difficult than local minimization

The problem structure is very complicated

In this course we will have first-hand experiences on handling these difficulties

## Formulation I

We have written all CNN operations in matrix/vector forms

This is useful in deriving the gradient

Are our representation symbols good enough? Can we do better?

You can say that this is only a matter of notation, but given the wide use of CNN, a good formulation can be extremely useful

## References I

M. Fern´andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing, 55:169–186, 2003.

A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015.

C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6):

1673–1724, 2018. URL http://www.csie.ntu.edu.tw/~cjlin/papers/dnn/dsh.pdf.