Optimization Problems for Neural Networks

(1)

Networks

Chih-Jen Lin

National Taiwan University Last updated: May 25, 2020

(2)

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

(3)

Outline

4 Discussion

(4)

Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a neural network, or other types

(5)

Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is sgn(w^Tx)

For any data, x, the predicted label is (1 if w^Tx ≥ 0

−1 otherwise

(6)

Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^Tx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

(7)

Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; y, x) for each instance (y, x), where

y = ±1 is the label and x is the feature vector Ideally we should use 0–1 training loss:

ξ(w; y, x) =

(1 if yw^Tx < 0, 0 otherwise

(8)

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−yw^Tx ξ(w; y, x)

We need continuous approximations

(9)

Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; y, x) ≡ max(0, 1 − yw^Tx) (1) Logistic loss

ξLR(w; y, x) ≡ log(1 + e^−y^w^T^x) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)

SVM and LR are two very fundamental classification methods

(10)

Common Loss Functions (Cont’d)

−yw^Tx ξ(w; y, x)

ξL1

ξLR

Logistic regression is very related to SVM Their performance is usually similar

(11)

Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(12)

Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(13)

l and s: training; and 4: testing

(14)

Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^Tw

2 or kwk¹ to the function that is minimized

(15)

General Form of Linear Classification

Training data {yi,xⁱ},xⁱ ∈ Rⁿ, i = 1, . . . , l, yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ w^Tw

2 + C

l

X

i =1

ξ(w; yⁱ,xⁱ) w^Tw/2: regularization term

ξ(w; y, x): loss function

C : regularization parameter (chosen by users)

(16)

Outline

4 Discussion

(17)

Multi-class Classification I

Our training set includes (yⁱ,xⁱ), i = 1, . . . , l.

xⁱ ∈ Rⁿ¹ is the feature vector.

yⁱ ∈ R^K is the label vector.

As label is now a vector, we change (label, instance) from

(y_i,xi) to (yⁱ,xⁱ) K : # of classes

If xⁱ is in class k, then yⁱ = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^T ∈ R^K

(18)

Multi-class Classification II

A neural network maps each feature vector to one of the class labels by the connection of nodes.

(19)

Fully-connected Networks

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).

A

₁

B

₁

C

₁

A

₂

B

₂

A

₃

B

₃

C

₃

(20)

Operations Between Two Layers I

The weight matrix W^m at the mth layer is

W^m =







w₁₁^m w₁₂^m · · · w_1n^m_m w₂₁^m w₂₂^m · · · w_2n^m_m

... ... ... ...

w_n^m_m+1₁ w_n^m_m+1₂ · · · w_n^m_m+1_n_m







nm+1×nm

n_m : # input features at layer m

n_m+1 : # output features at layer m, or # input features at layer m + 1

L: number of layers

(21)

Operations Between Two Layers II

n₁ = # of features, n_L+1 = # of classes

Let z^m be the input of the mth layer, z¹ = x and z^L+1 be the output

From mth layer to (m + 1)th layer s^m = W^mz^m,

z_j^m+1 = σ(s_j^m), j = 1, . . . , nm+1, σ(·) is the activation function.

(22)

Operations Between Two Layers III

Usually people do a bias term





 b₁^m b₂^m ...

b_n^m

m+1







n_m+1×1

,

so that

s^m = W^mz^m +b^m

(23)

Operations Between Two Layers IV

Activation function is usually an R → R

transformation. As we are interested in

optimization, let’s not worry about why it’s needed We collect all variables:

θ =







vec(W¹) b¹

...

vec(W^L) b^L







∈ Rⁿ

(24)

Operations Between Two Layers V

n : total # variables = (n₁+1)n₂+· · ·+(n_L+1)n_L+1 The vec(·) operator stacks columns of a matrix to a vector

(25)

Optimization Problem I

We solve the following optimization problem, minθ f (θ), where

f (θ) = 1

2θ^Tθ + C X^l

i =1ξ(z^L+1,i(θ);yⁱ,xⁱ). C : regularization parameter

z^L+1(θ) ∈ Rⁿ^L+1: last-layer output vector of x.

ξ(z^L+1;y, x): loss function. Example:

ξ(z^L+1;y, x) = ||z^L+1 −y||²

(26)

Optimization Problem II

The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex

Note that in the earlier discussion we consider a single instance

In the training process we actually have for i = 1, . . . , l,

s^m,i = W^mz^m,i,

z_j^m+1,i = σ(s_j^m,i), j = 1, . . . , nm+1, This makes the training more complicated

(27)

Outline

4 Discussion

(28)

Why CNN? I

There are many types of neural networks

They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods

For example, fully-connected networks were

evalueated for general classification data (e.g., data from UCI machine learning repository)

They are not consistently better than random forests or SVM; see the comparisons (Meyer et al., 2003;

Fern´andez-Delgado et al., 2014; Wang et al., 2018).

(29)

Why CNN? II

We are interested in CNN because it’s shown to be significantly better than others on image data That’s one of the main reasons deep learning becomes popular

To study optimization algorithms, of course we want to consider an “established” network

That’s why CNN was chosen for our discussion However, the problem is that operations in CNN are more complicated than fully-connected networks Most books/papers only give explanation without detailed mathematical forms

(30)

Why CNN? III

To study the optimization, we need some clean formulations

So let’s give it a try here

(31)

Convolutional Neural Networks I

Consider a K -class classification problem with training data

(yⁱ, Z^1,i), i = 1, . . . , l.

yⁱ: label vector Z^1,i: input image If Z^1,i is in class k, then

yⁱ = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^T ∈ R^K.

CNN maps each image Z^1,i to yⁱ

(32)

Convolutional Neural Networks II

Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.

Input and output of a convolutional layer are assumed to be images.

(33)

Convolutional Layers I

For the current layer, let the input be an image Zⁱⁿ : aⁱⁿ× bⁱⁿ × dⁱⁿ.

aⁱⁿ: height, bⁱⁿ: width, and dⁱⁿ: #channels.

a

ⁱⁿ

b

ⁱⁿ

d

ⁱⁿ

(34)

Convolutional Layers II

The goal is to generate an output image Z^out,i

of dôut channels of aôut× bôut images.

Consider d^out filters.

Filter j ∈ {1, . . . , d^out} has dimensions h × h × dⁱⁿ.







w_1,1,1^j w_1,h,1^j . . .

w^j w^j





 . . .







w_1,1,d^j in w_1,h,d^j in

. . .

w^j w^j





.

(35)

Convolutional Layers III

h: filter height/width (layer index omitted)

1,1,1 1,2,1 1,3,1

2,1,1 2,2,1 2,3,1

3,1,1 3,2,1 3,3,1

s_1,1,jôut,i s_1,2,jôut,i s_2,1,jôut,i s_2,2,jôut,i

To compute the j th channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × dⁱⁿ

(36)

Convolutional Layers IV

We then calculate the inner product between each sub-image and the j th filter

For example, if we start from the upper left corner of the input image, the first sub-image of channel d is





z_1,1,dⁱ . . . z_1,h,dⁱ . . .

z_h,1,dⁱ . . . z_h,h,dⁱ



.

(37)

Convolutional Layers V

We then calculate

dⁱⁿ

X

d =1

*



z_1,1,dⁱ . . . z_1,h,dⁱ . . .

z_h,1,dⁱ . . . z_h,h,dⁱ



,







w_1,1,d^j . . . w_1,h,d^j . . .

w_h,1,d^j . . . w_h,h,d^j





 +

+b_j, (3) where h·, ·i means the sum of component-wise

products between two matrices.

This value becomes the (1, 1) position of the channel j of the output image.

(38)

Convolutional Layers VI

Next, we use other sub-images to produce values in other positions of the output image.

Let the stride s be the number of pixels vertically or horizontally to get sub-images.

For the (2, 1) position of the output image, we move down s pixels vertically to obtain the following sub-image:





z_1+s,1,dⁱ . . . z_1+s,h,dⁱ . . .

z_h+s,1,dⁱ . . . z_h+s,h,dⁱ



.

(39)

Convolutional Layers VII

The (2, 1) position of the channel j of the output image is

dⁱⁿ

X

d =1

*



z_1+s,1,dⁱ . . . z_1+s,h,dⁱ . . .

z_h+s,1,dⁱ . . . z_h+s,h,dⁱ



,







w_1,1,d^j . . . w_1,h,d^j . . .

w_h,1,d^j . . . w_h,h,d^j





 +

+ b_j.

(4)

(40)

Convolutional Layers VIII

The output image size a^out and b^out are respectively numbers that vertically and horizontally we can move the filter

a^out = baⁱⁿ − h

s c + 1, b^out = bbⁱⁿ − h

s c + 1 (5) Rationale of (5): vertically last row of each

sub-image is

h, h + s, . . . , h + ∆s ≤ aⁱⁿ

(41)

Convolutional Layers IX

Thus

∆ = baⁱⁿ− h

s c

(42)

Matrix Operations I

For efficient implementations, we should conduct convolutional operations by matrix-matrix and matrix-vector operations

We will go back to this issue later

(43)

Matrix Operations II

Let’s collect images of all channels as the input Z^in,i

=







z_1,1,1ⁱ z_2,1,1ⁱ . . . z_aⁱin,bⁱⁿ,1

... ... . . . ...

z_1,1,dⁱ in z_2,1,dⁱ in . . . z_aⁱin,bⁱⁿ,dⁱⁿ







∈R^dⁱⁿ^×aⁱⁿ^bⁱⁿ.

(44)

Matrix Operations III

Let all filters

W =







w_1,1,1¹ w_2,1,1¹ . . . w_h,h,d¹ in

... ... . . . ...

w_1,1,1^dôut w_2,1,1^dôut . . . w_h,h,d^dôut in







∈ R^d^out^×hhdⁱⁿ

be variables (parameters) of the current layer

(45)

Matrix Operations IV

Usually a bias term is considered b =



 b₁

...

bd^out



 ∈ R^d^out^×1 Operations at a layer

Sôut,i = Wφ(Zîn,i) + b1^Taôutbôut

∈ R^dôut^×aôut^bôut, (6)

(46)

Matrix Operations V

where

1a^outb^out =



 1...

1



 ∈ Râôut^bôut^×1.

φ(Z^in,i) collects all sub-images in Z^in,i into a matrix.

(47)

Matrix Operations VI

Specifically, φ(Z^in,i) =







z_1,1,1ⁱ z_1+s,1,1ⁱ z_1+(aⁱ out−1)s,1+(b^out−1)s,1

z_2,1,1ⁱ z_2+s,1,1ⁱ z_2+(aⁱ out−1)s,1+(b^out−1)s,1

... ... . . . ...

z_h,h,1ⁱ z_h+s,h,1ⁱ z_h+(aⁱ out−1)s,h+(b^out−1)s,1

... ... ...

z_h,h,dⁱ in z_h+s,h,dⁱ in z_h+(aⁱ out−1)s,h+(b^out−1)s,dⁱⁿ







∈ R^hhdⁱⁿ^×a^out^b^out

(48)

Activation Function I

Next, an activation function scales each element of S^out,i to obtain the output matrix Z^out,i.

Zôut,i = σ(Sôut,i) ∈ R^dôut^×aôut^bôut. (7) For CNN, commonly the following RELU activation function

σ(x) = max(x, 0) (8)

is used

Later we need that σ(x) is differentiable, but the RELU function is not.

(49)

Activation Function II

Past works such as Krizhevsky et al. (2012) assume σ⁰(x ) =

(

1 if x > 0 0 otherwise

(50)

The Function φ(Z

ⁱⁿ^,i

) I

In the matrix-matrix product Wφ(Z^in,i),

each element is the inner product between a filter and a sub-image

We need to represent φ(Z^in,i) in an explicit form.

This is important for subsequent calculation

Clearly φ is a linear mapping, so there exists a 0/1 matrix P_φ such that

φ(Z^in,i) ≡ mat P_φvec(Z^in,i)

hhdⁱⁿ×a^outb^out, ∀i, (9)

(51)

The Function φ(Z

ⁱⁿ^,i

) II

vec(M): all M’s columns concatenated to a vector v

vec(M) =



 M_:,1

...

M_:,b



 ∈ R^ab×1, where M ∈ R^a×b mat(v) is the inverse of vec(M)

mat(v)a×b =





v₁ v_(b−1)a+1 ... · · · ...

v_a v_ba



 ∈ R^a×b, (10)

(52)

The Function φ(Z

ⁱⁿ^,i

) III

where

v ∈ R^ab×1. P_φ is a huge matrix:

P_φ ∈ R^hhdⁱⁿâôut^bôut^×dⁱⁿâⁱⁿ^bⁱⁿ and

φ : R^dⁱⁿ^×aⁱⁿ^bⁱⁿ → R^hhdⁱⁿ^×a^out^b^out Later we will check implementation details

Past works using the form (9) include, for example, Vedaldi and Lenc (2015)

(53)

Optimization Problem I

We collect all weights to a vector variable θ.

θ =







vec(W¹) b¹

...

vec(W^L) b^L







∈ Rⁿ, n : total # variables

The output of the last layer L is a vector z^L+1,i(θ).

Consider any loss function such as the squared loss ξi(θ) = ||z^L+1,i(θ) −yⁱ||².

(54)

Optimization Problem II

The optimization problem is min

θ f (θ), where

f (θ) = 1

2Cθ^Tθ + 1 l

X^l

i =1ξ(z^L+1,i(θ);yⁱ, Z^1,i) C : regularization parameter.

The formulation is almost the same as that for fully connected networks

(55)

Optimization Problem III

Note that we divide the sum of training losses by the number of training data

Thus the secnd term becomes the average training loss

With the optimization problem, there is still a long way to do a real implementation

Further, CNN involves additional operations in practice

padding pooling

We will explain them

(56)

Zero Padding I

To better control the size of the output image, before the convolutional operation we may enlarge the input image to have zero values around the border.

This technique is called zero-padding in CNN training.

An illustration:

(57)

Zero Padding II

An input image

0 · · · 0 ...

...

· · ·

· · · 0 · · · 0 ...

0 · · · 0

· · ·0 0 ...

· · ·0 0

· · ·0 0 ...

· · ·0 0

p

aⁱⁿ

bⁱⁿ

(58)

Zero Padding III

The size of the new image is changed from aⁱⁿ × bⁱⁿ to (aⁱⁿ + 2p) × (bⁱⁿ + 2p), where p is specified by users

The operation can be treated as a layer of mapping an input Z^in,i to an output Z^out,i.

Let

d^out = dⁱⁿ.

(59)

Zero Padding IV

There exists a 0/1 matrix

P_pad ∈ R^dôutâôut^bôut^×dⁱⁿâⁱⁿ^bⁱⁿ

so that the padding operation can be represented by Zôut,i ≡ mat(P_padvec(Zîn,i))_dôut_×aôut_bôut. (11) Implementation details will be discussed later

(60)

Pooling I

To reduce the computational cost, a dimension reduction is often applied by a pooling step after convolutional operations.

Usually we consider an operation that can

(approximately) extract rotational or translational invariance features.

Examples: average pooling, max pooling, and stochastic pooling,

Let’s consider max pooling as an illustration

(61)

Pooling II

An example:

image A







2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1







→ 5 9 4 6

image B







3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2







→ 5 9 4 6

(62)

Pooling III

B is derived by shifting A by 1 pixel in the horizontal direction.

We split two images into four 2 × 2 sub-images and choose the max value from every sub-image.

In each sub-image because only some elements are changed, the maximal value is likely the same or similar.

This is called translational invariance

For our example the two output images from A and B are the same.

(63)

Pooling IV

For mathematical representation, we consider the operation as a layer of mapping an input Z^in,i to an output Z^out,i.

In practice pooling is considered as an operation at the end of the convolutional layer.

We partition every channel of Z^in,i into

non-overlapping sub-regions by h × h filters with the stride s = h

Because of the disjoint sub-regions, the stride s for sliding the filters is equal to h.

(64)

Pooling V

This partition step is a special case of how we generate sub-images in convolutional operations.

By the same definition as (9) we can generate the matrix

φ(Zîn,i) = mat(Pφvec(Zîn,i))hh×dôutaôutbôut, (12) where

a^out = baⁱⁿ

h c, b^out = bbⁱⁿ

h c, d^out = dⁱⁿ. (13)

(65)

Pooling VI

This is the same from the calculation in (5) as baⁱⁿ− h

h c + 1 = baⁱⁿ h c Note that here we consider

hh × dôutaôutbôut rather than hhdôut× aôutbôut because we can then do a max operation on each column

(66)

Pooling VII

To select the largest element of each sub-region, there exists a 0/1 matrix

Mⁱ ∈ R^dôutâôut^bôut^×hhdôutâôut^bôut

so that each row of Mⁱ selects a single element from vec(φ(Z^in,i)).

Therefore,

Z^out,i = mat Mⁱvec(φ(Z^in,i))

dôut×aôutbôut. (14)

(67)

Pooling VIII

A comparison with (6) shows that Mⁱ is in a similar role to the weight matrix W

While Mⁱ is 0/1, it is not a constant. It’s positions of 1’s depend on the values of φ(Z^in,i)

By combining (12) and (14), we have Z^out,i = mat P_poolⁱ vec(Z^in,i)

dôut×aôutbôut, (15) where

P_poolⁱ = MⁱP_φ ∈ R^dôutâôut^bôut^×dⁱⁿâⁱⁿ^bⁱⁿ. (16)

(68)

Summary of a Convolutional Layer I

For implementation, padding and pooling are (optional) part of the convolutional layers.

We discuss details of considering all operations together.

The whole convolutional layer involves the following procedure:

Z^m,i → padding by (11) →

convolutional operations by (6), (7)

→ pooling by (15) → Z^m+1,i, (17)

(69)

Summary of a Convolutional Layer II

where Z^m,i and Z^m+1,i are input and output of the mth layer, respectively.

Let the following symbols denote image sizes at different stages of the convolutional layer.

a^m, b^m : size in the beginning a_pad^m , b_pad^m : size after padding

a_conv^m , b_conv^m : size after convolution.

The following table indicates how these values are aⁱⁿ, bⁱⁿ, dⁱⁿ and aôut, bôut, dôut at different stages.

(70)

Summary of a Convolutional Layer III

Operation Input Output

Padding: (11) Z^m,i pad(Z^m,i) Convolution: (6) pad(Z^m,i) S^m,i

Convolution: (7) S^m,i σ(S^m,i) Pooling: (15) σ(S^m,i) Z^m+1,i

Operation aⁱⁿ, bⁱⁿ, dⁱⁿ aôut, bôut, dôut Padding: (11) a^m, b^m, d^m a_pad^m , b_pad^m , d^m Convolution: (6) a^m_pad, b^m_pad, d^m a_conv^m , b_conv^m , d^m+1 Convolution: (7) a^m_conv, b^m_conv, d^m+1 a_conv^m , b_conv^m , d^m+1 Pooling: (15) a^m_conv, b^m_conv, d^m+1 a^m+1, b^m+1, d^m+1

(71)

Summary of a Convolutional Layer IV

Let the filter size, mapping matrices and weight matrices at the mth layer be

h^m, P_pad^m , P_φ^m, P_pool^m,i , W^m, b^m. From (11), (6), (7), (15), all operations can be summarized as

S^m,i =W^mmat(P_φ^mP_pad^m vec(Z^m,i))_h^m_h^m_d^m_×a^m_conv_b_conv^m + b^m1^Ta^convb^conv

Z^m+1,i = mat(P_pool^m,i vec(σ(S^m,i)))_d^m+1_×a^m+1_b^m+1, (18)

(72)

Fully-Connected Layer I

Assume L^C is the number of convolutional layers Input vector of the first fully-connected layer:

z^m,i = vec(Z^m,i), i = 1, . . . , l, m = L^c + 1.

In each of the fully-connected layers (L^c < m ≤ L), we consider weight matrix and bias vector between layers m and m + 1.

(73)

Fully-Connected Layer II

Weight matrix:

W^m =







w₁₁^m w₁₂^m · · · w_1n^m

m

w₂₁^m w₂₂^m · · · w_2n^m_m ... ... ... ...

w_n^m_m+1₁ w_n^m_m+1₂ · · · w_n^m

m+1n_m







nm+1×nm

(19)

Bias vector

b^m =





 b₁^m b₂^m ...

b_n^m_m+1







n_m+1×1

(74)

Fully-Connected Layer III

Here n_m and n_m+1 are the numbers of nodes in layers m and m + 1, respectively.

If z^m,i ∈ Rⁿ^m is the input vector, the following

operations are applied to generate the output vector z^m+1,i ∈ Rⁿ^m+1.

s^m,i = W^mz^m,i +b^m, (20) z_j^m+1,i = σ(s_j^m,i), j = 1, . . . , nm+1. (21)

(75)

Outline

4 Discussion

(76)

Challenges in NN Optimization

The objective function is non-convex. It may have many local minima

It’s known that global optimization is much more difficult than local minimization

The problem structure is very complicated

In this course we will have first-hand experiences on handling these difficulties

(77)

Formulation I

We have written all CNN operations in matrix/vector forms

This is useful in deriving the gradient

Are our representation symbols good enough? Can we do better?

You can say that this is only a matter of notation, but given the wide use of CNN, a good formulation can be extremely useful

(78)

References I

M. Fern´andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing, 55:169–186, 2003.

A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015.

C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6):

1673–1724, 2018. URL http://www.csie.ntu.edu.tw/~cjlin/papers/dnn/dsh.pdf.