• 沒有找到結果。

# Optimization Problems for Neural Networks

N/A
N/A
Protected

Share "Optimization Problems for Neural Networks"

Copied!
78
0
0

(1)

## Networks

### Chih-Jen Lin

National Taiwan University Last updated: May 25, 2020

(2)

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

(3)

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

(4)

## Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a neural network, or other types

(5)

## Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is sgn(wTx)

For any data, x, the predicted label is (1 if wTx ≥ 0

−1 otherwise

(6)

## Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦ ◦

◦◦

4 4 4 4

4 4

4

wTx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

(7)

## Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; y, x) for each instance (y, x), where

y = ±1 is the label and x is the feature vector Ideally we should use 0–1 training loss:

ξ(w; y, x) =

(1 if ywTx < 0, 0 otherwise

(8)

## Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−ywTx ξ(w; y, x)

We need continuous approximations

(9)

## Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; y, x) ≡ max(0, 1 − ywTx) (1) Logistic loss

ξLR(w; y, x) ≡ log(1 + e−ywTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)

SVM and LR are two very fundamental classification methods

(10)

## Common Loss Functions (Cont’d)

−ywTx ξ(w; y, x)

ξL1

ξLR

Logistic regression is very related to SVM Their performance is usually similar

(11)

## Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(12)

## Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(13)

(14)

## Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

wTw

2 or kwk1 to the function that is minimized

(15)

## General Form of Linear Classification

Training data {yi,xi},xi ∈ Rn, i = 1, . . . , l, yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ wTw

2 + C

l

X

i =1

ξ(w; yi,xi) wTw/2: regularization term

ξ(w; y, x): loss function

C : regularization parameter (chosen by users)

(16)

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

(17)

## Multi-class Classification I

Our training set includes (yi,xi), i = 1, . . . , l.

xi ∈ Rn1 is the feature vector.

yi ∈ RK is the label vector.

As label is now a vector, we change (label, instance) from

(yi,xi) to (yi,xi) K : # of classes

If xi is in class k, then yi = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]T ∈ RK

(18)

## Multi-class Classification II

A neural network maps each feature vector to one of the class labels by the connection of nodes.

(19)

## Fully-connected Networks

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).

1

1

1

2

2

3

3

3

(20)

## Operations Between Two Layers I

The weight matrix Wm at the mth layer is

Wm =

w11m w12m · · · w1nmm w21m w22m · · · w2nmm

... ... ... ...

wnmm+11 wnmm+12 · · · wnmm+1nm

nm+1×nm

nm : # input features at layer m

nm+1 : # output features at layer m, or # input features at layer m + 1

L: number of layers

(21)

## Operations Between Two Layers II

n1 = # of features, nL+1 = # of classes

Let zm be the input of the mth layer, z1 = x and zL+1 be the output

From mth layer to (m + 1)th layer sm = Wmzm,

zjm+1 = σ(sjm), j = 1, . . . , nm+1, σ(·) is the activation function.

(22)

## Operations Between Two Layers III

Usually people do a bias term

 b1m b2m ...

bnm

m+1

nm+1×1

,

so that

sm = Wmzm +bm

(23)

## Operations Between Two Layers IV

Activation function is usually an R → R

transformation. As we are interested in

optimization, let’s not worry about why it’s needed We collect all variables:

θ =

vec(W1) b1

...

vec(WL) bL

∈ Rn

(24)

## Operations Between Two Layers V

n : total # variables = (n1+1)n2+· · ·+(nL+1)nL+1 The vec(·) operator stacks columns of a matrix to a vector

(25)

## Optimization Problem I

We solve the following optimization problem, minθ f (θ), where

f (θ) = 1

Tθ + C Xl

i =1ξ(zL+1,i(θ);yi,xi). C : regularization parameter

zL+1(θ) ∈ RnL+1: last-layer output vector of x.

ξ(zL+1;y, x): loss function. Example:

ξ(zL+1;y, x) = ||zL+1 −y||2

(26)

## Optimization Problem II

The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex

Note that in the earlier discussion we consider a single instance

In the training process we actually have for i = 1, . . . , l,

sm,i = Wmzm,i,

zjm+1,i = σ(sjm,i), j = 1, . . . , nm+1, This makes the training more complicated

(27)

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

(28)

## Why CNN? I

There are many types of neural networks

They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods

For example, fully-connected networks were

evalueated for general classification data (e.g., data from UCI machine learning repository)

They are not consistently better than random forests or SVM; see the comparisons (Meyer et al., 2003;

Fern´andez-Delgado et al., 2014; Wang et al., 2018).

(29)

## Why CNN? II

We are interested in CNN because it’s shown to be significantly better than others on image data That’s one of the main reasons deep learning becomes popular

To study optimization algorithms, of course we want to consider an “established” network

That’s why CNN was chosen for our discussion However, the problem is that operations in CNN are more complicated than fully-connected networks Most books/papers only give explanation without detailed mathematical forms

(30)

## Why CNN? III

To study the optimization, we need some clean formulations

So let’s give it a try here

(31)

## Convolutional Neural Networks I

Consider a K -class classification problem with training data

(yi, Z1,i), i = 1, . . . , l.

yi: label vector Z1,i: input image If Z1,i is in class k, then

yi = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]T ∈ RK.

CNN maps each image Z1,i to yi

(32)

## Convolutional Neural Networks II

Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.

Input and output of a convolutional layer are assumed to be images.

(33)

## Convolutional Layers I

For the current layer, let the input be an image Zin : ain× bin × din.

ain: height, bin: width, and din: #channels.

in

in

in

(34)

## Convolutional Layers II

The goal is to generate an output image Zout,i

of dout channels of aout× bout images.

Consider dout filters.

Filter j ∈ {1, . . . , dout} has dimensions h × h × din.

w1,1,1j w1,h,1j . . .

wj wj

 . . .

w1,1,dj in w1,h,dj in

. . .

wj wj

.

(35)

## Convolutional Layers III

h: filter height/width (layer index omitted)

1,1,1 1,2,1 1,3,1

2,1,1 2,2,1 2,3,1

3,1,1 3,2,1 3,3,1

s1,1,jout,i s1,2,jout,i s2,1,jout,i s2,2,jout,i

To compute the j th channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × din

(36)

## Convolutional Layers IV

We then calculate the inner product between each sub-image and the j th filter

For example, if we start from the upper left corner of the input image, the first sub-image of channel d is

z1,1,di . . . z1,h,di . . .

zh,1,di . . . zh,h,di

.

(37)

## Convolutional Layers V

We then calculate

din

X

d =1

*

z1,1,di . . . z1,h,di . . .

zh,1,di . . . zh,h,di

,

w1,1,dj . . . w1,h,dj . . .

wh,1,dj . . . wh,h,dj

 +

+bj, (3) where h·, ·i means the sum of component-wise

products between two matrices.

This value becomes the (1, 1) position of the channel j of the output image.

(38)

## Convolutional Layers VI

Next, we use other sub-images to produce values in other positions of the output image.

Let the stride s be the number of pixels vertically or horizontally to get sub-images.

For the (2, 1) position of the output image, we move down s pixels vertically to obtain the following sub-image:

z1+s,1,di . . . z1+s,h,di . . .

zh+s,1,di . . . zh+s,h,di

.

(39)

## Convolutional Layers VII

The (2, 1) position of the channel j of the output image is

din

X

d =1

*

z1+s,1,di . . . z1+s,h,di . . .

zh+s,1,di . . . zh+s,h,di

,

w1,1,dj . . . w1,h,dj . . .

wh,1,dj . . . wh,h,dj

 +

+ bj.

(4)

(40)

## Convolutional Layers VIII

The output image size aout and bout are respectively numbers that vertically and horizontally we can move the filter

aout = bain − h

s c + 1, bout = bbin − h

s c + 1 (5) Rationale of (5): vertically last row of each

sub-image is

h, h + s, . . . , h + ∆s ≤ ain

(41)

Thus

∆ = bain− h

s c

(42)

## Matrix Operations I

For efficient implementations, we should conduct convolutional operations by matrix-matrix and matrix-vector operations

We will go back to this issue later

(43)

## Matrix Operations II

Let’s collect images of all channels as the input Zin,i

=

z1,1,1i z2,1,1i . . . zaiin,bin,1

... ... . . . ...

z1,1,di in z2,1,di in . . . zaiin,bin,din

∈Rdin×ainbin.

(44)

## Matrix Operations III

Let all filters

W =

w1,1,11 w2,1,11 . . . wh,h,d1 in

... ... . . . ...

w1,1,1dout w2,1,1dout . . . wh,h,ddout in

∈ Rdout×hhdin

be variables (parameters) of the current layer

(45)

## Matrix Operations IV

Usually a bias term is considered b =

 b1

...

bdout

 ∈ Rdout×1 Operations at a layer

Sout,i = Wφ(Zin,i) + b1Taoutbout

∈ Rdout×aoutbout, (6)

(46)

## Matrix Operations V

where

1aoutbout =

 1...

1

 ∈ Raoutbout×1.

φ(Zin,i) collects all sub-images in Zin,i into a matrix.

(47)

## Matrix Operations VI

Specifically, φ(Zin,i) =

z1,1,1i z1+s,1,1i z1+(ai out−1)s,1+(bout−1)s,1

z2,1,1i z2+s,1,1i z2+(ai out−1)s,1+(bout−1)s,1

... ... . . . ...

zh,h,1i zh+s,h,1i zh+(ai out−1)s,h+(bout−1)s,1

... ... ...

zh,h,di in zh+s,h,di in zh+(ai out−1)s,h+(bout−1)s,din

∈ Rhhdin×aoutbout

(48)

## Activation Function I

Next, an activation function scales each element of Sout,i to obtain the output matrix Zout,i.

Zout,i = σ(Sout,i) ∈ Rdout×aoutbout. (7) For CNN, commonly the following RELU activation function

σ(x) = max(x, 0) (8)

is used

Later we need that σ(x) is differentiable, but the RELU function is not.

(49)

## Activation Function II

Past works such as Krizhevsky et al. (2012) assume σ0(x ) =

(

1 if x > 0 0 otherwise

(50)

in,i

## ) I

In the matrix-matrix product Wφ(Zin,i),

each element is the inner product between a filter and a sub-image

We need to represent φ(Zin,i) in an explicit form.

This is important for subsequent calculation

Clearly φ is a linear mapping, so there exists a 0/1 matrix Pφ such that

φ(Zin,i) ≡ mat Pφvec(Zin,i)

hhdin×aoutbout, ∀i, (9)

(51)

in,i

## ) II

vec(M): all M’s columns concatenated to a vector v

vec(M) =

 M:,1

...

M:,b

 ∈ Rab×1, where M ∈ Ra×b mat(v) is the inverse of vec(M)

mat(v)a×b =

v1 v(b−1)a+1 ... · · · ...

va vba

 ∈ Ra×b, (10)

(52)

in,i

## ) III

where

v ∈ Rab×1. Pφ is a huge matrix:

Pφ ∈ Rhhdinaoutbout×dinainbin and

φ : Rdin×ainbin → Rhhdin×aoutbout Later we will check implementation details

Past works using the form (9) include, for example, Vedaldi and Lenc (2015)

(53)

## Optimization Problem I

We collect all weights to a vector variable θ.

θ =

vec(W1) b1

...

vec(WL) bL

∈ Rn, n : total # variables

The output of the last layer L is a vector zL+1,i(θ).

Consider any loss function such as the squared loss ξi(θ) = ||zL+1,i(θ) −yi||2.

(54)

## Optimization Problem II

The optimization problem is min

θ f (θ), where

f (θ) = 1

2CθTθ + 1 l

Xl

i =1ξ(zL+1,i(θ);yi, Z1,i) C : regularization parameter.

The formulation is almost the same as that for fully connected networks

(55)

## Optimization Problem III

Note that we divide the sum of training losses by the number of training data

Thus the secnd term becomes the average training loss

With the optimization problem, there is still a long way to do a real implementation

Further, CNN involves additional operations in practice

We will explain them

(56)

To better control the size of the output image, before the convolutional operation we may enlarge the input image to have zero values around the border.

This technique is called zero-padding in CNN training.

An illustration:

(57)

An input image

0 · · · 0 ...

0 · · · 0 ...

...

· · ·

· · · 0 · · · 0 ...

0 · · · 0

· · ·0 0 ...

· · ·0 0

· · ·0 0 ...

· · ·0 0

p

p

ain

bin

(58)

The size of the new image is changed from ain × bin to (ain + 2p) × (bin + 2p), where p is specified by users

The operation can be treated as a layer of mapping an input Zin,i to an output Zout,i.

Let

dout = din.

(59)

There exists a 0/1 matrix

so that the padding operation can be represented by Zout,i ≡ mat(Ppadvec(Zin,i))dout×aoutbout. (11) Implementation details will be discussed later

(60)

## Pooling I

To reduce the computational cost, a dimension reduction is often applied by a pooling step after convolutional operations.

Usually we consider an operation that can

(approximately) extract rotational or translational invariance features.

Examples: average pooling, max pooling, and stochastic pooling,

Let’s consider max pooling as an illustration

(61)

## Pooling II

An example:

image A

2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1

→ 5 9 4 6



image B

3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2

→ 5 9 4 6



(62)

## Pooling III

B is derived by shifting A by 1 pixel in the horizontal direction.

We split two images into four 2 × 2 sub-images and choose the max value from every sub-image.

In each sub-image because only some elements are changed, the maximal value is likely the same or similar.

This is called translational invariance

For our example the two output images from A and B are the same.

(63)

## Pooling IV

For mathematical representation, we consider the operation as a layer of mapping an input Zin,i to an output Zout,i.

In practice pooling is considered as an operation at the end of the convolutional layer.

We partition every channel of Zin,i into

non-overlapping sub-regions by h × h filters with the stride s = h

Because of the disjoint sub-regions, the stride s for sliding the filters is equal to h.

(64)

## Pooling V

This partition step is a special case of how we generate sub-images in convolutional operations.

By the same definition as (9) we can generate the matrix

φ(Zin,i) = mat(Pφvec(Zin,i))hh×doutaoutbout, (12) where

aout = bain

h c, bout = bbin

h c, dout = din. (13)

(65)

## Pooling VI

This is the same from the calculation in (5) as bain− h

h c + 1 = bain h c Note that here we consider

hh × doutaoutbout rather than hhdout× aoutbout because we can then do a max operation on each column

(66)

## Pooling VII

To select the largest element of each sub-region, there exists a 0/1 matrix

Mi ∈ Rdoutaoutbout×hhdoutaoutbout

so that each row of Mi selects a single element from vec(φ(Zin,i)).

Therefore,

Zout,i = mat Mivec(φ(Zin,i))

dout×aoutbout. (14)

(67)

## Pooling VIII

A comparison with (6) shows that Mi is in a similar role to the weight matrix W

While Mi is 0/1, it is not a constant. It’s positions of 1’s depend on the values of φ(Zin,i)

By combining (12) and (14), we have Zout,i = mat Ppooli vec(Zin,i)

dout×aoutbout, (15) where

Ppooli = MiPφ ∈ Rdoutaoutbout×dinainbin. (16)

(68)

## Summary of a Convolutional Layer I

For implementation, padding and pooling are (optional) part of the convolutional layers.

We discuss details of considering all operations together.

The whole convolutional layer involves the following procedure:

Zm,i → padding by (11) →

convolutional operations by (6), (7)

→ pooling by (15) → Zm+1,i, (17)

(69)

## Summary of a Convolutional Layer II

where Zm,i and Zm+1,i are input and output of the mth layer, respectively.

Let the following symbols denote image sizes at different stages of the convolutional layer.

aconvm , bconvm : size after convolution.

The following table indicates how these values are ain, bin, din and aout, bout, dout at different stages.

(70)

## Summary of a Convolutional Layer III

Operation Input Output

Convolution: (7) Sm,i σ(Sm,i) Pooling: (15) σ(Sm,i) Zm+1,i

Operation ain, bin, din aout, bout, dout Padding: (11) am, bm, dm apadm , bpadm , dm Convolution: (6) ampad, bmpad, dm aconvm , bconvm , dm+1 Convolution: (7) amconv, bmconv, dm+1 aconvm , bconvm , dm+1 Pooling: (15) amconv, bmconv, dm+1 am+1, bm+1, dm+1

(71)

## Summary of a Convolutional Layer IV

Let the filter size, mapping matrices and weight matrices at the mth layer be

hm, Ppadm , Pφm, Ppoolm,i , Wm, bm. From (11), (6), (7), (15), all operations can be summarized as

Zm+1,i = mat(Ppoolm,i vec(σ(Sm,i)))dm+1×am+1bm+1, (18)

(72)

## Fully-Connected Layer I

Assume LC is the number of convolutional layers Input vector of the first fully-connected layer:

zm,i = vec(Zm,i), i = 1, . . . , l, m = Lc + 1.

In each of the fully-connected layers (Lc < m ≤ L), we consider weight matrix and bias vector between layers m and m + 1.

(73)

## Fully-Connected Layer II

Weight matrix:

Wm =

w11m w12m · · · w1nm

m

w21m w22m · · · w2nmm ... ... ... ...

wnmm+11 wnmm+12 · · · wnm

m+1nm

nm+1×nm

(19)

Bias vector

bm =

 b1m b2m ...

bnmm+1

nm+1×1

(74)

## Fully-Connected Layer III

Here nm and nm+1 are the numbers of nodes in layers m and m + 1, respectively.

If zm,i ∈ Rnm is the input vector, the following

operations are applied to generate the output vector zm+1,i ∈ Rnm+1.

sm,i = Wmzm,i +bm, (20) zjm+1,i = σ(sjm,i), j = 1, . . . , nm+1. (21)

(75)

## Outline

1 Regularized linear classification

2 Optimization problem for fully-connected networks

3 Optimization problem for convolutional neural networks (CNN)

4 Discussion

(76)

## Challenges in NN Optimization

The objective function is non-convex. It may have many local minima

It’s known that global optimization is much more difficult than local minimization

The problem structure is very complicated

In this course we will have first-hand experiences on handling these difficulties

(77)

## Formulation I

We have written all CNN operations in matrix/vector forms

This is useful in deriving the gradient

Are our representation symbols good enough? Can we do better?

You can say that this is only a matter of notation, but given the wide use of CNN, a good formulation can be extremely useful

(78)

## References I

M. Fern´andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing, 55:169–186, 2003.

A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015.

C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6):

1673–1724, 2018. URL http://www.csie.ntu.edu.tw/~cjlin/papers/dnn/dsh.pdf.

For example, Ko, Chen and Yang [22] proposed two kinds of neural networks with different SOCCP functions for solving the second-order cone program; Sun, Chen and Ko [29] gave two

For different types of optimization problems, there arise various complementarity problems, for example, linear complemen- tarity problem, nonlinear complementarity problem

For different types of optimization problems, there arise various complementarity problems, for example, linear complementarity problem, nonlinear complementarity problem,

Taking second-order cone optimization and complementarity problems for example, there have proposed many ef- fective solution methods, including the interior point methods [1, 2, 3,

This kind of algorithm has also been a powerful tool for solving many other optimization problems, including symmetric cone complementarity problems [15, 16, 20–22], symmetric

Unless prior permission in writing is given by the Commissioner of Police, you may not use the materials other than for your personal learning and in the course of your official

Unless prior permission in writing is given by the Commissioner of Police, you may not use the materials other than for your personal learning and in the course of your official

SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods.. In this work, we investigate