Networks
Chih-Jen Lin
National Taiwan University Last updated: May 25, 2020
1 Regularized linear classification
2 Optimization problem for fully-connected networks
3 Optimization problem for convolutional neural networks (CNN)
4 Discussion
Outline
1 Regularized linear classification
2 Optimization problem for fully-connected networks
3 Optimization problem for convolutional neural networks (CNN)
4 Discussion
Minimizing Training Errors
Basically a classification method starts with minimizing the training errors
min
model (training errors)
That is, all or most training data with labels should be correctly classified by our model
A model can be a decision tree, a neural network, or other types
Minimizing Training Errors (Cont’d)
For simplicity, let’s consider the model to be a vector w
That is, the decision function is sgn(wTx)
For any data, x, the predicted label is (1 if wTx ≥ 0
−1 otherwise
Minimizing Training Errors (Cont’d)
The two-dimensional situation
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
wTx = 0
This seems to be quite restricted, but practically x is in a much higher dimensional space
Minimizing Training Errors (Cont’d)
To characterize the training error, we need a loss function ξ(w; y, x) for each instance (y, x), where
y = ±1 is the label and x is the feature vector Ideally we should use 0–1 training loss:
ξ(w; y, x) =
(1 if ywTx < 0, 0 otherwise
Minimizing Training Errors (Cont’d)
However, this function is discontinuous. The optimization problem becomes difficult
−ywTx ξ(w; y, x)
We need continuous approximations
Common Loss Functions
Hinge loss (l1 loss)
ξL1(w; y, x) ≡ max(0, 1 − ywTx) (1) Logistic loss
ξLR(w; y, x) ≡ log(1 + e−ywTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)
SVM and LR are two very fundamental classification methods
Common Loss Functions (Cont’d)
−ywTx ξ(w; y, x)
ξL1
ξLR
Logistic regression is very related to SVM Their performance is usually similar
Common Loss Functions (Cont’d)
However, minimizing training losses may not give a good model for future prediction
Overfitting occurs
Overfitting
See the illustration in the next slide For classification,
You can easily achieve 100% training accuracy This is useless
When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error
l and s: training; and 4: testing
Regularization
To minimize the training error we manipulate the w vector so that it fits the data
To avoid overfitting we need a way to make w’s values less extreme.
One idea is to make w values closer to zero We can add, for example,
wTw
2 or kwk1 to the function that is minimized
General Form of Linear Classification
Training data {yi,xi},xi ∈ Rn, i = 1, . . . , l, yi = ±1 l : # of data, n: # of features
minw f (w), f (w) ≡ wTw
2 + C
l
X
i =1
ξ(w; yi,xi) wTw/2: regularization term
ξ(w; y, x): loss function
C : regularization parameter (chosen by users)
Outline
1 Regularized linear classification
2 Optimization problem for fully-connected networks
3 Optimization problem for convolutional neural networks (CNN)
4 Discussion
Multi-class Classification I
Our training set includes (yi,xi), i = 1, . . . , l.
xi ∈ Rn1 is the feature vector.
yi ∈ RK is the label vector.
As label is now a vector, we change (label, instance) from
(yi,xi) to (yi,xi) K : # of classes
If xi is in class k, then yi = [0, . . . , 0
| {z }
k−1
, 1, 0, . . . , 0]T ∈ RK
Multi-class Classification II
A neural network maps each feature vector to one of the class labels by the connection of nodes.
Fully-connected Networks
Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).
A
1B
1C
1A
2B
2A
3B
3C
3Operations Between Two Layers I
The weight matrix Wm at the mth layer is
Wm =
w11m w12m · · · w1nmm w21m w22m · · · w2nmm
... ... ... ...
wnmm+11 wnmm+12 · · · wnmm+1nm
nm+1×nm
nm : # input features at layer m
nm+1 : # output features at layer m, or # input features at layer m + 1
L: number of layers
Operations Between Two Layers II
n1 = # of features, nL+1 = # of classes
Let zm be the input of the mth layer, z1 = x and zL+1 be the output
From mth layer to (m + 1)th layer sm = Wmzm,
zjm+1 = σ(sjm), j = 1, . . . , nm+1, σ(·) is the activation function.
Operations Between Two Layers III
Usually people do a bias term
b1m b2m ...
bnm
m+1
nm+1×1
,
so that
sm = Wmzm +bm
Operations Between Two Layers IV
Activation function is usually an R → R
transformation. As we are interested in
optimization, let’s not worry about why it’s needed We collect all variables:
θ =
vec(W1) b1
...
vec(WL) bL
∈ Rn
Operations Between Two Layers V
n : total # variables = (n1+1)n2+· · ·+(nL+1)nL+1 The vec(·) operator stacks columns of a matrix to a vector
Optimization Problem I
We solve the following optimization problem, minθ f (θ), where
f (θ) = 1
2θTθ + C Xl
i =1ξ(zL+1,i(θ);yi,xi). C : regularization parameter
zL+1(θ) ∈ RnL+1: last-layer output vector of x.
ξ(zL+1;y, x): loss function. Example:
ξ(zL+1;y, x) = ||zL+1 −y||2
Optimization Problem II
The formulation is same as linear classification However, the loss function is more complicated Further, it’s non-convex
Note that in the earlier discussion we consider a single instance
In the training process we actually have for i = 1, . . . , l,
sm,i = Wmzm,i,
zjm+1,i = σ(sjm,i), j = 1, . . . , nm+1, This makes the training more complicated
Outline
1 Regularized linear classification
2 Optimization problem for fully-connected networks
3 Optimization problem for convolutional neural networks (CNN)
4 Discussion
Why CNN? I
There are many types of neural networks
They are suitable for different types of problems While deep learning is hot, it’s not always better than other learning methods
For example, fully-connected networks were
evalueated for general classification data (e.g., data from UCI machine learning repository)
They are not consistently better than random forests or SVM; see the comparisons (Meyer et al., 2003;
Fern´andez-Delgado et al., 2014; Wang et al., 2018).
Why CNN? II
We are interested in CNN because it’s shown to be significantly better than others on image data That’s one of the main reasons deep learning becomes popular
To study optimization algorithms, of course we want to consider an “established” network
That’s why CNN was chosen for our discussion However, the problem is that operations in CNN are more complicated than fully-connected networks Most books/papers only give explanation without detailed mathematical forms
Why CNN? III
To study the optimization, we need some clean formulations
So let’s give it a try here
Convolutional Neural Networks I
Consider a K -class classification problem with training data
(yi, Z1,i), i = 1, . . . , l.
yi: label vector Z1,i: input image If Z1,i is in class k, then
yi = [0, . . . , 0
| {z }
k−1
, 1, 0, . . . , 0]T ∈ RK.
CNN maps each image Z1,i to yi
Convolutional Neural Networks II
Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.
Input and output of a convolutional layer are assumed to be images.
Convolutional Layers I
For the current layer, let the input be an image Zin : ain× bin × din.
ain: height, bin: width, and din: #channels.
a
inb
ind
inConvolutional Layers II
The goal is to generate an output image Zout,i
of dout channels of aout× bout images.
Consider dout filters.
Filter j ∈ {1, . . . , dout} has dimensions h × h × din.
w1,1,1j w1,h,1j . . .
wj wj
. . .
w1,1,dj in w1,h,dj in
. . .
wj wj
.
Convolutional Layers III
h: filter height/width (layer index omitted)
1,1,1 1,2,1 1,3,1
2,1,1 2,2,1 2,3,1
3,1,1 3,2,1 3,3,1
s1,1,jout,i s1,2,jout,i s2,1,jout,i s2,2,jout,i
To compute the j th channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × din
Convolutional Layers IV
We then calculate the inner product between each sub-image and the j th filter
For example, if we start from the upper left corner of the input image, the first sub-image of channel d is
z1,1,di . . . z1,h,di . . .
zh,1,di . . . zh,h,di
.
Convolutional Layers V
We then calculate
din
X
d =1
*
z1,1,di . . . z1,h,di . . .
zh,1,di . . . zh,h,di
,
w1,1,dj . . . w1,h,dj . . .
wh,1,dj . . . wh,h,dj
+
+bj, (3) where h·, ·i means the sum of component-wise
products between two matrices.
This value becomes the (1, 1) position of the channel j of the output image.
Convolutional Layers VI
Next, we use other sub-images to produce values in other positions of the output image.
Let the stride s be the number of pixels vertically or horizontally to get sub-images.
For the (2, 1) position of the output image, we move down s pixels vertically to obtain the following sub-image:
z1+s,1,di . . . z1+s,h,di . . .
zh+s,1,di . . . zh+s,h,di
.
Convolutional Layers VII
The (2, 1) position of the channel j of the output image is
din
X
d =1
*
z1+s,1,di . . . z1+s,h,di . . .
zh+s,1,di . . . zh+s,h,di
,
w1,1,dj . . . w1,h,dj . . .
wh,1,dj . . . wh,h,dj
+
+ bj.
(4)
Convolutional Layers VIII
The output image size aout and bout are respectively numbers that vertically and horizontally we can move the filter
aout = bain − h
s c + 1, bout = bbin − h
s c + 1 (5) Rationale of (5): vertically last row of each
sub-image is
h, h + s, . . . , h + ∆s ≤ ain
Convolutional Layers IX
Thus
∆ = bain− h
s c
Matrix Operations I
For efficient implementations, we should conduct convolutional operations by matrix-matrix and matrix-vector operations
We will go back to this issue later
Matrix Operations II
Let’s collect images of all channels as the input Zin,i
=
z1,1,1i z2,1,1i . . . zaiin,bin,1
... ... . . . ...
z1,1,di in z2,1,di in . . . zaiin,bin,din
∈Rdin×ainbin.
Matrix Operations III
Let all filters
W =
w1,1,11 w2,1,11 . . . wh,h,d1 in
... ... . . . ...
w1,1,1dout w2,1,1dout . . . wh,h,ddout in
∈ Rdout×hhdin
be variables (parameters) of the current layer
Matrix Operations IV
Usually a bias term is considered b =
b1
...
bdout
∈ Rdout×1 Operations at a layer
Sout,i = Wφ(Zin,i) + b1Taoutbout
∈ Rdout×aoutbout, (6)
Matrix Operations V
where
1aoutbout =
1...
1
∈ Raoutbout×1.
φ(Zin,i) collects all sub-images in Zin,i into a matrix.
Matrix Operations VI
Specifically, φ(Zin,i) =
z1,1,1i z1+s,1,1i z1+(ai out−1)s,1+(bout−1)s,1
z2,1,1i z2+s,1,1i z2+(ai out−1)s,1+(bout−1)s,1
... ... . . . ...
zh,h,1i zh+s,h,1i zh+(ai out−1)s,h+(bout−1)s,1
... ... ...
zh,h,di in zh+s,h,di in zh+(ai out−1)s,h+(bout−1)s,din
∈ Rhhdin×aoutbout
Activation Function I
Next, an activation function scales each element of Sout,i to obtain the output matrix Zout,i.
Zout,i = σ(Sout,i) ∈ Rdout×aoutbout. (7) For CNN, commonly the following RELU activation function
σ(x) = max(x, 0) (8)
is used
Later we need that σ(x) is differentiable, but the RELU function is not.
Activation Function II
Past works such as Krizhevsky et al. (2012) assume σ0(x ) =
(
1 if x > 0 0 otherwise
The Function φ(Z
in,i) I
In the matrix-matrix product Wφ(Zin,i),
each element is the inner product between a filter and a sub-image
We need to represent φ(Zin,i) in an explicit form.
This is important for subsequent calculation
Clearly φ is a linear mapping, so there exists a 0/1 matrix Pφ such that
φ(Zin,i) ≡ mat Pφvec(Zin,i)
hhdin×aoutbout, ∀i, (9)
The Function φ(Z
in,i) II
vec(M): all M’s columns concatenated to a vector v
vec(M) =
M:,1
...
M:,b
∈ Rab×1, where M ∈ Ra×b mat(v) is the inverse of vec(M)
mat(v)a×b =
v1 v(b−1)a+1 ... · · · ...
va vba
∈ Ra×b, (10)
The Function φ(Z
in,i) III
where
v ∈ Rab×1. Pφ is a huge matrix:
Pφ ∈ Rhhdinaoutbout×dinainbin and
φ : Rdin×ainbin → Rhhdin×aoutbout Later we will check implementation details
Past works using the form (9) include, for example, Vedaldi and Lenc (2015)
Optimization Problem I
We collect all weights to a vector variable θ.
θ =
vec(W1) b1
...
vec(WL) bL
∈ Rn, n : total # variables
The output of the last layer L is a vector zL+1,i(θ).
Consider any loss function such as the squared loss ξi(θ) = ||zL+1,i(θ) −yi||2.
Optimization Problem II
The optimization problem is min
θ f (θ), where
f (θ) = 1
2CθTθ + 1 l
Xl
i =1ξ(zL+1,i(θ);yi, Z1,i) C : regularization parameter.
The formulation is almost the same as that for fully connected networks
Optimization Problem III
Note that we divide the sum of training losses by the number of training data
Thus the secnd term becomes the average training loss
With the optimization problem, there is still a long way to do a real implementation
Further, CNN involves additional operations in practice
padding pooling
We will explain them
Zero Padding I
To better control the size of the output image, before the convolutional operation we may enlarge the input image to have zero values around the border.
This technique is called zero-padding in CNN training.
An illustration:
Zero Padding II
An input image
0 · · · 0 ...
0 · · · 0 ...
...
· · ·
· · · 0 · · · 0 ...
0 · · · 0
· · ·0 0 ...
· · ·0 0
· · ·0 0 ...
· · ·0 0
p
p
ain
bin
Zero Padding III
The size of the new image is changed from ain × bin to (ain + 2p) × (bin + 2p), where p is specified by users
The operation can be treated as a layer of mapping an input Zin,i to an output Zout,i.
Let
dout = din.
Zero Padding IV
There exists a 0/1 matrix
Ppad ∈ Rdoutaoutbout×dinainbin
so that the padding operation can be represented by Zout,i ≡ mat(Ppadvec(Zin,i))dout×aoutbout. (11) Implementation details will be discussed later
Pooling I
To reduce the computational cost, a dimension reduction is often applied by a pooling step after convolutional operations.
Usually we consider an operation that can
(approximately) extract rotational or translational invariance features.
Examples: average pooling, max pooling, and stochastic pooling,
Let’s consider max pooling as an illustration
Pooling II
An example:
image A
2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1
→ 5 9 4 6
image B
3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2
→ 5 9 4 6
Pooling III
B is derived by shifting A by 1 pixel in the horizontal direction.
We split two images into four 2 × 2 sub-images and choose the max value from every sub-image.
In each sub-image because only some elements are changed, the maximal value is likely the same or similar.
This is called translational invariance
For our example the two output images from A and B are the same.
Pooling IV
For mathematical representation, we consider the operation as a layer of mapping an input Zin,i to an output Zout,i.
In practice pooling is considered as an operation at the end of the convolutional layer.
We partition every channel of Zin,i into
non-overlapping sub-regions by h × h filters with the stride s = h
Because of the disjoint sub-regions, the stride s for sliding the filters is equal to h.
Pooling V
This partition step is a special case of how we generate sub-images in convolutional operations.
By the same definition as (9) we can generate the matrix
φ(Zin,i) = mat(Pφvec(Zin,i))hh×doutaoutbout, (12) where
aout = bain
h c, bout = bbin
h c, dout = din. (13)
Pooling VI
This is the same from the calculation in (5) as bain− h
h c + 1 = bain h c Note that here we consider
hh × doutaoutbout rather than hhdout× aoutbout because we can then do a max operation on each column
Pooling VII
To select the largest element of each sub-region, there exists a 0/1 matrix
Mi ∈ Rdoutaoutbout×hhdoutaoutbout
so that each row of Mi selects a single element from vec(φ(Zin,i)).
Therefore,
Zout,i = mat Mivec(φ(Zin,i))
dout×aoutbout. (14)
Pooling VIII
A comparison with (6) shows that Mi is in a similar role to the weight matrix W
While Mi is 0/1, it is not a constant. It’s positions of 1’s depend on the values of φ(Zin,i)
By combining (12) and (14), we have Zout,i = mat Ppooli vec(Zin,i)
dout×aoutbout, (15) where
Ppooli = MiPφ ∈ Rdoutaoutbout×dinainbin. (16)
Summary of a Convolutional Layer I
For implementation, padding and pooling are (optional) part of the convolutional layers.
We discuss details of considering all operations together.
The whole convolutional layer involves the following procedure:
Zm,i → padding by (11) →
convolutional operations by (6), (7)
→ pooling by (15) → Zm+1,i, (17)
Summary of a Convolutional Layer II
where Zm,i and Zm+1,i are input and output of the mth layer, respectively.
Let the following symbols denote image sizes at different stages of the convolutional layer.
am, bm : size in the beginning apadm , bpadm : size after padding
aconvm , bconvm : size after convolution.
The following table indicates how these values are ain, bin, din and aout, bout, dout at different stages.
Summary of a Convolutional Layer III
Operation Input Output
Padding: (11) Zm,i pad(Zm,i) Convolution: (6) pad(Zm,i) Sm,i
Convolution: (7) Sm,i σ(Sm,i) Pooling: (15) σ(Sm,i) Zm+1,i
Operation ain, bin, din aout, bout, dout Padding: (11) am, bm, dm apadm , bpadm , dm Convolution: (6) ampad, bmpad, dm aconvm , bconvm , dm+1 Convolution: (7) amconv, bmconv, dm+1 aconvm , bconvm , dm+1 Pooling: (15) amconv, bmconv, dm+1 am+1, bm+1, dm+1
Summary of a Convolutional Layer IV
Let the filter size, mapping matrices and weight matrices at the mth layer be
hm, Ppadm , Pφm, Ppoolm,i , Wm, bm. From (11), (6), (7), (15), all operations can be summarized as
Sm,i =Wmmat(PφmPpadm vec(Zm,i))hmhmdm×amconvbconvm + bm1Taconvbconv
Zm+1,i = mat(Ppoolm,i vec(σ(Sm,i)))dm+1×am+1bm+1, (18)
Fully-Connected Layer I
Assume LC is the number of convolutional layers Input vector of the first fully-connected layer:
zm,i = vec(Zm,i), i = 1, . . . , l, m = Lc + 1.
In each of the fully-connected layers (Lc < m ≤ L), we consider weight matrix and bias vector between layers m and m + 1.
Fully-Connected Layer II
Weight matrix:
Wm =
w11m w12m · · · w1nm
m
w21m w22m · · · w2nmm ... ... ... ...
wnmm+11 wnmm+12 · · · wnm
m+1nm
nm+1×nm
(19)
Bias vector
bm =
b1m b2m ...
bnmm+1
nm+1×1
Fully-Connected Layer III
Here nm and nm+1 are the numbers of nodes in layers m and m + 1, respectively.
If zm,i ∈ Rnm is the input vector, the following
operations are applied to generate the output vector zm+1,i ∈ Rnm+1.
sm,i = Wmzm,i +bm, (20) zjm+1,i = σ(sjm,i), j = 1, . . . , nm+1. (21)
Outline
1 Regularized linear classification
2 Optimization problem for fully-connected networks
3 Optimization problem for convolutional neural networks (CNN)
4 Discussion
Challenges in NN Optimization
The objective function is non-convex. It may have many local minima
It’s known that global optimization is much more difficult than local minimization
The problem structure is very complicated
In this course we will have first-hand experiences on handling these difficulties
Formulation I
We have written all CNN operations in matrix/vector forms
This is useful in deriving the gradient
Are our representation symbols good enough? Can we do better?
You can say that this is only a matter of notation, but given the wide use of CNN, a good formulation can be extremely useful
References I
M. Fern´andez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.
D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing, 55:169–186, 2003.
A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015.
C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6):
1673–1724, 2018. URL http://www.csie.ntu.edu.tw/~cjlin/papers/dnn/dsh.pdf.