Computer Science
Chih-Jen Lin
Outline
1 Introduction
2 Neural Networks
3 Matrix Computation in NN
Introduction
When Prof. Chao asked me to give a lecture here, I wasn’t quite sure what I should talk about
I don’t have new and emerging topic to share with you here.
So instead I plan to talk about a topic
“Mathematics and Computer Science”
But why this topic?
I will explain my motivation
Introduction (Cont’d)
Sometime ago, in a town-hall meeting with some faculty members, one student asked why calculus is a required course
I heard this from some faculty members as I wasn’t there
Anyway I think it really happened Here is the reaction from a professor:
He said “When we were students, we didn’t ask why xxx is a required course. We just took it.”
Introduction (Cont’d)
Then I asked myself if it’s possible to give you some reasons
That leads to this lecture
The Role of Mathematics in CS I
One reason why some students’ don’t think calculus is important is that they think
CS = programming But many (or most) CS areas are beyond programming
One issue is that in our required courses, things like calculus are seldom used
Students can see that discrete mathematics are related to algorithms
The Role of Mathematics in CS II
But they find calculus/linear algebra/statistics useful only after taking computer vision, signal processing, machine learning and others
These are more advanced courses
Note that CS is a rapidly changing area
Before Internet, many CS companies just hired programmers
For example, for Windows and Offices
developments, Microsoft hired many programmers with an undergraduate degree
The Role of Mathematics in CS III
Then Google started hiring many with Ph.D. or master degrees
In compared with traditional software development, in the Internet era, analytics skills are more
important
This doesn’t mean every engineer in big Internet companies has the job of developing analytics tools (e.g., deep learning software)
Instead, most are users. They don’t need to know all sophisticated details, but some basic
The Role of Mathematics in CS IV
For example, as a user of deep learning, you probably need to roughly know how it works
Otherwise you might now know what you are doing and what kinds of results you will get
To have a basic understanding of these things, you need some mathematics background
I am going to illustrate this point in the lecture
Outline
1 Introduction
2 Neural Networks
3 Matrix Computation in NN
Neural Networks
To discuss why mathematics is important in some CS areas, we can consider many examples
We decide to talk about neural networks as deep learning is (incredibly) hot
There are many types of neural networks, but we will consider the simplest one
It’s the fully connected network for multi-class classification
So let’s check what data classification is
Data Classification
We extract information to build a model for future prediction
Data Classification (Cont’d)
The main task is on finding a model It’s also called supervised learning
Data Classification (Cont’d)
Given training data in different classes (labels known)
Predict test data (labels unknown) Classic example: medical diagnosis
Find a patient’s blood pressure, weight, etc.
After several years, know if he/she recovers Build a machine learning model
New patient: find blood pressure, weight, etc Prediction
Minimizing Training Errors
Basically a classification method starts with minimizing the training errors
min
model (training errors)
That is, all or most training data with labels should be correctly classified by our model
A model can be a decision tree, a support vector machine, a neural network, or others
There are various ways to introduce classification methods. Here we consider probably the most popular one
Minimizing Training Errors (Cont’d)
For simplicity, let’s consider the model to be a vector w
That is, the decision function is sgn(wTx)
For any data, x, the predicted label is (1 if wTx ≥ 0
−1 otherwise
Minimizing Training Errors (Cont’d)
The two-dimensional situation
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
wTx = 0
This seems to be quite restricted, but practically x is in a much higher dimensional space
Minimizing Training Errors (Cont’d)
To characterize the training error, we need a loss function ξ(w; x, y ) for each instance (x, y )
Ideally we should use 0–1 training loss:
ξ(w; x, y ) =
(1 if y wTx < 0, 0 otherwise
Minimizing Training Errors (Cont’d)
However, this function is discontinuous. The optimization problem becomes difficult
−y wTx ξ(w; x, y )
We need continuous approximations
Common Loss Functions
Hinge loss (l1 loss)
ξL1(w; x, y ) ≡ max(0, 1 − y wTx) (1) Logistic loss
ξLR(w; x, y ) ≡ log(1 + e−y wTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)
SVM and LR are two very fundamental classification
Common Loss Functions (Cont’d)
−y wTx ξ(w; x, y )
ξL1 ξLR
Logistic regression is very related to SVM Their performance is usually similar
Common Loss Functions (Cont’d)
However, minimizing training losses may not give a good model for future prediction
Overfitting occurs
Overfitting
See the illustration in the next slide For classification,
You can easily achieve 100% training accuracy This is useless
When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error
l and s: training; and 4: testing
Regularization
To minimize the training error we manipulate the w vector so that it fits the data
To avoid overfitting we need a way to make w’s values less extreme.
One idea is to make w values closer to zero We can add, for example,
wTw
2 or kwk1 to the function that is minimized
General Form of Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w), f (w) ≡ wTw
2 + C Xl
i =1ξ(w; xi, yi) wTw/2: regularization term
ξ(w; x, y ): loss function
C : regularization parameter (chosen by users)
From Linear to Nonlinear
We now have linear classification because the decision function
sgn(wTx) is linear
We will see that neural networks (NN) is a nonlinear classifier
Neural Networks
We will explain neural networks using the the same framework for linear classification
Among various types of networks, we consider
fully-connected feed-forward networks for multi-class classification
Neural Networks (Cont’d)
Our training set includes (yi, xi), i = 1, . . . , l . xi ∈ Rn0 is the feature vector.
yi ∈ RK is the label vector.
K : # of classes
If xi is in class k, then yi = [0, . . . , 0
| {z }
k−1
, 1, 0, . . . , 0]T ∈ RK
A neural network maps each feature vector to one of the class labels by the connection of nodes.
Neural Networks (Cont’d)
Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).
A
0B
0C
0A
1B
1A
2B
2C
2Neural Networks (Cont’d)
The weight matrix Wm at the mth layer is
Wm =
w11m · · · w1nmm . . .
wnmm−11 · · · wnm
m−1nm
nm−1×nm
,
nm : # neurons at layer m
nm−1 : # neurons at layer m − 1 L: number of layers
n0 = # of features, nL = # of classes
Let zm be the input of mth layer. z0 = x and zL is the output
Neural Networks (Cont’d)
From (m − 1)th layer to mth layer sm = (Wm)Tzm−1,
zjm = σ(sjm), j = 1, . . . , nm,
σ(·) is the activation function. We collect all variables:
θ =
vec(W1) ...
vec(WL)
∈ Rn n : total # variables
= n0n1 + · · · + nL−1nL
Neural Networks (Cont’d)
We solve the following optimization problem, minθ f (θ),
where
f (θ) = 1
2θTθ + C Xl
i =1ξ(zL,i(θ); xi, yi).
C : regularization parameter
zL(θ) ∈ RnL: last-layer output vector of x.
ξ(zL; x, y): loss function. Example:
ξ(zL; x, y) = ||zL− y||2
Neural Networks (Cont’d)
That is, we hope
y =
0...
0 1 0...
0
zL =
±0.00 · · · ...
±0.00 · · · 1.00 · · ·
±0.00 · · · ...
±0.00 · · ·
Neural Networks (Cont’d)
The formulation is as before, but loss function is more complicated
This NN method has been developed for decades.
So what’s new about deep learning?
Though there are some technical advances, one major thing is that more layers often lead to better results
Solving Optimization Problems I
How do you minimize
f (θ)?
Usually by a descent method That is, we find a sequence
θ1, θ2, θ3, . . . , such that
Solving Optimization Problems II
Hopefully
k→∞lim f (θk)
exists and is the smallest function value
Now you see that calculus is used. You need to know what limit is
But how to obtain
f (θk+1) < f (θk) Usually by gradient descent
Gradient Descent I
Taylor expansion. If
f (θ) : R1 → R1 f (θk + d ) = f (θk) + f0(θk)d + 1
2f00(θk)d2 + · · · This is the one-dimensional case
Now we have multiple variables f (θ) : Rn → R1
Gradient Descent II
So we need multi-dimensional Taylor expansion f (θk + d) = f (θk) + ∇f (θk)Td + · · · We don’t get into details, but ∇f (θ) is called the gradient
Gradient is the multi-dimensional first derivative
∇f (θ) =
∂f (θ)
∂θ1
...
∂f (θ)
∂θn
Gradient Descent III
Let
f (θk + d) ≈ f (θk) + ∇f (θk)Td and we can find d by
mind ∇f (θk)Td But easily this value goes to −∞
If
∇f (θk)Td = −100, then
Gradient Descent IV
Thus we need to confine the search space of d min
d ∇f (θk)Td
subject to kdk = 1 (3)
Here kdk means the length of d:
q
d12 + · · · + dn2 How to solve (3)?
Gradient Descent V
We will use Cauchy inequality (a1b1 + · · · + anbn)2
≤(a21 + · · · + a2n)(b21 + · · · + bn2) When
d = −∇f (θk) k∇f (θk)k, we have
k∇f (θk)Tdk2 = k∇f (θk)k2
Gradient Descent VI
Equality holds for Cauchy inequality Thus the minimum of (3) is obtained However, we may not have
f (θk + d) < f (θk) Instead, we need to search for a step size
Gradient Descent VII
Specifically we try
α = 1,1 2,1
4,1 8, . . . until
f (θk + αd) < f (θk) + σ∇f (θk)Td, (4) where σ ∈ (0, 1/2).
The condition (4) is usually called sufficient decrease condition in optimization
Gradient Descent VIII
While θ isn’t optimal
d = −∇f (θ) and α ← 1 while true
If (4) holds break else
α ← α/2 θ ← θ + αd
The procedure to search for α is called line search
Gradient Descent IX
Instead of
α = 1,1 2,1
4,1 8, . . . we can use
α = 1, β, β2, β3, . . . , where
0 < β < 1
Step-size Search I
Why
σ ∈ (0,1 2)?
The use of 1/2 is for convergence though we won’t discuss details
Q: how do we know that the line search procedure is guaranteed to stop?
Step-size Search II
In fact we can prove that if
∇f (θ)Td < 0 (5) then there exists α∗ > 0 such that
f (θ + αd) < f (θ) + σ∇f (θ)T(αd), ∀α ∈ (0, α∗) Any d satisfying (5) is called a descent direction
Step-size Search III
Proof: assume the result is wrong. There exists a sequence
{αt} with
t→∞lim αt = 0 and αt > 0, ∀t such that
f (θ + αtd) ≥ f (θ) + σαt∇f (θ)Td, ∀t
Step-size Search IV
Then
αlimt→0
f (θ + αtd) − f (θ) αt
=∇f (θ)Td ≥ σ∇f (θ)Td However,
∇f (θ)Td < 0 and σ > 0 cause a contradiction
Step-size Search V
Q: how do you formally say
α→0lim
f (θ + αd) − f (θ)
α = ∇f (θ)Td?
Let
g (α) ≡ f (θ + αd) We essentially calculate
α→0lim
g (α) − g (0)
α (6)
By the definition of the first derivative
Step-size Search VI
(6) is g0(0) But what are
g0(α) and then g0(0)?
We have g0(α)
=∂f (θ + αd)
∂θ1
∂(θ1 + αd1)
∂α + · · · +
∂f (θ + αd)
∂θn
∂(θn+ αdn)
∂α
=∂f (θ + αd)
∂θ d1 + · · · + ∂f (θ + αd)
∂θ dn
Step-size Search VII
and
g0(0) = ∇f (θ)Td This is multi-variable chain rule
Statement of multi-variable chain rule: let x = x (t) and y = y (t) be differentiable at t and suppose
z = f (x , y )
Step-size Search VIII
is differentiable at (x , y ). Then z(t) = f (x (t), y (t)) is differentiable at t and
dz
dt = ∂z
∂x dx
dt + ∂z
∂y dy
dt
Gradient of NN I
Recall that NN optimization problem is min
θ f (θ), where f (θ) = 1
2θTθ + C Xl
i =1ξ(zL,i(θ); xi, yi).
How to calculate the gradient?
Now zL is actually a function of all variables zL(θ)
Gradient of NN II
What we will calculate is
∇f (θ) = θ + C
l
X
i =1
∇θξ(zL,i(θ); xi, yi)
So what is
∇θξ(zL(θ); x, y)?
Gradient of NN III
We have
∂ξ(zL(θ); x, y)
∂θ1 =
∂ξ(zL(θ); x, y)
∂z1L
∂z1L(θ)
∂θ1 + · · · + ∂ξ(zL(θ); x, y)
∂znLL
∂znLL(θ)
∂θ1
∂ξ(zL(θ); x, y)
∂θ2 =
∂ξ(zL(θ); x, y)
∂z1L
∂z1L(θ)
∂θ2
+ · · · + ∂ξ(zL(θ); x, y)
∂znLL
∂znLL(θ)
∂θn
Gradient of NN IV
Thus
∇θξ(zL(θ); x, y)
=
∂z1L(θ)
∂θ1 · · · ∂z
L nL(θ)
∂θ1
. . .
∂z1L(θ)
∂θn · · · ∂z
L nL(θ)
∂θn
∂ξ
∂z1L
...
∂ξ
∂znLL
where
∂z1L(θ)
∂θ1 · · · ∂z
L nL(θ)
∂θ1
. . .
Gradient of NN V
is called the Jacobian of zL(θ) We see that chain rule is used again
There are a lot of more details about the gradient evaluation but let’s stop here
The point is that techniques behind deep learning is quite complicated and needs lots of mathematics Next let’s switch to the issue of computation
Outline
1 Introduction
2 Neural Networks
3 Matrix Computation in NN
Matrix Multiplication I
We will show that to calculate f (θ)
the main operation from one layer to next is a matrix-matrix product
Recall from (m − 1)th layer to mth layer sm = (Wm)Tzm−1,
zjm = σ(sjm), j = 1, . . . , nm, where σ(·) is the activation function.
Matrix Multiplication II
Now each instance xi has zm−1,i So we have
zm−1,1, . . . ,zm−1,l if there are l training instances Thus
sm,1 · · · sm,l = WmT
zm−1,1 · · · zm−1,l ∈ Rnm×l, where
Wm ∈ Rnm−1×nm
Matrix Multiplication III
The main cost in calculating function value of NN is the
matrix-matrix product between every two layers
You know how to do matrix multiplication.
C = AB is a mathematics operation with
Cij =
n
X
k=1
AikBkj
Matrix Multiplication IV
At the first glance, it has nothing to do with computer science
But have you ever thought about a question: why do people use GPU for deep learning?
An Internet search shows the following answer from https://www.quora.com/
Why-are-GPUs-well-suited-to-deep-learning
“Deep learning involves huge amount of matrix multiplications and other operations which can be massively parallelized and thus sped up on GPU-s.”
Matrix Multiplication V
As a computer science student, we need to know a bit more details
I am going to use CPU rather than GPU to give an illustration – how computer architectures may affect a mathematics operation
Optimized BLAS: an Example by Using Block Algorithms I
Let’s test the matrix multiplication A C program:
#define n 2000
double a[n][n], b[n][n], c[n][n];
int main() {
int i, j, k;
Optimized BLAS: an Example by Using Block Algorithms II
for (j=0;j<n;j++) {
a[i][j]=1; b[i][j]=1;
}
for (i=0;i<n;i++)
for (j=0;j<n;j++) { c[i][j]=0;
for (k=0;k<n;k++)
c[i][j] += a[i][k]*b[k][j];
}
Optimized BLAS: an Example by Using Block Algorithms III
}
A Matlab program n = 2000;
A = randn(n,n); B = randn(n,n);
t = cputime; C = A*B; t = cputime -t To remove the effect of multi-threading, use matlab -singleCompThread
Timing is an issue
Optimized BLAS: an Example by Using Block Algorithms IV
cjlin@linux1:~$ matlab -singleCompThread
>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 4.520684 seconds.
>> a = randn(3000,3000);t=cputime; c = a*a;
t=cputime-t t =
4.3500
Optimized BLAS: an Example by Using Block Algorithms V
cjlin@linux1:~$ matlab
>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 1.180799 seconds.
>> a = randn(3000,3000);t=cputime; c = a*a;
t=cputime-t t =
8.4400
Matlab is much faster than a code written by
Optimized BLAS: an Example by Using Block Algorithms VI
Optimized BLAS: data locality is exploited Use the highest level of memory as possible
Block algorithms: transferring sub-matrices between different levels of storage
localize operations to achieve good performance
Memory Hierarchy I
CPU
↓ Registers
↓ Cache
↓
Main Memory
↓
Secondary storage (Disk)
Memory Hierarchy II
↑: increasing in speed
↓: increasing in capacity
When I studied computer architecture, I didn’t quite understand that this setting is so useful
But from optimized BLAS I realize that it is extremely powerful
Memory Management I
Page fault: operand not available in main memory transported from secondary memory
(usually) overwrites page least recently used I/O increases the total time
An example: C = AB + C , n = 1, 024
Assumption: a page 65,536 doubles = 64 columns 16 pages for each matrix
48 pages for three matrices
Memory Management II
Assumption: available memory 16 pages, matrices access: column oriented
A = 1 2 3 4
column oriented: 1 3 2 4 row oriented: 1 2 3 4
access each row of A: 16 page faults, 1024/64 = 16 Assumption: each time a continuous segment of data into one page
Approach 1: inner product
Memory Management III
for i =1:n for j=1:n
for k=1:n
c(i,j) = a(i,k)*b(k,j)+c(i,j);
end end end
We use a matlab-like syntax here
At each (i,j): each row a(i, 1:n) causes 16 page faults
Memory Management IV
Total: 10242 × 16 page faults at least 16 million page faults Approach 2:
for j =1:n for k=1:n
for i=1:n
c(i,j) = a(i,k)*b(k,j)+c(i,j);
end end end
Memory Management V
For each j , access all columns of A
A needs 16 pages, but B and C take spaces as well So A must be read for every j
For each j , 16 page faults for A 1024 × 16 page faults
C , B : 16 page faults
Approach 3: block algorithms (nb = 256)
Memory Management VI
for j =1:nb:n for k=1:nb:n
for jj=j:j+nb-1 for kk=k:k+nb-1
c(:,jj) = a(:,kk)*b(kk,jj)+c(:,jj);
end end end end
In MATLAB, 1:256:1025 means 1, 257, 513, 769
Memory Management VII
Note that we calculate
A11 · · · A14 ...
A41 · · · A44
B11 · · · B14 ...
B41 · · · B44
= A11B11 + · · · + A14B41 · · ·
... . . .
Memory Management VIII
Each block: 256 × 256
C11 = A11B11+ · · · + A14B41 C21 = A21B11+ · · · + A24B41 C31 = A31B11+ · · · + A34B41 C41 = A41B11+ · · · + A44B41
For each (j , k), Bk,j is used to add A:,kBk,j to C:,j
Memory Management IX
Example: when j = 1, k = 1
C11 ← C11+ A11B11 ...
C41 ← C41+ A41B11
Use Approach 2 for A:,1B11
A:,1: 256 columns, 1024 × 256/65536 = 4 pages.
A:,1, . . . , A:,4 : 4 × 4 = 16 page faults in calculating C:,1
Memory Management X
B: 16 page faults, C : 16 page faults
Now let’s try to compare approaches 1 and 2 We see that approach is faster. Why?
C is row-oriented rather than column-oriented
Optimized BLAS Implementations
OpenBLAS
http://www.openblas.net/
It is an optimized BLAS library based on GotoBLAS2 (see the story in the next slide) It’s a successful open-source project developed in China
Intel MKL (Math Kernel Library)
https://software.intel.com/en-us/mkl
Some Past Stories about Optimized BLAS
BLAS by Kazushige Goto
https://www.tacc.utexas.edu/
research-development/tacc-software/
gotoblas2
See the NY Times article: “Writing the fastest code, by hand, for fun: a human computer keeps speeding up chips”
http://www.nytimes.com/2005/11/28/
technology/28super.html?pagewanted=all
Homework I
We would like to compare the time for multiplying two 8, 000 by 8, 000 matrices
Directly using sources of blas http://www.netlib.org/blas/
Intel MKL OpenBLAS
You can use BLAS or CBLAS
Try to comment on the use of multi-core processors.
Conclusions I
In general I don’t think we should have too many required courses
However, some of them are very basic and are very useful in advanced topics
Some students do not think basic mathematics courses (e.g., calculus) are CS courses. But that may not be the case
When I evaluate applications for graduate schools by checking their transcripts, very often I first look at the grade of calculus
Conclusions II
I hope that through this lecture you have seen that some mathematics techniques are very related to CS topics