Some thoughts on Mathematics and Computer Science

88  Download (0)

Full text


Computer Science

Chih-Jen Lin



1 Introduction

2 Neural Networks

3 Matrix Computation in NN



When Prof. Chao asked me to give a lecture here, I wasn’t quite sure what I should talk about

I don’t have new and emerging topic to share with you here.

So instead I plan to talk about a topic

“Mathematics and Computer Science”

But why this topic?

I will explain my motivation


Introduction (Cont’d)

Sometime ago, in a town-hall meeting with some faculty members, one student asked why calculus is a required course

I heard this from some faculty members as I wasn’t there

Anyway I think it really happened Here is the reaction from a professor:

He said “When we were students, we didn’t ask why xxx is a required course. We just took it.”


Introduction (Cont’d)

Then I asked myself if it’s possible to give you some reasons

That leads to this lecture


The Role of Mathematics in CS I

One reason why some students’ don’t think calculus is important is that they think

CS = programming But many (or most) CS areas are beyond programming

One issue is that in our required courses, things like calculus are seldom used

Students can see that discrete mathematics are related to algorithms


The Role of Mathematics in CS II

But they find calculus/linear algebra/statistics useful only after taking computer vision, signal processing, machine learning and others

These are more advanced courses

Note that CS is a rapidly changing area

Before Internet, many CS companies just hired programmers

For example, for Windows and Offices

developments, Microsoft hired many programmers with an undergraduate degree


The Role of Mathematics in CS III

Then Google started hiring many with Ph.D. or master degrees

In compared with traditional software development, in the Internet era, analytics skills are more


This doesn’t mean every engineer in big Internet companies has the job of developing analytics tools (e.g., deep learning software)

Instead, most are users. They don’t need to know all sophisticated details, but some basic


The Role of Mathematics in CS IV

For example, as a user of deep learning, you probably need to roughly know how it works

Otherwise you might now know what you are doing and what kinds of results you will get

To have a basic understanding of these things, you need some mathematics background

I am going to illustrate this point in the lecture



1 Introduction

2 Neural Networks

3 Matrix Computation in NN


Neural Networks

To discuss why mathematics is important in some CS areas, we can consider many examples

We decide to talk about neural networks as deep learning is (incredibly) hot

There are many types of neural networks, but we will consider the simplest one

It’s the fully connected network for multi-class classification

So let’s check what data classification is


Data Classification

We extract information to build a model for future prediction


Data Classification (Cont’d)

The main task is on finding a model It’s also called supervised learning


Data Classification (Cont’d)

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example: medical diagnosis

Find a patient’s blood pressure, weight, etc.

After several years, know if he/she recovers Build a machine learning model

New patient: find blood pressure, weight, etc Prediction


Minimizing Training Errors

Basically a classification method starts with minimizing the training errors


model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a support vector machine, a neural network, or others

There are various ways to introduce classification methods. Here we consider probably the most popular one


Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is sgn(wTx)

For any data, x, the predicted label is (1 if wTx ≥ 0

−1 otherwise


Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦ ◦


4 4 4 4

4 4


wTx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space


Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; x, y ) for each instance (x, y )

Ideally we should use 0–1 training loss:

ξ(w; x, y ) =

(1 if y wTx < 0, 0 otherwise


Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y wTx ξ(w; x, y )

We need continuous approximations


Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; x, y ) ≡ max(0, 1 − y wTx) (1) Logistic loss

ξLR(w; x, y ) ≡ log(1 + e−y wTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)

SVM and LR are two very fundamental classification


Common Loss Functions (Cont’d)

−y wTx ξ(w; x, y )

ξL1 ξLR

Logistic regression is very related to SVM Their performance is usually similar


Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs



See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error


l and s: training; and 4: testing



To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,


2 or kwk1 to the function that is minimized


General Form of Linear Classification

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ wTw

2 + C Xl

i =1ξ(w; xi, yi) wTw/2: regularization term

ξ(w; x, y ): loss function

C : regularization parameter (chosen by users)


From Linear to Nonlinear

We now have linear classification because the decision function

sgn(wTx) is linear

We will see that neural networks (NN) is a nonlinear classifier


Neural Networks

We will explain neural networks using the the same framework for linear classification

Among various types of networks, we consider

fully-connected feed-forward networks for multi-class classification


Neural Networks (Cont’d)

Our training set includes (yi, xi), i = 1, . . . , l . xi ∈ Rn0 is the feature vector.

yi ∈ RK is the label vector.

K : # of classes

If xi is in class k, then yi = [0, . . . , 0

| {z }


, 1, 0, . . . , 0]T ∈ RK

A neural network maps each feature vector to one of the class labels by the connection of nodes.


Neural Networks (Cont’d)

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).


















Neural Networks (Cont’d)

The weight matrix Wm at the mth layer is

Wm =

w11m · · · w1nmm . . .

wnmm−11 · · · wnm




nm : # neurons at layer m

nm−1 : # neurons at layer m − 1 L: number of layers

n0 = # of features, nL = # of classes

Let zm be the input of mth layer. z0 = x and zL is the output


Neural Networks (Cont’d)

From (m − 1)th layer to mth layer sm = (Wm)Tzm−1,

zjm = σ(sjm), j = 1, . . . , nm,

σ(·) is the activation function. We collect all variables:

θ =

vec(W1) ...


 ∈ Rn n : total # variables

= n0n1 + · · · + nL−1nL


Neural Networks (Cont’d)

We solve the following optimization problem, minθ f (θ),


f (θ) = 1

Tθ + C Xl

i =1ξ(zL,i(θ); xi, yi).

C : regularization parameter

zL(θ) ∈ RnL: last-layer output vector of x.

ξ(zL; x, y): loss function. Example:

ξ(zL; x, y) = ||zL− y||2


Neural Networks (Cont’d)

That is, we hope

y =

 0...

0 1 0...


zL =

±0.00 · · · ...

±0.00 · · · 1.00 · · ·

±0.00 · · · ...

±0.00 · · ·


Neural Networks (Cont’d)

The formulation is as before, but loss function is more complicated

This NN method has been developed for decades.

So what’s new about deep learning?

Though there are some technical advances, one major thing is that more layers often lead to better results


Solving Optimization Problems I

How do you minimize

f (θ)?

Usually by a descent method That is, we find a sequence

θ1, θ2, θ3, . . . , such that


Solving Optimization Problems II


k→∞lim f (θk)

exists and is the smallest function value

Now you see that calculus is used. You need to know what limit is

But how to obtain

f (θk+1) < f (θk) Usually by gradient descent


Gradient Descent I

Taylor expansion. If

f (θ) : R1 → R1 f (θk + d ) = f (θk) + f0k)d + 1

2f00k)d2 + · · · This is the one-dimensional case

Now we have multiple variables f (θ) : Rn → R1


Gradient Descent II

So we need multi-dimensional Taylor expansion f (θk + d) = f (θk) + ∇f (θk)Td + · · · We don’t get into details, but ∇f (θ) is called the gradient

Gradient is the multi-dimensional first derivative

∇f (θ) =

∂f (θ)



∂f (θ)



Gradient Descent III


f (θk + d) ≈ f (θk) + ∇f (θk)Td and we can find d by

mind ∇f (θk)Td But easily this value goes to −∞


∇f (θk)Td = −100, then


Gradient Descent IV

Thus we need to confine the search space of d min

d ∇f (θk)Td

subject to kdk = 1 (3)

Here kdk means the length of d:


d12 + · · · + dn2 How to solve (3)?


Gradient Descent V

We will use Cauchy inequality (a1b1 + · · · + anbn)2

≤(a21 + · · · + a2n)(b21 + · · · + bn2) When

d = −∇f (θk) k∇f (θk)k, we have

k∇f (θk)Tdk2 = k∇f (θk)k2


Gradient Descent VI

Equality holds for Cauchy inequality Thus the minimum of (3) is obtained However, we may not have

f (θk + d) < f (θk) Instead, we need to search for a step size


Gradient Descent VII

Specifically we try

α = 1,1 2,1

4,1 8, . . . until

f (θk + αd) < f (θk) + σ∇f (θk)Td, (4) where σ ∈ (0, 1/2).

The condition (4) is usually called sufficient decrease condition in optimization


Gradient Descent VIII

While θ isn’t optimal

d = −∇f (θ) and α ← 1 while true

If (4) holds break else

α ← α/2 θ ← θ + αd

The procedure to search for α is called line search


Gradient Descent IX

Instead of

α = 1,1 2,1

4,1 8, . . . we can use

α = 1, β, β2, β3, . . . , where

0 < β < 1


Step-size Search I


σ ∈ (0,1 2)?

The use of 1/2 is for convergence though we won’t discuss details

Q: how do we know that the line search procedure is guaranteed to stop?


Step-size Search II

In fact we can prove that if

∇f (θ)Td < 0 (5) then there exists α > 0 such that

f (θ + αd) < f (θ) + σ∇f (θ)T(αd), ∀α ∈ (0, α) Any d satisfying (5) is called a descent direction


Step-size Search III

Proof: assume the result is wrong. There exists a sequence

t} with

t→∞lim αt = 0 and αt > 0, ∀t such that

f (θ + αtd) ≥ f (θ) + σαt∇f (θ)Td, ∀t


Step-size Search IV



f (θ + αtd) − f (θ) αt

=∇f (θ)Td ≥ σ∇f (θ)Td However,

∇f (θ)Td < 0 and σ > 0 cause a contradiction


Step-size Search V

Q: how do you formally say


f (θ + αd) − f (θ)

α = ∇f (θ)Td?


g (α) ≡ f (θ + αd) We essentially calculate


g (α) − g (0)

α (6)

By the definition of the first derivative


Step-size Search VI

(6) is g0(0) But what are

g0(α) and then g0(0)?

We have g0(α)

=∂f (θ + αd)


∂(θ1 + αd1)

∂α + · · · +

∂f (θ + αd)


∂(θn+ αdn)


=∂f (θ + αd)

∂θ d1 + · · · + ∂f (θ + αd)

∂θ dn


Step-size Search VII


g0(0) = ∇f (θ)Td This is multi-variable chain rule

Statement of multi-variable chain rule: let x = x (t) and y = y (t) be differentiable at t and suppose

z = f (x , y )


Step-size Search VIII

is differentiable at (x , y ). Then z(t) = f (x (t), y (t)) is differentiable at t and


dt = ∂z

∂x dx

dt + ∂z

∂y dy



Gradient of NN I

Recall that NN optimization problem is min

θ f (θ), where f (θ) = 1

Tθ + C Xl

i =1ξ(zL,i(θ); xi, yi).

How to calculate the gradient?

Now zL is actually a function of all variables zL(θ)


Gradient of NN II

What we will calculate is

∇f (θ) = θ + C



i =1

θξ(zL,i(θ); xi, yi)

So what is

θξ(zL(θ); x, y)?


Gradient of NN III

We have

∂ξ(zL(θ); x, y)

∂θ1 =

∂ξ(zL(θ); x, y)



∂θ1 + · · · + ∂ξ(zL(θ); x, y)




∂ξ(zL(θ); x, y)

∂θ2 =

∂ξ(zL(θ); x, y)




+ · · · + ∂ξ(zL(θ); x, y)





Gradient of NN IV


θξ(zL(θ); x, y)



∂θ1 · · · ∂z

L nL(θ)


. . .


∂θn · · · ∂z

L nL(θ)









∂θ1 · · · ∂z

L nL(θ)


. . .


Gradient of NN V

is called the Jacobian of zL(θ) We see that chain rule is used again

There are a lot of more details about the gradient evaluation but let’s stop here

The point is that techniques behind deep learning is quite complicated and needs lots of mathematics Next let’s switch to the issue of computation



1 Introduction

2 Neural Networks

3 Matrix Computation in NN


Matrix Multiplication I

We will show that to calculate f (θ)

the main operation from one layer to next is a matrix-matrix product

Recall from (m − 1)th layer to mth layer sm = (Wm)Tzm−1,

zjm = σ(sjm), j = 1, . . . , nm, where σ(·) is the activation function.


Matrix Multiplication II

Now each instance xi has zm−1,i So we have

zm−1,1, . . . ,zm−1,l if there are l training instances Thus

sm,1 · · · sm,l = WmT 

zm−1,1 · · · zm−1,l ∈ Rnm×l, where

Wm ∈ Rnm−1×nm


Matrix Multiplication III

The main cost in calculating function value of NN is the

matrix-matrix product between every two layers

You know how to do matrix multiplication.

C = AB is a mathematics operation with

Cij =






Matrix Multiplication IV

At the first glance, it has nothing to do with computer science

But have you ever thought about a question: why do people use GPU for deep learning?

An Internet search shows the following answer from


“Deep learning involves huge amount of matrix multiplications and other operations which can be massively parallelized and thus sped up on GPU-s.”


Matrix Multiplication V

As a computer science student, we need to know a bit more details

I am going to use CPU rather than GPU to give an illustration – how computer architectures may affect a mathematics operation


Optimized BLAS: an Example by Using Block Algorithms I

Let’s test the matrix multiplication A C program:

#define n 2000

double a[n][n], b[n][n], c[n][n];

int main() {

int i, j, k;


Optimized BLAS: an Example by Using Block Algorithms II

for (j=0;j<n;j++) {

a[i][j]=1; b[i][j]=1;


for (i=0;i<n;i++)

for (j=0;j<n;j++) { c[i][j]=0;

for (k=0;k<n;k++)

c[i][j] += a[i][k]*b[k][j];



Optimized BLAS: an Example by Using Block Algorithms III


A Matlab program n = 2000;

A = randn(n,n); B = randn(n,n);

t = cputime; C = A*B; t = cputime -t To remove the effect of multi-threading, use matlab -singleCompThread

Timing is an issue


Optimized BLAS: an Example by Using Block Algorithms IV

cjlin@linux1:~$ matlab -singleCompThread

>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 4.520684 seconds.

>> a = randn(3000,3000);t=cputime; c = a*a;

t=cputime-t t =



Optimized BLAS: an Example by Using Block Algorithms V

cjlin@linux1:~$ matlab

>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 1.180799 seconds.

>> a = randn(3000,3000);t=cputime; c = a*a;

t=cputime-t t =


Matlab is much faster than a code written by


Optimized BLAS: an Example by Using Block Algorithms VI

Optimized BLAS: data locality is exploited Use the highest level of memory as possible

Block algorithms: transferring sub-matrices between different levels of storage

localize operations to achieve good performance


Memory Hierarchy I


↓ Registers

↓ Cache

Main Memory

Secondary storage (Disk)


Memory Hierarchy II

↑: increasing in speed

↓: increasing in capacity

When I studied computer architecture, I didn’t quite understand that this setting is so useful

But from optimized BLAS I realize that it is extremely powerful


Memory Management I

Page fault: operand not available in main memory transported from secondary memory

(usually) overwrites page least recently used I/O increases the total time

An example: C = AB + C , n = 1, 024

Assumption: a page 65,536 doubles = 64 columns 16 pages for each matrix

48 pages for three matrices


Memory Management II

Assumption: available memory 16 pages, matrices access: column oriented

A = 1 2 3 4

column oriented: 1 3 2 4 row oriented: 1 2 3 4

access each row of A: 16 page faults, 1024/64 = 16 Assumption: each time a continuous segment of data into one page

Approach 1: inner product


Memory Management III

for i =1:n for j=1:n

for k=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end end end

We use a matlab-like syntax here

At each (i,j): each row a(i, 1:n) causes 16 page faults


Memory Management IV

Total: 10242 × 16 page faults at least 16 million page faults Approach 2:

for j =1:n for k=1:n

for i=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end end end


Memory Management V

For each j , access all columns of A

A needs 16 pages, but B and C take spaces as well So A must be read for every j

For each j , 16 page faults for A 1024 × 16 page faults

C , B : 16 page faults

Approach 3: block algorithms (nb = 256)


Memory Management VI

for j =1:nb:n for k=1:nb:n

for jj=j:j+nb-1 for kk=k:k+nb-1

c(:,jj) = a(:,kk)*b(kk,jj)+c(:,jj);

end end end end

In MATLAB, 1:256:1025 means 1, 257, 513, 769


Memory Management VII

Note that we calculate

A11 · · · A14 ...

A41 · · · A44

B11 · · · B14 ...

B41 · · · B44

= A11B11 + · · · + A14B41 · · ·

... . . .


Memory Management VIII

Each block: 256 × 256

C11 = A11B11+ · · · + A14B41 C21 = A21B11+ · · · + A24B41 C31 = A31B11+ · · · + A34B41 C41 = A41B11+ · · · + A44B41

For each (j , k), Bk,j is used to add A:,kBk,j to C:,j


Memory Management IX

Example: when j = 1, k = 1

C11 ← C11+ A11B11 ...

C41 ← C41+ A41B11

Use Approach 2 for A:,1B11

A:,1: 256 columns, 1024 × 256/65536 = 4 pages.

A:,1, . . . , A:,4 : 4 × 4 = 16 page faults in calculating C:,1


Memory Management X

B: 16 page faults, C : 16 page faults

Now let’s try to compare approaches 1 and 2 We see that approach is faster. Why?

C is row-oriented rather than column-oriented


Optimized BLAS Implementations


It is an optimized BLAS library based on GotoBLAS2 (see the story in the next slide) It’s a successful open-source project developed in China

Intel MKL (Math Kernel Library)


Some Past Stories about Optimized BLAS

BLAS by Kazushige Goto



See the NY Times article: “Writing the fastest code, by hand, for fun: a human computer keeps speeding up chips”



Homework I

We would like to compare the time for multiplying two 8, 000 by 8, 000 matrices

Directly using sources of blas

Intel MKL OpenBLAS

You can use BLAS or CBLAS

Try to comment on the use of multi-core processors.


Conclusions I

In general I don’t think we should have too many required courses

However, some of them are very basic and are very useful in advanced topics

Some students do not think basic mathematics courses (e.g., calculus) are CS courses. But that may not be the case

When I evaluate applications for graduate schools by checking their transcripts, very often I first look at the grade of calculus


Conclusions II

I hope that through this lecture you have seen that some mathematics techniques are very related to CS topics




Related subjects :