Some thoughts on Mathematics and Computer Science

(1)

Computer Science

Chih-Jen Lin

(2)

Outline

1 Introduction

2 Neural Networks

3 Matrix Computation in NN

(3)

Introduction

When Prof. Chao asked me to give a lecture here, I wasn’t quite sure what I should talk about

I don’t have new and emerging topic to share with you here.

So instead I plan to talk about a topic

“Mathematics and Computer Science”

But why this topic?

I will explain my motivation

(4)

Introduction (Cont’d)

Sometime ago, in a town-hall meeting with some faculty members, one student asked why calculus is a required course

I heard this from some faculty members as I wasn’t there

Anyway I think it really happened Here is the reaction from a professor:

He said “When we were students, we didn’t ask why xxx is a required course. We just took it.”

(5)

Introduction (Cont’d)

Then I asked myself if it’s possible to give you some reasons

That leads to this lecture

(6)

The Role of Mathematics in CS I

One reason why some students’ don’t think calculus is important is that they think

CS = programming But many (or most) CS areas are beyond programming

One issue is that in our required courses, things like calculus are seldom used

Students can see that discrete mathematics are related to algorithms

(7)

The Role of Mathematics in CS II

But they find calculus/linear algebra/statistics useful only after taking computer vision, signal processing, machine learning and others

These are more advanced courses

Note that CS is a rapidly changing area

Before Internet, many CS companies just hired programmers

For example, for Windows and Offices

developments, Microsoft hired many programmers with an undergraduate degree

(8)

The Role of Mathematics in CS III

Then Google started hiring many with Ph.D. or master degrees

In compared with traditional software development, in the Internet era, analytics skills are more

important

This doesn’t mean every engineer in big Internet companies has the job of developing analytics tools (e.g., deep learning software)

Instead, most are users. They don’t need to know all sophisticated details, but some basic

(9)

The Role of Mathematics in CS IV

For example, as a user of deep learning, you probably need to roughly know how it works

Otherwise you might now know what you are doing and what kinds of results you will get

To have a basic understanding of these things, you need some mathematics background

I am going to illustrate this point in the lecture

(10)

Outline

1 Introduction

2 Neural Networks

(11)

Neural Networks

To discuss why mathematics is important in some CS areas, we can consider many examples

We decide to talk about neural networks as deep learning is (incredibly) hot

There are many types of neural networks, but we will consider the simplest one

It’s the fully connected network for multi-class classification

So let’s check what data classification is

(12)

Data Classification

We extract information to build a model for future prediction

(13)

Data Classification (Cont’d)

The main task is on finding a model It’s also called supervised learning

(14)

Data Classification (Cont’d)

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example: medical diagnosis

Find a patient’s blood pressure, weight, etc.

After several years, know if he/she recovers Build a machine learning model

New patient: find blood pressure, weight, etc Prediction

(15)

Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a support vector machine, a neural network, or others

There are various ways to introduce classification methods. Here we consider probably the most popular one

(16)

Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is sgn(w^Tx)

For any data, x, the predicted label is (1 if w^Tx ≥ 0

−1 otherwise

(17)

Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^Tx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

(18)

Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; x, y ) for each instance (x, y )

Ideally we should use 0–1 training loss:

ξ(w; x, y ) =

(1 if y w^Tx < 0, 0 otherwise

(19)

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y w^Tx ξ(w; x, y )

We need continuous approximations

(20)

Common Loss Functions

Hinge loss (l1 loss)

ξ_L1(w; x, y ) ≡ max(0, 1 − y w^Tx) (1) Logistic loss

ξ_LR(w; x, y ) ≡ log(1 + e^{−y w}^T^x) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)

SVM and LR are two very fundamental classification

(21)

Common Loss Functions (Cont’d)

−y w^Tx ξ(w; x, y )

ξ_L1 ξ_LR

Logistic regression is very related to SVM Their performance is usually similar

(22)

Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(23)

Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(24)

l and s: training; and 4: testing

(25)

Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^Tw

2 or kwk₁ to the function that is minimized

(26)

General Form of Linear Classification

Training data {yi, xi}, x_i ∈ Rⁿ, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ w^Tw

2 + C Xl

i =1ξ(w; x_i, y_i) w^Tw/2: regularization term

ξ(w; x, y ): loss function

C : regularization parameter (chosen by users)

(27)

From Linear to Nonlinear

We now have linear classification because the decision function

sgn(w^Tx) is linear

We will see that neural networks (NN) is a nonlinear classifier

(28)

Neural Networks

We will explain neural networks using the the same framework for linear classification

Among various types of networks, we consider

fully-connected feed-forward networks for multi-class classification

(29)

Neural Networks (Cont’d)

Our training set includes (y_i, x_i), i = 1, . . . , l . x_i ∈ Rⁿ⁰ is the feature vector.

y_i ∈ R^K is the label vector.

K : # of classes

If x_i is in class k, then y_i = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^T ∈ R^K

A neural network maps each feature vector to one of the class labels by the connection of nodes.

(30)

Neural Networks (Cont’d)

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).

A

₀

B

₀

C

₀

A

₁

B

₁

A

₂

B

₂

C

₂

(31)

Neural Networks (Cont’d)

The weight matrix W^m at the mth layer is

W^m =





w₁₁^m · · · w_1n^m_m . . .

w_n^m_m−1₁ · · · w_n^m

m−1nm





nm−1×nm

,

n_m : # neurons at layer m

nm−1 : # neurons at layer m − 1 L: number of layers

n₀ = # of features, n_L = # of classes

Let z^m be the input of mth layer. z⁰ = x and z^L is the output

(32)

Neural Networks (Cont’d)

From (m − 1)th layer to mth layer s^m = (W^m)^Tz^m−1,

z_j^m = σ(s_j^m), j = 1, . . . , n_m,

σ(·) is the activation function. We collect all variables:

θ =





vec(W¹) ...

vec(W^L)



 ∈ Rⁿ n : total # variables

= n₀n₁ + · · · + n_L−1n_L

(33)

Neural Networks (Cont’d)

We solve the following optimization problem, min_θ f (θ),

where

f (θ) = 1

2θ^Tθ + C X^l

i =1ξ(z^L,i(θ); x_i, y_i).

C : regularization parameter

z^L(θ) ∈ Rⁿ^L: last-layer output vector of x.

ξ(z^L; x, y): loss function. Example:

ξ(z^L; x, y) = ||z^L− y||²

(34)

Neural Networks (Cont’d)

That is, we hope

y =





 0...

0 1 0...

0







z^L =







±0.00 · · · ...

±0.00 · · · 1.00 · · ·

±0.00 · · · ...

±0.00 · · ·







(35)

Neural Networks (Cont’d)

The formulation is as before, but loss function is more complicated

This NN method has been developed for decades.

So what’s new about deep learning?

Though there are some technical advances, one major thing is that more layers often lead to better results

(36)

Solving Optimization Problems I

How do you minimize

f (θ)?

Usually by a descent method That is, we find a sequence

θ1, θ2, θ3, . . . , such that

(37)

Solving Optimization Problems II

Hopefully

k→∞lim f (θk)

exists and is the smallest function value

Now you see that calculus is used. You need to know what limit is

But how to obtain

f (θ_k+1) < f (θ_k) Usually by gradient descent

(38)

Gradient Descent I

Taylor expansion. If

f (θ) : R¹ → R¹ f (θ_k + d ) = f (θ_k) + f⁰(θ_k)d + 1

2f⁰⁰(θ_k)d² + · · · This is the one-dimensional case

Now we have multiple variables f (θ) : Rⁿ → R¹

(39)

Gradient Descent II

So we need multi-dimensional Taylor expansion f (θ_k + d) = f (θ_k) + ∇f (θ_k)^Td + · · · We don’t get into details, but ∇f (θ) is called the gradient

Gradient is the multi-dimensional first derivative

∇f (θ) =







∂f (θ)

∂θ₁

...

∂f (θ)

∂θn







(40)

Gradient Descent III

Let

f (θ_k + d) ≈ f (θ_k) + ∇f (θ_k)^Td and we can find d by

mind ∇f (θ_k)^Td But easily this value goes to −∞

If

∇f (θ_k)^Td = −100, then

(41)

Gradient Descent IV

Thus we need to confine the search space of d min

d ∇f (θ_k)^Td

subject to kdk = 1 (3)

Here kdk means the length of d:

q

d₁² + · · · + d_n² How to solve (3)?

(42)

Gradient Descent V

We will use Cauchy inequality (a₁b₁ + · · · + a_nb_n)²

≤(a²₁ + · · · + a²_n)(b²₁ + · · · + b_n²) When

d = −∇f (θ_k) k∇f (θ_k)k, we have

k∇f (θ_k)^Tdk² = k∇f (θk)k²

(43)

Gradient Descent VI

Equality holds for Cauchy inequality Thus the minimum of (3) is obtained However, we may not have

f (θ_k + d) < f (θ_k) Instead, we need to search for a step size

(44)

Gradient Descent VII

Specifically we try

α = 1,1 2,1

4,1 8, . . . until

f (θ_k + αd) < f (θ_k) + σ∇f (θ_k)^Td, (4) where σ ∈ (0, 1/2).

The condition (4) is usually called sufficient decrease condition in optimization

(45)

Gradient Descent VIII

While θ isn’t optimal

d = −∇f (θ) and α ← 1 while true

If (4) holds break else

α ← α/2 θ ← θ + αd

The procedure to search for α is called line search

(46)

Gradient Descent IX

Instead of

α = 1,1 2,1

4,1 8, . . . we can use

α = 1, β, β², β³, . . . , where

0 < β < 1

(47)

Step-size Search I

Why

σ ∈ (0,1 2)?

The use of 1/2 is for convergence though we won’t discuss details

Q: how do we know that the line search procedure is guaranteed to stop?

(48)

Step-size Search II

In fact we can prove that if

∇f (θ)^Td < 0 (5) then there exists α^∗ > 0 such that

f (θ + αd) < f (θ) + σ∇f (θ)^T(αd), ∀α ∈ (0, α^∗) Any d satisfying (5) is called a descent direction

(49)

Step-size Search III

Proof: assume the result is wrong. There exists a sequence

{α_t} with

t→∞lim α_t = 0 and α_t > 0, ∀t such that

f (θ + αtd) ≥ f (θ) + σαt∇f (θ)^Td, ∀t

(50)

Step-size Search IV

Then

αlim_t→0

f (θ + α_td) − f (θ) α_t

=∇f (θ)^Td ≥ σ∇f (θ)^Td However,

∇f (θ)^Td < 0 and σ > 0 cause a contradiction

(51)

Step-size Search V

Q: how do you formally say

α→0lim

f (θ + αd) − f (θ)

α = ∇f (θ)^Td?

Let

g (α) ≡ f (θ + αd) We essentially calculate

α→0lim

g (α) − g (0)

α (6)

By the definition of the first derivative

(52)

Step-size Search VI

(6) is g⁰(0) But what are

g⁰(α) and then g⁰(0)?

We have g⁰(α)

=∂f (θ + αd)

∂θ1

∂(θ₁ + αd₁)

∂α + · · · +

∂f (θ + αd)

∂θ_n

∂(θ_n+ αd_n)

∂α

=∂f (θ + αd)

∂θ d₁ + · · · + ∂f (θ + αd)

∂θ d_n

(53)

Step-size Search VII

and

g⁰(0) = ∇f (θ)^Td This is multi-variable chain rule

Statement of multi-variable chain rule: let x = x (t) and y = y (t) be differentiable at t and suppose

z = f (x , y )

(54)

Step-size Search VIII

is differentiable at (x , y ). Then z(t) = f (x (t), y (t)) is differentiable at t and

dz

dt = ∂z

∂x dx

dt + ∂z

∂y dy

dt

(55)

Gradient of NN I

Recall that NN optimization problem is min

θ f (θ), where f (θ) = 1

2θ^Tθ + C X^l

i =1ξ(z^L,i(θ); xi, yi).

How to calculate the gradient?

Now z^L is actually a function of all variables z^L(θ)

(56)

Gradient of NN II

What we will calculate is

∇f (θ) = θ + C

l

X

i =1

∇_θξ(z^L,i(θ); xi, yi)

So what is

∇_θξ(z^L(θ); x, y)?

(57)

Gradient of NN III

We have

∂ξ(z^L(θ); x, y)

∂θ₁ =

∂ξ(z^L(θ); x, y)

∂z₁^L

∂z₁^L(θ)

∂θ₁ + · · · + ∂ξ(z^L(θ); x, y)

∂z_n^L_L

∂z_n^L_L(θ)

∂θ₁

∂ξ(z^L(θ); x, y)

∂θ₂ =

∂ξ(z^L(θ); x, y)

∂z₁^L

∂z₁^L(θ)

∂θ2

+ · · · + ∂ξ(z^L(θ); x, y)

∂z_n^L_L

∂z_n^L_L(θ)

∂θn

(58)

Gradient of NN IV

Thus

∇_θξ(z^L(θ); x, y)

=







∂z₁^L(θ)

∂θ1 · · · ^∂z

L nL(θ)

∂θ1

. . .

∂z₁^L(θ)

∂θn · · · ^∂z

L nL(θ)

∂θn













∂ξ

∂z₁^L

...

∂ξ

∂z_nL^L







where







∂z₁^L(θ)

∂θ₁ · · · ^∂z

L nL(θ)

∂θ₁

. . .







(59)

Gradient of NN V

is called the Jacobian of z^L(θ) We see that chain rule is used again

There are a lot of more details about the gradient evaluation but let’s stop here

The point is that techniques behind deep learning is quite complicated and needs lots of mathematics Next let’s switch to the issue of computation

(60)

Outline

1 Introduction

2 Neural Networks

(61)

Matrix Multiplication I

We will show that to calculate f (θ)

the main operation from one layer to next is a matrix-matrix product

Recall from (m − 1)th layer to mth layer s^m = (W^m)^Tz^m−1,

z_j^m = σ(s_j^m), j = 1, . . . , n_m, where σ(·) is the activation function.

(62)

Matrix Multiplication II

Now each instance xi has z^m−1,i So we have

z^m−1,1, . . . ,z^m−1,l if there are l training instances Thus

s^m,1 · · · s^m,l = W_m^T

z^m−1,1 · · · z^m−1,l ∈ Rⁿ^m^×l, where

W_m ∈ Rⁿ^m−1^×n^m

(63)

Matrix Multiplication III

The main cost in calculating function value of NN is the

matrix-matrix product between every two layers

You know how to do matrix multiplication.

C = AB is a mathematics operation with

C_ij =

n

X

k=1

A_ikB_kj

(64)

Matrix Multiplication IV

At the first glance, it has nothing to do with computer science

But have you ever thought about a question: why do people use GPU for deep learning?

An Internet search shows the following answer from https://www.quora.com/

Why-are-GPUs-well-suited-to-deep-learning

“Deep learning involves huge amount of matrix multiplications and other operations which can be massively parallelized and thus sped up on GPU-s.”

(65)

Matrix Multiplication V

As a computer science student, we need to know a bit more details

I am going to use CPU rather than GPU to give an illustration – how computer architectures may affect a mathematics operation

(66)

Optimized BLAS: an Example by Using Block Algorithms I

Let’s test the matrix multiplication A C program:

#define n 2000

double a[n][n], b[n][n], c[n][n];

int main() {

int i, j, k;

(67)

Optimized BLAS: an Example by Using Block Algorithms II

for (j=0;j<n;j++) {

a[i][j]=1; b[i][j]=1;

}

for (i=0;i<n;i++)

for (j=0;j<n;j++) { c[i][j]=0;

for (k=0;k<n;k++)

c[i][j] += a[i][k]*b[k][j];

}

(68)

Optimized BLAS: an Example by Using Block Algorithms III

}

A Matlab program n = 2000;

A = randn(n,n); B = randn(n,n);

t = cputime; C = A*B; t = cputime -t To remove the effect of multi-threading, use matlab -singleCompThread

Timing is an issue

(69)

Optimized BLAS: an Example by Using Block Algorithms IV

cjlin@linux1:~$ matlab -singleCompThread

>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 4.520684 seconds.

>> a = randn(3000,3000);t=cputime; c = a*a;

t=cputime-t t =

4.3500

(70)

Optimized BLAS: an Example by Using Block Algorithms V

cjlin@linux1:~$ matlab

>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 1.180799 seconds.

>> a = randn(3000,3000);t=cputime; c = a*a;

t=cputime-t t =

8.4400

Matlab is much faster than a code written by

(71)

Optimized BLAS: an Example by Using Block Algorithms VI

Optimized BLAS: data locality is exploited Use the highest level of memory as possible

Block algorithms: transferring sub-matrices between different levels of storage

localize operations to achieve good performance

(72)

Memory Hierarchy I

CPU

↓ Registers

↓ Cache

↓

Main Memory

↓

Secondary storage (Disk)

(73)

Memory Hierarchy II

↑: increasing in speed

↓: increasing in capacity

When I studied computer architecture, I didn’t quite understand that this setting is so useful

But from optimized BLAS I realize that it is extremely powerful

(74)

Memory Management I

Page fault: operand not available in main memory transported from secondary memory

(usually) overwrites page least recently used I/O increases the total time

An example: C = AB + C , n = 1, 024

Assumption: a page 65,536 doubles = 64 columns 16 pages for each matrix

48 pages for three matrices

(75)

Memory Management II

Assumption: available memory 16 pages, matrices access: column oriented

A = 1 2 3 4

column oriented: 1 3 2 4 row oriented: 1 2 3 4

access each row of A: 16 page faults, 1024/64 = 16 Assumption: each time a continuous segment of data into one page

Approach 1: inner product

(76)

Memory Management III

for i =1:n for j=1:n

for k=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end end end

We use a matlab-like syntax here

At each (i,j): each row a(i, 1:n) causes 16 page faults

(77)

Memory Management IV

Total: 1024² × 16 page faults at least 16 million page faults Approach 2:

for j =1:n for k=1:n

for i=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end end end

(78)

Memory Management V

For each j , access all columns of A

A needs 16 pages, but B and C take spaces as well So A must be read for every j

For each j , 16 page faults for A 1024 × 16 page faults

C , B : 16 page faults

Approach 3: block algorithms (nb = 256)

(79)

Memory Management VI

for j =1:nb:n for k=1:nb:n

for jj=j:j+nb-1 for kk=k:k+nb-1

c(:,jj) = a(:,kk)*b(kk,jj)+c(:,jj);

end end end end

In MATLAB, 1:256:1025 means 1, 257, 513, 769

(80)

Memory Management VII

Note that we calculate





A11 · · · A₁₄ ...

A₄₁ · · · A₄₄









B11 · · · B₁₄ ...

B₄₁ · · · B₄₄





= A11B11 + · · · + A14B41 · · ·

... . . .

(81)

Memory Management VIII

Each block: 256 × 256

C₁₁ = A₁₁B₁₁+ · · · + A₁₄B₄₁ C₂₁ = A₂₁B₁₁+ · · · + A₂₄B₄₁ C₃₁ = A₃₁B₁₁+ · · · + A₃₄B₄₁ C₄₁ = A₄₁B₁₁+ · · · + A₄₄B₄₁

For each (j , k), Bk,j is used to add A:,kBk,j to C:,j

(82)

Memory Management IX

Example: when j = 1, k = 1

C₁₁ ← C₁₁+ A₁₁B₁₁ ...

C41 ← C₄₁+ A41B11

Use Approach 2 for A_:,1B₁₁

A_:,1: 256 columns, 1024 × 256/65536 = 4 pages.

A_:,1, . . . , A_:,4 : 4 × 4 = 16 page faults in calculating C_:,1

(83)

Memory Management X

B: 16 page faults, C : 16 page faults

Now let’s try to compare approaches 1 and 2 We see that approach is faster. Why?

C is row-oriented rather than column-oriented

(84)

Optimized BLAS Implementations

OpenBLAS

http://www.openblas.net/

It is an optimized BLAS library based on GotoBLAS2 (see the story in the next slide) It’s a successful open-source project developed in China

Intel MKL (Math Kernel Library)

https://software.intel.com/en-us/mkl

(85)

Some Past Stories about Optimized BLAS

BLAS by Kazushige Goto

https://www.tacc.utexas.edu/

research-development/tacc-software/

gotoblas2

See the NY Times article: “Writing the fastest code, by hand, for fun: a human computer keeps speeding up chips”

http://www.nytimes.com/2005/11/28/

technology/28super.html?pagewanted=all

(86)

Homework I

We would like to compare the time for multiplying two 8, 000 by 8, 000 matrices

Directly using sources of blas http://www.netlib.org/blas/

Intel MKL OpenBLAS

You can use BLAS or CBLAS

Try to comment on the use of multi-core processors.

(87)

Conclusions I

In general I don’t think we should have too many required courses

However, some of them are very basic and are very useful in advanced topics

Some students do not think basic mathematics courses (e.g., calculus) are CS courses. But that may not be the case

When I evaluate applications for graduate schools by checking their transcripts, very often I first look at the grade of calculus

(88)

Conclusions II

I hope that through this lecture you have seen that some mathematics techniques are very related to CS topics