• 沒有找到結果。

Some thoughts on Mathematics and Computer Science

N/A
N/A
Protected

Academic year: 2022

Share "Some thoughts on Mathematics and Computer Science"

Copied!
88
0
0

加載中.... (立即查看全文)

全文

(1)

Computer Science

Chih-Jen Lin

(2)

Outline

1 Introduction

2 Neural Networks

3 Matrix Computation in NN

(3)

Introduction

When Prof. Chao asked me to give a lecture here, I wasn’t quite sure what I should talk about

I don’t have new and emerging topic to share with you here.

So instead I plan to talk about a topic

“Mathematics and Computer Science”

But why this topic?

I will explain my motivation

(4)

Introduction (Cont’d)

Sometime ago, in a town-hall meeting with some faculty members, one student asked why calculus is a required course

I heard this from some faculty members as I wasn’t there

Anyway I think it really happened Here is the reaction from a professor:

He said “When we were students, we didn’t ask why xxx is a required course. We just took it.”

(5)

Introduction (Cont’d)

Then I asked myself if it’s possible to give you some reasons

That leads to this lecture

(6)

The Role of Mathematics in CS I

One reason why some students’ don’t think calculus is important is that they think

CS = programming But many (or most) CS areas are beyond programming

One issue is that in our required courses, things like calculus are seldom used

Students can see that discrete mathematics are related to algorithms

(7)

The Role of Mathematics in CS II

But they find calculus/linear algebra/statistics useful only after taking computer vision, signal processing, machine learning and others

These are more advanced courses

Note that CS is a rapidly changing area

Before Internet, many CS companies just hired programmers

For example, for Windows and Offices

developments, Microsoft hired many programmers with an undergraduate degree

(8)

The Role of Mathematics in CS III

Then Google started hiring many with Ph.D. or master degrees

In compared with traditional software development, in the Internet era, analytics skills are more

important

This doesn’t mean every engineer in big Internet companies has the job of developing analytics tools (e.g., deep learning software)

Instead, most are users. They don’t need to know all sophisticated details, but some basic

(9)

The Role of Mathematics in CS IV

For example, as a user of deep learning, you probably need to roughly know how it works

Otherwise you might now know what you are doing and what kinds of results you will get

To have a basic understanding of these things, you need some mathematics background

I am going to illustrate this point in the lecture

(10)

Outline

1 Introduction

2 Neural Networks

3 Matrix Computation in NN

(11)

Neural Networks

To discuss why mathematics is important in some CS areas, we can consider many examples

We decide to talk about neural networks as deep learning is (incredibly) hot

There are many types of neural networks, but we will consider the simplest one

It’s the fully connected network for multi-class classification

So let’s check what data classification is

(12)

Data Classification

We extract information to build a model for future prediction

(13)

Data Classification (Cont’d)

The main task is on finding a model It’s also called supervised learning

(14)

Data Classification (Cont’d)

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example: medical diagnosis

Find a patient’s blood pressure, weight, etc.

After several years, know if he/she recovers Build a machine learning model

New patient: find blood pressure, weight, etc Prediction

(15)

Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a support vector machine, a neural network, or others

There are various ways to introduce classification methods. Here we consider probably the most popular one

(16)

Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is sgn(wTx)

For any data, x, the predicted label is (1 if wTx ≥ 0

−1 otherwise

(17)

Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦ ◦

◦◦

4 4 4 4

4 4

4

wTx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

(18)

Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; x, y ) for each instance (x, y )

Ideally we should use 0–1 training loss:

ξ(w; x, y ) =

(1 if y wTx < 0, 0 otherwise

(19)

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y wTx ξ(w; x, y )

We need continuous approximations

(20)

Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; x, y ) ≡ max(0, 1 − y wTx) (1) Logistic loss

ξLR(w; x, y ) ≡ log(1 + e−y wTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)

SVM and LR are two very fundamental classification

(21)

Common Loss Functions (Cont’d)

−y wTx ξ(w; x, y )

ξL1 ξLR

Logistic regression is very related to SVM Their performance is usually similar

(22)

Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(23)

Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(24)

l and s: training; and 4: testing

(25)

Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

wTw

2 or kwk1 to the function that is minimized

(26)

General Form of Linear Classification

Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ wTw

2 + C Xl

i =1ξ(w; xi, yi) wTw/2: regularization term

ξ(w; x, y ): loss function

C : regularization parameter (chosen by users)

(27)

From Linear to Nonlinear

We now have linear classification because the decision function

sgn(wTx) is linear

We will see that neural networks (NN) is a nonlinear classifier

(28)

Neural Networks

We will explain neural networks using the the same framework for linear classification

Among various types of networks, we consider

fully-connected feed-forward networks for multi-class classification

(29)

Neural Networks (Cont’d)

Our training set includes (yi, xi), i = 1, . . . , l . xi ∈ Rn0 is the feature vector.

yi ∈ RK is the label vector.

K : # of classes

If xi is in class k, then yi = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]T ∈ RK

A neural network maps each feature vector to one of the class labels by the connection of nodes.

(30)

Neural Networks (Cont’d)

Between two layers a weight matrix maps inputs (the previous layer) to outputs (the next layer).

A

0

B

0

C

0

A

1

B

1

A

2

B

2

C

2

(31)

Neural Networks (Cont’d)

The weight matrix Wm at the mth layer is

Wm =

w11m · · · w1nmm . . .

wnmm−11 · · · wnm

m−1nm

nm−1×nm

,

nm : # neurons at layer m

nm−1 : # neurons at layer m − 1 L: number of layers

n0 = # of features, nL = # of classes

Let zm be the input of mth layer. z0 = x and zL is the output

(32)

Neural Networks (Cont’d)

From (m − 1)th layer to mth layer sm = (Wm)Tzm−1,

zjm = σ(sjm), j = 1, . . . , nm,

σ(·) is the activation function. We collect all variables:

θ =

vec(W1) ...

vec(WL)

 ∈ Rn n : total # variables

= n0n1 + · · · + nL−1nL

(33)

Neural Networks (Cont’d)

We solve the following optimization problem, minθ f (θ),

where

f (θ) = 1

Tθ + C Xl

i =1ξ(zL,i(θ); xi, yi).

C : regularization parameter

zL(θ) ∈ RnL: last-layer output vector of x.

ξ(zL; x, y): loss function. Example:

ξ(zL; x, y) = ||zL− y||2

(34)

Neural Networks (Cont’d)

That is, we hope

y =

 0...

0 1 0...

0

zL =

±0.00 · · · ...

±0.00 · · · 1.00 · · ·

±0.00 · · · ...

±0.00 · · ·

(35)

Neural Networks (Cont’d)

The formulation is as before, but loss function is more complicated

This NN method has been developed for decades.

So what’s new about deep learning?

Though there are some technical advances, one major thing is that more layers often lead to better results

(36)

Solving Optimization Problems I

How do you minimize

f (θ)?

Usually by a descent method That is, we find a sequence

θ1, θ2, θ3, . . . , such that

(37)

Solving Optimization Problems II

Hopefully

k→∞lim f (θk)

exists and is the smallest function value

Now you see that calculus is used. You need to know what limit is

But how to obtain

f (θk+1) < f (θk) Usually by gradient descent

(38)

Gradient Descent I

Taylor expansion. If

f (θ) : R1 → R1 f (θk + d ) = f (θk) + f0k)d + 1

2f00k)d2 + · · · This is the one-dimensional case

Now we have multiple variables f (θ) : Rn → R1

(39)

Gradient Descent II

So we need multi-dimensional Taylor expansion f (θk + d) = f (θk) + ∇f (θk)Td + · · · We don’t get into details, but ∇f (θ) is called the gradient

Gradient is the multi-dimensional first derivative

∇f (θ) =

∂f (θ)

∂θ1

...

∂f (θ)

∂θn

(40)

Gradient Descent III

Let

f (θk + d) ≈ f (θk) + ∇f (θk)Td and we can find d by

mind ∇f (θk)Td But easily this value goes to −∞

If

∇f (θk)Td = −100, then

(41)

Gradient Descent IV

Thus we need to confine the search space of d min

d ∇f (θk)Td

subject to kdk = 1 (3)

Here kdk means the length of d:

q

d12 + · · · + dn2 How to solve (3)?

(42)

Gradient Descent V

We will use Cauchy inequality (a1b1 + · · · + anbn)2

≤(a21 + · · · + a2n)(b21 + · · · + bn2) When

d = −∇f (θk) k∇f (θk)k, we have

k∇f (θk)Tdk2 = k∇f (θk)k2

(43)

Gradient Descent VI

Equality holds for Cauchy inequality Thus the minimum of (3) is obtained However, we may not have

f (θk + d) < f (θk) Instead, we need to search for a step size

(44)

Gradient Descent VII

Specifically we try

α = 1,1 2,1

4,1 8, . . . until

f (θk + αd) < f (θk) + σ∇f (θk)Td, (4) where σ ∈ (0, 1/2).

The condition (4) is usually called sufficient decrease condition in optimization

(45)

Gradient Descent VIII

While θ isn’t optimal

d = −∇f (θ) and α ← 1 while true

If (4) holds break else

α ← α/2 θ ← θ + αd

The procedure to search for α is called line search

(46)

Gradient Descent IX

Instead of

α = 1,1 2,1

4,1 8, . . . we can use

α = 1, β, β2, β3, . . . , where

0 < β < 1

(47)

Step-size Search I

Why

σ ∈ (0,1 2)?

The use of 1/2 is for convergence though we won’t discuss details

Q: how do we know that the line search procedure is guaranteed to stop?

(48)

Step-size Search II

In fact we can prove that if

∇f (θ)Td < 0 (5) then there exists α > 0 such that

f (θ + αd) < f (θ) + σ∇f (θ)T(αd), ∀α ∈ (0, α) Any d satisfying (5) is called a descent direction

(49)

Step-size Search III

Proof: assume the result is wrong. There exists a sequence

t} with

t→∞lim αt = 0 and αt > 0, ∀t such that

f (θ + αtd) ≥ f (θ) + σαt∇f (θ)Td, ∀t

(50)

Step-size Search IV

Then

αlimt→0

f (θ + αtd) − f (θ) αt

=∇f (θ)Td ≥ σ∇f (θ)Td However,

∇f (θ)Td < 0 and σ > 0 cause a contradiction

(51)

Step-size Search V

Q: how do you formally say

α→0lim

f (θ + αd) − f (θ)

α = ∇f (θ)Td?

Let

g (α) ≡ f (θ + αd) We essentially calculate

α→0lim

g (α) − g (0)

α (6)

By the definition of the first derivative

(52)

Step-size Search VI

(6) is g0(0) But what are

g0(α) and then g0(0)?

We have g0(α)

=∂f (θ + αd)

∂θ1

∂(θ1 + αd1)

∂α + · · · +

∂f (θ + αd)

∂θn

∂(θn+ αdn)

∂α

=∂f (θ + αd)

∂θ d1 + · · · + ∂f (θ + αd)

∂θ dn

(53)

Step-size Search VII

and

g0(0) = ∇f (θ)Td This is multi-variable chain rule

Statement of multi-variable chain rule: let x = x (t) and y = y (t) be differentiable at t and suppose

z = f (x , y )

(54)

Step-size Search VIII

is differentiable at (x , y ). Then z(t) = f (x (t), y (t)) is differentiable at t and

dz

dt = ∂z

∂x dx

dt + ∂z

∂y dy

dt

(55)

Gradient of NN I

Recall that NN optimization problem is min

θ f (θ), where f (θ) = 1

Tθ + C Xl

i =1ξ(zL,i(θ); xi, yi).

How to calculate the gradient?

Now zL is actually a function of all variables zL(θ)

(56)

Gradient of NN II

What we will calculate is

∇f (θ) = θ + C

l

X

i =1

θξ(zL,i(θ); xi, yi)

So what is

θξ(zL(θ); x, y)?

(57)

Gradient of NN III

We have

∂ξ(zL(θ); x, y)

∂θ1 =

∂ξ(zL(θ); x, y)

∂z1L

∂z1L(θ)

∂θ1 + · · · + ∂ξ(zL(θ); x, y)

∂znLL

∂znLL(θ)

∂θ1

∂ξ(zL(θ); x, y)

∂θ2 =

∂ξ(zL(θ); x, y)

∂z1L

∂z1L(θ)

∂θ2

+ · · · + ∂ξ(zL(θ); x, y)

∂znLL

∂znLL(θ)

∂θn

(58)

Gradient of NN IV

Thus

θξ(zL(θ); x, y)

=

∂z1L(θ)

∂θ1 · · · ∂z

L nL(θ)

∂θ1

. . .

∂z1L(θ)

∂θn · · · ∂z

L nL(θ)

∂θn

∂ξ

∂z1L

...

∂ξ

∂znLL

where

∂z1L(θ)

∂θ1 · · · ∂z

L nL(θ)

∂θ1

. . .

(59)

Gradient of NN V

is called the Jacobian of zL(θ) We see that chain rule is used again

There are a lot of more details about the gradient evaluation but let’s stop here

The point is that techniques behind deep learning is quite complicated and needs lots of mathematics Next let’s switch to the issue of computation

(60)

Outline

1 Introduction

2 Neural Networks

3 Matrix Computation in NN

(61)

Matrix Multiplication I

We will show that to calculate f (θ)

the main operation from one layer to next is a matrix-matrix product

Recall from (m − 1)th layer to mth layer sm = (Wm)Tzm−1,

zjm = σ(sjm), j = 1, . . . , nm, where σ(·) is the activation function.

(62)

Matrix Multiplication II

Now each instance xi has zm−1,i So we have

zm−1,1, . . . ,zm−1,l if there are l training instances Thus

sm,1 · · · sm,l = WmT 

zm−1,1 · · · zm−1,l ∈ Rnm×l, where

Wm ∈ Rnm−1×nm

(63)

Matrix Multiplication III

The main cost in calculating function value of NN is the

matrix-matrix product between every two layers

You know how to do matrix multiplication.

C = AB is a mathematics operation with

Cij =

n

X

k=1

AikBkj

(64)

Matrix Multiplication IV

At the first glance, it has nothing to do with computer science

But have you ever thought about a question: why do people use GPU for deep learning?

An Internet search shows the following answer from https://www.quora.com/

Why-are-GPUs-well-suited-to-deep-learning

“Deep learning involves huge amount of matrix multiplications and other operations which can be massively parallelized and thus sped up on GPU-s.”

(65)

Matrix Multiplication V

As a computer science student, we need to know a bit more details

I am going to use CPU rather than GPU to give an illustration – how computer architectures may affect a mathematics operation

(66)

Optimized BLAS: an Example by Using Block Algorithms I

Let’s test the matrix multiplication A C program:

#define n 2000

double a[n][n], b[n][n], c[n][n];

int main() {

int i, j, k;

(67)

Optimized BLAS: an Example by Using Block Algorithms II

for (j=0;j<n;j++) {

a[i][j]=1; b[i][j]=1;

}

for (i=0;i<n;i++)

for (j=0;j<n;j++) { c[i][j]=0;

for (k=0;k<n;k++)

c[i][j] += a[i][k]*b[k][j];

}

(68)

Optimized BLAS: an Example by Using Block Algorithms III

}

A Matlab program n = 2000;

A = randn(n,n); B = randn(n,n);

t = cputime; C = A*B; t = cputime -t To remove the effect of multi-threading, use matlab -singleCompThread

Timing is an issue

(69)

Optimized BLAS: an Example by Using Block Algorithms IV

cjlin@linux1:~$ matlab -singleCompThread

>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 4.520684 seconds.

>> a = randn(3000,3000);t=cputime; c = a*a;

t=cputime-t t =

4.3500

(70)

Optimized BLAS: an Example by Using Block Algorithms V

cjlin@linux1:~$ matlab

>> a = randn(3000,3000);tic; c = a*a; toc Elapsed time is 1.180799 seconds.

>> a = randn(3000,3000);t=cputime; c = a*a;

t=cputime-t t =

8.4400

Matlab is much faster than a code written by

(71)

Optimized BLAS: an Example by Using Block Algorithms VI

Optimized BLAS: data locality is exploited Use the highest level of memory as possible

Block algorithms: transferring sub-matrices between different levels of storage

localize operations to achieve good performance

(72)

Memory Hierarchy I

CPU

↓ Registers

↓ Cache

Main Memory

Secondary storage (Disk)

(73)

Memory Hierarchy II

↑: increasing in speed

↓: increasing in capacity

When I studied computer architecture, I didn’t quite understand that this setting is so useful

But from optimized BLAS I realize that it is extremely powerful

(74)

Memory Management I

Page fault: operand not available in main memory transported from secondary memory

(usually) overwrites page least recently used I/O increases the total time

An example: C = AB + C , n = 1, 024

Assumption: a page 65,536 doubles = 64 columns 16 pages for each matrix

48 pages for three matrices

(75)

Memory Management II

Assumption: available memory 16 pages, matrices access: column oriented

A = 1 2 3 4



column oriented: 1 3 2 4 row oriented: 1 2 3 4

access each row of A: 16 page faults, 1024/64 = 16 Assumption: each time a continuous segment of data into one page

Approach 1: inner product

(76)

Memory Management III

for i =1:n for j=1:n

for k=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end end end

We use a matlab-like syntax here

At each (i,j): each row a(i, 1:n) causes 16 page faults

(77)

Memory Management IV

Total: 10242 × 16 page faults at least 16 million page faults Approach 2:

for j =1:n for k=1:n

for i=1:n

c(i,j) = a(i,k)*b(k,j)+c(i,j);

end end end

(78)

Memory Management V

For each j , access all columns of A

A needs 16 pages, but B and C take spaces as well So A must be read for every j

For each j , 16 page faults for A 1024 × 16 page faults

C , B : 16 page faults

Approach 3: block algorithms (nb = 256)

(79)

Memory Management VI

for j =1:nb:n for k=1:nb:n

for jj=j:j+nb-1 for kk=k:k+nb-1

c(:,jj) = a(:,kk)*b(kk,jj)+c(:,jj);

end end end end

In MATLAB, 1:256:1025 means 1, 257, 513, 769

(80)

Memory Management VII

Note that we calculate

A11 · · · A14 ...

A41 · · · A44

B11 · · · B14 ...

B41 · · · B44

= A11B11 + · · · + A14B41 · · ·

... . . .



(81)

Memory Management VIII

Each block: 256 × 256

C11 = A11B11+ · · · + A14B41 C21 = A21B11+ · · · + A24B41 C31 = A31B11+ · · · + A34B41 C41 = A41B11+ · · · + A44B41

For each (j , k), Bk,j is used to add A:,kBk,j to C:,j

(82)

Memory Management IX

Example: when j = 1, k = 1

C11 ← C11+ A11B11 ...

C41 ← C41+ A41B11

Use Approach 2 for A:,1B11

A:,1: 256 columns, 1024 × 256/65536 = 4 pages.

A:,1, . . . , A:,4 : 4 × 4 = 16 page faults in calculating C:,1

(83)

Memory Management X

B: 16 page faults, C : 16 page faults

Now let’s try to compare approaches 1 and 2 We see that approach is faster. Why?

C is row-oriented rather than column-oriented

(84)

Optimized BLAS Implementations

OpenBLAS

http://www.openblas.net/

It is an optimized BLAS library based on GotoBLAS2 (see the story in the next slide) It’s a successful open-source project developed in China

Intel MKL (Math Kernel Library)

https://software.intel.com/en-us/mkl

(85)

Some Past Stories about Optimized BLAS

BLAS by Kazushige Goto

https://www.tacc.utexas.edu/

research-development/tacc-software/

gotoblas2

See the NY Times article: “Writing the fastest code, by hand, for fun: a human computer keeps speeding up chips”

http://www.nytimes.com/2005/11/28/

technology/28super.html?pagewanted=all

(86)

Homework I

We would like to compare the time for multiplying two 8, 000 by 8, 000 matrices

Directly using sources of blas http://www.netlib.org/blas/

Intel MKL OpenBLAS

You can use BLAS or CBLAS

Try to comment on the use of multi-core processors.

(87)

Conclusions I

In general I don’t think we should have too many required courses

However, some of them are very basic and are very useful in advanced topics

Some students do not think basic mathematics courses (e.g., calculus) are CS courses. But that may not be the case

When I evaluate applications for graduate schools by checking their transcripts, very often I first look at the grade of calculus

(88)

Conclusions II

I hope that through this lecture you have seen that some mathematics techniques are very related to CS topics

參考文獻

相關文件

You need to configure DC1 to resolve any DNS requests that are not for the contoso.com zone by querying the DNS server of your Internet Service Provider (ISP). What should

(a) In your group, discuss what impact the social issues in Learning Activity 1 (and any other socials issues you can think of) have on the world, Hong Kong and you.. Choose the

Watch the speech delivered by Jeff Bezos and answer the following questionsA. You may use these keywords: Jeff Bezos, What will

Report to the class what you have found out about the causes of the social problem identified (i.e. what the causes are, and how details and examples are given to

The powerful play goes on and you may contribute a verse. What will your

間接問句:Do you know your favorite color can also tell people what kind of person you are.. 句中有兩個間接問句:第一個間接問句 your favorite color can also tell people

If necessary, you might like to guide students to read over the notes and discuss the roles and language required of a chairperson or secretary to prepare them for the activity9.

• It is a plus if you have background knowledge on computer vision, image processing and computer graphics.. • It is a plus if you have access to digital cameras