• 沒有找到結果。

Big-data Analytics: Challenges and Opportunities

N/A
N/A
Protected

Academic year: 2022

Share "Big-data Analytics: Challenges and Opportunities"

Copied!
57
0
0

加載中.... (立即查看全文)

全文

(1)

Big-data Analytics: Challenges and Opportunities

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at 台灣資料科學愛好者年會, August 30, 2014

(2)

Everybody talks about big data now, but it’s not easy to have an overall picture of this subject

In this talk, I will give some personal thoughts on technical developments of big-data analytics. Some are very pre-mature, so your comments are very welcome

(3)

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

(4)

From data mining to big data

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

(5)

From data mining to big data

From Data Mining to Big Data

In early 90’s, a buzzword called data mining appeared

Many years after, we have another one called big data

Well, what’s the difference?

(6)

From data mining to big data

Status of Data Mining and Machine Learning

Over the years, we have all kinds of effective

methods for classification, clustering, and regression We also have good integrated tools for data mining (e.g., Weka, R, Scikit-learn)

However, mining useful information remains difficult for some real-world applications

(7)

From data mining to big data

What’s Big Data?

• Though many definitions are available, I am considering the situation that data are larger than the capacity of a computer

• I think this is a main difference between data mining and big data

• So in a sense we are talking about distributed data mining or machine learning

(a), (b): distributed systems

(8)

From data mining to big data

From Small to Big Data

Two important differences:

Negative side:

Methods for big data analytics are not quite ready, not even mentioned to integrated tools

Positive side:

Some (Halevy et al., 2009) argue that the almost unlimited data make us easier to mine information

(9)

Challenges

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

(10)

Challenges

Possible Advantages of Distributed Data Analytics

Parallel data loading

Reading several TB data from disk is slow Using 100 machines, each has 1/100 data in its local disk ⇒ 1/100 loading time

But having data ready in these 100 machines is another issue

Fault tolerance

(11)

Challenges

Possible Advantages of Distributed Data Analytics (Cont’d)

Workflow not interrupted

If data are already distributedly stored, it’s not convenient to reduce some to one machine for analysis

(12)

Challenges

Possible Disadvantages of Distributed Data Analytics

More complicated (of course)

Communication and synchronization

Everybody says moving computation to data, but this isn’t that easy

(13)

Challenges

Going Distributed or Not Isn’t Easy to Decide

Quote from Yann LeCun (KDnuggets News 14:n05)

“I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.”

Now disk and RAM are large. You may load several TB of data once and conveniently conduct all

analysis

The decision is application dependent We will discuss this issue again later

(14)

Challenges

Distributed Environments

Many easy tasks on one computer become difficult in a distributed environment

For example, subsampling is easy on one machine, but may not be in a distributed system

Usually we attribute the problem to slow communication between machines

(15)

Challenges

Challenges

Big data, small analysis versus

Big data, big analysis

If you need a single record from a huge set, it’s reasonably easy

For example, accessing your high-speed rail reservation is fast

However, if you want to analyze the whole set by accessing data several time, it can be much harder

(16)

Challenges

Challenges (Cont’d)

Most existing data mining/machine learning methods were designed without considering data access and communication of intermediate results They iteratively use data by assuming they are readily available

Example: doing least-square regression isn’t easy in a distributed environment

(17)

Challenges

Challenges (Cont’d)

So we are facing many challenges methods not ready

no convenient tools

rapid change on the system side and many others

What should we do?

(18)

Opportunities

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

(19)

Opportunities

Opportunities

Looks like we are in the early stage of a research topic

But what is our chance?

(20)

Opportunities Lessons from past developments in one machine

Outline

3 Opportunities

Lessons from past developments in one machine Successful examples?

Design of big-data algorithms

(21)

Opportunities Lessons from past developments in one machine

Algorithms for Distributed Data Analytics

This is an on-going research topic.

Roughly there are two types of approaches

1 Parallelize existing (single-machine) algorithms

2 Design new algorithms particularly for distributed settings

Of course there are things in between

(22)

Opportunities Lessons from past developments in one machine

Algorithms for Distributed Data Analytics (Cont’d)

Given the complicated distributed setting, we wonder if easy-to-use big-data analytics tools can ever be available?

I don’t know either. Let’s try to think about the situation on one computer first

Indeed those easy-to-use analytics tools on one computer were not there at the first day

(23)

Opportunities Lessons from past developments in one machine

Past Development on One Computer

The problem now is we take many things for granted on one computer

On one computer, have you ever worried about calculating the average of some numbers?

Probably not. You can use Excel, statistical software (e.g., R and SAS), and many things else We seldom care internally how these tools work Can we go back to see the early development on one computer and learn some lessons/experiences?

(24)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product

Consider the example of matrix-matrix products C = A × B, A ∈ Rn×d, B ∈ Rd ×m where

Cij =

d

X

k=1

AikBkj

This is a simple operation. You can easily write your

(25)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

A segment of C code (assume n = m here) for (i=0;i<n;i++)

for (j=0;j<n;j++) {

c[i][j]=0;

for (k=0;k<n;k++)

c[i][j] += a[i][k]*b[k][j];

}

For 3, 000 × 3, 000 matrices

$ gcc -O3 mat.c

$ time ./a.out

(26)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

But on Matlab (single-thread mode)

$ matlab -singleCompThread

>> tic; c = a*b; toc

Elapsed time is 4.095059 seconds.

(27)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

How can Matlab be much faster than ours?

The fast implementation comes from some deep research and development

Matlab calls optimized BLAS (Basic Linear Algebra Subroutines) that was developed in 80’s-90’s

Our implementation is slow because data are not available for computation

(28)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

CPU

↓ Registers

↓ Cache

Main Memory

↑: increasing in speed

↓: increasing in capacity

Optimized BLAS: try to make data available in a higher level of memory

You don’t waste time to frequently move

(29)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

Optimized BLAS uses block algorithms

A × B =

A11 · · · A14 ...

A41 · · · A44

B11 · · · B14 ...

B41 · · · B44

= A11B11+ · · · + A14B41 · · ·

... . . .



If we compare the number of page faults (cache misses)

Ours: much larger

(30)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

I like this example because it involves both mathematical operations (matrix products), and

computer architecture (memory hierarchy) Only if knowing both, you can make breakthroughs

(31)

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

For big-data analytics, we are in a similar situation We want to run mathematical algorithms

(classification and clustering) in a complicated architecture (distributed system)

But we are like at the time point before optimized BLAS was developed

(32)

Opportunities Lessons from past developments in one machine

Algorithms and Systems

To have technical breakthroughs for big-data analytics, we should know both algorithms and systems well, and consider them together Indeed, if you are an expert on both topics, everybody wants you now

Many machine learning Ph.D. students don’t know much about systems. But this isn’t the case in the early days of computer science

(33)

Opportunities Lessons from past developments in one machine

Algorithms and Systems (Cont’d)

At that time, every numerical analyst knows computer architecture well.

That’s how they successfully developed

floating-point systems and IEEE 754/854 standard

(34)

Opportunities Lessons from past developments in one machine

Example: Machine Learning Using Spark

Recently we developed a classifier on Spark

Spark is an in-memory cluster-computing platform Beyond algorithms we must take details of

Spark Scala into account

For example, you want to know

the difference between mapPartitions and map in Spark, and

(35)

Opportunities Lessons from past developments in one machine

Example: Machine Learning Using Spark (Cont’d)

During our development, Spark was significantly upgraded from version 0.9 to 1.0. We must learn their changes

It’s like when you write a code on a computer, but the compiler or OS is actively changed. We are in a stage just like that.

(36)

Opportunities Successful examples?

Outline

3 Opportunities

Lessons from past developments in one machine Successful examples?

Design of big-data algorithms

(37)

Opportunities Successful examples?

Example of Distributed Machine Learning

I don’t think we have many successful examples yet Here I will show one: CTR (Click Through Rate) prediction for computational advertising

Many companies now run distributed classification for CTR problems

(38)

Opportunities Successful examples?

Example: CTR Prediction

Definition of CTR:

CTR = # clicks

# impressions. A sequence of events

Not clicked Features of user Clicked Features of user Not clicked Features of user

· · · ·

(39)

Opportunities Successful examples?

Example: CTR Prediction (Cont’d)

(40)

Opportunities Design of big-data algorithms

Outline

3 Opportunities

Lessons from past developments in one machine Successful examples?

Design of big-data algorithms

(41)

Opportunities Design of big-data algorithms

Design Considerations

Generally you want to minimize the data access and communication in a distributed environment

It’s possible that

method A better than B on one computer but

method A worse than B in distributed environments

(42)

Opportunities Design of big-data algorithms

Design Considerations (Cont’d)

Example: on one computer, often we do batch rather than online learning

Online and streaming learning may be more useful for big-data applications

Example: very often we design synchronous parallel algorithms

Maybe asynchronous ones are better for big data?

(43)

Opportunities Design of big-data algorithms

Workflow Issues

Data analytics is often only part of the workflow of a big-data application

By workflow, I mean things from raw data to final use of the results

Other steps may be more complicated than the analytics step

In one-computer situation, the focus is often on the analytics step

(44)

Opportunities Design of big-data algorithms

How to Get Started?

In my opinion, we should start from applications Applications → programming frameworks and algorithms → general tools

Now almost every big-data application requires special settings of algorithms, but I believe general tools will be possible

(45)

Discussion and conclusions

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

(46)

Discussion and conclusions

Risk of This Topic

It’s unclear how successful we can be Two problems:

Technology limits Applicability limits

(47)

Discussion and conclusions

Risk: Technology limits

It’s possible that we cannot get satisfactory results because of the distributed configuration

Recall that parallel programming or HPC (high performance computing) wasn’t very successful in early 90’s. But there are two differences this time

1 We are using commodity machines

2 Data become the focus

Well, every area has its limitation. The degree of success varies

(48)

Discussion and conclusions

Risk: Technology Limits (Cont’d)

Let’s compare two matrix products:

Dense matrix products: very successful as the final outcome (optimized BLAS) is much better than what ordinary users wrote

Sparse matrix products: not as successful. My code is about as good as those provided by Matlab

For big data analytics, it’s too early to tell We never know until we try

(49)

Discussion and conclusions

Risk: Applicability Limits

What’s the percentage of applications that need big-data analytics?

Not clear. Indeed some think the percentage is small (so they think big-data analytics is a hype) One main reason is that you can always analyze a random subest on one machine

But you may say this is a chicken and egg problem – because of no available tools, so no applications??

(50)

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

Another problem is the mis-understanding

Until recently, few universities or companies can access data center environments. They therefore think those big ones (e.g., Google) are doing big-data analytics for everything

In fact, the situation isn’t like that

(51)

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ...”

In my recent visit to a large company, their people did say that most analytics works are still done on one machine

(52)

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ...”

In my recent visit to a large company, their people did say that most analytics works are still done on one machine

(53)

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ...”

In my recent visit to a large company, their people did say that most analytics works are still done on one machine

(54)

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ...”

In my recent visit to a large company, their people did say that most analytics works are still done on one machine

(55)

Discussion and conclusions

Open-source Developments

Open-source developments are very important for big data analytics

How it works:

The company must do an application X. They consider an open-source tool Y. But Y is not enough for X. Then their engineers improve Y and submit pull requests

Through this process, core developers of a project are formed. They are from various companies

(56)

Discussion and conclusions

Open-source Developments (Cont’d)

For Taiwanese data-science companies, I think we should actively participate in such developments Indeed industry rather than schools are in a better position to do this

(57)

Discussion and conclusions

Conclusions

Big-data analytics is in its infancy

It’s challenging to development algorithms and tools in a distributed environment

To start, we should take both algorithms and systems into consideration

Hopefully we will get some breakthroughs in the near future

參考文獻

相關文件

A=fscanf(fid , format, size) reads data from the file specified by file identifier fid , converts it according to the specified format string, and returns it in matrix A..

The evidence presented so far suggests that it is a mistake to believe that middle- aged workers are disadvantaged in the labor market: they have a lower than average unemployment

what is the most sophisticated machine learning model for (my precious big) data. • myth: my big data work best with most

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced?. insight and

• It is a plus if you have background knowledge on computer vision, image processing and computer graphics.. • It is a plus if you have access to digital cameras

• Learn about wireless communications and networks!. • Why is it so different from wired communications

Discovering the City by Mining Diverse and Multimodal Data Streams – IBM Grand Challenge: New York City 360. §  Exploring and Integrating Multiple Contents and Sources for

– The distribution tells us more about  the data,  including how confident the system has about its including how confident the system has about its  prediction. It can