Distributing data storage and processing with frameworks

First steps in big data

5.1 Distributing data storage and processing with frameworks

New big data technologies such as Hadoop and Spark make it much easier to work with and control a cluster of computers. Hadoop can scale up to thousands of com-puters, creating a cluster with petabytes of storage. This enables businesses to grasp the value of the massive amount of data available.

Figure 5.1 Interactive Qlik dashboard

5.1.1 Hadoop: a framework for storing and processing large data sets

Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all of the following things and more:

■ Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.

■ Fault tolerant —It detects faults and applies automatic recovery.

■ Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).

■ Portable—Installable on all kinds of hardware and operating systems.

The core framework is composed of a distributed file system, a resource manager, and a system to run distributed programs. In practice it allows you to work with the distrib-uted file system almost as easily as with the local file system of your home computer.

But in the background, the data can be scattered among thousands of servers.

THEDIFFERENTCOMPONENTSOF HADOOP

At the heart of Hadoop we find

■ A distributed file system (HDFS)

■ A method to execute programs on a massive scale (MapReduce)

■ A system to manage the cluster resources (YARN)

On top of that, an ecosystem of applications arose (figure 5.2), such as the databases Hive and HBase and frameworks for machine learning such as Mahout. We’ll use Hive in this chapter. Hive has a language based on the widely used SQL to interact with data stored inside the database.

Figure 5.2 A sample from the ecosystem of applications that arose around the Hadoop Core Framework

It’s possible to use the popular tool Impala to query Hive data up to 100 times faster.

We won’t go into Impala in this book, but more information can be found at http://

impala.io/. We already had a short intro to MapReduce in chapter 4, but let’s elabo-rate a bit here because it’s such a vital part of Hadoop.

MAPREDUCE: HOW HADOOPACHIEVESPARALLELISM

Hadoop uses a programming method called MapReduce to achieve parallelism. A MapReduce algorithm splits up the data, processes it in parallel, and then sorts, com-bines, and aggregates the results back together. However, the MapReduce algorithm isn’t well suited for interactive analysis or iterative programs because it writes the data to a disk in between each computational step. This is expensive when working with large data sets.

Let’s see how MapReduce would work on a small fictitious example. You’re the director of a toy company. Every toy has two colors, and when a client orders a toy from the web page, the web page puts an order file on Hadoop with the colors of the toy. Your task is to find out how many color units you need to prepare. You’ll use a MapReduce-style algorithm to count the colors. First let’s look at a simplified version in figure 5.3.

As the name suggests, the process roughly boils down to two big phases:

■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can have many duplicates.

■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occur-rences are grouped together, and depending on the reducing function, a differ-ent result can be created. Here we wanted a count per color, so that’s what the reduce function returns.

In reality it’s a bit more complicated than this though.

Green, Blue,

Figure 5.3 A simplified example of a MapReduce flow for counting the colors in input texts

The whole process is described in the following six steps and depicted in figure 5.4.

1 Reading the input files.

2 Passing each line to a mapper job.

3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has been encountered (value). Or more techni-cally said, it maps a key (the color) to a value (the number of occurrences).

4 The keys get shuffled and sorted to facilitate the aggregation.

5 The reduce phase sums the number of occurrences per color and outputs one file per key with the total number of occurrences for each color.

6 The keys are collected in an output file.

NOTE While Hadoop makes working with big data easy, setting up a good working cluster still isn’t trivial, but cluster managers such as Apache Mesos do ease the burden. In reality, many (mid-sized) companies lack the compe-tence to maintain a healthy Hadoop installation. This is why we’ll work with the Hortonworks Sandbox, a pre-installed and configured Hadoop ecosys-tem. Installation instructions can be found in section 1.5: An introductory working example of Hadoop.

Now, keeping the workings of Hadoop in mind, let’s look at Spark.

5.1.2 Spark: replacing MapReduce for better performance

Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it can take awhile until an algorithm converges to a solution. As this is a weak point of the MapReduce framework, we’ll introduce the Spark Framework to over-come it. Spark improves the performance on such tasks by an order of magnitude.

Input files

Figure 5.4 An example of a MapReduce flow for counting the colors in input texts

WHATIS SPARK?

Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t handle the storage of files on the (distributed) file system itself, nor does it handle the resource management. For this it relies on systems such as the Hadoop File System, YARN, or Apache Mesos. Hadoop and Spark are thus complementary sys-tems. For testing and development, you can even run Spark on your local system.

HOWDOES SPARKSOLVETHEPROBLEMSOF MAPREDUCE?

While we oversimplify things a bit for the sake of clarity, Spark creates a kind of shared RAM memory between the computers of your cluster. This allows the different workers to share variables (and their state) and thus eliminates the need to write the interme-diate results to disk. More technically and more correctly if you’re into that: Spark uses Resilient Distributed Datasets (RDD), which are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant way.¹ Because it’s an in-memory system, it avoids costly disk operations.

THEDIFFERENTCOMPONENTSOFTHE SPARKECOSYSTEM

Spark core provides a NoSQL environment well suited for interactive, exploratory analysis. Spark can be run in batch and interactive mode and supports Python.

Spark has four other large components, as listed below and depicted in figure 5.5.

1 Spark streaming is a tool for real-time analysis.

2 Spark SQL provides a SQL interface to work with Spark.

3 MLLib is a tool for machine learning inside the Spark framework.

4 GraphX is a graph database for Spark. We’ll go deeper into graph databases in chapter 7.

Figure 5.5 The Spark framework when used in combination with the Hadoop framework

Now let’s dip our toes into loan data using Hadoop, Hive, and Spark.

在文檔中 Davy CielenArno D. B. MeysmanMohamed Ali (頁 141-146)