Storm Real-time

(1)

(2)

Storm Real-time

Processing Cookbook

Efficiently process unbounded streams of data in real time

Quinton Anderson

BIRMINGHAM - MUMBAI

(3)

Storm Real-time Processing Cookbook

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2013

Production Reference: 1190813

Published by Packt Publishing Ltd.

Livery Place 35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-442-5 www.packtpub.com

Cover Image by Suresh Mogre ( [email protected] )

(4)

Credits

Author

Quinton Anderson

Reviewers Maarten Ectors Alexey Kachayev Paco Nathan

Acquisition Editor Usha Iyer

Lead Technical Editor Madhuja Chaudhari

Technical Editors Hardik B. Soni Dennis John

Copy Editors Mradula Hegde Alfida Paiva Laxmi Subramanian Aditya Nair

Project Coordinator Navu Dhillon

Proofreaders Stephen Copestake Clyde Jenkins

Indexer

Mariammal Chettiyar

Graphics Abhinash Sahu

Production Coordinator Prachali Bhiwandkar

Cover Work

Prachali Bhiwandkar

(5)

About the Author

Quinton Anderson is a software engineer with a background and focus on real-time computational systems. His career has been split between building real-time communication systems for defense systems and building enterprise applications within financial services and banking. Quinton does not align himself with any particular technology or programming language, but rather prefers to focus on sound engineering and polyglot development. He is passionate about open source, and is an active member of the Storm community; he has also enjoyed delivering various Storm-based solutions.

Quinton's next area of focus is machine learning; specifically, Deep Belief networks, as they pertain to robotics. Please follow his blog entries on Computational Theory, general IT concepts, and Deep Belief networks for more information.

You can find more information on Quinton via his LinkedIn profile ( http://au.linkedin.

com/pub/quinton-anderson/37/422/11b/ ) or more importantly, view and contribute to the source code available at his GitHub ( https://github.com/quintona ) and Bitbucket ( https://bitbucket.org/qanderson ) accounts.

I would like to thank the Storm community for their efforts in building a truly awesome platform for the open source community; a special mention, of course, to the core author of Storm, Nathan Marz.

I would like to thank my wife and children for putting up with my long

working hours spent on this book and other related projects. Your effort

in making up for my absence is greatly appreciated, and I love you all very

dearly. I would also like to thank all those who participated in the review

process of this book.

(6)

About the Reviewers

Maarten Ectors is an executive who is an expert in cloud computing, big data, and disruptive innovations. Maarten's strengths are his combination of deep technical and business skills as well as strategic insights.

Currently, Maarten is responsible for the cloud strategy at Canonical—the company behind Ubuntu—where he is changing the future of cloud, big data, and other disruptive innovations. Previously, Maarten had his own company and was defining and executing the cloud strategy of a global mobile company. Maarten worked for Nokia Siemens Networks in several roles. He was heading cloud and disruptive innovation, founded Startups@NSN, was responsible for implementing offshoring in Europe, and so on. Earlier, he worked as the Director of professional services for Telcordia (now Ericsson) and as a Senior Project / Product Manager for a dotcom. Maarten started his career at Accenture, where he was active in Java developments, portals, mobile applications, content management, ecommerce, security, project management, and so on.

I would like to thank my family for always being there for me. Especially my

wonderful wife, Esther, and my great kids.

(7)

Alexey Kachayev began his development career in a small team creating an open source CMS for social networks. For over 2 years, he had been working as a Software Engineer at CloudMade, developing geo-relative technology for enterprise clients in Python and Scala.

Currently, Alexey is the CTO at Attendify and is focused on development of a distributed applications platform in Erlang. He is an active speaker at conferences and an open source contributor (working on projects in Python, Clojure, and Haskell).

His area of professional interests include distributed systems and algorithms, types theory, and functional language compilers.

I would like to thank Nathan Marz and the Storm project contributors team for developing such a great technology and spreading great ideas.

Paco Nathan is the Chief Scientist at Mesosphere in San Francisco. He is a recognized

expert in Hadoop, R, Data Science, and Cloud Computing, and has led innovative data teams

building large-scale apps for the past decade. Paco is an evangelist for the Mesos and

Cascading open source projects. He is also the author of Enterprise Data Workflows with

Cascading, O'Reilly. He has a blog about Data Science at http://liber118.com/pxn/ .

(8)

www.packtpub.com

Support files, eBooks, discount offers and more

You might want to visit www.packtpub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packtpub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packtpub.com , you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

TM

http://PacktLib.packtpub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

f

Fully searchable across every book published by Packt

f

Copy and paste, print and bookmark content

f

On demand and accessible via web browser

(9)

(10)

Preface 1 Chapter 1: Setting Up Your Development Environment 7 Introduction 7 Setting up your development environment 8

Distributed version control 10

Creating a "Hello World" topology 13 Creating a Storm cluster – provisioning the machines 21 Creating a Storm cluster – provisioning Storm 28

Deriving basic click statistics 33

Unit testing a bolt 43

Implementing an integration test 46

Deploying to the cluster 49

Chapter 2: Log Stream Processing 51 Introduction 51

Creating a log agent 52

Creating the log spout 54

Rule-based analysis of the log stream 60 Indexing and persisting the log data 64 Counting and persisting log statistics 68 Creating an integration test for the log stream cluster 71

Creating a log analytics dashboard 75

Chapter 3: Calculating Term Importance with Trident 89

Introduction 89

(11)

Chapter 4: Distributed Remote Procedure Calls 105 Introduction 105 Using DRPC to complete the required processing 106 Integration testing of a Trident topology 111 Implementing a rolling window topology 117 Simulating time in integration testing 120

Chapter 5: Polyglot Topology 123

Introduction 123 Implementing the multilang protocol in Qt 124 Implementing the SplitSentence bolt in Qt 129 Implementing the count bolt in Ruby 132 Defining the word count topology in Clojure 134 Chapter 6: Integrating Storm and Hadoop 139 Introduction 139

Implementing TF-IDF in Hadoop 142

Persisting documents from Storm 148

Integrating the batch and real-time views 150 Chapter 7: Real-time Machine Learning 155 Introduction 155 Implementing a transactional topology 158 Creating a Random Forest classification model using R 164 Operational classification of transactional streams

using Random Forest 175

Creating an association rules model in R 181

Creating a recommendation engine 184

Real-time online machine learning 190

Chapter 8: Continuous Delivery 197

Introduction 197

Setting up a CI server 198

Setting up system environments 200

Defining a delivery pipeline 202

Implementing automated acceptance testing 206

Chapter 9: Storm on AWS 215

Introduction 215

Deploying Storm on AWS using Pallet 216

Setting up a Virtual Private Cloud 221

Deploying Storm into Virtual Private Cloud using Vagrant 229

Index 233

(12)

Preface

Open source has changed the software landscape in many fundamental ways. There are many arguments that can be made for and against using open source in any given situation, largely in terms of support, risk, and total cost of ownership. Open source is more popular in certain settings than others, such as research institutions versus large institutional financial service providers. Within the emerging areas of web service providers, content provision, and social networking, open source is dominating the landscape. This is true for many reasons, cost being a large one among them. These solutions that need to grow to "Web scale" have been classified as "Big Data" solutions, for want of a better term. These solutions serve millions of requests per second with extreme levels of availability, all the while providing customized experiences for customers across a wide range of services.

Designing systems at this scale requires us to think about problems differently, architect solutions differently, and learn where to accept complexity and where to avoid it. As an industry, we have come to grips with designing batch systems that scale. Large-scale

computing clusters following MapReduce, Bulk Synchronous Parallel, and other computational paradigms are widely implemented and well understood. The surge of innovation has been driven and enabled by open source, leaving even the top software vendors struggling to integrate Hadoop into their technology stack, never mind trying to implement some level of competition to it.

Customers, however, have grown an insatiable desire for more. More data, more services,

more value, more convenience, and they want it now and at lower cost. As the sheer volume

of data increases, the demand for real-time response time increases too. The next phase of

computational platforms has arrived, and it is focused on real time, at scale. It represents

many unique challenges, and is both theoretically and practically challenging.

(13)

This cookbook will help you master a platform, the Storm processor. The Storm processor is an open source, real-time computational platform written by Nathan Marz at Backtype, a social analytics company. It was later purchased by Twitter and released as open source. It has since thrived in an ever-expanding open source community of users, contributors, and success stories within production sites. At the time of writing this preface, the project was enjoying more than 6,000 stars on GitHub, 3,000 Twitter followers, has been benchmarked at over a million transactions per second per node, and has almost 80 reference customers with production instances of Storm.

These are extremely impressive figures. Moreover, you will find by the end of this book that it is also extremely enjoyable to deliver systems based on Storm, using whichever language is congruent with your way of thinking and delivering solutions.

This book is designed to teach you Storm with a series of practical examples. These examples are grounded in real-world use cases, and introduce various concepts as the book unfolds.

Furthermore, the book is designed to promote DevOps practice around the Storm technology, enabling the reader to develop Storm solutions and deliver them reliably into production, where they create value.

An introduction to the Storm processor

A common criticism of open source projects is their lack of documentation. Storm does not suffer from this particular issue; the documentation for the project is excellent, well-written, and well-supplemented by the vibrant user community. This cookbook does not seek to duplicate this documentation but rather supplement it, driven largely by examples with conceptual and theoretical discussion where required. It is highly recommended that the reader take the time to read the Storm introductory documentation before proceeding to Chapter 1, Setting Up Your Development Environment, specifically the following pages of the Storm wiki:

f

https://github.com/nathanmarz/storm/wiki/Rationale

f

https://github.com/nathanmarz/storm/wiki/Concepts

f

https://github.com/nathanmarz/storm/wiki/Understanding-the- parallelism-of-a-Storm-topology

What this book covers

Chapter 1, Setting Up Your Development Environment, will demonstrate the process of setting up a local development environment for Storm; this includes all required tooling and suggested development workflows.

Chapter 2, Log Stream Processing, will lead the reader through the process of creating a log

stream processing solution, complete with a base statistics dashboard and log-searching

capability.

(14)

Chapter 3, Calculating Term Importance with Trident, will introduce the reader to Trident, a data-flow abstraction that works on top of Storm to enable highly productive enterprise data pipelines.

Chapter 4, Distributed Remote Procedure Calls, will teach the user how to implement distributed remote procedure calls.

Chapter 5, Polyglot Topology, will guide the reader to develop a Polyglot technology and add new technologies to the list of already supported technologies.

Chapter 6, Integrating Storm with Hadoop, will guide the user through the process of integrating Storm with Hadoop, thus creating a complete Lambda architecture.

Chapter 7, Real-time Machine Learning, will provide the reader with a very basic introduction to machine learning as a topic, and provides various approaches to implementing it in real- time projects based on Storm.

Chapter 8, Continuous Delivery, will demonstrate how to set up a Continuous Delivery pipeline and deliver a Storm cluster reliably into an environment.

Chapter 9, Storm on AWS, will guide the user through various approaches to automated provisioning of a Storm cluster into the Amazon Computing Cloud.

What you need for this book

This book assumes a base environment of Ubuntu or Debian. The first chapter will guide the reader through the process of setting up the remaining required tooling. If the reader does not use Ubuntu as a developer operating system, any *Nix-based system is preferred, as all the recipes assume the existence of a bash command interface.

Who this book is for

Storm Real-time Processing Cookbook is ideal for developers who would like to learn real-time processing or would like to learn how to use Storm for real-time processing. It's assumed that you are a Java developer. Clojure, C++, and Ruby experience would be useful but is not essential. It would also be useful to have some experience with Hadoop or similar technologies.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of

information. Here are some examples of these styles, and an explanation of their meaning.

(15)

A block of code is set as follows:

<repositories>

<repository>

<id>github-releases</id>

<url>http://oss.sonatype.org/content/repositories /github-releases/</url>

</repository>

<repository>

<id>clojars.org</id>

<url>http://clojars.org/repo</url>

</repository>

<repository>

<id>twitter4j</id>

<url>http://twitter4j.org/maven2</url>

</repository>

</repositories>

Any command-line input or output is written as follows:

mkdir FirstGitProject cd FirstGitProject git init

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Uncheck the Use default location checkbox."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

(16)

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to [email protected] , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors .

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com . If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Open source versions of the code are maintained by the author at his Bitbucket account:

https://bitbucket.org/qanderson .

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen.

If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be

grateful if you would report this to us. By doing so, you can save other readers from frustration

and help us improve subsequent versions of this book. If you find any errata, please report them

by visiting http://www.packtpub.com/submit-errata , selecting your book, clicking on

the errata submission form link, and entering the details of your errata. Once your errata are

verified, your submission will be accepted and the errata will be uploaded on our website, or

added to any list of existing errata, under the Errata section of that title. Any existing errata can

be viewed by selecting your title from http://www.packtpub.com/support .

(17)

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at [email protected] if you are having a problem with any

aspect of the book, and we will do our best to address it.

(18)

Setting Up 1

Your Development Environment

In this chapter we will cover:

f

Setting up your development environment

f

Distributed version control

f

Creating a "Hello World" topology

f

Creating a Storm cluster – provisioning the machines

f

Creating a Storm cluster – provisioning Storm

f

Deriving basic click statistics

f

Unit testing a bolt

f

Implementing an integration test

f

Deploying to the cluster

Introduction

This chapter provides a very basic and practical introduction to the Storm processor. This will

cover everything, from setting up your development environment to basic operational concerns

(19)

This book does not provide a theoretical introduction to the Storm processor and its primitives and architecture. The author assumes that the readers have orientated themselves through online resources such as the Storm wiki.

Delivery of systems is only achieved once a system is delivering a business value in a production environment consistently and reliably. In order to achieve this, quality and operational concerns must always be taken into account while developing your Storm topologies.

Setting up your development environment

A development environment consists of all the tools and systems that are required in order to start building Storm topologies. The focus of this book is on individual delivery of Storm with a focus on the technology; however, it must be noted that the development environment for a software development team, be it centralized or distributed, would require much more tooling and processes to be effective and is considered outside the scope of this book.

The following classes of tools and processes are required in order to effectively set up the development environment, not only from an on-going perspective, but also in terms of implementing the recipes in this book:

f

SDK(s)

f

Version control

f

Build environment

f

System provisioning tooling

f

Cluster provisioning tooling

The provisioning and installation recipes in this book are based on Ubuntu; they are, however, quite portable to other Linux distributions. If you have any issues working with another distribution using these instructions, please seek support from the Storm mailing list at https://groups.google.com/forum/#!forum/storm-user .

Environmental variables are the enemy of maintainable and available systems. Developing on one environment type and deploying on another is a very risky example of such a variable. Developing on your target type should be done whenever possible.

How to do it…

1. Download the latest J2SE 6 SDK from Oracle's website ( http://www.oracle.com/

technetwork/java/javase/downloads/index.html ) and install it as follows:

chmod 775 jdk-6u35-linux-x64.bin

(20)

yes | jdk-6u35-linux-x64.bin mv jdk1.6.0_35 /opt

ln -s /opt/jdk1.6.0_35/bin/java /usr/bin ln -s /opt/jdk1.6.0_35/bin/javac /usr/bin JAVA_HOME=/opt/jdk1.6.0_35

export JAVA_HOME

PATH=$PATH:$JAVA_HOME/bin export PATH

2. The version control system, Git, must then be installed:

sudo apt-get install git

3. The installation should then be followed by Maven, the build system:

sudo apt-get install mvn

4. Puppet, Vagrant, and VirtualBox must then be installed in order to provide application and environment provisioning:

sudo apt-get install virtualbox puppet vagrant 5. Finally, you need to install an IDE:

sudo apt-get install eclipse

There is currently a debate around which fork of the Java SDK is to be used since Sun was acquired by Oracle. While the author understood the need for OpenJDK, the recipes in this book have been tested using the Oracle JDK. In general, there is no difference between OpenJDK and Oracle JDK, apart from the Oracle JDK being more stable but lagging behind in terms of features.

How it works…

The JDK is obviously required for any Java development to take place. GIT is an open source distributed version control system that has received wide adoption in recent years. A brief introduction to GIT will be presented shortly.

Maven is a widely used build system that prefers convention over configuration. Maven

includes many useful features including the Project Object Model (POM), which allows us to

manage our libraries, dependencies, and versions in an effective manner. Maven is backed

(21)

Within the growing arena of DevOps and Continuous Delivery, the Puppet system is widely used to provide declarative server provisioning of Linux and other operating systems and applications. Puppet provides us with the ability to program the state of our servers and deployment environments. This is important because our server's state can then be maintained within a version control system such as GIT and manual changes to servers can be safely removed. This provides many advantages, including deterministic Mean Time to Recovery (MTTR) and audit trail, which, in general, means making systems more stable.

This is also an important step on the path towards continuous delivery.

Vagrant is a very useful tool within development environments. It allows the automation of provisioning of VirtualBox virtual machines. Within the context of the Storm processor, this is important, given that it is a cluster-based technology. In order to test a cluster, you must either build an actual cluster of machines or provision many virtual machines. Vagrant allows us to do this locally in a deterministic and declarative way.

A virtual machine is an extremely useful abstraction within the IT infrastructure, operations, and development. However, it must be noted that, while reduced performance is expected and acceptable within locally hosted VMs, their usability at all times depends entirely on the availability of RAM. The processing power is not a key concern, especially with most modern processors being extremely underutilized, although this is not necessarily the case once your topologies are working; it is recommended that you ensure your computer has at least 8 GB of RAM.

Distributed version control

Traditional version control systems are centralized. Each client contains a checkout of the files at their current version, depending on what branch the client is using. All previous versions are stored on the server. This has worked well, in such a way that it allows teams to collaborate closely and know to some degree what other members of the team are doing.

Centralized servers have some distinct downfalls that have led to the rise of distributed control systems. Firstly, the centralized server represents a single point of failure; if the server goes down or becomes unavailable for any reason, it becomes difficult for developers to work using their existing workflows. Secondly, if the data on the server is corrupt or lost for any reason, the history of the code base is lost.

Open source projects have been a large driver of distributed version controls, for both reasons,

but mostly because of the collaboration models that distribution enables. Developers can

follow a disciplined set of workflows on their local environments and then distribute these

changes to one or many remote repositories when it is convenient to do so, in both a flat and

hierarchical manner.

(22)

The obvious additional advantage is that there naturally exist many backups of the repository because each client has a complete mirror of the repository; therefore, if any client or server dies, it can simply be replicated back, once it has been restored.

How to do it…

Git is used in this book as the distributed version control system. In order to create a repository, you need to either clone or initialize a repository. For a new project that you create, the repository should be initialized.

1. First, let's create our project directory, as follows:

mkdir FirstGitProject cd FirstGitProject git init

2. In order to test if the workflow is working, we need some files in our repository.

touch README.txt vim README.txt

Using vim , or any other text editor, simply add some descriptive text and press the Insert key. Once you have finished typing, simply hit the Esc key and then a colon, followed by wq ; hit the Enter key.

3. Before you commit, review the status of the repository.

git status

This should give you an output that looks similar to the following:

# On branch master

# Initial commit

# Untracked files:

# README.txt

4. Git requires that you add all files and folders manually; you can do it as follows:

git add README.txt

5. Then commit the file using the following:

git commit –a

6. This will open a vim editor and allow you to add your comments.

(23)

Without pushing this repository to a remote host, you will essentially be placing it under the same risk as that of a centralized host. It is therefore important to push the repository to a remote host. Both www.github.com and www.bitbucket.org are good options for free-hosted Git services, providing that you aren't pushing your corporate intellectual property there for public consumption. This book uses bitbucket.org . In order to push your repository to this remote host, simply navigate there in your browser and sign up for an account.

Once the registration process is complete, create a new repository using the menu system.

Enter the following values in order to create the repository:

(24)

Once the repository is created, you need to add the remote repository to your local repository and push the changes to the remote repository.

git remote add origin https://[user]@bitbucket.org/[user]/

firstgitproject.git git push origin master

You must replace [user] in the preceding command with your registered username.

Cloning of a repository will be covered in later recipes, as will some standard version control workflows.

Creating a "Hello World" topology

The "Hello World" topology, as with all "Hello World" applications, is of no real use to anyone, except to illustrate some really basic concepts. The "Hello World" topology will show how to create a Storm project including a simple spout and bolt, build it, and execute it in the local cluster mode.

How to do it…

1. Create a new project folder and initialize your Git repository.

mkdir HelloWorld cd HelloWorld git init

2. We must then create the Maven project file as follows:

vim pom.xml

3. Using vim , or any other text editor, you need to create the basic XML tags and project metadata for the "Hello World" project.

<project xmlns="http://maven.apache.org/POM/4.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>

(25)

<version>0.0.1-SNAPSHOT</version>

<packaging>jar</packaging>

<name>hello-world</name>

<url>https://bitbucket.org/[user]/hello-world</url>

<properties>

<project.build.sourceEncoding>UTF-8</project.build.

sourceEncoding>

</properties>

</project>

4. We then need to declare which Maven repositories we need to fetch our dependencies from. Add the following to the pom.xml file within the project tags:

<repositories>

<repository>

<id>github-releases</id>

<url>http://oss.sonatype.org/content/repositories /github-releases/</url>

</repository>

<repository>

<id>clojars.org</id>

<url>http://clojars.org/repo</url>

</repository>

<repository>

<id>twitter4j</id>

<url>http://twitter4j.org/maven2</url>

</repository>

</repositories>

You can override these repositories using your .m2 and settings.xml files, the details of which are outside the scope of this book; however, this is extremely useful within development teams where dependency management is the key.

5. We then need to declare our dependencies by adding them within the project tags:

<dependencies>

<dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

(26)

<version>3.8.1</version>

<scope>test</scope>

</dependency>

<dependency>

<groupId>storm</groupId>

<artifactId>storm</artifactId>

<version>0.8.1</version>

<scope>provided</scope>

</dependency>

<dependency>

<groupId>com.googlecode.json-simple</groupId>

<artifactId>json-simple</artifactId>

<version>1.1</version>

</dependency>

</dependencies>

6. Finally we need to add the build plugin definitions for Maven:

<build>

<plugins>

<!--

bind the maven-assembly-plugin to the package phase this will create a jar file without the Storm dependencies suitable for deployment to a cluster.

-->

<plugin>

<artifactId>maven-assembly-plugin</artifactId>

<configuration>

<descriptorRefs>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

<archive>

<manifest>

<mainClass></mainClass>

</manifest>

</archive>

</configuration>

<executions>

<execution>

<id>make-assembly</id>

<phase>package</phase>

(27)

<plugin>

<groupId>com.theoryinpractise</groupId>

<artifactId>clojure-maven-plugin</artifactId>

<version>1.3.8</version>

<extensions>true</extensions>

<configuration>

<sourceDirectories>

<sourceDirectory>src/clj</sourceDirectory>

</sourceDirectories>

</configuration>

<executions>

<execution>

<id>compile</id>

<phase>compile</phase>

<goals>

<goal>compile</goal>

</goals>

</execution>

<execution>

<id>test</id>

<phase>test</phase>

<goals>

<goal>test</goal>

</goals>

</execution>

</executions>

</plugin>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

<configuration>

<source>1.6</source>

<target>1.6</target>

</configuration>

</plugin>

</plugins>

</build>

7. With the POM file complete, save it using the Esc + : + wq + Enter key sequence and complete the required folder structure for the Maven project:

mkdir src

cd src

mkdir test

mkdir main

cd main

mkdir java

(28)

8. Then return to the project root folder and generate the Eclipse project files using the following:

mvn eclipse:eclipse

The Eclipse project files are a generated artifact, much as a .class file, and should not be included in your Git checkins, especially since they contain client-machine-specific paths.

9. You must now start your Eclipse environment and import the generated project files

into the workspace:

(29)

10. You must then create your first spout by creating a new class named HelloWorldSpout , which extends from BaseRichSpout and is located in the storm.cookbook package. Eclipse will generate a default spouts method for you. The spout will simply generate tuples based on random probability. Create the following member variables and construct the object:

private SpoutOutputCollector collector;

private int referenceRandom;

private static final int MAX_RANDOM = 10;

public HelloWorldSpout(){

final Random rand = new Random();

referenceRandom = rand.nextInt(MAX_RANDOM);

}

11. After construction, the Storm cluster will open the spout; provide the following implementation for the open method:

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

this.collector = collector;

}

12. The Storm cluster will repeatedly call the nextTuple method, which will do all the work of the spout. Provide the following implementation for the method:

Utils.sleep(100);

final Random rand = new Random();

int instanceRandom = rand.nextInt(MAX_RANDOM);

if(instanceRandom == referenceRandom){

collector.emit(new Values("Hello World"));

} else {

collector.emit(new Values("Other Random Word"));

}

13. Finally, you need to tell the Storm cluster which fields this spout emits within the declareOutputFields method:

declarer.declare(new Fields("sentence"));

14. Once you have resolved all the required imports for the class, you need to create HelloWorldBolt . This class will consume the produced tuples and implement the required counting logic. Create the new class within the storm.cookbook package;

it should extend the BaseRichBolt class. Declare a private member variable and

provide the following implementation for the execute method, which does the work

for this bolt:

(30)

String test = input.getStringByField("sentence");

if("Hello World".equals(test)){

myCount++;

System.out.println("Found a Hello World! My Count is now: "

+ Integer.toString(myCount));

}

15. Finally, you need to bring the elements together and declare the Storm topology.

Create a main class named HelloWorldTopology within the same package and provide the following main implementation:

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("randomHelloWorld", new HelloWorldSpout(), 10);

builder.setBolt("HelloWorldBolt", new HelloWorldBolt(), 2) .shuffleGrouping("randomHelloWorld");

Config conf = new Config();

conf.setDebug(true);

if(args!=null && args.length > 0) { conf.setNumWorkers(3);

StormSubmitter.submitTopology(args[0], conf, builder.createTopology());

} else {

LocalCluster cluster = new LocalCluster();

cluster.submitTopology("test", conf,

builder.createTopology());

Utils.sleep(10000);

cluster.killTopology("test");

cluster.shutdown();

}

This will essentially set up the topology and submit it to either a local or remote Storm cluster, depending on the arguments passed to the main method.

16. After you have resolved the compiler issues, you can execute the cluster by issuing the following command from the project's root folder:

mvn compile exec:java -Dexec.classpathScope=compile -Dexec.

(31)

How it works…

The following diagram describes the "Hello World" topology:

Hello World Spout

Hello World

Bolt Hello

World Spout

Hello World

Bolt

The spout essentially emits a stream containing one of the following two sentences:

f

Other Random Word

f

Hello World

Based on random probability, it works by generating a random number upon construction and then generates subsequent random numbers to test against the original member's variable value. When it matches, Hello World is emitted; during the remaining executions, the other random words are emitted.

The bolt simply matches and counts the instances of Hello World . In the current implementation, you will notice sequential increments being printed from the bolt.

In order to scale this bolt, you simply need to increase the parallelism hint for the topology by updating the following line:

builder.setBolt("HelloWorldBolt", new HelloWorldBolt(), 3) .shuffleGrouping("randomHelloWorld");

The key parameter here is parallism_hint , which you can adjust upwards. If you execute the cluster again, you will then notice three separate counts that are printed independently and interweaved with each other.

You can scale a cluster after deployment by updating these hints using the Storm GUI or CLI; however, you can't change the topology structure without recompiling and redeploying the JAR. For the command-line option, please see the CLI documentation on the wiki available at the following link:

https://github.com/nathanmarz/storm/wiki/

Command-line-client

(32)

It is important to ensure that your project dependencies are declared correctly within your POM.

The Storm JARs must be declared with the provided scope; if not, they would be packaged into your JAR; this would result in duplicate class files on the classpath within a deployed node of the cluster. Note that Storm checks for this classpath duplication; it will fail to start if you have included Storm into your distribution.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.

com/support and register to have the files e-mailed directly to you.

Open source versions of the code are maintained by the author at his Bitbucket account at https://bitbucket.org/qanderson.

Creating a Storm cluster – provisioning the machines

Testing the cluster in the local mode is useful for debugging and verifying the basic functional logic of the cluster. It doesn't, however, give you a realistic view as to the operation of the cluster.

Moreover, any development effort is only complete once the system is running in a production environment. This is a key consideration for any developer and is the cornerstone of the entire DevOps movement; regardless of the methodology, however, you must be able to reliably deploy your code into an environment. This recipe demonstrates how to create and provision an entire cluster directly from version control. There are many key principles in doing this:

f

The state of any given server must be known at all times. It isn't acceptable that people can log into a server and make changes to its settings or files without strict version control being in place.

f

Servers should be fundamentally immutable, with the state in some kind of separate volume. This allows deterministic recovery times of a server.

f

If something causes problems in the delivery process, do it more often. In software development and IT operations, this applies heavily to disaster recovery and integration. Both tasks can only be performed often if they are automated.

f

This book assumes that your destination production environment is a cluster (based on Amazon Web Services (AWS) EC2), which enables automatic scaling.

Elastic auto-scaling is only possible where provisioning is automated.

(33)

How to do it...

Let's start by creating a new project as follows:

1. Create a new project named vagrant-storm-cluster with the following data structure:

2. Using your favorite editor, create a file in the project root called Vagrantfile . Inside the file, you must create the file header and the configuration for the virtual machines that we want to create. We need at least one nimbus node, two supervisor nodes, and a zookeeper node:

# -- mode: ruby --

# vi: set ft=ruby : boxes = [

{ :name => :nimbus, :ip => '192.168.33.100', :cpus =>2, :memory => 512 },

{ :name => :supervisor1, :ip => '192.168.33.101', :cpus =>4, :memory => 1024 },

{ :name => :supervisor2, :ip => '192.168.33.102', :cpus =>4, :memory => 1024 },

{ :name => :zookeeper1, :ip => '192.168.33.201', :cpus =>1, :memory => 512 }

]

Note that the use of a single zookeeper node is only for development environments, as this cluster is not highly available. The purpose of this cluster is to test your topology logic in a realistic setting and identify stability issues.

3. You must then create the virtual machine provisioning for each machine, specialized by the previous configuration at execution time. The first set of properties defines the hardware, networking, and operating system:

boxes.each do |opts|

config.vm.define opts[:name] do |config|

config.vm.box = "ubuntu12"

config.vm.box_url =

(34)

"http://dl.dropbox.com/u/1537815/precise64.box"

config.vm.network :hostonly, opts[:ip]

config.vm.host_name = "storm.%s" % opts[:name].to_s config.vm.share_folder "v-data", "/vagrant_data", "./data", :transient => false config.vm.customize ["modifyvm", :id, "--memory", opts[:memory]]

config.vm.customize ["modifyvm", :id, "--cpus", opts[:cpus] ] if opts[:cpus]

4. The provisioning of the application is then configured using a combination of the bash and Puppet scripts:

config.vm.provision :shell, :inline => "cp -fv /vagrant_data/hosts /etc/hosts"

config.vm.provision :shell, :inline => "apt-get update"

# Check if the jdk has been provided

if File.exist?("./data/jdk-6u35-linux-x64.bin") then config.vm.provision :puppet do |puppet|

puppet.manifests_path = "manifests"

puppet.manifest_file = "jdk.pp"

end end

config.vm.provision :puppet do |puppet|

puppet.manifests_path = "manifests"

puppet.manifest_file = "provisioningInit.pp"

end

# Ask puppet to do the provisioning now.

config.vm.provision :shell, :inline => "puppet apply /tmp/storm-puppet/manifests/site.pp --verbose -- modulepath=/tmp/storm-puppet/modules/ --debug"

end end end

The Vagrant file simply defines the hypervisor-level configuration and provisioning; the

remaining provisioning is done through Puppet and is defined at two levels. The first

level makes the base Ubuntu installation ready for application provisioning. The second

level contains the actual application provisioning. In order to create the first level of

provisioning, you need to create the JDK provisioning bash script and the provisioning

initialization Puppet script.

(35)

5. In the scripts folder of the project, create the installJdk.sh file and populate it with the following code:

#!/bin/sh

echo "Installing JDK!"

chmod 775 /vagrant_data/jdk-6u35-linux-x64.bin cd /root

yes | /vagrant_data/jdk-6u35-linux-x64.bin /bin/mv /root/jdk1.6.0_35 /opt

/bin/rm -rv /usr/bin/java /bin/rm -rv /usr/bin/javac

/bin/ln -s /opt/jdk1.6.0_35/bin/java /usr/bin /bin/ln -s /opt/jdk1.6.0_35/bin/javac /usr/bin JAVA_HOME=/opt/jdk1.6.0_35

export JAVA_HOME

PATH=$PATH:$JAVA_HOME/bin export PATH

This will simply be invoked by the Puppet script in a declarative manner.

6. In the manifest folder create a file called jdk.pp :

$JDK_VERSION = "1.6.0_35"

package {"openjdk":

ensure => absent, }

exec { "installJdk":

command => "installJdk.sh", path => "/vagrant/scripts", logoutput => true,

creates => "/opt/jdk${JDK_VERSION}", }

7. In the manifest folder, create the provisioningInit.pp file and define the required packages and static variable values:

$CLONE_URL = "https://bitbucket.org/qanderson/storm-puppet.git"

$CHECKOUT_DIR="/tmp/storm-puppet"

package {git:ensure=> [latest,installed]}

package {puppet:ensure=> [latest,installed]}

package {ruby:ensure=> [latest,installed]}

package {rubygems:ensure=> [latest,installed]}

package {unzip:ensure=> [latest,installed]}

exec { "install_hiera":

command => "gem install hiera hiera-puppet", path => "/usr/bin",

require => Package['rubygems'],

}

(36)

For more information on Hiera, please see the Puppet documentation page at http://docs.puppetlabs.com/hiera/1/index.html.

8. You must then clone the repository, which contains the second level of provisioning:

exec { "clone_storm-puppet":

command => "git clone ${CLONE_URL}", cwd => "/tmp",

path => "/usr/bin",

creates => "${CHECKOUT_DIR}", require => Package['git'], }

9. You must now configure a Puppet plugin called Hiera, which is used to externalize properties from the provisioning scripts in a hierarchical manner:

exec {"/bin/ln -s /var/lib/gems/1.8/gems/hiera-puppet-1.0.0/ /tmp/

storm-puppet/modules/hiera-puppet":

creates => "/tmp/storm-puppet/modules/hiera-puppet", require => [Exec['clone_storm-

puppet'],Exec['install_hiera']]

}

#install hiera and the storm configuration file { "/etc/puppet/hiera.yaml":

source => "/vagrant_data/hiera.yaml", replace => true,

require => Package['puppet']

}

file { "/etc/puppet/hieradata":

ensure => directory,

require => Package['puppet']

}

file {"/etc/puppet/hieradata/storm.yaml":

source => "${CHECKOUT_DIR}/modules/storm.yaml", replace => true,

require => [Exec['clone_storm-puppet'],File['/etc/puppet/

(37)

10. Finally, you need to populate the data folder. Create the Hiera base configuration file, hiera.yaml :

---

:hierarchy:

- %{operatingsystem}

- storm :backends:

- yaml :yaml:

:datadir: '/etc/puppet/hieradata'

11. The final datafile required is the host's file, which act as the DNS in our local cluster:

127.0.0.1 localhost 192.168.33.100 storm.nimbus 192.168.33.101 storm.supervisor1 192.168.33.102 storm.supervisor2 192.168.33.103 storm.supervisor3 192.168.33.104 storm.supervisor4 192.168.33.105 storm.supervisor5 192.168.33.201 storm.zookeeper1 192.168.33.202 storm.zookeeper2 192.168.33.203 storm.zookeeper3 192.168.33.204 storm.zookeeper4

The host's file is not required in properly configured environments; however, it works nicely in our local "host only" development network.

The project is now complete, in that it will provision the correct virtual machines and install the base required packages; however, we need to create the Application layer provisioning, which is contained in a separate repository.

12. Initialize your Git repository for this project and push it to bitbucket.org .

How it works...

Provisioning is performed on three distinct layers:

Application Guest Hypervisor

(38)

This recipe only works in the bottom two layers, with the Application layer presented in the next recipe. A key reason for the separation is that you will typically create different provisioning at these layers depending on the Hypervisor you are using for deployment. Once the VMs are provisioned, however, the application stack provisioning should be consistent through all your environments. This is key, in that it allows us to test our deployments hundreds of times before we get to production, and ensure that they are in a repeatable and version-controlled state.

In the development environment, VirtualBox is the Hypervisor with Vagrant and Puppet providing the provisioning. Vagrant works by specializing a base image of a VirtualBox. This base image represents a version-controlled artifact. For each box defined in our Vagrant file, the following parameters are specified:

f

The base box

f

The network settings

f

Shared folders

f

Memory and CPU settings for the VM

This base provisioning does not include any of the baseline controls you would expect in a production environment, such as security, access controls, housekeeping, and monitoring. You must provision these before proceeding beyond your development environment. You can find these kinds of recipes on Puppet Forge (http://forge.puppetlabs.com/).

Provisioning agents are then invoked to perform the remaining heavy lifting:

config.vm.provision :shell, :inline => "cp -fv /vagrant_data/hosts /etc/

hosts"

The preceding command installs the host's file that gives the resolution of our cluster name:

config.vm.provision :shell, :inline => "apt-get update"

This updates all the packages in the apt-get cache within the Ubuntu installation.

Vagrant then proceeds to install the JDK and the base provisioning. Finally it invokes the application provisioning.

The base VM image could contain the entire base provisioning already,

thus making this portion of the provisioning unrequired. However, it is

important to understand the process of creating an appropriate base

image and also to balance the amount of specialization in the base

images you control; otherwise, they will proliferate.

(39)

Creating a Storm cluster – provisioning Storm

Once you have a base set of virtual machines that are ready for application provisioning, you need to install and configure the appropriate packages on each node.

How to do it…

1. Create a new project named storm-puppet with the following folder structure:

2. The entry point into the Puppet execution on the provisioned node is site.pp . Create it in the manifests folder:

node 'storm.nimbus' { $cluster = 'storm1' include storm::nimbus include storm::ui }

node /storm.supervisor[1-9]/ { $cluster = 'storm1'

include storm::supervisor }

node /storm.zookeeper[1-9]/ { include storm::zoo

}

3. Next, you need to define the storm module. A module exists in the modules folder

and has its own manifests and template folder structure, much as with the

structure found at the root level of the Puppet project. Within the storm module,

create the required manifests ( modules/storm/manifests ), starting with the

init.pp file:

(40)

class storm {

include storm::install include storm::config }

4. The installation of the Storm application is the same on each storm node; only the configurations are adjusted where required, via templating. Next create the install.pp file, which will download the required binaries and install them:

class storm::install {

$BASE_URL="https://bitbucket.org/qanderson/storm-deb- packaging/downloads/"

$ZMQ_FILE="libzmq0_2.1.7_amd64.deb"

$JZMQ_FILE="libjzmq_2.1.7_amd64.deb"

$STORM_FILE="storm_0.8.1_all.deb"

package { "wget": ensure => latest }

# call fetch for each file exec { "wget_storm":

command => "/usr/bin/wget ${BASE_URL}${STORM_FILE}" } exec {"wget_zmq":

command => "/usr/bin/wget ${BASE_URL}${ZMQ_FILE}" } exec { "wget_jzmq":

command => "/usr/bin/wget ${BASE_URL}${JZMQ_FILE}" }

#call package for each file package { "libzmq0":

provider => dpkg, ensure => installed, source => "${ZMQ_FILE}", require => Exec['wget_zmq']

}

#call package for each file package { "libjzmq":

provider => dpkg, ensure => installed, source => "${JZMQ_FILE}",

require => [Exec['wget_jzmq'],Package['libzmq0']]

}

#call package for each file package { "storm":

provider => dpkg,

ensure => installed,

(41)

The install manifest here assumes the existence of package, Debian packages, for Ubuntu. These were built using scripts and can be tweaked based on your requirements. The binaries and creation scripts can be found at https://bitbucket.org/qanderson/

storm-deb-packaging.

The installation consists of the following packages:

Storm

ZeroMQ: http://www.zeromq.org/

Java-ZeroMQ

5. The configuration of each node is done through the template-based generation of the configuration files. In the storm manifests, create config.pp :

class storm::config { require storm::install include storm::params

file { '/etc/storm/storm.yaml':

require => Package['storm'],

content => template('storm/storm.yaml.erb'), owner => 'root',

group => 'root', mode => '0644' }

file { '/etc/default/storm':

require => Package['storm'],

content => template('storm/default.erb'), owner => 'root',

group => 'root', mode => '0644' }

}

6. All the storm parameters are defined using Hiera, with the Hiera configuration invoked from params.pp in the storm manifests:

class storm::params { #_ STORM DEFAULTS _#

$java_library_path = hiera_array('java_library_path', ['/usr/local/lib', '/opt/local/lib', '/usr/lib']) }

Due to the sheer number of properties, the file has been concatenated.

For the complete file, please refer to the Git repository at https://

bitbucket.org/qanderson/storm-puppet/src.

(42)

7. Each class of node is then specified; here we will specify the nimbus class:

class storm::nimbus { require storm::install include storm::config include storm::params

# Install nimbus /etc/default storm::service { 'nimbus':

start => 'yes',

jvm_memory => $storm::params::nimbus_mem }

}

Specify the supervisor class:

class storm::supervisor { require storm::install include storm::config include storm::params

# Install supervisor /etc/default storm::service { 'supervisor':

start => 'yes',

jvm_memory => $storm::params::supervisor_mem }

}

Specify the ui class:

class storm::ui {

require storm::install include storm::config include storm::params # Install ui /etc/default storm::service { 'ui':

start => 'yes',

jvm_memory => $storm::params::ui_mem }

}

And finally, specify the zoo class (for a zookeeper node):

(43)

8. Once all the files have been created, initialize the Git repository and push it to bitbucket.org .

9. In order to actually run the provisioning, navigate to the vagrant-storm-cluster folder and run the following command:

vagrant up

10. If you would like to ssh into any of the nodes, simply specify the following command:

vagrant ssh nimbus

Replace nimbus with your required node name.

How it works…

There are various patterns that can be applied when using Puppet. The simplest one is using a distributed model, whereby nodes provision themselves as opposed to a centralized model using Puppet Master. In the distributed model, updating server configuration simply requires that you update your provisioning manifests and push them to your central Git repository. The various nodes will then pull and apply this configuration. This can either be achieved through cron jobs, triggers, or through the use of a Continuous Delivery tool such as Jenkins, Bamboo, or Go. Provisioning in the development environment is explicitly invoked by Vagrant through the following command:

config.vm.provision :shell, :inline => "puppet apply /tmp/storm-puppet/

manifests/site.pp --verbose --modulepath=/tmp/storm-puppet/modules/

--debug"

The manifest is then applied declaratively by the Puppet. Puppet is declarative, in that each language element specifies the desired state together with methods for getting there.

This means that, when the system is already in the required state, that particular provisioning step will be skipped, together with the adverse effects of duplicate provisioning.

The storm-puppet project is therefore cloned onto the node and then the manifest is applied locally. Each node only applies provisioning for itself, based on the hostname specified in the site.pp manifest, for example:

node 'storm.nimbus' {

$cluster = 'storm1'

include storm::nimbus

include storm::ui

}

(44)

In this case, the nimbus node will include the Hiera configurations for cluster1 , and the installation for the nimbus and ui nodes will be performed. Any combination of classes can be included in the node definition, thus allowing the complete environment to be succinctly defined.

Deriving basic click statistics

The click topology is designed to gather basic website-usage statistics, specifically:

f

The number of visitors

f

The number of unique visitors

f

The number of visitors for a given country

f

The number of visitors for a given city

f

The percentage of visitors for each city in a given country

The system assumes a limited possible visitor population and prefers server-side client keys as opposed to client-side cookies. The topology derives the geographic information from the IP address and a public IP resolution service.

The click topology also uses Redis to store click events being sent into the topology, specifically as a persistent queue, and it also leverages Redis in order to persistently recall the previous visitors to the site.

For more information on Redis, please visit Redis.io.

Getting ready

Before you proceed, you must install Redis (Version 2.6 or greater):

wget http://download.redis.io/redis-stable.tar.gz tar xvzf redis-stable.tar.gz

cd redis-stable make

sudo cp redis-server /usr/local/bin/

sudo cp redis-cli /usr/local/bin/

(45)

How to do it…

1. Create a new Java project named click-topology , and create the pom.xml file and folder structure as per the "Hello World" topology project.

2. In the pom.xml file, update the project name and references, and then add the following dependencies to the <dependencies> tag:

<dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

<version>4.11</version>

<scope>test</scope>

</dependency>

<dependency>

<groupId>org.jmock</groupId>

<artifactId>jmock-junit4</artifactId>

<version>2.5.1</version>

<scope>test</scope>

</dependency>

<dependency>

<groupId>org.jmock</groupId>

<artifactId>jmock-legacy</artifactId>

<version>2.5.1</version>

<scope>test</scope>

</dependency>

<dependency>

<groupId>redis.clients</groupId>

<artifactId>jedis</artifactId>

<version>2.1.0</version>

</dependency>

3. Take a special note of the scope definitions of JUnit and JMock so as to not include them in your final deployable JAR.

4. In the source/main/java folder, create the ClickTopology main class in the package storm.cookbook package. This class defines the topology and provides the mechanisms to launch the topology into a cluster or in a local mode. Create the class as follows:

Storm Real-time

Storm Real-time

Processing Cookbook

Efficiently process unbounded streams of data in real time

Quinton Anderson

BIRMINGHAM - MUMBAI

Storm Real-time Processing Cookbook

Copyright © 2013 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2013

Production Reference: 1190813

Published by Packt Publishing Ltd.

Livery Place 35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-442-5 www.packtpub.com

Cover Image by Suresh Mogre ( [email protected] )

Credits

Author

Quinton Anderson

Reviewers Maarten Ectors Alexey Kachayev Paco Nathan

Acquisition Editor Usha Iyer

Lead Technical Editor Madhuja Chaudhari

Technical Editors Hardik B. Soni Dennis John

Copy Editors Mradula Hegde Alfida Paiva Laxmi Subramanian Aditya Nair

Project Coordinator Navu Dhillon

Proofreaders Stephen Copestake Clyde Jenkins

Indexer

Mariammal Chettiyar

Graphics Abhinash Sahu

Production Coordinator Prachali Bhiwandkar

Cover Work

Prachali Bhiwandkar

About the Author

Quinton's next area of focus is machine learning; specifically, Deep Belief networks, as they pertain to robotics. Please follow his blog entries on Computational Theory, general IT concepts, and Deep Belief networks for more information.

You can find more information on Quinton via his LinkedIn profile ( http://au.linkedin.

com/pub/quinton-anderson/37/422/11b/ ) or more importantly, view and contribute to the source code available at his GitHub ( https://github.com/quintona ) and Bitbucket ( https://bitbucket.org/qanderson ) accounts.

I would like to thank the Storm community for their efforts in building a truly awesome platform for the open source community; a special mention, of course, to the core author of Storm, Nathan Marz.

I would like to thank my wife and children for putting up with my long

working hours spent on this book and other related projects. Your effort

in making up for my absence is greatly appreciated, and I love you all very

dearly. I would also like to thank all those who participated in the review

process of this book.

About the Reviewers

Maarten Ectors is an executive who is an expert in cloud computing, big data, and disruptive innovations. Maarten's strengths are his combination of deep technical and business skills as well as strategic insights.

I would like to thank my family for always being there for me. Especially my

wonderful wife, Esther, and my great kids.

Alexey Kachayev began his development career in a small team creating an open source CMS for social networks. For over 2 years, he had been working as a Software Engineer at CloudMade, developing geo-relative technology for enterprise clients in Python and Scala.

Currently, Alexey is the CTO at Attendify and is focused on development of a distributed applications platform in Erlang. He is an active speaker at conferences and an open source contributor (working on projects in Python, Clojure, and Haskell).

His area of professional interests include distributed systems and algorithms, types theory, and functional language compilers.

I would like to thank Nathan Marz and the Storm project contributors team for developing such a great technology and spreading great ideas.

Paco Nathan is the Chief Scientist at Mesosphere in San Francisco. He is a recognized

expert in Hadoop, R, Data Science, and Cloud Computing, and has led innovative data teams

building large-scale apps for the past decade. Paco is an evangelist for the Mesos and

Cascading open source projects. He is also the author of Enterprise Data Workflows with

Cascading, O'Reilly. He has a blog about Data Science at http://liber118.com/pxn/ .

www.packtpub.com

Support files, eBooks, discount offers and more

You might want to visit www.packtpub.com for support files and downloads related to your book.

At www.packtpub.com , you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.packtpub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Table of Contents

Preface 1 Chapter 1: Setting Up Your Development Environment 7 Introduction 7 Setting up your development environment 8

Distributed version control 10

Creating a "Hello World" topology 13 Creating a Storm cluster – provisioning the machines 21 Creating a Storm cluster – provisioning Storm 28

Deriving basic click statistics 33

Unit testing a bolt 43

Implementing an integration test 46

Deploying to the cluster 49

Chapter 2: Log Stream Processing 51 Introduction 51

Creating a log agent 52

Creating the log spout 54

Rule-based analysis of the log stream 60 Indexing and persisting the log data 64 Counting and persisting log statistics 68 Creating an integration test for the log stream cluster 71

Creating a log analytics dashboard 75