Large Scale Machine Learning with Python

(1)

(2)

Large Scale Machine Learning with Python

Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

Bastiaan Sjardin Luca Massaron Alberto Boschetti

BIRMINGHAM - MUMBAI

(3)

Large Scale Machine Learning with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2016

Production reference: 1270716

Published by Packt Publishing Ltd.

Livery Place 35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-721-5 www.packtpub.com

(4)

Credits

Authors

Bastiaan Sjardin Luca Massaron Alberto Boschetti

Reviewers Oleg Okun Kai Londenberg

Commissioning Editor Akram Hussain

Acquisition Editor Sonali Vernekar

Content Development Editor Sumeet Sawant

Technical Editor Manthan Raja

Copy Editor Tasneem Fatehi

Project Coordinator Shweta H Birwatkar

Proofreader Safis Editing

Indexer

Mariammal Chettiyar

Graphics Disha Haria Kirk D'Penha

Production Coordinator Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

(5)

About the Authors

Bastiaan Sjardin

is a data scientist and founder with a background in artificial intelligence and mathematics. He has a MSc degree in cognitive science obtained at the University of Leiden together with on campus courses at Massachusetts Institute of Technology (MIT). In the past 5 years, he has worked on a wide range of data science and artificial intelligence projects. He is a frequent community TA at Coursera in the social network analysis course from the University of Michigan and the practical machine learning course from Johns Hopkins University. His programming languages of choice are Python and R. Currently, he is the cofounder of Quandbee (http://www.quandbee.com/), a company providing machine learning and artificial intelligence applications at scale.

Luca Massaron

is a data scientist and marketing research director who is specialized in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience in solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of Web audience analysis in Italy to achieving the rank of a top ten Kaggler, he has always been very passionate about everything regarding data and its analysis, and also about demonstrating the potential of data- driven knowledge discovery to both experts and non-experts. Favoring simplicity over unnecessary sophistication, he believes that a lot can be achieved in data science just by doing the essentials.

I would like to thank Yukiko and Amelia for their continued

(6)

Alberto Boschetti

is a data scientist with expertise in signal processing and statistics. He holds a PhD in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges that span from natural language processing (NLP) and machine learning to distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.

(7)

About the Reviewers

Oleg Okun

is a machine learning expert and an author/editor of four books, numerous journal articles, and conference papers. He has been working for more than a quarter of a century. During this time, Oleg was employed in both academia and industry in his mother country, Belarus, and abroad (Finland, Sweden, and Germany). His work experience includes document image analysis, fingerprint biometrics, bioinformatics, online/offline marketing analytics, and credit-scoring analytics. He is interested in all aspects of distributed machine learning and the Internet of Things. Oleg currently lives and works in Hamburg, Germany, and is about to start a new job as a chief architect of intelligent systems. His favorite programming languages are Python, R, and Scala.

I would like to express my deepest gratitude to my parents for everything that they have done for me.

Kai Londenberg

is a data science and big data expert with many years of professional experience. Currently, he is working as a data scientist at the Volkswagen Data Lab. Before that, he had the pleasure of being the lead data scientist at Searchmetrics, where Luca Massaron was a member of his team. Kai enjoys working with cutting-edge technologies, and while he is a pragmatic machine learning practitioner and software developer at work, he always enjoys staying up-to-date with the latest technologies and research in machine learning, AI, and related fields. You can find him on LinkedIn at https://www.linkedin.com/in/

(8)

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

TM

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

• Fully searchable across every book published by Packt

• Copy and paste, print, and bookmark content

• On demand and accessible via a web browser

(9)

(10)

Preface

"The nice thing about having a brain is that one can learn, that ignorance can be supplanted by knowledge, and that small bits of knowledge can gradually pile up into substantial heaps."

– Douglas Hofstadter Machine learning is often referred to as the part of artificial intelligence that actually works. Its aim is to find a function based on an existing set of data (training set) in order to predict outcomes of a previously unseen dataset (test set) with the highest possible correctness. This occurs either in the form of labels and classes (classification problems) or in the form of a continuous value (regression problems). Tangible examples of machine learning in real-life applications range from predicting future stock prices to classifying the gender of an author from a set of documents.

Throughout this book, the most important machine learning concepts, together with methods suitable for larger datasets, will be made clear to the reader, thanks to practical examples in Python. We will look at supervised learning (classification &

regression), as well as unsupervised learning (such as Principal Component Analysis (PCA), clustering, and topic modeling) that have been found to be applicable to larger datasets.

Large IT corporations such as Google, Facebook, and Uber have generated a lot of buzz by claiming that they successfully applied such machine learning methods at a large scale. With the onset and availability of big data, the demand for scalable machine learning solutions has grown exponentially and many other companies and individuals have started aspiring to ripe the fruits of hidden correlations in big datasets. Unfortunately, most learning algorithms don't scale well, straining CPUs and memory either on a desktop computer or on a larger computing cluster. During these times, even if big data has passed the peak of hype, scalable machine learning solutions are not plentiful.

(17)

Preface

Frankly, we still need to work around a lot of bottlenecks even with datasets we would hardly categorize as big data (think of datasets up to 2GB or even smaller). The mission of this book is to provide methods (and sometimes unconventional ones) to apply the most powerful open source machine learning methods at a larger scale, without the need for expensive enterprise solutions or large computing clusters.

Throughout this book, we will use Python and some other readily available solutions that integrate well in scalable machine learning pipelines. Reading the book is a journey that will redefine what you knew about machine learning, setting you on the starting blocks of real big data analysis.

What this book covers

Chapter 1, First Steps to Scalability, sets the problem of scalable machine learning under the right perspective and familiarizes you with the tools that we will be using in this book.

Chapter 2, Scalable Learning in Scikit-learn, discusses strategies for stochastic gradient descent (SGD) where we mitigate memory consumption; it is based on the theme of out-of-core learning. We will also deal with data preparation techniques that can deal with a variety of data, such as the hashing trick.

Chapter 3, Fast-Learning SVMs, covers streaming algorithms that are capable of discovering non-linearity in the form of support vector machines. We will present alternatives to Scikit-learn, such as LIBLINEAR and Vowpal Wabbit, which, although operating as external shell commands, can be easily wrapped and directed by Python scripts.

Chapter 4, Neural Networks and Deep Learning, provides useful tactics for applying deep neural networks within the Theano framework together with large-scale applications with H2O. Even though it is a hot topic, it can be quite a challenge to apply it successfully, let alone provide scalable solutions. We will also resort to unsupervised pre-training with autoencoders with the theanets package.

Chapter 5, Deep Learning with TensorFlow, covers interesting deep learning techniques together with an online method for neural networks. Although TensorFlow is only in its infancy, the framework provides elegant machine learning solutions. We will also utilize Keras Convolutional Neural Networks capabilities within the TensorFlow environment.

(18)

Preface Chapter 7, Unsupervised Learning at Scale, dives into unsupervised learning, as we will cover PCA, cluster analysis, and topic modeling using the right approach for scaling them up.

Chapter 8, Distributed Environments – Hadoop and Spark, teaches us how to set up Spark within a virtual machine environment, shifting from a single machine to a computational network paradigm. As Python can easily glue and power up our efforts on a cluster of machines, it becomes a piece of cake to leverage the power of a Hadoop cluster.

Chapter 9, Practical Machine Learning with Spark, gets into action with Spark, teaching all the essentials for starting immediately to manipulate data and build predictive models on large datasets.

Appendix, Introduction to GPUs and Theano, will cover the basics of Theano and GPU-computation. It will help you install and prepare your environment for using Theano on the GPU, if your system allows it.

What you need for this book

The execution of the code examples provided in this book requires an installation of Python 2.7 or higher versions on macOS, Linux, or Microsoft Windows.

The examples throughout the book will make frequent use of Python's essential libraries, such as SciPy, NumPy, Scikit-learn, and StatsModels, and to a minor extent, matplotlib and pandas, for scientific and statistical computing. We will also make use of an out-of-core cloud computing application called H2O.

This book is highly dependent on Jupyter and its Notebooks powered by the Python kernel. We will use its most recent version, 4.1, for this book.

The first chapter will provide you with all the step-by-step instructions and some useful tips to set up your Python environment, these core libraries, and all the necessary tools.

Who this book is for

This book is suitable for aspiring and actual data science practitioners, developers, and everyone who intends to work with large and complex datasets. We strive to make this book as accessible as possible to a wider audience. Yet, considering that the topics in this book are quite advanced, it is recommended, but not strictly compulsory, that readers are familiar with basic machine learning concept such as classification and regression, error minimizing functions, and cross validation.

(19)

Preface

We also assume some experience with Python, Jupyter Notebooks, and command-line execution together with a reasonable level of mathematical knowledge to grasp the concepts behind the various large solutions we propose.

The text is written in a style that programmers of other languages (R, Java, and MATLAB) can follow. Ideally, it is highly suitable for (but not limited to) a data scientist familiar with machine learning and interested in leveraging Python, in respect to other languages such as R or MATLAB, because of its computational, memory, and I/O capabilities.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:

"When inspecting the linear model, first check the coef_ attribute."

A block of code is set as follows:

from sklearn import datasets iris = datasets.load_iris()

Since we will be using Jupyter Notebooks along most of the examples, expect to have always an input (marked as In:) and often an output (marked Out:) from the cell containing the block of code. On your computer you have just to input the code after the In: and check if results correspond to the Out: content:

In: clf.fit(X, y)

Out: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

When a command should be given in the terminal command line, you'll find the command with the prefix $>, otherwise, if it's for the Python REPL it will be preceded by >>>:

$>python

>>> import sys

(20)

Preface New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "As a rule, you just have to type the code after In: in your cells and run it."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

(21)

Preface

You can download the code files by following these steps:

1. Log in or register to our website using your e-mail address and password.

2. Hover the mouse pointer on the SUPPORT tab at the top.

3. Click on Code Downloads & Errata.

4. Enter the name of the book in the Search box.

5. Select the book for which you're looking to download the code files.

6. Choose from the drop-down menu where you purchased this book from.

7. Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged into your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

• WinRAR / 7-Zip for Windows

• Zipeg / iZip / UnRarX for Mac

• 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/

PacktPublishing/Large-Scale-Machine-Learning-With-Python.We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

On Github, you will also find Vowpal Wabbit executables for Windows.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the

screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/

(22)

Preface

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.

com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/

content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously.

If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

(23)

(24)

First Steps to Scalability

Welcome to this book on scalable machine learning with Python.

In this chapter, we will discuss how to learn effectively from big data with Python and how it can be possible using your single machine or a cluster of other machines, which you can get, for instance, from Amazon Web Services (AWS) or the Google Cloud Platform.

In the book, we will be using Python's implementation of machine learning

algorithms that are scalable. This means that they can work with a large amount of data and do not crash because of memory constraints. They also take a reasonable amount of time, which is something manageable for a data science prototype and also deployment in production. Chapters are organized around solutions (such as streaming data), algorithms (such as neural networks or ensemble of trees), and frameworks (such as Hadoop or Spark). We will also provide you with some basic reminders about the machine learning algorithms and explain how to make them scalable and suitable to problems with massive datasets.

Given such premises as a start, you'll need to learn the basics (so as to figure out the perspective under which this book has been written) and set up all your basic tools to start reading the chapters immediately.

In this chapter, we will introduce you to the following topics:

• What scalability actually means

• What bottlenecks you should pay attention to when dealing with data

• What kind of problems this book will help you solve

• How to use Python to analyze datasets at scale effectively

• How to set up your machine quickly to execute the examples presented in this book

(25)

First Steps to Scalability

Explaining scalability in detail

Even if the hype now is about big data, large datasets existed long before the term itself had been coined. Large collections of texts, DNA sequences, and vast amounts of data from radio telescopes have always represented a challenge for scientists and data analysts. As most machine learning algorithms have a computational complexity of O(n²) or even O(n³), where n is the number of training instances, the challenge from massive datasets has been previously faced by data scientists and analysts by resorting to data algorithms that could be more efficient. A machine learning algorithm is deemed scalable when it can work after an appropriate setup, in case of large datasets. A dataset can be large because of a large number of cases or variables, or because of both, but a scalable algorithm can deal with it in an efficient way as its running time increases almost linearly accordingly to the size of the problem. Therefore, it is just a matter of exchanging 1:1 more time (or more computational power) with more data. Instead, a machine learning algorithm doesn't scale if it's faced with large amounts of data; it simply stops working or operates with a running time that increases in a nonlinear way, for instance, exponentially, thus making learning unfeasible.

The introduction of cheap data storage, a large RAM, and multiprocessor CPU dramatically changed everything, increasing the ability of single laptops to analyze large amounts of data. Another big game changer arrived on the scene in the past years, shifting the attention from single powerful machines to clusters of commodity computers (cheaper, easily available machines). This big change has been the

introduction of MapReduce and the open source framework Apache Hadoop with its Hadoop Distributed File System (HDFS) and, in general, of parallel computation on networks of computers.

In order to figure out how both of these changes deeply and positively affected your capabilities of solving your large scale problems, we should first start from what actually prevented you (and still prevents, depending on how massive is your problem) from analyzing large datasets.

No matter what your problem is, you will eventually find out that you cannot analyze your data because of any of these limits:

• Computing affecting the time taken to execute the analysis

• I/O affecting how much of your data you can take from storage to memory in a time unit

(26)

Chapter 1 Your computer has limitations that will determine if you can learn from your data and how long it will take before you hit a wall. Computing limitations occur in many intensive calculations, I/O problems will bottleneck your prompt access to data, and finally memory limitations can constraint you to take on only a part of your data, thus limiting the kind of matrix computations that you may have access to or the precision or even exactness of your estimations.

Each of these hardware limitations will also affect you differently in severity with regard to the data you are analyzing:

• Tall data, which is characterized by a large number of cases

• Wide data, which is characterized by a large number of features

• Tall and wide data, which has a large number of both cases and features

• Sparse data, which is characterized by a large number of zero entries or entries that could be transformed into zeros (that is, the data matrix may be tall and/or wide but informative, but not all the matrix entries have informative value)

Finally, it comes down to the algorithm that you are going to use in order to learn from the data. Each algorithm has its own characteristics, being able to map data using a solution differently affected by bias or variance. Therefore, with respect to your problem that, so far, you solved by machine learning, you considered, based on experience or empirical tests, that certain algorithms may work better than others did. With large scale problems, you have to add other and different considerations when deciding on the algorithm:

• How complex your algorithm is; that is, if the number of rows and columns in your data affects the number of computations in a linear or nonlinear way.

Most machine learning solutions are based on algorithms of quadratic or cubic complexity, thus strongly limiting their applicability to big data.

• How many parameters your model has; here, it's not just a problem of variance of the estimates (overfitting), but of the time it may take to compute them all.

• If the optimization processes are parallelizable; that is, can you easily split the computations across multiple nodes or CPU cores, or do you have to rely on a single, sequential, optimization process?

• Should the algorithm learn from all the data at once or can you use single examples or small batches of data instead?

(27)

If you cross-evaluate hardware limitations with data characteristics and these kind of algorithms, you'll get a host of possible problematic combinations that can prevent you from getting results from large scale analysis. From a practical point of view, all the problematic combinations can be solved by three approaches:

• Scaling up, that is, improving performances on a single machine by software or hardware modifications (more memory, faster CPU, faster storage disk, and using GPUs)

• Scaling out, that is, distributing the computation (and the performances) across multiple machines leveraging outside resources, namely other storage disks and other CPUs (or GPUs)

• Scaling up and out, that is, taking the best of the scaling up and out solutions together

Making large scale examples

Some motivating examples may make things clearer and more memorable for you.

Let's take two simple examples:

• Being able to predict the click-through rate (CTR) can help you earn quite a lot these days when Internet advertising is so widespread, diffused, and eating large shares of traditional media communication

• Being able to propose the right information to your customers, when they are searching the products and services offered by your site, could really enhance your chances to sell if you can guess what to put at the top of their results

In both cases, we have quite large datasets as they are produced by users' interactions on the Internet.

Depending on the business that we have in mind (we can imagine some big players here), we are clearly talking of millions of data points per day in both our examples. In the advertising case, data is certainly tall, being a continuous stream of information as the most recent data, more representative of markets and consumers, replaces the older one. In the search engine case, data is wide, being enriched by the feature provided by the results you offered to your customers: for instance, if you are in the travels business, you will have quite a lot of features about hotels, locations, and services offered.

(28)

Chapter 1 Clearly, scalability is an issue for both these problems:

• You have to learn from data that is growing every day and you have to learn fast because as you are learning, new data keeps arriving. Yet, you have to deal with data that clearly cannot fit in memory because the matrix is too tall or too large.

• You frequently need to update your machine learning model in order to accommodate new data. You need an algorithm that can process the information in a timely manner. O(n²) or O(n³) complexities could be impossible for you to handle because of the data quantity; you need some algorithm that can work with lower complexity (such as O(n)) or by dividing the data so that n will be much, much smaller.

• You have to be able to predict fast because the predictions have to be delivered only to new customers. Again, the complexity of your algorithm does matter.

The scalability problem can be solved in one or multiple ways:

• Scaling up by reducing the dimensionality of the problem; for instance, in the case of the search engine, by effectively selecting the relevant features to be used

• Scaling up using the right algorithm; for instance, in the case of advertising data, there are appropriate algorithms to learn effectively from streams

• Scaling out the learning process by leveraging multiple machines

• Scaling up the deployment process using multiprocessing and vectorization on a single server effectively

In this book, we will point out for you what kind of practical problems can be solved by each one of the solutions or algorithms proposed. It will become automatic for you to connect a particular constraint in time and execution (CPU, memory, or I/O) to the most suitable solution among the ones that we propose.

Introducing Python

As our treatise will depend on Python—our open source language of choice for this book—we have to stop for a brief moment and present the language before clarifying how Python can easily help you scale up and out with your massive data problem.

(29)

Created in 1991 as a general-purpose, interpreted, object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to have uncountable and fast experimentations, easy theory developments, and prompt deployments of scientific applications.

As a machine learning practitioner, you will find using Python interesting for various reasons:

• It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you may need in the course of a data analysis, and sometimes even more.

• It is very versatile. No matter what your programming background or style is (object-oriented or procedural), you will enjoy programming with Python.

• If you don't know it yet but you know other languages such as C/C++ or Java well, then it is very simple to learn and use. After you grasp the basics, there's no other better way to learn more than by immediately starting with the coding.

• It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux, and macOS systems. You won't have to worry about portability.

• Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language).

• It can work with in-memory big data because of its minimal memory

footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using the various iterations and reiterations of data wrangling.

If you are not already an expert (and actually we require some basic knowledge of Python in order to be able to make the most out of this book), you can read everything about the language and find the basic installations files directly from the Python foundations at https://www.python.org/.

(30)

Chapter 1

Scale up with Python

Python is an interpreted language; it runs the reading of your script from memory and executes it during runtime, thus accessing the necessary resources (files, objects in memory, and so on). Apart from being interpreted, another important aspect to take into consideration when using Python for data analysis and machine learning is that Python is single-threaded. Being single-threaded means that any Python program is executed sequentially from the start to the end of the script and that Python cannot take advantage of the extra processing power offered by the multiple threads and processors likely present in your computer (most computers nowadays are multicore).

Given such a situation, scaling up using Python can be achieved by different strategies:

• Compiling Python scripts in order to achieve more speed of execution.

Though easily possible using, for instance, PyPy—a Just-in-Time (JIT) compiler that can be found at http://pypy.org/, we actually didn't resort to such a solution in our book because it requires writing algorithms in Python from scratch.

• Using Python as a wrapping language; thus putting together the operations executed by Python with the execution of external libraries and programs, some capable of multicore processing. In our book, you will find many examples of this when we call specialized libraries such as the Library for Support Vector Machines (LIBSVM) or programs such as Vowpal Wabbit (VW), XGBoost, or H2O in order to execute machine learning activities.

• Effectively using vectorization techniques, that is, special libraries for matrix computations. This can be achieved using NumPy or pandas, both using computations from GPUs. GPUs are just like multicore CPUs, each one with their own memory and ability to process calculations in parallel (you can figure out that they have multiple tiny cores). Especially when working with neural networks, vectorization techniques based on GPUs can speed up computations incredibly. However, GPUs have their own limitations; first of all, their available memory has a certain I/O in passing your data to their memory and getting the results back to your CPU, and they require parallel programming via a special API, such as CUDA for NVIDIA-manufactured GPUs (so you have to install the appropriate drivers and programs).

• Reducing a large problem into chunks and solving each chunk one at a time in-memory (divide and conquer algorithms). This leads to the partitioning or subsampling of data from memory or disk and managing approximate solutions of your machine learning problem, which is quite effective. It is important to notice that both partitioning and subsampling can operate for cases and features (and both). If the original data is kept on a disk storage,

(31)

• Effectively leveraging both multiprocessing and multithreading, depending on the learning algorithm that you will be using. Some algorithms will naturally be able to split their operations into parallel ones. In such cases, the only constraint will be your CPU's and your memory (as your data will have to be replicated for every parallel worker that you will be using). Some other algorithms will instead take advantage of multithreading, thus managing more operations at the same time on the same memory blocks.

Scale out with Python

Scaling out solutions simply involve connecting together multiple machines into a cluster. As you connect the machines (scaling out), you can also scale up each one of them using configurations that are more powerful (thus augmenting CPU, memory, and I/O), applying the techniques we mentioned in the previous paragraph and enhancing their performances.

By connecting multiple machines, you can leverage their computational power in a parallel fashion. Your data will be distributed across multiple storage disks/memory, limiting I/O transfers by having each machine work only on its available data (that is, its own storage disk or RAM memory).

In our book, this translates into using outside resources effectively by means of the following:

• The H2O framework

• The Hadoop framework and its components, such as HDFS, MapReduce, and Yet Another Resource Negotiator (YARN)

• The Spark framework on top of Hadoop

Each of these frameworks will be controlled by Python (for instance, Spark by its Python interface named pySpark).

Python for large scale machine learning

Given the availability of many useful packages for machine learning and the fact that it is a programming language quite popular among data scientists, Python is our language of choice for all the code presented in this book.

(32)

Chapter 1

Choosing between Python 2 and Python 3

Before starting, it is important to know that there are two main branches of Python:

versions 2 and 3. As many core functionalities have changed, scripts built for one version are sometimes incompatible with the other one (they won't work without raising errors and warnings). Although the third version is the newest, the older one is still the most used version in the scientific area and the default version for many operative systems (mainly for compatibility in upgrades). When version 3 was released (in 2008), most scientific packages weren't ready so the scientific community stuck with the previous version. Fortunately, since then, almost all packages have been updated leaving just a few (see http://py3readiness.org for a compatibility overview) as orphans of Python 3 compatibility.

In spite of the recent growth in popularity of Python 3 (which, we shouldn't forget, is the future of Python), Python 2 is still widely used among data scientists and data analysts. Moreover, for a long time Python 2 has been the default Python installation (for instance, on Ubuntu), so it is the most likely version that most of the readers should have ready at hand. For all these reasons, we will adopt Python 2 for this book. It is not merely love for the old technologies, it is just a practical choice in order to make Large Scale Machine Learning with Python accessible to the largest audience:

• The Python 2 code will immediately address the existing audience of data experts.

• Python 3 users will find it very easy to convert our scripts in order to work under their favored Python version because the code we wrote is easily convertible and we will provide a Python 3 version of all our scripts and notebooks, freely downloadable from the Packt website.

In case you need to understand the differences between Python 2 and Python 3 in depth, we suggest reading this web page about writing Python 2-3 compatible code:

http://python-future.org/compatible_idioms.html From Python-Future, you may also find reading about how to convert Python 2 code to Python 3 useful:

http://python-future.org/automatic_conversion.html

(33)

Installing Python

As the first step, we are going to create a working environment for data science that you can use to replicate and test the examples in the book and prototype your own large solutions.

No matter in what language you are going to develop your application, Python will gift you with an easy time getting your data, building your model from it, and extracting the right parameters you need to make your predictions in a production environment.

Python is an open source, object-oriented, cross-platform programming language that, compared with its direct competitors (for instance, C/C++ and Java), produces very concise and readable code. It allows you to build a working software prototype in a very short time and tests, maintains, and scales it in the future. It has become the most used language in the data scientist's toolbox because, in the end, it is a general- purpose language turned very flexible thanks to a large variety of available packages that can easily and rapidly help you solve a wide spectrum of both common and niche problems.

Step-by-step installation

If you have never used Python (but this doesn't mean that you may not already have it installed on your machine), you need to first download the installer from the main website of the project, https://www.python.org/downloads/ (remember, we're using version 3), and then install it on your local machine.

This section provides you with full control over what can be installed on your machine. This is very useful when you are going to use Python as both your

prototyping and production language. Furthermore, it could help you keep track of the packages' versions that you are using. Anyway, be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and it may be

well-suited to first start and learn because it can save you quite a lot of time, though it will install a large number of packages (that for the most part you won't maybe ever use) on your computer all at once. Therefore, if you want to start immediately and don't want to bother much about controlling your installation, just skip this part and proceed to the next section, Scientific distributions.

(34)

Chapter 1 Being a multiplatform programming language, you'll find installers for computers that either run on Windows or Linux-/Unix-like operating systems. Remember that some Linux distributions (such as Ubuntu) already have Python 2 packed in the repository, which makes the installation process even easier.

1. Open a Python shell, type python in the terminal, or click on the Python icon.

2. Then, to test the installation, run the following code in the Python interactive shell or its Read-Eval-Print Loop (REPL) interface provided by Python's standard IDE or other solutions such as Spyder or PyCharm:

>>> import sys

>>> print sys.version

If a syntax error has been raised, it means that you are running Python 2 instead of Python 3. If you don't experience an error and you can read that your Python version is 3.4.x or 3.5.x (at the time of writing, the latest version is 3.5.2), then congratulations for running the version of Python that we elected for this book.

To clarify, when a command is given in the terminal command line, we prefix the command with $. Otherwise, if it's for the Python REPL, it's preceded by >>>.

The installation of packages

Depending on your system and past installations, Python may not come bundled with all that you need unless you have installed a distribution (which, on the other hand, usually is stuffed with much more than you may need).

To install any packages that you need, you can use either the pip or easy_install commands; however, easy_install is going to be dropped in the future and pip has important advantages over it.

pip is a tool to install Python packages directly accessing the Internet and picking them from the Python Package Index (https://pypi.python.org/pypi). PyPI is a repository containing third-party open source packages, which are constantly maintained and stored in the repository by their authors.

It is preferable to install everything using pip because of the following reasons:

• It is the preferred package manager for Python and starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers

• It provides an uninstall functionality

• It rolls back and leaves your system clear if, for whatever reason, the package installation fails

(35)

The pip command runs in the command line and makes the process of installation, upgrade, and removal of Python packages a breeze.

As we mentioned, if you're running at least Python 2.7.9 or Python 3.4, the pip command should already be there. To assure which tools have been installed on your local machine, directly test with the following command if any error is raised:

$ pip –V

In some Linux and Mac installations, Python 3 and not Python 2 being installed, the command may be present as pip3, so if you receive an error when looking for pip, try running the following command:

$ pip3 –V

If this is the case, remember that pip3 is suitable only to install packages on Python 3. As we are working with Python 2 in the book (unless you decide to use the most recent Python 3.4), pip should always be your choice to install packages.

Alternatively, you can also test whether the old easy_install command is available:

$ easy_install --version

Using easy_install in spite of pip and its advantages makes sense if you are working on Windows because pip will not install binary packages; therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day.

If your test ends with an error, you really need to install pip from scratch (and in doing so, also easy_install at the same time).

To install pip, simply follow the instructions given at https://pip.pypa.io/en/

stable/installing/. The safest way is to download the get-pip.py script from https://bootstrap.pypa.io/get-pip.py and then run it using the following:

$ python get-pip.py

By the way, the script will also install the setup tool from https://pypi.python.

org/pypi/setuptools, which contains easy_install.

As an alternative, if you are running a Debian/Ubuntu Unix-like system, then a fast

(36)

Chapter 1 After checking this basic requirement, you're now ready to install all the packages that you need in order to run the examples provided in this book. To install a generic

<pk> package, you just need to run the following command:

$ pip install <pk>

Alternatively, if you prefer to use easy_install, you can also run the following command:

$ easy_install <pk>

After this, the <pk> package and all its dependencies will be downloaded and installed.

If you're not sure whether a library has been installed or not, just try to import a module in it. If the Python interpreter raises an ImportError error, it can be concluded that the package has not been installed.

Let's take an example. This is what happens when the NumPy library has been installed:

>>> import numpy

This is what happens if it's not installed:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

ImportError: No module named numpy

In the latter case, before importing it, you'll need to install it through pip or easy_install.

Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and module have the same name, but in many cases, they don't match. For example, the sklearn module is included in the package named Scikit-learn.

(37)

Package upgrades

More often than not, you will find yourself in a situation where you have to upgrade a package because the new version is either required by a dependency or has

additional features that you would like to use. To do so, first check the version of the library that you have installed by glancing at the __version__ attribute, as shown in the following example using the NumPy package:

>>> numpy.__version__ # 2 underscores before and after '1.9.0'

Now, if you want to update it to a newer release, say precisely the 1.9.2 version, you can run the following command from the command line:

$ pip install -U numpy==1.9.2

Alternatively (but we do not recommend it unless it proves necessary), you can also use the following command:

$ easy_install --upgrade numpy==1.9.2

Finally, if you're just interested in upgrading it to the latest available version, simply run the following command:

$ pip install -U numpy

You can also run the easy_install alternative:

$ easy_install --upgrade numpy

Scientific distributions

As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need. (Sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier.)

If you want to save time and effort and want to ensure that you have a fully working Python environment that is ready to use, you can just download, install, and use the scientific Python distribution. Apart from Python, they also include a variety of preinstalled packages, and sometimes they even have additional tools and an IDE

(38)

Chapter 1 To immediately focus on the contents of the book, we suggest that you first promptly download and install a scientific distribution, such as Anaconda (which is the most complete one around, in our opinion), and decide to fully uninstall the distribution and set up Python alone after practicing the examples in the book, which can be accompanied by just the packages you need for your projects.

Again, if possible, download and install the version containing Python 3.

The first package that we would recommend you to try is Anaconda (https://www.

continuum.io/downloads), which is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, including NumPy, SciPy, pandas, IPython, matplotlib, Scikit-learn, and StatsModels. It's a cross-platform distribution that can be installed on machines with other existing Python distributions and versions, and its base version is free. Additional add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on its website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics and scientific computing. As for Python version 2.7, we recommend the Anaconda distribution 4.0.0. (In order to have a look at the packages installed with Anaconda, you can have a look at the list at https://docs.continuum.io/anaconda/pkg-docs.)

As a second suggestion, if you are working on Windows and you desire a portable distribution, WinPython (http://winpython.sourceforge.net/) could be a quite interesting alternative (sorry, no Linux or MacOS versions). WinPython is also a free, open source Python distribution maintained by the community. It is also designed with scientists in mind, and it includes many essential packages such as NumPy, SciPy, matplotlib, and IPython (basically the same as Anaconda's). It also includes Spyder as an IDE, which can be helpful if you have experience using the MATLAB language and interface. Its crucial advantage is that it is portable (you can put it in any directory or even in a USB flash drive), so you can have different versions present on your computer, move a version from a Windows computer to another, and you can easily replace an older version with a newer one just by replacing its directory. When you run WinPython or its shell, it will automatically set all the environment variables necessary to run Python as if it were regularly installed and registered on your system.

At the time of writing, Python 2.7 was the most recent distribution prepared on October 2015 with the release 2.7.10; since then, WinPython has published only updates of the Python 3 version of the distribution.

After installing the distribution on your system, you may need to update some of the key packages necessary for the examples present in this book.

(39)

Introducing Jupyter/IPython

IPython was initiated in 2001 as a free project by Fernando Perez, addressing a lack in the Python stack for scientific investigations using a user-programming interface that could incorporate the scientific approach (mainly experimenting and interactively discovering) in the process of software development.

A scientific approach implies the fast experimentation of different hypotheses in a reproducible fashion (as does the data exploration and analysis task in data science), and when using IPython, you will be able to implement an explorative, iterative, and trial-and-error research strategy more naturally during your code writing.

Recently, a large part of the IPython project has moved to a new one called Jupyter.

This new project extends the potential usability of the original IPython interface to a wide range of programming languages. (For a complete list, visit https://github.

com/ipython/ipython/wiki/IPython-kernels-for-other-languages.) Thanks to the powerful idea of kernels, programs that run the user's code are communicated by the frontend interface and provide feedback on the results of the executed code to the interface itself; you can use the same interface and interactive programming style, no matter what language you are developing in.

Jupyter (IPython is the zero kernel, the original starting one) can be simply described as a tool for interactive tasks operable by a console or web-based notebook, which offers special commands that help developers better understand and build the code that is being currently written.

Contrary to an IDE, which is built around the idea of writing a script, running it afterward and evaluating its results, Jupyter lets you write your code in chunks named cells, run each of them sequentially, and evaluate the results of each one separately, examining both textual and graphic outputs. Besides graphical integration, it provides you with further help, thanks to customizable commands, a rich history (in the JSON format), and computational parallelism for an enhanced performance when dealing with heavy numeric computations.

Such an approach is also particularly fruitful for the tasks involving developing code based on data as it automatically accomplishes the often neglected duty of documenting and illustrating how data analysis has been done, its premises and assumptions, and its intermediate and final results. If a part of your job is to also present your work and persuade internal or external stakeholders to the project,

(40)

Chapter 1 Actually, we have to confess that keeping a clean, up-to-date Jupyter Notebook has saved us uncountable times when meetings with managers/stakeholders have suddenly popped up, requiring us to hastily present the state of our work.

In short, Jupyter offers you the following features:

• Seeing intermediate (debugging) results for each step of the analysis

• Running only some sections (or cells) of the code

• Storing intermediate results in the JSON format and having the ability to do version control on them

• Presenting your work (this will be a combination of text, code, and images), sharing it via the Jupyter Notebook Viewer service (http://nbviewer.

jupyter.org/), and easily exporting it to HTML, PDF, or even slideshows Jupyter is our favored choice throughout this book, and it is used to clearly and effectively illustrate storytelling operations with scripts and data and their consequent results.

Though we strongly recommend using Jupyter, if you are using an REPL or IDE, you can use the same instructions and expect identical results (except for print formats and extensions of the returned results).

If you do not have Jupyter installed on your system, you can promptly set it up using the following command:

$ pip install jupyter

You can find complete instructions about the Jupyter installation (covering different operating systems) at http://jupyter.

readthedocs.io/en/latest/install.html.

If you already have Jupyter installed, it should be upgraded to at least version 4.1.

After installation, you can immediately start using Jupyter, calling it from the command line:

$ jupyter notebook

(41)

Once the Jupyter instance has opened in the browser, click on the New button, and in the Notebooks section, choose Python 2 (other kernels may be present in the section, depending on what you installed):

At this point, your new empty notebook will look like the following screenshot and you can start entering the commands in the cells:

For instance, you may start typing the following in the cell:

In: print ("This is a test")

After writing in cells, you just press the play button (below the Cell tab) to run it and obtain an output. Then, another cell will appear for your input. As you are writing in a cell, if you press the plus button on the above menu bar, you will get a new cell, and you can move from a cell to another using the arrows on the menu.

Most of the other functions are quite intuitive and we invite you to try them.

(42)

Chapter 1

For a complete treatise of the full range of Jupyter functionalities when running the IPython kernel, refer to the following two Packt Publishing books:

• IPython Interactive Computing and Visualization Cookbook by Cyrille Rossant, Packt Publishing, September 25, 2014

• Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant, Packt Publishing, April 25, 2013

For our illustrative purposes, just consider that every Jupyter block of instructions has a numbered input statement and an output one, so you will find the code presented in this book structured in to two blocks—at least when the output is not trivial at all—otherwise, just expect only the input part:

In: <the code you have to enter>

Out: <the output you should get>

As a rule, you just have to type the code after In: in your cells and run it. You can then compare your output with the output that we provide using Out: followed by the output that we actually obtained on our computers when we tested the code.

Python packages

The packages that we are going to introduce in the present paragraph will be frequently used in the book. If you are not using a scientific distribution, we offer you a walkthrough on what versions you should decide on and how to install them quickly and successfully.

NumPy

NumPy, which is Travis Oliphant's creation, is at the core of every analytical solution in the Python language. It provides the user with multidimensional arrays along with a large set of functions to operate multiple mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Arrays are useful not just to store data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems.

• Website: http://www.numpy.org/

• Version at the time of writing: 1.11.1

• Suggested install command:

(43)

As a convention that is largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:

import numpy as np

SciPy

An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more.

• Website: http://www.scipy.org/

$ pip install scipy

Pandas

Pandas deals with everything that NumPy and SciPy cannot do. In particular, thanks to its specific object data structures, DataFrames, and Series, it allows the handling of complex tables of data of different types (something that NumPy's arrays cannot) and time series. Thanks to Wes McKinney's creation, you will be able to easily and smoothly load data from a variety of sources, and then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize it at your will.

• Website: http://pandas.pydata.org/

$ pip install pandas

Conventionally, pandas is imported as pd:

import pandas as pd

Large Scale Machine Learning with Python

Large Scale Machine Learning with Python

Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

Bastiaan Sjardin Luca Massaron Alberto Boschetti

Large Scale Machine Learning with Python

Credits

About the Authors

Bastiaan Sjardin

Luca Massaron

Alberto Boschetti

About the Reviewers

Oleg Okun

Kai Londenberg

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Table of Contents

Preface vii Chapter 1: First Steps to Scalability 1

Chapter 2: Scalable Learning in Scikit-learn 29

Chapter 3: Fast SVM Implementations 77

Chapter 4: Neural Networks and Deep Learning 129

Chapter 5: Deep Learning with TensorFlow 175

Chapter 6: Classification and Regression Trees at Scale 211

Chapter 7: Unsupervised Learning at Scale 253

Chapter 8: Distributed Environments – Hadoop and Spark 297

Chapter 9: Practical Machine Learning with Spark 341

Appendix: Introduction to GPUs and Theano 381

Index 389

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

First Steps to Scalability

Explaining scalability in detail

Making large scale examples

Introducing Python

Scale up with Python

Scale out with Python

Python for large scale machine learning

Choosing between Python 2 and Python 3

Installing Python

Step-by-step installation

The installation of packages

Package upgrades

Scientific distributions

Introducing Jupyter/IPython

Python packages

NumPy

SciPy

Pandas