As you would expect, you can perform mathematical operations such as addition, subtraction, multiplication, as well as the trigonometric functions on NumPy arrays.
Arithmetic operations on different shaped arrays can be carried out by a process known as broadcasting. When operating on two arrays, NumPy compares their shapes element-wise from the trailing dimension. Two dimensions are compatible if they are the same size, or if one of them is 1. If these conditions are not met, then a ValueError exception is thrown.
This is all done in the background using the ufunc object. This object operates on ndarrays on a element-by-element basis. They are essentially wrappers that provide a consistent interface to scalar functions to allow them to work with NumPy arrays.
There are over 60 ufunc objects covering a wide variety of operations and types. The ufunc objects are called automatically when you perform operations such as adding two arrays using the + operator.
Let's look into some additional mathematical features:
• Vectors: We can also create our own vectorized versions of scalar functions using the np.vectorize() function. It takes a Python scalar function or method as a parameter and returns a vectorized version of this function:
def myfunc(a,b):
def myfunc(a,b):
if a > b:
return a-b else:
return a + b vfunc=np.vectorize(myfunc)
Chapter 2 We will observe the following output:
• Polynomial functions: The poly1d class allows us to deal with polynomial functions in a natural way. It accepts as a parameter an array of coefficients in decreasing powers. For example, the polynomial, 2x2 + 3x + 4, can be entered by the following:
We can see that it prints out the polynomial in a human-readable way.
We can perform various operations on the polynomial, such as evaluating at a point:
• Find the roots:
We can use asarray(p) to give the coefficients of the polynomial an array so that it can be used in all functions that accept arrays.
As we will see, the packages that are built on NumPy give us a powerful and flexible framework for machine learning.
Tools and Techniques
[ 44 ]
Matplotlib
Matplotlib, or more importantly, its sub-package PyPlot, is an essential tool for visualizing two-dimensional data in Python. I will only mention it briefly here because its use should become apparent as we work through the examples. It is built to work like Matlab with command style functions. Each PyPlot function makes some change to a PyPlot instance. At the core of PyPlot is the plot method. The simplest implementation is to pass plot a list or a 1D array. If only one argument is passed to plot, it assumes it is a sequence of y values, and it will automatically generate the x values. More commonly, we pass plot two 1D arrays or lists for the co-ordinates x and y. The plot method can also accept an argument to indicate line properties such as line width, color, and style. Here is an example:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0., 5., 0.2)
plt.plot(x, x**4, 'r', x, x*90, 'bs', x, x**3, 'g^') plt.show()
This code prints three lines in different styles: a red line, blue squares, and green triangles. Notice that we can pass more than one pair of coordinate arrays to plot multiple lines. For a full list of line styles, type the help(plt.plot) function.
Pyplot, like Matlab, applies plotting commands to the current axes. Multiple axes can be created using the subplot command. Here is an example:
x1 = np.arange(0., 5., 0.2) x2 = np.arange(0., 5., 0.1)
plt.figure(1) plt.subplot(211)
plt.plot(x1, x1**4, 'r', x1, x1*90, 'bs', x1, x1**3, 'g^',linewidth=2.0)
plt.subplot(212)
plt.plot(x2,np.cos(2*np.pi*x2), 'k') plt.show()
Chapter 2 The output of the preceding code is as follows:
Another useful plot is the histogram. The hist() object takes an array, or a sequence of arrays, of input values. The second parameter is the number of bins. In this
example, we have divided a distribution into 10 bins. The normed parameter, when set to 1 or true, normalizes the counts to form a probability density. Notice also that in this code, we have labeled the x and y axis, and displayed a title and some text at a location given by the coordinates:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(1000)
n, bins, patches = plt.hist(x, 10, normed=1, facecolor='g') plt.xlabel('Frequency')
plt.ylabel('Probability') plt.title('Histogram Example')
plt.text(40,.028, 'mean=100 std.dev.=15') plt.axis([40, 160, 0, 0.03])
plt.grid(True) plt.show()
Tools and Techniques
[ 46 ] The output for this code will look like this:
The final 2D plot we are going to look at is the scatter plot. The scatter object takes two sequence objects, such as arrays, of the same length and optional parameters to denote color and style attributes. Let's take a look at this code:
N = 100
x = np.random.rand(N) y = np.random.rand(N)
#colors = np.random.rand(N) colors=('r','b','g')
area = np.pi * (10 * np.random.rand(N))**2 # 0 to 10 point radiuses plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
Chapter 2 We will observe the following output:
Matplotlib also has a powerful toolbox for rendering 3D plots. The following code demonstrations are simple examples of 3D line, scatter, and surface plots. 3D plots are created in a very similar way to 2D plots. Here, we get the current axis with the gca function and set the projection parameter to 3D. All the plotting methods work much like their 2D counterparts, except that they now take a third set of input values for the z axis:
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D import numpy as np
import matplotlib.pyplot as plt from matplotlib import cm
mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
theta = np.linspace(-3 * np.pi, 6 * np.pi, 100) z = np.linspace(-2, 2, 100)
r = z**2 + 1
x = r * np.sin(theta)
Tools and Techniques
[ 48 ]
theta2 = np.linspace(-3 * np.pi, 6 * np.pi, 20) z2 = np.linspace(-2, 2, 20)
r2=z2**2 +1
x2 = r2 * np.sin(theta2) y2 = r2 * np.cos(theta2)
ax.scatter(x2,y2,z2, c= 'r') x3 = np.arange(-5, 5, 0.25) y3 = np.arange(-5, 5, 0.25) x3, y3 = np.meshgrid(x3, y3) R = np.sqrt(x3**2 + y3**2) z3 = np.sin(R)
surf = ax.plot_surface(x3,y3,z3, rstride=1, cstride=1, cmap=cm.Greys_r, linewidth=0, antialiased=False)
ax.set_zlim(-2, 2) plt.show()
We will observe this output:
Pandas
The Pandas library builds on NumPy by introducing several useful data structures and functionalities to read and process data. Pandas is a great tool for general data munging. It easily handles common tasks such as dealing with missing data, manipulating shapes and sizes, converting between data formats and structures, and importing data from different sources.
Chapter 2 The main data structures introduced by Pandas are:
• Series
• The DataFrame
• Panel
The DataFrame is probably the most widely used. It is a two-dimensional structure that is effectively a table created from either a NumPy array, lists, dicts, or series.
You can also create a DataFrame by reading from a file.
Probably the best way to get a feel for Pandas is to go through a typical use case. Let's say that we are given the task of discovering how the daily maximum temperature has changed over time. For this example, we will be working with historical weather observations from the Hobart weather station in Tasmania.
Download the following ZIP file and extract its contents into a folder called data in your Python working directory:
http://davejulian.net/mlbook/data
The first thing we do is create a DataFrame from it:
import pandas as pd
df=pd.read_csv('data/sampleData.csv') Check the first few rows in this data:
df.head()
We can see that the product code and the station number are the same for each row and that this information is superfluous. Also, the days of accumulated maximum temperature are not needed for our purpose, so we will delete them as well:
del df['Bureau of Meteorology station number']
del df['Product code']
del df['Days of accumulation of maximum temperature']
Let's make our data a little easier to read by shorting the column labels:
df=df.rename(columns={'Maximum temperature (Degree C)':'maxtemp'})
We are only interested in data that is of high quality, so we include only records that have a Y in the quality column:
df=df[(df.Quality=='Y')]
Tools and Techniques
[ 50 ] We can get a statistical summary of our data:
df.describe()
If we import the matplotlib.pyplot package, we can graph the data:
import matplotlib.pyplot as plt plt.plot(df.Year, df.maxtemp)
Notice that PyPlot correctly formats the date axis and deals with the missing data by connecting the two known points on either side. We can convert a DataFrame into a NumPy array using the following:
ndarray = df.values
If the DataFrame contains a mixture of data types, then this function will convert them to the lowest common denominator type, which means that the one that accommodates all values will be chosen. For example, if the DataFrame consists of a mix of float16 and float32 types, then the values will be converted to float 32.
Chapter 2 The Pandas DataFrame is a great object for viewing and manipulating simple
text and numerical data. However, Pandas is probably not the right tool for more sophisticated numerical processing such as calculating the dot product, or finding the solutions to linear systems. For numerical applications, we generally use the NumPy classes.
SciPy
SciPy (pronounced sigh pi) adds a layer to NumPy that wraps common scientific and statistical applications on top of the more purely mathematical constructs of NumPy. SciPy provides higher-level functions for manipulating and visualizing data, and it is especially useful when using Python interactively. SciPy is organized into sub-packages covering different scientific computing applications. A list of the packages most relevant to ML and their functions appear as follows:
Package Description
cluster This contains two sub-packages:
cluster.vq for K-means clustering and vector quantization.
cluster.hierachy for hierarchical and agglomerative clustering, which is useful for distance matrices, calculating statistics on clusters, as well as visualizing clusters with dendrograms.
constants These are physical and mathematical constants such as pi and e.
integrate These are differential equation solvers
interpolate These are interpolation functions for creating new data points within a range of known points.
io This refers to input and output functions for creating string, binary, or raw data streams, and reading and writing to and from files.
optimize This refers to optimizing and finding roots.
linalg This refers to linear algebra routines such as basic matrix calculations, solving linear systems, finding determinants and norms, and decomposition.
ndimage This is N-dimensional image processing.
odr This is orthogonal distance regression.
stats This refers to statistical distributions and functions.
Tools and Techniques
[ 52 ]
Many of the NumPy modules have the same name and similar functionality as those in the SciPy package. For the most part, SciPy imports its NumPy equivalent and extends its functionality. However, be aware that some identically named functions in SciPy modules may have slightly different functionality compared to those in NumPy. It also should be mentioned that many of the SciPy classes have convenience wrappers in the scikit-learn package, and it is sometimes easier to use those instead.
Each of these packages requires an explicit import; here is an example:
import scipy.cluster
You can get documentation from the SciPy website (scipy.org) or from the console, for example, help(sicpy.cluster).
As we have seen, a common task in many different ML settings is that of optimization. We looked at the mathematics of the simplex algorithm in the last chapter. Here is the implementation using SciPy. We remember simplex optimizes a set of linear equations. The problem we looked at was as follows:
Maximize x1 + x2 within the constraints of: 2x1 + x2 ≤ 4 and x1 + 2x2 ≤ 3
The linprog object is probably the simplest object that will solve this problem. It is a minimization algorithm, so we reverse the sign of our objective.
From scipy.optimize, import linprog: objective=[-1,-1]
con1=[[2,1],[1,2]]
con2=[4,3]
res=linprog(objective,con1,con2) print(res)
You will observe the following output:
Chapter 2 There is also an optimisation.minimize object that is suitable for slightly more complicated problems. This object takes a solver as a parameter. There are currently about a dozen solvers available, and if you need a more specific solver, you can write your own. The most commonly used, and suitable for most problems, is the nelder-mead solver. This particular solver uses a downhill simplex algorithm that is basically a heuristic search that replaces each test point with a high error with a point located in the centroid of the remaining points. It iterates through this process until it converges on a minimum.
In this example, we use the Rosenbrock function as our test problem. This is a non-convex function that is often used to test optimization problems. The global minimum of this function is on a long parabolic valley, and this makes it challenging for an algorithm to find the minimum in a large, relatively flat valley. We will see more of this function:
import numpy as np
from scipy.optimize import minimize def rosen(x):
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0) def nMin(funct,x0):
return(minimize(rosen, x0, method='nelder-mead', options={'xtol':
1e-8, 'disp': True}))
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2]) nMin(rosen,x0)
The output for the preceding code is as follows:
Tools and Techniques
[ 54 ]
The minimize function takes two mandatory parameters. These are the objective function and the initial value of x0. The minimize function also takes an optional parameter for the solver method, in this example we use the nelder-mead method.
The options are a solver-specific set of key-value pairs, represented as a dictionary.
Here, xtol is the relative error acceptable for convergence, and disp is set to print a message. Another package that is extremely useful for machine learning applications is scipy.linalg. This package adds the ability to perform tasks such as inverting matrices, calculating eigenvalues, and matrix decomposition.