This chapter covers
2.6 Step 5: Build the models
With clean data in place and a good understanding of the content, you’re ready to build models with the goal of making better predictions, classifying objects, or gain-ing an understandgain-ing of the system that you’re modelgain-ing. This phase is much more focused than the exploratory analysis step, because you know what you’re looking for and what you want the outcome to be. Figure 2.21 shows the components of model building.
The techniques you’ll use now are borrowed from the field of machine learning, data mining, and/or statistics. In this chapter we only explore the tip of the iceberg of existing techniques, while chapter 3 introduces them properly. It’s beyond the scope of this book to give you more than a conceptual introduction, but it’s enough to get you started; 20% of the techniques will help you in 80% of the cases because tech-niques overlap in what they try to accomplish. They often achieve their goals in similar but slightly different ways.
Building a model is an iterative process. The way you build your model depends on whether you go with classic statistics or the somewhat more recent machine learning school, and the type of technique you want to use. Either way, most models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison 2.6.1 Model and variable selection
You’ll need to select the variables you want to include in your model and a modeling technique. Your findings from the exploratory analysis should already give a fair idea
Data science process
1: Setting the research goal +
2: Retrieving data +
3: Data preparation +
4: Data exploration +
6: Presentation and automation + 5: Data modeling – Model execution
Model and variable selection
Model diagnostic and model comparison
Figure 2.21 Step 5:
Data modeling
of what variables will help you construct a good model. Many modeling techniques are available, and choosing the right model for a problem requires judgment on your part. You’ll need to consider model performance and whether your project meets all the requirements to use your model, as well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left untouched?
■ Does the model need to be easy to explain?
When the thinking is done, it’s time for action.
2.6.2 Model execution
Once you’ve chosen a model you’ll need to implement it in code.
REMARK This is the first time we’ll go into actual Python code execution so make sure you have a virtual env up and running. Knowing how to set this up is required knowledge, but if it’s your first time, check out appendix D.
All code from this chapter can be downloaded from https://www.manning .com/books/introducing-data-science. This chapter comes with an ipython (.ipynb) notebook and Python (.py) file.
Luckily, most programming languages, such as Python, already have libraries such as StatsModels or Scikit-learn. These packages use several of the most popular tech-niques. Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the process. As you can see in the following code, it’s fairly easy to use linear regression (figure 2.22) with StatsModels or Scikit-learn. Doing this your-self would require much more effort even for the simple techniques. The following listing shows the execution of a linear prediction model.
import statsmodels.api as sm import numpy as np
predictors = np.random.random(1000).reshape(500,2)
target = predictors.dot(np.array([0.4, 0.6])) + np.random.random(500) lmRegModel = sm.OLS(target,predictors)
result = lmRegModel.fit() result.summary()
Listing 2.1 Executing a linear prediction model on semi-random data
Imports required Python modules.
Creates random data for predictors (x-values) and semi-random data for the target (y-values) of the model. We use predictors as input to create the target so we infer a correlation here.
Fits linear regression on data.
Shows model fit statistics.
Okay, we cheated here, quite heavily so. We created predictor values that are meant to pre-dict how the target variables behave. For a linear regression, a “linear relation” between each x (predictor) and the y (target) variable is assumed, as shown in figure 2.22.
We, however, created the target variable, based on the predictor by adding a bit of randomness. It shouldn’t come as a surprise that this gives us a well-fitting model. The results.summary() outputs the table in figure 2.23. Mind you, the exact outcome depends on the random variables you got.
Y (target variable)
X (predictor variable)
Figure 2.22 Linear regression tries to fit a line while minimizing the distance to each point
Model fit: higher is better but too high is suspicious.
p-value to show whether a predictor variable has a significant influence on the target. Lower is better and <0.05 is often considered “significant.”
Linear equation coefficients.
y = 0.7658xl + 1.1252x2. Figure 2.23 Linear regression
model information output
Let’s ignore most of the output we got here and focus on the most important parts:
■ Model fit—For this the R-squared or adjusted R-squared is used. This measure is an indication of the amount of variation in the data that gets captured by the model. The difference between the adjusted R-squared and the R-squared is minimal here because the adjusted one is the normal one + a penalty for model complexity. A model gets complex when many variables (or features) are introduced. You don’t need a complex model if a simple model is available, so the adjusted R-squared punishes you for overcomplicating. At any rate, 0.893 is high, and it should be because we cheated. Rules of thumb exist, but for models in businesses, models above 0.85 are often considered good. If you want to win a competition you need in the high 90s. For research however, often very low model fits (<0.2 even) are found. What’s more important there is the influence of the introduced predictor variables.
■ Predictor variables have a coefficient—For a linear model this is easy to interpret. In our example if you add “1” to x1, it will change y by “0.7658”. It’s easy to see how finding a good predictor can be your route to a Nobel Prize even though your model as a whole is rubbish. If, for instance, you determine that a certain gene is significant as a cause for cancer, this is important knowledge, even if that gene in itself doesn’t determine whether a person will get cancer. The example here is classification, not regression, but the point remains the same: detecting influences is more important in scientific studies than perfectly fitting models (not to mention more realistic). But when do we know a gene has that impact?
This is called significance.
■ Predictor significance—Coefficients are great, but sometimes not enough evi-dence exists to show that the influence is there. This is what the p-value is about. A long explanation about type 1 and type 2 mistakes is possible here but the short explanations would be: if the p-value is lower than 0.05, the vari-able is considered significant for most people. In truth, this is an arbitrary number. It means there’s a 5% chance the predictor doesn’t have any influ-ence. Do you accept this 5% chance to be wrong? That’s up to you. Several people introduced the extremely significant (p<0.01) and marginally signifi-cant thresholds (p<0.1).
Linear regression works if you want to predict a value, but what if you want to classify something? Then you go to classification models, the best known among them being k-nearest neighbors.
As shown in figure 2.24, k-nearest neighbors looks at labeled points nearby an unlabeled point and, based on this, makes a prediction of what the label should be.
Let’s try it in Python code using the Scikit learn library, as in this next listing.
from sklearn import neighbors
predictors = np.random.random(1000).reshape(500,2)
target = np.around(predictors.dot(np.array([0.4, 0.6])) + np.random.random(500))
clf = neighbors.KNeighborsClassifier(n_neighbors=10) knn = clf.fit(predictors,target)
knn.score(predictors, target)
As before, we construct random correlated data and surprise, surprise we get 85% of cases correctly classified. If we want to look in depth, we need to score the model.
Don’t let knn.score() fool you; it returns the model accuracy, but by “scoring a model” we often mean applying it on data to make a prediction.
prediction = knn.predict(predictors)
Now we can use the prediction and compare it to the real thing using a confusion matrix.
metrics.confusion_matrix(target,prediction)
We get a 3-by-3 matrix as shown in figure 2.25.
Listing 2.2 Executing k-nearest neighbor classification on semi-random data
(a) 1-nearest neighbor
(b) 2-nearest neighbor (c) 3-nearest neighbor x
Figure 2.24 K-nearest neighbor techniques look at the k-nearest point to make a prediction.
Imports modules.
Creates random predictor data and semi-random target data based on predictor data.
Fits 10-nearest