Step 3: Cleansing, integrating, and transforming data

This chapter covers

2.4 Step 3: Cleansing, integrating, and transforming data

The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your task now is to sanitize and prepare it for use in the modeling and report-ing phase. Doreport-ing so is tremendously important because your models will perform bet-ter and you’ll lose less time trying to fix strange output. It can’t be mentioned nearly enough times: garbage in equals garbage out. Your model needs the data in a specific

Table 2.1 A list of open-data providers that should get you started

Open data site Description

Data.gov The home of the US Government’s open data https://open-data.europa.eu/ The home of the European Commission’s open data

Freebase.org An open database that retrieves its information from sites like Wikipedia, MusicBrains, and the SEC archive

Data.worldbank.org Open data initiative from the World Bank Aiddata.org Open data for international development

Open.fda.gov Open data from the US Food and Drug Administration

format, so data transformation will always come into play. It’s a good habit to correct data errors as early on in the process as possible. However, this isn’t always possible in a realistic setting, so you’ll need to take corrective actions in your program.

Figure 2.4 shows the most common actions to take during the data cleansing, inte-gration, and transformation phase.

This mind map may look a bit abstract for now, but we’ll handle all of these points in more detail in the next sections. You’ll see a great commonality among all of these actions.

2.4.1 Cleansing data

Data cleansing is a subprocess of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from.

By “true and consistent representation” we imply that at least two types of errors exist. The first type is the interpretation error, such as when you take the value in your

Data science process

3: Data preparation –

Data cleansing –

Physically impossible values

Errors against codebook Missing values Errors from data entry

Outliers Spaces, typos, …

Data transformation

Combining data –

Extrapolating data Derived measures Aggregating data

Creating dummies

– Set operators

Merging/joining data sets

Creating views

Reducing number of variables 1: Setting the research goal +

2: Retrieving data +

4: Data exploration +

5: Data modeling +

6: Presentation and automation +

Figure 2.4 Step 3: Data preparation

data for granted, like saying that a person’s age is greater than 300 years. The second type of error points to inconsistencies between data sources or against your company’s standardized values. An example of this class of errors is putting “Female” in one table and “F” in another when they represent the same thing: that the person is female.

Another example is that you use Pounds in one table and Dollars in another. Too many possible errors exist for this list to be exhaustive, but table 2.2 shows an overview of the types of errors that can be detected with easy checks—the “low hanging fruit,”

as it were.

Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors; diagnostic plots can be especially insightful. For example, in fig-ure 2.5 we use a measfig-ure to identify data points that seem out of place. We do a regression to get acquainted with the data and detect the influence of individual observations on the regression line. When a single observation has too much influ-ence, this can point to an error in the data, but it can also be a valid point. At the data cleansing stage, these advanced methods are, however, rarely applied and often regarded by certain data scientists as overkill.

Now that we’ve given the overview, it’s time to explain these errors in more detail.

Table 2.2 An overview of common errors

General solution

Try to fix the problem early in the data acquisition chain or else fix it in the program.

Error description Possible solution

Errors pointing to false values within one data set

Mistakes during data entry Manual overrules Redundant white space Use string functions

Impossible values Manual overrules

Missing values Remove observation or value

Outliers Validate and, if erroneous, treat as missing value (remove or insert)

Errors pointing to inconsistencies between data sets

Deviations from a code book Match on keys or else use manual overrules Different units of measurement Recalculate

Different levels of aggregation Bring to same level of measurement by aggregation or extrapolation

DATAENTRYERRORS

Data collection and data entry are error-prone processes. They often require human intervention, and because humans are only human, they make typos or lose their con-centration for a second and introduce an error into the chain. But data collected by machines or computers isn’t free from errors either. Errors can arise from human sloppiness, whereas others are due to machine or hardware failure. Examples of errors originating from machines are transmission errors or bugs in the extract, trans-form, and load phase (ETL).

For small data sets you can check every value by hand. Detecting data errors when the variables you study don’t have many classes can be done by tabulating the data with counts. When you have a variable that can take only two values: “Good” and

“Bad”, you can create a frequency table and see if those are truly the only two values present. In table 2.3, the values “Godo” and “Bade” point out something went wrong in at least 16 cases.

Table 2.3 Detecting outliers on simple variables with a frequency table

A single outlier can throw off a regression estimate.

Figure 2.5 The encircled point influences the model heavily and is worth investigating because it can point to a region where you don’t have enough data or might indicate an error in the data, but it also can be a valid data point.

Most errors of this type are easy to fix with simple assignment statements and if-then-else rules:

if x == “Godo”:

x = “Good”

if x == “Bade”:

x = “Bad”

REDUNDANTWHITESPACE

Whitespaces tend to be hard to detect but cause errors like other redundant charac-ters would. Who hasn’t lost a few days in a project because of a bug that was caused by whitespaces at the end of a string? You ask the program to join two keys and notice that observations are missing from the output file. After looking for days through the code, you finally find the bug. Then comes the hardest part: explaining the delay to the project stakeholders. The cleaning during the ETL phase wasn’t well executed, and keys in one table contained a whitespace at the end of a string. This caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that couldn’t be matched.

If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most programming languages. They all provide string functions that will remove the leading and trailing whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing spaces.

FIXING CAPITAL LETTER MISMATCHES Capital letter mismatches are common.

Most programming languages make a distinction between “Brazil” and “bra-zil”. In this case you can solve the problem by applying a function that returns both strings in lowercase, such as .lower() in Python. “Brazil”.lower() ==

“brazil”.lower() should result in true.

IMPOSSIBLEVALUESANDSANITYCHECKS

Sanity checks are another valuable type of data check. Here you check the value against physically or theoretically impossible values such as people taller than 3 meters or someone with an age of 299 years. Sanity checks can be directly expressed with rules:

check = 0 <= age <= 120

OUTLIERS

An outlier is an observation that seems to be distant from other observations or, more specifically, one observation that follows a different logic or generative process than the other observations. The easiest way to find outliers is to use a plot or a table with the minimum and maximum values. An example is shown in figure 2.6.

The plot on the top shows no outliers, whereas the plot on the bottom shows possi-ble outliers on the upper side when a normal distribution is expected. The normal dis-tribution, or Gaussian disdis-tribution, is the most common distribution in natural sciences.

It shows most cases occurring around the average of the distribution and the occur-rences decrease when further away from it. The high values in the bottom graph can point to outliers when assuming a normal distribution. As we saw earlier with the regression example, outliers can gravely influence your data modeling, so investigate them first.

DEALINGWITHMISSINGVALUES

Missing values aren’t necessarily wrong, but you still need to handle them separately;

certain modeling techniques can’t handle missing values. They might be an indicator that something went wrong in your data collection or that an error happened in the ETL process. Common techniques data scientists use are listed in table 2.4.

Expected distribution

Frequency

–3 –2 –1 0 1 2 3

200

150

100

Distribution with outliers

Frequency

–2 0 2 4

200

150

100

Figure 2.6 Distribution plots are helpful in detecting outliers and helping you understand the variable.

Which technique to use at what time is dependent on your particular case. If, for instance, you don’t have observations to spare, omitting an observation is probably not an option. If the variable can be described by a stable distribution, you could impute based on this. However, maybe a missing value actually means “zero”? This can be the case in sales for instance: if no promotion is applied on a customer basket, that customer’s promo is missing, but most likely it’s also 0, no price cut.

DEVIATIONSFROMACODEBOOK

Detecting errors in larger data sets against a code book or against standardized values can be done with the help of set operations. A code book is a description of your data, a form of metadata. It contains things such as the number of variables per observa-tion, the number of observations, and what each encoding within a variable means.

(For instance “0” equals “negative”, “5” stands for “very positive”.) A code book also tells the type of data you’re looking at: is it hierarchical, graph, something else?

You look at those values that are present in set A but not in set B. These are values that should be corrected. It’s no coincidence that sets are the data structure that we’ll use when we’re working in code. It’s a good habit to give your data structures addi-tional thought; it can save work and improve the performance of your program.

If you have multiple values to check, it’s better to put them from the code book into a table and use a difference operator to check the discrepancy between both tables. This way, you can profit from the power of a database directly. More on this in chapter 5.

Table 2.4 An overview of techniques to handle missing data

Technique Advantage Disadvantage

Omit the values Easy to perform You lose the information from an observation

Set value to null Easy to perform Not every modeling technique and/or implementation can han-dle null values

Impute a static value such as 0 or the mean

Easy to perform

You don’t lose information from the other variables in the observation

Can lead to false estimations from a model

Impute a value from an esti-mated or theoretical distribution

Does not disturb the model as much

Harder to execute

You make data assumptions Modeling the value

(nondepen-dent)

Does not disturb the model too much

Can lead to too much confidence in the model

Can artificially raise depen-dence among the variables Harder to execute

You make data assumptions

DIFFERENTUNITSOFMEASUREMENT

When integrating two data sets, you have to pay attention to their respective units of measurement. An example of this would be when you study the prices of gasoline in the world. To do this you gather data from different data providers. Data sets can con-tain prices per gallon and others can concon-tain prices per liter. A simple conversion will do the trick in this case.

DIFFERENTLEVELSOFAGGREGATION

Having different levels of aggregation is similar to having different types of measure-ment. An example of this would be a data set containing data per week versus one containing data per work week. This type of error is generally easy to detect, and sum-marizing (or the inverse, expanding) the data sets will fix it.

After cleaning the data errors, you combine information from different data sources. But before we tackle this topic we’ll take a little detour and stress the impor-tance of cleaning data as early as possible.

2.4.2 Correct errors as early as possible

A good practice is to mediate data errors as early as possible in the data collection chain and to fix as little as possible inside your program while fixing the origin of the problem. Retrieving data is a difficult task, and organizations spend millions of dollars on it in the hope of making better decisions. The data collection process is error-prone, and in a big organization it involves many steps and teams.

Data should be cleansed when acquired for many reasons:

■ Not everyone spots the data anomalies. Decision-makers may make costly mis-takes on information based on incorrect data from applications that fail to cor-rect for the faulty data.

■ If errors are not corrected early on in the process, the cleansing will have to be done for every project that uses that data.

■ Data errors may point to a business process that isn’t working as designed. For instance, both authors worked at a retailer in the past, and they designed a cou-poning system to attract more people and make a higher profit. During a data science project, we discovered clients who abused the couponing system and earned money while purchasing groceries. The goal of the couponing system was to stimulate cross-selling, not to give products away for free. This flaw cost the company money and nobody in the company was aware of it. In this case the data wasn’t technically wrong but came with unexpected results.

■ Data errors may point to defective equipment, such as broken transmission lines and defective sensors.

■ Data errors can point to bugs in software or in the integration of software that may be critical to the company. While doing a small project at a bank we discov-ered that two software applications used different local settings. This caused problems with numbers greater than 1,000. For one app the number 1.000 meant one, and for the other it meant one thousand.

Fixing the data as soon as it’s captured is nice in a perfect world. Sadly, a data scientist doesn’t always have a say in the data collection and simply telling the IT department to fix certain things may not make it so. If you can’t correct the data at the source, you’ll need to handle it inside your code. Data manipulation doesn’t end with correcting mistakes; you still need to combine your incoming data.

As a final remark: always keep a copy of your original data (if possible). Sometimes you start cleaning data but you’ll make mistakes: impute variables in the wrong way, delete outliers that had interesting additional information, or alter data as the result of an initial misinterpretation. If you keep a copy you get to try again. For “flowing data” that’s manipulated at the time of arrival, this isn’t always possible and you’ll have accepted a period of tweaking before you get to use the data you are capturing. One of the more difficult things isn’t the data cleansing of individual data sets however, it’s combining different sources into a whole that makes more sense.

2.4.3 Combining data from different data sources

Your data comes from several different places, and in this substep we focus on inte-grating these different sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.

We focus on data in table structures in this chapter for the sake of brevity. It’s easy to fill entire books on this topic alone, and we choose to focus on the data science pro-cess instead of presenting scenarios for every type of data. But keep in mind that other types of data sources exist, such as key-value stores, document stores, and so on, which we’ll handle in more appropriate places in the book.

THEDIFFERENTWAYSOFCOMBININGDATA

You can perform two operations to combine information from different data sets. The first operation is joining: enriching an observation from one table with information from another table. The second operation is appending or stacking: adding the observa-tions of one table to those of another table.

When you combine data, you have the option to create a new physical table or a virtual table by creating a view. The advantage of a view is that it doesn’t consume more disk space. Let’s elaborate a bit on these methods.

JOININGTABLES

Joining tables allows you to combine the information of one observation found in one table with the information that you find in another table. The focus is on enriching a single observation. Let’s say that the first table contains information about the pur-chases of a customer and the other table contains information about the region where your customer lives. Joining the tables allows you to combine the information so that you can use it for your model, as shown in figure 2.7.

To join tables, you use variables that represent the same object in both tables, such as a date, a country name, or a Social Security number. These common fields are known as keys. When these keys also uniquely define the records in the table they

are called primary keys. One table may have buying behavior and the other table may have demographic information on a person. In figure 2.7 both tables contain the cli-ent name, and this makes it easy to enrich the clicli-ent expenditures with the region of the client. People who are acquainted with Excel will notice the similarity with using a lookup function.

The number of resulting rows in the output table depends on the exact join type that you use. We introduce the different types of joins later in the book.

APPENDINGTABLES

Appending or stacking tables is effectively adding observations from one table to another table. Figure 2.8 shows an example of appending tables. One table contains the observations from the month January and the second table contains observations from the month February. The result of appending these tables is a larger one with the observations from January as well as February. The equivalent operation in set the-ory would be the union, and this is also the command in SQL, the common language

在文檔中 Davy CielenArno D. B. MeysmanMohamed Ali (頁 50-64)