• 沒有找到結果。

Overview of the data science process

This chapter covers

2.1 Overview of the data science process

Following a structured approach to data science helps you to maximize your chances of success in a data science project at the lowest cost. It also makes it possi-ble to take up a project as a team, with each team member focusing on what they do best. Take care, however: this approach may not be suitable for every type of project or be the only way to do good data science.

The typical data science process consists of six steps through which you’ll iter-ate, as shown in figure 2.1.

This chapter covers

Understanding the flow of a data science process

Discussing the steps in a data science process

Figure 2.1 summarizes the data science process and shows the main steps and actions you’ll take during a project. The following list is a short introduction; each of the steps will be discussed in greater depth throughout this chapter.

1 The first step of this process is setting a research goal. The main purpose here is making sure all the stakeholders understand the what, how, and why of the proj-ect. In every serious project this will result in a project charter.

2 The second phase is data retrieval. You want to have data available for analysis, so this step includes finding suitable data and getting access to the data from the Data science process

1: Setting the research goal

2: Retrieving data

Model diagnostic and model comparison

Reducing number of variables

Figure 2.1 The six steps of the data science process

data owner. The result is data in its raw form, which probably needs polishing and transformation before it becomes usable.

3 Now that you have the raw data, it’s time to prepare it. This includes transform-ing the data from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different kinds of errors in the data, com-bine data from different data sources, and transform it. If you have successfully completed this step, you can progress to data visualization and modeling.

4 The fourth step is data exploration. The goal of this step is to gain a deep under-standing of the data. You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The insights you gain from this phase will enable you to start modeling.

5 Finally, we get to the sexiest part: model building (often referred to as “data mod-eling” throughout this book). It is now that you attempt to gain the insights or make the predictions stated in your project charter. Now is the time to bring out the heavy guns, but remember research has taught us that often (but not always) a combination of simple models tends to outperform one complicated model. If you’ve done this phase right, you’re almost done.

6 The last step of the data science model is presenting your results and automating the analysis, if needed. One goal of a project is to change a process and/or make better decisions. You may still need to convince the business that your findings will indeed change the business process as expected. This is where you can shine in your influencer role. The importance of this step is more apparent in projects on a strategic and tactical level. Certain projects require you to per-form the business process over and over again, so automating the project will save time.

In reality you won’t progress in a linear way from step 1 to step 6. Often you’ll regress and iterate between the different phases.

Following these six steps pays off in terms of a higher project success ratio and increased impact of research results. This process ensures you have a well-defined research plan, a good understanding of the business question, and clear deliverables before you even start looking at data. The first steps of your process focus on getting high-quality data as input for your models. This way your models will perform better later on. In data science there’s a well-known saying: Garbage in equals garbage out.

Another benefit of following a structured approach is that you work more in pro-totype mode while you search for the best model. When building a propro-totype, you’ll probably try multiple models and won’t focus heavily on issues such as program speed or writing code against standards. This allows you to focus on bringing busi-ness value instead.

Not every project is initiated by the business itself. Insights learned during analy-sis or the arrival of new data can spawn new projects. When the data science team generates an idea, work has already been done to make a proposition and find a business sponsor.

Dividing a project into smaller stages also allows employees to work together as a team. It’s impossible to be a specialist in everything. You’d need to know how to upload all the data to all the different databases, find an optimal data scheme that works not only for your application but also for other projects inside your company, and then keep track of all the statistical and data-mining techniques, while also being an expert in presentation tools and business politics. That’s a hard task, and it’s why more and more companies rely on a team of specialists rather than trying to find one person who can do it all.

The process we described in this section is best suited for a data science project that contains only a few models. It’s not suited for every type of project. For instance, a project that contains millions of real-time models would need a different approach than the flow we describe here. A beginning data scientist should get a long way fol-lowing this manner of working, though.

2.1.1 Don’t be a slave to the process

Not every project will follow this blueprint, because your process is subject to the prefer-ences of the data scientist, the company, and the nature of the project you work on.

Some companies may require you to follow a strict protocol, whereas others have a more informal manner of working. In general, you’ll need a structured approach when you work on a complex project or when many people or resources are involved.

The agile project model is an alternative to a sequential process with iterations. As this methodology wins more ground in the IT department and throughout the com-pany, it’s also being adopted by the data science community. Although the agile meth-odology is suitable for a data science project, many company policies will favor a more rigid approach toward data science.

Planning every detail of the data science process upfront isn’t always possible, and more often than not you’ll iterate between the different steps of the process.

For instance, after the briefing you start your normal flow until you’re in the explor-atory data analysis phase. Your graphs show a distinction in the behavior between two groups—men and women maybe? You aren’t sure because you don’t have a vari-able that indicates whether the customer is male or female. You need to retrieve an extra data set to confirm this. For this you need to go through the approval process, which indicates that you (or the business) need to provide a kind of project char-ter. In big companies, getting all the data you need to finish your project can be an ordeal.

2.2 Step 1: Defining research goals and creating