Join the NoSQL movement
6.2 Case study: What disease is that?
It has happened to many of us: you have sudden medical symptoms and the first thing you do is Google what disease the symptoms might indicate; then you decide whether it’s worth seeing a doctor. A web search engine is okay for this, but a more dedicated database would be better. Databases like this exist and are fairly advanced; they can be almost a virtual version of Dr. House, a brilliant diagnostician in the TV series House M.D. But they’re built upon well-protected data and not all of it is accessible by the public. Also, although big pharmaceutical companies and advanced hospitals have access to these virtual doctors, many general practitioners are still stuck with only their books. This information and resource asymmetry is not only sad and dangerous, it needn’t be there at all. If a simple, disease-specific search engine were used by all gen-eral practitioners in the world, many medical mistakes could be avoided.
In this case study, you’ll learn how to build such a search engine here, albeit using only a fraction of the medical data that is freely accessible. To tackle the problem, you’ll use a modern NoSQL database called Elasticsearch to store the data, and the
Figure 6.15 Top 15 databases ranked by popularity according to DB-Engines.com in March 2015
data science process to work with the data and turn it into a resource that’s fast and easy to search. Here’s how you’ll apply the process:
1 Setting the research goal.
2 Data collection—You’ll get your data from Wikipedia. There are more sources out there, but for demonstration purposes a single one will do.
3 Data preparation—The Wikipedia data might not be perfect in its current for-mat. You’ll apply a few techniques to change this.
4 Data exploration—Your use case is special in that step 4 of the data science pro-cess is also the desired end result: you want your data to become easy to explore.
5 Data modeling—No real data modeling is applied in this chapter. Document-term matrices that are used for search are often the starting point for advanced topic modeling. We won’t go into that here.
6 Presenting results —To make data searchable, you’d need a user interface such as a website where people can query and retrieve disease information. In this chapter you won’t go so far as to build an actual interface. Your secondary goal:
profiling a disease category by its keywords; you’ll reach this stage of the data science process because you’ll present it as a word cloud, such as the one in fig-ure 6.16.
To follow along with the code, you’ll need these items:
■ A Python session with the elasticsearch-py and Wikipedia libraries installed (pip install elasticsearch and pip install wikipedia)
■ A locally set up Elasticsearch instance; see appendix A for installation instructions
■ The IPython library
NOTE The code for this chapter is available to download from the Manning website for this book at https://manning.com/books/introducing-data-science and is in IPython format.
Figure 6.16 A sample word cloud on non-weighted diabetes keywords
6.2.1 Step 1: Setting the research goal
Can you diagnose a disease by the end of this chapter, using nothing but your own home computer and the free software and data out there? Knowing what you want to do and how to do it is the first step in the data science process, as shown in figure 6.17.
■ Your primary goal is to set up a disease search engine that would help general practitioners in diagnosing diseases.
■ Your secondary goal is to profile a disease: What keywords distinguish it from other diseases?
This secondary goal is useful for educational purposes or as input to more advanced uses such as detecting spreading epidemics by tapping into social media. With your research goal and a plan of action defined, let’s move on to the data retrieval step.
Elasticsearch: the open source search engine/NoSQL database
To tackle the problem at hand, diagnosing a disease, the NoSQL database you’ll use is Elasticsearch. Like MongoDB, Elasticsearch is a document store. But unlike MongoDB, Elasticsearch is a search engine. Whereas MongoDB is great at perform-ing complex calculations and MapReduce jobs, Elasticsearch’s main purpose is full-text search. Elasticsearch will do basic calculations on indexed numerical data such as summing, counts, median, mean, standard deviation, and so on, but in essence it remains a search engine.
Elasticsearch is built on top of Apache Lucene, the Apache search engine created in 1999. Lucene is notoriously hard to handle and is more a building block for more user-friendly applications than an end–to–end solution in itself. But Lucene is an enormously powerful search engine, and Apache Solr followed in 2004, opening for public use in 2006. Solr (an open source, enterprise search platform) is built on top of Apache Lucene and is at this moment still the most versatile and popular open source search engine. Solr is a great platform and worth investigating if you get involved in a project requiring a search engine. In 2010 Elasticsearch emerged, quickly gaining in popularity. Although Solr can still be difficult to set up and configure, even for small projects, Elasticsearch couldn’t be easier. Solr still has an advantage in the number of possible plugins expanding its core functionality, but Elasticsearch is quickly catching up and today its capabilities are of comparable quality.
–
Define research goal Create project charter Data science process
1: Setting the research goal
Primary goal: disease search Secondary goal: disease profiling –
Figure 6.17 Step 1 in the data science process: setting the research goal
6.2.2 Steps 2 and 3: Data retrieval and preparation
Data retrieval and data preparation are two distinct steps in the data science process, and even though this remains true for the case study, we’ll explore both in the same sec-tion. This way you can avoid setting up local intermedia storage and immediately do data preparation while the data is being retrieved. Let’s look at where we are in the data science process (see figure 6.18).
As shown in figure 6.18 you have two possible sources: internal data and external data.
■ Internal data—You have no disease information lying around. If you currently work for a pharmaceutical company or a hospital, you might be luckier.
■ External data—All you can use for this case is external data. You have several possibilities, but you’ll go with Wikipedia.
When you pull the data from Wikipedia, you’ll need to store it in your local Elastic-search index, but before you do that you’ll need to prepare the data. Once data has entered the Elasticsearch index, it can’t be altered; all you can do then is query it.
Look at the data preparation overview in figure 6.19.
As shown in figure 6.19 there are three distinct categories of data preparation to consider:
■ Data cleansing—The data you’ll pull from Wikipedia can be incomplete or erro-neous. Data entry errors and spelling mistakes are possible—even false informa-tion isn’t excluded. Luckily, you don’t need the list of diseases to be exhaustive, and you can handle spelling mistakes at search time; more on that later. Thanks to the Wikipedia Python library, the textual data you’ll receive is fairly clean already. If you were to scrape it manually, you’d need to add HTML cleaning, removing all HTML tags. The truth of the matter is full-text search tends to be fairly robust toward common errors such as incorrect values. Even if you dumped in HTML tags on purpose, they’d be unlikely to influence the results;
the HTML tags are too different from normal language to interfere.
–
Internal data
External data Data science process
1: Setting the research goal
2: Retrieving data
+
– Wikipedia – Data retrieval
Data ownership– No internal data available – No internal data available
Figure 6.18 Data science process step 2: data retrieval. In this case there’s no internal data; all data will be fetched from Wikipedia.
■ Data transformation—You don’t need to transform the data much at this point;
you want to search it as is. But you’ll make the distinction between page title, disease name, and page body. This distinction is almost mandatory for search result interpretation.
■ Combining data—All the data is drawn from a single source in this case, so you have no real need to combine data. A possible extension to this exercise would be to get disease data from another source and match the diseases. This is no trivial task because no unique identifier is present and the names are often slightly different.
You can do data cleansing at only two stages: when using the Python program that connects Wikipedia to Elasticsearch and when running the Elasticsearch internal indexing system:
■ Python—Here you define what data you’ll allow to be stored by your document store, but you won’t clean the data or transform the data at this stage, because Elasticsearch is better at it for less effort.
■ Elasticsearch—Elasticsearch will handle the data manipulation (creating the index) under the hood. You can still influence this process, and you’ll do so more explicitly later in this chapter.
Data science process
3: Data preparation –
Data cleansing –
Physically impossible values
Errors against codebook Missing values Errors from data entry
Outliers Spaces, typos, …
Data transformation
Combining data –
Extrapolating data Derived measures Aggregating data
Creating dummies
– Set operators
Merging/joining data sets
Creating views
Reducing number of variables 1: Setting the research goal +
2: Retrieving data +
Figure 6.19 Data science process step 3: data preparation
Now that you have an overview of the steps to come, let’s get to work. If you followed the instructions in the appendix, you should now have a local instance of Elastic-search up and running. First comes data retrieval: you need information on the differ-ent diseases. You have several ways to get that kind of data. You could ask companies for their data or get data from Freebase or other open and free data sources. Acquir-ing your data can be a challenge, but for this example you’ll be pullAcquir-ing it from Wiki-pedia. This is a bit ironic because searches on the Wikipedia website itself are handled by Elasticsearch. Wikipedia used to have its own system build on top of Apache Lucene, but it became unmaintainable, and as of January 2014 Wikipedia began using Elasticsearch instead.
Wikipedia has a Lists of diseases page, as shown in figure 6.20. From here you can borrow the data from the alphabetical lists.
You know what data you want; now go grab it. You could download the entire Wikipe-dia data dump. If you want to, you can download it to http://meta.wikimeWikipe-dia.org/
wiki/Data_dump_torrents#enwiki.
Of course, if you were to index the entire Wikipedia, the index would end up requiring about 40 GB of storage. Feel free to use this solution, but for the sake of pre-serving storage and bandwidth, we’ll limit ourselves in this book to pulling only the data we intend to use. Another option is scraping the pages you require. Like Google, you can make a program crawl through the pages and retrieve the entire rendered HTML. This would do the trick, but you’d end up with the actual HTML, so you’d need to clean that up before indexing it. Also, unless you’re Google, websites aren’t too fond of crawlers scraping their web pages. This creates an unnecessarily high amount of traffic, and if enough people send crawlers, it can bring the HTTP server to its
Figure 6.20 Wikipedia’s Lists of diseases page, the starting point for your data retrieval
knees, spoiling the fun for everyone. Sending billions of requests at the same time is also one of the ways denial of service (DoA) attacks are performed. If you do need to scrape a website, script in a time gap between each page request. This way, your scraper more closely mimics the behavior of a regular website visitor and you won’t blow up their servers.
Luckily, the creators of Wikipedia are smart enough to know that this is exactly what would happen with all this information open to everyone. They’ve put an API in place from which you can safely draw your information. You can read more about it at http://www.mediawiki.org/wiki/API:Main_page.
You’ll draw from the API. And Python wouldn’t be Python if it didn’t already have a library to do the job. There are several actually, but the easiest one will suffice for your needs: Wikipedia.
Activate your Python virtual environment and install all the libraries you’ll need for the rest of the book:
pip install wikipedia pip install Elasticsearch
You’ll use Wikipedia to tap into Wikipedia. Elasticsearch is the main Elasticsearch Python library; with it you can communicate with your database.
Open your favorite Python interpreter and import the necessary libraries:
from elasticsearch import Elasticsearch import wikipedia
You’re going to draw data from the Wikipedia API and at the same time index on your local Elasticsearch instance, so first you need to prepare it for data acceptance.
client = Elasticsearch() indexName = "medical"
client.indices.create(index=indexName)
The first thing you need is a client. Elasticsearch() can be initialized with an address but the default is localhost:9200. Elasticsearch() and Elasticsearch ('localhost:9200') are thus the same thing: your client is connected to your local Elasticsearch node. Then you create an index named "medical". If all goes well, you should see an "acknowledged:true" reply, as shown in figure 6.21.
Elasticsearch claims to be schema-less, meaning you can use Elasticsearch without defining a database schema and without telling Elasticsearch what kind of data it
Elasticsearch client used to communicate with database
Index name Create index
needs to expect. Although this is true for simple cases, you can’t avoid having a schema in the long run, so let’s create one, as shown in the following listing.
diseaseMapping = { 'properties': {
'name': {'type': 'string'}, 'title': {'type': 'string'}, 'fulltext': {'type': 'string'}
} }
client.indices.put_mapping(index=indexName, doc_type='diseases',body=diseaseMapping )
This way you tell Elasticsearch that your index will have a document type called
"disease", and you supply it with the field type for each of the fields. You have three fields in a disease document: name, title, and fulltext, all of them of type string. If you hadn’t supplied the mapping, Elasticsearch would have guessed their types by looking at the first entry it received. If it didn’t recognize the field to be boolean, double, float, long, integer, or date, it would set it to string. In this case, you didn’t need to manually specify the mapping.
Now let’s move on to Wikipedia. The first thing you want to do is fetch the List of diseases page, because this is your entry point for further exploration:
dl = wikipedia.page("Lists_of_diseases")
You now have your first page, but you’re more interested in the listing pages because they contain links to the diseases. Check out the links:
dl.links
The List of diseases page comes with more links than you’ll use. Figure 6.22 shows the alphabetical lists starting at the sixteenth link.
dl = wikipedia.page("Lists_of_diseases") dl.links
Listing 6.1 Adding a mapping to the document type Figure 6.21 Creating an Elasticsearch index with Python-Elasticsearch
Defining a mapping and attributing it to the disease doc type.
The “diseases” doc type is updated with a mapping. Now we define the data it should expect.
This page has a considerable array of links, but only the alphabetic lists interest you, so keep only those:
diseaseListArray = []
for link in dl.links[15:42]:
try:
diseaseListArray.append(wikipedia.page(link)) except Exception,e:
print str(e)
You’ve probably noticed that the subset is hardcoded, because you know they’re the 16th to 43rd entries in the array. If Wikipedia were to add even a single link before the ones you’re interested in, it would throw off the results. A better practice would be to use regular expressions for this task. For exploration purposes, hardcoding the entry numbers is fine, but if regular expressions are second nature to you or you intend to turn this code into a batch job, regular expressions are recommended. You can find more information on them at https://docs.python.org/2/howto/regex.html.
One possibility for a regex version would be the following code snippet.
diseaseListArray = []
check = re.compile("List of diseases*") for link in dl.links:
if check.match(link):
try:
diseaseListArray.append(wikipedia.page(link)) except Exception,e:
print str(e)
Figure 6.22 Links on the Wikipedia page Lists of diseases. It has more links than you’ll need.
Figure 6.23 shows the first entries of what you’re after: the diseases themselves.
diseaseListArray[0].links
It’s time to index the diseases. Once they’re indexed, both data entry and data prepa-ration are effectively over, as shown in the following listing.
checkList = [["0","1","2","3","4","5","6","7","8","9"], ["A"],["B"],["C"],["D"],["E"],["F"],["G"],["H"],
["I"],["J"],["K"],["L"],["M"],["N"],["O"],["P"],
["Q"],["R"],["S"],["T"],["U"],["V"],["W"],["X"],["Y"],["Z"]]
docType = 'diseases'
for diseaselistNumber, diseaselist in enumerate(diseaseListArray):
for disease in diseaselist.links:
try:
if disease[0] in checkList[diseaselistNumber]
and disease[0:3] !="List":
currentPage = wikipedia.page(disease) client.index(index=indexName,
doc_type=docType,id = disease, body={"name": disease,
"title":currentPage.title ,
"fulltext":currentPage.content}) except Exception,e:
print str(e)
Because each of the list pages will have links you don’t need, check to see if an entry is a disease. You indicate for each list what character the disease starts with, so you check for this. Additionally you exclude the links starting with “list” because these will pop up once you get to the L list of diseases. The check is rather naïve, but the cost of hav-ing a few unwanted entries is rather low because the search algorithms will exclude irrelevant results once you start querying. For each disease you index the disease name and the full text of the page. The name is also used as its index ID; this is useful
Listing 6.2 Indexing diseases from Wikipedia
Figure 6.23 First Wikipedia disease list,
“list of diseases (0-9)”
The checklist is an array containing an array of allowed first characters. If lists of links for every disease list.
First check if it’s a disease, then index it.
for several advanced Elasticsearch features but also for quick lookup in the browser.
For example, try this URL in your browser: http://localhost:9200/medical/diseases/
11%20beta%20hydroxylase%20deficiency. The title is indexed separately; in most cases the link name and the page title will be identical and sometimes the title will contain an alternative name for the disease.
With at least a few diseases indexed it’s possible to make use of the Elasticsearch URI for simple lookups. Have a look at a full body search for the word headache in
With at least a few diseases indexed it’s possible to make use of the Elasticsearch URI for simple lookups. Have a look at a full body search for the word headache in