• 沒有找到結果。

Cracking the Nut

在文檔中 Beautiful Data (頁 187-192)

Oakland CrimeWatch is an application that serves crime report information in on-demand images with relatively primitive cartography and cartoon-like icons. CrimeWatch is opti-mized for data display, and follows from a development approach that focuses on predict-ing user needs rather than makpredict-ing raw predict-ingredients available. The user experience of the application is informed by “wizards,” user interfaces where the user is presented with a sequence of dialog boxes that lead through a series of steps, performing tasks in a specific sequence. The steps required by CrimeWatch are:

1. What: select the type or types of incidents.

2. Where: search near an address, within an administrative boundary, or near a feature, such as a school or park.

3. When: how far into the past to search.

CrimeWatch responds with a static image showing iconic representations of individual reports. These can be clicked for more information.

170 C H A P T E R E L E V E N

My interest in CrimeWatch was first piqued when I began to think about a way to reverse the server-side merging process, to start with a static image and extract crime report infor-mation with explicit location inforinfor-mation attached: latitude and longitude values compati-ble with those used by other geographic software systems, commonly called geolocation. This kind of simple recognition problem is fairly well understood, and there are well-established techniques for visual feature extraction.

First, we need to get an image to work with. This is actually more complicated than it seems, and we must jump through a series of hoops to convince the server to generate a crime report map. CrimeWatch stores session state on the server, so it’s necessary to simu-late a complete set of wizard interactions by a fake user: accept terms and conditions with a form, proceed through the multiple steps of the interactive wizard while storing HTTP cookies and tokens along the way, and respond correctly to a series of nonstandard HTTP redirects. The process of reconstructing the steps necessary to arrive at a useful crime report image was the first serious hurdle for the project. The client-side HTTP proxy Charles (http://charlesproxy.com/) and the Mozilla plug-in LiveHTTPHeaders (http://

livehttpheaders.mozdev.org/) made this process less painful than it needed to be. Interpreting the intermediate HTML pages themselves is greatly simplified by the use of a page-scraping library like Leonard Richardson’s BeautifulSoup (http://www.crummy.com/software/

BeautifulSoup/). BeautifulSoup is designed to make sense of the HTML “tag soup” fre-quently found online, correcting for such common problems as improperly nested tags or partial markup, and it allows us to read the HTML forms and JavaScript commands that establish a complete client/server session.

It’s possible to mock up a first draft of the scraping process using simple Unix command-line tools such as shell scripts and cURL (http://curl.netmirror.org/). The key is carefully examining HTTP connections between the browser and server, looking for telltale bits of information to help you reconstruct the interaction: CGI variables in URLs and POST request bodies are the first step, showing exactly where the initial session is established upon acceptance of terms of use. Session-based applications such as CrimeWatch make heavy use of client-side state stored in cookies, so use of a cookie jar by your HTTP library is a must. CrimeWatch also relies heavily on client-side JavaScript smarts beyond simple form submissions, including the use of additional state variables, so intermediate response pages must be parsed with a tolerant HTML parser and regular expressions to search for details buried deep within page scripts. Finally, since many such older-generation web applications were built and released before cross-browser dynamic HTML became a com-mon practice acom-mong developers, it’s often necessary to spoof the User-Agent header and pretend to be either Internet Explorer or Mozilla Firefox; other browsers are turned away with compatibility warnings and no data.

At the end of this process, you are left with a medium-sized image bitmap, hopefully con-taining recognizable crime report icons. The first pass at extracting the pixel locations of each icon was simple, but slow: for every possible location in the image, compare its pixel colors to a known icon, and report positive matches wherever the amount of difference was below a certain threshold. Since we’re dealing with predefined icons on a background

V I S U A L I Z I N G U R B A N D A T A 171 relatively free of conflicting noise, this is actually a completely bulletproof method. The

crime icons used in CrimeWatch are unique, and easily identifiable even when partially occluded by other icons or map features. The tool I use to perform these image checks is NumPy (http://numpy.scipy.org/), the venerable and powerful Python array-manipulation library. Figures 11-2 and 11-3 show a portion of a sample image from CrimeWatch, with programmatically recognized icons outlined.

F I G U R E 1 1 - 2.A sample image from CrimeWatch shows areas of theft, narcotics, robbery, vehicle theft, and other crimes. (See Color Plate 29.)

F I G U R E 1 1 - 3.The same sample image from CrimeWatch with programmatically recognized icons outlined. (See Color Plate 30.)

172 C H A P T E R E L E V E N

The brute-force method is unfortunately quite slow on a typical CPU, but it’s possible to speed it up with some knowledge of the kinds of maps you’re likely to encounter. For example, many of the crime report icons have a significant characteristic color: theft is represented by a green bag of money, simple assault by a blue boxing glove, and prostitu-tion by a pink letter “P”. A simple preprocessing step is a cheap scan of the image to find pixel locations near one of these desired colors, which drastically cuts down on the num-ber of locations to expensively check for a full-icon match. Figure 11-4 shows just the red-dish parts of Figure 11-2 in white, an indication of likely places where the aggravated assault marker, a red boxing glove, can be found.

A slightly more complicated prepreprocessing step is a series of scans to search for proxim-ity of characteristic icon colors. For example, burglary is represented by a small icon of a broken window rendered in black and white pixels. There is a lot of black on a typical map, and a lot of white, but only areas with icons and bits of text contain both black and white next to each other. We can find all pixels in close proximity to these two colors, and cut down the expensive search area to a limited number of candidate pixels.

The geolocation step requires determining a location for each crime report based on its detected pixel position in the rendered map. For this to be possible, it was helpful that CrimeWatch always returns a predictably sized and positioned map for a given set of inputs. For example, a map of Police Service Area #3 needs to always cover an identical area, regardless of whether the actual crime reports present at the time were concentrated in one corner of the area or spread out all over. CrimeWatch serves up maps with under-lying geographical features such as streets or coastlines always in the same place. For each possible geographic layout, it’s necessary to manually locate three widely spaced known reference points. Street intersections are great for this, as they can be easily picked out on F I G U R E 1 1 - 4.Again, the same sample image from CrimeWatch, this time with the reddish parts made white to show the red boxing glove icon (for aggravated assault) more clearly.

V I S U A L I Z I N G U R B A N D A T A 173 CrimeWatch for their (x, y) pixel locations and compared to a simple service such as

Simon Willison’s GetLatLon (http://getlatlon.com) for their (latitude, longitude) geographic locations. Six police service areas with three reference points apiece meant manually geolocating 18 known locations around Oakland. This needed to be done exactly once: all future icons found in each given Service Area could be compared to the known reference points using simple linear algebraic transforms to work out their geographic locations.

Figure 11-5 shows a map for a downtown zip code, with three geographic reference points selected. Knowing these three points, it’s possible to triangulate the location of any other point in the map.

The only thing left to do was to simulate a user click on each crime icon to collect further details on the crime reports, such as its case number, date and time of day, and a simple textual description. The end product is a database containing 100 or so reports per day.

One challenge to be found at this step is to decide what constitutes a unique report. I was collecting reports from a moving window, which meant that each individual report would be collected more than once, while multiple separate reports could be covered by a single case number provided by the police department. We ended up using a tuple of case number and text description, which was enough to cover most inconsistencies in data collection.

The code implementing this approach was executed in Twisted Python (http://www.

twistedmatrix.com/), an event-driven networking engine that made it possible to open and maintain long-running simulated browser sessions with the CrimeWatch service. With this code library in hand, it was possible to transform a brittle process into an ongoing nightly collection run, and to eventually make the resulting data public in a form we believed more useful to Oakland residents than CrimeWatch.

Nightly collections of this data formed the basis of an initial eight months of collection and experimentation. Each evening, we’d run a web page scraper on a full combination of 13 types of crime and six Police Service Areas. Due to the one-at-a-time design of Crime-Watch, each individual report would require its own request/response loop with the server. We also added in considerable delays to each step—up to a minute or more between every individual step in the process—so as to not overload the CrimeWatch server with excessive requests. A single run would begin after midnight, and often last for six or more hours.

F I G U R E 1 1 - 5.A map of downtown Oakland showing three reference points for triangulation purposes. (See Color Plate 31.)

174 C H A P T E R E L E V E N

Frequently, there were errors. CrimeWatch often would lose its head completely, and cough up a map with no space, time, or type restrictions: all of Oakland, all crimes, for the past three months. We had no reliable way of detecting this case, and on frequent occa-sions reports in our database were geographically misplaced.

In this case, we felt that the occasional bad report was a small price to pay for an improved database browsing tool, and we continued to accumulate data over the first half of 2007, periodically releasing small experiments in visual presentation or publishing technique.

在文檔中 Beautiful Data (頁 187-192)