• 沒有找到結果。

Our second method explicitly assumes at least two anchor points (for example, a home and a workplace) and treats each as the centroid of its own

local cluster of crimes. This method requires determining an appropriate number of clusters, which we derive from the locations of the previous crimes.

Algorithm

The basis of this algorithm is a hierarchical clustering scheme [Jain, Murty, and Flynn 1999]. Once clusters are found, the previous algorithm is applied at each cluster centroid.

Finding Clusters in Crime Sequences

We force a minimum of 2 clusters and a maximum of 4. The clustering algorithm is accomplished in a 3-step process.

1. Compute the distances between all crime locations, using the Euclidean distance.

2. Organize the distances into a hierarchical cluster tree, represented by a dendrogram. The cluster tree of data pointsP1, . . . , PN is built up by first assuming that each data point is its own cluster. The dendrogram for Offender B is shown in Figure 7.

Figure 7. Dendrogram for Offender B.

3. Merge the two clusters that are the closest (in distance between their cen-troids), and continue such merging until the desired number of clusters is reached. These cluster merges are plotted as the horizontal lines in the dendrogram, and their height is based on the distance between merged clusters at the time of merging.

To determine the optimal number of clusters, we use the notion of sil-houettes [Rousseeuw 1987]. We denote bya(Pi)the average distance from Pito all other points in its cluster and byb(Pi, k)the average distance from Pi to points in a different clusterCk. Then the silhouette ofPiis

s(Pi) = betterPifits into its current cluster; and the closers(Pi)is to1, the worse it fits within its current cluster.

To optimize the number of clusters, we compute the clusterings for 2, 3, and 4 clusters. Then for each number of clusters, we compute the average silhouette value across every point that is not the only point in a cluster.

(We ignore silhouette values at single-point clusters because otherwise such clusters influence the average in an undesirable way.) We then find the maximum of the three average silhouette values. For Offender B, we found average silhouette values of 0.52, 0.50, and 0.69 for 2, 3, and 4 clusters. So in this case, we go for four clusters. The cluster groupings computed by the algorithm are shown in Figure 8. Because the average silhouette value tends to increase with the number of clusters, we cap the possible number of clusters at 4.

Cluster Loop Algorithm

We compute the likelihood surface for the centroid of each cluster.

If a cluster contains a single point, we do not assume that this cluster represents an anchor point; instead, we treat this point as an outlier. We use a Gaussian distribution centered at the point as the likelihood surface, with mean the expected value of the gamma distribution placed over every anchor point of a cluster that has more than one point.

Combining Cluster Predictions: Temporal and Size Weighting

Using the separate likelihood surfaces computed for each cluster, we create our final surface as a normalized linear combination of the individual surfaces, using weights for the number of points in the cluster (to weight

Figure 8. Offender B crime points sorted into 4 clusters; the clusters are colored differently and separated by virtual vertical lines into 3, 6, 3, and 3 locations.

more-common locations) and for the average temporal index of the events in the cluster (to weight more-recent clusters).

Results and Analysis

The three test datasets conveniently display the cluster method’s supe-rior adaptability.

Offender C: The highly-localized nature of the data points in Figure 9 means there is little difference between the centroid-method results and the cluster-method results. The only difference is that the cluster method identifies the point directly below the centroid as a cluster of a single point (an outlier) and therefore excludes it from the computation of the larger cluster’s centroid (a slight Gaussian contribution from this point’s

“own” cluster can be seen in the surface plot). This has the effect of slightly reducing the variance and therefore narrowing the fit function;

consequently, the standard effectiveness multiplier rises slightly, from about 12 to almost 16.

Offender B: By contrast, Figure 10 shows the cluster method operat-ing at the other edge of its range, as the silhouette-optimization routine produces four clusters. It might appear that the centroid method out-performs the cluster algorithm for this dataset; after all, the actual crime point no longer appears in the band of maximum likelihood. This is true and intentional, since the model weights the largest cluster most strongly

a. Heat map. b. Surface plot.

Figure 9. Offender C predictions of location of final crime, from cluster model of previous crimes.

a. Heat map. b. Surface plot.

Figure 10. Offender B predictions of location of final crime, from cluster model of previous crimes.

a. Heat map. b. Surface plot.

Figure 11. Offender A predictions of location of final crime, from cluster model of previous crimes.

and the “freshest” cluster next-most. Nevertheless, while not accurately predicting that the offender returns to an earlier activity zones, the clus-ter method still outperforms the centroid method with a κs ≈ 23. This is because the craters generated by the cluster method are sharper and taller for this dataset, so fewer resources are “wasted” at high-likelihood areas where no crime is committed.

Offender A: Unsurprisingly, the cluster model fares no better than the centroid method (Figure 11). Since the outlier points are excluded from the centroid calculation for the larger cluster, the model bets even more aggressively on this cluster, with a resultingκs ≈ 0.

相關文件