Clustering By Time Series Data Similarities

In this work, our main goal is to predict the one-day speed pattern given a query date and the target road segment. With this pattern, the prediction model can report an estimated

driving speed given any queried time point during the day.

By observing the historical speed-time records, it can be discovered that the traffic behavior of one road reflects similar patterns at the same time on different days. We further find that there exist a few kinds of one-day patterns on different day types. For example, the peak and off-peak time intervals are located very closely on some of the weekdays. However, when it comes to the weekends, the traffic behavior is very likely to exhibit other patterns. In fact, weekdays do not always exhibit only as one pattern, and moreover, Saturdays and Sundays usually exhibit different patterns for some of the roads. It is obvious that categorizing traffic patterns according to weekday and weekend is not specific enough. Due to this observation, we believe that some basic day-types should have their own traffic behavior patterns. For example with Figure 5.1, in our observing experiment of a chosen highway segment in Taiwan (Chubei-to-Hsinchu), the Sunday time-speed records in May 2011 are very similar, appearing to be peaceful and close to the speed limit for the whole 24 hours. Hence, if the path searching system requires referencing the traffic condition of the Chubei-Hsinchu segment on a day that is a Sunday, the prediction model can report that the speed value should be close to the speed limit at all the query times throughout the whole day.

With the above observations, we have come up with the concept of finding the day-types with their own speed patterns of a target road segment. For example, we know that it is not specific enough to only define two day-types as weekdays and weekends, and the fact is that if we gather all the Monday records of one road, they are not all the same kind of pattern; hence we need to do clustering on the daily speed-time series data, and identify rules to conduct different day-types to different clusters. However, the detailed shape of the peak-time pattern may further turn into two more variations. Take the Saturday time-speed records of the Chubei-to-Hsinchu highway segment in May for example(Figure 5.1). Most of the Saturday traffic conditions tend to be congested (driving at a low speed) around noon; however, the length of the peak time and the amplitude of the speed drop varies. This is our motivation for proposing the second phase clustering, which aims to find more detailed types of peak-time traffic patterns.

Since we decided to take clustering as our mining method, it is an issue to choose a clustering algorithm that can lead our model to the best performance and accuracy. In this work, we implemented K-means, hierarchical clustering and DBSCAN, which are three typical

0 20

Figure 5.1: Real data example on the Chubei-to-Hsinchu highway segment recorded in May,2011

kinds of clustering algorithms. K-means is the most basic Partitional clustering approach. At the initial step of K-means, the number of cluster centroids should be specified, that is, the value of K of K-means should be assigned. For the initial centroids of the clusters, we choose to randomly select them out. Afterwards, the K-means algorithm will recursively assign each point to the cluster with the closest centroid and recompute the centroid of each cluster.

K-means terminates when the centroid of each cluster remains same as the one in previous iteration. Since we take every sequence of time series data as one point in our clustering process, the centroid of each cluster is the data point with the minimal aggregated distance to all the other points in the same cluster. For the distance measuring between each point to another, we have mentioned in the last subsection that we utilize the time series distance functions: DTW, ERP, LCSS, and EDR. The hierarchical clustering algorithm we choose to implement is the agglomerative clustering method. The basic agglomerative hierarchical clustering method starts with the points as individual clusters, and merges the closest pair of

clusters at each step until only one cluster is left. There are two parameters that we need to determine when applying the agglomerative clustering method to our model.

• plev: The partition level of the agglomerative hierarchical tree structure that partitions the tree structure into its sub-trees at the cutting level plev as output clusters. The plev of the root in the tree is set as 0.

• linkT ype: The definition of proximity between clusters, i.e., how to measure the distance between clusters. Here we implement three basic proximity types for our agglomerative hierarchical clustering: MIN, MAX and Group average.

The above mentioned three types of proximity definitions are the simplest ones among existing methods. MIN defines cluster proximity as the proximity between the closest two points in different clusters. On the contrary, MAX defines that with the farthest. Group average defines proximity as the average pairwise proximities of all pairs of points from different clusters.

Still another clustering method for our model is DBSCAN, the most common density-based clustering algorithm. For DBSCAN, two parameters should be given:

• Eps: The maximum radius of the neighborhood

• MinP ts: The minimum number of points in an Eps-neighborhood of that point In our experiments, the MinP ts is initialized to be 2. After the distances between all the input elements, that is, all the sequences of daily speed-time series, are measured, we then set Eps as the value between minimum dist(R, S) returned value and the maximum dist(R, S) returned value. Again, dist is one of the four time series distance functions: DTW, ERP, LCSS, and EDR, used to measure the distance between different sequences of time series data.

在文檔中用於交通預測之二層資料分群法 (頁 27-30)