Datasets. We evaluate the proposed methods on both real and synthetic trajectory datasets. The taxi dataset (D1) is retrieved from the Microsoft GeoLife and T-Drive projects [Yuan et al. 2010; Zheng et al. 2010] with the road network of Beijing. The trajectories are generated from GPS devices installed on 500 taxis in the city of Beijing.
The dataset is available to the public5. The military trajectory dataset (D2) is retrieved from the CBMANET project [Krout 2007], in which an infantry battalion of 780 units, divided as 30 teams, moves from Fort Dix to Lakehurst for a mission on two routes in 3 hours. Meanwhile, to test the algorithm’s performance in large datasets, we also generate two synthetic datasets (D3and D4), being comprised of 1,000 to 10,000 objects, with more than 10 million data records.
Baselines. The proposed Smart-and-Closed algorithm (SC) and Buddy-based discov-ery algorithm (BU) are compared with Clustering-and-Intersection method (CI), which is used as the framework to find convoy patterns [Jeung et al. 2008]; and two state-of-the-art algorithms: (1) The Swarm pattern (SW) [Li et al. 2010] that captures the objects moving within arbitrary shape of clusters for certain snapshots that are possi-bly nonconsecutive; (2) the TraClu algorithm (TC) [Lee et al. 2007] that discovers the common subtrajectories with a density-based line-segment clustering algorithm.
Environments. The experiments are conducted on a PC with Intel 6400 Dual CPU 2.13 GHz and 2.00GB RAM. The operating system is Windows 7 Enterprise. All the algorithms are implemented in Java on the Eclipse 3.3.1 platform with JDK 1.6.0. The parameter settings are listed in Figure 20.
7.2. Comparisons in Discovery Efficiency
In this section we conduct experiments to evaluate the efficiency of companion discov-ery algorithms in Euclidean space. Since both SW and TC cannot output the results incrementally, we take the running time of the entire dataset as the measure for time
5GeoLife GPS Trajectories Datasets. Released at: http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/default.aspx.
Fig. 21. Efficiency: (a) time, (b) space on different datasets.
Fig. 22. Efficiency: (a) time, (b) space versusδs.
cost. The size of candidate set (number of objects) is used to measure the space cost of companion computation. The only exception is TC, where since the algorithm only car-ries out the subtrajectory clustering task and does not store any companion candidates, TC’s space cost is not included in the experiment.
We first evaluate the algorithm’s time and space costs on different datasets with default settings. Figure 21 shows the experiment results. Note that the y-axes are in logarithmic scale. BU achieves the best performances on all the datasets. In the largest dataset D4, BU is an order of magnitude faster than CI and SW. BU’s space cost is only 20% of SW and less than 5% of CI.
Figure 22 illustrates the influences of companion size threshold δs in the experi-ments. The experiment is carried on dataset D3. Based on default settings, we evaluate the algorithms with different values ofδs. Generally speaking, when the size threshold grows larger, the filtering mechanism is more effective to prune more companion can-didates in each snapshot. The space costs reduce significantly, and the running times also decrease for fewer intersections.
We also study the influence of duration thresholdδt. Based on default settings, the experiments are conducted on dataset D3. The value ofδtis changed from 3 to 15, and the algorithm’s performances are shown in Figure 23. BU, SC, and CI are all faster whenδtgrows larger, because many companion candidates are not consistent enough to last for a long time. When settingδt as 15 snapshots, BU can process the dataset in less than 20 seconds (Figure 23(a)). It is almost an order of magnitude faster than SC and CI. TC is not influenced byδsandδt, since it is only a clustering algorithm and does not generate any companion candidates. Beside TC, SW also could not improve the performance whenδtincreases. The reason is SW utilizes the object growth strategy to prune candidates. Such heuristics could only work with the size thresholdδs, but cannot benefit from largerδt.
In summary, δs and δt are two important factors that influence the efficiency of companion discovery algorithms. When increasing the threshold, more companion can-didates are pruned and the time and space costs are reduced. BU outperforms other methods in the efficiency evaluations, especially in the scenarios of long-lasting streams with a large number of objects.
7.3. Efficiency Analysis for Buddy-Based Discovery
Why is the buddy-based discovery algorithm more efficient? In this section we carry out the experimental analysis to reveal the advantages of the buddy-based discovery method.
In the beginning, we tune the parameters of BU to study the factors that influence its efficiency. Withδsandδtset as default values, we test BU with different buddy radius thresholdδγ from ε/10 to ε/2, and record the average buddy size |b|, buddy number, and algorithm’s running time. Their relationships are demonstrated in Figure 24. One can clearly learn from Figure 24(a) the total buddy number is inversely proportional to the average buddy size|b|. In addition, the number of unchanged buddies decreases rapidly as|b| grows larger. However, as shown in Figure 24(b), the running time of both buddy-based clustering (B-Cluster) and BU decreases with larger|b|. This phenomenon can be explained by Proposition 2, where the cost of buddy’s maintenance algorithm is O(n+m2), where n is the number of objects and m is the number of buddies. If n is fixed, then m is inversely proportional to|b|. Hence BU costs less time if |b| is larger. Based on the efficiency analysis, we recommend setting the buddy radius as a relatively large value (such asε/2). Figure 24(b) also records the time cost of the DBSCAN clustering algorithm as a reference. Even if less than 20% buddies stay unchanged (which is rare for real-world objects), as long as the average size of the buddies is larger than 3, the buddy-based clustering algorithm can still outperform DBSCAN. The experiment results show that BU is especially feasible for processing a trajectory stream with dense object clusters.
BU has three steps, namely the maintenance step (M-step, Algorithm 3), clustering step (C-step, Algorithm 4), and intersection step (I-step, Algorithm 5). To study the time cost of each step, the system carries out BU on the four datasets and records the time costs of each step, as well as their proportions in the total running time, as shown in Figure 25. The results denote that the clustering step is actually the most efficient in the three, costing less than 5% of the total running time, compared to the DBSCAN clustering which usually takes 40–50% of the total running time of SC. BU spends an extra 10%–15% time in maintaining the buddies to save more time from the clustering task.
From the aforesaid experiments, one can clearly see the two key advantages of BU:
(1) utilizing the buddy information to filter out most objects without accessing their
Fig. 24. Efficiency analysis: (a) buddy number; (b) time versus buddy size.
Fig. 25. Efficiency analysis: (a) running time; (b) percentage of BU steps on diffeferent datasets.
details; (2) employing the buddy index to reduce the size of the candidate set, and so decrease the intersection times of companion discovery.
7.4. Evaluations on Algorithm’s Effectiveness
The third part of the experiment is to evaluate the quality of the retrieved companions.
In dataset D2, an infantry battalion of 780 units moves from Fort Dix to Lakehurst for a mission on two routes in 3 hours. The objects are organized in 30 teams, with each team having 25 to 30 units. The information of team partitioning is retrieved as the ground truth. The algorithm’s outputs are matched to the ground truth and the measures of precision and recall are calculated as follows.
Precision. The proportion of true companions over all the retrieved results of the algo-rithm is the precision. It represents the algoalgo-rithm’s selectivity in finding out meaningful companions.
Recall. The proportion of detected true companions over the ground truth is the precision. This criteria shows the algorithm’s sensitivity for detecting traveling com-panions.
We conduct experiments with different values of the size thresholdδs. The results of effectiveness evaluation are shown in Figure 26. BU and SC have the same precision and recall since they output identical companions. They have about 20% precision improvement over SW, and near 40% precision improvement over CI. SW generates the swarm patterns of frequently meeting objects, which is actually a superset of the companions. The swarm pattern is highly sensitive to helping find out all the companions (i.e., 100% recall), but SW also generates more false positives that bring down the algorithm’s selectivity. CI has the same problem with even lower precision.
Since there are many redundant and nonclosed companions in the results, more than half of CI’s results are not useful.
Fig. 27. Effectiveness: (a) precision, (b) recall versusδt.
Again, TC is not affected by the parameters of δs and δt. TC takes the movement direction as an important measure to compute subtrajectory clusters; its results reflect the major directions of the object movements. However, such clusters may not capture the information of companions, because the companion member’s moving direction might be different. As an illustration, please go back to Figure 1. From snapshot s2to s3, the moving directions of o8and o9are different, hence they may be put in different subtrajectory clusters.
Another interesting observation is that, in Figure 26, BU, SC, CI, and SW’s precisions all increase when δs becomes larger, since fewer companions can pass a higher size threshold. However, ifδs is set too high (more than 25), several true companions will also be filtered out and the algorithm cannot achieve 100% recall.
In the next experiment, we study the influence of time thresholdδt. Figure 27 shows the precision and recall of the five algorithms with differentδton D2. BU and SC achieve better performance than SW and CI. When increasing δt, the algorithm’s precision increases, but they can still keep a high recall. Since all the true companions last for a long period in D2, if we set δt greater than 11, both BU and SC can achieve 100%
precision and recall. However, ifδt is set too high, for example, 15, no companion can be discovered since there exist no object groups moving together for such a long time.
In general, BU and SC can guarantee 100% recall (i.e., not missing any real com-panion), we suggest that in real applications, the user should set a relatively high time threshold to filter out false positives, but a moderate size threshold to guarantee the algorithm’s sensitivity.
7.5. Experiments on Road Companion Discovery
To test the efficiency of road companion discovery, we perform the evaluation on dataset D1 with the road network of Beijing, which has 106,579 road nodes and 141,380
Fig. 28. Efficiency: (a) time; (b) I/O of road companion discovery versusδs.
Fig. 29. Efficiency: (a) time; (b) I/O of road companion discovery versusδt.
road segments. The default size threshold δs is set as 8 and the time threshold δt
is set as 11. In this experiment, we compare the performance of four methods: (1) the Clustering-and-Intersection framework with road network distance computation (CI);
(2) the Smart-and-Closed algorithm with road network distance computation (SC);
(3) the smart-and-closed algorithm with Filtering-and-Refinement strategy (FR); and (4) The Road-Buddy-based method (RB).
We first evaluate the time and space costs of road companion discovery. The number of accessed road nodes is used as the measure for I/O cost. Based on default settings, we evaluate the algorithms with different values ofδs. Figure 28 shows the running time and accessed node number. Generally speaking, when the size threshold grows larger, both running time and I/O costs decrease. The computation cost of road companion discovery is much larger than the traveling companion discovery on Euclidean space.
This is mainly caused by the high I/O overhead in road network distance computation.
Since the road network distance computation becomes the major cost, SC cannot save much time comparing to CI. However, FR and RB are an order of magnitude faster than SC and CI, because they utilize the filtering-and-refinement strategy to avoid most unnecessary road network distance computations. The effects of RB are better, since RB groups the objects in small buddies and limits the distance computation in a small region with lower I/O overhead.
The influence of duration threshold δt is also studied in our experiment. Based on default settings, the value ofδt is changed from 3 to 15; the algorithms’ performances are shown in Figure 29. All the algorithms run faster when δt grows larger, because fewer road companion candidates can last for a long time. Again, RB and FR only cost 20%–50% time as CI and SC.
The experiment results show that the main bottleneck of road companion discovery is at the distance computation stage. The traditional companion discovery methods,
Fig. 31. Effectiveness: (a) precision; (b) recall versusδl.
BU and SC, do not work well on the road networks. The new frameworks of RB and FR reduce the time cost on unnecessary shortest path computation, therefore they can achieve higher efficiency peformances.
7.6. Evaluations on Loose Companion Discovery
In the previous experiments, we set the leaving thresholdδl as 0. In this section, we conduct experiments on loose companion discovery. We run the algorithms of BU, SC, and CI on dataset D3by tuningδl from 0–6 snapshots.
Figure 30 shows the algorithms’ time and space costs. With larger δl, all the algo-rithms’ space costs increase rapidly since they cannot prune the candidates if several objects temporarily leave the companion, hence the system has to spend more time in making intersections with a larger candidate set. However, even with largeδl, BU still can discover the loose companions in about 20 seconds.
Finally we carry out the effectiveness experiment on the military dataset D2. δl is changed from 0 to 6 snapshots, and other parameters are set as the default values.
As shown in Figure 31(a), the precision of companion discovery decreases with larger δl, since more companions are generated and inevitably the number of false positives increases. However, the good news is that the recall increases asδlgrows (Figure 31(b)).
The experiment results show the necessity of loose companion discovery. With a released time constraint, BU and SC can discover more meaningful companions and achieve a higher recall. The system’s feasibility is increased in real applications.