In order to evaluate the accuracy and efficiency of the proposed methods, we have conducted extensive experiments on real data sets under a number of configurations. In this section, we have done many experiments to prove that our proposed methods are practical and accurate for real data. We also compare our proposed method with one existing method, ECTS, proposed in [23].
We used a real traffic data set in our experiments. The data set is the traffic speed log of a freeway segment in Taiwan. This segment is from Hsinchu to Jubei which has the highest throughput in Taiwan. There is a traffic burst at commuting time on weekdays. Therefore this data set has the patterns we expected. We spent more than three months obtaining the data from government sensors.
However, the training data length is not ”the longer the better”. Too long training data that is too long contains data which is out of date. In our experiments, we used two weeks of traffic status as the training data for the clustering-based methods, while for the regression, we used the same prediction length as the training data length.
In Fig. 5.1 and Fig. 5.2, we compare the prediction methods for short-term and long-term prediction. In these two experiments, the parameters of MCF and MPST are the best parameters in the experiment, as shown in Fig.5.5, 5.6, 5.7 and 5.8. The difference between short-term and long-term is the prediction length, and the prediction is short-term if it is less than 1 hour. This breakpoint is decided from our data set. The sampling rate of our data set
0
Figure 5.1: Prediction results for short-term prediction
0
Figure 5.2: Prediction results for long-term prediction
0
Figure 5.3: Prediction results for 6 hours continuous prediction
0
Average Trace Length for One Value
Predict length(hour)
Comparison of Trace Length between MPST and ECTS MPST ECTS
Figure 5.4: Comparison average trace length of MPST and ECTS
is 1 minute. In the short-term predictions, each is a continuous prediction for the values of 15 minutes after the prediction length. For example, the prediction length of 0.5 hour means to predict the values at tc + 31 tc + 45. As the results show in Fig. 5.1, regression is the best prediction method, as we mentioned, for short-term predictions. For long-term predictions, each experiment is continuous predictions for the values of 1 hour. Although MPST is not always the most accurate prediction method, it is better than regression, MCF and ECTS on most occasions.
Although the accuracy of MPST is close to that of ECTS, our MPST is better than ECTS on the tracing length. The tracing length is how far we should trace back the time series to predict a value. In Fig. 5.4, we compared the average tracing length of MPST and ECTS.
In this experiment, ECTS needs a longer tracing length, and the needed tracing length is not fixed. Therefore our MPST uses less data than ECTS, and the accuracy of MPST is close to that of ECTS.
To summarize the above results, we did an experiment of 6 hours continuous prediction, the results of which are shown in Fig. 5.3. In this experiment, the parameters of the prediction methods are the same as in the above experiments. The hybrid prediction in this experiment is the most accurate prediction method. In this experiment the threshold of hybrid prediction is 1 hour; therefore we use regression when the prediction length is less than 1 hour and MPST when it exceeds 1 hour.
We also carried out the experiment using our clustering-based methods. In Fig. 5.5, 5.6, 5.7 and 5.8, we use different parameters for the clustering-based prediction methods MCF and
0
MSE for different time series period MCF
Figure 5.5: Different periods for MCF
0
MSE for different time series periods MPST
Figure 5.6: Different periods for MPST
0
Figure 5.7: Different tree heights for MPST
0
MSE for different similarity thresholds MPST
Figure 5.8: Different pattern similarity threshold for MPST
MPST to investigate the effect of the parameters. In MCF, there are only two parameters: k and time series period. The parameter k is for clustering algorithm k-means. The value of k affects the clustering result of both MCF and MPST; therefore we will discuss this issue with MPST. The experiment result for the time series period of MCF is shown in Fig. 5.5. As can be seen, the best time series period for MCF when using our data set is 4 hours.
In Fig. 5.6, we used different time series periods for MPST. Because it predicts the future value using several recent patterns, the period is shorter than in MCF. If the period is 30 minutes and the MPST height is 3, MPST will use the recent data of 1.5 hours to predict the next 30 minutes. In this experiment, we found that 10 minutes is the best time series period for MPST when using our data set. The height of MPST is another important parameter.
We mentioned that the prediction result is decided from the recent patterns. The height of MPST affects the maximal patterns we use to make predictions. The effect of MPST heights is shown in Fig. 5.7. The effect is not obvious in this experiment because we used the number of occurrences to make the predictions. The number of occurrences is lower and lower in the lower nodes; therefore the effectiveness is also lower. Thus the MPST height is not so important as period. The last parameter of MPST is the pattern similarity threshold. This is the parameter which decides whether the subsequence is like the pattern in the symbolization process. The value of the pattern similarity threshold means: if the distance between the subsequence and the pattern is lower than the threshold, the subsequence is similar to this pattern. As shown in Fig. 5.8, the effect of the threshold is stepwise. When the threshold is greater than 6, the prediction result is inaccurate. This is because a threshold that is too
0
Average Trace Length for One Value
Predict length(hour)
Comparison of Trace Length between MPST and ECTS Hybrid
Figure 5.9: Different threshold for Hybrid Prediction
0
Clustering Results with Different Values of K k
Figure 5.10: Effect of K in K-means
big causes MPST to use patterns which are not similar to the subsequence to predict future values.
For the hybrid prediction method, we tried different threshold values. The experiment results are shown in Fig. 5.9, The best threshold is 1 hour, therefore we use 1 hour as the breakpoint of the short-term and long-term predictions. In this experiment, a bigger threshold means using much regression; thus accuracy is low with a big threshold.
We mentioned that the value of k in clustering algorithm K-means is not an important parameter. To prove this, we clustered our data set with different values of k and pruned the clusters. This experiment result is shown in Fig. 5.10. When k is big enough, the number of clusters will no longer grow.
In these experiments, we verified the performance of the proposed methods. We also prove
that the regression-based method fits for short-term prediction, clustering-based prediction fits for long-term prediction.