# Practical Data Analysis

This practical data are the wind speed data from October 2017 to September 2018, recording the maximum wind speed per hour. We apply two methods to this data to see if we can classify the different maximum wind speed trends in different seasons.

Here we treat one month of data as a curve with a total of 12 curves which are shown as figure 5.1. The first curve is the wind speed data for October 2017, the second curve

is the wind speed data for November 2017, and so on. The last curve is the wind speed data for September 2018. Table 5.1 shows the mean and standard deviation of monthly wind speed data. It shows that from October 2017 to February 2018 have larger wind speed, and from May 2018 to August 2018 have smaller standard deviation.

Table 5.1: The mean and standard deviation of monthly wind speed data

(mm/yy) 10/17 11/17 12/17 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18 9/18 mean 14.95 15.04 16.87 14.64 14.26 9.67 7.42 6.78 8.53 6.49 6.97 9.89

sd 7.27 6.12 6.85 5.20 6.33 5.86 5.80 3.84 4.84 3.48 3.50 7.32

We firstly use the mixture of regression model to group them. According to the BIC, the number of selected clusters is three, and the estimated parameters β of each group are as Table 5.2. Based on the estimated values in Table 5.2, weˆ

draw the regression line in Figure 5.2. It can be seen from this figure that the different properties of the three groups are low wind speed, increasing wind speed and decreasing wind speed. Cluster 1 is the decreasing wind speed that contains the wind speed of 11/2017, 12/2017, and 2/2018. Cluster 2 is the low wind speed that contains the wind speed of March to August in 2018. Cluster 3 is the increasing wind speed that contains the wind speed of 10/2017, 1/2018, and 9/2018. Secondly, we use the hierarchical

Table 5.2: The estimated parameters ˆβ of each group Cluster 1 Cluster 2 Cluster 3 coef.(Intercept) 17.648602 7.173842 8.766866 coef.x -0.008070 0.015417 -0.002302 coef.x2 0.000015 0.000008 -0.000006

Figure 5.2: Data and regression lines by estimated values

clustering with DTW to group them. First we get the dendrogram as shown in Figure 5.3. From this dendrogram, the data is divided into three groups. The group on the left is from December 2017 to February 2018. The middle group is March to August 2018. The group on the right is October and November of 2017 and September of 2018.

Table 5.3: The clustering results obtained by the two methods

(mm/yy) 10/17 11/17 12/17 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18 9/18

MRM 3 1 1 3 1 2 2 2 2 2 2 3

DTW 1 1 2 2 2 3 3 3 3 3 3 1

In the Table 5.3, we compare the clustering results obtained by the two meth-ods. According to Table 5.3, March to August are divided into groups by both two methods, but the result in other months are slightly different. When grouping by mixture of regression models, November 2017, December 2017 and February 2018 are divided into the same group, and January 2018, September 2018 and October 2017 are divided into the same group. However, when grouping by the hierarchical clustering, November 2017 is divided into the same group as October 2017 and September 2018, and January 2018 is divided into the same group as December 2017 and February 2018.

According to this result, we find out that the result of hierarchical clustering with DTW is similar to the average, and it also shows that the maximum wind speed have different behaviors in different seasons.

## Chapter 6 Conclusions

In this thesis we compare two clustering methods for curves data. First, we consider different numbers of model component and use EM-algorithm to estimate pa-rameters. Second, we consider hierarchical grouping where DTW is used to calculate the distance between two curves. We compare these two methods by the cluster cor-recting rate.

By the result of simulation study, we find out that when the variation of curves is small, hierarchical clustering with DTW get a better result. However, when the vari-ation of curves is large, mixture of regression models is a better choice. We also use these two methods to analyze a practical data. According to the results, we find out that the clustering results of hierarchical clustering with DTW are more in line with our expectations. The behavior of maximum wind speed will be different in different seasons.

There are still many problems that we can discuss in the future. First, in

mix-to successfully divide the data inmix-to the number of groups we specify. Second, we choose algorithm to estimate parameter, there are also classification version of the EM-algorithm (CEM) (Celeux & Govaert, 1992) and stochastic version of the EM-EM-algorithm (SEM) (Celeux, 1985) can apply and discuss. We can also consider partitional clus-tering, for example, K mean or fuzzy C mean, with DTW distance. Third, the curves we consider in this thesis is one-dimensional. We can also discuss two-dimensional or above. Fourth, we consider the points in each curve are independent, the dependent situation also could be discussed in the future. Finally, the maximum wind speed in the practical analysis can be considered using the AR(1) model to analyze.

### References

Camargo, S. J., Robertson, A. W., Gaffney, S. J., Smyth, P., & Ghil, M. (2007). Cluster analysis of typhoon tracks. Part I: General properties. Journal of Climate, 20 (14), 3635–3653.

Celeux, G. (1985). The sem algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem. Computational Statistics Quarterly, 2 , 73–82.

Celeux, G., & Govaert, G. (1992). A classification em algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis, 14 (3), 315–332.

Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal , 20 (4), 364–366.

DeSarbo, W. S., & Cron, W. L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5 (2), 249–282.

Draper, N. R., & Smith, H. (1981). Applied regression analysis 2nd ed. New York:

John Wiley & Sons.

Gaffney, S. (2004). Probabilistic curve-aligned clustering and prediction with regression mixture models (Ph.D. dissertation). University of California, Irvine.

Gaffney, S., & Smyth, P. (1999). Trajectory clustering with mixtures of regression mod-els. In Proceedings of the fifth acm sigkdd international conference on knowledge discovery and data mining (pp. 63–72).

using dynamic time warping distance. Engineering Applications of Artificial In-telligence, 39 , 235–244.

Lee, J. G., Han, J., & Whang, K. Y. (2007). Trajectory clustering: a partition-and-group framework. In Proceedings of the 2007 acm sigmod international conference on management of data (pp. 593–604).

Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11 (i08), 1-18.

Leisch, F., & Gruen, B. (2012). Package ‘flexmix’. Information found at https://cran.r-project.org/web/packages/flexmix/flexmix.pdf.

Morris, B., & Trivedi, M. (2009). Learning trajectory patterns by clustering: Experi-mental studies and comparative evaluation. In 2009 ieee conference on computer vision and pattern recognition (pp. 312–319).

Niennattrakul, V., & Ratanamahatana, C. A. (2007). On clustering multimedia time series data using K-means and dynamic time warping. In Proceedings of the 2007 international conference on multimedia and ubiquitous engineering (pp. 733–738).

Sakoe, H., & Chiba, S. (1971). A dynamic programming approach to continuous speech recognition. In Proceedings of the seventh international congress on acoustics (Vol. 3, p. 65-69).

Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26 (1), 43–49.

Sard´a-Espinosa, A. (2017). Comparing time-series clustering algorithms in r using the

dtwclust package. Vienna: R Development Core Team.

Sarda-Espinosa, A. (2019). Package ‘dtwclust’. Information found at https://cran.r-project.org/web/packages/dtwclust/dtwclust.pdf.

Schwarz, G., et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2), 461–464.

Sibson, R. (1973). Slink: an optimally efficient algorithm for the single-link cluster method. The Computer Journal , 16 (1), 30–34.

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Aca-demic Press.

Zheng, Y. (2015). Trajectory data mining: an overview. ACM Transactions on Intel-ligent Systems and Technology (TIST), 6 (3), 1–41.