Clustering analysis of trajectory data: Comparison of mixture of regression models and hierarchical clustering with dynamic time warping

全文

(1)國立臺灣師範大學數學系碩士班碩士論文指導教授：蔡碧紋博士. Clustering analysis of trajectory data: Comparison of mixture of regression models and hierarchical clustering with dynamic time warping. 研究生 : 蕭詠文. 中華民國 108 年 8 月.

(2) 致謝蕭詠文 2019.07.30. 首先要感謝我的指導教授蔡碧紋老指導我完成碩士論文研究，從一開始閱讀文獻與嘗試編寫程式，到確定方向著手研究，到最後完成整篇論文及修改。在這漫長的過程中，面對了許多的挫折跟困難，謝謝有老師耐心的指導，以及給予我許多的鼓勵與建議，讓我能夠跨越種種難關完成此篇論文。再來要感謝呂翠珊老師與丘政民老師擔任我的口試委員，在口試時不只給予了我許多的幫助與寶貴的建議，讓我的論文能夠更臻完善，還提供了新的想法讓我看見未來更多可能的研究方向。還要感謝的是過去所有教導過我的老師，謝謝老師們的尊尊教誨，陪伴我成長，引導我在迷途中找到自己的方向，以及許多許多的建議與鼓勵，謝謝您們。也要謝謝身邊的同學們，在一起學習互相討論的過程中，從你們身上學習到很多事情，也建立了可貴的友情。最後感謝我的家人與朋友，陪伴並支持我走到現在，謝謝你們總是願意傾聽我的煩惱並適時給予建議，你們是我前進的動力。. ii.

(3) 中文摘要路徑資料為對應著時間的曲線資料，常見於許多領域如氣候、時間序列等。而路徑資料的分群為統計分析中重要的一環，透過分群我們將相似的資料分為一群，藉此我們可以分析各群的性質甚至預測下一個資料屬於的集群。這篇論文中我們使用了兩種分群方法，混合回歸模型（mixture of regression models）和應用動態時間扭曲法的階層分群法（hierarchical clustering with dynamic time warping），透過模擬以及實際資料的分析將之做比較。在模擬中我們以分群的正確率來比較兩個方法在不同情況下的表現，以及討論了混合回歸模型在不同情況下參數估計的結果。根據模擬結果，兩個方法並沒有絕對的優劣，而是在不同情況下擁有各自的優勢。最後則是將這兩個方法分別應用在實際資料的分析上。. 關鍵字：混合回歸模型、階層式分群法、動態時間扭曲法. iii.

(4) Abstract The clustering of trajectory data is an important part of statistical analysis. Trajectory data is curve data corresponding to time. Through clustering, we divide similar curves into groups, so that we can analyze the properties of each group. Two methods are studied: one is model-based clustering, mixture of regression models, and the other is hierarchical clustering with dynamic time warping. These two methods are compared by simulation study. In the simulation, we discuss the results of the parameter estimation of the mixture of regression models, and compare the performance of the two methods in different situations by the correct clustering rate. According to the simulation results, the two methods have their own advantages in different situations. Additionally, the two clustering methods are applied to a practical data.. Keywords: Mixture of regression models, Hierarchical clustering, Dynamic time warping.. iv.

(5) Contents. 1 Introduction. 1. 2 Mixture of Regression Models. 4. 2.1. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2.2. EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 3 Hierarchical Clustering with Dynamic Time Warping. 12. 3.1. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 3.2. Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 3.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 4 Simulation Study. 21. 4.1. Clustering Results of mixture of regression models . . . . . . . . . . . .. 23. 4.2. Comparing clustering results of two methods . . . . . . . . . . . . . . .. 26. 5 Practical Data Analysis. 28. 6 Conclusions. 32. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v. 34.

(6) List of Tables 2.1. The data Y and X . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 2.2. The true value and estimated value of parameters . . . . . . . . . . . .. 11. 3.1. Value of d(y1 (u), y2 (v)) for each u, v . . . . . . . . . . . . . . . . . . . .. 17. 3.2. Result of γ(y1 (u), y2 (v)) for each u, v . . . . . . . . . . . . . . . . . . .. 18. 4.1. The mean and RMSE of estimates π based on 500 replications . . . . .. 23. 4.2. The bias and RMSE of estimates based on 500 replications (The same mixing weight) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.3. 24. The bias and RMSE of estimates based on 500 replications (Different mixing weight) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 4.4. The mean of correct clustering rate of two methods . . . . . . . . . . .. 26. 5.1. The mean and standard deviation of monthly wind speed data . . . . .. 29. 5.2. The estimated parameters βˆ of each group . . . . . . . . . . . . . . . .. 29. 5.3. The clustering results obtained by the two methods . . . . . . . . . . .. 31. vi.

(7) Chapter 1 Introduction Trajectory data record the location of a object over time. The object could be people, animals, typhoons, and so on. For example, travel route, movements of animals, and typhoon tracks. By analyzing these trajectories, a lot of information can be obtained. So the clustering of trajectory data is an important research theme in many areas, such as science, sociology, and climate (Lee, Han, & Whang, 2007; Zheng, 2015). The trajectory data we consider are one-dimensional curves. Each curve is. Figure 1.1: An example of trajectory data. 1.

(8) measured as a function of an independent variable (e.g. time) as shown in Figure 1.1. In the figure we can find out that these curves come from two populations. Our goal is to divide similar data into the same group. In this thesis, we use two clustering methods. The first method that we consider is mixture of regression models. It is model-based clustering which formulates the probabilistic models to fit the data. Other common clustering approaches on multidimensional data are hierarchical clustering and partitional clustering which are not based on probability model. The hierarchical clustering is the second method we consider. Mixture of regression models supposes that the data arise from a mixture of two or more regression models. In general, each data point is decided which regression model it belongs to (DeSarbo & Cron, 1988). However, the data type we focused on is curve case that each curve contains several points. Gaffney and Smyth (1999) applied this method to the trajectory data, and it has been applied on Typhoon tracks to cluster and analyze the properties of each group (Camargo, Robertson, Gaffney, Smyth, & Ghil, 2007). Hierarchical clustering calculates the distance between data to group data. We use Dynamic time warping (DTW) distance (Sakoe & Chiba, 1971, 1978) instead of Euclidean distance, because it can calculate the distance between curves of different lengths and has other advantages that we will cover in detail in Chapter 3. In fact, DTW distance is used in many clustering methods. For example, it is used in K-means to cluster multimedia time-series data (Niennattrakul & Ratanamahatana, 2007). It 2.

(9) is also be considered with other partitional clustering to capture shape similarities between time series (Izakian, Pedrycz, & Jamal, 2015). So in this thesis, we consider the hierarchical clustering with DTW as the second method. This thesis is organized as follows. In Chapter 2, we introduce mixture of regression models, and we illustrate how we used the expectation-maximization (EM) algorithm to estimate the cluster parameters. Then a simple example is given. In Chapter 3, we introduce the hierarchical clustering and illustrate how the DTW measure the distance between two curves. In Chapter 4, we apply these two clustering methods on simulation data. We present the bias and root mean square error of the estimates of mixtue of regression models. To compare two clustering methods, we observe the correct clustering rate of two methods under different situations. In Chapter 5, we apply the two clustering methods on the data of the maximum wind speed per hour in one year. In Chapter 6, we present our conclusions.. 3.

(10) Chapter 2 Mixture of Regression Models. 2.1. Model Let yj denote the j th curve in the N curves, with the ith point of yj denoted. as yj (i), where 1 ≤ j ≤ N and 1 ≤ i ≤ nj . nj is the number of points for the j th curve. In other words, yj is a column vector formed from the points of the j th curve, yj = [yj (1) yj (2) ... yj (nj )]0 . xj is the time point corresponding to yj , with the ith point of xj denoted as xj (i). Each curve belongs to a population that is a mixture of K clusters. We assume that yj and xj are 1-dimensional, and we adapt the notation to fit our data into this framework by defining the regression equation as follows.. yj = Xj β k + e,. 4. (2.1).

(11) with. .       Xj =      . 1 1 .. . 1. xj (1)m    xj (2) · · · xj (2)m   .  .. .. ..  . . .    xj (nj ) · · · xj (nj )m ···. xj (1). (2.2). Xj is an nj by m + 1 matrix where m is the degree of regression functions. We assume that e is a vector of size nj consisting of zero-mean Gaussians with variance σ 2 , and that β k = [ βk0 βk1 · · · βkn ]0 is a vector of regression coefficients. The probability of observing a particular point yj (i), given xj (i) and component model k, is defined as fk (yj (i)|xj (i), θk ), and is assumed to be a conditional regression model. We can then define the probability of a complete curve, given a particular component model k as. P (yj |xj , θk ) = P (yj (1), ..., yj (nj )|xj (1), ..., xj (nj ), θk ) =. nj Y. fk (yj (i)|xj (i), θk ).. (2.3). i. Since we do not know which component that curve belongs, here we assume that the proportions of k th component are denoted as πk , where 1 ≤ k ≤ K. The probabilities πk are positive and sum to one. Then we have the conditional density of the curves data P (yj |xj ) is a mixture density:. P (yj |xj , θ) =. K X k. 5. πk fk (yj |xj , θk ),. (2.4).

(12) where fk (yj |xj , θk ) are the mixture components, πk are the mixing weights, and θk is the set of parameter for component k. Conditional independence between curves, given the model, amounts to assuming that our curves constitute a random sample from a population of curves. So, the full joint density is written as:. P (Y |X, θ) =. N X K Y j. πk. nj Y. fk (yj (i)|xj (i), θk ),. (2.5). i. k. where Y and X are the set of yj and xj , respectively. The log-likelihood of the parameters θ given the data set can be defined directly from Equation (2.5).. L(θ|Y, X) =. N X j. log. K X. πk. nj Y. fk (yj (i)|xj (i), θk ).. (2.6). i. k. To estimate the parameter θ, we maximize the log-likelihood equation with respect to θ. The problem can be solved by using the EM algorithm.. 2.2. EM Algorithm Maximizing the likelihood across the entire parameter space is a complicated. maximization routine. EM-algorithm augments the observed data with additional information that replaces unobserved data, which greatly simplifies the maximization of the likelihood. In the context of mixture of regression models, this unobserved data constitute the membership of subjects in cluster, zjk , indicating whether the j th curve belongs. 6.

(13) to mixture component k. Let Z be a matrix of indicator vectors zj = (zj1 , ..., zjK ). zjk = 1 if the j th curve comes from cluster k, and zjk = 0 otherwize. The joint density of Y and Z given X can be defined as follows.. P (Y, Z|X, θ) = P (Y |Z, X, θ)p(Z|X, θ) =. N Y. P (yj |zj , xj , θ)p(zj ). j. =. zjk. [fk (yj |xj , θk )πk ]. j. =. (2.7). N Y K Y k. N Y K Y j. nj Y. " πk. #zjk fk (yj (i)|xj (i), θk ). .. i. k. This follows from our previous conditional independence assumptions on yj (i) and the fact that our zj ’s are independent. The complete-data log-likelihood function L follows directly from Equation (2.7):. L(θ|Y, X, Z) =. N X K X j. zjk log πk +. nj N X K X X j. k. zjk log fk (yj (i)|xj (i), θk ).. (2.8). i. k. In the E-step, the expected value of Equation (2.8) is taken with respect to (t−1). p(Z|Y, X, θt−1 ), where θt−1 = {πk. (t−1). , βk. , σ (t−1) |k = 1, ..., K} is a set of parameters.. In M-step, this expectation is maximized over the parameters θt−1 to yield the new parameters θt . By the particular form of Z chosen here, the expectation of L(θ|Y, X, Z) is. E[L(θ|Y, X, Z)] =. N X K X j. k. hjk log πk +. nj N X K X X j. 7. k. i. hjk log fk (yj (i)|xj (i), θk ),. (2.9).

(14) where. hjk = E[zjk ] = p(zjk = 1|yj , xj , θt−1 ) P (YJ |ZJK = 1, XJ , θt−1 )p(zjk = 1) P (yj ) Qnj wk i fk (yj (i)|xj (i), θt−1 ) = PK Qnj t−1 ) i fl (yj (i)|xj (i), θ l=1 wl =. (2.10). We defined the regression equation for yj such that the expected value of yj was equal to a function of xj . By specifying the mixture components as regression models, we set fk (yj |xj , θk ) equal to a Gaussian with mean Xj β k and covariance matrix diag(σ 2 ). The solutions for πk , β k and σ 2 are exactly obtained from the well known weighted least squares problem (Draper & Smith, 1981). Here we use a package FlexMix in R to estimate all parameters of this model (Leisch, 2004; Leisch & Gruen, 2012). We estimate that not only the model parameter θ but also the number of groups K. The function stepFlexmix returns the maximum likelihood solution, AIC, and BIC for each model under different numbers of components. Secondly, we use getModel to select the best number of components by the criteria BIC. Bayesian information criterion (Schwarz et al., 1978) is defined as:. ˆ BIC = k ln (N ) − 2 ln (L),. (2.11). ˆ is the maximized where N is the number of data, k is the number of parameters, and L value of the likelihood function of the model. Since likelihood is possible to increase by. 8.

(15) adding parameters, BIC introduces a penalty term to solve this problem. We choose the model with the lowest BIC. Finally, flexmix is used to divide the data into several groups and estimate the parameters.. 2.3. Example The data are simulated under a two-component mixture regression model.. First, we generate xj as a series {0, 1, 2, · · · , nj − 1}, where nj is a integer and nj ∈ [10, 14]. Next, yj is generated from equation g1 (x) = 2x + 0.5x2 and g2 (x) = 35 − 0.25x − 0.6x2 , where e ∼ N (0, 5). We generate 5 curves per component, so there are total 10 curves, j = 10. The graph of curves is shown as Figure 2.1 and Table 2.1 is data Y and X. The first row is xj = {0, 1, 2, · · · , nj − 1}, the j th row is yj−1 when j > 1, and the ith column is the i − 1 point in curves when i > 1. In order to avoid the value of yj (i) because the squared term becomes too large, we center the squared term. Figure 2.1: Trajectory graph of Table 2.1. 9.

(16) Table 2.1: The data Y and X yj. xj. y1 y2 y3 y4 y5 y6 y7 y8 y9 y10. 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 19.56 17.85 17.36 15.39 14.60 23.64 24.41 14.63 19.28 16.48. 19.14 19.40 17.33 0.63 13.63 29.23 25.31 15.48 23.71 30.21. 19.45 15.76 16.17 6.27 14.44 26.17 27.80 24.09 20.21 24.16. 14.58 21.18 5.31 10.23 3.68 32.65 35.22 24.85 24.91 21.54. 14.25 17.06 13.59 -2.05 14.73 32.34 36.91 34.22 33.29 36.88. -2.79 17.87 3.07 3.78 13.16 33.89 32.32 26.21 31.65 34.83. 11.39 17.47 12.18 10.95 17.45 34.53 28.90 29.75 44.92 39.13. 14.32 13.07 5.25 13.67 17.40 31.67 29.73 29.29 26.71 28.22. 8.67 22.71 20.13 21.39 21.90 21.97 27.99 33.30 37.51 25.04. 32.03 18.09 31.77 24.45 17.21 22.84 13.11 25.38 27.77 27.62. 20.79 21.87 22.90 29.31 45.90 19.49 18.97 17.95 24.71 13.25. 11. 12. 13. 41.47 44.45 33.25 44.68 39.15 39.12 37.04 48.34. 21.97 15.96 11.45 3.67. 7.88. of x. So the curves yj are generated from     yj = 2xj + 0.5(xj − x¯j )2 + e. (2.12).    yj = 35 − 0.25xj − 0.6(xj − x¯j )2 + e The result of clustering is as Figure 2.2. The trajectories indicated by the dotted lines are the same group, the dash-dotted lines are another group, and the solid. Figure 2.2: Result of clustering of this example. 10.

(17) Table 2.2: The true value and estimated value of parameters β10 β11 β12 β20 β21 β22 σ1 σ2 true value estimated value. 0 0.835. 2 0.5 1.913 0.478. 35 34.127. -0.25 -0.209. -0.6 -0.584. 5 4.87. π1. 5 0.5 4.31 0.5. lines are regression lines drawn by estimated value. It shows that the data are correctly divided into two groups. Parameters estimation is shown in Table 2.2. We can see that except of β10 and β20 , the estimated value of β are close to the true value. The difference between them is less than 0.1. Although the estimated value of β10 and β20 are slightly farther from true value, the difference between them is also less than 1.. 11.

(18) Chapter 3 Hierarchical Clustering with Dynamic Time Warping. 3.1. Hierarchical Clustering Hierarchical clustering is to measure the similarity between data and gradually. merge similar data. There are two types for hierarchical clustering, agglomerative and divisive. Agglomerative hierarchical clustering starts by treating each data point as a cluster, and then gradually merge similar clusters into a cluster. Divisive hierarchical clustering starts by treating all data points as a cluster, and then gradually split cluster. In this thesis, we use the agglomerative hierarchical clustering. Each curve starts in its own cluster, so we start with N clusters. At each step, the nearest two clusters that have the lowest cluster distance are combined into a higher-level cluster. The combining process ends when the number of clusters reaches the value K. Then the dendrogram is built. To merge closer clusters, we need to compute the intergroup dissimilarity. 12.

(19) There are several ways to compute it: single linkage (Sibson, 1973), complete linkage (Defays, 1977), average linkage, and so on (Wilks, 2011). Single linkage is the smallest distance between one member of one group Q and one member of another group R. Conversely, complete linkage is the largest distance between one member of Q and one member of R. Average linkage, the calculation method we choose, is taken to be average of all distances between pairs of curves yq in Q and yr in R:. d(Q, R) =. 1 X X d(yq , yr ), |Q||R| y ∈Q y ∈R q. (3.1). r. where d(yq , yr ) is the distance between yq and yr and |Q| and |R| are the sizes for Q and R, respectively. Here we use DTW to calculate the distance between two curves. The first reason why we do not use Euclidean distance to measure the distance between two curves is because the length of curves is not the same. When we calculate the distance between curves, Euclidean distance considers the curve vector as a point on the Rn space to calculate the distance between each other, so it restricts data to be of the same length. However, the curve data we consider are different length. Secondly, while the curves have an overall similar shape and they are not aligned in the time axis, Euclidean distance produce a pessimistic dissimilarity measure. DTW allows a more sophisticated distance measure to be calculated. It can be found not to be limited to the corresponding time axis and finds a shorter distance between curves which have similar shapes. We will introduce DTW distance in next section.. 13.

(20) 3.2. Dynamic Time Warping Suppose we have two curves yq and yr , of length nq and nr , respectively, where:. yq = [ yq (1) yq (2) · · · yq (nq ) ]0. (3.2). yr = [ yr (1) yr (2) · · · yr (nr ) ]0. (3.3). To align two curves with DTW, we construct an nq by nr matrix D where the (uth , v th ) element of the matrix contains the distance between the two point, yq (u) and yr (v), denoted as d(yq (u), yr (v)), where:. d(yq (u), yr (v)) = (yq (u) − yr (v))2 ,. (3.4). Let a warping path, W , align the element of yq and yr . The lth element of W is defined as wl = (u, v)l which is to say that this path passes (uth , v th ) element of the matrix D at the lth position. We have:. W = w1 , w2 , . . . , wl , . . . , wL ,. (3.5). where max(nq , nr ) ≤ L < nq + nr − 1. The warping path is typically subject to several constraints. (1) Bounding conditions: w1 = (1, 1) and wL = (nq , nr ). This requires the warping. 14.

(21) path to start at the first point of the two curves and ends at the last point of the two curves. (2) Local constraint: For any given wl = (a, b) in the path, the possible wl−1 are restricted to (a − 1, b), (a, b − 1), (a − 1, b − 1). This requires the warping path to be continuous and monotonic. For each warping path W = {wl | l = 1, ..., L}, we can compute its cost as qP L th element of W , wl , and dl (yq (u), yr (v)) l=1 dl (yq (u), yr (v)), where l means the l is the distance between yq (u) and yr (v) that u and v follow from wl = (u, v)l . The warping path that we are interested in minimizes the warping cost: v u L uX DT W (yq , yr ) = min t dl (yq (u), yr (v)),. (3.6). l=1. There are a lot of warping path that meet the above two conditions. Instead of calculating the distance of each path, the path can be found by using dynamic programming to evaluate the following recurrence, which defines the cumulative distance γ(u, v) as the distance found in the current cell and the minimum of the cumulative distances of the adjacent elements:. γ(u, v) = d(yq (u), yr (v)) + min{γ(u − 1, v − 1), γ(u − 1, n), γ(u, v − 1)}.. (3.7). We have γ(nq , nr ) = DT W (yq , yr ). Here we show a sample example to explain how DTW calculates the distance 15.

(22) between two curves and the difference between Euclidean distance and DTW distance. Figure 3.1 below is the broken-line graph of two curves y1 = (10, 9, 7, 9, 10, 10, 9, 7, 8)0 and y2 = (9, 8, 6, 6, 8, 9, 9, 7, 6)0. Figure 3.1: Trajectory graph of the example. The heavy line is y1 and the light line is y2. (1) Euclidean distance: By Euclidean distance, the distance between y1 and y2 can be computed by v u 9 uX dist(y1 , y2 ) = t (y1 (i) − y2 (i))2 ,. (3.8). i=1. where y1 (i) and y2 (i) are the ith point of y1 and y2 , respectively. So we have. dist(y1 , y2 ) =. p. (10 − 9)2 + (9 − 8)2 + · · · + (8 − 6)2 (3.9). = 4.58.. (2) DTW distance:. 16.

(23) Our goal is to find a warping path. W = w1 , w2 , · · · , wL (where 9 ≤ L ≤ 17). (3.10). such that the distance is minimized. We use dynamic programming to evaluate the recurrence, γ(u, v), to find the path. So firstly we compute d(y1 (u), y2 (v)) for each u, v which is shown in Table 3.1. It is the matrix D, where the first column is y1 = (10, 9, 7, 9, 10, 10, 9, 7, 8)0 and the last row is y2 = (9, 8, 6, 6, 8, 9, 9, 7, 6)0 . The numbers in the grid are the distance, which is defined in Equation (3.4), between the points of the two curves. Next, Table 3.2 shows the computing result of γ(u, v), where the gray part is the path which has minimum distance. By Table 3.2, we have. DT W (y1 , y2 ) =. √ 11 = 3.32.. Table 3.1: Value of d(y1 (u), y2 (v)) for each u, v 8 1 0 4 4 0 1 1 1 4 7 4 1 1 1 1 4 4 0 1 9 0 1 9 9 1 0 0 4 9 10 1 4 16 16 4 1 1 9 16 10 1 4 16 16 4 1 1 9 16 9 0 1 9 9 1 0 0 4 9 7 4 1 1 1 1 4 4 0 1 9 0 1 9 9 1 0 0 4 9 10 1 4 16 16 4 1 1 9 16 y1 9 8 6 6 8 9 9 7 6 y2. 17. (3.11).

(24) Table 8 7 9 10 10 9 7 9 10 y1 y2. 3.2: 12 11 7 7 6 5 5 1 1 9. Result of 8 12 8 9 8 17 10 23 7 19 3 11 2 3 2 11 5 21 8. 6. γ(y1 (u), y2 (v)) for 13 10 11 12 10 11 11 11 26 14 7 7 35 13 7 7 27 9 6 6 12 5 5 5 4 5 9 13 20 21 21 21 37 41 42 43 6. 8. 9. 9. each 8 7 11 15 14 9 13 25 52. u, v 11 8 20 30 25 18 14 34 68. 7. 6. The warping path here is. W = {(1, 1)1 , (2, 2)2 , (3, 3)3 , (3, 4)4 , (4, 5)5 , (5, 6)6 , (6, 6)7 , (7, 7)8 , (8, 8)9 , (9, 9)10 }. (3.12). We can see the different alignment between Euclidean distance and DTW distance via Figure 3.2.. Figure 3.2: Left side is the figure of Euclidean distance which aligns y1 (i) with y2 (i). Right side is the figure of DTW distance, where the alignment is followed from the warping path we found in Table 3.2.. 18.

(25) This example shows that DTW distance can measure similarity and gets the smaller distance than Euclidean distance. In the next section, we apply this method, hierarchical clustering with DTW, to the example of Chapter 2.. 3.3. Example We apply the method in this chapter on the same data in Section 2.3. To. calculate the distance between curves and cluster, we use the function tsclust in package dtwclust in R (Sarda-Espinosa, 2019; Sard´a-Espinosa, 2017). Then we get dendrogram as Figure 3.3. From this figure, we can see that y6 and y7 are merged into a group first, then gradually merged. In the end, these 10 curves are divided into two clusters.. Figure 3.3: The dendrogram of hierarchical clustering.. 19.

(26) Figure 3.4: Two clusters that are divided by hierarchical clustering with DTW.. Figure 3.4 shows the five curves of each of the two clusters. The left graph is the first group and the right graph is the second group. From the figure, the first group is concave upward which is come from the first equation in Equation (2.12), and the second group is concave downward which is come from the second equation in Equation (2.12). It shows that curves of similar shape are indeed clustered in the same group, and this clustering result is the same as the result in Section 2.3.. 20.

(27) Chapter 4 Simulation Study We compare these two methods under different scenarios. We fix the number of groups is K = 3, and the regression coefficient of each group is β 1 = [ 50 4 0 ]0 ,β 2 = [ 10 2 0.1 ]0 , β 3 = [ 250. − 0.75. − 0.1 ]0 . We generate totally N curves. Each. group has a proportion of πk , and each curve has a length of nj that nj is generated randomly from {12, 13, 14, 15} for each j. We select several factors to explore the results in different scenarios. Under the equal mixing weight or unequal mixing weight, we consider different number of curves, 60, 120, or 180. We compare the result under different variance to each scenario. The different conditions here are: (1) Different mixing weight: (a) Equal mixing weight: π = ( 31 , 13 , 31 ) (b) Unequal mixing weight: π = ( 16 , 13 , 12 ) (2) Different number of curves: N = 60, N = 120, or N = 180 (3) Different variance: σ = 10, σ = 30, or σ = 50. 21.

(28) xj is generated as {0, 3, 6, · · · , 3nj } by given nj . Based on xj , yj is generated from Equation (2.1) by given the designed β k , πk , and σ. We simulate until it correctly selects K group 500 times for each scenario. In mixture of regression models, we use BIC as the criteria to select how many group that we divide it into finally. In the second step, we divide the data into the number of groups we select and estimate the parameters. In hierarchical clustering with DTW, we calculate the distance between curves by DTW and divide data by hierarchical clustering. In the first section, we present the clustering results of mixture of regression models, the bias and root mean square error (RMSE) of the estimates, and how these results differ in different situations. The bias and RMSE are calculated by the following equations.. ˆ = bias(θ). 1 T. PT. ˆ(m) − θ). m=1 (θ. θ. ,. v u T u1 X t ˆ (θˆ(m) − θ)2 , RM SE(θ) = T m=1. (4.1). (4.2). where θ is the true value, θˆ is the estimated value, T is repeat times, and θˆ(m) is a estimated value from the mth repeat. If the true value is zero, we replace the denominator of bias with one. In the second section, we present the clustering results of DTW and compare the correctness of the results of the two methods.. 22.

(29) 4.1. Clustering Results of mixture of regression models When we use mixture of regression models to divide data, the data may not. be divide into three groups. However, when comparing the two methods, we consider how much data are classified into the wrong group when divided into three groups. Therefore, we repeat the simulation until it is divided into three groups of 500 times. When we discuss the bias and RMSE of the parameters, it is also based on dividing the Table 4.1: The mean and RMSE of estimates π based on 500 replications π = ( 13 , 13 , 13 ) σ. 10. N. mean. RMSE. mean. RMSE. mean. RMSE. 0.334 0.333 0.333 0.341 0.325 0.328 0.342 0.325 0.327. 0.015 0.015 0.007 0.051 0.051 0.038 0.053 0.053 0.045. 0.340 0.327 0.330 0.338 0.329 0.332 0.340 0.327 0.330. 0.047 0.047 0.028 0.039 0.039 0.015 0.047 0.047 0.024. 0.336 0.331 0.331 0.338 0.329 0.331 0.340 0.327 0.330. 0.031 0.031 0.022 0.040 0.040 0.026 0.047 0.047 0.026. 0.125 0.125 0.096 0.137 0.137 0.117 0.152 0.152 0.141. 0.232 0.268 0.464 0.266 0.234 0.441 0.280 0.220 0.428. 0.132 0.132 0.090 0.152 0.153 0.115 0.158 0.158 0.136. 60. 120. 180. π1 π2 π3 π1 π2 π3 π1 π2 π3. 30. 50. π = ( 16 , 13 , 12 ) 60. 120. 180. π1 π2 π3 π1 π2 π3 π1 π2 π3. 0.209 0.291 0.472 0.213 0.287 0.467 0.226 0.274 0.465. 0.111 0.111 0.107 0.115 0.115 0.115 0.128 0.128 0.118. 0.223 0.277 0.466 0.238 0.262 0.453 0.264 0.236 0.436. 23.

(30) repeated simulation into three groups of 500 times. Here we use relative bias because some of the parameters in this simulation are quite large, and we want to see the bias estimates in the same unit. However, when the true value is very small, such an algorithm will be somewhat distorted. So when we present the estimated results of π, we use the mean to replace the bias. Table 4.1 shows the mean and RMSE of estimates π. It shows that when the mixing weight is the same, π are estimated to be good, even if σ or number of curves become larger. Furthermore, the RMSE of estimates are relatively small when the number of curves or σ become larger. When the mixing weight are different, the mean and RMSE of Table 4.2: The bias and RMSE of estimates based on 500 replications (The same mixing weight) π = ( 13 , 31 , 13 ) N 60. σ. 10 bias RMSE 30 bias RMSE 50 bias RMSE 120 10 bias RMSE 30 bias RMSE 50 bias RMSE 180 10 bias RMSE 30 bias RMSE 50 bias RMSE. β10 0.002 1.614 -0.002 4.685 0.003 6.846 -0.009 3.190 -0.005 3.638 0.000 5.425 -0.010 3.384 -0.006 3.542 0.000 4.688. β11. β12. β20. β21. β22. β30. β31. β32. -0.001 0.000 -0.005 0.002 0.002 0.000 -0.003 0.001 0.069 0.005 1.516 0.068 0.005 1.324 0.051 0.005 -0.006 0.000 0.001 0.016 -0.005 0.000 0.008 -0.007 0.193 0.016 4.727 0.210 0.015 4.044 0.154 0.015 0.000 -0.001 0.014 0.003 -0.002 -0.001 -0.009 0.002 0.284 0.023 6.795 0.268 0.024 6.715 0.258 0.026 -0.006 0.001 0.053 0.011 -0.013 0.000 0.000 -0.004 0.159 0.009 3.194 0.156 0.008 0.924 0.035 0.003 -0.002 0.001 0.029 0.004 -0.004 0.000 -0.001 0.003 0.158 0.012 3.500 0.162 0.012 2.756 0.107 0.010 -0.002 0.001 0.004 0.007 -0.002 0.001 -0.011 0.015 0.219 0.018 5.028 0.223 0.019 4.474 0.173 0.018 -0.006 0.001 0.055 0.013 -0.014 0.000 0.001 -0.002 0.158 0.009 3.174 0.167 0.008 0.754 0.030 0.003 -0.006 0.001 0.056 0.009 -0.016 -0.001 -0.004 -0.004 0.168 0.010 3.661 0.166 0.011 2.295 0.088 0.009 -0.007 0.001 0.024 0.010 -0.011 -0.001 -0.015 0.002 0.203 0.017 4.835 0.195 0.016 3.954 0.158 0.015. 24.

(31) Table 4.3: The bias and RMSE of estimates based on 500 replications (Different mixing weight) π = ( 16 , 13 , 21 ) N 60. σ. 10 bias RMSE 30 bias RMSE 50 bias RMSE 120 10 bias RMSE 30 bias RMSE 50 bias RMSE 180 10 bias RMSE 30 bias RMSE 50 bias RMSE. β10. β11. β12. β20. β21. β22. -0.065 9.155 -0.082 11.212 -0.087 14.489 -0.073 9.216 -0.118 11.608 -0.158 13.580 -0.094 10.261 -0.153 12.714 -0.184 13.374. -0.042 0.449 -0.054 0.548 -0.063 0.636 -0.046 0.465 -0.068 0.570 -0.100 0.660 -0.060 0.511 -0.096 0.613 -0.113 0.658. 0.008 0.024 0.010 0.032 0.011 0.043 0.009 0.023 0.014 0.031 0.020 0.038 0.012 0.026 0.018 0.033 0.024 0.035. 0.164 4.643 0.243 6.393 0.273 8.447 0.189 4.771 0.273 5.957 0.399 7.735 0.237 5.192 0.383 6.380 0.440 7.308. 0.041 0.233 0.050 0.292 0.055 0.365 0.046 0.232 0.071 0.296 0.093 0.352 0.060 0.257 0.099 0.322 0.117 0.341. -0.039 0.012 -0.063 0.020 -0.056 0.027 -0.049 0.012 -0.070 0.017 -0.100 0.022 -0.060 0.013 -0.099 0.017 -0.115 0.020. β30. β31. β32. 0.000 0.002 -0.003 1.019 0.041 0.004 0.001 0.011 0.009 3.741 0.147 0.014 0.001 0.009 0.006 5.991 0.253 0.025 0.000 0.001 0.002 0.780 0.032 0.003 0.000 -0.001 -0.002 2.816 0.118 0.011 0.001 0.001 0.014 5.144 0.218 0.020 0.000 0.000 0.000 0.660 0.027 0.003 0.000 -0.002 0.012 2.214 0.097 0.009 0.001 0.013 -0.001 4.043 0.173 0.017. estimates are relatively large. However, the RMSE of estimates do not become larger or smaller when the number of curves become larger. Table 4.2 and Table 4.3 show the root mean square error (RMSE) and bias of estimates when the mixing weight is the same or different, respectively. Table 4.2 shows that all bias of estimates are quite small. RMSE is also small except for the intercept terms. The value 0.000 in the table means that it is less than 0.0005. In addition, both of RMSE and bias of estimates become larger as the σ become larger. However, both of RMSE and bias of estimates also become larger as number of curves become larger. Comparing with Table 4.3, estimates of different mixing weight have relatively large. 25.

(32) bias and RMSE. Also, RMSE and bias of estimates become larger as the σ and number of curves become larger.. 4.2. Comparing clustering results of two methods The above is the estimated parameters of the mixture of regression models.. Next we use the second method, hierarchical clustering with DTW to divide the data and compare it with the first method, mixture of regression models (MRM). We calculate how many proportions of curves are divided into correct group when they are divided into three groups. We call it as correct clustering rate (CCR) (Morris & Trivedi, 2009). K 1 X CCR = pc , N c=1. (4.3). where N is the number of curves and pc is the number of trajectories matched to the cth cluster. Table 4.4: The mean of correct clustering rate of two methods π = ( 31 , 13 , 13 ) N. 60. 120. 180. σ. MRM. DTW. MRM. DTW. MRM. DTW. 10 30 50. 0.999 0.990 0.991. 1.000 0.975 0.767. 0.987 0.994 0.988. 1.000 0.969 0.712. 0.985 0.990 0.986. 1.000 0.978 0.697. 0.906 0.839 0.813. 1.000 0.968 0.625. π = ( 61 , 13 , 12 ) 10 30 50. 0.930 0.910 0.897. 1.000 0.963 0.731. 0.921 0.882 0.839. 26. 1.000 0.967 0.658.

(33) Table 4.4 shows the mean of correct clustering rate under 500 replications. The correct clustering rate of both two methods are high when σ is small, and it decreases when σ becomes larger. Table 4.4 also shows that the correct clustering rate is higher under the same mixing weight than under the different mixing weight. Comparing these two methods, DTW can almost divide curves into correct group when σ is small. However, when σ gets larger, the correct clustering rate of DTW decreases much faster than mixture of regression models. So when the variance is large, mixture of regression models might be a better choice.. 27.

(34) Chapter 5 Practical Data Analysis This practical data are the wind speed data from October 2017 to September 2018, recording the maximum wind speed per hour. We apply two methods to this data to see if we can classify the different maximum wind speed trends in different seasons. Here we treat one month of data as a curve with a total of 12 curves which are shown as figure 5.1. The first curve is the wind speed data for October 2017, the second curve. Figure 5.1: Wind speed data per month 28.

(35) is the wind speed data for November 2017, and so on. The last curve is the wind speed data for September 2018. Table 5.1 shows the mean and standard deviation of monthly wind speed data. It shows that from October 2017 to February 2018 have larger wind speed, and from May 2018 to August 2018 have smaller standard deviation. Table 5.1: The mean and standard deviation of monthly wind speed data (mm/yy) 10/17 11/17 12/17 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18 9/18 mean sd. 14.95 15.04 16.87 14.64 14.26 9.67 7.42 6.78 8.53 6.49 6.97 9.89 7.27 6.12 6.85 5.20 6.33 5.86 5.80 3.84 4.84 3.48 3.50 7.32. We firstly use the mixture of regression model to group them. According to the BIC, the number of selected clusters is three, and the estimated parameters βˆ of each group are as Table 5.2. Based on the estimated values in Table 5.2, we draw the regression line in Figure 5.2. It can be seen from this figure that the different properties of the three groups are low wind speed, increasing wind speed and decreasing wind speed. Cluster 1 is the decreasing wind speed that contains the wind speed of 11/2017, 12/2017, and 2/2018. Cluster 2 is the low wind speed that contains the wind speed of March to August in 2018. Cluster 3 is the increasing wind speed that contains the wind speed of 10/2017, 1/2018, and 9/2018. Secondly, we use the hierarchical Table 5.2: The estimated parameters βˆ of each group Cluster 1 Cluster 2 Cluster 3 coef.(Intercept) coef.x coef.x2. 17.648602 -0.008070 0.000015. 29. 7.173842 0.015417 0.000008. 8.766866 -0.002302 -0.000006.

(36) Figure 5.2: Data and regression lines by estimated values. clustering with DTW to group them. First we get the dendrogram as shown in Figure 5.3. From this dendrogram, the data is divided into three groups. The group on the left is from December 2017 to February 2018. The middle group is March to August 2018. The group on the right is October and November of 2017 and September of 2018.. Figure 5.3: The dendrogram of hierarchical clustering. 30.

(37) Table 5.3: The clustering results obtained by the two methods (mm/yy) 10/17 11/17 12/17 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18 9/18 MRM DTW. 3 1. 1 1. 1 2. 3 2. 1 2. 2 3. 2 3. 2 3. 2 3. 2 3. 2 3. 3 1. In the Table 5.3, we compare the clustering results obtained by the two methods. According to Table 5.3, March to August are divided into groups by both two methods, but the result in other months are slightly different. When grouping by mixture of regression models, November 2017, December 2017 and February 2018 are divided into the same group, and January 2018, September 2018 and October 2017 are divided into the same group. However, when grouping by the hierarchical clustering, November 2017 is divided into the same group as October 2017 and September 2018, and January 2018 is divided into the same group as December 2017 and February 2018. According to this result, we find out that the result of hierarchical clustering with DTW is similar to the average, and it also shows that the maximum wind speed have different behaviors in different seasons.. 31.

(38) Chapter 6 Conclusions In this thesis we compare two clustering methods for curves data. First, we consider different numbers of model component and use EM-algorithm to estimate parameters. Second, we consider hierarchical grouping where DTW is used to calculate the distance between two curves. We compare these two methods by the cluster correcting rate. By the result of simulation study, we find out that when the variation of curves is small, hierarchical clustering with DTW get a better result. However, when the variation of curves is large, mixture of regression models is a better choice. We also use these two methods to analyze a practical data. According to the results, we find out that the clustering results of hierarchical clustering with DTW are more in line with our expectations. The behavior of maximum wind speed will be different in different seasons. There are still many problems that we can discuss in the future. First, in mixture of regression models, when we use the function flexmix in R, it is sometimes unable 32.

(39) to successfully divide the data into the number of groups we specify. Second, we choose EM-algorithm to estimate parameter, there are also classification version of the EMalgorithm (CEM) (Celeux & Govaert, 1992) and stochastic version of the EM-algorithm (SEM) (Celeux, 1985) can apply and discuss. We can also consider partitional clustering, for example, K mean or fuzzy C mean, with DTW distance. Third, the curves we consider in this thesis is one-dimensional. We can also discuss two-dimensional or above. Fourth, we consider the points in each curve are independent, the dependent situation also could be discussed in the future. Finally, the maximum wind speed in the practical analysis can be considered using the AR(1) model to analyze.. 33.

(40) References Camargo, S. J., Robertson, A. W., Gaffney, S. J., Smyth, P., & Ghil, M. (2007). Cluster analysis of typhoon tracks. Part I: General properties. Journal of Climate, 20 (14), 3635–3653. Celeux, G. (1985). The sem algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem. Computational Statistics Quarterly, 2 , 73–82. Celeux, G., & Govaert, G. (1992). A classification em algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis, 14 (3), 315–332. Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal , 20 (4), 364–366. DeSarbo, W. S., & Cron, W. L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5 (2), 249–282. Draper, N. R., & Smith, H. (1981). Applied regression analysis 2nd ed. New York: John Wiley & Sons. Gaffney, S. (2004). Probabilistic curve-aligned clustering and prediction with regression mixture models (Ph.D. dissertation). University of California, Irvine. Gaffney, S., & Smyth, P. (1999). Trajectory clustering with mixtures of regression models. In Proceedings of the fifth acm sigkdd international conference on knowledge discovery and data mining (pp. 63–72). Izakian, H., Pedrycz, W., & Jamal, I. (2015). Fuzzy clustering of time series data 34.

(41) using dynamic time warping distance. Engineering Applications of Artificial Intelligence, 39 , 235–244. Lee, J. G., Han, J., & Whang, K. Y. (2007). Trajectory clustering: a partition-andgroup framework. In Proceedings of the 2007 acm sigmod international conference on management of data (pp. 593–604). Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11 (i08), 1-18. Leisch, F., & Gruen, B. (2012). Package ‘flexmix’. Information found at https://cran.rproject.org/web/packages/flexmix/flexmix.pdf. Morris, B., & Trivedi, M. (2009). Learning trajectory patterns by clustering: Experimental studies and comparative evaluation. In 2009 ieee conference on computer vision and pattern recognition (pp. 312–319). Niennattrakul, V., & Ratanamahatana, C. A. (2007). On clustering multimedia time series data using K-means and dynamic time warping. In Proceedings of the 2007 international conference on multimedia and ubiquitous engineering (pp. 733–738). Sakoe, H., & Chiba, S. (1971). A dynamic programming approach to continuous speech recognition. In Proceedings of the seventh international congress on acoustics (Vol. 3, p. 65-69). Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26 (1), 43–49. Sard´a-Espinosa, A. (2017). Comparing time-series clustering algorithms in r using the 35.

(42) dtwclust package. Vienna: R Development Core Team. Sarda-Espinosa, A. (2019). Package ‘dtwclust’. Information found at https://cran.rproject.org/web/packages/dtwclust/dtwclust.pdf. Schwarz, G., et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2), 461–464. Sibson, R. (1973). Slink: an optimally efficient algorithm for the single-link cluster method. The Computer Journal , 16 (1), 30–34. Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic Press. Zheng, Y. (2015). Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology (TIST), 6 (3), 1–41.. 36.

(43)