資料結構與平行演算法

第三章、研究架構與設計

第二節、研究設計

3.2.6 資料結構與平行演算法

國

立政治大學

‧

Na tiona

l Ch engchi University

因此本研究的主要演算法的時間複雜度為𝑂 �^𝑥𝑛_𝑦₂²+^𝑥𝐶_𝑦₂²� = 𝑂 �^𝑥�𝑛²_𝑦^+𝐶₂ ²^��，而其中相較於原始資料量 n，在有效分群的狀況下，各區的分群量 C 均會減小相當的量，因此可以將 C 忽略，得O �^𝑥𝑛_𝑦₂²�，相較於原始 kNN 演算法的𝑂(𝑥𝑛²)，本研究設計的平行畫演算法可節省 y²的時間。

3.2.6 資料結構與平行演算法

本研究以 php 程式語言進行 kNN 演算法之平行化處理，關鍵步驟的虛擬程式碼（pseudo code）與資料儲存結構如下：

（一）資料結構

1. class_info：array，紀錄每個群集的資訊。

(1) key 值為群集的編號。

(2) [attr]：array，所有非 0 的 term 與其對應數值（平均值）。 (3) [cnt]：integer，在此群集的資料筆數

(4) [include]：array，此群集包含哪些資料，以編號的方式記錄。

(5) [InterSim]：群內相似度，此處僅計算相似度總和，待計算分群品質時才除以比較次數。

2. Pointlist：array，紀錄每筆資料的資訊。

(1) key 值為資料的編號。

(2) [attribute]：array，紀錄該筆資料的每個[term]與其對應的值。

(3) [inclass]：string，此資料被歸類的群集。

（二）各階段門檻值

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

本研究依照處理階段的不同，有多個門檻值可進行調整。以下將對各階段的門檻值進行說明：

1. 第一階段門檻值：

第一階段的門檻值是在各分區處理時，決定該筆資料是否要納入最接近群集的門檻值。由於資料本身並不一定是緊密接近，如果新加入的資料距離所有群集都有相當的距離（即與每個群集的資料相似度都不高），此時應該要讓此筆資料獨立成群。

第一階段的門檻值正是針對此狀況設定，決定新進資料與現有資料群集的相似度達到何種程度，才納入現有群集中；若未達到相似度門檻，則將該筆資料獨立成新的群集。

為了避免後續整合時要對群集進行多次的拆解、重組，本研究將第一階段門檻值設定為較高的數值，也就是新進資料必須達到相當程度的相似度才會歸類於現有群集。期望藉此產生大量較小且資料較緊密的分群，待第二階段整合時才產生出更完整的分群。

2. 第二階段門檻值：

第二階段主要進行的工作是將各區的分群整合，產生整合的分群結果。

此階段的門檻值乃合併之條件，即群集間的相似度必須達到該門檻值才進行合併，也就是「最低可合併相似度」。

相較於第一階段的較高門檻值，第二階段的門檻值會低一些，以避免整合後仍有大量小型群集，造成實際上並未真正整合的情況發生。此數值必須採用適當的數值，若過大則群集與群集無法適當合併，過小則會發生群集合併過度的情況。

3. 第三階段門檻值：

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

第三階段主要進行群集品質的檢驗，若品質未達標準則必須視情況進行拆分、重組、重新分群。分群品質受到群間相似度與群內相似度的影響，因此本階段的門檻值有二，分別是群間相似度與群內相似度。

為使分群品質較高，必須要有較低的平均群間相似度與較高的平均群內相似度。為此，各群集的群內相似度必須高，而群集間的相似度必須低。因此本階段門檻值分別是最高群間相似度與最低群內相似度。若某二群集的群間相似度超過最高群間相似度門檻值，則將此二群集視狀況合併，或是針對其中相似度較低的群集拆分、重組；若某一群集的群內相似度過低，則將此群集拆分，並對其中的資料進行分群重組。

（三）虛擬程式碼：各分區平行處理

1. get data from GoogleSpreadsheet (or other database saved data) 2. data_num_counter = 0

3. (loop start, stop until all data classified) 4. add 1 to data_num_counter

5. data_num (as a string) = region_num‘+’data_num_counter //region_num 為分區編號

6. pointlist[data_num] = data that gotton at step 1 7. foreach loop : pointlist

8. {

9. data_num = key of pointlist 10. similarity[ data_num ] = 0 11. foreach loop : attributes 12. {

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

13. attr_sim = compute attributes similarity //視選用的相似度計算 14. similarity[ data_num ] = similarity[ data_num ] + attr_sim 15. }

16. } //至此，similarity array 儲存了新近資料與現有資料點的相似度 17. select the top k of similarity // k 值為 kNN 的 k，本研究訂為 30 18. for loop : top k of similarity array

19. {

20. class_num = key of current similarity array 21. avg_class_similarity[class_num] =

22. } // avg_class_similarity array 是將這最接近的 k 個點的相似度，以群集分別加總並平均

23. if : largest avg_class_similarity >= threshold_01 //達到第一階段門檻值 24. { // threshold_01 為納入群集的相似度門檻 25. add to this class:

26. [attr] : compute new attribute average 27. [cnt] : add 1 to [cnt]

28. [include] : include this data_num

29. [InterSim] : add sum of similarity which data in this class to [InterSim]

30. }

31. else //未達到第一階段門檻值，就獨立成一群 32. {

33. Create a new class:

34. key = data_num

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

35. [attr] = attributes of this data 36. [cnt] = 1

37. [include] include this data_num (only include one data currently) 38. [InterSim] = 0

39. }

40. if : count(class_info) >= n //n 視計算能力而定。當分群數量超過 n 時，

41. { //就準備進行整合，可避免分群數量過多而造成整合費時。

42. send data to integrate server 43. clear class_info array 44. }

45. back to step 3, until all data are classified into a class

（四）虛擬程式碼：各區相似群合併

1. get integrated_class_info from GoogleSpreadsheet (or other database) 2. get class_info array from regional server

4. foreach loop : class_info 5. {

6. foreach loop : integrated_class_info 7. {

8. class _num = key of integrated_class_info 9. similarity[ class_num ] = 0

10. foreach loop : attr

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

11. {

12. attr_sim = compute attributes similarity //視選用的相似度計算 13. add attr_sim to similarity[ class_num ]

14. } 15. }

16. get the largest similarity in similarity array

17. //以下將 class_info 合併至 integrated_class_info 中

18. if : largest similarity >= threshold_02 //達到第二階段門檻值 19. { // threshold_02 為分群合併相似度門檻 20. merge to this class (in integrated_class_info array) :

21. [attr] : compute new attribute average

by all ( integrated_class_info[attr]* integrated_class_info[cnt] + class_info[attr]*class_info[cnt] ) / integrated_class_info[cnt]+class_info[cnt]

22. [cnt] : add class_info[cnt] to [cnt]

23. [include] : include this data_num

24. [InterSim] : add class_info[InterSim] to [InterSim]

and compute the similarity of data between two classes 25. }

26. else : add this class to integrated_class_info array 27. }

（五）虛擬程式碼：分群品質檢驗與重新分群

1. get integrated_class_info from GoogleSpreadsheet (or other database) 2. compute the inter-cluster similarity and safe as an array : sim_inter[], key as

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

class_num

3. sort the sim_inter array in ascending order

4. while loop : smallest value in sim_inter array <= threshold_03_Inter

5. { //threshold_03_Inter 是可接受的最低群內相似度 6. class_num = key of the smallest value of sim_inter

7. Foreach loop : integrated_class_info[include] (data in this class) 8. {

9. compute similarity with integrated_class_info[attr]

//僅與現有群集的質心做相似度運算，避免落入一般 kNN 費時之處 46. find the most similar class and add to this class:

47. [attr] : compute new attribute average 48. [cnt] : add 1 to [cnt]

49. [include] : include this data_num

10. [InterSim] : add sum of similarity which data in this class to [InterSim]

11. }

12. unset integrated_class_info[class_num]

13. }

14. //接著調整群間相似度

15. compute the intra-cluster similarity and safe as an array : sim_intra[], key as string : class_num’+’ class_num, which means the similarity of these two classes 16. sort the sim_intra array in descending order

17. while loop : largest value in sim_intra array >= threshold_03_Intra

18. { //threshold_03_Intra 是可接受的最高群間相似度

‧ 國

立政治大學

‧

Na tiona

l Ch engchi University

19. merge these two classes :

28. [attr] : compute new attribute average 29. [cnt] : add these two classes’ [cnt]

30. [include] : include all data_num in these two classes 31. [InterSim] : add these two classes’ [InterSim]

and compute the similarity of data between two classes 20. unset integrated_class_info[(second)class_num]

21. }

‧

行化的 kNN 演算法分別運算。本階段過程中相似度採歐幾里得距離（Euclidean distance，令為 d），並取 ¹

分群時間(秒) 441.4271 443.2496 437.1011 454.6090 414.7192 438.2212 平均群間相似度 0.0150 0.0148 0.0148 0.0158 0.0145 0.0150 平均群內相似度 0.0290 0.0290 0.0304 0.0265 0.0286 0.0287 分群品質 1.9325 1.9646 2.0567 1.6756 1.9759 1.9211

※註：以上數據均於計算後四捨五入至小數點後第 4 位。

在文檔中適用於雲端分散儲存架構下的kNN平行演算法之研究 - 政大學術集成 (頁 32-40)

第三章、 研究架構與設計

第二節、 研究設計

3.2.6 資料結構與平行演算法

國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧ 國

立 政 治 大 學

‧

‧

第三章、研究架構與設計

第二節、研究設計

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學