未知資料的預測 - GACMS 方法 - 應用多重支持之廣義關聯分類法建構大學休退學預測系統

第四章 GACMS 方法

4.3 未知資料的預測

當未知的資料輸入時，GACMS 方法先預測模型中尋找所有符合資料條件的規模，

並依據模型中的規則的強度由強至弱(先考慮信類度，再考慮支持度)、逐一與該筆未知資料的條件比對，直到符合為止；也就是該筆資料會由符合條件中，強度最強的一條分類規則決定其分類結果。例如

未知資料：

{高雄市,資工系,每週曠課程度 s } 模型中的規則：

{高雄市,資工系,每週曠課程度 s  休學} (conf = 40%, sup = 80%) {高雄市,資工系,每週曠課程度 s  退學} (conf = 40%, sup = 30%) {高雄市,資工系,每週曠課程度 s  正常} (conf = 10%, sup = 95%) 分類結果：休學

第五章

實驗與結果

在本章中，我們將會說明 GACMS 演算法的實作，並且進行演算法的效能比較。

5.1 實驗設計

首先在實作的平台，我們使用一台個人電腦，其作業環境如下：

作業系統：Windows 7

處理器：Intel® Core™ i5 M430 @ 2.27GHz 主記憶體：DDR3 8192Mbytes

硬碟：Intel SSD 480GB

另外在資料庫方面，我們選用微軟 SQL Server 2008，所有的程式及系統介面以 c#(Windows form)、WEKA 3.6.6 進行開發。

訓練資料的詳細內容如第二章所述，並參考專家的意見，建構了三個階層樹狀資訊，為住址、科系及曠課週數，詳如圖 3.3、3.4 及 3.5。

此外，特別值得一提的是，此休退學資料本質上屬於類別分佈極度不平均的資料，絕大部分的案例屬於正常，休學和退學的案例僅佔不到二成。因此，資料的選取我們等量擷取三個類別的資料，因為如果數量不相等，會造成系統習得的分類規則都偏向最大的那一群，導致最後無法有效的分類。

5.2 系統說明

如圖 5.1，系統基本設定如資料庫連結、訓練資料條件、測試資料條件、最小支持度調整參數、項目出現次數最低門檻、規則排序…等，都可透過 Setup 這個頁面進行設定。

圖 5.1 系統設定畫面

設定完後進入訓練資料檢視畫面(如圖 5.2)，可以檢視訓練資料是否正確。也可到測試資料檢視畫面(如圖 5.3)檢視測試資料。

圖 5.2 訓練資料檢視畫面

圖 5.3 測試資料檢視畫面

在欄位選擇頁面(Feature Selection)可以選擇要拿來建模型的資料欄位(如圖 5.4)，接下來到條件與階層設定頁面(如圖 5.5)可以設定判別項目的條件與階層關係。再到階層樹頁面(圖 5.5)就可以看到這個項目的階層關係。這個頁面功能還可供專家來加入不存在於資料中的項目到階層樹中，例如把高雄市歸入南台灣的下層項目中。

圖 5.4 欄位選擇畫面

圖 5.5 條件&階層設定畫面

圖 5.6 階層樹畫面

接下來開始導入訓練資料以 GACMS 演算法來建立模型(如圖 5.7)。在這個頁面中，系統會列出增強型的 FP-tree、CR-tree 和學到的規則的資訊，並依之前設定的排序方式(如圖 5.1)來修剪規則。

圖 5.7 預測模型建模畫面

接下來使用者可導入測試資料到模型中，以所習得的規則來分類測試資料，

並比對分類結果和資料的測試資料中真實結果的差距，算出每個分類的指標值(如圖 5.8)。

圖 5.8 測試結果畫面

分類清單如圖 5.9 所示，系統將列出由 GACMS 分類後的結果於”預測類別”

欄位，兩個短破折號”—“表示此筆資料以模型中的分類規則無法將其分類。

圖 5.9 分類清單

5.3 實驗結果分析

5.3.1 預測準確度分析

本實驗採用 ‘建築工程’、’經營管理’、’營建工程’和’數位內容’等四個科系資料來實驗，其分別隸屬三個學院(如圖 3.5)。

在精準度上，我們與過去常用在休退學分類上的四種演算法比較，分別是

預測類別

NaiveBayes、Logistic、JRip 和 DecisionTree (NB-Tree)。這四種方法分別以 WEKA (3.6.6 版) 執行，測試的方式是將 96~100 學年度 5687 筆資料作為訓練資料，將 101

此實驗 α= 0.3 的表現最佳。

圖 5.11 不同多支持度權重參數比較

5.3.2 執行速度分析

這一小節我們將針對執行速度來探討，首先我們一樣要和 NaiveBayes 、 Logistic、JRip和DecisionTree比較。

我們針對執行速度進行比較。如表 5.1 所示，我們的方法耗時最長，這是因為我們學習的規則比較零散，而且我們的程式只以單一執行緒執行，導致建立模型的時間比 Weka 提供的方法普遍運用多執行緒的執行技巧慢了許多。未來我們也將採用多執行緒的方法實作我們的方法。

多支持度權重參數 Precision

或 Recall 值

第六章

結論與未來研究

6.1 結論

在本研究中，針對大學生的休退學預測問題，我們建構了一個休退學預測系統，此系統能有效預測在學學生每學期結束後是否休退學或繼續在學。另外，此系統也能分析探勘所得的分類規則，將最重要關鍵的規則產出，提供給相關的人員進行決策參考。在這個系統中，我們發展一種新的分類方法，稱為 GACMS 方法。此方法以關聯分類方法為基礎，加入多重支持度的機制，可有效解決出現頻率少，但很重要的分類條件被單一支持度過濾掉的問題，讓更多有用的分類資訊得以保留到模型的產生，提高了模型的預測準確度。另外此方法還利用專家定義的階層樹，讓零散且有用的規則有機會彙整為強度足夠的分類規則，並有效修剪訓練後的模型裡規則中多餘的條件，讓分類規則更精實。

在實驗部份，雖然我們的運算速度不比其它演算法。但我們獲得的規則經測試資料試驗後，證實可獲得精確度遠高於其他方法的分類規則。

綜合以上，我們提出的 GACMS 方法，極適合用資料集中，各種條件出現頻率差異極大，且又具有豐富階層分類資訊的問題。例如除了做學生休退學研究，

也可做學生未來就業研究，或是將之應用到其它領域去探勘有用的分類規則。

6.2 未來研究工作

學生的休退學預測問題，過去已有相當多的研究，但利用關聯式分類法的作法，就我們所知是種新的嘗試。位來我們將以此研究為基礎，繼續朝向下列幾個方向進行研究 :

(1) 此次研究，發覺有些規則是有時序性的休退學規則。例如，有六成以上的退學學生是曾經休學。可經由增加時序性的分類規則探勘法找出這類的規則，以增加分類模型的準確度。

(2) 本論文採用的多重支持度關聯規則探勘，其所設定的權重參數扮演相當重要的角色。有研究[25]提出以信賴度調整此參數，未來也可參考此一方法找出更好的參數調整方法，以有效提升訓練模型時的運算速度。

(3) 本研究因採用多重支持度關聯規則探勘，讓許多有用資訊得以保留，但也相對大幅降低運算速度，如果能有效運用多工處理，應能解決此一問題。如何將本文中所提方法切分為多個線程(Threads)來執行而不互相干擾，也是接下來可以研究的方向。目前比較可行的是在 GACMS 作規則排序和修剪時，因為每條規則出現時就做排序和修剪，所以這邊極有機會可以切分多工模式，只要能有效控制每個線程不互搶資源即可。

參考文獻

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207-216, 1993.

[2] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th International Conferences on Very Large Data Bases, pp.

487-499, 1994.

[3] F. Araque, C. Roldan, and A. Salguero, “Factors influencing university drop out rates,” Computers & Education, vol. 53, pp. 563–574, 2009

[4] G.W. Dekker, M. Pechenizkiy, and J.M. Vleeshouwers, “Predicting students drop out: A case study,” in Proceedings of the 2nd International Conference on Educational Data Mining, pp. 41–50, 2009.

[5] M. Feng, N. Heffernan, and K. Koedinger, “Looking for sources of error in predicting student’s knowledge,” in Proceedings of AAAI Workshop on Education Data Mining, pp. 1–8, 2005.

[6] J. Han and Y. Fu, “Discovery of multiple-level association rules from large databases,” in Proceedings of the 21st International Conference on Very Large Data Bases, pp. 420-431, 1995.

[7] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1-12, 2000.

[8] S. Kotsiantis, “Educational data mining: A case study for predicting dropout-prone students,” International Journal of Knowledge Engineering and Soft Data Paradigms, vol. 1, no. 2, pp. 101–111, 2009.

[9] S. Kotsiantis, K. Patriarcheas, and M. Xenos, “A combinational incremental ensemble of classifiers as a technique for predicting students performance in distance education,” Knowledge Based Systems, vol. 23, no. 6, pp. 529–525, 2010.

[10] S.B. Kotsiantis and P.E. Pintelas, “Predicting students’ marks in Hellenic Open University,” in Proceedings of IEEE International Conference on Advanced Learning Technologies, pp. 664–668 , 2005.

[11] W. M. Li, J. W. Han, and J. Pei, “CMAR: Accurate and efficient classification based on multiple class-association rules,” in Proceedings of IEEE International Conference on Data Mining, pp. 369-376, 2001.

[12] B. Liu, W. Hsu, and Y. Ma, “Integrating classification and association rule mining,” in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80–86, 1998.

[13] B. Liu, W. Hsu and Y. Ma, “Mining association rules with multiple minimum supports”, in Proceedings of SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 337-341, 1999.

[14] C. L. Lui and F. L. Chung, “Discovery of generalized association rules with multiple minimum supports,” in Proceedings of 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 510-515,

2000.

[15] I. Lykourentzo, I. Giannoukos, V. Nikolopoulos, G. Mpardis, and V. Loumos,

“Dropout prediction in elearning courses through the combination of machine learning techniques,” Computers & Education, vol. 53, pp. 950–965, 2009.

[16] C. Márquez-Vera, A. Cano, C. Romero, and S. Ventura, “Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data,” Applied Intelligence, vol. 16, pp. 315-330,

2003.

[17] C. Marquez-Vera, C. Romero, and S. Ventura, “Predicting school failure using data mining,” in Proceedings of 4th International Conference on Educational Data Mining, pp. 271-276, 2011.

[18] D. Martinez, “Predicting student outcomes using discriminant function analysis,”

in Proceedings of Annual Meeting of the Research and Planning Group, pp.

163–173, 2001.

[19] G. Mendez, T.D. Buskirk, S. Lohr, and S. Haag, “Factors associated with persistence in science and engineering majors: An exploratory study using classification trees and random forests,” Journal of Engineering Education, vol. 9, no. 1, pp. 57-70, 2008.

[20] A. Parker, “A study of variables that predict dropout from distance education,”

International Journal of Educational Technology, vol. 1, no. 2, pp.1–11, 1999.

[21] M.N. Quadril and N.V. Kalyankar, “Drop out feature of student data for academic performance using decision tree techniques,” Journal of Computer Science and Technology, vol. 10, pp. 2–5, 2010.

[22] C. Romero and S. Ventura, “Educational data mining: A review of the state of the art,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 40, no. 6, pp.

601-618, 2010.

[23] R. Srikant and R. Agrawal, “Mining generalized association rules,” in Proceedings of the 21st International Conference on Very Large Data Bases, pp. 407-419,

1995.

[24] J.F. Superby, J.P. Vandamme, and N. Meskens, “Determination of factors influencing the achievement of the first year university students using data mining methods,” in Proceedings of AAAI Workshop on Educational Data Mining, pp.

1–8 , 2006.

[25] M.C. Tseng and W.Y. Lin, “Maintenance of generalized association rules with multiple minimum supports,” Intelligent Data Analysis, vol. 8, pp. 417-436, 2004.

[26] M.C. Tseng and W.Y. Lin, “Efficient mining of generalized association rules with non-uniform minimum support,” Data and Knowledge Engineering, vol. 62, no. 1, pp. 41-64, 2007.

[27] T.Y. Tang and G. Mccalla, “Student modeling for a web-based learning environment: A data mining approach,” in Proceedings of the 18th National Conference on Artificial Intelligence, pp. 967–968, 2002.

[28] W. Veitch, “Identifying characteristics of high school dropouts: Data mining with a decision tree model,” in Proceedings of Annual Meeting of the American Educational Research Association, pp. 1–11, 2004.

[29] L. Wegner, A.J. Flisher, P. Chikobvu, C. Lombard, and G. King, “Leisure boredom and high school dropout in Cape Town, South Africa,” Journal of Adolescence, vol. 31, pp. 421–431, 2008.

[30] M.V. Yudelson, O. Medvedeva, E. Legowski, M. Castine, D. Jukic, and D.

Rebecca, “Mining student learning data to develop high level pedagogic strategy in a medical ITS,” in Proceedings of AAAI Workshop on Educational Data Mining, pp. 1–8, 2006.

[31] ETToday 東森新聞雲 - 教育部 102 年推方案整併 6 所國立大學 http://www.ettoday.net/news/20121119/129267.htm

[32] P.N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining: Pearson, 2005, pp. 373-374.

在文檔中應用多重支持之廣義關聯分類法建構大學休退學預測系統 (頁 44-0)