不明確資料之資料挖掘

全文

(1)國立高雄大學電機工程學系(研究所) 碩士論文. 不明確資料之資料挖掘 Data Mining with Uncertain Data. 研究生：吳志偉撰指導教授：洪宗貝博士. 中華民國九十八年七月.

(2) Chinese Abstract 不明確資料之資料挖掘指導教授：洪宗貝博士國立高雄大學資訊工程所. 學生：吳志偉國立高雄大學電機工程所. 摘要機器學習與資料挖掘是兩項從資料集中擷取資料的重要技術。現今，雖然這兩項技術可以處理龐大的資料量，但是在資料收集的過程中有可能造成某些的資料遺失，在這樣的不完整資料上常會利用適當的處理方式來改善資料的品質。因此復原遺失資料的研究，已經在機器學習及資料挖掘此兩個領域上被視為一個重要的議題。本論文提出一種基於結合反覆式復原遺失資料的機制並結合不同的支持度算法來推導出可靠的關聯法則。第一種方法是利用健全式關聯法則的支持度算法來推導出可靠的關聯法則，再利用該關聯法則來反覆推導資料庫中遺失的資料。第二種方法則是利用部分支持度的算法來推導出可靠的關聯法則，再利用該關聯法則來反覆推導資料庫中遺失的資料。本論文中所提出的這兩個方法皆能完全的對遺失的資料賦予一個合理的值以提供品質較高的資料集。論文中所提到的反覆式復原遺失資料的機制由三個階段組成。第一階段利用關聯法則先粗略的復原某些遺失的資料；第二階段則反覆降低支持度門檻值以獲得更多關連法則，再 I.

(3) 藉由新的關聯法則復原剩下的所有遺失資料；而第三階段從已復原資料集中所挖掘出來的關連法則來修正復原過的遺失資料，以求得更準確之猜測值。最後，實驗結果顯示，即便在資料遺失度很高的情況下，此兩種方法仍然有很好的資料復原率及資料復原正確率。關鍵字：關聯法則，資料挖掘，資料遺失，不完全資料，支持度。. II.

(4) English Abstract Data Mining with Uncertain Data. Advisor: Dr. Tzung-Pei Hong Institute of Computer Science and Information Engineering National University of Kaohsiung. Student: Chih-Wei Wu Institute of Electrical Engineering National University of Kaohsiung. Abstract. Machine learning and data mining are two kinds of important techniques for extracting valuable information from datasets. Although current mining and learning technologies can handle large amounts of data, the rapid growth of datasets may cause some attribute values to be missed in the data-gathering process. Incomplete data are usually appropriately handled to improve the quality of the discovered information. Therefore, the problem of recovering missing values from a data set has become an important research issue in the field of data mining and machine learning. In this thesis, we first introduce an iterative missing-value completion method based on the RAR support values to extract useful association rules for inferring missing values in an III.

(5) iterative way. The proposed method can fully infer the missing attribute values by combining an iterative mechanism and data mining techniques. It consists of three phases. The first phase uses the association rules to roughly complete the missing values. The second phase iteratively reduces the minimum support to gather more association rules to complete the rest of missing values. The third phase uses the association rules from the completed dataset to correct the missing values that have been filled in. The proposed approach is then a little modified to consider the partial support values in deriving missing values. The second approach is a little better than the first one because the former uses more information (incomplete tuples) in guessing. Experimental results show both the proposed approaches have good accuracy and data recovery even when the missing-value rate is high.. Keywords: association rule, data mining, missing value, incomplete data, support.. IV.

(6) Acknowledgements 對於這篇論文能夠順利完成，首先感謝我的指導教授洪宗貝教授，他在百忙當中依然不辭辛勞，對於整篇論文的題目構思、研究方向、撰寫技巧以及對實驗模擬等種種項目給予專業且有效率的指導。在老師指導下，讓我懂得如何做出好的研究及撰寫出專業論文的方法，使我在短短二年間能有明顯的成長與豐碩的收穫，同時順利取得碩士學位，在此致上由衷的敬意與感謝！. 接著我要感謝我的碩士論文口試委員：高雄應用科技大學電子工程系潘正祥教授、台南大學資工系李建興教授以及本校電機系吳志宏教授於口試期間給予的專業建議，並於口試期間給予我寶貴的指導與意見，使我的論文更趨完整。. 研究期間也感謝高雄大學人工智慧實驗室的學長：俊豪學長、浚瑋學長、國誠學長以及韋体學長，在我研究上有困難總是能夠提供寶貴的意見，還有我的同學：卓翰、博正、欣怡、正男以及升馨，感謝這兩年來有你們一起成長、奮鬥以及互相加油打氣。最後是學弟們：偉屏以及廷一，感謝你們為實驗室注入了良好的學習氣氛。. 最後，僅將我的成果與喜悅獻給所有關心我的家人、老師與朋友們。. 吳志偉謹致民國 98 年 7 月. V.

(7) Table of Contents Chinese Abstract ...................................................................................................... I English Abstract .................................................................................................... III Table of Contents ...................................................................................................VI List of Figures ...................................................................................................... VII List of Tables .......................................................................................................VIII CHAPTER 1 Introduction ....................................................................................... 1 1.1 Problem Definition and Motivation .............................................................. 1 1.2 Contributions ................................................................................................ 2 1.3 Reader's Guide ............................................................................................. 3 CHAPTER 2 Review of Related Works .................................................................. 4 2.1 Incomplete Datasets...................................................................................... 4 2.2 Review of the Apriori Algorithm .................................................................. 5 2.3 Review of the RAR Approach ...................................................................... 6 2.4 Missing-Value Completion Method with the RAR Approach ........................ 8 2.5 Review of ~AR Approach............................................................................. 9 CHAPTER 3 The First Proposed Method ............................................................ 12 3.1 The Idea ..................................................................................................... 12 3.2 The Algorithm ............................................................................................ 12 3.3 An Example ................................................................................................ 17 CHAPTER 4 The Second Proposed Method......................................................... 29 4.1 The Idea ..................................................................................................... 29 4.2 Some Definitions ........................................................................................ 29 4.3 The Proposed Algorithm ............................................................................. 32 4.4 An Example ................................................................................................ 36 CHAPTER 5 Experimental Results ...................................................................... 48 5.1 Experimental Results of the First Method ................................................... 48 5.2 Experimental Results of the Second Proposed Method................................ 52 CHAPTER 6 Conclusions and Future Works ....................................................... 56 References .............................................................................................................. 57. VI.

(8) List of Figures Fig 5-1: The comparison of accuracy for RAR-MVC and our methods on TAE........... 49 Fig 5-2: The comparison of recovery rates for RAR-MVC and our methods on TAE... 50 Fig 5-3: The comparison of accuracy for RAR-MVC and our method on Tic-Tac-Toe. 50 Fig 5-4: The comparison of recovery rates for RAR-MVC and our method on Tic-Tac-Toe. .................................................................................................. 51 Fig 5-5: The comparison of accuracy for RAR-MVC and our methods on TAE........... 53 Fig 5-6: The comparison of recovery rates for RAR-MVC and our methods on TAE... 53 Fig 5-7: The comparison of accuracy for RAR-MVC and our method on Tic-Tac-Toe. 54 Fig 5-8: The comparison of recovery rates for RAR-MVC and our method on Tic-Tac-Toe. .................................................................................................. 55. VII.

(9) List of Tables Table 2-1: An incomplete dataset................................................................................... 4 Table 2-2: An incomplete dataset in the example ........................................................... 8 Table 2-3: An incomplete dataset in the example ........................................................... 9 Table 3-1: The dataset in the example.......................................................................... 17 Table 3-2: The incomplete dataset in the example ....................................................... 18 Table 3-3: All the frequent itemsets with their RAR-supports in the example .............. 19 Table 3-4: The association rules with support and confidence values after STEP 3 ...... 20 Table 3-5: The dataset after STEP 4 ............................................................................ 20 Table 3-6: The candidate 2-itemset formed in STEP 8 ................................................. 22 Table 3-7: The reduced frequent 2-itemsets formed in STEP 9 .................................... 23 Table 3-8: The reduced association rules AR’ with their support and confidence values ................................................................................................................. 23 Table 3-9: The dataset after STEP 11........................................................................... 24 Table 3-10: The most frequent attribute value for each attribute in the original dataset D ................................................................................................................. 25 Table 3-11: The dataset after Phase 2 .......................................................................... 25 Table 3-12: The new set of association rules after STEP 14 ......................................... 26 Table 3-13: The dataset after adjustment ..................................................................... 27 Table 3-14: The final association rules in the example ................................................. 28 Table 3-15: The finally completed dataset ................................................................... 28 Table 4-1: An incomplete dataset in the example ......................................................... 31 Table 4-2: The dataset in the example.......................................................................... 36 Table 4-3: The incomplete dataset in the example ....................................................... 37 Table 4-4: All the frequent itemsets with their partial supports in the example ............. 38 Table 4-5: The candidate 2-itemset formed in STEP 8 ................................................. 40 Table 4-6: The reduced frequent 2-itemsets formed in STEP 9 .................................... 41 Table 4-7: The reduced association rules AR’ with their support and confidence values ................................................................................................................. 41 Table 4-8: The dataset after STEP 11........................................................................... 42 Table 4-9: The reduced association rules AR’ with their support and confidence values ................................................................................................................. 43 Table 4-10: The dataset after second iteration of Phase 2 ............................................. 44 Table 4-11: The new set of association rules after STEP 14 ......................................... 45 VIII.

(10) Table 4-12: The dataset after adjustment ..................................................................... 46 Table 4-13: The final association rules in the example ................................................. 46 Table 4-14: The finally completed dataset ................................................................... 47 Table 5-1: Characteristics of the two experimental datasets and the used threshold values. ...................................................................................................... 48. IX.

(11) CHAPTER 1 Introduction 1.1 Problem Definition and Motivation Modern enterprises use knowledge management techniques to obtain competitive advantages. Effective knowledge management usually needs proper tools to extract business knowledge from large databases [9][10]. Machine learning and data mining are thus two kinds of important techniques for achieving the purpose. They attempt to mine hidden and useful business knowledge from large datasets. Typically, an enterprise continuously collects and stores important business data, resulting in a huge growth of data. Although current mining technologies can handle large amounts of data [1] [2] [3] [4], the rapid growth of datasets may cause some attribute values to be missed in the data-gathering process. Data-preprocessing steps are thus performed to conquer the problem. Incomplete data are usually appropriately handled to improve the quality of the discovered information. Therefore, the problem of recovering missing values from a dataset has become an important research issue in the field of data mining and machine learning. Most learning approaches derive rules from complete datasets. If some attribute values are unknown in a dataset, the set is called incomplete. Mining valuable knowledge from incomplete datasets is usually more difficult than from complete datasets. Designing a sophisticated algorithm is able to infer the missing values in incomplete datasets and extracting valuable knowledge thus presents a challenge in this research field. In the past, several completing methods were proposed to handle the problem of incomplete datasets [5][6][7][8][11][12][13][15]. A commonly-used solution to processing missing attribute values is to ignore the tuples which contain 1.

(12) missing attribute values. This method may disregard important information within the data and a significant amount of data could be discarded. Some other methods, such as using association rules as an aid to complete the missing values are shown to have acceptable prediction accuracy [5][7][11][12]. For example, Ragel and Cremilleux proposed the Robust Association Rules (RAR) approach to reduce the impact of missing values in a dataset [6]. However these association-rule-based completing methods cannot complete all missing values in an incomplete dataset when the missing ratio is high. To deal with this disadvantage, we introduce an iterative missing-value completion method to fully infer the missing attribute values by combining an iterative mechanism and data mining techniques. The method uses the RAR support value approach [6] to extract useful association rules to infer the missing values in an iterative way. The proposed approach consists of three phases. The first phase uses the association rules which are mined from an original incomplete dataset to roughly complete the missing values. The second phase uses the reduced minimum support to gather more association rules from the original incomplete dataset to complete the rest of missing values from phase 1 in an iterative way until no missing values exist. The third phase uses the association rules from the completed dataset to correct the missing values that have been filled in according to the association rules until the missing values converge. Experiments on two datasets are also made to show the performance of the proposed approach.. 1.2 Contributions The contributions of this thesis can be divided into the following parts. (1) This thesis proposes an iterative missing-value completion algorithm to 2.

(13) predict the missing values in a dataset. It can have good accuracy and data recovery even when the missing-value rate is high. (2) This thesis also proposes an approach by considering the partial support values in deriving missing values. The approach is a little better than the first one because it uses more information in guessing.. 1.3 Reader's Guide The remainder of this thesis is organized as follows. The Apriori mining pproach, the approach for robust association rules (RAR), an approach for completing missing values based on the RAR approach and the ~AR approach are briefly reviewed in Chapter 2. In Chapter 3, an iterative algorithm with the traditional RAR support counting is proposed for completing missing values and deriving association rules in an incomplete data set. In Chapter 4, an approach based on the partial support values in deriving missing values is proposed. Experimental results for the proposed methods are given in Chapter 5. Finally conclusions and future works are given in Chapter 6.. 3.

(14) CHAPTER 2 Review of Related Works 2.1 Incomplete Datasets Datasets can be roughly classified into two classes: complete and incomplete datasets. All the tuples in a complete data set have known attribute values. If at least one tuple in a data set has a missing value, the data set is incomplete. Table 2-1 shows an example of an incomplete data set. In Table 2-1, the symbol ‘?’ denotes a missing attribute value. Thus, the attribute A values in T9 and T10 are missed. Similarly, the B and C values of T4, T8 and T10 are missed. The dataset is thus incomplete.. Table 2-1: An incomplete dataset Attribute Tuple ID T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. A = {1, 2, 3}. B = {1, 2, 3}. C = {1, 2, 3}. 1 1 1 1 1 1 1 1 ? ? 1 1. 1 1 1 ? 1 1 1 1 1 ? 1 1. 1 1 1 ? 1 1 1 ? 1 2 2 2. 4.

(15) 2.2 Review of the Apriori Algorithm The goal of data mining is to discover important associations among attributes such that the presence of some items in a tuple will imply the presence of some other items. To achieve this purpose, Agrawal and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules [1][2][3][4]. They divided the mining process into as following steps: Step 1:. Generate candidate itemsets and count their frequencies by scanning the dataset.. Step 2:. Consider an itemset as a large (frequent) itemset if the number of the tuples in which the itemset appears is equal to or larger than a pre-defined threshold value (called minimum support).. Step 3:. Process the itemsets level by level. That is, the itemsets with only single items are processed first. Large itemsets containing only single items are then combined to form candidate itemsets with two items. This process is repeated until all large itemsets have been found.. Step 5:. Form all possible association combinations for each large itemset, and calculate their confidences.. Step 6:. Output the rules with their confidence values larger than a predefined threshold (called minimum confidence) as association rules.. Note that an association rule is an expression X→Y, where X is a set of attributes and Y is usually a single attribute. It means in the set of tuples, if all the attributes in X exist in a transaction, then Y is also in the tuple with a high probability.. 5.

(16) 2.3 Review of the RAR Approach Ragel and Cremilleux proposed the Robust-Association-Rules (RAR) approach to reduce the impact of missing values in a dataset [6][7]. In the past, the most common way to deal with incomplete data is to delete the tuples with missing values. It, however, often generates too few useful rules to be applied effectively. Instead of deleting the tuples with missing attribute values, RAR partially disables the tuples with missing attribute values to ease the issue of lost rules. The main idea of the RAR approach is to slice a dataset into several valid datasets for an itemset. They also re-define the support and confidence calculation of rules. Besides, a threshold called representative is defined to reduce redundant rules. These definitions are fully compatible with the traditional algorithm [4]. The rules which are mined with the RAR approach can be used to recover the missing values in a dataset [7]. Some definitions in RAR are described as follows.. Definition 1 (Disable data) Let Dis(X) be the set of disabled (missing) data with the itemset X. It is defined as follows:. Dis ( X= ) {Tupi | ∃A ∈ X , A = ?, X ⊆ Tupi , Tupi ∈ D} , where A is a missing attribute-value pair (item) belonging to X.. Definition 2 (RAR-Support) Let the tuple set σ(X) for an itemset X is defined as follows:. σ ( X ) ={Tupi | X ⊆ Tupi , Tupi ∈ D} , where X ⊆ Tup i represents Tup i has the itemset X and D is the dataset. The support for an itemset X based on the Robust-Association-Rule (RAR) approach is thus 6.

(17) defined as follows:.   |σ(X ) | Sup( X ) =  .  | D | − | Dis( X ∪ Y ) | . Definition 3 (RAR-Confidence) The confidence for an association rule X→Y based on the RAR approach is defined as follows:.   |σ(X ∪Y ) | Conf ( X → Y ) =  .  σ ( X ) − | Dis(Y ) ∩ σ ( X ) | . Definition 4 (Representative) Let Rep(X) be the representative value of X in DB. The definition of Rep(X) is:.  | D | − | Dis ( X ) |  Rep ( X ) =   |D|   A representative value of X∪Y, which is less than the threshold, can avoid rules for discovery in a dataset with many disable data. The user-specified minimum threshold of the representative value is denoted as minRep. Below, an example is given to explain how the mining process and the recovering process works.. Example 2-1: Consider the incomplete dataset in Table 2-2 with minSup = 50%, minConf = 50%, and minRep = 70%. The table has 12 tuples and three attributes. The three attributes, A, B and C separately include one, two and three attribute values. The symbol ‘?’ denotes the missing value of the corresponding item in a tuple. In this example, A1⇒B1 is the rule, but A1⇒C1 is not. The representative value of the rule (A1⇒C1) is 8/12, which is 67% and is below minRep. The support, the confidence, 7.

(18) and the representative values of the rule A1⇒B1 are calculated as follows: Sup(A1) = 10/10 = 100%, Sup(B1) = 10/10 = 100%, Sup (A1⇒B1) 9/9 = 85.7%, Conf(A1⇒B1) = 9/(10−1) = 100%, and Rep(A1⇒B1) = 9/12 = 75%.. Table 2-2: An incomplete dataset in the example Attribute Tuple ID. A. B. C. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 1 1 1 1 1 1 ? ? 1 1. 1 1 1 ? 1 1 1 1 1 ? 1 1. 1 1 1 ? 1 1 1 ? 1 2 2 2. 2.4 Missing-Value Completion Method with the RAR Approach RAR only partially disables the victim tuples to passively discover the association rules. For active discovery, Ragel and Cremilleux have also proposed the MissingValues Completion (MVC) approach [15], which is based on the RAR method, to recover multiple missing values in a dataset. First, MVC applies RAR to discover all association rules. Then, MVC applies the most appropriate rule to fill in a single missing value in a tuple. If a tuple has multiple missing values, MVC runs the process repeatedly. To avoid propagating the wrong value, MVC uses the rules, which have a high confidence value (more than 95%), to recover a tuple with multiple missing values. 8.

(19) Example 2-2: Consider the incomplete dataset in Table 2-2 again. The three minimum thresholds, minSup, minConf, and minRep, are identical with those in the above example. MVC executes the RAR process to generate the following four rules: A1⇒B1, B1⇒C1, B1⇒A1, and C1⇒B1, with 100%, 78%, 100%, and 78% confidence values, respectively. Except for the tuple T10, MVC recovers each missing value by filling ‘1’ in T04, T08, and T09.. 2.5 Review of ~AR Approach The ~AR algorithm [16] is divided into two parts. The first part of the ~AR algorithm is to replace the missing values by a probability distribution. This probability distribution is calculated using the attribute-value frequency from the tuples. Consider a dataset that contains the following tuples, where “?” represents a missing value.. Table 2-3: An incomplete dataset in the example Attribute Tuple ID. A = {1, 2, 3}. B = {1, 2, 3}. C = {1, 2, 3}. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 1 1 1 1 1 1 ? ? 1 1. 1 1 1 ? 1 1 1 1 1 ? 1 1. 1 1 1 ? 1 1 1 ? 1 2 2 2. 9.

(20) The missing value in Tuple 8 in Table 2-3 is replaced by a probability distribution which is calculated from existing values in the same column. In this example, the probability that the missing value of attribute C in Tuple 8 for “1” is 7/10 = 0.7, and the probability that the missing value for “2” is 3/10 = 0.3. Then, the probability will be used to calculate the support value in second part of the ~AR algorithm. The second part of ~AR is to mine the association rules. In this part, the calculation of the support count of each candidate itemset is the main difference between ~AR and the Apriori algorithm. In the Apriori algorithm, a tuple supports a candidate itemset if the tuple fully matches all the attributes in the candidate itemset. In the ~AR algorithm, a tuple is allowed to partially support a candidate itemset. Given n items in a candidate itemset, each item in the tuple will contribute 1/ n of the total support to the candidate itemset. If the tuple fully matches the candidate itemset, the support will be incremented by n*(1/ n), which is 1 and the same as that in the Apriori algorithm. The other issue in ~AR is the types of non-fully matching situations. In the ~AR algorithm, there are two types of non-fully match situations. In the first situation, the items in a tuple may not fully match the items in a candidate itemset. Thus, the support is incremented according to the matched items between the items in a tuple and the candidate itemset. For example, consider a candidate itemset C = A1, B1, C2. The support for Tuple 11 and Tuple 12 in Table 2-3 are calculated as (1/3) + (1/3) + (1/3) = 1. The support for Tuples 1 and 2 in Table 2-3 are calculated as (1/3) + (1/3) + (0) = 2/3. In the second situation, a tuple may contain missing values. In this situation, the support value will be incremented by (1/n) multiplied by the probability of the missing value corresponds to the item in the candidate itemset. For example, consider a candidate itemset C = A1, B1, C2. Since the missing value of 10.

(21) attribute C in Tuple 8 in Table 2-3 may be 1, 2 or 3, the probabilities of the three possible values are 0.7, 0.3, and 0, respectively. Because of the corresponding probability is 0.3, thus the support for Tuple 8 in Table 2-3 is calculated as (1/3) + (1/3) + ((1/3)*(0.3)) = 0.7. Take Tuple 4 in Table 2-3 as another example. Since the missing value of attribute B in Tuple 4 in Table 2-3 may be 1, 2 or 3, the probabilites of the three possible values are 1, 0, and 0, respectively. Besides, the missing value of attribute C in Tuple 4 in Table 2-3 may be 1, 2 or 3. The probabilities of the three possible values are 0.7, 0.3, and 0, respectively. Because of the corresponding probabilities are 0 and 0.3, the support for Tuple 4 in Table 2-3 is calculated as (1/3) + ((1/3)*0) + ((1/3)*(0.3)) = 0.4. Although ~AR provides a reasonable support-counting method, it may cause every tuple to potentially support every candidate itemset and thus make the confidence value greater than 1. To prevent this issue, a minimum match threshold is set by users. If the support provided by any tuple below the threshold, then the tuple does not contribute any support to the candidate itemset.. 11.

(22) CHAPTER 3 The First Proposed Method 3.1 The Idea In this chapter, we propose the iterative missing-value completion method with the RAR support-value counting method to extract the association rules from an incomplete dataset with a high missing rate. It consists of three phases. The first phase uses the association rules which are mined from original incomplete dataset to roughly complete the missing values. The second phase uses the reduced minimum support to gather more association rules from original incomplete dataset to complete the rest of missing values from phase 1 in an iterative way until no missing values existed. The third phase uses the association rules from the completed dataset to correct the missing values that have been filled by predicted values until convergence. Section 3.2 describes the Iterative-Based completing algorithm. Below, the details of the proposed algorithm are stated in Section 3.2 and an example is given to illustrate the proposed algorithm in Section 3.3.. 3.2 The Algorithm INPUT: An incomplete dataset D with n tuples, a set of m attributes A, each with a set of values, the minimum support threshold minSup, the minimum confidence threshold minConf. OUTPUT: A set of association rules and a complete dataset with completed missing values. PHASE 1: STEP 1: Find the RAR-support values of all the 1-itemsets. If the support of a 12.

(23) 1-itemset X is not less than the threshold minSup, put it in the set of frequent (large) 1-itemsets, L 1 . STEP 2: Iteratively find the other frequent itemsets with more than one items in an Apriori-like way using the RAR-Support evaluation. The set of frequent (large) k-itemsets is called L k . STEP 3: Find the confidence value of each possible candidate association rule generated from the frequent itemsets. Here, only the rules with only one item in the consequence are handled. If the confidence of a candidate association rule is not less than the threshold minConf, put it in the set of association rules, AR. STEP 4: Use the set of association rules to infer the missing values of the incomplete dataset by the following sub-steps. SUBSTEP 4.1:. If there is only one association rule which can be used to derive the missing value of an attribute in an tuple, then use the rule.. SUBSTEP 4.2:. If there is more than one association rule which can be used to derive the missing value of an attribute in an tuple, then use the one with the maximum RAR-confidence value; if more than one rule have the same maximum RAR-conference values, then use the one with the maximum RAR-support value; if the maximum RAR-support values are still the same, then keep the value still missing if the rules derive different values. Let the updated dataset as D’.. STEP 5: Check the dataset; if there are still missing values in the dataset, execute Phase 2; otherwise, execute Phase3.. 13.

(24) PHASE 2: STEP 6: Let y = 1, where y is used to control the reduced minimum support value and the reduced minimum confidence threshold. STEP 7: Set the reduced minimum support threshold RedMinSup = minSup/ry and the reduced minimum confidence threshold RedMinConf = minConf/ry, where r is the reduced coefficient. STEP 8: For each tuple still with missing values, find the set of originally non-missing attribute-value pairs in the tuple from the original dataset D and form the candidate 2-itemsets, with one item from the set of non-missing attribute-value pairs and the other from the possible values in a missing value. STEP 9: Find the RAR-support values of all the candidate 2-itemsets from the original dataset D; If the RAR-support of a candidate 2-itemset X is not less than the reduced support threshold RedMinSup, put it in the set of reduced frequent (large) 2-itemsets, L 2 ’. STEP 10: Find the confidence value of each possible candidate association rule generated from the frequent 2-itemsets, L 2 ’, generated in STEP 9. If the confidence of a candidate association rule is not less than the reduced confident threshold, put it in the set of reduced association rules, AR’. STEP 11: Use the set of association rules AR’ to infer the missing values of the updated dataset D’ by the following sub-steps. SUBSTEP 11.1: If there is only one association rule which can be used to derive the missing value of an attribute in an tuple, then use the rule. SUBSTEP 11.2: If there is more than one association rule which can be used to derive the missing value of an attribute in an tuple, then use the one with the maximum RAR-confidence value. If more than one 14.

(25) rule have the same maximum RAR-conference values, then use the one with the maximum RAR-support value. If the maximum RAR-support values are still the same and the rules derive different values, then keep the missing value still unknown. STEP 12: y = y + 1 and repeat STEPs 7 to 11 until there are no missing values in the updated dataset excluding the tuples with all their attribute values missing or y gets to a predefined value. STEP 13: If there is still a missing value for an attribute, fill in the most frequent value (based on the RAR support) of the attribute in the original dataset D.. PHASE 3: STEP 14: Find the new set of association rules, AR’ from the updated dataset D’ (which is complete now) in a way similar to STEPS 1 to 3 in Phase 1 using the traditional support and confidence (due to no missing values), but only focusing on the items in the tuples with originally missing values in D. STEP 15: Use the set of association rules to infer the missing values of the originally incomplete dataset by the following sub-steps. SUBSTEP 15.1:. If there is only one association rule which can be used to derive the missing value of an attribute in an tuple, then use the rule.. SUBSTEP 15.2:. If there is more than one association rule which can be used to derive the missing value of an attribute in an tuple, then use the one with the maximum RAR-confidence value; if more than one rule have the same maximum RAR-conference values, then use the one with the maximum. RAR-support 15. value;. if. the. maximum.

(26) RAR-support values are still the same and the rules derive different values, then keep the value the same as that in the updated dataset D’. STEP 16: Compare the missing values before and after dataset update. If they are not the same, repeat STEPS 14 to 16 using the new updated D’. If they are the same (meaning convergence), then do the next step. STEP 17: Output the final association rules and the updated dataset.. After STEP 17, the filled-in missing values have converged, and the association rules will not be changed as well. Through the iterative processing until convergence, the bad influence of the missing values wrongly guessed at the beginning can be reduced. Note that tuples with all values missing may possibly appear in a dataset. Take the data in Table 3-1 as an example, in which all the values in tuple 4 are missing. From STEP 8, since A = 1 and B = 3 are known in tuple 2 and the value of C is unknown, the set of candidate 2-itemsets generated thus includes (A1, C3), (A1, C4), (B3, C3), and (B3, C4). Similarly, the set of candidate 2-itemsets generated from tuple 3 includes (A1, C3), (A2, C3), (B1, C3), (B2, C3), and (B3, C3). Thus, only a partial set of possible candidate 2-itemsets needs to be checked, such that the computation time can be reduced. The algorithm will check only these candidates to extract useful association rules and infer missing values. STEP 14 also uses the same way to generate the candidates.. 16.

(27) Table 3-1: The dataset in the example. Attribute Tuple ID. A = {1, 2}. B = {1, 2, 3}. C = {3, 4}. Tuple 1 Tuple 2 Tuple 3 Tuple 4. 1 1. 2 3. 3. ? ?. ? ?. ? 3 ?. In Table 3-1, all the values in tuple 4 are missing. Thus, no candidate will be generated in STEP 8. For this case, STEP 13 is used to fill the initial missing values. That is, the most frequent value of the attribute based on the RAR support in the original dataset is used.. 3.3 An Example In this section, an example is given to show how the proposed algorithm can be used to find useful association rules and fill in missing values in an incomplete dataset with a high percentage of missing data. As an example, the incomplete dataset in Table 3-2 is used to demonstrate the idea. In Table 3-2, there are twelve tuples and three attributes, A, B, C. The values of the attributes are shown as follows: A = {1, 2}, B = {1, 2, 3} and C = {3, 4}. Each attribute has 50% missing values in this example. Assume the minSup threshold is set at 50% and the minConf threshold is set at 50%. The proposed algorithm processes this incomplete data set as follows.. 17.

(28) Table 3-2: The incomplete dataset in the example Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ? ? ? ? ? ? 2 2 2 2. 2 3 ? 2 3 1 2 ? ? ? ? ?. C 3 ? 3 4 ? ? 3 ? 3 ? 3 ?. Phase 1 Phase 1 is first used to mine association rules from the incomplete dataset and roughly assign proper values to incomplete slots according to the association rules. The processing for the example is stated as follows.. STEP 1: The RAR-support values of all the 1-itemsets are found from Table 3-2. Take the 1-itemset A = 2 as an example. There are four tuples with A = 2 in the example, including tuples 9 to 12. The set of disabled (missing) data with the attribute A includes six tuples (3 to 8). The RAR-support of A = 2 is thus 4/(12-6), which is 0.67. Since the RAS-support of A = 2 is larger than the minimum support threshold, which is 0.5 in the example, A = 2 is thus thought of as a frequent 1-itemset. The other 1-itemsets are similarly processed and the set of frequent 1-itemsets are shown as follows: 18.

(29) L 1 = {A2, B2, C3}.. STEP 2: The other frequent itemsets with more than one item are found in an Apriori-like way but using the RAR-Support evaluation. In this example, all the frequent itemsets with their RAR-supports are shown in Table 3-3.. Table 3-3: All the frequent itemsets with their RAR-supports in the example Large n-Itemset. Itemset. Support. L1. A2 B2 C3. 0.67 0.50 0.83. L2. A1B1 A1B3 A2C3 B2C3. 0.50 0.50 0.67 0.67. L3. A1B2C3. 1.00. STEP 3: The possible association rules are then generated from the frequent (large) itemsets in Table 3-3 and their confidence values are calculated according to the RAR-Confidence evaluation. Take the possible rule A2C3 generated from the frequent 2-itemset (A2, C3) as an example. There are two tuples (9 and 11) with (A2, C3) and four tuples (9 to 12) with A2 in the example. The set of disabled data with the attribute C for A=2 includes tuples 10 and 12. The confidence value Conf(A2C3) is thus 2/(4-2), which is 1. Since the RAR-confidence of the rule is larger than the minimum confidence threshold, which is 0.5, A2C3 is thought of as an association rule. Similarly, the confidence of the other possible rule C3→A2 generated from the same (A2, C3) is 2/(5-2), which is 0.68. Since the RAR-confidence of the rule is also larger than the minimum 19.

(30) confidence threshold, C3→A2 is an association rule. All the association rules with their support values and the confidence values are shown in Table 3-4.. Table 3-4: The association rules with support and confidence values after STEP 3 Rule ID. X. →. Y. Sup. Conf. R1 R2 R3 R4. C3 C3 A2 B2. → → → →. A2 B2 C3 C3. 0.67 0.67 0.67 0.67. 0.67 1.00 1.00 0.67. STEP 4: The association rules in Table 3-4 are used to roughly infer the missing values in the incomplete dataset. Take tuple 3 as an example. There are two missing values in the tuple. According to the first rule (C3→A2), the missing value of attribute A is estimated as 2 because the attribute-value pair C3 is known. Similarly, the missing value of attribute B is estimated as 2 from the second association rule. The results after this step are shown in Table 3-5.. Table 3-5: The dataset after STEP 4 Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ?=2 ? ? ? ?=2 ? 2 2 2 2. 2 3 ?=2 2 3 1 2 ? ?=2 ? ?=2 ?. 20. C 3 ? 3 4 ? ? 3 ? 3 ?=3 3 ?=3.

(31) STEP 5: As you can see from Table 3-5, there still exist missing values after STEP 4. Phase 2 then begins.. Phase 2 Phase 2 is used to fill in the missing values remained in Phase 1. It iteratively reduces the minimum support and the minimum confidence values to achieve the purpose. Only 2-itemsets are used for reducing the computational time. The processing for the example is stated as follows.. STEP 6: y is set at 1, where y is used to control the reduced minimum support and the minimum confidence thresholds for obtaining more association rules to infer the missing values.. STEP 7: The support and the confidence thresholds are decreased as follows: redMinSup =. minSup 0.5 = 1 = 0.25 2y 2. redMinConf =. minConf 0.5 = 1 = 0.25 2y 2. STEP 8: For the tuples still with missing values, the originally non-missing attribute-value pairs in those tuples are first identified. The candidate 2-itemsets are then formed, with one item from the set of originally non-missing attribute-value pairs and the other from the possible values of the attributes with missing values. Take tuple 4 as an example. B2 and C4 are known in the original tuple, and the value of A is unknown. A1 and A2 are the two possible values of attribute A. Thus, four candidate 21.

(32) 2-itemsets (A1, B2), (A2, B2), (A1, C4) and (A2, C4) are found in this step. In this example, all the candidate 2-itemsets are shown in Table 3-6.. Table 3-6: The candidate 2-itemset formed in STEP 8 Candidate 2-Itemset A1B1 A1B2 A1B3 A1C3 A1C4 A2B1 A2B2 A2B3 A2C4 B1C3 B1C4 B2C3 B3C3 B3C4. STEP 9: The RAR-support values of all the candidate 2-itemsets in Table 3-6 are calculated. The calculation is the same as that in STEP 1. If the RAR-support value of a candidate 2-itemset is equal to or larger than the reduced minimum support threshold (0.25), it is put into the set of reduced frequent 2-itemsets, L 2 ’. In this example, all the reduced frequent itemsets with their RAR-support values are shown in Table 3-7.. 22.

(33) Table 3-7: The reduced frequent 2-itemsets formed in STEP 9 Reduced Frequent 2-Itemset A1B2 A1B3 A1C3 B2C3. STEP 10: The possible association rules are then generated from the reduced frequent (large) 2-itemsets in Table 3-7 and their confidence values are calculated according to the RAR-Confidence evaluation. The calculation is the same as that in STEP 3. If the RAR-confidence value of a possible rule is equal to or larger than the reduced minimum confidence threshold (0.25), it is put into the set of reduced association rules, AR’. The reduced association rules with their support and confidence values are shown in Table 3-8.. Table 3-8: The reduced association rules AR’ with their support and confidence values Rule ID. X. →. Y. Sup. Conf. R1 R2 R3 R4 R5 R6 R7 R8. B2 B3 C3 A1 C3 A1 A1 B2. → → → → → → → →. A1 A1 A1 B2 B2 B3 C3 C3. 0.50 0.50 0.33 0.50 0.67 0.50 0.33 0.67. 1.00 1.00 0.33 0.50 1.00 0.50 1.00 0.67. STEP 11: The set of reduced association rules AR’ is then used to assign the missing values which have not yet been filled in. If there is more than one association 23.

(34) rule which can be used to derive the missing value of an attribute, then use the one with the maximum RAR-confidence value. The results after this step are shown in Table 3-9.. Table 3-9: The dataset after STEP 11 Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ?=2 ?=1 ?=1 ? ?=2 ? 2 2 2 2. 2 3 ?=2 2 3 1 2 ? ?=2 ?=2 ?=2 ?=2. C 3 ?=3 3 4 ?=3 ? 3 ? 3 ?=3 3 ?=3. STEP 12: y = y + 1 and STEPs 7 to 11 are repeated until there are no missing values in the updated dataset excluding the tuples with all their attribute values missing or y gets to a predefined value. In this example, y is set at 3 and the results are the same as those in Table 3-9. STEP 13: If there is still a missing value for an attribute, the most frequent value (based on the RAR support) of the attribute in the original dataset D is used to assign it. Take tuple 6 in Table 3-9 as an example. The most frequent value of each attribute is shown in Table 3-10. The missing value of attribute A in tuple 6 can thus be assigned the value 2 according to Table 3-10. The results after this step are shown in Table 3-11. After the step, all missing values are filled in and Phase 3 begins to adjust the values. 24.

(35) Table 3-10: The most frequent attribute value for each attribute in the original dataset D The Most Frequent Attribute Value. Support. A2 B2 C3. 0.67 0.50 0.83. Table 3-11: The dataset after Phase 2 Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ?=2 ?=1 ?=1 ?=2 ?=2 ?=2 2 2 2 2. 2 3 ?=2 2 3 1 2 ?=2 ?=2 ?=2 ?=2 ?=2. C 3 ?=3 3 4 ?=3 ?=3 3 ?=3 3 ?=3 3 ?=3. Phase 3 Phase 3 is used to adjust the assigned missing values to raise the precision through an iterative process. It first mines the association rules from the filled complete dataset and then uses the rules to guess once more the originally missing values. If the guess is not the same as the previous one, then the same process is repeated until the results converge (not change again). The processing for the example in the phase is stated as follows.. 25.

(36) STEP 14: The new set of association rules, AR’, are found from the updated dataset D’ (which is completed now) in a way similar to STEPs 1 to 3 in Phase 1 using the traditional support and confidence (due to no missing values) based on the original support and confidence thresholds. The new set of association rules obtained is shown in Table 3-12.. Table 3-12: The new set of association rules after STEP 14 Rule ID. X. →. Y. Sup. Conf. R1 R2 R3 R4 R5 R6 R7 R8 R9. B2 C3 B2C3 A2 C3 A2C3 A2 B2 A2B2. → → → → → → → → →. A2 A2 A2 B2 B2 B2 C3 C3 C3. 0.58 0.58 0.58 0.58 0.67 0.58 0.58 0.67 0.58. 0.78 0.64 0.88 0.88 0.73 1.00 0.88 0.89 1.00. STEP 15: The new set of association rules AR’ is used to adjust the previously filled-in missing values in the dataset. Take tuple 4 as an example. The missing value of attribute A in tuple 4 is assigned 1 according to the processing in Phases 1 and 2. The value is then adjusted to 2 according to Rule 1 in Table 3-12. Similarly, the value of attribute A in tuple 5 has been adjusted from 1 to 2 as well. The results after this step are shown in Table 3-13, where the first number in an missing slot represents the previous guess and the second one represents the current guess after adjustment.. 26.

(37) Table 3-13: The dataset after adjustment Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ? = 2, 2 ? = 1, 2 ? = 1, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 2 2 2 2. 2 3 ? = 2, 2 2 3 1 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2. C 3 ? = 3, 3 3 4 ? = 3, 3 ? = 3, 3 3 ? = 3, 3 3 ? = 3, 3 3 ? = 3, 3. STEP 16: The guessed missing values before and after adjustment are compared. In this example, the results of the missing values guessed in tuple 4 and tuple 5 are different, and thus STEPs 14 to 16 are repeated again. The results of guessed missing values are the same as those in the first iteration. Step 17 is then executed. STEP 17: The final association rules and the completed dataset are then output to users. The results for this example are shown in Tables 3-14 and 3-15.. 27.

(38) Table 3-14: The final association rules in the example Rule ID. X. →. Y. R1 R2 R3 R4 R5 R6 R7 R8 R9. B2 C3 B2C3 A2 C3 A2C3 A2 B2 A2B2. → → → → → → → → →. A2 A2 A2 B2 B2 B2 C3 C3 C3. Table 3-15: The finally completed dataset Attribute Tuple ID. A. B. C. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 2 2 2 2 2 2 2 2 2 2. 2 3 2 2 3 1 2 2 2 2 2 2. 3 3 3 4 3 3 3 3 3 3 3 3. 28.

(39) CHAPTER 4 The Second Proposed Method 4.1 The Idea. In this chapter, we propose another iterative missing-value completion method with a new partial support-value counting method to extract the association rules from an incomplete dataset with a high missing rate. It also has three phases with functions similar to those in the first phase. The organization of this chapter is as follows. Section 4.2 describes the related definitions used in the proposed algorithm. Section 4.3 describes the details of the proposed algorithm with the partial supports. An example is given to illustrate the proposed algorithm in Section 4.4.. 4.2 Some Definitions Let AP(a, v) be the appearance probability of a value v for an attribute a. It is formally defined as follows: AP (a, v) =. 1 , Va. where |Va | is the number of possible attribute values of the attribute a. Based on the appearance probability, the partial support in a tuple for supporting an item is defined below.. Definition 1 - Partial support of an item in a tuple: The partial support PSup t (a, v) of an item (a = v) in a tuple t is defined as:. 29.

(40) if t.a = v 1,  = PSupt (a, v)  AP (a, = v), if t.a ? , 0, otherwise . where t.a stands for the value of the attribute a in the tuple t and the question mark ‘?’ stands for a missing value. From the partial support of an item in a tuple, we can now define the partial support parSup t (X) of an itemset X with multiple items as follows:. Definition 2 - Partial support of an itemset in a tuple: Let an itemset X include k items {(a 1 , v 1 ), …, (a k , v k )}. The partial support PSup t (X) of an itemset X in a tuple t is defined as: PSup = PSupt (a1 , v1 ) ∗ ∗ PSupt (ak , vk ) t (X ). The partial support of an itemset in a whole database can thus be defined below.. Definition 3 - Partial support of an itemset in a whole dataset: The Partial support PSup(X) of an itemset X in a whole dataset D is defined as follows:  ∑ PSupt ( X )  , PSup ( X ) =  t∈D D where |D| is the size of the dataset D. The partial support will be used in this paper to find the frequent itemsets. The confidence will also be calculated by the partial supports. It is defined as follows.. Definition 4 – Confidence: The confidence for an association rule X→Y based on the partial supports is defined 30.

(41) as:. PSup ( X  Y ) conf ( X → Y ) = PSup ( X ). Below, a simple example is given to illustrate the above concepts. Assume there is an incomplete dataset shown in Table 4-1.. Table 4-1: An incomplete dataset in the example Attribute Tuple ID. A = {1, 2}. B = {1, 2, 3}. C = {3, 4}. T1 T2 T3 T4. 1 1. 2 3. 3. ? ?. ? ?. ? 3 ?. Take the item A = 1 as an example. Since there are two possible values (1 or 2) for A, the appearance probability of (A = 1) for an incomplete slot of A is calculated as AP(A, 1) = 1/2, which is 0.5. The partial supports of A in the first two tuples are both 1 and in the last two tuples are 0.5. The partial support of A = 1 in the whole dataset is thus calculated as PSup(A1) = (1+1+0.5+0.5) / 4, which is 0.75. Let’s take the 2-itemset (A = 1, B = 2) as another example. The appearance probabilities of (A = 1) and (B = 2) is 0.5 and 0.33, respectively. There is only one tuple (The first tuple) with (A = 1, B = 2) in the dataset. Thus, PSup 1 (A, 1) = 1 and PSup 1 (B, 2) = 1. PSup 1 ({A=1, B=2}) = 1*1, which is 1. All the partial supports of the 2-itemset (A1, B2) in all the four tuples can be calculated as follows: PSup1 ( A1,B2) = PSup1 ( A,1) ∗ PSup1 ( B,2) = 1*1 = 1 PSup2 ( A1,B2) = PSup2 ( A,1) ∗ PSup2 ( B,2) = 0*1 = 0 PSup3 ( A1,B2) = PSup3 ( A,1) ∗ PSup3 ( B,2) = 0.5*0.33 = 0.165 PSup4 ( A1,B2) = PSup4 ( A,1) ∗ PSup4 ( B,2) = 0.5*0.33 = 0.165 Therefore, the partial support of the 2-itemset (A1, B2) in the dataset is calculated 31.

(42) as PSup(A1, B2) = (1+0+0.165+0.165) / 4, which is 0.33.. 4.3 The Proposed Algorithm INPUT: An incomplete dataset D with n tuples, a set of m attributes A, each with a set of values, the minimum support threshold minSup, the minimum confidence threshold minConf. OUTPUT: A set of association rules and a complete dataset with completed missing values.. PHASE 1: STEP 1: Find the partial support values of all the 1-itemsets. If the support of a 1-itemset X is not less than the threshold minSup, put it in the set of frequent (large) 1-itemsets, L 1 . STEP 2: Iteratively find the other frequent itemsets with more than one items in an Apriori-like way using the partial support evaluation. The set of frequent (large) k-itemsets is called L k . STEP 3: Find the confidence value of each possible candidate association rule generated from the frequent itemsets. Here, only the rules with only one item in the consequence are handled. If the confidence of a candidate association rule is not less than the threshold minConf, put it in the set of association rules, AR. STEP 4: Use the set of association rules to infer the missing values of the incomplete dataset by the following sub-steps. SUBSTEP 4.1: If there is only one association rule which can be used to derive the missing value of an attribute in an tuple, then use 32.

(43) the rule. SUBSTEP 4.2: If there is more than one association rule which can be used to derive the missing value of an attribute in an tuple, then use the one with the maximum confidence value; if more than one rule have the same maximum conference values, then use the one with the maximum partial suppo rt value; if the maximum partial support values are still the same, then keep the value still missing if the rules derive different values. Let the updated dataset as D’. STEP 5: Check the dataset; if there are still missing values in the dataset, execute Phase 2; otherwise, execute Phase3.. PHASE 2: STEP 6: Let y = 1, where y is used to control the temporary minimum support value and the temporary minimum confidence threshold. STEP 7: Set the temporary minimum support threshold tempMinSup = minSup/2y and the temporary minimum confidence threshold tempMinConf = minConf/2y. STEP 8: For each tuple still with missing values, find the set of originally non-missing attribute-value pairs in the tuple from the original dataset D and form the candidate 2-itemsets, with one item from the set of non-missing attribute-value pairs and the other from the possible values in a missing value. STEP 9: Find the partial support values of all the candidate 2-itemsets from the original dataset D; If the partial support of a candidate 2-itemset X is not less than the reduced support threshold tempMinSup, put it in the new set of frequent (large) 2-itemsets, L 2 ’. STEP 10: Find the confidence value of each possible candidate association rule 33.

(44) generated from the frequent 2-itemsets, L 2 ’, generated in STEP 9. If the confidence of a candidate association rule is not less than the reduced confident threshold, put it in the set of association rules, AR’. STEP 11: Use the set of association rules AR’ to infer the missing values of the updated dataset D’ by the following sub-steps. SUBSTEP 11.1: If there is only one association rule which can be used to derive the missing value of an attribute in an tuple, then use the rule. SUBSTEP 11.2: If there is more than one association rule which can be used to derive the missing value of an attribute in an tuple, then use the one with the maximum confidence value. If more than one rule have the same maximum conference values, then use the one with the maximum partial support value. If the maximum partial support values are still the same and the rules derive different values, then keep the missing value still unknown. STEP 12: y = y + 1 and repeat STEPs 7 to 11 until there is no missing values in the updated dataset or y gets to a predefined value. STEP 13: If there is still a missing value for an attribute, fill in the most frequent value (based on the partial support evaluation) of the attribute in the original dataset D.. PHASE 3: STEP 14: Find the new set of association rules, AR’ from the updated dataset D’ (which is complete now) in a way similar to STEPS 1 to 3 in Phase 1 using the traditional support and confidence (due to no missing values), but only focusing on the items in the tuples with originally missing values in D. 34.

(45) STEP 15: Use the set of association rules to infer the missing values of the originally incomplete dataset by the following sub-steps. SUBSTEP 15.1: If there is only one association rule which can be used to derive the missing value of an attribute in an tuple, then use the rule. SUBSTEP 15.2: If there is more than one association rule which can be used to derive the missing value of an attribute in an tuple, then use the one with the maximum confidence value; if more than one rule have the same maximum conference values, then use the one with the maximum partial suppo rt value; if the maximum partial support values are still the same and the rules derive different values, then keep the value the same as that in the updated dataset D’. STEP 16: Compare the missing values before and after dataset update. If they are not the same, repeat STEPS 14 to 16 using the new updated D’. If they are the same (meaning convergence), then do the next step. STEP 17: Output the final association rules and the updated dataset. After STEP 17, the filled-in missing values have converged, and the association rules will not be changed as well. Through the iterative processing until convergence, the bad influence of the missing values wrongly guessed at the beginning can be reduced. Note that tuples with all values missing may possibly appear in a dataset. Take the data in Table 4-2 as an example, in which all the values in tuple 4 are missing. From STEP 8, since A = 1 and B = 3 are known in tuple 2 and the value of C is unknown, the set of candidate 2-itemsets generated thus includes (A1, C3), (A1, C4), (B3, C3), and (B3, C4). Similarly, the set of candidate 2-itemsets generated from tuple 35.

(46) 3 includes (A1, C3), (A2, C3), (B1, C3), (B2, C3), and (B3, C3). Thus, only a partial set of possible candidate 2-itemsets needs to be checked, such that the computation time can be reduced. The algorithm will check only these candidates to extract useful association rules and infer missing values. STEP 14 also uses the same way to generate the candidates.. Table 4-2: The dataset in the example Attribute Tuple ID. A = {1, 2}. B = {1, 2, 3}. C = {3, 4}. T1 T2 T3 T4. 1 1. 2 3. 3. ? ?. ? ?. ? 3 ?. In Table 4-2, all the values in tuple 4 are missing. Thus, no candidate can be generated. For this case, Step 13 is used to fill the initial missing values. That is, the most frequent value of the attribute based on the partial support in the original dataset is used.. 4.4 An Example In this section, an example is given to show how the proposed algorithm can be used to find useful association rules and fill in missing values in an incomplete dataset with a high percentage of missing data. As an example, the incomplete dataset in Table 4-3 is used to demonstrate the idea. In Table 4-3, there are twelve tuples and three attributes, A, B, C. The values of the attributes are shown as follows: A = {1, 2}, B = {1, 2, 3} and C = {3, 4}. Each attribute has 50% missing values in this example. Assume the minSup threshold is set at 50% and the minConf threshold is set at 50%. 36.

(47) The proposed algorithm processes this incomplete data set as follows.. Table 4-3: The incomplete dataset in the example Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ? ? ? ? ? ? 2 2 2 2. 2 3 ? 2 3 1 2 ? ? ? ? ?. C 3 ? 3 4 ? ? 3 ? 3 ? 3 ?. Phase 1 Phase 1 is first used to mine association rules from the incomplete dataset and roughly assign proper values to incomplete slots according to the association rules. The processing for the example is stated as follows.. STEP 1: The partial support values of all the 1-itemsets are found from Table 4-3. Take the 1-itemset A = 2 as an example. There are four tuples with A = 2 in the example, including tuples 9 to 12. The missing data with the attribute A includes six tuples (3 to 8). The partial support of A = 2 is thus ((0.5*6)+(1*4)) / 12, which is 0.58. Since the partial support of A = 2 is larger than the minimum support threshold, which is 0.5 in the example, A = 2 is thus thought of as a frequent 1-itemset. The other 37.

(48) 1-itemsets are similarly processed and the set of frequent 1-itemsets are shown as follows: L 1 = {A2, C3}.. STEP 2: The other frequent itemsets with more than one item are found in an Apriori-like way but using the partial Support evaluation. In this example, there is no other frequent itemsets will be found in this step. All the frequent itemsets with their partial supports are shown in Table 4-4.. Table 4-4: All the frequent itemsets with their partial supports in the example Large n-Itemset. Itemset. Support. L1. A2 C3. 0.57 0.67. STEP 3: The possible association rules are then generated from the frequent (large) itemsets in Table 4-4 and their confidence values are calculated according to the confidence evaluation. Since there is no useful rule can be generated from the large 1-itemset, STEP 4 will be executed.. STEP 4: Because of the lack of the useful rules, there is no any missing value can be completed in this step. Then SETP 5 will be executed.. STEP 5: As you can see from above steps, there still exist missing values. Phase 2 then begins.. Phase 2 38.

(49) Phase 2 is used to fill in the missing values remained in Phase 1. It iteratively reduces the minimum support and the minimum confidence values to achieve the purpose. Only 2-itemsets are used for reducing the computational time. The processing for the example is stated as follows.. STEP 6: y is set at 1, where y is used to control the reduced minimum support and the minimum confidence thresholds for obtaining more association rules to infer the missing values.. STEP 7: The support and the confidence thresholds are decreased as follows: redMinSup =. minSup 0.5 = 1 = 0.25 2y 2. redMinConf =. minConf 0.5 = 1 = 0.25 2y 2. STEP 8: For the tuples still with missing values, the originally non-missing attribute-value pairs in those tuples are first identified. The candidate 2-itemsets are then formed, with one item from the set of originally non-missing attribute-value pairs and the other from the possible values of the attributes with missing values. Take tuple 4 in Table 4-3 as an example. B2 and C4 are known in the original tuple, and the value of A is unknown. A1 and A2 are the two possible values of attribute A. Thus, four candidate 2-itemsets (A1, B2), (A2, B2), (A1, C4) and (A2, C4) are found in this step. In this example, all the candidate 2-itemsets are shown in Table 4-5.. 39.

(50) Table 4-5: The candidate 2-itemset formed in STEP 8 Candidate 2-Itemset A1B1 A1B2 A1B3 A1C3 A1C4 A2B1 A2B2 A2B3 A2C3 A2C4 B1C3 B1C4 B2C3 B2C4 B3C3 B3C4. STEP 9: The partial support values of all the candidate 2-itemsets in Table 4-5 are calculated. The calculation is the same as that in STEP 1. If the partial support value of a candidate 2-itemset is equal to or larger than the reduced minimum support threshold (0.25), it is put into the set of reduced frequent 2-itemsets, L 2 ’. In this example, all the reduced frequent itemsets are shown in Table 4-6.. 40.

(51) Table 4-6: The reduced frequent 2-itemsets formed in STEP 9 Reduced Frequent 2-Itemset A1C3 A2B3 B2C3. STEP 10: The possible association rules are then generated from the reduced frequent (large) 2-itemsets in Table 4-6 and their confidence values are calculated according to the confidence evaluation. The calculation is the same as that in STEP 3. If the confidence value of a possible rule is equal to or larger than the reduced minimum confidence threshold (0.25), it is put into the set of reduced association rules, AR’. The reduced association rules with their support and confidence values are shown in Table 4-7.. Table 4-7: The reduced association rules AR’ with their support and confidence values Rule ID. X. →. Y. Sup. Conf. R1 R2 R3 R4 R5 R6. C3 C3 C3 A1 A2 B2. → → → → → →. A1 A2 B2 C3 C3 C3. 0.27 0.4 0.29 0.27 0.4 0.29. 0.4 0.6 0.43 0.64 0.69 0.69. STEP 11: The set of reduced association rules AR’ is then used to assign the missing values which have not yet been filled in. If there is more than one association rule which can be used to derive the missing value of an attribute, then use the one with the maximum RAR-confidence value. The results after this step are shown in 41.

(52) Table 4-8.. Table 4-8: The dataset after STEP 11 Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ?=2 ? ? ? ?=2 ?=2 2 2 2 2. 2 3 ?=2 2 3 1 2 ?=2 ?=2 ? ?=2 ?. C 3 ?=3 3 4 ? ? 3 ?=3 3 ?=3 3 ?=3. STEP 12: y = y + 1 and STEPs 7 to 11 are repeated until there are no missing values in the updated dataset excluding the tuples with all their attribute values missing or y gets to a predefined value. In this example, y is set at 3 and the results of the reduced association rules and the dataset of second iteration are shown in Table 4-9 and Table 4-10 respectively.. 42.

(53) Table 4-9: The reduced association rules AR’ with their support and confidence values Rule ID. X. →. Y. Sup. Conf. R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19. B2 B3 C4 B1 B2 B3 C4 A2 C3 A1 A2 C3 A1 A2 C3 C4 B1 B2 B3. R20 R21 R22. A1 A2 B3. → → → → → → → → → → → → → → → → → → → → → →. A1 A1 A1 A2 A2 A2 A2 B1 B1 B2 B2 B2 B3 B3 B3 B3 C3 C3 C3 C4 C4 C4. 0.19 0.15 0.15 0.18 0.22 0.18 0.15 0.18 0.17 0.19 0.22 0.29 0.15 0.18 0.21 0.13 0.17 0.29 0.21 0.15 0.15 0.13. 0.45 0.45 0.45 0.72 0.52 0.55 0.45 0.31 0.25 0.45 0.38 0.43 0.36 0.31 0.31 0.39 0.68 0.69 0.64 0.36 0.26 0.39. 43.

(54) Table 4-10: The dataset after second iteration of Phase 2 Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ?=2 ?=2 ?=2 ?=2 ?=2 ?=2 2 2 2 2. 2 3 ?=2 2 3 1 2 ?=2 ?=2 ?=2 ?=2 ?=2. C 3 ?=3 3 4 ?=3 ?=3 3 ?=3 3 ?=3 3 ?=3. STEP 13: If there is still a missing value for an attribute, the most frequent value (based on the partial support) of the attribute in the original dataset D is used to assign it. In this example, the STEP 13 will be skipped because there is no any missing value left. Phase 3 is then executed.. Phase 3 Phase 3 is used to adjust the assigned missing values to raise the precision through an iterative process. It first mines the association rules from the filled complete dataset and then uses the rules to guess once more the originally missing values. If the guess is not the same as the previous one, then the same process is repeated until the results converge (not change again). The processing for the example in the phase is stated as follows.. 44.

(55) STEP 14: The new set of association rules, AR’, are found from the updated dataset D’ (which is completed now) in a way similar to STEPs 1 to 3 in Phase 1 using the traditional support and confidence (due to no missing values) based on the original support and confidence thresholds. The new set of association rules obtained is shown in Table 4-11.. Table 4-11: The new set of association rules after STEP 14 Rule ID. X. →. Y. Sup. Conf. R1 R2 R3 R4 R5 R6. B2 C3 A2 C3 A2 B2. → → → → → →. A2 A2 B2 B2 C3 C3. 0.5 0.58 0.5 0.67 0.58 0.67. 0.67 0.63 0.75 0.73 0.87 0.89. STEP 15: The new set of association rules AR’ is used to adjust the previously filled-in missing values in the dataset. Take tuple 4 as an example. The missing value of attribute A in tuple 4 is assigned 2 according to the processing in Phases 1 and 2. The value is then adjusted to 2 according to Rule 1 in Table 4-11. The results after this step are shown in Table 4-12, where the first number in an missing slot represents the previous guess and the second one represents the current guess after adjustment.. 45.

(56) Table 4-12: The dataset after adjustment Tuple ID. A. Attribute B. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 2 2 2 2. 2 3 ? = 2, 2 2 3 1 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2 ? = 2, 2. C 3 ? = 3, 3 3 4 ? = 3, 3 ? = 3, 3 3 ? = 3, 3 3 ? = 3, 3 3 ? = 3, 3. STEP 16: The guessed missing values before and after adjustment are compared. In this example, the results of guessed missing values are the same as those in the Phase 2. Step 17 is then executed.. STEP 17: The final association rules and the completed dataset are then output to users. The results for this example are shown in Tables 4-13 and 4-14.. Table 4-13: The final association rules in the example Rule ID. X. →. Y. R1 R2 R3 R4 R5. B2 C3 A2 C3 A2. A2 A2 B2 B2. R6. B2. → → → → → →. 46. C3 C3.

(57) Table 4-14: The finally completed dataset Attribute Tuple ID. A. B. C. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12. 1 1 2 2 2 2 2 2 2 2 2 2. 2 3 2 2 3 1 2 2 2 2 2 2. 3 3 3 4 3 3 3 3 3 3 3 3. 47.

(58) CHAPTER 5 Experimental Results In this thesis, experiments were made to show the performance of the proposed approach. All the experiments were performed on an Intel Core 2 Duo E8400 (3GHz) PC with 2 GB main memory, running the Windows XP Professional operating systems. The proposed algorithm was implemented in Excel VBA and applied to two real datasets, Teaching Assistant Evaluation and Tic-Tac-Toe, which were taken from the UCI Machine Learning Repository. Table 5-1 lists a summary of the two datasets and the used thresholds in the experiments. Each attribute in Teaching Assistant Evaluation and in Tic-Tac-Toe was randomly assigned different missing rates from 5% to 50%.. Table 5-1: Characteristics of the two experimental datasets and the used threshold values. Dataset. Tuple number. Attribute number. Missing rate. TAE Tic-Tac-Toe. 151 958. 5 9. 5% - 50% 5% - 50%. minSup minConf minRep 50% 50%. 50% 50%. 70% 70%. 5.1 Experimental Results of the First Method The robust association-rule (RAR) algorithm was also executed as a comparison to the proposed iterative mining algorithm. In order to evaluate the proformance of the proposed iterative mining algorithm, two measures inclusing accuracy and recovery are defined as follows:. 48.

(59) accuracy =. rrmv mv. ………… Eq. 1. recovery =. rmv mv. ………… Eq. 2. where |mv| denotes the number of missing values in the data set, |rrmv| denotes the number of correct recovered missing values, and |rmv| denotes the number of recovered missing values, which may be not correct. Note that if a missing value can’t be recovered by any association rule, it is thought of a wrongly recovered missing value.. Figure 5-1 shows the accuracy rates generated by RAR-MVC and by our proposed method over various data missing rates on the TAE dataset. It could be seen from the figure that our proposed method had a better precision than RAR-MVC.. Fig 5-1: The comparison of accuracy for RAR-MVC and our methods on TAE.. Figure 5-2 then shows the recovery rates generated by RAR-MVC and our 49.

(60) proposed method over various data missing rates on the TAE dataset. In this figure, our proposed method also had better recovery rates than RAR-MVC.. Fig 5-2: The comparison of recovery rates for RAR-MVC and our methods on TAE.. Figure 5-3 then shows the accuracy rates generated by RAR-MVC and our method over various data missing rates on the Tic-Tac-Toe dataset. In this figure, our proposed method had a better accuracy than RAR-MVC.. Fig 5-3: The comparison of accuracy for RAR-MVC and our method on 50.

(61) Tic-Tac-Toe.. At last, Figure 5-4 shows the recovery rates generated by RAR-MVC and our method over various data missing rates on the Tic-Tac-Toe dataset. In this figure, our method demonstrated better recovery rates than RAR-MVC.. Fig 5-4: The comparison of recovery rates for RAR-MVC and our method on Tic-Tac-Toe.. From these figures, it can be observed that when the missing rate of a dataset is small, the accuracy rate of our proposed method is higher than the RAR approach. Along with an increasing rate of missing values, the accuracy rate of the proposed method increases considerably. The reason is that we counted the missing values which can’t be recovered by association rules as wrongly recovered missing values. Our proposed methods have the ability to recover all-missing tuples. As to the recovery rate, it could be observed that our method allways performed better than the RAR approach. The reason is that our method used the iterative mechanism to recover all the missing values. 51.

(62) 5.2 Experimental Results of the Second Proposed Method The robust association-rule (RAR) algorithm was also executed as a comparison to the proposed iterative mining algorithm. In order to evaluate the proformance of the proposed iterative mining algorithm, two measures inclusing accuracy and recovery are defined as follows:. accuracy =. rrmv mv ………… Eq. 3. recovery =. rmv mv. ………… Eq. 4. where |mv| denotes the number of missing values in the data set, |rrmv| denotes the number of correct recovered missing values, and |rmv| denotes the number of recovered missing values, which may be not correct. Note that if a missing value can’t be recovered by any association rule, it is thought of a wrongly recovered missing value.. Figure 5-5 shows the accuracy rates generated by RAR-MVC and by our proposed method over various data missing rates on the TAE dataset. It could be seen form the figure that our proposed method had a better precision than RAR-MVC.. 52.

(63) Fig 5-5: The comparison of accuracy for RAR-MVC and our methods on TAE.. Figure 5-6 then shows the recovery rates generated by RAR-MVC and our proposed method over various data missing rates on the TAE dataset. In this figure, our proposed method also had better recovery rates than RAR-MVC.. Fig 5-6: The comparison of recovery rates for RAR-MVC and our methods on TAE.. Figure 5-7 then shows the accuracy rates generated by RAR-MVC and our 53.

(64) method over various data missing rates on the Tic-Tac-Toe dataset. In this figure, our proposed method had a better accuracy than RAR-MVC.. Fig 5-7: The comparison of accuracy for RAR-MVC and our method on Tic-Tac-Toe.. At last, Figure 5-8 shows the recovery rates generated by RAR-MVC and our method over various data missing rates on the Tic-Tac-Toe dataset. In this figure, our method demonstrated better recovery rates than RAR-MVC.. 54.

(65) Fig 5-8: The comparison of recovery rates for RAR-MVC and our method on Tic-Tac-Toe.. From these figures, it can be observed that when the missing rate of a dataset is small, the accuracy rate of our proposed method is higher than the RAR approach. Along with an increasing rate of missing values, the accuracy rate of the proposed method increases considerably. The reason is that we counted the missing values which can’t be recovered by association rules as wrongly recovered missing values. Our proposed methods have the ability to recover all-missing tuples. As to the recovery rate, it could be observed that our method allways performed better than the RAR approach. The reason is that our method used the iterative mechanism to recover all the missing values.. 55.