透過交易新增來隱藏敏感的頻繁項目集

全文

(1)國立高雄大學資訊工程學系(研究所) 碩士論文. 透過交易新增來隱藏敏感的頻繁項目集 Hiding Sensitive Frequent Itemsets through Transaction Addition. 研究生：張家境. 撰. 指導教授：洪宗貝. 博士. 林浚瑋. 博士. 中華民國一百年七月.

(2) 致謝以前在研讀別人的論文的時候，看著別人寫的感謝心情，總會自己想像一下如果是我在寫會寫些甚麼，沒想到現在真的輪到我親自寫下心中的感覺，想起之前的我，不禁有點有趣。兩年的研究所生活，能夠順利，有太多的人需要感謝，也有太多的事情需要感恩。在工作了一段時間之後，就讀研究所，心中的忐忑很巨大，不是資工背景的我，擔心著未知的課業和工作併行不知道是不是能應付的了，洪宗貝老師無疑的是鼓勵我試試看的那股力量之一，老師在專業領域的能力無庸置疑，但讓我更覺得佩服的還有老師對待事情的態度以及原則，讓我除了課業上，也在日常生活中，獲得不少寶貴的意見，短短的兩年，收穫是無比的巨大。在百忙之餘，老師還對學生的生活、課業上給予指點，而老師當導師的兩年間，也總是時時關心著大家的情況，這點總是讓我非常佩服。除了謝謝口試委員蔡正發博士、林威成博士在我的口試中給予我許多的指導之外；在寫論文窒礙難行的時候，浚瑋學長的挺身而出，帶著我一步一步把論文完成，對於很多東西都不懂的我，浚瑋學長的耐心是連我這個在中小學教書的老師都佩服不已的，尤其在修改論文當中，總得耐著性子改正我的錯誤，對於他的出心出力，我由衷的感謝，謝謝學長！當然在實驗室中的夥伴陪伴，也是讓這兩年的研究所生涯多了很多色彩：國誠學長、明泰學長、韋體學長、欣怡學姐、宗慶、世濱、峰世、柏諺，雖然我這個在職生平常在實驗室的機會不多，但是在實驗室的時候總是能深深體會你們把實驗室取做”歡樂實驗室”的原因了，大家相處得很愉快沒有負擔，其實，這或許才是大家最珍貴的事情吧！家人的支持總是一個人不管遇見再多困難都會努力下去的動力，謝謝我的父母，尤其是媽媽，常告訴我累了就休息一下；最重要的要謝謝即將變成我的家人的女朋友，美鈴，從知道我考上研究所的為我開心不已，到寫論文時心煩時的陪伴，都讓我覺得感動不已，很多時候，如果沒有妳的鼓勵，我想應該沒有動力繼續完成這兩年的學業！謝謝妳，我愛妳！謹誌於家中張家境. i. 2011.07.27.

(3) 透過交易新增來隱藏敏感的頻繁項目集. 指導教授：洪宗貝博士林浚瑋博士國立高雄大學資訊工程所. 學生：張家境國立高雄大學資訊工程所. 摘要資料探勘主要是用來從大量的資料中擷取出有用的知識，以用來幫助公司做出有效的決策。然而，在資料蒐集與數據傳播的過程中，卻可能引發隱私資料外洩風險。有關於個人、企業或組織的敏感資訊，在分享或是發佈時，應該以隱匿資訊的方式受到保護。也因此，隱私保護之探勘成為近年來重要的研究議題之一。在這篇論文裡，我們提出兩種透過新增交易資料的方法來隱藏敏感項目集。第一個方法以貪婪法為基礎，首先計算出可以用來完全的隱藏住敏感項目集的最大的新增交易筆數，再進行資料隱藏的動作。其中每筆新增交易資料的長度與新增的項目集取決於經驗法則裡的標準常態分佈，如此，也能進一步減少隱匿敏感項目時所產生的副作用。在第二個方法裡，我們提出一個演化式的隱私保護之探勘法，用來找出最適合的項目集以加入新增交易中，進一步得以隱藏住敏感項目集。此方法設計了三個變數來設計一個彈性的評估函數，並可根據使用者的喜好彈性地分配此三個變數的權重。此外，準大項目集的概念也被用來減少重新掃描資料庫的成本，以加快評估染色體的過程。最終，我們將透過實驗結果來評估演算法的效能。. 關鍵字：資料探勘、隱私保護、貪婪法、基因演算法、準大項目集、敏感項目. ii.

(4) Hiding Sensitive Frequent Itemsets through Transaction Addition Advisors: Dr. Tzung-Pei Hong Dr. Chun-Wei Lin Department of Computer Science and Information Engineering National University of Kaohsiung. Student: Chia-Ching Chang Department of Computer Science and Information Engineering National University of Kaohsiung. ABSTRACT Data mining technology is used to derive useful knowledge from large databases, thus being able to help make effective decision for companies. The process of data collection and dissemination may, however, cause the risk of privacy threats. Sensitive or personal information and knowledge of individuals, industries and organizations are required to be kept as private information before they are shared or published. Thus, the privacy-preserving data mining (PPDM) has become an important issue in recent years. In this thesis, two approaches for hiding sensitive itemsets by inserting new transactions are proposed. The first one is a greedy-based approach, which computes the maximal number of transactions to be inserted into the original database for totally hiding sensitive itemsets. The length of each inserted transaction and the itemsets within it are decided by some heuristic rules, such that the side effects from hiding sensitive itemsets could be reduced. The second one is an evolutionary privacy-preserving data mining method to find appropriate itemsets within inserted transactions for hiding sensitive itemsets. It uses a flexible evaluation function with three factors. Different weights are then assigned to three factors depending on users’ preference. The concept of pre-large itemsets is also used in the GA-based approach to reduce the cost of rescanning databases, thus speeding up the iii.

(5) evaluation process of chromosomes. Experimental results are finally performed to evaluate the performance of the proposed two approaches.. Keywords: Data mining, Privacy preserving, Greedy approach, Genetic algorithm, Pre-large itemset, Sensitive itemset. iv.

(6) Contents 致謝....................................................................................................................................... i 摘要..................................................................................................................................... ii ABSTRACT ....................................................................................................................... iii List of Figures .................................................................................................................... vii List of Tables ...................................................................................................................... viii Chapter 1 Introduction ................................................................................................................ 1 1.1. Motivation .................................................................................................................. 1. 1.2. Contribution ............................................................................................................... 2. 1.3. Thesis Organization ................................................................................................... 3. Chapter 2 Related Work ............................................................................................................. 4 2.1. Data Mining Approaches ........................................................................................... 4. 2.2. Data Sanitization ........................................................................................................ 5. 2.3. Genetic Algorithms .................................................................................................... 6. 2.4. The Concept of Pre-Large Itemsets ........................................................................... 7. Chapter 3 Problem Formulation ............................................................................................... 10 Chapter 4 A Greedy-based Approach ....................................................................................... 12 4.1. The Proposed Greedy Algorithm ............................................................................. 12. 4.2. An Example.............................................................................................................. 14. Chapter 5 GA-based Approach ................................................................................................. 20 5.1. Chromosome Representation ................................................................................... 20. 5.2. Fitness Function ....................................................................................................... 21. 5.3. Genetic Operators .................................................................................................... 25. 5.3.1. Crossover ................................................................................................................. 25. 5.3.2. Mutation ................................................................................................................... 26 v.

(7) 5.3.3. Selection ................................................................................................................... 27. 5.4. The Proposed GA-based Algorithm ......................................................................... 29. 5.5. An Example.............................................................................................................. 31. Chapter 7 Conclusion and Future Work ................................................................................... 41 References .................................................................................................................................. 43. vi.

(8) List of Figures Figure 2-1: Nine cases when new transactions are inserted into existing database ........................ 8 Figure 4-1: The probabilities of different lengths in standard normal distribution ...................... 16 Figure 5-1: The relationship of itemsets before and after the PPDM approach is processed ....... 21 Figure 5-2: The set of sensitive itemsets that fail to be hidden .................................................... 22 Figure 5-3: The set of missing itemsets ........................................................................................ 23 Figure 5-4: The set of artificial itemsets ....................................................................................... 24 Figure 5-5: The crossover operator ............................................................................................... 26 Figure 5-6: The mutation operator ................................................................................................ 27 Figure 5-7: The selection mechanism ........................................................................................... 28 Figure 5-8: The flowchart of the GA-based approach for PPDM ................................................ 28 Figure 5-9: The probabilities of different lengths in standard normal distribution ...................... 33 Figure 5-10: k populations ............................................................................................................ 33 Figure 6-1: Numbers of added transactions among three databases in different percentages of sensitive itemsets........................................................................................................ 37 Figure 6-2: Execution times among three databases in different percentages of sensitive itemsets .................................................................................................................................... 38 Figure 6-3: The evaluation of three side effects in Webview-1.................................................... 39 Figure 6-4: The evaluation of three side effects in Webview-2.................................................... 39. vii.

(9) Figure 6-5: Execution time of the original GA-based approach and the GA-based approach with pre-large concept ........................................................................................................ 40. viii.

(10) List of Tables Table 4-1: A database with 8 transactions……………………………………………………….15 Table 4-2: All large itemsets……………………………………………………….....................15 Table 4-3: The standardized values of transaction lengths……………………………………... 16 Table 4-4: The count differences of all frequent itemsets………………………………………. 17 Table 4-5: The process to add an itemset {bce}…………………………………………………18 Table 4-6: The process to add the itemset {ab}…………………………………………………18 Table 4-7: The process to add the small items………………………………………………….. 19 Table 4-8: The final updated database…………………………………………………………...19 Table 5-1: An example for a represented chromosome………………………………………….21 Table 5-2: A database used as an example……………………………………………………… 31 Table 5-3: The large and pre-large itemsets…………………………………………………….. 32 Table 5-4: The standardized values of transaction length………………………………………. 32 Table 5-5: The updated large and pre-large itemsets………………………………………….... 34 Table 6-1: The details of the three databases…………………………………………………… 36. ix.

(11) Chapter 1 Introduction 1.1 Motivation In recent years, the privacy-preserving data mining (PPDM) has become an important issue due to the quick proliferation of electronic data in governments, industries and organizations. The secured data may implicitly contain confidential information and lead to privacy threats if they are misused. As the data mining technology rapidly progresses, it is easier to get privacy information from user data through data mining technology. Privacy information includes some confidential information, such as income, medical history, address, credit card numbers, phone number and customer purchasing behaviors, among others. Besides, the range of privacy information may be extended to businesses as well. Based on business purposes, some shard information among companies may be extracted and analyzed by other partners, which may not only decrease the benefits of the companies but also cause threats to sensitive data. This has led to increasing concerns about the privacy of the underlying data and the implicit knowledge on the data. In the past, many approaches have been proposed for handling the privacy-preserving data mining [4, 8, 15-17, 23, 24-26], but most of them considered to partially delete the transactions or items within transactions for hiding sensitive itemsets. In real-world applications, however, deleting the transactions from original database may cause the serious results in decision making. In the first part of this thesis, the greedy-based approach for inserting new transactions into the original database is thus proposed. The empirical rules in standard normal distribution is respectively applied to determine the number of newly inserted transactions, the lengths of the inserted transactions, and the itemsets to be added into the inserted transactions.. 1.

(12) Recently, genetic algorithms (GAs) [12] have become increasingly important for researchers in solving difficult problems since they could provide feasible solutions in a limited amount of time. As to using GAs to PPDM, Dehkordi [8] proposed an approach to encode each transaction as a chromosome. Three operators, such as selection, crossover and mutation were then used to adjust the population for hiding sensitive itemsets. The performance of the approach might, however, greatly depend on the number of transactions in a database. In the second part of the thesis, the GA-based approach is thus designed to evaluate the newly inserted transactions. A flexible evaluation function with three factors and different weights are then assigned to factors depending on users’ preference. The proposed approach also adopts the concept of pre-large itemsets to avoid rescanning database in chromosome evaluation for improving the efficiency in execution time. The three proposed approaches can thus easily make good trade-offs between privacy preserving and execution time. Several experiments are respectively made to evaluate the performance of the execution time, the number of inserted transactions for hiding sensitive information, and the side effects of the two proposed algorithms.. 1.2 Contribution The contributions of this thesis are then divided into two parts and described below.. 1.. We propose a greedy-based approach to insert newly transactions into the original database for hiding sensitive itemsets in PPDM. The empirical rules in standard normal distribution is respectively applied to determine the number of newly inserted transactions, the lengths of the inserted transactions, and the itemsets to be added into the inserted transactions.. 2.

(13) 2.. We propose a GA-based approach and design a flexible evaluation function with three factors, which are the number of sensitive itemsets that fail to be hidden, the number of the missing itemsets, and the number of artificial itemsets. The pre-large concept is also used in the proposed GA-based approach, thus reducing the computational cost of rescanning database.. 1.3 Thesis Organization The rest parts of this thesis are organized as follows. Some related works are described in Chapter 2. The problem to be solved in this thesis is stated in Chapter 3. The first proposed greedy-based approach is shown in Chapter 4. The second proposed GA-based approach is explained in Chapter 5. Experimental results are then shown in Chapter 6. Conclusion and future works are illustrated in Chapter 7.. 3.

(14) Chapter 2 Related Work In this chapter, the related researches are then shortly reviewed. Data mining approaches are stated in Section 2.1. Data sanitization is given in Section 2.2. Genetic algorithms and the concept of pre-large itemsets are respectively shown in Section 2.3 and 2.4.. 2.1 Data Mining Approaches Data mining is the most commonly used in attempts to induce association rules from transaction data [3, 5-6], such that the presence of certain items in a transaction will imply the presence of some other items. To achieve this purpose, the Apriori algorithm [3] and the FP-growth algorithm [10] are recommended as the efficient approaches to derive frequent itemsets in association rules mining. The former one is a level-wise approach to generate-and-test candidates and the later one uses a tree structure to keep the frequent itemsets without candidate generation, thus reducing the computational cost of rescanning database. In the Apriori algorithm, the database is first scanned to find the frequencies of items. An item is then considered as a large (frequent) item since its count (frequency) is larger than or equal to the minimum count threshold. Next, the candidate itemsets obtained two items are then formed from the large items in combination process. The generated candidate itemsets are then determined to check the counts of the 2-itemsets larger than or equal to the minimum count threshold. This process was repeated until all large itemsets had been found. Association rules were induced from the large itemsets found in the first phase. All possible association combinations for each large itemset were formed, and those with calculated confidence values. 4.

(15) larger than the minimum confidence were output as the desired association rules.. 2.2 Data Sanitization Years of effort in data mining have produced a variety of efficient techniques, which have caused the security problems and privacy threats. The privacy preserving data mining (PPDM) techniques has thus become a critical research issue for hiding the confidential or secured information. In the past, Atallah et al. proposed the protection algorithm for data sanitization to avoid the inference of association rules [2]. It used both addition and deletion procedures to modify databases for hiding sensitive information. Dasseni et al. then proposed a hiding algorithm based on the hamming-distance approach to reduce the confidence or support values of association rules [7]. Three heuristic hiding approaches were thus proposed to respectively increase the supports of antecedent parts, to decrease the supports of consequent parts, and to decrease the support of either the antecedent or the consequent parts. When the supports or the confidences of sensitive association rules were below minimum support threshold, the sensitive association rules could thus be hidden. Oliveira and Za¨ıane [21] also introduced the multiple-rule hiding approach to efficiently hide sensitive itemsets. It requires twice scanning of database whether the number of sensitive itemsets. In the first database scan, the index file was created to efficiently find sensitive itemsets within transactions. Three algorithms were then used in the second database scan to remove minimal individual items. Amiri then proposed three heuristic algorithms to hide multiple sensitive rules [1]. The first approach computes the union of the supporting transactions for all sensitive itemsets to remove the transaction that supports the most sensitive and the least non-sensitive itemsets. The second one aims to remove individual items from transactions instead of removing whole transactions. The third approach combines the previous two approaches to identify sensitive transactions and to selectively delete. 5.

(16) items from these transactions until the sensitive knowledge has been hidden. Pontikakis et al. [22] then proposed two heuristic approaches based on data distortion. The first priority-based distortion algorithm (PDA) was designed to reduce the confidences of sensitive rules by decreasing consequent items. The second weight-based sorting distortion algorithm (WDA) was then proposed to prioritize selection of sanitized transactions. It used the priority values to weight the transactions based on effective data structures. Hong et al. then proposed two approaches to partially delete the items within the transactions or the whole transactions from the original database for hiding sensitive itemsets [15-16]. The optimal sanitization of databases regards as an NP-hard problem. Atallah et al. [2] proved that selecting which data to modify or sanitize was also NP-hard. Their proof was based on the reduction from the NP-hard problem of hitting-sets [9]. The hitting-set problem was first proven NP-hard. The PPDM problem was then reduced to the hitting-set problem in polynomial time. In this case, the PPDM problem could be said an NP-hard problem as well and could not be solved in polynomial time. That paper provided a solid theoretical background to explain that PPDM was a difficult issue.. 2.3 Genetic Algorithms In the past, many heuristic algorithms have been developed for difficult optimization problems. Some nature-inspired approaches were then proposed to achieve the purpose and one of the most commonly used among them is the evolutionary computation based on Darwin theory: “Nature selects, the fittest survives”. In 1975, Holland [12] proposed genetic algorithms (GAs) and it has become increasingly important for researchers in solving difficult problems since they could provide feasible solutions in a limited amount of time [11]. GAs has been successfully applied to optimization fields [18-19], such as machine learning [19], neural networks [23], fuzzy 6.

(17) logic controllers [23], among others. According to the principle of survival of the fittest, GAs generates the next population by several operations, with each individual in the population representing a possible solution. In general, a genetic algorithm consists of five basic components, as summarized by Michalewicz [20]: 1. A genetic representation of solutions to the problem. 2. A way for generating the initial population. 3. An evaluation functions for measuring goodness of solutions. 4. Several genetic operators that alter the genetic composition of children. 5. Parameter values.. 2.4 The Concept of Pre-Large Itemsets Hong et al. proposed the pre-large itemsets [13-14] for efficiently deriving the desired rules in incremental data mining. A pre-large itemset was not truly large (frequent), but might easily become large in the future through the data insertion process. Two minimum support thresholds are defined in the pre-large concept, which are a lower support threshold and an upper support threshold. The upper support threshold was the same as that in the conventional mining algorithms. The count of an itemset must be larger than or equal to the upper count threshold in order to be considered as a large itemset. On the other hand, the lower support threshold defines the lowest count threshold for an itemset to be treated as a pre-large itemset. A pre-large itemset is not truly large, but may be large with a high probability in the future. It acts like a buffer in the incremental mining process for reducing the movements of itemsets directly from large to small and vice-versa. An itemset with its count below the lower count threshold is thought of as a small itemset. The algorithm did not need to rescan the original database until a number of transactions 7.

(18) have been processed. Since rescanning database spent much computational time, the maintenance cost could thus be reduced based on the pre-large concept. The processing for transaction insertion is stated below. Considering an original database and newly transactions which are inserted by the two support thresholds [13], itemsets may fall into one of the following nine cases illustrated in Figure 2.1.. Insert. new records. Original databases. Large Itemsets. Pre Large Itemsets. Small Itemsets. Large Itemsets. Case 1. Case 2. Case 3. Pre Large Itemsets. Case 4. Case 5. Case 6. Small Itemsets. Case 7. Case 8. Case 9. Figure 2-1: Nine cases when new transactions are inserted into existing database. Cases 1, 5, 6, 8 and 9 will not affect the final association rules according to the weighted average of the counts. Cases 2 and 3 may remove existing association rules, and cases 4 and 7 may add new association rules. If we retain all large and pre-large itemsets with their counts after each pass, then cases 2, 3 and 4 can be handled easily. Also, in the maintenance phase, the ratio of new transactions to old database is usually very small. This is more apparent when database are growing larger. It has been formally shown that an itemset in case 7 cannot possibly. 8.

(19) be large for entire updated database as long as the number of new transactions is smaller than the safety number f shown below [13]:.  ( S − S l )d  f = u , 1 S − u   where f is the safety number of the new transactions, S u is the upper threshold, S l is the lower threshold, and d is the number of original database. The lower support threshold can be re-formulated when the number of inserted transactions f for not rescanning the original database is given as:. Sl ≤ Su −. f (1 − S u ). d. The above formula will be used in the proposed GA-based approach to set an appropriate lower-bound threshold for efficient chromosome evaluation.. 9.

(20) Chapter 3 Problem Formulation In the problem of PPDM, some basic concepts are borrowed from association rule mining. It is necessary to review the association rules mining before exploring the issues of PPDM. The most popular and common algorithm is called Apriori algorithm and it was proposed by Agrawal et al. [3]. Let I = {i 1 , i 2 , …, i m } be a set of items. Let D be a set of transactions, where each transaction T ∈ D consists of a set of items, such that T ⊆ I. Each transaction T has a unique identifier, called its TID. A set of items X ⊂ I is called an itemset. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅ . Usually, Y consists of only a single item. An association rule X ⇒ Y holds in a database D if the following two factors are satisfied. The first one is the support condition, which is defined as at least s% of the transactions in D contain X ∪ Y . It can be thought of as a measure of the frequency of a rule, and is expressed by. X ∪Y N. ≥ s , where N is the number of transactions in D. The second factor is the confidence. condition, which is defined as at least c% of transactions with the itemset X also contains Y. It is thus a measure of the strength of the rule, and is expressed by. X ∪Y X. ≥ c.. In privacy-preserving data mining, the sensitive itemsets H = {h 1 , h 2 , …, h i } is normally defined by users. The sensitive itemsets belong to frequent itemsets but it may consist of the confidential information. In this thesis, the sensitive itemsets are considered to be hidden by adding the number of newly transactions to increase the minimum count threshold of D. Let the modified database be denoted D’. Thus, each sensitive itemset will not have enough support to. 10.

(21) be frequent in D’. In addition to hiding the sensitive itemsets from being mined, some other goals have been set as well when the original database is sanitized. For example, all of the non-sensitive rules should be successfully mined from the sanitized database D’. Besides, the rules that are not found in the original database D should not be generated from the sanitized database D’. In this thesis, the sensitive itemsets are then hidden by adding newly transactions into the original database, thus increasing the minimum count threshold to achieve the goal. It is, however, three factors should be taken as the consideration. First, the number of transactions should be serious determined for achieving the minimal side effects to totally hide the sensitive itemsets. In this part, sensitive itemsets are then respectively evaluated to find the maximal number of transactions to be inserted. Second, the length of each newly inserted transaction is then calculated according to the empirical rules in standard normal distribution. Last, the already existing large itemsets are then alternatively added into the newly inserted transactions according to the lengths of transactions which determined at the second procedure. This step is to avoid the missing failure of the large itemsets for reducing the side effects in the PPDM Thus, two approaches are proposed for PPDM in this thesis. The first algorithm is a greedy-based approach to iteratively add the large itemsets into the inserted transactions for avoiding the miss failure side effects. The second algorithm is a GA-based approach to design a flexible evaluation function with three factors. Different weights are then assigned to three factors according to users’ preference in evaluation process. The pre-large concept is also applied to reduce the computational cost of rescanning database, thus speeding up the evaluation process of chromosomes.. 11.

(22) Chapter 4 A Greedy-based Approach In this chapter, a greedy-based approach for data sanitization will be introduced. Three steps are then used to insert new transactions into original database for hiding sensitive itemsets. In the first step, the safety bound for each sensitive itemset is then calculated to determine how many transactions should be inserted. Among the calculated safety bound of each sensitive itemset, the maximum operation is then used to get the maximal numbers of inserted transactions. Next, the lengths of inserted transactions are then evaluated through empirical rules in statistics as the standard normal distribution. In the third step, the count difference is then calculated between the sensitive itemsets and non-sensitive frequent itemsets at each k-level (k-itemset). The non-sensitive frequent itemsets are then inserted into the transaction in descending order of their count difference. This property remains that the original frequent itemsets would be still frequent after the numbers of transactions are inserted for hiding sensitive itemsets. The above steps are then repeatedly progressed until all the sensitive itemsets are hidden. The proposed algorithm is then shown below.. 4.1 The Proposed Greedy Algorithm INPUT: A transaction dataset D = {T 1 , T 2 , …, T m }, a set of k frequent itemsets FI = {fi 1 , fi 2 , …, fi k }, a set of l infrequent(small) itemsets I ={i 1 , i 2 , …, i l }, a user-specified minimum support threshold α , and a set of user-specified sensitive itemsets S = {si 1 , si 2 , …, si j , …, si p }. OUTPUT: A sanitized dataset.. 12.

(23) STEP 1: Calculate the value of maximum safety bound (MSB) for the number of newly inserted transactions as: p  si  n = max( SBi ) =  i − m + 1 , i =1 α . where SB i is the safety bound of each sensitive itemset, m is the number of original transactions in D, sii is the count of sensitive itemset si i .. STEP 2: Calculate the length p n of each inserted transaction in d according to the empirical rules in standard normal distribution, where d = {d 1 , d 2 , …, d n }, and n is the number of inserted transactions obtained in STEP 1. STEP 3: Choose the itemsets to be inserted into each inserted transaction d n . Do the substeps as follows. Substep 3-1: Calculate the count difference ( CD fik ) of each frequent itemset to be possibly inserted into the new transactions as: CD fik =  fik − (m + n)α  ,. where fik is the count(frequency) of an item fi k , m is the number of transactions in D and n is the number of transactions in d. Substep 3-2: Put the frequent itemsets fi k with negative CD fik into the set of Insert_Items. Substep 3-3: Sort the fi k in the set of Insert_Items in descending order of their lengths.. 13.

(24) Substep 3-4: Sort the sorted results obtained in substep 3-3 in descending order of their CD fik .. STEP 4: Process the inserted transactions d n one-by-one respectively to add the fi k in the set of Insert_Items according to the sorted order obtained in substep 3-4. Note that the length of an inserted itemset fi k is no longer than p n in d n and the inserted itemset fi k in a transaction cannot be formed as any super-itemsets of a sensitive itemset in S.. STEP 5: Update (decrease) the value CD fik. and the corresponding sub-itemsets of the. processed itemset fi k by 1. STEP 6: Repeat the STEPs 4 to 5 until the set of Insert_Items is null or there is no longer itemsets to be inserted into d n obtained the constraints in STEP 4. STEP 7: Add the small items in the set of I into the d n while d n remains positions to be added according to empirical rules in standard normal distribution.. 4.2 An Example In this section, an example is then used to illustrate the proposed algorithm step-by-step. Assume a database shown in Table 4-1 is used as an example. It consists of 8 transactions with 7 items, denoted a to g.. 14.

(25) Table 4-1: A database with 8 transactions TID 1 2 3 4 5 6 7 8. Item a, b, c, d, e a, b, c, e c, e a, b, c, e b, g b, d, e, f a, b, c, d b, c, e, f. Assume a set of the user-specified sensitive itemsets S is {c:6, be:5, abc:4}, and the minimum support threshold is set at 50%. The minimum count of this example is calculated as (0.5 × 8) (= 4). The Apriori approach [3] is then executed to find all frequent itemsets from Table 4-1 and the results are then shown in Table 4-2, respectively. Note the sensitive itemsets are marked in red color in Table 4-2. Table 4-2: All large itemsets All items Item Count a 4 b 7 c 6 d 3 e 6 f 2 g 1. Large 1-itemset Item Count a 4 b 7 c 6 e 6. Large 2-itemset Item Count ab 4 ac 4 bc 5 be 5 ce 5. Large 3-itemset Item Count abc 4 bce 4. STEP 1: The number of inserted transactions in this example is first calculated. In this example, there are three sensitive itemsets to be hidden. Thus, the safety bound for sensitive itemset {c} is calculated as ( 6 / 0.5) − 8 + 1 ) (= 5). The safety bound for sensitive itemset {be} and {abc} are respectively calculated as ( 6 / 0.5) − 8 + 1 ) (= 5) and (4 / 0.5) − 8 + 1 (= 1). The. 15.

(26) maximum safety bound (MSB) among three sensitive itemsets is thus max(5, 5, 1) (= 5). Thus, the number of newly inserted transactions is initially set at 5. STEP 2: For five inserted transactions obtained in STEP 1, the length of each transaction is thus computed according to empirical rules in standard normal distribution. In this example, the average length of 8 transactions is calculated as (5 + 4 + 2 + 4 + 2 + 4 + 4 + 4)/8 (= 3.625). The standard deviation is then calculated as. 1 [(5 − 3.625) 2 + (4 − 3.625) 2 +  + (4 − 3.625) 2 ] 8 −1. (= 1.06).. The length of transactions in the original database is standardized and shown in Table 4-3.. Table 4-3: The standardized values of transaction lengths Length 2 3 4 5. Standardized -1.53 -0.59 0.35 1.3. That is, the probability of length 2, 3, 4, and 5 are calculated as (13.5%, 34%, 34%, 13.5%) shown in Figure 4-1.. Figure 4-1: The probabilities of different lengths in standard normal distribution. 16.

(27) The lengths of five inserted transactions are then assigned according to Figure 4-1. Thus, the lengths of TID 9 to 13 are {4, 3, 4, 2, 3}, respectively. STEP 3: The count difference of each frequent itemset in Table 4-2 is then respectively calculated. Take item {a} as an example to illustrate the step. The count of item {a} in the original database is 4. The updated minimum count of item {a} in the updated database is then calculated as (8 + 5) × 0.5 (= 6.5). The count difference CD a is thus (4 − 6.5) × 0.5 (= -3). The count differences of other frequent itemsets are then shown in Table 4-4.. Table 4-4: The count differences of all frequent itemsets Large 1-itemset Item CD a -3 b 0 e 1. Large 2-itemset Item CD ab -3 ac -3 bc -2 ce -2. Large 3-itemset Item CD bce -3. In Table 4-4, only the itemsets with negative CD will be considered as the itemsets for insertion. In this example, itemsets {a:-3, ab:-3, ac:-3, bc:-2, ce:-2, bce:-3} satisfy the condition and are then sorted according to their lengths and |CD| value. After that, the sorted results are then put into the set of Insert_Items = {bce:3, ab:3, ac:3, bc:2, ce: 2, a:3}. STEP 4, 5 & 6: The itemsets are then respectively added into the transactions 9 to 13 according the sorted order in the set of Insert_Items. For example, the first itemset in Insert_Items is {bce:3}, indicating the itemset {bce} can thus be added into three different inserted transactions by 3 times. The results are then shown in Table 4-5.. 17.

(28) Table 4-5: The process to add an itemset {bce}. 9 10 11 12 13. bce○ bce bce○ ○○ ○○○. bce:3 ab:3 ac:3 bc:2 ce:2 a:3. Inserted_Items 3 times to be added 3 times to be added 3 times to be added 2 times to be added 2 times to be added 3 times to be added. After an itemset {bce} is respectively inserted into transaction 9, 10, and 11, the count of {bce} in the Inserted_Items becomes 0. The corresponding sub-itemsets {bc, ce} are then also updated (decreased) as 0. After that, the Insert_Items = {ab:3, ac:3, a:3}. The itemset {ab} is then respectively inserted into the transactions 12 and 13. The results are then shown in Table 4-6.. Table 4-6: The process to add the itemset {ab} 9 10 11 12 13. bce○ bce bce○ ab ab○. ab:1 ac:3 a:3. Inserted_Items 1 times to be added 3 times to be added 3 times to be added. Since there is only one position in transactions 9, 11, and 13, there is no more spaces for 2-itemsets {ab} and {ac}. Besides, an itemset {a} cannot be added into the transactions 9 and 11 due to those two transactions will produce the super-itemsets {abce} of the sensitive itemsets {abc}. Thus, the greedy procedure is terminated. STEP 7: In Table 4-2, the small items are {d:3, f:2, g:1} in the original database. Since transactions 9, 11, and 13 still remain one position for insertion, the small items are alternative 18.

(29) selected by empirical rules in stand normal distribution. In this example, items {d}, {f}, {g} are respectively added into 3 different transactions. The results are shown in Table 4-7.. Table 4-7: The process to add the small items 9 10 11 12 13. bceg bce bcef ab abd. Small items Item Count d 3 (+1) f 2 (+1) g 1 (+1). That is, the final updated database is shown in Table 4-8.. Table 4-8: The final updated database TID 1 2 3 4 5 6 7 8 9 10 11 12 13. Item a, b, c, d, e a, b, c, e c, e a, b, c, e b, g b, d, e, f a, b, c, d b, c, e, f b, c, e, g b, c, e b, c, e, f a, b a, b, g. 19.

(30) Chapter 5 GA-based Approach In this chapter, a GA-based approach for sanitizing database is thus proposed. It uses the genetic algorithm for inserting newly transactions to hide the sensitive itemsets. An evaluation function with three factors are then designed in the proposed algorithm. Different weights for three factors are then assigned to evaluate the fitness of the newly inserted transaction according to user’s preference. The pre-large concept is also applied to reduce the computational cost for rescanning database, thus speeding up the evaluation process of chromosomes. The details of the proposed algorithm and an example are then described below.. 5.1 Chromosome Representation In GAs, a corresponding chromosome represents as a possible and flexible solution. In the proposed approach, at most m transactions are then computed and inserted into the original database for hiding sensitive itemsets, such that the fitness value can thus be optimized. A chromosome with m genes is thus used, with each gene representing a possible transaction to be inserted. The itemsets are then respectively assigned to each gene by empirical rules in standard normal distribution, forming a transaction to be inserted. Note that each gene with the inserted itemsets cannot be formed as any super-itemsets (included) of the sensitive itemsets. An example to represent the chromosome in this chapter is described below. Assume the number of the newly inserted transactions is computed as 3 by the proposed algorithm. The sensitive itemsets are then set at {ab, acd}. A chromosome with two genes can be shown in Table 5-1.. 20.

(31) Table 5-1: An example for a represented chromosome g1 ade. g2 ac. g3 bc. 5.2 Fitness Function In GAs, it is necessary to set fitness functions to evaluate the goodness of chromosomes. Different application domains may require different fitness functions according to user’s preference. The goal in PPDM is to hide the sensitive itemsets with the minimal side effects. The relationship of itemsets before and after the PPDM process can be depicted in Figure 5-1, where L represents the large itemsets of D, S represents the sensitive itemsets defined by users that are large, ~S represents the non-sensitive itemsets that are large, and L’ is the large itemsets after some records are inserted.. L. S. ~S. L’. Figure 5-1: The relationship of itemsets before and after the PPDM approach is processed. 21.

(32) Let α be the number of sensitive itemsets that fail to be hidden. Thus, the sensitive itemsets still appear after the sanitization process. The sensitive itemsets should ideally become zero after the PPDM. The set of sensitive itemsets can be shown in Figure 5-2, in which the α part is the interaction of S and L’.. L. S. ~S. α. L’. Figure 5-2: The set of sensitive itemsets that fail to be hidden. Similarly, let β be the number of missing itemsets for another criteria in evaluation process. A missing itemset is a non-sensitive large itemset in the original database, but is not derived from the sanitized database. This side effect of β is shown in Figure 5-3, in which β is the difference of ~S and L’.. 22.

(33) L. β. S. L’. Figure 5-3: The set of missing itemsets. The γ is then defined as the last criteria in evaluation process as the number of artificial itemsets. An artificial itemset is a new large itemset appearing in the sanitized database but not in the original database. This side effect of γ is shown in Figure 5-4, in which γ is the difference of L’ and L.. 23.

(34) L. ~S. S. γ L’. Figure 5-4: The set of artificial itemsets. From Figures 5-2 to 5-4, it is obvious to see that α = S ∩ L' , β = ~ S − L' = ( L − S ) − L' , and γ = L' − L . The fitness function used in the chapter can be defined as follows:. fitness = ω1 × α + ω2 × β + ω3 × γ , where ω1 , ω2 and ω3 are the weighting parameters. The pre-large concept is also used in the proposed algorithm without database rescan for reducing the computational cost in evaluation process. It is then described below. The traditional methods for evaluating the fitness value need to rescan the database for calculating the three numbers, thus requiring a lot of computation cost. This problem can be solved by pre-large concept [13] since the pre-large itemsets can be acted as the buffer to reduce the movement of itemsets directly from large to small and vice-versa when transactions are. 24.

(35) inserted. When few transactions are inserted into the original database, the results can be easily derived without re-scanning the whole database through the help of stored pre-large itemsets. The concept of pre-large itemsets is used here to reduce the cost of rescanning a database and to speed up the evaluation process of chromosomes.. 5.3 Genetic Operators Genetic operators are very important to the success of specific GA applications. The operators used in this chapter are described as follows.. 5.3.1 Crossover The crossover operator is the main genetic operator in GAs. It considers two chromosomes to generate new offspring. Many crossover operators have been designed, which are the single-point crossover, the two-point crossover, the uniform crossover, and the arithmetic crossover, among others. In this chapter, the single-point crossover is used here to generate new offspring shown in Figure 5-5. The position of crossover is randomly chosen in the proposed approach.. 25.

(36) Figure 5-5: The crossover operator. 5.3.2 Mutation The purpose of mutation is to diversify the search direction and prevent converging to local optima. Mutation usually produces some random changes in chromosomes, but there is no guarantee that mutation will produce desirable features in the new chromosomes. In the proposed approach, the adopted mutation operator will change an item within the selected gene of a chromosome to another item by empirical rules in standard normal distribution. The process is shown in Figure 5-6.. 26.

(37) Figure 5-6: The mutation operator. 5.3.3 Selection The selection operation chooses some offspring for survival according to predefined rules. This keeps the population size under good control. Many selection methods were proposed, such as Elitism, Rank, Tournament, and Roulette-Wheel, among others. In this paper, a hybrid selection method is thus proposed to combine both the Elitism approach and the Rank approach. The chromosomes in the population are firstly sorted by their fitness values. The top k/2 chromosomes in the list are then selected to the next population, where k is the population size. Next, the others k/2 chromosomes are randomly selected from the original database to the next population. The selection mechanism may be illustrated in Figure 5-7.. 27.

(38) Choose b best fitness chromosomes Item. g1. g2. g3. …. gn. g1. g2. g3. …. gn. a. ab. bce. ad. …. bd. ab. bce. ad. …. bd. b … …. k/2 Select k/2 chromosomes. …. …. …. …. …. …. …. …. …. …. ad. be. cde. …. abc. bc. bde. de. …. bcd. New population (k chromosome). Old population. Figure 5-7: The selection mechanism. The flowchart of the proposed GA-based approach is shown in Figure 5-8. The algorithm is then stated in the next section.. Start. Generate initial population. Next generation. Mutation. Evaluate fitness value. Crossover. Satisfied terminal condition. No. GA operations. Selection. Yes. End. Figure 5-8: The flowchart of the GA-based approach for PPDM. 28.

(39) 5.4 The Proposed GA-based Algorithm The proposed GA-based algorithm for PPDM is stated as follows.. The proposed algorithm: INPUT: A transaction dataset D = {T 1 , T 2 , …, T m }, a minimum support threshold α (upper support threshold), a set of user-specified sensitive itemsets S = {si 1 , si 2 , …, si j , …, si p }, and a population size k. OUTPUT: A sanitized database. STEP 1: Calculate the value of maximum safety bound (MSB) for the number of newly inserted transactions as: p  si  n = max( SBi ) =  i − m + 1 , i =1 α . where SB i is the safety bound of each sensitive itemset, m is the number of original transactions in D, sii is the count of sensitive itemset si i .. STEP 2: Derive the lower support threshold S l as:. Sl ≤ Su −. n (1 − S u ) . m. STEP 3: Scan the database to store the large and the pre-large itemsets. STEP 4: Calculate the length p n of each inserted transaction in d according to the empirical rules in standard normal distribution, where d = {d 1 , d 2 , …, d n }, and n is the number of inserted transactions obtained in STEP 1.. 29.

(40) STEP 5: Generate a population of k individuals with n genes according to the empirical rules in standard normal distribution randomly, with each gene being the itemsets of the transaction to be inserted. Note that the formed gene cannot be any super-itemsets of the sensitive itemsets. STEP 6: Calculate the fitness value of each chromosome C i in the population as:. fitness (i ) = ω1 × α i + ω2 × βi + ω3 × γ i , where ω1 , ω2 and ω3 are weighting parameters, α is the number of sensitive itemsets that fail to be hidden, β is the number of missing itemsets, and γ is the number of artificial itemsets. STEP 7: Execute the crossover operations on the population. STEP 8: Execute the mutation operations on the population. STEP 9: Choose the top k/2 chromosomes from the population and randomly select k/2 chromosomes from the original database to generate the k chromosomes in the next population. STEP 10: If the termination criterion is not satisfied, go to Step 6; Otherwise, do the next step.. 30.

(41) 5.5 An Example In this section, an example is given to illustrate the proposed GA-based approach. Assume a database shown in Table 5-2 is used as an example. It consists of 11 transactions and 5 items, denoted a to e. Assume the sensitive itemsets are defined as {cd:5, bde:5}. The minimum support threshold is defined as 40%. The procedure of the proposed GA-based algorithm is then described below. Table 5-2: A database used as an example TID 1 2 3 4 5 6 7 8 9 10 11. Item a, b, d, e b, c, d a, b, c, d, e a, b, c, d, e b, c, d, e b, c, d, e d a, b, c a, d, e b, d c, e. STEP 1: The number of inserted transactions in this example is first calculated. In this example, there are 2 sensitive itemsets to be hidden. For example, the safety bound for sensitive itemset {cd} is calculated as ( (5 / 0.4) − 11 + 1 ) (= 2). The safety bound for a sensitive itemset {bde} is calculated as ( (5 / 0.4) − 11 + 1 ) (= 2). The maximum safety bound (MSB) among two sensitive itemsets is thus max(2, 2) (= 2), which is used for the number of newly inserted transactions. STEP 2 & 3: After STEP 1 is processed, the number of newly inserted transactions is defined as 2. The lower support threshold is then calculated as:. 31.

(42) S l ≤ 0.4 −. 2 (1 − 0.4) = 0.29. 11. Table 5-3: The large and pre-large itemsets Large 1-itemset Item Count a 5 b 8 c 7 d 9 e 7. Large 2-itemset Item Count bc 6 bd 7 be 5 cd 5 ce 5 de 6 Pre-large 2-itemset Item Count ab 4 ad 4 ae 4. Large 3-itemset Item Count bcd 5 bde 5 Pre-large 3-itemset Item Count bce 4 cde 4. STEP 4: For two inserted transactions obtained in STEP 1, the length of each transaction is thus computed according to empirical rules in standard normal distribution. In this example, the average of length of 11 transactions is calculated as (4 + 3 + 5 + 5 + 4 + 4 + 1 + 3 + 3 + 2 + 2)/11 (= 3.272). The standard deviation is then calculated as: 1 [(4 − 3.272) 2 + (3 − 3.272) 2 +  + (2 − 3.272) 2 ] 11 − 1. (= 1.21).. The length of transactions in the original database is standardized and shown in Table 5-4. Table 5-4: The standardized values of transaction length Length 1 2 3 4 5. Standardized -1.87 -1.05 -0.22 0.6 1.42. 32.

(43) That is, the probability of length 1, 2, 3, 4, and 5 are calculated as (13.5%, 13.5%, 34%, 34%, 13.5%) shown in Figure 5-9.. Figure 5-9: The probabilities of different lengths in standard normal distribution. The lengths of two inserted transactions are then assigned according to Figure 5-9. Thus, the lengths of TID 12 and 13 are {2, 3}, respectively. STEP 5: In this example, the number of newly inserted transactions was calculated in STEP 1, which was defined as 2. That is, the length of a chromosome is set at 2, indicating the two newly transactions are then inserted into the original database for hiding sensitive itemsets. The length of each gene is then calculated by empirical rules in standard normal distribution. Assume the k populations are then generated and shown in Figure 5-10.. Figure 5-10: k populations. 33.

(44) STEP 6: The fitness value of each chromosome in the population is evaluated. Assume the chromosome obtained two itemsets {a, b} and {b, c, e} are respectively considered as two newly inserted transactions in the current population. For evaluating the chromosome, the results of large itemsets after the two transactions represented by the chromosome are added need to be obtained. They may easily derive without database re-scan with the aid of pre-large itemsets. Since the minimum (upper) support threshold is set at 40%, the upper count threshold is thus updated to the ceiling of 13*40% (= 5.2), which is 6. The original large itemsets and pre-large itemsets are then updated according to the newly inserted transactions, with the results shown in Table 5-5. The final large itemsets are thus {a}, {b}, {c}, {d}, {e}, {bc}, {bd}, {be}, {ce}, and {de}.. Table 5-5: The updated large and pre-large itemsets TID 1 2 3 4 5 6 7 8 9 10 11 12 13. Item a, b, d, e b, c, d a, b, c, d, e a, b, c, d, e b, c, d, e b, c, d, e d a, b, c a, d, e b, d c, e a, b b, c, e. Large 1-itemset Item Count a 7 b 9 c 8 d 9 e 9. Large 2-itemset Item Count bc 6+1=7 bd 7 be 5+1=6 cd 5 ce 5+1=6 de 6 Pre-large 2-itemset Item Count ab 4+1=5 ad 4 ae 4. Large 3-itemset Item Count bcd 5 bde 5 Pre-large 3-itemset Item Count bce 4+1=5 cde 4. After the large itemsets are obtained, the number (α) of sensitive itemsets that fail to be hidden, the number (β) of missing itemsets, and the number (γ) of artificial itemsets can be easily 34.

(45) obtained. In the above example, the set of sensitive itemsets that fail to be hidden is {cd, bde} ∩ {a, b, c, d, e, bc, bd, be, ce, de}, which is Ø. The value of α is thus 0. The set of missing itemsets is ({a, b, c, d, e, bc, bd, be, cd, ce, de, bcd, bde} - {cd, bde}) – {a, b, c, d, e, bc, bd, be, ce, de}, which is {bcd}. The value of β is thus 1. The set of artificial itemsets is Ø. The value of γ is thus 0. Let the three weight parameters are set as 0.2, 0.3 and 0.5, respectively. The fitness value of the chromosome is then calculated as follows: fitness = 0.2 × 0 + 0.3 × 1 + 0.5 × 0 = 0.3. STEP 7: The crossover operation is executed on the population. An example has been previously shown in Figure 5-5. STEP 8: The mutation operation is executed on the population. An example has been previously shown in Figure 5-6. STEP 9: In the selection step, the top k/2 chromosomes from the current population and the randomly chosen n/2 chromosomes from the original database are combined to form the new population by the selection mechanism. STEP 10: If the results of the new population do not satisfy the termination condition, then Steps 5 to 9 should be repeated; if it do, the algorithm won’t continue. In the example, two criteria are used as the terminal conditions. One is the fitness function value of the best chromosome 0, and the other is an achieved predefined number of generations.. 35.

(46) Chapter 6 Experimental Results Experiments were made to show the performance of the proposed approaches. They were performed on a Intel Core2 CPU with 2GB RAM based on the Windows 7 with 64 bit platform. The details of the three databases used in the experiments were shown in Table 6-1.. Table 6-1: The details of the three databases. # of Transactions. # of Items. Maximum Transaction Size. Average Transaction Size. BMS-POS. 515,597. 1,657. 164. 6.5. BMS-Webview-1. 59,602. 497. 267. 2.5. BMS-Webview-2. 77,512. 3,340. 161. 5.0. Database. In the experiments, the minimum support thresholds were set at 9%, 3% and 2% for the BMS-POS database, the BMS-WebView-1 database, and the BMS-WebView-2 [27], respectively. The numbers of sensitive itemsets are then defined by the percentages of the frequent itemsets in the databases, which is more flexible to see the performance of the proposed algorithm. For the first proposed greedy-based algorithm, the relationships between the numbers of inserted transactions, the execution, and the side effects are then compared in three different. 36.

(47) databases. The numbers of newly inserted transactions are then computed for three different databases show in Figure 6-1.. Figure 6-1: Numbers of added transactions among three databases in different percentages of sensitive itemsets. In Figure 6-1, it is obvious to see that the BMS-POS database requires more transactions to be inserted since it contains more information (transactions) than the others. That is, more sensitive itemsets are required to be hidden due to the number of sensitive itemsets are calculated by the percentage of frequent itemsets. Besides, the execution times are compared among three different databases in different percentages of sensitive itemsets. The results are then shown in Figure 6-2.. 37.

(48) Figure 6-2: Execution times among three databases in different percentages of sensitive itemsets. Also, more computational times are required for the BMS-POS database since it has more sensitive itemsets to be hidden. The reasons are the same described above. The side effects of the proposed greedy-based approach are also evaluated, including the hiding failure for the number of sensitive itemsets, the number of the missing non-sensitive itemsets, and the number of artificial itemsets. The number of side effects for databases webview-1 and webview-2 are then evaluated to show the performance. The results are then respectively shown in Figure 6-3 and Figure 6-4. From Figure 6-3 and Figure 6-4, it is obvious to see that the proposed greedy-based approach can thus totally hide the sensitive itemsets without any side effects.. 38.

(49) Figure 6-3: The evaluation of three side effects in Webview-1. Figure 6-4: The evaluation of three side effects in Webview-2. 39.

(50) For the second GA-based algorithm, we evaluate the performance of the original GA-based approach and the GA-based approach with pre-large concept in BMS-POS database. The population is set at 20, the crossover rate is set at 0.9, and the mutation rate is set at 0.1. The results are then shown in Figure 6.5.. Figure 6-5: Execution time of the original GA-based approach and the GA-based approach with pre-large concept. From Figure 6-5, it is obvious to see that the proposed GA algorithm with the pre-large concept can greatly reduce the execution time.. 40.

(51) Chapter 7 Conclusion and Future Work In the research of privacy-preserving data mining (PPDM), it normally can be classified into two removal approaches. The first one is to remove items from transactions, and the second one is to remove the transactions from the database. In real-world applications, however, the important information and rules may thus remove, causing the unpredictable damages to industries or organizations. In this thesis, two approaches are proposed to insert newly transactions into the original database for efficiently hiding the sensitive itemsets. In the first greedy-based approach, the number of newly inserted transactions and the length of each inserted transaction can be thus determined by empirical rules in standard normal distribution. The large itemsets in the original database are respectively added into the inserted transactions, for reducing the side effects of missing rules. The above procedure is repeated until the set of sensitive itemsets become null or there is no longer large itemsets to be added in the inserted transactions. After that, the small items are then added into the inserted transactions with the remaining positions to be filled according to empirical rules in standard normal distribution. The experimental results are then shown the performance of the proposed greedy-based approach for inserting new transactions. In the second GA-based approach, each gene in chromosome represents the itemset within an inserted transaction for hiding the sensitive itemsets. Three user-specific weights are assigned to three factors, which are hiding failure, missing items and artificial items, to evaluate the fitness values of chromosomes. Besides, the pre-large concept is also applied to the GA-based approach to reduce the computational cost of re-scan database. In the experimental results, the original GA-based approach and the GA-based approach with pre-large concept are then. 41.

(52) compared to evaluate the performance in the BMS-POS databases. The GA-based approach with pre-large concept can greatly reduce the computational cost and show the better performance than the original one. In this thesis, new transactions are then inserted into the original database for hiding the sensitive itemsets. In real-world applications, the modified itemsets within the transactions can be also considered as another research issue. How to improve the performance of the proposed GA-based approach and the proposed greedy-based approach can also be considered as a critical research issue in PPDM.. 42.

(53) References [1]. A. Amiri, “Dare to share: Protecting sensitive knowledge with data sanitization,” Decision Support Systems, pp. 181-191, 2007.. [2]. M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. S. Verykios, “Disclosure limitation of sensitive rules,” Knowledge and Data Engineering Exchange Workshop, pp. 45-52, 1999.. [3]. R. Agrawal, T. Imielinski, and A. Sawmi, “Mining association rules between sets of items in large databases,” The ACM Special Interest Group on Management of Data Conference, pp. 207-216, 1993.. [4]. R. Agrawal and R. Srikant, “Privacy-preserving data mining,” The ACM Special Interest Group on Management of Data Conference, pp. 439-450, 2000.. [5]. R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The International Conference on Very Large Data Bases, pp. 487-499, 1994.. [6]. R. Agrawal, R. Srikant, and Q. Vu, “Mining association rules with item constraints,” The International Conference on Knowledge Discovery in Databases and Data Mining, pp. 67-73, 1997.. [7]. E. Dasseni, V. S. Verykios, A. K. Elmagarmid, and E. Bertino, “Hiding association rules by using confidence and support,” The International Workshop on Information Hiding, pp. 369-383, 2001.. [8]. M. N. Dehkordi, K Badie, and A. K. Zadeh, “A novel method for privacy preserving in association rule mining based on genetic algorithms,” Journal of Software, vol. 4, no. 6, pp. 555-562, 2009.. [9]. M. R. Garey and D. S. Johnson, “Computers and intractability: a Guide to the theory of NP-completeness,” W. H. Freeman & Co. New York, 1990.. 43.

(54) [10]. J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: a frequent-pattern tree approach,” Data Mining and Knowledge Discovery, vol. 8, pp. 53-87, 2004.. [11]. A. Homaifar, S. Guan, and G. E. Liepins, “A new approach on the traveling salesman problem by genetic algorithms,” The International Conference on Genetic Algorithms, pp. 460-466, 1993.. [12]. J. H. Holland, “Adaptation in natural and artificial systems,” University of Michigan Press, pp. 15, 1975.. [13]. T. P. Hong, C. Y. Wang, and Y. H. Tao, “A new incremental data mining algorithm using pre-large itemsets,” Intelligent Data Analysis, vol. 5, no. 2, pp. 111-129, 2001.. [14]. T. P. Hong and C. Y. Wang, “Maintenance of association rules using pre-large itemsets,” Intelligent Databases: Technologies and Applications, pp. 44-60, 2006.. [15]. T. P. Hong, C. W. Lin, K. T. Yang, and S. L. Wang, “A heuristic data-sanitization approach based on tf-idf,” The International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 156-164, 2011.. [16]. T. P. Hong, K. T. Yang, C. W. Lin, and S. L. Wang, “Evolutionary privacy-preserving data mining,” World Automation Congress, pp. 1-7, 2010.. [17]. D. E. O. Leary, “Knowledge discovery as a threat to database security,” Knowledge Discovery in Databases, pp. 507-516, 1991.. [18]. M. Mitchell, “An introduction to genetic algorithms,” MIT press, 1996.. [19]. Z. Michalewicz, “Genetic algorithms + data structures = evolution programs,” Springer-Verlag, 1994.. [20]. Z. Michalewicz, “Evolutionary computation: practical issues,” The International Conference on Evolutionary Computation, pp. 30-39, 1996.. [21]. S. R. M. Oliveira and O. R. Za¨ıane, “Privacy preserving frequent itemset mining,” IEEE. 44.

(55) International Conference on Privacy, Security and Data Mining, pp. 43-54, 2002. [22]. E. D. Pontikakis, A. A. Tsitsonis, and V. S. Verykios, “An experimental study of distortion-based techniques for association rule hiding,” IFIP Workshop on Database Security, pp. 325-339, 2004.. [23]. E. Sanchez, T. Shibata, and L. A. Zadeh, “Genetic algorithms and fuzzy logic systems,” Soft Computing Perspectives, pp. 1-17, 1997.. [24]. V. S. Verykios and Aris Gkoulalas-Divanis, “Privacy-preserving data mining models and algorithms,” Advances in Database Systems, vol. 34, pp. 267-289, 2008.. [25]. V. S. Verykios, A. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, “Association rule hiding,” IEEE Transactions on knowledge and Data Engineering, vol. 16, no. 4, pp. 434-447, 2004.. [26]. Z. Zhu, G. Wang and W. Du, “Deriving private information from association rule mining results,” IEEE International Conference on Data Engineering, pp. 18-29, 2009.. [27]. Zijian Zheng, Ron Kohavi, and Llew Mason, “Real world performance of association rule algorithms,” The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 401-406, 2001.. 45.

(56)