Thesis Organization - 透過交易新增來隱藏敏感的頻繁項目集

Chapter 1 Introduction

1.3 Thesis Organization

The rest parts of this thesis are organized as follows. Some related works are described in Chapter 2. The problem to be solved in this thesis is stated in Chapter 3. The first proposed greedy-based approach is shown in Chapter 4. The second proposed GA-based approach is explained in Chapter 5. Experimental results are then shown in Chapter 6. Conclusion and future works are illustrated in Chapter 7.

Chapter 2

Related Work

In this chapter, the related researches are then shortly reviewed. Data mining approaches are stated in Section 2.1. Data sanitization is given in Section 2.2. Genetic algorithms and the concept of pre-large itemsets are respectively shown in Section 2.3 and 2.4.

2.1 Data Mining Approaches

Data mining is the most commonly used in attempts to induce association rules from transaction data [3, 5-6], such that the presence of certain items in a transaction will imply the presence of some other items. To achieve this purpose, the Apriori algorithm [3] and the FP-growth algorithm [10] are recommended as the efficient approaches to derive frequent itemsets in association rules mining. The former one is a level-wise approach to generate-and-test candidates and the later one uses a tree structure to keep the frequent itemsets without candidate generation, thus reducing the computational cost of rescanning database. In the Apriori algorithm, the database is first scanned to find the frequencies of items. An item is then considered as a large (frequent) item since its count (frequency) is larger than or equal to the minimum count threshold. Next, the candidate itemsets obtained two items are then formed from the large items in combination process. The generated candidate itemsets are then determined to check the counts of the 2-itemsets larger than or equal to the minimum count threshold. This process was repeated until all large itemsets had been found. Association rules were induced from the large itemsets found in the first phase. All possible association combinations for each large itemset were formed, and those with calculated confidence values

larger than the minimum confidence were output as the desired association rules.

2.2 Data Sanitization

Years of effort in data mining have produced a variety of efficient techniques, which have caused the security problems and privacy threats. The privacy preserving data mining (PPDM) techniques has thus become a critical research issue for hiding the confidential or secured information. In the past, Atallah et al. proposed the protection algorithm for data sanitization to avoid the inference of association rules [2]. It used both addition and deletion procedures to modify databases for hiding sensitive information. Dasseni et al. then proposed a hiding algorithm based on the hamming-distance approach to reduce the confidence or support values of association rules [7]. Three heuristic hiding approaches were thus proposed to respectively increase the supports of antecedent parts, to decrease the supports of consequent parts, and to decrease the support of either the antecedent or the consequent parts. When the supports or the confidences of sensitive association rules were below minimum support threshold, the sensitive association rules could thus be hidden. Oliveira and Za¨ıane [21] also introduced the multiple-rule hiding approach to efficiently hide sensitive itemsets. It requires twice scanning of database whether the number of sensitive itemsets. In the first database scan, the index file was created to efficiently find sensitive itemsets within transactions. Three algorithms were then used in the second database scan to remove minimal individual items. Amiri then proposed three heuristic algorithms to hide multiple sensitive rules [1]. The first approach computes the union of the supporting transactions for all sensitive itemsets to remove the transaction that supports the most sensitive and the least non-sensitive itemsets. The second one aims to remove individual items from transactions instead of removing whole transactions. The third approach combines the previous two approaches to identify sensitive transactions and to selectively delete

items from these transactions until the sensitive knowledge has been hidden. Pontikakis et al. [22]

then proposed two heuristic approaches based on data distortion. The first priority-based distortion algorithm (PDA) was designed to reduce the confidences of sensitive rules by

decreasing consequent items. The second weight-based sorting distortion algorithm (WDA) was then proposed to prioritize selection of sanitized transactions. It used the priority values to weight the transactions based on effective data structures. Hong et al. then proposed two approaches to partially delete the items within the transactions or the whole transactions from the original database for hiding sensitive itemsets [15-16].

The optimal sanitization of databases regards as an NP-hard problem. Atallah et al. [2]

proved that selecting which data to modify or sanitize was also NP-hard. Their proof was based on the reduction from the NP-hard problem of hitting-sets [9]. The hitting-set problem was first proven NP-hard. The PPDM problem was then reduced to the hitting-set problem in polynomial time. In this case, the PPDM problem could be said an NP-hard problem as well and could not be solved in polynomial time. That paper provided a solid theoretical background to explain that PPDM was a difficult issue.

2.3 Genetic Algorithms

In the past, many heuristic algorithms have been developed for difficult optimization problems. Some nature-inspired approaches were then proposed to achieve the purpose and one of the most commonly used among them is the evolutionary computation based on Darwin theory:

“Nature selects, the fittest survives”. In 1975, Holland [12] proposed genetic algorithms (GAs) and it has become increasingly important for researchers in solving difficult problems since they could provide feasible solutions in a limited amount of time [11]. GAs has been successfully applied to optimization fields [18-19], such as machine learning [19], neural networks [23], fuzzy

logic controllers [23], among others. According to the principle of survival of the fittest, GAs generates the next population by several operations, with each individual in the population representing a possible solution. In general, a genetic algorithm consists of five basic components, as summarized by Michalewicz [20]:

1. A genetic representation of solutions to the problem.

2. A way for generating the initial population.

3. An evaluation functions for measuring goodness of solutions.

4. Several genetic operators that alter the genetic composition of children.

5. Parameter values.

2.4 The Concept of Pre-Large Itemsets

Hong et al. proposed the pre-large itemsets [13-14] for efficiently deriving the desired rules in incremental data mining. A pre-large itemset was not truly large (frequent), but might easily become large in the future through the data insertion process. Two minimum support thresholds are defined in the pre-large concept, which are a lower support threshold and an upper support threshold. The upper support threshold was the same as that in the conventional mining algorithms. The count of an itemset must be larger than or equal to the upper count threshold in order to be considered as a large itemset. On the other hand, the lower support threshold defines the lowest count threshold for an itemset to be treated as a pre-large itemset. A pre-large itemset is not truly large, but may be large with a high probability in the future. It acts like a buffer in the incremental mining process for reducing the movements of itemsets directly from large to small and vice-versa. An itemset with its count below the lower count threshold is thought of as a small itemset. The algorithm did not need to rescan the original database until a number of transactions

have been processed. Since rescanning database spent much computational time, the maintenance cost could thus be reduced based on the pre-large concept. The processing for transaction insertion is stated below.

Considering an original database and newly transactions which are inserted by the two support thresholds [13], itemsets may fall into one of the following nine cases illustrated in Figure 2.1.

Figure 2-1: Nine cases when new transactions are inserted into existing database

Cases 1, 5, 6, 8 and 9 will not affect the final association rules according to the weighted average of the counts. Cases 2 and 3 may remove existing association rules, and cases 4 and 7 may add new association rules. If we retain all large and pre-large itemsets with their counts after each pass, then cases 2, 3 and 4 can be handled easily. Also, in the maintenance phase, the ratio of new transactions to old database is usually very small. This is more apparent when database are growing larger. It has been formally shown that an itemset in case 7 cannot possibly

be large for entire updated database as long as the number of new transactions is smaller than the safety number f shown below [13]:



where f is the safety number of the new transactions, S_u is the upper threshold, S_l

). lower threshold, and d is the number of original database. The lower support threshold can be re-formulated when the number of inserted transactions f for not rescanning the original database is given as:

The above formula will be used in the proposed GA-based approach to set an appropriate lower-bound threshold for efficient chromosome evaluation.

Chapter 3

Problem Formulation

In the problem of PPDM, some basic concepts are borrowed from association rule mining. It is necessary to review the association rules mining before exploring the issues of PPDM. The most popular and common algorithm is called Apriori algorithm and it was proposed by Agrawal et al. [3]. Let I = {i1, i2, …, im

An association rule X ⇒ Y holds in a database D if the following two factors are satisfied.

The first one is the support condition, which is defined as at least s% of the transactions in D contain X ∪Y. It can be thought of as a measure of the frequency of a rule, and is expressed by

X Y

N s

∪ ≥ , where N is the number of transactions in D. The second factor is the confidence

condition, which is defined as at least c% of transactions with the itemset X also contains Y. It is thus a measure of the strength of the rule, and is expressed by ^X ^Y c

∪ ≥ .

In privacy-preserving data mining, the sensitive itemsets H = {h1, h2, …, hi} is normally defined by users. The sensitive itemsets belong to frequent itemsets but it may consist of the confidential information. In this thesis, the sensitive itemsets are considered to be hidden by adding the number of newly transactions to increase the minimum count threshold of D. Let the modified database be denoted D’. Thus, each sensitive itemset will not have enough support to

be frequent in D’. In addition to hiding the sensitive itemsets from being mined, some other goals have been set as well when the original database is sanitized. For example, all of the non-sensitive rules should be successfully mined from the sanitized database D’. Besides, the rules that are not found in the original database D should not be generated from the sanitized database D’.

In this thesis, the sensitive itemsets are then hidden by adding newly transactions into the original database, thus increasing the minimum count threshold to achieve the goal. It is, however, three factors should be taken as the consideration. First, the number of transactions should be serious determined for achieving the minimal side effects to totally hide the sensitive itemsets. In this part, sensitive itemsets are then respectively evaluated to find the maximal number of transactions to be inserted. Second, the length of each newly inserted transaction is then calculated according to the empirical rules in standard normal distribution. Last, the already existing large itemsets are then alternatively added into the newly inserted transactions according to the lengths of transactions which determined at the second procedure. This step is to avoid the missing failure of the large itemsets for reducing the side effects in the PPDM

Thus, two approaches are proposed for PPDM in this thesis. The first algorithm is a greedy-based approach to iteratively add the large itemsets into the inserted transactions for avoiding the miss failure side effects. The second algorithm is a GA-based approach to design a flexible evaluation function with three factors. Different weights are then assigned to three factors according to users’ preference in evaluation process. The pre-large concept is also applied to reduce the computational cost of rescanning database, thus speeding up the evaluation process of chromosomes.

Chapter 4

A Greedy-based Approach

In this chapter, a greedy-based approach for data sanitization will be introduced. Three steps are then used to insert new transactions into original database for hiding sensitive itemsets.

In the first step, the safety bound for each sensitive itemset is then calculated to determine how many transactions should be inserted. Among the calculated safety bound of each sensitive itemset, the maximum operation is then used to get the maximal numbers of inserted transactions.

Next, the lengths of inserted transactions are then evaluated through empirical rules in statistics as the standard normal distribution. In the third step, the count difference is then calculated between the sensitive itemsets and non-sensitive frequent itemsets at each k-level (k-itemset).

The non-sensitive frequent itemsets are then inserted into the transaction in descending order of their count difference. This property remains that the original frequent itemsets would be still frequent after the numbers of transactions are inserted for hiding sensitive itemsets. The above steps are then repeatedly progressed until all the sensitive itemsets are hidden. The proposed algorithm is then shown below.

4.1 The Proposed Greedy Algorithm

INPUT: A transaction dataset D = {T1, T2, …, Tm}, a set of k frequent itemsets FI = {fi1, fi2, …,

STEP 1: Calculate the value of maximum safety bound (MSB) for the number of newly inserted transactions as:

is the safety bound of each sensitive itemset, m is the number of original transactions in D, is the count of sensitive itemset sii

STEP 2: Calculate the length p

n of each inserted transaction in d according to the empirical rules in standard normal distribution, where d = {d1, d2, …, dn

STEP 3: Choose the itemsets to be inserted into each inserted transaction d

}, and n is the number of inserted transactions obtained in STEP 1.

Substep 3-1: Calculate the count difference (

. Do the substeps as follows.

fik

CD ) of each frequent itemset to be possibly inserted into the new transactions as:



^fi ⁽^m ⁿ⁾^α



CD_fi _k

k = − + ,

where fi_k is the count(frequency) of an item fik

Substep 3-2: Put the frequent itemsets fi

, m is the number of transactions in D and n is the number of transactions in d.

k with negative CDfi_k into the set of Insert_Items.

Substep 3-3: Sort the fik in the set of Insert_Items in descending order of their lengths.

Substep 3-4: Sort the sorted results obtained in substep 3-3 in descending order of their

fik

CD .

STEP 4: Process the inserted transactions d_n one-by-one respectively to add the fi_k in the set of Insert_Items according to the sorted order obtained in substep 3-4. Note that the length of an inserted itemset fi_k is no longer than p_n in d_n and the inserted itemset fi_k

STEP 5: Update (decrease) the value

in a transaction cannot be formed as any super-itemsets of a sensitive itemset in S.

fik

CD and the corresponding sub-itemsets of the processed itemset fi_k

STEP 6: Repeat the STEPs 4 to 5 until the set of Insert_Items is null or there is no longer itemsets to be inserted into d

by 1.

STEP 7: Add the small items in the set of I into the d

obtained the constraints in STEP 4.

n while dn remains positions to be added according to empirical rules in standard normal distribution.

4.2 An Example

In this section, an example is then used to illustrate the proposed algorithm step-by-step.

Assume a database shown in Table 4-1 is used as an example. It consists of 8 transactions with 7 items, denoted a to g.

Table 4-1: A database with 8 transactions

TID Item

Assume a set of the user-specified sensitive itemsets S is {c:6, be:5, abc:4}, and the minimum support threshold is set at 50%. The minimum count of this example is calculated as (0.5×8) (= 4). The Apriori approach [3] is then executed to find all frequent itemsets from Table 4-1 and the results are then shown in Table 4-2, respectively. Note the sensitive itemsets are marked in red color in Table 4-2.

Table 4-2: All large itemsets

All items Large 1-itemset Large 2-itemset Large 3-itemset

Item Count Item Count Item Count Item Count

STEP 1: The number of inserted transactions in this example is first calculated. In this example, there are three sensitive itemsets to be hidden. Thus, the safety bound for sensitive itemset {c} is calculated as (



⁶^/⁰^.⁵⁾⁻⁸



⁺¹) (= 5). The safety bound for sensitive itemset {be}

and {abc} are respectively calculated as (



⁶^/⁰^.⁵⁾⁻⁸



⁺¹) (= 5) and



⁽⁴^/⁰^.⁵⁾⁻⁸



⁺¹ (= 1). The

maximum safety bound (MSB) among three sensitive itemsets is thus max(5, 5, 1) (= 5). Thus,

the number of newly inserted transactions is initially set at 5.

STEP 2: For five inserted transactions obtained in STEP 1, the length of each transaction is thus computed according to empirical rules in standard normal distribution. In this example, the average length of 8 transactions is calculated as (5 + 4 + 2 + 4 + 2 + 4 + 4 + 4)/8 (= 3.625). The standard deviation is then calculated as [(5 3.625) (4 3.625) (4 3.625) ]

1 8

1 − 2 + − 2 + + − 2

−  (= 1.06).

The length of transactions in the original database is standardized and shown in Table 4-3.

Table 4-3: The standardized values of transaction lengths

Length Standardized

2 -1.53

3 -0.59

4 0.35

5 1.3

That is, the probability of length 2, 3, 4, and 5 are calculated as (13.5%, 34%, 34%, 13.5%) shown in Figure 4-1.

Figure 4-1: The probabilities of different lengths in standard normal distribution

The lengths of five inserted transactions are then assigned according to Figure 4-1. Thus, the lengths of TID 9 to 13 are {4, 3, 4, 2, 3}, respectively.

STEP 3: The count difference of each frequent itemset in Table 4-2 is then respectively calculated. Take item {a} as an example to illustrate the step. The count of item {a} in the original database is 4. The updated minimum count of item {a} in the updated database is then calculated as (8 + 5)×0.5 (= 6.5). The count difference CD_a^{is thus}



⁽⁴⁻⁶^.⁵⁾^×⁰^.⁵



(= -3). The count differences of other frequent itemsets are then shown in Table 4-4.

Table 4-4: The count differences of all frequent itemsets

Large 1-itemset Large 2-itemset Large 3-itemset

Item CD Item CD Item CD

a -3 ab -3 bce -3

b 0 ac -3

e 1 bc -2

ce -2

In Table 4-4, only the itemsets with negative CD will be considered as the itemsets for insertion. In this example, itemsets {a:-3, ab:-3, ac:-3, bc:-2, ce:-2, bce:-3} satisfy the condition and are then sorted according to their lengths and |CD| value. After that, the sorted results are then put into the set of Insert_Items = {bce:3, ab:3, ac:3, bc:2, ce: 2, a:3}.

STEP 4, 5 & 6: The itemsets are then respectively added into the transactions 9 to 13 according the sorted order in the set of Insert_Items. For example, the first itemset in Insert_Items is {bce:3}, indicating the itemset {bce} can thus be added into three different inserted transactions by 3 times. The results are then shown in Table 4-5.

Table 4-5: The process to add an itemset {bce}

After an itemset {bce} is respectively inserted into transaction 9, 10, and 11, the count of {bce} in the Inserted_Items becomes 0. The corresponding sub-itemsets {bc, ce} are then also updated (decreased) as 0. After that, the Insert_Items = {ab:3, ac:3, a:3}. The itemset {ab} is then respectively inserted into the transactions 12 and 13. The results are then shown in Table 4-6.

Table 4-6: The process to add the itemset {ab}

Since there is only one position in transactions 9, 11, and 13, there is no more spaces for 2-itemsets {ab} and {ac}. Besides, an itemset {a} cannot be added into the transactions 9 and 11 due to those two transactions will produce the super-itemsets {abce} of the sensitive itemsets {abc}. Thus, the greedy procedure is terminated.

STEP 7: In Table 4-2, the small items are {d:3, f:2, g:1} in the original database. Since

在文檔中透過交易新增來隱藏敏感的頻繁項目集 (頁 13-0)