• 沒有找到結果。

Chapter 4 A Greedy-based Approach

4.2 An Example

In this section, an example is then used to illustrate the proposed algorithm step-by-step.

Assume a database shown in Table 4-1 is used as an example. It consists of 8 transactions with 7 items, denoted a to g.

15

Table 4-1: A database with 8 transactions

TID Item

Assume a set of the user-specified sensitive itemsets S is {c:6, be:5, abc:4}, and the minimum support threshold is set at 50%. The minimum count of this example is calculated as (0.5×8) (= 4). The Apriori approach [3] is then executed to find all frequent itemsets from Table 4-1 and the results are then shown in Table 4-2, respectively. Note the sensitive itemsets are marked in red color in Table 4-2.

Table 4-2: All large itemsets

All items Large 1-itemset Large 2-itemset Large 3-itemset

Item Count Item Count Item Count Item Count

STEP 1: The number of inserted transactions in this example is first calculated. In this example, there are three sensitive itemsets to be hidden. Thus, the safety bound for sensitive itemset {c} is calculated as (

6/0.5)8

+1) (= 5). The safety bound for sensitive itemset {be}

and {abc} are respectively calculated as (

6/0.5)8

+1) (= 5) and

(4/0.5)8

+1 (= 1). The

16

maximum safety bound (MSB) among three sensitive itemsets is thus max(5, 5, 1) (= 5). Thus,

the number of newly inserted transactions is initially set at 5.

STEP 2: For five inserted transactions obtained in STEP 1, the length of each transaction is thus computed according to empirical rules in standard normal distribution. In this example, the average length of 8 transactions is calculated as (5 + 4 + 2 + 4 + 2 + 4 + 4 + 4)/8 (= 3.625). The standard deviation is then calculated as [(5 3.625) (4 3.625) (4 3.625) ]

1 8

1 2 + 2 + + 2

(= 1.06).

The length of transactions in the original database is standardized and shown in Table 4-3.

Table 4-3: The standardized values of transaction lengths

Length Standardized

2 -1.53

3 -0.59

4 0.35

5 1.3

That is, the probability of length 2, 3, 4, and 5 are calculated as (13.5%, 34%, 34%, 13.5%) shown in Figure 4-1.

Figure 4-1: The probabilities of different lengths in standard normal distribution

17

The lengths of five inserted transactions are then assigned according to Figure 4-1. Thus, the lengths of TID 9 to 13 are {4, 3, 4, 2, 3}, respectively.

STEP 3: The count difference of each frequent itemset in Table 4-2 is then respectively calculated. Take item {a} as an example to illustrate the step. The count of item {a} in the original database is 4. The updated minimum count of item {a} in the updated database is then calculated as (8 + 5)×0.5 (= 6.5). The count difference CDa is thus

(46.5)×0.5

(= -3). The count differences of other frequent itemsets are then shown in Table 4-4.

Table 4-4: The count differences of all frequent itemsets

Large 1-itemset Large 2-itemset Large 3-itemset

Item CD Item CD Item CD

a -3 ab -3 bce -3

b 0 ac -3

e 1 bc -2

ce -2

In Table 4-4, only the itemsets with negative CD will be considered as the itemsets for insertion. In this example, itemsets {a:-3, ab:-3, ac:-3, bc:-2, ce:-2, bce:-3} satisfy the condition and are then sorted according to their lengths and |CD| value. After that, the sorted results are then put into the set of Insert_Items = {bce:3, ab:3, ac:3, bc:2, ce: 2, a:3}.

STEP 4, 5 & 6: The itemsets are then respectively added into the transactions 9 to 13 according the sorted order in the set of Insert_Items. For example, the first itemset in Insert_Items is {bce:3}, indicating the itemset {bce} can thus be added into three different inserted transactions by 3 times. The results are then shown in Table 4-5.

18

Table 4-5: The process to add an itemset {bce}

After an itemset {bce} is respectively inserted into transaction 9, 10, and 11, the count of {bce} in the Inserted_Items becomes 0. The corresponding sub-itemsets {bc, ce} are then also updated (decreased) as 0. After that, the Insert_Items = {ab:3, ac:3, a:3}. The itemset {ab} is then respectively inserted into the transactions 12 and 13. The results are then shown in Table 4-6.

Table 4-6: The process to add the itemset {ab}

Since there is only one position in transactions 9, 11, and 13, there is no more spaces for 2-itemsets {ab} and {ac}. Besides, an itemset {a} cannot be added into the transactions 9 and 11 due to those two transactions will produce the super-itemsets {abce} of the sensitive itemsets {abc}. Thus, the greedy procedure is terminated.

STEP 7: In Table 4-2, the small items are {d:3, f:2, g:1} in the original database. Since transactions 9, 11, and 13 still remain one position for insertion, the small items are alternative

9 b c e

19

selected by empirical rules in stand normal distribution. In this example, items {d}, {f}, {g} are respectively added into 3 different transactions. The results are shown in Table 4-7.

Table 4-7: The process to add the small items

That is, the final updated database is shown in Table 4-8.

Table 4-8: The final updated database

TID Item

20

Chapter 5

GA-based Approach

In this chapter, a GA-based approach for sanitizing database is thus proposed. It uses the genetic algorithm for inserting newly transactions to hide the sensitive itemsets. An evaluation function with three factors are then designed in the proposed algorithm. Different weights for three factors are then assigned to evaluate the fitness of the newly inserted transaction according to user’s preference. The pre-large concept is also applied to reduce the computational cost for rescanning database, thus speeding up the evaluation process of chromosomes. The details of the proposed algorithm and an example are then described below.

5.1 Chromosome Representation

In GAs, a corresponding chromosome represents as a possible and flexible solution. In the proposed approach, at most m transactions are then computed and inserted into the original database for hiding sensitive itemsets, such that the fitness value can thus be optimized. A chromosome with m genes is thus used, with each gene representing a possible transaction to be inserted. The itemsets are then respectively assigned to each gene by empirical rules in standard normal distribution, forming a transaction to be inserted. Note that each gene with the inserted itemsets cannot be formed as any super-itemsets (included) of the sensitive itemsets. An example to represent the chromosome in this chapter is described below. Assume the number of the newly inserted transactions is computed as 3 by the proposed algorithm. The sensitive itemsets are then set at {ab, acd}. A chromosome with two genes can be shown in Table 5-1.

21

Table 5-1: An example for a represented chromosome

g1 g2 g

ade

3

ac bc

5.2 Fitness Function

In GAs, it is necessary to set fitness functions to evaluate the goodness of chromosomes.

Different application domains may require different fitness functions according to user’s preference. The goal in PPDM is to hide the sensitive itemsets with the minimal side effects. The relationship of itemsets before and after the PPDM process can be depicted in Figure 5-1, where L represents the large itemsets of D, S represents the sensitive itemsets defined by users that are large, ~S represents the non-sensitive itemsets that are large, and L’ is the large itemsets after some records are inserted.

L

L’

S ~S

Figure 5-1: The relationship of itemsets before and after the PPDM approach is processed

22

Let α be the number of sensitive itemsets that fail to be hidden. Thus, the sensitive itemsets still appear after the sanitization process. The sensitive itemsets should ideally become zero after the PPDM. The set of sensitive itemsets can be shown in Figure 5-2, in which the α part is the interaction of S and L’.

α

S ~S

L

L’

Figure 5-2: The set of sensitive itemsets that fail to be hidden

Similarly, let β be the number of missing itemsets for another criteria in evaluation process. A missing itemset is a non-sensitive large itemset in the original database, but is not derived from the sanitized database. This side effect of β is shown in Figure 5-3, in which β is the difference of ~S and L’.

23

L

L’

S

β

Figure 5-3: The set of missing itemsets

The γ is then defined as the last criteria in evaluation process as the number of artificial itemsets. An artificial itemset is a new large itemset appearing in the sanitized database but not in the original database. This side effect of γ is shown in Figure 5-4, in which γ is the difference of L’ and L.

24 L

L’

S ~ S

γ

Figure 5-4: The set of artificial itemsets

From Figures 5-2 to 5-4, it is obvious to see that α = S∩ , L' β = ~S− =L' (L S− )− , L' and γ = L'L. The fitness function used in the chapter can be defined as follows:

1 2 3

=

fitness ω α ω β ω γ× + × + × ,

whereω1, ω2 and ω3 are the weighting parameters. The pre-large concept is also used in the proposed algorithm without database rescan for reducing the computational cost in evaluation process. It is then described below.

The traditional methods for evaluating the fitness value need to rescan the database for calculating the three numbers, thus requiring a lot of computation cost. This problem can be solved by pre-large concept [13] since the pre-large itemsets can be acted as the buffer to reduce the movement of itemsets directly from large to small and vice-versa when transactions are

25

inserted. When few transactions are inserted into the original database, the results can be easily derived without re-scanning the whole database through the help of stored pre-large itemsets.

The concept of pre-large itemsets is used here to reduce the cost of rescanning a database and to speed up the evaluation process of chromosomes.

5.3 Genetic Operators

Genetic operators are very important to the success of specific GA applications. The operators used in this chapter are described as follows.

5.3.1 Crossover

The crossover operator is the main genetic operator in GAs. It considers two chromosomes to generate new offspring. Many crossover operators have been designed, which are the single-point crossover, the two-point crossover, the uniform crossover, and the arithmetic crossover, among others. In this chapter, the single-point crossover is used here to generate new offspring shown in Figure 5-5. The position of crossover is randomly chosen in the proposed approach.

26

Figure 5-5: The crossover operator

5.3.2 Mutation

The purpose of mutation is to diversify the search direction and prevent converging to local optima. Mutation usually produces some random changes in chromosomes, but there is no guarantee that mutation will produce desirable features in the new chromosomes. In the proposed approach, the adopted mutation operator will change an item within the selected gene of a chromosome to another item by empirical rules in standard normal distribution. The process is shown in Figure 5-6.

27

Figure 5-6: The mutation operator

5.3.3 Selection

The selection operation chooses some offspring for survival according to predefined rules.

This keeps the population size under good control. Many selection methods were proposed, such as Elitism, Rank, Tournament, and Roulette-Wheel, among others. In this paper, a hybrid selection method is thus proposed to combine both the Elitism approach and the Rank approach.

The chromosomes in the population are firstly sorted by their fitness values. The top k/2 chromosomes in the list are then selected to the next population, where k is the population size.

Next, the others k/2 chromosomes are randomly selected from the original database to the next population. The selection mechanism may be illustrated in Figure 5-7.

28

Figure 5-7: The selection mechanism

The flowchart of the proposed GA-based approach is shown in Figure 5-8. The algorithm is then stated in the next section.

Figure 5-8: The flowchart of the GA-based approach for PPDM

29

5.4 The Proposed GA-based Algorithm

The proposed GA-based algorithm for PPDM is stated as follows.

The proposed algorithm:

STEP 1: Calculate the value of maximum safety bound (MSB) for the number of newly inserted transactions as:

is the safety bound of each sensitive itemset, m is the number of original transactions in D, is the count of sensitive itemset sii

STEP 2: Derive the lower support threshold S

.

STEP 3: Scan the database to store the large and the pre-large itemsets.

STEP 4: Calculate the length pn of each inserted transaction in d according to the empirical rules in standard normal distribution, where d = {d1, d2, …, dn}, and n is the number of inserted transactions obtained in STEP 1.

30

STEP 5: Generate a population of k individuals with n genes according to the empirical rules in standard normal distribution randomly, with each gene being the itemsets of the transaction to be inserted. Note that the formed gene cannot be any super-itemsets of the sensitive itemsets.

STEP 6: Calculate the fitness value of each chromosome Ci

1 2 3

( ) = i i i

fitness i ω α ω β ω γ× + × + ×

in the population as:

,

where ω1, ω2 and ω3 are weighting parameters, α is the number of sensitive itemsets that fail to be hidden, β is the number of missing itemsets, and γ is the number of artificial itemsets.

STEP 7: Execute the crossover operations on the population.

STEP 8: Execute the mutation operations on the population.

STEP 9: Choose the top k/2 chromosomes from the population and randomly select k/2 chromosomes from the original database to generate the k chromosomes in the next population.

STEP 10: If the termination criterion is not satisfied, go to Step 6; Otherwise, do the next step.

31

5.5 An Example

In this section, an example is given to illustrate the proposed GA-based approach. Assume a database shown in Table 5-2 is used as an example. It consists of 11 transactions and 5 items, denoted a to e. Assume the sensitive itemsets are defined as {cd:5, bde:5}. The minimum support threshold is defined as 40%. The procedure of the proposed GA-based algorithm is then described below.

Table 5-2: A database used as an example

TID Item

STEP 1: The number of inserted transactions in this example is first calculated. In this example, there are 2 sensitive itemsets to be hidden. For example, the safety bound for sensitive itemset {cd} is calculated as (

(5/0.4)11

+1) (= 2). The safety bound for a sensitive itemset {bde} is calculated as (

(5/0.4)11

+1) (= 2). The maximum safety bound (MSB) among two sensitive itemsets is thus max(2, 2) (= 2), which is used for the number of newly inserted transactions.

STEP 2 & 3: After STEP 1 is processed, the number of newly inserted transactions is defined as 2. The lower support threshold is then calculated as:

32

Table 5-3: The large and pre-large itemsets

Large 1-itemset Large 2-itemset Large 3-itemset

Item Count Item Count Item Count

STEP 4: For two inserted transactions obtained in STEP 1, the length of each transaction is thus computed according to empirical rules in standard normal distribution. In this example, the average of length of 11 transactions is calculated as (4 + 3 + 5 + 5 + 4 + 4 + 1 + 3 + 3 + 2 + 2)/11 (= 3.272). The standard deviation is then calculated as:

]

The length of transactions in the original database is standardized and shown in Table 5-4

Table 5-4: The standardized values of transaction length

Length Standardized

1 -1.87

33

That is, the probability of length 1, 2, 3, 4, and 5 are calculated as (13.5%, 13.5%, 34%, 34%, 13.5%) shown in Figure 5-9.

Figure 5-9: The probabilities of different lengths in standard normal distribution

The lengths of two inserted transactions are then assigned according to Figure 5-9. Thus, the lengths of TID 12 and 13 are {2, 3}, respectively.

STEP 5: In this example, the number of newly inserted transactions was calculated in STEP 1, which was defined as 2. That is, the length of a chromosome is set at 2, indicating the two newly transactions are then inserted into the original database for hiding sensitive itemsets.

The length of each gene is then calculated by empirical rules in standard normal distribution.

Assume the k populations are then generated and shown in Figure 5-10.

Figure 5-10: k populations

34

STEP 6: The fitness value of each chromosome in the population is evaluated. Assume the chromosome obtained two itemsets {a, b} and {b, c, e} are respectively considered as two newly inserted transactions in the current population. For evaluating the chromosome, the results of large itemsets after the two transactions represented by the chromosome are added need to be obtained. They may easily derive without database re-scan with the aid of pre-large itemsets.

Since the minimum (upper) support threshold is set at 40%, the upper count threshold is thus updated to the ceiling of 13*40% (= 5.2), which is 6. The original large itemsets and pre-large itemsets are then updated according to the newly inserted transactions, with the results shown in Table 5-5. The final large itemsets are thus {a}, {b}, {c}, {d}, {e}, {bc}, {bd}, {be}, {ce}, and {de}.

Table 5-5: The updated large and pre-large itemsets

TID Item Large 1-itemset Large 2-itemset Large 3-itemset

1 a, b, d, e Item Count Item Count Item Count

After the large itemsets are obtained, the number (α) of sensitive itemsets that fail to be hidden, the number (β) of missing itemsets, and the number (γ) of artificial itemsets can be easily

35

obtained. In the above example, the set of sensitive itemsets that fail to be hidden is {cd, bde} {a, b, c, d, e, bc, bd, be, ce, de}, which is Ø. The value of α is thus 0. The set of missing itemsets is ({a, b, c, d, e, bc, bd, be, cd, ce, de, bcd, bde} - {cd, bde}) – {a, b, c, d, e, bc, bd, be, ce, de}, which is {bcd}. The value of β is thus 1. The set of artificial itemsets is Ø. The value of γ is thus 0.

Let the three weight parameters are set as 0.2, 0.3 and 0.5, respectively. The fitness value of the chromosome is then calculated as follows:

.

STEP 7: The crossover operation is executed on the population. An example has been previously shown in Figure 5-5.

STEP 8: The mutation operation is executed on the population. An example has been previously shown in Figure 5-6.

STEP 9: In the selection step, the top k/2 chromosomes from the current population and the randomly chosen n/2 chromosomes from the original database are combined to form the new population by the selection mechanism.

STEP 10: If the results of the new population do not satisfy the termination condition, then Steps 5 to 9 should be repeated; if it do, the algorithm won’t continue. In the example, two criteria are used as the terminal conditions. One is the fitness function value of the best chromosome 0, and the other is an achieved predefined number of generations.

36

Chapter 6

Experimental Results

Experiments were made to show the performance of the proposed approaches. They were performed on a Intel Core2 CPU with 2GB RAM based on the Windows 7 with 64 bit platform.

The details of the three databases used in the experiments were shown in Table 6-1.

Table 6-1: The details of the three databases

Database # of

Transactions # of Items

Maximum

In the experiments, the minimum support thresholds were set at 9%, 3% and 2% for the BMS-POS database, the BMS-WebView-1 database, and the BMS-WebView-2 [27], respectively. The numbers of sensitive itemsets are then defined by the percentages of the frequent itemsets in the databases, which is more flexible to see the performance of the proposed algorithm.

For the first proposed greedy-based algorithm, the relationships between the numbers of inserted transactions, the execution, and the side effects are then compared in three different

37

databases. The numbers of newly inserted transactions are then computed for three different databases show in Figure 6-1.

Figure 6-1: Numbers of added transactions among three databases in different percentages of sensitive itemsets

In Figure 6-1, it is obvious to see that the BMS-POS database requires more transactions to be inserted since it contains more information (transactions) than the others. That is, more sensitive itemsets are required to be hidden due to the number of sensitive itemsets are calculated by the percentage of frequent itemsets. Besides, the execution times are compared among three different databases in different percentages of sensitive itemsets. The results are then shown in Figure 6-2.

38

Figure 6-2: Execution times among three databases in different percentages of sensitive itemsets

Also, more computational times are required for the BMS-POS database since it has more sensitive itemsets to be hidden. The reasons are the same described above. The side effects of the proposed greedy-based approach are also evaluated, including the hiding failure for the number of sensitive itemsets, the number of the missing non-sensitive itemsets, and the number of artificial itemsets. The number of side effects for databases webview-1 and webview-2 are then evaluated to show the performance. The results are then respectively shown in Figure 6-3 and Figure 6-4. From Figure 6-3 and Figure 6-4, it is obvious to see that the proposed greedy-based

Also, more computational times are required for the BMS-POS database since it has more sensitive itemsets to be hidden. The reasons are the same described above. The side effects of the proposed greedy-based approach are also evaluated, including the hiding failure for the number of sensitive itemsets, the number of the missing non-sensitive itemsets, and the number of artificial itemsets. The number of side effects for databases webview-1 and webview-2 are then evaluated to show the performance. The results are then respectively shown in Figure 6-3 and Figure 6-4. From Figure 6-3 and Figure 6-4, it is obvious to see that the proposed greedy-based

相關文件