The Proposed GA-based Algorithm for PPDM - A GA-Based Approach for PPDM

Chapter 6 A GA-Based Approach for PPDM

6.4 The Proposed GA-based Algorithm for PPDM

The proposed GA-based algorithm for PPDM is stated as follows.

The algorithm:

INPUT: A transaction dataset D, a minimum support threshold s, a set of sensitive itemsets defined by users, and the maximum number m of transactions to be hidden, and a population size n.

OUTPUT: An appropriate set of transactions to be hidden.

STEP 1: Derive the lower support threshold Sl as:

1 | |

STEP 2: Scan the database to find the large and the pre-large itemsets.

STEP 3: Randomly generate a population of n individuals with m genes, with each gene being the ID number of the transaction to be hidden.

STEP 4: Calculate the fitness value of each chromosome Ci in the population as:

1 2 3

STEP 5: Execute the crossover operations on the population.

STEP 6: Execute the mutation operations on the population.

STEP 7: Choose the top b chromosomes from the population and randomly select p chromosomes from the original database to generate the n chromosomes in the next population.

STEP 8: If the termination criterion is not satisfied, go to Step 4; otherwise, do the next step.

STEP 9: Output the hidden transaction numbers in the best chromosome to users.

6.5 An Example

In this section, an example is given to demonstrate the proposed GA-based algorithm for privacy preserving data mining. Assume the original database includes 10 transactions, as shown in Table 6-1. Each transaction consists of its transaction identification (TID) and items. Also assume that the set of sensitive itemsets is defined as {be, bce} and the minimum support threshold rate s is 40%. Thus, the upper threshold count S_u is set as 10*40%, which is 4. Let the allowed maximum number m of deleted records be 4. The proposed algorithm processes the data as follows.

Table 6-1: The set of the original database in the example TID Item

STEP 1: The lower threshold (S_l) can be calculated as 0.24 according to the formula

1 | |

STEP 2: After the lower threshold is obtained, the original database is then processed to get the large and the pre-large itemsets. The large itemsets with different item numbers are shown in Tables 6-2, 6-3 and 6-4, respectively. The pre-large Itemsets with different numbers are shown in Tables 6-5, 6-6 and 6-7.

Table 6-2: Large 1-itemsets

STEP 3: A population of n individuals with m genes is randomly generated. In this example, at most four transactions are to be deleted, such that each chromosome is composed of four genes. For example, assume the four transactions T2, T3, T4, and T5 are randomly selected to form an initial chromosome. Its chromosome representation is thus (2, 3, 4, 5).

STEP 4: The fitness value of each chromosome in the population is evaluated. Assume the chromosome (2, 3, 4, 5) exists in the current population. For evaluating the chromosome, the resulting large itemsets after the four transactions represented by the chromosome are deleted need to be obtained. They may easily derived without database rescan with the aid of the pre-large itemsets. First, the database size d is updated from 10 to 6 when the four transactions are deleted from the database. Since the minimum support threshold is 40%, the upper count threshold is thus updated to

the ceiling of 6*40%, which is 3. The original large itemsets and pre-large itemsets are then updated according to the items in the deleted transactions, with the results shown in Figure 6-7. The final large itemsets are thus {a}, {b}, {c} and {bc}.

Figure 6-10: The large itemsets after the four transactions are deleted

After the large Itemsets are obtained, the number (α) of sensitive itemsets that fail to be hidden, the number (β) of missing itemsets, and the number (γ) of artificial Itemsets can be easily obtained. In the above example, the set of sensitive itemsets that fail to be hidden is {be, bce} ∩ {a, b, c, bc}, which is Ø. The value of α is thus 0.

The set of missing itemsets is ({a, b, c, e, ab, bc, be, ce, bce} - {be, bce}) – {a, b, c, bc}, which is {e, ab, ce}. The value of β is thus 3. The set of artificial itemsets is Ø.

The value of γ is thus 0. Let the three weight parameters are set as 0.5, 0.25, and 0.25, respectively. The fitness value of the chromosome is then calculated as follows:

= 0.5 0 0.25 3 0.25 0 = 0.75

fitness × + × + × .

STEP 5: The crossover operation is executed on the population. An example has been previously shown in Figure 6-6.

STEP 6: The mutation operation is executed on the population. An example has been previously shown in Figure 6-7.

STEP 7: In the selection step, we set b as n/2. Thus, the top b chromosomes from the current population and the randomly chosen p (= n-b) chromosomes from the original database are gathered to form the new population by the selection scheme.

STEP 8: If the results of the new population do not satisfy the termination condition, then Steps 4 to 7 are repeated; otherwise, the algorithm is stopped. In the example, two criteria are uses as the termination condition. The first one is the fitness function value of the best chromosome is 0, and the other one is a predefined number of generations is achieved.

Chapter 7

Experimental Results

Experiments were made to show the performance of the proposed approaches. They were performed on a Pentium IV 2GHz CPU with 512MB RAM based on the Mandriva platform. The details of the three databases used in the experiments were shown in Table 7-1.

Table 7-1: The details of the three databases

Database # of

In the experiments, the sensitive itemsets and the user-specific minimum support thresholds were defined the same for the three proposed algorithms. The minimum support thresholds were set at 2.41%, 1.95% and 5.9% for the BMS-POS database, the BMS-WebView-1 database, and the BMS-WebView-2, respectively. The numbers of sensitive itemsets for BMS-POS, BMS-WebView-1, and BMS-WebView-2 were set at 4, 4, and 8, respectively.

7.1 Experimental Results of SIF-IDF Algorithm

For the first proposed SIF-IDF algorithm, the relationships between the numbers of iterations and EC values are then compared to indicate the proposed algorithm completely performed for all user-specific sensitive itemsets. The results for BMS-POS, BMS-WebView-1, and BMS-WebView-2 are shown in Figure 7-1, Figure 7-2, and Figure 7-3, respectively.

Figure 7-1: The relationships between EC value and the number of iterations in BMS-POS database with SIF-IDF approach

Figure 7-2: The relationships between EC values and he number of iterations in BMS-Webview-1 database with SIF-IDF approach

Figure 7-3: The relationships between EC values and the number of iterations in BMS-Webview-2 database with SIF-IDF approach

Besides, we also evaluate the side effects of the proposed SIF-IDF approach, which are hidden fail itemsets, missing itemsets and artificial itemsets. The results are then shown in Table 7-1.

Table 7-2: The side effects of SIF-IDF algorithm

BMS-POS BMS-WebView2 BMS-WebView1

hidden fail itemsets 0 0 0

missing itemsets 0 0 0

SIF-IDF

artificial itemsets 0 0 0

It is obvious to see from Table 7-2 that the proposed SIF-IDF can thus completely hide the sensitive itemsets without any side effects in three databases.

7.2 Experimental Results of Lattice-Like Algorithm

For the second lattice-like approach, it can efficiently reduce the number of iterations.

Besides, the EC values of multiple sensitive rules can thus be symmetrically reduced, which is more feasible than the SIF-IDF algorithm. The results for three databases are then shown in Figure 7-4 to Figure 7-6, respectively.

Figure 7-4: The relationships between EC value and he number of iterations in BMS_POS database with lattice-like approach

Figure 7-5: The relationships between EC value and he number of iterations in BMS-Webview-1 database with lattice-like approach

Figure 7-6: The relationships between EC value and he number of iterations in BMS-Webview-2 database with lattice-like approach

Besides, we also evaluate three side effects of the proposed lattice-like approach. The results are then shown in Table 7-3.

Table 7-3: The side effects of lattice-like algorithm

BMS-POS BMS-Webview-1 BMS-Webview-2

hidden fail itemsets 0 0 0

missing itemsets 0 0 203

Lattice-like

artificial itemsets 0 0 0

It is obvious to see from Table 7-3 that the proposed lattice-like approach has good performances without any side effects in BMS-POS and BMS-WebView1. It is, however, missing itemsets occur in the database of BMS-WebView2. That is because the proposed lattice-like approach completely prunes the sensitive itemsets from the processed transactions.

The execution times of the SIF-IDF algorithm and the lattice-like approach are then compared and shown in Table 7-4.

Table 7-4: The execution time of the proposed algorithms

BMS-POS BMS-Webview-1 BMS-Webview-2

SIF-IDF 11,601.23 266.811 444.15

Lattice-Like 3,434.326 20.53 59.558

It is clearly to see from Table 7-4 that the lattice-like approach has better performance than the SIF-IDF algorithm in execution times.

7.3 Experimental Results of GA-based Algorithm

For the third proposed GA-based algorithm, it consists of pre-large concepts to reduce the re-scan times of database. In the experiments, we compare the proposed GA-based algorithm with pre-large concept and the random method for PPDM. In the random method, the transactions of chromosome are randomly selected without any elite processes. The results are then shown from Figure 7-7 to 7-9 for three databases, respectively.

Figure 7-7: The relation between fitness and generation in BMS-POS database with GA-based approach

Figure 7-8: The relation between fitness and generation in BMS-WebView-1 database with GA-based approach

Figure 7-9: The relation between fitness and generation in BMS-WebView-2 database with GA-based approach

In the evaluation process, the original GA-based algorithm and the GA-based algorithm with pre-large concept are then compared to show the performance in three databases. The results are then shown in Figure 7-10.

Figure 7-10: The relations between execution time and different database

From the experiments in Figure 7-10, it can thus be seen that the pre-large concept can greatly reduce the execution times in three databases. That is because the pre-large concept can avoid the progress of re-scanning database.

Chapter 8

Conclusion and Future Work

In the issue of privacy-preserving data mining (PPDM), it can thus classify as two categories. The first procedure is to remove items from the transactions, and another one is to remove the transactions in databases. In this thesis, two approaches called SIF-IDF algorithm and lattice-like approach are proposed for removing sensitive itemsets from transactions. The GA-based approach is also proposed for removing transactions infrom databases. In the first greedy-based approach, the SIF-IDF algorithm is proposed to evaluate the similarity between sensitive itemsets and transactions for minimizing the side effects, which inherits the properties from TF-IDF algorithm in information retrieval. It calculates the SIF-IDF values of all transactions and sorts them in descending order as the processing priority. The frequencies of items within sensitive itemsets are also calculated as the deletion priority of processed transactions. This procedure is repeated until a set of sensitive itemsets become null. Based on the user-specific sensitive itemsets in the experiments, the proposed SIF-IDF algorithm can process all defined sensitive itemsets without any side effects in three databases.

In the second lattice-like approach, the lattice structures are first designed to reduce the computational cost based on downward closure property of Apriori algorithm. The needs to delete count (NDC) values of each sensitive itemset is calculate to find the minimal number of processed transactions. It processes the lattice structures from the highest level to the lowest ones for completely deleting the items in the processed transactions. That is, the supports of multiple sensitive itmesets can thus be symmetrically hidden. In the experimental results, the proposed algorithm only executes fewer times than the SIF-IDF algorithm but results some side effects.

In the third GA-based approach, each gene in chromosome represents a possible transaction to be hidden. If the gene contains the number 0, it indicates there is no transaction needing to be deleted. Three user-specific weights are assigned to three factors, which are hiding failure, missing items and artificial items, to evaluate the fitness values of chromosomes. Besides, the pre-large concept is also applied to the GA-based approach to reduce re-scan databases. In the experimental results, the original GA-based approach and the GA-based approach with pre-large concept are then compared to evaluate the performance in three databases. The GA-based approach with pre-large concept can greatly reduce the computational cost and show the better performances than the original one.

In this thesis, we proposed GA-based algorithm to remove transaction in databases.

However, many non-sensitive itemsets can thus be removed causing significant side effects. In future works, each item within sensitive itemsets can be considered as a gene to be processed for hiding sensitive itemsets. The better performance can thus be expected especially in the resulting of side effects. Besides, the pre-large concept consisted of insertion, deletion and modification.

How to combine those concepts to PPDM is another critical issue in future works.

References

[1] A. Amiri, “Dare to share: Protecting sensitive knowledge with data sanitization,” Decision Support Systems, pp. 181–191, 2007.

[2] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. S. Verykios, “Disclosure limitation of sensitive rules,” In Knowledge and Data Engineering Exchange Workshop (KDEX'99), pp. 45-52, 1999.

[3] R. Agrawal, T. Imielinski, and A. Sawmi, “Mining association rules between sets of items in large databases,” The ACM SIGMOD Conference on Management of Data, pp. 207-216, 1993.

[4] R. Agrawal, R. Srikant, “Privacy-Preserving Data Mining,” The 19^th ACM SIGMOD Conference, pp. 439-450, 2000.

[5] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The International Conference on Very Large Data Bases, pp. 487-499, 1994.

[6] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules with item constraints,” The 3^th International Conference on Knowledge Discovery in Databases and Data Mining, pp.

67-73, 1997.

[7] E. Dasseni , V. S. Verykios , A. K. Elmagarmid and E. Bertino, “Hiding Association Rules by Using Confidence and Support,” The 4^th International Workshop on Information Hiding, pp.

369-383, 2001.

[8] M. N. Dehkordi, K Badie, A. K. Zadeh, “A Novel Method for Privacy Preserving in Association Rule Mining Based on Genetic Algorithms,” Journal of Software, Vol. 4, No. 6, pp. 555-562, August, 2009.

[9] M. R. Garey and D. S. Johnson, “Computers and Intractability: A Guide to the Theory of NP-Conpleteness,” W. H. Freeman, 1979.

[10] A. Homaifar, S. Guan, and G. E. Liepins, “A new approach on the traveling salesman problem by genetic algorithms,” The 5^th International Conference on Genetic Algorithms, pp.

460-466, 1993.

[11] J. H. Holland, “Adaptation in Natural and Artificial Systems,” University of Michigan Press, pp. 15, 1975.

[12] T. P. Hong, C. Y. Wang, and Y. H. Tao, “A new incremental data mining algorithm using pre-large itemsets,” Intelligent Data Analysis, Vol. 5, No. 2, pp. 111-129, 2001.

[13] T. P. Hong and C. Y. Wang, “Maintenance of association rules using pre-large itemsets,”

Intelligent Databases: Technologies and Applications, pp. 44-60, 2006.

[14] D. E. O. Leary, “Knowledge Discovery as a Threat to Database Security,” Knowledge Discovery in Databases, pp. 507-516, 1991.

[15] M. Mitchell, “An Introduction to Genetic Algorithms,” MIT press, 1996.

[16] Z. Michalewicz, ”Genetic Algorithms + Data Structures = Evolution Programs,”

Springer-Verlag, 2nd Edition, 1994.

[17] Z. Michalewicz, “Evolutionary Computation: Practical Issues,” International Conference on Evolutionary Computation, pp. 30-39, 1996.

[18] S. R. M. Oliveira and O. R. Za¨ıane, “Privacy Preserving Frequent Itemset Mining,” The IEEE international conference on Privacy, security and data mining, pp. 43-54, 2002.

[19] E. D. Pontikakis, A. A. Tsitsonis and V. S. Verykios, “An experimental study of distortion-based techniques for association rule hiding,” The 18^th Conference on Database Security, pp. 325-339, 2004.

[20] E. Sanchez, T. Shibata and L. A. Zadeh, “Genetic Algorithms and Fuzzy Logic Systems,”

Soft Computing Perspectives, World-Scientific, 1997.

[21] G. Salton, E. A. Fox, and H. Wu, “Extended Boolean information retrieval,”

Communications of the ACM, Vol. 26, No. 2, pp. 1022–1036, 1983.

[22] V. S. Verykios, Aris Gkoulalas-Divanis, “Privacy-Preserving Data Mining models and Algorithms,” Chapter 11, pp. 267-289, 2008.

[23] V.S. Verykios, A. Elmagarmid, E. Bertino, Y. Saygin and E. Dasseni, “Association Rule Hiding,” IEEE Transactions on knowledge and Data Engineering, Vol. 16, No. 4, pp.

434-447, 2004.

[24] Z. Zhu, G. Wang and W. Du, “Deriving Private Information from Association Rule Mining Results,” IEEE International Conference on Data Engineering, pp. 18-29, 2009

在文檔中數個應用於隱私保護資料探勘之啟發性方法 (頁 59-0)