Thesis Organization - 數個應用於隱私保護資料探勘之啟發性方法

Chapter 1 Introduction

1.3 Thesis Organization

The rest parts of this thesis are organized as follows. Some related works are described in Chapter 2. The problem to be solved in this thesis is described in Chapter 3. The first proposed SIF-IDF algorithm is stated in Chapter 4. The second proposed lattice-like approach is expressed in Chapter 5. The third proposed GA-based algorithm is explained in Chapter 6. Experimental results are then shown in Chapter 7. Conclusion and future works are given in Chapter 8.

Chapter 2

Related Works

In this chapter, we review some related researches about this thesis. Section 2.1 describes the data mining process. Section 2.2 introduces the general concept of data sanitization, which can be further classified as anonymity, blocking and encryption. Section 2.3 reviews genetic algorithms for solving optimization problems. Section 2.4 states the pre-large concept, which is integrated with the GA process in this thesis to avoid re-scanning databases for hiding sensitive itemsets.

2.1 Data Mining Process

Data mining is most commonly used in attempts to induce association rules from transaction data, such that the presence of certain items in a transaction will imply the presence of some other items. To achieve this purpose, Agrawal et al. proposed several mining algorithms based on the concept of large itemsets to find association rules in transaction data [3][5][6]. They divided the mining process into two phases. In the first phase, candidate itemsets were generated and counted by scanning the transaction data. If the count of an itemset appearing in the transactions was larger than a pre-defined threshold value (called the minimum support), the itemset was considered a large itemset. Itemsets containing only one item were processed first.

Large itemsets containing only single items were then combined to form candidate itemsets containing two items. This process was repeated until all large itemset had been found. In the second phase, association rules were induced from the large itemsets found in the first phase. All

possible association combinations for each large itemset were formed, and those with calculated confidence values larger than a predefined threshold (called the minimum confidence) were output as association rules.

2.2 Data Sanitization

Years of effort in data mining have produced a variety of efficient techniques, which have also caused the problems of security and privacy threats [14]. The research of privacy-preserving data mining (PPDM) has thus become a critical issue. PPDM is usually performed to hide sensitive information. In the past, Atallah et al. first proposed the protection algorithm for data sanitization to avoid the inference of association rules [2]. It used addition and deletion procedures to modify databases for hiding sensitive information. Dasseni et al. then proposed a hiding algorithm based on the hamming-distance approach to reduce the confidence or support values of association rules [7]. Three heuristic hiding approaches were thus proposed to increase the supports of antecedent parts, to decrease the supports of consequent parts, and to decrease the support of either the antecedent or the consequent parts, respectively. When the supports or the confidences of sensitive association rules were below user-specific minimum support thresholds, they could thus be hidden.

Oliveira and Za¨ıane [18] then introduced the multiple-rule hiding approach to efficiently hide sensitive itemsets. It required only two database scans no matter the number of sensitive itemsets. In the first database scan, the index file was created to efficiently find sensitive itemsets within transactions. Three algorithms called MinFIA, MaxFIA and IGA were then used in the second database scan to remove minimal individual items. Amiri then proposed three heuristic approaches to hide multiple sensitive rules [1]. The first approach was called Aggregate,

which computed the union of the supporting transactions for all sensitive itemsets and expelled the transaction that supports the most sensitive and the least non-sensitive itemsets. The second one was called Disaggregate, which aimed at removing individual items from transactions, rather than removing whole transactions. The third approach, called Hybrid, was a combination of the previous two. It uses the Aggregate approach to identify sensitive transactions and adopts the Disaggregate approach to selectively delete items from these transactions, until the sensitive knowledge has been hidden. Pontikakis et al. [19] then proposed two heuristics approaches based on data distortion. The first approach named priority-based distortion algorithm (PDA) was designed to reduce the confidences of sensitive rules by trying to decrease consequent items.

The second approach called weight-based sorting distortion algorithm (WDA) was then proposed to prioritize selection of sanitized transactions. It used the priority values to weight the transactions based on effective data structures.

The optimal sanitization of databases is, in general, regard as an NP-hard problem. Atallah et al. [2] proved that selecting which data to modify or sanitize was also NP-hard. Their proof was based on the reduction from the NP-hard problem of hitting-sets [9]. The hitting-set problem was first proven NP-hard. The PPDM problem was then reduced to the hitting-set problem in polynomial time. In this case, the PPDM problem could be said an NP-hard problem as well and could not be solved in polynomial time for now. That paper provided a solid theoretical background to explain that PPDM was a difficult issue.

2.3 Genetic Algorithms

Since 1960s, there has been much interest in developing powerful heuristic algorithms for difficult optimization problems. Some nature-inspired approaches were then proposed to achieve

the purpose. One of the most commonly used among them is the evolutionary computation based on Darwin theory: “Nature selects, the fittest survives”. In 1975, Holland [11] applied the concept of evolution into the field of dynamic algorithms and proposed genetic algorithms (GAs). Since then, GAs has become increasingly important for researchers in solving difficult problems because they could provide feasible solutions in a limited amount of time [10]. GAs have been successfully applied to the fields of optimization [15][16], machine learning [16], neural networks [20], fuzzy logic controllers [20], and so on. According to the principle of survival of the fittest, GAs generate the next population by several operations, with each individual in the population representing a possible solution. In general, a genetic algorithm has the following five basic components, as summarized by Michalewicz [17]:

1. A genetic representation of solutions to the problem, 2. A way for generating the initial population,

3. An evaluation function for measuring goodness of solutions,

4. Several genetic operators that alter the genetic composition of children, and 5. Parameter values.

2.4 The Concept of Pre-Large Itemsets

Hong and Wang proposed the pre-large itemsets [12] for efficient incremental data mining.

A pre-large itemset was not truly large (frequent), but might easily become large in the future through the data insertion process. Formally, a lower support threshold and an upper support threshold were used to realize this concept. The upper support threshold was the same as that in the conventional mining algorithms. The count of an itemset needed to be larger than the upper

support threshold in order to be regarded as large. The lower support threshold defined the lowest count for an itemset to be pre-large. Pre-large itemsets acted like a buffer in the incremental mining process and were used to reduce the movement of an itemset directly from large to small and vice verse. The concept of pre-large itemsets could also be used for record deletion [13]. In this thesis, we will delete transactions for PPDM. The processing for record deletion will thus be used and is explained below.

When some records are deleted from a database, there are nine cases for candidate-itemsets to be considered, which are shown in Figure 2-1 [12].

Deleted

Figure 2-1: Nine cases for candidate itemsets due to record deletion

In Figure 1, Cases 2, 3, 4, 7 and 8 do not affect the final association rules. Case 1 may remove existing association rules, and cases 5, 6 and 9 may generate new association rules. If we pre-store all large and pre-large itemsets from the original database, then Cases 1, 5 and 6 can be easily handled. Besides, an itemset in Case 9 cannot possibly be large for the entire updated

database as long as the number of deleted records is a considerably small proportion of the original database. Hong and Wang derived the following theorem for Case 9 [15]. Given a lower support threshold Sl, an upper support threshold Su, and a transaction number d in a database, if the number f of deleted records satisfies the following condition, then an itemset in Case 9 (neither large nor pre-large in both the original database and in the deleted records) is not certainly large for the updated database:

⎥⎦

Thus, no database rescan is needed if the above formula is satisfied. We may re-formulate the above formula and derive the following lower support threshold when the number f of deleted records for not rescanning databases is given:

⎟⎠

The above formula will be used in the proposed GA-based PPDM approach to set an appropriate lower-bound threshold for efficient chromosome evaluation.

Chapter 3

Problem Formulation

In the problem of PPDM, some basic concepts are borrowed from association rule mining.

Thus before exploring the PPDM’s issue, we need to know the definition of association rule mining. Agrawal extended and formalized the problem as follows [3]. Let I = {i₁, i₂, …, i_m} be a set of literals, called items. Let D be a set of transactions, where each transaction T ∈ D consists of a set of items, such that T ⊆ I. Each transaction T has a unique identifier, called its TID. A set of items X ⊂ I is called an itemset. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅. Usually, Y consists of only a single item.

We say an association rule X ⇒ Y holds in a database D if the following two factors are satisfied. The first one is the support condition, which is defined as at least s% of the transactions in D contain X ∪ . It can be thought of as a measure of the frequency of a rule, and is expressed Y by X Y

N s

∪ ≥ , where N is the number of transactions in D. The second factor is the confidence

condition, which is defined as at least c% of transactions with the itemset X also contain Y. It is thus a measure of the strength of the rule, and is expressed by X Y

X c

∪ ≥ .

In privacy-preserving data mining, users may pre-specify a set of sensitive itemsets H = {{h1}, {h2}, …, {hi}}, which may be mined out from a database but is sensitive. We aim at preventing these sensitive itemsets being disclosed, and a solution is to reduce the frequencies of the sensitive itemsets from D. Let the modified database be denoted D’. Thus, each sensitive itemset will not have enough support to be frequent in D’. This kind of approaches can be thought of as support-based ones, and have to satisfy the constraint of |h_i|/N’ < s, where N’ is the

number of transactions in D’ and |hi| is number of occurrences of the sensitive itemset hi. In addition to hiding the sensitive itemsets from being mined, some other goals have been set as well when the original database is sanitized. For example, all of the non-sensitive rules should be successfully mined from the sanitized database D’. Besides, the rules that are not found in the original database D should not be generated from the sanitized database D’.

In this thesis, we would like to hide sensitive itemsets, which are predefined by users. For achieving this purpose, new items or transactions may be inserted, or old items or transactions may be deleted or modified. Here, we will only focus on the deletion of items or transactions for PPDM. We use the data sanitization process to hide sensitive knowledge by item or transaction deletion that reduces the support of the rules below the user specified security threshold. Two kinds of removal are considered. One is removing items from transactions and the other is removing transactions from databases. An example for the two kinds of removal is shown in Figure 3-1. At the left of Figure 3-1, an item is removed, and at the right, a whole transaction is deleted.

Figure 3-1: Two kinds of removal

In this thesis, three approaches are proposed for PPDM. The first two are called the sensitive items frequency - inverse database frequency (SIF-IDF) approach and the lattice-like approach, which are designed for removing items from transactions for hiding sensitive Itemsets.

The third one is a GA-based approach, which delete transactions from a database for PPDM.

Chapter 4

A Greedy Approach Based on Sensitive Items Frequency and Inverse Database Frequency

In this chapter, a greedy-based approach called sensitive items frequency - inverse database frequency (SIF-IDF) is proposed to hide given sensitive itemsets. It uses and modifies the concept of TF-IDF [21] in text mining to evaluate the degrees of transactions associated with given sensitive itemsets. The measure for the SIF-IDF value of a transaction T_i is defined as follows:

where si is the number of sensitive items contained in the j-th sensitive itemset in T_ij _i, and T_i is the number of items in transaction T_i,n is the number of records in a database, f_k is the frequency count of each item, and MRC_k is the maximum reduced count of each item.

The above formula consists of two components. One is the sensitive items frequency (SIF) and the other is the inverse database frequency (IDF). The sensitive items frequency (SIF) value is measured for each sensitive itemset sij in a transaction Ti. It is calculated as the number (|siij|) of sensitive items in Ti which are included in an assigned sensitive itemset sij divided by the number of all the items in Ti. On the contrary, the inverse database frequency (IDF) value shows the influence degree of the sensitive itemsets within a transaction by considering the whole database. In this chapter, the SIF-IDF value of each transaction is calculated and is used to

measure whether a transaction has a large number of sensitive items but with less influence to other transactions. The transactions with high SIF-IDF values are considered to be processed with high probabilities for sanitization.

The proposed approach first calculates the maximum reduced count (MRC) of each item in the database. In doing this, the reduced count value (RCkj) of each item ik is first calculated for each sensitive itemset sij as fj – s*n + 1 if sij includes ik and as 0 otherwise, where fj is the occurrence frequency of the sensitive itemset sij in the database, s is the minimum support threshold, and n is the number of transactions in the database, 1≤ j≤m,1≤k≤ p.

The IDF value of each item is then calculated as the number of transactions in the database divided by the occurrence frequency of a processed item minus its MRC value. The IDF value for each sensitive itemset is then estimated as the summation of the items contained in the itemset. That is, the SIF-IDF value of each transaction is the summation of the SIF values of the sensitive itemsets appearing in a transaction multiplied by its corresponding IDF value. The transactions are then sorted in a descending order of their SIF-IDF values. The order is used as the processing order of the transactions for the proposed algorithm. In data sanitization, an item with a higher occurrence frequency in the sensitive itemsets may be considered to have a larger influence than the ones with a lower occurrence frequency. The sensitive items in the processed transactions are then deleted according to the ordering of their occurrence frequencies. This procedure is repeated until the set of sensitive itemsets becomes null, which indicates all the supports of the sensitive itemsets are under the user-specific minimum support threshold. The flowchart of the proposed SIF-IDF algorithm is shown in Figure 4-1. The proposed algorithm and an example are described in the next two sections, respectively.

Figure 4-1: The flowchart of SIF-IDF algorithm

4.1 The Proposed SIF-IDF Algorithm

INPUT: A transaction dataset D = {T1, T2, …, Ti, …, Tn} with a set of p items I = {i1, i2, …, ik, …, i_p}, a user-specific minimum support threshold s, and a set of m user-specific sensitive itemsets S = {si1, si2, …, sij, …, sim}.

Pre-process transaction

with sensitive items Calculate SIF

Calculate IDF

Calculate SIFIDF

Execute pruning process

Satisfied termination

condition?

Output a sanitized database

OUTPUT: A sanitized database with no sensitive rules mined out.

STEP 1: Find the transactions with sensitive itemsets in the database D.

STEP 2: Calculate the sensitive items frequency (SIFij) value of each sensitive itemset sij in each transaction T_i as:

STEP 3: Calculate the value of the inverse database frequency (IDF) of each sensitive itemset in each transaction by the following substeps.

Substep 3-1: Calculate the reduced count value (RCkj) of each item ik for each sensitive itemset si_j as f_j – s*n + 1 if si_j includes i_k and as 0 otherwise, where fj is the occurrence frequency of the sensitive itemset sij in the database, s is the minimum support threshold, and n is the number of transactions in the database, 1≤ j≤m ,1≤k ≤ p.

Substep 3-2: Calculate the maximum reduced count value (MRC_k) of each item i_k as:

Substep 3-3: Calculate the inverse database frequency (IDFk) value of each items ik

as follows:

where f is the occurrence count of item i_k _k in the database.

Substep 3-4: Sum the IDF values of all sensitive items within sensitive itemsets and calculate the SIF-IDF value for each transaction as follows:

∑ ∑

STEP 4: Find the transaction (Tb) which has the best SIF-IDF value.

STEP 5: Process the transaction Tb to prune appropriate items by the following substeps.

Substep 5-1: Sort the items in a descending order of their occurrence frequencies within the sensitive itemsets.

Substrp 5-2: Find the first sensitive item (item_o) in T_b according to the sorted order obtained in Substep 5-1.

Substep 5-3: Delete the item (itemo) from the transaction.

STEP 6: Update the occurrence frequencies of the sensitive itemsets.

STEP 7: Repeat STEPS 2 to 6 until the set of sensitive itemsets is null, which indicates that the supports of all the sensitive itemsets are below the user-specific minimum support threshold s.

4.2 An Example

In this section, an example is given to demonstrate the proposed sensitive items frequency - inverse database frequency (SIF-IDF) algorithm for privacy preserving data mining (PPDM).

Assume a database shown in Table 4-1 is used as the example. It consists of 10 transactions and 9 items, denoted a to i.

Table 4-1: A database example with 10 transactions

TID Item

Assume the set of user-specific sensitive itemsets S is {cfh, af, c}. Also assume the user-specific minimum support threshold is set at 40%, which indicates that the minimum count is 0.4*10, which is 4. The proposed approach proceeds as follows to hide the sensitive Itemsets for avoiding being mined from the database.

STEP 1: The transactions with sensitive itemsets in the database are found and kept. In this example, all the 10 transactions contain at least one sensitive itemset. All of them are then kept for later processing.

STEP 2: The sensitive items frequency (SIF) value of each sensitive itemset in each transaction is calculated. Take the first transaction as an example to illustrate the step. The first transaction includes the following seven items: {a, b, c, d, f, g, h}. The given sensitive itemsets

include {cfh, af, c}. The appearing sensitive items in the first transaction for the sensitive itemset {cfh} are c, f, h, and the number is 3. Similarly, the numbers of the appearing sensitive items in the first transaction for the sensitive itemsets {af} and {c} are 2 and 1, respectively. Thus, the SIF values of each sensitive itemset in the first transaction are calculated as 3/7, 2/7 and 1/7, respectively. The SIF values of each sensitive itemset in the other transactions could be found in a similar way. The results are shown in Table 4-2.

Table 4-2: The SIF values of each sensitive itemset in each transaction

TID Item SIFcfh SIFaf SIFc

STEP 3: The inverse database frequency (IDF) value of each sensitive itemset in each

在文檔中數個應用於隱私保護資料探勘之啟發性方法 (頁 15-0)