Chapter 2 Related Works
2.4 The Concept of Pre-Large Itemsets
Hong and Wang proposed the pre-large itemsets [12] for efficient incremental data mining.
A pre-large itemset was not truly large (frequent), but might easily become large in the future through the data insertion process. Formally, a lower support threshold and an upper support threshold were used to realize this concept. The upper support threshold was the same as that in the conventional mining algorithms. The count of an itemset needed to be larger than the upper
9
support threshold in order to be regarded as large. The lower support threshold defined the lowest count for an itemset to be pre-large. Pre-large itemsets acted like a buffer in the incremental mining process and were used to reduce the movement of an itemset directly from large to small and vice verse. The concept of pre-large itemsets could also be used for record deletion [13]. In this thesis, we will delete transactions for PPDM. The processing for record deletion will thus be used and is explained below.
When some records are deleted from a database, there are nine cases for candidate-itemsets to be considered, which are shown in Figure 2-1 [12].
Deleted
Figure 2-1: Nine cases for candidate itemsets due to record deletion
In Figure 1, Cases 2, 3, 4, 7 and 8 do not affect the final association rules. Case 1 may remove existing association rules, and cases 5, 6 and 9 may generate new association rules. If we pre-store all large and pre-large itemsets from the original database, then Cases 1, 5 and 6 can be easily handled. Besides, an itemset in Case 9 cannot possibly be large for the entire updated
10
database as long as the number of deleted records is a considerably small proportion of the original database. Hong and Wang derived the following theorem for Case 9 [15]. Given a lower support threshold Sl, an upper support threshold Su, and a transaction number d in a database, if the number f of deleted records satisfies the following condition, then an itemset in Case 9 (neither large nor pre-large in both the original database and in the deleted records) is not certainly large for the updated database:
⎥⎦
Thus, no database rescan is needed if the above formula is satisfied. We may re-formulate the above formula and derive the following lower support threshold when the number f of deleted records for not rescanning databases is given:
⎟⎠
The above formula will be used in the proposed GA-based PPDM approach to set an appropriate lower-bound threshold for efficient chromosome evaluation.
11
Chapter 3
Problem Formulation
In the problem of PPDM, some basic concepts are borrowed from association rule mining.
Thus before exploring the PPDM’s issue, we need to know the definition of association rule mining. Agrawal extended and formalized the problem as follows [3]. Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions, where each transaction T ∈ D consists of a set of items, such that T ⊆ I. Each transaction T has a unique identifier, called its TID. A set of items X ⊂ I is called an itemset. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅. Usually, Y consists of only a single item.
We say an association rule X ⇒ Y holds in a database D if the following two factors are satisfied. The first one is the support condition, which is defined as at least s% of the transactions in D contain X ∪ . It can be thought of as a measure of the frequency of a rule, and is expressed Y by X Y
N s
∪ ≥ , where N is the number of transactions in D. The second factor is the confidence
condition, which is defined as at least c% of transactions with the itemset X also contain Y. It is thus a measure of the strength of the rule, and is expressed by X Y
X c
∪ ≥ .
In privacy-preserving data mining, users may pre-specify a set of sensitive itemsets H = {{h1}, {h2}, …, {hi}}, which may be mined out from a database but is sensitive. We aim at preventing these sensitive itemsets being disclosed, and a solution is to reduce the frequencies of the sensitive itemsets from D. Let the modified database be denoted D’. Thus, each sensitive itemset will not have enough support to be frequent in D’. This kind of approaches can be thought of as support-based ones, and have to satisfy the constraint of |hi|/N’ < s, where N’ is the
12
number of transactions in D’ and |hi| is number of occurrences of the sensitive itemset hi. In addition to hiding the sensitive itemsets from being mined, some other goals have been set as well when the original database is sanitized. For example, all of the non-sensitive rules should be successfully mined from the sanitized database D’. Besides, the rules that are not found in the original database D should not be generated from the sanitized database D’.
In this thesis, we would like to hide sensitive itemsets, which are predefined by users. For achieving this purpose, new items or transactions may be inserted, or old items or transactions may be deleted or modified. Here, we will only focus on the deletion of items or transactions for PPDM. We use the data sanitization process to hide sensitive knowledge by item or transaction deletion that reduces the support of the rules below the user specified security threshold. Two kinds of removal are considered. One is removing items from transactions and the other is removing transactions from databases. An example for the two kinds of removal is shown in Figure 3-1. At the left of Figure 3-1, an item is removed, and at the right, a whole transaction is deleted.
Figure 3-1: Two kinds of removal
13
In this thesis, three approaches are proposed for PPDM. The first two are called the sensitive items frequency - inverse database frequency (SIF-IDF) approach and the lattice-like approach, which are designed for removing items from transactions for hiding sensitive Itemsets.
The third one is a GA-based approach, which delete transactions from a database for PPDM.
14
Chapter 4
A Greedy Approach Based on Sensitive Items Frequency and Inverse Database Frequency
In this chapter, a greedy-based approach called sensitive items frequency - inverse database frequency (SIF-IDF) is proposed to hide given sensitive itemsets. It uses and modifies the concept of TF-IDF [21] in text mining to evaluate the degrees of transactions associated with given sensitive itemsets. The measure for the SIF-IDF value of a transaction Ti is defined as follows:
where si is the number of sensitive items contained in the j-th sensitive itemset in Tij i, and Ti is the number of items in transaction Ti,n is the number of records in a database, fk is the frequency count of each item, and MRCk is the maximum reduced count of each item.
The above formula consists of two components. One is the sensitive items frequency (SIF) and the other is the inverse database frequency (IDF). The sensitive items frequency (SIF) value is measured for each sensitive itemset sij in a transaction Ti. It is calculated as the number (|siij|) of sensitive items in Ti which are included in an assigned sensitive itemset sij divided by the number of all the items in Ti. On the contrary, the inverse database frequency (IDF) value shows the influence degree of the sensitive itemsets within a transaction by considering the whole database. In this chapter, the SIF-IDF value of each transaction is calculated and is used to
15
measure whether a transaction has a large number of sensitive items but with less influence to other transactions. The transactions with high SIF-IDF values are considered to be processed with high probabilities for sanitization.
The proposed approach first calculates the maximum reduced count (MRC) of each item in the database. In doing this, the reduced count value (RCkj) of each item ik is first calculated for each sensitive itemset sij as fj – s*n + 1 if sij includes ik and as 0 otherwise, where fj is the occurrence frequency of the sensitive itemset sij in the database, s is the minimum support threshold, and n is the number of transactions in the database, 1≤ j≤m,1≤k≤ p.
The IDF value of each item is then calculated as the number of transactions in the database divided by the occurrence frequency of a processed item minus its MRC value. The IDF value for each sensitive itemset is then estimated as the summation of the items contained in the itemset. That is, the SIF-IDF value of each transaction is the summation of the SIF values of the sensitive itemsets appearing in a transaction multiplied by its corresponding IDF value. The transactions are then sorted in a descending order of their SIF-IDF values. The order is used as the processing order of the transactions for the proposed algorithm. In data sanitization, an item with a higher occurrence frequency in the sensitive itemsets may be considered to have a larger influence than the ones with a lower occurrence frequency. The sensitive items in the processed transactions are then deleted according to the ordering of their occurrence frequencies. This procedure is repeated until the set of sensitive itemsets becomes null, which indicates all the supports of the sensitive itemsets are under the user-specific minimum support threshold. The flowchart of the proposed SIF-IDF algorithm is shown in Figure 4-1. The proposed algorithm and an example are described in the next two sections, respectively.
16
Figure 4-1: The flowchart of SIF-IDF algorithm
4.1 The Proposed SIF-IDF Algorithm
INPUT: A transaction dataset D = {T1, T2, …, Ti, …, Tn} with a set of p items I = {i1, i2, …, ik, …, ip}, a user-specific minimum support threshold s, and a set of m user-specific sensitive itemsets S = {si1, si2, …, sij, …, sim}.
Pre-process transaction
with sensitive items Calculate SIF
Calculate IDF
Calculate SIFIDF
Execute pruning process
Satisfied termination
condition?
Output a sanitized database
Y
N
17
OUTPUT: A sanitized database with no sensitive rules mined out.
STEP 1: Find the transactions with sensitive itemsets in the database D.
STEP 2: Calculate the sensitive items frequency (SIFij) value of each sensitive itemset sij in each transaction Ti as:
STEP 3: Calculate the value of the inverse database frequency (IDF) of each sensitive itemset in each transaction by the following substeps.
Substep 3-1: Calculate the reduced count value (RCkj) of each item ik for each sensitive itemset sij as fj – s*n + 1 if sij includes ik and as 0 otherwise, where fj is the occurrence frequency of the sensitive itemset sij in the database, s is the minimum support threshold, and n is the number of transactions in the database, 1≤ j≤m ,1≤k ≤ p.
Substep 3-2: Calculate the maximum reduced count value (MRCk) of each item ik as:
kj
Substep 3-3: Calculate the inverse database frequency (IDFk) value of each items ik
as follows:
where f is the occurrence count of item ik k in the database.
18
Substep 3-4: Sum the IDF values of all sensitive items within sensitive itemsets and calculate the SIF-IDF value for each transaction as follows:
∑ ∑
STEP 4: Find the transaction (Tb) which has the best SIF-IDF value.
STEP 5: Process the transaction Tb to prune appropriate items by the following substeps.
Substep 5-1: Sort the items in a descending order of their occurrence frequencies within the sensitive itemsets.
Substrp 5-2: Find the first sensitive item (itemo) in Tb according to the sorted order obtained in Substep 5-1.
Substep 5-3: Delete the item (itemo) from the transaction.
STEP 6: Update the occurrence frequencies of the sensitive itemsets.
STEP 7: Repeat STEPS 2 to 6 until the set of sensitive itemsets is null, which indicates that the supports of all the sensitive itemsets are below the user-specific minimum support threshold s.
4.2 An Example
In this section, an example is given to demonstrate the proposed sensitive items frequency - inverse database frequency (SIF-IDF) algorithm for privacy preserving data mining (PPDM).
Assume a database shown in Table 4-1 is used as the example. It consists of 10 transactions and 9 items, denoted a to i.
19
Table 4-1: A database example with 10 transactions
TID Item
Assume the set of user-specific sensitive itemsets S is {cfh, af, c}. Also assume the user-specific minimum support threshold is set at 40%, which indicates that the minimum count is 0.4*10, which is 4. The proposed approach proceeds as follows to hide the sensitive Itemsets for avoiding being mined from the database.
STEP 1: The transactions with sensitive itemsets in the database are found and kept. In this example, all the 10 transactions contain at least one sensitive itemset. All of them are then kept for later processing.
STEP 2: The sensitive items frequency (SIF) value of each sensitive itemset in each transaction is calculated. Take the first transaction as an example to illustrate the step. The first transaction includes the following seven items: {a, b, c, d, f, g, h}. The given sensitive itemsets
20
include {cfh, af, c}. The appearing sensitive items in the first transaction for the sensitive itemset {cfh} are c, f, h, and the number is 3. Similarly, the numbers of the appearing sensitive items in the first transaction for the sensitive itemsets {af} and {c} are 2 and 1, respectively. Thus, the SIF values of each sensitive itemset in the first transaction are calculated as 3/7, 2/7 and 1/7, respectively. The SIF values of each sensitive itemset in the other transactions could be found in a similar way. The results are shown in Table 4-2.
Table 4-2: The SIF values of each sensitive itemset in each transaction
TID Item SIFcfh SIFaf SIFc
STEP 3: The inverse database frequency (IDF) value of each sensitive itemset in each transaction is calculated. In this step, the reduced count (RC) of each item for each sensitive itemset is first calculated and the maximum of the RC values of each item is found as the MRC value. Take item a as an example. The MRC value of item a is calculated as max{0, 5-0.4*10+1, 0}, which is max{0, 2, 0} and is equal to 2. The MRC values of the other items can be found in the same way. The results are shown in Table 4-3.
21
Table 4-3: The MRC values of all the items
Item RCcfh RCaf RCc MRC
The IDF value of each item is then calculated. Take item a as an example. The occurrence count of item a is 6 and the MRC value is 2. Its IDF value is then calculated as log(10/(6-2)), which is 0.398. The IDF values of all the items are shown in Table 4-4.
Table 4-4: The IDF value of each item
Item Count MRC IDF
22
The IDF value of each sensitive itemset in each transaction is then calculated. Take the first transaction for the first sensitive itemset {cfh} as an example to illustrate the process. The IDF value of {cfh} in the first transaction is the sum of the IDF values of the three items c, f and h, which is 0.523 + 0.222 + 0.523 and is equal to 1.268. All the results after this step are shown in Table 4-5.
Table 4-5: The IDF value of each sensitive itemset in each transaction
TID IDFcfh IDFaf IDFc
The SIF-IDF value of an sensitive itemset in each transaction is then calculated as the SIF value of the sensitive itemset multiplied by its IDF value in the transaction. Take the first transaction as an example to illustrate the process. The SIF value of the sensitive itemset {cfh} in the first transaction is 3/7 as shown in Table 4-2, and its IDF value is 1.268 as shown in Table 4-5. The SIF-IDF value of {cfh} is then calculated as 3/7*1.268, which is 0.5434. The other SIF-IDF values for the two sensitive itemsets {af} and {c} are calculated as 0.1771 and 0.0747, respectively. That is, the SIF-IDF value of the first transaction is summed as 0.5434 + 0.1771 +
23
0.0747, which is 0.795. The other transactions are processed in the same way. After that, the results are shown in Table 4-6.
Table 4-6: The SIF-IDF values for all the transactions
TID SIF1 IDF1 SIF2 IDF2 SIF3 IDF3 SIF-IDF
T1 3/7 1.268 2/7 0.62 1/7 0.523 0.795
T2 0/4 0 1/4 0.398 0/4 0 0.099
T3 3/6 1.268 1/6 0.222 1/6 0.523 0.758
T4 3/5 1.268 2/5 0.62 1/5 0.523 1.113
T5 1/5 0.523 0/5 0 1/5 0.523 0.209
T6 2/4 0.745 2/4 0.62 1/4 0.523 0.813
T7 2/6 0.523 1/6 0.222 1/6 0.523 0.299
T8 3/5 1.268 1/5 0.222 1/5 0.523 0.901
T9 1/5 0.222 1/5 0.62 0/5 0 0.168
T10 3/5 1.268 2/5 0.62 1/5 0.523 1.113
STEP 4: The transactions in Table 4-6 are sorted in the descending order of their SIF-IDF values. The results are then shown in Table 4-7.
24
Table 4-7: The sorted transactions according to the SIF-IDF values
TID Item SIF-IDF
STEP 5: The transactions are processed in the above descending order to prune appropriate items. In this example, the set of sensitive itemsets is {cfh, af, c}. The occurrence frequencies of the items within the sensitive itemsets are {a:1, c:2, f:2, h:1}. The items are then sorted in the descending order of their frequencies as {c:2, f:2, a:1, h:1}, which will be used as the deletion sequence. From Table 4-7, transaction 4 has the best SIF-IDF value among all the ten. It is thus selected to be processed. The item c in transaction 4 is then first selected to be deleted.
STEP 6: After item c is deleted from the fourth transaction, the new occurrence frequencies of the sensitive Itemsets in the transactions are updated. The sensitive itemsets with their occurrence frequencies are then updated from {cfh:5, af:5, c:8} to {cfh:4, af:5, c:7}.
25
STEP 7: STEPs 2 to 6 are then repeated until the supports of all the sensitive itemsets are below the minimum count. The results of the final sanitized database in the example are shown in Table 4-8.
Table 4-8: The result of the final sanitized database in the example
TID Item
T1 a, b, d, f, g, h
T2 a, b, d, e
T3 b, c, d, g, h
T4 a, b, h
T5 c, d, e, g, i
T6 a, f, i
T7 b, c, d, e, f, g
T8 d, f, h, i
T9 a, d, e, f, i
T10 a, e, h
26
Chapter 5
A Greedy Approach Based on Lattices
The Apriori mining algorithm [5] uses the downward closure property to efficiently generate and test itemsets level-by-level. The process can be represented by a lattice structure as shown in Figure 5-1.
Figure 5-1: A lattice structure with five items
In this chapter, a lattice-like algorithm for privacy-preserving data mining (PPDM) is then proposed for efficiently hiding the sensitive itemsets. The measure Extra Count (EC) of a sensitive itemset is first defined below:
27
, 1
* +
−
= f s n ECj j
where f is the occurrence frequency of a sensitive itemset sij j, s is the user-specific support threshold, and n is the number of transactions in database, 1≤ j≤m.
In the lattice-like approach, it treats all the sensitive itemset as the initial sets in the lattice structure. For example in Figure 5-2, the three itemsets S1, S2, and S3 are sensitive, and put in the first level (as the initial sets) of the lattice. All the combinations from the initial sets are then level-wisely created by the union operation on the items contained. The corresponding transactions in each set of the lattice structure can thus be found through the intersection operation of the combined sets. For the above example, the lattice structure is built as shown in Figure 5-2.
Figure 5-2: The lattice structure built from the three sensitive itemsets
For support-based approaches of PPDM, the supports of all the sensitive itemsets must be under the user-specific support threshold. In the proposed lattice-like approach, the item removal
28
process is performed from the higher levels to the lower ones. For example in Figure 5-2, the set at level 3 is first processed. It has three sensitive itemsets and the one with the minimal EC value is found. If the transaction number in the node is larger than or equal to EC, then totally EC transactions in the processed set in the node are randomly selected and the items contained in the selected sensitive itemset are deleted from them. Otherwise, all the transactions in the node are used to delete the items in the selected itemset. After that, the EC values of the remaining sensitive itemsets are updated for the next processing. This procedure is repeated until the EC values of all the sensitive values become 0. The flowchart of the proposed lattice-like approach is shown in Figure 5-3. The proposed algorithm and an example are described in the next sections, respectively.
29
Figure 5-3: The flowchart of the proposed lattice-like approach
The details of the proposed algorithm are described below.
Calculate EC for each sensitive itemset
Create the lattice structure
Create the lattice structure