• 沒有找到結果。

Extending Suppression for Anonymization on Set-Valued Data

N/A
N/A
Protected

Academic year: 2021

Share "Extending Suppression for Anonymization on Set-Valued Data"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

1

Extending Suppression for Anonymization on

Set-Valued

Data

Shyue-Liang Wang1, Yu-Chuan Tsai2, Hung-Yu Kao2 and Tzung-Pei Hong3

1

Department of Information Management

3Department of Computer Science and Information Engineering National University of Kaohsiung

Kaohsiung, Taiwan 81148 {slwang; tphong}@nuk.edu.tw

2

Department of Computer Science and Information Engineering National Cheng Kung University

Tainan, Taiwan 70101

{p7894131; hykao}@mail.ncku.edu.tw

ABSTRACT. Anonymization on relational data to protect privacy against re-identification attacks has been studied extensively in recent years. The problem has been shown to be NP-hard and several heuristic and approximation algorithms have been proposed. However, many published data exist in set-valued format (e.g. transactional, search query, recommendation data) but their anonymization techniques have not been well studied. Previous work [19] proposed borrowing suppression-based approximation algorithms on relational data and the concept of flipping to achieve O(k*log k) k-anonymity approximation on set-valued data in two phases. In this work, we propose a new approach to anonymize set-valued data in one single phase. The proposed approach is based on direct estimation of minimal number of addition and deletion operations and

(2)

2

without using the suppression technique. We show that the proposed approach can achieve O(log k)-approximation solution to k-anonymity on set-valued data. Experimental results also demonstrate that the proposed algorithms require less number of addition/deletion operations as well as information loss on both real world and synthetic data sets, compared to previous approach.

Keywords: Anonymization, K-anonymity, Privacy Preservation, Approximation Algorithm, Set-Valued Data.

1. Introduction

Recent publishing of market-based data, recommendation data, and search query log data for public analyses in pursuit of better system design, service, and improving overall quality of search has posed certain privacy issues such as re-identification attack. In order to protect privacy of users against different types of attacks, data should be anonymized before they are published.

Current practices to protect user privacy from published data include: (1) removing all identifiable personal information such as names and social security numbers, (2) limiting access, (3) “fuzzing” the data, (4) eliminating unnecessary groupings, and (5) augmenting with additional data, etc. However, it is still easy for an attacker to identify the target by performing different kinds of structural, non-structural, and linking attacks. Let’s consider the following examples of re-identification attack on relational data and transactional data.

(3)

3

For published relational data, a well-known example is that an attacker can re-identify Massachusetts governor’s identity from a de-identified (name and social security number removed) patient data of state employees. By a simple “linking” of the patient data and a public voter registration data, the identity and medical history of the governor is exposed. In fact, according to one study, approximately 87% of the population of the United States can be uniquely identified on the basis of their 5-digit zip code, sex, and date of birth [23-25].

For published transaction data, America Online (AOL) released a large portion of its search engine query logs for research purposes in August 2006. The dataset contained 20 million queries posed by 650,000 AOL users over a 3 month period. Before releasing the data,

AOL replaces each user’s name by a random identifier. However, by examining unique

query terms, the New York Times [4] demonstrated that the searcher No. 4417749 was traced back to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Georgia. Despite a query does not contain address or name, a searcher may still be re-identified from combination of query terms that are unique enough about the searcher.

For published recommendation data, Netflix announced a $1-million prize for improving their movie rating and recommendation system in October 2006. A dataset of 100 million movie ratings posted by 500,000 subscribers over a 6-year period was released. Similar to

AOL’s approach, Netflix replaces each username with a random identifier. However, it

(4)

4

knew 6 out of 8 movies that the subscriber had rated outside of top 500 movies [20].

Motivated by these examples and others, many privacy models and anonymization approaches have been studied. Depending on the format of data and adversary’s knowledge about the target of attack, non-trivial but practical techniques have been proposed. However, most of these work concentrated on dealing with relational data with additional assumption of Domain Generalization Hierarchies (DGH) on the data attributes. Few works has studied the problem of anonymizing set-valued data without assuming any item correlation. In [19], Motwani etc proposed a k-anonymous algorithm for set-valued data that was based on suppression algorithm for relational data and the concept of flipping in two phases. In this work, we address the same problem of anonymizing set-valued data without assuming any additional constrain on data items. We adopt the set-cover greedy approach to determine the optimal partition of a set of transactions, such that the total number of addition and deletion on items in this partition would be minimized and in the same time achieving k-anonymity in one phase. In order to find such partition, we propose a new measure to directly estimate the number of operations required for each block in a partition. We show that such techniques can achieve O(log k)-approximation to k-anonymity for set-valued data. Furthermore, we demonstrate numerically that the proposed algorithms require less number of addition/deletion operations and information loss than previous approach. To summarize our contributions:

• We present a new measure to estimate the number of addition and deletion operations required to achieve k-anonymity for set-valued data.

(5)

5

• We present three direct anonymization algorithms that are based on the new measure to accomplish k-anonymity.

We show that the proposed algorithms can achieve O(log k)-approximation to the

k-anonymity problem for set-valued data.

• We provide comparative results of numerical experiments on real-life and synthetic datasets and demonstrate that the proposed algorithms requires less number of operations, shorter running time, and less information loss.

The rest of the paper is organized as follows. Section 2 describes related work. Section 3 gives the problem description. Section 4 presents the proposed anonymization algorithms. Section 5 reports the numerical experiments. Section 6 concludes the paper.

2. Related Work

Recent studies in privacy preserving data publishing have attracted considerable works, especially in the context of relational data. One main task is to prevent re-identification attacks, which tackles the problem that Quasi-Identifier attributes (QID) such as age, sex, zip code, are used to infer the identities of target victims. To address such threat, the most widely studied anonymity paradigm is k-anonymity [9, 22-25] and it’s variants, l-diversity [17] and t-closeness [14]. The model requires that each record in a dataset to be indistinguishable among at least k-1 other records, with respect to the set of QID attributes. To form such an equivalence class or anonymized group, many techniques have been proposed: generalization, suppression, clustering, permutation, and perturbation, etc.

(6)

6

Assuming domain generalization hierarchies existed for each QID attribute in relational data; k-anonymity can be achieved through generalization, which maps detailed attribute values to values in the DGH. Many bottom-up [23-25] and top-down [5] traversals on the

DGH for finding the minimal generalized values have been studied and reported.

The suppression-based approach tries to remove certain attribute values (cell suppression) [1] or entire record (tuple suppression) [23] from the microdata. It could be combined with generalization technique [15, 26] or used independently in approximation algorithms [18, 21]. This process of data anonymization is called recording. Global recording refers to a particular detailed value is always mapped to the same generalized value, whereas, local recording allows mapping to different generalized values in different anonymized groups. Many anonymization techniques based on suppression have been proposed.

For relational data without DGH, clustering-based local recording methods [2] assume that metrics for measuring the distance between attribute values exist and could be used to form clusters of records. Each cluster is an anonymized group of records and only cluster center and its size are published to achieve anonymity. Several clustering based anonymization techniques have been reported.

Instead of generalizing QID values, permutation-based approach decouples the sensitive

(7)

7

directly but permuted the SA values among the records. In [7], correlated QIDs are permuted to a group and the corresponding sensitive items are made to be homogeneous. The problem with this approach is that the QID data and SA data must be published separately.

In [3], random perturbation by adding noise to the data was proposed to prevent re-identification of records. However, since the noises added are correlated with the original data [10] and therefore outliers could be easily exposed when attacker has access to the external knowledge [13].

For set-valued data with given DGH on items, a local recording, top-down generalization approach has been proposed to achieve k-anonymity [8]. For set-valued data without

DGH, Motwani etc [19] proposed a two-phase approach. It first transformed the set data

into binary relational data and applied the suppression anonymization techniques that were used for relational data. The concept of flipping is then applied to suppressed data in order to obtain anonymized result. It has been shown that k-anonymization on relational data is NP-hard, either using generalization [1] or suppression approaches [18]. Approximation algorithms with ratios of O(k*log k) [18], O(k) [1], and O(log k) [21] that minimize the number of entries suppressed on relational data were then proposed. But due to the loose approximation ratio, the algorithms show poor performance for large k. In this work, we will concentrate on developing direct approximation algorithms for

(8)

8

3. Problem Description

This section describes the k-anonymity problem for set-valued data. Transaction data, on-line search query data, and rating/recommendation data are all considered as set-valued data. For simplicity, we will use transaction data for illustration.

3.1 K-anonymity for transaction data

Let D = {T1, …, Tn} be a dataset containing n transactions, where each transaction, Ti, 1≦ i

≦ n, consists of a subset of items selected from a given universal itemset I = {e1, …, e|I|}.

In order to anonymize the identity of each transaction, k-anonymity for transaction data imposes a constraint that for every transaction in the data set, there must exist at least (k-1) other identical transactions. The k-anonymity concept and k-anonymity problem for transaction data are defined as follows. The definitions are similar to those in [19].

Definition 1. (K-anonymity for transaction data) We say that D is k-anonymous if every

transaction Tj D has at least (k-1) other identical transactions in the dataset D.

Given this definition, the k-anonymization problem studied here is to efficiently determine the minimal number of addition and deletion on a dataset so that the modified dataset is

k-anonymous.

(9)

9

minimal number of items that need to be added or deleted from the transactions T1, ..., Tn, to ensure that the resulting dataset D’ is k-anonymous.

Figure 1(c) shows an example of 3-anonymization on transaction data from Figure 1(a). The modified dataset in Figure 1(c) consists of a partition of two 3-anonymous blocks: {T1,

T4, T5} and {T2, T3, T6}. The item e2 is deleted from transaction T1 and item e1 is added to

transaction T2. It only takes one addition operation and one deletion operation to achieve

3-anonymity and is the minimal number of operations for any possible partition on the given dataset. As such, the k-anonymity can be achieved by finding the optimal partition such that the sum of the number of addition and deletion operations on all blocks is minimal.

(a) Original dataset (b) Suppressed dataset (c) Anonymized dataset Figure 1 An example of 3-anonymization

e1 e2 e3 T1 1 1 0 T2 0 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1 e1 e2 e3 T1 1 * 0 T2 * 0 1 T3 * 0 1 T4 1 * 0 T5 1 * 0 T6 * 0 1 e1 e2 e3 T1 1 0 0 T2 1 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1

(10)

10 3.2 Suppression-based k-anonymity

To achieve k-anonymity on transaction data, Motwani etc [19] proposed a skillful 2-phase technique that is based on suppression algorithm on relational data proposed by Park etc [21] and the concept of flipping. In phase one, the suppression algorithm tries to efficiently determine a partition of the dataset so that the total number of suppression is minimal. For example, in the first phase, the suppression algorithm will partition the dataset into two blocks  = {{T1, T4, T5}, {T2, T3, T6}}, where in the first block T1 = T4 = T5 = (1, *, 0), and the second block T2 = T3 = T6 = (*, 1, 0), as shown in Figure 1(b). The *

here represents that the item value is suppressed, 1 represents that the corresponding item is in the transaction, and 0 represents the item is not in the transaction. In the second phase, for each block of the partition, the suppressed items are flipped to 1 or 0 depending on the minimal number of flipping required. For example, for block {T1, T4, T5}, there are two

transactions T4, T5 that do not contain item e2 and one transaction T1 that contains item e2.

Deleting item e2 from transaction T1 (i.e. flipping item e2 from * to 0) will make this block

of transactions identical and satisfy 3-anonymity with fewer operations. Similarly, for block {T2, T3, T6}, there are two transactions T3, T6 that contain item e1 and one transaction

T2 that does not contain item e1. Adding item e1 to transaction T2 (i.e. flipping item e1

from * to 1) will make this block of transactions identical and satisfy 3-anonymity.

To obtain a partition with minimal number of suppression, Park etc [21] proposed using

minimum length to estimate the number of suppression required for each block in a given

(11)

11

the sum of all minimum length on each block of a partition. The partition with minimum

length sum will achieve 2(1 + ln 2k)-approximation to the optimal number of suppression

for the k-anonymization problem. The minimum length, a(S), for a set of transactions S is defined as follows.

Definition 3. (Minimum length of suppression for relational data) Let S be a relational

table and a(S) be the number of attributes with multiple distinct values in the table, the minimum length is defined as follows:

a(S) := |{i : u, v S, u[i] v[i]}|

where u[i] and v[i] are the values of i-th attribute of the transactions u and v respectively.

For example, a(S) = 1 for the first block {T1, T4, T5} in Figure 1, due to attribute e2 has two

values. And a(S) = 1 for the second block {T2, T3, T6}, due to attribute e1 has two values.

However, even though minimum length sum may be a better estimate than minimum

diameter [18] for suppressing relational data. To anonymize transaction data, direct

estimation of addition and deletion of items for each block can achieve better partition and requires fewer operations. In the next section, we will define a new measure on transaction data to find optimal partition based on minimal number of item addition and deletion operations required for each block, and propose approximate algorithms to achieve

(12)

12

4. Approximation Algorithms

This section describes the approximation theory for k-anonymity problem on transaction data. We first propose a new measure, called minimum operation, to estimate the number of addition and deletion required for each block in a partition. We will then prove that the optimal partition with minimum operation sum can lead to the optimal number of addition/deletion for the k-anonymity problem. Three algorithms based on set-covered approximation approach, which can achieve O(log k)-approximation, are then presented.

4.1 Minimum operation sum

Given a relational dataset, k-minimum diameter sum [18] and k-minimum length sum [21] have been proposed to determine a partition with minimal number of suppression in order to achieve k-anonymity. For transaction data, we propose the following new measure,

ad(S), to effectively estimate the number of addition and deletion operations required for a

given block of transactions.

Definition 4. (Minimum operation of addition and deletion for transaction data) For a

given block of transactions S, the minimal number of addition and deletion operation to anonymize S is defined as ad(S),

   I j s j O S ad 1 ) ( ) (

(13)

13

transactions in S not containing j-th item}, and |I| is the total number of items in the

universal itemset I.

For example, let S be the original dataset in Figure 1(a) and I = {e1, e2, e3}, then Os(1) = 1,

Os(2) = 1, Os(3) = 3, and the minimum sum is ad(S) = 1 + 1 + 3 = 5.

Given a partition , the number of operation required to anonymize the entire partition (minimum operation sum) is the sum of all minimum operations on each block and can be expressed as

    S S ad ad( ) ( )

For example, Figure 2 shows two different partitions of the same eight records in two blocks respectively. For arbitrary partition in Figure 2(a), the minimum length sum is a() = 3 + 3 = 6 and the minimum operation sum is ad() = 4 + 4 = 8. For the optimal partition in Figure 2(b), the minimum length sum a() is still 3 + 3 = 6, which is the same as previous partition and cannot distinguish the difference between the two different partitions. However, the minimum operation sum ad() for this partition is 3 + 3 = 6, which is smaller than the previous arbitrary partition and indicates that this is a better partition. This example shows that the proposed minimum operation sum measure is potentially a better estimate for finding the optimal partition for k-anonymity.

(14)

14 e1 e2 e3 T1 1 1 1 T2 1 1 0 T4 0 1 1 T5 1 0 0 e1 e2 e3 T1 1 1 1 T2 1 1 0 T3 1 0 1 T4 0 1 1 e1 e2 e3 T3 1 0 1 T6 0 1 0 T7 0 0 1 T8 0 0 0 e1 e2 e3 T5 1 0 0 T6 0 1 0 T7 0 0 1 T8 0 0 0

(a) Arbitrary partition (b) Optimal partition Figure 2. Two partitions of eight records

4.2 Approximation ratio

In this subsection, we will first show that optimal partition with minimum operation sum will lead to the optimal solution to the k-anonymity problem for transaction data. And then since O(1 + ln 2k)-approximation algorithms have been introduced [5, 13, 17] to obtain optimal partition using minimum diameter sum and minimum length sum, we show similarly that O(log k)-approximation algorithms can be obtained for the k-anonymity problem using minimum operation sum.

(15)

15

Let OPT(D) be defined as the minimal number of addition and deletion operations for the optimal solution to k-anonymity problem for transaction dataset D. Let * be the optimal partition for k-anonymity problem for transaction data. Let ANON(S) be the number of addition and deletion operations in S such that all transactions in S become identical. Then the optimal number of operations required to achieve k-anonymity on transaction data can be expressed as .

On the other hand, for optimal partition * with blocks S, the minimum operation sum can be expressed as . Since for a given block of transactions, ad(S) can be expressed as . Therefore,

. This means that optimal partition with

minimum operation sum lead to the minimal addition/deletion for optimal solution to k-anonymity problem. The following lemma shows this relationship between ad(*) and

the optimal solution for k-anonymity.

Lemma 1. For a transaction dataset D, we have

2 ) ( ) ( * OPT D I D ad   

where the cardinality of every block in the partition is in the range of [k, 2k-1].

Proof: Let S be a block of transactions in partition * with size between [k, 2k-1]. Since ad(S) = ANON(S), it follows that

) ( ) ( ) ( ) ( * * * D OPT S ANON S ad ad S S    

.

(16)

16 In addition, 2 1 2 2 ) ( ) ( 1 1 1 I S S S j O S ad I j I j I j S    

. Hence 2 2 2 ) ( ) ( * * * D I S I I S S ad D OPT S S S    

.

In [5, 13, 17], it has been shown that the set-covered approximation approach can provide algorithms to obtain O(1 + ln 2k)-approximation to the optimal (k, 2k-1)-partition, using either minimum diameter sum or minimum length sum. In the next subsection, we will present three set-cover based algorithms to obtain O(1 + ln 2k)-approximation partition. Therefore, it leads to the following approximation property that O(log k)-approximation to the optimal k-anonymity problem for transaction data can be achieved.

Theorem 1. Let 1, and be a (k, 2k-1)-partition with operation sum at most times that of the optimal solution of k-minimum operation sum problem. Then the algorithm

that minimally anonymizes each S  , is a -approximation algorithm to optimal

k-anonymization problem.

Proof: Since and from lemma 1, it is straight forward that   .

4.3 Approximation algorithms based on minimum operation sum

This subsection presents three approximation algorithms based on minimum operation sum. The first algorithm is based on arbitrary collection of transactions to form optimal partition.

(17)

17

The second algorithm is based on transactions that contain frequent itemsets to form optimal partition. The third algorithm is based on transactions that contain sorted frequent itemsets to form optimal partition.

On a relational dataset D, to obtain an optimal partition that requires minimal suppression in order to achieve k-anonymity, Myerson [18] proposed a set-cover type greedy approach based on Johnson’s approximation theory for combinatorial problems [12]. The basic idea of the approach consists of two steps. In the first step, it collects all possible subsets with sizes in the range of [k, 2k-1] from D. For each subset, the algorithm calculates its

minimum diameter (or minimum length in [21]) and selects the subset with the smallest minimum diameter. The selected subset is then saved in a cover. The same process is

repeated until all transactions are included to the cover. Each subset saved in the cover is supposed to have minimal diameter when it was selected. In the second step, the cover is converted to a partition to ensure each subset is disjoint from others.

Let F be the collection of all possible subsets of transaction dataset D with cardinality in the range of [k, 2k-1]. Let S F be a set in the collection. Our first approximation

algorithm using k-minimum operation sum follows the similar principle of set-cover and partition procedures described above [12, 18, 21]. It basically runs in two phases:

1. It generates a [k, 2k-1]-cover whose operation sum is at most (1+ln 2k) times the operations required for optimal solution. This is described in the procedure Cover(D,F) below.

(18)

18

process is similar to [18] and described in the procedure Convert() below.

Procedure Cover(D, F)

Input: dataset D, collection F Output: cover 

1. Let  := , D’ := ; //D’ contains currently covered transaction from D 2. While (D’ D) {

3. For each S F, Compute ; 4. Choose an S with minimal r(S);

5. D’ := D’  S; 6.  :=   {S}; } 7. Output ; Procedure Convert() Input: [k, 2k-1]-cover  Output: [k, 2k-1]-partition

1. For any Si, Sj and Tm D such that Tm Si Sj

2. If (|Si|>k or |Sj|>k) {

3. Delete Tm from the larger set;

4. Insert the larger set into ’ and delete the larger set from ;}

5. Else { // |Si| = |Sj| = k

6. Insert the Si Sj into ’ ;

(19)

19 8. Output the 

Combining two procedures described above, the first set-cover based greedy approximation algorithm for k-anonymity problem on transaction data is given below.

Algorithm 1. (Set_Cover)

Input: Dataset D, Collection F

Output: anonymized database D’ that satisfies k-anonymity 1. Set  := Cover(D, F);

2.  := Convert(); 3. For each S ,

4. Minimally add or delete ANON(S) items on S such that all transactions are identical; 5. Output D’ := ;

Based on the similar analysis of the greedy algorithm for set-cover on subsets of cardinality at most 2k [12, 21], it can be shown that the cover in procedure Cover(D, F) is also a

(1+ln 2k)-approximation to the k-minimum operation sum problem. For the runtime

analysis on algorithm Set_Cover, in the first phase, the algorithm will choose at most |D| sets from F, where each selection requires running time to determine which

subset is to be selected. Hence the runtime in the first phase is . In the second phase, the runtime is . So the total runtime over the two phases is dominated by the first phase and is .

(20)

20

To improve efficiency, instead of collecting all possible subsets of D in the range of [k,

2k-1], we propose using the sets of transactions that contain frequent itemsets with support

count greater than or equal to k to be included in the collection F. In another word, frequent itemsets with support count greater than or equal to k are calculated first. For each frequent itemset v, the transactions that contain itemset v will form a subset S and is collected into F. The size of collection F here will be greatly reduced compared to the F used in algorithm Set_Cover. The greedy approach then selects the S with minimum operation ad(S) and includes it to the cover. The proposed frequent itemset based algorithm is given as follows.

Algorithm 2. (APX_Cover)

Input: dataset D

Output: anonymized database D’ that satisfies k-anonymity

1. Remove transactions that have at least (k-1) other identical records from D and also put them in D’;

2. Find all frequent itemsetsvi and its support counts opcount(vi) from D;

3. Let FIL = { vi, opcount(vi) };

4. For (each vi FIL){

5. Find all subsets of transactions S(vi) containing vi and let F = {S(vi)};

6. Calculate the ad(S(vi)) and update the opcount(vi)=ad(S(vi)) for vi in FIL;} 7. Sort S(vi) in increasing order of opcount(vi);

8. While (D  and F ) { 9. If |S(vi)|≦2k-1

(21)

21 10. {Anonymize(S(vi)) and add to D’;

11. Remove S(vi) from D and Remove vi from FIL;}

12. Else

13. {S’(vi)=Randomly select (2k-1) transactions from S(vi);

14. S(vi) = S(vi)-S’(vi);

15. Anonymize(S’(vi)) and add to D’;}

16. }; //end of While 17. Output D’;

To further improve the efficiency of the previous two approximation algorithms, we re-arrange the order of the subsets in the collection F. The basic idea is as follow. Grouping transactions that contain the same frequent itemset into a block tends to reduce the number of addition and deletion operations required to anonymize the whole block. By the same token, grouping transactions that contain larger frequent itemset into a block should also reduce the operations required. The larger frequent itemset here refers to itemset with more items. As such, before selecting the subset from F only based on

minimum operation, the subsets should be sorted in descending order of itemset size. The

following algorithm modifies the previous APX_Cover algorithm by adopting this heuristic.

Algorithm 3. (Sort_APX_Cover) //Input: dataset D

//Output: anonymized database D’ that satisfies k-anonymity

(22)

22 them in D’;

2. Find all frequent itemsets vi and its support counts opcount(vi) from D;

3. Let FIL = { vi, opcount(vi) };

4. Find all subsets of transactions S(vi) containing vi and let F = {S(vi)};

5. Calculate the ad(S(vi)) and update the opcount(vi) as ad(S(vi)) for vi in FIL;

6. Sort S(vi) in descending order of itemset size m and ascending order of opcount(vi);

7. While (D  and F ) { 8. If |S(vi)|≦2k-1

9. {Anonymize(S(vi)) and add to D’;

10. Remove S(vi) from D and Remove vi from FIL;}

11. Else

12. {S’(vi)=Randomly select (2k-1) transactions from S(vi);

13. S(vi) = S(vi)-S’(vi);

14. Anonymize(S’(vi)) and add to D’;}

15. }; //end of While 16. Output D’;

5. Experiments

To evaluate the performance of the proposed algorithms and compare with suppression-based approach, we run simulations on the BMS-WebView-1 real world dataset and IBM synthetic datasets [11]. We carried out the comparisons in two phases. In the first phase, we compare the number of addition/deletion operations and running times

(23)

23

between the minimum length based and minimum operation based algorithms. Due to arbitrary collection of F is not feasible for handling large datasets, we use frequent itemset based APX_Cover algorithm. For minimum length based approach, there are five implementations proposed in [21]. The fastest implementation is OPT-LB, which is not algorithm for the k-anonymity problem, but can show the lower bound on the number of suppressed cells for the optimal solution for the k-anonymity problem. We modify this implementation by using frequent itemsets instead of closed frequent itemsets, to be consistent with the APX_Cover algorithm. We call this implementation modified a(S) (OPT_LB with frequent itemsets). In the second phase, we further compare the

Sort_APX_Cover algorithm with the APX_Cover and modified a(S) algorithms. We

examined the number of operations, running time, and information loss.

All experiments reported in this section were performed on a Pentium-4 2.3 Ghz machine with 2 GB main memory, running Microsoft Windows 2000 operating system. All the methods were implemented using Microsoft SQL Server 2000. The BMS-WebView-1 dataset contains 59,602 transactions, 497 items with maximum transaction length 267 and average length 2.5. The tested dataset size is 10,000 records selected from the

BMS-WebView-1 dataset. The number of items tested ranges from 3 to 10, which are the

top-3 to top-10 frequent items respectively. There are three IBM synthetic datasets used. Each contains 10,000 records with 8, 10, 12 items respectively. The average transaction length is 2 for all three datasets.

(24)

24

different privacy level k, 3 ≦ k ≦ 200. The number of operations and running times are averages of six simulation runs. The improvement of number of operations ranges from -3% to 15%. The average improvement is about 4%. This implies that the proposed APX_Cover algorithm required less number of addition and deletion operations than the previously proposed suppression-based approach (modified_a(S)). Figure 4 shows the running times under different privacy level k, 3 ≦ k ≦ 200. The percentage difference of running time ranges from -15% to 0%, with average of -7%. This implies that APX_Cover algorithm requires more running time than the modified a(S) algorithm.

Figure 5 shows the number of addition and deletion operations under different number of items (k is fixed at 5). Similar to varying k, the proposed APX_Cover algorithm required less number of addition and deletion operations. In fact, the improvement ranges from -14% to 23%, with average improvement of 7%. Figure 6 shows the running times under different number of items (k is fixed at 5). The percentage difference ranges from -6% to 1%, with average of -2%. This implies that APX_Cover algorithm requires a little more running time than the modified a(S) algorithm.

(25)

25

Figure 3. Number of operations varying k

Figure 4. Running time varying k 0 500 1000 1500 2000 2500 3000 3 5 7 9 10 15 20 30 40 50 100 200 # o f o p e ration s K

# of operations

APX_Cover modified_a(s) 0 500 1000 1500 2000 3 5 7 9 10 15 20 30 40 50 100 200 R u n n in g Ti m e (Sec o n d ) K

Running Time

APX_Cover modified_a(s)

(26)

26

Figure 5. Number of operations varying item (k=5)

Figure 6. Running time varying item (k=5)

For the second phase, we compare the performance of the Sort_APX_Cover algorithm against APX_Cover and modified_a(S) algorithms because it applies additional sorting heuristic. In addition to the number of addition/deletion operations and running times, we also compare the information loss of the anonymized data.

To estimate information loss, we adopt Kullback and Leibler (KL) divergence for random variable of binary values. Intuitively, the KL divergence measure the number of additional

0 100 200 300 3 5 7 9 # o f o p e ration item

# of Operation

APX_Cover modified_a(s) 0 1000 2000 3000 4000 3 5 7 9 item

Running Time

APX_Cover modified_a(s)

(27)

27

bits required when coding a random variable with a probability distribution f(x) while using an alternative probability distribution g(x). It basically compares the entropy of two distributions over the same random variable. For a random variable X = {0, 1} and two distributions f(x) and g(x), let f(x=1) = r, f(x=0) = 1-r, g(x=1) = s, g(x=0) = 1-s. The KL divergence is given as follow.

where KL(g, f) = 0 if and only if f = g, and 0 log (0/x) and 0 log (0/0) are defined as zero.

To measure the KL divergence of original dataset and anonymized dataset, a random variable is defined as follows. For a given block of three transactions {T1, T4, T5} with

three items as shown in Figure 1(a), the probabilities for each item in the original dataset

[r1, r2, r3] = [1, 1/3, 0] as item e1 appears three times, item e2 appears one time and item e3

appears zero time in three transactions respectively. The probabilities of each item in the anonymized dataset given in Figure 1(c) can be calculated similarly and are [s1, s2, s3] = [1,

0, 0]. The KL divergence

of this anonymized set of transactions is log(3/2).

Figure 7 shows the number of addition and deletion operations under different privacy level

k, 3 ≦ k ≦ 200, for modified_a(X), APX_Cover, and Sort_APX_Cover algorithms. The r s s r s s f g Divergence KL log 1 1 log ) 1 ( ) , ( _      ] log 1 1 log ) 1 [(

     i i i i i i i r s s r s s KL

(28)

28

improvement of number of operations, for Sort_APX_Cover over modified_a(X), ranges from 2% to 24%. The average improvement is about 11%. Figure 8 shows the running times of the three algorithms. The percentage difference ranges from -17% to 30%, with average of 10% less running time for Sort_APX_Cover algorithm. This implies that the proposed Sort_APX_Cover algorithm not only requires less number of operation but also less running time.

Figures 9 and 10 shows the number of addition/deletion operations and running times respectively under different number of items (k is fixed at 5). Similarly, the

Sort_APX_Cover algorithm requires 8% less of operations and 11% less of running times,

on the average, over the modified_sort(S) algorithm.

Figures 11 and 12 show the KL divergence of the three algorithms. There is one KL

divergence vale for each block in a partition. The value shown here is the average of all

blocks in a give partition. The Sort_APX_Cover algorithm requires 8% less of information loss over the modified_sort(S) algorithm, for the average of all privacy levels of k, 3 ≦ k ≦ 200, and 4% less of information loss under different number of items. Overall, the Sort_APX_Cover algorithm presents better performance in terms of less number of operations, shorter running time, and less information loss.

(29)

29

Figure 7 Number of operations with sorting (k)

Figure 8 Running time with sorting (k) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 3 5 7 9 10 15 20 30 40 50 100 150 200 # o f o p e ration s K

# of operations

modified_a(s) APX_Cover Sort_APX_Cov er 0 500 1000 1500 2000 2500 3 5 7 9 10 15 20 30 40 50 100 150 200 R u n n in g Ti m e ( Se co n d ) K

Running Time

modified_a(s) APX_Cover Sort_APX_Cover

(30)

30

Figure 9 Number of operations with sorting (item)

Figure 10 Running time with sorting (item)

Figure 11 KL Divergence (k) 0 200 400 600 800 1000 1200 8 10 12 # o f o p e ration s Item

# of operations

modified_a(s) APX_Cover Sort_APX_Cover 0 500 1000 1500 2000 2500 8 10 12 R u n n in g tim e (Sec o n d ) Item

Running Time

modified_a(s) APX_Cover Sort_APX_Cover 0 0.1 0.2 0.3 0.4 0.5 3 5 7 9 10 15 20 30 40 50 100 150 200 K L Di ve rg e n ce K

KL Divergence

modified_a(s) APX_Cover Sort_APX_Cover

(31)

31

Figure 12 KL Divergence (item)

6. Conclusion

In this work, we have studied the privacy preserving data publishing problem in general and proposed anonymization techniques against re-identification attacks on set-valued data in particular. The problem of k-anonymity for set-valued data is known to be NP-hard. Existing approximation algorithms for k-anonymity on set-valued data relies indirectly on suppression technique for relational data. Our approach calculates the number of addition/deletion operations directly on set-valued data and hence achieves better approximation ratio. The improvements of the proposed direct approach, in terms of less operation, shorter running time, and less information loss, have been shown theoretically and numerically in this work.

Even though the proposed direct approach shows improvements over indirect suppression-based approach, more work can be done. Both approaches try to estimate and find an “optimal” partition on the “un-processed” data set such that the sum of

0 0.02 0.04 0.06 0.08 8 10 12 K L Di ve rg e n ce Item

KL Divergence

modified_a(s) APX_Cover Sort_APX_Cover

(32)

32

addition/deletion operations from each block, with size [k, 2k-1], is minimized. We plan to investigate infusing different preprocessing and heuristics before the initial search of partition, so that higher anonymity level and lower information loss could be obtained. In addition, the balance between anonymity and information loss could be further examined. From a different perspective, the techniques studied here could potentially be extended to anonymization on social network graph data, spatial and temporal data, among others.

Acknowledgments. This work was supported in part by the National Science Council, Taiwan, under grant NSC-99-2221-E-390-033.

References

[1] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu, Anonymizing tables. In Proc. of the 10th International Conference on Database Theory, pp. 246–258, 2005.

[2] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving anonymity via clustering. In Proc. of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 153–162, 2006.

[3] R. Agrawal and R. Srikant, Privacy preserving data mining. In Proc. of ACM SIGMOD, pp. 439-450, 2000.

[4] M. Barbaro and T. Z. Jr., A face is exposed for AOL searcher no. 4417749. New York Times, Aug 2006. [5] R. J. Bayardo and R. Agrawal, Data Privacy through optimal k-Anonymization. In Proc. of ICDE, pp.

(33)

33

[6] Y. Gao, K.Y. Qin and J.L. Yang, On the Axiomatic Characterizations of Approximation Operators on a CCD Lattice, ICIC Express Letters, vol.3, no.4 (A), pp.915-920, 2009.

[7] G. Ghinita, Y. Tao, and P. Kalnis, On the anonymization of sparse high-dimensional data. In Proc. of

ICDE, pp. 715-724, 2008.

[8] Y. He and J.F. Naughton, Anonymization of set-valued data via top-down, local generalization. In Proc.

of VLDB, pp. 934-945, 2009.

[9] H.F. Huang, K.C. Liu and HW. Wang, A New Design of Cryptographic Key Management for HIPAA Privacy and Security Regulations, International Journal of Innovative Computing, Information and

Control, vol.5, no.11(A), pp.3923-3932, 2009.

[10] Z. Huang, W. Du, and B. Chen, Deriving private information from randomized data. In Proc. of ACM

SIGMOD, pp. 37-48, 2005.

[11] IBM Quest Market-Basket Synthetic Data Generator,

www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#assocSynData.

[12] D.S. Johnson. Approximation algorithms for combinatorial problems. Journal of computer and system sciences, 9:256-278, 1974.

[13] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, Workload-aware Anonymization. In Proc. of KDD, pp. 277–286, 2006.

[14] N. Li, T. Li, and S. Venkatasubramanian, t-closeness: Privacy beyond k-anonymity and l-diversity. In

Proc. of ICDE, pp. 106-115, 2007.

[15] J.Q. Liu and K. Wang. Anonymizing transaction data by integrating suppression and generalization. In

Proc. of PAKDD, pp. 171-180, 2010.

[16] P.Q. Liu, D.M. Zhu, H. Fan and Q.S. Xie, An Efficient Approximation Algorithm of BCMV(p) Based on LP-rounding, ICIC Express Letters, vol.3, no.4 (A), pp.983-988, 2009.

(34)

34

[17] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, l-diversity: Privacy beyond

k-anonymity. ACM TKDD, article 3, 2007.

[18] A. Meyerson and R. Williams, On the complexity of optimal k-anonymity. In Proc. of PODS, pp. 223-228, 2004.

[19] R. Motwani and S.U. Nabar, Anonymizing unstructured data, arXiv: 0810.5582v2, [cs.DB], 2008. [20] A. Narayanan and V. Shmatikov, Robust de-anonymization of large sparse datasets. In Proceedings of

IEEE Symposium on security and privacy, pp. 111-125, 2008.

[21] H. Park and K. Shim, Approximate algorithms for k-anonymity. In Proc. of ACM SIGMOD, pp. 67–78, 2007.

[22] M. Saito, Y. Namba and S. Serikawa, Extraction of Values Representing Human Features in Order to Develop a Privacy-preserving Sensor, International Journal of Innovative Computing, Information and

Control, vol.4, no.4, pp.883-896, 2008.

[23] P. Samarati, Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge

and Data Engineering, 13(6): 1010-1027, 2001.

[24] P. Samarati and L. Sweeny, Generalizing data to provide anonymity when disclosing information. In

Proc. of ACM Symposium on Principles of Database Systems, pp. 188, 1998.

[25] L. Sweeny, K-anonymity: a model for protecting privacy. International Journal on Uncertainty,

Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002.

[26] W.K. Wong, N. Mamoullis, and D.W. Cheung, Non-homogeneous generalization in privacy preserving data publishing. In Proc. of SIGMOD, pp. 747-758, 2010.

[27] X. Xiao and Y. Tao, Anatomy: simple and effective privacy preservation, in Proc. of VLDB, pp. 139-150, 2006.

數據

Figure  1(c)  shows  an  example  of  3-anonymization  on  transaction  data  from  Figure  1(a)
Figure 4.    Running time varying k 0 500 1000 1500 2000 2500 3000 3579101520304050100200# of operationsK# of operations APX_Cover modified_a(s)0 500 1000 1500 2000 3579101520304050100200Running Time(Second)KRunning TimeAPX_Covermodified_a(s)
Figure 5.    Number of operations varying item (k=5)
Figure 8    Running time with sorting (k) 01000200030004000500060007000800090003579101520304050100150200# of operationsK# of operations modified_a(s)APX_Cover Sort_APX_Cover050010001500200025003579101520304050100150200Running Time (Second)KRunning Timemodi
+3

參考文獻

相關文件

• 57 MMX instructions are defined to perform the parallel operations on multiple data elements parallel operations on multiple data elements packed into 64-bit data types.. Th i l

Laser Capture Microdissection Microdissection of Fluorescently of Fluorescently Labeled Embryonic Cranial Neural Crest Cells Labeled Embryonic Cranial Neural Crest Cells..

With the proposed model equations, accurate results can be obtained on a mapped grid using a standard method, such as the high-resolution wave- propagation algorithm for a

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

"Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

Besides the above various complementarity problems, SVNCP(F, Ω) has a close rela- tion with the Quasi-variational inequality, a special of the extended general variational

The Model-Driven Simulation (MDS) derives performance information based on the application model by analyzing the data flow, working set, cache utilization, work- load, degree

• Use table to create a table for column-oriented or tabular data that is often stored as columns in a spreadsheet.. • Use detectImportOptions to create import options based on