1

**Extending Suppression for Anonymization on **

**Set-Valued**

** Data **

Shyue-Liang Wang1, Yu-Chuan Tsai2, Hung-Yu Kao2 and Tzung-Pei Hong3

*1*

Department of Information Management

3_{Department of Computer Science and Information Engineering }
National University of Kaohsiung

Kaohsiung, Taiwan 81148 {slwang; tphong}@nuk.edu.tw

2

Department of Computer Science and Information Engineering National Cheng Kung University

Tainan, Taiwan 70101

{p7894131; hykao}@mail.ncku.edu.tw

ABSTRACT. *Anonymization on relational data to protect privacy against re-identification attacks *
*has been studied extensively in recent years. The problem has been shown to be NP-hard and *
*several heuristic and approximation algorithms have been proposed. However, many published *
*data exist in set-valued format (e.g. transactional, search query, recommendation data) but their *
*anonymization techniques have not been well studied. Previous work [19] proposed borrowing *
*suppression-based approximation algorithms on relational data and the concept of flipping to *
*achieve O(k*log k) k-anonymity approximation on set-valued data in two phases. In this work, *
*we propose a new approach to anonymize set-valued data in one single phase. The proposed *
*approach is based on direct estimation of minimal number of addition and deletion operations and *

2

*without using the suppression technique. We show that the proposed approach can achieve *
*O(log k)-approximation solution to k-anonymity on set-valued data. Experimental results also*
*demonstrate that the proposed algorithms require less number of addition/deletion operations as *
*well as information loss on both real world and synthetic data sets, compared to previous *
*approach. *

Keywords: *Anonymization, K-anonymity, Privacy Preservation, Approximation *
Algorithm, Set-Valued Data.

**1. Introduction **

Recent publishing of market-based data, recommendation data, and search query log data for public analyses in pursuit of better system design, service, and improving overall quality of search has posed certain privacy issues such as re-identification attack. In order to protect privacy of users against different types of attacks, data should be anonymized before they are published.

Current practices to protect user privacy from published data include: (1) removing all identifiable personal information such as names and social security numbers, (2) limiting access, (3) “fuzzing” the data, (4) eliminating unnecessary groupings, and (5) augmenting with additional data, etc. However, it is still easy for an attacker to identify the target by performing different kinds of structural, non-structural, and linking attacks. Let’s consider the following examples of re-identification attack on relational data and transactional data.

3

For published relational data, a well-known example is that an attacker can re-identify Massachusetts governor’s identity from a de-identified (name and social security number removed) patient data of state employees. By a simple “linking” of the patient data and a public voter registration data, the identity and medical history of the governor is exposed. In fact, according to one study, approximately 87% of the population of the United States can be uniquely identified on the basis of their 5-digit zip code, sex, and date of birth [23-25].

*For published transaction data, America Online (AOL) released a large portion of its search *
engine query logs for research purposes in August 2006. The dataset contained 20 million
*queries posed by 650,000 AOL users over a 3 month period. Before releasing the data, *

*AOL replaces each user’s name by a random identifier. However, by examining unique *

*query terms, the New York Times [4] demonstrated that the searcher No. 4417749 was *
traced back to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Georgia.
Despite a query does not contain address or name, a searcher may still be re-identified from
combination of query terms that are unique enough about the searcher.

*For published recommendation data, Netflix announced a $1-million prize for improving *
their movie rating and recommendation system in October 2006. A dataset of 100 million
movie ratings posted by 500,000 subscribers over a 6-year period was released. Similar to

*AOL’s approach, Netflix replaces each username with a random identifier. However, it *

4

knew 6 out of 8 movies that the subscriber had rated outside of top 500 movies [20].

Motivated by these examples and others, many privacy models and anonymization
approaches have been studied. Depending on the format of data and adversary’s
knowledge about the target of attack, non-trivial but practical techniques have been
proposed. However, most of these work concentrated on dealing with relational data with
*additional assumption of Domain Generalization Hierarchies (DGH) on the data attributes. *
Few works has studied the problem of anonymizing set-valued data without assuming any
*item correlation. In [19], Motwani etc proposed a k-anonymous algorithm for set-valued *
data that was based on suppression algorithm for relational data and the concept of flipping
in two phases. In this work, we address the same problem of anonymizing set-valued data
without assuming any additional constrain on data items. We adopt the set-cover greedy
approach to determine the optimal partition of a set of transactions, such that the total
number of addition and deletion on items in this partition would be minimized and in the
*same time achieving k-anonymity in one phase. In order to find such partition, we *
propose a new measure to directly estimate the number of operations required for each
*block in a partition. We show that such techniques can achieve O(log k)-approximation to *
*k-anonymity for set-valued data. Furthermore, we demonstrate numerically that the *
proposed algorithms require less number of addition/deletion operations and information
loss than previous approach. To summarize our contributions:

• We present a new measure to estimate the number of addition and deletion operations
*required to achieve k-anonymity for set-valued data. *

5

• We present three direct anonymization algorithms that are based on the new measure
*to accomplish k-anonymity. *

• *We show that the proposed algorithms can achieve O(log k)-approximation to the *

*k-anonymity problem for set-valued data. *

• We provide comparative results of numerical experiments on real-life and synthetic datasets and demonstrate that the proposed algorithms requires less number of operations, shorter running time, and less information loss.

The rest of the paper is organized as follows. Section 2 describes related work. Section 3 gives the problem description. Section 4 presents the proposed anonymization algorithms. Section 5 reports the numerical experiments. Section 6 concludes the paper.

**2. Related Work **

Recent studies in privacy preserving data publishing have attracted considerable works,
especially in the context of relational data. One main task is to prevent re-identification
*attacks, which tackles the problem that Quasi-Identifier attributes (QID) such as age, sex, *
zip code, are used to infer the identities of target victims. To address such threat, the most
*widely studied anonymity paradigm is k-anonymity [9, 22-25] and it’s variants, l-diversity *
*[17] and t-closeness [14]. The model requires that each record in a dataset to be *
*indistinguishable among at least k-1 other records, with respect to the set of QID attributes. *
To form such an equivalence class or anonymized group, many techniques have been
*proposed: generalization, suppression, clustering, permutation, and perturbation, etc. *

6

*Assuming domain generalization hierarchies existed for each QID attribute in relational *
*data; k-anonymity can be achieved through generalization, which maps detailed attribute *
*values to values in the DGH. Many bottom-up [23-25] and top-down [5] traversals on the *

*DGH for finding the minimal generalized values have been studied and reported. *

*The suppression-based approach tries to remove certain attribute values (cell suppression) *
[1] or entire record (tuple suppression) [23] from the microdata. It could be combined
with generalization technique [15, 26] or used independently in approximation algorithms
*[18, 21]. This process of data anonymization is called recording. Global recording *
refers to a particular detailed value is always mapped to the same generalized value,
whereas, local recording allows mapping to different generalized values in different
anonymized groups. Many anonymization techniques based on suppression have been
proposed.

*For relational data without DGH, clustering-based local recording methods [2] assume that *
metrics for measuring the distance between attribute values exist and could be used to form
clusters of records. Each cluster is an anonymized group of records and only cluster
center and its size are published to achieve anonymity. Several clustering based
anonymization techniques have been reported.

*Instead of generalizing QID values, permutation-based approach decouples the sensitive *

7

*directly but permuted the SA values among the records. In [7], correlated QIDs are *
permuted to a group and the corresponding sensitive items are made to be homogeneous.
*The problem with this approach is that the QID data and SA data must be published *
separately.

*In [3], random perturbation by adding noise to the data was proposed to prevent *
re-identification of records. However, since the noises added are correlated with the
original data [10] and therefore outliers could be easily exposed when attacker has access to
the external knowledge [13].

*For set-valued data with given DGH on items, a local recording, top-down generalization *
*approach has been proposed to achieve k-anonymity [8]. For set-valued data without *

*DGH, Motwani etc [19] proposed a two-phase approach. It first transformed the set data *

into binary relational data and applied the suppression anonymization techniques that were
used for relational data. The concept of flipping is then applied to suppressed data in
*order to obtain anonymized result. It has been shown that k-anonymization on relational *
data is NP-hard, either using generalization [1] or suppression approaches [18].
*Approximation algorithms with ratios of O(k*log k) [18], O(k) [1], and O(log k) [21] that *
minimize the number of entries suppressed on relational data were then proposed. But
*due to the loose approximation ratio, the algorithms show poor performance for large k. *
*In this work, we will concentrate on developing direct approximation algorithms for *

8

**3. Problem Description **

*This section describes the k-anonymity problem for set-valued data. Transaction data, *
on-line search query data, and rating/recommendation data are all considered as set-valued
data. For simplicity, we will use transaction data for illustration.

**3.1 K-anonymity for transaction data **

*Let D = {T1, …, Tn} be a dataset containing n transactions, where each transaction, Ti, 1≦ i *

*≦ n, consists of a subset of items selected from a given universal itemset I = {e1, …, e|I|}. *

*In order to anonymize the identity of each transaction, k-anonymity for transaction data *
*imposes a constraint that for every transaction in the data set, there must exist at least (k-1) *
*other identical transactions. The k-anonymity concept and k-anonymity problem for *
transaction data are defined as follows. The definitions are similar to those in [19].

**Definition 1. (K-anonymity for transaction data) We say that D is k-anonymous if every **

*transaction Tj ** D has at least (k-1) other identical transactions in the dataset D. *

*Given this definition, the k-anonymization problem studied here is to efficiently determine *
the minimal number of addition and deletion on a dataset so that the modified dataset is

*k-anonymous. *

9

*minimal number of items that need to be added or deleted from the transactions T1, ..., Tn, *
*to ensure that the resulting dataset D’ is k-anonymous. *

Figure 1(c) shows an example of 3-anonymization on transaction data from Figure 1(a).
*The modified dataset in Figure 1(c) consists of a partition of two 3-anonymous blocks: {T1, *

*T4, T5} and {T2, T3, T6}. The item e2 is deleted from transaction T1 and item e1* is added to

*transaction T2*. It only takes one addition operation and one deletion operation to achieve

3-anonymity and is the minimal number of operations for any possible partition on the
*given dataset. As such, the k-anonymity can be achieved by finding the optimal partition *
such that the sum of the number of addition and deletion operations on all blocks is
minimal.

(a) Original dataset (b) Suppressed dataset (c) Anonymized dataset Figure 1 An example of 3-anonymization

*e1 e2 e3*
*T1* 1 1 0
*T2* 0 0 1
*T3* 1 0 1
*T4* 1 0 0
*T5* 1 0 0
*T6* 1 0 1
*e1 e2 e3*
*T1* 1 * 0
*T2* * 0 1
*T3* * 0 1
*T4* 1 * 0
*T5* 1 * 0
*T6* * 0 1
*e1 e2 e3*
*T1* 1 0 0
*T2* 1 0 1
*T3* 1 0 1
*T4* 1 0 0
*T5* 1 0 0
*T6* 1 0 1

10
**3.2 Suppression-based k-anonymity **

*To achieve k-anonymity on transaction data, Motwani etc [19] proposed a skillful 2-phase *
technique that is based on suppression algorithm on relational data proposed by Park etc
[21] and the concept of flipping. In phase one, the suppression algorithm tries to
efficiently determine a partition of the dataset so that the total number of suppression is
minimal. For example, in the first phase, the suppression algorithm will partition the
dataset into two blocks * = {{T1, T4, T5}, {T2, T3, T6}}, where in the first block T1 = T4 = T5 *
*= (1, *, 0), and the second block T2 = T3 = T6 = (*, 1, 0), as shown in Figure 1(b). The * *

*here represents that the item value is suppressed, 1 represents that the corresponding item is *
*in the transaction, and 0 represents the item is not in the transaction. In the second phase, *
for each block of the partition, the suppressed items are flipped to 1 or 0 depending on the
*minimal number of flipping required. For example, for block {T1, T4, T5}, there are two *

*transactions T4, T5 that do not contain item e2 and one transaction T1 that contains item e2*.

*Deleting item e2 from transaction T1 (i.e. flipping item e2* from * to 0) will make this block

of transactions identical and satisfy 3-anonymity with fewer operations. Similarly, for
*block {T2, T3, T6}, there are two transactions T3, T6 that contain item e1* and one transaction

*T2 that does not contain item e1. Adding item e1 to transaction T2 (i.e. flipping item e1*

from * to 1) will make this block of transactions identical and satisfy 3-anonymity.

To obtain a partition with minimal number of suppression, Park etc [21] proposed using

*minimum length to estimate the number of suppression required for each block in a given *

11

*the sum of all minimum length on each block of a partition. The partition with minimum *

*length sum will achieve 2(1 + ln 2k)-approximation to the optimal number of suppression *

*for the k-anonymization problem. The minimum length, a(S), for a set of transactions S is *
defined as follows.

**Definition 3. (Minimum length of suppression for relational data) Let S be a relational **

*table and a(S) be the number of attributes with multiple distinct values in the table, the *
*minimum length is defined as follows: *

*a(S) := |{i : **u, v ** S, u[i] ** v[i]}| *

*where u[i] and v[i] are the values of i-th attribute of the transactions u and v respectively. *

*For example, a(S) = 1 for the first block {T1, T4, T5} in Figure 1, due to attribute e2* has two

*values. And a(S) = 1 for the second block {T2, T3, T6}, due to attribute e1 has two values.*

*However, even though minimum length sum may be a better estimate than minimum *

*diameter [18] for suppressing relational data. To anonymize transaction data, direct *

estimation of addition and deletion of items for each block can achieve better partition and requires fewer operations. In the next section, we will define a new measure on transaction data to find optimal partition based on minimal number of item addition and deletion operations required for each block, and propose approximate algorithms to achieve

12

**4. Approximation Algorithms**

*This section describes the approximation theory for k-anonymity problem on transaction *
*data. We first propose a new measure, called minimum operation, to estimate the number *
of addition and deletion required for each block in a partition. We will then prove that the
*optimal partition with minimum operation sum can lead to the optimal number of *
*addition/deletion for the k-anonymity problem. Three algorithms based on set-covered *
*approximation approach, which can achieve O(log k)-approximation, are then presented. *

**4.1 Minimum operation sum **

*Given a relational dataset, k-minimum diameter sum [18] and k-minimum length sum [21] *
have been proposed to determine a partition with minimal number of suppression in order
*to achieve k-anonymity. For transaction data, we propose the following new measure, *

*ad(S), to effectively estimate the number of addition and deletion operations required for a *

given block of transactions.

**Definition 4. (Minimum operation of addition and deletion for transaction data) For a **

*given block of transactions S, the minimal number of addition and deletion operation to *
*anonymize S is defined as ad(S), *

##

*I*

*j*

*s*

*j*

*O*

*S*

*ad*1 ) ( ) (

13

*transactions in S not containing j-th item}, and |I| is the total number of items in the *

*universal itemset I. *

*For example, let S be the original dataset in Figure 1(a) and I = {e1, e2, e3}, then Os(1) = 1, *

*Os(2) = 1, Os(3) = 3, and the minimum sum is ad(S) = 1 + 1 + 3 = 5. *

Given a partition , the number of operation required to anonymize the entire partition
*(minimum operation sum) is the sum of all minimum operations on each block and can be *
expressed as

##

*S*

*S*

*ad*

*ad*( ) ( )

For example, Figure 2 shows two different partitions of the same eight records in two
*blocks respectively. For arbitrary partition in Figure 2(a), the minimum length sum is a(**) *
*= 3 + 3 = 6 and the minimum operation sum is ad(**) = 4 + 4 = 8. For the optimal *
*partition in Figure 2(b), the minimum length sum a(**) is still 3 + 3 = 6, which is the same *
as previous partition and cannot distinguish the difference between the two different
*partitions. However, the minimum operation sum ad(**) for this partition is 3 + 3 = 6, *
which is smaller than the previous arbitrary partition and indicates that this is a better
*partition. This example shows that the proposed minimum operation sum measure is *
*potentially a better estimate for finding the optimal partition for k-anonymity. *

14
*e1* *e2* *e3*
*T1* 1 1 1
*T2* 1 1 0
*T4* 0 1 1
*T5* 1 0 0
*e1* *e2* *e3*
*T1* 1 1 1
*T2* 1 1 0
*T3* 1 0 1
*T4* 0 1 1
*e1* *e2* *e3*
*T3* 1 0 1
*T6* 0 1 0
*T7* 0 0 1
*T8* 0 0 0
*e1* *e2* *e3*
*T5* 1 0 0
*T6* 0 1 0
*T7* 0 0 1
*T8* 0 0 0

(a) Arbitrary partition (b) Optimal partition Figure 2. Two partitions of eight records

**4.2 Approximation ratio **

*In this subsection, we will first show that optimal partition with minimum operation sum *
*will lead to the optimal solution to the k-anonymity problem for transaction data. And *
*then since O(1 + ln 2k)-approximation algorithms have been introduced [5, 13, 17] to *
*obtain optimal partition using minimum diameter sum and minimum length sum, we show *
*similarly that O(log k)-approximation algorithms can be obtained for the k-anonymity *
*problem using minimum operation sum. *

15

*Let OPT(D) be defined as the minimal number of addition and deletion operations for the *
*optimal solution to k-anonymity problem for transaction dataset D. Let **** be the optimal
*partition for k-anonymity problem for transaction data. Let ANON(S) be the number of *
*addition and deletion operations in S such that all transactions in S become identical. Then *
*the optimal number of operations required to achieve k-anonymity on transaction data can *
be expressed as _{ } .

On the other hand, for optimal partition * with blocks S, the minimum operation sum can
be expressed as *. Since for a given block of transactions, ad(S) *
can be expressed as . Therefore,

. This means that optimal partition with

*minimum operation sum lead to the minimal addition/deletion for optimal solution to *
*k-anonymity problem. The following lemma shows this relationship between ad(*) and *

*the optimal solution for k-anonymity. *

**Lemma 1.*** For a transaction dataset D, we have *

2
)
(
)
( * *OPT* *D* *I* *D*
*ad*

*where the cardinality of every block in the partition is in the range of [k, 2k-1]. *

* Proof: Let S be a block of transactions in partition *

** with size between [k, 2k-1].*

*Since ad(S) = ANON(S), it follows that*

)
(
)
(
)
(
)
( * *
*
*D*
*OPT*
*S*
*ANON*
*S*
*ad*
*ad*
*S*
*S*

##

_{}

_{}

##

_{}

_{}

*.*

16
*In addition, *
2
1
2
2
)
(
)
(
1
1
1
*I*
*S*
*S*
*S*
*j*
*O*
*S*
*ad*
*I*
*j*
*I*
*j*
*I*
*j* *S*

##

_{}

_{}

##

_{}

_{}

##

_{}

_{}

*.*

*Hence*2 2 2 ) ( ) ( * * *

*D*

*I*

*S*

*I*

*I*

*S*

*S*

*ad*

*D*

*OPT*

*S*

*S*

*S*

##

_{}

_{}

##

_{}

_{}

##

_{}

_{}

*.*

In [5, 13, 17], it has been shown that the set-covered approximation approach can provide
*algorithms to obtain O(1 + ln 2k)-approximation to the optimal (k, 2k-1)-partition, using *
*either minimum diameter sum or minimum length sum. In the next subsection, we will *
*present three set-cover based algorithms to obtain O(1 + ln 2k)-approximation partition. *
*Therefore, it leads to the following approximation property that O(log k)-approximation to *
*the optimal k-anonymity problem for transaction data can be achieved. *

* Theorem 1. Let *

*1, and*

*be a (k, 2k-1)-partition with operation sum at most*

*times*

*that of the optimal solution of k-minimum operation sum problem. Then the algorithm*

*that minimally anonymizes each S * *, is a * *-approximation algorithm to optimal *

*k-anonymization problem. *

**Proof: Since ** * and from lemma 1, it is straight forward that *
.

**4.3 Approximation algorithms based on minimum operation sum **

*This subsection presents three approximation algorithms based on minimum operation sum. *
The first algorithm is based on arbitrary collection of transactions to form optimal partition.

17

The second algorithm is based on transactions that contain frequent itemsets to form optimal partition. The third algorithm is based on transactions that contain sorted frequent itemsets to form optimal partition.

*On a relational dataset D, to obtain an optimal partition that requires minimal suppression *
*in order to achieve k-anonymity, Myerson [18] proposed a set-cover type greedy approach *
based on Johnson’s approximation theory for combinatorial problems [12]. The basic idea
of the approach consists of two steps. In the first step, it collects all possible subsets with
*sizes in the range of [k, 2k-1] from D. For each subset, the algorithm calculates its *

*minimum diameter (or minimum length in [21]) and selects the subset with the smallest *
*minimum diameter. The selected subset is then saved in a cover. The same process is *

repeated until all transactions are included to the cover. Each subset saved in the cover is supposed to have minimal diameter when it was selected. In the second step, the cover is converted to a partition to ensure each subset is disjoint from others.

*Let F be the collection of all possible subsets of transaction dataset D with cardinality in the *
*range of [k, 2k-1]. Let S ** F be a set in the collection. Our first approximation *

*algorithm using k-minimum operation sum follows the similar principle of set-cover and *
partition procedures described above [12, 18, 21]. It basically runs in two phases:

1. It generates a *[k, 2k-1]-cover whose operation sum is at most (1+ln 2k) times the *
*operations required for optimal solution. This is described in the procedure Cover(D,F) *
below.

18

*process is similar to [18] and described in the procedure Convert(**) below. *

**Procedure Cover(D, F) **

*Input: dataset D, collection F *
Output: cover

1. Let * := **, D’ := **; //D’ contains currently covered transaction from D *
*2. While (D’ ** D) { *

3. *For each S ** F, Compute *
;
4. *Choose an S with minimal r(S); *

5. *D’ := D’ S; *
6. := {S}; }
7. Output *; *
* Procedure Convert(*

*)*Input:

*[k, 2k-1]-cover* Output:

*[k, 2k-1]-partition*

*’*

*1. For any Si, Sj** and Tm** D such that Tm** Si** Sj*

*2. If (|Si|>k or |Sj*|>k) {

*3. Delete Tm* from the larger set;

4. Insert the larger set into *’ and delete the larger set from ** ;} *

*5. Else { // |Si| = |Sj*| = k

*6. Insert the Si** Sj* into *’ ; *

19
8. Output the *’ *

Combining two procedures described above, the first set-cover based greedy approximation
*algorithm for k-anonymity problem on transaction data is given below. *

**Algorithm 1. (Set_Cover) **

*Input: Dataset D, Collection F *

*Output: anonymized database D’ that satisfies k-anonymity *
1. Set * := Cover(D, F); *

2. := Convert();
*3. For each S *,

*4. Minimally add or delete ANON(S) items on S such that all transactions are identical; *
*5. Output D’ := *;

Based on the similar analysis of the greedy algorithm for set-cover on subsets of cardinality
*at most 2k [12, 21], it can be shown that the cover ** in procedure Cover(D, F) is also a *

*(1+ln 2k)-approximation to the k-minimum operation sum problem. For the runtime *

*analysis on algorithm Set_Cover, in the first phase, the algorithm will choose at most |D| *
*sets from F, where each selection requires * _{ } running time to determine which

subset is to be selected. Hence the runtime in the first phase is _{ }
. In the second phase, the runtime is . So the total runtime over the
two phases is dominated by the first phase and is _{ . }

20

*To improve efficiency, instead of collecting all possible subsets of D in the range of [k, *

*2k-1], we propose using the sets of transactions that contain frequent itemsets with support *

*count greater than or equal to k to be included in the collection F. In another word, *
*frequent itemsets with support count greater than or equal to k are calculated first. For *
*each frequent itemset v, the transactions that contain itemset v will form a subset S and is *
*collected into F. The size of collection F here will be greatly reduced compared to the F *
*used in algorithm Set_Cover. The greedy approach then selects the S with minimum *
*operation ad(S) and includes it to the cover. The proposed frequent itemset based *
algorithm is given as follows.

**Algorithm 2. (APX_Cover) **

*Input: dataset D *

*Output: anonymized database D’ that satisfies k-anonymity *

*1. Remove transactions that have at least (k-1) other identical records from D and also put *
*them in D’; *

2. Find all frequent itemsets*vi and its support counts opcount(vi) from D; *

*3. Let FIL = { vi, opcount(vi) }; *

*4. For (each vi** FIL){ *

*5. Find all subsets of transactions S(vi) containing vi and let F = {S(vi)}; *

*6. Calculate the ad(S(vi)) and update the opcount(vi)=ad(S(vi)) for vi in FIL;} *
*7. Sort S(vi) in increasing order of opcount(vi); *

*8. While (D ** and F *) {
*9. If |S(vi)|≦2k-1 *

21
*10. {Anonymize(S(vi)) and add to D’; *

11. * Remove S(vi) from D and Remove vi from FIL;} *

12. Else

*13. {S’(vi)=Randomly select (2k-1) transactions from S(vi); *

*14. S(vi) = S(vi)-S’(vi); *

*15. Anonymize(S’(vi)) and add to D’;} *

16. }; //end of While
*17. Output D’; *

To further improve the efficiency of the previous two approximation algorithms, we
*re-arrange the order of the subsets in the collection F. The basic idea is as follow. *
Grouping transactions that contain the same frequent itemset into a block tends to reduce
the number of addition and deletion operations required to anonymize the whole block.
By the same token, grouping transactions that contain larger frequent itemset into a block
should also reduce the operations required. The larger frequent itemset here refers to
*itemset with more items. As such, before selecting the subset from F only based on *

*minimum operation, the subsets should be sorted in descending order of itemset size. The *

*following algorithm modifies the previous APX_Cover algorithm by adopting this heuristic. *

**Algorithm 3. (Sort_APX_Cover) ***//Input: dataset D *

*//Output: anonymized database D’ that satisfies k-anonymity *

22
*them in D’; *

*2. Find all frequent itemsets vi and its support counts opcount(vi) from D; *

*3. Let FIL = { vi, opcount(vi) }; *

*4. Find all subsets of transactions S(vi) containing vi and let F = {S(vi)}; *

*5. Calculate the ad(S(vi)) and update the opcount(vi) as ad(S(vi)) for vi in FIL; *

*6. Sort S(vi) in descending order of itemset size m and ascending order of opcount(vi); *

*7. While (D ** and F *) {
*8. If |S(vi)|≦2k-1 *

*9. {Anonymize(S(vi)) and add to D’; *

*10. Remove S(vi) from D and Remove vi from FIL;} *

11. Else

*12. {S’(vi)=Randomly select (2k-1) transactions from S(vi); *

*13. S(vi) = S(vi)-S’(vi); *

*14. Anonymize(S’(vi)) and add to D’;} *

15. }; //end of While
*16. Output D’; *

**5. Experiments **

To evaluate the performance of the proposed algorithms and compare with
*suppression-based approach, we run simulations on the BMS-WebView-1 real world dataset *
and IBM synthetic datasets [11]. We carried out the comparisons in two phases. In the
first phase, we compare the number of addition/deletion operations and running times

23

*between the minimum length based and minimum operation based algorithms. Due to *
*arbitrary collection of F is not feasible for handling large datasets, we use frequent itemset *
*based APX_Cover algorithm. For minimum length based approach, there are five *
*implementations proposed in [21]. The fastest implementation is OPT-LB, which is not *
*algorithm for the k-anonymity problem, but can show the lower bound on the number of *
*suppressed cells for the optimal solution for the k-anonymity problem. We modify this *
implementation by using frequent itemsets instead of closed frequent itemsets, to be
*consistent with the APX_Cover algorithm. We call this implementation modified a(S) *
*(OPT_LB with frequent itemsets). In the second phase, we further compare the *

*Sort_APX_Cover algorithm with the APX_Cover and modified a(S) algorithms. We *

examined the number of operations, running time, and information loss.

All experiments reported in this section were performed on a Pentium-4 2.3 Ghz machine
with 2 GB main memory, running Microsoft Windows 2000 operating system. All the
*methods were implemented using Microsoft SQL Server 2000. The BMS-WebView-1 *
dataset contains 59,602 transactions, 497 items with maximum transaction length 267 and
average length 2.5. The tested dataset size is 10,000 records selected from the

*BMS-WebView-1 dataset. The number of items tested ranges from 3 to 10, which are the *

top-3 to top-10 frequent items respectively. There are three IBM synthetic datasets used. Each contains 10,000 records with 8, 10, 12 items respectively. The average transaction length is 2 for all three datasets.

24

*different privacy level k, 3 ≦ k ≦ 200. The number of operations and running times *
are averages of six simulation runs. The improvement of number of operations ranges
from -3% to 15%. The average improvement is about 4%. This implies that the
*proposed APX_Cover algorithm required less number of addition and deletion operations *
*than the previously proposed suppression-based approach (modified_a(S)). Figure 4 *
*shows the running times under different privacy level k, 3 ≦ k ≦ 200. The percentage *
difference of running time ranges from -15% to 0%, with average of -7%. This implies
*that APX_Cover algorithm requires more running time than the modified a(S) algorithm. *

Figure 5 shows the number of addition and deletion operations under different number of
*items (k is fixed at 5). Similar to varying k, the proposed APX_Cover algorithm required *
less number of addition and deletion operations. In fact, the improvement ranges from
-14% to 23%, with average improvement of 7%. Figure 6 shows the running times under
*different number of items (k is fixed at 5). The percentage difference ranges from -6% to *
*1%, with average of -2%. This implies that APX_Cover algorithm requires a little more *
*running time than the modified a(S) algorithm. *

25

Figure 3. Number of operations varying k

Figure 4. Running time varying k
0
500
1000
1500
2000
2500
3000
3 5 7 9 _{10} _{15} _{20} _{30} _{40} _{50}
100 200
**# **
**o**
**f o**
**p**
**e**
**ration**
**s**
**K**

**# of operations**

APX_Cover
modified_a(s)
0
500
1000
1500
2000
3 5 7 9 _{10}

_{15}

_{20}

_{30}

_{40}

_{50}100 200

**R**

**u**

**n**

**n**

**in**

**g**

**Ti**

**m**

**e**

**(Sec**

**o**

**n**

**d**

**)**

**K**

**Running Time**

APX_Cover
modified_a(s)
26

Figure 5. Number of operations varying item (k=5)

Figure 6. Running time varying item (k=5)

*For the second phase, we compare the performance of the Sort_APX_Cover algorithm *
*against APX_Cover and modified_a(S) algorithms because it applies additional sorting *
heuristic. In addition to the number of addition/deletion operations and running times, we
also compare the information loss of the anonymized data.

*To estimate information loss, we adopt Kullback and Leibler (KL) divergence for random *
*variable of binary values. Intuitively, the KL divergence measure the number of additional *

0
100
200
300
3 5 7 9
**# **
**o**
**f o**
**p**
**e**
**ration**
**item**

**# of Operation**

APX_Cover
modified_a(s)
0
1000
2000
3000
4000
3 5 7 9
**item**

**Running Time**

APX_Cover
modified_a(s)
27

*bits required when coding a random variable with a probability distribution f(x) while using *
*an alternative probability distribution g(x). It basically compares the entropy of two *
*distributions over the same random variable. For a random variable X = {0, 1} and two *
*distributions f(x) and g(x), let f(x=1) = r, f(x=0) = 1-r, g(x=1) = s, g(x=0) = 1-s. The KL *
divergence is given as follow.

*where KL(g, f) = 0 if and only if f = g, and 0 log (0/x) and 0 log (0/0) are defined as zero. *

*To measure the KL divergence of original dataset and anonymized dataset, a random *
*variable is defined as follows. For a given block of three transactions {T1, T4, T5} with *

three items as shown in Figure 1(a), the probabilities for each item in the original dataset

*[r1, r2, r3] = [1, 1/3, 0] as item e1 appears three times, item e2 appears one time and item e3*

appears zero time in three transactions respectively. The probabilities of each item in the
*anonymized dataset given in Figure 1(c) can be calculated similarly and are [s1, s2, s3] = [1, *

*0, 0]. The KL divergence *

of this anonymized set of transactions is log(3/2).

Figure 7 shows the number of addition and deletion operations under different privacy level

*k, 3 ≦ k ≦ 200, for modified_a(X), APX_Cover, and Sort_APX_Cover algorithms. The *
*r*
*s*
*s*
*r*
*s*
*s*
*f*
*g*
*Divergence*
*KL* log
1
1
log
)
1
(
)
,
(
_
]
log
1
1
log
)
1
[(

##

*i*

*i*

*i*

*i*

*i*

*i*

*i*

*r*

*s*

*s*

*r*

*s*

*s*

*KL*

28

*improvement of number of operations, for Sort_APX_Cover over modified_a(X), ranges *
from 2% to 24%. The average improvement is about 11%. Figure 8 shows the running
times of the three algorithms. The percentage difference ranges from -17% to 30%, with
*average of 10% less running time for Sort_APX_Cover algorithm. This implies that the *
*proposed Sort_APX_Cover algorithm not only requires less number of operation but also *
less running time.

Figures 9 and 10 shows the number of addition/deletion operations and running times
*respectively under different number of items (k is fixed at 5). Similarly, the *

*Sort_APX_Cover algorithm requires 8% less of operations and 11% less of running times, *

*on the average, over the modified_sort(S) algorithm. *

*Figures 11 and 12 show the KL divergence of the three algorithms. There is one KL *

*divergence vale for each block in a partition. The value shown here is the average of all *

*blocks in a give partition. The Sort_APX_Cover algorithm requires 8% less of *
*information loss over the modified_sort(S) algorithm, for the average of all privacy levels *
*of k, 3 ≦ k ≦ 200, and 4% less of information loss under different number of items. *
*Overall, the Sort_APX_Cover algorithm presents better performance in terms of less *
number of operations, shorter running time, and less information loss.

29

Figure 7 Number of operations with sorting (k)

Figure 8 Running time with sorting (k)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
3 5 7 9 _{10} _{15} _{20} _{30} _{40} _{50}
100 150 200
**# **
**o**
**f o**
**p**
**e**
**ration**
**s**
**K**

**# of operations**

modified_a(s)
APX_Cover
Sort_APX_Cov
er
0
500
1000
1500
2000
2500
3 5 7 9 _{10}

_{15}

_{20}

_{30}

_{40}

_{50}100 150 200

**R**

**u**

**n**

**n**

**in**

**g**

**Ti**

**m**

**e**

**(**

**Se**

**co**

**n**

**d**

**)**

**K**

**Running Time**

modified_a(s)
APX_Cover
Sort_APX_Cover
30

Figure 9 Number of operations with sorting (item)

Figure 10 Running time with sorting (item)

Figure 11 KL Divergence (k)
0
200
400
600
800
1000
1200
8 10 12
**# **
**o**
**f o**
**p**
**e**
**ration**
**s**
**Item**

**# of operations**

modified_a(s)
APX_Cover
Sort_APX_Cover
0
500
1000
1500
2000
2500
8 10 12
**R**

**u**

**n**

**n**

**in**

**g**

**tim**

**e**

**(Sec**

**o**

**n**

**d**

**)**

**Item**

**Running Time**

modified_a(s)
APX_Cover
Sort_APX_Cover
0
0.1
0.2
0.3
0.4
0.5
3 5 7 9 _{10}

_{15}

_{20}

_{30}

_{40}

_{50}100 150 200

**K**

**L Di**

**ve**

**rg**

**e**

**n**

**ce**

**K**

**KL Divergence**

modified_a(s)
APX_Cover
Sort_APX_Cover
31

Figure 12 KL Divergence (item)

**6. Conclusion **

In this work, we have studied the privacy preserving data publishing problem in general
and proposed anonymization techniques against re-identification attacks on set-valued data
*in particular. The problem of k-anonymity for set-valued data is known to be NP-hard. *
*Existing approximation algorithms for k-anonymity on set-valued data relies indirectly on *
suppression technique for relational data. Our approach calculates the number of
addition/deletion operations directly on set-valued data and hence achieves better
approximation ratio. The improvements of the proposed direct approach, in terms of less
operation, shorter running time, and less information loss, have been shown theoretically
and numerically in this work.

Even though the proposed direct approach shows improvements over indirect
suppression-based approach, more work can be done. Both approaches try to estimate and
*find an “optimal” partition on the “un-processed” data set such that the sum of *

0
0.02
0.04
0.06
0.08
8 10 12
**K**
**L Di**
**ve**
**rg**
**e**
**n**
**ce**
**Item**

**KL Divergence**

modified_a(s)
APX_Cover
Sort_APX_Cover
32

*addition/deletion operations from each block, with size [k, 2k-1], is minimized. We plan *
to investigate infusing different preprocessing and heuristics before the initial search of
partition, so that higher anonymity level and lower information loss could be obtained. In
addition, the balance between anonymity and information loss could be further examined.
From a different perspective, the techniques studied here could potentially be extended to
anonymization on social network graph data, spatial and temporal data, among others.

**Acknowledgments. This work was supported in part by the National Science Council, **
Taiwan, under grant NSC-99-2221-E-390-033.

**References **

[1] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu,
*Anonymizing tables. In Proc. of the 10th International Conference on Database Theory, pp. 246–258, *
2005.

[2] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving
*anonymity via clustering. In Proc. of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on *
*Principles of Database Systems, pp. 153–162, 2006. *

[3] *R. Agrawal and R. Srikant, Privacy preserving data mining. In Proc. of ACM SIGMOD, pp. 439-450, *
2000.

[4] *M. Barbaro and T. Z. Jr., A face is exposed for AOL searcher no. 4417749. New York Times, Aug 2006. *
[5] *R. J. Bayardo and R. Agrawal, Data Privacy through optimal k-Anonymization. In Proc. of ICDE, pp. *

33

[6] Y. Gao, K.Y. Qin and J.L. Yang, On the Axiomatic Characterizations of Approximation Operators on a
*CCD Lattice, ICIC Express Letters, vol.3, no.4 (A), pp.915-920, 2009. *

[7] *G. Ghinita, Y. Tao, and P. Kalnis, On the anonymization of sparse high-dimensional data. In Proc. of *

*ICDE, pp. 715-724, 2008. *

[8] *Y. He and J.F. Naughton, Anonymization of set-valued data via top-down, local generalization. In Proc. *

*of VLDB, pp. 934-945, 2009. *

[9] H.F. Huang, K.C. Liu and HW. Wang, A New Design of Cryptographic Key Management for HIPAA
*Privacy and Security Regulations, International Journal of Innovative Computing, Information and *

*Control, vol.5, no.11(A), pp.3923-3932, 2009. *

*[10] Z. Huang, W. Du, and B. Chen, Deriving private information from randomized data. In Proc. of ACM *

*SIGMOD, pp. 37-48, 2005. *

[11] IBM Quest Market-Basket Synthetic Data Generator,

www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#assocSynData.

[12] D.S. Johnson. Approximation algorithms for combinatorial problems. Journal of computer and system sciences, 9:256-278, 1974.

*[13] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, Workload-aware Anonymization. In Proc. of KDD, pp. *
277–286, 2006.

[14] N. Li, T. Li, and S. Venkatasubramanian, t-closeness: Privacy beyond k-anonymity and l-diversity. In

*Proc. of ICDE, pp. 106-115, 2007. *

[15] J.Q. Liu and K. Wang. Anonymizing transaction data by integrating suppression and generalization. In

*Proc. of PAKDD, pp. 171-180, 2010. *

[16] P.Q. Liu, D.M. Zhu, H. Fan and Q.S. Xie, An Efficient Approximation Algorithm of BCMV(p) Based
*on LP-rounding, ICIC Express Letters, vol.3, no.4 (A), pp.983-988, 2009. *

34

*[17] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, l-diversity: Privacy beyond *

*k-anonymity. ACM TKDD, article 3, 2007. *

*[18] A. Meyerson and R. Williams, On the complexity of optimal k-anonymity. In Proc. of PODS, pp. *
223-228, 2004.

[19] R. Motwani and S.U. Nabar, Anonymizing unstructured data, arXiv: 0810.5582v2, [cs.DB], 2008.
*[20] A. Narayanan and V. Shmatikov, Robust de-anonymization of large sparse datasets. In Proceedings of *

*IEEE Symposium on security and privacy, pp. 111-125, 2008. *

*[21] H. Park and K. Shim, Approximate algorithms for k-anonymity. In Proc. of ACM SIGMOD, pp. 67–78, *
2007.

[22] M. Saito, Y. Namba and S. Serikawa, Extraction of Values Representing Human Features in Order to
*Develop a Privacy-preserving Sensor, International Journal of Innovative Computing, Information and *

*Control, vol.4, no.4, pp.883-896, 2008. *

*[23] P. Samarati, Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge *

*and Data Engineering, 13(6): 1010-1027, 2001. *

[24] P. Samarati and L. Sweeny, Generalizing data to provide anonymity when disclosing information. In

*Proc. of ACM Symposium on Principles of Database Systems, pp. 188, 1998. *

*[25] L. Sweeny, K-anonymity: a model for protecting privacy. International Journal on Uncertainty, *

*Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002. *

[26] W.K. Wong, N. Mamoullis, and D.W. Cheung, Non-homogeneous generalization in privacy preserving
*data publishing. In Proc. of SIGMOD, pp. 747-758, 2010. *

*[27] X. Xiao and Y. Tao, Anatomy: simple and effective privacy preservation, in Proc. of VLDB, pp. *
139-150, 2006.