4 Related Works - 資料探勘之敏感資料保護技術研發

The problem of hiding frequent patterns and association rules was proposed in [9] firstly. The authors proved that finding an optimal sanitization for hiding frequent patterns is an NP-hard problem and proposed a heuristic approach by deleting items from transactions in the database to hide sensitive frequent patterns. In recent year, more and more researchers start paying attention to privacy issues. The consequent approaches can be classified into two categories:

data modification, and data reconstruction.

Data modification. The main idea of this group is to alter original database such that the sensitive information is not able to be mined in new database. So as to decrease the support or confidence of sensitive rules below the user-predefined threshold, these algorithms choose some items as victims and delete or insert them in some transactions [6]. In [17], the authors present a novel approach us-ing a disclosure parameter instead of the support threshold to directly control the balance between privacy requirement and information preservation. Based on the disclosure parameter, the support of each sensitive pattern is decreased by the same proportion. The proposed IGA algorithm groups sensitive patterns first, and then chooses the victim items based on the minimal side effects for database. In [7], the border-based concept was proposed to evaluate the impact of any modification on the database efficiently. The quality of database and relative frequency of each frequent itemset can be well maintained by greedily selecting the modifications with minimal side effect. In [4], the authors propose

9 an algorithm which can be secure against forward inference attack. By multi-plying the original database matrix and a sanitization one together, the method raises more efficiency. Most of recent works focus on the minimal side effects on database [5][10]. These methods minimize the number of sanitized transactions or items to limit the side effects on database respectively.

Data reconstruction. The motivation of this group is that the previous database-based modification methods spend more time on scanning the database and we can not directly control the information on the released database. Hence the data-reconstruction group uses knowledge-based method to directly recon-struct the released dataset containing the knowledge that the database owner wants to preserve. In general, the set of frequent patterns in original dataset is regarded as knowledge. The concept of “inverse frequent set mining problem”

was first proposed in [11], and was proved to be an NP-hard problem. Con-sequently, most researches apply this concept on privacy issues and algorithm benchmarks [12]. In [3], the authors proposed a constraint inverse itemset lat-tice mining technique to automatically generate a sample dataset which can be released for sharing. It indicates that if there exists a feasible support set of all itemsets, they can generate the new database containing the same frequent itemsets by one-to-one mapping. In [13], the authors proposed the FP-tree-based method for inverse frequent set mining, and the new database exactly satisfies the whole given constraints. However, this method does not provide complete and well hiding. It only controls the support counts of the non-sensitive patterns to be the same as before, but the frequencies do not satisfy the original con-straint. For this issue, the major problem at present is how to find the feasible support set which has compatible dataset.

In addition, the concept of multiple support thresholds is first proposed in [14]

owing to the observed phenomenon that support thresholds of different itemsets are not always uniform in the reality. We take some of existing specification of multiple support thresholds to be the benchmark of our sanitization framework, and they will be introduced in next section.

5 Experiments

In this section, the performance, efficiency and scalability of our proposed saniti-zation process are shown. We use the support constraint [18] and the maximum constraint [8] to assign different sensitive threshold to each sensitive pattern. In addition, in order to compare our sanitization process with the Item Grouping Algorithm [20], IGA for short, the same disclosure threshold of IGA is used by our framework to show performance for the situation of uniform sensitive threshold. These constraints are described as follow:

Disclosure thresholds: It is devised to compare our method with IGA under uniform support thresholds. Therefore, the sensitive thresholds are set to be the same with IGA as follows:

st(X) = sup(X) × α

where α is the same as the disclosure threshold used by IGA.

Support constraints: It is similar in [18]. We first partition the support range in database into the bin number of intervals. Each interval has the same number of items so that each bin, Bi, contains every items in the ith interval.

Next, the support constraints with the schemas which are made up of all possible combinations of bins were generated. And the support threshold of the support constraint SCk(B1, . . . , Br) ≤ θk is defined as follows:

θ = min{γ^k−1× S(Bi) × . . . × S(Br), 1}

where S(Bi) denotes the smallest item support for the Bi , and γ is an integer larger than 1. A large value of γ can be used to slow down the rapid decrease of S(B1) × . . . × S(Br). We can vary the value of γ to generate different support constraints.

Maximal constraints: we use the same formula in [14] to assign the support threshold of each item:

st(X) =

½sup(X) × σ if sup(X) × σ > minsup, minsup otherwise.

where 0 ≤ σ ≤ 1 , and sup(i) denotes the support of item i in the dataset. If σ is set to be zero, the support thresholds of all items are the same, then this case becomes the same as the uniform one.

We use two real datasets, accidents [19] and kosarak with different charac-teristics, to compare our method with IGA when applying disclosure threshold.

The accidents dataset is donated by Karolien Geurts and contains traffic acci-dent data, and the kosarak dataset was provided by Ferenc Bodon and contains click-stream data from a Hungarian on-line news portal. On the other hand, considering the time complexity of mining with multiple sensitive thresholds, without loss of generality, two smaller real datasets chess and mushroom [20] are used to evaluate the performance of our hiding approach. These four datasets are commonly used for performance evaluation of various association rule min-ing algorithms. Their characteristics are shown in Table 3. For each original dataset, we first execute Apriori algorithm to mine the supports of all items and use them to establish the settings of item support thresholds and support con-straints. Next, according to different applications, we apply algorithms Apriori, Apriori-like [18], and Adaptive-Apriori [8] to mine the frequent patterns under the uniform threshold, maximal constraint, and support constraints, respectively.

Subsequently, the sensitive patterns to be hidden are selected randomly to sim-ulate the application-oriented sensitive information.

Our testing environment consists of a 3.4 GHz Intel(R) Pentium(R) D proces-sor with 1 GB of memory running a Window XP operating system. All recorded execution times include the CPU time and the I/O time.

5.1 Metrics

Two measures, information loss and hiding failure, are adopted to evaluate the performance of our hiding strategies. Information loss (IL) is the percentage of

(a) accident dataset (b) kosarak dataset Fig. 2: IL with disclosure thresholds

Fig. 3: IL with support con-straints on chess dataset

trans. Items Accidents 340,184 572 Kosarak 990,002 41,270

Chess 3,196 75

mushroom 8,124 119 Table 3: The characteristics of each dataset

non-sensitive patterns which are hidden in the sanitization process, as shown in the following equation:

IL = ((|FP | − |Ps|) − (|FP⁰| − |P_s⁰|)) (|FP | − |Ps|)

where |FP | and |Ps| are the number of frequent patterns in original database, D, and the number of sensitive patterns in D, respectively, and |FP⁰| and |P_s⁰| are the number of frequent patterns in new database, D⁰, and the number of sensitive patterns in D⁰, respectively. Hiding failure (HF ) is the percentage of sensitive patterns remaining in D⁰ after sanitization, represented as follows:

HF = |P_s⁰|

|Ps|

Under the uniform support threshold, Ps contains all the supersets of any pat-terns in SP. However under the multiple support thresholds, the subset of a frequent pattern may not be frequent. Hence Ps should only contain the super-sets which have larger sensitive thresholds than those of its sensitive subsuper-sets of any pattern in SP.

5.2 Performance

Firstly, our framework is compared with IGA to evaluate its performance under the uniform support threshold. We use the disclosure thresholds 0.2 and 0.002 to

(a) chess dataset (b) mushroom dataset

Fig. 4: IL under maximal constraints

(a) the efficiency (b) the scalability Fig. 5: The comparison with IGA on the kosarak dataset

hide sensitive patterns in accidents and kosarak datasets. The sensitive patterns are randomly selected from the frequent patterns which have the support be-ing larger than 20% and 0.2%, respectively. Subsequently we mine the frequent patterns in new dataset by the support thresholds 10% to 90%, 0.1% to 0.28%, respectively. The result of IL is shown in Fig. 2. The trend of the IL is highly related the characteristic of the experiment dataset. We can observe that our method reaches better information preservation. Most of HF are zero, except for the case when the support threshold being 10% of accidents, the HF of IGA is 0.0057% and our method is 0.325%.

For the capability of hiding with multiple thresholds, we evaluate perfor-mance under support constraints and under the maximal constraints. Under the support constraints, we evaluate performance from hiding 1 pattern to 10 pat-terns. All the sensitive patterns are chosen randomly. We compare the results of γ = 15, 20 under a fixed bins number 8, and bins number = 8, 10 with a fixed γ.

The result of IL is shown in Fig. 3. All HF s are zero. We can observe that the IL is not affected by the bins number and γ, but probably relies on the distribution of the dataset and the chosen sensitive patterns. Finally, performances under maximal constraints are measured. Under the maximum constraints, we evalu-ate performance from hiding 1 pattern to 10 patterns. All the sensitive patterns are chosen randomly. Different parameter settings, including σ = 0.7, 0.85 and σ

= 0.25, 0.5 on the chess and mushroom datasets are examined, respectively. The result of IL is shown in Fig. 4(a) and 4(b). Since supports of chosen sensitive

13 patterns of chess dataset are higher and more items are deleted to hide such patterns, the IL is higher. All HF s are zero. We can observe that the IL will increase along with the decrease of γ and the increase of the number of hidden sensitive patterns.

5.3 Efficiency and Scalability

We estimate the efficiency and scalability of our method compared with IGA on the size of the database and the number of the sensitive patterns. The disclosure parameter of IGA is set to be zero, and the same is α of our method. The zero value means hiding completely.

We vary the size of dataset from 100K to 900K on hiding six mutually ex-clusive frequent patterns with length 2-7. The result is illustrated in Fig. 5(a).

Next, the number of the hidden sensitive patterns is varied from 1 pattern to 10 patterns. All the patterns are chosen randomly. The result is illustrated in Fig. 5(b). We can observe that the execution time is linear with the size of the database and the number of sensitive patterns. Note that our method achieves good scalability as IGA while attaining better information preservation and pro-viding additional capability of hiding with multiple sensitive thresholds.

6 Conclusions

In this paper, we introduce the concept of frequent pattern hiding under multi-ple sensitive thresholds. A new hiding strategy of multimulti-ple sensitive thresholds is proposed. The hiding strategy is more applicable in the practical applica-tions. Considering the properties of the frequent patterns under multiple sen-sitive thresholds, we suggest the revised border-based method to reduce the redundant work on hiding, and used the inverted file and the pattern index to speed up the update in our framework.

We empirically validated the performance, efficiency and scalability of our method by using a series of experiments. In all of these experiments, we took into account the uniform support threshold, multiple support thresholds with support constraints, and multiple support thresholds under maximal support constraints. The results of our experiments reveal that our method is effective and achieves significant improvement over the IGA with the uniform support thresholds. Furthermore, we can hide sensitive knowledge of the dataset with multiple sensitive thresholds.

References

1. Agrawal, R., Imieli´nski, T., Swami, A.: Mining Associations Rule Between Sets of Items in Massive Database. Proc. of the ACM SIGMOD Int. Conf. on Management of Data. (1993) 207–216

2. Clifton, C., Marks, D.: Security and Privacy Implication of Data Mining. ACM SIGMOD Workshop on Data Mining and Knowledge Discovery (1996) 15–19

3. Chen, X., Orlowska, M., Li, X.: A New Framework of Privacy Preserving Data Sharing. Proc. of IEEE 4th Int. Workshop on Privacy and Security Aspects of Data Mining (2004) 47–56

4. Wang, E.T., Lee, G., Lin, Y.T.: A Novel Method for Protecting Sensitive Knowl-edge in Association Rules Mining. Proc. of the 29th Annual Int. COMPSAC 1 (2005) 511–516

5. Wu, Y.H., Chiang, C.M., Chen, A.L.P.: Hiding Sensitive Association Rules with Limited Side Effects. IEEE Transactions on Knowledge and Data Engineering, 19(1) (2007) 29–42

6. Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., Dasseni, E.: Associa-tion Rule Hiding. IEEE TransacAssocia-tions on Knowledge and Data Engineering, 16(4) (2004) 434–447

7. Xingzhi, S., Yu, P.S.: A Border-Based Approach for Hiding Sensitive Frequent Itemsets. Proc. of 5th IEEE Int. Conf. on Data Mining (2005) 426–433

8. Lee, Y.C., Hong, T.P., Lin, W.Y.: Mining Association Rules with Multiple Mini-mum Supports Using MaxiMini-mum Constraints. Int. Journal of Approximate Reason-ing on Data MinReason-ing and Granular ComputReason-ing 40(1–2) (2005) 44–54

9. Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., Verykios, V.: Disclosure Limitation of Sensitive Rules. Proc. of the IEEE Knowledge and Data Exchange Workshop (1999) 45–52

10. Gkoulalas-Divanis, A., Verykios, V.S.: An Integer Programming Approach for Fre-quent Itemset Hiding. Proc. of Int. Conf. on Information and Knowledge Manage-ment (2006) 748–757

11. Mielikainen, T.: On Inverse Frequent Set Mining. Proc. of the 2nd IEEE ICDM Workshop on Privacy Preserving Data Mining (2003)

12. Wu, X., Wu, Y., Wang, Y., Li, Y.: Privacy-Aware Market Basket Data Set Gener-ation: A Feasible Approach for Inverse Frequent Set Mining. Proc. 5th SIAM Int.

Conf. on Data Mining (2005)

13. Guo, Y.: Reconstruction-Based Association Rule Hiding. Proc. of SIGMOD 2007 Ph.D. Workshop on Innovative Database Research (2007)

14. Liu, B., Hsu, W., Ma, Y.: Mining Association Rules with Multiple Minimum Sup-ports. Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (1999) 337–341

15. Wang, K., Fung, B.C.M., Yu, P.S.: Template-Based Privacy Preservation in Clas-sification Problems. Proc. - IEEE Int. Conf. on Data Mining (2005) 466–473 16. Dwork, C.: Ask a Better Question, Get a Better Answer: A New Approach to

Private Data Analysis Lecture Notes in Computer Science on Database Theory -ICDT 4353 (2007) 18–27

17. Oliveira, S.R.M., Za´ıane, O.R.: A Unified Framework for Protecting Sensitive As-sociation Rules in Business Collaboration. Int. J. of Business Intelligence and Data Mining 1(3) (2006) 247-287

18. Wang, K., He, Y., Han, J.: Pushing Support Constraints into Association Rules Mining. IEEE Transactions on Knowledge and Data Engineering 15(3) (2003) 642–658

19. Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling High-Frequency Accident Locations Using Association Rules. Proc. of the 82th Annual Transportation Re-search Board (2003) 18

20. Blake, C.L., Merz, C.J.: UCIRepository of machine learning databases [http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: University of Califarnia, Dept. of Inf. and CS., (1998)

在文檔中資料探勘之敏感資料保護技術研發 (頁 22-29)