• 沒有找到結果。

Hiding Constrained Association Rules in Privacy Preserving Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Hiding Constrained Association Rules in Privacy Preserving Data Mining"

Copied!
70
0
0

加載中.... (立即查看全文)

全文

(1)

Leon S.L. Wang (王學亮)

Department of Information Management National University of Kaohsiung

Kaohsiung, Taiwan 81148

Hiding Constrained Association

Rules in Privacy Preserving Data

(2)

Outline

• Privacy Preserving Data Mining

• Hiding Informative Association Rules

• Proposed Algorithms

• Numerical Experiments

• Analyses

(3)

What is Data Mining?

1

• Market basket analysis (Association Rules)

– “if a customer purchases diapers, then he will very likely purchase beer

• Sequences (Sequential Patterns)

– “A customer who bought a iPod three months ago is likely to order a iPhone within one month”

(4)

Training Data

NAME RANK YEARS TENURED

Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no

Classification Algorithms IF rank = ‘professor’ OR years > 6 Classifier (Model)

• Classification

(5)

What is Data Mining?

3

Classifier

Testing Data

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

Unseen Data

(Jeff, Professor, 4)

Tenured?

• Classification

(6)

What is Data Mining?

4

Clustering

• Unsupervised learning: Finds “natural” grouping of instances given unlabeled data

(7)

What is Data Mining?

5

• Data mining:

– Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from data in large databases

• Alternative names:

– Knowledge discovery in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

(8)

Current Technology:

A KDD Process

– Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

(9)

Current Technology: Architecture

of a Typical Data Mining System

Data

Data cleaning & data integration Filtering

Database or data warehouse server

Data mining engine Pattern evaluation

Graphical user interface

(10)

What is Privacy Preserving Data

Mining?

1

• A scenario:

Supplier ABC Paper Company Retailer XYZ Supermarket Chain Allow access to customer DB

Predict inventory needs Offer reduced prices

(11)

What is Privacy Preserving Data

Mining?

2

• Supplier ABC discovers (thru data mining):

– Sequence: Cold remedies -> Facial tissue – Association: (Skim milk, Green paper)

• Supplier ABC runs coupon marketing campaign:

– “50 cents off skim milk when you buy ABC products”

• Results:

– Lower sales of Green paper for XYZ (Bad)

(12)

What is Privacy Preserving Data

Mining?

3

• Main objective

– To develop algorithms for modifying the

original data in some way, so that the private data and private knowledge remain private even after the mining process

(13)

What is Privacy Preserving Data

Mining?

4

• Different Concepts of Privacy!

– Output privacy

• Selectively modify individual values from a database to prevent the discovery of a set of association rules

(Dasseni etc, Info Hiding workshop 2001)

– Input privacy

• Added noise to data before delivery to the data miner

– Technique to reduce impact of noise on learning a decision tree (Agrawal and Srikant, SIGMOD)

• Two parties, each with a portion of the data

– Learn a decision tree without sharing data (Lindell and Pinkas, CRYPTO)

(14)

What is Privacy Preserving Data

Mining?

5 • Output privacy DO RO DM RM

DM

DM

Modification DO RO D’o

DM

DM

Modification • Input privacy (1)

(15)

What is Privacy Preserving Data

Mining?

6

• Input privacy (2) D1+D2+D3 RO

(16)

Possible Solutions

1

• Limiting access

– Control access to the data

– Used by secure DBMS community

• “Fuzz” the data

– Forcing aggregation into daily records instead of individual transactions or slightly altering data values

(17)

Possible Solutions

2

• “Fuzz” the data

– k-anonymity, at least k tuples in one group

(18)

Possible Solutions

3

• Eliminate unnecessary groupings

– The first 3 digits of SSNs are assigned by office sequentially

– Clustering high-order bits of a “unique identifier” is likely to group similar data elements

– Unique identifiers are assigned randomly

• Augment the data

– Populate the phone book with extra, fictitious people in non-obvious ways

– Return correct info when asking an individual, but return incorrect info when asking all

(19)

Possible Solutions

4

• Audit

– Detect misuse by legitimate users

– Administrative or criminal disciplinary action may be initiated

(20)

Current Techniques

1

• Data sources

– Centralized data – Distributed data

• Horizontally distributed or vertically distributed

• Data modification

– Perturbation (changing 1 to 0, or add noise)

– Blocking (replaced by “?”)

– Aggregation or merging values

– Swapping (interchanging values of individual records)

(21)

Current Techniques

2

• Data mining algorithm

– To find commonalities among algorithms, in order to develop strategies to defeat a wide variety of data mining tools

• Data or rule hiding

– Hiding raw data or aggregated data (or rule)

• Privacy preservation (selective modification of data)

– Heuristic-based techniques

– Cryptography-based techniques – Reconstruction-based techniques

(22)

Current Research Areas

1

• Privacy-Preserving Data Publishing

– K-anonymity

• Try to prevent privacy de-identification

• Privacy-Preserving Application

– Association rules hiding

• Utility-based Privacy-Preserving

• Distributed Privacy with Adversarial Collaboration

(23)

Current Research Areas

2

(24)

Current Research Areas

3

(25)

Current Research Areas

4

• Utility-based Privacy-Preserving

– Q1:”How many customers under age 29 are there in the data set?”

– Q2: “Is an individual with age=25, education= Bachelor, Zip Code = 53712 a target customer?”

– Table 2, answers: “2”; “Y”

(26)

Problem Description

1

• Association rule mining

• Input: DO, min_supp, min_conf

• Output: RO TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) DO |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not AR

(27)

Problem Description

2

• Informative Association rules

• Input: DO, min_supp, min_conf, X = {C} • Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%)Not AR Rules #6,7,8 predict same RHS {A,B} as #2,4

(28)

Problem Description

2

(LHS)

• Input: DO, X (items to be hidden on LHS), min_supp, min_conf • Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC DM 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

(29)

Side effects

• Hiding failure, lost rules, new rules

R

②Lost Rules

③New Rules

Rh ~R

(30)

Proposed Algorithms

1

• Strategy:

– To lower the confidence of a given rule X => Y,

– either

• Increase the support of X, but not XY, OR

• Decrease the support of XY (or both X and XY)

)

(

support

)

(

support

)

(

X

XY

X

XY

Y

X

Conf

=

=

(31)

Proposed Algorithms

2

• Multiple scans of database

– Increase Support of LHS First (ISL) – Decrease Support of RHS First (DSR)

• One scan of database

– Decrease Support and Confidence (DSC)

(32)

Proposed Algorithms

3

• Multiple-scan algorithm ISL:

Find large 1-item sets from D;For each predicting item x X

If x is not a large 1-itemset, then X := X -{x} ;If X is empty, then EXIT;// no rule contains X in LHSFind large 2-itemsets from D;

For each x X {

For each large 2-itemset containing x {

Compute confidence of rule U, where U is a rule like x y;If conf(U) < min_conf, then

• Go to next large 2-itemset;

• Else { //Increase Support of LHSFind TL = { t in D | t does not support U} ;

Sort TL in ascending order by the number of items;While { conf(U) >= min_conf and TL is not empty) {Choose the first transaction t from TL;

Modify t to support x, the LHS(U);Compute support and confidence of U;

(33)

Proposed Algorithms

4

• Multiple-scan algorithm ISL: (con’t)

}; // end if conf(U) < min_confIf TL is empty, then {

Can not hide x y;Restore D;

• Go to next large-2 itemset; • } // end if TL is empty

• } //end of for each large 2-itemset • Remove x from X;

} // end of for each x X

(34)

Proposed Algorithms

5

• One-scan algorithm DSC:

1 Initialize a PI-tree and an item frequency list L;

• 2 For each transaction t in D //Build initial PI-tree • 3 Sort t according to item name (or item #);

• 4 Insert t into PI-tree;

• 5 Update L with items in t;

• 6 For each path p’ from the root to a leaf //Restructure the initial PI-tree • 7 Set support count of each node of p’ to common support count;

• 8 Sort p’ according to L; • 9 Insert p’ to a new PI-tree;

• 10 For each x X and x is a frequent item { //Sanitize all rules of the form: U: x y • 11 For each large 2-itemset containing x { // e.g. {xy} and rule U: x y

• 12 If confidence(U) >= min_conf, then { • //Calculate the number of transactions to sanitize; • //# of transactions required to lower the confidence

(35)

Proposed Algorithms

6

• One-scan algorithm DSC: (con’t)

13 iterNumConf =|D|*(supp(xy) - min_conf * supp(x)); • //# of transactions required to lower the support

14 iterNumSupp =|D|*(supp(xy) - min_supp);k = minimum (iterNumConf, iterNumSupp);

Sanitize item y in the shortest k transactions containing xy by updating PI-tree, frequency list L and D;

• 17 }; //end if

• 18 }; // end for each large 2-itemset • 19 remove x from X;

20 }; // end for each x X

(36)

Proposed Algorithms

7

• One-scan algorithm DSC

– Pattern-inversion tree (prefix)

TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A A:6:[T5] B:4:[T4] C:3:[T ,T ,T ] Root C:1:[T6]

(37)

Example

• Databases before and after sanitization

TID D D1 T1 111 001 T2 111 111 T3 111 111 T4 110 110 T5 100 100 T6 101 001

(38)

Numerical Experiments

1

• Figure 1 Number of Hidden and Total Rules (ISL/DSR)

0 20 40 60 80 100 120 140 160 180 200 5k 10k 15k Total Rules Hidden Rules Percentage

(39)

Numerical Experiments

2 DSR Time Effects 0 500 1000 1500 2000 2500 5k 10k 15k Number of Transactions T im e i n S ec on ds 1 item 2 items

(40)

Numerical Experiments

3

• Figure 3 Side Effects of DSR with 2 Predicting Items

DSR Side Effects 0 5 10 15 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules Hiding Failures New Rules Lost Rules

(41)

Numerical Experiments

4 DSR Altered Transactions 0 10 20 30 40 50 5k 10k 15k Number of Transactions P er cen tag e 1 item 2 items

(42)

Numerical Experiments

5

• Figure 5 Time Effects of ISL

ISL Time Effects

0 2000 4000 6000 8000 10000 5k 10k 15k Number of Transactions T im e i n S ec on ds 1 item 2 items

(43)

Numerical Experiments

6

ISL Side Effects

0 10 20 30 40 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules Hiding Failures New Rules Lost Rules

(44)

Numerical Experiments

7

• Figure 7 Database Effects of ISL

ISL Altered Transactions

0 20 40 60 80 100 120 5k 10k 15k Number of Transactions Per cen tag e 1 item 2 items

(45)

Numerical Experiments

8 0 20 40 60 80 100 120 140 160 180 200 5k 10k 15k Total Rules DSR Total Rules DSC Hidde n Rule %DSR Hidde n Rule %DSC

(46)

Numerical Experiments

9

• Figure 9 Two-item Time Effects of DSR and DSC

Time Effects - Two Items

0 500 1000 1500 2000 2500 5k 10k 15k Number of Transactions T im e i n S ec ond s DSR DSC

(47)

Numerical Experiments

10

DSR Side Effects - Two Items

0 5 10 15 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules New Rules-DSR Lost Rules-DSR Hiding Failure -DSR

(48)

Numerical Experiments

11

• Figure 11 Two-item Side Effects of DSC

DSC Side Effect s - T wo It ems

0 5 10 15

5k 10k 15k

Number of T ransact ions

P er ce n ta g e o f R u les Hidden Rules New Rules-DSC Lost Rules-DSC Hiding Failure-DSC

(49)

Analysis

1

• For multiple-scan algorithms: • Time effects

– DSR faster than ISL

• Due to size of candidate transactions is smaller

• Side effects

– DSR: no hiding failure (0%), few new rules (5%) and some lost rules (11%)

– ISL: shows some hiding failure (12.9%), many new rules (33%) and no lost rule (0%)

(50)

Analysis

2

• Database effects

– DSR: 14% and 25% of transaction are altered for one and two predicting items respectively

– ISL: 59% and 128% of transactions are

altered for one and two predicting items respectively

– In ISL, some transactions are modified more that once

(51)

Analysis

3

• Item effects: different hiding orders & algorithms => different transformed databases

TID D D2 D4 D5 D6 T1 111 101 110 110 101 T2 111 111 111 111 111 T3 111 111 111 111 111 T4 110 110 110 110 110 T5 100 110 101 100 100 T6 101 101 101 101 101 ISL,C,B ISL,B,C DSR,C,B DSR,B,C

(52)

Analysis

4

• Less database scans and more rules pruned

– *: from calculating the large itemsets in different levels

– #: do not check if rules can be pruned after each transaction updates

DB Scans Rules

Pruned

ISL 3 2

(53)

Analysis

5

• For single-scan algorithm:

– DSC is O(2|D| + |X|*l2*K*logK)

• where |X| is the number of items in X, l2 is the maximum number of large 2-itemsets, and K is the maximum number of iterations in DSC algorithm.

– SWA is O((n1-1)*n1/2*|D|*Kw)

• where n1 is the initial number of restrictive rules in the database D and Kw is the window size chosen.

– SWA has higher order of complexity

(54)

Discussions

1

• Study the problem of hiding informative association rule sets

• In phase one, propose two multiple-scan algorithms ISL & DSR, based on reducing the confidence of association rules

• Automatically hide informative rule sets without pre-mining and selection of a class of rules

• Analyze the characteristics of the proposed algorithms and compare with Dasseni etc algorithms

(55)

Discussions

2

• In phase two, propose a one-scan algorithm, DSC, to improve the time efficiency by using Pattern-Inversion tree

• Numerically compared with ISL/DSR with better time efficiency with similar side effects • Analytically compared with SWA with better

(56)

Discussions

3

• Maintenance of hiding informative association rule sets: ' D TID D T1 111 001 T2 111 111 T3 111 111 T4 110 110 T5 100 100 T6 101 001

New data set TID T7 101 T8 101 T9 110 + ∆ + ∆ D+ = D + ∆+ D+'

(57)

Discussions

4

• Maintenance of hiding informative association

rule sets: TID D+ (DSC) (MSI)

T1 111 111 001 T2 111 111 111 T3 111 111 111 T4 110 110 110 T5 100 100 100 T6 101 001 001 T7 101 001 001 T 101 101 101

(58)

Discussions

5

• Maintenance of hiding informative association rule sets

– Centralized databases (one table)

– Distributed databases (partitioned table)

• Horizontally partitioned • Vertically partitioned

(59)

Discussions

6

• Distributed databases

– Horizontally partitioned

• Grocery shopping data collected by different supermarkets

• Credit card databases of two different credit unions • “fraudulent customers often have similar transaction histories, etc”

(60)

Discussions

7

• Distributed Hiding of Association Rules

TID Items T11 ABC T12 ABC T13 ABC T14 AB T15 A T16 AC Site S1 TID Items T21 ABC T22 AB T23 B T24 AB T25 A Site S2 TID Items T31 ABC T32 ABC T33 ABC T34 AB T35 A T A Site S3 Hide C => A (44%, 100%) min_supp=30%

(61)

Discussions

7

• Distributed databases

– Vertically partitioned

• “Cell phones with Li/Ion batteries lead to brain tumors in diabetics”

(62)

Discussions

8

• Multi-relational databases

– Multi-dimensional association rules

(63)

Discussions

9

• Multi-dimensional association rules

– Example: Store of material for hiking trips

– Customers(cno, name, rating, age, occupation, city)

– Items(ino, item, name, price)

– Buys(cno, ino, date, qty, total)

• 1. Single-dimensional association rule

• Buys(c, “Ski pants”) -> buys(c, “Sunglasses”)

• 2. Multidimensional association rule

(64)

Discussions

10

• Multi-dimensional association rules

– Store of material for hiking trips

– Customers(cno, name, rating, age, occupation,city)

– Items(ino, item, name, price)

– Buys(cno, ino, date, qty, total)

• 3. Hybrid-dimensional association rule

• Buys(c, “Glove”) and occupation(c,

(65)

Discussions

11

• Multi-relational association rules

Item <-> Atom,

e.g., likes(joni, icecream), has(joni, piglet),likes(elliot, piglet), has(elliot, icecream), prefers(joni, icecream, pudding)

(66)

Discussions

12

Transaction <-> Example Ei

e.g., Example joni = {7 atoms in Fig. 2}, elliot={6 atoms}

TID <-> Example Key, ExKey, e.g., KID has values joni, elliot

Cover: atomset X covers example Ei, X is a subset of Ei,

Example joni

(67)

Discussions

13

• Support & Confidence

of MRAR

– DB w/ 4 examples • {{p(1,a), q(1,b)}, • {p(2,a), q(2,b)}, • {p(3,a), p(3,d), q(3,b)}, • {p(4,a)}} – Atomsets • X={p(k,x)}, Y={q(k,y)}, • X U Y ={p(k,x), q(k,y)} – Support of X U Y = ¾ = 0.75

(68)

Discussions

14

• AprioriRel, extension of Apriori

– Association rule from one table

• likes(KID, piglet), likes(KID, ice cream) =>

likes(KID, dolphin); (c:85, s:9)

– Association rule from three tables

• likes(KID, A), has(KID, B) => prefers(KID, A,

B); (c:98, s:70)

• Hiding MRAR

– Hide user specified MRARs by minimally

(69)

Discussions

11

• So far, all heuristic hiding algorithms

• Computational complexity

– NP-complete? Yes, for one table, hiding

selected rules

– How about hiding rule sets?

• Reversibility

– Some blocking algorithms are reversible – How about perturbation algorithms?

(70)

References

• Some websites

– Privacy-Preserving Data Mining

• http://www.cs.umbc.edu/~kunliu1/research/privacy_revie w.html • http://www.cs.ualberta.ca/~oliveira/psdm/pub_by_year.ht ml • http://www.springer.com/west/home?SGWID=4-102-22-52496494-0&changeHeader=true – Kdnuggets: www.kdnuggets.com

參考文獻

相關文件

Srikant, Fast Algorithms for Mining Association Rules in Large Database, Proceedings of the 20 th International Conference on Very Large Data Bases, 1994, 487-499. Swami,

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

Following the supply by the school of a copy of personal data in compliance with a data access request, the requestor is entitled to ask for correction of the personal data

In our AI term project, all chosen machine learning tools will be use to diagnose cancer Wisconsin dataset.. To be consistent with the literature [1, 2] we removed the 16

MTL – multi-task learning for STM and LM, where they share the embedding layer PSEUDO – train an STM with labeled data, generate labels for unlabeled data, and retrain STM.

2 machine learning, data mining and statistics all need data. 3 data mining is just another name for

In developing LIBSVM, we found that many users have zero machine learning knowledge.. It is unbelievable that many asked what the difference between training and

We try to explore category and association rules of customer questions by applying customer analysis and the combination of data mining and rough set theory.. We use customer