Hiding Constrained Association Rules in Privacy Preserving Data Mining

(1)

Leon S.L. Wang (王學亮)

Department of Information Management National University of Kaohsiung

Kaohsiung, Taiwan 81148

Hiding Constrained Association

Rules in Privacy Preserving Data

(2)

Outline

• Privacy Preserving Data Mining

• Hiding Informative Association Rules

• Proposed Algorithms

• Numerical Experiments

• Analyses

(3)

What is Data Mining?

₁

• Market basket analysis (Association Rules)

– “if a customer purchases diapers, then he will very likely purchase beer”

• Sequences (Sequential Patterns)

– “A customer who bought a iPod three months ago is likely to order a iPhone within one month”

(4)

Training Data

NAME RANK YEARS TENURED

Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no

Classification Algorithms IF rank = ‘professor’ OR years > 6 Classifier (Model)

• Classification

(5)

What is Data Mining?

₃

Classifier

Testing Data

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

Unseen Data

(Jeff, Professor, 4)

Tenured?

• Classification

(6)

What is Data Mining?

₄

•

Clustering

• Unsupervised learning: Finds “natural” grouping of instances given unlabeled data

(7)

What is Data Mining?

₅

• Data mining:

– Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from data in large databases

• Alternative names:

– Knowledge discovery in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

(8)

Current Technology:

A KDD Process

– Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

(9)

Current Technology: Architecture

of a Typical Data Mining System

Data

Data cleaning & data integration Filtering

Database or data warehouse server

Data mining engine Pattern evaluation

Graphical user interface

(10)

What is Privacy Preserving Data

Mining?

₁

• A scenario:

Supplier ABC Paper Company Retailer XYZ Supermarket Chain Allow access to customer DB

Predict inventory needs Offer reduced prices

(11)

What is Privacy Preserving Data

Mining?

₂

• Supplier ABC discovers (thru data mining):

– Sequence: Cold remedies -> Facial tissue – Association: (Skim milk, Green paper)

• Supplier ABC runs coupon marketing campaign:

– “50 cents off skim milk when you buy ABC products”

• Results:

– Lower sales of Green paper for XYZ (Bad)

(12)

What is Privacy Preserving Data

Mining?

₃

• Main objective

– To develop algorithms for modifying the

original data in some way, so that the private data and private knowledge remain private even after the mining process

(13)

What is Privacy Preserving Data

Mining?

₄

• Different Concepts of Privacy!

– Output privacy

• Selectively modify individual values from a database to prevent the discovery of a set of association rules

(Dasseni etc, Info Hiding workshop 2001)

– Input privacy

• Added noise to data before delivery to the data miner

– Technique to reduce impact of noise on learning a decision tree (Agrawal and Srikant, SIGMOD)

• Two parties, each with a portion of the data

– Learn a decision tree without sharing data (Lindell and Pinkas, CRYPTO)

(14)

What is Privacy Preserving Data

Mining?

₅ • Output privacy D_O R_O D_M _R_M

DM

Modification D_O R_O D’_o

DM

Modification • Input privacy (1)

(15)

What is Privacy Preserving Data

Mining?

₆

• Input privacy (2) D₁+D₂+D₃ R_O

(16)

Possible Solutions

₁

• Limiting access

– Control access to the data

– Used by secure DBMS community

• “Fuzz” the data

– Forcing aggregation into daily records instead of individual transactions or slightly altering data values

(17)

Possible Solutions

₂

• “Fuzz” the data

– k-anonymity, at least k tuples in one group

(18)

Possible Solutions

₃

• Eliminate unnecessary groupings

– The first 3 digits of SSNs are assigned by office sequentially

– Clustering high-order bits of a “unique identifier” is likely to group similar data elements

– Unique identifiers are assigned randomly

• Augment the data

– Populate the phone book with extra, fictitious people in non-obvious ways

– Return correct info when asking an individual, but return incorrect info when asking all

(19)

Possible Solutions

₄

• Audit

– Detect misuse by legitimate users

– Administrative or criminal disciplinary action may be initiated

(20)

Current Techniques

₁

• Data sources

– Centralized data – Distributed data

• Horizontally distributed or vertically distributed

• Data modification

– Perturbation (changing 1 to 0, or add noise)

– Blocking (replaced by “?”)

– Aggregation or merging values

– Swapping (interchanging values of individual records)

(21)

Current Techniques

₂

• Data mining algorithm

– To find commonalities among algorithms, in order to develop strategies to defeat a wide variety of data mining tools

• Data or rule hiding

– Hiding raw data or aggregated data (or rule)

• Privacy preservation (selective modification of data)

– Heuristic-based techniques

– Cryptography-based techniques – Reconstruction-based techniques

(22)

Current Research Areas

₁

• Privacy-Preserving Data Publishing

– K-anonymity

• Try to prevent privacy de-identification

• Privacy-Preserving Application

– Association rules hiding

• Utility-based Privacy-Preserving

• Distributed Privacy with Adversarial Collaboration

(23)

Current Research Areas

₂

(24)

Current Research Areas

₃

(25)

Current Research Areas

₄

• Utility-based Privacy-Preserving

– Q1:”How many customers under age 29 are there in the data set?”

– Q2: “Is an individual with age=25, education= Bachelor, Zip Code = 53712 a target customer?”

– Table 2, answers: “2”; “Y”

(26)

Problem Description

₁

• Association rule mining

• Input: D_O, min_supp, min_conf

• Output: R_O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) D_O |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not AR

(27)

Problem Description

₂

• Informative Association rules

• Input: D_O, min_supp, min_conf, X = {C} • Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%)Not AR Rules #6,7,8 predict same RHS {A,B} as #2,4

(28)

Problem Description

₂

(LHS)

• Input: D_O, X (items to be hidden on LHS), min_supp, min_conf • Output: D_M TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC D_M 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

(29)

Side effects

• Hiding failure, lost rules, new rules

R

②Lost Rules

③New Rules

Rh _~_R

(30)

Proposed Algorithms

₁

• Strategy:

– To lower the confidence of a given rule X => Y,

– either

• Increase the support of X, but not XY, OR

• Decrease the support of XY (or both X and XY)

)

(

support

)

(

support

)

(

X

XY

X

XY

Y

X

Conf

⇒

=

(31)

Proposed Algorithms

₂

• Multiple scans of database

– Increase Support of LHS First (ISL) – Decrease Support of RHS First (DSR)

• One scan of database

– Decrease Support and Confidence (DSC)

(32)

Proposed Algorithms

₃

• Multiple-scan algorithm ISL:

• Find large 1-item sets from D; • For each predicting item x X

• If x is not a large 1-itemset, then X := X -{x} ; • If X is empty, then EXIT;// no rule contains X in LHS • Find large 2-itemsets from D;

• For each x X {

• For each large 2-itemset containing x {

• Compute confidence of rule U, where U is a rule like x y; • If conf(U) < min_conf, then

• Go to next large 2-itemset;

• Else { //Increase Support of LHS • Find TL = { t in D | t does not support U} ;

• Sort TL in ascending order by the number of items; • While { conf(U) >= min_conf and TL is not empty) { • Choose the first transaction t from TL;

• Modify t to support x, the LHS(U); • Compute support and confidence of U;

(33)

Proposed Algorithms

₄

• Multiple-scan algorithm ISL: (con’t)

• }; // end if conf(U) < min_conf • If TL is empty, then {

• Can not hide x y; • Restore D;

• Go to next large-2 itemset; • } // end if TL is empty

• } //end of for each large 2-itemset • Remove x from X;

• } // end of for each x X

(34)

Proposed Algorithms

₅

• One-scan algorithm DSC:

• 1 Initialize a PI-tree and an item frequency list L;

• 2 For each transaction t in D //Build initial PI-tree • 3 Sort t according to item name (or item #);

• 4 Insert t into PI-tree;

• 5 Update L with items in t;

• 6 For each path p’ from the root to a leaf //Restructure the initial PI-tree • 7 Set support count of each node of p’ to common support count;

• 8 Sort p’ according to L; • 9 Insert p’ to a new PI-tree;

• 10 For each x X and x is a frequent item { //Sanitize all rules of the form: U: x y • 11 For each large 2-itemset containing x { // e.g. {xy} and rule U: x y

• 12 If confidence(U) >= min_conf, then { • //Calculate the number of transactions to sanitize; • //# of transactions required to lower the confidence

(35)

Proposed Algorithms

₆

• One-scan algorithm DSC: (con’t)

• 13 iterNumConf =|D|*(supp(xy) - min_conf * supp(x)); • //# of transactions required to lower the support

• 14 iterNumSupp =|D|*(supp(xy) - min_supp); • k = minimum (iterNumConf, iterNumSupp);

• Sanitize item y in the shortest k transactions containing xy by updating PI-tree, frequency list L and D;

• 17 }; //end if

• 18 }; // end for each large 2-itemset • 19 remove x from X;

• 20 }; // end for each x X

(36)

Proposed Algorithms

₇

• One-scan algorithm DSC

– Pattern-inversion tree (prefix)

TID Items T₁ ABC T₂ ABC T₃ ABC T₄ AB T₅ A A:6:[T₅] B:4:[T₄] C:3:[T ,T ,T ] Root C:1:[T₆]

(37)

Example

• Databases before and after sanitization

TID D D₁ T1 111 001 T2 111 111 T3 111 111 T4 110 110 T5 100 100 T6 101 001

(38)

Numerical Experiments

₁

• Figure 1 Number of Hidden and Total Rules (ISL/DSR)

0 20 40 60 80 100 120 140 160 180 200 5k 10k 15k Total Rules Hidden Rules Percentage

(39)

Numerical Experiments

₂ DSR Time Effects 0 500 1000 1500 2000 2500 5k 10k 15k Number of Transactions T im e i n S ec on ds 1 item 2 items

(40)

Numerical Experiments

₃

• Figure 3 Side Effects of DSR with 2 Predicting Items

DSR Side Effects 0 5 10 15 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules Hiding Failures New Rules Lost Rules

(41)

Numerical Experiments

₄ DSR Altered Transactions 0 10 20 30 40 50 5k 10k 15k Number of Transactions P er cen tag e 1 item 2 items

(42)

Numerical Experiments

₅

• Figure 5 Time Effects of ISL

ISL Time Effects

0 2000 4000 6000 8000 10000 5k 10k 15k Number of Transactions T im e i n S ec on ds 1 item 2 items

(43)

Numerical Experiments

₆

ISL Side Effects

0 10 20 30 40 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules Hiding Failures New Rules Lost Rules

(44)

Numerical Experiments

₇

• Figure 7 Database Effects of ISL

ISL Altered Transactions

0 20 40 60 80 100 120 5k 10k 15k Number of Transactions Per cen tag e 1 item 2 items

(45)

Numerical Experiments

₈ 0 20 40 60 80 100 120 140 160 180 200 5k 10k 15k Total Rules DSR Total Rules DSC Hidde n Rule %DSR Hidde n Rule %DSC

(46)

Numerical Experiments

₉

• Figure 9 Two-item Time Effects of DSR and DSC

Time Effects - Two Items

0 500 1000 1500 2000 2500 5k 10k 15k Number of Transactions T im e i n S ec ond s DSR DSC

(47)

Numerical Experiments

₁₀

DSR Side Effects - Two Items

0 5 10 15 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules New Rules-DSR Lost Rules-DSR Hiding Failure -DSR

(48)

Numerical Experiments

₁₁

• Figure 11 Two-item Side Effects of DSC

DSC Side Effect s - T wo It ems

0 5 10 15

5k 10k 15k

Number of T ransact ions

P er ce n ta g e o f R u les Hidden Rules New Rules-DSC Lost Rules-DSC Hiding Failure-DSC

(49)

Analysis

₁

• For multiple-scan algorithms: • Time effects

– DSR faster than ISL

• Due to size of candidate transactions is smaller

• Side effects

– DSR: no hiding failure (0%), few new rules (5%) and some lost rules (11%)

– ISL: shows some hiding failure (12.9%), many new rules (33%) and no lost rule (0%)

(50)

Analysis

₂

• Database effects

– DSR: 14% and 25% of transaction are altered for one and two predicting items respectively

– ISL: 59% and 128% of transactions are

altered for one and two predicting items respectively

– In ISL, some transactions are modified more that once

(51)

Analysis

₃

• Item effects: different hiding orders & algorithms => different transformed databases

TID D D₂ D₄ D₅ D₆ T1 111 101 110 110 101 T2 111 111 111 111 111 T3 111 111 111 111 111 T4 110 110 110 110 110 T5 100 110 101 100 100 T6 101 101 101 101 101 ISL,C,B ISL,B,C DSR,C,B DSR,B,C

(52)

Analysis

₄

• Less database scans and more rules pruned

– *: from calculating the large itemsets in different levels

– #: do not check if rules can be pruned after each transaction updates

DB Scans Rules

Pruned

ISL 3 2

(53)

Analysis

₅

• For single-scan algorithm:

– DSC is O(2|D| + |X|*l₂*K*logK)

• where |X| is the number of items in X, l₂ is the maximum number of large 2-itemsets, and K is the maximum number of iterations in DSC algorithm.

– SWA is O((n₁-1)*n₁/2*|D|*Kw)

• where n₁ is the initial number of restrictive rules in the database D and Kw is the window size chosen.

– SWA has higher order of complexity

(54)

Discussions

₁

• Study the problem of hiding informative association rule sets

• In phase one, propose two multiple-scan algorithms ISL & DSR, based on reducing the confidence of association rules

• Automatically hide informative rule sets without pre-mining and selection of a class of rules

• Analyze the characteristics of the proposed algorithms and compare with Dasseni etc algorithms

(55)

Discussions

₂

• In phase two, propose a one-scan algorithm, DSC, to improve the time efficiency by using Pattern-Inversion tree

• Numerically compared with ISL/DSR with better time efficiency with similar side effects • Analytically compared with SWA with better

(56)

Discussions

₃

• Maintenance of hiding informative association rule sets: ' D TID D T₁ 111 001 T₂ 111 111 T₃ 111 111 T₄ 110 110 T₅ 100 100 T₆ 101 001

New data set TID T₇ 101 T₈ 101 T₉ 110 + ∆ + ∆ D+ = D + ∆+ _D+_'

(57)

Discussions

₄

• Maintenance of hiding informative association

rule sets: _TID _D+ _(DSC) _(MSI)

T₁ 111 111 001 T₂ 111 111 111 T₃ 111 111 111 T₄ 110 110 110 T₅ 100 100 100 T₆ 101 001 001 T₇ 101 001 001 T 101 101 101

(58)

Discussions

₅

• Maintenance of hiding informative association rule sets

– Centralized databases (one table)

– Distributed databases (partitioned table)

• Horizontally partitioned • Vertically partitioned

(59)

Discussions

₆

• Distributed databases

– Horizontally partitioned

• Grocery shopping data collected by different supermarkets

• Credit card databases of two different credit unions • “fraudulent customers often have similar transaction histories, etc”

(60)

Discussions

₇

• Distributed Hiding of Association Rules

TID Items T₁₁ ABC T₁₂ ABC T₁₃ ABC T₁₄ AB T₁₅ A T₁₆ AC Site S₁ TID Items T₂₁ ABC T₂₂ AB T₂₃ B T₂₄ AB T₂₅ A Site S₂ TID Items T₃₁ ABC T₃₂ ABC T₃₃ ABC T₃₄ AB T₃₅ A T A Site S₃ Hide C => A (44%, 100%) min_supp=30%

(61)

Discussions

₇

• Distributed databases

– Vertically partitioned

• “Cell phones with Li/Ion batteries lead to brain tumors in diabetics”

(62)

Discussions

₈

• Multi-relational databases

– Multi-dimensional association rules

(63)

Discussions

₉

• Multi-dimensional association rules

– Example: Store of material for hiking trips

– Customers(cno, name, rating, age, occupation, city)

– Items(ino, item, name, price)

– Buys(cno, ino, date, qty, total)

• 1. Single-dimensional association rule

• Buys(c, “Ski pants”) -> buys(c, “Sunglasses”)

• 2. Multidimensional association rule

(64)

Discussions

₁₀

• Multi-dimensional association rules

– Store of material for hiking trips

– Customers(cno, name, rating, age, occupation,city)

– Items(ino, item, name, price)

– Buys(cno, ino, date, qty, total)

• 3. Hybrid-dimensional association rule

• Buys(c, “Glove”) and occupation(c,

(65)

Discussions

₁₁

• Multi-relational association rules

Item <-> Atom,

e.g., likes(joni, icecream), has(joni, piglet),likes(elliot, piglet), has(elliot, icecream), prefers(joni, icecream, pudding)

(66)

Discussions

₁₂

Transaction <-> Example E_i

e.g., Example joni = {7 atoms in Fig. 2}, elliot={6 atoms}

TID <-> Example Key, ExKey, e.g., KID has values joni, elliot

Cover: atomset X covers example E_i, X is a subset of E_i,

Example joni

(67)

Discussions

₁₃

• Support & Confidence

of MRAR

– DB w/ 4 examples • {{p(1,a), q(1,b)}, • {p(2,a), q(2,b)}, • {p(3,a), p(3,d), q(3,b)}, • {p(4,a)}} – Atomsets • X={p(k,x)}, Y={q(k,y)}, • X U Y ={p(k,x), q(k,y)} – Support of X U Y = ¾ = 0.75

(68)

Discussions

₁₄

• AprioriRel, extension of Apriori

– Association rule from one table

• likes(KID, piglet), likes(KID, ice cream) =>

likes(KID, dolphin); (c:85, s:9)

– Association rule from three tables

• likes(KID, A), has(KID, B) => prefers(KID, A,

B); (c:98, s:70)

• Hiding MRAR

– Hide user specified MRARs by minimally

(69)

Discussions

₁₁

• So far, all heuristic hiding algorithms

• Computational complexity

– NP-complete? Yes, for one table, hiding

selected rules

– How about hiding rule sets?

• Reversibility

– Some blocking algorithms are reversible – How about perturbation algorithms?

(70)

References

• Some websites

– Privacy-Preserving Data Mining

• http://www.cs.umbc.edu/~kunliu1/research/privacy_revie w.html • http://www.cs.ualberta.ca/~oliveira/psdm/pub_by_year.ht ml • http://www.springer.com/west/home?SGWID=4-102-22-52496494-0&changeHeader=true – Kdnuggets: www.kdnuggets.com