Leon S.L. Wang (王學亮)
Department of Information Management National University of Kaohsiung
Kaohsiung, Taiwan 81148
Hiding Constrained Association
Rules in Privacy Preserving Data
Outline
• Privacy Preserving Data Mining
• Hiding Informative Association Rules
• Proposed Algorithms
• Numerical Experiments
• Analyses
What is Data Mining?
1
• Market basket analysis (Association Rules)
– “if a customer purchases diapers, then he will very likely purchase beer”
• Sequences (Sequential Patterns)
– “A customer who bought a iPod three months ago is likely to order a iPhone within one month”
Training Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no
Classification Algorithms IF rank = ‘professor’ OR years > 6 Classifier (Model)
• Classification
What is Data Mining?
3
Classifier
Testing Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
Unseen Data
(Jeff, Professor, 4)
Tenured?
• Classification
What is Data Mining?
4
•
Clustering
• Unsupervised learning: Finds “natural” grouping of instances given unlabeled data
What is Data Mining?
5• Data mining:
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information or patterns from data in large databases
• Alternative names:
– Knowledge discovery in databases (KDD),
knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Current Technology:
A KDD Process
– Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
Current Technology: Architecture
of a Typical Data Mining System
Data
Data cleaning & data integration Filtering
Database or data warehouse server
Data mining engine Pattern evaluation
Graphical user interface
What is Privacy Preserving Data
Mining?
1• A scenario:
Supplier ABC Paper Company Retailer XYZ Supermarket Chain Allow access to customer DBPredict inventory needs Offer reduced prices
What is Privacy Preserving Data
Mining?
2• Supplier ABC discovers (thru data mining):
– Sequence: Cold remedies -> Facial tissue – Association: (Skim milk, Green paper)
• Supplier ABC runs coupon marketing campaign:
– “50 cents off skim milk when you buy ABC products”
• Results:
– Lower sales of Green paper for XYZ (Bad)
What is Privacy Preserving Data
Mining?
3• Main objective
– To develop algorithms for modifying the
original data in some way, so that the private data and private knowledge remain private even after the mining process
What is Privacy Preserving Data
Mining?
4• Different Concepts of Privacy!
– Output privacy
• Selectively modify individual values from a database to prevent the discovery of a set of association rules
(Dasseni etc, Info Hiding workshop 2001)
– Input privacy
• Added noise to data before delivery to the data miner
– Technique to reduce impact of noise on learning a decision tree (Agrawal and Srikant, SIGMOD)
• Two parties, each with a portion of the data
– Learn a decision tree without sharing data (Lindell and Pinkas, CRYPTO)
What is Privacy Preserving Data
Mining?
5 • Output privacy DO RO DM RMDM
DM
Modification DO RO D’oDM
DM
Modification • Input privacy (1)What is Privacy Preserving Data
Mining?
6• Input privacy (2) D1+D2+D3 RO
Possible Solutions
1• Limiting access
– Control access to the data
– Used by secure DBMS community
• “Fuzz” the data
– Forcing aggregation into daily records instead of individual transactions or slightly altering data values
Possible Solutions
2• “Fuzz” the data
– k-anonymity, at least k tuples in one group
Possible Solutions
3• Eliminate unnecessary groupings
– The first 3 digits of SSNs are assigned by office sequentially
– Clustering high-order bits of a “unique identifier” is likely to group similar data elements
– Unique identifiers are assigned randomly
• Augment the data
– Populate the phone book with extra, fictitious people in non-obvious ways
– Return correct info when asking an individual, but return incorrect info when asking all
Possible Solutions
4• Audit
– Detect misuse by legitimate users
– Administrative or criminal disciplinary action may be initiated
Current Techniques
1• Data sources
– Centralized data – Distributed data
• Horizontally distributed or vertically distributed
• Data modification
– Perturbation (changing 1 to 0, or add noise)
– Blocking (replaced by “?”)
– Aggregation or merging values
– Swapping (interchanging values of individual records)
Current Techniques
2• Data mining algorithm
– To find commonalities among algorithms, in order to develop strategies to defeat a wide variety of data mining tools
• Data or rule hiding
– Hiding raw data or aggregated data (or rule)
• Privacy preservation (selective modification of data)
– Heuristic-based techniques
– Cryptography-based techniques – Reconstruction-based techniques
Current Research Areas
1• Privacy-Preserving Data Publishing
– K-anonymity
• Try to prevent privacy de-identification
• Privacy-Preserving Application
– Association rules hiding
• Utility-based Privacy-Preserving
• Distributed Privacy with Adversarial Collaboration
Current Research Areas
2Current Research Areas
3Current Research Areas
4• Utility-based Privacy-Preserving
– Q1:”How many customers under age 29 are there in the data set?”
– Q2: “Is an individual with age=25, education= Bachelor, Zip Code = 53712 a target customer?”
– Table 2, answers: “2”; “Y”
Problem Description
1• Association rule mining
• Input: DO, min_supp, min_conf
• Output: RO TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) DO |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not AR
Problem Description
2• Informative Association rules
• Input: DO, min_supp, min_conf, X = {C} • Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%)Not AR Rules #6,7,8 predict same RHS {A,B} as #2,4
Problem Description
2(LHS)
• Input: DO, X (items to be hidden on LHS), min_supp, min_conf • Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC DM 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try
Side effects
• Hiding failure, lost rules, new rules
R
②Lost Rules
③New Rules
Rh ~R
Proposed Algorithms
1• Strategy:
– To lower the confidence of a given rule X => Y,
– either
• Increase the support of X, but not XY, OR
• Decrease the support of XY (or both X and XY)
)
(
support
)
(
support
)
(
X
XY
X
XY
Y
X
Conf
⇒
=
=
Proposed Algorithms
2• Multiple scans of database
– Increase Support of LHS First (ISL) – Decrease Support of RHS First (DSR)
• One scan of database
– Decrease Support and Confidence (DSC)
Proposed Algorithms
3• Multiple-scan algorithm ISL:
• Find large 1-item sets from D; • For each predicting item x X
• If x is not a large 1-itemset, then X := X -{x} ; • If X is empty, then EXIT;// no rule contains X in LHS • Find large 2-itemsets from D;
• For each x X {
• For each large 2-itemset containing x {
• Compute confidence of rule U, where U is a rule like x y; • If conf(U) < min_conf, then
• Go to next large 2-itemset;
• Else { //Increase Support of LHS • Find TL = { t in D | t does not support U} ;
• Sort TL in ascending order by the number of items; • While { conf(U) >= min_conf and TL is not empty) { • Choose the first transaction t from TL;
• Modify t to support x, the LHS(U); • Compute support and confidence of U;
Proposed Algorithms
4• Multiple-scan algorithm ISL: (con’t)
• }; // end if conf(U) < min_conf • If TL is empty, then {
• Can not hide x y; • Restore D;
• Go to next large-2 itemset; • } // end if TL is empty
• } //end of for each large 2-itemset • Remove x from X;
• } // end of for each x X
Proposed Algorithms
5• One-scan algorithm DSC:
• 1 Initialize a PI-tree and an item frequency list L;
• 2 For each transaction t in D //Build initial PI-tree • 3 Sort t according to item name (or item #);
• 4 Insert t into PI-tree;
• 5 Update L with items in t;
• 6 For each path p’ from the root to a leaf //Restructure the initial PI-tree • 7 Set support count of each node of p’ to common support count;
• 8 Sort p’ according to L; • 9 Insert p’ to a new PI-tree;
• 10 For each x X and x is a frequent item { //Sanitize all rules of the form: U: x y • 11 For each large 2-itemset containing x { // e.g. {xy} and rule U: x y
• 12 If confidence(U) >= min_conf, then { • //Calculate the number of transactions to sanitize; • //# of transactions required to lower the confidence
Proposed Algorithms
6• One-scan algorithm DSC: (con’t)
• 13 iterNumConf =|D|*(supp(xy) - min_conf * supp(x)); • //# of transactions required to lower the support
• 14 iterNumSupp =|D|*(supp(xy) - min_supp); • k = minimum (iterNumConf, iterNumSupp);
• Sanitize item y in the shortest k transactions containing xy by updating PI-tree, frequency list L and D;
• 17 }; //end if
• 18 }; // end for each large 2-itemset • 19 remove x from X;
• 20 }; // end for each x X
Proposed Algorithms
7• One-scan algorithm DSC
– Pattern-inversion tree (prefix)
TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A A:6:[T5] B:4:[T4] C:3:[T ,T ,T ] Root C:1:[T6]
Example
• Databases before and after sanitization
TID D D1 T1 111 001 T2 111 111 T3 111 111 T4 110 110 T5 100 100 T6 101 001
Numerical Experiments
1• Figure 1 Number of Hidden and Total Rules (ISL/DSR)
0 20 40 60 80 100 120 140 160 180 200 5k 10k 15k Total Rules Hidden Rules Percentage
Numerical Experiments
2 DSR Time Effects 0 500 1000 1500 2000 2500 5k 10k 15k Number of Transactions T im e i n S ec on ds 1 item 2 itemsNumerical Experiments
3• Figure 3 Side Effects of DSR with 2 Predicting Items
DSR Side Effects 0 5 10 15 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules Hiding Failures New Rules Lost Rules
Numerical Experiments
4 DSR Altered Transactions 0 10 20 30 40 50 5k 10k 15k Number of Transactions P er cen tag e 1 item 2 itemsNumerical Experiments
5• Figure 5 Time Effects of ISL
ISL Time Effects
0 2000 4000 6000 8000 10000 5k 10k 15k Number of Transactions T im e i n S ec on ds 1 item 2 items
Numerical Experiments
6ISL Side Effects
0 10 20 30 40 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules Hiding Failures New Rules Lost Rules
Numerical Experiments
7• Figure 7 Database Effects of ISL
ISL Altered Transactions
0 20 40 60 80 100 120 5k 10k 15k Number of Transactions Per cen tag e 1 item 2 items
Numerical Experiments
8 0 20 40 60 80 100 120 140 160 180 200 5k 10k 15k Total Rules DSR Total Rules DSC Hidde n Rule %DSR Hidde n Rule %DSCNumerical Experiments
9• Figure 9 Two-item Time Effects of DSR and DSC
Time Effects - Two Items
0 500 1000 1500 2000 2500 5k 10k 15k Number of Transactions T im e i n S ec ond s DSR DSC
Numerical Experiments
10DSR Side Effects - Two Items
0 5 10 15 5k 10k 15k Number of Transactions P er cen tag e o f R ul es Hidden Rules New Rules-DSR Lost Rules-DSR Hiding Failure -DSR
Numerical Experiments
11• Figure 11 Two-item Side Effects of DSC
DSC Side Effect s - T wo It ems
0 5 10 15
5k 10k 15k
Number of T ransact ions
P er ce n ta g e o f R u les Hidden Rules New Rules-DSC Lost Rules-DSC Hiding Failure-DSC
Analysis
1• For multiple-scan algorithms: • Time effects
– DSR faster than ISL
• Due to size of candidate transactions is smaller
• Side effects
– DSR: no hiding failure (0%), few new rules (5%) and some lost rules (11%)
– ISL: shows some hiding failure (12.9%), many new rules (33%) and no lost rule (0%)
Analysis
2• Database effects
– DSR: 14% and 25% of transaction are altered for one and two predicting items respectively
– ISL: 59% and 128% of transactions are
altered for one and two predicting items respectively
– In ISL, some transactions are modified more that once
Analysis
3• Item effects: different hiding orders & algorithms => different transformed databases
TID D D2 D4 D5 D6 T1 111 101 110 110 101 T2 111 111 111 111 111 T3 111 111 111 111 111 T4 110 110 110 110 110 T5 100 110 101 100 100 T6 101 101 101 101 101 ISL,C,B ISL,B,C DSR,C,B DSR,B,C
Analysis
4• Less database scans and more rules pruned
– *: from calculating the large itemsets in different levels
– #: do not check if rules can be pruned after each transaction updates
DB Scans Rules
Pruned
ISL 3 2
Analysis
5• For single-scan algorithm:
– DSC is O(2|D| + |X|*l2*K*logK)
• where |X| is the number of items in X, l2 is the maximum number of large 2-itemsets, and K is the maximum number of iterations in DSC algorithm.
– SWA is O((n1-1)*n1/2*|D|*Kw)
• where n1 is the initial number of restrictive rules in the database D and Kw is the window size chosen.
– SWA has higher order of complexity
Discussions
1• Study the problem of hiding informative association rule sets
• In phase one, propose two multiple-scan algorithms ISL & DSR, based on reducing the confidence of association rules
• Automatically hide informative rule sets without pre-mining and selection of a class of rules
• Analyze the characteristics of the proposed algorithms and compare with Dasseni etc algorithms
Discussions
2• In phase two, propose a one-scan algorithm, DSC, to improve the time efficiency by using Pattern-Inversion tree
• Numerically compared with ISL/DSR with better time efficiency with similar side effects • Analytically compared with SWA with better
Discussions
3• Maintenance of hiding informative association rule sets: ' D TID D T1 111 001 T2 111 111 T3 111 111 T4 110 110 T5 100 100 T6 101 001
New data set TID T7 101 T8 101 T9 110 + ∆ + ∆ D+ = D + ∆+ D+'
Discussions
4• Maintenance of hiding informative association
rule sets: TID D+ (DSC) (MSI)
T1 111 111 001 T2 111 111 111 T3 111 111 111 T4 110 110 110 T5 100 100 100 T6 101 001 001 T7 101 001 001 T 101 101 101
Discussions
5• Maintenance of hiding informative association rule sets
– Centralized databases (one table)
– Distributed databases (partitioned table)
• Horizontally partitioned • Vertically partitioned
Discussions
6• Distributed databases
– Horizontally partitioned
• Grocery shopping data collected by different supermarkets
• Credit card databases of two different credit unions • “fraudulent customers often have similar transaction histories, etc”
Discussions
7• Distributed Hiding of Association Rules
TID Items T11 ABC T12 ABC T13 ABC T14 AB T15 A T16 AC Site S1 TID Items T21 ABC T22 AB T23 B T24 AB T25 A Site S2 TID Items T31 ABC T32 ABC T33 ABC T34 AB T35 A T A Site S3 Hide C => A (44%, 100%) min_supp=30%
Discussions
7• Distributed databases
– Vertically partitioned
• “Cell phones with Li/Ion batteries lead to brain tumors in diabetics”
Discussions
8• Multi-relational databases
– Multi-dimensional association rules
Discussions
9• Multi-dimensional association rules
– Example: Store of material for hiking trips
– Customers(cno, name, rating, age, occupation, city)
– Items(ino, item, name, price)
– Buys(cno, ino, date, qty, total)
• 1. Single-dimensional association rule
• Buys(c, “Ski pants”) -> buys(c, “Sunglasses”)
• 2. Multidimensional association rule
Discussions
10• Multi-dimensional association rules
– Store of material for hiking trips– Customers(cno, name, rating, age, occupation,city)
– Items(ino, item, name, price)
– Buys(cno, ino, date, qty, total)
• 3. Hybrid-dimensional association rule
• Buys(c, “Glove”) and occupation(c,
Discussions
11• Multi-relational association rules
Item <-> Atom,
e.g., likes(joni, icecream), has(joni, piglet),likes(elliot, piglet), has(elliot, icecream), prefers(joni, icecream, pudding)
Discussions
12Transaction <-> Example Ei
e.g., Example joni = {7 atoms in Fig. 2}, elliot={6 atoms}
TID <-> Example Key, ExKey, e.g., KID has values joni, elliot
Cover: atomset X covers example Ei, X is a subset of Ei,
Example joni
Discussions
13• Support & Confidence
of MRAR
– DB w/ 4 examples • {{p(1,a), q(1,b)}, • {p(2,a), q(2,b)}, • {p(3,a), p(3,d), q(3,b)}, • {p(4,a)}} – Atomsets • X={p(k,x)}, Y={q(k,y)}, • X U Y ={p(k,x), q(k,y)} – Support of X U Y = ¾ = 0.75Discussions
14• AprioriRel, extension of Apriori
– Association rule from one table
• likes(KID, piglet), likes(KID, ice cream) =>
likes(KID, dolphin); (c:85, s:9)
– Association rule from three tables
• likes(KID, A), has(KID, B) => prefers(KID, A,
B); (c:98, s:70)
• Hiding MRAR
– Hide user specified MRARs by minimally
Discussions
11• So far, all heuristic hiding algorithms
• Computational complexity
– NP-complete? Yes, for one table, hiding
selected rules
– How about hiding rule sets?
• Reversibility
– Some blocking algorithms are reversible – How about perturbation algorithms?
References
• Some websites
– Privacy-Preserving Data Mining
• http://www.cs.umbc.edu/~kunliu1/research/privacy_revie w.html • http://www.cs.ualberta.ca/~oliveira/psdm/pub_by_year.ht ml • http://www.springer.com/west/home?SGWID=4-102-22-52496494-0&changeHeader=true – Kdnuggets: www.kdnuggets.com