Recent Studies in Privacy-Preserving
Data Mining
1
Leon S.L. Wang
Department of Information Management National University of Kaohsiung
Outline
Data Mining – a quick glance
Privacy-Preserving Data Mining (PPDM)
Objective, common practices, possible attacks
Current Research Areas
K-Anonymity, Utility, Distributed Privacy, Association Rule
Hiding
Recent Studies
Data Mining
1
Market basket analysis (
Association Rules
)
“if a customer purchases diapers, then he will very likely
purchase beer”
Sequences (Sequential Patterns)
“A customer who bought a iPod three months ago is likely to
order a iPhone within one month”
Training Data
N A M E R A N K Y E A R S T E N U R E D
M ike A ssistan t P ro f 3 n o M ary A ssistan t P ro f 7 yes
Classification Algorithms Classifier (Model)
Classification
Data Mining
2
N A M E R A N K Y E A R S T E N U R E D
T o m A ssistan t P ro f 2 n o M erlisa A sso ciate P ro f 7 n o G eo rg e P ro fesso r 5 yes Jo sep h A ssistan t P ro f 7 yes
5
Data Mining
3
Classifier
Testing
Data Unseen Data (Jeff, Professor, 4)
Tenured?
Data Mining
4
•
Clustering
• Unsupervised learning: Finds “natural” grouping of instances given unlabeled data
7
Data Mining
5
Data mining:
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from data in large databases
Alternative names:
Knowledge discovery in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
Privacy Preserving Data Mining
1
Motivating example – Group Insurance
Commission: they found MA governor’s
medical record
Privacy Preserving Data Mining
2
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Andre has heart disease!
9
Motivating examples – Group Insurance
Privacy Preserving Data Mining
3
Motivating examples – A Face Is Exposed
for AOL Searcher No. 4417749
Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. – New York Times, August 9, 2006.
Thelma Arnold's
Privacy Preserving Data Mining
4
11
Motivating examples – American On Line
~650k users, 3 months period, ~20 million queries released
No name, no SSN, no driver license #, no credit card #
The user, ID 4417749, was found to be Thelma Arnold, a 62 year old woman living in Georgia.
Lost of privacy to users, damage to AOL, significant damage to academics who depend on such data.
Privacy Preserving Data Mining
5
Motivating examples – Netflix Prize
In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.
Netflix publicly released a dataset containing 100
million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.
13
Privacy Preserving Data Mining
6
Motivating examples – Association Rules
Supplier
ABC Paper CompanyRetailer
XYZ Supermarket Chain1. Allow ABC to access customer XYZ’s DB
2. Predict XYZ’s inventory needs & Offer reduced prices
Privacy Preserving Data Mining
7
Supplier ABC discovers (thru data mining):
Sequence: Cold remedies -> Facial tissue
Association: (Skim milk, Green paper)
Supplier ABC runs coupon marketing
campaign:
“50 cents off skim milk when you buy ABC
products”
15
Privacy Preserving Data Mining
1
- Objective
Privacy
The state of being private; the state of not being seen by others
Database security
To prevent loss of privacy due to
viewing/disclosing unauthorized data
PPDM
Privacy Preserving Data Mining
2
- Common Practices
Limiting access
Control access to the data
Used by secure DBMS community
“Fuzz” the data
Forcing aggregation into daily records instead of individual transactions or slightly altering data values
17
Privacy Preserving Data Mining
3
- Common Practices
Eliminate unnecessary groupings
The first 3 digits of SSNs are assigned by office sequentially
Clustering high-order bits of a “unique identifier” is
likely to group similar data elements
Unique identifiers are assigned randomly
Augment the data
Populate the phone book with extra, fictitious people in non-obvious ways
Return correct info when asking an individual, but return incorrect info when asking all individuals in a department
Privacy Preserving Data Mining
4
- Common Practices
Audit
Detect misuse by legitimate users
Administrative or criminal disciplinary action may be initiated
19
Privacy Preserving Data Mining
5
- Possible Attacks
Linking attacks (Sweeney IJUFKS „02)
Re-identification
Identity linkage (K-anonymity)
Attribute linkage (l-diversity)
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Privacy Preserving Data Mining
6
- Possible Attacks
Corruption attacks(Tao ICDE08, Chaytor ICDM09)
Background knowledge; Perturbed generalization (PG)21
Privacy Preserving Data Mining
7
- Possible Attacks
Differential privacy (Dwork ICALP ‟06)
Add noise to the data set so that the difference
between any query output to 30 records and to any 29 records will be very small (within a differential).
Privacy Preserving Data Mining
8
- Possible Attacks
Realistic adversaries (Machanavajjhala VLDB ‟09)
Weak privacy:
l-diversity, t-closeness
Adversaries need to know very specific information Strong privacy:
Differential privacy
Adversaries need to know all information except victum Epislon-privacy
Adversary‟s knowledge can vary and learn, and is
23
Privacy Preserving Data Mining
9
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Degree attack: knowing Bob has with 4 friends =>
Privacy Preserving Data Mining
10
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Sub-graph attack (one match), knowing Bob’s friends and friend‟s friends => Vertex 7 is Bob
25
Privacy Preserving Data Mining
11
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Sub-graph attack (k-match), knowing Bob’s
friends => 8 matches, but share common labels 6 & 7 => still uniquely identify vertex 7 is Bob
Privacy Preserving Data Mining
12
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Hub fingerprint attacks
Some hubs have been identified, adversary knows the distance between Bob and hubs
27
Privacy Preserving Data Mining
13
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Label attack
Interaction graph
Privacy Preserving Data Mining
14
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
29
Privacy Preserving Data Mining
15
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Privacy Preserving Data Mining
16
- Possible Attacks
Active attacks (Backstrom WWW „07)
Planted a subgraph H into G and connect to targeted nodes (add new nodes and edges)
Recover H from G & identify targeted nodes‟
identity and relationships
Walk-based (largest H), cut-based (smallest H)
31
Privacy Preserving Data Mining
17
- Possible Attacks
Passive attacks (no nodes, no edges added)
Start from a coalition of friends (nodes) in
anonymized graph G, discover the existence of edges among users to whom they are linked to
Semi-passive attacks (add edges only, no nodes)
From existing nodes in G, add fake edges to targeted nodes
Privacy Preserving Data Mining
18
- Possible Attacks
Intersection attacks (Puttaswamy CoNEXT‟09)
Two users were compromised, A and B,
A queries server for the visitor of “website xyz”,
B queries server for the visitor of “website xyz”,
33
Privacy Preserving Data Mining
19
- Possible Attacks
Intersection attacks
StarClique (add latent edges)
The graph evolution process for a node. The node first selects a
subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.
Privacy Preserving Data Mining
20
- Possible Attacks
Relationship attacks (Liu SDM‟09)
Sensitive edge weights, e.g. transaction expenses in business network,
Reveal the shortest path between source and sink, e.g., A -> D,
35
Privacy Preserving Data Mining
21
- Possible Attacks
Relationship attacks (Liu SDM‟09)
Preserving the shortest path, e.g. A -> D,
Min perturbation on path length, path weight,
Privacy Preserving Data Mining
22
- Possible Attacks
Relationship attacks (Liu ICIS‟10)
Preserving the shortest path, e.g. A -> D,
K-anonymous weight privacy
the blue edge group and the green edge group satisfy the 4-anonymous privacy where =10.
37
Privacy Preserving Data Mining
23
- Possible Attacks
Relationship attacks (Das ICDE‟10)
Preserving linear property, e.g., shortest paths,
The ordering of the five edge weights are preserved after naïve anonymization.
x5 ≦ x1≦ x4 ≦ x3 ≦ x2, where x1=(v1, v2), x2=(v1, v4), x3=(v1, v3), x4=(v2, v4), x5=(v3, v4),
The minimum cost spanning tree is preserved. {(v1, v2), (v2, v4), (v1, v3)}
The shortest path from v1 to v4 is changed.
The ordering of the edge weight is still exposed. For example, v3 and v4 are best friends and v1 and v4 are not so good friends.
x1 x
4
x3
x2 x
Privacy Preserving Data Mining
24
- Possible Attacks
Location Privacy (Papadopoulos VLDB‟10)
Preserving the privacy of the location of user in
39
Privacy Preserving Data Mining
25
- Possible Attacks
Location Privacy (Papadopoulos VLDB‟10)
Location obfuscation
Send additional set of “dummy” queries, in addition to actual
query
Data transformation
Encrypted query is sent to LBS PIR-based location privacy
PIR-based queries are sent to LBS server and retrieve blocks without server discovering which blocks are requested
Privacy Preserving Data Mining
26
- Possible Attacks
Inference through data mining attacks
Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden
Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ
D
ODM
R
O41
Privacy Preserving Data Mining
27
- Possible Attacks
ρ-uncertainty (Cao VLDB‟10)
Given a transaction dataset, sensitive items Is, uncertainty level ρ, the objective is to make the confidence of all sensitive association rules to be less than ρ, i.e., Conf (χ -> α) < ρ., where χ I and α Is.
If Alice knows Bob bought b1, then she knows Bob also bought {a1, b2, α, }, where Is = {α, }
Privacy Preserving Data Mining
28
- Possible Attacks
ρ-uncertainty (Cao VLDB‟10)
Given a transaction dataset, sensitive items Is, uncertainty level ρ = 0.7, a hierarchy of
non-sensitive items, the published data after suppression
Current Research Areas
1
Privacy-Preserving Data Publishing
K-anonymity
Try to prevent privacy de-identification
Utility-based Privacy-Preserving
Distributed Privacy with Adversarial
Collaboration
Privacy-Preserving Application
Association rules hiding
Privacy Preserving Data Publishing
1
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Andre has heart disease!
45
Privacy Preserving Data Publishing
2
47
Privacy Preserving Data Publishing
3
“Fuzz” the data
k-anonymity, at least k tuples in one group
Utility-based Privacy-Preserving
1
49
Utility-based Privacy-Preserving
2
Utility-based Privacy-Preserving
3
Q1:”How many customers under age 29 are
there in the data set?”
Q2: “Is an individual with age=25, education=
Bachelor, Zip Code = 53712 a target customer?”
Table 2, answers: “2”; “Y”
51
Distributed Privacy with Adversarial
Collaboration
Input privacy (2)
D
1+D
2+D
3R
ODM
Association rule mining
Input: D
O, min_supp, min_conf
Output: R
O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=353
Input: DO, X (items to be hidden on LHS),
min_supp, min_conf Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC DM Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try
Association Rule Hiding
3
- Side
effects
Hiding failure, lost rules, new rules
R
② Lost Rules
R h ~ R
55
Association Rule Hiding
4
Output privacy
D
OR
OD
MR
MDM
DM
Modification
D
OR
OD’
oDM
DM
Modification
• Input privacy (1)
R
M= R
O-
R
HRecent Studies
1
-
Informative Association Rule Set (IRS)
Informative Association rules
Input: DO, min_supp, min_conf, X = {C} Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) Rules #6,7,8 predict same RHS {A,B} as #2,4
Recent Studies
2
-
Hiding IRS
1(LHS)
57
Input: DO, X (items to be hidden on LHS),
min_supp, min_conf Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC DM Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try
Recent Studies
3
–
Proposed Algorithms
Strategy:
To lower the confidence of a given rule X => Y,
either
Increase the support of X, but not XY, OR
Decrease the support of XY (or both X and XY)
)
(
support
)
(
support
)
(
X
XY
X
XY
Y
X
Conf
Recent Studies
4
–
Proposed Algorithms
59
Multiple
scans
of database (Apriori-based)
Increase Support of LHS First (ISL) Decrease Support of RHS First (DSR)
One scan
of database
Decrease Support and Confidence (DSC)
Propose Pattern-Inversion tree to store data
Maintenance
of hiding informative association rule sets
Recent Studies
5
-
Analysis
For multiple-scan algorithms:
Time
effects
DSR faster than ISL
Due to size of candidate transactions is smaller
Database
effects
Recent Studies
6
-
Analysis
61
Side
effects
DSR: no hiding failure (0%), few new rules (5%) and
some lost rules (11%)
ISL: shows some hiding failure (12.9%), many new
Recent Studies
7
-
Proposed Algorithms
One-scan algorithm DSC
Pattern-inversion tree TID Items T1 ABC T2 ABC T3 ABC A:6:[T5] B:4:[T4] Root C:1:[T6]Recent Studies
6
-
Proposed Algorithms
One-scan algorithm DSC
Time effects, Database effects
Recent Studies
9
-
Proposed Algorithms
Recent Studies
10
-
Analysis
65
For single-scan algorithm:
DSC is O(2|D| + |X|*l2*K*logK)
where |X| is the number of items in X, l2 is the maximum number of large
2-itemsets, and K is the maximum number of iterations in DSC algorithm. SWA is O((n1-1)*n1/2*|D|*Kw)
where n1 is the initial number of restrictive rules in the database D and Kw is the
window size chosen.
SWA has higher order of complexity O(l22*|X|2*|D|2) if Kw |D|
Recent Studies
11
–
Maintenance of Hiding IRS
Maintenance
of hiding informative association rule
sets:
' D TID D T1 111 001 T2 111 111 T3 111 111 T 110 110 TID T7 101 T8 101 T9 110 D+ = D + D
'
Recent Studies
12
–
Maintenance of Hiding IRS
67
Maintenance of hiding informative association rule
sets:
TID D+ (DSC) (MSI)T1 111 111 001 T2 111 111 111 T3 111 111 111 T4 110 110 110 T5 100 100 100 T6 101 001 001 T7 101 001 001 T8 101 101 101 T9 110 110 110
Recent Studies
13
–
Maintenance of Hiding IRS
Recent Studies
14
–
Maintenance of Hiding IRS
69
Recent Studies
15
–
K- anonymity and K
m- anonymity
K-anonymity
Every record has k-1 other identical record on QIs
K
m- anonymity
The support count of every m-itemset ≧ k
Domain Generalization Hierarchy (DGH)
Data types
Recent Studies
16
–
K- anonymity on transaction data with minimal
addition/deletion of items
71 3-anonymity
{T1, T4, T5}, {T2, T3, T6} e1 e2 e3 T1 1 1 0 T2 0 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1 e1 e2 e3 T1 1 0 0 T2 1 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1Recent Studies
17
–
K- anonymity on transaction data with minimal
addition/deletion of items
a1 a2 a3 T1 1 1 1 T2 1 1 0 T4 0 1 1 T5 1 0 0 a1 a2 a3 T3 1 0 1 T6 0 1 0 a1 a2 a3 T1 1 1 1 T2 1 1 0 T3 1 0 1 T4 0 1 1 a1 a2 a3 T5 1 0 0 T6 0 1 0Recent Studies
18
–
K- anonymity on transaction data with minimal
addition/deletion of items
73
Propose an O(log k)-approximation solution to
Recent Studies
19
–
K- anonymity and K
m- anonymity
Data type K-anonymity Km- anonymity
Relational + DGH
Generalization & suppression: Samarati TKDE ’01, Sweeney IJUFKS ’02;
Clustering: Li DAWAK ’06; Byun DASFAA ’07; Mohammed KDD ‘09, LKC-privacy, Top-down; Relational, no DGH Approximation algorithms (minimize suppression,
clustering): Park SIGMOD ’07; Clustering: Aggarwal PODS ’06, r-gather, ;
Recent Studies
20
–
Graph and Spatial Data
75
Data type K-anonymity Km- anonymity
Graph+ DGH Campan PinKDD08, SaNGreeA; Graph, no
DGH
1. Kuramochi DMKD’05,find freq patterns in large sparse graph;
2. Liu SIGMOD’08, k-degree, k nodes w/ same edge #, random, small-world, scale-free, prefuse, Enron, powergrid,
co-authors graphs;
3. Chang VLDB ‘09, Predictive anonymity; 4. Zon VLDB ‘09, K-automorphism for
multiple structural attacks, prefuse & co-author graphs, Pajek sw Erdos Renyi, scale-free models generated;
5. Bhagat VLDB ‘09, class-based anonymity; 6. Puttaswamy CoNEXT ’09, Intersection
Recent Studies
21
–
Graph and Spatial Data
Data type K-anonymity Km- anonymity
Graph, edge weight
1. Liu SDM’09, ano sensitive edge weight of undirected graph, maintain shortest
paths, Gaussian randomization, greedy perturbation, EIES and synthetic datasets; 2. Liu ICIS’10, k-anonymous weighted edge
on directed graph, k edges with weight differences less than ;
3. Das ICDE’10, ano directed edge weight, keep linear property, generate linear inequalities for shortest paths, LP solver, model Flickr, LiveJoural, Orkut, Youtube;
Recent Studies
22
–
K- anonymity and K
m- anonymity
77
Major issues
Large data volume High dimensionality Sparseness
Approaches
Wit DGH Generalization or suppression, bottom-up or top-down No DGH clustering,Discussions
1
From relational and set data to
graph data
Privacy-preserving social networking Privacy-preserving collaborative filtering
Discussions
2
79
From
centralized
data
to
distributed
data
Distributed databases
Horizontally partitioned
Grocery shopping data collected
by different supermarkets
Credit card databases of two
different credit unions
“fraudulent customers often have
Discussions
3
From
centralized
data to
distributed
data
Distributed databases Vertically partitioned
Discussions
4
81
From
data
privacy to
information
privacy
Hiding aggregated information
Car dealer inventory, hide stock, not individual query
Air line, hide total seats left, prevent terrorist flying less crowded
flight
References
Some websites
Privacy-Preserving Data Mining
Privacy Preserving Data Mining: Models and Algorithms
(http://www.springerlink.com/content/978-0-387-70991-8) http://www.springer.com/west/home?SGWID=4-102-22-52496494-0&changeHeader=true http://www.cs.umbc.edu/~kunliu1/research/privacy_review.html http://www.cs.ualberta.ca/~oliveira/psdm/pub_by_year.html Kdnuggets: www.kdnuggets.com