Privacy Preserving Data Modeling:
An Overview
Leon S.L. Wang
Outline
2
Private Query of Public Information
Privacy Preservation
Objective, common practices, possible attacks
Current Research Areas
K-Anonymity, Utility, Distributed Privacy, Association Rule
Hiding, Location Privacy
Recent Studies
Association Rule Set Hiding
K-anonymity on Transaction Data K-anonymous path privacy
Private query of public information
Querying public information is a daily activity in life
Easy access to your own data, any time, any place,
But also true for others to get your data (potential privacy
breach),
Private query of public information
4
“
Privately
”
Data privacy: user cannot retrieve data other than requested
(and cannot infer new information)
Inferring name, ID, medical records, political/religious/sexy preferences, … Requesting data block i, but receive data blocks i-1, i, i+1,…
User privacy: server is not aware of the data a user actually
retrieving
Alice is searching for gold (going around with a GPS) and checking the server to
verify if the place has been awarded a mining patent. What would you do if you knew there is no patent registered at the location she queries?
Privacy Preservation
1
Motivating example – Group Insurance
Commission: they found MA governor’s
medical record
Privacy Preservation
2
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Andre has heart disease!
6
Motivating examples – Group Insurance
Privacy Preservation
3
Motivating examples – A Face Is Exposed
for AOL Searcher No. 4417749
Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. – New York Times, August 9, 2006.
Thelma Arnold's
Privacy Preservation
4
8
Motivating examples – American On Line
~650k users, 3 months period, ~20 million
queries released
No name, no SSN, no driver license #, no credit
card #
The user, ID 4417749, was found to be Thelma
Arnold, a 62 year old woman living in Georgia.
Lost of privacy to users, damage to AOL,
significant damage to academics who depend on such data.
Privacy Preservation
5
Motivating examples – Netflix Prize
In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.
Netflix publicly released a dataset containing 100
million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.
Anonymization - replacing usernames with random identifiers.
10
Privacy Preservation
6
Motivating examples – Association Rules
Supplier
ABC Paper CompanyRetailer
XYZ Supermarket Chain 1. Allow ABC to accesscustomer XYZ’s DB
2. Predict XYZ’s inventory needs & Offer reduced prices
Privacy Preservation
7
Supplier ABC discovers (thru data mining):
Sequence: Cold remedies -> Facial tissue
Association: (Skim milk, Green paper)
Supplier ABC runs coupon marketing
campaign:
“50 cents off skim milk when you buy ABC
products”
Results:
12
Privacy Preservation
8
- Objective
Privacy
The state of being private; the state of not being seen by others
Database security
To prevent loss of privacy due to
viewing/disclosing unauthorized data
Privacy Preservation
Privacy Preservation
9
- Common Practices
Limiting access
Control access to the data
Used by secure DBMS community
“Fuzz” the data
Forcing aggregation into daily records instead of individual transactions or slightly altering data values
14
Privacy Preservation
10
- Common Practices
Eliminate unnecessary groupings
The first 3 digits of SSNs are assigned by office sequentially
Clustering high-order bits of a “unique identifier” is
likely to group similar data elements
Unique identifiers are assigned randomly
Augment the data
Populate the phone book with extra, fictitious people in non-obvious ways
Return correct info when asking an individual, but return incorrect info when asking all individuals in a department
Privacy Preservation
11
- Common Practices
Audit
Detect misuse by legitimate users
Administrative or criminal disciplinary action may be initiated
16
Privacy Preservation
12
- Possible Attacks
Linking attacks (Sweeney IJUFKS „02)
Re-identification
Identity linkage (K-anonymity) Attribute linkage (l-diversity)
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Privacy Preservation
13
- Possible Attacks
Corruption attacks(Tao ICDE08, Chaytor ICDM09)
Background knowledge; Perturbed generalization (PG)
18
Privacy Preservation
14
- Possible Attacks
Differential privacy (Dwork ICALP ‟06)
Add noise to the data set so that the difference
between any query output to 30 records and to any 29 records will be very small (within a differential).
Privacy Preservation
15
- Possible Attacks
Realistic adversaries (Machanavajjhala VLDB ‟09)
Weak privacy:
l-diversity, t-closeness
Adversaries need to know very specific information
Strong privacy: Differential privacy
Adversaries need to know all information except victum
Epislon-privacy
Adversary‟s knowledge can vary and learn, and is characterized by a stubbornness parameter
20
Privacy Preservation
16
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Degree attack: knowing Bob has with 4 friends =>
Privacy Preservation
17
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Sub-graph attack (one match), knowing Bob’s friends and friend‟s friends => Vertex 7 is Bob
22
Privacy Preservation
18
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Sub-graph attack (k-match), knowing Bob’s
friends => 8 matches, but share common labels 6 & 7 => still uniquely identify vertex 7 is Bob
Privacy Preservation
19
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Hub fingerprint attacks
Some hubs have been identified, adversary knows the distance between Bob and hubs
24
Privacy Preservation
20
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Label attack
Interaction graph
Privacy Preservation
21
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
26
Privacy Preservation
22
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Privacy Preservation
23
- Possible Attacks
Active attacks (Backstrom WWW „07)
Planted a subgraph H into G and connect to targeted nodes (add new nodes and edges)
Recover H from G & identify targeted nodes‟
identity and relationships
Walk-based (largest H), cut-based (smallest H)
28
Privacy Preservation
24
- Possible Attacks
Passive attacks (
no nodes, no edges added
)
Start from a coalition of friends (nodes) in
anonymized graph G, discover the existence of edges among users to whom they are linked to
Semi-passive attacks (
add edges only, no nodes
)
From existing nodes in G, add fake edges to targeted nodes
Privacy Preservation
25
- Possible Attacks
Intersection attacks (Puttaswamy CoNEXT‟09)
Two users were compromised, A and B,
A queries server for the visitor of “website xyz”,
B queries server for the visitor of “website xyz”,
30
Privacy Preservation
26
- Possible Attacks
Intersection attacks
StarClique (add latent edges)
The graph evolution process for a node. The node first selects a
subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.
Privacy Preservation
27
- Possible Attacks
Relationship attacks (Liu SDM‟09)
Sensitive edge weights, e.g. transaction expenses in business network,
Reveal the shortest path between source and sink, e.g., A -> D,
32
Privacy Preservation
28
- Possible Attacks
Relationship attacks (Liu SDM‟09)
Preserving the shortest path, e.g. A -> D,
Min perturbation on path length, path weight,
Privacy Preservation
28
- Possible Attacks
Relationship attacks (Liu ICIS‟10)
Preserving the shortest path, e.g. A -> D,
K-anonymous weight privacy
the blue edge group and the green edge group satisfy the 4-anonymous privacy where =10.
34
Privacy Preservation
29
- Possible Attacks
Relationship attacks (Das ICDE‟10)
Preserving linear property, e.g., shortest paths, The ordering of the five edge weights are preserved after
naïve anonymization.
x5 ≦ x1≦ x4 ≦ x3 ≦ x2, where x1=(v1, v2), x2=(v1, v4), x3=(v1, v3), x4=(v2, v4), x5=(v3, v4),
The minimum cost spanning tree is preserved. {(v1, v2), (v2, v4), (v1, v3)} The shortest path from v1 to v4 is changed.
The ordering of the edge weight is still exposed. For example, v3 and v4 are best friends and v1 and v4 are not so good friends.
x1 x
4
x3
x2 x
Privacy Preservation
30
- Possible Attacks
Location Privacy (Papadopoulos VLDB‟10)
Preserving the privacy of the location of user in
Location-based service, e.g., Google or Bing maps, Alice is searching for gold (going around with a GPS) and
checking the server to verify if the place has been awarded a mining patent. What would you do if you knew there is no patent registered at the location she queries?
36
Privacy Preservation
32
- Possible Attacks
Inference through data mining attacks
Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden
Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ
D
ODM
R
OD
MDM
R
MModification
Privacy Preservation
33
- Possible Attacks
ρ-uncertainty (Cao VLDB‟10)
Given a transaction dataset, sensitive items Is, uncertainty level ρ, the objective is to make the confidence of all sensitive association rules to be less than ρ, i.e., Conf (χ -> α) < ρ., where χ I and α Is.
If Alice knows Bob bought b1, then she knows Bob also bought {a1, b2, α, }, where Is = {α, }
38
Privacy Preservation
34
- Possible Attacks
ρ-uncertainty (Cao VLDB‟10)
Given a transaction dataset, sensitive items Is, uncertainty level ρ = 0.7, a hierarchy of
non-sensitive items, the published data after suppression
Current Research Areas
1
Privacy-Preserving Data Publishing
K-anonymity
Try to prevent privacy de-identification
Utility-based Privacy-Preserving
Distributed Privacy with Adversarial
Collaboration
Privacy-Preserving Application
Association rules hiding
Current Research Areas
2
Current Research Areas - Privacy
Preserving Data Publishing
1
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Current Research Areas - Privacy
Preserving Data Publishing
2
An example of 2-Anonymity (one-to-many approach)
42
Current Research Areas - Privacy
Preserving Data Publishing
3
“Fuzz” the data
k-anonymity, at least k tuples in one group
44
Current Research Areas - Utility-based
Privacy-Preserving
1
Current Research Areas - Utility-based
Privacy-Preserving
2
46
Current Research Areas - Utility-based
Privacy-Preserving
3
Q1:”How many customers under age 29 are
there in the data set?”
Q2: “Is an individual with age=25, education=
Bachelor, Zip Code = 53712 a target customer?”
Table 2, answers: “2”; “Y”
Current Research Areas - Distributed
Privacy with Adversarial Collaboration
Input privacy (2)
D
1+D
2+D
3R
O48
Association rule mining
Input: D
O, min_supp, min_conf
Output: R
O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) DO RO: Association Rules |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not ARCurrent Research Areas - Association
Input: DO, X (items to be hidden on LHS), min_supp, min_conf Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) |A|=6,|B|=3,|C|=4 hidden hidden hidden lost lost lost try try
Current Research Areas - Association
50
Current Research Areas -
Association Rule Hiding
3
- Side
effects
Hiding failure, lost rules, new rules
R ① Hiding Failure ② Lost Rules ③ New Rules R h ~ R h
Current Research Areas - Association
Rule Hiding
4
Output privacy
D
OR
OD
MR
MDM
DM
Modification
D
OR
OD’
oDM
DM
Modification
• Input privacy (1)
52
Current Research Areas - Location
Privacy
1
Location Privacy (Papadopoulos VLDB‟10)
Location obfuscation
Send additional set of “dummy” queries, in addition to actual query
Current Research Areas - Location
Privacy
2
Location Privacy
Data transformation
54
Current Research Areas - Location
Privacy
3
Location Privacy
PIR-based location privacy
PIR-based queries are sent to LBS server (computational or
secure co-processor) and retrieve blocks without server
Recent Studies
1
-
Informative Association Rule Set (IRS)
Informative Association rules Input: DO, min_supp, min_conf, X = {C} Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB min_supp=33% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) Rules #6,7,8 predict same RHS {A,B} as #2,4
Recent Studies
2
-
Hiding IRS
1(LHS)
56 Input: DO, X (items to be hidden on LHS),
min_supp, min_conf Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC DM Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try
Recent Studies
3
–
Proposed Algorithms
Strategy:
To lower the confidence of a given rule X => Y,
either
Increase the support of X, but not XY, OR
Decrease the support of XY (or both X and XY)
)
(
support
)
(
support
)
(
X
XY
X
XY
Y
X
Conf
Recent Studies
4
–
Proposed Algorithms
58
Multiple
scans
of database (Apriori-based)
Increase Support of LHS First (ISL) Decrease Support of RHS First (DSR)
One scan
of database
Decrease Support and Confidence (DSC)
Propose Pattern-Inversion tree to store data
Maintenance
of hiding informative association rule sets
Recent Studies
5
-
Analysis
For multiple-scan algorithms:
Time
effects
DSR faster than ISL
Due to size of candidate transactions is smaller
Database
effects
Recent Studies
6
-
Analysis
60
Side
effects
DSR: no hiding failure (0%), few new rules (5%) and
some lost rules (11%)
ISL: shows some hiding failure (12.9%), many new
Recent Studies
7
-
Proposed Algorithms
One-scan algorithm DSC
Pattern-inversion tree TID Items T1 ABC T2 ABC T3 ABC A:6:[T5] B:4:[T4] Root C:1:[T6]Recent Studies
6
-
Proposed Algorithms
One-scan algorithm DSC
Time effects, Database effects
Recent Studies
9
-
Proposed Algorithms
Recent Studies
10
-
Analysis
64
For single-scan algorithm:
DSC is O(2|D| + |X|*l2*K*logK)
where |X| is the number of items in X, l2 is the maximum number of large 2-itemsets, and K is the maximum number of iterations in DSC algorithm.
SWA is O((n1-1)*n1/2*|D|*Kw)
where n1 is the initial number of restrictive rules in the database D and Kw is the window size chosen.
SWA has higher order of complexity O(l22*|X|2*|D|2) if Kw |D|
Recent Studies
11
–
Maintenance of Hiding IRS
Maintenance
of hiding informative association rule
sets:
' D TID D T1 111 001 T2 111 111 T3 111 111 T4 110 110 TID T7 101 T8 101 T9 110 D+ = D + D
'
Recent Studies
12
–
Maintenance of Hiding IRS
66
Maintenance of hiding informative association rule
sets:
TID D+ (DSC) (MSI)T1 111 111 001 T2 111 111 111 T3 111 111 111 T4 110 110 110 T5 100 100 100 T6 101 001 001 T7 101 001 001 T8 101 101 101 T9 110 110 110
Recent Studies
13
–
Maintenance of Hiding IRS
Recent Studies
14
–
Maintenance of Hiding IRS
68
Recent Studies
15
–
K- anonymity and K
m- anonymity
K-anonymity
Every record has k-1 other identical record on QIs
K
m- anonymity
The support count of every m-itemset ≧ k
Domain Generalization Hierarchy (DGH)
Data types
Recent Studies
16
–
K- anonymity on transaction data with minimal
addition/deletion of items
70 3-anonymity
{T1, T4, T5}, {T2, T3, T6} e1 e2 e3 T1 1 1 0 T2 0 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1 e1 e2 e3 T1 1 0 0 T2 1 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1Recent Studies
17
–
K- anonymity on transaction data with minimal
addition/deletion of items
a1 a2 a3 T1 1 1 1 T2 1 1 0 T4 0 1 1 T5 1 0 0 a1 a2 a3 T3 1 0 1 T6 0 1 0 T 0 0 1 a1 a2 a3 T1 1 1 1 T2 1 1 0 T3 1 0 1 T4 0 1 1 a1 a2 a3 T5 1 0 0 T6 0 1 0 a1 a2 a3 T1 1 1 1 T2 1 1 0 T3 1 0 1 T4 0 1 1 T5 1 0 0 T6 0 1 0 T7 0 0 1Recent Studies
18
–
K- anonymity on transaction data with minimal
addition/deletion of items
72
Propose an O(log k)-approximation solution to
Recent Studies
19
–
K- anonymous
path
privacy
to minimally modify the graph such that there exists
k shortest paths
between each given pair of vertices
specified in H, without adding or deleting any
vertices or edges.
v v2 v4 1 8 7 1 v v2 v4 1 8 7 1Recent Studies
19
–
K- anonymous
path
privacy
74
Ratios of perturbed edges for different k
EIES data set, 48 nodes, 830 edges
Recent Studies
19
–
K- anonymous path privacy
Discussions
1
76
Major issues
Large data volume High dimensionality Sparseness
Discussions
2
From relational and set data to
graph, spatial, …
data and beyond …
Privacy-preserving social networking (PPNP) Privacy-preserving collaborative filtering
Discussions
3
78
From
data
privacy to
information
privacy
Hiding aggregated information
Car dealer inventory, hide stock, not individual query
Air line, hide total seats left, prevent terrorist flying less crowded flight
Thank
You
Discussions
4