Social Networks and
Privacy Preservation
(Part II)
Leon S.L. Wang (王學亮)
Outline
Overview of Social Networks (Part I)
Analysis
Extraction, Construction Application
Challenge
Privacy Preservation (Part II)
Privacy-Preserving Data Publishing (PPDP)
Privacy-Preserving Network Publishing (PPNP) Privacy-Preserving Data Mining (PPDM)
Querying public information is an activity of daily life. But
Cloud
Attacks from client side (un-trusted
clients)
Retrieve information, analyze, re-identify personal
information, then attack,
Relational data privacy - PPDP
Set data (transaction data) privacy - PPDP
Graph data (social network data) privacy – PPNP, PPSN Edge weight privacy – PPNP, PPSN
Attacks from server side
(un-trusted servers)
Based on client’s queries, identify personal information
(location, identity), then attack,
Location-based service privacy, Obfuscation: cloaking and anonymity,
Data transformation: incremental NN, encrypted databases, PIR-based (Private Information Retrieval),
Objective
To protect privacy of users and information (PPNP)
Identity disclosure Link disclosure
Attribute disclosure
Recent Studies
PPDP, PPNP, PPDM
Motivating Examples, Objective, Common Practices, Possible
Attacks
Some Research Areas
K-Anonymity, Utility Issues, Distributed Privacy, Association
Privacy Preserving Data Publishing
1
Motivating example – Group Insurance
Commission: they found MA governor’s
medical record
Privacy Preserving Data Publishing
2
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Motivating examples – Group Insurance
Privacy Preserving Data Publishing
3
Motivating examples – A Face Is Exposed
for AOL Searcher No. 4417749
Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. – New York Times, August 9, 2006.
Thelma Arnold's
identity was betrayed by AOL records of her Web searches, like ones for her dog,
Privacy Preserving Data Publishing
4
Motivating examples – American On Line
~650k users, 3 months period, ~20 million
queries released
No name, no SSN, no driver license #, no credit card #
The user, ID 4417749, was found to be Thelma Arnold, a 62 year old woman living in Georgia.
Lost of privacy to users, damage to AOL, significant damage to academics who depend on such data.
Privacy Preserving Data Publishing
5
Motivating examples – Netflix Prize
In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.
Netflix publicly released a dataset containing 100
million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.
Anonymization - replacing usernames with random identifiers.
Shown that 84% of the subscribers could be uniquely identified by an attacker who knew 6 out of 8 movies
Privacy Preserving Network Publishing
7
You have control on what information you
want to share, who you want to connect with
You do not have comprehensive and
accurate idea of the information you have
explicitly and implicitly disclosed
Setting online privacy is time consuming
and
many of you choose to
accept the default
setting
Eventually you lose control… and you are
Privacy Preserving Data Mining
1
Motivating examples – Association Rules
Supplier
ABC Paper CompanyRetailer
XYZ Supermarket Chain1. Allow ABC to access customer XYZ’s DB
2. Predict XYZ’s inventory needs & Offer reduced prices
Privacy Preserving Data Mining
2
Supplier ABC discovers (thru data mining):
Sequence: Cold remedies -> Facial tissue
Association: (Skim milk, Green paper)
Supplier ABC runs coupon marketing
campaign:
“50 cents off skim milk when you buy ABC
products”
Results:
Customers buy Green paper from ABC
Lower sales of Green paper for XYZ (Bad)
Privacy Preservation
1
- Objective
Privacy
The state of being private; the state of not being seen by others
Database security
To prevent loss of privacy due to
viewing/disclosing unauthorized data
Privacy Preservation
Privacy Preservation
1
- Common Practices
Limiting access
Control access to the data
Used by secure DBMS community
“Fuzz” the data
Forcing aggregation into daily records instead of individual transactions or slightly altering data values
Privacy Preservation
3
- Common Practices
Eliminate unnecessary groupings
The first 3 digits of SSNs are assigned by office sequentially
Clustering high-order bits of a “unique identifier” is
likely to group similar data elements
Unique identifiers are assigned randomly
Augment the data
Populate the phone book with extra, fictitious people in non-obvious ways
Return correct info when asking an individual, but return incorrect info when asking all individuals in
Privacy Preservation
4
- Common Practices
Audit
Detect misuse by legitimate users
Administrative or criminal disciplinary action may be initiated
Privacy Preservation
5
- Possible Attacks
Linking attacks (Sweeney IJUFKS „02)
Re-identification
Identity linkage (K-anonymity)
Attribute linkage (l-diversity)
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Privacy Preservation
6
- Possible Attacks
Corruption attacks(Tao ICDE08, Chaytor ICDM09)
Background knowledge; Perturbed generalization (PG)
Privacy Preservation
7
- Possible Attacks
Differential privacy (Dwork ICALP ‟06)
Add noise to the data set so that the difference
between any query output to 30 records and to any 29 records will be very small (within a differential).
Privacy Preservation
8
- Possible Attacks
Privacy Preservation
9
- Possible Attacks
Privacy Preservation
10
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Degree attack: knowing Bob has with 4 friends => vertex 7 is Bob
Privacy Preservation
11
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Sub-graph attack (one match), knowing Bob’s friends and friend‟s friends => Vertex 7 is Bob
Privacy Preservation
12
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Sub-graph attack (k-match), knowing Bob’s
friends => 8 matches, but share common labels 6 & 7 => still uniquely identify vertex 7 is Bob
Privacy Preservation
13
- Possible Attacks
Structural attacks (Zhou VLDB „09)
Hub fingerprint attacks
Some hubs have been identified, adversary knows the distance between Bob and hubs
Privacy Preservation
14
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Label attack
Interaction graph
Privacy Preservation
15
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Privacy Preservation
16
- Possible Attacks
Non-structural attacks (Bhagat VLDB „09)
Privacy Preservation
17
- Possible Attacks
Active attacks
(Backstrom WWW „07)
Planted a subgraph H into G and connect to targeted nodes (add new nodes and edges)
Recover H from G & identify targeted nodes‟
identity and relationships
Walk-based (largest H), cut-based (smallest H)
Privacy Preservation
18
- Possible Attacks
Passive
attacks (
no nodes, no edges added
)
Start from a coalition of friends (nodes) in
anonymized graph G, discover the existence of edges among users to whom they are linked to
Semi-passive
attacks (
add edges only, no nodes
)
From existing nodes in G, add fake edges to targeted nodes
Privacy Preservation
19
- Possible Attacks
Intersection attacks (Puttaswamy CoNEXT‟09)
Two users were compromised, A and B,
A queries server for the visitor of “website xyz”, B queries server for the visitor of “website xyz”, The intersection is C only (privacy leak)
Privacy Preservation
20
- Possible Attacks
Intersection attacks
StarClique (add latent edges)
The graph evolution process for a node. The node first selects a
subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.
Privacy Preservation
21
- Possible Attacks
Relationship attacks (Liu SDM‟09)
Sensitive edge weights, e.g. transaction expenses in business network,
Reveal the shortest path between source and sink, e.g., A -> D,
Privacy Preservation
22
- Possible Attacks
Location Privacy (Papadopoulos VLDB‟10)
Preserving the privacy of the location of user in
Privacy Preservation
23
- Possible Attacks
Inference through data mining attacks
Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden
Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ
D
ODM
R
OModification
R
M= R
O-
R
HPrivacy Preservation
24
- Possible Attacks
ρ-uncertainty (Cao VLDB‟10)
Given a transaction dataset, sensitive items Is, uncertainty level ρ, the objective is to make the confidence of all sensitive association rules to be less than ρ, i.e., Conf (χ -> α) < ρ., where χ I and α Is.
If Alice knows Bob bought b1, then she knows Bob also bought {a1, b2, α, }, where Is = {α, }
Privacy Preservation
25
- Possible Attacks
ρ-uncertainty (Cao VLDB‟10)
Given a transaction dataset, sensitive items Is, uncertainty level ρ = 0.7, a hierarchy of
non-sensitive items, the published data after suppression
Privacy Preservation
26
- Possible Attacks
Privacy Preservation
27
- Possible Attacks
Privacy Preservation
28
- Possible Attacks
Current Research Areas
1
Privacy-Preserving Data Publishing
K-anonymity
Try to prevent privacy de-identification
Utility-based Privacy-Preserving
Distributed Privacy with Adversarial
Collaboration
Privacy-Preserving Application
Association rules hiding
Privacy-Preserving Network Publishing
Privacy Preserving Data Publishing
1
DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail
Name DOB Sex Zipcode Andre 1/21/76 Male 53715
Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237
Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)
Privacy Preserving Data Publishing
2
K-Anonymity for linking attacks
Privacy Preserving Data Publishing
3
“Fuzz” the data
k-anonymity, at least k tuples in one group
Utility-based Privacy-Preserving
1
Utility-based Privacy-Preserving
2
Utility-based Privacy-Preserving
3
Q1:”How many customers under age 29 are
there in the data set?”
Q2: “Is an individual with age=25, education=
Bachelor, Zip Code = 53712 a target customer?”
Table 2, answers: “2”; “Y”
Distributed Privacy with Adversarial
Collaboration
Input privacy (2)
D
1+D
2+D
3R
O
Association rule mining
Input: D
O, min_supp, min_conf
Output: R
O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) DO |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not AR Input: DO, X (items to be hidden on LHS), min_supp, min_conf Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC D 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, hidden hidden hidden lost lost lost try try
Association Rule Hiding
3
- Side
effects
Hiding failure, lost rules, new rules
R ① Hiding Failure ② Lost Rules ③ New Rules R h ~ R h
Association Rule Hiding
4
Output privacy
D
OR
OD
MR
MDM
DM
Modification
D
OR
OD’
oDM
DM
Modification
• Input privacy (1)
R
= R
-
R
Privacy-Preserving Network Publishing
1
K-anonymity
K-degree, neighborhood, automorphism,
k-isomorphism, k-symmetry, k-security, k-obfuscation
Generalization
Clustering nodes into supernodes
Randomization
Statistically add/delete/switch edges
Other works
Edge weighted graph, privacy scores
Output Perturbation
Privacy-Preserving Network Publishing
2
Privacy-Preserving Network Publishing
3
Privacy-Preserving Network Publishing
4
Privacy-Preserving Network Publishing
6
Privacy-Preserving Network Publishing
7
Privacy-Preserving Network Publishing
7
Privacy-Preserving Network Publishing
8
Privacy-Preserving Network Publishing
9
Privacy-Preserving Network Publishing
10
Relationship attacks (Liu SDM‟09)
Preserving the shortest path, e.g. A -> D,
Min perturbation on path length, path weight,
Privacy-Preserving Network Publishing
11
Relationship attacks (Liu ICIS‟10)
Preserving the shortest path, e.g. A -> D,
K-anonymous weight privacy
the blue edge group and the green edge group satisfy the 4-anonymous privacy where =10.
Privacy-Preserving Network Publishing
12
Relationship attacks (Das ICDE‟10)
Preserving linear property, e.g., shortest paths,
The ordering of the five edge weights are preserved after naïve anonymization.
x5 ≦ x1≦ x4 ≦ x3 ≦ x2, where x1=(v1, v2), x2=(v1, v4), x3=(v1, v3), x4=(v2, v4), x5=(v3, v4),
The minimum cost spanning tree is preserved. {(v1, v2), (v2, v4), (v1, v3)}
The shortest path from v1 to v4 is changed.
The ordering of the edge weight is still exposed. For example, v3 and v4 are best friends and v1 and v4 are not so good friends.
x1 x
4
x2 x
Privacy-Preserving Network Publishing
13
Location Privacy
1
Location Privacy (Papadopoulos VLDB‟10)
Location obfuscation
Send additional set of “dummy” queries, in addition to actual
Location Privacy
2
Location Privacy (Papadopoulos VLDB‟10)
Data transformation
Location Privacy
3
PIR-based location privacy
PIR-based queries are sent to LBS server and retrieve blocks without server discovering which blocks are requested
Discussions
1
From relational and set data to
graph data
Privacy-preserving social networking Privacy-preserving collaborative filtering
Discussions
2
From
centralized
data
to
distributed
data
Distributed databases
Horizontally partitioned
Grocery shopping data collected
by different supermarkets
Credit card databases of two
different credit unions
“fraudulent customers often have
Discussions
3
From
centralized
data to
distributed
data
Distributed databases Vertically partitioned
Discussions
4
From data privacy to information privacy
Hiding aggregated information
Car dealer inventory, hide stock, not individual query
Air line, hide total seats left, prevent terrorist flying less crowded
flight
Discussions
5
How to quantify graph
utility-privacy tradeoff
especially for (dynamic) rich graphs?
Existing data publication techniques do not
provide guarantees of accuracy of graph analysis
How to define utility?
Scalability is always an issue.
Differential privacy
preserving social network
mining
Tutorials
Privacy in data system, Rakesh Agrawal, PODS03
Privacy preserving data mining, Chris Clifton, PKDD02, KDD03
Models and methods for privacy preserving data publishing and analysis,
Johannes Gehrke, ICDM05, ICDE06, KDD06
Cryptographic techniques in privacy preserving data mining, Helger Lipmaa ,
PKDD06
Randomization based privacy preserving data mining, Xintao Wu, PKDD06
Privacy in data publishing, Johannes Gehrke & Ashwin Machanavajjhala, S&P09 Anonymized data: generation, models, usage, Graham Cormode & Divesh
Srivastava, SIGMOD09
A tutorial of privacy-preservation of graphs and social networks, Xintao Wu,
Xiaowei Ying, PAKDD 2011
Privacy-aware data management in information networks, Michael Hay, Kun Liu,