Privacy Preserving Data Modeling: An Overview

(1)

Privacy Preserving Data Modeling:

An Overview

Leon S.L. Wang

(2)

Outline

2



Private Query of Public Information



Privacy Preservation

 Objective, common practices, possible attacks



Current Research Areas

 K-Anonymity, Utility, Distributed Privacy, Association Rule

Hiding, Location Privacy 

Recent Studies

 Association Rule Set Hiding

 K-anonymity on Transaction Data  K-anonymous path privacy

(3)

Private query of public information



Querying public information is a daily activity in life

 Easy access to your own data, any time, any place,

 But also true for others to get your data (potential privacy

breach),

(4)

Private query of public information

4



“

Privately

”

 Data privacy: user cannot retrieve data other than requested

(and cannot infer new information)

 Inferring name, ID, medical records, political/religious/sexy preferences, …  Requesting data block i, but receive data blocks i-1, i, i+1,…

 User privacy: server is not aware of the data a user actually

retrieving

 Alice is searching for gold (going around with a GPS) and checking the server to

verify if the place has been awarded a mining patent. What would you do if you knew there is no patent registered at the location she queries?

(5)

Privacy Preservation

₁



Motivating example – Group Insurance

Commission: they found MA governor’s

medical record

(6)

Privacy Preservation

₂

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

 Andre has heart disease!

6



Motivating examples – Group Insurance

(7)

Privacy Preservation

₃



Motivating examples – A Face Is Exposed

for AOL Searcher No. 4417749

 Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. – New York Times, August 9, 2006.

Thelma Arnold's

(8)

Privacy Preservation

₄

8



Motivating examples – American On Line

 ~650k users, 3 months period, ~20 million

queries released

 No name, no SSN, no driver license #, no credit

card #

 The user, ID 4417749, was found to be Thelma

Arnold, a 62 year old woman living in Georgia.

 Lost of privacy to users, damage to AOL,

significant damage to academics who depend on such data.

(9)

Privacy Preservation

₅



Motivating examples – Netflix Prize

 In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.

 Netflix publicly released a dataset containing 100

million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.

 Anonymization - replacing usernames with random identifiers.

(10)

10

Privacy Preservation

₆

 Motivating examples – Association Rules

Supplier

ABC Paper Company

Retailer

XYZ Supermarket Chain 1. Allow ABC to access

customer XYZ’s DB

2. Predict XYZ’s inventory needs & Offer reduced prices

(11)

Privacy Preservation

₇



Supplier ABC discovers (thru data mining):

 Sequence: Cold remedies -> Facial tissue

 Association: (Skim milk, Green paper)



Supplier ABC runs coupon marketing

campaign:

 “50 cents off skim milk when you buy ABC

products”



Results:

(12)

12

Privacy Preservation

₈

- Objective



Privacy

 The state of being private; the state of not being seen by others



Database security

 To prevent loss of privacy due to

viewing/disclosing unauthorized data 

Privacy Preservation

(13)

Privacy Preservation

₉

- Common Practices



Limiting access

 Control access to the data

 Used by secure DBMS community 

“Fuzz” the data

 Forcing aggregation into daily records instead of individual transactions or slightly altering data values

(14)

14

Privacy Preservation

₁₀

- Common Practices



Eliminate unnecessary groupings

 The first 3 digits of SSNs are assigned by office sequentially

 Clustering high-order bits of a “unique identifier” is

likely to group similar data elements

 Unique identifiers are assigned randomly



Augment the data

 Populate the phone book with extra, fictitious people in non-obvious ways

 Return correct info when asking an individual, but return incorrect info when asking all individuals in a department

(15)

Privacy Preservation

₁₁

- Common Practices



Audit

 Detect misuse by legitimate users

 Administrative or criminal disciplinary action may be initiated

(16)

16

Privacy Preservation

₁₂

- Possible Attacks



Linking attacks (Sweeney IJUFKS „02)

 Re-identification

 Identity linkage (K-anonymity)  Attribute linkage (l-diversity)

(17)

Privacy Preservation

₁₃

- Possible Attacks



Corruption attacks(Tao ICDE08, Chaytor ICDM09)

 Background knowledge; Perturbed generalization (PG)

(18)

18

Privacy Preservation

₁₄

- Possible Attacks



Differential privacy (Dwork ICALP ‟06)

 Add noise to the data set so that the difference

between any query output to 30 records and to any 29 records will be very small (within a differential).

(19)

Privacy Preservation

₁₅

- Possible Attacks



Realistic adversaries (Machanavajjhala VLDB ‟09)

 Weak privacy:

 l-diversity, t-closeness

 Adversaries need to know very specific information

 Strong privacy:  Differential privacy

 Adversaries need to know all information except victum

 Epislon-privacy

 Adversary‟s knowledge can vary and learn, and is characterized by a stubbornness parameter

(20)

20

Privacy Preservation

₁₆

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Degree attack: knowing Bob has with 4 friends =>

(21)

Privacy Preservation

₁₇

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Sub-graph attack (one match), knowing Bob’s friends and friend‟s friends => Vertex 7 is Bob

(22)

22

Privacy Preservation

₁₈

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Sub-graph attack (k-match), knowing Bob’s

friends => 8 matches, but share common labels 6 & 7 => still uniquely identify vertex 7 is Bob

(23)

Privacy Preservation

₁₉

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Hub fingerprint attacks

 Some hubs have been identified, adversary knows the distance between Bob and hubs

(24)

24

Privacy Preservation

₂₀

- Possible Attacks



Non-structural attacks (Bhagat VLDB „09)

 Label attack

 Interaction graph

(25)

Privacy Preservation

₂₁

- Possible Attacks



Non-structural attacks (Bhagat VLDB „09)

(26)

26

Privacy Preservation

₂₂

- Possible Attacks



Non-structural attacks (Bhagat VLDB „09)

(27)

Privacy Preservation

₂₃

- Possible Attacks



Active attacks (Backstrom WWW „07)

 Planted a subgraph H into G and connect to targeted nodes (add new nodes and edges)

 Recover H from G & identify targeted nodes‟

identity and relationships

 Walk-based (largest H), cut-based (smallest H)

(28)

28

Privacy Preservation

₂₄

- Possible Attacks



Passive attacks (

no nodes, no edges added

)

 Start from a coalition of friends (nodes) in

anonymized graph G, discover the existence of edges among users to whom they are linked to



Semi-passive attacks (

add edges only, no nodes

)

 From existing nodes in G, add fake edges to targeted nodes

(29)

Privacy Preservation

₂₅

- Possible Attacks



Intersection attacks (Puttaswamy CoNEXT‟09)

 Two users were compromised, A and B,

 A queries server for the visitor of “website xyz”,

 B queries server for the visitor of “website xyz”,

(30)

30

Privacy Preservation

₂₆

- Possible Attacks



Intersection attacks

 StarClique (add latent edges)

 The graph evolution process for a node. The node first selects a

subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.

(31)

Privacy Preservation

₂₇

- Possible Attacks



Relationship attacks (Liu SDM‟09)

 Sensitive edge weights, e.g. transaction expenses in business network,

 Reveal the shortest path between source and sink, e.g., A -> D,

(32)

32

Privacy Preservation

₂₈

- Possible Attacks



Relationship attacks (Liu SDM‟09)

 Preserving the shortest path, e.g. A -> D,

 Min perturbation on path length, path weight,

(33)

Privacy Preservation

₂₈

- Possible Attacks



Relationship attacks (Liu ICIS‟10)

 Preserving the shortest path, e.g. A -> D,

 K-anonymous weight privacy

 the blue edge group and the green edge group satisfy the 4-anonymous privacy where  =10.

(34)

34

Privacy Preservation

₂₉

- Possible Attacks



Relationship attacks (Das ICDE‟10)

 Preserving linear property, e.g., shortest paths,  The ordering of the five edge weights are preserved after

naïve anonymization.

 x₅≦ x₁≦ x₄≦ x₃≦ x₂, where x₁=(v₁, v₂), x₂=(v₁, v₄), x₃=(v₁, v₃), x₄=(v₂, v₄), x₅=(v₃, v₄),

 The minimum cost spanning tree is preserved. {(v₁, v₂), (v₂, v₄), (v₁, v₃)}  The shortest path from v₁ to v₄ is changed.

 The ordering of the edge weight is still exposed. For example, v₃ and v₄ are best friends and v₁ and v₄ are not so good friends.

x₁ _x

4

x₃

x₂ _x

(35)

Privacy Preservation

₃₀

- Possible Attacks



Location Privacy (Papadopoulos VLDB‟10)

 Preserving the privacy of the location of user in

Location-based service, e.g., Google or Bing maps,  Alice is searching for gold (going around with a GPS) and

checking the server to verify if the place has been awarded a mining patent. What would you do if you knew there is no patent registered at the location she queries?

(36)

36

Privacy Preservation

₃₂

- Possible Attacks



Inference through data mining attacks

 Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden

 Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ

D

_O

DM

R

_O

D

_M

DM

_R

_M

Modification

(37)

Privacy Preservation

₃₃

- Possible Attacks



ρ-uncertainty (Cao VLDB‟10)

 Given a transaction dataset, sensitive items I_s, uncertainty level ρ, the objective is to make the confidence of all sensitive association rules to be less than ρ, i.e., Conf (χ -> α) < ρ., where χ  I and α  I_s.

 If Alice knows Bob bought b₁, then she knows Bob also bought {a₁, b₂, α,  }, where I_s = {α, }

(38)

38

Privacy Preservation

₃₄

- Possible Attacks



ρ-uncertainty (Cao VLDB‟10)

 Given a transaction dataset, sensitive items I_s, uncertainty level ρ = 0.7, a hierarchy of

non-sensitive items, the published data after suppression

(39)

Current Research Areas

₁



Privacy-Preserving Data Publishing

 K-anonymity

 Try to prevent privacy de-identification



Utility-based Privacy-Preserving



Distributed Privacy with Adversarial

Collaboration



Privacy-Preserving Application

 Association rules hiding

(40)

Current Research Areas

₂

(41)

Current Research Areas - Privacy

Preserving Data Publishing

₁

(42)

Current Research Areas - Privacy

Preserving Data Publishing

₂

 An example of 2-Anonymity (one-to-many approach)

42

(43)

Current Research Areas - Privacy

Preserving Data Publishing

₃



“Fuzz” the data

 k-anonymity, at least k tuples in one group

(44)

44

Current Research Areas - Utility-based

Privacy-Preserving

₁

(45)

Current Research Areas - Utility-based

Privacy-Preserving

₂

(46)

46

Current Research Areas - Utility-based

Privacy-Preserving

₃



Q1:”How many customers under age 29 are

there in the data set?”



Q2: “Is an individual with age=25, education=

Bachelor, Zip Code = 53712 a target customer?”



Table 2, answers: “2”; “Y”

(47)

Current Research Areas - Distributed

Privacy with Adversarial Collaboration



Input privacy (2)

D

₁

+D

₂

+D

₃

R

_O

(48)

48



Association rule mining



Input: D

_O

, min_supp, min_conf



Output: R

_O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) D_O R_O: Association Rules |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not AR

Current Research Areas - Association

(49)

 Input: D_O, X (items to be hidden on LHS), min_supp, min_conf  Output: _D_M TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) |A|=6,|B|=3,|C|=4 hidden hidden hidden lost lost lost try try

Current Research Areas - Association

(50)

50

Current Research Areas -

Association Rule Hiding

₃

- Side

effects



Hiding failure, lost rules, new rules

R ① Hiding Failure ② Lost Rules ③ New Rules R h _~_R h

(51)

Current Research Areas - Association

Rule Hiding

₄



Output privacy

D

_O

R

_O

D

_M

_R

_M

DM

Modification

D

_O

R

_O

D’

_o

DM

Modification

• Input privacy (1)

(52)

52

Current Research Areas - Location

Privacy

₁



Location Privacy (Papadopoulos VLDB‟10)

 Location obfuscation

 Send additional set of “dummy” queries, in addition to actual query

(53)

Current Research Areas - Location

Privacy

₂



Location Privacy

 Data transformation

(54)

54

Current Research Areas - Location

Privacy

₃



Location Privacy

 PIR-based location privacy

 PIR-based queries are sent to LBS server (computational or

secure co-processor) and retrieve blocks without server

(55)

Recent Studies

₁

-

Informative Association Rule Set (IRS)

 Informative Association rules

 Input: D_O, min_supp, min_conf, X = {C}  Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB min_supp=33% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) Rules #6,7,8 predict same RHS {A,B} as #2,4

(56)

Recent Studies

₂

-

Hiding IRS

₁

(LHS)

56  Input: D_O, X (items to be hidden on LHS),

min_supp, min_conf  Output: _D_M TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC D_M Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

(57)

Recent Studies

₃

–

Proposed Algorithms



Strategy:

 To lower the confidence of a given rule X => Y,

 either

 Increase the support of X, but not XY, OR

 Decrease the support of XY (or both X and XY)

)

(

support

)

(

support

)

(

X

XY

X

XY

Y

X

Conf





(58)

Recent Studies

₄

–

Proposed Algorithms

58



Multiple

scans

of database (Apriori-based)

 Increase Support of LHS First (ISL)  Decrease Support of RHS First (DSR)



One scan

of database

 Decrease Support and Confidence (DSC)

 Propose Pattern-Inversion tree to store data



Maintenance

of hiding informative association rule sets

(59)

Recent Studies

₅

-

Analysis



For multiple-scan algorithms:



Time

effects

 DSR faster than ISL

 Due to size of candidate transactions is smaller



Database

effects

(60)

Recent Studies

₆

-

Analysis

60



Side

effects

 DSR: no hiding failure (0%), few new rules (5%) and

some lost rules (11%)

 ISL: shows some hiding failure (12.9%), many new

(61)

Recent Studies

₇

-

Proposed Algorithms



One-scan algorithm DSC

 Pattern-inversion tree TID Items T₁ ABC T₂ ABC T₃ ABC A:6:[T₅] B:4:[T₄] Root C:1:[T₆]

(62)

Recent Studies

₆

-

Proposed Algorithms



One-scan algorithm DSC

 Time effects, Database effects

(63)

Recent Studies

₉

-

Proposed Algorithms

(64)

Recent Studies

₁₀

-

Analysis

64



For single-scan algorithm:

 DSC is O(2|D| + |X|*l₂*K*logK)

 where |X| is the number of items in X, l₂ is the maximum number of large 2-itemsets, and K is the maximum number of iterations in DSC algorithm.

 SWA is O((n₁-1)*n₁/2*|D|*Kw)

 where n₁ is the initial number of restrictive rules in the database D and Kw is the window size chosen.

 SWA has higher order of complexity O(l₂2*|X|2*|D|2) if Kw  |D|

(65)

Recent Studies

₁₁

–

Maintenance of Hiding IRS



Maintenance

of hiding informative association rule

sets:

' D TID D T₁ 111 001 T₂ 111 111 T₃ 111 111 T₄ 110 110 TID T₇ 101 T₈ 101 T₉ 110   D+ = D + 

_D



_'

(66)

Recent Studies

₁₂

–

Maintenance of Hiding IRS

66



Maintenance of hiding informative association rule

sets:

_TID _D+ _(DSC) _(MSI)

T₁ 111 111 001 T₂ 111 111 111 T₃ 111 111 111 T₄ 110 110 110 T₅ 100 100 100 T₆ 101 001 001 T₇ 101 001 001 T₈ 101 101 101 T₉ 110 110 110

(67)

Recent Studies

₁₃

–

Maintenance of Hiding IRS

(68)

Recent Studies

₁₄

–

Maintenance of Hiding IRS

68

(69)

Recent Studies

₁₅

–

K- anonymity and K

m

_{- anonymity}



K-anonymity

 Every record has k-1 other identical record on QIs



K

m

- anonymity

 The support count of every m-itemset ≧ k



Domain Generalization Hierarchy (DGH)



Data types

(70)

Recent Studies

₁₆

–

K- anonymity on transaction data with minimal

addition/deletion of items

70 

3-anonymity

 {T₁, T₄, T₅}, {T₂, T₃, T₆} e₁ e₂ e₃ T₁ 1 1 0 T₂ 0 0 1 T₃ 1 0 1 T₄ 1 0 0 T₅ 1 0 0 T₆ 1 0 1 e₁ e₂ e₃ T₁ 1 0 0 T₂ 1 0 1 T₃ 1 0 1 T₄ 1 0 0 T₅ 1 0 0 T₆ 1 0 1

(71)

Recent Studies

₁₇

–

K- anonymity on transaction data with minimal

addition/deletion of items

a₁ a₂ a₃ T₁ 1 1 1 T₂ 1 1 0 T₄ 0 1 1 T₅ 1 0 0 a₁ a₂ a₃ T₃ 1 0 1 T₆ 0 1 0 T 0 0 1 a₁ a₂ a₃ T₁ 1 1 1 T₂ 1 1 0 T₃ 1 0 1 T₄ 0 1 1 a₁ a₂ a₃ T₅ 1 0 0 T₆ 0 1 0 a₁ a₂ a₃ T₁ 1 1 1 T₂ 1 1 0 T₃ 1 0 1 T₄ 0 1 1 T₅ 1 0 0 T₆ 0 1 0 T₇ 0 0 1

(72)

Recent Studies

₁₈

–

K- anonymity on transaction data with minimal

addition/deletion of items

72



Propose an O(log k)-approximation solution to

(73)

Recent Studies

₁₉

–

K- anonymous

path

privacy



to minimally modify the graph such that there exists

k shortest paths

between each given pair of vertices

specified in H, without adding or deleting any

vertices or edges.

v v2 v4 1 8 7 1 v v2 v4 1 8 7 1

(74)

Recent Studies

₁₉

–

K- anonymous

path

privacy

74



Ratios of perturbed edges for different k

 EIES data set, 48 nodes, 830 edges

(75)

Recent Studies

₁₉

–

K- anonymous path privacy

(76)

Discussions

₁

76



Major issues

 Large data volume  High dimensionality  Sparseness

(77)

Discussions

₂



From relational and set data to

graph, spatial, …

data and beyond …

 Privacy-preserving social networking (PPNP)  Privacy-preserving collaborative filtering

(78)

Discussions

₃

78



From

data

privacy to

information

privacy

 Hiding aggregated information

 Car dealer inventory, hide stock, not individual query

 Air line, hide total seats left, prevent terrorist flying less crowded flight