RecentStudiesInPPDM

(1)

Recent Studies in Privacy-Preserving

Data Mining

1

Leon S.L. Wang

Department of Information Management National University of Kaohsiung

(2)

Outline



Data Mining – a quick glance



Privacy-Preserving Data Mining (PPDM)

 Objective, common practices, possible attacks



Current Research Areas

 K-Anonymity, Utility, Distributed Privacy, Association Rule

Hiding



Recent Studies

(3)

Data Mining

₁



Market basket analysis (

Association Rules

)

 “if a customer purchases diapers, then he will very likely

purchase beer”



Sequences (Sequential Patterns)

 “A customer who bought a iPod three months ago is likely to

order a iPhone within one month”

(4)

Training Data

N A M E R A N K Y E A R S T E N U R E D

M ike A ssistan t P ro f 3 n o M ary A ssistan t P ro f 7 yes

Classification Algorithms Classifier (Model) 

Classification

Data Mining

₂

(5)

N A M E R A N K Y E A R S T E N U R E D

T o m A ssistan t P ro f 2 n o M erlisa A sso ciate P ro f 7 n o G eo rg e P ro fesso r 5 yes Jo sep h A ssistan t P ro f 7 yes

5

Data Mining

₃

Classifier

Testing

Data Unseen Data (Jeff, Professor, 4)

Tenured?

(6)

Data Mining

₄

•

Clustering

• Unsupervised learning: Finds “natural” grouping of instances given unlabeled data

(7)

7

Data Mining

₅



Data mining:

 Extraction of interesting ₍non-trivial, implicit, previously

unknown and potentially useful) information or patterns from data in large databases



Alternative names:

 Knowledge discovery in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data

(8)

Privacy Preserving Data Mining

₁



Motivating example – Group Insurance

Commission: they found MA governor’s

medical record

(9)

Privacy Preserving Data Mining

₂

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

 Andre has heart disease!

9



Motivating examples – Group Insurance

(10)

Privacy Preserving Data Mining

₃



Motivating examples – A Face Is Exposed

for AOL Searcher No. 4417749

 Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. – New York Times, August 9, 2006.

Thelma Arnold's

(11)

Privacy Preserving Data Mining

₄

11



Motivating examples – American On Line

 ~650k users, 3 months period, ~20 million queries released

 No name, no SSN, no driver license #, no credit card #

 The user, ID 4417749, was found to be Thelma Arnold, a 62 year old woman living in Georgia.

 Lost of privacy to users, damage to AOL, significant damage to academics who depend on such data.

(12)

Privacy Preserving Data Mining

₅



Motivating examples – Netflix Prize

 In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.

 Netflix publicly released a dataset containing 100

million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.

(13)

13

Privacy Preserving Data Mining

₆

 Motivating examples – Association Rules

Supplier

ABC Paper Company

Retailer

XYZ Supermarket Chain

1. Allow ABC to access customer XYZ’s DB

2. Predict XYZ’s inventory needs & Offer reduced prices

(14)

Privacy Preserving Data Mining

₇



Supplier ABC discovers (thru data mining):

 Sequence: Cold remedies -> Facial tissue

 Association: (Skim milk, Green paper)



Supplier ABC runs coupon marketing

campaign:

 “50 cents off skim milk when you buy ABC

products”

(15)

15

Privacy Preserving Data Mining

₁

- Objective



Privacy

 The state of being private; the state of not being seen by others



Database security

 To prevent loss of privacy due to

viewing/disclosing unauthorized data



PPDM

(16)

Privacy Preserving Data Mining

₂

- Common Practices



Limiting access

 Control access to the data

 Used by secure DBMS community



“Fuzz” the data

 Forcing aggregation into daily records instead of individual transactions or slightly altering data values

(17)

17

Privacy Preserving Data Mining

₃

- Common Practices



Eliminate unnecessary groupings

 The first 3 digits of SSNs are assigned by office sequentially

 Clustering high-order bits of a “unique identifier” is

likely to group similar data elements

 Unique identifiers are assigned randomly



Augment the data

 Populate the phone book with extra, fictitious people in non-obvious ways

 Return correct info when asking an individual, but return incorrect info when asking all individuals in a department

(18)

Privacy Preserving Data Mining

₄

- Common Practices



Audit

 Detect misuse by legitimate users

 Administrative or criminal disciplinary action may be initiated

(19)

19

Privacy Preserving Data Mining

₅

- Possible Attacks



Linking attacks (Sweeney IJUFKS „02)

 Re-identification

 Identity linkage (K-anonymity)

 Attribute linkage (l-diversity)

(20)

Privacy Preserving Data Mining

₆

- Possible Attacks



Corruption attacks(Tao ICDE08, Chaytor ICDM09)

 Background knowledge; Perturbed generalization (PG)

(21)

21

Privacy Preserving Data Mining

₇

- Possible Attacks



Differential privacy (Dwork ICALP ‟06)

 Add noise to the data set so that the difference

between any query output to 30 records and to any 29 records will be very small (within a differential).

(22)

Privacy Preserving Data Mining

₈

- Possible Attacks



Realistic adversaries (Machanavajjhala VLDB ‟09)

 Weak privacy:

 l-diversity, t-closeness

 Adversaries need to know very specific information  Strong privacy:

 Differential privacy

 Adversaries need to know all information except victum  Epislon-privacy

 Adversary‟s knowledge can vary and learn, and is

(23)

23

Privacy Preserving Data Mining

₉

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Degree attack: knowing Bob has with 4 friends =>

(24)

Privacy Preserving Data Mining

₁₀

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Sub-graph attack (one match), knowing Bob’s friends and friend‟s friends => Vertex 7 is Bob

(25)

25

Privacy Preserving Data Mining

₁₁

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Sub-graph attack (k-match), knowing Bob’s

friends => 8 matches, but share common labels 6 & 7 => still uniquely identify vertex 7 is Bob

(26)

Privacy Preserving Data Mining

₁₂

- Possible Attacks



Structural attacks (Zhou VLDB „09)

 Hub fingerprint attacks

 Some hubs have been identified, adversary knows the distance between Bob and hubs

(27)

27

Privacy Preserving Data Mining

₁₃

- Possible Attacks



Non-structural attacks (Bhagat VLDB „09)

 Label attack

 Interaction graph

(28)

Privacy Preserving Data Mining

₁₄

- Possible Attacks



Non-structural attacks (Bhagat VLDB „09)

(29)

29

Privacy Preserving Data Mining

₁₅

- Possible Attacks



Non-structural attacks (Bhagat VLDB „09)

(30)

Privacy Preserving Data Mining

₁₆

- Possible Attacks



Active attacks (Backstrom WWW „07)

 Planted a subgraph H into G and connect to targeted nodes (add new nodes and edges)

 Recover H from G & identify targeted nodes‟

identity and relationships

 Walk-based (largest H), cut-based (smallest H)

(31)

31

Privacy Preserving Data Mining

₁₇

- Possible Attacks



Passive attacks (no nodes, no edges added)

 Start from a coalition of friends (nodes) in

anonymized graph G, discover the existence of edges among users to whom they are linked to



Semi-passive attacks (add edges only, no nodes)

 From existing nodes in G, add fake edges to targeted nodes

(32)

Privacy Preserving Data Mining

₁₈

- Possible Attacks



Intersection attacks (Puttaswamy CoNEXT‟09)

 Two users were compromised, A and B,

 A queries server for the visitor of “website xyz”,

 B queries server for the visitor of “website xyz”,

(33)

33

Privacy Preserving Data Mining

₁₉

- Possible Attacks



Intersection attacks

 StarClique (add latent edges)

 The graph evolution process for a node. The node first selects a

subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.

(34)

Privacy Preserving Data Mining

₂₀

- Possible Attacks



Relationship attacks (Liu SDM‟09)

 Sensitive edge weights, e.g. transaction expenses in business network,

 Reveal the shortest path between source and sink, e.g., A -> D,

(35)

35

Privacy Preserving Data Mining

₂₁

- Possible Attacks



Relationship attacks (Liu SDM‟09)

 Preserving the shortest path, e.g. A -> D,

 Min perturbation on path length, path weight,

(36)

Privacy Preserving Data Mining

₂₂

- Possible Attacks



Relationship attacks (Liu ICIS‟10)

 Preserving the shortest path, e.g. A -> D,

 K-anonymous weight privacy

 the blue edge group and the green edge group satisfy the 4-anonymous privacy where  =10.

(37)

37

Privacy Preserving Data Mining

₂₃

- Possible Attacks



Relationship attacks (Das ICDE‟10)

 Preserving linear property, e.g., shortest paths,

 The ordering of the five edge weights are preserved after naïve anonymization.

 x₅≦ x₁≦ x₄≦ x₃≦ x₂, where x₁=(v₁, v₂), x₂=(v₁, v₄), x₃=(v₁, v₃), x₄=(v₂, v₄), x₅=(v₃, v₄),

 The minimum cost spanning tree is preserved. {(v₁, v₂), (v₂, v₄), (v₁, v₃)}

 The shortest path from v₁ to v₄ is changed.

 The ordering of the edge weight is still exposed. For example, v₃ and v₄ are best friends and v₁ and v₄ are not so good friends.

x₁ _x

4

x₃

x₂ _x

(38)

Privacy Preserving Data Mining

₂₄

- Possible Attacks



Location Privacy (Papadopoulos VLDB‟10)

 Preserving the privacy of the location of user in

(39)

39

Privacy Preserving Data Mining

₂₅

- Possible Attacks



Location Privacy (Papadopoulos VLDB‟10)

 Location obfuscation

 Send additional set of “dummy” queries, in addition to actual

query

 Data transformation

 Encrypted query is sent to LBS  PIR-based location privacy

 PIR-based queries are sent to LBS server and retrieve blocks without server discovering which blocks are requested

(40)

Privacy Preserving Data Mining

₂₆

- Possible Attacks



Inference through data mining attacks

 Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden

 Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ

D

_O

DM

R

_O

(41)

41

Privacy Preserving Data Mining

₂₇

- Possible Attacks



ρ-uncertainty (Cao VLDB‟10)

 Given a transaction dataset, sensitive items I_s, uncertainty level ρ, the objective is to make the confidence of all sensitive association rules to be less than ρ, i.e., Conf (χ -> α) < ρ., where χ  I and α  I_s.

 If Alice knows Bob bought b₁, then she knows Bob also bought {a₁, b₂, α,  }, where I_s = {α, }

(42)

Privacy Preserving Data Mining

₂₈

- Possible Attacks



ρ-uncertainty (Cao VLDB‟10)

 Given a transaction dataset, sensitive items I_s, uncertainty level ρ = 0.7, a hierarchy of

non-sensitive items, the published data after suppression

(43)

Current Research Areas

₁



Privacy-Preserving Data Publishing

 K-anonymity

 Try to prevent privacy de-identification



Utility-based Privacy-Preserving



Distributed Privacy with Adversarial

Collaboration



Privacy-Preserving Application

 Association rules hiding

(44)

(45)

Privacy Preserving Data Publishing

₁

 Andre has heart disease!

45

(46)

Privacy Preserving Data Publishing

₂

(47)

47

Privacy Preserving Data Publishing

₃



“Fuzz” the data

 k-anonymity, at least k tuples in one group

(48)

Utility-based Privacy-Preserving

₁

(49)

49

Utility-based Privacy-Preserving

₂

(50)

Utility-based Privacy-Preserving

₃



Q1:”How many customers under age 29 are

there in the data set?”



Q2: “Is an individual with age=25, education=

Bachelor, Zip Code = 53712 a target customer?”



Table 2, answers: “2”; “Y”

(51)

51

Distributed Privacy with Adversarial

Collaboration



Input privacy (2)

D

₁

+D

₂

+D

₃

R

_O

DM

(52)



Association rule mining



Input: D

_O

, min_supp, min_conf



Output: R

_O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3

(53)

53

 Input: D_O, X (items to be hidden on LHS),

min_supp, min_conf  Output: _D_M TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC D_M Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

(54)

Association Rule Hiding

₃

- Side

effects



Hiding failure, lost rules, new rules

R

② Lost Rules

R h _~_R

(55)

55

Association Rule Hiding

₄



Output privacy

D

_O

R

_O

D

_M

_R

_M

DM

Modification

D

_O

R

_O

D’

_o

DM

Modification

• Input privacy (1)

R

_M

= R

_O

-

R

_H

(56)

Recent Studies

₁

-

Informative Association Rule Set (IRS)

 Informative Association rules

 Input: D_O, min_supp, min_conf, X = {C}  Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) Rules #6,7,8 predict same RHS {A,B} as #2,4

(57)

Recent Studies

₂

-

Hiding IRS

₁

(LHS)

57

 Input: D_O, X (items to be hidden on LHS),

min_supp, min_conf  Output: _D_M TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC D_M Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

(58)

Recent Studies

₃

–

Proposed Algorithms



Strategy:

 To lower the confidence of a given rule X => Y,

 either

 Increase the support of X, but not XY, OR

 Decrease the support of XY (or both X and XY)

)

(

support

)

(

support

)

(

X

XY

X

XY

Y

X

Conf





(59)

Recent Studies

₄

–

Proposed Algorithms

59



Multiple

scans

of database (Apriori-based)

 Increase Support of LHS First (ISL)  Decrease Support of RHS First (DSR)



One scan

of database

 Decrease Support and Confidence (DSC)

 Propose Pattern-Inversion tree to store data



Maintenance

of hiding informative association rule sets

(60)

Recent Studies

₅

-

Analysis



For multiple-scan algorithms:



Time

effects

 DSR faster than ISL

 Due to size of candidate transactions is smaller 

Database

effects

(61)

Recent Studies

₆

-

Analysis

61



Side

effects

 DSR: no hiding failure (0%), few new rules (5%) and

some lost rules (11%)

 ISL: shows some hiding failure (12.9%), many new

(62)

Recent Studies

₇

-

Proposed Algorithms



One-scan algorithm DSC

 Pattern-inversion tree TID Items T₁ ABC T₂ ABC T₃ ABC A:6:[T₅] B:4:[T₄] Root C:1:[T₆]

(63)

Recent Studies

₆

-

Proposed Algorithms



One-scan algorithm DSC

 Time effects, Database effects

(64)

Recent Studies

₉

-

Proposed Algorithms

(65)

Recent Studies

₁₀

-

Analysis

65



For single-scan algorithm:

 DSC is O(2|D| + |X|*l₂*K*logK)

 where |X| is the number of items in X, l₂ is the maximum number of large

2-itemsets, and K is the maximum number of iterations in DSC algorithm.  SWA is O((n₁-1)*n₁/2*|D|*Kw)

 where n₁ is the initial number of restrictive rules in the database D and Kw is the

window size chosen.

 SWA has higher order of complexity O(l₂2*|X|2*|D|2) if Kw  |D|

(66)

Recent Studies

₁₁

–

Maintenance of Hiding IRS



Maintenance

of hiding informative association rule

sets:

' D TID D T₁ 111 001 T₂ 111 111 T₃ 111 111 T 110 110 TID T₇ 101 T₈ 101 T₉ 110   D+ = D + 

_D



_'

(67)

Recent Studies

₁₂

–

Maintenance of Hiding IRS

67



Maintenance of hiding informative association rule

sets:

_TID _D+ _(DSC) _(MSI)

T₁ 111 111 001 T₂ 111 111 111 T₃ 111 111 111 T₄ 110 110 110 T₅ 100 100 100 T₆ 101 001 001 T₇ 101 001 001 T₈ 101 101 101 T₉ 110 110 110

(68)

Recent Studies

₁₃

–

Maintenance of Hiding IRS

(69)

Recent Studies

₁₄

–

Maintenance of Hiding IRS

69

(70)

Recent Studies

₁₅

–

K- anonymity and K

m

_{- anonymity}



K-anonymity

 Every record has k-1 other identical record on QIs



K

m

- anonymity

 The support count of every m-itemset ≧ k



Domain Generalization Hierarchy (DGH)



Data types

(71)

Recent Studies

₁₆

–

K- anonymity on transaction data with minimal

addition/deletion of items

71 

3-anonymity

 {T₁, T₄, T₅}, {T₂, T₃, T₆} e₁ e₂ e₃ T₁ 1 1 0 T₂ 0 0 1 T₃ 1 0 1 T₄ 1 0 0 T₅ 1 0 0 T₆ 1 0 1 e₁ e₂ e₃ T₁ 1 0 0 T₂ 1 0 1 T₃ 1 0 1 T₄ 1 0 0 T₅ 1 0 0 T₆ 1 0 1

(72)

Recent Studies

₁₇

–

K- anonymity on transaction data with minimal

addition/deletion of items

a₁ a₂ a₃ T₁ 1 1 1 T₂ 1 1 0 T₄ 0 1 1 T₅ 1 0 0 a₁ a₂ a₃ T₃ 1 0 1 T₆ 0 1 0 a₁ a₂ a₃ T₁ 1 1 1 T₂ 1 1 0 T₃ 1 0 1 T₄ 0 1 1 a₁ a₂ a₃ T₅ 1 0 0 T₆ 0 1 0

(73)

Recent Studies

₁₈

–

K- anonymity on transaction data with minimal

addition/deletion of items

73



Propose an O(log k)-approximation solution to

(74)

Recent Studies

₁₉

–

K- anonymity and K

m

_{- anonymity}

Data type K-anonymity Km_{- anonymity}

Relational + DGH

Generalization & suppression: Samarati TKDE ’01, Sweeney IJUFKS ’02;

Clustering: Li DAWAK ’06; Byun DASFAA ’07; Mohammed KDD ‘09, LKC-privacy, Top-down; Relational, no DGH Approximation algorithms (minimize suppression,

clustering): Park SIGMOD ’07; Clustering: Aggarwal PODS ’06, r-gather, ;

(75)

Recent Studies

₂₀

–

Graph and Spatial Data

75

Graph+ DGH Campan PinKDD08, SaNGreeA; Graph, no

DGH

1. Kuramochi DMKD’05,find freq patterns in large sparse graph;

2. Liu SIGMOD’08, k-degree, k nodes w/ same edge #, random, small-world, scale-free, prefuse, Enron, powergrid,

co-authors graphs;

3. Chang VLDB ‘09, Predictive anonymity; 4. Zon VLDB ‘09, K-automorphism for

multiple structural attacks, prefuse & co-author graphs, Pajek sw Erdos Renyi, scale-free models generated;

5. Bhagat VLDB ‘09, class-based anonymity; 6. Puttaswamy CoNEXT ’09, Intersection

(76)

Recent Studies

₂₁

–

Graph and Spatial Data

Graph, edge weight

1. Liu SDM’09, ano sensitive edge weight of undirected graph, maintain shortest

paths, Gaussian randomization, greedy perturbation, EIES and synthetic datasets; 2. Liu ICIS’10, k-anonymous weighted edge

on directed graph, k edges with weight differences less than ;

3. Das ICDE’10, ano directed edge weight, keep linear property, generate linear inequalities for shortest paths, LP solver, model Flickr, LiveJoural, Orkut, Youtube;

(77)

Recent Studies

₂₂

–

K- anonymity and K

m

_{- anonymity}

77



Major issues

 Large data volume  High dimensionality  Sparseness 

Approaches

 Wit DGH  Generalization or suppression,  bottom-up or top-down  No DGH  clustering,

(78)

Discussions

₁



From relational and set data to

graph data

 Privacy-preserving social networking  Privacy-preserving collaborative filtering

(79)

Discussions

₂

79



From

centralized

data

to

distributed

data



Distributed databases

 Horizontally partitioned

 Grocery shopping data collected

by different supermarkets

 Credit card databases of two

different credit unions

 “fraudulent customers often have

(80)

Discussions

₃



From

centralized

data to

distributed

data

 Distributed databases  Vertically partitioned

(81)

Discussions

₄

81



From

data

privacy to

information

privacy

 Hiding aggregated information

 Car dealer inventory, hide stock, not individual query

 Air line, hide total seats left, prevent terrorist flying less crowded

flight

(82)

References



Some websites

 Privacy-Preserving Data Mining

 Privacy Preserving Data Mining: Models and Algorithms

(http://www.springerlink.com/content/978-0-387-70991-8)  http://www.springer.com/west/home?SGWID=4-102-22-52496494-0&changeHeader=true  http://www.cs.umbc.edu/~kunliu1/research/privacy_review.html  http://www.cs.ualberta.ca/~oliveira/psdm/pub_by_year.html  Kdnuggets: www.kdnuggets.com