## Recent Studies in Privacy-Preserving

## Data Mining

1

Leon S.L. Wang

Department of Information Management National University of Kaohsiung

## Outline

### Data Mining – a quick glance

### Privacy-Preserving Data Mining (PPDM)

Objective, common practices, possible attacks

### Current Research Areas

K-Anonymity, Utility, Distributed Privacy, Association Rule

Hiding

### Recent Studies

## Data Mining

_{1 }

### Market basket analysis (

### Association Rules

### )

*“if a customer purchases diapers, then he will very likely *

*purchase beer*”

### Sequences (Sequential Patterns)

*“A customer who bought a iPod three months ago is likely to *

*order a iPhone within one month” *

Training Data

**N A M E R A N K** **Y E A R S T E N U R E D**

M ike A ssistan t P ro f 3 n o M ary A ssistan t P ro f 7 yes

Classification Algorithms Classifier (Model)

### Classification

## Data Mining

_{2 }

**N A M E** **R A N K** **Y E A R S T E N U R E D**

T o m A ssistan t P ro f 2 n o M erlisa A sso ciate P ro f 7 n o G eo rg e P ro fesso r 5 yes Jo sep h A ssistan t P ro f 7 yes

5

## Data Mining

_{3 }

Classifier

Testing

Data Unseen Data (Jeff, Professor, 4)

### Tenured?

## Data Mining

_{4}

•

### Clustering

• Unsupervised learning: Finds “natural” grouping of instances given unlabeled data

7

## Data Mining

_{5 }

### Data mining:

Extraction of interesting _{(}non-trivial, implicit, previously

unknown and potentially useful) information or patterns from data in large databases

### Alternative names:

Knowledge discovery in databases (**KDD**), knowledge

extraction, data/pattern analysis, data archeology, data

## Privacy Preserving Data Mining

_{1 }

*Motivating example – Group Insurance *

*Commission: they found MA governor’s *

*medical record *

## Privacy Preserving Data Mining

_{2 }

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

Andre has heart disease!

9

*Motivating examples – Group Insurance *

## Privacy Preserving Data Mining

_{3 }

*Motivating examples – A Face Is Exposed *

*for AOL Searcher No. 4417749 *

*Buried in a list of 20 million Web search queries collected *
*by AOL and recently released on the Internet is user No. *
*4417749. The number was assigned by the company to *
*protect the searcher’s anonymity, but it was not much of a *
*shield. – New York Times, August 9, 2006. *

*Thelma Arnold's *

## Privacy Preserving Data Mining

_{4 }

11

*Motivating examples – American On Line *

*~650k users, 3 months period, ~20 million *
*queries released *

*No name, no SSN, no driver license #, no credit *
*card # *

*The user, ID 4417749, was found to be Thelma *
*Arnold, a 62 year old woman living in Georgia. *

*Lost of privacy to users, damage to AOL, *
*significant damage to academics who depend on *
*such data. *

## Privacy Preserving Data Mining

_{5 }

*Motivating examples – Netflix Prize *

In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.

Netflix publicly released a dataset containing 100

million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.

13

## Privacy Preserving Data Mining

_{6 }

*Motivating examples – Association Rules *

### Supplier

ABC Paper Company### Retailer

XYZ Supermarket Chain1. Allow ABC to access customer XYZ’s DB

2. Predict XYZ’s inventory needs & Offer reduced prices

## Privacy Preserving Data Mining

_{7 }

### Supplier ABC discovers (thru data mining):

Sequence: Cold remedies -> Facial tissue

*Association: (Skim milk, Green paper) *

### Supplier ABC runs coupon marketing

### campaign:

“50 cents off skim milk when you buy ABC

products”

15

## Privacy Preserving Data Mining

_{1 }

## - Objective

### Privacy

The state of being private; the state of not being seen by others

### Database security

To prevent loss of privacy due to

viewing/disclosing *unauthorized* data

### PPDM

## Privacy Preserving Data Mining

_{2 }

## - Common Practices

### Limiting access

Control access to the data

Used by secure DBMS community

### “Fuzz” the data

Forcing aggregation into daily records instead of individual transactions or slightly altering data values

17

## Privacy Preserving Data Mining

_{3 }

## - Common Practices

### Eliminate unnecessary groupings

The first 3 digits of SSNs are assigned by office sequentially

Clustering high-order bits of a “unique identifier” is

likely to group similar data elements

Unique identifiers are assigned randomly

### Augment the data

Populate the phone book with extra, fictitious people in non-obvious ways

Return correct info when asking an individual, but return incorrect info when asking all individuals in a department

## Privacy Preserving Data Mining

_{4 }

## - Common Practices

### Audit

Detect misuse by legitimate users

Administrative or criminal disciplinary action may be initiated

19

## Privacy Preserving Data Mining

_{5 }

## - Possible Attacks

### Linking attacks (Sweeney IJUFKS „02)

Re-identification

Identity linkage (K-anonymity)

Attribute linkage (l-diversity)

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

## Privacy Preserving Data Mining

_{6 }

## - Possible Attacks

### Corruption attacks(Tao ICDE08, Chaytor ICDM09)

Background knowledge; Perturbed generalization (PG)21

## Privacy Preserving Data Mining

_{7 }

## - Possible Attacks

### Differential privacy (Dwork ICALP ‟06)

Add noise to the data set so that the difference

between any query output to 30 records and to any 29 records will be very small (within a differential).

## Privacy Preserving Data Mining

_{8 }

## - Possible Attacks

### Realistic adversaries (Machanavajjhala VLDB ‟09)

Weak privacy:

l-diversity, t-closeness

Adversaries need to know very specific information Strong privacy:

Differential privacy

Adversaries need to know all information except victum
*Epislon-privacy *

Adversary‟s knowledge can vary and learn, and is

23

## Privacy Preserving Data Mining

_{9 }

## - Possible Attacks

### Structural attacks (Zhou VLDB „09)

Degree attack*: knowing Bob has with 4 friends *=>

## Privacy Preserving Data Mining

_{10 }

## - Possible Attacks

### Structural attacks (Zhou VLDB „09)

Sub-graph attack (one match), knowing *Bob’s *
friends and friend‟s friends *=> Vertex 7 is Bob *

25

## Privacy Preserving Data Mining

_{11 }

## - Possible Attacks

### Structural attacks (Zhou VLDB „09)

Sub-graph attack (k-match), knowing *Bob’s *

friends => 8 matches, but share common labels 6
& 7 => *still uniquely identify vertex 7 is Bob *

## Privacy Preserving Data Mining

_{12 }

## - Possible Attacks

### Structural attacks (Zhou VLDB „09)

Hub fingerprint attacks

Some hubs have been identified, adversary knows the distance between Bob and hubs

27

## Privacy Preserving Data Mining

_{13 }

## - Possible Attacks

### Non-structural attacks (Bhagat VLDB „09)

Label attack

*Interaction graph *

## Privacy Preserving Data Mining

_{14 }

## - Possible Attacks

### Non-structural attacks (Bhagat VLDB „09)

29

## Privacy Preserving Data Mining

_{15 }

## - Possible Attacks

### Non-structural attacks (Bhagat VLDB „09)

## Privacy Preserving Data Mining

_{16 }

## - Possible Attacks

### Active attacks (Backstrom WWW „07)

Planted a subgraph *H into G and *connect to
*targeted nodes (add new nodes and edges*)

Recover *H from G & identify targeted nodes‟ *

identity and relationships

Walk-based (largest *H*), cut-based (smallest *H*)

31

## Privacy Preserving Data Mining

_{17 }

## - Possible Attacks

### Passive attacks (no nodes, no edges added)

Start from a coalition of friends (nodes) in

*anonymized graph G, discover the existence of *
edges among users to whom they are linked to

### Semi-passive attacks (add edges only, no nodes)

*From existing nodes in G, add fake edges to *
targeted nodes

## Privacy Preserving Data Mining

_{18 }

## - Possible Attacks

### Intersection attacks (Puttaswamy CoNEXT‟09)

Two users were compromised, *A* and *B*,

*A* queries server for the visitor of “website xyz”,

*B* queries server for the visitor of “website xyz”,

33

## Privacy Preserving Data Mining

_{19 }

## - Possible Attacks

### Intersection attacks

StarClique (add latent edges)

The graph evolution process for a node. The node first selects a

subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.

## Privacy Preserving Data Mining

_{20 }

## - Possible Attacks

### Relationship attacks (Liu SDM‟09)

Sensitive edge weights, e.g. transaction expenses in business network,

Reveal the shortest path between source and sink,
e.g., *A -> D*,

35

## Privacy Preserving Data Mining

_{21 }

## - Possible Attacks

### Relationship attacks (Liu SDM‟09)

Preserving the shortest path, e.g. *A -> D*,

Min perturbation on path length, path weight,

## Privacy Preserving Data Mining

_{22 }

## - Possible Attacks

### Relationship attacks (Liu ICIS‟10)

Preserving the shortest path, e.g. *A -> D*,

*K-anonymous weight privacy *

the blue edge group and the green edge group satisfy the
4-anonymous privacy where * =10.*

37

## Privacy Preserving Data Mining

_{23 }

## - Possible Attacks

### Relationship attacks (Das ICDE‟10)

Preserving linear property, e.g., shortest paths,

The ordering of the five edge weights are preserved after naïve anonymization.

*x _{5 }*≦

*x*≦

_{1}*x*≦

_{4 }*x*≦

_{3 }*x*

_{2}, where x_{1}=(v_{1}, v_{2}), x_{2}=(v_{1}, v_{4}), x_{3}=(v_{1}, v_{3}), x_{4}=(v_{2}, v_{4}),*x*

_{5}=(v_{3}, v_{4}), *The minimum cost spanning tree is preserved. {(v _{1}, v_{2}), (v_{2}, v_{4}), (v_{1}, v_{3})} *

*The shortest path from v _{1} to v_{4}* is changed.

The ordering of the edge weight is still exposed. *For example, v _{3} and v_{4}* are

*best friends and v*are not so good friends.

_{1}and v_{4}**x**_{1}_{x}

**4**

**x**_{3}

**x**_{2}_{x}

## Privacy Preserving Data Mining

_{24 }

## - Possible Attacks

### Location Privacy (Papadopoulos VLDB‟10)

Preserving the privacy of the location of user in

39

## Privacy Preserving Data Mining

_{25 }

## - Possible Attacks

### Location Privacy (Papadopoulos VLDB‟10)

Location obfuscation

Send additional set of “dummy” queries, in addition to actual

query

Data transformation

Encrypted query is sent to LBS PIR-based location privacy

PIR-based queries are sent to LBS server and retrieve blocks without server discovering which blocks are requested

## Privacy Preserving Data Mining

_{26 }

## - Possible Attacks

### Inference through data mining attacks

Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden

Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ

### D

_{O }

### DM

### R

_{O }

41

## Privacy Preserving Data Mining

_{27 }

## - Possible Attacks

### ρ-uncertainty (Cao VLDB‟10)

Given a transaction dataset, sensitive items *I _{s}*,
uncertainty level

*ρ*, the objective is to make the confidence of all sensitive association rules to be less than

*ρ, i.e., Conf (χ -> α) < ρ., where χ I*and

*α*

*I*.

_{s} *If Alice knows Bob bought b _{1}*, then she knows Bob

*also bought {a*

_{1}, b_{2}, α,*}, where I*= {

_{s}*α,*}

## Privacy Preserving Data Mining

_{28 }

## - Possible Attacks

### ρ-uncertainty (Cao VLDB‟10)

Given a transaction dataset, sensitive items *I _{s}*,
uncertainty level

*ρ = 0.7*, a hierarchy of

non-sensitive items, the published data after suppression

## Current Research Areas

_{1 }

### Privacy-Preserving Data Publishing

K-anonymity

Try to prevent privacy de-identification

### Utility-based Privacy-Preserving

### Distributed Privacy with Adversarial

### Collaboration

### Privacy-Preserving Application

Association rules hiding

## Privacy Preserving Data Publishing

_{1 }

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

Andre has heart disease!

45

## Privacy Preserving Data Publishing

_{2}

47

## Privacy Preserving Data Publishing

_{3 }

### “Fuzz” the data

*k-anonymity, at least k tuples in one group *

## Utility-based Privacy-Preserving

_{1 }

49

## Utility-based Privacy-Preserving

_{2 }

## Utility-based Privacy-Preserving

_{3 }

### Q1:”How many customers under age 29 are

### there in the data set?”

### Q2: “Is an individual with age=25, education=

### Bachelor, Zip Code = 53712 a target customer?”

### Table 2, answers: “2”; “Y”

51

## Distributed Privacy with Adversarial

## Collaboration

### Input privacy (2)

### D

_{1}

### +D

_{2}

### +D

_{3 }

### R

_{O }

### DM

### Association rule mining

### Input: D

_{O}

### , min_supp, min_conf

### Output: R

_{O }TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A min_supp=33% min_conf=70%

*1*

*B=>A*(66%, 100%)

*2*

*C=>A*(66%, 100%)

*3*

*B=>C*(50%, 75%)

*4*

*C=>B*(50%, 75%)

*5*

*AB=>C*(50%, 75%)

*6*

*AC=>B*(50%, 75%)

*7*

*BC=>A*(50%, 100%)

*8*

*C=>AB*(50%, 75%)

*9*

*B=>AC*(50%, 75%) |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3

53

*Input: D _{O}*,

*X*

*(items to be hidden on LHS)*,

min_supp, min_conf
Output: * _{D}_{M }*
TID Items
T1 ABC
T2 ABC
T3 ABC
T4 AB
T5 A
T6 AC
min_supp=33%
min_conf=70%

*X = {C}*

*AC*

*D*Association Rules

_{M }*1*

*B=>A*(50%, 100%)

*2*

*C=>A*(66%, 100%)

*3*

*B=>C*(33%, 66%)

*4*

*C=>B*(33%, 50%)

*5*

*AB=>C*(33%, 66%)

*6*

*AC=>B*(33%, 50%)

*7*

*BC=>A*(33%, 100%)

*8*

*C=>AB*(33%, 50%)

*9*

*B=>AC*(33%, 66%)

*10 A=>B*(50%, 50%)

*11 A=>C*(50%, 66%)

*12 A=>BC*(33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

## Association Rule Hiding

_{3}

## - Side

## effects

### Hiding failure, lost rules, new rules

R

② Lost Rules

R h _{~ }_{R }

55

## Association Rule Hiding

_{4 }

### Output privacy

### D

_{O }

### R

_{O }

### D

_{M }

_{R}

_{M }

### DM

### DM

### Modification

### D

_{O }

### R

_{O }

### D’

_{o }

### DM

### DM

### Modification

### • Input privacy (1)

### R

_{M}

### = R

_{O}

### -

### R

_{H}

## Recent Studies

_{1}

## -

### Informative Association Rule Set (IRS)

Informative Association rules

Input: D_{O}, min_supp, min_conf, *X = {C}*
Output: *{C=>A (66%, 100%), C=>B *
(50%, 75%)}
TID Items
T1 ABC
T2 ABC
T3 ABC
T4 AB
*1 * *B=>A * (66%, 100%)
*2 * *C=>A * (66%, 100%)
*3 * *B=>C * (50%, 75%)
*4 * *C=>B * (50%, 75%)
*5 * *AB=>C * (50%, 75%)
*6 * *AC=>B * (50%, 75%)
*7 * *BC=>A * (50%, 100%)
*8 * *C=>AB * (50%, 75%)
*9 * *B=>AC * (50%, 75%)
Rules #6,7,8
predict same RHS
*{A,B} as #2,4 *

## Recent Studies

_{2}

## -

### Hiding IRS

_{1 }

### (LHS)

57

*Input: D _{O}*,

*X*

*(items to be hidden on LHS)*,

min_supp, min_conf
Output: * _{D}_{M }*
TID Items
T1 ABC
T2 ABC
T3 ABC
T4 AB
T5 A
T6 AC
min_supp=33%
min_conf=70%

*X = {C}*

*AC*

*D*Association Rules

_{M }*1*

*B=>A*(50%, 100%)

*2*

*C=>A*(66%, 100%)

*3*

*B=>C*(33%, 66%)

*4*

*C=>B*(33%, 50%)

*5*

*AB=>C*(33%, 66%)

*6*

*AC=>B*(33%, 50%)

*7*

*BC=>A*(33%, 100%)

*8*

*C=>AB*(33%, 50%)

*9*

*B=>AC*(33%, 66%)

*10 A=>B*(50%, 50%)

*11 A=>C*(50%, 66%)

*12 A=>BC*(33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

## Recent Studies

_{3}

## –

### Proposed Algorithms

### Strategy:

*To lower the confidence of a given rule X => Y, *

either

Increase the support of X, but not XY, OR

Decrease the support of XY (or both X and XY)

### )

### (

### support

### )

### (

### support

### )

### (

*X*

*XY*

*X*

*XY*

*Y*

*X*

*Conf*

###

###

###

## Recent Studies

_{4}

## –

### Proposed Algorithms

59

### Multiple

### scans

### of database (Apriori-based)

Increase Support of LHS First (ISL) Decrease Support of RHS First (DSR)

### One scan

### of database

Decrease Support and Confidence (DSC)

Propose Pattern-Inversion tree to store data

### Maintenance

### of hiding informative association rule sets

## Recent Studies

_{5}

## -

### Analysis

### For multiple-scan algorithms:

### Time

### effects

DSR faster than ISL

Due to size of candidate transactions is smaller

### Database

### effects

## Recent Studies

_{6}

## -

### Analysis

61

### Side

### effects

DSR: no hiding failure (0%), few new rules (5%) and

some lost rules (11%)

ISL: shows some hiding failure (12.9%), many new

## Recent Studies

_{7}

## -

### Proposed Algorithms

### One-scan algorithm DSC

Pattern-inversion tree TID Items T_{1}ABC T

_{2}ABC T

_{3}ABC A:6:[T

_{5}] B:4:[T

_{4}] Root C:1:[T

_{6}]

## Recent Studies

_{6}

## -

### Proposed Algorithms

### One-scan algorithm DSC

Time effects, Database effects

## Recent Studies

_{9}

## -

### Proposed Algorithms

## Recent Studies

_{10}

## -

### Analysis

65

### For single-scan algorithm:

*DSC is O(2|D| + |X|*l _{2}*K*logK) *

*where |X| is the number of items in X, l _{2}* is the maximum number of large

*2-itemsets, and K is the maximum number of iterations in DSC algorithm. *
*SWA is O((n _{1}-1)*n_{1}/2*|D|*Kw) *

*where n _{1} is the initial number of restrictive rules in the database D and Kw is the *

window size chosen.

*SWA has higher order of complexity O(l _{2}2*|X|2*|D|2) if Kw *

*|D|*

## Recent Studies

_{11}

## –

### Maintenance of Hiding IRS

### Maintenance

### of hiding informative association rule

### sets:

'*D*TID

*D*T

_{1}111 001 T

_{2}111 111 T

_{3}111 111 T 110 110 TID T

_{7}101 T

_{8}101 T

_{9}110

*D+ = D +*

_{D}

_{D}

_{'}

## Recent Studies

_{12}

## –

### Maintenance of Hiding IRS

67

### Maintenance of hiding informative association rule

### sets:

_{TID }

_{D}+

_{(DSC) }

_{(MSI) }

T_{1} 111 111 001
T_{2} 111 111 111
T_{3} 111 111 111
T_{4} 110 110 110
T_{5} 100 100 100
T_{6} 101 001 001
T_{7} 101 001 001
T_{8} 101 101 101
T_{9} 110 110 110

## Recent Studies

_{13}

## –

### Maintenance of Hiding IRS

## Recent Studies

_{14}

## –

### Maintenance of Hiding IRS

69

## Recent Studies

_{15}

## –

### K- anonymity and K

m_{- anonymity}

### K-anonymity

Every record has k-1 other identical record on QIs

### K

m### - anonymity

The support count of every m-itemset ≧ k

### Domain Generalization Hierarchy (DGH)

### Data types

## Recent Studies

_{16}

## –

### K- anonymity on transaction data with minimal

### addition/deletion of items

71 ### 3-anonymity

*{T*

_{1}, T_{4}, T_{5}}, {T_{2}, T_{3}, T_{6}}*e*

_{1}*e*

_{2}*e*

_{3}*T*1 1 0

_{1}*T*0 0 1

_{2}*T*1 0 1

_{3}*T*1 0 0

_{4}*T*1 0 0

_{5}*T*1 0 1

_{6}*e*

_{1}*e*

_{2}*e*

_{3}*T*1 0 0

_{1}*T*1 0 1

_{2}*T*1 0 1

_{3}*T*1 0 0

_{4}*T*1 0 0

_{5}*T*1 0 1

_{6}## Recent Studies

_{17}

## –

### K- anonymity on transaction data with minimal

### addition/deletion of items

*a*

_{1}a_{2}a_{3}*T*1 1 1

_{1}*T*1 1 0

_{2}*T*0 1 1

_{4}*T*1 0 0

_{5}*a*

_{1}a_{2}a_{3}*T*1 0 1

_{3}*T*0 1 0

_{6}*a*

_{1}*a*

_{2}*a*

_{3}*T*1 1 1

_{1}*T*1 1 0

_{2}*T*1 0 1

_{3}*T*0 1 1

_{4}*a*

_{1}*a*

_{2}*a*

_{3}*T*1 0 0

_{5}*T*0 1 0

_{6}## Recent Studies

_{18}

## –

### K- anonymity on transaction data with minimal

### addition/deletion of items

73

*Propose an O(log k)-approximation solution to *

## Recent Studies

_{19}

## –

### K- anonymity and K

m_{- anonymity}

Data type K-anonymity Km_{- anonymity }

Relational + DGH

Generalization & suppression: Samarati TKDE ’01, Sweeney IJUFKS ’02;

Clustering: Li DAWAK ’06; Byun DASFAA ’07; Mohammed KDD ‘09, LKC-privacy, Top-down; Relational, no DGH Approximation algorithms (minimize suppression,

clustering): Park SIGMOD ’07; Clustering: Aggarwal PODS ’06, r-gather, ;

## Recent Studies

_{20}

## –

### Graph and Spatial Data

75

Data type K-anonymity Km_{- anonymity }

Graph+ DGH Campan PinKDD08, SaNGreeA; Graph, no

DGH

1. Kuramochi DMKD’05,find freq patterns in large sparse graph;

2. Liu SIGMOD’08, k-degree, k nodes w/ same edge #, random, small-world, scale-free, prefuse, Enron, powergrid,

co-authors graphs;

3. Chang VLDB ‘09, Predictive anonymity; 4. Zon VLDB ‘09, K-automorphism for

multiple structural attacks, prefuse & co-author graphs, Pajek sw Erdos Renyi, scale-free models generated;

5. Bhagat VLDB ‘09, class-based anonymity; 6. Puttaswamy CoNEXT ’09, Intersection

## Recent Studies

_{21}

## –

### Graph and Spatial Data

Data type K-anonymity Km_{- anonymity }

Graph, edge weight

1. Liu SDM’09, ano sensitive edge weight of undirected graph, maintain shortest

paths, Gaussian randomization, greedy perturbation, EIES and synthetic datasets; 2. Liu ICIS’10, k-anonymous weighted edge

on directed graph, k edges with weight differences less than ;

3. Das ICDE’10, ano directed edge weight, keep linear property, generate linear inequalities for shortest paths, LP solver, model Flickr, LiveJoural, Orkut, Youtube;

## Recent Studies

_{22}

## –

### K- anonymity and K

m_{- anonymity}

77

### Major issues

Large data volume High dimensionality Sparseness

### Approaches

Wit DGH Generalization or suppression, bottom-up or top-down No DGH clustering,## Discussions

_{1 }

### From relational and set data to

### graph data

Privacy-preserving social networking Privacy-preserving collaborative filtering

## Discussions

_{2 }

79

**From **

**centralized**

** data **

**to **

**distributed**

** data **

### Distributed databases

Horizontally partitioned

Grocery shopping data collected

by different supermarkets

Credit card databases of two

different credit unions

“fraudulent customers often have

## Discussions

_{3 }

**From **

**centralized**

** data to **

**distributed**

** data **

**Distributed databases **
Vertically partitioned

## Discussions

_{4 }

81

**From **

**data**

** privacy to **

**information**

** privacy **

Hiding aggregated information

Car dealer inventory, hide stock, not individual query

Air line, hide total seats left, prevent terrorist flying less crowded

flight

## References

### Some websites

Privacy-Preserving Data Mining

Privacy Preserving Data Mining: Models and Algorithms

(http://www.springerlink.com/content/978-0-387-70991-8) http://www.springer.com/west/home?SGWID=4-102-22-52496494-0&changeHeader=true http://www.cs.umbc.edu/~kunliu1/research/privacy_review.html http://www.cs.ualberta.ca/~oliveira/psdm/pub_by_year.html Kdnuggets: www.kdnuggets.com