• 沒有找到結果。

Privacy Preserving Data Modeling: An Overview

N/A
N/A
Protected

Academic year: 2021

Share "Privacy Preserving Data Modeling: An Overview"

Copied!
79
0
0

加載中.... (立即查看全文)

全文

(1)

Privacy Preserving Data Modeling:

An Overview

Leon S.L. Wang

(2)

Outline

2

Private Query of Public Information

Privacy Preservation

 Objective, common practices, possible attacks

Current Research Areas

 K-Anonymity, Utility, Distributed Privacy, Association Rule

Hiding, Location Privacy 

Recent Studies

 Association Rule Set Hiding

 K-anonymity on Transaction Data  K-anonymous path privacy

(3)

Private query of public information

Querying public information is a daily activity in life

 Easy access to your own data, any time, any place,

 But also true for others to get your data (potential privacy

breach),

(4)

Private query of public information

4

Privately

Data privacy: user cannot retrieve data other than requested

(and cannot infer new information)

Inferring name, ID, medical records, political/religious/sexy preferences, … Requesting data block i, but receive data blocks i-1, i, i+1,…

User privacy: server is not aware of the data a user actually

retrieving

Alice is searching for gold (going around with a GPS) and checking the server to

verify if the place has been awarded a mining patent. What would you do if you knew there is no patent registered at the location she queries?

(5)

Privacy Preservation

1

Motivating example – Group Insurance

Commission: they found MA governor’s

medical record

(6)

Privacy Preservation

2

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

 Andre has heart disease!

6

Motivating examples – Group Insurance

(7)

Privacy Preservation

3

Motivating examples – A Face Is Exposed

for AOL Searcher No. 4417749

Buried in a list of 20 million Web search queries collected by AOL and recently released on the Internet is user No. 4417749. The number was assigned by the company to protect the searcher’s anonymity, but it was not much of a shield. – New York Times, August 9, 2006.

Thelma Arnold's

(8)

Privacy Preservation

4

8

Motivating examples – American On Line

~650k users, 3 months period, ~20 million

queries released

No name, no SSN, no driver license #, no credit

card #

The user, ID 4417749, was found to be Thelma

Arnold, a 62 year old woman living in Georgia.

Lost of privacy to users, damage to AOL,

significant damage to academics who depend on such data.

(9)

Privacy Preservation

5

Motivating examples – Netflix Prize

 In October of 2006, Netflix announced the $1-million Netflix Prize for improving their movie recommendation system.

 Netflix publicly released a dataset containing 100

million movie ratings of 18,000 movies, created by 500, 000 Netflix subscribers over a period of 6 years.

 Anonymization - replacing usernames with random identifiers.

(10)

10

Privacy Preservation

6

Motivating examples – Association Rules

Supplier

ABC Paper Company

Retailer

XYZ Supermarket Chain 1. Allow ABC to access

customer XYZ’s DB

2. Predict XYZ’s inventory needs & Offer reduced prices

(11)

Privacy Preservation

7

Supplier ABC discovers (thru data mining):

 Sequence: Cold remedies -> Facial tissue

Association: (Skim milk, Green paper)

Supplier ABC runs coupon marketing

campaign:

 “50 cents off skim milk when you buy ABC

products”

Results:

(12)

12

Privacy Preservation

8

- Objective

Privacy

 The state of being private; the state of not being seen by others

Database security

 To prevent loss of privacy due to

viewing/disclosing unauthorized data 

Privacy Preservation

(13)

Privacy Preservation

9

- Common Practices

Limiting access

 Control access to the data

 Used by secure DBMS community 

“Fuzz” the data

 Forcing aggregation into daily records instead of individual transactions or slightly altering data values

(14)

14

Privacy Preservation

10

- Common Practices

Eliminate unnecessary groupings

 The first 3 digits of SSNs are assigned by office sequentially

 Clustering high-order bits of a “unique identifier” is

likely to group similar data elements

 Unique identifiers are assigned randomly

Augment the data

 Populate the phone book with extra, fictitious people in non-obvious ways

 Return correct info when asking an individual, but return incorrect info when asking all individuals in a department

(15)

Privacy Preservation

11

- Common Practices

Audit

 Detect misuse by legitimate users

 Administrative or criminal disciplinary action may be initiated

(16)

16

Privacy Preservation

12

- Possible Attacks

Linking attacks (Sweeney IJUFKS „02)

 Re-identification

 Identity linkage (K-anonymity)  Attribute linkage (l-diversity)

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

(17)

Privacy Preservation

13

- Possible Attacks

Corruption attacks(Tao ICDE08, Chaytor ICDM09)

 Background knowledge; Perturbed generalization (PG)

(18)

18

Privacy Preservation

14

- Possible Attacks

Differential privacy (Dwork ICALP ‟06)

 Add noise to the data set so that the difference

between any query output to 30 records and to any 29 records will be very small (within a differential).

(19)

Privacy Preservation

15

- Possible Attacks

Realistic adversaries (Machanavajjhala VLDB ‟09)

 Weak privacy:

 l-diversity, t-closeness

 Adversaries need to know very specific information

 Strong privacy:  Differential privacy

 Adversaries need to know all information except victum

Epislon-privacy

 Adversary‟s knowledge can vary and learn, and is characterized by a stubbornness parameter

(20)

20

Privacy Preservation

16

- Possible Attacks

Structural attacks (Zhou VLDB „09)

 Degree attack: knowing Bob has with 4 friends =>

(21)

Privacy Preservation

17

- Possible Attacks

Structural attacks (Zhou VLDB „09)

 Sub-graph attack (one match), knowing Bob’s friends and friend‟s friends => Vertex 7 is Bob

(22)

22

Privacy Preservation

18

- Possible Attacks

Structural attacks (Zhou VLDB „09)

 Sub-graph attack (k-match), knowing Bob’s

friends => 8 matches, but share common labels 6 & 7 => still uniquely identify vertex 7 is Bob

(23)

Privacy Preservation

19

- Possible Attacks

Structural attacks (Zhou VLDB „09)

 Hub fingerprint attacks

 Some hubs have been identified, adversary knows the distance between Bob and hubs

(24)

24

Privacy Preservation

20

- Possible Attacks

Non-structural attacks (Bhagat VLDB „09)

 Label attack

Interaction graph

(25)

Privacy Preservation

21

- Possible Attacks

Non-structural attacks (Bhagat VLDB „09)

(26)

26

Privacy Preservation

22

- Possible Attacks

Non-structural attacks (Bhagat VLDB „09)

(27)

Privacy Preservation

23

- Possible Attacks

Active attacks (Backstrom WWW „07)

 Planted a subgraph H into G and connect to targeted nodes (add new nodes and edges)

 Recover H from G & identify targeted nodes‟

identity and relationships

 Walk-based (largest H), cut-based (smallest H)

(28)

28

Privacy Preservation

24

- Possible Attacks

Passive attacks (

no nodes, no edges added

)

 Start from a coalition of friends (nodes) in

anonymized graph G, discover the existence of edges among users to whom they are linked to

Semi-passive attacks (

add edges only, no nodes

)

From existing nodes in G, add fake edges to targeted nodes

(29)

Privacy Preservation

25

- Possible Attacks

Intersection attacks (Puttaswamy CoNEXT‟09)

 Two users were compromised, A and B,

A queries server for the visitor of “website xyz”,

B queries server for the visitor of “website xyz”,

(30)

30

Privacy Preservation

26

- Possible Attacks

Intersection attacks

 StarClique (add latent edges)

 The graph evolution process for a node. The node first selects a

subset of its neighbors. Then it builds a clique with the members of this subset. Finally, it connects the clique members with all the non-clique members in the neighborhood. Latent or virtual edges are added in the process.

(31)

Privacy Preservation

27

- Possible Attacks

Relationship attacks (Liu SDM‟09)

 Sensitive edge weights, e.g. transaction expenses in business network,

 Reveal the shortest path between source and sink, e.g., A -> D,

(32)

32

Privacy Preservation

28

- Possible Attacks

Relationship attacks (Liu SDM‟09)

 Preserving the shortest path, e.g. A -> D,

 Min perturbation on path length, path weight,

(33)

Privacy Preservation

28

- Possible Attacks

Relationship attacks (Liu ICIS‟10)

 Preserving the shortest path, e.g. A -> D,

K-anonymous weight privacy

 the blue edge group and the green edge group satisfy the 4-anonymous privacy where  =10.

(34)

34

Privacy Preservation

29

- Possible Attacks

Relationship attacks (Das ICDE‟10)

 Preserving linear property, e.g., shortest paths,  The ordering of the five edge weights are preserved after

naïve anonymization.

x5 x1 x4 x3 x2, where x1=(v1, v2), x2=(v1, v4), x3=(v1, v3), x4=(v2, v4), x5=(v3, v4),

The minimum cost spanning tree is preserved. {(v1, v2), (v2, v4), (v1, v3)} The shortest path from v1 to v4 is changed.

 The ordering of the edge weight is still exposed. For example, v3 and v4 are best friends and v1 and v4 are not so good friends.

x1 x

4

x3

x2 x

(35)

Privacy Preservation

30

- Possible Attacks

Location Privacy (Papadopoulos VLDB‟10)

 Preserving the privacy of the location of user in

Location-based service, e.g., Google or Bing maps,  Alice is searching for gold (going around with a GPS) and

checking the server to verify if the place has been awarded a mining patent. What would you do if you knew there is no patent registered at the location she queries?

(36)

36

Privacy Preservation

32

- Possible Attacks

Inference through data mining attacks

 Addition and/or deletion of items so that Sensitive Association Rules (SAR) will be hidden

 Generalization and/or suppression of items so that the confidence of SAR will be lower than ρ

D

O

DM

R

O

D

M

DM

R

M

Modification

(37)

Privacy Preservation

33

- Possible Attacks

ρ-uncertainty (Cao VLDB‟10)

 Given a transaction dataset, sensitive items Is, uncertainty level ρ, the objective is to make the confidence of all sensitive association rules to be less than ρ, i.e., Conf (χ -> α) < ρ., where χ I and α Is.

If Alice knows Bob bought b1, then she knows Bob also bought {a1, b2, α, }, where Is = {α, }

(38)

38

Privacy Preservation

34

- Possible Attacks

ρ-uncertainty (Cao VLDB‟10)

 Given a transaction dataset, sensitive items Is, uncertainty level ρ = 0.7, a hierarchy of

non-sensitive items, the published data after suppression

(39)

Current Research Areas

1

Privacy-Preserving Data Publishing

 K-anonymity

 Try to prevent privacy de-identification

Utility-based Privacy-Preserving

Distributed Privacy with Adversarial

Collaboration

Privacy-Preserving Application

 Association rules hiding

(40)

Current Research Areas

2

(41)

Current Research Areas - Privacy

Preserving Data Publishing

1

DOB Sex Zipcode Disease 1/21/76 Male 53715 Heart Disease 4/13/86 Female 53715 Hepatitis 2/28/76 Male 53703 Brochitis 1/21/76 Male 53703 Broken Arm 4/13/86 Female 53706 Flu 2/28/76 Female 53706 Hang Nail

Name DOB Sex Zipcode Andre 1/21/76 Male 53715

Beth 1/10/81 Female 55410 Carol 10/1/44 Female 90210 Dan 2/21/84 Male 02174 Ellen 4/19/72 Female 02237

Hospital Patient Data (Name, ID are hidden) Vote Registration Data (public info)

(42)

Current Research Areas - Privacy

Preserving Data Publishing

2

 An example of 2-Anonymity (one-to-many approach)

42

(43)

Current Research Areas - Privacy

Preserving Data Publishing

3

“Fuzz” the data

k-anonymity, at least k tuples in one group

(44)

44

Current Research Areas - Utility-based

Privacy-Preserving

1

(45)

Current Research Areas - Utility-based

Privacy-Preserving

2

(46)

46

Current Research Areas - Utility-based

Privacy-Preserving

3

Q1:”How many customers under age 29 are

there in the data set?”

Q2: “Is an individual with age=25, education=

Bachelor, Zip Code = 53712 a target customer?”

Table 2, answers: “2”; “Y”

(47)

Current Research Areas - Distributed

Privacy with Adversarial Collaboration

Input privacy (2)

D

1

+D

2

+D

3

R

O

(48)

48

Association rule mining

Input: D

O

, min_supp, min_conf

Output: R

O TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) 10 A=>B (66%, 66%) 11 A=>C (66%, 66%) 12 A=>BC (50%, 50%) DO RO: Association Rules |A|=6,|B|=4,|C|=4 |AB|=4,|AC|=4,|BC|=3 |ABC|=3 Not AR

Current Research Areas - Association

(49)

Input: DO, X (items to be hidden on LHS), min_supp, min_conf  Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) |A|=6,|B|=3,|C|=4 hidden hidden hidden lost lost lost try try

Current Research Areas - Association

(50)

50

Current Research Areas -

Association Rule Hiding

3

- Side

effects

Hiding failure, lost rules, new rules

R ① Hiding Failure ② Lost Rules ③ New Rules R h ~ R h

(51)

Current Research Areas - Association

Rule Hiding

4

Output privacy

D

O

R

O

D

M

R

M

DM

DM

Modification

D

O

R

O

D’

o

DM

DM

Modification

• Input privacy (1)

(52)

52

Current Research Areas - Location

Privacy

1

Location Privacy (Papadopoulos VLDB‟10)

 Location obfuscation

 Send additional set of “dummy” queries, in addition to actual query

(53)

Current Research Areas - Location

Privacy

2

Location Privacy

 Data transformation

(54)

54

Current Research Areas - Location

Privacy

3

Location Privacy

 PIR-based location privacy

PIR-based queries are sent to LBS server (computational or

secure co-processor) and retrieve blocks without server

(55)

Recent Studies

1

-

Informative Association Rule Set (IRS)

 Informative Association rules

 Input: DO, min_supp, min_conf, X = {C}  Output: {C=>A (66%, 100%), C=>B (50%, 75%)} TID Items T1 ABC T2 ABC T3 ABC T4 AB min_supp=33% 1 B=>A (66%, 100%) 2 C=>A (66%, 100%) 3 B=>C (50%, 75%) 4 C=>B (50%, 75%) 5 AB=>C (50%, 75%) 6 AC=>B (50%, 75%) 7 BC=>A (50%, 100%) 8 C=>AB (50%, 75%) 9 B=>AC (50%, 75%) Rules #6,7,8 predict same RHS {A,B} as #2,4

(56)

Recent Studies

2

-

Hiding IRS

1

(LHS)

56  Input: DO, X (items to be hidden on LHS),

min_supp, min_conf  Output: DM TID Items T1 ABC T2 ABC T3 ABC T4 AB T5 A T6 AC min_supp=33% min_conf=70% X = {C} AC DM Association Rules 1 B=>A (50%, 100%) 2 C=>A (66%, 100%) 3 B=>C (33%, 66%) 4 C=>B (33%, 50%) 5 AB=>C (33%, 66%) 6 AC=>B (33%, 50%) 7 BC=>A (33%, 100%) 8 C=>AB (33%, 50%) 9 B=>AC (33%, 66%) 10 A=>B (50%, 50%) 11 A=>C (50%, 66%) 12 A=>BC (33%, 33%) |A|=6,|B|=3,|C|=4 |AB|=3,|AC|=4, |BC|=2 |ABC|= 2 hidden hidden hidden lost lost lost try try

(57)

Recent Studies

3

Proposed Algorithms

Strategy:

To lower the confidence of a given rule X => Y,

 either

 Increase the support of X, but not XY, OR

 Decrease the support of XY (or both X and XY)

)

(

support

)

(

support

)

(

X

XY

X

XY

Y

X

Conf

(58)

Recent Studies

4

Proposed Algorithms

58

Multiple

scans

of database (Apriori-based)

 Increase Support of LHS First (ISL)  Decrease Support of RHS First (DSR)

One scan

of database

 Decrease Support and Confidence (DSC)

 Propose Pattern-Inversion tree to store data

Maintenance

of hiding informative association rule sets

(59)

Recent Studies

5

-

Analysis

For multiple-scan algorithms:

Time

effects

 DSR faster than ISL

 Due to size of candidate transactions is smaller

Database

effects

(60)

Recent Studies

6

-

Analysis

60

Side

effects

 DSR: no hiding failure (0%), few new rules (5%) and

some lost rules (11%)

 ISL: shows some hiding failure (12.9%), many new

(61)

Recent Studies

7

-

Proposed Algorithms

One-scan algorithm DSC

 Pattern-inversion tree TID Items T1 ABC T2 ABC T3 ABC A:6:[T5] B:4:[T4] Root C:1:[T6]

(62)

Recent Studies

6

-

Proposed Algorithms

One-scan algorithm DSC

 Time effects, Database effects

(63)

Recent Studies

9

-

Proposed Algorithms

(64)

Recent Studies

10

-

Analysis

64

For single-scan algorithm:

DSC is O(2|D| + |X|*l2*K*logK)

where |X| is the number of items in X, l2 is the maximum number of large 2-itemsets, and K is the maximum number of iterations in DSC algorithm.

SWA is O((n1-1)*n1/2*|D|*Kw)

where n1 is the initial number of restrictive rules in the database D and Kw is the window size chosen.

SWA has higher order of complexity O(l22*|X|2*|D|2) if Kw |D|

(65)

Recent Studies

11

Maintenance of Hiding IRS

Maintenance

of hiding informative association rule

sets:

' D TID D T1 111 001 T2 111 111 T3 111 111 T4 110 110 TID T7 101 T8 101 T9 110   D+ = D + 

D

'

(66)

Recent Studies

12

Maintenance of Hiding IRS

66

Maintenance of hiding informative association rule

sets:

TID D+ (DSC) (MSI)

T1 111 111 001 T2 111 111 111 T3 111 111 111 T4 110 110 110 T5 100 100 100 T6 101 001 001 T7 101 001 001 T8 101 101 101 T9 110 110 110

(67)

Recent Studies

13

Maintenance of Hiding IRS

(68)

Recent Studies

14

Maintenance of Hiding IRS

68

(69)

Recent Studies

15

K- anonymity and K

m

- anonymity

K-anonymity

 Every record has k-1 other identical record on QIs

K

m

- anonymity

 The support count of every m-itemset ≧ k

Domain Generalization Hierarchy (DGH)

Data types

(70)

Recent Studies

16

K- anonymity on transaction data with minimal

addition/deletion of items

70 

3-anonymity

{T1, T4, T5}, {T2, T3, T6} e1 e2 e3 T1 1 1 0 T2 0 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1 e1 e2 e3 T1 1 0 0 T2 1 0 1 T3 1 0 1 T4 1 0 0 T5 1 0 0 T6 1 0 1

(71)

Recent Studies

17

K- anonymity on transaction data with minimal

addition/deletion of items

a1 a2 a3 T1 1 1 1 T2 1 1 0 T4 0 1 1 T5 1 0 0 a1 a2 a3 T3 1 0 1 T6 0 1 0 T 0 0 1 a1 a2 a3 T1 1 1 1 T2 1 1 0 T3 1 0 1 T4 0 1 1 a1 a2 a3 T5 1 0 0 T6 0 1 0 a1 a2 a3 T1 1 1 1 T2 1 1 0 T3 1 0 1 T4 0 1 1 T5 1 0 0 T6 0 1 0 T7 0 0 1

(72)

Recent Studies

18

K- anonymity on transaction data with minimal

addition/deletion of items

72

Propose an O(log k)-approximation solution to

(73)

Recent Studies

19

K- anonymous

path

privacy

to minimally modify the graph such that there exists

k shortest paths

between each given pair of vertices

specified in H, without adding or deleting any

vertices or edges.

v v2 v4 1 8 7 1 v v2 v4 1 8 7 1

(74)

Recent Studies

19

K- anonymous

path

privacy

74

Ratios of perturbed edges for different k

 EIES data set, 48 nodes, 830 edges

(75)

Recent Studies

19

K- anonymous path privacy

(76)

Discussions

1

76

Major issues

 Large data volume  High dimensionality  Sparseness

(77)

Discussions

2

From relational and set data to

graph, spatial, …

data and beyond …

 Privacy-preserving social networking (PPNP)  Privacy-preserving collaborative filtering

(78)

Discussions

3

78

From

data

privacy to

information

privacy

 Hiding aggregated information

 Car dealer inventory, hide stock, not individual query

 Air line, hide total seats left, prevent terrorist flying less crowded flight

(79)

Thank

You

Discussions

4

Do you have any privacy concern when

you Googling, Twitting, Blogging?

3Q

參考文獻

相關文件

When using RLR as the regressor, the Hamming loss curve of PLST is always be- low the curve of PBR across all M in all data sets, which demonstrates that PLST is the more

For those methods utilizing label powerset to reduce the multi- label classification problem, in [7], the author proposes cost- sensitive RAkEL (CS-RAkEL) based on RAkEL optimizing on

We address the two issues by proposing Feature-aware Cost- sensitive Label Embedding (FaCLE), which utilizes deep Siamese network to keep cost information as the distance of

名稱 Carbon Trust Carbon Reduction Label Quality Assurance Scheme Climate Conscious Carbon Label CarbonFree Label.

[classification], [regression], structured Learning with Different Data Label y n. [supervised], un/semi-supervised, reinforcement Learning with Different Protocol f ⇒ (x n , y

A Novel Approach for Label Space Compression algorithmic: scheme for fast decoding theoretical: justification for best projection practical: significantly better performance

3 Error-correction Coding (Ferng &amp; Lin, ACML Conference 2011, TNNLS Journal 2013). —expand for accuracy: better (than REP) code HAMR

HAMR =⇒ lower 0/1 loss, similar Hamming loss BCH =⇒ even lower 0/1 loss, but higher Hamming loss to improve Binary Relevance, use. HAMR or BCH =⇒ lower 0/1 loss, lower