應用部份保密技術於密文資料探勘之協定設計

(1)

Protocol Design for Privacy-Preserving Data

Mining Using Partial Homomorphic

Encryption

A THESIS Presented to

The Academic Faculty By

Yin-Ming Chang

In Partial Fulfillment

of the Requirements for the Degree of Master in Computer Science

Institute of Computer Science College of Computer Science National Chiao Tung University

2013

(2)

Abstract

With the advance of computing power, data mining techniques can extract useful information from large amount of data. In 2012, 2.5 quintillion bytes of data (1 follow 18 zeroes) are created every day. Data privacy is of utmost concern for distributed data mining across multiple parties, which may be competitors. In this thesis, we focus on the privacy preserving techniques in distributed data mining algorithms. We propose two protocols — multi-party association rule mining (MP-ARM) and multi-party decision tree learning (MP-DTL). Both protocols use partial homomorphic encryption to perform secure data mining algorithms, which are more efficient than the existing work. With the aid from the third participant, two or more parties can securely perform large-scale data mining algorithms without revealing any additional information to the cloud servers.

(3)

Acknowledgements

I would like to appreciate the support from my parents. Also, I especially thank the suggestions from Professor Li-Chun Wang who discussed with me every week during these two years. I could not finish this work without his guidance and comments. I would also like to thank my laboratory mates in the cloud computing group. They discussed and excised with me. I can hardly go through these days without the company of them. Last but not the least, I give thanks to those who supported me during the completion of this thesis.

(4)

Abstract i Acknowledgements ii Contents iii List of Tables v List of Figures vi 1 Introduction 1 1.1 Thesis Outline . . . 4 2 Background 5 2.1 Homomorphic Encryption . . . 5 2.1.1 Unpadded RSA . . . 7 2.1.2 ElGamal . . . 7 2.1.3 Goldwasser-Micali . . . 7 2.1.4 Benaloh . . . 7 2.1.5 Paillier Cryptosystem . . . 8

2.2 Association Rule Mining . . . 9

(5)

2.3 Decision Tree Learning . . . 11

2.3.1 ID3 Algorithm . . . . 11

2.3.2 ID3δ Approximation . . . 17

2.4 Secure Multi-party Computation . . . 17

3 Privacy-Preserving Association Rule Mining 19 3.1 Problem Definition . . . 19 3.2 Related Work . . . 20 3.3 Proposed Protocol . . . 22 3.4 Measurement . . . 25 3.5 Security Analysis . . . 27 3.6 Multi-Party Case . . . 29

3.7 Large Scale Privacy-Preserving Association Rule Mining . . . 30

4 Privacy-Preserving Decision Tree Learning 34 4.1 Problem Definition . . . 34 4.2 Related Work . . . 35 4.3 Proposed Protocol . . . 36 4.4 Measurement . . . 37 4.5 Analysis . . . 38 4.6 Security Analysis . . . 38

5 Conclusion and Future Work 41 5.1 Conclusion . . . 41

5.2 Future Work . . . 42

(6)

List of Tables

2.1 NIST Recommended Key Size . . . 9 2.2 An original transaction database . . . 13 2.3 The transaction table partitioned by outlook (Sunny) . . . 14 2.4 The transaction table partitioned by outlook (Overcast) . . . . 15 2.5 The transaction table partitioned by outlook (Rain) . . . 15 2.6 The calculation of information gain with different attribute . . 16 3.1 The notation table of MP-ARM . . . 23 4.1 The notation table of MP-DTL . . . 36

(7)

List of Figures

1.1 The process of knowledge discovery. . . 2

1.2 The overview of proposed protocols in the process. . . 4

2.1 The example of ID3 decision tree. . . 16

3.1 Proposed architecture (two-party case). . . 22

3.2 The execution time of MP-ARM and Koasar2012 [17]. . . 26

3.3 The execution time of MP-ARM with different key size. . . 27

3.4 Proposed Architecture (multi-party case). . . 30

(8)

Chapter 1 Introduction

Data mining is the process of knowledge discovery as shown in Fig. 1.1. The overall goal of data mining is to extract patterns and useful trends in large data sets, which involves the knowledge from different fields, including arti-ficial intelligence, machine learning, statistics, and database systems. Data mining techniques [1] have been developed to make decisions precisely. For example, one Midwest chain used data mining technique to analyze local buying patterns [2]. They found that people tended to buy beer when they bought diapers on Saturday. On Thursdays, however, they only bought a few items. The retailer concluded that they needed to prepare more beers before the weekends. The grocery chain can use this newly discovered information in various ways to increase their revenue. For example, they can move the beer display closer to the diaper display, or sell beers and diapers at full price on Thursdays.

There are various types of existing data mining algorithms, these can mainly classified as: 1) classification; 2) clustering; 3) association.

Classification. Data mining classification is the process to group items

(9)

data mining classification, including nearest neighbor classification [3], deci-sion tree learning [4], and support vector machine [5].

Clustering. Data mining clustering is to find the groups of objects which

are similar to each other, and differentiate with other groups.

Association. In data mining, association rule learning is a popular method to discover interesting relations between variables in large databases. It is intended to identify the strong rules discovered in databases using dif-ferent measures of interests. For example, the rule{diaper} → {beer} means that while diaper appears in a transaction record, beer would also appears in this record with high probabilities.

Figure 1.1: The process of knowledge discovery.

To perform a data mining algorithm, the traditional approach first collects all data into a centralized site, then run mining algorithms. However, a centralized database has the difficulty in sharing data among competitors due to the privacy concern. Data may be distributed among several sites, which are not allowed to transfer data directly to another site owing to the untrust network environment.

Taking medical research cooperation as an example. Suppose that Cen-ter for Disease Control wants to mine health records to find out the rela-tionship between genders, ages, and health. Insurance companies and local

(10)

hospitals hold data of patient diseases and prescriptions. Mining these data can help the Center of Disease Control to discover important rules such as

gender & age ⇒ health status. However, the problem is that insurance

companies or hospitals will not share their data publicly. It is illegal to share patients’ records without their permissions. Therefore, the protocol designed in this situation should ensure that all the sensitive information can not be known by outsiders.

In order to achieve the above scenario, data transfer between the Center of Disease Control and insurance companies or hospitals must be encrypted before sending to another site. This encryption behavior limits the capa-bility of calculation. We propose the effective protocols for association rule mining and decision tree learning to reduce the time spent on collecting the intermediate values before obtaining the mining results.

In this thesis, we assumes that two or more participants want to perform data mining algorithms based on their joint databases. In the two-party case, we introduce a new participant Ted to help decide whether outcome is larger than the predetermined threshold or not. A special property should be addressed to emphasize the difficulty to preserve privacy when performing data mining algorithm under two-party case. For association rule mining, one party will know a rule is supported globally, but not supported in his site. This behavior shows that the other site supports this rule. Thus, lots of private information is revealed even under a secure protocol. Also, we assume no collusion in proposed protocols because collision may fail the security assurance.

We consider a scenario where two or more parties having private databases plan to compute a data mining algorithm based on the union of their databases. Due to confidential data, neither party will divulge any contents to the other.

(11)

We show how the involved data mining problem of decision tree learning can be efficiently computed, with no party learning anything other than the out-put itself. We demonstrate apriori algorithm and ID3 algorithm, both of which are well-known and influential algorithm for the task of data mining. We note that the extensions of apriori and ID3 are widely used in real market applications. Our proposed protocols are designed to securely perform two data mining algorithms as Fig. 1.2.

Figure 1.2: The overview of proposed protocols in the process.

1.1 Thesis Outline

The rest of this thesis is organized as follows. Chapter 2 describes the back-ground on the data mining algorithms and the cryptosystem forming the basis of our proposed protocols. Chapter 3 introduces the proposed proto-col for securely perform distributed association rule mining on private. We present the other protocol for privacy-preserving decision tree learning on Chapter 4. Chapter 5 gives concluding remarks and outlines direction for future work.

(12)

Chapter 2 Background

2.1 Homomorphic Encryption

Homomorphic encryption is the scheme that allows computations to be car-ried out on ciphertext. The decryption of computation results match the outcome of operations performed on the plaintext. The concept of homo-morphic encryption, or privacy homomorphism was first proposed to the sci-entific community in 1978 by Ronald Rivest, Leonard Adleman and Michael Dertouzos. A semantically secure homomorphic encryption scheme was de-veloped and proposed by Shafi Goldwasser and Silvio Micali in 1982. In 2009, Craig Gentry proved that a completely homomorphic encryption scheme is possible.

Rivest, Aldeman and Dertouzos developed their theory based on the fact that the existing security and encryption systems severely limit the ability to manipulate data after it is encrypted and turned into ciphertext. Without the development of a homomorphic solution, “sending” and “receiving” data are the only function that can be accomplished with encrypted data. The biggest concern was the level of computing that processes the encrypted request on

(13)

the encrypted data. This manipulation may reduce the security level of the encryption scheme.

With the advent and rapid expansion of cloud computing, a feasible ho-momorphic encryption method is crucial. Otherwise, the risk is too high to entrust sensitive data to a cloud computing service provider. If a service provider can access data in their decrypted form, the data can directly ex-pose to malicious users. [6] proved that homomorphic encryption is viable, though the amount of computation time is a concern.

In [6], the author outlined how to create an encryption scheme that can allow data to be securely stored in a cloud environment where the owner can utilize the computational power of the cloud provider to manipulate the encrypted data. There are three main steps in [6]. An encryption scheme is constructed that is “bootstrappable”. In this step, a somewhat homomorphic encryption scheme can work with its own decryption circuit. Next, an almost-bootstrappable public key encryption scheme is built using the idea of ideal lattices. Finally, the schemata are simplified, while maintaining the property of being bootstrappable.

Although [6] created a completely homomorphic encryption scheme, it remains impractical. Homomorphic encryption has evolved to be mostly se-cured against chosen plain-text attacks, but securing against chosen cipher-text attacks remains a problem. In addition to the security issue, the fully homomorphic schemes are so complex that the time factor has precluded their usage in many applications. Somewhat homomorphic encryption sys-tems have been developed to address at least the time factor, using only the most efficient portions of a completely homomorphic encryption scheme.

In this thesis, we apply homomorphic encryption in realistic world. In other words, efficiency should be taken into considerations. We use partial

(14)

homomorphic encryption as our mainly used encryption scheme and combine protocols design to realize the privacy-preserving data mining process. There are several efficient partial homomorphic cryptosystems:

2.1.1 Unpadded RSA

If the RSA public key is modulus m and exponent e, then the encryption of a message x is given by E(x) = xe _{mod m. The homomorphic property is}

then

E(x1)· E(x2) = xe1xe2 mod m = (x1x2)e mod m = E(x1· x2).

2.1.2 ElGamal

In a group G, if the public key is (G, q, g, h), where h = gx, and x is the secret key, then the encryption of a message m is E(m) = (gr_{, m}_{· h}r_{), for}

some random r ∈ {0, 1, · · · , q − 1}, the homomorphic property is then

E(x1)· E(x2) = (gr1, x1· hr1)(gr2, x2· hr2) = (gr1+r2, (x1· x2)hr1+r2).

2.1.3 Goldwasser-Micali

In Goldwasser-Micali cryptosystem, if the public key is the modulus m and quadratic non-residue x, then the encryption of a bit b is E(b) = xb_r2_{mod m,}

for some random r ∈ {0, 1, · · · , m − 1}. The homomorphic property is then

E(b1)· E(b2) = xb1r21x b2_r2 2 = x b1+b2_(r 1r2)2 = E(b1⊕ b2).

2.1.4 Benaloh

If the public key is the modulus m and the base g with a blocksize of c, then the encryption of a message x is E(x) = gx_rc_{mod m. for some random}

(15)

r ∈ {0, 1, · · · , m − 1}. The homomorphic property is then E(x1)· E(x2) = (gx1rc1)(g x2_rc 2) = g x1+x2_(r 1r2)c = E(x1+ x2 mod c).

2.1.5 Paillier Cryptosystem

The Paillier Cryptosystem [7] is a public key encryption scheme based on modular arithmetic, created by Pascal Paillier. The homomorphic property in Paillier cryptosystem is additive homomorphism as follow:

Ek(x)× Ek(y) = Ek(x + y).

Encryption

To encrypt a message using the Paillier cryptosystem, a public key must be established first.

To construct the public key, one must choose two large primes, p and q, then calculate their product, n = p·q. Then a semi-random, nonzero integer,

g, in Zn2, must be selected so that the order of g is a multiple of n in Z∗_n2.

Thus, the public key is (n, g).

The steps of encryption is as follows: 1. Create a message, m, with m∈ Zn.

2. Choose a random, nonzero integer, r∈ Z∗_n. 3. Compute c≡ gm_{· r}n _{mod n}2_.

Decryption

1. Define L(u) = (u− 1)/n. 2. Calculate L(gλ(n) _{mod n}2_{) = k.}

(16)

3. Compute µ≡ k−1 mod n2. 4. m≡ L(cλ(n)_{mod n}2₎_{· µ mod n.}

Our proposed protocols use additive homomorphic scheme to securely sum up the encrypted results, so we take Paillier cryptosystem as our en-cryption scheme. Also, Table 2.1 shows the key size recommended by NIST for security consideration, we implement our system with 1024-bit key size.

Table 2.1: NIST Recommended Key Size Symmetric Key Size

(bits)

RSA and Diffie-Hellman Key Size (bits)

Elliptic Curve Key Size (bits) 80 1024 160 112 2048 224 128 3072 256 192 7680 384 256 15360 521

2.2 Association Rule Mining

Association rule mining is a process that help find the confidential rules from a large amount of data. The problem can be defined as follows:

Let I = {i1, i2...in} be a set of items. Let T = {t1, t2...tn} be a set of

transactions, where each ti ⊆ I. Given an itemset X ⊆ I, a transaction ti

contains X if and only if X ⊆ ti. An association rule is an implication of the form X ⇒ Y where X ⊆ I, Y ⊆ I, and X ∩Y = ∅. The rule has support s in the transaction database DB if s% of transactions in D that contain X∪ Y . The association rule holds in the transaction database D with confidence c if

(17)

c% of transactions in D that contain X also contain Y . An itemset X with k items called k-itemset. The problem of mining association rules is to find

all rules whose support and confidence are higher than the minimum support and confidence defined by user.

It has been shown that the problem of mining association rules can be reduced to two subproblems:

1. Find all large itemsets for a predetermined minimum support. 2. Generate the association rules from the large imtesets found.

The most crucial part affecting the performance of mining association rules is to solve the first problem efficiently.

2.2.1 Apriori Algorithm

The Apriori algorithm is an effective method for determining association rule [8]. The idea is to separate association rule mining problem into two stages:

1. The large itemsets are computed iteratively. In each iteration, the database is scanned once and all large itemsets of the same size are computed. The large itemsets are computed in the ascending order of their sizes.

2. In the first iteration, the size-1 large itemsets are computed by scanning the database once. Subsequently, in the kth iteration (k > 1), a set of candidate sets Ck is created by applying the candidate set generation

function Apriori-gen on Lk−1, where Lk−1 is the set of all large (k−

1)-itemsets found in iteration k − 1. Apriori-gen generates only those

(18)

counts of the candidate itemsets in Ck are then computed by scanning

the database once and the size-k large itemsets are extracted from the candidates.

In general, for both frequent itemsets XY and X, we can determine if the rule X → Y holds by computing the ratio r:

r = support(XY )/support(X).

The rule holds only if r ≥ minimum confidence. Note that this rule has the minimum support because XY is frequent.

2.3 Decision Tree Learning

Decision tree algorithms are the learning methods for approximating discrete-valued target function. These famous algorithms in inductive learning have been successfully applied in a broad range of tasks.

2.3.1 ID3 Algorithm

ID3 is a simple decision learning algorithm developed by J. Ross Quinlan in 1986 [9]. ID3 constructs a decision tree by employing a top-down greedy search through the given sets of training data to test each attribute at every node. It uses a statistics property called information gain to select which at-tribute to test at each node. Information gain measures how a given atat-tribute separates the training examples according to their target classifications. The notation is listed as:

• R: the set of attributes; • C: the class attribute;

(19)

• T : the set of transactions.

Entropy

It is a measure in the information theory, which characterizes the impurity of an arbitrary collection of examples. If the class attribute C takes on l different values, then the entropy HC(T ) to this l-wise classification is defined as

HC(T ) = l

∑

i=1

−|T (ci_{|T |})|log₂ |T (ci)|

|T | ,

where T (ci) is the number of transactions that contain ci. Logarithm is base

2 because entropy is a measure in terms of bits. For instance, if training data have 14 instances with 6 positive and 8 negative instances, the entropy is calculated as

HC(T ) =−(6/14) log2(6/14)− (8/14) log(8/14) = 0.985.

Note that the more the uniform the probability distribution, the greater the entropy. This implies that it is difficult to obtain clear information from the current distribution.

Information Gain

Let T be a set of transactions, C the class attribute with l different val-ues, A the non-class attribute with m possibilities (a1, a2,· · · , am). T (aj)

denotes the transactions containing aj. Then the conditional entropy can be

calculated as follows: HC(T|A) = m ∑ j=1 |T (aj)| |T | HC(T (aj)).

Information gain measures the expected reduction in entropy by par-titioning the transaction set according to this attribute. The information

(20)

gain, Gain(A) of an attribute A, relative to the collection of transactions T , is defined as

Gain(A) = HC(T )− HC(T|A).

We can use this measure to rank attribute with the highest information gain among the attributes that have not yet been considered in the path from the root. By doing so recursively, we can acquire the decision tree in the end.

Example

Table 2.2: An original transaction database

Attribute Classes Day Outlook Temperature Humidity Wind Play Tennis

D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild Normal Strong No

(21)

Assume that we have a transaction database as Table 2.2. The goal is to construct a decision tree using ID3 algorithm. We first compute the entropy which is

HC(T ) =−(9/14) log2(9/14)− (5/14) log2(5/14) = 0.940.

Then we test each non-class attribute A as root node of decision tree, and calculate the information gain of each non-class attribute. Here, we use “Outlook” as an example. We have three values in outlook field, includ-ing a1(Sunny), a2(Overcast), and a3(Rain). Using Outlook as the attribute

to partition the original transaction databases, Tables 2.3, 2.4, 2.5 can be acquired.

Table 2.3: The transaction table partitioned by outlook (Sunny) Attribute Classes

Day Outlook Play Tennis D1 Sunny No D2 Sunny No D8 Sunny No D9 Sunny Yes D11 Sunny Yes

(22)

Table 2.4: The transaction table partitioned by outlook (Overcast) Attribute Classes

Day Outlook Play Tennis D3 Overcast Yes D7 Overcast Yes D12 Overcast Yes D13 Overcast Yes

Table 2.5: The transaction table partitioned by outlook (Rain) Attribute Classes

Day Outlook Play Tennis D4 Rain Yes D5 Rain Yes D6 Rain No D10 Rain Yes D14 Rain No For Outlook: the entropy now becomes

HC(T|A(Outlook)) = 5 15(−(0.4)log2(0.4)− (0.6)log2(0.6)) + 5 15(−(0.6)log2(0.6)− (0.4)log2(0.4)) = 0.647. Gain(A(Outlook)) = 0.940− 0.647 = 0.293.

The information gain of all non-class attributes is listed at Table 2.6. “Outlook” has the most information gain, so we construct our decision tree

(23)

Figure 2.1: The example of ID3 decision tree.

with the root node “Outlook”. By doing this recursively we can get the finally decision tree as Fig. 2.1.

Table 2.6: The calculation of information gain with different attribute Attribute Gain

Outlook 0.293 Temperature 0.074 Humidity 0.093 Wind 0.093

(24)

2.3.2 ID3

δ

Approximation

The traditional ID3 algorithm chooses the best attribute that can maximize the information gain. For two attributes A1 and A2 with close information

gain, we say that A1 and A2 have δ-close information gains if

|HC(T|A1)− HC(T|A2)| ≥ δ.

2.4 Secure Multi-party Computation

Substantial works have been done on secure multi-party computation. The key result is that a wide class of computations can be computed securely under reasonable assumptions. We give an overview on this work. The definitions were given by Goldreich [10]. For simplicity, we concentrate on the two-party case. Extending the definitions to the multi-party case is straightforward.

1. Security in semi-honest model: A semi-honest party follows the proto-cols using its correct input, but is free to later use what it sees during the execution of the protocol to compromise security. This is a realis-tic assumption because parties wanting to mine data for their mutual benefits will follow the protocols to get the correct results. Also, a pro-tocol, being is buried in large and complex software, can not be easily altered.

2. Yao’s general two-party secure function evaluation: Yao’s general se-cure two-party evaluation is to express the function f (x, y) as a circuit and encrypting the gates for secure evaluation [11]. Using this proto-col, any two-party function can be evaluated securely in the semi-honest

(25)

model. However, the functions must have a small circuit representa-tion. We will not detail this generic method, but adopt this generic for securely solving the millionaire problem. In comparison of any two inte-gers securely, Yao’s generic method is one of the most efficient methods, although other asymptotically equivalent but practically more efficient algorithms could be used as well [12].

(26)

Chapter 3 Privacy-Preserving Association

Rule Mining

3.1 Problem Definition

Let n = 2 be the number of participants who perform association rule min-ing with their databases jointly. Each participant i has a private transaction database Di. Given a support threshold s and confidence c, The goal is

to discover all association rules which satisfy the threshold. To ensure the privacy of sensitive data owned by participants, it is required to limit the information leakage. Thus each participant knows nothing. Other partic-ipant’s data can’t be simulated by inputting its own data and the mining results.

We consider the following as sensitive information and should not be revealed in the progress of our protocol:

• The itemsets supported at each site;

(27)

• The global support count of an itemset at each iteration k; • Database size at each site.

3.2 Related Work

There are several privacy-preserving data mining approaches. Previous work in privacy-preserving data mining can be classified into two aspects. The first one aims at preserving participants’ privacy through distorting data values [13]. The basic idea is that the distorted data doesn’t reveal private information. Thus it is safe to use for mining. The distorted data and in-formation can be used to generate an approximation to the original data distribution without revealing the original data values. The distribution can be used to provide mining results rather than mining the distorted data di-rectly. This is primarily done by selecting split points to “bin” continuous data. Later refinement of this approach tightened the bounds on what pri-vate information is disclosed by showing that the ability to reconstruct the distribution can be used to tighten estimates of original values based on the distorted data [14].

More recently, the data distortion approach was used to boolean asso-ciation rules [15], [16]. The key idea is to adjust the data values to make reconstructing the values for any individual transaction more difficult, but the rules learned from these distorted data are still available. One feature of this work is the flexibility to define privacy. The other approach is con-structing decision trees, which will be discussed in Chapter 4.

The problem of privately computing association rules in vertically parti-tioned distributed database has also been addressed in [11]. The vertically partitioned problem occurs when each transaction is split across multiple

(28)

sites, with each site having a different set of attributes for the entire set of transactions. With horizontal partitioning each site has a set of complete transactions. In relational terms, with horizontal partitioning the relation to be mined is the union of the relations at the sites. In vertical partitioning, the relations at the individual sites must be joined to get the relation to be mined. The change in the way the data is distributed makes this a much different problem form the one we address here, resulting in a very different solution.

Two-party association rule mining has been addressed owing to the hard-ness to preserve privacy when performing privacy-preserving association rule mining. If the database size of two databases are represented by d1 and d2,

and the count of an itemset can be expressed as c1 and c2, then the

inequal-ity c1+c2

d1+d2 ≥ s tests whether the itemset is frequent or not, where s is the

minimum support count threshold. In [17], they used fully homomorphic encryption proposed by Gentry [6] instead of Yao’s garbled circuit to test the inequality above. The main contribution of this work is the use of filly homomorphic encryption to solve the problem caused by using Yao’s solu-tion which has higher communicasolu-tion cause since the generated circuit can’t be reused. But, when considering the efficiency to use this protocol in real system owing to the efficiency to perform fully homomorphic encryption

In order to realize the privacy preserving association rule mining in real system, we introduce a trusted participant who only helps us to decide whether the itemset is frequent or not. This participant can’t learn extra information except the inequality stands or not.

(29)

Figure 3.1: Proposed architecture (two-party case).

3.3 Proposed Protocol

We extend Apriori algorithm [8] to achieve privacy. The goal is to establish a secure version of the Apriori algorithm that minimize the information disclo-sure during the progress of the protocol. Our work is inspired by the method proposed in [17] for discovering association rules privately. Their protocol was based on the fully homomorphic encryption to discover the frequent rules.

Instead of using the actual support count to decide whether an itemsets is frequent or not, we use the excess support count of the itemset, i.e. by how much a support count at a participant’s transaction database exceeds the threshold support s. For an itemset to be frequent, the following inequality must be true: n ∑ i=1 X.sup_i ≥ s × n ∑ i=1 |Di|, n ∑ i=1 (X.supi− s × |Di|) ≥ 0.

We denote (X.supi− s × |Di|) as excess support count of participant i.

There are three participants in our protocol which includes Alice, Bob and Ted. Alice and Bob are two sites both hold large databases. Alice wants

(30)

Table 3.1: The notation table of MP-ARM

s The minimum support count

ci The local support count at party i

X.sup The global support count of an itemset X

X.supi The local support count of X at site Si D The number of transactions in DB

Di The number of transactions in DBi Lk The set of global large k-itemsets Ck The set of candidate k-itemsets pk The public key of the first site S1

sk The private key of the first site S1

Epk(m) Encrypt m with public key pk Dsk(c) Decrypt c with private key sk

(31)

to perform association rule mining with the union of both databases. All sites are assumed to be semi-honest. That is, all participants will follow the protocol honestly, but try to collect all the data received and figure out the secret as possible as they can. Unlike other protocols to implement association rule mining that uses homomorphic encryption to directly encrypt the excess support count, our protocols do not allow Alice who wants to perform the association rule mining with fake data to acquire the count of itemsets in Bob’s database.

Our protocol is composed of two communication steps. First, Alice re-ceives the encrypted results through cooperating with Bob. However, she doesn’t know the encrypted support count. In the second step, Ted informs Alice whether the support count exceeds the minimum support count, help-ing Alice to realize that the current itemset is frequent itemset or not. By performing the protocol iteratively, Alice can get the complete mining results. Alice can misbehave and send fake support count to Bob, or vice versa. Although the mining results will go wrong, each of them can’t learn other’s support count. We will discuss this situation in Section 4.6

We assume that I ={i1, i2· · · in} be the set of items. Let DAlice be a set

of transactions owned by participant Alice. Each transaction T in DAliceis an

itemset that T ⊆ I. Alice want to check the support count in all transactions in DAlice and DBob if

∑

i∈participants

(ci× 100 − si× Di)≥ 0,

if the equation before stands, that means the current itemset is frequent itemset. Fig. 3.1 represents the basic protocol used in our system model. These two communication channels are the one from Ted to Alice and from Bob to Alice. Instead of the role to perform mining association rule mining,

(32)

Alice also acts as a relay point to help Bob send messages to Ted. The ad-vantage of this approach is to hide Bob from Ted, thereby enhancing privacy in comparison of sending messages to Ted.

The protocol consists of two steps. First, let Alice collect the excess support count from all the other users. After collecting the necessary infor-mation, Alice doesn’t know the exact mining results. She asks Ted for helping determine whether the current itemset is frequent or not. These protocols was detailed on algorithms 1 and 2.

3.4 Measurement

We implemented a software prototype to test the feasibility of our protocol. The prototype was executed on a machine with a 2.66 GHz Intel Core(TM)2 with 6 GB of memory, running the windows 8 operation system. We used Java as our programming language and Paillier cryptosystem implementa-tion.

We separated the total execution time into three stages, which includes encryption, evaluation and decryption. The data or count value was en-crypted by the Paillier cryptosystem with key size (security parameters) equals to 512, 1024 and 2048 which ensures the feasibility of our proposed protocol on large-scale association rule mining. The results are shown at Fig. 3.2. At the mean time, we compare our results with [17]. The results show that our protocol takes less time than previous work on encryption and evaluation stage but spend more time at decryption stage. Fig. 3.3 shows that even on larger key size, our proposed protocol can be completed in a short period (less than 1.5 sec).

(33)

0.0035 0.041 0.293 0.008 7.05 0.000113 0 1 2 3 4 5 6 7 8

encryption evaluation decryption

T

im

e

(sec)

Proposed Kaosar2012

(34)

0.001 0.125 0.002 0.0035 0.293 0.008 0.018 1.22 0.04 0 0.2 0.4 0.6 0.8 1 1.2 1.4

encryption evaluation decryption

T

im

e(sec)

256 bit 512 bit 1024 bit

Figure 3.3: The execution time of MP-ARM with different key size.

3.5 Security Analysis

The protocol has no prevention from incorrect data usage as their input. Under this consideration, we assume that the protocol was executed under the semi-honest model to acquire the correct mining results. That is, all participants can gather all information in the progress of our protocolin the protocol should not deviate from protocol.

Owing to the semi-honest model we use in this protocol, it is assumed that all participants in the protocol can’t deviate from protocol. By two

(35)

secure channels that we use in this protocol, Ted doesn’t know the existence of Bob. This improves privacy a little. Our protocol can prevent information leakage even if one or more participants using incorrect excess support count. We later analyze the situation if one of the participant misbehaves.

Alice misbehaves. If Alice try to use incorrect data to acquire the

infor-mation from Bob. Owing to the random number k used when returning the calculation results d, the result may distribute over the field we use in the encryption scheme. Alice can’t figure out the excess support count owned by Bob.

Bob misbehaves. If Bob adds the ciphertext with incorrect excess support

count, Alice won’t get the correct mining results from Ted. Nevertheless, Bob can’t get the information about Alice’s transaction database. Because the protocol mainly encrypted with Alice’s public key. Each itemset at each site should be requested only once. That means if Bob receives the request from Alice who asks for the same itemset twice. He can doubt if Alice trying to resolve the information on his database. Then he can drop the protocol.

The additional participant, Ted, can’t learn any information about Alice and Bob. The value d that Ted receives is the summation of excess support count of all participants. Moreover, Ted does not know what is the current itemsets in the middle of protocol.

Our protocol can provide partial collusion resistance. If either Alice or Bob colludes with Ted, then he/she can figure out the excess support count owned by the other participant. But if there are more than three participants, and two of them collude with each other can’t get the information about the third participant owing to the random value k chose by the third participant. In conclusion, If Alice or Bob use fake data in our protocol, although we can’t get the correct mining results, but each of them can get additional

(36)

information except the wrong mining result. Besides, Ted can’t learn infor-mation from Alice or Bob without colluding. Because Ted only receive a number and judge whether the number is greater than zero or not.

3.6 Multi-Party Case

There are two approaches that can extend our proposed two-party case into multi-party case:

1. Choose one of the participants as the role Ted played in two party case. 2. Adding an additional participant as two party case.

We then give a brief discussion about the two different approaches, the first approach is the same as described in Section 4.6 except that more than two participants choosing random number to distort their true excess support count makes it more secure than two-party case. The second approach should be performed under the semi-honest model. That is, assumes Bob is the chosen one to execute the steps did by Ted in two-party case.

Additionally, this architecture uses only one key pairs to perform encryp-tion and decrypencryp-tion. This can speed up the time spent on the protocol. Each participant can calculate his/her own encrypted excess support count multiplying with encrypted random value and send the encrypted results to Alice. This can be performed concurrently. As shown in Fig. 3.4, the first site can issue two parallel mining command, The first issue goes through site 2 and site 5, and the second one passes site 3 and site 4. After collecting the calculation results from both issues, site 1 can easily merge the final results directly using additive homomorphic properties on ciphertext. The above example can save almost half execution time than sequential calculation.

(37)

Figure 3.4: Proposed Architecture (multi-party case).

3.7 Large Scale Privacy-Preserving

Associa-tion Rule Mining

In real databases, the number of itemsets and transactions are much more than the simulation in Section 3.4. In this section, we measure the relation between the execution time and the number of itemsets to test the feasibility of MP-ARM with large scale databases. The result is shown in Fig. 3.5, we observe that the execution time double after inserting a new itemset into the database. This is mainly caused by the increasing iterations needed for more itemsets.

(38)

1.278 3.663 6.363 12.337 24.99449.167 99.757 190.055 371.266 780.486 1580.097 0 200 400 600 800 1000 1200 1400 1600 1800 0 2 4 6 8 10 12 14 16 The e xecu ti on ti m e (sec)

The number of items

(39)

Algorithm 1 Finding large k-itemsets Input: Lk−1, s

Output: Lk

Step 1 Candidate Sets Generation

Alice generates Ck = Apriori− gen(Lk−1) Step 2 Frequent itemsets Genration

for ci ∈ Ck do

Alice sends candidate and encrypted excess support count EA(TA(ci))

to Bob.

Bob calculates EA(d + k) and sends to Alice, where d = TA(ci) + TB(ci)

and k is a random number in the predefined field. Alice decrypts d + k, and sends to Ted with ET(k)

Ted decrypts k and calculate d,

if d≥ 0 then

Ted sends ET(′Y ES′) to Alice

else

Ted sends ET(′N O′) to Alice

end if

Alice decrypts the final result.

if The result from Ted is ’YES’ then

Add itemset to Lk

end if end for

(40)

Algorithm 2 Association Rule Generation Input: Lk, c

Output: AR (sets of all association rules)

for Each frequent itemset Li ∈ Lk do

Alice splits Li into all possible i1 and i2 such that, Li ={i1∪ i2} and

{i1∩ i2 =∅}

Alice sends α1 ← EA(count(Li)) to Bob

Bob sends EA(d′+ k) and ET(k) to Alice

Alice decrypts d′+ k, and sends to T with ET(k)

Ted decrypts k and calculate d′,

if d′ ≥ 0 then

Ted sends ET(′Y ES′) to Alice

else

Ted sends ET(′N O′) to Alice

end if

Alice decrypts the final result.

if The result from Ted is ’YES’ then

AR← {AR ∪ i1 ⇒ i2}

end if end for

(41)

Chapter 4 Privacy-Preserving Decision

Tree Learning

Big data are collections of large and complex datasets. It becomes difficult to process using traditional data processing algorithms. The main objective of analysis on big data is to classify large amount of raw data into meaningful cluster. In this chapter, we apply our techniques to implement a classification algorithm — ID3 decision tree learning.

4.1 Problem Definition

Consider two participants Alice and Bob possess two horizontally partitioned transaction databases T1 and T2 of size |T1| and |T2| respectively, where

T = {T1∪T2}. First, we assume that two databases T1 and T2 have the same

structure and the attributes name are public which is essential for joint com-putation. On the one hand, using universal name clearly would reduce the complexity of our protocol. On the other hand, even if the attribute names are sensitive information, it can be solved by using some distortion on

(42)

nam-ing at the preprocessnam-ing phase. Here, we simplify the protocol through usnam-ing public attribute name. The goal is that two participants should jointly com-pute the decision tree using ID3 algorithm without additional information leakage.

4.2 Related Work

Several approaches have been proposed on privacy-preserving decision tree learning [19–27]. In [18], the goal is to securely build an ID3 decision tree where the training set is distributed between two parties. The key idea is that finding the attribute that maximizes information gain is equivalent to finding the attributes that maximizes the conditional entropy. The condi-tional entropy for an attribute in two-party case can be expressed as a sum of the expression of the form (v1+ v2)×log(v1+ v2). In this paper, the author

proposed a way to securely compute the value (v1+ v2)× log(v1 + v2) and

show how to take advantage of this function to securely build the ID3δ

de-cision tree. This approach treats privacy preserving data mining as a special case of secure multi-party computation [10] and not only preserve individual privacy but also tries to forbid leakage of any information.

Although we can implement this algorithm by previous method, the ef-ficiency is still a bottleneck if we have large amount of data (big data). In the following section, we first describe how this problem can be solved by homomorphic encryption and then use the similar protocol that mentioned in the Chapter 3. In order to increase the efficiency, we can sometimes share some non-sensitive information to improve the highly efficient manipulation.

(43)

Table 4.1: The notation table of MP-DTL

R The set of attributes

C The class attribute

T The set of transactions

A The non-class attributes

HC(T ) The entropy of transaction database T

T (ci) The transaction count which has class attribute ci T (aj) The transaction count which has non-class attribute aj

4.3 Proposed Protocol

To perform privacy preserving decision tree learning, we can take advantage of the protocol described before with a little modification. We first analyze why we can separate the jointed calculation. The root node can be decided by calculating information gain under all non-class attributes. Assume we have n non-class attributes A = {a1, a2· · · , an}. The impurity of the original

database is

HC(T ) =

l

∑

i=1

−|T (ci_{|T |})|log|T (ci)|

|T | .

For a given non-class attribute A, Let A have m possible values a1, a2,· · · , am

and let the class attribute C have l possible values c1, c2,· · · , cl.

HC(T|A) = m ∑ j=1 |T (aj)| |T | HC(T (a(j)), = 1 |T | l ∑ i=1 −|T (aj, ci)| T (aj) · log ( |T (aj, ci)| |T (aj)| ), = 1 |T |(− m ∑ j=1 l ∑ i=1

|T (aj, ci)| log(|T (aj, ci)|) + m

∑

j=1

(44)

Therefore, showing that the conditional entropy can be jointly calculated by collecting T (aj, ci) and T (aj). here,

|T (aj)| = |T1(aj)| + |T2(aj)|, |T (aj, ci)| = |T1(aj, ci)| + |T2(aj, ci)|,

where Tb(aj) is from party b and here b = 1, 2. This means we can jointly

compute the entropy and information gain by collecting the data through different partitioned databases.

This protocols contains two steps: 1) Jointly calculate HC(T ); 2) Jointly

calculate HC(T|A). After executing these two steps. Ted can figure out

the exact attribute that can maximize the information gain. The detail of protocol steps is listed at algorithm 3.

4.4 Measurement

The communication and computational complexity for each party is analyzed as follow (Recall that R is the set of non-class attribute and T the set of transactions):

1. The proposed protocol is repeated m· l times for each testing attribute where m and l are the number of non-class attribute and class attribute as mentioned before. For all |R| attributes, we have O(m · l · |R|). 2. The number of rounds needed to communicate between parties is

con-stant. To construct the decision tree, and we have |R| non-class at-tributes, so the total communication complexity is O(m·l ·|R|2). Com-pare with the non-private distributed ID3, our proposed protocol dou-ble the communication by inserting the special participant Ted.

(45)

4.5 Analysis

4.6 Security Analysis

The protocol has no prevention from incorrect data usage as their input. Under this consideration, we assume that the protocol was executed under the semi-honest model to acquire the correct mining results. That is, all participants can gather all information in the progress of our protocolin the protocol should not deviate from protocol.

Owing to the semi-honest model we use in this protocol, it is assumed that all participants in the protocol can’t deviate from protocol. By two secure channels that we use in this protocol, Ted doesn’t know the existence of Bob. This improves privacy a little. Our protocol can prevent information leakage even if one or more participants using incorrect excess support count. We later analyze the situation if one of the participant misbehaves.

Alice misbehaves. If Alice try to use incorrect data to acquire the

infor-mation from Bob. Owing to the random number k used when returning the calculation results d, the result may distribute over the field we use in the encryption scheme. Alice can’t figure out the excess support count owned by Bob.

Bob misbehaves. If Bob adds the ciphertext with incorrect excess support

count, Alice won’t get the correct mining results from Ted. Nevertheless, Bob can’t get the information about Alice’s transaction database. Because the protocol mainly encrypted with Alice’s public key. Each itemset at each site should be requested only once. That means if Bob receives the request from Alice who asks for the same itemset twice. He can doubt if Alice trying to resolve the information on his database. Then he can drop the protocol.

(46)

and Bob. The value d that Ted receives is the summation of excess support count of all participants. Moreover, Ted does not know what is the current itemsets in the middle of protocol.

Our protocol can provide partial collusion resistance. If either Alice or Bob colludes with Ted, then he/she can figure out the excess support count owned by the other participant. But if there are more than three participants, and two of them collude with each other can’t get the information about the third participant owing to the random value k chose by the third participant. In conclusion, If Alice or Bob use fake data in our protocol, although we can’t get the correct mining results, but each of them can get additional information except the wrong mining result. Besides, Ted can’t learn infor-mation from Alice or Bob without colluding. Because Ted only receive a number and judge whether the number is greater than zero or not.

(47)

Algorithm 3 ID3 Decision Tree Learning Input: Partitioned Database Ta(Alice), Tb(Bob)

Output: Decision Tree

Step 1 Jointly calculate HC(T )

for each class value ci ∈ C do

1. Alice calculates the count of each TA(ci)

2. Alice sends the encrypted count EA(TA(ci)) to Bob

3. Bob sends EA(di+ ki) and ET(ki) to Alice, where di = T (ci) = TA(ci) + TB(ci)

4. Alice decrypts di+ ki, and sends to Ted with ET(ki)

5. Ted decrypts ki and calculate di.

end for

Ted calculate entropy HC(T ) = l

∑

i=1

−dilog di Step 2 Jointly calculate HC(T|A)

for Each attribute aj ∈ A do

1. Alice calculates her own TA(aj) and TA(aj, ci)

2. Alice sends the encrypted results to Bob 3. Bob calculates his TB(aj) and TB(aj, ci)

4. Bob encrypts the result and sends EA(d′i+ ki), EA(d′′j + kj) and ET(kj) to Alice, where d′j = TA(aj)+TB(aj) and d′′j = TA(aj, ci)+TB(aj, ci)

5. Alice decrypts d′_j + kj and d′′j + kj, and sends to Ted with ET(kj)

6. Ted decrypts kj, and calculate d′i and d′′i.

7. Ted calculate conditional entropy HC(T|A)

end for

Ted finds the largest information gain and sends back to Alice. Recursively construct the decision tree.

(48)

Chapter 5 Conclusion and Future Work

5.1 Conclusion

In this thesis, we have introduced two privacy-preserving protocols. One for association rule mining, the other for decision tree learning. Our results show that the execution time of the protocol is much shorter than previous work. After adding a special participant in these protocols, we improves the efficiency than previous work. The protocol was shown to be secure under the semi-honest model of multi-party computation. The security analysis is based on the mathematical hardness assumption.

The main contribution was to design protocols to reach a balance between efficiency and privacy preserving. Previous work either use Yao’s garbled cir-cuit or combine fully homomorphic encryption. The former approach spends too much time on complex circuit and the latter one can’t fit into large scale distributed databases.

(49)

5.2 Future Work

Although these protocols have higher efficiency, they can not assure each participant using correct data to jointly compute the mining results. Using signature to promise the data correction may be a possible way to solve this problem.

(50)

Bibliography

[1] P. N. Tan, M. Steinbac, and V. Kumar, “Introduction to Data Mining”. Addison-Wesley, 2006.

[2] T. Fawcett, “diapers and beer”, 2000.

[3] T. Hastie and R. Tibshirani, “Discriminant adaptive nearest neigh-bor classification,” IEEE TRANSACTIONS ON PATTERN

ANALY-SIS AND MACHINE INTELLIGENCE, vol. 18, no. 6, pp. 607–616,

JUN 1996.

[4] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” Journal of Artificial Intelligence Research, vol. 2, pp. 1–32, 1994.

[5] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.

[6] C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,” in

STOC’09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON THEORY OF COMPUTING, ser. Annual ACM Symposium on

The-ory of Computing, SIGACT; ACM. : ASSOC COMPUTING MA-CHINERY, 2009, Proceedings Paper, pp. 169–178, 41st Annual ACM

(51)

Symposium on Theory of Computing, Bethesda, MD, MAY 31-JUN 02, 2009.

[7] P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes,” in Advances in Cryptology EUROCRYPT 99, ser. Lecture Notes in Computer Science, J. Stern, Ed. Springer Berlin Heidelberg, 1999, vol. 1592, pp. 223–238.

[8] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” Proc. 20th Int. Conf. Very Large Data, 1994.

[9] J. R. Quinlan, “Induction of decision trees,” Mach. Learn, pp. 81–106, 1986.

[10] O. Goldreich and A. Warning, “Secure multi-party computation,” 1998. [11] A. C. C. Yao, “How to generate and exchange secrets,” in Foundations

of Computer Science, 1986., 27th Annual Symposium on, 1986, pp. 162–

167.

[12] I. Ioannidis and A. Grama, “An efficient protocol for yao’s millionaires’ problem,” in System Sciences, 2003. Proceedings of the 36th Annual

Hawaii International Conference on, 2003, pp. 6 pp.–.

[13] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” SIGMOD

RECORD, vol. 29, no. 2, pp. 439–450, JUN 2000, International

Confer-ence on Management of Data, DALLAS, TEXAS, MAY 16-18, 2000. [14] D. Agrawal and C. C. Aggarwal, “On the design and quantification

of privacy preserving data mining algorithms,” in Proceedings of the

(52)

of database systems, ser. PODS ’01. New York, NY, USA: ACM, 2001,

pp. 247–255.

[15] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining of association rules,” INFORMATION SYSTEMS, vol. 29, no. 4, pp. 343–364, JUN 2004, 8th International Conference on Knowledge Discovery and Data Mining (KDD 2002), EDMONTON, CANADA, JUL 23-26, 2002.

[16] S. Rizvi and J. Haritsa, “Maintaining data privacy in association rule mining,” the 28th international conference on Very Large Data, 2002. [17] M. G. Kaosar, R. Paulet, and X. Yi, “Fully homomorphic encryption

based two-party association rule mining,” Data and Knowledge Engineering, vol. 76-78, pp. 1–15, June 2012.

[18] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in

Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology, ser. CRYPTO ’00. London, UK, UK: Springer-Verlag, 2000, pp. 36–54.

[19] W. Du and Z. Zhan, “Using randomized response techniques for privacy-preserving data mining,” the conference on Knowledge discovery

and data mining, pp. 505–510, 2003.

[20] Z. Zhong and R. Wright, “Privacy-preserving classification of customer data without loss of accuracy,” SIAM International Conference on

Data Mining, pp. 1–11, 2005.

[21] R. Wright and Z. Yang, “Privacy-preserving Bayesian network structure computation on distributed heterogeneous data,” Proceedings of the

(53)

2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04, p. 713, 2004.

[22] M. Kantarcioglu and C. Clifton, “Privacy-preserving distributed mining of association rules on horizontally partitioned data,” Knowledge and

Data Engineering, pp. 2–13, 2004.

[23] D. Cheung, V. Ng, and a.W. Fu, “Efficient mining of association rules in distributed databases,” IEEE Transactions on Knowledge and Data

Engineering, vol. 8, no. 6, pp. 911–922, 1996.

[24] Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,” Journal

of Cryptology, vol. 15, no. 3, pp. 177–206, June 2002.

[25] J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data,” ACM SIGKDD Conference on Knowledge

discovery and data mining, 2002.

[26] I. Saleh and A. Mokhtar, “P3ARM: privacy-preserving protocol for association rule mining,” Information Assurance Workshop, 2006 IEEE, pp. 76–83, 2006.

[27] M. Kantarcioglu and C. Clifton, “Privacy-preserving distributed mining of association rules on horizontally partitioned data,” IEEE

TRANS-ACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 16,

(54)

Vita

Yin-Ming Chang

He was born in Taiwan, R. O. C. in 1985. He received a B.S. in Chemical Engineering from National Taiwan University in 2007. From June 2011 to July 2013, he worked his Master degree in the Mobile Communications and Cloud Computing Lab in the Department of Computer Science at National Chiao-Tung University. His research interests are in the field of security issue on cloud environment.