Secure Nearest Neighbor Revisited

(1)

Secure Nearest Neighbor Revisited

Bin Yao¹, Feifei Li², Xiaokui Xiao³

1Department of Computer Science and Engineering, Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University, China

2School of Computing, University of Utah

3School of Computer Engineering, Nanyang Technological University, Singapore

1{yaobin}@cs.sjtu.edu.cn,²{lifeifei}@cs.utah.edu,³{xkxiao}@ntu.edu.sg

Abstract—In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a cloud service provider and asks for an encrypted data point in E(D) (the encrypted database) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods [15], [21], even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field [5], [6], and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models [5], [6], our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D).

Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed to contain the answer for the SNN query.

Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.

I. INTRODUCTION

The cloud has gained increasing popularity for its flexibility and scalability, which motivates cloud service providers to offer accesses to cloud databases, such as Amazon Relational Database Service (Amazon RDS), Google Cloud SQL, and Microsoft SQL Azure. Data owners outsource their databases to the cloud service providers and rely on their services for storage, management, and query processing of the databases.

Clearly, this framework offers great flexibility and scalability to data owners and their clients, and it is especially useful for users with stringent local resources.

However, the remote placement of the data brings security concerns. A data owner may prefer to prevent the server (i.e., the service provider) from learning the content of his/her database D or the contents of queries to D, while still requiring the server to provide database functionality forD in the cloud [11], [13], [16]. For this purpose, the data owner needs to encrypt D with an encryption scheme E and only publishes to the server an encrypted version of D, denoted as E(D).

The clients also need to encrypt their queriesq and send only E(q) to the server. The server needs to identify the ciphertext in E(D) that corresponds to the answer of q on D, using only E(q) and E(D). We dub a query processing procedure satisfying these constraints as secure query processing.

Under the settings of secure query processing, there have been a few techniques that address general range selection queries and aggregation queries [11], [13], [14], [16], [19].

For more complicated query types, however, there is only a narrow selection of existing studies. In particular, this has been the case for the secure nearest neighbor (SNN) problem, where a client issues an encrypted query pointE(q) to a server and asks for an encrypted data point inE(D) that is closest to the query point, without allowing the server to learn the plaintext of the query or the data. This problem is of significant practical importance, since nearest neighbor (NN) queries are funda- mental in spatial and multimedia databases, both of which find extensive applications in the cloud. Unfortunately, as we will show in this work, existing methods for secure NN queries, though claimed to be secure [15], [21] in standard security models (e.g., indistinguishability under chosen plaintext or ciphertext attacks), are not secure.

This inspires us to examine the hardness of the SNN problem. Our investigation shows that the SNN problem is at least as hard as constructing an order-preserving encryption (OPE) scheme [5], [6]. Given the recent findings that it is impossible to construct secure OPE schemes in standard security models [5], our results immediately imply that one cannot expect the server to find the exact (encrypted) answer for an SNN query given only E(q) and E(D).

However, this does not mean we are hopeless in answering secure NN queries in cloud databases. The trick is to relax what we require the server to achieve. Instead of asking him to find the encrypted exact nearest neighbor based on only E(q) and E(D), we can ask the server to find a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that G is guaranteed to contain the nearest neighbor of q.

Note that the hardness of the SNN problem does not apply to this modified problem, as finding E(G) is not the same as finding the encrypted exact nearest neighbor. Once given E(G), any trusted entity with the decryption function E⁻¹ can easily identify the answer to a queryq, by using E⁻¹ to decryptE(G) and then inspecting the data points therein.

Our new methods build on this idea for data in one and two dimensions, and they allow customizable tradeoff between communication and computation costs in cloud databases. Our idea leverages on special partitions over D and the Voronoi diagram of D. We dub our new method the secure Voronoi diagram (SVD) method. Since the SVD method does not use any new encryption schemes, rather, it only relies on any

(2)

standard encryption scheme E (e.g., public-key encryption RSA, symmetric-key encryption AES), the SVD method is as secure as E for any standard security model in which E is proven secure (e.g., indistinguishability in either chosen plaintext or chosen ciphertext attacks).

Finally, we use extensive experiments to demonstrate the efficiency of our attacks to existing methods, as well as the efficiency and the scalability of our new method SVD. Our experiments were conducted over real and large datasets (up to few millions points in multi-dimensional spaces).

The rest of the paper is organized as follows. Section II formulates the problem. Sections III-A and III-B show our construction of two (different) efficient attacks to the SNN schemes in [21] and [15] respectively. Section IV investigates the relationship between SNN and OPE, which establish the hardness of the SNN problem. Section V presents the new, secure Voronoi diagram (SVD) method. Section VI gives our experimental results. Section VII surveys the related work, and Section VIII concludes the paper.

II. PROBLEMFORMULATION

Formally, the SNN problem involves three parties:

1) A data owner who has a databaseD with d-dimensional Euclidean objects/points, and would like to outsourceD to a server that cannot be fully trusted.

2) A client (or multiple of them) who wants to access and pose queries to the databaseD.

3) A server that is honest but potentially curious in the tuples inD and/or the queries from the clients. A sever could be curious either because he is just curious or he has been compromised to become curious on the behalf of a third party without his explicit knowledge.

Our objective is to enable the client to perform NN queries without letting the server learn contents about the query (and its result) or the tuples in the database. Note that in practice, a data owner could also be a client. Clearly, in order to achieve our objective, the database D has to be encrypted by some encryption scheme E by the data owner. We use E(D) to denote the encrypted version of the database, and E⁻¹ to denote the corresponding decryption function (with the necessary secret). Similarly to the case for the data owner, clients only send to the server the encrypted versionsE(q) of their queriesq (each of which represents a query point).

We aim to ensure that the SNN method is as secure as the encryption methodE used by the data owner. For ease of exposition, we consider thatE is proven secure under chosen- plaintex-attack, but our method can be straightforwardly ex- tended for other adversary models (e.g., indistinguishability under chosen-ciphertext-attack).

The above formulation of the SNN problem is adopted in previous work [15], [21] and is of considerable practical importance for various reasons [11], [12], [14]–[16], [19], [21]. For example, D might contain some sensitive values that cannot be disclosed to the server, or D represents a business asset to the data owner and/or the client, or the service

providers wish to provide SNN service as an assurance to attract more customers.

Without loss of generality, we assume thatD is represented byN tuples {p1, . . . , pN}. Each tuple can be viewed as a d- dimensional point: for anyp ∈ D, p = hv1, v2, . . . , vdi; p can also be considered as a vector ind-dimensional space.

Remarks. One may consider an alternative formulation of the SNN problem by regarding the data owner and the client as the same party, since they share the same key and are assumed to trust each other. We explicitly differentiate the data owner from the client, so as to be consistent with previous work [15], [21] and to simplify our discussion on the loopholes of the existing methods (see Section III).

Unless otherwise specified, all proofs in this paper can be found in Appendix ?? from our online technical report [22].

III. INSECURITY OFEXISTINGMETHODS

A. Solution by Wong et al. [21]

Basic idea. The data owner encrypts each tuple in the database before sending it to the server using a customized encryption scheme ET. The client sends an encrypted query using a related but slightly different customized encryption scheme EQto the server. Wong et al. [21] showed that the customized encryption schemes ET and EQ preserve the dot product between the query vector q and any tuple vector p from the databaseD, i.e., q · p = EQ(q) · ET(p). Furthermore, for any two pointsp1 andp2 fromD, p1· p26= ET(p1) · ET(p2), i.e, the dot product between two tuples in D cannot be obtained directly from their ciphertext.

Given an encrypted query pointE(q), the server (i) inspects the dot product between E(q) and the ciphertext ET(p) of each tuplep ∈ D, and (ii) returns to the client the ciphertext that corresponds to the answer of the SNN query (see [21]

for the detailed algorithm). After that, the client decrypts the ciphertext to get the query result. Note that, in order for the client to encrypt and decrypt, the data owner needs to share the secret key with the client.

Customized encryption schemes. The data owner and the client share the following secret information: (i) an integer d^′≥ d, (ii) two d^′× d^′ matricesM1andM2 that are random but invertible, (iii) a random bit vectorA with d^′ bitsb1,b2, . . ., bd^′, and (iv) some additional information that is irrelevant to our discussion.

Letp = hv1, v2, . . . , vdi be any tuple in the database. The encryption schemeET works as follows. The data owner first convertsp into a d^′-dimensional tuplep^′ = [v^′1, ..., v_d^′′]^T, such that (i) v^′_i = vi for any i <= d, (ii) vd+1 = −¹₂Pd

j=1v_j², and (iii) v_d+2^′ , ..., v^′_d′ are random numbers generated by some specific rules (omitting such details does not affect our analysis here). Next, p^′ is transformed into two tuples p^′_a= [x1, . . . , xd^′]^T andp^′_b= [y1, . . . , yd^′]^T based onA, such that for anyi ∈ [1, d^′],

1) ifbi = 1 (the ith bit from A), then xi and yi are two random numbers satisfyingxi+ yi = v_i^′;

2) otherwise,xi= yi= v_i^′.

(3)

At last, the data owner computes p^∗_a = M1^T · p^′_a and p^∗_b = M2^T · p^′_b, and sendsp^∗_a andp^∗_b to the server as the encrypted version of p.

Meanwhile, the scheme EQ for encrypting query points is as follows. Given q = hv1, v2, . . . , vdi and a random number r > 0, the client first converts q into a d^′-dimensional tuple q^′= [v1^′, ..., v^′_d′]^T that satisfies two conditions. First, v^′_i= rvi

for any i <= d, and v_d+1^′ = r. Second, v^′_i’s for i ∈ [d + 2, d^′] are random values generated according to some specific rules, such that the scalar product over the artificial attributes (between d + 1 and d^′)from anyq^′ andp^′ is always 0.

Thenq^′ is transformed into two tuplesq_a^′ = [x1, . . . , xd^′]^T andq^′_b= [y1, . . . , yd^′]^T, such that∀i ∈ [1, d^′],

1) if bi = 0 (the ith bit from A), then xi andyi are two random numbers such thatxi+ yi= v^′_i;

2) otherwise, xi= yi= v^′_i.

The encrypted version of q is the pair: q_a^∗ = M1⁻¹· q^′_a and q_b^∗= M2⁻¹· p^′_b.

Security guarantee. Wong et al. show that the above encryption schemes preserve the dot product between any query point q and any point p in the database, i.e., q · p = ET(q) · EQ(p) (this forms the basis for their SNN method [21]). Furthermore, Wong et al. claim that this protocol can guard against any attacks based on the knowledge of a number of (plaintext, ciphertext) pairs [21], and their argument is as follows: if the boolean vector A is known to the adversary, then he/she would be able to use the known (plaintext, ciphertext) pairs to construct linear equations about M1 and M2, and then solve the equations to obtain M1 and M2. However, since A is secret, it would be hard for the adversary to derive the correct linear equations to use, since the formulation of the equations depends onA. A brute-force approach would require the adversary to examine all possible bit vector A, which leads to 2^|A| linear equation systems that cannot be solved in reasonable time when |A| is large.

We observe that the above reasoning is not rigorous, since it assumes that the adversary only attempts his/her attacks by solvingM1 andM2. We will demonstrate an attack that does not require any knowledge about M1 andM2.

Our attack using chosen-plaintext attack. Assume that the server obtainsd query points and their encryption (by asking the oracle in the chosen plaintext attack model). For eachq of those query points, the server would have two encrypted points q_a^∗andq^∗_b generated byEQ. Wong et al.’s encryption schemes ensure that the dot product between q and any database point p can be calculated based on the following equation:

p · q = p^∗_a· q_a^∗+ p^∗_b· q^∗_b, (1) Notice that the above equation contains only d variables unknown to the server, i.e., thed coordinates of the data point p. Since the server has the plaintext-ciphertext pair of d query points, he can construct d linear equations like (1) to derive the coordinates of p.

B. Solution by Hu et al. [15]

Hu et al. [15] consider the SNN problem under a setting that is different but similar to ours. They assume that each

client (i) has the ciphertexts of all data points in D and the encryption functionE used to encrypt D, but (ii) does not have the decryption functionE⁻¹. On the other hand, the server has E⁻¹ and some auxiliary information about each data point (which is irrelevant to our discussion), but does not have the plaintext or ciphertext of any data point. The objective of Hu et al.’s method is to enable the client to identify the encrypted answer for any SNN query (with the help from the server), after which the client can retrieve the auxiliary information associated with the answer from the server (the plaintext of the answer remains secret to the client). Hu et al. claimed that their solution not only prevents the server and client to learn the plaintext of any data point, but also prevents the data owner and the server from learning the queries posed by the clients. However, we will show that Hu et al.’s solution does not fulfill their security claims.

The encryption scheme. The construction in [15] relies on an encryption scheme that gives what they called the privacy homomorphism (PH). PH is an encryption transformations which map some operations on cleartext to operations on ciphertext. Formally, they are encryptions Ek : P → X that allow a set of operations F on encrypted data without knowledge of the decryption function (here,P is the domain of plaintext andX is the domain of ciphertext). In particular, they used the ASM-PH encryption scheme from [9], which supports modular addition, subtraction, and multiplication. In what follows, we useE to denote an ASM-PH encryption function, and E⁻¹ to denote the corresponding decryption function (with the necessary secret keys). The PH encryptionE in [9]

is shown to be secure against known-plaintext attacks, and it can perform addition, subtraction, and multiplication directly on the ciphertexts, e.g., E(a + b) = E(a) + E(b). Note that in PH the operation onE(a) and E(b) is not necessarily the same as that in plaintexts to preserve an operation, but maybe some functionf on E(a) and E(b) that is efficient to compute and gives the same output as the operation of interest over the corresponding plaintexts, i.e.,E(a + b) = f (E(a), E(b)). But for simplicity, in the remainder of this paper, we just use the same operation over the ciphertexts to denote such anf .

However, assuming a fully secure ASM-PH schemeE, we can still launch an efficient attack on Hu et al.’s solution.

Basic idea of Hu et al.’s solution. The data owner builds a multidimensional indexI over his database D. Each node b in I has a set of entries, where each entry e has a key value v and a pointer p (to a child node of b). Note that v can represent any object. For example, in the case of a d-dimensional R- tree, v is an minimum bounding rectangle (MBR) and can be represented as (~ℓ, ~u), where ~ℓ = {ℓ1, . . . , ℓd} and ~u = {u1, . . . , ud}, such that [ℓi, ui] is the extent of the MBR in the ith dimension. After I is generated, the data owner constructs a shadow indexE(I), which is identical to I except that the key values (and only the key values) in all entries from all nodes are encrypted by an ASM-PH encryption functionE. E(I) is published to all clients, and only the decryption functionE⁻¹ of E is sent to the server.

(4)

During query processing, Hu et al.’s solution requires the client to encrypt the query q and traverse the shadow index E(I) locally with the help of the server. In particular, for each node b in E(I) visited by the client and for each entry e = (E(v), p) in b, the client computes E(mindist(q, v)) = mindist(E(q), E(v)) using the properties of the ASM-PH encryption E, where mindist(·) is the minimum distance between a query point q and an MBR v. Hu et al. showed that in theith dimension (suppose that q = (q1, . . . , qd)):

E(2 · mindist(qi, [ℓi, ui])) = sign(ui− qi)(E(ui) − E(qi))+

sign(ℓi− qi)(E(ℓi) − E(qi)) − (E(ui) − E(ℓi)), and (2) mindist(E(q), E(v)) =

d

X

i=1

E²(2 · mindist(qi, [ℓi, ui])), (3) wheresign(a − b) returns −1 if a < b and 1 otherwise.

Note that in (2), only sign(ui− qi) and sign(ℓi− qi) are not known/computable by the client locally. To figure out their values, the client computes E(ui) − E(qi) = E(ui− qi) and E(ℓi) − E(qi) = E(ℓi− qi) locally and sends them to the server. The server, with the decryption function E⁻¹, can easily tell the values for sign(ui− qi) and sign(ℓi− qi) and send them back to the client, givenE(ui− qi) and E(ℓi− qi).

Now, knowing the values of sign(ui− qi) and sign(ℓi− qi), the client proceeds to compute E(2 · mindist(qi, [ℓi, ui])) by (2) and then derive mindist(E(q), E(v)) by (3). Finally, the client sends mindist(E(q), E(v)), which is equivalent to E(mindist(q, v)), to the server for decryption to figure out the actual distance betweenq and the MBR represented by v.

The client repeats this process for each entry e = (E(v), p) from an index nodeu, and chooses the proper children nodes to browse in the next level, following the standard NN search algorithm in an R-tree.

To prevent the server from knowing the exact value ofui−qi

and ℓi− qi, before sending E(ui − qi) and E(ℓi − qi) to the server, the client subjects them to a scrambling process.

Similarly, to prevent the client from knowing the exact value of mindist(q, v). The server subjects the decrypted value of mindist(E(q), E(v)) to a recoding process.

The details of the scrambling and recoding procedures are not important to our discussion here. The bottom line is, in Hu et al.’s scheme, sign(ui− qi) and sign(ℓi− qi) must be computed by server and sent to client for each dimension i ∈ [1, d], so that client can compute some distances locally to decide the next node(s) to visit. We construct our attack using this simple knowledge.

Our attack using chosen-plaintext attack. As said above, for anyd-dimensional query point q and any encrypted value E(v) wherev = {~ℓ, ~u}, the server returns the values of sign(ui−qi) and sign(ℓi− qi) (−1 or 1) to the client (for all i ∈ [1, d]).

With this knowledge, however, the client can actually recover the valuev using a chosen plaintext attack. This means that the client can recover any pointp in D, since p’s ciphertext E(p) is stored in the leaf-level entries of the shadow indexE(I) at the client side. As a result, Hu et al.’s approach cannot conceal the databaseD from the client (which contradicts their security claims). Our attack works as follows.

Suppose thatv = [~ℓ, ~u]. Given E(v), the client can perform a binary search along the ith dimension to recover the value ℓi (or ui) in theith dimension (i ∈ [1, d]). Specifically, if the ith dimension has a domain [a, b], then the client can starts with a random ciphertext E(x1) for a random x1 ∈ [a, b]

(he can obtain any such pair (E(x), x) through an oracle from a chosen-plaintext attack), and then sends E(x1) and E(ℓi) to the server and asks for the value of sign(x1, ℓi). If sign(x1, ℓi) = −1, then the client asks for another ciphertext E(x2) by choosing a random x2∈ (x1, b]; if sign(x1, ℓi) = 1, the client asks for another ciphertext E(x2) by choosing a randomx2∈ [a, x1); otherwise, if E(x1) = E(ℓi), the client terminates and outputsℓi= x1.

Clearly, the client only needs to repeat this procedure for log(b − a) steps in the worse case to recover the value ℓi. Note that this works even for non-discrete domains, as any real valued coordinate still has a fixed level of precision. Hence, with no more than 2d log n steps (where n is the maximum domain size, or bits of precision, for any dimension), the client can fully recover the value v.

This attack shows that in Hu et al.’s method, the client can recover the plaintext for any point inD and any entry in the index I. Using similar techniques, we can also construct an attack where the data owner can recover any query q given only the encrypted version of E(q). In summary, Hu et al.’s method [15] is not secure even if the E used is fully secure.

IV. HARDNESS OF THESNN PROBLEM

Given the observations that none of the existing SNN methods guarantees security in a standard security model, one may wonder if the SNN problem is indeed hard. By being hard, we mean that it is at least as hard as some other well- understood cryptography problems that are known having no efficient secure schemes in well-established security models.

To answer this question, we will show in this section that the SNN problem is at least as hard as the order-preserving encryption (OPE) problem [5], [6].

An order-preserving encryptionE : P → X is an encryption that takes a plaintext in a domainP and outputs a ciphertext in a domainX, such that for any p1, p2∈ P , one can determine if p1 > p2 or p1 < p2 given only c1 = E(p1) and c2 = E(p2). (Note that deciding whether p1 = p2 can be reduced to checking whether bothp1> p2 andp1< p2 are false.) In other words, one can understand an OPE encryption scheme as a set of functions{E, E⁻¹, op}, such that:

op(c1,c2) = 1 iff p1< p2, op(c1,c2) = −1 iff p1> p2, (4) where op(·) is a polynomial operation that does not involve (or have any knowledge of) the decryption functionE⁻¹.

The concept of an OPE scheme was first proposed in the database community [1]. The solution, however, did not come with a rigorous analysis of its security guarantee. It was until the efforts from the cryptography community [5], [6] that we understand the hardness of constructing a truly secure OPE scheme. Specifically, Boldyreva et al. [6] have shown that it is impossible to construct a secure OPE in standard security

(5)

models (such as indistinguishability against chosen-plaintext attack, a.k.a. IND-CPA). Interested readers are referred to [5], [6] for details. The critical message from [5], [6] is as follows:

Theorem 1: [from [5], [6]] A truly secure OPE does not exist in standard security models, such as IND-CPA. It also does not exist even in much relaxed security models, such as the indistinguishability under ordered chosen-plaintext attack security model (a.k.a. IND-OCPA).

We can establish the hardness of the SNN problem using this negative result. We will show that givenE(q), designing an SNN method to findq’s (encrypted) exact nearest neighbor from an encrypted database E(D) is as hard as designing a secure OPE in standard security models. For convenience, let nn(q, D) denote the nearest neighbor of q in D.

The reduction. Suppose that we have a d-dimensional data- baseD that contains N points p1, . . . , pN, and an encryption scheme E that is secure under a standard security model M . We use E⁻¹ to denote the decryption function of E, and E(D) to denote the set {E(p1), . . . , E(pN)}. Assume that we can construct a truly secure SNN method that is able to find exactlyE(nn(q, D)) efficiently given E(q) and the encrypted database E(D) alone. We denote this polynomial method as B and formally define it as:

B(E(q), E(D)) → E(nn(q, D)), B does not use E⁻¹. (5) If such a methodB(·) exists, we can construct an OPE E(·) in the same security model M . Our construction is as follows.

Suppose that the message space for the OPEE(·) is the same as the encryption schemeE above, which is represented by N values {m1, . . . , mN}, and without loss of generality, m1<

m2< · · · < mN−1< mN. Our first step in constructingE(·) is to map these values to a set of one dimensional pointsD = {p1, . . . , pN, pN +1} using a special random hash function h, such that for anyi ∈ [1, N + 1], pi is a random value subject to the following constraints:

1) pi∈ Z⁺ fori ∈ [1, N + 1];

2) h(mi) = pi fori ∈ [1, N ];

3) pi< pj for any i, j ∈ [1, N + 1] iff i < j;

4) pi+1− pi< pi− pi−1 for anyi ∈ [2, N ].

Lemma 1: For any message space {m1, . . . , mN}, the above hash function h(·) guarantees that:

nn(h(mi), D) = h(mi+1) for any i ∈ [1, N − 1], and nn(h(m^N), D) = pN +1,nn(pN +1, D) = p^N= h(m^N). (6) Lemma 1 indicates that, for any two consecutive valuesmi

and mi+1 in the message space of E(·), the hash value of mi+1 is the NN of the hash value ofmi. In addition, for the maximum elementmN in the message space ofE(·), the NN of its hash value is the maximum element in the output space ofWe are now ready to present our OPE. For anyh(·). Figure 1 shows an example of the hash function h(·).mi from a message space {m1, . . . , mN}, we define an encryption schemeE:

E(mi) = E(h(mi)) = E(pi), E⁻¹(c) = h⁻¹(E⁻¹(c)), (7)

p1

m1 m2 m3 m4

Z⁺

p2 p3 p4

h(·) : pⁱ⁺¹− pⁱ< pi− pⁱ−1

p5

Fig. 1: The hash functionh(·).

whereh⁻¹(h(mi)) = mi. The secret in E contains both the secret of E and the random mapping h, which easily allows us to construct its decryption functionE⁻¹as above. We have the following lemma aboutE.

Lemma 2: For anyi ∈ [1, N −1] and any ciphertext E(mi), we have B E(mi), E(D) = E(m_i+1), B E(mN), E(D) = E(pN +1), and B E(pN +1), E(D) = E(p_N) = E(mN).

By Lemma 2, if there exists a secure SNN methodB(·) based on the encryption schemeE(·), then we can incorporate B(·) withE(·) (i.e., an encryption scheme that combines E(·) and the hash function h(·)) to construct a secure approach for searching successors. In particular, given the ciphertextE(mi) of any non-maximum element mi in the message space of E(·), we can invoke B(·) to compute the ciphertext of mi+1, wheremi+1 is the smallest element in the message space that is larger thanmi.

Algorithm 1: op(c = E(mi), z = E(mj)) i =traverse(c); /* see Algorithm 2 */

1

j =traverse(z); /* see Algorithm 2 */

2

ifi < j then return 1; else return −1;

3

Algorithm 2: traverse(c) setγ = 0;

1

lett = B(c, E(D)); t^′= c

2

whileB(t, E(D)) 6= t^′ do

3

lett^′ = t and t = B(t, E(D));

4

γ = γ + 1;

5

returnN − γ; /*B(·), E(D) are global inputs */

6

Next, we will show that E(·) is an OPE, by proving that there exists an operator op(·) that satisfies the Equation 4.

Algorithm 1 presents the details of our formulation ofop(·).

Note that the algorithm only uses E(D) and B without any knowledge about the secret of E (which includes both the secret of E and the random mapping h). Algorithm 1 also uses Algorithm 2 as its building block. Given a ciphertext c = E(mi), Algorithm 2 outputs the index value i for the plaintext mi. Hence, op(·) in Algorithm 1 simply compares the two index values of the two input ciphertexts.

Specifically, the idea in Algorithm 2 is as follows. By Lemma 2, B(c, E(D)) = B(E(mi), E(D)) outputs E(mi+1) (for i ∈ [1, N − 1]). Now, we can repeatedly apply B(·) γ times on its result until we hit E(mN) = E(pN); clearly, i = N −γ. The only challenge left is: how can we check if we have reached E(mN) = E(pN)? To explain how we address this, observe that, by Lemma 2, we haveB E(mN), E(D) =

(6)

E(pN +1) and B E(pN +1), E(D)

= E(pN) = E(mN), i.e., once we hit E(mN) for the first time, subsequent calls to B(·) on its result will bounce between E(pN +1) and E(mN). Furthermore, among all possible ciphertexts E(m1) = E(p1), . . . , E(mN) = E(pN), E(pN +1), this mini-loop only happens between E(mN) and E(pN +1) (when we repeatedly apply B(·) to its own output).

Thus, if we record the current value of t as t^′ before we set t = B(t, E(D)) in line 4 in Algorithm 2 (same idea in line 2), then when t becomes E(mN), line 4 would first set t^′ = t = E(mN), and then set t = B(t, E(D)) = B(E(mN), E(D)) = E(pN +1). After that, B(t = E(pN +1), E(D)) outputs E(mN) = t^′ again, in which caset and t^′ would satisfyB(t, E(D)) = t^′; this is precisely the condition that terminates the loop in line 3.

Finally, for c = E(mi), it is easy to verify that this termination condition is first met when we have iterated t through E(mi+1), . . ., and E(mN) exactly once (by the mini- loop observation above). Thus, the running time of Algorithm 2 is at most O(N Z) where Z is the running time of the SNN method B(·). This indicates that the cost of the Algorithm 1 for op is also O(N Z).

Theorem 2: Let{m1, . . . , mN} be any message space and mi < mj if i < j. Let E(mi) = E(h(mi)) and E⁻¹ be as defined in (7), using the hash function h and the secure encryption schemeE from the SNN method B(·). Define op(·) as shown in Algorithm 1. Then(E, E⁻¹, op) is an OPE scheme that is secure in any security model M in which E is secure.

Finally, by Theorems 1 and 2, our conclusion is:

Theorem 3: It is impossible to construct a secure SNN method B(·) satisfying (5) in standard security models, such as IND-CPA. It is not even possible to construct such anB(·) in much relaxed security models such as IND-OCPA.

V. PARTITIONBASEDSNN METHOD

As shown in Theorem 3, given only an encrypted NN query E(q) and an encrypted database E(D), there is no SNN method that can pinpoint the NN ofq in D. To circumvent this impossibility result, a natural idea is to devise an SNN method that does not exactly retrieve the NN of q. For example, the server may answer an SNN query by returning the encrypted database E(D) as a whole to the client, after which the client can decryptE(D) and compute the answer to the query locally. This naive approach is clearly secure (as secure asE), but it is highly inefficient as it requires transferring E(D) to the client. To improve, we propose to partition D into small groups and store the encrypted version of each group on the server, such that any SNN query can be answered by returning oneencrypted group instead of the whole encrypted database.

Specifically, our partitioning of D is based on the Voronoi diagram of D, as explained in the following.

A. Secure Voronoi Diagram

Given a multi-dimensional point databaseD with |D| = N , a Voronoi diagram of D is a decomposition of the space Ω

in which the points in D are defined. The diagram consists ofn disjoint Voronoi cells, each of which is associated with a point inD (referred to as the owner of the cell). If a point p is the owner of a cell c, then c equals the union of all points inΩ that are closer to p than to any other points in D. Thus, if a query pointq falls in c, then its nearest neighbor in D is p. For example, Figure 2 illustrates the Voronoi diagram of a database with16 two-dimensional points.

The Voronoi diagram of points inD can induce a partition ofD for SNN. Specifically, we can impose a square grid on the Voronoi diagram, and then construct an overlapping partition ofD, such that each element of the partition (i) corresponds to a grid cellB and (ii) consists of the owners of all Voronoi cells that overlap with B. For example, in Figure 2, the partition element (i) G1 corresponds to the grid cell B1, and (ii) G1

consists of ten points (i.e.,p1,p2,p3,p4,p5,p6,p7,p8, and p10), sinceB1overlaps with the Voronoi cells owned by those ten points. Observe that, if a query point falls in a grid cell Bi, then its nearest neighbor must be a point in the partition elementGi associated with Bi.

Given the aforementioned partition of D, the data owner pads each partition element to the same size, and then encrypts them separately (with the same key) and gives each element a random identifier. (The padding procedure ensures that the encrypted partition elements cannot be distinguished by their sizes). After that, the data owner sends all encrypted partition elements and their associated identifiers to the server, and informs the client about the description of the square grid and the identifier of each partition element. Whenever the client has an SNN query q, she first identifies the grid cell B that contains q, and then retrieves the identifier i of the partition element that corresponds toB. Then, the client asks the server to return the encrypted partition element whose identifier equals i (notice that this partition element must contain the nearest neighbor ofq). Upon receiving the partition element, the client decrypts it and computes the answer toq locally. Intuitively, this SNN method is secure, as it allows the server to learn nothing but the identifier of the returned encrypted partition, which is randomly generated. We provide the formal security proof in Section V-C.

The above partition scheme, albeit simple and secure, incurs significant space overhead (for the server) and communication cost (for the client). To understand this, recall that each partition element needs to be padded to the same size before encryption. Let smax be the number of points in the largest partition element. Then, after padding, the size of each partition element is smax. Assume that the size of the encryption of a message is (roughly) linearly dependent (or in a stair-case fashion) on the size of the message, which is the case for most encryption functions, the server requires O(k · smax) space to store all encrypted partition elements (where k denotes the total number of partition elements), and the client pays O(smax) communication cost for each SNN query. Ideally, we would prefer a partition scheme that ensuressmax= N/k, in which case both the server’s space overhead and the client’s communication cost are minimized. This, however, would

(7)

B₁ B₂

G_i= {p|p ⊂ Bi}

G₂ G₁

p1

p2

p3

p4

p5

p8

p6

p7

p10

p9

p12

p11

p13

p14

p16

p15

Fig. 2: The SVD method.

(a) Square Grid: SG.

B¹

B2 B4

B³

(b) MinSG.

B2 B4

B¹ B³

(c) MinMax.

Fig. 3: Different partitioning schemes,k = 4.

p1

Z⁺

p2 p3 p4

vc1 vc2 σ vc3 vc4

k = 2, G1={p1, p2}, G2={p3, p4}, B1={−∞, σ}, B2={σ, +∞}

Fig. 4: Partition Scheme for One-dimensional Data.

require that each partition element contains equal number of points, which is rather unlikely under the square grid partition scheme. Specifically, as real datasets are often skewed, some cell in a square grid may intersect a significantly larger number of Voronoi cells than other grid cells. This leads to a large smax, which results in significant space and communication overheads. To remedy this deficiency, we propose improved schemes that adaptively partition the data space to minimize smax, as will be discussed in Section V-B.

Remark: The partition scheme of imposing a square grid over the Voronoi diagram was also used by Ghinita et al. [10] in the context of private information retrieval (PIR). We will discuss the difference between SNN and PIR in Section VII. We dub this scheme the SG (Square Grid) partition.

B. Improved Partition Schemes

In what follows, we present partition schemes that aim to minimize smax, i.e., the number of points in the largest partition element. We focus on the case when D is one- or two-dimensional, as (i) SNN queries are particularly important for one- or two-dimensional spatial data, and (ii) constructing Voronoi diagrams is computationally challenging on data with dimensionality over2 (as each Voronoi cell would become a complex polytope).

1) One Dimension: In one dimension, there is an optimal partition scheme that generates disjoint and equal-size partition elements. To explain this, observe that the Voronoi cells of any one-dimensional dataset are one-dimensional intervals, such that (i) any two intervals are disjoint, and (ii) the union of all intervals equals the data space. For example, Figure 4 shows the Voronoi cells (vc1, . . . , vc4) of a one-dimensional dataset that contains four points p1, p2, p3, p4.

Assume that we aim to generate a partition of D with k elements. Then, it suffices to divide the data space into k disjoint intervals, such that each interval containsN/k Voronoi cells ofD (where N is the number points in D). The owners of the Voronoi cells in each interval can be taken as partition element, which leads to an optimal partition scheme where smax = N/k. For instance, if we are to generate a partition with two elements on the dataset in Figure 4, then we can put p1, p2 into a partition element, andp3, p4 into the other.

2) Two Dimensional Data: The case for two-dimensional data is much more complicated than the one-dimensional case, which renders it difficult to compute an optimal solution efficiently. This motivates us to derive two partition schemes based on heuristics, as will be explained in the following.

The MinSG scheme. This partition scheme adopts a greedy approach to generate a grid where the sizes of cells vary in a manner that adapts to the distribution of points in D. We start with a grid that contains only one cellΩ that cover the entire data space. Then, we iteratively cut the grid cells into smaller ones using either horizontal (or vertical) lines through [−∞, +∞] along the x-axis (or the y-axis). The process is conducted until the desired number of grid cells is met. After that, we generate the partition by identifying the owners of the Voronoi cells that intersect with each grid cell (as with the case of a square grid in the SG method).

The effectiveness of the above algorithm relies on how we choose the horizontal (or vertical) line to cut the grid cells. As we aim to minimize smax, we propose to choose a line cutting through the grid cell Bmax that determines smax, i.e., the grid cell that intersects the largest number of Voronoi cells. The intuition is that, whenBmaxis split into two smaller cellsBαandBβ, each of the two cells may intersect a smaller number of Voronoi cells, which leads to a decrease of smax. To maximize the decrease ofsmax, we should minimize max {|sα|, |sβ|}, where |sα| (|sβ|) denotes the number of Voronoi cells that intersects Bα (Bβ). Choosing a line that minimizesmax {|sα|, |sβ|}, however, is challenging given that there exists an infinite number of lines that cut throughBmax. To remedy this problem, we have a key observation as follows.

Lemma 3: max {|sα|, |sβ|} is minimized only if the cutting line passes through (i) a vertex of a Voronoi cell or (ii) the intersection point between the boundary of Bmax and a face of a Voronoi cell.

Let[xl, xu] and [yl, yu] be the projections of Bmaxonto the x- and y-axes, respectively. Let V1 be a set that contains any vertex of any Voronoi cell with its x-coordinate in [xl, xu] or its y-coordinate in [yl, yu]. Let V2 be a set that contains any intersection point between the boundary ofBmax and a face of any Voronoi cell. By Lemma 3, we can identify the cutting line that minimizesmax {|sα|, |sβ|}, by (i) inspecting all horizontal or vertical lines that pass through the points in V1∪ V2 and (ii) choosing the line that leads to the smallest max {|sα|, |sβ|}.

As mentioned, the above splitting procedure is applied

(8)

iteratively based on the grid cell that intersects the largest number of Voronoi cells, until the number of grid cells reaches a pre-defined threshold k. Since each cutting line that we use would span the whole x- or y-axis, it may split more than one grid cells, and hence, when the algorithm terminates, the total number of grid cells may be more thank. Whenever this happens, we would iteratively merge some neighboring cells without affecting the size of the largest partition, until only k cells are left. We omit such technical details for brevity.

We refer to the aforementioned partition scheme as the MINSG (Minimum Space Grid) method. For example, Figure 3(b) shows the results of applying MINSG over the same dataset in Figure 3(a).

It can be verified that the number of iterations performed by MINSG is no more thank − 1 (since each iteration increases the total number of grid cells by at least one). Furthermore, the number of cutting lines that MINSG needs to inspect in each iteration is O(N ), since (i) the total number of vertices in a Voronoi diagram is2N − 5 (for N ≥ 3), i.e., |V1| = 2N − 5, (see Theorem 7.3, [8]); and (ii) each Voronoi cell, being a convex polygon, can intersect the boundary of a grid cell at no more than 8 points, i.e., |V2| ≤ 8N . For each candidate cutting lineℓ, we need to identify the Voronoi cells intersecting ℓ and to incrementally update the number of Voronoi cells that overlap with each grid cell, which incurs O(N ) cost in the worst case. Therefore, the time complexity of MINSG is O(kN²). In practice, however, the running time of MINSG is often not quadratic to N , since the sizes of V1 andV2 are often much smaller than N .

The MinMax scheme. The MINSG method tries to minimize the maximum element size in the partition, by iteratively splitting the largest partition element. Nevertheless, the split is induced by a horizontal or vertical line that spans the whole x- or y-axis, which often incurs unnecessary split of other partition elements. For example, consider the grid partition in Figure 3(b), where the grid cell B1 intersects the largest number of Voronoi cells. If we are to splitB1with an vertical line ℓ, then the cell B2 would be split into two smaller cells by ℓ as well, even though the split of B2 does not decrease the maximum element size in the induced partition.

To address this problem, we propose to improve MINSG by splitting, in each iteration, nothing but the grid cell that induces that largest partition element. Specifically, in an iteration where the large partition element is induced by grid cell Bmax, we would split Bmax using a horizontal (vertical) line segment ℓ^′ whose projection on the x-axis (y-axis) equals Bmax’s projection on the same axis. Furthermore, the line segment is selected among those that pass through (i) a vertex of a Voronoi cell or (ii) the intersection point between a face of a Voronoi cell and the boundary of Bmax – the correctness of this approach can be shown by a result similar to Lemma 3.

The rest of the algorithm is the same as in MINSG. We refer to this improved method as the MINMAX (Minimum Max partition) method. Figure 3(c) shows the results of applying MINMAX on the same dataset in Figure 3(a).

Observe that each iteration of MINMAXincreases the number of grid cells by one. Thus, the total number of iterations performed by MINMAX is (k − 1). Each iteration inspects O(N ) cutting line segments, and examines O(N ) Voronoi cells for each line segment. Therefore, the time complexity of MINMAX is O(kN²), as with the case of MINSG. But, for similar reasons, its running time in practice is much better.

Remark. No matter which partition scheme is used, the data owner does not send the k partition elements to the client, which have a total size ofN points. Instead, he only needs to send the description of the grid constructed, and the associated (randomly generated) identifier for each grid cell, which only has a size of O(k) for all three methods, SG, MINSG, and MINMAX.

C. Security Analysis

No matter which partition scheme is used, the SVD method only releases the encryptions of all partition elements to the server, using an encryption schemeE that is proven secure in a standard security modelM , and their associated (randomly generated) identifiers. Furthermore, during query processing, the client sends only the identifier for the partition element whose corresponding grid cell contains the query point q, to the server. Thus, the server learns nothing but an id number (randomly generated by the data owner for each partition element) in answering a query. Formally:

Theorem 4: IfE is a secure encryption scheme in a standard security modelM , e.g., indistinguishability under chosen plaintext attack (IND-CPA), then the SVD method is as secure as E in the same model M with respect to a single query.

Further improvement of security. The above analysis con- siders a powerful adversary that is able to obtain a large number of plaintext-ciphertext pairs (in either chosen plaintext attack model or chosen ciphertext attack model), but implicitly assumes that the adversary has no knowledge of how the user’s queries are distributed. When this assumption does not hold, the adversary may infer information by exploiting the correlation between different encrypted queries from the user.

For example, assume that the adversary knows in advance that most of the queries from the user would have a query point in a regionR. Then, the adversary can keep track of the encrypted partition element retrieved by each user query over a long period, after which the adversary can link R to the partition element that is retrieved most frequently.

To guard against such attacks, we can extend our solution by adopting private information retrieval (PIR) techniques [10].

In particular, for any SNN queryq, the user can first identify the encrypted partition element that contains the answer toq;

and then, the user can invoke a PIR protocol to retrieve the encrypted partition element from the server, without allowing the server to learn which partition element is returned. But notice that existing PIR techniques either incur considerable overhead in query processing [10] or require specialized secure hardware that is expensive [17], i.e., the improved security guarantee would come at a cost.

(9)

0 0.4 0.8 1.2 1.6 2 0

400 800 1200 1600

|D|=N: x10⁶

time to attack (secs)

(a) Vary|D|.

0 10 20 30 40 50

0 500 1000 1500 2000

d

(b) Vary d.

30 40 50 60 70 80 620

635 650 665 680

d’

(c) Vary d^′, d = 30.

Fig. 5: The attack on solution by Wong et al. [21]

0 0.4 0.8 1.2 1.6 2 0

5 10 15 20

|D|=N: x10⁶

Fig. 6: The attack on solution by Hu et al. [15], varying|D|.

Return a partition element vs. return nn(q, D). One may wonder if the SVN method “leaks” too much information to the client about the database D, by returning a partition element (i.e., a number of points) for each query. We argue that this should not be a concern at all: if one allows the client to learn the nearest neighbor of a query point of her choice (by any SNN methods), the client can efficiently reconstruct the entire databaseD anyway, i.e., there is no way of hiding the databaseD from the client if one wants to support client’s nearest neighbor queries. We show why this is the case in Appendix ?? in our online technical report [22].

VI. EXPERIMENT

Our implementations were achieved in C++. We used the Qhull library to find the Voronoi diagram for a dataset D, and the latest Crypto++ library for any standard encryption schemes. All experiments were performed on a Linux machine with an Intel Xeon 3.07GHz CPU and 12GB memory.

Datasets. When investigating our attacks to existing schemes in different dimensions, we generate random points following either a random cluster or a uniform distribution. Based on the construction of our attacks, the distribution of the points does not affect the efficiency of our attacks.

For SVD methods, we focus on 2-dimensions with two real datasets, which are points of interest in California (CA) and Texas (TX) from the OpenStreetMap project. In each dataset, we randomly select 2 million points to create the largest dataset Dmax and form smaller datasets based on Dmax. We also make sure that smaller datasets are always subsets of larger datasets, in order to isolate the impact of|D|.

Setup. In the investigation of our attacks, since the distribution of the data does not introduce a noticeable difference, our default distribution is uniform. In the scheme from Wong et al. [21], we set the default values of d and d^′ as d = 30 and d^′ = 60. In the scheme from Hu et al. [15], we only report the results fromd = 2 for brevity.

In the study of the SVD method, the default values are:

|D| = 10⁶ and k = 625 (number of partition elements). The default dataset is CA. By default, we used the AES encryption scheme with a key size of256 bits and a block size of 256 bits.

The trends from other secure encryption schemes are similar.

Any secure public-key or symmetric-key encryption scheme can be used to construct a SVD method.

In all experiments, unless otherwise specified, when we vary the value of one parameter in concern, we keep all other

parameters at their default values.

A. Attacks to existing methods

We first studied the efficiency of our attack on the SNN method by Wong et al. [21], and the results are given by Figure 5. As shown in Section III-A, our attack only needsd query points and their encryption to form a linear system ofd linear equations withd unknowns to solve one data point p ∈ D.

Thus, the cost of our attack should be linear to both |D|

andd, which is reflected in Figures 5(a) and 5(b). Our attack is extremely efficient. For example, Figure 5(a) shows that it takes less than 23 minutes to recover a database with 2 million points in 30 dimensions and also with 30 artificial dimensions.

Lastly, the artificial dimensionsd^′ introduced in the solution of Wong et al. [21] is not significant to our attack, as seen in Figure 5(c). When d^′ goes from30 up to 80, the overhead introduced to the overall time to attack is less than40 seconds when the total time to attack is 600 some seconds.

Next, focusing ond = 2 and a domain size of 10⁹ in both dimensions, Figure 6 shows that our attack to the solution by Hu et al. [15] is extremely efficient when we vary |D|.

For a database with 2 million points, each with 1 billion possible values in either dimension, our attack takes less than 20 seconds to recover all 2 million points! Our attack is linear to both d and |D|, since its cost to recover one point in D is onlyO(d log n) where n is the maximum domain size in any one dimension according to our analysis in Section III-B.

Finally, both attacks can be easily made parallel, as they recover each point inD independently.

B. Evaluation of the SVD method

We next evaluate the efficiency of the new SVD method, when we instantiate it with three different partitioning schemes introduced in Section V. To differentiate them, we simply refer to the resulting SVD methods by the names of the partitioning methods used.

Preprocessing cost. For the data owner, there are two major steps: partition and encryption. Both are mainly affected by the number of grid cells k (which is also the number of partition elements) and the size of the database. Figure 7(a) shows that the partition costs in both MINSG and MINMAX

increase linearly withk, however, the cost of MINSG increases much faster than that in MINMAX, which increases very slowly (almost unnoticeable). This is due to the optimization in MINMAX (compared to the cutting lines in MINSG) to limit any cutting line to be strictly within the cell to split.

(10)

0 300 600 900 1200 0

15 30 45 60

k

partition time (secs)

SG MinSG MinMax

(a) vary k.

0 0.4 0.8 1.2 1.6 2 0

20 40 60 80 100

|D|=N: x10⁶

partition time (secs)

SG MinSG MinMax

(b) vary|D|.

Fig. 7: Running time of the partition phase.

0 300 600 900 1200 10⁰

10² 10⁴ 10⁶ 10⁸

k

avg partition size (bytes)

SG MinSG MinMax

(a) vary k.

0 0.4 0.8 1.2 1.6 2 10⁰

10² 10⁴ 10⁶ 10⁸

|D|=N: x10⁶

avg partition size (bytes)

SG MinSG MinMax

(b) vary|D|.

Fig. 8: Partition size in different methods.

The partition cost of SG is constant regardless k, since it calculates the unit length of each cell constantly once k is given. MINMAX is extremely efficient; its partition cost is almost as good as the partition cost of SG. It produces 1,225 partitions for 1 million points in just 20 seconds.

Next, Figure 7(b) shows the partition cost when we vary the size of the database from 0.2 million to 2 million. The partition cost in SG clearly should be linear to N = |D| once it has figured out the unit length of each cell, which is observed in Figure 7(b). Interestingly enough, even though our analysis in Section V-B indicates that the worst case complexity for both MINSG and MINMAXisO(kN²). In practice, both their costs are in fact only linear toN . This is because that the bad case in our analysis, where a cutting line intersects with all O(N ) Voronoi cells, is almost impossible in practice. Instead, for both MINSG and MINMAX, in any step a cutting line typically only intersects with a constant number of Voronoi cells. This means that both their costs should be only O(kN ) (for a maximum of k possible steps). Also, clearly a cutting line in MINSG expects to intersect with more Voronoi cells than that in MINMAXin any step. Hence, we also see a higher cost in MINSG than the cost of MINMAX, and a faster pace of increase in cost when |D| increases. Figure 7(b) indicates that MINMAXis almost as efficient and scalable as SG. When

|D| changes from 0.2 to 2 million points, the partition cost of MINMAX only increases from 5 seconds to 40 seconds.

We then examine the sizes of the partition elements from different methods before applying the random padding op- eration. Due to the random padding operation to “inflate”

every partition element with random bytes to the size smax

of the maximum partition element, two values are critical:

smax, which decides the storage cost at the server and the communication cost of every query, and the variance of the sizes of partition elements, which decides the overhead of the total “inflation”. Figure 8 plots the average size of partition elements along withsmax andsmin (the size of the minimum

0 300 600 900 1200 10⁰

10¹ 10² 10³

k

total running time (secs)

SG MinSG MinMax Send−D

(a) vary k.

0 0.4 0.8 1.2 1.6 2 10⁰

10¹ 10² 10³

|D|=N: x10⁶

total running time (secs)

SG MinSG MinMax Send−D

(b) vary|D|.

Fig. 9: Total running time of the preprocessing step.

partition element), as the error bar, w.r.t. k and |D|.

All methods produce partitions that have a similar average size. However, SG always has the largest values in bothsmax

and the variance, due to its complete ignorance of the data distribution. MINSG does significantly reduce the value of smax, which is expected since its design is to greedily split the partition element with the maximum size. However, following our analysis in Section V-B, due to the long extent of its cutting lines (always from [−∞, +∞]), its splitting method will potentially lead to partitions of very small size, as observed in Figure 8. The design of MINMAXmakes further improvement to MINSG, so it further reduces smax considerably. It also eliminates very small partition elements by limiting a cutting line to be only within the extent for the cell to split (the cell for the current maximum partition element). Hence, MINMAX

always produces a balanced partition. As a result, itssmax is very close to the average size of all partition elements and the variance is also small, as seen in Figure 8.

Finally, not surprisingly, for all partition methods, the values for both the average size and the smax reduce when more grid cells are used (Figure 8(a)), and increase when D grows (Figure 8(b)). Nevertheless, MINMAXalways produces highly balanced partitions with the smallest (much smaller than that from the other two methods)smax values.

Figure 9 reports the total running time for the preprocessing step, which includes the costs of both the partition and the en- cryption (as well as the Voronoi diagram construction and the random padding operation, whose costs are small compared to partition and encryption costs). For reference, we also include the preprocessing cost from the naive method Send-D (see the first paragraph in Section V), whose preprocessing cost is to encryptD in its entirety as one message.

What is interesting to observe is that, even though SG is fastest in producing its partitions, it becomes the slowest method overall. This is explained by the fact that partitions produced by SG suffer from the largest variance and the largest smaxvalue. After the random padding operation, all partitions share the same size as the maximum partition element. Hence, the value ofsmaxdecides the total encryption cost when there are equal number of partition; and the variance in partition size decides the overhead introduced to the encryption step by the

“inflation” of random bytes. Given the results in Figure 8, it is not surprising to see that SG becomes the slowest method, and MINMAX becomes the fastest method (especially when MINMAX also enjoys a partition cost that is almost as good as the partition cost in SG, see Figure 7).