K Nearest Neighbor Queries and KNN-Joins in Large Relational Databases (Almost) for Free

(1)

K Nearest Neighbor Queries and KNN-Joins in Large Relational Databases (Almost) for Free

Bin Yao, Feifei Li, Piyush Kumar

Computer Science Department, Florida State University, Tallahassee, FL, U.S.A.

{yao, lifeifei, piyush}@cs.fsu.edu

Abstract— Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts to solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, which may limit their application on large data sets that are stored in a relational database management system. Furthermore, these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize. We design algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with a constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.

I. INTRODUCTION

Thek-Nearest Neighbor query (kNN) is a classical problem that has been extensively studied, due to its many important applications, such as spatial databases, pattern recognition, DNA sequencing and many others. A more general version is the kNN-Join problem [7], [8], [11], [31], [33]: Given a data setP and a query set Q, for each point q ∈ Q we would like to retrieve itsk nearest neighbors from points in P .

Previous work has concentrated on the use of spatial databases or stand-alone systems. In these solution method- ologies, changes to the database engine maybe necessary; for example, new index structures or novel algorithms need to be incorporated into the engine. This requirement poses a limitation when the data set is stored in a database in which neither the spatial indices (such as the popular R-tree) nor the kNN-Join algorithms are available. Another limitation of the existing approaches that are outside the relational database, is the lack of support for query optimization, when additional query conditions are specified. Consider the query:

Retrieve the k nearest restaurants of q with both Italian food and French wines.

The results of this query could be empty (in the case where

no restaurants offer both Italian food and French wines) and the kNN retrieval is not necessary at all. It is well known that when the data is stored in a relational database, for various types of queries using only primitive SQL operators, the sophisticated query optimizer built inside the database engine will do an excellent job in finding a fairly good query execution plan [4].

These advantages of storing and processing data sets in a relational database motivate us to study the kNN query and the kNN-Join problem in a relational database environment.

Our goal is to design algorithms that could be implemented by the primitive SQL operators and require no changes to the database engine. The benefits of satisfying such constraints are threefold. First, kNN based queries could be augmented with ad-hoc query conditions dynamically, and they are automatically optimized by the query optimizer, without updating the query algorithm each time for specific query conditions.

Second, such an approach could be readily applied on existing commercial databases, without incurring any cost for upgrading or updating the database engine, e.g., to make it support spatial indices. We denote an algorithm that satisfies these two constraints as a relational algorithm. Finally, this approach makes it possible to support thekNN-Join efficiently.

We would like to design algorithms that work well for data in multiple dimensions and easily support dynamic updates without performance degeneration.

The similar relational principle has been observed for other problems as well, e.g., approximate string joins in relational databases [15]. The main challenge in designing relational algorithms is presented by the fact that a query optimizer cannot optimize any user-defined functions (UDF) [15]. This rules out the possibility of using a UDF as a main query condition. Otherwise, the query plan always degrades to the expensive linear scan or the nested-loop join approach, which is prohibitive for large databases. For example,

SELECT TOP k * FROM Address A, Restaurant R WHERE R.Type=‘Italian’ AND R.Wine=‘French’

ORDER BY Euclidean(A.X, A.Y, R.X, R.Y)

In this query, X and Y are attributes representing the coordinates of an Address record or a Restaurant record.

“Euclidean” is a UDF that calculates the Euclidean distance between two points. Hence, this is a kNN-Join query. Even though this query does qualify as a relational algorithm, the query plan will be a nested-loop join when there are restaurants satisfying both the “Type” and “Wine” constraints, since the

(2)

query optimizer cannot optimize the UDF “Euclidean”. This implies that we may miss potential opportunities to fully optimize the queries. Generalizing this example to the kNN- query problem, the UDF-based approach will degrade to the expensive linear scan approach.

Our Contributions. In this work, we design relational algorithms that can be implemented using primitive SQL operators without the reliance on the UDF as a main query condition, so that the query optimizer can understand and optimize. We would like to support both approximate and exact kNN and kNN-Join queries. More specifically,

• We formalize the problems of kNN queries and kNN- Joins in a relational database (Section II).

• We provide a constant factor approximate solution based on the Z-order values from a small, constant number of randomly shifted copies of the database (Section III). We provide the theoretical analysis to show that by using only O(1) random shifts for data in any fixed dimension, our approach gives an expected constant factor approximation (in terms of the radius of thek nearest neighbor ball) with onlylog N number of page accesses for the kNN query whereN is the size of data set P . Our approach can be achieved with only primitive SQL operators.

• Using the approximate solution, we show how to get exact results for a kNN query efficiently (Section IV).

Furthermore, we show that for certain types of data distributions, our exact solution also only usesO(log N ) number of page accesses in any fixed dimension. The exact solution is also easy to implement using SQL.

• We extend our algorithms to kNN-Join queries, which can be achieved in relational databases without changes to the engines (Section V).

• We show that our algorithms easily support float values, data in arbitrary dimension and dynamic updates without any changes to the algorithms (Section VI).

• We present a comprehensive experimental study, that confirms the significant performance improvement of our approach, against the state of the art (Section VII).

In summary, we show how to find constant approximations for kNN queries in logarithm page accesses with a small constant number of random shifts in any fixed dimension; our approximate results lead to highly efficient search of the exact answers, with a simple post-processing of the results. Finally, our framework enables the efficient processing ofkNN-Joins.

We survey the related work in Section VIII.

II. PROBLEMFORMULATION

Suppose that the data set P in a d dimensional space is stored in a relational table RP. The coordinates of each point p ∈ P are stored in d attributes {Y1, . . . , Yd}. Each point could associate with other values, e.g., types of restaurants. These additional attributes are denoted by{A1, . . . , Ag} for some value g. Hence, the schema of RP is{pid, Y1, · · · , Yd, A1, · · · , Ag} wherepid corresponds to the point id from P .

kNN queries. Given a query point q and its coordinates

{X1, . . . , Xd}, let A=kNN(q, RP) be the set of k nearest neighbors ofq from RP and|x, y| be the Euclidean distance between the point x and the point y (or the corresponding records for the relational representation of the points), then:

(A ⊆ RP)∧(|A| = k)∧(∀a ∈ A, ∀r ∈ RP−A, |a, q| ≤ |r, q|).

kNN-Join. In this case, the query is a set of points denoted byQ, and it is stored in a relational table RQ. The schema of RQ is {qid, X1, · · · , Xd, B1, · · · , Bh}. Each point q in Q is represented as a records in RQ, and its coordinates are stored in attributes {X1, . . . , Xd} of s. Additional attributes of q, are represented by attributes{B1, · · · , Bh} for some value h.

Similarly,qid corresponds to the point id from Q. The goal of thekNN-Join query is to join each record s from RQ with itskNN from RP, based on the Euclidean distance defined by {X1, . . . , Xd} and {Y1, . . . , Yd}, i.e., for ∀s ∈ Q, we would like to produce pairs(s, r), for ∀r ∈ kNN(s, Rp).

Other query conditions. Additional, ad-hoc query conditions could be specified by any regular expression, over {A1, · · · , Ag} in case of kNN queries, or both {A1, · · · , Ag} and{B1, · · · , Bh} in case of kNN-Join queries. The relational requirement clearly implies that the query optimizer will be able to automatically optimize input queries based on these query conditions (see our discussion in Section VII).

Approximate k nearest neighbors. Suppose q’s kth nearest neighbor from P is p^∗ and r^∗ = |q, p^∗|. Let p be the kth nearest neighbor of q for some kNN algorithm A and r^p =

|q, p|. Given ǫ > 0 (or c > 1), we say that (p, r^p) ∈ R^d× R is a (1 + ǫ)-approximate (or c-approximate) solution to the kNN query kNN(q, P ) if r^∗≤ r^p≤ (1 + ǫ)r^∗ (orr^∗≤ r^p≤ cr^∗ for some constant c). Algorithm A is called a (1 + ǫ)- approximation (orc-approximation) algorithm.

Similarly for kNN-Joins, an algorithm that finds a kth nearest neighbor point p ∈ P for each query point q ∈ Q, that is at least a (1 + ǫ)-approximation or c-approximation w.r.t kNN(q, P ) is a (1 + ǫ)-approximate or c-approximate kNN-Join algorithm. The result by this algorithm is referred to as a(1 + ǫ)-approximate or c-approximate join result.

Additional notes. The default value for d in our running examples is two, but, our proofs and algorithms are presented for any fixed dimension,d. We focus on the case where the coordinates for points are always integers. Handling floating points coordinates is discussed in Section VI. Our approach easily supports updates, which is also discussed in Section VI. The number of records in RP and RQ are denoted by N = |RP| and M = |RQ| respectively. We assume that each page can store maximally B records from either RP or RQ, and the fan-out of a B+ tree is f . Without loss of generality, we assume that points inP are all in unique locations. For the general case, our algorithms can be easily adapted by breaking the ties arbitrarily.

III. APPROXIMATION BYRANDOMSHIFTS

The z-value of a point is calculated by interleaving the binary representations of its coordinate values from the most

(3)

significant bit (msb) to the least significant bit (lsb). For example, given a point (2, 6) in a 2-d space, the binary representation of its coordinates is (010, 110). Hence, its z- value is 011100 = 28. The Z-order curve for a set of points P is obtained by connecting the points in P by the numerical order of their z-values and this produces the recursively Z- shaped curve. A key observation for the computation of z- value is that, it only requires simple bit-shift operations which are readily available or easily achievable in most commercial database engines. For a pointp, zp denotes itsz-value.

Our idea utilizes the z-values to map points in a multi- dimensional space into one dimension, and then translate the kNN search for a query point q into one dimensional range search on the z-values around the q’s z-value. In most cases, z-values preserve the spatial locality and we can findq’s kNN in a close neighborhood (say γ positions up and down) of its z-value. However, this is not always the case. In order to get a theoretical guarantee, we produceα, independent, randomly shifted copies of the input data set P and repeat the above procedure for each randomly shifted version of P .

Specifically, we define the “random shift” operation, as shifting all data points in P by a random vector −→v ∈ R^d. This operation is simply p + −→v for all p ∈ P , and denoted as P + −→v . We independently at random generate α number of vectors {−→v1, . . . , −v→α} where ∀i ∈ [1, α], −→vi ∈ R^d. Let Pⁱ= P +−→vi,P⁰= P and −→v0=−→

0 . For each Pⁱ, its points are sorted by theirz-values. Note that the random shift operation is executed only once for a data setP and used for subsequent queries. Next, for a query pointq and a data set P , let zp be the successor z-value of zq among all z-values for points in P . The γ-neighborhood of q is defined as the γ points up and down next to zp. For the special case, whenzp does not have γ points before or after, we simply take enough points after or before zp to make the total number of points in the γ-neighborhood to be 2γ + 1, including zp itself. Our kNN query algorithm essentially finds the γ-neighborhood of the query pointqⁱ= q + −→vi inPⁱfori ∈ [0, α] and select the final topk from the points in the unioned (α+ 1) γ-neighborhoods, with a maximum (α + 1)(2γ + 1) number of distinct points.

We denote this algorithm as the z^χ-kNN algorithm and it is shown in Algorithm 1. It is important to note that in Line 5, we obtain the original point from its shifted version if it is selected to be a candidate in one of the γ-neighborhoods.

This step simplifies the final retrieval of the kNN from the candidate set C. It also implies that the candidate sets Cⁱ’s may contain duplicate points, i.e., a point may be in the γ- neighborhood of the query point in more than one randomly shifted versions.

The z^χ-kNN is clearly very simple and can be implemented efficiently using only(α + 1) one-dimensional range searches, each requires only logarithmic IOs w.r.t the number of pages occupied by the data set (N/B) if γ is some constant. More importantly, we can show that, withα = O(1) and γ = O(k), z^χ-kNN gives a constant approximation kNN result in any fixed dimensiond, with only O(log_f ^N_B+k/B) page accesses.

In fact, we can show these results with just α = 1 and

Algorithm 1: z^χ-kNN (point q, point sets {P⁰, . . . , P^α}) CandidatesC = ∅;

1

fori = 0, . . . , α do

2

Find zⁱ_p as the successor ofz_q+⁻^→_v_i inPⁱ;

3

LetCⁱ beγ points up and down next to zⁱ_p inPⁱ;

4

For each pointp in Cⁱ, letp = p − −→vi;

5

C = CS Cⁱ;

6

LetA^χ= kNN(q, C) and output A^χ.

7

γ = k. In this case, let P^′be a randomly shifted version of the point setP , q^′ be the correspondingly shifted query point and Aq^′be thek nearest neighbors of q^′inP^′. Note thatAq^′ = A.

Let points in P^′ be {p1, p2, . . . , pN} and they are sorted by theirz-values. The successor of zq^′ w.r.t the z-values inP^′ is denoted aspτ for someτ ∈ [1, N ]. Clearly, the candidate set C in this case is simply C = {pτ −k, . . . , pτ +k}. Without loss of generality, we assume that bothτ − k and τ + k are within the range of [1, N ], otherwise we can simply take additional points after or before the successor as explained above in the algorithm description. We use B(c, r) to denote a ball with centerc and radius r and let rad(p, S) be the distance from the pointp to the farthest point in a point set S. We first show the following lemmas in order to claim the main theorem.

Lemma 1 LetM be the smallest box, centered at q^′ contain- ingAq^′ and with side length2ⁱ(wherei is assumed w.l.o.g to be an integer> 0) which is randomly placed in a quadtree T (associated with the Z-order). If the eventEj is defined asM being contained in a quadtree boxMT with side length2^i+j, andMT is the smallest such quadtree box, then

Pr(Ej) ≤

1 − 1

2^j

d

d^j−1 2^{j2 −j}²

Proof: The proof is in the paper’s full version [32].

Lemma 2 The z^χ-kNN algorithm gives an approximate k- nearest neighbor ballB(q^′, rad(q^′, C)) using q^′andP^′ where rad(q^′, C) is at most the side length of MT. HereMT is the smallest quadtree box containingM , as defined in Lemma 1.

Proof: z^χ-kNN scans at least pτ −k. . . pτ +k, and picks the top k nearest neighbors to q among these candidates.

Let a be the number of points between pτ and the point with the largest z-value in B(q^′, rad(q^′, Aq^′)), the exact k- nearest neighbor ball of q^′. Similarly, let b be the number of points between pτ and the point with the smallest z- value in B(q^′, rad(q^′, Aq^′)). Clearly, a + b = k. Note that B(q^′, rad(q^′, Aq^′)) ⊂ M is contained inside MT, hence the number of points inside MT ≥ k. Now, pτ. . . pτ +k

must containa points inside MT. Similarly,pτ −k. . . pτ must contain at leastb points from MT. Since we have collected at least k points from MT,rad(q^′, C) is upper bounded by the side length ofMT.

Lemma 3 MT is only constant factor larger than M in expectation.

(4)

Proof: The expected side length ofMT is:

E[2^i+j] ≤

∞

X

j=1

2^i+jPr(Ej) ≤ 2ⁱ

∞

X

j=1

(1 − 1

2^j)^dd^j−12^3j−j2² , where Lemma 1 givesPr(Ej). Using Taylor’s approximation:

1 − 1

2^j

d

≤ 1 − d2⁻^j 1 + 2^−1−j + d²2^−2j−1 and substituting it in the expectation calculation, we can show thatE[2^i+j] is O(2ⁱ). The detail is in the full version [32].

These lemmas lead to the main theorem for z^χ-kNN.

Theorem 1 Using α = O(1), or just one randomly shifted copy of P , and γ = O(k), z^χ-kNN guarantees an expected constant factor approximatekNN result with O(log_f ^N_B+k/B) number of page accesses.

Proof: The IO cost follows directly from the fact that the one dimensional range search used by z^χ-kNN takes O(log_f ^N_B + k/B) number of page accesses with a B-tree index. Let q be the point for which we just computed the approximate k-nearest neighbor ball B(q^′, rad(q^′, C)). From Lemma 2, we know thatrad(q^′, C) is at most the side length ofMT. From Lemma 3 we know that the side length ofMT

is at most constant factor larger than rad(q^′, Aq^′) in P^′, in expectation. Note that rad(q, Aq) in P equals rad(q^′, Aq^′) in P^′. Hence our algorithm, in expectation, computes an approximate k-nearest neighbor ball that is only a constant factor larger than the truek-nearest neighbor ball.

Algorithm 1 clearly indicates that z^χ-kNN only relies on one dimensional range search as its main building block (Line 4). One can easily implement this algorithm with just SQL statements over tables that store the point sets {P⁰, . . . , P^α}.

We simply use the original data set P to demonstrate the translation of the z^χ-kNN into SQL. Specifically, we pre- process the table RP so that an additional attribute zval is introduced. A record r ∈RP uses {Y1, . . . , Yd} to compute its z-value and store it as the zval. Next, a clustered B+-tree indexis built on the attribute zval over the table RP. This also implies that records in RP are sorted by the zval attribute. For a given query point q, we first calculate its z-value (denoted as zval as well) and then, we find the successor record of q in RP, based on the zval attribute of Rp andq. In the sequel, we assume that such a successor record always exists. In the special case, when q’s z-value is larger than all the z-values in RP, we simply take its predecessor instead. This technical detail will be omitted. The successor can be found by:

SELECT TOP 1 * FROM RP WHERE R_P.zval ≥ q.zval

Note that TOP is a ranking operator that becomes part of the standard SQL in most commercial database engines. For example, Microsoft SQL Server 2005 has the TOP operator available. Recent versions of Oracle, MySQL, and DB2 have the LIMIT operator which has the same functionality. Since table RP has a clustered B+ tree on the zval attribute, this query has only a logarithmic cost (to the size of RP) in terms of the IOs, i.e., O(log_f ^N_B). Suppose the successor record is

rs, in the next step, we retrieve all records that locate within γ positions away from rs. This can be done as:

SELECT * FROM

( SELECT TOP γ * FROM RP WHERE RP.zval

> rs.zval ORDER BY R_P.zval ASC UNION

SELECT TOP γ * FROM RP WHERE R_P.zval

< rs.zval ORDER BY R_P.zval DESC ) AS C

Again, due to the clustered B+ tree index on the zval, this query is essentially a sequential scan around rs, with a query cost of O(log_f ^N_B +_B^γ). The first SELECT clause of this query is similar to the successor query as above. The second SELECT clause is also similar but the ranking is in descending order of the zval. However, even in the second case, the query optimizer is robust enough to realize that with the clustered index on the zval, no ranking is required and it simply sequentially scansγ records backwards from rs.

The final step of our algorithm is to simply retrieve the top k records from these 2γ + 1 candidates (including rs), based on their Euclidean distance to the query pointq. Hence, the UDF ‘Euclidean’ is only applied in the last step with 2γ + 1 number of records. The complete algorithm can be expressed in one SQL statement as follows:

1 SELECT TOP k * FROM

2 ( SELECT TOP γ+ 1 * FROM RP, 3 ( SELECT TOP 1 zval FROM RP

4 WHERE R_P.zval ≥ q.zval 5 ORDER BY RP.zval ASC ) AS T 6 WHERE R_P.zval≥T.zval

7 ORDER BY RP.zval ASC

8 UNION

9 SELECT TOP γ * FROM RP

10 WHERE R_P.zval < T.zval 11 ORDER BY RP.zval DESC ) AS C

12 ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2) (Q1)

For all SQL environments, line 3 to 5 need to be copied into the FROM clause in line 9. We omit this from the SQL statement to shorten the presentation. In the sequel, for similar situations in all queries, we choose to omit this. Essentially, line 2 to line 11 in Q1 select the γ-neighborhood of q in P , which will be the candidate set for finding the approximate kNN of q using the Euclidean UDF by line 1 and 12.

In general, we can pre-process {P¹, . . . , P^α} similarly to get the randomly shifted tables {R¹_P, . . . , R^α_P} of RP, the above procedure could be then easily repeated and the final answer is the topk selected based on applying the Euclidean UDF over the union of the(α + 1) γ-neighborhoods retrieved.

The whole process could be done in just one SQL by repeating line2 to line 11 on each randomly shifted table and unioning them together with the SQL operator UNION.

In the work by Liao et al. [23], using Hilbert-curves a set of d + 1 deterministic shifts can be done to guarantee a constant factor approximate answer but in this case the space requirement is high, O(d) copies of the entire point set is required, and the associated query cost is increased by a multiplicative factor of d. In contrast, our approach only requires O(1) shifts for any dimension and can be adapted to yield exact answers. In practice, we just use the optimal value of α = min(d, 4) that we found experimentally to

(5)

give the best results for any fixed dimension. In addition, z- values are much simpler to calculate than the Hilbert values, especially in higher dimensions, making it suitable for the SQL environment. Finally, we emphasize that randomly shifted tables {R¹_P, . . . , R^α_P} are generated only once and could be used for multiple, different queries.

IV. EXACTkNN RETRIEVAL

The z^χ-kNN algorithm finds a good approximate solution in O(log_f ^N_B + k/B) number of IOs. We denote the kNN result from the z^χ-kNN algorithm as A^χ. One could further retrieve the exactkNN results based on A^χ. A straightforward solution is to perform a range query using the approximate kth nearest neighbor ball of A^χ,B(A^χ). It is defined as the ball centered at q with the radius rad(q, A^χ), i.e., B(A^χ) = B(q, rad(q, A^χ)). Clearly, rad(q, A^χ) ≥ r^∗. Hence, the exact kNN points are enclosed by B(A^χ). This implies that we can find the exactkNN for q by:

SELECT TOP k * FROM RP

WHERE Euclidean(q.X1,q.X2,R_P.Y1,R_P.Y2)≤ rad(p, A^χ) ORDER BY Euclidean(q.X1,q.X2,R_P.Y1,R_P.Y2) (Q2)

This query does reduce the number of records participated in the final ranking procedure with the Euclidean UDF. However, it still needs to scan the entire table RP to calculate the Euclidean UDF first for each record, thus becomes very expensive. Fortunately, we can again utilize the z-values to find the exactkNN based on A^χ much more efficiently.

We define the kNN box for A^χ as the smallest box that fully encloses B(A^χ), denoted as M (A^χ). Generalizing this notation, let A^χ_i be the kNN result of z^χ-kNN when it is applied only on table Rⁱ_P and M (A^χ_i) and B(A^χ_i) be the corresponding kNN box and kth nearest neighbor ball from table Rⁱ_P. The exact kNN results from all table Rⁱ_P’s are the same and they are always equal to A. In the sequel, when the context is clear, we omit the subscripti from A^χ_i.

An example of the kNN box is shown in Figure 1(a). In this case, the z^χ-kNN algorithm on this table RP returns the kNN result as A^χ= {p1, p2, p4} for k = 3 and p4 is thekth nearest neighbor. Hence,B(A^χ) = B(q, |q, p4|), shown as the solid circle in Figure 1(a). ThekNN box M (A^χ) is defined by the lower-left and right-upper corner pointsδℓandδh, i.e., the

∆ points in Figure 1(a). As we have argued above, the exact kNN result must be enclosed by B(A^χ), hence, also enclosed by M (A^χ). In this case, the exact kNN result is {p1, p2, p3} and the dotted circle in Figure 1(a) is the exact kth nearest neighbor ballB(A).

Lemma 4 For a rectangular box M and its lower-left and upper-right corner pointsδℓ,δh,∀p ∈ M , zp∈ [zℓ, zh], where zp stands for thez-value of a point p and zℓ,zh correspond to the z-values of δℓ andδh respectively (See Figure 1(a)).

Proof: Consider 2-d points, let p.X and p.Y be the coordinate values of p in x-axis and y-axis respectively. By p ∈ M , we have p.X ∈ [δℓ.X, δh.X] and p.Y ∈ [δℓ.Y, δh.Y ].

Since the z-value of a point is obtained by shuffling bits of its coordinate values from the msb to the lsb in an alternating

p4

zp₁

γ zp4

zℓ zp2 zp₅zp₃zh

γ z-val p5

zq

r^∗ r^p

δh

δℓ

p1 q p2

p3

(a) kNN box.

p5

p8 p9

p7

p1 q p2

p3 δh

δ_ℓ p4

p6

γh

γℓ

(b) A^χ= A.

Fig. 1. kNN box: definition and exact search.

fashion, this immediately implies that zp∈ [zℓ, zh]. The case for higher dimensions is similar.

Lemma 4 implies that thez-values of all exact kNN points will be bounded by the range[zℓ, zh], where zℓ andzh are the z-values for the δℓ andδh points ofM (A^χ), in other words:

Corollary 1 Letzℓandzhbe thez-values of δℓandδhpoints ofM (A^χ). For all p ∈ A, zp∈ [zℓ, zh].

Proof: ByB(A) ⊂ M (A^χ) and Lemma 4.

Consider the example in Figure 1(a), Corollary 1 guarantees thatzpi fori ∈ [1, 5] and zq are located between zℓ and zh

in the one-dimensionalz-value axis. The z^χ-kNN essentially searchesγ number of points around both the left and the right ofzq in the z-value axis, for α number of randomly shifted copies ofP , including P itself. However, as shown in Figure 1(a), it may still miss some of the exactkNN points. Let γℓ

andγhdenote the left and rightγ-th points respectively for this search. In this case, letγ = 2, zp3 is outside the search range, specifically, zp3 > zγh. Hence, z^χ-kNN could not find the exactkNN result. However, given Corollary 1, an immediate result is that one can guarantee to find the exactkNN result by considering all points with their z-values between zℓ and zh

ofM (A^χ). In fact, if zγℓ ≤ zℓandzγh ≥ zhin at least one of the table Rⁱ’s fori = 0, . . . , α, we know for sure that z^χ-kNN has successfully retrieved the exact kNN result; otherwise it may have missed some exactkNN points. The case when zℓ

andzh are both contained by zγℓ and zγh of theM (A^χ) in one of the randomly shifted tables is illustrated in Figure 1(b).

In this case, in one of the(α + 1) randomly shifted tables, the z-order curve passes through zγℓ first before any points in M (A^χ) (i.e., zγℓ ≤ zℓ) and it comes tozγh after all points inM (A^χ) have been visited (i.e., zγh ≥ zh). As a result, the candidate points considered by z^χ-kNN include every point inA and the kNN result from the algorithm z^χ-kNN will be exact, i.e.,A^χ = A.

When this is not the case, i.e., eitherz_ℓⁱ< z_γⁱ_ℓ orz_hⁱ > zⁱ_γ_h or both in all tables Rⁱ_P’s fori = 0, . . . , α, A^χ might not be equal toA. To address this issue, we first choose one of the table R^j_P such that its M (A^χ_j) contains the least number of points amongkNN boxes from all tables; then the candidate points for the exactkNN search only need to include all points contained by this box, i.e.,M (A^χ_j) from the table R^j_P.

To find the exactkNN from the box M (A^χ_j), we utilizing Lemma 4 and Corollary 1. In short, we calculate thez_ℓ^j and

(6)

Algorithm 2: z-kNN (point q, point sets {P⁰, . . . , P^α}) Let A^χ_i bekNN(q, Cⁱ) whereCⁱ is from Line4 in the

1

z^χ-kNN algorithm;

Let zⁱ_ℓandz_hⁱ be the z-values of the lower-left and

2

upper-right corner points for the boxM (A^χ_i);

Let zⁱ_γ_ℓ andzⁱ_γ_h be the lower bound and upper bound of

3

the z-values z^χ-kNN has searched to produce Cⁱ; if ∃i ∈ [0, α], s.t. z_γⁱ_ℓ ≤ z_ℓⁱ andz_γⁱ_h ≥ z_hⁱ then

4

ReturnA^χ by z^χ-kNN as A;

5

else

6

Find j ∈ [0, α] such that the number of points in R^j_P

7

with z-values in ∈ [z_ℓ^j, z^j_h] is minimized;

Let Ce be those points and return A = kNN(q, Ce);

8

z_h^j of this box and do a range query with [z^j_ℓ, z_h^j] on the zval attribute in table R^j_P. Since there is a clustered index built on the zval attribute, this range query becomes a sequential scan in [z_ℓ^j, z_h^j] which is very efficient. It essentially involves logarithmic IOs to access the path from the root to the leaf level in the B+ tree, plus some sequential IOs linear to the number of points between z_ℓ^j and z_h^j in table R^j_P. The next lemma is immediate based on the above discussion. Let zⁱ_s be the successor z-value to qⁱ’s z-value in table Rⁱ_P where qⁱ= q + −→vi,z_pⁱ be thez-value of a point p in table Rⁱ_P andLⁱ be the number of points in the range[z_ℓⁱ, z_hⁱ] from table Rⁱ_P. Lemma 5 For the algorithm z^χ-kNN, if there exists at least onei ∈ {0, . . . , α}, such that [zⁱ_ℓ, z_hⁱ] ⊆ [z_γⁱ_ℓ, zⁱ_γ_h], then A^χ= A; otherwise, we can find A ⊆ [z_ℓ^j, z^j_h] for some table R^j_P where j ∈ {0, . . . , α} and L^j = min{L⁰, . . . , L^α}.

These discussion leads to a simple algorithm for finding the exactkNN based on z^χ-kNN and it is shown in Algorithm 2.

We denote it as the z-kNN algorithm.

Figure 2 illustrates the z-kNN algorithm. In this case, α = 1, In both table R⁰_P and R¹_P, [zγℓ, zγh] does not fully contain [zℓ, zh]. Hence, algorithm z^χ-kNN does not guarantee to return the exact kNN. Among the two tables, [z_ℓ¹, z_h¹] contains less number of points than [z_ℓ⁰, z_h⁰] (in Figure 2, L¹< L⁰), hence we do the range query using[z_ℓ¹, z_h¹] in table R¹_P and apply the

“Euclidean” UDF on the records returned by this range query to select the final, exactkNN records. Algorithm z-kNN can be achieved using SQL alone. The first step is to check if A^χ = A. This is equivalent to check if [zⁱ_ℓ, z_hⁱ] ⊆ [z_γⁱ_ℓ, z_γⁱ_h] for somei. Note that we can calculate the coordinate values of δ_ℓⁱandδ_hⁱ for anyM (A^χ_i) easily with addition and subtraction, givenqⁱ andA^χ_i, and convert them into the z-values (z_ℓⁱ and z_hⁱ) in SQL (we omit this detail for brevity). That said, this checking on table Rⁱ_P can be done via:

SELECT COUNT(*) FROM RⁱP AS R WHERE R.zval ≥ z_ℓⁱ AND R.zval ≤ z_hⁱ

AND R.pid NOT IN ( SELECT pid FROM Rⁱ_P AS R1 WHERE R1.zval ≥ z_γⁱ_ℓ AND R1.zval ≤ zⁱ_γ_h) (Q3)

z1s

· · · zval

· · ·

· · · R⁰_P q⁰.zval

γ

γ z0ℓ

z0h

· · ·

· · · zval

· · ·

γ γ

R¹_P q¹.zval

z1ℓ

z1h

L¹ L⁰

z0s

Fig. 2. Exact search of kNN.

If this count equals 0, then [z_ℓⁱ, zⁱ_h] ⊆ [zⁱ_γ_ℓ, z_γⁱ_h]. One can easily create one SQL statement to do this checking for all tables. If there is at least one table with a count equal to 0 among R⁰_P, . . . , R^α_P, then we can can safely returnA^χ as the exact kNN result. Otherwise, we continue to the next step.

We first select the table (say R^j_P) with the smallestLⁱ value.

This is done by (Q4), in which we find the valuej. Then we select the final kNN result from R^j_P among the records with z-values between [z^j_ℓ, z_h^j] via (Q5).

SELECT TOP 1 ID FROM

( SELECT 0 AS ID, COUNT(*) AS L FROM R⁰_P AS R⁰ WHERE R⁰.zval ≥ z_ℓ⁰ AND R⁰.zval ≤ z⁰_h

UNION · · · UNION

SELECT α AS ID, COUNT(*) AS L FROM R^αP AS R^α WHERE R^α.zval ≥ z_ℓ^α AND R^α.zval ≤ z_h^α

) AS T ORDER BY T.L ASC (Q4)

SELECT TOP k * FROM R^j_P //The ID from Q4 is j WHERE R^j_P.zval ≥ z^j_ℓ AND R^j_P.zval ≤ z_h^j

ORDER BY Euclidean(q^j.X1,q^j.X2,R^j_P.Y1,R^j_P.Y2) (Q5)

One can combine (Q3), (Q4) and (Q5) to get the exact kNN result based on the approximate kNN result given by (Q1)(the z^χ-kNN algorithm).

Finally, we would like to highlight that theoretically, for many practical distributions, the z-kNN algorithm still achieves O(log_f ^N_B + k/B) number of page accesses to report the exact kNN result in any fixed dimension. When A^χ = A, this result is immediate by Theorem 1. When A^χ 6= A, the key observation is that the number of “false positives” in the candidate setCe(line8 in Algorithm 2) is bounded by some constantO(k) for many practical data distributions. Hence, the number of candidate points that the z-kNN algorithm needs to check is still O(k), the same as the z^χ-kNN algorithm.

Referring back to Figure 1(b), the false positives in akNN box M (A^χ_i) are defined as those points p such that p 6∈ M (A^χ_i) but zp ∈ [z_ℓⁱ, zⁱ_h]. For example, in Figure 1(b), the false positives are {p8, p6, p7}. Note that p9 is not a false positive w.r.t this kNN box as it is beyond the range [z_ℓⁱ, zⁱ_h].

Let P be a fixed distribution of points in a fixed d- dimensional space, i.e.,d is considered as a constant. Let P be i.i.d. from P and its size be N ≫ k ≥ 1. We will call this distribution a Doubling Distribution if it has the following property: LetS be a d dimensional ball with center pi∈ P and radiusr that contains k points. Then, the d dimensional ball S^′ with centerpi and radius2r has at most νk points, for some ν = O(1). This is a similar restriction to the doubling metric restriction on metric spaces and has been used before [21].

(7)

Note that many distributions of points occurring in real data sets, including uniform distribution satisfy this property.

Theorem 2 For doubling distributions, the expected number of false positives for akNN box M (A^χ_i) for all i ∈ {0, . . . , α}

is O(k); the number of points that are fully enclosed by M (A^χ_i) is also O(k).

Proof: Without loss of generality, let ι be any fixed value from 0 to α. Let MT be the smallest quadtree box that contains M (A^χ_ι). Let d = 1. We will now show that the number of points in MT is O(k). The expected number of points in M (A^χ_ι) is at most kν¹₂ + kν^{2 1}₂³₄ + kν¹₂¹₄⁷₈ + . . . ≤ kP^∞

j=1 ν^j

2^{(j2 +1)/2} = O(k), since ν = O(1). A similar argument shows that as the dimension increases (but is still O(1)), the expected number of points in M (A^χ_ι) is still O(k).

A two level expectation argument shows that the expected number of points in MT is still O(k). The details of this calculation are omitted for brevity. The points in MT are consecutive in z-values and the Z-order curve enters the lower- left corner and sweeps through the entireMT before it leaves through the upper right corner ofMT. Since,M (A^χ_ι) ⊂ MT, the curve passing through the lower left corner ofM (A^χ_ι) and ending at the upper right corner ofM (A^χ_ι) can not go out of MT. This implies that all the false positives are contained in MT and hence the expected number of false positives is also upper bounded by O(k) in expectation.

Corollary 2 For doubling distributions, the z-kNN algorithm, using O(1) number of random shifts, retrieves the exact kNN result with O(log_f ^N_B + k/B) number of page accesses for data in any fixed dimension.

Our Theorem requires just 1 random shift. In practice, we use several random shifts to amplify the probability of getting smallerMT sizes (which in turn reduces the query cost), at the expense of increasing storage cost. We explore this trade-off in our experiments.

V. kNN-JOIN ANDDISTANCEBASEDθ-JOIN

An important feature for our approach is that we can easily and efficiently support join queries. The basic principle of finding the k nearest neighbors stays the same for the kNN- Join query over two tables RQ and RP. However, the main challenge is to achieve this using a single SQL statement. We still generate R⁰_P, . . . , R^α_P in the same fashion. Concentrating on the approximate solution, we need to perform the similar procedure as shown in Section III to joining two tables (RQ

and Rⁱ_P’s). The general problem is to join each individual record si from RQ to 2γ + 1 number of records from RP

aroundsi’s successor (based on thez-value) record rs(si) in RP; and then for each such group (si and the2γ + 1 records around rs(si)) we need to select the top-k records based on their Euclidean distances to si. A simple approach is to use a store procedure to implement this idea, i..e, for each record from RQ, we execute the z^χ-kNN query from Section III.

If one would like to implement this join with just one SQL statement, the observation is that the second step above is

equivalent to retrieving topk records in each group based on some ranking functions and grouping conditions. This has been addressed by all commercial database engines. For example, in Microsoft SQL Server, this is achieved by the RANK() OVER (PARTITION BY . . . ORDER BY . . .) clause. Conceptu- ally, this clause assigns a rank number to each record involved in one partition or group according to its sorted order in that partition. Hence, we could simply select the tuple with the rank number that is less than or equal tok from each group.

Oracle, MySQL and DB2 all have their own operators for similar purposes.

We denote this query as the z^χ-kNNJ algorithm. It is important to note that in some engines, the query optimizer may not do a good job in optimizing the top-k query for each group. An alternative approach is to implement the same idea with a store procedure, By the same argument as shown in Section III, the following result is immediate.

Lemma 6 Using α = O(1) and γ = O(k), the z^χ-kNNJ algorithm guarantees an expected constant approximatekNN- Join result inO ^M_B(log_f ^N_B +_B^k) number of page accesses.

We could extend the exact z-kNN algorithm from Section IV to derive SQL statements as the z-kNNJ algorithm for the exactkNN-Join. We omit it for brevity.

Our approach is quite flexible and supports a variety of interesting queries. In particular, we demonstrate how it could be adopted to support the distance basedθ-Join query, denoted as the θ-DJoin. Our idea for the θ-DJoin is similar to the principle adopted in [24]. This query joins each recordsi ∈ RQ with the set A^θ(si) which contains all records rj ∈ RP

such that |si, rj| ≤ θ for some specified θ value. Suppose si corresponds to a query point q. An obvious observation is that the furthest point (or record) to q in A^θ(si) always has a distance that is at most θ. Hence, the ball B(q, θ) completely encloses all points from A^θ(si). Let the θ-box for a recordsi be the smallest box that enclosesB(q, θ) and denote it as M (A^θ(si)). Clearly, all points from A^θ(si) are also fully enclosed by M (A^θ(si)). By Lemma 4, we have for ∀p ∈ A^θ(si), zp ∈ [zℓ, zh], here zℓ and zh are the z- values of the bottom-left and top-right corner points of the boxM (A^θ(si)). This becomes exactly the same problem as the exact kNN search and similar ideas from z-kNN could then be applied.

VI. FLOATVALUES, HIGHERDIMENSIONS ANDUPDATES

Our method easily supports the floating point coordinates by computing thez-values explicitly for floating point coordinates in the pre-processing phase. This is done via the same bit- interleaving operation. The only problem with this approach is that the number of bits required for interleaving d-single precision co-ordinates is 256d, assuming IEEE 754 floating point representation. To avoid such a long string we can use the following trick: We fix the length of the interleaved Z-order bits to be at mostµd bits, where µ is a small constant. We scale the input data such that all coordinates lie between(0, 1). We then only interleave the firstµ bits of each coordinate after the

(8)

decimal point. For most practical data sets,µ < 32 (equivalent to more than9 digits of precision in the decimal system).

There are no changes required to our framework for dealing with data in any dimension. Though our techniques work for any dimension, however, as d increases, the number of bits required for thez-value also increases. This introduces storage overhead as well as performance degradation (the fanout of the clustered B+ tree on thezval attribute drops). Hence, for the really large dimensionality (say d > 30) one should consider using techniques that are specially designed for those purposes, for example the LSH-based method [5], [14], [25], [30].

Another nice property of our approach is the easy and efficient support of updates, both insertions and deletions. For the deletion of a record r, we simply delete r based on its pid from all tables R⁰, . . . , R^α. For an insertion of a record r that corresponds to a point p, we calculate the z-values for p⁰, . . . , p^α, recall pⁱ = p + −→vi. Next, we simply insertr into all tables R⁰, . . . , R^αbut with differentz-values. The database engine will take care of maintaining the clustered B+ tree indices on the zval attribute in all these tables.

Finally, our queries are parallel-friendly as they execute similar queries over multiple tables with the same schema.

VII. EXPERIMENT

We implemented all algorithms in a database server running Microsoft SQL Server 2005. The state of the art algorithm for the exactkNN queries in arbitrary dimension is the iDistance [20] algorithm. For the approximatekNN, we compare against the M edrank algorithm [12], since it is the state of the art for finding approximate kNN in relatively low dimensions and is possible to adapt it in the relational principle. We also compared against the approach using deterministic shifts with the Hilbert-curve [23], however, that method requires O(d) shifts and computing Hilbert values in different dimensions.

Both become very expensive when d increases, especially in a SQL environment. Hence, we focused on the comparison against the Medrank algorithm. We would like to emphasize they were not initially designed to be relational algorithms that are tailored for the SQL operators. Hence, the results here do not necessarily reflect their behavior when being used without the SQL constraint. We implemented iDistance [20]

using SQL, assuming that the clustering step has been pre- processed outside the database and its incremental, recursive range exploration is achieved by a stored procedure. We built the clustered B+ tree index on the one-dimensional distance value in this approach. We used the suggested 120 clusters with thek-means clustering method, and the recommended ∆r value from [20]. For theM edrank, the pre-processing step is to generateα random vectors, then create α one-dimensional lists to store the projection values of the data sets onto the α random vectors, lastly sort these α lists. We created one clustered index on each list. In the query step, we leveraged on the cursors in the SQL Server as the ‘up’ and ‘down’

pointers and used them to retrieve one record from each list in every iteration. The query process terminates when there arek elements have been retrieved satisfying the dynamic threshold

(essentially the TA algorithm [13]). Both steps are achieved by using stored procedures. We did not implement the iJoin algorithm [33], the state of the art method forkNN-Joins, using only SQL, as it is not clear if that is feasible. Furthermore, it is based on iDistance and our kNN algorithm significantly outperforms the SQL-version iDistance. All experiments were executed on a Windows machine with an Intel2.33GHz CPU.

The memory of the SQL Server is set to1.5GB.

Data sets. The real data sets were obtained from [1]. Each data set represents the road-networks for a state in United States.

We have tested California, New Jersey, Maryland, Florida and others. They all exhibit similar results. Since the California data set is the largest with more than10 million points, we only show its results. By default, we randomly sample 1 million points from the California data set. We also generate two types of synthetic data sets, namely, the uniform (UN) points, and the random-clustered (R-Cluster) points. Note that the California data set is in2-dimensional space. For experiments in higher dimensional space, we use the UN and R-Cluster data sets.

Setup. Unless otherwise specified, we measured an algorithm’s performance by the wall clock time metric which is the total execution timeof the algorithm, i.e, including both the IO cost and the CPU cost. By default, 100 queries were generated for each experiment and we report the average for one query.

For both kNN and kNN-Join queries, the query point or the query points are generated uniformly at random in the space of the data set P . The default size of P is N = 10⁶. The default value for k is 10. We keep α = 2, randomly shifted copies, for the UN data set and α = min{4, d}, randomly shifted copies, for the California and R-Cluster data sets, and setγ = 2k. These values of α and γ do not change for different dimensions. For theM edrank algorithms, we set its α value as2 for experiments in two dimensions and 4 for d larger than 2. This is to make fair comparison with our algorithms (with the same space overhead). The default dimensionality is2.

A. Results for thekNN Query

0 2 3 4 5 6

0 0.04 0.08 0.12

α

Running time (secs)

z−kNN z^χ−kNN UN California

Fig. 4. Impact of α on the running time.

Impact of α. The number of “randomly shifted” copies has a direct impact on the running time for algorithms z^χ-kNN and z-kNN, as well as their space overhead. Its effect on the running time is shown in Figure 4. For the z^χ-kNN algorithm, we expect

its running time to increase linearly with α. Indeed, this is the case in Figure 4. The running time for the exact z- kNN algorithm has a more interesting trend. For the uniform UN data set, its running time also increases linearly with α, simply because it has to search more tables, and for the uniform distribution more random shifts do not change the probability of A^χ = A. This probability is essentially Pr(∃i, [z_ℓⁱ, z_hⁱ] ⊆ [z_γⁱ_ℓ, zⁱ_γ_h]) and it stays the same in the uniform data for different α values. The number of false