K Nearest Neighbor Queries and KNN-Joins in Large Relational Databases (Almost) for Free
Bin Yao, Feifei Li, Piyush Kumar
Computer Science Department, Florida State University, Tallahassee, FL, U.S.A.
{yao, lifeifei, piyush}@cs.fsu.edu
Abstract— Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts to solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, which may limit their application on large data sets that are stored in a relational database management system. Furthermore, these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize. We design algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the “best” query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with a constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.
I. INTRODUCTION
Thek-Nearest Neighbor query (kNN) is a classical problem that has been extensively studied, due to its many important applications, such as spatial databases, pattern recognition, DNA sequencing and many others. A more general version is the kNN-Join problem [7], [8], [11], [31], [33]: Given a data setP and a query set Q, for each point q ∈ Q we would like to retrieve itsk nearest neighbors from points in P .
Previous work has concentrated on the use of spatial databases or stand-alone systems. In these solution method- ologies, changes to the database engine maybe necessary; for example, new index structures or novel algorithms need to be incorporated into the engine. This requirement poses a limitation when the data set is stored in a database in which neither the spatial indices (such as the popular R-tree) nor the kNN-Join algorithms are available. Another limitation of the existing approaches that are outside the relational database, is the lack of support for query optimization, when additional query conditions are specified. Consider the query:
Retrieve the k nearest restaurants of q with both Italian food and French wines.
The results of this query could be empty (in the case where
no restaurants offer both Italian food and French wines) and the kNN retrieval is not necessary at all. It is well known that when the data is stored in a relational database, for various types of queries using only primitive SQL operators, the sophisticated query optimizer built inside the database engine will do an excellent job in finding a fairly good query execution plan [4].
These advantages of storing and processing data sets in a relational database motivate us to study the kNN query and the kNN-Join problem in a relational database environment.
Our goal is to design algorithms that could be implemented by the primitive SQL operators and require no changes to the database engine. The benefits of satisfying such constraints are threefold. First, kNN based queries could be augmented with ad-hoc query conditions dynamically, and they are auto- matically optimized by the query optimizer, without updating the query algorithm each time for specific query conditions.
Second, such an approach could be readily applied on ex- isting commercial databases, without incurring any cost for upgrading or updating the database engine, e.g., to make it support spatial indices. We denote an algorithm that satisfies these two constraints as a relational algorithm. Finally, this approach makes it possible to support thekNN-Join efficiently.
We would like to design algorithms that work well for data in multiple dimensions and easily support dynamic updates without performance degeneration.
The similar relational principle has been observed for other problems as well, e.g., approximate string joins in relational databases [15]. The main challenge in designing relational algorithms is presented by the fact that a query optimizer cannot optimize any user-defined functions (UDF) [15]. This rules out the possibility of using a UDF as a main query condition. Otherwise, the query plan always degrades to the expensive linear scan or the nested-loop join approach, which is prohibitive for large databases. For example,
SELECT TOP k * FROM Address A, Restaurant R WHERE R.Type=‘Italian’ AND R.Wine=‘French’
ORDER BY Euclidean(A.X, A.Y, R.X, R.Y)
In this query, X and Y are attributes representing the coordinates of an Address record or a Restaurant record.
“Euclidean” is a UDF that calculates the Euclidean distance between two points. Hence, this is a kNN-Join query. Even though this query does qualify as a relational algorithm, the query plan will be a nested-loop join when there are restaurants satisfying both the “Type” and “Wine” constraints, since the
query optimizer cannot optimize the UDF “Euclidean”. This implies that we may miss potential opportunities to fully optimize the queries. Generalizing this example to the kNN- query problem, the UDF-based approach will degrade to the expensive linear scan approach.
Our Contributions. In this work, we design relational algo- rithms that can be implemented using primitive SQL operators without the reliance on the UDF as a main query condition, so that the query optimizer can understand and optimize. We would like to support both approximate and exact kNN and kNN-Join queries. More specifically,
• We formalize the problems of kNN queries and kNN- Joins in a relational database (Section II).
• We provide a constant factor approximate solution based on the Z-order values from a small, constant number of randomly shifted copies of the database (Section III). We provide the theoretical analysis to show that by using only O(1) random shifts for data in any fixed dimension, our approach gives an expected constant factor approximation (in terms of the radius of thek nearest neighbor ball) with onlylog N number of page accesses for the kNN query whereN is the size of data set P . Our approach can be achieved with only primitive SQL operators.
• Using the approximate solution, we show how to get exact results for a kNN query efficiently (Section IV).
Furthermore, we show that for certain types of data distributions, our exact solution also only usesO(log N ) number of page accesses in any fixed dimension. The exact solution is also easy to implement using SQL.
• We extend our algorithms to kNN-Join queries, which can be achieved in relational databases without changes to the engines (Section V).
• We show that our algorithms easily support float values, data in arbitrary dimension and dynamic updates without any changes to the algorithms (Section VI).
• We present a comprehensive experimental study, that confirms the significant performance improvement of our approach, against the state of the art (Section VII).
In summary, we show how to find constant approximations for kNN queries in logarithm page accesses with a small constant number of random shifts in any fixed dimension; our approximate results lead to highly efficient search of the exact answers, with a simple post-processing of the results. Finally, our framework enables the efficient processing ofkNN-Joins.
We survey the related work in Section VIII.
II. PROBLEMFORMULATION
Suppose that the data set P in a d dimensional space is stored in a relational table RP. The coordinates of each point p ∈ P are stored in d attributes {Y1, . . . , Yd}. Each point could associate with other values, e.g., types of restaurants. These ad- ditional attributes are denoted by{A1, . . . , Ag} for some value g. Hence, the schema of RP is{pid, Y1, · · · , Yd, A1, · · · , Ag} wherepid corresponds to the point id from P .
kNN queries. Given a query point q and its coordinates
{X1, . . . , Xd}, let A=kNN(q, RP) be the set of k nearest neighbors ofq from RP and|x, y| be the Euclidean distance between the point x and the point y (or the corresponding records for the relational representation of the points), then:
(A ⊆ RP)∧(|A| = k)∧(∀a ∈ A, ∀r ∈ RP−A, |a, q| ≤ |r, q|).
kNN-Join. In this case, the query is a set of points denoted byQ, and it is stored in a relational table RQ. The schema of RQ is {qid, X1, · · · , Xd, B1, · · · , Bh}. Each point q in Q is represented as a records in RQ, and its coordinates are stored in attributes {X1, . . . , Xd} of s. Additional attributes of q, are represented by attributes{B1, · · · , Bh} for some value h.
Similarly,qid corresponds to the point id from Q. The goal of thekNN-Join query is to join each record s from RQ with itskNN from RP, based on the Euclidean distance defined by {X1, . . . , Xd} and {Y1, . . . , Yd}, i.e., for ∀s ∈ Q, we would like to produce pairs(s, r), for ∀r ∈ kNN(s, Rp).
Other query conditions. Additional, ad-hoc query condi- tions could be specified by any regular expression, over {A1, · · · , Ag} in case of kNN queries, or both {A1, · · · , Ag} and{B1, · · · , Bh} in case of kNN-Join queries. The relational requirement clearly implies that the query optimizer will be able to automatically optimize input queries based on these query conditions (see our discussion in Section VII).
Approximate k nearest neighbors. Suppose q’s kth nearest neighbor from P is p∗ and r∗ = |q, p∗|. Let p be the kth nearest neighbor of q for some kNN algorithm A and rp =
|q, p|. Given ǫ > 0 (or c > 1), we say that (p, rp) ∈ Rd× R is a (1 + ǫ)-approximate (or c-approximate) solution to the kNN query kNN(q, P ) if r∗≤ rp≤ (1 + ǫ)r∗ (orr∗≤ rp≤ cr∗ for some constant c). Algorithm A is called a (1 + ǫ)- approximation (orc-approximation) algorithm.
Similarly for kNN-Joins, an algorithm that finds a kth nearest neighbor point p ∈ P for each query point q ∈ Q, that is at least a (1 + ǫ)-approximation or c-approximation w.r.t kNN(q, P ) is a (1 + ǫ)-approximate or c-approximate kNN-Join algorithm. The result by this algorithm is referred to as a(1 + ǫ)-approximate or c-approximate join result.
Additional notes. The default value for d in our running examples is two, but, our proofs and algorithms are presented for any fixed dimension,d. We focus on the case where the coordinates for points are always integers. Handling floating points coordinates is discussed in Section VI. Our approach easily supports updates, which is also discussed in Section VI. The number of records in RP and RQ are denoted by N = |RP| and M = |RQ| respectively. We assume that each page can store maximally B records from either RP or RQ, and the fan-out of a B+ tree is f . Without loss of generality, we assume that points inP are all in unique locations. For the general case, our algorithms can be easily adapted by breaking the ties arbitrarily.
III. APPROXIMATION BYRANDOMSHIFTS
The z-value of a point is calculated by interleaving the binary representations of its coordinate values from the most
significant bit (msb) to the least significant bit (lsb). For example, given a point (2, 6) in a 2-d space, the binary representation of its coordinates is (010, 110). Hence, its z- value is 011100 = 28. The Z-order curve for a set of points P is obtained by connecting the points in P by the numerical order of their z-values and this produces the recursively Z- shaped curve. A key observation for the computation of z- value is that, it only requires simple bit-shift operations which are readily available or easily achievable in most commercial database engines. For a pointp, zp denotes itsz-value.
Our idea utilizes the z-values to map points in a multi- dimensional space into one dimension, and then translate the kNN search for a query point q into one dimensional range search on the z-values around the q’s z-value. In most cases, z-values preserve the spatial locality and we can findq’s kNN in a close neighborhood (say γ positions up and down) of its z-value. However, this is not always the case. In order to get a theoretical guarantee, we produceα, independent, randomly shifted copies of the input data set P and repeat the above procedure for each randomly shifted version of P .
Specifically, we define the “random shift” operation, as shifting all data points in P by a random vector −→v ∈ Rd. This operation is simply p + −→v for all p ∈ P , and denoted as P + −→v . We independently at random generate α number of vectors {−→v1, . . . , −v→α} where ∀i ∈ [1, α], −→vi ∈ Rd. Let Pi= P +−→vi,P0= P and −→v0=−→
0 . For each Pi, its points are sorted by theirz-values. Note that the random shift operation is executed only once for a data setP and used for subsequent queries. Next, for a query pointq and a data set P , let zp be the successor z-value of zq among all z-values for points in P . The γ-neighborhood of q is defined as the γ points up and down next to zp. For the special case, whenzp does not have γ points before or after, we simply take enough points after or before zp to make the total number of points in the γ-neighborhood to be 2γ + 1, including zp itself. Our kNN query algorithm essentially finds the γ-neighborhood of the query pointqi= q + −→vi inPifori ∈ [0, α] and select the final topk from the points in the unioned (α+ 1) γ-neighborhoods, with a maximum (α + 1)(2γ + 1) number of distinct points.
We denote this algorithm as the zχ-kNN algorithm and it is shown in Algorithm 1. It is important to note that in Line 5, we obtain the original point from its shifted version if it is selected to be a candidate in one of the γ-neighborhoods.
This step simplifies the final retrieval of the kNN from the candidate set C. It also implies that the candidate sets Ci’s may contain duplicate points, i.e., a point may be in the γ- neighborhood of the query point in more than one randomly shifted versions.
The zχ-kNN is clearly very simple and can be implemented efficiently using only(α + 1) one-dimensional range searches, each requires only logarithmic IOs w.r.t the number of pages occupied by the data set (N/B) if γ is some constant. More importantly, we can show that, withα = O(1) and γ = O(k), zχ-kNN gives a constant approximation kNN result in any fixed dimensiond, with only O(logf NB+k/B) page accesses.
In fact, we can show these results with just α = 1 and
Algorithm 1: zχ-kNN (point q, point sets {P0, . . . , Pα}) CandidatesC = ∅;
1
fori = 0, . . . , α do
2
Find zip as the successor ofzq+−→vi inPi;
3
LetCi beγ points up and down next to zip inPi;
4
For each pointp in Ci, letp = p − −→vi;
5
C = CS Ci;
6
LetAχ= kNN(q, C) and output Aχ.
7
γ = k. In this case, let P′be a randomly shifted version of the point setP , q′ be the correspondingly shifted query point and Aq′be thek nearest neighbors of q′inP′. Note thatAq′ = A.
Let points in P′ be {p1, p2, . . . , pN} and they are sorted by theirz-values. The successor of zq′ w.r.t the z-values inP′ is denoted aspτ for someτ ∈ [1, N ]. Clearly, the candidate set C in this case is simply C = {pτ −k, . . . , pτ +k}. Without loss of generality, we assume that bothτ − k and τ + k are within the range of [1, N ], otherwise we can simply take additional points after or before the successor as explained above in the algorithm description. We use B(c, r) to denote a ball with centerc and radius r and let rad(p, S) be the distance from the pointp to the farthest point in a point set S. We first show the following lemmas in order to claim the main theorem.
Lemma 1 LetM be the smallest box, centered at q′ contain- ingAq′ and with side length2i(wherei is assumed w.l.o.g to be an integer> 0) which is randomly placed in a quadtree T (associated with the Z-order). If the eventEj is defined asM being contained in a quadtree boxMT with side length2i+j, andMT is the smallest such quadtree box, then
Pr(Ej) ≤
1 − 1
2j
d
dj−1 2j2 −j2
Proof: The proof is in the paper’s full version [32].
Lemma 2 The zχ-kNN algorithm gives an approximate k- nearest neighbor ballB(q′, rad(q′, C)) using q′andP′ where rad(q′, C) is at most the side length of MT. HereMT is the smallest quadtree box containingM , as defined in Lemma 1.
Proof: zχ-kNN scans at least pτ −k. . . pτ +k, and picks the top k nearest neighbors to q among these candidates.
Let a be the number of points between pτ and the point with the largest z-value in B(q′, rad(q′, Aq′)), the exact k- nearest neighbor ball of q′. Similarly, let b be the number of points between pτ and the point with the smallest z- value in B(q′, rad(q′, Aq′)). Clearly, a + b = k. Note that B(q′, rad(q′, Aq′)) ⊂ M is contained inside MT, hence the number of points inside MT ≥ k. Now, pτ. . . pτ +k
must containa points inside MT. Similarly,pτ −k. . . pτ must contain at leastb points from MT. Since we have collected at least k points from MT,rad(q′, C) is upper bounded by the side length ofMT.
Lemma 3 MT is only constant factor larger than M in expectation.
Proof: The expected side length ofMT is:
E[2i+j] ≤
∞
X
j=1
2i+jPr(Ej) ≤ 2i
∞
X
j=1
(1 − 1
2j)ddj−123j−j22 , where Lemma 1 givesPr(Ej). Using Taylor’s approximation:
1 − 1
2j
d
≤ 1 − d2−j 1 + 2−1−j + d22−2j−1 and substituting it in the expectation calculation, we can show thatE[2i+j] is O(2i). The detail is in the full version [32].
These lemmas lead to the main theorem for zχ-kNN.
Theorem 1 Using α = O(1), or just one randomly shifted copy of P , and γ = O(k), zχ-kNN guarantees an expected constant factor approximatekNN result with O(logf NB+k/B) number of page accesses.
Proof: The IO cost follows directly from the fact that the one dimensional range search used by zχ-kNN takes O(logf NB + k/B) number of page accesses with a B-tree index. Let q be the point for which we just computed the approximate k-nearest neighbor ball B(q′, rad(q′, C)). From Lemma 2, we know thatrad(q′, C) is at most the side length ofMT. From Lemma 3 we know that the side length ofMT
is at most constant factor larger than rad(q′, Aq′) in P′, in expectation. Note that rad(q, Aq) in P equals rad(q′, Aq′) in P′. Hence our algorithm, in expectation, computes an approximate k-nearest neighbor ball that is only a constant factor larger than the truek-nearest neighbor ball.
Algorithm 1 clearly indicates that zχ-kNN only relies on one dimensional range search as its main building block (Line 4). One can easily implement this algorithm with just SQL statements over tables that store the point sets {P0, . . . , Pα}.
We simply use the original data set P to demonstrate the translation of the zχ-kNN into SQL. Specifically, we pre- process the table RP so that an additional attribute zval is introduced. A record r ∈RP uses {Y1, . . . , Yd} to compute its z-value and store it as the zval. Next, a clustered B+-tree indexis built on the attribute zval over the table RP. This also implies that records in RP are sorted by the zval attribute. For a given query point q, we first calculate its z-value (denoted as zval as well) and then, we find the successor record of q in RP, based on the zval attribute of Rp andq. In the sequel, we assume that such a successor record always exists. In the special case, when q’s z-value is larger than all the z-values in RP, we simply take its predecessor instead. This technical detail will be omitted. The successor can be found by:
SELECT TOP 1 * FROM RP WHERE RP.zval ≥ q.zval
Note that TOP is a ranking operator that becomes part of the standard SQL in most commercial database engines. For example, Microsoft SQL Server 2005 has the TOP operator available. Recent versions of Oracle, MySQL, and DB2 have the LIMIT operator which has the same functionality. Since table RP has a clustered B+ tree on the zval attribute, this query has only a logarithmic cost (to the size of RP) in terms of the IOs, i.e., O(logf NB). Suppose the successor record is
rs, in the next step, we retrieve all records that locate within γ positions away from rs. This can be done as:
SELECT * FROM
( SELECT TOP γ * FROM RP WHERE RP.zval
> rs.zval ORDER BY RP.zval ASC UNION
SELECT TOP γ * FROM RP WHERE RP.zval
< rs.zval ORDER BY RP.zval DESC ) AS C
Again, due to the clustered B+ tree index on the zval, this query is essentially a sequential scan around rs, with a query cost of O(logf NB +Bγ). The first SELECT clause of this query is similar to the successor query as above. The second SELECT clause is also similar but the ranking is in descending order of the zval. However, even in the second case, the query optimizer is robust enough to realize that with the clustered index on the zval, no ranking is required and it simply sequentially scansγ records backwards from rs.
The final step of our algorithm is to simply retrieve the top k records from these 2γ + 1 candidates (including rs), based on their Euclidean distance to the query pointq. Hence, the UDF ‘Euclidean’ is only applied in the last step with 2γ + 1 number of records. The complete algorithm can be expressed in one SQL statement as follows:
1 SELECT TOP k * FROM
2 ( SELECT TOP γ+ 1 * FROM RP, 3 ( SELECT TOP 1 zval FROM RP
4 WHERE RP.zval ≥ q.zval 5 ORDER BY RP.zval ASC ) AS T 6 WHERE RP.zval≥T.zval
7 ORDER BY RP.zval ASC
8 UNION
9 SELECT TOP γ * FROM RP
10 WHERE RP.zval < T.zval 11 ORDER BY RP.zval DESC ) AS C
12 ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2) (Q1)
For all SQL environments, line 3 to 5 need to be copied into the FROM clause in line 9. We omit this from the SQL statement to shorten the presentation. In the sequel, for similar situations in all queries, we choose to omit this. Essentially, line 2 to line 11 in Q1 select the γ-neighborhood of q in P , which will be the candidate set for finding the approximate kNN of q using the Euclidean UDF by line 1 and 12.
In general, we can pre-process {P1, . . . , Pα} similarly to get the randomly shifted tables {R1P, . . . , RαP} of RP, the above procedure could be then easily repeated and the final answer is the topk selected based on applying the Euclidean UDF over the union of the(α + 1) γ-neighborhoods retrieved.
The whole process could be done in just one SQL by repeating line2 to line 11 on each randomly shifted table and unioning them together with the SQL operator UNION.
In the work by Liao et al. [23], using Hilbert-curves a set of d + 1 deterministic shifts can be done to guarantee a constant factor approximate answer but in this case the space requirement is high, O(d) copies of the entire point set is required, and the associated query cost is increased by a multiplicative factor of d. In contrast, our approach only requires O(1) shifts for any dimension and can be adapted to yield exact answers. In practice, we just use the optimal value of α = min(d, 4) that we found experimentally to
give the best results for any fixed dimension. In addition, z- values are much simpler to calculate than the Hilbert values, especially in higher dimensions, making it suitable for the SQL environment. Finally, we emphasize that randomly shifted tables {R1P, . . . , RαP} are generated only once and could be used for multiple, different queries.
IV. EXACTkNN RETRIEVAL
The zχ-kNN algorithm finds a good approximate solution in O(logf NB + k/B) number of IOs. We denote the kNN result from the zχ-kNN algorithm as Aχ. One could further retrieve the exactkNN results based on Aχ. A straightforward solution is to perform a range query using the approximate kth nearest neighbor ball of Aχ,B(Aχ). It is defined as the ball centered at q with the radius rad(q, Aχ), i.e., B(Aχ) = B(q, rad(q, Aχ)). Clearly, rad(q, Aχ) ≥ r∗. Hence, the exact kNN points are enclosed by B(Aχ). This implies that we can find the exactkNN for q by:
SELECT TOP k * FROM RP
WHERE Euclidean(q.X1,q.X2,RP.Y1,RP.Y2)≤ rad(p, Aχ) ORDER BY Euclidean(q.X1,q.X2,RP.Y1,RP.Y2) (Q2)
This query does reduce the number of records participated in the final ranking procedure with the Euclidean UDF. However, it still needs to scan the entire table RP to calculate the Euclidean UDF first for each record, thus becomes very expensive. Fortunately, we can again utilize the z-values to find the exactkNN based on Aχ much more efficiently.
We define the kNN box for Aχ as the smallest box that fully encloses B(Aχ), denoted as M (Aχ). Generalizing this notation, let Aχi be the kNN result of zχ-kNN when it is applied only on table RiP and M (Aχi) and B(Aχi) be the corresponding kNN box and kth nearest neighbor ball from table RiP. The exact kNN results from all table RiP’s are the same and they are always equal to A. In the sequel, when the context is clear, we omit the subscripti from Aχi.
An example of the kNN box is shown in Figure 1(a). In this case, the zχ-kNN algorithm on this table RP returns the kNN result as Aχ= {p1, p2, p4} for k = 3 and p4 is thekth nearest neighbor. Hence,B(Aχ) = B(q, |q, p4|), shown as the solid circle in Figure 1(a). ThekNN box M (Aχ) is defined by the lower-left and right-upper corner pointsδℓandδh, i.e., the
∆ points in Figure 1(a). As we have argued above, the exact kNN result must be enclosed by B(Aχ), hence, also enclosed by M (Aχ). In this case, the exact kNN result is {p1, p2, p3} and the dotted circle in Figure 1(a) is the exact kth nearest neighbor ballB(A).
Lemma 4 For a rectangular box M and its lower-left and upper-right corner pointsδℓ,δh,∀p ∈ M , zp∈ [zℓ, zh], where zp stands for thez-value of a point p and zℓ,zh correspond to the z-values of δℓ andδh respectively (See Figure 1(a)).
Proof: Consider 2-d points, let p.X and p.Y be the coordinate values of p in x-axis and y-axis respectively. By p ∈ M , we have p.X ∈ [δℓ.X, δh.X] and p.Y ∈ [δℓ.Y, δh.Y ].
Since the z-value of a point is obtained by shuffling bits of its coordinate values from the msb to the lsb in an alternating
p4
zp1
γ zp4
zℓ zp2 zp5zp3zh
γ z-val p5
zq
r∗ rp
δh
δℓ
p1 q p2
p3
(a) kNN box.
p5
p8 p9
p7
p1 q p2
p3 δh
δℓ p4
p6
γh
γℓ
(b) Aχ= A.
Fig. 1. kNN box: definition and exact search.
fashion, this immediately implies that zp∈ [zℓ, zh]. The case for higher dimensions is similar.
Lemma 4 implies that thez-values of all exact kNN points will be bounded by the range[zℓ, zh], where zℓ andzh are the z-values for the δℓ andδh points ofM (Aχ), in other words:
Corollary 1 Letzℓandzhbe thez-values of δℓandδhpoints ofM (Aχ). For all p ∈ A, zp∈ [zℓ, zh].
Proof: ByB(A) ⊂ M (Aχ) and Lemma 4.
Consider the example in Figure 1(a), Corollary 1 guarantees thatzpi fori ∈ [1, 5] and zq are located between zℓ and zh
in the one-dimensionalz-value axis. The zχ-kNN essentially searchesγ number of points around both the left and the right ofzq in the z-value axis, for α number of randomly shifted copies ofP , including P itself. However, as shown in Figure 1(a), it may still miss some of the exactkNN points. Let γℓ
andγhdenote the left and rightγ-th points respectively for this search. In this case, letγ = 2, zp3 is outside the search range, specifically, zp3 > zγh. Hence, zχ-kNN could not find the exactkNN result. However, given Corollary 1, an immediate result is that one can guarantee to find the exactkNN result by considering all points with their z-values between zℓ and zh
ofM (Aχ). In fact, if zγℓ ≤ zℓandzγh ≥ zhin at least one of the table Ri’s fori = 0, . . . , α, we know for sure that zχ-kNN has successfully retrieved the exact kNN result; otherwise it may have missed some exactkNN points. The case when zℓ
andzh are both contained by zγℓ and zγh of theM (Aχ) in one of the randomly shifted tables is illustrated in Figure 1(b).
In this case, in one of the(α + 1) randomly shifted tables, the z-order curve passes through zγℓ first before any points in M (Aχ) (i.e., zγℓ ≤ zℓ) and it comes tozγh after all points inM (Aχ) have been visited (i.e., zγh ≥ zh). As a result, the candidate points considered by zχ-kNN include every point inA and the kNN result from the algorithm zχ-kNN will be exact, i.e.,Aχ = A.
When this is not the case, i.e., eitherzℓi< zγiℓ orzhi > ziγh or both in all tables RiP’s fori = 0, . . . , α, Aχ might not be equal toA. To address this issue, we first choose one of the table RjP such that its M (Aχj) contains the least number of points amongkNN boxes from all tables; then the candidate points for the exactkNN search only need to include all points contained by this box, i.e.,M (Aχj) from the table RjP.
To find the exactkNN from the box M (Aχj), we utilizing Lemma 4 and Corollary 1. In short, we calculate thezℓj and
Algorithm 2: z-kNN (point q, point sets {P0, . . . , Pα}) Let Aχi bekNN(q, Ci) whereCi is from Line4 in the
1
zχ-kNN algorithm;
Let ziℓandzhi be the z-values of the lower-left and
2
upper-right corner points for the boxM (Aχi);
Let ziγℓ andziγh be the lower bound and upper bound of
3
the z-values zχ-kNN has searched to produce Ci; if ∃i ∈ [0, α], s.t. zγiℓ ≤ zℓi andzγih ≥ zhi then
4
ReturnAχ by zχ-kNN as A;
5
else
6
Find j ∈ [0, α] such that the number of points in RjP
7
with z-values in ∈ [zℓj, zjh] is minimized;
Let Ce be those points and return A = kNN(q, Ce);
8
zhj of this box and do a range query with [zjℓ, zhj] on the zval attribute in table RjP. Since there is a clustered index built on the zval attribute, this range query becomes a sequential scan in [zℓj, zhj] which is very efficient. It essentially involves logarithmic IOs to access the path from the root to the leaf level in the B+ tree, plus some sequential IOs linear to the number of points between zℓj and zhj in table RjP. The next lemma is immediate based on the above discussion. Let zis be the successor z-value to qi’s z-value in table RiP where qi= q + −→vi,zpi be thez-value of a point p in table RiP andLi be the number of points in the range[zℓi, zhi] from table RiP. Lemma 5 For the algorithm zχ-kNN, if there exists at least onei ∈ {0, . . . , α}, such that [ziℓ, zhi] ⊆ [zγiℓ, ziγh], then Aχ= A; otherwise, we can find A ⊆ [zℓj, zjh] for some table RjP where j ∈ {0, . . . , α} and Lj = min{L0, . . . , Lα}.
These discussion leads to a simple algorithm for finding the exactkNN based on zχ-kNN and it is shown in Algorithm 2.
We denote it as the z-kNN algorithm.
Figure 2 illustrates the z-kNN algorithm. In this case, α = 1, In both table R0P and R1P, [zγℓ, zγh] does not fully contain [zℓ, zh]. Hence, algorithm zχ-kNN does not guarantee to return the exact kNN. Among the two tables, [zℓ1, zh1] contains less number of points than [zℓ0, zh0] (in Figure 2, L1< L0), hence we do the range query using[zℓ1, zh1] in table R1P and apply the
“Euclidean” UDF on the records returned by this range query to select the final, exactkNN records. Algorithm z-kNN can be achieved using SQL alone. The first step is to check if Aχ = A. This is equivalent to check if [ziℓ, zhi] ⊆ [zγiℓ, zγih] for somei. Note that we can calculate the coordinate values of δℓiandδhi for anyM (Aχi) easily with addition and subtraction, givenqi andAχi, and convert them into the z-values (zℓi and zhi) in SQL (we omit this detail for brevity). That said, this checking on table RiP can be done via:
SELECT COUNT(*) FROM RiP AS R WHERE R.zval ≥ zℓi AND R.zval ≤ zhi
AND R.pid NOT IN ( SELECT pid FROM RiP AS R1 WHERE R1.zval ≥ zγiℓ AND R1.zval ≤ ziγh) (Q3)
z1s
· · · zval
· · ·
· · ·
· · ·
· · ·
· · · R0P q0.zval
γ
γ z0ℓ
z0h
· · ·
· · ·
· · · zval
· · ·
· · ·
· · ·
· · ·
· · ·
γ γ
R1P q1.zval
z1ℓ
z1h
L1 L0
z0s
Fig. 2. Exact search of kNN.
If this count equals 0, then [zℓi, zih] ⊆ [ziγℓ, zγih]. One can easily create one SQL statement to do this checking for all tables. If there is at least one table with a count equal to 0 among R0P, . . . , RαP, then we can can safely returnAχ as the exact kNN result. Otherwise, we continue to the next step.
We first select the table (say RjP) with the smallestLi value.
This is done by (Q4), in which we find the valuej. Then we select the final kNN result from RjP among the records with z-values between [zjℓ, zhj] via (Q5).
SELECT TOP 1 ID FROM
( SELECT 0 AS ID, COUNT(*) AS L FROM R0P AS R0 WHERE R0.zval ≥ zℓ0 AND R0.zval ≤ z0h
UNION · · · UNION
SELECT α AS ID, COUNT(*) AS L FROM RαP AS Rα WHERE Rα.zval ≥ zℓα AND Rα.zval ≤ zhα
) AS T ORDER BY T.L ASC (Q4)
SELECT TOP k * FROM RjP //The ID from Q4 is j WHERE RjP.zval ≥ zjℓ AND RjP.zval ≤ zhj
ORDER BY Euclidean(qj.X1,qj.X2,RjP.Y1,RjP.Y2) (Q5)
One can combine (Q3), (Q4) and (Q5) to get the exact kNN result based on the approximate kNN result given by (Q1)(the zχ-kNN algorithm).
Finally, we would like to highlight that theoretically, for many practical distributions, the z-kNN algorithm still achieves O(logf NB + k/B) number of page accesses to report the exact kNN result in any fixed dimension. When Aχ = A, this result is immediate by Theorem 1. When Aχ 6= A, the key observation is that the number of “false positives” in the candidate setCe(line8 in Algorithm 2) is bounded by some constantO(k) for many practical data distributions. Hence, the number of candidate points that the z-kNN algorithm needs to check is still O(k), the same as the zχ-kNN algorithm.
Referring back to Figure 1(b), the false positives in akNN box M (Aχi) are defined as those points p such that p 6∈ M (Aχi) but zp ∈ [zℓi, zih]. For example, in Figure 1(b), the false positives are {p8, p6, p7}. Note that p9 is not a false positive w.r.t this kNN box as it is beyond the range [zℓi, zih].
Let P be a fixed distribution of points in a fixed d- dimensional space, i.e.,d is considered as a constant. Let P be i.i.d. from P and its size be N ≫ k ≥ 1. We will call this distribution a Doubling Distribution if it has the following property: LetS be a d dimensional ball with center pi∈ P and radiusr that contains k points. Then, the d dimensional ball S′ with centerpi and radius2r has at most νk points, for some ν = O(1). This is a similar restriction to the doubling metric restriction on metric spaces and has been used before [21].
Note that many distributions of points occurring in real data sets, including uniform distribution satisfy this property.
Theorem 2 For doubling distributions, the expected number of false positives for akNN box M (Aχi) for all i ∈ {0, . . . , α}
is O(k); the number of points that are fully enclosed by M (Aχi) is also O(k).
Proof: Without loss of generality, let ι be any fixed value from 0 to α. Let MT be the smallest quadtree box that contains M (Aχι). Let d = 1. We will now show that the number of points in MT is O(k). The expected number of points in M (Aχι) is at most kν12 + kν2 1234 + kν121478 + . . . ≤ kP∞
j=1 νj
2(j2 +1)/2 = O(k), since ν = O(1). A similar argument shows that as the dimension increases (but is still O(1)), the expected number of points in M (Aχι) is still O(k).
A two level expectation argument shows that the expected number of points in MT is still O(k). The details of this calculation are omitted for brevity. The points in MT are consecutive in z-values and the Z-order curve enters the lower- left corner and sweeps through the entireMT before it leaves through the upper right corner ofMT. Since,M (Aχι) ⊂ MT, the curve passing through the lower left corner ofM (Aχι) and ending at the upper right corner ofM (Aχι) can not go out of MT. This implies that all the false positives are contained in MT and hence the expected number of false positives is also upper bounded by O(k) in expectation.
Corollary 2 For doubling distributions, the z-kNN algorithm, using O(1) number of random shifts, retrieves the exact kNN result with O(logf NB + k/B) number of page accesses for data in any fixed dimension.
Our Theorem requires just 1 random shift. In practice, we use several random shifts to amplify the probability of getting smallerMT sizes (which in turn reduces the query cost), at the expense of increasing storage cost. We explore this trade-off in our experiments.
V. kNN-JOIN ANDDISTANCEBASEDθ-JOIN
An important feature for our approach is that we can easily and efficiently support join queries. The basic principle of finding the k nearest neighbors stays the same for the kNN- Join query over two tables RQ and RP. However, the main challenge is to achieve this using a single SQL statement. We still generate R0P, . . . , RαP in the same fashion. Concentrating on the approximate solution, we need to perform the similar procedure as shown in Section III to joining two tables (RQ
and RiP’s). The general problem is to join each individual record si from RQ to 2γ + 1 number of records from RP
aroundsi’s successor (based on thez-value) record rs(si) in RP; and then for each such group (si and the2γ + 1 records around rs(si)) we need to select the top-k records based on their Euclidean distances to si. A simple approach is to use a store procedure to implement this idea, i..e, for each record from RQ, we execute the zχ-kNN query from Section III.
If one would like to implement this join with just one SQL statement, the observation is that the second step above is
equivalent to retrieving topk records in each group based on some ranking functions and grouping conditions. This has been addressed by all commercial database engines. For example, in Microsoft SQL Server, this is achieved by the RANK() OVER (PARTITION BY . . . ORDER BY . . .) clause. Conceptu- ally, this clause assigns a rank number to each record involved in one partition or group according to its sorted order in that partition. Hence, we could simply select the tuple with the rank number that is less than or equal tok from each group.
Oracle, MySQL and DB2 all have their own operators for similar purposes.
We denote this query as the zχ-kNNJ algorithm. It is important to note that in some engines, the query optimizer may not do a good job in optimizing the top-k query for each group. An alternative approach is to implement the same idea with a store procedure, By the same argument as shown in Section III, the following result is immediate.
Lemma 6 Using α = O(1) and γ = O(k), the zχ-kNNJ algorithm guarantees an expected constant approximatekNN- Join result inO MB(logf NB +Bk) number of page accesses.
We could extend the exact z-kNN algorithm from Section IV to derive SQL statements as the z-kNNJ algorithm for the exactkNN-Join. We omit it for brevity.
Our approach is quite flexible and supports a variety of interesting queries. In particular, we demonstrate how it could be adopted to support the distance basedθ-Join query, denoted as the θ-DJoin. Our idea for the θ-DJoin is similar to the principle adopted in [24]. This query joins each recordsi ∈ RQ with the set Aθ(si) which contains all records rj ∈ RP
such that |si, rj| ≤ θ for some specified θ value. Suppose si corresponds to a query point q. An obvious observation is that the furthest point (or record) to q in Aθ(si) always has a distance that is at most θ. Hence, the ball B(q, θ) completely encloses all points from Aθ(si). Let the θ-box for a recordsi be the smallest box that enclosesB(q, θ) and denote it as M (Aθ(si)). Clearly, all points from Aθ(si) are also fully enclosed by M (Aθ(si)). By Lemma 4, we have for ∀p ∈ Aθ(si), zp ∈ [zℓ, zh], here zℓ and zh are the z- values of the bottom-left and top-right corner points of the boxM (Aθ(si)). This becomes exactly the same problem as the exact kNN search and similar ideas from z-kNN could then be applied.
VI. FLOATVALUES, HIGHERDIMENSIONS ANDUPDATES
Our method easily supports the floating point coordinates by computing thez-values explicitly for floating point coordinates in the pre-processing phase. This is done via the same bit- interleaving operation. The only problem with this approach is that the number of bits required for interleaving d-single precision co-ordinates is 256d, assuming IEEE 754 floating point representation. To avoid such a long string we can use the following trick: We fix the length of the interleaved Z-order bits to be at mostµd bits, where µ is a small constant. We scale the input data such that all coordinates lie between(0, 1). We then only interleave the firstµ bits of each coordinate after the
decimal point. For most practical data sets,µ < 32 (equivalent to more than9 digits of precision in the decimal system).
There are no changes required to our framework for dealing with data in any dimension. Though our techniques work for any dimension, however, as d increases, the number of bits required for thez-value also increases. This introduces storage overhead as well as performance degradation (the fanout of the clustered B+ tree on thezval attribute drops). Hence, for the really large dimensionality (say d > 30) one should consider using techniques that are specially designed for those purposes, for example the LSH-based method [5], [14], [25], [30].
Another nice property of our approach is the easy and efficient support of updates, both insertions and deletions. For the deletion of a record r, we simply delete r based on its pid from all tables R0, . . . , Rα. For an insertion of a record r that corresponds to a point p, we calculate the z-values for p0, . . . , pα, recall pi = p + −→vi. Next, we simply insertr into all tables R0, . . . , Rαbut with differentz-values. The database engine will take care of maintaining the clustered B+ tree indices on the zval attribute in all these tables.
Finally, our queries are parallel-friendly as they execute similar queries over multiple tables with the same schema.
VII. EXPERIMENT
We implemented all algorithms in a database server running Microsoft SQL Server 2005. The state of the art algorithm for the exactkNN queries in arbitrary dimension is the iDistance [20] algorithm. For the approximatekNN, we compare against the M edrank algorithm [12], since it is the state of the art for finding approximate kNN in relatively low dimensions and is possible to adapt it in the relational principle. We also compared against the approach using deterministic shifts with the Hilbert-curve [23], however, that method requires O(d) shifts and computing Hilbert values in different dimensions.
Both become very expensive when d increases, especially in a SQL environment. Hence, we focused on the comparison against the Medrank algorithm. We would like to emphasize they were not initially designed to be relational algorithms that are tailored for the SQL operators. Hence, the results here do not necessarily reflect their behavior when being used without the SQL constraint. We implemented iDistance [20]
using SQL, assuming that the clustering step has been pre- processed outside the database and its incremental, recursive range exploration is achieved by a stored procedure. We built the clustered B+ tree index on the one-dimensional distance value in this approach. We used the suggested 120 clusters with thek-means clustering method, and the recommended ∆r value from [20]. For theM edrank, the pre-processing step is to generateα random vectors, then create α one-dimensional lists to store the projection values of the data sets onto the α random vectors, lastly sort these α lists. We created one clustered index on each list. In the query step, we leveraged on the cursors in the SQL Server as the ‘up’ and ‘down’
pointers and used them to retrieve one record from each list in every iteration. The query process terminates when there arek elements have been retrieved satisfying the dynamic threshold
(essentially the TA algorithm [13]). Both steps are achieved by using stored procedures. We did not implement the iJoin algorithm [33], the state of the art method forkNN-Joins, us- ing only SQL, as it is not clear if that is feasible. Furthermore, it is based on iDistance and our kNN algorithm significantly outperforms the SQL-version iDistance. All experiments were executed on a Windows machine with an Intel2.33GHz CPU.
The memory of the SQL Server is set to1.5GB.
Data sets. The real data sets were obtained from [1]. Each data set represents the road-networks for a state in United States.
We have tested California, New Jersey, Maryland, Florida and others. They all exhibit similar results. Since the California data set is the largest with more than10 million points, we only show its results. By default, we randomly sample 1 million points from the California data set. We also generate two types of synthetic data sets, namely, the uniform (UN) points, and the random-clustered (R-Cluster) points. Note that the California data set is in2-dimensional space. For experiments in higher dimensional space, we use the UN and R-Cluster data sets.
Setup. Unless otherwise specified, we measured an algo- rithm’s performance by the wall clock time metric which is the total execution timeof the algorithm, i.e, including both the IO cost and the CPU cost. By default, 100 queries were generated for each experiment and we report the average for one query.
For both kNN and kNN-Join queries, the query point or the query points are generated uniformly at random in the space of the data set P . The default size of P is N = 106. The default value for k is 10. We keep α = 2, randomly shifted copies, for the UN data set and α = min{4, d}, randomly shifted copies, for the California and R-Cluster data sets, and setγ = 2k. These values of α and γ do not change for different dimensions. For theM edrank algorithms, we set its α value as2 for experiments in two dimensions and 4 for d larger than 2. This is to make fair comparison with our algorithms (with the same space overhead). The default dimensionality is2.
A. Results for thekNN Query
0 2 3 4 5 6
0 0.04 0.08 0.12
α
Running time (secs)
z−kNN zχ−kNN UN California
Fig. 4. Impact of α on the running time.
Impact of α. The number of “randomly shifted” copies has a direct impact on the running time for algorithms zχ-kNN and z-kNN, as well as their space overhead. Its effect on the running time is shown in Figure 4. For the zχ-kNN algorithm, we expect
its running time to increase linearly with α. Indeed, this is the case in Figure 4. The running time for the exact z- kNN algorithm has a more interesting trend. For the uniform UN data set, its running time also increases linearly with α, simply because it has to search more tables, and for the uniform distribution more random shifts do not change the probability of Aχ = A. This probability is essentially Pr(∃i, [zℓi, zhi] ⊆ [zγiℓ, ziγh]) and it stays the same in the uniform data for different α values. The number of false