Fast and versatile algorithm for nearest neighbor search based on a lower bound tree

(1)

www.elsevier.com/locate/patcog

Fast and versatile algorithm for nearest neighbor search based

on a lower bound tree

Yong-Sheng Chen

a,∗

, Yi-Ping Hung

b, c

, Ting-Fang Yen

a

, Chiou-Shann Fuh

b a_{Department of Computer Science, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan, ROC}

b_{Department of Computer Science and Information Engineering, National Taiwan University, 1 Roosevelt Road, Section 4, Taipei 106, Taiwan, ROC} c_{Institute of Information Science, Academia Sinica, 128 Academia Road, Section 2, Taipei 115, Taiwan, ROC}

Received 14 April 2004; received in revised form 1 June 2005; accepted 24 August 2005

Abstract

In this paper, we present a fast and versatile algorithm which can rapidly perform a variety of nearest neighbor searches. Efficiency improvement is achieved by utilizing the distance lower bound to avoid the calculation of the distance itself if the lower bound is already larger than the global minimum distance. At the preprocessing stage, the proposed algorithm constructs a lower bound tree (LB-tree) by agglomeratively clustering all the sample points to be searched. Given a query point, the lower bound of its distance to each sample point can be calculated by using the internal node of the LB-tree. To reduce the amount of lower bounds actually calculated, the winner-update search strategy is used for traversing the tree. For further efficiency improvement, data transformation can be applied to the sample and the query points. In addition to finding the nearest neighbor, the proposed algorithm can also (i) provide the k-nearest neighbors progressively; (ii) find the nearest neighbors within a specified distance threshold; and (iii) identify neighbors whose distances to the query are sufficiently close to the minimum distance of the nearest neighbor. Our experiments have shown that the proposed algorithm can save substantial computation, particularly when the distance of the query point to its nearest neighbor is relatively small compared with its distance to most other samples (which is the case for many object recognition problems).

Keywords: Nearest neighbor search; Lower bound tree

1. Introduction

Nearest neighbor search has been widely applied in many ﬁelds, including object recognition[1], pattern classiﬁcation and clustering [2,3], image matching[4,5], data compres-sion [6,7], texture synthesis [8], and information retrieval in database systems[9,10]. Depending on the application, each object (pattern, image block, or other kind of data) can be represented as a multi-dimensional point. Using a dis-tance function as the measure of dissimilarity, the nearest neighbor search for the most similar object can be regarded as the closest point search in a multi-dimensional space.

∗_{Corresponding author. Tel.: +886 3 5131316; fax: +886 3 5724176.} E-mail addresses:yschen@cs.nctu.edu.tw(Y.-S. Chen),

hung@csie.ntu.edu.tw (Y.-P. Hung), tyen@andrew.cmu.edu (T.-F. Yen), fuh@csie.ntu.edu.tw(C.-S. Fuh).

In general, a ﬁxed data set P of s sample points in a d-dimensional space is given, represented by P ={pi ∈ Rd|i =

1, . . . , s}. Preprocessing can be performed, if necessary, to construct a particular data structure. The goal of the nearest neighbor search is to ﬁnd in P the point closest to each query point q in the d-dimensional space. A straightforward way to do so is to exhaustively compute and compare the distances between the query point and all sample points. This exhaustive search has the computational complexity of O(s · d), and when one or both of s, d are large, the process can be very time-consuming.

Many methods have been proposed to speed up the com-putation of nearest neighbor search. One category of these methods partitions the data space into “regions” according to the sample points. Various shapes of region have been adopted, including hyper-rectangular bucket (the k-d tree method [11]), bounding rectangle (the R-tree[12] and the

(2)

SR-tree[13] methods), bounding sphere (the SS-tree [14] and the SR-tree[13]methods), pyramid[15], and Voronoi cell [16]. A data structure, usually a tree, was used for recording and indexing these regions. Given a query point, its nearest neighbor can be found by using the tree struc-ture. For example, Bentley (the k-d tree method [11]) par-titioned the data space into hyper-rectangular buckets, each of which contains several sample points. Their method for nearest neighbor search is performed by a binary search for the target bucket followed by a local search for the desired sample point in the target bucket and its neighboring buck-ets, which is very efﬁcient when the dimension of the data space is small. However, as reported in Refs.[16,17], when the number of dimensions increases, its performance de-grades exponentially, in an effect known as the curse of

di-mensionality. The main reason for this phenomenon is that

more neighboring buckets must be checked when the di-mension is higher. Thus, the number of sample points to be examined increases dramatically.

Another category of fast nearest neighbor search methods are elimination-based methods (see Ref.[18]for a review). For example, Fukunaga and Narendra[19]constructed a tree structure for the sample points, and used a branch-and-bound search strategy to traverse and prune the tree structure in the query process for efﬁciently determining the nearest neigh-bor. To construct the tree structure, a set of sample points is ﬁrst divided into k subsets using the k-means clustering al-gorithm. Each subset is then further divided into k subsets. This process is repeated thereby creating a tree structure, with each node in the tree representing a number of sam-ple points. The mean of these samsam-ple points and the farthest distance from the mean to these sample points are recorded. For a node in the tree, if the distance between its recorded mean and the query point subtracted by the recorded far-thest distance is larger than the minimum distance obtained so far, the distance computation for all the sample points represented by this node can be avoided due to the trian-gle inequality. Brin[20]proposed a method similar to Ref. [19]. They constructed another kind of data structure, called GNAT, by using hierarchical decomposition for the sam-ple points. Each level in the GNAT data structure can have different numbers of branches. Vidal [21]also utilized the branch-and-bound search strategy to reduce the distance cal-culations. Friedman et al.[22]proposed a projection-based search algorithm. On the projection coordinate, the sample points are sorted according to the values of this coordinate. They are then examined in the order of their distance (on this coordinate) to the query point. Sample points whose dis-tance to the query point on the projection coordinate is larger than the current minimum distance (on all coordinates) can be eliminated, thereby speeding up the search process. So-leymani and Morgera [23] used an elimination technique similar to Ref.[22] where they performed the elimination test on each coordinate, instead of only on the projection coordinate. Djouadi and Bouktache[24]partitioned the un-derlying space of the sample points into a set of cells. By

calculating the distances between the query point and the centers of the cells, the nearest neighbor can be found ef-ﬁciently by searching only in those cells in the vicinity of the query point, rather than the whole space. Lee and Chae

[25] also proposed a fast nearest neighbor search method, which uses a number of anchor sample points to eliminate many distance calculations based on the triangle inequality. In Ref.[26], McNames presented a fast nearest-neighbor al-gorithm based on principal axis trees. This method utilizes depth-ﬁrst search and distance lower bounds to eliminate many distance calculations.

Instead of ﬁnding the exact nearest neighbor, that is, the global optimum, another research direction is to ﬁnd the

ap-proximate nearest neighbor. Arya et al.[27]proposed a fast algorithm to find the (1 + r)-approximate nearest neighbor within a factor of (1 + r) of the distance between the query point and its exact nearest neighbor. They constructed a bal-anced box-decomposition (BBD) tree by hierarchically de-composing the underlying space. A priority search is then applied to efficiently find the approximate nearest neighbors. There are some application-dependent issues worth con-sidering for nearest neighbor search. For example, Faragó et al.[28] presented a fast nearest-neighbor search algorithm in dissimilarity space in which the triangle inequality may not hold [29]. In database systems, the obtained query re-sults may have to be checked using some other conditions in addition to the minimum distance requirement. In this case, the number k of the k-nearest neighbors cannot be specified beforehand. Hjaltason and Samet[30]proposed a fast algo-rithm which can provide k-nearest neighbors progressively (one-by-one) until the required number of nearest neighbors satisfying other conditions is obtained. In object recognition applications, the nearest neighbor of a query object is of in-terest only when the distance between the query object and its nearest neighbor is small enough. For this kind of appli-cations, Nene and Nayar[17]proposed a fast algorithm for searching the nearest neighbor within a pre-specified small distance threshold in a high-dimensional space. For each di-mension, their method excludes the sample points whose dis-tances to the query point at the current dimension are larger than the distance threshold. The nearest neighbor can then be determined from examining the remaining candidates. This process may eliminate all the sample points if the distance threshold is too small. The remedy is to enlarge the distance threshold gradually. For some other applications that utilize audio or image matching, each multi-dimensional sample or query point represents an autocorrelated signal. That is, the signal values in consecutive dimensions are correlated. In such cases, the search process can be accelerated by applying some data transformation to each data point, such as mean pyramid construction[5,6,31]or wavelet transform[7].

In this paper, a novel algorithm is presented which efﬁ-ciently searches for the exact nearest neighbor in Euclidean space. The proposed algorithm ﬁrst preprocesses the sam-ple points by constructing a lower bound tree (LB-tree), in which each leaf node represents a sample point and each

(3)

internal node represents a mean point in a space of smaller dimension. For each query point, a lower bound of its dis-tance to each sample point can be calculated by using a mean point of an internal node in the LB-tree. Distance calcula-tions can be avoided for many sample points whose lower bound of the distance to the query point is larger than the minimum distance between the query point and its nearest neighbor. The whole search process is accelerated this way because the computational cost of the lower bounds is less than that of the distance. In addition to the use of an LB-tree, the following three techniques are further adopted to reduce lower-bound calculation:

1. Winner-update search: To reduce the number of nodes examined, we apply a winner-update search strategy for traversing the LB-tree. Starting from the root node of the LB-tree, the node having minimum lower bound is re-placed by its children for the following competition after the lower bounds of these children having been calcu-lated.

2. Agglomerative clustering: When constructing the LB-tree, we use an agglomerative clustering technique to keep the number of the internal nodes as small as possi-ble while keeping the lower bound as tight as possipossi-ble. 3. Data transformation: Data transformation, such as the

wavelet transform or the principal component analysis, is applied to each point so that the lower bound of an internal node can be further tightened, thus saving more computation.

Among the above three techniques, both the winner-update search strategy and the data transformation are performed in the search process. That means, additional computations are required for each query. Fortunately, the amount of this increased burden is relatively small compared to the sav-ings gained by these two techniques and the overall search efﬁciency can be improved in most situations (see Section 7). The other technique, agglomerative clustering for LB-tree construction, can be very time-consuming. However, the LB-tree is constructed in the preprocessing stage. It is usually worthwhile to obtain a good data structure at the expense of large amount of computation beforehand for the sake of high search efﬁciency. For example, it took about 3 h to construct the LB-tree for 36,000 sample points in 35-dimensional space and the search process can be over one thousand times faster by using the constructed LB-tree in our experiment (see Section 7.3).

Our experiments have shown that the proposed algorithm for nearest neighbor search can save a considerable amount of computation, particularly when the query point is rel-atively closer to its nearest neighbor than to most other samples. Furthermore, the proposed algorithm is versatile because it can deal with various types of queries. More speciﬁcally, this algorithm can speed up the progressive search for k-nearest neighbors, the search for nearest neigh-bors within a speciﬁed distance threshold, and the search

for neighbors whose distances to the query are sufﬁciently close to the minimum distance of the nearest neighbor.

This paper is organized as follows. At ﬁrst, we introduce the data structure and the proposed algorithm for nearest neighbor search in Sections 2 and 3, respectively. Next, we present the supplement of the proposed algorithm for other query types in Section 4. Then, the construction of the LB-tree is described in Section 5. Two kinds of data transfor-mation are introduced in Section 6. Section 7 presents the experimental results of the proposed algorithm. Finally, con-clusions are stated in Section 8.

2. Multilevel structure and LB-tree

This section introduces the LB-tree used in the proposed algorithm for nearest neighbor search. We will ﬁrst describe the multilevel structure of a data point. The multilevel struc-tures of all the sample points can then be used to construct the LB-tree. We shall also introduce some properties of the LB-tree, which reveals the effectiveness of the proposed al-gorithm.

2.1. Multilevel structure of each data point

For a point p= [p1, p2, . . . , pd] in a d-dimensional

Eu-clidean space, Rd, we denote its multilevel structure of L+1 levels by {p0, p1, . . . , pL}, and deﬁne it in the following. At each level l, pl = [p1, p2, . . . , p_dl], which comprises

the ﬁrst dl dimensions of the point p, is referred to as the level-l projection of p, for 1dl_{d, l = 0, . . . , L. A}

triv-ial way to construct a d-level structure is to let dl= l + 1, l = 0, . . . , d − 1. Here, dl is an increasing function of l be-cause dl=l+1 < (l+1)+1=dl+1. Notice that the construc-tion method of multilevel structure for a data point belongs to one kind of telescoping functions, which can be used to con-tract and extend feature vectors, proposed by Lin et al.[32]. In this paper, the dimension at level l is set to dl_{= 2}l_.

Without loss of generality, we assume that the dimension of the data point space, d, is equal to 2L_{. If d is not a power}

of 2, zero padding can be used to enlarge the dimension of the underlying space. In this way, an (L + 1)-level structure for point p can be constructed. Notice that level-L projec-tion, pL, is equivalent to point p. An example of a 4-level structure,{p0, . . . , p3}, where d = 8, is shown inFig. 1.

Given the multilevel structures of points p and q, we can derive the succeeding inequality property:

Property 1. The Euclidean distance between p and q is

larger than or equal to the Euclidean distance between their level-l projections pl and ql for each level l. That is, p − q2pl− ql2, l = 0, . . . , L.

Although all the properties shown in this section (and hence the proposed algorithm) are valid for any lpnorm, we

(4)

adopt l2norm (Euclidean distance) here. The reason is that if

the data transformation described in Section 6 is applied, the distance other than l2 norm may change. From Property 1,

p2 p1 p0 p3 _p 7 p₆ p₅ p₈ p₄ p₄ p₃ p₃ p₂ p₂ p₂ p₁ p₁ p₁ p₁

Fig. 1. An example of the 4-level structure of the point p, where p∈ R8.

Fig. 2. An example of hierarchical construction of the LB-tree. All the points in the same dark region are determined agglomeratively and are grouped into a cluster. Notice that each point is transposed in order to ﬁt into the limited space.

a lower bound of the distance p − q2 can be considered

to be the distance pl − ql2 calculated using the

level-l projections. Notice that the computationalevel-l complevel-lexity of the distance pl − ql2 is less than that of the distance

p − q2. Speciﬁcally, the complexity of calculating the

distance between level-l projections is O(2l) for l=0, . . . , L. 2.2. LB-tree for the data set

This section introduces the LB-tree and some of its prop-erties. To construct an LB-tree, we require the multilevel structures of all sample points p_i, i = 1, . . . , s, in a data set P, where s is the number of the data points in P. The LB-tree has the same number of levels as the multilevel structure, without considering the dummy root node, which is consid-ered to have zero dimension. At level L in the LB-tree, each leaf node contains a level-L projection pL_i , which is equiv-alent to the sample point p_i. For level 0 to level L − 1, the level-l projections, pl_i, i = 1, . . . , s, of all the sample points can be clustered to form a hierarchy, as illustrated inFig. 2,

(5)

Fig. 3. An example of the LB-tree.

where L = 3 and s = 9. More details of the LB-tree con-struction will be given in Section 5.

Let sl denote the number of clusters at level l. Letp denote the node containing the point p in the LB-tree. Each cluster C_jl, j = 1, . . . , sl, is represented by an internal node ml

j at level l in the LB-tree. The internal node mlj

con-tains the mean point ml_j, which is the mean of all the level-l projections of the sample points contained in this cluster, and the associated radius, r_jl, which is the radius of the smallest 2l-dimensional hyper-sphere centered at m_jl and covers all the level-l projections in cluster C_jl. An example of an LB-tree is shown inFig. 3. This smallest hyper-sphere is called the bounding sphere of Cl_j; its radius can be calculated as the maximum distance from the mean point ml_j to all

level-l projections in this clevel-luster. The LB-tree has the folevel-llevel-lowing

inequality property:

Property 2. Given a sample point p∗, the distance between

its level-l projection, p∗l, and its level-l ancestor, ml_j∗, is

smaller than or equal to the radius of the bounding sphere of cluster C_jl∗. That is,

p∗l− ml j∗2rjl∗, l = 0, . . . , L. * * * *

q

l

p

l j l

r

m

lj j l

<m >

LB

d ( ,

q

l

)

Fig. 4. Illustration of the distance inequality of Eq. (1).

Notice that a leaf node is equivalent to a cluster of only one point. The radius is zero and the mean point is the sample point itself in such cases.

Now, given a query point q, we ﬁrst construct its mul-tilevel structure as described in Section 2.1. For a sample point p∗and its corresponding leaf node p∗, its ancestor at level l in the LB-tree is denotedml_j∗. As illustrated in

Fig. 4, the following inequality can be derived using the tri-angle inequality and Properties 1 and 2:

p∗− q2p∗l− ql2

ml

j∗− ql2− p∗l− mlj∗2

ml

(6)

The LB-distance dLB(mlj∗, ql) between the internal node

ml

j∗ and ql, the level-l projection of the query point q, is

deﬁned as

dLB(ml_j∗, ql) ≡ ml_j∗− ql2− r_jl∗. (2)

We then have the following inequality property:

Property 3. Given a query point q and a sample point p∗, the LB-distance between the level-l ancestor of p∗ (that is, ml

j∗) and the level-l projection of q is smaller than or

equal to the distance between p∗and q. That is, dLB(ml_j∗, ql)p∗− q2, l = 0, . . . , L.

We know now from the above property that dLB(ml_j∗, ql)

is a lower bound of the distance p∗ − q2. Notice that

the LB-distance is not a valid distance metric. A negative dLB(ml_j∗, ql) implies that the query point q is located

within the bounding sphere of C_jl∗ centered at ml_j∗. Also,

dLB(m0_j∗, q0), dLB(m_j1∗, q1), . . . , dLB(mL_j∗, qL) is not

necessarily (and not required to be) an ascending list of lower bounds of the distance between p∗ and q.

The internal node ml_j∗ can have a number of

descen-dants. Therefore, besides being a lower bound on the dis-tance to q for any particular p∗, dLB(ml_j∗) is also a lower

bound on the distances from all the sample points in clus-ter Cl_j∗ containing p∗ to q. Hence, we have the following

property:

Property 4. Let q be a query point andˆp be a sample point.

For any internal nodeml_j of the LB-tree, if dLB(ml_j, ql) > ˆp − q2,

then, for every descendant leaf node p of ml j, we

have

p− q2> ˆp − q2.

From Property 4, if the LB-distance of the internal node ml

j is already larger than the distance between ˆp and q,

all the descendant leaf nodesp of ml_j can be eliminated in the search, since there is already a better candidate, ˆp, which is closer to q.

3. Winner-update search strategy and the proposed algorithm

In our algorithm an LB-tree of L + 1 levels has to be con-structed using data set P before running the query process. For each query point q, the goal of nearest neighbor search is to ﬁnd the sample point ˆp in P such that the Euclidean distance ˆp − q2 is minimum. According to Property 4,

if at some point the LB-distance of an internal node ml_j is larger than the minimum distance between ˆp and q, then the nearest neighbor cannot be in the descendant samples of nodem_jl. Hence, the costly calculation of their distances to q can be all saved by only calculating the less-expensive LB-distance of nodem_jl.

The above saving requires knowing the valueˆp − q2,

but it is unknown beforehand which sample point ˆp is. In fact, ˆp is exactly the nearest neighbor which we are look-ing for. To achieve the same calculation savlook-ing effect, we adopt the winner-update search strategy, which computes the lower bounds from the root node toward the leaf nodes while traversing the LB-tree. The LB-distances of the in-ternal nodes is calculated starting from the top level down. Since the computation cost of the LB-distance is smaller at the upper level, and an upper-level node generally has more descendants, we can save more distance calculation if the LB-distance of an upper-level node is already larger than the minimum distance.

We now describe the winner-update search strategy that greatly reduces the number of LB-distance calculations. First, the LB-distances between q0and all the level-0 nodes in the LB-tree are calculated using Eq. (2). A heap data structure is then constructed using these level-0 nodes, m0

1, m 0

2, . . . , m 0

s0, with the root node of the heap, ˆp,

being the node having the minimum LB-distance. Then, we delete the node ˆp and insert its children into the heap, calculating their LB-distances and rearranging the heap to maintain the heap property. This produces a new root node with the minimum LB-distance, which becomes the new ˆp. The procedure of deleting ˆp and inserting its children is repeated until the dimension ofˆp, dim(ˆp), is equal to d. At this time, with the nodeˆp being a leaf node contain-ing a sample point, we have the minimum distanceˆp−q2

in the heap. The nearest neighbor ˆp is thus determined, since the lower bounds of the distances from all the other sample points to the query point q are already larger than ˆp − q2.

Fig. 5illustrates three intermediate stages of the heap that are constructed during the search process, based on the LB-tree example shown inFig. 3. Given a query point q, the LB-distances dLB(m01, q0) and dLB(m02, q0) for nodes m01

and m₂0 at level 0 are ﬁrst calculated, respectively, and used to construct a heap, as shown in Fig. 5(a). Suppose

dLB(m0₁, q0) is 6 and dLB(m0₂, q0) is 2. At this point, node

m0

2 is on top of the heap and will be replaced by its two

children: nodesm1₂ and m1₃. Next, suppose dLB(m12, q1)

is 8 and dLB(m13, q1) is 3. Then, the heap is rearranged

to maintain the heap property, and nodem₃1 will pop up to the top of the heap, as shown in Fig. 5(b). Again, the new top node (that is,m1₃) is replaced by its children and the heap is rearranged according to the LB-distances of the nodes.Fig. 5(c) illustrates the heap at this stage, where the LB-distances of the newly inserted nodesm2₄ and m₅2 are 9 and 4, respectively.

(7)

Fig. 5. Three intermediate stages of the heap. (a) Given a query point q, the LB-distances for nodesm0₁ and m0₂ at level 0 are calculated and used to construct a heap. (b) The nodem0₂ in (a) has the smaller LB-distance and is replaced by its children: nodes m1₂ and m₃1. (c) The node m1₃ in (b) has the smallest LB-distance and is replaced by nodesm2₄ and m2₅.

The proposed algorithm is summarized below:

Proposed Algorithm for nearest neighbor search

/∗Preprocessing Stage∗/

(1) Given a data set P = {pi ∈ Rd|i = 1, . . . , s}

(2) Construct the LB-tree of L + 1 levels for P /∗Nearest Neighbor Search Stage∗/ (3) Given a query point q∈ Rd

(4) Construct the (L + 1)-level structure of q (5) Insert the root node of the LB-tree into an

(5) empty heap

(6) Letˆp be the root node of the heap (7) while dim( ˆp) < d do

(8) Delete nodeˆp from the heap

(9) Calculate the LB-distances for all the children

(9) ofˆp

(10) Insert all the children ofˆp into the heap (11) Rearrange the heap to maintain the heap (11) property that the root node is the node having (11) the minimum LB-distance

(12) Updateˆp as the root node of the heap (13) endwhile

(14) Output ˆp

For conciseness of the above pseudo-code, the heap is ini-tialized as the dummy root node of the LB-tree, instead of the level-0 nodes. This will not affect the result since the dummy root node is replaced immediately in the ﬁrst itera-tion of the loop by its children, that is, all the level-0 nodes. Due to the adoption of the winner-update search strategy, which is actually the best-ﬁrst search strategy, the proposed algorithm can be regarded as a special case of the A∗ al-gorithm[33]. In our algorithm the path cost term is always zero and the estimated distance to the goal node is the LB-distance, which is a lower bound of the distance from a sample point to the query point.

4. Other query types

With slight modiﬁcation, the proposed algorithm can also speed up the following three search tasks: (i) the progressive search for k-nearest neighbors, (ii) the search for k-nearest

neighbors within a speciﬁed distance threshold, and (iii) the search for neighbors that are close enough to the query, compared with the nearest neighbor.

4.1. Progressive search for k-nearest neighbors

When the nearest neighbor is obtained by using the pro-posed algorithm, there may be some other candidates in the heap. Distance calculation for these candidates is partially performed and we can continue the search process to de-termine the next nearest neighbor without starting all over again. In general, we can easily extend the proposed algo-rithm to ﬁnd the k-nearest neighbors, 1 < k s, in the fol-lowing way. Once the nearest neighbor ˆp is obtained by using the algorithm described in Section 3, we can delete it from the heap and continue the process until the second nearest neighbor is obtained. By repeating the above proce-dure, one can obtain the third nearest neighbor, the fourth nearest neighbor, and so on, until all the desired k-nearest neighbors are obtained. The following pseudo-code can be merged into the original algorithm, in the designated line number order, to provide k-nearest neighbors:

(6.5) for loop= 1, 2, . . . , k

(15) Delete nodeˆp from the heap

(16) Rearrange the heap to maintain the heap roperty (16) that the root node is the node having the (16) minimum LB-distance

(17) Updateˆp as the root node of the heap (18) endfor

Notice that these k-nearest neighbors are provided incremen-tally. This feature is particularly useful when additional tests on the obtained nearest neighbors are required, and hence, the number k cannot be known before the query process be-gins. Hjaltason and Samet [30] ascribed the capability of incremental k-nearest neighbor search to the heap (priority queue) employed in the algorithm. Arya et al.[27]adopted a similar progressive approach which enumerates leaf cells of their BBD-tree in increasing order of distance from the query point and examines data points in the cells. Tradi-tional methods such as Ref.[19]use an array to record the ﬁrst k candidates of k-nearest neighbors during the search process. Each newly computed distance is compared against

(8)

the elements in the array and is substituted for the largest element in the array that is larger than the newly computed distance. After the k-nearest neighbors are determined and we ﬁnd that more nearest neighbors are needed, the search process has to be started all over again with a larger k. As a result, there are wasteful, duplicated distance calculations. 4.2. Search for k-nearest neighbors within a distance threshold

In many pattern recognition applications, a query object is considered to be “recognized with high confidence” only when it is sufficiently close to an object in the data set. Therefore, the distance between the query point and its near-est neighbor should be smaller than a pre-specified distance threshold_T. For further speedup, the proposed algorithm can be easily extended to meet this requirement by adding the following two lines to the pseudo-code of Section 3: (7.5) if the LB-distance ofˆp is larger than _T stop

(13.5) if the LB-distance ofˆp is larger than _T stop

When the k-nearest neighbors within the distance thresh-old _T are needed, the additional pseudo-code for provid-ing k-nearest neighbors, given in Section 4.1, can also be added.

4.3. Search for neighbors close enough to the query compared with the nearest neighbor

In some applications, all the points that are sufﬁciently close to the query point, compared with the nearest neighbor, should be considered as good matches. To achieve this goal, all the points of distance smaller than (1 + r) ˆp − q2have

to be identiﬁed, where ˆp is the nearest neighbor and r is a small number. Our algorithm can be easily extended to provide this functionality. After the nearest neighbor ˆp and the minimum distanceˆp − q2 are obtained, the methods

described in Sections 4.1 and 4.2 can be used to provide all the points having distance smaller than_T, where k is set to be s and the threshold_T is set to be (1 + r) ˆp − q2.

5. Construction of LB-tree

The LB-tree plays an important role in our algorithm. It should be noted that there exists more than one method for constructing the LB-tree described in Section 2.2. Although many methods could be chosen for constructing the LB-tree, some lead to better performance than others. Hence, it is desirable to construct a “good” LB-tree in view of the need for efﬁciency in nearest neighbor search. Since construction of the LB-tree is performed in the preprocessing stage, its computational cost is not a major concern here, hence the efﬁciency of the resulting LB-tree is more important than the speed of its construction.

To construct an LB-tree, the simplest way is to directly use the multilevel structures of the sample points without clustering. In this case, there are s nodes at each level l, l = 0, . . . , L, in the LB-tree. Each node ml_i, i = 1, . . . , s, at level l contains exactly one level-l projection of a sample point, say pl_i. Here, the mean point m_il equals pl_i and the radius r_il is set to zero. All the internal nodes in the LB-tree thus constructed have only one child node, with the exception of the root, which has s child nodes.

Another method of LB-tree construction is to use k-means clustering method [3] to hierarchically cluster the sample points, similar to what was done in Ref. [19]. At level 0, all the sample points are partitioned into k disjoint clusters according to the distances between their level-0 projections. For each cluster at level 0, the mean point and the radius of the bounding sphere can be calculated and recorded by using the level-0 projections of the sample points that belong to the same cluster. Then, these sample points that belong to the same cluster at level 0 can be further partitioned into k disjoint sub-clusters according to the distances between their level-1 projections. After partitioning all the clusters at level 0, all the obtained sub-clusters constitute the nodes at level 1 of the LB-tree. This process is repeated for the succeeding levels until level-L is reached. The result is an LB-tree (but maybe not the best one), in which every internal node has k branches.

From Property 3 as deﬁned in Section 2.2, given a query point q, the LB-distance for each internal node (that is, the LB-distance between an internal node and the level-l pro-jection of q) is the lower bound of the distances between q and all the sample points contained in the descendant leaf nodes of this internal node. In order to obtain a tighter lower bound and thus, in order to reduce the number of distance calculations, the LB-distance of each internal node should be as large as possible, which requires the radius of

the bounding sphere, r_jl, to be as small as possible in con-sequence, according to Eq. (2). From this perspective, it is suggestible to adopt the simplest construction method men-tioned before, which directly uses the multilevel structures of the sample points, since the radius of each internal node is zero. However, the number of internal nodes is propor-tional to the amount of memory storage and computation of the LB-distance required, so we would also like the number

of internal nodes to be as small as possible. Although the k-means clustering method can construct an LB-tree with

fewer internal nodes, the radius of the bounding sphere may be very large since there is a chance that sample points far away from each other are grouped into one cluster. As a result, the trade-off between the number of internal nodes and the radii of the associated bounding spheres needs to be taken into consideration when constructing an LB-tree.

In this work, we use an agglomerative hierarchical clus-tering technique [3,34] to construct the LB-tree, in which both the number of internal nodes and the associated radii can be kept small. Details are given below.

(9)

5.1. Multi-dimensional case for each level

Suppose that the level-l projections of all the sam-ple points have been partitioned into sl clusters, C_jl, j = 1, . . . , sl. (For instance, the example shown in Fig. 2 contains three clusters, C11, C

1 2, and C

1

3, at level 1.) Each

cluster C_jl at level l is to be further partitioned into sub-clusters independently. Notice that cluster C_jl is a set of level-l projections. Denote the members of C_jl as pl_i

lj(k), k = 1, 2, . . . , nlj, where nlj is the number of elements in C_jl. Consider the example shown inFig. 2. For l = 1 and

j = 3, we have C31= {p 1 3, p 1 4, p 1 7} where n13= 3, i13(1) = 3, i13(2) = 4, and i13(3) = 7. Then, by using the multilevel

structures of p_i_lj_(k), k = 1, 2, . . . , nlj, we denote the set

of level-(l + 1) projections {pl+1_i_lj_(k)|k = 1, 2, . . . , nlj} by S_jl+1. For the above example (l = 1 and j = 3), we have

S32= {p23, p24, p27}. Our approach is to partition Sjl+1 into

clusters by using an agglomerative method. In the example ofFig. 2, S32is then partitioned into C42and C52.

The agglomerative method begins with treating each point as a distinct cluster, and successively merges clusters to-gether until a stopping criterion is satisﬁed[3]. There are two issues to be determined when adopting the agglomerative method. The ﬁrst concerns how to choose clusters for merg-ing, and the other is the stopping criterion. Suppose X and

Y are two disjoint subsets of S_jl+1. We deﬁne the between-cluster distance, d_maxl+1(X, Y ), of X and Y as the maximum

Euclidean distance between every pair (xl+1, yl+1) of level-(l + 1) projections, where xl+1 ∈ X and yl+1∈ Y . That is, d_maxl+1(X, Y ) = max

xl+1_∈X,yl+1_∈Y x

l+1_{− y}l+1 2.

The pair of clusters with minimum dmaxl+1 is chosen for

con-sideration to be merged because they are the closest clusters in the sense of dl+1

max. The radius of the cluster obtained by

merging this pair is more likely to be small.

For the stopping criterion, we use the radius constraint which requires that the radii of all clusters at level l are smaller than a pre-speciﬁed radius threshold r_Tl. When fur-ther cluster merging cannot satisfy the radius constraint, the agglomerative procedure is terminated. In this way we can obtain a clustering result by gradually reducing the num-ber of clusters via merging while the radius of each cluster gradually increases, approaching the radius threshold. If we raise the radius threshold, the number of clusters (and the resulted number of nodes) at level (l + 1) will decrease. By specifying a “good” radius threshold, the number of internal nodes and their associated radii reach a compromise and a good clustering result can be obtained.

For the set S_jl+1= {pl+1_i_lj_(k)|k = 1, 2, . . . , nlj}, we initially

treat each of its member as a separate cluster. Then we cal-culate and sort the between-cluster distances dl+1

maxfor every

pair of clusters. The pair of clusters with the minimum dl+1 max

is chosen for consideration of merging. If d_maxl+1 of this pair is larger than twice the radius threshold r_Tl+1, the radius of the bounding sphere of the merged cluster will deﬁnitely be larger than r_Tl+1, and so violates the radius constraint. In this case, clustering can be terminated because no further merg-ing can satisfy the radius constraint. Otherwise, we tenta-tively merge this pair of clusters by computing the mean of the merged cluster and the radius of its bounding sphere. If the newly computed radius is indeed smaller than the radius threshold r_Tl+1, this pair of clusters will actually be merged. The between-cluster distances dmaxl+1 between this merged

cluster and all other clusters must be updated accordingly. If the newly computed radius is not smaller than the radius threshold, we do not merge this pair of clusters and instead choose the pair with the second minimum dl+1

maxfor

consid-eration. This procedure is repeated until all the pairs are ex-amined or the dl+1

maxof the examined pair is larger than twice

the radius threshold r_Tl+1.

All sample points whose level-(l + 1) projections are grouped into the same cluster at level l + 1 can be further partitioned at level l +2 by using the same method presented above. This recursive clustering is applied until the bottom level is reached, where each sample point is treated as a sep-arate cluster of zero radius. In this way we can construct an LB-tree satisfying the radius constraint with a speciﬁed ra-dius threshold while trying to make the number of internal nodes as small as possible.

5.2. One-dimensional case for level 0

If at level 0 the number of sample points, s, is large, the number of between-cluster distances for every pair of initial clusters can be very large (O(s2_{)). Hence, the}

agglomera-tive clustering process can be very time-consuming at level 0. If there is only one dimension at level 0, as in this work, we can reduce this problem with the following method. The level-0 projections of all the sample points are ﬁrst sorted. Then, consider only pairs of neighboring level-0 projections for merging because the minimum d_max0 appears only be-tween the neighboring level-0 projections. In this way, the number of the cluster pairs to be considered can be reduced from O(s2) to O(s). When a pair of level-0 projections with minimum dmax0 is merged, these two level-0 projections are

replaced with their mean in the sorted list. This process is repeated until the radius of the bounding sphere (or rather, the bounding segment) for the best merging is larger than the radius threshold r_T0.

5.3. Selection of the radius threshold

The radius threshold r_Tl for each level has a great inﬂuence on the construction of the LB-tree and the resulting search efﬁciency. Remember that a tighter LB-distance can save more distance calculations. Toward the goal of achieving tighter LB-distances, we have to lower the radius threshold

(10)

r_Tl at level l in order to obtain smaller radii r_jl for all internal nodes at this level. However, a smaller radius threshold will in general result in more clusters, which tends to increase the computational cost of the proposed algorithm (because more LB-distances have to be calculated). This is the trade-off between choosing a smaller r_jl and choosing a smaller sl, as mentioned in Section 5.1.

It is difﬁcult to determine a good radius threshold be-forehand because the choice depends on the distribution of the sample points. Therefore, instead of specifying a radius threshold r_T0, the experiments shown in this paper specify the number of clusters s0at level 0, where s0< s. (For lev-els other than level 0, we specify radius thresholds as de-scribed below instead of specifying the number of clusters.) Hence, the stopping criterion at level 0 has to be modi-ﬁed accordingly in the following. All level-0 projections are merged agglomeratively until the number of clusters equals s0. When the number of clusters reaches s0, the radius of the latest merged cluster is recorded as r_T∗, and then used to determine the radius thresholds of the other levels. For each level l other than level 0, that is, l = 1, . . . , L − 1, the radius threshold r_Tl can be determined based on r_T∗. (Note that there is no need to perform agglomerative clustering at level L because each cluster contains only one point at this level.) In this work, we simply use r_T∗ as the radius threshold at each level l, or r_Tl = r_T∗, l = 1, . . . , L − 1.

6. Data transformation

This section explains how to further improve the efﬁ-ciency of nearest neighbor search by applying data trans-formation. Recall that the Euclidean distance calculated at level l in the LB-tree is the distance in the subspace of the ﬁrst 2l _{dimensions. If these dimensions are not}

discrimina-tive enough, meaning the projections of the sample points on this subspace are too close to each other, the distances may be almost identical for different samples calculated in this subspace, which will not help much in the search for nearest neighbor. To alleviate this problem, we apply transforma-tion to the data points, transforming them into another space so that the anterior dimensions are likely to be more dis-criminative than the posterior dimensions. The transforma-tion will affect efﬁciency but not the ﬁnal search result, for the Euclidean distances calculated in both space should be the same. Moreover, because this transformation is also ap-plied to the query points during the query process, it should be computationally inexpensive. The pseudo-code for data transformation is as follows, which is to be joined with the algorithm in Section 3:

(1.5) Transform each sample point p_i, i = 1, . . . , s (3.5) Transform the query point q

Depending on the characteristics of the data, one of the fol-lowing two types of data transformation can be used. Wavelet

transform with orthonormal basis[35]is applied when the data point represents an autocorrelated signal, like an audio signal or an image block. The basis has to be orthonormal to preserve Euclidean distances. Here we adopt Haar wavelets for transforming the autocorrelated data, which is then rep-resented in one of its multiple resolutions in each level in the multilevel structure. Readers are referred to Ref.[35]for computation method of Haar wavelets as well as the proof that Haar wavelets form an orthonormal basis.

Another type of data transformation is the principal com-ponent analysis (PCA). PCA ﬁnds a set of vectors ordered in their ability to account for the variation of data projected onto those vectors. The data point is transformed onto the space spanned by this set of vectors so that the anterior di-mensions become more discriminative. This transformation is particularly useful for object recognition where not all features are equally important.

7. Experimental results

In this section, we show some experimental results of four algorithms: the exhaustive search algorithm (ES), the searching-by-slicing (SBS) algorithm proposed by Nene and Nayar [17], the BBD tree (BBDT) algorithm proposed by Arya et al.[27], and the lower bound tree (LBT) algorithm proposed in this paper. We obtained via FTP the software of the SBS algorithm and the BBDT algorithm implemented by Nene and Nayar [17]and Arya et al.[27], respectively. In SBS, the initial distance threshold is set to be 0.1. To guarantee that the nearest neighbor can always be found, this threshold is enlarged gradually by adding 0.1 each time no point is found, as recommended in Ref.[17]. Remember that the BBDT algorithm can ﬁnd the (1 + r)-approximate nearest neighbor within a factor of (1 + r) of the distance between the query point and its exact nearest neighbor. To guarantee that the exact nearest neighbor can be found, we set the parameter r to be 0 in the software. For the ES algorithm and the LBT algorithm, we have implemented in C programming language.

There are three different kinds of data distribution that we used to examine the efficiency for these algorithms, including a computer-generated set of autocorrelated data (Section 7.1), a computer-generated set of clustered Gaus-sian data (Section 7.2), and a real data set acquired from an object recognition system (Sections 7.3–7.5). The exper-iments were performed on a PC with a Pentium III 700 MHz CPU. To compare the efficiency of different algorithms, we use the execution time instead of the number of distances calculated for the following two reasons. First, the insertion and deletion of an element in the heap, rearranging the heap, and updating nodeˆp results in our algorithm having some overhead. Second, the computational cost of the LB-distance of a node differs at different levels. To be specific, the com-putational cost of the LB-distance for nodes increases from the top level to the bottom level of the LB-tree.

(11)

7.1. Experiments on autocorrelated data

We now demonstrate the efficiency of the proposed al-gorithms by showing the result of three experiments as the following three factors vary: the number of sample points in the data set, s; the dimensionality of the underlying space, d; and the average of the minimum distances between query points and their nearest neighbors,_min. Autocorrelated data points were randomly generated to simulate real signals. For each data point, the value of its first dimension was cho-sen from a uniform distribution with extent[−1, 1], and the value of each subsequent dimension was assigned the value of the previous dimension plus normally distributed noise with zero mean and standard deviation 0.1. Beyond the ex-tent[−1, 1], the value of each dimension was truncated. In order to see how data transformation affects the search ef-ficiency for autocorrelated data, we performed the nearest neighbor search twice for the SBS, BBDT, and LBT algo-rithms, with Haar transform applied to each data point only in the second time. The LB-tree was constructed with its number of clusters at level 0 specified as 45.

In the first experiment, we probed the algorithm efficiency by varying the number of sample points, s, in the data set. Seven data sets of s sample points, s = 800, 1600, 3200, …, 51,200, were generated using the random process de-scribed above. The dimension of the underlying space, d, was 32. Constructing the LB-tree spent 0.1, 0.5, 1.6, 7.8, 67.5, 730.8, and 6434 s for each data set, respectively. Another set containing 100,000 query points were also generated us-ing the same random process, and nearest neighbor search was then performed for each query point.Fig. 6shows the mean query time for each algorithm, where both the Haar transform, if applied, and the search process were taken into account. It is apparent inFig. 6that the search efficiency of the SBS, BBDT, and LBT algorithms (without Haar trans-form), i.e., “SBS”, “BBDT”, and “LBT”, can be significantly improved by applying the Haar transform, as denoted by “SBS+ Haar”, “BBDT + Haar”, and “LBT + Haar”. This clearly demonstrates that the Haar transform can help to re-duce more computational cost when the data set consists of autocorrelated data. Among all the algorithms in this experi-ment, the proposed LBT algorithm (the “LBT+Haar” case) is the fastest one, which is 12.2 and 56.2 times faster than the ES algorithm, when s is 800 and 51,200, respectively. When s increases from 800 to 51,200, there are more sample points scattered in the fixed space, so the average minimum distance,_min, decreases from 0.73 to 0.53. When the mini-mum distance is smaller, the LB-distance is then more likely to be larger than the minimum distance of the query point q to its nearest neighbor ˆp, according to Property 4. That is, more distance calculations can be avoided if_minis smaller, which is why the speedup factor increases as s increases.

In the second experiment, we vary the dimensionality,

d, of the underlying space. Eight data sets of 10,000

sam-ple points, where dimension, d = 2, 4, 8, . . . , 256, respec-tively, were generated. The construction time of the LB-tree

was 3.1, 24.9, 26.7, 27.3, 28.5, 35.5, 37.3, and 51.8 s for each data set, respectively. The same random process was also used to generate eight corresponding sets of 100,000 query points, with matched dimensions d =2, 4, 8, . . . , 256.

Fig. 7shows that the Haar transform can improve the search efficiency, particularly when d is large. The proposed LBT algorithm (the “LBT+Haar” case) outperforms the other al-gorithms when d is larger than 4. Interestingly note that our algorithm does not suffer from the curse of dimensionality for autocorrelated data like the k-dimensional binary search tree algorithm does, as reported in Refs. [16,17]. In fact, the computational speedup of the proposed algorithm (the “LBT+ Haar” case) over the ES algorithm rises from 5.6 to 64.1 as d increases from 2 to 256. The increase of d also increases the level number of the multilevel structure and of the constructed LB-tree. Using the Haar transform causes the anterior dimensions to contain more significant compo-nents of the autocorrelated data, and so the lower bound of the distance can be tighter when calculated at the up-per level. Distance calculation can therefore be avoided for more sample points by calculating only the LB-distances of a few of their upper-level ancestors, with exception of a few tough competitors. Without applying Haar transform (i.e., the “LBT” case), each dimension of the data point is equally significant, and so the LB-distance at the lower level needs to be calculated to determine the nearest neighbor, which requires more computation and degrades performance. In addition, data transformation causes the agglomerative clus-tering from top to bottom to be more effective because the anterior dimensions contain more significant components. There are more internal nodes for the “LBT” case compared to that of the “LBT+ Haar” case, and thus efficiency is re-duced. The increase of d amplifies this phenomenon, which results in the dramatic drop of the speedup factor for the non-transform case, but not for the transform case.

The third experiment demonstrates the efﬁciency of the algorithms with respect tomin. We generated a data set of

10,000 sample points in a space of dimension d = 32, where each sample point was then used to generate a query point by adding a uniformly distributed noise with extent[−e, e] to each coordinate. As a result, when e is large the distance between the query point and its nearest neighbor tends to be large as well. In this case, the construction time of the LB-tree is 29.6 s. In this experiment, eight sets of 10,000 query points are generated, with e = 0.01, 0.02, 0.04, . . . , 1.28. The mean query time versus the mean of the minimum distances,_min, is compared among different algorithms in Fig. 8. Again, the Haar transform improves the search efﬁ-ciency and the proposed LBT algorithm (the “LBT+ Haar” case) outperforms the other algorithms. As e increases from 0.01 to 1.28, min increases from 0.033 to 3.838. The

in-crease in the computational cost of the LBT algorithm is due to the fact that when the minimum distance of the nearest neighbor is already very large, the LB-distance is less likely to be larger than the minimum distance, so less distance calculation can be saved. The speedup factor of the LBT

(12)

800 1600 3200 6400 12800 25600 51200 10-2 10-1 100 101 s

mean query time (ms.)

exhaustive SBS BBDT LBT SBS+Haar BBDT+Haar LBT+Haar

Fig. 6. Mean query time versus size, s, of the sample point set (d = 32).

2 4 8 16 32 64 128 256 10-2 10-1 100 101 102 d

Fig. 7. Mean query time versus dimension of the underlying space, d (s = 10, 000).

algorithm (the “LBT+ Haar” case), compared with the ES algorithm, decreases from 570.4 to 0.63 in this case. Notice that when the speedup factor becomes 0.63, the noise extent, [−1.28, 1.28], is larger than the data extent, [−1, 1]. minis

usually relatively small for most applications, and therefore the case when the LBT algorithm does not outperform the ES algorithm, shown on the right part in Fig. 8, does not likely happen.

7.2. Experiments on clustered Gaussian data

This section shows the experimental results when the sam-ple point set consists of clustered Gaussian data, which was generated to simulate an object database. We ﬁrst randomly chose 100 cluster center points in a 32-dimensional space. For each cluster center point, the value of each dimension was randomly generated from a uniform distribution with

(13)

0.033 0.065 0.13 0.26 0.52 1.017 1.928 3.838 10-2 10-1 100 101 εmin

Fig. 8. Mean query time versus mean of the minimum distances,_min(s = 10, 000, d = 32).

extent [−1, 1]. Then, we generated 100 sample points for each cluster. Each sample point was randomly chosen from a Gaussian distribution with standard deviation around the cluster center point. That is, the value of each dimension of a sample point was assigned the value of the corresponding dimension of the cluster center point added by normally dis-tributed noise with zero mean and standard deviation. We obtained a set of 10,000 sample points in this way. The LB-tree construction time was 78.7, 80.5, 98.7, 114, and 89.7 s, respectively, as ranging from 0.02, 0.04, . . . , up to 0.1.

Around each of the same 100 cluster center points, we ran-domly chose another 1000 data points from the same Gaus-sian distribution with standard deviation . These 100,000 points constituted the set of query points in the nearest neigh-bor search process. We totally generated ﬁve sets of sam-ple points and query points with different standard devia-tion ranging from 0.02 up to 0.1.Table 1shows the mean query time of nearest neighbor search by using the ES, SBS, BBDT, and LBT algorithms. Numbers in parentheses denote the speedup factor compared with the ES algorithm. Search efﬁciency of the proposed LBT algorithm is the best, partic-ularly when the clusters are compact (i.e., is small). The reason is that the minimum distance from the query point to its nearest neighbor tends to be smaller, compares with the distances from the query to the points in different clusters, when the clusters are more compact.

7.3. Experiments on an object recognition database

The database adopted in the experiments described here is the same as those in Refs.[1,17], which was generated

from 72 images of an object taken at different poses for a total of 100 objects. Each of these 7200, 128 × 128 images was represented in vector form, and each vector was nor-malized to unit length. An eigenspace of dimension 35 can be computed from those normalized vectors, so that by pro-jecting onto the eigenspace, each vector can then be com-pressed from 16,384 dimensions to 35 dimensions. In the eigenspace, the manifold for each object can be constructed using the 72 vectors belonging to the object. Each of the 100 manifolds was sampled to obtain 360 vectors, resulting in a total of s = 36, 000 sampled vectors constituting the data set, where each sample point has dimension d = 35.

To generate the set of query points, we ﬁrst uniformly sample the manifolds by sampling each of the 100 mani-folds at 3600 equally spaced positions. Then we add to each coordinate a uniformly distributed noise with extent[−e, e]. This yields a set of 360,000 query points.

The ES, SBS, BBDT, and LBT algorithms were used to perform the nearest neighbor search. The initial distance threshold of the SBS algorithm was selected to be 0.035 in this experiment.Table 2shows the mean query time for these algorithms when the noise extent e is 0.005, 0.01, and 0.015. Numbers in parentheses denote the speedup factor compared with the ES algorithm. In this case the proposed LBT algorithm can tremendously speed up the nearest neigh-bor search process. When e is 0.005, the LBT algorithm is 1088 times faster than the ES algorithm. This performance is roughly 13 times faster than the result attained by the SBS algorithm and is roughly 1.7 times faster than the BBDT algorithm. Furthermore, the speedup factors of the LBT al-gorithm compared with the SBS and BBDT alal-gorithms rise as the noise extent e rises. The construction time of the

(14)

Table 1

Efﬁciency comparison for clustered Gaussian data with different

Algorithm = 0.02 = 0.04 = 0.06 = 0.08 = 0.1 ES 11.938 ms SBS 0.619 ms (19) 0.679 ms (18) 0.710 ms (17) 1.024 ms (12) 1.916 ms (6) BBDT 0.236 ms (51) 0.243 ms (49) 0.249 ms (48) 0.265 ms (45) 0.289 ms (41) LBT 0.047 ms (254) 0.052 ms (230) 0.077 ms (155) 0.082 ms (146) 0.115 ms (104) Table 2

Efﬁciency comparison for an object recognition database

Algorithm e = 0.005 e = 0.01 e = 0.015 ES 50.048 ms SBS 0.613 ms (82) 1.096 ms (46) 2.161 ms (23) BBDT 0.079 ms (634) 0.164 ms (305) 0.281 ms (178) LBT 0.046 ms (1088) 0.072 ms (695) 0.095 ms (527) Table 3

Mean query time (ms) for k-nearest neighbor search Algorithm k

2 4 6 8 10 12 14 16 18 20

BBDT 0.087 0.121 0.204 0.363 0.587 0.875 1.192 1.532 1.885 2.251

LBT 0.053 0.069 0.091 0.121 0.158 0.202 0.252 0.302 0.356 0.408

Table 4

Mean query time (ms) for_T-nearest neighbor search Algorithm _T

0 0.012 0.024 0.036 0.048 0.060 0.072 0.084 0.096 0.108

LBT 0.023 0.040 0.060 0.078 0.096 0.114 0.134 0.154 0.174 0.193

LB-tree is 11,679 s using those 36,000 sample points of dimension 35, and the number of clusters, sl_{, at level l of the}

LB-tree are sl _{= 20, 245, 2456, 4684, 5716, 7019, 36,000,}

l = 0, 1, . . . , 6. In this case, although the construction time is acceptable, more work should be done when dealing with a large sample point set to improve the efﬁciency of the LB-tree construction. The average of the minimum distances of all the sample points to their nearest neighbors is 0.017376. 7.4. Experiments for k-nearest neighbor search

This section presents the experiments for k-nearest neigh-bor search using the BBDT algorithm and the LBT algo-rithm modiﬁed as described in Section 4.1. These experi-ments were performed with the object recognition database described in Section 7.3, and the same LB-tree and query point set with the noise extent e=0.005 were used. However, instead of searching for only the single nearest neighbor, we searched for the k-nearest neighbors of each query point.

Table 3 illustrates the mean query time for the k-nearest neighbor search, k = 2, 4, . . . , 20. As k rises to 20, the mean

query time using the LBT algorithm increases to 0.408 ms, which is about 8.9 times larger than that for 1-nearest neigh-bor search. When the BBDT algorithm is applied, the mean query time increases to 2.251 ms as k rises to 20. That is, the mean query time of 20-nearest neighbor search is about 28.5 times larger than that for 1-nearest neighbor search by using the BBDT algorithm. This concludes that there exists extra advantage if the proposed LBT algorithm is adopted for k-nearest neighbor search.

7.5. Experiments of searching for k-nearest neighbors within a distance threshold

This section presents the experiments for the LBT algo-rithm that is modiﬁed as described in Section 4.2. Again, the same LB-tree and query point set described in Section 7.3 were used. For each query point, at most the ﬁrst 20 of its nearest neighbors within the distance threshold _T were obtained. As _T rises from 0 to 0.108 (T = 0 implies the

requirement for a perfect match), the mean query time goes from 0.023 to 0.193 ms, as shown inTable 4. As can be

(15)

ex-pected, larger_T will result in more neighbors obtained, and hence, more computation time. In this experiment, average number of obtained neighbors for all query points increases from 0 to 17.4.

8. Conclusions

In this paper, we have proposed a fast algorithm for nearest neighbor search. By creating an LB-tree using the agglomer-ative clustering technique and then traversing the tree using the winner-update search strategy, we can efficiently find the exact nearest neighbor. To further speedup the search pro-cess, some data transformation is applied to sample points and query points, such as Haar transform (for autocorre-lated data) and PCA (for general object recognition data). Moreover, the proposed algorithm can be easily extended to provide k-nearest neighbors progressively, nearest neigh-bors within a specified distance threshold, and close-enough neighbors compared with the nearest neighbor, respectively. From our experiments, the search process is dramatically accelerated using the proposed algorithm, especially when the distance of the query point to its nearest neighbor is rela-tively small compared with its distance to most other sample points. Our algorithm is particularly advantageous in many object recognition applications, where a query point of an object is close to the sample points of the same object, but is far from the sample points of other objects. In this paper we applied our algorithm to the object recognition database used in Refs.[1,17], and the result is about 500 to 1000 times faster than the exhaustive search. In addition, we believe that the proposed algorithm can be very helpful in applica-tions where each sample point represents an autocorrelated signal, like applications concerning content-based retrieval from a large audio, image, or video database, as those in Refs. [9,36]. The dimension d and the number of sample points s in these applications are both large, and hence, our algorithm will become extremely appealing.

Acknowledgements

The authors would like to thank the helpful comments and suggestions given by the reviewers. This work was sup-ported in part by the Ministry of Economic Affairs, Taiwan, under Grants 93-EC-17-A-02-S1-032 and 94-EC-17-A-02-S1-032.

References

[1]H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from appearance, Int. J. Comput. Vision 14 (1995) 5–24.

[2]T. Hastie, R. Tibshirani, Discriminant adaptive nearest neighbor classiﬁcation, IEEE Trans. Pattern Anal. Mach. Intell. 18 (6) (1996) 607–616.

[3]A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323.

[4]C. Tomasi, R. Manduchi, Stereo matching as a nearest-neighbor problem, IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 333–340.

[5]Y.-S. Chen, Y.-P. Hung, C.-S. Fuh, Fast block matching algorithm based on the winner-update strategy, IEEE Trans. Image Process. 10 (8) (2001) 1212–1222.

[6]C.-H. Lee, L.-H. Chen, A fast search algorithm for vector quantization using mean pyramids of codewords, IEEE Trans. Commun. 43 (2/3/4) (1995) 1697–1702.

[7]C.-H. Hsieh, Y.-J. Liu, Fast search algorithms for vector quantization of images using multiple triangle inequalities and wavelet transform, IEEE Trans. Image Process. 9 (3) (2000) 321–328.

[8]L.-Y. Wei, M. Levoy, Fast texture synthesis using tree-structured vector quantization, in: Proceedings of SIGGRAPH, New Orleans, LA, July 2000, pp. 479–488.

[9]M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, Query by image and video content: the QBIC sytstem, IEEE Comput. 28 (9) (1995) 23–32.

[10]S. Berchtold, C. Böhm, B. Braunmüller, D.A. Keim, H.-P. Kriegel, Fast parallel similarity search in multimedia databases, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, USA, May 1997, pp. 1–12.

[11]J.L. Bentley, Multidimensional binary search trees used for associative searching, Commun. ACM 18 (9) (1975) 509–517.

[12]A. Guttman, R-trees: a dynamic index structure for spatial searching, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Boston, MA, June 1984, pp. 47–57.

[13]N. Katayama, S. Satoh, The SR-tree: an index structure for high-dimensional nearest neighbor queries, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, USA, May 1997, pp. 369–380.

[14]D.A. White, R. Jain, Similarity indexing with the SS-tree, in: Proceedings of the International Conference on Data Engineering, New Orleans, LA, February 1996, pp. 516–523.

[15]S. Berchtold, C. Böhm, H.-P. Kriegel, The pyramid-technique: towards breaking the curse of dimensionality, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, June 1998, pp. 142–153.

[16]S. Berchtold, D.A. Keim, H.-P. Kriegel, T. Seidl, Indexing the solution space: a new technique for nearest neighbor search in high-dimensional space, IEEE Trans. Knowledge Data Eng. 12 (1) (2000) 45–57.

[17]S.A. Nene, S.K. Nayar, A simple algorithm for nearest neighbor search in high dimensions, IEEE Trans. Pattern Anal. Mach. Intell. 19 (9) (1997) 989–1003.

[18]V. Ramasubramanian, K.K. Paliwal, Fast nearest-neighbor search algorithms based on approximation-elimination search, Pattern Recognition 33 (9) (2000) 1497–1510.

[19]K. Fukunaga, P.M. Narendra, A branch and bound algorithm for computing k-nearest neighbors, IEEE Trans. Comput. 24 (1975) 750–753.

[20]S. Brin, Near neighbor search in large metric spaces, in: Proceedings of the International Conference on Very Large Data Bases, Zurich, Switzerland, September 1995, pp. 574–584.

[21]E. Vidal, New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA), Pattern Recogn. Lett. 15 (1) (1994) 1–7.

[22]J.H. Friedman, F. Baskett, L.J. Shustek, An algorithm for ﬁnding nearest neighbors, IEEE Trans. Comput. 24 (1975) 1000–1006.

[23]M.R. Soleymani, S.D. Morgera, An efﬁcient nearest neighbor search method, IEEE Trans. Commun. COM-35 (6) (1987) 677–679.

[24]A. Djouadi, E. Bouktache, A fast algorithm for the nearest-neighbor classiﬁer, IEEE Trans. Pattern Anal. Mach. Intell. 19 (3) (1997) 277–282.