Introduction - 在布隆過濾器下改善範圍搜尋方法

1.1 Overview

In recent years overlay and peer-to-peer network applications, such as file sharing, Internet telephony and group communication systems, have been replacing the traditional client-server model. The peer-to-peer network applications use the distributed hash tables to locate a node or object in peer-to-peer network [1], [2]. And each node in peer-to-peer network only preserves a part list of objects locations in a peer-to-peer system instead of every object location in other nodes. The replication of global index is well distributed over peer-to-peer network; therefore, keeping the distributed hash table at each node is important in the moderate-sized peer-to-peer network construction for large-scale scalability.

Bloom filter has been used to profile the description of a node in a P2P systems or a set of data, including numerical and non-numerical items. The PlanetP is a peer-to-peer system that using Bloom filter to summarize the set of data items in peer’s local index [3]. As a result, the cost of replication can be reduced and the distributed hash table of peer’s local cache would be minimized by compressing the bloom filter. Reynolds and Vahdat demonstrate another application where Bloom filter was used to find the set intersection for keyword searches [4].

Although Bloom filter is a space-efficient way to represent a set of data, it has difficulty in representing a large range of numerical data. Because the large number of inserted items will result in the false positive rate increasing, we need to find more efficient way to insert data into Bloom filter. For example, a numerical rage of a data set contains a class c IP address 140.113.214.X, which includes 255 sub IP addresses.

And if we used Bloom filter to represent this data set for previous application [3], the

large number of inserted elements of Bloom filter would make performance degradation of Bloom filter. In this thesis we discuss this kind of numerical range problems and propose our schemes to improve the former methods for numerical ranges.

1.2 Related Work

Several researches have addressed the issues how to improve using space or comparison time of Bloom filter and still maintain a low false positive probability.

Bloom filter is a bit array to represent a set of data elements by mapping the set of data into the randomized bit array indices. In other words, the different indices of bit array are set to 1 or 0 to represent a set of data. The false positive occurs when the Bloom filter reports the element x is in the set although it is actually not in the set. In addition, inserting element into Bloom filter changes the probability of false positive.

The background on Bloom filter theory is presented in chapter 2.

Fan, Cao, Almerida, and Broder [5] proposed an extending Bloom filter, using counter array to replace the bit array of Bloom filter for inserting and deleting;

therefore, it can be more scalable to summary the web server cache. When an element is inserted into the cache, the counter increased from 0 to 1; when an element is deleted from the cache, the counter decreased from 1 to 0. This method avoids the problem that the Bloom filter loses the correctness after inserting or removing element elements because bit counter can dynamically increase or decrease rather than a single bit. Mitzenmacher [6] suggested a Compressed Bloom filter to improve the performance in term of bandwidth saving when the Bloom filters are used to the transmission messages. The method of compressed Bloom filter is to compress the bit array size of Bloom filter and use less number of hash function in Bloom filter. The author emphasized the point that the number of hashing function minimized the false

positive probability in uncompressed Bloom filter case but maximized the probability in the case of Compressed Bloom filter. Kirsch and Mitzenmacher [7] also proposed distance-sensitive Bloom filter, using a set of locality-sensitive hash functions to answer queries of the forms, “Is x close to an element of S?” It has potentially benefits of the speed of membership query comparisons and requires less space than the original data.

Cohen and Matias [8] proposed spectral Bloom filter and addressed the issue of element deletion over multi-sets of Bloom filter. Spectral Bloom filter is an extension of original Bloom filter to estimate the multiplicities of individual elements with small error probability. Kumar, Xu, Li, and Wang [9] showed another compact structure space-code Bloom filter, which is an approximate representation of a multi-set.

Space-code allows for the query about how many occurrences of an element being there in a multi-set. Both Bloom filters are approximate representations of a multi-set, which allows for querying multiplicities of an element. Spectral Bloom filter, space-code Bloom filter and their variations are suitable for representing static sets whose size can be estimated before design and development.

Instead of representing static sets, dynamic Bloom filter [10] and scalable Bloom filter [11] are proposed to dynamic sets when the actual size of a data set increases.

Dynamic Bloom filter is a bit matrix with s lows and m columns. In other words, dynamic Bloom filter consists of s standard Bloom filters with length m, and it starts with s = 1 when no inserting element. When inserting new elements, dynamic Bloom filter may increase the number of rows s if it could not find an active bloom filter, and an active Bloom filter of dynamic means that the number of inserting elements does not exceed the threshold of the standard Bloom filter with size m for maintaining false positive rate at constant value below. Therefore, the inserting element did not be inserted until finding an active Bloom filter in dynamic Bloom filter or adding a new

standard Bloom filter for an active Bloom filter. Scalable Bloom filter improves the performance degradation of dynamic Bloom filter when the number of standard Bloom filter increases. The main difference between dynamic Bloom filter and scalable Bloom filter is the method of adding a new standard Bloom filter. Scalable Bloom filter is a bit matrix as same as dynamic Bloom filter, but it inserts an active Bloom filter with double size of previous active Bloom filter rather than dynamic Bloom filter. Scalable Bloom filter provides the lower query time and more scalable inserting method than dynamic Bloom filter.

1.3 Objective

Although many studies have been done on the data structure improvement of Bloom filter, little information is available on inserting method over Bloom filter.

Previous works have proposed many variations of standard Bloom filter, but it still remains the issue how to efficiently insert element into Bloom filter. The purpose of this thesis was to investigate the effect of inserting many numerical elements, which increases false positive probability, and we will propose our schemes Division, Overlapping and the combination of both Division-Overlapping to improve the method of numerical elements insertion.

In the thesis, we address the issue of numerical range insertion using Bloom filer, and show how a numerical range, which contains many elements, can be represented and stored in a Bloom filter with less space. The representation scheme of our work may increasing the efficiency of Bloom filter in query time and space when numerical elements having a large percentage of a data set. Our contribution of this thesis is to propose an efficient scheme for the mapping from numerical rages to Bloom filer, and we will give our suggestion for the parameters setting of our methods in this thesis.

1.4 Summary

We organized the remaining thesis in the following. Chapter 2 presents the background of Bloom filter theory and the definition of range query. In Chapter 3, we describe our methods in representing a numeric range and the analytic models. In Chapter 4, we evaluate the effectiveness of our methods and discuss the simulation results. Finally we give our conclusion in Chapter 5.

在文檔中在布隆過濾器下改善範圍搜尋方法 (頁 12-17)