Single Numerical Attribute in a Bloom filter

Chapter 3 Bloom Filter Design for Range Query

3.4 Optimal Parameters for Bloom Filter

3.4.1 Single Numerical Attribute in a Bloom filter

A.”Division” scheme

The optimal parameter dividing-range d is decided by the insertion numbers n and the domain range R of the numerical attribute when the other parameters including m, k and range n have been predefined. According to the previous simulation result depicted in Fig. 3-2, we found that when the dividing-range d increased, the false positive rate would first get less and less; but after one point, the false positive rate would be increasing linearly. The optimal dividing-range d for the

“Division” scheme is the x-axis value on the point, which is also the inflection point on the false positive rate curve.

In Fig. 3-8, the line is the searching path, and if the value of false positive rate on the point d = 4 was larger than the value on previous point d = 3, the optimization stops and decides that the previous point is the optimal d for the Division scheme. In this figure, the optimal dividing-range d is 3 for the range n = 100 in the domain R = 1000, and the false positive rate on the d = 3 is about 0.003. Our scheme to find the optimal dividing-range d is to find the inflection point based on the false positive rate of Equation 3-5 when the parameters m, n and k were predefined. The false positive rate is decreasing exponentially before the inflection point because that the number of insertion bits was effectively compressed; but after the inflection point, the error rate is increasing linearly because of the penalty of dividing-range d which grouped d continuous numbers into a divisions. Therefore, there is only local extreme minimal value, which is the global minimal value on the curve, and the optimal dividing-range d is the inflection point on the false positive curve.

Fig. 3-8 Searching the inflection point of false positive rate

B. “Overlapping” scheme

Finding optimal parameter shift-bit s is as same as Division when the size of Bloom filter m, the number of insertion numbers n and the number of hash functions k were predefined. The difference of finding optimal parameter between Overlapping and Division is that the variation of shift-bit s is from 1 to k, but dividing-range d is from 1 to n. The way to find the inflection point on the false positive curve is still applied to search optimal shift-bit s for the Overlapping scheme. In Fig. 3-9, the number of insertion element n is 40 rather than 100, and the other parameters are as same as Fig. 3-8. We find that the point s = 4 on the false positive rate curve was larger than the point s = 3 on the curve, so that the point s = 3 is the optimal shift-bit.

Because the penalty of the Overlapping scheme is the random indices checking error, and we found the curve is always concave upward for the parameters m, n, k and s are all larger than 0. Therefore, the property of Overlapping that there is only one extreme value on the curve is as same as Division scheme.

Fig. 3-9 Searching the inflection point of false positive rate

C. “Division-Overlapping”

Finding the optimal parameter (d, s) dividing-range d and shift-range s is similar to previous two optimization search, and the search path is 2-dimentional.In Fig. 3-10, there was a query range whose the number of continuous numbers n is 1000 inserted into a Bloom filter, and the domain of the numerical attribute R was 10000; the size of the Bloom filter m was 512, and the number of hash functions k was 8. The optimization searching first searched on the curve (d, 1) where the variation is dividing-range d, and the shift-bit s is constant 1. The method to find the inflection point (d₁, 1) on the curve (d, 1) is as same as previous searching method in Division.

Next, we find the inflection point (d2, 2) on the curve (d, 2). If the value of inflection point (d₂, 2) was larger than the inflection point on (d₁, 1), the searching path stops and decides that (d1, 1) is the optimal parameters for the Division-Overlapping. In this figure, the optimal parameter (d, s) for the Bloom filter is (6, 1). The Algorithm 3-1

“Optimization-SingleAttribute” is to find the optimal parameters dividing-range d and shift-bit s for the numerical attribute with n continuous numbers and the domain range

R in a Bloom filter. The time complexity of finding optimal parameter (d, s) for a

numerical attribute is O(n×k). This algorithm can also be applied to find the optimal parameter for Division or Overlapping scheme if we made constraint on choosing parameter dividing-range d and shift-bit s in the optimization search.

Fig. 3-10 Searching the inflection point of two-dimensional false positive curve 3.4.2 Multiple Numerical Attribute in a Bloom filter

In the case of multiple numerical attributes in a Bloom filter, the target is to find the optimal parameter (di, si) of Division-Overlapping scheme for each numerical attribute A_i, so that the average false positive rate of all attributes would be minimal.

The false positive rate of each numerical attributes Ai depends on the number of total insertion bits and its parameters (d_i, s_i) for Division-Overlapping scheme. Because there are more than one numerical attributes in the data set, the number of total insertion bits is the sum of the number of insertion bits of each numerical attributes.

Therefore, we must consider all possible of each parameter (di, si). Based on Equation 3-11and Equation 3-13, we modify the number of insertion bits equation and the false positive rate equations in Division-Overlapping scheme for multiple numerical

attributes parameters optimization. In Equations 3-14, the parameter “addBits” is the number of additional insertion bits, which is the sum of total insertion bits of the other attributes.

With the actual insertion bits in Bloom filter, the false positive rate of each numerical attribute in that same Bloom filter can be calculated as follows.

∑

Function 3- 1 Calculate total false positive

loop

Function 3- 2 Finding the optimal parameter (di, si) of each numerical attribute

There are two functions, one of which is the function “TotalFalsePostiveRate”

depicted in Function 3-1, and the other of which is the function “OptiSingleAttribute”

depicted in Function 3-2. They are used by our parameter optimization algorithm

“Optimization-MultipleAttribute” depicted in Algorithm 3-2. In Function 3-1, the false positive rate of each attribute A_i with the same parameters (m, k) is calculate based on each parameters (ni, Ri, di, si); after calculation of the false positive rate of each attribute A_i, the total false positive rate of the summation of all numerical attributes would be return. Because the probability of each attribute to be queried is

equal, the minimal average false positive rate of all attributes is the target of our optimization algorithm. Instead of using average false positive rate, we use the total false positive rates of all numerical attribute for simplicity. In Function 3-2, we modified the previous Algorithm 3-1 in line 4 to the calling Function 3-1 because our concern is the total false positive rate of the summation of all attributes rather than individual false positive rate. After calculation of Function 3-2, the parameter (di, si) of the numerical attribute A_i would be optimal for the total false rate.

There are two phases in our multiple numerical attributes parameter optimization algorithm. In the first phase “Preprocess” of Algorithm 3-2, we first assumed that there was only one numerical attribute in a Bloom filter, and then found optimal parameter (d_i, s_i) and the false positive rate of each numerical attribute was calculated by Algorithm 3-1. After the single attribute optimization, the false positive rate of each numerical attribute and the number of insertion bits of each numerical attribute will be calculated again where all attribute are inserted the same Bloom filter. The total false positive rate of all attributes in the Bloom filter and the false positive rate of each numerical attribute are used to next phase. The purpose of first phase is to calculate the false positive rate FP[i] of each attribute A_i in the data set S.

In the second phase “Iteration”, we optimize the parameter (di, si) of each numerical attribute. The numerical attribute with larger false positive rate is first because the numerical attribute with larger false positive rate may have more insertion elements than the other attributes. Therefore, we first optimized the parameter of this attribute and calculated the false positive rate again in line from 19 to 21 of Algorithm 3- 2. In line 23 of Algorithm 3-2, the termination condition is that the value of total false positive rate did not change after this iteration. Instead of using brute-force search for all possible parameters combination of all numerical attribute , the time complexity of finding the optimal total false positive rate in our optimization

algorithm is about O(l×n×k) rather than O((n×k)^l), where l is the number of numerical attributes in the data set S, and n is the size of the query range which was larger than the others. The purpose of second phase is to find individual optimal parameter (d_i, s_i) of the Division-Overlapping scheme for each numerical attribute Ai of data set S to have minimal average false positive rate in multiple numerical attribute.

loop

Algorithm 3-2 Optimization-SingleAttribute

Algorithm 3-3 Optimization-MultipleAttributes

Chapter 4 Performance Analysis

In this chapter, the simple experiments on Algorithm 3-1 and Algorithm 3-2 will be simulated, and the evaluation of Division, Overlapping and Division-Overlapping schemes will be presented. The simulation results will be compared with the original method to insert a numerical element into Bloom filter. There are two simulation case of our experiments, one of which is that there was only one numerical attribute inserted into a Bloom filter; the other was that there are more than one numerical attributes inserted into a Bloom filter. In the case of single numerical attribute in a Bloom filter, Algorithm 3-1 was applied to find the optimal parameter (d, s) for single numerical attribute. In the case of multiple numerical attributes in a Bloom filter, we only used Division-Overlapping for our simulation because of its adjustable property that it can transform to pure Division scheme when shift-bit s = k or to pure Overlapping when dividing-range d = 1. Algorithm 3-2 was applied to find the optimal parameter (d_i, s_i) for each numerical attribute of the data set.

4.1 Single Numerical Attribute in a Bloom Filter

In single attribute experiments, we first evaluate our schemes in the case of single numerical attribute in a Bloom filter. Algorithm 3-1 will be used to find the optimal parameter (d, s) for minimal false positive rate when inserting many numerical elements into Bloom filter. In Fig. 4-1 There was a numerical attribute inserted into a Bloom filter whose size m was 512, and the number of hash functions k was 8. The number of insertion elements of the query rang in this numerical attribute, whose domain R was 10000, is from 20 to 340. Simulation results of different schemes comparison are presented in Fig. 4-1. When the number of insertion elements was more than 20, the false positive rate of our scheme are better than original. The

results revealed that the Division, Overlapping and Division-Overlapping schemes are better than the original scheme. Fig. 4-1 depicted that the false positive rate of using Overlapping scheme is better than using Division scheme when n is less than 260;

however, using Division scheme is better than Overlapping scheme when n is larger than 260. Result of this figure showed that the Division-Overlapping, which combines Division and Overlapping has minimal false positive rate with its optimal parameter (d, s).

Fig. 4-1 Compare different schemes with optimal configuration

The next part of experiments is to find the relationship between the number of hash functions k and the optimal parameters of the numerical attribute. In Fig. 4-2 a numerical attribute, whose domain R is 10000, was inserted into an empty Bloom filter. The size of Bloom filter m was 512, and the number of hash functions k is 8 or 12.The number of insertion numbers n of the query range in the numerical attribute was from 100 to 1000. The theoretical value based on Equation 3-13 for false positive rate in Division-Overlapping scheme is approximate to the simulation results. Table 4.1 lists the k, (d, s, bits), where k was the number of hash functions used in the

Bloom filter. The (bits) item in the (d, s, bits) was the number of total insertion bits used in the Bloom filter, and the (d, s) was the optimal parameter of the Division and Overlapping schemes. The different row of this table means that the Bloom filter used different k, and the different column means that the number of insertion elements n of the query range was different. The correlations between the number of hash functions k and the optimal parameter (d, s) for false positive rate were slightly different.

Fig. 4-2 The false positive rate of theory values and simulations results

Table 4-1 The optimal configuration of using different k

Insertion Elements (n) 100 200 300 400

k =8 (d, s, bits)

500 600 700 800 900

(3, 1, 174.33)

In Fig. 4-2 we find that the false positive rates of using k = 12 is better than using k = 8. In Table 4-1, we also find that using k = 12 used more insertion bits than using k = 8 at every column with different insertion numbers. Back to original Bloom filter optimal relation Equation 2-3, the optimal probability p is 1/2, which means that the false positive rate is optimal when the number of insertion bits in Bloom filter was

approximate to m * ln2. In original Bloom filter the optimal false positive is approximate to 0.004= (1/2)⁸ when m = 512, n = 44 and k = 352, but this may not be applied to our scheme. Because the insertion elements in our case are continuous numbers n within the domain R of the numerical attribute, and the false positive rate function Equation 3-13 is different to Equation 2-2. As a result, the optimal probability p may not be always 1/2 in our Division-Overlapping scheme.

Fig. 4-3 The false positive rate of using different k

Since the optimal number of hash functions k is not constant, the method to find the optimal k is similar to find the inflection point on the false positive rate curve. By Using different k and then finding the optimal parameters d and s of, searching optimal k would stop if the false positive rate of using k+1 with optimal parameters d and s is larger than using k. In Fig. 4-3 we compared the theoretical false positive rate of using optimal k with using k = 8 and using k = 12. The figure showed that using the different k with its optimal parameter (d, s) has optimized false positive rate rather than using constant k = 8 or k = 12. In Table 4-1, the insertion bits of using optimal k are more than using k = 8 and using k = 12 at different the number of insertion

element. Fig. 4-3 depicted that the false positive rate of using optimal k is also better than using k = 8 and using k = 12. The insertion bits of using optimal k in different insertion elements are not always as same as each other because of the additional false positive penalty of Division-Overlapping scheme in Equation 3-8. One possible explanation is that the Division-Overlapping scheme changes the optimal relation between the size of Bloom filter m in original Bloom filter, the insertion elements n and the number of hash functions k.

4.2 Multiple Numerical Attribute in a Bloom Filter

Instead of inserting single numerical attributes into a Bloom filter, we inserted a set of numerical attributes and non-numerical attributes into Bloom filter for our simulation and find the optimal parameter setting (di, si) for each numerical attribute of the data set according to our proposed optimization Algorithm 3-2. The test data set of multiple attributes in our experiment is the System Defied Attributes (SDA), which used in MFPGC System [13]. Table 4-2 lists the necessary items which were are for querying a user profile, and a SDA might contain one or more non-numerical and numerical attributes. The numerical attributes of the SDA would have many numerical insertion elements of its query range. For example, the numerical attribute “Age” of the SDA, whose domain R is 120 (from 1 to 120) and its insertion elements n is from 1 to 10. The number of insertion bits of each numerical attribute was larger than k when the size of its query range was more than one. Because the number of insertion elements of a numerical attribute was large, and its values of the query range were all continuous numbers, the Division-Overlapping was applied to insert the numerical attributes of the SDA into a Bloom filter. Instead of inserting continuous numbers into Bloom filter, there is only one insertion element of non-numerical attributes because the value of non-numerical attribute contained only string-type value.

Table 4-2 System Defined Attribute

Attribute Name Attribute Type Attribute Value

Name String (non-numerical) Random String in length 20 Nick Name String (non-numerical) Random String in length 20 University String (non-numerical) Random String in length 20 Hobby String (non-numerical) Random String in length 20 Professional String (non-numerical) Random String in length 20 Age Integer [1:120] (numerical) Query Range: 1~10 Year Integer[1900:2100]

(numerical)

Query Range: 2~20 Income Integer[0:5000000]

(numerical)

Query Range:

50000~5000000

Longitude Integer[-1800000: 1800000]

(numerical)

Query Range: 10~100

Latitude Integer[-900000: 900000]

(numerical)

Query Range: 10~100

In our multiple attribute experiments, there were five non-numerical attributes and five numerical attributes in the SDA. The five numerical attributes were inserted into Bloom filter by our numerical attribute representation scheme Division-Overlapping. To decide the optimal dividing-range d and shift-bit s for each numerical attributes, our optimization Algorithm 3-2 was used to find the optimal parameters for each numerical attribute. In Fig. 4-4, we inserted the data set of multiple attributes listed in Table 4-2 into a Bloom filter, whose size m is 512 and the number of hash function k is 8 or 12. The simulation results depicted in Fig. 4-4 showed that the false positive rates of simulation results are consistent with the theoretical values based on Equation 3-15. Clearly, the false positive rate of using k = 12 with the optimal d and s for each numerical attribute of the SDA is better than using k = 8. We found that the number of insertion bits is a key factor to affect the false positive rate in our Division-Overlapping scheme. Like the case of single

numerical attribute in a Bloom filter, the false positive rate of using more insertion bits is better. Moreover, in Fig. 4-5 we compare the theoretical false positive rates of using optimal k=15 in multiple attributes with using k = 8 and k = 12, using optimal k=15 is better than the others.

Fig. 4-4 The false positive rate of theory values and simulations results

Fig. 4-5 The false positive rate of different k

Table 4-3 and 4-4 summarized the optimal parameter (d_i, s_i) of each numerical attribute of the SDA and their false positive rates when inserting 500235 elements into a Bloom filter. In Table 4-3, the number of hash function k is 8, and the number of insertion bits of each attribute is different to each other. If the attribute was non-numerical like “Name”, “Nick Name”, “University”, “Hobby” and “Profession”, the number of insertion bits is k because the random string in different length can be hashed to only k random values by k hash functions; however, if the attribute was numerical like “Age”, “Year”, “Income”, “Longitude” and “Latitude”, the number of insertion bits is decided by the parameter (d, s) of the Division-Overlapping scheme.

In Table 4-3 and 4-4, the number of insertion bits of “Income” attribute is larger than the other attributes, and its false positive rate is also larger than the false positive rate of the others. This is the effect that there were 500000 insertion element of “Income”

attribute; as a result, the dividing-range d was so large to compress the continuous numbers. The penalty of dividing range error would be too large, so the false positive rate of “Income” attributes became the main factor to affect the total false positive rate of all attributes in the SDA. According to parameter optimization Algorithm 3-2, the optimal parameter of each numerical attribute would be determined for the optimal average false positive rates in the test data set.

Results of the optimal parameter of using the optimal k = 15 are presented in

在文檔中在布隆過濾器下改善範圍搜尋方法 (頁 38-0)