Results of Filtering Unnecessary Cache Searches

Chapter 4 Evaluation

4.2 Results of Filtering Unnecessary Cache Searches

Fig. 11 shows the filtered snooping rate using classic Bloom filter, CBF, BBF, and DLCBF for each benchmark. The filtered rate represents how many snoops are screened out by the filter. These are the unnecessary snoops that do not need to be handled by a cache. The experiments were performed on multi-core systems with two, four, eight, and sixteen processors. We choose the "simmedium" data set of PARSEC benchmark suite to evaluate the proposed architecture in a reasonable time. The right most column of Fig. 11 represents the geometric mean of the filtered rates observed in the benchmarks. The reason we choose geometric mean instead of arithmetic mean is that the total number of snoops in each benchmark differs significantly and the results we show here are ratios to them. We can see that the classic Bloom filter performs poorly in all benchmarks. The reason is that the classic Bloom filter faces array saturation problem and does not support “resetting” a slot whenever an element is removed from the set. In this context, removal of an element is considered as

cache line invalidation or eviction. Therefore, its mapping array saturates even when data set size is small and the classic Bloom filter loses its filtering functionality very quickly. Banked Bloom filter suffers from the same reason, although it supports faster querying access. On the other hand, because the counting feature enables the “resetting” ability, CBF reduces the array saturation rate. With lower saturation rates, CBF achieves better filtering behavior than the classic and banked Bloom filters. When compared with a classic Bloom filter, CBF achieves up to 3.57X filtered rate improvement.

With an additional hash function and hard-wired permutation table, DLCBF divides and maps the whole memory space to several storage arrays. This design avoids the data in different memory spaces colliding with each other. Inside each storage array, DLCBF utilizes multiple hash functions to prevent collision. In addition, the counting feature enables a much lower probability of collision and the ability to reset an element more effectively. Therefore, DLCBF further improves the filtered rate and significantly outperforms the classic Bloom filter, CBF, and BBF. The average improvement of filtered rate is 81.99% and 31.36% when compared to classic Bloom filter and CBF respectively.

Another observation from Fig. 11 shows that the filtered rate increases with the number of processors for all four filters. This is because that when the number of processors increases, each processor is responsible for a smaller size of data set. For example, assume the total data set is 1MB; each processor in a 2-core multi-processor system would deal with 512kB of data.

And in a 4-processor system, each core will be responsible for only 256kB of data. The

Fig. 11. Filtered rate of classic Bloom filter (BF), counting Bloom filter (CBF), banked Bloom filter (BBF), and double layer counting Bloom filter (DLCBF) with simmedium PARSEC

benchmarks

effective data set size allocated to each individual core decreases with the increasing number of processors. The smaller effective data set size lowers the memory space that needs to be handled by each filter. Therefore, the characteristics of each filter are improved with more processors in a system. Notice that freqmine and streamcluster do not follow the above criteria. The possible reason is that these two benchmarks might have many true-sharings or the data set is not evenly distributed among processors as in our example.

We use Synopsys design compiler and CACTI 5.3 [17] to estimate the area of the proposed double layer counting Bloom filter. Two SRAM modules with 1-Byte words and one read/write ports were modeled with CACTI to estimate the area and access time of the DLCBF. The upper layer module is modeled as a 64-Byte, single-bank SRAM module, while the lower layer is a 768-Byte SRAM. We use 2-bank architecture to estimate the lower layer module because CACTI does not support 3-bank architecture as the DLCBF. The four hash functions, permutation table, and the additional control logic are modeled with Synopsis design compiler. We implements H3 hash because it is proven better than others [18]. The overall area of DLCBF takes about 113,413 μm² using 90nm technology and the critical path is 1.544 ns.

CACTI is also used to give the energy saving estimation. Fig. 12 depicts the estimated energy saving percentage after applying BF and DLCBF, respectively. The energy estimation here is calculated with the energy saved by filtering unnecessary snoops subtracts the energy consumed by the filter. BF introduces more energy consumption to the system because its

Fig. 12. Energy savings introduced by BF and DLCBF

filtered rate is so low that it couldn't compensate the energy consumed. DLCBF, on the other hand, benefits from its higher filtered rate and is able to contribute energy savings for most of cases. In average, the DLCBF could save up to 13.17% of energy in a SMP system.

Fig. 13. Filtered rate of double layer counting Bloom filter with different data set sizes of blackscholes

In order to examine the scalability of the proposed design, we evaluated one of the PARSEC benchmark, blackscholes with different data sizes. As shown in Fig. 13, PARSEC provides five data set sizes for each benchmark [16]. From the smallest one to the largest are test, simdev, simsmall, simmedium, and simlarge. The result shows that filtered rate of the DLCBF will decrease with the size of data set. With this insight, we performed an evaluation for each benchmark with the largest data set size. These experiments were conducted on the

same multi-core system as the previous one, except that we extended the number of processors to twenty-four and thirty-two. Fig. 14 shows the filtered snooping rate for each benchmark in our evaluation. Notice that facesim does not work with medium-size data set and both facesim and fluidanimate do not support twenty-four-processor system. The bottom right most column is the geometric mean of the filtered rates. When the data set size increases, all the filters performs worse than with the medium-size benchmarks because they have to manage a larger size of data. Nevertheless, we can observe that the proposed DLCBF still outperform other designs in most of the cases. The average improvement of filtered rate in large-size benchmark is 7X and 1.1X when compared to classic Bloom filter and CBF respectively.

Fig. 14. Filtered rate of classic Bloom filter (BF), counting Bloom filter (CBF), banked Bloom filter (BBF), and double layer counting Bloom filter (DLCBF) with simlarge PARSEC

benchmarks

4.3 Results of Filtering Unnecessary Data Transmission on a

在文檔中在多核心系統上之高效能雙層計數布隆過濾器 (頁 35-43)