• 沒有找到結果。

In modern computing systems, multi-core architectures have become prominence due to its high performance and energy efficiency. The shared-memory Symmetric Multi-processor (SMP) is one of the widely developed multicore architectures which exchange data between cores via the shared memory space. However, data sharing among multiple cores in a SMP introduces cache coherence issues [1]. In order to maintain a coherent memory system, a shared memory SMP system needs to update or invalidate the shared data whenever one of its owners writes a new value to this data location. The broadcast-based snoopy protocol is a widely used scheme to maintain a coherent memory system. This protocol broadcasts the data sharing states on the system interconnection to trigger the distributed data management mechanisms on each processor.

Fig. 1. A coherence mechanism when multiple cores share the same data. The colored cache blocks are shared among different cores. They should be either updated or invalidated while

this particular block is being written.

Fig. 1 illustrates a simple example of the cache coherence issue on a SMP system with three processors (Core 1 to Core 3). Assume that Core 1 and Core 3 share the same data (A=5). When Core 1 performs a write operation to its copy of the shared data (A=7), the broadcast-based snoopy protocol would update the shared copy in Core 3 with the latest written value or otherwise invalidate it. The snoopy protocol can support cache-to-cache single-hop data transfers. When compared to the directory-based cache coherence protocol, the simple broadcasting scheme makes the snoopy protocol a low complexity design which does not require specialized architectures to maintain the sharing information. For a smaller scale multicore systems, such as the ones that have been recently applied onto embedded and mobile devices, usually adopt the snoopy protocol due to its simplicity and fast cache-to-cache transfers. For example, ARM provides a cache coherent interconnect with snoopy protocol for its Cortex-A15 processor [2, 3].

However, this broadcast-based scheme blindly disseminates the data sharing information across the system, and usually causes a significant amount of unnecessary data transfers on the interconnection. In general, the data sharing behavior is happened within a certain number of parallel tasks. The number of these affined tasks is much smaller than the size of the overall multi-core system. With the broadcast-based scheme, the processors which are not involved in the current data sharing state would consequently perform needless data searches in their own caches. For example, assume that there is a datum owned by only one processor. When this particular datum is written, the snoopy protocol would broadcast an invalidation message and

invoke searches at all the caches while none of these searches are actually needed. All these redundant data management operations would occupy the system resources and degrade the performance and energy efficiency.

Fig. 2. Redundant snoops in PARSEC benchmark suite.

Fig. 2 shows the percentage of redundant coherence requests for the PARSEC benchmark suite on a 16-core SMP system. According to Fig. 2, in average, 78.37% of the coherence messages are unnecessary. These redundant requests introduced by the broadcasting behavior of the snoopy protocol would unnecessarily put a cache in a busy mode and increase the energy consumption. These unnecessary cache operations could even block the useful requests from processors and therefore degrade the system performance. This problem will be more severe in embedded systems due to its stringent energy constraints and

strict performance requirements. If these unnecessary data communication and cache searches can be filtered, the effective utilization of the system resources, such as cache and interconnection, can be enhanced significantly. This paper proposes a novel architecture of Double Layer Counting Bloom filter to screen out the unnecessary data management caused by the broadcast-based snoopy protocol in a SMP system.

A Bloom filter [4] is a classic unit used in database management. It uses hash functions to maintain the data mapping structure and provides an effective method to perform membership querying. However, due to the limited size of the filter, it suffers from rapid array saturation problem [5]. If the dataset of an application is too large, the data mapping structure would saturate and make the filtering mechanism ineffective. This paper proposes a novel architecture of Double Layer Counting Bloom Filter (DLCBF), which uses a two-layer filtering scheme to achieve high filtering rates with low implementation cost. The DLCBF implements an extra layer of hash function and the counting feature at each filter entry. By using the hierarchical structure of the hash function, DLCBF can manage larger query spaces and effectively increase the successful filter rates while requiring a smaller memory usage than the conventional Bloom filters. The counting feature of DLCBF further enhances the ability to handle the array saturation issue.

To demonstrate the efficacy of the proposed DLCBF, this paper implements the DLCBF on two system modules of a SMP system to reduce unnecessary data processing of the snoopy coherence protocol. The first module, depicted in Fig. 10, is the local cache of each processor.

By connecting a Bloom filter between a cache and system interconnection, the filter mechanism can be used to screen out the unnecessary snooping messages that would be otherwise handled by each processor. The second module is the hierarchical shared system bus illustrated in Fig. 15. A Bloom filter is embedded in the system interconnection to reduce the costly system-wide data broadcasting. When compared with conventional Bloom filters, the DLCBF can manage larger data set with fast data accesses while requiring smaller memory area. By deploying the DLCBF in a SMP system, a substantial amount of redundant memory operations and data transmission can be eliminated. In our experiment, the DLCBF can reduce up to 65.8% of unnecessary snoops and up to 13.17% of energy consumption to local caches with 18.75% less memory usage. Simulation results also show that the DLCBF outperforms conventional filters by 58% for local transmissions and 1.86X for remote transmissions on a hierarchical system interconnection. Furthermore, we implemented DLCBF in Verilog HDL. The RTL simulation shows that the DLCBF can achieve 1.544 ns of delay when querying and the overall area is 113,413 μm2 with 90nm technology node. In short, our contributions are:

1. We proposed a novel and area-efficient design of Double Layer Counting Bloom Filter (DLCBF), which can effectively manage a larger query space than conventional Bloom filters.

2. We have demonstrated that, on a 16-core SMP system, the DLCBF achieves 81.99%

better filtering rate over all other conventional Bloom filters while costing 18.75%

less memory storage.

3. We have also demonstrated that, by removing the unnecessary data management, the DLCBF can achieve 13.17% of overall energy saving.

The rest of the paper is organized as follows. In Chapter 2, we introduce the basic architecture of a Bloom filter (BF). Two modified versions, Counting Bloom Filter (CBF) and Banked Bloom Filter (BBF), are also discussed. Furthermore, several related works are reviewed. Chapter 3 shows the proposed filter structure, Double Layer Counting Bloom Filter (DLCBF). The detail functionality and implementation concerns are also discussed. Chapter 4 covers our evaluation methodology and demonstrates the cycle accurate simulation results.

Finally, we conclude this paper in Chapter 5.

相關文件