Related Work - 快取分區模式效能改進的方法

4.1 Cache Partitioning

Cache partitioning as a research topic saw an increase in interest with the rise of chip multi-processors. A number of different methods have been proposed, with a large proportion using way-based partitioning.

Dynamic Partitioning of Shared Cache Memory [1] is a way-based partitioning method to dynamically reduce the total number of misses for simultaneously executing processes.

Cache miss information for each process is collected through stack distance counters (termed marginal gain counters) and a greedy algorithm used to determine a new partition size. Of note is the rollback mechanism, where the performance of the current and previous partition sizes are compared and the better one chosen for the next partition size.

Limitations of this method is include the fact that separate hit counters are not kept for each core making miss prediction less accurate and the limited scalability to four or more cores.

Utility Based Cache Partitioning [8], the example method used for determining the private partitions in this paper, allocates ways amongst the cores based on maximizing the reduction in misses. This is computed through stack distance counters and alternate tag directories enabling the effect of various cache partition sizes to be determined simultaneously. The method however is not able to adequately adjust to situations where having no explicit partitioning policy performs well (applications with a low number of inter-process conflict misses running concurrently).

Cooperative Cache Partitioning [9] is another way based partitioning method designed to deal with thrashing threads. It uses Multiple Time-sharing Partitions to share a large

partition between multiple thrashing threads, giving each thread the entire partition for a portion of the repartitioning period. This combined with the Cooperative Caching [10]

method provides an improvement in performance, particularly Quality of Service. This scheme is also compatible with our proposed shared partitioning method and we anticipate additional improvements in performance if used together.

Adaptive Shared/Private NUCA Cache Partitioning [11] is a method similar to our proposed shared partitioning method that divides a cache into shared and private partitions.

The difference lies in the method for determining the size of the partitions. In this method shadow tags are used, however only one way is reallocated per repartitioning period, meaning the method is unable to adjust quickly to changes in working sets unlike our method which can make larger changes in partition sizes. Additionally, cache misses are used as the determinant for when to repartition the cache, meaning applications with a large number of cache misses yet no change in their working sets will cause unnecessary repartitioning.

The reconfigurable cache mentioned in [12] describes a similar, albeit more simplified method of cache partitioning. They focus more on the hardware requirements and feasibility of implementing cache partitioning. Differences with the proposed cache partitioning method include the use of software vs. hardware for partition size determination, cache scrubbing vs. lazy repartitioning, and the use of L1 vs. a cache explicitly shared between multiple cores.

Peir et al. [13] describe a dynamic partitioning technique for a direct-mapped cache in which partitioning is done by grouping sets (termed a group-associative cache). In addition, underutilized cache blocks are detected based on the recency of their use (to attempt to implement a global LRU scheme) and prefetched data is placed in those blocks in the hope of increasing their utilization. This method of underutilization detection is somewhat similar

to our proposed method, but operates on a direct-mapped cache using what would be classified as a set-based granularity if the associativity were increased.

Adaptive set pinning [14] can be thought of as cache partitioning using a set granularity.

Sets are allocated to processors based on the frequency of accessing a particular set, with each set having an owner. This scheme is more scalable than way based partitioning and our proposed method may be able to be extended to complement set pinning by detecting underutilized ways within allocated sets that can be shared.

Recent work has noted the poor scalability of having separate monitors for each core and methods have been proposed including In-Cache Estimation Monitors [15] and set-dueling [16] to eliminate the need for separate monitors. A number of sets in the cache are dedicated to a particular core from which the monitored statistics can be gathered. These methods improve in effectiveness as the cache size increases while associativity remains constant, as there are a larger number of sets and less reduction in effective cache capacity per core. These methods are compatible with our proposed method and can also be adjusted to help in the monitoring of set usage, helping reduce the overhead of our proposed method.

4.2 Cache Indexing Functions

Previous research on cache indexing functions has generally not focused on the interaction between processes in a shared cache and has focused on reducing conflict misses within a single process.

Rau [17] discusses the calculation of the index as the address modulo an irreducible polynomial. However Gonzalez et al. [18] show that there is a marginal advantage to choosing a polynomial mapping scheme over their own simpler bitwise XOR mapping.

Kharbutli et al. [19] propose a fast implementation of an index function that uses the

address modulo a prime number.While improving performance, they recognize that the hardware cost and increase in delay means it is more suitable for higher level caches.

Zero Cost Indexing for Improved Processor Cache Performance [20] describes a heuristic to select address bits for the index given a program trace. For no cost (just selecting different bits for the index) it is able to reduce the miss rate, however requires a program trace, making it unsuitable in general for other applications.

In a similar manner, Vandierendonck and De Bosschere [21] present an algorithm to determine the optimal XOR function that will minimize misses for a given program trace finding that XOR based functions provide the best reduction in misses of the index functions surveyed.

While these techniques propose individual index functions per application, they do not address the case of a cache shared between multiple applications.

在文檔中快取分區模式效能改進的方法 (頁 34-38)