Chapter 3 Way-Predicting Set Associative Cache
3.4 Conclusions
The performance / power efficiency of the way-predicting set associative cache largely depends on the accuracy of the way prediction. The miss prediction makes the cache search all of the ways and consume all power of them. In other words, the way-predicting set associative cache cannot reduce power consumption in this scenario.
The miss prediction also causes additional cycles which cause performance degradation.
Besides, the way-prediction table is also a significant area overhead compared to the conventional set associative cache.
Chapter 4
History-Based Tag-Comparison Instruction Cache
In this chapter, we introduce a low-power instruction cache architecture, called history-based tag-comparison (HBTC) cache [7] [8]. The HBTC cache attempts to reuse tag-comparison results to eliminate the power consumption of tag comparison, including the tag-memory accesses further. The cache records tag-comparison results in an extended branch target buffer (BTB), and reuses them for directly selecting only the hit-way which includes the target instruction. In this scenario, the (NTag, NData) is equal to (0, 1) in Equation (2.3). However, not all the processors employ BTB technique.
Naturally, the HBTC cache has a limitation in the hardware. In other words, the HBTC cache can be only used in processors which have employed the BTB technique.
4.1 Concept
The content of cache memory is updated when cache misses take place. Instruction caches can achieve much higher cache hit rates due to rich locality of memory references.
This means that the content of instruction caches is rarely updated.
There are many loops in programs, so that some instruction blocks will be executed in many times. We call a run-time instruction block “a dynamic basic-block”. The dynamic basic-block consists of one or more successive basic blocks. The top of the dynamic basic-block is addressed by a branch-target address, and the end of it is addressed by a taken-branch or jump address. Therefore, not-taken conditional branches might be included in the dynamic basic-block.
The dynamic basic-block is executed many times during program execution. On the first time of the dynamic basic-block execution, the tag comparison for all instructions has to be performed. However, on the second execution, if no cache miss has occurred since the first execution, it is guaranteed that the dynamic basic-block resides in the cache.
Hence, we can determine that the indexed cache entry corresponds to the requested address without performing the tag comparison.
When a dynamic basic-block is executed, the history-based tag-comparison cache attempts to avoid unnecessary tag comparisons by detecting the following conditions:
1. The dynamic basic-block has been executed, and
2. No cache miss has occurred since the previous execution of the dynamic basic-block.
The history-based tag-comparison cache omits the tag comparison when the above conditions are satisfied regardless of the intra-line and inter-line flows.
4.2 Organization
As shown in Figure 4.1, the HBTC cache requires six additional components:
Way-Pointer table (WP table), Way-Pointer Register (WPreg), Way-Pointer Record Register (WPRreg), a mode controller, Previous Branch-Address Register (PBAreg), and
Figure 4.1: Block diagram of a 4-way SA HBTC cache
A conventional BTB is extended by adding the WP table. Each entry of the table corresponds to that of the BTB, and consists of two of M way-pointers. A tag-comparison result (i.e., hit-way number) is stored in the extended BTB as a way pointer (WP).
Therefore, the WP can be implemented as a log n-bit flag where n is the cache associativity, and specifies the hit-way of the corresponding instructions. The 1-bit valid flag is used for determining whether the M of WPs are valid or not. The taken WPs are used for the target instructions, and the not-taken WPs are used for the fall-through instructions of the corresponding branch in the BTB. In Figure 4.1, for example, cache line A, B, C, and D are referenced sequentially after a taken branch is executed. In this case, the tag-comparison results (or the hit-way numbers) for their references are 0, 1, 0, and 3. This information is stored in the WP table, and is reused when the target instructions are referenced in the future.
At the first reference of instruction of instructions, we have to perform tag checks. In order to record the generated tag-comparison results in the WP table, the WPRreg is used as a temporal register. The PBAreg stores the previous-branch-instruction address and the result of branch prediction (taken or not-taken), and is used as an address register to store the value of WPRreg to the WP table. At every BTB hit, the WPs read from the BTB is stored in the WPreg, and are provided to the I-cache for tag-comparison re-use. The mode controller manages the HBTC behavior. The details of the HBTC operation are explained in Section 4.3.
In order to reuse the tag-comparison results at cache-line granularity, we need to detect cache-line boundary for instruction references. This can be done by monitoring a few bits of the PC [12]. (It uses BTB hit to hint that the successive instructions are sequential flow.)
4.3 Operation
The HBTC cache has the following three operation modes, one of which is activated by the mode controller:
z Normal mode (Nmode): The cache behaves as a conventional I-cache, so that the tag check is performed at every cache access (the (NTag, NData) is equal to (n, n) in Equation (2.3)).
z Omitting mode (Omode): The cache reuses tag comparison results, so that only the hit way is activated with performing tag checks (the (NTag, NData) is equal to (0, 1) in Equation (2.3)).
z Tracing mode (Tmode): The cache works as the same as the Nmode the (NTag, NData) is equal to (n, n) in Equation (2.3)), and also attempts to record the
Figure 4.2: Operation-mode transition of HBTC cache
tag-comparison results generated by the I-cache (this operation is not performed in the Nmode).
Figure 4.2 shows the operation transitions. On every BTB hit, the HBTC cache reads in parallel both the taken and not-taken WPs associated with the BTB-hit entry, and selects one of them based on the branch prediction result. If the selected valid flag is ‘1’, the operation enters the Omode and the selected WPs are stored to the WPreg. Otherwise the Tmode is activated, and both the branch instruction address (PC) and the branch prediction result (taken or not-taken) are stored to the PBAreg.
In the Omode, whenever a cache-line boundary is detected, the next WP in the WPreg is selected. On the other hand, in the Tmode, the tag-comparison results generated by the I-cache are stored to the WPRreg at cache-line granularity. When the next BTB hit occurs in the Tmode, the value of the WPRreg is written into the WP-table entry pointed by the PBAreg and the corresponding valid-flag is set to ‘1’.
The WPreg and the WPRreg can hold WPs up to M, where M is the total number of WPs implemented in a WP-table taken (or not-taken) entry. In the Omode or the Tmode,
if the cache attempts to access the M + 1-th WP in the WPreg or WPRreg, WP-access overflow occurs and the operation switches to the Nmode.
Whenever a cache miss takes place, all WPs recorded in the WP table are invalided by resetting all the valid-flags to ‘0’ and operation transits to the Nmode. This is because instructions corresponding to valid WPs may be evicted from the cache.
In the Tmode, when a BTB hit occurs just after L (L < M) of tag-comparison results are written in the WPRreg. Some of invalid WPs are stored to the WP table. We assume that no BTB replacement has occurred since the previous Tmode. Under this assumption, it is guaranteed that the BTB-entry makes the next BTB hit just after L of valid WPs are accessed. Since the WPreg is overwritten by the next BTB hit , there is no chance to be used for the M - L of invalid WPs. In order to guarantee this assumption, the cache performs WP invalidation and changes the operation mode to the Nmode whenever not only a cache moss takes place but also a BTB replacement occurs.
The cache operates in the Nmode whenever a branch-target address is provided by a return address stack (RAS), or a branch mis-prediction is detected (WP invalidation is not performed).
4.4 Conclusions
The cache records tag-comparison results in an extended branch target buffer (BTB).
However, not all the processors employ BTB technique. Naturally, the HBTC cache has a limitation on the hardware implementation. In other words, the HBTC cache can be only used in processors which have employed the BTB technique.
Size of the WP table in proportion to the number M of WPs is a significant area overhead compared to the conventional SA cache, and it also causes some power
Chapter 5
Proposed Low-Power Instruction Cache
In this chapter, we propose our low-power instruction cache architecture with four techniques as follows to reduce the power consumption of cache memory.
1. Memory sub-banking [13].
2. Two-phase cache.
3. Pre-tag checking.
4. Signal “seq” for tag-memory access skipping.
5.1 Memory sub-banking
In conventional SA caches, we find that the entire memory block is enabled just in order to access one word (one data or one tag for comparison). We can partition a large memory block into several small blocks. During each memory access, we just enable one of these small blocks where the critical word is at and just consume the power of the small block.
Figure 5.1 shows the concept of memory sub-banking. We partition a 16KB memory bank into four 4KB memory sub-banks. A sub-bank address decoder is needed to enable
47.4
@ Artisan UMC 0.18um memory compiler
Figure 5.1: The concept of memory sub-banking
only the desired sub-bank. The 4-to-1 multiplexer is also needed to choose the correct output data. In the example, we can reduce about (57.6-47.4) / 57.6 = 17.7% power consumption of a 16KB memory bank. However, we also have a (347748 × 4-989604) / 989604 = 40.5% area overhead of a 16KB memory bank. Therefore, memory sub-banking is a trade-off between power and area.
Figure 5.2 shows the address space partition for the memory sub-banking in a cache.
The sum of the SUB field width and Sub-Index field width is equal to the original INDEX field width. The SUB field width depends on the number of memory sub-banks. For example, we decide to partition the tag-memory into four sub-banks and the data-memory into eight sub-banks. The SUB field width is 2-bit and 3-bit individually. The Sub-Index field is the set selection of the sub-bank.
Figure 5.3 (a) shows a 2-bit sub-bank address decoder is needed to generate control signals for the four tag-memory sub-banks. A 3-bit sub-bank address decoder is needed to generate control signals for the eight data-memory sub-banks.
Figure 5.3 (b) shows the memory partition of each way in a cache. Only 1/4 tag-memory and 1/8 data-memory are activated each way during the cache access.
Figure 5.2: The address space partition for the memory sub-banking
2-bit decoder3-bit decoder
Figure 5.3: Example of memory sub-banking
In order to discuss the power efficiency further due to the memory sub-banking technique, we use Artisan UMC 0.18um memory compiler to do the power analysis for the memory sub-bank. Table 5.1 shows the power consumption of the memory sub-bank in an 8KB cache. “No sub” means that the memory is not partitioned. “SubN” means that the memory is partitioned into N sub-banks. We can see that the power consumption of the tag-memory is almost not improved whether we perform memory sub-banking or not.
The power consumption of the data-memory is just reduced by 2.5 mW (47.4-44.9 = 2.5) about 5% improvement. Table 5.2 shows the power consumption of the memory sub-bank in a 32KB cache. The power consumption of the data-memory is reduced by 11.6 mW (57.6-46.0 = 11.6) about 20% improvement.
Table 5.1: Power consumption of the memory sub-bank in an 8KB cache 8KB cache size(2-Way set associative, 16Byte Line)
Tag Memory Data Memory
Type Size Power (mW) Type Size Power (mW)
No sub 256x20 28.5 No sub 1024x32 47.4
Sub2 128x20 28.2 Sub2 512x32 46.0
Sub4 64x20 28.0 Sub4 256x32 45.3
Sub8 128x32 44.9
Table 5.2: Power consumption of the memory sub-bank in a 32KB cache 32KB cache size (2-Way set associative, 16Byte Line)
Tag Memory Data Memory
Type Size Power(mW) Type Size Power (mW)
No sub 1024x18 27.8 No sub 4096x32 57.6
Sub2 512x18 26.4 Sub2 2048x32 50.2
Sub4 256x18 25.7 Sub4 1024x32 47.4
Sub8 512x32 46.0
Based on these experimental results, we find that the memory sub-banking technique should be used for a larger cache size, say, more than 32KB.
5.2 Two-phased cache
Although at most only one way has the data desired by the processor, all the ways are accessed in parallel in conventional set associative caches. Thus, a lot of power is wasted.
To solve this issue, Hasegawa et al proposed a low-power set associative cache architecture [11] called phased set associative cache. The phased set associative cache is prior to look up tag-memory in all the ways and sequentially accesses the data-memory in
and 2 PData, power reduction from the conventional two-way set associative cache on cache hits, and cache misses, respectively. The average power consumption in a phased two-way set associative cache (PP2SACache) for a cache access can be expressed as follows:
Data Tag
SACache
P
P CHR P
P
2= 2 + ×
(5.1) Here, CHR is the cache hit rate. However, all the cache accesses are delayed one cycle.This latency significantly slows down the overall performance.
In order to solve the latency, we propose a new architecture which is similar to phased set associative cache called two-phased set associative cache. We use posedge-trigger tag-memory and negedge-trigger data-memory to make one-cycle cache accesses possible.
Figure 5.4 shows the cache hit in the two-phased two-way set associative cache (8KB cache size, 16-byte block). The cache accesses the tag-memory and do tag comparison in all the ways during the high half-cycle and sequentially accesses the data-memory for the desired data in the matching way during the low half-cycle.
Figure 5.5 shows the cache miss in the two-phased two-way set associative cache (8KB cache size, 16-byte block). Due to no matching way during the high half-cycle, no data-memory will be activated in the low half-cycle. The average power consumption in a two-phased two-way set associative cache (P2P2SACache) for a cache access is the same with the Equation (5.1).
However, the timing of the two-phased cache becomes more critical than the phased cache. Because the tag-memory access and the tag comparison must be done within half-cycle.
Figure 5.4: The cache hit in the two-phased two-way set associative cache
5.3 Pre-tag checking
Due to the locality principle of program, the addresses of instructions loaded to the cache are not far away between each other. This means that the least significant bits of the tag field are usually different, but the most significant bits of the tag field are usually the same. In other words, the tag comparison with the few least significant bits can almost decide if the cache hits or not. In order to reduce the power consumption of the tag-memory further, we propose a technique called pre-tag checking used with the two-phased set associative cache. We take 3 least significant bits (TAG[LSB+2: LSB]) to do pre-tag checking for way selection.
Figure 5.6 shows the address space partition in a 2-way set associative cache (8KB, 16-byte block) for the pre-tag checking.
Figure 5.6: The address space partition for the pre-tag checking
Figure 5.7 shows the architecture of the two-phased cache with pre-tag checking.
During the high half-cycle, the cache accesses the ptag-memory and does pre-tag comparison with the 3-bit PTAG field in the address for the way selection. During the low half-cycle, the cache accesses the otag-memory and data-memory in parallel in the matching way selected by the PTAG comparison result. The OTAG comparison result is to ensure if the cache actually hits or not.
Figure 5.7: The architecture of a two-phased cache with pre-tag checking
The cache-access power can be expressed as follows:
Data Data
OTag OTag
PTag PTag
Cache
N P N P N P
P = × + × + ×
(5.2)There are four cases for the power consumption in the two-phased two-way set associative cache with pre-tag checking. Among them, there are two better cases and two worse cases for the power consumption compared to the two-phased two-way set associative cache without pre-tag checking.
z BC I: Figure 5.8 shows the better case I for the power consumption. The PTAG comparison result indicates that there is one way matching. The otag-memory and data-memory in the matching way are activated, and the OTAG comparison result indicates that the cache actually hits. In this case, the (NPTag, NOTag, NData) is equal to (2, 1, 1).
z BC II: Figure 5.9 shows the better case II for the power consumption. The PTAG comparison result indicates that there is no way matching (cache miss).
Therefore, no otag-memory and data-memory will be activated. In this case, the (NPTag, NOTag, NData) is equal to (2, 0, 0).
z WC I: Figure 5.10 shows the worse case I for the power consumption. The PTAG comparison result indicates that there are two ways matching. The otag-memory and data-memory in the matching ways are activated, and the OTAG comparison result indicates that the cache actually hits or misses. In this case, the (NPTag, NOTag, NData) is equal to (2, 2, 2).
z WC II: Figure 5.11 shows the worse case II for the power consumption. The PTAG comparison result indicates that there is one way matching. The otag-memory and data-memory in the matching way are activated, but the OTAG comparison result indicates that the cache actually misses. In this case, the (NPTag, NOTag, NData) is equal to (2, 1, 1).
On the basis of our previous discussion, according to the locality principle of program, the ratio of the WC I and WC II is much smaller than the ratio of the BC I and BC II. In order to prove this point, we run some benchmarks, including JPEG encoder (jpeg2000_enc), Dhrystone (dhry), fast Fourier transform (fft), discrete cosine transform (dct), math operation (math), and dual-tone multi-frequency algorithm (dtmf) to measure the ratio of WC I and WC II. According to the results of Table 5.3 and Table 5.4, the total average ratio of WC I and WC II is about 1~2%. That is much smaller than the ratio of the BC I and BC II. Therefore, we can say the (NPTag, NOTag, NData) of the two-phased two-way set associative cache with pre-tag checking is roughly equal to (2, 1, 1) on cache hits and (2, 0, 0) on cache misses.
256x17
Figure 5.8: Better case I of the two-phased case with pre-tag checking
256x3256x3
256x17
Figure 5.10: Worse case I of the two-phased case with pre-tag checking
256x3256x3
Figure 5.11: Worse case II of the two-phased case with pre-tag checking
Table 5.3: The ratio of WC I and WC II (part1)
Benchmark jpeg2000_enc dhry fft
Type WC I WC II WC I WC II WC I WC II
pre-tag 3-bit 0.00% 0.00% 1.05% 0.01% 0.12% 0.01%
Table 5.4: The ratio of WC I and WC II (part2)
Benchmark arm_dct arm_math dtmf
Type WC I WC II WC I WC II WC I WC II
pre-tag 3-bit 0.73% 0.01% 1.45% 0.07% 2.77% 0.04%
Let’s make a conclusion. For a conventional 2-way set associative cache, the (NPTag, NOTag, NData) is roughly equal to (2, 2, 2) regardless of hits or misses. The two-phased two-way set associative cache without pre-tag checking can reduce the power consumption by making (NPTag, NOTag, NData) be (2, 2, 1) on cache hits and (2, 2, 0) on cache misses. The two-phased two-way set associative cache with pre-tag checking can
Let’s make a conclusion. For a conventional 2-way set associative cache, the (NPTag, NOTag, NData) is roughly equal to (2, 2, 2) regardless of hits or misses. The two-phased two-way set associative cache without pre-tag checking can reduce the power consumption by making (NPTag, NOTag, NData) be (2, 2, 1) on cache hits and (2, 2, 0) on cache misses. The two-phased two-way set associative cache with pre-tag checking can