Drowsy Cache - State Preserving Technique

Chapter 2 Background and Related Work

2.3 State Preserving Technique

2.3.1 Drowsy Cache

There are two turn-off policies be proposed in drowsy cache [2] [14]. First, the simple policy periodically turns off all cache lines. Its concept is that the working set changes from time to time. Second, the noaccess policy is that only the lines that have not be been accessed for a fixed period time are placed into drowsy mode. Noaccess trun-off policy is very similar as cache decay policy. Their difference is that one places cache line into drowsy mode, another places the cache liens into sleep mode.

The simple policy turns off all cache lines periodically without checking the cache lines will be accessed again or not. For the reason, the simple policy has more run-time increases on instruction cache, and the noaccess policy may do better than simple policy in instruction cache. On the other hand, the simple policy is the simplest and very litter implementation overhead. Simple policy can do better than noacces policy in data cache since its run-time increases in data cache are not such serious

2.3.2 Drowsy Instruction Cache

The technique proposed in drowsy instruction cache [18] [14] adapts a bank based strategy that the instruction cache consists of some banks. Only one bank that is accessed now is in active mode and the others are in drowsy mode. When the program executes from one bank to another bank, the hardware turns off the former and turns on the latter. Without pre-waking up the next target bank, performance degraded significantly due to the wake-up penalties. In order to solve this problem, two bank pre-activation policies have been proposed.

Memory sub-bank prediction buffer. Figure 2-7 illustrates this bank

prediction technique that using a fully associated buffer (CAM) like BTB structure to pre-active target bank address. Each prediction entry in the buffer contains an instruction address and the target bank number. The instruction address is the one before a branch instruction such that the prediction can pre-wake up target bank before the target bank is accessed. On each cache access, this buffer is searched to decide if pre-active the next target bank. The power overhead due to the CAM search per cache access is significant.

Figure 2-7. Next sub-bank prediction buffer

Next sub-bank predictors in cache tags. Figure 2-8 illustrates this prediction

technique. This technique extends the cache tags to recode the predictor information.

Each tag entry is extended to contain an address of byte offset, valid bit, and target bank address. The stored address of byte offset is the one before a branch instruction such that the prediction can pre-wake up target bank before the target bank is accessed.

When each instruction cache access, the address of byte offset in extended tag is read to compare with the byte offset of program counter (PC), and checks the valid bit to decide to pre-wake up target bank or not. The disadvantage of this technique is that multiple target banks can not be kept if there are more than one branch instructions in this cache line.

Figure 2-8. Next sub-bank predictor in cache tags

2.3.3 Hotspot Based Leakage Management & Just-In-Time Activation

In [19], a hotspot based leakage management (HSLM) is proposed for turn-off policy and exploit the sequential access property of instruction execution for turn-on policy. The concept of hotspot leakage management is that the cache lines which used in a hotspot of program are likely to be accessed frequently, thus those cache lines should not be turned off. This technique builds on the simple policy of drowsy cache that periodically transitions cache lines to drowsy mode, and detect the cache lines that are used in hotspot to prevent them from turning off when the fixed period time expires.

Just-In-Time Activation (JITA) The just-in-time activation technique exploits

the sequential access patterns for instruction cache that to pre-active the sequential line of current accessed cache line. Figure 2-9 depicts the JITA scheme. In the set-associative cache, the JITA scheme pre-activate the sequential set of current accessed cache line or utilizes way prediction information to pre-active a selected way in the sequential set. This pre-activation scheme can’t handle the taken branches.

Figure 2-9. Just-In-Time Activation scheme

HotSpot Based Leakage Management (HSLM) The hotspot detection

mechanism proposed in this work is to track the branch behaviors and collect the execution frequencies of basic block by adding some information in the branch target buffer (BTB). Figure 2-10 depicts the enhanced BTB structure to support hotspot detection. Each entry in the BTB is augmented by two access counter that one is access counter for the target basic block (fgt_cnt) and another is access counter for the fall-through basic block (fth_cnt). During each branch prediction, the corresponding target/fall-through access counters increases one by the branch is taken/non-taken if the BTB is hit. The access counter has a pre-defined threshold T_acc, and the counter bits are log(Tacc)+1. It means that the counter does not increase any more if the most significant bit is 1. When the branch target entry is replaced, the corresponding access counters is reset to zero.

The value of target/fall-through counter shows the frequency of the target/fall-through basic block. If the BTB predicts taken/non-take and the most significant bit of target/fall-through counter is 1 (the frequency achieves the predefined threshold), the global mask bit (GB) is set to indicate that a hot basic block is executing. Furthermore, there is a voltage control mask bit (VCM) for each cache line to identify this cache line is used in hotspot or not. When the cache lien is accessed and if the global mask bit is 1, its corresponding VCM bit is set. Thus the cache line which its VCM bit is 1 would not be turned off when the fixed period time expires.

Furthermore, several initialization operations take place when the period time expires. First, to turn off all cache lines except those with their voltage control mask bit set. Second, to reset all voltage control mask bits. Third, all the access counters in the BTB are shifted right one bit to decrease the counter value by half, and then to

In addition to prevent the hot cache lines from inadvertent turning off, the hotspot detection detect a loop execution to issue turn-off signal early. If the BTB access predicts a taken backward branch and its corresponding most significant bit of the access counter is 1, it assume that the program is in a hotspot executing a loop. At this point, the global turn-off signal is issued to turn off all cache lines except those with their voltage control mask bit set.

Figure 2-10. HotSpot based Leakage Management scheme

2.3.4 Compiler Approach

A compiler based leakage management is proposed in [15]. This approach manage the instruction cache leakage by inserts a turn-off instruction at loop granularity level (it treat the streamline code as a loop which iterates one) to turn off all cache lines. Its concept is based on determining the last use of an instruction at loop granularity. A conservative strategy does not turn off an instruction cache line unless it knows sure that the current instruction that resides in that line is dead. It means that it will not inset the turn-off instruction at the exit of inner loop. Another optimistic strategy may turn off a cache line if it detects that the next access to the instruction will occurs only after a long gap. In this compile-based turn-off approach, it either place cache lines into drowsy mode or state-destructive sleep mode.

Furthermore, a hybrid strategy is proposed that to employ both the state-destructive and state-preserving technique under the optimistic approach. In this strategy, if the loop is not going to be accessed again, it places cache lines into state-destructive mode else places cache lines into state-preserving mode. This hybrid strategy needs more complicated hardware to control three supply voltages for active mode, drowsy mode and state-destructive sleep mode. Figure 2-11 shows conservative and optimistic strategy cases.

Figure 2-11. The control flow graph of conservative strategy and optimistic strategy

2.4 Summary and Observation

In table 2-2, we compare the previous works of leakage management (except drowsy instruction bank) based on state preserving circuit technique with our proposed design. We find that previous works turn off cache lines after a large fixed period time, and they are without pre-activation (drowsy cache policy) scheme or with a simple pre-activation scheme (JITA).

Å turn-off instruction

Å turn-off instruction Å turn-off instruction

Å turn-off instruction

(a) Conservative strategy. (b) Optimistic strategy

Schemes Turn-on Mechanism Turn-off Mechanism

Simple policy

When accessed Periodically turn off entire cache

Noaccess policy

When accessed Turn off the lines which are unused during a fixed time

HSLM &

JITA

Pre-activate sequential set Turn off non-hot cache lines periodically or when hot-backward

Keep a small part of most recently used and currently pre-activated cache lines

in active, and turned off other lines

Table 2-2. Summary of previous work and our proposed design

For the periodic turn-off policy, a cache line may be waked up only in its first access during the widow, and they will be kept in active mode until the turn-off window expires. Extra stalls due to wake-up penalty may be few, pre-activation scheme is not needed for this turn-off policy. The noaccess turn-off policy is the same situation. However, since the transition energy due to voltage switching is not large, the cache lines that will not be accessed again soon should be turned off to obtain more leakage reduction if there are no wake-up penalties.

In order to turn off cache lines more aggressively, an effect pre-activation scheme is required to avoid extra stalls due wake-up latency. We propose a two-direction pre-activation scheme that exploits the instruction access pattern which is either sequential or taken branch to pre-activate the sequential line due to sequential access and the jump line due to taken branch. Furthermore, a simple and aggressive MRU-Keeping turn-off policy is proposed in our design. It only keeps a small part of the most recently used cache lines and turns off others. We introduce them in next chapter.

Chapter 3 Proposed design of leakage management on L1 cache

In this chapter, we propose two-direction pre-activation scheme and MRU-Keeping turn-off policy. The two-direction pre-active scheme exploits sequential access and taken branch access patterns of instruction cache to pre-activate possible cache lines in both sequential and jump directions. We will introduce these pre-activation approaches of sequential and jump directions in section 3.1, and introduce the MRU-keeping scheme that keeps only the cache lines that are most recently used in active mode in section 3.2. Finally, the detail circuit block of pre-activation scheme and MRU-Keeping scheme are shown in section 3.3.

3.1 Two-Direction Pre-activation

An aggressive turn-off policy will make most part of cache lines in drowsy mode, and will increase the amount of waking up drowsy cache line. Since the extra stalls due to wake-up penalty will consume large energy, in order to aggressively turn off cache lines, we need an accurate pre-activation scheme to reduce this overhead.

According our experiment, the percentage of cache line accesses which will move into sequential direction is 67.7% (detail in chapter 4.3.1), the jump direction also occupies much proportion and is needed to pre-activate to avoid wake-up penalty.

Figure 3-1 illustrates the concept of two-direction pre-activation scheme, when a cache line is accessed, to pre-activation its sequential line of sequential access and jump line due to taken branch.

In the following subsection, we will introduce three pre-activation approaches in sequential direction and two pre-activation approaches in jump direction. Thus there are six combinations of two-direction pre-activation schemes.

Figure 3-1. The concept of two-direction pre-activation

3.1.1 Pre-activate Sequential Line (SL pre-activation)

This approach is to add a sequential-line entry in each cache line to pre-activate the cache line of sequential set. The sequential-line entry stores the information of which way in the sequential set, and its bits are log₂(way number of a set). In the directed mapping cache, the sequential-line entry is not need. The pre-activation entry can be placed into drowsy mode with its hosting cache line to decrease its leakage energy overhead. Figure 3-2 depicts this pre-activation approach, when a cache line is accessing, its sequential-line entries are read out to pre-activate sequential line in sequential direction.

Figure 3-2. Pre-activate sequential line

In this approach, there is a late pre-activation problem that if a cache line is accessed only one cycle and then the program counter changes to a new cache line, this pre-activation scheme will pre-activate cache lines too late. The reason is that the sequential-line entry is read out when its hosting cache line is being accessed, the wake-up of the jump and sequential lines will occurs in next cycle. Figure 3-3 shows the late pre-activation case of jump-line. Assume that cache line ‘A’ is accessed in first cycle and cache line ‘B’ will be accessed in second cycle. The sequential-line of ‘A’ is read out at first cycle, then this sequential-line information will start to wake up the drowsy cache line ‘B’ in second cycle and complete in this cycle end; however, the cache line ‘B’ will also being accessing in second cycle, the pre-activation is hit but it is too late to incurs one cycle wake-up latency. According our experiment (detail simulation environment in chapter 4.1), the percentage of cache line accesses that stays only one cycle is about 5.7% (highest is 21%). In order to solve this late pre-activation problem, we introduce some solution in following section.

Figure 3-3. Late pre-activation case of SL pre-activation

3.1.2 Pre-activate Sequential Line of Next Accessed Line (NSL Pre-activation)

In order to solve late pre-activation of SL pre-activation, the sequential-line informant of some one cache line is stored in its previously accessed cache line. So, the pre-activation information can be read out earlier and avoid late pre-activation.

Since this pre-activation entry is to pre-activate the sequential line of next accessed line, we call this pre-activation entry next-sequential-line entry. Figure 3-4 shows this NSL pre-activation. For a cache access, its next-sequential-line entry is read out to indicate which line will be waked up, and to wake up this line when switching to the next accessed cache line.

Figure 3-4. Pre-activate sequential line of next accessed line

Figure 3-5 depicts an example of NSL pre-activation. Assume the cache line ‘A’

is accessed at the first cycle and the cache line ‘B’ will be accessed at the fourth cycle.

The next- -sequential-line entry of ‘A’ is read out to indicate cache line ‘C’ at first cycle and it will start to wake up ‘C’ when ‘B’ is accessed at fourth cycle.

Figure 3-5 (a). The cache line ‘A’ is accessed at cycle 1 and its next- sequential-line entry is read out to indicate cache line ‘C

Figure 3-5 (b). The cache line ‘B’ is accessed at cycle 4 and cache line ‘C’ is waked up at this cycle

Since the next accessed line is a predict line, the accuracy (without considering late pre-activation case) of NSL pre-activation will be worse than SL pre-activation.

3.1.3 Pre-activate Sequential Set (SS Pre-activation, i.e. JITA)

Another approach to solve late pre-activation of SL pre-activation is to pre-activate all lines of the sequential set (JITA scheme). Because no information is needed to be store in a cache line, the late pre-activation will not occurs in pre-activating set. Figure 3-6 depicts the approach of jump-line & sequential-set pre-activation.

Figure 3-6. Pre-activate sequential set (JITA scheme)

3.1.4 Pre-activate Jump Line (JL Pre-activation)

This approach is to add a jump-line entry in each cache line to pre-activate the cache line in jump direction like SL pre-activation. The jump-line entry stores the information of jump line address which is the set address and way address, and its bits are log₂(set number of cache) + log₂(way number of a set). The late pre-activation problem is the same as JL pre-activation. Figure 3-7 depicts this pre-activation approach, when a cache line is accessing, its jump-line entries are read out to pre-activate the cache line in jump direction.

Figure 3-7. Pre-activate jump line

3.1.5 Pre-activate Sequential Line (NJL Pre-activation)

The solution of late pre-activation problem in jump-line pre-activation is the same as NSL pre-activation to store the jump-line information of some one cache line in its previously accessed cache line. Figure 3-8 depicts the approach of next-line’s-jump-line & sequential-line pre-activation.

3.1.6 Summary of Two-Direction Pre-activate Schemes

There are six combinations of tow directions pre-activation which are JL-NL, JL-NSL, JL-SS, NJL-SL, NJL-NSL, and NJL-SS pre-activation schemes. Their pre-activation accuracy will be evaluated in chapter 4. Figure 3-9 depicts an example of JL-SL directions pre-activation.

Figure 3-8. Pre-activate jump line of next accessed line

Figure 3-9. JL-SL pre-activation

3.2 MRU-Keeping Turn-Off Policy

Since power mode transition will consume extra energy and this transition energy is not large, thus only the cache lines which will be accessed again soon should not be turned off, the cache lines which will not be accessed again recently could be turned off aggressively to obtain more leakage reduction. This concept is shown in figure 3-10.

Figure 3-10. A reference stream of some one cache lin

Assume the pre-activation is perfect, the equation (3) shows that how many cycles of a cache line should be turned off to compensate the energy overhead of voltage transition.

(3)

According to power parameters are shown in table, the intervaloff is about 73 cycles. The cache lines which will not be accessed soon should be aggressively turned off. For this reason, we propose a simple and aggressive MRU-Keeping turn-off policy that only keeps a small part of most recently used (MRU) cache lines in active mode, others could be turned off. Using a MRU buffer to store the cache line

)

addresses which consist of set number and way number to indicate the most recently used cache lines in active mode. This buffer is implemented as CAM structure and updated by LRU policy which needs log₂(# entries of buffer) bits to maintain this LRU update policy. When an entry in the buffer is evicted, the cache line indicated by this evicted entry is turned off. Figure 3-11 depicts the MRU-Keeping concept.

Figure 3-11. MRL-Keeping buffer (updated by LRU policy)

Since our two-direction pre-activation policy need to update the pre-activation information in the previous or previous of previous cache line, the MRU buffer needs to keep at least one or two cache lines in active mode for the update of pre-activation information.

3.1.3 Circuit Block of Our Proposed Design

In this subsection, we illustrate the circuit block of our proposed design. Figure 3-12 shows the circuit blocks of JL-SL pre-activation policy, there are two bits which are PA bit (pre-activation bit) and MRU bit for each cache line. The PA bit (pre-activation bit) indicates this cache line is pre-activated cache line or not, and the MRU bit indicates this cache line is most recently used cache line or not. When the cache line is accessed, its MRU bit is set, and its pre-activation information is read out

在文檔中嵌入式處理器上的第一階層指令快取記憶體之漏電流管理 (頁 23-0)