Compiler Approach - State Preserving Technique

Chapter 2 Background and Related Work

2.3 State Preserving Technique

2.3.4 Compiler Approach

A compiler based leakage management is proposed in [15]. This approach manage the instruction cache leakage by inserts a turn-off instruction at loop granularity level (it treat the streamline code as a loop which iterates one) to turn off all cache lines. Its concept is based on determining the last use of an instruction at loop granularity. A conservative strategy does not turn off an instruction cache line unless it knows sure that the current instruction that resides in that line is dead. It means that it will not inset the turn-off instruction at the exit of inner loop. Another optimistic strategy may turn off a cache line if it detects that the next access to the instruction will occurs only after a long gap. In this compile-based turn-off approach, it either place cache lines into drowsy mode or state-destructive sleep mode.

Furthermore, a hybrid strategy is proposed that to employ both the state-destructive and state-preserving technique under the optimistic approach. In this strategy, if the loop is not going to be accessed again, it places cache lines into state-destructive mode else places cache lines into state-preserving mode. This hybrid strategy needs more complicated hardware to control three supply voltages for active mode, drowsy mode and state-destructive sleep mode. Figure 2-11 shows conservative and optimistic strategy cases.

Figure 2-11. The control flow graph of conservative strategy and optimistic strategy

2.4 Summary and Observation

In table 2-2, we compare the previous works of leakage management (except drowsy instruction bank) based on state preserving circuit technique with our proposed design. We find that previous works turn off cache lines after a large fixed period time, and they are without pre-activation (drowsy cache policy) scheme or with a simple pre-activation scheme (JITA).

Å turn-off instruction

Å turn-off instruction Å turn-off instruction

Å turn-off instruction

(a) Conservative strategy. (b) Optimistic strategy

Schemes Turn-on Mechanism Turn-off Mechanism

Simple policy

When accessed Periodically turn off entire cache

Noaccess policy

When accessed Turn off the lines which are unused during a fixed time

HSLM &

JITA

Pre-activate sequential set Turn off non-hot cache lines periodically or when hot-backward

Keep a small part of most recently used and currently pre-activated cache lines

in active, and turned off other lines

Table 2-2. Summary of previous work and our proposed design

For the periodic turn-off policy, a cache line may be waked up only in its first access during the widow, and they will be kept in active mode until the turn-off window expires. Extra stalls due to wake-up penalty may be few, pre-activation scheme is not needed for this turn-off policy. The noaccess turn-off policy is the same situation. However, since the transition energy due to voltage switching is not large, the cache lines that will not be accessed again soon should be turned off to obtain more leakage reduction if there are no wake-up penalties.

In order to turn off cache lines more aggressively, an effect pre-activation scheme is required to avoid extra stalls due wake-up latency. We propose a two-direction pre-activation scheme that exploits the instruction access pattern which is either sequential or taken branch to pre-activate the sequential line due to sequential access and the jump line due to taken branch. Furthermore, a simple and aggressive MRU-Keeping turn-off policy is proposed in our design. It only keeps a small part of the most recently used cache lines and turns off others. We introduce them in next chapter.

Chapter 3 Proposed design of leakage management on L1 cache

In this chapter, we propose two-direction pre-activation scheme and MRU-Keeping turn-off policy. The two-direction pre-active scheme exploits sequential access and taken branch access patterns of instruction cache to pre-activate possible cache lines in both sequential and jump directions. We will introduce these pre-activation approaches of sequential and jump directions in section 3.1, and introduce the MRU-keeping scheme that keeps only the cache lines that are most recently used in active mode in section 3.2. Finally, the detail circuit block of pre-activation scheme and MRU-Keeping scheme are shown in section 3.3.

3.1 Two-Direction Pre-activation

An aggressive turn-off policy will make most part of cache lines in drowsy mode, and will increase the amount of waking up drowsy cache line. Since the extra stalls due to wake-up penalty will consume large energy, in order to aggressively turn off cache lines, we need an accurate pre-activation scheme to reduce this overhead.

According our experiment, the percentage of cache line accesses which will move into sequential direction is 67.7% (detail in chapter 4.3.1), the jump direction also occupies much proportion and is needed to pre-activate to avoid wake-up penalty.

Figure 3-1 illustrates the concept of two-direction pre-activation scheme, when a cache line is accessed, to pre-activation its sequential line of sequential access and jump line due to taken branch.

In the following subsection, we will introduce three pre-activation approaches in sequential direction and two pre-activation approaches in jump direction. Thus there are six combinations of two-direction pre-activation schemes.

Figure 3-1. The concept of two-direction pre-activation

3.1.1 Pre-activate Sequential Line (SL pre-activation)

This approach is to add a sequential-line entry in each cache line to pre-activate the cache line of sequential set. The sequential-line entry stores the information of which way in the sequential set, and its bits are log₂(way number of a set). In the directed mapping cache, the sequential-line entry is not need. The pre-activation entry can be placed into drowsy mode with its hosting cache line to decrease its leakage energy overhead. Figure 3-2 depicts this pre-activation approach, when a cache line is accessing, its sequential-line entries are read out to pre-activate sequential line in sequential direction.

Figure 3-2. Pre-activate sequential line

In this approach, there is a late pre-activation problem that if a cache line is accessed only one cycle and then the program counter changes to a new cache line, this pre-activation scheme will pre-activate cache lines too late. The reason is that the sequential-line entry is read out when its hosting cache line is being accessed, the wake-up of the jump and sequential lines will occurs in next cycle. Figure 3-3 shows the late pre-activation case of jump-line. Assume that cache line ‘A’ is accessed in first cycle and cache line ‘B’ will be accessed in second cycle. The sequential-line of ‘A’ is read out at first cycle, then this sequential-line information will start to wake up the drowsy cache line ‘B’ in second cycle and complete in this cycle end; however, the cache line ‘B’ will also being accessing in second cycle, the pre-activation is hit but it is too late to incurs one cycle wake-up latency. According our experiment (detail simulation environment in chapter 4.1), the percentage of cache line accesses that stays only one cycle is about 5.7% (highest is 21%). In order to solve this late pre-activation problem, we introduce some solution in following section.

Figure 3-3. Late pre-activation case of SL pre-activation

3.1.2 Pre-activate Sequential Line of Next Accessed Line (NSL Pre-activation)

In order to solve late pre-activation of SL pre-activation, the sequential-line informant of some one cache line is stored in its previously accessed cache line. So, the pre-activation information can be read out earlier and avoid late pre-activation.

Since this pre-activation entry is to pre-activate the sequential line of next accessed line, we call this pre-activation entry next-sequential-line entry. Figure 3-4 shows this NSL pre-activation. For a cache access, its next-sequential-line entry is read out to indicate which line will be waked up, and to wake up this line when switching to the next accessed cache line.

Figure 3-4. Pre-activate sequential line of next accessed line

Figure 3-5 depicts an example of NSL pre-activation. Assume the cache line ‘A’

is accessed at the first cycle and the cache line ‘B’ will be accessed at the fourth cycle.

The next- -sequential-line entry of ‘A’ is read out to indicate cache line ‘C’ at first cycle and it will start to wake up ‘C’ when ‘B’ is accessed at fourth cycle.

Figure 3-5 (a). The cache line ‘A’ is accessed at cycle 1 and its next- sequential-line entry is read out to indicate cache line ‘C

Figure 3-5 (b). The cache line ‘B’ is accessed at cycle 4 and cache line ‘C’ is waked up at this cycle

Since the next accessed line is a predict line, the accuracy (without considering late pre-activation case) of NSL pre-activation will be worse than SL pre-activation.

3.1.3 Pre-activate Sequential Set (SS Pre-activation, i.e. JITA)

Another approach to solve late pre-activation of SL pre-activation is to pre-activate all lines of the sequential set (JITA scheme). Because no information is needed to be store in a cache line, the late pre-activation will not occurs in pre-activating set. Figure 3-6 depicts the approach of jump-line & sequential-set pre-activation.

Figure 3-6. Pre-activate sequential set (JITA scheme)

3.1.4 Pre-activate Jump Line (JL Pre-activation)

This approach is to add a jump-line entry in each cache line to pre-activate the cache line in jump direction like SL pre-activation. The jump-line entry stores the information of jump line address which is the set address and way address, and its bits are log₂(set number of cache) + log₂(way number of a set). The late pre-activation problem is the same as JL pre-activation. Figure 3-7 depicts this pre-activation approach, when a cache line is accessing, its jump-line entries are read out to pre-activate the cache line in jump direction.

Figure 3-7. Pre-activate jump line

3.1.5 Pre-activate Sequential Line (NJL Pre-activation)

The solution of late pre-activation problem in jump-line pre-activation is the same as NSL pre-activation to store the jump-line information of some one cache line in its previously accessed cache line. Figure 3-8 depicts the approach of next-line’s-jump-line & sequential-line pre-activation.

3.1.6 Summary of Two-Direction Pre-activate Schemes

There are six combinations of tow directions pre-activation which are JL-NL, JL-NSL, JL-SS, NJL-SL, NJL-NSL, and NJL-SS pre-activation schemes. Their pre-activation accuracy will be evaluated in chapter 4. Figure 3-9 depicts an example of JL-SL directions pre-activation.

Figure 3-8. Pre-activate jump line of next accessed line

Figure 3-9. JL-SL pre-activation

3.2 MRU-Keeping Turn-Off Policy

Since power mode transition will consume extra energy and this transition energy is not large, thus only the cache lines which will be accessed again soon should not be turned off, the cache lines which will not be accessed again recently could be turned off aggressively to obtain more leakage reduction. This concept is shown in figure 3-10.

Figure 3-10. A reference stream of some one cache lin

Assume the pre-activation is perfect, the equation (3) shows that how many cycles of a cache line should be turned off to compensate the energy overhead of voltage transition.

(3)

According to power parameters are shown in table, the intervaloff is about 73 cycles. The cache lines which will not be accessed soon should be aggressively turned off. For this reason, we propose a simple and aggressive MRU-Keeping turn-off policy that only keeps a small part of most recently used (MRU) cache lines in active mode, others could be turned off. Using a MRU buffer to store the cache line

)

addresses which consist of set number and way number to indicate the most recently used cache lines in active mode. This buffer is implemented as CAM structure and updated by LRU policy which needs log₂(# entries of buffer) bits to maintain this LRU update policy. When an entry in the buffer is evicted, the cache line indicated by this evicted entry is turned off. Figure 3-11 depicts the MRU-Keeping concept.

Figure 3-11. MRL-Keeping buffer (updated by LRU policy)

Since our two-direction pre-activation policy need to update the pre-activation information in the previous or previous of previous cache line, the MRU buffer needs to keep at least one or two cache lines in active mode for the update of pre-activation information.

3.1.3 Circuit Block of Our Proposed Design

In this subsection, we illustrate the circuit block of our proposed design. Figure 3-12 shows the circuit blocks of JL-SL pre-activation policy, there are two bits which are PA bit (pre-activation bit) and MRU bit for each cache line. The PA bit (pre-activation bit) indicates this cache line is pre-activated cache line or not, and the MRU bit indicates this cache line is most recently used cache line or not. When the cache line is accessed, its MRU bit is set, and its pre-activation information is read out to pre-activate the cache line and set the PA bit. When the program counter changes to a new cache line, it indicate a new pre-activation occurs, thus the PA bit of pre-activated cache line by pre-activated previous accessed cache line will be reset.

Furthermore, an entry evicted form MRU buffer will reset its indicated cache line’s MRU bit. If the PA bit and MRU bit of a cache line are both zero, this cache line is neither pre-activated cache line nor MRU cache line, and it will be turned off. Figure 3-13 shows the update policy of JL-SL pre-activation policy. If a drowsy cache line is accessed, it indicates a miss pre-activation, and the pre-activation information of previous accessed cache line will be updated.

Figure 3-12. The circuit block of JL-SL pre-activation policy

Figure 3-13. The update policy of JL-SL pre-activation policy

Figure 3-14 shows the circuit block of NJL-NSL pre-activation policy. The only difference between JL-SL and NJL-NSL pre-activation policies is that the pre-activation information of NJL-NSL is read out to a latch, and then the pre-activation occurs when program counter changes to the next accessed cache line since the pre-activation information of NSL-NSL is used to pre-activate the cache lines of next accessed line. Figure 3-15 shows the update policy of NJL-NSL pre-activation policy. If a drowsy cache line is accessed, it indicates a miss pre-activation, and the pre-activation information of previous of previous accessed cache line will be updated.

Figure 3-14. The circuit block of NJL-NSL pre-activation policy

Figure 3-15. The update policy of NJL-NSL pre-activation policy

Prev_prev_index: the address of previous of previous accessed cache line

Chapter 4 Simulation Environment and Experiment Result

In this chapter, we will introduce our simulation environment, benchmark suits, and energy model. Then to evaluate the accuracy of two-direction pre-activations and how many entries of MRU buffer is the best. Finally, we show and discuss the experiment results of our design compared with previous works in I-cache leakage reduction and run-time increment.

4.1 Simulation Environment

We extended the SimpleScalar/ARM [20], an architectural simulator can execute ARM binary code, to implement our design and previous works which are drowsy cache and HSLM&JITA schemes. The configuration parameters of SimpleScalar/ARM are configured like an embedded processor as similar as possible which is in-order issue execution, one instruction commit pre cycle, and without L2 cache, etc. The configuration parameters are given in table 4-1. We will simulate the set-associative mapping of cache from 1-way to 32-way.

Processor Core

Instruction widow 2 RUU, 2 LSQ

Decode/Issue Width 1 instruction per cycle Commit Width 1 instruction per cycle

Function Units 2 IALU, 1MULT/IDIV, 1 Memports

Branch Predictor Bimodal, 1024 entries; 256-set 4-way BTB, 8-entry RAS L1 I-Cache 32KB (1 way ~ 32 way), 32B blocks 1cycle latency

L1 D-Cache 32KB (1 way ~ 32 way), 32B blocks 1cycle latency Memory 22 cycles first chunk, 2 cycles rest

TLB ITLB 32 entries fully associative, ITLB 32 entries fully associative, 30 cycle miss penalty

Figure 4-1. Configuration Parameters of SimpleScalar/ARM

We use the MiBench benchmark suits [21] as our simulation benchmark. It is a free, commercially representative embedded benchmark suit, and consists of six categories: Automotive and Industrial Control, Consumer Device, Network, Office Automation, Security, and Telecommunications. These categories offer different program characteristics, and introduce below:

Automotive and Industrial Control

The Automotive and Industrial Control category is intended to demonstrate use of embedded processors in embedded control systems. These processors require performance in basic math abilities, bit manipulation, data input/output and simple data organization. Typical applications are air bag controllers, engine performance monitors and sensor systems. The tests used to characterize these situations are a basic math test, a bit counting test, a sorting algorithm and shape recognition program.

Consumer Devices

The Consumer Devices benchmarks are intended to represent the many consumer devices that have grown in popularity during recent years like scanner, digital cameras and Personal Digital Assistants (PDAs). The category focuses primarily on multimedia applications with the representative algorithms being jpeg encoding/decoding, image color format conversion, image dithering, color palette reduction, MP3 encode/decodeing,

and HTML typesetting. Several of the algorithms are from the SGI TIFF utilities. All of the image benchmarks use small and large images as data input.

Network

The Network category represents embedded processors in network devices like switches and routers. The work done by these embedded processors involves shortest path calculations, tree and table lookups and data input/output. The algorithms used to demonstrate the networking category are finding a shortest path in a graph and creating and searching a Patricia trie data structure. Some of the benchmarks in the Security and Telecommunications category are also relevant to Network category:

CRC32, sha, and blowfish. However, they are separated for organization.

Office Automation

The Office applications are primarily text manipulation algorithms to represent office machinery like printers, fax machines and word processors.

The PDA market mentioned in Consumer category also relies heavily on the manipulation of text for data organization.

Security

Data Security is going to have increased importance as the Internet continues to gain popularity in e-commerce activities. The Security category includes several common algorithms for data encryption, decryption and hashing. One algorithm, rijndael, is the new Advanced Encryption Standard (AES). The other representative security algorithms are Blowfish, PGP, and SHA.

Telecommunications

processors is the Telecommunications category. Telecommunication benchmarks consist of voice encoding and decoding algorithms, frequency analysis and a checksum algorithm.

Auto./Industrial Consumer Office Network Security Telecomm.

basicmath jpeg ghostscript dijkstra blowfish enc. CRC32

bitcount jame ispell patricia blowfish dec FFT

qsort mad rsynth (CRC32) pag sign IFFT

susan (edges) tiff2bw sphinx (sha) pgp verify ADPCM enc.

susan (conrners) tiff2rga stringsearch (blowfish) rijndael enc ADPCM dec.

susan (smoothing) tiffdither rijndael dec. GSM enc.

tiffmedian sha GSM dec.

typeset

Table 4-2. MiBench Benchmarks

4.2 Power Model

Our energy model is as follows:

The Emng is the leakage energy consumption of instruction cache with extra overhead energy of leakage management scheme, and composed of five parts:

Eactive: leakage energy consumed by cache lines in active mode

Edrowsy: leakage energy consumed by cache lines in drowsy mode

Etransition: dynamic energy consumed by power mode transition

Estall: dynamic and static energy consumed by run-time increase due to

overhead

wake-up penalty

Emng_overhead: dynamic and static energy consumed by extra hardware of leakage management schemes. There are different Emng_overhead between leakage management schemes.

¾ Simple policy of drowsy cache:

Dynamic energy consumed by global cycle counter for periodical turn-off scheme.

¾ Noaccess policy of drowsy cache:

在文檔中嵌入式處理器上的第一階層指令快取記憶體之漏電流管理 (頁 29-0)