INTRODUCTION - 以指令快取為基準之低功耗分支目標暫存器

1.1 Importance of Low Power Design

As the progress of technique, more and more products pay much attention to its limit life time of battery-powered equipments and heat releasing problems. The battery-powered equipments, as like MP3 player, PDA and etc. , would prefer more and more life time to work. The heat would affect the stability of system and the bad heat releasing policy may cause more power consumption on releasing heat and decrease the life time of battery-powered equipments. Low power design would be helpful to increase the life time of battery-powered equipments and reduce the heat producing.

1.2 Power Components of CMOS Circuit

Today, CMOS technology is the dominant semiconductor technology for microprocessors, memories and application specific integrated circuits (ASICs). In CMOS circuit, its power components can be divided into twp kinds – dynamic power and static power. Dynamic power is composed of switching power and short-circuit power. Switching power is dissipated by charging and discharging the gate output capacitance. Short-circuit power is, during logic gate operation, caused by VDD and VSS may be inter mittently shorted. There are three major components to static power.1> Sub-threshold leakage from Source to Drain. 2> Gate leakage. 3> Reverse bias junction leakage. Sub-threshold leakage is the most dominant component to static power consumption. It should also be noted that static power is generally a product of silicon area.

In most cases, switching dominates the dynamic power. Thus, many related researches which are for reducing dynamic power are focus on reducing switching power. However, the static power begins to dominate the total power consumption as process technology moves below 0.1 um as showed in fig[1].(Reference - 9) Therefore, how to reduce both dynamic and static power simultaneously becomes an important research issue.

Fig. 1 Power Trend

1.3 Importance of Low Power Design on Dynamic Branch

predict branch direction and next instruction address of each branch instruction dynamically. Moreover, dynamic branch prediction performs well branch prediction accuracy from 90% to 98%.

Dynamic branch prediction is typically performed at the first pipeline stage to eliminate pipeline stalls due to branches. A drawback arises here: since the fetched instruction can not be identified as a branch or not at this stage, the dynamic branch predictor is always exercised, Worse yet, the branch target buffer(BTB) which supports dynamic branch prediction is a large storage with both tag and data memories. Thus, dynamic branch prediction is a power-hungry technique in both dynamic and static power while it is still very attractive to processors for power-miser application due to its success in performance designs.

1.4 Introduction of I-Cache

In today processor design, instruction cache (I-Cache) is a indispensable structure to provide instructions every cycle which dominates dynamic and static power consumption of total system.

I-Cache is composed of valid bit, tag array and data array as showed in fig[2].

The organization of I-Cache is divided into three kinds: direct-map, set-associative and fully-associative. I-Cache is low-way set associative organization in common and the address space is program counter. The cache line size mostly is 8 to 16 instructions.

Fig. 2 Introduction of I-Cache

It is usually included read and write operation in I-Cache(fig[3]). Take the five stages pipelines in MIPS for example. Instruction fetcher would read I-Cache in If stage by index part of program counter to index the corresponding cache line. Then, compare the tag part of program counter to tag field of I-Cache and rely on the valid bit to identify hit or not. The write operation is executed when occur the instruction read miss. It would read instruction from other instruction memory and write it into the I-Cache line decided by replacement policy of I-Cache.

Fig. 3 Structure of 4-way I-Cache

1.5 Introduction of Branch Target Buffer

Using branch target buffer (BTB) to predict the next instruction address of branch is one kind of the popular dynamic branch prediction policy (ex. Pentium 4, Alpha 21264, X-scale). BTB is a small cache memory which save the branch target address of executed branch instruction. Each instruction would lookup BTB and which may return predicted branch target address to reduce the performance loss caused by branch penalty. The organization of BTB is divided into three kinds:

direct-map, set-associative and fully-associative.

The BTB structure is as showed in fig[4] which is composed of tag field, status field and branch target address field. The status field is composed of valid bit and predictor bits. Moreover, the high order bits of BTB tag array is equal to the high order bits of I-Cache tag array. The fig[5] is a direct-map BTB.

Fig. 4 Introduction of BTB

It is usually included read and write operation in BTB. In MIPS five stage pipelines, each instruction would lookup BTB in IF stage by index part of program counter to index the corresponding BTB entry. Then, compare the tag part of program counter to tag field of BTB. The write operation is executed for BTB update in EXE stage. There would be two situation to update BTB. One is a branch is executed and its branch information is not in BTB. Another is a branch is executed and its branch target address is not the same with that in BTB. It would compare the correct branch target address to the predict branch target address to decide update BTB or not. The information of valid bit, tag and branch target address would be updated into BTB.

The replacement policy of BTB would decide a BTB entry to replace.

Fig. 5 Structure of direct-map BTB

It constantly has Least Recently Used (LRU)、First-In-First-Out (FIFO) 、 Random and etc. replacement policies in I-Cache and BTB. LRU is replace the entry least recently used. FIFO is replace the entry put in first. Random is replace the entry randomly.

1.6 Similar Features of I-Cache and BTB

We could see several similar features of I-Cache and BTB from above introduction:

1> Each instruction has to access I-Cache and BTB in the first stage of pipeline.

2> Both of them have tag array and the high order bits are the same 3> BTB is a cache memory in nature

BTB, it would be helpful to simplify the operation of BTB access. BTB would save the area and power consumption because of sharing tag array of I-Cache.

1.7 Our Design

We observe the percentage on each component of BTB power consumption. We could see that tag array occupy a critical ratio (36%). Above all, we proposed a architecture that BTB could share tag array of I-Cache. In this architecture, the cache lines in the same index could use N BTB entries. Moreover, under priority consideration, the cache lines in different index could sharing use the BTB entries belong to each cache lines. For the branch instructions that still have no empty BTB entry using, we provide K BTB entries to them in additional.

In BTB operation, we discuss it in three parts - identification、placement and replacement. Identification – how is a entry found if the information is in the BTB.

Placement – possible places to place. Replacement – when BTB miss occurred, which BTB entry is replaced ?

Simulation results show that we could reduce as much as 42% percent in dynamic power consumption and 24% in static power consumption with compared to independent BTB of ARM-A8.

1.8 Thesis Organization

在文檔中以指令快取為基準之低功耗分支目標暫存器 (頁 13-21)