功率感知資料匯流排編碼解碼器設計

(1)

國立交通大學

電機與控制工程學系

碩士論文

功率感知資料匯流排編碼解碼器設計

Design of Power Aware Data Bus Codec

研究生：黃德瑋

指導教授：林進燈教授

陳右穎教授

(2)

功率感知資料匯流排編碼解碼器設計

Design of Power Aware Data Bus Codec

研究生：黃德瑋 Student：De-Wei Huang

指導教授：林進燈教授 Advisor：Dr. Chin-Teng Lin

陳右穎教授 Dr. You-Yeng Chen

國立交通大學

電機與控制工程學系

碩士論文

A Thesis

Submitted to Institute of Electrical and Control Engineering

College of Electrical and Computer Engineering

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Master

in

Electrical and Control Engineering

July 2007

Hsinchu, Taiwan, Republic of China

(3)

功率感知資料匯流排編碼解碼器設計

Design of Power Aware Data Bus Codec

學生：黃德瑋

指導教授：林進燈博士

陳右穎博士

國立交通大學電機與控制工程研究所 t

中文摘要

本論文提出在匯流排傳輸上面，設計一個功率感知資料匯流排編碼解碼器，來降低transition activity，進而達到降低功耗輸出的效果。在 8 位元寬度以及外部負載電容 50 pF 環境下模擬結果，分別與編碼前及 RSH 方法相比較可降低 23% 和 6%功率消耗，其設計特色在於：(1)編碼解碼端不需要花費龐大硬體成本以及處理時間，便可達到迅速傳輸資料以及有效率降低功耗的目的；(2)針對不同應用能自動挑選來做最合適的編碼處理。經由測試結果，只需要額外增加 6%硬體成本，在多媒體資料傳輸，平均可降低 20%左右動態功率；在DCT、FIR 程式中，平均可降低 50~60%左右動態功率。再者，我們將此低功耗匯流排整合至在已開發的嵌入式RISC/DSP單核心處理器內，針對處理器系統架構上面，加入數位低功耗設計，有效率的降低功率消耗，期望能在效能以及功耗上達到一個平衡點。此設計採用TSMC 0.18μm 製程，晶片製作面積約 2.11x2.11 mm2 ，預估最大操作頻率在 100MHz，功率消耗約 16mW。

(4)

Design of Power Aware Data Bus Codec

Student：De-Wei Huang

Advisor：Dr. Chin-Teng Lin

Dr. You-Yeng Chen

Department of Electrical and Control Engineering

National Chiao-Tung University

t

Abstract

In this thesis, we propose a power-aware codec scheme to reduce transition activity for data bus design. The low power data bus codec consisting of transparent, inverter, XOR, and XNOR module can lead to 23 % & 6 % power reduction compared with the un-coding and R-S-H’s methods under the 8-bit width and the 50 pF capacitance loading. The main features of this codec design are: (1) codec can save 68% area overhead compared with R-S-H’s design and (2) codec can adaptively choose the optimal encoding scheme for different kinds of data types due to versatile applications. From the FIR and DCT benchmark simulations, the power can be reduced to 50%~60% on average.

Furthermore, we integrate this data bus codec into a RISC/DSP unit-core processor with the tradeoff between cost and power. The chip fabricated in TSMC 0.18μm CMOS technology process with the total area of 2.11×2.11mm2 and has power consumption of 16mW at 100MHz with 1.8V supply voltage.

(5)

誌謝

兩年的研究所生涯隨著論文的完成劃上了句號，這兩年間，要感謝許多人的鼓勵和幫忙，使我獲得充實的專業能力並順利完成研究所的學業。首先要感謝的是我的指導教授-林進燈老師。林老師是國內十分傑出的一位教授，在不同領域內都有相當好的研究成果。感謝老師提供了很理想的研究環境及正確的引導，使我在研究上非常順利。在老師悉心的指導下，讓我學習到解決問題的能力及做研究應有的態度。另外，最感謝資工系范倫達教授及鐘仁峰學長的教導，尤其是面臨畢業主題方向模糊的壓力時，范教授在這上面給予我相當大的助力，而教授親切的態度及學識的也讓我在討論論文時感到輕鬆而無壓力，獲益良多。此外在實驗室中，不管大小疑難雜症，常常去請教仁峰學長，非常感謝學長不厭其煩地教導，使我增進了對積體電路設計上的專業知識，開拓了我的視野。也感謝實驗室所有的夥伴，經翔學長、紹航學長、峻谷學長、家昇學長、庭緯學長、有德學長、笑容甜美頭髮捲捲的美女靜瑩、酷酷運動全能的智文、開朗運動全能的俊傑、可愛熱情的正妹林玫、翰林大學士肇廷以及讓我論文不能早點寫完的學弟妹們等，感謝大家在研究上的互相扶持及鼓勵。也感謝我的爸爸、媽媽、奶奶、哥哥，你們的支持一直是我最溫暖的後盾；你們的鼓勵是我信心的來源。

(6)

vii

List of Table

x

Chapter 1 Introduction... 1

1.1 Brief Introduction...1

1.2 Organization of the Thesis ...3

Chapter 2 Power Aware Data Bus Codec

... 4

2.1 Motivation...4

2.2 Related Works ...7

2.2.1 Bus-Invert Bus Encoding...7

2.2.2 Zero-Transition Activity Encoding ...9

2.2.3 A Coding Framework for Low Power Address and Data Busses ...12

2.3 Power Aware Data Bus Codec ...20

2.3.1 Proposal of Codec...22

2.3.2 Architecture of Codec ...28

2.4 Power Aware Data Bus Codec Simulator ...31

2.4.1 8 bits Power Aware Data Bus Codec Simulator...31

2.4.2 16 bits Power Aware Data Bus Codec Simulator...35

2.5 Result and Analysis...39

Chapter 3 Low Power Embedded Processor Design

... 41

3.1 Architecture of the Low Power Embedded Processor ...41

3.1.1 Low Power Embedded Processor Core...41

3.1.2 Low Power Embedded Processor Instruction Set...45

3.2 Configurable Master-Slave I-Cache Controller ...49

3.2.1 The Proposal of Configurable Master-Slave I-Cache Controller...49

(7)

3.3 High Performance Pipeline Design of Low Power Phased Cache ...51

3.4 Tool Chain...52

3.4.1 Assembler ...52

3.4.2 Simulator...53

3.5 Verification...55

3.5.1 Finite Impulse Response ...55

3.5.2 Discrete Cosine Transform ...57

3.5.3 Sobel Operator ...58

3.6 Field-Programmable Gate Array (FPGA)...59

3.7 Summary ...60

Chapter 4 Chip Implemenation and Verification Results

... 61

4.1 Chip Fabrication...61

4.1.1. Chip Design Flow ...61

4.1.2. Synthesis ...62

4.1.3. Auto Placement and Routing (APR)...62

4.2 Power Analysis...66

Chapter 5 Conclusions and Future works

... 69

Appendix

... 73

A. DRC and LVS Verification...73

(8)

List of Figures

Fig. 2-1. Harvard architecture with four busses...6

Fig. 2-2. von Neumann architecture with two busses...6

Fig. 2-3 von Neumann architecture with one bus. ...7

Fig. 2-4. Bus-Invert Encoding. ...9

Fig. 2-5. Zero-Transition Activity encoder/decoder...11

Fig. 2-6. A general communication system...14

Fig. 2-7. A general communication system of noiseless channel. ...15

Fig. 2-8. A Practical communication system of noiseless channel. ...15

Fig. 2-9. Occurrence distribution for EEG data before dbm...18

Fig. 2-10. Occurrence distribution for EEG data after dbm. ...18

Fig. 2-11. Waveform of the classic music...21

Fig. 2-12. Data variation. ...22

Fig. 2-13. Block diagram of Invert coding. ...25

Fig. 2-14. Block diagram of XOR coding...27

Fig. 2-15. Block diagram of XNOR coding...28

Fig. 2-16. System architecture. ...29

Fig. 2-17. Block diagram of encoder. ...29

Fig. 2-18. Block diagram of decoder. ...30

Fig. 2-19. Switch activity reduction for 8-bit data...33

Fig.2-21. Switch activity reduction for 8-bit data...34

(9)

Fig.2-24. Switch activity reduction for 16 bits data...37

Fig. 2-28. Simulation for Multi-Media data...39

Fig. 3-1 The architecture of processor ...42

Fig. 3-2 Pipeline processing flow ...43

Fig. 3-3 MACHR operation ...48

Fig. 3-4 The Configurable Master-Slave I-Cache controller algorithm...50

Fig. 3-5 The improvement of MS-cache...50

Fig. 3-6 The architecture of High performance pipeline design of low power phased cache ...51

Fig. 3-7 Cache access cycles & Power consumption...52

Fig. 3-8 The assembler Figure ...52

Fig. 3-9 Assembler Interface...53

Fig. 3-10 Software pipeline design flow...54

Fig. 3-11 The simulator interface...55

Fig. 3-12 FIR RTL simulation and simulator result...56

Fig. 3-13 Switch activity for FIR...56

Fig. 3-14 1 dimension 8 by 8 DCT ...57

Fig. 3-15 2 dimension 8-8 DCT RTL simulation and simulator result ...57

Fig. 3-16 Switch activity for DCT ...58

Fig. 3-17 Sobel Operator simulation...59

Fig. 3-18 The Sobel operator result in FPGA and Matlab ...60

Fig. 4-1 Chip Design Flow...61

(10)

Fig. 4-3 Chip Pin Description Diagram ...64

Fig. 4-4 160pin-CQFP Bounding Diagram...64

Fig. 4-5 DCT gate-level simulation ...66

Fig. 4-6 Sobel gate-level simulation ...66

Fig. 4-7 Power dissipation for Proposed and Original...68

(11)

List of Table

Table 2-1 Without Zero-Transition Activity Encoding ...11

Table 2-2 With Zero-Transition Activity Encoding ...12

Table 2-3 Example of Difference-Based Mapping ( dbm ) ...17

Table 2-4 Example of Probability-Based Mapping ( pbm )...19

Table 2-5 First Ten Data Sequences of Classic Music...21

Table 2-6 Data Variation ...22

Table 2-7 Example of Classic Music before Using Invert...23

Table 2-8 Example of Classic Music after Using Invert...24

Table 2-9 Example of Classic Music before Using XOR ...25

Table 2-10 Example of Classic Music after Using XOR...26

Table 3-1 Data Moving Instructions List ...45

Table 3-2 Arithmetic & Logic Instructions List...46

Table 3-3 Branch/Jump Instructions List...46

Table 3-4 SIMD Instructions List ...47

Table 3-5 Other Instructions List ...48

Table 4-1 Synthesis Report ...62

Table 4-2 APR Report...62

(12)

Chapter 1 Introduction

1.1 Brief Introduction

In 3C integration era, the mobile phone does not only communicate with people but also has various functions like digital camera, MP3 player, games, and etc. Therefore, the multi-functions mobile phone just can acquire favor of consumers in the information market.

However, when the demand of performance and functions of the mobile phone increases, the power consumption would be an important design issue. Most of companies not only seek for high performance and low cost, but also focus on low power design.

In other words, low power is a primary consideration to System on Chip (SOC) design, especially for handheld devices due to the limited battery life. In order to accomplish such challenging tasks, many design techniques such as multi-Vth design techniques [1][2], dynamic voltage scaling [3][4], gated clock [5], and low-power on-chip memory architecture [6] have been proposed to reduce both dynamic power and leakage power However, those design techniques require advanced design process to reach the low power goal.

In the processor, it becomes increasingly limited by memory performance and system power consumption [7]. The power associated with off-chip accesses can dominate the overall power budget. The memory power problem is even more acute for processors that possess memory intensive access patterns and require streaming serial memory access that tends to exhibit low temporal locality.

(13)

In terms of reducing memory power, one approach is to consider how optimally to schedule off-chip accesses. The capacitance associated with the external bus is much larger than the internal node capacitance inside a microprocessor. [7] For example, a low-power embedded microprocessor system like an Analog Devices ADSP-BF533 running at 500 MHz consumes about 374 mW on average during normal execution. Assuming a 3.65 V supply voltage and 133 MHz bus frequency, the average external power consumed is around 170 mW, which accounts for approximately 30% of the overall system power dissipation. One factor affecting the capacitance on external bus power is the bus width. For example, the power dissipation on 16-bit bus is larger than 30% on 8-bit bus. As a consequence, the design target like MP3 player, PDA and mobile phone always use low bit width bus instead of the high bit width bus.

Recently, R-S-H proposed codec scheme to reduce power consumption for data and address buses. However, the table size is proportional to bit width in [16]. That means that while data width is larger, more power consumption certainly be induced. In this thesis, we are motivated to design a power-aware data bus codec which can reduce dynamic power for data transmission. This power-aware codec is composed of transparent, inverter, XOR, and XNOR modules. We use the audio, image, EEG,, random, and specific data to verify the codec characteristics via simulation results and compare with other encoding schemes. In terms of codec implementation, a RISC/DSP unit-core processor that integrates the proposed codec and low power cache controller design is used for verification. The chip has been fabricated in TSMC 0.18μm CMOS technology with the total area of 2.11×2.11mm2. The maximum clock frequency runs at 100MHz with a single 1.8V supply voltage.

The proposed codec design has following features: (1)Low cost

(14)

Codec does not need large hardware cost (just have 5% gate counts of total processor) and one cycle processing time penalty.

(2)Low power

In the result of 8-bit simulation, our proposal has 23 % dynamic power reduction in average on bus. For DSP function such as DCT and FIR, our proposal has 50-60% dynamic power reduction on bus. For power estimation, the proposed encoder and decoder only have 0.8mW in PrimePower simulation.

(3)Awareness

The general encoder is usually suitable for several specific data stream or data property. For instance, Bus-Invert encoding scheme can only be used to acute data variability. Our proposed method can compare the result of all encoding functions in encoder and adaptively choose the optimal encoding scheme for different kinds of data types due to versatile applications.

1.2 Organization of the Thesis

In this thesis, the organization is as follows. In Chapter 1, we give a brief introduction for low power design. In Chapter 2, we propose a new power-aware codec design for data bus. The integrated processor including our proposed bus codec, and tool chains will be demonstrated in Chapter 3. The processor layout and simulated result are shown in Chapter 4. Finally, conclusions and future work are remarked in the last Chapter.

(15)

Chapter 2 Power-Aware Data Bus Codec

We would present an adaptive data bus codec including proposal, architecture, and performance comparison with the features of low power, low cost, and awareness.

2.1 Motivation

As we know, there are two major sources of power dissipation in digital CMOS circuits, which are summarized as follows[8][9]

2 (2-1)

leakage

P a C V= × × × f + I × ,V

Where P, C, α, V, f denote power consumption, capacitance, transition activity, supply voltage, and clock frequency, respectively. The first and second terms represent the dynamic power and leakage power, respectively. In the second term, leakage current that can be arisen from substrate injection and sub-threshold effects is primarily determined by the fabrication technology.

For the reduction of dynamic power, the main design principle is to minimize the values of V, C, f and α in Eq. (2-1) [10]. Among the four parameters, supply voltage V that has a quadratic effect and capacitance C are very efficient ways of decreasing the power dissipation. However, for CMOS circuits, the designers usually decrease V and C in layout level. For larger digital circuits and systems, decreasing V and C is an annoying problem in cell-based design. On the other hand, lowering the transition activity is a very promising way to reduce the power consumption in cell-based design.

(16)

Generally speaking, the percentage of power dissipation on bus is in the range of 10% and 80% for microprocessor. The category of bus is external bus and internal bus. External bus includes external memory data transmission and I/O data transmission. Internal bus includes internal memory, cache, and IP data transmission. The power dissipation in external busses usually is larger than that of internal busses by hundred times [8]. Thus, we are motivated to solve this critical power problem of data bus in architecture and logic level. In this paper, we propose a power-aware encoder and decoder to compress the data transition activity α, and thus the power can be saved.

There are four properties in bus stream [11] discussed as follows.

(1) Instruction address stream: Instructions addresses are often consecutive. As a result, instruction address stream is very predictable.

(2) Data address stream: Data access may be consecutive while accessing arrays; otherwise, the data address stream is random. Although data addresses are less predictable, they still follow the principles of spatial and temporal locality.

(3) Instruction stream: Most ISAs (Instruction Set Architecture) exhibit some regularity and instructions can be partitioned into fixed-location fields. As a result, Instruction stream is predictable by fixed-location fields.

(4) Data stream: The sequence is not predictable. The values vary irregularly with different kinds of applications and different kinds of algorithms.

The above properties in bus stream have been widely applied to three off-the-shelf computer architectures.

(17)

Fig. 2-1. Harvard architecture with four busses.

Harvard archit storage

and signal pathways for instructions and da

ann architecture with two busses:

The von Neum storage

ann architecture with one bus:

ecture is a computer architecture with physically separate

ta. Each address bus and data bus is only for instruction memory or data memory. As a result, each stream has independent bus and been easily controlled.

(b) von Neum

Fig. 2-2. von Neumann architecture with two busses.

ann architecture is a computer architecture that uses a single structure to hold both instructions and data. Instruction address stream and Data address stream are set on the same bus. Instruction stream and Data stream is so on. (c) von Neum Memory CPU I/D-Data I/D-Address D-Address D-Data Inst. Memory Data Memory CPU I-Data I-Address

(18)

CPU I/D-Address/Data

Memory

Fig. 2-3 von Neumann architecture with one bus.

All streams are running on the sam ore signals to

control stream operations.

2.2 Related Works

In this section, we would introduce the relative researches of low power bus encoding. F

2.2.1 Bus-Invert Bus Encoding

We will consider the activity on a typical data bus to be characterized by a random

e bus. On this bus, it needs m

rom the beginning, we will have a brief subsection about Bus-Invert encoding. Bus-Invert encoding [12] is a traditional encoding at the early low power designs. It has the advantage of low cost hardware implementation. In Section 2.2.2, we will introduce Zero-Transition Activity encoding [15]. In Section 2.2.3, we will show a coding framework for low power address and data busses [16].

uniformly distributed sequence of values [13][14]. The assumption of random uniformly distributed inputs is also conveniently made by most of the statistical power estimation methods. With this assumption for any given time-slot the data on an n-bit wide bus can be any of 2n possible values with equal probability. The average number of transitions per time slot will be n/2. For example on an eight-bit bus there will be

(19)

an average of 4 transitions per time-slot or 0.5 transitions per bus-line per time-slot.

ses one extra control bit called

differ) between

, set invert = 1 and make the present bus

nt bus value equal to the present data

the decoder side, the contents of the bus must be conditionally inverted

m number of trans

When all the bus-lines toggle at the same time (the probability of this happening in any time-slot is 1/2n) there will be a maximum of n transitions in a time-slot and thus the worst power dissipation is proportional with n.

The Bus-Invert method [12] proposed here u

invert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be inverted. The worst power dissipation can then be decreased by half by coding the bus as follows (Bus-Invert method):

(1) Compute the Hamming distance (the number of bits in which they the present bus value and the last data value.

(2) If the Hamming distance is larger than n/2 value equal to the inverted present data value. (3) Otherwise let invert = 0 and let the prese value.

(4) At

according to the invert line. In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).

The Bus-Invert encoding has the advantage of that the maximu

itions per time-slot is reduced from n to n/2. Therefore the worst power dissipation for the bus is reduced by half. Fig. 2-4 shows the 16 bit data sequence using the Bus-Invert encoding in order to decrease the number of transitions.

(20)

Fig. 2-4. Bus-Invert Encoding.

We can see the Hamm ata 1 is smaller than

8, so invert =0. However the Hamm

2.2.2 Zero-Transition Activity Encoding

The scheme we propose is related to the Bus-Invert encoding, both Bus-Invert encoding [

that of avoiding the transfer of cons

ing distance between the data 0 and d

ing distance between the data 1 and data 2 is bigger than 8, so invert =1 and data 2 is inverted.

12] and Zero-Transition Activity encoding [15] rely on the addition of a redundant line to reduce the total number of transitions that may happen when streams of patterns are transmitted over the bus. For example, Bus-Invert encoding use a redundant line INV that control data encoding for power reduction.

In Zero-Transition Activity encoding scheme, called the T0 code, is

ecutive addresses on the bus by using a redundant line, INC, to transfer to the receiving sub-system the information on the sequentially of the addresses. When two addresses in the stream to be transmitted are consecutive, the INC line is set to 1, the address bus lines are frozen (to avoid unnecessary switch activities), and the new address is computed directly by the receiver. On the other hand, when two addresses are not consecutive, the INC line is driven to 0 and the bus lines operate normally.

Data 0 : 1000000100110101

Data 1 : 1000000010000001

Data 2 : 1100000001111111 INV : 0011111110000000 Data 1 : 1000000010000001 Data 1 : 1000000010000001

(21)

If all addresses of the ideal stream are consecutive, the INC line is always high, and

ng (T0 code) scheme can be desc

, ( ( ),0);

the bus lines always have no transition. Consequently, the switch activity of our code is zero transitions per emitted consecutive address.

More formally, our Zero-Transition Activity encodi ribed as follows Eq. (2-2):

( ( ), ( )) {(B( -1),1) ; t if t 0 and b t ( ) b t( -1) S

b t otherwise

> = +

(2-2) the value on the encoded bus lines at time t ,INC(t) is the a

B t if INC

B t INC t =

where B(t) is dditional bus

line, b(t) is the address value at time t and S is a constant of increase, that we call stride. The corresponding decoding scheme can formally define as follows (2-3): b t( ) {b t( -1) S if INC; 1 and t 0,

( ); 0

+ = >

= (2

= -3)

code retains its zero-transition property are i

sition Activity encoding following above equa

r architecture is simple. At any given clock cycle, the last cycle's

Notice that the T0 even if the addresses

ncremented by a constant stride equal to a constant of two (as it is often the case for practical machines which are byte addressable, but that are able to access data or instructions aligned at word boundaries).

We take an example shows Zero-Tran

tions (2-2) (2-3). Table 2-1 lists the switch activities with original data transfer, we can find the total transitions are 10 from cycle 0 to cycle 6. Table 2-2 lists the data transmission with Zero-Transition Activity encoding. At a given clock cycle t (t = [1,7] for table 2-2), the encoder computes the incremented address of cycle t and compares it to the address generated at cycle t - 1. If the incremented old (t - 1) address and the new ( t ) address are equal, the INC line is raised, and the old address is left on the bus. The encoder/decoder architecture is shown on Fig.2-5. The incrementer can be programmable, to be able to flexibly define the constant increment S. In Table 2-2, S is defined as 1.

(22)

addr

T

Continuous bus address transition

ess is incremented. If the INC line is high, the old incremented value is used for addressing; otherwise, the value coming from the bus lines is selected. Finally, we can find the total transitions become 4. Zero-Transition Activity encoding make address value on bus be frozen when address is consecutive so that power dissipation will be reduced efficiently.

Fig. 2-5. Zero-Transition Activity encoder/decoder.

able 2-1 Without Zero-Transition Activity Encoding

cycle Address to be transfer Address on BUS

0 00000000 00000000 1 00000001 00000001 2 00000010 00000010 3 00000011 00000011 4 00001000 00001000 5 00001001 00001001 6 00001010 00001010 Total Transitions 10 ENCODER DECODER BUS INC

(23)

Table 2-2 With Zero-Transition Activity Encoding Continuous bus address transition

cycle Address to be transfer Address on BUS INC

0 00000000 00000000 0 1 00000001 frozen 1 2 00000010 frozen 1 3 00000011 frozen 1 4 00001000 00001000 0 5 00001001 frozen 1 6 00001010 frozen 1 Total Transitions 4

2.2.3 A Coding Framework for Low Power Address

-coding framework for describing low power ploy the framework to develop new encoding schemes [16].

uited for the power dissipation depends on the num

and Data Busses

In this section, we present a source encoding schemes and then em

In the framework proposed here, a data source is processed first by a decorrelating function f1. Next, a variant of entropy coding function f2 is employed,

which reduces the transition activity.

Signal samples have higher probability of occurrence are assigned code words with fewer ON bits. This scheme is s

ber of ON bits. In VLSI systems, however, power dissipation depends on the number of transitions rather than thee number of ON bits.

(24)

A general communication system in Fig. 2-6 consists of a source coder, a channel coder, a noisy channel, a channel decoder, and a source decoder. The source coder (deco

ad circuitry, driving (in case of the trans

ncies can been removed. The

ng. der) compresses (decompresses) the input data so that the number of bits required in the representation of the source is minimized. While the source coder removes redundancy, the channel coder adds just enough of it to combat errors that may arise due to the noise in the physical channel.

We consider the bus between two chips as the physical channel and the transmitter and receiver blocks to be a part of the p

mitting chip) or detecting (in case of the receiving chip) the data signals. We will assume here that the signal levels are sufficiently high so that the channel can be considered as be noiseless. The noiseless channel assumption allows us to eliminate the channel coder resulting in the system shown in Fig. 2-7.

There have two functions f1, f2 in the source encoder shown in Fig. 2-8. The function f1 decorrelates the input so that all linear depende

function f2 employs a variant of encoding whereby, instead of minimizing the

average number of bits at the output, it reduces the average number of transitions. Therefore, the function f1 decorrelates the input and adjusts the input probability

(25)

Source Encoder Channel Encoder Source Decoder Channel Decoder Input Noisy channel

(26)

Source Encoder Source Decoder Input Noiseless channel

Fig. 2-7. A general communication system of noiseless channel.

Input Noiseless channel F1 (decorrelator) F2 (encoder) F2-1 (decoder) F1-1(correlator) Source Encoder Source Decoder

(27)

In this thesis, we choose the Difference-Based Mapping as the function f1, the

Probability-Based Mapping as the function f2. In the later chapter, we will use this

encoding method to compare with other encoding schemes including Bus-Invert, XOR, XNOR, proposed scheme.

The method of Difference-Based Mapping (dbm) is shown as follows Eq. 2-4. The x(n) is the input data, The prediction ( )x n , is a function of the past value of x(n). The dbm function returns the difference between x(n) and ( )x n properly adjusted so that the output fits in the available B bits.

(2-4) B B-1 B if (x(n) x(n) & & 2x(n) x(n)) dbm = 2x(n) - 2x(n);

else if (x(n) < x(n) & & 2x(n) - x(n) < 2 ) dbm = 2x(n) - 2x(n) - 1; else if ( x(n) < 2 ) dbm = x(n) ; else dbm = 2 - 1 - x(n) ; ≥ ≥

In the Difference-Based Mapping ( dbm ), we define four ranges for mapping,

{_{x n}_{( )} _<₂B-1}, {2( )_{x n} _-₂B _{≤ x(n) ≤} ( )_{x n} _{}, {}( )_{x n} _{< x(n) < 2}_{x n}( )_{}, and}

others. We can choose proper calculation according to four mapping ranges. For an example is listed in Table 2.3, we see that the dbm output is 0 when the current x(n) is equal to the previous ( )x n and the output value increases as the distance between the current x(n) and previous ( )x n increases. The goal of dbm is convert the total data distribution to close to 0 so that the number of transitions would be reduced. We see the occurrence distribution at the output of dbm for EEG 8 bits data is shown in

(28)

Fig. 2-9 and Fig. 2-10.The dbm skew the original distribution for most of the data sets and hence enable function f2 ,Probability-Based Mapping (pbm) to reduce the number of transitions even more.

Table 2-3 Example of Difference-Based Mapping ( dbm )

x(n) X(n) _Dbm(x(n), _x(n)₎ 011 000 101 011 001 011 011 010 001 011 011 000 011 100 010 011 101 100 011 110 101 011 111 111

(29)

Fig. 2-9. Occurrence distribution for EEG data before dbm.

(30)

The Probability-Based Mapping (pbm) is a method of sorting for reducing the number of ‘1’.It satisfies given below.

if Pr( ) i > Pr( ) j then pbm i ( ) ≤ pbm j( ) ( , )∀ a b (2-5) The probabilities in (2-6) can be computed using a representative data sequence. If the most probable value is i, then pbm(i) = 0.Then the second most probable value is j,

pbm(j) =1 and so on. Therefore all value are mapped to value in 2i (i=0…B-1) by pbm. We can make a sorting table according to probability. An example of pbm is listed in Table 2-4

Table 2-4 Example of Probability-Based Mapping ( pbm )

i Pr(i) Pbm(i) 000 0.37 000 001 0.14 010 010 0.22 001 011 0.11 011 100 0.05 101 101 0.03 110 110 0.06 100 111 0.02 111

In summary, we can reduce transition activity by combining with dbm and pbm encoding schemes. It can make the value having higher probability of occurrence to be assigned code words with fewer ON bits. In VLSI circuits, power dissipation depends on the number of transitions occurring at the capacitive nodes of the circuit. But unfortunately, the dbm + pbm require more hardware for build the input

(31)

probability distribution table and more execution time for encoding.

2.3 Power Aware Data Bus Codec

According to different kinds of data properties and correlations, the various encoding schemes can be generated. Zero-Transition Activity encoding [15] that needs high correlation and tardy variation in data type is suitable for instruction memory. Bus-Invert encoding method [12] that needs low correlation and rapid variation in data type is suitable for data memory. Dbm and Pbm encoding schemes [16] have an advantage of that it can change correlation of data and choose proper value by probability mapping. Dbm and Pbm encoding scheme is suitable for specific data value range, but Dbm and Pbm encoding scheme pays a heavy penalty on hardware implementation cost.

On the other hand, in general, although data width is constant, the variation of the most significant bit group (MSBG) is different from the variation of least significant bit group (LSBG). We define the MSBG is from 4th bit to 7th bit, the LSBG is from 0th bit to 3rd bit for 8 bits data bit width. For example, we choose the first ten decimal data sequences in Fig. 2-11 and the corresponding binary representation for observation in Table. 2-5. In Table 2-5, the data value ranges at between 115 and 150 and the variation of the MSBG is smoother than that of LSBG. Fig. 2-12 shows the variation curve.

(32)

Fig. 2-11. Waveform of the classic music. Table 2-5 First Ten Data Sequences of Classic Music

Value(decimal) Value(binary) 1 140 1000_1100 2 131 1000_0011 3 146 1001_0010 4 151 1001_0111 5 136 1000_1000 6 125 0101_1101 7 115 0101_0011 8 130 1000_0010 9 145 1001_0001 10 139 1000_1011

(33)

0 20 40 60 80 100 120 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 cycle va ri at io n (% ) MSBG LSBG

Fig.2-12. Data variation.

Table 2-6 Data Variation

MSBG LSBG

Total Hamming distance 126 200

Average of variation 31.5% 50%

We can find the difference obviously between MSBG and LSBG in Fig. 2-12. Therefore, unlike in [19], we can separate specific blocks from data bit width such that the proper encoding can be applied to each block. The transition activity of data transmission can be reduced by encoding.

2.3.1 Proposed Data Bus Codec

The architecture of encoder have four kinds of encoding schemes, Invert, XOR [17][18], XNOR [17][18], original, and then we will introduce each encoding algorithm and proper data type for each a algorithm.

The Invert function is given in Eq. 2-6, where Hamming( x(n) ,( )x n ) returns the Hamming distance between the current data x(n) and the previous data x n( ). If the Hamming distance exceeds half the number of bus lines, and then the input is inverted

(34)

and the inversion is signaled using an extra bit. An example of classic music before using Invert is listed in Table 2-7, and an example of classic music after using Invert is listed in Table 2-8. ( ( ( ) , ( ) ) 2 ( ) ( ( ) ); ( ) ( ); Bitwidth if Hamming x n x n y n inv x n else y n x n > = = (2-6)

Table 2-7 Example of Classic Music before Using Invert

cycle _x(n) X(n) transitions 1 00000000 10001100 3 2 10001100 10000011 4 3 10000011 10010010 2 4 10010010 10010111 2 5 10010111 10001000 5 6 01110111 01011101 3 7 01011101 01010011 3 8 01010011 10000010 4 9 10000010 10010001 3 10 10010001 10101010 5 Total transitions 34

(35)

Table 2-8 Example of Classic Music after Using Invert

cycle _x(n) X(n) _{( ( )}_{Hamming x n} _{, ( ))}_{x n}Y(n) Inv transitions

1 00000000 10001100 3 10001100 off 3 2 10001100 10000011 4 10000011 off 4 3 10000011 10010010 2 10010010 off 2 4 10010010 10010111 2 10010111 off 2 5 10010111 10001000 5 01110111 on (*)3 6 01110111 01011101 3 01011101 off 3 7 01011101 01010011 3 01010011 off 3 8 01010011 10000010 4 10000010 off 4 9 10000010 10010001 3 10010001 off 3 10 _10010001 _{10101010 5} _{01010101 on (*)3} Total transitions 30

The block diagram of Invert encoding is sketched in Fig. 2-13, where Hamming function is composed of 8 exclusive-OR gates and adders for 8-bit length input.

(36)

Hamm ing Value n Value n-1 >4? Yes No INV Value n Value n Output 8 8 8 8

Fig. 2-13. Block diagram of Invert coding.

The XOR function is given in Eq. 2-7, where XOR( x(n),x n( )) returns the value of the current data x(n) exclusive-or the previous data x n( ). If the value of

) is smaller than , and then

the output for transmission equals to XOR(x(n),

( ( ) , ( )

Hamming x n x n Hamming XOR x n x n( ( ( ), ( )) , ( ))x n

( )

x n ) .Otherwise, the output for transmission will be unchanged.

For example, classic music coding results using transparent and XOR coding schemes are listed in Table 2-9 and Table 2-10.

( ( ( ) , ( ) ) ( ( ( ), ( ) ) , ( ) ) ( ) ( ( ), ( ) ); ( ) ( ); ( ( ), ( ) ) ( ) ( );

if Hamming x n x n Hamming XOR x n x n x n y n XOR x n x n else y n x n XOR x n x n x n x n > = = = ⊗ (2-7)

Table 2-9 Example of Classic Music before Using XOR

cycle _x(n) X(n) Transitions

(37)

2 10001100 10000011 4 3 10000011 10010010 2 4 10010010 10010111 2 5 10010111 10001000 5 6 01110111 01011101 3 7 01011101 01010011 3 8 01010011 10000010 4 9 10000010 10010001 3 10 10010001 10101010 5 Total transitions 34

Table 2-10 Example of Classic Music after Using XOR

cycle _x(n) X(n) _{( ( )}_{Hamming x n} _{, ( ))}_{x n} _Hamming_{(XOR( ( ) ) , ( ) )}_xn _xn _{Y(n) XOR transitions}

1 00000000 10001100 3 3 10001100 off 3 2 10001100 10000011 4 3 00001111 on (*)3 3 00001111 10010010 5 2 10011101 on (*)2 4 10011101 10010111 2 5 10010111 off 2 5 10010111 10001000 5 3 00011111 on (*)3 6 00011111 01011101 2 5 01011101 off 3 7 01011101 01010011 3 4 01010011 off 3 8 01010011 10000010 4 2 11010001 on (*)2 9 11010001 10010001 1 3 10010001 off 3 10 10010001 10101010 5 4 00111011 on (*)4 Total transitions 28

(38)

The block diagram of XOR encoding is sketched in Fig. 2-14. The conditional block will select optimal result which the function Hamming () has smallest value.

Ha mmin g Value n Value n-1 A>B? Yes No XOR Value n Value n Output 8 8 XOR Value n 8 8 8

Fig. 2-14. Block diagram of XOR coding.

The XNOR function is given in Eq. 2-8, where XNOR( x(n,),x n( )) returns the

value of the current data x(n) exclusive-nor the previous data ( )x n . If the value of

) is smaller than ,and

then the output for transmission equals to XNOR(x(n),

( ( ) , ( )

Hamming x n x n Hamming XNOR x n x n( ( ( ), ( ) ) , ( ) )x n

( )

x n ) .Otherwise, the output for

transmission will be unchanged. The inversion is signaled using an extra bit.

(2-8) ( ( ( ) , ( ) ) ( ( ( ), ( ) ) , ( ) ) ( ) ( ( ), ( ) ); ( ) ( ); ( ( ) ) ~ ( ( ) ( ));

if Hamming x n x n Hamming XNOR x n x n x n y n XNOR x n x n else y n x n XNOR x n x n x n > = = = ⊗

The logic diagram is shown in Fig. 2-15. The conditional block will select optimal result which the function Hamming has smaller value.

(39)

Ha mmin g Value n Value n-1 A>B? Yes No XNOR Value n Value n Output 8 8 XNOR Value n 8 8 8

Fig. 2-15. Block diagram of XNOR coding.

2.3.2 Architecture of Codec

The total codec system overview is shown in Fig. 2-16. The proposed codec architecture is placed between I/O, external memory interface and I/O, and external memory module. The extra bit line on bus is used for notify which function to decoding in decoder.

The proposed encoder architecture is composed of four encoding functions. It targets at different kinds of data types and adaptively choose the optimal encoding way for transmission. According to the property that different bit group location has different kinds of variation, the transmission data would be separated into several blocks for encoding.

(40)

I/O, External Memory Decoder/Encoder Encoder/Decoder

I/O, External MemoryInterface

Extra bits BUS

CPU

Fig. 2-16. System architecture.

Input Data Encoder 1 Encoder 2 Data (N/2 bit) Data (N/2 bit) Output Comparator INV XOR XNOR A H Q1 Q8 ENB Register Value n Value n-1 Output MUX CLK

(41)

The input data for transmission is separated into two or more bit groups. Each bit group has individual encoder for encoding its separated data.

The current data Valuen and the previous data Valuen-1 enter the INV, XOR, XNOR

functions, and then the comparator chooses the optimal encoding way that has the minimum Hamming distance and sends the encoded data for transmission on bus.

Our architecture of decoder is similar as architecture of encoder.

It has two input source, transmission data on bus and extra bits. It depends on the extra bits from encoder to decode the data for transmission .Extra bits mean four decoding functions, INV, XOR, XNOR, and transparent. After decoding data by MUX, the data will be return to original form by suitable decoding functions. Fig. 2-18 shows the decoder architecture diagram.

A H Q1 Q8 ENB Register Value n Value n-1 MUX Extra bits Output CLK

(42)

2.4 Power Aware Data Bus Codec Simulator

To verify Power Aware Data Bus Codec and compare the performance with other encoding schemes like Bus-Invert, XOR, XNOR, Dbm (different based mapping) plus Pbm (probability based mapping). Our thesis has not only RTL model design but also a simulator by C++ language. The simulator can help us know the switch activity effect in different kinds of data variation.

2.4.1 8 bits Power Aware Data Bus Codec Simulator

In 8 bits Power Aware Data Bus Codec Simulator, we define our proposed codec which is separated into two 4-bit groups for 8-bit length data encoding. And we configure some variability parameters for simulation.

Most significant bit group variability: the variability of the value 0 to 1 or 1 to 0 from

4th to 7th bits

Least significant bit group variability: the variability of the value 0 to 1 or 1 to 0

from 0thto 3th bits

When the simulator executes a test pattern, it would record the results about bit transitions below:

switch_act switch activity before encoding switch_act_BI_total switch activity after Bus-Invert encoding switch_act_XOR _total switch activity after XOR encoding switch_act_XNOR_total switch activity after XNOR encoding switch_act_dbm switch activity after dbm encoding

(43)

switch_act_dbm+pbm switch activity after dbm + pbm encoding switch_act_1block_total switch activity after 1 block encoding

switch_act_2blocks_total switch activity after 2 blocks encoding

switch_act_BI_ctrl switch activity on extra bits after Bus-Invert encoding switch_act_XOR_ctrl switch activity on extra bits after XOR encoding switch_act_XNOR_ctrl switch activity on extra bits after XNOR encoding

switch_act_1block_ctrl_high switch activity on extra bits in most significant bit

group after 1 block encoding

switch_act_1block _ctrl_low switch activity on extra bits in least significant bit

group after 1 block encoding

switch_act_2blocks_ctrl_high switch activity on extra bits in most significant bit

group after 2 blocks encoding

switch_act_proposal2_ctrl_low switch activity on extra bits in least significant bit

group after 2 blocks encoding

In Section 2.1, we know the dynamic power depends on transition activity α. Therefore, we can use switch activity reduction (SAR) as a measurement metric to signify the power reduction.

SAR (%) =

before encoding

-

after encoding

control extra bit

,

before encoding

SA

+

(2-9)

Where SA denotes switch activity.

We employee four encoding schemes, proposal and configure different variability parameters to simulate 100,000 data by in Fig. 2-19, Fig. 2-20, Fig. 2-21 and Fig. 2-22. We define 1 block is proposed coding scheme with one 8 bits group; 2 blocks is proposed coding scheme with two 4 bits groups;

(44)

The x-axis shows the group bits variability and the y-axis shows the switch activity reduction. For example, 25/50 means high level group bits have 25% variability and low level group bits have 50% variability.

0 10 20 30 40 50 60 70 SA r educt ion( % ) specific random 25/25 25/50 25/75 25/100 MSBG/LSBG variability 8-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks

Fig.2-19. Switch activity reduction for 8-bit data.

0 10 20 30 40 50 60 70 SA r educt ion( % ) 50/25 50/50 50/75 50/100 MSBG/LSBG variability 8-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks

(45)

0 10 20 30 40 50 60 70 80 SA r educt ion( % ) 75/25 75/50 75/75 75/100 MSBG/LSBG variability 8-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks

0 20 40 60 80 100 SA r educt ion( % ) 100/25 100/50 100/75 100/100 MSBG/LSBG variability 8-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks

We add the specific data and random data for simulation except above data. In the specific data, It has high probability in specific range.

The data distribution is shown in Fig. 2-23. The most of general video and audio data distribution are like this figure.

(46)

Fig. 2-23. The Data Distribution.

Either in the specific data or in the random data, the switch activity reduction has 15 ~ 18 percentages by proposal scheme. With the increasing of variability parameters, our proposal has more reduction in switch activity. In the variability parameters 25/75 and 75/25, we have 20% in switch activity reduction and our proposal has greater improvement than other encoding schemes.

2.4.2 16 bits Power Aware Data Bus Codec Simulator

In 16 bits Power Aware Data Bus Codec Simulator, we define our proposed codec which is separated into two 8-bit groups and four 4-bit groups for 16-bit length data encoding. And we configure some variability parameters for simulation.

(47)

Most significant bit group variability: the variability of the value 0 to 1 or 1 to 0 from

8th to 15th bits

Least significant bit group variability: the variability of the value 0 to 1 or 1 to 0

from 0thto 7th bits

When the simulator executes a test pattern, it would record the results about bit transitions below:

switch_act switch activity before encoding switch_act_BI_total switch activity after Bus-Invert encoding switch_act_XOR _total switch activity after XOR encoding switch_act_XNOR_total switch activity after XNOR encoding switch_act_dbm+pbm switch activity after dbm + pbm encoding switch_act_1block_total switch activity after proposal encoding by 1 blocks. switch_act_ 2blocks_total switch activity after proposal encoding by 2 blocks. switch_act_4blocks_total switch activity after proposal encoding by 4 blocks. switch_act_BI_ctrl switch activity on extra bits after Bus-Invert encoding

switch_act_XOR_ctrl switch activity on extra bits after XOR encoding switch_act_XNOR_ctrl switch activity on extra bits after XNOR encoding

switch_act_1block_ctrl_high switch activity on extra bits in high level group after

proposal encoding by 1 block.

switch_act_1block_ctrl_low switch activity on extra bits in low level group after

proposal encoding by 1 block.

switch_act_ 2blocks_ctrl_high switch activity on extra bits in high level group after

proposal encoding by 2 blocks.

(48)

switch_act_4blocks_ctrl_high switch activity on extra bits in high level group after

switch_act_4blocks_ctrl_low switch activity on extra bits in low level group after

We run 100,000 data by four encoding schemes, proposal and configure different variability parameters in Fig.2-24, Fig.2-25, Fig.2-26 and Fig.2-27. Encoding schemes:

Bus-Invert; XOR; XNOR; Dbm+Pbm;

1 block: Proposed coding scheme with one 16 bits group; 2 blocks: Proposed coding scheme with two 8 bits groups; 4 blocks: Proposed coding scheme with four 4 bits groups;

0 10 20 30 40 50 60 70 SA r educt ion( % ) random 25/25 25/50 25/75 25/100 MSBG/LSBG variability 16-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks 4 blocks

(49)

0 10 20 30 40 50 60 70 SA r educt ion( % ) 50/25 50/50 50/75 50/100 MSBG/LSBG variability 16-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks 4 blocks

Fig.2-25. Switch activity reduction for 16 bits data.

0 10 20 30 40 50 60 70 80 SA r educt ion( % ) 75/25 75/50 75/75 75/100 MSBG/LSBG variability 16-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks 4 blocks

(50)

0 20 40 60 80 100 SA r educt ion( % ) 100/25 100/50 100/75 100/100 MSBG/LSBG variability 16-bit BI XOR XNOR Dbm+pbm 1 block 2 blocks 4 blocks

Fig.2-27. Switch activity reduction for 16 bits data.

Either in the random data, the switch activity reduction has 20 percentages by proposed scheme. With the increasing of variability parameters, our proposed method has more reduction in switch activity and has greater improvement than other encoding schemes.

2.5 Result and Analysis

0 10 20 30 40 SA r educ tion (% ) Pop music(2.68Mb WAV) Classic music(3.81Mb WAV) mobile pic(352x288) stefan pic(352x288)

table pic(352x288) EEG(8bit)

Audio & Image data simulation

BI XOR XNOR Dbm+pbm Proposed 1 Proposed 2

(51)

Proposed 1: Proposed coding scheme with one 8 bits group; Proposed 2: Proposed coding scheme with two 4 bits groups;

The simulation for 8 bits multi-media data in Fig. 2-28 shows our proposal has 20 % dynamic power reduction in average. In image data, we choose three 352x288 pictures including mobile, table tennis, and Stefan for encoding. We can find that Table image has low data variability so that is suited for Dbm + Pbm encoding scheme. In other images, the high data variability is well for our proposed encoder can select optimal encoding scheme.

(52)

Chapter 3 Low Power Embedded Processor

Design

In order to verify the codec function [20], we have performed the codec combine with a 32 bits embedded processor [21]. We will introduce the properties of processor, instruction set, tool chains and other specific designs in this Chapter.

3.1 Architecture of the Low Power Embedded

Processor

3.1.1 Low Power Embedded Processor Core

Our processor applies RISC architecture including low power designs, which are Master-Slave cache, low power phased cache controller, and power aware data bus codec.

The low power embedded processor has seven-pipeline architecture. All instructions start by using the program counter (PC) to supply the instruction address to the instruction memory. After the instruction is fetched, ID stage decodes the instruction and specifies register operands. Once the operands have been fetched in ALU, they can be operated to compute a memory address, to compute an arithmetic result, or to compare. If the instruction is an arithmetic-logical instruction, the result from ALU must be written to a register. If the operation is a load or store, the result from ALU is used as an address to either store or load a value .The result from the ALU or memory is written back into the REG stage. Cache controller controls the Load/Store operation in the memory peripheral device. Fig. 3-1 shows the architecture

(53)

of processor.

RISC

Processor

4KB

Data Cache

4KB

Ins. Cache

Phased

D cache

Control

unit

MS

I cache

Control

unit

Main Memory

BUS Encoder/Decoder

I/O unit

BUS Encoder/Decoder

I/O unit

BUS Encoder/Decoder

AMBA BUS

BUS Encoder/Decoder

(54)

Instruction Fetch /

Program Counter / Branch Prediction MS-Cache

Instruction Decoder

Register File/ Cache address generator

ALU Cache tag comparison

W B / Cache Data Access

Main memory

BUS

Load Instruction Fetch /

Program Counter / Branch Prediction MS-Cache

Instruction Decoder

Register File/ Cache address generator

ALU Cache tag comparison

W B / Cache Data Access

Main memory

BUS

Load

BUS Encoder/Decoder IO

Fig. 3-2. Pipeline processing flow.

The processor has 7 pipeline architecture including Instruction fetch/Program Counter/Branch Prediction, MS cache(2 stages), Instruction decoder, Register file, ALU/Cache tag comparison, and Write back/Cache data access.

The seven stages are the following:

PC Counter/Branch Predict/ Instruction Fetch ： In the top portion of

hardware architecture, Program counter handles branch instructions and generates the PC address. The instruction is read from memory using the address in the PC and then is placed in the ID pipeline register. Due to some instructions need PC address to be computed in ALU stage, the PC address would be saved stage by stage. Therefore, PC address is saved in the next stage register. In order to avoid an instruction be fetched after branch instruction occurs, we set two flags to handle branch instructions. These

(55)

flags can show whether the pipeline is in stall state and decide the stage process.

MS-cache (2 stages): The second portion of Fig. 3-2 shows the operation of

instructions. If data miss occurs, it will replace data from main memory. The MS-cache design is based on phased cache. The phased cache compares Tag value in first cycle, and reads Hit data to ID stage in second cycle. By the way, MS-cache also enhances the hit rate for branch/jump instructions.

Instruction Decoder：In ID stage, the instruction separates into two-source

registers location, one destination register location. These locations can get source operands for the Register stage and provide destination operand for ALU stage.

Register File: It provides 16 general-purpose registers, 16 interrupt registers for

external interrupt, internal interrupt and other configuration.

ALU/Cache Tag access：All operands computation from Register File are

executed in ALU stage. Data forwarding is supported in ALU stage to eliminate RAW hazard. Meanwhile, the value in Tag cache is compared with memory address and is verified whether it is a cache hit or miss when Load/Store instructions are executed.

Write-Back/Cache data access： The ALU writes data back to the Register file,

cache or memory in this stage. In case of Load/Store instructions execution, it would access memory data according to a cache hit.

Five specific hardware designs is supported for DSP:

SIMD(Single Issue Multi Data) support: 8/16 bits SIMD instruction set is

supported to improve multi-media processing, such as 8 bits image processing or 16 bits speech processing.

Bit Reverse：A memory addressing mode is designed for FFT. For example,

address 01101 can be transformed to 10110.

(56)

Effective Data forwarding [22].

Conditional Branch：Prediction – untaken method.

3.1.2 Low Power Embedded Processor Instruction

Set

The instruction set has four categories: Data moving instructions, Arithmetic & Logic instructions, Branch/Jump instructions, SIMD instructions and others.

6 addressing modes are supported: Direct, Reg to Reg, Indirect, Displacement (base add), Index and Bit-Reverse addressing modes.

Table 3-1 Data Moving Instructions List

Instruction Opcode Example Mode

MOVRC 000001 MOV rd,data Direct

MOVRR 000010 MOV rd,rs Reg-Reg

MOVRM 000011 MOV rd,address Direct

MOVMR 000100 MOV address,rs Direct

MOVMRR 000101 MOV @rs2,rs Indirect

MOVRRM 000110 MOV rd,@rs Indirect

MOVARR 100010 MOV rd(a),rs(b) Reg-Reg

MOVB 101111 MOVB rd,base(rs) Displacement

MOVI 110000 MOVI rd,rs1(rs2) Index

MOVREVRM 101010 MOV rd,address Bit Reverse

MOVREVMR 101011 MOV address,rs Bit Reverse

(57)

MOVREVRRM 101101 MOV rd,@rs Bit Reverse

Table 3-2 Arithmetic & Logic Instructions List

Instruction Opcode Example

ADDRR 001000 ADD rd,rs1,rs2

SUBRR 001010 SUB rd,rs1,rs2

MULRR 001100 MUL rd,rs1,rs2

ADDRC 000111 ADD rd,data

SUBRC 001001 SUB rd,data

MULRC 001011 MUL rd,data

MACR 100111 MAC rd,rs1,rs2

MACC 110001 MAC rd,rs1,data

ANDRR 001110 AND rd,rs1,rs2

ORRR 001111 OR rd,rs1,rs2

XORRR 010000 XOR rd,rs1,rs2

INVR 010001 INV rd,rs

Table 3-3 Branch/Jump Instructions List

JMP 010010 JMP address

JMPR 010011 JMP @rs

JBE 010100 JBE rs1,address

JNE 010101 JNE rs1,address

JMB 010110 JMB rs1,address

(58)

JBER 011000 JBER rs1,rs2,address

JNER 011001 JNBR rs1,rs2,address

JMBR 011010 JMBR rs1,rs2,address

JLBR 011011 JLBR rs1,rs2,address

CALL 100011 CALL address

RET 011110 RET

Table 3-4 SIMD Instructions List

MOVHLRC 110001 MOVHLRC rd,direct

MOVHURC 110010 MOVHURC rd,direct

ADDHRR 110011 ADDHRR rd,rs1,rs2 SUBHRR 110100 SUBHRR rd,rs1,rs2 MULHRR 110101 MULHRR rd,rs1,rs2 MACHR 100110 MACHR rd,rs1,rs2 ANDHRR 110110 ANDHRR rd,rs1,rs2 ORHRR 110111 ORHRR rd,rs1,rs2 XORHRR 111000 XORHRR rd,rs1,rs2 ADDBRR 111001 ADDBRR rd,rs1,rs2 SUBBRR 111010 SUBBRR rd,rs1,rs2 MULBRR 111011 MULBRR rd,rs1,rs2 ANDBRR 111100 ANDBRR rd,rs1,rs2 ORBRR 111101 ORBRR rd,rs1,rs2 XORBRR 111110 XORBRR rd,rs1,rs2

(59)

In case of SIMD instructions, the 32 bits data in the register file is divided into 8 bits or 16 bits blocks. Each block are parallel processed. Therefore, it can improve 8 bits or 16 bits calculation.

For example, the following is MACHR instruction,

1 1 2

d

2 R

=

ACC

=

ACC

+

A

×

B

+

A

×

B

ACC (32) A1 (16) A2 (16) X B1 (16) X B2 (16) ∑

Fig. 3-3. MACHR operation.

Table 3-5 Other Instructions List

SET 011100 SET A,rs

INTOK 011101 INTOK

SHR 100000 SHR rs

SHL 100001 SHL rs

(60)

SET: It can sets two extra 16 bits I/O ports. INTOK: Instructions for software interrupt. SHR: It would right shift 1 bit from rs. SHL: It would left shift 1 bit from rs.

3.2 Configurable Master-Slave I-Cache Controller

In general, 20%~30% of total power dissipation in the processor dissipated in instruction cache. Therefore, the configurable Master-Slave Instruction cache controller is designed for low power design.[24]

3.2.1 The Proposal of Configurable Master-Slave

I-Cache Controller

The Configurable Master-Slave I-Cache controller is designed for increasing hit rate efficiently in large range of jump. The Configurable Master-Slave I-Cache controller algorithm is shown in Fig. 3-4.

(61)

Fig. 3-4. The Configurable Master-Slave I-Cache controller algorithm.

3.2.2 The Performance of Configurable

Master-Slave I-Cache

Fig. 3-5 shows the total performance improvement in different kinds of CR_Ratio.

(62)

CR_Ratio: The ratio of returnable jump in total jump instructions Eff_Improve: A parameter of total performance improvement.

When CR_Ratio increases, the value of Eff_Improve increases obviously. On the other hand, MS-cache uses the architecture of phased cache so that it can reduce 44% of power dissipation.

3.3 High performance pipeline design of low power

phased cache

High performance pipeline design of low power phased cache is combined phased cache with specific pipeline. It takes advantages of that it can eliminate the set associate cache power and access the cache data one stage early by specific pipeline. Our approach can reduce 44%~70% (2 ~ 4way) cache power consumption without any time latency and only cost 6% total gate count in implementation.

IF ID REG ALU WB/MEM

Reg file ALU L1 Tag L1 Data load other HIT MISS Main memory Address calculator

Fig. 3-6. The architecture of High performance pipeline design of low power phased cache.

Fig. 3-7 reports the results of cache access cycle and total performance by Simplescalar. The time consumption of cache access is reduced 38% and power

(63)

consumption is reduced 40% - 70%.

Fig. 3-7. Cache access cycles & Power consumption.

3.4 Tool Chain

3.4.1 Assembler

The GUI assembler supports machine code translation, program ROM generation and debug information. User can debug and generate test bench by above information. The assembler figure is shown in Fig. 3-8.

Assembler Data Rom Machine Code Debug Information Testbench Assembler Assembler Data Rom Data Rom Machine Code Debug Information Testbench Testbench

(64)

We implemented the tool based on Visual C++ language in Fig. 3-9. The assembler generates files:

Pop.txt : Hexadecimal program code for testing chip. Bin.txt : Binary program code for simulation.

Direct

File

Message

Compile

Build

Edit

Fig. 3-9. Assembler Interface.

3.4.2 Simulator

Our thesis provides a simulator implemented by Visual C++ language for different kinds of test patterns. We apply a method like software pipeline [13] for simulator so that each iteration is arranged in inverse order. An example for five pipeline RISC architecture is in Fig. 3-10. All stages sort in inverse order.

(65)

For(cycle++) {

//5th

stage Write Back …. //4th stage ALU …. //3rd stage Reg_File …. //2nd stage Decoder …. //1st stage Fetch …. } Execution Way

Fig. 3-10. Software pipeline design flow.

The simulator provides the ability to view register value and memory content and calculate the number of hazard and total penalty cycle.

These information can help programmer to analyze performance and debug easily. In Fig. 3-11, it shows assemble code, memory data, register value, total cycle count and total instruction count.

(66)

Fig. 3-11. The simulator interface.

3.5 Verification

In order to respond ISS(Information Systems and Sciences), our processor use some test pattern including F.I.R (Finite Impulse Response) ,D.C.T (Discrete Cosine Transform) and Sobel operator and the result of simulator to verify our processor’s function. We will introduce three kinds of test pattern and these results in the following paragraph.

3.5.1 Finite Impulse Response

FIR filtering is a general application in communication and multi-media field. Fig. 3-12 shows the 16 tap impulse response FIR filter.

(67)

Fig. 3-12. FIR RTL simulation and simulator result.

For verify our proposed codec performance, we supports a module to calculate switch activity which data to external memory on bus. In Fig. 3-13, our proposed method can reduce 46.13 % of switch activity on data bus.

W/O encoding With encoding

(68)

3.5.2 Discrete Cosine Transform

The 8 by 8 1-dimensional DCT algorithm is shown in Fig. 3-14. The 8 by 8 2-dimensional DCT is implemented by applying 1-dimension DCT row-by-row and column by column. The simulation result is shown in Fig. 3-15.

Fig. 3-14. 1 dimension 8 by 8 DCT.

(69)

W/O encoding With encoding

Fig. 3-16. Switch activity for DCT.

In Fig. 3-16, our proposal can reduce 58.92 % of switch activity on data bus.

3.5.3 Sobel Operator

We use Sobel operator to verify the large data moving in data cache. The Sobel operator is an edge detection algorithm in image processing. It is a discrete differentiation operator technically and gets the gradient of the image intensity function. At each point in the image, the result of the Sobel operator is either the corresponding gradient vector or the norm of this vector.

Sobel operator computes approximations of the derivatives for horizontal and vertical changes by using two 3x3 array which are convolved with the original image. We define A as the source image, Gx and Gy are two images which contain the

功率感知資料匯流排編碼解碼器設計

國 立 交 通 大 學

電機與控制工程學系

碩士論文

功率感知資料匯流排編碼解碼器設計

Design of Power Aware Data Bus Codec

研究生：黃德瑋

指導教授：林進燈 教授

陳右穎 教授

功率感知資料匯流排編碼解碼器設計

Design of Power Aware Data Bus Codec

研 究 生：黃德瑋 Student：De-Wei Huang

指導教授：林進燈 教授 Advisor：Dr. Chin-Teng Lin

陳右穎 教授 Dr. You-Yeng Chen

國立交通大學

電機與控制工程學系

碩士論文

A Thesis

Submitted to Institute of Electrical and Control Engineering

College of Electrical and Computer Engineering

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Master

in

Electrical and Control Engineering

July 2007

Hsinchu, Taiwan, Republic of China

功率感知資料匯流排編碼解碼器設計

Design of Power Aware Data Bus Codec

學生：黃德瑋

指導教授：林進燈 博士

陳右穎 博士

中文摘要

Design of Power Aware Data Bus Codec

Student：De-Wei Huang

Advisor：Dr. Chin-Teng Lin

Dr. You-Yeng Chen

Department of Electrical and Control Engineering

National Chiao-Tung University

Abstract

誌謝

Table of Contents

vii

x

... 4

... 41

... 61

... 69

... 73

List of Figures

List of Table

Chapter 1

Introduction

1.1 Brief Introduction

1.2 Organization of the Thesis

Chapter 2

Power-Aware Data Bus Codec

2.1 Motivation

2.2 Related Works

2.2.1 Bus-Invert Bus Encoding

2.2.2 Zero-Transition Activity Encoding

2.2.3 A Coding Framework for Low Power Address

and Data Busses

2.3 Power Aware Data Bus Codec

2.3.1 Proposed Data Bus Codec

2.3.2 Architecture of Codec

2.4 Power Aware Data Bus Codec Simulator

2.4.1 8 bits Power Aware Data Bus Codec Simulator

SAR (%) =

-

,

SA

SA

SA

SA

+

2.4.2 16 bits Power Aware Data Bus Codec Simulator

2.5 Result and Analysis

Chapter 3

國立交通大學

指導教授：林進燈教授

陳右穎教授

研究生：黃德瑋 Student：De-Wei Huang

指導教授：林進燈教授 Advisor：Dr. Chin-Teng Lin

陳右穎教授 Dr. You-Yeng Chen

指導教授：林進燈博士

陳右穎博士