Organization of the Thesis - 功率感知資料匯流排編碼解碼器設計

Chapter 1 Introduction

1.2 Organization of the Thesis

In this thesis, the organization is as follows. In Chapter 1, we give a brief introduction for low power design. In Chapter 2, we propose a new power-aware codec design for data bus. The integrated processor including our proposed bus codec, and tool chains will be demonstrated in Chapter 3. The processor layout and simulated result are shown in Chapter 4. Finally, conclusions and future work are remarked in the last Chapter.

Chapter 2 Power-Aware Data Bus Codec

We would present an adaptive data bus codec including proposal, architecture, and performance comparison with the features of low power, low cost, and awareness.

2.1 Motivation

As we know, there are two major sources of power dissipation in digital CMOS circuits, which are summarized as follows[8][9]

P a C V

= × ×

×

+

I_leakage

× ,

V (2-1) Where P, C,

α

, V, f denote power consumption, capacitance, transition activity, supply voltage, and clock frequency, respectively. The first and second terms represent the dynamic power and leakage power, respectively. In the second term,

leakage current that can be arisen from substrate injection and sub-threshold effects is primarily determined by the fabrication technology.

For the reduction of dynamic power, the main design principle is to minimize the values of V, C, f and

α

in Eq. (2-1) [10]. Among the four parameters, supply voltage V that has a quadratic effect and capacitance C are very efficient ways of decreasing the power dissipation. However, for CMOS circuits, the designers usually decrease V and C in layout level. For larger digital circuits and systems, decreasing V and C is an annoying problem in cell-based design. On the other hand, lowering the transition activity is a very promising way to reduce the power consumption in cell-based design.

Generally speaking, the percentage of power dissipation on bus is in the range of 10% and 80% for microprocessor. The category of bus is external bus and internal bus.

External bus includes external memory data transmission and I/O data transmission.

Internal bus includes internal memory, cache, and IP data transmission. The power dissipation in external busses usually is larger than that of internal busses by hundred times [8]. Thus, we are motivated to solve this critical power problem of data bus in architecture and logic level. In this paper, we propose a power-aware encoder and decoder to compress the data transition activity

α , and thus the power can be saved.

There are four properties in bus stream [11] discussed as follows.

(1) Instruction address stream: Instructions addresses are often consecutive. As a result, instruction address stream is very predictable.

(2) Data address stream: Data access may be consecutive while accessing arrays;

otherwise, the data address stream is random. Although data addresses are less predictable, they still follow the principles of spatial and temporal locality.

(3) Instruction stream: Most ISAs (Instruction Set Architecture) exhibit some regularity and instructions can be partitioned into fixed-location fields. As a result, Instruction stream is predictable by fixed-location fields.

(4) Data stream: The sequence is not predictable. The values vary irregularly with different kinds of applications and different kinds of algorithms.

The above properties in bus stream have been widely applied to three off-the-shelf computer architectures.

(a) Harvard architecture with four busses:

Fig. 2-1. Harvard architecture with four busses.

Harvard archit storage

and signal pathways for instructions and da

ann architecture with two busses:

The von Neum storage

ann architecture with one bus:

ecture is a computer architecture with physically separate

ta. Each address bus and data bus is only for instruction memory or data memory. As a result, each stream has independent bus and been easily controlled.

(b) von Neum

Fig. 2-2. von Neumann architecture with two busses.

ann architecture is a computer architecture that uses a single structure to hold both instructions and data. Instruction address stream and Data address stream are set on the same bus. Instruction stream and Data stream is so on.

I/D-Address/Data CPU Memory

Fig. 2-3 von Neumann architecture with one bus.

All streams are running on the sam ore signals to control stream operations.

2.2 Related Works

In this section, we would introduce the relative researches of low power bus encoding. F

2.2.1 Bus-Invert Bus Encoding

We will consider the activity on a typical data bus to be characterized by a random

e bus. On this bus, it needs m

rom the beginning, we will have a brief subsection about Bus-Invert encoding. Bus-Invert encoding [12] is a traditional encoding at the early low power designs. It has the advantage of low cost hardware implementation. In Section 2.2.2, we will introduce Zero-Transition Activity encoding [15]. In Section 2.2.3, we will show a coding framework for low power address and data busses [16].

uniformly distributed sequence of values [13][14]. The assumption of random uniformly distributed inputs is also conveniently made by most of the statistical power estimation methods. With this assumption for any given time-slot the data on an n-bit wide bus can be any of 2ⁿ possible values with equal probability. The average number of transitions per time slot will be n/2. For example on an eight-bit bus there will be

an average of 4 transitions per time-slot or 0.5 transitions per bus-line per time-slot.

ses one extra control bit called

differ) between

, set invert = 1 and make the present bus

nt bus value equal to the present data

the decoder side, the contents of the bus must be conditionally inverted

m number of trans

When all the bus-lines toggle at the same time (the probability of this happening in any time-slot is 1/2ⁿ) there will be a maximum of n transitions in a time-slot and thus the worst power dissipation is proportional with n.

The Bus-Invert method [12] proposed here u

invert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be inverted. The worst power dissipation can then be decreased by half by coding the bus as follows (Bus-Invert method):

(1) Compute the Hamming distance (the number of bits in which they the present bus value and the last data value.

(2) If the Hamming distance is larger than n/2 value equal to the inverted present data value.

(3) Otherwise let invert = 0 and let the prese value.

(4) At

according to the invert line. In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).

The Bus-Invert encoding has the advantage of that the maximu

itions per time-slot is reduced from n to n/2. Therefore the worst power dissipation for the bus is reduced by half. Fig. 2-4 shows the 16 bit data sequence using the Bus-Invert encoding in order to decrease the number of transitions.

Fig. 2-4.Bus-Invert Encoding.

We can see the Hamm ata 1 is smaller than

8, so invert =0. However the Hamm

2.2.2 Zero-Transition Activity Encoding

The scheme we propose is related to the Bus-Invert encoding, both Bus-Invert encoding [

that of avoiding the transfer of cons

ing distance between the data 0 and d

ing distance between the data 1 and data 2 is bigger than 8, so invert =1 and data 2 is inverted.

12] and Zero-Transition Activity encoding [15] rely on the addition of a redundant line to reduce the total number of transitions that may happen when streams of patterns are transmitted over the bus. For example, Bus-Invert encoding use a redundant line INV that control data encoding for power reduction.

In Zero-Transition Activity encoding scheme, called the T0 code, is

ecutive addresses on the bus by using a redundant line, INC, to transfer to the receiving sub-system the information on the sequentially of the addresses. When two addresses in the stream to be transmitted are consecutive, the INC line is set to 1, the address bus lines are frozen (to avoid unnecessary switch activities), and the new address is computed directly by the receiver. On the other hand, when two addresses are not consecutive, the INC line is driven to 0 and the bus lines operate normally.

Data 0 : 1000000100110101

Data 1 : 1000000010000001

Data 2 : 1100000001111111 INV : 0011111110000000 Data 1 : 1000000010000001 Data 1 : 1000000010000001

Data 0 : 1000000100110101

If all addresses of the ideal stream are consecutive, the INC line is always high, and

ng (T0 code) scheme can be desc

( ( ),0); ,

the bus lines always have no transition. Consequently, the switch activity of our code is zero transitions per emitted consecutive address.

More formally, our Zero-Transition Activity encodi ribed as follows Eq. (2-2):

(

( ( ), ( )) {

B( -1),1) ; t if t

and b t

( )

b t

( -1)

S

b t otherwise

> = +

(2-2) the value on the encoded bus lines at time t ,INC(t) is the a

B t if INC B t INC t

where B(t) is dditional bus

line, b(t) is the address value at time t and S is a constant of increase, that we call stride. The corresponding decoding scheme can formally define as follows (2-3):

( -1) ; 1 0

code retains its zero-transition property are i

sition Activity encoding following above equa

r architecture is simple. At any given clock cycle, the last cycle's

Notice that the T0 even if the addresses

ncremented by a constant stride equal to a constant of two (as it is often the case for practical machines which are byte addressable, but that are able to access data or instructions aligned at word boundaries).

We take an example shows Zero-Tran

tions (2-2) (2-3). Table 2-1 lists the switch activities with original data transfer, we can find the total transitions are 10 from cycle 0 to cycle 6. Table 2-2 lists the data transmission with Zero-Transition Activity encoding. At a given clock cycle t (t = [1,7]

for table 2-2), the encoder computes the incremented address of cycle t and compares it to the address generated at cycle t - 1. If the incremented old (t - 1) address and the new ( t ) address are equal, the INC line is raised, and the old address is left on the bus.

The encoder/decoder architecture is shown on Fig.2-5. The incrementer can be programmable, to be able to flexibly define the constant increment S. In Table 2-2, S is defined as 1.

The decode

addr

Continuous bus address transition

ess is incremented. If the INC line is high, the old incremented value is used for addressing; otherwise, the value coming from the bus lines is selected. Finally, we can find the total transitions become 4. Zero-Transition Activity encoding make address value on bus be frozen when address is consecutive so that power dissipation will be reduced efficiently.

Fig. 2-5. Zero-Transition Activity encoder/decoder.

able 2-1 Without Zero-Transition Activity Encoding

cycle Address to be transfer Address on BUS

0 00000000 00000000

1 00000001 00000001

2 00000010 00000010

3 00000011 00000011

4 00001000 00001000

5 00001001 00001001

6 00001010 00001010

Total Transitions 10

ENCODER DECODER

BUS INC

Table 2-2 With Zero-Transition Activity Encoding Continuous bus address transition

cycle Address to be transfer Address on BUS INC

0 00000000 00000000 0

1 00000001 frozen 1

2 00000010 frozen 1

3 00000011 frozen 1

4 00001000 00001000 0

5 00001001 frozen 1

6 00001010 frozen 1

Total Transitions 4

2.2.3 A Coding Framework for Low Power Address

-coding framework for describing low power ploy the framework to develop new encoding schemes [16].

uited for the power dissipation depends on the num

and Data Busses

In this section, we present a source encoding schemes and then em

In the framework proposed here, a data source is processed first by a decorrelating function f1. Next, a variant of entropy coding function f2 is employed, which reduces the transition activity.

Signal samples have higher probability of occurrence are assigned code words with fewer ON bits. This scheme is s

ber of ON bits. In VLSI systems, however, power dissipation depends on the number of transitions rather than thee number of ON bits.

A general communication system in Fig. 2-6 consists of a source coder, a channel coder, a noisy channel, a channel decoder, and a source decoder. The source coder (deco

ad circuitry, driving (in case of the trans

ncies can been removed.

The

ng.

der) compresses (decompresses) the input data so that the number of bits required in the representation of the source is minimized. While the source coder removes redundancy, the channel coder adds just enough of it to combat errors that may arise due to the noise in the physical channel.

We consider the bus between two chips as the physical channel and the transmitter and receiver blocks to be a part of the p

mitting chip) or detecting (in case of the receiving chip) the data signals. We will assume here that the signal levels are sufficiently high so that the channel can be considered as be noiseless. The noiseless channel assumption allows us to eliminate the channel coder resulting in the system shown in Fig. 2-7.

There have two functions f1

, f

2 in the source encoder shown in Fig. 2-8. The function f1 decorrelates the input so that all linear depende

function f2 employs a variant of encoding whereby, instead of minimizing the average number of bits at the output, it reduces the average number of transitions.

Therefore, the function f1 decorrelates the input and adjusts the input probability distribution so that function f2 can reduce the transition activity by mapping encodi

Source Encoder

Channel Encoder

Source Decoder

Channel Decoder

Input

Noisy channel

Fig. 2-6. A general communication system.

Source Encoder

Source Decoder

Input

Noiseless channel

Fig. 2-7. A general communication system of noiseless channel.

Input

Noiseless channel F1 (decorrelator)

F2 (encoder)

F2-1 (decoder)

F1-1(correlator)

Source Encoder

Source Decoder

Fig. 2-8.A Practical communication system of noiseless channel.

In this thesis, we choose the Difference-Based Mapping as the function f1, the Probability-Based Mapping as the function f2. In the later chapter, we will use this encoding method to compare with other encoding schemes including Bus-Invert, XOR, XNOR, proposed scheme.

The method of Difference-Based Mapping (dbm) is shown as follows Eq. 2-4.

The x(n) is the input data, The prediction

( )

x n , is a function of the past value of x(n).

The dbm function returns the difference between x(n) and

( )

x n properly adjusted so that the output fits in the available B bits.

(2-4)

In the Difference-Based Mapping ( dbm ), we define four ranges for mapping, {x n

( ) < 2

^B^-1}, {2

( )

x n - 2^B ≤ x(n) ≤

( )

x n }, {

( ) x n

< x(n) < 2x n

( )

}, and others. We can choose proper calculation according to four mapping ranges. For an example is listed in Table 2.3, we see that the dbm output is 0 when the current x(n) is equal to the previous

( ) x n

and the output value increases as the distance between

the current x(n) and previous

( ) x n

increases. The goal of dbm is convert the total data distribution to close to 0 so that the number of transitions would be reduced. We see the occurrence distribution at the output of dbm for EEG 8 bits data is shown in

Fig. 2-9 and Fig. 2-10.The dbm skew the original distribution for most of the data sets and hence enable function f2 ,Probability-Based Mapping (pbm) to reduce the number of transitions even more.

Table 2-3 Example of Difference-Based Mapping ( dbm )

x(n)

X(n) Dbm(x(n),

x(n)

)

011 000 101 011 001 011 011 010 001 011 011 000 011 100 010 011 101 100 011 110 101 011 111 111

Fig. 2-9.Occurrence distribution for EEG data before dbm.

Fig. 2-10. Occurrence distribution for EEG data after dbm.

The Probability-Based Mapping (pbm) is a method of sorting for reducing the number of ‘1’.It satisfies given below.

if

Pr( )

i

> Pr( )

j then pbm i

( ) ≤

pbm j

( ) ( , )∀

a b

(2-5) The probabilities in (2-6) can be computed using a representative data sequence. If the most probable value is i, then pbm(i) = 0.Then the second most probable value is j,

pbm(j) =1 and so on. Therefore all value are mapped to value in 2

ⁱ (i=0…B-1) by pbm.

We can make a sorting table according to probability. An example of pbm is listed in Table 2-4

Table 2-4 Example of Probability-Based Mapping ( pbm )

i

Pr(i) Pbm(i)

000 0.37 000

001 0.14 010

010 0.22 001

011 0.11 011

100 0.05 101

101 0.03 110

110 0.06 100

111 0.02 111

In summary, we can reduce transition activity by combining with dbm and pbm encoding schemes. It can make the value having higher probability of occurrence to be assigned code words with fewer ON bits. In VLSI circuits, power dissipation depends on the number of transitions occurring at the capacitive nodes of the circuit.

But unfortunately, the dbm + pbm require more hardware for build the input

probability distribution table and more execution time for encoding.

2.3 Power Aware Data Bus Codec

According to different kinds of data properties and correlations, the various encoding schemes can be generated. Zero-Transition Activity encoding [15] that needs high correlation and tardy variation in data type is suitable for instruction memory. Bus-Invert encoding method [12] that needs low correlation and rapid variation in data type is suitable for data memory. Dbm and Pbm encoding schemes [16] have an advantage of that it can change correlation of data and choose proper value by probability mapping. Dbm and Pbm encoding scheme is suitable for specific data value range, but Dbm and Pbm encoding scheme pays a heavy penalty on hardware implementation cost.

On the other hand, in general, although data width is constant, the variation of the most significant bit group (MSBG) is different from the variation of least significant bit group (LSBG). We define the MSBG is from 4^th bit to 7^th bit, the LSBG is from 0^th bit to 3^rd bit for 8 bits data bit width. For example, we choose the first ten decimal data sequences in Fig. 2-11 and the corresponding binary representation for observation in Table. 2-5. In Table 2-5, the data value ranges at between 115 and 150 and the variation of the MSBG is smoother than that of LSBG. Fig. 2-12 shows the variation curve.

Fig. 2-11. Waveform of the classic music.

Table 2-5 First Ten Data Sequences ofClassic Music Value(decimal) Value(binary)

1 140 1000_1100

2 131 1000_0011

3 146 1001_0010

4 151 1001_0111

5 136 1000_1000

6 125 0101_1101

7 115 0101_0011

8 130 1000_0010

9 145 1001_0001

10 139 1000_1011

Fig.2-12. Data variation.

Table 2-6 Data Variation

MSBG LSBG

Total Hamming distance 126 200

Average of variation 31.5% 50%

We can find the difference obviously between MSBG and LSBG in Fig. 2-12.

Therefore, unlike in [19], we can separate specific blocks from data bit width such that the proper encoding can be applied to each block. The transition activity of data transmission can be reduced by encoding.

2.3.1 Proposed Data Bus Codec

The architecture of encoder have four kinds of encoding schemes, Invert, XOR [17][18], XNOR [17][18], original, and then we will introduce each encoding algorithm and proper data type for each a algorithm.

The Invert function is given in Eq. 2-6, where Hamming( x(n) ,

( )

x n ) returns the Hamming distance between the current data x(n) and the previous data x n

( )

. If the Hamming distance exceeds half the number of bus lines, and then the input is inverted

and the inversion is signaled using an extra bit. An example of classic music before using Invert is listed in Table 2-7, and an example of classic music after using Invert is listed in Table 2-8.

Table 2-7 Example of Classic Music before Using Invert

cycle

_x(n)

X(n) transitions

1 00000000 10001100 3 2 10001100 10000011 4 3 10000011 10010010 2 4 10010010 10010111 2 5 10010111 10001000 5 6 01110111 01011101 3 7 01011101 01010011 3 8 01010011 10000010 4 9 10000010 10010001 3 10 10010001 10101010 5 Total transitions 34

Table 2-8 Example of Classic Music after Using Invert

cycle

_x(n)

X(n)

_{( ( )}

Hamming x n

_{, ( ))}

x n Y(n) Inv transitions

1 00000000 10001100 3 10001100 off 3

2 10001100 10000011 4 10000011 off 4

3 10000011 10010010 2 10010010 off 2

4 10010010 10010111 2 10010111 off 2

5 10010111 10001000 5 01110111 on (*)3

6 01110111 01011101 3 01011101 off 3

7 01011101 01010011 3 01010011 off 3

8 01010011 10000010 4 10000010 off 4

9 10000010 10010001 3 10010001 off 3

10 10010001 10101010 5 01010101 on (*)3

Total transitions 30

The block diagram of Invert encoding is sketched in Fig. 2-13, where Hamming function is composed of 8 exclusive-OR gates and adders for 8-bit length input.

Hamming

Fig. 2-13. Block diagram of Invert coding.

The XOR function is given in Eq. 2-7, where XOR( x(n),x n

( )

) returns the value of the current data x(n) exclusive-or the previous data x n

( )

. If the value of

) is smaller than , and then

the output for transmission equals to XOR(x(n),

( ( ) , ( )

Hamming x n x n Hamming XOR x n x n

( ( ( ), ( )) , ( ))

x n

( )

x n ) .Otherwise, the output for transmission will be unchanged.

For example, classic music coding results using transparent and XOR coding schemes are listed in Table 2-9 and Table 2-10.

Table 2-9 Example of Classic Music before Using XOR

cycle

_x(n)

X(n) Transitions

1 00000000 10001100 3

2 10001100 10000011 4 3 10000011 10010010 2 4 10010010 10010111 2 5 10010111 10001000 5 6 01110111 01011101 3 7 01011101 01010011 3 8 01010011 10000010 4 9 10000010 10010001 3 10 10010001 10101010 5 Total transitions 34

Table 2-10 Example of Classic Music after Using XOR

cycle

_x(n)

X(n) ^{( ( )}Hamming x n ^{, ( ))}x n Hamming(XOR( ( ) ) , ( ) )xn xn Y(n) XOR transitions

1 00000000 10001100 3 3 10001100 off 3

2 10001100 10000011 4 3 00001111 on (*)3

3 00001111 10010010 5 2 10011101 on (*)2

4 10011101 10010111 2 5 10010111 off 2

5 10010111 10001000 5 3 00011111 on (*)3

6 00011111 01011101 2 5 01011101 off 3

7 01011101 01010011 3 4 01010011 off 3

8 01010011 10000010 4 2 11010001 on (*)2

9 11010001 10010001 1 3 10010001 off 3

10 10010001 10101010 5 4 00111011 on (*)4

Total transitions 28

The block diagram of XOR encoding is sketched in Fig. 2-14. The conditional block will select optimal result which the function Hamming () has smallest value.

Hamming

Fig. 2-14. Block diagram of XOR coding.

The XNOR function is given in Eq. 2-8, where XNOR( x(n,),x n

( ) ) returns the

value of the current data x(n) exclusive-nor the previous data

( )

x n . If the value of

) is smaller than ,and

then the output for transmission equals to XNOR(x(n),

( ( ) , ( )

Hamming x n x n Hamming XNOR x n x n

( ( ( ), ( ) ) , ( ) )

x n

( )

x n

) .Otherwise, the output for

transmission will be unchanged. The inversion is signaled using an extra bit.

(2-8)

The logic diagram is shown in Fig. 2-15. The conditional block will select optimal result which the function Hamming has smaller value.

Hamming

Fig. 2-15. Block diagram of XNOR coding.

2.3.2 Architecture of Codec

The total codec system overview is shown in Fig. 2-16. The proposed codec

在文檔中功率感知資料匯流排編碼解碼器設計 (頁 14-0)