Basic Transfer - Overview of AMBA - An ALU cluster Intellectual Property

Chapter 3 Development Roadmap and Proposed Design

3.3 An ALU cluster Intellectual Property

3.3.1 Overview of AMBA

3.3.1.4 Basic Transfer

The basic transfer of AHB protocol is composed of two phases. They are address phase and data phase. An example of simple transfer is shown in Fig 3.9. As illustrate in the figure, the address phase only requires one cycle but the data phase may require several cycles. The necessary signals needed for the basic transfer are HCLK, HADDR, control, HWDATA. The transfers will response the HREADY, HRDATA and HRESP. The figure demonstrates how the address and data phases of the transfer during different clock periods. The address phase always occurs during the data phase of previous transfer. The above-mentioned situation of overlapping is based on the pipelined nature of the bus and allows for high performance operations. The logic high of HREADY represents the transfer is ready to be finished and logic low shows that the transfer is needed to be extended. The example of the transfer needed to be extended is shown in Fig 3.10. The address phase is the same as Fig 3.9. Thus the data phase shown in Fig 3.10 is extended with two cycles because the transfer is not ready to be completed by means of signaling the HREADY logic low. This may result from both master and slave depends on the transfer type. It will be introduced in the following section.

Fig 3.9 An example of simple transfer

Fig 3.10 The example of the transfer extended

3.3.1.5 Transfer Type

The transfers of AMBA AHB are classified into four different transfer types. The HTRANS signals will be used to indicate the type of transfer. The two bits signal represents IDLE, BUSY, NONSEQ and SEQ by 00, 01, 10 and 11 respectively. These transfer types will be introduced below.

The IDLE state indicates no data transfer is required and is used when a bus master is granted the bus but does not intend to perform a data transfer. The bus slaves must always provide a zero wait state OKAY response to IDLE transfers and the transfer should be ignored by the slave.

The BUSY transfer type indicates the bus master is continuing with a burst of transfers, but the next transfer cannot take place immediately. This transfer type allows bus masters to insert IDLE cycles in the middle of bursts of transfer. When a master uses the BUSY transfer type the address and control signals must reflect the next transfer in the burst. The transfer should be ignored by the slave. Slaves must always provide a zero wait state OKAY response, in the same way that they respond to IDLE transfers.

The NONSEQ transfer type is used to indicate the first transfer of a burst or a single transfer. The necessary information such as address and control signals are

unrelated to the previous transfer. In AMBA AHB specification, a single transfer on the bus is treated as the first one of a burst therefore the transfer type is NONSEQ too.

The last one transfer type is SEQ. This type indicates the remaining transfers in a burst. The address needed is related to the previous transfer and it is equal to the address of the previous transfer plus the size in the incrementing burst. But in the situation of wrapping burst, the address of the transfer wraps at the address boundary equal to the size multiplied by the number of beats in the transfer. In addition, the control information is identical to the previous transfer.

3.3.1.6 Address Decoding

The select signal, HSELx, will be provided by an address decoder shown in Fig 3.11 for each slave on the bus. The select signal is a combinatorial decode of the high-order address signals and simple address decoding schemes are encouraged to avoid complex decode logic and to be suitable for high-performance operations. A slave only samples the address and control signals and HSELX when HREADY is logic HIGH. It is indicates that the current transfer is completing. Under certain situation it is possible that HSELx will be asserted when HREADY is logic LOW, but the selected slave will have changed by the time the current transfer completes. The minimum address space can be allocated to a single slave is 1kB. All bus masters are designed such that they will not perform incrementing transfer over a 1kB boundary, thus ensuring that a burst never crosses an address decode boundary.

Fig 3.11 Slave Selected Signal

3.3.1.7 Burst Operation

There are two kinds of bursts of the burst operations supported in the AMBA AHB protocol. They are incrementing and wrapping burst modes. Four, eight and sixteen-beat bursts are defined in AHB protocol. The burst information is identified by the signal HBURST. It decides the burst modes and beat number. The relationship between signal and type is listed in Table 3.2. There are eight types defined in this table.

Table 3.2 Burst Signal Encoding

HBURST [2:0] Type Description

000 SINGLE Single transfer

001 INCR Incrementing burst of unspecified length

010 WRAP4 4-beat wrapping burst

011 INCR4 4-beat incrementing burst

100 WRAP8 8-beat wrapping burst

101 INCR8 8-beat incrementing burst

110 WRAP16 16-beat wrapping burst

111 INCR16 16-beat incrementing burst

The address accessing of each transfer in the burst of the incrementing burst mode is sequential and an increment of the previous address. In the wrapping burst mode, if the start address of the transfer is not aligned to the total number of bytes in the burst (size x beats) then the address of the transfer in the burst will wrap when the boundary is reached. For example, a four-beat wrapping burst of word accesses will wrap at 16-byte boundaries. Therefore, if the start address of the transfer is 0x34, then it consists of four transfers to addresses 0x34, 0x38, 0x3C and 0x30. It will wrap the address back when the boundary is reached. As description it will wrap back to 0x30.

Bursts must not cross a 1kB address boundary. It is important that masters do not attempt to start a fixed-length incrementing burst which would cause this boundary to be crossed. It means that an incrementing burst can be of any length, but the upper limit is set by the fact that the address must not cross a 1kB boundary. The signal, HSIZE, is used to control the transfer size. It supports eight different sizes such as 8, 16, 32, 64, 128, 256, 512 and 1024 bits. Finally the endian policy defined in this protocol is shown in Table 3.3 and Table 3.4. They are big-endian and little-endian

respectively. The designer only needs to obey one of the endian policies and makes whole systems consistently.

Table 3.3 Active Byte Lanes for a 32 bits big endian data bus Transfer

Table 3.4 Active Byte Lanes for a 32 bits little endian data bus Transfer

3.3.2 Micro-Architecture of an ALU cluster Intellectual Property

The proposed ALU cluster Intellectual Property (IP) id described in this section.

The detail architecture is shown in Fig 3.12. As illustrated in Fig 3.12, four main blocks composed of this design are AMBA AHB wrapper, ALU cluster, instruction and data memory. The instruction and data memory are used to feed the data and instruction required for operation into functional units.

Fig 3.12 The Proposed ALU Cluster IP Architecture

The major part for dealing the media applications is an ALU cluster as description in Section 3.2. The arithmetic units and internal storages part of the ALU cluster in this ALU cluster IP is the same as the one introduced in Section 3.2.1.

However, the control and internal storages are improved in this designed ALU cluster IP. The ALU cluster in this designed is improved the ability to reading source and writing destination. It makes all banks of data and instruction memory expose to the AMBA bus. It means that these memory banks can be accessed directly from AMBA bus through the AMBA AHB wrapper which will be introduced later. In addition, the better performance is exploited by shortening reading cycles. In the original ALU cluster, the reading must take four cycles to access one burst reading operation.

However, in the improved ALU cluster, the reading operation takes two cycle latencies in burst reading and then the data is read sequentially in every cycle.

The ALU cluster IP must has the ability to execute when the AMBA bus is granted by other masters so that the ALU cluster needs a functional block to feed address to the instruction memory automatically. As illustrates in Fig 3.12, the Pc_counter is used to process this job. It will increase the program counter by one in

every clock cycle. The decoder will compare the value of program counter with the end value of Pc_counter every cycle to check if the ALU cluster finishes the job. If the job is completed, the alu_work signal is activated to send information to the wrapper.

In the alu_work signal is inactive, the IP can not be accessed and returns RETRY signal response to AMBA bus. Besides, one special input signal combination can clear the end value of Pc_counter in the decoder and force the IP to stop execution. The special mechanism is designed in order to avoiding the possibility of the deadlock occurrence.

Another key component of ALU cluster IP is AMBA AHB wrapper. It will be discussed in this paragraph. The wrapper interface conforms to Advanced Microcontroller Bus Architecture (AMBA) Advanced High-performance Bus (AHB) protocol described in Section 3.3.1. It provides a common interface to integrate the proposed design with ARM versatile baseboard and form a media processing system.

A finite state machine (FSM) and an address generation unit (AGU) are composed of the architecture of proposed wrapper. The finite state machine of proposed wrapper is used to control the states and response the request of AMBA bus. It provides the communication capability between AHB slave bus and the ALU cluster inside proposed IP. It receives signals from AMBA bus and activates the ALU cluster to response. The FSM also controls the address generation unit to produce necessary address for the ALU cluster, whether operating in incrementing mode or wrapping mode of burst operation.

This FSM is designed with six states. They are Idle, Accessible, ALU_Work, Un-readable Wait, Un-writable Wait and Error. As shown in Fig 3.13, the state diagram of the finite state machine, the FSM will stay in the Idle state while the IP is not accessible or the operation of ALU cluster is finished. Whether IP has done the work or suffers from some error, it returns back to the Idle state. In this state, the wrapper will be ready to receive signals from bus and prepare next operation. It will go to other starts while the bus is granted and the IP will be accessed or the ALU cluster is activated. The condition of going to other state is only when the HTRANS signal equals to NONSEQ. If the NONSEQ is encountered, it identifies which operation of the IP is requested by HWRITE then the FSM will move to the target state.

Fig 3.13 The state diagram of the finite state machine

In the next state, Accessible state, the IP is accessible. When the HTRANS signal is equal to NONSEQ and the HWRITE signal is logic high, it will directly move to this state. There is a control signal to identify the different types of accessing whether incrementing mode or wrapping mode is utilized in the burst transformation while staying this state. One type is that the IP is accessed with different address with the HTRANS signal equals to NONSEQ. Another one is that the IP is accessed continuously with the address of the previous access in wrapping or incrementing mode in the burst transformation. Three conditions are forced the FSM to other states.

These cases are access is finished, ready to read but data is not ready and busy to write. The states are moved to Idle, Un-readable Wait and Un-writable Wait. The later two of the above-mentioned states are addressed below.

The Un-readable Wait state exists because of the two necessary cycle of reading data latency. One of two paths makes the FSM enter the Un-readable state is when the FSM is in the Idle state and the HTRANS signal is NONSEQ and the HWRITE signal is logic low. It presents the IP is being read. The first reading operation needs two cycles to prepare necessary data so it must be in this state until the data is ready. Then it will enter the Accessible state to perform the following reading request. Another one of two paths is from Accessible state to Un-readable Wait state because of the necessary latencies. In addition, when the IP is being written data in burst mode of wrapping or incrementing type thus the TRANS signal of AHB slave is changed to

BUSY, the FSM will enter the Un-Writable Wait state. After the signal of TRANS release from BUSY to NONSEQ or SEQ, the FSM will return from the Un-writable state to Accessible state.

The last two states of design six-stated FSM are Error and ALU_Work state.

When proposed IP is accessed illegally due to invalid address and transaction, the finite state machine will go to Error state. The invalid address and transaction result from the depth limitation of data and instruction memory. The other reason entering this state is that the IP is being accessed but is not granted expectedly. When these two cases happened, Error state will be entered and escapes from violating AMBA AHB protocol. If the Error occurs, the Error state must obey the AHB protocol and thus have two cycles response to reply the bus with proper HREADY and HRESP signal as defined in the AMBA AHB specification.

Finally, The ALU_Work state reveals that the applications are being processed in ALU cluster. From Idle state is an only one path into the state. Whether accessed by reading or writing operations, the FSM has the ability to transfer a two cycle response to the AHB bus in the ALU_Work state. Additionally the ALU cluster keeps working without being affected by any unexpected access until finishing the operations.

Eventually there is one characteristics related to the wrapper. That is data and instruction memory embedded in the IP can be access directly by proposed wrapper.

As description of the ALU cluster IP, there is one thing needed to be reminded.

One instruction must be completed through many stages so it takes more cycles to write the executed results back. The ALU is a two stage pipelined structure unit so that it takes six cycles, including two extra cycles and four necessary cycles for every operation such as instruction decoding, data source selection and results writing. Then the four stages pipelined multiplier will need eight cycles and the divider will need twenty cycles to write back the results.

3.4 Floating Point Units for the ALU cluster IP

In modern media application system, floating point operation is indispensable for any applications. The floating point operation units (FPUs) are designed and implemented to be integrated with original ALU cluster IP. In the following, the design considerations of the floating point units are described. Then the implementation results of them are discussed in latter chapter. Besides, the performance evaluation of the selected benchmark is compared between the

architecture of ALU cluster IP and the architecture of ALU cluster IP supported by the floating point units in next chapter too.

3.4.1 Design Consideration

The floating point operation units are designed for the ALU cluster IP in order to make it more suitable and widely applied for media processing applications. Consider the architecture of ALU cluster IP, it is not well-matched for the floating point operation using the IEEE 754 standard format for floating point arithmetic [26]. In the architecture of original ALU cluster IP, the floating point operations obeyed the IEEE754 format need to be decomposed into several fields to finish the calculations and the field is easy to encounter the mistake results from the saturation problems.

Consequently, the floating points units are designed for the IEEE 754 standard single precision floating point format.

The briefly review of IEEE 754 standard for binary floating point format is introduced in the following. The format of floating point numbers includes four types which are identified by its precision. They are single precision, double precision, single extended precision and double extended precision floating point number formats. The numbers of bits used to represent the value are 32 bits, 64 bits, larger or equal to 43 bits and larger or equal to 79 bits respectively. The last two formats, single extended precision and double extended precision, are not commonly used. The features of single precision format and double precision format will be focused.

As illustrated in Table 3.5, the single precision format and double precision format are listed. As shown in the table, the IEEE 754 floating point numbers have three basic fields such as sign field, exponent field and mantissa field. The field of sign bit is used to represent the sign of the floating point number. Zero denotes a positive number and one denotes a negative number in this one bit field. The exponent field needs to represent both positive and negative exponents. This field occupies eight bits in the single precision field and eleven bits in the double precision field. The actual exponent is added the value called bias to form the value recorded in this field.

The bias value is 127 and 1023 for single precision and double precision respectively.

For example, an exponent of zero means that 127 and 1023 is stored in the exponent field for single precision format and double precision format respectively. The mantissa, also called the significant, represents the precision bits of the number. The significand field occupies twenty-three bits and fifty-two bits for the single and double precision format respectively. Whether in single precision or double precision

format, it is composed of an implicit leading bit and fraction bits. In order to maximize the quantity to represent numbers, floating point numbers are stored in normalized form. So the leading digit is assumed to 1 and needs not to represent it explicitly. In other words, the mantissa has effectively twenty-four bits of resolution with twenty-three fraction bits in single precision format. It is similar to the double precision format.

Table 3.5 Format of single and double precision IEEE 754 floating point number Sign Field Exponent Field Significand Field Single Precision 1 bit [31] 8 bits [30:23] 23 bits [22:0]

Double Precision 1 bits [63] 11 bits [62:52] 52 bits [51:0]

The summary is described below. First, the sign bit is zero for positive and one for negative number. Second, the exponent field contains 127 added to the true exponent field for single precision format and 1023 added to the exponent field for double precision format. Third, the first bit of the significand is typically assumed to

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 37-0)