Design Consideration - Floating Point Units for the ALU cluster IP

Chapter 3 Development Roadmap and Proposed Design

3.4 Floating Point Units for the ALU cluster IP

3.4.1 Design Consideration

The floating point operation units are designed for the ALU cluster IP in order to make it more suitable and widely applied for media processing applications. Consider the architecture of ALU cluster IP, it is not well-matched for the floating point operation using the IEEE 754 standard format for floating point arithmetic [26]. In the architecture of original ALU cluster IP, the floating point operations obeyed the IEEE754 format need to be decomposed into several fields to finish the calculations and the field is easy to encounter the mistake results from the saturation problems.

Consequently, the floating points units are designed for the IEEE 754 standard single precision floating point format.

The briefly review of IEEE 754 standard for binary floating point format is introduced in the following. The format of floating point numbers includes four types which are identified by its precision. They are single precision, double precision, single extended precision and double extended precision floating point number formats. The numbers of bits used to represent the value are 32 bits, 64 bits, larger or equal to 43 bits and larger or equal to 79 bits respectively. The last two formats, single extended precision and double extended precision, are not commonly used. The features of single precision format and double precision format will be focused.

As illustrated in Table 3.5, the single precision format and double precision format are listed. As shown in the table, the IEEE 754 floating point numbers have three basic fields such as sign field, exponent field and mantissa field. The field of sign bit is used to represent the sign of the floating point number. Zero denotes a positive number and one denotes a negative number in this one bit field. The exponent field needs to represent both positive and negative exponents. This field occupies eight bits in the single precision field and eleven bits in the double precision field. The actual exponent is added the value called bias to form the value recorded in this field.

The bias value is 127 and 1023 for single precision and double precision respectively.

For example, an exponent of zero means that 127 and 1023 is stored in the exponent field for single precision format and double precision format respectively. The mantissa, also called the significant, represents the precision bits of the number. The significand field occupies twenty-three bits and fifty-two bits for the single and double precision format respectively. Whether in single precision or double precision

format, it is composed of an implicit leading bit and fraction bits. In order to maximize the quantity to represent numbers, floating point numbers are stored in normalized form. So the leading digit is assumed to 1 and needs not to represent it explicitly. In other words, the mantissa has effectively twenty-four bits of resolution with twenty-three fraction bits in single precision format. It is similar to the double precision format.

Table 3.5 Format of single and double precision IEEE 754 floating point number Sign Field Exponent Field Significand Field Single Precision 1 bit [31] 8 bits [30:23] 23 bits [22:0]

Double Precision 1 bits [63] 11 bits [62:52] 52 bits [51:0]

The summary is described below. First, the sign bit is zero for positive and one for negative number. Second, the exponent field contains 127 added to the true exponent field for single precision format and 1023 added to the exponent field for double precision format. Third, the first bit of the significand is typically assumed to be 1.f, where f is the fraction stored in this field.

The effective range of representing the IEEE 754 floating point number is listed in Table 3.6. In this table the single and double precision format are listed. There are five distinct numerical ranges are not presentable in this format. Taking single precision format as an example, the numbers which are not able to present is listed below. The positive number less than 2^-149, positive number greater than (2-2^-23) * 2¹²⁷, zero, negative number less than - (2-2^-23) * 2¹²⁷ and negative numbers greater than -2^-149. They are so called positive underflow, positive overflow, zero, negative underflow and negative overflow respectively. The overflow of the value means it is too large to represent. Underflow is the problem of loss of precision.

Table 3.6 Effective Range of the IEEE 754 floating point number Binary Value Decimal Value Single Precision -(2-2^-23)*2¹²⁷~ +(2-2^-23)*2¹²⁷ -10^38.53~ +10^38.53 Double Precision -(2-2^-52)*2¹⁰²³~ +(2-2^-52)*2¹⁰²³ -10^308.25~ +10^308.25

Finally, special values defined by IEEE 754 standard are introduced. It reserves exponent field values of all zero and all one to denote special values in the floating point values. First, zero will be discussed. Zero is not representable in general format result from the leading one assumption in mantissa field. Zero is defined to be denoted with an exponent field of zero and a fraction field of zero. The positive zero and

negative zero are distinct although they are compared as equal. Then the denormalized number is described. The exponent field is set to all zero and the fraction is non-zero represent a denormalized number. It is not assumed the leading one before the binary point. The representation of value will become (-1)^S* 0.f * 2^-126 and (-1)^S* 0.f * 2^-1022 for the single precision and double precision format respectively. Third, the value of infinity is denoted with an exponent of all one and a fraction of all zero. The sign bit decides it is positive infinity or negative infinity. Denoting the value of infinity as a specific value is useful because of allowing operations to continue past overflow situations. Operations with infinite values are well defined in IEEE standard. The Last special number is Not A Number (NaN). It is used to represent the value which is not able to represent as a real number. There are two types of NaN such as Quite NaN (QNaN) and Signaling NaN (SNaN). The most significant bit of fraction field is set for QNaN. The value pops out of an operation when the result is not mathematically defined. The most significant fraction bit is not set for SNaN. It is used to signal an exception.

These features and formats mentioned above is the brief review of the IEEE 754 floating point format standard. The details information of this standard will be listed in the reference list.

Three different types of FPUs are designed and implemented in this thesis. As shown in Fig 3.14 below, the FPU will be integrated with the ALU cluster IP. The data format of the ALU cluster IP is 32 bits so that the IEEE 754 single precision floating point number format is adopted for the design of the floating point operation unit.

Three different types of FPUs are described in the following. Considering with the features of the benchmarks and applications, some arithmetic operations of floating point numbers are critical and some are not. In other words, not all of these functional units in the FPUs are operated in the same frequency. Some critical operations need faster clock rate and some need not. The FPU of type 1 includes addition, subtraction and multiplication operations. The FPU of type 2 includes addition, subtraction, multiplication and division operations. The FPU of type3 include division operation only. In most of the media processing applications the division operation is not as common as the addition, subtraction and multiplication operations. So the type1 and type 3 are designed in order to make the critical operations faster and shrink the logic resource needed for the non-critical operations by means of increasing the latencies.

The type 2 FPU is also designed for general benchmarks and applications which operations are equally distributed. These FPUs have two input operands and one output results, both of them with 32 bits data width. The operations are decided by the control signal. The details of implementation are introduced in the following chapter.

ALU

Data Input Data Output

Control

Fig 3.14 An ALU cluster IP with Floating Point Unit Supported Architecture

CHAPTER 4 Implementation Results and Performance Evaluation

After the three chapters presented previously, the background, challenges, developmental roadmap and micro-architectures of proposed designs in this thesis are addressed. The circuit implementation results of the proposed designs in this thesis will discussed in this chapter.

The first section is the demonstration of previous implementation and chip testing results of ALU cluster. Review the implementation results of the ALU cluster, the processing element of ALU cluster intellectual property implemented later.

Through testing the silicon-baked chip, the results confirm the correctness of functionality and the architecture is not only feasible but also efficient for media applications.

The second part of this chapter is presented for the implementation and verification results of the designed ALU cluster intellectual property. The features overview of Magnetic RAM and the details of modified architecture included the Magnetic RAM are introduced in this section. Then this section is also described the detail results of taping chip out, circuit verification and chip testing.

The third section of this chapter introduces the circuit implementation and results of floating point operation units for ALU cluster IP. It will be integrated with the ALU cluster IP. The implementation and verification results are summarized in this section.

Then the performance evaluation of selected benchmark is estimated to confirm the integration of the floating point unit is efficient and it is the forth part of the whole chapter.

4.1 Implementation and Testing Results of An ALU cluster

The briefly description of the previous design, an ALU cluster, includes the implementation, verification and testing results and related photos. The ALU cluster, the prototype 1 called in the above-presented, is convinced that it can handle media applications expectedly through silicon measurements.

The summary of the manufactured ALU cluster is listed in Table 4.1. UMC 0.18um CMOS technology and cell-based design kit of Artisan are utilized to tape the chip out. The operation frequency of post-layout simulation is 100MHz. The chip size and core size are about 3x3 mm² and 2.2x2.2mm² respectively. The gate count of this work is 411491. Power consumption of this work is 968.35mW. Besides the logic resources of arithmetic units and control logic, there are total fifteen memory banks used for data and instruction. The instruction memory includes four 32 x 128 single port static RAM (SRAM) and one 14 x 128 single port SRAM. The instruction memory of 128 entries can support output bandwidth of 142 bits per cycle to VLIW instructions. The data memory includes ten 32 x 32 single port SRAM. The data memory of 32 entries can provided the data bandwidth of 320 bits per cycle.

Table 4.1 Implementation Results Summary of ALU cluster

Process UMC 0.18um CMOS Technology

Library Artisan SAGE-X Standard Cell Library

Post Layout Clock Rate 100MHz

Chip Size 2.98 x 2.98 mm² 4 blocks 32 x 128 single port SRAM

1 block 14 x 128 single port SRAM

These banks of memory are generated by memory compiler with Artisan library.

The gate count of this work excluding these memories is 255669 and the power dissipation without these memories is down to 312.38mW. The physical layout of the ALU cluster is shown in Fig 4.1 below. The floorplan and pad assignment are also shown in Fig 4.2.

Fig 4.1 Physical Layout of an ALU cluster

clk

q_31 q_30 q_29 q_28 q_27 q_26 q_25 q_22q_23q_24 q_21 q_20 q_19 q_18 q_17 q_16 q_15 q_14

d_5 d_6 d_7 d_8 d_9 d_10 d_11 core_gnd_04 core_vdd_04 io_gnd_2 io_vdd_2 d_12 d_14

d_13 d_15 d_16 d_17 d_18

CornerUL CornerUR

core_gnd_06 core_vdd_06 io_gnd_3 io_vdd_3 core_gnd_07 core_vdd_07 d_19 d_20 d_21 d_22

core_vdd_11

core_gnd_15 core_vdd_15 io_gnd_7 io_vdd_7 core_gnd_14 core_vdd_14 core_gnd_13 core_vdd_13 io_gnd_6 io_vdd_6 core_gnd_12 core_vdd_12

Instruction Memory

Data Memory ALU Cluster

Fig 4.2 Floorplan and Pad Assignment of an ALU cluster

There are total 127 I/O pads, including 47 input pads, 32 output pads and 48 power pads. In addition to the information, the die microphotograph of taped out chip is shown in Fig 4.3. The package used for the manufactured chip is CQFP128 and the photograph of the prototype 1 with package is shown in Fig 4.4.

Fig 4.3 Microphotograph of taped out ALU cluster

Fig 4.4 An ALU cluster with CQFP128 package

As illustrated in Fig 4.5, the manufactured chip is placed on the PCB board in order to testing. The measurement equipment adopted is Agilent 16902A Logic Analyzer System [27] with Agilent 16720A pattern generator and Agilent 16910A logic analyzer modules. The maximum frequency of signal from the pattern generator is 300MHz and 180MHz when the pattern generator is operated in half channel mode and full channel mode respectively. It is suitable for us to measure the chip.

Fig 4.5 An ALU cluster with PCB board

We measure and verify the chip of the ALU cluster mainly in functionality and performance. In the functional testing, the memory testing, the instruction testing and the real program testing are executed. The 16-tap FIR filter is selected as a benchmark for the real program testing and the performance testing. In the memory testing, the benchmarks written to and read from data memory and instruction memory are all one, all zero and mixed interleaving one and zero.

These signals are 32’hFFFFFFFF, 32’h00000000, 32’hAAAAAAAA.

32’h55555555 and mixed the 32’hAAAAAAAA and 32’h55555555. The usage of this kind of benchmark helps us to find out the stuck at zero and stuck at one error. The functionality of instructions is tested by feeding random sequence into all operation units. All instructions of different operation units are gone through at least once to ensure the correctness. These test patterns needed in chip testing necessarily obey the format of the pattern generator.

The functional testing mentioned above including the memory testing, the instruction testing and the real program testing are correct. Thus the performance

testing of the FIT filter system is down to about sixteen to eighteen Mega Hertz.

The summaries about testing results mentioned above are listed in Table 4.2.

Because of the huge loading of the probe lead set, it is more than five times slower than the post layout clock rate. It is hard to measure the true performance of the chip unless connecting the pod of pattern generator to the PCB directly.

Table 4.2 Testing results summaries of ALU cluster

Testing Items Results

Data Memory Correct

Read/Write Memory

Instruction Memory Correct

ALU0 Correct FIR filter system Correct with operating in 16~18 MHz

4.2 Verification and Implementation Results of An ALU cluster Intellectual Property

In this section, an ALU cluster Intellectual Property (IP) is designed and implemented. As mentioned in previous chapter, there is an AMBA AHB wrapper in the ALU cluster IP different from the ALU cluster. First, the modified ALU cluster IP architecture will be presented and the introduction of Magnetic RAM is described. The taped out chip is concert with the Magnetic RAM (MRAM) developed by Industrial Technology Research Institute (ITRI). Then we will introduce the implementation of the ALU cluster Intellectual Property. Finally the circuit verification and chip testing are introduced in the last two sub-sections in the paragraph. The details are described in the following sub-sections.

4.2.1 An ALU cluster IP with Magnetic RAM

An ALU cluster IP with Magnetic RAM is the extended version of the ALU cluster IP. First we will introduce briefly the characteristics of MRAM. Then the architecture of the ALU cluster is slightly modified to adapt the features of using MRAM as the data memory of the IP. Finally the implementation and verification

4.2.1.1 Introduction of Magnetic RAM

Three types of memory such as static RAM (SRAM), dynamic RAM (DRAM) and Flash occupy the most proportion of current market. However, they have their own drawback respectively. SRAM and DRAM have high data accessing speed, but they are volatile when power is turned off and the power consumption of these two types of memory is high. These drawbacks can be solved when using the Flash. It is non-volatile and the power dissipation is low. But the accessing speed of Flash is low and the lifetime of reading and writing is limited.

Magnetic RAM (MRAM) is the innovation of these types of memory [28]. It adopts the magnetic tunnel junction (MTJ) and MOSFET as the mechanism to form the memory cell [29]. It collects the pros, such as non-volatile, low power, high accessing speed, high cell density and strong radiation resistance, of current existent memories. In addition, the MRAM has the advantages such as process compatible and long lifetime of writing and reading. Because the MRAM is manufactured by the metal layer so it is compatible to CMOS technology and without extra overhead. And its lifetime of data accessing is much higher than the conventional non-volatile memory such as Flash.

MRAM applications currently targets to mobile system, smart card, radiation hardened military applications, database storage, RFID and MRAM element in FPGA.

These include both standalone and embedded memory applications. As these above-mentioned pros and wide application in modern life, MRAM will become a chief tendency of new generation memory system.

4.2.1.2 Modified ALU cluster IP for Magnetic RAM

The architecture of ALU cluster IP with SRAM as the data memory must be modified to adopt MRAM as the data memory because of the interface of MRAM and SRAM is not the same and the data bandwidth between these two memories is also different. In order to connect MRAM to our IP, an extra load store unit (LSU) must be added to solve the issue presented above. The modified architecture is shown in Fig 4.6. The instruction format is changed slightly from 142 bits to 143 bits. The additional bit is used to control the mode of the ALU cluster IP. If the additional bit is not set, the IP will execute the applications normally. When the additional bit is set, the data in IRF and MRAM could be accessed separately which is based on the executing instruction. The data bandwidth between IRF and MRAM is restricted by

MRAM. The bandwidth support by the IRF is also modified to support byte transfer.

This is also suitable for the AHB wrapper since it is designed for byte, half word, and word access in little endian manner the same as non-modified IP.

Fig 4.6 Modified ALU cluster IP architecture for MRAM

4.2.2 Implementation Results

The summary of implementation characteristics are listed in Table 4.3. The proposed ALU cluster IP is implemented with cell-based design flow and taped out using TSMC 0.15um CMOS technology. Synopsys computer aided design (CAD) flow is adopted to accomplish this chip. The post-layout operation frequency of an ALU cluster IP is 100 MHz. The chip size, core size and gate count are about 3.9x3.9 mm², 3.0x3.0 mm² and 0.2 million, respectively. The physical layout and pad assignment are shown in Fig 4.7 and Fig 4.8 respectively.

Table 4.3 Summary of Implementation Characteristics

Process TSMC 0.15um

Post-layout Clock Rate 100 MHz Chip Size 3.91 x 3.90 mm

Core Size 2.98 x 2.98 mm

Gate Count 267,473

On-chip memory

Instruction Memory : synthesized Data Memory : MRAM

Package Type COB(PGA256)

Pad

Input: 34 pins Inout : 32 pins Output: 24 pins

Power: 40 pins

Fig 4.7 Physical Layout of an ALU Cluster IP

Fig 4.8 Pads Assignment of an ALU Cluster IP

There are total 130 pads, where 34 input pads, 24 output pads, 32 inout pads and 40 power pads in this design. In addition, the die microphotograph of taped out chip is shown in Fig 4.9. The selected package for the manufactured die is PGA256. The prototype with package is shown in Fig 4.10. The definitions of I/O ports are listed in Table 4.4.

Fig 4.9 Die Microphotograph of Taped Out Chip

Fig 4.10 Photograph of Prototype with Package

Table 4.4 The Definitions of I/O the ports I/O Port Name Input/Output/In

out Signal Description

HCLK Input The clock signal provides for designed chip

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 46-0)