Modified ALU cluster IP for Magnetic RAM - An ALU cluster IP with Magnetic RAM

Chapter 4 Implementation Results and Performance Evaluation

4.2 Verification and Implementation Results of An ALU cluster Intellectual

4.2.1 An ALU cluster IP with Magnetic RAM

4.2.1.2 Modified ALU cluster IP for Magnetic RAM

The architecture of ALU cluster IP with SRAM as the data memory must be modified to adopt MRAM as the data memory because of the interface of MRAM and SRAM is not the same and the data bandwidth between these two memories is also different. In order to connect MRAM to our IP, an extra load store unit (LSU) must be added to solve the issue presented above. The modified architecture is shown in Fig 4.6. The instruction format is changed slightly from 142 bits to 143 bits. The additional bit is used to control the mode of the ALU cluster IP. If the additional bit is not set, the IP will execute the applications normally. When the additional bit is set, the data in IRF and MRAM could be accessed separately which is based on the executing instruction. The data bandwidth between IRF and MRAM is restricted by

MRAM. The bandwidth support by the IRF is also modified to support byte transfer.

This is also suitable for the AHB wrapper since it is designed for byte, half word, and word access in little endian manner the same as non-modified IP.

Fig 4.6 Modified ALU cluster IP architecture for MRAM

4.2.2 Implementation Results

The summary of implementation characteristics are listed in Table 4.3. The proposed ALU cluster IP is implemented with cell-based design flow and taped out using TSMC 0.15um CMOS technology. Synopsys computer aided design (CAD) flow is adopted to accomplish this chip. The post-layout operation frequency of an ALU cluster IP is 100 MHz. The chip size, core size and gate count are about 3.9x3.9 mm², 3.0x3.0 mm² and 0.2 million, respectively. The physical layout and pad assignment are shown in Fig 4.7 and Fig 4.8 respectively.

Table 4.3 Summary of Implementation Characteristics

Process TSMC 0.15um

Post-layout Clock Rate 100 MHz Chip Size 3.91 x 3.90 mm

Core Size 2.98 x 2.98 mm

Gate Count 267,473

On-chip memory

Instruction Memory : synthesized Data Memory : MRAM

Package Type COB(PGA256)

Pad

Input: 34 pins Inout : 32 pins Output: 24 pins

Power: 40 pins

Fig 4.7 Physical Layout of an ALU Cluster IP

Fig 4.8 Pads Assignment of an ALU Cluster IP

There are total 130 pads, where 34 input pads, 24 output pads, 32 inout pads and 40 power pads in this design. In addition, the die microphotograph of taped out chip is shown in Fig 4.9. The selected package for the manufactured die is PGA256. The prototype with package is shown in Fig 4.10. The definitions of I/O ports are listed in Table 4.4.

Fig 4.9 Die Microphotograph of Taped Out Chip

Fig 4.10 Photograph of Prototype with Package

Table 4.4 The Definitions of I/O the ports I/O Port Name Input/Output/In

out Signal Description

HCLK Input The clock signal provides for designed chip HADDR Input This is a 14-bits input used to specify the address

of instruction memory and data memory

HSELx Input

The select signal from the arbiter of AHB bus to enable bus slave to work. It will be logical low at all execution stage.

HWRITE Input The signal indicates a write transfer at logical high and a read transfer at logical low.

HTRANS Input

The 2-bits signal determines the transfer type of AHB protocol including IDLE, BUSY, NONSEQ and SEQ.

HSIZE Input The 3-bits signal used to determine the size of transfer.

HBURST Input

This 3-bits signal indicates which type of burst mode is used. The burst may be either incrementing or wrapping and four, eight and sixteen beat bursts are supported.

HRESTn Input The reset signal provides for this chip.

mem_ls_q Input

The 8-bits signal used to receive data from data memory, Magnetic RAM, which is not embedded in the chip.

HREADY Output

The signal uses to indicate whether the transfer has finished on the bus or not. Logic high means it is finished and logic low means that the transfer need to extend

HRESP Output

2-bits signal response the status of a transfer.

OKAY, ERROR, RETRY and SPLIT are provided.

mram_ls_d Output The signal is an 8-bits data output to the data memory which is not embedded in this chip.

mram_ls_a Output The 10-bits width output. Used to specify the address of the external data memory.

Mram_ls_cen Mram_ls_wen Mram_ls_oen

Output

These three signals are used to control the status of the external data memory, such as disable, read, write and selected disable mode, by different combination of these signals.

HDATA Inout

The 32-bits inout signal. These signals receive data from input ports to compute and output the calculated results outside the chip.

IOVDD & IOVDD Power The power supply provides for the core of this chip. There are 12 pairs of power supply.

CoreVDD & CoreVSS Power The power supply provides for the IO Pads. There are 8 pairs of power supply.

4.2.3 Circuit Verification

The popular operation of multimedia processing applications, the finite impulse response (FIR) filter system [30], is chosen as the benchmark of the ALU cluster IP.

The benchmark used to simulate and verify the proposed IP is 16-tap FIR filter system.

The media applications could be expressed as the stream programming model that would be fit the features of the ALU cluster IP. In modern media and DSP applications, FIR filtering is one of the most popular and widely operation applied, such as matched filtering, pulse shaping and equalization, etc. This selected benchmark is suitable for functional verification of one dimensional architecture needed repeat and high percentage of addition and multiplication. .

A brief review of FIR filter system is introduced below. The equation of input and output relationship of linear time invariant FIR filter can be describe in Equation 4.1.

Equation 4.1

∑

⁻

−

= ¹

] [

* ]

[

k x n k

b n

As shown in the equation, M represents the length of the FIR filter, b_k represents the coefficients and x[n-k] denotes the data sampled at time instance n-k. The output y[n] is the response to the instance time n. As illustrated in Fig 4.11, the coefficients b_k of the sixteen-tap Kaiser window FIR bandpass filter and the exponential function with ten sampling points as the input data are figured. The usage of Mathworks Matlab helps us to simulate the correct results in advance. The results are illustrated in Fig 4.12.

Fig 4.11 The input function and coefficients of the FIR filter system

Fig 4.12 Output results of the FIR filter system

After simulating the 16-tap FIR filter system with Mathworks Matlab, the benchmark is ported into the implemented design as circuit verification. The circuit verification is executed by post-layout simulation with the taped-out chip introduced above. Due to the imperfect library of the CMOS technology, the simulation results addressed here are from the ALU cluster IP hard macro. The results of the post-layout simulation are listed from Fig 4.13(a) to Fig 4.13(e) continuously. The post-layout simulation is based on TSMC 0.15um CMOS technology. The circuit of the proposed design shows that it works correctly at clock rate of 100 MHz after comparing between the final result from Matlab and the results of post-layout simulation. It reveals that the functionality is correct exactly and IP works correctly.

Fig 4.13(a) Post-Layout Simulation Results of an ALU cluster IP (Ⅰ)

Fig 4.13(b) Post-Layout Simulation Results of an ALU cluster IP (Ⅱ)

Fig 4.13(c) Post-Layout Simulation Results of an ALU cluster IP (Ⅲ)

Fig 4.13(d) Post-Layout Simulation Results of an ALU cluster IP (Ⅳ)

Fig 4.13(e) Post-Layout Simulation Results of an ALU cluster IP (Ⅴ)

4.2.4 Chip Testing

Chip testing of the prototype 2, the ALU cluster IP, is proceeded. The printed circuit board (PCB) are designed and manufactured in order to verify the silicon chip.

As illustrated in Fig 4.14, a four layer PCB board with socket is used and the packaged chip is put on the socket welding on the board for measurement. There are three different voltages of power supply as shown in Fig 4.14. They are IO pad power for the chip, core power for chip and supply power for MRAM. The voltages are 3.3 V, 1.2V and 5V respectively. The buffer in the PCB is used to control the input and output states of bidirectional IO pad adopted by the chip.

Fig 4.14 The Printed Circuit Board (PCB) for the manufactured chip

The testing equipments adopted are Agilent 16902A Logic Analyzer System with Agilent 16720A pattern generator and Agilent 16910A logic analyzer modules as shown in Fig 4.15. In order to avoid the performance degradation results from the pods of pattern generator modules, pods of pattern generator are directly connected to the PCB combined the chip. A diagram is illustrated in Fig 4.16.

Fig 4.15 Testing Equipments – Logic Analyzer System

Fig 4.16 Connection between the Chip and Testing Equipments

Chip testing is divided into two parts such as functional measurement and performance measurement. The performance measurement adopts the 16-tap FIR filter system presented previously as the benchmark. And the functional measurements include basic functions, writing to IRF and memory and reading from IRF and memory. While verifying the chip, several phenomenons are revealed. First, the input signals from pattern generator modules into the chip can not be sent

correctly. After measuring the signal through an oscilloscope, the phenomenon that the signals have irregular voltage values near the threshold voltage is revealed.

Besides, the peripheral signals on the PCB are also measured and the phenomenon is also still existent. Second the control signal for the MRAM such as the signal of chip enable is toggled irregular and it disagrees with the results of post-layout simulation.

Not only the chip enable signal but also some other signals also has the same phenomenon. These are two main phenomenon observed from the chip testing and they may cause the errors of the measurement. Chip testing is progressing and the errors will be solved and discussed.

4.3 Circuit Implementation and Results of Floating Point Units for the ALU cluster IP

In this section, the implementation results of the FPUs described above will be introduced. These FPUs are implemented as hard macros with cell-based design flow.

The macros can be used to integrated with the ALU cluster IP and provide efficient floating points operations ability. The synthesis results and the results of Auto Place and Route (APR) are discussed. The circuit verification results executed through post-layout simulation are also listed in this section too.

The floating point units are synthesized with Synopsys Design Compiler and the physical layouts of these FPU macros are finished by means of the Synopsys Astro.

The TSMC 0.18um CMOS technology and Artisan SAGE-x Standard Cell Library are adopted for implementing these FPUs. As mentioned above, three types of FPUs are designed. Type 1 FPU includes the floating point operations for addition, subtraction and multiplication. It operates in 75 MHz of post simulation frequency. The gate count and area are 23,298 and 0.415 mm²respectively. The area utilization and power consumption are 0.9 and 10.85 mW respectively. The physical layout of the type 1 macro of the FPU is shown in Fig 4.17. As shown in Fig 4.18, the macro of type 2 FPU is implemented completely. It provides the floating point operations including addition, subtraction, multiplication and division inside. The post simulation clock rate of type 2 FPU is 25 MHz. The gate count and area are 31,331 and 0.529 mm² respectively. The area utilization, the same as type 1 FPU, is 0.9. The power dissipation of type 2 FPU is 4.60 mW. Type 3 FPU, including the division floating point operation only, is in order to collocate with type 1 FPU to provide the same operations as type 2 FPU. So the type 3 FPU is implemented with the 25 MHz clock rate of post simulation in spite of it is not the fastest operation frequency it can achieve. The gate count and area are 24,931 and 0.396 mm². The area utilization, the

same as type 1 and type 2 FPU, is 0.9 and the power dissipation is 6.59 mW. The physical layout of type 3 FPU macro is shown in Fig 4.19 below. The summary of these results are listed in Table 4.5

Fig 4.17 Physical Layout of the Type 1 FPU macro

Fig 4.18 Physical Layout of the Type 2 FPU macro

Fig 4.19 Physical Layout of the Type 3 FPU macro Table 4.5 Summary of the Implementation Results Floating Point Unit Type 1 (ADD,

SUB, MUL)

Type2 (ADD, SUB, MUL, DIV)

Type 3 (DIV only) Technology TSMC 0.18um TSMC 0.18um TSMC 0.18um Cell Library Artisan

SAGE-X^TM The circuit verification results are listed below. Post-layout simulation are performed with the TSMC 0.18um CMOS technology and library environment. The post-layout simulation results for type 1 FPU are shown in Fig 4.20 and Fig 4.21.

They are the full view and the interception of whole simulation periods respectively.

The same as type 1, the verification results of type 2 FPU are shown in Fig 4.22 and Fig 4.23 and they are full view and a portion of all periods respectively. Eventually the post-layout simulation results of type 3 FPU are illustrated in Fig 4.24 and Fig 4.25. As described previously, they are also full view and interception of whole simulation periods respectively. These outcomes promise that these hard macros work correctly corresponding to their own clock rate of post-layout simulation.

Fig 4.20 Full View of Post-Layout Simulation Results for Type 1 FPU

Fig 4.21 Interception of Post-Layout Simulation Results for Type 1 FPU

Fig 4.22 Full View of Post-Layout Simulation Results for Type 2 FPU

Fig 4.23 Interception of Post-Layout Simulation Results for Type 2 FPU

Fig 4.24 Full View of Post-Layout Simulation Results for Type 3 FPU

Fig 4.25 Interception of Post-Layout Simulation Results for Type 3 FPU

4.4 Performance Evaluation and Comparison

In this section, the performance of floating point operations will be evaluation and comparison. The benchmark used to evaluation is the Fast Fourier Transform (FFT) commonly used in media processing applications. There are two target architectures used to evaluate the performance of floating point operations and compare the performance each other. These two parts mentioned above are discussed in the following.

4.4.1 Selected Benchmark

In this thesis, 32-points Fast Fourier Transform is selected as the benchmark. The FFT is an efficient algorithm for computation of the Fourier Transform. There are three reasons that the FFT is selected. First, the FFT is the most often used operations in the multimedia applications or signal processing applications. The second reason is taken as an example that executing on the streaming programming model in paper [18]. Third, the applications involved the FFT usually need to handle the floating point numbers operations. It is well-match for the performance evaluations presented in this thesis.

The Split-Radix FFT (SRFFT) algorithm is adopted to form the benchmark [31]

[32]. An inspection of the decimation in frequency flowchart of FFT shows that the even terms and the odd terms of the Discrete Fourier Transform (DFT) can be computed independently. It is quite clear that the radix-2 algorithm is better for the even terms and the radix-4 algorithm is better for the odd terms of the DFT. So the split-Radix FFT (SRFFT) algorithm which reduces the number of computations exploits the idea of using radix-2 and radix-4 algorithms mixed into the same FFT algorithm.

As mentioned above, the FFT algorithm is decomposed into even terms and odd terms to compute independently. The radix-2 decimation in frequency FFT algorithm used for the even numbered samples of the N-points DFT are given in below.

∑

⁻

The odd-numbered samples [X (2k+1)] of the DFT need to pre-multiplication of the input sequence with WNn

. The raidx-4 decimation in frequency FFT algorithm used for the odd-numbered samples of the N-points DFT are given in the two equations below.

The length-N DFT will be obtained by using the N Split-Radix FFT algorithm.

The benchmark adopted is the length 32 Split-Radix FFT and its flowchart is shown in Fig 4.26.

Fig 4.26 Flowchart of the length 32 Split-Radix FFT algorithm

4.4.2 Evaluation and Comparison Results

Reminding the developmental roadmap discussed in the first part of Chapter 3, the ALU cluster IPs will be combined with the versatile baseboard to form a media streaming architecture with homogeneous processor cores. As shown in Fig 3.4, different numbers of ALU cluster IPs will be stacked with the board. As the result of the simulator, the numbers of the ALU cluster IPs are decided. One, two, four and eight ALU cluster IPs will be integrated with the baseboard in order. As introduced above, there are two target architectures to evaluate the performance and any of which will be considered when different numbers of ALU cluster IPs are integrated.

The two target architectures used in the thesis are Original Integer Architecture and Floating Point Unit and Original Integer Architecture Mixed. First, the original integer architecture is the architecture of the ALU cluster IP mentioned in Section 3.3.2. It is the architecture for integer operations essentially. The benchmark with floating point operations is decomposed. All operand are presented in the IEEE 754 floating point format and decomposed into several fields of integer and computes separately. It means the floating point operations are executed on the integer operation architecture by decomposing the format fit the original form. Then the second evaluation architecture is the floating point unit and original integer architecture mixed. In the SRFFT benchmark, it includes integer operations and floating point number operations. For the second evaluation architecture the integer operations and floating point number operations are computed by the integer and floating point unit respectively.

Before the evaluation results are presented, the essential cycles of executing the floating point numbers by original integer architecture such as ALU cluster IP will be calculated. The multiplication operation of the floating point numbers needs six cycles to finish the calculation. The addition operations of the floating point numbers in the ALU unit costs seven cycles to complete the calculation. In addition to these floating point number operations, other integer operations of this architecture need one cycle to finish the operation. The first data inputted into the functional units will need more cycles to finish the job result from the pipeline features. The integer ALU operations of the ALU cluster IP is two-stage pipelined and the integer multiplication operations of the ALU cluster IP is four-stage pipelined. The floating point number operations executed in the floating point unit architecture need one cycle for finishing the addition and one cycle for completing the multiplication. In the performance evaluation in the floating point unit architecture, all of the data are represented by the IEEE 754 format and calculate in this architecture.

Different number of clusters will be considered and evaluated with the two target architectures. The essential cycles to complete the 32-points FFT benchmark between different number clusters and different target architectures are listed separately in the following tables. They show the cycles need to finish the benchmark. The leftmost field stands for the ALU cluster IPs included in the evaluation. The middle field listed in the table represents the operation cycles needed for the functional unit while executing the benchmark. The rightmost field of the tables is the critical cycles which is underlined and dominate the performance of the conditions of different number clusters and different target architectures.

Table 4.6 shows the detail information of cycles for the Original Integer Architecture, the ALU cluster IP which is mentioned in former section. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 647 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 57-0)