Circuit Verification - Verification and Implementation Results of An ALU cluster Intellectual

Chapter 4 Implementation Results and Performance Evaluation

4.2 Verification and Implementation Results of An ALU cluster Intellectual

4.2.3 Circuit Verification

The popular operation of multimedia processing applications, the finite impulse response (FIR) filter system [30], is chosen as the benchmark of the ALU cluster IP.

The benchmark used to simulate and verify the proposed IP is 16-tap FIR filter system.

The media applications could be expressed as the stream programming model that would be fit the features of the ALU cluster IP. In modern media and DSP applications, FIR filtering is one of the most popular and widely operation applied, such as matched filtering, pulse shaping and equalization, etc. This selected benchmark is suitable for functional verification of one dimensional architecture needed repeat and high percentage of addition and multiplication. .

A brief review of FIR filter system is introduced below. The equation of input and output relationship of linear time invariant FIR filter can be describe in Equation 4.1.

Equation 4.1

∑

⁻

−

= ¹

] [

* ]

[

k x n k

b n

As shown in the equation, M represents the length of the FIR filter, b_k represents the coefficients and x[n-k] denotes the data sampled at time instance n-k. The output y[n] is the response to the instance time n. As illustrated in Fig 4.11, the coefficients b_k of the sixteen-tap Kaiser window FIR bandpass filter and the exponential function with ten sampling points as the input data are figured. The usage of Mathworks Matlab helps us to simulate the correct results in advance. The results are illustrated in Fig 4.12.

Fig 4.11 The input function and coefficients of the FIR filter system

Fig 4.12 Output results of the FIR filter system

After simulating the 16-tap FIR filter system with Mathworks Matlab, the benchmark is ported into the implemented design as circuit verification. The circuit verification is executed by post-layout simulation with the taped-out chip introduced above. Due to the imperfect library of the CMOS technology, the simulation results addressed here are from the ALU cluster IP hard macro. The results of the post-layout simulation are listed from Fig 4.13(a) to Fig 4.13(e) continuously. The post-layout simulation is based on TSMC 0.15um CMOS technology. The circuit of the proposed design shows that it works correctly at clock rate of 100 MHz after comparing between the final result from Matlab and the results of post-layout simulation. It reveals that the functionality is correct exactly and IP works correctly.

Fig 4.13(a) Post-Layout Simulation Results of an ALU cluster IP (Ⅰ)

Fig 4.13(b) Post-Layout Simulation Results of an ALU cluster IP (Ⅱ)

Fig 4.13(c) Post-Layout Simulation Results of an ALU cluster IP (Ⅲ)

Fig 4.13(d) Post-Layout Simulation Results of an ALU cluster IP (Ⅳ)

Fig 4.13(e) Post-Layout Simulation Results of an ALU cluster IP (Ⅴ)

4.2.4 Chip Testing

Chip testing of the prototype 2, the ALU cluster IP, is proceeded. The printed circuit board (PCB) are designed and manufactured in order to verify the silicon chip.

As illustrated in Fig 4.14, a four layer PCB board with socket is used and the packaged chip is put on the socket welding on the board for measurement. There are three different voltages of power supply as shown in Fig 4.14. They are IO pad power for the chip, core power for chip and supply power for MRAM. The voltages are 3.3 V, 1.2V and 5V respectively. The buffer in the PCB is used to control the input and output states of bidirectional IO pad adopted by the chip.

Fig 4.14 The Printed Circuit Board (PCB) for the manufactured chip

The testing equipments adopted are Agilent 16902A Logic Analyzer System with Agilent 16720A pattern generator and Agilent 16910A logic analyzer modules as shown in Fig 4.15. In order to avoid the performance degradation results from the pods of pattern generator modules, pods of pattern generator are directly connected to the PCB combined the chip. A diagram is illustrated in Fig 4.16.

Fig 4.15 Testing Equipments – Logic Analyzer System

Fig 4.16 Connection between the Chip and Testing Equipments

Chip testing is divided into two parts such as functional measurement and performance measurement. The performance measurement adopts the 16-tap FIR filter system presented previously as the benchmark. And the functional measurements include basic functions, writing to IRF and memory and reading from IRF and memory. While verifying the chip, several phenomenons are revealed. First, the input signals from pattern generator modules into the chip can not be sent

correctly. After measuring the signal through an oscilloscope, the phenomenon that the signals have irregular voltage values near the threshold voltage is revealed.

Besides, the peripheral signals on the PCB are also measured and the phenomenon is also still existent. Second the control signal for the MRAM such as the signal of chip enable is toggled irregular and it disagrees with the results of post-layout simulation.

Not only the chip enable signal but also some other signals also has the same phenomenon. These are two main phenomenon observed from the chip testing and they may cause the errors of the measurement. Chip testing is progressing and the errors will be solved and discussed.

4.3 Circuit Implementation and Results of Floating Point Units for the ALU cluster IP

In this section, the implementation results of the FPUs described above will be introduced. These FPUs are implemented as hard macros with cell-based design flow.

The macros can be used to integrated with the ALU cluster IP and provide efficient floating points operations ability. The synthesis results and the results of Auto Place and Route (APR) are discussed. The circuit verification results executed through post-layout simulation are also listed in this section too.

The floating point units are synthesized with Synopsys Design Compiler and the physical layouts of these FPU macros are finished by means of the Synopsys Astro.

The TSMC 0.18um CMOS technology and Artisan SAGE-x Standard Cell Library are adopted for implementing these FPUs. As mentioned above, three types of FPUs are designed. Type 1 FPU includes the floating point operations for addition, subtraction and multiplication. It operates in 75 MHz of post simulation frequency. The gate count and area are 23,298 and 0.415 mm²respectively. The area utilization and power consumption are 0.9 and 10.85 mW respectively. The physical layout of the type 1 macro of the FPU is shown in Fig 4.17. As shown in Fig 4.18, the macro of type 2 FPU is implemented completely. It provides the floating point operations including addition, subtraction, multiplication and division inside. The post simulation clock rate of type 2 FPU is 25 MHz. The gate count and area are 31,331 and 0.529 mm² respectively. The area utilization, the same as type 1 FPU, is 0.9. The power dissipation of type 2 FPU is 4.60 mW. Type 3 FPU, including the division floating point operation only, is in order to collocate with type 1 FPU to provide the same operations as type 2 FPU. So the type 3 FPU is implemented with the 25 MHz clock rate of post simulation in spite of it is not the fastest operation frequency it can achieve. The gate count and area are 24,931 and 0.396 mm². The area utilization, the

same as type 1 and type 2 FPU, is 0.9 and the power dissipation is 6.59 mW. The physical layout of type 3 FPU macro is shown in Fig 4.19 below. The summary of these results are listed in Table 4.5

Fig 4.17 Physical Layout of the Type 1 FPU macro

Fig 4.18 Physical Layout of the Type 2 FPU macro

Fig 4.19 Physical Layout of the Type 3 FPU macro Table 4.5 Summary of the Implementation Results Floating Point Unit Type 1 (ADD,

SUB, MUL)

Type2 (ADD, SUB, MUL, DIV)

Type 3 (DIV only) Technology TSMC 0.18um TSMC 0.18um TSMC 0.18um Cell Library Artisan

SAGE-X^TM The circuit verification results are listed below. Post-layout simulation are performed with the TSMC 0.18um CMOS technology and library environment. The post-layout simulation results for type 1 FPU are shown in Fig 4.20 and Fig 4.21.

They are the full view and the interception of whole simulation periods respectively.

The same as type 1, the verification results of type 2 FPU are shown in Fig 4.22 and Fig 4.23 and they are full view and a portion of all periods respectively. Eventually the post-layout simulation results of type 3 FPU are illustrated in Fig 4.24 and Fig 4.25. As described previously, they are also full view and interception of whole simulation periods respectively. These outcomes promise that these hard macros work correctly corresponding to their own clock rate of post-layout simulation.

Fig 4.20 Full View of Post-Layout Simulation Results for Type 1 FPU

Fig 4.21 Interception of Post-Layout Simulation Results for Type 1 FPU

Fig 4.22 Full View of Post-Layout Simulation Results for Type 2 FPU

Fig 4.23 Interception of Post-Layout Simulation Results for Type 2 FPU

Fig 4.24 Full View of Post-Layout Simulation Results for Type 3 FPU

Fig 4.25 Interception of Post-Layout Simulation Results for Type 3 FPU

4.4 Performance Evaluation and Comparison

In this section, the performance of floating point operations will be evaluation and comparison. The benchmark used to evaluation is the Fast Fourier Transform (FFT) commonly used in media processing applications. There are two target architectures used to evaluate the performance of floating point operations and compare the performance each other. These two parts mentioned above are discussed in the following.

4.4.1 Selected Benchmark

In this thesis, 32-points Fast Fourier Transform is selected as the benchmark. The FFT is an efficient algorithm for computation of the Fourier Transform. There are three reasons that the FFT is selected. First, the FFT is the most often used operations in the multimedia applications or signal processing applications. The second reason is taken as an example that executing on the streaming programming model in paper [18]. Third, the applications involved the FFT usually need to handle the floating point numbers operations. It is well-match for the performance evaluations presented in this thesis.

The Split-Radix FFT (SRFFT) algorithm is adopted to form the benchmark [31]

[32]. An inspection of the decimation in frequency flowchart of FFT shows that the even terms and the odd terms of the Discrete Fourier Transform (DFT) can be computed independently. It is quite clear that the radix-2 algorithm is better for the even terms and the radix-4 algorithm is better for the odd terms of the DFT. So the split-Radix FFT (SRFFT) algorithm which reduces the number of computations exploits the idea of using radix-2 and radix-4 algorithms mixed into the same FFT algorithm.

As mentioned above, the FFT algorithm is decomposed into even terms and odd terms to compute independently. The radix-2 decimation in frequency FFT algorithm used for the even numbered samples of the N-points DFT are given in below.

∑

⁻

The odd-numbered samples [X (2k+1)] of the DFT need to pre-multiplication of the input sequence with WNn

. The raidx-4 decimation in frequency FFT algorithm used for the odd-numbered samples of the N-points DFT are given in the two equations below.

The length-N DFT will be obtained by using the N Split-Radix FFT algorithm.

The benchmark adopted is the length 32 Split-Radix FFT and its flowchart is shown in Fig 4.26.

Fig 4.26 Flowchart of the length 32 Split-Radix FFT algorithm

4.4.2 Evaluation and Comparison Results

Reminding the developmental roadmap discussed in the first part of Chapter 3, the ALU cluster IPs will be combined with the versatile baseboard to form a media streaming architecture with homogeneous processor cores. As shown in Fig 3.4, different numbers of ALU cluster IPs will be stacked with the board. As the result of the simulator, the numbers of the ALU cluster IPs are decided. One, two, four and eight ALU cluster IPs will be integrated with the baseboard in order. As introduced above, there are two target architectures to evaluate the performance and any of which will be considered when different numbers of ALU cluster IPs are integrated.

The two target architectures used in the thesis are Original Integer Architecture and Floating Point Unit and Original Integer Architecture Mixed. First, the original integer architecture is the architecture of the ALU cluster IP mentioned in Section 3.3.2. It is the architecture for integer operations essentially. The benchmark with floating point operations is decomposed. All operand are presented in the IEEE 754 floating point format and decomposed into several fields of integer and computes separately. It means the floating point operations are executed on the integer operation architecture by decomposing the format fit the original form. Then the second evaluation architecture is the floating point unit and original integer architecture mixed. In the SRFFT benchmark, it includes integer operations and floating point number operations. For the second evaluation architecture the integer operations and floating point number operations are computed by the integer and floating point unit respectively.

Before the evaluation results are presented, the essential cycles of executing the floating point numbers by original integer architecture such as ALU cluster IP will be calculated. The multiplication operation of the floating point numbers needs six cycles to finish the calculation. The addition operations of the floating point numbers in the ALU unit costs seven cycles to complete the calculation. In addition to these floating point number operations, other integer operations of this architecture need one cycle to finish the operation. The first data inputted into the functional units will need more cycles to finish the job result from the pipeline features. The integer ALU operations of the ALU cluster IP is two-stage pipelined and the integer multiplication operations of the ALU cluster IP is four-stage pipelined. The floating point number operations executed in the floating point unit architecture need one cycle for finishing the addition and one cycle for completing the multiplication. In the performance evaluation in the floating point unit architecture, all of the data are represented by the IEEE 754 format and calculate in this architecture.

Different number of clusters will be considered and evaluated with the two target architectures. The essential cycles to complete the 32-points FFT benchmark between different number clusters and different target architectures are listed separately in the following tables. They show the cycles need to finish the benchmark. The leftmost field stands for the ALU cluster IPs included in the evaluation. The middle field listed in the table represents the operation cycles needed for the functional unit while executing the benchmark. The rightmost field of the tables is the critical cycles which is underlined and dominate the performance of the conditions of different number clusters and different target architectures.

Table 4.6 shows the detail information of cycles for the Original Integer Architecture, the ALU cluster IP which is mentioned in former section. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 647 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is ALU1 in the second cluster and the dominant cycles is 331 cycles. As discussed just now, the performance of four clusters insides architecture is that the dominant functional unit is ALU2 in the second cluster and the dominant cycles are 213 cycles.

The dominant functional unit is ALU1 of the fifth cluster in the eight clusters architecture and its cycles are 151 cycles.

Table 4.6 Performance Evaluation Results for Original Integer Architecture Function Unit and Cycles Critical

Cycles 1 Cluster ALU1:645 ALU2:647 MUL1:412 MUL2:330

647

ALU1:320 ALU2:319 MUL1:141 MUL2:95 320 2 Clusters

ALU1:331 ALU2:309 MUL1:256 MUL2:240

331

ALU1:177 ALU2:213 MUL1:55 MUL2:72

213

ALU1:190 ALU2:153 MUL1:105 MUL2:72 190 ALU1:173 ALU2:179 MUL1:135 MUL2:112 179 4 Clusters

ALU1:154 ALU2:141 MUL1:189 MUL2:122 189 ALU1:145 ALU2:96 MUL1:39 MUL2:34 145 ALU1:96 ALU2:126 MUL1:52 MUL2:34 126 ALU1:85 ALU2:90 MUL1:54 MUL2:54 90 ALU1:91 ALU2:60 MUL1:67 MUL2:44 91 ALU1:151 ALU2:132 MUL1:89 MUL2:84

151

ALU1:66 ALU2:103 MUL1:92 MUL2:64 103 ALU1:97 ALU2:90 MUL1:74 MUL2:74 97 8 Clusters

ALU1:72 ALU2:102 MUL1:97 MUL2:64 102

Table 4.7 shows the detail information of cycles for the Floating Point Unit and Original Integer Architecture Mixed. It processes the integer numbers and the floating point numbers separately. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 157 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is ALU2 both in the first cluster and in the second cluster and the dominant cycles is 74 cycles. As discussed just now, the performance of four clusters insides architecture is that the dominant functional unit is ALU1 in the third cluster and the dominant cycles are 52 cycles. The dominant functional unit is ALU1 of the fifth cluster and ALU2 of the sixth cluster in the eight clusters architecture and its cycles are 42 cycles.

Table 4.7 Performance Evaluation Results for Floating Point Unit and Original Integer Architecture Mixed

Function Unit and Cycles Critical Cycles

As listed and described above, two target architectures with different numbers of clusters are evaluated. Considering the performance of the information listed above.

In order to compare the performance of the selected benchmark, the variable of the number of clusters inside will be fixed between these architectures. First, the performance when one cluster insides these architectures are compared and plotted in Fig 4.27. The cycles used to complete the benchmark is 647 and 157 cycles for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Result from analyzing the data above, the performance of the mixed floating point unit and original integer architecture is 4.12 times better than the original integer architecture in cycles.

One Cluster Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Cycles

ALU1 ALU2 MUL1 MUL2

Fig 4.27 Performance Evaluation of one cluster included in these architectures Second, the performance when two clusters included in these architectures are compared and plotted in Fig 4.28. The essential cycles used to complete the benchmark is 331and 74 cycles for the original integer architecture and floating point unit and original integer architecture mixed respectively. Result from analyzing the data above, the performance of the mixed of floating point unit and original integer architecture is 4.47 times better than the original integer architecture in cycles.

Two Clusters Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.28 Performance Evaluation of two clusters included in these architectures Following that the performance when four clusters included in these architectures are compared and sketched in Fig 4.29. The critical cycles need to finish the benchmark is 213 and 52 cycles for the original integer architecture and the floating point unit and original integer mixed architecture respectively. Analyzing the data above the performance of the mixture of floating point unit and original integer architecture is 4.09 times better than the original integer architecture in cycles.

Four Clusters Used for Different Target Architectuures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.29 Performance Evaluation of four clusters included in these architectures

Finally the performance when eight clusters included in these architectures are compared and sketched in Fig 4.30. The essential cycles need to finish the benchmark is 151 and 42 cycles for the original integer architecture and mixed floating point unit and original integer architecture respectively. Analyzing the data above the performance of the mixed floating point unit and original integer architecture is 3.60 times better than the original integer architecture in cycles.

Eight Clusters used for Different Target Architectures

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 63-0)