Circuit Implementation and Results of Floating Point Units for the

Chapter 4 Implementation Results and Performance Evaluation

4.3 Circuit Implementation and Results of Floating Point Units for the

In this section, the implementation results of the FPUs described above will be introduced. These FPUs are implemented as hard macros with cell-based design flow.

The macros can be used to integrated with the ALU cluster IP and provide efficient floating points operations ability. The synthesis results and the results of Auto Place and Route (APR) are discussed. The circuit verification results executed through post-layout simulation are also listed in this section too.

The floating point units are synthesized with Synopsys Design Compiler and the physical layouts of these FPU macros are finished by means of the Synopsys Astro.

The TSMC 0.18um CMOS technology and Artisan SAGE-x Standard Cell Library are adopted for implementing these FPUs. As mentioned above, three types of FPUs are designed. Type 1 FPU includes the floating point operations for addition, subtraction and multiplication. It operates in 75 MHz of post simulation frequency. The gate count and area are 23,298 and 0.415 mm²respectively. The area utilization and power consumption are 0.9 and 10.85 mW respectively. The physical layout of the type 1 macro of the FPU is shown in Fig 4.17. As shown in Fig 4.18, the macro of type 2 FPU is implemented completely. It provides the floating point operations including addition, subtraction, multiplication and division inside. The post simulation clock rate of type 2 FPU is 25 MHz. The gate count and area are 31,331 and 0.529 mm² respectively. The area utilization, the same as type 1 FPU, is 0.9. The power dissipation of type 2 FPU is 4.60 mW. Type 3 FPU, including the division floating point operation only, is in order to collocate with type 1 FPU to provide the same operations as type 2 FPU. So the type 3 FPU is implemented with the 25 MHz clock rate of post simulation in spite of it is not the fastest operation frequency it can achieve. The gate count and area are 24,931 and 0.396 mm². The area utilization, the

same as type 1 and type 2 FPU, is 0.9 and the power dissipation is 6.59 mW. The physical layout of type 3 FPU macro is shown in Fig 4.19 below. The summary of these results are listed in Table 4.5

Fig 4.17 Physical Layout of the Type 1 FPU macro

Fig 4.18 Physical Layout of the Type 2 FPU macro

Fig 4.19 Physical Layout of the Type 3 FPU macro Table 4.5 Summary of the Implementation Results Floating Point Unit Type 1 (ADD,

SUB, MUL)

Type2 (ADD, SUB, MUL, DIV)

Type 3 (DIV only) Technology TSMC 0.18um TSMC 0.18um TSMC 0.18um Cell Library Artisan

SAGE-X^TM The circuit verification results are listed below. Post-layout simulation are performed with the TSMC 0.18um CMOS technology and library environment. The post-layout simulation results for type 1 FPU are shown in Fig 4.20 and Fig 4.21.

They are the full view and the interception of whole simulation periods respectively.

The same as type 1, the verification results of type 2 FPU are shown in Fig 4.22 and Fig 4.23 and they are full view and a portion of all periods respectively. Eventually the post-layout simulation results of type 3 FPU are illustrated in Fig 4.24 and Fig 4.25. As described previously, they are also full view and interception of whole simulation periods respectively. These outcomes promise that these hard macros work correctly corresponding to their own clock rate of post-layout simulation.

Fig 4.20 Full View of Post-Layout Simulation Results for Type 1 FPU

Fig 4.21 Interception of Post-Layout Simulation Results for Type 1 FPU

Fig 4.22 Full View of Post-Layout Simulation Results for Type 2 FPU

Fig 4.23 Interception of Post-Layout Simulation Results for Type 2 FPU

Fig 4.24 Full View of Post-Layout Simulation Results for Type 3 FPU

Fig 4.25 Interception of Post-Layout Simulation Results for Type 3 FPU

4.4 Performance Evaluation and Comparison

In this section, the performance of floating point operations will be evaluation and comparison. The benchmark used to evaluation is the Fast Fourier Transform (FFT) commonly used in media processing applications. There are two target architectures used to evaluate the performance of floating point operations and compare the performance each other. These two parts mentioned above are discussed in the following.

4.4.1 Selected Benchmark

In this thesis, 32-points Fast Fourier Transform is selected as the benchmark. The FFT is an efficient algorithm for computation of the Fourier Transform. There are three reasons that the FFT is selected. First, the FFT is the most often used operations in the multimedia applications or signal processing applications. The second reason is taken as an example that executing on the streaming programming model in paper [18]. Third, the applications involved the FFT usually need to handle the floating point numbers operations. It is well-match for the performance evaluations presented in this thesis.

The Split-Radix FFT (SRFFT) algorithm is adopted to form the benchmark [31]

[32]. An inspection of the decimation in frequency flowchart of FFT shows that the even terms and the odd terms of the Discrete Fourier Transform (DFT) can be computed independently. It is quite clear that the radix-2 algorithm is better for the even terms and the radix-4 algorithm is better for the odd terms of the DFT. So the split-Radix FFT (SRFFT) algorithm which reduces the number of computations exploits the idea of using radix-2 and radix-4 algorithms mixed into the same FFT algorithm.

As mentioned above, the FFT algorithm is decomposed into even terms and odd terms to compute independently. The radix-2 decimation in frequency FFT algorithm used for the even numbered samples of the N-points DFT are given in below.

∑

⁻

The odd-numbered samples [X (2k+1)] of the DFT need to pre-multiplication of the input sequence with WNn

. The raidx-4 decimation in frequency FFT algorithm used for the odd-numbered samples of the N-points DFT are given in the two equations below.

The length-N DFT will be obtained by using the N Split-Radix FFT algorithm.

The benchmark adopted is the length 32 Split-Radix FFT and its flowchart is shown in Fig 4.26.

Fig 4.26 Flowchart of the length 32 Split-Radix FFT algorithm

4.4.2 Evaluation and Comparison Results

Reminding the developmental roadmap discussed in the first part of Chapter 3, the ALU cluster IPs will be combined with the versatile baseboard to form a media streaming architecture with homogeneous processor cores. As shown in Fig 3.4, different numbers of ALU cluster IPs will be stacked with the board. As the result of the simulator, the numbers of the ALU cluster IPs are decided. One, two, four and eight ALU cluster IPs will be integrated with the baseboard in order. As introduced above, there are two target architectures to evaluate the performance and any of which will be considered when different numbers of ALU cluster IPs are integrated.

The two target architectures used in the thesis are Original Integer Architecture and Floating Point Unit and Original Integer Architecture Mixed. First, the original integer architecture is the architecture of the ALU cluster IP mentioned in Section 3.3.2. It is the architecture for integer operations essentially. The benchmark with floating point operations is decomposed. All operand are presented in the IEEE 754 floating point format and decomposed into several fields of integer and computes separately. It means the floating point operations are executed on the integer operation architecture by decomposing the format fit the original form. Then the second evaluation architecture is the floating point unit and original integer architecture mixed. In the SRFFT benchmark, it includes integer operations and floating point number operations. For the second evaluation architecture the integer operations and floating point number operations are computed by the integer and floating point unit respectively.

Before the evaluation results are presented, the essential cycles of executing the floating point numbers by original integer architecture such as ALU cluster IP will be calculated. The multiplication operation of the floating point numbers needs six cycles to finish the calculation. The addition operations of the floating point numbers in the ALU unit costs seven cycles to complete the calculation. In addition to these floating point number operations, other integer operations of this architecture need one cycle to finish the operation. The first data inputted into the functional units will need more cycles to finish the job result from the pipeline features. The integer ALU operations of the ALU cluster IP is two-stage pipelined and the integer multiplication operations of the ALU cluster IP is four-stage pipelined. The floating point number operations executed in the floating point unit architecture need one cycle for finishing the addition and one cycle for completing the multiplication. In the performance evaluation in the floating point unit architecture, all of the data are represented by the IEEE 754 format and calculate in this architecture.

Different number of clusters will be considered and evaluated with the two target architectures. The essential cycles to complete the 32-points FFT benchmark between different number clusters and different target architectures are listed separately in the following tables. They show the cycles need to finish the benchmark. The leftmost field stands for the ALU cluster IPs included in the evaluation. The middle field listed in the table represents the operation cycles needed for the functional unit while executing the benchmark. The rightmost field of the tables is the critical cycles which is underlined and dominate the performance of the conditions of different number clusters and different target architectures.

Table 4.6 shows the detail information of cycles for the Original Integer Architecture, the ALU cluster IP which is mentioned in former section. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 647 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is ALU1 in the second cluster and the dominant cycles is 331 cycles. As discussed just now, the performance of four clusters insides architecture is that the dominant functional unit is ALU2 in the second cluster and the dominant cycles are 213 cycles.

The dominant functional unit is ALU1 of the fifth cluster in the eight clusters architecture and its cycles are 151 cycles.

Table 4.6 Performance Evaluation Results for Original Integer Architecture Function Unit and Cycles Critical

Cycles 1 Cluster ALU1:645 ALU2:647 MUL1:412 MUL2:330

647

ALU1:320 ALU2:319 MUL1:141 MUL2:95 320 2 Clusters

ALU1:331 ALU2:309 MUL1:256 MUL2:240

331

ALU1:177 ALU2:213 MUL1:55 MUL2:72

213

ALU1:190 ALU2:153 MUL1:105 MUL2:72 190 ALU1:173 ALU2:179 MUL1:135 MUL2:112 179 4 Clusters

ALU1:154 ALU2:141 MUL1:189 MUL2:122 189 ALU1:145 ALU2:96 MUL1:39 MUL2:34 145 ALU1:96 ALU2:126 MUL1:52 MUL2:34 126 ALU1:85 ALU2:90 MUL1:54 MUL2:54 90 ALU1:91 ALU2:60 MUL1:67 MUL2:44 91 ALU1:151 ALU2:132 MUL1:89 MUL2:84

151

ALU1:66 ALU2:103 MUL1:92 MUL2:64 103 ALU1:97 ALU2:90 MUL1:74 MUL2:74 97 8 Clusters

ALU1:72 ALU2:102 MUL1:97 MUL2:64 102

Table 4.7 shows the detail information of cycles for the Floating Point Unit and Original Integer Architecture Mixed. It processes the integer numbers and the floating point numbers separately. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 157 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is ALU2 both in the first cluster and in the second cluster and the dominant cycles is 74 cycles. As discussed just now, the performance of four clusters insides architecture is that the dominant functional unit is ALU1 in the third cluster and the dominant cycles are 52 cycles. The dominant functional unit is ALU1 of the fifth cluster and ALU2 of the sixth cluster in the eight clusters architecture and its cycles are 42 cycles.

Table 4.7 Performance Evaluation Results for Floating Point Unit and Original Integer Architecture Mixed

Function Unit and Cycles Critical Cycles

As listed and described above, two target architectures with different numbers of clusters are evaluated. Considering the performance of the information listed above.

In order to compare the performance of the selected benchmark, the variable of the number of clusters inside will be fixed between these architectures. First, the performance when one cluster insides these architectures are compared and plotted in Fig 4.27. The cycles used to complete the benchmark is 647 and 157 cycles for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Result from analyzing the data above, the performance of the mixed floating point unit and original integer architecture is 4.12 times better than the original integer architecture in cycles.

One Cluster Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Cycles

ALU1 ALU2 MUL1 MUL2

Fig 4.27 Performance Evaluation of one cluster included in these architectures Second, the performance when two clusters included in these architectures are compared and plotted in Fig 4.28. The essential cycles used to complete the benchmark is 331and 74 cycles for the original integer architecture and floating point unit and original integer architecture mixed respectively. Result from analyzing the data above, the performance of the mixed of floating point unit and original integer architecture is 4.47 times better than the original integer architecture in cycles.

Two Clusters Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.28 Performance Evaluation of two clusters included in these architectures Following that the performance when four clusters included in these architectures are compared and sketched in Fig 4.29. The critical cycles need to finish the benchmark is 213 and 52 cycles for the original integer architecture and the floating point unit and original integer mixed architecture respectively. Analyzing the data above the performance of the mixture of floating point unit and original integer architecture is 4.09 times better than the original integer architecture in cycles.

Four Clusters Used for Different Target Architectuures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.29 Performance Evaluation of four clusters included in these architectures

Finally the performance when eight clusters included in these architectures are compared and sketched in Fig 4.30. The essential cycles need to finish the benchmark is 151 and 42 cycles for the original integer architecture and mixed floating point unit and original integer architecture respectively. Analyzing the data above the performance of the mixed floating point unit and original integer architecture is 3.60 times better than the original integer architecture in cycles.

Eight Clusters used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.30 Performance Evaluation of eight clusters included in these architectures

Because of the hardware implementation, these architectures may operate in different clock rate. As described in the previous sections, the original integer architecture, the ALU cluster IP, is operated in the clock frequency of 100 MHz for post-layout simulation. The floating point unit adopted in the floating point unit and original integer architecture mixed are designed and implemented with the 75 MHz clock rate in the post-layout simulation. So the comparison of the execution time between these architectures is evaluated also. As listed in the Table 4.8, the critical execution time of the benchmark for two target architectures with different numbers of clusters is shown. The boldface and underlined data are listed to represent the critical execution time needed for completing the benchmark. Result from scheduling the instructions of the operations in the SRFFT benchmark the critical functional units in these architectures is slightly different. When scheduling the instructions, the cycles which execute no-operation (NOP) instructions will influence the total execution time

result in this situation. The results will be compared each other later to demonstrate the trend of the performance evaluation is in similar with the performance evaluation in clock cycles.

Table 4.8 Performance Evaluation Results in Execution Time Execution

Time (ns)

Original Integer Architecture

Floating Point Unit and Original Integer Architecture Mixed

1 Cluster

6470 2035.3

3200

957.8

As in Table 4.8 and detail information of the Fig 4.31 shown below, the performance when one cluster insides between these architectures are compares and plotted in Fig 4.31. The essential execution time need to complete the benchmark is 6470 and 2035.3 nano seconds for the original integer architecture and mixed

architecture of floating point unit and original integer architecture respectively.

Analyzing the data above, the performance of the mixture of floating point unit and original integer architecture is 3.18 times better than the original integer architecture.

One Cluster Used for Different Target Architectures

0 1000 2000 3000 4000 5000 6000 7000

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Execution Time(ns)

ALU1 ALU2 MUL1 MUL2

Fig 4.31 Performance Evaluation of one cluster included in execution time

Second, the detail information of the performance when two clusters included in these architectures are compared and plotted in Fig 4.32 and Table 4.8. The needed execution time used to complete the benchmark is 3310 and 957.8 nano seconds for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Result from analyzing the data above, the performance of the mixed of floating point unit and original integer architecture is 3.46 times better than the original integer architecture.

Two Clusters Used for Different Target Architectures

0 500 1000 1500 2000 2500 3000 3500

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Execution Time (ns)

Cluster1_ALU1 Cluster1_ALU2 Cluster1_MUL1 Cluster1_MUL2 Cluster2_ALU1 Cluster2_ALU2 Cluster2_MUL1 Cluster2_MUL2

Fig 4.32 Performance Evaluation of two clusters included in execution time

Third, the detail information of the performance when four clusters included in these architectures are compared and plotted in Fig 4.33 and Table 4.8. The critical time need to finish the benchmark is 2130 and 628.9 nano seconds for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Analyzing the data above the performance of the mixed floating point unit and original integer architecture is about 3.39 times better than the original integer architecture in seconds.

Four Cluster Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.33 Performance Evaluation of four clusters included in execution time

Eventually the performance when eight clusters included in these architectures are compared and sketched in Fig 4.34. The essential time need to complete the benchmark is 1510 and 545.4 nano seconds for the original integer architecture and floating point unit and original integer mixed architecture respectively. Result from analyzing the data above, the performance of the mixture of floating point unit and original integer architecture is about 2.77 times better than the original integer architecture.

Eight Clusters Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.34 Performance Evaluation of eight clusters included in execution time The summary of the descriptions above are discussed in this paragraph. The performance evaluations considering the execution cycles and the execution time of two target architectures with different numbers of clusters used are normalized to the original integer architecture such as the architecture of the ALU cluster IP mentioned in previous section and the amount of performance enhancement are listed. As shown

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 70-0)