Selected Benchmark - Performance Evaluation and Comparison

Chapter 4 Implementation Results and Performance Evaluation

4.4 Performance Evaluation and Comparison

4.4.1 Selected Benchmark

In this thesis, 32-points Fast Fourier Transform is selected as the benchmark. The FFT is an efficient algorithm for computation of the Fourier Transform. There are three reasons that the FFT is selected. First, the FFT is the most often used operations in the multimedia applications or signal processing applications. The second reason is taken as an example that executing on the streaming programming model in paper [18]. Third, the applications involved the FFT usually need to handle the floating point numbers operations. It is well-match for the performance evaluations presented in this thesis.

The Split-Radix FFT (SRFFT) algorithm is adopted to form the benchmark [31]

[32]. An inspection of the decimation in frequency flowchart of FFT shows that the even terms and the odd terms of the Discrete Fourier Transform (DFT) can be computed independently. It is quite clear that the radix-2 algorithm is better for the even terms and the radix-4 algorithm is better for the odd terms of the DFT. So the split-Radix FFT (SRFFT) algorithm which reduces the number of computations exploits the idea of using radix-2 and radix-4 algorithms mixed into the same FFT algorithm.

As mentioned above, the FFT algorithm is decomposed into even terms and odd terms to compute independently. The radix-2 decimation in frequency FFT algorithm used for the even numbered samples of the N-points DFT are given in below.

∑

⁻

The odd-numbered samples [X (2k+1)] of the DFT need to pre-multiplication of the input sequence with WNn

. The raidx-4 decimation in frequency FFT algorithm used for the odd-numbered samples of the N-points DFT are given in the two equations below.

The length-N DFT will be obtained by using the N Split-Radix FFT algorithm.

The benchmark adopted is the length 32 Split-Radix FFT and its flowchart is shown in Fig 4.26.

Fig 4.26 Flowchart of the length 32 Split-Radix FFT algorithm

4.4.2 Evaluation and Comparison Results

Reminding the developmental roadmap discussed in the first part of Chapter 3, the ALU cluster IPs will be combined with the versatile baseboard to form a media streaming architecture with homogeneous processor cores. As shown in Fig 3.4, different numbers of ALU cluster IPs will be stacked with the board. As the result of the simulator, the numbers of the ALU cluster IPs are decided. One, two, four and eight ALU cluster IPs will be integrated with the baseboard in order. As introduced above, there are two target architectures to evaluate the performance and any of which will be considered when different numbers of ALU cluster IPs are integrated.

The two target architectures used in the thesis are Original Integer Architecture and Floating Point Unit and Original Integer Architecture Mixed. First, the original integer architecture is the architecture of the ALU cluster IP mentioned in Section 3.3.2. It is the architecture for integer operations essentially. The benchmark with floating point operations is decomposed. All operand are presented in the IEEE 754 floating point format and decomposed into several fields of integer and computes separately. It means the floating point operations are executed on the integer operation architecture by decomposing the format fit the original form. Then the second evaluation architecture is the floating point unit and original integer architecture mixed. In the SRFFT benchmark, it includes integer operations and floating point number operations. For the second evaluation architecture the integer operations and floating point number operations are computed by the integer and floating point unit respectively.

Before the evaluation results are presented, the essential cycles of executing the floating point numbers by original integer architecture such as ALU cluster IP will be calculated. The multiplication operation of the floating point numbers needs six cycles to finish the calculation. The addition operations of the floating point numbers in the ALU unit costs seven cycles to complete the calculation. In addition to these floating point number operations, other integer operations of this architecture need one cycle to finish the operation. The first data inputted into the functional units will need more cycles to finish the job result from the pipeline features. The integer ALU operations of the ALU cluster IP is two-stage pipelined and the integer multiplication operations of the ALU cluster IP is four-stage pipelined. The floating point number operations executed in the floating point unit architecture need one cycle for finishing the addition and one cycle for completing the multiplication. In the performance evaluation in the floating point unit architecture, all of the data are represented by the IEEE 754 format and calculate in this architecture.

Different number of clusters will be considered and evaluated with the two target architectures. The essential cycles to complete the 32-points FFT benchmark between different number clusters and different target architectures are listed separately in the following tables. They show the cycles need to finish the benchmark. The leftmost field stands for the ALU cluster IPs included in the evaluation. The middle field listed in the table represents the operation cycles needed for the functional unit while executing the benchmark. The rightmost field of the tables is the critical cycles which is underlined and dominate the performance of the conditions of different number clusters and different target architectures.

Table 4.6 shows the detail information of cycles for the Original Integer Architecture, the ALU cluster IP which is mentioned in former section. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 647 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is ALU1 in the second cluster and the dominant cycles is 331 cycles. As discussed just now, the performance of four clusters insides architecture is that the dominant functional unit is ALU2 in the second cluster and the dominant cycles are 213 cycles.

The dominant functional unit is ALU1 of the fifth cluster in the eight clusters architecture and its cycles are 151 cycles.

Table 4.6 Performance Evaluation Results for Original Integer Architecture Function Unit and Cycles Critical

Cycles 1 Cluster ALU1:645 ALU2:647 MUL1:412 MUL2:330

647

ALU1:320 ALU2:319 MUL1:141 MUL2:95 320 2 Clusters

ALU1:331 ALU2:309 MUL1:256 MUL2:240

331

ALU1:177 ALU2:213 MUL1:55 MUL2:72

213

ALU1:190 ALU2:153 MUL1:105 MUL2:72 190 ALU1:173 ALU2:179 MUL1:135 MUL2:112 179 4 Clusters

ALU1:154 ALU2:141 MUL1:189 MUL2:122 189 ALU1:145 ALU2:96 MUL1:39 MUL2:34 145 ALU1:96 ALU2:126 MUL1:52 MUL2:34 126 ALU1:85 ALU2:90 MUL1:54 MUL2:54 90 ALU1:91 ALU2:60 MUL1:67 MUL2:44 91 ALU1:151 ALU2:132 MUL1:89 MUL2:84

151

ALU1:66 ALU2:103 MUL1:92 MUL2:64 103 ALU1:97 ALU2:90 MUL1:74 MUL2:74 97 8 Clusters

ALU1:72 ALU2:102 MUL1:97 MUL2:64 102

Table 4.7 shows the detail information of cycles for the Floating Point Unit and Original Integer Architecture Mixed. It processes the integer numbers and the floating point numbers separately. When only one cluster is used in the evaluation, the dominant functional unit is ALU2 and the dominant cycles of the performance is 157 cycles. The same concepts as one cluster, the evaluation for two clusters insides are listed and the dominant functional unit is ALU2 both in the first cluster and in the second cluster and the dominant cycles is 74 cycles. As discussed just now, the performance of four clusters insides architecture is that the dominant functional unit is ALU1 in the third cluster and the dominant cycles are 52 cycles. The dominant functional unit is ALU1 of the fifth cluster and ALU2 of the sixth cluster in the eight clusters architecture and its cycles are 42 cycles.

Table 4.7 Performance Evaluation Results for Floating Point Unit and Original Integer Architecture Mixed

Function Unit and Cycles Critical Cycles

As listed and described above, two target architectures with different numbers of clusters are evaluated. Considering the performance of the information listed above.

In order to compare the performance of the selected benchmark, the variable of the number of clusters inside will be fixed between these architectures. First, the performance when one cluster insides these architectures are compared and plotted in Fig 4.27. The cycles used to complete the benchmark is 647 and 157 cycles for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Result from analyzing the data above, the performance of the mixed floating point unit and original integer architecture is 4.12 times better than the original integer architecture in cycles.

One Cluster Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Cycles

ALU1 ALU2 MUL1 MUL2

Fig 4.27 Performance Evaluation of one cluster included in these architectures Second, the performance when two clusters included in these architectures are compared and plotted in Fig 4.28. The essential cycles used to complete the benchmark is 331and 74 cycles for the original integer architecture and floating point unit and original integer architecture mixed respectively. Result from analyzing the data above, the performance of the mixed of floating point unit and original integer architecture is 4.47 times better than the original integer architecture in cycles.

Two Clusters Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.28 Performance Evaluation of two clusters included in these architectures Following that the performance when four clusters included in these architectures are compared and sketched in Fig 4.29. The critical cycles need to finish the benchmark is 213 and 52 cycles for the original integer architecture and the floating point unit and original integer mixed architecture respectively. Analyzing the data above the performance of the mixture of floating point unit and original integer architecture is 4.09 times better than the original integer architecture in cycles.

Four Clusters Used for Different Target Architectuures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.29 Performance Evaluation of four clusters included in these architectures

Finally the performance when eight clusters included in these architectures are compared and sketched in Fig 4.30. The essential cycles need to finish the benchmark is 151 and 42 cycles for the original integer architecture and mixed floating point unit and original integer architecture respectively. Analyzing the data above the performance of the mixed floating point unit and original integer architecture is 3.60 times better than the original integer architecture in cycles.

Eight Clusters used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.30 Performance Evaluation of eight clusters included in these architectures

Because of the hardware implementation, these architectures may operate in different clock rate. As described in the previous sections, the original integer architecture, the ALU cluster IP, is operated in the clock frequency of 100 MHz for post-layout simulation. The floating point unit adopted in the floating point unit and original integer architecture mixed are designed and implemented with the 75 MHz clock rate in the post-layout simulation. So the comparison of the execution time between these architectures is evaluated also. As listed in the Table 4.8, the critical execution time of the benchmark for two target architectures with different numbers of clusters is shown. The boldface and underlined data are listed to represent the critical execution time needed for completing the benchmark. Result from scheduling the instructions of the operations in the SRFFT benchmark the critical functional units in these architectures is slightly different. When scheduling the instructions, the cycles which execute no-operation (NOP) instructions will influence the total execution time

result in this situation. The results will be compared each other later to demonstrate the trend of the performance evaluation is in similar with the performance evaluation in clock cycles.

Table 4.8 Performance Evaluation Results in Execution Time Execution

Time (ns)

Original Integer Architecture

Floating Point Unit and Original Integer Architecture Mixed

1 Cluster

6470 2035.3

3200

957.8

As in Table 4.8 and detail information of the Fig 4.31 shown below, the performance when one cluster insides between these architectures are compares and plotted in Fig 4.31. The essential execution time need to complete the benchmark is 6470 and 2035.3 nano seconds for the original integer architecture and mixed

architecture of floating point unit and original integer architecture respectively.

Analyzing the data above, the performance of the mixture of floating point unit and original integer architecture is 3.18 times better than the original integer architecture.

One Cluster Used for Different Target Architectures

0 1000 2000 3000 4000 5000 6000 7000

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Execution Time(ns)

ALU1 ALU2 MUL1 MUL2

Fig 4.31 Performance Evaluation of one cluster included in execution time

Second, the detail information of the performance when two clusters included in these architectures are compared and plotted in Fig 4.32 and Table 4.8. The needed execution time used to complete the benchmark is 3310 and 957.8 nano seconds for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Result from analyzing the data above, the performance of the mixed of floating point unit and original integer architecture is 3.46 times better than the original integer architecture.

Two Clusters Used for Different Target Architectures

0 500 1000 1500 2000 2500 3000 3500

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Execution Time (ns)

Cluster1_ALU1 Cluster1_ALU2 Cluster1_MUL1 Cluster1_MUL2 Cluster2_ALU1 Cluster2_ALU2 Cluster2_MUL1 Cluster2_MUL2

Fig 4.32 Performance Evaluation of two clusters included in execution time

Third, the detail information of the performance when four clusters included in these architectures are compared and plotted in Fig 4.33 and Table 4.8. The critical time need to finish the benchmark is 2130 and 628.9 nano seconds for the original integer architecture and the mixture of floating point unit and original integer architecture respectively. Analyzing the data above the performance of the mixed floating point unit and original integer architecture is about 3.39 times better than the original integer architecture in seconds.

Four Cluster Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.33 Performance Evaluation of four clusters included in execution time

Eventually the performance when eight clusters included in these architectures are compared and sketched in Fig 4.34. The essential time need to complete the benchmark is 1510 and 545.4 nano seconds for the original integer architecture and floating point unit and original integer mixed architecture respectively. Result from analyzing the data above, the performance of the mixture of floating point unit and original integer architecture is about 2.77 times better than the original integer architecture.

Eight Clusters Used for Different Target Architectures

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.34 Performance Evaluation of eight clusters included in execution time The summary of the descriptions above are discussed in this paragraph. The performance evaluations considering the execution cycles and the execution time of two target architectures with different numbers of clusters used are normalized to the original integer architecture such as the architecture of the ALU cluster IP mentioned in previous section and the amount of performance enhancement are listed. As shown in Fig 4.35, the performance normalized to the original integer architecture is sketched. Observing Fig 4.35, it is clear that no matter what numbers of clusters included the trend of the performance increases incrementally. Considering the phenomenon that more clusters used makes the performance improvement increases slight slowly when the number of clusters used is focused between different target architectures. This phenomenon results form that more cluster used will share and degrade the computation loading of each functional unit and reveals the slight slowly increase of performance. In spite of this, it is still about four times of the performance improvement when the floating point units involved in the architectures.

The performance evaluation considering the execution time has the same trend and phenomenon. As illustrated in Fig 4.36, the trend of the performance increases incrementally regardless of the numbers of clusters. It also has the phenomenon that the performance increases slight slowly when more clusters used if the number of

clusters used is focused between different target architectures. The reason is the same as description above. In spite of this, it is still about 3.3 times of the performance improvement when the floating point units involved in the architectures. These two performance evaluation executed the benchmark SRFFT prove that the floating point unit is an essential units to improve the performance of the architecture.

Performance Normalized to Original Integer Architecture

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.35 Comparison of Performance Normalized in execution cycles

Performance Normalized to Otiginal Integer Architecture

Original Integer Architecture Floating Point Unit and Original Interger Architecture Mixed

Fig 4.36 Comparison of Performance Normalized in execution time

The following of this section the performance comparison results which fixed the type of adopted architecture when different numbers of clusters included in the selected architecture are presented. Consider the execution cycles of finishing the SRFFT benchmark, the results are shown in Fig 4.37. As illustrated in the figure, the trend shows that the more cluster used in the architecture the higher performance which is gained from the architecture. The performance enhancements of the original integer architecture are 1.95, 3.04 and 4.28 times higher than one cluster included architecture for two clusters, four clusters and eight clusters used in the original integer architecture respectively. Similarly the mixture architecture of the floating point unit and original integer architecture also has the trend. The performance improvements of the selected mixture architecture are 2.12, 3.01 and 3.73 times higher than single cluster used for two clusters, four clusters and eight clusters used respectively.

Then the performance comparisons in the perspective of execution time are also presented. As shown in Fig 4.38, the trend of performance improvement is the same.

As illustrated in the figure, the performance enhancements of the original integer architecture are 1.95, 3.03 and 4.28 times higher than one cluster included architecture for two clusters, four clusters and eight clusters used in the original integer architecture respectively. Similarly the mixture architecture of the floating point unit and original integer architecture also has the trend in the perspective of execution time.

The performance improvements of the selected mixture architecture are 2.12, 3.24 and 3.73 times higher than single cluster used for two clusters, four clusters and eight clusters used respectively. The results mentioned above shows that more clusters used in the selected architecture makes more parallelism be exploit and improves the performance higher.

A phenomenon observed from Fig 4.37 and Fig 4.38 is discussed in this paragraph. It shows that when the numbers of clusters included in the architecture are increased from one to two or from two to four, the performance improvement is doubled no matter what architectures are adopted and either in the perspective of execution time or execution cycles. But when the clusters included increase from four to eight, the performance enhancement is not doubled and not conspicuous. The phenomenon results from huge data exchange outside the intra register files embedded

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 76-0)