Performance comparison - Performance Evaluation

Chapter 4 Performance Evaluation

4.4 Performance comparison

Performance of FFT on different architecture

Figure 4.10 Performance of FFT on different micro-architecture

Figure 4.10 shows the analysis chart of performance evaluation in Section 4.3.

The x-axis represents different stream micro-architectures, including 1-cluster, 2-cluster, 4-cluster, and 8-cluster; while the y-axis denotes performance.

The curve marked with diamond presents the necessary CPU_Time when FFT is executed. As shown in the figure, it could be observed that when the cluster number of the micro-architecture is increased from 1 to 2, the performance is doubled.

Compared 2-cluster with 4-cluster micro-architecture, the performance can still double. However, in the case of 4-cluster compared to 8-cluster micro-architecture, the improvement of performance can not be easily observe.

The curve marked with square represents the required number of accessing SRF when we execute FFT. It could be also viewed as the number of data exchange using SRF. As shown in the figure, when the number of cluster is increased, for example from 1-cluster to 2-cluster, or from 2-cluster to 4-cluster, the number of data exchange using SRF increases slowly. However, when the number of cluster is increased from 4-cluster to 8-cluster, the number of data exchange using SRF obviously doubles.

The curve marked with triangle denotes the number of accessing SP when we execute FFT. It can also be viewed as the number of data exchange using SP. As shown in the figure, there is no relationship between the number of cluster and the number of SP used. Therefore, it could be concluded that regardless of how many clusters we have, the number of data exchange using SP between functional units in the same cluster is not many. Moreover, the functional unit will not vary with the number of cluster either.

The curve marked with “x” represents the number of accessing LRF when we execute FFT, i.e., the number of data exchange using LRF. As shown in the figure, with cluster increasing, the number of data exchange would decrease. When the number of clusters is increased from one to four, the number of data exchange using LRF linearly is decreased. However, in the case of increasing 4-cluster to 8-cluster, the number of accessing LRF shows more drastic increase.

Since the number of accessing memory hierarchy is fixed when executing FFT, the number of SRF being used increases proportionally with the number of clusters, which linearly increases from one to four. However, the number of consumed LRF shows linear decrease. In addition, when increasing the number of cluster from 4 to 8, the number of accessing SRF suddenly doubles and the number of accessing LRF

show noticeable decrease, which represents that data exchange between clusters, i.e., utilizing SRF to exchange data, becomes frequent. On the other hand, the number of data exchange inside every cluster, i.e., using LRF to do data exchange, would decrease.

Then let’s consider the variation of performance between different micro-architectures. From 1-cluster, 2-cluster, to 4-cluster micro-architecture, the performance all doubly increases. However, the performance only shows little improvement from 4-cluster to 8-cluster micro-architecture. It cause huge data exchange from high-bandwidth LRF to low-bandwidth SRF since the bandwidth of LRF > SP > SRF. In this case, the advance in performance does not follow that in expensive hardware.

Figure 4.11 analysis chart of memory usage, where x-axis represents different level of memory hierarchy, while y-axis denotes memory usage in every level. The demand for memory in every level does not hold close relationship between performances.

The demands for SRF usage in 1-cluster, 2-cluster, 4-cluster, and 8-cluster micro-architecture are roughly the same. It does not increase with the number of clusters. The capacity of SP usage for four micro-architectures is quite different.

However, it does not positive relation with the number of cluster. The capacity of LRF would linearly increase with the number of cluster. In the 3-tiered memory hierarchy, only the demand of LRF holds positive relation with cluster number.

0 50 100 150 200 250

SRF Usage SP Usage LRF Usage Memory Capacity

1-Cluster 2-Cluster 4-Cluster 8-Cluster

Figure 4.11 Memory capacity of memory hierarchy

From the analysis above, it could be concluded that the 4-cluster micro-architecture best suits the hardware micro-architecture of FFT-32 application.

As for the memory size required by the micro-architecture, Figure 4.12, memory usage chart, should be referenced where SRF = 64 32-bit registers, SP = 20 32-bit registers, and LRF = 212 32-bit registers.

4.5 Summary

In this thesis, an micro-architecture simulator is being designed, where the main function is to simulate the operation of media application in stream processor and estimate performance. FFT is taken as benchmark. Several micro-architectures of different cluster number have been attempted, and then analyzed with varying number of accessing memory in each level and CPU Time for comparison. It could be observed that when the number of cluster increase from 1 to 2 and 2 to 4, CPU Time

would double, and the number of SRF being used shows “linear” increase while the that of LRF present linear decrease. However, when the cluster number of the simulator goes from 4-cluster to 8-cluster, the progress of CPU Time is limited and causes large amount of SRF being used, i.e., huge data exchange between clusters.

The design of stream processor is trying to have data calculated inside cluster as possible. When necessary, data would be exchanged between clusters through SRF.

However, the simulation results of 8-cluster micro-architecture do not fit our expectation.

Therefore, 4-cluster micro-architecture has been chosen as the most suitable one for executing FFT in stream processor. In addition, the usage of memory hierarchy is SRF : SP : LRF = 64 : 20 : 212, respectively.

在文檔中多媒體串列處理器之微架構模擬器 (頁 69-74)