Performance Analysis and Discussion

Chapter 4 Result and Analysis

4.2 Performance Analysis and Discussion

We’ll discuss and compare the results in section 4.1 and try to explain the effect upon the system performance with the changes of configurations in this section.

First, we take the result of 16QAM modulation mode in Case 1 for example. This configuration is the closest one to the reference SW code. From Table 4.1, we can simply see the improvement of total execution time due to the enabled cache. Table 4.2 – Table 4.5 show further information about cache with memory access counts.

Figure 4.1 shows the cache efficiency for function execution time in simulation Case 1, and the modulation mode is 16QAM. Based on the result of no cache, the use of 4k bytes caches for both instruction and data memory reduces the total execution time to 49.06% for function TX, 31.43% for function Modulation, 31.22% for function STBC Encoder, 51.59% for function OFDM Modulator, and 52.65% for function IFFT. In comparison, the use of 32k bytes caches reduces the total execution time to 48.33%, 30.29%, 28.40%, 51.03%, and 52.27% for the five functions.

Figure 4.1: Cache efficiency for function execution time in Case 1

This improvement of execution time by cache is mainly effected by reducing the memory accesses. From Table 4.2 – Table 4.4 we can find the reduced percentage of ROM read accesses for function TX is about 99.69% with both 4k bytes caches, and the least value is 99.68% for function OFDM Modulator; with both 32k bytes caches, we get 99.96% for function TX and the least is still 99.77% for STBC Encoder. The reduced percentage of RAM read accesses with both 4k bytes caches is 74.65% for function TX and the least is 22.25% for function STBC Encoder; with both 32k bytes caches, we get 92.38% for function TX and 49.39% for function STBC Encoder as the least. But the reduced percentage of RAM write accesses is 0% for function TX with neither 4k caches nor 32k caches, although it seems to have 100% reductions for some of its subs, but the total access counts had not been reduced. That’s because the values are recorded by the difference in checking the access counters between the begin and the end of functions, and the memory write back may not occur immediately when data cache is enabled.

Thus in determining the cache efficiency for memory accesses, we use the access counts for the whole program instead of separating each function.

Figure 4.2 shows the cache efficiency for total memory accesses in simulation Case 1 with 16QAM modulation. We can see the obvious improvement in the access of reading from ROM and RAM, but it would not help in the access of writing to RAM. In this case, the improvement in ROM read access counts should be the main purpose for the improvement in execution time. We can also see that increasing data cache size seems to make more sense than increasing instruction cache size for this case. Since we use single AHB bus to access RAM and ROM, we should check the effect on the bus.

Compare Table 4.6 with Table 4.5 without cache effect, we can see the total ROM access counts is more than the bus transaction counts through IAHB port; the total RAM access counts is less than the bus transaction counts through DAHB port. It means some ROM accesses are initiated by DAHB port and used for data rather than instruction. The cache efficiency for bus transactions is similar to the cache efficiency for total memory accesses, just like what Figure 4.3 shows. With 4k bytes caches it can reduce transactions to 0.55% through IAHB and 53.18% through DAHB; with 32k bytes caches, the values are 0.36% through IAHB and 45.92% through DAHB.

Figure 4.3: Cache efficiency for bus transactions in Case 1

Besides, the bus utilization in Table 4.6 shows that instruction cache can greatly reduce IAHB utilization from 52.87% to 0.59% or 0.4%. Relative to instruction cache, data cache reduces much less transactions through DAHB, and that makes its utilization remains around 7%. With 4k bytes caches, the bus utilization of DAHB is even increased to 7.85% and its transaction throughputs are also increased. By the way, we can also see the total waiting time of IAHB is reduced from 1.52% to be less than 0.012%, and the value of DAHB is reduced from 15.44% to 3.03%.

After the discussion about cache effect for Case 1, we found that even if we have 32k bytes caches, the total execution time 44198600ns is about 197 times the execution time of our constraint time 224089ns. Therefore, we turn to reduce RAM accesses by HW acceleration. First, we see the function profiling result in Case 1 and find the heaviest-loaded function, write a TLM model for it and embed it in our platform as a HW accelerator. Figure 4.4 is the pie chart for the function profiling in Case 1. The left pie represents the whole execution time for function TX, and the right pie represents the execution time for its heaviest-loaded sub function OFDM Modulator.

Figure 4.4: Function profiling pie chart for Case 1

Figure 4.4 shows that function IFFT is the heaviest-loaded function which cost 89% of total execution time, so we should partition it as a HW accelerator first, then redraw the pie chart with the new profiling result to partition the next heaviest function if the performance still not acceptable. But since even the execution time of the 2%

function STBC Encoder is still over our boundary by 4.6 times, we partition three functions as three HW accelerators. The configuration is our simulation Case 2.

Figure 4.5 is the pie chart for the function profiling in Case 2. In this case, the heaviest-loaded two parts are the remaining of function TX and function OFDM Modulator which are still run by SW. The remaining parts of function TX include the set of complex variables, subcarrier allocation, pilot preamble, and so on; the remaining parts of function OFDM Modulator include the copy of complex variables and guard interval insertion. Although we partition three functions as HW accelerators in this case, but only the 3% function Modulation itself can meet our constraint time. That means this configuration should not be acceptable even we partition the remaining parts as HW accelerators. Hence we need to change configuration.

Figure 4.5: Function profiling pie chart for Case 2

Before we start the new configuration with our simulation Case 3, we discuss the analysis of comparing Case 2 with Case 1 first. By the same token, we take 16QAM modulation for example. Figure 4.6 shows the improvement in reducing instruction counts and function execution time without cache by HW accelerators. By partitioning function Modulation as HW accelerator, we can reduce instruction counts to 45.26%

and execution time to 74.30% when compared with Case 1 without cache; if we partition function STBC Encoder, we can reduce instruction counts to 34.49% and execution time to 32.43%; similarly, the partitioned HW IFFT can reduce instruction counts to 1.01% and execution time to 2.14%.

Figure 4.6: Instructions and time reduction by HW acceleration

Figure 4.7 shows the improvement in reducing memory accesses by HW acceleration without utilizing cache. The total reduced memory access counts are less than 25% in Case 1 for function Modulation, much worse than 66.7% for function STBC Encoder and 98.1% for function IFFT, which roughly answer the reduction rate

Figure 4.7: Memory accesses reduction by HW acceleration

Next, we compare the bus transaction information between Case 1 and Case 2 without considering cache effect. The total bus transaction count in Case 2 is reduced to 25.62% through IAHB and 29.99% through DAHB compared with Case 1. Bus utilization of IAHB in Case 2 is slightly decreased from 52.87% in Case 1 to 50.13%, but increased from 7.26% to 8.06% for DAHB, probably due to the increased accesses to HW accelerators.

After comparing Case 1 and Case 2 without cache to manifest the effect in HW acceleration, we continue the performance analysis with 32k bytes caches. Figure 4.8 shows the execution time for functions in Case 1 and Case 2. The total execution time in Case 2 had been reduced to 5377680ns and the HW accelerator for function IFFT contributed 99.28% of the total improvement, but it is still about 24 times our constraint time 224089ns. With cache enabled, we saw the execution time for function Modulation in Case 2 is worse than Case 1 due to the increased 2304 bus transactions for accessing HW accelerator which cannot be reduced by caching.

Figure 4.8: Execution time comparison between Case 1 and Case 2

By inspecting Figure 4.5, since the total execution time is 24-times the time constraint, it should be a 4% in that pie. Thus we have to not only accelerate the remaining 48% and 23% SW parts, but also solve the problem comes from HW accesses to reduce the execution time in the other parts.

For this reason, we may need more HW accelerators for almost whole function TX to meet the execution time constraint. Since we should avoid too many additional bus transactions for accessing these HW accelerators, we combine all of them as a single HW accelerator for function TX. And since the program is a transmitter, we don’t need to receive the tx_signal after the iteration unless we use it in the validation step with the SW golden functions. This configuration becomes our simulation Case 3. However, it seems to have almost nothing to do with the SW realized with microprocessor in this configuration. According to the comparison result between Case 1 and Case 2, function Modulation is suitably run by SW. That is the reason why we have the simulation Case 4.

Figure 4.9 shows the execution time comparison among Case 2, Case 3, and Case 4.

The total execution time 7856ns for Case 3 and 176120ns for Case 4 are both under the constraint time. As such, the two cases should be acceptable simulation cases. But the two cases are simulated with ideal HW accelerator without any delay. For this reason, we should discuss from another viewpoint. In fact, we only write data to bus for the combined HW accelerator and never read the transmitted data back from the bus during simulations. The execution time of function TX records the execution time until all data send to HW accelerator. Thus we can take out the execution time of function TX in these cases from execution time constraint. With 125MHz system clock frequency, we can divide the left over time by 8ns and the quotient will be the left over cycles for the HW accelerator to complete those functions. Of course, more left over cycles for HW accelerator is more flexible for HW design, and thus we should list the values of different configurations for further discussion. However, since we have two acceptable cases, let’s take a look at the cache efficiency for them first.

Figure 4.9: Execution time comparison among Case 2, Case 3, and Case 4

We can see the result in Table 4.13 and Table 4.17 to compare the difference in execution time with different cache sizes. Figure 4.10 shows the cache efficiency for function execution time in Case 3 and Case 4. We can see the execution time reduction rate in Case 4 is higher than in Case 3, because the function Modulation is done by SW in Case 4. And we can also find that increasing cache size does not produce obvious effect by comparing the cases of cache sizes of 4k and 32k bytes. Refer to Table 4.15, Table 4.16, Table 4.19, and Table 4.20 we find there is no difference in ROM read and RAM write access counts whether the cache size is 4k or 32k, for the two cases.

Regardless of Case 3 or Case 4, the differences between 4k and 32k cache sizes in the percentage of reduced IAHB transaction counts are 0.01%, and the differences in DAHB are both less than 1%. It seems that we don’t need more cache size for instructions. In this configuration, we reduced the bus utilization of DAHB to be less than 5% for the two cases. Figure 4.11 and Figure 4.12 show the cache efficiency for bus transactions in Case 3 and Case 4.

Figure 4.10: Cache efficiency for execution time in Case 3 and Case 4

Figure 4.11: Cache efficiency for bus transactions in Case 3

Figure 4.12: Cache efficiency for bus transactions in Case 4

Table 4.21 lists the left over cycles for the HW accelerators in Case 3 and Case 4 with three different kinds of modulation modes and caches sizes.

Table 4.21: Left over cycles for HW accelerator in Case 3 and Case 4 Simulation

Case

Modulation mode

Left cycles for HW (Cache size) 0k 4k 32k Case 3

QPSK 27364 27512 27512

16QAM 26740 26993 27029

64QAM 26116 26510 26546

Case 4

QPSK -21053 6074 6086

16QAM -19423 5983 5996

64QAM -31690 841 910

As this table shows, HW designer will have more than 26000 clock cycles in 125MHz to design the suitable HW accelerator for function TX in Case 3. But in Case 4, there is a bad news. We find that although we left around 6000 clock cycles for HW in QPSK and 16QAM modulation modes, the SW execution time in 64QAM modulation is almost equal to the constraint time. That’s because in 64QAM modulation mode we handle three bits at a time. Since the bit number is not a power of 2 which increases the time complexity.

We have some methods to solve this problem. The first one is to conduct the simulation with fixed-point representation. Since the bus HW access counts is 1536 and the transaction data type is 64-bit double precision floating point, we can use fixed-point representation with less than 32 bits and save 768 access counts, that should at least take 768 clock cycles for transactions. And then, we left over more than 1609 clock cycles for HW accelerator. In addition to use fixed-point representation, 32-bit single precision floating point should have similar result.

The second method is applying data encoding technique. The three supported modulation modes have total fourteen different values. As a result we can use 4 bits to represent them in the SW part. Accordingly we need to decode them to the actual value they are in HW part. Doing so, we can also save 768 access counts without buffer; with

Although this method seems to be similar to use HW for function Modulation, at least we can ignore modulation mode in HW.

The third method is that we can divide the application into three parts according to modulation modes. That means we use the configuration in Case 4 for QPSK and 16QAM modulation, and use the configuration in Case 3 for 64QAM modulation.

According to the current modulation mode, SW part and HW part will both switch to the corresponding mode.

The last method is by overclocking. For this method, we could change the clock division from 4 to 3 and rerun simulation. With a 166MHz system clock frequency, the function profiling result reports the execution time of function TX is 162606ns and 83964ns for function Modulation. These values are exactly reduced to 75% from 125MHz, and the reduced transaction duration also increased transaction throughputs to about 1.33 times. Anyway, this method leaves 10247 clock cycles for HW accelerator at 166MHz, corresponds to 7685 clock cycles at 125MHz.

Since we have four methods to make Case 4 fully acceptable with all three modulation modes, now we have two cases with several different configurations to choose. According to the cache efficiency analysis for acceptable cases, 4k bytes cache size is enough. In Case 4, we execute partial function by SW and hope to decrease HW cost. However, it may not useful because the left over cycles for HW design is also decreased to be less than 20% by comparing with Case 3. Depending on HW design, we may have more factors for performance analysis. Besides, in our simulation all memory accesses are treated as on-chip memory accesses without delay. From our memory size evaluation, all acceptable configurations will not use more than 64k bytes except for 64QAM in Case 4. Therefore, if our system does not have more than 64k bytes on-chip RAM, access to off-chip memory will decrease the performance. For this reason, application of the third method for Case 4 may be more feasible.

Chapter 5 Conclusion and Future Work

In this work, we follow an ESL design methodology and focused on the system-level modeling before implementation. We built up an ARM-based SoC virtual platform and ported a referenced SW code for IEEE 802.16e baseband to the virtual platform. Then we started developing, debugging, and optimizing the application SW for the transmitter on this virtual platform. We also classified three kinds of HW/SW configuration and decided four different simulation cases for HW/SW partition analysis during HW/SW co-design. We use the platform for program validation and make sure each case is functionally correct. Then we gathered information via function profiling, memory accesses, and bus transactions during simulation with different cache sizes for architectural exploration and system performance evaluation in each case. We derived the required execution time constraint for application SW and compared the result from each case and decided acceptable cases. For acceptable cases, we reviewed the profiling result to advance better suggestion in configuration before real implementation. In our discussion on Case 1 and Case 2, we found the Modulation function run in SW with cache is more efficient than partition it as a HW accelerator due to bus transaction problem. In the discussion for fully acceptable Case 3 and partially acceptable Case 4 in three supported modulation modes, 4k cache size is enough, especially for instruction memory. We also brought up several methods for solving 64QAM problem in Case 4 and at last listed the allowable clock cycles for HW design to consult.

and the use of TLM brings flexibility and usability for non-HW-experts in HW/SW co-design. It simplifies HW/SW interface and HW models, while the support for multi cycle access can be quickly parameterized and applied to simulate system performance when HW architecture and timing had been decided. The simulation speed is higher and easier to get results in making changes with system architecture. With more controlling factors in the system, for example, the memory space and specified access delays, we have more dimensions for analysis and chance to get the better architectures right from early exploration well before beginning real implementation.

In the future, we can follow this work and find out acceptable configurations for the receiver, and then combine them with the transmitter and channel model for a complete transceiver analysis. In the transceiver analysis, we should concern the influences of channel model and try to ignore them in performance analysis via system architecture. For example, another independent bus for channel model shall decrease its influences on bus transactions. We could also use the virtual platform for HDL co-simulation [16] and take the SystemC TLM models as golden models for co-verification with HW RTL design and complete the ESL design methodology to implementation part. By the way, the evaluation of HW power consumption and area cost can also be carried on for further analysis.

References

[1] J. Bhasker, A SystemC Primer, Star Galaxy Publishing, Allentown, PA, 2002, ISBN 0-9650391-8-8.

[2] Javier Gozalvez, “Mobile WiMAX Rollouts Announced,” IEEE Vehicular Technology Magazine, volume 1, Issue 3, pp. 53–59, September 2006.

在文檔中應用於WiMAX系統之電子系統層級設計與分析 (頁 56-0)