Chapter 7 Realization on Embedded Real-Time System
7.4 Optimization
In order to achieve high performance processing capability, we must understand the core processor structures that can help optimize performance. In this section, we will discuss the optimization, and it can be separated into two parts: system and coding. The system optimization is described in section 7.4.1. Section 7.4.2 discusses how to tune the C code for BG-561.
7.4.1 System Optimization
z MemoryEfficient system resource utilization is critical for developing applications that demand high bandwidth on an embedded platform. Systems can often run out of bandwidth, even if the throughput requirements are within the limits of the system.
The critical factors that result in lower than expected throughput, more often than not, are external memory access latencies and inefficient utilization of system resources.
In order to fully exploit the capabilities of an embedded processor, it is important to understand its system architecture and the available system optimization techniques.
This EE-Note serves as a quick reference to Blackfin processor memory hierarchy and its system architecture. It also provides guidelines for using several optimization techniques to efficiently utilize the available system resources and discusses benchmark studies to evaluate and quantify the suggested optimization techniques.
The Blackfin processor’s memory hierarchy is shown in Fig. 79 and the relative tradeoffs between on-chip (L1 and L2) memory and off-chip (external) memory.
Guidelines are also provided to efficiently map code and data into the memory hierarchy to achieve minimal memory access latencies.
Fig. 79: Blackfin processor memory hierarchy
Cached memory can provide significant benefits for execution of code and data mapped to L2 or external memory. Cache performance depends on the temporal and spatial characteristics of the application. The disadvantage of cache memory is that it suffers from cache miss penalties, which increases memory access latencies, thus increasing external memory bandwidth requirements. Also, for streaming data, cache lines must be invalidated when new data is transferred in external memory.
Invalidating cache lines is expensive and can significantly decrease performance.
z System Architecture
The Blackfin processor’s system architecture includes the system buses, DMA controllers, peripherals, and external bus arbiter.
The system throughput can be greatly increased by using the maximum bus width for every transfer. Using 32-bit DMA access for ADSP-BF561 processors combined with packing can free up the system buses for other activities, thereby greatly increasing the throughput of the system. For example, the PPI provides 32-bit packing for ADSP-BF561 processors.
Blackfin processors provide traffic control on all the system buses. If the traffic on the bus is switching directions too often, the result will be increased latencies due to bank turnaround times. Using the traffic control registers is one of the best ways to optimize the system bus traffic, consequently improving bandwidth utilization. The traffic period for each of the DMA buses can be specified to group transfers in one direction, thereby minimizing bank turnaround times. Fig. 80 illustrates an optimized traffic pattern over the DAB bus. [71]
Fig. 80: Optimizing DMA traffic over the system buses
7.4.2 Tuning C Code for Black-Fin 561
There is a vast difference in the performance of C code that has been compiled optimized and non-optimized. In some cases optimized code can run ten or twenty times faster. Note that the default setting is for non-optimized compilation, the non-optimized default being there to assist programmers in diagnosing problems with their initial coding.
z Avoid Float/Double Arithmetic
Floating-point arithmetic operations are implemented by library routines and,
consequently, are far slower than integer operations. An arithmetic floating-point operation inside a loop will prevent the optimizer from using a hardware loop.
z Avoid Integer Division in Loops
The hardware does not provide direct support for 32-bit integer division, so the division and modulus operations on int variables are multi-cycle operations. The compiler will convert an integer division by a power of two to a right-shift operation if the value of the divisor is known. If the compiler has to issue a full division operation, it will issue a call to a library function. In addition to being a multi-cycle operation, this will prevent the optimizer from using a hardware loop for any loops around the division. Whenever possible, do not use divide or modulus operators inside a loop.
z Indexed Arrays versus Pointers
C language allows you to program data accesses from an array in two ways:
either by indexing from an invariant base pointer or by incrementing a pointer. The pointer style introduces additional variables that compete with the surrounding code for resources during the optimizer’s analysis. Array accesses, on the other hand, must be transformed to pointers by the compiler, and sometimes it does not do the job as well as you could do by hand.
The best strategy is to start with array notation. If this looks unsatisfactory try using pointers. Outside the important loops use the indexed style because it is easier to understand.
z Initialize Constants Statically
Inter-procedural analysis will also identify variables that only have one value
and replace them with constants, which can enable better optimization.
z Word-align Your Data
To make most efficient use of the hardware, it must be kept fed with data. In many algorithms, the balance of data accesses to computations is such that, to keep the hardware fully utilized, data must be fetched with 32-bit loads.
Although the Blackfin architecture supports byte addressing, the hardware requires that references to memory be naturally aligned. Thus, 16-bit references must be at even address locations, and 32-bit at word-aligned addresses. So, for the most efficient code to be generated, we should ensure that data are word-aligned.
7.5 Summary
The memory and system optimization techniques discussed in this chapter will help produce efficient code/data layouts and optimize system performance. Tuning C code gets maximal code performance from the compiler. All content in this chapter are our precious experiments for coding on BF-561.