We now introduce the software development environment used in our work. TI supports a useful GUI development tool set to DSP users for developing and debugging their codes: the CCS. The CCS development tools are a key element of the DSP software and development tools from TI. The fully integrated development environment includes real-time analysis capabilities, easy-to-use debugger, C/C++ compiler, assembler, linker, editor, visual project manager, simulators, XDS560 and XDS510 emulation drivers and DSP/BIOS support.
Some of CCS’s fully integrated host tools include:
• Simulators for full devices, CPU only and CPU plus memory to suit different perfor-mance needs.
• Integrated visual project manager with source control interface, multi-project support and the ability to handle thousands of project files.
• Source code debugger common interface for both simulator and emulator targets:
– C/C++/assembly language support.
– Simple breakpoints.
– Advanced watch window.
– Symbol browser.
• DSP/BIOS host tooling support (configure, real-time analysis and debug).
• Data transfer for real time data exchange between host and target.
• Profiler to analyze code performance.
CCS also delivers “foundation software” consisting of:
• DSP/BIOS kernel for the TMS320C6000 DSPs.
– Pre-emptive multi-threading.
– Interthread communication.
– Interrupt handling.
• TMS320 DSP Algorithm Standard to enable software reuse.
• Chip Support Libraries (CSL) to simplify device configuration. CSL provides C-program functions to configure and control on-chip peripherals.
TI also supports some optimized DSP functions for the TMS320C64x devices: the TMS320C64x digital signal processor library (DSPLIB). This source code library includes C-callable functions (ANSI-C language compatible) for general signal processing mathemat-ical and vector functions [19]. The routines included in the DSP library are organized as follows:
• Adaptive filtering.
• Correlation.
• FFT.
• Filtering and convolution.
• Math.
• Matrix functions.
• Miscellaneous.
However TI offer these routines for optimize, but we did no use anyone in our C code.
5.2.1 Code Development Flow
The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization rather than forcing the programmer to code by hand in assembly. This makes the compiler do all the laborious work of instruction selection, parallelizing, pipelining, and register allocation, which simplifies the maintenance of the code, as everything resides in a C framework that is simple to maintain, support, and upgrade.
The recommended code development flow for the C6000 involves the phases described in Fig. 5.3. The tutorial section of the Programmer’s Guide [20] focuses on phases 1 and 2, and the Guide also instructs the programmer about the tuning stage of phase 3. It is important to give the compiler enough information to fully maximize its potential. An benefical feature of this compiler is that it provides direct feedback on the entire program’s high MIPS areas (loops). Based on this feedback, there are some simple steps the programmer can take to pass complete and better information to the compiler to maximize the compiler performance.
The following items list the goal for each phase in the software development flow shown in Fig. 5.3.
• Developing C code (phase 1) without any knowledge of the C6000. Use the C6000 profiling tools to identify any inefficient areas that we might have in the C code. To improve the performance of the code, proceed to phase 2.
• Use techniques described in [20] to improve the C code. Use the C6000 profiling tools to check its performance. If the code is still not as efficient as we would like it to be, proceed to phase 3.
• Extract the time-critical areas from the C code and rewrite the code in linear assembly.
We can use the assembly optimizer to optimize the code.
Figure 5.3: Code development flow for TI C6000 DSP [20].
TI provides high performance C program optimization tools, and they do not suggest the programmer to code by hand in assembly. In this thesis, the development flow is stopped at phase 2. We do not optimize the code by writing linear assembly. Coding the program in high-level language keeps the flexibility of porting to other platforms.
5.2.2 Compiler Optimization Options
The compiler supports several options to optimize the code. The compiler options can be used to optimize code size or execution performance. Our primary concern in this work is the execution performance. Our code do not use up the L2 memory, and hence we do not care very much about the code size. The easiest way to invoke optimization is to use the cl6x shell program, specifying the -on option on the cl6x command line, where n denotes the level of optimization (0, 1, 2, 3) which controls the type and degree of optimization:
• -o0.
– Performs control-flow-graph simplification.
– Allocates variables to registers.
– Performs loop rotation.
– Eliminates unused code.
– Simplifies expressions and statements.
– Expands calls to functions declared inline.
• -o1. Performs all -o0 optimization, and:
– Performs local copy/constant propagation.
– Removes unused assignments.
– Eliminates local common expressions.
• -o2. Performs all -o1 optimizations, and:
– Performs software pipelining.
– Performs loop optimizations.
– Eliminates global common subexpressions.
– Eliminates global unused assignments.
– Converts array references in loops to incremented pointer form.
– Performs loop unrolling.
• -o3. Performs all -o2 optimizations, and:
– Removes all functions that are never called.
– Simplifies functions with return values that are never used.
– Inlines calls to small functions.
– Reorders function declarations so that the attributes of called functions are known when the caller is optimized.
– Propagates arguments into function bodies when all calls pass the same value in the same argument position.
– Identifies file-level variable characteristics.
The -o2 is the default if -o is set without an optimization level.
The programlevel optimization can be specified by using the pm option with the -o3 option. With program-level optimization, all of the source files are compiled into one intermediate file called a module. The module moves through the optimization and code generation passes of the compiler. Because the compiler can see the entire program, it performs several optimizations that are rarely applied during file-level optimization:
• If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument.
• If a return value of a function is never used, the compiler deletes the return code in the function.
• If a function is not called directly or indirectly, the compiler removes the function.
When program-level optimization is selected in Code Composer Studio, options that have been selected to be file-specific are ignored. The program level optimization is the highest level optimization option.We use this option to optimize our code.
5.2.3 Using Intrinsics
The C6000 compiler provides intrinsics, which are special functions that map directly to C64x instructions, to optimize the C code performance. All instructions that are not easily expressed in C code are supported as intrinsics. Intrinsics are specified with a leading under-score ( ) and are accessed by calling them as we call a function. A table of TMS320C6000 C/C++ compiler intrinsics can be found in [20]. In this thesis, we do not use any intrinsic in our program, maybe which can develop in the future.
Chapter 6
Fixed-Point Implementation and Optimization Methods
In this chapter, we discuss some technique issue concerning our fixed-point implementation and some optimization methods to reduce the run time. Before the above, we first introduce the basic concepts of fixed-point and floating-point arithmetic. What is their difference?
What are the advantages and disadvantage of fixed-point calculation? After these, we pro-pose some techniques regarding C code implementation and optimization.
6.1 Fixed-Point Concepts
6.1.1 Fixed-Point and Floating-Point
IEEE 754 is a standard for point arithmetic, which describes the formats for floating-point numbers. There are two of them, namely, single precision and double precision. The formats are shown in Figure 6.1. The floating-point value is given by
f loating − point value = (−1)s× (1 + f raction) × 2Exp − bias. (6.1)
From this equation, we see that the translation for floating-point numbers is not very
Figure 6.1: Formats for floating-point numbers.
Figure 6.2: Formats for fixed-point numbers.
directly, which makes the associated arithmetic some what complex. But with this format, we can define a wide range of values.
The fixed-point formats, are shown in Figure 6.2, where the black fields indicate the sign bit. We can see that they are simpler and more direct than the floating point ones, and so hardware implementation can be cheaper. But the worst defect is dynamic range.
The range of fixed-point values are usually more limited ones. Therefore, if an application requires a large dynamic range of numerical values, one should use floating-point arithmetic.
Otherwise, one should use fixed-point arithmetic. In our implementation, we choose to use fixed-point implementation.
6.1.2 Fixed-Point Translation and Rounding
In this section, we discuss how to translate the floating-point values to fixed-point. There are three basic types of fixed-point length on the TI DSP chip. short, integer and long.
short type and the integer types are 16 bits and 32 bits (both including signed bit) in length on TMS320C6416 DSP. The long type has 40 bits, but it requires more computational time and memory space, so we do not use this type in our implementation. Unless necessary, we use the short type, which can decrease the loading for the DSP chip and enhance the speed of the implementation.
To avoid excessive loss of precision (including overflow), we may use suitable scaling and rounding when representing a value in fixed point. For example, to represent the number 0.33 to a precision of 1/2 × 2−10, we may compute 0.33 × 210+ 0.5 = 338.42 and truncate at the decimal point to obtain.
F ixed( 0.33 × 210+ 0.5) = 338. (6.2)
Note that the addition of 0.5 before truncation is for rounding. If we do not add this 0.5, the average error can be bigger. In the above example, the result would have been 337 if we did not add the 0.5. In C, we can use right shift of bits to accomplish division by integer power of 2. For example, to calculate 100/16, we can use the instruction 100 >> 4. To effect rounding at the last bit position, we can use 100 + 23 >> 4.
6.1.3 Q and S Expressions
There are two common ways to represent fixed-point formats. One is Q expression the other is S expression. Table 6.1 shows the decimal ranges of these expressions. In this thesis, we use the Q expression. For convenience, we can use the macro instruction in C code to define some expressions, which can help us know the scaling factors of various quantities without a
Table 6.1: Q and S Expressiona [21]
Q S Decimal Range Q S Decimal Range
Q15 S0.15 -1≦X≦0.9999695 Q7 S8.7 -256≦X≦255.9921875 Q14 S1.14 -2≦X≦1.9999390 Q6 S9.6 -512≦X≦511.9804375 Q13 S2.13 -4≦X≦3.9998779 Q5 S10.5 -1024≦X≦1023.96875 Q12 S3.12 -8≦X≦7.9997559 Q4 S11.4 -2048≦X≦2047.9375 Q11 S4.11 -16≦X≦15.9995117 Q3 S12.3 -4096≦X≦4095.875 Q10 S5.10 -32≦X≦31.9990234 Q2 S13.2 -8192≦X≦8191.75
Q9 S6.9 -64≦X≦63.9980469 Q1 S14.1 -16384≦X≦16383.5
Q8 S7.8 -128≦X≦127.9960938 Q0 S15.0 -32768≦X≦32767
lot of notes. At the last of this subsection, we give an example to illustrate the fixed-point arithmetic. Figure 6.3 shows the example. In this example, there are two different scaling factors in the original representation of X and Y , when we add them together, the one with a lower scaling factor needs to be down-scaled. This avoids overflow of the final result, but also sacrifice some precision.