The DSP Baseboard - DSP Implementation Environment 28

3 DSP Implementation Environment 28

3.2 The DSP Baseboard

The Quixote DSP Baseboard card is shown in Fig. 3.3 and the architecture is shown in Fig. 3.4 [15]. Quixote consists of a TMS320C6416 600 MHz 32-bit fixed-point DSP chip and a Xilinx two- or six-million gate Virtex-II FPGA in a single board. Utilizing the signal processing technology to provide processing flexibility, efficiency and deliver high performance. Quixote has 32MBytes SDRAM for use by DSP and 4 or 8Mbytes zero bus turnaround (ZBT) SBSRAM for use by FPGA.

Developers could build complicated signal processing systems by integrating these reusable logic designs with their specific application logic.

Figure 3.3: Innovative Integration’s Quixote DSP Baseboard Card.

Figure 3.4: The Architecture of Quixote Baseboard.

3.3 Data Communication Mechanism

Many applications of the Quixote baseboards involve communication with the host CPU in some manner. All applications at a minimum must be reset and downloaded from the host, even if they are isolated from the host after that. The simplest communication method supported is a mapping of Standard C++ I/O to the Uniterminal applet that allows console-type I/O on the host. This allows simple data input and control and sending text strings to the user. The next level of support is provided by the Packetized Message Interface. This allows more complicated medium rate transfer of commands and information between the host and target. It requires more software support on the host than the Standard I/O. For full rate data transfers Quixote supports

the creation of data streaming to the host, for the maximum ability to move data between the target and host. On Quixote baseboards, a second type of busmaster communication between target and host is available for use, it is the CPU Busmaster interface.

The primary CPU busmaster interface is based on a streaming model, where logically data is an infinite stream between the source and destination. This model is more efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. In addition, the Busmaster streaming interface is fully handshook, so that no data loss can occur in the process of streaming. For example, if the application cannot process blocks fast enough, the buffers will fill, then the busmaster region will fill, then busmastering will stop until the application resumes processing. When the busmaster stops, the DSP will no longer be able to add data to the PCI interface FIFO.

However, in the application of FEC encoder and decoder, the data sequence is first divided into RS blocks then performed encoding and decoding procedure. Hence the continuous streaming may not be suitable for FEC application. Alternatively, there is a data flow paradigm supported for non-continuous data sequence called block mode streaming. For very high rate applications, any processing done to each point may result in a reduction in the maximum data rate that can be achieved. Since block mode does no implicit processing on a point-by-point basis, the fastest data rates are achievable using this mode.

TheDSP Streaming interface is bi-directional. Two streams can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “Incoming Stream”. The other stream runs out to the analog peripherals. This is the “Outgoing Stream”. In both cases, the DSP needs to act as a mediator, since there

Figure 3.5: Block Diagram of DSP Streaming Mode.

DSP Streaming is initiated and started by the Host, using the Caliente component.

On the target, the DSP interface uses a pair of DSP/BIOS Device Drivers, PciIn (on the Outgoing Stream) and PciOut (on the Incoming Stream), provided in the Pismo peripheral libraries for the DSP. They use burst-mode and are capable of copying blocks of data between target SDRAM and host bus-master memory via the PCI interface at instantaneous rates up 264 MBytes/sec.

In addition to the busmaster streaming interface, the DSP and the host also have a lower bandwidth communications link called packetized message interface for sending commands or side information between the host PC and the target DSP.

Chapter 4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

As mentioned in last chapter, we adopt the Texas Instruments (TI) digital signal processor (DSP) for implementing the Forward Error Correction (FEC) scheme in the IEEE 802.16a wireless communication standard. In this chapter, we are going to discuss the main themes of this thesis – the implementation and optimization of the specified FEC scheme on the newly released II’s Quixote DSP baseboard, which houses a TI TMS320C6416 DSP chip. We firstly briefly introduce the entire system structure of our FEC implementation and its communication mechanism. Secondly, we introduce some special features of TI C6000 family DSP that is helpful when doing compiler level optimization. Then, we proposed some simple and yet practically useful techniques for improving the computational speed of Reed-Solomon (RS) Code and Convolutional (CC) Code (mainly for decoder part) on TI C64 family DSP. Finally, we present the improvement after the efforts we made on the RS code and the CC code optimization by showing the simulation profile generated by the TI Code Composer Studio (CCS)

4.1 System Structure of the FEC Implementation

As defined in the IEEE 802.16a standard, the FEC scheme, which consists of FEC encoder and FEC decoder, is located between the source encoder/decoder and the channel modulator/demodulator. Due to the features of the FEC scheme, it requires massive computation in the decoding procedure. Thus, for the purpose of achieving the real-time processing goal, it is necessary to assign a single DSP board for the FEC use only. In consequence, the FEC scheme, source coding scheme and channel modulation scheme each uses an individual DSP board and linked by the PCI port on the personal computer (PC), one thing to be noted here is the communication mechanism between the DSP board and the Host PC. It supposed to be the data streaming mode that has been described in Chapter 3, but due to the malfunction of the newly released II’s Quixote DSP baseboard, up till now, the streaming mode on the DSP board is not yet work. In substitution, we use the standard I/O (fread and fwrite) to implement the DSP file I/O mechanism on the TMS320C6416 Simulator. The drawback of using the standard I/O is that it cannot proceed too many input data or the processor may crash during the I/O time and it takes extra cycles to perform the file I/O mechanism.

The system structure is shown in Fig. 4.1 and Fig. 4.2 for the transmitter side and the receiver side, respectively. At the transmitter side, the source coded data sequence is first multiplexed by an audio/video multiplexer then transmits to the randomizer of the FEC encoding scheme through the PCI interface of Host PC. Afterward, the sequence is processed by the randomizer, the RS encoder, the CC encoder and the block interleaver and then the interleaved coded sequence is transmitted to the channel modulator through the PCI interface. At the receiver side, the procedure is an reverse of that in the transmitter side. First the demodulator transmits the soft decision demodulated metric sequence to the FEC decoding scheme (again through the PCI interface). After FEC decoding, the decoded sequence is passed to the source decoder through the PCI interface and then the source decoding operation is performed.

Figure 4.1: System Structure of Transmitter Side.

Figure 4.2: System Structure of Receiver Side.

4.2 Compiler Level Optimization Techniques

In this subsection, firstly the TI C6000 family pipeline structure is introduced for understanding how the processor arranges the pipeline stages and what instructions are more time consuming and shall be avoided if possible. Secondly, the code development

explain how we can improve the program efficiency by using the software pipelining technique.

4.2.1 Pipeline Structure of the TI C6000 Family

There are a few features regarding to the TI C6000 family’s pipeline structure that can provide the advantages of good performance, low cost, and simple programming.

The following are several useful features [11]:

Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and multiply operation.

Pipeline control is simplified by eliminating pipeline locks.

The pipeline can dispatch eight parallel instructions in every cycle.

Parallel instructions proceed simultaneously through the same pipeline phase.

The pipeline structure of the C6000 family consists of three basic pipeline stages, They are Fetch stage (F), Decode stage (D), and Execution stage (E). At the F stage, the CPU first generates an address, fetches the opcode of the specified instruction from memory, and then passes it to the program decoder. At the D stage, the program decoder efficiently routes the opcode to the specific functional unit determined by the type of instruction (LDW, ADD, SHR, MPY, etc). Once the instruction reaches the E stage, it is executed by its specified functional unit. Most instructions of the C6000 family fall in the Instruction-Single-Cycle (ISC) category, such as ADD, SHR, AND, OR, XOR, etc.

However, the results of a few instructions are delayed. For example, the multiply instructions - MPY (and its varieties) requires a delay length equal to one cycle.

One cycle delay means that the execution result will not be available until one cycle later (i.e., not available for the next instruction to use). The results of a load instruction – LDW (and its varieties) are delayed for 4 cycles. Branches instructions

reach their target destination 5 cycles later. Store instructions are viewed as an ISC from the CPU’s perspective because of the fact that there is no execution phase required for a store instruction but actually it still finishes in 2 cycles later. Since the maximum delay among all the available instructions is 5 cycles (6 execution cycles totally), it is intuitive to split the execution stage (E) into six phases as shown in Table 4.1.

Table 4.1: Completing Phase of Different Type Instructions.

4.2.2 Code Development Flow

Traditional development flows in DSP industry have involved validating a C model for correctness on a host PC or Unix workstation and then painstakingly porting that C code to hand-coded DSP assembly language. This is both time consuming and error prone. The recommended code development flow involves utilizing the C6000 code generation tools to help in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work

Execution Phases

(Completing Phase) Instructions' Category E1

E2

E3

E4

E5

E6

write linear assembly code unless the software pipelining efficiency is very poor or the resource allocation is very unbalanced.

Figure 4.3: Code Development Flow.

4.2.3 Software Pipelining

Software pipelining is extensively used to exploit instruction level parallelism (ILP) in loops and TI CCS compiler is also capable of doing it. To say more clearly, it is a technique used to schedule instructions in a loop so that multiple iterations of the loop can execute in parallel, so the empty slots can be filled and the functional units can be used more efficiently. Overall it makes a loop to be a highly optimized loop code and hence accelerate the program execution speed significantly.

For the ease of understanding how software pipelining actually works, here we give an example to illustrate [16]. A simple for loop and its code after applying software pipelining are shown in Fig 4.4(a) and 4.4(b). The loop schedule length is reduced from four control steps to one control step in the software pipelined loop. However, the code size of software pipelined loop is three times longer than that of the original code in this

example. Fig. 4.5(a) and 4.5(b) show the execution records of the original loop and the software pipelined loop, respectively.

(a) (b)

Figure 4.4: (a) The Original Loop. (b) The Loop After Applying Software Pipelining.

for i = 1 to n do A[i] = E[i-4] + 9;

B[i] = A[i] * 5;

C[i] = A[i] + B[i-2];

D[i] = A[i] * C[i];

E[i] = D[i] + 30;

end

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 5’

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i = 1 to n-3 do A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End E[n] = D[n] +30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

B[n] = A[n] * 5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

In these figures, we can clearly observe that there are only two (B and C) of the five instructions – A,B,C,D,E executed in parallel in the original loop, while there are all five instructions executed in parallel in the software pipelined loop and hence the program efficiency is improved significantly. We can also notice that the pipelined code can be classified into three regions: prologue, loop kernel (repeating schedule) and epilogue. The prologue is the “setup” to the loop. Running the prologue code is often called “priming” the loop. The length of the prologue depends on the latency between the beginning and ending of the loop code; i.e., the number of instruction and their latency. The epilogue refers to the ending instructions, which must be completed at the end after the loop kernel; it is kind of similar to the prologue and is optional. If necessary, it can be rolled into the loop kernel. Prologue and epilogue of the software pipelined loop occupy a large part of the code size, so there may be a trade-off issue between the speed and memory size consideration that we have to take into account. But since the program memory of the Quixote DSP baseboard is quite large and the original FEC code size is quite small, it may not be a serious issue if we adopt software pipelining in our codes.

Concerning the implementation using the TI C6000 DSP family, the C code loop performance is greatly influenced by how well the CCS compiler can do the software pipelining on our loop. The compiler provides some feedback information to the programmers to fine-tune the loop structure. Understanding the feedback information, we can quickly tune our C codes to obtain the highest possible performance. The feedback is geared for explaining exactly what all the issues related to pipelining the loop are and what the results are. The compiler goes through three basic stages when compiling a loop, these stages are [13]

1. Qualify the loop for software pipelining.

2. Collect loop resource and dependency graph information.

3. Software pipelining the loop.

In the first stage, the compiler tries to identify what the loop counter (named trip counter because of the number of trips througha loop) is and any information about the loop counter such as minimum value (known minimum trip count), and whether it is a multiple of something (has a known maximum trip count factor).

If the above information is known about a loop counter, the compiler can be more aggressive with performing packed data processing and loop unrolling optimizations.

For example, if the exact value of a loop counter is not known but it is known that the value is a multiple of some number, the compiler may be able to unroll the loop to improve the performance.

There are several conditions that must be met before software pipelining is allowed, or legal, from the compiler’s point of view. These conditions are

It cannot have too many instructions in the loop. Loops that are too big typically require more registers than that are available and they require a longer compilation time.

It cannot call another function within the loop unless the called function is inlined. Any break in control flow makes it impossible to software pipeline as multiple iterations are executing in parallel.

If any of the conditions for software pipelining are not met, qualification of the pipeline will halt and a disqualification messages will appear. In this situation, software pipelining will not be applied to our loop program and hence the program operating speed will be quite slow.

In the second stage, the compiler is collecting loop resource and dependency graph information. It will derive the loop carried dependency bound, unpartitioned resource bound across all resources, partitioned resource bound across all resources based on our

In the third stage, the compiler attempts to software pipeline our loop based on the knowledge it collects from the previous two stages. The first thing the compiler attempts to do during this stage, is to schedule the loop at an iteration interval (ii) equal to the minimum value of the three bounds obtained in second stage. If the attempt was not successful, the compiler provides additional feedback message to help explain why it failed; i.e., register is live too long or did not find a schedule. The compiler will keep proceeding to ii = (previous failed ii + 1) till it find a valid schedule and then the software pipeline is done.

4.3 Optimization on Reed-Solomon Code

Follow the code development flow described in section 4.2.2, before actually revising our program code, we should first generate a profile by using the CCS built-in profiler to know the exact execution cycles. Then, we can identify which part of our program consumes the most execution time based on the profile data, and hence we can concentrate on this part to make the whole program faster. In the following subsections, the optimization of our RS code program on TI DSP platform is divided into the encoder and the decoder parts to be discussed.

4.3.1 Optimization on RS Encoder

4.3.1.1 Choose Appropriate Data Types

Table 4.2 shows the original (before optimization) profile of the RS encoder, the first shadowed function, GF_Multiply, is the galois field multiplier used by the RS_Encode function to encode the data sequence into RS blocks. The last two shadowed functions, Int_to_Vec and Vec_to_Int, are used by the GF_Multiply function to convert the integer containing 8-bit galois field element to a vector containing each

bit of the integer, and convert the vector back to the original integer, respectively. The TI Programmer’s Guide [13] reminds us that the size of data type is different between the DSP platform and the PC platform. For example, the “int” and “long” data type is of the same size in PC platform which equal to 32-bit, but the “long” data type in DSP platform is of larger size, which equals to 40-bit. If we just port our C program to DSP using the CCS compiler without any checking on the data type, it may result in larger variables and data arrays in our program if our program assumed that “int” and “long”

data types are the same size. Furthermore, using the “long” data type will result in a worse execution efficiency since it requires extra instructions to be generated and limits functional unit selection. The “long” data type value needs two registers – one 32-bit register plus 8-bit LSB in another specific 32-bit register. If the registers used in the program are full, we need to store the register contents into the stack and load them back after the “long” type data are computed completely. That is, we waste time and memory space because of the load/store operations, which are the most time consuming instructions.

After examining our program code carefully, we find it does not affect the correctness of our program if we replace the “long” data type by the “int” data type, and it will result in a significantly improvement on the program executing speed. Fig. 4.6 shows the pseudo assembly code for variable using “long” data type and “int” data type.

Obviously, compared to the variable with “int” data type, the one with “long” data type needs extra assembly instructions to execute the same C instruction.

Areas Code Size Cycles Percentage (%)

Processing Rate (Kbits/sec)

Main Function 1164 1433434 100 120

RS_Encode 356 1430005 99

The profile of modified RS encoder is shown in Table 4.3. We can find that not only the processing rate, but also the code size are improved by this modification.

Areas Code Size Cycles Percentage (%)

Table 4.3: Profile of Revised RS Encoder (Data Type Modification).

Figure 4.6: Pseudo Code for Variable Using Long and Int Data Type.

4.3.1.2 Galois Field Multiplication

Refer to Table 4.3 shows the profile of our RS encoder program after the data type

在文檔中 IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化 (頁 48-0)