• 沒有找到結果。

2.3 Implementation Issues of the FEC Scheme

2.3.2 Convolutional Code

2.3.2.4 Viterbi Decoding of Tail-Biting Convolutional Code

According to [8] and [10], the practical suboptimal tail-biting Viterbi decoder is shown in Fig. 2.20, where the “SDD Viterbi Decoder” block denotes the Viterbi decoder with puncturing mechanism and bit-interleaved SDD. The parameter and

are both chosen to be 24 to achieve the balance of computational complexity and the performance of error correction based on the analysis done in [8].

Figure 2.20: Block Diagram of the Suboptimal Tail-Biting Viterbi Decoder.

In order to reduce the computational complexity in the ACS part, we bring in the concept of butterfly structure from the trellis diagram. The Symmetry in the trellis diagram, which forms the butterfly structure, can be used to reduce the number of branch metric calculations. Fig. 2.21 shows the butterfly structure associated with the Viterbi decoder pairing new states 2i and 2i+1 with previous states i and i+s/2, where s is the number of total possible states. In our case of constraint length K=7, s equals 64 (26). Even though there are four incoming branches, there are only two different branch costs.

Path metrics for each new state are calculated using each incoming branch cost plus the previous path cost associated with that branch. The maximum of the two incoming path metrics is selected as the survivor. The butterfly computations consist of two “Add-Compare-Select” (ACS) operations and updating the survivor path history.

The two ACS operations are:

Sn(2i) = min {Sn-1 (i) + b, Sn-1(i+s /2) + a}, and Sn(2i+1) = min {Sn-1 (i) + a , Sn-1 (i+s /2) + b}

After completing N stages of decoding, one of the M survivor paths is selected for trace-back. Obviously, the number of branch metric calculation has been reduced greatly by introducing the butterfly structure.

Figure 2.21: Butterfly Structure Showing Branch Cost Symmetry.

Chapter 3

DSP Implementation Environment

In our IEEE 802.16a OFDMA project, for the ease to link up all the subprojects, we choose the digital signal processor (DSP) platform to implement the whole system.

The DSP baseboard we use is Innovative Integration's (II’s) new product in year 2003 named Quixote, which houses the Texas Instruments TMS320C6416 DSP chip. In this chapter, in addition to an introduction to the DSP chip and the DSP baseboard, the data communication process between the host PC and the target DSP is also described.

3.1 The TI DSP Chip

The DSP chip we adopt is one of the TMS320C64x series DSP. According to [11], TMS320C64x series is a member of the TMS320C6000 (C6x) family. The C6000 device is capable of executing up to eight 32-bit instructions per cycle and its core CPU consists of 64 general-purpose 32-bit registers (for C64x only) and eight functional units. The detailed features of the C6000 family devices include:

Advanced VLIW CPU with eight functional units, including two multipliers and six arithmetic units.

8/16/32-bit data support, providing efficient memory support for a variety of applications.

40-bit arithmetic options add extra precision for computationally intensive applications.

Saturation and normalization provide support for key arithmetic operations.

Field manipulation and instruction extract, set, clear, and bit counting support common operation found in control and data manipulation applications.

The block diagram of the C6000 family is shown in Fig. 3.1. The C6000 devices come with program memory, which, on some devices, can be used as a program cache.

The devices also have varying sizes of data memory. Peripherals such as a direct memory access (DMA) controller, power-down logic, and external memory interface (EMIF) usually come with the CPU, while peripherals such as serial ports and host ports are available only for certain models.

In the following subsections, the TMS320C64x DSP Chip is introduced in the three major parts: Central processing unit (CPU), Memory, and Peripherals.

Figure 3.1: The Block Diagram of TMS320C6x DSP Chip.

3.1.1 Central Processing Unit

Besides the eight independent functional units and sixty-four general purpose registers that has been mentioned before, the C64x CPU also consists of the program fetch unit, instruction dispatch unit (attached with advanced instruction packing), instruction decode unit, two data path (A and B, each with four functional units), test unit, emulation unit, interrupt logic, several control registers and two register files (A and B with respect to the two data paths). The architecture is illustrated in more detail in Fig .3.2 [12]. Compared with the other C6000 family DSP chip, the C64x DSP chip provides more available hardware resources. The additional features that are only available on C64x are:

Each multiplier can perform two 16 x 16-bit or four 8 x 8 bit multiplies every clock cycle.

Quad 8-bit and dual 16-bit instruction set extensions with data flow support Support for non-aligned 32-bit (word) and 64-bit (double word) memory

accesses.

Special communication-specific instructions have been added to address common operations in error-correcting codes.

Bit count and rotate hardware extends support for bit-level algorithms.

The program fetch unit shown in the figure could fetch eight 32-bit instructions (which implies 256-bit wide program data bus) every single cycle, and the instruction dispatch and decode units could also decode and arrange the eight instructions to eight functional units. The eight functional units in the C64x architecture could be further divided into two data paths A and B as shown in Fig. 3.2. Each path has one unit for multiplication operations (.M), one for logical and arithmetic operations (.L), one for branch, bit manipulation, and arithmetic operations (.S), and one for loading/storing, address calculation and arithmetic operations (.D). The .S and .L units are for arithmetic,

logical, and branch instructions. All data transfers make use of the .D units. Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. There can be a maximum of two cross-path source reads per cycle. There are 32 general purpose registers, but some of them are reserved for specific addressing or are used for conditional instructions.

Most of the buses in the CPU support 32-bit operands, and some of them support 40-bit operands. Each functional unit has its own 32-bit write port into a general-purpose register file. All functional units which end in 1 (for example, .L1) write to register file A while all functional units which end in 2 ( for example, .L2) write to register file B. There is an extra 8-bit wide port for 40-bit write as well as an extra 8-bit wide input port for 40-bit read in four specific units (.L1, .L2, .S1 and .S2). Since each unit has its own 32-bit write port, all eight functional units could be operated in parallel in every single cycle.

The program pipelining is also an important technique to make instructions execute in parallel and hence reduce the overall execution cycles. In order to make pipelining work properly, we should have knowledge of the pipeline stages and instruction execution phases. Since the program pipelining is highly related to the optimization of DSP program, we left it to be discussed in next chapter and not go into detail here.

3.1.2 Memory

Internal Memory

The C64x DSP chip has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When off-chip memory is used, these spaces are unified on most devices to a single memory space via the external

Memory Options

Besides the internal memory, the C64x DSP Chip also provides a variety of memory options:

Large on-chip RAM, up to 7M bits.

Program cache.

2-level caches.

32-bit external memory interface supports SDRAM, SBSRAM, SRAM,

and other asynchronous memories for a broad range of external memory requirements and maximum system performance.

3.1.3 Peripherals

In addition to the on-chip memory, the TMS320C64x DSP chips also contain peripherals for supporting with off-chip memory options, co-processors, host processors, and serial devices. The peripherals are direct memory access (DMA) controller, Host-Port Interface (HPI), EMIF, Timers and some other units.

The DMA controller transfers data between regions in the memory map without the intervention by CPU. It could move the data from internal memory to external memory or from internal peripherals to external devices. It is used for communication to other devices.

The Host-Port Interface (HPI) is a 16-bir wide parallel port through which a host processor could directly access the CPUs memory space. It is used for communication between the host PC and the target DSP.

The C64x has two 32-bit general-purpose timers that are used to time events, count events, generate pulses, interrupt the CPU and send synchronization events to the DMA controller. The timer has two signaling modes and could be clocked by an internal or an external source.

3.2 The DSP Baseboard

The Quixote DSP Baseboard card is shown in Fig. 3.3 and the architecture is shown in Fig. 3.4 [15]. Quixote consists of a TMS320C6416 600 MHz 32-bit fixed-point DSP chip and a Xilinx two- or six-million gate Virtex-II FPGA in a single board. Utilizing the signal processing technology to provide processing flexibility, efficiency and deliver high performance. Quixote has 32MBytes SDRAM for use by DSP and 4 or 8Mbytes zero bus turnaround (ZBT) SBSRAM for use by FPGA.

Developers could build complicated signal processing systems by integrating these reusable logic designs with their specific application logic.

Figure 3.3: Innovative Integration’s Quixote DSP Baseboard Card.

Figure 3.4: The Architecture of Quixote Baseboard.

3.3 Data Communication Mechanism

Many applications of the Quixote baseboards involve communication with the host CPU in some manner. All applications at a minimum must be reset and downloaded from the host, even if they are isolated from the host after that. The simplest communication method supported is a mapping of Standard C++ I/O to the Uniterminal applet that allows console-type I/O on the host. This allows simple data input and control and sending text strings to the user. The next level of support is provided by the Packetized Message Interface. This allows more complicated medium rate transfer of commands and information between the host and target. It requires more software support on the host than the Standard I/O. For full rate data transfers Quixote supports

the creation of data streaming to the host, for the maximum ability to move data between the target and host. On Quixote baseboards, a second type of busmaster communication between target and host is available for use, it is the CPU Busmaster interface.

The primary CPU busmaster interface is based on a streaming model, where logically data is an infinite stream between the source and destination. This model is more efficient because the signaling between the two parties in the transfer can be kept to a minimum and transfers can be buffered for maximum throughput. In addition, the Busmaster streaming interface is fully handshook, so that no data loss can occur in the process of streaming. For example, if the application cannot process blocks fast enough, the buffers will fill, then the busmaster region will fill, then busmastering will stop until the application resumes processing. When the busmaster stops, the DSP will no longer be able to add data to the PCI interface FIFO.

However, in the application of FEC encoder and decoder, the data sequence is first divided into RS blocks then performed encoding and decoding procedure. Hence the continuous streaming may not be suitable for FEC application. Alternatively, there is a data flow paradigm supported for non-continuous data sequence called block mode streaming. For very high rate applications, any processing done to each point may result in a reduction in the maximum data rate that can be achieved. Since block mode does no implicit processing on a point-by-point basis, the fastest data rates are achievable using this mode.

TheDSP Streaming interface is bi-directional. Two streams can run simultaneously, one running from the analog peripherals through the DSP into the application. This is called the “Incoming Stream”. The other stream runs out to the analog peripherals. This is the “Outgoing Stream”. In both cases, the DSP needs to act as a mediator, since there

Figure 3.5: Block Diagram of DSP Streaming Mode.

DSP Streaming is initiated and started by the Host, using the Caliente component.

On the target, the DSP interface uses a pair of DSP/BIOS Device Drivers, PciIn (on the Outgoing Stream) and PciOut (on the Incoming Stream), provided in the Pismo peripheral libraries for the DSP. They use burst-mode and are capable of copying blocks of data between target SDRAM and host bus-master memory via the PCI interface at instantaneous rates up 264 MBytes/sec.

In addition to the busmaster streaming interface, the DSP and the host also have a lower bandwidth communications link called packetized message interface for sending commands or side information between the host PC and the target DSP.

Chapter 4

Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

As mentioned in last chapter, we adopt the Texas Instruments (TI) digital signal processor (DSP) for implementing the Forward Error Correction (FEC) scheme in the IEEE 802.16a wireless communication standard. In this chapter, we are going to discuss the main themes of this thesis – the implementation and optimization of the specified FEC scheme on the newly released II’s Quixote DSP baseboard, which houses a TI TMS320C6416 DSP chip. We firstly briefly introduce the entire system structure of our FEC implementation and its communication mechanism. Secondly, we introduce some special features of TI C6000 family DSP that is helpful when doing compiler level optimization. Then, we proposed some simple and yet practically useful techniques for improving the computational speed of Reed-Solomon (RS) Code and Convolutional (CC) Code (mainly for decoder part) on TI C64 family DSP. Finally, we present the improvement after the efforts we made on the RS code and the CC code optimization by showing the simulation profile generated by the TI Code Composer Studio (CCS)

4.1 System Structure of the FEC Implementation

As defined in the IEEE 802.16a standard, the FEC scheme, which consists of FEC encoder and FEC decoder, is located between the source encoder/decoder and the channel modulator/demodulator. Due to the features of the FEC scheme, it requires massive computation in the decoding procedure. Thus, for the purpose of achieving the real-time processing goal, it is necessary to assign a single DSP board for the FEC use only. In consequence, the FEC scheme, source coding scheme and channel modulation scheme each uses an individual DSP board and linked by the PCI port on the personal computer (PC), one thing to be noted here is the communication mechanism between the DSP board and the Host PC. It supposed to be the data streaming mode that has been described in Chapter 3, but due to the malfunction of the newly released II’s Quixote DSP baseboard, up till now, the streaming mode on the DSP board is not yet work. In substitution, we use the standard I/O (fread and fwrite) to implement the DSP file I/O mechanism on the TMS320C6416 Simulator. The drawback of using the standard I/O is that it cannot proceed too many input data or the processor may crash during the I/O time and it takes extra cycles to perform the file I/O mechanism.

The system structure is shown in Fig. 4.1 and Fig. 4.2 for the transmitter side and the receiver side, respectively. At the transmitter side, the source coded data sequence is first multiplexed by an audio/video multiplexer then transmits to the randomizer of the FEC encoding scheme through the PCI interface of Host PC. Afterward, the sequence is processed by the randomizer, the RS encoder, the CC encoder and the block interleaver and then the interleaved coded sequence is transmitted to the channel modulator through the PCI interface. At the receiver side, the procedure is an reverse of that in the transmitter side. First the demodulator transmits the soft decision demodulated metric sequence to the FEC decoding scheme (again through the PCI interface). After FEC decoding, the decoded sequence is passed to the source decoder through the PCI interface and then the source decoding operation is performed.

Figure 4.1: System Structure of Transmitter Side.

Figure 4.2: System Structure of Receiver Side.

4.2 Compiler Level Optimization Techniques

In this subsection, firstly the TI C6000 family pipeline structure is introduced for understanding how the processor arranges the pipeline stages and what instructions are more time consuming and shall be avoided if possible. Secondly, the code development

explain how we can improve the program efficiency by using the software pipelining technique.

4.2.1 Pipeline Structure of the TI C6000 Family

There are a few features regarding to the TI C6000 family’s pipeline structure that can provide the advantages of good performance, low cost, and simple programming.

The following are several useful features [11]:

Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and multiply operation.

Pipeline control is simplified by eliminating pipeline locks.

The pipeline can dispatch eight parallel instructions in every cycle.

Parallel instructions proceed simultaneously through the same pipeline phase.

The pipeline structure of the C6000 family consists of three basic pipeline stages, They are Fetch stage (F), Decode stage (D), and Execution stage (E). At the F stage, the CPU first generates an address, fetches the opcode of the specified instruction from memory, and then passes it to the program decoder. At the D stage, the program decoder efficiently routes the opcode to the specific functional unit determined by the type of instruction (LDW, ADD, SHR, MPY, etc). Once the instruction reaches the E stage, it is executed by its specified functional unit. Most instructions of the C6000 family fall in the Instruction-Single-Cycle (ISC) category, such as ADD, SHR, AND, OR, XOR, etc.

However, the results of a few instructions are delayed. For example, the multiply instructions - MPY (and its varieties) requires a delay length equal to one cycle.

One cycle delay means that the execution result will not be available until one cycle later (i.e., not available for the next instruction to use). The results of a load instruction – LDW (and its varieties) are delayed for 4 cycles. Branches instructions

reach their target destination 5 cycles later. Store instructions are viewed as an ISC from the CPU’s perspective because of the fact that there is no execution phase required for a store instruction but actually it still finishes in 2 cycles later. Since the maximum delay among all the available instructions is 5 cycles (6 execution cycles totally), it is intuitive to split the execution stage (E) into six phases as shown in Table 4.1.

Table 4.1: Completing Phase of Different Type Instructions.

4.2.2 Code Development Flow

Traditional development flows in DSP industry have involved validating a C model for correctness on a host PC or Unix workstation and then painstakingly porting that C code to hand-coded DSP assembly language. This is both time consuming and error prone. The recommended code development flow involves utilizing the C6000 code generation tools to help in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work

Execution Phases

(Completing Phase) Instructions' Category E1

E2

E3

E4

E5

E6

write linear assembly code unless the software pipelining efficiency is very poor or the resource allocation is very unbalanced.

Figure 4.3: Code Development Flow.

4.2.3 Software Pipelining

Software pipelining is extensively used to exploit instruction level parallelism (ILP) in loops and TI CCS compiler is also capable of doing it. To say more clearly, it is a technique used to schedule instructions in a loop so that multiple iterations of the loop can execute in parallel, so the empty slots can be filled and the functional units can be used more efficiently. Overall it makes a loop to be a highly optimized loop code and hence accelerate the program execution speed significantly.

For the ease of understanding how software pipelining actually works, here we give an example to illustrate [16]. A simple for loop and its code after applying software pipelining are shown in Fig 4.4(a) and 4.4(b). The loop schedule length is reduced from four control steps to one control step in the software pipelined loop. However, the code size of software pipelined loop is three times longer than that of the original code in this

相關文件