• 沒有找到結果。

DSP Implementation Environment

3.1 The TMS320C6416 DSP Chip

The following text is mainly taken from references [16] and [17].

3.1.1 TMS320C6416 Features

The TMS320C64x DSPs are the highest-performance fixed-point DSP generation on the TMS320C6000 DSP platform. The TMS320C64x device is based on the second-generation high-performance, very-long-instruction-word (VLIW) architecture developed by TI. The C6416 device has two high-performance embedded coprocessors, Viterbi Decoder Coproces-sor (VCP) and Turbo Decoder CoprocesCoproces-sor (TCP) that can significantly speed up channel-decoding operations on-chip, but we do not make use of these coprocessors in the present

work.

The C64x core CPU consists of 64 general-purpose 32-bits registers and 8 function units.

Features of C6000 devices include:

• The eight functional units include two multipliers and six arithmetic units:

– Execute up to eight instructions per cycle.

– Allow designers to develop highly effective RISC-like code for fast development time.

• Instruction packing:

– Gives code size equivalence for eight instructions executed serially or in parallel.

– Reduces code size, program fetches, and power consumption.

• Conditional execution of all instructions:

– Reduces costly branching.

– Increases parallelism for higher sustained performance.

• Efficient code execution on independent functional units:

– Efficient C compiler on DSP benchmark suite.

– Assembly optimizer for fast development and improved parallelization.

• 8/16/32-bit data support, providing efficient memory support for a variety of applica-tions.

• 40-bit arithmetic options add extra precision for applications requiring it.

• Saturation and normalization provide support for key arithmetic operations.

• Field manipulation and instruction extract, set, clear, and bit counting support com-mon operation found in control and data manipulation applications.

The C64x additional features include:

• Each multiplier can perform two 16×16 bits or four 8×8 bits multiplies every clock cycle.

• Quad 8-bit and dual 16-bit instruction set extensions with data flow support.

• Support for non-aligned 32-bit (word) and 64-bit (double word) memory accesses.

• Special communication-specific instructions have been added to address common op-erations in error-correcting codes.

• Bit count and rotate hardware extends support for bit-level algorithms.

3.1.2 Central Processing Unit Features [18]

The block diagram of C6416 DSP is shown in Fig. 3.1. The DSP contains: program fetch unit, instruction dispatch unit, instruction decode unit, two data paths which each has four functional units, 64 32-bit registers, control registers, control logic, and logic for test, emulation, and interrupt logic.

The TMS320C64x DSP pipeline provides flexibility to simplify programming and improve performance. The pipeline can dispatch eight parallel instructions every cycle. The follow-ing two factors provide this flexibility: Control of the pipeline is simplified by eliminatfollow-ing pipeline interlocks, and the other is increasing pipelining to eliminate traditional architec-tural bottlenecks in program fetch, data access, and multiply operations. This provides single cycle throughput.

Figure 3.1: Block diagram of TMS320C6416 DSP (from [18]).

Figure 3.2: Pipeline phases of TMS320C6416 DSP (from [18]).

The pipeline phases are divided into three stages: fetch, decode, and execute. All in-structions in the C62x/C64x instruction set flow through the fetch, decode, and execute stages of the pipeline. The fetch stage of the pipeline has four phases for all instructions, and the decode stage has two phases for all instructions. The execute stage of the pipeline requires a varying number of phases, depending on the type of instruction. The stages of the C62x/C64x pipeline are shown in Fig. 3.2.

Reference [18] contains detailed information regarding the fetch and decode phases. The pipeline operation of the C62x/C64x instructions can be categorized into seven instruction types. Six of these are shown in Table 3.1, which gives a mapping of operations occurring in each execution phase for the different instruction types. The delay slots associated with each instruction type are listed in the bottom row.

The execution of instructions can be defined in terms of delay slots. A delay slot is a CPU cycle that occurs after the first execution phase (E1) of an instruction. Results from instructions with delay slots are not available until the end of the last delay slot. For example, a multiply instruction has one delay slot, which means that one CPU cycle elapses before the results of the multiply are available for use by a subsequent instruction. However, results are available from other instructions finishing execution during the same CPU cycle in which the multiply is in a delay slot.

The eight functional units in the C6000 data paths can be divided into two groups of

Table 3.1: Execution Stage Length Description for Each Instruction Type (from [18]).

four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. The functional units are described in Table 3.2.

Besides being able to perform 32-bit operations, the C64x also contains many 8-bit and 16-bit extensions to the instruction set. For example, the MPYU4 instruction performs four 8×8 unsigned multiplies with a single instruction on a .M unit. The ADD4 instruction performs four 8-bit additions with a single instruction on a .L unit.

The data line in the CPU supports 32-bit operands, long (40-bit) and double word (64-bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file (see Fig. 3.3). All units ending in 1 (for example, .L1) write to register file A, and all units ending in 2 write to register file B. Each functional unit has two 32-bit read ports for source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle.

3.1.3 Cache Memory Architecture Overview [19]

The C64x memory architecture consists of a two-level internal cache-based memory archi-tecture plus external memory. Level 1 cache is split into program (L1P) and data (L1D) caches. The C64x memory architecture is shown in Fig. 3.4. On C64x devices, each L1 cache is 16 kB. All caches and data paths are automatically managed by cache controller. Level 1 cache is accessed by the CPU without stalls. Level 2 cache is configurable and can be split into L2 SRAM (addressable on-chip memory) and L2 cache for caching external memory locations. On a C6416 DSP, the size of L2 cache is 1 MB, and the external memory on Quixote baseboard is 32 MB. More detailed introduction to the cache system can be found in [19].

Table 3.2: Functional Units and Operations Performed (from [18]) Function Unit Operations

.L unit (.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit logical operations

Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts .S unit (.S1, .S2) 32-bit arithmetic operations

32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations

Branches

Constant generation

Register transfers to/from control register file (.S2 only) Byte shifts

Data packing/unpacking

Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations

Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations .M unit (.M1, .M2) 16 x 16 multiply operations

16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations

Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operation

Bit expansion

Bit interleaving/de-interleaving Variable shift operations and rotation Galois Field Multiply

.D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset

Loads and stores with 15-bit constant offset (.D2 only) Load and store double words with 5-bit constant Load and store non-aligned words and double words 5-bit constant generation

Figure 3.3: TMS320C64x CPU data paths (from [18]).

Figure 3.4: C64x cache memory architecture (from [19]).

相關文件