• 沒有找到結果。

Overview of the IEEE 802.16m Standard

4.1 The DSP Chip [11]

The TMS320C6416T DSP is a fixed-point DSP in the TMS320C64x series of the TMS320C6000 DSP platform family. It is based on the advanced VelociTI very-long-instruction-word (VLIW) architecture developed by TI. A functional block and DSP core diagram of TMS320C64x series is shown in Fig. 4.1.

The C64x core CPU consists of 64 general-purpose 32-bit registers and eight functional units. Features of C6000 device include the following.

• Eight functional units, including two multipliers and six arithmetic-logic units – Executes up to eight instructions per cycle

– Allows designers to develop effective RISC-like code for fast development time

• Instruction packing

– Gives code size equivalence for eight instructions executed serially or in parallel – Reduces code size, program fetches, and power consumption

• Conditional execution of all instructions

– Reduces costly branching

– Increases parallelism for higher sustained performance

• Efficient code execution on independent functional units

– Efficient C complier on DSP benchmark suite

– Assembly optimizer for fast development and improved parallelization

• 8/16/32-bit data support, providing efficient memory support for a variety of applica-tions

• 40-bit arithmetic options add extra precision for vocoders

• 32 × 32-bit integer multiply with 32- or 64-bit result

• Saturation and normalization provide support for key arithmetic operations

• Field manipulation and instruction extract, set, clear, and bit counting support com-mon operation found in control and data manipulation applications

• Each multiplier can perform two 16 × 16-bit or four 8 × 8 bit multiplies every clock cycle

• Quad 8-bit and dual 16-bit instruction set extensions with data flow support

Figure 4.1: Functional block and CPU (DSP core) diagram [12].

• Special communication-specific instructions have been added to address common op-erations in error-correcting codes

• Bit count and rotate hardware extends support for bit-level algorithms

In the follwing subsections, we introduce three parts of the TMS320C64x DSP including CPU, memory, and peripherals.

4.1.1 Central Processing Unit

The C64x DSP core contains 64 32-bit general purpose registers, program fetch unit, instruc-tion decode unit, two data paths each with four funcinstruc-tion units, control register, control logic,

advanced instruction packing, test unit, emulation logic and interrupt logic. The program fetch, instruction fetch, and instruction decode units can arrange eight 32-bit instructions to the eight function units every CPU clock cycle. The processing of instructions occurs in each of the two data paths (A and B) shown in Fig. 4.1, each of which contains four functional units and one register file. The four functional units are as follows: A multiplier (.M), a arithmetic and logic operations (.L), a unit for branch, byte shifts, and arithmetic operations (.S), and a unit for linear and circular address calculation to load and store with external memory operations (.D). The details of the functional units are described in Table 4.1.

Each register file consists of 32 32-bit registers for each four functional units reads and writes directly within its own data path. That is, the functional units .L1, .S1, .M1, .D1 can only write to register file A. The same condition occurs in register file B. However, two cross-paths (1X and 2X) allow functional units from one data path to access a 32-bit operand from the opposite side register file. The cross path 1X allows data path A to read their source from register file B. The cross path 2X allows data path B to read their source from register file A. In the C64x, CPU pipelines data-cross-path accesses over multiple clock cycles. This allows the same register to be used as a data-cross-path operand by multiply functional units in the same execute packet.

4.1.2 Memory Architecture and Peripherals

The C64x is a two-level cache-based architecture. The level 1 cache is separated into program and data spaces. The level 1 program cache (L1P) is a 128 Kbit direct mapped cache and the level 1 data cache (L1D) is a 128 Kbit 2-way set-associative mapped cache. The level 2 (L2) memory consists of 1 MB memory space for cache (up to 256 Kbytes) and unified mapped memory.

The external memory interface (EMIF) provides interfaces for the DSP core and

exter-Table 4.1: Functional Units and Operations Performed [11]

Parameter Value

.L unit(.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit logical operations

Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts

Data packing/unpacking 5-bit constant generation

Dual 16-bit and Quad 8-bit arithmetic operations Dual 16-bit and Quad 8-bit min/max operations .S unit (.S1, .S2) 32-bit arithmetic operations

32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations

Branches

Constant generation

Register transfers to/from control register file (.S2 only) Byte shifts

Data packing/unpacking

Dual 16-bit and Quad 8-bit compare operations

Dual 16-bit and Quad 8-bit saturated arithmetic operations .M unit (.M1, .M2) 16 x 16 multiply operations

16 x 32 multiply operations

Dual 16 x 16 and Quad 8 x 8 multiply operations Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operations

Bit expansion

Bit interleaving/de-interleaving Variable shift operations

Rotation

Galois Field Multiply

.D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset

Loads and stores with 15-bit constant offset(.D2 only) Loads and stores doubles words with 5-bit constant Loads and store non-aligned words and double words 5-bit constant generation

64-bit-wide (EMIFA) and 16-bit-wide (EMIFB) memory read capability.

The C64x contains some peripherals such as enhanced direct-memory-access (EDMA), host-port interface (HPI), PCI, three multichannel buffered serial ports (McBSPs), three 32-bit general-purpose timers and sixteen general-purpose I/O pins. The EDMA controller handles all data transfers between the level-two (L2) cache/memory and the device periph-eral. The C64x has 64 independent channels. The HPI is a 32-/16-bit wide parallel port through which a host processor can directly access the CPUs memory space. The PCI port supports connection of the DSP to a PCI host via the integrated PCI master/slave bus interface.

相關文件