• 沒有找到結果。

DSP Implementation Environment

3.2 The DSP Chip

3.2.1 Central Processing Unit [23]

Besides the eight independent functional units and sixty-four general purpose 32-bit registers that has been mentioned before, the C64x CPU also consists of the program fetch unit, instruction dispatch unit (attached with advanced instruction packing), instruction decode unit, two data path (A and B, each with four functional units), test unit, emulation unit, interrupt logic, several control registers and two register files (A and B with respect to the two data paths).

The architecture is illustrated in more detail in Fig. 3.3. Compared with the other C6000 family DSP chip, the C64X DSP chip provides more available hardware resources.

The block diagram of C6416 DSP is shown in Fig. 3.2. The DSP contains: program fetch unit, instruction dispatch unit, instruction decode unit, two data paths which each has four functional units, 64 32-bit registers, control registers, control logic, and logic for test, emulation, and interrupt logic.

The TMS320C64x DSP pipeline provides flexibility to simplify programming and improve performance. The pipeline can dispatch eight parallel instructions every cycle. The follow-ing two factors provide this flexibility: Control of the pipeline is simplified by eliminatfollow-ing

Figure 3.3: The TMS320C64x DSP chip architecture and comparison with earlier TMS320C62x/C67x chip (from [23]).

pipeline interlocks, and the other is increasing pipelining to eliminate traditional architec-tural bottlenecks in program fetch, data access, and multiply operations. This provides single cycle throughput.

The pipeline phases are divided into three stages: fetch, decode, and execute. All in-structions in the C62x/C64x instruction set flow through the fetch, decode, and execute stages of the pipeline. The fetch stage of the pipeline has four phases for all instructions, and the decode stage has two phases for all instructions. The execute stage of the pipeline requires a varying number of phases, depending on the type of instruction. The stages of the C62x/C64x pipeline are shown in Fig. 3.4.

Reference [23] contains detailed information regarding the fetch and decode phases. The pipeline operation of the C62x/C64x instructions can be categorized into seven instruction types. Six of these are shown in Fig. 3.5, which gives a mapping of operations occurring in each execution phase for the different instruction types. The delay slots associated with

Figure 3.4: Pipeline phases of TMS320C6416 DSP (from [23]).

each instruction type are listed in the bottom row.

The execution of instructions can be defined in terms of delay slots. A delay slot is a CPU cycle that occurs after the first execution phase (E1) of an instruction. Results from instructions with delay slots are not available until the end of the last delay slot. For example, a multiply instruction has one delay slot, which means that one CPU cycle elapses before the results of the multiply are available for use by a subsequent instruction. However, results are available from other instructions finishing execution during the same CPU cycle in which the multiply is in a delay slot.

The program fetch unit shown in the Fig. 3.3 could fetch eight 32-bit instructions (which implies 256-bit wide program data bus) every single cycle, and the instruction dispatch and decode units could also decode and arrange the eight instructions to eight functional units.

The eight functional units in the C64x architecture could be further divided into two data paths A and B as shown in Fig. 3.3. Each path has one unit for multiplication operations (.M), one for logical and arithmetic operations (.L), one for branch, bit manipulation, and arithmetic operations (.S), and one for loading/storing, address calculation and arithmetic operations (.D). The .S and .L units are for arithmetic, logical, and branch instructions.

All data transfers make use of the .D units. Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. There can be a maximum of two cross-path source reads per cycle. There are 32

Figure 3.5: Execution stage length description for each instruction type (from [23]).

general purpose registers, but some of them are reserved for specific addressing or are used for conditional instructions.

The eight functional units in the C6000 data paths can be divided into two groups of four; each functional unit in one data path is almost identical to the corresponding unit in the other data path. The functional units are described in Table 3.1.

Besides being able to perform 32-bit operations, the C64x also contains many 8-bit and 16-bit extensions to the instruction set. For example, the MPYU4 instruction performs four 8×8 unsigned multiplies with a single instruction on a .M unit. The ADD4 instruction performs four 8-bit additions with a single instruction on a .L unit.

The data line in the CPU supports 32-bit operands, long (40-bit) and double word (64-bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file (see Fig. 3.6). All units ending in 1 (for example, .L1) write to register file A, and all units ending in 2 write to register file B. Each functional unit has two 32-bit read

Table 3.1: Functional Units and Operations Performed (from [23]) Function Unit Operations

.L unit (.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit logical operations

Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts .S unit (.S1, .S2) 32-bit arithmetic operations

32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations

Branches

Constant generation

Register transfers to/from control register file (.S2 only) Byte shifts

Data packing/unpacking

Dual 16-bit compare operations Quad 8-bit compare operations Dual 16-bit shift operations

Dual 16-bit saturated arithmetic operations Quad 8-bit saturated arithmetic operations .M unit (.M1, .M2) 16 x 16 multiply operations

16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations

Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operation

Bit expansion

Bit interleaving/de-interleaving Variable shift operations and rotation Galois Field Multiply

.D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset

Loads and stores with 15-bit constant offset (.D2 only) Load and store double words with 5-bit constant Load and store non-aligned words and double words 5-bit constant generation

ports for source operands src1 and src2. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle.

3.2.2 Memory [24]

Internal Memory

The C64x DSP chip has a 32-bit, byte-addressable address space. Internal (on-chip) memory is organized in separate data and program spaces. When off-chip memory is used, these spaces are unified on most devices to a single memory space via the external memory interface (EMIF). The C64x has two 64-bit internal ports to access internal data memory and a single internal port to access internal program memory, with an instruction-fetch width of 256 bits

Memory Options

the C64x DSP Chip also provides a variety of memory options:

• Large on-chip RAM, up to 7M bits.

• Program cache.

• 2-level caches.

• 32-bit external memory interface supports SDRAM, SBSRAM, SRAM.

And other asynchronous memories for a broad range of external memory requirements and maximum system performance.

Figure 3.6: TMS320C64x CPU data paths (from [23]).

Figure 3.7: C64x cache memory architecture (from [24]).

Cache Memory

The C64x memory architecture consists of a two-level internal cache-based memory archi-tecture plus external memory. Level 1 cache is split into program (L1P) and data (L1D) caches. The C64x memory architecture is shown in Fig. 3.7. On C64x devices, each L1 cache is 16 kB. All caches and data paths are automatically managed by cache controller. Level 1 cache is accessed by the CPU without stalls. Level 2 cache is configurable and can be split into L2 SRAM (addressable on-chip memory) and L2 cache for caching external memory locations. On a C6416 DSP, the size of L2 cache is 1 MB, and the external memory on Quixote baseboard is 32 MB. More detailed introduction to the cache system can be found in [24].

相關文件