Complexity Analysis of H.264/AVC

Chapter 2. Conspectus of H.264 Standard

2.10. Complexity Analysis of H.264/AVC

H.264/AVC

The H.267/AVC standard only specifies the decoder, and the encoder design remains open. In this paper, we adopted the official H.264/AVC JM as decoder for integrity, and adopted the x264 encoder for the faster encoding speed. Thus we illustrate the complexity of the important functions in Figure 2-19 and Figure 2-20.

Figure 2-19 Distribution of clock cycle of each function of encoder.

Figure 2-20 Distribution of clock cycle of each function decoder.

Chapter 3. DSP I MPLEMENTATION E NVIRONMENT

In this chapter, we will briefly introduce the DSP platform environment and some optimization methods. We use the DSP module (MEX) made by Vitec Mult-Media.

Four TMS320DM642 DSP chips are housed on this board. Our implementation system includes software system and some peripherals on the board. Thus for the TI DSP, the Code Composer Studio (CCS) and some efficient optimization methods will be introduced. In addition, to facilitate the system and peripherals, Reference Framework 5(RF5) and Network Developer’s Kit (NDK) will be bring out as well.

3.1. I NTRODUCTION OF DSP P LATFORM

The DSP board used in our implementation is the MEX (Multi-Channel Video Platform) in Figure 3-1, which is a powerful platform for video application. The architecture of MEX includes four TI DSPs, two FPGA (one as crossbar, the other as PCI interface), eight video decoders, four audio stereo ADCs, and a 100BaseT Ethernet controller, as shown in Figure 3-2.

MEX’s key features are listed as below:

9 Four TMS320DM642 DSPs run at up to 600MHz (Fixed point).

9 Each DSP has a private memory of 32MB, which is SDRAM running at 100 MHz with 64 bits.

9 Each DSP has three powerful configurable video ports. By configure the crossbar (implemented in an FPGA), the video architecture are flexible. With proper configuration, the video path way can distribute one vide source on four DSP, four distinct video sources on four DSP, four distinct video sources on one DSP, or so on.

9 DSP-DSP communication or DSP-PCI communication is facilitated by the

"Inter-DSP communication & PCI interface" FPGA. Each DSP has a dedicated FIFO inside the FPGA which is mapped in its memory. This FIFO can be written by the DSP and sent to PCI interface and the others DSPs. Those mean PC-DSP communication and DSP-DSP communication respectively.

Figure 3-1 MEX (Multi-Channel Video Platform) [6]

Figure 3-2 Block diagram of the MEX [6]

As shown in Figure 3-2, the flexible architecture include some modules of TMS320DM642 DSP chip: the I2C bus used to configure the Video (7113) / Audio(CS4221) chips, the Video Port set to configure the video acquisition data path, and EMIF that define the address of FPGA seen by DSP. Those DSP modules will further introduced in the following sections.

(a) (b)

Figure 3-3 Block diagram of (a)emulator system and (b)application system

In the developing phase, a JTAG emulator pod called “USB 560BP” is used to connect the MEX to PC. With the JTAG emulator, the CCS emulation of DSPs on the board is fully supported. We develop our system and debug in this way. After that, the emulator can be removed from this system to expose the stand-alone ability of MEX.

The only thing the PC should do is to supply 3V power and load the DSP program to the board. Figure 3-3 are two different block diagram of the system in emulation phase and in application phase.

3.2. DSP C HIP

In our system, the TMS320DM642 DSP chip is the most important part of this system. In this section, we will describe some details of this chip. TMS320DM642, the high-performance fixed-point DSP, is based on the second generation, high performance, advanced VelociTI™ very long instruction word (VLIW) architecture (VelociTI.2™), developed by Texas Instruments (TI). The VelociTI.2 extensions in the eight functional units include new instructions to accelerate the performance in key applications and extend the parallelism of the VelociTI architecture. This VLIW architecture makes the DSP chips an excellent choice for digital media application [8].

The DM642 DSP is a Video/Imaging fixed-point digital signal processor in the TMS320C64x family. It has eight independent functional units running at 600MHZ for peak execution of 4800 MIPS. Some key features of DM642 are listed below.

9 Eight highly independent functional units - two multipliers to generate 32-bit result and six arithmetic logic units (ALUs)

9 The VelociTI.2™ extensions in the eight functional units include new instructions to accelerate the performance in video and imaging manipulations

and to extend the parallelism of the VelociTI™ architecture.

9 Conditional execution reduces cost of branch and increase parallelism.

9 Instruction packing reduces code size, program fetches, and power consumption.

9 8/16/32/40-bit data support.

9 Saturation and normalization provide support for key arithmetic operations.

Figure 3-4 is the functional block and DSP core diagram of TMS320C64x.

In the following sections, three major components of TMS320C64x DSP, including the central processing unit, memory, and peripherals, will be introduced.

Figure 3-4 Block diagram of the TMSDM642 [9]

3.2.1. C ENTRAL P ROCESSING U NIT (CPU)

The DSP core of C64 series consists of eight independent fictional units, 64 general purpose registers, program fetch unit, instruction dispatch (attached with advanced instruction packing), instruction decode unit, two data path, test unit, emulation unit, interrupt logic, and etc. The instruction dispatch and decode units could decode and arrange the eight instructions to eight functional units respectively.

Thus the program fetch, instruction dispatch, and instruction decode units can deliver up to eight 32-bit instructions to the functional units during every CPU clock cycle.

The eight functional units in the C64 architecture could be further divided into two data paths, data path A and data path B as shown in Figure 3-4.

Each data path has 8 functional units for multiplication operations (.M), logical and, arithmetic operations (.L), branch, bit manipulation, and arithmetic operations (.S), and loading/storing and arithmetic operations (.D). Table 3-1 shows these functional units and their operations. Two cross data paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file from the opposite side. Most data lines in the CPU support 32-bit operands, while some support long (40-bit) operands. Each functional unit has its own 32-bit write port to a general-purpose register file and 32-bit read port for source operands src1 and src2 (refer to Figure 3-5). All function units which ends in 1(for example (.L1)) write to register file A, while those function units which end in 2 (for example (.M2)) write to register file B.

Table 3-1 Functional Units and Operations Performed [10]

Functional Unit

Fixed-Point Operations Floating-Point Operations

.L unit

(.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit logical operations

Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts

32/40-bit shifts and 32-bit bit-field operations

32-bit logical operations Branches

Constant generation

(.M1, .M2) 16 x 16 multiply operations 16 x 32 multiply operations Quad 8 x 8 multiply operations Dual 16 x 16 multiply operations Dual 16 x 16 multiply with add/subtract operations

Quad 8 x 8 multiply with add operation Bit expansion

Bit interleaving/de-interleaving Variable shift operations Rotation

Galois Field Multiply

32 X 32-bit fixed-point multiply operations

Floating-point multiply operations

.D unit

(.D1, .D2) 32-bit add, subtract, linear and circular address calculation

Loads and stores with 5-bit constant offset Loads and stores with 15-bit constant offset (.D2 only)

Load and store double words with 5-bit constant

Load and store non-aligned words and double words

5-bit constant generation 32-bit logical operations

Load doubleword with 5-bit constant offset

Figure 3-5 TMS320C64x^TM CPU (DSP Core)Data Paths [9]

3.2.2. M EMORY A RCHITECTURE

The DM642 uses a two-level cache-based architecture and has a powerful set of peripherals. This memory architecture consists of the following:

9 Internal data/program memory

9 External memory, with external memory interface (EMIF) 9 Enhanced Directed-Memory-Access (EDMA)

Level 1 program cache (L1P) is a 128-Kbit direct mapped cache and the Level 1 data cache (L1D) is a 128-Kbit 2-way set-associative cache. The Level 2 memory/cache (L2) consists of a 2-Mbit memory space that is shared by both

program space and data space. The TMS320DM642 internal program memory can be mapped into the CPU address space or operated as a program cache. There is a single port to access internal program memory, with an instruction fetch width of 256 bits.

The internal data memory on C64x devices divides the memory into eight 32-bit wide banks. These banks are single-ported, allowing only one access per cycle. This is in contrast to the C621x/C671x devices, which use a single bank of dual-ported memory rather than multiple banks of single-ported memory. There are more details described in [11].

3.2.3. P ERIPHERALS

The C64x contains some peripherals such as enhanced direct memory access (EDMA) controller, external memory interface (EMIF), video port peripheral, inter-integrated circuit (I2C) Bus module,10/100 Mb/s Ethernet MAC (EMAC), and etc.

3.2.3.1. E XTERNAL M EMORY I NTERFACE (EMIF)

EMIF supports a glueless interface to a variety of external device, including:

9 Pipelined synchronous-burst SDRAM (SBSRAM) 9 Synchronous DRAM (SDRAM)

9 Asynchronous device, including SDRAM, ROM, and FIFOs 9 An external shared-memory device

On MEX board, EMIF serves as the interface between DSP to two SDRAM, memories of 1Meg×32bits×4banks (total 32MB), a synchronous FIFO to write/read data, and various registers via an asynchronous. Thus the external memory map is listed in Table 3-2.

Table 3-2 Memory map using EMIF of each DSP on MEX Start

0x80000000 0x81FFFFFF SDRAM 64 bits

0x90000000 Synchronous FIFO 16 bits 0xB0000000 0xB000000E Asynchronous interface 16 bits

3.2.3.2. T HE ENHANCED DIRECT MEMORY ACCESS

(EDMA)

The enhanced direct memory access (EDMA) controller handles all data transfers between the Level-two (L2) cache/memory controller and the device peripherals on the C64x DSP. The EDMA controller in the C64x DSP has a different architecture from the previous DMA controller in the C620x/C670x devices. The EDMA includes several enhancements to the DMA, such as 64 channels for the C64x DSP, with programmable priority, and the ability to link and chain data transfers. The EDMA allows movement of data to/from any addressable memory spaces, including internal memory, peripherals, and external memory.

3.2.3.3. V IDEO P ORT

The DM642 device has three configurable video port peripherals. These video port peripherals provide an interface to common video decoder and encoder devices.

The DM642 video port peripherals support multiple resolutions and video standards.

These three video port peripherals are configurable and can support video capture and/or video display modes. As shown in Video Port Block Diagram [12], each video port consists of two channels - A and B with a 5120-byte capture/display buffer being splittable between these two channels. The video port peripheral can operate as a video capture port, a video display port, or a transport stream interface (TSI) capture port. For the capture mode, the video port may operate as two 8/10 bits channels of BT.656 or raw video. It may also operate as a single channel of 8/10-bit BT.656, 8/10-bit raw video, 16/20-bit Y/C video, 16/20-bit raw video, or 8-bit TSI. For the display mode, the video port may operate as a single channel of 8/10-bit BT.656, 8/10-bit raw video, 16/20 bit Y/C video, or 16/20-bit raw video. It may also operate in a two-channel 8/10-bit raw mode. There are more details described in [12].

Figure 3-6 Video Port Block Diagram [12]

3.2.3.4. I NTER - INTEGRATED CIRCUIT (I2C)

The inter-integrated circuit (I2C) module provides an ideal interface between TMS320C6000 DSP and other devices compliant with Philips Semiconductors Inter-IC bus (I²C bus) specification. On the MEX board the I2C bus connects the DM642 chip to video (SAA7113) and audio (CS4221) chips, and is used to initial the video/audio chips and configure the video/audio data pathway with the format shown below.

Figure 3-7 I²C bus format

3.2.3.5. E THERNET MAC (EMAC)

The Ethernet media access controller (EMAC) provides an efficient interface between the DM642 DSP core processor and the network. It supports both 10Base-T and 100Base-TX, or 10 Mbits/second (Mbps) and 100 Mbps in either half- or

full-duplex. The EMAC controls the flow of packet data from the DSP to the physical layer device (PHY). The MDIO module controls PHY configuration and status monitoring. Figure 3-8 Figure 2-1is the EMAC Control Module Block Diagram.

Figure 3-8 EMAC Control Module Block Diagram

3.3. C ^ODING D EVELOPMENT E NVIRONMENT

In this section, we will briefly introduce the coding environment of our project.

The powerful coding environment tool called Code Composer Studio (CCS) will be described. In CCS, DSP programmers can develop the project, debug the project, and do some optimization. It’s necessary for a DSP programmer to be familiar with the coding environment to develop a program efficiently.

3.3.1. C ODE COMPOSER STUDIO

Code composer studio (CCS) extends the basic code generation toll with a set of debugging and real-time analysis capabilities. It speeds and enhances the development process for programmers who create and test real-time, embedded signal processing applications. Every phase of development cycle including conceptual design, coding

&building, debugging, and real-time analysis is fully supported. Code Composer Studio includes the following components, which works together as show in:

9 TMS320C6000 code generation tools

9 Code Composer Studio Integrated Development Environment (IDE):

9 DSP/BIOS plug-ins and API

9 RTDX plug-in, host interface, and API

Figure 3-9 Code Composer Studio environment [14]

3.3.1.1. C ODE GENERATION TOOL AND I NTEGRATED

D EVELOPMENT E NVIRONMENT

The foundation for the development environment provided by Code Composer Studio is consist of some code generation tools, including C compiler, assembler,

assembly optimizer, linker, archives library-build utility, and etc.

The Code Composer Studio Integrated Development Environment (IDE) is designed to allow user to edit, build, and debug DSP target programs. In the coding phase, C source code and the corresponding assembly instructions can be shown and edit. In building phase, different files including C source files, assembly source files, object files, libraries, linker command files, and include files can be added to build the application. In debugging phase, flexibility to setting the breakpoints, accessibility to memory registers, graphical signal, statistics of execution profiling make it easier to debug.

3.3.1.2. DSP/BIOS P LUG - INS

DSP/BIOS gives DSP chips developers the ability to develop and analyze embedded real-time software. DSP/BIOS provides a graphical interface for static system setup, real-time scheduling, real-time analysis (RTA), and real-time data exchange (RTDX). By using the DSP/BIOS Configuration Tool, we can initialize data structures and set various parameters of DSP/BIOS objects. The Configuration Tool provides developers a windows explorer-like interface, as shown in DSP/BIOS Configuration Tool I, to use DSP/BIOS real-time library, DSP/BIOS API, and also CSL.

Figure 3-10 DSP/BIOS Configuration Tool Interface

For real-time DSP applications, such as our system, it is possible to perform a number of seemingly unrelated functions at the same time. Such functions are called thread. DSP/BIOS enables applications to be structured as a collection of threads, with each of them carrying out a modularized function. Multi-thread programs run on a single processor by allowing higher-priority threads to preempt lower-priority threads, and by allowing various types of interactions among threads, including blocking, communication, and synchronization [15]. The thread types (from highest to lowest priority) provided by DSP/BIOS include: hardware interrupts (HWI), software interrupts (SWI), tasks (TSK), and Background thread (IDL). Programs using

multithreads, as opposed to a single centralized polling loop, are easier to design, implement, and maintain.

Since the DSP/BIOS object tasks (TSK) is the major component of our multi-thread system, the way how tasks work is illustrated below. There are 15 level priorities and four states of execution, including running, ready, blocked, and terminated of tasks. Tasks are scheduled for execution according to a priority level assigned to the application. At a time only one task can be running, while other ready tasks are blocked due to their lower priorities. When a task with higher priority is ready, the current running task is blocked until higher-priority task is terminated.

As shown in Figure 3-11, TSK preempts the running task in favor of the higher-priority ready task. During the course of a program, each task’s mode of execution can change for a number of reasons. The following figure shows how execution modes change.

Figure 3-11 TSK module execution flow chart

3.3.1.3. H ARDWARE E MULATION AND R EAL -T IME

D ATA E XCHANGE

TI DSPs provide on-chip emulation support that enables Code Composer Studio to control program execution and monitor real-time program activity. An emulator interface, like the TI XDS510, provides the host side of the JTAG connection.

In addition, real-time data exchange (RTDX) capability is exposed through host and DSP APIs, allowing for bi-directional real-time communications between the host and DSP. It provides real-time, continuous visibility into the way DSP applications operate in the real world. As shown in real-time data exchange of DSP, the RTDX between the host and the DSP is achieved via the JTAG emulator.

Figure 3-12 real-time data exchange of DSP emulation [14]

3.3.2. DSP P ROGRAM D EVELOPMENT F LOW

Tradition development flows in DSP industry have involved validating a C model for correctness on a host PC or UNIX workstation. Programmer will need to take a great effort to port process from C code to hand coded DSP assembly langue.

However this is both time consuming and error prone. The recommended code development flow involves utilizing the C6000 code generation tools to aid in optimization than force the programmer to code by hand in assembly. These advantages allow the compiler to do all the exhausting work of instruction parallelizing, pipelining, and register allocation.

The phases of recommended code development flow are illustrated in Figure 3-13.

Figure 3-13 DSP Program Development Flow

In phase one some compiler level optimization can be adopted without any knowledge of the C6000. In the second phase, intrinsic and compiler options are used to improve the code. In the last phase, linear assembly code won’t be written unless the software pipeline efficiency is hardly achieved or the unbalanced resource allocation is hardly solved by the compiler.

Figure 3-13 DSP Program Development Flow

3.4. O PTIMIZATION ON TI DSP P ^LATFORM

As shown in Figure 3-13, optimization is adopted to increase the execution performance. In this section some common used optimizations we adopt will be described.

3.4.1. C OMPILER LEVEL OPTIMIZATION

Figure 3-14 Process that translates source program into code [16]

As shown in Figure 2-1, the process that is taken to translate source program into code. Compiler in this process is able to perform various optimizations. High-level optimizations are performed in the optimizer and low-level, target specific optimizations occur in the code generator. The optimizer can reduce code size and improve executing time by using different compiler options. There are four optimization levels –o0, –o1, –o2, and –o3 denoting different type and degree of optimization, naming register level, local level, unction level, file level optimization respectively.

The –o1, register level optimization performs control-flow-graph simplification, allocates variables to registers, performs loop rotation, eliminates unused code, simplifies expressions and statements, expands calls to functions declared inline.

Besides the optimization done in –o0, some more optimization will be done in the local level optimization (-o1) includes propagation of local copy/constant, unused assignments removal, and elimination of local common expressions. The function level (-o2) performs all –o1 optimizations, plus software pipelining, loop optimizations, global common sub-expressions and global unused assignments

elimination, and loop unrolling. Finally, the highest level file level (-o3) perform all –o2 optimization, and remove never-called functions, simplifies functions with return values that are never used, in-lines calls to small functions, reorders function declarations, propagates arguments into function bodies, and Identifies file-level variable characteristics. In addition to these optimizations, there are some optimizations that are performed regardless of the optimization level. These optimizations cannot be turned off.

3.4.2. P ROGRAM LEVEL OPTIMIZATION

Expect the compiler optimization taken by configuring the optimization level of compiler, mentioned in the last section, there are still refinements we can do to speed up the program. There are several optimization methods for the special architecture of TI C64x DSP.

First we can allocate the code sections and the code section into memories. In the two level memory architecture mentioned in 0, there are fast memories with small size

在文檔中 H.264/AVC影像編碼系統在TI DSP系統平台上之實現與加速 (頁 24-0)