Inter macroblock encoding - Inter-frame processing

Chapter 3 The Proposed Framework

3.4 Inter-frame processing

3.4.3 Inter macroblock encoding

As same as I macroblock encoding in intra frame processing, there is also a transform coding module in inter frame processing. And there are some differences between these two encodings. First, in this P macroblock encoding, our input changes as the residuals between current frame and last frame. Second, the control flow of P macroblock encoding is more complex than I macroblock encoding. There are I macroblock encodings and P macroblock encodings in the inter frame processing according to result of the mode decision module in the motion estimation processing. In order to handle these encodings in parallel and reasonable for transform coding of inter frame, our design is to combine these two encodings in this architecture. As a result, before we dispatch jobs to ARM or DSP, we will check this macroblock’s feature for executing proper function module. It also means that we define the P macroblock encoding as a series of processing of FDCT, Quantization, Dequantization, and IDCT for the same reason of I macroblock encoding. The follow figure and tables shows and describes the dual-core inter macroblock encoding architecture.

Fig 27 Inter macroblock encoding scheme

Table 11 Description of dual-core P MB encoding module

Step Description

1 Process the frame in scanline order

2 Control module decide to dispatch next job to DSP 3 Transfer MB data and control parameter to DSP 4 DSP interface decide to do I MB encoding

5 After DSP completing I MB encoding, it asserts interrupt.

6 DSP interface decide to do P MB encoding

7 After DSP completing P MB encoding, it asserts interrupt.

8 MCU executes handler routine.

9 Control module decides next step.

10 Integrate results from DSP.

11 Control module decides to let ARM do I MB encoding.

12 Integrate the computation result from ARM.

13 Control module decides to let ARM do P MB encoding.

14 Integrate the computation result from ARM.

15 The encoding loop repeats, until all jobs have completed

Table 12 Specification of DSP P MB encoding

Specification of DSP P macroblock encoding

DSP Input

1. One source macroblock data

Four Y data blocks (8*8*4*2 bytes)

One U data blocks (8*8*2 bytes)

One V data blocks (8*8*2 bytes) 2. QP parameter for DSP quantization

2 bytes

DSP output

1. One Q coefficient macroblock data

Four Y data blocks (8*8*4*2 bytes)

One U data blocks (8*8*2 bytes)

One V data blocks (8*8*2 bytes) 2. One reconstructed macroblock data

(transfer according to execution status)

Four Y data blocks (8*8*4*2 bytes)

One U data blocks (8*8*2 bytes)

One V data blocks (8*8*2 bytes) 3. P MB encoding execution status

2 bytes

From the above execution flow figure, this module’s dual architecture is similar to the dual-core intra I MB encoding with extra P MB encoding module. The main feature of this P MB encoding is that it may decide whether to do Dequantization and IDCT or not according to the status of quantization result. In the follow figure, it shows the detail execution flow of these two transform encoding component in this dual-core module. In the middle of the figure is the control module which decides either I MB encoding or P MB encoding need to be performed. And the modules in left side and right side show the detail execution flow of each module. And if P MB encoding is performed, the condition of getting different amounts of DSP computation result will occur, and thus, it will have less transfer of computation result from DSP than I MB encoding.

Fig 28 Execution flow in Inter macroblock encoding scheme

DMA module is still used to improve this architecture, there are two implementation method for this module, one is as same as I macroblock encoding DMA module which may have redundant transfer but has lower complexity on DMA handler, and the other one is shown in the follow figure. There is an issue between these two methods, since P macroblock encoding sometimes doesn’t reconstruct blocks from Q coefficients, we may let DMA module ignores these transfers for efficiency. But if we make our DMA module support the detection of whether transfer each block or not, our DMA control module will become more complex and thus may decrease DMA performance. This is because we should concern with the overhead of DMA module and interrupt overhead. So that before we finalize our design, we reference the implement results of these two methods, and find that the DMA module support detection has better performance.

Fig 29 DMA architecture for dual-core P MB encoding

Chapter 4 Implementation using DSP Hardware

Extension for Video Coding

The TMS320C55x DSP core was created with an open architecture that allows the addition of application-specific hardware to boost performance on specific algorithms.

And the TI C55x IMGLIB is an optimized image/video processing functions library for C programmers using TMS320C55x devices. It includes many C-callable, assembly-optimized, general-purpose image/video processing routines. This library is implemented by using the TI C55x hardware extension set, so that through this IMGLIB library, we can utilize the max power of TI C55x easily. And these routines are typically used in computationally intensive real-time applications where optimal execution speed is critical. By using these routines, it will help us to achieve execution speeds considerably faster than equivalent code written in standard ANSI C language.

In addition, by providing ready-to-use DSP functions, TI IMGLIB may shorten our image/video processing application development time. The TI C55x IMGLIB contains commonly used image/video processing routines. And it also provides source code for us to modify functions to match our specific needs. There are many application fields it provides, and since our focus is on the compressing application at this moment, so that we will describe how to use this motion estimation, interpolation, FDCT, and IDCT algorithms in TI IMGLIB library which implement in TI DSP hardware extension set.

4.1 FDCT module and IDCT module

In the TI C55x IMGLIB library, it provides DCT and IDCT algorithms which is implemented in TI DSP C55x hardware extension set. We can use these two modules to improve our design in I macroblock encoding, and P macroblock encoding. We can instead our own FDCT and IDCT modules by TI IMGLIB’s DCT and IDCT modules.

The follow two tables show the specification of FDCT and IDCT modules.

Table 13 Specification of FDCT with HW extensions

FDCT for an 8x8 Image using built-in hardware extensions

Syntax void IMG_fdct_8x8(short *fdct_data, short *inter_buffer);

Inputs:

fdct_data: Points to a short format array [0…63] containing an 8x8 macroblocks row by row. Data format is Q16.0.

inter_buffer: Points to a short format array [0...71] used as a temporary buffer that contains intermediate results in the transform.

Outputs:

Arguments

fdct_data: Points to a short format array [0…63] containing the results of 2-D DCT for the macro-block. Data format is Q16.0.

Description

The routine IMG_fdct_8x8 implements the Forward Discrete Cosine Transform (FDCT) using built-in hardware extensions for an 8x8 image block. Input terms are expected to be signed Q16.0 values,

producing signed Q16.0 results.

Table 14 Specification of IDCT with HW extensions

IDCT for an 8x8 image block using built-in hardware extensions Syntax void IMG_idct_8x8(short *idct_data, short *inter_buffer);

Inputs:

idct_data: Points to a short format array [0...63] containing an 8x8 macro-block row by row. Data format is Q13.3.

inter_buffer: Points to a short format array [0...71] used as a temporary buffer that contains intermediate results in the transform.

Outputs:

Arguments

idct_data: Points to a short format array [0..63] containing the results of 2-D IDCT for the input block. Data format is Q16.0.

Description

The routine IMG_idct_8x8 implements the Inverse Discrete Cosine Transform (IDCT) using built-in hardware extensions for an 8x8 image block. Input terms are expected to be signed Q13.3 values,

producing signed Q16.0 results.

After realizing these specifications, we can find that their input and output are similar to our original design. So, in IMGLIB’s FDCT module, there should be a source block data and one temp buffer for it, and then take it instead of our FDCT module directly. But in IMGLIB’s IDCT module, there is one thing need to be notified that its input data format is Q13.3, so that we must adjust our input of Q coefficients from Q16 format into Q13.3 format to fit the specification. The follow figure shows how to perform format conversion from Q16 format to Q13.3 format by shifting.

Fig 30 Format conversion

After realizing corresponding background, the follow figure shows how to add FDCT and IDCT hardware extension set modules into our I MB encoding, and P MB encoding.

Fig 31 Macroblock encoding with built-in hardware extension support

4.2 Interpolation module

There is an interpolation module in IMGLIB, it implements pixel interpolation for a 16x16 source block located in reference window using built-in hardware extensions.

As a result, this module can be used to instead our original interpolation module for accelerating computations.

Before we use this module, there is an issue need us to think. The design in our visual encoder processes pixels in 16-bit width for the concern of processing in ARM and DSP which we mentioned before. But this interpolation module in IMGLIB processes two pixels data in 16-bit width. As a result, before we use this module, some format conversion to fit its specification is needed, and this operations decrease performance. But in the other hand, if we process pixels data in 8-bit width, this will increase the complexity of function modules in DSP side which don’t have corresponding hardware extension set support. So, this exist a tradeoff. By the way, FDCT and IDCT modules don’t have such issue; this is because they process source data in 16-bit width in theory. And then, we can know its specification of this module from the follow table.

Table 15 Specification of Interpolation with HW extensions

Pixel Interpolation for 16x16 Image block using built-in hardware extensions Syntax IMG_pix_inter_16x16(short *reference_window, short

*pixel_inter_block, int offset, short *align_variable);

Inputs:

reference_window: Points to a packed integer format buffer [0...1152] that contains a 48x48 image block row by row. Must be doubleword aligned. Every four pixels are packed into one 32-bit doubleword. Data format Q16.0.

offset: Specifies the top-left corner index of the 18x18 MBE (MBE=16x16 macroblock + extension) in reference_window. Offset is even because of the doubleword alignment.

align_variable: Configures four alignment cases of the MBE in the reference_window.

Outputs:

Arguments

pixel_inter_block: Points to a packed integer format buffer [0...612] that contains the 36x34 interpolated result. Only the lower 33x33 part that corresponds to the whole 36x34 interpolated zone is usually used. Every four pixels are packed into one 32-bit doubleword.

Description

The routine IMG_pix_inter_16x16 implements pixel interpolation for a 16x16 source block located in reference_window using built-in hardware extensions and it is useful in video compression.

To implement full interpolation for the 16x16 source block, the 18x18 MBE (MBE=16x16 macroblock + extension) is needed.

The full interpolated zone is composed of 36x34 pixels, but only the lower 33x33 part corresponding to the full interpolated zone is

usually interested.

In this specification, we can see that it supports some align modes for us to use.

We can choose the align mode which is most fit for our architecture to implement. In the format of input reference frame, it announces large space to put source data, and just use the size of macroblock to interpolate. The purpose of this design is that, TI’s

IMGLIB want to corresponding modules to help each other. So, this design will help the motion estimation module to do half refine, since they have the identical size of reference frame. But in our own design, we design the interpolation module as an individual module, so there are some modifications need to be performed before using this interpolation in IMBLIB.

Fig 32 Interpolation processing with built-in hardware extension support

We can see the above figure to realize the execution flow of using IMGLIB’s interpolation module. At first, we need to adjust our input source macroblock from one pixel in 16-bit width to two pixels in 16-bit width. And then put it into corresponding position in the specific input buffer as the input of IMGLIB’s interpolation module.

Because there is a little difference of the computation behavior between these two modules, so that we will shift right 1 pixel unit for exactness. And there are two important control parameters which we must set by ourselves. The first one is the rounding signal; it decides whether do rounding in this interpolation, the default value in IMGLIB is enabling. The second one is the output format; it decides the arrangement of the interpolation module’s output.

4.3 Motion estimation module

Motion estimation is the most time-consuming part in video compression algorithms such as MPEG4 and H263. So that, it is no doubt that there will be a motion estimation module in IMGIB. The follow table shows the specification of this motion estimation module. And its motion estimation algorithm is as same as our motion estimation algorithm: four step hierarchy search algorithm.

Table 16 Specification of ME in HW extensions

Motion Estimation by 4-step search using built-in hardware extensions Syntax IMG_mad_16x16_4step(short *src_data, short *search_window,

unsigned int *match);

Inputs:

src_data: Points to a packed integer format buffer [0…128] that contains 16x16 source data row by row.

Data format is Q16.0. Every two pixels are packed into one 16-bit integer.

search_window: Points to a packed integer format buffer [0...1152] that contains the 48x48 search-window row by row. Data format is Q16.0. Every two pixels are packed into one 16-bit integer.

Outputs:

Arguments

match [2]: The location of the best match block is packed in match[0]. The upper halfword contains the horizontal pixel position, and the lower halfword contains the vertical pixel position of the best matching 16x16 block in the search window. The minimum absolute difference value at the best match location is packed in match [1].

Description

The routine IMG_mad_16x16_4step implements the motion estimation by 4-step (distance=8, 4, 2, 1) search using built-in hardware extensions. The 4-step search is a popular fast searching

technique. Input terms are packed in 16-bit integers and doubleword aligned. Input and output data format is Q16.0.

Before using this motion estimation module to improve our codec, there are something need to pay attentions. Generally, one often calculate the motion vectors by comparing with the center point of the reference frame. But, in this built-in hardware extension motion estimation module, it calculates the motion vectors by comparing with the left top point of the reference frame. So some compensation to its motion vector to fit our codec is needed. The follow figure shows the execution flow of our

motion estimation module with built-in hardware extension module.

Fig 33 Motion estimation with built-in hardware extension support

From the above figure, it shows how to add IMGLIB’s motion estimation module to our architecture. Our own motion estimation module is replaced by IMGLIB’s motion estimation module with some adjust of inputs and outputs. And thus, it gets better performance from the support of built-in HW extensions.

In fact, we just complete partial of the motion estimation module with built in hardware extension motion estimation module. Because we face the conditions of implementation time and little information about the instruction set of the hardware extension set now. As a result, we just use the IMGLIB provided by TI, and follow the rules provided by IMGLIB. So that since we haven’t know the detail specifications and algorithms of remain motion search modules provided by IMGLIB, we can’ add them into our motion estimation architecture. We will improve this condition in the future.

Chapter 5 Experimental results

Some experimental results are shown in this section. The QCIF version of the Stefan sequence is used for the experiments. The first 150 frames of this sequence is encoded and the target bit rate is set at 96 kbps. The test environment are configured similarly to the general test environment which often used by TI on OMAP platforms.

The follow table shows the main features of the test environment in this experiment. On the ARM side, the main program is stored in SDRAM, and the SRAM is used as the frame buffer for the LCD controller. On the DSP side, main program sections are put in the SARAM, and data sections are put on the DARAM. And the MPUI mode is set as shared mode for ARM core to access DSP core’s memory.

Table 17 Setup of experiment environment Experiment environment

ARM core 150 MHz

DSP core 150 MHz

Traffic controller 75MHz System DMA No burst, 16-bit width

5.1 Experiment of Intra frame processing

5.1.1 Overall result

In this section, the main goal is to experiment with the I MB encoding module, and the encoding mode of all frames are set as intra frame mode for intra frame processing experiment. The implementation result and improvement will be shown step by step here.

Execution with pure ARM core

At first, we see the experiment result of Intra frame processing. The follow table shows the implementation result of execution on ARM core alone. Thus, we can know the original performance of our codec which ported from PC on intra frame processing.

Table 18 Experiment result of pure ARM core

Qcif,150 I frames Execution time (ms) Percentage

Initialization 236 0.735

Coding 4111 12.793

Sequence conversion 1684 5.241

Prediction 2631 8.190

DCT/Q/Q^-1 /IDCT 22297 69.396

Total 30963 100

Encoding frame rate =4.7

Execution with pure DSP core

The follow table shows the implementation result of execution on DSP core alone.

This illustrates the computation ability of the DSP core.

Table 19 Experiment result of pure DSP core

Qcif,150 I frames Execution time (ms) Percentage

Initialization 236 0.885

Coding 4123 15.455

Sequence conversion 1683 6.308

Prediction 2637 9.886

DCT/Q/Q^-1 /IDCT 16811 63.011

Total 26680 100

Encoding frame rate =5.6

Execution with pure DSP core, FIQ

The follow table shows the implementation result of execution using only the DSP core with interrupt mode - FIQ. Through this experiment, it shows that the interrupt mode improve the performance of our codec minor.

Table 20 Experiment result of pure DSP core, FIQ

Qcif,150 I frames Execution time (ms) Percentage

Initialization 236 0.886

Coding 4123 15.470

Sequence conversion 1683 6.314

Prediction 2637 9.895

DCT/Q/Q^-1 /IDCT 16785 62.974

Total 26654 100

Encoding frame rate = 5.6

Execution with pure DSP core, FIQ, HW extensions

The follow table shows the implementation result of execution using only the DSP core with interrupt mode – FIQ. And the built-in hardware extension module of DCT and IDCT are used for improving I MB encoding. Through this experiment, it shows the outstanding performance from the support of hardware extension.

Table 21 Experiment result of pure DSP core, FIQ, HW extensions

Qcif,150 I frames Execution time (ms) Percentage

Initialization 236 1.158

Coding 4121 20.214

Sequence conversion 1683 8.255

Prediction 2638 12.941

DCT/Q/Q^-1 /IDCT 10519 51.598

Total 20388 100

Encoding frame rate = 7.4

Execution with dual-core

The follow table shows the implementation result of the proposed dual-core architecture with interrupt mode – FIQ. And we also use the built-in hardware extension module of DCT and IDCT for improving I MB encoding. It shows that this architecture will increase efficiency if ARM core take a part to share the computation load from DSP core. And the content of A/D in the follow table shows the ratio of tasks executed on ARM core and DSP core.

Table 22 Experiment result of dual-core

Qcif,150 I frames Execution time (ms) A/D Percentage

Initialization 236 1.212

Coding 4133 21.227

Sequence conversion 1684 8.647

Prediction 2652 13.622

DCT/Q/Q^-1 /IDCT 9572 1:6.07 49.159

Total 19471 100 Encoding frame rate = 7.7

Execution with dual-core, DMA

在文檔中視訊編碼器在雙核心平臺上的最佳化 (頁 49-0)