• 沒有找到結果。

H.264/AVC decoder implementation

Chapter 4 Proposed video codec implementation based on DSP

4.2 H.264/AVC decoder implementation

In this chapter, we intro to embed H.264/AVC into ADSP-BF548 development board by using Blackfin DSP integrated development environment (IDE), VDSP. The official H.264/AVC test platform (JM) was developed in the x86 architecture, and it is easier for academic research on the computer. If we want to test or improve H.264/AVC module easily in the DSP platform, the JM must be rewritten by using coding rules of ADSP-BF548. In order to improve huge computational load of H.264/AVC decoder, we analyze the complexity for various modules firstly. And then we modify and optimize the decoder modules and codes, respectively. in addition to BP specification standard that we only consider, the other unnecessary programs modules such as a B frame, interlaced coding, data partition, and so on will be deleted to ensure that H.264/AVC BP decoder can be realize.

99

H.264/AVC Decoding Procedure

Fig. 4.4 shows the H.264/AVC decoding procedure and procedure consists of bit stream parsing including network adaptation layers (NAL) unit, entropy decoding using CABAC or CAVLC, reorder, inverse quantization (IQ), inverse integer cosine transform (IICT), motion compensation (MC), intra prediction, deblocking filter (DF) and reconstruction of video. From Table 4.1, the most consuming processes of decoder include IICT, MC, entropy decoding and reconstruction of video are summarized and we can find the IICT module uses the maximum resource in decoder. IICT module is generally applied to a block of 8×8 pixels by using Chen’s algorithm [39]. For multimedia applications, the spatial and temporal correlation will be very high, and it's meaning that there are very few significant coefficients (non-zero coefficients) after ICT and quantization. So the optimization of IICT will involve unnecessary multiplications and additions for the zero valued coefficients. At last, in order to further to increase the decoding performance, a novel deblocking filer technique which uses only the internal memory proposed by F. Pescador et al is implemented in our study [23].

The usage of Memory

The DSP memory usage in the decoder for processing data in ADSP-BF548 is optimally allocated according the proposed structure. Because IICT, motion compensation and deblocking modules use the huge resource to process, we place the procedure of modules in internal memory L1 instruction area. The related coefficients of modules, like ICT_COEFFS, IQ, DC/AC_COEFFS and so on, are located in internal memory L1 data area. The other hand, the reference pictures and MB for deblocking filter need bigger data memory to store. The L1 data area is not enough to save and the speed of external memory area L3 is too slow. Therefore, the internal memory L2 is the optimum area to

100

speed up the decoding procedure. The assigned addresses for L1 and L2 are from 0xffb00000 to 0xff800000 and from 0xfeb00000 to 0xfeb20000, respectively.

Parallelization

The ADSP-548 plays the ITU-656 format video in the LCM by using the parallel peripheral interface (PPI). Therefore, we need to convert the YUV into ITU-656 after video system decodes one frame. In general, the video playing process usually adopts the sequential playing model (SPM) as shown in Fig. 4.5. By using SPM, the video system decodes the bitstream of one frame and shows one frame sequentially. Because H.264/AVC's computational load is very heavy, the embedded video system will lead to frame delay. To overcome the playing lag issue, we will use a parallel architecture to move memory data and decode operations.

In order to achieve a real-time decoding and playing, we further design two frame buffer group (FBG) for parallel processing to carry out the playing machine. The BG consists of some frame buffer (FB) as shown in Fig. 4.6. Each FB stores a frame data and each BG can store N-1 frames data. The proposed two BG buffer structure is shown in Fig. 4.7. The buffer group A receives and stores the decoded frame data of YUV format, and then converts them into the ITU-656 format. The buffer group B display the frame data after covered by using PPI with DMA. The video system only needs one interrupt to exchange two FBG pointer. Therefore, we can achieve the parallelization which the DSP implements the decoding procedure and DMA PPI process. Fig. 4.7 shows the proposed parallel playing model (PPM) for Ping-Pong buffer structure.

101

Code Optimization

In order to reach real-time operation, additional optimization steps which was proposed by Pescador’s method have been carried out as follows [24]:

1. The higher computational cost functions have been moved to internal memory.

2. High frequent accessed data have been moved internal memory.

3. High frequent arithmetic operations have been replaced from C language to assembly code.

4. The core of the IQ, IICT, Entropy coding and MC has also been replaced from C language to assembly code.

Table 4.1 Time consumed of decoding Model.

Decoding Model Time Consumed

IICT + Motion Compensation 47.66%

Deblocking Filter 25.80%

Entropy Coding 13.58%

Others 12.96%

Fig. 4.4 H.264/AVC decoding procedure.

102

Fig. 4.5 Sequential playing model.

(a) The procedure of decoding and converting into frame buffer group A

(b) Display the frame data from frame buffer group B

Fig. 4.6 The proposed two frame buffer groups structure.

103

Fig. 4.7 The proposed parallel playing model.

Experimental results

We evaluate three QCIF (176×144) video sequences including “Forman”, “Carphone”

and “Test” as shown in Fig. 4.8. The “Test” video sequence is actually taken from the Digital Video (DV). The experimental streams are stored in the files and allocated in the external memory. The decoder reads the streams and decodes them to the frame buffers.

The numbers of FB is set to N = 16.

The experiment was carried out on Visual DSP++ v5.0 in the ADSP-BF548 EZ-KIT Lite. Table 4.2 summarizes the average playing rate for SPM and proposed PPM using QCIF sequences mentioned above.

104

It is clear that the proposed PPM method can reach a real-time playing and the playing rate is higher than general SPM method. The ADSP-BF548 is an embedded processor with 533MHz core cycles. Fig. 4.9 shows the MHz counts comparisons of the SPM and the proposed PPM. We can find that the proposed PPM consumes less cycle to achieve a mostly real-time decoding and playing although some core cycles as shown in the points in amplitude of N=16 must be wasted. It is worthy to wait the data transfer of FBG after finishing the video display.

This work has proposed an optimized H.264/AVC BP video decoder based on an ADSP-BF548 processor for real-time performance. Simulation results show that the proposed method achieve 50.2 MHz core cycles. The decoded QCIF frame playing rate can increase up to 25 fps when applied the proposed PPM. The video system can achieve a real-time decoder and player.

(a)

Forman (b) Carphone (c) Test Fig. 4.8 Test video sequences.

Table 4.2 Summarizes the average playing rate for SPM and proposed PPM using QCIF sequences

.

105

Fig. 4.9 The MHz counts comparisons of the SPM and the proposed PPM.