Chapter 1. Introduction
1.2. Overview of the Thesis
Chapter 1.
I NTRODUCTION
1.1. I NTRODUCTION
With its higher compression efficiency than all prior video coding standards, the latest video compression standard H.264, which is also known as MPEG-4 part 10 or MPEG-4 AVC, is expected to become the major video standard in the coming years.
H.264/AVC provides high coding efficiency through the addition of new features and functionalities. With the H.264/AVC standard, the size of a digital video can be reduced up to 80% than the Motion JPEG format and up to 50% than the MPEG-4 Part-2 standard.
On the other hand, the demand for multimedia services over internet is steadily increasing. With its high coding efficiency, H.264/AVC has become one of the most favorite video compression standards to transmit videos over the internet. However, the high complexity of the H.264/AVC coding process has made the implementation of the H.264/AVC standard very difficult.
The general-purpose Digital Signal Processor (DSP) has been widely used in the implementation of various algorithms. The C64x DSP family, developed and provided by the Texas Instruments (TI), is a popular choice for digital media applications. In this thesis, we implement an H.264/AVC based video communication system based on the multi-DSP board MEX (Multi-Channel Video Platform), which possesses four TMS320DM642 DSP chips. The H.264 based video transmission is implemented in terms of multiple threads. Moreover, to speed up the encoding/decoding process, the optimization and parallelization of the DSP codes are investigated in this thesis.
1.2. O VERVIEW OF THE T HESIS
The rest of the thesis is organized as follows. Chapter 2 contains the brief introduction to the H.264/AVC coding standard. In Chapter 3, a brief overview of the DSP platform and the development environment is represented. In Chapter 4, a multi-task multi-thread implementation of the H.264 based video communication system is discussed. Finally, conclusions are given in Chapter 5.
2
Chapter 2.
C ONSPECTUS OF H.264 S TANDARD
H.264, also known as MPEG-4 Part 10 or MPEG-4 AVC, is the state-of-the-art video coding standard. It is proposed by the Joint Video Team of both the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Pictures Experts Group (MPEG). The final drafting work on the first version had been completed in May 2003 [1].
The primary goal of H.264/AVC is to develop a video coding standard with high coding efficiency and network-friendly video representation. As shown in Figure 2-1, the H.264 covers a Video Coding Layer (VCL), which efficiently represents the video content, and a Network Abstraction Layer (NAL), whose formats are appropriate for conveyance over particular transport layers or storage media. With the state-of-the-art coding tools, it can achieve lower bit rates than all prior standards, like MPEG-2, H.263, and MEPG-4 Part-2 [2]. Moreover, its packed-based video representation addresses both conversational and non-conversational applications. Outperforming earlier standards, H.264/AVC is becoming the worldwide digital video standard for consumer electronics and video broadcasting. In this chapter, the H.264/AVC standard is briefly introduced. More details about H.264/AVC can be accessed in [3].
Figure 2-1 Structure of an H.264/AVC video encoder [4]
3
2.1. O VERVIEW OF H.264/AVC
As shown in Figure 2-2, the scope of H.264/AVC standard includes only the decoder of the typical video coding /decoding chain. The decoder is standardized by prescribing the Bitstream syntax and defining the decoding process. This limitation of the scope of the standard allows the maximal freedom to the encoder for different applications.
Although the encoder /decoder pair is not explicitly defined, encoder and decoder are likely to include the functional elements shown in Figure 2-3 and Figure 2-4 to be complaint to the standard.
2.1.1. T HE H.264/AVC E NCODER
A block diagram of a typical H.264/AVC encoder is shown in Figure 2-3. The encoding process is divided into several functionality block diagrams. Except the deblocking filter, most of these functional components (intra/inter prediction, transformation, quantization, entropy encoding) had been presented in these previous standards. However, some important changes in the details of each functional block occur in H.264.
Figure 2-3 H.264/AVC Encoder[5]
Figure 2-2 Scope of H.264/AVC [4]
4
The intra prediction and motion estimation/compensation removes spatial redundancy and temporal redundancy respectively. After that, the prediction mode and the residual data are recorded. Then the transformation and quantization are adopted to transform residual data into more suitable data space to drop some details those are less perceptible to human vision. The entropy coding removes the syntax redundancy.
In addition, the deblocking is performed to reduce the blocking effect in reconstruction path.
2.1.2. T HE H.264/AVC D ECODER
Figure 2-3 shows the block diagram of the H.264/AVC decoder. The entropy decoder decodes the quantized coefficients and the motion data, which is used for the motion compensated prediction. As in the encoder, prediction data are obtained by intra or motion estimation, which is added to the inverse transformed coefficients.
After deblocking filtering, the macroblock is completely decoded.
Figure 2-4 H.264/AVC Decoder[5]
2.2. P ROFILE AND L EVELS
There are three profiles defined in H.264/AVC standard, these profiles are baseline profile, main profile, and extended profile. The profile is adopted flexibly for different application. The baseline profile, supporting intra coding and inter coding, together with entropy coding with CAVLC is primary for lower-cost application.
Designed as the mainstream consumer profile, the main profile supports interlaced video, B-picture, inter coding using weighted prediction and entropy coding using CABAC. With robustness to data losses, the extended profile does not support interlaced video and CABAC, but adds modes to enable switching between Bitstream and to improve error resilience. Table 2-1 lists the coding tools and features of these three profiles.
5
Table 2-1 Coding tools and features of different profiles [3]
Baseline Extended Main
I and P Slices Yes Yes Yes
B Slices No Yes Yes
SI and SP Slices No Yes No
Multiple Reference Frames Yes Yes Yes In-Loop Deblocking Filter Yes Yes Yes
CAVLC Entropy Coding Yes Yes Yes
CABAC Entropy Coding No No Yes
Flexible Macroblock Ordering (FMO) Yes Yes No Arbitrary Slice Ordering (ASO) Yes Yes No
Redundant Slices (RS) Yes Yes No
2.3. I NTER P REDICTION
By using the previous encoded video frames or fields, inter prediction can be established from motion estimation and motion compensation. Similar to the prior coding standard, the block-based motion compensation is used. However, variable block size is different from the earlier standards and makes it more efficiency than earlier standards.
In prediction procedure, a predicted block P is searched from the reference picture Fn-1 by motion estimation. Motion Vector (MV) is the displacement from the current block to the predicted block P. With the encoded information of MVs and residual, motion compensation can reconstruct the current picture from the reference picture Fn-1. In this standard, MVs have accuracy of quarter-sample resolution to achieve higher coding efficiency. Next, we will describe these features of H.264 inter prediction
2.3.1. T REE - STRUCTURE MOTION COMPENSATION
In H.264/AVC standard, the luma component of each macroblock can be segmented into one 16x16 partition, two 8x16 partitions, two 16x8 partitions, or four 8x8 partitions, as shown in Figure 2-1. In Figure 2-6, if the 8x8 partitions is chosen, each 8x8 block can be further divide into four different sub-partitions, including 8x8, 8x4, 4x8, and 4x4. In general, the large partitions are appropriate for smooth regions;
the smaller partitions have smaller residual, but the number of motion vectors is increased. With the flexibility of variable block-size motion compensation, the coding
6
efficiency can be increased.
Figure 2-5 Macroblock partitions: 16x16, 16x8, 8x16 and 8x8 [3]
Figure 2-6 Macroblock sub-partitions: 8x8, 8x4, 4x8 and 4x4 [3]
2.3.2. F RACTIONAL PIXEL PRECISION
In order to increase the accuracy of motion compensation, H.264 supports quarter-pixel resolution for luma components and one-eight-pixel resolution for chroma components. If the prediction result of sub pixel is better than that of the integer pixel, the sub pixel will be chosen.
The half-pixel samples are obtained by applying a six tap filter with weights (1/32, -5/32, 20/32, 20/32, -5/32, 1/32). For example, a half pixel b in Figure 2-7 is obtained from the six horizontal integer neighbors E, F, G, H, I, and J with the formulation:
b = round ((E- 5F+20G+20H-5I+J )/32)
Furthermore, the quarter-pixel samples can be calculated after all the half-pixel macroblock are available. They are produced by linear interpolation between two of their adjacent samples. As shown in Figure 2-8, value of a can be calculate by:
a = round ( (G+b)/2)
In Figure 2-9, the chroma eight-sample component can be acquired by linear interpolation of the neighbor pixels:
a=round([(8-dx)(8-dy)A+dx(8-dy)B+(8-dx)dyC+dxdyD]/64)
7
Figure 2-7 Interpolation of luma half-pel positions [3]
Figure 2-8 Interpolation of luma quarter-pixel positions [3]
Figure 2-9 Interpolation of chroma samples [3]
8
2.3.3. M OTION VECTOR PREDICTION
As mentioned in 2.3.1, number of motion vectors increases with the using of variable block partition mechanism. It can cost a significant number of bits to encoding a motion vector for each partition. Since there are high correlations between motion vectors of the neighboring partitions, the motion vector can be predicted by nearby ones. Hence the motion vector prediction (MVp) is generated by the motion vector of the adjacent partitions. The way of forming the prediction MVp depends on the motion compensation partition size and on the availability of nearby vectors. MVp is obtained in a manner of: (see Figure 2-10 )
z For 16x8 partitions, the MVp of the upper 16x8 partition is predicted from of B, and the MVp of the lower one is the motion vector of A.
z For 8x16 partitions, the MVp of the left 8x16 partition is predicted from of A, and the MVp of the right one is the motion vector of C.
z The MVp of other partitions is the median of the motion vector of A, B, and C.
The motion vector difference (MVD) is then derived calculate the difference between the MVp and the real motion vector. These MVDs are the final results that should be further encoded. In general cases, fewer bits are needed for encoding the MVDs than encoding real motion vectors.
Figure 2-10 Current and neighboring partitions for MVp [3]
2.4. I NTRA P REDICTION
The high correlation of neighboring region within a frame implies the high redundancy in spatial domain. As mentioned in 2.1.1, intra predication is imposed to eliminate the spatial redundancy. For the luma samples, intra prediction block is formed for each 4x4 block or 16x16 blocks; for the chroma samples intra prediction block is formed for each 8x8 blocks. The spatial prediction is calculated from the edges pixels of neighboring blocks.
9
2.4.1. 4 X 4 L UMA P REDICTION MODES
When intra mode of 4x4 blocks is applied, nine possible modes cab be chosen.
As shown in Figure 2-11, the samples above and to the left (labeled A–M) have previously been encoded and reconstructed to form a prediction reference. The prediction block (the gray part) is calculated based in A-M. The arrows in Figure 2-11indicate the direction of prediction in each mode. In mode 0 and mode 1, respectively, the samples of A-D and I-L are extrapolated vertically and horizontally.
Mode 2 (DC prediction) is modified depending on the availability of samples A to M.
In the rest modes: Mode 3-8, the predicted samples are calculated by a weighted average of the reference samples A-M.
Figure 2-11 4 × 4 luma prediction modes [3]
2.4.2. 16 X 16 L UMA P REDICTION MODES
In addition to those 4x4 luma modes described in the previous section, there are four modes for 16x16 prediction modes for luma intra prediction. These four luma 16x16 prediction modes are vertical, horizontal, DC, and plane, as shown in Figure 2-12. The requirement of reconstruction of above and left component is similar to the 4x4 luma prediction.
10
Figure 2-12 Intra 16 × 16 prediction modes [3]
2.4.3. 8 X 8 C HROMA P REDICTION MODES
Four 8x8 intra prediction modes are provided for the chroma samples. Similar to the 16x16 luma inter prediction in Figure 2-12, the four modes are DC, horizontal, vertical and plane.
2.5. I N -L OOP D E - BLOCKING F ILTER
One drawbacks of the block base video compression mentioned above is the visible block boundaries. It is so called blocking effects: the lower bit rate the compression is, the more obvious the edges are. To eliminate the blocking effect, a deblocking filter is applied after the inverse transform in both encoder and decoder.
As shown in Figure 2-13, it is applied to vertical or horizontal edges of 4x4 blocks in a macroblock, in the fallowing order: four vertical boundaries (a, b, c, then d) of luma, four horizontal boundaries (e, f, g, then h) of lima, and two vertical boundaries (i, j) horizontal boundaries (k, l).
Figure 2-13 Edge filtering order in a macroblock [3]
The filtering is adaptively applied according to the boundary strength and the gradient across the boundaries. The boundary strength depends on the compression mode of a macroblock, the quantization parameter, motion vector, frame or field coding decision, and pixel values.
With this filter, subjective quality is significant improved as shown in Figure 2-14. This filter also reduces the bits rate with ratio of 5%–10% compared with non-filtered video with the same objective quality [4].
11
(a) (b) Figure 2-14 Performance of the deblocking filter for highly compressed pictures
(a) without deblocking filter and (b) with deblocking filter [4]
2.6. T RANSFORM AND Q UANTIZATION
H.264/AVC, as prior video standard, utilizes the transform coding on the prediction residual. The residual generated in intra or inter prediction is processed the transform for further quantization. One macroblock is divided into 24 4x4 blocks to do the 4x4 integer transform with the transform matrix:
In addition, for each macroblock a 4x4macroblock, a 4x4 Hadamard transform is applied to the DC coefficients of the 16 luma blocks, while a 2x2 Hadamard transform is applied to the DC coefficients of the 4x2 chroma blocks, as shown in Figure 2-15.
Figure 2-15 Scanning order of residual blocks within a macroblock. [3]
12
A quantization parameter is used to determine the quantization step for the quantization of transform coefficient. A total of 52 values of quantization step size (Qstep) are supported by this standard, which are indexed by the quantization parameter (QP). Increasing one in the value of QP means an increase of the quantization step size by approximately 12%. An increase of step size by 12% also means a reduction of bit rate by approximately 12% [4].
2.7. E NTROPY C ODING
To eliminate the syntax redundancy, the arithmetic coding is applied. The syntax above the slice layer is encoded as fixed- or variable-length codes (VLCs). At the slice layer and below, elements are coded using Content Adaptive Variable Length Coding (CAVLC) or Content Adaptive Binary Arithmetic Coding (CABAC) according to the entropy encoding mode. Parameters that are required to be encoded and transmitted include the following (Table 2-2Table 2-1).
Table 2-2 Examples of parameters to be encoded
Parameters Description
Syntax elements above slice layer Headers and parameters
Macroblock type mb type Prediction method for each coded macroblock
Coded block pattern Blocks containing coded coefficients within a macroblock Reference frame index Identify reference frame(s) for inter prediction
Motion vector Difference (mvd) from predicted motion vector Residual data Coefficient data for each 4 × 4 or 2 × 2 block
2.8. NAL UNIT
By choosing a suitable transporting protocol to represent of video coded content, the coded video is organized as a collection of NAL units. Each NALU is a video picket containing an integer number of bytes. As shown in , the first byte as a header byte of NALU contain NAL unite type (T), the nal_reference_idc (R) that indicates the importance of an NALU for the reconstruction process, and the forbidden_bit (F) which is set to ‘0’ in H.264 encoding.
Figure 2-16 NALU header.
13
2.9. D ATA D EPENDENCY OF H.264/AVC
Taking a macroblock as the basic elements In H.264/AVC, the data dependencies cross the macroblocks are illustrated in Figure 2-17 and Figure 2-18. Intra prediction needs the above and the left macroblock to be decoded, further for 4x4 luma block needs the up block, left block, and up right block information. And for deblocking filtering four tap in the upper macroblock and left to the macroblock are needed.
In Figure 2-18, data within the search range of the reference frame is needed to do the interprediciotn.
Figure 2-17 Data dependency induced by (Left) intra prediction and (Right)deblocking filter
Figure 2-18 Data dependency induced by inter prediction
14
2.10. C OMPLEXITY A NALYSIS OF
H.264/AVC
The H.267/AVC standard only specifies the decoder, and the encoder design remains open. In this paper, we adopted the official H.264/AVC JM as decoder for integrity, and adopted the x264 encoder for the faster encoding speed. Thus we illustrate the complexity of the important functions in Figure 2-19 and Figure 2-20.
Figure 2-19 Distribution of clock cycle of each function of encoder.
Figure 2-20 Distribution of clock cycle of each function decoder.
15
Chapter 3.
DSP I MPLEMENTATION E NVIRONMENT
In this chapter, we will briefly introduce the DSP platform environment and some optimization methods. We use the DSP module (MEX) made by Vitec Mult-Media.
Four TMS320DM642 DSP chips are housed on this board. Our implementation system includes software system and some peripherals on the board. Thus for the TI DSP, the Code Composer Studio (CCS) and some efficient optimization methods will be introduced. In addition, to facilitate the system and peripherals, Reference Framework 5(RF5) and Network Developer’s Kit (NDK) will be bring out as well.
3.1. I NTRODUCTION OF DSP P LATFORM
The DSP board used in our implementation is the MEX (Multi-Channel Video Platform) in Figure 3-1, which is a powerful platform for video application. The architecture of MEX includes four TI DSPs, two FPGA (one as crossbar, the other as PCI interface), eight video decoders, four audio stereo ADCs, and a 100BaseT Ethernet controller, as shown in Figure 3-2.
MEX’s key features are listed as below:
9 Four TMS320DM642 DSPs run at up to 600MHz (Fixed point).
9 Each DSP has a private memory of 32MB, which is SDRAM running at 100 MHz with 64 bits.
9 Each DSP has three powerful configurable video ports. By configure the crossbar (implemented in an FPGA), the video architecture are flexible. With proper configuration, the video path way can distribute one vide source on four DSP, four distinct video sources on four DSP, four distinct video sources on one DSP, or so on.
9 DSP-DSP communication or DSP-PCI communication is facilitated by the
"Inter-DSP communication & PCI interface" FPGA. Each DSP has a dedicated FIFO inside the FPGA which is mapped in its memory. This FIFO can be written by the DSP and sent to PCI interface and the others DSPs. Those mean PC-DSP communication and DSP-DSP communication respectively.
16
Figure 3-1 MEX (Multi-Channel Video Platform) [6]
Figure 3-2 Block diagram of the MEX [6]
17
As shown in Figure 3-2, the flexible architecture include some modules of TMS320DM642 DSP chip: the I2C bus used to configure the Video (7113) / Audio(CS4221) chips, the Video Port set to configure the video acquisition data path, and EMIF that define the address of FPGA seen by DSP. Those DSP modules will further introduced in the following sections.
(a) (b)
Figure 3-3 Block diagram of (a)emulator system and (b)application system
In the developing phase, a JTAG emulator pod called “USB 560BP” is used to connect the MEX to PC. With the JTAG emulator, the CCS emulation of DSPs on the board is fully supported. We develop our system and debug in this way. After that, the emulator can be removed from this system to expose the stand-alone ability of MEX.
The only thing the PC should do is to supply 3V power and load the DSP program to the board. Figure 3-3 are two different block diagram of the system in emulation phase and in application phase.
3.2. DSP C HIP
In our system, the TMS320DM642 DSP chip is the most important part of this system. In this section, we will describe some details of this chip. TMS320DM642, the high-performance fixed-point DSP, is based on the second generation, high performance, advanced VelociTI™ very long instruction word (VLIW) architecture (VelociTI.2™), developed by Texas Instruments (TI). The VelociTI.2 extensions in the eight functional units include new instructions to accelerate the performance in key applications and extend the parallelism of the VelociTI architecture. This VLIW architecture makes the DSP chips an excellent choice for digital media application [8].
The DM642 DSP is a Video/Imaging fixed-point digital signal processor in the TMS320C64x family. It has eight independent functional units running at 600MHZ for peak execution of 4800 MIPS. Some key features of DM642 are listed below.
9 Eight highly independent functional units - two multipliers to generate 32-bit result and six arithmetic logic units (ALUs)
9 The VelociTI.2™ extensions in the eight functional units include new
9 The VelociTI.2™ extensions in the eight functional units include new