Hardware architecture design for H.264/AVC intra frame coder

(1)

HARDWARE ARCHITECTURE DESIGN FOR H.264/AVC INTRA FRAME CODER

Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen

DSP/IC Design Lab., Graduate Institute of Electronics Engineering and

Department of Electrical Engineering, National Taiwan University

yuwen, bingyu, djchen, lgchen

@video.ee.ntu.edu.tw

ABSTRACT

In this paper, we contributed a VLSI architecture design for H.264/AVC intra frame coder. First, analysis of coding algorithm is provided by using a RISC model to obtain the proper degrees of parallelism under SDTV specification. Second, a two-stage mac-roblock pipelining is proposed to double the processing capability and hardware utilization. Third, Hadamard-based mode decision is modified as DCT-based version to reduce the 40% of memory access. To sum up, our system architecture achieves 215 times of speed compared with RISC-based software implementation in terms of processing cycles. In addition, we also made a lot of ef-forts on developing area-speed efficient modules. Reconfigurable intra predictor generator can support all kinds of prediction modes. Parallel multi-transform has four times throughput of the serial one with little area overhead. CAVLC engine can efficiently provide coding information for the bitstream packer. A prototype chip was fabricated with TSMC 0.25 m CMOS technology and is capable of encoding 720x480 4:2:0 30Hz video in real time at the working frequency of 54 MHz. The transistor count is 429K, and the core

size is only 1.855x1.885 mm¾

.

1. INTRODUCTION

H.264/AVC intra frame coder [1] is competitive with the latest image coding standard, JPEG2000 [2], in coding performance. According to the experimental results of JPEG2000 VM7.2 and H.264/AVC JM7.3, the rate-distortion curve of H.264/AVC Main Profile intra frame coder (CABAC, high complexity mode deci-sion) is almost the same as that of JPEG2000 DWT97. H.264/AVC Baseline Profile intra frame coder (CAVLC, low complexity mode decision) is 0.2-1.0 dB better than JPEG2000 DWT53. For encod-ing and decodencod-ing the “Bike” image (2048x2560x8b), JPEG2000 DWT53 requires 3430 and 3180 Mega-instructions, respectively, while H.264/AVC Baseline Profile requires 3648 and 584 Mega-instructions. For applications whose key functionality is compres-sion instead of scalability, such as digital storage camera, digi-tal scanner, digidigi-tal video editing, and digidigi-tal video surveillance, H.264/AVC intra frame coder may be a more attractive solution due to the hardware-friendly block-based algorithm. In JPEG2000, DWT is a frame-based transform that requires a huge amount of memory, and EBCOT is a sequential bitplane processing that re-quires a high operating frequency.

Intra prediction with rate-distortion constrained mode deci-sion is the most important technology in H.264/AVC intra frame coder. The predictor generation engine for intra prediction and the transform engine for mode decision are critical because these

operations occupy 80% of the computation time of the whole com-pression process, and it is difficult for general purpose processors (GPP) to meet the real-time constraints. In this paper, we will analyze the coding algorithm to develop the VLSI architecture of H.264/AVC Baseline Profile intra frame coder targeted for SDTV specification (720x480 4:2:0 30Hz video). The rest of this pa-per is organized as follows. In Section 2, the fundamentals of H.264/AVC intra frame coding is first reviewed. In Section 3, sys-tem design is proposed according to deep analysis. Module design and implementation results are described in Section 4 and Section 5, respectively. Finally, Section 6 gives a conclusion.

2. FUNDAMENTALS

The encoding ﬂow of each macroblock (MB) can be separated into mode decision phase and residue encoding phase. In the mode de-cision phase, 17 kinds of prediction modes are generated for one MB (9 I4MB modes for luma, 4 I16MB modes for luma, 4 modes similar to I16MB for chroma), and distortion cost is evaluated by sum of absolute values of 2-D 4x4 Hadamard transformed differ-ences (SATD), and rate cost is estimated by quantization param-eter and number of bits required to code the mode information. Then, the best MB mode is chosen by minimizing the Lagrangian cost value (distortion cost plus rate cost) [3]. In the residue en-coding phase, prediction residues are transformed and quantized [4]. The mode information and residues are then compressed by Exp-Golomb code and context-based adaptive variable length code (CAVLC) [5], respectively.

The instruction profile of H.264/AVC Baseline Profile intra frame coder with low complexity mode decision for SDTV speci-fication is shown in Table 1. Real-time processing requires 10,829 million instructions per second (MIPS), which is far beyond the ca-pability of today’s GPP. The instructions are classified as three cat-egories: computing, controlling, and memory access. It is shown that memory access operations are the most highly demanded. This reveals that local SRAM and registers are critical to reduce the bus bandwidth. Figure 1 shows the runtime percentages of several ma-jor functional modules. As can be seen, transform for cost gen-eration (SATD computation) and mode decision take the largest portion of computation, and intra predictor generation is the sec-ond. These two functions take 77% of computation and obviously are the processing bottleneck.

3. SYSTEM DESIGN

In this section, a parallel H.264/AVC intra frame coding architec-ture will be proposed for SDTV speciﬁcation, which requires to

,,

(2)

Table 1: Instruction proﬁle for SDTV speciﬁcation.

Instruction Type MIPS % Category

Arithmetic 1,785 16.5 Computing

Logic 83 0.77 Computing

Rotate and Shift 279 2.58 Computing

Jump and Compare 1,558 14.4 Controlling

Stack Instruction 3,154 29.15 Memory Access

Data Instruction 3,961 36.6 Memory Access

Total 10,820 100 DCT/Q/IQ/IDCT 16% Transform for Cost Generation and Mode Decision 57% Exp-Golomb VLC and CAVLC 4% Others 3% Intra Predictor Generation 20%

Figure 1: Run-time percentages of various functional modules.

encode about 16 Mega-pixels within one cycle. The detailed anal-ysis and system/module designs will be described as follows. 3.1. Exploration of Parallelism

First, two assumptions are made: a RISC is able to execute one in-struction in one cycle with an exception of multiplication requiring two cycles; a processing element (PE) is capable of generating the predictor of one pixel in one cycle. Next, we compute the average instruction counts required for intra predictor generation. For ex-ample, the operation “c = a + b” requires two load instructions and one add instruction. Table 2 shows that it takes 3.2629 and 3.9610 cycles for a RISC to generate one luma predictor and one chroma predictor, respectively.

We ﬁrst discuss three possible solutions for intra predictor generation, as shown in Fig. 2. The ﬁrst one is a RISC solution, which requires to run at 521.9 MHz to generate the predictors in

PE01 PC In struc tion me mo ry Rea d ad dre ss In struc tion 1 6 32 Ad d ALUres ul t Mux Re gi ste rs W rite re gi ster W rite d ata Rea d d ata 1 Rea d d ata 2 Re ad re gi ster 1 Re ad reg i ster 2 Sh ift l eft 2 4 Mu x ALU op e ra tio n 3 Reg Wri te Me mRe ad M emW rite PCSrc AL USrc M emto Reg AL U re su lt Z ero AL U Da ta m em ory Ad dre ss W rite d ata Re add ataM ux Si gn ex ten d Ad d R-PE Generate 1 predictor in one cycle Generate 13 luma predictors or 4

chroma predictors in one cycle Generate 1 predictor

in one cycle

RISC Dedicated PE

PE02 PE03 PE04

PE06 PE07 PE08 PE09 PE10

PE11 PE12 PE13

PE05

Reconfigurable PE Under the design specification: SDTV 720x480 4:2:0 30 frames/sec

Required freq. = 720x480x30x13x3.2629 + 360x240x2x30x4x3.9610 = 521.9 MHz Required freq. = 720x480x1.5x30 = 15.6 MHz Required freq. = 720x480x1.5x30x13 = 202.2 MHz

Figure 2: Three possible solutions and the required working fre-quency to meet the real-time requirement of SDTV speciﬁcation.

Table 2: Analysis of instructions for intra predictor generation. Intra Prediction

Modes

Average

Average Cycles to Generate the Predictor of a Pixel

3.2629 (cycles/pixel required for RISC) L U M A C H R O M

A Average 3.9610 (cycles/pixel required for RISC)

(0+0+0+4 x4)/16 = 1 (0+0+0+4 x4)/16 = 1 (8+1+0+4 x4)/16 = 1.5625 (8x3x2+2x2+2+2+2+(3+2x2+2+2)x256+16x16)/256 = 12.2266 Intra4x4 Vertical Intra4x4 Horizontal Intra4x4 DC Intra4x4 Diagonal Down/Left Intra4x4 Diagonal Down/Right Intra4x4 Vertical Left Intra4x4 Horizontal Down Intra4x4 Vertical Right Intra4x4 Horizontal Up Intra16x16 DC Intra16x16 Vertical Intra16x16 Horizontal Intra16x16 Plane (6x6+4+7+12+4x4)/16 = 4.6875 (3x7+7+7+4x4)/16 = 3.1875 (2x4+3x6+10+6+4x4)/16 = 3.625 (2x4+3x6+10+6+4x4)/16 = 3.625 (2x4+3x4+6x2+10+9+4x4)/16 = 4.1875 (2x4+3x4+6x2+10+9+4x4)/16 = 4.1875 (15 x2+2+1+0+16x16)/256 = 1.1289 (0+0+0+16 x16)/256 = 1 (0+0+0+16 x16)/256 = 1 (4x3x2+2x2+2+2+2+(3+2x2+2+2)x64+8x8)/64 = 12.5313 DC Vertical Horizontal Plane (3x4+4+4+0+8x8)/64 = 1.3125 (0+0+0+8 x8)/64 = 1 (0+0+0+8 x8)/64 = 1

Table 3: Hardware complexity and operating frequency under dif-ferent degrees of parallelism.

No Parallelism Two-Parallel Hardware Complexity Operating Frequency Hardware Complexity Four-Parallel Eight-Parallel ~A A <13A >>521.9 MHz 202.2 MHz 15.6 MHz RISC Reconfigurable PE Dedicated PE`s Solution Parallelism ~2A 2A <26A ~4A <52A ~8A 8A <104A >>65.2 MHz 25.3 MHz 1.9 MHz Hardware Complexity Hardware Complexity Operating Frequency Operating Frequency >>261.0 MHz 101.1 MHz 7.8 MHz >>130.5 MHz 3.9 MHz 50.6 MHz Operating Frequency 4A

time, not to mention transform, entropy coding, and other system jobs. Consequently, RISC seems to be impractical. The second solution is a set of 13 different PE’s. The hardware can generate 13 kinds of predictors in one cycle. The architecture only needs to operate at 15.6 MHz, but the cost is very high. The third choice is to design a reconfigurable PE to generate all the intra predic-tors with different configurations. This solution targets at higher area-speed efficiency. Nevertheless, it still requires to operate at 202.2 MHz. Thus, parallel reconfigurable PE’s become the most promising solution. Table 3 lists the hardware complexity and re-quired frequency of the three solutions under different degrees of parallelism. We conclude this subsection by adopting four-parallel reconfigurable PE for intra predictor generation in our design. 3.2. System Architecture

We divide our system into two main parts, the encoding loop and the bitstream generation unit, as illustrated in Fig. 3. Assume one row of reconstructed pixels/coding information is buffered in the external DRAM. At the beginning, current MB pixels and up-per reconstructed pixels/coding information are loaded from ex-ternal DRAM to on-chip SRAM. The reconstructed pixels/coding information of the previous (left) MB can be directly kept in reg-isters to save bus bandwidth. With on-chip SRAM and coding information registers, the bus bandwidth is reduced from hundreds of Mbytes to about 20 Mbytes/sec. Then, we start the intra pre-diction block by block. According to the previous analysis, four pixels should be processed (predictor generation and SATD com-putation) in one cycle. Therefore, the number of cycles required

(3)

Intra Predictor Generation Current MB SRAM D I F F Mode Decision Q CAVLC & ExpGolomb IQ IDCT / IHadamard DCT / Hadamard Degree of Parallelism 4 2 1 Encoding Loop Bitstream Generation Unit 4x4 BUF Source Input Bitstream Output Reconstructed MB SRAM 0 Encoding Loop Bitstream Generation Unit Processing Schedule Functional Unit 256 512 768 1024 1280 1536 1792 2048 2304 _(cycles)Time INITIAL I4MB

Hadamard-Based Mode Decision I16MB UV

CAVLC of a MB (24 blocks, each 4x4 block takes 16~50 cycles) Generation of 4x4 Residues, DCT, and Q

(each 4x4 block takes 4 cycles)

Processing Current Macroblock

Y U V

System Block Diagram

Figure 3: Illustration of initial system architecture.

for luma intra prediction and mode decision is 832 (4x13x16), and that for chroma is 128 (4x4x8). Before DCT/Q/IQ/IDCT ﬁnishes the previous 4x4-block, current 4x4-block cannot proceed to intra prediction, so four-parallel DCT/IDCT and two-parallel Q/IQ are adopted to reduce the latency. CAVLC is a sequential algorithm and its computational loading is not very large. Hence, parallel processing is not a must. In sum, the initial system architecture needs about 2400 cycles to encode a MB. Also, with the straight-forward data ﬂow, the residues are generated twice. One time is used for intra predictor generation and mode decision while the other is for entropy coding of the best mode.

In the previous paragraph, it is observed that the number of processing cycles for encoding loop is about the same as that for bitstream generation unit in the worst case. These two procedures are separable because there is no feedback loop between them. Therefore, a MB pipelining is incorporated into the system to ac-celerate the processing speed at the cost of coefﬁcient buffer for a MB. When current MB is processed by encoding loop, bitstream generation unit processes previous MB simultaneously. The num-ber of processing cycles is reduced to less than 1300 cycles.

In the reference software, Hadamard transform is involved in SATD. The transform coefficients in the mode decision phase can-not be reused for CAVLC. Thus, we modify the SATD compu-tation by using DCT. Luma mode decision is performed block by block and 13 kinds of predictors are generated for each 4x4-block. The former nine prediction modes decide the best I4MB mode and its quantized transform coefficients will be stored in the coefficient buffer. In this way, if I4MB mode is chosen, re-generation of luma transform quantized residues can be avoided. The improvement will be significant in high quality applications

where almost all MB’s select I4MB. In our experience, when

is smaller than 25, the percentage of I16MB mode is less than 10%. The amount of on-chip memory access can thus be reduced from 113.17 Mbytes/sec to 72.25 Mbytes/sec. Also, the proposed mode decision does not suffer any quality loss compared with the mode decision in the reference software. Fig. 4 shows the ﬁnal system block diagram of our proposed H.264/AVC intra frame encoder.

Table 4 shows the comparison of the three developed architec-tures. The last version has the fewest processing cycles and the least memory access. Compared with software implementation on RISC, which requires 0.28M cycles to encode one MB, the

per-Intra Predictor Generation Current MB Buffer 96x32 Decoded Block Boundary Handle A G Rec. MB Buffer 96x32 D I F F DC Coefficient Registers for I16 Modes Cost Generation Mode Decision Best Coeff. Registers Q CAVLC & ExpGolomb IQ IDCT/ IHadamard FSM2 Control FSM1 Control DCT/ Hadamard CoefBuf1 96x16 CoefBuf2 96x16 CoefBuf3 96x16 CoefBuf4 96x16 Plane Pred. Buffer 64x32 M U X Best Mode Regs. FSM3 Control

External Upper Line Buffer for Decoded Pixels, MP Mode, Nu, and Nl

Source Input Bitstream Output Degree of Parallelism 4 2 1 MB Header Hadamard Transform of DC Coefficients

Interleaved MB Tasks Current MB Previous MB

Figure 4: Illustration of ﬁnal system architecture.

Table 4: Comparison of system architectures.

Architecture Initial MB Pipelining

MB Pipelining and DCT-Based Mode Decision Parallelism Optimized

Task Schedule Mode Decision Method Processing Cycles / MB YES Sequential Hadamard-Based < 2400 (cycles) YES Interleaved Hadamard-Based < 1300 (cycles) YES Interleaved DCT-Based < 1300 (cycles)

On-Chip SRAM Access 113.17 (Mbytes/s) 113.17 (Mbytes/s) 72.25 (Mbytes/s)

Coefficient Buffer 16x16 (bits) 96x64 (bits) 96x64 (bits)

Required Frequency 97.2 MHz 52.7 MHz 52.7 MHz

Bus Bandwidth ~20 (Mbytes/s) ~20 (Mbytes/s) ~20 (Mbytes/s)

formance of proposed architecture is 215.4 times faster than the software implementation. The chip only needs to operate at about 50 MHz to meet the SDTV speciﬁcation.

4. MODULE DESIGN

We developed a four-parallel reconﬁgurable intra predictor gen-erator to achieve resource sharing between all kinds of tion modes. Due to the limited space, only I16MB plane predic-tion will be explained. We also developed an area-speed efﬁcient four-parallel multi-transform engine. Details can be found in [6]. CAVLC engine will also be described later.

4.1. I16MB Prediction Modes

The detailed deﬁnition of I16MB plane prediction mode, which is an approximation of bilinear transform, is described as follows.

We proposed a decomposition technique to avoid multiplications.

First, there is a short setup period to precompute, ,,and

and buffer them in registers. Next, four seed values, ,

, , and are computed. With the

precomputed ,, and these seed values, all the other I16MB plane

predictors can be computed by add and shift operations, as ex-pressed in Fig. 5. The proposed PE is shown in Fig. 6.

4.2. Bitstream Generation

Figure 7 shows the bitstream generator. The macroblock header is ﬁrst produced. Then, the CAVLC forms bitstream block by block.

(4)

x

y

A0 A1 A2 A3

pred[ y, x ] = Clip1( ( a + b * ( x - 7 ) + c * ( y - 7 ) + 16 ) >> 5 ) Predictors for a MB, x=0-15, y=0-15

pred[0, 0] = Clip1( ( a + b * ( - 7 ) + c * ( - 7 ) + 16 ) >> 5 ) = Clip1( A0 >>5)

pred[0, 1] = Clip1( ( a + b * ( - 6 ) + c * ( - 7 ) + 16 ) >> 5 ) = Clip1(( A0+ b)>>5)

pred[0, 2] = Clip1( ( a + b * ( - 5 ) + c * ( - 7 ) + 16 ) >> 5 ) = Clip1(( A0+2b)>>5)

pred[0, 3] = Clip1( ( a + b * ( - 4 ) + c * ( - 7 ) + 16 ) >> 5 ) = Clip1(( A0+3b)>>5)

pred[1, 0] = Clip1( ( a + b * ( - 7 ) + c * ( - 6 ) + 16 ) >> 5 ) = Clip1((pred [0, 0] + c)>>5) pred[1, 1] = Clip1( ( a + b * ( - 6 ) + c * ( - 6 ) + 16 ) >> 5 ) = Clip1((pred [0, 1] + c)>>5) pred[1, 2] = Clip1( ( a + b * ( - 5 ) + c * ( - 6 ) + 16 ) >> 5 ) = Clip1((pred [0, 2] + c)>>5) pred[1, 3] = Clip1( ( a + b * ( - 4 ) + c * ( - 6 ) + 16 ) >> 5 ) = Clip1((pred [0, 3] + c)>>5)

pred[2, 0] = Clip1( ( a + b * ( - 7 ) + c * ( - 5 ) + 16 ) >> 5 ) = Clip1((pred [1, 0] + c)>>5) pred[2, 1] = Clip1( ( a + b * ( - 6 ) + c * ( - 5 ) + 16 ) >> 5 ) = Clip1((pred [1, 1] + c)>>5) pred[2, 2] = Clip1( ( a + b * ( - 5 ) + c * ( - 5 ) + 16 ) >> 5 ) = Clip1((pred [1, 2] + c)>>5) pred[2, 3] = Clip1( ( a + b * ( - 4 ) + c * ( - 5 ) + 16 ) >> 5 ) = Clip1((pred [1, 3] + c)>>5)

Four Seed Values

Figure 5: Decomposition of I16MB plane prediction mode.

D D D D

MUX

Round & Shift

Clip

MUX MUX MUX MUX

MUX Round & Shift Clip MUX Round & Shift Clip MUX Round & Shift Clip MUX IJKLM ABCDEF GH a, b, c, 4 seeds IJKLM ABCDEF GH IJKLM ABCDEF GH IJKLM ABCDEF GH

MUX MUX MUX

Predictor Output 0 Predictor Output 1 Predictor Output 2 Predictor Output 3 compute DC predictor of I4MB/I16MB/ chroma

bypass for vertical and horizontal predictor of I16MB/chroma use accumulation to compute I16MB/chroma plane predictors inputs for I4MB prediction modes p[y=-1, x=0,4,8,12] p[y=0,4,8,12, x=-1] P[y=-1, x=1,5,9,13] P[y=1,5,9,13, x=-1] P[y=-1, x=2,6,10,14] P[y=2,6,10,14, x=-1] p[y=-1, x=3,7,11,15] p[y=3,7,11,15, x=-1] a, b, c, 4 seeds a, b, c, 4 seeds a, b, c, 4 seeds

Figure 6: Proposed four-parallel reconﬁgurable intra predictor generator.

It takes sixteen cycles to load coefficients of a 4x4-block from the memory in a reverse zigzag scan order. During the loading, the level detection checks if the coefficient is zero. If the level is nonzero, it will be stored into the level-FIFO, and the correspond-ing run information will also be stored to the run-FIFO. At the same time, the trailing one counter, total coefficient counter, and run counter will update the corresponding counts into registers. After scan, the total coefficient/trailing one module will output the code word to packer by looking up the VLC table according to the results of total coefficient register and trailing one register. Next, level code information is sent to the packer by looking up the VLC table for levels followed by total zero and runs.

Y U V Zig-Zag Scan Address Generation Trailing One Counter Total Coefficient Counter Run/Total Zero Counter Level Detection Total Coefficient / Trailing One Table: VLC0~VLC2 Total Coefficient/ Total Zeros Table Run Before/ Zeros Left Table Level Table: VLC0~VLC6

Exp Golomb Code of MB Header FIFO Table Selection M U X FIFO VLC Packer Off Chip Nu, Nl FIFO Bitstream Codeword Codelength Codelength Codeword 1 2 4 5 3

Figure 7: Hardware architecture of bitstream generation engine.

Technology

Package

Core Size

Logic Gate Count

On-Chip SRAM

Transistor Count

Max. Clock Rate

Processing Capability TSMC 0.25um CMOS 1P5M CQFP 208 1.855 x 1.885 mm2 84,985 Single Port 64x32 (x1) Single Port 96x32 (x2) Dual Port 96x16 (x4) 429,139 55 MHz 434 fps for 4:2:0 QCIF (176X144) 107 fps for 4:2:0 CIF (352X288) 31 fps for 4:2:0 SDTV (720x480)

16.26 Mega-pixels within 1 sec

Figure 8: Chip photo and speciﬁcations.

5. IMPLEMENTATION RESULTS

The chip photo and speciﬁcations are shown in Fig. 8. I16MB plane predictors are buffered in one 64x32 RAM to save the regen-eration of predictors when selected as best MB mode. Two 96x32 RAM’s are used to save current MB and reconstructed MB. The other four RAM’s are used as residue buffer for MB pipelining.

6. CONCLUSION

This paper presents a VLSI architecture design for H.264/AVC in-tra frame coder. We provide analysis to obtain the suitable degrees of parallelism under SDTV speciﬁcation. MB pipelining and DCT-based mode decision are then proposed to double the speed and to reduce the 40% of memory access, respectively. Area-speed efﬁ-cient modules are also designed. Our implementation is capable of encoding 16 Mega-pixels within one second at 54 MHz with

1.855x1.885 mm¾

core area.

7. REFERENCES

[1] Joint Video Team, Draft ITU-T Recommendation and

Fi-nal Draft InternatioFi-nal Standard of Joint Video Speciﬁcation, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, May 2003. [2] JPEG 2000 Part I, ISO/IEC JTC1/SC29/WG1 Final

Commit-tee Draft, Rev. 1.0, Mar. 2000.

[3] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 688–703, July 2003. [4] H. S. Malvar, A. Hallapuro, M. Karczewicz, and Louis

Kerosfsky, “Low-complexity transform and quantization in H.264/AVC,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 598–603, July 2003.

[5] T. Wiegand, G. J. Sullivan, G. Bj ntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, July 2003.

[6] T. C. Wang, Y. W. Huang H. C. Fang, and L. G. Chen, “Par-allel 4x4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” in Proc. of IEEE International Sym-posium on Circuits and Systems, 2003.

Hardware architecture design for H.264/AVC intra frame coder