Hybrid scheduling

CHAPTER 4 ARCHITECTURE DESIGN OF MPEG-2/H.264/AVC DECODER

4.6.3 Hybrid scheduling

....

frame width = N

frame height = M

Upper Neighbor Left

Neighbor ..

16x16 MB

. . . ...

.. ..

....

. . . . . . . ...

.. 0 1

2 3 4 5 6 7 8 9 10 11

12 13 14 15

16 17 18 19

20 21 22 23

(a) (b)

B B+1

Fig. 4.24 (a) Slice memory with grid or shaded region and (b) Content memory with black-dotted region

4.6.3 Hybrid scheduling

To reduce the overhead with the reloaded data when switching the filtering edge from horizontal to vertical, we adopt a hybrid filter scheduling to re-schedule the standard-defined edge. The de-blocking filter in H.264/AVC system is performed in the vertical edge first, and then the horizontal edge. Based on the standard-defined filter ordering, we can deduce the filter order on each 4x4 block as Fig. 4.25(a). In the filter ordering of one 4x4 block, left edge is filtered first and lower edge is the last one. We

develop a novel filter ordering to schedule our filter process on each edge as Fig. 4.25(b).

Each filter order of one block obeys the rules of the left edge first and the lower edge last.

Compared to the traditional scheduling [13][14], our method prevents the re-access for different direction and combine the vertical and horizontal filter at the rule of standard-compliance.

Fig. 4.25 Hybrid scheduling method

We use four 4x4 pixel buffer to keep the temporary data in our hybrid scheduling process. In Fig. 4.26 (a), each MB has been partitioned into two main parts (i.e. Loop Filter-MB-Upper or Lower) to reduce the kept buffer size. Each part is composed of eight time-instances to process the filtering procedure in Fig. 4.26 (b). The grid region represents the neighboring block and the shaded region is the position of kept data buffer with the size of four 4x4 blocks. There is no need to keep the neighboring block as the data buffer in certain time instance (except for the initial state t1) because the neighboring block and current MB are located at different memory module. Both data of them can be accessed at the same time instance and sent to the input of edge filter.

+

=

LF-MB

LF-MB-U

LF-MB-L

t1 t2 t3 t4

t5 t6 t7 t8

(a)

(b)

Fig. 4.26 The partitioned MB and each time instance when applying the hybrid scheduling method

We derived the filter ordering of the proposed hybrid scheduling method in Fig.

4.26(b). Each bold line represents the edge to be filtered in each time instance. The filtered ordering complied with the hybrid scheduling in Fig. 4.25(a) at each time instance t1 ~ t8.

By the same way, the proposed scheduling is also performed in the block of chroma 4x4 block.

The main problem of in/post-loop de-blocking filter is the considerable amount of memory access and processing cycles. To apply the proposed hybrid scheduling into the overall system and enhance the system throughput, we use a high-throughput architecture of de-blocking filter. Fig. 4.27 shows the proposed design with block diagram and data flow representation. The external frame buffer is an off-chip memory and the size is decided by the frame size. The shaded-arrows denote the data flow inside the de-blocking filter unit, and the black-arrows denote the data flow outside. The pixel buffer is used to store the intermediate pixel value when applying the proposed hybrid scheduling. It contains four 4×

4 pixel values. Moreover, in each time instance, it locates at the position as the shaded regions of Fig. 4.26(b) shows.

Intra/Inter Prediction

96x32

Slice Memory

Pixel Buffer (four 4x4 sub-block)

Triple P-i-P-o EdgeFilter

External Frame Memory

De-Blocking Filter Unit

q0~3

p0~3

q0~3

p0~3

｀

｀ IDCT

+

Threshold Memory

Triple-Mode Decision & Control

Content Memory

4:2:0 96x32

Fig. 4.27 The block diagram and data flow of the MPEG-2/H.264 combined de-blocking filter

Chapter 5 Chip Implementation for Digital TV Applications

5.1 System Specification

In our MPEG-2/H.264 dual mode decoder design, the specification of the MPEG-2 part is MPEG-2 simple profile at main level (SP@ML), table 5.1 shows the details of this profile. In the H.264/AVC part, our specification is H.264/AVC baseline profile at level 3.2, table 5.2 shows the details of this profile.

Table 5.1 Simple profile @ Main level of MPEG-2 system

No. of layers

Layer id Scalable mode

Profile and level indication

1 0 Base 720/576/30 10,368,000 15 1,835,008 SP@ML

Table 5.2 Baseline profile @ level 3.2 of H.264 system

Level Max (1000 bits/s

or 1200

Vertical MV component

range MaxVmvR (luma frame samples) motion vectors

per two consecutive MBs MaxMvsPer2MB

3.2 216,000 5,120 7,680.0 20,000 20,000 [-512,+511.75] 4 16

The maximum computational capability is to support real time decoding of 1080i (1920x1088) MPEG-2 video sequence and SXGA (1280x1024) H.264 video sequence in

30fps. Our operational frequency required for MPEG-2 is 80.92MHz, and for H.264 is 79.64MHz.

5.2 Design Flow

We use the standard cell based design flow. Fig. 5.1 shows our design flow from system specification to physical-level.

System

Specification System design Architecture design

RTL level design

Physical level design

Fig. 5.1 Design flow from system specification to physical-level

In system design stage, first we estimated the required throughput for the specification, applied the 4x4-sub-block level pipeline scheme and modified it to hybrid scheme for the reason that macroblock-level pipelining scheme is suitable for some modules. We carefully estimate the efficiency of different decoding ordering for all the modules because it would be an important interface between modules. We choose 1x4-column-by-column decoding ordering for the implementation at last. Because we aimed at multi-mode decoder design, the hardware sharing issue shall be considered as well in this first stage. The overall block diagram and data flow is designed in this stage.

In architecture design stage, we divide the work mainly to 4 people, one for motion compensation, one for entropy decoding, one for de-blocking filter, and one for the system design and other modules (me). We have to consider the hardware sharing issue for both systems in designing each module. The throughput required is the aim of designing each module. Under the constraint of the throughput requirement, we focus on the architecture design and to make each module low-complexity and low-power. Some low-complexity architecture and low-power techniques are derived in this stage.

The RTL-design is along with the architecture design. The work for RTL-design is mainly to translate the architecture of each module to RTL description. To make the synthesis result identical to the architecture of our design is the goal of the RTL-design. Of course that some coding techniques for the synthesizer are considered in this stage. To write the RTL-code synthesizable and easy understanding is also important.

In physical design stage the CAD tools are important. To make a good use of these tools and to do the remaining job to the best is the key point to our final result. The design margin, technology used, some nano effects on deep sub-micron circuits are also needed to be considered. At the end of the physical design stage, our work is taped-out for the prototyping and final verification.

5.3 Implementation Result

In our work, we implemented an MPEG-2/H.264 dual mode decoder. Fig. 5.2 shows the layout of this work. The total gate count is about 491K, chip size is 3.9x3.9mm² in 0.18um technology. Maximum working frequency is 83.3MHz, support decoding 720p H.264 video sequence under 56MHz, decoding 720p MPEG-2 video sequence under 35.7MHz in 30fps. Power consumptions are 44.35mW and 30.15mW, respectively.

Mot io n Co mp ensat io n

In/Post-Loop Deblocking Filter

In tr a P re d ic ti on

Content Memory

IDCT

Syntax Parser

IQ I-ZZ CAVLC/

VLC

Fig. 5.2 Layout of this work

Table 5.3 Chip details

Items Specification

Function H.264 Baseline@Level 3.2

MPEG-2 SP@ML

Gate counts 491,260 (On-chip SRAM included)

Technology 0.18um 1P6M

Supply voltage 3.3V/1.8V

Die size 3.9x3.9mm²

Package 208CQFP Max working frequency 83.3MHz

Core Power Consumption (720p@30fps)

44.35mW@56MHz (H.264) 30.15mW@35.7MHz (MPEG-2)

5.4 Measurement Results and Comparison

Table 5.4 shows the power report of our work. The power consumption of decoding CIF, NTSC, and 720pHD MPEG-2 video sequences are 3.12mW, 11.15mW, and 30.15mW;

the power consumption of decoding CIF, NTSC, and 720pHD H.264 video sequences are 4.51mW, 16.39mW, and 44.35mW, respectively.

Table 5.4 Power report

Items (Core Power) MPEG-2’s power analysis H.264’s power analysis

720pHD (1280x720) 30.15mW@35.7MHz 44.35mW@56MHz

NTSC (648x486) 11.15mW@13.2MHz 16.39mW@20.7MHz

CIF (352x288) 3.12mW@3.7MHz 4.51mW@5.7MHz

Table 5.5 shows the comparisons to the State-Of-the-Art. It’s hard to find a pure ASIC decoder but RISC included or ARM-based works. Thus it’s hard to have a fair comparisons.

However, we can still see that our work is a good solution to dual mode H.264/MPEG-2 decoder.

Table 5.5 Comparisons

Proposed [1]-[4]

ISCAS’05 VLSI-TSA’05

C&S [11]

ISCAS’04

Conexant [18]

ISCE’04

NTU [19]

ISCAS’05

Specification 1280x720@30fps 1920x1088@30fps 2048x1024@30fps 2048x1024@30fps Operating

Frequency

56MHz 130MHz (local

bus:170MHz)

200MHz 120MHz

Technology 180nm (1.8V) 130nm (1.2V) 130nm (1.2V) 180nm (1.8V)

Profile H.264 baseline

MPEG-2 SP@ML

H.264 baseline MPEG-4 SP H.261,H.263,JPEG

H.264 main H.264 baseline

Implementation ASIC ARM-based ARM-based ASIC+RISC

Gate Count 491K 910K 300K 217K

Internal Memory

24K bytes N/A 74K bytes 10K bytes

Power 44.35mW 554mW 160mW N/A

Normalized power*

100.92mW 2422.88mW 691.89mW N/A

*Normalized to 180nm(1.8v), 2048x1024@30fps

Chapter 6 Conclusion and Future Work

6.1 Conclusion

In this work, we implemented a dual mode H.264/MPEG-2 video decoder. We adopt many design techniques both on system-point-of-view and architectures.

From the system point of view, first we proposed the hybrid 4x4-sub-block pipelining scheme, by which we can save 93.75% intermediate buffers compared with macroblock-level pipelining scheme at the penalty of slightly throughput degradation. The instantaneous switching scheme reduces the latency to minimum during pipeline stages.

Second, we proposed the efficient 1x4-column-by-column decoding ordering, by which the 28% memory access times and 28% processing cycles in motion compensation process can be saved. 17% memory access times can be saved as well in intra predictor. Third, we proposed a variable length FIFO architecture for the synchronization problems in adding pixels from residual/prediction paths. Forth, the exploration on coded-block-pattern technique saves power in inverse quantizer and IDCT modules from 30% to 86% under qP ranging from 20 to 48.

In architecture design, first we proposed a hierarchical syntax parser. The hierarchical syntax parser is easy to design and is very suitable for the bit-stream in hierarchical structure. With the hierarchical enable signals in these parsers, the power savings by clock-gating technique can be up to 86% in these parsers. Second, the register sharing technique is applied on syntax parsers for both systems. This technique reduces the amount of registers required for both system and 26% registers can be saved. Third, we implement

the Exp-Golomb decoder for parsing the H.264 bit-stream. The dedicated interface of this decoder enables this decoder to be shared for all parsers. Forth, 3 types of reusable buffers in intra predictor are proposed. By the aids of these reusable buffers (upper, left, and corner), implementation of the directional modes becomes very easy and the memory access times can be reduced also.

In our final chip implementation, the total gate-count is about 491K, maximum working frequency is 83.3MHz, supports real time decoding 720pHD H.264 sequence@56MHz and 720pHD MPEG-2 sequence @35.7MHz in 30fps. The power consumption for these 2 systems is 44.35mW (720p H.264 sequence) and 30.15mW (720p MPEG-2 sequence), respectively.

6.2 Future Work

In our future work, first, we will try to integrate and combine more functional blocks for both systems like IDCT, and inverse quantizer. For IDCT, we will take efforts on splitting the 8x8 IDCT formula into 2-stage 4x4 and 2x2 IDCT so that the 4x4 IDCT module for H.264 can be shared for the 8x8 IDCT operation of MPEG-2. Then, we will try to add the CABAC with other functional blocks to our current work to support H.264 main profile. We will also try to find the critical path in our work such that we can speed up our decoder to work under more than 120MHz to support real time decoding 1080i H.264 video sequence in 30fps.

Bibliography

[1] Ting-An Lin, Sheng-Zen Wang, Tsu-Ming Liu and Chen-Yi Lee, "An H.264/AVC Decoder with 4x4-block level pipeline", ISCAS 2005

[2] Ting-An Lin, Tsu-Ming Liu and Chen-Yi Lee, "A Low-Power H.264/AVC Decoder", VLSI-TSA 2005

[3] Sheng-Zen Wang, Ting-An Lin, Tsu-Ming Liu and Chen-Yi Lee, " A New Motion Compensation Design For H.264/AVC Decoder”, ISCAS 2005

[4] Tsu-Ming Liu, Wen-Ping Lee, Ting-An Lin and Chen-Yi Lee, "A Memory-Efficient Deblocking Filter For H.264/AVC Video Coding", ISCAS 2005

[5] Shih-Hao Wang, Wen-Hsiao Peng et al., “A Platform-Based MPEG-4 Advanced Video Coding (AVC) Decoder with Block Level Pipelining”, Information, Communications and Signal Processing, ICICS-PCM December 2003

[6] Tung-Chien Chen, Yu-Wen Huang, and Liang-Gee Chen, “Analysis and design of macroblock pipelining for H.264/AVC VLSI architecture”, ISCAS 2004

[7] Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, “Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification”

ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC, May 2003

[8] Iain E. G. Richardson, “H.264 and MPEG-4 video compression”, John Willey & Sons, autumn 2003, ISBN 0-470-84837-5

[9] Ville Lappalainen, Antti Hallapuro, and Timo D. Hamalainen, “Complexity of Optimized H.26L Video Decoder Implementation”, Circuits and Systems for Video Technonlogy, IEEE Transactions, July 2003

[10] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen, “Hardware architecture design for H.264/AVC intra frame coder”, ISCAS 2004

[11] Hae-Yong Kang, Kyung-Ah Jeong, Jung-Yang Bae, Young-Su Lee, Seung-Ho Lee,

“MPEG4 AVC/H.264 decoder with scalable bus architecture and dual memory controller”, ISCAS 2004

[12] K. Suhring, Ed., JM 8.2 reference software (online), 2004. Available at ftp://ftp.imtc.org/jvt-experts/

[13] Yu-Wen Huang, To-Wei Chen, Bing-Yu Hsieh , Tu-Chih Wang, Te-Hao Chang and Liang-Gee Chen, “Architecture Design for Deblocking Filter in H.264/JVT/AVC”

International Conference on Multimedia and Expo(ICME’03), Vol. 1, pp. I-693-6, July 2003.

[14] Miao Sima, Yuanhua Zhou and Wei Zhang, “an Efficient Architecture for Adaptive Deblocking Filter of H.264/AVC Video Coding” IEEE Transactions on Consumer Electronics, Vol. 50, Issue 1, pp. 292-296, Feb. 2004.

[15] He-Wei Feng, Zhi-Gang Mao, Jin-Xiang Wang, Dao-Fu Wang, “Design and implementation of motion compensation for MPEG-4 AS profile streaming video decoding,”. 5th International Conference on ASIC, Oct. 2003. Proceeding.

[16] Tung-Chien Chen, Yu-Wen Huang, and Liang-Gee Chen, “Fully utilized and reusable architecture for fractional motion estimation of H.264/AVC,” IEEE International Conference on Acoustics, Speech, and Signal Processing, May 2004.

[17] Shu-Tzu Lin, Chen-Yi Lee, “Analysis and design of a high-throughput two dimension inverse scan discrete cosine transform processor”, Master Thesis, Department of Electronics Engineering, National Chiao Tung University, Taiwan, June 2000

[18] Y. Hu, A. Simpson, K. McAdoo, and J. Cush, “A high definition H.264/AVC hardware video decoder core for multimedia SoC’s,” Proc. ISCE 2004.

[19] T. W. Chen, Y. W. Huang, T. C. Chen, Y. H. Chen, C. Y. Tsai, and L. G. Chen,

“Architecture Design of H.264/AVC Decoder with Hybrid Task Pipelining for High Definition Videos,” Proc. ISCAS 2005.

作者簡歷

姓名：林亭安

出生地：台灣省台北市出生日期：1980. 11. 24

學歷： 1987. 9 ~ 1993. 6 台北市立大安國民小學 1993. 9 ~ 1996. 6 台北市立和平國民中學 1996. 9 ~ 1999. 6 台北市立松山高級中學

1999. 9 ~ 2003. 6 國立交通大學電子工程系學士

2003. 9 ~ 2005. 6 國立交通大學電子研究所系統組碩士

得獎事績

2000/06 書卷獎 2001/01 書卷獎

2003/05 2003 全國 IC 設計競賽設計完整獎

2003/06 電子實驗專題競賽榮獲殷之同獎學金

2004/05 2004 全國 IC 設計競賽設計完整獎 2004/06 書卷獎

2004/10 2004 全國系統晶片設計比賽-光電通訊類 SIP 組特優 2005/05 2005 全國 IC 設計競賽優等獎

發表論文

Ting-An Lin, Sheng-Zen Wang, Tsu-Ming Liu and Chen-Yi Lee, "An H.264/AVC Decoder with 4x4-block level pipeline", ISCAS 2005

Ting-An Lin, Tsu-Ming Liu and Chen-Yi Lee, "A Low-Power H.264/AVC Decoder", VLSI-TSA 2005

Sheng-Zen Wang, Ting-An Lin, Tsu-Ming Liu and Chen-Yi Lee, "A New Motion Compensation Design for H.264/AVC Decoder”, ISCAS 2005

Tsu-Ming Liu, Wen-Ping Lee, Ting-An Lin and Chen-Yi Lee, "A Memory-Efficient Deblocking Filter for H.264/AVC Video Coding", ISCAS 2005

Ting-An Lin, and Chen-Yi Lee, “Predictive Equalizer Design for DVB-T system”, ISCAS 2005

在文檔中應用於數位電視之視訊雙標準解碼器設計與實現 (頁 92-0)

CHAPTER 4 ARCHITECTURE DESIGN OF MPEG-2/H.264/AVC DECODER

4.6.3 Hybrid scheduling

+

=

Intra/Inter Prediction

+

Chapter 5

Chip Implementation for Digital TV Applications

5.1 System Specification

5.2 Design Flow

5.3 Implementation Result

Mot io n Co mp ensat io n

In/Post-Loop Deblocking Filter

In tr a P re d ic ti on

IDCT

Syntax Parser

IQ I-ZZ CAVLC/

VLC

5.4 Measurement Results and Comparison

Chapter 6

Conclusion and Future Work

6.1 Conclusion

6.2 Future Work

Bibliography

作 者 簡 歷

得 獎 事 績

發 表 論 文

作者簡歷

得獎事績

發表論文