• 沒有找到結果。

Chapter 5 Simulation and Implementation Result for Digital TV

5.1 Simulation result

In this section, we propose a platform to easily verify our design of CABAD.

Figure 44 shows our verification architecture. In this platform, it divides into two levels: one is the C model, and the other one is the circuit design. The blue block in Figure 44 is our planned C model, which is composed of H.264/AVC encoder and decoder from the reference software of JM92 [5].

Based on the C model, the encoder of JM9.2 reads the test sequence of the current video and turns it into the bit-stream of the H.264/AVC format. The decoder of JM 9.2 de-compresses the current bit-stream to the yuv format. In the decoding process of JM 9.2, we fetch the related information as the input data of our RTL design which contains the frame type, the address of MB layer, the address of sub-MB layer, the current decoding SE mode, and so on. We also fetch the correct SE value from the decoding process of JM 9.2 to verify with the computed SE from the VERILOG test pattern, which can simplify the debug procedure.

Besides C model, we construct the circuit design flow except the blue block. We use the verilog test pattern to control the current design of RTL code. The test pattern refers the block of “CABAD input pattern” as the input signal of “verilog HDL coding” block and records the output signal to compare with the correct SE values from JM9.2 decoder. Then, we also verify this design with the circuit level simulation.

C Model

Figure 44 Platform of CABAD verification

We simulate our design and compute the cycle count and throughput by means of Figure 44. We adopt two CIF test video sequences, given as foreman and mobile, to compare with the other designs. Table 14 is the specification of the CIF frame at 30 fps. We adopt the high definition of level 2.0 in H.264/AVC standard [1]. The throughput of our design has to be over the maximum MB processing rate of 11,880 in standard [1] at the maximum video bit-rate of 2,000,000 bps.

Table 14 Specification for CIF at 30 fps

Level Max MB processing rate

Max MB (MB/s) Max frame size

MaxFS (MBs) Max video bit-rate MaxBR (1000bit/s or 1200 bit/s)

2.0 11,880 352 2,000

As a result, we get the cycle count and the throughput in each test sequence.

Figure 45(a) shows the cycle count in the test sequence of foreman, which contains the cycle count in I slice, P slice, B slice, and average computation. Figure 45(b)

shows the throughput in the test sequence of foreman, which contains the processing macroblocks per second averagely under three working frequencies. We focus on the red circle in Figure 45(a), (b) because the maximum video bit-rate equals 2,000,000 bps. Hence, the maximum cycle count equals 576 cycles per MB and the worst throughput equals 172,165 at the working frequency of 100MHz in our simulation which corresponds to the requirement of level 2.0 in Table 14. Figure 46(a) shows the cycle count in the test sequence of mobile. Figure 46(b) shows the throughput in the test sequence of mobile, which contains the processing macroblocks per second averagely under three working frequencies. We focus on the red circle in Figure 45(a), (b) because the maximum video bit-rate equals 2,000,000 bps. Hence, the maximum cycle count equals 563 cycles per MB and the worst throughput equals 176,046 at the working frequency of 100MHz in our simulation which is easy to achieve the requirement of level 2.0 in Table 14.

(a)

(b)

Figure 45 (a) Characteristic curve of the processing cycle for foreman and (b) Characteristic curve of the throughput for foreman

(a)

(b)

Figure 46 (a) Characteristic curve of the processing cycle for mobile and

We have simulated the typical sequence of foreman and the worst sequence of mobile on our verification platform. The processing MB of the CIF frame can achieve the requirement of level 2.0 in the H.264/AVC standard [1] easily. It doesn’t mean that the performance of our design can correspond to the requirement of the HDTV application. Thus, we adopt two test sequences of the 1080HD video, given as riverbed and station, to simulate the throughput for our work. In addition, we also generate the down-sample videos to analyze the throughput among the different texture sequence. Hence, we have three sequences of the different frame size in each kind of video which contains the frame sizes of 1920x1088, 960x544, and 480x272 pixels. Figure 47 illustrates the down-sample flowchart in our verification.

1080HD

Figure 47 Down-sample of the 1080HD frame

Based on the flowchart of Figure 47, we take two video sequences to generate the down-sample videos with yuv format. The three yuv sequences offer our platform to encode the H.264 format bit-stream by the reference software of JM 9.2 [5] encoder.

Figure 48 shows the test sequences of 1080HD and its down-sample videos for

“station” video sequences. Figure 49 shows the test sequences of 1080HD and its down-sample videos for “riverbed” video sequences. “DS2” denotes the test sequence of the down-sample 2 video. “DS4” denotes the test sequence of the down-sample 4 video. We use JM 9.2 [5] decoder to restore the video and extract the cycle count of CABAD processing by means of our platform in Figure 44.

Figure 48 1080HD and down-sample sequences for station

Figure 49 1080HD and down-sample sequences for riverbed

As a result, we draw three plots to analyze the performance under 200, 110, and 100MHz working frequencies respectively, given as Figure 50, Figure 51, Figure 52, Figure 53, Figure 54, and Figure 55.

Table 15 Specification for 1080HD at 30 fps

Level Max MB processing rate

Max MB (MB/s) Max frame size

MaxFS (MBs) Max video bit-rate MaxBR (1000bit/s or 1200 bit/s)

4.0 224,800 8160 20,000

Figure 50, Figure 51, and Figure 52 are the characteristic curves of the 1080HD, DS2, and DS4 frame under 200MHz, 110MHz, and 100MHz for the “riverbed”

sequence respectively. Figure 53, Figure 54, and Figure 55 are the characteristic curves of the 1080HD, DS2, and DS4 frame under 200MHz, 110MHz, and 100MHz for the “station” sequence respectively. These plots contain throughput and PSNR Y which are scaled with the primary (left side) and secondary (right side) vertical axes respectively. Because the texture of down-sample video is higher than original, the throughput of DS4 is worse than DS2. Thus, the throughput of DS2 is worse than 1080HD. The throughput of these two 1080HD sequences can achieve level 4.0, which means that our CABAD can process thirty frames per second. The specification of level 4.0 is shown in Table 15. But the all measurement of DS2 and DS4 sequences can’t match the requirements of the maximum MB processing rate. Hence, we find that the small size frame has more texture and the large size frame has fewer texture based on the same video content. The small size frame has fewer skipped-MB than large. The worse throughput of the 1080HD size frame is measured if we use smaller size frame to estimate larger one. In practical, the measurements of DS2 and DS4 aren’t correct. The result of measuring 1080HD videos correspond to level 4.0 of the standard [1].

Figure 50 Characteristic curves of 200MHz for “riverbed”

Figure 51 Characteristic curves of 110MHz for “riverbed”

Figure 52 Characteristic curves of 100MHz for “riverbed”

Figure 53 Characteristic curves of 200MHz for “station”

Figure 54 Characteristic curves of 110MHz for “station”

Figure 55 Characteristic curves of 100MHz for “station”

5.2 Chip implementation

In our H.264/AVC decoder design, the specification of the H.264/AVC is the main profile at level 4.0, Table 15 has shown the details of this profile.

The maximum computational capability is to support real time decoding of 1080HD(1920x1088) H.264/AVC video sequence in 30fps.

5.2.1 Design flow

We use the standard cell based design flow. Figure 56 shows our design flow from system specification to physical-level.

`

Figure 56 Design flow from system specification to physical-level

In system design stage, first we estimated the required throughput for the specification, and followed the 4x4 sub-block level pipeline scheme of the syntax parser and modified it to hybrid scheme for the reason that macroblock-level pipelining scheme is suitable for CABAD modules. We carefully estimate the efficiency of different decoding ordering for all the modules because it would be an important interface between modules. Because we aimed at H.264 decoder design of the main profile, the hardware speed issue shall be considered as well in this first stage, the overall block diagram and data flows is designed in this stage.

In architecture design stage, we divide the work mainly to 3 parts, one for the binarization engine, one for the arithmetic decoder, and one for memory system design. We have to consider the hardware sharing issue of the context model for two entropy decoders such as CABAD and CAVLD on the system view, which makes entropy decoder to pay additional memory to realize CAVLD.

Under the constraint of the throughput requirement, we focus on the architecture System

design and to make each module low-complexity. Some low-complexity architecture is derived in this stage.

The RTL-design is along with the architecture design. The work for RTL-design is mainly to translate the architecture of each module to RTL description. To make the synthesis result identical to the architecture of our design is the goal of the RTL-design. Of course that some coding techniques for the synthesizer are considered in this stage. To write the RTL-code synthesizable and easy understanding is also important.

In physical design stage, the CAD tools are important. To make a good use of these tools and to do the remaining job to the best is the key point to our final result.

The design margin, technology used, some nano effects on deep sub-micron circuits are also needed to be considered. At the end of the physical design stage, our work is taped-out for the prototyping and final verification.

5.2.2 Implementation result

In our work, we implemented a H.264@MP decoder. Figure 57 shows the layout of this work. The total gate count is about 557,730. Chip size is 2.1 x 2.1 mm2 in 0.13µm technology. The total gate count and SRAM requirement of each function block in H.264/AVC decoder system are shown in Table 16 and Table 17 respectively. It supports decoding 1080HD H.264 video sequence in 30fps.

As a result, we synthesize our work of CABAD with RTL Compiler (RC) individually. According to the characteristic curves of Figure 45(a), we also apply the throughput of the reasonable video quality to compare with other designs and show in Table 16.

We implement the proposal in Verilog-HDL under UMC 0.13µm CMOS Process.

The total gate count with the embedded SRAM is 163,573 under the working frequency of 200MHz. According to the report of DC, the total power consumption is

about 4.107mW. The video bit-rate is equal to 800kbps is assumed to compare with other designs. A 41 dB corresponds to our requirements. Under the sequence type of

“IBBBPBBBP…”, it achieves the throughput of 229 cycles per MB.

Figure 57 Layout of this work

Table 16 Gate count list of each function block

Table 17 Memory requirement of each function block

Table 18 Comparison with other designs

This work Chen’ [2] Yu’ [3]

Decode. Spec. H.264@main H.264@main H.264@main Target

technology 0.13µm

UMC 0.13µm

TSMC 0.18µm

Clock rate 200MHz 200MHz 150MHz (6.7ns)

Gate count 163,573 138,226 area:0.3mm2

Power 4.107mW na na

Table 19 shows the percentage of the cycle reduction for three proposed methods.

Hence, we can know the contribution of the throughput promotion due to this table.

Table 19 Percentage of cycle reduction for three proposed methods

Proposed method Target decoding Percentage of cycle reduction (%)

Pipeline Normal arithmetic

decoding 50%

Multi-symbol Bypass arithmetic

decoding 67%

Zero-skip Residual data decoding 20%

5.3 Summary

In this chapter, we have simulated our CABAD on the C model and verilog

hybrid platform and get the cycle count in each macroblock. Thus, the throughput of CABAD in H.264/AVC decoder system can be fetched. We can achieve the level 4.0 of the standard [1], which means that our design can support to play the 1080HD video sequence at 30 fps. In addition, we synthesize our CABAD with Design Compiler and get some circuit information of CABAD. As a result, the cycle count per MB is better than other designs under the reasonable video quality.

Chapter 6

Conclusion and Future Work

6.1 Conclusion

In this work, we implement an advance entropy decoder, context-based adaptive binary arithmetic decoder. We adopt several design techniques both on system point of view and architecture.

We construct the architecture of CABAD which includes binarization engine, arithmetic decoder, and the memory system (context model and row-storage). We have completed three arithmetic decoding processes and the binary tree with the finite state machine of the binarization engine. Further, we schedule the memory system with SRAM, which serves the context model and storing the neighboring information of the left and top blocks.

Three robust methods have been proposed to enhance the throughput, including pipelining architecture, multi-symbol organization for the bypass decoding, and zero skip consideration for the residual decoding. Besides the proposed CABAD IP, the co-verification with the C model of the reference software JM [5] and the test pattern of the RTL design. We complete the function verification depending on this simulation architecture and integrate this module with the syntax parser on our H.264/AVC decoder. We also offer the characteristic curves of our CABAD architecture under variable bit-rate. It can achieve the speed up to 1,00,000 MBs per second for real-time decoding of video sequences.

As a result, the average cycle count per MB can reduce up to about 50% under the reasonable video quality compared with other proposed designs. Our design can

achieve the level 4.0 of H.264/AVC standard [1], which means that it can play the resolution of 1080HD videos at 30 fps. The power consumption for CABAD is 4.107mW.

6.2 Future work

In order to achieve the high quality videos, the frame rate of 30fps doesn’t correspond to the requirement of our digital TV market. The high resolution and high frame rate becomes the target of the human life. Hence, the large frame and high speed video playing is essentially for the digital TV application. To play the videos of 1080HD at 60fps is the basic requirement for the point view of CABAD. Thus, CABAD has to achieve the 1080HD of 60fps under the maximum bit-rate of 50,000,000 bit-per-second, which means the specification of level 4.2 for H.264/AVC is the future work for CABAD. Compared to level 4.0, it has to accelerate CABAD for 5 times. Hence, the acceleration of CABAD is the essential work in the advanced application.

Bibliography

[1] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC), May. 2003.

[2] Jian-Wen Chen, Cheng-Ru Chang, Youn-Long Lin,” A Hardware Accelerator for Context-Based Adaptive Binary Arithmetic Decoding in H.264/AVC”, ISCAS 2005. Page(s):4525 – 4528

[3] Wei Yu, Yun He,”A High Performance CABAC Decoding Architecture”, IEEE Transaction on Consumer Electronics 2005, pp. 1352-1359

[4] Detlev Marpe, Member, IEEE, Heiko Schwarz, and Thomas Wiegand.

“Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard” CSVT 2003, pages: :209-217.

[5] JVT H/AVC Reference Software JM 9.2

[6] Z.M. Milicevic, Z.S. Bojkovic, “Multimedia H.264/AVC standard as a tool for video coding performance improvement” Telecommunications in Modern Satellite, Cable and Broadcasting Services, 2005. Page(s):237 - 240

[7] V.H.S. Ha, Woo-Sung Shim, Jung-Woo Kim, “Real-time MPEG-4 AVC/H.264 CABAC entropy coder”, ICCE 2005. ICCE. 2005, page(s): 255 – 256.

[8] C. C. Lin, J. I. Guo, H. C. Yang, J. W. Chen, M. C. Tsai, J. S. Wang, “ A 160kGate 4.5kB SRAM H.264 Video Decoder for HDTV Application”, ISSCC 2006, page(s): 406 – 407.

[9] Osorio, R.R. Bruguera, J.D. , “A new architecture for fast arithmetic coding in H.264 advanced video coder”, Digital System Design 2005, Page(s): 298 – 305.

[10] Chu Yu, Hwai-Tsu Hu,“Design and implementation of an ASIC architecture for the context-based binary arithmetic encoder”, ISCE 2005, Page(s): 83 – 86

[11] Keng-Khai Ong, Wei-Hsin Chang, Yi-Chen Tseng, Chen-Yi Lee, “A high throughput context-based adaptive arithmetic codec for JPEG2000”, IEEE Int.

Symp. Circuits Syst. 2002, vol. 4, page(s): 133-136.

[12] W. B. Pennebaker et al., “An overview of the basic principles of the Q-Coder adaptive binary arithmetic coder” IBM J. Res. Develop. 1988, vol.32, no. 6, pp.

717-726.

作 者 簡 歷

姓名 :黃毅宏

出生地 :台灣省台中市

出生日期:1977. 04. 24

學經歷: 1983. 9 ~ 1989. 6 桃園縣立武漢國民小學

1989. 9 ~ 1992. 6 桃園縣立龍潭國民中學

1992. 9 ~ 1995. 6 國立中壢高中

1996. 9 ~ 2000. 6 中原大學 電子工程學系 學士

2000.10~ 2004. 6 中山科學研究院 電子系統所

相列雷達組 技佐

2004. 9 ~ 2006. 7 國立交通大學 電子研究所

系統組 碩士

得 獎 事 績

2006/05 2006 全國 IC 設計競賽佳作獎

發 表 論 文

z Yi-Hong Huang, Ping-Chang Lin, Kang-Cheng Hou, Yueh-Chi Hung, Tsu-Ming Liu, Chen-Yi Lee,” A High-Throughput SRAM-Based Context Adaptive Binary Arithmetic Decoder (CABAD) for H.264/AVC”, in Proceedings of the 17th VLSI/CAD Symposium, August 2006.

z Kang-Cheng Hou, Sheng-Zen Wang, Yi-Hong Huang, Tsu-Ming Liu, Chen-Yi Lee, “A Bandwidth-Efficient Motion Compensation Architecture for H.264/AVC HDTV Decoder”, in Proceedings of the 17th VLSI/CAD Symposium, August 2006.

相關文件