RS memory in sub-MB level - Row-storage memory system of the neighbor information

Chapter 4 Memory System

4.3 Row-storage memory system of the neighbor information

4.3.2 RS memory in sub-MB level

The target syntax elements contain coded_block_pattern for luminance, mvd, ref_idx, and coded_block_flag for the residual data. In the sub-MB level, the context model index is decided by the top and left sub-macroblocks. It has two possible conditions to refer the SE value of the neighbor blocks for one dimension. The first condition is to refer to the neighbor information from the row-storage SRAM when the sub-address of the sub-macroblock is equal to zero, and the other one is from the established register file when the sub-address of the sub-macroblock isn’t equal to zero. It is required to keep the neighbor sub-macroblocks by means of registers until the current SE has been decoded completely. Each macroblock has two axes such as

the horizontal and vertical. Therefore, there are four cases to find the neighbor information resource, given as Table 12. Figure 40 shows four cases in sub-MB level.

Table 12 Four cases of getting neighbor information

Case Sub_address Y ==0 Sub_address X==0 Left block Top block

1 False False Register file Register file

2 False True SRAM Register file

3 True False Register file SRAM

4 True True SRAM SRAM

Figure 40 Illustration of four getting neighbor information cases

To fetch the neighbor information from RS SRAM, it has to decode SE in the location of “sub_address Y=0” or “sub_address X=0”. Figure 41 shows the RS SRAM in sub-macroblocks. The shaded region is the content of the RS SRAM which includes the SE values of the right and bottom bounds. The updating timing is the same with Figure 39 of Section 4.3.1. The SE value of the right and bottom bounds

are updated when decoding end_of_slice_flag.

Figure 41 Row-Storage SRAM relates to the decoded sub-macroblocks

To fetch the neighbor information from registers, it has to decode SE in the location of “sub_address Y!=0” or “sub_address X!=0”. We establish the register file to store the SE value to be used as the neighbor information. In order to support B slices, we configure two set of registers for mvd and ref_idx of the forward and backward. It totally costs 664 bits register to serve as the RS memory system in sub-MB level. Table 13 shows the register number and size requirement for getting neighbor information by means of registers.

Table 13 required registers of the neighbor information

Syntax element type Unit register size(bit) Required number

mvd_x (forward) 8 16

mvd_y (forward) 8 16

Mvd_x (backward) 8 16

Mvd_y (backward) 8 16

ref_idx(forward) 4 16

ref_idx(backward) 4 16

coded_block_pattern(luma) 3 1

coded_block_flag(residual) 21 1

(a) macroblock partition

(b) sub-macroblock partition

Figure 42 Macroblock and sub-macroblock partition

The block size is divided into seven types such as 16x16, 8x16, 16x8, 8x8, 4x8, 8x4, and 4x4, given as Figure 42. In order to get the top and left syntax elements from registers, it has to copy the current decoded value to the assigned register of 4x4 blocks depending on the block size of Figure 42. The register number and size requirement have been arranged in Table 13 when decoding in sub-MB level, which contains mvd, ref_idx, coded_block_pattern for luminance, and coded_block_flag.

The unit register of the syntax element is described a 4x4 sub-macroblock. If the current block size isn’t equal to block 4x4, the result value of the current SE has to be copied to all registers of the current block size assigned.

We take an example in Figure 43 and assume that the axes of the vertical and horizontal in current MB is denoted as {Y , X}. In Figure 43(a), CABAD decodes the syntax element of mvd at {0 , 0} and the block size is defined as block 8x8. Thus, the current block is composed of four 4x4 sub-macroblocks, which means the computed SE value at { 0 , 0} has to update four 8-bit registers at {0 , 0}, {0 , 1}, {1 , 0}, and {1 , 1}. When next decoding at {0 , 2}, the binarization engine refers to the left sub-macroblock from the register at {0 , 1}. In addition, the top sub-macroblock is

Figure 43(b), Figure 43(c), Figure 43(d), Figure 43(e), and Figure 43(f) are described the copy methods for the block size of 16x16, 8x16, 16x8, 4x8, and 8x4 respectively.

Figure 43 SE values mapping for different block size

4.4 Summary

In this chapter, we have discussed the organization of the CABAD memory system for H.264/AVC. We implement the context model by means of 399x7 bit two-port SRAM, which serves as the pipeline architecture of CABAD. Then, we proposed the RS memory system by two ways of the SRAM-based and register-based and propose the SRAM size requirement of the SRAM-based RS memory. The methodologies of supporting the variable block size and B frame have been described in final section.

Chapter 5 Simulation and Implementation Result for Digital TV Applications

In this chapter, we scheme the verification platform and show the characteristic curves of the throughput and the cycle count per macroblock under the frame sizes of CIF (352x288) and 1080HD (1920x1088). In order to verify the video efficiency of the different texture, we also apply two video sequences of 1080HD to analyze the throughput between 1920x1088 and the down-sample videos (960x544 and 480x 272 pixels). The efficiency information can be gotten and we arrange these data as the characteristic figures. Hence, we can find the performance of our CABAD design.

Our chip of the H.264/AVC decoder will be presented. Hence, the layout of our design and the chip detail information are shown at the final.

This chapter is organized as follows. In Section 5.1, we propose a structure to verify our system and present the simulation results. The more detail analysis is discussed for our performance of CABAD. Section 5.2 shows the chip implementation for H.264’AVC@MP decoder. Then, we make some measurement result and comparison finally.

5.1 Simulation result

In this section, we propose a platform to easily verify our design of CABAD.

Figure 44 shows our verification architecture. In this platform, it divides into two levels: one is the C model, and the other one is the circuit design. The blue block in Figure 44 is our planned C model, which is composed of H.264/AVC encoder and decoder from the reference software of JM92 [5].

Based on the C model, the encoder of JM9.2 reads the test sequence of the current video and turns it into the bit-stream of the H.264/AVC format. The decoder of JM 9.2 de-compresses the current bit-stream to the yuv format. In the decoding process of JM 9.2, we fetch the related information as the input data of our RTL design which contains the frame type, the address of MB layer, the address of sub-MB layer, the current decoding SE mode, and so on. We also fetch the correct SE value from the decoding process of JM 9.2 to verify with the computed SE from the VERILOG test pattern, which can simplify the debug procedure.

Besides C model, we construct the circuit design flow except the blue block. We use the verilog test pattern to control the current design of RTL code. The test pattern refers the block of “CABAD input pattern” as the input signal of “verilog HDL coding” block and records the output signal to compare with the correct SE values from JM9.2 decoder. Then, we also verify this design with the circuit level simulation.

C Model

Figure 44 Platform of CABAD verification

We simulate our design and compute the cycle count and throughput by means of Figure 44. We adopt two CIF test video sequences, given as foreman and mobile, to compare with the other designs. Table 14 is the specification of the CIF frame at 30 fps. We adopt the high definition of level 2.0 in H.264/AVC standard [1]. The throughput of our design has to be over the maximum MB processing rate of 11,880 in standard [1] at the maximum video bit-rate of 2,000,000 bps.

Table 14 Specification for CIF at 30 fps

Level Max MB processing rate

Max MB (MB/s) Max frame size

MaxFS (MBs) Max video bit-rate MaxBR (1000bit/s or 1200 bit/s)

2.0 11,880 352 2,000

As a result, we get the cycle count and the throughput in each test sequence.

Figure 45(a) shows the cycle count in the test sequence of foreman, which contains the cycle count in I slice, P slice, B slice, and average computation. Figure 45(b)

shows the throughput in the test sequence of foreman, which contains the processing macroblocks per second averagely under three working frequencies. We focus on the red circle in Figure 45(a), (b) because the maximum video bit-rate equals 2,000,000 bps. Hence, the maximum cycle count equals 576 cycles per MB and the worst throughput equals 172,165 at the working frequency of 100MHz in our simulation which corresponds to the requirement of level 2.0 in Table 14. Figure 46(a) shows the cycle count in the test sequence of mobile. Figure 46(b) shows the throughput in the test sequence of mobile, which contains the processing macroblocks per second averagely under three working frequencies. We focus on the red circle in Figure 45(a), (b) because the maximum video bit-rate equals 2,000,000 bps. Hence, the maximum cycle count equals 563 cycles per MB and the worst throughput equals 176,046 at the working frequency of 100MHz in our simulation which is easy to achieve the requirement of level 2.0 in Table 14.

(a)

(b)

Figure 45 (a) Characteristic curve of the processing cycle for foreman and (b) Characteristic curve of the throughput for foreman

(a)

(b)

Figure 46 (a) Characteristic curve of the processing cycle for mobile and

We have simulated the typical sequence of foreman and the worst sequence of mobile on our verification platform. The processing MB of the CIF frame can achieve the requirement of level 2.0 in the H.264/AVC standard [1] easily. It doesn’t mean that the performance of our design can correspond to the requirement of the HDTV application. Thus, we adopt two test sequences of the 1080HD video, given as riverbed and station, to simulate the throughput for our work. In addition, we also generate the down-sample videos to analyze the throughput among the different texture sequence. Hence, we have three sequences of the different frame size in each kind of video which contains the frame sizes of 1920x1088, 960x544, and 480x272 pixels. Figure 47 illustrates the down-sample flowchart in our verification.

1080HD

Figure 47 Down-sample of the 1080HD frame

Based on the flowchart of Figure 47, we take two video sequences to generate the down-sample videos with yuv format. The three yuv sequences offer our platform to encode the H.264 format bit-stream by the reference software of JM 9.2 [5] encoder.

Figure 48 shows the test sequences of 1080HD and its down-sample videos for

“station” video sequences. Figure 49 shows the test sequences of 1080HD and its down-sample videos for “riverbed” video sequences. “DS2” denotes the test sequence of the down-sample 2 video. “DS4” denotes the test sequence of the down-sample 4 video. We use JM 9.2 [5] decoder to restore the video and extract the cycle count of CABAD processing by means of our platform in Figure 44.

Figure 48 1080HD and down-sample sequences for station

Figure 49 1080HD and down-sample sequences for riverbed

As a result, we draw three plots to analyze the performance under 200, 110, and 100MHz working frequencies respectively, given as Figure 50, Figure 51, Figure 52, Figure 53, Figure 54, and Figure 55.

Table 15 Specification for 1080HD at 30 fps

Level Max MB processing rate

Max MB (MB/s) Max frame size

MaxFS (MBs) Max video bit-rate MaxBR (1000bit/s or 1200 bit/s)

4.0 224,800 8160 20,000

Figure 50, Figure 51, and Figure 52 are the characteristic curves of the 1080HD, DS2, and DS4 frame under 200MHz, 110MHz, and 100MHz for the “riverbed”

sequence respectively. Figure 53, Figure 54, and Figure 55 are the characteristic curves of the 1080HD, DS2, and DS4 frame under 200MHz, 110MHz, and 100MHz for the “station” sequence respectively. These plots contain throughput and PSNR Y which are scaled with the primary (left side) and secondary (right side) vertical axes respectively. Because the texture of down-sample video is higher than original, the throughput of DS4 is worse than DS2. Thus, the throughput of DS2 is worse than 1080HD. The throughput of these two 1080HD sequences can achieve level 4.0, which means that our CABAD can process thirty frames per second. The specification of level 4.0 is shown in Table 15. But the all measurement of DS2 and DS4 sequences can’t match the requirements of the maximum MB processing rate. Hence, we find that the small size frame has more texture and the large size frame has fewer texture based on the same video content. The small size frame has fewer skipped-MB than large. The worse throughput of the 1080HD size frame is measured if we use smaller size frame to estimate larger one. In practical, the measurements of DS2 and DS4 aren’t correct. The result of measuring 1080HD videos correspond to level 4.0 of the standard [1].

Figure 50 Characteristic curves of 200MHz for “riverbed”

Figure 51 Characteristic curves of 110MHz for “riverbed”

Figure 52 Characteristic curves of 100MHz for “riverbed”

Figure 53 Characteristic curves of 200MHz for “station”

Figure 54 Characteristic curves of 110MHz for “station”

Figure 55 Characteristic curves of 100MHz for “station”

5.2 Chip implementation

In our H.264/AVC decoder design, the specification of the H.264/AVC is the main profile at level 4.0, Table 15 has shown the details of this profile.

The maximum computational capability is to support real time decoding of 1080HD(1920x1088) H.264/AVC video sequence in 30fps.

5.2.1 Design flow

We use the standard cell based design flow. Figure 56 shows our design flow from system specification to physical-level.

Figure 56 Design flow from system specification to physical-level

In system design stage, first we estimated the required throughput for the specification, and followed the 4x4 sub-block level pipeline scheme of the syntax parser and modified it to hybrid scheme for the reason that macroblock-level pipelining scheme is suitable for CABAD modules. We carefully estimate the efficiency of different decoding ordering for all the modules because it would be an important interface between modules. Because we aimed at H.264 decoder design of the main profile, the hardware speed issue shall be considered as well in this first stage, the overall block diagram and data flows is designed in this stage.

In architecture design stage, we divide the work mainly to 3 parts, one for the binarization engine, one for the arithmetic decoder, and one for memory system design. We have to consider the hardware sharing issue of the context model for two entropy decoders such as CABAD and CAVLD on the system view, which makes entropy decoder to pay additional memory to realize CAVLD.

Under the constraint of the throughput requirement, we focus on the architecture System

design and to make each module low-complexity. Some low-complexity architecture is derived in this stage.

The RTL-design is along with the architecture design. The work for RTL-design is mainly to translate the architecture of each module to RTL description. To make the synthesis result identical to the architecture of our design is the goal of the RTL-design. Of course that some coding techniques for the synthesizer are considered in this stage. To write the RTL-code synthesizable and easy understanding is also important.

In physical design stage, the CAD tools are important. To make a good use of these tools and to do the remaining job to the best is the key point to our final result.

The design margin, technology used, some nano effects on deep sub-micron circuits are also needed to be considered. At the end of the physical design stage, our work is taped-out for the prototyping and final verification.

5.2.2 Implementation result

In our work, we implemented a H.264@MP decoder. Figure 57 shows the layout of this work. The total gate count is about 557,730. Chip size is 2.1 x 2.1 mm² in 0.13µm technology. The total gate count and SRAM requirement of each function block in H.264/AVC decoder system are shown in Table 16 and Table 17 respectively. It supports decoding 1080HD H.264 video sequence in 30fps.

As a result, we synthesize our work of CABAD with RTL Compiler (RC) individually. According to the characteristic curves of Figure 45(a), we also apply the throughput of the reasonable video quality to compare with other designs and show in Table 16.

We implement the proposal in Verilog-HDL under UMC 0.13µm CMOS Process.

The total gate count with the embedded SRAM is 163,573 under the working frequency of 200MHz. According to the report of DC, the total power consumption is

about 4.107mW. The video bit-rate is equal to 800kbps is assumed to compare with other designs. A 41 dB corresponds to our requirements. Under the sequence type of

“IBBBPBBBP…”, it achieves the throughput of 229 cycles per MB.

Figure 57 Layout of this work

Table 16 Gate count list of each function block

Table 17 Memory requirement of each function block

Table 18 Comparison with other designs

This work Chen’ [2] Yu’ [3]

Decode. Spec. H.264@main H.264@main H.264@main Target

technology 0.13µm

UMC 0.13µm

TSMC 0.18µm

Clock rate 200MHz 200MHz 150MHz (6.7ns)

Gate count 163,573 138,226 area:0.3mm²

Power 4.107mW na na

Table 19 shows the percentage of the cycle reduction for three proposed methods.

Hence, we can know the contribution of the throughput promotion due to this table.

Table 19 Percentage of cycle reduction for three proposed methods

Proposed method Target decoding Percentage of cycle reduction (%)

Pipeline Normal arithmetic

decoding 50%

Multi-symbol Bypass arithmetic

decoding 67%

Zero-skip Residual data decoding 20%

5.3 Summary

In this chapter, we have simulated our CABAD on the C model and verilog

hybrid platform and get the cycle count in each macroblock. Thus, the throughput of CABAD in H.264/AVC decoder system can be fetched. We can achieve the level 4.0 of the standard [1], which means that our design can support to play the 1080HD video sequence at 30 fps. In addition, we synthesize our CABAD with Design Compiler and get some circuit information of CABAD. As a result, the cycle count per MB is better than other designs under the reasonable video quality.

Chapter 6 Conclusion and Future Work

6.1 Conclusion

In this work, we implement an advance entropy decoder, context-based adaptive binary arithmetic decoder. We adopt several design techniques both on system point of view and architecture.

We construct the architecture of CABAD which includes binarization engine, arithmetic decoder, and the memory system (context model and row-storage). We have completed three arithmetic decoding processes and the binary tree with the finite state machine of the binarization engine. Further, we schedule the memory system with SRAM, which serves the context model and storing the neighboring information of the left and top blocks.

Three robust methods have been proposed to enhance the throughput, including pipelining architecture, multi-symbol organization for the bypass decoding, and zero skip consideration for the residual decoding. Besides the proposed CABAD IP, the co-verification with the C model of the reference software JM [5] and the test pattern of the RTL design. We complete the function verification depending on this simulation architecture and integrate this module with the syntax parser on our H.264/AVC decoder. We also offer the characteristic curves of our CABAD architecture under variable bit-rate. It can achieve the speed up to 1,00,000 MBs per second for real-time decoding of video sequences.

As a result, the average cycle count per MB can reduce up to about 50% under

在文檔中應用於數位電視之H.264/AVC背景適應性二元算術解碼器 (頁 74-0)