Summary - DATA MAPPING AWARE FRAME MEMORY CONTROLLER

Chapter 3 DATA MAPPING AWARE FRAME MEMORY CONTROLLER

3.4 Summary

In this chapter we proposed a SDRAM memory controller dedicated for H.264 video decoder. We analyze statistics of video sequences and characteristics of SDRAMs. According to the information, we set an optimized data arrangement which can effectively reduce the memory setup overheads. Besides, we employ a simple prediction mechanism to enhance the performance. We schedule the memory access with the bank alternating technique and have a chance to skip the row active operations. The memory controller is helpful to improve the bus utilization and ease the bandwidth requirement in overall decoder. The performance of controller doesn’t crashes in large frame size. The row miss remains stable in different frame size and sequences. By the way, this controller performance can be enhanced with advanced DRAM like DDR and wider data bus width.

Chapter 4 ENTROPY DECODER

4.1 UVLC decoder

In H.264/AVC, the Exp-Golomb coding, also known as UVLC, is adopted for all syntax except fixed codes and the quantized transform coefficients. The UVLC entropy coding uses a single finite-extent codeword table. Thus, this coding type is constructed in a regular way, characterized by having predetermined code pattern [15].

Instead of designing a different VLC table for each syntax element, only mapping to the exp-Golomb code table is customized according to the data statistics. These mapping can be divided into four kinds, unsigned element (ue), signed element (se), truncated element (te) and mapped element (me).

Fig.34 Exp-Golomb code construction

Exp-Golomb codes are variable length codes with a regular construction. The code structure is separated by the bit string into “prefix” and “suffix” bits as shown in Fig.34. The leading N zeros and the middle “1” can be regarded as “prefix” bits. The

information is carried by the last N bits decoded as “suffix” bits. The length of each codeword is 2N+1 bits. Each exp-Golomb code can be constructed by the following formula:

Table 11 shows a brief example of exp-Golomb code.

Table 11 Exp-Golomb codeword

CodeNum Codeword

For unsigned element, the value of the syntax element is the same as CodeNum.

Otherwise, the mappings of the signed element and mapped element are as listed in Table 12(a). If the syntax element is codes as truncated element, the range may be between 0 and x, with x begin greater than or equal to 1. If x is greater than 1, the value is the same as CodeNum decoded from exp-Golomb code. Else(x is equal to 1), one more bit is read and inverted as the CodeNum.

Table 12(a) Post mapping of signed exp-Golomb code syntax element

CodeNum syntax element value

0 0

1 1

Table 11(b) Post mapping of mapped exp-Golomb code syntax element

CodeNum syntax element value

intra 4x4 inter

4.1.1 Syntax organization under macroblock layer

In this section, we describe the syntax organization under macroblock layer. The descriptors listed below specify the parsing process of each syntax element.

- ce(v):context-adaptive variable-length entropy-coded syntax element.

- f(n): fixed-pattern bit string using n bits written with the left bit first.

- me(v): mapped exp-Golomb-coded element.

- se(v): signed integer exp-Golomb-coded element.

- te(v): truncated exp-Golomb-coded element.

- u(n): unsigned integer using n bits.

- ue(v):unsigned exp-Golomb-coded element.

The syntax is coded as shown below.

Descriptor

Descriptor mb_pred( mb_type ) {

if( MbPartPredMode( mb_type, 0 ) = = Intra_4x4 | | MbPartPredMode( mb_type, 0 ) = = Intra_16x16 ) { if( MbPartPredMode( mb_type, 0 ) = = Intra_4x4 )

for( luma4x4BlkIdx=0; luma4x4BlkIdx<16; luma4x4BlkIdx++ ) {

prev_intra4x4_pred_mode_flag[ luma4x4BlkIdx ] u(1) if( !prev_intra4x4_pred_mode_flag[ luma4x4BlkIdx ] )

rem_intra4x4_pred_mode[ luma4x4BlkIdx ] u(3) }

intra_chroma_pred_mode ue(v)

} else if( MbPartPredMode( mb_type, 0 ) != Direct ) {

for( mbPartIdx = 0; mbPartIdx < NumMbPart( mb_type ); mbPartIdx++) if( ( num_ref_idx_l0_active_minus1 > 0 | |

mb_field_decoding_flag ) &&

MbPartPredMode( mb_type, mbPartIdx ) != Pred_L1 )

ref_idx_l0[ mbPartIdx ] te(v)

for( mbPartIdx = 0; mbPartIdx < NumMbPart( mb_type ); mbPartIdx++) if( ( num_ref_idx_l1_active_minus1 > 0 | |

mb_field_decoding_flag ) &&

MbPartPredMode( mb_type, mbPartIdx ) != Pred_L0 )

ref_idx_l1[ mbPartIdx ] te(v)

for( mbPartIdx = 0; mbPartIdx < NumMbPart( mb_type ); mbPartIdx++) if( MbPartPredMode ( mb_type, mbPartIdx ) != Pred_L1 ) for( compIdx = 0; compIdx < 2; compIdx++ )

mvd_l0[ mbPartIdx ][ 0 ][ compIdx ] se(v) for( mbPartIdx = 0; mbPartIdx < NumMbPart( mb_type ); mbPartIdx++)

if( MbPartPredMode( mb_type, mbPartIdx ) != Pred_L0 ) for( compIdx = 0; compIdx < 2; compIdx++ )

mvd_l1[ mbPartIdx ][ 0 ][ compIdx ] se(v) }

}

Descriptor sub_mb_pred( mb_type ) {

for( mbPartIdx = 0; mbPartIdx < 4; mbPartIdx++ )

if( (num_ref_idx_l1_active_minus1 > 0 | | mb_field_decoding_flag )

Mb_type specifies the macroblock type. It represents the number of macroblock partition used(NumMbPart (mb_type)) and the prediction mode of the macroblock or

the partitions of macroblock. For intra_16x16, the luma intra prediction mode and coded block pattern are coded in mb_type. Other macroblock types have their own coded block pattern which is coded in the syntax, coded_block_pattern, to specify which of the six 8x8 blocks (luma and chroma) contain non-zero transform coefficient levels. For intra_4x4, the luma intra prediction mode is specified by rem_intra4x4_pred_mode and its predictor prev_intra4x4_pred_mode_flag. For the inter or bi-direction macroblocks partitioned with 8x16 or 16x8, ref_idx_10 (ref_idx_11) represents the index of forward (backward) reference frame index and mvd_10 (mvd_11) describes the difference between a motion vector and its prediction in corresponding partitions. For the block sizes under 8x8, an additional syntax element, sub_mb_type, specifies the sub-macroblock information and each partition in sub-macroblock has its own ref_idx_10 (ref_idx_11) and mvd_10 (mvd_11). For mb_type equal to I_PCM, all transform coefficients are coded with 8 bits unsigned integer. Below show some examples for different cases.

Fig.35 Syntax organization

4.1.2 Hardware architecture of UVLC

Fig.36 illustrates the structure of the UVLC decoder. Due to the characters of exp-Golomb coding, we can determine the code length by the detection of leading zeros. Since the code length is not more than 32 bits, we use two 32-bits registers to buffer the bit stream input.

Fig.36 Architecture of UVLC decoder

The alignment shifter is used to align the input data to the proper position for next decoding process. The leading one detector and accumulator calculate and store the sum of code lengths. The code number extractor extracts the CodeNum+1 based on the determined code length. The function of post processing unit is to generate the final value of UVLC coding. The control unit contains a syntax FSM and register updating controller. The syntax FSM controls the decoding flow of the syntax element under macroblock layer. The register updating controller determines when to update the buffers. With this architecture, the syntax elements can be decoded with smooth flow.

4.2 CAVLC decoder

4.2.1 Introduction

Context-based Adaptive Variable Length Code (CAVLC) has been adopted in MPEG-4 AVC/H.264 video coding standard [1] as one of the entropy coding methods to further reduce the bit rate. However, in a video decoder system, the entropy decoding has often become the performance bottleneck since it is hard to be speedup by parallelism and pipelining.

Previous designs on CAVLC decoding focus on simplifying VLC tables to reduce area or using gated clock for all zero block decoding [16] - [18], quite similar to other VLC decoding approach as in MPEG-2 and MPEG-4 video coding. Besides, previous decoding cycles can only generate one decoded coefficient per cycle no matter the coefficient is zero or not, since they have to merge the decoded nonzero coefficient and zero coefficient together.

CAVLC, though similar to other VLC process at the codeword table construction, is quite different at the overall process. The overall decoding process of CAVLC can be decomposed into several processes, and some processes can be merged for lower cycle count. Besides, previous designs can only generate one decoded coefficient per cycle no matter the coefficient is zero or not. It is due to they merge and reorder the decoded nonzero coefficient and zero coefficient in a one-by-one processing fashion.

This style could take 16 cycles for a 4x4 block just for merging process. In summary, previous approaches only skip the all zero blocks and have not fully exploited the large zero coefficient portion, 87% for the Qp = 28 case, in a nonzero block. Besides,

their one-by-one sample processing style severely limits its available performance no matter the underlying architectures are.

However, how to exploit the zero coefficients while keep hardware simple is a challenge in the architecture design. In the decoder, in additional to the codeword decoding, one of the major challenges is how the extracted abundant zeros can be placed into the correct positions with other nonzero coefficients without using the time-consuming one-by-one processing style. Our approach for this is to use a fast coefficient merging unit. In the unit, the buffer will be automatically initialized to zero for every new block decoding and only the nonzero coefficients will be written to the corresponding positions in parallel during the zero coefficient decoding. With this, we can easily merge the zero and nonzero coefficients together and skip the cycle to merge zero coefficients. In addition, we further improve the speed by adopting all zero block skipping and the multi-symbol decoding for sign and run-before decoding.

The final design can save up to 76% of cycle count when compared with other designs in the Qp=28 case.

4.2.2 Overview of CAVLC

Fig.37 Typical CAVLC decoding flow [16]

H.264 video coding uses CAVLC for transform coefficients coding and Exp-Golomb code for all other syntax elements. Fig.37 illustrates a typical decoding flow. The CAVLC coding steps are shown as below:

1. Decode the number of total nonzero coefficients, TotalCoeffs, and the trailing ones (up to 3), TrailingOnes (T1s).

2. Decode the sign of each trailing one.

3. Decode the level of each remaining nonzero coefficient.

4. Decode the number of total zeros, TotalZeros, after the first nonzero coefficient.

5. Decode each run of zeros, RunBefore, between the nonzero coefficients, which will also depend on the number of zeros that have not yet been coded (ZerosLeft).

6. Merge the level and run information to generate the final transform coefficients

In this manner, the main cycle consuming sources are level decoding, run decoding and merging process. Fig.38 shows the distribution of the required cycles in

different process. Considering the complexity of level adaptation, multi-level decoding requires high overheads in hardware implement. Thus, we use multi-run-like decoding and zero-skipped merging as our solution to save the required decoding cycle count.

cycle count dis tribution

5 % 5 %

Fig.38 Cycle count distribution in different process

4.2.3 The proposed CAVLC decoding flow

In the CAVLC decoding process, each symbol has several corresponding context-based adaptive VLC tables, and the selection of these tables is based on the statistics of block content and previous decoded symbols. Thus the decoding process depends on not only the bit stream but also the previous symbols. This prevents the speedup techniques like parallelism and pipelining. One often adopted solution is to combine the codeword of different symbols so that multi-symbol decoding is possible.

However, this will lead to a longer code and thus larger table to be decoded in one cycle. In this paper, we propose to use some decoded information as inherent table index to reduce the table size.

flow basically follows the standard decoding flow. The major difference is that we adopt multi-symbol decoding at two stages. One is the sign decoding stage and the other is the run before decoding stage. In addition to this, another major problem of decoding process is to combine the decoded nonzero coefficients (level) and zero coefficients (run) together. In other approaches [16] - [18], the level and run are decoded separately and then combined together in a one-bye-one raster scan order to reconstruct the coefficients. However, the combination can be started as soon as a run is decoded. Thus we propose to use a coefficient merging unit for such purpose to save extra merging cycles.

Fig.39 The proposed decoding flow of CAVLC [19]

The whole decoding process and its impact to the hardware design is described as below.

1) Coefficients token and sign T1decoding process

Decoding of the coefficient token is combined with the decoding of T1 sign since the length of sign code is predictable as soon as the T1s and TotalCoeff are known. In the hardware design, we use three-bit sign code registers to store the possible sign code since its maximum length is three. Then, when the length of the sign code is known, the sign of all T1s could be decoded in the next cycle. During the same cycle of sign decoding, the decoding of the first level code or TotalZero if no other levels are remained can be started. Fig.40 illustrates the operation of T1s decoding. Since this is a multi-symbol decoding, the level buffer, which is used to store the decoded level value, has to be a multiple input level buffer instead of single input FIFO in other methods.

When TotalCoeff is equal to zero, which implies a zero block, only the coefficient token process is required and the other decoding stages will be skipped as that in other designs.

Fig.40 Proposed T1s decoding

2) Level decoding process

The adaptation of level decoding is quite complex. Thus, we do not adopt the multi-symbol decoding due to the hardware cost consideration.

3) Total zero decoding process

This process will be skipped if the TotalCoeff is equal to zero or maxNumCoeff.

To skip or not to skip is controlled by coefficient token process.

4) Run before decoding process

In this process, the multi-symbol decoding is adopted. However, by examining this process, we find that the ZerosLeft for next run is predictable unless ZerosLeft is larger than six. For example, if the run is equal to 1 under ZerosLeft == 6, then decoding next run should take the table under ZerosLeft == 5. Thus, we decide to decode two runs in the same cycle only when ZerosLeft <= 6 and partition the run before table accordingly. However, the available number of runs is not always even to do two-symbol decoding. If only one run needs to be processed, only the length of the first run code is considered. Fig.41 (a) shows different cases of the proposed run decoding and merging.

For example, when decoding the block data as shown in Fig.39, the tail level of the block, “-1”, is first placed in the 14^th position of coefficient buffer according to the decoded symbols, TotalCoeff and TotalZero. Second, we decode first run and merge the 6^th level to 8^th position of coefficient buffer since

ZerosLeft > 6 in this stage as illustrated in Fig.41 (a). Then, we decode 2 runs and merge the 4^th an 5^th level to the 4^th and 6^th at the same time because the updated ZerosLeft become 2 and less than or equal to 6 as shown in Fig.41 (b). Forth, we decode the 4^th and 5^th runs and final that the index, ZerosLef, is equal to 0 after the 4^th run decoding. In this case, we only decode the 4^th run and merge the 3^rd level to the 3^rd position in the coefficient buffer as described in Fig.41 (c). Finally, when the ZerosLef is equal to 0 in the following decoding process, we put all the left un-merged levels to the coefficient buffer according the order previous arranged in level buffer as illustrated in Fig.41 (d).

Fig.41 (a)

When zeroleft>6,decode 1 run and merge 1 coefficient.

Fig.41 (b)

When 0<= zeroleft <=6, decode 2 runs and merge 2 coefficients.

Fig.41 (c)

When zeroleft ==0 during 2nd run merging, decode 2 run and merge 1 coefficient.

Fig.41 (d)

When zeroleft ==0, merge the rest coefficients.

Due to the run decoding characteristics that when ZerosLeft is 3 and first decoded run is 1, the possible run codeword of 2^nd run is only under the index, ZerosLeft, equal to 2, which means 3 possible combinations, many combinations can be eliminated to minimize this 2 runs decoding table. The possible combinations of two run codes are 77 and its longest code length is six since the longest length of runs under ZeroLeft ≦ 6 is 3. The final modified run table contains 84 items which contains the 2 run decoding table and the codewords for runs ranged from 0 to 6 under ZeroLeft >6 and a leading one detector which decodes the runs ranged from 7 to 14 when ZerosLeft >= 6. In addition, unlike the previous design that only skips all zero block, we also skip this process and directly copy the remaining levels into the coefficient buffer as long as ZerosLeft is equal to 0. For this purpose, data in the level buffer must be stored according to the decoding order for easier copy.

The critical case for the proposed decoding flow is as shown in Fig.42. In this case, the proposed method needs 26 cycles to complete total process containing the combination of level and run information. For the traditional process, the required cycle count is 46. The proposed flow can save 43% of cycle count even in such worst case.

Fig.42 The critical cases for CAVLC decoding design, where the X denotes the

nonzero coefficients.

4.2.4 Hardware architecture of CAVLC decoder

Fig.43 shows the decoder architecture based on the proposed flow. The design takes input from the bitstream shifter that shifts the bitstream according to the length of the previous codeword and provides the aligned bitstream for next decoding process. It includes two registers, a shifter and a code length accumulator, which is generally used in the traditional VLC decoding hardware [21]. Then the bitstream is decoded according to the flow. The output of the decoder is a coefficient merging unit that contains a zero-initialized coefficient buffer for final output. The coefficient merging unit directly merges the level and run together during the run before decoding. When a run is decoded, the corresponding levels will be copied to the coefficient buffer. This scheme can reduce the processing cycles and save the extra run buffer. Another output of the design is the zero block index when an zero block is decoded. This helps zero skipping for the subsequent inverse quantization and other components.

Fig.43 The proposed architectures for CAVLC decoding

The details of each component are described as below.

1) Coeff_token-sign decoding

Fig.44 shows the architecture of Coeff_token-sign decoding. The bitstream is first decoded by the coeff_token table to generate the value of TotalCoeff and T1s and the code length. Then, the next three bits from the bitstream will be stored in the sign code registers for sign decoding. According to T1s, these bits will be denoted whether they are sign code or not and decoded through T1 masking logic.

Finally, T1s will be sent to the level buffer. The zero-block detector will detect the codeword of the zero block and set the zero block index to one.

Fig.44 Architecture of coeff_token-sign decoder

2) Level decoder

The prefix of the level codeword is first decoded by the leading one detector to get the information for suffix decoding and codeword length. After subtracting suffix, the value of the level could be decoded by prefix and suffix. An escape code is happened when prefix is 15, which has a 28 bit codeword length. This decoder is implemented by a simple arithmetic calculation and combinational logic.

3) Total zero decoder and run before decoder

The total zero decoder contains two tables for decoding of 4x4 blocks and DC 2x2 blocks. The tables are further partitioned into several sub-tables. The selection of sub-tables is determined by the decoded symbol, TotalCoeff. The total zero decoder will decode the symbol, TotalZeros, and the length of code for total zero decoding process. The run before decoder contains a run before table and

pipelining registers for pipelining the coefficient merging process and run before

在文檔中用於H.264視訊解碼器之記憶體控制器與熵解碼器之設計 (頁 65-0)