The Proposed CAVLC Decoding Flow

Chapter 4 Entropy Decoder

4.2 CAVLC decoder

4.2.3 The Proposed CAVLC Decoding Flow

In the CAVLC decoding process, each symbol has several corresponding context-based adaptive VLC tables, and the selection of these tables is based on the statistics of block content and previous decoded symbols. Thus the decoding process depends on not only the bit stream but also the previous symbols. This prevents the speedup techniques like parallelism and pipelining. One often adopted solution is to combine the codeword of different symbols so that multi-symbol decoding is possible.

However, this will lead to a longer code and thus larger table to be decoded in one cycle. In this paper, we propose to use some decoded information as inherent table index to reduce the table size.

flow basically follows the standard decoding flow. The major difference is that we adopt multi-symbol decoding at two stages. One is the sign decoding stage and the other is the run before decoding stage. In addition to this, another major problem of decoding process is to combine the decoded nonzero coefficients (level) and zero coefficients (run) together. In other approaches [16] - [18], the level and run are decoded separately and then combined together in a one-bye-one raster scan order to reconstruct the coefficients. However, the combination can be started as soon as a run is decoded. Thus we propose to use a coefficient merging unit for such purpose to save extra merging cycles.

Fig.39 The proposed decoding flow of CAVLC [19]

The whole decoding process and its impact to the hardware design is described as below.

1) Coefficients token and sign T1decoding process

Decoding of the coefficient token is combined with the decoding of T1 sign since the length of sign code is predictable as soon as the T1s and TotalCoeff are known. In the hardware design, we use three-bit sign code registers to store the possible sign code since its maximum length is three. Then, when the length of the sign code is known, the sign of all T1s could be decoded in the next cycle. During the same cycle of sign decoding, the decoding of the first level code or TotalZero if no other levels are remained can be started. Fig.40 illustrates the operation of T1s decoding. Since this is a multi-symbol decoding, the level buffer, which is used to store the decoded level value, has to be a multiple input level buffer instead of single input FIFO in other methods.

When TotalCoeff is equal to zero, which implies a zero block, only the coefficient token process is required and the other decoding stages will be skipped as that in other designs.

Fig.40 Proposed T1s decoding

2) Level decoding process

The adaptation of level decoding is quite complex. Thus, we do not adopt the multi-symbol decoding due to the hardware cost consideration.

3) Total zero decoding process

This process will be skipped if the TotalCoeff is equal to zero or maxNumCoeff.

To skip or not to skip is controlled by coefficient token process.

4) Run before decoding process

In this process, the multi-symbol decoding is adopted. However, by examining this process, we find that the ZerosLeft for next run is predictable unless ZerosLeft is larger than six. For example, if the run is equal to 1 under ZerosLeft == 6, then decoding next run should take the table under ZerosLeft == 5. Thus, we decide to decode two runs in the same cycle only when ZerosLeft <= 6 and partition the run before table accordingly. However, the available number of runs is not always even to do two-symbol decoding. If only one run needs to be processed, only the length of the first run code is considered. Fig.41 (a) shows different cases of the proposed run decoding and merging.

For example, when decoding the block data as shown in Fig.39, the tail level of the block, “-1”, is first placed in the 14^th position of coefficient buffer according to the decoded symbols, TotalCoeff and TotalZero. Second, we decode first run and merge the 6^th level to 8^th position of coefficient buffer since

ZerosLeft > 6 in this stage as illustrated in Fig.41 (a). Then, we decode 2 runs and merge the 4^th an 5^th level to the 4^th and 6^th at the same time because the updated ZerosLeft become 2 and less than or equal to 6 as shown in Fig.41 (b). Forth, we decode the 4^th and 5^th runs and final that the index, ZerosLef, is equal to 0 after the 4^th run decoding. In this case, we only decode the 4^th run and merge the 3^rd level to the 3^rd position in the coefficient buffer as described in Fig.41 (c). Finally, when the ZerosLef is equal to 0 in the following decoding process, we put all the left un-merged levels to the coefficient buffer according the order previous arranged in level buffer as illustrated in Fig.41 (d).

Fig.41 (a)

When zeroleft>6,decode 1 run and merge 1 coefficient.

Fig.41 (b)

When 0<= zeroleft <=6, decode 2 runs and merge 2 coefficients.

Fig.41 (c)

When zeroleft ==0 during 2nd run merging, decode 2 run and merge 1 coefficient.

Fig.41 (d)

When zeroleft ==0, merge the rest coefficients.

Due to the run decoding characteristics that when ZerosLeft is 3 and first decoded run is 1, the possible run codeword of 2^nd run is only under the index, ZerosLeft, equal to 2, which means 3 possible combinations, many combinations can be eliminated to minimize this 2 runs decoding table. The possible combinations of two run codes are 77 and its longest code length is six since the longest length of runs under ZeroLeft ≦ 6 is 3. The final modified run table contains 84 items which contains the 2 run decoding table and the codewords for runs ranged from 0 to 6 under ZeroLeft >6 and a leading one detector which decodes the runs ranged from 7 to 14 when ZerosLeft >= 6. In addition, unlike the previous design that only skips all zero block, we also skip this process and directly copy the remaining levels into the coefficient buffer as long as ZerosLeft is equal to 0. For this purpose, data in the level buffer must be stored according to the decoding order for easier copy.

The critical case for the proposed decoding flow is as shown in Fig.42. In this case, the proposed method needs 26 cycles to complete total process containing the combination of level and run information. For the traditional process, the required cycle count is 46. The proposed flow can save 43% of cycle count even in such worst case.

Fig.42 The critical cases for CAVLC decoding design, where the X denotes the

nonzero coefficients.

在文檔中用於H.264視訊解碼器之記憶體控制器與熵解碼器之設計 (頁 78-86)