Previous Work - H.264 CAVLC/CABAC熵解碼整合型IP設計

In this Chapter, we will review some previous designs of the CAVLD and the CABAD architecture.

2.1. CAVLD

Chang et al. [4] published an efficient CAVLD architecture and proposed four different techniques to reduce both the hardware cost and power consumption of CAVLD. These techniques are Partial Combinational Component Freezing (PCCF), Hierarchical Logic for Look-up Tables (HLLT), Zero_left Table Elimination by Arithmetic (ZTEBA), and Zero Codeword Skip (ZCS). As compared to the architecture proposed in [3], the design achieves 23% reduction in hardware cost and 40% improvement in speed.

PCCF is a technique to assign one enable signal to each CAVLC decoder component to freeze the non-operating component of combinational circuits to achieve low-power consumption. HLLT is used to partition the original big LUT into many small LUTs so that the unused parts of the LUTs can be disabled for reducing power consumption. ZTEBA is a technique for more efficient decoding of Run-before syntax elements by finding out the rules among the LUTs. ZCS is used to skip decoding of zero codewords when all coefficients in 4x4 or or 2x2 blocks are zeros.

Lin et al. [7] suggested a power-efficient approach called prefix pre-decoding that can reduce power consumption by 25%. Based on empirical analysis, the lengths of codewords in Coeff_token hardly exceed 8. Therefore, LUTs are divided into two groups: one group contains LUTs with codewords smaller than 8 and the other group contains codewords that can exceed 8.

Yu et al. [8] proposed several techniques to improves the performance of CAVLD, including merging the Coeff_token and T1s processes together to reduce cycles, skipping some decoding process if no coefficient is necessary to be decoded, and decoding multiple symbols in Run_before stage. The proposed design uses 90 cycles for one MB decoding on average, which can meet real time HDTV requirement and saves 64% of cycle count on average when compared with the design in [4].

Tseng et al. [9] proposed an algorithm with a redesign of the LUTs. If a pattern is matched in their look-up table, they can skip the standard CAVLD procedure and reconstruct a block directly. The performance can be improved by 10% compared with the standard CAVLD procedure. In short, the most frequently occurring 4x4-block (or 2x2-block) bitstream patterns are recorded in a table so that a full block decoding can be done quickly. They sample 4,000 frequent patterns and arrange them according to their frequencies. Sum of frequencies of top 4,000 patterns occupies 67.63% of number of decoded block, They re-arrange the order of these 4,000 patterns according to their bit lengths, and there are 81.07% of patterns represented within 8 bits and 96.93% of patterns represented within 12 bits. The pattern-search algorithm is based on a two-pass table look-up method and all the coefficients in a 4x4 (or 2x2) block can be reconstructed directly.

Kim et al. [10] proposed a new CAVLC decoding method using arithmetic hashing operations instead of the conventional table look-up method. Experimental results show that the proposed algorithm is 50% faster and uses 95% less memory access comparing with three conventional search-based table lookup CAVLC algorithms such as Moon’s method [5].

2.2. CABAD

occurring frequency during the decoding process. For more frequent syntax elements, a new architecture that can decode two regular bins together with one bypass bin in one cycle is proposed. These syntax elements include Motion Vector Difference, Significance Map, and Level Information. And they also divide context models into 18 groups according to their access frequency. With this mechanism, access frequency to the RAM storing context models is greatly reduced. For a typical 4Mbps bitstream at D1 resolution, experimental results show that on average each MB can be decoded within 500 cycles.

Yang et al. [18] proposed several techniques to optimize the CABAC decoding process. At MB information level, previously decoded MB information is packed so that accesses to this information for current MB decoding are more efficient. At slice-data and MB layer level, they perform careful pipeline scheduling, using segmented context tables, adding cache registers, and doing look-ahead codeword parsing to improve decoding performance. In summary, it takes three cycles to generate a bin. When an internal loop occurs, there might be a succession of context memory accesses using the same context values. Therefore, cache registers are used to store the context values and write back to context memory only once at the end of the internal loop. In addition, they propose a look-ahead codeword parsing scheme to detect if the re-normalization on the probability model occurs in CABAD. If the look-ahead condition fits, they can decode two bins in each cycle. Otherwise, it takes one cycle to decode one-bit of CABAD codeword. Furthermore, they partition one context table into multiple segmented context memories. Thus, combining the segmented context memories with cache registers, they can read and write memory in a more flexible way. By exploiting all the proposed design techniques, they can averagely reduce about 53% of cycles count in the CABAD decoding process.

Similarly, Zheng et al. [22], also point out that not all the parameters of a reference MB or all the possible values of an SE need to be stored. They have shown that the bits that are required to store a reference MB are only 142 bits. The proposed architecture and control/decoding strategy have improved the time efficiency by 27.9%, 18.2%, and 48.8% for I frames, P frames, and B frames compared with the architecture in Chen et al. [16].

在文檔中 H.264 CAVLC/CABAC熵解碼整合型IP設計 (頁 19-23)