CHAPTER 2 ALGORITHM OF CABAC
2.4 P REVIOUS WORK
In the previous papers, many architecture designs for arithmetic coder and decoder are proposed for different standard. In this section, we introduce the ideas in their architecture.
In [4], they proposed a 4 pipeline stage design for JPEG2000 MQ-coder. Figure 2.13 shows the block diagram of the architecture in [4]. The works of stage 0 and stage 1 are context update and range computation. The context index update in the next cycle.
Because the precision of range register are 16 bits and the precision of register is 28 bits in JPEG2000, they pipeline the adder of low into two stages to reduce the path delay.
The stage 2 computes the least significant 16 bits of low register. The stage 3 computes the most significant 12 bits of low register and generates the output bits. In [7], they proposed the extension architecture based on [4]. Their architecture can handle more
Figure 2.13 4-stage pipeline design of encoder in [4]
than one symbol in a cycle. For reducing delay of critical path, they invert the sequence of some elementary operations. As a consequence, their have to use six branches to map the allowable cases. Their method is called the inverse multiple branch selection (IMBS) method.
In [5] and [6], an architecture of CABAC is proposed for multiple standard and JPEG2000. Their architecture is a 3 pipeline stages design. Figure 2.14 shows pipelined context-based AC encoding flow in [5] and [6]. In the stage 0, the interval computation and context information update is completed. In the stage 1, the bit-stuffing handler is in charge of bit-stuffing problem. The bit-stuffing handler uses buffer to detect the sequence “0xFF” in the result bits. If the bit-stuffing handler detects the sequence
“0xFF”, it will insert a bit after the sequence “0xFF”. The inserted bit is the carry bit generated from previous stage. Because the output length of the bit-stuffing handler is 0, 1 or 2 bytes, they add a 4-byte FIFO register to limit the output length in one byte.
In [8], their architecture is based on H.264/AVC. Figure 2.15 shows the block diagram of arithmetic coder in [8]. Their arithmetic coder architecture is a 2 pipeline stage design. The first stage handles encoding iteration. In their design, they divide the original LPS table into four LPS tables which are independent of range. In the encoding iteration, they use a carry-save adder and prefix-adder to reduce the computation time of low and the range. In the renormalization, they use Leading-Zero detection in parallel
Figure2.14 Pipelined context-based AC encoding flow in [5] & [6]
with the prefix adder to reduce the time for the renormalization. Stage 1 packs the output bits from stage 0 into a byte. The bit-packing stage also detect the sequence
“0xFF” in the result to prevent the result is emitted by a carry propagation. If the bit-packing stage detects the sequence “0xFF”, they suspend the output process. Then count the total length of “0xFF” until further operations discard carry propagation.
In [10], they propose co-processor architecture on SoC platform. In the coder design, Figure 2.15 Block diagram of arithmetic coder in [8]
they use the MZ-coder instead of the M-coder in H.264/AVC. The MZ-coder provides equivalent bit rate comparison with M-coder. Furthermore, the MZ-coder eliminate the multiple renormalization cycles in M-coder. The coprocessor achieves a constant throughput for both encoding and decoding processes of 1 symbol per cycle.
In [11], they improve the bypass coding of CABAC in H.264/AVC. They use different hardware to handle the regular and the bypass mode coding. If a regular mode follows a bypass mode, their architecture can encode two symbols in the cycle. The probability of this situation is about 10%. Therefore their can increase 10%
performance in average.
In the next chapter, we propose a novel architecture of arithmetic encoder and decoder for H.264/AVC. The architecture of encoder is a 3 pipeline stages design and the architecture of decoder is non-pipeline design. For reducing the path delay in the architecture, we rearrange the operations of range operations in the encoder and the decoder. Furthermore, in encoder we extend the architecture to encode more than one symbol per cycle.
Chapter 3
Architecture of Arithmetic Encoder and Decoder
In this chapter, we present hardware architectures for arithmetic encoder and decoder. First, we show a basic encoder architecture, which can encode one symbol per cycle. Then, based on the basic architecture, we further extend the design to support the encoding of multiple symbols per cycle. In the second section, we show the architecture of arithmetic decoder.
3.1 Encoding Architecture
Figure 3.1 shows the block diagram of the basic architecture, which includes 3 pipeline stages. In our architecture, we separate the operations of range and low into stage 0 and stage 1. In stage 2, the byte packing unit can pack the results into the format of byte. Furthermore, the architecture can support two encoding modes. Specifically, as illustrated in Figure 3.1, the stage 0 computes the value of range and update context
Figure 3.1 Pipeline stage of one-symbol encoding architecture
probability model. The stage 1 computes the value of low and generates output bits. The stage 2 groups the bits from the output of the stage 2 and packs them in a byte-by-byte manner. In addition, the bit-stuffing is also done in the stage 2.
To improve encoding throughput, we propose an extended architecture, which can encode multiple symbols per cycle. To achieve this, the operation of range is first reordered to reduce critical path. Then we duplicate the one-symbol encoding architecture and add additional hardware in each pipeline stage. The detail information will be shown in section 3.2.
3.1.1 One-symbol Encoding Architecture
In this section, we detail the design of each pipeline stage. For the stage 0, the operation includes two parts, which are the computation of range and the update of context probability model. Particularly, the critical path of the stage 0 is the computation of range, as shown in Figure 3.2 (a). We summarize the operations in the critical path as follows:
1. The table look-up of rLPS (range of LPS).
2. The subtraction for getting rMPS (range of MPS).
3. The renormalization.
For reducing the delay in the critical path, we rearrange the order of these operations.
Originally, to produce the rLPS, the look-up table takes both the range and the probability of LPS, i.e., Qe., as input. For eliminating the data dependency between the rLPS and the range, we produce 4 sub-tables by unrolling the cases of range. After the unrolling, each sub-table simply takes Qe as input, as shown in Figure 3.2 (b). Then, the renormalization of previous iteration and the table-lookup of rLPS can be done in parallel, as shown in Figure 3.2 (c). Lastly, we can reorganize the operations in the iteration, as shown in Figure 3.2 (d).
The detailed block diagram of the stage 0 is shown in Figure 3.3. As shown, the stage 0 has three input signals, one output signal, and three intermediate signals for the next stage. The meaning of each symbol is elaborated as follows:
z The signal “Symbol” means the encode symbol.
Table
Figure 3.2 Path of range in encoding iteration.
z The signal “Context” means the context probability model that includes the Qe and the MPS.
z The signal “Encode mode” specifies whether the coding is in the regular mode or the bypass mode.
z The other two signal passing to the next stage means the value that will add with low and the numbers of output bits in this encode process.
z The output signal “Context update” is used to update the context information.
For supporting the bypass mode in the stage 0, the register of range and the signal
“Addtolow” is controlled by the signal “Encode mode”. When the encoding is in bypass mode, the range will remain un-changed and the signal “Addtolow” will take the value
Figure 3.3 Detailed block diagram of stage 0.
of range. Then, the signal “encode mode” will be passed to the next stage.
In the stage 1, the main operations include the computation and the renormalization of low. Figure 3.4 shows the detailed block diagram of the stage 1. As shown, the stage 1 takes the intermediate signals produced by the stage 0 as input and produces “Output bits” and “N_bits”. Respectively, the signals “Output bits” and “N_bits” stand for the encoding result of one symbol and the associated number of coded bits. Particularly, when the coding is in the bypassing mode, the value of low will be firstly shifted to the left by one bit and the total shift value will be set to 1.
In the stage 2, the encoding results, i.e., the compressed bits, from the stage 1 are packed in a byte-by-byte manner. In [7], the byte-stuffing technology is used for packing. To detect the occurrence of 0xFF sequence, a 16-bit buffer is used to buffer the compressed bits. When the value of 0xFF is detected and identified, there are possibilities that the carry propagation will affect the byte that has been outputted.
Therefore, we need to hold the output byte and use a register to store the length of stuffed bytes. The operation in packing buffer is shown in Figure 3.5. In the beginning, the second byte in Pbuffer is 0xFF. Then we store the length of the byte 0xFF in the register “N_bytestuff” and the bit value of the byte 0xFF in the register “Stuff”. After storing the information of stuff situation, we continue the process of byte packing. If the
Low
Figure 3.4 Block diagram of the stage1 in basic architecture.
next byte is not 0xff, stage 2 will output the first byte in the packing buffer, the value in the register “Stuff” and the value in the register “N_bytestuff”. On the other hand, if the next byte is 0xff, the register “N_bytestuff” will be increased by 1. If the following operations produce a carry signal, the register “Stuff” will be turned to 0 and the first byte of packing buffer will be increased by 1.
Figure 3.6 shows the block diagram of the stage 2. The register “Pbuffer” stores the encoding results. The register “N_Pbuffer” records the number bit of results in the
“Pbuffer”. The combination of the registers “Stuff” and “N_bytestuff” specifies the information of byte stuff. Upon the detection of a “0xFF” byte, the register “stuff”
records the content of stuffing bits and the “N_bytestuff” specifies the number of the bytes that have the value of “0xFF”. Until the next byte is not “0xFF”, the stuffing information is output with the signal “Outputbyte”.
Output byte 0xFF
Figure 3.5 Operation in packing buffer.
3.1.2 Multiple Symbols Encoding Architecture
For improving the performance of AC encoding, we propose an encoding architecture that is capable of coding multiple symbols per cycle. While maintaining similar or higher coding performance, our scalable architecture provides the flexibility to adjust clock rate by changing the number of coding symbols per cycle.
Figure 3.7 shows the block diagram of scalable architecture. To encode more than one symbol per cycle, we duplicate the one-symbol encoding architecture and add additional hardware in each pipeline stage. For encoding n symbols per cycle, we duplicate the one-symbol encoding architecture by n times in the stages 0 and 1. As shown in Figure 3.7, the one-symbol encoding unit in the stages 0 and 1 includes the range operation, the context update, and the low operation. For multi-symbol encoding,
+
Figure 3.6 Block diagram of the stage2 in the basic architecture.
these functional units are duplicated. After encoding the symbol, the values of range and low are passed to the next functional unit. In the stage 0, if more than two encoding symbols use the same context probability model, the later encoding symbols will use the context probability model after the update. Therefore the context information needs a multiplexer to choose the correct one. In the stage 1, the number of result produced by low operation is variable. For reducing the workload and complexity of the stage 2, we combine all the results before passing the data to the stage2.
In the stage 2, we insert a small input buffer to support the multi-symbol encoding.
The basic byte packing unit can process 8 bits in one cycle. For coding one symbol, the average number of results from the stage 1 is less than 1. As we extend the design for multiple-symbol encoding, the probability for the total input number being greater than 8 is very small. Such an exception only occurs a few times for each video frame. Thus, we insert a small buffer in front of the stage 2. The input buffer limits the number of input bits to 8 bits. As a result, using an input buffer can maintain the same structure of byte packing unit in the stage 2.
Figure 3.7 Block diagram of the multiple symbols encoding architecture
Figure 3.8 details the unit for result combination. The result combination unit consists of shifters and adders. There are two kinds of input signals in Figure 3.8. The signal “Output bits_i” ( i=0,1,…,n) means the encoding result of one symbol and the signal “R_shift_i” ( i=0,1,…,n) denotes the length of encoding result. For combining the results produced by different low operations, we first shift the previous encoding result to the correct position. Then we use adders to combine all the result bits. By this way, the result combination unit can output the total number of result bits and a sequence of result bits.
Figure 3.9 shows the block diagram of the input limit buffer. There are two input signals, two output signals, and two local registers. The register “Buffer” temporarily
Figure 3.8 Block diagram of result combination unit
Figure 3.9 Block diagram of input limit buffer
stores the residual bits if the length of previous packing bits is greater than 8. The register “N_buffer” records the number of bits that are stored in the register “buffer”. If the buffer is not empty, we combine the bits in the buffer and the input bits. Then we check if the total number of bits is greater than 8. As the total number of bits is greater than 8, we will select the first 8 MSB bits of the combined result as the output and keep the residual bits in the buffer. In the opposite case, we will pass the bits directly to the byte packing unit.
3.1.3 Multiple Standard Support
In addition to supporting multi-symbol encoding, our structure can also be easily tailored to support the arithmetic encoding in JPEG2000. Table3.1 summarizes the difference of the arithmetic coder in H.264/AVC and JPEG2000. There are three major differences, which are (1) the method for getting rLPS, (2) the operations of low and range, and (3) the precision for representing range and low. In JPEG2000, rLPS simply depends on Qe. However, in H264/AVC, rLPS is from both Qe and range. For the operations of low and range, JPEG2000 updates the low by adding the value of rLPS, as
Table 3.1 Differences of arithmetic coder in H.264/AVC and JPEG2000
H.264/AVC JPEG2000
rLPS table[Qe][range[7:6]] Qe
rMPS range - rLPS range - rLPS
the input symbol is MPS. On the other hand, in H.264/AVC, the low is updated by adding the value of rMPS, as the input symbol is LPS. Lastly, the precision for representing the range and low is different. Specifically, JPEG 2000 requires higher precision for the range and low.
To support JPEG2000, our design is modified to adopt these differences. More specifically, when the coding is for JPEG2000, we remove the 4 sub-tables of LPS and directly connect the Qe to the range compute unit. In the encoding operation, we change the value, which we prepare to add to low, from rMPS to rLPS. Then the timing of adding will be change from that symbol equals LPS to that symbol equals MPS. Lastly, we use high-precision registers and adders to fulfill the need of JPEG2000. Without changing the architecture significantly, our design can be slightly extended to support JPEG2000.
3.2 Decoder Architecture
In this section, we illustrate the architecture of binary arithmetic decoder. For the decoding, our architecture can decode only one symbol per cycle. Different from the case of encoder, at the decoder, the context index for a symbol can only be certain when the previous symbol is decoded. Because of strong data dependency and insufficient context information, it is more difficult to decode multiple symbols per cycle. Thus, our proposed decoder architecture is not pipelined.
To reduce the delay of computation, we apply the same reordering technique in the encoder. Figure 3.10 (a) shows the critical path in the straightforward implementation.
As shown, the longest delay is for the computation of new offset value. The offset value is determined after the range. For completing the computation of range, it needs four steps. The first two steps in the decoding process are similar to those in the encoding
process. The rMPS and rLPS are known in the first two steps. When rMPS is known, we can compare the value of rMPS with the offset to make the decision. After the comparison, new range and offset are determined. Thus, we renormalize the range and the offset. For reducing the critical path, we employ the same reordering technique used for the encoder. As shown in Figure 3.10 (b), the registers of range and the offset store the value without renormalization. Also, we use 4 sub-tables to eliminate the data dependency between rLPS and range. After the reordering, Figure 3.10 (b) illustrates the proposed decoder architecture.
Figure 3.11 shows the detail architecture of decoder. Basically, the decoder has four input signals and three output signals. The signal “Bits_in” replenishes the least significant bits of the offset value during the renormalization. The encode mode indicates whether the decoding process is in regular or bypass mode. The state index and MPS form the context information. The signal “Output bit” is the decoding result and the signal “Qe update” and the signal “MPS update” update the context probability
Figure 3.10 Data path of range in decoder
model. To support the bypass decoding, two additional multiplexers are deployed in this design, as shown with the dash blocks in Figure 3.11. In the bypass mode, we make the decision according to the range and double offset. The first multiplexer choose rMPS or range according to encoding model. The second multiplexer chooses the double values of offset or the value of offset to decode. But, in our design, the renormalization is done in the beginning. Therefore, we can combine the shift of renormalization and the shift of double low. When bypass decoding, the shift value of offset will be increased by 1.
Figure 3.11 Architecture of the AC decoder
Chapter 4
Implementation
4.1 Design flow and verification
In this section, we detail the design flow and the verification environment. In the beginning of the design, we build the C model for the proposed architecture. The C model is used not only for analysis but also for debugging in RTL-level design. After the C model design and verification, the design and the simulation of the register
In this section, we detail the design flow and the verification environment. In the beginning of the design, we build the C model for the proposed architecture. The C model is used not only for analysis but also for debugging in RTL-level design. After the C model design and verification, the design and the simulation of the register