• 沒有找到結果。

Chapter 2 Algorithm of CABAD for H.264/AVC

2.6 Summary

In this chapter, we focus on the algorithm introduction of CABAD system. We have introduced the three kinds of the binary arithmetic decoding flows in the first decoding layer through the basic arithmetic decoding algorithm. Then, the binarization decoding flow has been depicted to generate the current syntax elements including mbType, coded_block_pattern, mvd, residual data, and so on. This is the second layer decoding flow in CABAD. The CABAD system is composed of the two layer hierarchy decoding architectures. To provide the high compression gain and the better performance, we also offer how to address the adaptive context model depending on the value of binIdx has been shown in this chapter. Based on the context model index, the required syntax elements of the neighbor blocks are defined in the previous section.

After the introduction of the CABAD system behavior, the hardware implementation will be proposed in Chapter 3 and 4. It includes the binary arithmetic

engine, and the SRAM based context model and the getting neighbor storage.

Chapter 3

Binary Arithmetic Decoder Engine

In this chapter, we propose our hardware architecture of the present design. And we will show some data to analyze the efficiency.

In H.264/AVC system, the entropy coding includes the variable length coding (UVLC and CAVLC) and CABAC. In baseline profile, UVLD and CAVLD are the main decoders to de-compress the macroblock information related to the parameter and the pixel coefficients. In main profile, CABAD substitutes for UVLD and CAVLD to restore the video data.

CABAD applies two level hierarchical decoding flows. The second level is the binarization decoding flow which is similar to the process of the variable length decoders such as UVLD and CAVLD. Besides looking for the context model index, the algorithm of the binarization is easy to realize. The first level is the arithmetic decoding flow. This decoding flow has the highly data dependency between the current interval and the next one due to make the high de-compression gain. It is, therefore, hard to accelerate this hierarchical decoding flow. The cycle count of CABAD is worse than CAVLD. So we focus on the throughput promotion in the current design.

This chapter is organized as follows. In Section 3.1, we present the overview of H.264/AVC for CABAD, and show the consideration of the two level decoding processes. In Section 3.2, an in-depth discussion of the proposed architecture of the arithmetic decoder will be given. We will propose three methods to improve the performance of our design. In Section 3.3, the binarization engine is realized by

machine of the binarization.

3.1 Overview of CABAD

In our system, CABAD consists of three main modules namely the arithmetic decoder (AD), the binarization engine, and the SRAM module. AD is the computationally-intensive, so we focus on enhancing its efficiency.

Figure 14. System architecture of CABAD

Figure 14 is the system architecture of CABAD. The entire decoding procedure is described as follows. When starting to decode, it has to initialize the context model SRAM (399x7 bits) by looking up the initial table which is implemented by means of the combinational circuit. AD reads bit-stream to get the bin value. At the same time, it refers to the current probability from the context model SRAM to find the sub-range of MPS and LPS and updates the probability to the location of the current context model index (ctxIdx). Due to the concatenate AD execution, the several bin values form the bin string. The binarization engine reads bin string until matching the bin definition of the standard [1] and turns it into the mapped syntax element (SE) such as the macroblock parameter, mvd, residual data, and so on. We store the SE value in the SE buffer (SEB) and backup the essential SE from the SE buffer to the row-storage

(RS) SRAM (120x208 bits) which is applied to record the neighbour information when completing the macroblock decoding.

AG1 generates the address of the row-storage SRAM. We will discuss it in Chapter 4. AG2 generates the address of the context model SRAM which has been defined in Section 2.4.

We rearrange the CABAD decoding flow. Figure 15 is the flow chart of the CABAD decoding flow. At the beginning, the entire probabilities of the context model SRAM have to be initialized by the context model initial table. The context model also has to be re-initialized when the new slice starts. In Figure 15, it has two decoding flows among the dotted lines. The first decoding flow is the arithmetic decoder which is the first stage of decoding one syntax element. It produces the bin value depending on the current range (codlRange) and the current value (codlOffset).

The second decoding flow is the binarization engine. It reads the bin values to judge if the bin string forms the meaningful data. If not, the binarization engine requests the arithmetic decoder to decode one bin again and re-judges the bin string until identifying the value of the current syntax element. If completing the current slice, codlRange is assigned to “51210” and codlOffset is refilled in 9-bit bit-stream from the syntax parser.

Figure 15 Flow chart of the CABAD decoding flow [1]

1st level decoding flow

2nd level decoding flow

3.2 1 st Decoding Flow -

Architecture of the Arithmetic Decoder

The arithmetic decoder is the first level decoding flow. It de-compresses the bit-stream to the bin string which is the variable length codeword to offer the binarization engine which is the second level decoding flow turns into the SE value.

The arithmetic decoder has three kinds of the decoding flows, including the normal, bypass, and terminal decoding flows. An in-depth discussion of the proposed architecture will be given.

3.2.1 Normal decoding flow

The normal decoding flow occupies 84% of the usage in the entire number of the AD demand, as shown in the pie chart of Figure 16. It is the highest usage in these three ADs. So it is easy to promote the process efficiency to focus on the normal decoding flow.

Figure 16. Percentage of the AD usage

The normal decoding flow refers to the probabilities and the historical bin value to produce the current bin. The architecture is shown as follows.

Because AD has the property of the data dependency between the current and previous intervals, the bottleneck of the process cycle is sensitive to the implementation architecture. Thus, we propose the pipeline organization to overcome the speed problem.

As shown in Figure 17, we divide the normal decoding flow into two stages which is shown as follows.

1st stage

The first stage is to read the context model. Because the normal decoding flow has to refer the context model to generate the bin string, it must add this stage to request the context model SRAM to load the current probability depending on the SE type and the bin index (binIdx). We implement the context model with the two-port SRAM so that both storing and loading of the context model can be done at the same cycle.

Figure 17. Pipeline schedule of the normal decoding flow 2nd stage

The second stage is the sub-division behavior. Figure 18 is the hardware architecture of the three decoding flows. In this section, we discuss the architecture of the normal decoding flow first. We construct AD by means of the combinational circuit except L1 and L2 pre-load caches, and divide the AD decoding flow into two parts. The first part is the AD kernel. When executing the normal decoding flow, the AD kernel estimates the probability of the next interval and the current range of LPS and MPS by means of RangTabLPS, tranIdxMPS, and tranIdxLPS which are made by the hardwire.

AD kernel:

Figure 18. Hardware architecture of AD

Table 9 Truth table of the shift number definition related to codlRange codlRange[8:0]

The second part is the renormalization arbiter and the pre-load cache. We adopt the 2-level pre-load cache and the renormalization arbiter to avoid the waiting time of the bit-stream loading. The renormalization arbiter detects the value of codlRange if it is larger than “010016”, and generates the shift number of codlRange. The shift number is also the required number of the bit-stream which codlOffset needs to fill in. Table 9 shows the shift number of all possible cases. As a result, the MSB of codlRange must be ‘1’.

After detecting the shift number, codlRange can be obtained and codlOffset fills in enough bit-stream by means of the renormalization arbiter. We propose 2-level pre-load cache to provide the sufficient bit-stream which the renormalization arbiter requires.

We meet three cases relating to provide the bit-stream as follows.

In the first case, there is no need to provide any bit-stream, so the L1 cache offers nothing.

Figure 19 Example of the second case

The second case is the general one if the current index of the L1 cache is greater than or equal to the shift number. The renormalization arbiter fetches the necessary bit-stream only from the L1 cache. Figure 19 is the example of the second case.

Assumed the current index of the L1 cache locates at bit 5, and the shift number of the renormalization arbiter is equal to five, which means that the 5-bit bit-stream of L1 cache is available.

Mapping control signal = 1

Current index Next Index

Figure 20. Example of the third case

The third case has to borrow the bit-stream from L2 cache if the current index of the L1 cache is less than the shift number. The renormalization arbiter fetches the required bit-stream not only from the L1 cache but also from the L2 cache. The mapping control signal is asserted at the same time. The content of the L2 cache is sent to L1 cache and L2 cache loads the new bit-stream from the bit-stream SRAM by means of reading the mapping control signal. Figure 20 shows the example of the third case. Assumed the current index of the L1 cache locates at bit 2, and the shift number of the renormalization arbiter is equal to five, which means that the L1 cache can provide only 3-bit bit-stream and the renormalization arbiter reads the extra bit-stream from the L2 cache. The third case prevents the loading miss penalty when the L1 cache can’t offer enough bit-stream. The loading miss penalty means that the handshaking of loading between the renormalization arbiter and the bit-stream SRAM.

symbol of “V” denotes the required bit-stream from L1 and L2 caches. The first row of this table is the first case. It needs no bit-stream because the number of the shift is equal to “0” no matter what the index of L1 cache is. The gray regions belong to the third case. The renormalization arbiter fetches the bit-stream from both L1 and L2 caches. The other regions are the second case. It just needs to fetch the bit-stream of the L1 cache.

Thus, both the AD kernel and the renormalization arbiter share one cycle to compute one bin.

Table 10. required bit-stream from the L1 and L2 caches

L1 cache(bit) L2 cache(bit) Index of

1 V

“V” denotes the required bit-stream from L1 and L2 caches

We divide the normal decoding of AD into two stages which have been shown in Figure 17. The first stage is to read the context model SRAM. The second stage is to decode the bit-stream into bin, and write the probability back to the context model SRAM. We apply the two stages to schedule the pipeline organization.

cycle 2

Figure 21 timing diagram of the pipeline comparison

Figure 21 shows the timing diagram of the pipeline comparison for the normal decoding in AD. Every cycle, the current bin is made by the AD operation of the second stage and the current context model is written back to the write port of the current context SRAM, and the next context model is read from the read port of the context model SRAM. The read-port index of the context model SRAM is looked up depending on the current division condition. It can be found that the schedule without pipelining produces one bin every 2 cycles, and the other one with pipelining produces one bin every 1 cycle in average. Compared with the non-pipeline organization, the normal decoding flow with the pipeline can save the process cycle about 50%.

Besides, it is difficult to produce more than one bin per cycle for the architecture of two normal decoding of AD parallel connections due to the data dependency

between the previous sub-division and the next context model. We cost more hardware to read and write context model at the same cycle in order to reduce the process cycle. Thus, we don’t adopt the dual AD architecture of the normal decoding in cascade. But we will show the simpler parallel connection of the low hardware complexity in next section.

3.2.2 Bypass decoding flow

When decoding the syntax elements of mvd and the residual data, the bypass decoding of AD is the typical format to define coefficients and the sign bit.

Exp-Golomb (EG) code applies the most usage of the bypass decoding.

According to Figure 16, the usage of the bypass decoding flow occupies 16%. So it has the value to improve the throughput of producing the bin string.

The algorithm has been discussed in Section 2.2.2.2 and the flowchart of the bypass decoding is shown in Figure 10. Based on this flow chart, the bypass decoding unit is mainly composed of only one subtractor, which is shown in Figure 23(a). The bypass decoding flow doesn’t need to refer to the context model, so it is unnecessary to load the context model from the context model SRAM. Hence, the bypass decoding flow just needs single stage to complete one bin. codlRange is also not changed which means there is only codlOffset and the bin value to be computed. So the complexity of the bypass decoding unit is much less than the architecture of the normal decoding flow.

Figure 22 shows the statistics of the number and percentage of the concatenate bypass decoding under executing the six typical test sequences. The concatenate bypass decoding of 96.5% produces less than six bins per issue. So we construct five bypass decoding units as cascade, as shown in Figure 23(b).

25.6

22.7 (d) mother and daughter

26.0 times 27035 25481 49550 30786 22737 5671 3.2

% 16.8 15.8 30.7 19.1 14.1 3.5 -

(f) football

Figure 22 percentage of the number of the concatenate bypass decoding under executing 100 CIF frames of six sequences

Figure 23.(a) Bypass decoding unit and (b) Organization of the multi-symbol bypass

The cascade architecture of the concatenate bypass decoding pre-computes five sub-divisions and pre-generates five bins per cycle. But it doesn’t always execute five bypass decoding flows according to our statistics. The binarization engine judges the five bin string if the bin values are the valid symbols. If the first n bins are valid, the n-th result of codlOffset has to be selected by the binarization engine to offer the next AD by means of the 5 to 1 multiplexer. The alphabet “n” is the order number of the cascade bypass decoding architecture. We arrange the probable cases in Table 11.

Table 15 shows the valid bin string which the binarization engine has to get, and the valid codlOffset which offers the next sub-interval division executing AD.

Table 11 Cascade bypass decoding output value for five required cases Valid bin value

The cascade architecture leads to produce five bin values at most in one cycle. In other words, it consumes 0.2 cycles for one bin at least under the concatenate bypass decoding.

According to our statistics in Figure 22, the number of the concatenate bypass decoding is about “3” per EG issue. Compared to the bypass decoding without the cascade architecture, the processing cycle of the bypass decoding with the cascade can optimize for 3 times. Hence, the efficiency of the CABAD system can promote 10% due to the cascade architecture.

3.2.3 Terminal decoding flow

The final AD is the terminal decoding flow which is applied by the syntax elements of mb_type and end_of_slice_flag. Because the terminal decoding flow is used to judge if the current slice is complete, it works one or two times per macroblock. Thus, the terminal decoding flow is seldom used in CABAD system. The pie chart of Figure 16 shows the percentage of the AD usage. The terminal decoding flow occupies approximately 0%, so we don’t care about its efficiency very much.

According to the flow chart of Figure 11, the terminal decoding flow just needs one comparator which compares with codlOffset and codlRange and the renormalization part. Hence, it shares the comparator and the renormalization part with the normal decoding flow, given as Figure 18.

Based on the algorithm of the terminal decoding flow, codOffset and codlRange don’t need to be renormalized when sub-dividing to LPS. Thus, the control signal of

“skip renormalization” in Figure 18 has discussed. “skip renormalization” has two cases to be activated. One is the bypass decoding flow to be used. The other is the LPS condition when the terminal decoding flow is applied. “skip renormalization”

controls one 9-bit 2-to-1 multiplexers for codlRange and one 10-bit 2-to-1 multiplexers for codlOffset to skip the renormalization part. The simple logic function is shown in Figure 24.

Figure 24 Simple logic function for “skip renormalization” control signal

3.3 2 nd Decoding Flow -

Architecture of the Binarization Engine

The binarization engine is the second level decoding flow of the CABAD architecture. We also treat it as the top module in our proposed architecture. It schedules the timing related to the context model of reading-to and writing-back and selecting the arithmetic decoding flows. It also controls the syntax element buffer (SEB) to record the required coefficients to the row-storage SRAM (RS SRAM). In addition to the aforementioned techniques, the main work of the binarization engine is to read the bin string and find the suitable syntax element values by means of the unary, the truncate unary, the fixed length, the Exp-Golomb, and the special defined codes.

bin string match the defined code

State 0: waiting for the request of syntax parser

State 1: check the AD mode

If( normal decoding): get the context model

If( bypass decoding or terminal decoding) skip getting the context model

State 2: bin string decode:

match definition: binIdx = 0

No match definition: binIdx=binIdx+1, request AD to produce next bin

State 3: generate SE value

Figure 25 finite state machine of the binarization engine

Figure 25 shows the finite state machine (FSM) of the binarization. The first state (state 0) is the stand-by state. The binarization waits for the request of the syntax parser until activating the CABAD system, and jumps to “state 1”. “state 1” checks

the type of AD. If it is the normal decoding, the binarization engine reads the neighbor information from the RS SRAM, and generates the context model index and reads the context model form the context SRAM. Then, FSM jumps to “state 2”. “State 2” is a binary tree where we have defined in Section 2.3. Based on the bin index (binIdx), the

the type of AD. If it is the normal decoding, the binarization engine reads the neighbor information from the RS SRAM, and generates the context model index and reads the context model form the context SRAM. Then, FSM jumps to “state 2”. “State 2” is a binary tree where we have defined in Section 2.3. Based on the bin index (binIdx), the

相關文件