Normal decoding flow - Overview for CABAD

Chapter 3 Binary Arithmetic Decoder Engine

3.1 Overview for CABAD

3.2.1 Normal decoding flow

The normal decoding flow occupies 84% of the usage in the entire number of the AD demand, as shown in the pie chart of Figure 16. It is the highest usage in these three ADs. So it is easy to promote the process efficiency to focus on the normal decoding flow.

Figure 16. Percentage of the AD usage

The normal decoding flow refers to the probabilities and the historical bin value to produce the current bin. The architecture is shown as follows.

Because AD has the property of the data dependency between the current and previous intervals, the bottleneck of the process cycle is sensitive to the implementation architecture. Thus, we propose the pipeline organization to overcome the speed problem.

As shown in Figure 17, we divide the normal decoding flow into two stages which is shown as follows.

1^st stage

The first stage is to read the context model. Because the normal decoding flow has to refer the context model to generate the bin string, it must add this stage to request the context model SRAM to load the current probability depending on the SE type and the bin index (binIdx). We implement the context model with the two-port SRAM so that both storing and loading of the context model can be done at the same cycle.

Figure 17. Pipeline schedule of the normal decoding flow 2^nd stage

The second stage is the sub-division behavior. Figure 18 is the hardware architecture of the three decoding flows. In this section, we discuss the architecture of the normal decoding flow first. We construct AD by means of the combinational circuit except L1 and L2 pre-load caches, and divide the AD decoding flow into two parts. The first part is the AD kernel. When executing the normal decoding flow, the AD kernel estimates the probability of the next interval and the current range of LPS and MPS by means of RangTabLPS, tranIdxMPS, and tranIdxLPS which are made by the hardwire.

AD kernel:

Figure 18. Hardware architecture of AD

Table 9 Truth table of the shift number definition related to codlRange codlRange[8:0]

The second part is the renormalization arbiter and the pre-load cache. We adopt the 2-level pre-load cache and the renormalization arbiter to avoid the waiting time of the bit-stream loading. The renormalization arbiter detects the value of codlRange if it is larger than “010016”, and generates the shift number of codlRange. The shift number is also the required number of the bit-stream which codlOffset needs to fill in. Table 9 shows the shift number of all possible cases. As a result, the MSB of codlRange must be ‘1’.

After detecting the shift number, codlRange can be obtained and codlOffset fills in enough bit-stream by means of the renormalization arbiter. We propose 2-level pre-load cache to provide the sufficient bit-stream which the renormalization arbiter requires.

We meet three cases relating to provide the bit-stream as follows.

In the first case, there is no need to provide any bit-stream, so the L1 cache offers nothing.

Figure 19 Example of the second case

The second case is the general one if the current index of the L1 cache is greater than or equal to the shift number. The renormalization arbiter fetches the necessary bit-stream only from the L1 cache. Figure 19 is the example of the second case.

Assumed the current index of the L1 cache locates at bit 5, and the shift number of the renormalization arbiter is equal to five, which means that the 5-bit bit-stream of L1 cache is available.

Mapping control signal = 1

Current index Next Index

Figure 20. Example of the third case

The third case has to borrow the bit-stream from L2 cache if the current index of the L1 cache is less than the shift number. The renormalization arbiter fetches the required bit-stream not only from the L1 cache but also from the L2 cache. The mapping control signal is asserted at the same time. The content of the L2 cache is sent to L1 cache and L2 cache loads the new bit-stream from the bit-stream SRAM by means of reading the mapping control signal. Figure 20 shows the example of the third case. Assumed the current index of the L1 cache locates at bit 2, and the shift number of the renormalization arbiter is equal to five, which means that the L1 cache can provide only 3-bit bit-stream and the renormalization arbiter reads the extra bit-stream from the L2 cache. The third case prevents the loading miss penalty when the L1 cache can’t offer enough bit-stream. The loading miss penalty means that the handshaking of loading between the renormalization arbiter and the bit-stream SRAM.

symbol of “V” denotes the required bit-stream from L1 and L2 caches. The first row of this table is the first case. It needs no bit-stream because the number of the shift is equal to “0” no matter what the index of L1 cache is. The gray regions belong to the third case. The renormalization arbiter fetches the bit-stream from both L1 and L2 caches. The other regions are the second case. It just needs to fetch the bit-stream of the L1 cache.

Thus, both the AD kernel and the renormalization arbiter share one cycle to compute one bin.

Table 10. required bit-stream from the L1 and L2 caches

L1 cache(bit) L2 cache(bit) Index of

1 V

“V” denotes the required bit-stream from L1 and L2 caches

We divide the normal decoding of AD into two stages which have been shown in Figure 17. The first stage is to read the context model SRAM. The second stage is to decode the bit-stream into bin, and write the probability back to the context model SRAM. We apply the two stages to schedule the pipeline organization.

cycle 2

Figure 21 timing diagram of the pipeline comparison

Figure 21 shows the timing diagram of the pipeline comparison for the normal decoding in AD. Every cycle, the current bin is made by the AD operation of the second stage and the current context model is written back to the write port of the current context SRAM, and the next context model is read from the read port of the context model SRAM. The read-port index of the context model SRAM is looked up depending on the current division condition. It can be found that the schedule without pipelining produces one bin every 2 cycles, and the other one with pipelining produces one bin every 1 cycle in average. Compared with the non-pipeline organization, the normal decoding flow with the pipeline can save the process cycle about 50%.

Besides, it is difficult to produce more than one bin per cycle for the architecture of two normal decoding of AD parallel connections due to the data dependency

between the previous sub-division and the next context model. We cost more hardware to read and write context model at the same cycle in order to reduce the process cycle. Thus, we don’t adopt the dual AD architecture of the normal decoding in cascade. But we will show the simpler parallel connection of the low hardware complexity in next section.

在文檔中應用於數位電視之H.264/AVC背景適應性二元算術解碼器 (頁 47-55)