• 沒有找到結果。

Throughput bottleneck of arithmetic encoding engine 64

Chapter 3 Binary Arithmetic Encoding Engine

3.2 Three throughput promoting methods

3.2.3 Case Efficiency Architecture

3.2.3.1 Throughput bottleneck of arithmetic encoding engine 64

The arithmetic encoder is the operating center of the CABAC encoder, it dominate the throughput of the whole CABAC encoding process. Besides, The probability of normal encoding is the highest for the three encoding mode of the arithmetic encoding, it is shown in Figure 23 of Section 3.2. The renormalization process takes the most time of normal encoding due to the bitsOutstanding accumulating. So improving the renormalization process is more efficient in throughput promoting of the CABAC encoder.

The bottlenecks of arithmetic encoding engine is the bitsOutstanding accumulating in renormalization process. The bitsOutstanding leads to the following two bottlenecks:

(1) the data dependence between successive symbols.

(2) the variable executing cycles of the renormalization process.

The first throughput bottleneck causes the multi-symbol architecture of normal encoding to be difficult to implement due to its data dependence. It is the intrinsic drawback of the arithmetic encoding algorithm. The second throughput bottleneck is caused by the bitsOutstanding accumulating. The range of the number of bitsOutstanding accumulating is from 0 to 31, it means that it have to take 0~31 cycles to accumulate the bitsOutstanding for one symbol. These two throughput bottlenecks can be illustrated in Figure 7 and Figure 8 of Section 2.4.1. We give an example shown as follows:

If the initial state of the renormalization process is:

The current codlRange is: 0_0000_01xx => it means that the shift = 6 The current codlLow is: 11_0110_1110

The initial bitsOutstanding = 0

The corresponding renormalization process:

(step 0): 11_0110_1110 => PutBit(1) => 1 (step 1): 10_1101_1100 => PutBit(1) => 1 (step 2): 01_1011_1000 => bitsOutstanding = 1 (step 3): 01_0111_0000 => bitsOutstanding = 2 (step 4): 00_1110_0000 => PutBit(0) => 011 (step 5): 01_1100_0000 => bitsOutstanding = 1

=> The bit-stream of current symbol is 1_1011 The remainder bitsOutstanding =1

(PS: The remainder bitsOutstanding carry on being used by the next symbol.)

(Eq. 15)

The remainder bitsOutstanding is the fist throughput bottleneck mentioned above, it leads to the multi-symbol architecture for normal encoding is difficult to be designed.

The second throughput bottleneck is the bitsOutstanding accumulating. Namely, the normal encoding may take too many cycles in some situations due to it accumulates the bitsOutstanding cycle by cycle. In next section, we propose the Case Efficiency Architecture to improve the inefficiency caused by the second bottleneck.

3.2.3.2 Throughput efficiency design

The bitsOutstanding accumulating restricts the throughput of the whole arithmetic encoder, so we analyze the data characteristic of bitsOutstanding to look for an improving method. We get the statistic results of the following five pie charts corresponding to five different video test sequences which are the same as those shown in Figure 24 and Figure 25 of Section3.2.1.

Figure 31 Statistic of bitsOutstanding for the low complexity and slow motion

Figure 32 Statistic of bitsOutstanding for the high complexity and fast motion video sequence

Figure 33 Statistic of bitsOutstanding for 1080HD video sequence

Figure 31, Figure 32, and Figure 33 have similar probability distribution for bitsOutstanding . We can find that the probabilities of bitsOutstanding become great with the getting small of the accumulating number of bitsOutstanding, so we will take the data characteristic of bitsOutstanding to divide the whole renormalization process into cases.

Figure 34 shows the tree map of renormalization process which is corresponding to the flowchart shown in Figure 7 of Section 2.4.1. In Figure 34, the PB(1) and the PB(0) mean the putting bit “1” process and the putting bit “0” process, and the (bs+1) means bitsOutstanding +1. The (9,8) means the bit 9 and the bit 8 of codlLow, and 7, 6, 5, 4…, and 0 means the bit 7, 6, 5, 4, …, and 0 of codlLow.

We divide the iterant renormalization process into several cases and estimate the corresponding bit-stream and remainder bitsOutstanding of these cases by the tree maps shown in Figure 34. The example of it is shown in (Eq. 15) of Section 3.2.3.1.

The dividing of these cases is based on the codlLow of current symbol and the remainder bitsOutstanding of last symbol. The codlLow is 10 bit, and the maximum shift dominated by codlRange is 7, so the maximum bitsOutstanding number of one symbol is 7. The building of cases which are based on the accumulating number of bitsOutstanding is shown as following tables, and we only show several situation such those shown in Table 12, Table13, and Table14 here due to too many cases in smaller bisOutstanding. The cases shown in Table 12 and Table13 are the cases when their last remainder bitsOutstanding equal to zero, and the cases shown in Table 14 are the cases when their last remainder bitsOutstanding equal to 1.

Table 12 The cases corresponding to its accumulating number of

bitsOutstanding=7 (It is under last remainder bitsOutstanding=0) ( 4 cases )

The corresponding

01_1111_110x 0 0111_1111 8

10_1111_111x 7 1 1

Table 13 The cases corresponding to its accumulating number of

bitsOutstanding=6 (It is under last remainder bitsOutstanding=0) ( 12 cases )

The corresponding

01_1111_10xx 0 011_1111 7

10_1111_11xx 6 1 1

shift = 7 00_0111_111x 6 00 2

00_1111_110x 0 0011_1111 8

01_0111_111x 6 01 2

01_1111_100x 0 0111_1110 8

01_1111_101x 1 011_1111 7

10_0111_111x 6 10 2

10_1111_110x 0 1011_1111 8

11_0111_111x 6 11 2

Table 14 The cases corresponding to its accumulating number of

bitsOutstanding=5 (It is under last remainder bitsOutstanding=1) ( 32 cases )

The corresponding

01_1111_0xxx 0 011_1111 7

10_1111_1xxx 5 10 2

shift = 6 00_0111_11xx 5 010 3

00_1111_10xx 0 0101_1111 8

01_0111_11xx 5 011 3

01_1111_00xx 0 0111_1110 8

01_1111_01xx 1 011_1111 7

10_0111_11xx 5 100 3

10_1111_10xx 0 1001_1111 8

11_0111_11xx 5 101 3

shift = 7 00_0011_111x 5 0100 4

00_0111_110x 0 0_1001_1111 9

00_1011_111x 5 0101 4

00_1111_100x 5 0_1011_1110 9

00_1111_101x 1 0101_1111 8

01_0011_111x 5 0110 4

01_0111_110x 0 0_1101_1111 9

01_1011_111x 5 0111 4

01_1111_000x 0 0_1111_1100 9

01_1111_001x 1 0111_1110 8

01_1111_010x 0 0_1111_1101 9

01_1111_011x 2 011_1111 7

10_0011_111x 5 1000 4

10_0111_110x 0 1_0001_1111 9

10_1011_111x 5 1001 4

10_1111_100x 0 1_0011_1110 9

11_0011_111x 5 1010 4

11_0111_110x 0 1_0101_1111 9

11_1011_111x 5 1011 4

Table 15 The number of the corresponding cases for the accumulating number of bitsOutstanding of one symbol

(It is suitable for all kinds of last remainder bitsOutstanding)

The accumulating number of bitsOutstanding

The number of the corresponding cases

7 4 6 12 5 32 4 80 3 187 2 298 1 355 0 52 Table 15 shows the number of the corresponding cases for the accumulating

number of bitsOutstanding when encoding one symbol, and these cases are 1020.

Verify the number of cases is 1020:

The range of shift which is decided by codlRange is: 0~7 (The detail of the relationship is shown in Table 17)

The codlLow is 10 bit, but the range of shift is 0~7, so we only consider the 9 bit codlLow[9:1]. It is illustrated in Figure 35.

Figure 35 Illustration of the shifting left of codlLow

So the total cases can be estimated as follows:

Total cases

= 29 + 28 + 27 + 26 + 25 + 24 + 23 + 22

(shift=7) (shift=6) (shift=5) (shift=4) (shift=3) (shift=2) (shift=1) (shift=0)

= 1020 cases

(Eq. 16) Besides the 1020 cases, the last remainder bitsOutstanding has to be considered.

The range of remainder bitsOutstanding is 0 ~ 31, so there are 32 kinds of remainder bitsOutstanding. Therefore, the total number of the cases for renormalization process is 1020 x 32 = 32640 cases.

It is inefficient to build all of these cases due to too many cases causing the critical path too long. We analyze the utility rate of these cases based on the probability distribution of bitsOutstanding for typical video test sequence.

Table 16 The utility rate of each case for one symbol

The probability of % appearance

The probability of % appearance

Table 16 shows the analysis of utility rate for each case in one symbol, it is the same for all kinds of remainder bitsOutstanding. According to the utility rate of cases, we rank it in order. The cases whose utility rate is smaller than 0.041% are implemented with sequential circuit; namely, the bitsOutstanding accumulates itself one by one per cycle, and the generated bit-stream for one symbol is produced not only one cycle. The cases whose utility rate is greater than 0.041% to implement only with combinational circuit; namely, the generated bit-stream for one symbol is produced in only one cycle. The probability of these cases which are one executing cycle is about 78.41%.

Then, we have to consider the occupied probabilities of remainder bitsOutstanding. We only analyze the bigger probabilities of remainder bitsOutstanding;

namely, the number of it is smaller than or equal to 7.

Table 17 The number of cases and the probabilities of the different containing range of remainder bitsOutstanding

the different containing range of remainder

bitsOutstanding

the number of cases

Probability

case utility

rate (< 1) 0 455x 1= 455 78.41%x 51.37%= 40.28% 0.0885%

(< 2) 0+1 455x 2= 910 78.41%x 75.78%= 59.42% 0.0652%

(< 3) 0+1+2 455x 3= 1365 78.41%x 87.88%= 68.91% 0.0505%

(< 4) 0+1+2+3 455x 4= 1820 78.41%x 93.93%= 73.65% 0.0405%

(< 5) 0+1+2+3+4 455x 5= 2275 78.41%x 96.95%= 76.01% 0.0334%

(< 6) 0+1+2+3+4+5 455x 6= 2730 78.41%x 98.46%= 77.20% 0.0283%

(< 7) 0+1+2+3+4+5+6 455x 7= 3184 78.41%x 99.22%= 77.80% 0.0244%

(< 8) 0+1+2+3+4+5+6+7 455x 8= 3640 78.41%x99.60%=78.10% 0.0215%

Figure 36 Probabilities of the different containing range of remainder

Table 17 shows the number of cases and the final probabilities of the different containing range of remainder bitsOutstanding. Figure 36 is the probability curve of it based on Table 17. The curve shows the probabilities will be saturation with the increasing of the range of remainder bitsOutstanding. So the containing range of remainder bitsOutstanding which is 3 is adopted by our design; namely, it supports the remainder bitsOutstanding equaling to 0, 1, and 2. The cases of it are 455 x 3 = 1365 cases, and these cases are implemented only with combinational circuit; namely, there are 68.91% renormalization process can be executed taking only one cycle.

Besides, the other cases are 32640 – 1365 = 31275 cases. These cases are not implemented only with combinational circuit due to too many and too inefficient for these cases. So these cases are implemented with sequential circuit.

According to the analysis of renormalization process mentioned above, we propose Case Efficiency Architecture shown in Figure 37 to promote the efficiency of the renormalization process.

Figure 37 shows the Case Efficiency Architecture for renormalization process.

The upper blue block is the high probability cases which are implemented only with combinational circuit. It can produce the bit-stream of current symbol in only one cycle.

The generated bit-stream is variable length, so we design the control signal which is bit-stream length to put the bit-stream into the bit-stream output buffer. The bit-stream length is decided in lower blue block of Figure 37, and it is implemented only with combinational circuit, too. The red block is the low utility rate cases which are implemented with sequential circuit. Its bitsOutstanding accumulates itself one by one cycle, so generating the corresponding bit-stream for one symbol takes several cycles which is dominated by the shift number. The shift number is decided by current codlRange (9 bits), the relationship of them is shown in Table 18. The implementation of the Shift Judgement block shown in Figure 37 is based on the relationship. The relationship is decided by the codlRange decision branch shown in Figure 7 of Section 2.4.1. If the current colRange is smaller than 25610, it has to be shift left until that it is not smaller than 25610 anymore. We simplify the process to be the rule shown in Table 18. According to the rule, the shift can be estimated in only one cycle by implementing it only with combinational circuit shown in the shift judgment block of Figure 37. For Table 18, the number of zero before the first “1” in MSB of the codlRange is the shift number. For example, if the current codlRange is 0_0010_1101, the corresponding shift is 3.

The conclusion of this section is that the proposed Case Efficiency Architecture can make almost 70% normal encoding process take only one cycle due to the efficiency of renormalization process is improved by Case Efficiency Architecture.

Table 18 The relationship between shift and codlRange

相關文件