Paper survey for CABAD designs - OVERVIEW OF CABAD FOR H.264/AVC

CHAPTER 2 OVERVIEW OF CABAD FOR H.264/AVC

2.6 Paper survey for CABAD designs

In this section, we will introduce some of CABAD decoding designs which have been published recently (2005 ~ 2007). The main differences of all of these are almost in arithmetic design due to that the arithmetic coder is the main dominator of throughput for the whole CABAD system. The CABAD decoder designs are introduced as follows.

1. For the CABAD design of [4] proposed by Yongseok Yi, In-Cheol Park, the initial design without optimization takes 7.43 clock cycles per bin. The optimization

per bin. But the throughput of this design is not high-product because it is one-symbol architecture and its context memory needs great hardware cost.

2. The high-performance CABAD design is proposed by J. W. Chen, Y. L. Lin [5]. It proposes three parallel processing techniques. The initial design without optimization decodes 0.44 bins per cycle. Three parallel processing techniques are shown as follows.

(1) Parallelizing the tasks of decoding coefficients and getting neighboring data.

(2) The two-bin-per-cycle decoding method.

(3) Context table rearrangement method.

After adopting these methods, the throughput is up to 0.99 bins per cycle.

3. The CABAD decoder design of [8] is proposed by Y. C. Yang, C. C. Lin, H. C. Chang et al. They adopt four techniques to improve the performance of CABAD. They are adopting 1) two-symbol architecture pipeline scheduling, 2) using segmented context tables, 3) adding cache registers to store the value of context memory, and 4) doing look-ahead codeword parsing.

4. We also reference the multi-symbol architecture design for arithmetic encoder [6]

which is proposed by Y. J. Chen, C. H. Tsai, L. G. Chen. The one-symbol arithmetic coder was partitioned into four stage: Update State, Update Range, Update Low and Output. And then they extend the architecture of one-symbol arithmetic encoder to arbitrary m-symbol.

5. A novel configurable architecture of CABAC encoder [7] is proposed by Y. J. Chen, C. H. Tsai, L. G. Chen. The traditional processing unit is divided into two parts, MPS encoder and LPS encoder. With different arrangements of these two basic components, they develop two types of ML-decomposed structures, such as 1) ML cascade architecture and 2) throughput-selection architecture. ML cascade architecture exploits the complementary critical path of MPS and LPS coder, and

throughput-selection architecture offers more choices of ML cascades to select the highest throughput one.

Chapter 3 Multi-Symbol of Binary Arithmetic Decoder Engine

3.1 Overview of CABAD system

Figure 3-1 Block diagram of CABAD

Arithmetic coding is a recursive subdivision procedure. It contains two data dependency which results in intensive computation. Firstly, the interval is specified by

range and offset. Depending on symbol is the Most Significant Symbol (MPS) or Least

Significant Symbol (LPS), the next interval is updated as one of two sub-intervals. The second is the adaptive probability state of the context of symbol. The probability table will be updated according to the current symbol. Figure 3-1 is the system architecture of

CABAD which consists of three main modules called the binary arithmetic decoder, the binarization engine, and the context model. The entire decoding procedure is described as follows. When starting to decode, it has to initialize the context model by looking up the initial table. BAD reads bit-stream to get the bin value. At the same time, it refers to the current probability from the context model to find the sub-range of MPS or LPS and updates the probability of the location of the current context model index (ctxIdx). The bin string from several bin values is fed to the binarization engine. Then the binarization engine will send out the value of syntax element. Address Generator generates the address of the context model which has been described in section 2-5. Due to these strict data dependencies, the elementary operations can hardly be processed in parallel.

Figure 3-2 Elementary operations of CABAD

To execute multi-symbol CABAD, the BAD unit and the Context model should

context selection to BAD stage.

Figure 3-3 Overview of our architecture

Figure 3-3 is overview of our architecture. γbase(s) is a base context index generated from the AG stage as the ctxIdxOffset definition from standard [1]. The set of context memory data Cj(s) in the same syntax element is gotten from context memory according to γbase(s) and stored in a small storage called the context state register (CSR). After the context memory is obtained, the BAD stage takes place. In our BAD stage, it contains three parts such as context selection, binary arithmetic decoding core, and binarization engine. We select needed context data (c1,c2,…) from Cj(s) according to binIdx (ctxIdxInc) and feed them to BAD core. At the same time, we should update each of the context data. For example, if ctx1 and ctx2 are the same, the pState and valMPS of ctx2

should be replaced by the updated ones of ctx1. When working BAD core, the symbol is decided by comparing the coding offset and the coding range. Then the renormalization follows to keep the coding range and the coding offset to a fixed precision. Then, we send

the bin string (b1, b2, b3) to do the binarization and resolve the value of syntax element.

Besides, only the updated values of context data ci corresponding to those valid bins should be written back to CSR. Finally, the data Cj(s) of CSR will write back to context memory. The part of BAD core is described in next section, and the detail of context model in next chapter.

3.2 Statistics and analysis of syntax elements

Figure 3-4 Decoding flow at syntax element level

Figure 3-4 is the state transition at the syntax element level. H.264/AVC defines twenty-five syntax elements. Many syntax elements only need one bin to decode (like significant_coeff_flag, last_significant_coeff_flag, end_of_slice_flag, coded_block_flag, and intra_pre_mode_flag …. First two of them have around 40% of bins). And others need multiple bins to get its information (like coeff_abs, rem_intra_pre_mode, mb_type, sub_mb_type, ref_idx, mvd …, etc.).

Table 3-1 Percentage of the bins at each syntax element

Table 3-2 Percentage of the cycle counts at each syntax element

cycle%

Table 3-1 and Table 3-2 are shown the percentage of decoded bins and cycle counts of different syntax elements. "sig.& last_sig.” and “coeff_abs” have most of decoding bins.

Therefore, how to enhance the throughput would be divided into two parts. The first is our multi-symbol architecture that can decode multiple bins per cycle. It is shown in next section. But the multi-symbol architecture will not enhance the performance of the one-bin syntax elements such as sig.& last_sig. Then secondly, we rearrange our context

Table 3-3 Percentage of each concatenate symbol

Table 3-3 is the statistics of the average percentage of each symbol alignment. It simulates under executing four CIF sequences (stefan, foreman, news and mobile) by JM8.2. The number of frame is 200 and we set QP16, QP28, and QP40. We find the percentage of concatenate M-symbol is obviously higher than others, especially MMM in 3-symbol and MM in 2-symbol. Take 3-symbol an example, we divide four orders of the happening probability (from most probability to least probability). First group is MMM and it contains 44%. Second group are MML, MLM, and LMM, and they contain 13%

respectively. Last group is LLL and it contains 3%. It is efficient that the concatenate symbols (MMM) will be improved firstly.

1. MMM

2. MML, MLM, LMM 3. MLL, LML, LLM 4. LLL

3.3 Proposed multi-symbol architecture

In this section, we extend the architecture of one-symbol arithmetic decoder to three-symbol. It has data dependencies in range and offset. Depending on symbol is the Most Significant Symbol (MPS) or the Least Significant Symbol (LPS), next interval is updated as one of two sub-intervals. The range and offset equations are as follows,

MPS： Rangen = Rangen-1 – rLPSn

Offsetn = Offset n-1

LPS： Rangen = rLPSn

Offsetn = Offset n-1 – Range n-1 + rLPSn

where n represents current symbol and rLPS is the estimated range when coding LPS.

3.3.1 One-symbol structure of BAD

Table 3-4 Results for range, offset and binVal at each mode

The basic Binary Arithmetic Decoding core is as shown in Figure 3-5 [11]. For hardware sharing, it combines three modes (decision, bypass, and terminal) into the architecture, and Table 3-4 shows the results of range, offset, and bin value in each mode.

The shaded adder is also the comparator which calculates the temporal variables OffsetX and RangeX, and it will decide the symbol is MPS or LPS resulting in the binVal. The table of rangeLPS has 256 entries. The large table is unfortunately located in the critical path when decoding multi-symbol. To speed up, we divide the table into two parts, 64:1 and 4:1 as like [6]. Then, we can pre-compute the greater parts (64:1) when doing other operations. Table 3-5 shows the dependency of Bin_flag, valMPS, and Bin_value. The result of bin value is Bin_flag depending on valMPS. And the signal Bin_flag is the msb of the result from the subtractor of Offset and RangX. When Offset is less than RangX, the signal of Bin_flag will be set 1. It means that the decoding symbol is MPS, and the decoding Bin value is the function XOR of the two signals Bin_flag and valMPS.

Table 3-5 Dependency of symbol, Bin_flag, valMPS and bin value comparator Bin_flag Symbol Bin_flag valMPS Bin value

Offset >= RangX 0 LPS 0 0 1

Offset < RangX 1 MPS 1 0 0

0 1 0

1 1 1

3.3.2 Cascaded structure of multi-symbol BAD

The intuitive method for multi-symbol BAD is to cascade one-symbol architecture as shown in Figure 3-6. It doesn’t decode next bin until the result of the comparator of current bin, so that the critical path of the one-symbol architecture will have two adders and the rangeLPS table. If we extend to three-symbol, the critical path is too long that will be six adders and three rangeLPS tables. The hardware cost is three times than one-symbol architecture. Figure 3-7 is simply drawing of the cascade architecture of three-symbol BAD [5].

Figure 3-6 Simplify one-symbol BAD architecture and its simply drawing

Figure 3-7 Cascade architecture of three-symbol BAD

3.3.3 Extending structure of multi-symbol BAD

Figure 3-8 Example of 2-symbol expanding BAD architecture

As shown in Figure 3-8, if we expand the range and offset equation of two-symbol BAD, the architecture can reduce the long critical path. The following equations are the results of Range and Offset. LPS1 represents that the first decoding symbol is LPS, and LPS2 represents that the second decoding symbol is LPS, and so forth to MPS1 and MPS2. LL_R represents the result of range which decodes concatenate symbols of both LPS.

The same as to LL_O, and the last letter O represents the result of Offset. And LM_R

When doing the first symbol’s comparator, it also does the second symbol’s operation.

In addition, the rangeLPS table can be computed in advance because we already know the next decoded symbol is MPS or LPS. As a result, it can reduce one adder time and one rangeLPS table time if every adding one-symbol extending architecture. So the critical path of the two-symbol extending architecture is three adders and one rangeLPS table.

It is easy to expand to three-symbol architecture, and its critical path is four adders and one rangeLPS table. We reduce the critical path of two adders and two rangeLPS tables compared to the cascade three-symbol BAD architecture. But the hardware cost of extending three-symbol architecture is seven times larger than one-symbol architecture.

The hardware cost is too great. Next we propose an efficient method to reduce the critical path and let cost down.

3.3.4 M-cascade of multi-symbol architecture

Table 3-6 Critical path of the adders in three-symbol extending BAD

MMM MML MLM MLL

Num. of adders time 3 4 3 4

LLL LLM LMM LML

Num. of adders time 4 3 2 3

Table 3-6 is shown the critical path of the needed adders of decoding each concatenate symbol case.

The critical path of cascade three-symbol architecture is too long and the hardware cost of extending three-symbol architecture is too large. Case control study with Table 3-6, Table 3-3 and hardware design, in concatenate three symbols we finally choose the decoding process of MMM and MML to make sure hardware sharing (cost down) and efficiently enhance the throughput. We can speed up 57% decoding bin and minimize the hardware cost and the critical path.

Figure 3-9 Organization of the multi-symbol BAD

Table 3-7 Case of multi-symbol which our architecture can decode

case

1-symbol L, M 2-symbol ML,MM 3-symbol MML,MMM

We propose our M-cascade of multi-symbol BAD architecture in Figure 3-9 The architecture can decode three concatenate symbol whether it is decision mode or bypass mode, and it only executes the case of symbol alignment(L, M, ML, MM, MML, MMM) as shown in Table 3-7. The architecture decodes next symbol when the prior symbol is

M-symbol, so we call it M-cascade architecture. For an example, if we want to decode the symbol streams MLLMMM, our architecture will decode ML firstly and L at next cycle, and decode MMM finally. So it doesn’t always execute up to three symbols. First problem is that the architecture of other symbol alignment (MLM, MLL, LMM, LML, LLM, and LLL) doesn’t parallel processing in our design. These symbol alignments should be separated to one-symbol and two-symbol or three one-symbols. Because we focus on the improvement of the most percentage, we choose the case of MMM and the case of MML to decoding. Secondly, the binarization engine judges the three bin string if the bin values are the valid symbols. The signal binvalidx_BAD is to discriminate the correctness of the decoded bin value by our confining architecture (only decoding L, M, ML, MM, MML, MMM). Table 3-8 is shown their relation. Then we sent those needed signal to execute the binarization. If the first n bins are valid, the n-th results of codIOffset and codIRange have to be selected by the binarization engine to offer the next BAD.

In Figure 3.9, bit stream buffer is fed to ○¹ , ○² , ○³ , ○⁴ , ○⁵ and Renormalization unit. ○¹ , ○³ and ○⁵ is about bypass decoding process. ○² is the operation of the renormalization after decoding “M”, and ○⁴ is after decoding “MM”.

Table 3-8 Truth table of binvalid?_BAD related to our BAD architecture

INPUT OUTPUT

MSB_1 MSB_2 MSB_3 binvalid1_BAD binvalid2_BAD binvalid3_BAD

0 ? ? 1 0 0

3.4 Pipeline organization

The most effective way to enhance the performance is to exploit the pipelining scheme. In decision mode, it takes 4 cycles to complete one bin coding in conventional processing without pipelining. The bypass mode and the terminal mode doesn’t need the probability data, so it will not execute the part of context memory and takes one cycle to complete one bin coding, as shown in Figure 3-3. We show these stages to schedule the pipeline organization in this section. And we also show some restricts in our design.

Figure 3-10 Timing diagram of the pipeline comparison

Figure 3-10 shows the timing diagram of the pipeline comparison for decision mode, and it is almost the same as [4]. But we move the CS operation to BAD stage.

In conventional scheme, it must compute context address every symbol processing and load context data (pState and valMPS) to next stage without CSR. In our design, we load a series set of context data to CSR in syntax element beginning and write back to

context memory in syntax element end. We only read and write context memory one time in every syntax element (except the two syntax element of significant_coeff_flag and last_significant_coeff_flag), but in conventional scheme it will read and write context memory more times depending on how many the decoded bins in that syntax element. It can be found that the conventional scheme produces one bin every 4 cycles in average, and the other one with pipelining and CSR produces 1~3 bins every cycle. Compared with the conventional organization, the proposed design with the pipeline can save large the process cycles. Next we show the timing diagram of some restricts and situation resulted from our multi-symbol BAD unit.

Figure 3-11 The timing diagram of our architecture restricts

Our architecture doesn’t support the concatenate symbol MLM, but support ML. It will judge the symbol M of binIdx = 5 is invalid, and forward the value of binIdx = 4.

Although the comparator decide the decoding symbol of binIdx = 5 is M, and it is indeed, but the output of codIRange and codIOffset will be wrong. That will result in the wrong following process. So we put some logic to estimate.

Figure 3-11(b) is the timing diagram happening when syntax element change. When a new syntax element is to be decoded, the pipeline is stalled for two cycles to update and load the series set of context data. The CSR (Figure 3-3) will write back the context data of prior syntax element to context memory and then load the new one of current syntax element. When the correct output of ML is decoded and sent to the binarization engine, the binarization judges it’s the end of syntax element. Then we write back the CSR to context memory, and at next cycle we will load context data of new syntax element to CSR. It wastes two cycles and it is also the bottleneck of our architecture.

When decoding the syntax element of MVD and coeff_abs, it may decode the bin using bypass mode or decision mode. This part is shown the schedule of the decision mode changing to the bypass mode in our architecture. When the decoding bins in these two syntax elements are more than the value boundary (£), the following bins will use bypass mode to decode. The value boundary (£) of MVD and coeff_abs is set 8 and 13 respectively as shown in Table 3-9.

Table 3-9 The parameters of the decision mode changing to the bypass mode Syntax element Value boundary Should decode more bins using bypass mode

MVD 8 n+2

coeff_abs 13 N

In syntax element MVD, if the decoding bins are more n than 8 until the value of bin

is 0, it should decode n+2 bins using bypass mode in following process. The situation is also the same in syntax element coeff_abs. If the decoding bins are more n than 13 until the decoding bin value is 0 in syntax element coeff_abs, it should decode n bins using bypass mode in following process. And both of them, the last decoding bypass bin is also the sign bin. If decoding bins in the two syntax element are less than the value 8 and 13 respectively until the decoding bin value is 0, it should decode more one bin by bypass mode as sign bin. And the changing to bypass mode, it always happens at next cycle whether the concatenate symbols which our architecture can support or not. Figure 3-12 is an example of syntax element MVD.

syntax element MVD. When binIdx = 6 decoding the result of bin value = 0, it decodes more one bin at binIdx = 7 as sign bin at next cycle. Then this syntax element process finish. Figure 3-12(b) is the situation of the decoded bins more than 8 until bin value = 0.

When binIdx = 10 decoding the result of bin value = 0, we will know the result of binIdx

= 11 is wrong although the concatenate symbols MML our architecture can support.

Besides we should decode more 5 bins using bypass mode at next follows cycles.

Chapter 4 Structure of Context Model

4.1 Overview of the context model

The values of the context model depending on the context index (ctxIdx) offer the

在文檔中應用於H.264/AVC的高產量M串聯多重符號背景適應性二元算術解碼器 (頁 41-0)