Chapter 3 Algorithm of Memory-based VLC Decoding
3.4 Symbol Memory Allocation
Generally, the symbols are store in the symbol memory and symbol lengths of different tables are different. If only one symbol memory is used, the word length must be the length of the longest symbols among all tables. This allocation leads to some wasted space for shorter symbols. In [2], there are several symbol memories with different kinds of word length in order to save the space. The symbol length is
That is, the symbols in Coeff_Token and Total_Zeros/Run_Before can be concatenated into 11-bit words and stored in the 256 × 11 symbol memory. The overhead includes a mask and the multiplexer to choose the format of the symbol according to the standards and decoding tables as shown in Fig. 22. Besides, the start positions of tables are stored in a small register files. As CAVLC is used, we can select the most significant 7 bits of the memory output for Coeff_Token symbols or the least significant 4 bits for Total_Zeros/Run_Before symbols; the whole word is assigned to the symbols for MPEG2.
Fig. 22 256-entry symbol memory allocation in VLC decoder for MPEG-2 and H.264 standards.s
Chapter 4
Error Resynchronization
4.1 Concept of Error Resynchronization
In H.264/AVC error resilient tools, the most important feature is slicing. A slice consists of a sequence of macroblocks (Fig. 23) and the intra-prediction in one slice does not refer to data belongs to the other slices in one frame. If one slice is corrupted during the transmission, the error would not propagate to other slice regions so that the corrupted data is restricted within that slice. Insertion of refreshment frames, slices or macroblocks is a method to stop error propagation in the temporal domain. In addition, [17] mentioned that the insertion of additional markers in the bitstream can achieve VLC resynchronization. From these three methods, we can find that the error propagation can be stopped in a certain region and the following bitstream can be correctly decoded. That is, error resynchronization is achieved. However, these methods have bitstream overhead and increase needed data bandwidth.
Fig. 23 Division of a picture into several slices.
performance. To accomplish error resynchronization, the position of the next decoding unit is necessary so that error will be stopped until the current decoding units. In the following sections, we focus on the I-frame error resynchronization because I-frames are much important than P-frame and I-frame is reference of P-frame.
4.2 Proposed Scheme for Error Resynchronization
The smallest decoding unit in H.264/AVC is 4x4 block and the position of the block can only be known by end of block symbol. Unlike the previous video standards, there is no end-of-block symbol in CAVLC. Take MPEG-2 as an example, TB-14 and TB-15 define the end of block symbol with codeword “10” and “0110”, respectively. As a result, we can try to find the block boundary by searching “10” or
“0110” in the bitstream. A further insight into the tables makes these two codewords mistakenly predict the block boundaries because other codewords also contain “10” or
“0110”. For instance, codewords of (run, level) = (4, 1), (7, 1) and (2, 2) contain “10”
in TB-14; codewords of (run, level) = (4, 1), (1, 2) and (16, 1) contain “0110” in TB-15.
From the conventional CAVLC decoding process, a block with non-zero coefficients has the following steps to complete the decoding: 1. decoding of total coefficients and trailing ones, 2. decoding of sign of trailing ones, 3. decoding of levels, 4. decoding of total zeros, 5. decoding of runs before every non-zero coefficients. The last two steps are about total zeros and runs. Therefore, we can say that end of block consists of these two syntax elements. In bitstream, all combinations of the codewords of theses two tables that meet the rule of CAVLC decoding can be viewed as the EOBs. Since the constructed EOBs have many possibilities, we can choose those ones which are not one segment of combinations of other codewords in the other tables. Once the EOBs are known, the block boundaries can be predicted by them and thus the error can be limited until the current block. Then, the decoding of the next block is resynchronized and correct. The EOB format is shown in Fig. 24.
Fig. 24 The format of EOB
4.3 EOB Construction
To know all combinational series of codewords from Total_Zeros(T.Z.) and Run_Before(R.B.) tables, we must know all kinds of distribution of coefficients in one 4x4 block. For different distribution of non-zero coefficients, the combinations of the T.Z. codeword and the R.B. codewords are different. The total number of combination can be computed by the following equation (1):
16 16 k k=1
= C , where is the number of coefficients in 4x4 block
Total
∑
k (1)Therefore, there are totally 65535 kinds of EOB and histogram of the distribution is shown in Fig. 25.
Fig. 25 Distribution of EOB number at different numb of coefficients
The next step verifies the EOBs that are not segments of combination of other codewords. The EOB is viewed as a sliding window through the whole bitstream and collect those EOBs that only occur at the exact positions. However, the number of total EOBs is large and the total bit of the bitstream is much larger, thus the
short EOBs because short EOBs are more easily found in other places in the bitstream.
With EOB positions which are known, the reduced EOBs set is checked in a certain range to determine whether one EOB is unique within the given range. As Fig. 4 shows, the EOB is viewed as a sliding window and check the correlation between the EOB and bitstream. The range is set because in real cases, EOB is not necessary to be unique globally thus only local uniqueness is checked.
Fig. 26 Checking whether the EOB is unique within the given range or not After all EOBs and the corresponding bitstream are checked, the EOBs are reduced because the EOBs that occurred multiple times are removed as shown in Fig.
27. The remained EOBs are defined as intra-set. To ensure EOBs can survive in other frames, they are also checked their uniqueness in other frames. The further reduced set is called the inter-set as shown in Fig. 28. After inter-checking, the remained set can be the error resynchronization information and stored in the memory.
Fig. 27 Reduction phase of intra-checking. The number of EOBs generated from each frame is reduced.
Fig. 28 Reduction phase of inter-checking. After this reduction the remained EOBs is the EOB library stored in memory.
4.4 EOB Storage Using Group-based Scheme
We can treat all EOBs stored in memory as the codewords of a virtual table.
From the group-based algorithm we know that the symbol of the decoded codeword
1. Do group searching
ÎPCLC_mincode1(29’b0000_0000_0101_1111_0000_0000_0000_0) <
bitstream_num < PCLC_mincode2(29’b0000_0000_0110_1011_0000_0000_0000_0) ÎThe matching group: G1
2. Send group information
Î code length = 15-bit, PCLC_mincode =
29’b0000_0000_0110_0100_0000_0000_0000_0, base_addr(6-bit) =6’b1000_00.
3. Find the valid VLC_codeoffset, which is the code length most significant bits of the result of subtracting the PCLC_mincode from the bitstream_num
ÎBitstream_num(29’b0000_0000_0110_0110_0000_0000_0000_0) – PCLC_mincode(29’b0000_0000_0110_0100_0000_0000_0000_0) = 29’b0000_0000_0000_0010_0000_0000_0000_0.
ÎThe valid VLC_codeoffset = 15’b0000_0000_0000_001= 1.
4. Extract the VLC_codeoffset operand, which has the same word length as the symbol address
ÎVLC_codeoffset = 6’b0000_01= 1.
5. Calculate the decoded symbol address
Îsymbol_addr = base_addr(6’b1000_00) + VLC_codeoffset(6’b0000_01) = 6’b1000_01= 33.
6. Fetch the decoded symbol Î sym_memory[33] = S33.
As a consequence, we can store flag as the S33 in the location of address = 33 The flag means if hit or miss for the EOB checking.
PCLC Length-1 Address ---G0--- 0000_0000_0110_1111_0000_0000_0000_0 15 [ 40 ] 0000_0000_0110_1100_0000_0000_0000_0 15 [ 37 ] 0000_0000_0110_1011_0000_0000_0000_0 15 [ 36 ] ---G1--- 0000_0000_0110_1010_0000_0000_0000_0 14 [ 35 ] 0000_0000_0110_1000_0000_0000_0000_0 14 [ 34 ] 0000_0000_0110_0110_0000_0000_0000_0 14 [ 33 ] 0000_0000_0110_0100_0000_0000_0000_0 14 [ 32 ] ---G2--- 0000_0000_0101_1111_0000_0000_0000_0 15 [ 31 ] ---G3--- 0000_0000_0101_1110_0000_0000_0000_0 14 [ 30 ] 0000_0000_0101_1100_0000_0000_0000_0 14 [ 29 ] ---G4--- 0000_0000_0011_1110_0000_0000_0000_0 14 [ 28 ] 0000_0000_0011_1100_0000_0000_0000_0 14 [ 27 ] 0000_0000_0011_1010_0000_0000_0000_0 14 [ 26 ] 0000_0000_0011_1000_0000_0000_0000_0 14 [ 25 ] 0000_0000_0011_0110_0000_0000_0000_0 14 [ 24 ] 0000_0000_0011_0100_0000_0000_0000_0 14 [ 23 ] 0000_0000_0011_0000_0000_0000_0000_0 14 [ 21 ] 0000_0000_0010_1110_0000_0000_0000_0 14 [ 20 ] 0000_0000_0010_1100_0000_0000_0000_0 14 [ 19 ]
Fig. 29 A portion of groups of EOBs
4.5 Joint with Channel and VLC Source Decoder
The bock diagram is shown in Fig. 30. In this error resynchronization scheme, we assume that channel information is available. The system includes the channel side and source side. The source decoder can use the bit reliability which is the
actually bit position. Besides, the method is mainly for random error type.
Bit Reliability Error Bitstream
Channel Source
Channel FEC VLC Video Decoder
Decoder
Fig. 30 Block diagram of the decoder in the wireless transmission environment.
Chapter 5
Simulation Result
From the proposed algorithm, block boundary prediction by EOBs is a probability issue and EOBs are also VLC, therefore, the length of EOBs has effect on the probability of prediction. On one hand, longer EOB codewords has lower occurrence probability. On the other hand, longer EOB codewords are removed more hardly and have more probability of being in the EOB library finally.
The probability of EOBs that meet length constraints are shown in Fig. 31 and the calculation equation (2) is as follows:
# of EOBs(length > L) occurred probability =
Total # of EOBs (2)
All testing frames are QCIF format and the first 100 frames is the first 100 frames from akiyo sequence, 200~299 frames is the first 100 frames from foreman sequence and the last 100 frames is the last 100 frames from foreman sequence. Fig. 32 shows the probability of correctly found EOBs under three different length constraints. The simulation shows that probability is very close when length is 8 and 10 bits at least and the probability of EOBs (L > 14) is lower.
0
Fig. 31 Probability of EOBs with length constraints.
0
Fig. 32 Probability of EOBs that are correctly found in the bitstream.
Table 6 and Table 7 show the memory space for different sets of EOBs. Different grouping strategy makes memory space of group information and symbol quite different. If we separate EOBs into groups by position of leading one (prefix), number of groups is small while the maximum symbol address is large. In contrary, we separate EOBs into groups by prefix and length condition, number of group is large while maximum symbol address is much smaller than the previous one.
Table 6 Estimation of memory usage for different sets of EOBs Grouping by prefix only
. Length
8 10 14
number of group 17 17 17
max. symbol
address 160975 282645 639418
PCLC 29 29 29
symbol memory 1024*256 2048*256 8192*128
base address 10 11 13
group information
memory 17*(29+5+10) 17*(29+5+11) 17*(29+4+13)
Total 262892 525053 1049358
Table 7 Estimation of memory usage for different sets of EOBs.
Grouping by prefix and length
Length
memory(bits) 1024*32 1024*32 1024*32
base address(bits) 10 10 10
group information
Chapter 6
Hardware Architecture and implementation
6.1 Overview of Hardware Architecture
Fig. 33 shows the block diagram of proposed dual-mode memory-based VLC decoder. There are mainly five components: 1) Controller, 2) Input Shift Buffer, 3) Memory-based VLC Decoder, 4) Coefficient Buffer and 5) Level Decoder. The controller assigns the control signals for each syntax element according to nC, maxnumcoeff and enable signal from syntax parser. The controller is implemented by a finite state machine (FSM). The memory-based VLC decoder can support CAVLC and MPEG-2 coefficient decoding. Several control signal are needed to control internal memory and these control signal are directly from chip I/O ports so that the content can be loaded into memory for different video standards. The level decoder is mainly composed by level prefix decoder, level suffix decoder and suffix length.
Level prefix is composed of a leading 1 detector and level suffix is decoded by getting bits from bitstream buffer depending on level suffix size. The coefficients buffer consists of 4x4 register array and is controlled by run index and level index such that the decoded runs and levels can be put into buffer in order. The input bitstream buffer is integrated into higher hierarchical module in the H.264/MPEG2 video decoder, which is proposed in [18]. Also, the design is also integrated into the H.264/MPEG2 video decoder in the previous video decoder [18].
Input Bitstream
Fig. 33 Block diagram: controller, bitstream buffer, Memory-based VCL Decoder, Level decoder , coefficient buffer
6.2 Memory-based VLC Decoder
From the modified MTM algorithm, the VLC decoder has two categories for storing group information and table information. They are selected by table index.
Each group information is extracted to compute offset. The final offset is determined by the enable signal from group detectors. The tri-state buffer can viewed as the gate for each group. One group has three data items: offset, base address and (length-1).
For each decoding time, only one set of data among all groups is passed by the tri-state buffer while others are floated. After the set of data are passed, (length -1) is returned to bitstream shift buffer in the other module and base address is added to offset to compute symbol address. Finally, symbol memory is accessed to output decoded symbol.
Group Information
Fig. 34 Total Block Diagram of memory-based VLC decoder
The stage partition of memory-based VLC decoder is briefly shown in Fig. 35 and cycle time of important signals are also shown. The register file generated by memory compiler needs one cycle to read data thus the access of group information and table information memory is viewed as the first stage. After the first stage, the symbol address is known thus the access of symbol memory is the second stage.
The group information and table information memory can be accessed when the table used to decode is known. The example of cycle time is shown as in Fig. 35. The first cycle is reading of memory and address computation and the decoded symbol is outputted in the second cycle. The in_valid signal represents the input bitstream and table idx is valid. The symbol is valid until the second cycle. In the third cycle, next valid data is sent after the controller received the Symbol_valid signal.
Address Calculation input bitstream
Memory Stage Memory Stage
Group, Table
Information Symbol
CLK table index
Bitstream
16
16
8 11
5 11
Fig. 35 Stage and timing diagram
Run_Before consumes many cycles for a block especially in low-QP video and most codewords of Run_Before tables are short. Therefore, we assume that the short codeword stores information of Run_Before such as length of codewords, symbols and table index.
Fig. 36 shows the block diagrams of memory-based VLC decoder with cache.
Only some important signals are annotated for simplicity. Again, the controller send the read_en and write_en signal to activate reading or writing of the cache. If cache hit occurs, the memory-based VLC decoder is disabled and the decoded symbol and length is from the cache. If cache missed, conventional VLC decoder is enabled and the symbol and length are stored in the cache. Looking up cache can be done in one cycle while decoding of memory-based VLC decoder need three cycles. As a result, adding cache is a good method to improve the throughput of the whole decoding.
Short Codeword
Fig. 36 Small Cache for Run (short codewords)
In addition to adding cache, one method is also proposed to improve throughput further. In Fig. 35, the table index is known after the FSM is in Coeff_Token and Total_Zeros states. From the conditions of table transition, the next table used to decode can be known earlier than these two states. As a consequence, group
information and table information can be accessed once the condition is known and the conventional three-cycle stage is reduced to two-cycle stage as shown in Fig. 37.
In this figure, current Information Memory_out is known before in_valid is high or in the same cycle as in_valid is high. Therefore, the pre-fetching mechanism can achieve one symbol decoding every two cycles. This method can further improve throughput compared to the original design in this thesis.
in_valid table index
Bitstream
Information Memory_out
Symbol Memory_out
Symbol
Length-1
Symbol_valid
1 2 3 4 5 6 7 8 9
CLK
In 1 In 3 In 5 In 7
address 1 address 3 address 5 address 7
symbol 1 symbol 3 symbol 5 symbol 7 symbol 1 symbol 3 symbol 5 symbol 7
length 1 length 3 length 5 length 7
Fig. 37 Cycle time of the pre-fetching method
6.4 Implementation Result
In hardware implementation, the group information for MTM_PCLC and
Table 8 Memory space for VLC decoder
Storage content Number of bits Physical space Table information
The gate count and power consumption of the designs are shown in Table 9.
From this table, the gate count of these design are similar. However, the power consumption of memory-based VLC decoder with cache only is lower than that of memory-based VLC decoder without cache. This shows that the cache storing frequent codewords without replacement can achieve power reduction and improve throughput. The throughput of foreman and mobile sequences under different QP are shown in Fig. 38 and Fig. 39, respectively. From these two figures, the proposed design can achieve HD 720p even for very low QPs under 100MHz. The design can also meet requirement of HD1080p when operation frequency is 200MHz as shown in Fig. 40 and Fig. 41.
Table 9 Gate count and power of different designs.
Gate Count Power Consumption( mW) Memory-based VLC
decoder 15.4 k 1.132
Memory-based VLC
decoder + cache 17.2 k 1.078
Memory-based VLC
decoder + pre-fetch 17.2k 1.017
forman.yuv
Fig. 38 Throughput of decoding under Foreman sequence under 100MHz.
mobile.yuv
Fig. 39 Throughput of decoding under Mobile sequence under 100MHz.
foreman.yuv
Fig. 40 Throughput of decoding for Foreman sequence under 200MHz.
mobile.yuv
Fig. 41 Throughput of decoding for Mobile sequence under 200MHz.
Fig. 42 and Fig. 43 show power distribution of each main module for test pattern of mobile and akiyo, respectively. The operating frequency is 100MHz. Memory1 stored the base addresses and table information for large tables and Memory2 stored the base addresses and table information for small tables.
1.44E-04, 9.82E-05, 12%
8%
4.19E-04, 5.42E-04, 35%
45%
Memory1 Memory2 Symbol_meory Others
(a) QP = 16 for mobile
4.96E-04, 44%
9.53E-05, 8%
8.60E-05, 7%
4.80E-04, 41%
Memory1
Memory2
Symbol_meory
Others
4.79E-04, 31%
5.75E-04, 38%
3.82E-04, 25%
9.38E-05, 6%
Memory1 Memory2 Symbol_meory Others
(a) QP = 16 for akiyo
5.75E-04, 45%
4.58E-04, 36%
1.45E-04, 11%
9.88E-05, Memory1 8%
Memory2 Symbol_meory Others
(b) QP = 34 for akiyo
Fig. 43 Power chart of different module in akiyo test frame for (a) QP=16, (b) QP=34
Table 10 shows the comparison of the proposed design. We can see that the proposed design support two different entropy decoding, i.e. MPEG2 and H.264. Besides, the proposed design has error resilience feature for application of wireless video
transmission. These two features are quite different from the other designs.
Table 10 Comparison of other designs and proposed design [1] NCCU
Process 0.18um 0.18um 0.18um 0.09um Technique Hardwired Hardwired Hardwired Memory-based
Features Parallel LUT
Multi-symbol for level
Modified level
detector Error resilience Max
Frequency N/A 102MHz 213MHz 200MHz Gate
Chapter 7
Conclusion and Future Work
As we know, entropy decoder of MPEG-2 and H.264/AVC are very different from each other, such as decoding flow, symbol format and table transition. A memory-based VLC decoder which support dual-mode vide format (H.264/AVC) and MPEG-2 with error robustness is proposed in the thesis. The thesis focuses on improvement of memory efficiency for conventional VLC decoder first. Although MPEG-2 part is not yet exactly implemented in the decoder, only little overhead like multiplexers and additional coefficients buffers are needed when MPEG-2 is required because the memory utilization and size are considered in this design. For CAVLC decoding, throughput is limited by dependency between syntax elements, hence, pipeline stage is not adequate for this decoding. However, the decoding of MPEG2 can be pipelined because there symbols are independent. The VLC decoder is synthesized under 100MHz and can be promised to support HD720p even under low QPs. The design can also meet requirement of HD1080p when operation frequency is 200MHz.
In addition, a novel error resynchronization is proposed in the thesis. This method can be combined with conventional memory-based VLC decoding without extra bandwidth overhead. In this scheme, the EOBs are constructed with length constraint. The flow of EOB construction is proposed to reduce off-line simulation time and the analysis of the EOB probability is also presented. After EOB library is set, group-based decoding of VLC is applied to determine if EOB are found.
In addition, a novel error resynchronization is proposed in the thesis. This method can be combined with conventional memory-based VLC decoding without extra bandwidth overhead. In this scheme, the EOBs are constructed with length constraint. The flow of EOB construction is proposed to reduce off-line simulation time and the analysis of the EOB probability is also presented. After EOB library is set, group-based decoding of VLC is applied to determine if EOB are found.