Level Efficient Coding - Efficient Coding

4.1. Efficient Coding

4.1.1. Level Efficient Coding

4.1.1. 4.1.1. Level Efficient Coding

Figure 4-1 : Algorithm of level encoding and decoding

Figure 4-1 shows the algorithm of level encoding and decoding. According to the value of suffix_length, we can choose the decoded table from NUM_VLC0 ~ NUM_VLC6. When suffix_length is 0, there are two escape cases (level_prefix = 14 or 15) which have to fetch level_suffix to decode. On the other hand, the suffix length is equal to the variable, suffix_length. The variable, sign, means the level is positive or negative.

In the encoding procedures, length is the codeword length and code represents the codeword value. The variable, escape, is defined as the following equation.

15 _

escape= <<suffix length

The variable, escape, determines the threshold of escape case, and if the value of level is greater than or equal to escape, the encoding procedure enters the escape case. In escape case, level_prefix is given 15 and the level_suffix length is 12. This is for the large value of encoding levels. The two cases, |level| < 16 and |level| ≧ 16, are the mapping to the two escape cases in the decoding process. According to the encoding and decoding algorithm shown in Figure 4-1, we can formulates the calculations of level encoding/decoding shown in Figure 4-2.

If (level_encoding) 2'b00 : level_out0 = 1<< 12;

level_out1 = {|level|-16,sign};

shift = suffix_length – 1; length

Figure 4-2 : Calculations of level encoding and decoding

The escape cases for level encoding procedures are |level| < 8, |level| < 16, and

|level| < escape, and those for level decoding procedures are level_prefix < 14, 14 ≤ level_prefix < 15, and level_prefix = 15. The lengths for level encoding procedures are the length of the encoded codeword, and those for level decoding procedures is the suffix length of decoding codeword which is transmitted to codeword boundary detector to calculate the codeword boundary. The level_out is the codeword value for level encoding and that is the value of level_code for level decoding. According to level_code, we can get the value of the decoded level shown in Figure 2-10. Based on these calculations, we can simplify the complexity of level encoding and decoding, and this architecture can help us handle the parallel input bitstream for level decoding and integrate level encoding/decoding to an area-efficient level codec system. The level decoding/encoding procedures and the corresponding examples are shown in Figure 4-3 and Figure 4-4.

Decoding procedures – assume the decoding codewords “00001101000…” and suffix_length is 3.

1) Count leading zeros and fetch level_prefix.

leading zeros = 4 => level_prefix = 4;

2) Evaluate the escape case and fetch level_suffix according to suffix_length.

level_prefix < 14 => escape0 = 0;

level_prefix < 15 => escape1 = 0 & escape2 = 0;

suffix_length = 3 => suffix = 3'b101;

3) According to suffix_length, escape cases, level_prefix, and level_suffix, we can get the decoded suffix_length, level_out0, level_out1, and sign.

suffix_length != 0 && escape2 = 0 => length = suffix_length = 3;

suffix_length != 0 => level_out0 = level_prefix << suffix_length = 4'b0100 << 3 = 7'b0100_000 = 32;

leve_out1 = level_suffix = 3'b101 = 5;

sign = level_suffix[0] = 1'b1;

4) Extract the lengths of level_out0 and level_out1 the same as the word length of levels which is 16-bit.

level_out0' = 16'b0000_0000_0010_0000 = 32;

level_out1' = 16'b0000_0000_0000_0101 = 5;

5) Calculate level_code by adding level_out0' and level_out1' and derive level according to sign and level_code.

level_code = level_out0' + level_out1' = 16'b0000_0000_0010_0101 = 37;

sign = 1 => level = ~level_code >> 1= 16'b1111_1111_1110_1101 = -19;

Figure 4-3 : Level decoding procedures and the corresponding examples

Encoding procedure – assume the encoding level is 14 and suffix_length is 1.

1) Calculate the absolute value of level, the escape value according to suffix_length, and sign.

level = 14 => |level| = 14;

escape = 15 << suffix_length = 15 << 1 = 30;

level >= 0 => sign = 0;

2) Evaluate the escape cases according to the absolute value of level.

|level| = 14 => escape0 = 1, escape1 = 0, escape2 = 0;

3) According to suffix_length, escape cases, and the absolute value of level, we can get the encoded codeword length and level_out0 and level_out1.

suffix_length == 0 && escape2 == 0 => codeword length = (|level| - 1) >> shift + suffix_length + 1 = (14 - 1) >> 0 + 1 + 1 = 16.

suffix = (|level| - 1) & (~((0xffffffff)<<shift)) = (4'b1101) & (0x0000_0000) = 0;

escape2 == 0 => level_out0 = 1 << suffix_length = 1 << 1 = 2;

level_out1 = {suffix,sign} = 0;

4) Extract the lengths of level_out0 and level_out1 the same as the codeword length.

level_out0' = 16'b0000_0000_0000_0010 = 2;

level_out1' = 16'b0000_0000_0000_0000 = 0;

5) Calculate codeword by adding level_out0' and level_out1'.

codeword = level_out0' + level_out1' = 16'b0000_0000_0000_0010;

Figure 4-4 : Level encoding procedures and the corresponding examples

The architecture of the proposed level codec system is shown in Figure 4-5. The level decoding and encoding procedures can work on this codec system. When executing level encoding, the valid outputs are codeword and length; on the other hand, the valid outputs are level and length. The results of |level| - 8, |level| - 16, and

|level| - escape can be derived from the calculations of escape cases. The codeword boundary detector always sends 12-bit bitstream for level decoding, and according to level_prefix and suffix_length the system will fetch the wanted level_suffix with correct length. The information of suffix_length is given from the suffix_length

generator, and the architecture is the same as that in Figure 2-11. Therefore, we don’t describe it here. The three components, decoder for length, decoder for level_out0, and decoder for level_out1 implement the calculations shown in Figure 4-2 with PLA architectures. The three adders in the left part in Figure 4-5 calculate the escape cases of level encoding and decoding. The two input signals, level_encoding and level_decoding, are not only the selecting signal, but also enable signals to open or close the level codec system. The component to get the absolute value of the input is to do 2’s complement or pass the original input according to the most significant bit (msb) of the input which can judge the input value is positive or negative. That is the approximate introduction of the proposed level encoding and decoding architecture.

level level_prefix

MUXMUXMUX

Barrel shifter level_decoding Decoder for level_out1

Barrel shifterBarrel shifter suffix_length level_decoding level_encoding

{|level|,sign} suffix_length Barrel shiftersuffix_length + 1 {|level| -16,sign} {|level| -8,sign} {|level| -escape,sign} {suffix,sign} level_prefix level_suffix

level_decoding

Figure 4-5 : Architecture of level decoding/encoding

4.1.2. 4.1.2. Run_before Efficient Coding

zerosLeft

Table 4-1 : Table for run_before

The run_before table is shown in Table 4-1. Even if we can get good performance with PZTP to realize run_before table, we hope to find the easier and more efficient method to implement run_before codec system with parallel input bitstream and combine the encoding part. After observing the run_before table, we can find the numerical relation for run_before decoding and encoding shown in Figure 4-6. No matter the decoding or encoding procedures, we can divide the run_before table into three groups, which are zerosLeft < 6, zerosLeft = 6, and zerosLeft > 6. Besides, the calculations in each group are similar. For example, when zerosLeft is equal to 6, run_before is the result of adding codeword and one in decoding processes, and codeword is the difference of run_before and one. Such relation helps us to complete the efficient coding for run_before table.

Figure 4-6 : The numerical calculations of run_before encoding and decoding

The architecture of the proposed run_before codec system is shown in Figure 4-7.

We can use the architecture of run_before efficient coding instead of look-up table method. The advantage of the proposed architecture is the major function units can be shared for the encoding and decoding procedures. However, if we implement the run_before codec system with look-up table, we have to build two tables for both procedures.

run_before codeword encoding

decoding codeword

run_before encoding codeword

decoding run_before codeword encoding decoding

MUX

Figure 4-7 : Architecture of run_before codec system

4.2. 4.2. Zero skipping and proposed symbols constructor

Code Element Value Output array

0000100 coeff_token TotalCoeff = 5, TrailingOnes = 3 Empty

0 T1 sign + 1

1 T1 sign - -1,1

1 T1 sign - -1,-1,1

1 level +1 1,-1,-1,1

0010 level +3 3,1,-1,-1,1

111 total_zeros 3 3,1,-1,-1,1

10 run_before 1 3,1,-1,-1,0,1

1 run_before 0 3,1,-1,-1,0,1

01 run_before 1 3,0,1,-1,-1,0,1

Figure 4-8 : An example of decoding procedures of CAVLC

Figure 4-8 shows an example of decoding procedures of CAVLC. We can see the processes of constructing the DCT coefficients in zigzag order. Generally, we will arrange the DCT coefficients after decoding all run_befores. Such method will take additional cycles whose value is the same as the value of TotalCoeff to arrange the DCT coefficients. If the decoded run_before is derived, we arrange the coefficients in the next cycle, and we can save a few cycles to arrange the DCT coefficients. Before executing the proposed symbols construction, we have to know the location of last non-zero coefficient in the coefficients storage. According to TotalCoeff and total_zeros, we can calculate the location of the last DCT coefficient. The procedures of proposed symbols construction and the corresponding example are shown in Figure

4-9. In Figure 4-9, cycle means the cycle of symbols construction and run_before is being decoded in cycle 1 ~ 4, run_before is the value of decoded run_before in the present cycle, level_count represents the pointer to the levels buffer, coeff_count means the pointer to the coefficients buffer, and coeff_buffer records the values of coefficients buffer in the next cycle. The default value of coeff_count is the sum of TotalCoeff and total_zeros minus one. The sum of TotalCoeff and total_zeros means the total number of decoded symbols including non-zero and zero coefficients, so according the sum of TotalCoeff and total_zeros we can know the location of last non-zero coefficient in coeff_buffer. In the first cycle, level_count equal to 4 maps the level is 1. Therefore, we put 1 to coeff_buffer at the location coeff_buffer 7, and the next coeff_count is the result of subtracting current coeff_count and 1. At the same time, the decoded run_before is 1, and the next coeff_count also has to subtract the value of run_before, so the next coeff_count is 5. Repeating the above steps, finally we can get the DCT coefficients in zigzag order.

cycle run_before level_count coeff_count coeff_buffer 0 ~ 15

1 1 4 7 0000_0001_0000_0000

2 0 3 6 - 1 0000_0-101_0000_0000

3 0 2 4 0000_-1-101_0000_0000

4 1 1 3 0001_-1-101_0000_0000

5 N.A. 0 2 - 1 0301_-1-101_0000_0000

Figure 4-9 : The proposed symbols construction for example in Figure 4-8.

However, the proposed symbols construction is not the optimal solution. When the decoded run_before is equal to 0, the next coefficient location can be predicted, even if we don’t decode the run_before. That is, if we skip the zero run_before and decode the next run_before, we can still store the levels into correct locations in coefficients

buffer. That is not difficult, and when calculating the results of level_count and coeff_count, we take the number of zero-skipping run_befores into consideration. The example of the proposed symbols construction with zero-skipping is shown in Figure 4-10.

cycle run_before level_count coeff_count coeff_buffer 0 ~ 15

1 1 4 7 0000_0001_0000_0000

4 (0), (0), 1 3 - 2 6 – 1 - 2 0001_-1-101_0000_0000

5 N.A. 0 2 - 1 0301_-1-101_0000_0000

Figure 4-10 : An example of the proposed symbols construction with zero-skipping

The final problem is how to realize the function unit to detect the condition of zero-skipping. Figure 4-11 shows the run_before table mapping to zero run_before under different zerosLefts. We can find that the codewords of zero run_before are “1”,

“11”, and “111”. Therefore, the realization of zero-skipping detector is quite easy, because we already design a leading-one counter in the codeword boundary detector for MPEG-2 codewords. Here, we only use that leading-one counter and add another decoder whose inputs are leading ones and zerosLeft, and we can get the information about the number of zero-skipping run_befores.

zerosLeft run_before

1 2 3 4 5 6 >6

0 1 1 11 11 11 11 111

Figure 4-11 : The run_before table mapping to zero run_before

4.3. 4.3. Summary

proposed method I-frame P-frame frame

level efficient coding 40% 17% 29%

run_before efficient coding 4% 12% 8%

symbols construction 14% 12% 13%

zero skipping 4% 5% 4%

Table 4-2 : Throughput improvement of each proposed method, foreman QP = 10

Table 4-2 shows the improvement of throughput for each proposed approach, when we decode the picture, foreman, and the QP is equal to 10. We can see the effect is the best when applying level efficient coding, and the method can save about 40%

throughput when decoding an I-frame. Besides, run_before efficient coding has more performance for P-frame than I-frame, because the blocks of P-frame have more zero coefficients than those in I-frame. Symbols construction also has good improvement both for I-frame and P-frame. However, the effect of zero skipping is not so significant. We consider that the number of zero run_befores is not so much in this picture. Therefore, we decode another picture, mobile, and set QP is 28. The improvement of throughput is shown in Table 4-3.

proposed method I-frame P-frame frame

level efficient coding 27% 5% 22%

run_before efficient coding 6% 12% 7.5%

symbols construction 14% 5% 12%

zero skipping 4% 3% 4%

Table 4-3 : Throughput improvement of each proposed method, mobile QP = 28

Table 4-3 shows the improvement of throughput for each proposed approach, when we decode the picture, foreman, and the QP is equal to 10. The proposed approach, level efficient coding, still has excellent performance for I-frame, but the performance for P-frame is not so good. The proposed approach, run_before efficient coding, also has good performance in P-frame, and symbols construction provides

much improvement in I-frame. However, zero-skipping approach still has not good performance. Blessedly, the hardware cost of zero-skipping is acceptable, although the improvement of throughput is not good enough.

Chapter 5.

Implementation Results and

Conclusion

5.1. 5.1. Implementation Results

Figure 5-1 and Figure 5-2 show the encoding throughput of the proposed VLC group-based codec system with the H.264/AVC standard C code, JM 9.2, and in Figure 5-1 we encode the picture, mobile.yuv; on the other hand, the picture is foreman.yuv. The proposed VLC group-based codec system can support H.264/AVC main profile @5.1, when QP is equal to 28 in Figure 5-1, and Figure 5-2. Table 5-1 shows the average encoding cycles per MB in the proposed design.

QP mobile foreman

10 368 329

12 353 292

16 320 226

20 278 156

24 227 102

28 165 69

32 114 50

36 86 35

40 68 23

average 220 142

Table 5-1 : The average encoding cycles per MB in the proposed design

Figure 5-1 : The encoding throughput of proposed design running mobile.yuv

Figure 5-2 : The encoding throughput of proposed design running foreman.yuv

Figure 5-3 : The decoding throughput of proposed design running mobile.yuv

Figure 5-4 : The decoding throughput of proposed design running foreman.yuv

Figure 5-3 and Figure 5-4 show the encoding throughput of the proposed VLC group-based codec system, and in Figure 5-3 we decode mobile.yuv, and foreman.yuv in Figure 5-4. Usually, compared to the decoding throughput, we often consider I-frame of a decoded picture. In Figure 5-3, the decoding throughput of I-frame can reach the standard of H.264/AVC main profile @5.0 when QP is 32 and in Figure 5-4 the proposed VLC codec system can meet that when QP is 28. Therefore, the proposed VLC codec system can support H.264/AVC main profile @5.0.

Chien[2] Chen [1] Yu[5] Proposed

Technology 0.18um 0.18um 0.18um 0.13um

Gate Count 9724 17635 13192 20357

Clock Frequency 125 MHz 100 MHz 125 MHz 125 MHz

Encoding/Decoding Encoding Encoding Decoding

Decoding : 8554

Table 5-2 : Comparison of the proposed design with others

In implementing the proposed CAVLC codec system, we performed logic synthesis on the proposed design according to a 0.13um CMOS technology. The comparison of the proposed design with other is shown in Table 5-1. Design [1]

contains a bitstream packer which packs the codewords produced by symbol encoders, the packing of bitstream headers and Exp-Golomb.

In MPEG-2, the only difference of the throughput from the conventional group-based VLC codec design is the decoding procedure, because we have to access the symbol group memory when decoding a MPEG-2 symbol in our proposed design.

However, under well pipelined architecture, such difference is not obvious. Besides, the encoding procedures in the proposed design have the same steps as the conventional group-based VLC codec design, so of course the throughput is the same

as the conventional one. The simulation results are shown in Table 5-3. We can see the average symbol rate of encoding process is 99.98 Msps at 100 MHz-clock rate and the average symbol rate of decoding process is 99.8 Msps at the same clock rate. Some overheads are introduced due to stalls of the bitstream FIFOs.

image: (4:2:2) @ 1920 X 1080

simulation results

# of bitstream (bit) 3439392 1912640

# of symbols 590302 252817

Encoding cycle 590348 252864

Decoding cycle 591484 253323

Table 5-3 : Simulation results based on HDTV systems (I-frame) in MPEG-2

5.2. 5.2. Conclusion

In this thesis, we propose one low power and hardware cost VLC decoder for dual standards, MPEG-2 and H.264/AVC. Compared to [4], we reduce 30% hardware cost in H.264/AVC CAVLD. The hardware cost of the proposed dual-standard VLD is 7683 gate-count and the power is 1.719 mW for MPEG-2, 1.302 mW for H.264/AVC baseline@3.0 I-frame, and 1.376 mW for H.264/AVC baseline@3.0 P-frame at 100 MHz.

Besides, we proposed another group-based VLC codec system for dual standards, MPEG-2 and H.264/AVC. According the group-based, level efficient coding, run_before efficient coding, the proposed symbols construction, and zero-skipping, we design a VLC codec system which can support H.264/AVC main profile @5.0 with 20357 gate counts at 100 MHz. Each proposed method can improve the percentage of throughput shown in Table 5-4 and Table 5-5. Compared to the

conventional VLC group-based VLC codec system, the proposed design reduce 16%

memory usage.

proposed method I-frame P-frame frame

level efficient coding 40% 17% 29%

run_before efficient coding 4% 12% 8%

symbols construction 14% 12% 13%

zero skipping 4% 5% 4%

Table 5-4 : Throughput improvement of each proposed method, foreman QP = 10

proposed method I-frame P-frame frame

level efficient coding 27% 5% 22%

run_before efficient coding 6% 12% 7.5%

symbols construction 14% 5% 12%

zero skipping 4% 3% 4%

Table 5-5 : Throughput improvement of each proposed method, mobile QP = 28

5.3. 5.3. Future Work

The hardware cost is a problem for the proposed group-based VLC codec design, because under such performance in throughput the hardware cost is not efficient enough. Therefore, hardware cost reduction can be a target to make effort. Besides, the power issue is always the problem of the group-based design. How to reduce the power consumption of the proposed group-based VLC codec design is another point.

Perhaps, we can solve this problem with memory hierarchy, because the codewords of VLC tables are the representation of the occurring probabilities.

On the other hand, the mobile devices are used generally. In the process of the wireless communication, the problem of receiving error bitstream due to the noise is serious. It will result in the error blocks decoded, and the picture decoded maybe has mosaics. Therefore, to develop the error resilience approaches very important.

Reference

[1]T. C. Chen, Y. W. Huang, C. Y. Tsai, B. Y. Hsieh, and L. G. Chen,

“Dual-block-pipelined VLSI architecture of entropy coding for H.264/AVC baseline profile“, Proc. International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp. 271-274, 2005.

[2]C. D. Chien, K. P. Lu, Y. H. Shih, and J. I. Guo “A High Performance CAVLC Encoder Design for MPEG-4 AVC/H.264 Video Coding Applications”, in Proc.

ISCAS, 2006.

[3]Wu Di, Gao Wen, Hu Mingzeng, and Ji Zhenzhou, “A VLSI Architecture Design of CAVLC Decoder” Proc. 5^th International Conference on ASIC, Vol. 2 pp. 962-965, 21-24 Oct. 2003..

[4]H. C. Chang, C. C. Lin, J. I. Guo, “A Novel Low-Cost High-Performance VLSI Architecture for MPEG AVC/H.264 CAVLC Decoding”, in Proc, ISCAS, pp. 6110 – 6113, 2005.

[5]K. S. Yu and T. S. Chang “A Zero-Skipping Multi-symbol CAVLC Decoder for

在文檔中應用於數位電視雙模背景適應性可變長度之編解碼 (頁 82-0)