Zero skipping and proposed symbols constructor

Code Element Value Output array

0000100 coeff_token TotalCoeff = 5, TrailingOnes = 3 Empty

0 T1 sign + 1

1 T1 sign - -1,1

1 T1 sign - -1,-1,1

1 level +1 1,-1,-1,1

0010 level +3 3,1,-1,-1,1

111 total_zeros 3 3,1,-1,-1,1

10 run_before 1 3,1,-1,-1,0,1

1 run_before 0 3,1,-1,-1,0,1

01 run_before 1 3,0,1,-1,-1,0,1

Figure 4-8 : An example of decoding procedures of CAVLC

Figure 4-8 shows an example of decoding procedures of CAVLC. We can see the processes of constructing the DCT coefficients in zigzag order. Generally, we will arrange the DCT coefficients after decoding all run_befores. Such method will take additional cycles whose value is the same as the value of TotalCoeff to arrange the DCT coefficients. If the decoded run_before is derived, we arrange the coefficients in the next cycle, and we can save a few cycles to arrange the DCT coefficients. Before executing the proposed symbols construction, we have to know the location of last non-zero coefficient in the coefficients storage. According to TotalCoeff and total_zeros, we can calculate the location of the last DCT coefficient. The procedures of proposed symbols construction and the corresponding example are shown in Figure

4-9. In Figure 4-9, cycle means the cycle of symbols construction and run_before is being decoded in cycle 1 ~ 4, run_before is the value of decoded run_before in the present cycle, level_count represents the pointer to the levels buffer, coeff_count means the pointer to the coefficients buffer, and coeff_buffer records the values of coefficients buffer in the next cycle. The default value of coeff_count is the sum of TotalCoeff and total_zeros minus one. The sum of TotalCoeff and total_zeros means the total number of decoded symbols including non-zero and zero coefficients, so according the sum of TotalCoeff and total_zeros we can know the location of last non-zero coefficient in coeff_buffer. In the first cycle, level_count equal to 4 maps the level is 1. Therefore, we put 1 to coeff_buffer at the location coeff_buffer 7, and the next coeff_count is the result of subtracting current coeff_count and 1. At the same time, the decoded run_before is 1, and the next coeff_count also has to subtract the value of run_before, so the next coeff_count is 5. Repeating the above steps, finally we can get the DCT coefficients in zigzag order.

cycle run_before level_count coeff_count coeff_buffer 0 ~ 15

1 1 4 7 0000_0001_0000_0000

2 0 3 6 - 1 0000_0-101_0000_0000

3 0 2 4 0000_-1-101_0000_0000

4 1 1 3 0001_-1-101_0000_0000

5 N.A. 0 2 - 1 0301_-1-101_0000_0000

Figure 4-9 : The proposed symbols construction for example in Figure 4-8.

However, the proposed symbols construction is not the optimal solution. When the decoded run_before is equal to 0, the next coefficient location can be predicted, even if we don’t decode the run_before. That is, if we skip the zero run_before and decode the next run_before, we can still store the levels into correct locations in coefficients

buffer. That is not difficult, and when calculating the results of level_count and coeff_count, we take the number of zero-skipping run_befores into consideration. The example of the proposed symbols construction with zero-skipping is shown in Figure 4-10.

cycle run_before level_count coeff_count coeff_buffer 0 ~ 15

1 1 4 7 0000_0001_0000_0000

4 (0), (0), 1 3 - 2 6 – 1 - 2 0001_-1-101_0000_0000

5 N.A. 0 2 - 1 0301_-1-101_0000_0000

Figure 4-10 : An example of the proposed symbols construction with zero-skipping

The final problem is how to realize the function unit to detect the condition of zero-skipping. Figure 4-11 shows the run_before table mapping to zero run_before under different zerosLefts. We can find that the codewords of zero run_before are “1”,

“11”, and “111”. Therefore, the realization of zero-skipping detector is quite easy, because we already design a leading-one counter in the codeword boundary detector for MPEG-2 codewords. Here, we only use that leading-one counter and add another decoder whose inputs are leading ones and zerosLeft, and we can get the information about the number of zero-skipping run_befores.

zerosLeft run_before

1 2 3 4 5 6 >6

0 1 1 11 11 11 11 111

Figure 4-11 : The run_before table mapping to zero run_before

4.3. 4.3. Summary

proposed method I-frame P-frame frame

level efficient coding 40% 17% 29%

run_before efficient coding 4% 12% 8%

symbols construction 14% 12% 13%

zero skipping 4% 5% 4%

Table 4-2 : Throughput improvement of each proposed method, foreman QP = 10

Table 4-2 shows the improvement of throughput for each proposed approach, when we decode the picture, foreman, and the QP is equal to 10. We can see the effect is the best when applying level efficient coding, and the method can save about 40%

throughput when decoding an I-frame. Besides, run_before efficient coding has more performance for P-frame than I-frame, because the blocks of P-frame have more zero coefficients than those in I-frame. Symbols construction also has good improvement both for I-frame and P-frame. However, the effect of zero skipping is not so significant. We consider that the number of zero run_befores is not so much in this picture. Therefore, we decode another picture, mobile, and set QP is 28. The improvement of throughput is shown in Table 4-3.

proposed method I-frame P-frame frame

level efficient coding 27% 5% 22%

run_before efficient coding 6% 12% 7.5%

symbols construction 14% 5% 12%

zero skipping 4% 3% 4%

Table 4-3 : Throughput improvement of each proposed method, mobile QP = 28

Table 4-3 shows the improvement of throughput for each proposed approach, when we decode the picture, foreman, and the QP is equal to 10. The proposed approach, level efficient coding, still has excellent performance for I-frame, but the performance for P-frame is not so good. The proposed approach, run_before efficient coding, also has good performance in P-frame, and symbols construction provides

much improvement in I-frame. However, zero-skipping approach still has not good performance. Blessedly, the hardware cost of zero-skipping is acceptable, although the improvement of throughput is not good enough.

Chapter 5.

Implementation Results and

Conclusion

5.1. 5.1. Implementation Results

Figure 5-1 and Figure 5-2 show the encoding throughput of the proposed VLC group-based codec system with the H.264/AVC standard C code, JM 9.2, and in Figure 5-1 we encode the picture, mobile.yuv; on the other hand, the picture is foreman.yuv. The proposed VLC group-based codec system can support H.264/AVC main profile @5.1, when QP is equal to 28 in Figure 5-1, and Figure 5-2. Table 5-1 shows the average encoding cycles per MB in the proposed design.

QP mobile foreman

10 368 329

12 353 292

16 320 226

20 278 156

24 227 102

28 165 69

32 114 50

36 86 35

40 68 23

average 220 142

Table 5-1 : The average encoding cycles per MB in the proposed design

Figure 5-1 : The encoding throughput of proposed design running mobile.yuv

Figure 5-2 : The encoding throughput of proposed design running foreman.yuv

Figure 5-3 : The decoding throughput of proposed design running mobile.yuv

Figure 5-4 : The decoding throughput of proposed design running foreman.yuv

Figure 5-3 and Figure 5-4 show the encoding throughput of the proposed VLC group-based codec system, and in Figure 5-3 we decode mobile.yuv, and foreman.yuv in Figure 5-4. Usually, compared to the decoding throughput, we often consider I-frame of a decoded picture. In Figure 5-3, the decoding throughput of I-frame can reach the standard of H.264/AVC main profile @5.0 when QP is 32 and in Figure 5-4 the proposed VLC codec system can meet that when QP is 28. Therefore, the proposed VLC codec system can support H.264/AVC main profile @5.0.

Chien[2] Chen [1] Yu[5] Proposed

Technology 0.18um 0.18um 0.18um 0.13um

Gate Count 9724 17635 13192 20357

Clock Frequency 125 MHz 100 MHz 125 MHz 125 MHz

Encoding/Decoding Encoding Encoding Decoding

Decoding : 8554

Table 5-2 : Comparison of the proposed design with others

In implementing the proposed CAVLC codec system, we performed logic synthesis on the proposed design according to a 0.13um CMOS technology. The comparison of the proposed design with other is shown in Table 5-1. Design [1]

contains a bitstream packer which packs the codewords produced by symbol encoders, the packing of bitstream headers and Exp-Golomb.

In MPEG-2, the only difference of the throughput from the conventional group-based VLC codec design is the decoding procedure, because we have to access the symbol group memory when decoding a MPEG-2 symbol in our proposed design.

However, under well pipelined architecture, such difference is not obvious. Besides, the encoding procedures in the proposed design have the same steps as the conventional group-based VLC codec design, so of course the throughput is the same

as the conventional one. The simulation results are shown in Table 5-3. We can see the average symbol rate of encoding process is 99.98 Msps at 100 MHz-clock rate and the average symbol rate of decoding process is 99.8 Msps at the same clock rate. Some overheads are introduced due to stalls of the bitstream FIFOs.

image: (4:2:2) @ 1920 X 1080

simulation results

# of bitstream (bit) 3439392 1912640

# of symbols 590302 252817

Encoding cycle 590348 252864

Decoding cycle 591484 253323

Table 5-3 : Simulation results based on HDTV systems (I-frame) in MPEG-2

5.2. 5.2. Conclusion

In this thesis, we propose one low power and hardware cost VLC decoder for dual standards, MPEG-2 and H.264/AVC. Compared to [4], we reduce 30% hardware cost in H.264/AVC CAVLD. The hardware cost of the proposed dual-standard VLD is 7683 gate-count and the power is 1.719 mW for MPEG-2, 1.302 mW for H.264/AVC baseline@3.0 I-frame, and 1.376 mW for H.264/AVC baseline@3.0 P-frame at 100 MHz.

Besides, we proposed another group-based VLC codec system for dual standards, MPEG-2 and H.264/AVC. According the group-based, level efficient coding, run_before efficient coding, the proposed symbols construction, and zero-skipping, we design a VLC codec system which can support H.264/AVC main profile @5.0 with 20357 gate counts at 100 MHz. Each proposed method can improve the percentage of throughput shown in Table 5-4 and Table 5-5. Compared to the

conventional VLC group-based VLC codec system, the proposed design reduce 16%

memory usage.

proposed method I-frame P-frame frame

level efficient coding 40% 17% 29%

run_before efficient coding 4% 12% 8%

symbols construction 14% 12% 13%

zero skipping 4% 5% 4%

Table 5-4 : Throughput improvement of each proposed method, foreman QP = 10

proposed method I-frame P-frame frame

level efficient coding 27% 5% 22%

run_before efficient coding 6% 12% 7.5%

symbols construction 14% 5% 12%

zero skipping 4% 3% 4%

Table 5-5 : Throughput improvement of each proposed method, mobile QP = 28

5.3. 5.3. Future Work

The hardware cost is a problem for the proposed group-based VLC codec design, because under such performance in throughput the hardware cost is not efficient enough. Therefore, hardware cost reduction can be a target to make effort. Besides, the power issue is always the problem of the group-based design. How to reduce the power consumption of the proposed group-based VLC codec design is another point.

Perhaps, we can solve this problem with memory hierarchy, because the codewords of VLC tables are the representation of the occurring probabilities.

On the other hand, the mobile devices are used generally. In the process of the wireless communication, the problem of receiving error bitstream due to the noise is serious. It will result in the error blocks decoded, and the picture decoded maybe has mosaics. Therefore, to develop the error resilience approaches very important.

Reference

[1]T. C. Chen, Y. W. Huang, C. Y. Tsai, B. Y. Hsieh, and L. G. Chen,

“Dual-block-pipelined VLSI architecture of entropy coding for H.264/AVC baseline profile“, Proc. International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp. 271-274, 2005.

[2]C. D. Chien, K. P. Lu, Y. H. Shih, and J. I. Guo “A High Performance CAVLC Encoder Design for MPEG-4 AVC/H.264 Video Coding Applications”, in Proc.

ISCAS, 2006.

[3]Wu Di, Gao Wen, Hu Mingzeng, and Ji Zhenzhou, “A VLSI Architecture Design of CAVLC Decoder” Proc. 5^th International Conference on ASIC, Vol. 2 pp. 962-965, 21-24 Oct. 2003..

[4]H. C. Chang, C. C. Lin, J. I. Guo, “A Novel Low-Cost High-Performance VLSI Architecture for MPEG AVC/H.264 CAVLC Decoding”, in Proc, ISCAS, pp. 6110 – 6113, 2005.

[5]K. S. Yu and T. S. Chang “A Zero-Skipping Multi-symbol CAVLC Decoder for MPEG-4 AVC/H.264” in Proc. , ISCAS, 2006

[6]B. J. Shieh, Y. S. Lee, C. Y. Lee, “A New Approach of Group-Based VLC Codec System”, in Proc. , ISCAS, Vol. 4, pp. 609 - 612, 28-31 May 2000.

[7]B. J. Shieh, Y. S. Lee, C. Y. Lee, “A New Approach of Group-Based VLC Codec System with Full Table Programmability”, in Proc. , ISCAS, Vol. 2, pp. 210 – 221, Feb 2001.

[8]T. M. Liu, T. A. Lin, S. Z. Wang, W. P. Lee, K. C. Hou, J. Y. Yang and C. Y. Lee,

“An 865-uW H.264/AVC Video Decoder for Mobile Applications”, in Proc. ASSCC, 2005.

[9]T. M. Liu, T. A. Lin, S. Z. Wang, W. P. Lee, K. C. Hou, J. Y. Yang and C. Y. Lee,

“A 125-uW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications”, in Proc. ISSCC, 2006.

[10]Joint Video Team, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, May 2003.

簡歷

姓名：楊俊彥

出生地：台灣省高雄市

出生日期：民國七十一年五月二十九日

學歷：

2004 年 9 月~2006 年 7 月國立交通大學電子研究所系統組碩士班

2000 年 9 月~2004 年 7 月國立交通大學電子工程學系

1997 年 9 月~2000 年 6 月高雄市立高雄高級中學

在文檔中應用於數位電視雙模背景適應性可變長度之編解碼 (頁 88-0)

Zero skipping and proposed symbols constructor

4.3.

4.3.

4.3.

4.3. Summary

Chapter 5.

Implementation Results and

Conclusion

5.1.

5.1.

5.1.

5.1. Implementation Results

5.2.

5.2.

5.2.

5.2. Conclusion

5.3.

5.3.

5.3.

5.3. Future Work

Reference

簡 歷

學歷：

簡歷