• 沒有找到結果。

Performance Assessments

6.4 Chapter Summary

Applied with impressive dynamic boundary prediction scheme as well as the proposed low-power/high-performance methodology, our work supports dramatic power and rate-distortion superiority over prior studies, which has been summarized in average of 621.8µW of power dissipation with dynamic 198.8µW and more than half the bitrate improvement of equal frame PSNR compared to prior hardware-oriented full search, as from Figure 6.2 to Table 6.8.

Table 6.6: Power & rate-distortion assessments for the proposed memory architecture (CIF 30fps, QP = {36,42})

Static Scene Active Scene Sequence akiyo M&D foreman foreman mobile stefan

Frames 1–30 1–30 221–250 191–220 1–30 221–250

Motion local local global extreme global extreme Average

Co

re Total (µW) 483.0 493.7 615.7 680.6 601.9 856.2 621.8

Dyn. (µW) 60.0 70.7 192.7 257.6 178.9 433.2 198.8

Ref.buff

er Total (µW) 62.7 63.6 78.8 88.1 73.2 104.2 78.4

Dyn. (µW) 2.7 3.6 18.8 28.1 13.2 44.2 18.4

wrt. Total (core) 13.0% 12.9% 12.8% 12.9% 12.2% 12.2% 12.6%

wrt. Dyn. (core) 4.5% 5.0% 8.7% 10.9% 7.4% 10.2% 7.8%

SA

P Total (µW) 280.0 285.0 310.9 378.0 340.7 502.4 349.5

wrt. Total (core) 58.0% 57.7% 51.4% 55.5% 56.6% 58.7% 56.3%

Clo

ck Dyn. (µW) 33.25 36.97 73.71 94.94 63.41 112.60 69.1

wrt. Dyn. (core) 55.6% 52.4% 39.5% 36.9% 35.5% 26.1% 41.0%

Comp.Red.

ME skipped 70.3% 71.0% 33.1% 27.0% 12.9% 26.3% 40.1%

BWmem 81.1% 80.4% 76.7% 57.8% 77.7% 59.1% 72.1%

Op. rate 97.6% 97.3% 95.3% 84.5% 96.0% 85.9% 92.8%

R

D ∆PSNR (dB) 0.076 0.065 0.054 -0.145 0.011 0.155 0.029

∆Rate (%) -1.76% -0.15% -8.86% -32.52% -2.58% -52.21% -17.35%

MasterThesis,NationalChiaoTungUniversity

Sequence QP

Core power Ref. buffer SAP array Clock tree SKIP BWmem Arithmetic Op. Rate-Distortion Total Dyn. D/T Total wrt. Total wrt. Total wrt. ratio Avg. Saving Avg. Saving ∆Rate ∆PSNR

(µW) (µW) (%) (µW) Core (µW) Core (µW) Core ratio (%) SA (%) SP (%) (%) (dB)

StaticScene

Akiyo CIF 36 487.8 64.8 13.3% 63.1 12.9% 282.0 57.8% 35.1 54.2% 65.3% 441.0 80.9% 26.9 97.5% 0.62% 0.062 42 478.2 55.2 11.5% 62.3 13.0% 278.0 58.1% 31.4 56.9% 75.3% 429.9 81.3% 24.4 97.8% -4.14% 0.089 Avg. 483.0 60.0 12.4% 62.7 13.0% 280.0 58.0% 33.2 55.6% 70.3% 435.5 81.1% 25.6 97.6% -1.76% 0.076 M&D CIF 36 500.2 77.2 15.4% 64.2 12.8% 288.1 57.6% 39.2 50.8% 68.2% 457.9 80.1% 30.9 97.2% -2.77% 0.075 42 487.2 64.2 13.2% 63.0 12.9% 281.8 57.8% 34.7 54.0% 73.7% 445.3 80.7% 27.5 97.5% 2.47% 0.054 Avg. 493.7 70.7 14.3% 63.6 12.9% 285.0 57.7% 37.0 52.4% 71.0% 451.6 80.4% 29.2 97.3% -0.15% 0.065 Foreman CIF 36 537.8 114.8 21.3% 67.1 12.5% 317.0 58.9% 48.7 42.4% 47.1% 526.5 77.1% 47.4 95.6% -2.40% 0.051 221–250 42 693.5 270.5 39.0% 90.5 13.0% 304.8 44.0% 98.7 36.5% 19.1% 549.2 76.2% 53.9 95.1% -15.32% 0.058 Avg. 615.7 192.7 30.2% 78.8 12.8% 310.9 51.4% 73.7 39.5% 33.1% 537.9 76.7% 50.6 95.3% -8.86% 0.054

ActiveScene

Foreman CIF 36 693.5 270.5 39.0% 90.5 13.0% 384.4 55.4% 98.7 36.5% 19.1% 980.9 57.4% 175.6 83.9% -34.31% −0.224 191–220 42 667.7 244.7 36.6% 85.7 12.8% 371.6 55.7% 91.2 37.2% 34.8% 964.9 58.1% 162.7 85.1% -30.73% −0.065 Avg. 680.6 257.6 37.8% 88.1 12.9% 378.0 55.5% 94.9 36.9% 27.0% 972.9 57.8% 169.1 84.5% -32.52% −0.145 Mobile CIF 36 601.4 178.4 29.7% 73.3 12.2% 340.3 56.6% 63.6 35.6% 9.1% 513.6 77.7% 44.0 96.0% -1.33% −0.003 42 602.3 179.3 29.8% 73.1 12.1% 341.1 56.6% 63.2 35.3% 16.6% 513.8 77.7% 43.7 96.0% -3.82% 0.024 Avg. 601.9 178.9 29.7% 73.2 12.2% 340.7 56.6% 63.4 35.5% 12.9% 513.7 77.7% 43.9 96.0% -2.58% 0.011 Stefan CIF 36 896.4 473.4 52.8% 110.1 12.3% 527.5 58.8% 120.0 25.3% 21.0% 963.3 58.2% 164.6 84.9% -53.56% 0.131 42 815.9 392.9 48.2% 98.4 12.1% 477.2 58.5% 105.2 26.8% 31.6% 921.5 60.0% 142.6 86.9% -50.86% 0.178 Avg. 856.2 433.2 50.5% 104.2 12.2% 502.4 58.7% 112.6 26.1% 26.3% 942.4 59.1% 153.6 85.9% -52.21% 0.155 Average 621.8 198.8 32.0% 78.4 12.6% 349.5 56.3% 69.1 41.0% 40.1% 642.3 72.1% 78.7 92.8% -16.35% 0.019

92

Chapter6.PerformanceAssessments NTU, Huang et al. NTU, Huang et al. Miyama et al. NTU, Lin et al. NTU, Chen et al.

This work ISCAS’03 [46] TCSVT’04 [65] JSSC’04 [32] ISCAS’04 [33,45] TCSVT’07 [31,37,47]

Processing

720 × 480@30fps CIF@30fps CIF@30fps CIF@30fps CIF@30fps CIF@30fps

Capability

Resolution Integer-pel Integer-pel Half-pel Integer-pel Integer-pel Integer-pel

Block Size 16 × 16 16 × 16 16 × 16, 8 × 8 16 × 16 16 × 16 ∼ 4 × 4 16 × 16 ∼ 8 × 8

Algortihm Full search Global Gradient Fast full search Fast full search

Fast full search

Elimination Decent (4SS) (4SS-based)

Search Center (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) Adaptive

Technology 0.35µm 0.35µm 0.18µm 0.18µm 0.18µm 0.13µm

Voltage 3.3V 3.3V 1.2V 1.8V 1.3V 1.2V

Core Area 25.56µm2 27.8µm2 13.65µm2 2.2µm2 3.61µm2 0.69µm2

Internal Memory 24.6Kbits 24.1Kbits 40Kbits 24.1Kbits 64Kbits 11.5Kbits

Eqv. Gates

106Kgates 115Kgates 1000Kgates 137Kgates 300Kgates 96Kgates

(2-NAND)

Clock Rate 66.67MHz 27.8MHz 13.5MHz 48.67MHz 13.5MHz 20MHz

Power Consumption 737.32mW 189mW 12mW 8.46mW 4.38mW 0.62mW

Dynamic Power N/A 0.20mW

93

Conclusions

To reveal an impressive design methodology for portable video codec, this thesis thor-oughly treats design tradeoffs in motion estimation between rate-distortion, complex-ity, and power. Three most advanced power-efficiency techniques have been explored to achieve an excellent low-power, cost-efficiency, and high rate-distortion requirements for advanced motion estimation task.

A false rate constrained MB-skipping detection in use of likelihood ratio test has suc-cessively exploited the computation dependence and effectively eliminated the complexity burden. The proposed detection method moderately detects whether the current mac-roblock should be SKIP coding or not prior to motion estimation, where the 17%–87%

probability of detection is archived at false rate 1% relative to the motion activity.

Because of advanced prediction on intensive mathematics, motion estimation domi-nates computation and memory access complexity. To effectively release the power de-mand, prior studies was strictly exploited the characteristics of hardware acts in exces-sively compromising rate-distortion performance. In addition, measuring of power in high abstraction regardless dependance of architecture is generally insufficient. To illustrate, a biased block-matching scheme in use of dynamic boundaries is then preformed, which has been proven significantly improving arithmetic efficiency as well as motion-compensated fidelity. Simulation assessment shows that our method can save maximum 52.6% the coding bitrate efficiency, and release more than 90% arithmetic amount.

In reference buffer optimization, we theocratically give a perspective in optimizing memory structure using technology-dependent high-level power analysis methodology.

By applying the optimization methodology, the power dissipation of internal memory (register file) is successfully suppressed to 12% with respected to total core power and only 7.8% in dynamic power.

Moreover, three novel key techniques was first presented in this thesis are 1) low switching AD (absolute difference) logic (section 4.2), 2) shortest distance bus coding scheme (section 4.4), and 3) low-cost accumulator-structured MV cost computation logic

(section 4.5). These ingenious arithmetic simplifications or designs effectively advantages ours work as well.

As an excellent design methodology was applied in the implementation, the proposed design supports dramatically power and rate-distortion superiority over prior designs, which has been summarized in average of 621.8µW of power dissipation with dynamic 198.8µW and more than half the bitrate improvement of equal frame PSNR compared to prior hardware-oriented full search. In brief, the present work not only elegantly resolves the prior design gap between rate-distortion fidelity and power demanding, but also further regulates the design guideline for qualified low power video motion estimation, and has greatly encouraged the advancement to realized a real-time, low-cost, as well as high-performance video codec.

[1] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Recommendation H.264 and ISO/IEC 14496-10 AVC, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, Mar. 2003.

[2] Codec for Videoconferencing Using Primary Digital Group Transmission, ITU-T Rec-ommendation H.120, ITU-T (formerly CCITT), version 1, 1984; version 2, 1988.

[3] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, and L.-G. Chen, “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 673–688, Jun.

2006.

[4] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[5] R.-L. Lin, “Method for determining to skip macroblocks in encoding video,” U.S.

Patent 6,192,148, Feb. 20, 2001.

[6] B. A. Hall et al., “Macroblock coding technique with biasing towards skip macroblock coding,” U.S. Patent 6,993,078, Jan. 31, 2006.

[7] C. S. Kannangara et al., “Low-complexity skip prediction for H.264 through La-grangian cost estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 2, pp. 202–208, Feb. 2006.

[8] Y. V. Ivanov and C. J. Bleakley, “Skip prediction and early termination for fast mode decision in H.264/AVC,” in Proc. ICDT’06, 2006.

[9] Y. Zhao, M. Bystrom, and I. Richardson, “A MAP framework for efficient skip/code mode decision in H.264,” in Proc. IEEE ICIP’06, Oct. 2006, pp. 45–48.

[10] JM web site. [Online]. Available: http://iphome.hhi.de/suehring/tml/

[11] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,”

IEEE Signal Process. Mag., vol. 15, no. 6, pp. 74–90, Nov. 1998.

[12] JM reference software version 8.6. [Online]. Available: http://iphome.hhi.de/

suehring/tml/

[13] J.-F. Yang, S.-C. Chang, and C.-Y. Chen, “Computation reduction for motion search in low rate video coders,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 10, pp. 948–951, Oct. 2002.

[14] Y. Cheng, K. Dai, Z. Wang, and J. Lu, “Motion search method based on zero-block detection in H.264/AVC,” in Proc. Int. Conf. on Computer Supported Cooperative Work in Design (CACWD’04), vol. 2, May 26–28, 2004, pp. 739–743.

[15] Y. H. Moon, G. Y. Kim, and J. H. Kim, “An improved early detection algorithm for all-zero blocks in H.264 video encoding,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 8, pp. 1053–1057, Aug. 2005.

[16] H. Wang, S. Kwong, and C.-W. Kok, “Effectively detecting all-zero DCT blocks for H.264 optimization,” in Proc. IEEE ICIP’06, Oct. 2006, pp. 1329–1332.

[17] H. L. V. Trees, Detection, Estimation, and Modulation Theory. New York, US: John Wiley & Sons, 2001.

[18] A. Raghunathan, N. K. Jha, and S. Dey, High-Level Power Analysis and Optimiza-tion. Norwell, US: Kluwer Academic Publishers, 1998.

[19] F. N. Najm, “Transition density: a new measure of activity in digital circuits,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 12, no. 2, pp. 310–323, Feb.

1993.

[20] A. M. Tekalp, Digital Video Processing. New Jersey, US: Prentice Hall, 1995.

[21] P. Kuhn, Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Mo-tion EstimaMo-tion. Norwell, US: Kluwer Academic Publishers, 1999.

[22] Video Coding for Low Bitrate Communication, DRAFT ITU-T Recommendation H.263, ITU-T, May 2, 1996.

[23] A. Puri, X. Chen, and A. Luthra, “Video coding using the h.264/mpeg-4 avc com-pression standard,” Signal Processing: Image Communication, vol. 19, no. 9, pp.

793–849, Oct. 2004.

[24] T. Wiegand, M. Lightstone, D. Mukherjee, T. G. Campbell, and S. K. Mitra, “Rate-distortion optimized mode selection for very low bit rate video coding and the emerg-ing h.263 standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 2, pp.

182–190, Apr. 1996.

[25] G. V. Reklaitis, A. Ravindran, and K. M. Ragsdell, Engineering Optimization: Meth-ods and Applications. New York, US: John Wiley & Sons, 1983.

[26] Y. Shoham and A. Gersho, “Efficient bit allocation for an arbitrary set of quantizers,”

IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 9, pp. 1445–1453, Sep. 1988.

[27] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and L.-G.

Chen, “Analysis and architecture design of variable block-size motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, vol. 53, no. 2, pp. 578–593, Feb. 2006.

[28] L. Fanucci, L. Bertini, and S. Saponara, “Programmable and low power VLSI archi-tecture for full search motion estimation in multimedia communications,” in Proc.

ICME’00, vol. 3, Jul. 30–Aug. 2, 2000, pp. 1395–1398.

[29] M. Sayed, I. Amer, and W. Badawy, “Towards an H.264/AVC full encoder on chip:

An efficient real-time VBSME ASIC chip,” in Proc. IEEE ISCAS’06, vol. 2, May 21–24, 2003, pp. 2613–2616.

[30] C.-M. Ou, C.-F. Le, and W.-J. Hwang, “An efficient vlsi architecture for H.264 variable block size motion estimation,” IEEE Trans. Consum. Electron., vol. 51, no. 4, pp. 1291–1299, Nov. 2005.

[31] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, S.-Y. Chien, and L.-G. Chen, “Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC,”

IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 5, pp. 568–577, May 2007.

[32] M. Miyama, J. Miyakoshi, Y. Kuroda, K. Imamura, H. Hashimoto, and M. Yoshi-moto, “A sub-mw MPEG-4 motion estimation processor core for mobile video appli-cation,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1562–1570, Sep. 2004.

[33] S.-S. Lin, P.-C. Tseng, and L.-G. Chen, “Low-power parallel tree architecture for full search block-matching motion estimation,” in Proc. IEEE ISCAS’04, vol. 2, May 23–26, 2004, pp. 313–316.

[34] C.-Y. Chen, Y.-W. Huang, C.-L. Lee, and L.-G. Chen, “One-pass computation-aware motion estimation with adaptive search strategy,” IEEE Trans. Multimedia, vol. 8, no. 4, pp. 698–706, Aug. 2006.

[35] T. Yamada, M. Ikekawa, and I. Kuroda, “Fast and accurate motion estimation al-gorithm by adaptive search range and shape selection,” in Proc. IEEE Int. Conf.

on Acoust., Speech and Signal Process. (ICASSP’05), vol. 2, Mar. 18–23, 2005, pp.

897–900.

[36] S. Li, Y. Jiang, T. Ikenaga, and S. Goto, “Content-based motion estimation with extended temporal-spatial analysis,” IEICE Trans. on Info. and Syst., vol. E88-D, no. 7, pp. 1561–1568, Jul. 2005.

[37] Y.-H. Chen, T.-C. Chen, and L.-G. Chen, “Hardware oriented content-adaptive fast algorithm for variable block-size integer motion estimation in H.264,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP’05), Dec. 13–16, 2005, pp. 341–344.

[38] P. E. Landman and J. M.Rabaey, “Power estimation for high level synthesis,” in Proc. IEEE EDAC’93, Feb. 22-25, 1993, pp. 361–366.

[39] Z.-L. He, C.-Y. Tsui, K.-K. Chan, and M. L. Liou, “Low-poewr VLSI design for motion estimation using adaptive pixel truncation,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 5, pp. 669–678, Aug. 2000.

[40] V. G. Moshnyaga, “Reducing switching activity of subtraction via variable truncation of the most-significant bits,” J. VLSI Signal Process. Syst., vol. 33, no. 1, pp. 75–82, 2003.

[41] S. Y. Kung, VLSI Array Processors. New Jersey, US: Prentice Hall, 1988.

[42] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.

New York, US: John Wiley & Sons, 1999.

[43] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time Signal Processing, 2nd ed. New Jersey, US: Prentice Hall, 1999.

[44] C.-P. Lin, P.-C. Tseng, Y.-T. Chiu, S.-S. Lin, C.-C. Cheng, H.-C. Fang, W.-M. Chao, and L.-G. Chen, “A 5mw MPEG4 SP encoder with 2D bandwidth-sharing motion estimation for mobile applications,” in Proc. IEEE ISSCC’06, vol. 2, Feb. 6-9, 2006, pp. 1626–1635.

[45] S.-S. Lin, “Low-power motion estimation processors for mobile video application,”

Master’s thesis, National Taiwan University, Taipei, Taiwan, 2004.

[46] Y.-W. Huang, T.-C. Wang, B.-Y. Hsieh, and L.-G. Chen, “Hardware architecture de-sign for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264,”

in Proc. IEEE ISCAS’03, vol. 2, May 25–28, 2003, pp. 796–799.

[47] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, and L.-G. Chen, “Architecture design of low power integer motion estimation for H.264/AVC,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP’06), vol. 3, 2006, pp. 900–903.

[48] S. Vassiliadis et al., “The sum-absolute-difference motion estimation accelerator,”

in Proc. IEEE Euromicro Conf., vol. 2, Vasteras, Sweden, Aug. 25-27, 1998, pp.

559–566.

[49] D. F. McManigal, “Absolute difference generator for use in display system,” U.S.

Patent 4,218,751, Aug. 19, 1980.

[50] T. Kanoh, “Absolute value calculating circuit having a single adder,” U.S. Patent 4,953,115, Aug. 28, 1990.

[51] J. M. Dodson and C. T. Cheng, “Low-power area-efficient absolute value arithmetic unit,” U.S. Patent 5,251,164, Oct. 5, 1993.

[52] T. Komarek and P. Pirsch, “Array architectures for block matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 36, no. 10, pp. 1301–1308, Oct. 1989.

[53] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, pp. 61–72, Jan. 2002.

[54] H. Bhatnagar, Advanced ASIC Chip Synthesis: Using Synopsys Design Complier Physical Compiler and PrimeTime, 2nd ed. Norwell, US: Kluwer Academic Pub-lishers, 2002.

[55] Artisan Standard Library Register File Generator User Manual, Artisan Components, 2003Q4, release 2.2.

[56] W.-T. Hsien et al., “An efficient power modeling approach for embedded memory using LIB format,” in Proc. IEEE VLSI-TSA-DAT, Apr. 27-29, 2005, pp. 55–58.

[57] C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Saving power in the control path of embedded processors,” IEEE Des. Test. Comput., vol. 11, no. 4, pp. 24–31, 1994 Winter.

[58] C.-H. Lin, C.-M. Chen, and C.-W. Jen, “Low power design for mpeg-2 video decoder,”

in Proc. IEEE TCE’96, vol. 42, no. 3, 1996, pp. 513–520.

[59] J. P. Hayes, Computer Architecture and Organization, 3rd ed. New York, US:

McGraw-Hill, 1998.

[60] S. Yalcin, H. F. Ates, and I. Hamzaoglu, “A high performance hardware architecture for an SAD reuse based hierarchical motion estimation algorithm for H.264 video coding,” in Proc. Int. Conf. on Field Programmable Logic and Appl. (FPL’06), Aug.

24–26, 2005, pp. 509–514.

[61] L. Zhang and W. Gao, “Improved FFSBM algorithm and its VLSI architecture for variable block size motion estimation of H.264,” in Proc. Int. Symp. on Intell. Signal Process. and Commun. Syst. (ISPACS’05), Hong Kong, Dec. 13–16, 2005, pp. 445–

448.

[62] Physical Compiler User Guide, SYNOPSYS, Inc., Jun. 2003, version U-2003.06.

[63] PrimePower Manual, SYNOPSYS, Inc., Dec. 2004, version W-2004.12.

[64] Power Compiler User Guide, SYNOPSYS, Inc., Jan. 2005, release W-2004.12.

[65] Y.-W. Huang, S.-Y. Chien, B.-Y. Heish, and L.-G. Chen, “Global elimination al-gorithm and architecture design for fast block matching motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 6, pp. 898–907, Jun. 2004.

相關文件