Third Stage - Proposed Reconfigurable Achitecture

Chapter 3 Proposed Reconfigurable Compression Algorithm and Architecture

3.2 Proposed Reconfigurable Achitecture

3.2.3 Third Stage

At the third stage, there are two-plane differential combination, compression scheme selection, and packing blocks. Fig. 3.13 illustrate the implementation of two-plane differential combination which will choose the desired differentials from two sets of differentials for compression in the two-plane type. Because once a tile is determined into the two-plane type, at most seven lower bits of every differential will be saved at the packing block. Therefore, in the two-plane differential combination, just seven lower bits of each differential are used to the combination operation. It results in less hardware resources and power consumption.

Fig. 3.14 illustrates the implementation of compression scheme selection.

According to the range of differentials, this block will choose the adequate bit length for storing these differentials. If a tile applies the type-1 HA scheme, the constant one is added to each 2^nd order differential and, at the same time, the constant one is subtracted from the 1^st order differential. Without adding circuit, we use some inverters at the end of packing block. Because a tile applies the type-1 HA scheme, the least significant bit of each differential will be selected for packing, i.e. only one bit will be used for compression. Since these differentials are the elements of the set {-1,0} and after adding constant one to these differentials are the elements of the set {0,1}, the least significant bit of each differential changes from 1 to 0 or from 0 to 1. Eventually, packing block packs necessary information, for example control-code, for a compressed or uncompressed tile. At the next cycle, the signal, out_valid, will be pull up to notice that an output is available.

0 1 0 0

Fig. 3.13. Block diagram of two-plane differential combination.

-1

Fig. 3.14. Block diagram of the compression scheme selection.

In accordance with the compression/uncompression mode and one/two-plane types, different packing formats can be obtained. The most significant bit, flag, in each mode indicates whether a tile is compressed. In uncompression mode, the remaining bits are composed of the original depth values. In one-plane type, except for the flag, control-code indication bits, and the reference point, the first part of remainder bits, △z (V) and △²z (V), belongs to the vertical part; △z (H) and △²z (H) belong to the

horizontal part in a tile. In the two-plane type, the first part of remaining bits, excluding flag, control-code, 1^st reference, and 2^nd reference bits, belongs to the vertical part. The second part belongs to the horizontal part in a tile. Besides, the break-point is included in the control-code in the two-plane type. Additionally, the clock gating technique is applied at this stage as well. Table 3.3 shows the summary of the number of clock cycles needed for each compression/uncompression mode.

Original depth values Flag

Flag Control- 1^st Ref.

code ∆z (V) ∆²z (V) ∆z (H) ∆²z (H)

Flag Control- 1^st Ref. ∆z (V) ∆²z (V) ∆z (H) ∆²z (H)

code 2^nd Ref.

(a) Uncompression mode

(b) One-plane type

(c) Two-plane type Fig. 3.15. Packing format.

When a tile is input, the first step is to compute differentials. Because there is no information for a tile, i.e. we do not know what kind of plane type a tile belongs to, we set the upper-left pixel as the default reference point. Additionally, in each cycle, there is a half set of differentials computed so that for a whole set of differentials it will take two cycles. After computing differentials, the corresponding break-point map is checked for whether this tile is classified into uncompression mode, one-plane type, or two-plane type.

In uncompression mode, a tile will be checked twice with two sets of differentials according to two reference points, the upper-left and lower-left pixels. Then this tile

classified into uncompression mode exactly will bypass the two-plane differential combination block and just pass the packing block.

In one-plane type, after computing the 1^st set of differentials according to the upper-left pixel, the break-point map will be checked. Then this tile will bypass the two-plane differential combination and pass through the choosing compression scheme and packing blocks.

In two-plane type, excluding falling cases, after computing the 1^st set of differentials according to the upper-left pixel, the 2^nd set of differentials according to the lower-right pixel will be computed. Besides checking these two sets of break points for determining what kind of combination cases these two tiles belong to, the two sets of break points will also be checked for making sure these two sets of differentials are recognized as the same combination case, such as rising cases. After stage 3, the tile will be passed through the two-plane differential combination block to combine these two sets of differentials. The break points according to the 1^st reference point indicate which differential will be chosen in the two-plane differential combination. Eventually, this combined tile passes through the choosing compression scheme and packing blocks.

In two-plane type, including falling cases, the 1^st set of differentials according to the upper-left pixel is classified into the uncompression mode. The 2^nd and 3^rd sets of differentials according to the lower-left and upper-right pixels, respectively, are classified into the two-plane type and the combination case is falling. Then this tile passes through the two-plane differential combination, choosing compression scheme, and packing blocks.

Table 3.3. Summary of number of clock cycles needed for each compression/uncompression mode.

One-plane type

Two-plane type Uncompression

mode Rising/vertical/horizontal Falling

# clock cycles 5 9 12 8

For power-efficiency, power-reduced techniques are concerned. Gated clock is applied in the proposed architecture. The folded differential computation is used to reduce redundant computation and power consumption. Because huge transition among registers and MXUs result in high power consumption, the data reorder architecture designed for trading off the number of transitions among registers and MUXs reschedules the source and destination data of the differential computation. In the two-plane differential combination, only seven lower bits of every differential are used, because a tile passed to this block has been classified into the two-plane type and each differential is saved in 7 bits at most. This kind of architecture uses less number of MUXs. Without additions for the type1 of 1-bit HA compression scheme, a 1-bit inverters consume less power and area than 16-bit adders. Furthermore, the proposed architecture also applies hardware-reused skills in blocks, such as the break-point map generation and the compression scheme selection.

Chapter 4 Simulation Results and Chip Implementation

4.1 Simulation Results

In this section, the 11 compression modes are illustrated and the total compressed bits of a tile are listed in Table 4.1. In OP-HA-HA as listed in Table 4.1, the vertical and horizontal parts both are compressed by the HA scheme, and the total compressed tile size is 16+7+7+61+6=97 bits, including one reference point, two 7-bit 1^st order differentials, 61-bit 2^nd order differentials, and 6-bit control-code. In TP-HA-HA, in the same conditions, the total compressed tile size is 16+16+7+7+7+7+58+6+8 =132 bits, including two reference points, four 7-bit 1^st order differentials, 61-bit 2^nd order differentials, 6-bit control-code, and break point. Other title sizes using different mode schemes can be calculated similarly. Concerning the 7-bit DDPCM scheme, we expect that the size of the compressed tile can be smaller than that of half size of the original tile.

The teapot and stereoscopic polygons benchmarks are used as reference simulations as shown in Fig. 4.1 (a) and (b). The average CR as listed in Table 4.2 shows the average compression ratio and the comprehensive comparison with the 1-bit HA and 2-bit DDPCM schemes related to the two benchmarks. For the teapot, the

proposed reconfigurable algorithm outperforms others by 27.2% and 13.6% compared with the independent 2-bit DDPCM and 1-bit HA schemes. For the stereoscopic polygons, the proposed algorithm outperforms others by 33.6% and 21.7% compared with the 2-bit DDPCM and 1-bit HA schemes.

The sample distribution of the average CR related to the benchmark, Fig. 4.1 (a), as shown in the Fig. 4.2 and Fig. 4.3 illustrate the usefulness of our proposed algorithm compared with the 1-bit HA and 2-bit DDCPM schemes, respectively. Moreover, Fig.

4.4 and Fig. 4.5 illustrate the average CR related to the benchmark, Fig. 4.1 (b). A point in the Fig. 4.2, Fig. 4.3, Fig. 4.4, and Fig. 4.5 indicates an average compression ratio of five tiles. It is obvious that our proposed reconfigurable algorithm can achieve more stable average compression ratio than the 1-bit HA and 2-bit DDPCM schemes.

Table 4.1. Bit width of compressed/uncompressed tile in proposed algorithm.

Mode Name Number of bits

OP-HA-HA 97

OP-2bDDPCM-HA 103

OP-7bDDPCM-HA 113

OP-7bDDPCM-2bDDPCM 188

OP-7bDDPCM -7bDDPCM 463

TP-HA-HA 132

TP-2bDDPCM-HA 138

TP-7bDDPCM-HA 168

TP-7bDDPCM-2bDDPCM 220

TP-7bDDPCM -7bDDPCM 480

Uncompression 1025

Fig. 4.1 (a) Teapot, and (b) Stereoscopic polygons.

Fig. 4.2. Proposed algorithm vs. the 1-bit HA compression scheme for teapot scenario.

Fig. 4.3. Proposed algorithm vs. the 2-bit DDPCM scheme for teapot scenario.

Fig. 4.4. Proposed algorithm vs. the 1-bit HA compression scheme for stereoscopic polygons scenario.

Fig. 4.5. Proposed algorithm vs. the 2-bit DDPCM scheme for stereoscopic polygons scenario.

Table 4.2. Average compression ratio with 8x8 tile size.

Teapot Stereoscopic polygons

1-bit HA scheme [21] 1.54 (100%) 1.43 (100%)

2-bit DDPCM scheme [16] 1.33 (86.4%) 1.26 (88.1%)

Proposed algorithm 1.75 (113.6%) 1.74 (121.7%)

4.2 Chip Implementation

Concerning the chip implementation, the cell-based design flow with Artisan standard cell library is adopted and the proposed architecture has been implemented in TSMC 0.18-um CMOS process. The Synopsys Design Compiler is used to synthesize the RTL design of the proposed architecture, the Cadence SOC Encounter is adopted for

placement and routing (P&R) and the Synopsys PrimePower is used to measure the power consumption for each mode after post-layout simulation. Table 4.3 summarizes the chip characteristics of the proposed architecture.

Table 4.3. Chip characteristics of the proposed architecture.

Active Chip Area 1.13 x 1.13 mm²

Gate Count 97, 246

Max Clock Frequency 100 MHz

Process Technology TSMC 0.18-um CMOS

Power Consumption (mW) @ 100MHz

One-Plane Type 22.75 Two-Plane Type

(rising/vertical/horizontal)

51.76/56.25/71.9

Two-Plane Type (falling) 57.63 Uncompression Mode 38.63 Power Consumption

(mW) @ 66.7MHz

One-Plane Type 15.18 Two-Plane Type

(rising/vertical/horizontal)

34.52/37.51/57.26

Two-Plane Type (falling) 38.43 Uncompression Mode 25.76

Fig. 4.6. Chip layout of the proposed architecture.

Chapter 5 Conclusion and Future Work

In this work, the reconfigurable algorithm for depth buffer compression is presented. This proposed algorithm not only supports the 1-bit HA, 2-bit DDPCM schemes as well as 7-bit DDPCM scheme, but also handles one-plane and one-plane type compressions. In addition, different compression schemes can be applied in the vertical and horizontal parts in a tile. There are totally 11 compression modes adaptively applied according to different 3D scenes in this proposed compression algorithm. In two-plane type, there are four kinds of combination cases, including rising, vertical, horizontal, and falling cases, concerned in the presented algorithm.

For 8x8 tile size with 16-bit depth values under the teapot benchmark, the proposed reconfigurable algorithm can achieve CR of 1.75 on average and improve 13.6% and 31.6% compared with the HA and DDPCM compression methods, respectively. For 8x8 tile size with 16-bit depth values under the Stereoscopic polygons benchmark, the proposed reconfigurable algorithm can achieve CR of 1.74 on average and improve 21.7% and 38.1% compared with the HA and DDPCM compression methods, respectively.

Furthermore, the proposed reconfigurable and power efficient depth buffer compression architecture has been verified and implemented in TSMC 0.18-um CMOS process. The core consists of 97,246 transistors, and its area is 1.13 um². It operates at 100 MHz with maximum power consumption of 38.63 mW in uncompression mode,

22.75 mW in one-plane type, 51.76/56.25/71.9 mW in two-plane type, including rising, vertical, and horizontal cases, and 57.63 mW in two-plane type, including falling cases, at supply voltage of 1.8V.

For the future work, the ranges of horizontal and vertical parts will be discussed for better compression performance.

Bibliography

[1] DVB Multimedia Home Platform (MHP) Specification 1.1, TS 102 812, Nov.

2001.

[2] T. Heinonen, A. Lahtinen and V. Hakkinen, “Implementation of three-dimensional EEG brain mapping,” Computers and Biomedical Research, vol.32, pp. 123–131, 1999.

[3] R.-W. Woo, S. Choi, J.-H. Sohn, S.-J. Song Y.-D. Bae, and H.-J. Yoo, “A Low-Power 3-D Rendering Engine With Two Texture Units and 29-Mb Embedded DRAM for 3G Multimedia Terminals,” in IEEE Journal of Solid-State Circuits, vol. 39, no. 7, pp. 1101-1109, July 2004.

[4] R. Woo, S. Choi, J.-H. Sohn, and H.-J. Yoo, “A 210-mW Graphics LSI Implementation Full 3-D Pipeline With 264 Mtexels/s Texturing for Mobile Multimedia Applications,” in IEEE Journal of Solid-State Circuits, vol. 39, no. 2, pp. 358-367, February 2004.

[5] J.-H. Sohn, J.-H. Woo, M.-W. Lee, H.-J. Kim, R. Woo, and H.-J. Yoo, “A 155-mW 50-Mvertices/s Graphics Processor With Fixed-Point Programmable Vertex Shader for Mobile Applications,” in IEEE Journal of Solid-State Circuits, vol. 41, no. 5, pp. 1081-1091, May 2006.

[6] C.-W. Yoon, R. Woo, J. Kook, S.-J. Lee and H.-J. Yoo, “An 80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3-D Rendering Engine for Mobile Applications,” in IEEE Journal of Solid-State Circuits, vol. 36, no. 11, pp. 1758-1767, November

2001.

[7] Y.-H. Park, S.-H. Han, J.-H. Lee, and H.-J. Yoo, “A 7.1-GB/s Low-Power Rendering Engine in 2-D Array-Embedded Memory Logic CMOS for Portable Multimedia Sysyem,” in IEEE Journal of Solid-State Circuits, vol.

36, no. 6, pp. 944-955, June 2001.

[8] R. Woo, C.-W. Yoo, J. Kook, S.-J. Lee and H.-J. Yoo, “A 120-mW 3-D Rendering Engine With 6-Mb Embedded DRAM and 3.2-GB/s Runtime Reconfigurable Bus for PDA Chip,” in IEEE Journal of Solid-State Circuits, vol. 37, no. 10, pp. 1352-1355, October 2002.

[9] B.-G. Nam, H. Kim, and H.-J. Yoo, “A Low-Power Unified Architecture Unit for Programmable Handheld 3-D Graphics Systems,” in IEEE Journal of Solid-State Circuits, vol. 42, no. 8, pp. 1767-178, August 2007.

[10] D. Kim, K. Chung, C.-H. Yu, C.-H. Kim, I. Lee, J. Bae, Y.-J. Kim, J.-H. Park, S. Kim, Y.-H. Park, N.-H. Seong, J.-A. Lee, J. Park, S. Oh, S.-W. Jeong, and L.-S. Kim, “An SoC With 1.3 Gtexels/s 3-D Graphics Full Pipeline for Consumer Applications,” in IEEE Journal of Solid-State Circuits, vol. 41, no.

1, pp. 71-84, January 2006.

[11] T. Akenine-M¨oller and Jacob Ström, “Graphics for the masses: a hardware rasterization architecture for mobile phones,” in ACM Transactions on Graphics, vol. 22, issue 3, pp. 801-808, July 2003.

[12] H.-C. Shin, J.-A. Lee, and L.-S. Kim, “A Cost-Effective VLSI Architecture for Anisotropic Texture Filtering in Limited Memory Bandwidth,” in IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 14, no. 3,

pp. 254-267, March 2002.

[13] S. Fenney, “Texture compression using low-frequency signal modulation,” in Graphics Hardware, SIGGRAPH/EUROGRAPHICS, pp. 84-91, 2003.

[14] J. Ström and T. Akenine-Möller, “iPACKMAN: high-quality, low-complexity texture compression for mobile phones,” in Graphics Hardware, SIGGRAPH/EUROGRAPHICS, pp. 63-70, 2005.

[15] S. Morein., “Method and apparatus for efficient clearing of memory,” U.S.

Patent 6 421 764, July 16, 2002.

[16] J. DeRoo, S. Morein, B. Favela, M. Wright, “Method and apparatus for compressing parameter values for pixels in a display frame,” U.S. Patent 6 476 811, Nov. 5, 2002.

[17] J. Van Dyke, J. Margeson, “Method and apparatus for managing and accessing depth data in a computer graphics system,” U.S. Patent 6 961 057, Nov. 1, 2005.

[18] T. Van Hook, “Method and Apparatus for Compression and Decompression of Z Data,” U.S. Patent 6 630 933, Oct. 7, 2003.

[19] B.-S. Liang, Y.-C. Lee, W.-C. Yeh, and C.-W. Jen, “Index rendering:

hardware-efficient architecture for 3-D graphics in multimedia system,” in IEEE Transactions on Multimedia, vol. 4, no. 2, pp. 343-360, June 2002

[20] S. Morein, M. Natale, ”System, method, and apparatus for compression of video data using offset values,” U.S. Patent 6 762 758, July 13, 2004.

[21] J. Hasselgren, T. Akenine-Möller, “Efficient depth buffer compression,” in Graphics Hardware, SIGGRAPH/EUROGRAPHICS, pp. 102-110, 2006.

[22] S. Morein, “ATI Radeon HyperZ technology,” in Hot3D Proc. ACM SlGGRAPH/Eurographics Workshop on Graphics Hardware, Aug. 2000.

[23] C.-H. Chen and C.-Y. Lee, “Two-level hierarchical Z-buffer with compression technique for 3D graphics hardware,” in The Visual Computer, Springer, vol. 19, no. 7-8, pp. 467-479, Dec. 2003.

[24] C.-H. Yu and L.-S. Kim, “A hierarchical depth buffer for minimizing

memory bandwidth in 3D rendering engine: depth filter,” in Proc. ISCAS'03, May 2003, pp.II-724- II-727.

[25] Per Wennersten, “Depth buffer compression,” M.S. thesis, Dept. Computer Science and Communication, Royal Institute of Technology, Stockholm, Sweden, 2007.

[26] M.-H. Choi, W.-C. Park, Francis Neelamkavil, T.-D. Han, and S.-D. Kim,

“An effective visibility culling method based on cache block,” IEEE Trans.

Computers, vol. 55, no. 8, pp. 1024–1032, Aug. 2006.

[27] N. Greene, M. Kass, and G. Miller, “Hierarchical Z-buffer visibility,” in Proc.

of SIGGRAPH‘93, Jul. 1993, pp. 231–238.

[28] C.-H. Yu and L.-S. Kim, “An adaptive spatial filter for early depth test,” in Proc. IEEE ISCAS’04, May 1994, pp. II-137- II -40.

[29] Y.-M. Tsao, C.-L. Wu, S.-Y. Chien, and L.-G. Chen, “Adaptive tile depth filter for the depth buffer bandwidth minimization in the low power graphics systems,” in Proc. IEEE ISCAS’06, May 2006, pp. 5023-5026.

在文檔中應用於三維繪圖系統之可重組式深度緩衝區壓縮演算法設計與實作 (頁 45-0)