Power - Synthesis Result - Methodology and Experiment

Chapter 5 Methodology and Experiment

6.3 Synthesis Result

6.3.2 Power

Power Report (4 warps 4 cores, single compute unit)

Term Power(mW) Percentage (%)

Cell Internal Power 142.6 95

Net Switching Power 6.8 5

Total Dynamic Power 149.4 100

Cell Leakage Power 32.5

Total Power 181.9

Power is not measured in real testing data, just an estimation from Design Compiler.

We do not apply any power-saving technique here. There still remains lots of work that can saves huge power, such as clock-gating and power-gating.

Chapter 7 Limitations and Future Works

7.1 Facts Not Simulated

There are some facts not measured by our simulator which might vary the performance of proposed system.

Most importantly, the Traverser is not modeled detailedly in both throughput and latency. Best case is to simulate the performance of real hardware traversal unit. But we do not have such customized designed that can directly merge with our system with tim-ing information in 40nm process. Or maybe able to model it by function parts in the fu-ture work. Like assign latency by tree traversing depth and the hit rate in tree node cache. And simulates the dynamic performance across separate cores in Traverser.

Secondly, in the simulation of multiple compute units, we check and update the hit or miss by the same cache tag. Ideally, there are separate data caches in each compute unit. For real cases, there are more miss from first allocation in different CU (since data not shared) but can be more coefficient in cache (maybe reduce some miss).

Also, the arbitration to off-chip bus is not simulated. There might be additional confliction in off-chip bus in real case.

The miss penalty in instruction cache is not simulated, too. But we assume by us-ing an instruction cache of 4KB can accommodate all the kernel codes inside. Only the first access of instructions may cause a miss. It is ignorable.

The models of two pools are not verified by hardware implementation. But we think the specification we used is feasible.

7.2 Possible Further Optimizing Method

There are many parts can be further optimized for performance.

7.2.1 Functional Block Optimization

The proposed multiprocessor in this thesis is just a baseline design. There are lots of space to speed up or shrink in area with little performance gain.

For example, caches can be made as non-blocking pipelined cache or have the

abil-ity handling dynamic formation of separating request. Replacing current 2-way set asso-ciated cache by 4 or 8 ways is another aspect. For industrial usage, SRAM can be re-placed by high bandwidth SRAM like what recent GPU use which is not available by us.

Computational ALUs are pipelined by Design Compiler without overall view. In general, hand-made ALUs with the consideration of process and cycle time are much better then these generated ones. It can achieve more hardware resource sharing, too.

7.2.2 Customized Composition of ALUs

Our proposed configurable multiprocessor possess of ability to plug-on customized ALUs and duplicate critical ALUs. But we have not try such experiments for further op-timization.

For example, in section 5.4 , we know that there are many warps waiting for In-tALU as the number of warps grow. It means InIn-tALU is a bottleneck in our test cases.

We can additionally plug-on more resources of IntALU to solve it.

As the number of ALUs grow, it might exceed the timing of switching circuit in dispatcher. But we can merge some low utilization ALUs to fix it. For example, in sec-tion 5.4 , TidALU and BrhALU are both low in utilizasec-tion. Maybe we can merge them and reserve the switch circuit space for duplication IntALUs.

7.2.3 Better Host Control Policy

We only test our system by a naive control policy for choice of kernels, showed in sec-tion 5.2.2 . But the performance is highly related to this policy by the balancing of utili-zation rate of time and resource. Different configurations of hardware also favor differ-ent control policy.

For example, if we change a single parameter in Figure 7.1, from 8 to 96, the per-formance varies. Since it can consume more contents for the configuration of 4 CUs, larger parameter may achieve higher resource efficiency. But for the cases of fewer con-tents, this policy has little effects to performance.

Since the hardware utilization rate for configuration of huge contents is low, there are many experiment can be done to raise it. Imaginably, cases of multiple compute units might have the potential scaling up the performance as single compute units.

Quick example in kernel choice policy Configuration Performance

(Mrays/s)

Time utilization (%)

Resource utiliza-tion (%) 8 warps 8 cores, 4 CUs,

before modification 82.08 93.65 33.30

8 warps 8 cores, 4 CUs, after modification

96.83

(+18.0%) 56.74 95.80

4 warps 4 cores, 1 CU,

before modification 18.69 100.0 99.35

4 warp 4 cores, 1 CU, before modification

18.75

(+0.3%) 100.0 99.84

Table 7.1: Quick example in kernel choice policy

Figure 7.1 An example of changing kernel choice policy

In short, there are lots can be tried for optimizing performance by changes in ker-nel choice policy. Different configurations also has its own best policy.

7.2.4 Ray Arrangement in Pools

Ray tracing has been known as a tough problem for GPGPU by high divergence in data access. Maybe we can append some sorting algorithm in pools to make rays more coef-ficient in both Shader kernels and Traverser traversal. There are lots of topic in such

RP+ISP>0

Traversal Traversal Traversal Traversal

→ 192 → 96

sorting method can be further researched.

7.3 Possible Further Extension

7.3.1 Other Ray Tracing Algorithm

We have not tried the changes of kernel codes for supporting different algorithm for ray tracing based rendering. But it seems capable with algorithms like path tracing and spe-cial effects like ambient occlusion. The Traversal ASIC can also be changed for other accelerating data structure and methods.

7.3.2 Other Application for Configurable Multiprocessor

Proposed configurable multiprocessor can be separated from our system and be a gen-eral purposed or customized multiprocessor. Our ray-pool based ray tracing engine can be viewed as an example of adopting this multiprocessor. All the optimization steps and experiments on different configuration can be done again for other applications when needed. With simple but sound simulation tools, instruction generator and basic hard-ware implementations, it is suitable for transplanting our configurable multiprocessor to other applications.

Chapter 8 Conclusion

In this thesis, a simple but complete system for acceleration of ray tracing base render-ing is proposed. It can achieves the performance up to 80 Mrays per second on a light-weight and programmable hardware which is competitive to many large scale solutions.

And there is space remained for further optimization. It is also programmable for differ-ent ray tracing effects.

In this system, a configurable multiprocessor is designed. It can be extended for different applications with customized ALUs and is easy to simulate and configure. It comes with a near cycle accurate simulator and instruction generating tools. While the functionality and timing are verified by Verilog HDL implementation and gate-level synthesis.

By our experiments on different configuration, we suggest the 8 warps 8 cores de-sign. And shows how performance varies by the growth of number of compute units.

To summarize, we provide a new architecture for ray tracing with competitive per-formance in our first trial. We believe by further optimization and combining with stronger multiprocessor and traversing hardware, real time ray tracing in HD will be possible in the near future.

Bibliography

[1] Samuel R. Buss, 3D Computer Graphics: A Mathematical Introduction with

OpenGL. New York, NY, USA: Cambridge University Press, 2003.

[2] Matt Pharr and Greg Humphreys, Physically Based Rendering: From Theory to

Implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,

2004.

[3] James T. Kajiya, "The rendering equation," in Proceedings of the 13th Annual

Conference on Computer Graphics and Interactive Techniques, New York, NY,

USA, 1986, pp. 143–150.

[4] Turner Whitted, Bell Laboratories, Holmdel and New Jersey, "An improved illumination model for shaded display," in Proceedings of the 6th Annual

Conference on Computer Graphics and Interactive Techniques, vol. 13, New

York, NY, USA, 1979, p. 14.

[5] Jae-Ho Nah, Hyuck-Joo Kwon, Dong-Seok Kim, Cheol-Ho Jeong, Jinhong Park, Tack-Don Han, Dinesh Manocha and Woo-Chan Park, "RayCore: A Ray-Tracing Hardware Architecture for Mobile Devices," ACM Transactions on Graphics

(TOG), vol. 33, no. 5, August 2014.

[6] Youngsam Shin, Jaedon Lee, Jin-Woo Kim, Jae-Ho Nah, Seokyoon Jung, Shihwa Lee, Hyun-Sang Park and Tack-Don Han Won-Jong Lee, "SGRT: a mobile GPU architecture for real-time ray tracing," in Proceedings of the 5th

High-Performance Graphics Conference, New York, NY, USA, 2013, pp. 109-119.

[7] Jae-Ho Nah, Jeong-Soo Park, Chanmin Park, Jin-Woo Kim, Yun-Hye Jung, Woo-Chan Park, and Tack-Don Han, "T&I engine: traversal and intersection engine for hardware accelerated ray tracing," ACM Transactions on Graphics (TOG), vol.

30, no. 6, December 2011.

[8] Imagination Technologies. (2014) PowerVR GR6500: ray tracing is the future…

and the future is now. [Online].

http://blog.imgtec.com/powervr-developers/powervr-gr6500-ray-tracing

[9] Timo Aila, Samuli Laine, and Tero Karras, "Understanding the efficiency of ray traversal on GPUs–Kepler and Fermi addendum," NVIDIA Corporation, NVIDIA

Technical Report NVR-2012-02, June 2012.

[10] Steven G. Parker, James Bigler, Andreas Dietrich, Heiko Friedrich, Jared Hoberock, David Luebke, David McAllister, Morgan McGuire, Keith Morley,

Austin Robison, and Martin Stich, "OptiX: a general purpose ray tracing engine,"

ACM Transactions on Graphics (TOG), vol. 29, no. 4, p. 66, July 2010.

[11] Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanovic, "Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators," in Proceedings of

the 38th Annual International Symposium on Computer Architecture, New York,

NY, USA, 2011, pp. 129-140.

[12] Steven M. Rubin and Turner Whitted, "A three-dimensional representation for fast rendering of complex scenes," in Proceedings of the 7th annual conference

on Computer graphics and interactive techniques, vol. 14, New York, NY, USA,

1980, pp. 110-116.

[13] Jeffrey Goldsmith and John Salmon, "Automatic creation of object hierarchies for ray tracing," Computer Graphics and Applications, IEEE, vol. 7, no. 5, pp.

14-20, May 1987.

[14] Chen-Hao Chang, "Algorithm and System Architecture Design of Whitted-style Ray Tracing for Dynamic Scenes," in 臺灣大學電子工程學研究所學位論文, Taipei, 2009.

[15] Wilson W. L. Fung and Tor M. Aamodt, "Thread Block Compaction for Efﬁcient SIMT Control Flow," in Proceedings of the 2011 IEEE 17th International

Symposium on High Performance Computer Architecture, Washington, DC,

USA, 2011, pp. 25-36.

[16] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M.

Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in

Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, Boston, MA, 2009, pp. 163 - 174.

[17] NVIDIA Corporation. (2010) NVIDIA CUDA Programming Guide.

[18] Micron Technology. (2014) Micron MT41K512M4HX-15E Sim Models &

Software. [Online].

http://www.micron.com/~/media/documents/products/data%20sheet/dram/ddr3/4 gb_ddr3_sdram.pdf

[19] Synopsys, Inc. (2015) DesignWare Library. [Online].

http://www.synopsys.com/ip/socinfrastructureip/designware/Pages/default.aspx

[20] Cadence Design Systems, Inc. (2015) Tools. [Online].

http://www.cadence.com/products/Pages/default.aspx

[21] Synopsys, Inc. (2015) RTL Synthesis & Test. [Online].

http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages

在文檔中適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計 (頁 61-0)