Quality - Experimental Results - 應用於通用圖形處理器上具熱感知及位置相關之三維佈局規劃演算法

Chapter 5 Experiments

5.2 Experimental Results

5.2.1 Quality

In this part, we compare our work to related work [29], as shown in following table. Table 5, Table 6, and Table 7 show the results of circuit of n100, n200, n300 respectively. The first row shows the zone size, and the number of bracket means zone size / (#modules/#layers). ZT represents Cong’s work [29] without applying OHPZ technique, which implies the zone size of it is 0. Other columns show the result of varies zone size in LDZT. In the second row, Max_T means the maximum temperature of a single floorplan, and the following columns show the maximum/minimum/average/standard deviation of max_T from floorplans generated by 100 different random seeds. The last, the bottom two rows show the average wirelength and TSV count of 100 floorplans.

The Figure 19, Figure 20, and Figure 21 show the distribution of thermal data of Table 5, Table 6, and Table 7. The top endpoint of line means the maximum of Max_T, and the bottom endpoint of line means the minimum of Max_T. The label on the line means the average of Max_T.

Table 5. Experimental results – n100.

Zone size ZT 0(0%) 3(10%) 5(20%) 8(30%) 10(40%)

Max_T Std 5.6 5.1 2.8 2.7 2.5 3.0

Max 163.2 160.4 154.6 150.0 151.5 155.3 Min 130.1 133.6 137.2 139.3 135.4 135.1 Avg 148.0 147.0 145.7 144.7 145.0 146.6 WL 131554 131486 130930 131207 131412 131772

TSV 703.2 702.7 699.3 693.4 704.7 701.3

Table 6. Experimental results – n200.

Zone size ZT 0(0%) 5(10%) 10(20%) 15(30%) 20(40%)

Max_T Std 2.7 2.2 1.9 1.7 1.4 1.6

Max 193.8 191.4 191.7 192.1 189.4 189.5 Min 181.3 180.7 178.6 179.8 181.2 179.8 Avg 186.3 185.3 184.6 184.4 184.1 184.5 WL 241258 239184 240007 240083 240619 242574 TSV 1540.4 1527.6 1522.7 1520.0 1516.6 1516.8

Table 7. Experimental results – n300.

Zone size ZT 0(0%) 8(10%) 15(20%) 23(30%) 30(40%)

Max_T Std 2.5 1.6 1.8 1.3 1.2 1.3

Max 203.7 199.1 197.5 196.2 196.7 197.8 Min 189.3 189.1 187.5 189.2 188.9 189.3 Avg 193.7 193.2 192.8 192.7 193.3 193.5 WL 343863 340112 341874 343885 345545 347358 TSV 1592.5 1570.4 1565.6 1566.8 1564.2 1557.2

Figure 19. Thermal data – n100.

Figure 20. Thermal data – n200.

Figure 21. Thermal data – n300.

Observing above data, TSV count and wirelength of our work are similar to Cong’s work. Next, observing of the thermal issue. In Figure 19 – Figure 21, the range of the maximum temperatures becomes smaller in our methods. It means our model is stable; in other words, our method is more insensitive to different seeds. If the zone size is too small, over-head modules are still closer to bottom-left corner. But if the zone size is too large, we cannot guarantee the locations of modules in OHPZ are the places we want. Therefore form the results, the range of zone size we recommend is 20% – 30%.

After the analysis of thermal issue, we think that observing maximum temperature of each floorplan is not enough. If there are two floorplans with the same maximum temperature, the temperature distribution of one is cool except the hotspot and the other is hot everywhere. If we only consider maximum temperature, these two floorplans are the same, but the former is better than the latter obviously. So we choose grids in bottom layer with top 5% temperature to analyze. The results are

shown in Figure 22 – Figure 24. They show the number of grids with top 5%

temperature of Cong’s and our work with 20% zone size. We can see that the number of grids of our work is less than Cong’s in high temperature range. So our work not only has lower maximum temperature but only has more uniform temperature distribution.

Figure 22.Thermal data II – n100.

Figure 23. Thermal data II – n200.

Figure 24. Thermal data II – n300.

5.2.2 Runtime

The following tables and figure show the experimental result of runtime.

CPU/GPU, shown in first row, means the floorplan is operated in CPU or GPU. First

column shows the elements of cost function, and the follow columns show the runtime/runtime ratio/speedup of these items on CPU/GPU. The rightmost column

#Pstr shows the number of streaming processors, this value means ideal upper bound of speedup. And the Figure 25 show the data of Table 8 – Table 10.

Table 8. Runtime – n100.

CPU GPU #Pstr

Ratio(%) Time(s) Time(s) Speedup Ratio(%)

Area 11.4 16.9 0.8 21.1 1.9 256

WL 4.2 6.2 1.3 4.8 3.0 64

Temp 81.1 120.1 2.8 42.9 6.6 128

RF 3.3 4.9 1.5 3.3 3.5 128

Total 100.0 148.1 6.4 23.1 – –

Tovh – – 20.8 – 48.8 –

Memcpy – – 15.4 – 36.2 –

Total – 148.1 42.6 3.5 100.0 –

Table 9. Runtime – n200.

CPU GPU #Pstr

Ratio(%) Time(s) Time(s) Speedup Ratio(%)

Area 8.5 78.9 4.8 16.4 4.4 256

WL 2.6 23.9 4.5 5.3 4.1 64

Temp 85.3 792.6 18.2 43.5 16.6 128

RF 3.6 33.2 9.0 3.7 8.2 128

Total 100.0 928.6 36.5 25.4 – –

Tovh – – 41.8 – 38.1 –

Memcpy – – 31.4 – 28.6 –

Total – 928.6 109.7 8.5 100.0 –

Table 10. Runtime – n300.

CPU GPU #Pstr

Ratio(%) Time(s) Time(s) Speedup Ratio(%)

Area 6.7 190.1 14.4 13.2 6.9 256

WL 1.6 46.1 8.0 5.8 3.8 64

Temp 87.7 2479.6 56.8 43.7 27.3 128

RF 4.0 112.9 27.0 4.2 13.0 128

Total 100.0 2828.7 106.2 26.6 – –

Tovh – – 55.3 – 26.5 –

Memcpy – – 46.9 – 22.5 –

Total – 2828.7 208.4 13.6 100.0 –

(a) (b) (c)

Figure 25. (a) Runtime – n100 (b) Runtime – n200 (c) Runtime – n300.

We can see the speedup of GPU and number of streaming processors are different. Because the algorithm is not parallel completely and there are drawbacks we will discuss later on CUDA. First, we discuss the speedup of area. Because we evaluate coordinates module by module then get the final area when all modules are done. That is, data are highly dependent, and each thread may idle for each other until total thread finish their work. Thus, the speedup decrease due to dependent data.

Second, when computing wirelength, because there are a great amount of branch

instructions on wirelength evaluation, performance may reduce by CUDA property we introduced previously. Next, the difference between speedup and number of streaming processors on temperature evaluation is smaller than it on other evaluations.

This is because the temperature evaluation of each grid is independent, which is introduced before. Thus, the temperature evaluation has better parallelism than others.

However, the computations done after power density evaluation delay the speed, so the final speedup gets reduced. Last, we analysis the speedup of repulsion force. In general, the number of streaming processors is the maximum speedup. Nevertheless, as we introduce before, the number of divider on each multiprocessor is much fewer than the number of thread on each multiprocessor, and there is division in repulsion force evaluation, so the maximum speedup is determined by the total number of dividers. The total speedup of these evaluation approximate 25. Finally, there are still communication time and kernel overhead, the final speedup reduce due to these drawbacks. The time complexity of communication time and kernel overhead increase linearly with module size, but the time complexity of evaluation increase faster than module size. Hence, while module size becomes larger, the ratio of runtime of these two drawbacks to total runtime becomes smaller. This is why the final speedup becomes larger as larger module size.

Chapter 6 Conclusion

In this thesis, we propose a fast location-dependent thermal model, and a thermal-aware floorplan algorithm. And we implement the algorithm on CPU and GPGPU.

LDZT, the fast thermal model we propose, can show the location-dependent property without runtime increase. Moreover, we also propose two strategies to prevent generating hotspots. We refine the repulsion force to exclude the module with high power density. This technique can also compensate the thermal coupling issue due to omitting lateral thermal resistances in LDZT. And we define a zone, named over-heat prevention zone, to prevent left-bottom corner of floorplan getting over-heat during the SA procedure. The over-heat module, whose power density is higher than average power density by standard deviation of power density, cannot be placed in this zone. By these strategies, we can reduce the maximum/average temperature and decrease the number of grid in high temperature range. Additionally, the floorplanner is insensitive to random seeds, which implies the robustness of our method is quite good. Finally, we use CUDA to speed up the runtime. We get 3.5X – 13.6X speedup.

The speedup gets significant as the size of the design grows.

Reference

[1] “International Technology Roadmap for Semiconductor,” Semiconductor Industry Association 2005–2010.

[2] G. Metze, M. Khbels, N. Goldsman, and B. Jacob, “Heterogeneous integration,”

Tech Trend Notes, vol. 12, no. 2, p. 3, 2003.

[3] A. W. Topol, D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A.

Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong,

“Three-dimensional integrated circuits,” IBM J. of Research and Development, vol. 50, no. 4.5, pp. 491–506, Jul. 2006.

[4] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: a novel chip design for improving deep submicron interconnect performance and systems-on-chip integration,” Proc. IEEE, vol. 89, no. 5, pp. 602–633, May 2001.

[5] R. Tummala and V. Madisetti, “System on chip or system on package?” IEEE Design & Test of Computers, vol. 16, no. 2, pp. 48–56, Apr.–Jun. 1999.

[6] P. H. Shiu and K. S. Lim, “Multi-layer floorplanning for reliable system-on-package,” Proc. Int’l Symp. Circuits and System, pp. 23–26, 2004.

[7] S. Spiesshoefer, Z. Rahman, G. Vangara, S. Polamreddy, S. Burkett, and L.

Schaper, “Process integration for through-silicon vias,” J. of Vacuum Science and Technology A, vol. 23, no. 4, pp. 824–829, Jul. 2005.

[8] SOCcentral. [Online]. Available: http://www.soccentral.com

[9] S. Das, A. P. Chandrakasan, and R. Reif, “Calibration of rent's rule models for three-dimensional integrated circuits,” IEEE Trans. Very Large Scale Integration Systems, vol. 12, no. 4, pp. 359–366, Apr. 2004.

[10] A. Rahman and R. Reif, “System-level performance evaluation of three-dimensional integrated circuits,” IEEE Trans. Very Large Scale Integration Systems, vol.8, no.6, pp. 671–678, Dec. 2000.

[11] S. Das, A. Fan, K. Chen, C. S. Tan, N. Checka, and R. Reif, “Technology, performance, and computer-aided design of three-dimensional integrated circuits,”

Proc. Int’l Symp. Physical Design, pp. 108–115, 2004.

[12] I. Kaya, S. Salewski, M. Olbrich, and E. Barke, “Wirelength reduction using 3D physical design,” Int’l Workshop Integrated Circuit System Design, pp. 453–462, 2004.

[13] I. Loi, S. Mitra, T. H. Lee, S. Fujita, and L. Benini, “A low-overhead fault tolerance scheme for TSV-based 3D network on chip links,” Proc. Int’l Conf.

Computer-Aided Design, pp. 598–602, 2008.

[14] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A.M. Sule, M. Steer, and P. D. Franzon, “Demystifying 3D ICs: the pros and cons of going vertical,”

IEEE Design & Test of Computers, vol. 22, no. 6, pp. 498–510, Nov.–Dec. 2005.

[15] D. F. Wong and C. L. Liu, “A new algorithm for floorplan design,” Proc. Design Automation Conf., pp.101–107, 1986.

[16] R. Otten, “Automatic floorplan design,” Proc. Design Automation Conf., pp.261–267, 1982.

[17] Y.-C. Chang, Y.-W. Chang, G.-M.Wu, and S.-W.Wu, “B*-trees: A new representation for nonslicing floorplans,” Proc. Design Automation Conf., pp.

458–463, 2000.

[18] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, “VLSI module placement based on rectangle-packing by the sequence pair”. IEEE Trans. Computer Aided Design of Integrated Circuits and Systems, vol. 15, no. 12, pp. 1518–1524, Dec.

1996.

[19] P.-N. Guo, C.-K. Cheng, and T. Yoshimura, “An O-tree representation of nonslicing floorplan and its applications,” Proc. Design Automation Conf., pp.

268–273, 1999.

[20] Z. Li, X. Hong, Q. Zhou, Y. Cai, J. Bian, H. H. Yang, V. Pitchumani, and C.-K Cheng, “Hierarchical 3D floorplanning algorithm for wirelength optimization,”

IEEE Trans. Circuits and Syst.I: Regular Papers, vol. 53, no. 12, pp. 2637–2646, Dec. 2006.

[21] T. Yan, Q. Dong, Y. Takashima, Y. Kajitani, “How dose partitioning matter for 3D floorplanning,” Proc. ACM Great Lakes symposium on VLSI, pp. 73–78, 2006.

[22] Z. Li, X. Hong, Q. Zhou, S. Zeng, J. Bian, W. Yu, H. H. Yang, V. Pitchumani, and C.-K. Cheng, “Efficient thermal via planning approach and its application in 3-D floorplanning,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 4, pp. 645–658, Apr. 2007.

[23] Z. Li, X-L. Hong, Q. Zhou, S. Zeng, H. Yang, V. Pitchumani, and C-K. Cheng,

“Integrating dynamic thermal via planning with 3D floorplanning algorithm,”

Proc. Int’l Symp. Physical Design, pp. 178–185, 2006.

[24] Y. Huang, Q. Zhou, and Y. Cai, “Thermal via planning aware force-directed floorplanning for 3D ICs,” Int’l Conf. Application-Specific Integrated Circuit, pp.

751–753, 2009.

[25] E. Wong and S. Lim, “Whitespace redistribution for thermal via insertion in 3D stacked ICs,” Proc. Int’l Conf. Computer Design, pp. 267–272, 2007.

[26] X. Li, Y. Ma, X. Hong, S. Dong, and J. Cong, “LP based white space redistribution for thermal via planning and performance optimization in 3D ICs,”

Proc. Asia and South Paciﬁc Design Automation Conf., pp. 209–212, 2008.

[27] Y. Chen, H. Zhou, R. Dick, “Integrated circuit white space redistribution for temperature optimization,” Proc. Int’l Conf. Design, Automation & Test in Europe, pp. 1–6, 2011.

[28] L. Xiao, S. Sinha, J. Xu, E. Young, “Fixed-outline thermal-aware 3D floorplanning,” Proc. Int’l Conf. Design Automation Conference, pp. 561–567, 2010.

[29] J. Cong, J. Wei, and Y. Zhang, “A thermal-driven floorplanning algorithm for 3D ICs,” Proc. Int’l Conf. Computer-Aided Design, pp. 306–313, 2004.

[30] J. Kung, I. Han, S. Sapatnekar, and Y. Shin, “Thermal singature: a simple yet accuate thermal index for floorplan optimization,” Proc. Int’l Conf. Design Automation Conference, pp. 108–113, 2011.

[31] http://www.nvidia.com/object/cuda_home_new.html

[32] P. Wilkerson, A. Raman, and M. Turowski, “Fast, automated thermal simulation of three-dimensional integrated circuits,” Int’l Society Conf. on Thermal Phenomena, vol. 1, pp. 706–713, Jun. 2004.

[33] W. Huang, “HotSpot - A chip and package compact thermal modeling methodology for VLSI design,” PhD Thesis, ECE, University of Virginia, 2007.

[34] W. K. Chu and W. H. Kao, “A three-dimensional transient electro thermal simulation system for IC’s,” Proc. Therminic Workshop, pp. 201–207, 1995.

[35] T.-Y. Wang, Y.-M. Lee, and C. C.-P. Chen, “3D thermal-ADI: an efficient chip-level transient thermal simulator,” Proc. International Symposium Physical Design, pp. 10–17, 2003.

[36] S. Logan, M. Guthaus, “Fast thermal-aware floorplanning using white-space optimization,” Proc. Int’l Conf. Very Large Scale Integration, pp. 65–70, 2009 [37] Y-S. Huang, Y.-H. Liu, and J.-D. Huang, “Layer-aware design partitioning for

vertical interconnect minimization,” Proc. IEEE Computer Society Annual Symp.

on VLSI, pp. 144–149, 2011.

[38] S. N. Adya and I. L. Markov, “Fixed-outline floorplanning: enabling hierarchical design,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 6, pp.

1120–1135, Dec. 2003.

[39] C-Y. Huang, “Efficient TSV planning via congestion-aware block shifting in 3D floorplanning,” Master Thesis, EE, National Chiao Tung University, 2012.

在文檔中應用於通用圖形處理器上具熱感知及位置相關之三維佈局規劃演算法 (頁 40-50)