Over-heat Prevention Zone - Proposed Thermal-aware Floorplan Algorithm

Chapter 3 Thermal-aware Floorplan Algorithm

3.2 Proposed Thermal-aware Floorplan Algorithm

3.2.2 Over-heat Prevention Zone

We use the algorithm introduced previously for 100 random seeds. The worst case of 100 floorpalns is shown in Figure 15. The hotspot occurs at original point, the corner of floorplan. This is not condition we except, because we will place modules with high power density at center by LDZT. So we analyze this condition. There are two reasons causing the hotspot occurs at original point.

Figure 15. Temperature distribution

The first reason is we use SP representation to represent floorplan. We will place modules from bottom-left to top-right by SP representation. Hence, the density of module of original point (bottom-let corner) will be higher than the other corners, as shown in Figure 16.

Figure 16. Modules are placed by using SP representation.

The other reason is degree of perturbation. Because there are other items in cost function, all items of cost function determine where the module is placed. Although the thermal cost want to place module away from corner, other items may not want. In brief, other items may restrict the degree of perturbation.

In order to let the hotspot be away from original point, we want the modules with high power density be placed away from original point. We observe the SP representation. In SP representation, if notation A is always left to notation B in two sequences (Γ⁺, Γ^-), the location of module A will be left to the location of module B.

Otherwise, if notation A is left to notation B in sequenceΓ⁺, but it is right to notation B in sequenceΓ^-. In this condition, the module A will be placed above the module B.

Based on this property of SP representation, the notation get closer to left part of sequences means that the module of this notation get closer to original point (bottom-left).

After observation of SP representation, we consider the N leftmost rooms in two sequences as over-heat prevention zone (OHPZ), and N is the zone size. The purpose of OHPZ is that the modules with high power density cannot be located in OHPZ.

Then, we define over-heat module whose power density is higher than average power density plus standard deviation of power density. We constrain that the over-heat module cannot be placed in OHPZ.

Chapter 4 Parallel Floorplan Algorithm on GPGPU

In this chapter, we introduce the way we parallelize the floorplan and how the parallel algorithm can obtain maximum speedup. As listed in section 1.3, using global memory will cause large latency, so we have to use it as less as possible. There are two conditions where we use global memory. One is communication between blocks, and the other is communication between CPU and GPU. For first condition, the best way to prevent is that let blocks not to communicate each other. Thus, we let blocks deal with independent issues. In this way blocks don’t need to communicate each other. The latter condition can’t be avoided. If CPU doesn’t transmit data to GPU, there are no data to use for CUDA kernel function. On the other hand, if GPU doesn’t transmit data to CPU, CPU can’t receive the data which are treated already. Therefore, we can’t avoid this condition. All we can do is reduce the frequency and quantity of data of communication.

There are two types of data sent to GPU from CPU. One is the data which do not change during SA procedure, such as netlist, power density of module, and thermal resistance. The other one will change during SA procedure, such as location of modules, and height/width of modules. In order to reduce the frequency of communication, we send the type one data to GPU once before SA procedure. And during SA procedure, we only send type two data to GPU to reduce the quantity of data of communication.

The items of cost function are area, wirelength, TSV count, temperature, and repulsion force. We consider each item as respective kernel function, because each item evaluation is suit for different number of blocks and threads.

We will introduce the method we parallelize these items in turn.

4.1 Parallel Floorplan – Area

SP representation places modules from bottom-left to top-right. When module is placed, its location is relative to the modules which are placed already. Thus, the module is dependent while calculating area.

Because the data are dependent, we can’t divide the algorithm into independent parts completely. We divide the algorithm by breaking the for-loop as shown in Algorithm 1, pseudo-code of area evaluation for Parquet 4.5 [38]. After breaking the for-loop, the pseudo-code is performed be each thread. In this way, the time complexity of area evaluation becomes O(#modules) from O(#modules²).

Algorithm 1. The pseudo-code for area evaluation [38].

Area evaluation

for( i = 1 to #modules ) match[X[i]].x = i;

match[Y[i]].y= i;

Length[i] = 0;

for( i = 1 to #modules ) b = X[i];

p = match[b].y;

Position[b] = Length[p];

t = Position[b] + weights(b);

for( j = p to #modules)

if( t > Length[j] ) Length[j] = t;

else break;

return Length[#modules];

In this algorithm, the independent elements are x coordinate evaluation, y coordinate evaluation, and different layers. We can evaluate x coordinate and y coordinate of each module and module coordinate at different layers simultaneously.

So the number of blocks is 2 × #layers. After the evaluation, we only return the footprint area to CPU to reduce the communication time (quantity of data of communication). The module data, like height/width and coordinate, are stored at global memory, because the shared memory will be clear while the kernel function is end. The module date we store are used for other kernel function. Like wirelength evaluation, we need module coordinate to calculate it. Other kernel function all need module coordinate, so we evaluation area first.

BREAK

4.2 Parallel Floorplan – Wirelength and TSV Count

We deal with these two items by using the same data, netlist, and module coordinate. In order to reduce the times of communications between shared memory and global memory, we evaluate these two items in one kernel function together.

In this algorithm, the independent element is net. Because the wirelength of net and number of layers the net cross are independent, we can deal with one net by one thread, the time complexity becomes O(maximum #degrees of a net) from O(total

#degrees). And we sum up the values of each thread to get the total wirelength/TSV

count by tree reduction technique, as shown in Figure 17. By this technique, the time complexity for addition is becomes O(log^#nets) from O(#nets).Finally, we only send total wirelength and TSV count to CPU.

10 15 7 8 10

18 25

Figure 17. Tree reduction technique.

4.3 Parallel Floorplan – Temperature

4.3.1 Power Density Evaluation

The way to evaluate temperature is expressed in Equation (3). Before we evaluate temperature, we calculate power density first. The power density of the grid

is calculated by summation of power density of each module multiplied by its area ratio, the overlapping area between module and grid divide by grid area, as shown in Equation (14) different grids simultaneously. One thread deals with one grid, and each grid scans all modules at layer where grid resides to calculate the power density. Hence, the time complexity becomes O(#modules) from O(#grids×#modules). In this algorithm, number of blocks is determined by number of layers, because different layers are independent element. Besides, we only need the module data with one layer on one block, so we can let module data with same layer be located at shared memory for each block.

4.3.2 Temperature Evaluation

After we calculate power density, we calculate the temperature immediately.

Each grid evaluates temperature respectively. Then accumulate the temperature layer by layer, so the maximum temperature will occur at bottom layer. We use the tree reduction technique to find maximum temperature. Different from it used by wirelenght/TSV count, the operation tree reduction technique perform is not addition, but comparison. It finds maximum temperature by comparing two temperatures iteratively. So the time complexity becomes O(log^#grids) from O(#grids). And we return the maximum temperature only.





4.4 Parallel Floorplan – Repulsion Force

Repulsion force evaluations of each module are independent. We use one thread to calculate the repulsion force of one module. We evaluate repulsion force layer by layer. The modules with different layers are not used. This is similar to power density evaluation, so we deal with them in the same way. The number of blocks is determined by number of layers, due to the usage of shared memory has been discussed previously.

The thing each thread operates is scan all modules at one layer to calculate the repulsion force, so the time complexity is O(#modules) from O(#modules²). Then we sum up the repulsion force by tree reduction technique, so the time complexity is O(log^#modules) from O(#modules). Finally, we return the total repulsion force only.

Chapter 5 Experiments

5.1 Environmental Setup

Partition

Floorplan

Temperature evaluation Module

information netlist

Construct LDZT

Floorplan result

Thermal profile

Figure 18. Experiments flow

The largest cases form benchmark GSRC are shown in Table 3. The total flow is shown in Figure 18 and it is introduced as follow. Before floorplan, we construct thermal model, LDZT, first. Then, the initial layers where modules reside is determined by iLap [37]. The flooplanner we use is Parquet 4.5 [38], a 2D floorplanner, and [39] modify it for 3D floorplan. Next, the experimental settings of floorplan are shown as follow. The fix-outline constraints we set are 20% whitespace for 4-layer design and the area ratio is 1. The floorplan algorithm we proposed has been implemented in C++/Linux environment. Last, we use compact resistive thermal model for thermal simulation after floorplan, and the details of it are shown in Table 4.

Thickness Substrate 30

Interconnect sublayer 150

5.2 Experimental Results

In this section, we first show the results of floorplan quality, including area, wirelength, TSV count, and temperature. Next we present the runtime analysis in the CUDA platform.

5.2.1 Quality

In this part, we compare our work to related work [29], as shown in following table. Table 5, Table 6, and Table 7 show the results of circuit of n100, n200, n300 respectively. The first row shows the zone size, and the number of bracket means zone size / (#modules/#layers). ZT represents Cong’s work [29] without applying OHPZ technique, which implies the zone size of it is 0. Other columns show the result of varies zone size in LDZT. In the second row, Max_T means the maximum temperature of a single floorplan, and the following columns show the maximum/minimum/average/standard deviation of max_T from floorplans generated by 100 different random seeds. The last, the bottom two rows show the average wirelength and TSV count of 100 floorplans.

The Figure 19, Figure 20, and Figure 21 show the distribution of thermal data of Table 5, Table 6, and Table 7. The top endpoint of line means the maximum of Max_T, and the bottom endpoint of line means the minimum of Max_T. The label on the line means the average of Max_T.

Table 5. Experimental results – n100.

Zone size ZT 0(0%) 3(10%) 5(20%) 8(30%) 10(40%)

Max_T Std 5.6 5.1 2.8 2.7 2.5 3.0

Max 163.2 160.4 154.6 150.0 151.5 155.3 Min 130.1 133.6 137.2 139.3 135.4 135.1 Avg 148.0 147.0 145.7 144.7 145.0 146.6 WL 131554 131486 130930 131207 131412 131772

TSV 703.2 702.7 699.3 693.4 704.7 701.3

Table 6. Experimental results – n200.

Zone size ZT 0(0%) 5(10%) 10(20%) 15(30%) 20(40%)

Max_T Std 2.7 2.2 1.9 1.7 1.4 1.6

Max 193.8 191.4 191.7 192.1 189.4 189.5 Min 181.3 180.7 178.6 179.8 181.2 179.8 Avg 186.3 185.3 184.6 184.4 184.1 184.5 WL 241258 239184 240007 240083 240619 242574 TSV 1540.4 1527.6 1522.7 1520.0 1516.6 1516.8

Table 7. Experimental results – n300.

Zone size ZT 0(0%) 8(10%) 15(20%) 23(30%) 30(40%)

Max_T Std 2.5 1.6 1.8 1.3 1.2 1.3

Max 203.7 199.1 197.5 196.2 196.7 197.8 Min 189.3 189.1 187.5 189.2 188.9 189.3 Avg 193.7 193.2 192.8 192.7 193.3 193.5 WL 343863 340112 341874 343885 345545 347358 TSV 1592.5 1570.4 1565.6 1566.8 1564.2 1557.2

Figure 19. Thermal data – n100.

Figure 20. Thermal data – n200.

Figure 21. Thermal data – n300.

Observing above data, TSV count and wirelength of our work are similar to Cong’s work. Next, observing of the thermal issue. In Figure 19 – Figure 21, the range of the maximum temperatures becomes smaller in our methods. It means our model is stable; in other words, our method is more insensitive to different seeds. If the zone size is too small, over-head modules are still closer to bottom-left corner. But if the zone size is too large, we cannot guarantee the locations of modules in OHPZ are the places we want. Therefore form the results, the range of zone size we recommend is 20% – 30%.

After the analysis of thermal issue, we think that observing maximum temperature of each floorplan is not enough. If there are two floorplans with the same maximum temperature, the temperature distribution of one is cool except the hotspot and the other is hot everywhere. If we only consider maximum temperature, these two floorplans are the same, but the former is better than the latter obviously. So we choose grids in bottom layer with top 5% temperature to analyze. The results are

shown in Figure 22 – Figure 24. They show the number of grids with top 5%

temperature of Cong’s and our work with 20% zone size. We can see that the number of grids of our work is less than Cong’s in high temperature range. So our work not only has lower maximum temperature but only has more uniform temperature distribution.

Figure 22.Thermal data II – n100.

Figure 23. Thermal data II – n200.

Figure 24. Thermal data II – n300.

5.2.2 Runtime

The following tables and figure show the experimental result of runtime.

CPU/GPU, shown in first row, means the floorplan is operated in CPU or GPU. First

column shows the elements of cost function, and the follow columns show the runtime/runtime ratio/speedup of these items on CPU/GPU. The rightmost column

#Pstr shows the number of streaming processors, this value means ideal upper bound of speedup. And the Figure 25 show the data of Table 8 – Table 10.

Table 8. Runtime – n100.

CPU GPU #Pstr

Ratio(%) Time(s) Time(s) Speedup Ratio(%)

Area 11.4 16.9 0.8 21.1 1.9 256

WL 4.2 6.2 1.3 4.8 3.0 64

Temp 81.1 120.1 2.8 42.9 6.6 128

RF 3.3 4.9 1.5 3.3 3.5 128

Total 100.0 148.1 6.4 23.1 – –

Tovh – – 20.8 – 48.8 –

Memcpy – – 15.4 – 36.2 –

Total – 148.1 42.6 3.5 100.0 –

Table 9. Runtime – n200.

CPU GPU #Pstr

Ratio(%) Time(s) Time(s) Speedup Ratio(%)

Area 8.5 78.9 4.8 16.4 4.4 256

WL 2.6 23.9 4.5 5.3 4.1 64

Temp 85.3 792.6 18.2 43.5 16.6 128

RF 3.6 33.2 9.0 3.7 8.2 128

Total 100.0 928.6 36.5 25.4 – –

Tovh – – 41.8 – 38.1 –

Memcpy – – 31.4 – 28.6 –

Total – 928.6 109.7 8.5 100.0 –

Table 10. Runtime – n300.

CPU GPU #Pstr

Ratio(%) Time(s) Time(s) Speedup Ratio(%)

Area 6.7 190.1 14.4 13.2 6.9 256

WL 1.6 46.1 8.0 5.8 3.8 64

Temp 87.7 2479.6 56.8 43.7 27.3 128

RF 4.0 112.9 27.0 4.2 13.0 128

Total 100.0 2828.7 106.2 26.6 – –

Tovh – – 55.3 – 26.5 –

Memcpy – – 46.9 – 22.5 –

Total – 2828.7 208.4 13.6 100.0 –

(a) (b) (c)

Figure 25. (a) Runtime – n100 (b) Runtime – n200 (c) Runtime – n300.

We can see the speedup of GPU and number of streaming processors are different. Because the algorithm is not parallel completely and there are drawbacks we will discuss later on CUDA. First, we discuss the speedup of area. Because we evaluate coordinates module by module then get the final area when all modules are done. That is, data are highly dependent, and each thread may idle for each other until total thread finish their work. Thus, the speedup decrease due to dependent data.

Second, when computing wirelength, because there are a great amount of branch

instructions on wirelength evaluation, performance may reduce by CUDA property we introduced previously. Next, the difference between speedup and number of streaming processors on temperature evaluation is smaller than it on other evaluations.

This is because the temperature evaluation of each grid is independent, which is introduced before. Thus, the temperature evaluation has better parallelism than others.

However, the computations done after power density evaluation delay the speed, so the final speedup gets reduced. Last, we analysis the speedup of repulsion force. In general, the number of streaming processors is the maximum speedup. Nevertheless, as we introduce before, the number of divider on each multiprocessor is much fewer than the number of thread on each multiprocessor, and there is division in repulsion force evaluation, so the maximum speedup is determined by the total number of dividers. The total speedup of these evaluation approximate 25. Finally, there are still communication time and kernel overhead, the final speedup reduce due to these drawbacks. The time complexity of communication time and kernel overhead increase linearly with module size, but the time complexity of evaluation increase faster than module size. Hence, while module size becomes larger, the ratio of runtime of these two drawbacks to total runtime becomes smaller. This is why the final speedup becomes larger as larger module size.

Chapter 6 Conclusion

In this thesis, we propose a fast location-dependent thermal model, and a thermal-aware floorplan algorithm. And we implement the algorithm on CPU and GPGPU.

LDZT, the fast thermal model we propose, can show the location-dependent property without runtime increase. Moreover, we also propose two strategies to prevent generating hotspots. We refine the repulsion force to exclude the module with high power density. This technique can also compensate the thermal coupling issue due to omitting lateral thermal resistances in LDZT. And we define a zone, named over-heat prevention zone, to prevent left-bottom corner of floorplan getting over-heat during the SA procedure. The over-heat module, whose power density is higher than average power density by standard deviation of power density, cannot be placed in this zone. By these strategies, we can reduce the maximum/average temperature and decrease the number of grid in high temperature range. Additionally, the floorplanner is insensitive to random seeds, which implies the robustness of our method is quite good. Finally, we use CUDA to speed up the runtime. We get 3.5X – 13.6X speedup.

The speedup gets significant as the size of the design grows.

Reference

[1] “International Technology Roadmap for Semiconductor,” Semiconductor Industry Association 2005–2010.

[2] G. Metze, M. Khbels, N. Goldsman, and B. Jacob, “Heterogeneous integration,”

Tech Trend Notes, vol. 12, no. 2, p. 3, 2003.

[3] A. W. Topol, D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A.

Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong,

“Three-dimensional integrated circuits,” IBM J. of Research and Development, vol. 50, no. 4.5, pp. 491–506, Jul. 2006.

[4] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: a novel chip design for improving deep submicron interconnect performance and systems-on-chip integration,” Proc. IEEE, vol. 89, no. 5, pp. 602–633, May 2001.

[5] R. Tummala and V. Madisetti, “System on chip or system on package?” IEEE Design & Test of Computers, vol. 16, no. 2, pp. 48–56, Apr.–Jun. 1999.

[6] P. H. Shiu and K. S. Lim, “Multi-layer floorplanning for reliable system-on-package,” Proc. Int’l Symp. Circuits and System, pp. 23–26, 2004.

[7] S. Spiesshoefer, Z. Rahman, G. Vangara, S. Polamreddy, S. Burkett, and L.

Schaper, “Process integration for through-silicon vias,” J. of Vacuum Science and Technology A, vol. 23, no. 4, pp. 824–829, Jul. 2005.

[8] SOCcentral. [Online]. Available: http://www.soccentral.com

[9] S. Das, A. P. Chandrakasan, and R. Reif, “Calibration of rent's rule models for three-dimensional integrated circuits,” IEEE Trans. Very Large Scale Integration Systems, vol. 12, no. 4, pp. 359–366, Apr. 2004.

[10] A. Rahman and R. Reif, “System-level performance evaluation of three-dimensional integrated circuits,” IEEE Trans. Very Large Scale Integration Systems, vol.8, no.6, pp. 671–678, Dec. 2000.

[11] S. Das, A. Fan, K. Chen, C. S. Tan, N. Checka, and R. Reif, “Technology, performance, and computer-aided design of three-dimensional integrated circuits,”

Proc. Int’l Symp. Physical Design, pp. 108–115, 2004.

[12] I. Kaya, S. Salewski, M. Olbrich, and E. Barke, “Wirelength reduction using 3D physical design,” Int’l Workshop Integrated Circuit System Design, pp. 453–462, 2004.

[13] I. Loi, S. Mitra, T. H. Lee, S. Fujita, and L. Benini, “A low-overhead fault tolerance scheme for TSV-based 3D network on chip links,” Proc. Int’l Conf.

Computer-Aided Design, pp. 598–602, 2008.

[14] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A.M. Sule, M. Steer, and P. D. Franzon, “Demystifying 3D ICs: the pros and cons of going vertical,”

IEEE Design & Test of Computers, vol. 22, no. 6, pp. 498–510, Nov.–Dec. 2005.

[15] D. F. Wong and C. L. Liu, “A new algorithm for floorplan design,” Proc. Design Automation Conf., pp.101–107, 1986.

[16] R. Otten, “Automatic floorplan design,” Proc. Design Automation Conf., pp.261–267, 1982.

[17] Y.-C. Chang, Y.-W. Chang, G.-M.Wu, and S.-W.Wu, “B*-trees: A new representation for nonslicing floorplans,” Proc. Design Automation Conf., pp.

458–463, 2000.

[18] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, “VLSI module placement based on rectangle-packing by the sequence pair”. IEEE Trans. Computer Aided Design of Integrated Circuits and Systems, vol. 15, no. 12, pp. 1518–1524, Dec.

1996.

[19] P.-N. Guo, C.-K. Cheng, and T. Yoshimura, “An O-tree representation of nonslicing floorplan and its applications,” Proc. Design Automation Conf., pp.

268–273, 1999.

[20] Z. Li, X. Hong, Q. Zhou, Y. Cai, J. Bian, H. H. Yang, V. Pitchumani, and C.-K Cheng, “Hierarchical 3D floorplanning algorithm for wirelength optimization,”

IEEE Trans. Circuits and Syst.I: Regular Papers, vol. 53, no. 12, pp. 2637–2646, Dec. 2006.

[21] T. Yan, Q. Dong, Y. Takashima, Y. Kajitani, “How dose partitioning matter for 3D floorplanning,” Proc. ACM Great Lakes symposium on VLSI, pp. 73–78, 2006.

[22] Z. Li, X. Hong, Q. Zhou, S. Zeng, J. Bian, W. Yu, H. H. Yang, V. Pitchumani, and C.-K. Cheng, “Efficient thermal via planning approach and its application in

在文檔中應用於通用圖形處理器上具熱感知及位置相關之三維佈局規劃演算法 (頁 30-0)