• 沒有找到結果。

Chapter 1 Introduction

1.4 Contribution

In this thesis, we first present that the power optimization is not sufficient for temperature optimization during placement and routing process. Then, we point out that even though the heat is primarily dissipated through vertical path in 3D ICs, the lateral heat flow cannot be neglected. It inspires us to develop a thermal-aware placement and routing algorithm, named TherWare. Both placement and routing

Stack

6

algorithm concentrate to minimize maximum temperature, temperature deviation and maximum temperature gradient, while keeping the delay and runtime overhead within few percent. TherWare placement is based on simulated annealing algorithm; three guidelines are integrated in thermal cost – distributing power uniformly, finding better position for potentially hotter tiles, as well as preventing excessive increase of interconnect power. TherWare routing is based on Pathfinder negotiated congestion algorithm [27], which takes the power overhead and power distribution into consideration.

Thesis Organization

The remainder of this thesis is organized as follows. In Chapter 2, EDA flow and problem formulation are represented first, then a set of fine-grained thermal resistive models with different granularities, named FG-8, FG-4 and FG-2, respectively, are proposed. We introduce three related works along with the motivation, and propose our TherWare placement and routing in Chapter 3 and Chapter4. In Chapter 5, the experimental environment, two experimental results and case study are represented.

Finally, the concluding remarks are given in Chapter 6.

7

Chapter 2

Preliminaries

In this chapter, before we present the thermal-aware EDA flow for thermal-aware framework, the typical EDA flow should be introduced first. Then, we describe the problem formulation for this thesis.

2.1 EDA Flow

2.1.1 Typical EDA Flow

Circuit Technology mapping

and packing Partitioning and

layering

Placement

Routing

P&R result

Figure 8. Typical EDA flow.

Figure 8 shows the typical EDA flow [14][17]. The first stage is technology mapping and packing; LUTs and registers are packed into basic logic elements (BLEs), and then the multiple BLEs are clustered into a netlist of CLB. In the second stage, these CLBs are divided into several partitions; each partition is assigned to different layers for minimized number of TSVs. Third stage places the CLBs to available hardware, the goal of this stage is to minimize the wirelength and delay.

Final stage, determines which routing-resource should be used to connect all the CLB input and output pins required by the circuit based on placement result. Nevertheless,

8

after thermal analysis, the placement and routing result obtained from this flow contains many hotspots because thermal issue did not considered while synthesizing.

2.1.2 Thermal-Aware EDA Flow

Circuit Technology mapping

and packing Partitioning and

layering

Thermal-aware placement Thermal-aware

routing P&R result

Net activity analysis

CLB power calculation

Figure 9. Thermal-aware EDA flow.

Figure 9 shows the thermal-aware EDA flow. The difference between thermal-aware EDA flow and typical EDA flow is that thermal-aware EDA flow needs two additional information – i) the switching activity of nets; ii) the power consumption of CLBs. By the information, thermal-aware placer and router can realize that which tile has higher probability to generate hotspots because it is crossed by the nets with higher switching activity or placed by a CLB with higher power consumption. Then, thermal-aware placer and router will place the CLBs and route the nets in a way that minimizes the temperature, such as maximum temperature, temperature deviation and maximum temperature gradient. As a result, after evaluating thermal information, we can obtain a better temperature profile.

9

2.2 Problem Formulation

Given a netlist of CLBs, 3D FPGA architecture, switching activity of nets and power consumption of CLBs, our goal is to find a placement result and a routing result under the constraints that – i) each logic hardware is placed by one CLB at most;

ii) each net occupies uniquely a routing-resource. The most important objective of this thesis is that maximum temperature, temperature deviation and maximum temperature gradient are minimized with acceptable delay overhead.

2.3 Thermal Model

2.3.1 Thermal Modeling of 3D IC

Figure 10. Thermal resistive model.

Figure 10 illustrates a typical 3D IC stacking configuration. Based on the thermal resistive model in [18], a single die is composed of three sublayers – the substrate where active devices reside, the interconnect sublayer where metal wires and vias reside, and the bonding interface attaching between two adjacent silicon dies. Heat generated from active devices is carried from substrates to a heat sink, and then

Layer 1 Layer 2 Layer N

Substrate Interconnect Sublayer Bonding Interface

Thinned Substrate Interconnect Sublayer

Bonding Interface Thinned Substrate Interconnect Sublayer

Heat Sink

Tile

As a grid

X

Y Z

Substrate Interconnect Sublayer

Bonding Interface

10

dissipated into the ambient air (25℃). Each die in a 3D design is first partitioned into a number of small regular grids with their own power densities. Each grid is further divided into several nodes according to the number of sublayers. Then a thermal resistance, whose value is determined based on both grid size and thermal properties of material, is attached between two adjacent nodes. Finally, the thermal resistive model applies thermal-electrical duality (as shown in Table 1) to generate a steady-state temperature profile for the given design.

Since our application targets 3D FPGAs, it is very natural to regard a tile as a grid due to regularity of FPGA. Therefore, the power consumption of each tile contains logic power and interconnect power. Logic power of each tile is only contributed by a placed CLB. Interconnect power of each tile is contributed by three elements – wire segment, SB and CB; if a wire segment in a tile is occupied by a net, these three elements will consume power because of hardware architecture.

Table 1. Thermal-electrical duality [19].

2.3.2 Proposed Fine-Grained Thermal Model

In this thesis, we set that the target 3D FPGAs are implemented in 45nm technology; the horizontal channel width is 32, and each vertical channel also contains 32 TSVs [20]; the pitch of each TSV is 6 μm [1]. Then by [17], we can estimate that the area of each tile is 47.8×47.8 μm2. Other related parameters (e.g., thickness of each sublayer, thermal conductivity of each material, and so on) are properly set based on [21][22].

Thermal quantity Unit Electrical quantity Unit T, Temperature difference K V, Voltage difference V

P, Power density W I, Current source A

Rth, Thermal resistance K/W R, Electrical resistance Ω

11

However, as stepping into the 3D era, the presence of TSVs makes accurate thermal modeling of 3D FPGA a bit complicated since thermal properties of all sublayers in a die would be substantially changed, and based on our settings that were introduced earlier, we can estimate that the ratio of area in TSVs is 50.4%. Therefore, it is assumed that TSVs account for nearly half area of a single tile and are uniformly distributed within a tile. As a consequence, we decide to further partition a tile into an array of even fine-grained grids and then construct a set of fine-grained thermal models. As shown in Figure 11, in order to examine the effects of models with different grid granularity, we divide a tile into a 2×2, 4×4, and 8×8 grid array, and name the model FG-8, FG-4, and FG-8, respectively.

Figure 11. Proposed fine-grained thermal model.

2.3.3 Comparisons

For demonstrating the accuracy level of proposed fine-grained thermal models with different granularities, FG-8, FG-4, and FG-2, first we perform random logic

Bonding Interface Bonding Interface Bonding Interface

Bonding Interface

Bonding Interface Bonding Interface Bonding Interface Bonding Interface

Bonding Interface Bonding Interface Bonding Interface Bonding Interface

Bonding Interface Bonding Interface Bonding Interface Bonding Interface

Bonding Interface Bonding Interface TSV Bonding Interface

Bonding Interface TSV

Bonding Interface

12

x/y-dimension); the logic utilization is set to 75%; the power of each mapped tile is set to the product of the power density of 2×106 W/m2 [21]. After a thermal resistive network is built, hspice is then invoked to get a corresponding temperature profile.

Every reported value in the following is an average of 5 random logic mapping runs.

Figure 12. Node-to-node RMSE and MAD against FG-8.

Figure 13. Node-to-node correlation against FG-8.

Figure 12 and Figure 13 report root mean square error (RMSE), maximum absolute difference (MAD) and correlation to show node-to-node differences with respect to FG-8. From these figures, FG-4 is just getting slightly inaccurate as the number of layers increases; however, compared with FG-4, FG-2 is more inaccurate than FG-4. In FG-4/FG-2, the root mean square error is less than 2.5%/6.7%, the maximum absolute difference is less than 3.9%/14.0%, and the correlation is more

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

1 2 3 4 5 6 7 8

# of layers

RMSE and MAD FG-4 : RMSE FG-4 : MAD FG-2 : RMSE FG-2 : MAD

98.5%

99.0%

99.5%

100.0%

1 2 3 4 5 6 7 8

# of layers

Correlation

FG-4 FG-2

13

than 99.8%/98.5%. Moreover, FG-8 takes 144.6 seconds to produce a temperature profile on average, while FG-4/FG-2 merely requires 19.9/17.0 seconds, which suggests a 7.3/8.5 times speedup against FG-8.

From above results, it is concluded that FG-4 can achieve a large speedup in runtime but with a tiny loss in accuracy. Therefore, FG-4 would be used as our thermal model.

14

Chapter 3

Placement Algorithm

3.1 Related Works

3.1.1 TPR

In 3D FPGAs, the first backend tool is three dimensional place and route (TPR) [14], which can supports timing-driven placement and routing. In TPR, placement approach is based on simulated annealing; logic blocks are selected and swapped or moved randomly during the placement until maximum number of iterations is reached.

A cost function is used to evaluate the quality of placement result as shown in Equation (1).

(1) In Equation (1), it tries to minimize wirelength and delay, these two costs are calculated based on a timing analyzer and a net semi-perimeter metric wire length estimator. In addition, the factor α and β are used to trade-off between wirelength and delay.

3.1.2 3D MEANDER

3D MEANDER [23] is another design framework for 3D FPGAs, and it supports thermal-aware placement and routing. Since 3D MEANDER takes thermal issue into consideration, so the Equation (1) is modified to Equation (2), and the thermal cost in Equation (2) is presented in Equation (3).

(2) (3)

Delay

Wire Cost

Cost

Cost 

Thermal Delay

Wire Cost Cost

Cost

Cost

 

 

Net i

Wire

Thermal Activity i Cost i

Cost ( ) ( )

15

The concept of 3D MEANDER placement algorithm is to minimize interconnect dynamic power because heat source is transited from power consumption; that is, lower power may result in lower temperature. From Equation (3), the thermal cost is sum of multiplying switching activity by wirelength for each net; thus, the wirelength of nets with higher switching activity will be shorter by placing the CLBs connected through by this net to each other , leading to lower power consumption.

3.1.3 Z-tile

Figure 14. Construction of Z-tile model.

In the past decade, the Z-tile model is considered one of the most broadly used simplified thermal models, as depicted in Figure 14. Authors in [24] have observed that heat is primarily dissipated through vertical path in 3D ICs; therefore, for thermal model simplification, all lateral heat flows are intentionally ignored. It is also the reason why this simplified model is named the Z-tile model. By omitting all lateral thermal resistances, the Z-tile model facilitates fast temperature evaluation by Equation (4).

(4) (5)

R3

R2

R1

Rsink

P1

P2

P3

T P4

R4

Single Z-tile

4

1 1

4

1

) (

i i sink i

j j i

i R R P

P T

max Thermal T

Cost

16

For example, there are four single Z-tiles as shown in Figure 14, by calculating the temperature of each single Z-tile, the highest temperature value of all single Z-tiles will be put into the thermal cost (Equation (5)) for maximum temperature minimization, and thus has been widely used in many thermal-aware 3D ASIC design

flows by put this thermal cost into Equation (2).

3.2 Motivation

In 3D MEANDER placement, for minimizing interconnect power, the wirelength of nets with higher switching activity must be shorter as Figure 15 shows; however, the CLBs which are connected through by these net, have higher power consumption generally. It causes that some regions will generate hotspots because these regions are placed by the CLBs with higher power consumption as shown in Figure 16. As a result, the maximum temperature does not decrease obviously even though the interconnect power is minimized.

Figure 15. Interconnect power minimization.

Figure 16. Effect of Interconnect power minimization.

CLB

17

For proving the drawback introduced earlier, we evaluate 3D MEANDER at 4-layer design for 20 largest MCNC benchmarks, and the logic utilization is set to 75%. Figure 17 shows the comparison between maximum temperature with total power of 3D MEANDER, the two curves are normalized to timing-driven – TPR, and we can observe that the some cases have very higher improvement of total power, but improvement of maximum temperature have not, such as diffeq, frisc, spla and pdc.

Furthermore, we measure dependence (correlation) between these two curves, it is -0.22 which is a bit negative relationship. For these reasons, we think that the temperature optimization should focus on power distribution.

Figure 17. Comparison between max. temp with total power of 3D MEANDER.

In order to observe the effects of different power distributions on temperature, we perform two contrastive tile mapping patterns as shown in Figure 18, vertically staggered and aligned. For each mapping pattern, a set of configurations with

different number of tiles in a big block are further considered. A configuration with a group factor n suggests there are n×n tiles within a block. In all experiments, the

dimension of target 3D FPGA is fixed to 36×36×6, the utilization is set to 50%, all mapped tiles consume same power, and FG-4 thermal model is used for temperature evaluation. Notice that the vertically staggered can present the vertical heat flow is considered, and smaller group factor can presents the lateral heat flows is considered.

0%

5%

10%

15%

20%

25%

tseng ex5p apex4 diffeq dsip misex3 alu4 des seq bigkey apex2 s298 elliptic frisc spla pdc ex1010 s38584.1 s38417 clma

Improvement

Maximum Temperature vs. Total Power

Maximum Temperature Total Power

18 Aligned Staggered

Group factor = 1Group factor = 2Lateral Heat Flow

Vertical Heat Flow

Tile with placed CLB Tile with unplaced CLB

Figure 18. Staggered and aligned mapping patterns.

Figure 19 reports the results as a function of group factor. It is observed that maximum temperature is virtually independent of group factor for configurations using the staggered pattern. On the contrary, maximum temperature is significantly increased as group factor grows for configurations using the aligned pattern.

Therefore, it seems practical that the Z-tile model only considers vertical heat flow and ignores all lateral heat flow.

Figure 19. Maximum temperature with different group factors.

Nevertheless, as logic utilization increases, vertical heat flow cannot be staggered at some single Z-tiles as shown in Figure 20; that is, maximum temperature maybe significantly increased due to ignores all lateral heat flow. In other words, group factor cannot be controlled if we use Z-tile model as thermal cost during

110 120 130 140 150 160 170 180 190 200

1 2 3 6 9 18

Horizontal group

Maximum Temperature

Vertically staggered Vertically aligned (℃)

19

placement. Therefore, we think that the lateral heat flow should not be neglected.

Figure 20. Effect of vertically aligned as utilization increases.

3.3 Proposed Algorithm – TherWare

In this section, we introduce our proposed thermal-aware placement algorithm – TherWare, which is based on simulated annealing, and the cost function is shown in

Equation (2). The thermal cost of our TherWare placement is shown in Equation (6), and it regards for three guidelines – power uniformity (CostPU), heat dissipativity (CostHD) and interconnect power (CostIP), respectively.

(6)

3.3.1 Power Uniformity

In this cost function, we want to keep power uniformity between several tiles with placed CLB. Generally, in timing-driven placement, the CLBs are placed to adjacent available hardware in order to shorten wirelength and delay. However, such a placement result has higher maximum temperature on temperature profile because of heat congestion. Therefore, the tiles with placed CLB should be uniformly spread out the entire FPGA as shown in Figure 21. In brief, for each tile with placed CLB, we want to minimize the power consumption of adjacent tiles.

Vertically aligned Utilization↑

110 120 130 140 150 160 170 180 190 200

1 2 3 6 9 18

Horizontal group

Maximum Temperature

Vertically staggered Vertically aligned (℃)

Utilization↑

IP HD

PU

Thermal Cost Cost Cost

Cost

1 

2 

3

20

Figure 21. Uniform power distribution.

As shown in Figure 22, for a placed CLB i, the Adj(i) represents the set of its adjacent placed CLBs, and we classify this set into three subsets – i) Adjvertical(i) represents the set of vertical adjacent CLBs of placed CLB i, ii) Adjlateral(i) represents the set of lateral adjacent CLBs of placed CLB i, iii) Adjdiagonal(i) represents the set of diagonal adjacent CLBs of placed CLB i.

Figure 22. Definition of Adj(i).

In Equation (7) and Equation (8), the power uniformity cost is sum of adjacent power for all tiles with placed CLB; moreover, since vertical dissipating path is more important than lateral dissipating path, so we set the Weight function as shown in Equation (9), according to how important for dissipating heat.

(7) (8)

Uniformly spread out these tiles with placed CLB Timing-driven

Adjverticallateraldiagonal

i

P Adjacent Power i

Cost U _ ()

21

(9)

3.3.2 Heat Dissipativity

Figure 23. The cool point in 4-layer 3D IC.

In order to let the heat of potentially hotter tiles can be dissipated easily, the first step is to find out where are the position with the best heat dissipativity in 3D ICs. As shown in Figure 23, top-most layer is the closest to the heat sink, so it has better heat dissipativity than other layers, and the center of top-most layer has more area for dissipating heat; that is, center of top-most layer has the best heat dissipativity in entire 3D IC, where is named this position cool point (CP). Next, we want to estimate the potentially of heat for each tile with placed CLB. In section 2.3.1, we introduced that a tile contains logic power and interconnect power, so the pins activity must be considered because the pins of placed CLB are terminal of some nets, and these nets probably consume high interconnect power in this tile after routing, and higher interconnect power represents that this tile has higher potential of heat; for example, the tile 2 has higher potential of heat than tile 1 as shown in Figure 24.

Figure 24. Comparison of two tiles.

Cool point (CP)

22

As shown in Equation (10) and Equation (11), the first term represents potential of heat of a tile with placed CLB, and Distance_to_CP is distance between this tile to cool point; therefore, the tiles with higher potential of heat will place to close to cool point; moreover, since we take the pins activity into consideration, the nets with higher activity will route to close to cool point too because its terminals of bounding box are placed close to cool point. The factor ω provides higher flexibility to this cost, and it must be set to less than 1. The factor λ is also set to less than 1 because vertical dissipating path is more important than lateral dissipating path. The Equation (12)~(14) are used to get the coordinates of cool point.

(10) (11) (12) (13) (14)

3.3.3 Interconnect Power

As technology node scales down, the interconnect power becomes dominant the total power, it can contribute 75~85% of total power [25][26]. However, both power uniformity cost and power dissipativity cost may cause the wirelength of some nets longer, and interconnect power will increases. Hence, we want to prevent increasing the interconnect power excessively; we take interconnect power into consideration. As shown in Equation (15), this cost function is the same as Equation (3).

(15)

HD Power i Activity i Distance to CP i

Cost ( ( ) (1 ) ( )) _ _ () IP Activity i Cost i

Cost ( ) ( )

23

Chapter 4

Routing Algorithm

4.1 Related Works

4.1.1 TPR

(a) FPGA routing architecture (b) Routing-resource graph Figure 25. Routing-resource graph .

In FPGA routing, a directed graph named routing-resource graph is usually used to represent the routing architecture of the FPGA. Each wire segment, TSV and CLB pin becomes a node, and potential connections become edges in this graph as shown in Figure 25.

The routing algorithm in TPR is based on Pathfinder negotiated congestion algorithm [27], and the flow chart is shown in Figure 26. Initially, for each net, the router finds the path with lowest total cost between a net source node and a net sink node in the routing-resource graph. In this step, congestion cost is set to 0; that is, some routing-resources are overused probably. Consequently, the routing iteration

The routing algorithm in TPR is based on Pathfinder negotiated congestion algorithm [27], and the flow chart is shown in Figure 26. Initially, for each net, the router finds the path with lowest total cost between a net source node and a net sink node in the routing-resource graph. In this step, congestion cost is set to 0; that is, some routing-resources are overused probably. Consequently, the routing iteration

相關文件