Thesis Organization - 為取得面積與延遲間較佳平衡之三維可程式邏輯閘陣列架構探索

Chapter 1 Introduction

1.4 Thesis Organization

The rest of this thesis is organized as follows. In Chapter 2, two previous works along with the motivation of this thesis are introduced. The evaluation environment including architectural parameters, benchmark and the 3D P&R tool is described in Chapter 3. Chapter 4 and Chapter 5 present two major families of architectures and how we explore those architectures. The TSV utilization issue is discussed in Chapter 6. Finally, the concluding remarks are given in Chapter 7.

Chapter 2 Previous Works and Motivation

So far, there are two synthesis frameworks targeting the 3D FPGA architecture:

the three-dimensional place and route (TPR) [19]–[21] and 3D MEANDER [22][23]

(Section 2.1). After discussing the pros and cons of the two previous works, the motivation of this thesis is described in Section 2.2.

2.1 Previous Works

The first work is TPR, i.e., three-dimensional place and route for FPGAs [19]–[21]. TPR is the first tool we can ever see that supports placement and routing for 3D FPGAs. In TPR, all SBs are assumed regular 3D-SBs and the number of available TSVs in a 3D-SB is assumed unlimited, which is surely impractical.

3D MEANDER is another design framework for 3D FPGAs [22][23]. In addition, it also studies the impact of different deployment strategies for 3D-SBs. It proposes a family of 3D FPGA architectures in which 2D-SBs and 3D-SBs are mixed up in certain regular spatial patterns. However, the number of available TSVs within a 3D-SB is assumed fixed in 3D MEANDER. That is, it does not investigate what the impact of the different number of TSVs in a 3D-SB is.

2.2 Motivation

Consider a FPGA with maximal TSVs inside each 3D-SB and all the SBs are 3D which is shown in Fig. 9. Each square stands for a 3D-SB. The number of available TSVs in this 3D-SB is 4, i.e., maximal TSVs, corresponding to the 4 wires in X-Y directions. Therefore, a 3D-SB with maximal TSVs inside is called fully connected. Moreover, all the SBs of the FPGA in Fig. 9 are 3D. Therefore, the spatial distribution of these 3D-SBs is called fully distributed. It is obvious that fully connected TSVs in each 3D-SB as well as fully distributed 3D-SBs are to blame for lots of area cost.

Fig. 10. Fully connected but partially distributed 3D-SBs.

Fig. 11. Different spatial distributions of 3D-SBs.

Fig. 9. Fully connected and fully distributed 3D-SBs.

The idea that 3D MEANDER reduces the area of TSVs is by removing 3D-SBs regularly as shown in Fig. 10. Squares with solid line stand for fully connected 3D-SBs and squares with dotted line stand for 2D-SBs. The spatial distribution of 3D-SBs is called fully connected but partially distributed. Fig. 11 shows two variants of 3D MEANDER’s architecture. The number of 2D-SBs between two adjacent 3D-SBs in X-Y directions can be 2 (Fig. 11(a)) or 3 (Fig.

11(b)) respectively. It means that more TSVs can be reduced by removing more 3D-SBs. However, the way that 3D MEANDER reduces TSVs is based on removing 3D-SBs instead of removing TSVs.

Fig. 12. Partially connected but fully distributed 3D-SBs.

One of our motivations shown in Fig. 12 takes number of TSVs in a 3D-SB into account which is not considered in the two previous works. There are only 2 TSVs in a 3D-SB. The other two TSVs are removed. The spatial distribution of these 3D-SBs in Fig. 12 is called partially connected but fully distributed. The number of total TSVs is the same as that in 3D MEANDER (Fig. 11(a)), i.e., 50%

of TPR. The way that reduces the number of TSVs like this could also be a method for area reduction but it is not considered before.

Fig. 13. Variants of partially connected but fully distributed 3D-SBs.

Fig. 13 shows two variants of partially connected but fully distributed 3D-SBs.

The number of connected TSVs in a 3D-SB can be 3 (Fig. 13(a)) or 1 (Fig. 13(b)) in this example. It means the way that reduces TSVs inside a 3D-SB is flexible like variants of 3D MEANDER’s.

Furthermore, we can mix the concepts of Fig. 10 and Fig. 12 together for further TSV reduction as Fig. 14 shows. It is not considered in the two previous works either. The spatial distribution of mixed 3D-SBs is called partially connected and partially distributed. That is, not only half TSVs inside a 3D-SB are removed but also half 3D-SBs are replaced with 2D-SBs. Therefore, the number of total TSVs after mixture is 25% of that in TPR (Fig. 9) and less than 50% of that in 3D MEANDER (Fig. 10). The method which mixes different TSV-reduction methods is not considered before but it can be an efficient way to reduce more area of TSVs.

Consequently, it must be taken into account for a complete architectural exploration.

Fig. 14. Mix the concepts of Fig. 10 and Fig. 12 together for further TSV reduction.

Fig. 15. Four categories of spatial distributions of TSVs.

Finally, all the different spatial distributions which we have discussed in this chapter are summarized in Fig. 15. We define the spatial distributions of TSVs as patterns. Initially, the “Fully-Fully” 3D-SBs in Fig. 15(a). 3D MEANDER takes area of TSVs into consideration and proposes the “Fully-Partially” 3D-SBs in Fig.

15(b). Another way to reduce TSV area, which is not considered in previous works, is removing partial TSVs inside a 3D-SB. It is thus called “Partially-Fully” 3D-SBs.

The last TSV-reduction method, which is also not considered before, is called

“Partially-Partially” 3D-SBs by mixing the concepts of Fig. 10 and Fig. 12 together for further TSV reduction.

Our objective is to provide a complete architectural exploration on different patterns for generic 3D FPGAs. We take number of TSVs inside a 3D-SB and mixed patterns into account which are not considered in previous works. After all evaluations, we provide several area-saving configurations with only minor delay increase by properly tailoring the structure and deployment strategy of 3D-SBs for generic 3D FPGAs.

Chapter 3 Evaluation Environment

3.1 Architectural Settings

The basic architectural settings in our 3D FPGA are shown in TABLE I. Most of the parameters are set according to existing commercial FPGAs [24][25], well-known FPGA synthesizers [19]–[21], and related research works [26]–[28].

TABLE I

ARCHITECTURAL PARAMETERS

Each LB with 8 inputs consists of 2 LUTs instead of 1 in the previous work [19][21], which is considered more realistic and there are 6 inputs of each LUT.

The settings of LBs are based on Altera Stratix IV [24]. Channel width, i.e., wires in X-channel (WX) and Y-channel (WY), is set to 32 based on Xilinx FPGAs [25]. Z direction L1 only (Routability-driven) –

I/Os Location Bottom-most layer

Delay model L1Z : L1X-Y 1 : 10

TSV Pitch 10um

Process technology node 65nm

There are 4 wire segments with different lengths in these 32 wires, L1, L2, L4 and L8. The length of a wire segment is the number of logic blocks it spans. There are 12 L1/L2 and 4 L4/L8 wires. A long wire segment has better performance while a short wire segment has better routability (Section 1.2.1). Number of TSVs varies with different patterns. In the vertical direction (Z), only L1 is available to maximize the routability. As number of TSVs increases, the timing of test cases becomes better as a result of increased routability. I/O pads are located only around the bottom-most layer based on practical FPGAs. Delay ratio between a TSV and a L1 in X-Y directions is 1:10. The ratio which we obtained from [28] is around 1:50 actually. However, it is so small that the delay of TSVs is almost negligible. Other ratios like what TPR assumes, i.e., 1:1, is impractical because delay of TSVs is smaller than horizontal wires actually. Finally we select a reasonable setting, 1:10, as our delay ratio. TSV pitch is 10um [1] and the process technology node is 65nm.

Number of layers ranges for 1 to 8. Patterns, i.e., spatial distribution of TSVs, of all layers are the same for a given number of layers.

3.2 3D P&R Tool

The Section introduces the 3D placement and routing tool we use. The framework is shown in Fig. 16. A netlist file and a specified architecture file are fed into a 3D FPGA synthesizer, which includes three steps: TSV-driven 3D layering, timing-driven 3D placement and 3D routing. Results contain the placement file and the routing file.

Fig. 16. Tool flow.

The netlist consisting of LBs is packed by T-VPack, which is a part of the 2D FPGA synthesis framework VPR [14][15]. The architectural file includes the settings in Section 3.1, the number of total layers and the pattern of TSVs. Because area cost of TSVs is huge, an initial layering partitions the netlist into different layers with the objective of TSV minimization [31]. The timing-driven 3D placement and 3D routing processes, which are adapted from TPR, are then conducted on the layered netlist. The tool generates the placement and routing results in the end. Delay of the netlist is also published along with the FPGA area from the architecture file.

3.3 Benchmark

TABLE II BENCHMARK

TABLE II shows the 24 test cases in our benchmark set – 14 are from MCNC [29] and the other 10 larger ones are from IWLS2005, ITC99 and Altera [30]. The numbers of LBs and nets are derived from T-VPack with the parameters (N, I, K) = (2, 8, 6), which is specified in TABLE I.

In our evaluation environment, the X-Y dimension (D) of a 3D FPGA architecture varies according to the number of layers (L) and the overall capacity of LBs (C), as given in (1):

(1)

All the test cases are classified into small, medium, or large category according to their sizes. Three corresponding types of FPGA capacity (small, medium and large) are designed according to the upper limit in each category for

Design #LBs #Nets #I/Os Design #LBs #Nets #I/Os

spla¹ 1845 2977 62 oc_des_perf_opt_opt_abc_resyn⁴ 10641 17150 185

frisc¹ 1778 2821 136 systemcaes_0abc_ch² 13480 19634 388

ex1010¹ 2299 3932 20 b22³ 14585 23114 55

s38417¹ 3203 5617 135 b17³ 15390 24986 135

1: MCNC; ²: IWLS2005; ³: ITC99; ⁴: Altera

80% LB utilization as shown in TABLE III. For example, the upper limit in small category is 1000 LBs and thus the capacity of the small FPGA is 1200. For the same reason, capacities of medium and large FPGAs are 5400 and 19200 respectively as TABLE III shows.

TABLE III

CIRCUIT SIZE VS.FPGA CAPACITY

Circuit size Range FPGA Capacity

Small ~1000 1200

Medium 1001~4500 5400

Large 4501~16000 19200

Chapter 4 Architectural Evaluations

Fig. 17. Overview of architectures.

Fig. 17 shows an overview of different architectures. Comparisons between different patterns always need a common standard, i.e., baseline architecture (BSL).

The FPGA with fully connected TSVs inside a 3D-SB and fully distributed 3D-SBs is our BSL because this architecture has maximal number of TSVs. Therefore, it costs maximal area but the delay is minimal as a result of maximal routability.

Two manners can reduce the number of TSVs. The one is to remove TSVs inside a 3D-SB. It makes 3D-SBs partially connected but still fully distributed. This architecture is called internally-sparse (IS) architecture. The other is to remove partial 3D-SBs regularly. It makes 3D-SBs still fully connected but partially distributed. This architecture is called externally-sparse (ES) architecture. Combine these two manners and a hybrid architecture arises. It makes 3D-SBs partially connected and partially distributed. This architecture is called sparse (SP)

architecture.

4.1 Baseline Architecture (BSL)

The baseline architecture (BSL) based on TPR is discussed in this Section.

Because there are 32 wires in each X-Y channel, the number of TSVs is also 32.

The pattern is shown in Fig. 18.

Fig. 18. Baseline architecture (BSL).

Fig. 19. The average normalized delay of BSL.

Fig. 20. The average TSV utilization of BSL.

Delay results are shown in Fig. 19. The delays in different layers of each test case are first normalized to that in 2D, i.e., number of layer is 1. Then these normalized delays of all test cases are averaged out. The result shows 3D (2–8 layers) is better than 2D (1 layer) on delay. The delay decreases substantially at 1–3 layers because the 3D benefit on shortening global interconnects is obvious and becomes saturated at 4–5 layers. It increases slightly at 6–8 layers. Because too many layers are utilized, most horizontal paths are replaced with the vertical ones and the delay increases as a result of increased number of switches. All in all, 4–5 layers are good enough for delay.

Although the vertical routability is maximized (which is best to delay) in BSL, the TSV utilization is extremely low (<10%), which implies more than 90% TSVs are unused. This fact also suggests there is still a big room for further architecture improvements. In this thesis, this architecture would serve as a baseline for comparisons with the proposed architectures later.

4.2 Internally-Sparse Architecture (IS)

The internally-sparse (IS) architecture removes TSVs inside a 3D-SB and makes all 3D-SBs partially connected but still fully distributed. The architecture is shown in Fig. 21.

Fig. 21. IS architecture (IS).

IS# represents a pattern of the IS architecture, where the postfix # specifies the number of TSVs available in a 3D-SB. For example, every 3D-SB in IS16 has only 16 TSVs inside, while IS32 is actually equivalent to the BSL.

As mentioned in Section 3.1, the multi-segment routing structure is adopted and there are four different wire lengths with different amount of tracks in the X-Y directions. In BSL, it makes no differences since each X-Y channel has its own vertical connection. However, in an IS architecture, some of TSVs in a 3D-SB are removed. The wire segments in X-Y directions that each TSV connects to may be different. Removing a TSV connecting to a short wire segment will cause loss of routability. On the contrary, removing a TSV connecting to a long wire segment will cause loss of performance. To strike a balance between performance and routability, it is better to preserve the ratio between the numbers of TSVs connecting to the four different wire segments whenever possible. If the ratio can not be preserved, for better timing, the removing priority of TSVs is the one connects to L1 first; L2, L4 and L8 follow in order.

As TABLE IV shows, there are 6 patterns excluding BSL in this IS architecture.

TABLE IV

NUMBER OF TSVS IN ONE 3D-SB OF ISPATTERNS

The ratio among L1, L2, L4 and L8 in IS32 is 3:3:1:1 because the numbers of tracks with vertical (Z) connectivity for (L1, L2, L4, L8) are (12, 12, 4, 4). For preserving this ratio in IS24, the corresponding numbers of vertical tracks, i.e., TSVs, are (9, 9, 3, 3) after removing 3 TSVs connecting to L1/L2 and 1 TSV connecting to L4/L8. However, only 4 TSVs are removed from IS32 to IS28 and thus the ratio can not be preserved. Based on timing-driven consideration, 2 TSVs connecting to L1/L2 are removed and the numbers of TSVs for (L1, L2, L4, L8) are (10, 10, 4, 4). The removing orders of TSVs of other patterns in IS architecture are of the same reason listed in TABLE IV.

Fig. 22. The normalized areas of different IS patterns.

IS (#TSVs in one 3D-SB) L1 L2 L4 L8

BSL(IS32) 12 12 4 4

IS28 10 10 4 4

IS24 9 9 3 3

IS20 7 7 3 3

IS16 6 6 2 2

IS12 4 4 2 2

IS8 3 3 1 1

Fig. 22 shows the areas of all these 6 patterns normalized to BSL. The SB/Tile means the normalized area of all SBs/tiles after removing TSVs. The area of Tile is large than that of SB because a tile contains not only an SB but also an LB and a CB. Because numbers of TSVs in patterns from IS28 to IS8 appear as an arithmetic sequence with equal difference, i.e., 4 TSVs, the area decreases linearly.

Fig. 23. The average normalized delays of different IS patterns.

Fig. 23 shows the delays of these IS patterns. For a IS pattern after removing partial TSVs, the delays in different layers of each test case are normalized to that in corresponding layers in BSL first. Then these normalized delays of all test cases are averaged out and the result is the delay of this IS pattern. Because the delays of IS28, IS24 and IS20 are too close to be distinguished, only IS20 stays in Fig. 23 standing for them. As number of TSVs in a 3D-SB decreases, the delay become worse as expected. Note that some test cases fail to be mapped onto IS12. Most of the test cases fail in IS8; one even succeeds, the delay increases significantly.

The insufficient number of vertical routing tracks in partial hot-regions results in the failure. An example shown in Fig. 24 illustrates the TSV utilization of test case s38584.1 in the 8-layer IS8 pattern. The test case is on the verge of failure because

the mapping succeeds only in partial given seeds. The number of used TSVs in red region is eight, i.e., maximal TSVs inside a 3D-SB in IS8 pattern. It shows that congestion occurs in these partial hot-regions accounting for the failure of test cases because the TSV demands exceed the provided ones.

Fig. 24. The TSV utilization of case s38584.1 in the 8-layer IS8 pattern.

Finally, since IS20 achieves about 27% overall area reduction only with a delay penalty less than 3% as compared to BSL, it should be the one with a better area/delay balance among all IS architectures.

4.3 Externally-Sparse Architecture (ES)

The externally-sparse (ES) architecture removes partial 3D-SBs regularly and makes all 3D-SBs still fully connected but partially distributed.

ES# represents a pattern of the ES architecture, where the postfix # specifies the maximum distance between two adjacent 3D-SBs in either X or Y direction.

Fig. 25. Examples of ES patterns.

As Fig. 25 shows, ES1 is our BSL. ES2 means that a 3D-SB exists on every 2 SBs in X-Y directions. Number of total TSVs of ES2 is half of BSL. Similarly, a 3D-SB exists on every 3 SBs in X-Y directions in ES3 and the number of total TSVs is 33% of BSL. Patterns from ES4 to ES9 are for the same reason.

Fig. 26. The ES architecture with (a)/(c) the oblique stripes pattern and (b)/(d) the vertical stripes pattern.

The ES architecture replaces a part of fully connected 3D-SBs with 2D-SBs for saving the TSV area. There are various ways, or patterns, for mixing 2D-SBs and 3D-SBs. The pattern used in the ES architecture is oblique stripes, which is

the same as the one used in 3D MEANDER. Different from the vertical or horizontal stripes patterns, it is guaranteed that each 3D-SB in the oblique stripes pattern can reach some other 3D-SBs within a fixed distance in any directions, which surely provides much better routability.

Take the patterns shown in Fig. 26 as an example. From the perspective of a 2D-SB, the shaded region covers the area where a 2D-SB can access vertical links in the shortest distance as shown in Fig. 26(a). However, with the same shaded region, only two 3D-SBs can be reached for a 2D-SB in an architecture with vertical stripes patterns, as shown in Fig. 26(b). From the perspective of a 3D-SB, the shaded region covers 8 other 3D-SBs where a 3D-SB can access vertical links in the shortest distance as shown in Fig. 26(c). However, with the same shaded region, only 6 other 3D-SBs can be reached for a 3D-SB in architecture with vertical stripes patterns, as shown in Fig. 26(d). Therefore, The pattern in Fig.

26(a)/(c) is more preferred than that in Fig. 26(b)/(d).

Fig. 27. The normalized areas of different ES patterns.

Fig. 27 shows the normalized areas of different patterns in ES architecture. As the distance increases, the area reduction is significant at first and then becomes flat gradually. The curve appears like a harmonic sequence.

Fig. 28. The average normalized delays of different ES patterns.

Fig. 28 shows the delays of ES patterns which are normalized to BSL. As the distance between 3D-SBs in a FPGA increases, the delay becomes worse as expected. Failed test cases appear in ES8 and ES9. Delays of ES2, ES3 and ES4 are around 3% penalty regarding BSL. The areas from ES7 to ES9 are similar but the delay of ES9 increases substantially. Therefore, there is no need to replace too many 3D-SBs with 2D-SBs because the area becomes saturated but the delay gets worse in a great quantity.

(a) (b) Fig. 29. (a) Connection manner inside a 3D-SB and

(b) Partially connected 3D-SB.

Unlike IS structure whose test cases fail from IS12, 55% of tile area, no test case fail until ES8 in ES structure, which has 37% tile area of BSL. The reason is the connection manner inside a 3D-SB as shown in Fig. 29(a). TSVs connect to

the intersection of the X-Y directions wires with the same index only. For example, the wire1 in X-direction connects to a TSV only at the intersection with the wire1

在文檔中為取得面積與延遲間較佳平衡之三維可程式邏輯閘陣列架構探索 (頁 18-0)