Benchmark - Evaluation Environment - 為取得面積與延遲間較佳平衡之三維可程式邏輯閘陣列架構探索

Chapter 3 Evaluation Environment

3.3 Benchmark

TABLE II BENCHMARK

TABLE II shows the 24 test cases in our benchmark set – 14 are from MCNC [29] and the other 10 larger ones are from IWLS2005, ITC99 and Altera [30]. The numbers of LBs and nets are derived from T-VPack with the parameters (N, I, K) = (2, 8, 6), which is specified in TABLE I.

In our evaluation environment, the X-Y dimension (D) of a 3D FPGA architecture varies according to the number of layers (L) and the overall capacity of LBs (C), as given in (1):

(1)

All the test cases are classified into small, medium, or large category according to their sizes. Three corresponding types of FPGA capacity (small, medium and large) are designed according to the upper limit in each category for

Design #LBs #Nets #I/Os Design #LBs #Nets #I/Os

spla¹ 1845 2977 62 oc_des_perf_opt_opt_abc_resyn⁴ 10641 17150 185

frisc¹ 1778 2821 136 systemcaes_0abc_ch² 13480 19634 388

ex1010¹ 2299 3932 20 b22³ 14585 23114 55

s38417¹ 3203 5617 135 b17³ 15390 24986 135

1: MCNC; ²: IWLS2005; ³: ITC99; ⁴: Altera

80% LB utilization as shown in TABLE III. For example, the upper limit in small category is 1000 LBs and thus the capacity of the small FPGA is 1200. For the same reason, capacities of medium and large FPGAs are 5400 and 19200 respectively as TABLE III shows.

TABLE III

CIRCUIT SIZE VS.FPGA CAPACITY

Circuit size Range FPGA Capacity

Small ~1000 1200

Medium 1001~4500 5400

Large 4501~16000 19200

Chapter 4 Architectural Evaluations

Fig. 17. Overview of architectures.

Fig. 17 shows an overview of different architectures. Comparisons between different patterns always need a common standard, i.e., baseline architecture (BSL).

The FPGA with fully connected TSVs inside a 3D-SB and fully distributed 3D-SBs is our BSL because this architecture has maximal number of TSVs. Therefore, it costs maximal area but the delay is minimal as a result of maximal routability.

Two manners can reduce the number of TSVs. The one is to remove TSVs inside a 3D-SB. It makes 3D-SBs partially connected but still fully distributed. This architecture is called internally-sparse (IS) architecture. The other is to remove partial 3D-SBs regularly. It makes 3D-SBs still fully connected but partially distributed. This architecture is called externally-sparse (ES) architecture. Combine these two manners and a hybrid architecture arises. It makes 3D-SBs partially connected and partially distributed. This architecture is called sparse (SP)

architecture.

4.1 Baseline Architecture (BSL)

The baseline architecture (BSL) based on TPR is discussed in this Section.

Because there are 32 wires in each X-Y channel, the number of TSVs is also 32.

The pattern is shown in Fig. 18.

Fig. 18. Baseline architecture (BSL).

Fig. 19. The average normalized delay of BSL.

Fig. 20. The average TSV utilization of BSL.

Delay results are shown in Fig. 19. The delays in different layers of each test case are first normalized to that in 2D, i.e., number of layer is 1. Then these normalized delays of all test cases are averaged out. The result shows 3D (2–8 layers) is better than 2D (1 layer) on delay. The delay decreases substantially at 1–3 layers because the 3D benefit on shortening global interconnects is obvious and becomes saturated at 4–5 layers. It increases slightly at 6–8 layers. Because too many layers are utilized, most horizontal paths are replaced with the vertical ones and the delay increases as a result of increased number of switches. All in all, 4–5 layers are good enough for delay.

Although the vertical routability is maximized (which is best to delay) in BSL, the TSV utilization is extremely low (<10%), which implies more than 90% TSVs are unused. This fact also suggests there is still a big room for further architecture improvements. In this thesis, this architecture would serve as a baseline for comparisons with the proposed architectures later.

4.2 Internally-Sparse Architecture (IS)

The internally-sparse (IS) architecture removes TSVs inside a 3D-SB and makes all 3D-SBs partially connected but still fully distributed. The architecture is shown in Fig. 21.

Fig. 21. IS architecture (IS).

IS# represents a pattern of the IS architecture, where the postfix # specifies the number of TSVs available in a 3D-SB. For example, every 3D-SB in IS16 has only 16 TSVs inside, while IS32 is actually equivalent to the BSL.

As mentioned in Section 3.1, the multi-segment routing structure is adopted and there are four different wire lengths with different amount of tracks in the X-Y directions. In BSL, it makes no differences since each X-Y channel has its own vertical connection. However, in an IS architecture, some of TSVs in a 3D-SB are removed. The wire segments in X-Y directions that each TSV connects to may be different. Removing a TSV connecting to a short wire segment will cause loss of routability. On the contrary, removing a TSV connecting to a long wire segment will cause loss of performance. To strike a balance between performance and routability, it is better to preserve the ratio between the numbers of TSVs connecting to the four different wire segments whenever possible. If the ratio can not be preserved, for better timing, the removing priority of TSVs is the one connects to L1 first; L2, L4 and L8 follow in order.

As TABLE IV shows, there are 6 patterns excluding BSL in this IS architecture.

TABLE IV

NUMBER OF TSVS IN ONE 3D-SB OF ISPATTERNS

The ratio among L1, L2, L4 and L8 in IS32 is 3:3:1:1 because the numbers of tracks with vertical (Z) connectivity for (L1, L2, L4, L8) are (12, 12, 4, 4). For preserving this ratio in IS24, the corresponding numbers of vertical tracks, i.e., TSVs, are (9, 9, 3, 3) after removing 3 TSVs connecting to L1/L2 and 1 TSV connecting to L4/L8. However, only 4 TSVs are removed from IS32 to IS28 and thus the ratio can not be preserved. Based on timing-driven consideration, 2 TSVs connecting to L1/L2 are removed and the numbers of TSVs for (L1, L2, L4, L8) are (10, 10, 4, 4). The removing orders of TSVs of other patterns in IS architecture are of the same reason listed in TABLE IV.

Fig. 22. The normalized areas of different IS patterns.

IS (#TSVs in one 3D-SB) L1 L2 L4 L8

BSL(IS32) 12 12 4 4

IS28 10 10 4 4

IS24 9 9 3 3

IS20 7 7 3 3

IS16 6 6 2 2

IS12 4 4 2 2

IS8 3 3 1 1

Fig. 22 shows the areas of all these 6 patterns normalized to BSL. The SB/Tile means the normalized area of all SBs/tiles after removing TSVs. The area of Tile is large than that of SB because a tile contains not only an SB but also an LB and a CB. Because numbers of TSVs in patterns from IS28 to IS8 appear as an arithmetic sequence with equal difference, i.e., 4 TSVs, the area decreases linearly.

Fig. 23. The average normalized delays of different IS patterns.

Fig. 23 shows the delays of these IS patterns. For a IS pattern after removing partial TSVs, the delays in different layers of each test case are normalized to that in corresponding layers in BSL first. Then these normalized delays of all test cases are averaged out and the result is the delay of this IS pattern. Because the delays of IS28, IS24 and IS20 are too close to be distinguished, only IS20 stays in Fig. 23 standing for them. As number of TSVs in a 3D-SB decreases, the delay become worse as expected. Note that some test cases fail to be mapped onto IS12. Most of the test cases fail in IS8; one even succeeds, the delay increases significantly.

The insufficient number of vertical routing tracks in partial hot-regions results in the failure. An example shown in Fig. 24 illustrates the TSV utilization of test case s38584.1 in the 8-layer IS8 pattern. The test case is on the verge of failure because

the mapping succeeds only in partial given seeds. The number of used TSVs in red region is eight, i.e., maximal TSVs inside a 3D-SB in IS8 pattern. It shows that congestion occurs in these partial hot-regions accounting for the failure of test cases because the TSV demands exceed the provided ones.

Fig. 24. The TSV utilization of case s38584.1 in the 8-layer IS8 pattern.

Finally, since IS20 achieves about 27% overall area reduction only with a delay penalty less than 3% as compared to BSL, it should be the one with a better area/delay balance among all IS architectures.

4.3 Externally-Sparse Architecture (ES)

The externally-sparse (ES) architecture removes partial 3D-SBs regularly and makes all 3D-SBs still fully connected but partially distributed.

ES# represents a pattern of the ES architecture, where the postfix # specifies the maximum distance between two adjacent 3D-SBs in either X or Y direction.

Fig. 25. Examples of ES patterns.

As Fig. 25 shows, ES1 is our BSL. ES2 means that a 3D-SB exists on every 2 SBs in X-Y directions. Number of total TSVs of ES2 is half of BSL. Similarly, a 3D-SB exists on every 3 SBs in X-Y directions in ES3 and the number of total TSVs is 33% of BSL. Patterns from ES4 to ES9 are for the same reason.

Fig. 26. The ES architecture with (a)/(c) the oblique stripes pattern and (b)/(d) the vertical stripes pattern.

The ES architecture replaces a part of fully connected 3D-SBs with 2D-SBs for saving the TSV area. There are various ways, or patterns, for mixing 2D-SBs and 3D-SBs. The pattern used in the ES architecture is oblique stripes, which is

the same as the one used in 3D MEANDER. Different from the vertical or horizontal stripes patterns, it is guaranteed that each 3D-SB in the oblique stripes pattern can reach some other 3D-SBs within a fixed distance in any directions, which surely provides much better routability.

Take the patterns shown in Fig. 26 as an example. From the perspective of a 2D-SB, the shaded region covers the area where a 2D-SB can access vertical links in the shortest distance as shown in Fig. 26(a). However, with the same shaded region, only two 3D-SBs can be reached for a 2D-SB in an architecture with vertical stripes patterns, as shown in Fig. 26(b). From the perspective of a 3D-SB, the shaded region covers 8 other 3D-SBs where a 3D-SB can access vertical links in the shortest distance as shown in Fig. 26(c). However, with the same shaded region, only 6 other 3D-SBs can be reached for a 3D-SB in architecture with vertical stripes patterns, as shown in Fig. 26(d). Therefore, The pattern in Fig.

26(a)/(c) is more preferred than that in Fig. 26(b)/(d).

Fig. 27. The normalized areas of different ES patterns.

Fig. 27 shows the normalized areas of different patterns in ES architecture. As the distance increases, the area reduction is significant at first and then becomes flat gradually. The curve appears like a harmonic sequence.

Fig. 28. The average normalized delays of different ES patterns.

Fig. 28 shows the delays of ES patterns which are normalized to BSL. As the distance between 3D-SBs in a FPGA increases, the delay becomes worse as expected. Failed test cases appear in ES8 and ES9. Delays of ES2, ES3 and ES4 are around 3% penalty regarding BSL. The areas from ES7 to ES9 are similar but the delay of ES9 increases substantially. Therefore, there is no need to replace too many 3D-SBs with 2D-SBs because the area becomes saturated but the delay gets worse in a great quantity.

(a) (b) Fig. 29. (a) Connection manner inside a 3D-SB and

(b) Partially connected 3D-SB.

Unlike IS structure whose test cases fail from IS12, 55% of tile area, no test case fail until ES8 in ES structure, which has 37% tile area of BSL. The reason is the connection manner inside a 3D-SB as shown in Fig. 29(a). TSVs connect to

the intersection of the X-Y directions wires with the same index only. For example, the wire1 in X-direction connects to a TSV only at the intersection with the wire1

in Y-direction. In other words, signals from wires₁ will never be passed to wires₂, wires3 or wires4. Take the partially connected 3D-SB in Fig. 29(b) as an example.

The 2 TSVs connect to the intersections of wires₁/wires₃ instead of that of wires2/wires4. That is, signals between different layers are passed only through wires₁/wires₃. wires₂/wires₄ only pass those signals within the same layer and there is no interlayer communication in them. Therefore, the less the number of TSVs inside a 3D-SB in the IS architecture, the less the X-Y directions tracks can pass signals between different layers. However, situation in the ES architecture is completely different. Even though partial 3D-SBs are replaced with 2D-SBs, i.e., removing all the TSVs inside a 3D-SB, signals can still access vertical links in other 3D-SBs with fully-connected TSVs within the same layer. Consequently, number of failed cases is fewer in the ES architecture.

Finally, since ES2 achieves about 36% overall area reduction only with a delay penalty less than 3% as compared to BSL, it should be the one with a better which takes two previous strategies simultaneously. It makes 3D-SBs partially connected and also partially distributed. The architecture is shown in Fig. 30.

Fig. 30. Sparse architecture (SP).

SP(#₁, #₂) is used to name a specific pattern of this hybrid architecture, which is the combination of IS#1 and ES#2. This notation can be generalized to represent IS and ES architectures as well. For example, SP(32, 1) is equivalent to BSL, SP(16, 1) implies IS16, SP(32, 2) is actually ES2, and SP(20, 2) means a 3D-SB exists on every 2 SBs in X-Y directions with 20 TSVs in each 3D-SB.

It is not difficult to directly determine the relative area sizes of patterns in IS or ES architectures because only one strategy is performed. However, the SP architecture takes two strategies into account simultaneously and thus it is not easy to obtain the relative area sizes of SP patterns intuitively. Therefore, we define a terminology  as the TSV density which means there are  TSVs in a tile on average to conquer this issue as shown in (2):

TSV density () = IS# / ES# (2)

Because area of TSVs is so huge that it almost dominates the area of a tile, the average number of TSVs, i.e., , can be regarded as the relative area size of a pattern.

There are totally 48 possible combinations when 6 IS patterns (Section 4.2) and 8 ES (Section 4.3) patterns are mixed together. Although many patterns have to be considered, there is no need to go through all of them actually. Only 4 IS patterns, IS28, IS24, IS20 and IS16 are involved because many failed test cases

start from IS12. Only 3 ES patterns, ES2, ES3 and ES4 are considered because the delays of other ES patterns are larger than 6% of BSL. We combine the 4 IS patterns with the 3 ES patterns and thus there are 12 combinations to be considered now. These patterns are arranged them into a decreasing order according to their area size as Fig. 31(a) shows. Only the 9 patterns with large area are listed because the delays of the other 3 patterns are worse than the acceptable range, i.e., more than 5% of BSL for most layers.

(a)

(b)

Fig. 31. The normalized (a)areas/(b)delays of different SP patterns (Partially).

As TSV density () decreases, the delay becomes worse as expected because of

diminished routability. =10 is the threshold for preserving the 3% delay penalty of BSL. It suggests that for a promising configuration in the SP architecture, the  value of a pattern should be at least 10. Because delays of patterns from =8 to

=6 are difficult to distinguish, number of layers with increasing delay increases

as  becomes smaller. Therefore, as TSV density() becomes smaller, the delay becomes worse on average. This is the reason for discarding the 3 patterns with smaller area.

Finally, since SP(20, 2) achieves about 50% overall area reduction only with a delay penalty less than 3% as compared to BSL, it should be the one with a better area/delay balance among all SP architectures.

4.5 Comparisons between IS, ES and SP

The section introduces the comparisons between the three architectures, IS, ES, and SP. In each architecture style, the one with the smallest area while still satisfying the delay constraint (i.e., the delay penalty must be less than 3% as compared to BSL) is selected as the representative. Fig. 32(a) shows the areas of the three representative patterns, IS20=SP(20, 1), ES2=SP(32, 2) and SP(20, 2), in which all values are normalized to that of IS20. TSV density of each pattern is also marked. The area reduction of SP(20, 2) is more than that of ES2 and IS20.

Delays of the three patterns are normalized to BSL as shown in Fig. 32(b). All the three patterns are smaller than 3% delay penalty regarding BSL. All in all, SP(20, 2) outperforms IS20 and ES2 with relatively smaller tile area under similar delay.

That is, SP architecture outperforms IS and ES architecture.

(a)

(b)

Fig. 32. Comparisons between (a)areas/(b)delays between IS20, ES2 and SP(20, 2).

Chapter 5 Sunny Egg Architecture

5.1 Introduction and Patterns Involved

Though SP(20, 2) has already achieved an area reduction of 50% with minor delay loss as compared to BSL, there is still room for improvement. Fig. 33 shows the average TSV distribution in a 6-layer BSL over all medium test cases. The results suggest that the TSV demand is much bigger in the central region than that in the peripheral zone; the TSV utilization ratio between center and periphery is about 4:1; and thus inspire us to further propose the sunny egg (SE) architecture as shown in Fig. 34.

Fig. 33. Average TSV distribution in a 6-layer BSL over all medium test cases.

Fig. 34. Sunny egg.

The sunny egg architecture divides a horizontal plane into two regions – center (egg yolk) and periphery (egg white). Two regions are implemented using different SP architectures – the TSV density in the center is set larger than that in the periphery.

SE(IS_C#, ES_C#, R, IS_P#, ES_P#) indicates a specific SE architecture, where SP_C (ISC# and ESC#)/SPP (ISP# and ESP#) is for the center/periphery respectively, and R is the ratio between the dimension of the center and dimension of FPGA (D).

The dimension of center (DC) can be obtained from the product of D and R. as given in (3)

(3)

Fig. 35. An example of sunny egg architecture, SE(32, 2, 0.5, 16, 4).

An example of sunny egg architecture, SE(32, 2, 0.5, 16, 4), is shown in Fig.

In order to determine the specific value of each SE parameter, a total of 20 SE patterns are evaluated as shown in TABLE V.

TABLE V

SEPATTERNS UNDER EVALUATION (PARTIAL)

The TSV density of center/periphery is set to 16/4 as R=0.7, i.e., the areas of center and periphery are equal, because: 1) the TSV utilization ratio between center and periphery is about 4:1 in Fig. 33; and 2) the overall TSV density of the better area/delay balance pattern, SP(20, 2), in SP architecture is 10. The average of the densities of center and periphery is 10 and the one of center is 4 times larger as the one in periphery, the TSV density of center/periphery is thus 16/4 as R=0.7.

IS_C# ES_C# R IS_P# ES_P#

Two SP configurations, SP_C(32, 2) and SP_C(16, 1), with higher TSV densities (=16) are set for the center. Two SP configurations, SPP(16, 4) and SP_P(8, 2), with lower TSV densities (=4) are set for the periphery because the other two configurations, SPP(32, 8) and SPP(4, 1), are not appropriate here. The reasons are:

1) the distance between 3D-SBs, i.e., ES_P#, of SP_P(32, 8) is too far away from each other referenced to ES8 and thus test case may failed to map on it; and 2) number of TSVs inside a 3D-SB, IS_P#, of SP_P(4, 1) is too few referenced to IS20 and thus the delay may increases substantially. In this way, given the same TSV density, SP_C(16, 1) distributes TSVs more evenly than SP_C(32, 2) in the center.

Similarly, SPP(8, 2) distributes TSVs more evenly than SPP(16, 4) in the periphery.

Two patterns at both center/periphery and thus there are four patterns under R=0.7.

For more area reduction, R ranges from 0.7 to 0.3 under the same center/periphery settings as shown in TABLE V.

center/periphery and the value of R.

Fig. 36. The average normalized delays of unevenly/evenly at the center.

Fig. 36 shows the average normalized delays of unevenly/evenly at the center.

The results of unevenly/evenly are the average of 10 patterns and it suggest that unevenly is better than evenly at the center. That is, SPC(32, 2) is better than SP_C(16, 1).

Fig. 37. The average normalized delays of unevenly/evenly at the periphery.

Fig. 37 shows the average normalized delays of unevenly/evenly at the periphery. The results of unevenly/evenly are the average of 10 patterns. There is not much difference between unevenly or evenly. Anyway we choose evenly, i.e., SP_P(8, 2), as the type at the periphery.

(a)

(b)

Fig. 38. The (a) normalized areas and (b) avg. normalized delays of different R’s.

在文檔中為取得面積與延遲間較佳平衡之三維可程式邏輯閘陣列架構探索 (頁 27-0)