Problem Formulation - 應用於三維可程式邏輯閘陣列之容錯架構探索暨快速重組態演算法

Chapter 2 Preliminaries

2.5 Problem Formulation

Given a netlist of CLBs, architecture specification, existing placement and routing result and locations of faulty CLBs, our objective is to partially reconfigure

Chapter 3 Proposed Demand-Aware Reconfiguration Algorithm

As mentioned in Section 1.4, we propose a demand-aware reconfiguration algorithm considering the demand issue during fault reconfiguration. According to the cost function of ripple move reconfiguration algorithm (as shown in Equation (8)), we modify the cost function to the demand-aware version consisting of the delay and the demand cost as shown in Equation (9).

(8) (9) For quickly estimating the delay between two blocks, an approximation can be computed by using Manhattan distance. The estimated delay is recorded in the delay lookup matrix, so when we construct a DAG for a fault CLB, edges are weighted with the delay required upon moving the block from existing location to the adjacent CLB by looking up the delay matrix and Costdealy is calculated by finding the shortest path (least delay cost) on the DAG.

The objective of the demand cost Costdemand is to achieve a solution with more our iterative reconfiguration algorithm and in the last section, the routing algorithm and re-routing algorithm are described.

delay

3.1 Reconfiguration Algorithm

3.1.1 The Concept of Our Algorithm

In the ripple move reconfiguration method, a DAG is constructed for each faulty block. The weight of an edge e=<u, v> is set to the difference between the required delay of the block residing in the CLB u and in the adjacent CLB v; then the shortest path (least cost) between the source CLB and destination CLB is found. According to the shortest path, the faulty block is reconfigured by ripple moving CLBs to their adjacent ones along this path. However, the result of this greedy method may be too local because for each faults. It constructs a DAG only from the faulty block point of view; and non-faulty spare CLBs are greedily chosen as destinations in order of the locally delay cost is minimized in each iteration. In fact, the demand for non-faulty spare CLB of every faulty block may be different, thus these two factors have to be considered when constructing DAG. Therefore, our algorithm is based on such a concept to resolve this problem.

3.1.2 Demand-Aware Reconfiguration Algorithm

For ease of explanation, the following is an example to detail how we use this concept of demand. At the begin of each iteration, we determine a range with the distance d_SER by expanding the from b_FT, the target faulty and mapped CLB which has the most critical, until this range contains at least K (K=3 in this example) spare CLBs.

Notice that it is a three-dimensional range, as shown in Figure 16 and the definition is shown in Equation (10).

Figure 16. Find the searching distance.

Next, we find the neighbor faulty and mapped CLBs of each bS. Consider bS3 at first, we utilize the d_SER to find b_FN, as shown in Figure 17. We define a B_FN as a set of faulty and mapped CLBs in the neighborhood of bS, as shown in Equation (11).

(11)

Figure 17. Neighbor faulty and mapped CLBs of bS3.

And we utilize the d_SER to find the neighbor non-faulty spare CLBs B_SN of these faulty and mapped CLBs, as shown in Figure 18. BSN is defined as spare CLBs residing in the neighborhood of bF with the distance smaller or equal to dSER, as shown in Equation (12).

F FN SER FN

S M FN S

FN b b d b b d B B

B ( ){ | ( , ) }, 

(12)

(a)

(b) (c)

Figure 18. Neighbor spare CLBs of (a) bF8, (b) bF10, and (c) bF15.

Here we introduce Equation (13), which represents the demands for b_S of b_F and isin inverse proportion to the number of spare CLBs in the neighborhood and the Manhattan distance between them. If there is a spare CLB very close to a b_Fand only few spare CLBs is in the bF’s neighborhood, then the demands for the spare CLB of the b_F is very high. The maximum demand D_md-max for a spare CLB is defined as

(13) (14)

Figure 19. The reconfiguration procedure when consider b_S1. According to the definition, the demands for bS3 of bF10 and bF15 are:

For the reconfiguration, the demand cost of b_FT for b_S is defined in the Equation (15). The small demand cost means the demand of bFT for bS is high.

(15) So in the example, the demand cost for bS3 of bF8 is:

Similarly, for bS1, the procedure is shown in Figure 19. Notice that only bF8 in the neighborhood of b_S1, so the demand cost is 0.

For b_S13at the last, the procedure is shown in Figure 20.

Figure 20. The reconfiguration procedure when consider b_S13..

2 1

In the example, we assume the delay cost equals to the Manhattan distance between two CLBs, so the delay cost of all shortest path in this example equal to 2.

The costs of each candidate destinations in the first iteration are shown in Table 2, where is 0.2 and is 0.8.

Table 2. The cost of first iteration in the example.

Finally, we take bS1 as the destination and the bF8 is reconfigured by moving it to adjacent CLBs along the shortest path {b_F8, b₇, b_S1}. This procedure is iteratively performed for each faulty block until all faults are successfully reconfigured or no path is found resulting in reconfiguration failure. The final result of ripple move reconfiguration algorithm is shown in Figure 21-(a). This method randomly chooses one of {b_S1,b_S3,b_S13} as destination because it does not consider the demand cost.

Another result is shown in Figure 21-(b), it is demonstrated that our demand-aware reconfiguration algorithm can find the better solution compared to the ripple move reconfiguration algorithm.

(a) (b)

Figure 21. The result of (a) ripple move algorithm and (b) our algorithm.

The algorithm flow is shown in Figure 22; after constructing a DAG first, the reconfiguration path is determined by finding the shortest path between the source and one of the k destinations; finally, the fault is reconfigured by moving blocks to adjacent CLBs along this path. The reconfiguration iteration is performed until every block is located on non-faulty CLBs.

Figure 22. Our algorithm flow.

3.2 Re-routing Algorithm

3.2.1 Concept of Routing Algorithm

After placement, the locations of all CLBs have been determined, and then a timing driven router connects all connections between CLBs. In routing stage, the FPGA architecture is represented as a routing resource graph. It represents wire segments, TSVs and input or output pins of logic blocks, as shown in Figure 23.

Figure 23. FPGA routing architecture and routing resource graph.

The routing algorithm in TPR is based on Pathfinder negotiated congestion algorithm [26]. It iteratively rips-up and re-routes every net until the result meets the congestion constraint. At the first iteration, all nets are routed for minimizing delay without congestion constraint; that is, the routing resources are allowed overuse.

When overuse exists at end of a routing iteration, the cost of overusing a routing resource is increased, so congestion will be resolved at another routing iteration. This process is repeated until all routing resources only are used once.

3.2.2 Re-routing Algorithm

During fault-tolerant reconfiguration, the blocks on the shortest path are moved for one grid in each iteration and a set of blocks are moved due to the 10% faulty CLBs generally. When the block is moved, its connections are also affected, thus we have to re-route these connections, as shown in Figure 24.

Figure 24. The concept of re-route.

We record the blocks which are moved during the placement stage of fault-tolerant replacement stage. When in the routing stage, we rip-up all the affected nets and fix the exist routing, and then re-route them. Figure 25 shows an example that endpoints of Net_1 connect to CLB_A, CLB_B and CLB_C, respectively. If the block originally residing in CLB_A is moved to a new CLB, we rip-up Net_1 and the routing of Net_1 is started from the output pin of the new CLB and terminated at the original sink1 and the original sink2. If the block residing in CLB_B or CLB_C is moved to a new CLB, we rip-up Net_1 and the new routing of Net_1 is started from the original source and terminated at the input pin of the new CLB.

Figure 25. Rip-up and re-route the affected net.

Chapter 4 Fault Tolerant Architecture

4.1 Non-Reserved (NR)

Locations of blocks are determined using an SA-based placement algorithm with the objective of minimizing wirelength and circuit delay. Thus, spare CLBs are pushed to the edge of FPGA, such a distribution of spare CLBs is called non-reserved (NR) architecture, as shown in Figure 26-(a). As the result, this placement is not suitable to fault reconfiguration through replacement with spare CLB because most spares located along the edge, which may cause a large amount of CLBs moved by ripple-move fault reconfiguration, as shown in Figure 26-(b). Therefore, even if we have a better reconfiguration algorithm, results will be limited because the restrictions of architecture.

(a) (b)

Figure 26. (a) The timing-driven placement. (b) The drawback of timing-driven placement for fault tolerance.

4.2 Evenly-Distributed (ED)

As mentioned above, traditional architecture is not suitable to fault tolerance, which inspires us to discover new architectures that take fault tolerance into consideration. We address this problem by evenly distributing spare CLBs across the FPGA and force them to pre-allocate spare resources before the SA-based placement algorithm. These pre-allocated spare CLBs are not allowed being used during SA-based placement, so we can get a placement result with spares evenly distributed in the 3D FPGA design. Such a distribution of spare CLBs is called even-distributed (ED) architecture. When faults occur, spares are very likely close to the faulty CLBs and benefit replacement without severely timing degradation.

We propose five optional ED architecture ED3, ED4, ED5, ED6 and ED7. ED#

represents a spare pattern that the postfix # specifies the maximum distance between two adjacent spare CLBs in either X or Y or XY direction, as shown in Figure 27. The estimated percentage of reserved spare CLBs of each ED architecture is shown in Table 3.

Table 3. The estimated percentage of reserved spare CLBs.

It should be noticed that the CLB utilization of most FPGA is only 70–80% in order to enhance the routability. As we use spare CLBs, the total number of signal nets does not increase. Thus, routing complexity does not significantly increase, however, a price to be paid for using the fault tolerant architecture is an additional delay increasing because we change the original timing driven placement, detail

discussions are concluded in Chapter 5.

Figure 27. Evenly-distributed architecture.

Chapter 5 Experimental Results

5.1 Experimental Environment

The architectural setting in our experiments are shown in Table 4. The settings of CLBs and channel width are based on Altera Stratix IV [27], Xilinx FPGAs [28] and related work [29]. There are 4 wire segments with different lengths in these 32 wires, L1, L2, L4 and L8. The length of a wire segment is the number of CLBs it spans.

There are 12 L1/L2 and 4 L4/L8 wires. In Z direction, each TSV spans one layer only for routability.

Table 4. The architecture setting.

Table 5 shows the 16 test cases in our benchmark set – 15 are from MCNC [30]

and 1 is from IWLS2005 [31], which are sorted by number of CLBs. Each test case perform 25 experimental runs with different random seeds (5 fault seeds and 5 placement seeds) and find the average as the result. In addition, the number of layers (nz) is set to 4. The CLB utilization is set to 70% and the fault rate is set to 10%.

Table 5. The benchmark circuits.

5.2 Results and Analysis

5.2.1 Experimental Flow

In our experiment, three types of configuration-level repair methods are implemented: i) resynthesis ii) Cong's reconfiguration algorithm and iii) our reconfiguration algorithm. Figure 28 shows the experimental flow of resynthesis, the faulty CLB are marked before layer assignment and regarding them unable to be mapped. Figure 29 shows the experimental flow of two reconfiguration algorithms.

Taking the initial placement and routing as an existing result, faults are repaired by partially reconfiguring blocks avoiding faulty CLB.

Figure 28. The experimental flow of resynthesis.

Figure 29. The experimental flow of reconfiguration.

5.2.2 Analysis of Timing Penalty

Following are two reasons cause of timing degradation:

i) Initial architecture – there are six architectures used in our experiment, NR, ED3, ED4, ED5, ED6, ED7 with different percentages of reserved spare CLBs for each pattern, i.e., different spare densities; the higher spare density results in more blocks spread to the edge of the FPGA and thus the more delay increases. Figure 30 shows the delay increase of each architecture compared to NR. ED7 has the minimal

impact to timing because it has the minimal spare density, otherwise, ED3 has the maximal timing overhead. For ease of exposition, we refer to the result of the NR architecture as IA-NR.

Figure 30. Timing penalty caused by fault tolerant architecture.

ii) Reconfiguration – the delay is increased as the circuit placement being reconfigured. Because the ED architecture provides a fault tolerant friendly architecture. The higher spare density is, the more spare CLBs close to faulty blocks, which causes the timing degradation is lower during reconfiguration. Figure 31-(a) shows the delay increase caused by reconfiguration for uniform fault model based on their IA results. The delay overhead is gradually reduced as spare density grows, and the increased delay of our method is always lower than Cong's.

Similarly, Figure 31-(b) illustrates the delay increase for clustered fault model.

The delay increases is significantly higher compared to uniform fault model because of a number of faults being localized within a region. it is represents clustered fault distribution is more difficult to be reconfigured.

(a)

(b)

Figure 30. Timing penalty caused by reconfiguration.

Figure 32-(a) shows the delay increase caused by reconfiguration for uniform fault model with the IA-NR as the baseline. It is observed that our delay increases are lower than Cong's. The delay increase is gradually reduces at the beginning as the spare density grows; however, if we continue increase the spare density, the timing degradation caused by initial architecture will dominate the FPGA, so the delay increase is gradually increased.

The total delay increase caused by reconfiguration for clustered fault model are much higher compared to the pattern degradation, so the delay increase is decreased as grows spare density, and our delay increase are lower than Cong's, as shown in Figure 32-(b)

(a)

(b)

Figure 31. Combined effect on timing penalty.

5.2.3 Success Rate

A successful result is defined as a result with the all faults successfully reconfigured and the critical delay is within timing constraint; otherwise, the case is called failure case. Then the success rate is the percentage of the successful results.

Resynthesis has the highest success rate as well as minimal timing degradation. We take the result as the baseline of reconfiguration. Therefore, we set the timing constraint to the delay that every case has 96% success rate in resynthesis flow. Figure 33 shows the results of success rate for uniform fault model. Our algorithm improves up to 13% success rate. If we relax 1% of the timing constraint, (i.e., 101% of the delay of the resynthesis flow with 96% success rate) the overall success rate is

increased by 5~10% and our algorithm has up to 9% improvement.

Figure 32. The success rate for uniform fault model.

(a)

(b)

Figure 33. The number of failure cases for uniform fault model.

Figure 34 shows the number of failure cases for uniform fault model, which is separated into i) reconfiguration failure – not all faults can find the corresponding reconfiguration paths; ii) timing failure – all faults can be reconfigured but the

resultant timing cannot meet the timing requirement. In the Figure 34-(a), it meets our expectation that the number of reconfiguration failure cases are decreased as the spare density grows. However, the initial architecture with high spare density dominates the timing degradation. It makes the number of timing failure cases more than one of the architectures with lower spare densities. Therefore, the total number of failure cases is increased. In the Figure 34-(b), it is also meets our expectation that the number of reconfiguration failure cases is decreased as the spare density grows. However, the number of timing failure cases is unstable because this algorithm just makes the locally optimal choice at each iteration. Take the example of NR and ED7, the low timing degradation can be obtained in the initial iterations for ED7 because the faulty blocks are close to spare CLBs; however, there are more results violating timing constraint in the last iterations of ED7 compared to NR.

Figure 35 shows the results of success rate for clustered fault model. Our algorithm improves up to 25% success rate. If we relax 1% of the target timing constraint, the overall success rate is increased 3~5% and our algorithm improves up to 25% success rate. The number of failure cases is far more than uniform fault model because concentrated faulty and mapped CLBs are difficult to be reconfigured, as shown in Figure 36. It is observed that the number of failure cases are decreased as the spare density grows; however, the results of two algorithm is not much difference in high spare density because the number reserved non-faulty spare CLBs is too much.

Figure 34. The success rate for clustered fault model.

(a)

(b)

Figure 35. The number of failure cases for clustered fault model.

5.2.4 Runtime

The average runtime is shown in Figure 37. From the three configuration-level repair method, the runtime of reconfiguration methods (i.e., ours and Cong’s) is roughly half of the resynthesis method. Moreover, the improvement is dominated by the placement stage since in the reconfiguration methods, constructing the DAGs and then finding the shortest paths are more efficient than SA-based method.

(a)

(b)

Figure 36. The runtime for uniform fault model.

In Figure 37, the runtime is separated into placement runtime and routing runtime. The runtime overhead of placement in our methods is slightly more than

Cong’s because our placer considers more factors when calculating costs in the reconfiguration iterations. With more global point of view, the affected nets of our placer is less than those of Cong’s, which implies less number of routing iterations will be taken. Therefore, our router runs faster than Cong’s. Figure 38 shows the average runtime for each architecture, the runtime is decreased as the spare density grows because the faulty blocks are closer to spare CLBs.

(a)

(b)

Figure 37. The runtime for clustered fault model.

Chapter 6 Conclusion

As process technology scaling continues, manufacturing large fault-free integrated circuits become increasingly difficult. The architectural regularity of FPGAs provides inherent redundancy resources which can be exploited for fault tolerance and yield enhancement. In this thesis, we propose a fault tolerant reconfiguration algorithm for CLBs. A faulty block is relocated to its adjacent CLBs along a reconfiguration path from faulty and mapped CLB to non-faulty spare CLB.

After all faulty CLBs are successfully reconfigured, we rip-up the affected nets and then re-route them. We also propose a generic fault tolerant architecture for 3D FPGAs that distributes spare CLBs evenly across the 3D FPGA, which provides a reconfiguration friendly architecture to improves the success rate. The experimental results show that more faults can be repaired when the fault patterns are generated using the uniform fault model than for the clustered fault model. As well, our algorithm improves up to 13% success rate for the uniform fault model and 25%

success rate for the clustered fault model compared to the previous work. The runtime overhead of our method is only slightly more than the prior art.

Reference

[1] International Technology Roadmap for Semiconductor. Semiconductor Industry Association 2005–2010.

[2] A. W. Topol, D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A.

在文檔中應用於三維可程式邏輯閘陣列之容錯架構探索暨快速重組態演算法 (頁 24-0)