Thesis Organization - 應用於三維可程式邏輯閘陣列之容錯架構探索暨快速重組態演算法

Chapter 1 Introduction

1.6 Thesis Organization

The remainder of this thesis is organized as follows. In Chapter 2, timing model, fault model, definitions and problem formulation are presented first. Then we propose our reconfiguration algorithm in Chapter 3. In Chapter 4, two fault tolerant architectures are proposed. Experimental results are presented in Chapter 5 and some contributions are concluded in Chapter 6.

Chapter 2 Preliminaries

In this chapter we first introduce the tool which used in the thesis and then describes two models used to timing analysis and fault location assumption. Finally, we describe the definitions and problem formulation.

2.1 3D P&R Tool

Three dimensional place and route (TPR) [14][26] is the first complete CAD flow in academia from layering process to routing process for 3D FPGAs. The main flow of them is shown in Figure 13.

Figure 13. A 3D FPGA CAD flow.

The flow starts with a technology-mapped netlist in .blif format, which describes the circuit. To map a circuit into FPGA, T-VPack [14] converts the .blif netlist into a .net netlist of FPGA logic blocks. Then, the .net netlist and the architecture description file are input to the placement algorithm. At first, the placement algorithm partitions the circuit into n balanced partitions, where n equals to the number of layers

in a 3D FPGA design. Second, all layers are placed by an SA-based placement algorithm; CLBs are selected and swapped or moved randomly during the placement until maximum number of iterations is reached. Finally, global and detailed routing is performed using the adapted 3D version of the TPR routing algorithm.

2.2 Timing Model

In a “tile-based” FPGA, the FPGA structure is homogeneous, i.e. every location (x, y, z) in the FPGA is constructed from identical tiles. Exploiting such architectures, a delay lookup matrix indexed by (Δx, Δy, Δz) is constructed. Each (Δx, Δy, Δz) entry in the matrix is computed by TPR's timing-driven router, that performs a routing between the two blocks and the delay is recorded in the delay lookup matrix at location (Δx, Δy, Δz), so the matrix performs as a function that return estimated delay between two blocks given the delta location (Δx, Δy, Δz) of them.

The circuit is represented as a directed acyclic graph. Nodes represent the input and output pins of circuit elements such as registers and LUTs. Connections between nodes are modeled with edges in the graph which are annotated with the delay required to pass through the circuit element or routing. To determine the delay of the circuit, a breadth-first traversal is applied to the timing graph. Each node with incident edges is labeled with its arrival time as shown in Equation (1):

(1) Node i is the node currently being computed, and delay (j, i) is the delay value marked on the edge. To compute the slack, we perform a second breadth-first traversal of the timing graph for required time T_required. T_required at all sinks is set to the maximum arrival time and then propagated backwards starting from the sinks with the following Equation (2):

Finally, the slack of a connection (j, i) as shown in Equation (3):

(3)

2.3 Fault Model

In this paper we only use a CLB-level fault model, which assumes that any fault in a CLB then the CLB is disabled; faults in other part are not considered. In our experiments, we use two different fault models described below.

i) Uniform fault model – Faults uniformly distribute across the FPGA. In the other words, the probability of a CLB being fault is independent of the state of the neighboring CLBs. The model is implemented by randomly assuming a CLB of coordinate (x, y, z) to be faulty.

ii) Clustered fault model – Faults distribute in clusters. In the other words, if a CLB is faulty then its neighboring CLBs have a higher probability of being faulty, as shown in Figure 14. This model is implemented by randomly assuming a CLB of coordinate (x, y, z) to be the center C of a fault cluster of radius r. On the layer z, the CLBs within distance r from the center are faulty with an exponentially decreasing probability function as shown in Equation (4):

(4) where μ is failure rate and X is a positive value range between 1 and r.

(a) (b)

Figure 14. (a) A clustered fault of r = 2. (b) The exponential probability function.

) Slack  _required  _arrival 

e X

P( ,) ^^

2.4 Definitions

In this section we define some sets and function that will be used in the following chapters. We define a set called B, which includes all CLBs on the FPGA is shown in Equation (5) and B_F is a set of all faulty and mapped CLBs, as shown in Equation (6). Similarly, BS is a set of all non-faulty spare CLBs (as shown in Equation (7)). b_FT is a faulty and mapped CLB, which is the most critical one and d_M(b_i, b_j) is a function which returns the Manhattan distance between two CLBs, as shown in Figure 15.

(5) (6) (7)

Figure 15. The definitions of CLB array.

2.5 Problem Formulation

Given a netlist of CLBs, architecture specification, existing placement and routing result and locations of faulty CLBs, our objective is to partially reconfigure

Chapter 3 Proposed Demand-Aware Reconfiguration Algorithm

As mentioned in Section 1.4, we propose a demand-aware reconfiguration algorithm considering the demand issue during fault reconfiguration. According to the cost function of ripple move reconfiguration algorithm (as shown in Equation (8)), we modify the cost function to the demand-aware version consisting of the delay and the demand cost as shown in Equation (9).

(8) (9) For quickly estimating the delay between two blocks, an approximation can be computed by using Manhattan distance. The estimated delay is recorded in the delay lookup matrix, so when we construct a DAG for a fault CLB, edges are weighted with the delay required upon moving the block from existing location to the adjacent CLB by looking up the delay matrix and Costdealy is calculated by finding the shortest path (least delay cost) on the DAG.

The objective of the demand cost Costdemand is to achieve a solution with more our iterative reconfiguration algorithm and in the last section, the routing algorithm and re-routing algorithm are described.

delay

3.1 Reconfiguration Algorithm

3.1.1 The Concept of Our Algorithm

In the ripple move reconfiguration method, a DAG is constructed for each faulty block. The weight of an edge e=<u, v> is set to the difference between the required delay of the block residing in the CLB u and in the adjacent CLB v; then the shortest path (least cost) between the source CLB and destination CLB is found. According to the shortest path, the faulty block is reconfigured by ripple moving CLBs to their adjacent ones along this path. However, the result of this greedy method may be too local because for each faults. It constructs a DAG only from the faulty block point of view; and non-faulty spare CLBs are greedily chosen as destinations in order of the locally delay cost is minimized in each iteration. In fact, the demand for non-faulty spare CLB of every faulty block may be different, thus these two factors have to be considered when constructing DAG. Therefore, our algorithm is based on such a concept to resolve this problem.

3.1.2 Demand-Aware Reconfiguration Algorithm

For ease of explanation, the following is an example to detail how we use this concept of demand. At the begin of each iteration, we determine a range with the distance d_SER by expanding the from b_FT, the target faulty and mapped CLB which has the most critical, until this range contains at least K (K=3 in this example) spare CLBs.

Notice that it is a three-dimensional range, as shown in Figure 16 and the definition is shown in Equation (10).

Figure 16. Find the searching distance.

Next, we find the neighbor faulty and mapped CLBs of each bS. Consider bS3 at first, we utilize the d_SER to find b_FN, as shown in Figure 17. We define a B_FN as a set of faulty and mapped CLBs in the neighborhood of bS, as shown in Equation (11).

(11)

Figure 17. Neighbor faulty and mapped CLBs of bS3.

And we utilize the d_SER to find the neighbor non-faulty spare CLBs B_SN of these faulty and mapped CLBs, as shown in Figure 18. BSN is defined as spare CLBs residing in the neighborhood of bF with the distance smaller or equal to dSER, as shown in Equation (12).

F FN SER FN

S M FN S

FN b b d b b d B B

B ( ){ | ( , ) }, 

(12)

(a)

(b) (c)

Figure 18. Neighbor spare CLBs of (a) bF8, (b) bF10, and (c) bF15.

Here we introduce Equation (13), which represents the demands for b_S of b_F and isin inverse proportion to the number of spare CLBs in the neighborhood and the Manhattan distance between them. If there is a spare CLB very close to a b_Fand only few spare CLBs is in the bF’s neighborhood, then the demands for the spare CLB of the b_F is very high. The maximum demand D_md-max for a spare CLB is defined as

(13) (14)

Figure 19. The reconfiguration procedure when consider b_S1. According to the definition, the demands for bS3 of bF10 and bF15 are:

For the reconfiguration, the demand cost of b_FT for b_S is defined in the Equation (15). The small demand cost means the demand of bFT for bS is high.

(15) So in the example, the demand cost for bS3 of bF8 is:

Similarly, for bS1, the procedure is shown in Figure 19. Notice that only bF8 in the neighborhood of b_S1, so the demand cost is 0.

For b_S13at the last, the procedure is shown in Figure 20.

Figure 20. The reconfiguration procedure when consider b_S13..

2 1

In the example, we assume the delay cost equals to the Manhattan distance between two CLBs, so the delay cost of all shortest path in this example equal to 2.

The costs of each candidate destinations in the first iteration are shown in Table 2, where is 0.2 and is 0.8.

Table 2. The cost of first iteration in the example.

Finally, we take bS1 as the destination and the bF8 is reconfigured by moving it to adjacent CLBs along the shortest path {b_F8, b₇, b_S1}. This procedure is iteratively performed for each faulty block until all faults are successfully reconfigured or no path is found resulting in reconfiguration failure. The final result of ripple move reconfiguration algorithm is shown in Figure 21-(a). This method randomly chooses one of {b_S1,b_S3,b_S13} as destination because it does not consider the demand cost.

Another result is shown in Figure 21-(b), it is demonstrated that our demand-aware reconfiguration algorithm can find the better solution compared to the ripple move reconfiguration algorithm.

(a) (b)

Figure 21. The result of (a) ripple move algorithm and (b) our algorithm.

The algorithm flow is shown in Figure 22; after constructing a DAG first, the reconfiguration path is determined by finding the shortest path between the source and one of the k destinations; finally, the fault is reconfigured by moving blocks to adjacent CLBs along this path. The reconfiguration iteration is performed until every block is located on non-faulty CLBs.

Figure 22. Our algorithm flow.

3.2 Re-routing Algorithm

3.2.1 Concept of Routing Algorithm

After placement, the locations of all CLBs have been determined, and then a timing driven router connects all connections between CLBs. In routing stage, the FPGA architecture is represented as a routing resource graph. It represents wire segments, TSVs and input or output pins of logic blocks, as shown in Figure 23.

Figure 23. FPGA routing architecture and routing resource graph.

The routing algorithm in TPR is based on Pathfinder negotiated congestion algorithm [26]. It iteratively rips-up and re-routes every net until the result meets the congestion constraint. At the first iteration, all nets are routed for minimizing delay without congestion constraint; that is, the routing resources are allowed overuse.

When overuse exists at end of a routing iteration, the cost of overusing a routing resource is increased, so congestion will be resolved at another routing iteration. This process is repeated until all routing resources only are used once.

3.2.2 Re-routing Algorithm

During fault-tolerant reconfiguration, the blocks on the shortest path are moved for one grid in each iteration and a set of blocks are moved due to the 10% faulty CLBs generally. When the block is moved, its connections are also affected, thus we have to re-route these connections, as shown in Figure 24.

Figure 24. The concept of re-route.

We record the blocks which are moved during the placement stage of fault-tolerant replacement stage. When in the routing stage, we rip-up all the affected nets and fix the exist routing, and then re-route them. Figure 25 shows an example that endpoints of Net_1 connect to CLB_A, CLB_B and CLB_C, respectively. If the block originally residing in CLB_A is moved to a new CLB, we rip-up Net_1 and the routing of Net_1 is started from the output pin of the new CLB and terminated at the original sink1 and the original sink2. If the block residing in CLB_B or CLB_C is moved to a new CLB, we rip-up Net_1 and the new routing of Net_1 is started from the original source and terminated at the input pin of the new CLB.

Figure 25. Rip-up and re-route the affected net.

Chapter 4 Fault Tolerant Architecture

4.1 Non-Reserved (NR)

Locations of blocks are determined using an SA-based placement algorithm with the objective of minimizing wirelength and circuit delay. Thus, spare CLBs are pushed to the edge of FPGA, such a distribution of spare CLBs is called non-reserved (NR) architecture, as shown in Figure 26-(a). As the result, this placement is not suitable to fault reconfiguration through replacement with spare CLB because most spares located along the edge, which may cause a large amount of CLBs moved by ripple-move fault reconfiguration, as shown in Figure 26-(b). Therefore, even if we have a better reconfiguration algorithm, results will be limited because the restrictions of architecture.

(a) (b)

Figure 26. (a) The timing-driven placement. (b) The drawback of timing-driven placement for fault tolerance.

4.2 Evenly-Distributed (ED)

As mentioned above, traditional architecture is not suitable to fault tolerance, which inspires us to discover new architectures that take fault tolerance into consideration. We address this problem by evenly distributing spare CLBs across the FPGA and force them to pre-allocate spare resources before the SA-based placement algorithm. These pre-allocated spare CLBs are not allowed being used during SA-based placement, so we can get a placement result with spares evenly distributed in the 3D FPGA design. Such a distribution of spare CLBs is called even-distributed (ED) architecture. When faults occur, spares are very likely close to the faulty CLBs and benefit replacement without severely timing degradation.

We propose five optional ED architecture ED3, ED4, ED5, ED6 and ED7. ED#

represents a spare pattern that the postfix # specifies the maximum distance between two adjacent spare CLBs in either X or Y or XY direction, as shown in Figure 27. The estimated percentage of reserved spare CLBs of each ED architecture is shown in Table 3.

Table 3. The estimated percentage of reserved spare CLBs.

It should be noticed that the CLB utilization of most FPGA is only 70–80% in order to enhance the routability. As we use spare CLBs, the total number of signal nets does not increase. Thus, routing complexity does not significantly increase, however, a price to be paid for using the fault tolerant architecture is an additional delay increasing because we change the original timing driven placement, detail

discussions are concluded in Chapter 5.

Figure 27. Evenly-distributed architecture.

Chapter 5 Experimental Results

5.1 Experimental Environment

The architectural setting in our experiments are shown in Table 4. The settings of CLBs and channel width are based on Altera Stratix IV [27], Xilinx FPGAs [28] and related work [29]. There are 4 wire segments with different lengths in these 32 wires, L1, L2, L4 and L8. The length of a wire segment is the number of CLBs it spans.

There are 12 L1/L2 and 4 L4/L8 wires. In Z direction, each TSV spans one layer only for routability.

Table 4. The architecture setting.

Table 5 shows the 16 test cases in our benchmark set – 15 are from MCNC [30]

and 1 is from IWLS2005 [31], which are sorted by number of CLBs. Each test case perform 25 experimental runs with different random seeds (5 fault seeds and 5 placement seeds) and find the average as the result. In addition, the number of layers (nz) is set to 4. The CLB utilization is set to 70% and the fault rate is set to 10%.

Table 5. The benchmark circuits.

5.2 Results and Analysis

5.2.1 Experimental Flow

In our experiment, three types of configuration-level repair methods are implemented: i) resynthesis ii) Cong's reconfiguration algorithm and iii) our reconfiguration algorithm. Figure 28 shows the experimental flow of resynthesis, the faulty CLB are marked before layer assignment and regarding them unable to be mapped. Figure 29 shows the experimental flow of two reconfiguration algorithms.

Taking the initial placement and routing as an existing result, faults are repaired by partially reconfiguring blocks avoiding faulty CLB.

Figure 28. The experimental flow of resynthesis.

Figure 29. The experimental flow of reconfiguration.

5.2.2 Analysis of Timing Penalty

Following are two reasons cause of timing degradation:

i) Initial architecture – there are six architectures used in our experiment, NR, ED3, ED4, ED5, ED6, ED7 with different percentages of reserved spare CLBs for each pattern, i.e., different spare densities; the higher spare density results in more blocks spread to the edge of the FPGA and thus the more delay increases. Figure 30 shows the delay increase of each architecture compared to NR. ED7 has the minimal

impact to timing because it has the minimal spare density, otherwise, ED3 has the maximal timing overhead. For ease of exposition, we refer to the result of the NR architecture as IA-NR.

Figure 30. Timing penalty caused by fault tolerant architecture.

ii) Reconfiguration – the delay is increased as the circuit placement being reconfigured. Because the ED architecture provides a fault tolerant friendly architecture. The higher spare density is, the more spare CLBs close to faulty blocks, which causes the timing degradation is lower during reconfiguration. Figure 31-(a) shows the delay increase caused by reconfiguration for uniform fault model based on their IA results. The delay overhead is gradually reduced as spare density grows, and the increased delay of our method is always lower than Cong's.

Similarly, Figure 31-(b) illustrates the delay increase for clustered fault model.

The delay increases is significantly higher compared to uniform fault model because of a number of faults being localized within a region. it is represents clustered fault distribution is more difficult to be reconfigured.

(a)

(b)

Figure 30. Timing penalty caused by reconfiguration.

Figure 32-(a) shows the delay increase caused by reconfiguration for uniform fault model with the IA-NR as the baseline. It is observed that our delay increases are lower than Cong's. The delay increase is gradually reduces at the beginning as the spare density grows; however, if we continue increase the spare density, the timing degradation caused by initial architecture will dominate the FPGA, so the delay increase is gradually increased.

The total delay increase caused by reconfiguration for clustered fault model are much higher compared to the pattern degradation, so the delay increase is decreased as grows spare density, and our delay increase are lower than Cong's, as shown in Figure 32-(b)

(a)

(b)

Figure 31. Combined effect on timing penalty.

5.2.3 Success Rate

A successful result is defined as a result with the all faults successfully reconfigured and the critical delay is within timing constraint; otherwise, the case is called failure case. Then the success rate is the percentage of the successful results.

Resynthesis has the highest success rate as well as minimal timing degradation. We take the result as the baseline of reconfiguration. Therefore, we set the timing constraint to the delay that every case has 96% success rate in resynthesis flow. Figure 33 shows the results of success rate for uniform fault model. Our algorithm improves up to 13% success rate. If we relax 1% of the timing constraint, (i.e., 101% of the delay of the resynthesis flow with 96% success rate) the overall success rate is

increased by 5~10% and our algorithm has up to 9% improvement.

在文檔中應用於三維可程式邏輯閘陣列之容錯架構探索暨快速重組態演算法 (頁 20-0)