Chapter 3 Proposed Demand-Aware Reconfiguration Algorithm
3.2 Re-routing Algorithm
3.2.1 Concept of Routing Algorithm
After placement, the locations of all CLBs have been determined, and then a timing driven router connects all connections between CLBs. In routing stage, the FPGA architecture is represented as a routing resource graph. It represents wire segments, TSVs and input or output pins of logic blocks, as shown in Figure 23.
23
Figure 23. FPGA routing architecture and routing resource graph.
The routing algorithm in TPR is based on Pathfinder negotiated congestion algorithm [26]. It iteratively rips-up and re-routes every net until the result meets the congestion constraint. At the first iteration, all nets are routed for minimizing delay without congestion constraint; that is, the routing resources are allowed overuse.
When overuse exists at end of a routing iteration, the cost of overusing a routing resource is increased, so congestion will be resolved at another routing iteration. This process is repeated until all routing resources only are used once.
3.2.2 Re-routing Algorithm
During fault-tolerant reconfiguration, the blocks on the shortest path are moved for one grid in each iteration and a set of blocks are moved due to the 10% faulty CLBs generally. When the block is moved, its connections are also affected, thus we have to re-route these connections, as shown in Figure 24.
24
Figure 24. The concept of re-route.
We record the blocks which are moved during the placement stage of fault-tolerant replacement stage. When in the routing stage, we rip-up all the affected nets and fix the exist routing, and then re-route them. Figure 25 shows an example that endpoints of Net_1 connect to CLB_A, CLB_B and CLB_C, respectively. If the block originally residing in CLB_A is moved to a new CLB, we rip-up Net_1 and the routing of Net_1 is started from the output pin of the new CLB and terminated at the original sink1 and the original sink2. If the block residing in CLB_B or CLB_C is moved to a new CLB, we rip-up Net_1 and the new routing of Net_1 is started from the original source and terminated at the input pin of the new CLB.
Figure 25. Rip-up and re-route the affected net.
25
Chapter 4
Fault Tolerant Architecture
4.1 Non-Reserved (NR)
Locations of blocks are determined using an SA-based placement algorithm with the objective of minimizing wirelength and circuit delay. Thus, spare CLBs are pushed to the edge of FPGA, such a distribution of spare CLBs is called non-reserved (NR) architecture, as shown in Figure 26-(a). As the result, this placement is not suitable to fault reconfiguration through replacement with spare CLB because most spares located along the edge, which may cause a large amount of CLBs moved by ripple-move fault reconfiguration, as shown in Figure 26-(b). Therefore, even if we have a better reconfiguration algorithm, results will be limited because the restrictions of architecture.
(a) (b)
Figure 26. (a) The timing-driven placement. (b) The drawback of timing-driven placement for fault tolerance.
26
4.2 Evenly-Distributed (ED)
As mentioned above, traditional architecture is not suitable to fault tolerance, which inspires us to discover new architectures that take fault tolerance into consideration. We address this problem by evenly distributing spare CLBs across the FPGA and force them to pre-allocate spare resources before the SA-based placement algorithm. These pre-allocated spare CLBs are not allowed being used during SA-based placement, so we can get a placement result with spares evenly distributed in the 3D FPGA design. Such a distribution of spare CLBs is called even-distributed (ED) architecture. When faults occur, spares are very likely close to the faulty CLBs and benefit replacement without severely timing degradation.
We propose five optional ED architecture ED3, ED4, ED5, ED6 and ED7. ED#
represents a spare pattern that the postfix # specifies the maximum distance between two adjacent spare CLBs in either X or Y or XY direction, as shown in Figure 27. The estimated percentage of reserved spare CLBs of each ED architecture is shown in Table 3.
Table 3. The estimated percentage of reserved spare CLBs.
It should be noticed that the CLB utilization of most FPGA is only 70–80% in order to enhance the routability. As we use spare CLBs, the total number of signal nets does not increase. Thus, routing complexity does not significantly increase, however, a price to be paid for using the fault tolerant architecture is an additional delay increasing because we change the original timing driven placement, detail
27
discussions are concluded in Chapter 5.
Figure 27. Evenly-distributed architecture.
28
Chapter 5
Experimental Results
5.1 Experimental Environment
The architectural setting in our experiments are shown in Table 4. The settings of CLBs and channel width are based on Altera Stratix IV [27], Xilinx FPGAs [28] and related work [29]. There are 4 wire segments with different lengths in these 32 wires, L1, L2, L4 and L8. The length of a wire segment is the number of CLBs it spans.
There are 12 L1/L2 and 4 L4/L8 wires. In Z direction, each TSV spans one layer only for routability.
Table 4. The architecture setting.
Table 5 shows the 16 test cases in our benchmark set – 15 are from MCNC [30]
and 1 is from IWLS2005 [31], which are sorted by number of CLBs. Each test case perform 25 experimental runs with different random seeds (5 fault seeds and 5 placement seeds) and find the average as the result. In addition, the number of layers (nz) is set to 4. The CLB utilization is set to 70% and the fault rate is set to 10%.
29
Table 5. The benchmark circuits.
5.2 Results and Analysis
5.2.1 Experimental Flow
In our experiment, three types of configuration-level repair methods are implemented: i) resynthesis ii) Cong's reconfiguration algorithm and iii) our reconfiguration algorithm. Figure 28 shows the experimental flow of resynthesis, the faulty CLB are marked before layer assignment and regarding them unable to be mapped. Figure 29 shows the experimental flow of two reconfiguration algorithms.
Taking the initial placement and routing as an existing result, faults are repaired by partially reconfiguring blocks avoiding faulty CLB.
30
Figure 28. The experimental flow of resynthesis.
Figure 29. The experimental flow of reconfiguration.
5.2.2 Analysis of Timing Penalty
Following are two reasons cause of timing degradation:
i) Initial architecture – there are six architectures used in our experiment, NR, ED3, ED4, ED5, ED6, ED7 with different percentages of reserved spare CLBs for each pattern, i.e., different spare densities; the higher spare density results in more blocks spread to the edge of the FPGA and thus the more delay increases. Figure 30 shows the delay increase of each architecture compared to NR. ED7 has the minimal
31
impact to timing because it has the minimal spare density, otherwise, ED3 has the maximal timing overhead. For ease of exposition, we refer to the result of the NR architecture as IA-NR.
Figure 30. Timing penalty caused by fault tolerant architecture.
ii) Reconfiguration – the delay is increased as the circuit placement being reconfigured. Because the ED architecture provides a fault tolerant friendly architecture. The higher spare density is, the more spare CLBs close to faulty blocks, which causes the timing degradation is lower during reconfiguration. Figure 31-(a) shows the delay increase caused by reconfiguration for uniform fault model based on their IA results. The delay overhead is gradually reduced as spare density grows, and the increased delay of our method is always lower than Cong's.
Similarly, Figure 31-(b) illustrates the delay increase for clustered fault model.
The delay increases is significantly higher compared to uniform fault model because of a number of faults being localized within a region. it is represents clustered fault distribution is more difficult to be reconfigured.
32
(a)
(b)
Figure 30. Timing penalty caused by reconfiguration.
Figure 32-(a) shows the delay increase caused by reconfiguration for uniform fault model with the IA-NR as the baseline. It is observed that our delay increases are lower than Cong's. The delay increase is gradually reduces at the beginning as the spare density grows; however, if we continue increase the spare density, the timing degradation caused by initial architecture will dominate the FPGA, so the delay increase is gradually increased.
The total delay increase caused by reconfiguration for clustered fault model are much higher compared to the pattern degradation, so the delay increase is decreased as grows spare density, and our delay increase are lower than Cong's, as shown in Figure 32-(b)
33
(a)
(b)
Figure 31. Combined effect on timing penalty.
5.2.3 Success Rate
A successful result is defined as a result with the all faults successfully reconfigured and the critical delay is within timing constraint; otherwise, the case is called failure case. Then the success rate is the percentage of the successful results.
Resynthesis has the highest success rate as well as minimal timing degradation. We take the result as the baseline of reconfiguration. Therefore, we set the timing constraint to the delay that every case has 96% success rate in resynthesis flow. Figure 33 shows the results of success rate for uniform fault model. Our algorithm improves up to 13% success rate. If we relax 1% of the timing constraint, (i.e., 101% of the delay of the resynthesis flow with 96% success rate) the overall success rate is
34
increased by 5~10% and our algorithm has up to 9% improvement.
Figure 32. The success rate for uniform fault model.
(a)
(b)
Figure 33. The number of failure cases for uniform fault model.
Figure 34 shows the number of failure cases for uniform fault model, which is separated into i) reconfiguration failure – not all faults can find the corresponding reconfiguration paths; ii) timing failure – all faults can be reconfigured but the
35
resultant timing cannot meet the timing requirement. In the Figure 34-(a), it meets our expectation that the number of reconfiguration failure cases are decreased as the spare density grows. However, the initial architecture with high spare density dominates the timing degradation. It makes the number of timing failure cases more than one of the architectures with lower spare densities. Therefore, the total number of failure cases is increased. In the Figure 34-(b), it is also meets our expectation that the number of reconfiguration failure cases is decreased as the spare density grows. However, the number of timing failure cases is unstable because this algorithm just makes the locally optimal choice at each iteration. Take the example of NR and ED7, the low timing degradation can be obtained in the initial iterations for ED7 because the faulty blocks are close to spare CLBs; however, there are more results violating timing constraint in the last iterations of ED7 compared to NR.
Figure 35 shows the results of success rate for clustered fault model. Our algorithm improves up to 25% success rate. If we relax 1% of the target timing constraint, the overall success rate is increased 3~5% and our algorithm improves up to 25% success rate. The number of failure cases is far more than uniform fault model because concentrated faulty and mapped CLBs are difficult to be reconfigured, as shown in Figure 36. It is observed that the number of failure cases are decreased as the spare density grows; however, the results of two algorithm is not much difference in high spare density because the number reserved non-faulty spare CLBs is too much.
36
Figure 34. The success rate for clustered fault model.
(a)
(b)
Figure 35. The number of failure cases for clustered fault model.
37
5.2.4 Runtime
The average runtime is shown in Figure 37. From the three configuration-level repair method, the runtime of reconfiguration methods (i.e., ours and Cong’s) is roughly half of the resynthesis method. Moreover, the improvement is dominated by the placement stage since in the reconfiguration methods, constructing the DAGs and then finding the shortest paths are more efficient than SA-based method.
(a)
(b)
Figure 36. The runtime for uniform fault model.
In Figure 37, the runtime is separated into placement runtime and routing runtime. The runtime overhead of placement in our methods is slightly more than
38
Cong’s because our placer considers more factors when calculating costs in the reconfiguration iterations. With more global point of view, the affected nets of our placer is less than those of Cong’s, which implies less number of routing iterations will be taken. Therefore, our router runs faster than Cong’s. Figure 38 shows the average runtime for each architecture, the runtime is decreased as the spare density grows because the faulty blocks are closer to spare CLBs.
(a)
(b)
Figure 37. The runtime for clustered fault model.
39
Chapter 6 Conclusion
As process technology scaling continues, manufacturing large fault-free integrated circuits become increasingly difficult. The architectural regularity of FPGAs provides inherent redundancy resources which can be exploited for fault tolerance and yield enhancement. In this thesis, we propose a fault tolerant reconfiguration algorithm for CLBs. A faulty block is relocated to its adjacent CLBs along a reconfiguration path from faulty and mapped CLB to non-faulty spare CLB.
After all faulty CLBs are successfully reconfigured, we rip-up the affected nets and then re-route them. We also propose a generic fault tolerant architecture for 3D FPGAs that distributes spare CLBs evenly across the 3D FPGA, which provides a reconfiguration friendly architecture to improves the success rate. The experimental results show that more faults can be repaired when the fault patterns are generated using the uniform fault model than for the clustered fault model. As well, our algorithm improves up to 13% success rate for the uniform fault model and 25%
success rate for the clustered fault model compared to the previous work. The runtime overhead of our method is only slightly more than the prior art.
40
Reference
[1] International Technology Roadmap for Semiconductor. Semiconductor Industry Association 2005–2010.
[2] A. W. Topol, D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A.
Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong,
“Three-dimensional integrated circuits,” IBM J. of Res. and Develop., vol. 50, no.
4/5, pp. 491–506, Jul.–Sep. 2006.
[3] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: a novel chip design for improving deep submicron interconnect performance and systems-on-chip integration,” Proc. IEEE, vol. 89, no. 5, pp. 602–633, May.
2001.
[4] R. R. Tummala and V. K. Madisetti, “System on chip or system on package?”
Design & Test Computers, vol. 16, no. 2, pp. 48–56, Apr.–Jun. 1999.
[5] P. H. Shiu, R. Ravichandran, S. Easwar, and S. K. Lim, “Multi-layer floorplanning for reliable system-on-package,” Int’l Symp. Circuits and System, pp. 23–26, 2004.
[6] K. L. Tai, “System-In-Package (SIP): challenges and opportunities,” Asia South Pacific Design Automation Conf., pp. 191–196, 2000.
[7] SOCcentral. [Online]. Available: http://www.soccentral.com
[8] S. Das, A. P. Chandrakasan, and R. Reif, “Calibration of rent's rule models for three-dimensional integrated circuits,” IEEE Trans. Very Large Scale Integration Systems, vol. 12, no. 4, pp. 359–366, Apr. 2004.
[9] A. Rahman and R. Reif, “System-level performance evaluation of three-dimensional integrated circuits,” IEEE Trans. Very Large Scale Integration Systems, vol.8, no.6, pp. 671–678, Dec. 2000.
[10] S. Das, A. Fan, K. Chen, C. S. Tan, N. Checka, and R. Reif, “Technology, performance, and computer-aided design of three-dimensional integrated circuits,”
Proc. Int’l Symp. Physical Design, pp. 108–115, 2004.
[11] I. Kaya, S. Salewski, M. Olbrich, and E. Barke, “Wirelength reduction using 3D physical design,” Int’l Workshop Integrated Circuit System Design, pp. 453–462, 2004.
[12] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A.M. Sule, M. Steer, and P. D. Franzon, “Demystifying 3D ICs: the pros and cons of going vertical,”
IEEE Design & Test of Computers, vol. 22, no. 6, pp. 498–510, Nov.–Dec. 2005.
[13] I. Loi, S. Mitra, T. H. Lee, S. Fujita, and L. Benini, “A low-overhead fault tolerance scheme for TSV-based 3D network on chip links,” Proc. Int’l Conf.
41
Computer-Aided Design, pp. 598–602, 2008.
[14] C. Ababei, H. Mogal, and K. Bazargan, “Three-dimensional place and route for FPGAs,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 6, pp. 1132–1140, Jun. 2006.
[15] E. Stott, P. Sedcole and P. Cheung, “Fault Tolerance and Reliability in Field Programmable Gate Arrays,” Computers & Digital Techniques, vol. 4, No. 3, pp.
196–210, 2010.
[16] M. Mishra, S. Goldstein, “Defect tolerance at the end of the roadmap,” Proc.
Int’l Test Conf. Vol. 1, pp. 1201–1210, Sep. 2003.
[17] P. Maidee, “Methodologies and Tolls for Yield Improvement of Field-programmable Logic Architectures,” PhD thesis, 2009.
[18] J. Emmert, C. Stroud, and M. Abramovici, “Online Fault Tolerance for FPGA Logic Blocks,” IEEE Trans on Very Large Scale Integration (VLSI) Systems, Vol. 15, No. 2, pp. 216-226, Feb. 2007.
[19] A. Mathur and C. L. Liu, “Timing-driven placement reconfiguration for fault tolerance and yield enhancementin FPGAs,” Proc. European conference on Design and Test, pp. 165–169, 1996.
[20] F. Hatori, T. Sakurai, K. Sawada, M. Takahashi, M. Ichida, M. Uchida, I. Yoshii, Y. Kawahara, T. Hibi, Y. Saeki, H. Muroga, A. Tanaka, and K. Kanzaki,
“Introducing redundancy in field programmable gate arrays,” Proc. CIC Conf., vol. 7, pp. 1–4, Aug. 2002.
[21] F. Hanchek and S. Dutt, “Design methodologies for tolerating cell and interconnect faults in FPGAs,” Conf. Computer Design. pp. 326–331, 1996.
[22] A. Doumar and H. Ito. “Defect and fault tolerance SRAM-based FPGAs by shifting the configuration data,” IEICE Trans. Inf. Syst. pp. 1104–1115, 2000.
[23] A.K. Agarwal, J. Cong, and B. Tagiku. “Fault tolerant placement and defect reconfiguration for nano-FPGAs,” In Proc. Int. Conf on Computer Aided Design, 2008.
[24] J. Narasimhan, K. Nakajima, C. S. Rim, A. T. Dahbura, “Yield enhancement of programmable ASIC arrays by reconfiguration of circuit placements,” IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, pp.
976–986, Aug. 1994.
[25] F. Hanchek, S. Dutt, “Node-covering based defect and fault tolerance methods for increased yield in FPGAs,” Proc. 9th Int. Conf on VLSI Design. pp. 225–229, 1996.
[26] C. Ababei, Y. Feng, B. Goplen, H. Mogal, T. Zhang, K. Bazargan, and S.
Sapatnekar, “Placement and routing in 3D integrated circuits,” IEEE Design Test Computers, vol. 22, no. 6, pp. 520–531, Nov. 2005.
42
[27] Altera. [Online]. Available: http://www.altera.com/
[28] Xilinx. [Online]. Available: http://www.xilinx.com/
[29] C.-I Chen, B.-C. Lee, and J.-D. Huang, “Architectural exploration of 3D FPGAs towards a better balance between area and delay,” Proc. Design, Automation &
Test in Europe Conf. and Exhibit., pp. 587–590, 2011.
[30] S. Yang, “Logic synthesis and optimization benchmarks user guide,” Technical Report 1991-IWLS-UG-Saeyang, Microelectronics Center of North Carolina, 1991.
[31] [Online]. Available: http://www.eecs.berkeley.edu/~alanmi/benchmarks/