Discussion - Experimental Results - 使用最少量緩衝器於延遲容忍系統中達成效能最佳化

Chapter 4 Experimental Results

4.4 Discussion

In experiment Ⅰ, our proposed reduction procedure is efficient in decreasing number of cycles in the graph. Since the number of cycles in a directed graph can grow faster than the exponential 2ⁿ, it is important to reduce the graph size of practical circuits. Without the reduction procedure, we know from Table 4-1 that cycles in some benchmarks exceed one million. The million order cycles are hard to process in normal computers and waste time to count all cycles. This is why reduction procedure is so important.

In experiment Ⅱ, our proposed method saves 22% of queues than Collins’

method on average. Even our method cost about 2.5 times on run time than Collins’

method, but additional time cost in our method is still acceptable. For instance, the benchmark with the most cycles in our experiment, ami49, only cost 947 seconds to solve it. So, we usually prefer to sacrifice acceptable time but saving valuable area in the chips. We make another experiment to verify what will happen if channel latency becomes worse. The experimental results show our method is more suitable than Collins’ method in worse channel latency.

In experiment Ⅲ, our proposed method saves 33% of queues than Collins’

method on average if bit width is considered. With similar time overhead to experimental results showed in Table 4-2, our method saves more area than Collins’

method when bit width is considered. The similar run time is because the only difference between those two formulations is the objective function. This makes our method more elastic to transform between different bit width assignments.

Since number of cycles determines efficiency of our method, decreasing number of cycles is the vital problem for our method. We propose the reduction procedure including path condensation and edge unification to decrease number of cycles.

However, there are some experimental skills helping us further reduce number of cycles. One is to ignore the cycle if and only if its T(C)>1, since it is not the most critical cycle. Another is to collapse each strongly connected component (SCC) into a single vertex. This is because throughput upper bound in our experiment is 1, and each sub-system must finally have throughput 1, too. Hence, we can view each SCC as a sub-system with throughput 1, and then we collapse them into a single vertex.

The final one is to ignore cycles containing only two edges, since it must the self-loop cycle in the original marked graph.

Chapter 5 Future Works and Conclusions

As the manufacturing process to deep submicron technology, length of interconnects becomes more unpredictable and uncontrollable. It makes designers hard to assembly pre-designed IP cores together at early design stage since the unknown signal transference time. Repeater insertion is the promising solution to solve this problem without heavily changing the designs. However, slight modifications on existed IP cores are unavoidable. This prolongs the product developed period on meaningless modification. And even worse, repeater insertion will degrade performance of overall system by multi-clock communication. LIS is a good solution for those existed problems. LIS handle the unpredictable interconnects problem by automatic inserting relay stations which is similar to mentioned repeater insertion. LIS avoids modified iterations by encapsulating every existed IP cores.

Encapsulating is to add some additional hardware called shell to the existed IPs. This step makes all encapsulated IP cores and relay stations can follow the same communication protocol—latency insensitive protocol. LIS works out performance degradation mainly by queue sizing technology. Finally, product developed period shortens and company can earn more benefit. From those reasons, we know that LIS is a gorgeous solution for time-to-market. However, the physical parameters, like length of interconnects, positions of IPs…etc. are known after floorplanning performing. From Section 2.1, we know throughput upper bound of an LIS is determined by system architecture. In other words, poor system architecture limits the spaces that LIS can improve. In our experimental results, channel latency is assigned to a reasonable interval, not obtained from realist floorplanning results. There are

many research working on determining best system architecture on floorplanning stage reported in [28] and [29]. After those performance-aware floorplannng performing, we acquire real physical information which is closer to optimal architecture. So our future direction is to combine our proposed method with real physical information acquired from performance-wise floorplanning.

We propose an optimal throughput optimization technique for LIS with minimal queue size. First, we transform original marked graph to quantitative graph. Then, we develop the reduction procedure for graph size reduction. We use an ILP formulation to guarantee the minimal queue demand. After acquiring minimal queue solution from reduced quantitative graph, we develop a recovered procedure to transform reduced quantitative graph back to quantitative graph while maintaining correctness of minimal queue size. The experimental results show that our method outperforms Collins’ in terms of queue size (area cost). Runtime of our method is acceptable for real industrial systems.

References

[1] R. H. Havemann and J. A. Hutchby, “High-performance interconnects: an integration overview,” in Proceedings of IEEE, pp. 586–601, 2001.

[2] R. Ho, K. W. Mai and M.A. Horowitz, “The future of wires,” in Proceedings of IEEE, pp. 490–504, 2001.

[3] International Technology Roadmap for Semiconductors, Semiconductor Industry Association, 2005.

[4] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D.

dissertation, Stanford Univ., CA, 1984.

[5] D. S. Bormann and P. Y. Cheung, “Asynchronous wrapper for heterogeneous systems,” in Proc. ICCD, pp. 307–317, 1997.

[6] J. Muttersbach, T. Villiger and W. Fichtner, “Practical design of globally-asynchronous locally-synchronous systems,” in Proc. ASYNC, pp.

52–59, 2000.

[7] S. Kumar, A. Jantsch, J-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K.

Tiensyrja, and A. Hemani, “A network on chip architecture and design methodology,” in Proc Symposium on VLSI, pp. 117–124, 2002.

[8] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and D.

Lindqvist, “Network on a chip: an architecture for billion transistor era,” in Proc.

of the IEEE NorChip Conf., 2002.

[9] J. Cong, Y. Fan, G. Han, X, Yang, and Z. Zhang, “Architecture and synthesis for on-chip multicycle communication,” in Proc. TCAD, pp. 550–564, 2004.

[10] J. Cong, Y. Fan, Z. Zhang, “Architecture-level synthesis for automatic interconnect pipelining,” in Proc. DAC, pp. 602–607, 2004.

[11] L. P. Carloni, K. L. McMillan, and A.L. Sangiovanni-Vincentelli,

“Latency-insensitive protocols,” in Proc. of the Computer-Aided Verification (CAV), 1999.

[12] L. P. Carloni, K. L. McMillan, and A.L. Sangiovanni-Vincentelli, “Theory of latency-insensitive design,” in IEEE Tran. CAD, vol. 20, no. 9, 2001.

[13] V. Adler, E. G. Friedman, “Repeater design to reduce delay and power in resistive interconnect,” in IEEE Trans. Circuits Syst. II: Analog Digital Signal Process, vol. 45, no. 5, pp. 607–616, 1998.

[14] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli,

“A methodology for correct-by-construction latency insensitive design,” in Proc.

ICCAD, pp. 309–315, 1999.

[15] C. Li, R. Collins, S. Sonalkar, and L. P. Carloni, “Design, implementation, and validation of a new class of interface circuits for latency insensitive design,” in Fifth ACM-IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE), 2007.

[16] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Performance analysis and optimization of latency insensitive systems,” in Proc. DAC, pp. 361–367, 2000.

[17] R. Lu, and C. Koh, “Performance optimization of latency insensitive systems through buffer queue sizing of communication channels,” in Proc. ICCAD, pp.

207–231, 2003.

[18] L. P. Carloni, “The role of back-pressure in implementing latency-insensitive systems,” in Electronic Notes in Theoretical Computer Science (ENTCS), vol.

146, no. 2, 2006.

[19] M. R. Casu and L. Macchiarulo, “Issues in implementing latency insensitive protocols,” in Proc. of the Conf. on Design, Automation and Test in Europe, pp.

1390–1391, 2004.

[20] R. Lu and C. Koh, “Performance analysis of latency-insensitive systems,” in IEEE Trans. CAD, vol. 25, pp. 469–483, 2006.

[21] C. Li, R. Collins, and L. P. Carloni, “Topology-based optimization of maximal sustainable throughput in a latency-insensitive system,” in Proc. DAC, pp.

410–415, 2007.

[22] C. Li, R. Collins, and L. P. Carloni, “Topology-based performance analysis and optimization of latency-insensitive systems,” in IEEE Trans. CAD, vol. 27, pp.

2277–2290, 2008.

[23] M. R. Casu and L. Macchiarulo, “A new approach to latency insensitive design,”

in Proc. DAC, pp. 576–581, 2004.

[24] D. Bufistov, J. Julvez, and J. Cortadella, “Performance optimization of elastic systems using buffer resizing and buffer insertion,” in Proc. ICCAD, pp.

442–448, 2008.

[25] F. Commoner, A. W. Holt, S. Even, and A. Pnueli, “Marked directed graphs,” in J. Comput. Syst. Sci., pp. 511–523, 1971.

[26] D. B. Johnson, “Finding All the Elementary Circuits of a Directed Graph,” in SIAM J. Comput., 1975.

[27] “lp_solver,” http://lpsolve.sourceforge.net/5.5/

[28] M. R. Casu and L. Macchiarulo, “Floorplanning for throughput,” in Proc. ISPD, pp. 62–69, 2004.

[29] J. Wang, H. Zhou and P. Wu, “Processing rate optimization by sequential system floorplanning,” in Proc. ISQED, pp. 340–345, 2006.

在文檔中使用最少量緩衝器於延遲容忍系統中達成效能最佳化 (頁 48-0)