Recovery Phase - 後次微米時代新興電子設計自動化技術之研究---子計畫一：符合次世代晶片上通訊思維之具備幾何考量的系統架構合成技術(III)

Once ILP produces an optimal solution for a given CQG, certain operations should follow to further derive the optimal solution for the original QG. Then, we present two operations, edge split and path expansion, for this purpose.

Edge Split:

Assume an edge ed(wd, qd, cd) is a selected dominating edge and ek(wk, qk, ck) is ed’s removed Fig. 12 (a) The edge split operation. (b) The path expansion operation. (c) The solution

for original QG.

(b) (a)

(c)

parallel edge. After applying ILP, qd is set large enough to ensure that (10) holds for every cycle C passing through ed in reverse direction. When putting ek back to CQG, qk must also be properly set to ensure that (10) still holds for any newly generated cycle C’ derived from C by just replacing ed with ek. It follows that (10) is guaranteed to hold for every such cycle C’ if the following inequality can be satisfied:

qk – ck  qd – cd (or qk  ck + qd – cd) (12) Therefore, minimal queue size of an edge removed by edge unification previously can be derived using (12). For example, in Fig. 12(a), queue size of the blue selected dominating edge (v1, v5) is set to 4 after ILP. After edge splitting, the minimal queue size of the lower red non-dominating edge is set to 1 + 4 – 3 = 2 by (12).

Path Expansion:

While recovering from a condensed edge ep regarding the condensable path p, (7) ensures that (10) automatically holds for any cycle C originally passing through ep in forward direction and now passing through p. Hence, a path expansion operation merely has to further ensure that (10) also holds for such cycle C but in reverse direction; and this can be done if the following constraint can be satisfied:

)

In general, the way for distributing q(ep) to those edges along p is not unique. However, the following proposed strategy must be adopted to guarantee minimal queue insertion. Let em  E(p) be the edge with lowest burden factor along a condensable path p, i.e., b(e_m)  b(e) for all e  E(p), then q(e) of each edge e along p can be determined as:

If there are two or more edges with lowest burden factor, pick one arbitrary. It is apparent that using (14) for queue sizing can ensure that (13) always holds. Fig. 12(b) illustrates an

Fig. 13 Overall flow of the proposed method.

example of path expansion. At last, as shown in Fig. 12(c), edge split and path expansion can be performed repeatedly until the complete optimal solution for the original QG is obtained.

At the end of this section, Fig. 13 summarizes the overall flow of our proposed method for minimal queue insertion. Unlike [34], [35], identifying strongly connected components (SCCs) is unnecessary here since the testcases are acylic.

The proposed approach has been implemented in C++/Linux environment. Since it is difficult for us to get a bunch of real-world systems, alternatively, we decide to randomly build a set of different-sized directed acyclic graphs (DAGs) as QGs for evaluation, which is similar to the approach used in the experimental setup of [35]. Furthermore, latency of every edge in a DAG (i.e., communication channel in a system) is also randomly assigned with an integer within the interval [1, L]; that is, the number of relay stations required inserting at each edge (channel) is within the range [0, L – 1]. All experiments are conducted on a workstation with an AMD 1.81GHz CPU and 2GB RAM. The package lp_solve is adopted when solving ILP [41].

Our first experiment is to verify whether the proposed compaction techniques are effective.

Johnson’s algorithm [42] is applied to identify all cycles in both the original QG and the minimal CQG. The experimental results shown in Table 5 clearly indicate that the proposed technique can successfully reduce the number of vertices and edges as well as achieve a remarkable reduction of cycle count. Before compaction, the cycle count for several test cases even exceeds one million, which makes ILP virtually impossible to find a feasible solution.

In our second experiment, we compare our proposed method with Collins’ heuristic method roposed in [35]. Table 6 and Table 7 report the results with L = 3 and 16, respectively. The results show that our proposed method can achieve an average reduction of 23% and 28% in queue size

Table 5 Experimental results of cycle reduction.

Case Name Original QG Minimal CQG

(V, E) #Cycles (V, E) #Cycles

Testcase1 (11,15) 55 (8,11) 12

Testcase2 (17,21) 51 (13,17) 14

Testcase3 (45,61) 30540 (20,35) 10123

Testcase4 (58,76) 48590 (39,45) 10497

Testcase5 (104,121) 42435 (56,73) 19754

Testcase6 (126,172) > 1Million (77,98) 132415 Testcase7 (175,201) > 1Million (66,84) 15423 Testcase8 (297,318) > 1Million (116,142) 23862

for L = 3 and 16 respectively as compared to Collins’ method. The results also imply that the improvement can slightly increase as fabrication process keeps scaling (i.e., L increases).

Meanwhile, our method needs about 58% more runtime than Collins’ on average. However, it should be acceptable since all test cases can be completed within 24 minutes. Table 2 also shows that ILP fails in several test cases (denoted as ‘*’) if it directly applies to QG instead of minimal CQG. The reason is obvious that the size of the constraint set is too large at QG level. It is also worth to mention that several test cases contain hundreds of vertices and edges, which positively suggests our approach is capable of handling moderately large systems in practice.

四、結論

First of all, the number of IICs has been reported to better model the global interconnect cost and then can be considered as a major QoR evaluation metric at early design stages in DRFM. In this project, we propose a resource-constrained synthesis algorithm for IIC minimization. The iterative binding-then-rescheduling procedure is first utilized for island assignment. A better island binding result can be expected because the solution search space is significantly expanded through rescheduling. The proposed algorithm also incorporates the consideration of read port restriction into scheduling and binding procedures to minimize the potential access conflicts. A post-processing procedure is then conducted to eliminate all remaining access conflicts.

The experimental results indicate that the proposed algorithm reduces the number of IICs by 21.0% ~ 24.7% on average as compared to the prior art. While adopting the read port restriction,

Table 6 Experimental results with L=3.

L L=3

Case Name

Proposed Method Collins’ Method [12] ILP directly to QG

#Queues Run-

time #Queues Run-

time #Queues Run- time

Testcase1 20 0 20 0 20 1

Testcase2 9 0 9 0 9 0

Testcase3 51 5 80 4 51 14

Testcase4 43 14 46 13 43 44

Testcase5 29 40 78 27 29 340

Testcase6 77 867 90 542 * *

Testcase7 84 32 90 23 * *

Testcase8 114 73 141 47 * *

Ratio 0.77 1.57 1 1 - -

the proposed method also outperforms the previous work by about 12% in terms of average latency. As a result, the proposed algorithm should be regarded as a better alternative while performing architectural synthesis targeting DRFM.

Furthermore, a throughput optimization technique for LIS with minimal queue size is presented. First, an LIS is transformed as a newly proposed quantitative graph; next, the size of QG can be minimized through the developed compaction operations; ILP then follows to get an exact solution of minimal queue size, which can further be converted into an optimal solution for the original LIS. The experimental results demonstrate that our algorithm can achieve an average reduction of up to 28% in queue size as compared to the prior art. Moreover, the required runtime is merely about half an hour for a system with hundreds of cores. Consequently, we believe that the proposed technique is a better alternative to resolve the issue of queue sizing for moderately large systems in practice. The proposed algorithm can only handle acyclic systems at this moment. We are currently working on developing on improved version that can deal with cyclic systems as well.

Table 7 Experimental results with L=16.

L L=16

Case Name

Proposed Method Collins’ Method [12] ILP directly to QG

#Queues Run-

time #Queues Run-

time #Queues Run- time

Testcase1 68 1 68 0 68 1

Testcase2 76 0 77 0 76 0

Testcase3 290 9 437 6 290 19

Testcase4 291 31 351 19 291 52

Testcase5 256 77 386 48 256 459

Testcase6 519 1438 793 913 * *

Testcase7 673 69 753 40 * *

Testcase8 641 131 1035 83 * *

Ratio 0.72 1.58 1 1 - -

五、參考文獻

[1] International Technology Roadmap for Semiconductors. Semiconductor Industry Association, 2007.

[2] Matzke, “Will physical scalability sabotage performance gains?” IEEE Computer, vol.20, pp.

37–39, 1997.

[3] L. P. Carloni, and A. L. Sangiovanni-Vincentelli, “Coping with latency in SOC design,”

IEEE Micro, vol. 22, pp. 24–35, 2002.

[4] W. J. Dally, “Interconnect-limited VLSI architecture,” IEEE Int’l Conf. Interconnect Technology, 1999.

[5] Y. Mori, V. Moshnyaga, H. Onodera, and K. Tamaru, “A performance-driven macro-block placer for architectural evaluation of ASIC designs,” Proc. Annual IEEE Int’l ASIC Conf.

and Exhibit, pp. 233–236, Sep. 1995.

[6] V. Moshnyaga and K. Tamaru, “A placement driven methodology for high-level synthesis of sub-micron ASIC’s,” Proc. Int’l Symp. Circuits and Systems, vol. 4, pp. 572–575, May 1996.

[7] P. Prabhakaran and P. Banerjee, “Parallel algorithms for simultaneous scheduling, binding and floorplanning in high-level synthesis,” Proc. of Int’l Symp. Circuits and Systems, vol. 6, pp. 372–376, May 1998.

[8] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D. dissertation, Stanford Univ., Stanford, CA, 1984.

[9] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli, “A methodology for correct-by-construction latency insensitive design,” Proc. Int’l Conf.

Computer Aided Design, pp. 309–315, 1999.

[10] J.-D. Huang, Y.-S. Huang, L. Wang, and G.-W. Lee, “Throughput-Aware Floorplanning via Dynamic Optimization on Performance-Critical Loops,” Intl. Journal of Electrical Engineering, vol. 17, no.1, pp. 33–42, Feb. 2010.

[11] Kim, J. Jung, S. Lee, J. Jeon, and K. Choi, “Behavior-to-placed RTL synthesis with performance-driven placement,” Proc. Int’l Conf. Computer Aided Design, pp. 320–325, Nov. 2001.

[12] J. Jeon, D. Kim, D. Shin, and K. Choi, “High-level synthesis under multi-cycle interconnect delay,” Proc. Asia and South Pacific Design Automation Conf., pp. 662–667, Jan. 2001.

[13] J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, “Architecture and synthesis for on-chip multicycle communication,” IEEE Trans. on Computer-Aided Design Integrated Circuits

and Systems, vol. 23, no. 4, pp. 550–564, Apr. 2004.

[14] C.-I Chen and J.-D. Huang, “A Hierarchical Criticality-Aware Architectural Synthesis Framework for Multicycle Communication,” IEICE Trans. on Fundamentals, vol. E93-A, no.

7, pp. 1300–1308, Jul. 2010.

[15] S.-H. Huang, C.-H. Chiang, and C.-H. Cheng, “Three-dimension scheduling under multi-cycle interconnect communications,” IEICE Electronics Express, vol. 2, no. 4 pp.108–114, Feb. 2005.

[16] J. Cong, Y. Fan, and Z. Zhang, “Architecture-level synthesis for automatic interconnect pipelining,” Proc. Design Automation Conf., pp. 602–607, Jun. 2004.

[17] W.-S. Huang, Y.-R. Hong, J.-D. Huang, and Y.-S. Huang, “A multicycle communication architecture and synthesis flow for global interconnect resource sharing,” Proc. Asia and South Pacific Design Automation Conf., pp. 16–21, Jan. 2008.

[18] Y.-S. Huang, Y.-J. Hong, and J.-D. Huang, “Communication Synthesis for Interconnect Minimization in Multicycle Communication Architecture,” IEICE Trans. on Fundamentals.

vol. E92-A, no. 12, pp. 3143–3150, Dec. 2009.

[19] Ohchi, N. Togawa, M. Yanagisawa, and T. Ohtsuki, “High-level synthesis algorithms with floorplaning for distributed/shared-register architectures,” Proc. Int’l Symp. VLSI Design, Automation and Test, pp. 164–167, Apr. 2008.

[20] J. Cong, Y. Fan, and J. Xu, “Simultaneous resource binding and interconnection optimization based on a distributed register-file microarchitecture,” ACM Trans. Design Automation Electronics Systems vol. 14, no. 3, pp. 1–31, May. 2009.

[21] K. Lim, Y. Kim, and T. Kim, “Interconnect and communication synthesis for distributed register-file microarchitecture,” Proc. Design Automation Conf., pp. 765–770, Jun. 2007.

[22] J.-D. Huang, C.-I Chen, Y.-T. Lin, and W.-L. Hsu, “Communication synthesis for interconnect minimization targeting distributed register-file microarchitecture,” IEICE Trans.

on Fundamentals, vol. E94-A, no. 4, pp. 1151–1155, Apr. 2011.

[23] S. Gao, K. Seto, S. Komatsu, and M. Fujita, “Pipeline scheduling for array based reconfigurable architectures considering interconnect delays,” Proc. Int’l Conf.

Field-Programmable Technology, pp. 137–144, Dec. 2005.

[24] Terechko, E. L. Thenaff, M. Garg, J. van Eijndhoven, and H. Corporaal, “Inter-cluster communication models for clustered VLIW processors,” Proc. Int’l Symp. High Performance Computer Architecture, 2003.International Technology Roadmap for Semiconductors. Semiconductor Industry Association, 2007.

[25] L. P. Carloni, K. L. McMillan and A. L. Sangiovanni-Vincentelli, “Theory of latency-insensitive design,” IEEE Trans. on CAD, vol. 20, no. 9, pp. 1059–1076, Sep. 2001.

[26] Altera website. [Online]. Available: http://www.altera.com [27] Xilinx website. [Online]. Available: http://www.xilinx.com

[28] B. Kernighan, and S. Lin, “An efficient heuristic procedure for partitioning graphs,” Bell System Technical Journal, pp. 291–307, Feb. 1970.

[29] MCAS: multicycle architectural synthesis system. [Online]. Available:

http://cadlab.cs.ucla.edu/software_release/mcas/

[30] ExPRESS group. [Online]. Available:http://express.ece.ucsb.edu/

[31] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to algorithms, 2nd edition, the MIT press, 2001

[32] R. Lu and C.-K. Koh, “Performance optimization of latency insensitive systems through buffer queue sizing of communication channels,” in Proc. Intl. Conf. on Computer-Aided Design, Nov. 2003, pp. 227–231.

[33] R. Lu and C.-K. Koh, “Performance analysis of latency-insensitive system,” IEEE Trans. on CAD, vol. 25, no. 3, pp. 469–483, Mar. 2006.

[34] R. L. Collins and L. P. Carloni, “Topology-based optimization of maximal sustainable throughput in a latency-insensitive system,” in Proc. of the Design Automation Conf., 2007, pp. 410–415.

[35] R. L. Collins and L. P. Carloni, “Topology-based performance analysis and optimization of latency-insensitive systems,” IEEE Trans. on CAD, vol. 27, no. 12, pp. 2277–2290, Dec.

2008.

[36] M. R. Casu and L. Macchiarulo, “A new approach to latency insensitive design,” in Proc. of the Design Automation Conf., 2004, pp. 576–581.

[37] M. R. Casu and L. Macchiarulo, “Issues in implementing latency insensitive protocols,” in Proc. of the Design, Automation and Test in Europe Conf., 2004, pp. 1390–1391.

[38] D. Bufistov, J. Julvez, and J. Cortadella, “Performance optimization of elastic systems using buffer resizing and buffer insertion,” in Proc. Intl. Conf. on Computer-Aided Design, Nov.

2008, pp. 442–448.

[39] T. Murata, “Circuit theoretic analysis and synthesis of marked graphs,” IEEE Trans. on Circuit and Systems, vol. 24, no. 7, pp. 400–405, Jul. 1977.

[40] T. Murata, “Petri nets: Properties, analysis and applications,” Proc. of the IEEE, vol. 77, no.

4, pp. 541–580, Apr. 1989.

[41] lp_solve. http://lpsolve.sourceforge.net/5.5/

[42] D. B. Johnson, “Finding All the Elementary Circuits of a Directed Graph,” SIAM Journal on Computing, vol. 4, no. 1, pp.77–84, Mar. 1975.

六、成果自評

在學術論文發表方面, 在今年度我們發表了以下國際會議期刊：

1. Che-Hua Shih, Ya-Ching Yang, Chia-Chih Yen, Juinn-Dar Huang, and Jing-Yang Jou,

“FSM-Based Formal Compliance Verification of Interface Protocols,” Journal of Information Science and Engineering, vol. 26, no. 5, pp. 1601–1617, Sep. 2010.

2. Juinn-Dar Huang, Chia-I Chen, Yen-Ting Lin, and Wan-Lin Hsu, “Communication synthesis for interconnect minimization targeting distributed register-file microarchitecture,” IEICE Trans. on Fundamentals, vol. E94-A, no. 4, pp. 1151–1155, Apr. 2011.

3. Juinn-Dar Huang, Chia-I Chen, Wan-Ling Hsu, Yen-Ting Lin, and Jing-Yang Jou,

“Performance-driven architectural synthesis for distributed register-file microarchitecture with inter-island delay,” IEICE Trans. on Fundamentals, to appear.

及以下國際會議論文：

1. Ya-Shih Huang and Juinn-Dar Huang, “Throughput-Driven Hierarchical Placement for Two-Dimensional Regular Multicycle Communication Architecture,” Asia Symposium on Quality Electronic Design, pp. 134–139, Aug. 2010.

2. Juinn-Dar Huang, Chia-I Chen, Wan-Ling Hsu, Yen-Ting Lin, and Jing-Yang Jou,

“Inter-Island Delay Aware Communication Synthesis for Island-Based Distributed Register Architecture,” Proc. of the 16th Workshop on Synthesis and System Integration of Mixed Information Technologies, pp. 58–63, Oct. 2010.

3. Juinn-Dar Huang, Yi-Hang Chen, and Ya-Chien Ho, “Quantitative Graph-Based Minimal Queue Sizing for Throughput Optimization in Latency-Insensitive Designs,” Proc. of the 16th Workshop on Synthesis and System Integration of Mixed Information Technologies, pp.

430–435, Oct. 2010.

4. Juinn-Dar Huang, Yi-Hang Chen, and Ya-Chien Ho, “Throughput Optimization for Latency-Insensitive System with Minimal Queue Insertion,” Proc. of IEEE Asia and South Pacific Design Automation Conference, pp.585–590, Jan. 2011.

5. Juinn-Dar Huang, Yi-Hang Chen, and Wan-Hsien Lin, “Performance-Optimal Behavioral Synthesis with Degenerable Compound Functional Units,” Proc. of IEEE International Symposium on VLSI Design, Automation, and Test, pp. 337–340, Apr. 2011. (Best Paper Candidate)

在專利方面，通過以下國內外專利：

1. 黃俊達，林步青，李耿維，周景揚，“精細頻寬調控的仲裁器及其仲裁方法＂中華民國專利案號 I332615，2010 年 11 月 1 日。

2. Juinn-Dar Huang and Chia-I Chen, “Dynamical sequentially-controlled low-power multiplexer device,” US 7881241 B2, Feb. 2011.

3. 黃俊達，陳嘉怡，“低功率動態序向控制多工器＂中華民國專利案號 I342670，2011 年5 月 21 日。

並有以下專利正在申請中：

1. 黃俊達、呂智宏、林步青、周景揚，“應用於查找表式 FPGA 的壓縮樹延遲最佳化”。

2. 黃俊達、王毓翔、林步青、周景揚，“可參數化管線式快速傳利葉轉換硬體產生器”。

3. 黃俊達、呂智宏、林步青、周景揚，Delay optimal compressor tree synthesis for LUT-based FPGAs

如果和我們當初在計劃提案中的預計的目標相比較，我們自評今年度計劃的完成度相當不錯, 尤其在學術論文發表方面更有不錯的表現。(預期目標為發表一篇國際長篇期刊和三篇國際會議論文）

在文檔中後次微米時代新興電子設計自動化技術之研究---子計畫一：符合次世代晶片上通訊思維之具備幾何考量的系統架構合成技術(III) (頁 25-35)