Extension for Read Port Limitation

Chapter 3 Proposed Algorithm

3.4 Extension for Read Port Limitation

As shown in Fig. 8, combining the read port limitation with scheduling and binding can avoid the access conflicts. This section gives an extended algorithm that takes the read port limitation into account.

In IIC Refinement, a secondary gain of swaps, defined as the decreased number of access conflicts of all islands at all csteps, is added. The number of access conflicts of an island at a cstep is calculated by demanded variables on that island at the cstep minus the number of read ports that a register file has. In Fig. 8, for example, the secondary gain of (6, 7) is one because there is a conflict (three demanded variables minus two read ports) at cstep 4 on island A before the swap (i.e., Fig. 8(a)), but there is no conflict after the swap (i.e., Fig. 8(b)).

Meanwhile, the original gain (i.e., the reduced number of IICs) is called the primary gain.

The second step of IIC refinement is modified as follows: find a swap pair with the largest primary gain from FSPS; if there are many pairs with the same largest primary gain, choose the pair with the largest secondary gain. By means of secondary gain, read port limitation is well complied during scheduling and binding, so the access conflicts can be minimized.

In Data Detouring, only the paths which would not produce any access conflicts of read ports are considered while building detouring graphs. As a result, Data Detouring does not inject any access conflicts.

Chapter 4 Experimental Results

The proposed algorithm is implemented in C++/Linux environment, and all experiments are conducted on a workstation with an Intel Xeon 3.2GHz CPU and 4GB RAM. The test cases are from different benchmark sets [24]–[26], which are frequently used in the high-level synthesis field. The basic information of these test cases (DFGs) is shown in Table 2. The first three columns are the names, number of nodes and number of edges of the DFGs respectively, and the last one is the latency obtained by ASAP scheduling with unlimited resources. For fair and comprehensive comparison, two different synthesis flows are presented, as shown in Fig.

16. Given an input DFG and a resource constraint, list scheduling is first performed to provide an initial scheduling result for both flows. Flow1 implements the approach proposed in [9];

Flow2 applies the algorithm proposed in this work.

Two configurations are considered in our experiments – synthesis is performed without/with a resource constraint in Configuration 1/2, respectively. In the first configuration, the number of islands is set as the minimum number that still guarantees the synthesis outcome with the minimum latency indicated in Table 2. In Configuration 2, the number of islands is reduced by half as:

The results of the configuration 1 and 2 are shown in Table 3 and Table 4 respectively. The numbers of islands in Config. 1

numbers of islands in Config. 2 =

⎢ ⎥

⎣ ⎦ (5)

second column is the number of the given island, and the third one is the latency of DFGs after list scheduling is applied; the fourth and fifth ones are the number of IICs by Flow1 and Flow2 respectively, and the sixth one is the percentage of reduction in terms of the number of IICs.

The experimental results of [9] showed that [9] was able to produce good solutions, but there is still room for improvement. The proposed algorithm can reduce the number of IICs on average by 21.1% without resource constraints (i.e., configuration 1). When the resource constraints (i.e., configuration 2) are applied, the results show that the number of IICs can still be reduced by 24.5% on average. Furthermore, the number of nodes of DFGs ranges from 40 up to 500, so the proposed algorithm remains consistent in the size of DFGs and the number of islands.

Moreover, the number of read ports is unlimited in the above experiments. Thus another experiment is conducted under read port number limitation, as shown in Fig. 17. A post-processing is added to remove the access conflicts. When access conflicts occur at a cstep, the operation with the smallest ALAP value is postponed one cstep, and the scheduling has to be modified to maintain data dependency.

The experiments are also conducted in the two different configurations, respectively. The results are shown in Table 4 and 5. The third column is the latency of DFGs after list scheduling, and the fourth and fifth ones are the latency by Flow3 and Flow4 after the post-processing; the sixth and seventh ones are the number of IICs by Flow3 and Flow4.

The number of csteps of Flow3 increases by 12% on average because of the access conflicts, whereas that of Flow4 remains the same because the proposed algorithm integrates the read port limitation into scheduling and binding. The percentage of IIC reduction of Flow4 remains consistent, so the IIC reduction is not a tradeoff with the consideration of read port limitation.

The experimental results clearly demonstrate that our algorithm outperforms the previous

art both with/without read port limitation. We believe that the advantage comes from the joint effects of the iterative binding-then-rescheduling scheme, the data-detouring process utilizing bubbles and the consideration of read port conflict.

Table 2: The basic information of benchmarks

Test case #nodes #edges ASAP latency

fir2 40 39 11

fir1 44 43 11

lee 49 62 9

cosine 82 91 8

honda 105 104 15

wribmp 106 88 7

dir 127 126 15

chem 342 327 15

fft16 414 672 14

u5ml 564 557 26

Fig. 16: Experimental flows w/o read port limitation.

Fig. 17: Experimental flows w/ read port limitation.

Table 3: Experimental results of configuration 1 (w/o read port limitation).

#IICs Test case #islands #csteps

Flow1 (1) Flow2 (2)

Table 4: Experimental results of configuration 2 (w/o read port limitation).

#IICs Test case #islands #csteps

Flow1 (3) Flow2 (4)

Table 5: Experimental results of configuration 3, #read_ports = 2.

Table 6: Experimental results of configuration 4, #read_ports = 2.

#csteps #IICs

Chapter 5 Conclusion

The number of IICs on DRFM is used as the metric for QoR in early design phases because it is highly correlated with the area and performance of designs. In this work, we have proposed a two-phase resource-constrained synthesis algorithm for IIC minimization targeting DRFM. The iterative binding-then-rescheduling procedure is first performed. Island Assignment maps operations onto islands, and a better result can be derived because the solution space is expanded by IIC Refinement. Moreover, the read port limitation is also considered in this work. Next, data detouring is applied for further elimination of IICs. The experimental results indicate that the proposed algorithm reduces the number of IICs by 24%

on average as compared to the prior art.

References

[1] International Technology Roadmap for Semiconductors. Semiconductor Industry Association, 2005.

[2] D. Matzke, “Will physical scalability sabotage performance gains?” IEEE Computer, vol.20, pp. 37–39, 1997.

[3] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Coping with latency in SOC design,”

IEEE Micro, vol. 22, pp. 24–35, 2002.

[4] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, “Interconnect-power dissipation in a microprocessor,” Proc. Int’l Workshop System Level Interconnect Prediction, pp. 7–13, 2004.

[5] A. Singh, G. Parthasarathy, and M. Marek-Sadowska, “Efficient circuit clustering for area and power reduction in FPGAs,” ACM Trans. Design Automation of Electronics Systems, vol. 7, no. 4, pp. 643–663, Oct. 2002.

[6] E. Kusse and J. Rabaey, “Low-energy embedded FPGA structures,” Proc. Int’l Symp.

Low Power Electronics and Design, pp. 155–160, 1998.

[7] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D.

dissertation, Stanford Univ., Stanford, CA, 1984.

[8] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli, “A methodology for correct-by-construction latency insensitive design,” Proc. Int’l Conf.

Computer Aided Design, pp. 309–315, 1999.

[9] J. Cong, Y. Fan, and W. Jiang, “Platform-based resource binding using a distributed register-file microarchitecture,” Proc. Int’l Conf. Computer Aided Design, pp. 709–715,

2006.

[10] K. Lim, Y. Kim, and T. Kim, “Interconnect and communication synthesis for distributed register-file microarchitecture,” Proc. Design Automation Conf., pp. 765–770, 2007.

[11] D. Kim, J. Jung, S. Lee, J. Jeon, and K. Choi, “Behavior-to-placed RTL synthesis with performance-driven placement,” Proc. Int’l Conf. Computer Aided Design, pp. 320–325, 2001.

[12] J. Jeon, D. Kim, D. Shin, and K. Choi, “High-level synthesis under multi-cycle interconnect delay,” Proc. Asia South Pacific Design Automation Conf., pp. 662–667, 2001.

[13] J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, “Architecture and synthesis for on-chip multicycle communication,” IEEE Trans. Computer-Aided Design Integrated Circuits and Systems, vol. 23, no. 4, pp. 550–564, Apr. 2004.

[14] C.-I Chen and J.-D. Huang, “CriAS: A performance-driven criticality-aware synthesis flow for on-chip multicycle communication architecture,” Proc. Asia and South Pacific Design Automation Conf., pp. 67–72, Jan. 2009.

[15] S.-H. Huang, C.-H. Chiang, and C.-H. Cheng, “Three-dimension scheduling under multi-cycle interconnect communications,” IEICE Electronics Express, vol. 2, no. 4 pp.108–114, Feb. 2005.

[16] J. Cong, Y. Fan, and Z. Zhang, “Architecture-level synthesis for automatic interconnect pipelining,” Proc. Design Automation Conf., pp. 602–607, 2004.

[17] W.-S. Huang, Y.-R. Hong, J.-D. Huang, and Y.-S. Huang, “A multicycle communication architecture and synthesis flow for global interconnect resource sharing,” Proc. Asia and South Pacific Design Automation Conf., pp. 16–21, 2008.

[18] Y.-J. Hong, Y.-S. Huang, and J.-D. Huang, “Simultaneous data transfer routing and scheduling for interconnect minimization in multicycle communication architecture,”

Proc. Asia and South Pacific Design Automation Conf., pp. 19–24, Jan. 2009.

[19] A. Ohchi, N. Togawa, M. Yanagisawa, and T. Hothuki, “High-level synthesis algorithms with floorplanning for distributed/shared-register architectures,” Proc. Int’l Symp. VLSI Design, Automation and Test, pp. 164–167, 2008.

[20] S. Gao, K. Seto, S. Komatsu, and M. Fujita, "Pipeline scheduling for array based reconfigurable architectures considering interconnect delays," Proc. Int’l Conf.

Field-Programmable Technology, pp. 137–144, Dec. 2005.

[21] Altera website. [Online]. Available: http://www.altera.com [22] Xilinx website. [Online]. Available: http://www.xilinx.com

[23] B. Kernighan and S. Lin, “An efficient heuristic procedure for partitioning graphs,” Bell System Technical Journal, pp. 291–307, Feb. 1970.

[24] MCAS: multicycle architectural synthesis system. [Online]. Available:

http://cadlab.cs.ucla.edu/software_release/mcas/

[25] ExPRESS group. [Online]. Available: http://express.ece.ucsb.edu/

[26] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to algorithms, 2nd edition, the MIT press, 2001.

[27] D. Herrmann and R. Ernst, “Improved interconnect sharing by identity operation insertion,” Proc. IEEE/ACM Int’l Conf. Computer Aided Design, pp. 489–493, 1999.

[28] M. Potkonjak and S. Dey, “Optimizing resource utilization and testability using hot potato techniques,” Proc. Design Automation Conf., pp. 201–205, 1994.

[29] C.-I Chen, Y.-T. Lin, W.-L. Hsu, and J.-D. Huang, "Communication synthesis for interconnect minimization targeting distributed register-file microarchitecture," Proc.

Electronic Technology Symposium, Jun. 2009.

[30] G. De Micheli, Synthesis and optimization of digital circuits, McGraw-Hill, 1994.

在文檔中分散式暫存器檔案架構之資料傳輸合成 (頁 31-41)