Read Port Limitation - Motivational Examples

Chapter 2 Related Works and Motivations

2.4 Motivational Examples

2.4.3 Read Port Limitation

Incorporating consideration of read port limitation into scheduling and binding can avoid the increase of DFG latency due to access conflict of read ports. A scheduled and bound DFG, as shown in Fig. 8(a), is given without the consideration of the number of read ports during the process of scheduling and binding. Assume that each local register file has only two read ports. Access conflict of read ports would occur at cstep 4 because three variables, produced by v1, v2 and v3 are read from island A but the register file has only two read ports. One read access of v₄, v₇ or v₁₁ has to be postponed which consequently increases the latency of the DFG to five. However, the access conflict can be avoided if the read port limitation is taken into account during the process of scheduling and binding, as shown in Fig. 8(b), where v₆ and v7 are scheduled at cstep 4 and cstep 3 respectively.

The two solutions in Fig. 8(a) and (b) have the same number of IICs. Consequently, the access conflict can be avoided while maintaining the minimized number of IICs.

Fig. 7(a): The schedule and bound DFG with bubbles and (b): the scheduled and bound DFG after data detouring.

Fig. 8(a): A scheduled and bound DFG and (b): the same DFG with another scheduling.

Chapter 3 Proposed Algorithm

3.1 Overview

The problem formulation of this work is as follows: Given a DFG and a resource constraint (the number of islands), obtain a scheduled and bound DFG with the minimized latency as well as minimize the number of required IICs.

The overall flow of the proposed method is shown in Fig. 9. Given a DFG, list scheduling is first performed to obtain an initial scheduling result and followed by the iterative cstep-by-cstep binding-then-rescheduling process. In each iteration, two procedures, island

Fig. 9: The overall flow of the proposed algorithm.

assignment (binding) and IIC refinement (rescheduling and rebinding), are applied consecutively. The way used for island assignment in this work is similar to the horizontal assignment adopted in [9]. Namely, island assignment is formulated as a minimum-weighted bipartite matching problem, where a weight on an edge represents the number of extra IICs induced by the corresponding matching. However, the foregoing algorithm does not allow rescheduling and generally produces a locally optimized solution. Hence, an IIC refinement process is proposed to look for a better result from the expanded solution space via rescheduling. More details are described in Section 3.2. After the iterative phase, data detouring is then conducted and responsible for further IIC reduction. The related details are given in Section 3.3. In Section 3.2 and 3.3, the number of read ports of register files is unlimited, and then Section 3.4 explains how to integrate the consideration of read port limitation into the flow. In the end, a scheduled and bound DFG with minimized IICs is derived.

3.2 IIC Refinement

As mentioned above, the algorithm for island assignment generally leads to a locally optimized solution. However, further improvement can still be achieved by allowing certain operation rescheduling and rebinding, as depicted in Fig. 6(b) and Fig. 6(c), as long as the data dependency is still intact.

The proposed IIC refinement process is based on KL algorithm [23], which is broadly used in partitioning-related problem. Within the process, nodes and bubbles are swapped for IIC minimization. A swap can be made between two nodes or between a node and a bubble. A swap is considered feasible only on two conditions: (i) nodes must be unlocked, and (ii) data dependency must be preserved after swapping. For example, in Fig. 10(a), the feasible swap candidates for node 5 are {node 1, node 7, node a}. A feasible swap pair of node u and node/bubble v is denoted as (u, v). The gain of a swap pair is defined as how many IICs it can reduce, i.e., the difference between the numbers of IICs before and after the swap. The gain of a swap pair (u, v) is denoted as gu,v. All feasible swap pairs are collected into the feasible swap pair set (FSPS). After performing an actual swap, FSPS and gains of swap pairs are updated accordingly. The key steps of IIC refinement are described as follows:

(i) Set all operation nodes unlocked.

(ii) Find a swap pair with the largest gain from FSPS.

(iii) Swap the pair then lock the operation node.

(iv) Update FSPS and recalculate the gains of pairs in FSPS.

(v) Repeat (ii) to (iv) until FSPS is empty.

(vi) Keep the fist k swaps and undo the rest if the partial gain sum of the first k swaps is the largest and positive; go to (i).

(vii) Otherwise, terminate IIC refinement.

For example, a partially scheduled and bound DFG is shown in Fig. 10(a) with an IIC number equal to 4. Initially, the gains of all feasible swap pairs in FSPS are calculated as follows:

Then the swap pair (9, b) is selected to be swapped and node 9 is locked after the swap.

This process is not terminated until FSPS is empty. Table 1 shows the gain and the partial gain sum of the eight consecutive feasible swaps in this iteration. As a result, only the first three swaps, including (9, b), (1, 5) and (2, a), are actually desired. The resultant DFG at the end of this iteration is shown in Fig. 10(b) and it merely requires 2 IICs instead of 4 in Fig. 10(a).

Fig. 10(a): The DFG at the beginning of the iteration and (b): the DFG at the end of the iteration.

g1,5 = 0 g1,7 = –1 g2,a = –1 g2,8 = –1 g3,6 = 0 g_3,9 = –2 g_4,b = 0 g_4,c = –1 g_5,7 = –2 g_5,a = 0 g6,9 = –1 g8,a = –1 g9,c = 0 g9,b = 1

Table 1: Gains and partial gain sums in an iteration

n-th swap 1 2 3 4 5 6 7 8

Swapped pair (9, b) (1, 5) (2, a) (5, a) (7, a) (4, c) (3, b) (6, b)

Gain 1 0 1 0 -1 -1 1 -2

Partial gain sum 1 1 2 2 1 0 1 -1

3.3 Data Detouring

As shown in Fig. 7 previously, data detouring can further reduce the number of IICs.

However, not all the IITs can be detoured. Only the IIT with slack greater than zero, named splittable IIT, can be detoured. The slack of an IIT is defined as (1), where T(vi) is the cstep in which v_i is scheduled.

On the contrary, an IIT with zero slack is called a non-splittable IIT. As shown in Fig.

11(a), IIT_1,7 and IIT_2,8 are splittable, while IIT_6,2 and IIT_3,11 are non-splittable. For a splittable IIT, it is possible to detour the transfer through a series of bubbles. For instance, IIT1,7 in Fig.

11(a) can be detoured through IIT_1,c and IIT_c,7 as shown in Fig. 11(b).

Fig. 12 outlines the data detouring procedure. Since a non-splittable IIT cannot be detoured, an IIC is surely demanded for it. Hence, the objective for data detouring is to

1 2 2 1

( , ) ( ) ( ) 1

slack v v =T v −T v − (1)

Fig. 11(a): The splittable and non-splittable IITs, and (b): the resultant DFG after data detouring.

reroute certain splittable IITs so that the number of IICs can be further minimized. However, there can be no IIC reduction even after an IIT is successfully detoured. As discussed previously, the reason is that an IIC can be shared by several IITs, and it cannot be safely removed unless all the IITs utilizing it are successfully detoured. Therefore, to eliminate an IIC, all IITs utilizing it should be identified first, as indicated in Fig. 12.

Fig. 13 gives a heuristic-based policy to determine which IIC an IIT actually utilizes. If there are multiple IICs, this mapping strategy tries to assign fewer IITs with larger slack to latter IICs. It is because that an IIC is more likely removed when fewer IITs utilize it or those IITs are with larger slack. As shown in Fig. 11(a), nine IITs are mapped onto six IICs. For example, IIC¹_B,C contains IIT6,10 and IIT7,11, and IIC²_B,C contains IIT5,10, where IIC¹_A,B denotes the i-th inter-island connection between island I_A and I_B. After mapping all the IITs onto IICs, two kinds of IICs are identified – the one containing at least one non-splittable IIT is a hard IIC; the other containing no non-splittable IIT is a soft IIC. As the above example, IIC¹_B,C is a

hard IIC because IIT6,10 is non-splittable, while IIC²_B,C is a soft IIC since it only contains a splittable IIT_5,10. It is impossible to remove a hard IIC via data detouring due to non-splittable IITs. On the contrary, a soft IIC can be eliminated if all the IITs utilizing it are successfully detoured. For example, there are two soft IICs in Fig. 11(a) – IIC¹_A,B can be removed if IIT1,7

Fig. 12: Two key steps of the data detouring procedure.

and IIT2,8 can both be detoured, as well as IIC²_B,C can be removed if IIT5,10 can be detoured. In addition, an IIC is fixed if it is inherently a hard IIC or a soft IIC with at least one IIT which cannot be detoured.

An iterative edge splitting (i.e., IIT detouring) procedure is proposed to eliminate soft IICs as shown in Fig. 14. Bubbles are used while performing IIT detouring as mentioned previously. Since the number of bubbles is a constant, the fewer bubbles the current IIT consumes, the more bubbles the latter IITs can use for detouring. Furthermore, some bubbles can be used to detour many IITs while others can only be used by few IITs. For example, in Fig. 11(a), bubble c can be used to detour IIT_1,7 or IIT_5,10, but bubble a can only be used by IIT5,10. Hence, the overall objective of the proposed iterative edge splitting procedure is to detour a given IIC by using as fewer and less popular bubbles as possible. First, the detouring graph for each IIT belonging to some soft IIC is created. It enumerates all possible detouring paths via the existing fixed IICs. The detouring graphs of the example in Fig. 11(a) are shown in Fig. 15(a), 15(b) and 15(c). A weight is associated with a node and an edge to indicate its importance and popularity.

For every source-destination island pair (IA, IB):

Sort all IITs (vi, vj), where vi ∈ IA and vj ∈ IB, in:

(i) Increasing order of T(vj) as the primary key, and (ii) Decreasing order of T(vi) as the secondary key.

Map IITs onto IICs in the order generated above:

(i) Attempt mapping an IIT onto the first IIC.

(ii) Attempt mapping an IIT onto the second IIC only when the first one is occupied, and so on so forth.

Fig. 13: Mapping IITs onto IICs.

The weight of a source node of an IIT is defined as (2) to reflect its importance. Then weights of other nodes and edges are computed in topological order by (3) and (4).

Fig. 14: The iterative edge splitting procedure.

Fig. 15: The detouring graphs for (a) IIT_1,7 , (b) IIT_2,8 and (c) IIT_5,10.

The bubble weights are therefore obtained by summing up weights in all detouring graphs.

As the example in Fig. 15, the weights of node a, b, c and d are 0.5, 0, 1, and 0.5, respectively.

After all bubble weights are available, the path with minimum-bubble-weight is identified then used to detour the given IIT. The minimum-bubble-weight problem can be formulated as the shortest path problem then solved accordingly. Once the given IIT is detoured, certain detouring graphs should be updated since some bubbles have already been consumed and are not available anymore. Since the fewer IITs a soft IIC contains, the more easily it can be detoured – the soft IICs containing fewer IITs would be processed earlier. For example, IIC²_B,C is processed before IIC¹_A,B.

Overall, the proposed procedure described in Fig. 14 attempts to detour IITs related to the target soft IIC in increasing order of their slacks. If there is one IIT which cannot be split, all previously-split IITs are recovered and the target IIC is therefore marked as a fixed IIC. On the contrary, if all IITs in the target IIC are successfully split, it can then be safely removed. In either case, the proposed procedure proceeds to the next candidate soft IIC. Note that the data detouring procedure never increases the number of IICs. The worst-case scenario which can be anticipated is that all soft IICs become fixed IICs and no IIC reduction is achieved. One last thing, the resultant DFG after data detouring is shown in Fig. 11(b), where the number of IICs is reduced from 6 to 4.

weight of a source node = 1

number of edges this IIC contains (2)

3.4 Extension for Read Port Limitation

As shown in Fig. 8, combining the read port limitation with scheduling and binding can avoid the access conflicts. This section gives an extended algorithm that takes the read port limitation into account.

In IIC Refinement, a secondary gain of swaps, defined as the decreased number of access conflicts of all islands at all csteps, is added. The number of access conflicts of an island at a cstep is calculated by demanded variables on that island at the cstep minus the number of read ports that a register file has. In Fig. 8, for example, the secondary gain of (6, 7) is one because there is a conflict (three demanded variables minus two read ports) at cstep 4 on island A before the swap (i.e., Fig. 8(a)), but there is no conflict after the swap (i.e., Fig. 8(b)).

Meanwhile, the original gain (i.e., the reduced number of IICs) is called the primary gain.

The second step of IIC refinement is modified as follows: find a swap pair with the largest primary gain from FSPS; if there are many pairs with the same largest primary gain, choose the pair with the largest secondary gain. By means of secondary gain, read port limitation is well complied during scheduling and binding, so the access conflicts can be minimized.

In Data Detouring, only the paths which would not produce any access conflicts of read ports are considered while building detouring graphs. As a result, Data Detouring does not inject any access conflicts.

Chapter 4 Experimental Results

The proposed algorithm is implemented in C++/Linux environment, and all experiments are conducted on a workstation with an Intel Xeon 3.2GHz CPU and 4GB RAM. The test cases are from different benchmark sets [24]–[26], which are frequently used in the high-level synthesis field. The basic information of these test cases (DFGs) is shown in Table 2. The first three columns are the names, number of nodes and number of edges of the DFGs respectively, and the last one is the latency obtained by ASAP scheduling with unlimited resources. For fair and comprehensive comparison, two different synthesis flows are presented, as shown in Fig.

16. Given an input DFG and a resource constraint, list scheduling is first performed to provide an initial scheduling result for both flows. Flow1 implements the approach proposed in [9];

Flow2 applies the algorithm proposed in this work.

Two configurations are considered in our experiments – synthesis is performed without/with a resource constraint in Configuration 1/2, respectively. In the first configuration, the number of islands is set as the minimum number that still guarantees the synthesis outcome with the minimum latency indicated in Table 2. In Configuration 2, the number of islands is reduced by half as:

The results of the configuration 1 and 2 are shown in Table 3 and Table 4 respectively. The numbers of islands in Config. 1

numbers of islands in Config. 2 =

⎢ ⎥

⎣ ⎦ (5)

second column is the number of the given island, and the third one is the latency of DFGs after list scheduling is applied; the fourth and fifth ones are the number of IICs by Flow1 and Flow2 respectively, and the sixth one is the percentage of reduction in terms of the number of IICs.

The experimental results of [9] showed that [9] was able to produce good solutions, but there is still room for improvement. The proposed algorithm can reduce the number of IICs on average by 21.1% without resource constraints (i.e., configuration 1). When the resource constraints (i.e., configuration 2) are applied, the results show that the number of IICs can still be reduced by 24.5% on average. Furthermore, the number of nodes of DFGs ranges from 40 up to 500, so the proposed algorithm remains consistent in the size of DFGs and the number of islands.

Moreover, the number of read ports is unlimited in the above experiments. Thus another experiment is conducted under read port number limitation, as shown in Fig. 17. A post-processing is added to remove the access conflicts. When access conflicts occur at a cstep, the operation with the smallest ALAP value is postponed one cstep, and the scheduling has to be modified to maintain data dependency.

The experiments are also conducted in the two different configurations, respectively. The results are shown in Table 4 and 5. The third column is the latency of DFGs after list scheduling, and the fourth and fifth ones are the latency by Flow3 and Flow4 after the post-processing; the sixth and seventh ones are the number of IICs by Flow3 and Flow4.

The number of csteps of Flow3 increases by 12% on average because of the access conflicts, whereas that of Flow4 remains the same because the proposed algorithm integrates the read port limitation into scheduling and binding. The percentage of IIC reduction of Flow4 remains consistent, so the IIC reduction is not a tradeoff with the consideration of read port limitation.

The experimental results clearly demonstrate that our algorithm outperforms the previous

art both with/without read port limitation. We believe that the advantage comes from the joint effects of the iterative binding-then-rescheduling scheme, the data-detouring process utilizing bubbles and the consideration of read port conflict.

Table 2: The basic information of benchmarks

Test case #nodes #edges ASAP latency

fir2 40 39 11

fir1 44 43 11

lee 49 62 9

cosine 82 91 8

honda 105 104 15

wribmp 106 88 7

dir 127 126 15

chem 342 327 15

fft16 414 672 14

u5ml 564 557 26

Fig. 16: Experimental flows w/o read port limitation.

Fig. 17: Experimental flows w/ read port limitation.

Table 3: Experimental results of configuration 1 (w/o read port limitation).

#IICs Test case #islands #csteps

Flow1 (1) Flow2 (2)

Table 4: Experimental results of configuration 2 (w/o read port limitation).

#IICs Test case #islands #csteps

Flow1 (3) Flow2 (4)

Table 5: Experimental results of configuration 3, #read_ports = 2.

Table 6: Experimental results of configuration 4, #read_ports = 2.

#csteps #IICs

Chapter 5 Conclusion

The number of IICs on DRFM is used as the metric for QoR in early design phases because it is highly correlated with the area and performance of designs. In this work, we have proposed a two-phase resource-constrained synthesis algorithm for IIC minimization targeting DRFM. The iterative binding-then-rescheduling procedure is first performed. Island Assignment maps operations onto islands, and a better result can be derived because the solution space is expanded by IIC Refinement. Moreover, the read port limitation is also considered in this work. Next, data detouring is applied for further elimination of IICs. The experimental results indicate that the proposed algorithm reduces the number of IICs by 24%

on average as compared to the prior art.

References

[1] International Technology Roadmap for Semiconductors. Semiconductor Industry Association, 2005.

[2] D. Matzke, “Will physical scalability sabotage performance gains?” IEEE Computer, vol.20, pp. 37–39, 1997.

[3] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Coping with latency in SOC design,”

IEEE Micro, vol. 22, pp. 24–35, 2002.

[4] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, “Interconnect-power dissipation in a microprocessor,” Proc. Int’l Workshop System Level Interconnect Prediction, pp. 7–13, 2004.

[5] A. Singh, G. Parthasarathy, and M. Marek-Sadowska, “Efficient circuit clustering for area and power reduction in FPGAs,” ACM Trans. Design Automation of Electronics Systems, vol. 7, no. 4, pp. 643–663, Oct. 2002.

[6] E. Kusse and J. Rabaey, “Low-energy embedded FPGA structures,” Proc. Int’l Symp.

Low Power Electronics and Design, pp. 155–160, 1998.

[7] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D.

dissertation, Stanford Univ., Stanford, CA, 1984.

[8] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli, “A methodology for correct-by-construction latency insensitive design,” Proc. Int’l Conf.

Computer Aided Design, pp. 309–315, 1999.

[9] J. Cong, Y. Fan, and W. Jiang, “Platform-based resource binding using a distributed register-file microarchitecture,” Proc. Int’l Conf. Computer Aided Design, pp. 709–715,

2006.

[10] K. Lim, Y. Kim, and T. Kim, “Interconnect and communication synthesis for distributed register-file microarchitecture,” Proc. Design Automation Conf., pp. 765–770, 2007.

[11] D. Kim, J. Jung, S. Lee, J. Jeon, and K. Choi, “Behavior-to-placed RTL synthesis with performance-driven placement,” Proc. Int’l Conf. Computer Aided Design, pp. 320–325, 2001.

[12] J. Jeon, D. Kim, D. Shin, and K. Choi, “High-level synthesis under multi-cycle interconnect delay,” Proc. Asia South Pacific Design Automation Conf., pp. 662–667, 2001.

[13] J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang, “Architecture and synthesis for on-chip

在文檔中分散式暫存器檔案架構之資料傳輸合成 (頁 19-0)