IIC Refinement - Proposed Algorithm - 分散式暫存器檔案架構之資料傳輸合成

Chapter 3 Proposed Algorithm

3.2 IIC Refinement

As mentioned above, the algorithm for island assignment generally leads to a locally optimized solution. However, further improvement can still be achieved by allowing certain operation rescheduling and rebinding, as depicted in Fig. 6(b) and Fig. 6(c), as long as the data dependency is still intact.

The proposed IIC refinement process is based on KL algorithm [23], which is broadly used in partitioning-related problem. Within the process, nodes and bubbles are swapped for IIC minimization. A swap can be made between two nodes or between a node and a bubble. A swap is considered feasible only on two conditions: (i) nodes must be unlocked, and (ii) data dependency must be preserved after swapping. For example, in Fig. 10(a), the feasible swap candidates for node 5 are {node 1, node 7, node a}. A feasible swap pair of node u and node/bubble v is denoted as (u, v). The gain of a swap pair is defined as how many IICs it can reduce, i.e., the difference between the numbers of IICs before and after the swap. The gain of a swap pair (u, v) is denoted as gu,v. All feasible swap pairs are collected into the feasible swap pair set (FSPS). After performing an actual swap, FSPS and gains of swap pairs are updated accordingly. The key steps of IIC refinement are described as follows:

(i) Set all operation nodes unlocked.

(ii) Find a swap pair with the largest gain from FSPS.

(iii) Swap the pair then lock the operation node.

(iv) Update FSPS and recalculate the gains of pairs in FSPS.

(v) Repeat (ii) to (iv) until FSPS is empty.

(vi) Keep the fist k swaps and undo the rest if the partial gain sum of the first k swaps is the largest and positive; go to (i).

(vii) Otherwise, terminate IIC refinement.

For example, a partially scheduled and bound DFG is shown in Fig. 10(a) with an IIC number equal to 4. Initially, the gains of all feasible swap pairs in FSPS are calculated as follows:

Then the swap pair (9, b) is selected to be swapped and node 9 is locked after the swap.

This process is not terminated until FSPS is empty. Table 1 shows the gain and the partial gain sum of the eight consecutive feasible swaps in this iteration. As a result, only the first three swaps, including (9, b), (1, 5) and (2, a), are actually desired. The resultant DFG at the end of this iteration is shown in Fig. 10(b) and it merely requires 2 IICs instead of 4 in Fig. 10(a).

Fig. 10(a): The DFG at the beginning of the iteration and (b): the DFG at the end of the iteration.

g1,5 = 0 g1,7 = –1 g2,a = –1 g2,8 = –1 g3,6 = 0 g_3,9 = –2 g_4,b = 0 g_4,c = –1 g_5,7 = –2 g_5,a = 0 g6,9 = –1 g8,a = –1 g9,c = 0 g9,b = 1

Table 1: Gains and partial gain sums in an iteration

n-th swap 1 2 3 4 5 6 7 8

Swapped pair (9, b) (1, 5) (2, a) (5, a) (7, a) (4, c) (3, b) (6, b)

Gain 1 0 1 0 -1 -1 1 -2

Partial gain sum 1 1 2 2 1 0 1 -1

3.3 Data Detouring

As shown in Fig. 7 previously, data detouring can further reduce the number of IICs.

However, not all the IITs can be detoured. Only the IIT with slack greater than zero, named splittable IIT, can be detoured. The slack of an IIT is defined as (1), where T(vi) is the cstep in which v_i is scheduled.

On the contrary, an IIT with zero slack is called a non-splittable IIT. As shown in Fig.

11(a), IIT_1,7 and IIT_2,8 are splittable, while IIT_6,2 and IIT_3,11 are non-splittable. For a splittable IIT, it is possible to detour the transfer through a series of bubbles. For instance, IIT1,7 in Fig.

11(a) can be detoured through IIT_1,c and IIT_c,7 as shown in Fig. 11(b).

Fig. 12 outlines the data detouring procedure. Since a non-splittable IIT cannot be detoured, an IIC is surely demanded for it. Hence, the objective for data detouring is to

1 2 2 1

( , ) ( ) ( ) 1

slack v v =T v −T v − (1)

Fig. 11(a): The splittable and non-splittable IITs, and (b): the resultant DFG after data detouring.

reroute certain splittable IITs so that the number of IICs can be further minimized. However, there can be no IIC reduction even after an IIT is successfully detoured. As discussed previously, the reason is that an IIC can be shared by several IITs, and it cannot be safely removed unless all the IITs utilizing it are successfully detoured. Therefore, to eliminate an IIC, all IITs utilizing it should be identified first, as indicated in Fig. 12.

Fig. 13 gives a heuristic-based policy to determine which IIC an IIT actually utilizes. If there are multiple IICs, this mapping strategy tries to assign fewer IITs with larger slack to latter IICs. It is because that an IIC is more likely removed when fewer IITs utilize it or those IITs are with larger slack. As shown in Fig. 11(a), nine IITs are mapped onto six IICs. For example, IIC¹_B,C contains IIT6,10 and IIT7,11, and IIC²_B,C contains IIT5,10, where IIC¹_A,B denotes the i-th inter-island connection between island I_A and I_B. After mapping all the IITs onto IICs, two kinds of IICs are identified – the one containing at least one non-splittable IIT is a hard IIC; the other containing no non-splittable IIT is a soft IIC. As the above example, IIC¹_B,C is a

hard IIC because IIT6,10 is non-splittable, while IIC²_B,C is a soft IIC since it only contains a splittable IIT_5,10. It is impossible to remove a hard IIC via data detouring due to non-splittable IITs. On the contrary, a soft IIC can be eliminated if all the IITs utilizing it are successfully detoured. For example, there are two soft IICs in Fig. 11(a) – IIC¹_A,B can be removed if IIT1,7

Fig. 12: Two key steps of the data detouring procedure.

and IIT2,8 can both be detoured, as well as IIC²_B,C can be removed if IIT5,10 can be detoured. In addition, an IIC is fixed if it is inherently a hard IIC or a soft IIC with at least one IIT which cannot be detoured.

An iterative edge splitting (i.e., IIT detouring) procedure is proposed to eliminate soft IICs as shown in Fig. 14. Bubbles are used while performing IIT detouring as mentioned previously. Since the number of bubbles is a constant, the fewer bubbles the current IIT consumes, the more bubbles the latter IITs can use for detouring. Furthermore, some bubbles can be used to detour many IITs while others can only be used by few IITs. For example, in Fig. 11(a), bubble c can be used to detour IIT_1,7 or IIT_5,10, but bubble a can only be used by IIT5,10. Hence, the overall objective of the proposed iterative edge splitting procedure is to detour a given IIC by using as fewer and less popular bubbles as possible. First, the detouring graph for each IIT belonging to some soft IIC is created. It enumerates all possible detouring paths via the existing fixed IICs. The detouring graphs of the example in Fig. 11(a) are shown in Fig. 15(a), 15(b) and 15(c). A weight is associated with a node and an edge to indicate its importance and popularity.

For every source-destination island pair (IA, IB):

Sort all IITs (vi, vj), where vi ∈ IA and vj ∈ IB, in:

(i) Increasing order of T(vj) as the primary key, and (ii) Decreasing order of T(vi) as the secondary key.

Map IITs onto IICs in the order generated above:

(i) Attempt mapping an IIT onto the first IIC.

(ii) Attempt mapping an IIT onto the second IIC only when the first one is occupied, and so on so forth.

Fig. 13: Mapping IITs onto IICs.

The weight of a source node of an IIT is defined as (2) to reflect its importance. Then weights of other nodes and edges are computed in topological order by (3) and (4).

Fig. 14: The iterative edge splitting procedure.

Fig. 15: The detouring graphs for (a) IIT_1,7 , (b) IIT_2,8 and (c) IIT_5,10.

The bubble weights are therefore obtained by summing up weights in all detouring graphs.

As the example in Fig. 15, the weights of node a, b, c and d are 0.5, 0, 1, and 0.5, respectively.

After all bubble weights are available, the path with minimum-bubble-weight is identified then used to detour the given IIT. The minimum-bubble-weight problem can be formulated as the shortest path problem then solved accordingly. Once the given IIT is detoured, certain detouring graphs should be updated since some bubbles have already been consumed and are not available anymore. Since the fewer IITs a soft IIC contains, the more easily it can be detoured – the soft IICs containing fewer IITs would be processed earlier. For example, IIC²_B,C is processed before IIC¹_A,B.

Overall, the proposed procedure described in Fig. 14 attempts to detour IITs related to the target soft IIC in increasing order of their slacks. If there is one IIT which cannot be split, all previously-split IITs are recovered and the target IIC is therefore marked as a fixed IIC. On the contrary, if all IITs in the target IIC are successfully split, it can then be safely removed. In either case, the proposed procedure proceeds to the next candidate soft IIC. Note that the data detouring procedure never increases the number of IICs. The worst-case scenario which can be anticipated is that all soft IICs become fixed IICs and no IIC reduction is achieved. One last thing, the resultant DFG after data detouring is shown in Fig. 11(b), where the number of IICs is reduced from 6 to 4.

weight of a source node = 1

number of edges this IIC contains (2)

在文檔中分散式暫存器檔案架構之資料傳輸合成 (頁 23-31)