A clustering- and probability-based approach for time-multiplexed FPGA partitioning

(1)

INTEGRATION, the VLSI journal 38 (2004) 245–265

A clustering- and probability-based approach for

time-multiplexed FPGA partitioning

Guang-Ming Wu

a,

, Mango Chia-Tso Chao

b

, Yao-Wen Chang

c

a

Department of Information Management, Nan-Hua University, 32 Chung Keng, Dalin, Chiayi, Taiwan

b

Computer and Information Science, National Chiao Tung University, Hsinchu 30010, Taiwan

c_{Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan}

Received 14 February 2003; received in revised form 25 May 2004; accepted 3 June 2004

Abstract

Improving logic density by time-sharing, time-multiplexed FPGAs (TMFPGAs) have become an important research topic for reconfigurable computing. Due to the precedence and capacity constraints in TMFPGAs, the clustering and partitioning problems for TMFPGAs are different from the traditional ones. In this paper, we propose a two-phase hierarchical approach to solve the partitioning problem for TMFPGAs. With the precedence and capacity considerations for both phases, the first phase clusters nodes to reduce the problem size, and the second phase applies a probability-based iterative-improvement approach to minimize cut cost. Experimental results based on the Xilinx TMFPGA architecture show that our algorithm significantly outperforms previous works.

1. Introduction

Improving logic density by time-sharing, time-multiplexed FPGAs (TMFPGAs) have become an important research topic for reconﬁgurable computing. In TMFPGAs, a virtual large design is

www.elsevier.com/locate/vlsi

_{Corresponding author. Tel.: +88652721001x201; fax: +88652427136.}

E-mail addresses: [email protected] (G.-M. Wu), [email protected] (M.C.-T. Chao), ywchang@ cc.ee.ntu.edu.tw (Y.-W. Chang).

(2)

partitioned into multiple stages (or partitions) to share the same smaller physical device than that occupied by traditional FPGAs. Several different architectures have been proposed, such as the Xilinx model[1], Dharma [2], etc. All these models allow dynamic reuse of logic blocks and wire segments by using more than one on-chip SRAM bit to control them. The conﬁgurations of logic blocks and wire segments can be changed by reading different SRAM bits.

Fig. 1 shows the Xilinx TMFPGA configuration model [1]. The TMFPGA emulates a single circuit design in the sequencing of multiple configurations. In each micro-cycle, the TMFPGA reads in the circuit information from a corresponding configuration SRAM, and then the configurable logic blocks (CLBs) in the TMFPGA are reused to evaluate logic. A user cycle is a cycle passing through all micro-cycles. Each CLB contains micro registers to hold the CLB result. Micro registers hold the intermediate values of combinational logic for later micro-cycles in the same user cycle and reserve the status of flip-flops for the next user cycle. In Xilinx TMFPGAs, there are eight micro-cycles in a user cycle. A new configuration is loaded into active configuration memory after all CLB results in the last micro-cycle have been saved.

The objective of the TMFPGA partitioning problem is to minimize the interconnection (the number of micro registers required) between micro-cycles. Unlike a traditional FPGA, the execution order of nodes in a TMFPGA must follow their precedence constraints. For example, a node must be executed no later than all of its outputs in a combinational circuit. It implies that a cut in a TMFPGA partitioning should be a uni-directional cut. For the TMFPGA partitioning problem, several heuristics such as list scheduling[3,4]and network-flow-based approach[5–7]on different architectures were proposed. The network-flow-based approach first finds a min cut. If the min cut is not at the balanced point, it will randomly move nodes to meet the balance constraint. Thus the optimality may deviate away after nodes are adjusted. In this paper, we propose a two-phase approach, the CPAT method (Clustering and Probability-based Algorithm for TMFPGA), to solve the TMFPGA partitioning problem. The first phase reduces the problem size using a clustering method; the second phase minimizes the interconnection by a probability-based iterative-improvement[8,9]method. For the first phase, we extend the method used in[10]

... ... configurations 1 2 k-1 k+1 k 8 micro-cycles user cycle

(3)

which is effective in clustering traditional circuits, but may generate a cluster of size exceeding the capacity of a stage in the TMFPGA partitioning. Our solution to the capacity overflow problem is based on a rooted-tree subset-sum formulation; we prove that the rooted-tree subset-sum problem is NP-complete and present an exact exponential-time and a fully polynomial-time approximation algorithms[11]for the problem. For the second phase, the probability-based method incorporates the precedence constraints into the 2nd-order probability estimation[12]. Thus, the probability-based method finds the potentially maximum gain among movable nodes. Our method, thus, can globally monitor the changes and can avoid the drawback of the network-flow-based approach. Experimental results, based on the Xilinx TMFPGA architecture [1] with eight micro-cycles (stages), show that our algorithm reduces the maximum numbers of micro registers required than previous works.

2. Problem formulation

We follow the formulation and notation used in[5]. A circuit in a TMFPGA can be represented by a directed hypergraph GðV ; NÞ, where V is the set of nodes and N is the set of nets in the circuit. There are two types of nodes in V: combinational nodes (C-nodes) and ﬂip-ﬂop nodes (FF-nodes). Each node v 2 V has a weight wðvÞ. The weight of a set U (U V ), W ðU Þ, is given byP_v2UwðvÞ. For a net n ¼ fv1; v2; . . . ; vpg with p nodes, let v1 be the fan-out node whose output signal is the

input signal to vj 2n (2pjpp), and let vj 2n (2pjpp) be the fan-in node whose input signal is the

output signal from v1.

To ﬁt into a TMFPGA, a circuit is partitioned into k stages, such that the logic blocks and wire segments in different stages can share the same physical TMFPGA device. These k stages form one user cycle, and one user cycle should produce the same results on the outputs as would be seen by a non-time-multiplexed device. In order to ensure the correct results produced in a user cycle, every nodes must be evaluated in a proper order. According to the Xilinx architecture [1], the following three precedence constraints must be satisﬁed:

1. Each combinational node (C-node) must be scheduled in a stage no later than all its output nodes. 2. Each FF-node must be scheduled in a stage no earlier than all its output nodes. This rule guarantees that all the nodes that use the value of the flip-flop use the same value: the value of flip-flop from the previous user cycle.

The above constraints deﬁne a partial temporal ordering on the nodes in the circuit. Let PreðvÞ be the precedence of a node v. For two nodes u and v, let PreðuÞPreðvÞ denote that node u must be scheduled no later than node v. In other words, for a net n = {v1, v2,. . ., vp}, where v1 is the

fan-out node and vj, 2pjpp, is the fan-in node.

if v1 is a C-node, then Preðv1ÞPreðvjÞ for 2pjpp;

if v1 is an FF-node, then PreðvjÞPreðv1Þ for 2pjpp.

By the two constraints, we can decide the directions of nets in the graph and classify nets into two types: a net is C-type if its v1is a C-node, and a net is FF-type if its v1is an FF-node, as shown

(4)

In TMFPGAs, micro registers are required between stages to store the data of nodes for use in later micro-cycles. Let Cutða; bÞ be the set of micro registers between stage a and stage b. A k-stage TMFPGA contains k cuts, Cutð1; 2Þ, Cutð2; 3Þ;. . . ; Cutðk 1; kÞ, and Cutðk; 1Þ. For a C-type net, the data of its fan-out node must be held until the last stage containing a fan-in node of the net. For an FF-type net, the data of its fan-out node must be held not only in the rest stages of the current user cycle but also from the ﬁrst stage to the last stage of all its fan-in nodes in the next user cycle. For a net n ¼ fv1; v2; . . . ; vpg, let sðvÞ ¼ j if v belongs to the stage j, aðnÞ denote the

number of micro registers used in net n, and k denote the number of stages. aðnÞ is given as follows:

aðnÞ ¼ maxfsðvjÞj2pjppg sðv1Þ, if net n is C-type.

aðnÞ ¼ k sðv1Þ þmaxfsðvjÞj2pjppg, if net n is FF-type.

Fig. 3shows the registers needed in a net for a 4-stage TMFPGA. InFig. 3(a), the data of a C-type fan-out node is held from stage 1, the stage of the fan-out node, to stage 3, the last stage of the fan-in nodes. It uses two micro registers, one for Cutð1; 2Þ and Cutð2; 3Þ each. InFig. 3(b), the data of an FF-type fan-out node is held from stage 3, the stage of fan-out node, to stage 4, then back to stage 1 of next user cycle and ﬁnally to stage 2, the last stage of fan-in nodes. It uses three registers, one for Cutð3; 4Þ, Cutð4; 1Þ, and Cutð1; 2Þ each.

1 4 4 2 3 stage# 1 2 (a) Cut(4,1) Cut(1,2) Cut(2,3) Cut(3,4)

1 2 3

stage#

2 3

Cut(4,1) Cut(1,2) Cut(2,3) Cut(3,4)

1 v1 v2 v3 v1 v2 v3 (b)

Fig. 3. (a) Two micro registers, indicated by &, used in a C-type net fv1; v2; v3g, (b) Three micro registers used in an

FF-type net fv1; v2; v3g. FF-type net v1 v2 v3 vp v2 v3 vp v1 C-type net ... ... ... ... a C-type

node FF-type_node

Fig. 2. Precedence constraints. Shaded nodes and white nodes represent the fan-out nodes and fan-in nodes, respectively.

(5)

The k-stage TMFPGA partitioning problem is to partition a circuit GðV ; NÞ into k non-overlapping subsets V1; V2; . . . ; Vk, such that the maximum interconnection (the number of micro

registers) between each two adjacent stages is minimized, and the following properties are satisﬁed:

(1) Sk_i¼1Vi¼V .

(2) Precedence constraint: Let sðvÞ = j if v 2 Vj. For each two nodes u and v, if PreðuÞPreðvÞ, then

sðuÞpsðvÞ.

(3) Balance constraint: For each subset Vi, W ðViÞ is bounded by a factor r as follows:

W ðV Þ

k ð1 rÞpWðViÞp W ðV Þ

k ð1 þ rÞ; 0prp1:

(4) Timing constraint: Let D be the length of the longest path in a circuit. The length of the longest path in each stage is upper bounded by dD=ke.

3. The two-phase CPAT algorithm

The k-stage TMFPGA partitioning problem can be handled by repeatedly solving k 1 TMFPGA bipartitioning problems. We shall focus our discussions on the approach for solving the TMFPGA bipartitioning problem. Our solution to this problem is based on a two-phase hierarchical approach: clustering followed by a probability-based iterative-improvement formulation.

3.1. Phase I: the clustering algorithm

An effective clustering algorithm can greatly improve the quality of the precedence-constrained partitioning results and speed up the later partitioning algorithm by reducing the problem size. The maximum fanout free subgraph (MFFS) algorithm is effective in clustering traditional circuits [10]. MFFS is a signal ﬂow-based clustering algorithm that considers simultaneous movement of logically dependent nodes during the node moves. However, MFFS may generate a cluster of size larger than the capacity of a stage in the TMFPGA partitioning. To consider the capacity constraint, we propose a clustering method based on the MFFS, which can control the size of a cluster. The deﬁnitions of FFS and MFFS are described as follows. For a given node v in a circuit,

FFSv ¼ fujevery path from u to some primary output passes through v in the circuitg.

MFFSv¼ fuj for all FFSv; u 2 FFSvg.

A circuit can be represented in the TMFPGA by a directed graph. For a given circuit Ci and a

node v, an MFFS cluster rooted at v can be obtained by using the following procedure:

Convert Cito a directed graph, GðV ; NÞ, where V is a set of nodes which corresponds to Ci, and

(6)

Cut all the fan-out edges of the root node v; search all other nodes in graph GðV ; NÞ starting from the primary outputs of the Ci. The nodes in GðV ; NÞ that were not traversed belong to the

MFFSv.

The MFFS construction algorithm described above is used to obtain one MFFS cluster. To cluster the entire circuit, we need to apply the MFFS construction algorithm repeatedly. The MFFS clustering algorithm works as follows: For a given circuit Ci, let Roots = fall primary

outputs in Cig. Then, extract a node v 2 Roots and use the MFFS construction algorithm to

construct MFFSv. This process is repeated until Roots is empty. Then remove all currently

constructed MFFS clusters from Ci, resulting in a reduced circuit C0i whose primary outputs are

input nodes to the removed MFFS clusters. Repeat the same procedure for the new circuit C0_i recursively until all nodes in Ciare grouped into MFFS clusters. For example, the circuit depicted

inFig. 4(a) can be clustered into three clusters (see Fig. 4(b)).

We present in the following two algorithms to handle a cluster of size exceeding the capacity of a stage in the TMFPGA partitioning. Our method decomposes a cluster Ci(rooted at v) according

to the two cases: (1) Ciis a rooted tree, and (2) Ci is an acyclic graph. Our target is to partition Ci

into two balanced sets with the minimal cut size.

We ﬁrst consider the case where a circuit Ciis a rooted tree. Let Tvidenote the subtree rooted at

vi, where vi2Ci. For nodes v1; v2; . . ., and vd in respective Tv1; Tv2; . . ., and Tvd, let kðv1; v2; . . . ; vdÞ

denote the total weights of nodes in Tv1; Tv2; . . ., and Tvd. We deﬁne an element

x ¼ ðkðv1; v2; . . . ; vdÞ, Tv1; Tv2; . . . ; TvdÞ, where v1; v2; . . . ; vd represent the respective roots of

disjoint subtrees Tv1; Tv2; . . . ; Tvd. For an element x ¼ ðkðv1; v2; . . . ; vdÞ; Tv1; Tv2; . . . ; TvdÞ, let

jxj ¼ d and pðxÞ ¼ kðv1; v2; . . . ; vdÞ. An element y is called a singleton element if it contains only

one subtree. For an element xi¼ ðkðvi;1; vi;2; . . . ; vi;dÞ; Tvi;1; Tvi;2; . . . ; Tvi;dÞand a singleton element

y_j ¼ ðkðvjÞ, TvjÞ, if TvjgT_vi;l, 1plpd, let xi]y_j ¼ ðkðvi;1; vi;2; . . . ; vi;d; vjÞ, Tv_i;1, Tv_i;2; . . . ; Tv_i;d, Tv_jÞ; if

TvjT_vi;l, 1plpd, let xi]yj ¼ ðkð ^V Þ; ^T Þ, where the set V ¼ fv^ i;1; vi;2; . . . ; vi;d; vjg fvi;lg and

^

T ¼ fTvi;1; Tvi;2; . . . ; Tvi;d; Tvjg fTvi;lg. Let h denote a half of the total weights of nodes in Ci. The

Rooted-Tree Subset-Sum problem is to cut Ciinto minimal number of subtrees such that the total

weights of nodes in the sub trees is equal to h. We formulate the Rooted-Tree Subset-Sum Problem as follows.

The Rooted-Tree Subset-Sum problem. Given a set R of singleton elements associated with a rooted tree Ci and an integer h, ﬁnd an element x derived by a sequence of ] operations such that

pðxÞ ¼ h and minimize jxj.

Theorem 1. The decision problem of the Rooted-Tree Subset-Sum problem is NP-complete.

(7)

Proof. We first show that Rooted-Tree Subset-Sum problem is in NP. Given a set R associated with a rooted tree and two integers q and h, we let the subset R0 of R be the certificate. Checking whether h ¼ pð]_x2R0xÞ and j]_x2R0xj ¼ q can be accomplished by a verification algorithm in

polynomial time.

The SUBSET-SUM problem is an NP-complete problem [11]. We now show that SUBSET-SUM pP Rooted-Tree Subset-Sum. Given an instance h ^S; ti of the subset-sum problem, the

reduction algorithm constructs a tree (a circuit) C of the Rooted-Tree Subset-Sum problem such that there exists a subset in ^S whose sum is equal to t if and only if there exists an element x associated with C, where pðxÞ ¼ t.

The heart of the reduction is a tree representation of ^S. Let ^S ¼ fs1; s2; . . . ; sigbe a set consisting

of i integers. We construct the tree CðV ; NÞ with i þ 1 nodes associated with ^S as follows:

Add a root v0 with weight 1 to V.

For each integer sj 2S, add a node vj with weight sj to V and a directed edge ðv0; vjÞ to N.

Every subtree of C except Tv0 has only one node and is disjoint to each other. We have

R ¼ fðkðv0Þ; Tv0Þ; ðkðv1Þ; Tv1Þ; . . . ; ðkðviÞ; TviÞgassociated with CðV ; NÞ. Let q equal i, ^S

0

^S such that t ¼ S_s

j2 ^S0sj, and yk¼ ðkðvkÞ; TvkÞ. Then we ﬁnd the element x ¼ ]yj, where yj is associated

with sj 2 ^S0, such that pðxÞ ¼ t and jxjpq.

Conversely, suppose that there exists an element x ¼ ðkðv1; v2; . . . ; vdÞ; Tv1; Tv2; . . . ; TvdÞ. Let jxj

equal d and pðxÞ equal kðv1; v2; . . . ; vdÞ such that pðxÞ ¼ t. Then, the sum of the subset

fvj1; vj2; . . . ; vjkg is equal to t. &

We give an exponential-time exact algorithm as well as a fully polynomial-time approximation scheme [11] for the Rooted-Tree Subset-Sum problem, listed in Figs. 6 and 7, respectively. For a sequence L ¼oðkðv1;1; . . . ; v1;i1Þ; Tv1;1; . . . ; Tv1;i1Þ, ðkðv2;1; . . . ; v2;i2Þ; Tv2;1; . . . ; Tv2;i2Þ; . . . ;

ðkðvm;1; . . . ; vm;imÞ; Tvm;1; . . . ; Tvm;imÞ4 and ðkðvjÞ; TvjÞ, let L þ ðkðvjÞ; TvjÞ denote the sequence

derived from a series of ] operations on each element of L with the singleton element ðkðvjÞ; TvjÞ.

For example, if L ¼oð1; Tv1Þ; ð3; Tv2Þ; ð5; Tv3Þ; ð6; Tv4Þ4, then L þ ð2; Tv5Þ ¼oð3; Tv1; Tv5Þ;

ð5; Tv2; Tv5Þ; ð7; Tv3; Tv5Þ; ð8; Tv4; Tv5Þ4 (if Tv1; . . . ; Tv5do not share any node).

We use an auxiliary procedure merge-lists(L; L0) that returns the sorted list by merging its two sorted input lists L and L0_{, and remove the duplicate elements. Like the merge procedure which}

used in merge sort [11], merge-lists runs in time OðjLj þ jL0_{jÞ. Since the length of L}

i can be as

much as 2i, Exact-Rooted-Tree-Subset-Sum is an exponential-time algorithm.

The polynomial-time approximation algorithm Approx-Rooted-Tree-Subset-Sum is per-formed by trimming each list Li after an ] operation. We use a trimming parameter such

that 0pp1. To trim a list L by means to remove as many elements from L as possible, in such a way that if L0 is the result of trimming L, then for each element y removed from L, there exists an element z still in L0_{, where ð1 ÞpðyÞ}_{ppðzÞppðyÞ. Line 3 initializes the list L}

0 to be

the list containing just the element ð0; ;Þ. Lines 4–5 perform the ] operation in a topological order. Lines 6and 7 remove each element x, pðxÞ4h and jxj4q. Line 8 performs trimming operations. We can show that Approx-Rooted-Tree-Subset-Sum listed in Fig. 7 runs in time polynomially in both jRj and 1=; i.e., it is a fully polynomial-time approximation scheme [11].

(8)

We give an example of Approx-Rooted-Tree-Subset-Sum in the following. Suppose we have a list of singleton elements

L ¼ hð8; Tv1Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv6Þ; ð1; Tv7Þ; ð1; Tv8Þi

associated with the rooted-tree in Fig. 5, in which the weight of each vertex is equal to 1. The target is to ﬁnd an element x, where pðxÞ ¼ h ¼ 4 and jxj ¼ q ¼ 1 with ¼ 0:2. The trimming parameter r is =8 ¼ 0:025. The Approx-Rooted-Tree-Subset-Sum computes the elements as follows (Figs. 6 &7):

Line 2: L0¼ hð0; ;Þi; Line 4: pick ð8; Tv1Þ; Line 5: L1¼ hð0; ;Þ; ð8; Tv1Þi; Line 6: L1¼ hð0; ;Þi; Line 7: L1¼ hð0; ;Þi; Line 8: L1¼ hð0; ;Þi; Line 4: pick ð3; Tv2Þ; Line 5: L2¼ hð0; ;Þ; ð3; Tv2Þi; Line 6: L2¼ hð0; ;Þ; ð3; Tv2Þi; Line 7: L2¼ hð0; ;Þ; ð3; Tv2Þi; Line 8: L2¼ hð0; ;Þ; ð3; Tv2Þi; 1 2 3 5 4 6 7 8 cut

Fig. 5. A rooted-tree with eight vertices. The tree has a minimum cut (cut-size = 1) which partitions the tree into two balanced parts.

(9)

Line 4: pick ð4; Tv3Þ; Line 5: L3¼ hð0; ;Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð7; Tv2; Tv3Þi; Line 6: L3¼ hð0; ;Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 7: L3¼ hð0; ;Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 8: L3¼ hð0; ;Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 4: pick ð1; Tv4Þ; Line 5: L4¼ hð0; ;Þ; ð1; Tv4Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð5; Tv3; Tv4Þi; Line 6: L4¼ hð0; ;Þ; ð1; Tv4Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 7: L4¼ hð0; ;Þ; ð1; Tv4Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 8: L4¼ hð0; ;Þ; ð1; Tv4Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 4: pick ð1; Tv5Þ; Line 5: L5¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv4; Tv5Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð5; Tv3; Tv3Þi; Line 6: L5¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv4; Tv5Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 7: L5¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 8: L5¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð3; Tv2Þ; ð4; Tv3Þi;

(10)

Line 4: pick ð2; Tv6Þ; Line 5: L6¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð3; Tv4; Tv6Þ; ð3; Tv5; Tv6Þ; ð4; Tv3Þ; ð5; Tv2; Tv6Þi; Line 6: L6¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð3; Tv4; Tv6Þ; ð3; Tv5; Tv6Þ; ð4; Tv3Þi; Line 7: L6¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 8: L6¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð4; Tv3Þi;

(11)

Line 4: pick ð1; Tv7Þ; Line 5: L7¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð2; Tv6Þ; ð2; Tv4; Tv7Þ; ð2; Tv5; Tv7Þ; ð3; Tv2Þ; ð3; Tv6; Tv7Þ; ð4; Tv3Þ; ð4; Tv2; Tv7Þi; Line 6: L7¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð2; Tv6Þ; ð2; Tv4; Tv7Þ; ð2; Tv5; Tv7Þ; ð3; Tv2Þ; ð3; Tv6; Tv7Þ; ð4; Tv3Þ; ð4; Tv2; Tv7Þi; Line 7: L7¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð4; Tv2; Tv7Þi; Line 8: L7¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð4; Tv3Þi; Line 4: pick ð1; Tv8Þ; Line 5: L₈¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð1; Tv8Þ; ð2; Tv6Þ; ð2; Tv4; Tv8Þ; ð2; Tv5; Tv8Þ; ð2; Tv7; Tv8Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð4; Tv2; Tv8Þi; Line 6: L8¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð1; Tv8Þ; ð2; Tv6Þ; ð2; Tv4; Tv8Þ; ð2; Tv5; Tv8Þ; ð2; Tv7; Tv8Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð4; Tv2; Tv8Þi; Line 7: L8¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð1; Tv8Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð4; Tv3Þ; ð4; Tv2; Tv8Þi; Line 8: L8¼ hð0; ;Þ; ð1; Tv4Þ; ð1; Tv5Þ; ð1; Tv7Þ; ð1; Tv8Þ; ð2; Tv6Þ; ð3; Tv2Þ; ð4; Tv3Þi:

The algorithm returns ð4; Tv3Þ, where pð4; Tv3Þ ¼4, which is bounded in ¼ 20% of the optimal

answer.

Theorem 2. Approx-Rooted-Tree-Subset-Sum is a fully polynomial-time approximation scheme for the Rooted-Tree Subset-Sum Problem.

Proof. In lines 6–8, the operation trimming Li and removes each element y where pðyÞ is

greater than h from Li. The rest elements of Li are generated by selecting a subset of R and

applying a sequence of ] operations on the selected elements. Therefore, the element x _returned

in line 9 is indeed derived from a subset of R. It remains to show that the pðx_Þ_{is not smaller than}

1 times an optimal solution, and we must also show that the algorithm runs in polynomial time.

To show that the relative error of the returned answer is small, note that when list Li is

trimmed, we introduce a relative error of at most =l between the representative p values of the elements remaining and the p values of the elements before trimming. By induction on i, it can be shown that for each possible element y in Li produced by the Exact-Rooted-Tree-Subset-Sum

algorithm, there exists an element x 2 Li produced by the Approx-Rooted-Tree-Subset-Sum

algorithm such that

ð1 =lÞipðyÞppðxÞppðyÞ: ð1Þ

If y_{denotes an optimal solution to the Rooted-Tree Subset-Sum problem, then there is a x}₂_L l

such that

(12)

where pðx Þ is the p value of the element x returned by Approx-Rooted-Tree-Subset-Sum. Since lX14, it can be shown that

d dl 1 l l 40: ð3Þ

It implies that the function ð1 =lÞl increases with l, so that l41 implies

1 oð1 =lÞl ð4Þ

and thus,

ð1 ÞpðyÞppðxÞ: ð5Þ

Therefore, the p value of x _{returned by Approx-Rooted-Tree-Subset-Sum is not smaller than}

1 times the p value of the optimal solution y_.

To show that this is a fully polynomial-time approximation scheme, we derive a bound on the length of Li. After trimming, successive elements x and x0 of Li must have the relationship

pðxÞ=pðx0_{Þ41=ð1 =lÞ. That is, their p values must differ by a factor of at least ð1 =lÞ.}

Therefore, the number of elements in each Li is at most

log_1=ð1 =lÞh ¼ ln h lnð1 =lÞp

l ln h

ð6Þ

since lnð1 þ iÞpi for i4 1. This bound is polynomial in the number l of the given input elements, in the number of bits ln h needed to represent h, and in 1=. Since the running time of Approx-Rooted-Tree-Subset-Sum is polynomial in the length of Li,

Approx-Rooted-Tree-Subset-Sum is a fully polynomial-time approximation scheme. &

Approx-Rooted-Tree-Subset-Sum tells us how to partition a rooted-tree circuit. If its results contain infeasible trees, we need to apply Approx-Rooted-Tree-Subset-Sum repeatedly.

For the case where Ci (rooted at v) is an acyclic graph. We can perform breadth-ﬁrst search

from node v and obtain a rooted tree, and then apply Approx-Rooted-Tree-Subset-Sum on the tree.

3.2. Phase II: the probability-based algorithm

The probability-based iterative-improvement method extends the work [12] to ﬁt the architecture of Xilinx TMFPGAs.

3.2.1. Iterative-improvement approach

In the TMFPGA bipartitioning problem, the set V of nodes is divided into two subsets V1and

V2, which represent nodes in two stages. For any two nodes u, v in V, if PreðuÞPreðvÞ, then u, v are

in the same stage, or u is V1 and v is in V2. Further, V1 and V2 must satisfy the balance

constraint. The size of Cutð2; 1Þ equals the number of total registers in the circuit, which cannot be reduced any more. Therefore, we only need to minimize the size of Cutð1; 2Þ in the TMFPGA

(13)

bipartitioning problem. In the second step of CPAT, we present the PAT (Probability-based Algorithm for TMFPGA), which applies a probability-based, iterative-improvement approach to minimize the size of Cutð1; 2Þ. (Fig. 10 summarizes PAT.) We ﬁrst use the topological sort to obtain an initial partitioning that satisﬁes the balance and the precedence constraints (line 1 in

Fig. 10). During the iterative improvement, each node is assigned a gain, representing the beneﬁt of moving the node to the other subset. In each pass (lines 4–18 inFig. 10), we choose a node with the largest gain and check if it will violate the balance or the precedence constraint after moving the node. If it is feasible to move the node, it is temporarily moved and locked. Select the best sequence of moves and make them permanent. Repeat the above process in a pass until no better cutsize is found.

3.2.2. The precedence constraint

Because of the precedence constraint, moving a node to the other subset may not be valid. For C-type nodes, we use the following two rules to judge if a node can be moved:

R1: A C-type node v in V1 can be moved if all its successors in V1 have been moved.

R2: A C-type node v in V2 can be moved if all its ancestors in V2 have been moved.

For example, inFig. 8(a), v2 cannot be moved according to Rule R1. In Fig. 8(b), v3 cannot be

moved according to Rule R2.

For FF-type nodes, we use the following rules to judge if a node can be moved:

R3: A FF-type node v in V2 can be moved if all its successors and ancestors in V2 have been

moved.

After a node v is moved to the other stage, some of its neighbors may also be blocked in that stage due to the precedence constraint. We use the following two rules to determine whether such neighbors should be blocked (see line 14 inFig. 10):

R4: If v is moved from V1 to V2, all its successors should be blocked in V2.

R5: If v is moved from V2 to V1, all its ancestors should be blocked in V1.

3.2.3. Gains of nodes

In the PAT, each node is given a probability for moving it to the other set. Based on these probabilities, an expected gain of moving a node to the other subset can be evaluated. Before detailing how to compute gains, we shall introduce some notation ﬁrst.

v₁ v₂ v₃ v₄ v5 1 V V2 (a) v₁ v₂ v₃ v₄ v₅ 1 V V₂ (b)

(14)

Cutset: set of nets which need micro registers in Cutð1; 2Þ. In other words, a C-type net is not in Cutset if all its nodes are in V1or V2; an FF-type net is not in Cutset only if its fan-out node is

in V2and all its fan-in nodes are in V1.

cðnÞ: cost of net n.

pðuÞ: probability of moving node u to the other stage.

n1!2

i : event that net ni is removed from Cutset by moving nodes to V2. For a C-type net ni in

Cutset, n1!2_i means all its nodes in V1are moved to V2; for an FF-type net ni in Cutset, n1!2i

means all nodes are originally in V1 and then its fan-out node is moved to V2.

n2!1_i : event that net ni is removed from Cutset by moving nodes to V1. For a C-type

net ni in Cutset, n2!1i means all its nodes in V2 are moved to V1; for an FF-type net ni in

Cutset, n2!1

i means its fan-out node is originally in V2and then all its fan-in nodes are moved

to V1.

pðna!b

i Þ: probability of net ni being removed from Cutset by moving all ni’s nodes in Vato Vb.

pðna!b_i juÞ: probability of removing net ni from Cutset by moving nodes to Vb in the condition

that node u is originally in Va and then is moved to Vb.

pðna!b_i jucÞ: probability of removing net ni from Cutset by moving nodes to Vb in the condition

that node u is originally in Vb and then stays in Vb.

f_n

i : fan-out node of net ni.

SaðuÞ: set of successors of u in stage Va.

AaðuÞ: set of ancestors of u in stage Va.

Eaðni; njÞ: set of nodes in Va that are both in nets ni and nj. E.g., inFig. 9, E1ðn5; n6Þ ¼ ; and

E2ðn5; n6Þ ¼ fv5g.

NaðuÞ: set of u’s neighbors in Va.

IðuÞ: set of nets which contain node u.

MaðnÞ: set of nets that have common nodes with net n in Va. E.g., inFig. 9, M1ðn5Þ ¼ fn4; n7g

and M2ðn5Þ ¼ fn2; n6g.

eðna!b

i Þ: expected gain in the condition that net ni is moved from Va to Vb.

enjðn

a!b

i Þ: expected gain contributed by nj in the condition that net ni is moved from Vato Vb.

gðuÞ: gain of node u.

g_n_jðuÞ: gain of node u contributed by nj.

(15)

According to the deﬁnitions of n1!2_i and n2!1_i , we have the following equations. For a C-type net ni with node u, u 2 Va,

pðna!b_i juÞ ¼ Y v2NaðniÞ fug pðvÞ pðnb!a_i jucÞ ¼ Y v2NbðniÞ pðvÞ:

For an FF-type net ni with node u, 8v; v 2 V1 and v 2 ni,

pðn1!2_i juÞ ¼ 1 if u ¼ fni 0 otherwise ( pðn1!2_i jucÞ ¼ pðf_n iÞ if uafni 0 otherwise ( pðn2!1_i juÞ ¼ 0:

For an FF-type net ni with node u, fni 2V2,

pðn2!1_i juÞ ¼ Q v2N2ðniÞ ff_ni;ugpðvÞ if u 2 N2ðniÞ ffnig 0 otherwise ( pðn2!1_i jucÞ ¼ Q v2N2ðniÞ ff_nigpðvÞ if u 2 N1ðniÞ 0 otherwise ( pðn1!2_i juÞ ¼ 0:

Moving a net ni to some stage will affect the move of the other nets that have common nodes

with net ni. It is called the 2nd-order information[12]. Therefore, the expected gain for removing a

net from Cutset should be considered. eðna!b_i Þ ¼ X

nj2MaðniÞ

enjðn

a!b i Þ:

For two C-type nets ni and nj, ni\nja;,

enjðn a!b i Þ ¼cðnjÞpðna!bj Þ Y v2Eaðni;njÞ pðvÞ , : ð7Þ

For a C-type net ni and an FF-type net nj, ni\nja;,

(1) if f_n j 2V1, enjðn 1!2 i Þ ¼ cðnjÞ if E1ðni; njÞ ¼ ffnjg 0 otherwise ( enjðn 2!1 i Þ ¼0:

(16)

(2) if f_n j 2V2, enjðn 2!1 i Þ ¼ cðnjÞpðn2!1_j Þ Q v2E2ðni;nj ÞpðvÞ if f_n_jeE₂ðn_i; n_jÞ 0 otherwise 8 > < > : enjðn 1!2 i Þ ¼0:

For an FF-type net ni and a C-type net nj, ni\nja;,

enjðn 1!2 i Þ ¼ cðnjÞpðn1!2j Þ=pðfniÞ if fni 2nj 0 otherwise ( enjðn 2!1 i Þ ¼ cðnjÞpðn2!1j Þ Q v2E2ðni;nj ÞpðvÞ if f_n ienj 0 otherwise. 8 > < > :

For two FF-type nets ni and nj, ni\nja;,

(1) if f_n_j 2V1, enjðn 1!2 i Þ ¼ cðnjÞ if fni ¼fnj 0 otherwise ( enjðn 2!1 i Þ ¼0: (2) if f_n j 2V2, enjðn 2!1 i Þ ¼ cðnjÞpðn2!1j Þ Q v2E2ðni;nj Þ ff ni gpðvÞ if f_n jeni 0 otherwise 8 > < > : enjðn 1!2 i Þ ¼0:

If net njin the above cases is not in Cutset originally and moved into Cutset in condition of na!bi ,

the term cðnjÞ should be incorporated into enjðn

a!b

i Þ. For example, two C-type nets ni and nj,

nj 2Va, enjðn a!b i Þ ¼ cðnjÞ þcðnjÞpðna!bj Þ Y v2Eaðni;njÞ pðvÞ , : ð8Þ

Using the above equations, we can compute g_n

iðuÞ as follows:

(1) if ni is C-type,

g_n_iðuÞ ¼ ðcðniÞ þeðna!bi ÞÞpðna!bi juÞ ðcðniÞ þeðnb!ai ÞÞpðnb!ai jucÞ: ð9Þ

(2) if ni is FF-type,

(17)

Thus, the gain of a node u is given by gðuÞ ¼ X

ni2IðuÞ

g_n_iðuÞ: ð11Þ

The probability of a node represents the likelihood that the node will be moved. The node with a greater gain has a higher probability to be moved. Thus, we can get the probability of a node by a monotonically increasing mapping function of its gain. (In our experiments shown in the next section, we used an increasing linear function.) It causes an interdependency between probabilities and gains since we obtain the gains from probabilities of nodes as shown in the above equations. To break this endless recursive relation, we give each node the probability 0.5 in our experiment. Repeat computing gains and probabilities from each other until they are stable enough, and then we have initial probabilities (line 6inFig. 10). In practice, three iterations are enough to reach a stable state. The probability-based algorithm PAT is summarized inFig. 10.

(18)

3.3. The timing constraint

The speed of a TMFPGA is determined by the maximum execution time of a micro-cycle. Therefore, we must reduce the longest path in a micro-cycle. In PAT, the lengths of the longest paths in both stages are upper bounded by dD=2e, where D is the length of the longest path in the circuit.

For a node v, let dOðvÞ denote the length of the longest path from v to primary outputs and dIðvÞ

denote the length of the longest path from primary inputs to v. A node v cannot be put in V1if

dIðvÞ is more than dD=2e, because there will exist a path of length more than dD=2e from a

primary input to v in V1. For the same reason, a node v cannot be put in V2if dOðvÞ is more than

dD=2e. According to the above rules, the nodes that may violate the timing constraint are ﬁxed in proper stages before the clustering phase.

4. Experimental results

The probability-based algorithm, PAT, and the clustering- and probability-based algorithm, CPAT, were implemented in the C þ þ language on a PC with a Pentium II 300 microprocessor and 128 MB RAM and tested on the MCNC Partitioning93 benchmark circuits. The characteristics of the circuits are shown inTable 1. Columns 2 and 3 in Table 1list the numbers of nodes and nets, respectively, in each circuit. InTable 2, we compare PAT and CPAT. Columns 2, 3, and 4 inTable 2 compare the maximum numbers of micro registers. Columns 5–7 compare the runtimes. The results show that PAT has performance for smaller circuits (e.g. the circuits size are less than 6000 in the benchmark circuits) while CPAT obtain better results for larger circuits (e.g. the circuits size are larger than 6000 in the benchmark circuits). It implies that the clustering algorithm in CPAT leads to a considerable improvement as the size of a circuit increases over a certain bound. In addition, the clustering algorithm in CPAT substantially reduces the problem size and thus the runtime. However, the clustering algorithm bound some nodes into a cluster. In the following probability-based algorithm, all nodes must be considered moving or not together in a cluster. Therefore, the clustering algorithm in CPAT might break the connectivities of nodes and nets when the circuit is small, in which the following probability-based algorithm might not be able to get the sufﬁcient information to ﬁnd a better result.

Table 1

Benchmark circuit characteristics

Circuit No. of Nodes No. of Nets Circuit No. of Nodes No. of Nets

c3540 1038 1016s9234 6098 5846 c5315 1778 1655 s13 207 9445 8653 c6288 2856 2824 s15 850 11071 10385 c7552 2247 2140 s35 932 19880 17830 s820 340 314 s38 417 25589 23845 s838 495 459 s38 584 22451 20719 s1423 831 750

(19)

In Table 3 (Table 4), we compared the performance of the smaller (larger) circuits of PAT (CPAT) with the approach in [7], the network-ﬂow-based approach FBP-m [5], and the list scheduling List[3,4]on the Xilinx TMFPGA model, in which a circuit was partitioned into eight stages. The size of a stage is bounded by the balance factor 5% (the same as in[5]). Columns 2–5 list the maximum numbers of micro registers used by List, FBP-m,[7], and PAT, respectively. Columns 6–8 list the percentages of improvements of PAT over List, FBP-m, and [7], respectively. The results show that our PAT and CPAT algorithms outperform List and FBP-m by respective average reductions of 35.8% (40.0%) and 15.5% (22.3%) in the maximum numbers

Table 2

Comparison between PAT and CPAT

Circuit Max No. of registers Runtime (s)

PAT CPAT Imprv. (%) PAT CPAT Imprv. (%)

c3540 126152 17.1 3 3 0 c5315 157 174 9.8 11 4 +63.7 s820 43 61 29.5 4 2 +50.0 s838 72 93 22.61 1 0 s1423 106120 11.7 3 2 +33.3 s9234 430 402 +6.5 29 25 +13.8 s13 207 838 838 0 190 136+28.4 s15 850 808 767 +5.0 163 104 +36.2 s35 932 2138 2018 +5.620 131 15 715 +21.9 s38 417 2628 2468 +6.0 1125 926 +17.7 s38 584 3611 1451 +59.8 1766 932 +47.2 Average 0.7 +28.4 Table 3

Results for the 8-stages TMFPGA partitioning of the smaller circuits

Circuit Max No. of registers PAT Imprv. (%)

List FBP-m Ref.[7] PAT List FBP-m Ref.[7]

c3540 177 166 198 126 +28.8 +24.0 +33.4 c5315 265 165 140 157 +40.7 +5.1 12.1 c6288 117 114 83 114 +2.6 0 37.3 c7552 453 392 210 260 +42.6 +33.7 23.8 s820 91 81 52 43 +52.7 +46.9 +17.3 s838 131 71 70 72 +64.8 1.4 2.9 s1423 130 120 101 106+18.5 +11.7 5.0 Average +35.8 +15.5 +1.0

(20)

of micro registers required, and comparable to the algorithm in[7]. It implies that the probability-based scheme is effective in reducing the interconnection for TMFPGAs.

5. Conclusion

We have presented the probability-based algorithm PAT for the TMFPGA partitioning problem. Experimental results have shown that our probability-based algorithm outperforms the previous works, the List scheduling and the network-ﬂow-based method, by signiﬁcant margins. Furthermore, we can further improve the results for large circuits and runtimes for all circuits by incorporating a clustering algorithm into PAT.

Acknowledgements

The authors would like to thank Dr. Huiqun Liu for providing the benchmark circuits and helpful discussions on [5]and Prof. Ting-Chi Wang for his constructive comments.

References

[1] S. Trimberger, A time-multiplexed FPGA, in: Proceedings of FCCM, 1997, pp. 22–28.

[2] N.B. Bhat, et al., Performance-oriented fully routable dynamic architecture for a ﬁeld programmable logic device, Memorandum No. UCB/RELM93/42, UC Berkeley, 1993.

[3] D. Chang, M. Marek-Sadowska, Buffer minimization and Time-multiplexed I/O on dynamically reconﬁgurable FPGAs, in: Proceedings of the FPGA Symposium, 1997, pp. 142–148.

[4] D. Chang, M. Marek-Sadowska, Partitioning sequential circuits on dynamically reconﬁgurable FPGAs, IEEE Trans. Comput. (1999) 565–578.

[5] H. Liu, D.F. Wong, Network ﬂow based circuit partitioning for time-multiplexed FPGAs, in: Proceedings of ICCAD, 1998, pp. 497–504.

Table 4

Number of registers needed for CPAT and the previous works of the larger circuits

Circuit Max No. of registers PAT Imprv. (%)

List FBP-m Ref.[7] CPAT List FBP-m Ref.[7]

s9234 640 502 381 402 +37.2 +19.9 5.5 s13 207 1118 901 688 838 +25.0 +7.0 21.8 s15 850 1070 877 761 767 +28.3 +12.5 0.8 s35 932 38062950 2729 2018 +47.0 +31.6 +26.1 s38 417 35462892 2194 2468 +30.4 +14.7 12.5 s38 584 5131 27962280 1451 +71.7 +48.1 +36.4 Average +40.0 +22.3 +3.7

(21)

[6] H. Liu, D.F. Wong, A graph theoretic optimal algorithm for schedule compression in time-multiplexed FPGA partitioning, in: Proceedings of ICCAD, 1999, pp. 400–405.

[7] W.K. Mak, F.Y. Young, Temporal logic replication for dynamically reconﬁgurable FPGA partitioning, in: Proceedings of ISPD, 2002, pp. 190–195.

[8] C.M. Fidducia, R.M. Mattheyses, A linear-time heuristic for improving network partitions, in: Proceedings of DAC, 1982, pp. 175–181.

[9] B.W. Kernighan, S. Lin, An efﬁcient heuristic procedure for partitioning graphs, Bell System Technol. J. 49 (1970) 291–307.

[10] J. Cong, et al., Large scale circuit partitioning with loose/stable net removal and signal ﬂow based clustering, in: Proceedings of ICCAD, 1997, pp. 441–446.

[11] T.H. Cormen, C.E. Leiserson, R.L. Rivest, in: Introduction to Algorithms, The MIT Press, Cambridge, MA, 1990, pp. 951–983.

[12] S. Dutt, W. Deng, Partitioning using second-order information and stochastic-gain functions, in: Proceedings of International Symposium Physical Design, 1998, pp. 112–117.