Physical Design for Reconfigurable Computing System

(1)

行政院國家科學委員會專題研究計畫期中進度報告

子計畫五：可重組化系統之實體設計(2/3)

計畫類別：整合型計畫

計畫編號： NSC92-2215-E-002-018-

執行期間： 92 年 08 月 01 日至 93 年 07 月 31 日

執行單位：國立臺灣大學電子工程學研究所

計畫主持人：張耀文

報告類型：精簡報告

處理方式：本計畫可公開查詢

中華民國 93 年 6 月 1 日

(2)

多媒體通訊系統中可重組化運算技術之研究

子計畫五：可重組化系統之實體設計(2/3)

Physical Design for Reconfigurable Computing System

計畫編號：NSC 92-2215-E-002-018

執行期限：92 年 8 月 1 日至 93 年 7 月 31 日

計畫主持人：張耀文副教授國立臺灣大學電子工程學研究所

一

､

中文摘要

動態可重組程式閘陣列(DRFPGAs)被利用來處理具有高度複雜性和功能的設計，因其可利用時間共享(time-sharing)方式來增進邏輯效能。在這篇報告中，我們用 3D-box 表示每一個任務(task) 來處理因為動態可重組程式閘陣列而產生的 3 維平面規劃和擺置(floorplanning/placement)。我們提出一個新的以樹狀結構為基礎(tree-based)的表示法-時序樹(T-tree)。每一個結點(node)至多有三個子結點來表示任務中空間和時間的關係。我們提出一個有效的求解方法並導出要滿足任何因動態可重組程式閘陣列執行而產生的時間順序限制所需的條件。實驗結果顯示跟現今最頂尖的表示法相比，我們以樹狀結構為基礎的表示法可以在較少的時間中得到非常好的解。關鍵詞：可重組化系統，可重組化計算，實體設計，三維平面規劃

二､英文摘要(Abstract)

Improving logic capacity by time-sharing, dynamically reconfigurable FPGAs are employed to handle designs of high complexity and functionality. In this report, we model each task as a 3D-box and deal with the temporal floorplanning/placement problem for dynamically reconfigurable FPGA architectures. We present a tree-based formulation, called T-tree, to represent the spatial and temporal relations among tasks. Each node in a T-tree has at most three children which represent the dimensional relationship among tasks. We present an efficient packing method and derive the condition to ensure the satisfaction of precedence constraints which model the temporal ordering among tasks induced by the execution of dynamically reconfigurable FPGAs. Experimental results show that our tree-based formulation can achieve significantly better solution quality with less execution time than the most recent state-of-the-art work.

Keywords: reconfigurable system,

reconfigurable computing, physical design, 3D floorplanning

三､

背景和目的

1. Background

A Field Programmable Gate Array (FPGA) typically consists of regular identical reconfigurable cells (logic blocks) and interconnects around these blocks. Traditionally, an FPGA can only implement circuits by loading the serial configuration bit-streams into the chip at starting time, and the reconfiguration must be done in a whole. Recently, various new architectures have been proposed by various vendors, such as the Atmel AT40K series [1], the Xilinx XC6200 [7] series and the Xilinx Virtex II series [11]. These new-generation FPGAs are partitionable and partially reconfigurable, allowing several tasks and circuits to share the same physical locations at different times and part of the chip to be reconfigured at run-time.

Due to the capability of partially reconfigurable of recent FPGAs, studies have shown that an FPGA-based computation hardware system can improve performance for many applications[7,9]. A reconfigurable system usually consists a host processor and an FPGA coprocessor called reconfigurable function unit (RFU) [2]. During the execution of one program, an RFU may have several configurations at different times. Figure 1(a) shows a program code that can be mapped into four RFU operations (RFUOPs or modules). Since the RFUOP must be placed on the RFU and has its execution time, we may denote each RFUOP as a 3D-box. Because of the area constraint, we cannot load all RFUOPs at the same time. Thus, at time 2, RFUOP 3 is swapped out and RFUOP 4 is swapped in. The question of how to place these RFUOPs becomes a 3D-placement problem. Each module is represented as a 3D-box with the spatial dimensions X and Y and the temporal dimension T. There exist temporal ordering requirements among tasks because one task's input may be another task's output. The goal of temporal floorplanning is to schedule all modules on an RFU so that the specified objective function (e.g., the product of chip area and execution time---the volume of the 3D floorplan/placement) is optimized and no two modules violate the temporal constraints.

(3)

Figure 1. (a) A running program. (b) A 3D-placement of the running program.

One significant purpose of a temporal floorplanner is to be a scheduler embedded in the compiler of the host CPU. For some applications, the flow of the program has already been known in advance (for example, in DSP applications). Thus, the scheduler can schedule all RFUOPs that must be executed on the RFU before the program starts. Also, the scheduler can perform various optimizations on the configuration of the RFU, such as the reconfiguration overhead.

四

､

研究方法

We discuss the underlying techniques, approaches, and solutions for handling the proposed problems.

1. Problem formulation for the temporal floorplanning

In the reconfigurable architecture, a task v is loaded into the device for a period of time for execution. Therefore, each task can be represented as a 3D module with spatial dimension x and y and the temporal dimension t. Through this report, we use task and module interchangeably. Let V={v1, v2,...,

vm} be a set of m tasks whose widths, heights, and

execution time are given by Wi, Hi, and Ti, 1 ≦ i ≦

m. We use (xi, yi) ((x'i, y'i)) to denote the coordinate

of the bottom-left (top-right) corner of a task vi and ti

(t'i) the starting (ending) time of task vi, 1 ≦ i ≦ m ,

scheduled in the reconfigurable device. These tasks often need to be executed in a specific order because one task's input could be another task's output. The temporal ordering among tasks is referred to as the precedence constraint in the 3D floorplanning problem. Let D={(vi, vj)|1 ≦ i, j ≦ m, i ≦ j } denote the

precedence constraint for the tasks vi and vj (i.e., vi

must be executed before vj). The precedence

constraints should not be violated during floorplanning/placement.

In order to measure the quality of a floorplan, we consider the same objectives as in [4] and [12]. They are volume, wirelength, communication and reconfiguration overheads.

The definitions of these four objective functions are given below.

Volume (the minimum bounding box of a

placement ): In temporal floorplanning, we

need to consider the area of a device and the

total execution time trade-off. If we use a larger device, the total execution time could be shorten. In contrast, it takes longer time if a smaller one is used. Therefore, we shall minimize the product of the area of the device and the total execution time.

Wirelength (the summation of the half

bounding box of interconnections): Due to the

special architecture of the reconfigurable device, the method to estimate the wirelength in the temporal floorplanning is different from the traditional floorplanning/placement problem. Given a net, those nodes in the net may be executed at the same time or at different times. If they are executed at the same time, we can estimate the wirelength according to their geometric distance directly. However, we have to project all nodes onto the same time frame before computing their wirelength if they are executed at different time frames.

Communication overhead: We quantify the communication overhead based on the Xilinx Virtex XCV1000. Similar to the work by Fekete et al. [4], we assume that a task communicates with another task (data-dependence) in the following way: the results of a CLB, which are read by the successor task, are first written to external memory through a bus interface. The dependent task, which has been loaded at the specified position, then perform a read-in of the results. Recall that a frame is the atomic unit that can be written to or read from. Each frame contains 1248 bits and the bus width is only 8 bit. Thus, it takes approximately 1248/8+24=180 clock cycles in each read-in or read-out, where the 24 cycles are used to configure the bus interface as described on the Xilinx FPGA data book~ [10]}. Therefore, the communication overhead of each reconfiguration takes 360 \times f clock cycles time (we should first write the data to the external memory and then read back the data) if data in f columns need to be transferred.

Reconfiguration overhead: As described in [10], Xilinx Virtex XCV1000 is column-oriented (i.e., all bits in one column should be updated in each read-in or read-out). Suppose that a task vi occupies Wi × Hi CLBs.

We have to reconfigure Hi columns of CLBs in

each reconfiguration. As an example, each CLB column in a Virtex FPGA consists of 48 frames, which takes (1248/8) × 48+24=7512 clock cycles to configure per CLB column. This means we need Wi × 7512 clock cycles in total

if the addresses in the column are incrementally updated.

In this report, we treat a task vi as a

(4)

assignment of (xi, yi, ti) for each vi, 1 ≦ i ≦ m ,

such that no two boxes overlap and all precedence constraints are satisfied. The goal of temporal floorplanning is to optimize a predefined cost metric (defined in the above) induced by a placement.

2. Techniques and Approaches

2.1. The T-tree representation:

2.1.1. The structure of the T-tree

Figure 2: The structure of a T-tree.

A T-tree has at most three children at each node as shown in Figure 2. The T-tree represents the geometric relationships between two modules as follows. If node nj is the left child of node ni, module

vj must be placed adjacent to module vi on the T+

direction, i.e., tj = ti + Ti. If node nk is the middle

child of node ni, module vk must be placed in the Y+

direction of module vi, with its t-coordinate of vk equal

to that of ni, i.e., tk = ti and yk > yi. If node nl is the

right child of node ni, module vl must be placed on the

X+_{direction of module v}

i, with the t- and

y-coordinates equal to those of vi, i.e., tl = ti and yl =

yi.

Figure 3: A compacted placement and the corresponding T-tree.

2.1.2. From a Compacted Placement to its T-tree

Given a compacted placement, we can represent it by a unique T-tree. A placement is said to be compacted if and only if no module can be moved along its X-, Y- or T- directions while other modules

are fixed. The root of the tree corresponds to the task on the origin (bottom-left) of a placement. We construct a T-tree for a compacted placement in a DFS manner: Starting from the root, we recursively construct the left sub-tree, then the middle sub-tree, and finally the right sub-tree. Let Li denote the set of

tasks that are adjacent to yi in the T+ direction. The

left child of node ni corresponds to the lowest task of

Li in the X-Y plane. The middle child of node ni

corresponding to the first task in the Y+ direction, with

its t-coordinate equal to that of ni's. The right child

of node ni represents the first task in the X+ direction,

with its y- and t-coordinate equal to those of ni's. A

compacted placement can be transformed to its corresponding T-tree in linear time. Figure 3 shows a compacted placement and its corresponding T-tree.

2.1.3. From a T-tree to its Placement

Now we describe the packing method for a T-tree. The t-coordinate of each module can be easily obtained by traversing the T-tree in the DFS order. If node nj is the left child of node ni, tj = ti + Ti;

otherwise, tj = ti. Once the t-coordinates are fixed,

we can utilize the existing tree solutions in [5] and [3] to compute y coordinates. We first decompose a T-tree into a set of binary trees. The T-tree decomposition process is shown in Figure 4. Starting from the root, we traverse a T-tree in the DFS order. When we encounter a node which has the right child, nb in the

example shown in Figure 4, we decompose the tree into two trees: one is the right sub-tree of nb, and the

other is the original tree without the right sub-tree of nb. The same decomposition procedure is applied to

each sub-tree until a leaf node is encountered. For each binary tree, we adopt the contour data structure presented in [5] and [3] to determine the y-coordinate of each module. The contour structure is a double-linked list of modules that records the contour line in current compaction. To compute x coordinates, we maintain a list lst to store all tasks whose t- and y-coordinates are already determined. The x-coordinate of task vi is equal to maxx'k {the

projections of vk and vi are overlapped on the Y-T

plane | k is in lst }.

Figure 4: The T-tree decomposition process.

2.2. Temporal Floorplanning Algorithm

Our floorplanning algorithm is based on the simulated annealing method [8]. The cost function Φ used in the algorithm is given by

Φ = αV + βW + γO,

where V stands for the volume of the placement, W is the total wirelength, O is the reconfiguration and communication overheads, and α , β , δ are user-specified constants. Given a T-tree (a feasible solution), we perturb the T-tree to obtain another feasible T-tree by using the following three operations:

Move: move a task to another place. Swap: swap two tasks.

Rotate: rotate a task.

2.2.1. Feasibility Detection and Tree

Re-construction

To maintain the temporal orderings among tasks, we need to guarantee that a T-tree meets all the precedence constraints after each perturbation. For

(5)

the three operations mentioned above, Move and Swap might violate the temporal constraints. Therefore, in this section, we describe how to examine the feasibility of a T-tree and the procedure to re-construct a T-tree to meet the precedence constraints.

Based on the structure of the T-tree, we know that if node nj is in the left sub-tree of ni's, task vj must be

executed after task vi. Therefore, to ensure all the

precedence constraints are not violated, a node nk must

be placed in the left sub-tree of np, where np has the

latest ending time among the tasks that must be executed before task vk.

Once we identify a node that violates the precedence constraint, we re-construct the T-tree to remove the violation conditions. Assume task vi

violates precedence constraints and vp is the task that

has the latest ending time in Ii. Let U = {all nodes in the left sub-tree of np}∪{np}. In U, we look for a

node nj that minimizes |tj - ti| with Ij = empty set. If nj

≠np, ni is swapped with nj; otherwise, it means that nj

= np or Ij ≠empty set. In this case, we move ni to

the np's left position. The tree re-construction

process is summarized in Figure 5.

Figure 5: Summary of tree re-construction process.

3. Fixed-area Floorplanning:

For fixed-outline floorplanning, the area of the reconfigurable device is fixed. Let Wf/Hf and Wp/Hp

denote the width and height of a reconfigurable device and a placement, respectively. A feasible placement of fixed-outline floorplanning must satisfy the outline constraint; that is Wp ≤ Wf and Hp ≤ Hf. Therefore,

we consider excessive volumes of a placement in the objective function for the fixed-outline floorplanning problem. The new objective functionΦ’is:

Φ’=αV + βW + γO+ δF,

where δ is also a user-specified constant, and F is given in the following equation:

F = min((max(Wp-Wf, 0) × Hp × Time) + (max(Hp

-Hf, 0) ×Wp ×Time), (max(Wp-Wf, 0) × p × Time)

+ \max(Hp - Hf, 0) × Wp × Time)),

where Time is the total execution time for a placement. Since the whole design can be rotated by 90 degrees, we choose the smaller excessive volume of

two orthogonal placements.

Besides considering the excessive volume in the objective function, we bias the selection of the destination of the Move operations based on the value

n

k

, where k is the number of infeasible placements in the last n iterations. In this report, we set n equal to 500. A large

n

k

value indicates that the placement is not easy to fit into the device area; therefore, we should try to place a module along the$t direction to increase the success probability. In contrast, if the

n

k

value is small, we should try to place a module in the x or y direction to minimize the task execution time.

五､

成果 (Publications)

1. P.-H. Yu, C.-L.Yang, Y.-W. Chang, and H.-L. Chen, "Temporal floorplanning using 3D-subTCG," Proc. ASP-DAC, pp. 725--730, Yokohama, Japan, January 2004. (Nominated for Best Paper)

2. P.-H. Yu, C.-L. Yang, and Y.-W. Chang, ``Temporal floorplanning using 3D transitive closure sub-graphs," in revision, IEEE Trans. VLSI Systems, 2004.

3. P.-H. Yuh, C.-L. Yang, and Y.-W. Chang, ``Temporal Floorplanning using the T-tree Formulation," submitted to ICCAD 2004.

六

､

參考文獻

[1] Atmel, ``AT40K05102040AL_Complete,'' Atmel, Inc. [2] K. Bazargan, R. Kastner, and M. Sarrafzadeh, ``Fast Template

Placement for Reconfigurable Computing Systems,'' IEEE Design & Test of Computers, vol. 17, no. 1, pp. 68--83, Mar. 2000.

[3] Y.-C. Chang, Y.-W. Chang, G.-M. Wu, and S.-W. Wu, ``B*-trees: A new representation for non-slicing floorplan,'' Proc. DAC, pp. 458-462, June 2000.

[4] S. P. Fekete, E. Kohler, and J. Teich, ``Optimal FPGA Module Placement with Temporal Precedence Constraints,'' Proc. DATE, pp. 658-665, Mar. 2001.

[5] P.-N. Guo, C.-K. Cheng, and T. Yoshimura, ``An O-tree representation of non-slicing floorplan and its application,'' Proc. DAC, pp. 268-273, June 1999.

[6] S. Hauck,``The Roles of FPGAs in Reprogrammable Systems,'' Proc. the IEEE, vol.86, no. 4, pp. 615--639, Apr. 1998. [7] S. Hauck, Z. Li, and E.J. Schwabe, ``Configuration

Compression for the Xilinx XC6200 FPGA,'' Proc of FCCM, pp. 138--146, 1998.

[8] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, ``Optimization by Simulated Annealing,'' Science, vol. 220, no. 4598, pp.671--680, May 1983.

[9] R. Tesser and W. Burleson, ``Reconfigurable Computing for Digital Signal Processing: A Survey,'' Journal of VLSI Signal Processing, Vol. 28, no. 1, pp. 7--27, May/June 2001. [10] Xilinx, ``XAPP151 Virtex Series Configuration

Architecture User Guide v1.5,'' Xilinx, Inc., Sep. 2000. [11] Xilinx ``Virtex-II Pro Platform FPGA User Guide,'' Xilinx,

Inc.

[12] P.-H. Yuh, C.-L. Yang, Y.-W. Chang, and H.-L. Chang, ``Temporal Floorplanning using 3D-subTCG,'' Proc. ASP-DAC, pp. 725--730, Jan. 2004.

Physical Design for Reconfigurable Computing System

行政院國家科學委員會專題研究計畫 期中進度報告

子計畫五：可重組化系統之實體設計(2/3)

計畫類別： 整合型計畫

計畫編號： NSC92-2215-E-002-018-

執行期間： 92 年 08 月 01 日至 93 年 07 月 31 日

執行單位： 國立臺灣大學電子工程學研究所

計畫主持人： 張耀文

報告類型： 精簡報告

處理方式： 本計畫可公開查詢

中 華 民 國 93 年 6 月 1 日

多媒體通訊系統中可重組化運算技術之研究

子計畫五：可重組化系統之實體設計(2/3)

Physical Design for Reconfigurable Computing System

計畫編號：NSC 92-2215-E-002-018

執行期限：92 年 8 月 1 日至 93 年 7 月 31 日

計畫主持人：張耀文副教授 國立臺灣大學電子工程學研究所

一

､

中文摘要

二､英文摘要(Abstract)

三､

背景和目的

四

､

研究方法

n

k

n

k

n

k

五､

成果 (Publications)

六

､

參考文獻

行政院國家科學委員會專題研究計畫期中進度報告

計畫類別：整合型計畫

執行單位：國立臺灣大學電子工程學研究所

計畫主持人：張耀文

報告類型：精簡報告

處理方式：本計畫可公開查詢

中華民國 93 年 6 月 1 日

計畫主持人：張耀文副教授國立臺灣大學電子工程學研究所