C ONVENTIONAL A RCHITECTURE L EVEL S YNTHESIS F LOW

CHAPTER 1 INTRODUCTION

1.1. C ONVENTIONAL A RCHITECTURE L EVEL S YNTHESIS F LOW

Architecture level synthesis is a sequence of tasks to transform a higher level behavior description to RTL design. There are lots of ways to implement it according to the desired architecture style. Therefore, a large variety of problems, algorithms and tools have been proposed.

Fig. 1. Conventional and simplified architecture level synthesis system

1.2.

Fig. 1 shows a conventional and simplified architecture level synthesis system, which includes target architecture and corresponding synthesis flow [14]. First, the synthesis system will get the internal representation (i.e. data flow graph here) of design by compiling input application. Second, the scheduling and binding tasks place operations in feasible functional units and timing slots in sequencing order. Finally, it generates the detailed interconnections and corresponding control signals which map the behavior of data transfer and all configurations to circuit cycle by cycle.

This target architecture shown in Fig. 1 has some functional units, centralized register and a corresponding control unit. The centralized registers store all temporal data and provide all operands to functional units. In the view of data transfer, we can say that the scheduling and binding are the tasks which assign when and to where the temporal data go. In addition, the datapath synthesis task generates the needed multiplexers, de-multiplexers and wires of interconnections.

In centralized register architecture, any data generated from some functional unit will be available for other ones at the next cycle. In other words, it always takes no additional cycle to transfer data between functional units. However, it should pay the extra cycle time for this convenience interconnection scheme. Furthermore, the cycle time equal the computation time plus interconnection delay which includes the delay of multiplexers, de-multiplexers and wires.

Distributed Register Architecture

In Deep Submicron Meter (DSM) technology, the wiring delay is no longer trivial [1]. It will dominate the overall system delay gradually with the scale evaluation of process technology which is due to RC delay, coupling noises, inductance, etc [2][3]. Obviously, the fatal long wiring delay will become a big portion in cycle time and worsen the system

latency at that time.

In such a situation, it is important to take the interconnection delay into consideration.

Because the interconnection delay information is only available after physical layout, the conventional architecture level synthesis flow cannot obtain the accurate cycle time. To overcome this problem, lots of researchers had used the estimated interconnection delay for a higher level design of abstraction [4][5][6][7][8][9].

In architecture level synthesis, the targeted architecture and corresponding synthesis algorithm will affect the effectiveness of exploiting the interconnection delay. Therefore, the distributed register architecture which partitions the registers has been proposed [10][11].

Fig. 2 represents the simplified model, which has some clusters connected through global interconnection. The cluster includes some functional units which can only access the dedicated registers in the same cluster. The global interconnection responds to transfer the data among all clusters in several cycles.

Fig. 2. Model of distributed register architecture

Additionally, some partition constraints prevent the overloading of interconnection delay constituted by multiplexers (e.g. adding constraints on the number of access ports of registers) and long wiring (e.g. adding constraints on the number of functional units).

Contrary to the conventional centralized register architecture, the distributed register architecture partitions the interconnection delay in several cycles. Only the interconnection delay within a cluster makes a portion in system cycle time. The other one in global interconnection only makes the additional cycles. Therefore, the partition of interconnection delay implies multi-cycle communication which enables the parallel execution of computations and data transfers.

Base on the same concept, the Regular Distributed Register (RDR) architecture was proposed [12], which offers high regularity and direct support of multi-cycle communication. The RDR architecture divides the entire chip into an array of clusters.

Fig. 3 shows an example of RDR architecture with 2 3× cluster array. For the highly regular advantage of RDR architecture, the information of inter-cluster and intra-cluster interconnection delay can be accurately recorded in lookup tables and pre-computed once the parameter of RDR structure (e.g. size of cluster, clock period) are specified.

Fig. 3. RDR architecture with 2 3× cluster array

1.3. Extra Wire Loading in RDR-based Architecture

The corresponding synthesis flow for RDR is MCAS (Architectural Synthesis for Multi-cycle Communication). At the front end, after generating the Control Data Flow Graph (CDFG) of application, the MCAS performs resource allocation, functional unit binding and scheduling-driven placement in order. These tasks place the functional units to clusters and assign the operations in CDFG when and where to be executed. At the backend, MCAS performs register and port binding followed by datapath and distributed controller generation. The experimental results reported in [12] shows 44% and 37% improvement on average in terms of the cycle time and final latency for data flow intensive examples. It also shows 28% and 23% improvement on average in terms of the cycle time and final latency for designs with control flow.

However, the RDR architecture may introduce extra global wiring overhead in the presence of many simultaneous data transfers, when each one requires a dedicated global connection. The significant wiring overhead would eventually limit the scaling of the application in RDR architecture. To overcome this problem, [13] presents an architecture level synthesis solution, which is called RDR-pipe, to support automatic interconnection pipelining extended from RDR. The interconnection pipelining potentially improves the wiring utilization by sharing the wires between each pair of clusters. Compared to RDR architecture, 28% global wire-length reduction is reported [13] in RDR-pipe architecture.

A good way to reduce resource demand is sharing. The interconnection pipelining improves the wiring overhead by sharing the wires between each two clusters. However, this methodology still limits the sharing capability to divided localized regions, because the transferred data are still scheduled and allocated within dedicated wires. Therefore, a global sharing methodology in which the wires and registers are both shared by all transferred data

1.4.

might greatly minimize the interconnection overhead. In this thesis, we propose the formal formulation of channel and register allocation problem in architecture level synthesis, which captures the behavior of the transfer data at each cycle in distributed register architecture.

Therefore, base on the formulation, it can be extended to minimize the required interconnection resource easily.

Thesis Organization

The rest of the thesis is organized as follows: chapter 2 addresses the channel and register allocation problem. Then it gives a motivational example which shows the difference between pervious methods and the desired optimal solution. Chapter 3 presents the detailed description of the proposed ILP formulation including problem formulation, variables definition, constraints and the objective function. Chapter 4 gives some useful extensions to the ILP model described in Chapter 3. Chapter 5 shows the experimental results compared to the previous works. Finally, conclusions and future works are drawn in chapter 6.

2.1. Chapter 2 Channel and Register Allocation Problem

In this chapter, we introduce the channel and register allocation problem in architecture level synthesis which has distributed registers. After that, a motivational example is given to show why we need a new methodology to share the interconnection resource globally.

Channel and Register Allocation Problem

The RDR architecture gives lots of dedicated interconnection wires between clusters.

Because it has no interconnection pipelining, the sender will hold the transferred data for several cycles. Therefore, one transferred datum will occupy a long wire for several cycles, which wastes the interconnection resource.

The RDR-pipe is extended from RDR. It puts the registers in appropriate positions to pipeline the long wires. By performing the transfer scheduling, it has higher utilization in wiring resource. However, the register station in RDR-pipe has no control signal. It makes the pipeline register dedicate to its interconnection wire only and cannot be shared for the data generated from the other clusters.

Therefore, we address a further extension on the register station, which is capable of incoming data and forward them to any directions. Besides, we take the distributed wiring segments in available channels between register stations instead of the dedicated interconnection. That is, any interconnection can be combined by several wiring segments.

Consequently, how to allocate those transfers to channels becomes a new problem because the behavior of transfers deeply affects the interconnection resource including wires and registers. Thus, this is the proposed channel and register allocation problem.

Fig. 4. Scheduled and bound DFG of DCT

The given input of this problem is the scheduled and bound data flow graph and placement of functional units in distributed register architecture. The given input should be checked whether the transfer latency is enough in advance. Fig. 4 shows a data flow graph of discrete cosine transform [12] with two ALUs and two multipliers. Each circle which is bound to a specific functional unit represents an operation. Also, these operations are scheduled in a time slot divided by horizontal lines. The scheduled and bound data flow graph describes when and which FU should take the temporary data to be computed.

Fig. 5 shows a cluster array architecture and the placement of two ALUs and two multipliers. One cluster includes a meshed square as FUs which only access the dedicated registers in the white square called register station. According to the placement of functional units in

3 3×

Fig. 5(a) and the bound DFG, Fig. 5(b) replaces the relative operations in it.

Therefore, with the scheduling and binding information, we can specify the transfer where and when it is generated or required.

Fig. 5. Placement functional units in distributed register architecture

2.2. Previous Methodology

Because the channel and register allocation problem is extended from the RDR/RDR-pipe architecture, we take those register and port binding tasks as the pervious work. Fig. 6 shows a simple example of the scheme performed in RDR/MCAS. The operation number 1 and 2 are executed in cluster A and the number 3 and 4 are in cluster C.

According to this architecture, the register 1 and 2 of sender A hold these transferred values at least two cycles. Therefore, two parallel inter-cluster wires are needed between cluster A and C for the overlap transfer time of data.

Fig. 6. Register and port binding task with dedicated interconnection

Fig. 7. Register and port binding task with dedicated pipelining interconnection

Fig. 7 performs the scheme of RDR-pipe/MCAS-pipe in the same example. With the presence of pipeline register 2 in cluster B, register 1 can forward one data to register 2 at the first cycle and forward the other one at the next cycle. These two data were issued at different cycles and forwarded by pipeline register (i.e. register 2) without stalling until reaching the cluster C. With transfer scheduling, it serializes the transfers by differing the issued cycle of each datum. Therefore, only one interconnection wire is needed and it has one wire reduction compared to RDR/MCAS.

2.3. Motivational Example

Sharing is the way to reduce resource requirement. We take a motivational example to illustrate how the sharing capability affects the requirement of interconnection resource.

The comparison of results among RDR/MCAS, RDR-pipe/MCAS-pipe and the idealized global sharing method will be discussed. Fig. 8 gives the scheduled and bound data flow graph and the mapping of operations in a 3 3× cluster array. For comparing the result of resource requirement, we will count the number of registers and the wiring segments. A simple definition of wiring segment is one wire between two adjacent clusters. The definition is intuitional and directly proportion to the wire length.

Fig. 8. Given scheduled, bound DFG and placed functional units

Fig. 9 shows the result of transfer allocation in RDR/MCAS. The cluster generating data always holds the transferred value until the slack time equal to the interconnection delay.

No transfer scheduling and pipeline register makes extra wiring requirement. It needs 12 wiring segments and 10 registers in total.

Fig. 10 shows the result of data transfer in RDR-pipe/MCAS-pipe. The pipeline registers are inserted in each register stations where interconnections would pass away. Besides, the transfer scheduling is performed to minimize the wire requirement. Consequently, it reduces the wiring segments to 7 at the cost of additional 2 registers (i.e. totally 12 registers).

Fig. 9. Result of allocation in RDR/MCAS

Fig. 10. Result of allocation in RDR-pipe/MCAS-pipe

The sharing of wires and registers by transferred data reduce the resource requirement.

Under the principle, we extend the capability of register stations which can store data for cycles and forward them to arbitrary directions. The extension enables the transfer data use all interconnection wires and registers. Therefore, it implies global sharing and more resource reduction. Fig. 11 shows the result, with the hand-scheduling, it use only 4 wire segments and 7 registers.

Fig. 11. Result of the global sharing of interconnection resource

Chapter 3 Proposed ILP Formulation for the Channel

and Register Allocation Problem

In the chapter, we will formulate the behavior of transfers in distributed register architecture with extended capability, which permits the register station to store or forward data at each cycle. In our proposed formulation, the definition of problem and input will be given initially. Then we will make an explanation of variables and decide the feasible region of them. Finally, we will write down the objective function to minimize interconnection resources which should be subject to uniqueness, continuity and resource counting constraints.

3.1. Definition of Problem Given

Before solving the channel and register allocation problem, we assume the tasks of scheduling and functional unit binding have been performed. Therefore, the scheduled and bound data flow graph, functional unit placement are taken as input. Besides, the topology information including the position and connectivity of clusters is also needed.

First, we use the graph representation to describe the input data flow graph.

The vertex set

represents the operations in data flow graph.

The directed edge set represents the data dependency

implying data transfer, such that means the data generated from { | 0,1, 2, ,| | 1}

Fig. 12. Data flow graph as input application

Fig. 12 is a simple example of data flow graph, which has the vertex set and the edge set

Second, we use the graph representation , called topology information, to specify the target distributed register architecture, which indicates the available positions for putting interconnection wires or registers. The vertex set

( , ) represents the available channels, such that we can describe the behavior of transferred data from r_j to as . It is notable that we can use a self loop to enable the transferred data stay at the same register station for one cycle.

rk w r_i: _j →r_k

Fig. 13 takes a cluster array of distributed register architecture as topology information, which has the vertex set

2 2×

{ | 0,1, 2,3}

R i

V = r i= and the edge set

implying the register station has the capability to transfer data to adjacent one horizontally, vertically by , or keeping them for cycles by

Fig. 13. Topology information

The input data flow graph is scheduled and bound. What do the “scheduled” and “bound”

mean for? First, we can treat each data dependency in data flow graph as a transfer. Besides, by the “scheduled” and “bound” information, we can know when these operations are done and in which cluster the outputs are stored. In fact, we can use this information with placement of functional units to specify each transfer like this statement – some transfer is the task that taking data from register station

r to x r_y from cycle to n m. Precisely, we can use a set of pair {et_i =(st ft_i, _i) |e_i∈E_S} to specify the transfer generated at cycle and required at cycle

sti ft . And another set of pair is used to specify _i the transfer e_i generated at register station sr_i and required at register station fr . _i

In Fig. 14, we focus on the edge , which is a data dependency between and . By the information of and , we could know there is a datum needed to be transferred from register station

e0 o₀ o₂

et0 er₀

0 1

sr = at cycle st₀ = to register station 5 fr₀ = at 2 cycle ft₀ =8.

Fig. 14. Specification of a transfer

3.2. The Definition of Variables

How to define appropriate variables is important, which affects the complexity, flexibility of an ILP formulation. To minimize the limitation of extension capability in our proposed formulation, we decide to capture the basic behavior of transfer at each cycle. Without lots of indirect specifications, a simple type of zero-one integer variable x_{i j k}_{, ,} is adopted and called channel allocation variable.

Fig. 15. Specifying a transfer path with channel allocation variables

The meaning of x_{i j k}_{, ,} is that whether the transfer datum e_i at cycle j is allocated to channel w_k. “One” value means yes, and “zero” stands for no. The set X collects all these zero-one variables and can be written as:

{ _{i j k}, , | 0,1, ,| _S | 1; _i _i; 0,1, | _W | 1}

X = x i= … E − ft ≥ >j st k= … E −

How do these variables work for describing a transfer path? Fig. 15 shows an example.

In this example, transfer should be taken from register station to from cycle 6 to 8. If we have chosen a transfer path, which is sending the data to at first cycle, staying one cycle in at next step and achieving in the end of cycle 8. What we need to do is specifying this path by setting variables

e0 r₁ r₂

r0 r₂

0,6,3

x , x_0,7,0 and x_0,8,4, which means using the

channels , and respectively cycle by cycle. Then preserve the other allocation variables of transfer to zero for preventing the ambiguity in assignment of transfer behavior. Therefore, one assignment of allocation variables without ambiguity stands for one transfer behavior of data. It is the one to one and onto mapping implying that the entire solution space is included in the variety of assignment.

w3 w₀ w₄ e0

Otherwise, to minimize the interconnection resource, the counting resource variables are also defined. The interconnection resource includes wires and registers. In our formulation, we care about the resource requirement in available channels and register stations.

Therefore, the straightforward types of integer variables, and , are used. They mean the number of wires in available channel and registers in available register station , respectively.

Nwi Nr_i wi

3.3. The Feasible Region of Variables

In fact, there are lots of redundant allocation variables included in the set X . With the specification of generating and requiring register stations, the active region of transferred data is limited. That is, there are lots of channels of transfers would never be used, and these corresponding allocation variables are always zero. To minimize the allocation variables while preserving complete behavior of transfer, it is needed to decide the feasible activitive region of transfer at each cycle.

As mention above, the activity region of transfer is limited by generating and requiring register station. We take the same example in Fig. 16. Fig. 16(a) shows that the possible channels used for transfer at cycle 6. Those feasible channels are limited to , and , which are all emitted from generating register station . Even at the next cycle, the feasible channels are also limited to those ones emitted from register stations , and . Therefore, the feasible channels are spread out from generating register station and the effect of limitation can be traced cycle by cycle. Base on this idea, we define the set which includes the number of feasible channels traced from the generating register station at cycle j:

Fig. 16. Limitation of activitive region of transfer

The WS_{i j}_, is defined recursively and takes the generating cycle as the initial condition. Fig.

17 shows an example. The set of feasible channels of transfer e₀, which is equal to WS₀_j, has carried out. The includes channel 1, 3 and 6 which are all leaving from the generating register stations and entering into , , or . At the next cycle, the

在文檔中應用整數線性規劃達成架構層級合成上最佳化通道與暫存器配置之技術 (頁 10-0)