• 沒有找到結果。

Organization of This Thesis

1 Introduction

1.3 Organization of This Thesis

This thesis is organized as follows. In Chapter 2 we show the preliminaries for our work.

It includes the throughput calculation, the effects of floorplanning to throughput, and the related works. The proposed throughput-aware floorplanning strategy is given in Chapter 3.

Experimental results are provided in Chapter 4. The recommended overall design flow is discussed in Chapter 5. Chapter 6 concludes this thesis.

Chapter 2

Preliminaries

2.1 System Throughput

Data rate is the major performance metric in system design and it is calculated by multiplying the system throughput and the operating frequency. Obviously, the throughput serves as an important performance factor. As described, the throughput depends on not only the latency of functional blocks but also the latency incurred by long wires. An example is given to show how the wire latency impacts the throughput.

Figure 2-1. The system nFB without feedback loop.

Consider the system nFB shown in Figure 2-1, the circles a, b, and c are functional blocks, while the rectangle PE is a pipeline element. Assume the interconnect delay between a and c is greater than one clock period but less than two. It implies that at least one pipeline element must be inserted between a and c to preserve the correct functionality. Hence, the interconnection latency from a to c is 1 clock cycle. Assume the latency of every functional

block is 1 clock cycle. Also assume that the interconnect channel can properly queue the transfer data. The output sequence at the functional block c is considered as the output of the system nFB.

At the time slot t0, all functional blocks a, b, and c produce the first valid data tokens a0, b0, and c0 while PE only contains an empty data token τ initially. For following time slots, every functional block produces a valid data if its input data are ready at the previous time slot.

Otherwise it produces empty data. A pipeline element only keeps a token one clock cycle then forwards it out from senders to receivers.

Figure 2-2. Time slot sequence in the system nFB.

In Figure 2-2, at time slot t1, because one input data of the block c is τ at the previous time slot, c produces τ and the data b0 is queued in the interconnecting channel from b to c.

Also at time slot t1, the pipeline element PE receives the data a0, and passes it to the block c at the next time slot. Then the empty data τ is popped out since t2 and all functional blocks produce valid data since then. Note that there exists only one τ in the output sequence of the system nFB. Therefore the overall throughput of the system nFB approaches 1 as t→∞.

Now consider the system FB with a feedback loop as shown in Figure 2-3. Though it looks similar as the system nFB, it presents a totally different throughput.

Figure 2-3. The system FB with feedback loop.

As Figure 2-4 points out, the empty data token τ introduced by PE circles around during t0 ~ t4. It becomes obvious that the empty data token never disappears and always presents at the system output every 4 clock cycles. As a result, the throughput of the system FB approaches only 3/4 as t→∞.

Figure 2-4. Time slot sequence in the system FB.

As the two systems nFB and FB demonstrate, a system can be modeled as a graph, in which vertices depict the functional modules and edges describe inter-module communication channels. Every vertex and edge has a weight indicating its required number of clock cycles for signal transmission. The throughput P of a cycle C can be calculated as the reciprocal of the cycle mean λ and is shown in Equation (1)

1

In Equation (1), |C|, usually called the size of cycle C, is the number of vertices in C and W(C) is the sum of weights of all vertices and edges within the cycle C.

If a system contains more than one cycle, then the system throughput is bounded by the most critical cycle possessing the maximum cycle mean, shown in Equation (2). The equation has been proven in [7-8] and [16].

The system shown in Figure 2-5 is an example that the critical cycle bounds the system throughput. This system has 2 cycles C1 and C2. Although the output functional block f belongs to C2 whose throughput is 4/5 and the output sequence at f shown in Figure 2-6 presents a throughput approaching 4/5 before t19, the cycle C1 eventually blocks the input data of the block d such that the throughput of functional block f is 3/4 after t19. Thus, the final throughput of this system is bounded by the cycle C1 with the maximum cycle mean and approaches 3/4 as t→∞.

4

Figure 2-5. The system with two cycles.

Figure 2-6. Time slot sequence of the system in Figure 2-5.

2.2 The Effect of Floorplanning to Throughput

In the previous section, how wire latency affects system throughput is described.

However, what latency a interconnect channel possesses depends on the floorplanning result.

For example, two different floorplans, FP1 and FP2, for the same system are shown in Figure 2-7. This system has one cycle a-b-c-a and the two floorplan have the same area and wire length. The only difference between them is the positions of blocks c and d.

Figure 2-7. Floorplanning results FP1 & FP2.

It is assumed that the data transfer cannot be successfully completed from b to d in FP1 and from b to c in FP2 within one clock cycle. Hence, a pipeline element must be inserted in both cases. After inserting pipeline elements, surprisingly, the performances of two floorplans are significantly different. As shown in Figure 2-8, FP1 remains its throughput to be 1 while

FP2’s throughput drops from 1 to 3/4 (25% performance loss).

3 1 3 =

λ

=

3

= 4 λ

Figure 2-8. Floorplan FP1 & FP2 after pipelining.

This example shows the strong effect of floorplanning to throughput. If the floorplan without carefully dealing with this issue, the degradation of throughput becomes large.

Therefore, we try to find a new method that improve the system throughput under simular area and wire length cost.

2.3 Problem Formulation

This problem is formulated to be a modified floorplan problem. The modifications are the input data type and the additional new objective. It can be further formulated as following.

Given: 1. A system task graph with physical information

2. The wire length that a signal can reach within one clock cycle, WCLK (The wire length WCLK is used to calculate the latency on an interconnect.)

Constraint: A non-overlap feasible floorplan result

Objective: 1. Maximize throughput (i.e. minimize the maximum cycle mean) 2. Minimize area and wire length

In an SA-based floorplanning algorithm, the cost function is the guide to reach the objectives. Conventionally the cost function only contains the objectives for area and wire length. We add a new one for maximizing throughput into cost function Φ, and it is now defined as Equation (3).

Cost function:

Φ = αA + βW + γf (P )

(3)

In Equation (3), A is for total area, W is for total wire length, f(P) represents the cost of throughput. Smaller f(P) means higher throughput. α, β, and γ are weighting constants. If one objective is emphazied, raising its weighting constant can make the floorplan be better on the objective.

2.4 Related Works

The floorplan with throughput consideration has been discussed. There are two related works with discussion as in the following sections.

2.4.1 The Modified SA-based Adjacent Constraint Graph (ACG) Floorplaner[14]

In [14], authors identify the maximum cycle mean among all cycles and directly use that mean as f(P) of the cost function in their SA-based floorplaner [15]. This method immediately gives the response of the overall system throughput in cost function without considering any minor factors. The high correlation between cost and throughput is provided. However, the maximum cycle mean might not be smooth enough to serve as a good cost function.

For example, consider the system S1 and the floorplan FS1-a shown in Figure 2-9. This system has three cycles C1, C2, and C3. According to FS1-a, the cycle C2 is the critical cycle

with the maximum cycle mean and it dominates this system throughput bounded to be 3/4.

The other two cycles have equal cycle mean to be 5/4.

In order to improve the throughput, the floorplaner tends to choose a result improving the critical cycle C2. As a result shown in Figure 2-10, the cycle mean of C2 is improved from 4/3 to 1, however, the cycle mean of C3 (=3/2) takes the place of critical cycle and the system throughput decreases from 3/4 (of FS1-a) to 2/3 (of FS1-b).

4

Figure 2-9. The system S1 and its resulting floorplan FS1-a.

3 1

Figure 2-10. The resulting floorplan FS1-b of the system S1.

From this example of the system S1, directly using the maximum cycle mean to be f(P) of the cost function may increase the other minor cycle means. Thus, f(P) must be considered to prevent the increasing of critical and minor cycle means simultaneously.

2.4.2 The SA-based Floorplaner with Correlative Cost Function[11-13]

In [11-13], authors first investigate the throughput-driven floorplanning. They introduce a simulated annealing (SA)-based floorplaner optimize for system throughput. However, they claim that the throughput is too sensitive to small local changes such that it is not a good idea to directly use it in the cost function.

Instead, they create a correlative f(P) for preventing the pipeline element insertion into the edges in the critical cycles. Initially, every edge is assigned a weight with the reciprocal of the smallest cycle size containing it. It is because the pipeline element insertion in the cycle with small size is easy to decrease the system throughput. If an edge with a large reciprocal value, inserting pipeline element into the edge has much danger to the throughput. Then at each following iteration, the edge latency is updated according to the current floorplanning solution. After that, the SA-based floorplaner is guided to reduce the sum of product of the weight and the latency for all edges. The small value of this method means that the less pipeline elements in the critical edges such that the system throughput is improved. Taking the resulting floorplan of the system S1 shown in Figure 2-9 as an example, the edge g-e has one pipeline element and its smallest cycle size is |C3| =3. Hence the partial cost of it is 1/3. The respected the partial cost of edge c-d is 1/4 and edge c-e is 1/4. Thus the final cost of f(P) of this method is 10/12.

The authors claim that the designated cost function is better than the real system

throughput for SA-based floorplaner due to its modest sensitivity to local changes and fast computation. However, the fact is that the overall system throughput solely depends on the most critical cycle with the maximum cycle mean. Thus, it is arguable how strong the correlation between the designated cost function and the final system throughput is.

According to the claim in [11], the throughput with smaller cost is better than the other one.

For example, to reduce the cost of FS1-a in Figure 2-9, which is computed to be 0.83 (10/12), the FS1-b in Figure 2-10 could be the new floorplan because it has smaller cost 0.75 (9/12).

However, FS1-b with throughput 0.75 (3/4) is actually better than that of FS1-a with throughput 0.67 (2/3). Therefore, neither the weighted edge cost nor the most critical cycle mean can improve the actual system throughput very well.

Chapter 3

Throughput-Aware Floorplan by Dynamically Considering Multiple Critical Cycles

3.1 The Consideration of Multiple Critical Cycles

According to the discussion in section 2.4, a new method for f(P) is considered in order to balance the correlation and sensitivity. Therefore, instead of the overall cycles in [11] and the maximum cycle mean in [14], a set of most critical cycles is taken into consideration in our methodology. The set of most critical cycles is defined as CTp and the number of cycles in CTp is defined as its size |CTp|. The cycles in CTp are chosen according to their cycle means and the |CTp| is parameterizable. For example, assume we have ten cycles in a system, Figure 3-1 shows the cycle means of these cycles. The number in x-axis is the cycle number and the height of the bar is the cycle’s corresponding cycle mean.

In this example, if we decided to choose the 50% most critical cycles, then the cycles with 50% larger cycle means (C3, C4, C5, C9, and C10 which are colored in Figure 3-1) are picked up to form CTp and |CTp| is five in the floorplaner. f(P) now designed to considering how to improve the cycle means in CTp. Minimizing the cycle mean of each cycle in CTp is our objective. Hence, we use the average cycle mean of all cycles in CTp, called λTp, to

represent the influence of throughput in the cost function. Therefore, the f(P) in Equation (3) can be rewritten as Equation (4).

cycle mean

Figure 3-1 The cycle means of 10 cycles in a system.

Tp

In our work, the chosen cycles’ percentage is initially set in SA floorplaner. However, Equation (4) is not ideal application of f(P). Although this method gives the opportunity to prevent forming a worse critical cycle from the minor ones, it has the criticism mentioned in section 2.4.2 that the small cost may not imply good throughput result. In addition, as mentioned in section 2.1, the system throughput is only decided by the maximum cycle mean.

Thus it seems that keeping fixed number of multiple cycles in CTp during whole SA iterations is not suitable. Therefore, in next section, a more feasible application is evolved.

3.2 Dynamic Cycle Set C

Tp

Although the cycle set CTp is high correlative and not so sensitive to local change, the computing method of λTp is still not good enough to stand for the system throughput improving. It is always that the maximum cycle mean bounds the system throughput. Thus when SA process is near to the end, we should concentrate on decreasing the maximum cycle mean only. On the other words, CTp should eventually contain only the cycle with the maximum cycle mean at low annealing temperature (i.e. |CTp| becomes on the end of SA iterations). It is more practical than to consider multi-cycle for the final floorplanning result.

An example is shown in Figure 3-2. (a) is represents the cycle means at the beginning iteration of floorplanning and (b) is the cycle means at almost the end of the floorplan. The cycle means in (a) are worse that needs to consider the most critical cycles. But during the SA iterations, λTp decreases such that the most cycle means are small in (b). Therefore, the multiple cycle consideration is not suitable for (b). On the other words, reducing the maximum cycle mean is the best way under (b)’s condition.

cycle mean

Figure 3-2. The possible cycle-means at the beginning (a) and ending (b) in SA process.

However, dramatically drop of |CTp| should be avoided in the SA process. For SA

algorithm, it is a gradually converging process and a violent changing may destroy the process.

Therefore, the violent decreasing change of |CTp| may damage the smooth characteristic of SA-based floorplanning algorithm such that the throughput may not be improved. Thus we develop a mechanism to decrease |CTp| smoothly during SA process.

Assume that when temperature going down, the most cycle means in CTp decrease such that the non-critical cycles increase. Therefore, |CTp| can be shrunk as temperature cooling down and should contain only the cycle with maximum cycle mean near the end of SA iterations. T means the annealing temperature in SA algorithm and r is the cooling down ratio of T (i.e. Tnew = rTold) and 0 < r < 1. In order to guarantee that CTp remains only the cycle with maximum cycle mean, a threshold of temperature, called Tthreshold, is given. Therefore, |CTp| decrease with the same down ration r of T as T greater than or equal to Tthreshold. This formulation of |CTp| is given in Equation (5).

⎪⎩

Consequently CTp is the set of most critical cycles which change according their cycle means and |CTp| decreases with the temperature until T is smaller than Tthreshold. The initial value of CTp is decided by how worse the cycle mean is wanted to be improved. The other parameter Tthreshold can be decided by when the multiple cycles consideration should be finished. By this dynamically considering the multiple critical cycles, λTp in Equation (4) just becomes the best computation method of f(P) in our opinion.

3.3 The SA-based Flow of Our Approach

Combining the SA-based floorplanning algorithm with our approach, the flow chart of Figure 3-3 displays the whole process for the new throughput-aware floorplanning. The abbreviations in Figure 3-3 are described in Table 3-1. G is the system task graph with hard blocks physical information. X is the representation of a floorplan. FP( ) is the floorplanning generation to generate a new X. p is the acceptance probability decided by the cost and temperature. R is a random number between 0~1.

Table 3-1: The meanings of the abbreviations in Figure 3-3.

G The system task graph T Annealing temperature

r The cooling speed of Temperature X Floorplan

FP( ) Floorplanning generation Φ Cost

p Acceptance probability R Random number

This flow is separated into three main processes that are discussed as following:

1. Initial setup: In this process, the parameters, like initial temperature, chip aspect ratio, temperature cooling down ratio…etc, are given. All cycles in the system task graph are identified and the initial set of CTp is also determined.

2. SA iteration: This is the main process of the SA-based floorplaner. At every iteration it generates a new floorplan Xnew and the new cost of this floorplan is computed. If the cost does not change any more or T is decreased to zero, the iteration stops.

Otherwise, the judgement of taking Xnew or not is applied and the floorplaner decreases |CTp| with temperature by Euqation (5). The cycles of CTp are also re-chosen according to their new cycles. The floorplan Xnew. is taken by its

decreasing cost or the greater computed acceptance probability than a random number. Then the next new floorplan is continuously generated.

3. Final result: It outputs the best floorplan in the SA process and it throughput.

Input data: G, T

Figure 3-3. The flow chart of SA-based floorplaner.

Chapter 4

Experimental Results

4.1 Environment Setup and Benchmarks

The benchmarks we use contain two sets, MCNC and GSRC. However, the original benchmarks lack of transfer direction information. In order to add data dependency between the modules in each benchmark, we break each net into a 2-pin net and randomly assign it with a direction. The system has cycles because of this direction assignment. To provide a moderate number of cycles in a system, we limit the number of cycles around one tenth the number of blocks in every benchmark. The number of cycles in every benchmark is show in Table 4-1 after our modification.

Table 4-1. The number of blocks and cycles for each benchmark

MCNC GSRC file name

apte xeorx hp ami33 ami49 n10 n30 n50 n100

blocks 9 10 11 33 49 10 30 50 100

cycles 4 2 1 5 7 2 3 9 12

The experiments are conducted on a Linux workstation with two Intel Xeon 3.2G CUPs and 3GB DRAM.

4.2 Weight Assignment

The weight of each block in a benchmark is assigned to be one. This means that each block requires one clock cycle for computation. The weight of an interconnect ei is the number of pipeline elements it needing and this number is determined by how many clock cycles a signal propagates from the interconnect’s one end to the other. For example, and interconnect with 2.4 clock cycle delay needs to be inserted into two pipeline elements.

Therefore, the weight of ei is defined by Equation (6)

⎥ ⎦

The interconnect length is modelled as the Manhattan distance between the centers of two blocks. Equation (6) computes the clock cycle delay of ei and takes the floor value which is the number of pipeline elements needed at least for ei.

To model the effect of the advancing technology, we assume WCLK to shorten as 1/k times of the die length and give several kind of k. We applied k to be the number of 1, 2, 4, …to 32.

1 means that the signal can reach the die length in one clock cycle in 0.25 um processor generation [2]. 32 is the required cycle for cross-chip signal propagation through the top-level

1 means that the signal can reach the die length in one clock cycle in 0.25 um processor generation [2]. 32 is the required cycle for cross-chip signal propagation through the top-level

相關文件