The SA-based Floorplaner with Correlative Cost Function[9]

2 Preliminaries

2.4 Related Works

2.4.2 The SA-based Floorplaner with Correlative Cost Function[9]

In [11-13], authors first investigate the throughput-driven floorplanning. They introduce a simulated annealing (SA)-based floorplaner optimize for system throughput. However, they claim that the throughput is too sensitive to small local changes such that it is not a good idea to directly use it in the cost function.

Instead, they create a correlative f(P) for preventing the pipeline element insertion into the edges in the critical cycles. Initially, every edge is assigned a weight with the reciprocal of the smallest cycle size containing it. It is because the pipeline element insertion in the cycle with small size is easy to decrease the system throughput. If an edge with a large reciprocal value, inserting pipeline element into the edge has much danger to the throughput. Then at each following iteration, the edge latency is updated according to the current floorplanning solution. After that, the SA-based floorplaner is guided to reduce the sum of product of the weight and the latency for all edges. The small value of this method means that the less pipeline elements in the critical edges such that the system throughput is improved. Taking the resulting floorplan of the system S1 shown in Figure 2-9 as an example, the edge g-e has one pipeline element and its smallest cycle size is |C3| =3. Hence the partial cost of it is 1/3. The respected the partial cost of edge c-d is 1/4 and edge c-e is 1/4. Thus the final cost of f(P) of this method is 10/12.

The authors claim that the designated cost function is better than the real system

throughput for SA-based floorplaner due to its modest sensitivity to local changes and fast computation. However, the fact is that the overall system throughput solely depends on the most critical cycle with the maximum cycle mean. Thus, it is arguable how strong the correlation between the designated cost function and the final system throughput is.

According to the claim in [11], the throughput with smaller cost is better than the other one.

For example, to reduce the cost of FS1-a in Figure 2-9, which is computed to be 0.83 (10/12), the FS1-b in Figure 2-10 could be the new floorplan because it has smaller cost 0.75 (9/12).

However, FS1-b with throughput 0.75 (3/4) is actually better than that of FS1-a with throughput 0.67 (2/3). Therefore, neither the weighted edge cost nor the most critical cycle mean can improve the actual system throughput very well.

Chapter 3 Throughput-Aware Floorplan by Dynamically Considering Multiple Critical Cycles

3.1 The Consideration of Multiple Critical Cycles

According to the discussion in section 2.4, a new method for f(P) is considered in order to balance the correlation and sensitivity. Therefore, instead of the overall cycles in [11] and the maximum cycle mean in [14], a set of most critical cycles is taken into consideration in our methodology. The set of most critical cycles is defined as CTp and the number of cycles in CTp is defined as its size |CTp|. The cycles in CTp are chosen according to their cycle means and the |CTp| is parameterizable. For example, assume we have ten cycles in a system, Figure 3-1 shows the cycle means of these cycles. The number in x-axis is the cycle number and the height of the bar is the cycle’s corresponding cycle mean.

In this example, if we decided to choose the 50% most critical cycles, then the cycles with 50% larger cycle means (C3, C4, C5, C9, and C10 which are colored in Figure 3-1) are picked up to form CTp and |CTp| is five in the floorplaner. f(P) now designed to considering how to improve the cycle means in CTp. Minimizing the cycle mean of each cycle in CTp is our objective. Hence, we use the average cycle mean of all cycles in CTp, called λTp, to

represent the influence of throughput in the cost function. Therefore, the f(P) in Equation (3) can be rewritten as Equation (4).

cycle mean

Figure 3-1 The cycle means of 10 cycles in a system.

In our work, the chosen cycles’ percentage is initially set in SA floorplaner. However, Equation (4) is not ideal application of f(P). Although this method gives the opportunity to prevent forming a worse critical cycle from the minor ones, it has the criticism mentioned in section 2.4.2 that the small cost may not imply good throughput result. In addition, as mentioned in section 2.1, the system throughput is only decided by the maximum cycle mean.

Thus it seems that keeping fixed number of multiple cycles in CTp during whole SA iterations is not suitable. Therefore, in next section, a more feasible application is evolved.

3.2 Dynamic Cycle Set C

Although the cycle set CTp is high correlative and not so sensitive to local change, the computing method of λTp is still not good enough to stand for the system throughput improving. It is always that the maximum cycle mean bounds the system throughput. Thus when SA process is near to the end, we should concentrate on decreasing the maximum cycle mean only. On the other words, CTp should eventually contain only the cycle with the maximum cycle mean at low annealing temperature (i.e. |CTp| becomes on the end of SA iterations). It is more practical than to consider multi-cycle for the final floorplanning result.

An example is shown in Figure 3-2. (a) is represents the cycle means at the beginning iteration of floorplanning and (b) is the cycle means at almost the end of the floorplan. The cycle means in (a) are worse that needs to consider the most critical cycles. But during the SA iterations, λTp decreases such that the most cycle means are small in (b). Therefore, the multiple cycle consideration is not suitable for (b). On the other words, reducing the maximum cycle mean is the best way under (b)’s condition.

cycle mean

Figure 3-2. The possible cycle-means at the beginning (a) and ending (b) in SA process.

However, dramatically drop of |CTp| should be avoided in the SA process. For SA

algorithm, it is a gradually converging process and a violent changing may destroy the process.

Therefore, the violent decreasing change of |CTp| may damage the smooth characteristic of SA-based floorplanning algorithm such that the throughput may not be improved. Thus we develop a mechanism to decrease |CTp| smoothly during SA process.

Assume that when temperature going down, the most cycle means in CTp decrease such that the non-critical cycles increase. Therefore, |CTp| can be shrunk as temperature cooling down and should contain only the cycle with maximum cycle mean near the end of SA iterations. T means the annealing temperature in SA algorithm and r is the cooling down ratio of T (i.e. Tnew = rTold) and 0 < r < 1. In order to guarantee that CTp remains only the cycle with maximum cycle mean, a threshold of temperature, called Tthreshold, is given. Therefore, |CTp| decrease with the same down ration r of T as T greater than or equal to Tthreshold. This formulation of |CTp| is given in Equation (5).

⎪⎩

Consequently CTp is the set of most critical cycles which change according their cycle means and |CTp| decreases with the temperature until T is smaller than Tthreshold. The initial value of CTp is decided by how worse the cycle mean is wanted to be improved. The other parameter Tthreshold can be decided by when the multiple cycles consideration should be finished. By this dynamically considering the multiple critical cycles, λTp in Equation (4) just becomes the best computation method of f(P) in our opinion.

3.3 The SA-based Flow of Our Approach

Combining the SA-based floorplanning algorithm with our approach, the flow chart of Figure 3-3 displays the whole process for the new throughput-aware floorplanning. The abbreviations in Figure 3-3 are described in Table 3-1. G is the system task graph with hard blocks physical information. X is the representation of a floorplan. FP( ) is the floorplanning generation to generate a new X. p is the acceptance probability decided by the cost and temperature. R is a random number between 0~1.

Table 3-1: The meanings of the abbreviations in Figure 3-3.

G The system task graph T Annealing temperature

r The cooling speed of Temperature X Floorplan

FP( ) Floorplanning generation Φ Cost

p Acceptance probability R Random number

This flow is separated into three main processes that are discussed as following:

1. Initial setup: In this process, the parameters, like initial temperature, chip aspect ratio, temperature cooling down ratio…etc, are given. All cycles in the system task graph are identified and the initial set of CTp is also determined.

2. SA iteration: This is the main process of the SA-based floorplaner. At every iteration it generates a new floorplan Xnew and the new cost of this floorplan is computed. If the cost does not change any more or T is decreased to zero, the iteration stops.

Otherwise, the judgement of taking Xnew or not is applied and the floorplaner decreases |CTp| with temperature by Euqation (5). The cycles of CTp are also re-chosen according to their new cycles. The floorplan Xnew. is taken by its

decreasing cost or the greater computed acceptance probability than a random number. Then the next new floorplan is continuously generated.

3. Final result: It outputs the best floorplan in the SA process and it throughput.

Input data: G, T

Figure 3-3. The flow chart of SA-based floorplaner.

Chapter 4 Experimental Results

4.1 Environment Setup and Benchmarks

The benchmarks we use contain two sets, MCNC and GSRC. However, the original benchmarks lack of transfer direction information. In order to add data dependency between the modules in each benchmark, we break each net into a 2-pin net and randomly assign it with a direction. The system has cycles because of this direction assignment. To provide a moderate number of cycles in a system, we limit the number of cycles around one tenth the number of blocks in every benchmark. The number of cycles in every benchmark is show in Table 4-1 after our modification.

Table 4-1. The number of blocks and cycles for each benchmark

MCNC GSRC file name

apte xeorx hp ami33 ami49 n10 n30 n50 n100

blocks 9 10 11 33 49 10 30 50 100

cycles 4 2 1 5 7 2 3 9 12

The experiments are conducted on a Linux workstation with two Intel Xeon 3.2G CUPs and 3GB DRAM.

4.2 Weight Assignment

The weight of each block in a benchmark is assigned to be one. This means that each block requires one clock cycle for computation. The weight of an interconnect ei is the number of pipeline elements it needing and this number is determined by how many clock cycles a signal propagates from the interconnect’s one end to the other. For example, and interconnect with 2.4 clock cycle delay needs to be inserted into two pipeline elements.

Therefore, the weight of ei is defined by Equation (6)

⎥ ⎦

The interconnect length is modelled as the Manhattan distance between the centers of two blocks. Equation (6) computes the clock cycle delay of ei and takes the floor value which is the number of pipeline elements needed at least for ei.

To model the effect of the advancing technology, we assume WCLK to shorten as 1/k times of the die length and give several kind of k. We applied k to be the number of 1, 2, 4, …to 32.

1 means that the signal can reach the die length in one clock cycle in 0.25 um processor generation [2]. 32 is the required cycle for cross-chip signal propagation through the top-level metal wire in 35nm processor generation [18].

4.3 Results

4.3.1 Experiment I

We apply four different methods for cost computation in the same floorplaner:

Method_A is like the conventional floorplaner optimizing are only. The other methods not

the maximum cycle mean only [13]; Method_C is re-implementation of the correlative cost in [10]; and Method_D is our approach. In our method, CTp initially contains all cycles and Tthreshold is set as 1/1000 of the initial temperature.

Under the same value of k, we run every benchmark for ten times by using the four different methods. The results of average throughput and dead space are shown in Table 4-2.

P is the system throughput and DS is the percentage of dead space.

Table 4-2. Average throughput and dead space.

Method_A Method_M Method_C Method_D

k P DS(%) P DS(%) P DS(%) P DS(%)

1 0.73 8.15 0.92 8.41 0.99 8.38 1.00 8.49

2 0.45 7.75 0.64 8.68 0.76 9.92 0.88 10.69 4 0.26 8.04 0.39 8.35 0.49 9.96 0.63 12.51 8 0.15 7.94 0.23 8.58 0.26 9.94 0.41 12.18 16 0.07 7.95 0.12 8.59 0.14 9.84 0.22 12.69 32 0.04 7.74 0.06 8.50 0.07 10.26 0.11 12.54

Form Table 4-2, the throughput decreases very much as k increases. The dead space of our method is smaller than 13% while Method_A and Method_M are smaller than 9% and Method_C is smaller than 11%. The throughput of our method (Method_D) always has better throughput results than others.

The comparison between the four different methods on the thoughput is shown in Figure 4-1. The area overhead is shown in Table 4-3. All the results are normalized to Method_A.

Form Figure 4-1, the improvement of throughput is increasing as the number of cycles increasing and our method has the best achievements. The area overhead, which does not exceed 5%, is quite small as shown in Table 4-3.

The improvement of throughput

Figure 4-1. Average throughput improvement for Method_M, Method_C and Method_D.

Table 4-3. Average area overhead for Method_M, Method_C, and Method_D

4.3.2 Experiment II

In this experiment, we try to allow more are overhead by decreasing the weighting constant α in our method, called Method_D* and do the experiment of Method_D*. The comparisons of throughput and area overhead of Method_D and Method_D* are shown in Figure 4-2 and Table 4-4. The data is also normalized to Method_A.

The improvement of throughput

Figure 4-2. Average throughput increasing of Method_D and Method_D*.

Table 4-4. Average area overhead for Method_D and Method_D*

According to Figure 4-2 and Table 4-4, if more area overhead is allowed, even more significant improvement of throughput can be achieved as the value of k increases. It shows that the area overhead increase about 10% but the improvement of Method_D* increases near to 300% as k=32.

4.4 Stability

For all methods, we compute the average standard deviation of the maximum cycle mean

for the results of each benchmark. The results are shown in Figure 4-3.

The average standard deviation of the maximum cycle means

Method_D*

Method_D Method_C Method_M Method_A

Figure 4-3. The standard deviations for every method under different number of cycles.

From Figure 4-3, Method_A has the greatest standard deviation because it does not condiering throughput at all. Among all the throughput-aware floorplan, Method_M has the greatest standard deviation because it considers only the maximum cycle mean. And our method has smaller standard deviation than Method_C because improving the throughputs for all cycles by [10] may not actually improve the system throughput. Moreover, our method considers a set of multiple critical cycles with dynamically decreasing size, thus it really improves the system throughput at low temperature.

Method_D* has smaller standard deviation than Method_D because the more area overhead is allowed in Method_D*. Therefore, the floorplaner has more opportunities to place

4.5 Discussion

In Experiment I, the throughput degrades is very fast as the required number of cycles increases. Therefore, all method for optimizing throughput has improvement of throughput and the improvement of throughput increases when the number of cycles increases (i.e. the wire delay becomes worse). When the number of cycles is small, the improving of throughput has little difference. It is because the wire weight is still small such that achieving high throughput is easy. Our throughput-aware floorplanning shows the greatest improvement on the throughput. Even when the number of cycle increases, our approach has more improvement than other methods. It reaches almost 200% improving when the number of clocks is 16 and 32 cycles. Moreover, the area overhead of our method is no more than 5%

corresponding to the Method_1.

In Experiment II, Method_4* has even more improvement than Method_4. It can have more than 250% improvement when the number of clocks is greater than 16. And the area overhead is only increasing about 9%. Thus if relaxing area constraint is allowed, the better throughput is achieved than having area constraint.

In section 4.4, our method has the smallest standard deviation between Method_1~Method_4*. Therefore our method can provide a more stable result than other methods. Specially, Method_4* which relax the area constraint has smaller standard deviation than Method_4. It may be because that relaxing area constraint makes the more possible positions for the movements of the blocks become large. Thus the standard deviation of relaxing area constraint is smaller than having it.

Chapter 5 Throughput-Aware Design Flow

Our experimental results show a great improvement can be achieved on the system throughput by using our method. However, if the result given by the throughput-aware floorplaner still can not satisfy the design constraint, redesign iteration is unavoidable. Figure 5-1 shows a design process from the system-level design to floorplanning.

Throughput-aware floorplaner RTL & logic

design System-level

design

Figure 5-1. The redesign iterations.

If the throughput-aware floorplaner can not generate a satisfied result, the RTL & Logic design is modified again and expects to get an improved result after performing floorplan.

However, if this redesign iteration is useless on improving throughput, the system-level architecture is needed to be rebuilt. Nevertheless the throughput of the rebuilt system

architecture is still unknown until floorplan is done. Thus an un acceptable resulting throughput of the floorplan makes the design process repeat form system level to floorplan.

Such redesign iterations cost too much time and decrease the benefit.

Therefore, we propose a new methodology that can help to reduce the possible design iterations. The flow of this throughput-evaluation methodology is shown in Figure 5-2. The gray part is the proposed new design stages for reducing the redesign iterations form floorplan back to system level. It helps evaluate the throughput at early design stage.

System-level design with

Figure 5-2 The recommended global design flow chart for preliminary throughput.

First the functional blocks are give the estimated physical information of area size and the clock-latency. Then, our proposed method can be applied to calculate the preliminary throughput value for floorplanning. If the resulting throughput is not satisfied, it can

immediately modify the system-level design such that the possibility of redesigns from floorplan back to system level is greatly reduced.

Chapter 6 Conclusions

In this thesis, a new throughput-aware floorplanning strategy is proposed to maximize the system performance. It optimizes a dynamic set of most critical cycles simultaneously.

Thus a more stable and better result can be obtained. Compared to the previous works, our approach is not only highly correlated to the final system throughput but also insensitive and stable to the locally minor changes during floorplanning refinement processes in floorplanning. The experimental results show that our floorplaner can achieve about twice the throughput of that minimizing area only as the applied clock cycles increasing to 16~32. And the area overhead only increases about 3%~10% more area under various setting of applied cycles. Meanwhile, a recommend design flow is proposed for preliminary throughput-aware floorplanning in system level. It helps reduce the design iterations. Hence, we believe the proposed method and flow can provide better floorplan solutions in the coming multi-cycle communication era.

Reference

[1] I International Technology Roadmap for Semiconductors, Semiconductor Industry Association, 2003.

[2] D. Matzke, “Will physical scalability sabotage performance gains?” IEEE Computer, vol.

30, pp. 37-39, 1997.

[3] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D.

dissertation, Stanford Univ., CA, 1984.

[4] D. S. Bormann and P. Y. Cheung, “Asynchronous wrapper for heterogeneous systems,”

in Proc. ICCD, pp.307-314, 1997.

[5] J. Muttersbach, T. Villiger and W. Fichtner, “Practical design of globally-asynchronous locally-synchronous systems,” in Proc. ASYNC, pp. 52-59, 2000.

[6] L. P. Carloni, K. L. McMillan, A. Saldanha and A. L. Sangiovanni-Vincentelli, “A methodology for “Correct-by-construction latency insensitive design,” in Proc. ICCAD, pp. 309-315, 1999.

[7] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Performance analysis and optimization of latency insensitive protocols,” in Proc. DAC, pp. 21-26, 2001.

在文檔中動態考慮多關鍵性迴路之效能感知佈局研究 (頁 22-0)