We have established the relationship between performance and floorplanning as above. We know that the process of floorplanning is usually iteration-based. For example, for a simulated annealing floorplanner, the floorplanning process will iter-ate with change in temperature. The temperature will go uphill or downhill until a termination condition is meet. The move in temperature means change in floorplan.
The cost of changed floorplan is evaluated by the objective function, and whether the move is accepted or rejected is decided by the cost. Therefore the objective function plays an important role in floorplanning since it decides whether a floor-plan is accepted or not. However, the objective functions generally used before are like area, total wire length, aspect ratio. The question now is if these objective functions are capable to guide the floorplan to achieve better performance and if these objective functions are able to guide the floorplan to make the change like what we have discussed in Section 3.1.
To resolve these issues, we propose a methodology in hope that it will guide the floorplanning process from the view of performance. We believe it is not enough to consider total wire length and area for performance. The main concept of our methodology is that, for different kinds of instructions, they pass through different functional blocks in the processor. The collection of different functional blocks an instruction pass through is like a path. For example, an ALU-type instruction will pass through a path consisted of functional blocks IF, ID, RF, EX, ALU, MEM/WB;
and a Shift-type instruction will go through IF, ID, RF, EX, SR, MEM/WB. We try to guide the floorplanning process in the view of these paths instead of traditional
objective function, which is consisted of the ”ingredients” in these paths. The ingredients are in fact the latencies of interconnects along the paths. We model the delay of each path as function of interconnects like what we showed in Equation 1.1, Equation 1.2 and calculated in Section 3.1. These functions are then used as the objective function to guide the floorplanning process.
To model the latencies of interconnects, the Manhattan distance of two functional blocks form a specific interconnect is used. After the Manhattan distance is obtained, corresponding latency can be calculated. To compute the latency, we use the value of unit interconnect delay from [10], which is 55pS/mm. Using the value of time needed for an unit length of interconnect, the time an interconnect needs can be calculated accordingly by multiply the length of interconnect and time needed for an unit length. Then the latency can be obtained from dividing the time needed for a interconnect by the clock cycle time.
Since there are many kinds of instructions, correspondingly there are many paths exist in a processor. However, there is no need to optimize the floorplan for each kind of instructions. In fact it is also impossible to optimize all kinds of instructions.
We need some trade-off between these instructions.
Here we use instruction mix for weighting. Instruction mix is a measure of the dynamic frequency of instructions across on or many programs by definition in [6]. For example, if a program has 14% of Load/Store instructions, 57% of ALU-type instructions, 15% of Shift-ALU-type instructions, and 12% of branch instructions, the numbers in percentage is the instruction mix of this program. We can use a simulator to run some simulations on benchmarks for profiling the instruction mix first. The details of simulator will be shown in the next section. After the instruction mix is obtained, it can then be used as the weighting of different paths.
By using this heuristic, the relationship between a floorplan and its performance
Figure 3.3: The flow chart of proposed methodology. The objective function is obtained from characterizing the given microarchitecture. Weights of factors in objective function is obtained from instruction mix. After floorplanning, result of performance is verified by Latency Configurable Instruction Set Simulator.
is established. Floorplan influences clock cycle count of instructions, and clock cy-cle count of instructions influence performance, therefore floorplan influence perfor-mance. Different paths in the processor are weighted differently by their importance, therefore achieving a good trade-off. We believe this heuristic will relate physical information with micro-architectural issues well.
The flow chart of proposed methodology is shown in Figure 3.3.
3.4 Latency Configurable Instruction Set Simula-tor
We must examine the performance of floorplan from our methodology in order to prove the validity of our methodology. This can be done by cycle accurate simula-tion. However, the existing simulators do not consider the latencies of interconnects.
Further more, the target design we used is a reduced subset of MIPS64 instruction set, thereby no existing simulator is available. Therefore we must implement an instruction set simulator of our reduced subset of MIPS64 instruction set. Besides functional simulation, the simulator also needs the capacity of considering laten-cies of interconnects. These latenlaten-cies should be adjustable. We call this simulator Latency Configurable Instruction Set Simulator.
The simulator is written in C++, with the usage of SystemC library. The rea-son for using SystemC is that it provides convenient data types such as fixed-width signed/unsigned integers, bit vectors, and corresponding functions such as bit range selection, bit-wise logical operations, etc. This simplifies the implementation of the instruction set simulator. The simulator accepts assembly code in hexadecimal for-mat. The instructions are simulated without parallelism. For each instruction the needed register values are first obtained. The result of instruction and target pc (program counter) are then calculated. Result will be written back to memory if needed. After completion of an instruction the simulator will read the next instruc-tion according to the target pc calculated in simulainstruc-tion of previous instrucinstruc-tion, and the next instruction is simulated until the HALT instruction is fetched. During sim-ulation the instruction count executed and clock cycles needed including additional latencies introduced by interconnects are tracked and recorded. Data dependencies, corresponding forwarding and additional latencies introduced by pipeline stalls are also taken into account while calculating the cycle count. After simulation the total
instruction count and total cycle count of program executed are displayed, and the value of CPI is also calculated from dividing total cycle count of program executed by total instruction count. The instruction mix can also be analyzed and displayed.
Chapter 4
Experimental Results
We present the details of experiment in this chapter. The setup of experiment includ-ing target microarchitecture, floorplanner, new objective functions and experimental benchmarks is introduced first. The results are then presented.
4.1 Experimental Setup
4.1.1 Target Microarchitecture
The target microarchitecture we used for implementing our methodology is a MIPS64[11]-like processor. This processor implements a subset of MIPS64 instruction set, which is shown in Table 4.1.
The processor has a 5-stage pipeline as shown in Appendix A.1 in [12]. It also has forwarding capacity. The forwarding architecture is as shown in Appendix A.4 in [12]. It has 64-bit data width general purpose registers and data memory. The instruction width is 32-bit. About the branch operations, the MIPS64-like processor calculates branch condition and target address in ID (instruction decode) stage, the second stage in the processor.
Table 4.1: Instruction set of target design. Note that these are a reduced subset of MIPS64 instruction set.
Type Instruction
Data Transfer LW, LWU, SW, LH, LHU, SH
LD, SD, LB, LBU, SB
Arithmetic / logic/ bit manipulation DADD, DADDI, DADDU, DADDIU DSUB, DSUBU
AND, ANDI, OR, ORI, XOR, XORI LUI
DSLL, DSRL, DSRA, DSRAV, DSLLV, DSRLV, DROTR, DROTRV;
DSBH, DSHD, DEXT, DINS, SEB, SEH SLT, SLTI, SLTU, SLTIU
Control BEQ, BNE
MOVN, MOVZ J, JR
JAL, JALR
4.1.2 Floorplanner and New Objective Functions
For implementing our methodology, we must use a floorplanner with our proposed objective function. Here, we use a floorplanner from [13]. It uses the method of simulated annealing. The representation of floorplanning it uses is normalized Pol-ish expressions which enables carrying out the neighborhood search effectively and hence speeds up the search procedure significantly. A simultaneous minimization of area and total interconnection length in the final solution is also utilized so that the interconnection information as well as the area and shape information can be considered simultaneously. Its original objective function consists of area and wire length. There is an adjustable weighting factor for wire length.
In this thesis we consider two new objective functions for verifying our methodol-ogy. The first one objective function consists of latency factors which are ingredients of the path instructions pass through, as described in Section 3.1, instead of total wire length. In other words, the latencies of interconnects which instructions may
pass through are used as factors of this objective function. These factors are equally weighted. This objective function is targeted at optimizing the latencies of critical path without considering the frequency of path usage. We wish to show it is su-perior objective function than original ones, since the latencies of interconnects are considered instead of wire length.
The second objective function has the same factors of the previous objective function. However, unlike previous objective function where all factors are equally weighted, these factors are weighted by instruction mix, as we have discussed in Section 3.3. We hope this difference will further improve the result compared with previous objective function.
We also run the floorplanner with its original objective function. Here we utilize two different configurations based on the original objective function. One of these has objective function which area and wire length are equally weighted. The other one is targeted on wire length optimization. The ratio of weighting between area and wire length is 1:30. We wish to show that shorter wire length does not mean better performance necessarily. Also, we can compare the results for area and wire length between original and new objective functions to find out the overhead of proposed methodology on area and wire length.
4.1.3 Experimental Benchmarks
In order to verify the performance, we need some benchmarks. Here we use five benchmarks, written in MIPS64 assembly. They are: DCT (discrete cosine trans-formation), iDCT (inverse discrete cosine transtrans-formation), FIR (finite impulse re-sponse), Bubble sort, and Hailstone. Hailstone accepts a number and do the follow-ing: If number is odd, multiply by 3 and add 1; if number is even, divide it by 2;
this iteration is repeated until number is 1.
Table 4.2: Results of CPI. Note that the result from reducing wire length shows it actually slightly degrades performance. Comparing results from original and new objective function shows that our methodology indeed improves performance.
DCD iDCT FIR Bubble Sort Hailstone Average CPI
Original (1:1) 8.89 6.25 7.69 7.08 6.09 7.20
Original (1:30) 9.01 6.25 7.75 7.28 6.18 7.29
All factors are 7.07 5.25 6.25 6.09 5.28 5.99
equally weighted
Weighted by 6.26 4.75 5.63 5.39 4.83 5.37
instruction mix
These five benchmarks are first profiled to obtain their average instruction mix.
After the floorplanning is completed, the latencies are fed into our latency config-urable instruction set simulator to obtain CPI of each individual benchmark. The average CPI can be calculated accordingly.
4.2 Results
We show the CPI results of each individual benchmarks and average CPI of four objective functions in Table 4.2.
The upper two rows in Table 4.2 are the results by original objective function of the floorplanner from [13]. The difference between these two rows is the weighting of factors area and wire length: the ratio of weighting in the first row is 1:1 while 1:30 in the second row. The objective function of second row has more weight on the factor wire length which means it intends to reduce wire length. In the third and fourth row, we represent the result of new objective functions. In the third row, we show the results of objective function which consists of latencies of interconnects with equal weighting. Finally, in the last row, the factors are weighted by the average instruction mix.
We show a chart of comparison between average CPI from different objective
functions in Table 4.3. The values in Table 4.3 show the improvement of perfor-mance. We know that the relationship of two configurations, X and Y, can be defined as described in [6]:
P erformanceX
P erformanceY = Exexution timeY
Exexution timeX =n (4.1) Also, we can use CPI to denote execution time, as we have discussed in Section 3.1.
Therefore, we can rewrite the Equation 4.1 as follows:
P erformanceX
P erformanceY = Exexution timeY
Exexution timeX = CP IY
CP IX =n (4.2)
The results in Table 4.3 are calculated according to Equation 4.2. X in Equation 4.2 comes from the row it is in while Y comes from corresponding column. The calculated n is then minus one in order to represent the net improvement and the final value is presented in percentage.
Table 4.3: Comparison between average CPI from different objective functions. The improvement is up to 35.75% comparing original and new objective functions.
Original (1:1) Original (30.0) All factors
Original (1:30) -1.23%
All factors are equally weighted 20.20% 21.70%
Weighted by instruction mix 34.08% 35.75% 11.55%
From the CPI results we can find that the performance of wire length reduction oriented objective function actually degrades, instead of the intuition that reducing wire length may improve performance. For the new objective functions, the results show that it indeed improvs the performance by up to 35.75% when compared to the original objective functions. Compare the results of equally weighted factors and weighted by instruction mix, the latter shows improvement of 11.55%, which shows validation of our methodology.
Despite improvement in performance, there may be overheads. The overheads are mainly in wire length and area. To calculate the overheads we should first find
out the actual values of them. Table 4.4 shows the results of wire length and area from four objective functions.
Table 4.4: Results of wire length and total area. The results of two new objective functions are clearly larger than results from original objective function.
Wire Length Total Area Average CPI
Original (1:1) 51375.45 673682.56 7.20
Original (1:30) 49783.50 706621.25 7.29
All factors are equally weighted 55156.06 708309.44 5.99 Weighted by instruction mix 53932.86 741857.50 5.37
From the Table 4.4 we can find that the area and wire length of results from the two new objective functions are larger than those from original objective function, which means that there are indeed overheads. Here we present two tables to show the overheads of wire length and area comparing new objective functions and original objective functions.
Table 4.5: Comparison of wire length and total area between two new objective function and original objective function with equal weight of area and wire length (1:1).
Wire length Area All factors are equally weighted 7.36% 5.14%
Weighted by instruction mix 4.98% 10.12%
Table 4.6: Comparison of wire length and total area between two new objective function and original objective function for wire length reduction (1:30).
Wire length Area All factors are equally weighted 10.79% 0.24%
Weighted by instruction mix 8.33% 4.99%
4.3 Discussion
From the Table 4.5 and 4.6 we can know that the ranges of overheads are from 4.98%
of instruction mix weighted objective function, the worst case of overhead in wire length is 8.33% and 10.12% in area. This may seem to be a large overhead. However, the overheads come with performance improvement over 30%. It is designer’s choice to determine the trade-off between performance and wire length/area.
Also note that the CPI results which are significantly larger than ideal case:
one. The cause is lack of ability for wire pipelining. We know that pipelining can reduce the CPI to reach ideal case. The ideal case is the longest delay in all pipeline stages. If each stage takes one clock cycle, in ideal case the pipelined CPI will be one since the longest delay in all pipeline stages is one. Without wire pipelining the additional latencies introduced by interconnects in each stage of pipeline will be counted in while calculating the delay of each stage. Therefore the value of CPI is also increased.
Chapter 5
Conclusion and Future Works
This thesis proposed a methodology based on a heuristic to relate performance in terms of microarchitecture and floorplanning, therefore achieving microarchitecture-aware floorplanning for processor performance optimization.
In the past, floorplanner used objective functions focused on reducing wire length and area. These objective functions were considered efficient before since the la-tencies of interconnects were within single clock cycle or even could be neglected.
However, as feature size continues to shrink, the communication of signals on in-terconnects becomes multi-cycle. The latencies can not be ignored now. These latencies have impact on the performance. However, floorplanner does not consider performance aspect. Hence we propose the methodology as described in this thesis in order to consider performance in floorplanning. We have proven the validity of our methodology since it indeed enhances performance as shown in the experimental results. The results also emphasize the importance of considering performance in microarchitectural aspects.
About the future works, since the experiment is based on a MIPS64-like pro-cessor with reduced subset of instruction set, we wish to implement the proposed methodology on a fully-functional MIPS64 processor. This also makes it easier to compare the results of our methodology with previous works. Our methodology
also lacks ability to consider wire pipelining which may improve performance fur-ther. Therefore future works may also contain the ability of wire pipelining/flip-flop insertion in terms of performance.
Bibliography
[1] Jason Cong, Ashok Jagannathan, Glenn Reinman, and Michail Romesis.
“Microarchitecture Evaluation with Physical Planning”. In Proceedings IEEE/ACM Design Automation Conference, pages 32–35, June 2003.
[2] Mongkol Ekpanyapong, Jacob R. Minz, Thaisiri Watewaiy, HsienHsin S. Lee, and Sung Kyu Lim. “Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design”. In Proceedings IEEE/ACM Design Automation Conference, pages 634–639, June 2004.
[3] Changbo Long, Lucanus J. Simonson, Weiping Liao, and Lei He. “Floorplan-ning Optimization with Trajectory Piecewise-Linear Model for Pipelined Inter-connects”. In Proceedings IEEE/ACM Design Automation Conference, pages 640–645, June 2004.
[4] Vidyasagar Nookala, Ying Chen, David J. Lilja, and Sachin S. Sapatnekar.
“Microarchitecture-Aware Floorplanning Using a Statistical Design of Experi-ments Approach”. In Proceedings IEEE/ACM Design Automation Conference, pages 579–584, June 2005.
[5] P. N. Glaskowsky. “Pentium 4 (Partially) Previewed”. Microprocessor Report, 14(8):11-13, August 2000.
[6] David A. Patterson and John L. Hennessy. “Computer Organization and De-sign: The Hardware/Software Interface”. Morgan Kaufmann Publishers, Inc., second edition, 1998.
[7] D. C. Burger and T. M. Austin. “The SimpleScalar tool set version 2.0”.
Technical Report CS-TR-97-1342, The University of Wisconsin, Madison, June 1997.
[8] T. M. Austin. “Simplescalar tool suite”. http://www.simplescalar.com.
[9] A. J. KleinOsowski and David J. Lilja. “MinneSPEC: A new SPEC bench-mark workload for simulation-based computer architecture research”. IEEE Computer Architecture Letters, vol. 1, June 2002.
[10] Ron Ho, Kenneth W. Mai, and Mark A. Horowitz. “The Future of Wires”. In Proceedings of the IEEE, 2001.
[11] http://www.mips.com/content/Products/Architecture/MIPS64/.
[12] John L. Hennessy and David A. Patterson. “Computer Architecture - A Quan-titative Approach”. Morgan Kaufmann Publishers, Inc., third edition, 2003.
[13] D. F. Wong and C. L. Liu. “A New Algortihm for Floorplan Design”. In Proceedings IEEE/ACM Design Automation Conference, pages 101–107, 1986.