Proposed Methodology - 考慮處理器微架構之效能最佳化布局技術

We have established the relationship between performance and ﬂoorplanning as above. We know that the process of ﬂoorplanning is usually iteration-based. For example, for a simulated annealing ﬂoorplanner, the ﬂoorplanning process will iter-ate with change in temperature. The temperature will go uphill or downhill until a termination condition is meet. The move in temperature means change in ﬂoorplan.

The cost of changed ﬂoorplan is evaluated by the objective function, and whether the move is accepted or rejected is decided by the cost. Therefore the objective function plays an important role in ﬂoorplanning since it decides whether a ﬂoor-plan is accepted or not. However, the objective functions generally used before are like area, total wire length, aspect ratio. The question now is if these objective functions are capable to guide the ﬂoorplan to achieve better performance and if these objective functions are able to guide the ﬂoorplan to make the change like what we have discussed in Section 3.1.

To resolve these issues, we propose a methodology in hope that it will guide the ﬂoorplanning process from the view of performance. We believe it is not enough to consider total wire length and area for performance. The main concept of our methodology is that, for diﬀerent kinds of instructions, they pass through diﬀerent functional blocks in the processor. The collection of diﬀerent functional blocks an instruction pass through is like a path. For example, an ALU-type instruction will pass through a path consisted of functional blocks IF, ID, RF, EX, ALU, MEM/WB;

and a Shift-type instruction will go through IF, ID, RF, EX, SR, MEM/WB. We try to guide the ﬂoorplanning process in the view of these paths instead of traditional

objective function, which is consisted of the ”ingredients” in these paths. The ingredients are in fact the latencies of interconnects along the paths. We model the delay of each path as function of interconnects like what we showed in Equation 1.1, Equation 1.2 and calculated in Section 3.1. These functions are then used as the objective function to guide the ﬂoorplanning process.

To model the latencies of interconnects, the Manhattan distance of two functional blocks form a speciﬁc interconnect is used. After the Manhattan distance is obtained, corresponding latency can be calculated. To compute the latency, we use the value of unit interconnect delay from [10], which is 55pS/mm. Using the value of time needed for an unit length of interconnect, the time an interconnect needs can be calculated accordingly by multiply the length of interconnect and time needed for an unit length. Then the latency can be obtained from dividing the time needed for a interconnect by the clock cycle time.

Since there are many kinds of instructions, correspondingly there are many paths exist in a processor. However, there is no need to optimize the ﬂoorplan for each kind of instructions. In fact it is also impossible to optimize all kinds of instructions.

We need some trade-oﬀ between these instructions.

Here we use instruction mix for weighting. Instruction mix is a measure of the dynamic frequency of instructions across on or many programs by deﬁnition in [6]. For example, if a program has 14% of Load/Store instructions, 57% of ALU-type instructions, 15% of Shift-ALU-type instructions, and 12% of branch instructions, the numbers in percentage is the instruction mix of this program. We can use a simulator to run some simulations on benchmarks for proﬁling the instruction mix ﬁrst. The details of simulator will be shown in the next section. After the instruction mix is obtained, it can then be used as the weighting of diﬀerent paths.

By using this heuristic, the relationship between a ﬂoorplan and its performance

Figure 3.3: The ﬂow chart of proposed methodology. The objective function is obtained from characterizing the given microarchitecture. Weights of factors in objective function is obtained from instruction mix. After ﬂoorplanning, result of performance is veriﬁed by Latency Conﬁgurable Instruction Set Simulator.

is established. Floorplan inﬂuences clock cycle count of instructions, and clock cy-cle count of instructions inﬂuence performance, therefore ﬂoorplan inﬂuence perfor-mance. Diﬀerent paths in the processor are weighted diﬀerently by their importance, therefore achieving a good trade-oﬀ. We believe this heuristic will relate physical information with micro-architectural issues well.

The ﬂow chart of proposed methodology is shown in Figure 3.3.

3.4 Latency Configurable Instruction Set Simula-tor

We must examine the performance of ﬂoorplan from our methodology in order to prove the validity of our methodology. This can be done by cycle accurate simula-tion. However, the existing simulators do not consider the latencies of interconnects.

Further more, the target design we used is a reduced subset of MIPS64 instruction set, thereby no existing simulator is available. Therefore we must implement an instruction set simulator of our reduced subset of MIPS64 instruction set. Besides functional simulation, the simulator also needs the capacity of considering laten-cies of interconnects. These latenlaten-cies should be adjustable. We call this simulator Latency Conﬁgurable Instruction Set Simulator.

The simulator is written in C++, with the usage of SystemC library. The rea-son for using SystemC is that it provides convenient data types such as ﬁxed-width signed/unsigned integers, bit vectors, and corresponding functions such as bit range selection, bit-wise logical operations, etc. This simpliﬁes the implementation of the instruction set simulator. The simulator accepts assembly code in hexadecimal for-mat. The instructions are simulated without parallelism. For each instruction the needed register values are ﬁrst obtained. The result of instruction and target pc (program counter) are then calculated. Result will be written back to memory if needed. After completion of an instruction the simulator will read the next instruc-tion according to the target pc calculated in simulainstruc-tion of previous instrucinstruc-tion, and the next instruction is simulated until the HALT instruction is fetched. During sim-ulation the instruction count executed and clock cycles needed including additional latencies introduced by interconnects are tracked and recorded. Data dependencies, corresponding forwarding and additional latencies introduced by pipeline stalls are also taken into account while calculating the cycle count. After simulation the total

instruction count and total cycle count of program executed are displayed, and the value of CPI is also calculated from dividing total cycle count of program executed by total instruction count. The instruction mix can also be analyzed and displayed.

Chapter 4 Experimental Results

We present the details of experiment in this chapter. The setup of experiment includ-ing target microarchitecture, ﬂoorplanner, new objective functions and experimental benchmarks is introduced ﬁrst. The results are then presented.

4.1 Experimental Setup

4.1.1 Target Microarchitecture

The target microarchitecture we used for implementing our methodology is a MIPS64[11]-like processor. This processor implements a subset of MIPS64 instruction set, which is shown in Table 4.1.

The processor has a 5-stage pipeline as shown in Appendix A.1 in [12]. It also has forwarding capacity. The forwarding architecture is as shown in Appendix A.4 in [12]. It has 64-bit data width general purpose registers and data memory. The instruction width is 32-bit. About the branch operations, the MIPS64-like processor calculates branch condition and target address in ID (instruction decode) stage, the second stage in the processor.

Table 4.1: Instruction set of target design. Note that these are a reduced subset of MIPS64 instruction set.

Type Instruction

Data Transfer LW, LWU, SW, LH, LHU, SH

LD, SD, LB, LBU, SB

Arithmetic / logic/ bit manipulation DADD, DADDI, DADDU, DADDIU DSUB, DSUBU

AND, ANDI, OR, ORI, XOR, XORI LUI

DSLL, DSRL, DSRA, DSRAV, DSLLV, DSRLV, DROTR, DROTRV;

DSBH, DSHD, DEXT, DINS, SEB, SEH SLT, SLTI, SLTU, SLTIU

Control BEQ, BNE

MOVN, MOVZ J, JR

JAL, JALR

4.1.2 Floorplanner and New Objective Functions

For implementing our methodology, we must use a ﬂoorplanner with our proposed objective function. Here, we use a ﬂoorplanner from [13]. It uses the method of simulated annealing. The representation of ﬂoorplanning it uses is normalized Pol-ish expressions which enables carrying out the neighborhood search eﬀectively and hence speeds up the search procedure signiﬁcantly. A simultaneous minimization of area and total interconnection length in the ﬁnal solution is also utilized so that the interconnection information as well as the area and shape information can be considered simultaneously. Its original objective function consists of area and wire length. There is an adjustable weighting factor for wire length.

In this thesis we consider two new objective functions for verifying our methodol-ogy. The ﬁrst one objective function consists of latency factors which are ingredients of the path instructions pass through, as described in Section 3.1, instead of total wire length. In other words, the latencies of interconnects which instructions may

pass through are used as factors of this objective function. These factors are equally weighted. This objective function is targeted at optimizing the latencies of critical path without considering the frequency of path usage. We wish to show it is su-perior objective function than original ones, since the latencies of interconnects are considered instead of wire length.

The second objective function has the same factors of the previous objective function. However, unlike previous objective function where all factors are equally weighted, these factors are weighted by instruction mix, as we have discussed in Section 3.3. We hope this diﬀerence will further improve the result compared with previous objective function.

We also run the ﬂoorplanner with its original objective function. Here we utilize two diﬀerent conﬁgurations based on the original objective function. One of these has objective function which area and wire length are equally weighted. The other one is targeted on wire length optimization. The ratio of weighting between area and wire length is 1:30. We wish to show that shorter wire length does not mean better performance necessarily. Also, we can compare the results for area and wire length between original and new objective functions to ﬁnd out the overhead of proposed methodology on area and wire length.

4.1.3 Experimental Benchmarks

In order to verify the performance, we need some benchmarks. Here we use ﬁve benchmarks, written in MIPS64 assembly. They are: DCT (discrete cosine trans-formation), iDCT (inverse discrete cosine transtrans-formation), FIR (ﬁnite impulse re-sponse), Bubble sort, and Hailstone. Hailstone accepts a number and do the follow-ing: If number is odd, multiply by 3 and add 1; if number is even, divide it by 2;

this iteration is repeated until number is 1.

Table 4.2: Results of CPI. Note that the result from reducing wire length shows it actually slightly degrades performance. Comparing results from original and new objective function shows that our methodology indeed improves performance.

DCD iDCT FIR Bubble Sort Hailstone Average CPI

Original (1:1) 8.89 6.25 7.69 7.08 6.09 7.20

Original (1:30) 9.01 6.25 7.75 7.28 6.18 7.29

All factors are 7.07 5.25 6.25 6.09 5.28 5.99

equally weighted

Weighted by 6.26 4.75 5.63 5.39 4.83 5.37

instruction mix

These ﬁve benchmarks are ﬁrst proﬁled to obtain their average instruction mix.

After the ﬂoorplanning is completed, the latencies are fed into our latency conﬁg-urable instruction set simulator to obtain CPI of each individual benchmark. The average CPI can be calculated accordingly.

4.2 Results

We show the CPI results of each individual benchmarks and average CPI of four objective functions in Table 4.2.

The upper two rows in Table 4.2 are the results by original objective function of the ﬂoorplanner from [13]. The diﬀerence between these two rows is the weighting of factors area and wire length: the ratio of weighting in the ﬁrst row is 1:1 while 1:30 in the second row. The objective function of second row has more weight on the factor wire length which means it intends to reduce wire length. In the third and fourth row, we represent the result of new objective functions. In the third row, we show the results of objective function which consists of latencies of interconnects with equal weighting. Finally, in the last row, the factors are weighted by the average instruction mix.

We show a chart of comparison between average CPI from diﬀerent objective

functions in Table 4.3. The values in Table 4.3 show the improvement of perfor-mance. We know that the relationship of two conﬁgurations, X and Y, can be deﬁned as described in [6]:

P erformance_X

P erformanceY = Exexution time_Y

Exexution timeX =n (4.1) Also, we can use CPI to denote execution time, as we have discussed in Section 3.1.

Therefore, we can rewrite the Equation 4.1 as follows:

P erformance_X

P erformanceY = Exexution time_Y

Exexution timeX = CP I_Y

CP IX =n (4.2)

The results in Table 4.3 are calculated according to Equation 4.2. X in Equation 4.2 comes from the row it is in while Y comes from corresponding column. The calculated n is then minus one in order to represent the net improvement and the ﬁnal value is presented in percentage.

Table 4.3: Comparison between average CPI from diﬀerent objective functions. The improvement is up to 35.75% comparing original and new objective functions.

Original (1:1) Original (30.0) All factors

Original (1:30) -1.23%

All factors are equally weighted 20.20% 21.70%

Weighted by instruction mix 34.08% 35.75% 11.55%

From the CPI results we can ﬁnd that the performance of wire length reduction oriented objective function actually degrades, instead of the intuition that reducing wire length may improve performance. For the new objective functions, the results show that it indeed improvs the performance by up to 35.75% when compared to the original objective functions. Compare the results of equally weighted factors and weighted by instruction mix, the latter shows improvement of 11.55%, which shows validation of our methodology.

Despite improvement in performance, there may be overheads. The overheads are mainly in wire length and area. To calculate the overheads we should ﬁrst ﬁnd

out the actual values of them. Table 4.4 shows the results of wire length and area from four objective functions.

Table 4.4: Results of wire length and total area. The results of two new objective functions are clearly larger than results from original objective function.

Wire Length Total Area Average CPI

Original (1:1) 51375.45 673682.56 7.20

Original (1:30) 49783.50 706621.25 7.29

All factors are equally weighted 55156.06 708309.44 5.99 Weighted by instruction mix 53932.86 741857.50 5.37

From the Table 4.4 we can ﬁnd that the area and wire length of results from the two new objective functions are larger than those from original objective function, which means that there are indeed overheads. Here we present two tables to show the overheads of wire length and area comparing new objective functions and original objective functions.

Table 4.5: Comparison of wire length and total area between two new objective function and original objective function with equal weight of area and wire length (1:1).

Wire length Area All factors are equally weighted 7.36% 5.14%

Weighted by instruction mix 4.98% 10.12%

Table 4.6: Comparison of wire length and total area between two new objective function and original objective function for wire length reduction (1:30).

Wire length Area All factors are equally weighted 10.79% 0.24%

Weighted by instruction mix 8.33% 4.99%

4.3 Discussion

From the Table 4.5 and 4.6 we can know that the ranges of overheads are from 4.98%

of instruction mix weighted objective function, the worst case of overhead in wire length is 8.33% and 10.12% in area. This may seem to be a large overhead. However, the overheads come with performance improvement over 30%. It is designer’s choice to determine the trade-oﬀ between performance and wire length/area.

Also note that the CPI results which are signiﬁcantly larger than ideal case:

one. The cause is lack of ability for wire pipelining. We know that pipelining can reduce the CPI to reach ideal case. The ideal case is the longest delay in all pipeline stages. If each stage takes one clock cycle, in ideal case the pipelined CPI will be one since the longest delay in all pipeline stages is one. Without wire pipelining the additional latencies introduced by interconnects in each stage of pipeline will be counted in while calculating the delay of each stage. Therefore the value of CPI is also increased.

Chapter 5 Conclusion and Future Works

This thesis proposed a methodology based on a heuristic to relate performance in terms of microarchitecture and ﬂoorplanning, therefore achieving microarchitecture-aware ﬂoorplanning for processor performance optimization.

In the past, ﬂoorplanner used objective functions focused on reducing wire length and area. These objective functions were considered eﬃcient before since the la-tencies of interconnects were within single clock cycle or even could be neglected.

However, as feature size continues to shrink, the communication of signals on in-terconnects becomes multi-cycle. The latencies can not be ignored now. These latencies have impact on the performance. However, ﬂoorplanner does not consider performance aspect. Hence we propose the methodology as described in this thesis in order to consider performance in ﬂoorplanning. We have proven the validity of our methodology since it indeed enhances performance as shown in the experimental results. The results also emphasize the importance of considering performance in microarchitectural aspects.

About the future works, since the experiment is based on a MIPS64-like pro-cessor with reduced subset of instruction set, we wish to implement the proposed methodology on a fully-functional MIPS64 processor. This also makes it easier to compare the results of our methodology with previous works. Our methodology

also lacks ability to consider wire pipelining which may improve performance fur-ther. Therefore future works may also contain the ability of wire pipelining/ﬂip-ﬂop insertion in terms of performance.

Bibliography

[1] Jason Cong, Ashok Jagannathan, Glenn Reinman, and Michail Romesis.

“Microarchitecture Evaluation with Physical Planning”. In Proceedings IEEE/ACM Design Automation Conference, pages 32–35, June 2003.

[2] Mongkol Ekpanyapong, Jacob R. Minz, Thaisiri Watewaiy, HsienHsin S. Lee, and Sung Kyu Lim. “Proﬁle-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design”. In Proceedings IEEE/ACM Design Automation Conference, pages 634–639, June 2004.

[3] Changbo Long, Lucanus J. Simonson, Weiping Liao, and Lei He. “Floorplan-ning Optimization with Trajectory Piecewise-Linear Model for Pipelined Inter-connects”. In Proceedings IEEE/ACM Design Automation Conference, pages 640–645, June 2004.

[4] Vidyasagar Nookala, Ying Chen, David J. Lilja, and Sachin S. Sapatnekar.

“Microarchitecture-Aware Floorplanning Using a Statistical Design of Experi-ments Approach”. In Proceedings IEEE/ACM Design Automation Conference, pages 579–584, June 2005.

[5] P. N. Glaskowsky. “Pentium 4 (Partially) Previewed”. Microprocessor Report, 14(8):11-13, August 2000.

[6] David A. Patterson and John L. Hennessy. “Computer Organization and De-sign: The Hardware/Software Interface”. Morgan Kaufmann Publishers, Inc., second edition, 1998.

[7] D. C. Burger and T. M. Austin. “The SimpleScalar tool set version 2.0”.

Technical Report CS-TR-97-1342, The University of Wisconsin, Madison, June 1997.

[8] T. M. Austin. “Simplescalar tool suite”. http://www.simplescalar.com.

[9] A. J. KleinOsowski and David J. Lilja. “MinneSPEC: A new SPEC bench-mark workload for simulation-based computer architecture research”. IEEE Computer Architecture Letters, vol. 1, June 2002.

[10] Ron Ho, Kenneth W. Mai, and Mark A. Horowitz. “The Future of Wires”. In Proceedings of the IEEE, 2001.

[11] http://www.mips.com/content/Products/Architecture/MIPS64/.

[12] John L. Hennessy and David A. Patterson. “Computer Architecture - A Quan-titative Approach”. Morgan Kaufmann Publishers, Inc., third edition, 2003.

[13] D. F. Wong and C. L. Liu. “A New Algortihm for Floorplan Design”. In Proceedings IEEE/ACM Design Automation Conference, pages 101–107, 1986.

在文檔中考慮處理器微架構之效能最佳化布局技術 (頁 34-0)