Optimal Solution - ISE Exploration - 考量管線時間之延伸指令集

Chapter 3 ISE Exploration

3.4 Optimal Solution

The optimal solution can be identified as follows. At first, step 1, all of the possible patterns in the DFG are enumerated and tested to the input/output and convex constraints, and those passed the test are listed as legal ISEs. In step 2, exact implementation option evaluation of each ISE in the list is calculated. Suppose the DFG size is n and there are k legal ISEs are listed and the maximum hardware implementation option number is c, then the time complexity of step 1 and step 2 are O(2ⁿ)·O(n) and k·O(cⁿ)·O(n) respectively. Finally in step 3, all of the combinations of legal ISEs are enumerated, and then we can get the best cycle reduction number and corresponding minimum area cost of the DFG. The complexity of this step is O(2^k).

Chapter 4 ISE Selection

Due to the constraints of silicon area and original ISA format, the subset of ISE candidates which has the best performance improvement under the constraints should be selected. This problem is formulated as the multi-constrained 0/1 Knapsack

problem as follows:

ISE selection: Suppose there are n ISE candidates, the area of the ith ISE is ai and the performance improvement of the ith ISE is wi, the area of selected ISEs can’t exceed the total area A, the limitation of the number of extended instructions is E, and then get the maximum of

∑

Noticeable, the constraints on ISE selection in this paper are silicon area and original ISA format. However, if we add new constraints on ISE selection, only the equation (4.2) needs to be changed as follows:

j n

i i j i

j ct x b

∑

≤

=1 ,

: and x_i∈{0,1}，1≤i≤n (4.3)

where Cj represents which constraint is applied, cti,j is resource consumption values of ith ISE and bj is the given resource limits.

Chapter 5 Experimental Results

5.1 Experimental setup

We use Portable Instruction Set Architecture (PISA) [10] which is a MIPS-like ISA and MiBench [11] with different register input/output ports constraint to evaluate our proposed algorithm. Each benchmark is compiled by gcc 2.7.2.3 for PISA with -O0 and -O3 optimizations. Due to the limitation in library and compiler, 6 benchmarks, such as mad, typeset, ghostscript, rsynth, sphinx and pgp, can not be compiled successfully. For both algorithms, we evaluate 6 cases, includes 2/1, 4/2 and 6/3 register file read/write ports as well as using -O0 and -O3 optimization.

Table 5.1.1: Hardware implementation option setting

Operation Delay (ns) Area (µm²) Operation Delay (ns) Area (µm²) multu 5.65 79778.1 nor 2.00 250.00

In this simulation, we assume that: (1) the CPU core is synthesized in 0.13 µm CMOS technology and executes in 100MHz; (2) the CPU core area is 1.5 mm²; (3) the

read/write ports of register file are 2/1, 4/2 and 6/3, respectively; and (4) the execution time of all instructions in PISA is one cycle, i.e. 10 (ns). Table 5.1.1 shows the hardware implementation option settings (delay and area) of instructions in PISA.

Note that we only list instructions which are capable of being grouped into ISE in table 5.1.1. These settings reference from either [12] or synthesized by Synopsys Design Compiler with standard cells. Since increasing the read/write ports of register file needs extra silicon area, we also synthesize different read/write ports of register file. The silicon area of CPU core with 4/2 and 6/3 (register file read/write ports) are 1574138.80µm² and 1631359.54µm², respectively.

Because of the heuristic nature of the ISE exploration algorithm, the exploration is repeated 5 times within each basic block, and the result among the 5 iterations having minimal execution cycle count with less extra area cost is selected.

To make things easy as much as possible, the parameters are usually fixed and to adjust one of them one at a time. For the sake of clarity, one parameter is designed to influence only one thing, although they usually effect with each other. For example, the parameter β decides the decay speed of merit when one of the constraints is violated. When we have a fitting magnitude of β, there are always other things

changed at the same time due to the alternation of β. Then we adjust other parameters one at a time just like we did to β. After many times of regulation, we can find that the interval of the parameter at each regulation is less and less. Finally, we get a set of suitable parameters.

The following is the parameters in this paper and their meaning.

α：The weight of merit and pheromone in p . Increase α to get a solution more slowly,

decrease α to get a solution more quickly, but usually worse one.

β1：The tendency to choose hardware implementation option in a node.

β2：The decay speed when the input/output constraint is violated.

β3：The decay speed when the convex constraint is violated.

In our experiments, we use the initial value of software implementation option of 100, initial value of hardware implementation option of 200, P_END of 99%. The

parameter α used in the calculation of probability value, β1, β2 and β3 used in merit function are 0.25, 0.9, 0.9 and 0.5, respectively.

In the simulation, the ISE selection is implemented as a greedy algorithm. ISE

number and silicon area constraints can be easily applied within the greedy algorithm.

ISE selection algorithm first sorts ISE candidates according to their cycle count reduction. The ISEs are then selected sequentially according to this sorted list until the number of ISEs exceeds ISE number constraint or total silicon area is over. For the sake of clarity, the simulation result only shows the impact of the ISE number constraint. We divide the total saving cycle count of selected ISEs by the total cycle count of original application to get the execution time reduction in the figures.

5.2 Experimental results

Figure 5.2.1 and 5.2.2 show the average execution time reduction and average extra silicon area cost of Mibench, respectively, under different number of ISE. In both figure 5.2.1 and 5.2.2, each bar consists of several segments which indicate the execution time reduction under different number of ISE, are 1, 2, 4, 8, 16 and 32, respectively.

In order to show the effectiveness of the consideration of pipestage timing is remarkable. We assume the proposed algorithm doesn’t consider the effect of

pipestage timing. Therefore there is only one hardware implementation option for the operation can be included into ISE. In here, we always take it as the fastest

implementation option.

The label on X axis in both figure 5.2.1 and 5.2.2 represents ISE exploration algorithm with different arguments is used. The first and second symbols in parenthesis of each label on X axis represent the number of register file read/write ports used and which optimization scheme (-O0 or -O3) is used. (4/2, O3), for example, means register file has 4 read ports and 2 write ports and -O3 optimization scheme is used. The third symbol “T” in parenthesis represents “thinking” of pipestage timing.

Figure 5.2.1: Execution time reduction

0 Extra Area Cost (µm2 )

1 2 4 8 16 32

Figure 5.2.2: Extra silicon area cost

For both algorithms, -O3 shows better execution time reduction than -O0 under same read/write ports constraint. The possible reason is that -O3 usually makes program execution faster in various ways of compiler techniques. Some of these techniques (like loop unrolling, function inlining, etc.) remove branch instructions and increase the size of certain critical basic blocks. The bigger the basic block size is, the larger the search space exists and the more possibility the ISEs which have more cycle reduction can be explored in these bigger basic blocks. Also noteworthy is that most of execution time reduction is contributed by several ISEs. This is because the execute time of most program takes on small fraction of code segment, i.e. the execution time reduction is dominated by several ISEs. In most cases, 8 ISEs can achieve half or more of execution time reduction and only consume a little fraction of the maximum extra area cost. For example, 8 ISEs using 4/2 (register file read/write ports) register file can save average 14.95% execution time and cost 81467.5µm² silicon area, that’s 5.43% of the original core area. On the other hand, if we select 32 ISEs, the average

execution time reduction can increase to 20.62% but extra area cost also rises to 345135.45µm², that’s 23.01% of the original core area.

There is one thing should be noticed in figure 5.2.1 that ACO (2/1, O0) seems to be better than ACO (2/1, O3). In fact, with 1, 2 and 4 ISE number, –O3 still behaves better than –O0, the situation reverses only with larger ISE number. This is caused by the results of some special benchmark. For example, there are only 4 ISEs can be found by –O3, but –O0 can find over 4 ISEs and totally get more execution time reduction. When the ISE number is more than 4, then –O0 looks like better than –O3.

10%

15%

20%

25%

30%

35%

1 2 4 8 16 32

ISE Number

Extra Area Saving Percentage

ACO(2/1, O0, T) ACO(2/1, O3, T) ACO(4/2, O0, T) ACO(4/2, O3, T) ACO(6/3, O0, T) ACO(6/3, O3, T)

Figure 5.2.3: Extra area saving percentage

Since the proposed ISE exploration algorithm explores not only ISE candidate but also their implementation option, less extra silicon area is used in all cases. Figure 5.2.3 illustrates the extra area saving percentage for all cases and figure 5.2.4 to figure 5.2.6 shows the execution time reduction per unit area. In these figures, the

consideration of pipestage timing obviously reduces the extra area usage.

Execution Time Reduction per Unit Area

ACO(2/1, O0, T) ACO(2/1, O3, T) ACO(2/1, O0) ACO(2/1, O3)

Figure 5.2.4: Execution time reduction per unit area (2/1)

0.00

Execution Time Reduction per Unit Area

ACO(4/2, O0, T) ACO(4/2, O3, T) ACO(4/2, O0) ACO(4/2, O3)

Figure 5.2.5: Execution time reduction per unit area (4/2)

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00

1 2 4 8 16 32

ISE Number

Execution Time Reduction per Unit Area

ACO(6/3, O0, T) ACO(6/3, O3, T) ACO(6/3, O0) ACO(6/3, O3)

Figure 5.2.6: Execution time reduction per unit area (6/3)

From another perspective, under the same silicon area constraints, using miser implementation option can employ more ISEs in processor core. This leads to better performance improvement. We illustrate this perspective with figure 5.2.7 and 5.2.8.

In figure 5.2.7, each bar consists of several segments which indicate the execution time reduction under different silicon area constraint, are 5%, 10%, 15%, 20%, 25%

and 30% of original CPU core, respectively. Figure 5.2.8 shows ISE number can be used in different silicon area constraint. Note that the silicon area of CPU core with different register file read/write ports is different. In all cases, the proposed ISE exploration algorithm has better improvement in the execution time reduction. It is more noteworthy that the improvement of execution time reduction is not in proportion to available silicon area. This is because most execution time reduction is dominated by several ISEs. Table 5.2.1 shows the detailed results of figure 5.2.7 and 5.2.8.

Figure 5.2.7: Execution time reduction under different silicon area constraint

Figure 5.2.8: ISE Number under different silicon area constraint Table 5.2.1: Execution time reduction under different silicon area constraint Silicon area

constraint 5% 10% 15% 20% 25% 30%

Number of ISE being selected

ACO(2/1, O0, T) 13 28 50 71 102 128

ACO(2/1, O3, T) 12 27 40 55 79 100

ACO(2/1, O0) 10 21 34 50 64 86

ACO(2/1, O3) 10 19 31 44 55 72

Execution time reduction

ACO(2/1, O0, T) 8.60% 9.81% 10.38% 10.61% 10.76% 10.83%

ACO(2/1, O3, T) 7.57% 8.49% 8.94% 9.29% 9.54% 9.64%

ACO(2/1, O0) 6.49% 7.21% 7.61% 7.90% 8.02% 8.11%

ACO(2/1, O3) 6.65% 7.28% 7.72% 8.01% 8.20% 8.36%

Number of ISE being selected

ACO(4/2, O0, T) 8 18 23 34 46 56

ACO(4/2, O3, T) 6 14 20 26 34 45

ACO(4/2, O0) 6 13 20 24 32 42

ACO(4/2, O3) 5 12 17 22 27 33

Execution time reduction

ACO(4/2, O0, T) 13.61% 17.26% 18.19% 19.31% 20.13% 20.64%

ACO(4/2, O3, T) 14.98% 19.04% 20.46% 21.30% 22.09% 22.84%

ACO(4/2, O0) 12.13% 15.06% 16.69% 17.27% 18.08% 18.77%

ACO(4/2, O3) 14.15% 17.51% 18.64% 19.43% 20.04% 20.62%

Number of ISE being selected

ACO(6/3, O0, T) 5 12 19 25 31 39

ACO(6/3, O3, T) 6 9 14 19 25 32

ACO(6/3, O0) 4 9 15 19 24 29

ACO(6/3, O3) 4 7 11 15 20 25

Execution time reduction

ACO(6/3, O0, T) 14.95% 19.25% 20.97% 21.77% 22.33% 22.87%

ACO(6/3, O3, T) 18.76% 20.92% 22.72% 23.77% 24.61% 25.32%

ACO(6/3, O0) 13.83% 17.19% 19.37% 20.19% 20.94% 21.56%

ACO(6/3, O3) 16.91% 19.76% 21.74% 22.92% 24.00% 24.81%

5.3 Optimal Solution

In order to illustrate the quality of ISEs explored by the proposed algorithm, a set of basic blocks are processed to get the optimal solution. In table 5.3.1, we compare the result of proposed algorithm and the optimal solution. And the corresponding

processing time is listed in table 5.3.2. The legal pattern number is the number of

patterns that are legal to be ISEs (input/output constraint, convex, no load/store operation). The processing time of the optimal solution of a DFG is decided by the DFG size or legal pattern number. This can be observed from the time complexity of optimal solution mentioned earlier. For the cases that optimal solution can be obtained successfully, the proposed algorithm exhibits wonderful solution quality compared to the optimal one. It can get cycle reduction and extra area cost closed to the optimal one with tremendous computing time saving. For the legal pattern number up to 45 or even 108, the optimal solution needs considerable computing time and even can’t terminate in a reasonable interval. On the other hand, the proposed algorithm just consumes a few seconds to get the solution. Another observation is the released input/output constraint usually leads the increment of legal pattern number. In this situation, to obtain the optimal solution is more difficult, but proposed algorithm still behaves well.

Table 5.3.1: Comparison of optimal solution and ISE Exploration Algorithm (result) Optimal Solution Proposed Algorithm DFG

Constraint Cycle Reduction

P.S. *: means the solution can’t be obtained in practical time.

Table 5.3.2: Comparison of optimal solution and ISE Exploration Algorithm (processing time)

Optimal Solution Proposed Algorithm DFG

Size

Legal Pattern Number

In / Out

Constraint Processing Time

Processing Time

13 4 2 / 1 0.01s 0.03s

26 9 2 / 1 0.03s 1.21s

20 30 2 / 1 14m22.46s 2.705s

41 7 2 / 1 2m12.53s 1.249s

64 1 2 / 1 4.08s 0.753s

32 45 2 / 1 --* 4.49s

23 75 2 / 1 -- 2.333s

13 2 / 1 0.01s 0.438s

12 28 4 / 2 2m15.33s 0.777s

13 2 / 1 4.73s 1.786s

46 4 / 2 -- 2.102s

108 6 / 3 -- 3.067s

P.S. *: means the solution can’t be obtained in practical time.

Chapter 6 Conclusion

The proposed ISE exploration and selection algorithms can significantly reduce extra silicon area cost with almost no performance loss. Previous researches, to achieve the highest speed-up ratio, overlook several important microarchitectural constraints, such as pipestage timing constraint and instruction set architecture (ISA) format. To

conform to pipestage timing constraint, an ISE exploration algorithm which evaluates different implementation options of each operation in DFG during exploring ISE candidates is proposed. On the other hand, we formulate ISE selection as the

multi-constrained 0/1 Knapsack problem to comply with different microarchitectural constraints. The benefits of our approach are: (1) conform to several important

microarchitectural constraints; (2) significantly reduce extra silicon area cost; (3) both algorithms are polynomial time solvable. Experiment results show that our design can further reduce up to 35.28%, 15.92% and 22.41% (max., min. and avg.) of extra silicon area, and only has maximally 1.06% performance loss.

In addition, we conclude several issues which can be addressed in future work. First, with adjusting parameters (α, β1, β2 and β3) used in probability value, ISE exploration algorithm and merit function, we observe that these parameters greatly affect

experimental results. Although we only use a same set of parameters for different cases, i.e. different combination of register file read/write ports and the size of BB, in this work, it will be an interesting if we study the dynamic adjustment for these parameters in our approach. Second, the running time of ISE generation algorithm is

one noteworthy issue. In this paper, ISE exploration algorithm only explores one ISE candidate at each round. However, if the algorithm simultaneously explores multiple ISE candidates at each round, the running time can significantly be reduced. Third, [combination] raises one interesting issue “ISE combination”. Without introducing any performance loss, if we merge several analogous ISE candidates as one or use one hardware resource to execute identical operations in same ISE, the silicon area can be further reduced.

Reference

[1] Gang Wang, Wenrui Gong and Ryan Kastner, “Application Partitioning on Programmable Platforms Using the Ant Colony Optimization”, to appear in the Journal of Embedded Computing, Vol. 2, Issue 1, 2006.

[2] Mouloud Koudil, Karima Benatchba, Said Gharout, Nacer Hamani: Solving Partitioning Problem in Codesign with Ant Colonies. IWINAC (2) 2005:

324-337.

[3] Laura Pozzi, Kubilay Atasu, and Paolo Ienne. Exact and Approximate Algorithms for the Extension of Embedded Processor Instruction Sets. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 25, Issue 7, Jul 2006 Page(s):1209 – 1229.

[4] Partha Biswas, Sudarshan Banerjee, Nikil Dutt, Laura Pozzi, and Paolo Ienne.

Fast automated generation of high-quality instruction set extensions for processor customization. In Proceedings of the 3rd Workshop on Application Specific Processors, Stockholm, September 2004.

[5] Pan Yu, Tulika Mitra: Characterizing embedded applications for instruction-set extensible processors. DAC 2004: 723-728.

[6] Pan Yu, Tulika Mitra: Satisfying real-time constraints with custom instructions.

CODES+ISSS 2005: 166-171.

[7] Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang. Application-Specific Instruction Generation for Configurable Processor Architectures. Twelfth International Symposium on Field Programmable Gate Arrays, 183-189, 2004.

[8] Samik Das, P. P. Chakrabarti, Pallab Dasgupta: Instruction-Set-Extension

Exploration Using Decomposable Heuristic Search. VLSI Design 2006: 293-298.

[9] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, "Custom-Instruction Synthesis for Extensible-Processor Platforms," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, pp. 216--228, February 2004.

[10] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for computer system modeling. IEEE Computer, 35(2), 2002.

[11] M. R. Guthaus et al. Mibench: A free, commercially representative embedded benchmark suite. In IEEE Annual Workshop on Workload Characterization, 2001.

[12] A Lindstrom and M. Nordseth. (2004, Mars). Arithmetic Database. [Online].

Available: http://www.ce.chalmers.se/arithdb/

[13] Edson Borin, Felipe Klein, Nahri Moreano, Rodolfo Azevedo, and Guido Araujo.

“Fast Instruction Set Customization”, 2nd Workshop on Embedded Systems for

Real-Time Multimedia (ESTIMedia'04). Stockholm - Sweden, September 2004.

[14] Nathan T. Clark, Hongtao Zhong, Scott A. Mahlke, "Automated Custom Instruction Generation for Domain-Specific Processor Acceleration," IEEE Transactions on Computers, vol. 54, no. 10, pp. 1258-1270, Oct., 2005.

[15] Armita Peymandoust, Laura Pozzi, Paolo Ienne, and Giovanni De Micheli.

Automatic Instruction-Set Extension and Utilization for Embedded Processors. In Proceedings of the 14th International Conference on Application-specific

Systems, Architectures and Processors, The Hague, The Netherlands, June 2003.

[16] A. Lindström, M. Nordseth and L. Bengtsson. "0.13 µm CMOS Synthesis of Common Arithmetic Units", Technical Report 03-11, Department of Computer Engineering, Chalmers University of Technology, June 2003.

[17] Laura Pozzi, Paolo Ienne: Exploiting pipelining to relax register-file port constraints of instruction-set extensions. CASES 2005: 2-10.

[18] Maria Luisa Lopez-Vallejo, Jesus Grajal, Juan Carlos Lopez, "Constraint-Driven System Partitioning," date, p. 411, Design, Automation and Test in Europe (DATE '00), 2000.

Appendix A

A.1. Simulation results of ACO(Input/Output, T)

Execution time reduction of ACO(2/1, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

basicmath_O0 basicmath_O3 bitcount_O0 bitcount_O3 qsort_O0 qsort_O3 susan_O0 susan_O3 cjpeg_O0 cjpeg_O3 djpeg_O0 djpeg_O3 lame_O0 lame_O3 tiff2bw_O0 tiff2bw_O3 tiff2rgba_O0 tiff2rgba_O3 tiffdither_O0 tiffdither_O3 tiffmedian_O0 tiffmedian_O3 dijkstra_O0 dijkstra_O3 patricia_O0 patricia_O3 ispell_O0 ispell_O3 stringsearch_O0 stringsearch_O3 blowfish_O0 blowfish_O3 rijndael_O0 rijndael_O3 sha_O0 sha_O3 CRC32_O0 CRC32_O3 FFT_O0 FFT_O3 adpcm_O0 adpcm_O3 gsm_O0 gsm_O3 average

Execution time reduction

1 2 4 8 16 32

Figure A.1.1: Execution time reduction of ACO(2/1, T)

Execution time reduction of ACO(4/2, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.1.2: Execution time reduction of ACO(4/2, T)

Execution time reduction of ACO(6/3, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.1.3: Execution time reduction of ACO(6/3, T)

Execution time reduction of ACO(8/4, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.1.4: Execution time reduction of ACO(8/4, T)

Extra area cost of ACO(2/1, T)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.1.5: Extra area cost of ACO(2/1, T)

Extra area cost of ACO(4/2, T)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.1.6: Extra area cost of ACO(4/2, T)

Extra area cost of ACO(6/3, T)

在文檔中考量管線時間之延伸指令集 (頁 29-0)