Experimental results - Experimental Results

Chapter 5 Experimental Results

5.2 Experimental results

Figure 5.2.1 and 5.2.2 show the average execution time reduction and average extra silicon area cost of Mibench, respectively, under different number of ISE. In both figure 5.2.1 and 5.2.2, each bar consists of several segments which indicate the execution time reduction under different number of ISE, are 1, 2, 4, 8, 16 and 32, respectively.

In order to show the effectiveness of the consideration of pipestage timing is remarkable. We assume the proposed algorithm doesn’t consider the effect of

pipestage timing. Therefore there is only one hardware implementation option for the operation can be included into ISE. In here, we always take it as the fastest

implementation option.

The label on X axis in both figure 5.2.1 and 5.2.2 represents ISE exploration algorithm with different arguments is used. The first and second symbols in parenthesis of each label on X axis represent the number of register file read/write ports used and which optimization scheme (-O0 or -O3) is used. (4/2, O3), for example, means register file has 4 read ports and 2 write ports and -O3 optimization scheme is used. The third symbol “T” in parenthesis represents “thinking” of pipestage timing.

Figure 5.2.1: Execution time reduction

0 Extra Area Cost (µm2 )

1 2 4 8 16 32

Figure 5.2.2: Extra silicon area cost

For both algorithms, -O3 shows better execution time reduction than -O0 under same read/write ports constraint. The possible reason is that -O3 usually makes program execution faster in various ways of compiler techniques. Some of these techniques (like loop unrolling, function inlining, etc.) remove branch instructions and increase the size of certain critical basic blocks. The bigger the basic block size is, the larger the search space exists and the more possibility the ISEs which have more cycle reduction can be explored in these bigger basic blocks. Also noteworthy is that most of execution time reduction is contributed by several ISEs. This is because the execute time of most program takes on small fraction of code segment, i.e. the execution time reduction is dominated by several ISEs. In most cases, 8 ISEs can achieve half or more of execution time reduction and only consume a little fraction of the maximum extra area cost. For example, 8 ISEs using 4/2 (register file read/write ports) register file can save average 14.95% execution time and cost 81467.5µm² silicon area, that’s 5.43% of the original core area. On the other hand, if we select 32 ISEs, the average

execution time reduction can increase to 20.62% but extra area cost also rises to 345135.45µm², that’s 23.01% of the original core area.

There is one thing should be noticed in figure 5.2.1 that ACO (2/1, O0) seems to be better than ACO (2/1, O3). In fact, with 1, 2 and 4 ISE number, –O3 still behaves better than –O0, the situation reverses only with larger ISE number. This is caused by the results of some special benchmark. For example, there are only 4 ISEs can be found by –O3, but –O0 can find over 4 ISEs and totally get more execution time reduction. When the ISE number is more than 4, then –O0 looks like better than –O3.

10%

15%

20%

25%

30%

35%

1 2 4 8 16 32

ISE Number

Extra Area Saving Percentage

ACO(2/1, O0, T) ACO(2/1, O3, T) ACO(4/2, O0, T) ACO(4/2, O3, T) ACO(6/3, O0, T) ACO(6/3, O3, T)

Figure 5.2.3: Extra area saving percentage

Since the proposed ISE exploration algorithm explores not only ISE candidate but also their implementation option, less extra silicon area is used in all cases. Figure 5.2.3 illustrates the extra area saving percentage for all cases and figure 5.2.4 to figure 5.2.6 shows the execution time reduction per unit area. In these figures, the

consideration of pipestage timing obviously reduces the extra area usage.

Execution Time Reduction per Unit Area

ACO(2/1, O0, T) ACO(2/1, O3, T) ACO(2/1, O0) ACO(2/1, O3)

Figure 5.2.4: Execution time reduction per unit area (2/1)

0.00

Execution Time Reduction per Unit Area

ACO(4/2, O0, T) ACO(4/2, O3, T) ACO(4/2, O0) ACO(4/2, O3)

Figure 5.2.5: Execution time reduction per unit area (4/2)

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00

1 2 4 8 16 32

ISE Number

Execution Time Reduction per Unit Area

ACO(6/3, O0, T) ACO(6/3, O3, T) ACO(6/3, O0) ACO(6/3, O3)

Figure 5.2.6: Execution time reduction per unit area (6/3)

From another perspective, under the same silicon area constraints, using miser implementation option can employ more ISEs in processor core. This leads to better performance improvement. We illustrate this perspective with figure 5.2.7 and 5.2.8.

In figure 5.2.7, each bar consists of several segments which indicate the execution time reduction under different silicon area constraint, are 5%, 10%, 15%, 20%, 25%

and 30% of original CPU core, respectively. Figure 5.2.8 shows ISE number can be used in different silicon area constraint. Note that the silicon area of CPU core with different register file read/write ports is different. In all cases, the proposed ISE exploration algorithm has better improvement in the execution time reduction. It is more noteworthy that the improvement of execution time reduction is not in proportion to available silicon area. This is because most execution time reduction is dominated by several ISEs. Table 5.2.1 shows the detailed results of figure 5.2.7 and 5.2.8.

Figure 5.2.7: Execution time reduction under different silicon area constraint

Figure 5.2.8: ISE Number under different silicon area constraint Table 5.2.1: Execution time reduction under different silicon area constraint Silicon area

constraint 5% 10% 15% 20% 25% 30%

Number of ISE being selected

ACO(2/1, O0, T) 13 28 50 71 102 128

ACO(2/1, O3, T) 12 27 40 55 79 100

ACO(2/1, O0) 10 21 34 50 64 86

ACO(2/1, O3) 10 19 31 44 55 72

Execution time reduction

ACO(2/1, O0, T) 8.60% 9.81% 10.38% 10.61% 10.76% 10.83%

ACO(2/1, O3, T) 7.57% 8.49% 8.94% 9.29% 9.54% 9.64%

ACO(2/1, O0) 6.49% 7.21% 7.61% 7.90% 8.02% 8.11%

ACO(2/1, O3) 6.65% 7.28% 7.72% 8.01% 8.20% 8.36%

Number of ISE being selected

ACO(4/2, O0, T) 8 18 23 34 46 56

ACO(4/2, O3, T) 6 14 20 26 34 45

ACO(4/2, O0) 6 13 20 24 32 42

ACO(4/2, O3) 5 12 17 22 27 33

Execution time reduction

ACO(4/2, O0, T) 13.61% 17.26% 18.19% 19.31% 20.13% 20.64%

ACO(4/2, O3, T) 14.98% 19.04% 20.46% 21.30% 22.09% 22.84%

ACO(4/2, O0) 12.13% 15.06% 16.69% 17.27% 18.08% 18.77%

ACO(4/2, O3) 14.15% 17.51% 18.64% 19.43% 20.04% 20.62%

Number of ISE being selected

ACO(6/3, O0, T) 5 12 19 25 31 39

ACO(6/3, O3, T) 6 9 14 19 25 32

ACO(6/3, O0) 4 9 15 19 24 29

ACO(6/3, O3) 4 7 11 15 20 25

Execution time reduction

ACO(6/3, O0, T) 14.95% 19.25% 20.97% 21.77% 22.33% 22.87%

ACO(6/3, O3, T) 18.76% 20.92% 22.72% 23.77% 24.61% 25.32%

ACO(6/3, O0) 13.83% 17.19% 19.37% 20.19% 20.94% 21.56%

ACO(6/3, O3) 16.91% 19.76% 21.74% 22.92% 24.00% 24.81%

5.3 Optimal Solution

In order to illustrate the quality of ISEs explored by the proposed algorithm, a set of basic blocks are processed to get the optimal solution. In table 5.3.1, we compare the result of proposed algorithm and the optimal solution. And the corresponding

processing time is listed in table 5.3.2. The legal pattern number is the number of

patterns that are legal to be ISEs (input/output constraint, convex, no load/store operation). The processing time of the optimal solution of a DFG is decided by the DFG size or legal pattern number. This can be observed from the time complexity of optimal solution mentioned earlier. For the cases that optimal solution can be obtained successfully, the proposed algorithm exhibits wonderful solution quality compared to the optimal one. It can get cycle reduction and extra area cost closed to the optimal one with tremendous computing time saving. For the legal pattern number up to 45 or even 108, the optimal solution needs considerable computing time and even can’t terminate in a reasonable interval. On the other hand, the proposed algorithm just consumes a few seconds to get the solution. Another observation is the released input/output constraint usually leads the increment of legal pattern number. In this situation, to obtain the optimal solution is more difficult, but proposed algorithm still behaves well.

Table 5.3.1: Comparison of optimal solution and ISE Exploration Algorithm (result) Optimal Solution Proposed Algorithm DFG

Constraint Cycle Reduction

P.S. *: means the solution can’t be obtained in practical time.

Table 5.3.2: Comparison of optimal solution and ISE Exploration Algorithm (processing time)

Optimal Solution Proposed Algorithm DFG

Size

Legal Pattern Number

In / Out

Constraint Processing Time

Processing Time

13 4 2 / 1 0.01s 0.03s

26 9 2 / 1 0.03s 1.21s

20 30 2 / 1 14m22.46s 2.705s

41 7 2 / 1 2m12.53s 1.249s

64 1 2 / 1 4.08s 0.753s

32 45 2 / 1 --* 4.49s

23 75 2 / 1 -- 2.333s

13 2 / 1 0.01s 0.438s

12 28 4 / 2 2m15.33s 0.777s

13 2 / 1 4.73s 1.786s

46 4 / 2 -- 2.102s

108 6 / 3 -- 3.067s

P.S. *: means the solution can’t be obtained in practical time.

Chapter 6 Conclusion

The proposed ISE exploration and selection algorithms can significantly reduce extra silicon area cost with almost no performance loss. Previous researches, to achieve the highest speed-up ratio, overlook several important microarchitectural constraints, such as pipestage timing constraint and instruction set architecture (ISA) format. To

conform to pipestage timing constraint, an ISE exploration algorithm which evaluates different implementation options of each operation in DFG during exploring ISE candidates is proposed. On the other hand, we formulate ISE selection as the

multi-constrained 0/1 Knapsack problem to comply with different microarchitectural constraints. The benefits of our approach are: (1) conform to several important

microarchitectural constraints; (2) significantly reduce extra silicon area cost; (3) both algorithms are polynomial time solvable. Experiment results show that our design can further reduce up to 35.28%, 15.92% and 22.41% (max., min. and avg.) of extra silicon area, and only has maximally 1.06% performance loss.

In addition, we conclude several issues which can be addressed in future work. First, with adjusting parameters (α, β1, β2 and β3) used in probability value, ISE exploration algorithm and merit function, we observe that these parameters greatly affect

experimental results. Although we only use a same set of parameters for different cases, i.e. different combination of register file read/write ports and the size of BB, in this work, it will be an interesting if we study the dynamic adjustment for these parameters in our approach. Second, the running time of ISE generation algorithm is

one noteworthy issue. In this paper, ISE exploration algorithm only explores one ISE candidate at each round. However, if the algorithm simultaneously explores multiple ISE candidates at each round, the running time can significantly be reduced. Third, [combination] raises one interesting issue “ISE combination”. Without introducing any performance loss, if we merge several analogous ISE candidates as one or use one hardware resource to execute identical operations in same ISE, the silicon area can be further reduced.

Reference

[1] Gang Wang, Wenrui Gong and Ryan Kastner, “Application Partitioning on Programmable Platforms Using the Ant Colony Optimization”, to appear in the Journal of Embedded Computing, Vol. 2, Issue 1, 2006.

[2] Mouloud Koudil, Karima Benatchba, Said Gharout, Nacer Hamani: Solving Partitioning Problem in Codesign with Ant Colonies. IWINAC (2) 2005:

324-337.

[3] Laura Pozzi, Kubilay Atasu, and Paolo Ienne. Exact and Approximate Algorithms for the Extension of Embedded Processor Instruction Sets. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 25, Issue 7, Jul 2006 Page(s):1209 – 1229.

[4] Partha Biswas, Sudarshan Banerjee, Nikil Dutt, Laura Pozzi, and Paolo Ienne.

Fast automated generation of high-quality instruction set extensions for processor customization. In Proceedings of the 3rd Workshop on Application Specific Processors, Stockholm, September 2004.

[5] Pan Yu, Tulika Mitra: Characterizing embedded applications for instruction-set extensible processors. DAC 2004: 723-728.

[6] Pan Yu, Tulika Mitra: Satisfying real-time constraints with custom instructions.

CODES+ISSS 2005: 166-171.

[7] Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang. Application-Specific Instruction Generation for Configurable Processor Architectures. Twelfth International Symposium on Field Programmable Gate Arrays, 183-189, 2004.

[8] Samik Das, P. P. Chakrabarti, Pallab Dasgupta: Instruction-Set-Extension

Exploration Using Decomposable Heuristic Search. VLSI Design 2006: 293-298.

[9] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, "Custom-Instruction Synthesis for Extensible-Processor Platforms," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, pp. 216--228, February 2004.

[10] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for computer system modeling. IEEE Computer, 35(2), 2002.

[11] M. R. Guthaus et al. Mibench: A free, commercially representative embedded benchmark suite. In IEEE Annual Workshop on Workload Characterization, 2001.

[12] A Lindstrom and M. Nordseth. (2004, Mars). Arithmetic Database. [Online].

Available: http://www.ce.chalmers.se/arithdb/

[13] Edson Borin, Felipe Klein, Nahri Moreano, Rodolfo Azevedo, and Guido Araujo.

“Fast Instruction Set Customization”, 2nd Workshop on Embedded Systems for

Real-Time Multimedia (ESTIMedia'04). Stockholm - Sweden, September 2004.

[14] Nathan T. Clark, Hongtao Zhong, Scott A. Mahlke, "Automated Custom Instruction Generation for Domain-Specific Processor Acceleration," IEEE Transactions on Computers, vol. 54, no. 10, pp. 1258-1270, Oct., 2005.

[15] Armita Peymandoust, Laura Pozzi, Paolo Ienne, and Giovanni De Micheli.

Automatic Instruction-Set Extension and Utilization for Embedded Processors. In Proceedings of the 14th International Conference on Application-specific

Systems, Architectures and Processors, The Hague, The Netherlands, June 2003.

[16] A. Lindström, M. Nordseth and L. Bengtsson. "0.13 µm CMOS Synthesis of Common Arithmetic Units", Technical Report 03-11, Department of Computer Engineering, Chalmers University of Technology, June 2003.

[17] Laura Pozzi, Paolo Ienne: Exploiting pipelining to relax register-file port constraints of instruction-set extensions. CASES 2005: 2-10.

[18] Maria Luisa Lopez-Vallejo, Jesus Grajal, Juan Carlos Lopez, "Constraint-Driven System Partitioning," date, p. 411, Design, Automation and Test in Europe (DATE '00), 2000.

Appendix A

A.1. Simulation results of ACO(Input/Output, T)

Execution time reduction of ACO(2/1, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

basicmath_O0 basicmath_O3 bitcount_O0 bitcount_O3 qsort_O0 qsort_O3 susan_O0 susan_O3 cjpeg_O0 cjpeg_O3 djpeg_O0 djpeg_O3 lame_O0 lame_O3 tiff2bw_O0 tiff2bw_O3 tiff2rgba_O0 tiff2rgba_O3 tiffdither_O0 tiffdither_O3 tiffmedian_O0 tiffmedian_O3 dijkstra_O0 dijkstra_O3 patricia_O0 patricia_O3 ispell_O0 ispell_O3 stringsearch_O0 stringsearch_O3 blowfish_O0 blowfish_O3 rijndael_O0 rijndael_O3 sha_O0 sha_O3 CRC32_O0 CRC32_O3 FFT_O0 FFT_O3 adpcm_O0 adpcm_O3 gsm_O0 gsm_O3 average

Execution time reduction

1 2 4 8 16 32

Figure A.1.1: Execution time reduction of ACO(2/1, T)

Execution time reduction of ACO(4/2, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.1.2: Execution time reduction of ACO(4/2, T)

Execution time reduction of ACO(6/3, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.1.3: Execution time reduction of ACO(6/3, T)

Execution time reduction of ACO(8/4, T)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.1.4: Execution time reduction of ACO(8/4, T)

Extra area cost of ACO(2/1, T)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.1.5: Extra area cost of ACO(2/1, T)

Extra area cost of ACO(4/2, T)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.1.6: Extra area cost of ACO(4/2, T)

Extra area cost of ACO(6/3, T)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.1.7: Extra area cost of ACO(6/3, T)

Extra area cost of ACO(8/4, T)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.1.8: Extra area cost of ACO(8/4, T)

A.2. Simulation results of ACO(Input/Output)

Execution time reduction of ACO(2/1)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.2.1: Execution time reduction of ACO(2/1)

Execution time reduction of ACO(4/2)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.2.2: Execution time reduction of ACO(4/2)

Execution time reduction of ACO(6/3)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.2.3: Execution time reduction of ACO(6/3)

Execution time reduction of ACO(8/4)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.2.4: Execution time reduction of ACO(8/4)

Extra area cost of ACO(2/1)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.2.5: Extra area cost of ACO(2/1)

Extra area cost of ACO(4/2)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.2.6: Extra area cost of ACO(4/2)

Extra area cost of ACO(6/3)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.2.7: Extra area cost of ACO(6/3)

Extra area cost of ACO(8/4)

Extra area cost (µm2 )

1 2 4 8 16 32

Figure A.2.8: Extra area cost of ACO(8/4)

A.3. Simulation results of Genetic(Input/Output, T)

The genetic algorithm presented here is referenced by [3] without consideration of the pipestage timing, i.e. no multiple hardware implementation options. It is taken as a reference material for execution time reduction.

Execution time reduction of Genetic(2/1)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.3.1: Execution time reduction of Genetic(2/1)

Execution time reduction of Genetic(4/2)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.3.2: Execution time reduction of Genetic(4/2)

Execution time reduction of Genetic(6/3)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Execution time reduction

1 2 4 8 16 32

Figure A.3.3: Execution time reduction of Genetic(6/3)

Execution time reduction of Genetic(8/4)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

在文檔中考量管線時間之延伸指令集 (頁 34-0)