實驗結果 - 可動態擴充之數位訊號處理核心於系統單晶片內之整合架構研究(III)

4.1. Experimental setup

The Portable Instruction Set Architecture (PISA) [17], which is a MIPS-like ISA, and MiBench [18] was employed to evaluate the proposed ISE exploration algorithm and genetic algorithm [8]. Each benchmark was compiled by gcc 2.7.2.3 for PISA with -O0 and -O3 optimizations. Owing to the limitation in the corresponding glibc and compiler, 6 benchmarks, namely mad, typeset, ghostscript, rsynth, sphinx and pgp, could not be compiled successfully. For both ISE exploration algorithms, 6 cases were evaluated, includes 2/1, 4/2 and 6/3 register file read/write ports as well as using -O0 and -O3 optimization.

Case 1. (The size of vSx is equal to 1) If (size(vSx) = 1)

For each hardware implementation option (j=1 to k) in operation x meritx,j = meritx,j × β1;

Case 2. (Violate constraints and the size of vSx,j is larger than 1) If (vSx violates in/out constraint)

For each hardware implementation option (j=1 to k) in operation x meritx,j = meritx,j × β2;

If (vSx violates convex constraint)

For each hardware implementation option (j=1 to k) in operation x meritx,j = meritx,j × β2;

Case 3. (Conform with constraints and the size of vSx,j is larger than 1) If (vSx observes in/out and convex constraint)

For each hardware implementation option (j=1 to k) in operation x meritx,j = (cycle_savingx,MAX + 1) × meritx,0;

If (cycle_savingx,j = cycle_savingx,MAX)

x cycle saving cycle saving

merit

Table 1: Hardware implementation option setting

In the simulation, we assume that: (1) the CPU core is synthesized in 0.13 µm CMOS technology and executes in 100MHz; (2) the area of the base CPU core, in which no register file is comprised, is 1378855.84µm²; (3) the read/write ports of register file are 2/1, 4/2 and 6/3, respectively; and (4) the execution cycle of all instructions in PISA is one cycle, i.e. 10 (ns). Table 1 lists the hardware implementation option settings (delay and area) of instructions in PISA. Significantly, only instructions that can be grouped into ISEs are listed in table 1. These settings were either obtained from Lindstrom [19], or modeled by Verilog and synthesized with Synopsys Design Compiler. Since enlarging the number of read/write ports in the register file increases the required silicon area, different read/write ports of register file were also synthesized. Therefore, the silicon areas of the CPU cores with 2/1, 4/2 and 6/3 (register file read/write ports) were 1500000µm², 1574138.80µm² and 1631359.54µm², respectively.

Because of the heuristic nature of the ISE exploration algorithm, the exploration was repeated 5 times within each basic block, and the best result among the 5 iterations was chosen.

For obtaining the set of parameters (α, ρ, β1, β2 and β3), this work adopted a greedy-like method. The parameter exploration method is described as follows:

Step 1: Randomly select a smaller basic block, and manually compute its optimal result. Then, randomly choose a set of parameters. Based on this set of parameters, only adjust one parameter at a time to execute the proposed algorithm. When the simulation result is equal or very close to the optimal one, stop adjusting this parameter and start to adjust other parameters. Once all the parameters are obtained, enter Step 2.

Step 2: Randomly choose several bigger basic blocks, and execute the proposed

algorithm to compute their results using the set of parameters obtained in step 1. If the proposed algorithm converges for these bigger basic blocks, apply this set of parameters for all basic blocks. Otherwise, go back to Step 1, and use other smaller basic block to find a set of parameters. If the proposed algorithm can not converge for all basic blocks, go back to Step 1, too.

Note that the basic block(s) used in Step 1 and 2 must be chosen among the result of BB selection in the ISE design flow. The parameters adopted in this paper and their meanings are listed below.

♦ α: the weight of merit and pheromone in px,j.

♦ ρ : the evaporating factor in trail update.

♦ β1: the decay speed when a selected-hardware-implementation-option operation is stand alone.

♦ β2: the decay speed when the input/output constraint is violated.

♦ β3: the decay speed when the convex constraint is violated.

A large α makes the algorithm converge slowly, while a small α is on the contrary.

Restated, a large α obtains a solution slowly, and a small α obtains a poor solution, but quickly. ρ has same characteristic with α. β1, β2 and β3 affect the opportunity of being selected again for the violated-constraint(s) implementation option. In other words, large β1, β2 and β3 let the violated-constraint(s) implementation option have higher chance of being selected again at following iterations.

In this experiment, the initial merit value of the software and hardware implementation option was 100 and 200, respectively; P_END was 99%. The probability value adopted α = 0.25, the evaporating factor ρ was 5, and the merit function had β1 = 0.9, β2 = 0.9 and β3 = 0.5.

Additionally, since [author of 3]’s approach [3] does not consider pipestage timing constraint, we assume that it always deploys the fastest implementation option for every operation in ASFU.

4.2 Experimental results

Figures 7 and 8 depict the average execution time reduction and the average extra silicon area cost of Mibench with different numbers of ISE, respectively. Each bar in Figs. 7 and 8, comprises several segments, which indicate the execution time reduction using 1, 2, 4, 8, 16 and 32 ISEs. The first word of each label on X axis in both Figs. 7 and 8 indicates which ISE exploration algorithm is adopted. “Proposed”

and “genetic” denote the proposed ISE exploration algorithm and that of [author of 3]

[3], respectively. The first and second symbols in parentheses of each label on the X-axis are the number of register file read/write ports in use, and which optimization method (-O0 or -O3) is used. For instance, (4/2, O3) means that the register file has 4

read ports as well as 2 write ports, and that the -O3 optimization method is employed.

For both algorithms, -O3 exhibits better execution time reduction than -O0 in most cases. Possibly, -O3 often uses various compiler optimization techniques. Some of these techniques (like loop unrolling, function inlining, etc.) remove branch instructions, and increase the size of basic block(s). The bigger basic block usually has a larger search space, such that it has a greater opportunity to obtain the ISEs, which consist of more operations. This results in more execution time reduction.

However, increasing the size of basic blocks also enlarges the opportunity of violating register read/write port constraint, when only few read/write ports can be used, e.g.

2/1. This is why -O3 has less execution time reduction than -O0 in some cases.

Most of execution time reduction is dominated by several ISEs within hot basic blocks. In other words, the number of ISE is not entirely proportional to the execution time reduction. In most cases, 8 ISEs can perform over half of execution time reduction achieved by 32 ISEs, and only utilize less than quarter of silicon area used by 32 ISEs. For instance, while using 4/2 (register file read/write ports) register file, 8 ISEs can save average 14.95% execution time and cost 81467.5µm² silicon area, which is 5.43% of the original core area. Conversely, if 32 ISEs are adopted, then the average execution time reduction can rise to 20.62%, but extra area cost also increases to 345135.45µm², which is 23.01% of the original core area.

Figure 7: Execution time reduction

Figure 8: Extra silicon area cost

Except for several cases in 6/3 (register file read/write ports), the proposed ISE exploration algorithm achieved a better execution time reduction than [author of 3] [3].

The proposed approach has lower performance improvement in some cases because one set of parameters (α, β1, β2 and β3) does not work well in all cases. This problem can be mitigated by dynamically adjusting these parameters according to different situations. Theoretically, ISE candidates always adopting the fastest hardware implementation option would have the best performance. However, as revealed in Fig.

7, the proposed ISE exploration algorithm has better execution time reduction in most cases, since even in the same BB, the operations grouped into an ISE candidate and the number of ISE candidate explored by both algorithms may be not identical.

Because the proposed ISE exploration algorithm explores not only ISE candidate but also their implementation options, less extra silicon area is used in all cases.

Figure 9 illustrates the silicon area saving of proposed algorithm for all cases, as compared with the genetic algorithm [3]. Obviously, the proposed algorithm can significantly reduce the extra silicon area cost. Figure 9 also reveals that relaxing the constraint of register file read/write ports tends to decrease the silicon area saving.

Relaxing the constraint of register file read/write ports can increase the number of operations grouped into ISE. However, the operations grouped into ISE due to relaxing the constraint of register file read/write ports usually are logic operations such that only one hardware implementation option can be selected. This leads to less silicon area being saved using our approach.

0.00%

Figure 9: Silicon area saving

From another perspective, under the same silicon area constraints, adopting the miser implementation option can enlarge the number of ISEs that can be utilized in processor core, can increase the level of performance improvement. Figure 10 shows this perspective. In Fig. 10, each bar consists of several segments, which indicate the execution time reduction under different silicon area constraints, are 5%, 10%, 15%, 20%, 25% and 30% of the original CPU core size. In all cases, the proposed ISE exploration algorithm has better performance improvement than genetic algorithm [3].

Significantly, the improvement in execution time reduction is not in proportion to the available silicon area, since most execution time reduction is dominated by several ISEs. Table 2 presents the detailed results of Fig. 10. describe why only few performance improvement

Figure 10: Execution time reduction under different silicon area constraint

Table 2: Execution time reduction under different silicon area constraint Silicon area

constraint 5% 10% 15% 20% 25% 30%

Number of ISE being selected

proposed (2/1, O0) 13 28 50 71 102 128

proposed (2/1, O3) 12 27 40 55 79 100

genetic (2/1, O0) 8 15 22 35 48 61

genetic (2/1, O3) 8 12 18 26 34 46

Execution time reduction

proposed (2/1, O0) 8.60% 9.81% 10.38% 10.61% 10.76% 10.83%

proposed (2/1, O3) 7.57% 8.49% 8.94% 9.29% 9.54% 9.64%

genetic (2/1, O0) 6.23% 6.90% 7.25% 7.63% 7.87% 8.01%

genetic (2/1, O3) 6.42% 6.83% 7.23% 7.57% 7.79% 8.05%

Number of ISE being selected

proposed (4/2, O0) 8 18 23 34 46 56

proposed (4/2, O3) 6 14 20 26 34 45

genetic (4/2, O0) 4 8 14 21 28 36

genetic (4/2, O3) 4 8 14 19 22 26

Execution time reduction

proposed (4/2, O0) 13.61% 17.26% 18.19% 19.31% 20.13% 20.64%

proposed (4/2, O3) 14.98% 19.04% 20.46% 21.30% 22.09% 22.84%

genetic (4/2, O0) 10.76% 13.17% 15.37% 16.86% 17.70% 18.40%

genetic (4/2, O3) 13.30% 15.99% 18.02% 18.99% 19.43% 19.93%

Number of ISE being selected

proposed (6/3, O0) 5 12 19 25 31 39

proposed (6/3, O3) 6 9 14 19 25 32

genetic (6/3, O0) 4 7 12 17 22 28

genetic (6/3, O3) 4 7 10 15 19 23

Execution time reduction

proposed (6/3, O0) 14.95% 19.25% 20.97% 21.77% 22.33% 22.87%

proposed (6/3, O3) 18.76% 20.92% 22.72% 23.77% 24.61% 25.32%

genetic (6/3, O0) 13.83% 16.12% 18.47% 19.83% 20.66% 21.44%

genetic (6/3, O3) 16.91% 19.76% 21.35% 22.92% 23.81% 24.50%

4.3 Optimal Solution

To understand the quality of ISE candidates explored by the proposed ISE exploration algorithm, the results of proposed algorithm are compared with the optimal solution, as shown in Table 3. Table 4 compares the processing times. The

legal pattern number is the number of ISE candidates that obey constraints (input/output constraint, convex, no load/store operation). The processing time of the optimal solution strongly depends on the DFG size or legal pattern number. The results of proposed algorithm are very close to the optimal result when the optimal solution can be obtained successfully. Nevertheless, the proposed algorithm can significantly decrease the computing time. Additionally, releasing the input/output constraint normally increases the legal pattern number significantly. In this case, the optimal solution is difficult to obtain, but the proposed algorithm still behaves well.

Table 3: Comparison of optimal solution and ISE Exploration Algorithm (result) Optimal Solution Proposed Algorithm DFG

P.S. *: means the solution can’t be obtained in practical time.

Table 4: Comparison of optimal solution and ISE Exploration Algorithm (processing time)

Optimal Solution Proposed Algorithm DFG

28 4 / 2 2m15.33s 0.777s

13 2 / 1 4.73s 1.786s

46 4 / 2 -- 2.102s

108 6 / 3 -- 3.067s

P.S. *: means the solution can’t be obtained in practical time.

在文檔中可動態擴充之數位訊號處理核心於系統單晶片內之整合架構研究(III) (頁 23-31)