• 沒有找到結果。

The Difference between Proposed Methods and Other Methods

5 Effective Generalized Code Generation Method

6.3 Rotation Scheduling with Exploiting Operand Reutilization (RSER)

6.3.3 The Difference between Proposed Methods and Other Methods

In the following, we describe the difference among methods RSOR, RSER, LPLS [4], power-conscious loop folding [24], and method proposed in [28], all are designed based on the operand sharing technique. Among these methods, the retiming technique is never applied in LPLS, which only uses a modified list scheduling to consider the operand sharing. RSOR focus on achieving potential OPRs within an iteration, and the retiming technique is used once to compact the schedule. Other three methods RSER, power-conscious loop folding, and the method [28] all use the

Figure 6.8. The corresponding TDAG of (a) Figure 6.6(a), (b) Figure 6.6(d).

0

Figure 6.9. Scheduling results of Figure 6.6(a). (a) RSOR, (b) RSOE.

(b)

retiming technique to generate instructions with common operands hidden inside the MDFG. Power-conscious loop folding is a basic method. After finding instructions sharing an operand in different iterations, the retiming technique is used to move them to the same iteration. The method [28] contains a force-directed retiming mechanism to determine which instruction must be retimed, and aim to make as many instructions as possible take common operands as their inputs. Apparently, these two methods only apply the retiming technique to achieve more OPRs. On the other hand, in our RSER, the retiming technique is applied more than once for different purposes. First, after determining exploitable sharing sets, it is used to gather instructions sharing common operands. Note that before retiming a specific retiming base must be chosen. That is, to remove more non-zero delay edges during MDFG reconstruction we may retime the MDFG several times with different retiming bases. Then, to compact the initial scheduling result, the retiming technique is used once more to partial overlap the execution time of successive iterations. From above description, we expect that using RSER can produce schedules with shorter lengths than using methods in [24, 28].

However, applying the retiming technique will generate corresponding prologue and epilogue codes that must be executed separately before and after the iterative patterns. If code sizes of the prologue and epilogue are too large, they will cost greater overall execution time and more power consumption of the given loop. We have proven that the overall schedule length is strongly dependent on which schedule vector, as well as retiming base, been selected [41]. Therefore, to avoid generating too many prologue and epilogue codes, we restrict that only two retiming bases, (0, 1) and (1, 0), can be selected in the MDFG reconstruction algorithm. This restriction means that in RSER the retiming technique is applied at most three times. Detailed evaluations of RSOR, RSER, and other energy-efficient instruction scheduling methods will be given in section 6.4.

Finally, in section 5.4 we have presented that with minor modifications, our hypothetical machine model and RSSA can be apply to real DSP families with various architectural features. Since in RSOR and RSER we apply the same mechanisms as in RSSA to schedule instructions and insert spill codes, both RSOR and RSER also can suit real DSP families.

6.4 Performance Evaluations [58]

In this section, we evaluate RSOR and RSER using selected MDFGs and the hypothetical machine model. LPLS [4] and Kim et al. [28] are also evaluated using the variable partition mechanism presented in RSVR [30] for comparison, after inserting necessary spill codes. Similar as in subsection 6.2.2, we still use evaluation metrics including schedule length, instruction count, the number of OPRs, and approximate current, and only show the best result derived by RSOR and RSER.

Table 6.4 lists the number of OPRs for a single iteration in the repetitive pattern.

Note that not all selected MDFGs contain exploitable sharing sets, so we only apply RSER to MDFGs that have potential operand sharing in different iterations. From this

1 FU, 2 acc, 4 reg, 2 mem 2 FU, 4 acc, 4 reg, 2 mem LPLS RSOR Kim RSER LPLS RSOR Kim RSER

Wave Digital Filter 0 0.5 1 1 0 0.5 0 1

Filter 0 0 0 0 0 0

IIR2D 0 0 4 4 0 0 4 4

forward-substitution 1 1.5 2 2 2 2 1 2

THCS 1 1 1 2 2 0

DFT 3 4 3 7 3 4 3 7

Floyd-Steinberg 9 9 9 9 9 6

Transmission 4 4 4 5 5 4

IIR1D 2 3 4 4 2 3 3 4

Equation Solver 5 5 5 4 5 6 4 5

All-pole Lattice 3 3 2 6 6 2

Elliptic Filter 9 9 9 11 11 11

Table 6.4. The number of OPRs obtained by different scheduling methods.

table, if the given MDFG has exploitable sharing sets, using RSER and Kim et al. [28]

can clearly produce schedules with more OPRs compared to using LPLS and RSOR.

That is, for a single iteration in the repetitive pattern, schedules generated by RSER and Kim et al. [28] will cost lower power consumption at function units. In addition, for an MDFG without exploitable sharing set, using RSOR still generates a similar number of OPRs to LPLS and Kim et al. [28]. This result shows that all three methods can successfully exploit potential operand sharing within an iteration. For comparison between two different architectures, Table 6.4 shows that all methods, except Kim et al. [28], perform better when the target architecture has more function units. This situation indicates whether an MDFG is reconstructed or not, using more function units is beneficial in achieving more OPRs. Thus, we conclude that when the number of OPRs is taken as the evaluation metric, RSOR and RSER are at least as effective as the previous methods. Furthermore, if the given loop contains potential operand sharing in different iterations, applying the retiming technique to exploit it is positive for energy-efficient instruction scheduling.

Table 6.5 lists the schedule length, instruction count, and approximate current of a single iteration in Motorola DSP56000 architecture. From these results, we find that RSOR and RSER achieve shorter schedules than LPLS and Kim et al. [28] in most cases, because both our methods apply the retiming technique to effectively explore the instruction-level parallelism between successive iterations. But the effectiveness between RSOR and RSER is uncertain, and will depend on the topological difference between the MDFGs before and after reconstruction. Hence, we conclude that RSOR and RSER are more effective than previous methods when the schedule length is the evaluation metric. On the other hand, in most cases using LPLS and Kim et al. [28]

will generate schedules with the least and most instructions, respectively. If a MDFG contains exploitable sharing sets, applying RSER will require greater instruction

count than RSOR but still less than Kim et al. [28]. Note that the number of ALU instructions for a MDFG is fixed whichever scheduling method is applied. That is, a schedule with more instruction counts represents more inserted spill codes, which are usually extra memory accesses. Based on the instruction-level power model presented in [45], to execute every instruction will cost the base cost, so a schedule with less instruction count will benefit code size as well as power consumption. As for the approximate current, in most cases RSOR and RSER outperform LPLS and Kim et al.

[28]. Obviously the main reason is using our methods can obtain shorter schedules.

For comparison between RSOR and RSER, RSER is usually better, even if the number of memory accesses may increase after MDFG reconstruction. This is because using RSER can further lower the power consumed at function units, and the schedule length is only slightly increased.

LPLS RSOR Kim RSER

length instr.

count current length instr.

count current length instr.

count current length instr.

count current

[1] 6 13 820 5 13 770 6 13 830 4 11.5 640

[2] 8 11 920 5 10.5 655 6 10 760

[3] 20 37 2690 16 37 2435 18 42 2604 17.5 39 2428

[4] 7 10 802 5 10.5 717 7 15 922 5 12.5 717

[5] 6 10 712 4 9.5 587 6 10 712

[6] 14 30 1866 12.5 32 1756 15 34 1986 12.5 31.5 1691

[7] 20 39 1630 17.5 39 2440 19 39 1520

[8] 14 29 1940 12 28 1770 14 29 1930

[9] 10 18 1222 8 19 1111 10 23 1298 8.5 21 1118

[10] 14 24 1776 12 24.5 1651 13 30 1816 11.5 27.5 1646 [11] 21 35 2640 17 34.5 2300 17 39 2450

[12] 40 77 5162 35 75 4827 36 73 4782

[1] Wave Digital Filter [7] Floyd-Steinberg

[2] Filter [8] Transmission Line

[3] Infinite Impulse Response Filter 2D [9] Infinite Impulse Response Filter 1D [4] forward-substitution [10] Differential Equation Solver [5] Toeplitz Hyperbolic Cholesky Solver [11] All-pole Lattice Filter [6] Discrete Fourier Transform [12] Elliptic Filter

Table 6.5. The comparison among four methods (under Motorola DSP56000).

In the following, we focus on the entire retimed loop to compare the overall schedule length. In chapter 3 we have introduced an analytic model to calculate the overall schedule length of a retimed MDFG. Formulas (A.1)~(A.5) can be directly used to test methods RSOR and Kim et al. [28], and we extend it further to treat RSER. Table 6.6 lists variables used in the extended analytic model. Note that we restrict that only two retiming bases, (0, 1) and (1, 0), can be selected in the MDFG reconstruction algorithm. That is, the original MDFG is retimed at most twice during reconstructing, with two retiming bases been used in different sequences. In the extended analytic model, we directly assume that every MDFG is retimed twice, and design corresponded formulas to calculate the overall schedule length. If the given

Variable Definition

N Number of memory modules

m Loop bound of the outer loop for a two-dimensional nested loop Loop bound for an one-dimensional loop

n Loop bound of the inner loop for a two-dimensional nested loop (s1, s2) Schedule vector selected for retiming during instruction scheduling

list Schedule length of an iteration in the repetitive pattern produced by list scheduling method

length Schedule length of an iteration in the repetitive pattern

prologue Schedule length of the prologue generated during instruction scheduling eplogue Schedule length of the prologue generated during instruction scheduling

d Retiming depth obtained during instruction scheduling

half (k, N) Schedule length of k original iterations under N memory modules

exp1 Schedule length of the prologue generated during MDFG reconstructing after first retiming

exe1 Schedule length of the epilogue generated during MDFG reconstructing after first retiming

exd1 Retiming depth obtained during MDFG reconstructing after first retiming exp2 Schedule length of the prologue generated during MDFG reconstructing

after second retiming

exe2 Schedule length of the epilogue generated during MDFG reconstructing after second retiming

exd2 Retiming depth obtained during MDFG reconstructing after second retiming Table 6.6. Definitions of variables used in the analytic model.

MDFG is only retimed once, variables exp2, exe2, and exd2 can be simply set to zero.

Detailed derivations of new formulas are listed in appendix B.

Figures 6.10 and 6.11 show the overall schedule lengths of the entire retimed loop when the target architecture has one or two function units, respectively. From these figures, for most applications RSOR obtain shorter overall schedule lengths than LPLS and Kim et al. [28]. If the given MDFG contains exploitable sharing sets, using RSER may not produce shorter overall schedule lengths compared to RSOR, but still outperforms LPLS and Kim et al. [28]. These results are the same as the evaluations based on a single iteration in the repetitive pattern. That is, although the two proposed methods, especially RSER, require longer time to run the prologue and epilogue, the overall performance is still better because they can effectively explore the instruction- level parallelism between successive iterations.

Finally, we summarize above evaluations. The overall schedule lengths obtained by RSOR and RSER are obviously shorter than those of previous methods, although RSER may require more time to run corresponding prologue and epilogue codes. If the number of OPRs is the evaluation metric, RSOR and RSER are at least as effective as LPLS and Kim et al. [28]. Recall that the average power consumption of the function unit is clearly less when an operand remains unchanged, and the total power consumption of a schedule equals to the sum of power consumed by all instructions. Since proposed RSOR and RSER perform better on both evaluation metrics schedule length and the number of OPRs, we conclude that they are energy-efficient code generation methods. As for the instruction count, our proposed methods are still very effective for the repetitive pattern due to fewer inserted spill codes. But their corresponding prologue and epilogue codes have to be stored in addition to the repetitive pattern, so our RSER will require much more memory space to store the scheduling results compared to other related methods.

0

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

100 225 324 400 529 625 729 900 1000 loop size

100 225 324 400 529 625 729 900 1000 loop size

clock cycle (x 1000)

allpole_LPLS allpole_Kim allpole_RSOR elliptic_LPLS elliptic_Kim elliptic_RSOR

Figure 6.10. Experimental results of DSP applications (1 function unit, overall schedule length).

0

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50

loop size

100 225 324 400 529 625 729 900 1000

loop size

100 225 324 400 529 625 729 900 1000 loop size

clock cycle (x 1000)

allpole_LPLS allpole_Kim allpole_RSOR elliptic_LPLS elliptic_Kim elliptic_RSOR

Figure 6.11. Experimental results of DSP applications (2 function units, overall schedule length).