Comparisons among RSF, RST, and RSP - Performance Evaluations

3 Variable Partition Mechanisms

3.4 Performance Evaluations

3.4.3 Comparisons among RSF, RST, and RSP

After evaluating RSF, RST, and RSP, we analyze the effectiveness among them.

Actually the answer will depend on the topology and loop-carried dependencies of the nested loop. From formulas (A.6) and (A.7) listed in appendix A, we find that after

Variable Definition

N Number of memory banks

m Loop bound of the outer loop for a two-dimensional nested loop Loop bound for an one-dimensional loop

n Loop bound of the inner loop for a two-dimensional nested loop prologue Schedule length of the prologue part of a retimed loop

epilogue Schedule length of the epilogue part of a retimed loop

length Schedule length of a single iteration in the repetitive pattern of a retimed loop

list Schedule length of a single iteration produced by list scheduling d Retiming depth, the number of iterations that must be moved into the

prologue and epilogue

w Skew factor used to parallelize the inner loop

half (k, N) Schedule length of k original iterations under N memory banks Table 3.3. Variables defined in the analytic model.

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50 Loop size

Execution cycles (x1000)

List RSVR

RSF RST

RSP

Figure 3.14. Overall schedule lengths of DSP applications (1 function unit, 2 memory banks).

WDF

Filter

IIR2D

forward

10x10 8x15 15x10 20x20 30x15 15x30 30x30 20x50 50x25 40x40 25x65 60x30 30x80 70x35 50x50 Loop size

Execution cycles (x1000) List RSVR

RSF RST

RSP

Figure 3.15. Overall schedule lengths of DSP applications (1 function unit, 2 memory banks).

DFT

Floyd

xmission THCS

applying RSP, prologue, epilogue, and half really occupy considerable portion of the overall schedule length. Recall that the prologue, epilogue, and half are part of the overhead. That is, although RSP performs as well as other methods from Figures 3.14 and 3.15, it costs more overheads on both execution time and instruction count compared to RSF and RST, especially when the architecture contains more than two data memory banks. Therefore, we suggest using RSP only for the DSP architecture with two data memory banks.

In the following, we conclude some principles of RSF, RST, and RSP. If the nested loop only contains dependencies with distances (1, 0, …, 0) but (0, ..., 0, 1), RSF should obtain better results because original iterations in a single unfolded iteration are data independent. On the contrary, using RST should be better if the nested loop only contains dependencies with distances (0, …, 0, 1) but (1, 0, …, 0), and this nested loop can be tiled directly. These two conclusions are made directly based on the principle of their variable partition mechanisms. Then, if the nested loop contains dependencies with distances (0, …, 0, 1) and (1, 0, …, 0), an enlarged iterations in neither RSF nor RST is combined with original iterations which are data independent. At this time RSP is suited when the architecture contains two data memory banks. As for architectures with more than two data memory banks, we suggest using list scheduling to schedule an iteration of RSF and RST separately and choose the shorter one.

In parallel processing system, column major is one of common problems to prevent the parallelism. If the target DSP architecture contains multiple data memory banks and more than one function unit, the similar problem will also occur. When we design methods RSF, RST, and RSP, we simply assume the given nested loop is executed in row major sequence. After enlarging the given loop, the variable partition mechanism will be selected based on distances of loop-carried data dependencies.

From descriptions listed in above paragraph, our goal is to make original iterations in an enlarged iteration be data independent. That is, for elements of the same array variable which will be accessed in an enlarged iteration, we will separate them into different memory banks as far as possible. This mechanism works well when there is only one DSP core with one or more function units. However, we never consider the memory access sequence between different enlarged iterations. Therefore, if there are two or more DSP cores in the target architecture, the potential column major problem may still occur using our proposed methods.

Chapter 4. Effective Code Generation Method for Motorola DSP56000

In this and next chapter, we will present our second issue about code generation methods for DSPs with multiple data memory banks and heterogeneous register sets.

As mentioned in section 2.4.3, a complete code generation process for DSP with multiple data memory banks must include five phases: intermediate representation, code compaction, instruction scheduling, memory bank assignment (or variable partition), and register/accumulator assignment [17]. Our three methods RSF, RST, and RSP presented above directly use data memory to store temporary variables, so they have covered all except the accumulator/register assignment phase. In this chapter, we introduce a new method focus on Motorola DSP56000 to consider the accumulator/register assignment and further improve overall execution performance.

Section 4.1 we briefly give an overview of the Motorola DSP56000. Section 4.2 lists our design motivations, and detailed steps of proposed method are described in section 4.3. Finally, in section 4.4 some performance evaluations are shown.

4.1 Motorola DSP56000 Architecture [10]

In our studies we target on the DSP architecture which consists of multiple data memory banks and heterogeneous register sets. Associated with each data memory bank is an independent set of address bus, data bus, and independent unit to calculate address. Motorola DSP56000/DSP56001 and DSP56300 family members are examples of this architecture, and are commonly used in practice and in previous researches. Many members belong to DSP56300 family, which have various memory sizes and peripheral interfaces. However, data ALU circuits of all members of DSP56300 family are the same, and are almost identical to those of DSP56000/

DSP56001 as shown in Figure 4.1. The main difference is that all ALU instructions are completed in one clock cycle in DSP56000/DSP56001, and performed in two clock cycles in pipeline fashion in DSP56300 family. Therefore, in the following we briefly introduce the Motorola DSP56000 architecture, and design our code generation method based on it.

As shown in Figure 4.2, the DSP56000 architectural units of interest are the data ALU, Address Generation Unit (AGU), and X/Y memory banks. The data ALU consists of four input registers called X0, X1, Y0, and Y1, and two accumulators, A and B. The source operands for all ALU instructions, except multiplication, must be registers or accumulators and the destination operand must always be an accumulator.

Source operands of multiplication must always be input registers. Two buses XDB and YDB permit two input registers or accumulators to be read or written in conjunction during execution of an ALU instruction. Thus, up to two move operations (including Figure 4.1. Data ALU block diagram. (a) DSP56000/DSP56001, (b) DSP56300 family.

(a) (b)

memory access, register transfer, and immediate load) and one data ALU instruction may be executed simultaneously in one cycle.

Two independent move operations executed in the same cycle are called parallel moves. However, due to the nature of the DSP56000 architecture, not all pairs of move operations can be performed in parallel. Detailed parallel move conditions can be found in [10]. In our studies we especially consider the following conditions: (1) the two move operations reference data in different data memory banks; (2) the two destination registers are different; (3) the X/Y memory access loads into restricted locations X0/Y0, X1/Y1, A, or B.

4.2 Design Motivations [37]

In section 2.4.3 we have surveyed some code generation methods for DSP with multiple data memory banks. Among them, both methods proposed in [9, 17] focus on Motorola DSP56000 and consider all five phases of the code generation process. In the following we summarize them and introduce our design motivations. First, these two methods perform variable partition after code compaction, which means memory accesses are scheduled without information of memory bank assignment. However, in

Figure 4.2. Motorola DSP56000 architecture.

the DSP56000 architecture, memory accesses involved in a parallel move must reference variables in different banks [10]. That is, memory accesses may be assumed to be executed in parallel, but in fact their reference variables are stored in the same data memory bank. In this situation, an extra cycle (spill code) will be required to access them separately. If spill codes occur frequently, the computational performance is clearly degraded. On the other hand, if variables are partitioned before code compaction, this kind of spill codes will not occur. In our design we will use the later mechanism to avoid the occurrence of spill codes.

Apart from location conflict for parallel moves, spill codes also possibly occur in the accumulator/register assignment phase. In methods proposed in [9, 17], they store variables in unlimited symbolic accumulator/register during code compaction, and consider accumulator/register assignment at last. But in DSP numbers of accumulators and registers are usually strictly limited. When accumulator and register spills occur, spill codes are required and their spill costs may be more than one extra cycle.

Therefore, we will design mechanisms to predict the occurrence of register and accumulator spills in advance and generate corresponding spill codes. Then, these spill codes can be scheduled in parallel with other instructions, which is beneficial to decrease the spill costs.

4.3 Rotation Scheduling with Spill Codes Predicting (RSSP) [37]

In this section we introduce code generation method named rotation scheduling with spill codes predicting (RSSP) proposed for Motorola DSP56000. As listed in Figure 4.3 RSSP contains six parts: MDFG construction, TDAG construction, TDAG modification, ALU instruction scheduling, other instruction scheduling, and initial schedule retiming. Detailed description of each part is presented as follows.

4.3.1 MDFG Construction

In the first part we construct the MDFG from the high-level language using the same mechanism as in RSF, RST, and RSP. During the MDFG construction operands are stored in memory, and reloaded into registers only when they are required for use.

This mechanism appears burdensome but is really used in some DSP compilers, because the number of registers is limited in DSP and memory is the only safe repository. In addition to constructing the MDFG, variables are also partitioned by four mechanisms proposed in RSVR, RSF, RST, and RSP in the first part. Constants are stored in both X and Y memory banks at specific locations in advance.

4.3.2 TDAG Construction

If all instructions in the MDFG are scheduled, apparently that accumulator and register spills will not occur. But scheduling according to this complicated MDFG will degrade the computational performance, because ALU results can be temporarily stored in accumulators or registers instead of directly written back to memory. Hence, in RSSP we define a translated data acyclic graph (TDAG) constructed from the

1. Gc = Construct MDFG;

1.1. Partition variables to X and Y memory banks;

1.2. Unfold or tile Gc if necessary;

2. Gt = Construct TDAG;

3. Modify TDAG Gt;

5.1. Gt = Insert register transfer nodes (Gt);

5.2. (Gop, Gpr) = Construct DAG Gop and Gpr (Gt);

5.3. Gop = Mark_Edge (Gop, Eop);

5.4. Gop = Mark_Edge (Gop, Epr);

5.5. Gop = Check_Cycle (Gop, Gt);

5.6. Gt = Insert memory access nodes (Gop, Gt);

4. S = Schedule ALU instructions (Gop);

5. S = Schedule other instructions (S, Gt);

6. S = Retime the initial scheduling result (S);

Figure 4.3. The entire scheduling steps of RSSP.

original MDFG, which is aimed at removing possible unnecessary memory accesses.

The formal definition of the TDAG is given below.

Definition 4.1 A translated data acyclic graph (TDAG) G = (V, E, X, P) is a node-weighted and edge-weighted direct graph, where V is the set of computation nodes; E ⊆ V × V is the edge set that defines the precedence relations over the nodes in V; X(e) represents the variable accessed by an edge e; P(v) represents the type of node v (see Figure 2.1(c)).

1. Input: G_c = (V_c, E_c, X_c, d, P_c);

2. Output: Gt = (Vt, Et, Xt, Pt);

3. Vt = Vc; Et = {e | e ∈ Ec, d(e) = (0,…, 0)};

4. Assume that vi, vj, vk, vl, vm, vn ∈ Vc, and their types are M, A, S, L, M, and A respectively;

4.1. If (∃ a path vi → vk → vl → vm ∈ Gt) // M → M

Insert node vx into Vt (set Pt(vx) = T); Insert edge eix into Et;

∀ elm ∈ Et delete edges elm from Et, insert edges exm into Et; Delete node vl from Vt; Delete edge ekl from Et;

If (∃ ekl ∈ Ec such that d(e_kl) ≠ (0,…, 0)) ; // retain vk, e_ik Else delete node vk from Vt, delete edge eik from Et; 4.2. If (∃ a path vj → vk → vl → vm ∈ Gt) // A → M

Insert node vx into Vt (set Pt(vx) = T); Insert edge ejx into Et;

∀ elm ∈ Et delete edges elm from Et, insert edges exm into Et; Delete node vl from Vt; Delete edge ekl from Et;

If (∃ ekl ∈ Ec such that d(ekl) ≠ (0,…, 0)) ; // retain vk, ejk

Else delete node v_k from V_t, delete edge e_jk from E_t; 4.3. If (∃ a path vi → vk → vl → vn ∈ Gt) // M → A

∀ eln ∈ Et delete edges e_ln from E_t, insert edges e_in into E_t; Delete node vl from Vt; Delete edge ekl from Et;

If (∃ ekl ∈ Ec such that d(ekl) ≠ (0,…, 0)) ; // retain vk, eik

Else delete node vk from Vt, delete edge eik from Et; 4.4. If (∃ a path vj → vk → vl → vn ∈ Gt) // A → A

∀ eln ∈ Et delete edges eln from Et, insert edges ejn into Et; Delete node vl from Vt; Delete edge ekl from Et;

If (∃ ekl ∈ Ec such that d(e_kl) ≠ (0,…, 0)) ; // retain vk, e_jk Else delete node vk from Vt, delete edge ejk from Et; 5. X_t(e) = X_c(e), if e is remained in E_t;

6. Pt(v) = Pc(v), if v is remained in Vt; 7. Return Gt;

Figure 4.4. The TDAG constructing algorithm.

Figure 4.4 shows the TDAG construction algorithm. For a given MDFG, the first step is to remove edges with non-zero delays. Then, for an ALU result written back and reloaded in the same iteration, it can be temporarily stored in an accumulator to remove the corresponding instructions with types S and L. If an ALU result will be used in latter iteration, its corresponding store variable instruction must be retained. In addition, in Motorola DSP56000 both source operands of a multiplication must contains absolutely necessary memory accesses, which is beneficial to decrease the instruction count.

Figure 4.5. (a) Two cases of removing memory accesses, (b) TDAG of MDFG in Figure 2.1(b).

4.3.3 TDAG Modification

One of the main goals of RSSP is to avoid accumulator and register spills by predicting their occurrence in advance. In the third part of RSSP we analyze and modify the TDAG to resolve accumulator spills. Register spills will be dealt with in the fifth part later.

Three main steps are required for this TDAG modification: insertion of register transfers, analysis of TDAG, and insertion of memory accesses. Recall that we assume unlimited number of accumulators when constructing the TDAG. Hence, an ALU instructions with types M/A may have many immediate successors with type A in the TDAG. As shown in Figure 4.6(a), the ALU result of vj is a source operand of all additions vj1 to vjm. In this case we add a register transfer vk if m > n, if the architecture only consists of one data ALU and n accumulators. Figure 4.6(b) contains the TDAG after inserting vk and the corresponding algorithm is listed in Figure 4.7.

Figure 4.6. (a) A TDAG fragment, (b) after inserting the register transfer vk. (a)

vj1

vj2 … vjm

P(vi) = M or A

P(v_ji) = A, for 1 ≤ i ≤ m P(vk) = T

(b) vj1

vj2 … vjm

Figure 4.7. The register transfer inserting algorithm.

1. Input: G = (V, E, X, P), n;

2. Output: G_t = (V_t, E_t, X_t, P_t);

3. Gt = G;

4. Suppose that vi ∈ Vt and Pt(vi) = M or A;

5. If (vi has more than n immediate successors v1,…, vm with type A) Delete edges ei1,…, eim from Et;

Insert nodes vx into Vt (set Pt(vx) = T);

Insert edges ex1,…, exm into Et; 6. Return Gt;

Then, we analyze TDAG topologies too predict the occurrence of accumulator spill. Two intermediate DAGs Gop and Gpr, defined as follows, are constructed using algorithm listed in Figure 4.8. Initially we set S(e) = F for all edges in Gop and Gpr to indicate no accumulator spill will occur. After applying algorithms listed in Figures 4.9 and 4.10, some edges in Gop will be set S(e) = T to represent the occurrence of accumulator spill. Figure 4.11 shows two Gop fragments with accumulator spills that will be checked by algorithms Mark_Edge and Check_Cycle, respectively. Note that Mark_Edge and Check_Cycle algorithms are designed based on our analyses of TDAG topologies. That is, they only suit the architecture consisting of data ALU and two accumulators, such as the DSP56000.

1. Input: G = (V, E, X, P);

Figure 4.8. The Gop and Gpr constructing algorithm.

1. Input: G = (V, E, S), Ei; 2. Output: G_r = (V_r, E_r, S_r);

3. Gr = G;

4. label(v) = N, ∀ v ∈ V;

5. label(v) = S, ∀ v doesn’t have any immediate predecessor;

6. While (∃ label(v) = = N)

Definition 4.2 A DAG Gop = (V, E, S) is a direct graph, where V is the node set representing ALU instructions; E ⊆ V × V is the edge set that defines the precedence relations over the nodes in V; S(e) is an edge mark that represents two nodes that must be scheduled at separate time steps or not.

Definition 4.3 A DAG Gop, corresponds to an undirected DAG Gpr = (V, E, S) with the same topology and characteristics.

Finally, for an edge in Gop with S(e) = T, two memory accesses with types S and L are inserted into the TDAG using algorithm listed in Figure 4.12. After completing steps 3.1~3.6 listed in Figure 4.3, we will get a modified TDAG which can be scheduled without any accumulator spill.

1. Input: G = (V, E, S), Gt = (Vt, Et, Xt, Pt);

2. Output: Gr = (Vr, Er, Sr);

3. G_r = G;

4. Delete edge e from E, such that S(e) = T;

5. ∀ eij ∈ Et such that P_t(v_j) = T

∀ ejk ∈ Et, insert edge eik into E (set S(eik) = X);

6. Remove edge direction in G;

7. Level each node v ∈ V (level(v) indicates the longest path length from v to any root node; level(v) = 1 if v is a root node)

8. If (∃ a cycle vi → vi+1 →…→ vk → vk+1 →…→ vj → vi in G) 8.1. Suppose vi has the smallest level(v) value in this path;

8.2. If ((level(v_i) < level(v_i+1) in path v_i →…→ v_k) and (level(v_k) < level(v_k+1) in path vk →…→ vi)) Sr(eji) = T;

else Sr(eij) = T, ∀ level(v) = level(vi) in this path;

9. Return Gr;

Figure 4.10. The Check_Cycle algorithm.

∀ i, P(vi) = M or A v₁ v₂

v₄ v₅

v7 S(e) = T

S(e) = T

Figure 4.11. Two Gop fragments with accumulator spill.

4.3.4 ALU Instruction Scheduling

In the fourth part of RSSP, ALU instructions are scheduled considering the nature of Motorola DSP56000. We first list principles that a correct schedule must satisfy as follows, and propose scheduling rules based on these principles. For convenience, we only permit a variable or constant loaded from memory to be stored in a register.

1. For an edge eij of a TDAG, if P(vi) = L/C/T and P(vj) = M/A, vj must be executed no later than the next two instruction (in the same memory bank as vi) with type L/C/T.

2. For an edge eij of a TDAG, if P(vi) = M/A and P(vj) = S, vj must be executed no later than the next two instruction with type M/A.

3. For an edge eij of a TDAG, if P(vi) = M/A and P(vj) = M/A, at most one ALU instruction can be executed between vi and vj.

Basically, ALU instructions are scheduled using list scheduling based on Gop (V, E, S). Recall that the Motorola DSP56000 consists of one data ALU and two accumulators, and all instructions are completed in one time step. For an edge eij ∈ E, its edge mark S(eij) may be F, T, or X, which indicates different rules for scheduling vi

and vj. Assume that vi ∈ V is scheduled at time step i, and the ALU result rti of vi is stored in accumulator acci. If S(eij) = F/X, vj must be scheduled at time step i+1 or i+2

1. Input: G = (V, E, S), G₁ = (V₁, E₁, X₁, P₁);

2. Output: Gt = (Vt, Et, Xt, Pt);

3. Gt = G1;

4. ∀ eij ∈ E such that S(eij) = = T Delete edge eij from Et;

Insert nodes vs, vl into Vt (set Pt(vs) = S, Pt(vl) = L);

Insert edges eis, esl, elj into Et (set Xt(eis) = t, Xt(elj) = t, where t is a temporary variable);

5. Return Gt;

Figure 4.12. The memory access inserting algorithm.

to prevent rti being recovered before being used. Conversely, if S(eij) = T, vj can be scheduled at time step later than i+2, because rti will be transferred to register regi. In addition, if S(eij) = X and vj is scheduled at time step i+1, an idle time step is inserted between vi and vj for scheduling register transfer instruction further. Because we have already considered the occurrences of accumulator spill, all ALU instructions can be scheduled exactly according to the above three rules. These rules for scheduling ALU instructions are essentially equivalent to the third principle listed above. Figure 4.13 (a) shows a TDAG example, and its scheduling result of the ALU instructions only is listed in Figure 4.13(b).

4.3.5 Other Instruction Scheduling

After scheduling ALU instructions, other instructions including memory accesses and register transfers are scheduled based on the modified TDAG. Meanwhile, we consider the limited number of registers during instruction scheduling, therefore no extra action is required to determine and deal with the occurrences of register spill. In RSSP, we use two variables reg_x(t) and reg_y(t) to record the number of registers

Figure 4.13. Scheduling steps of RSSA. (a) An TDAG example, (b) ALU instructions only, (c) initial scheduling result, (d) retimed scheduling result.

ALU X Y reg_x reg_y

been occupied at time step t for X and Y memory banks, respectively. When scheduling each instruction, these two variables are dynamically updated. Apparently, if we can generate a schedule where reg_x(t) and reg_y(t) do not exceed the limited number of registers for all time steps, register spills will not occur.

For a correct schedule, an operand residing in an accumulator/register obviously cannot be overwritten before being used. Recall that all instructions are completed in one time step in Motorola DSP56000. That is, if a variable (or constant) is loaded from memory at time step i and used at time step j, it will occupy a register from time step i to j-1. Similarly, an ALU result will occupy a register from time step i to j–1 if it is transferred from an accumulator at time step i and used at time step j. We conclude scheduling rules for memory accesses and register transfers as follows.

在文檔中在數位訊號處理器架構下有效指令排程法之研究 (頁 50-0)