Expected Value of Schedule Length

Chapter 3. Probabilistic Loop Scheduling Method

3.2 Basic Concept

3.2.3 Expected Value of Schedule Length

We have defined the executing probability of each long instruction. Because we schedule the MD-CdDFG into several long instructions, we can sum PL to define the expected value of schedule length as follow.

Definition 3.4 Expected value of schedule length (ESL) is defined as

∑

= MSL is the maximum schedule length of MD-CdDFG G and PL(i) is the executing probability of long instruction PL at C-Step i .

The behavior of fork node decides the average number of instructions to be executed in an iteration. We schedule instructions into several long instructions and get the maximum schedule length. Schedule length can not present the performance perfectly because it is static.

We propose the ESL which is computed by the behavior of fork nodes means that the average computation time of loop body. The ESL is a more fair measurement for the performance of scheduling result.

Our schedule goal is ESL reduction. The ESL reduction means that we collect some conditional operations into a conditional long instruction. Because the loop body usually use large part of computation time, ESL reduction can decrease the entire execution time. In the next section, we propose a new method for ESL reduction.

3.3 Schedule Strategy

Our scheduling goal is to reduce the ESL and we propose several heuristic strategies to achieve it. From the related work, MSL is defined as :

unit } functional of

operations of

max{ # MSL=

MSL is the upper bound of schedule length. We need to reduce the number of operations to

reduce the upper bound and to share several nodes into one operation. We define the resource sharing condition to indicate which nodes can share one operation.

Definition 3.6 (Resource Sharing Condition) In a given MD-CdDFG G, u,v∈V , u and v can share the same operation if they satisfy:

(1) u and v use the same type of functional unit.

(2) u and v are exclusive

(3) u and v do not exist “sharing-prevention cycle”

Fig 3.5 A example of MD-CdDFG.

We use this definition to check all pairs of nodes in the MD-CdDFG, and get a GS to present resource sharing policy. The condition (3) can be detected by traversing the GS in the MDBA.

We use Fig 3.6 as our example in this section and the result for finding resource sharing policy is shown in Fig 3.7. It shows that a2/a3/a8, a6/a7 and a4/a5 share one adder. By formula (4) in 3.3.2, we can compute that a2/a3/a8 and a4/a5 are unconditional. After finding all resource sharing policy, the number of operations is reduced. MSL can be also reduced by resource sharing. Thus, our scheduling strategy is given below.

Schedule strategy 1 To reduce the upper bound of schedule length, we find more and more nodes to share one operation.

We also need to reduce the lower bound of schedule length. From (1), if there exists an unconditional operation in a long instruction, the executing probability of this long instruction PL is equal to 1. It is an unconditional long instruction which always be executed in each

Fig 3.6 The resource sharing policy of our example

iteration. Because the resource sharing policy has been found, we define the number of unconditional long instruction as Minimum Schedule Length (NSL).

Definition 3.7 Given an MD-CdDFG G after finding resource sharing policy, the Minimum

Schedule Length (NSL) is defined as }

For example in Fig 3.7(b), we have found the resource sharing policy and get unconditional operations as { a1, m1, c1, c3, a4/a5, m5/m7, m2/m3, a2/a3/a8 }. If we have two adders and one multiplier in our architecture, the NSL = } 3

1 ,3 2

max{5 = . NSL is the lower bound of

schedule length, it physical meaning is that three long instructions always be executed.

However, there are three conditional long instructions with executing probability, the PL(6)

PL(5) PL(4)

ESL= + + + by Definition 3.5. Thus, given an MD-CdDFG G after finding resource sharing policy, we can compute the MSL, NSL, ESL and know that

NSL ESL

MSL≥ ≥ .

We need a minimum number of unconditional long instructions to reduce NSL. Our second scheduling strategy can be given as below.

Schedule Strategy 2 To reduce the lower bound of schedule, we schedule unconditional operations into minimum number of long instructions.

After scheduling all unconditional operations into minimum number of long instruction, there are still some conditional operations to be scheduled. We need to schedule them into conditional long instructions and also need to reduce its executing probability. We consider the pervious simple architecture with only two operations, and p, q are scheduled into the long instruction.

If p and q are exclusive, its PL is p(p)+ p(q). However, if p and q are independent, we should propose a heuristic method to avoid such exclusive nodes. We have many methods to avoid exclusive and we choice the easiest one which select one operation to increase its retiming count. Then such two operations will become independent. This heuristic method will increase the retiming depth of our scheduling result, but reduce ESL which is our schedule goal. This heuristic method has a disadvantage which retiming depth will increase.

Our third scheduling strategy is shown as following.

Schedule Strategy 3 When we schedule two exclusive conditional operations into a long instruction, they need to be set different retiming count.

We have proposed three scheduling method and we will propose our scheduling method in the next section. Such three scheduling method will be used to decrease the average schedule length ESL.

3.4 Probabilistic Loop Schedule Method

In this section, we propose our detail algorithm and use an example to explain how they work.

Our scheduling goal is a static schedule result with shorter ESL. Input of our algorithm is the MD-CdDFG G, and outputs is the retiming of each node u∈V..

Our method which called Probabilistic Loop Scheduling Method (PLSM) contains four steps. The first step computes the executing probabilistic of each node. The second step finds the sharing policy of graph. The third step allocates operations into schedule table which is one output. The final step computes the retiming of each node u∈V.

Fig 3.7 shows the algorithm for calculating the probability of each u∈ . In the V algorithm is an implement of Definition 3.1, Line 1 initially sets all nodes with probability equal to 1. From Line 2 to Line 4 is a loop indexed by the number of fork nodes. When loop is indexed by ci , we multiply the true probability to every node which belongs to the true path of ci and multiply the false probability to every node which belongs to the false path of ci . After processing all fork nodes, each node gets an executing probability. For example in Fig 3.8, we assume the executing probability of each node and the italic real number means the result of probability calculation. We can know the executing probability of each node after the first step, the second step we need to find the resource sharing policy.

Input : MD-CdDFG G

Output : probability of each node 1. ∀ u∈V p(u)=1

2. For each fork node ci = c1…..cn

3. do ∀ p ∈ true path of ci ; p(p)=p(p)×f(ci) 4. ∀ q ∈ false path of ci ; p(q)=p(q) × (1-f(ci))

Fig 3.7 Step 1 of PLSM : Calculating the executing probability for each node

Input : MD-CdDFG with p(u) ∀u∈V

Output : GS : MD-CdDFG with sharing indication edges

1 Sort fork node Ci=C1…Cm By its Decreasing Order of Conditional Depth 2 Sort type of FU FUj=FU1…Fun By its (number of nodes / FU in system) 3 For (Fui=Fu1…Fun)

4 For (Cj=C1…Cm)

5 For each u s.t {(u ∈ true path respect to Ci ; shared u first) } 6 For each v s.t {(v ∈ false path respect to Ci ; shared v first )}

7 IF ( (Fu(u)=Fu(v)=Fuj)

8 AND ( u and v are exclusive)

9 AND ( u and v are not sharing-prevention cycle ) ) 10 THEN A sharing indication edge u→v

11 p(u)=p(u)+p(v) ; p(v)=p(u);

Fig 3.9 Step 2 of PLSM : Algorithm for finding resource sharing policy

Fig 3.9 is the algorithm for finding resource sharing policy. Input of this algorithm is the MD-CdDFG with calculated the executing probability of each node in the first step. From the Line 5 and Line 9, we select a pair of nodes to check the resource sharing conditions in Definition 3.6. If we find two nodes which satisfy the sharing conditional, we use a sharing indication edge to connect them and sum their executing probability in Line 10 and 11. By the first schedule strategy, Line 1 and Line 2 try to merge more nodes into one operation. The Line 1 means that we find the sharing policy from the deeper conditional block first. The Line 2 means that we find the sharing nodes from the functional units which use longer schedule length. They are useful to enhance the number of resource sharing nodes. Fig 3.10(a) shows the resource sharing policy after finding addition, and Fig 3.10(b) shows the finding resource sharing policy after finding multiplication. After finding sharing policy, operations are classified into unconditional and conditional by its executing probability in Fig 3.11.

For example, after finding the resource sharing policy, we can find the number of unconditional and conditional operations as shown in Fig 3.11. The MSL and NSL can be computed by the related works and Definition 3.7. In this example, if we have two adders and one multiplier, then

Fig 3.10 (a) After finding sharing policy among addition; (b) After finding all sharing policy.

Fig 3.11 Operations after finding resource sharing policy.

The third step, we allocate all operations into schedule table which size is equal to the MSL. By second scheduling strategy, unconditional operations should be scheduled into

minimum number of long instructions from C-step 0 to NSL. Fig 3.12 shows the schedule tables and the dotted line means the NSL. Algorithm for allocating operations is shown in Fig 3.13 which contains two phases. The first phase allocates all unconditional operations into schedule table from C-Step 1 to NSL by BUSM [3] to reduce the retiming depth. After first phase, there are some conditional operations to be scheduled. The second phase allocates conditional operations into schedule table by its executing probability. A conditional operation with higher probability is scheduled to the earlier C-Step to reduce the PL of later long instructions.

Now, we use the example to show how the algorithm works. Initially a1 is in the QueueV Fig 3.12 Scheduling processes for allocating operations

in the Line 4 and we can find an available FU in the C-Step 1 in the Line 9 to 11. We schedule a1 into schedule table and decrease the indegree count of its successors to find schedulable

operations. Then m1 and c3 are added into QueueV and we try to find an available functional unit for m1 at C-Step from 2 to 4 and schedule it into C-Step 2.

After a1, m1, c1, c3 are schedulable in Fig 3.12(a), we need to schedule c2 but it’s a conditional operation in the Line 19. We decrease its successors’ indegree count but added c2 into QueueS. Then the sharing operation m2/m3 has zero indegree and is schedulable. There are not available functional unit from ES(m2/m3)=3 to NSL and we find another one at C-Step 1 to ES(m2/m3)-1 in the line 13 to 16. It’s the idea of BUSM to reduce retiming depth. In the Fig 3.12(b), when Queue is empty, all unconditional operations are allocated into schedule table, there are some conditional operation to be scheduled in the QueueS.

We sort such conditional operations by its executing probability and allocate them into schedule table at C-Step from 1 to MSL. The highest probability operation is m8/m9 and will be scheduled into table firstly in Fig 3.12(c). After allocating all operations, we get the schedule table as Fig 3.12(d). We can observe that the first three long instructions will always be executed, and the other three depend on the branch testing. By the definition of ESL, we can compute ESL=3+PL(4)+PL(5)+PL(6) as the average performance of our scheduling result. However, the retiming count of each operation is unknown, we finally use a algorithm to set the retiming count.

Input : MD-CdDFG GS ,MSL,NSL Output : An Allocated Schedule Table 1. ES(∀v∈V)←1

2. Queue←v∈V ; QueueV=QueueS=∅

3. ∀e∈E, E←E-{e, s.t. d(e)≠(0...0)}

4. QueueV←QueueV∪{u∈V, s.t. Indegree(u)=0}

5. While (Queue≠∅)

6. DO u←DeQueue(QueueV) 7. IF (u is unconditional)

8. THEN for i=ES(u) to NSL

9. if (exist an empty FU for u)

10. ES(u)=i ;

11. Assign operation u to an available FU at C-Step i

12. Assign operation shared with u to an available FU at C-Step i 13. for i=1 to ES(u)-1

14. if (exist an empty resource for u)

15. ES(u)=i ; Assign node u to an available FU at C-Step i 16. Assign operation shared with u to an available FU at C-Step i

17. ∀v s.t. u→v

18. ES(v)←Max{ES(u),ES(u)+t(v)}

19. ELSE QueueS=QueueS∪{u , nodes shared with u}

20. ∀v s.t. u→v

21. Indegree(v)=Indegree(v)-1

22. if Indegree(v and nodes shared with v)=0

23. QueueV=QueueV∪{v}

24. While (QueueS≠∅)

25. u←DeQueue(QueueV) s.t. u has the highest p(u) in QueueV 26. for i =1 to MSL

27. if (exist an empty FU for u)

28. Assign operation u to an available FU at C-Step i Fig 3.13 Step 3 of PLSM : allocating operations into schedule table

Input : MD-CdDFG and scheduled table Output : retiming function of each node v∈V 1. ∀ u∈V RC(v)=null

2. RCmax=0

3. E ← E – { e, s.t. d(e)=(0...0) }

4. ∀ u∈V s.t. Indegree(u)=0 RC(u)=0 5. While (∀ u∈V, RC(u) ≠null)

6. do For CS=1 to MSL

7. for each u s.t u is a operation in the control step

8. do IF (RC(u)=null AND RC(all parents of u) ≠ null)

9. THEN RC(u)=RCmax

10. IF (exist operation v, s.t. u, v are exclusuive)

11. AND RCmax>NSL

12. AND RC(v)=RCmax

13. Then RC(u)=RC(u)+1

14. RCmax=RCmax+1

15. Choose s=(s1,s2…sn) s.t. s•d(e ) > 0 for any e∈E whose d(e)≠(0,…,0) 16. Choose r s.t. r ⊥ s

17. for each u∈V

18. do r(u) = (RCmax – RC(u))*r

Fig 3.14 Step 4 of PLSM : Algorithm for setting retiming count

The final step is retiming count setting. We need to avoid the exclusive by Schedule Strategy 3. Nodes which do not violate the data dependence in the schedule table can use the same retiming count. Detail algorithm is shown in the Fig 3.14, and we use the allocated

result in Fig 3.12 to show how retiming count setting works.

The process of retiming count setting is shown in Fig 3.15. The input of algorithm is the graph MD-CdDFG G and scheduled table. Initially we set the retiming count of each node as null in Line 1. We set the nodes without incoming edges as zero retiming count with retiming count 0 in the Line 4. In the Fig 3.15(a), only a1 has zero retiming count initially. Then we find all nodes which satisfy zero retiming count in the schedule table from the Line 5 to 13.

One operation can be set its retiming count if their parents are set the retiming count. For example, when we search the second C-Step and observe that all parents of c2 have set the retiming count, we can set the retiming count of c2 as zero. The first round from Line 6 to 9 set all operations which can apply zero retiming count and the result is shown in Fig 3.15(b).

After setting zero retiming count, we search the table twice to find nodes which can apply retiming count 1 and 2 in the Fig 3.15(c). From Line 10 to 13, we avoid schedule two exclusive operations into one long instruction by the Schedule Strategy 3. We increase the

Fig 3.15 Process for retiming count setting.

retiming count of a later scheduled node. The while loop is finished until that all operations have its retiming count as Fig 3.15(d) . The maximum retiming count is recorded by RCmax.

From the Line 15 to 18, we find the base retiming function by RPUSM [2] and compute the retiming function of each node v∈V.

So far, we have scheduled all operations into schedule table and set the retiming count.

The example has shown how algorithm works. By property2.5, the scheduling result is a legal retiming to get a realize scheduling result. In the next chapter, we will use several benchmarks to evaluate the efficiency of our method.

Chapter 4. Preliminary Performance Evaluations

In this chapter, we will show some performance evaluations of the DSP program. First we will introduce the formula of evaluating the execution time for nested loops with conditional branch. Then, we will explain how the ESL affects the entire execution time.

Secondly, we use some benchmarks to evaluate our method PLSM and MDBA in some detail.

At the end, we will give some summary of time complexity and entire execution time between MDBA and our method PLSM.

4.1 Evaluating the Execution Time

In DSP applications, most nested loops are 2-dimensional loops. Thus, we use four benchmarks to evaluate the performance of PLSM and they are all 2-dimensional MD-CdDFG.

In the following, we introduce the formula to calculate the execution time of 2-dimensional loops whose indexes are m and n. Before applying the retiming technique to a 2-dimensional loop, its execution time can be represented bym×n×A, where A is the static schedule length.

After applying the retiming technique, the execution time of a 2-dimensional loop can be divided into three parts, the loop body, the prologue and epilogue inside the first level loop, and the prologue and epilogue are out of the nested loop[2-3]. The formula of evaluating execution time of a 2-dimensional loop is shown as follows [2]:

)

, where (s1,s2) is the schedule vector, d is the maximum retiming depth, A is the average schedule length after applying some algorithm for optimization, D is the static schedule length of an iteration after applying “List Scheduling”, B is the length of prologue inside the first

level loop, and C is the length of epilogue inside the first level loop. Following, we use the formula to compare the performance of PLSM and MDBA. We both use average schedule length in A to calculate their entire execution time.

4.2 Experiments Results

We use some benchmarks to evaluate the effects of MDBA and PLSM. They are all nested loops with conditional branches. We use (7) to compute their entire execution time, including loop bodies and overheads. The four benchmarks are the Floyd-Steinberg [4], VerySmall [6], SC [4,6], Kim [4,6.11,14] and VeryLarge[11,14] are shown in the Appendix A.

These benchmarks are all 2-dimensional loops with conditional branches. We change the behavior of branch instructions and number of functional units to get different results which including MSL, NSL, ESL and retiming depth by our method. We also use MDBA to schedule such benchmarks and get MSL and retiming depth.

In the comparison of entire execution time, we use formula (7) to compute the execution time of scheduling result. Because the execution time is various, we show the maximum, minimal and average execution times. We change the size of nested loop to observe its influence of loop size on the entire execution time.

Fig 4.1(a) is the benchmark “VerySmall”, we assume that there are one adder and two multipliers in our architecture and the parameters are f(c1)=0.2 and f(c2)=0.8. We select a best schedule vector (1,0) by RPUSM [2] and (0,1) as retiming base. The PLSM’s scheduling result is shown in Fig 4.1(b) and MDBA’s scheduling result is shown in Fig 4.1(c). Both the retiming depth of PLSM and MDBA are 2. After scheduling, we can compute the ESL are 1.36 in the PLSM and 2 in the MDBA by our average schedule length measurement.

We compute the maximum, average and minimum execution times by (7) in a 2-dimensional nested loop. Fig 4.2 shows that such two scheduling results are executed on a

5×5 nested loop. In this graph, the left three bars mean the maximum, average and minimum execution time produced by MDBA in the Fig 4.1(c). Each bar contains tow part, overhead and loop body. The overhead means instructions including prologues and epilogues beside loop body. The loop body means the instructions executed by nested loop in a retiming problem. Integral numbers in the central of each bar means the execution time of overhead or loop body. Because PLSM and MDBA cause the same retiming depth, the overhead both need 20 cycles to complete.

MDBA MSL MDBA ESL MDBA NSL PLSM MSL PLSM ESL PLSM NSL

Overhead LoopBody

Fig 4.1 (a) Benchmark “VerySmall”; (b) Scheduling result by PLSM;

Fig 4.2 Maximum, average and minimum execution time (cycle) of the Fig 4.1(b)(c) in a 5x5 nested loop

(Cycle)

0.60 0.60 0.60 0.60 0.41 0.30 0.80 0.80 0.80 0.80 0.54 0.40 0.80 0.80 0.80 0.80 0.54 0.40 0.92 0.92 0.92 0.92 0.63 0.46 0.92 0.92 0.92 0.92 0.63 0.46 0.98 0.98 0.98 0.98 0.67 0.49 0.98 0.98 0.98 0.98 0.67 0.49

4 60 500 500 500 500 372 300 1250 1250 1250 1250

882 675 2500 2500 2500 2500 1764 1350 10000 10000 10000 10000 6864 5100 20000 20000 20000 20000 13728 10200

In both scheduling results, we use the same retiming function and get the same retiming depth. We assume that prologues and epilogues use the same execution time. There are both 15 iterations in the loop body in such two scheduling results. In the MDBA, the schedule length is always 2, we always need 2×15=30 cycles to complete the loop body.

In our PLSM, the second long instruction is not always executed. In the worst case, if the

在文檔中考慮分支指令行為的迴圈排班方法 (頁 30-0)