Introduction - 針對數位訊號處理器中的巢狀迴圈考慮功率消耗的指令排程方法

In embedded system, high performance Digital Signal Processing（DSP）used in image processing, multimedia, wireless security, etc., needs to be processed not only with high data throughout but also with low power consumption [14]. These applications usually contain time-critical sections consisting of nested loops of instructions. The optimization of such loops, considering processing resource constraints, is required in order to improve their computational time [6].

Push-Up Scheduling Method (PUSM) [4] and Relax Push-Up Scheduling Method (RPUSM) [6] are retiming-based methods used to schedule instructions of nested loops under resources constraint. They can fully utilize functional units to achieve the minimum static schedule length, and RPUSM can further select a better schedule vector to reduce the entire execution time. However, they usually result in a bigger maximum retiming depth, which will longer prologue and epilogue and increase the entire execution time. Hence, in this thesis, we propose a method named Bottom Retiming Scheduling Method (BRSM) to overcome this shortcoming. In our BRSM, it can result in a smaller maximum retiming depth. From the experimental results, it shows that BRSM gives an improvement from 20.98% to 41.13% over the RPUSM in Floyd-Steimberg problem with various loop indexes [7].

As for low power scheduling, many techniques used for nested loops have been studied [2-6, 10-16, 19]. Based on operand sharing approach, a loop pipelining methodology to reduce both latency and power is first proposed in [13]. After that, a list-based loop pipelining technique is proposed to first minimize power and then maximize throughout [12]. Since list scheduling only considers the node with highest priority in ready list, it can’t get an optimal solution when the number of functional units is more than one in every scheduling step. In order to fully utilize functional units and reduce power consumption, we integrate the operand sharing technique into BRSM to design another method, Bottom Retiming with Operand

Sharing method (BROS). It can bind operations with a common operand into the same functional unit and result in a smaller maximum retiming depth. BROS has advantages of BRSM and the operand sharing technique. From the experimental results, we can find that the performance of BROS is very close BRSM, and operand reutilizations are very high.

This thesis is organized as follows. In chapter 2, we will introduce the fundamental background and the related work. In chapter 3, BRSM is presented in detail, and the experimental results are shown. BROS is finely presented in chapter 4, and the corresponding experimental results are shown. Finally, we conclude our thesis in chapter 5, and list the future work of our research.

Chapter 2. Fundamental Background &

Related Work

In this chapter, we will introduce the Multidimensional Data Flow Graph (MDFG) to model the nested loop to be scheduled. Then, the retiming technique will be presented. Finally, we will go through some related work, including Push-up Scheduling Method [4], Relax Push-up Scheduling Method [6], and List Scheduling for Low Power Method [2].

2.1 Modeling the Problem [4-5]

Multidimensional data flow graph (MDFG) is used to model the nested loop to be scheduled [4, 6, 24-26]. Definition 2.1 defines what an MDFG is.

Definition 2.1. A Multidimensional Data Flow Graph (MDFG) is a node-weighted and edge-weighted directed graph, where V is the set of computation nodes, E is the set of dependence edges, d is a function from E to Z

)

n, representing the multidimensional delays between two nodes, where n is the number of dimensions, and t is the computation time of each node.

Fig. 2.1 shows the high-level language code of a DSP program. We use

to represent any delay edge e in a two-dimensional data flow graph. The equivalent two-dimensional data flow graph is shown in Fig. 2.2. In this thesis, we assume that execution time of any operation is one time unit.

) once, i.e., the execution of one instance of the loop body [4]. An iteration is associated with a static schedule, that is repeatedly executed for the loop. Iterations are identified by a vector

I, equivalent to a multidimensional index, starting from . Inter-iteration dependencies are represented by vector-weighted edges in an MDFG. For any iteration j, in an MDFG an edge from node u to node v with delay vector means that the computation of node v at iteration j depends on the execution of node u at iteration

) in an MDFG represents a data dependence within the same iteration. A legal MDFG must have no zero-delay cycle, i.e., the summation of the delay vectors along any cycle can’t be

Definition 2.2. A cell dependence graph (DG) of the MDFG G is the directed acyclic graph, showing the dependences between copies of nodes representing an MDFG G.

The cell dependence graph is bounded by the dimensions of the problem which it represents [5]. A computational cell is the DG node that represents a copy of the MDFG, excluded the edges with delay vectors different from(0,0,...,0). The computational cell is

considered as an atomic execution unit. Fig. 2.3(a) shows the DG based on the replication of

Fig. 2.1 High-level language code of a DSP program

Fig. 2.2 An equivalent two-dimensional data flow graph

the MDFG in Fig. 2.1, and Fig. 2.3(b) shows its DG represented by computational cells.

.2 Retiming a Multidimensional Data Flow Graph [3-5]

es in the o

, and each

2

A multidimensional retiming r is a function from V to Zⁿ that redistributes the nod riginal dependence graph created by the replication of an MDFGG=(V,E,d,t)[27]. A new MDFG G_r =(V,E,d_r,t) is created after applying retiming f

iteration still has one execution of each node in G. The purpose of using retiming technique is to construct a new MDFG with better instruction level parallelism (ILP). The retiming vector

) (u

r of a node u V

unction r

∈ represents the offset between the original iteration and the one after ing. The delay vectors change accordingly to preserve dependencies. The retiming vector )

r of a node u represents delay components pushed into the edgesu→v, and subtracted the edgesw→u, whereu v w V

retim

from , , ∈ . The execution of node u i tion i which is

represented by dimensional vector is moved to the iterationi−r(u). Here we give some definitions and properties of the retiming technique as follows.

n itera a multi

Definition 2.3. For any MDFG G=(V,E,d,t), retiming function r, and retimed MDFG , we define the retimed delay vector for every edge e in E, the retimed delay vector for every path in G, and the retimed delay vector for every cycle in G, denoted as d

) , , ,

(V E d t G_r = _r

r(e), d_r(p), d_r(l) respectively by the following formulas:

(a) d_r(e)=d(e)+r(u)−r(v) for every edge u⎯⎯→^e v, u,v∈V and e∈E; (b) d_r(p)=d(p)+r(u)−r(v) for any path u⎯⎯→^p v, u,v∈V and p∈G; (c) d_r(l)=d(l) for any cycle l∈G.

For example, Fig. 2.4 shows the retimed MDFG Gr after applying retiming function r on G. We can use the definition to obtain the retimed delay vector for every edge e in E.

In Fig. 2.5(a), we show the retimed DG based on the replication of the MDFG in Fig. 2.4 and the retimed DG represented by computational cells is shown in Fig. 2.5(b). The retiming function applied to an MDFG may create prologue and epilogue. Prologue is the set of instructions that must be executed to provide the necessary data for the beginning of the iterative process. Epilogue is the set of instructions that must be executed to complete the process. These two sets of instructions are complementary. For example, in Fig. 2.5(a) the instruction D becomes the prologue, and the instruction A, B and C become epilogue for this problem. If the retiming function of node D of the MDFG in Fig. 2.2 is equal to (0,2) and the

retiming function of other nodes is equal to (0,0), we can find that the instructions of the prologue and epilogue become more. So, the size of prologue and epilogue varies with the retiming function on the MDFGs.

A schedule vector s is the normal vector for a set of parallel equitemporal hyperplanes that define the sequence of execution of the cell dependence graph. To get a schedule vector s, we can solve the inequalities d(e)⋅ s≥0 for every e∈E [5]. For example, (1,0) is a schedule vector of the MDFG in Fig. 2.2.

Definition 2.6. A legal MDFG G=(V,E,d,t) that must have no zero-delay cycle is realizable if there exits a schedule vector s for the cell dependence graph with respect to G, i.e., s⋅ d ≥0 for anyd∈G.

Definition 2.7. Given a realizable MDFG G, a legal multidimensional retiming for G is the multidimensional retiming function r that transforms G into G_r, such that G_r is still realizable.

A legal multidimensional retiming on an MDFG G=(V,E,d,t) requires that the execution sequence of the corresponding retimed DG does not contain any cycle. This constraint is enforced through the use of a schedule vector that supports the realization of the retimed graph.

The selection of retiming function may result in illegal retiming. Fig. 2.6(a) shows an illegal retiming function applied to the MDFG in Fig. 2.2. By simple inspection of the cell dependence graph in Fig. 2.6(b), we can find that there exists a cycle created by the dependencies (0,1) and (0,-1).

To get a legal multidimensional retiming r, we need to find a schedule vector s, such that for any

⋅ d

s d∈G. For a two-dimensional problem, we choose s=(s.x,s.y) such that y

s x

s. + . is minimum. Then, a legal multidimensional retiming r of node u is any vector orthogonal to s [5]. So, we can find that (0,1) is a legal retiming function on the MDFG in Fig.

2.2, and (1,0) is an illegal retiming function on the MDFG in Fig. 2.2. Further, we can have the corollary 2.1 [5].

Corollary 2.1. If r is a multidimensional retiming function orthogonal to a schedule vector s

that realizes an MDFGG=(V,E,d,t), and then (k× is also a legal multidimensional r) retiming on that MDFG.

From the corollary 2.1, we know that the retiming function of every node in the retimed MDFG can be in the form(k×r). Here, r is called retiming base, and k is called retiming

depth [6].

2.3 Related Work [1-2, 4, 6]

In this section, we’ll show how Push-Up Scheduling Method (PUSM) works, and then Relax Push-up Scheduling Method (RPUSM) will also be introduced. Finally, we will introduce the operand sharing technique and a scheduling method, List Scheduling for Low Power method (LPLS), using list scheduling method combined with operand sharing technique [2].

2.3.1 Push-up Scheduling Method [4]

In order to make the schedule length shorter, PUSM uses retiming technique to change the dependence in the MDFGs. PUSM will first analyze that if a node could be scheduled, and then use retiming technique to make the node schedulable as early as possible. Now, we define what a schedulable node as follows.

Definition 2.8. (Scheduling Conditions): Given an MDFG G=(V,E,d,t) and a nodeu∈V,

u is a schedulable node at a control step cs, if it satisfies one of the following conditions:

(a) u has no incoming edges;

(b) all incoming edges of u have a nonzero multidimensional delay;

(c) all predecessors of u, connected to u by a zero-delay edge, have been scheduled to earlier control steps.

When scheduling an MDFG G by PUSM, it traverses G using BFS algorithm and checks that if the current traversing node satisfies the scheduling conditions or not. If the current traversing node satisfies the scheduling conditions, it will be scheduled in that control step.

Otherwise, retiming technique will be used to make the node satisfy the scheduling conditions and be scheduled in that control step. During traversing G, every traversed node will record the multidimensional delay counting functionMC(u),u∈V . MC(u) represents the upper bound on the number of extra nonzero delays required by any path from roots of G to node u.

Before traversing the MDFG G, a schedule vector s realizing G and a legal retiming r on G will be found. After traversing G, PUSM uses multidimensional delay counting function MC to calculate the retiming function of every node by the following formula:

r ignores the effect of the schedule vector and the retiming depth. Both of them affect the execution time of a scheduled nested loop. Following, RPUSM provides a method to select a better schedule vector to reduce the execution time.

2.3.2 Relax Push-up Scheduling Method [6]

One of the main shortcomings of PUSM is that it doesn’t consider the effect of the schedule vector on the execution time. RPUSM finds that if the schedule vector could be kept as (1,0), the execution time is minimum as compared with other schedule vector different with

(1,0). The author also proposed a method to check if (1,0) could be a schedule vector, as shown in theorem 2.1.

Theorem 2.1. For an MDFG , the retiming depth of any node u in V is rd(u).

We can use schedule vector

) following two conditions which make the MDFG realizable:

(a) If there doesn’t exist any delay vector ( a0, ) for a>0 in the original MDFG;

(b) If there exists the delay vectors for , and after finding out the retiming depth of every node in V we must make sure that for all in the original MDFG;

Relax Push-up scheduling method (RPUSM) uses theorem 2.1 to modify PUSM to select a better schedule vector. The main difference is that RPUSM find the schedule vector and the retiming base after traversing an MDFG. Thus, RPUSM can use the multidimensional delay counting function MC obtained after traversing an MDFG to get the retiming depth of every node. RPUSM uses the retiming depth of every node to check if the conditions in theorem 2.1 are satisfied or not. If not, the same method of finding a schedule vector and a retiming base as PUSM will be performed.

Although RPUSM provides a method to select (1,0) as a schedule vector, RPUSM, like PUSM, doesn’t consider the effect of the retiming depth of nodes. We’ll discuss this issue in chapter 3.

2.3.3 List Scheduling for Low Power Method [2]

There have been a few research results about power reduction using the operand sharing technique [1-2]. Here, we will briefly explain the operand sharing technique and then go

through the key feature of list scheduling for low power method (LPLS).

A functional unit in a data-path consumes both useful and useless power. It consumes useful power when it executes an operation and consumes useless power when there is an input operand transition while the functional unit is idle. The power consumption of a functional unit depends on the operand variability of its inputs. So, the operand sharing technique will try to bind operations with a common operand to the same functional unit such that the input activity of the shared functional unit decreases. In an MDFG, if a node has more than one outgoing edge, it reveals that the children of that node share the same data generated by that node. So, these children can be bind to the same functional unit in continuous cycles or non-continuous cycles without breaking by other operations. If two adjacent instructions executed on the same functional units have one common operand, it is called that one operand reutilization exists, or operand transition is reduced. One operand transition can reduce input activity of functional units.

LPLS uses a list scheduling approach. The priorities of the operations of the ready-operation queue are set in such a way that operations sharing the same operand are scheduled in control steps as close as possible. So, the scheduling of the operations sharing the same operand is guided by giving more priority to the operations in the operand-ready queue. Because operations with common operands may be scheduled in non-continuous cycles without breaking by other operations, some nodes may be delayed. Thus, the schedule length becomes longer, and the utilization of functional units is decreased. Although LPLS can result in a schedule with well operand reutilization, the schedule length may be very long.

In the chapter 3, we will present the bottom retiming scheduling method aiming at decreasing the retiming depth. In chapter 4, we will combine operand sharing technique with bottom retiming scheduling method to get a schedule with well operand reutilization and performance for reducing the power.

Chapter 3. Bottom Retiming Scheduling Method

In this chapter, we will finely introduce Bottom Retiming Scheduling Method (BRSM).

First, we will explain our motivation to propose a method reducing the execution time of a nested loop. Then, we will describe the main concept and principle of BRSM. Finally, we will give some basic experimental results. From the results, we can find that BRSM produces the schedule of a nested loop with less execution time compared with RPUSM.

3.1 Motivation

From the related work, we know that in order to schedule nodes as early as possible, PUSM first analyzes the scheduling conditions (definition 2.8). If necessary, it uses the retiming technique to make instructions satisfying the scheduling conditions to be schedulable earlier. Although PUSM can achieve minimum static schedule length, effects of the schedule vector and retiming depth did not be considered. Both of them affect the execution time of applications. In [6], the author proposed a method, RPUSM, to get a better schedule vector to reduce the execution time. Different from PRUSM, we focus on effects of the retiming depth.

From Fig. 2.5, we find that some instructions become prologue and epilogue after retiming function is applied to an MDFG. Further, the prologue and epilogue will be longer while the retiming base is fixed and the maximum retiming depth, the maximum value of the retiming depth of all the nodes in a retimed MDFG, becomes bigger. That is to say, the optimized portion of the nested loop, the loop body, is decreased, and the un-optimized portion, prologue and epilogue, is increased. In order to reduce the execution time of nested loops, we have to increase the optimized portion of a nested loop. Thus, we need to decrease the maximum retiming depth.

By observing RPUSM, we find that in order to make nodes schedulable as early as

possible, many zero-delay edges will be changed to nonzero-delay edges. Thus, RPUSM will result in a static schedule with a bigger maximum retiming depth. In the following, we will introduce our BRSM to produce a static schedule with a smaller maximum retiming depth to reduce the execution time.

3.2 Basic Concept

In this section, basic concepts will be presented to explain how we decrease the maximum retiming depth under the minimum static schedule length with limited resources constraints. We describe how RPUSM works first. If m adders and n multipliers are available, RPUSM will schedule the first m add operations and the first n multiply operations in control step 1, the next m add operations and the next n multiply operations in control step 2,…until all operations are scheduled. If some node can’t be scheduled in that control step, RPUSM uses the retiming technique to change the delay dependences to make it schedulable earlier.

Although RPUSM can fully utilize functional units to achieve minimum static schedule length, it will produce a static schedule with a bigger maximum retiming depth. A bigger maximum retiming depth will result in a longer prologue and epilogue to increase the execution time.

We will propose a new method which not only fully utilizes functional units to achieve minimum static schedule length but also has a smaller maximum retiming depth.

In order to fully utilize functional units, base on some information of architecture and applications, we can calculate the minimum static schedule length before scheduling. In a DSP application, it is usually composed of additions, multiplications, and assignments.

Additions and multiplications are executed by adders and multipliers respectively.

Assignments can be executed by adders or multipliers. To calculate the minimum static schedule length of some MDFG, denoted by ML(G), some information is needed, the total number of adders(A), multipliers(M), additions(ADD), multiplications(MUL), and

assignments(AS). It costs ADD/A and MUL/M cycles to execute all additions and multiplications respectively. If no assignments exist, the bigger one of ADD/A and MUL/M is ML(G). If assignments exist, we need to calculate that how many cycles it costs to execute them. The formula for calculating ML(G) is shown in line 1 and line 2 of Fig. 3.1. Because assignments can be executed by adders or multipliers, we can calculate the maximum number of assignments executed by adders and multipliers, as shown in line 3 and line 4 of Fig. 3.1.

Generally speaking, in order to get a schedule with a smaller maximum retiming depth,

在文檔中針對數位訊號處理器中的巢狀迴圈考慮功率消耗的指令排程方法 (頁 11-0)