THE CAD TOOL (FIT) AND BENCHMARK RESULTS - A Fast Implementation for Recurrent DSP Scheduling U

In [14], we reported a CAD tool used to draw, edit, and simulate a DFG and to find critical and subcritical loops, and iteration bounds. Based on this CAD tool, we have devel-oped and implemented a fast processor assignment algorithm to allow the designer to view graphically critical loops, scheduling ranges, and level and processor assignment diagrams.

FIT integrates the following functions:

(1) DFG Draw and Edit capability including erase, move, copy, past, etc.;

(2) function assignment to each node and the ability to simulate the whole DFG to verify against the specifications;

(3) analysis of DFG performance and properties such as iteration bounds, critical loops, liveness, safeness, etc.;

(4) application of our theory of the final matrix to the scheduling of DFG nodes among processors.

All the tools [1, 16, 17, 19, 22] reported deal only with

∑ processor assignment;

∑ single-rate DFGs (SRDFG); our CAD tool is able to handle both SRDFG and MRDFG (multiple-rate DFG) [23];

∑ textual data only; no graphical display of either DFGs or processor assignments is available without the printing of postscript files. In other words, no on-line graphical display is available; our CAD tool is able to handle both graphical and textual inputs (useful for very large DFGs);

∑ steady-state scheduling; initial transients were not discussed.

Often the designer draws a new, or starts with an existing, graphical DSP and then translates it into textual formats to be processed by the software. In the former case, several iterations may be required before the design can be finalized. A CAD tool that can erase, move and copy graphical objects as ours can will be very helpful in facilitating the drawing process.

FIT will also serve as a computer-aided education tool for DFG design. Students can understand the operation of DFG by simulating the DFG in a micro-fashion using the Step

mode. Further, the tool is currently able to simulate a DFG with the Trace mode on to help students understand the meaning of iteration bounds, critical loops, static rate-optimal scheduling, and so on.

The above algorithm has been incorporated into our CAD tool to allow the designer to draw the DFG and click on buttons to view critical loops (in thicker lines), scheduling ranges, level diagrams and processor assignments. See Fig. 7(a)-(c), where the FSASAP scheduling is used. Due to the limited size of the display window, we show the iteration numbers underneath each node number. The number of iterations shown in Fig. 7(c) is two, barely large enough to cover node n2, which has the longest node execution time. The iteration number of ni relative to nr is Hi/I. Note that among all five nodes, only one node is not on the critical loop, and that the level diagram can be trivially constructed. Thus, there is no need to adjust the level diagram to reduce the number of levels and hence the number of processors. This is indeed true for many benchmarks in which most nodes are in nontouching critical loops.

See Tables 1-3, where the program was run on a SUN Sparc 10 at 40 Mhz with 32Mb main memory. Table 1 and 2 show a comparison between ART [16], Barnwells result [19]

and FIT for benchmarks with relatively smaller DFGs compared to those in Table 3. Both tables indicate that ART and FIT use the same number of processors, but fewer than [19].

FIT uses CPU time no longer than 0.01 second, which is shorter than either ART and [19].

Table 3 shows a comparison between ART and FIT for benchmarks with large DFGs. FIT uses fewer processors but runs much faster than ART. It is interesting to observe that for large benchmarks (Table 3), there is no need to perform optimize() to reduce the number of processors. This is because in such cases, there are many type n nodes to fill empty holes.

The number of processors used in this case depends on how efficiently we pack type n

nodes.

ART developed by Wang et al. [16] runs faster than all the above schemes except FIT.

It minimizes the number of processors, and they developed a CAD tool called ART. The idea is that all fully static periodic schedules must be upper bounded by a cutofftime. Com-paring all such schedules allows us to find one that minimizes the number of processor.

This finite cutofftime also leads to faster scheduling as shown by their benchmark results.

However, their worst case time complexity is O(n²(d + e + 1) + nd), smaller than O(n³) used by [1] and our O(n), where n(e) is the total number of nodes (edges).

The reason why FIT is fast is due to the fact that the FSASAP and FSALAP time schedules and scheduling ranges come directly from the final matrix while ART have to use graphs to redo some computations similar to those involved in the Final matrix. In addition, ART incurs extra overheads finding Tcutoff and within which to search for the optimum time schedule. [1], on the other hand, relies on repeated schedule updates to fix the time schedule.

Further, we do not need to search very much for empty holes as we can continuously arrange the nodes on a critical loop in the same levels.

The reason why FIT uses fewer processors is due to the fact that we schedule nodes on a critical or subcritical loop continuously more or less in the same levels, which tends to reduce the number of fragmented holes. Further, we eliminate as many fragmented holes as possible using (type n) nodes not in any loops which have infinitely large scheduling

ranges and can be inserted into any place in the level diagram. If the fragmented holes are too small to fit type n nodes, we squeeze out the fragmented holes by shifting all but one non-n type node into higher levels, and fill the remaining empty region completely with type n nodes. This creates the largest continuous filled region and hence leads to higher utilization of the processor, resulting in fewer processors needed.

Most produce the optimal scheduling using our fixed-time scheduling without sched-uling updates. Few cases require schedsched-uling updates to further reduce the number of processors. An example of a four-order all-pole lattice filter benchmark is shown in Fig.

8. Note that nodes n16 and n17 are input and output nodes, respectively. They are created artificially, and their execution times are zero so that they will not be considered in pro-cessor assignment. Among the total of 15 nodes, 12 are in two nontouching critical loops and can be scheduled easily on two processors. The remaining 3 nodes can by assigned to two levels (Fig. 8(b)), which can be reduced to one by shifting forward the scheduling times of n14 by one time unit a shown in Fig. 8(c). Note that there is no need to update the scheduling ranges as in [1]. As a result, the time complexity is greatly reduced.

The time schedulings of other nodes are not affected. This is because the corre-sponding mobilities are among the largest. Note that the higher the level, the fewer the nodes; hence it takes less time to remove, if possible, the last level.

9. CONCLUSIONS

We have supplemented the scheduling theory of Heemstra de Groot et al. by show-ing that the final matrix is not only useful for findshow-ing the iteration bound and critical loop, but also useful to derive the explicit formulas for the slack time, the scheduling range and its update, and the initial scheduling to avoid transient periods. Thus, the theory presented in this paper eliminates some redundant steps (e.g., inequality graphs are no longer needed) from the scheduling algorithm of [1]. We have proved that both the steady state ASAP and ALAP schedulings satisfy the firing rule, and that they are static rate-optimal. Since the start execution times of all the nodes are fixed, we can simplify the scheduling algorithm in [1] by eliminating most of the scheduling updates. We further simplify the scheduling by scheduling critical loops ahead of subcritical loops since a major portion of the nodes are in nontouching critical loops. We have also considered the case of large node execu-tion times, which was not consider in [1]. No unfolding is required as in [1] compared with [2], where unfolding takes extra time and space. However, we did not consider the case in which I is a fraction as in [2]. It is expected that with slight modification, our simplified scheduling algorithm can handle such fractional cases.

We have implemented the theory presented above into our earlier CAD tool, which can find iteration bounds, critical and subcritical loops. This results in a single tool, unlike [1] and [2], for DFG scheduling. Finally, we have enhanced our tool to handle PERT charts, which do not have any loops, and to perform simulations to verify both timing and functional behaviors.

REFERENCES

1. S. M. Heemstra de Groot, S. H. Gerez, and O. E. Hermann, Range-chart-guided iterative data-flow graph scheduling, Transactions on Circuits and Systems, Vol. CAS-39, No. 5, 1992, pp. 351-364.

2. K. K. Parhi, Algorithm transformation techniques for concurrent processors, IEEE Proceedings, Vol. 77, 1989, pp. 1879-1895.

3. S. M. Heemstra de Groot, Scheduling techniques for iterative data-flow graphs, and approach based on the range chart, Ph.D. Dissertation, Unvi. Twente, Faculty of Elec-trical Engineering, Dec. 1990.

4. J. Blazewicz, Selected topics in scheduling theory, Surveys in Combinatorial Optimization, P. L. Hammer, Ed Amsterdam, North-Holland, 1987, pp. 1-59.

5. W. A. Kohler, A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems, IEEE Transactions on Computers, Vol. C-24, 1975, pp. 235-1238.

6. D. A. Schwartz, Synchronous multiprocessor realizations of shift invariant flow graphs,

Ph.D. Dissertation, Technical Report DSPL-85-2, Georgia Institute of Technology, July 1985.

7. S. M. Heemstra de Groot and O. E. Hermann, Evaluation of some multiprocessor scheduling techniques of atomic operations for iterative DSP graphs, in Proceedings of European Conference on Circuit Theory and Design, 1989, pp. 400-404.

8. M. Renfors and Y. Neuvo, Fast multiprocessor realization of digital filters, in Pro-ceedings of International Conference on Acoustics, Speech, Signal Processing, 1989, pp. 916-919.

9. M. Renfors and Y. Neuvo, The maximum sampling rate of digital filters under hard-ware speed constraints, IEEE Transactions on Circuits and Systems, Vol. CAS-28, No. 2, 1981, pp. 196-202.

10. A. Fettweis, Realizability of digital filter networks, IEEE Proceedings, Vol. 77, No.

12, 1989, pp. 1879-1895.

11. D. A. Schwartz and T. P. Barnwell, Cyclo-static multiprocessor scheduling on the opti-mal realization on shift invariant data flow graphs, in Proceedings of International Conference on Acoustics, Speech, Signal Processing, 1985, pp. 1384-1387.

12. Y. Yaw, B. Wei, C. V. Ramamoorthy, and W. T. Tsai, Extensions on performance evalu-ation technique for concurrent systems, Internevalu-ational Computer Software and Ap-plication Conference, 1988, pp. 480-484.

13. S. H. Gerez, S. M. Heemstra de Groot, and O. E. Hermann, A polynomial time algo-rithm data flow graphs, IEEE Transactions on Circuits and Systems, CAS-I, Vol. 40, 1993, pp. 629-634.

14. D. Y. Chao and D. T. Wang, Iteration bounds of single-rate data flow graphs for con-current processing, IEEE Transactions on VLSI, Vol. 3, No. 3, 1995, pp. 393-403.

15. R. W. Floyd, Algorithm 97: shortest path, Communications of ACM, Vol. 5, No. 6, 1962, pp. 345.

16. D. J. Wang and Y. H. Hu, Multiprocessor implementation of real-time DSP algorithms,

IEEE Transactions on VLSI, Vol. 3, No. 3, 1995, pp. 393-403.

17. L. G. Jeng and L. G. Chen, Rate-optimal DSP synthesis by pipeline and minimum unfolding,

IEEE Transactions on VLSI, Vol. 2, No. 1, 1994, pp. 81-87.

18. C. V. Ramamoorthy and G. S. Ho, Performance evaluation of asynchronous concurrent systems using Petri nets, IEEE Transactions on Software Engineering, Vol. SE-6, No.

5, 1980, pp. 440-449.

19. P. R. Gelabert and T. P. Barnwell III, Optimal automatic periodic multiprocessor sched-uler for fully specified flow graphs, IEEE Transactions on Signal Processing, Vol. 41, 1993, pp. 858-888.

20. R. Bellman, On a routing problem, Quarterly of Applied Mathematics, Vol. 16, No. 1, 1958, pp. 87-90.

21. L. R. Ford, Jr. and S. M. Johnson, A tournament problem, The American Mathematical Monthly, Vol. 66, 1959, pp. 387-389.

22. C. Y. Wang and K. K. Parhi, Dedicated DSP architecture synthesis using the MARS design system, in Proceedings of International Conference on Application of Special Array Processing, 1992, pp. 21-36.

23. D. Y. Chao, Performance of multi-rate data flow graphs for concurrent processing,

Journal of Information Science and Engineering, Vol. 13, No. 1, 1997, pp. 85-123.

Daniel Y. Chao ( ) received the Ph.D. degree in electri-cal engineering and computer science from the University of California, Berkeley, in 1987. From 1987-1988, he worked at Bell Laboratories. In 1988, he joined the computer and information science department of New Jersey Institute. In 1994, he joined the MIS department of NCCU as an associate professor. Since February, 1997, he has been promoted to a full professor. His research interest was in the application of Petri nets to the design and synthesis of communication protocols. He is now working on CAD implementation of a multi-function Petri net graphic tool. He has published 77 (including 19 journal) papers in the areas of com-munication protocols, Petri nets, DQDB, networks, FMS, data flow graphs and neural networks.

在文檔中 A Fast Implementation for Recurrent DSP Scheduling Using Final Matrix (頁 27-31)