The Parafrase-2 compiler, a famous parallelizing compiler developed by Illinois University, aims at developing a source-to-source multilingual restructuring compiler and provides a reliable, portable, easy to extend, and powerful research tool for exploring program transformation about a parallelizing compiler. Normally, the parallelizing compiler consists of two principal components, the front-end
and the back-end. The front-end of a parallelizing compiler consists of the scalar analysis and data dependence analysis, which detect dependence relations between procedures or statements and extract the parallelizable code segments for the back-end to generate parallel executable codes. The task of the back-end is to generate parallel machine codes for some multiprocessor systems from its intermediate representation by using the analysis results gathered in the front-end [15,16]. This compiler supports preprocessors and postprocessors for C and FORTRAN languages together with an intermediate representation. The preprocessor is used to transform each input source code to the intermediate representation together with its corresponding set of data structures, and the postprocessor is used to recreate the source program. The Parafrase-2 compiler is executed by a pass list file consisting of several passes, each of which operates on the data structures and transforms the input into some suitable form for subsequent execution. The Parafrase-2 compiler coded by C language has a convenient user interface, provides a means for user interaction at several levels during the transformation processes, and also has high portability.
A new parallelizing compiler, called Polaris, is developed at the Center for Supercomputing Research and Development (CSRD) at the University of Illinois [2]. Polaris includes a powerful basic infrastructure for manipulating FORTRAN programs and a number of improved analysis and transformation passes, notable subroutine inline expansion, symbolic analysis, induction and reduction variable recognition, data dependence analysis, array privatization, and runtime analysis. The most important techniques implemented in Polaris resulted from a study of the effectiveness of commercial Fortran parallelizers. They compiled the Perfect Benchmarks, a collection of conventional Fortran programs representing the typical workload of high-performance computers, for the Alliant FX/80, an eight-processor multiprocessor popular in the late 1980s. For each program, they measured the quality of the parallelization by computing the speedup of the ratio of a program’s sequential execution time to the execution time of the automatically parallelized version. Their study showed that extending the four most important analysis and transformation techniques traditionally used for vectorization leads to significant increases in speedup. However, it is important to note that Polaris’ innovation is in improved recognition of parallelism, which is a necessary step for porting programs to any parallel machine available today.
Because independently developing an entire infrastructure is prohibitively expensive, compiler researchers would benefit greatly from sharing investments in infrastructure development. Toward that end, they are making the SUIF (Stanford University Intermediate Format) compiler system available to others. They have developed SUIF as a platform for their research on compiler techniques for high-performance machines. It is powerful, modular, flexible, clearly documented, and complete enough to compile large benchmark programs. Their group has successfully used SUIF to perform research on topics including scalar optimizations, array data dependence analysis, loop transformations for both locality and parallelism, software prefetching, and instruction scheduling. Ongoing research projects using SUIF include global data and computation decomposition for both shared and distributed address space machines, communication optimizations for distributed address space machines, array privatization, interprocedural parallelization, and efficient pointer analysis. The SUIF toolkit contains a variety of compiler passes. Fortran and ANSI C front-ends are available to translate source programs into SUIF. The system includes a parallelizer that can automatically find parallel loops and generate parallelized code. A SUIF-to-C translator allows us to compile the parallelized code on any platform to which our parallel runtime library has been ported. The system provides many features to support parallelization: data dependence analysis, reduction recognition, a set of symbolic analyses to improve
the detection of parallelism, and unimodular transformations to increase parallelism and locality. Scalar optimizations such as partial redundancy elimination and register allocation are also included.
In our PPD, the ZIV/I test is used for checking if the linear equation formed by array subscript has an appropriate integer solution. Besides, we also proposed two ad hoc techniques that look for the trivial contradiction on direction vectors to improve the drawbacks of traditional subscript-by-subscript testing mechanisms. PPD could detect the DOALL loops and DOACROSS loops which include synchronization directives. In Parafrase-2, only GCD Test and Banerjee Test are employed on the builddep pass. The accuracy of GCD and Banerjee tests is less than that using the I test. Besides, Parafrase could not detect the DOACROSS loops in a source program.
Traditionally, the parallelizing compiler dispatches the loop by using only one scheduling algorithm, either static or dynamic. However, programs have different kinds of loops, including uniform workload, increasing workload, decreasing workload, and random workload loops, and every scheduling algorithm can achieve good performance on a different loop style. To reduce the overhead and enhance load balancing, the knowledge-based approach is a feasible solution for parallel loop scheduling.
An approach that integrates existing static and dynamic scheduling algorithms and makes good use of their advantages is proposed in the paper. We can use this approach to choose an appropriate scheduling algorithm based on some features that include the loop style, loops bound, system status, data locality, and synchronization overhead, and then apply the resulting algorithm to schedule the DOALL loop on processors. In this paper, we concentrate on the fundamental phase, parallel loop scheduling, in parallelizing compilers running on multiprocessor systems. A new model exploiting loop parallelization using knowledge-based techniques is proposed. The knowledge-based approach integrates existing loop schedules to make good use of their abilities in extracting more parallelism.
Experimental results show that the high speedup obtained by using IPLS on multiprocessors is obvious.
Furthermore, for system maintenance and extensibility, our approach is obviously superior to others.
In addition, a runtime technique based on the inspector–executor scheme is proposed to find available parallelism on loops. Our inspector can determine the wavefronts by building a DEF-USE table for each loop of a program. The process of the inspector for finding the wavefronts can be parallelized fully without any synchronization. Our executor can execute the loop iterations concurrently. Additionally, our compiler is highly modularized so that porting to other platforms is very easy, and it can partition parallel loops into multithreaded codes based on several loop-partitioning algorithms. The experimental results clearly show that the compiler achieves good speedup on the Windows NT OS.
5. CONCLUSIONS
This paper describes the design and implementation of an efficient and parallelizing compiler to parallelize loops and achieve high speedup rates on multiprocessor systems. We first introduce how to design a portable FORTRAN parallelizing compiler (PFPC) on a multiprocessor system by a multithreading operating system OSF/1. The main contribution of this paper is described as follows.
A model of a FORTRAN parallelizing compiler on multithreading OSF/1 has been proposed. This paper has also reviewed the practical parallel loop detector (PPD) that was implemented in PFPC on finding the parallelism in loops. Furthermore, if DOACROSS loops are available, optimization of synchronization statements is achieved. Experimental results have shown that PPD was more reliable and accurate than previous approaches. In addition, a new model by using knowledge-based
techniques was proposed to exploit more loop parallelisms in this paper. A new model exploiting loop parallelization using knowledge-based techniques is proposed. The knowledge-based approach integrates existing loop schedules to make good use of their abilities in extracting more parallelism.
Experimental results show that the high speedup obtained by using IPLS on multiprocessors is obvious.
Furthermore, for system maintenance and extensibility, our approach is obviously superior to others.
In addition, a runtime technique based on the inspector–executor scheme is proposed to find available parallelism on loops. Our inspector can determine the wavefronts by building a DEF-USE table for each loop of a program. The process of the inspector for finding the wavefronts can be parallelized fully without any synchronization. One of the ultimate goals is to construct a high-performance and portable FORTRAN parallelizing compiler on shared-memory multiprocessor systems.
ACKNOWLEDGEMENT
This work was supported in part by the National Science Council of the Republic of China under grant nos. NSC86-2213-E009-081 and NSC87-2213-E009-023.
REFERENCES
1. Zima HP, Chapman B. Supercompilers for Parallel and Vector Computers. Addison-Wesley Publishing and ACM Press:
New York, 1990.
2. Blume W, Eigenmann R, Hoeflinger J, Padua D, Petersen P, Rauchwerger L, Tu P. Automatic detection of parallelism: A grand challenge for high-performance computing. IEEE Parallel & Distributed Technology 1994; 2(3):37–47.
3. Wolfe M. High-Performance Compilers for Parallel Computing. Addison-Wesley Publishing: New York, 1995; 137–162.
4. Yang CT, Tseng SS, Chen CS. The anatomy of parafrase-2. Proceedings of the National Science Council Republic of China (Part A) 1994; 18(5):450–462.
5. Cooper KD et al. The ParaScope parallel programming environment. Proceedings of the IEEE 1993; 81(2):244–263.
6. Wilson RP et al. SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Notices 1994; 29(12):31–37.
7. Boykin J, Kirschen D, Langerman A, LoVerso S. Programming under Mach. Addison-Wesley Publishing: New York, 1993.
8. Yang CT, Tseng SS, Hsiao MC. A model of parallelizing compiler on multithreading operating systems. International Journal of Modelling and Simulation 1998; 18(1):9–15.
9. Yang CT, Tseng SS, Hsiao MC, Kao SH. A portable parallelizing compiler with loop partitioning. Proceedings of the National Science Council Republic of China (Part A), Barcelona, Spain, 1999; 23(6):751–765.
10. Yang CT, Wu CT, Tseng S. PPD: A practical parallel loop detector for parallelizing compilers on multiprocessor systems.
IEICE Transactions on Information and Systems 1996; E79-D(11):1545–1560.
11. Yang CT, Tseng SS, Chuang CD, Shih WC. Using knowledge-based techniques on loop parallelization for parallelizing compilers. Parallel Computing 1997; 23(3):291–309.
12. Fann YW, Yang CT, Tsai CJ, Tseng SS. IPLS: An intelligent parallel loop scheduling for multiprocessor systems.
Proceedings of the ICPADS’98, Tainan, Taiwan, 1998; 751–782.
13. Yang CT, Tseng SS, Hsieh MH, Kao SH. An efficient run-time parallelization for do loops. Journal of Information Science and Engineering—Special Issue on Compiler Techniques for High-Performance Computing 1998; 14(1):237–253.
14. Rauchwerger L, Amato NM, Pauda D. Run-time methods for parallelizing partially parallel loops. Proceedings of the 1995 International Conference on Supercomputing, Barcelona, Spain, 1995.
15. Polychronopoulos CD. Parallel Programming and Compilers. Kluwer Academic Publishers, 1988.
16. Polychronopoulos CD. Parafrase-2: An environment for paralleling, partitioning, synchronizing, and scheduling programs on multiprocessors. International Journal of High Speed Computing 1989; 1(1):45–72.