Distributed Data Structure Design for Scientic Computation
Jan-Jan Wu
Institute of Information Science Academia Sinica
Taipei, Taiwan 11529, R.O.C.
[email protected]
Pangfeng Liu
Department of Computer Science National Chung Cheng University
Chia-Yi, Taiwan 62107, R.O.C.
[email protected]
AbstractThis paper gives an overview of the VGDS (Virtual Global Data Structure) project. The VGDS eort focuses on de- veloping an integrated, distributed environment that allows fast prototyping of a diverse set of simulation problems in scientic and engineering domains, including regular, irregular, and adap- tive problems. The framework denes three base libraries, Array, Graph, and Tree, that capture major data structures involved in scientic computation. The framework denes multiple layers of class libraries which work together to provide data-parallel rep- resentations to application developers while encapsulate parallel implementation details into lower layers of the framework. The layered approach enables easy extension of the base libraries to a variety of application-specic data structures. Experimental results on a Sun UltraSparc workstation cluster is reported.
Keywords: distributed data structures, parallel scientic compu- tation, object oriented framework
1 Introduction
With the advance of commercial o-the-shelf microproces- sors and interconnection technology, distributed-memorypar- allel machines (e.g. MPPs, workstation clusters, and SMP clusters) have become an attractive computing platform for solving large problems. In the past few years, a large number of scientic computing applications have benet from such parallel computers. Despite this success, parallel program- ming for distributed-memory machines remains a dicult task, mainly because it often involves many intricate details that are error-prone and dicult to debug. This has become the major bottleneck of HPC software development. This is particularly evident in development of irregular and adap- tive applications. The goal of this eort has been to identify and nurture one enabling technology for high performance scientic computation: object-oriented construction of vir- tual global data structures.
Fortran 90 and High Performance Fortran (HPF) are counted among HPC's software successes, because they demon- strate the utility of parallel arrays for hiding low-level dis- tributed memory details. This is possible because array structures are used in highly idiomatic ways in scientic ap-
plications, and compiler and library writers exploit those idioms to extract high performance|an HPF array being just one instance of a virtual global data structure.
The uses of irregular and adaptive data structures in commercial and scientic applications are even more styl- ized, precisely because of their irregularity and similarity.
For example, most of the tree-based scientic codes use simi- lar tree structures and exhibit similar computation patterns.
A uid mechanics code and a molecular dynamics code may dier only in the interaction formula. The tree structures are basically the same except for the data stored in tree nodes and the implementation-dependenttree representation. Fur- thermore, dierent simulation algorithms may use the same data structure. For instance, fast multipole method and Barnes-Hut's algorithm use the same oct-tree structure { they dier only in how they manipulate the trees. Therefore, a general tree framework helps in developing tree codes for dierent application domains, and in implementing dierent tree algorithms as well. Another example is unstructured mesh computation. Although unstructured mesh compu- tations may perform dierent calculations according to the systems being simulated, they use the mesh virtually the same way { A mesh point updates its stored data, which represent some physical quantities at that point, by retriev- ing the data from its neighbors and performing calculations on them. Such property gives distributed data structure de- signers the ability to map them to HPC architectures and extract their parallelism.
In this paper, we present a data structuring framework, Virtual Global Data Structures, for parallel and distributed scientic computation. The VGDS framework denes three base libraries, Array, Graph, and Tree, that capture ma- jor data structures involved in scientic and engineering computation. The framework implemented distributed data structures as object-oriented classes with explicit associated method interfaces. Application-specic data structures can be derived from these base libraries through class inheri- tance.
In the past few years, many research eorts have been devoted to designing ecient numerical libraries for matrix- based parallel computation (e.g. MultiMATLAB [20], PETSc [16], and ScaLAPACK [7]). Research eorts in providing suitable object-oriented parallel languages/libraries for cer- tain classes of applications have also been abundant [4, 8, 3, 13, 21]. Our VGDS eort distinguishes itself from others in two aspects. First, instead of tackling one particular data structure, we propose an integrated framework incorporat- ing a diverse set of data structure classes that are essential in scientic and engineering computation. These include
regular (in which data reference patterns are uniform), ir- regular (in which data reference patterns are non-uniform), and adaptive (in which data reference patterns keep chang- ing dynamically and incrementally) applications.
Secondly, we use layered object-oriented design and anal- ysis in the construction of the VGDS base libraries. System objects in the upper layers of the framework are relevant to application specic domains (e.g. computational uid dynamics simulation, molecular dynamics simulation, etc.) while objects lower in the framework capture the abstrac- tion of parallelism and ecient processor-level computation.
This layered approach provides a natural breakdown of re- sponsibility in designing a complete HPC system, and allows design eort and heavy-duty optimization to be easily ex- pended exactly where it is most needed.
This paper reports on the progress of our VGDS eort.
Section 2 describes the organization and functionality of the VGDS framework. Section 3 discusses some implementa- tion details of virtual global data structures in distributed- memory environments. Section 4 illustrates the core func- tionality of the VGDS framework using the Tree library as a running example. Section 5 reports our experimental re- sults on a Sun UltraSparc-2 workstation cluster. Section 6 describes related work and Section 7 concludes.
2 The VGDS Framework
The VGDS framework denes three layers of C++ classes:
the global layer, the parallel abstraction layer, and the lo- cal layer. Layers of application components can be built atop the VGDS basis. Figure 1 depicts the structure of this framework.
(Uniprocessor/Workstation Clusters/MPP) Machines
Array, Record, Pointer) (Integer, Real, Double, Base Language Data Types
Domain decomposition, Interprocessor communication
Local Layer
Communication Primitives MPI or
Primitive Layer Global Layer
basis VGDS
3D fast multipole method Nbody simulation parallel financial modeling CFD simulations
Parallel Abstraction Layer Adaptive multigrid methods
Local VGDS Data Types (LocArray, LocGraph, LocTree) Global VGDS Data Types (Array, Graph, Tree)
(Array-based) (Graph-based) (Tree-based)
s
nonadaptive multigridmethod linear algebra computation
solid mechanics simulations
Application Layer
Figure 1: The VGDS Framework
The global and local layers together dene various vir- tual global operations on regular (such as arrays), irreg- ular (such as graphs, unstructured meshes), and adaptive (such as adaptive trees, and adaptive grids) data structures.
The global layer denes global data types. Objects in the global layer are bookkeepers that delegate computational tasks to the local layer. The local layer implements generic, processor-local computational kernels for each VGDS com- ponent. The interactions between the global and the local layers are mediated by the parallel abstraction layer that captures the abstraction of parallelism, including data de- composition, interprocessor communication, and load bal- ancing.
Currently, the VGDS framework provides three basic
data structures: Array(for regular computation),Graph(for irregular graph-based computation), andTree(for adaptive tree-based computation). Application-specic data struc- tures can be derived from these basic data structures. For instance, a dense matrix class can be derived from theArray class by inheriting the Array class and dening additional methods essential in dense matrix computation. Table 1 outlines the classes and functionality of each layer, details of which are described in the following sections.
Table 1: VGDS Framework Functionalities
Layers Base ClassesDerived Functionality
Array Grid
Matrix data-parallel
Global Graph UMesh operations
Tree BHTreeAdaptGrid LocArray LocGrid
LocMatrix processor-local
Local LocGraph LocUMesh operations
LocTree LocBHTree LocAdaptGrid
Mapper BlockPartitioner data layoutmanagement
Parallel ORBPartitioner
Abstraction Communicator inter-processorcommunication
Message messageabstractions
2.1 Global and Local Layers
The VGDSs within the global layer provide a global view of the data, in which the data structure is treated as a mono- lithic whole, with operators that manipulate individual ele- ments and implicitly iterate over substructures. In the local view (the local layer), each processor contains only a part of the whole, with operators acting only on the local data.
When a global data structure is instantiated, it creates a constituent local data substructure on each processor. When- ever a kernel operator associated with the data structure is invoked, the operation is carried out by rst retrieving the handles to the local data, then delegating complete local computation to each local data substructure. If communi- cation is required, it is performed through system objects in the parallel abstraction layer.
2.2 Parallel Abstraction Layer
The parallel abstraction layer denes classes for data layout, interprocessor communication, and load balancing for vir- tual global data structures. Classes in this layer are imple- mented as abstract classes and can be shared among various data structures. The key features of this layer are encapsu- lated into two groups of classes { data decomposition classes that are responsible for processor geometry, data partition- ing and mapping, and load balancing, and communication classes that take care of data movement between processors.
Data Decomposition Classes
The global data structure are partitioned into local sub- structures on each processor according to theMapperclass.
Mapperis an abstract class that dene common service inter- face for nding the geometry of a VGDS, identifying global neighbor relations between their constituent local substruc- tures, and deriving logical send- and receive-sets for a given global subscript resolved into the local substructure. Con- crete mapping classes that are derived fromMapperprovide domain specic information and functionality that can be tuned to the need of the specic data structure. For ex- ample, VGDS supports a BlockPartitioner for arrays and aORBPartitioner (Orthogonal Recursive Bisection) for ir- regular and adaptive data structures. By instantiating the
Mapperclass, the user can also construct customized data decomposition strategies.
Communication Classes
Two groups of classes are implemented to support portable, transparent message-passing communication on distributed- memory machines {MessageandCommunicator. TheMes- sageclass is used to encapsulate data in a common format for easy data delivery and retrieval of dierent data structures.
TheCommunicatoris an abstract class that denes common service interfaces for buer allocation, message delivery, and data handling related to communication. These services are encapsulated into three methods: extract, communicate, andprocess. Communicating data elements between pro- cessors are performed in three steps. First, theCommuni- cator extracts data elements for sending by traversing the specied region in the VGDS data object and packing data elements into a Message object. Then theCommunicator delivers(communicates) theMessageobject according to the given communication scheduling algorithm. When a Mes- sage object is received, theCommunicator unpacks it and stores the data elements to the appropriate locations in the VGDS data object. The extraction and the restoring pro- cess requires interaction with the VGDS data object. The methodcommunicate is implemented on top of MPI, to as- sure portability.
Figure 2 depicts the interactions between major classes in the VGDS framework. When a VGDS data object in- vokes a method that requires remote data accesses, the data object consults theMapperobject for the identiers of the processors on which the global subscripts are mapped, and inserts them into a list of sends and receives (called com- munication schedule). The data object then requests the Communicator to carry out the planned data movement.
During the course of computation, if the VGDS data object decides that a remapping is necessary (e.g. for load balanc- ing purpose), it invokes the remap method inMapper, which in turn redistributes the data structure incrementally.
VG data object
Message Communicator,
Mapper load balancing
remapping
schedule communication data movement
Figure 2: Interaction of Classes in the VGDS Framework 3 Data Coherence and Synchronization
Since a virtual global data structure is distributed over local memories of processors, in order to eect the same computation as in the global view, the local computations must be coordinated. We adopt the owner-computes rule, which distributes computations according to the mapping of data across processors. However, a local substructure may require information from other processors to complete the computation of data assigned to it. When communications mostly occur between neighboring processors and the same communication patterns may occur many times during pro- gram execution, it is more ecient to duplicate boundary
P0 P1
P0 P1
P0
P0 P1
P1
P0 P1
P0 P1
(a) duplication for regular array structures
(b) duplication for unstructured meshes (irregular data structures)
(c) duplication for adaptive tree structures
Figure 3: Duplication for distributed data structures. The duplicated data are indicated by solid black.
data elements on adjacent processors. For example, in an unstructured mesh computation where the new data value of a mesh node is a function of its neighbors, by duplicating boundary mesh nodes to the other side of partitioning lines, computations on the local submeshes on individual proces- sors can all be performed locally without communication.
In reality, data elements may be read or updated, which raises the issues of data coherence and synchronization. We describe our approach next.
We classify the data into two categories,mastercopy and duplication. A master copy is a data region in the original global structure that is mapped to a processor. A master copy can make copies of itself, called duplication, on other processors. That is, all the data elements that are essential to the computations of the local master copies will be fetched into the local substructure on the processor which owns the master copies. As far as each master copy is concerned, there is no distinction between global and local structures. Note that we do not have the notion of global pointers because all the pointers address a local memory address, be it a master copy or a duplication. The computations read and update the master copy only { the duplications only provide data and are read-only. Therefore, data coherence is guaranteed by allowing only the master copy to be updated, and only one master copy exists for one data element.
Figure 3 shows the duplication mechanism for a regular array, an unstructured mesh, and an adaptive Barnes-Hut tree for N-body algorithms. We assume that the computa- tion of each element in the regular array and the unstruc- tured mesh requires its neighbors, and the per-particle force computation of the Barnes-Hut algorithm requires a traver- sal on the adaptive Barnes-Hut tree.
To assure synchronization, data elements are duplicated before the actual computation is performed. After data are
partitioned, system objects in the parallel abstraction layer duplicate the data to the processors where they are essen- tial to the computation. A barrier synchronization separates the duplication process from the computation, assuring that all the data are available and the computation can proceed without any further communication. This mechanism guar- antees safety in a distributed environment.
4 Case Study: BH Tree
In this section, we use theTreeclass (and aBH Treeclass derived fromTree) as an example to illustrate the VGDS framework.
4.1 N-body problem and tree codes
Computational methods to track the motions of bodies which interact with one another have been the subject of exten- sive research for centuries. So-called \N-body" methods have been applied to problems in astrophysics, semiconduc- tor device simulation, molecular dynamics, plasma physics, and uid mechanics.
The problem can be simply stated as follows. Given the initial states of N bodies, compute their interactions ac- cording to the underlining physic laws, usually described by a partial dierential equation, and derive their nal states at timeT. Fast algorithms have been reported in [2, 5, 10, 19].
All theseN-body algorithms explore the idea that the eect of a cluster of particles at a distant point can be approxi- mated by a small number of initial terms of an appropriate power-series. To apply the approximation eectively, these so called \tree codes" organize the bodies into a hierarchy tree in which a particle can easily nd the appropriate clus- ters for approximation purpose.
4.2 The Barnes-Hut algorithm
We will focus on the Barnes-Hut algorithm as an exam- ple ofN-body tree code. The Barnes-Hut algorithm pro- ceeds by rst computing an oct-tree partition of the three- dimensional box (region of space) enclosing the set of parti- cles. The partition is computed recursively by dividing the original box into eight octants of equal volume until each undivided box contains exactly one particle. An example of such a recursive partition in two dimensions and the corre- sponding BH-tree are shown in Figure 4. Note that each internal node of the BH-tree represents a cluster. Once the BH-tree has been built, the mass and the location of the centers-of-mass of the internal nodes are computed in one phase up the tree, starting at the leaves.
. . . .. . . . . .
. . .
.
..
Figure 4: BH tree decomposition
To compute accelerations, we loop over the set of par- ticles observing the following rules. Each particle starts at
the root of the BH-tree, and traverses down the tree trying to nd clusters that it can apply center-of-mass approxima- tion. If the distance between the particle and the cluster is far enough, with respect to the radius of the cluster, then the acceleration due to that cluster is approximated by a single interaction between the particle and a point mass lo- cated at the center-of-mass of the cluster. Otherwise the particle visits each of the children of the cluster. Note that nodes visited in the traversal form a sub-tree of the entire BH-tree and dierent particles will, in general, traverse dif- ferent subtrees. The leaves of the subtree traversed by a particle will be calledessential datafor the particle because it needs these nodes for interaction computation.
Once the accelerations on all the particles are known, the new positions and velocities can be computed. The entire process, starting with the construction of the BH-tree, is now repeated for the desired number of time steps.
4.3 Parallel Implementation
To make the paper self-contained, we brie y describe our parallel implementation of the BH algorithm, upon which the BH-Tree library is built.
4.3.1 Data partitioning
The default strategy that we use to distribute bodies among processors isorthogonal recursive bisection(ORB). The space bounding all the bodies is recursively partitioned into as many boxes as there are processors, and all bodies within a box are assigned to one processor. Each separator divides the workload within the region equally. The ORB decom- position can be represented by a binary tree and is used as a map to locate points in space to processors.
4.3.2 Building the BH-tree in parallel
We construct the BH tree as follows. Each processor rst builds a local BH-tree for the bodies within its domain. At the end of this stage, the local trees will not, in general, be structurally coherent. The next step is to make the lo- cal trees structurally coherent with the global BH-tree by adjusting the levels of all leaves which are split by ORB bisectors.
Once level-adjustment is complete, each processor com- putes the centers-of-mass on its local tree without any com- munication. Next, each processor sends its contribution to an internal node to the owner of the node, dened as the processor whose domain contains the center of the internal node. Once the transmitted data have been combined by the receiving processors, the construction of the global BH-tree is complete.
4.3.3 Collecting essential data
Once the global BH-tree has been constructed it is possible to start calculating accelerations. It is signicantly easier and faster for a processor to rst collect all the essential data for its local particles, then compute the interactions the same way as in the sequential Barnes-Hut method since all the essential data are now available. In other words, the owner of a data must determine the area (calledin u- ence area) where its data might be essential, and send the data there. Formally, for every BH-node, the owner of computes an annular region calledin uence ringforsuch that those particles is essential to must reside within 's
in uence ring. Those particles that are not within the in u- ence ring are either too close touto apply center-of-mass approximation, or far away enough to useu's parent's infor- mation. With the ORB map it is straightforward to locate the destination processors to whichmight be essential.
4.3.4 Communication
The communication phases can all be abstracted as an \all- to-some" problem, in which each processor sends a set of personalized messages to dynamically determined destina- tion processors. Therefore, the communication pattern is irregular and dynamically changing.
VGDS employs a randomized protocol for all-to-some communication. The protocol alternates sends with receives to avoid exhausting communication channels reserved for messages that are sent but not yet received, and randomly permutes the destination so that any processor will not be ooded by incoming messages at any given time. In an earlier paper [12] we developed the atomic message model to investigate message passing eciency. Consistent with the theory, we nd that sending messages in random order worked best.
Figure 5 gives a high-level description of the parallel im- plementation structure. Note that the local trees are built only at the start of the rst time step.
Build local BH trees.
For every time step do:
1. Construct the BH-tree representation (a) Adjust node levels
(b) Compute partial node values on local trees (c) Combine partial node values at owning processors 2. Owners send essential data
3. Calculate accelerations
4. Update velocities and positions of bodies 5. Update local BH-trees incrementally
6. If the workload is not balanced update the ORB incrementally
Figure 5: Outline of code structure 4.4 The Tree Framework
To eliminate duplicated programming investments in devel- oping similar tree-based scientic codes, we have developed a VGDS tree frameowrk. The tree framework denes three layers of classes: base tree layer,Barnes-Hut tree layer, and application layer. Each latter layer is built on top of the for- mer one. Thebase tree layersupports simple tree construc- tion and manipulation methods. System programmers can build domain-specic tree libraries (e.g. Barnes-Hut Tree) using the classes in thebase tree layer (Sec 4.4.2). Appli- cation programmers can write programs using classes in the Barnes-Hut tree layer, or any other special library developed from thebase tree layer. Figure 6 depicts the hierarchy of tree classes and their associated methods.
4.4.1 Base tree layer
Thebase tree layeris the foundation of our framework from which complex tree structures can be derived. We dene basic tree manipulation methods in the base tree layer, in- cluding inserting a new child from a leaf, deleting an existing leaf, and performing parallel tree reduction and traversal.
Tree Tree_node Tree_reduction
Tree_traversal_with_traverser Tree_traversal
Compute_cluster_data Check_Particle_bh_box_consistency
Find_edata Interaction
BH_id Vector BH_tree_node
BH_tree Grav_BH_tree
Grav_BH_node Particle_cluster Grav_interaction Particle
Cluster
Application
BH_tree
Tree
Figure 6: The class hierarchy inbase tree,Barnes-Hut tree, andapplicationlayers.
template <class Data, const int n_children>
class Tree_node { protected:
Data *data;
Tree_node *children[n_children];
};
template <class Data, class Tree_node, class Tree, const int n_children>
class Tree_reduction { public:
virtual void init(Data*) = 0;
virtual void combine(Data *parent, Data* child) = 0;
void reduction(Tree* tree);
};
template <class Data, class Tree_node, class Tree, const int n_children, class Node_id>
class Tree_traversal { public:
virtual bool process(Data*) = 0;
void traverse(Tree *tree);
};
template <class Data, class Tree_node, class Tree, class Traverser>
class Tree_traversal_with_traverser :
public Tree_traversal<Data,Tree_node,Tree,N_CHILD,BH_id>
{ protected:
Traverser *traverser; // who is traversing?
};
Figure 7: Base tree layer classes.
Tree reduction computes the data of a tree node ac- cording to the data of its children, e.g. computing the center of mass in Barnes-Hut's algorithm. Tree traversal walks over the tree nodes and perform a user-dened operation (denoted asper node function) on each tree node (Figure 7).
For tree reduction, users are required to provide two functions:init(Data*)andcombine(Data *parent, Data*
child), which tell reduction class how to initialize and combine the data in tree nodes, respectively. The classData is the data type stored in each node of the tree on which the reduction operation is to be performed. For tree traver- sal, users are required to provide the per node functionbool
process(Data*)that is to be performed on every tree node.
4.4.2 Barnes-Hut tree
TheBH treelayer supports tree operations required in most of the N-body tree algorithms { it supports tree opera- tions common to both BH algorithm and fast multipole method, and all the special operations used in the Barnes- Hut method.
By extending the Treeclass, each tree node inBH tree contains a data cluster, and the data cluster of each leaf node contains a list of bodies. The types of the particle and cluster are given by the user of theBH treeclass as template
parametersAppClusterandAppBody. This abstraction cap- tures the structure of a BH tree without any application specic details.
template<class AppBody>
class Cluster {
protected: Link_list<AppBody*> body_list;
public: void add(AppBody* b);
};
template<class AppCluster, class AppBody>
class BH_tree : public Tree<AppCluster, N_CHILD> { public:
void insert_body(AppBody*);
void remove_body(AppBody*, Tree_node<AppCluster, N_CHILD>*);
};
template<class AppCluster, class AppBody, class Tree_node, class Tree, const int n_children>
class Compute_cluster_data: public
Tree_reduction<AppCluster, Tree_node, Tree, n_children>{
public:
void init(AppCluster* cluster) { cluster->reset_data();
if (cluster->get_type() == Leaf) for (every body in cluster's body_list)
cluster->add_body(body); }
void combine(AppCluster* parent, AppCluster* child) {parent->add_cluster(child);}
};
Figure 8: BH tree layer classes.
TheBH treeclass also supports several operations: com- puting cluster data, nding essential data, computing inter- action, and checking particle and BH box for consistency.
Cluster data computation is implemented as a tree re- duction (Figure 8). init(AppCluster* cluster)resets the data in the cluster and if the cluster is a leaf, it combines the data of the bodies from the body list into the data of the cluster. The other functioncombine(AppCluster* parent, AppCluster* child)adds children's data to parent's.
The essential data nding classFind edata inherits
Tree traversal with traverser with two additional lists for essential clusters and bodies (Figure 9). Thetraverseris the particle that collects essential data. The per node func- tionprocess(AppCluster*)inserts the clusters that can be approximated into essential clusters list, and adds the bodies from leaf clusters that cannot be approximated into
essential bodieslist. The traversal continues only when traverser cannot apply approximation on an internal cluster.
After collecting the essential clusters and bodies, a body can start computing the interactions. The computation class
Interaction(Figure 9) goes through the essential data list1 and calls for functions to compute body-to-body and body- to-cluster interactions dened by the user ofInteraction. 4.4.3 Application Layer
VariousN-body applications can be built upon theBh tree layer. We brie y describe the implemenation of the grav- itational N-body computation. First we construct a class
Particle for bodies that attract one another by gravity, then we build the cluster typeParticle clusterfromParticle (Figure 10). Next, in theParticle clusterclass we dene the methods for computing/combining center of mass and the methods for testing essential data.
Then, in classGrav interaction, which is derived from the class templateInteraction, we dene methods to com- pute gravitational interactions. We specify the gravitation interaction rules in the denition ofbody body interaction
andbody cluster interaction.
1Lists obtained from the classFind Edata.
template<class AppCluster, class AppBody, class Tree_node, class Tree>
class Find_edata: public Tree_traversal_with_traverser
<AppCluster,Tree_node,Tree,AppBody> { Link_list<AppBody*> essential_bodies;
Link_list<AppCluster*> essential_clusters;
public:
bool process(AppCluster* c) { if (c->is_edata_for(traverser)) {
essential_clusters.insert(c); return(0);
} else if (c->get_type() == Leaf) { for (every body in c's body list)
if (body != traverser)
essential_bodies.insert(body);
return(0);
} return(1); } };
template<class AppBody, class AppCluster, class Result>
class Interaction { AppBody *subject;
Link_list<AppBody*>* body_list;
Link_list<AppCluster*>* cluster_list;
Result result;
public:
void compute() { result.reset();
for (every body in body_list)
result += body_body_interaction(subject, body);
for (every cluster cluster_list)
result += body_cluster_interaction(subject,cluster);}
virtual Result body_body_interaction(AppBody*,AppBody*)=0;
virtual Result body_cluster_interaction(AppBody*, AppCluster*)=0;
};
Figure 9: Class for nding essential data and interaction computation.
Finally, we dene the BH-tree type Grav BH tree and tree node type Grav BH node. These two data types serve as template parameters to instantiate BH-tree related oper- ations, likeCompute cluster data,Find edata, andCheck
particle bh box consistency. 4.4.4 Parallel Abstraction Mapper class
TheMapper class denes a data partitioner (e.g. the ORB partitioner), a remapping method, and two associated geom- etry resolution functions: data to processor (that trans- lates a data coordinate to a processor domain) anddataset
to processors (that translates multiple data coordinates to a set of processor domains). In addition, it denes a simple data structure MappingTable to store the mapping information.
template <class Data, class DataSet, class ProcessorDomain, class MappingTable>
class Mapper { protected:
MappingTable table;
public:
virtual ProcessorDomain data_to_processor(Data*)=0;
virtual Link_list<ProcessorDomain>
dataset_to_processors(DataSet*)=0;
};
Communicator class
The Communicator class supports general purpose all-to- some communications. A Communicatorclass denes three methods: extract (that, when given a data pointer, con- structs outgoing data and packs them into a Message ob- ject), communicate (that sends outgoing messages and re- ceives incoming messages according to a randomized schedul-
class Particle { protected:
Real mass;
Vector position;
Vector velocity;
};
class Particle_cluster: public Cluster<Particle> { protected:
Center_of_mass center_of_mass;
public:
void reset_data(); // center of mass computation void add_body(Particle *p);
void add_cluster(Particle_cluster* child);
bool is_edata_for(Particle*); // find essential data };
class Grav_interaction:
public Interaction<Particle, Particle_cluster, Vector> { public:
Vector body_body_interaction(Particle*, Particle*);
Vector body_cluster_interact(Particle*,Particle_cluster*);
};
typedef Tree_node<Particle_cluster, N_CHILD> Grav_BH_node;
typedef BH_tree<Particle_cluster, Particle> Grav_BH_tree;
Figure 10: Classes for a gravitationalN-body application.
ing algorithm), andprocess(that unpacks the received mes- sages and performs appropriate action on the received data).
template <class Data, class DataPacket>
class Communicator { protected:
Link_list<Data*> *data_list[MAX_NUM_PROCESSORS];
DataPacket send_buffer[MAX_BUFFER_SIZE];
DataPacket receive_buffer[MAX_BUFFER_SIZE];
public:
void communicate();
virtual DataPacket extract(Data*)=0;
virtual process(DataPacket*)=0;
};
5 Experimental Results
The experiments were conducted on a cluster of four Ultra- Sparc2 workstations located in the Institute of Information Science, Academia Sinica. The workstations are connected by a fast Ethernet network capable of 100M bps per node.
Each workstation has 128 mega bytes of memory and runs SUNOS 5.5.1.
In the following, we report our preliminary experiences with a set of application programs, the Shallow Water code developed using the Array class, a airfoil simulation code developed using the preliminary Umesh class, and a gravi- tational Nbody simulation code and a vortex CFD code de- veloped using the BHTree class. The VGDS class libraries signicantly reduced the code sizes and development time of these applications (e.g. for each of the two tree codes, from over ten thousand lines down to a few hundred lines in code sizes and from over six months down to a few days in development time), compared with their message-passing C counterparts. Table 2 shows the performance of the VGDS codes. The Shallow Water code developed using the Array class achieved a speedup factor of 3.5. The Nbody code and the CFD code achieved a speedup factor of 3.2 and 3.5 re- spectively. In all these cases, speedup factor increases as problem size is increased. This is because communication overhead becomes less signicant, compared with computa- tion time, for large problem sizes.
Furthermore, the codes developed using the VGDS classes achieved more than 90% of the performance of their message- passing C version implementing the same algoritm. The main sources of overhead in the libraries include dynamic
memory allocation/deallocation for data object creation and destruction, non-optimized computation kernel for long ex- pressions, and additional overhead in support of portability of the library. We expect that as the project grows more mature, this overhead can be further reduced.
Shallow Water (10 iterations)
problem size 642 1282 2562 5122 1k2 seq time 0.28 1.22 4.96 21.32 88.59 parallel time 0.13 0.45 1.50 6.18 25.16
speedup 2.16 2.71 3.31 3.45 3.52
Gravitational N-body
problem size 48k 56k 64k 128k 256k
seq time 65.62 81.23 93.59 186.12 413.75 parallel time 21.03 26.17 29.17 58.78 125.78
speedup 3.12 3.12 3.17 3.12 3.23
Vortex CFD
problem size 48k 56k 64k 128k 256k
seq time 148.60 175.46 204.18 404.34 801.81 parallel time 43.00 51.37 59.05 117.20 231.07
speedup 3.42 3.42 3.46 3.45 3.47
Table 2: Execution time of the VGDS codes. Time units are seconds
6 Related Work
The benet of data abstraction in object-oriented languages on scientic code development has been demonstrated by various eorts [9, 15]. Particularly in uential and relevant to our approach are the work reported by Angus [1] and Shart and Otto[18] where class-specic compiler optimiza- tions are introduced into a compiler written in an object- oriented fashion. Our approach has taken their class-specic philosophy further into the realm of runtime support for a diverse set of parallel and distributed data structures (be- yond simply array classes) on high performance platforms.
Another line of work uses objects to dene data struc- tures with built-in data distribution capabilities. This again relates directly to our approach. Examples of work along this line include the Paragon package [8], which supports a special class PARRAY for parallel programming, the P++
Array class library [14], PC++ proposed by Lee and Gannon [11, 21], which consists of a set of distributed data struc- tures (arrays, priority queues, lists, etc.) implemented as library routines, where data are automatically distributed based on directives. Interwork II Toolkit [4] described by Bain supports user programs with a logical name space on machines like iPSC. The user is responsible for supplying procedures mapping the object name space to processors.
In a related work by ourselves [6], we report abstractions of adaptive load balancing mechanisms and complex, many-to- many communications as C++ classes for supporting HPC challenging applications. Instead of tackling one particular data structure such as arrays or matrics, we propose an in- tegrated design framework for a diverse set of distributed data structures, where data distribution, data sharing, data coherence, and synchronization between data references are mediated by the runtime system.
Our data structuring framework has similar goals and approaches to the POOMA package [3] and the Chaos++
library [17]. POOMA supports a set of distributed data
structures (elds, matrices, particles) for scientic simula- tions. To our knowledge, POOMA has not supported adap- tive data structures as we do. Chaos++ is a general-purpose runtime library that supports pointer-based dynamic data structures through an inspector-executor-based runtime pre- processing technique. Our framework focuses on a more specic class of data structures essential to scientic simu- lations and engineering computation; therefore, we are able to exploit optimizations that would be dicult for a general preprocessing technique.
A large body of work in the literature can be catego- rized as \object-parallelism," whereobjects are mapped to processes that are driven by messages. Our use of object- orientation is for structuring the VGDS classes and their specializations for optimizations, which is entirely distinct in philosophy from that of object-parallelism.
7 Conclusion
In this paper, we have presented the VGDS framework for scientic applications. We have implemented a prototype of VGDS base libraries and a set of distributed data struc- tures derived from this basis, including array, unstructured mesh, and BH tree. We reported our experimental results on a workstation cluster. We demonstrated that the VGDS class libraries signicantly reduced application development cost, at the expense of slight performance penalty due to object orientation. We are currently investigating possible approaches to reduing such overhead.
We hope that the basic scientic results and concrete classes and templates libraries from our VGDS eort will encourage value-added development of mapping and opti- mizing methods for classes of parallel applications.
Acknowledgment
Support for this work is provided by National Science Coun- cil of Taiwan under grant 86-2213-E-001-009.
References
[1] Ian G. Angus. Applications demand class-specic op- timizations: The c++ compiler can do more. InPro- ceedings of the First Annual Object-Oriented Numerics conference, 1993.
[2] A.W. Appel. An ecient program for many-body simu- lation.SIAM Journal on Scientic and Statistical Com- puting, 6, 1985.
[3] Susan Atlas, Subhankar Benerjee, Julian C. Cummings, Paul J. Hinker, M. Srikant, John V.W. Reynders, and Marydell Tholburn. POOMA: A high performance dis- tributed simulation environment for scientic applica- tions. InSupercomputing95, 1995.
[4] W. L. Bain. Aggregate distributed objects for dis- tributed memory parallel systems. In The 5th Dis- tributed Memory Computing Conference, Vol. II, pages 1050{1055, Charleston, SC, April 1990. IEEE.
[5] J. Barnes and P. Hut. A hierarchicalO(NlogN) force- calculation algorithm. Nature, 324, 1986.
[6] S. Bhatt, M. Chen, C.-Y. Lin, and P. Liu. Abstractions for parallel n-body simulations. InScalable High Per- formance Computing Conference SHPCC-92, pages 38 { 45, Williamsburg, VA, April 1992.
[7] L.S. Blackford, J. Choi, A. Cleary, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D.W. Walker, and R.C. Wha- ley. SCALAPACK: A portable linear algebra library for distributed memory computers - design issues and performance. InSupercomputing'96, 1996.
[8] C. M. Chase, A. L. Cheung, A. P. Reeves, and M. R.
Smith. Paragon: A parallel programming environment for scientic applications using communication struc- tures. In 1991 International Conference for Parallel Processing, Vol. II, pages 211{218, August 1991.
[9] D. W. Forslund, Wingate C., Ford P., Junkins S., Jack- son J., and Pope S. C. Experiences in writing a dis- tributed particle simulation code in C++. In 1990 USENIX C++ Conference, pages 1{19, 1990.
[10] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.Journal of Computational Physics, 73, 1987.
[11] J. K. Lee and D. Gannon. Object oriented parallel pro- gramming experiments and results. InSupercomputing '91, pages 273{282, November 1991.
[12] P. Liu, W. Aiello, and S. Bhatt. An atomic model for message passing. In 5th Annual ACM Symposium on Parallel Algorithms and Architecture, 1993.
[13] Steve W. Otto. Parallel array classes and lightweight sharing mechanisms. InProceedings of the First Annual Object-Oriented Numerics conference, 1993.
[14] R. Parsons and D. Quinlan. A++/p++ array classes for architecture independent nite dierence calculations.
In Proceedings of the Second Annual Object-Oriented Numerics Conference, April 1994.
[15] J. S. Peery, K. G. Budge, and A. C. Robinson. Using C++ as a scientic programming language. InCUG11, 1991.
[16] Petsc 2.0 for mpi. Argonne National Laboratory, Mathematics and Computer Science Division, URL:
http://www.mcs.anl.gov/petsc/petsc.html, 1997.
[17] J. Saltz, A. Sussman, and C. Chang. Chaos++: A runtime library to support distributed dynamic data structures. Gregory V. Wilson, Editor, Parallel Pro- gramming Using C++, 1995.
[18] Michael D. Sharp and Steve W. Otto. A class specic optimizing compiler. InProceedings of the First Annual Object-Oriented Numerics conference, 1993.
[19] S. Sundaram. Fast Algorithms for N-body Simulations. PhD thesis, Cornell University, 1993.
[20] A.E. Trefethen, V.S. Menon, C.C. Chang, G.J. Cza- jkowski, C. Myers, and L.N. Trefethen. MultiMAT- LAB:MATLAB on multiple processors. Technical Re- port 96-239, Cornell Theory Center, 1996.
[21] S. X. Yang, J. K. Lee, S. P. Narayana, and D. Gan- non. Programming an astrophysics application in an object-oriented parallel language. InScalable High Per- formance Computing Conference SHPCC-92, pages 236 { 239, Williamsburg, VA, April 1992.