• 沒有找到結果。

Chapter 1 Introduction

1.2 Reviews of the DS MC Method

1.2.6 The Parallel DS MC Method

The DSM C method has become a widely used computational tool for the simulation of gas flows in which molecular effects become important. The advantage of using a particle method under these circumstances is that molecular model can be applied directly to the calculation of particle collisions, while the continuum methods use macroscopic averages to account for such effects. Therefore, particle methods can predict these effects with much higher accuracy. Also, it is the only viable tool for analyzing the gas flows in the transitional regime. Nevertheless, the main drawback of such direct physical method is its high computational cost. That is why the DSM C method was only used for analyzing high Knudsen number flows (or transitional flows).

For lower Knudsen number gas flows near the continuum regime, the computational cost is prohibitively high, even with the most advanced computer nowadays. Hence, it is important to increase the computational speed to extend the application range of the DSM C method.

1.2.6.1 Domain Decomposition

Generally, these are two methods to partition the simulated domain, which are geometry-based and graph-based domain decomposition. Geometry-based method is usually faster but provided poor edge cut (Ec) since they pay no heed to the connections of the point in the mesh. M any physical problems can be expressed within the framework of graph theory, such as discrete optimization problems and matrix reordering. Sketch of graph and mesh is shown in Fig. 1.2. A small portion of a triangular grid, which is usually made by commercial mesh software, is shown as the thinner lines. The bold solid circles and bold lines represent the vertices and edge cuts of the graph, respectively. The graph G(V, E) is the collection of these vertices (V) and edge cuts (E) on the basis of the connectivity between the cells. One of the advantages

in expressing the problem in terms of a graph is that each of the edges and vertices can be assigned a weight to account for the specific numerical application. For example, in DSM C, the vertex (cell center) can be weighted with the particle numbers with all edges having unitary weight. The brief overviews of each partition method are described in the following.

Geometry-based method

Geometry-based methods use spatial (or coordinate) information of mesh to partition the domain. These methods are usually simple and fast but provide poor Ec (cutting edge) and poor load balancing.

Coordinate Partitioning

This method is the simplest way of geometric partition that is only working for rectangular and cubic domain. In this method, the domain is split into Nx×Ny and Nx×Ny×Nz sub-domain, which is based on their geometric position for 2-D and 3-D mesh, respectively. The chief advantages are that the partition domain is easy to obtained, and the cheaper computational cost. Examples of this method are given in Haug et al. [47].

Recursive Coordinate Bisection (RCB)

This method is similar to the coordinate partitioning that attempts to minimize the boundaries between the sub-domains. The procedure is to find out the longest coordinate direction of the domain and then split the mesh in half. And then each sub-domain is recursively divided by the same method. This method is also fast but the quality is low and sometimes obtains the disconnected sub-domains. Furthermore, it is only useful in splitting the domain by the normal direction of the coordinate axes.

Examples of this method are given in Simon [48].

Inertial Recursive Bisection (IRB)

This method is a modification from the RCB method due to the geometry of the domain is not always orthogonal to the coordinated axes. The concept of this method is to find out the longest inertial axis. The domain is split in half to produce bisection and then applied recursively. A smaller sub-domain boundary is created than the RCB method. IRB has been implemented within the context of dynamic load balancing for DSM C by Diniz et al [49].

Graph-based method

A conventional graph-partitioning problem is to subdivide the n vertices between the NP sub-domains while minimizing the number of edge cuts, Ec, and balancing the

weight in each sub-domain. For example, in DSM C, each vertex can be weighted with the number of particles with all edges that connect cell centers, having unitary weight. A truly dynamic load balancing technique is required for DSM C because the load (approximately proportional to the number of particles) in each sub-domain changes frequently, especially during the transient period of simulation. Domain decomposition in DSM C may become very efficient by taking the advantage of successful development in graph partitioning. However, it is well known that it is NP complete, which means that the optimal solution of this problem is impossible to compute in polynomial bounded time. Instead, it is relaxed to seek near-optimal solutions within reasonable time. In computer science, there are several methods developed for achieving near-optimal solutions to this problem. Related description and reviews of graph partitioning can be found in the references [48, 50-53].

Greedy Partitioning

This method is first proposed by Farhat [50] to decompose the domain of the finite element computation. This method is started from determining a vertex in the domain.

This vertex “bites” the neighboring vertices until appropriate proportion of the graph is formed. And the next vertex of the other sub-domain is seeded on the border of the previous sub-domain and the processes are repeated until the whole domain is partitioned. The variant employed here differs from Farhat [50] only in that it works with graph based rather than the node and elements of a finite element mesh.

Recursive Spectral Bisection (RSB)

Simon [48] first presents this graph partition by computing the Laplacian matrix of the graph. Then, the eigenvector (Fiedler vector) corresponding to second largest eigenvalue, when associated with the vertices of the graph, gives a measure of the distance between the vertices. Once the measures of distance are computed for all vertices, they can be sorted by their value and then split into two parts. This method seens to produce good quality partitions, but it is relatively computationally expensive on calculating the Fiedler vector.

Multilevel Scheme

Barnard and Simon [51] first proposed multilevel partitioning to accelerate the convergence in CFD simulations. The process of this partitioning is first coarsing to obtain a small graph; it is easier and finds a good partition of a small graph than a large graph. The second stage is computing a high-quality bisection, that is, small edge-cut, of the coarse graph. And finally the refinement process is operated here to get the partition

domain and repeat these three processes until the partition reaches comparable quality.

M ultilevel schemes now are considered state of the art static partitioners. METIS [52], developed at the University of M innesota, is a variant of the multilevel scheme of graph partitioning. The idea of multilevel scheme draws from the multi-grid technique.

Reported performance was impressive in terms of CPU time.

Two-Step Method

Two-step methods may be seen as relatives of the multi-level partition methods.

The concept of two-step partition method is to refine the initial domain by a cheap and high-speed partition method. M oving the vertices on partition boundaries optimizes the domain decomposition. A cost function F is proposed to determine the minimum value, which is, FEc+(1−α)L, where Ec the edge cut associated with the decomposition, L is the degree of load imbalance and α is a penalty parameter. JOSTLE [54] uses an initial domain decomposition (generated by greedy partitioning) and successively adjusts the partition by moving vertices lying on partition boundaries. In this method, vertex shedding is localized since only the vertices along the partition boundaries are allowed to move, not the vertices anywhere in the domain. Hence, this method possesses a high degree of concurrency and has the potential of being applied in the dynamic domain decomposition in the event of load imbalancing across the processor array.

Others

In addition, Plimpton and Bartel [55] have proposed a random partitioning method for the DSM C method. The computational cells are randomly picked up to form separate sub-domains. This is the simplest and a fast method of domain decomposition, resulting in maximum edge cuts and disconnected domains, although it may give moderate load balance.

Thus far, there seems no report that matured graph-partitioning tool has been incorporated in the parallel DSM C method on an unstructured mesh. Thus, it is interesting and technically important to learn that if the graph partition tools can be used efficiently in conjunction with the DSM C method. Thus, we use PJOSTLE [56] to dynamically decompose the computational domain for the parallel D SM C simulation in the early stage of the study and, recently, we use ParMETIS [57] instead for its robustness and much bigger user community around the world, which is often the driving force to improve the tool.

1.2.6.2 S tatic Domain Decomposition (SDD)

In the past, several studies on parallel implementation of DSM C have been published [58-62] using static domain decomposition and structured mesh. M essage passing was used to transfer molecules between processors and to provide the synchronization necessary for the correct physical simulation. Results showed reasonable speedup and efficiency could be obtained if the problem is sized properly to the number of processors.

Recently, Boyd’s group [29, 63] designed the parallel DSM C software named MONACO, which emphasized high data locality to match the hardware structure of modern workstations, while maintains the code efficiency on vectorized supercomputers.

In this code, unstructured grids were used to take the advantage of flexibility of handling complex object geometry. Static domain decomposition technique was used to distribute cells among processors. Interactive human interruption is required to redistribute the cells among processors to maintain workload balance among processors, which is indeed unsatisfactory from practical viewpoint. Timing results show the performance improvement on workstations and the necessity of load balancing for achieving high performance on parallel computers. M aximum 400 IBM -SP2 processors have been used to simulate flow around a planetary probe with approximately 100 million particles, which parallel efficiency of 90% has been reached by manually redistributing the cells among processors during simulation. However, the parallel efficiency for n processors is unusually defined as the ratio of computational time to the sum of computational and communicational time, rather than which is normally defined as the ratio of the true speedup to the ideal speedup (n) for n processors.

1.2.6.3 Dynamic Domain Decomposition (DDD)

Until very recently, dynamic domain decomposition technique was used in conjunction with the parallel implementation of DSM C method. Ivanov's group [64] has developed a parallel DSM C code called SMILE, which implements both the static and dynamic load balancing techniques. SMILE has united the background-structured cells into groups, so-called "cluster", which is the minimum spatial unit, and are distributed and transferred between the processors. The dynamic domain decomposition algorithm is scalable and requires only local knowledge of the load distribution in a system. In addition, the direction and the amount of workload transfer are determined by the concept of heat diffusion process [65]. In addition, an automatic granularity control is used to determine when to communicate the data among processors [65].

Around the same period of time, dynamic load balancing technique, using Stop At Rise (SAR) [66], which compares the cost of re-mapping the decomposition with the cost of not re-mapping, based on a degradation function, was used in conjunction with the parallel implementation of the DSM C method [60, 62]. In the study [60], they used a runtime library, CHAOS, for data communication and data structure manipulation on a structured mesh. Results show that it yields faster execution times than the scalar code, although only 25% of parallel efficiency is achieved for 64 processors. LeBeau [67]

reported that parallel efficiency up to 90% is achieved for 128 processors for the flow over a sphere. It is not clear how they implemented the dynamic load balancing, although they did mention they have used the concept of heat diffusion [65]. In LeBeau's study [67], surface geometry is discretized using an unstructured triangular grid representation. A two-level embedded Cartesian grid is employed for the discretization of the computational domain.

In summary, studies about DSM C using both purely unstructured mesh and dynamic domain decomposition were relatively few in the past [24-26], although using unstructured mesh exhibits higher flexibility in handling objects with complicated geometry and boundary conditions. Robinson [24-26] has been the first to develop a heuristic, diffusive, hybrid graph-geometric, localized, concurrent scheme, ADDER, for repartitioning the domain on an unstructured mesh. Dramatic increase of parallel efficiency was reported as compared with that of static domain decomposition. However, Robinson [24-26] has shown that the parallel efficiency begins to fall dramatically as the number of processors increases to some extent due to the large runtime of the repartitioning the domain relative to the DSM C computation. Thus, the utilization of a more efficient repartitioning runtime library is essential to improve the performance of a parallel DSM C method.

Based on previous reviews, the development of the parallel DSM C method has not taken the advantage of the great success in graph partition. Considering the nature of DSM C, a truly dynamic domain decomposition technique is required because the load (approximately proportional to the particle numbers) in each sub-domain changes frequently, especially during the transient period of simulation. However, such implementation is definitely not an easy task to accomplish.