• 沒有找到結果。

Chapter 3 Unstructured Adaptive Mesh Refinement with Variable

3.3.2 Three-Dimensional Flows

Flow and Simulation Conditions

A supersonic flow past a sphere is simulated to demonstrate the applicability of the current mesh refinement and variable time-step scheme to three-dimensional flow problem. Simulation is conducted for one to sixteenth of a sphere by taking advantage of the inherent axial symmetry of this problem. Sketch of one to sixteenth sphere is shown in Fig. 3.26. Simulation by this geometry can reduce vast particles and cell numbers and then it will be proven the domain is enough to the simulation. Considering a supersonic flow over a sphere, related flow conditions are listed as follows: Variable

Hard Sphere nitrogen gas, free-stream M ach number M= 4.2, free-stream number density n = 9.77E20 particles/m3, free-stream temperature T= 66.25 K, stagnation temperature To = 300 K, fully thermal accommodated and diffusive sphere wall with the temperature Tw (equal to To). The corresponding free-stream Knudsen number Kn is 0.1035, based on the free-stream mean free path and diameter of the sphere. The diameter of sphere is 1.28 cm. These flow conditions represent the numerical simulation and experiments conducted by Russell et al [82] and list in Table 3.5. Figure 3.27 presents normalized density contour with three different cross sections, including two symmetric planes and the center plane between these two symmetric planes, to identify the computational domain is enough to this simulation. It is clear the results are almost the same.

M esh Adaptation Concerns

The local cell Knudsen number criterion (Kncc) and free-stream parameter (φ0) for mesh adaptation are chosen as 2 and 1.05, respectively, since we are dealing with an external flow. Evolution of adaptive surface mesh at each level (0-2) with cell quality control is presented in Fig. 3.28 and corresponding results are summarized in Table 3.6.

The number of cells is increase from initial mesh (5,353) to the level-2 adaptive mesh (164,276) and the increased scale is larger then two-dimensional case resulting from that a parent cell can be divided to 2, 4, or 8 child cells. Hence, the number of cells and adaptation level should be careful to avoid too many cells are formed. In Figure 3.28(c) (level-2 adaptive surface mesh), it illustrates that the high-density region due to the bow shock around the sphere is clearly captured with much finer mesh distribution. In addition, the mesh distribution in the free-stream region is comparably coarse as compared with the local mean free path due to application of the free-stream parameter mentioned earlier. Application of the free-stream parameter effectively reduces the number of cells in the free-stream region, where the cell-size requirement (∆x <λ) could be relaxed due to nearly uniform density distribution in this region.

Comparisons of the Results Using Different M eshes

The mesh distribution and normalized density contour of initial and level-2 adaptive meshes at x-y plane are illustrated in Fig. 3.29. In these figure, the upper and lower parts are represented as the results of initial and level-2 adaptive mesh, respectively. A strong bow shock is observed in front of the sphere in Fig. 3.29(a), and the density contour distribution are obvious different due to the mesh adaptation. A

maximum value of 4.26 and 5.14 are observed at the stagnation point, while a minimum value of 0.0215 and 0.016 are observed just behind the sphere with respect to initial and level-2 adaptive meshes. Comparisons of normalized temperatures at x-y plane are also presented in Fig. 3.30. The temperatures are increased when the flow interacts the bow shock and reach to a maximum at the stagnation point, and then decreased after the shock and the sphere. The distributions are quite different especially at the wake regions, which justify the use of adaptive mesh in the current study. In addition, normalized density along the stagnation line with different meshes is compared in Fig. 3.31 with experimental data by Russell [82]. Results show that the better agreement with experimental data is found using level-2 adaptive mesh near the stagnation region in front of the sphere.

Comparisons of Adaptive M esh With or Without Cell Quality Control

Comparisons of adaptive mesh and the normalized density contour with or without cell quality control at x-y plane are presented in Fig. 3.32. The upper figure in Fig.

3.32(a) is the adaptive mesh without cell quality control. As this figure illustrated, the cells of high aspect ratio are formed especially at the interface of adaptive and un-adaptive regions due to the first type an-isotropic adaptation. The refined cells with high aspect ratio are exactly removed by the cell quality control procedure at the lower figure. Figure 3.32(b) is the comparison of the flow density at x-y plane and normalized to the free-stream density. Roughly, the results are very similar, but the refinement mesh with cell quality control has better solutions in Fig. 3.32(b). In this figure, the value of 1.05 is located on the interface of adaptive and un-adaptive region. The contour lines with cell quality control are smoother and straight. Figure 3.33 presents normalized density distribution along the stagnation streamline. Both results agree excellent with experiment.

Comparisons of the Results Using CTS and VTS Schemes

Comparisons of constant time-step scheme and variable scheme are also discussed in this simulation by using level-2 adaptive mesh (164,276). Variation of particle distribution is reduced greatly as illustrated in Fig. 3.34, which shows the comparison of distribution of averaged particles per cell on level-2 adaptive mesh using constant time-step (CTS) and variable time-step (VTS) methods. In this figure, using VTS method, the reduction of averaged number of particles per cell can be as large as 10-fold and 30-fold in the regions of oblique bow shock and wake region behind the sphere, respectively. This is achieved by keeping the particle weight the same in the reference

cell (minimum cell, near the stagnation point in this case) for both VTS and CTS methods. In the other words, the statistical uncertainty in the minimum cell is kept the same for comparing both methods. The simulated particle count with different scheme is shown in Fig. 3.35. As two-dimensional simulation, the particle numbers of variable time-step scheme is obvious fewer than constant time-step scheme. Resulting simulated particles at steady state are reduced from 1.7 million, using constant time-step (CTS), to 0.34 million, using variable time-step (VTS) in this case (Fig. 3.35). In addition, another benefit of applying VTS method to unstructured adaptive mesh is that it can reduce dramatically the number of iterations of transient period towards steady state, as illustrated in Fig. 3.35. In this figure, the required number of iterations for transient period if VTS is applied is about only 25% of that if CTS is applied. In the current case, if we expect roughly the same statistical uncertainty in the minimum cell to obtain macroscopic properties for both CTS and VTS methods, the combination of VTS and unstructured adaptive mesh could save the computational time up to one order of magnitude. In this simulation, the total computation time of variable time-step and constant time-step are about 31.4 and 5.7 hours, respectively. It saves about 5.5 times computational time with the same time-steps.

Density Distributions

Figures 3.36 and 3.37 are the contours of normalized density and temperatures with different time schemes, respectively. The upper and lower regions are the results using CTS and VTS scheme, respectively. The trends are similar but there is a little different at the wake regions. Figure 3.38 illustrates normalized density along the stagnation line using CTS and VTS schemes. The experimental results are also included in this figure for comparison. Both of the results agree very well with the experiments.

The results are almost the same although the particle number of variable time-step scheme is fewer about five times of constant time-step scheme.

3.4 Concluding Remarks

In contrast, this chapter is introduced the DSM C method combined adaptive mesh and variable time-step scheme, which is implemented in two- and three-dimension simulations. Accurate results and more efficient computation can be achieved by a uniform particle distribution. The results serve both to validate the DSM C code and to provide some insight into the complicated nature of these flows. Take the 3-D supersonic sphere flow as an example, the adaptive mesh refinement (AM R) can refine

those cells near the bow shock region. The variable time-step (VTS) scheme can reduce the simulated particles from 1.7 million, which is by using constant time-step scheme (CTS), to 0.34 million and the required transient time decreases to only 25% of the constant time-step scheme. The next chapter is going to introduce the DSM C method incorporating the parallel computing to reduce the computational cost and speed up the efficiency of the computations. This exploitation will bring the D SM C method to wider applications of flow problems.

Chapter 4 Parallel Computing of DSMC

The direct simulation M onte Carlo method (DSM C) is a particle method for the simulation of gas flows. The method is statistical in nature and the gas is modeled at the microscopic level using simulated particles which each represents a large number of physical molecules or atoms. Physical events such as collisions are handled probabilistically using largely phenomenological models, which are designed to reproduce real fluid behavior when examined at the macroscopic level. It has been proven that the DSM C method is valid for rarefied gas flows even for near-continuum flows, but the computational cost is vast to afford. Thus, to parallelize the DSM C method is necessary, which is introduced in this section.

4.1 The Parallel DS MC Method

The DSM C algorithm is readily parallelized through the physical domain decomposition. The cells of the computational grid are distributed among the processors.

Each processor executes the DSM C algorithm in serial for all particles and cells in its own domain. Parallel communication occurs when particles cross the domain (processor) boundaries and are then transferred between processors. High parallel performance can only be achieved if communication is minimized and the computational load is evenly distributed among processors. To minimize the communication for domain decomposition, the boundaries between sub-domains should more or less lie along the streamlines of the flow field; however, it is nearly impossible to achieve this partition for most practical flows. In practice, we can only minimize the number of edge cuts, Ec, under the framework of graph theory. Fortunately, the advancement of networking speed has reduced the communication time between processors to an acceptable level.

For the DSM C algorithm, the workload (or equivalently the number of particles) in each processor changes frequently, especially during the transient period of a simulation;

while the workload attains a roughly constant value during the steady-state sampling.

Thus, a truly dynamic (or adaptive) domain decomposition technique is required to perfectly balance the workload among the processors.

Figure 4.1 shows a simplified flowchart of the parallel DSM C method proposed in the current study, which incorporates the multi-level graph-partitioning technique. In

general, this algorithm not only works for the DSM C method, but also it is suitable for other particle-based methods, such as M olecular Dynamics (M D), Particle-In-Cell (PIC) and M onte Carlo methods in plasma physics. Note that processors are numbered from 0 to np-1 in this figure. Before detailing the proposed procedures (Fig. 4.1), we will instead discuss the preprocessing required for this parallel implementation. In this implementation, an unstructured mesh is first constructed by a commercial code, HyperMesh [83] or other equivalent meshing tool. Then, a preprocessing code is used to reorder the fully unstructured mesh data into the globally sequential but locally unstructured mesh data for each processor in conformation with the partitioning information from graph partitioning tool (JOSTLE) [54], as schematically presented in Fig. 4.2. In addition to the above, another important information output from this preprocessor is the cell-neighboring information, which is needed for particle tracing on an unstructured mesh. Original algorithm [84] used to obtain the information of cell neighbors, using the concept of loops over cells by identifying repeated node number, has been found to be very inefficient as the total number of cells increases up to several tens of thousand. Instead, we have replaced it by a very efficient algorithm, using the concept of loops over nodes by searching through the cells sharing the node, which turns out to be very efficient. For example, for preprocessing 3 million unstructured 3-D cells, it takes less than 20 minutes on a 1.6-GHz (Intel) personal computer. Preliminary results show that the preprocessing time increases approximately linearly with the number of cells. Parallel processing to speed up this preprocessing is currently in progress and will be incorporated into the parallel DSM C code in the very near future.

Note that the partition information from JOSTLE provides the cell numbers (mn for the nth sub-domain, where n=0 to np-1) and mapping of cells in each partitioned sub-domain. After the cell-number reordering, the cells in each sub-domain are renumbered such that the corresponding global starting and ending cell numbers for the nth sub-domain are

, respectively. In each processor, the cell numbering is unordered (unstructured), but both the starting (smallest) and ending (largest) cell numbers increase with processor numbers. We term this as “globally sequential but locally unstructured“. Thus, in each processor the memory is only needed to record the starting and ending cell numbers for all processors, in addition to the cell related data in each processor. The mapping between global and local cell data, however, can be easily obtained by a simple arithmetic operation due to this special

cell-numbering design. The required array size for cell related data is approximately the same as the number of cells in each sub-domain. For example, if there are one million cells totally in the simulation with 100 processors, each processor will only be required to store the array on the order of 10,000. The memory cost reduction will be approximately 100 times in this case. This simple reordering of cell numbers dramatically reduces the memory cost otherwise required for storing the mapping between the local cell number in each processor and the global cell number in the computational domain if un-reordering unstructured cells are used.

In addition, a processor neighbor-identifying array is created for each processor from the output of the preprocessor, which is used to identify the surrounding processors for those particles crossing the inter-processor boundaries (IPB) during simulation. From our practical experience, the maximum number of processor-neighbor is on the order of 10 at most; therefore, the increase of memory cost due to this processor neighbor-identifying array is negligible. The resulting globally sequential but locally unstructured mesh data with the partition information is then imported into the parallel DSM C code as the initial mesh distribution.

Again referring to Fig. 4.1, after reading the preprocessed cell data on a master processor (CPU 0), the cell data are then distributed to all other processors according to the designated initial domain decomposition. All the particles in each processor then start to move as in sequential DSM C algorithm. The particle related data are sent to a buffer and are numbered sequentially when hitting the inter-processor boundary (IPB) during its journey within a simulation time-step. After all the particles in a processor are moved, the destination processor for each particle in the buffer is identified via a simple arithmetic computation, owing to the previously mentioned approach for the cell-numbering scheme, and are then packed into arrays. Considering communication efficiency, the packed arrays are sent as a whole to its surrounding processors in turn based on the tagged numbers recorded earlier. Once a processor sends out all the packed arrays, it waits to receive the packed arrays from its surrounding processors in turn. This

“send” and “receive” operation serves practically as a synchronization step during each simulation time-step. Received particle data are then unpacked and each particle continues to finish its journey for the remaining time-step. The above procedures are repeated twice since there might be some particles cross the IPB twice during a simulation time-step. Theoretically it could be more than twice, but in our practical experience it is generally at most twice for "normal" domain decomposition and by

carefully choosing the simulation time-step. After all particles on each processors have come to their final destinations at the end of a time-step, the program then carries out the indexing of all particles and the collisions of particles in each computational cell in each processor as usual in a sequential DSM C code. The particles in each cell are then sampled at the appropriate time.

High parallel efficiency can only be achieved if communication is minimized and the computational load is evenly distributed among processors. To minimize the communication for static domain decomposition, the boundaries between sub-domains should lie along the streamlines of the flow field [63] as mentioned previously; however, it is nearly impossible to achieve this partition for most practical flows. Fortunately, the advancement of networking speed has reduced the communication time between processors to a tolerable amount.

4.1.1 S tatic Domain Decomposition (SDD)

In the previous research about the parallel DSM C code, three static domain decomposition methods are used by considering the estimated particle weight on each vertex (cell center) [84]. This represents the first application of the graph partition techniques in DSM C to the best of the author’s knowledge. The weight on each vertex is estimated by running the sequential DSM C code with about 10% or fewer (for some cases only 3% is used) of the total particle number, which is used for a real parallel simulation. The variations of weight on vertices obtained this way can be shown later that they are roughly equivalent to those obtained by running all particles (100%). For the simple coordinate partition, the speedup (efficiency) of 25 processors is 13.6 (54.5%), while it increases to 15.6 (62.5%) for the multi-level method (JOSTLE) and 15.5 (62.2%) for the multi-level scheme (METIS). Note that the percentage number appearing in the parenthesis right after the speedup represents the corresponding parallel efficiency.

Although the DSM C possesses nearly 100% parallelism (except for initialization and final output), and an estimated particle distribution is used as a weight to get a suitable partition domain, both the values of speedup and efficiency by using static domain decomposition are expected to be lower than the ideal values due to the load unbalancing and communication as mentioned previously. Dynamic domain decomposition can improve the parallel performance by balancing the load of each processor and are described in the next section.

4.1.2 Dynamic Domain Decomposition (DDD)

This section presents an overview of the algorithms implemented of dynamic load decomposition scheme. As mentioned above, the parallel performance will become worse resulting from the communication and the load unbalancing. Dynamic domain decomposition scheme for an unstructured mesh is implemented in this thesis to speed up the parallel computing. It is a non-trivial task and it is the main contribution of the PhD research. Basic concept is the domain will be repartition when the loading of each processor is becoming unbalancing. The simulator aims to balance the number of particles on the sub-domains. The flowchart with dynamic domain decomposition is shown as Fig. 4.3. The procedures of the flowchart are almost the same except some processes of dynamic domain decomposition method. There are three main processes,

This section presents an overview of the algorithms implemented of dynamic load decomposition scheme. As mentioned above, the parallel performance will become worse resulting from the communication and the load unbalancing. Dynamic domain decomposition scheme for an unstructured mesh is implemented in this thesis to speed up the parallel computing. It is a non-trivial task and it is the main contribution of the PhD research. Basic concept is the domain will be repartition when the loading of each processor is becoming unbalancing. The simulator aims to balance the number of particles on the sub-domains. The flowchart with dynamic domain decomposition is shown as Fig. 4.3. The procedures of the flowchart are almost the same except some processes of dynamic domain decomposition method. There are three main processes,