• 沒有找到結果。

Chapter 4 Parallel Computing of DSMC

4.2 Parallel Performance of the Parallel DS MC Method

4.2.2 Simulations on IBM-S P2

Efficiency≡ (4.4)

and is just the ratio of the true speedup to the ideal speedup, N, and hence its value lies between zero and one. The speedup and efficiency of ideal parallel algorithm are the number of the processors and unity, respectively.

4.2.2 Simulations on IBM-S P2 Speedup and Efficiency

Results of parallel speedup and efficiency of the cavity-flow computation, at different problem sizes on IBM -SP2 machine, as a function of the number of processors are presented in Fig. 4.6. In these figures, there are four curves with circle, triangle, quadrilateral and cross symbols with respect to SDD, DDD by calling SAR at intervals of 2∆t, 10∆t and 20∆t. And the linear dash line represents the ideal case. As expected, the parallel performance of those using dynamic domain decomposition is much better than those using static dynamic domain decomposition. Several trends for different problem sizes are described in detail as follows.

Small Problem Size

Super-linear speedup (efficiency > 100%) occurs clearly for number of processors less than or equal to 16, if dynamic domain decomposition is applied (Fig. 4.6(a) and Fig. 4.6(d)). This is mainly attributed to both the cache effects and better load balancing among processors. However, the efficiency decreases with increasing number of processors (up to 64) as expected, due to load unbalancing among processors, if static domain decomposition is applied. Figure 4.7 shows that the computational time per particle as a function of particle numbers on a single processor (IBM -SP2), in which the minimum computational time per particle occurs at approximately 4,000 particles and increases with increasing particle numbers as particle numbers is greater than 4,000. As the number of processors increases over 16, negative effects of load unbalancing and communication increase among processors begin to play a more important role than the positive effects of cache effects. Thus, the parallel efficiency decreases monotonously with increasing number of processors (up to 64) even if dynamic domain decomposition is used, as shown in Fig. 4.6(a) and Fig. 4.6(d).

In addition, results show that parallel performance for applying SAR scheme less frequently is generally better than applying SAR scheme more frequently for number of processors less than 64. As the number of processors increases up to 64, all three

strategies of applying SAR scheme result in roughly the same value of parallel efficiency, mainly due to the relative load unbalancing developed among processors levels the advantages gained by repartitioning the domain less frequently. Also, in the small problem size, it might be difficult for the repartitioning library to re-decompose the domain more exactly since too few particles stay within a processor if the number of processors is high, e.g., only approximately 3,500 particles per processor with 64 processors. Nevertheless, for the small problem size the parallel efficiency using dynamic domain decomposition improves appreciably in the range of 30-50%, as compared with static domain decomposition.

Medium Problem Size

Similarly, super-linear speedup exists for the medium problem extending even up to 48 processors, if dynamic domain decomposition is activated (Fig. 4.6(b) and Fig.

4.6(e)). This unusual extended super-linear speedup should be attributed to the relatively small cache size available on the super-scalar workstation, in which the array data can be accessed very fast. However, the super-linear speedup is not seen at all if the dynamic domain decomposition is deactivated, which nevertheless demonstrates the effectiveness of implementing dynamic domain decomposition. As the number of processors is over 48, this super-linear speedup disappears due to increasing communication among processors. For the medium problem size, parallel performance (Fig. 4.6(b) and Fig. 4.6(e)) using dynamic domain decomposition is generally 50-100%

higher than that using static domain decomposition as the number of processors is less than or equal to 64. Note that approximately 90% of parallel efficiency can be reached at processor numbers of 64 for the medium problem size. In addition, advantage of activating SAR scheme less frequently is diminishing for the medium problem size due to the increasing problem size, in which the repartitioning cost becomes comparatively less important than useful particle computation.

Large Problem Size

For the large problem size, super-linear speedup generally disappears even if the dynamic domain decomposition is activated (Fig. 4.6(c) and Fig. 4.6(f)). The only exception is the case, which activates SAR scheme more frequently (at interval of 2 ∆t), which balances the workload among processors more efficiently than the other two cases. Parallel efficiency of 107% can still be reached for number of processors of 64 when activating SAR scheme at interval of 2 ∆t.

In conclusion, the results are shown that the simulation with larger problem size

can obtain higher parallel performance. The simulations by using dynamic domain decomposition method present a better performance then using static domain decomposition and dependent to the problem size. And the optimal frequency of activating SAR scheme generally increases with increasing problem size.

Dynamic Domain Decomposition

Typical evolution of dynamic domain decomposition using graph partitioning technique are shown in Fig. 4.8 and Fig. 4.9 for the large problem size using 16 and 64 processors, respectively, when activating SAR scheme at intervals of 2∆t. As mention above, JOSTLE is used to form the initial partition by assigning the unitary weight on each vertex, and PJOSTLE is used to repartition when the loading is unbalance. It is clear that region covered by each sub-domain (processor) changes as the simulation proceeds due to repartitioning among processors when the initial size of each domain is approximately the same. There exists a smallest sub-domain in the right-hand lower corner of the cavity due to the presence of highest density in this region (Figs. 4.8(c) and 4.9(c)). In addition, the size of the sub-domains above the moving plate is generally larger as compared with others due to the rarefied conditions caused by the fast moving plate (Figs. 4.8(c) and 4.9(c)). It clearly demonstrates that the current implementation of dynamic domain decomposition is very effective in following the dynamics of the flow problem under study.

Figure 4.10 illustrates both the number of particles in each processor and the partition count as a function of the number of simulation time-steps for the large problem size using 16 processors when activating SAR at intervals of 2∆t. It only shows the time history of both quantities in the early stage of simulation in each processor for the clarity of presentation. In this figure, we are not trying to identify the evolution of particle numbers in any specific processor, although different lines represent different processors. Results show that number of particles in each processor approaches to the average number of particles per processor (225,000) right after the repartition, which shows the load balancing takes effect once the repartitioning functions. Note that we have preset the balance tolerance value to be 3% in PJOSTLE library [57], which represents that load imbalance among processors less than this value, shall not repartition the domain, even the SAR scheme decides to do so. In addition, the deviation of the number of particles in some processors from the average value deteriorates faster in the early stage right after the repartition, where the flow changes dramatically from initially uniform distribution, as shown in Fig. 4.10. As the flow reaches steady state,

the repartitioning is less frequently as expected, which can be seen clearly with smaller value of slope in Figs. 4.11 and 4.12. These figures show the typical domain repartition counts as a function of simulation time-steps for 16 and 64 processors. Each solid symbol indicates a repartitioning of the computational domain. Generally, at first it increases more rapidly (larger value of slope) with simulation time during transient period and then increases less rapidly (smaller value of slope) with simulation time as flow approaches steady state (~10,000 steps in Fig. 4.12). Similar trends can be found for other flow conditions. Note that the repartition history for larger problem size after 20,000 steps is skipped due to the limitation of accessing time to the parallel machine.

An interesting feature of small problem is the remapping process are called at every opportunity after steady state, that is, indicating the flow field is unable to balance the load. The instability named “flip-flop” will occurred when the number of processors is increased or the problem size is too small. When the weight of the transferred vertices would reach such a level that their transfer would take a sub-domain from an over-loaded state to an under-loaded state, or vice-versa. In such condition, the repartition is repeating. The problem dominates on 64 processors even for the simulation with medium problem size in Fig. 4.12.

Figures 4.13 and 4.14 shows the final normalized workload distribution (or number of particles per processor) with respect to processor numbers among the 16 and 64 processors for the three different problem sizes using different strategies of implementing SAR (2∆t, 10∆t and 20∆t), respectively. It is clear that the workload distribution is much more uniform with dynamic domain decomposition than that without dynamic domain decomposition. In addition, the workload distribution is found to be most uniform for SAR-2∆t scheme since it monitors the load imbalance more often than others. This does not guarantee better parallel efficiency (referring to Fig.

4.6(d) and Fig. 4.6(e)), since frequent repartition is expensive as compared with the normal DSM C computation.

Time Breakdown of Parallel Implementation

Figure 4.15 illustrates the typical fraction of time spending in DSM C computation and dynamic domain decomposition per simulation time-step as a function of the number of processors by employing SAR scheme at the interval of 2∆t. Note that the DSM C computational time includes the “useful” DSM C computational time, the idle time and the communicational time during particle movement between adjacent

processors. It can be seen that, for the small problem, the average fraction of time spending in repartitioning the domain per time-step increases dramatically with the number of processors, which explains the rapid decrease of parallel efficiency at this condition for employing SAR every 2∆t in Fig. 4.6(d). More or less, similar trend is found for medium problem size. On the contrast, for the large problem, the fraction of time for repartitioning the domain remains approximately the same (~0.04) with increasing number of processors up to 64. Correspondingly, the fraction of time for the DSM C computation varies insignificantly as the number of processors is over 24. This shows the current parallel DSM C method may be highly scalable at least for the large problem size.

Figure 4.16 is the real CPU running time in seconds required to complete one time-step of “useful” DSM C and “repartition” with different problem size and the number of processors. The real time for repartition is derived by the number of repartitions throughout the computation. We can see some information from this figure.

First, the time spending on calling once “useful” DSM C process increases as the increasing problem size and decreasing the number of processors. Second, the time of per repartition of each problem size is reduced with the increasing of processor number.

In general, the repartition time is proportion to the problem size, especially for the small number of processors. That is because the loading of large problem size is too large and the repartition process has to take a relative long time to deal with. But this phenomenon is disappearing as the increasing of the processor number. Finally, the cost of repartition is very small and can be neglected by comparing with “useful” DSM C.

The relative cost of each step of the parallel DSM C program with different problem size on IBM SP2 is shown in Fig. 4.17. The data with dynamic domain decomposition are also included for comparing. The solid and dash lines represent the data with static and dynamic domain decomposition, respectively. The cost of message passing is included in MOVE subroutine, and it cost about 60~70% of the total time. It increases as increasing the number of processors. The trend displayed with different problem size is qualitatively the same.

Degree of Imbalance

Degree of imbalance is a useful indicator for measuring the workload non-uniformity among processors, which is an important parameter for justifying dynamic domain decomposition in the current study. It is interesting to examine the maximal degree of imbalance in the system, Imax, which is defined as

) 1 (

min max

max W W

I =W − (4.5)

where Wmax and Wmin are the maximum and minimum particle numbers across the processor, respectively. W is the average particle numbers of each sub-domain. Figure 4.18 shows the variation of maximal imbalance with processor numbers for three different problem sizes. The solid and dash line represent the parallel simulation with static and dynamic domain decomposition, respectively. In general, the maximal imbalance developed by static domain decomposition (0.5-2) is much higher (2-6 times) than that by dynamic domain decomposition (0.1-0.5). In addition, workload imbalance developed very fast among processors as the number of processors increases, if dynamic domain decomposition is not used. Although the maximal degree of imbalance for the small problem size deteriorates with increasing number of processors, it is fairly constant within 0.4 for the medium and large problem sizes, which again demonstrates the effectiveness of the dynamic domain decomposition.